How can you find existing data?
Description
Many datasets could exist that you can reuse for your project. Even if you know the literature very well, you can not assume that you know everything that is available. Datasets that you should be looking for can either be collected for the same purpose in another earlier project, but it could also have been collected for a completely different purpose and still serve your goals.
Considerations
-
Creation of scientific data can be a costly process. For a research project to receive funding one needs to justify, in the project’s data management plan, the need for data creation and why reuse is not possible. Therefore it is advised to always check first if there exists suitable data to reuse for your project.
-
When the outputs of a project are to be published, the methodology of selecting a source dataset will be subjected to peer review. Following community best practice for data discovery and documenting your method will help you later in reviews.
-
List the characteristics of the datasets you are looking for, e.g. format, availability, coverage, etc. This enables you to formulate the search terms. Please see Gregory K. et al. Eleven quick tips for finding research data. PLoS Comput Biol 14(4): e1006038 (2018) for more information.
Solutions
- Locate the repositories relevant for your field.
- Check the bibliography on relevant publications, and check where the authors of those papers have stored their data. Note those repositories. If papers don’t provide data, contact the authors.
- Data papers provide peer-reviewed descriptions of publicly available datasets or databases and link to the data source in repositories. Data papers can be published in dedicated journals, such as Scientific Data, or be a specific article type in conventional journals.
- Search for research communities in the field, and find out whether they have policies for data submission that mention data repositories. For instance, ELIXIR communities in Life Sciences.
- Locate the primary journals in the field, and find out what data repositories they endorse.
- Journal websites will have a “Submitter Guide”, where you’ll find lists of recommended deposition databases per discipline, or generalist repositories. For instance, Scientific Data’s Recommended Repositories.
- You can also find the databases supported by a journal through the policy interface of FAIRsharing.
- Search registries for suitable data repositories.
- FAIRsharing is an ELIXIR resource listing repositories.
- re3data lists repositories from all fields of science.
- Google Dataset Search or DataCite for localization of datasets.
- The OmicsDI provides a knowledge discovery framework across heterogeneous omics data (genomics, proteomics, transcriptomics and metabolomics).
- The ELIXIR Core Data Resources list of knowledge resources recommended by ELIXIR.
- OpenAIRE Explore provides linked open research datasets.
- Mendeley data is linked with the Mendeley social network.
- Search through all repositories you found to identify what you could use. Give priority to curated repositories.
How can you reuse existing data?
Description
When you find data of interest, you should first check if the quality is good and if you are allowed to use the data for your purpose. This process might be difficult, so you can find guidelines and tools below.
Considerations
-
Before reusing the data, make sure to check if a licence is attached and that it allows your intended use of the data.
-
Check if metadata or documentation are provided with the data. Metadata and documentation should provide enough information for a correct interpretation and reuse of the data. The use of standard metadata schemas and ontologies increase reusability of the data.
-
Quality of the data is of utmost importance. You should check whether there is a data curation process on the repository (automatic, manual, community). This information should be available on the repository’s website. Check if the repository provides a quality status of each dataset (e.g. star rating system or quality indicators).
-
The data you choose to reuse may be versioned. Before you start to reuse it you should decide which version of the dataset you will use.
Solutions
- Verify that the data is suitable for reuse.
- Check the licences or repository policy for data usage.
- Data from publications can generally be used but make sure that you cite the publication as reference.
- If you cannot find the licence of the data, contact the authors. No licence means no reuse allowed.
- If you are reusing personal (identifiable) or even sensitive data, some extra care needs to be taken (see Human data and Sensitive data pages):
- Make sure you select a data repository that has a clear, published data access/use policy. You do not want to be liable for improper reuse of personal information. For instance, if you’re downloading human data from some lab’s website make sure there is a statement/confirmation that the data was collected with ethical and legal considerations in place.
- Sensitive data is often shared under restrictions. Check in the description of the access conditions whether these match with your project (i.e. whether you would be able to successfully ask to get access to the data). For instance, certain datasets can only be accessed by projects with Ethics/Institutional Review Board approval or some can only be used within a specific research field.
- Verify the quality of the data. Some repositories have quality indicators, such as:
- Star system indicating level of curation, e.g. for manually curated/non-curated entries.
- Evidence and Conclusion Ontology (ECO).
- Detailed quality assessment methods. For instance, PDB has several structure quality assessment metrics.
-
If metadata is available, check the quality of metadata. For instance, information about experimental setup, sample preparation, data analysis/processing can be necessary to reuse the data and reproduce the experiments.
- Decide which version (if present) of the data you will use.
- You can decide to always use the version that is available at the start of the project. You would do this if switching to the new versions would not be very beneficial to the project or it would require major changes. In this case, you need to make sure that you and others, who want to reproduce your results, can access the old version at a later stage too.
- You can update to the latest versions if new ones come out during your project. You would do this if the new version does not require major changes in your project workflow, and/or if the updates could improve your project. In this case, consider that you may need to re-do all your calculations based on a new version of the dataset and make sure that everything stays consistent.
More information
Links to FAIR Cookbook
FAIR Cookbook is an online, open and live resource for the Life Sciences with recipes that help you to make and keep data Findable, Accessible, Interoperable and Reusable; in one word FAIR.
Links to DSW
With Data Stewardship Wizard (DSW), you can create, plan, collaborate, and bring your data management plans to life with a tool trusted by thousands of people worldwide — from data management pioneers, to international research institutes.
Tools and resources on this page
Tool or resource | Description | Related pages | Registry |
---|---|---|---|
DataCite | A search engine for the complete collection of publicly available DataCite DOIs | Standards/Databases | |
ELIXIR Core Data Resources | Set of European data resources of fundamental importance to the wider life-science community and the long-term preservation of biological data | COVID-19 Data Portal | Standards/Databases |
Evidence and Conclusion Ontology (ECO) | Controlled vocabulary that describes types of evidence and assertion methods | Documentation and meta... | Standards/Databases |
FAIRsharing | A curated, informative and educational resource on data and metadata standards, inter-related to databases and data policies. | FAIRtracks Health data Microbial biotechnology Plant sciences Data discoverability Data provenance Data publication Machine actionability Documentation and meta... | Standards/Databases Training |
Google Dataset Search | Search engine for datasets | ||
Mendeley data | Multidisciplinary, free-to-use open repository specialized for research data | Biomolecular simulatio... Data publication | Standards/Databases |
OmicsDI | Omics Discovery Index (OmicsDI) provides a knowledge discovery framework across heterogeneous omics data (genomics, proteomics, transcriptomics and metabolomics) | Galaxy Machine actionability | Tool info Standards/Databases Training |
OpenAIRE Explore | Explore Open Access research outcomes from OpenAIRE network | ||
re3data | Registry of Research Data Repositories | Data discoverability Data publication Licensing | Training |
Scientific Data's Recommended Repositories | List of respositories recommended by Scientific Data, contains both discipline-specific and general repositories. |
National resources
Tools and resources tailored to users in different countries.
Tool or resource | Description | Related pages | Registry |
---|---|---|---|
Swiss Pathogen Surveillance Platform (SPSP) | A secure One-health online platform that enables near real-time sharing under controlled access of pathogen whole genome sequences (WGS) and their associated clinical/epidemiological metadata. Since 20221 it has centralized and processed all SARS-CoV-2 sequencing data within the national genomic surveillance program. |
COVID-19 Data Portal | |
Czech National Repository | National Repository (NR) is a service provided to the scientific and research communities in the Czech Republic to store their generated research data together with persistent DOI identifier. NR service is currently under the pilot program. |
Researcher Data Steward Research Software Engi... Data storage Identifiers Data management plan | |
Fairdata.fi | With the Fairdata Services you can store, share and publish your research data with easy-to-use web tools. |
CSC Researcher Data Steward Data storage Data publication | |
Federated EGA Finland | FEGA allows you to store and share sensitive data in Finland in a way that fulfils all the requirements of the General Data Protection Regulation (GDPR).
The European Genome-phenome Archive (EGA)
|
CSC Researcher Data Steward Data sensitivity Data publication Human data | |
Findata | The Health and Social Data Permit Authority. Findata offers services and enables secure and efficient utilisation of data materials containing health and social data. |
CSC Researcher Data Steward Data sensitivity Human data | |
BBMRI catalogue | Biobanking Netherlands makes biosamples, images and data findable, accessible and usable for health research. |
Human data Researcher Data analysis Data storage | |
cBioPortal for Cancer Genomics | cBioPortal provides a web-based resource for researchers to explore, visualize, analyze, and share multidimensional cancer genomic datasets, as well as other studies involving multidimensional genomic data. |
Human data Researcher Data analysis Data storage | |
CBS, Statistics Netherlands | The national statistical office, Statistics Netherlands (CBS), provides reliable statistical information and data in the life sciences and health domain. |
Human data Researcher | |
Dutch COVID-19 Data Support Programme | To support investigators and health care professionals with tools and services in their search for ways to overcome the pandemic and its health consequences. |
Human data Researcher | |
Health-RI Service Catalogue | Health-RI provides a set of tools and services available to the biomedical research community. |
Human data Researcher Data analysis Data storage | |
RIVM Health and Healthcare Data | The Dutch National Institute for Public Health and the Environment (RIVM), together with other organisations, provides numbers and explanation on relevant topics, to prevent duplication of data collection. |
Human data Researcher | |
Federated EGA Norway node | Federated instance collects metadata of -omics data collections stored in national or regional archives and makes them available for search through the main EGA portal. With this solution, sensitive data will not physically leave the country, but will reside on TSD.
The European Genome-phenome Archive (EGA)
|
Human data Data sensitivity Data publication TSD | |
Norwegian COVID-19 Data Portal | The Norwegian COVID-19 Data Portal aims to bundle the Norwegian research efforts and offers guidelines, tools, databases and services to support Norwegian COVID-19 researchers. |
Human data Data sensitivity Data publication COVID-19 Data Portal | |
usegalaxy.no | Galaxy is an open-source, web-based platform for data-intensive biomedical research. This instance of Galaxy is coupled with NeLS for easy data transfer.
Galaxy
|
Data analysis Data sensitivity Data publication NeLS | |
Federated EGA Sweden node | Secure archiving and sharing of genetic and phenotypic data resulting from Swedish biomedical research projects.
The European Genome-phenome Archive (EGA)
|
Human data Data sensitivity Data publication | |
SciLifeLab Data Repository (Figshare) | A repository for publishing any kind of research-related data, e.g. documents, figures, or presentations.
FigShare
|
Data publication | |
Swedish Pathogens Portal | The Swedish Pathogens Portal provides information, guidelines, tools and services to support researchers to utilise Swedish and European infrastructures for data sharing. |
COVID-19 Data Portal Human data Data sensitivity Data publication |