Skip to content Skip to footer

Your tasks: Existing data

How can you find existing data?

Description

Many datasets could exist that you can reuse for your project. Even if you know the literature very well, you can not assume that you know everything that is available. Datasets that you should be looking for can either be collected for the same purpose in another earlier project, but it could also have been collected for a completely different purpose and still serve your goals.

Considerations

  • Creation of scientific data can be a costly process. For a research project to receive funding one needs to justify, in the project’s data management plan, the need for data creation and why reuse is not possible. Therefore it is advised to always check first if there exists suitable data to reuse for your project.

  • When the outputs of a project are to be published, the methodology of selecting a source dataset will be subjected to peer review. Following community best practice for data discovery and documenting your method will help you later in reviews.

  • List the characteristics of the datasets you are looking for, e.g. format, availability, coverage, etc. This enables you to formulate the search terms. Please see Gregory K. et al. Eleven quick tips for finding research data. PLoS Comput Biol 14(4): e1006038 (2018) for more information.

Solutions

  • Locate the repositories relevant for your field.
    • Check the bibliography on relevant publications, and check where the authors of those papers have stored their data. Note those repositories. If papers don’t provide data, contact the authors.
    • Data papers provide peer-reviewed descriptions of publicly available datasets or databases and link to the data source in repositories. Data papers can be published in dedicated journals, such as Scientific Data, or be a specific article type in conventional journals.
    • Search for research communities in the field, and find out whether they have policies for data submission that mention data repositories. For instance, ELIXIR communities in Life Sciences.
  • Locate the primary journals in the field, and find out what data repositories they endorse.
    • Journal websites will have a “Submitter Guide”, where you’ll find lists of recommended deposition databases per discipline, or generalist repositories. For instance, Scientific Data’s Recommended Repositories.
    • You can also find the databases supported by a journal through the policy interface of FAIRsharing.
  • Search registries for suitable data repositories.
  • Search through all repositories you found to identify what you could use. Give priority to curated repositories.

How can you reuse existing data?

Description

When you find data of interest, you should first check if the quality is good and if you are allowed to use the data for your purpose. This process might be difficult, so you can find guidelines and tools below.

Considerations

  • Before reusing the data, make sure to check if a licence is attached and that it allows your intended use of the data.

  • Check if metadata or documentation are provided with the data. Metadata and documentation should provide enough information for a correct interpretation and reuse of the data. The use of standard metadata schemas and ontologies increase reusability of the data.

  • Quality of the data is of utmost importance. You should check whether there is a data curation process on the repository (automatic, manual, community). This information should be available on the repository’s website. Check if the repository provides a quality status of each dataset (e.g. star rating system or quality indicators).

  • The data you choose to reuse may be versioned. Before you start to reuse it you should decide which version of the dataset you will use.

Solutions

  • Verify that the data is suitable for reuse.
    • Check the licences or repository policy for data usage.
    • Data from publications can generally be used but make sure that you cite the publication as reference.
    • If you cannot find the licence of the data, contact the authors. No licence means no reuse allowed.
    • If you are reusing personal (identifiable) or even sensitive data, some extra care needs to be taken (see Human data and Sensitive data pages):
      • Make sure you select a data repository that has a clear, published data access/use policy. You do not want to be liable for improper reuse of personal information. For instance, if you’re downloading human data from some lab’s website make sure there is a statement/confirmation that the data was collected with ethical and legal considerations in place.
      • Sensitive data is often shared under restrictions. Check in the description of the access conditions whether these match with your project (i.e. whether you would be able to successfully ask to get access to the data). For instance, certain datasets can only be accessed by projects with Ethics/Institutional Review Board approval or some can only be used within a specific research field.
  • Verify the quality of the data. Some repositories have quality indicators, such as:
  • If metadata is available, check the quality of metadata. For instance, information about experimental setup, sample preparation, data analysis/processing can be necessary to reuse the data and reproduce the experiments.

  • Decide which version (if present) of the data you will use.
    • You can decide to always use the version that is available at the start of the project. You would do this if switching to the new versions would not be very beneficial to the project or it would require major changes. In this case, you need to make sure that you and others, who want to reproduce your results, can access the old version at a later stage too.
    • You can update to the latest versions if new ones come out during your project. You would do this if the new version does not require major changes in your project workflow, and/or if the updates could improve your project. In this case, consider that you may need to re-do all your calculations based on a new version of the dataset and make sure that everything stays consistent.

More information

Tools and resources on this page

Skip tool table
Tool or resource Description Related pages Registry
DataCite A search engine for the complete collection of publicly available DataCite DOIs Standards/Databases
ELIXIR Core Data Resources Set of European data resources of fundamental importance to the wider life-science community and the long-term preservation of biological data Standards/Databases
Evidence and Conclusion Ontology (ECO) Controlled vocabulary that describes types of evidence and assertion methods Documentation and meta... Tool info Standards/Databases
FAIRsharing A curated, informative and educational resource on data and metadata standards, inter-related to databases and data policies. FAIRtracks Microbial biotechnology Plant sciences Data provenance Data publication Documentation and meta... Standards/Databases Training
Google Dataset Search Search engine for datasets
Mendeley data Multidisciplinary, free-to-use open repository specialized for research data Biomolecular simulatio... Data publication Standards/Databases
OmicsDI Omics Discovery Index (OmicsDI) provides a knowledge discovery framework across heterogeneous omics data (genomics, proteomics, transcriptomics and metabolomics) Galaxy Machine actionability Tool info Standards/Databases Training
OpenAIRE Explore Explore Open Access research outcomes from OpenAIRE network
re3data Registry of Research Data Repositories Data publication Licensing Training
Scientific Data's Recommended Repositories List of respositories recommended by Scientific Data, contains both discipline-specific and general repositories.
National resources
Swiss Pathogen Surveillance Platform (SPSP)

A secure One-health online platform that enables near real-time sharing under controlled access of pathogen whole genome sequences (WGS) and their associated clinical/epidemiological metadata. Since 20221 it has centralized and processed all SARS-CoV-2 sequencing data within the national genomic surveillance program.

COVID-19 Data Portal
Czech National Repository

National Repository (NR) is a service provided to the scientific and research communities in the Czech Republic to store their generated research data together with persistent DOI identifier. NR service is currently under the pilot program.

Researcher Data Steward Research Software Engi... Data storage Identifiers Data management plan
Fairdata.fi

With the Fairdata Services you can store, share and publish your research data with easy-to-use web tools.

CSC Researcher Data Steward Data storage Data publication
Federated EGA Finland

FEGA allows you to store and share sensitive data in Finland in a way that fulfils all the requirements of the General Data Protection Regulation (GDPR).

The European Genome-phenome Archive (EGA)
CSC Researcher Data Steward Data sensitivity Data publication Human data
Findata

The Health and Social Data Permit Authority. Findata offers services and enables secure and efficient utilisation of data materials containing health and social data.

CSC Researcher Data Steward Data sensitivity Human data
BBMRI catalogue

Biobanking Netherlands makes biosamples, images and data findable, accessible and usable for health research.

Human data Researcher Data analysis Data storage
CBS, Statistics Netherlands

The national statistical office, Statistics Netherlands (CBS), provides reliable statistical information and data in the life sciences and health domain.

Human data Researcher
Dutch COVID-19 Data Support Programme

To support investigators and health care professionals with tools and services in their search for ways to overcome the pandemic and its health consequences.

Human data Researcher
Health-RI Service Catalogue

Health-RI provides a set of tools and services available to the biomedical research community.

Human data Researcher Data analysis Data storage
RIVM Health and Healthcare Data

The Dutch National Institute for Public Health and the Environment (RIVM), together with other organisations, provides numbers and explanation on relevant topics, to prevent duplication of data collection.

Human data Researcher
Federated EGA Norway node

Federated instance collects metadata of -omics data collections stored in national or regional archives and makes them available for search through the main EGA portal. With this solution, sensitive data will not physically leave the country, but will reside on TSD.

The European Genome-phenome Archive (EGA)
Human data Data sensitivity Data publication TSD
Norwegian COVID-19 Data Portal

The Norwegian COVID-19 Data Portal aims to bundle the Norwegian research efforts and offers guidelines, tools, databases and services to support Norwegian COVID-19 researchers.

Human data Data sensitivity Data publication
usegalaxy.no

Galaxy is an open-source, web-based platform for data-intensive biomedical research. This instance of Galaxy is coupled with NeLS for easy data transfer.

Galaxy
Data analysis Data sensitivity Data publication NeLS
Federated EGA Sweden node

Secure archiving and sharing of genetic and phenotypic data resulting from Swedish biomedical research projects.

The European Genome-phenome Archive (EGA)
Human data Data sensitivity Data publication
SciLifeLab Data Repository (Figshare)

A repository for publishing any kind of research-related data, e.g. documents, figures, or presentations.

FigShare
Data publication
Swedish Pathogens Portal

The Swedish Pathogens Portal provides information, guidelines, tools and services to support researchers to utilise Swedish and European infrastructures for data sharing.

COVID-19 Data Portal Human data Data sensitivity Data publication
Contributors