Your tasks: Existing data

How can you find existing data?

Description

Many datasets could exist that you can reuse for your project. Even if you know the literature very well, you can not assume that you know everything that is available. Datasets that you should be looking for can either be collected for the same purpose in another earlier project, but it could also have been collected for a completely different purpose and still serve your goals.

Considerations

Creation of scientific data can be a costly process. For a research project to receive funding one needs to justify, in the project’s data management plan, the need for data creation and why reuse is not possible. Therefore it is advised to always check first if there exists suitable data to reuse for your project.
When the outputs of a project are to be published, the methodology of selecting a source dataset will be subjected to peer review. Following community best practice for data discovery and documenting your method will help you later in reviews.
List the characteristics of the datasets you are looking for, e.g. format, availability, coverage, etc. This enables you to formulate the search terms. Please see Gregory K. et al. Eleven quick tips for finding research data. PLoS Comput Biol 14(4): e1006038 (2018) for more information.

Solutions

Locate the repositories relevant for your field.
- Check the bibliography on relevant publications, and check where the authors of those papers have stored their data. Note those repositories. If papers don’t provide data, contact the authors.
- Data papers provide peer-reviewed descriptions of publicly available datasets or databases and link to the data source in repositories. Data papers can be published in dedicated journals, such as Scientific Data, or be a specific article type in conventional journals.
- Search for research communities in the field, and find out whether they have policies for data submission that mention data repositories. For instance, ELIXIR communities in Life Sciences.
Locate the primary journals in the field, and find out what data repositories they endorse.
- Journal websites will have a “Submitter Guide”, where you’ll find lists of recommended deposition databases per discipline, or generalist repositories. For instance, Scientific Data’s Recommended Repositories.
- You can also find the databases supported by a journal through the policy interface of FAIRsharing.
Search registries for suitable data repositories.
- FAIRsharing is an ELIXIR resource listing repositories.
- re3data lists repositories from all fields of science.
- Google Dataset Search or DataCite for localisation of datasets.
- The OmicsDI provides a knowledge discovery framework across heterogeneous omics data (genomics, proteomics, transcriptomics and metabolomics).
- The ELIXIR Core Data Resources list of knowledge resources recommended by ELIXIR.
- OpenAIRE Explore provides linked open research datasets.
- Mendeley data is linked with the Mendeley social network.
Search through all repositories you found to identify what you could use. Give priority to curated repositories.

How can you reuse existing data?

Description

When you find data of interest, you should first check if the quality is good and if you are allowed to use the data for your purpose. This process might be difficult, so you can find guidelines and tools below.

Considerations

Before reusing the data, make sure to check if a licence is attached and that it allows your intended use of the data.
Check if metadata or documentation are provided with the data. Metadata and documentation should provide enough information for a correct interpretation and reuse of the data. The use of standard metadata schemas and ontologies increase reusability of the data.
Quality of the data is of utmost importance. You should check whether there is a data curation process on the repository (automatic, manual, community). This information should be available on the repository’s website. Check if the repository provides a quality status of each dataset (e.g. star rating system or quality indicators).
The data you choose to reuse may be versioned. Before you start to reuse it you should decide which version of the dataset you will use.

Solutions

Verify that the data is suitable for reuse.
- Check the licences or repository policy for data usage.
- Data from publications can generally be used but make sure that you cite the publication as reference.
- If you cannot find the licence of the data, contact the authors. No licence means no reuse allowed.
- If you are reusing personal (identifiable) or even sensitive data, some extra care needs to be taken (see Human data and Sensitive data pages):
  - Make sure you select a data repository that has a clear, published data access/use policy. You do not want to be liable for improper reuse of personal information. For instance, if you’re downloading human data from some lab’s website make sure there is a statement/confirmation that the data was collected with ethical and legal considerations in place.
  - Sensitive data is often shared under restrictions. Check in the description of the access conditions whether these match with your project (i.e. whether you would be able to successfully ask to get access to the data). For instance, certain datasets can only be accessed by projects with Ethics/Institutional Review Board approval or some can only be used within a specific research field.
Verify the quality of the data. Some repositories have quality indicators, such as:
- Star system indicating level of curation, e.g. for manually curated/non-curated entries.
- Evidence and Conclusion Ontology (ECO).
- Detailed quality assessment methods. For instance, PDB has several structure quality assessment metrics.
If metadata is available, check the quality of metadata. For instance, information about experimental setup, sample preparation, data analysis/processing can be necessary to reuse the data and reproduce the experiments.
Decide which version (if present) of the data you will use.
- You can decide to always use the version that is available at the start of the project. You would do this if switching to the new versions would not be very beneficial to the project or it would require major changes. In this case, you need to make sure that you and others, who want to reproduce your results, can access the old version at a later stage too.
- You can update to the latest versions if new ones come out during your project. You would do this if the new version does not require major changes in your project workflow, and/or if the updates could improve your project. In this case, consider that you may need to re-do all your calculations based on a new version of the dataset and make sure that everything stays consistent.

Tool or resource	Description	Related pages	Registry
DataCite	A search engine for the complete collection of publicly available DataCite DOIs		Standards/Databases
ELIXIR Core Data Resources	Set of European data resources of fundamental importance to the wider life-science community and the long-term preservation of biological data	COVID-19 Data Portal	Standards/Databases Standards/Databases
Evidence and Conclusion Ontology (ECO)	Controlled vocabulary that describes types of evidence and assertion methods	Documentation and meta...	Standards/Databases
FAIRsharing	A curated, informative and educational resource on data and metadata standards, inter-related to databases and data policies.	FAIRtracks Health data Microbial biotechnology Plant sciences Virology Data discoverability Data provenance Data publication Machine actionability Documentation and meta...	Standards/Databases Training
Google Dataset Search	Search engine for datasets
Mendeley data	Multidisciplinary, free-to-use open repository specialized for research data	Biomolecular simulatio... Data publication	Standards/Databases
OmicsDI	Omics Discovery Index (OmicsDI) provides a knowledge discovery framework across heterogeneous omics data (genomics, proteomics, transcriptomics and metabolomics)	Galaxy Data interlinking Machine actionability	Tool info Standards/Databases Training
OpenAIRE Explore	Explore Open Access research outcomes from OpenAIRE network
re3data	Registry of Research Data Repositories	Data discoverability Data publication Licensing	Training
Scientific Data's Recommended Repositories	List of repositories recommended by Scientific Data, contains both discipline-specific and general repositories.

National resources

Tools and resources tailored to users in different countries.

Tool or resource	Description	Related pages
Swiss Pathogen Surveillance Platform (SPSP)	A secure One-health online platform that enables near real-time sharing under controlled access of pathogen whole genome sequences (WGS) and their associated clinical/epidemiological metadata. Since 2021 it has centralised and processed all SARS-CoV-2 sequencing data within the national genomic surveillance program.	COVID-19 Data Portal
Czech National Repository	National Repository (NR) is a service provided to the scientific and research communities in the Czech Republic to store their generated research data together with persistent DOI identifier. NR service is currently under the pilot program.	Researcher Data Steward Research Software Engi... Data storage Identifiers Data management plan
Fairdata.fi	With the Fairdata Services you can store, share and publish your research data with easy-to-use web tools.	CSC Researcher Data Steward Data storage Data publication
Federated EGA Finland	FEGA allows you to store and share sensitive data in Finland in a way that fulfils all the requirements of the General Data Protection Regulation (GDPR). The European Genome-phenome Archive (EGA)	CSC Researcher Data Steward Data sensitivity Data publication Human data
Findata	The Health and Social Data Permit Authority. Findata offers services and enables secure and efficient utilisation of data materials containing health and social data.	CSC Researcher Data Steward Data sensitivity Human data
National instance of Genomic Data Infrastructure for ELIXIR Greece	An instance of the Genomic Data Infrastructure GDI on ELIXIR Greece, for secure genomic data management, including storage, discovery, access, and reception. This is a pilot instance based on the GDI Starter Kit.	Human data Data sensitivity Data publication
BBMRI catalogue	Biobanking Netherlands makes biosamples, images and data findable, accessible and usable for health research.	Human data Researcher Data analysis Data storage
cBioPortal for Cancer Genomics	cBioPortal provides a web-based resource for researchers to explore, visualise, analyse, and share multidimensional cancer genomic datasets, as well as other studies involving multidimensional genomic data.	Human data Researcher Data analysis Data storage
CBS, Statistics Netherlands	The national statistical office, Statistics Netherlands (CBS), provides reliable statistical information and data in the life sciences and health domain.	Human data Researcher
Dutch COVID-19 Data Support Programme	To support investigators and health care professionals with tools and services in their search for ways to overcome the pandemic and its health consequences.	Human data Researcher
Health-RI Service Catalogue	Health-RI provides a set of tools and services available to the biomedical research community.	Human data Researcher Data analysis Data storage
RIVM Health and Healthcare Data	The Dutch National Institute for Public Health and the Environment (RIVM), together with other organisations, provides numbers and explanation on relevant topics, to prevent duplication of data collection.	Human data Researcher
Federated EGA Norway node	Federated instance collects metadata of -omics data collections stored in national or regional archives and makes them available for search through the main EGA portal. With this solution, sensitive data will not physically leave the country, but will reside on TSD. The European Genome-phenome Archive (EGA)	Human data Data sensitivity Data publication TSD
Pathogens Portal Norway	The portal provides information about available datasets, resources, tools, and services related to pandemic preparedness in Norway. The portal gives researchers, clinicians and policymakers access to an extensive collection of biomolecular data about pathogens. Pathogens Portal	Human data Data sensitivity Data publication COVID-19 Data Portal
usegalaxy.no	Galaxy is an open-source, web-based platform for data-intensive biomedical research. This instance of Galaxy is coupled with NeLS for easy data transfer. Galaxy	Data analysis Data sensitivity Data publication NeLS
dados.gov.pt	Open data portal of the Portuguese Public Administration.	Researcher Data Steward Data publication
INDEXAR	National directory of repositories and digital scientific journals, in the fields of science and culture.	Researcher Data Steward Data publication
FEGA Sweden	The Swedish node of Federated European Genome-phenome Archive (FEGA), which offers secure archiving and sharing of genetic and phenotypic data resulting from Swedish biomedical research projects. The European Genome-phenome Archive (EGA)	Human data Data sensitivity GDPR compliance Data publication
Researchdata.se	Researchdata.se is a national web portal where you can find, share, and reuse research data from a wide range of research fields. You can also find helpful advice and recommendations on managing research data.	Data publication Data management plan
SciLifeLab Data Repository (Figshare)	A repository for publishing life science research outputs that do not fit within existing domain-specific repositories, including datasets, documents, figures, and presentations, hosted on SciLifeLab’s local Figshare instance. FigShare	Data publication
SciLifeLab Precision Medicine Portal	This portal is a service for researchers in the precision medicine field, designed to support and accelerate data-driven life science research in Sweden. It provides links to various data sources, customised dashboards, and resources for navigating data management challenges. Researchers can also find guidance on handling sensitive data and links to relevant tools and services.	Health data Human data GDPR compliance
Swedish Biodiversity Data Infrastructure (SBDI)	SBDI is a national research infrastructure that provides access to biodiversity data and related services for researchers in Sweden. SBDI aims to facilitate the use of biodiversity data for research, conservation, and sustainable development.	Biodiversity Data publication
Swedish Pathogens Portal	The Swedish Pathogens Portal provides researchers with information, guidelines, tools and services that enable effective use of national and European infrastructures for sharing data related to pathogens. Pathogens Portal	COVID-19 Data Portal Human data Data sensitivity Data publication
Swedish Reference Genome Portal	A web platform for aggregating, sharing, and visualising non-human eukaryotic genome assemblies and genome annotations (co-)produced by researchers affiliated with Swedish institutions.	Biodiversity Data publication