Your tasks: Data publication

Can you really deposit your data in a public repository?

Description

Sometimes it is difficult to determine if publishing data you have at hand is the right thing to do. Some reasons for hesitations might be that you have not used the data in a publication yet and don’t want to be scooped, that the data contains personal information about patients or that the data was collected or produced in a collaboration.

Considerations

Publishing data does not necessarily mean open access nor public. Data can be published with closed or restricted access.
Data doesn’t have to be published immediately while you are still working on the project. Data can be made available during the revision of the paper or after the publication of the paper.
Make sure to have the rights or permissions to publish the data.
- Is the data commercially-sensitive?
- Does the data contain confidential/restricted information?
- Who controls the data?

Solutions

If ethical, legal or contractual issues apply to your data type (e.g. personal or sensitive data, confidential or third-party data, data with copyright, data with potential economic or commercial value, intellectual property or IP data, etc.), ask for help and advice from the Legal Team, Tech Transfer Office, and/or Data Protection Officer of your institute.
Decide what is the right type of access for your data, for instance:
- Open access.
- Registered access or with authentication procedure.
- Controlled access or via Data Access Committees (DACs).
Decide what licence should be applied to your metadata and data.
Certain repositories offer solutions for depositing data that need to be under restricted access. This allows for data to be findable even when it can not be published openly. One example is the The European Genome-phenome Archive (EGA) that can be used to deposit potentially identifiable genetic and phenotypic human data.
Many repositories provide the option to put an embargo on a deposited dataset. This might be useful if you prefer to use the data in a publication before making it available for others to use.
Establish an agreement outlining the controllership of the data and each collaborators’ rights and responsibilities.
Even if the data cannot be published, it is good practice to publish the metadata of your datasets.

Which repository should you use to publish your data?

Description

Once you have completed your experiments and have performed quality control of your data it is good scientific practice to share your data in a public repository. Publishing your data is often required by funders and publishers.

The most suitable repository will depend on the data type and your discipline.

Considerations

What type of data are you planning to publish?
Does the repository need to provide solutions for restricted access for sensitive data?
Do you have the rights to publish the data via the repository?
How sustainable is the repository, will the data remain public over time?
How FAIR is the repository?
Does the funding agency or the scientific journal pose specific requirements regarding data sharing?
What are the repository’s policies concerning licences and data reuse?

Solutions

Based on the possible ethical, legal and contractual implications of your data, decides:
- The right type of access for your data.
- The licence that should be applied to your metadata and data.
Check if/what discipline-specific repositories can apply the necessary access conditions and licences to your (meta)data.
Discipline-specific repositories: if a discipline-specific repository, recognised by the community, exists this should be your first choice since discipline-specific repositories often increases the FAIRness of the data.
- The EMBL-EBI’s data submission wizard can help you choose a suitable repository based on your data type.
- There are lists of discipline-specific, community-recognised repositories e.g.:
- ELIXIR Deposition Databases for Biomolecular Data including ArrayExpress, BioModels, BioStudies, European Nucleotide Archive (ENA), PDB, PRIDE
- Scientific Data journal’s recommended repositories
- Wellcome Open Research - Data Guidelines
Check if there are repositories available for specific data formats, such as images (e.g. BioImageArchive, EMPIAR) or earth and environmental science data (e.g. PANGAEA).
General-purpose and institutional repositories: For other cases, a repository that accepts data of different types and disciplines should be considered. It could be a general-purpose repository, such as Zenodo, Mendeley data, FigShare, Dryad or a centralised repository provided by your institution or university.
re3data or Repository Finder gather information about existing repositories and allows you to filter them based on access and licence types.
re3data and FAIRsharing websites gather features of repositories, which you can filter by discipline, data type, taxonomy and many other features.

How do you prepare your data for publication in data repositories?

Description

Once you have decided where to publish your data, you will have to make your (meta)data ready for repository submission. For this reason it is recommended to become aware of repository’s requirements before start collecting the data.

Considerations

What file formats should be used for the data?
How is the data uploaded?
What metadata do you need to provide?
Under which licence should the data be published?
Should sensitive data and metadata be anonymised or pseudonymised prior to a publication? This could notably be the case if you work with human data.
After data is submitted to a public repository, should the original copy of the data be retained at the central brokering platform and linked to its public counterpart? Or should it be removed and replaced with the ID of the public record?

Solutions

Learn the following information about the chosen repositories:
- Required metadata schemas
- Required ontologies or controlled vocabularies
- Accepted file formats for data and metadata
- Costs for sharing and storing data
Repositories generally have information about data formats, metadata requirements and how data can be uploaded under a section called “submit”, “submit data”, “for submitters” or something similar. Read this section in detail.
To ascertain re-usability data should be released with a clear and accessible data usage licence. We suggest making your data available under licences that permit free reuse of data, e.g. a Creative Commons licence, such as CC0 or CC-BY.
- Note that every repository can have one default licence for all datasets. For instance, sequence data submitted to for example European Nucleotide Archive (ENA) are implicitly free to reuse by others as specified in the International Nucleotide Sequence Database Collaboration (INSDC).
See the corresponding pages for more detailed information about metadata, licences and data transfer.
There are many tools available to remove human reads from your non-human data, e.g. Metagen-FastQC

How do you update or delete a published entry from a data repository?

Description

You will sometimes need to update or delete some entries that were incomplete or wrongly submitted for publication. Note however that upon creation of a new record, data is generally tagged for distribution and selected metadata fields may be exchanged with other repositories. Thus, redistribution of updated records may not be triggered automatically and updating records fully can be a time consuming and manual process for the repository. Also, in general, submitted data may not be deleted, but may be suppressed from public view upon request. In a nutshell, it is therefore safer to make sure to submit the right entry from the start, rather than updating it or asking for its withdrawal at a later stage.

Considerations

Does the repository offer the possibility to update a submission? For the data submitter, is this a manual procedure (e.g. email, web interface) or is it available through an Application Programming Interface (API) or Command Line Interface (CLI)?
Does the repository offer the possibility to delete (or hide) submissions?
Does the repository have a test-server where data can be submitted for testing purpose?

Solutions

Solutions are very much repository-dependent. For example, on the European Nucleotide Archive (ENA), entries can be easily updated using a CLI. However, the updated information is not automatically redistributed to other registries linked to ENA. Upon email request, entries may also be suppressed from public view. Note that ENA also has a test server to make test submissions before submitting to the actual production server, which can be very useful when sending large batches of data to test for any systematic errors. Please check these points with your repository of choice.

Should you include datasets accession numbers in pre-prints or theses?

Description

Researchers often deposit their data in public databases (e.g., ArrayExpress, Gene Expression Omnibus (GEO)) and receive an accession number for that data. Often researchers put data under embargo until the final publication - a period of time during which a dataset remains unavailable to the public. Including accession number(s) in a pre-print or a thesis could finish embargo and make the dataset publicly available, even before the final publication. Including accession number(s) depends on whether the researcher wants to make the data publicly accessible early or keep it private until later.

Considerations

Pre-prints and theses are often treated as citable publications.
Some databases may release data as soon as an accession number is cited in a pre-print or in a published thesis.
Policies vary between repositories — some release data proactively, while others do so only upon inquiry.
Once an accession number is made public in a pre-print, the dataset may become publicly accessible, even if the preprint is intended to be temporary and even if a second manuscript describing the same data is pending.

Solutions

If public availability before final publication is not an issue, citing the accession number ensures accessibility.
Check the guidelines or policies of the data repository of interest before deciding whether to include the accession number in a pre-print. For example, GEO’s policy requires that data be publicly accessible once an accession number is cited, even if the manuscript is a pre-print. More information can be found here: GEO FAQ.
To avoid unintended data release, authors should refrain from including accession numbers in pre-prints or mask them.

More information

Links to FAIR Cookbook

FAIR Cookbook is an online, open and live resource for the Life Sciences with recipes that help you to make and keep data Findable, Accessible, Interoperable and Reusable; in one word FAIR.

Depositing to generic repositories - Zenodo use case

Tools and resources on this page

Tool or resource	Description	Related pages	Registry
ArrayExpress	A collection in BioStudies for archiving and publishing data from high-throughput functional genomics experiments.	Microbial biotechnology Single-cell sequencing Data interlinking	Tool info Standards/Databases Training
BioImageArchive	The BioImage Archive stores and distributes biological images that are useful to life-science researchers.	Biodiversity Bioimaging data	Standards/Databases
BioModels	A repository of mathematical models for application in biological sciences	Enzymology and biocata... Microbial biotechnology	Tool info Standards/Databases Training
BioStudies	A database hosting datasets from biological studies. Useful for storing or accessing life sciences data without community-accepted repositories, and for linking components of data from multi-omics studies.	Microbial biotechnology Plant sciences Single-cell sequencing Data interlinking Project data managemen...	Tool info Standards/Databases Training
Dryad	Open-source, community-led data curation, publishing, and preservation platform for CC0 publicly available research data	Bioimaging data Biomolecular simulatio...	Standards/Databases Training
ELIXIR Deposition Databases for Biomolecular Data	List of discipline-specific deposition databases recommended by ELIXIR.	CSC IFB Marine Metagenomics NeLS Data discoverability Data interlinking Documentation and meta...	Standards/Databases Training
EMBL-EBI's data submission wizard	EMBL-EBI's wizard for finding the right EMBL-EBI repository for your data.
EMPIAR	Electron Microscopy Public Image Archive is a public resource for raw, 2D electron microscopy images. You can browse, upload and download the raw images used to build a 3D structure	OMERO Bioimaging data Structural Bioinformatics	Tool info Standards/Databases Training
European Nucleotide Archive (ENA)	A record of sequence information scaling from raw sequcning reads to assemblies and functional annotation	Galaxy Plant Genomics Biodiversity Epitranscriptome data Human pathogen genomics Microbial biotechnology Single-cell sequencing Virology Data brokering Data interlinking Project data managemen...	Tool info Standards/Databases Training
FAIRsharing	A curated, informative and educational resource on data and metadata standards, inter-related to databases and data policies.	FAIRtracks Health data Microbial biotechnology Plant sciences Virology Data discoverability Data provenance Existing data Machine actionability Documentation and meta...	Standards/Databases Training
FigShare	Data publishing platform SciLifeLab Data Repository (Figshare)	Biomolecular simulatio... Enzymology and biocata... Identifiers Documentation and meta...	Standards/Databases Training
Gene Expression Omnibus (GEO)	A repository of MIAME-compliant genomics data from arrays and high-throughput sequencing	Microbial biotechnology Single-cell sequencing	Standards/Databases
International Nucleotide Sequence Database Collaboration	The International Nucleotide Sequence Database Collaboration (INSDC) is a long-standing foundational initiative that operates between DDBJ, EMBL-EBI and NCBI. INSDC covers the spectrum of data raw reads, through alignments and assemblies to functional annotation, enriched with contextual information relating to samples and experimental configurations.	Galaxy Biodiversity Microbial biotechnology Plant sciences	Training
International Nucleotide Sequence Database Collaboration (INSDC)	A collaborative database of genetic sequence datasets from DDBJ, EMBL-EBI and NCBI	Galaxy Biodiversity Microbial biotechnology Plant sciences	Tool info Training
Mendeley data	Multidisciplinary, free-to-use open repository specialized for research data	Biomolecular simulatio... Existing data	Standards/Databases
Metagen-FastQC	Cleans metagenomic reads to remove adapters, low-quality bases and host (e.g. human) contamination
PANGAEA	Data Publisher for Earth and Environmental Science		Tool info Standards/Databases Training
PDB	The Protein Data Bank (PDB)	Galaxy Intrinsically disorder... Structural Bioinformatics	Tool info Training
PRIDE	PRoteomics IDEntifications (PRIDE) Archive database	Proteomics	Tool info Standards/Databases Training
re3data	Registry of Research Data Repositories	Data discoverability Existing data Licensing	Training
Repository Finder	Repository Finder can help you find an appropriate repository to deposit your research data. The tool is hosted by DataCite and queries the re3data registry of research data repositories.
The European Genome-phenome Archive (EGA)	EGA is a service for permanent archiving and sharing of all types of personally identifiable genetic and phenotypic data resulting from biomedical research projects Federated EGA Finland Federated EGA Norway node FEGA Sweden	CSC TSD Cancer data Human data Virology Data interlinking	Tool info Standards/Databases Training
Wellcome Open Research - Data Guidelines	Wellcome Open Research requires that the source data underlying the results are made available as soon as an article is published. This page provides information about data you need to include, where your data can be stored, and how your data should be presented.		Standards/Databases
Zenodo	Generalist research data repository built and developed by OpenAIRE and CERN	FAIRtracks Plant Phenomics Bioimaging data Biomolecular simulatio... Enzymology and biocata... Plant sciences Single-cell sequencing Identifiers	Standards/Databases Training

National resources

Tools and resources tailored to users in different countries.

Tool or resource	Description	Related pages
DORA	Digital Object Repository at the Libr4RI (4 ETH Domain Research Institutes, that are EAWAG, EMPA, PSI, WSL).
OLOS	OLOS is a Swiss-based data management portal, to help Swiss researchers safely manage, publish and preserve their data.	Data storage
SWISSUbase	SWISSUbase is a national cross-disciplinary solution for Swiss universities and other research organisations in need of local institutional data repositories for their researchers. The platform relies on international archiving standards and processes to ensure that data are preserved and accessible in the long-term.	Data storage
PUBLISSO	Open access publishing platform for life sciences.	Researcher Data Steward
Fairdata.fi	With the Fairdata Services you can store, share and publish your research data with easy-to-use web tools.	CSC Researcher Data Steward Data storage Existing data
Federated EGA Finland	FEGA allows you to store and share sensitive data in Finland in a way that fulfils all the requirements of the General Data Protection Regulation (GDPR). The European Genome-phenome Archive (EGA)	CSC Researcher Data Steward Data sensitivity Existing data Human data
Sensitive Data Services for Research	CSC Sensitive Data Services for Research are designed to support secure sensitive data management through web-user interfaces accessible from the user’s own computer.	CSC Researcher Data Steward Data sensitivity Data analysis Data storage Human data
National instance of Genomic Data Infrastructure for ELIXIR Greece	An instance of the Genomic Data Infrastructure GDI on ELIXIR Greece, for secure genomic data management, including storage, discovery, access, and reception. This is a pilot instance based on the GDI Starter Kit.	Human data Data sensitivity Existing data
FAIR-Aware	Online tool which helps researchers and data managers assess how much they know about the requirements for making datasets findable, accessible, interoperable, and reusable (FAIR) before uploading them into a data repository.	Researcher Data management plan Compliance monitoring ...
DataverseNO	DataverseNO is a national, generic repository for open research data. Various Norwegian research institutions have established partner agreements about using DataverseNO as institutional repositories for open research data. DATAVERSE
Federated EGA Norway node	Federated instance collects metadata of -omics data collections stored in national or regional archives and makes them available for search through the main EGA portal. With this solution, sensitive data will not physically leave the country, but will reside on TSD. The European Genome-phenome Archive (EGA)	Human data Data sensitivity Existing data TSD
Pathogens Portal Norway	The portal provides information about available datasets, resources, tools, and services related to pandemic preparedness in Norway. The portal gives researchers, clinicians and policymakers access to an extensive collection of biomolecular data about pathogens. Pathogens Portal	Human data Data sensitivity Existing data COVID-19 Data Portal
usegalaxy.no	Galaxy is an open-source, web-based platform for data-intensive biomedical research. This instance of Galaxy is coupled with NeLS for easy data transfer. Galaxy	Data analysis Data sensitivity Existing data NeLS
dados.gov.pt	Open data portal of the Portuguese Public Administration.	Researcher Data Steward Existing data
INDEXAR	National directory of repositories and digital scientific journals, in the fields of science and culture.	Researcher Data Steward Existing data
FEGA Sweden	The Swedish node of Federated European Genome-phenome Archive (FEGA), which offers secure archiving and sharing of genetic and phenotypic data resulting from Swedish biomedical research projects. The European Genome-phenome Archive (EGA)	Human data Data sensitivity Existing data GDPR compliance
NBIS Data Management Consultation	Free consultation service regarding data management questions in life science research.	Data management plan Data sensitivity
NBIS data submission documentation	A GitHub repository with documentation and guidelines for submitting data to domain-specific repositories, provided by NBIS data stewards.
Researchdata.se	Researchdata.se is a national web portal where you can find, share, and reuse research data from a wide range of research fields. You can also find helpful advice and recommendations on managing research data.	Existing data Data management plan
SciLifeLab Data Repository (Figshare)	A repository for publishing life science research outputs that do not fit within existing domain-specific repositories, including datasets, documents, figures, and presentations, hosted on SciLifeLab’s local Figshare instance. FigShare	Existing data
Swedish Biodiversity Data Infrastructure (SBDI)	SBDI is a national research infrastructure that provides access to biodiversity data and related services for researchers in Sweden. SBDI aims to facilitate the use of biodiversity data for research, conservation, and sustainable development.	Biodiversity Existing data
Swedish Pathogens Portal	The Swedish Pathogens Portal provides researchers with information, guidelines, tools and services that enable effective use of national and European infrastructures for sharing data related to pathogens. Pathogens Portal	COVID-19 Data Portal Human data Data sensitivity Existing data
Swedish Reference Genome Portal	A web platform for aggregating, sharing, and visualising non-human eukaryotic genome assemblies and genome annotations (co-)produced by researchers affiliated with Swedish institutions.	Biodiversity Existing data