Your domain: Virology

Introduction

Virology is a rapidly evolving field that generates diverse and complex datasets, from genomic sequences to clinical trial results and epidemiological data. Effective research data management (RDM) is essential to ensure that these valuable datasets are organised, shared, and preserved in a manner that enables long-term impact.

Data Heterogeneity

Description

Outbreak surveillance in virology requires the integration of multiple data types, including sequencing, clinical, and epidemiological data. These data types originate from diverse sources, such as hospitals, research laboratories, and public health agencies, making their harmonisation and interoperability critical. Without standardised data management approaches, inconsistencies in metadata, sampling protocols, and data formats can hinder effective outbreak response, cross-study comparisons, and reproducibility of findings.

Considerations

Key challenges in managing data heterogeneity include:

Standardising sample collection protocols across clinical and environmental settings to ensure comparability.
Integrating epidemiological and genomic data to enhance outbreak detection and integrative and global responses.

Solutions

To effectively manage data heterogeneity in virology outbreak surveillance, several domain-specific solutions are recommended:

The Data Provenance page provides comprehensive guidance on capturing metadata, documenting data origins, and maintaining transparent records, ensuring data traceability and reproducibility.
Use standardised protocols for sample collection by identifying and reviewing existing guidelines from organisations such as World Health Organisation, ECDC, and ISO standards to ensure alignment with best practices.
Implement metadata standards for data interoperability
- The Documentation and metadata page offers practical recommendations and standards for accurately describing data, ensuring consistent interpretation, interoperability, and long-term usability.
- Apply Minimum Information about any (x) Sequence (MIxS) and the derived Minimum Information about an Uncultivated Virus Genome (MIOViG) for annotating viral sequence data with necessary metadata fields. Visit the NCBI submission portal for an overview.
- Adapt generic dataset descriptions (e.g. Data Catalog Vocabulary (DCAT), Dublin Core Metadata Terms, Bioschemas) for basic dataset descriptions.
- Follow BioSamples Metadata Schema from EBI for structured viral sample metadata.
Ensure metadata compliance with international repositories
- The Data Brokering page describes strategies and tools for integrating, harmonising, and translating data across diverse formats and sources.
- Format metadata to meet the submission requirements of repositories like Global Initiative on Sharing All Influenza Data (GISAID) and European Nucleotide Archive (ENA) to ensure smooth data deposition and retrieval. For instance, search for a virology-related checklists in the ENA sample checklists.
- Utilise validation tools and checklists to ensure completeness and accuracy before data submission.
FAIRsharing resources for metadata selection
- Refer to the EVORA collection of FAIRsharing-referenced metadata standards for pandemic preparedness and response.
- Consult FAIRsharing registries to identify suitable standards for different types of virology datasets.
- Integrate epidemiological and genomic data for outbreak analysis, by utilising phylodynamic workflows (e.g. BEAST, Nextstrain, or TreeTime), enabling real-time outbreak tracking.
- Implement structured pipelines for linking viral genome sequences with patient and outbreak metadata.
Apply domain-specific vocabularies for consistent annotation
- Utilise the EVORA Ontology and ICTV to ensure that virus-related metadata terms are standardised and interoperable.
- Monitor updates in virology-specific ontologies to maintain alignment with evolving standards. By implementing these best practices, outbreak surveillance data can be better structured, more interoperable, and more effectively shared, ultimately improving global response efforts to viral outbreaks.

Data Sensitivity & Ethics

Description

Handling sensitive data in virology outbreak surveillance involves managing patient data, ensuring proper anonymisation, and complying with regulations such as the General Data Protection Regulation (GDPR). The ethical and legal aspects of handling such data are crucial to maintaining patient privacy while allowing researchers and public health authorities to analyse and respond to viral threats effectively. Proper data governance strategies must balance the need for data accessibility and data security, ensuring that only authorised personnel can access identifiable information. Secure storage, controlled access mechanisms, and appropriate anonymisation techniques must be implemented to meet legal and ethical standards.

Considerations

Key challenges in managing data sensitivity and ethics include:

Ensuring compliance with GDPR and other legal frameworks when collecting, processing, and sharing sensitive patient data.
Anonymising and pseudonymising patient data to reduce privacy risks while preserving data utility for research.
Implementing secure and scalable storage solutions that support controlled data access, encryption, and long-term preservation.

Solutions

Organisations must implement robust governance policies, encryption techniques, and controlled data access mechanisms to maintain patient privacy while enabling scientific advancements. The following strategies outline best practices for secure data management and ethical compliance.

Ensure GDPR and regulatory compliance
- For guidance on managing sensitive data ethically and in compliance with regulations, the pages on GDPR compliance, Ethical aspects, and Compliance monitoring & measurement provide detailed recommendations on protecting personal data, navigating ethical responsibilities, and implementing effective compliance monitoring practices.
- Identify and mitigate privacy risks associated with human and health-related data (see Human data and Health data domain pages).
- Establish clear data governance policies defining roles and responsibilities regarding data processing and sharing.
- Ensure contractual coverage for data exchanges through appropriate agreements (e.g. Data Use Agreements, Consortium Agreements, Data Sharing Agreements), explicitly outlining data usage, retention, reuse, and publication policies. Material Transfer Agreements (MTAs) should similarly address the handling of human data derived from received samples.
- You can also check the practices included in the Infectious Diseases Toolkit (IDTk) related to the management of human and pathogen data in the context of infectious diseases.
Implement effective data anonymisation and pseudonymisation strategies to reduce data sensitivity and protect privacy, following best practices detailed on the Data Sensitivity page.
Adopt secure and scalable data storage solutions to ensure long-term data integrity, controlled access, and compliance with relevant standards and best practices, as described on the Data Storage page.
- Store sensitive data in repositories designed for controlled access (e.g. The European Genome-phenome Archive (EGA))
- Use encryption for data at rest and in transit to prevent unauthorised access.
- Establish strict access control policies, using role-based access and multi-factor authentication where appropriate.
Implement best practices for data versioning and backup to maintain clear data histories, prevent loss, and enhance reproducibility, following guidelines provided on the Data Organisation page.

Description

Effective data sharing and access enable timely surveillance of viral genome evolution, facilitating early detection of mutations and emerging variants. This rapid exchange of outbreak information supports swift global responses while simultaneously addressing critical concerns around data security and patient privacy. Researchers have a responsibility to share outbreak-related data ethically, legally, and efficiently, allowing authorised users appropriate access to sensitive materials, such as patient records and genomic sequences, in alignment with national and EU regulatory frameworks. Furthermore, assigning persistent identifiers and utilising structured data repositories enhance the long-term usability, traceability, and discoverability of outbreak data, significantly improving preparedness and response strategies.

Considerations

Key challenges in data sharing and access include:

Balancing accesibility with security, ensuring that sensitive data is accessible to authorised users without compromising privacy.
Adhering to national and international regulations to ensure ethical data sharing.
Ensuring proper data deposition in repositories that align with domain-specific standards.
Maintaining dataset integrity and discoverability using persistent identifiers like DOIs and accession numbers.

Solutions

To ensure secure and effective data sharing in virology outbreak surveillance, researchers and institutions should implement structured strategies:

Balance accessibility with security in data sharing by implementing measures that protect sensitive data while ensuring it remains discoverable and accessible, following the guidelines provided on the pages for Data Security and Data Discoverability
- Use controlled-access repositories, findable via VODAN - GO FAIR to manage and access sensitive patient or outbreak data securely.
- Implement data access governance models that allow tiered access based on user credentials and need.
- Utilise Pathoplexus for managing metadata and access rights in a structured manner. It is an open-source database dedicated to the efficient sharing of human viral pathogen genomic data, fostering global collaboration and public health response.
Ensure proper deposition of outbreak data into trusted repositories to facilitate data sharing, citation, and reuse, following best practices outlined on the Data Publication page.
- Submit viral genomic sequences and epidemiological data to Global Initiative on Sharing All Influenza Data (GISAID), NCBI GenBank, and European Nucleotide Archive (ENA) for public accessibility.
- Use standard submission pipelines to ensure compliance with repository-specific metadata and formatting guidelines.
Assign persistent identifiers to datasets to enhance discoverability, citation, and long-term accessibility, following recommendations on the Identifiers page.
- Register DOIs or accession numbers for datasets to facilitate long-term accessibility and proper citation.
- Ensure metadata linking persistent identifiers is properly formatted and recorded.
Comply with national and EU data-sharing regulations by consulting country-specific guidelines and legal frameworks available on the National Resources page.
- Align data-sharing practices with GDPR, national laws, and institutional ethical guidelines.
- Implement data-sharing agreements between collaborating institutions to formalise responsibilities.

Your tasks

Data brokering

Information on brokering data to data repositories on behalf of data producers.

Your tasks

GDPR compliance

How to protect your research data, and how to make research data compliant to GDPR.

Your tasks

Ethical aspects

Working on aspects in the management of research data that can raise ethical issues

Your tasks

Compliance monitoring & measurement

How to measure compliance to data management regulations and standards.

Your tasks

Data sensitivity

How to identify the sensitivity of different research data types

Your tasks

Data security

How do you ensure that your data is handled securely.

Your tasks

Data discoverability

How to make data discoverable

Your tasks

Identifiers

How to use identifiers for research data.

Tool assembly

COVID-19 Data Portal

The COVID-19 Data Portal brings together relevant datasets for sharing and analysis to accelerate coronavirus research.

More information

Links to FAIRsharing

FAIRsharing is a curated, informative and educational resource on data and metadata standards, inter-related to databases and data policies.

Virology collection

Tools and resources on this page

Tool or resource	Description	Related pages	Registry
BioSamples	BioSamples stores and supplies descriptions and metadata about biological samples used in research and development by academia and industry.	Plant Genomics Biodiversity Plant sciences Data interlinking	Tool info Standards/Databases Training
Bioschemas	Bioschemas aims to improve the Findability on the Web of life sciences resources such as datasets, software, and training materials	Plant Phenomics Enzymology and biocata... Intrinsically disorder... Data discoverability Machine actionability Documentation and meta...	Standards/Databases Training
Data Catalog Vocabulary (DCAT)	DCAT is an RDF vocabulary designed to facilitate interoperability between data catalogs published on the Web.	Rare disease data Data discoverability Machine actionability Documentation and meta...	Standards/Databases
Dublin Core Metadata Terms	A metadata standard for describing resources of any kind.	Data discoverability Data provenance Machine actionability Documentation and meta...	Standards/Databases
European Nucleotide Archive (ENA)	A record of sequence information scaling from raw sequcning reads to assemblies and functional annotation	Galaxy Plant Genomics Biodiversity Epitranscriptome data Human pathogen genomics Microbial biotechnology Single-cell sequencing Data brokering Data interlinking Data publication Project data managemen...	Tool info Standards/Databases Training
EVORA Ontology	European Viral Outbreak Response Alliance Ontology
FAIRsharing	A curated, informative and educational resource on data and metadata standards, inter-related to databases and data policies.	FAIRtracks Health data Microbial biotechnology Plant sciences Data discoverability Data provenance Data publication Existing data Machine actionability Documentation and meta...	Standards/Databases Training
GenBank	A database of genetic sequence information. GenBank may also refer to the data format used for storing information around genetic sequence data.	Microbial biotechnology	Tool info Standards/Databases Training
Global Initiative on Sharing All Influenza Data (GISAID)	A web-based platform for sharing viral sequence data, initially for influenza data, and now for other pathogens (including SARS-CoV-2).	Human pathogen genomics Machine actionability	Tool info Standards/Databases
ICTV	The International Committee on Taxonomy of Viruses (ICTV) taxonomic database is a maintained resource comprised of a universal taxonomic scheme for all the viruses infecting animals (vertebrates, invertebrates and protozoa), plants (higher plants and algae), fungi, bacteria and archaea.		Tool info Standards/Databases
Infectious Diseases Toolkit (IDTk)	Discover tools and best practices for working with infectious disease data. IDTk provides general guidance as well as specific information for pathogen characterisation, socioeconomic data, human biomolecular data, and human clinical and health data.	COVID-19 Data Portal Human pathogen genomics
Minimum Information about an Uncultivated Virus Genome (MIOViG)	MIxS - MIUViG (Minimum Information about an Uncultivated Virus Genome) is a checklist that is a part of the larger MIxS standard.		Standards/Databases
Minimum Information about any (x) Sequence (MIxS)	An overarching framework of standard metadata that includes sequence-type and technology-specific checklists.	Biodiversity Human pathogen genomics Marine metagenomics	Standards/Databases
Nextstrain	Nextstrain is an open-source project to harness the scientific and public health potential of pathogen genome data.		Tool info Standards/Databases Training
Pathoplexus	Pathoplexus is an open-source database designed to enhance the sharing and analysis of human viral pathogen genomic data.		Standards/Databases
The European Genome-phenome Archive (EGA)	EGA is a service for permanent archiving and sharing of all types of personally identifiable genetic and phenotypic data resulting from biomedical research projects Federated EGA Finland Federated EGA Norway node FEGA Sweden	CSC TSD Cancer data Human data Data interlinking Data publication	Tool info Standards/Databases Training