Your domain: Human pathogen genomics

Introduction

The human pathogen genomics domain focuses on studying the genetic code of organisms that cause disease in humans. Studies to identify and understand pathogens are conducted across different types of organisations ranging from research institutes to regional public health authorities. The aims can include urgent outbreak response, prevention measures, and developing remedies such as treatments and vaccines.

Data management challenges in this domain include the potential urgency of data sharing and secondary use of data across initiatives emerging from research, public health and policymakers. While pathogenic organisms are the object of interest, there are many considerations to take into account when dealing with samples collected from patients, pathogen surveillance, and human research subjects.

The genomic data can represent anything from the genetic sequence of a single pathogen isolate to various fragments of genetic materials from a flora of pathogens in a larger population. Other data can represent a wide range of contextual information about the human host, the disease, and various environmental factors.

Planning a study with pathogen genomic data

Description

While the objects of interest in this domain are pathogens, the data is usually derived from samples originating from human research subjects. This means that you must plan to either remove or handle human data during your study.

Considerations

What legal and ethical aspects do you need to consider?
- Can you separate pathogen and human host material and data?
- What data protection measures should be implemented in contracts and procedures dealing with suppliers and collaborators?
- What is the appropriate scope for the legal and ethical agreements necessary for the study?
- How should statements related to data processing be phrased to allow timely and efficient data sharing?
- How much time would be required to negotiate access to the samples and data for the study?
What public health and research initiatives should you consider aligning with?
- What data could be shared with or reused from other initiatives during the project?
- How will you align your practices with these initiatives to maximise the impact of the data and insight generated by the project?
- How will you share data with your collaborators and other initiatives?
What conventions will you adopt when planning your study?
- What existing protocols should you consider adopting for sample preparation, sequencing, variant calling and other operations?
- What conventions should you adopt for documenting your research?

Solutions

Working with human data

Ensure that the project’s procedures conform with good practices for handling human data. In particular, the following sections of the RDMkit:
- Planning for projects with human data
- Processing and analysing human data
You can also check the practices included in the Infectious Diseases Toolkit (IDTk) related to the management of human and pathogen data in the context of infectious diseases.

Isolate pathogen from host information

Depending on the pathogen and how it interacts with the host or the methods applied, it can be possible to generate clean isolates that do not contain host-related material. Data produced from a clean isolate could potentially be handled with fewer restrictions, while other data will be considered to be sensitive and will need protection.

Public health initiatives

National and international recommendations from public health authorities, epidemic surveillance programs and research data communities should be considered when planning a new study or surveillance programme. In particular, you could consult conventions for relevant surveillance programs while considering widely adopted guidelines for research documentation, and instructions from the data-sharing platforms.
- European Centre for Disease Prevention and Control (ECDC) coordinates Disease and laboratory networks and also issues Surveillance and reporting protocols and other Technical guidance on sequencing.
- WHO issued genomic surveillance strategy and guidance on implementation for maximum impact on public health and there are published reports that advise on Implementing Quality Management Systems in Public Health Laboratories.
- Refer to National resources for information on regional authorities and national considerations.

Sequencing experiments

Good practices for genome experiments suggest that the documentation, at a minimum, should describe the design of the study or surveillance program, the collected specimens and how the samples were prepared, the experimental setup and protocols, and the analysis workflow.
- Adopt specific genomics and pathogen genomics recommendations such as those outlined in Stevens2020[1].
- Refer to the general guidance on providing documentation and metadata during your project.
Adopt standards, conventions and robust protocols to maximise the reuse potential of the data in parallel initiatives and your future projects.
- The Genomic Standards Consortium (GSC) develops and maintains the Minimum Information about any (x) Sequence (MIxS) and the Minimum Information about a (Meta)Genome Sequence (MIxS - MIGS/MIMS) set of core and extended descriptors for genomes and metagenomes with associated samples and their environment to guide scientists on how to capture the metadata essential for high-quality research.
- The GenEpiO Consortium develops and maintains the Genomic Epidemiology Ontology (GenEpiO) to support data sharing and integration specifically for foodborne infectious disease surveillance and outbreak investigations.
- The Public Health Alliance for Genomic Epidemiology (PHA4GE) supports openness and interoperability in public health bioinformatics. The Data Structures working group develops, adapts and standardises data models for microbial sequence data, contextual metadata, results and workflow metrics, such as the SARS-CoV-2 contextual data specification.
- The International Organization for Standardization (ISO) has issued standards that can be referenced when designing or commissioning genomic sequencing and informatics services, such as

Collecting and processing pathogen genomic data

Considerations

What information should you consider recording when collecting data?
- What should you note when collecting, storing, and preparing the samples?
- How will you capture information about the configuration and quality of the sequencing results?
- How will you ensure that the captured information is complete and correct?
What data and file formats should you consider for your project?
- What are the de-facto standards used for the experiment type and downstream analysis pipelines?
- Where are the instrument-specific aspects of the data and file formats documented?
What existing data will you integrate or use as a reference in your project?
- What reference genome(s) will you need access to?
- What is the recommended citation for the data and their versions?

Solutions

Filtering genomic reads corresponding to human DNA fragments

Data files with reads produced by sequencing experiments sometimes contain fragments of the host organism’s DNA. When the host is a human research subject, these fragments can be masked or removed to produce files that could potentially be handled with fewer restrictions. The approach chosen to mask the host-associated reads leads to different trade-offs. Make sure to include this as a factor in your risk assessment.
- Mapping to (human) host reference genomes can inadvertently leave some host-associated reads unmasked [2].
- Mapping to pathogens reference genomes can inadvertently mask some pathogen-associated reads and still leave some host-associated reads unmasked
- Removal of human reads from SARS-CoV-2 sequencing data | Galaxy training

Contextual information about the sample

Information about the host phenotype, context, and disease is often necessary to answer questions in a research study or policy perspective. Other contextual information can include non-host-related environmental factors, such as interactions with other pathogens, drugs and geographic proliferation. It can also include information about the sampled material and how it was processed for sequencing.
Adopt common reporting checklists, data dictionaries, terms and vocabularies to simplify data sharing across initiatives.
- European Nucleotide Archive (ENA) hosts a selection of sample checklists that can be used to annotate sequencing experiments, including checklists derived from Minimum Information about any (x) Sequence (MIxS). The ENA virus pathogen reporting standard checklist has been widely used for SARS-CoV-2 genomic studies.
- Reuse terms and definitions from existing vocabularies, such as the Phenotypic QualiTy Ontology (PATO), NCBI Taxonomy, Disease Ontology (DOID), ChEBI, and UBER anatomy ONtology (UBERON).
- The PHA4GE SARS-CoV-2 contextual data specification is a comprehensive example including a reporting checklist, related protocols, and mappings to relevant vocabularies and data sharing platforms.

Generating genomic data

Establish protocols and document the steps taken in the lab to process the sample and in the computational workflow to prepare the resulting data. Make sure to keep information from quality assurance procedures and strive to make your labwork and computational process as reproducible as possible.
- High-Throughput Sequencing | LifeScienceRDMLookUp.
- The Beyond 1 Million genomes project provides guidelines that cover the minimum quality requirements [3] for the generation of genome sequencing data.
- Data repositories generally have information about recommended data file formats and metadata.
- The FAIR Cookbook provides instructions on validation of file formats.
- A good place to look for scientific and technical information about data quality validation software tools for pathogenomics is bio.tools.
- The Infectious Diseases Toolkit (IDTk) has a showcase on An automated SARS-CoV-2 genome surveillance system built around Galaxy.
- The Galaxy Training Network provides free online training materials on quality control.

Considerations

What data needs to be preserved by the project and for how long?
What is preserved by others and how would someone find and access the data?
What databases should I use to share human pathogen genomics data?
What other research information (such as protocols, computational tools, and samples) can the project share?

Solutions

Some host-related information can be personal and/or sensitive and care should be taken when storing and sharing it. Apply data masking and aggregation techniques to pseudonymise or anonymise the contextual information and take measures to separate personal and sensitive information from the pathogen data when possible.
Adopt solutions for federated analysis to support distributed analyses on information that could otherwise not be shared, such as establishing contractual agreements with suitable regional or international data infrastructures.
GA4GH is a global organisation that frames policy and builds standards to meet the real-world needs of the genomics and health community. Its GDPR & International Health Data Sharing Forum shares GDPR Briefs that represent a consensus position among its Forum Members (not legal advice) regarding the current understanding of the GDPR and its implications for genomic and health-related research, such as
- GDPR Brief: data protection implications of publishing metadata to enable discovery;
- GDPR Brief: federated analysis for responsible data sharing under the GDPR.

You should adopt good practices for data sharing and identify which data sharing platforms to use to reach the relevant stakeholders. You can use more than one platform but care should be taken to make sure that data is interconnected where possible to enable deduplication in downstream analyses.
- European healthcare surveillance systems are administered and used by public health authorities such as ECDC’s TESSy/EpiPulse.
- International research data exchanges such as European Nucleotide Archive (ENA) for non-sensitive genomic data.
- There are also pathogen specific initiatives, such as Pathogens Portal and Pathogen Detection. And initiatives focusing specifically on viruses, certain pathogens or certain data types, such as Global Initiative on Sharing All Influenza Data (GISAID) for observations and assembled consensus sequences on a selection of pathogens.
Investigate if there are national resources or a data brokering organisation available to facilitate data sharing.
- Pathogens Portal Data Hubs network for sensitive data.
- COVID-19 Data Portal.

Bibliography

Stevens, I. et al. Ten Simple Rules for Annotating Sequencing Experiments. PLOS Computational Biology 16, (2020).
Bush, S. J., Connor, T. R., Peto, T. E. A., Crook, D. W. & Walker, A. S. Evaluation of Methods for Detecting Human Reads in Microbial Sequencing Datasets. Microbial Genomics 6, (2020).
Gut, I. et al. B1MG D3.1 - Quality Metrics for Sequencing. (2021) doi:10.5281/ZENODO.5018495.

Your tasks

Data brokering

Information on brokering data to data repositories on behalf of data producers.

Your tasks

Documentation and metadata

How to document and describe your data.

Your tasks

Data transfer

How to transfer data files.

Your tasks

Data security

How do you ensure that your data is handled securely.

Your tasks

Data quality

How to ensure high quality of research data.

More information

Links to FAIRsharing

FAIRsharing is a curated, informative and educational resource on data and metadata standards, inter-related to databases and data policies.

Human pathogen genomics collection

Tools and resources on this page

Tool or resource	Description	Related pages	Registry
bio.tools	Essential scientific and technical information about software tools, databases and services for bioinformatics and the life sciences.	Biodiversity Data analysis	Tool info Standards/Databases Training
ChEBI	Dictionary of molecular entities focused on 'small' chemical compounds	Enzymology and biocata... Microbial biotechnology	Tool info Standards/Databases Training
Chemical Entities of Biological Interest (ChEBI)	ChEBI) is a dictionary describing small chemical compounds including distinct synthetic or natural atoms, molecules, ions, ion pairs, radicals, radical ions, complexes, and conformers.	Enzymology and biocata... Microbial biotechnology	Tool info Standards/Databases Training
COVID-19 Data Portal	The COVID-19 Data Portal enables researchers to upload, access and analyse COVID-19 related reference data and specialist datasets. The aim of the COVID-19 Data Portal is to facilitate data sharing and analysis, and to accelerate coronavirus research. The portal includes relevant datasets submitted to EMBL-EBI as well as other major centres for biomedical data. The COVID-19 Data Portal is the primary entry point into the functions of a wider project, the European COVID-19 Data Platform. Estonian COVID-19 Data Portal Spanish COVID-19 Data Portal Greek COVID-19 Data Portal Luxembourg COVID-19 Data Portal	COVID-19 Data Portal	Tool info Standards/Databases Training
Disease Ontology (DOID)	The Disease Ontology has been developed as a standardised ontology for human disease to provide the biomedical community with consistent, reusable and sustainable descriptions of human disease terms, phenotype characteristics, underlying mechanisms and related medical vocabulary disease concepts.		Standards/Databases
European Nucleotide Archive (ENA)	A record of sequence information scaling from raw sequcning reads to assemblies and functional annotation	Galaxy Plant Genomics Biodiversity Epitranscriptome data Microbial biotechnology Single-cell sequencing Virology Data brokering Data interlinking Data publication Project data managemen...	Tool info Standards/Databases Training
FAIR Cookbook	FAIR Cookbook is an online resource for the Life Sciences with recipes that help you to make and keep data Findable, Accessible, Interoperable and Reusable (FAIR)	TransMed Health data Compliance monitoring ...	Tool info Training
Genomic Epidemiology Ontology (GenEpiO)	The Genomic Epidemiology Ontology (GenEpiO) covers vocabulary necessary to identify, document and research food-borne pathogens, infectious disease surveillance and outbreak investigations.		Standards/Databases
Global Initiative on Sharing All Influenza Data (GISAID)	A web-based platform for sharing viral sequence data, initially for influenza data, and now for other pathogens (including SARS-CoV-2).	Virology Machine actionability	Tool info Standards/Databases
Infectious Diseases Toolkit (IDTk)	Discover tools and best practices for working with infectious disease data. IDTk provides general guidance as well as specific information for pathogen characterisation, socioeconomic data, human biomolecular data, and human clinical and health data.	COVID-19 Data Portal Virology
Minimum Information about a (Meta)Genome Sequence (MIxS - MIGS/MIMS)	a conceptual structure for extending the core INSDC information to describe genomic and metagenomic sequences.	Marine metagenomics	Standards/Databases
Minimum Information about any (x) Sequence (MIxS)	An overarching framework of standard metadata that includes sequence-type and technology-specific checklists.	Biodiversity Marine metagenomics Virology	Standards/Databases
NCBI Taxonomy	NCBI's taxonomy browser is a database of biodiversity information	Biodiversity Microbial biotechnology	Standards/Databases
Pathogen Detection	NCBI Pathogen Detection integrates bacterial and fungal pathogen genomic sequences from numerous ongoing surveillance and research efforts whose sources include food, environmental sources such as water or production facilities, and patient samples. Foodborne, hospital-acquired, and other clinically infectious pathogens are included.		Training
Pathogens Portal	The Pathogens Portal aims to provide access to data and tools relating to pathogens, their human and animal hosts and their vectors. Current content spans bacterial, viral and eukaryotic parasite lineages alongside human host data. Pathogens Portal Norway Swedish Pathogens Portal		Standards/Databases Training
Phenotypic QualiTy Ontology (PATO)	PATO is an ontology of phenotypic qualities, intended for use primarily in phenotype annotation.		Tool info Standards/Databases
UBER anatomy ONtology (UBERON)	Uberon is an integrated cross-species anatomy ontology covering animals and bridging multiple species-specific ontologies.		Standards/Databases