Your domain: Human pathogen genomics


The human pathogen genomics domain focuses on studying the genetic code of organisms that cause disease in humans. Studies to identify and understand pathogens are conducted across different types of organisations ranging from research institutes to regional public health authorities. The aims can include urgent outbreak response, prevention measures, and developing remedies such as treatments and vaccines.

Data management challenges in this domain include the potential urgency of data sharing and secondary use of data across initiatives emerging from research, public health and policy. While the pathogenic organisms are the object of interest, there are many considerations to account for when dealing with samples collected from patients, pathogen surveillance, and human research subjects.

The genomic data can represent anything from the genetic sequence of a single pathogen isolate to various fragments of genetic materials from a flora of pathogens in larger population. Other data can represent a wide range of contextual information about the human host, the disease, and various environmental factors.

Planning a study with pathogen genomic data


While the object of interest in this domain are pathogens, the data is usually derived from samples originating from patients and human research subjects. This means that you must plan to either remove or to handle human data during your study.


  • What legal and ethical aspects do you need to consider?
    • Can you separate pathogen and human host material and data?
    • What data protection measures should be implemented in contracts and procedures dealing with suppliers and collaborators?
    • What is the appropriate scope for the legal and ethical agreements necessary for the study?
    • How should statements related to data processing be phrased to allow timely and efficient data sharing?
    • How much time would be required to negotiate access to the samples and data for the study?
  • What public health and research initives should you consider aligning with?
    • What data could be shared with or reused from other initiatives during the project?
    • How will you align your practices with these initiatives to maximise the impact of the data and insight generated by the project?
    • How will you share data with your collaborators and other initiatives?
  • What conventions will you adopt when planning your study?
    • What existing protocols should you consider adopting for sample preparation, sequencing, variant calling etc?
    • What conventions should you adopt for documenting your research?


Working with human data

Isolate pathogen from host information

  • Depending on the pathogen, how it interacts with the host, or the methods applied, it can be possible to generate clean isolates that do not contain host related material. Data produced from a clean isolate could potentially be handled with few restrictions, while other data will be considered to be personal and sensitive that need protection.

Public health initiatives

Sequencing experiments

Collecting and processing pathogen genomic data


  • What information should you consider recording when collecting data?
    • What should you note when collecting, storing and preparing the samples?
    • How will you capture information about the configuration and quality of the sequencing results?
    • How will you ensure that the information captured is complete and correct?
  • What data and file formats should you consider for your project?
    • What are the de-facto standards used for the experiment type and down-stream analysis-pipelines?
    • Where are the instrument specific aspects for the data and files formats documented?
  • What existing data will you integrate or use as a reference in your project?
    • What reference genome(s) will you need access to?
    • What is the recommended citation for the data and their versions?


Filtering genomic reads corresponding to human DNA fragments

  • Data files with reads produced by sequencing experiments sometimes contain fragments of the host organism’s DNA. When the host is a human research subject or patient, these fragments can be masked or removed to produce files that could potentially be handled with fewer restrictions. The approach chosen to mask the host associated reads leads to different trade-offs. Make sure to include this as a factor in your risk assessment.

Contextual information about the sample

Generating genomic data

Sharing and preserving pathogen genomic data


  • What data need to be preserved by the project and for how long?
  • What is preserved by others and how would someone find and access the data?
  • What databases should I use to share human pathogen genomics data?
  • What other research information (such as protocols, computational tools, samples) can the project share?


Sharing pathogen genomic data

