Skip to content Skip to footer

Your domain: Cancer data

Introduction

Cancer is a heterogeneous disease that affects almost everyone: as a patient, survivor, relative or friend. This page focuses on the key data management challenges, considerations and solutions that are relevant across all stages of the patient’s journey from cancer prevention, diagnosis and treatment to assessment of patient outcomes and monitoring those at follow-up visits.

Each stage of the patient journey has different associated data types, a number of technical, ethical, legal and organisational challenges. In this page we focus on the management of human health data generated from patients diagnosed with both solid and liquid tumors (oncology or hematology). Data might be collected by a number of different means, e.g. from clinical trials and non-interventional studies (NIS) or real-world data (RWD) from observational studies (e.g. registries) or captured in electronic health record (EHR) hospital systems during primary care.

Efficient data management and interoperability are essential for ensuring timely and accurate diagnoses and treatment while maintaining compliance with ethical and regulatory standards. In the following sections we will address data management best practices, considerations and possible solutions for effective management of data in each stage (Figure1) of the individual patient’s journey.

Figure 1. Cancer data management challenges at different stages of patient journey.

Cancer prevention

Description

Primary prevention data

Primary prevention in cancer care refers to strategies aimed at reducing the risk of developing cancer. Data from cancer registries (which includes spatial patterns of cancer incidence, as well as stage, survival and mortality) in combination with genetic predisposition and/or exposome data (including exposure to environmental factors and socio-economic characteristics) can be used to identify risk factors for developing cancer. These cancer registries are information systems designed for the collection, storage, and management of data on persons with cancer and play a critical role in cancer research, surveillance, cancer prevention and control interventions. Key challenges include heterogeneity in data collection and integrating diverse datasets from different sources, e.g. linkage of exposome data to the health data from cancer registries.

Secondary prevention data

Secondary prevention in cancer care focuses on early detection and intervention to identify cancer at an early stage when it is more treatable and potentially curable. Survival rate improvement in most major tumour types depends on early detection, which has prompted screening programs in many European countries. These programs produce highly relevant data sets for further (data-driven) research on early cancer diagnostics. This data typically consists of health and bioimaging data, such as mammograms, colonoscopies, or blood tests. Most of this data contains personal health information and must be managed in compliance with privacy regulations such as GDPR.

Key challenges include integrating diverse datasets and ensuring data accuracy since the screening programs could be organised on national or regional level. Additionally, the risks and benefits of screening programs must be balanced.

Considerations

Primary prevention data

  • Consider local cancer registries in the different European countries as they can be organised locally, regionally or nationally.
  • Think about the health data access procedures which could be different for each cancer registry.
  • Bear in mind the interoperability of variables from the exposome data could be suboptimal due to heterogeneous data collection between different sources.
  • Linkage between different data types, e.g. exposome and health data, could be non-trivial. Think about the following:
    • Does the geographical grid match?
    • Does the timestamp of the data correlate?
  • Exposome data is considered non-personal data, but once linked to personal data the linked dataset becomes personal data and privacy has to be ensured in compliance with applicable legislation (e.g. GDPR).

Secondary prevention data

  • The data access procedure could be different for different data sources. Be mindful to contact the relevant data creators and managers for the relevant access rights.
  • Interoperability of data originating from multiple screening programs is not guaranteed.
  • For general health data considerations, see Health data page.
  • For general bioimaging data considerations, see BioImaging data page.

Solutions

Primary prevention data

Secondary prevention data

As there are no commonly accepted data collection standards currently, EOSC4Cancer developed a harmonised codebook for colorectal cancer screening (based on Dutch, Catalan, Italian and Czech screening codebooks), which could be used as a common basis to be extended to other cancer types.

Cancer diagnosis

Description

A cancer patient’s journey starts with a confirmed diagnosis, which involves clinical evaluation, imaging, laboratory tests, and testing of molecular biomarkers by different methods (e.g. immunohistochemistry, next-generation sequencing, in situ hybridisation) in biopsies (e.g. tissue or liquid biopsies) to confirm malignancy and assess tumour characteristics.

Cancer diagnosis is a multi-step process that begins with a patient’s clinical presentation, such as symptoms or incidental findings during routine check-ups or specialized screening programs (e.g. mammography for breast cancer, FIT tests and consequent colonoscopy for colorectal cancer, Pap smears and HPV PCR for cervical cancer). If cancer is suspected, the diagnostic journey typically involves a combination of medical imaging (CT, MRI, PET scans, ultrasound), laboratory tests, and biopsy procedures to confirm malignancy, each step producing a different data type. Integrating diverse data sources, including clinical history, imaging, pathology, and genomic data, allows for a more comprehensive understanding of the disease and personalized treatment strategies.

Cancer diagnosis relies on data from multiple sources that are also often used at other stages of the cancer patient’s journey (prevention, treatment, follow-up):

  • Imaging (MRI, CT, PET scans) provides tumor size, location, and spread
  • Pathology (biopsy analysis) confirms tumor type and molecular characteristics
  • Genetic/Genomic profiling can identify tumor genomic alterations relevant for molecularly matched therapies, pharmacogenomic biomarkers relevant for drug metabolism, germline alterations
  • Clinical data (patient history, symptoms, lab tests) provide context on overall health and treatment history

Managing cancer data for diagnosing and determining the best treatment for localized tumors presents several challenges, as it requires working with a wide range of sensitive patient data, coming from different departments/sources, including clinical records, radiological images (radiology), histopathological evaluation (pathology) and genomic profiles (pathology or other specialized laboratory). This makes interoperability and data integration essential to enable a holistic approach to cancer care. Consequently, data management must be precise to ensure that healthcare professionals have accurate and comprehensive information for tumour identification and treatment decisions. Data security and compliance with ethical guidelines (such as GDPR) are critical to protecting patient privacy when dealing with personal health records and sensitive tumour data. Furthermore, the need for data to be accessible across different healthcare providers and research institutions adds complexity to the management process.

Considerations

  • Are all clinically relevant variables collected using standard vocabularies across data domains including socio-demographics, risk factors, and tumor-specific metadata (e.g. Tumour- Nodes-Metastases (TNM) stage, histology, genomic alterations)?
  • Are diagnostic data and images stored in standardized formats (e.g. Digital Imaging and Communications in Medicine (DICOM) for imaging, Variant Call Format for genomics) to allow long-term usability and reanalysis?
  • Is there a data management system in place to ensure interoperability between different data types (e.g. imaging, molecular, and health records)?
  • Are there AI-based tools or decision-support systems integrated into the workflow to assist oncologists in making diagnostic decisions?

Solutions

Cancer treatment

Description

Cancer treatment varies depending on the type and stage of cancer (e.g. locally advanced, metastatic disease), as well as the overall health and preferences of the patient. The use of advanced diagnostic techniques such as PET-CT/MRI, molecular profiling (e.g. next-generation sequencing, comprehensive genomic profiling (CGP), whole genome sequencing (WGS) and liquid biopsies (e.g. ctDNA) has tremendously increased the data density and complexity to be dealt with at this stage of disease.

Cancer treatment employs a wide range of data types, such as patients’ therapeutic regimens, including surgery techniques, stem cell transplantation, radiotherapy, systemic therapies (e.g. hormone, chemotherapy, immunotherapy and targeted therapies) as well as imaging data, biomarker assessments, responses to therapies data, clinical trial outcomes, drug efficacy, and adverse reactions. Cancer treatment data is commonly associated with further clinical data and patients’ information. Due to their sensitive nature, the data must be managed following ethical guidelines, data protection laws, and FAIR (Findable, Accessible, Interoperable, and Reusable) principles.

Although cancer treatment data is crucial for developing personalized medicine approaches, improving patient outcomes and advancing research, comprehensive documentation of cancer treatment data remains limited in cancer registries and public datasets. This challenge is often due to data privacy regulations, ethical concerns, and varying reporting standards, which highlight disparities arising from resource limitations, national database structures, and language barriers. In addition, while cancer treatment data publication has increased, it remains inconsistent due to the lack of data standardization along with sparse ontologies. The increasing use of electronic health records across western countries, along with standardized cancer classification systems (e.g. WHO, ICD, CAP), staging systems (e.g. Tumour, Nodes, Metastases (TNM) cancer staging system ), and pioneering drug (e.g. DRON PDRO) and side effects (e.g. OAE) ontologies, facilitates data collection. However, clear guidelines for cancer treatment data collection and tools for unified analysis still need to be developed.

Considerations

  • Do you use human data? You can find more information on the Human data page.
  • Are the required clinical variables related to the treatment available?
  • How will clinical variables be integrated with molecular or imaging data?
  • Which resources are available for downloading and analysing cancer treatment data?
  • Where can you access standard-of-care cancer clinical guidelines?
  • How to access cancer treatment data from clinical trials or side effect registries?
  • How to propose cancer treatments based on cancer multi-omics data?

Solutions

In order to obtain information about oncological clinical practice guidelines several medical societies provide guidance:

A more unified approach to cancer treatment data collection is crucial for improving outcome analysis and supporting all stakeholders. To support this aim, several consortia and institutions provide annotated reference datasets with cancer treatment data:

Reference databases and platforms:

Drug and trial public repositories:

Genomics & multi-omics resources:

  • MTB Portal: provides a general framework to interpret the functional and predictive relevance of a given list of gene variants in interactive reports.
  • PanDrugs: a platform to prioritize cancer drug treatments according to individual multi-omics data (SNVs, CNVs and gene expression).
  • Cancer Genome Interpreter: flags genomic biomarkers of drug response with clinical relevance.
  • Clinical Interpretations of Variants in Cancer (CIViC): a free resource to identify the best cancer treatment options based on DNA alterations.
  • Topograph: Therapy-Oriented Precision Oncology Guidelines for Recommending Anti-cancer Pharmaceuticals.

Monitoring of outcomes during follow-up visits

Description

The follow-up phase in cancer care is a critical component of comprehensive patient management, ensuring long-term monitoring and well-being of cancer survivors. This stage focuses on assessing treatment outcomes, detecting potential recurrences, managing long-term side effects, and enhancing the overall quality of life. Effective follow-up strategies integrate not only systematic clinical evaluations, which include routine medical visits, imaging exams (e.g. MRI, CT, PET), and biomarker testing (e.g. CEA, PSA, ctDNA), but also patient-reported outcomes (PROs).

In this context, the increasing adoption of digital health technologies, including wearable devices and mobile health applications, as well as Artificial Intelligence and predictive analytics, has transformed post-treatment monitoring. On one hand, it has enabled real-time remote tracking of health metrics (e.g. physical activity, heart rate, sleep patterns etc), facilitating early detection of potential complications and on the other hand help anticipate complications and tailor follow-up schedules to individual patients’ needs. Both scenarios lead to generation of diverse data types. Additionally, cancer registries (CRs) and clinical trial databases play a fundamental role in storing longitudinal data on disease progression, survival rates, and treatment efficacy, allowing researchers to analyze trends, identify recurrence risk factors, and refine personalized follow-up guidelines.

However, due to the wide heterogeneity of data types, sources, and healthcare systems achieving seamless interoperability and standardisation of follow-up data, to support individualized patient management and optimize data reuse in cancer research remains a major challenge. In addition, data collection and management at this stage presents other challenges, including: (i) the sensitive nature of the data, requiring strict adherence to regulatory and ethical frameworks, (ii) the lack of consistency and/or quality of patient follow-up information, and (iii) the lack of standardization and inherent subjectivity in survivorship quality-of-life data, influenced by patient perception, reporting methods, and assessment tools.

Considerations

Different considerations should be taken into account depending on the type of data being managed:

  • Use specific standards and methods to extract and transform data included in the Electronic Health Record (clinical data, diagnoses, demographics, procedures, medications, vital signs, laboratory results). For Considerations towards improved reuse of EHR refer to the section in the Health data page.
  • Considerations for managing imaging data (and histopathological data), binary files,as well as the associated metadata can be found in the Bioimaging data page.
  • For human genomic data, established research ethical guidelines and legislations must be followed as described in the Human data page.
  • Since health data falls under the “special category of data” as defined by the GDPR, strict guidelines and considerations must be followed when handling this information covered in the GDPR compliance and Data Sensitivity page of the RDMkit.
  • For PROs, to collect data directly from cancer patients and/or survivors, follow the considerations listed on the health data page. Additionally, since these PROs focus on quality-of-life and are inherently subjective, additional considerations must be addressed:

    • Are questionnaires designed to minimize ambiguity and ensure that all patients interpret questions in a similar way?
    • Are there methods in place to differentiate between true changes in quality of life and variations due to individual perception or recall bias?
    • Have statistical or methodological approaches been considered to adjust for subjectivity in self-reported data?
    • How is the potential discrepancy between patient-reported outcomes and clinician assessments addressed?
  • For data collected from wearable devices and mobile applications, the follow considerations should be taken into account:

    • Are the wearable devices and mobile applications validated to provide accurate real-time measurements?
    • How is the data quality ensured, considering potential sensor calibration issues, environmental factors, or user error?
    • Are there mechanisms in place to handle missing or incomplete data, such as when the device is not worn or battery levels are low?
    • How are transient fluctuations in health metrics differentiated from clinically significant changes?
    • Are patient-specific factors incorporated into the analysis to improve data interpretation?
    • Is there a system for aggregating data from multiple devices or platforms to create a comprehensive view of the patient’s health metrics over time?
  • To address the potential lack of consistency and/or quality in patient follow-up information, particularly over the long term, the following considerations should be taken into account:

    • How will data consistency be maintained if patients change healthcare providers, devices, or platforms over time?
    • Are there standardized processes in place to ensure that follow-up data from different sources can be seamlessly integrated and compared?
    • Is there a plan to handle potential gaps in data, such as missed follow-up appointments or missing reports?
    • What strategies are in place to encourage continuous patient engagement and adherence to follow-up protocols?
    • Are there periodic checks or audits in place to validate data quality and identify potential discrepancies or inconsistencies?

Solutions

Related pages

More information

Skip tool table
Tool or resource Description Related pages Registry
AACR-GENIE Real-world cancer genomic dataset aggregated from multiple institutions.
All of Us NIH research program with diverse patient health, genetic, and lifestyle data.
Apple HealthKit A framework for health and fitness data integration on Apple devices.
Cancer Genome Interpreter Cancer Genome Interpreter (CGI) is designed to support the identification of tumor alterations that drive the disease and detect those that may be therapeutically actionable. Human data Tool info
Cancer Pharmacogenomics Pharmacogenomics knowledge base including cancer pharmacogenomic information.
Cancer Therapeutics Response Portal (CTRP) Resource to explore relationships between cancer cell line genotypes and drug response.
cBioPortal A platform for exploring, analysing, and visualising cancer genomics data.
Clinical Interpretations of Variants in Cancer (CIViC) A community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer
ClinicalTrials.gov ClinicalTrials.gov is a resource depending on the National Library of medicine which makes available private and public-funded clinical trials. Toxicology data Standards/Databases
Data Use Ontology (DUO) DUO allows to semantically tag datasets with restriction about their usage. Human data Ethical aspects Standards/Databases Training
dbGAP The database of Genotypes and Phenotypes (dbGaP) archives and distributes data from studies investigating the interaction of genotype and phenotype in Humans Human data Tool info Standards/Databases Training
Digital Imaging and Communications in Medicine (DICOM) A standard for medical imaging data storage and transmission.
DOME-ML Data, Optimisation, Model and Evaluation in Machine Learning (DOME-ML) is a set of community guidelines, recommendations and checklists for supervised ML validation in biology. Machine learning Standards/Databases
Drug-Gene Interaction Database (DGIdb) Drug-Gene Interaction Database linking drugs to their target genes.
DrugBank Comprehensive database containing information on drugs and drug targets.
DrugMap Database for drug mechanisms, targets, pathways, and indications.
Eastern Cooperative Oncology Group (ECOG) A scale for assessing patient functional status.
European Cancer Imaging Initiative (EUCAIM) A federated platform for cancer imaging data.
European Medicines Agency's (EMA) medicine finder European Medicines Agency's (EMA) portal for authorised medicines in the EU.
European Organisation for Research and Treatment of Cancer (EORTC) Quality of Life Questionnaires (QLQ) A set of questionnaires developed to assess the quality of life of cancer patients.
FDA-approved tools Various digital health tools approved by the FDA for medical use (e.g., wearables, software, diagnostics).
Federated EGA The Federated EGA is an infrastructure built upon the European Genome-phenome Archive (EGA), an EMBL-EBI and CRG data resource for secure archiving and sharing of human sensitive biomolecular and phenotypic data resulting from biomedical research projects. Training
Genomics of Drug Sensitivity in Cancer (GDSC) Database of drug sensitivity and genomic data from ~1000 cancer cell lines.
Google Fit A health-tracking platform by Google for collecting and accessing fitness data.
Hartwig Medical Foundation (HMF) Multi-omic cancer dataset of >7000 patients, including whole genome sequencing and clinical outcomes.
HL7 FHIR A standard used for health care data exchange. Health data
ICD-O-3 Coding Materials A classification system for oncology, used in cancer registries and pathology reporting.
ICGC-ARGO Integrated clinical and genomic data for over 100,000 cancer patients from the International Cancer Genome Consortium.
Informatics for Integrating Biology & the Bedside (i2b2) An open-source data warehouse for clinical and translational research.
Medical Dictionary for Regulatory Activities(MedDRA) A single standardised international medical terminology.
MSK-CHORD Platform from Memorial Sloan Kettering providing molecular and clinical data to support precision oncology.
MTB Portal Provides a general framework to interpret the functional and predictive relevance of a list of gene variants through interactive reports.
National Cancer Institute(NCI) treatment drugs A database of cancer-related drugs, including descriptions, uses, and approvals.
OMOP-CDM OMOP is a common data model for the harmonisation for of observational health data. TransMed Data quality
OncoKB Precision oncology knowledge base that annotates the effects and treatment implications of somatic mutations in cancer.
Open mHealth An open-source Clinical Interpretations platform for integrating mobile health data to improve healthcare insights.
Paige AI AI solutions for digital pathology in cancer detection.
PanDrugs Platform to prioritise cancer drug treatments based on individual multi-omics data including SNVs, CNVs, and gene expression.
PathAI AI-powered pathology tools for cancer diagnosis and research.
Patient-Reported Outcomes Measurement Information System (PROMIS) A set of person-centred measures that evaluates and monitors physical, mental, and social health in adults and children.
PET Response Criteria in Solid Tumors (PERCIST) A standardised method for assessing tumor response via PET imaging.
Qure.ai AI-based medical imaging analysis for radiology and oncology.
REDCap REDCap is a secure web application for building and managing online surveys and databases. While REDCap can be used to collect virtually any type of data in any environment, it is specifically geared to support online and offline data capture for research studies and operations. TransMed Health data Data quality Identifiers Tool info Training
Sequence Read Archive Sequence Read Archive (SRA) data, available through multiple cloud providers and NCBI servers, is the largest publicly available repository of high throughput sequencing data. Single-cell sequencing Standards/Databases Training
SIDER Side Effect Resource containing information on marketed medicines and their recorded adverse drug reactions.
Systematized Nomenclature of Medicine - Clinical Terms (SNOMED-CT) Terminology standard consisting of codes, terms, synonyms and definitions used in clinical documentation and reporting. Health data
The Cancer Imaging Archive A repository of medical images related to cancer.
The European Genome-phenome Archive (EGA) EGA is a service for permanent archiving and sharing of all types of personally identifiable genetic and phenotypic data resulting from biomedical research projects CSC TSD Human data Data publication Tool info Standards/Databases Training
Therapeutic Target Database (TTD) Database of known and explored therapeutic targets and their related drugs.
Transparent Reporting of a Multivariable Prediction Model Guidelines for reporting predictive model studies in medicine.
Tumour, Nodes, Metastases (TNM) cancer staging system A cancer staging system that describes the size of the tumor (T), lymph node involvement (N), and metastasis (M).
UK Biobank Large-scale biomedical database with genetic, clinical, and lifestyle data from 500,000 UK participants.
Variant Call Format A common file format that contains information about variants found at specific positions in a reference genome.
WAYFIND-R A federated real-world evidence platform containing clinical and genomic data from patients with solid tumours.
WHO Classification of Tumours The standard for diagnostic oncology, defining tumor types and classifications.
XNAT Open source imaging informatics platform. It facilitates common management, productivity, and quality assurance tasks for imaging and associated data. TransMed XNAT-PIC Bioimaging data
Contributors