Skip to content Skip to footer

Sensitive data

Is your data sensitive?

Description

In general, the term “sensitive data” is used for any data that could do harm (for example to people, organisations, countries, or ecosystems) if it would be openly available. This can for example be personal or commercial information, but also information such as breeding grounds of endangered species. Any such data must be protected against unauthorized access. What is considered sensitive information is usually regulated by national laws and may differ between countries. You should be cautious when you are dealing with sensitive, or potentially sensitive, information.

Considerations

  • If you deal with any information about individuals from the EU, you are bound by the General Data Protection Regulation (GDPR). In GDPR, such data is called “personal data”.
  • In the context of GDPR “special category data” is a subclass of “personal data” that is potentially even more harmful, and GDPR prescribes very strict rules for dealing with this data. Article 9 of GDPR defines the special categories as data consisting of racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, genetic data, biometric data, data concerning health or data concerning a natural person’s sex life or sexual orientation. Confusingly, these special categories are sometimes colloquially called “sensitive data”. Note that this page is concerned with the broader definition of “sensitive data”.
  • Information in Life Science projects are for the most part categorised under health and genetic data and are considered special category data under the GDPR.
  • You need to assess whether or not your dataset contains personally identifying attributes. Note that combinations of attributes that are themselves not identifyable can be identifyable together.
  • You need to know the de-identification status of your data. Life Science research data rarely contains directly identifying attributes. Research data would typically be pseudonymised or anonymised. If you work with personal data, you must understand the difference between these two (see under de-identification below).
  • For some studies there is a cohort owner, often a clinical party or a trusted third party that can map study participant keys back to names and surnames. Such data is considered pseudonymous.
  • If there are no means to map the data back to individuals, then the data is considered anonymous and is out of the scope of the GDPR.
  • You should keep in mind that anonymising data is a notoriously difficult task. Does your dataset contain a wide array of attributes, or exhibit unique traits/patterns such that one can reasonably expect that not more than a dozen people in the world have those together? In that case, you can not assume that it is anonymous. Such data run the risk of being linked back to individuals through various technical means. You need to take into account that technical means to identify people in the future may be more powerful than than they are right now: i.e. data that is anonymous right now may not be anonymous forever.

Solutions

  • Identify what legislations and regulations there are that you are expected to follow. Your institution’s website may give you hints on where you can look for information about sensitive data.
  • If you cannot determine if your data is sensitive, contact someone with expert knowledge in that area.

How can you de-identify your data?

Description

Data anonymization is the process of irreversibly modifying personal data in such a way that subjects cannot be identified directly or indirectly by anyone, including the study team. If data are anonymized, no one can link data back to the subject.

Pseudonymization is a process where identifying-fields within data records are replaced by artificial identifiers called pseudonyms or pseudonymized IDs. Pseudonymization ensures no one can link data back to the subject, apart from nominated members of the study team who will be able to link pseudonyms to identifying records, such as name and address.

Data anonymization involves modifying a dataset so that it is impossible to identify a subject from their data. Pseudonymization involves replacing identifying data with artificial IDs, for example, replacing a healthcare record ID with an internal participant ID only known to a named clinician working in the study.

Considerations

Both anonymization and pseudonymization are approaches that comply with the GDPR. Simply removing identifiers cannot guarantee data anonymity. A dataset may contain unique traits/patterns that could identify individuals. An example of this would be recording 2 potentially unrelated attributes such as the instance of a rare disease and country of residence, where there is only a single case of this disease in this country. Data that is anonymous currently may not be anonymous in the future. Future datasets on the same individual may disclose their identity. Anonymization techniques can sometimes damage the statistical properties of the data, for example, translating current participant age into an age range.

Solutions

An example of pseudonymization is where participants in a study are assigned a non-identifying ID and all identifying data (such as name and address) are removed from the metadata to be shared. The mapping of this ID to personal data is held separately and securely by a named researcher who will not share this data. There are well-established data anonymization approaches, such as k-anonymity, l-diversity, and differential privacy.

More information

Relevant tools and resources

Skip tool table
Tool or resource Description Related pages Registry
Amnesia Amnesia is a GDPR compliant high accuracy data anonymization tool
BBMRI-ERIC's ELSI Knowledge Base The ELSI Knowledge Base is an open-access resource platform that aims at providing practical know-how for responsible research. Data protection Data steward policy Data steward research Human data
ELIXIR-AAI The ELIXIR Authentication and Authorisation Infrastructure (AAI) NeLS TSD TransMed TeSS
GA4GH Data Security Toolkit Principled and practical framework for the responsible sharing of genomic and health-related data. Data publication Data steward policy Data steward research Data steward infrastructure Human data
GA4GH Regulatory and Ethics toolkit Framework for Responsible Sharing of Genomic and Health-Related Data Data protection Data steward policy Data steward research Data steward infrastructure Human data
Nettskjema Form and survey tool, also for sensitive data TSD
Tryggve ELSI Checklist A list of Ethical, Legal, and Societal Implications (ELSI) to consider for research projects on human subjects Data steward policy Data steward research Human data NeLS CSC TSD
TSD Norwegian Services for sensitive data TSD Data storage TeSS
National resources
Norwegian COVID-19 Data Portal

The Norwegian COVID-19 Data Portal aims to bundle the Norwegian research efforts and offers guidelines, tools, databases and services to support Norwegian COVID-19 researchers.

Human data Existing data Data publication
Norwegian Federated EGA

Federated instance collects metadata of -omics data collections stored in national or regional archives and makes them available for search through the main EGA portal. With this solution, sensitive data will not physically leave the country, but will reside on TSD.

The European Genome-phenome Archive (EGA)
Human data Existing data Data publication TSD
usegalaxy.no

Galaxy is an open source, web-based platform for data intensive biomedical research. This instance of Galaxy is coupled with NeLS for easy data transfer.

Galaxy
Data analysis Existing data Data publication NeLS
Educloud Research

Educloud Research is a platform provided by the Centre for information Technology (USIT) at the University of Oslo (UiO). This platform provides access to a work environment accessible to collaborators from other institutions or countries. This service provides a storage solution and a low threshold HPC system that offers batch job submission (SLURM) and interactive nodes. Data up to the red classification level can be stored/analysed.

Data analysis Data storage
TSD

The TSD – Service for Sensitive Data, is a platform for collecting, storing, analysing and sharing sensitive data in compliance with the Norwegian privacy regulation. TSD is developed and operated by UiO.

Human data Data analysis Data storage TSD
HUNTCloud

The HUNT Cloud, established in 2013, aims to improve and develop the collection, accessibility and exploration of large scale information. HUNT Cloud offers cloud services, lab management, and is a key service that has established a framework for data protection, data security, and data management. HUNT Cloud is owned by NTNU and operated by HUNT Research Centre at the Department of Public Health and Nursing at the Faculty of Medicine and Health Sciences.

Human data Data analysis Data storage
SAFE

SAFE (secure access to research data and e-infrastructure) is solution for secure processing of sensitive personal data in research at the University of Bergen. SAFE is based on “Norwegian Code of conduct for information security in the health and care sector” (Normen) and ensures confidentiality, integrity, and availability are preserved when processing sensitive personal data. Through SAFE, the IT-department offers a service where employees, students and external partners get access to dedicated resources for processing of sensitive personal data.

Human data Data analysis Data storage
RETTE

System for Risk and compliance. Processing of personal data in research and student projects at UiB.

Human data Data protection Data steward policy Data steward research
Contributors