Skip to content Skip to footer

Data quality

How do you ensure the quality of research data?

Description

Data quality is a term that can be understood in many ways. In enterprise context, it often refers to master data management as defined by the ISO 8000 standards. In science, the quality of data is closely linked to the suitability of the data for (re)use for a particular purpose and it is a key attribute of research data. Data quality affects the reliability of research results and it is a key factor increasing the reusability of data for secondary research. Data quality control can take place at any stage during the research data lifecycle. That said, you should ensure that the necessary procedures are defined during data management planning.

Considerations

  • What is your data collection mechanism? Quality control is most typically performed during data collection. The elements of data collection in your research will determine the quality measures you can take. Examples of such measures are:
    • setup data management working group (DMWG) that includes people who generate data, analyse data and data managers;
    • for data collection: DMWG to plan and define data dictionary (including validation rules) before collecting data;
    • for metadata collection: DMWG to plan and define metadata data templates;
    • use electronic data capture systems;
    • automated quality monitoring through tools, pipelines, dashboards;
    • training of study participants and researchers, surveyors or other staff involved;
    • adopting standards;
    • instrument calibrations;
    • repeated samples;
    • post collection data curation;
    • data peer-review.
  • Are there standards or established working practices for quality in your field of study? Certain areas such as clinical studies, or those involving Next Generation Sequencing have commonly working methods to ensure data quality.

  • There are many frameworks proposed in the literature to define and evaluate overall data quality, such as:
    • the four data quality dimensions (Accuracy, Relevancy, Representation, Accessibility) by Wang;
    • the five C’s of Sherman (Clean, Consistent, Conformed, Current and Comprehensive), and the three categories from Kahn (Conformance, Completeness and Plausibility), for electronic health data. Kahn also proposes two different modes to evaluate these components:
      • verification (focusing on the intrinsic consistency, such as adherence to a format or specified value range);
      • validation (focusing on the alignment of values with respect to external benchmarks).
  • For health data, a nice example of working out what data quality means can be found in the OHDSI community. The context in this case is observational healthcare data represented in the OMOP Common Data Model.

Solutions

  • Electronic data capturing system: REDCap allows you to design electronic data capture forms and allows you to monitor the quality of data collected via those forms.
  • An example of data dictionary illustrating the elements and factors that should be defined for the variable needed by data collection.
  • The World Bank provides quality assurance guidance for survey design and execution.
  • The U.S. National Institute’s of Health’s provides introductory training material on data quality.
  • Bio.tools’ listing for computational tools and pipelines for data quality control in life sciences.
  • Data integration tools that include pre-defined building blocks to monitor and check data quality, such as Pentaho Community Edition (CE), Talend Open Studio.
  • Data curation tools such as OpenRefine that help you to identify quality issues, correct (curate) them, carry out transformations in the collected data with easy-to-use graphic interface and visualisation. It also documents all the steps during the curation for reproducibility and backtracking.

  • For health data.
    • The Book of OHDSI has several chapters on methods for assessing the data quality of observational health datasets, split out by data quality, clinical validity, software validity and method validity.
    • The OHDSI DataQualityDashboard is a software framework for asesssing the quality and suitability of routinely generated healthcare data that is represented in the OMOP Common Data Model. Frameworks proposed in the literature, to define and evaluate overall data quality, could be used to create computational representations of the data quality of a dataset, such as the one visualized in the OHDSI Data Quality Dashboard, which leverages the Kahn framework referenced above (adapted from original thehyve.nl blogpost).

Relevant tools and resources

Skip tool table
Tool or resource Description Related pages Registry
OpenRefine Data curation tool for working with messy data Data quality TeSS
REDCap REDCap is a secure web application for building and managing online surveys and databases. While REDCap can be used to collect virtually any type of data in any environment, it is specifically geared to support online and offline data capture for research studies and operations. Identifiers Data steward infrastructure Data steward research Data quality bio.tools
Contributors