Your tasks: Data quality
How do you ensure the quality of research data?
Data quality is a term that can be understood in many ways. In enterprise context, it often refers to master data management as defined by the ISO 8000 standards. In science, the quality of data is closely linked to the suitability of the data for (re)use for a particular purpose and it is a key attribute of research data. Data quality affects the reliability of research results and it is a key factor increasing the reusability of data for secondary research. Data quality control can take place at any stage during the research data lifecycle. That said, you should ensure that the necessary procedures are defined during data management planning.
Quality control is most typically performed during data collection but it should not be neglected in later stages of research data lifecycle. The type of data as well as instruments and processes adopted for data collection in your research will determine the quality assurance measures you can take. Examples of such measures are:
- setup data management working group (DMWG) that includes people who generate data, analyse data and data managers;
- for data collection: DMWG to plan and define data dictionary (including validation rules) before collecting data;
- for metadata collection: DMWG to plan and define metadata data templates;
- use electronic data capture systems;
- automated quality monitoring through tools, pipelines, dashboards;
- training of study participants and researchers, surveyors or other staff involved;
- adopting standards;
- instrument calibrations;
- repeated samples;
- post collection data curation;
- data peer-review.
Certain areas such as clinical studies, or those involving Next Generation Sequencing have commonly working methods to ensure data quality. Consider familiarizing yourself with data quality standards or established working practices in your field of study.
There are many frameworks proposed in the literature to define and evaluate overall data quality, such as:
- the four data quality dimensions (Accuracy, Relevancy, Representation, Accessibility) by Wang;
- the five C’s of Sherman (Clean, Consistent, Conformed, Current and Comprehensive), and the three categories from Kahn (Conformance, Completeness and Plausibility), for electronic health data. Kahn also proposes two different modes to evaluate these components:
- verification (focusing on the intrinsic consistency, such as adherence to a format or specified value range);
- validation (focusing on the alignment of values with respect to external benchmarks).
For health data, a nice example of working out what data quality means can be found in the OHDSI community. The context in this case is observational healthcare data represented in the OMOP Common Data Model.
- Electronic data capturing system: REDCap allows you to design electronic data capture forms and allows you to monitor the quality of data collected via those forms.
- An example of data dictionary illustrating the elements and factors that should be defined for the variable needed by data collection.
- The World Bank provides quality assurance guidance for survey design and execution.
- The U.S. National Institute’s of Health’s provides introductory training material on data quality.
- Bio.tools’ listing for computational tools and pipelines for data quality control in life sciences.
- Data integration tools that include pre-defined building blocks to monitor and check data quality, such as Pentaho Community Edition (CE), Talend Open Studio.
Data curation tools such as OpenRefine that help you to identify quality issues, correct (curate) them, carry out transformations in the collected data with easy-to-use graphic interface and visualisation. It also documents all the steps during the curation for reproducibility and backtracking.
- For heath data, the Book of OHDSI has several chapters on methods for assessing the data quality of observational health datasets, split out by data quality, clinical validity, software validity and method validity. Frameworks proposed in the literature, to define and evaluate overall data quality, could be used to create computational representations of the data quality of a dataset. OHDSI DataQualityDashboard, which leverages the Kahn framework referenced above (adapted from original thehyve.nl blogpost), is a software framework for assessing the quality and suitability of routinely generated healthcare data that is represented in the OMOP Common Data Model.
Relevant tools and resourcesSkip tool table
|Tool or resource||Description||Related pages||Registry|
|OpenRefine||Data curation tool for working with messy data||Training|
|REDCap||REDCap is a secure web application for building and managing online surveys and databases. While REDCap can be used to collect virtually any type of data in any environment, it is specifically geared to support online and offline data capture for research studies and operations.||Identifiers Data Steward: infrastructure Data Steward: research||Tool info Training|
This is the Estonian instance of REDCap, which is a secure web platform for building and managing online databases and surveys.
|Data Steward: research|