Skip to content Skip to footer

Your tasks: Data provenance

How to document and track your data provenance?

Description

Provenance is the documentation of why and how the data (but also datasets, computational analysis and other research output) was produced, where, when and by whom. Data provenance is often used interchangeably with the term “data lineage”, although their definition might slightly differs in some contexts. Data provenance/lineage means tracing the movements and the changes of the data that occurred between their origin and their destination system.

Well-documented data provenance is essential for assessing authenticity, credibility, trustworthiness, quality (it helps finding errors) and reusability of data, as well as the reproducibility of the results.

However, knowing what’s the best way to document provenance can be challenging due to the large amount and variety of the information that need to be recorded.

Considerations

  • Provence is part of documentation and metadata.
  • Many aspects of data documentation and metadata are related to provenance information, such as history log, versioning, licence, citation, identifiers, etc. Moreover, data provenance is related to several other aspects of data management, namely data access rights, governance, privacy and security.
  • Provence information can be recorded:
    • as free text and unstructured information (mainly readable for humans, not for machines/software), describing data collection and processing method.
    • according to metadata schemas or standards, that can be generic (e.g. Dublin Core) or discipline specific such as ISO19115-2.
    • according to Provenance Data Model (PROV-DM) and ontology (PROV-O).
  • As for documentation and metadata, the medium to capture provenance information can also varies. Provenance trails can be captured
    • in text files or spreadsheets
    • in registries or databases
    • in dedicated software/platforms (such as LIMS)
    • internally and automatically by software tools during their processing activity (such as workflow management systems)
  • As for documentation and metadata, provenance information can be recorded and displayed/visualised in machine-readable (see Machine actionability page) and/or human-readable form.

Solutions

  • Record provenance according to schemas or defined profiles. These can be generic or domain-specific, and can be found in RDA Metadata Standards Catalog or FAIRsharing. Use metadata schemas containing provenance information in your README file and in any kind of data documentation and metadata file. Best practices for documentation and metadata, and data organisation should be applied for provenance file as well.
  • Implement serialisation specification of the PROV-MODEL in your data management tools to record provenance in machine-actionable format (RDF, Linked data, owl, xml, etc).
  • Use RO-Crate specifications and/or specific profiles for provenance (e.g., RO-Crate profiles to capture the provenance of workflow runs).
  • Make use of tools and software that help you record provenance in a manual or an automated way. Use:
    • Electronic Data Capture (EDC) systems, Laboratory Information Management Systems (LIMS) or similar tools.
    • Workflow management systems (such as Kepler, Galaxy, Taverna, VisTrails); provenance information embedded in such software or tools are usually available to users of the same tool or can be exported as separated file in several formats, such as RO-Crate.
    • Registries such as WorkflowHub.

Relevant tools and resources

Skip tool table
Tool or resource Description Related pages Registry
PROV-DM: The PROV Data Model PROV-DM is the conceptual data model that forms a basis for the W3C provenance (PROV) family of specifications.
Research Object Crate (RO-Crate) RO-Crate is a lightweight approach to packaging research data with their metadata, using schema.org. An RO-Crate is a structured archive of all the items that contributed to the research outcome, including their identifiers, provenance, relations and annotations. Documentation and metadata Data storage Data organisation Data Steward: research Researcher Microbial biotechnology Machine actionability Standards/Databases
Contributors