How to document and track your data provenance?
Provenance is the documentation of why and how the data (but also datasets, computational analysis and other research output) was produced, where, when and by whom. Data provenance is often used interchangeably with the term “data lineage”, although their definition might slightly differs in some contexts. Data provenance/lineage means tracing the movements and the changes of the data that occurred between their origin and their destination system.
Well-documented data provenance is essential for assessing authenticity, credibility, trustworthiness, quality (it helps finding errors) and reusability of data, as well as the reproducibility of the results.
However, knowing what’s the best way to document provenance can be challenging due to the large amount and variety of the information that need to be recorded.
- Provence is part of documentation and metadata.
- Many aspects of data documentation and metadata are related to provenance information, such as history log, versioning, licence, citation, identifiers, etc. Moreover, data provenance is related to several other aspects of data management, namely data access rights, governance, privacy and security.
- Provence information can be recorded:
- as free text and unstructured information (mainly readable for humans, not for machines/software), describing data collection and processing method.
- according to metadata schemas or standards, that can be generic (e.g. Dublin Core) or discipline specific such as ISO19115-2.
- according to Provenance Data Model (PROV-DM: The PROV Data Model) and ontology (PROV-O).
- As for documentation and metadata, the medium to capture provenance information can also varies. Provenance trails can be captured
- in text files or spreadsheets
- in registries or databases
- in dedicated software/platforms (such as LIMS)
- internally and automatically by software tools during their processing activity (such as workflow management systems)
- As for documentation and metadata, provenance information can be recorded and displayed/visualised in machine-readable (see Machine actionability page) and/or human-readable form.
- Record provenance according to schemas or defined profiles. These can be generic or domain-specific, and can be found in RDA Standards or FAIRsharing. Use metadata schemas containing provenance information in your README file and in any kind of data documentation and metadata file. Best practices for documentation and metadata, and data organisation should be applied for provenance file as well.
- Implement serialisation specification of the PROV-MODEL in your data management tools to record provenance in machine-actionable format (RDF, Linked data, owl, xml, etc.).
- Use RO-Crate specifications and/or specific profiles for provenance (e.g., RO-Crate profiles to capture the provenance of workflow runs).
- Make use of tools and software that help you record provenance in a manual or an automated way. Use:
- Electronic Data Capture (EDC) systems, Laboratory Information Management Systems (LIMS) or similar tools.
- Workflow management systems (such as Kepler, Galaxy, Taverna, VisTrails); provenance information embedded in such software or tools are usually available to users of the same tool or can be exported as separated file in several formats, such as Research Object Crate (RO-Crate).
- Registries such as WorkflowHub.
Relevant tools and resourcesSkip tool table
|Tool or resource||Description||Related pages||Registry|
|FAIRsharing||A curated, informative and educational resource on data and metadata standards, inter-related to databases and data policies.||Microbial biotechnology Plant sciences Data publication Existing data Documentation and meta...||Standards/Databases Training|
|Galaxy||Open, web-based platform for data intensive biomedical research. Whether on the free public server or your own instance, you can perform, reproduce, and share complete analyses.||Marine Metagenomics Data analysis Data storage||Tool info Training|
|PROV-DM: The PROV Data Model||PROV-DM is the conceptual data model that forms a basis for the W3C provenance (PROV) family of specifications.|
|RDA Standards||Directory of standard metadata, divided into different research areas||Documentation and meta...|
|Research Object Crate (RO-Crate)||RO-Crate is a lightweight approach to packaging research data with their metadata, using schema.org. An RO-Crate is a structured archive of all the items that contributed to the research outcome, including their identifiers, provenance, relations and annotations.||Galaxy Microbial biotechnology||Standards/Databases|
|WorkflowHub||WorkflowHub is a registry for describing, sharing and publishing scientific computational workflows.||Galaxy Data analysis||Tool info Standards/Databases Training|
An electronic lab notebook (ELN) for the BioData.pt community.
|Researcher Data Steward Principal Investigator... Documentation and meta... Data quality Project data managemen... Machine actionability|