How do you find appropriate standard metadata for datasets or samples?
There are multiple standards for different types of data, ranging from generic dataset descriptions (e.g. DCAT, Dublin core, (bio)schema.org) to specific data types (e.g. MIABIS for biosamples). Therefore, how to find standard metadata, and how to find an appropriate repository for depositing your data become relevant questions.
- Decide at the beginning of the project what are the recommended repositories for your data types.
- Note that you can use several repositories if you have different data types.
- Distinguish between generic (e.g. Zenodo) and data type (technique) specific repositories (e.g. EBI repositories).
- If you have a repository in mind:
- Go to the repository website and check the “help”, “guide” or “how to submit” tab to find information about required metadata.
- On the repository website, go through the submission process (try to submit some dummy data) to identify metadata requirements. For instance, if you consider publishing your transcriptomic data in ArrayExpress, you can make your metadata spreadsheet by using Annotare 2.0 submission tool, at the beginning of the project.
- Be aware that data type specific repositories usually have check-lists for metadata. For example, the European Nucleotide Archive provides sample checklists that can also be downloaded as a spreadsheet.
- If you don’t know yet what repository you will use, look for what is the recommended minimal information (i.e. “Minimum Information …your topic”, e.g. MIAME or MINSEQE or MIAPPE) required for your type of data in your community, or other metadata, at the following resources:
- Research Data Alliance (RDA): Metadata Dictionary: Standards
- FAIRsharing.org at “Standards” and “Collections”
- The Digital Curation Centre (DCC): List of Metadata Standards
How do you find appropriate vocabularies or ontologies?
Vocabularies and ontologies are meant for describing concepts and relationships within a knowledge domain. Used wisely, they can enable both humans and computers to understand your data. There is no clear-cut division between the terms “vocabulary” and “ontology”, but the latter is more commonly used when dealing with complex (and perhaps more formal) collections of terms.
There are many vocabularies and ontologies to be found on the web. Finding a suitable one can be both difficult and time-consuming.
- Check whether you really need to find a suitable ontology or vocabulary yourself. Perhaps the repository where you are about to submit your data have recommendations? Or the journal where you plan to publish your results?
- Understand your goal with sharing data. Which formal requirements (by e.g. by funder or publisher) need to be fulfilled? Which parts of your data would benefit the most from adopting ontologies?
- Learn the basics about ontologies. This will be helpful when you search for terms in ontologies and want to understand how terms are related to one another.
- Accept that one ontology may not be sufficient to describe your data. It is very common that you have to combine terms from more than one ontology.
- Accept terms that are good enough. Sometimes you you cannot find a term that perfectly match what you want to express. Chosing the best available term is often better than not chosing a term at all. Note that the same concept may also be present in multiple ontologies.
- Define a list of terms that you want to find ontologies for. Include in the list also any alternative term names that you are aware of.
- Search for your listed terms on dedicated web portals. These are a few:
Relevant tools and resources
|Tool or resource||Description||Tags||Registry|
|Biosamples||BioSamples stores and supplies descriptions and metadata about biological samples used in research and development by academia and industry.||metadata plants|
|Biostudies||The BioStudies database holds descriptions of biological studies and links to data from these studies in other databases.||metadata plants|
|COPO||Portal for scientists to broker more easily rich metadata alongside data to public repos.||metadata researcher plants|
|Data Curation Centre Metadata list||List of metadata standards||metadata researcher data manager|
|EMBL-EBI Ontology Lookup Service||EMBL-EBI’s web portal for finding ontologies||metadata data manager researcher|
|FAIRDOMHub||Data, model and SOPs management for projects, from preliminary data to publication, support for running SBML models etc. (public SEEK instance)||storage researcher nels metadata micro biotech|
|fairsharing||A curated, informative and educational resource on data and metadata standards, inter-related to databases and data policies.||metadata data publication policy officer data manager researcher micro biotech|
|IDPO||Intrinsically disordered proteins ontology||IDP metadata|
|Linked Open Vocabularies (LOV)||Web portal for finding ontologies||metadata data manager researcher|
|MCPD||The Multi-Crop Passport Descriptor is the metadata standard for plant genetic resources maintained ex situ by genbanks.||metadata researcher IT support policy officer plants|
|MCPD||The Multi-crop Passport Descriptors are an international standard to facilitate germplasm passport information exchange||metadata plants|
|MIADE||Minimum Information About Disorder Experiments (MIADE) standard||metadata researcher data manager IDP|
|MIAPPE||Minimum Information About a Plant Phenotyping Experiment||metadata researcher data manager plants|
|MIGS/MIMS||Minimum Information about a (Meta)Genome Sequence||metadata researcher data manager marine micro biotech|
|MIxS||Minimum Information about any (x) Sequence||metadata researcher data manager marine|
|Ontobee||A web portal to search and visualise ontologies||metadata data manager researcher|
|OTP||One Touch Pipeline (OTP) is a data management platform for running bioinformatics pipelines in a high-throughput setting, and for organising the resulting data and metadata.||human data metadata DMP data analysis|
|RDA Standards||Directory of standard metadata, divided into different research areas||metadata researcher data manager|
|Research Object Crate (RO-Crate)||RO-Crate is a lightweight approach to packaging research data with their metadata, using schema.org. An RO-Crate is a structured archive of all the items that contributed to the research outcome, including their identifiers, provenance, relations and annotations.||metadata storage data organisation data manager researcher micro biotech|
|Rightfield||RightField is an open-source tool for adding ontology term selection to Excel spreadsheets||researcher metadata data manager micro biotech|
|Schemapedia||Web portal for finding ontologies||metadata data manager researcher|
|The Genomic Standards Consortium (GSC)||Minimum Information about any (x) Sequence||metadata researcher IT support policy officer human data|
|The Open Biological and Biomedical Ontology (OBO) Foundry||Collaborative effort to develob interoperable ontologies for the biological sciences||metadata data manager researcher|
|UniProt||Comprehensive resource for protein sequence and annotation data||metadata researcher IDP micro biotech|