Skip to content Skip to footer

Your tasks: Data discoverability

How can you make your data more discoverable?

Description

Data discovery involves processes and tools that help users understand what data is available, where it is stored, and how to access it. It includes querying datasets to find specific information based on given conditions. However, for data to be discoverable, it must be well-prepared. Making your data discoverable maximises its impact and utility, enabling others to find, access, and use it effectively. Discoverable data promotes transparency, reproducibility, and scientific progress. Achieving this requires detailed metadata and documentation, depositing data in public and institutional repositories, and using standardised formats for interoperability.

Considerations

  • Detailed Metadata: Provide comprehensive metadata for your datasets, including titles, descriptions, keywords, creators, dates, and other relevant information.
  • Documentation: Include thorough documentation that explains the dataset, its structure, how it was collected and processed, and any limitations.
  • Standard Schemas: Use standardised metadata schemas to ensure consistency and interoperability.
  • Standard Formats: Use widely accepted data formats to ensure compatibility and ease of use.
  • Public Repositories: Deposit your data in reputable public data repositories that are indexed by search engines and widely used by the research community.

Solutions

  • There are several appropriate tools to create detailed (or comprehensive) metadata and document data properly for the project. Check the Documentation and metadata page for more information.
  • Some scientific communities utilise platforms such as CEDAR, Semares, FAIRDOM-SEEK, FAIRDOMHub, and COPO for managing metadata and data.
  • Various standards exist for different data types, from general dataset descriptions such as DCAT, Dublin Core, and (bio)schema.org, to those tailored for specific data types, such as MIABIS for biosamples. Hence, selecting the appropriate standard at the project’s outset is crucial. Typically, if you choose a suitable data repository for your data, it will come with an integrated metadata scheme, simplifying your work by eliminating the need to develop a separate metadata profile.
  • Decide at the beginning of the project the right repository for your data type. To search for it, you can use the ELIXIR Deposition Databases for Biomolecular Data, re3data or FAIRsharing at “Databases”.
  • If your chosen repository lacks some of the metadata fields you wish to include and you need to add a separate file with this information (such as in Zenodo), you should adhere to the appropriate metadata schema. To identify the correct schema, you have several options:
  • RDA Standards
  • FAIRsharing at “Standards” and “Collections”
  • Data Curation Centre Metadata list
  • The ideal file formats vary based on the type of data, the availability and common acceptance of open file formats, and the research domain. There isn’t a universal solution, so selecting the most suitable format for your specific needs is essential. The Data Organisation page provides a table with recommended file formats and best practices for research data management.

How can you discover controlled access data?

Description

Discovering research data for re-analysis can occur at different levels of granularity. Initially, researchers browse online catalogues that describe studies, datasets, related publications, variables, and some data distributions. This basic discovery may suffice if the datasets meet all the criteria. However, to find dataset that meet specific combinations of attributes — such as identifying datasets with particular combinations of attributes, like ‘adults diagnosed with COVID-19 in the last year, fully vaccinated, with no underlying health conditions’ (for example) — researchers must either contact the authors or request data access and verify themselves. This process is feasible for a small number of datasets and cooperative data controllers but it is usually time-consuming and uncertain. To streamline this, data discovery at the source allows users to query data non-disclosively, determining its relevance before requesting full access.

Considerations

  • Detailed Metadata: ensure comprehensive metadata for your datasets, including detailed descriptions of studies, datasets, variables, and any available distributions.
  • Data Catalogs and Repositories: use well-maintained online catalogs and repositories that support controlled access data, and check for advanced search features to filter datasets by specific attributes.
  • Data Access Policies: get familiar with the data access policies of different repositories and datasets, understanding the requirements and procedures for requesting access to controlled data.
  • Ethical and Legal Compliance: ensure compliance with ethical guidelines and legal regulations governing data use and sharing, and obtain necessary approvals from institutional review boards or ethics committees if required. Check the GDPR compliance and Ethical aspects pages for more information.
  • Data Access Request Process: be aware that the process for requesting and obtaining data access can be time-consuming, and prepare detailed justifications for data access requests, including research objectives and intended analyses.
  • Privacy and Security Measures: implement robust privacy and security measures to protect sensitive data during discovery and after access is granted, ensuring data handling practices comply with data protection regulations. Check the Data sensitivity page for more information.

Solutions

Beacon, developed through the Global Alliance for Genomics and Health (GA4GH) Discovery workstream, and with substantial support from ELIXIR, serves as a data discovery protocol and specification defining an open standard for discovering genomic and phenoclinic data in biomedical research and clinical applications.

The latest version, Beacon v2, introduced expanded query options, enabling the retrieval of biological or technical (meta)data through filters defined via CURIEs. This includes, but not limited to, parameters such as phenotypes, disease codes, sex, or age, providing researchers with a nuanced approach to data inquiries.

Beacon v2 is organised in two main blocks: the Beacon Framework and the Beacon Model. The Framework defines the format for the requests and responses, whereas the Model defines the structure of the biological data response.

This dual-system approach not only broadens the scope for diverse Models – using different domains such as images, pathogens, or infectious diseases – but also reinforces the adaptability of the Framework. The overall function of these components is to provide the instructions to design a REST API that could be implemented as a stand-alone product or, preferably, extending existing data management solutions.

Consequently, the ‘beaconised’ data represents a significant enhancement in data discoverability with minimal risks. Currently, there are two ways to implement a Beacon:

  • API on top of existing tools: This APIs is targeted to those organizations equipped with well-organised and structured data housed in databases, whether SQL or NoSQL, and possess the necessary resources and expertise to interpret and implement the Beacon v2 specification and construct an API on top of an existing tool.

    Beacon API

    Figure 1. Beacon API functionality (Source)

  • Beacon v2 Reference Implementation (B2RI): an out-of-the-box example implementation of the Beacon v2 protocol. It is an open-source toolkit based on Python programming language and consists of tools for loading metadata, e.g. phenotypic data, from a CSV file and genomic variants from a VCF file into a MongoDB database. It also features the Beacon query engine (REST API) and comes bundled with an example dataset (CINECA synthetic cohort EUROPE UK1) comprising synthetic data. You can find the GitHub Repository for Beacon v2 here.

    Beacon RI

    Figure 2. Beacon RI functionality. (Source)

Building on B2RI, the Beacon for OMOP (B4OMOP) software allows for the integration of a Beacon onto any OMOP Common Data Model (CDM) database. This enables organizations using the OMOP CDM to leverage the Beacon framework for querying and sharing genomic and phenotypic data.

More information

Skip tool table
Tool or resource Description Related pages Registry
Beacon The Beacon protocol defines an open standard for genomics data discovery. Human data Tool info Standards/Databases Training
Beacon for OMOP (B4OMOP) Beacon for OMOP (B4OMOP) software allows for the integration of a Beacon onto any OMOP Common Data Model (CDM) database.
Beacon v2 Developed through the Global Alliance for Genomics and Health (GA4GH) Discovery workstream with support from ELIXIR, Beacon is a data discovery protocol defining an open standard for discovering genomic and phenoclinic data in research and clinical applications. Training
Beacon v2 Reference Implementation (B2RI) An open-source out-of-the-box toolkit to initiate a Beacon v2. B2RI includes tools for loading metadata, such as phenotypic data and genomic variants into a MongoDB database, and features a Beacon query engine (REST API).
CEDAR CEDAR is making data submission smarter and faster, so that scientific researchers and analysts can create and use better metadata. Documentation and meta... Tool info Standards/Databases
COPO Portal for scientists to broker more easily rich metadata alongside data to public repos. Plant Phenomics Plant sciences Documentation and meta... Tool info Standards/Databases
Data Curation Centre Metadata list List of metadata standards Documentation and meta...
ELIXIR Deposition Databases for Biomolecular Data List of discipline-specific deposition databases recommended by ELIXIR. CSC IFB Marine Metagenomics NeLS Data publication Documentation and meta... Standards/Databases
FAIRDOM-SEEK A data Management Platform for organising, sharing and publishing research datasets, models, protocols, samples, publications and other research outcomes. NeLS Plant Phenomics Microbial biotechnology Plant sciences Documentation and meta... Data storage Tool info
FAIRDOMHub Data, model and SOPs management for projects, from preliminary data to publication, support for running SBML models, etc. (public SEEK instance) NeLS Plant Genomics Plant Phenomics Microbial biotechnology Plant sciences Documentation and meta... Standards/Databases Training
FAIRsharing A curated, informative and educational resource on data and metadata standards, inter-related to databases and data policies. FAIRtracks Health data Microbial biotechnology Plant sciences Data provenance Data publication Existing data Machine actionability Documentation and meta... Standards/Databases Training
RDA Standards Directory of standard metadata, divided into different research areas Data provenance Documentation and meta...
re3data Registry of Research Data Repositories Data publication Existing data Licensing Training
Semares All-in-one platform for life science data management, semantic data integration, data analysis and visualization Documentation and meta... Data storage
Contributors