Your domain: Biomolecular simulation data

Introduction

Biomolecular simulations are important technique for our understanding and design of biological molecules and their interactions. Simulation methods are demonstrating rapidly growing impact in areas as diverse as biocatalysis, drug delivery, biomaterials, biotechnology, and drug or protein design. Simulations offer the potential of uniquely detailed, atomic‐level insight into mechanisms, dynamics, and processes, as well as increasingly accurate predictions of molecular properties. Yet the field only relatively recently started to store and share (bio)simulation data to be reused for new, unexpected projects, and started discussions about their biomolecular simulation data FAIRification (i.e. to make them Findable, Accessible, Interoperable and Reusable). Here we show several current possibilities moving in this direction, but we should stress that these guidelines are not carved to stone and the biomolecular simulation community still needs to address challenges to FAIRify their data.

Description

The biomolecular simulation data comes in several forms and multiple formats, which unfortunately are not completely interoperable. Different methods also require slightly different metadata description.

Considerations

What type of data do you have?
- Molecular dynamics data - by far the most typical and largest biomolecular simulation data. Each molecular dynamics simulation is driven by the used engine, force-field, and multiple other and often hidden simulation parameters to produce trajectories that are further analysed.
- Molecular docking data - docking provides the structures of the complex (e.g. ligand-protein, protein-protein, protein-nucleic acid, etc.) and its score/energy.
- Virtual screening data - virtual screening is used for selection of active compounds from the pool of others and is usually in the form of ID and its score/energy.
- Free energies and other analysis data - data calculable from the analysis of the simulations.
Where should you store this data?
- Since there is no common community repository that would be able to gather the often spacious simulation data, the field did not systematically store them. Recently, there’s multiple possibilities where the data can be stored. The repositories can be divided in two main branches:
  - Generic: Repositories that can be used to store any kind of data.
  - Specific: Repositories designed to store specific data (e.g. MD data).
- Are you looking for a long-term or short-term storage? Repositories have different options (and sometimes prices) for the storage time of your data.
- Do you need a static reference for your data? A code (identifier) that can uniquely identify and refer to your data?
What data should you store?
What type of data should you store from the whole bunch of data generated in our project. Again, the type of data might vary depending on the biomolecular simulation field.
Consider what is essential (absolutely needed to reproduce the simulated experiment) versus what can be extracted from this data (analyses).
How do you want your data to be shared?
- You should consider the terms in which other scientists can use your data for other projects, access, modify, or redistribute them.

Solutions

Deposit your data to a suitable repository for sharing. There’s a long (and incomplete) list of repositories available for data sharing. Repositories are divided into two main categories, general-purpose and discipline-specific, and both categories are utilised in the domain of biomolecular modelling and simulation. For a general introduction to repositories, you are advised to read the data publication page.
- General-purpose repositories such as Zenodo, FigShare, Mendeley data, Dryad, and OpenScienceFramework can be used.
- Discipline-specific repositories can be used when the repository supports the type of data to be shared e.g. molecular dynamics data. Repositories for various data types and models are listed below:
  - Molecular Dynamics repositories
    - GPCRmd - for GPCR protein simulations, with submission process.
    - MoDEL - (https://bio.tools/model) specific database for protein MD simulations.
    - BigNASim - (https://bio.tools/bignasim) specific database for Nucleic Acids MD simulations, with submission process.
    - MoDEL-CNS - specific database for Central Nervous System-related, mainly membrane protein, MD simulations.
    - NMRlipids - project to validate lipid force fields with NMR data with submission process
    - MolSSI - BioExcel COVID-19 therapeutics hub - database with COVID-19 related simulations, with submission process.
  - Molecular Dynamics databases - allow access to precalculated data
    - Dynameomics - database of folding/unfolding pathways
    - MemProtMD - database of automatically generated membrane proteins from PDB inserted into simulated lipid bilayers
  - Docking respositories
    - MolSSI - BioExcel COVID-19 therapeutics hub - database with COVID-19 related simulations, with submission process.
    - PDB-Dev - prototype archiving system for structural models using integrative or hybrid modelling, with submission process.
    - ModelArchive - theoretical models of macromolecular structures, with submission process.
  - Virtual Screening repositories:
    - Bioactive Conformational Ensemble - small molecule conformations, with submission process.
    - BindingDB - database of measured binding affinities, focusing chiefly on the interactions of protein considered to be drug-targets with small, drug-like molecules, with submission process.
  - Repositories for the analysed data from simulations:
    - MolMeDB - for molecule-membrane interactions and free energy profiles, with submission process.
    - ChannelsDB 2.0 - resource of channels, pores and tunnels found in biomacromolecules, with submission process.
Based on the type of data to be shared, pay attention to what should be included and the data and metadata that will be deposited to repositories. Below listed are some suggested examples of types of essential and optional data describing the biomolecular simulation data:
- Molecular Dynamics:
  - Essentials:
    - Metadata (Temperature, pressure, program, version, …)
    - Complete set of input files that were used in the simulations
    - Trajectory(ies)
    - Topology(ies)
  - Optionals:
    - Analysis data (Free energy, snapshots, clusterization)
- Docking poses:
  - Essentials:
    - The complete set of molecules tested as well as the scoring functions used and the high-ranking, final poses (3D-structures)
    - Metadata (Identifiers (SMILES, InChI-Key), target (PDBID), energies/scores, program, version, box definition)
  - Optionals:
    - Complete ensemble of poses
- Virtual Screening:
  - Essentials:
    - List of molecules sorted
    - Metadata (identifiers of ligands and decoy molecules, target, program+version, type of VS (QSAR, ML, Docking,…))
  - Optionals:
    - Details of the method, scores, …
- Free energies and other analyses:
  - Essentials:
    - Metadata (model, method, program, version, force field(s), etc.)
    - Values (Free energy values, channels, etc.)
  - Optionals:
    - Link to Trajectory (Dynamic PDB?)
Associate a license with the data and/or source code e.g. models. Licenses mainly differ on openness vs restrictiveness, and it is crucial to understand the differences among licenses before sharing your research outputs. The RDMkit licensing page lists resources that can help you understand licensing and choose an appropriate license.

File formats Biomolecular simulation field has a tendency to produce a multitude of input/output formats, each of them mainly related to one software package. That makes interoperability and reproducibility really difficult. You can share your data but this data will only be useful if the scientist interested in it has access to the tool that has generated it. The field is working on possible standards (e.g. TNG trajectory).
Metadata standards There is no existing standard defining the type and format of the metadata needed to describe a particular project and its associated data. How to store the program, version, parameters used, input files, etc., is still an open question, which has been addressed in many ways and using many formats (json, xml, txt, etc.). Again, different initiatives exist trying to address this issue (see further references).
Data size Data generated in the biomolecular simulation field is growing at an alarming pace. Making this data available to the scientific community sometimes means transferring them to a long-term storage, and even this a priori straightforward process can be cumbersome because of the large data size.

Your tasks

Data publication

How to prepare data and find repositories for publication.

Your tasks

Documentation and metadata

How to document and describe your data.

Your tasks

Data storage

How to find appropriate storage solutions.

Tool or resource	Description	Related pages	Registry
BigNASim	Repository for Nucleic Acids MD simulations		Tool info
BindingDB	Public, web-accessible database of measured binding affinities		Tool info Standards/Databases
Bioactive Conformational Ensemble	Platform designed to efficiently generate bioactive conformers and speed up the drug discovery process.		Tool info
ChannelsDB 2.0	A comprehensive resource of channels, pores and tunnels found in biomacromolecular structures deposited in the Protein Data Bank.		Tool info Standards/Databases
Dryad	Open-source, community-led data curation, publishing, and preservation platform for CC0 publicly available research data	Bioimaging data Data publication	Standards/Databases Training
Dynameomics	Database of folding / unfolding pathway of representatives from all known protein folds by MD simulation
FigShare	Data publishing platform SciLifeLab Data Repository (Figshare)	Enzymology and biocata... Data publication Identifiers Documentation and meta...	Standards/Databases Training
GPCRmd	Repository of GPCR protein simulations		Tool info
MemProtMD	Database of over 5000 intrinsic membrane protein structures		Tool info
Mendeley data	Multidisciplinary, free-to-use open repository specialized for research data	Data publication Existing data	Standards/Databases
MoDEL	Database of Protein Molecular Dynamics simulations representing different structural clusters of the PDB		Tool info Training
MoDEL-CNS	Repository for Central Nervous System-related mainly membrane protein MD simulations
ModelArchive	Repository for theoretical models of macromolecular structures with DOIs for models	Structural Bioinformatics	Tool info Standards/Databases Training
MolMeDB	Database about interactions of molecules with membranes		Tool info Standards/Databases
MolSSI - BioExcel COVID-19 therapeutics hub	Aggregating critical information to accelerate COVID-19 drug discovery for the molecular modeling and simulation community.
NMRlipids	Repository for lipid MD simulations to validate force fields with NMR data		Tool info Standards/Databases
OpenScienceFramework	free and open source project management tool that supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery	Documentation and meta...	Standards/Databases
PDB-Dev	Prototype archiving system for structural models obtained using integrative or hybrid modeling	Structural Bioinformatics
Zenodo	Generalist research data repository built and developed by OpenAIRE and CERN	FAIRtracks LabID Plant Phenomics Bioimaging data Enzymology and biocata... Plant sciences Single-cell sequencing Data publication Identifiers	Standards/Databases Training