Introduction
The enzymology and biocatalysis domain is concerned with collecting experimental data that characterises biocatalysts with regards to their activity, specificity, selectivity, and catalytic mechanism in a chemical reaction. Biocatalysts are macromolecules of biological origin (or synthetic equivalents thereof) that are directly involved in, and enhance the rates of, chemical reactions without themselves being produced or consumed in them. They are mostly proteins (called enzymes) or RNA molecules (ribozymes), but can also be entire networks of enzymes and in living organisms. The complete characterisation of a biocatalyst comprises both structural and functional properties. Amino acid sequence, covalent modification, 3-dimensional structure, formulation, and functional properties constitute the former, whereas the kinetics, reaction stoichiometries (as given by the complete chemical equation of the reaction(s) catalysed), selectivity and specificity are parameters that contribute to the latter, functional aspects. The dynamic properties are often referred to as ‘enzyme activity data’. They are ultimately determined by the structural properties of the biocatalysts and their reactants, but the computation of the former from the latter is still much less accurate than inference from direct experimental measurement, which is what the domain ‘Enzymology and Biocatalysis’ focuses on.
Thermodynamic parameters such as the equilibrium constant and standard Gibbs or metabolic energy of the reaction are properties only of the complete chemical reaction catalysed, yet worth mentioning in biocatalyst characterisation because they can limit reaction rate, yield and efficiency. Similarly, the process conditions have a major effect on the kinetic and thermodynamic parameters. The parameters are calculated from a series of experimental raw data. Enzyme activity data are widely distributed in the literature and databases and are used when inferring and comparing reaction specificity and mechanisms, when comparing enzymes from various organisms, when trying to understand the physiological function of an enzyme and when making dynamic pathway models to understand genome function.
Biochemical parameter values vary with the conditions around the biocatalyst. Neither the comparison nor corroboration, analysis or interpretation of this data is possible without specifying those conditions in the assessments. A complete description of the experiment including the materials and methods is therefore necessary for the data to be interoperable and FAIR, i.e. useful for the scientific community at large. Standards for reporting data, definitions of metadata and frameworks for structuring the data for the publication in databases and repositories and for the data exchange greatly assist researchers in carrying out the above tasks by sharing data, reproducing recent findings and collaborating in distributed projects.
Adherence to standards comes with extra requirements in terms of equipment, expense, training, assay conditions and assay time and is not always of immediate benefit of the study at hand. Yet, the utility of the data that comes out is greatly enhanced by the standardisation as is the impact of the work done. The increased impact and citation rate as well as the ‘quality stamp’ obtained, should motivate the adoption of the standards by the community in order to generate complete and robust datasets (manually written or in electronic notebooks) and to apply tools that allow interchangeability of data and to make these datasets publicly findable, available, interoperable and reusable in high quality.
Standards-compliant data collection
Description
The determination of functional parameters of biocatalysts does not always require catalytic turnover; equilibrium binding constants are functionally important too. Collections and reports of biocatalyst function data must contain a description of the experimental setup. In order to be complete, this should include the identity of the catalytic or binding entity (enzyme, protein, nucleic acid or other molecule), the biological origin or source of the molecule, its amino acid sequence, its purity (also in terms of the fraction that is in the native rather than a denatured state), its formulation, and other characteristics such as post-translational modifications, prosthetic groups, mutations, any modifications made to facilitate expression or purification, and oligomeric state if the biocatalysts forms complexes with itself. The methods and technologies used as well as the experimental conditions (temperature, pH, pressure, additives and solvents, pMg, ionic strength, medium osmolarity (If >0.3 M) and approximate macromolecule concentration) of the assay must be described, certainly if it is a new assay, but also if the assay has been published before (all too often minor modifications of experimental conditions are not reported, leading to irreproducibility and inutility; referencing publications with the original assay description is still recommended). Specific information regarding the operation mode, type of reaction vessel and mixing, and the environment of the reaction needs to be provided in the reports. The manner in which the concentration of the added substrates was determined should be described. In instances where catalytic activity or binding cannot be detected, an estimate of the limit of detection based on the sensitivity and error analysis of the assay should be provided. In such cases, a positive control should be assayed, and this should be described. In case activity is observed, a negative control (e.g. with denatured biocatalyst) should be included and described. In end point essays, the way the reaction was brought to a halt should be described.
The analytical equipment (e.g. spectrophotometer, mass spectrometer), relevant metadata (e.g. wavelength), and the measured raw data should be reported, including the ways the measurements were calibrated (e.g. proton release by acid titration). By reporting the measured data (e.g. absorption) and the data used for calibration, the evaluation of concentrations becomes reproducible. Ambiguous terms such as “not detectable” should be avoided or extended by ‘at a detection limit of …’. A description of the software used for data analysis should be included along with calculated errors for all parameters and the biochemical model of the biocatalyst used by the analysis, inclusive of the mathematical equations used.
Additional information is required for both the investigation and reporting of the apparent equilibrium constants of reactions. Both the equilibrium constants and the standard states/conditions need to be clearly defined. When calculating and reporting the value of an equilibrium constant, the units of concentrations, the direction of the reaction as well as the procedure of the calculation of the equilibrium constant itself must be specified. Control experiments should always be performed to detect systematic failures or external effects to exclude interferences from the enzyme or the solution. The chemical equilibrium needs to be defined when the forward and reverse reactions proceed at the same rate (reaction quotient Q does not change with time). It should be demonstrated (e.g. by addition of more substrate) that the reaction did not stop because the biocatalyst lost activity. It should be demonstrated that the position of the measured equilibrium constant does not depend on the amount of added enzyme nor on the initial amount of substrate or product added.
Considerations
Prerequisite for the reproducibility of enzyme activity datasets is the reporting in a complete way, without omissions and without the lack of essential information that allows other researchers to corroborate, interpret and reuse the data. Therefore, the major questions to fulfill these aspects include:
- Which data are required to provide complete data sets?
- What is the minimal data accepted to be considered complete?
- What is the minimal data set required to make the data useful for studies of metabolic pathways?
- Which metadata describe both the materials and methods data and results data most appropriately?
- From where can I obtain the metadata?
- What should be the best way to define the metabolic energy of reactions, substrates and metabolites?
- How can I address thermodynamic issues in my experiments?
- How can I ensure that my datasets are reproducible?
Solutions
The Commission of Standards for Reporting Enzyme Data (STRENDA), a community driven initiative 1 has developed reporting standards that are unique for this domain and include:
- STRENDA Guidelines List Level 1a - Data required for a complete description of an experiment
- STRENDA Guidelines List Level 1b - Description of enzyme activity data
- Recommendations on the design and execution of experiments to obtain the apparent equilibrium constants of enzyme catalyzed reactions
- Recommendations on the reporting of the results to determine the equilibrium constant
- STRENDA Biocatalysis Guidelines and Metadata catalogue, an extension of the STRENDA Guidelines towards some specialities in biocatalysis The STRENDA Guidelines, which initially propose reporting fundamental enzymology data, have been extended by additional parameters for the description of biocatalysis experiments, such as for reactors and vessels, mixing conditions and the formulation of the biocatalyst, which can be studied in a solved or immobilised form. Not only have parameters been defined, but also the corresponding metadata, which assist researchers in describing both the biocatalyst and the reaction conditions in more detail.
The nomenclature of an enzyme should follow the systematic classification and numbering recommended by the Enzyme Commission (EC) of the International Union of Biochemistry and Molecular Biology (IUBMB). The EC number classifies enzymes based on the catalyzing reaction. If a new enzyme is discovered or the classification changes an EC number can be proposed for reviewing at https://www.enzyme-database.org/newform.php.
Data structure and exchange
Description
When sampling and collecting data for deposition, sharing, and exchanging, the data needs to be structured in a way that both the sender and the recipient of the data are enabled to directly integrate these into their workflow. When structuring data, usually ontologies and metadata catalogs provide a valuable means for the integration of controlled vocabularies, ontologies and additional information that enriches the experimental data. Structured data is required that follows community-based principles to increase the findability of the data on the web.
Considerations
Before data can be sampled in a structured way, frameworks and tools are required to assist the researcher in compiling complete and high-quality datasets. Therefore, the major questions that address the requirement of standardised and structured data include:
- How to assist with the implementation of repositories, databases and electronic lab notebooks?
- How to structure the data?
- Which ontologies to use?
- How to include ontologies in the definition of data fields?
- How to define metadata so as to enrich experimental data optimally?
- How to increase the findability of datasets?
- How to enhance the utility of data for users?
Solutions
EnzymeML is a data exchange format enabling the transfer of enzyme function data between instruments, electronic lab notebooks, modelling tools, and databases. It includes experimental raw data as well as metadata compliant with the Standards for Reporting Enzyme Data (STRENDA) and Systems Biology Markup Language (SBML).
Systems Biology Markup Language (SBML) is a data exchange format for computational systems biology to describe biological models including enzyme networks, their reactions and kinetic properties.
The model repositories JWS Online and BioModels are home to a large variety of detailed kinetic models of cell biochemistry that are exchangeable through Systems Biology Markup Language (SBML). The models are populated with curated parameter values based on experimental data. As such they serve to structure the data concerning enzymological parameters in a way that shows the implications of the parameter values for physiological functions. This makes models and data immediately useful for the non-modelling expert.
Bioschemas is a framework that adds bio-related properties and types to Schema.org which aims at increasing the findability of datasets in the web. Schema.org is a general framework that enriches any webpage with additional metadata. However, as Schema.org is a general framework, Bioschemas introduces a domain-specific controlled vocabulary.
The Ontology Lookup Service provides more than 260 ontologies in the various fields. For the biocatalysis and enzymology domain, the following ontologies may be of interest:
- Bioassay Ontology
- ChEBI
- Cell Ontology for cell types
- BRENDA Tissue Ontology, source of an enzyme comprising tissues, cell lines, cell types and cell cultures
- Protein MODification (PSI-MOD)
- PRotein Ontology (PRO)
Deposition, sharing and reusing of data
Description
Because of their usefulness for comparative enzymology, as well as for understanding the functioning of biochemical pathways and networks through systems biology, data and results should be made available worldwide in the sense of FAIR data. This means that this data needs to be findable and accessible to allow other researchers and software to reuse and reproduce this data as well as to generate new knowledge through comparison and integration. Various databases cover different aspects of enzymology data including enzyme occurrence, catalyzed reactions, binding mechanisms, or kinetic properties. Most of the databases contain manually curated data from literature, while only few also support direct data submission of experimental data.
Considerations
- How to find appropriate resources for the deposition and reuse of enzymology data?
- How to upload enzymology and biocatalysis data?
- What is the prerequisite for the upload of data?
- What are the requirements for the publication of data in databases and repositories?
- Are there any tools that support researchers in uploading their data?
- Are the suggested databases and repositories open (freely accessible by the researchers)?
- How to meet the requirements of journals and funding agencies to provide a meaningful data availability statement?
- Is there an indicator of the quality of the data (e.g. in the senses of accuracy and proven reproducibility)?
Solutions
- For the collection of experimental data in the laboratory electronic lab notebooks (e.g. Chemotion, openBIS) should be used to store raw and processed data along with metadata and corresponding protocols.
- Enzyme-related repositories which enable the direct submission of enzyme function data and experimental conditions are STRENDA DB and SABIO-RK via the data exchange formats EnzymeML and Systems Biology Markup Language (SBML).
- STRENDA DB is a storage and search platform for authors who are preparing a manuscript containing functional enzymology data. Data sets entered in STRENDA DB are automatically checked for compliance with the Standards for Reporting Enzyme Data (STRENDA) and prepared for publication after the journal’s reviewing process together with the corresponding paper. The direct deposition of experimental data in STRENDA DB by the authors not only ensures the completeness of information but also simplifies the integration of the data into other enzyme databases.
- Information about the effects of chemical compounds on enzyme protein targets can be uploaded to the ChEMBL database.
- Most of the databases containing enzymology data are based on information manually extracted from literature. The structured format of the literature data in such databases allows the export and reuse of enzyme data (e.g. kinetic parameters) as well as the automatic integration in processing tools for modelling or visualisation. Manually curated databases containing enzyme function data are: UniProt, BRENDA, SABIO-RK, M-CSA (Mechanism and Catalytic Site Atlas), EzCatDB, MetaCyc. An overview of general and more specialised enzyme databases was published 2023 2.
- Besides literature-based information, there are databases such as GotEnzymes containing predicted kinetic parameters for enzymes or TopEnzyme containing predicted enzyme structure models.
- Further repositories which are neither domain specific nor store data in a structured way are Zenodo, FigShare and DATAVERSE. These repositories are often suggested by journals and funding agencies that may be unaware of awareness of the structured repositories mentioned above.
Data processing
Description
The primary data that emerge directly from experimental analysis are mostly extensive tables of observations (e.g. measured concentrations) as a function of time or of variations in conditions such as substrate concentrations. The functional data of enzyme kinetics and biocatalysis is a much-reduced set, typically comprising kinetic and thermodynamic parameters such as equilibrium dissociation constants and Michaelis-Menten and catalytic constants. These parameters completely pin down mathematical models for the biocatalysts. These models then enable the calculation of the rate of the reaction catalysed by the enzyme for a wide range of concentrations of substrates, products, modifiers, pH values and temperatures. A frequent example of such a model is the reversible Michaelis-Menten equation. The appropriateness of this model for the description of the biocatalyst function is often uncertain or an approximation. It is important that the model used is validated for the biocatalyst under consideration.
Considerations
- How to find the most appropriate model for my data?
- How to infer the kinetic and thermodynamic parameters from the primary data?
- How to assess the accuracy and reliability of the kinetic and thermodynamic parameters?
- How and to what extent are experimental tests required for the validity of these kinetic models for the particular biocatalysts?
Solutions
- Many examples of kinetic models for enzymes are available in standard enzyme kinetics textbooks. Moreover, simulation tools such as Copasi and JWS Online provide a predefined list of reference kinetic models from which the user can choose when entering a new enzyme reaction.
- EnzymeML provides a standard format and data model for capturing the raw data, the metadata of the kinetic experiment, the kinetic model used for fitting and the final kinetic and thermodynamic parameters.
- The PyEnzyme library facilitates the processing of data by providing a programmatic interface to EnzymeML. PyEnzyme also interfaces with computational systems biology tools such as PySCeS and Copasi to assist with numerically solving the kinetic models for parameter estimation.
- Fitting statistics such as the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC), which are commonly reported by fitting libraries, can be used to select the most appropriate model from different alternatives.
Bibliography
- Tipton, K. F. et al. Standards for Reporting Enzyme Data: The STRENDA Consortium: What it aims to do and why it should be helpful. Perspectives in Science 1, 131–137 (2014).
- Prešern, U. & Goličnik, M. Enzyme Databases in the Era of Omics and Artificial Intelligence. International Journal of Molecular Sciences 24, 16918 (2023).
Related pages
More information
Skip tool tableTools and resources on this page
Tool or resource | Description | Related pages | Registry |
---|---|---|---|
Bioassay Ontology | The BioAssay Ontology (BAO) describes chemical biology screening assays and their results, including high-throughput screening (HTS) data, to categorise assays and data analysis. | Standards/Databases | |
BioModels | A repository of mathematical models for application in biological sciences | Microbial biotechnology Data publication | Tool info Standards/Databases Training |
Bioschemas | Bioschemas aims to improve the Findability on the Web of life sciences resources such as datasets, software, and training materials | Intrinsically disorder... Virology Machine actionability | Standards/Databases Training |
BRENDA | Database of enzyme and enzyme-ligand information, across all taxonomic groups, manually extracted from primary literature and extended by text mining procedures | Microbial biotechnology | Tool info Standards/Databases Training |
BRENDA Tissue Ontology | A structured controlled vocabulary for the source of an enzyme. It comprises terms for tissues, cell lines, cell types and cell cultures from uni- and multicellular organisms. | Standards/Databases | |
Cell Ontology | The Cell Ontology (CL) is a candidate OBO Foundry ontology for the representation of cell types. | Standards/Databases | |
ChEBI | Dictionary of molecular entities focused on 'small' chemical compounds | Human pathogen genomics Microbial biotechnology | Tool info Standards/Databases Training |
ChEMBL | Database of bioactive drug-like small molecules, it contains 2-D structures, calculated properties and abstracted bioactivities. | Toxicology data | Tool info Standards/Databases Training |
Chemical Entities of Biological Interest (ChEBI) | ChEBI) is a dictionary describing small chemical compounds including distinct synthetic or natural atoms, molecules, ions, ion pairs, radicals, radical ions, complexes, and conformers. | Human pathogen genomics Microbial biotechnology | Tool info Standards/Databases Training |
Chemotion | Chemotion is a repository for chemistry research data that provides solutions for current challenges to store research data in a feasible manner, allowing the conservation of domain specific information in a machine readable format. | Tool info Standards/Databases | |
Copasi | COPASI is a software application for simulation and analysis of biochemical networks and their dynamics. | Tool info | |
DATAVERSE | Open source research data respository software. | Plant Phenomics Plant sciences Machine actionability Data storage | Training |
EnzymeML | EnzymeML is a standardised data format for catalytic reaction data, designed to ensure consistency and interoperability. It enables researchers to store, share, and enrich reaction data with detailed metadata in JSON or XML formats. | Tool info | |
EzCatDB | EzCatDB is a database of enzyme catalytic mechanisms. | ||
FigShare | Data publishing platform | Biomolecular simulatio... Data publication Identifiers Documentation and meta... | Standards/Databases Training |
GotEnzymes | GotEnzymes: an extensive database of enzyme parameter predictions. | ||
JWS Online | JWS-Online is a systems biology tool for the construction, modification and simulation of kinetic models and for the storage of curated models. | NeLS | Standards/Databases |
M-CSA (Mechanism and Catalytic Site Atlas) | M-CSA is a database of enzyme reaction mechanisms. | Standards/Databases | |
MetaCyc | The MetaCyc database is a comprehensive resource for metabolic pathways and enzymes from all domains of life. | Tool info Standards/Databases | |
Ontology Lookup Service | EMBL-EBI's web portal for finding ontologies | FAIRtracks Bioimaging data Health data Documentation and meta... | Tool info Standards/Databases Training |
openBIS | openBIS (open Biology Information System) is an Electronic Laboratory Notebook and a Laboratory Information Management System (ELN-LIMS) solution suitable for life science laboratories. | ||
Protein MODification (PSI-MOD) | PSI-MOD is an ontology consisting of terms that describe protein chemical modifications. | Standards/Databases | |
PRotein Ontology (PRO) | Protein Ontology (PRO) provides an ontological representation of protein-related entities by explicitly defining them and showing the relationships between them. | Standards/Databases | |
PyEnzyme | PyEnzyme is a comprehensive software solution for manipulating EnzymeML files. | ||
PySCeS | The Python Simulator for Cellular Systems (PySCeS) is a a flexible, user friendly tool for the analysis of cellular systems. | Tool info | |
SABIO-RK | SABIO-RK is a curated database that contains information about biochemical reactions, their kinetic rate equations with parameters and experimental conditions. | Tool info Standards/Databases | |
Schema.org | Schema.org is a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond. | Machine learning Machine actionability | Standards/Databases Training |
Standards for Reporting Enzyme Data (STRENDA) | Resource of standards for reporting enzyme data | Microbial biotechnology | Standards/Databases |
STRENDA DB | STRENDA DB is a storage and search platform that incorporates the STRENDA Guidelines in a user-friendly, web-based system. | Standards/Databases | |
Systems Biology Markup Language (SBML) | An open format for computational models of biological processes | Microbial biotechnology | Tool info |
TopEnzyme | TopEnzyme is a framework and database for structural coverage of the functional enzyme space. | Tool info | |
UniProt | Comprehensive resource for protein sequence and annotation data | Galaxy Intrinsically disorder... Proteomics Single-cell sequencing Structural Bioinformatics Machine actionability | Tool info Standards/Databases Training |
Zenodo | Generalist research data repository built and developed by OpenAIRE and CERN | FAIRtracks Plant Phenomics Bioimaging data Biomolecular simulatio... Plant sciences Single-cell sequencing Data publication Identifiers | Standards/Databases Training |