Introduction
The proteomics domain encompasses standard data formats, software tools, and data repositories for mass spectrometry-based proteomics data. In proteomics, the relatively wide range of mass spectrometry technologies, devices, protocols, study designs and data analysis approaches poses a particular challenge for the standardised description and storage of data and associated metadata. This circumstance compelled the proteomics community to address the complex definition of suitable standard data formats relatively early in its history. This encouraged, among other things, the development of software tools that can import and export results in standardised formats, and of data repositories in which proteomics data can be stored in a standardised way. The particular challenge for the proteomics community now is to evolve its achievements in data management to date towards a more complete fulfilment of FAIR research data management and to close the remaining gaps in this regard.
Standard data formats
Description
To make proteomics data interoperable and reproducible from the first to the last mile of proteomics data analysis pipelines, comprehensive metadata accompanying the data is needed. The crucial metadata includes information on study design, proteomics technology, lab protocol, device, device settings and software settings. All of them have an enormous impact on the resulting data. Thus, to enable data reusability in proteomics, appropriate standard data formats are needed.
Considerations
For different proteomics experiments and different steps of the respective data analysis pipelines, there are different kinds of data and metadata that should be recorded. Consequently, the main challenges for data and metadata standardisation include:
- What are the definitions of proteomics-specific terms that are needed to describe proteomics experiments?
- Which is the minimal information that is needed to describe a proteomics experiment?
- How should the data and metadata of proteomics raw data and peak lists be stored?
- How should the data and metadata of proteomics identification results be stored?
- How should the data and metadata of proteomics quantification results be stored?
- How can proteomics data and metadata be stored in a simple and human-readable way?
Solutions
The Human Proteome Organisation (HUPO) Proteomics Standards Initiative (PSI) is a proteomics community-driven organisation providing several different controlled vocabularies, standard data formats, converter and validator software tools. The most important include:
- Controlled vocabularies: HUPO PSI Mass Spectrometry Controlled Vocabulary (PSI-MS CV), PSI Molecular Interaction Controlled Vocabulary (PSI-MI CV), and HUPO-PSI cross-linking and derivatization reagents controlled vocabulary (XLMOD), which are provided as OBO files.
- Minimum Information About a Proteomics Experiment (MIAPE) guidelines document.
- mz Markup Language (mzML) - a standard format for encoding raw mass spectrometer output.
- mz peptide and protein Identification Markup Language (mzIdentML) - a standard exchange format for peptides and proteins identified from mass spectra.
- mzTab - a tab-delimited text file format to report proteomics and metabolomics results.
Processing and analysis of proteomics data
Description
For all steps within a FAIR proteomics data analysis pipeline, software is needed that imports standard data formats and exports standard data formats, including all needed results and metadata.
Considerations
- Can your proteomics raw data recorded by a mass spectrometer be stored as an mz Markup Language (mzML) file?
- Is it possible to convert your raw data to mz Markup Language (mzML)?
- Does your search engine support mz Markup Language (mzML) and/or mz peptide and protein Identification Markup Language (mzIdentML)?
- Does your quantification software support mzTab?
Solutions
- Within the proteomics community, various converter software tools such as msconvert were implemented, which support the conversion of mass spectrometer output formats to the mzML standard data format as well as other conversions to standard data formats.
- Information on software tools that support HUPO-PSI data formats can be found on the specific web pages for mz Markup Language (mzML), mz peptide and protein Identification Markup Language (mzIdentML), and mzTab. The following list shows just a few tools using standard data formats as input and/or output:
Preserving and sharing proteomics data
Description
FAIR public data repositories are needed to make proteomics data and results worldwide findable and accessible for other researchers and software.
Consideration
- How can I find an appropriate proteomics data repository?
- How can I upload my proteomics data into a specific proteomics data repository?
- What are the requirements for my data to be uploaded into a proteomics data repository?
- What are the advantages of uploading data into proteomics data repositories?
- How can public proteomics data be used by other researchers?
- How can I increase the transparency and reproducibility of my shared data?
Solution
- You can find an appropriate data repository via the website of the ProteomeXchange Consortium. ProteomeXchange was established to provide globally coordinated standard data submission and dissemination pipelines involving the main proteomics repositories, and to encourage open data policies in the field. Currently, member repositories include PRIDE, PeptideAtlas, MassIVE, Japan Proteome Standard Repository (jPOSTrepo), iProX, and Panorama Public.
- Information on data uploads can be found on ProteomeXchange submissions or on the websites of the particular data repositories. E.g. PRIDE uploads are conducted via the PRIDE Submission Tool. There are data repository-specific requirements.
- Advantages of data publication: fulfilment of journal requirements, higher visibility of research, free storage, worldwide accessibility, basic re-analysis by repository-associated tools and possible integration in more specialised knowledgebases like: Human Protein Atlas, Mass Centric Peptide Database (MaCPepDB), STRING, Unimod, InterPro, UniProt or CATH
- You can increase transparency and reproducibility of the mass spectrometry-based proteomics data by providing sample and data relationship file (Sample and Data Relationship Format for Proteomics (SDRF)) along with submission to a data repository (e.g. ProteomeXchange).
Related pages
More information
Links to FAIRsharing
FAIRsharing is a curated, informative and educational resource on data and metadata standards, inter-related to databases and data policies.
Training
Skip tool tableTools and resources on this page
| Tool or resource | Description | Related pages | Registry |
|---|---|---|---|
| Agile Protein Interactomes Dataserver (APID Interactomes) | APID (Agile Protein Interactomes DataServer) is a server that provides a comprehensive collection of protein interactomes for more than 400 organisms based in the integration of known experimentally validated protein-protein physical interactions (PPIs) | Tool info Standards/Databases | |
| CATH | A hierarchical domain classification of protein structures in the Protein Data Bank. | Tool info Standards/Databases Training | |
| Comet | Open source tandem mass spectrometry (MS/MS) sequence database search tool. | Tool info | |
| Human Protein Atlas | The Human Protein Atlas contains information for a large majority of all human protein-coding genes regarding the expression and localization of the corresponding proteins based on both RNA and protein data. | Standards/Databases Training | |
| HUPO PSI Mass Spectrometry Controlled Vocabulary (PSI-MS CV) | The PSI-MS Controlled Vocabulary consists of a large collection of structured terms covering description and use of Mass Spectrometry instrumentation as well as Protein Identification and Quantitation software. | Standards/Databases | |
| HUPO-PSI cross-linking and derivatization reagents controlled vocabulary (XLMOD) | The XLMOD ontology is a structured, controlled vocabulary for cross-linking reagents and cross-linker-related post-translational modifications used in cross-linking mass spectrometry experiments and derivatisation reagents for GC-MS. | Standards/Databases | |
| InterPro | Functional analysis of protein sequences by classifying them into families and predicting the presence of domains and important sites | Tool info Standards/Databases Training | |
| iProX | iProX is a public platform for collecting and sharing raw data, analysis results and metadata obtained from proteomics experiments. | Tool info Standards/Databases | |
| Japan Proteome Standard Repository (jPOSTrepo) | jPOSTrepo (Japan ProteOme STandard Repository) is a data repository for sharing MS raw/processed data. | Standards/Databases | |
| Mascot | Powerful search engine which uses mass spectrometry data to identify proteins from DNA, RNA and protein sequence databases as well as spectral libraries. | Tool info Standards/Databases Training | |
| Mass Centric Peptide Database (MaCPepDB) | A database to quickly access all tryptic peptides of the UniProtKB | Standards/Databases | |
| MassIVE | Powerful search engine which uses mass spectrometry data to identify proteins from DNA, RNA and protein sequence databases as well as spectral libraries. | Standards/Databases Training | |
| Minimum Information About a Proteomics Experiment (MIAPE) | MIAPE defines the minimum set of information about whole proteomics experiments that would be required in a public repository. | Standards/Databases | |
| msconvert | Provides a set of open-source, cross-platform software libraries and tools that facilitate proteomics data analysis. | Tool info Training | |
| mz Markup Language (mzML) | mzML was formed to amalgamate two formats for encoding raw spectrometer data; mzData, developed by the PSI, and mzXML, developed at the Seattle Proteome Center at the Institute for Systems Biology. | Standards/Databases | |
| mz peptide and protein Identification Markup Language (mzIdentML) | mzIdentML provides a common format for the export of identification results from any search engine. | Standards/Databases | |
| mzTab | This format aims to present the results of a proteomics experiment in a computationally accessible overview. | Standards/Databases | |
| OpenMS | OpenMS offers an open-source C++ library (+ Python bindings) for LC/MS data management, analysis and visualization. | Tool info Training | |
| PAA | PAA is an R/Bioconductor tool for protein microarray data analysis aimed at biomarker discovery. | Tool info | |
| Panorama Public | Panorama Public is a data repository for sharing and disseminating results from analysing mass spectrometry data with the Skyline software, supporting targeted analysis of proteomics or metabolomics data from a variety of mass spectrometry data acquisition techniques. | Standards/Databases | |
| PeptideAtlas | Database of multi-organism, publicly accessible compendium of peptides identified in a large set of tandem mass spectrometry proteomics experiments. | Tool info Standards/Databases | |
| PIA - Protein Inference Algorithms | PIA is a toolbox for mass spectrometrey based protein inference and identification analysis. | Tool info | |
| PRIDE | PRoteomics IDEntifications (PRIDE) Archive database | Data publication | Tool info Standards/Databases Training |
| PRIDE Submission Tool | Main tool used to submit proteomics datasets to PRIDE Archive | ||
| ProteomeXchange | ProteomeXchange provides globally coordinated standard data submission and dissemination pipelines | Tool info Standards/Databases Training | |
| PSI Molecular Interaction Controlled Vocabulary (PSI-MI CV) | A structured controlled vocabulary for the annotation of experiments concerned with protein-protein interactions. | Standards/Databases | |
| Sample and Data Relationship Format for Proteomics (SDRF) | Information about a proposed standard for sample metadata annotations in public repositories called Sample and Data Relationship File (SDRF)-Proteomics format. | Standards/Databases | |
| Skyline | Freely-available, open-source Windows client application for building Selected Reaction Monitoring (SRM) / Multiple Reaction Monitoring (MRM), Parallel Reaction Monitoring (PRM), DIA/SWATH and targeted DDA quantitative methods and analyzing the resulting mass spectrometer data. | Tool info | |
| STRING | Known and predicted protein-protein interactions. | Tool info Standards/Databases Training | |
| The Human Proteome Organisation (HUPO) Proteomics Standards Initiative (PSI) | The HUPO Proteomics Standards Initiative defines community standards for data representation in proteomics and interactomics to facilitate data comparison, exchange and verification. | Microbial biotechnology | Tool info Standards/Databases |
| Unimod | Protein modification for mass spectrometry | Tool info Standards/Databases | |
| UniProt | Comprehensive resource for protein sequence and annotation data | Galaxy Enzymology and biocata... Intrinsically disorder... Single-cell sequencing Structural Bioinformatics Machine actionability | Tool info Standards/Databases Training |
National resources
Tools and resources tailored to users in different countries.
| Tool or resource | Description | Related pages | Registry |
|---|---|---|---|
| Technology Hotels | More than 130 Technology Hotels offer access to high-end technology and expertise in the field of bioimaging, bioinformatics, genomics, medical imaging, metabolomics, phenotyping, proteomics, structural biology, and/or systems biology. |
Human data Bioimaging data Researcher Compliance monitoring ... |