The proteomics domain deals with standard data formats, software tools and data repositories for mass spectrometry-based proteomics data. In proteomics, the relatively wide range of mass spectrometry technologies, devices, protocols, study designs and data analysis approaches poses a particular challenge for the standardized description and storage of data and associated metadata. This circumstance forced the proteomics community to deal with the complex definition of suitable standard data formats relatively early in its history. This encouraged, among other things, the development of software tools that can import and export results in standardized formats, and of data repositories in which proteomics data can be stored in a standardized way. The particular challenge for the proteomics community now is to evolve its achievements in data management to date towards a more complete fulfillment of FAIR research data management and to close the remaining gaps in this regard.
Standard data formats
To make proteomics data interoperable and reproducible from the first to the last mile of proteomics data analysis pipelines, comprehensive metadata accompanying the data is needed. The crucial metadata includes information on study design, proteomics technology, lab protocol, device, device settings and software settings. All of them have an enormous impact on the resulting data. Thus, to enable data reusability in proteomics appropriate standard data formats are needed.
For different proteomics experiments and different steps of the respective data analysis pipelines there are different kinds of data and metadata that should be recorded. Consequently, the main challenges for data and metadata standardization include:
- What are the definitions of proteomics-specific terms that are needed to describe proteomics experiments?
- Which is the minimal information that is needed to describe a proteomics experiment?
- How should the data and metadata of proteomics raw data and peak lists be stored?
- How should the data and metadata of proteomics identification results be stored?
- How should the data and metadata of proteomics quantification results be stored?
- How can proteomics data and metadata be stored in a simple and human-readable way?
The Human Proteome Organisation (HUPO) Proteomics Standards Initiative (HUPO-PSI), a proteomics community-driven organization, provides several different controlled vocabularies, standard data formats, converter and validator software tools. The most important include:
- Controlled vocabularies: PSI-MS, PSI-MI, XLMOD and sepCV, which are provided as OBO files.
- The Minimum Information About a Proteomics Experiment (MIAPE) guidelines document.
- mzML - a standard format for encoding raw mass spectrometer output.
- mzIdentML - a standard exchange format for peptides and proteins identified from mass spectra.
- mzQuantML - a standard format that is intended to store the systematic description of workflows quantifying molecules (principly peptides and proteins) by mass spectrometry.
- mzTab - a tab delimited text file format to report proteomics and metabolomics results.
Processing and analysis of proteomics data
For all steps within a FAIR proteomics data analysis pipeline software is needed that imports standard data formats and exports standard data formats including all needed results and metadata.
- Can your proteomics raw data recorded by a mass spectrometer be stored as an mzML file?
- Is it possible to convert your raw data to mzML?
- Does your search engine support mzML and/or mzIdentML?
- Does your quantification software support mzQuantML or mzTAB?
- Within the proteomics community various converter software tools such as msconvert were implemented, which support the conversion of mass spectrometer output formats to the mzML standard data format as well as other conversions to standard data formats.
- Information on software tools that support HUPO-PSI standard data formats can be found on the standard format-specific web pages of the HUPO-PSI (e.g., mzML , mzIdentML and MZTAB ).
Preserving and sharing proteomics data
In order to make proteomics data and results worldwide findable and accessible for other researchers and software, FAIR public data repositories are needed.
- How can I find an appropriate proteomics data repository?
- How can I upload my proteomics data into a specific proteomics data repository?
- What are the requirements for my data to be uploaded into a proteomics data repository?
- What are the advantages of uploading data into proteomics data repositories?
- How can public proteomics data be used by other researchers?
- You can find an appropriate data repository via the website of the ProteomeXchange Consortium. ProteomeXchange was established to provide globally coordinated standard data submission and dissemination pipelines involving the main proteomics repositories, and to encourage open data policies in the field. Currently, member repositories include PRIDE, PepideAtlasq, MassIVE, jPOST, iProx and PanoramaPublic.
- Information on data uploads can be found on proteomexchange.org or on the websites of the particular data repositories. E.g. PRIDE uploads are conducted via a submission tool. There are data repository-specific requirements.
- Advantages of data publication: fulfillment of journal requirements, higher visibility of research, free storage, worldwide accessibility, basic re-analysis by repository-associated tools
Relevant tools and resourcesSkip tool table
|Tool or resource||Description||Related pages||Registry|
|BIONDA||BIONDA is a free and open-access biomarker database, which employs various text mining methods to extract structured information on biomarkers from abstracts of scientific publications||Data storage Researcher Human data Proteomics||bio.tools|
|CalibraCurve||A highly useful and flexible tool for calibration of targeted MS?based measurements. CalibraCurve enables an automated batch-mode determination of dynamic linear ranges and quantification limits for both targeted proteomics and similar assays. The software uses a variety of measures to assess the accuracy of the calibration and provides intuitive visualizations.||Data analysis Proteomics||bio.tools|
|Human Protein Atlas||The Human Protein Atlas contains information for a large majority of all human protein-coding genes regarding the expression and localization of the corresponding proteins based on both RNA and protein data.||Proteomics||FAIRsharing TeSS|
|OmicsDI||Omics Discovery Index (OmicsDI) provides a knowledge discovery framework across heterogeneous omics data (genomics, proteomics, transcriptomics and metabolomics)||Existing data Proteomics||bio.tools FAIRsharing TeSS|
|PAA||PAA is an R/Bioconductor tool for protein microarray data analysis aimed at biomarker discovery.||Data analysis Researcher Human data Proteomics||bio.tools|
|PIA - Protein Inference Algorithms||PIA is a toolbox for mass spectrometrey based protein inference and identification analysis.||Data analysis Researcher Proteomics||bio.tools|
|PRIDE||PRoteomics IDEntifications (PRIDE) Archive database||Proteomics||bio.tools FAIRsharing TeSS|
|ProteomeXchange||ProteomeXchange provides globally coordinated standard data submission and dissemination pipelines||Proteomics||bio.tools FAIRsharing TeSS|
|Proteomics Standards Initiative||The HUPO Proteomics Standards Initiative defines community standards for data representation in proteomics and interactomics to facilitate data comparison, exchange and verification.||Proteomics|
|STRING||Known and predicted protein-protein interactions.||Proteomics||bio.tools FAIRsharing TeSS|
|UniProt||Comprehensive resource for protein sequence and annotation data||Documentation and metadata Researcher Intrinsically disordered proteins Microbial biotechnology Proteomics||bio.tools FAIRsharing TeSS|