Your domain: Plant sciences
Data management challenges in plant sciences
The plant science domain includes studying the adaptation of plants to their environment, with applications ranging from improving crop yield or resistance to environmental conditions, to managing forest ecosystems. Data integration and reuse are facilitators for understanding the play between genotype and environment to produce a phenotype, which requires integrating phenotyping experiments and genomic assays made on the same plant material, with geo-climatic data. Moreover, cross-species comparisons are often necessary to understand the mechanisms behind phenotypic traits, especially at the genotypic level, due to the gap in genomic knowledge between well-studied plant species (namely Arabidopsis) and newly sequenced ones.
The challenges to data integration stem from the multiple levels of heterogeneity in this domain. It encompasses a variety of species, ranging from model organisms, to crop species, to wild plants such as forest trees. These often need to be detailed at infra-specific levels (e.g. subspecies, variety), but naming at these levels sometimes lacks consensus. Studies can take place in a diversity of settings including indoor (e.g. growth chamber, greenhouse) and outdoor settings (e.g. cultivated field, forest) which differ fundamentally on the requirements and manner of characterizing the environment. Phenotypic data can be collected manually or automatically (by sensors and drones), and be very diverse in nature, spanning physical measurements, the results of biochemical assays, and images. Some omics data can be considered as well as molecular phenotypes (e.g. transcriptome, metabolomes, …). Thus the extension and depth of metadata required to describe a plant experiment in a FAIR-compliant way is very demanding for researchers.
Another particularity of this domain is the absence of central deposition databases for certain important data types, in particular data deriving from plant phenotyping experiments. Whereas datasets from plant omics experiments are typically deposited in global deposition databases for that type of experiment, those from phenotyping experiments remain in institutional or at best national repositories. This makes it difficult to find, access and interconnect plant phenotyping data.
Plant biological materials: (meta)data collection and sharing
Plant genetic studies such as genomic-based prediction of phenotypes requires the integration of genomic and phenotypic data with data about their environment. While phenotypic and environmental data are typically stored together in phenotyping databases, genomic and other types of molecular data are typically deposited in international deposition databases, for example, those of the International Nucleotide Sequence Database Collaboration INSDC global consortium.
It can be challenging to integrate phenotypic and molecular data even within a single project, particularly if the project involves studying a panel of genetic resources in different conditions. It is paramount to maintain the link between the plant material in the field, the samples extracted from them (e.g. at different development stages), and the results of omics experiments (e.g. transcriptomics, metabolomics) performed on those samples, across all datasets that will be generated and published.
Integrating phenotyping and molecular data, both within and between studies, hinges entirely on precise identification of the plant material under study (down to the variety, or even the seed lot), as well as of the samples that are collected from these plants.
- Are you working with established plant varieties, namely crop plants?
- Can you trace their provenance to a genebank accession?
- Are they identified in a germplasm database with an accession number?
- Are you working with crosses of established plant varieties?
- Can you trace the genealogy of the crosses to plant varieties from a genebank or identified in a germplasm database?
- Are you working with experimental material?
- Can you trace a genealogy to other material?
- How do you unambiguously identify your material?
Identification of plant biological materials
- Detailed metadata needs to be captured on the biological materials used in the study—the accession in the genebank or the experimental identification and, when applicable, the seed lots or the parent plants as well as the possible samples taken from the plant—as they are the key to integrating omics and phenotyping datasets.
Checklists and metadata standard
- The identification and description of plant materials should comply with the standard for the identification of plant genetic resources, The Multi-Crop Passport Descriptors(MCPD).
- If you are studying experimental plant materials that cannot be traced to an existing genebank or germplasm database, you should describe them in accordance with the MCPD in as much detail as possible.
- If your plant materials can be traced to an existing genebank or germplasm database, you need only to cross reference to the MCPD information already published in the genebank or germplasm database.
- The minimal fields from MCPD are listed in the Biological Material section of the Minimum Information About Plant Phenotyping Experiments (MIAPPE) metadata standard.
- For wild plants and accessions from tree collections, precise identification often requires the GPS coordinates of the tree. MIAPPE provides the necessary fields.
Tools for (meta)data collection
- For identifying your plant material in a plant genetic resource repository (genebank or germplasm database), you can consult the European Cooperative Programme for Plant Genetic Resources (ECPGR), which includes a central germplasm database and a catalogue of relevant external databases.
- Other key databases for identifying plant material are
- the European Search Catalogue for Plant Genetic Resources (EURISCO), which provides information about more than 2 million accessions of crop plants and their wild relatives, from hundreds of European institutes in 43 member countries
- Genesys, an online platform with a search engine for Plant Genetic Resources for Food and Agriculture (PGRFA) conserved in genebanks worldwide.
- The “Biological Material” section of the MIAPPE checklist deals with sample description.
(Meta)Data sharing and publication
- For identifying samples from which molecular data was produced, the BioSamples database is recommended as a provider of international unique identifiers.
- The plant-miappe.json model provided by BioSample is aligned with all recommendations provided above for plant identification and is therefore recommended for your sample submission.
- It is also recommended that you provide permanent access to a description of the project or study, that contains links to all the data, molecular or phenotypic. Several databases are recommended for this purpose including:
Phenotyping: (meta)data collection and publication
Archiving, sharing, and publication of plant phenotyping data can be challenging, given that there is no global centralized archive for this type of data. Research projects often involve multiple partners, some of which collate data into their own (institutional) data management platforms, whereas others collate data into Excel spreadsheets.
For researchers, it is highly desirable that the datasets collected in different media by the partners in a research project (or across different collaborative projects) can be shared in a way that enables their integration, for collective analysis and for facilitating deposition into a dedicated repository. For managers of plant phenotyping data repositories that support a project or institution, it is essential to ensure that the uptake of data is easy and includes a step of metadata validation upon intake.
It is recommended that metadata collection is contemplated from the start of the experiment and that the working environment facilitates (meta)data collection, storage, and validation throughout the project. In field studies, it is critical to record the geographical coordinates and time of the experiment, for linkage with geo-climatic data. For all study types (fields, growth chamber or greenhouse), the environmental conditions that were measured should be described in detail.
- Did you collect the metadata for the identification of your plant material according to the recommendation provided in the above section?
- Have you documented your phenotyping and environment assays (i.e. measurement or computation methodology based on the trait, method, scale triplet) both for direct measures (data collection) and computed data (after data processing or analysis)?
- Is there an existing Crop Ontology for the species you experiment and does it describe your assay? If not, have you described your data following the trait, method, scale triplet?
- Do you have your own system to collect data and is it compliant with the MIAPPE standard?
- Are you exchanging data with individual researchers?
- In what media is data being collected?
- Is the data described in a MIAPPE-compliant manner?
- Are you exchanging data across different data management platforms?
- Do these platforms implement the Breeding API (BrAPI) specification?
- If not, are they MIAPPE-compliant and do they enable automated data exchange?
Checklists and ontologies
- The metadata standard applicable to plant phenotyping experiments is MIAPPE.
- There is a section dedicated to the identification of plant biological materials that follows The Multi-Crop Passport Descriptors (MCPD) described above.
- There is a section to describe the phenotyping assays based on the Crop Ontology recommendations.
- There is a section describing the type of experiment (greenhouse, field, etc.) and it is advisable to collect the location (geographical coordinates) and time where it was performed for linkage with geo-climatic data.
- Other sections include description of investigations, studies, people involved, data files, environmental parameters, experimental factors, events, observed variables.
- Tools and resources for data collection and management:
- FAIRDOM-SEEK is a free data management platform for which MIAPPE templates are in development.
- Dataverse is a free data management platform for which MIAPPE templates are in development. It is used in several repositories such as Recherche Data Gouv.
- e!DAL is a free data management platform for which MIAPPE templates are in development.
- The ISA-Tools also include a configuration for MIAPPE and can be used both for filling-in metadata and for validating.
- Collaborative Open Plant Omics (COPO) is a data management platform specific for the plant sciences.
- FAIRsharing is a manually curated registry of reporting guidelines, vocabularies, identifier schemes, models, formats, repositories, knowledge bases, and data policies that includes many resources relevant for managing plant phenotyping data.
- Validation of MIAPPE compliance can be done via ISA-Tools or upon data deposition in a Breeding API (BrAPI) compliant repository.
- If you or your partners collect data manually, it is critical to adopt a spreadsheet template that is compatible with the structure of the database that will be used for data deposition.
- If the database is MIAPPE compliant, you can use the MIAPPE-compliant spreadsheet template.
- This template could make use of tools for handling ontology annotations in a spreadsheet, such as RightField or OntoMaton.
- If you or your partners collect data into data management platforms:
- If it implements BrAPI, you can exchange data using BrAPI calls.
- If it doesn’t implement BrAPI, the simplest solution would be to export data into the MIAPPE spreadsheet template, or another formally defined data template.
- For data deposition, it is highly recommended that you opt for one of the many repositories that implement BrAPI, as they enhance findability through the ELIXIR plant data discovery service, FAIR Data-finder for Agronomic Research (FAIDARE), enable machine actionable access to MIAPPE compliant data and validation of that compliance.
Genotyping: (meta)data collection and publication
Here are described the mandatory, recommended and optional metadata fields for data interoperability and re-use, as well as for data deposition in EVA (European Variation Archive), the EMBL-EBI’s open-access genetic variation archive connected to BioSamples, described above.
- Did you collect the metadata for the identification of your plant samples according to the recommendations provided in the above section?
- Is the reference genome assembly available in an INSDC archive and has a Genome Collections Accession number, either GCA or GCF?
- Is the analytic approach used for creating the VCF file available in a publication and has a Digital Object Identifier (DOI)?
Checklists, ontologies and file formats
- Sharing plant genotyping data files involves the use of the Variant Call Format (VCF) standard.
- Findability and reusability of VCF files depends on the supplied metadata and in particular with MIAPPE compliant biological material description: the plant genomic and genetic variation data submission recipe helps you on that topic.
Data sharing and publication
- Once the VCF file is ready with all necessary metadata, it can be submitted to the European Variation Archive (EVA). You will find all necessary information on the submission steps on the EVA submission page.
Relevant tools and resourcesSkip tool table
|Tool or resource||Description||Related pages||Registry|
|AgroPortal||Browser for ontologies for agricultural science based on NBCO BioPortal.||Plant Phenomics Documentation and metadata||Tool info Standards/Databases|
|BioSamples||BioSamples stores and supplies descriptions and metadata about biological samples used in research and development by academia and industry.||Plant Genomics Plant Phenomics||Tool info Standards/Databases Training|
|BioStudies||A database hosting datasets from biological studies. Useful for storing or accessing data that is not compliant for mainstream repositories.||Microbial biotechnology Data publication||Tool info Standards/Databases Training|
|BrAPI||Specification for a standard API for plant data: plant material, plant phenotyping data||Data Steward: infrastructure Plant Phenomics||Training|
|COPO||Portal for scientists to broker more easily rich metadata alongside data to public repos.||Documentation and metadata Researcher Machine actionability Plant Phenomics Plant Genomics||Tool info Standards/Databases|
|Crop Ontology||The Crop Ontology compiles concepts to curate phenotyping assays on crop plants, including anatomy, structure and phenotype.||Researcher Data Steward: research Data Steward: infrastructure Plant Phenomics||Standards/Databases Training|
|Data INRAE||Dataverse for life sciences and agronomic related data||Plant Genomics Researcher Data Steward: research Plant Phenomics||Standards/Databases|
|e!DAL-PGP||Plant Genomics and Phenomics Research Data Repository||Plant Genomics Researcher Data Steward: research Data Steward: infrastructure Data publication Documentation and metadata Plant Phenomics||Standards/Databases|
|ECPGR||Hub for the identification of plant genetic resources in Europe||Researcher Data Steward: research|
|Ensembl Plants||Open-access database of full genomes of plant species.||Plant Genomics||Standards/Databases Training|
|EURISCO||European Search Catalogue for Plant Genetic Resources||Researcher Data Steward: research Plant Phenomics||Tool info|
|FAIDARE||FAIDARE is a tool allowing to search data across dinstinct databases that implemented BrAPI.||Researcher Data Steward: research IFB Plant Phenomics Plant Genomics||Tool info|
|GnpIS||A multispecies integrative information system dedicated to plant and fungi pests. It allows researchers to access genetic, phenotypic and genomic data. It is used by both large international projects and the French National Research Institute for Agriculture, Food and Environment.||Plant Phenomics||Tool info Standards/Databases|
|ISA4J||Open source software library that can be used to generate a ISA-TAB export from in-house data sets. These comprises e.g. local database or local file system based experimental.||Machine actionability Plant Phenomics||Tool info|
|MIAPPE||Minimum Information About a Plant Phenotyping Experiment||Documentation and metadata Researcher Data Steward: research Plant Genomics Plant Phenomics||Standards/Databases Training|
|Multi-Crop Passport Descriptor (MCPD)||The Multi-Crop Passport Descriptor is the metadata standard for plant genetic resources maintained ex situ by genbanks.||Documentation and metadata Researcher Data Steward: infrastructure Data Steward: policy Plant Phenomics Plant Genomics||Standards/Databases Training|
|PHIS||The open-source Phenotyping Hybrid Information System (PHIS) manages and collects data from plants phenotyping and high throughput phenotyping experiments on a day to day basis.||Plant Phenomics IFB||Training|
|PLAZA||Access point for plant comparative genomics, centralizing genomic data produced by different genome sequencing initiatives.||Plant Genomics Researcher||Standards/Databases Training|
PIPPA, the PSB Interface for Plant Phenotype Analysis, is the central web interface and database that provides the tools for the management of the plant imaging robots on the one hand, and the analysis of images and data on the other hand.
|Plant Phenomics Data Steward: research Researcher Data Steward: infrastructure||Tool info|
Data Management Plan (DMP) generator that focuses on plant science.
|Data management plan|