Your domain: Bioimaging data
Introduction
Bioimaging specialists are acquiring an ever growing amount of data: images, associated metadata, etc. However, image data management often does not receive the attention it requires or is avoided altogether since it is considered a burdensome task. At the same time, storing images on personal computers or USB keys is no longer an option, assuming it ever was! Data volume is exponentially increasing, and not just the acquired images need storing but potentially processed images will be generated and will need to be kept alongside the original images. It is critical to proactively identify where the data will be stored, for how long, who will cover the cost of the hardware, and who will cover the cost of managing the infrastructure. All the stakeholders need to be involved in the preliminary discussions: biologists, facility managers, data analysis, IT support, etc., to ensure that the requirements are understood and met.
What constitutes bioimage data
An image is much more than a collection of zeros and ones. The image will contain the binary representing the pixels on screen but it is usually packed with useful metadata. You will find the obvious keys indicating how to interpret the zeros and ones, you can also find a lot of acquisition metadata e.g. hardware/instrument used, settings used, etc.
The number of image proprietary formats is very large and keeps increasing. It is challenging to support so many proprietary file formats i.e. read/extract metadata. The Bio-formats library currently supports over 150 different file formats. The Dataset Structure Table shows the extension of the files to read and indicates the structure of the image itself e.g. single file, multiple files, one image file and a companion file, etc.
Data management challenges
The number of files and their size could be extremely large. Deleting/misplacing a file could invalidate the study itself, preventing its reuse.
Managing images immediately becomes a larger problem, not only the binary files need to be handled, but also the associated metadata. Several efforts have been made and still ongoing to capture those metadata. Understanding and capturing the metadata are critical for many reasons, just to mention a few: analysis, detection of possible faults in acquisition systems. It is important to decide how much details will be recorded since this could dramatically increase the metadata volume and therefore the effort required to capture the metadata.
The collection of images could be:
- data acquired within a facility;
- data acquired in other facility (commissioned work or external guest user) and “transported” by the users to their facility;
- slides scanned.
After acquisition, data are usually moved to more permanent storages with different level of permissions. This depends on the facility policies and could prevent collaborative work. Users will also adopt their own “organisation” conventions, this could potentially make it very difficult to find or understand the data when, for example, the data are migrated to a new location or when the researcher who acquired the data leaves the lab.
Standard (meta)data formats
Description
Unlike other domains, the bioimaging community has not yet agreed on a single standard data format which is generated by all acquisition systems. Instead, the images described above are most frequently collected in proprietary file formats (PFFs) defined by hardware vendors. Currently, there are several hundred such formats that the researchers may encounter. These formats combine critical acquisition metadata with the multidimensional binary data but are often optimized for quickly writing the data to disk. Tools and strategies are outlined below to ease working with this data.
Considerations
- When purchasing a microscope, consider carefully how the resulting files will be processed. If open source tools will be used, proprietary file formats may require a time-consuming conversion. Discuss with your vendor if an open format is available.
- If data from multiple vendors is to be combined, similar a conversion may be necessary to make the data comparable.
- Imaging data brings special considerations due to the large, often continuous nature of the data. Single terabyte-scale files are not uncommon. Sharing these can require special infrastructure, like a data management server (described below) or a cloud-native format (described below). One goal of such infrastructure is to enable the selective (i.e. interactive) zooming of your image data without the need to download the entire volume, thereby reducing your internet bandwidth and costs.
- Importantly, most acquisition systems produce proprietary file formats. Understanding how well they are supported by the imaging community could be a key factor of a successful study. Will it be possible to analyse or view the image using open-source software? Will it be possible to deposit the images to public repositories when published? The choice of proprietary file formats could prevent from using any other tools that are not related to the acquisition systems.
Solutions
Vendor libraries: Some vendors provide open source libraries for parsing their proprietary file formats. See libCZI from Zeiss.
Open source translators: Members of the community have developed multi-format translators that can be used to access your data on-the-fly i.e. the original format is preserved, no file written on disk. This implies that you will need to perform this translation each time you access your data and, depending on the size of the image(s), you could run out of memory. Translation libraries include,
- Bio-Formats (Java) - supports over 150 file formats
- OpenSlide (C++) - primarily for whole-slide imaging (WSI) formats
- aicsimageio (Python) - wraps vendor libraries and Bio-Formats to support a wide-range of formats in Python
Permanent conversion: An alternative is to permanently convert your data to
- OME-Files - The Open Microscopy Consortium (OME) has developed an open format, “OME-TIFF”, to which you can convert your data. The Bio-Formats (above) library comes with a command line to tool bfconvert that can be used to convert to files to OME-TIFF
- The bioformats2raw and raw2ometiff toolchain provided by Glencoe Software allows the more performant conversion of your data, but requires an extra intermediate copy of the data. If you have available space, the toolchain could also be an option to consider.
Cloud (or “object”) storage: If you are storing your data in the cloud, you will likely need a different file format since most current image file formats are not suitable for cloud storage. OME is currently developing a next-generation file format (NGFF) that you can use.
Metadata: If metadata are stored separately from the image data, the format of the metadata should follow the subject-specific standards regarding the schema, vocabulary or ontologies and storage format used such as:
- OME model XML-based representation of microscopy data.
- Quality assessment and Reproducibility in Light Microscopy (QUAREP-LiMi).
- REMBI.
(Meta)Data collection
Description
The acquisition of bioimaging data takes place in various environments. The (usually) light or electron microscope may be in a core facility, in a research lab or even remotely in a different institution. Regardless of where the instrument is located, the acquired imaging data is likely to be stored, at least temporarily, in a local, vendor specific system’s PC next to the acquisition system due to their complexity and size. This is often unavoidable in order to securely store the data as quickly as the acquisition process itself.
Due to the scale of data, keeping track of the image data and the associated data and metadata is essential, particularly in life sciences and medical fields. Organising, storing, sharing, publishing image data and metadata can be very challenging.
Considerations
- Consider using an image management software platform. Image management software platforms offer a way to centralize, organize, view, distribute and track all of their digital images and photos. It allows you to take control over how your images are managed, used and shared within research groups.
- When evaluating an image management software platforms, check if it allows you to:
- Control the access you wish to give to your data and how you wish to work e.g. PI only can view and annotate my data or you can choose to work on project with some collaborators.
- Access data from anywhere via either Web or Desktop clients and API.
- Store the metadata with your images. For example, analytical results can be linked to your imaging data and can be easily findable.
- Add value to your imaging data by for example linking them to external resources like ontologies.
- Make your data publicly available and slowly moving towards FAIRness.
- Try to avoid storing bioimaging data in the local system’s PC.
- If possible, make a transfer to central storage mandatory. If not possible, enable automation of data backup to central storage.
- Consider support for minimal standards (metadata schemas, file formats, etc.) in your domain.
- Consider reusing existing data.
Solutions
- Agnostic platforms that can be used to bridge between domain data include:
- Image-specific data management platforms include:
- OMERO - broad support for a large number of imaging formats.
- Cytomine-IMS - image specific.
- XNAT - medical imaging platform, DICOM-based.
- MyTardis - largely file-system based platform handling the transfer of data.
- BisQue - resource for management and analysis of 5D biological images.
- Platforms like OMERO, b2share also allow you to publish the data associated with a given project.
- Metadata standards can be found at the Metadata Standards Directory Working Group.
- Ontologies Resources available at:
- Zooma - Resource to find ontology mapping for free text terms.
- Ontology Search - Ontology lookup service.
- BioPortal - Biomedical ontologies.
- Existing data can be found by using the following resources: - LINCS. - Research Data repositories Registry.
Data publication and archiving
Description
Public data archives are an essential component of biological research. However, publishing image data and metadata can be very challenging for multiple reasons, just to mention a few: limited infrastructure for some domains, data support, sparse data.
Bioimaging tools and resources are behind compared to what is available in sequencing for example. mainly due to limited infrastructures capable of hosting the data. There are a few ongoing efforts to breach that gap.
Two distinct types of resources should be considered:
- Data archives (“storage”) as a long-lasting storage for data and metadata and making those data easily accessible to the community.
- Added-values archives: store enhanced curated data, typically aiming at a scientific community.
Considerations
- If you only need to make your data available online and have limited metadata associated, consider publishing in a Data archive.
- If your data should be considered as a reference dataset, consider an Added-values archive.
- Select and choose the repositories based on the following characteristics:
- Storage vs Added-value resources.
- Images format support.
- Supported licenses e.g. CC0 or CC-BY license. For example the Image Data Resource (IDR) uses Creative Commons Licenses for submitted datasets and encourages submitting authors to choose.
- Which types of access are required for the users e.g. download only, browse search and view data and metadata, API access.
- Does an entry have an access e.g. idr-xxx, EMPIAR-#####?
- Does an entry have a DOI (Digital Object Identifier)?
Solutions
Comparative table of some repositories that can be used to deposit imaging data:
Repository | Type | Data Restrictions | Data Upload Restrictions | DOI | Cost |
---|---|---|---|---|---|
BioImageArchive | Archive | No PIH data | None | — | Free |
Dryad | Archive | No PIH data | 300GB | Yes | over 50GB (*) |
EMPIAR | Added-value | Electron microscopy imaging data | None | Yes | Free |
IDR | Added-value | Cell/Tissue imaging data, no PIH data | None | Yes | Free |
SSBD:database | Added-value | Biological dynamics imaging data | None | — | Free |
SSBD:repository | Archive | Biological dynamics imaging data | None | — | Free |
Zenodo | Archive | None | 50GB per dataset | Yes | Free |
- PIH: Protected health information.
- (*) unless submitter is based at member institution.
Related pages
How to write a Data Management Plan (DMP). Data organisation
Best practices to name and organise research data. Data publication
How to prepare data and find repositories for publication. Existing data
How to find and reuse existing data. Data transfer
How to transfer data files. Licensing
How to license research data. Documentation and metadata
How to document and describe your data. Data storage
How to find appropriate storage solutions.
More information
Relevant tools and resources
Skip tool tableTool or resource | Description | Related pages | Registry |
---|---|---|---|
4DN-BINA-OME-QUAREP (NBO-Q) Microscopy Metadata Specifications | Rigorous record-keeping and quality control are required to ensure the quality, reproducibility and value of imaging data. The 4DN Initiative and BINA have published light Microscopy Metadata Specifications that extend the OME Data Model, scale with experimental intent and complexity, and make it possible for scientists to create comprehensive records of imaging experiments. The Microscopy Metadata Specifications have been adopted by QUAREP-LiMi and are being revised in QUAREP-LiMi in collaboration with instrument manufacturers | Data publication OMERO | Standards/Databases |
b2share | Store and publish your research data. Can be used to bridge between domains | Data storage Data publication | Standards/Databases |
Bio-Formats | Bio-Formats is a software tool for reading and writing image data using standardized, open formats | OMERO | Tool info Training |
BioImageArchive | The BioImage Archive stores and distributes biological images that are useful to life-science researchers. | Data publication | Standards/Databases |
BisQue | Resource for management and analysis of 5D biological images | Data organisation Data Steward: research Data analysis | Tool info |
Cytomine-IMS | Image Data management | Data Steward: research | |
Dryad | Open-source, community-led data curation, publishing, and preservation platform for CC0 publicly available research data | Data publication Biomolecular simulation data | Standards/Databases |
EMPIAR | Electron Microscopy Public Image Archive is a public resource for raw, 2D electron microscopy images. You can browse, upload and download the raw images used to build a 3D structure | Data publication OMERO | Tool info Standards/Databases Training |
FigShare | Data publishing platform | Data publication Biomolecular simulation data Identifiers | Standards/Databases Training |
Gene Expression Omnibus (GEO) | A repository of MIAME-compliant genomics data from arrays and high-throughput sequencing | Microbial biotechnology Data publication Documentation and metadata Data transfer OMERO Toxicology data | |
Image Data Resource (IDR) | A repository of image datasets from scientific publications | Microbial biotechnology Data publication Documentation and metadata Data transfer OMERO | Tool info Standards/Databases |
iRODS | Integrated Rule-Oriented Data System (iRODS) is open source data management software for a cancer genome analysis workflow. | Data storage Data Steward: infrastructure TransMed | Tool info |
MyTARDIS | A file-system based platform handling the transfer of data | Data Steward: research Data transfer | |
OMERO | OMERO is an open-source client-server platform for managing, visualizing and analyzing microscopy images and associated metadata | Documentation and metadata Data Steward: research Data Steward: infrastructure Data storage OMERO | Tool info Training |
SSBD:database | Added-value database for biological dynamics images | Data publication | Standards/Databases |
SSBD:repository | An open data archive that stores and publishes bioimaging and biological quantitative datasets | Data publication | |
XNAT | Open source imaging informatics platform. It facilitates common management, productivity, and quality assurance tasks for imaging and associated data. | Researcher Data analysis TransMed XNAT-PIC | |
Zenodo | Generalist research data repository built and developed by OpenAIRE and CERN | Data publication Biomolecular simulation data Plant Phenomics | Standards/Databases Training |
National resources | |||
Technology Hotels | More than 130 Technology Hotels offer access to high-end technology and expertise in the field of bioimaging, bioinformatics, genomics, medical imaging, metabolomics, phenotyping, proteomics, structural biology, and/or systems biology. |
Human data Proteomics Researcher Compliance monitoring & measurement |