Your domain: Bioimaging data
Bioimaging specialists are acquiring an ever growing amount of data: images, associated metadata, etc. However, image data management often does not receive the attention it requires or is avoided altogether since it is considered a burdensome task. At the same time, storing images on personal computers or USB keys is no longer an option, assuming it ever was! Data volume is exponentially increasing, and not just the acquired images need storing but potentially processed images will be generated and will need to be kept alongside the original images. It is critical to proactively identify where the data will be stored, for how long, who will cover the cost of the hardware, and who will cover the cost of managing the infrastructure. All the stakeholders need to be involved in the preliminary discussions: biologists, facility managers, data analysis, IT support, etc. to ensure that the requirements are understood and met.
What constitutes bioimage data
An image is much more than a collection of zeros and ones. The image will contain the binary representing the pixels on screen but it is usually packed with useful metadata. You will find the obvious keys indicating how to interpret the zeros and ones, you can also find a lot of acquisition metadata e.g. hardware/instrument used, settings used, etc. Managing images immediately becomes a larger problem, not only the binary files need to be handled, but also the associated metadata. Several efforts have been made and still ongoing to capture those metadata. Understanding and capturing the metadata are critical for many reasons, just to mention a few: analysis, detection of possible faults in acquisition systems. It is important to decide how much details will be recorded since this could dramatically increase the metadata volume and therefore the effort required to capture the metadata.
The collection of images can take several forms:
- Data acquired within a facility
- Data acquired in other facility (commissioned work or external guest user) and “transported” by the users to their facility
- Slides scanned for example.
Data management challenges
The number of files and the size of files could be extremely large. Deleting/misplacing a file could invalidate the study itself, preventing its reuse. Managing images immediately becomes a major problem, not only the binary files need to be handled, but also the associated metadata. Several efforts have been made and are still ongoing to capture those metadata. Understanding and capturing the metadata are critical for many reasons, just to mention a few: analysis, detection of possible faults in acquisition systems. It is important to decide how much details will be recorded since this could dramatically increase the metadata volume and therefore the effort required to capture the metadata.
Standard (meta)data formats
Unlike other domains, the bioimaging community has not yet agreed on a single standard data format which is generated by all acquisition systems. Instead, the images described above are most frequently collected in proprietary file formats (PFFs) defined by hardware vendors. Currently, there are several hundred such formats that the researchers may encounter. These formats combine critical acquisition metadata with the multidimensional binary data but are often optimized for quickly writing the data to disk. Tools and strategies are outlined below to ease working with this data.
- When purchasing a microscope, consider carefully how the resulting files will be processed. If open source tools will be used, proprietary file formats may require a time-consuming conversion. Discuss with your vendor if an open format is available.
- If data from multiple vendors is to be combined, similar a conversion may be necessary to make the data comparable.
- Imaging data brings special considerations due to the large, often continuous nature of the data. Single terabyte-scale files are not uncommon. Sharing these can require special infrastructure, like a data management server (described below) or a cloud-native format (described below). One goal of such infrastructure is to enable the selective (i.e. interactive) zooming of your image data without the need to download the entire volume, thereby reducing your internet bandwidth and costs.
- Importantly, most acquisition systems produce proprietary file formats. Understanding how well they are supported by the imaging community could be a key factor of a successful study. Will it be possible to analyse or view the image using open-source software? Will it be possible to deposit the images to public repositories when published? The choice of proprietary file formats could prevent from using any other tools that are not related to the acquisition systems.
Vendor libraries: Some vendors provide open source libraries for parsing their proprietary file formats. See libCZI from Zeiss.
Open source translators: Members of the community have developed multi-format translators that can be used to access your data with needing to transform it. You will need to perform this translation each time you access your data.
- Bio-Formats (Java) - supports over 150 file formats
- OpenSlide (C++) - primarily for whole-slide imaging (WSI) formats
- aicsimageio (Python) - wraps vendor libraries and Bio-Formats to support a wide-range of formats in Python
Permanent conversion: An alternative is permanently convert your data to
- OME-Files - The Open Microscopy Consortium (OME) has developed an open format, “OME-TIFF”, to which you can convert your data. Bio-Formats (above) is likely the most straight-forward method to convert your data to OME-TIFF.
- The bioformats2raw and raw2ometiff toolchain provided by Glencoe Software allows the more performant conversion of your data, but requires an extra intermediate copy of the data. If you have available space, you might want to give it a try.
Cloud (or “object”) storage: If you are storing your data in the cloud, you will likely need a different file format since most current image file formats are not suitable for cloud storage. OME is currently developing a next-generation file format (NGFF) that you can use.
Metadata: If metadata are stored separately from the image data, the format of the metadata should follow the subject-specific standards regarding the schema, vocabulary or ontologies and storage format used such as:
The acquisition of bioimaging data takes place in various environments. The (usually) light or electron microscope may be in a core facility, in a research lab or even remotely in a different institution. Regardless of where the instrument is located, the acquired imaging data is likely to be stored, at least temporarily, in a local, vendor specific system’s PC next to the acquisition system due to their complexity and size. This is often unavoidable in order to securely store the data as quickly as the acquisition process itself.
Due to the scale of data, keeping track of the image data and the associated data and metadata is essential, particularly in life sciences and medical fields. Organising, storing, sharing, publishing image data and metadata can be very challenging.
- Consider using an image management software platform. Image management software platforms offer a way to centralize, organize, view, distribute and track all of their digital images and photos. It allows you to take control over how your images are managed, used and shared within research groups.
- When evaluating an image management software platforms, check if it allows you to:
- Control the access you wish to give to your data and how you wish to work e.g. PI only can view and annotate my data or you can choose to work on project with some collaborators.
- Access data from anywhere via either Web or Desktop clients and API.
- Store the metadata with your images. For example, analytical results can be linked to your imaging data and can be easily findable.
- Add value to your imaging data by for example linking them to external resources like ontologies.
- Make your data publicly available and slowly moving towards FAIRness.
- Try to avoid storing bioimaging data in the local system’s PC.
- If possible, make a transfer to central storage mandatory. If not possible, enable automation of data backup to central storage.
- Consider support for minimal standards (metadata schemas, file formats etc.) in your domain.
- Consider reusing existing data.
- Agnostic platforms that can be used to bridge between domain data include:
- Image-specific data management platforms include:
- Platforms like OMERO, b2share also allow you to publish the data associated with a given project.
- Metadata standards can be found at the Metadata Standards Directory Working Group.
- Ontologies Resources available at:
- Existing data can be found by using the following resources: - LINCS - Research Data repositories Registry
Data publication and archiving
Public data archives are an essential component of biological research. However, publishing image data and metadata can be very challenging for multiple reasons, just to mention a few: limited infrastructure for some domains, data support, sparse data.
Bioimaging tools and resources are behind compared to what is available in sequencing for example. mainly due to limited infrastructures capable of hosting the data. There are a few ongoing efforts to breach that gap.
Two distinct types of resources should be considered:
- Data archives (“storage”) as a long-lasting storage for data and metadata and making those data easily accessible to the community.
- Added-values archives: store enhanced curated data, typically aiming at a scientific community,
- If you only need to make your data available online and have limited metadata associated, consider publishing in a Data archive.
- If your data should be considered as a reference dataset, consider an Added-values archive.
- Select and choose the repositories based on the following characteristics:
- Storage vs Added-value resources
- Images format support
- Supported licenses e.g. CC0 or CC-BY license. For example the Image Data Resource (IDR) uses Creative Commons Licenses for submitted datasets and encourages submitting authors to choose.
- Which types of access is required for the users e.g. download only, browse search and view data and metadata, API access, etc.
- Does an entry have an access e.g. idr-xxx, EMPIAR-#####?
- Does an entry have a DOI (Digital Object Identifier)?
Comparative table of some repositories that can be used to deposit imaging data:
|Repository||Type||Data Restrictions||Data Upload Restrictions||DOI||Cost|
|BioImageArchive||Archive||No PIH data||None||—||Free|
|Dryad||Archive||No PIH data||300GB||Yes||over 50GB (*)|
|EMPIAR||Added-value||Electron microscopy imaging data||None||Yes||Free|
|IDR||Added-value||Cell/Tissue imaging data, no PIH data||None||Yes||Free|
|SSBD:database||Added-value||Biological dynamics imaging data||None||—||Free|
|SSBD:repository||Archive||Biological dynamics imaging data||None||—||Free|
|Zenodo||Archive||None||50GB per dataset||Yes||Free|
- PIH: Protected health information.
- (*) unless submitter is based at member institution.
Data management plan
How to write a data management plan (dmp).
Best practices to name and organise research data.
Prepare data and find repositories for publication.
How to find and reuse existing data.
How to transfer data files.
How to license research data.
Documentation and metadata
How to document and describe your data.
How to find appropriate storage solutions.
Relevant tools and resourcesSkip tool table
|Tool or resource||Description||Related pages||Registry|
|b2share||Store and publish your research data. Can be used to bridge between domains||Data storage Data publication||Standards/Databases|
|Bio-Formats||Bio-Formats is a software tool for reading and writing image data using standardized, open formats||OMERO||Tool info Training|
|BioImageArchive||The BioImage Archive stores and distributes biological images that are useful to life-science researchers.||Data publication||Standards/Databases|
|BisQue||Resource for management and analysis of 5D biological images||Data organisation Data Steward: research Data analysis||Tool info|
|Cytomine-IMS||Image Data management||Data Steward: research|
|Dryad||Open-source, community-led data curation, publishing, and preservation platform for CC0 publicly available research data||Data publication Biomolecular simulation data||Standards/Databases|
|EMPIAR||Electron Microscopy Public Image Archive is a public resource for raw, 2D electron microscopy images. You can browse, upload and download the raw images used to build a 3D structure||Data publication OMERO|
|FigShare||Data publishing platform||Data publication Biomolecular simulation data||Standards/Databases Training|
|Gene Expression Omnibus (GEO)||A repository of MIAME-compliant genomics data from arrays and high-throughput sequencing||Microbial biotechnology Data publication Documentation and metadata Data transfer OMERO Toxicology data|
|Image Data Resource (IDR)||A repository of image datasets from scientific publications||Microbial biotechnology Data publication Documentation and metadata Data transfer OMERO||Tool info Standards/Databases|
|iRODS||Integrated Rule-Oriented Data System (iRODS) is open source data management software for a cancer genome analysis workflow.||Data storage Data Steward: infrastructure TransMed||Tool info|
|MyTardis||A file-system based platform handling the transfer of data||Data Steward: research Data transfer|
|OMERO||OMERO is an open-source client-server platform for managing, visualizing and analyzing microscopy images and associated metadata||Documentation and metadata Data Steward: research Data Steward: infrastructure Data storage OMERO||Tool info Training|
|SSBD:database||Added-value database for biological dynamics images||Data publication|
|SSBD:repository||An open data archive that stores and publishes bioimaging and biological quantitative datasets||Data publication|
|XNAT||Open source imaging informatics platform. It facilitates common management, productivity, and quality assurance tasks for imaging and associated data.||Researcher Data analysis TransMed XNAT-PIC|
|Zenodo||Generalist research data repository built and developed by OpenAIRE and CERN||Data publication Biomolecular simulation data||Standards/Databases Training|