Your tasks: Data storage
What features do you need in a storage solution when collecting data?
The need for Data storage arises early on in a research project, as space will be required to put your data when starting collection or generation. Therefore, it is a good practice to think about storage solutions during the data management planning phase, and request storage in advance and/or pay for it.
The storage solution for your data should fulfil certain criteria (e.g. space, access & transfer speed, duration of storage etc), which should be discussed with the IT team. You may choose a tiered storage system for assigning data to various types of storage media based on requirements for access, performance, recovery and cost. Using tiered storage allows you to classify data according to levels of importance and assign it to the appropriate storage tiers or move it to different tier for e.g. once analysis is completed you have the option to move data to lower tier for preservation or archiving.
Tiered Storage is classified as “Cold” or “Hot” Storage. “Hot” storage is associated with fast access speed, high access frequency, high value data and consists of faster drives such as the Solid State Drives (SSD). This storage is usually located in close proximity to the user such as on campus and incurs high costs. “Cold” storage is associate with low access speed and frequency and consists of slower drives or tapes. This storage is usually off-premises and incurs low cost.
When looking for solutions to store your data during the collection or generation phase, you should consider the following aspects:
- The volume of your data is an important discerning factor to determine the appropriate storage solution. At the minimum, try to estimate the volume of raw data that you are going to generate or collect.
- What kind of access/transfer speed and access frequency will be required for your data?
- Knowing where the data will come from is also crucial. If the data comes from an external facility or needs to be transferred to a different server, you should think about an appropriate data transfer method.
- It is a good practice to have a copy of the original raw data in a separate location, to keep it untouched and unchanged (not editable).
- Knowing for how long the raw data, as well as data processing pipelines and analysis workflows need to be stored, especially after the end of the project, is also a relevant aspect for storage.
- It is highly recommended to have metadata, such as an identifier and file description, associated with your data (see Metadata management page). This is useful if you want to retrieve the data years later or if your data needs to be shared with your colleagues for collaboration. Make sure to keep metadata together with the data or establish a clear link between data and metadata files.
- In addition to the original “read-only” raw (meta)data files, you need storage for files used for data processing and analysis as well as the workflows/processes used to produce the data. For these, you should consider:
- Who is allowed to access the data (in case of collaborative projects), how do they expect to access the data and for what purpose.
- Check if you have the rights to give access to the data, in case of legal limitations or third party rights (for instance, collaboration with industry).
- Consult policy for data sharing outside the institute/country (see Compliance and Monitoring page).
- Keeping track of the changes (version control), conflict resolution and back-tracing capabilities.
- Provide an estimate about the volume of your raw data (i.e., is it in the order of Megabytes, Gigabytes or Terabytes?) to the IT support in your institute when consulting for storage solutions.
- Clarify if your data needs to be transferred from one location to another. Try to provide IT with as much information as possible about the system where the data will come from. See our Data Transfer page for additional information.
- Ask for a tiered storage solution that gives you easy and fast access to the data for processing and analysis. Explain to the IT support what machine or infrastructure you need to access the data from and if other researchers should have access as well (in case of collaborative projects).
- Ask if the storage solution includes an automatic management of versioning, conflict resolution and back-tracing capabilities (see also our Data Organisation page).
- Ask the IT support in your institute if they offer technical solutions to keep a copy of your (raw)data secure and untouched (snapshot, read-only access, backup…). You could also keep a copy of the original data file in a separate folder as “read-only”.
- For small data files and private or collaborative projects within your institute, commonly accessible Cloud Storage is usually provided by the institute, such as NextCloud (on-premises), Microsoft OneDrive, DropBox, Box, etc. Do not use personal accounts on similar services for this purpose, adhere to the policies of your institute.
- It is a requirement from the funders or universities to store raw data and data analysis workflows (for reproducible results) for a certain amount of time after the end of the project (see our Preserve page). This is usually a requirement. Check the data policy for your project or institute to know if a copy of the data should be also stored at your institute for a specific time after the project. This helps you budget for storage costs and helps your IT support with estimation of storage resources needed.
- Make sure to generate good documentation (i.e., README file) and metadata together with the data. Follow best practices for folder structure, file naming and versioning systems (see our Data Organisation page). Check if your institute provides a (meta)data management system, such as iRODS, DataVerse, FAIRDOM-SEEK or OSF. See All tools and resources table below for additional tools.
How do you estimate computational resources for data processing and analysis?
In order to process and analyse your data, you will need access to computational resources. This ranges from your laptop, local compute clusters to High Performance Computing (HPC) infrastructures. However, it can be difficult to be able to estimate the amount of computational resource needed for a process or an analysis.
Below, you can find some aspects that you need to consider to be able to estimate the computational resource needed for data processing and analysis:
- The volume of total data is an important discerning factor to estimate the computational resources needed.
- Consider how much data volume you need “concurrently or at once”. For example, consider the possibility to analyse a large dataset by downloading or accessing only a subset of the data at a time (e.g., stream 1 TB at a time from a big dataset of 500 TB).
- Define the expected speed and the reliability of connection between storage and compute.
- Determine which software you are going to use. If it is a proprietary software, you should check possible licensing issues. Check if it only runs on specific operative systems (windows, mac, linux…).
- Establish if and what reference datasets you need.
- In the case of collaborative projects, define who can access the data and the computational resource for analysis (specify from what device, if possible). Check policy about data access between different Countries. Try to establish a versioning system.
- Try to estimate the volume of:
- Raw data files necessary for the process/analysis.
- Data files generated during the computational analysis as intermediate files.
- Results data files.
- Communicate your expectations about speed and the reliability of connection between storage and compute to the IT team. This could depend on the communication protocols that the compute and storage systems use.
- It is recommended to ask about the time span for analysis to colleagues or bioinformatic support that have done similar work before. This could save you money and time.
- If you need some reference datasets (e.g the references genomes such as human genome.), ask IT if they provide it or consult bioinformaticians that can set up automated public reference dataset retrieval.
- For small data files and private projects, using the computational resources of your own laptop might be fine, but make sure to preserve the reproducibility of your work by using data analysis software such as Galaxy or R Markdown.
- For small data volume and small collaborative projects, a commonly accessible Cloud Storage, such as Nextcloud (on-premises) or Owncloud might be fine. Adhere to the policies of your institute.
- For large data volume and bigger collaborative projects, you need a large storage volume on fast hardware that is closely tied to a computational resource accessible to multiple users.
Where should you store the data after the end of the project?
After the end of the project, all the relevant (meta)data (to guarantee reproducibility) should be preserved for a certain amount of time, that is usually defined by funders or institution policy. However, where to preserve data that are not needed for active processing or analysis anymore is a common question in data management.
- Data preservation doesn’t refer to a place nor to a specific storage solution, but rather to the way or “how” data can be stored. As described in our Preservation page, numerous precautions need to be implemented by people with a variety of technical skills to preserve data.
- Estimate the volume of the (meta)data files that need to be preserved after the end of the project. Consider using a compressed file format to minimize the data volume.
- Define the amount of time (hours, days…) that you could wait in case the data needs to be reanalysed in the future.
- It is a good practice to publish your data in public data repositories. Usually, data publication in repositories is a requirement for scientific journals and funders. Repositories preserve your data for a long time, sometimes for free. See our Data Publication page for more information.
- Institutes or universities could have specific policies for data preservation. For example, your institute can ask you to preserve the data internally for 5 years after the project, even if the same data is available in public repositories.
- Based on the funders or institutional policy about data preservation, the data volume and the retrieval time span, discuss with the IT team what preservation solutions they can offer (i.e., data archiving services in your Country) and the costs, so that you can budget for it in your DMP.
- Publish your data in public repositories, and they will preserve the data for you.
Nels provides the necessary tools for data management aimed for researchers in norway and their collaborators.
The sensitive data service (tsd) provides a platform to store, compute and analyse research sensitive data in compliance with norwegian regulations regarding individuals’ privacy.
Omero is a software platform for managing, sharing and analysing images data.
Transmed from elixir luxembourg supports projects in translational biomedicine for clinical and translational projects.
Xnat for preclinical imaging centers (xnat-pic) is a of set of tools to store, process and share preclinical imaging studies built on top of the xnat imaging informatics platform.
Relevant tools and resourcesSkip tool table
|Tool or resource||Description||Related pages||Registry|
|Amazon Web Services||Amazon Web Services||Data analysis Data transfer||Training|
|b2share||Store and publish your research data. Can be used to bridge between domains||Data publication Bioimaging data||Standards/Databases|
|BIONDA||BIONDA is a free and open-access biomarker database, which employs various text mining methods to extract structured information on biomarkers from abstracts of scientific publications||Researcher Human data Proteomics||Tool info|
|Box||Cloud storage and file sharing service||Data Steward: infrastructure Data transfer||Training|
|CERNBox||CERNBox cloud data storage, sharing and synchronization|
|CS3||Cloud Storage Services for Synchronization and Sharing (CS3)|
|DATAVERSE||Open source research data respository software. Different instances available||Researcher Data Steward: research Data Steward: infrastructure IFB||Training|
|DropBox||Cloud storage and file sharing service||Data Steward: infrastructure Data transfer|
|e!DAL||Electronic data archive library is a framework for publishing and sharing research data||Data Steward: infrastructure||Tool info|
|FAIRDOM-SEEK||Data, model and SOPs management for projects, from preliminary data to publication, support for running SBML models etc.||Data Steward: infrastructure NeLS Microbial biotechnology IFB Machine actionability||Tool info Training|
|FAIRDOMHub||Data, model and SOPs management for projects, from preliminary data to publication, support for running SBML models etc. (public SEEK instance)||Researcher NeLS Documentation and metadata Microbial biotechnology Machine actionability||Standards/Databases|
|Google Drive||Cloud Storage for Work and Home||Data transfer|
|iCloud||Data sharing||Data analysis Data transfer|
|iRODS||Integrated Rule-Oriented Data System (iRODS) is open source data management software for a cancer genome analysis workflow.||Data Steward: infrastructure TransMed Bioimaging data||Tool info|
|Microsoft Azure||Cloud storage and file sharing service from Microsoft||Data Steward: infrastructure Data transfer|
|Microsoft OneDrive||Cloud storage and file sharing service from Microsoft||Data Steward: infrastructure|
|NextCloud||As fully on-premises solution, Nextcloud Hub provides the benefits of online collaboration without the compliance and security risks.||Data Steward: infrastructure Data transfer|
|OHDSI||Multi-stakeholder, interdisciplinary collaborative to bring out the value of health data through large-scale analytics. All our solutions are open-source.||Researcher Data Steward: research Data analysis TransMed Toxicology data||Tool info|
|OMERO||OMERO is an open-source client-server platform for managing, visualizing and analyzing microscopy images and associated metadata||Documentation and metadata Data Steward: research Data Steward: infrastructure OMERO Bioimaging data||Tool info Training|
|OpenStack||OpenStack is an open source cloud computing infrastructure software project and is one of the three most active open source projects in the world Different instances available||Data analysis TransMed IFB||Training|
|OSF||OSF (Open Science Framework) is a free, open platform to support your research and enable collaboration.||Researcher Data Steward: research||Training|
|OwnCloud||Cloud storage and file sharing service||Data Steward: infrastructure Data transfer Data analysis|
|Research Data Management Platform (RDMP)||Data management platform for automated loading, storage, linkage and provision of data sets||Data Steward: infrastructure||Tool info|
|Research Object Crate (RO-Crate)||RO-Crate is a lightweight approach to packaging research data with their metadata, using schema.org. An RO-Crate is a structured archive of all the items that contributed to the research outcome, including their identifiers, provenance, relations and annotations.||Documentation and metadata Data organisation Data Steward: research Researcher Microbial biotechnology Machine actionability||Standards/Databases|
|Rucio||Rucio - Scientific Data Management||Data analysis Data transfer|
|ScienceMesh||ScienceMesh - frictionless scientific collaboration and access to research services||Data analysis Data transfer|
|SeaFile||SeaFile File Synchronization and Share Solution||Data transfer|
|semares||All-in-one platform for life science data management, semantic data integration, data analysis and visualization||Researcher Data Steward: research Documentation and metadata Data analysis Data Steward: infrastructure|
|tranSMART||Knowledge management and high-content analysis platform enabling analysis of integrated data for the purposes of hypothesis generation, hypothesis validation, and cohort discovery in translational research.||Researcher Data Steward: research Data analysis TransMed||Tool info|
|TSD||Norwegian Services for sensitive data||Sensitive data TSD||Training|
|Flemish Supercomputing Center (VSC)||
VSC is the Flanders’ most highly integrated high-performance research computing environment, providing world-class services to government, industry, and researchers.
|Data Steward: research Data Steward: infrastructure Data analysis|
Plant Genomics and Phenomics Research Data Repository
|Documentation and metadata Researcher Data Steward: research Data Steward: infrastructure Plant sciences Plant Genomics|
The German Human Genome-Phenome Archive
|Documentation and metadata Researcher Data Steward: research|
Data management platform for organising, sharing and publishing research datasets, models, protocols, samples, publications and other research outcomes.
|Documentation and metadata Researcher Data Steward: research Data Steward: infrastructure|
Data Publisher for Earth & Environmental Science
|Documentation and metadata Researcher Data Steward: research|
With the Fairdata Services you can store, share and publish your research data with easy-to-use web tools.
|CSC Researcher Data Steward: research Data publication Existing data|
|Sensitive Data Services for Research||
CSC Sensitive Data Services for Research are designed to support secure sensitive data management through web-user interfaces accessible from the user’s own computer
|CSC Researcher Data Steward: research Sensitive data Data analysis Data publication Human data|
The National Infrastructure for Research Data (NIRD) infrastructure offers storage services, archiving services, and processing capacity for computing on the stored data. It offers services and capacities to any scientific discipline that requires access to advanced, large-scale, or high-end resources for storing, processing, publishing research data or searching digital databases and collections. This service is owned and operatedby Sigma2 NRIS, which is a joint collaboration between UiO, UiB, NTNU, UiT, and UNINETT Sigma2
|Data transfer NeLS|
|Norwegian Research and Education Cloud (NREC)||
NREC is an Infrastructure-as-a-Service (IaaS) project between the University of Bergen and the University of Oslo, with additional contributions from NeIC (Nordic e-Infrastructure Collaboration) and Uninett., commonly referred to as a cloud infrastructure An IaaS is a self-service infrastructure where you spawn standardized servers and storage instantly, as needed, from a given resource quota.
Educloud Research is a platform provided by the Centre for information Technology (USIT) at the University of Oslo (UiO). This platform provides access to a work environment accessible to collaborators from other institutions or countries. This service provides a storage solution and a low threshold HPC system that offers batch job submission (SLURM) and interactive nodes. Data up to the red classification level can be stored/analysed.
|Data analysis Sensitive data|
The TSD – Service for Sensitive Data, is a platform for collecting, storing, analysing and sharing sensitive data in compliance with the Norwegian privacy regulation. TSD is developed and operated by UiO.
|Human data Data analysis Sensitive data TSD|
The HUNT Cloud, established in 2013, aims to improve and develop the collection, accessibility and exploration of large scale information. HUNT Cloud offers cloud services, lab management, and is a key service that has established a framework for data protection, data security, and data management. HUNT Cloud is owned by NTNU and operated by HUNT Research Centre at the Department of Public Health and Nursing at the Faculty of Medicine and Health Sciences.
|Human data Data analysis Sensitive data|
SAFE (secure access to research data and e-infrastructure) is solution for secure processing of sensitive personal data in research at the University of Bergen. SAFE is based on “Norwegian Code of conduct for information security in the health and care sector” (Normen) and ensures confidentiality, integrity, and availability are preserved when processing sensitive personal data. Through SAFE, the IT-department offers a service where employees, students and external partners get access to dedicated resources for processing of sensitive personal data.
|Human data Data analysis Sensitive data|
|BioData.pt Service Hub||
BioData.pt Service Hub includes several data management resources, tools and services available for researchers in Life Sciences.
|Researcher Data Steward: research Data analysis|
|BioData.pt Data Management Portal (DMPortal)||
This instance of DataVerse is provided by the BioData.pt. We can help you write and maintain data management plans for your research.
|Researcher Data Steward: research|
The Swedish National Infrastructure for Computing (SNIC) is a national research infrastructure that makes available large-scale high-performance computing resources, storage capacity, and advanced user support, for Swedish research.