Skip to content Skip to footer

Your tasks: Data storage

What features do you need in a storage solution when collecting data?

Description

The need for Data storage arises early on in a research project, as space will be required to put your data when starting collection or generation. Therefore, it is a good practice to think about storage solutions during the data management planning phase, and request storage in advance and/or pay for it.

The storage solution for your data should fulfil certain criteria (e.g. space, access & transfer speed, duration of storage, etc.), which should be discussed with the IT team. You may choose a tiered storage system for assigning data to various types of storage media based on requirements for access, performance, recovery and cost. Using tiered storage allows you to classify data according to levels of importance and assign it to the appropriate storage tiers or move it to different tier for e.g. once analysis is completed you have the option to move data to lower tier for preservation or archiving.

Tiered Storage is classified as “Cold” or “Hot” Storage. “Hot” storage is associated with fast access speed, high access frequency, high value data and consists of faster drives such as the Solid State Drives (SSD). This storage is usually located in close proximity to the user such as on campus and incurs high costs. “Cold” storage is associated with low access speed and frequency and consists of slower drives or tapes. This storage is usually off-premises and incurs low cost.

Considerations

When looking for solutions to store your data during the collection or generation phase, you should consider the following aspects.

  • The volume of your data is an important discerning factor to determine the appropriate storage solution. At the minimum, try to estimate the volume of raw data that you are going to generate or collect.
  • What kind of access/transfer speed and access frequency will be required for your data.
  • Knowing where the data will come from is also crucial. If the data comes from an external facility or needs to be transferred to a different server, you should think about an appropriate data transfer method.
  • It is a good practice to have a copy of the original raw data in a separate location, to keep it untouched and unchanged (not editable).
  • Knowing for how long the raw data, as well as data processing pipelines and analysis workflows need to be stored, especially after the end of the project, is also a relevant aspect for storage.
  • It is highly recommended to have metadata, such as an identifier and file description, associated with your data (see Documentation and metadata page). This is useful if you want to retrieve the data years later or if your data needs to be shared with your colleagues for collaboration. Make sure to keep metadata together with the data or establish a clear link between data and metadata files.
  • In addition to the original “read-only” raw (meta)data files, you need storage for files used for data processing and analysis as well as the workflows/processes used to produce the data. For these, you should consider:
    • who is allowed to access the data (in case of collaborative projects), how do they expect to access the data and for what purpose;
    • check if you have the rights to give access to the data, in case of legal limitations or third party rights (for instance, collaboration with industry);
    • consult policy for data sharing outside the institute/country (see Compliance monitoring page).
  • Keeping track of the changes (version control), conflict resolution and back-tracing capabilities.

Solutions

  • Provide an estimate about the volume of your raw data (i.e., is it in the order of Megabytes, Gigabytes or Terabytes?) to the IT support in your institute when consulting for storage solutions.
  • Clarify if your data needs to be transferred from one location to another. Try to provide IT with as much information as possible about the system where the data will come from. See our Data transfer page for additional information.
  • Ask for a tiered storage solution that gives you easy and fast access to the data for processing and analysis. Explain to the IT support what machine or infrastructure you need to access the data from and if other researchers should have access as well (in case of collaborative projects).
  • Ask if the storage solution includes an automatic management of versioning, conflict resolution and back-tracing capabilities (see also our Data organisation page).
  • Ask the IT support in your institute if they offer technical solutions to keep a copy of your (raw)data secure and untouched (snapshot, read-only access, backup…). You could also keep a copy of the original data file in a separate folder as “read-only”.
  • For small data files and private or collaborative projects within your institute, commonly accessible Cloud Storage is usually provided by the institute, such as Nextcloud (on-premises), Microsoft OneDrive, Dropbox, Box, etc. Do not use personal accounts on similar services for this purpose, adhere to the policies of your institute.
  • For large data sets consider cloud storage services, such as ScienceMesh, OpenStack) and cloud synchronization and sharing services (CS3), such as CERNBox or SeaFile
  • It is a requirement from the funders or universities to store raw data and data analysis workflows (for reproducible results) for a certain amount of time after the end of the project (see our Preserve page). This is usually a requirement. Check the data policy for your project or institute to know if a copy of the data should be also stored at your institute for a specific time after the project. This helps you budget for storage costs and helps your IT support with estimation of storage resources needed.
  • Make sure to generate good documentation (i.e., README file) and metadata together with the data. Follow best practices for folder structure, file naming and versioning systems (see our Data organisation page). Check if your institute provides a (meta)data management system, such as iRODS, DATAVERSE, FAIRDOM-SEEK or OSF.

How do you estimate computational resources for data processing and analysis?

Description

In order to process and analyse your data, you will need access to computational resources. This ranges from your laptop, local compute clusters to High Performance Computing (HPC) infrastructures. However, it can be difficult to be able to estimate the amount of computational resource needed for a process or an analysis.

Considerations

Below, you can find some aspects that you need to consider to be able to estimate the computational resource needed for data processing and analysis.

  • The volume of total data is an important discerning factor to estimate the computational resources needed.
  • Consider how much data volume you need “concurrently or at once”. For example, consider the possibility to analyse a large dataset by downloading or accessing only a subset of the data at a time (e.g., stream 1 TB at a time from a big dataset of 500 TB).
  • Define the expected speed and the reliability of connection between storage and compute.
  • Determine which software you are going to use. If it is a proprietary software, you should check possible licensing issues. Check if it only runs on specific operative systems (Windows, MacOS, Linux,…).
  • Establish if and what reference datasets you need.
  • In the case of collaborative projects, define who can access the data and the computational resource for analysis (specify from what device, if possible). Check policy about data access between different Countries. Try to establish a versioning system.

Solutions

  • Try to estimate the volume of:
    • raw data files necessary for the process/analysis;
    • data files generated during the computational analysis as intermediate files;
    • results data files.
  • Communicate your expectations about speed and the reliability of connection between storage and compute to the IT team. This could depend on the communication protocols that the compute and storage systems use.
  • It is recommended to ask about the time span for analysis to colleagues or bioinformatic support that have done similar work before. This could save you money and time.
  • If you need some reference datasets (e.g the reference genomes such as human genome.), ask IT if they provide it or consult bioinformaticians that can set up automated public reference dataset retrieval.
  • For small data files and private projects, using the computational resources of your own laptop might be fine, but make sure to preserve the reproducibility of your work by using data analysis software such as Galaxy or R Markdown.
  • For small data volume and small collaborative projects, a commonly accessible cloud storage, such as Nextcloud (on-premises) or ownCloud might be fine. Adhere to the policies of your institute.
  • For large data volume and bigger collaborative projects, you need a large storage volume on fast hardware that is closely tied to a computational resource accessible to multiple users, such as Rucio, tranSMART, Semares or Research Data Management Platform (RDMP).

Where should you store the data after the end of the project?

Description

After the end of the project, all the relevant (meta)data (to guarantee reproducibility) should be preserved for a certain amount of time, that is usually defined by funders or institution policy. However, where to preserve data that are not needed for active processing or analysis anymore is a common question in data management.

Considerations

  • Data preservation doesn’t refer to a place nor to a specific storage solution, but rather to the way or “how” data can be stored. As described in our Preservation page, numerous precautions need to be implemented by people with a variety of technical skills to preserve data.
  • Estimate the volume of the (meta)data files that need to be preserved after the end of the project. Consider using a compressed file format to minimize the data volume.
  • Define the amount of time (hours, days…) that you could wait in case the data needs to be reanalysed in the future.
  • It is a good practice to publish your data in public data repositories. Usually, data publication in repositories is a requirement for scientific journals and funders. Repositories preserve your data for a long time, sometimes for free. See our Data publication page for more information.
  • Institutes or universities could have specific policies for data preservation. For example, your institute can ask you to preserve the data internally for 5 years after the project, even if the same data is available in public repositories.

Solutions

  • Based on the funders or institutional policy about data preservation, the data volume and the retrieval time span, discuss with the IT team what preservation solutions they can offer (i.e., data archiving services in your Country) and the costs, so that you can budget for it in your DMP.
  • Publish your data in public repositories, and they will preserve the data for you.

Related pages

More information

Tools and resources on this page

Skip tool table
Tool or resource Description Related pages Registry
Box Cloud storage and file sharing service Data transfer Training
CERNBox CERNBox cloud data storage, sharing and synchronization
CS3 Cloud Storage Services for Synchronization and Sharing (CS3)
DATAVERSE Open source research data respository software. Plant Phenomics Plant sciences Machine actionability Standards/Databases Training
Dropbox Cloud storage and file sharing service Data transfer Documentation and meta...
FAIRDOM-SEEK A data Management Platform for organising, sharing and publishing research datasets, models, protocols, samples, publications and other research outcomes. NeLS Plant Phenomics Microbial biotechnology Plant sciences Documentation and meta... Tool info
Galaxy Open, web-based platform for data intensive biomedical research. Whether on the free public server or your own instance, you can perform, reproduce, and share complete analyses. Marine Metagenomics Single-cell sequencing Data analysis Data provenance Tool info Training
iRODS Integrated Rule-Oriented Data System (iRODS) is open source data management software for a cancer genome analysis workflow. TransMed Bioimaging data Tool info
Microsoft OneDrive Cloud storage and file sharing service from Microsoft Data transfer
Nextcloud As fully on-premises solution, Nextcloud Hub provides the benefits of online collaboration without the compliance and security risks
OpenStack OpenStack is an open source cloud computing infrastructure software project and is one of the three most active open source projects in the world Data analysis Training
OSF OSF (Open Science Framework) is a free, open platform to support your research and enable collaboration. Training
ownCloud Cloud storage and file sharing service Data transfer
R Markdown R Markdown documents are fully reproducible. Use a productive notebook interface to weave together narrative text and code to produce elegantly formatted output. Use multiple languages including R, Python, and SQL. Training
Research Data Management Platform (RDMP) Data management platform for automated loading, storage, linkage and provision of data sets Tool info
Rucio Rucio - Scientific Data Management Data transfer
ScienceMesh ScienceMesh - frictionless scientific collaboration and access to research services Data transfer
SeaFile SeaFile File Synchronization and Share Solution Data transfer
Semares All-in-one platform for life science data management, semantic data integration, data analysis and visualization Documentation and meta...
tranSMART Knowledge management and high-content analysis platform enabling analysis of integrated data for the purposes of hypothesis generation, hypothesis validation, and cohort discovery in translational research. TransMed Tool info
National resources
Flemish Supercomputing Center (VSC)

VSC is the Flanders’ most highly integrated high-performance research computing environment, providing world-class services to government, industry, and researchers.

Data Steward Research Software Engi... Data analysis
OLOS

OLOS is a Swiss-based data management portal, to help Swiss researchers safely manage, publish and preserve their data.

Data publication
SWISSUbase

SWISSUbase is a national cross-disciplinary solution for Swiss universities and other research organizations in need of local institutional data repositories for their researchers. The platform relies on international archiving standards and processes to ensure that data are preserved and accessible in the long-term.

Data publication
Czech National Repository

National Repository (NR) is a service provided to the scientific and research communities in the Czech Republic to store their generated research data together with persistent DOI identifier. NR service is currently under the pilot program.

Researcher Data Steward Research Software Engi... Existing data Identifiers Data management plan
e-INFRA CZ (Supercomputing and Data Services)

e-INFRA CZ provides integrated high-performance research computing/data storage environment, providing world-class services to government, industry, and researchers. It also cooperates with European Open Science Cloud (EOSC) implementation in the Czech Republic.

Data Steward Research Software Engi... Data analysis
ownCloud@CESNET

CESNET-hosted ownCloud is a 100 GB cloud storage freely available for Czech scientists to manage their data from any research projects.

ownCloud
Researcher Research Software Engi... Data organisation
GHGA

The German Human Genome-Phenome Archive.

Documentation and meta... Researcher Data Steward
Fairdata.fi

With the Fairdata Services you can store, share and publish your research data with easy-to-use web tools.

CSC Researcher Data Steward Data publication Existing data
Sensitive Data Services for Research

CSC Sensitive Data Services for Research are designed to support secure sensitive data management through web-user interfaces accessible from the user’s own computer.

CSC Researcher Data Steward Data sensitivity Data analysis Data publication Human data
BBMRI catalogue

Biobanking Netherlands makes biosamples, images and data findable, accessible and usable for health research.

Human data Researcher Data analysis Existing data
Health-RI Service Catalogue

Health-RI provides a set of tools and services available to the biomedical research community.

Human data Researcher Data analysis Existing data
Educloud Research

Educloud Research is a platform provided by the Centre for Information Technology (USIT) at the University of Oslo (UiO). This platform provides access to a work environment accessible to collaborators from other institutions or countries. This service provides a storage solution and a low-threshold HPC system that offers batch job submission (SLURM) and interactive nodes. Data up to the red classification level can be stored/analysed.

Data analysis Data sensitivity
HUNTCloud

The HUNT Cloud, established in 2013, aims to improve and develop the collection, accessibility and exploration of large-scale information. HUNT Cloud offers cloud services and lab management. It is a key service that has established a framework for data protection, data security, and data management. HUNT Cloud is owned by NTNU and operated by HUNT Research Centre at the Department of Public Health and Nursing at the Faculty of Medicine and Health Sciences.

Human data Data analysis Data sensitivity
NIRD

The National Infrastructure for Research Data (NIRD) infrastructure offers storage services, archiving services, and processing capacity for computing on the stored data. It offers services and capacities to any scientific discipline that requires access to advanced, large-scale, or high-end resources for storing, processing, publishing research data or searching digital databases and collections. This service is owned and operated by Sigma2 NRIS, which is a joint collaboration between UiO, UiB, NTNU, UiT, and UNINETT Sigma2.

Data transfer NeLS FAIRtracks
Norwegian Research and Education Cloud (NREC)

NREC is an Infrastructure-as-a-Service (IaaS) project between the University of Bergen and the University of Oslo, with additional contributions from NeIC (Nordic e-Infrastructure Collaboration) and Uninett., commonly referred to as a cloud infrastructure An IaaS is a self-service infrastructure where you spawn standardized servers and storage instantly, as needed, from a given resource quota.

OpenStack
Data analysis
SAFE

SAFE (secure access to research data and e-infrastructure) is the solution for the secure processing of sensitive personal data in research at the University of Bergen. SAFE is based on the “Norwegian Code of conduct for information security in the health and care sector” (Normen) and ensures confidentiality, integrity, and availability are preserved when processing sensitive personal data. Through SAFE, the IT department offers a service where employees, students and external partners get access to dedicated resources for processing of sensitive personal data.

Human data Data analysis Data sensitivity
TSD

The TSD – Service for Sensitive Data, is a platform for collecting, storing, analysing and sharing sensitive data in compliance with the Norwegian privacy regulation. TSD is developed and operated by UiO.

Human data Data analysis Data sensitivity TSD
BioData.pt Data Management Portal (DMPortal)

This instance of DataVerse is provided by the BioData.pt. We can help you write and maintain data management plans for your research.

DATAVERSE
Researcher Data Steward
BioData.pt Service Hub

BioData.pt Service Hub includes several data management resources, tools and services available for researchers in Life Sciences.

Researcher Data Steward Data analysis
NAISS

The National Academic Infrastructure for Super­computing in Sweden (NAISS) is a national research infrastructure that makes available large-scale high-performance computing resources, storage capacity, and advanced user support, for Swedish research.

Data analysis
Contributors