Your tasks: Data analysis
What are the best practices for data analysis?
When carrying out your analysis, you should also keep in mind that all your data analysis has to be reproducible. This will complement your research data management approach since your data will be FAIR compliant but also your tools and analysis environments. In other words, you should be able to tell what data and what code or tools were used to generate your results.
This will help to tackle reproducibility problems but also will improve the impact of your research through collaborations with scientists who will reproduce your in silico experiments.
There are many ways that will bring reproducibility to your data analysis. You can act at several levels:
- By providing your code.
- By providing your execution environment.
- By providing your workflows.
- By providing your data analysis execution.
- Make your code available. If you have to develop some software for your data analysis, it is always a good idea to publish your code. The git versioning system offers both a way to release your code but offers also a versioning system. You can also use Git to interact with your software users. Be sure to specify a license for your code (see the licensing section).
- Use package and environment management system. By using package and environment management systems like Conda and its bioinformatics specialized channel Bioconda, researchers that have got access to your code will be able to easily install specific versions of tools, even older ones, in an isolated environment. They will be able to compile/run your code in an equivalent computational environment, including any dependencies such as the correct version of R or particular libraries and command-line tools your code use. You can also share and preserve your setup by specifying in a environment file which tools you installed.
- Use container environments. As an alternative to package management systems you can consider container environments like Docker or Singularity.
- Use workflow management systems. Scientific Workflow management systems will help you organize and automate how computational tools are to be executed. Compared to composing tools using a standalone script, workflow systems also help document the different computational analyses applied to your data, and can help with scalability, such as cloud execution. Reproducibility is also enhanced by the use of workflows, as they typically have bindings for specifying software packages or containers for the tools you use from the workflow, allowing others to re-run your workflow without needing to pre-install every piece of software it needs. It is a flourishing field and many other workflow management systems are available, some of which are general-purpose (e.g. any command line tool), while others are domain-specific and have tighter tool integration. Among the many workflow management systems available, one can mention
- Workflow platforms that manage your data and provide an interface (web, GUI, APIs) to run complex pipelines and review their results. For instance: Galaxy and Arvados (CWL-based, open source).
- Workflow runners that take a workflow written in a proprietary or standardized format (such as the CWL standard) and execute it locally or on a remote compute infrastructure. For instance, toil-cwl-runner, the reference CWL runner (cwltool), Nextflow, Snakemake, Cromwell.
- Use notebooks. Using notebooks, you will be able to create reproducible documents mixing text and code; which can help explain your analysis choices; but also be used as an exploratory method to examine data in detail. Notebooks can be used in conjunction with the other solutions mentioned above, as typically the notebook can be converted to a script. Some of the most well-known notebooks systems are: Jupyter, with built-in support for code in Python, R and Julia, and many other kernels; RStudio based on R. See the table below for additional tools.
How can you use package and environment management systems?
By using package and environment management systems like Conda and its bioinformatics specialized channel Bioconda, you will be able to easily install specific versions of tools, even older ones, in an isolated environment. You can also share and preserve your setup by specifying in a environment file which tools you installed.
Conda works by making a nested folder containing the traditional UNIX directory structure
lib/ but installed from Conda’s repositories instead of from a Linux distribution.
- As such Conda enables consistent installation of computational tools independent of your distribution or operating system version. Conda is available for Linux, macOS and Windows, giving consistent experience across operating systems (although not all software is available for all OSes).
- Package management systems work particularly well for installing free and Open Source software, but can also be useful for creating an isolated environment for installing commercial software packages; for instance if they requires an older Python version than you have pre-installed.
- Conda is one example of a generic package management, but individual programming languages typically have their environment management and package repositories.
- You may want to consider submitting a release of your own code, or at least the general bits of it, to the package repositories for your programming language.
- MacOS-specific package management systems: Homebrew, Macports.
- Windows-specific package management systems: Chocolatey and Windows Package Manager
- Linux distributions also have their own package management systems (
apt) that have a wide variety of tools available, but at the cost of less flexibility in terms of the tool versions, to ensure they exist co-installed.
- Tips and tricks to navigate the landscape of software package management solutions:
- If you need multiple tools/programming languages, but your machines have different OS types or versions, list packages in a Conda
- If you need conflicting versions of some tools/libraries for different operations, make separate Conda environments.
- If you need a few open source libraries for my Python script, none which require complilation, make a
- If you need multiple tools/programming languages, but your machines have different OS types or versions, list packages in a Conda
How can you use container environments?
In short containers works almost like a virtual machine (VMs), in that it re-creates a whole Linux distibution with separation of processes, files and network.
- Containers are more lightweight than VMs since they don’t virtualize hardware. This allows a container to run with a fixed version of the distribution independent of the host, and have just the right, minimal dependencies installed.
- The container isolation also adds a level of isolation, which although not as secure as VMs, can reduce the attack vectors. For instance if the database container was compromised by unwelcome visitors, they would not have access to modify the web server configuration, and the container would not be able to expose additional services to the Internet.
- A big advantage of containers is that there are large registries of community-provided container images.
- Note that modifying things inside a container is harder than in a usual machine, as changes from the image are lost when a container is recreated.
- Typically containers run just one tool or applications, and for service deployment this is useful for instance to run mySQL database in a separate container from a NodeJS application.
- Docker is the most well-known container runtime, followed by Singularity. These require (and could be used to access) system administrator privileges to be set up.
- uDocker and Podman are also user space alternatives that have compatible command line usage.
- Large registries of community-provided container images are Docker Hub and RedHat Quay.io. These are often ready-to-go, not requiring any additional configuration or installations, allowing your application to quickly have access to open source server solutions.
- Biocontainers have a large selection of bioinformatics tools.
- To customize a Docker image, it is possible to use techniques such as volumes to store data and Dockerfile. This is useful for installing your own application inside a new container image, based on a suitable base image where you can do your
apt installand software setup in a reproducible fashion - and share your own application as an image on Docker Hub.
- Container linkage can be done by container composition using tools like Docker Compose.
- More advanced container deployment solutions like Kubernetes and Computational Workflow Management systems can also manage cloud instances and handle analytical usage.
- Tips and tricks to navigate the landscape of container solutions:
- If you just need to run a database server, describe how to run it as a Docker/Singularity container.
- If you need several servers running, connected together, set up containers in Docker Compose.
- If you need to install many things, some of which are not available as packages, make a new
Dockerfilerecipe to build container image.
- If you need to use multiple tools in a pipeline, find Conda or container images, compose them in a Computational Workflow.
- If you need to run tools in a cloud instance, but it has nothing preinstalled, use Conda or containers to ensure installion on cloud VM matches your local machine.
- If you just need a particular open source tool installed, e.g. ImageMagick, check the document how to install: For Ubuntu 20.04, try
apt install imagemagick.
Nels provides the necessary tools for data management aimed for researchers in norway and their collaborators.
Xnat for preclinical imaging centers (xnat-pic) is a of set of tools to store, process and share preclinical imaging studies built on top of the xnat imaging informatics platform.
Transmed from elixir luxembourg supports projects in translational biomedicine for clinical and translational projects.
Omero is a software platform for managing, sharing and analysing images data.
Support for DMP on: Did you choose the workflow engine you will be using?
Support for DMP on: How will you work with your data?
Support for DMP on: Will you use a central repository for all tools and their versions as used in your project?
Support for DMP on: How will you make sure to know what exactly has been run?
Relevant tools and resourcesSkip tool table
|Tool or resource||Description||Related pages||Registry|
|Ada Discovery Analytics (Ada)||Ada is a performant and highly configurable system for secured integration, visualization, and collaborative analysis of heterogeneous data sets, primarily targeting clinical and experimental sources.||TransMed|
|Amazon Web Services||Amazon Web Services||Data storage Data transfer||Training|
|Arvados||With Arvados, bioinformaticians run and scale compute-intensive workflows, developers create biomedical applications, and IT administrators manage large compute and storage resources.||Data Steward: infrastructure Data Steward: policy Researcher|
|BIAFLOWS||BIAFLOWS is an open-soure web framework to reproducibly deploy and benchmark bioimage analysis workflows||Tool info|
|BIII||The BioImage Informatics Index is a registry of software tools, image databases for benchmarking, and training materials for bioimage analysis||Data Steward: infrastructure||Tool info|
|Bioconda||Bioconda is a bioinformatics channel for the Conda package manager||Data Steward: infrastructure||Tool info Training|
|BisQue||Resource for management and analysis of 5D biological images||Data organisation Data Steward: research Bioimaging data||Tool info|
|BoostDM||BoostDM is a method to score all possible point mutations (single base substitutions) in cancer genes for their potential to be involved in tumorigenesis.||Human data||Tool info|
|CalibraCurve||A highly useful and flexible tool for calibration of targeted MS?based measurements. CalibraCurve enables an automated batch-mode determination of dynamic linear ranges and quantification limits for both targeted proteomics and similar assays. The software uses a variety of measures to assess the accuracy of the calibration and provides intuitive visualizations.||Proteomics||Tool info|
|Cancer Genome Interpreter||Cancer Genome Interpreter (CGI) is designed to support the identification of tumor alterations that drive the disease and detect those that may be therapeutically actionable.||Human data||Tool info|
|ChEMBL||Database of bioactive drug-like small molecules, it contains 2-D structures, calculated properties and abstracted bioactivities.||Researcher Toxicology data||Tool info Standards/Databases Training|
|Common Workflow Language (CWL)||An open standard for describing workflows that are build from command line tools||Data Steward: infrastructure Researcher||Standards/Databases Training|
|Conda||Open source package management system||Data Steward: infrastructure||Training|
|DisGeNET||A discovery platform containing collections of genes and variants associated to human diseases.||Human data Researcher Toxicology data||Tool info Standards/Databases|
|Docker||Docker is a software for the execution of applications in virtualized environments called containers. It is linked to DockerHub, a library for sharing container images||Data Steward: infrastructure||Standards/Databases Training|
|Galaxy||Open, web-based platform for data intensive biomedical research. Whether on the free public server or your own instance, you can perform, reproduce, and share complete analyses. Different instances available||NeLS Marine Metagenomics Researcher Data Steward: infrastructure IFB||Tool info Training|
|GENEID||Geneid is an ab initio gene finding program used to predict genes along DNA sequences in a large set of organisms.||Researcher||Tool info|
|GRAPE 2.0||The GRAPE pipeline provides an extensive pipeline for RNA-Seq analyses. It allows the creation of an automated and integrated workflow to manage, analyse and visualize RNA-Seq data.||Tool info|
|HumanMine||HumanMine integrates many types of human data and provides a powerful query engine, export for results, analysis for lists of data and FAIR access via web services.||Data organisation Data Steward: research Researcher Human data||Tool info Standards/Databases Training|
|iCloud||Data sharing||Data storage Data transfer|
|IntoGen||IntoGen collects and analyses somatic mutations in thousands of tumor genomes to identify cancer driver genes.||Human data||Tool info|
|Jupyter||Jupyter notebooks allow to share code, documentation||Data Steward: infrastructure||Training|
|LUMI||EuroHPC world-class supercomputer||Researcher Data Steward: infrastructure CSC||Tool info|
|Nextflow||Nextflow is a framework for data analysis workflow execution||Data Steward: infrastructure||Tool info Training|
|OHDSI||Multi-stakeholder, interdisciplinary collaborative to bring out the value of health data through large-scale analytics. All our solutions are open-source.||Researcher Data Steward: research Data storage TransMed Toxicology data||Tool info|
|OpenEBench||ELIXIR benchmarking platform to support community-led scientific benchmarking efforts and the technical monitoring of bioinformatics reosurces||Data Steward: research Data Steward: infrastructure||Tool info|
|OpenStack||OpenStack is an open source cloud computing infrastructure software project and is one of the three most active open source projects in the world Different instances available||Data storage TransMed IFB||Training|
|OTP||One Touch Pipeline (OTP) is a data management platform for running bioinformatics pipelines in a high-throughput setting, and for organising the resulting data and metadata.||Human data Documentation and metadata Data management plan||Tool info|
|OwnCloud||Cloud storage and file sharing service||Data storage Data Steward: infrastructure Data transfer|
|PAA||PAA is an R/Bioconductor tool for protein microarray data analysis aimed at biomarker discovery.||Researcher Human data Proteomics||Tool info|
|PIA - Protein Inference Algorithms||PIA is a toolbox for mass spectrometrey based protein inference and identification analysis.||Researcher Proteomics||Tool info|
|PMut||Platform for the study of the impact of pathological mutations in protein stuctures.||Human data||Tool info|
|R Markdown||R Markdown documents are fully reproducible. Use a productive notebook interface to weave together narrative text and code to produce elegantly formatted output. Use multiple languages including R, Python, and SQL.||Researcher||Training|
|Reva||Reva connects cloud storages and application providers||Data transfer||Tool info|
|Rstudio||Rstudio notebooks allow to share code, documentation||Data Steward: infrastructure Researcher||Tool info Training|
|Rucio||Rucio - Scientific Data Management||Data storage Data transfer|
|ScienceMesh||ScienceMesh - frictionless scientific collaboration and access to research services||Data storage Data transfer|
|semares||All-in-one platform for life science data management, semantic data integration, data analysis and visualization||Researcher Data Steward: research Documentation and metadata Data Steward: infrastructure Data storage|
|Singularity||Singularity is a container platform.||Data Steward: infrastructure TSD||Training|
|Snakemake||Snakemake is a framework for data analysis workflow execution||Data Steward: infrastructure||Tool info Training|
|tranSMART||Knowledge management and high-content analysis platform enabling analysis of integrated data for the purposes of hypothesis generation, hypothesis validation, and cohort discovery in translational research.||Researcher Data Steward: research Data storage TransMed||Tool info|
|TXG-MAPr||A tool that contains weighted gene co-expression networks obtained from the Primary Human Hepatocytes, rat kidney, and liver TG-GATEs dataset.||Researcher Toxicology data||Tool info|
|XNAT||Open source imaging informatics platform. It facilitates common management, productivity, and quality assurance tasks for imaging and associated data.||Researcher TransMed XNAT-PIC Bioimaging data|
|XNAT-PIC Pipelines||Analysing of single or multiple subjects within the same project in XNAT||Researcher Data Steward: research XNAT-PIC|
Galaxy Belgium is a Galaxy instance managed by the Belgian ELIXIR node, funded by the Flemish government, which utilizing infrastructure provided by the Flemish Supercomputer Center (VSC).
|Flemish Supercomputing Center (VSC)||
VSC is the Flanders’ most highly integrated high-performance research computing environment, providing world-class services to government, industry, and researchers.
|Data Steward: research Data Steward: infrastructure Data storage|
This is the Estonian instance of Galaxy, which is an open source, web-based platform for data intensive biomedical research.
Chipster is a user-friendly analysis software for high-throughput data such as RNA-seq and single cell RNA-seq. It contains analysis tools and a large reference genome collection.
|CSC Researcher Data Steward: infrastructure|
|Sensitive Data Services for Research||
CSC Sensitive Data Services for Research are designed to support secure sensitive data management through web-user interfaces accessible from the user’s own computer
|CSC Researcher Data Steward: research Sensitive data Data storage Data publication Human data|
|High performance computing||
CSC Supercomputers Puhti, Mahti and LUMI performance ranges from medium scale simulations to one of the most competitive supercomputers in the world.
|CSC Researcher Data Steward: research|
CSC offers a variety of cloud computing services: the Pouta IaaS services and the Rahti container cloud service.
|CSC Researcher Data Steward: research|
META-pipe is a pipeline for annotation and analysis of marine metagenomics samples, which provides insight into phylogenetic diversity, metabolic and functional potential of environmental communities.
MarDB includes all non-complete marine microbial genomes regardless of level of completeness. Each entry contains 120 metadata fields including information about sampling environment or host, organism and taxonomy, phenotype, pathogenicity, assembly and annotation.
MarFun is a manually curated marine fungi genome database.
Galaxy is an open source, web-based platform for data intensive biomedical research. This instance of Galaxy is coupled with NeLS for easy data transfer.
|Sensitive data Existing data Data publication NeLS|
|Sigma2 HPC systems||
The current Norwegian academic HPC infrastructure consists of three systems for different purposes. The Norwegian academic high-performance computing and storage infrastructure is maintained by Sigma2 NRIS, which is a joint collaboration between UiO, UiB, NTNU, UiT, and UNINETT Sigma2.
|Norwegian Research and Education Cloud (NREC)||
NREC is an Infrastructure-as-a-Service (IaaS) project between the University of Bergen and the University of Oslo, with additional contributions from NeIC (Nordic e-Infrastructure Collaboration) and Uninett., commonly referred to as a cloud infrastructure An IaaS is a self-service infrastructure where you spawn standardized servers and storage instantly, as needed, from a given resource quota.
Educloud Research is a platform provided by the Centre for information Technology (USIT) at the University of Oslo (UiO). This platform provides access to a work environment accessible to collaborators from other institutions or countries. This service provides a storage solution and a low threshold HPC system that offers batch job submission (SLURM) and interactive nodes. Data up to the red classification level can be stored/analysed.
|Sensitive data Data storage|
The TSD – Service for Sensitive Data, is a platform for collecting, storing, analysing and sharing sensitive data in compliance with the Norwegian privacy regulation. TSD is developed and operated by UiO.
|Human data Sensitive data Data storage TSD|
The HUNT Cloud, established in 2013, aims to improve and develop the collection, accessibility and exploration of large scale information. HUNT Cloud offers cloud services, lab management, and is a key service that has established a framework for data protection, data security, and data management. HUNT Cloud is owned by NTNU and operated by HUNT Research Centre at the Department of Public Health and Nursing at the Faculty of Medicine and Health Sciences.
|Human data Sensitive data Data storage|
SAFE (secure access to research data and e-infrastructure) is solution for secure processing of sensitive personal data in research at the University of Bergen. SAFE is based on “Norwegian Code of conduct for information security in the health and care sector” (Normen) and ensures confidentiality, integrity, and availability are preserved when processing sensitive personal data. Through SAFE, the IT-department offers a service where employees, students and external partners get access to dedicated resources for processing of sensitive personal data.
|Human data Sensitive data Data storage|
|BioData.pt Service Hub||
BioData.pt Service Hub includes several data management resources, tools and services available for researchers in Life Sciences.
|Researcher Data Steward: research Data storage|
The Swedish National Infrastructure for Computing (SNIC) is a national research infrastructure that makes available large-scale high-performance computing resources, storage capacity, and advanced user support, for Swedish research.