Your tasks: Data organisation
What is the best way to name a file?
Description
Brief and descriptive file names are important in keeping your data files organised. A file name is the principal identifier for a file and a good name gives information what the file contains and helps in sorting them, but only if you have been consistent with the naming.
Considerations
- Best practice is to develop a file naming convention with elements that are important to your project already when the project starts.
- When working in collaboration with others, it is important to follow the same file naming convention.
Solutions
Tips for naming files
- Balance with the amount of elements: too many makes it difficult to understand vs too few makes it general.
- Order the elements from general to specific.
- Use meaningful abbreviations.
- Use underscore (_), hyphen (- ) or capitalized letters to separate elements in the name. Don’t use spaces or special characters: ?!& , * % # ; * ( ) @$ ^ ~ ‘ { } [ ] < >.
- Use date format ISO8601: YYYYMMDD, and time if needed HHMMSS.
- Include a unique identifier (see: Identifiers)
- Include a version number if appropriate: minimum two digits (V02) and extend it, if needed for minor corrections (V02-03). The leading zeros, will ensure the files are sorted correctly.
- Write your file naming convention down and explain abbreviations in your data documentation.
- If you need to rename a lot of files in order to organize your project data and manage your files better, it is possible to use applications like Bulk Rename Utility (Windows, free) and Renamer4Mac (Mac).
Example elements to include in the file name
- Date of creation
- Project number / Experiment / Acronym
- Type of data (Sample ID, Analysis, Conditions, Modifications, etc.)
- Location / Coordinates
- Name / Initials of the creator
- Version number
- Reserve the last 3-letters for file format (e.g. .xls, .rtf, .mov, .tif, .doc)
Examples of good file names
- Honeybee project, experiment 2 done in Helsinki, data file created on the second of December 2020
- File name:
20201202_HB_EXP2_HEL_DATA_V03.xls
- Explanation:
Time_ProjectAbbreviation_ExperimentNumber_Location_TypeOfData_VersionNumber
- File name:
- Cropped image of an ant head taken on the third of December 2020 by Meg Megson
- File name:
20201203_MM_HEAD_CROPPED_V1.psd
- Explanation:
Time_CreatorData_TypeModification_Version
- File name:
How do you manage file versioning?
Description
File versioning is a way to keep track of changes made to files and datasets. While the implementation of a good file naming convention will indicate that different versions exist, this will not explain the difference between two (or more) versions. File versioning will enable transparency about which actions and changes were made and when. This makes it easy to backtrack and find something that was present in a previous version, but was later deleted or changed.
Considerations
- Do you need to collaborate on files, perhaps at the same time?
- Is there a need to be able to backtrack and restore a previous version?
- Will there be many changes made?
Solutions
- Smaller demands of versioning can be managed manually e.g. by keeping a log where the changes for each respective file is documented, version by version.
- For automatic management of versioning, conflict resolution and back-tracing capabilities, use a proper version control software such as Git, hosted by e.g. GitHub and BitBucket.
- Use a Cloud Storage service (see Data storage page) that provides automatic file versioning. It can be very handy for spreadsheets, text files and slides.
How do you organise files in a folder structure?
Description
A carefully planned folder structure, with intelligible folder names and an intuitive design, is the foundation for good data organisation. The folder structure gives an overview of which information can be found where, enabling present as well as future stakeholders to understand what files have been produced in the project.
Considerations
- The decisions on how to organise the files should be made during planning and design of the project, so that the strategy can be implemented from the start.
- Consider to consistently apply the same strategy in every project within the research group.
Solutions
Folders should:
- follow a structure with folders and subfolders that correspond to the project design and workflow
- have a self-explanatory name that is only as long as is necessary
- have a unique name – avoid assigning the same name to a folder and a subfolder
The top folder should have a README.txt file describing the folder structure and what files are contained within the folders. This file should also contain explanation of the file naming convention. See also A Quick Guide to Organizing Computational Biology Projects.
An example:
project/
code/ code needed to go from input files to final results
data/ raw and primary data (never edit!)
raw_external/
raw_internal/
meta/
doc/ documentation of the study
intermediate/ output files from intermediate analysis steps
logs/ logs from the different analysis steps
notebooks/ notebooks that document your day-to-day work
results/ output from workflows and analyses
figures/
reports/
tables/
scratch/ temporary files that can safely be deleted or lost
README.txt file and folder description
Related pages
OMERO is a software platform for managing, sharing and analysing images data. TransMed
TransMed tool assembly from ELIXIR Luxembourg supports projects in clinical and translational biomedicine. XNAT-PIC
XNAT for Preclinical Imaging Centers (XNAT-PIC) is a of set of tools to store, process and share preclinical imaging studies built on top of the XNAT imaging informatics platform.
More information
Relevant tools and resources
Skip tool tableTool or resource | Description | Related pages | Registry |
---|---|---|---|
BisQue | Resource for management and analysis of 5D biological images | Data Steward: research Data analysis Bioimaging data | Tool info |
Bitbucket | Git based code hosting and collaboration tool, built for teams. | Data Steward: research Data Steward: infrastructure | Standards/Databases |
Bulk Rename Utility | File renaming software for Windows | Data Steward: research Researcher | |
Cookiecutter | A command-line utility that creates projects from cookiecutters (project templates), e.g. creating a Python package project from a Python package project template. | Data Steward: infrastructure Data Steward: research | |
Git | Distributed version control system designed to handle everything from small to very large projects | Data Steward: research Data Steward: infrastructure | Training |
GitHub | Versioning system, used for sharing code, as well as for sharing of small data | Data publication Data Steward: infrastructure Data Steward: research | Standards/Databases Standards/Databases Training |
GitLab | GitLab is an open source end-to-end software development platform with built-in version control, issue tracking, code review, CI/CD, and more. Self-host GitLab on your own servers, in a container, or on a cloud provider. | Data publication Data Steward: infrastructure Data Steward: research | Standards/Databases Training |
HumanMine | HumanMine integrates many types of human data and provides a powerful query engine, export for results, analysis for lists of data and FAIR access via web services. | Data Steward: research Researcher Human data Data analysis | Tool info Standards/Databases Training |
pISA-tree | A data management solution for intra-institutional organization and structured storage of life science project-associated research data, with emphasis on the generation of adequate metadata. | Microbial biotechnology Researcher Data Steward: research Documentation and metadata Plant Phenomics Plant Genomics | Tool info |
Renamer4Mac | File renaming software for Mac | Data Steward: research Researcher | |
Research Object Crate (RO-Crate) | RO-Crate is a lightweight approach to packaging research data with their metadata, using schema.org. An RO-Crate is a structured archive of all the items that contributed to the research outcome, including their identifiers, provenance, relations and annotations. | Documentation and metadata Data storage Data Steward: research Researcher Microbial biotechnology Machine actionability Data provenance | Standards/Databases |
SMASCH | SMASCH (Smart Scheduling) system, is a web-based tooldesigned for longitudinal clinical studies requiring recurrent follow-upvisits of the participants. SMASCH controls and simplifies the scheduling of big database of patients. Smasch is also used to organize the daily plannings (delegation of tasks) for the different medical professionals such as doctors, nurses and neuropsychologists. | TransMed | |
National resources | |||
ownCloud@CESNET | CESNET-hosted ownCloud is a 100 GB cloud storage freely available for Czech scientists to manage their data from any research projects.
ownCloud
|
Researcher Data Steward: infrastructure Data storage |