Which types of identifiers can you use during data collection?
A lot of (meta)data is collected in the form of tables, representing quantitative or qualitative measurements (values in cells) of certain named properties (variables in columns) of a range of subjects or samples (records or observations in rows). It can help your research a lot if you make sure you can address each of these records, variables and values unambiguously, i.e. if each has a unique identifier. This is also true for (meta)data that is not in tabular format (“key”:value format, unstructured data, etc.). Identifiers should be always used for metadata and data independently of the format.
If the research institute or group has a centralised and structured system (such as a central electronic database) in place to describe and store (meta)data, this process can be quite straight forward for the researcher. However, if there is no such system, often researchers have to set up an internal “database” to keep track of each record or observation in a study. This situation can be quite challenging for many reasons, one of which is assigning identifiers. The use of identifiers for records, variables and values will increase the reusability and interoperability of the data for you, your future-self and others.
- At the beginning of your research project, check if your institute or research group has a centralised database where data must be entered during data collection. Usually, large and international research projects, industries, research institutes or hospitals have a centralised electronic database, an Electronic Data Capture (EDC) system, a Laboratory Information Management System (LIMS) or an Electronic Lab Notebook (ELN) with an user interface for data entry. More details about using ELNs are given by e.g. University of Cambridge - Electronic Research Notebook Products and Harvard Medical School - Electronic Lab Notebooks.
- If you can choose how to manage your data entry system, consider what’s the level of exposure of the identifier for each record or observation in the dataset. Define the context in which the identifier should be used and is unique. This is a key aspect to define what kind of identifier for each individual record is appropriate in your case.
- Should the identifier of a record or observation be unique within your spreadsheet, your entire research project files or across the whole institute? What’s the reference system (or “target audience”) of your identifier?
- Will your reference system change in due time? If it will be opened up later, assigning globally unique identifiers from the beginning may be saving time.
- Will the identifiers for individual records or observations be made openly accessible on the internet, during data collection?
- If the identifier of an individual record or observation should be unique only within your research group (within an intranet), and it will not be available on the internet, it can be considered an “internal or local identifier”. A local identifier is unique only in a specific local context (e.g. single collection or dataset).
- Local identifiers can be applied not only for individual records or observations in a tabular dataset but also for each variable or even value (columns and cells in a tabular dataset, respectively).
- Identifiers for an individual record, variable and value in a dataset can be assigned by using ontology terms (see metadata page) or accession numbers provided by public databases such as, EBI and National Center for Biotechnology Information (NCBI) repositories. Here there are few examples for tabular (meta)data, but the same type of identifiers can be applied independently of the (meta)data structure and format.
- The patient ID is in its own row, a column header is the variable “disease” from the EFO ontology (ID EFO:0000408), and the value in the cell is the child term “chronic fatigue syndrome” (ID EFO:0004540) of “disease”.
- The specimen ID is in its own row, a column header is the variable “Ensembl gene ID” from the Ensembl genome browser and the value in the cell is the identifier for BRCA1 gene ENSG00000012048.
- If your institute or research group makes use of centralised electronic databases (EDC, LIMS, ELN, etc.), follow the related guidelines for generating and assigning identifiers to individual records or observations, within the database. Some institutes have a centralised way of providing identifiers; ask the responsible team for help.
- Internal or local identifiers should be unique names based on specific naming convention and formal pattern, such as regular expression. Encode the regular expression into your spreadsheet or software and make sure to describe your regular expression in the documentation (README file or codebook). Avoid any ambiguity! Identifiers that identify specimens (such as a biopsy or a blood sample), animal or plant models or patients could be written to the specimen tubes, the animal or plant model tags and patients files, respectively.
- Avoid embedding meaning into your local identifier. If you need to convey meaning in a short name implement a “label” for human readability only (Lesson 4. Avoid embedding meaning or relying on it for uniqueness).
- Do not use problematic characters and patterns into your local identifier (Lesson 5. Avoid embedding meaning or relying on it for uniqueness). Problematic strings can be misinterpreted by some software. In this case it is better to fix the bugs or explicitly declare this possible issue in documentation.
- Ontology terms or accession numbers provided by public databases, such as EBI and NCBI repositories, can be applied to uniquely identify genes, proteins, chemical compounds, diseases, species, etc. Choose exactly one for each type in order to be the most interoperable with yourself. Identifiers for molecules, assigned by EBI and NCBI repositories, keep track of relations between identifiers (for instance, different versions of a molecule). You can also submit your newly identified molecules to EBI or NCBI repositories to get a unique identifier.
- Applying ontologies to variables keeps clear structure and relations between variables (i.e.,”compound & dose”, “variable & unit”) . Software that allow you to integrate ontology terms into a spreadsheet are: Rightfield and OnotoMaton.
- If you keep track of each record in a tabular format that gets new rows every day, use a versioning system to track the changes. Many cloud storage services offer automatic versioning, or keep a versioning log (see data organisation page). Some parts of the tabular (meta)data file must be stable to be useful: do not delete nor duplicate essential columns. Generate documentation about your tabular (meta)data file (README file, Codebook, etc..).
- If you collect data from a database that is frequently updated (dynamic or evolving database), it is recommended to keep track not only of the database ID, but also of the used version (by timestamp, or by recording date and time of data collection) and of the exact queries that you performed. In this way, the exact queries can be re-executed against the timestamped data store (Data citation of evolving data).
- If you reuse an existing dataset, keep the provided identifier for provenance and give a new identifier according to your system, but preserve the relation with the original identifier to be able to trace back to the source. Use a spreadsheet or create a mapping file to keep the relation between provenance and internal identifier.
- To set up a centralised machine readable database, an EDC, a LIMS or an ELN for large research projects or institutes (available on intranet), highly specialised technical skills in databases, programming and computer science might be needed. We encourage you to talk to the IT team or experts in the field to find software and tools to implement such a system.
- Software to make a machine-readable system for databases and data collection are available. Their interfaces are quite user friendly but command-line skills might be needed depending on the kind of use that you need.
- MOLGENIS is a modular web application for scientific data. MOLGENIS was born from molecular genetics research but has grown to be used in many scientific areas such as biobanking, rare disease research, patient registries and even energy research. MOLGENIS provides researchers with user friendly and scalable software infrastructures to capture, exchange, and exploit the large amounts of data that is being produced by scientific organisations all around the world.
- Castor is an EDC system for researchers and institutions. With Castor, you can create and customize your own database in no time. Without any prior technical knowledge, you can build a study in just a few clicks using an intuitive Form Builder. Simply define your data points and start collecting high quality data, all you need is a web browser.
- REDCap is a secure web application for building and managing online surveys and databases. While REDCap can be used to collect virtually any type of data in any environment, it is specifically geared to support online and offline data capture for research studies and operations.
- We don’t encourage setting up a centralised electronic database that will be exposed to the internet, unless really necessary. We encourage you to use existing and professional deposition databases to publish and share your datasets (see below).
Which type of identifiers should you use for data publication?
When all records and measurements have been collected and you are ready to share your entire dataset with others, it is good practise to assign globally unique persistent identifiers in order to make your dataset more FAIR. “A Globally Unique Identifier (GUID) is a unique number that can be used as an identifier for anything in the universe and the uniqueness of a GUID relies on the algorithm that was used to generate it” (What is a GUID?). “A persistent identifier (PID) is a long-lasting reference to a resource. That resource might be a publication, dataset or person. Equally it could be a scientific sample, funding body, set of geographical coordinates, unpublished report or piece of software. Whatever it is, the primary purpose of the PID is to provide the information required to reliably identify, verify and locate it. A PID may be connected to a set of metadata describing an item rather than to the item itself” (What is a persistent identifier, OpenAIRE). This means that any dataset with a PID will be findable even if the location of the dataset and its web address (URL) changes. The central registry that manage PID will ensure that the given PID will point you to the digital resource’s current location. There are different types of PID, such as DOI, PURL, Handle, IGSN and URN. The GO FAIR organisation provides examples of GUID, PID and services that supply identifiers.
PIDs are essential to make your digital object (datasets or resources) citable, enabling you to claim and receive credit for your research output. In turn, when you reuse someone else research output, you have to cite it.
There are different ways to obtain a globally unique persistent identifier, and you need to decide which one is the best solution for your dataset or resource.
- By publishing into an existing public repository. For most types of data, this is usually the best option because the repository will assign a globally unique persistent identifier or an accession number. Update your internal database to keep the relationship with public identifiers.
- By opening up your local database to the public. This requires that the resource has a sustainability plan, as well as policies for versioning and naming of identifiers. While this option could be a viable solution if there is no public repository that allows for the right level of exposure of your data, it puts a lot of responsibility on your shoulders for future maintenance and availability.
- If you want to publish your data into an existing public repository, please first see our data publication page. The repository will provide globally unique persistent identifiers for your data. Check their guidelines if you need to edit or update your dataset after publication. Generic repositories (such as Zenodo and FigShare) use versioning DOI to update a public dataset or document.
- If you want to publish your data in an institutional public repository, ask the institution to obtain a namespace at Identifiers.org in order to obtain globally unique persistent identifiers for your data.
- If you have the resources and skills to open up your database to the public, obtain a namespace at Identifiers.org in order to acquire globally unique persistent identifiers for your data.
Relevant tools and resourcesSkip tool table
|Tool or resource||Description||Related pages||Registry|
|Castor||Castor is an EDC system for researchers and institutions. With Castor, you can create and customize your own database in no time. Without any prior technical knowledge, you can build a study in just a few clicks using our intuitive Form Builder. Simply define your data points and start collecting high quality data, all you need is a web browser.||Tool info Standards/Databases|
|Ensembl||Genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation.||Tool info Standards/Databases Training|
|FigShare||Data publishing platform||Biomolecular simulatio... Data publication Documentation and meta...||Standards/Databases Training|
|Harvard Medical School - Electronic Lab Notebooks||ELN Comparison Grid by Hardvard Medical School||Documentation and meta...|
|Identifiers.org||The Identifiers.org Resolution Service provides consistent access to life science data using Compact Identifiers. Compact Identifiers consist of an assigned unique prefix and a local provider designated accession number (prefix:accession).||Tool info Standards/Databases Training|
|MOLGENIS||Molgenis is a modular web application for scientific data. Molgenis provides researchers with user friendly and scalable software infrastructures to capture, exchange, and exploit the large amounts of data that is being produced by scientific organisations all around the world.||MOLGENIS||Tool info|
|National Center for Biotechnology Information (NCBI)||Online database hosting a vast amount of biotechnological information including nucleic acids, proteins, genomes and publications. Also boasts integrated tools for analysis.|
|OnotoMaton||OntoMaton facilitates ontology search and tagging functionalities within Google Spreadsheets.||Plant sciences|
|REDCap||REDCap is a secure web application for building and managing online surveys and databases. While REDCap can be used to collect virtually any type of data in any environment, it is specifically geared to support online and offline data capture for research studies and operations.||TransMed Data quality||Tool info Training|
|Rightfield||RightField is an open-source tool for adding ontology term selection to Excel spreadsheets||Microbial biotechnology Plant sciences||Tool info|
|University of Cambridge - Electronic Research Notebook Products||List of Electronic Research Notebook Products by University of Cambridge||Documentation and meta...|
|Zenodo||Generalist research data repository built and developed by OpenAIRE and CERN||Plant Phenomics Bioimaging data Biomolecular simulatio... Plant sciences Data publication||Standards/Databases Training|
|Czech National Repository||
National Repository (NR) is a service provided to the scientific and research communities in the Czech Republic to store their generated research data together with persistent DOI identifier. NR service is currently under the pilot program.
|Researcher Data Steward Research Software Engi... Data storage Existing data Data management plan|