Storing research data can have many purposes, from facilitating study replication to allowing further hypotheses to be tested. Nevertheless, archiving only the data points, with no information regarding their purpose, provenance or collection method, will exponentially decrease their value over time, as both other researchers and the authors of the data will find impossible to reuse them without further information about the syntax and semantics.
Metadata, data about data, is the information created, stored and shared in order to describe objects (either physical or digital), facilitating the interaction with said objects for obtaining knowledge (Riley 2017). Metadata can describe various aspects of the underlying dataset, and it is often useful to split the various attributes based on their purpose.
Descriptive metadata, such as the title, creators of the dataset or description, is useful for allowing others to find and achieve a basic understanding of the dataset. Often linked to this is the licencing and rights metadata that describes the legal ways in which the data can be shared and reused; it includes the copyright statement, rights holder and reuse terms.
Technical metadata, which most often includes information on the data files, such as their formats and size, is useful for transferring and processing data across systems and its general management. Preservation metadata will often enhance the technical attributes by including information useful for ensuring that data remains accessible and usable, such as the checksum or replica replacement events (see Sect. 2.2). Finally, structural metadata describes the way in which data files are organised and their formats.
Given its complexity, producing metadata can become a significant undertaking, its complexity exceeding even that of the underlying data in certain cases. This is one of the reasons for which the standardisation of metadata has become mandatory, this happening at three main levels.
At the structural level, standardisation ensures that, on one hand, a minimum set of attributes is always attached to datasets and, on the other, that enough attributes are present for ensuring proper description of any possible artefact, no matter its origin, subject area or geographical location. Multiple such standards have been developed, from more succinct ones, such as the DataCite SchemaFootnote 4 or the Dublin Core Metadata Element Set,Footnote 5 to more extensive, such as MARC21Footnote 6 or the Common European Research Information Format (CERIF).Footnote 7 The usage of these standards might vary across subject areas (e.g. the Document, Discover and Interoperate (DDI) standard is targeted at survey data in social, behavioural and health sciences (DDI Alliance 2018)) or the main purpose of the metadata (e.g. the METS standard emphasises technical, preservation and structural metadata more than the DataCite schema).
At the semantic level, the focus is on ensuring that the language used for describing data observes a controlled variability both inside a research area and across domains. For example, the CRediT vocabulary (CASRAI 2012) defines various roles involved in research activities, Friend of a Friend (FOAF) establishes the terminology for describing and linking persons, institutions and other entities (Brickley and Miller 2014), and the Multipurpose Internet Mail Extensions (MIME) standard defines the various file formats (Freed Innosoft and Borenstein 1996).
3.1 Unique and Persistent Identifiers
One important aspect of metadata, considered by most standards, vocabularies and formats relates to the usage of identifiers. Similar to social security numbers for humans or serial numbers for devices, when it comes to research data, the aim of identifiers is to uniquely and persistently describe it. This has become a stringent necessity in the age of Internet, both due to the requirement to maintain resources accessible for periods of times of the order of years or even decades, no matter the status or location of the system preserving them at any discrete moment,Footnote 10 and also due to the necessity of linking various resources across systems. Thus, various elements of research data started receiving identifiers, various initiatives and standards becoming available.
Even before the prevalence of research data sharing, bibliographic records received identifiers, such as International Standard Book Numbers (ISBN) for books and International Standard Serial Numbers (ISSN) for periodicals. For research data, Archival Resource Keys (ARK) and HandlesFootnote 11 are more prevalent, as these mechanisms facilitate issuing new identifiers and, thus, are more suited for the larger volume of produced records.
The Digital Object Identifier (DOI) systemFootnote 12 is quickly emerging as the industry standard; it builds upon the Handle infrastructure, but adds an additional dimension over it, namely, persistence (DOI Foundation 2017). At a practical level, this is implemented using a number of processes that ensure that an identified object will remain available online (possibly only at the metadata level) even when the original holding server becomes unavailable and the resource needs to be transferred elsewhere. A DOI is linked to the metadata of the object and is usually assigned when the object becomes public. The metadata of the object can be updated at any time and, for example, the online address where the object resides, could be updated when the object’s location changes; so-called resolver applications are in charge of redirecting accesses of the DOI to the actual address of the underlying object.
A second important dimension of research outputs relates to persons and institutions. ORCiD is currently the most widespread identifier for researchers, with over 5 million registrations (ORCID 2018), while the International Standard Name Identifier (ISNI)Footnote 13 and the Global Research Identifier Database (GRID)Footnote 14 provide identifiers for research institutions, groups and funding bodies.
Identifiers have been developed for other entities of significant importance in terms of sharing and interoperability. For example, the Protein Data Bank provides identifiers for the proteins, nucleic acids and other complex assemblies (RCSB PDB 2018), while GenBank indexes genetic sequences using so-called accession numbers (National Center for Biotechnology Information 2017).
Research Resource Identifiers (RRID) (FORCE11 2018) aim to cover a wider area, providing identifiers for any type of asset used in the scientific pursuit; the current registry includes entities ranging from organisms, cells and antibodies to software applications, databases and even institutions. Research Resource Identifiers have been adopted by a considerable number of publishing institutions and are quickly converging towards a community standard.
The main takeaway here is that, it is in general better to use a standard unique and possibly persistent identifier for describing and citing a research-related entity, as this will ensure both its common understanding and accessibility over time.
3.2 The FAIR Principles
As outlined, producing quality metadata for research data can prove to be an overwhelming effort, due to the wide array of choices in terms of standards and formats, the broad target audience or the high number of requirements. To overcome this, the community has devised the FAIR principles (Wilkinson et al. 2016), a concise set of recommendations for scientific data management and stewardship which focuses on the aims of metadata.
The FAIR principles are one of the first attempts to systematically address the issues around data management and stewardship; they were formulated by a large consortium of research individuals and organisations and are intended for both data producers and data publishers, targeting the promotion of maximum use and reuse of data. The acronym FAIR stands for the four properties research data should present.
Findability relates to the possibility of coming across the resource using one of the many Internet facilities. This requires that the attached metadata is rich enough (e.g. description and keywords are crucial for this), that a persistent identifier is associated and included in the metadata and that all this information is made publicly available on the Internet.
Accessibility mostly considers the methods through which data can be retrieved. As such, a standard and open protocol, like the ones used over the Internet, should be employed. Moreover, metadata should always remain available, even when the object ceases to be, in order to provide the continuity of the record.
Interoperability considers the ways in which data can be used, processed and analysed across systems, both by human operators and machines. For this, metadata should both “use a formal, accessible, shared, and broadly applicable language for knowledge representation” (FORCE11 2015) and standardised vocabularies.Footnote 15
Moreover, the interoperability guideline requires that metadata contains qualified references to other metadata. This links both the persistent and unique identifiers described earlier, but also to the relations between them, the foundation of Linked Data, a concept introduced by the inventor of the World Wide Web, Tim Berners-Lee. This concept relies heavily on the Resource Description Framework (RDF) specification which allows describing a graph linking pieces of information (W3C RDF Working Group 2014). The linked data concept is of utmost importance to the scholarly workflow, as it can provide a wider image over scientific research, as proven by projects such as SciGraph, which defines over one billion relationships between entities such as journals, institutions, funders or clinical trials (Springer Nature 2017), or SCHOLIXFootnote 16 which links research data to the outputs that reference it.
Finally, the FAIR principles mandate that research data should be reusable, thus allowing for study replicability and reproducibility. For this, it requires that metadata contains accurate and relevant attributes (e.g. it describes the columns of tabular data) and information about its provenance. Moreover, it touches on certain legal aspects, such as the need for clear and accessible licencing and adherence to “domain-relevant community standards”, such as, for example, the requirements on patient data protection.