Identifying and relating biological concepts in the Catalogue of Life
- 7.5k Downloads
In this paper we describe our experience of adding globally unique identifiers to the Species 2000 and ITIS Catalogue of Life, an on-line index of organisms which is intended, ultimately, to cover all the world's known species. The scientific species names held in the Catalogue are names that already play an extensive role as terms in the organisation of information about living organisms in bioinformatics and other domains, but the effectiveness of their use is hindered by variation in individuals' opinions and understanding of these terms; indeed, in some cases more than one name will have been used to refer to the same organism. This means that it is desirable to be able to give unique labels to each of these differing concepts within the catalogue and to be able to determine which concepts are being used in other systems, in order that they can be associated with the concepts in the catalogue. Not only is this needed, but it is also necessary to know the relationships between alternative concepts that scientists might have employed, as these determine what can be inferred when data associated with related concepts is being processed. A further complication is that the catalogue itself is evolving as scientific opinion changes due to an increasing understanding of life.
We describe how we are using Life Science Identifiers (LSIDs) as globally unique identifiers in the Catalogue of Life, explaining how the mapping to species concepts is performed, how concepts are associated with specific editions of the catalogue, and how the Taxon Concept Schema has been adopted in order to express information about concepts and their relationships. We explore the implications of using globally unique identifiers in order to refer to abstract concepts such as species, which incorporate at least a measure of subjectivity in their definition, in contrast with the more traditional use of such identifiers to refer to more tangible entities, events, documents, observations, etc.
A major reason for adopting identifiers such as LSIDs is to facilitate data integration. We have demonstrated the incorporation of LSIDs into the Catalogue of Life, in a manner consistent with the biodiversity informatics community's conventions for LSID use. The Catalogue of Life is therefore available as a taxonomy of organisms for use within various disciplines, including biomedical research, by software written with an awareness of these conventions.
KeywordsUnique Identifier Species Concept Common Data Model Global Biodiversity Information Facility Taxonomic Concept
As in many areas of scientific research, there is an ever-increasing need to be able to access species-related information reliably, and to be sure that various researchers are either referring to the same entity or that they know they are not. This is particularly important to the biodiversity informatics community where they are frequently using terms and scientific names which have to be understood within the context where they appear. It is not necessarily always appreciated that this issue extends beyond biodiversity informatics to other areas in which species names are used, such as bioinformatics, biomedical informatics and ecoinformatics. The use of Globally Unique Identifiers (GUIDs) can help address this problem electronically. In this paper we explain how GUIDs - and, in particular, Life Science Identifiers (LSIDs)  - are being used in biodiversity informatics systems. One of the most challenging problems is to manage species names effectively, due to the variability of the concepts to which they are applied, and the majority of this paper concerns the approach we have taken to solving some of these problems in recent editions of the Species 2000 Catalogue of Life system, and strategies for addressing the remaining issues. A key requirement for the Catalogue to be used to its full potential is interoperability across application domains. For example, use of species names in biomedical literature, with the associated problems of synonymy, is an important issue . Indexing of biological material and data organised by species is important [3, 4], not least as a means of providing users and electronic systems with alternative search terms for species, and the Catalogue of Life is a key resource in achieving this in an effective manner. We shall see that there are some inconvenient external constraints that have been imposed on our current approach, but they do not preclude the Catalogue's use for such purposes. More generally, the basic problem addressed in this paper is one that is inevitably encountered whenever there are differences of expert opinion about the categories that should be used for classifying entities, especially when opinions develop and change over time.
The Species 2000 and ITIS Catalogue of Life project
The Catalogue of Life (CoL)  is seeking to build a catalogue of all known species. It uses a distributed architecture , which is important in order to provide suppliers of component databases with the autonomy and control they require. Users of scientific names are faced with the problem that disagreement amongst the taxonomists who publish and organise these names will lead to different scientific names being used to refer to the same organism, and to variation in the range of organisms that a given name might refer to. In order to provide a complete "synonymic index" of all the world's species, the Species 2000 programme was set up. It is creating a catalogue of known species, with their accepted names, ambiguous and unambiguous synonyms, misapplied names, vernacular names, and some other basic data, by dynamically linking available checklist databases for different higher taxa (nodes higher than species in the taxanomic hierarchy), with the ultimate aim of complete coverage of the taxonomic hierarchy and hence all known species. In partnership with the North American ITIS organisation, it has been delivering the Catalogue of Life (CoL) in two main forms: the Dynamic Checklist, updated on the Web as the component federated databases are updated, and the Annual Checklist, a snapshot of the CoL released on CD and on the Web every year. The distinction between the Annual Checklist and the Dynamic Checklist is becoming less explicit, as the main updates at present are quarterly updates to the Annual Checklist on the Web. The uniqueness of the CoL lies in its target of creating a complete catalogue, organised by concepts (the species), comprising a single classification assembled from peer-reviewed checklist databases. It complements other initiatives such as uBio , which gathers information from a variety of resources such as published literature and web documents into its Name Bank, and provides means to organise these names.
Role of the Catalogue of Life in a semantically-linked Web
In essence the Catalogue of Life is an electronic taxonomic checklist. Researchers can use the Dynamic or Annual Checklist in order to find the status of a scientific name of their choice and - if it is not an accepted name in the catalogue - to find the corresponding accepted name. However, this checklist is also accessible via Web Services and the complete checklist can be downloaded in various forms if desired . Other major providers of species information such as the Global Biodiversity Information Facility (GBIF)  and the Encyclopedia of Life  use the Catalogue of Life in order to enhance users' searches. For example, if a user searches the Encyclopedia of Life for Drosera aldrovanda, one of the results (s)he will obtain is for Aldrovanda vesiculosa, because the Catalogue of Life holds information that Drosera aldrovanda is a synonym of this latter scientific name:
Accepted scientific name:
Aldrovanda vesiculosa L.
Aldrovanda verticillata Roxb.
Drosera aldrovanda F. Muell.
The combination of accepted name and synonyms can be regarded as the Catalogue of Life's definition of a single species. If in future editions of the catalogue the list of synonyms changes, this may be due to a change of opinion about the circumscription of the species and hence a change of opinion about which synonyms are applicable.
The circumscription (range of variation) associated with each of these names may be different.
The differences between the 2008 and 2009 Droseraceae classifications relate to the species level and below, but there is also some evidence of historical difference of opinion regarding higher levels of the classification, in particular the genus Drosera. As noted earlier, the species Aldrovanda vesiculosa has a synonym Drosera aldrovanda. This indicates that at least part of it has been considered to belong within the genus Drosera by some authors at some point.
In the above example we have illustrated how scientific names can be unstable as a means of referring to specific taxonomic concepts. It is desirable, therefore, to have a more reliable, unambiguous way of referring to a specific species concept so that it can be known whether records relating to a given species (such as maps of occurrence records, descriptions, chemical and medicinal properties, etc.) do in fact relate to precisely the same concept of the species. It is further desirable for information about the relationships between different, but related concepts to be known, in order that appropriate inference can be performed. For example, if species concept A is contained within species concept B then any occurrence records relating to A will apply also to B. We shall return to this point later.
The SPICE-TIP project
At the time when the SPICE-TIP (Species 2000 Interoperability Co-ordination Environment - TDWG Infrastructure Project) project was funded (2007) the biodiversity informatics community - and particularly the Biodiversity Informatics Standards (also known as Taxonomic Databases Working Group (TDWG)) Community  - was starting to adopt Life Science Identifiers (LSIDs)  in order to label objects persistently and uniquely. However in most cases these objects were easily identifiable and immutable, such as species names, or observation events. This is similar to the originally-envisaged uses for LSIDs to (for example) label a molecular sequence [12, 13]. The authors of the present paper were asked to investigate how LSIDs could be incorporated into the Catalogue of Life, which raised various issues that had not been addressed in these other projects, relating to labelling of concepts and tracking change. Another then-current development was the Taxon Concept Schema (TCS)  (also referred to as the Taxonomic Concept Transfer Schema), and the SPICE-TIP project was taken as an opportunity to explore the effectiveness of the TCS for representing the concepts inherent in the Catalogue of Life. The TCS had been developed on the premise that scientific names are not satisfactory as identifiers for species concepts (as we have ourselves explained above), and that a way of solving this problem is to define the concepts in relation to other concepts and their context (such as publication information) . Although this does not completely solve the problem of determining which concept a scientist may have actually used (see later in the present paper), the TCS does provide a means of expressing relationships between the concepts in different taxonomies and so is suited to our purpose. Similarly, it cannot necessarily be assumed that two sources of species-related data using the same species LSID will hold information about precisely the same species, and not different concepts, but if responsible use of the LSIDs is made, then such an assumption is reasonable.
There are various alternative options for persistent unique identifiers and metadata (as we shall discuss later), but due to the context of the project the authors were constrained to the investigation of the suitability of specific technologies rather than to identify the best technologies for the task at that stage.
As described earlier, the research reported in this paper depends on the use of Life Science Identifiers (LSIDs) and the Taxon Concept Schema (TCS). In this section we describe LSIDs and TCS in more detail; how they were introduced into the Catalogue of Life will be described later, in the Results section. We also discuss the requirements for an effective GUID system for the Catalogue of Life, which we sought to satisfy in our design and implementation, as far as was possible within the constraints we were given.
Life Science Identifiers
The Object Management Group (OMG) defines LSIDs as:
"persistent, location-independent, resource identifiers for uniquely naming biologically significant resources" .
The nature of the LSID resolution process (described fully in reference ) means that specific software must be implemented that uses the LSID resolution protocol, such as these clients. Also, to make LSIDs resolvable, the authority domain part of the LSID shown in Figure 4 requires the existence of an SRV record on the Domain Name Server (DNS). This record must refer to the actual end point (the IP address of the server and the TCP/IP port number) for the resolution service. The implementation of such a service is independent of how a client accesses it and thus, as long as it makes provisions for the standard data and metadata requests, the programming language used to develop the resolver is left up to the authority. Typical actions of a client include looking up the SRV record, calling the endpoint and parsing the RDF response when it is returned. The namespace and object identifier parts of the LSID are used by the resolver to locate or build the corresponding data or metadata from local resources such as a database. LSIDs are intended to be semantically opaque , so no assumptions can be made about the individual objects referred to based on the identifiers used, other than the class of the object. If this were not the case, this may lead to unwarranted assumptions and predictions being made by external clients about the internal representation of the data or dynamic formation of the authority's other LSIDs. In some cases, nevertheless, individual implementers have created a syntactical relationship between their LSIDs and the underlying data (for example, between LSIDs and a specially-constructed export schema ), but such relationships appear to have been adopted specifically for implementational convenience.
The end part of the LSID is the optional version field which many authorities omit altogether for the main reason that the persistent requirement of LSIDs means that change is signified by the assignment of a new identifier rather than versioning of an existing one. However, this LSID component plays an important role in the CoL implementation as we shall explain later.
The Taxon Concept Schema
TDWG provides an ontology built on top of TCS (Taxon Concept Schema) elements to assist and standardise the RDF metadata returned by LSID resolvers . Among the concepts provided by the TDWG ontology are TaxonName and TaxonConcept. This makes it possible to distinguish between the concepts which individual researchers might hold in their minds regarding a particular group of organisms which together form a taxon on the one hand, and the name(s) which might have been used for this concept on the other. The relationship between the two notions can sometimes be many-to-many. Other relation types provided include IsParentTaxonOf and IsChildTaxonOf, which make it possible to capture the hierarchical relationships within a taxonomic classification tree, and publishedInCitation, which makes it possible to cite supporting literature.
The TCS can be used to represent a fully organised set of taxonomic concepts, such as is the basis of the Catalogue of Life. It has also been used in other contexts, e.g. to support the process of taxonomy where the concepts and names are fluid and under debate while an attempt is being made to study and classify some group of organisms .
The combination of LSIDs for taxon identifiers and TCS for the information retrieved by resolving these identifiers was therefore a suitable candidate for implementing a system that uniquely labels the Catalogue of Life concepts and allows these concepts to be retrieved.
Requirements for an effective CoL GUID system
In this section we shall discuss the requirements for an effective GUID system for the Catalogue of Life. We will later observe that the choice of GUID system (LSID) which we were asked to make meant that some of these requirements could not be met. Nevertheless, where possible we tailored our use of LSIDs with these requirements in mind.
The identifier should ideally identify a specific object which at any given time exists in a single, unique location, supported by a mechanism that ensures any copies of this object are up-to-date.
The way identifiers are used should be interoperable across disciplines. (Currently the use in biodiversity informatics is to some extent standardised, in that LSIDs are typically used and there are recommendations for the format of LSID metadata , but this does not automatically provide interoperability with LSID use in other domains and hence tools have to be implemented that are aware of the way these identifiers are used in the various domains.)
The identifier should preferably not require special tools to be installed on a user's computer in order for him or her to be able to access the objects referred to.
Clients should be implemented that fully support navigation via the identifiers, in a way analogous to the service provided by crossref.org (http://www.crossref.org/).
Where possible, a "human-friendly" option for composing LSIDs for important objects should be available, hence enabling users to have at least an initial entry point into a network of related objects. For example, in the Catalogue of Life, an identifier for each scientific name having some obvious form such as "...catalogueoflife.org:species:Drosera-filiformis" could be supported, and the object that this resolved to would contain metadata supporting navigation to the current species corresponding to this name, etc. Of course, this would be in opposition to "best practices" that have been advocated in the past . An argument for identifiers not being "human-friendly" is that they are for linking data in a way that should be transparent to human users. On the other hand, it is not unreasonable for authors to want to cite globally unique identifiers in traditionally-published papers, e.g. in order to refer unambiguously to a species, and in such cases it would be convenient if the identifier could readily be typed in by the reader of such a document.
- 6.A means should be provided of discovering:
GUIDs which either refer to the same object or to distinct objects that represent the same real-world entity, such as a concept
GUIDs for related objects and concepts (including, for example, taxa which replace the current one in a checklist), and metadata about the nature of these relationships
usage of GUIDs (electronic documents, etc, which refer to GUIDs). This last point implies that either every time a GUID is used, a reference to this usage gets recorded in a registry, or, perhaps more realistically, that an LSID crawler service of some description needs to be implemented.
The above requirements form the background to the approach taken to the implementation of Globally Unique Identifiers in the Catalogue of Life, which we shall now describe.
In order to introduce LSIDs into the Catalogue of Life, some modification of the existing Catalogue of Life software and database schemas was necessary, but in addition to this, an LSID resolver had to be set up, able to provide taxon data in response to LSID metadata requests. We shall describe these two aspects first, and then explain how relationships with concepts from other sources of taxonomic information are expressed. Finally in this section we discuss the issue of managing change - identifying when new LSIDs need to be issued.
Modifying the CoL
SPICE-TIP (SPICE TDWG Infrastructure Project) added LSID support to both the Annual Checklist (AC) and, as an experiment, to the Dynamic Checklist (DC). As a result of this project, annual checklists from 2008 onwards contain LSIDs, and LSIDs from the 2008 AC will generally be used as examples in this paper. The CoL partners decided to use Universally Unique Identifiers (UUIDs) [23, 24] as the object identifiers within CoL LSIDs, because they can be generated by a simple algorithmic process which virtually guarantees that no two UUIDs will ever be the same. This has the advantage that not only will all CoL LSIDs be globally unique (as a combination of the domain name and identifiers unique within the CoL), but the UUID is unique in its own right. It can therefore be used in contexts unrelated to LSIDs, including being used in other systems of resolvable identifiers, such as Linked Data URIs . They also decided that the LSID version field should be used. Using the version field allows the reuse of an object identifier between different AC years (e.g. :ac2008, :ac2009) and between the AC and the DC (e.g. :dc) in cases where they refer to the same taxon. That is, if the underlying concept has not changed, the associated LSID does not change (other than the version). This allows the possibility for software or humans simply to compare the object identification parts of two LSIDs to determine whether they refer to the same taxon, and to determine which CoL edition the LSID came from, without the need to call the resolution service. The implications for taxon matching are discussed later in this paper.
In the AC, a new field was added to an existing table, containing all taxa, to hold the LSID. In the DC, the data retrieved from the contributing databases is held in a cache database. However, to incorporate LSIDs a separate database was created which would hold the data in a more suitable format than this cache. In both cases, the MySQL UUID() function was used to generate the object identifier LSID part. (This does not fulfil our suggested requirement for a "human-friendly" LSID composition option, but ensures the more fundamental requirement of GUID uniqueness.) In the AC, LSIDs were assigned by running the appropriate MySQL query to add them to the species (and also the higher taxa, such as genus, family, etc.) prior to being released. In the DC, code was introduced into the SPICE CAS (Common Access System) to assign LSIDs to new taxa as they enter the system. Additions were made to both the AC and DC interfaces to display LSIDs and the SPICE Web Service was modified in order to communicate them to clients including the DC interface. The modified systems also allow data providers the option to supply their own GUIDs for their species and other taxa, and also GUIDs for the names themselves (accepted names and synonyms). In contrast, within the Catalogue of Life, we have assigned LSIDs to the concepts it contains, not to the various names (accepted names and synonyms) which might be associated with each concept. The provider GUIDs are not made visible in the user interfaces, and are accessed via the CoL LSID resolver.
Supporting infrastructure: resolver and metadata
The resolver for AC and DC LSIDs is written in Java using the LSID Java Toolkit available from the LSID resolution project site . The resolver has access to both the AC database and the SPICE cache database in order to form the response for a metadata request. (All the CoL information associated with an LSID is held in the metadata of the object referred to by the LSID, not its data.) It also has access to what we will refer to as the "LSID Repository", which is a database that contains both a table to hold DC LSIDs and a table to hold relationships between LSIDs. The latter is consulted during every resolution process to enrich the current LSID with relationships to others (and also with relationships to other taxa in the same checklist). These may be relationships between the AC and the DC or between either the AC or DC and a provider-assigned LSID. Modification of SPICE means that this table can be populated with external LSIDs (if they are given by the provider) during the caching process.
As described earlier, the TDWG Taxon Concept Schema and TWDG ontology were used in order to represent the data that is returned by the resolver. It should be noted that the Catalogue of Life is not stored internally as a complete TCS document. A TCS document is generated in response to individual LSID resolution requests, and provides information about the taxon concerned, and about which other taxa it is related to. The most fundamental concepts used are TaxonName and TaxonConcept. As the Catalogue is a provider of a complete taxonomy using the names contained in its databases, it was decided that the core element (the one which has the about = X attribute where X is the current LSID) returned by the CoL's resolver for each LSID should be of type TaxonConcept, i.e. it is one of the agreed concepts which together comprise a consistent Catalogue of Life; each concept has a single accepted name. This could also be described as the "idea" in a person's mind when using a certain name to describe a taxon. The ontology also provides a means to relate the core TaxonConcept to others to better give it context by using the hasRelationship object property and the Relationship class. The HasVernacular and HasSynonym relationship types (or categories) are used to relate the current concept to its common name and synonym counterparts.
Relationships between CoL LSIDs and LSIDs from other providers
Most of the discussion in this paper has focussed on the assignment of LSIDs in the 2008 edition of the CoL Annual Checklist. Since the Catalogue does not only grow (by new species being introduced) but change (reflecting changing expert opinion), not all these LSIDs will be applicable to concepts in subsequent editions of the Catalogue.
Prior to the introduction of LSIDs, the CoL was criticized for using identifiers which changed from year to year . The internal identifiers have never been intended to be used in other systems linking to the CoL, of course, but this criticism draws attention to the demand for persistent identifiers that are designed for use by other systems. The CoL still does not guarantee to maintain the same internal identifiers, because there appears to be no need to insist on this as a requirement, but it does now provide persistent globally unique, publicly available identifiers. The fact that some LSIDs will become deprecated through time may lead to the misunderstanding that the issue of stability still has not been addressed. It needs to be understood that in relation to concepts the Catalogue is intentionally not stable, so if a client is wishing to link to a name, not a concept, the client should use any LSID available for the name (or just the name itself), not a CoL-supplied taxon LSID. It should also be noted that it is intended that deprecated concepts will be accessible via their LSIDs in perpetuity, and the metadata retrieved will include information about the concepts' relationships to relevant current concepts (such as inclusion, etc.).
Due to the fact that not all suppliers of data to the CoL also provide any kind of GUID, any system to handle change must both:
use information coming from the supplier database, where available, which indicates by a new GUID that change has occurred, and
analyse the checklist to detect changes.
The first of these options is the more straightforward to implement, although it requires an understanding of each supplier's policy in order to ensure that LSIDs are changing if and only if the associated concepts are.
The second option is something that would be infeasible to undertake entirely by hand for approximately 1.5 million species. We have implemented a "Taxon Matcher" program which performs basic comparisons of species in different checklists . This program is currently what is used by the CoL when issuing LSIDs for a new edition of the Catalogue. To determine whether to retain the same UUID as used for some taxon in the previous edition, or to create a new UUID, the program has to discover whether a taxon in the new edition matches one in the previous edition. Because no tracking of changing taxon concepts is currently carried out by the providers of individual checklists ("GSDs") to the CoL, it is necessary for Taxon Matcher to compare each taxon in the new edition with every taxon in the previous edition to find matching taxa.
In future work it is intended to refine the procedure substantially, to allow editors to identify taxa with minor changes as being in fact identical, and implementing heuristics such as:
If a new synonym has been added to a taxon, and the synonym did not previously occur elsewhere in the checklist, the concept is probably unchanged.
In this paper we have explained how we enhanced the Catalogue of Life with Life Science Identifiers, and extended the Catalogue software to be able to respond to requests for data about individual taxa with documents in Taxon Concept Schema form, expressed as RDF. We now discuss four important issues raised by our work: the extent to which the use of LSIDs has made interoperability of the Catalogue of Life with other systems possible (especially systems for other application domains); the extent to which the precision afforded by LSIDs (or other GUIDs) might imply accuracy; the extent to which CoL LSIDs have been adopted - and the ways in which alternative GUID schemes could be supported; and the applicability of our approach to other domains.
There has been a fragmentary, disjointed approach to globally unique identifiers across disciplines. Although some GUID schemes have been deployed for specific applications (e.g. DOIs  for publications, LSIDs for data objects in Life Science domains), and hence have had differing requirements to meet, the lack of a generally-accepted resolvable GUID specification means that the possibility of creating general-purpose software that can use GUIDs to access objects from multiple application domains is limited. One very widely used type of GUID is the UUID . However, UUIDs are not in themselves resolvable, although (as we have seen in this paper) there is good reason to make them part of a resolvable identifier such as an LSID. This does in turn at least mean that we have a basis for producing more than one kind of resolvable identifier for the same object if identifier schemas that can embed UUIDs are used. This then means that such objects can be used in a wider range of application domains.
It is not the purpose of this paper to compare the various kinds of GUID that are available (Laibe & Le Novère  and Altman & King  include useful overviews in their papers), but even given the requirement that LSIDs be used, there are still decisions to be made which potentially limit interoperability:
The TDWG community decided to use the metadata component for holding all information relating to the objects referred to by LSIDs; the data field is not used. This is in contrast with typical use of LSIDs, and means that (for example) bioinformatics software will not necessarily be able to interpret them.
Use of TCS to describe the taxon concepts presents at least two problems. The first of these is that development of the associated TDWG ontology has been intermittent, and it is not clear whether extending some other ontology (such as the SEEK Ecoinformatics Ontology ) might be a more sustainable approach. The other problem is that the TCS provides its own specialised terminology for expressing taxonomic concepts and their relationship to each other. While some terminology is inevitably specialised, some (such as IsCongruentTo and Overlaps) are much more generally applicable and it would be desirable to standardise across domains so that reasoning with such relationships can be performed in a more general context.
In future work we plan to explore alternative persistent identifier schemes and to investigate how the Catalogue can be made available as Linked Data  in order to make the knowledge contained in the Catalogue more readily associated with other Web-accessible information.
Precision and accuracy
Unfortunately, although the Catalogue of Life is being assembled with attention being given to the appropriate selection of species concepts for inclusion, one cannot necessarily be sure that, even if a scientist has used the Catalogue to look up a name, (s)he is using the name as the creators of the relevant part of the Catalogue had intended.
This means that although the use of LSIDs ensures that one can refer to a specific concept, present at a specific time in the Catalogue, with precision, this does not solve the problem of ensuring that scientists recording information associated with a given species concept have correctly identified the concept to use. We are currently investigating ways of detecting this problem in practice, but it seems inevitable that no complete solution will be found, because of the subjective nature of species definition. Clearly a development that would be beneficial in future is to make it easier for users to determine whether their species concepts coincide with those in the Catalogue. One current relevant development is the Encyclopedia of Life, referred to earlier, which includes descriptive species pages; another possibility would be the creation of "identification keys" linked explicitly to the Catalogue of Life classification. This would be an enormous undertaking, however, when one considers that merely enumerating the known species is a task which the Catalogue of Life has been performing for over a decade, and which it will never be able definitively to complete due to the regular discovery of new species.
Adoption of CoL LSIDs; other GUID schemes
The adoption of CoL LSIDs has not yet been as widespread as originally anticipated, although examples of their use can be found in:
There are perhaps a number of reasons for this somewhat limited adoption, which are not specific to the CoL. The end user unfamiliar with LSIDs may be disappointed to find that they cannot be resolved by simply pasting them into a web browser, although we noted earlier that browser plug-ins and web proxies exist to allow Web browsers to be used. Another reason for the relatively low rate of adoption of LSIDs may be the high profile accorded to the Semantic Web and the use of Linked Data URIs as alternative means to provide metadata about objects and concepts, especially as these are based on existing Web technology which provides less of a barrier to users unfamiliar with them.
However, because the CoL has based its identifiers on the use of UUIDs which do not necessarily require the existence of the LSID system in order to be resolved, the CoL has the option to deploy its taxon concept identifiers in other ways. For example, the CoL UUIDs can be incorporated in HTTP URIs. If this is done, the edition of the CoL from which the identifier was obtained could be omitted, and the resolution mechanism should still be able to provide metadata about which editions this taxon concept was included in and what changes, if any, occurred between them. Such changes would not relate to changes of concept, but (for example) to additional data being included. If it is important to be able to specify the edition from which the concept was obtained, then the edition code from the revision element of the LSID could simply be concatenated with the UUID part. There is no reason why the CoL cannot deploy its identifiers in multiple ways simultaneously, and so the approach we have taken allows us to accomodate changing external requirements on the provision of globally unique identifiers.
Applicability to other domains
We have focussed in this paper upon the enhancement of a Catalogue of Life by adding LSIDs to the concepts defined in the Catalogue and incorporating information about relationships between these concepts in the LSID metadata. Yet concepts and variation of professional opinion are not phenomena unique to the domain of taxonomy, and not dissimilar benefits might be expected where globally unique identifiers are used in a comparable way in other domains, for example in the Unified Medical Language System (UMLS) , which defines the notion of a Concept Unique Identifier (CUI), and in which concepts from various biomedical taxonomies need to be related to each other.
As we have seen, some of the metadata used for taxonomic concepts is domain-specific and may not be applicable to other domains; we anticipate that the reverse would also be true. In other words, although we have been advocating the desirability of a cross-disciplinary approach to the problem of adopting GUIDs, domain-specific extensions to capture some of the nuances associated with concepts, in particular, are needed. Appropriate ontologies underpinning these extensions are also needed, if software that is not application domain-specific is to be able to make any sense of these extra details.
In this paper we have demonstrated the feasibility of enhancing the Catalogue of Life with globally unique identifiers and suitable explicit metadata relating the Catalogue's concepts to each other. We have also explained how the Catalogue of Life may be regarded as a specialised ontology, providing knowledge that is needed to support semantic interoperability in biomedical and other disciplines when dealing with species-related data. Using identifiers such as LSIDs to refer to species concepts instead of using species names (which might be subject to various interpretations) can help in data integration. This is because it can be assumed (with the caveats expressed earlier in this paper) that two sources of species-related data using the same species LSID will hold information about precisely the same species, not different concepts. The provision of unique identifiers is a key aspect of the recently-commenced i4Life project : i4Life aims to cross-map between taxonomies supplied by various organisations, including the NCBI taxonomy, IUCN Red List, etc., using the Catalogue of Life as a source both of names and of concepts (assemblages of names into species and other taxa) to support the cross-mapping process.
In future, there are a number of areas that would benefit from further development, including some areas in which the requirements for an effective CoL GUID system listed earlier in this paper are not yet satisfied. We have mentioned these in the previous sections but, to summarise, we are planning:
enhanced tracking of concept changes;
tracking LSID usage (see requirement (6));
suitable default behaviour in cases where a CoL LSID without a version field is submitted to the resolver, perhaps from systems that cannot accommodate the version field. At present, the resolver fails; our intention is to change this behaviour so that the resolver will return metadata associated with the current version of the LSID.
exploration of alternative GUID schemes and options for expressing concept metadata;
investigation of appropriate ways to provide a "human-friendly" option for composing LSIDs for important objects (see requirement (5));
making the Catalogue of Life available as Linked Data, and
experimenting to determine ways of automatically identifying where users are erroneously referring to the same concept.
Availability of supporting data
The Catalogue of Life is accessible via its Web interface at: http://www.catalogueoflife.org/
LSIDs used in the Catalogue of Life are of the form: urn:lsid:catalogueoflife.org:... and these LSIDs may be used to retrieve a TCS representation of the relevant part of the Catalogue.
The software described in this paper - enhanced Catalogue of Life (SPICE) software, software for resolving Catalogue of Life LSIDs as TCS documents, and the Taxon Matcher program - is available at: http://biodiversity.cs.cf.ac.uk/
The initial stages of the work reported in this paper were supported by a GBIF/TDWG TWDG Infrastructure Project grant. Indexing for Life (i4Life) is a European e-Infrastructure project, co-funded by the European Commission's Seventh Framework Programme for Research and Technological Development, in which ACJ and RJW are participants. The current user interface to the Catalogue of Life system was developed by ETI Bioinformatics (Leiden), and the Catalogue content is maintained by Species 2000. An earlier version of some parts of this paper was presented at the BNCOD 2008 workshop "Biodiversity Informatics: challenges in modelling and managing biodiversity knowledge", and can be downloaded from the Workshop Web Site .
A project of the kind reported in this paper inevitably involves discussion with a wide range of colleagues, but we would particularly like to express our appreciation to Prof. Frank Bisby, Dr. Alan Paton, Dr. Guy Cochrane and members of the GBIF LSID-GUID task group for discussions that have helped in forming some of the ideas expressed here, and to Prof. Alun Preece for agreeing to read and comment on a draft of this paper.
We would also like to thank the anonymous reviewers of an earlier version of our paper for their constructive comments and the relevant information that they have provided. The idea of improving the resolver's response in cases where no version is provided was suggested by one of the reviewers.
- 5.Species 2000-ITIS Catalogue of Life. [http://www.catalogueoflife.org/]
- 6.Jones AC, Xu X, Pittas N, Gray WA, Fiddian NJ, White RJ, Robinson JS, Bisby FA, Brandt SM: SPICE: a flexible architecture for integrating autonomous databases to comprise a distributed catalogue of life. 11th International Conference on Database and Expert Systems Applications (LNCS 1873). Edited by: Ibrahim M, Küng J, Revell N. 2000, Springer Verlag, 981-992.CrossRefGoogle Scholar
- 7.The uBio project. [http://www.ubio.org/]
- 8.Catalogue of Life Services and Download Documentation. [http://www.catalogueoflife.org/services/]
- 9.Global Biodiversity Information Facility (GBIF). [http://www.gbif.org/]
- 10.Encyclopedia of Life. [http://www.eol.org/]
- 11.Biodiversity Informatics Standards (TDWG). [http://www.tdwg.org/]
- 14.TDWG Taxon Concept Schema (TCS). [http://www.tdwg.org/standards/117/]
- 15.Kennedy JB, Kukla R, Paterson T: Scientific Names Are Ambiguous as Identifiers for Biological Taxa: Their Context and Definition Are Required for Accurate Data Integration. Data Integration in the Life Sciences, Volume 3615 of Lecture Notes in Computer Science. Edited by: Ludäscher B, Raschid L. 2005, Springer Berlin/Heidelberg, 80-95.Google Scholar
- 16.LSID Tester, a tool for testing Life Science Identifier resolution services. Source Code for Biology and Medicine. 2008, 3: 2-10.1186/1751-0473-3-2.Google Scholar
- 17.TDWG LSID Web Resolver. [http://lsid.tdwg.org/]
- 18.Smith D, Szekely B: LSID Best practices. 2005, [http://www.ibm.com/developerworks/opensource/library/os-lsidbp/]Google Scholar
- 20.TDWG LSID Vocabularies. [http://wiki.tdwg.org/twiki/bin/view/TAG/LsidVocs#The_TDWG_LSID_Vocabularies]
- 22.TDWG Life Sciences Identifiers (LSID) Applicability Statement. [http://www.tdwg.org/fileadmin/subgroups/guid/LSID_Applicability_Statement_draft.pdf]
- 26.LSID Java Toolkit. [http://sourceforge.net/projects/lsid/]
- 27.Index Fungorum. [http://www.indexfungorum.org]
- 28.Species 2000 Data Standards. [http://biodiversity.cs.cf.ac.uk/sp2000/protocol/]
- 30.Catalogue of Life Taxon Matcher. [http://biodiversity.cs.cf.ac.uk/repository/TaxMatch/]
- 31.Paskin N: Digital Object Identifier (DOI) System. Encyclopedia of Library and Information Sciences. 2010, Taylor & Francis, 1586-1592. 3Google Scholar
- 33.Altman M, King G: A Proposed Standard for the Scholarly Citation of Quantitative Data. D-Lib Magazine. 2007, 13(3/4), 10.1045/march2007-altman.Google Scholar
- 35.Belgian Species List. [http://www.species.be/]
- 36.CultureSheet dot Org. [http://culturesheet.org/]
- 37.Atlas of Living Australia. [http://www.ala.org.au/]
- 40.Bodenreider O: The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research. 2004, 267-270. 32 DatabaseGoogle Scholar
- 41.i4Life: Indexing for Life. [http://www.i4life.eu/]
- 42.BNCOD 2008 Workshop: "Biodiversity Informatics: challenges in modelling and managing biodiversity knowledge". [http://biodiversity.cs.cf.ac.uk/bncod/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.