Keywords

1 Introduction

There has been tremendous interest in moving towards linked data as a means of discovery across the library, museum, and archives space for many years now. In 2011, Stanford held a linked data workshop with representatives from major information providers across North America and Europe [5]. One of the outcomes of that meeting was a “Manifesto for Linked Libraries (And Museums And Archives And …)” [6].

Although the practices enumerated there are stated simply, their implementation is difficult. Of particular interest is structuring data semantically. The Resource Description Framework (RDF), established by the W3C, has been widely adopted as a model for expressing data semantically on the Web. But even if the data is recorded semantically, entities will not “link” unless the same identifier has been used. Libraries have approached this issue in the past by creating authority records for entities they wish to establish so that references to the same entity can be linked by a unique text string. These authority records can be converted to identifiers representing the real world object they describe and so may be used to link matching entities on the Web. As library metadata is converted to linked data, however, the appropriate authority record or identifier can easily be missed, especially if the text string varies from that used in the authority record. And in addition, there are many entities for which no authority record was ever created, making conversion of this metadata to linked data problematic. The process of reconciliation, or the linking of identifiers for matching entities, becomes a critical step in the conversion of library metadata to linked data.

2 Current Services

The need for reconciliation is widespread. Organizations such as the Bibliothèque nationale de France (BnF) have incorporated reconciliation into their digital services platform. As a national library, data they produce is of high quality and much reconciliation can be resolved through the use of authority files. Works, however, are a particularly difficult problem as authority records are infrequently produced for them. The BnF is currently working on algorithms for the extraction of work identifiers from bibliographic data.

Culturegraph [7], a platform for services around data networking for cultural entities, has taken another approach. Projects such as their resolution and look-up service make available an open, central infrastructure for, among other things, the identification of equivalent records through a common URI.

Europeana [8] has a more complex task. The data they work with has neither the uniform quality of the BnF nor the more limited scope of the Culturegraph resolution and look up service. The heterogeneous nature of their data, along with varying standards of construction and supporting authority files, makes reconciliation of the data very complex.

Similar to Europeana, Linked Data for Production [1], or LD4P, must work with data from a mix of institutions with varying quality standards. Many headings lack authority records making their reconciliation dependent upon clues in the bibliographic data. Authify, in part, exploits this information in an attempt to reconcile entities when standard authority records or identifiers are lacking.

3 Authify

The Authify reconciliation service represents the heart of an ecosystem developed by @Cult and Casalini Libri called the SHARE-Virtual Discovery Environment or SHARE-VDE [9]. It offers several search and detection services with the scope of creating a ‘cluster’ of variant name forms coming from different sources but referring to the same entity. The process will produce an Authority Knowledge Base (AKB) composed of entity clusters that are continuously increased as new sources are encountered and ingested. The idea of Authify started at the beginning of the SHARE-VDE project as a way of overcoming some limitations of the public Virtual International Authority File (VIAF) Web APIs. VIAF [10], being a public project, does not allow a massive call on its APIs. For those use cases where such a requirement is needed, the project provides a download of the whole dataset. Authify indexes and stores the VIAF clusters dataset and provides powerful full-text and bibliographic search services built upon them.

VIAF was the first of the sources to be added to Authify. Other sources, not only in the Resource Description Framework (RDF), but also in other formats, are now also considered by Authify for inclusion in the AKB. Thanks to this broadening of sources, the module is even more effective and is able to fulfill the requirement expressed by libraries for external datasets to be used in the detection and clusterization processes. Examples of these sources are: the Library of Congress Name Authority File (LC NAF) [11], Library of Congress Subject Headings (LCSH) [12], Faceted Application of Subject Terminology (FAST) [13], Gemeinsame Normdatei (GND1) [14], and ISNI [15].

Authify uses these different sources applying different strategies, depending on the availability of data from the source:

  • if the source is available as a dump db (in different formats, such as MARC, xml, RDF) the data are indexed into a SOLR component or in a RDF triple store in order to be used for queries;

  • if the source is not available as a dump db, but offers APIs and web services to be queried, Authify uses these tools to interrogate the source and retrieve the useful information.

Different sources are queried and each source endpoint declares in the URL the related source (e.g. /viaf/names, /fast/subjects).

The ability to search and retrieve data from different sources enhances the project ‘clusters’ that represent entities in the real world (so, each cluster is considered a Real-World-Object entity, with all the necessary attributes to identify it). The detection and clusterization processes mentioned above, makes possible the identification of an entity (as a person, as a work, as a subject etc.), the identification of the role the entity has in relationship to a resource, and the creation of a ‘cluster’ that identifies it with an ID and that gathers together the different attributes useful for identification.

The logic of creating a new cluster begins with the search of an entity within the databases used in the project (a Postgres relational database used to register the clusters upon creation before they are added to the AKB or the Authify SOLR database built with data from external sources used in the project (VIAF, NAF, etc.)). Data extracted from library records are used to query these databases to ascertain whether a cluster already exists, a VIAF ID exists, the form used by the library exists as a preferred form or as a variant form, etc. Both the normalized forms used for queries and the responses received from Authify are registered to the Postgres database in order to be used for the creation (or feeding) of the AKB. But before being used in queries, the library authority and bibliographic records need to be passed through two preliminary processes of normalization that eliminates the sub-field separators and any non-standard punctuation, and the creation of a sort-form that transforms the original string in uppercase and removes diacritics, accents and special characters. In the construction of the normalized string and the sort-form, the tags and sub-fields coming from the authority and bibliographic records are used, depending on the type of tag.

If the search result gives a positive answer (a cluster for this entity already exists), the existing cluster will be expanded to include a new variant form. If the variant form is already present, no action will be made on the cluster. If the search result gives a negative answer, a new cluster will be created in the Postgres database and in the AKB.

All variants retrieved by external sources are grouped by source (e.g. VIAF, ISNI etc.) and all variants belonging to a given source are associated with the same URI. The final SHARE-VDE cluster will be composed of a cluster ID (the SHARE-VDE URI) that includes all variant forms from local authority files that inherit the same SHARE-VDE URI and that are brought together with the SameAs relationship; all forms from external sources (each one having a preferred form and variant forms), all with the same source URI, and brought together with the SameAs relationship; all variant forms from bibliographic records that do not match authoritative forms but that inherit the same SHARE-VDE URI and are brought together with the SomeAs relationship; additional information (such as Authority notes); and operational data such as the cluster creation date, the update date, the cluster type, etc.

One of the most relevant function in Authify is the cluster search service. As the name suggests, this provides a full-text search service that queries names, works, and other entities available from different sources. All search services are made available as HTTP endpoints. The parameter used to start a search is the name form (or title, or subject) used in the project’s original data source (a heading, in the case of bibliographic or authority data). The search Web API uses an “invisible queries” approach in order to find as precise a match as possible for the search parameter among the forms already present in the external sources, or in the AKB.

The invisible queries approach makes everything transparent to the user. Following a single search request, the system executes a chain of different searches with different priorities. The first match that produces a result that will populate the response returned. Each new response will progressively populate the new (or already existent) cluster that has been created in the AKB. For debugging purposes, the response will also include the search that produced the results. The goal of each search strategy is to return as precise a result as possible with the minimum recall possible.

Query responses are used in a series of data analysis logics that are part of a process called ‘Similarity score’. This process assigns a weight to the various results in order to identify the elements (for example the variants forms of a name) to be assigned to the same cluster. The Similarity score allows the system to decide if and when to feed an already existing cluster or to create a new cluster. At the end of the search process, the heading either is assigned to an already existent cluster in the AKB or it produces a new cluster if one does not already exist. Each cluster, for each entity type, is marked with an identifier (the cluster ID) used to produce the URI that will identify the entity in the RDF conversion process. At the end of each process, the AKB will have a name form marked as ‘preferred’ and a number of other forms marked as variant that are useful to create ‘sameAs’ relationships. Additional attributes are available to enrich the AKB, such as the original source of each variant/preferred form and the URIs/ID for each form.

One of the most delicate processes in the handling of bibliographic data is ‘Entity recognition’ or entity detection. In some cases, this step is crucial to the identification of an entity and relates to the identification of the role that a person has had in the creation or production of a resource. In the bibliographic world, the identification of a person is usually realized through the relationship with his/her work and, vice-versa, the identification of a work is realized through the association with its creator.

Authify uses the “Relator term detection” service to identify these relationships. Starting from a MARC record, the system analyses all (configured) tags that contain a name and tries to determine the corresponding role within the work represented by the given record using the statements of responsibility and other note fields.

To identify the role that an Agent has in relation to a resource, the ‘Relator term detection’ service uses a ‘Roles Knowledge Base’ that is progressively fed (with text analysis processes, automatically and manually) with all possible expressions useful to identify a role. As an example, two main roles may be detected, for instance, author and an unclassified role (other). The “other” role is a catch-all role used when no valuable information can be gathered from the analysis. At the end of each entity detection process, the system produces a report of non-matching role-expressions associated with the bibliographic records identifiers. This report enables the library cataloguers to check the record and to add the specific role term or code in the appropriate subfield. Behind a simple token matching analysis, there is a more complicated logic that tries (using, among other things, the search services described in the previous point) to find the role of each name found using its variant forms or using a set of tokens that could identify such role (e.g. edited by, by, illustrated by). At the end of this process, a certain number of records are enriched with roles and a related report with ‘undefined’ roles is made available to allow for manual checks by professional users. This added element of human curation can help resolve issues that an automated process would find too difficult at present.

4 Conclusion

Libraries’ shift to the semantic web through the conversion of their MARC data has been underway for a number of years. However, millions of headings within libraries MARC metadata are uncontrolled, that is, have no matching authority record. On conversion to RDF, each heading will receive a unique identifier even if that heading may already exist in another bibliographic record within the library’s holdings, or in another library’s holdings worldwide. This proliferation of multiple identifiers for the same entity makes the linking of these entities problematic.

Authify is one of the first tools available to libraries to both convert their metadata to linked data, but also enrich the reconciliation process with semantic data hidden within the MARC fields. By making use of additional data points, such as role, contained in free text within the MARC record, services such as Authify can match related entities even if the text strings for those entities do not match. As libraries worldwide will need to convert hundreds of millions of MARC records in their library systems to RDF, sophisticated, automated services for the conversion and reconciliation of their data will be critical for their transition to the semantic web.