Authify: The Reconciliation of Entities at Scale

Schreur, Philip E.; Possemato, Tiziana

doi:10.1007/978-3-030-14401-2_21

Philip E. Schreur¹² &
Tiziana Possemato^13,14

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 846))

Included in the following conference series:

Research Conference on Metadata and Semantics Research

841 Accesses

Abstract

Libraries’ shift to the semantic web has been underway for a number of years. Mellon funded projects such as Linked Data for Production (LD4P) [1] or the BIBFRAME European Workshop 2018 in Florence [2] show the commitment of national, public, and academic libraries, as well as vendors, to this transition. Libraries worldwide, however, are enmeshed in hundreds of millions of metadata records communicated through flat files (the MARC formats) [3]. The shift to linked data will require the conversion of these flat files to a semantically expressive model such as the Resource Description Framework (RDF) [4]. The conversion of such large amounts of semantically inexpressive data to semantically rich data will require automated enhancements in the conversion process. Data hidden within the flat files, such as role (author, illustrator, composer, etc.), can greatly aid with the reconciliation of entities within those files. Authify is one of the first tools available to libraries to both convert their metadata to linked data, but also enrich the reconciliation process with semantic data hidden within the MARC fields. As libraries look to convert their legacy data to linked data, Authify can help them move their data to the Web in as a semantically rich way as possible.

Download conference paper PDF

Sinopia: A New Linked-Data Editing Environment Designed for Libraries

Anno4j - Idiomatic Access to the W3C Web Annotation Data Model

VoIDext: Vocabulary and Patterns for Enhancing Interoperable Datasets with Virtual Links

Keywords

1 Introduction

There has been tremendous interest in moving towards linked data as a means of discovery across the library, museum, and archives space for many years now. In 2011, Stanford held a linked data workshop with representatives from major information providers across North America and Europe [5]. One of the outcomes of that meeting was a “Manifesto for Linked Libraries (And Museums And Archives And …)” [6].

Although the practices enumerated there are stated simply, their implementation is difficult. Of particular interest is structuring data semantically. The Resource Description Framework (RDF), established by the W3C, has been widely adopted as a model for expressing data semantically on the Web. But even if the data is recorded semantically, entities will not “link” unless the same identifier has been used. Libraries have approached this issue in the past by creating authority records for entities they wish to establish so that references to the same entity can be linked by a unique text string. These authority records can be converted to identifiers representing the real world object they describe and so may be used to link matching entities on the Web. As library metadata is converted to linked data, however, the appropriate authority record or identifier can easily be missed, especially if the text string varies from that used in the authority record. And in addition, there are many entities for which no authority record was ever created, making conversion of this metadata to linked data problematic. The process of reconciliation, or the linking of identifiers for matching entities, becomes a critical step in the conversion of library metadata to linked data.

2 Current Services

The need for reconciliation is widespread. Organizations such as the Bibliothèque nationale de France (BnF) have incorporated reconciliation into their digital services platform. As a national library, data they produce is of high quality and much reconciliation can be resolved through the use of authority files. Works, however, are a particularly difficult problem as authority records are infrequently produced for them. The BnF is currently working on algorithms for the extraction of work identifiers from bibliographic data.

Culturegraph [7], a platform for services around data networking for cultural entities, has taken another approach. Projects such as their resolution and look-up service make available an open, central infrastructure for, among other things, the identification of equivalent records through a common URI.

Europeana [8] has a more complex task. The data they work with has neither the uniform quality of the BnF nor the more limited scope of the Culturegraph resolution and look up service. The heterogeneous nature of their data, along with varying standards of construction and supporting authority files, makes reconciliation of the data very complex.

Similar to Europeana, Linked Data for Production [1], or LD4P, must work with data from a mix of institutions with varying quality standards. Many headings lack authority records making their reconciliation dependent upon clues in the bibliographic data. Authify, in part, exploits this information in an attempt to reconcile entities when standard authority records or identifiers are lacking.

3 Authify

The Authify reconciliation service represents the heart of an ecosystem developed by @Cult and Casalini Libri called the SHARE-Virtual Discovery Environment or SHARE-VDE [9]. It offers several search and detection services with the scope of creating a ‘cluster’ of variant name forms coming from different sources but referring to the same entity. The process will produce an Authority Knowledge Base (AKB) composed of entity clusters that are continuously increased as new sources are encountered and ingested. The idea of Authify started at the beginning of the SHARE-VDE project as a way of overcoming some limitations of the public Virtual International Authority File (VIAF) Web APIs. VIAF [10], being a public project, does not allow a massive call on its APIs. For those use cases where such a requirement is needed, the project provides a download of the whole dataset. Authify indexes and stores the VIAF clusters dataset and provides powerful full-text and bibliographic search services built upon them.

VIAF was the first of the sources to be added to Authify. Other sources, not only in the Resource Description Framework (RDF), but also in other formats, are now also considered by Authify for inclusion in the AKB. Thanks to this broadening of sources, the module is even more effective and is able to fulfill the requirement expressed by libraries for external datasets to be used in the detection and clusterization processes. Examples of these sources are: the Library of Congress Name Authority File (LC NAF) [11], Library of Congress Subject Headings (LCSH) [12], Faceted Application of Subject Terminology (FAST) [13], Gemeinsame Normdatei (GND1) [14], and ISNI [15].

Authify uses these different sources applying different strategies, depending on the availability of data from the source:

if the source is available as a dump db (in different formats, such as MARC, xml, RDF) the data are indexed into a SOLR component or in a RDF triple store in order to be used for queries;
if the source is not available as a dump db, but offers APIs and web services to be queried, Authify uses these tools to interrogate the source and retrieve the useful information.

Different sources are queried and each source endpoint declares in the URL the related source (e.g. /viaf/names, /fast/subjects).

The ability to search and retrieve data from different sources enhances the project ‘clusters’ that represent entities in the real world (so, each cluster is considered a Real-World-Object entity, with all the necessary attributes to identify it). The detection and clusterization processes mentioned above, makes possible the identification of an entity (as a person, as a work, as a subject etc.), the identification of the role the entity has in relationship to a resource, and the creation of a ‘cluster’ that identifies it with an ID and that gathers together the different attributes useful for identification.

The logic of creating a new cluster begins with the search of an entity within the databases used in the project (a Postgres relational database used to register the clusters upon creation before they are added to the AKB or the Authify SOLR database built with data from external sources used in the project (VIAF, NAF, etc.)). Data extracted from library records are used to query these databases to ascertain whether a cluster already exists, a VIAF ID exists, the form used by the library exists as a preferred form or as a variant form, etc. Both the normalized forms used for queries and the responses received from Authify are registered to the Postgres database in order to be used for the creation (or feeding) of the AKB. But before being used in queries, the library authority and bibliographic records need to be passed through two preliminary processes of normalization that eliminates the sub-field separators and any non-standard punctuation, and the creation of a sort-form that transforms the original string in uppercase and removes diacritics, accents and special characters. In the construction of the normalized string and the sort-form, the tags and sub-fields coming from the authority and bibliographic records are used, depending on the type of tag.

If the search result gives a positive answer (a cluster for this entity already exists), the existing cluster will be expanded to include a new variant form. If the variant form is already present, no action will be made on the cluster. If the search result gives a negative answer, a new cluster will be created in the Postgres database and in the AKB.

All variants retrieved by external sources are grouped by source (e.g. VIAF, ISNI etc.) and all variants belonging to a given source are associated with the same URI. The final SHARE-VDE cluster will be composed of a cluster ID (the SHARE-VDE URI) that includes all variant forms from local authority files that inherit the same SHARE-VDE URI and that are brought together with the SameAs relationship; all forms from external sources (each one having a preferred form and variant forms), all with the same source URI, and brought together with the SameAs relationship; all variant forms from bibliographic records that do not match authoritative forms but that inherit the same SHARE-VDE URI and are brought together with the SomeAs relationship; additional information (such as Authority notes); and operational data such as the cluster creation date, the update date, the cluster type, etc.

One of the most relevant function in Authify is the cluster search service. As the name suggests, this provides a full-text search service that queries names, works, and other entities available from different sources. All search services are made available as HTTP endpoints. The parameter used to start a search is the name form (or title, or subject) used in the project’s original data source (a heading, in the case of bibliographic or authority data). The search Web API uses an “invisible queries” approach in order to find as precise a match as possible for the search parameter among the forms already present in the external sources, or in the AKB.

The invisible queries approach makes everything transparent to the user. Following a single search request, the system executes a chain of different searches with different priorities. The first match that produces a result that will populate the response returned. Each new response will progressively populate the new (or already existent) cluster that has been created in the AKB. For debugging purposes, the response will also include the search that produced the results. The goal of each search strategy is to return as precise a result as possible with the minimum recall possible.

Query responses are used in a series of data analysis logics that are part of a process called ‘Similarity score’. This process assigns a weight to the various results in order to identify the elements (for example the variants forms of a name) to be assigned to the same cluster. The Similarity score allows the system to decide if and when to feed an already existing cluster or to create a new cluster. At the end of the search process, the heading either is assigned to an already existent cluster in the AKB or it produces a new cluster if one does not already exist. Each cluster, for each entity type, is marked with an identifier (the cluster ID) used to produce the URI that will identify the entity in the RDF conversion process. At the end of each process, the AKB will have a name form marked as ‘preferred’ and a number of other forms marked as variant that are useful to create ‘sameAs’ relationships. Additional attributes are available to enrich the AKB, such as the original source of each variant/preferred form and the URIs/ID for each form.

One of the most delicate processes in the handling of bibliographic data is ‘Entity recognition’ or entity detection. In some cases, this step is crucial to the identification of an entity and relates to the identification of the role that a person has had in the creation or production of a resource. In the bibliographic world, the identification of a person is usually realized through the relationship with his/her work and, vice-versa, the identification of a work is realized through the association with its creator.

Authify uses the “Relator term detection” service to identify these relationships. Starting from a MARC record, the system analyses all (configured) tags that contain a name and tries to determine the corresponding role within the work represented by the given record using the statements of responsibility and other note fields.

To identify the role that an Agent has in relation to a resource, the ‘Relator term detection’ service uses a ‘Roles Knowledge Base’ that is progressively fed (with text analysis processes, automatically and manually) with all possible expressions useful to identify a role. As an example, two main roles may be detected, for instance, author and an unclassified role (other). The “other” role is a catch-all role used when no valuable information can be gathered from the analysis. At the end of each entity detection process, the system produces a report of non-matching role-expressions associated with the bibliographic records identifiers. This report enables the library cataloguers to check the record and to add the specific role term or code in the appropriate subfield. Behind a simple token matching analysis, there is a more complicated logic that tries (using, among other things, the search services described in the previous point) to find the role of each name found using its variant forms or using a set of tokens that could identify such role (e.g. edited by, by, illustrated by). At the end of this process, a certain number of records are enriched with roles and a related report with ‘undefined’ roles is made available to allow for manual checks by professional users. This added element of human curation can help resolve issues that an automated process would find too difficult at present.

4 Conclusion

Libraries’ shift to the semantic web through the conversion of their MARC data has been underway for a number of years. However, millions of headings within libraries MARC metadata are uncontrolled, that is, have no matching authority record. On conversion to RDF, each heading will receive a unique identifier even if that heading may already exist in another bibliographic record within the library’s holdings, or in another library’s holdings worldwide. This proliferation of multiple identifiers for the same entity makes the linking of these entities problematic.

Authify is one of the first tools available to libraries to both convert their metadata to linked data, but also enrich the reconciliation process with semantic data hidden within the MARC fields. By making use of additional data points, such as role, contained in free text within the MARC record, services such as Authify can match related entities even if the text strings for those entities do not match. As libraries worldwide will need to convert hundreds of millions of MARC records in their library systems to RDF, sophisticated, automated services for the conversion and reconciliation of their data will be critical for their transition to the semantic web.

References

LD4P Homepage. https://wiki.duraspace.org/pages/viewpage.action?pageId=74515029. Accessed 14 June 2018
European BIBFRAME Workshop Homepage. http://www.casalini.it/EBW2018/. Accessed 14 June 2018
MARC Standards Homepage. https://www.loc.gov/marc/. Accessed 14 June 2018
RDF Homepage. https://www.w3.org/RDF/. Accessed 14 June 2018
Report of the Stanford Linked Data Workshop. http://www.clir.org/wp-content/uploads/sites/6/LinkedDataWorkshop.pdf. Accessed 14 June 2018
Report of the Stanford Linked Data Workshop, p. 22. http://www.clir.org/wp-content/uploads/sites/6/LinkedDataWorkshop.pdf. Accessed 14 June 2018
Culturegraph Homepage. http://www.culturegraph.org/Subsites/culturegraph/DE/Home/home_node.html. Accessed 14 June 2018
Europeana Homepage. https://www.europeana.eu/portal/en. Accessed 14 June 2018
SHARE-VDE Homepae. http://www.share-vde.org/sharevde/clusters?l=en. Accessed 14 June 2018
VIAF Homepage. https://viaf.org/. Accessed 14 June 2018
LC NAF Homepage. http://id.loc.gov/authorities/names.html. Accessed 14 June 2018
LCSH Homepage. http://id.loc.gov/authorities/subjects.html. Accessed 14 June 2018
FAST Homepage. https://fast.oclc.org/searchfast/. Accessed 14 June 2018
GND Homepage. http://www.dnb.de/EN/Standardisierung/GND/gnd_node.html. Accessed 14 June 2018
ISNI Homepage. http://www.isni.org/. Accessed 14 June 2018

Download references

Author information

Authors and Affiliations

Stanford University, Stanford, CA, USA
Philip E. Schreur
@Cult, Rome, Italy
Tiziana Possemato
Casalini Libri, Fiesole, Italy
Tiziana Possemato

Authors

Philip E. Schreur
View author publications
You can also search for this author in PubMed Google Scholar
Tiziana Possemato
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Philip E. Schreur .

Editor information

Editors and Affiliations

Alexander Technological Educational Institute (ATEI) of Thessaloniki, Thessaloniki, Greece
Emmanouel Garoufallou
Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milan, Italy
Fabio Sartori
Alexander Technological Educational Institute (ATEI) of Thessaloniki, Thessaloniki, Greece
Rania Siatri
Library and Information Services, Cyprus University of Technology, Limassol, Cyprus
Marios Zervas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schreur, P.E., Possemato, T. (2019). Authify: The Reconciliation of Entities at Scale. In: Garoufallou, E., Sartori, F., Siatri, R., Zervas, M. (eds) Metadata and Semantic Research. MTSR 2018. Communications in Computer and Information Science, vol 846. Springer, Cham. https://doi.org/10.1007/978-3-030-14401-2_21

Download citation

DOI: https://doi.org/10.1007/978-3-030-14401-2_21
Published: 24 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-14400-5
Online ISBN: 978-3-030-14401-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Authify: The Reconciliation of Entities at Scale

Abstract

Similar content being viewed by others

Sinopia: A New Linked-Data Editing Environment Designed for Libraries

Anno4j - Idiomatic Access to the W3C Web Annotation Data Model

VoIDext: Vocabulary and Patterns for Enhancing Interoperable Datasets with Virtual Links

Keywords

1 Introduction

2 Current Services

3 Authify

4 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Authify: The Reconciliation of Entities at Scale

Abstract

Similar content being viewed by others

Sinopia: A New Linked-Data Editing Environment Designed for Libraries

Anno4j - Idiomatic Access to the W3C Web Annotation Data Model

VoIDext: Vocabulary and Patterns for Enhancing Interoperable Datasets with Virtual Links

Keywords

1 Introduction

2 Current Services

3 Authify

4 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation