1 Introduction

MARC has been used as a standard format to exchange library records over various information systems within the library community for a long time. Most MARC-based metadata records, either bibliographic or authority records, are locked in a closed system, and cannot be integrated into the web or found by search engines such as Google. Although a vast volume of records have been created and maintained by libraries, they are nearly all isolated from the web. Linked Open Data (LOD) has become the preferred approach for the conversion of MARC-based legacy library data into a part of the semantic web by libraries. Based on the principles of LOD, legacy records can be deconstructed into LOD data which can be enriched with various information by aggregating with other external resources and their contexts on the web. However, this approach still lacks a best practice and related issues to tell libraries how to convert library catalogue records into LOD.

2 Literature Review

In recent years, LOD has been used to associate related information with diverse viewpoints for the same resource, especially in the domain of cultural heritage. Traditionally, libraries have played the role of gatekeeper by providing information to support scholarly communication and research, but blocking these well-organized catalogue data through proprietary library information systems, and thus keeping them far away from users of the Internet. Libraries have shown great interest in freeing catalogue data to become part of the web, because these data, stored in a complicated format such as MARC, can enrich content and context for resources of the semantic web. According to the analysis of Baker et al. [1], datasets, value vocabularies and metadata element sets in libraries are available for Linked Data (LD) reuse. Terms from value vocabularies (e.g., terms of The Virtual International Authority File (VIAF) and Library of Congress Subject Headings) and metadata element sets (e.g., Dublin Core Element terms) can be used for description of LD, such as persons, organizations, geographic names, temporal periods, works, concepts, events, and so on.

According to the definition provided by Bernes-Lee [3], there are four principles for (LD) as follows: use URIs as names for identification of things, use HTTP URIs for looking up names, use standards such as RDF and SPARQL for provision of useful information, and link to more data with URIs. In order to facilitate wide free adoption and reuse of data, the concept of open data has been integrated with LD into a new concept, that is, LOD [27]. In fact, LD and LOD are used interchangeably. In this study LOD is used as a standard for both LD and LOD.

Although best practices have been released by W3C such as Hyland and Villazón-Terrazas [21] and Hyland et al. [20], there is a gap in the transfer “from theoretical discussion into practical implementation” of LOD for libraries as pointed out by Hanson [17]. In terms of metadata design and development, LOD is not only “a conceptual shift from document-centric to data-centric and metadata-based approaches” as stated by Di Noia et al. [11], but is also a data model that is distinct from that of library community as pointed out by Cole et al. [8] and Di Noia et al. [11]. According to the four principles of LOD, libraries have to transform current legacy records (e.g., MARC) stored in the catalogue into URI-based data. To do this, Hanson [17] regarded that libraries have to know how LOD are actually created and published in practice, and Bowen [6] also pointed out the significance of LOD, and the many unexpected issues faced by libraries, as exemplified in the current case studies. For example, in a case study that transformed the MARC-based authority files of the National Library and Archive of Iran into LOD, Eslami and Vaghefzadeh [12] pointed out that it is difficult for libraries to select exact terms and rules for data RDFization and linking for authoring LOD. In a case of the transformation of MARC records of 30,000 digitized books at the University of Illinois at Urbana-Champaign Library, Cole et al. [8] pointed out that there are no common consistencies in current examples of library LOD records based on cases of OCLC WorldCat and British Library’s British National Bibliography (BL’s BNB). They also further raised the issue of the lack of clear and consistent decisions for guiding libraries to integrate links and select the exact URIs. Furthermore, manual editing and intervention is needed for batch transformation from catalogue data into LOD as reported by Bowen [6], Lampert and Southwick [24], Park and Kipp [28] and Zeng et al. [35]. Therefore, there is an urgent need for customized LOD best practices and workflow for libraries as stated in several studies including Bowen [6], Cole et al. [8], Di Noia et al. [11], Hallo et al. [15], Hanson [17], Lampert and Southwick [24], and Southwick [30].

3 Methodology

Several types of documents, including Bauer and Kaltenböck [2], Heath and Bizer [19], Hyland et al. [20], Hyland and Villazón-Terrazas [21] and Hyland and Wood [22], can be regarded as useful references for authoring and publishing LOD. The document authored by Hyland and Villazón-Terrazas [21] is not only an official publication released by W3C, it has also been extended into several derivatives such as Hyland et al. [20] and Hyland and Wood [22], and the procedures are more abstract than those of Hyland et al. [20]. Thus, in this study, the Linked Data Cookbook authored by Hyland and Villazón-Terrazas [21] was selected as a framework to examine the characteristics and issues related to Library LD (LLD). The framework is composed of key components as follows: modeling (including identifying and modeling datasets), naming with URIs, reusing existing terms, publishing human and machine readable descriptions, RDF conversion, license, host and announcement. Based on the aforementioned framework, content analysis was used in this study as an approach to analyze the existing practices for transforming library data into LOD. In total, sixteen case studies were selected as subject, six national libraries, four university libraries, and six related LOD pilot projects and approaches (shown in Table 1). Each selected case study has to offer related information to reply to more than four components of the framework defined by Hyland and Villazón-Terrazas [21]. In addition to the published journal articles, LOD websites and their documents (e.g., ppt, specification, FAQ, about, data model, technical report, LOD web catalogs, example LOD datasets) offered by selected cases were also cross-checked.

Table 1. 16 cases of LLD practice for review

4 Results

4.1 Identifying Data

Originally the principle defined by Hyland and Villazón-Terrazas [21] was to select real world objects of interest, but that has been adapted to include unique datasets for the public by Hyland et al. [20]. Eight cases (BNE, DNB, Harvard Library datasets, LIBRIS, Talis, Universidad de Alicante, UIUC Library and XC) addressed this task, and two (EPN and NCSU Libraries) selected the data for expected audience interest, and three (music resources, Puglia Digital Library, and UNLV Libraries) selected unique data for LLD. In addition to interest and uniqueness selection for LLD, data popularity (i.e., NLAI) and integration of data (i.e., BNF) are also regarded as selection criteria for identification of data. However, the selection principle has been extended in the case of British Library BNB into more detailed categories, including authority, consistency, vast amount and clear rights of data.

In terms of type of library legacy records, three patterns can be generalized as follows. The first is that most cases focus on bibliographic data and then extend to specific value vocabularies, including the following cases: BL’s BNB, BNF, EPN, Harvard Library datasets, LIBRIS, music resources, Puglia Digital Library, UNLV Libraries Universidad de Alicante and XC. Second, in some cases authority data was selected as subject for LLD, including NCSU Libraries, NLAI and UIUC Library. Lastly, both bibliographic and authority data were selected for LLD in only three cases: BNE, DNB and Talis.

4.2 Modeling Data

Modeling data is required to express the relations between data and reuse existing terms from standards. There are three types of modeling data for LLD. The first is to select a reference model or an ontology as a basis for data modeling. As they have MARC-based legacy records, most cases have adopted Functional Requirements for Bibliographic Records (FRBR) for modeling data, including group 1 (i.e., work, expression, manifestation and item), group 2 (i.e., person, family and corporation body) and group 3 (i.e., subject). Some cases only focus on group 1 (i.e., EPN and LIBRIS), some on group 1 and 2 (i.e., Talis), and some on group 1–3 (BNE, BNF, music resources, NLAI, Universidad de Alicante and XC). On the other hand, one case (UNLV Libraries) used the European Data Model (EDM) to model data for LLD, one case adopted music ontology (i.e., music resources), and one case (Harvard Library datasets) employed PROV to model LLD with provenance. Second, in some cases more than two existing LOD terms from different standards were reused to model the data (i.e., BL’s BNB, DNB, NCSU Libraries and Puglia Digital Library). Third, and most interestingly, MARC was converted into MODS format, and then existing terms were reused for LLD (i.e., UIUC Library).

4.3 Naming with URIs

All the cases developed their own URIs for LLD. In terms of URI expression, in most cases an URL was adopted for each piece of data, but in several cases special expressive methods, such as ARK (including BNF and UIUC Library), compact URI (i.e., BNE), and hash value were used (i.e., Talis). Furthermore, Bowen [6] suggested that libraries should use a metadata registry such as the NSDL Metadata Registry to register and reuse their LOD terms with URIs at no cost.

4.4 Reusing Existing Terms

Existing LOD terms (shown in Table 2) have been used to achieve two functions for LLD as follows: data modeling and linking to external resources. The former is used to describe the data, and the latter is employed to aggregate different information for the same resource. According to the selected practices for review by this study, there are three types of reuse of existing LOD terms for LLD, as follows: metadata elements, value vocabularies, and classes/entities and relations of conceptual reference models/ontologies between resources. In order to describe data, in most cases, only an appropriate standard and its terms are selected. However, in a few cases, terms from more than two standards of metadata elements, value vocabularies and conceptual reference models/ontologies were adopted to aggregate with a broader range of information for LLD, such as the case of BL’s BNB.

Table 2. Existing terms reused by LLD for data modeling and external linking

4.5 Other Steps

In addition to publishing machine readable descriptions for LLD in various formats such as RDF, JSON and Turtle, human readable descriptions are also important to enable users’ to browse. Most cases are inclined to offer both human and machine readable descriptions, including BL’s BNB, BNE, BNF, DNB, LIBRIS, NCSU Libraries, Puglia Digital Library, Universidad de Alicante and UNLV Libraries.

In terms of RDF conversion task, eleven cases (BL’s BNB, BNE, BNF, EPN, Harvard Library datasets, LIBRIS, NCSU Library, Talis, UIUC Library, Universidad de Alicante, and XC) developed propriety software for conversion from library legacy records into RDFized form, in addition to open software like OpenRefine (i.e., UNLV Libraries). Furthermore, owing to the inconsistency of library catalogue data, manual intervention and editing are still required for RDF conversion. In releasing LLD, open licensing terms (i.e., Creative Commons 0) were selected in nine cases (BL’s BNB, BNE, BNF, DNB, EPN, LIBRIS, NCSU Libraries, Puglia Digital Library, and Universidad de Alicante) to allow users to download RDF-based data for reuse. National libraries in particular, such as the British Library BNB, BNE, BNF, DNB, and LIBRIS have provided users a “data dump” service to download LLD by batch, rather than record by record. In 10 cases (BL’s BNB, BNF, DNB, EPN, LIBRIS, NCSU Libraries, Puglia Digital Library, UIUC Library, UNLV Libraries, and Universidad de Alicante) a website was also provided as host and announcement for LLD.

5 Discussion

5.1 Selection of Data for LLD

According to the best practices provided by W3C, the principle of data selection for LOD is to “look for real world objects of interest” [21]. Naturally, selection policy is rooted in the library community for many professional tasks, such as collection development and cataloging. In the case of BNB, the British Library has attempted to develop library-oriented principles to identify data for LLD, including authority, consistency, vast amount and clear rights [9]. Although interests and uniqueness of data are important, quality data related to authority and consistency is also regarded as essential criteria to select data for LLD. Based on experiences learned from Bowen [6], Gracy et al. [13] and Zeng et al. [35], inconsistency in original library catalogue data will result in unsolved issues during conversion from MARC to RDF-based LLD. Fully automatic transformation from library legacy records into LLD is not possible without human intervention and editing, although mapping rules and specification have been defined and specified clearly. Usually the more consistent the data, the higher automatic conversion for LLD can achieve. Furthermore, clear rights are also the other important criteria for users to determine how to reuse these LLD for future extended applications. Therefore, achieving a common agreement about what criteria are essential for selection of data for LLD is an urgent issue for libraries.

5.2 Selection of Existing LOD Terms

Selection of appropriate existing terms from standards is fundamental for data modeling. How to select an appropriate standard and its terms consistently has also become a hot issue for LLD raised by Cole et al. [8], Deliot [9], Lampert and Southwick [24] and Park and Kipp [28], and Cole et al. [8] furthering highlighting that inconsistency will impact interoperability directly. According to the selection criteria for terms of best practices provided by Hyland et al. [20], terms should be documented, self-descriptive, described in more than one language, used by other datasets, accessible for a long period, published by a trusted group or organization, have persistent URLs, and provide a versioning policy. According to the information outlined in the selected cases studied here, long-term viability [28], authoritative source [9] and relevance [13] are another three criteria for LLD, in addition to popularity [12]. Traditionally, trusted metadata registries have been published and maintained to facilitate data interoperability, including the Open Metadata Registry, RDA Registry, and NSDL Metadata Registry. A trusted metadata registry of LLD terms composed from various standards (e.g., AAT, Bibliographic Ontology, Dewey.info, Dublin Core Terms, LCNAF, LCSH, TGN, ULAN, VIAF, MARC21’s language/country/role and so on) provided by authoritative organizations is required for libraries. If LLD is an important approach to push library legacy records as part of the semantic web, trusted registries composed of terms from standards including metadata element sets, value vocabularies and conceptual reference models/ontologies (e.g., Bibliographic Ontology, FRBR and MARCOnt) need to be collected, created and maintained persistently in order to meet the aforementioned criteria defined by Hyland et al. [20] and pave a way for interoperability in the future.

5.3 1-to-1 Mapping Principle

Mapping is an essential task for the selection of the right standard and its terms for LLD. In the domain of the digital library, 1-to-1 mapping defined by Dublin Core is a widely accepted principle globally. In the case of Europeana, Haslhofer and Isaac [18] pointed out that the 1-to-1 mapping principle was not applied to the Europeana LOD Data Pilot as data can be applied to various resources resulting in complicated networks of aggregations. In other words, too many individual terms can be used for the same data. Furthermore, Deliot [9] emphasized that the BL’s BNB has employed more than two individual terms from different standards for LLD. Therefore, apparently the 1-to-1 mapping principle is not entirely suitable to decide and select the right term for data modeling and external linking in LLD. It seems that “context” pointed by Gracy et al. [13], Hallo et al. [15], Han [16], Lampert and Southwick [24] and Zeng et al. [35] could be the key reason for this complicated issue in selecting the right terms for described resources. Therefore, a contextual mapping principle and its decision rules are required for judging and selecting the appropriate standards and their terms for LLD.

5.4 Long-Term Preservation of LLD

One of the essential requirements of LOD is to assign resources with URIs for facilitating aggregation of information from various sources. However, the changeability and vanishing of URLs has been an unsolved issue for a long time. If URLs can be not kept intact or persistent, the aggregated networked effects of LOD will diminish. According to a review of five case studies of LOD in digital libraries, Hallo et al. [14] suggesting that preservation of linked datasets should be considered as one of the tasks of a library. In order to play an infrastructure role in the semantic web, libraries need to evaluate feasible solutions to this issue.

6 Conclusion

Most of the tasks defined by Hyland and Villazón [21] are adopted by the 16 LLD cases examined for transforming library legacy records (e.g., MARC) into LOD. However, many cases have also developed policies that extend or refine the tasks defined by Linked Data Cookbook to meet their specific requirements of LLD. More studies are needed to investigate the decision principles and rules for selection of standards and their terms during data modeling for LLD. On the other hand, based on results of this study, the concept of application profile should be adopted to examine the application levels of tasks provided by Hyland and Villazón [21]. In the future, a library-oriented workflow for LLD will be developed from this study.