Abstract
Current practices for transforming library legacy records into linked data for libraries were studied. The Linked Data Cookbook released by W3C was used as an analytical framework. A total of sixteen library linked data case studies focused on converting library catalogue data into linked open data were selected as subjects to analyze the details of transformation according to the following categories: identifying data, modeling data, naming with URIs, reusing existing terms, publishing human and machine readable descriptions, RDF conversion, license, download, host and announcement. It was found that although most tasks defined by the Linked Data Cookbook were adopted, some extensions and refinements were adopted to meet specific library-oriented requirements. Related issues including selection of data, selection of terms, 1-to-1 mapping principle and long-term preservation of library linked data are discussed.
1 Introduction
MARC has been used as a standard format to exchange library records over various information systems within the library community for a long time. Most MARC-based metadata records, either bibliographic or authority records, are locked in a closed system, and cannot be integrated into the web or found by search engines such as Google. Although a vast volume of records have been created and maintained by libraries, they are nearly all isolated from the web. Linked Open Data (LOD) has become the preferred approach for the conversion of MARC-based legacy library data into a part of the semantic web by libraries. Based on the principles of LOD, legacy records can be deconstructed into LOD data which can be enriched with various information by aggregating with other external resources and their contexts on the web. However, this approach still lacks a best practice and related issues to tell libraries how to convert library catalogue records into LOD.
2 Literature Review
In recent years, LOD has been used to associate related information with diverse viewpoints for the same resource, especially in the domain of cultural heritage. Traditionally, libraries have played the role of gatekeeper by providing information to support scholarly communication and research, but blocking these well-organized catalogue data through proprietary library information systems, and thus keeping them far away from users of the Internet. Libraries have shown great interest in freeing catalogue data to become part of the web, because these data, stored in a complicated format such as MARC, can enrich content and context for resources of the semantic web. According to the analysis of Baker et al. [1], datasets, value vocabularies and metadata element sets in libraries are available for Linked Data (LD) reuse. Terms from value vocabularies (e.g., terms of The Virtual International Authority File (VIAF) and Library of Congress Subject Headings) and metadata element sets (e.g., Dublin Core Element terms) can be used for description of LD, such as persons, organizations, geographic names, temporal periods, works, concepts, events, and so on.
According to the definition provided by Bernes-Lee [3], there are four principles for (LD) as follows: use URIs as names for identification of things, use HTTP URIs for looking up names, use standards such as RDF and SPARQL for provision of useful information, and link to more data with URIs. In order to facilitate wide free adoption and reuse of data, the concept of open data has been integrated with LD into a new concept, that is, LOD [27]. In fact, LD and LOD are used interchangeably. In this study LOD is used as a standard for both LD and LOD.
Although best practices have been released by W3C such as Hyland and Villazón-Terrazas [21] and Hyland et al. [20], there is a gap in the transfer “from theoretical discussion into practical implementation” of LOD for libraries as pointed out by Hanson [17]. In terms of metadata design and development, LOD is not only “a conceptual shift from document-centric to data-centric and metadata-based approaches” as stated by Di Noia et al. [11], but is also a data model that is distinct from that of library community as pointed out by Cole et al. [8] and Di Noia et al. [11]. According to the four principles of LOD, libraries have to transform current legacy records (e.g., MARC) stored in the catalogue into URI-based data. To do this, Hanson [17] regarded that libraries have to know how LOD are actually created and published in practice, and Bowen [6] also pointed out the significance of LOD, and the many unexpected issues faced by libraries, as exemplified in the current case studies. For example, in a case study that transformed the MARC-based authority files of the National Library and Archive of Iran into LOD, Eslami and Vaghefzadeh [12] pointed out that it is difficult for libraries to select exact terms and rules for data RDFization and linking for authoring LOD. In a case of the transformation of MARC records of 30,000 digitized books at the University of Illinois at Urbana-Champaign Library, Cole et al. [8] pointed out that there are no common consistencies in current examples of library LOD records based on cases of OCLC WorldCat and British Library’s British National Bibliography (BL’s BNB). They also further raised the issue of the lack of clear and consistent decisions for guiding libraries to integrate links and select the exact URIs. Furthermore, manual editing and intervention is needed for batch transformation from catalogue data into LOD as reported by Bowen [6], Lampert and Southwick [24], Park and Kipp [28] and Zeng et al. [35]. Therefore, there is an urgent need for customized LOD best practices and workflow for libraries as stated in several studies including Bowen [6], Cole et al. [8], Di Noia et al. [11], Hallo et al. [15], Hanson [17], Lampert and Southwick [24], and Southwick [30].
3 Methodology
Several types of documents, including Bauer and Kaltenböck [2], Heath and Bizer [19], Hyland et al. [20], Hyland and Villazón-Terrazas [21] and Hyland and Wood [22], can be regarded as useful references for authoring and publishing LOD. The document authored by Hyland and Villazón-Terrazas [21] is not only an official publication released by W3C, it has also been extended into several derivatives such as Hyland et al. [20] and Hyland and Wood [22], and the procedures are more abstract than those of Hyland et al. [20]. Thus, in this study, the Linked Data Cookbook authored by Hyland and Villazón-Terrazas [21] was selected as a framework to examine the characteristics and issues related to Library LD (LLD). The framework is composed of key components as follows: modeling (including identifying and modeling datasets), naming with URIs, reusing existing terms, publishing human and machine readable descriptions, RDF conversion, license, host and announcement. Based on the aforementioned framework, content analysis was used in this study as an approach to analyze the existing practices for transforming library data into LOD. In total, sixteen case studies were selected as subject, six national libraries, four university libraries, and six related LOD pilot projects and approaches (shown in Table 1). Each selected case study has to offer related information to reply to more than four components of the framework defined by Hyland and Villazón-Terrazas [21]. In addition to the published journal articles, LOD websites and their documents (e.g., ppt, specification, FAQ, about, data model, technical report, LOD web catalogs, example LOD datasets) offered by selected cases were also cross-checked.
4 Results
4.1 Identifying Data
Originally the principle defined by Hyland and Villazón-Terrazas [21] was to select real world objects of interest, but that has been adapted to include unique datasets for the public by Hyland et al. [20]. Eight cases (BNE, DNB, Harvard Library datasets, LIBRIS, Talis, Universidad de Alicante, UIUC Library and XC) addressed this task, and two (EPN and NCSU Libraries) selected the data for expected audience interest, and three (music resources, Puglia Digital Library, and UNLV Libraries) selected unique data for LLD. In addition to interest and uniqueness selection for LLD, data popularity (i.e., NLAI) and integration of data (i.e., BNF) are also regarded as selection criteria for identification of data. However, the selection principle has been extended in the case of British Library BNB into more detailed categories, including authority, consistency, vast amount and clear rights of data.
In terms of type of library legacy records, three patterns can be generalized as follows. The first is that most cases focus on bibliographic data and then extend to specific value vocabularies, including the following cases: BL’s BNB, BNF, EPN, Harvard Library datasets, LIBRIS, music resources, Puglia Digital Library, UNLV Libraries Universidad de Alicante and XC. Second, in some cases authority data was selected as subject for LLD, including NCSU Libraries, NLAI and UIUC Library. Lastly, both bibliographic and authority data were selected for LLD in only three cases: BNE, DNB and Talis.
4.2 Modeling Data
Modeling data is required to express the relations between data and reuse existing terms from standards. There are three types of modeling data for LLD. The first is to select a reference model or an ontology as a basis for data modeling. As they have MARC-based legacy records, most cases have adopted Functional Requirements for Bibliographic Records (FRBR) for modeling data, including group 1 (i.e., work, expression, manifestation and item), group 2 (i.e., person, family and corporation body) and group 3 (i.e., subject). Some cases only focus on group 1 (i.e., EPN and LIBRIS), some on group 1 and 2 (i.e., Talis), and some on group 1–3 (BNE, BNF, music resources, NLAI, Universidad de Alicante and XC). On the other hand, one case (UNLV Libraries) used the European Data Model (EDM) to model data for LLD, one case adopted music ontology (i.e., music resources), and one case (Harvard Library datasets) employed PROV to model LLD with provenance. Second, in some cases more than two existing LOD terms from different standards were reused to model the data (i.e., BL’s BNB, DNB, NCSU Libraries and Puglia Digital Library). Third, and most interestingly, MARC was converted into MODS format, and then existing terms were reused for LLD (i.e., UIUC Library).
4.3 Naming with URIs
All the cases developed their own URIs for LLD. In terms of URI expression, in most cases an URL was adopted for each piece of data, but in several cases special expressive methods, such as ARK (including BNF and UIUC Library), compact URI (i.e., BNE), and hash value were used (i.e., Talis). Furthermore, Bowen [6] suggested that libraries should use a metadata registry such as the NSDL Metadata Registry to register and reuse their LOD terms with URIs at no cost.
4.4 Reusing Existing Terms
Existing LOD terms (shown in Table 2) have been used to achieve two functions for LLD as follows: data modeling and linking to external resources. The former is used to describe the data, and the latter is employed to aggregate different information for the same resource. According to the selected practices for review by this study, there are three types of reuse of existing LOD terms for LLD, as follows: metadata elements, value vocabularies, and classes/entities and relations of conceptual reference models/ontologies between resources. In order to describe data, in most cases, only an appropriate standard and its terms are selected. However, in a few cases, terms from more than two standards of metadata elements, value vocabularies and conceptual reference models/ontologies were adopted to aggregate with a broader range of information for LLD, such as the case of BL’s BNB.
4.5 Other Steps
In addition to publishing machine readable descriptions for LLD in various formats such as RDF, JSON and Turtle, human readable descriptions are also important to enable users’ to browse. Most cases are inclined to offer both human and machine readable descriptions, including BL’s BNB, BNE, BNF, DNB, LIBRIS, NCSU Libraries, Puglia Digital Library, Universidad de Alicante and UNLV Libraries.
In terms of RDF conversion task, eleven cases (BL’s BNB, BNE, BNF, EPN, Harvard Library datasets, LIBRIS, NCSU Library, Talis, UIUC Library, Universidad de Alicante, and XC) developed propriety software for conversion from library legacy records into RDFized form, in addition to open software like OpenRefine (i.e., UNLV Libraries). Furthermore, owing to the inconsistency of library catalogue data, manual intervention and editing are still required for RDF conversion. In releasing LLD, open licensing terms (i.e., Creative Commons 0) were selected in nine cases (BL’s BNB, BNE, BNF, DNB, EPN, LIBRIS, NCSU Libraries, Puglia Digital Library, and Universidad de Alicante) to allow users to download RDF-based data for reuse. National libraries in particular, such as the British Library BNB, BNE, BNF, DNB, and LIBRIS have provided users a “data dump” service to download LLD by batch, rather than record by record. In 10 cases (BL’s BNB, BNF, DNB, EPN, LIBRIS, NCSU Libraries, Puglia Digital Library, UIUC Library, UNLV Libraries, and Universidad de Alicante) a website was also provided as host and announcement for LLD.
5 Discussion
5.1 Selection of Data for LLD
According to the best practices provided by W3C, the principle of data selection for LOD is to “look for real world objects of interest” [21]. Naturally, selection policy is rooted in the library community for many professional tasks, such as collection development and cataloging. In the case of BNB, the British Library has attempted to develop library-oriented principles to identify data for LLD, including authority, consistency, vast amount and clear rights [9]. Although interests and uniqueness of data are important, quality data related to authority and consistency is also regarded as essential criteria to select data for LLD. Based on experiences learned from Bowen [6], Gracy et al. [13] and Zeng et al. [35], inconsistency in original library catalogue data will result in unsolved issues during conversion from MARC to RDF-based LLD. Fully automatic transformation from library legacy records into LLD is not possible without human intervention and editing, although mapping rules and specification have been defined and specified clearly. Usually the more consistent the data, the higher automatic conversion for LLD can achieve. Furthermore, clear rights are also the other important criteria for users to determine how to reuse these LLD for future extended applications. Therefore, achieving a common agreement about what criteria are essential for selection of data for LLD is an urgent issue for libraries.
5.2 Selection of Existing LOD Terms
Selection of appropriate existing terms from standards is fundamental for data modeling. How to select an appropriate standard and its terms consistently has also become a hot issue for LLD raised by Cole et al. [8], Deliot [9], Lampert and Southwick [24] and Park and Kipp [28], and Cole et al. [8] furthering highlighting that inconsistency will impact interoperability directly. According to the selection criteria for terms of best practices provided by Hyland et al. [20], terms should be documented, self-descriptive, described in more than one language, used by other datasets, accessible for a long period, published by a trusted group or organization, have persistent URLs, and provide a versioning policy. According to the information outlined in the selected cases studied here, long-term viability [28], authoritative source [9] and relevance [13] are another three criteria for LLD, in addition to popularity [12]. Traditionally, trusted metadata registries have been published and maintained to facilitate data interoperability, including the Open Metadata Registry, RDA Registry, and NSDL Metadata Registry. A trusted metadata registry of LLD terms composed from various standards (e.g., AAT, Bibliographic Ontology, Dewey.info, Dublin Core Terms, LCNAF, LCSH, TGN, ULAN, VIAF, MARC21’s language/country/role and so on) provided by authoritative organizations is required for libraries. If LLD is an important approach to push library legacy records as part of the semantic web, trusted registries composed of terms from standards including metadata element sets, value vocabularies and conceptual reference models/ontologies (e.g., Bibliographic Ontology, FRBR and MARCOnt) need to be collected, created and maintained persistently in order to meet the aforementioned criteria defined by Hyland et al. [20] and pave a way for interoperability in the future.
5.3 1-to-1 Mapping Principle
Mapping is an essential task for the selection of the right standard and its terms for LLD. In the domain of the digital library, 1-to-1 mapping defined by Dublin Core is a widely accepted principle globally. In the case of Europeana, Haslhofer and Isaac [18] pointed out that the 1-to-1 mapping principle was not applied to the Europeana LOD Data Pilot as data can be applied to various resources resulting in complicated networks of aggregations. In other words, too many individual terms can be used for the same data. Furthermore, Deliot [9] emphasized that the BL’s BNB has employed more than two individual terms from different standards for LLD. Therefore, apparently the 1-to-1 mapping principle is not entirely suitable to decide and select the right term for data modeling and external linking in LLD. It seems that “context” pointed by Gracy et al. [13], Hallo et al. [15], Han [16], Lampert and Southwick [24] and Zeng et al. [35] could be the key reason for this complicated issue in selecting the right terms for described resources. Therefore, a contextual mapping principle and its decision rules are required for judging and selecting the appropriate standards and their terms for LLD.
5.4 Long-Term Preservation of LLD
One of the essential requirements of LOD is to assign resources with URIs for facilitating aggregation of information from various sources. However, the changeability and vanishing of URLs has been an unsolved issue for a long time. If URLs can be not kept intact or persistent, the aggregated networked effects of LOD will diminish. According to a review of five case studies of LOD in digital libraries, Hallo et al. [14] suggesting that preservation of linked datasets should be considered as one of the tasks of a library. In order to play an infrastructure role in the semantic web, libraries need to evaluate feasible solutions to this issue.
6 Conclusion
Most of the tasks defined by Hyland and Villazón [21] are adopted by the 16 LLD cases examined for transforming library legacy records (e.g., MARC) into LOD. However, many cases have also developed policies that extend or refine the tasks defined by Linked Data Cookbook to meet their specific requirements of LLD. More studies are needed to investigate the decision principles and rules for selection of standards and their terms during data modeling for LLD. On the other hand, based on results of this study, the concept of application profile should be adopted to examine the application levels of tasks provided by Hyland and Villazón [21]. In the future, a library-oriented workflow for LLD will be developed from this study.
References
Baker, T., Bermès, E., Coyle, K., Dunsire, G., Isaac, A., Murray, P., Panzer, M., Schneider, J., Singer, R., Summers, E., Waites, W., Young, J., Zeng, M.: Library linked data incubator group final report: W3C incubator group report (2011). https://www.w3.org/2005/Incubator/lld/XGR-lld-20111025/
Bauer, F., Kaltenböck, M.: Linked Open Data: The Essentials. Edition Mono, Vienna (2012). https://www.semantic-web.at/LOD-TheEssentials.pdf
Bernes-Lee, T.: Linked data (2006). https://www.w3.org/DesignIssues/LinkedData.html
BNE’s Data model Homepage. http://www.bne.es/en/Inicio/Perfiles/Bibliotecarios/DatosEnlazados/Modelos/
BNF’s semantic web and data model Homepage. http://data.bnf.fr/en/semanticweb#Ancre2
Bowen, J.: Moving library metadata toward linked data: opportunities provided by the extensible catalog. In: Proceedings of International Conference on Dublin Core and Metadata Applications 2010 (2010). http://dcpapers.dublincore.org/pubs/article/view/1010/979
Candela, G., Escobar, P., Marco-Such, M., Carrasco, R.C.: Transformation of a library catalogue into RDA linked open data. In: Kapidakis, S., Mazurek, C., Werla, M. (eds.) TPDL 2015. LNCS, vol. 9316, pp. 321–325. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24592-8_26
Cole, T.W., Han, M.-J., Weathers, W.F., Joyner, E.: Library marc records into linked open data: challenges and opportunities. J. Libr. Metadata 13(2–3), 163–196 (2013)
Deliot, C.: Publishing the British National Bibliography as linked open data. Cat. Index 174, 13–18 (2014). http://www.bl.uk/bibliographic/pdfs/publishing_bnb_as_lod.pdf
Deutsche National Biblothek: The linked data service of the german national library: modelling of bibliographic data (2016). http://www.dnb.de/SharedDocs/Downloads/EN/DNB/service/linkedDataModellierungTiteldaten.pdf?__blob=publicationFile
Di Noia, T., Ragone, A., Maurino, A., Mongiello, M., Marzoccca, M.P., Cultrera, G., Bruno, M.P.: Linking data in digital libraries: the case of purlia digital library. In: Proceedings of 1st Workshop on Humanities in the Semantic Web (WHiSe 2016) (2016). http://ceur-ws.org/Vol-1608/paper-05.pdf
Eslami, S., Vaghefzadeh, M.H.: Publishing Persian linked data of national library and archive of Iran. In: Proceedings of IFLA WLIC 2013 (2013). http://library.ifla.org/193/1/222-eslami-en.pdf
Gracy, K.F., Zeng, M.L., Skirvin, L.: Exploring methods to improve access to music resources by aligning library data with linked data: a report of methodologies and preliminary findings. J. Am. Soc. Inform. Sci. Technol. 64(10), 2078–2099 (2013)
Hallo, M., Luján-Mora, S., Maté, A., Trujillo, J.: Current state of linked data in digital libraries. J. Inf. Sci. 42(2), 117–127 (2016)
Hallo, M., Luján-Mora, S., Trujillo, J.: Transforming library catalogs into linked data. In: Proceedings of the 7th International Conference of Education, Research and Innovation (ICERI 2014), Seville, Spain, 17–19 November 2014, pp. 1845–1853 (2014). https://rua.ua.es/dspace/bitstream/10045/50586/1/transforming-library-catalogs-linked-data.pdf
Han, M.-J.: Linked data in library services: transforming the library catalog to linked data. In: Proceedings of Semantic Web in Libraries. Hamburg, 23–25 November 2015 (2015). http://www.bi-international.de/download/file/SWIB2015-MJHan-Report.pdf
Hanson, E.: A beginner’s guide to creating library linked data: lessons from NCSU’s organization name liked data project. Ser. Rev. 40(40), 251–258 (2014)
Haslhofer, B., Isaac, A.: data.europeana.edu: the Europeana linked open data pilot. In: Proceedings of International Conference on Dublin Core and Metadata Applications 2011 (2011). http://dcpapers.dublincore.org/pubs/article/view/3625/1851
Heath, T., Bizer, C.: Linked Data: Evolving the Web into a Global Data Space, 1st edn. Morgan & Claypool, London (2011). http://linkeddatabook.com/book/
Hyland, B., Atemezing, G.A., Villazón-Terrazas, B.: Best practices for publishing linked data (2017). https://dvcs.w3.org/hg/gld/raw-file/cb6dde2928e7/bp/index.html
Hyland, B., Villazón-Terrazas, B.: Linked data cookbook: cookbook for open government linked data (2011). https://www.w3.org/2011/gld/wiki/Linked_Data_Cookbook
Hyland, B., Wood, D.: The joy of data – a cookbook for publishing linked government data on the web. In: Wood, D. (ed.) Linking Government Data, pp. 3–26. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-1767-5_1
Kumar, S., Ujjal, M., Utpal, B.: Exposing MARC21 format for bibliographic data as linked data with provenance. J. Libr. Metadata 13(2–3), 212–229 (2013)
Lampert, C.K., Southwick, S.B.: Leading to linking: introducing linked data to academic library digital collections. J. Libr. Metadata 13(2–3), 230–253 (2013)
Malmsten, M.: Making a library catalogue part of the semantic web. In: Proceedings of International Conference on Dublin Core and Metadata Applications, pp. 146–152 (2008). http://dcpapers.dublincore.org/pubs/article/view/927/923
Malmsten, M.: Exposing library data as linked data. IFLA 2009 (2009). http://wtlab.um.ac.ir/images/e-library/linked_data/other/Exposing%20Library%20Data%20as%20Linked%20Data.pdf
Miller, P.: Linked data horizon scan (2010). http://cloudofdata.s3.amazonaws.com/FINAL-201001-LinkedDataHorizonScan.pdf
Park, H., Kipp, M.E.I.: Evaluation of mappings from MARC to linked data. In: Proceedings of 25th ASIS SIG/CR Classification Research Workshop (2015). http://journals.lib.washington.edu/index.php/acro/article/view/14908/12495
Simon, A., Wenz, R., Michel, V., Di Mascio, A.: Publishing bibliographic records on the web of data: opportunities for the BnF (French National Library). In: Cimiano, P., Corcho, O., Presutti, V., Hollink, L., Rudolph, S. (eds.) ESWC 2013. LNCS, vol. 7882, pp. 563–577. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38288-8_38
Southwick, S.B.: A guide for transforming digital collections metadata into linked data using open source technologies. J. Libr. Metadata 15(1), 1–35 (2015)
Styles, B., Ayers, D., Shabir, N.: Semantic MARC, MARC21 and the semantic web. In: Proceedings of Linked Data on the Web Workshop, 17th International World Wide Web Conference (WWW2008), Beijing, China, 22 April 2008 (2008). http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-369/paper02.pdf
Vila-Suero, D., Rodríguez-Escolano, E.: Linked data at the Spanish national library and the application of IFLA rdfs models. IFLA SCATNews 35, 5–6 (2011)
Vila-Suero, D., Villazón-Terrazas, B., Gómez-Pérez, A.: Datos.bne.es: a library linked data dataset. Sem. Web J. 4(3), 307–313 (2012)
Wenz, R.: Linked open data for new library services: the example of data.bnf.fr. JLIS.it 4(1), 403–415 (2013). http://leo.cineca.it/index.php/jlis/article/viewFile/5509/7919
Zeng, M.L., Gracy, K.F., Skirvin, L.: Navigating the intersection of library bibliographic data and linked music information sources: a study of the identification of useful metadata elements for interlinking. J. Libr. Metadata 13(2–3), 254–278 (2013)
Acknowledgements
This paper was supported by the Ministry of Science and Technology of Taiwan under MOST Grants: MOST 105-2410-H-032-057.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Chen, YN. (2017). A Review of Practices for Transforming Library Legacy Records into Linked Open Data. In: Garoufallou, E., Virkus, S., Siatri, R., Koutsomiha, D. (eds) Metadata and Semantic Research. MTSR 2017. Communications in Computer and Information Science, vol 755. Springer, Cham. https://doi.org/10.1007/978-3-319-70863-8_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-70863-8_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-70862-1
Online ISBN: 978-3-319-70863-8
eBook Packages: Computer ScienceComputer Science (R0)