Abstract
The digital era raises new challenges for traditional library services in which information has to be delivered and supported by technology-enhanced systems. The increasing need for rapid access to information requires librarians to re-evaluate the way they develop, manage and deliver resources, as well as services. However, most information extraction systems are not designed to work with PDF files generated after Optical Character Recognition, and several problems are encountered while trying to properly restructure the recognized text, for example: disruption of paragraphs, improper page breaks, or loss of content structure. This paper introduces a pre-processing pipeline designed to support university libraries to adequately index old document collections. The extracted text is indexed into Elasticsearch which facilitates the search for relevant documents, based on keywords. The information extraction system is designed to assist librarians in the digitization process by enabling a systematic review of documents, which leads to more accurate representations of the indexed files.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Biblioteca Centrala Universitara Carol I. http://www.bcub.ro/home/biblioteca-in-cifre/biblioteca-in-cifre-la-31-decembrie-2018. Accessed 16 Aug 2019
Cervone, H.F.: Emerging technology, innovation, and the digital library. OCLC Syst. Serv. Int. Digit. Libr. Perspect. 26(4), 239–242 (2010)
Schouten, K., Frasincar, F., Dekker, R., Riezebos, M.: Heracles: a framework for developing and evaluating text mining algorithms. Expert Syst. Appl. 127, 68–84 (2019)
Korzen, C.: Icecite (2017). https://github.com/ckorzen/icecite. Accessed 16 Aug 2019
Santos, A., Matos, S., Campos, D., Oliveira, J.L.: A curation pipeline and web-services for PDF documents. In: CEUR Workshop Proceedings, vol. 1650. http://ceur-ws.org/Vol-1650/smbm16San-tos.pdf. ISSN 1613-0073
Maciocci, G.: ScienceBeam - using computer vision to extract PDF data. https://elifesciences.org/labs/5b56aff6/sciencebeam-using-computer-vision-to-extract-pdf-data. Accessed 16 Aug 2019
Hassan, T.: Baumgartner, R.: Intelligent text extraction from PDF documents, pp. 2–6 (2005). https://doi.org/10.1109/cimca.2005.1631436
Sasirekha, D., Chandra, E.: Text extraction from PDF document. In: IJCA Proceedings on Amrita International Conference of Women in Computing, AICWIC, no. 3, pp. 17–19 (2013)
Ramakrishnan, C., Patnia, A., Hovy, E., Burns, G.: Layout-aware text extraction from full-text PDF of scientific articles. Source Code Biol. Med. 7(1), 7 (2012)
Acknowledgement
This work was supported by a grant of the Romanian Ministry of Research and Innovation, CCCDI - UEFISCDI, project number PN-III-P1-1.2-PCCDI-2017-0689/“Lib2Life - Revitalizarea bibliotecilor si a patrimoniului cultural prin tehnologii avansate”/“Revitalizing Libraries and Cultural Heritage through Advanced Technologies”, within PNCDI III.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Nitu, M., Dascalu, M., Dascalu, MI., Cotet, TM., Tomescu, S. (2020). Reconstructing Scanned Documents for Full-Text Indexing to Empower Digital Library Services. In: Popescu, E., Hao, T., Hsu, TC., Xie, H., Temperini, M., Chen, W. (eds) Emerging Technologies for Education. SETE 2019. Lecture Notes in Computer Science(), vol 11984. Springer, Cham. https://doi.org/10.1007/978-3-030-38778-5_21
Download citation
DOI: https://doi.org/10.1007/978-3-030-38778-5_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-38777-8
Online ISBN: 978-3-030-38778-5
eBook Packages: Computer ScienceComputer Science (R0)