Abstract
This paper deals with the task of information extraction from a structured document scanned by an ordinary office scanner device. It explores the processing pipeline from scanned paper documents to the extraction of searched information such as names, addresses, dates, and other numerical values.
We propose system design decomposed into four consecutive modules: preprocessing, optical character recognition, information extraction with a database, and information extraction without a database. In the preprocessing module, two essential techniques are presented – image quality improvement and image deskewing. Optical Character Recognition solutions and approaches to information extraction are compared using the whole system performance. The best performance of information extraction with the database was obtained by the Locality-sensitive Hashing algorithm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Altamura, O., Esposito, F., Malerba, D.: Transforming paper documents into xml format with wisdom++. Int. J. Doc. Anal. Recogn. 4(1), 2–17 (2001)
Bart, E., Sarkar, P.: Information extraction by finding repeated structure. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, DAS 2010, pp. 175–182. ACM, New York (2010)
Brauer, F., Rieger, R., Mocan, A., Barczynski, W.M.: Enabling information extraction by inference of regular expressions from sample entities. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM 2011, pp. 1285–1294. ACM, New York (2011)
Cohen, A.: Fuzzywuzzy: Fuzzy string matching in python (2011). https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613 (1998)
Kornblum, J.: Identifying almost identical files using context triggered piecewise hashing. Digital Invest. 3, 91–97 (2006)
Král, P.: Named entities as new features for czech document classification. In: Gelbukh, A. (ed.) CICLing 2014. LNCS, vol. 8404, pp. 417–427. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54903-8_35
Muslea, I., et al.: Extraction patterns for information extraction tasks: a survey. In: The AAAI-99 Workshop on Machine Learning for Information Extraction, vol. 2. Orlando Florida (1999)
Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979)
Palm, R.B., Laws, F., Winther, O.: Attend, copy, parse end-to-end information extraction from documents. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 329–336. IEEE (2019)
Pizer, S.M., Amburn, E.P., Austin, J.D., Cromartie, R., Geselowitz, A., Greer, T., Romeny, B.T.H., Zimmerman, J.B.: Adaptive histogram equalization and its variations. Comput. Vision Graph. Image Process. 39(3), 355–368 (1987)
Yujian, L., Bo, L.: A normalized levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1091–1095 (2007)
Zajíc, Z., et al.: Towards processing of the oral history interviews and related printed documents. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Zuiderveld, K.: Contrast limited adaptive histogram equalization. In: Heckbert, P.S. (ed.) Graphics Gems IV, pp. 474–485. Academic Press Professional Inc., San Diego (1994)
Acknowledgments
This publication was supported by the project LO1506 of the Czech Ministry of Education, Youth and Sports and by the grant of the University of West Bohemia, project No. SGS-2019-027.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Bureš, L., Neduchal, P., Müller, L. (2020). Automatic Information Extraction from Scanned Documents. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2020. Lecture Notes in Computer Science(), vol 12335. Springer, Cham. https://doi.org/10.1007/978-3-030-60276-5_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-60276-5_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60275-8
Online ISBN: 978-3-030-60276-5
eBook Packages: Computer ScienceComputer Science (R0)