Automatic Information Extraction from Scanned Documents

Bureš, Lukáš; Neduchal, Petr; Müller, Luděk

doi:10.1007/978-3-030-60276-5_9

Lukáš Bureš¹⁰,
Petr Neduchal¹⁰ &
Luděk Müller¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12335))

Included in the following conference series:

International Conference on Speech and Computer

1676 Accesses
3 Citations

Abstract

This paper deals with the task of information extraction from a structured document scanned by an ordinary office scanner device. It explores the processing pipeline from scanned paper documents to the extraction of searched information such as names, addresses, dates, and other numerical values.

We propose system design decomposed into four consecutive modules: preprocessing, optical character recognition, information extraction with a database, and information extraction without a database. In the preprocessing module, two essential techniques are presented – image quality improvement and image deskewing. Optical Character Recognition solutions and approaches to information extraction are compared using the whole system performance. The best performance of information extraction with the database was obtained by the Locality-sensitive Hashing algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Reconstructing Scanned Documents for Full-Text Indexing to Empower Digital Library Services

Automated Text and Tabular Data Extraction from Scanned Document Images

Quality Assurance Tool Suite for Error Detection in Digital Repositories

Notes

References

Altamura, O., Esposito, F., Malerba, D.: Transforming paper documents into xml format with wisdom++. Int. J. Doc. Anal. Recogn. 4(1), 2–17 (2001)
Article Google Scholar
Bart, E., Sarkar, P.: Information extraction by finding repeated structure. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, DAS 2010, pp. 175–182. ACM, New York (2010)
Google Scholar
Brauer, F., Rieger, R., Mocan, A., Barczynski, W.M.: Enabling information extraction by inference of regular expressions from sample entities. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM 2011, pp. 1285–1294. ACM, New York (2011)
Google Scholar
Cohen, A.: Fuzzywuzzy: Fuzzy string matching in python (2011). https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613 (1998)
Google Scholar
Kornblum, J.: Identifying almost identical files using context triggered piecewise hashing. Digital Invest. 3, 91–97 (2006)
Article Google Scholar
Král, P.: Named entities as new features for czech document classification. In: Gelbukh, A. (ed.) CICLing 2014. LNCS, vol. 8404, pp. 417–427. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54903-8_35
Chapter Google Scholar
Muslea, I., et al.: Extraction patterns for information extraction tasks: a survey. In: The AAAI-99 Workshop on Machine Learning for Information Extraction, vol. 2. Orlando Florida (1999)
Google Scholar
Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979)
Article Google Scholar
Palm, R.B., Laws, F., Winther, O.: Attend, copy, parse end-to-end information extraction from documents. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 329–336. IEEE (2019)
Google Scholar
Pizer, S.M., Amburn, E.P., Austin, J.D., Cromartie, R., Geselowitz, A., Greer, T., Romeny, B.T.H., Zimmerman, J.B.: Adaptive histogram equalization and its variations. Comput. Vision Graph. Image Process. 39(3), 355–368 (1987)
Article Google Scholar
Yujian, L., Bo, L.: A normalized levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1091–1095 (2007)
Article Google Scholar
Zajíc, Z., et al.: Towards processing of the oral history interviews and related printed documents. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Google Scholar
Zuiderveld, K.: Contrast limited adaptive histogram equalization. In: Heckbert, P.S. (ed.) Graphics Gems IV, pp. 474–485. Academic Press Professional Inc., San Diego (1994)
Chapter Google Scholar

Download references

Acknowledgments

This publication was supported by the project LO1506 of the Czech Ministry of Education, Youth and Sports and by the grant of the University of West Bohemia, project No. SGS-2019-027.

Author information

Authors and Affiliations

Faculty of Applied Sciences, New Technologies for the Information Society, University of West Bohemia, Univerzitní 8, 306 14, Plzeň, Czech Republic
Lukáš Bureš, Petr Neduchal & Luděk Müller

Authors

Lukáš Bureš
View author publications
You can also search for this author in PubMed Google Scholar
Petr Neduchal
View author publications
You can also search for this author in PubMed Google Scholar
Luděk Müller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lukáš Bureš .

Editor information

Editors and Affiliations

St. Petersburg Institute for Informatics and Automation, Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Institute for Applied and Mathematical Linguistics, Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bureš, L., Neduchal, P., Müller, L. (2020). Automatic Information Extraction from Scanned Documents. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2020. Lecture Notes in Computer Science(), vol 12335. Springer, Cham. https://doi.org/10.1007/978-3-030-60276-5_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-60276-5_9
Published: 29 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60275-8
Online ISBN: 978-3-030-60276-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Automatic Information Extraction from Scanned Documents

Abstract

Access this chapter

Similar content being viewed by others

Reconstructing Scanned Documents for Full-Text Indexing to Empower Digital Library Services

Automated Text and Tabular Data Extraction from Scanned Document Images

Quality Assurance Tool Suite for Error Detection in Digital Repositories

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Automatic Information Extraction from Scanned Documents

Abstract

Access this chapter

Similar content being viewed by others

Reconstructing Scanned Documents for Full-Text Indexing to Empower Digital Library Services

Automated Text and Tabular Data Extraction from Scanned Document Images

Quality Assurance Tool Suite for Error Detection in Digital Repositories

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation