Skip to main content

Automatic Information Extraction from Scanned Documents

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2020)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12335))

Included in the following conference series:

Abstract

This paper deals with the task of information extraction from a structured document scanned by an ordinary office scanner device. It explores the processing pipeline from scanned paper documents to the extraction of searched information such as names, addresses, dates, and other numerical values.

We propose system design decomposed into four consecutive modules: preprocessing, optical character recognition, information extraction with a database, and information extraction without a database. In the preprocessing module, two essential techniques are presented – image quality improvement and image deskewing. Optical Character Recognition solutions and approaches to information extraction are compared using the whole system performance. The best performance of information extraction with the database was obtained by the Locality-sensitive Hashing algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/go2starr/lshhdc.

  2. 2.

    https://pypi.org/project/lshash/.

  3. 3.

    http://www.cs.ubc.ca/research/flann.

References

  1. Altamura, O., Esposito, F., Malerba, D.: Transforming paper documents into xml format with wisdom++. Int. J. Doc. Anal. Recogn. 4(1), 2–17 (2001)

    Article  Google Scholar 

  2. Bart, E., Sarkar, P.: Information extraction by finding repeated structure. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, DAS 2010, pp. 175–182. ACM, New York (2010)

    Google Scholar 

  3. Brauer, F., Rieger, R., Mocan, A., Barczynski, W.M.: Enabling information extraction by inference of regular expressions from sample entities. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM 2011, pp. 1285–1294. ACM, New York (2011)

    Google Scholar 

  4. Cohen, A.: Fuzzywuzzy: Fuzzy string matching in python (2011). https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/

  5. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613 (1998)

    Google Scholar 

  6. Kornblum, J.: Identifying almost identical files using context triggered piecewise hashing. Digital Invest. 3, 91–97 (2006)

    Article  Google Scholar 

  7. Král, P.: Named entities as new features for czech document classification. In: Gelbukh, A. (ed.) CICLing 2014. LNCS, vol. 8404, pp. 417–427. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54903-8_35

    Chapter  Google Scholar 

  8. Muslea, I., et al.: Extraction patterns for information extraction tasks: a survey. In: The AAAI-99 Workshop on Machine Learning for Information Extraction, vol. 2. Orlando Florida (1999)

    Google Scholar 

  9. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979)

    Article  Google Scholar 

  10. Palm, R.B., Laws, F., Winther, O.: Attend, copy, parse end-to-end information extraction from documents. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 329–336. IEEE (2019)

    Google Scholar 

  11. Pizer, S.M., Amburn, E.P., Austin, J.D., Cromartie, R., Geselowitz, A., Greer, T., Romeny, B.T.H., Zimmerman, J.B.: Adaptive histogram equalization and its variations. Comput. Vision Graph. Image Process. 39(3), 355–368 (1987)

    Article  Google Scholar 

  12. Yujian, L., Bo, L.: A normalized levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1091–1095 (2007)

    Article  Google Scholar 

  13. Zajíc, Z., et al.: Towards processing of the oral history interviews and related printed documents. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)

    Google Scholar 

  14. Zuiderveld, K.: Contrast limited adaptive histogram equalization. In: Heckbert, P.S. (ed.) Graphics Gems IV, pp. 474–485. Academic Press Professional Inc., San Diego (1994)

    Chapter  Google Scholar 

Download references

Acknowledgments

This publication was supported by the project LO1506 of the Czech Ministry of Education, Youth and Sports and by the grant of the University of West Bohemia, project No. SGS-2019-027.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lukáš Bureš .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bureš, L., Neduchal, P., Müller, L. (2020). Automatic Information Extraction from Scanned Documents. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2020. Lecture Notes in Computer Science(), vol 12335. Springer, Cham. https://doi.org/10.1007/978-3-030-60276-5_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-60276-5_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-60275-8

  • Online ISBN: 978-3-030-60276-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics