Abstract
Automatically cataloging of thousands of paper-based structured documents is a crucial fund-saving task for future document management systems. Current optical character recognition (OCR) systems process the tabular data with a sufficient level of character-level accuracy; however, the overall structure of the document metadata is still an open practical task.
In this paper, we introduce the OCRMiner system designed to extract the indexing metadata of structured documents obtained from an image scanning process and OCR. We present the details of the system modular architecture and evaluate the detection of text block types that appear within invoice documents. The system is based on text analysis in combination of layout features, and is developed and tested in cooperation with a renowned copy machine producer. The system uses an open source OCR and reaches the overall accuracy of 80.1%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Aslan, E., Karakaya, T., Unver, E., Akgül, Y.S.: A part based modeling approach for invoice parsing. In: Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2016, pp. 392–399 (2016)
Barrentine, A.: Statistical NLP on OpenStreetMap: Part 2, Training Conditional Random Fields on 1 billion street addresses (2017). https://medium.com/@albarrentine/statistical-nlp-on-openstreetmap-part-2-80405b988718
Bart, E., Sarkar, P.: Information extraction by finding repeated structure. In: Proceedings of the 9th International Workshop on Document Analysis Systems, pp. 175–182. ACM (2010)
Bayer, T., Mogg-Schneider, H.: A generic system for processing invoices. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition, vol. 2, pp. 740–744. IEEE (1997)
Chao, H., Fan, J.: Layout and content extraction for PDF documents. In: Marinai, S., Dengel, A.R. (eds.) DAS 2004. LNCS, vol. 3163, pp. 213–224. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28640-0_20
Ha, H.T.: Recognition of invoices from scanned documents. In: Recent Advances in Slavonic Natural Language Processing, RASLAN 2017, pp. 71–78 (2017)
Hamza, H., Belaid, Y., Belaïd, A.: A case-based reasoning approach for invoice structure extraction. In: Ninth International Conference on Document Analysis and Recognition, vol. 1, pp. 327–331. IEEE (2007)
The Institute of Finance and Management (IOFM): Special Report: The True Costs of Paper-Based Invoice Processing and Disbursements. Diversified Communications (2016). https://www.concur.com/en-us/resources/true-costs-paper-based-invoice-processing-and-disbursements
Klink, S., Dengel, A., Kieninger, T.: Document structure analysis based on layout and textual features. In: Proceedings of International Workshop on Document Analysis Systems, DAS 2000, pp. 99–111. Citeseer (2000)
Konkol, M., Konopík, M.: CRF-based Czech named entity recognizer and consolidation of Czech NER research. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 153–160. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40585-3_20
Liang, J., Ha, J., Haralick, R.M., Phillips, I.T.: Document layout structure extraction using bounding boxes of different entitles. In: Proceedings 3rd IEEE Workshop on Applications of Computer Vision, WACV 1996, pp. 278–283. IEEE (1996)
Mao, S., Rosenfeld, A., Kanungo, T.: Document structure analysis algorithms: a literature survey. In: Document Recognition and Retrieval X, vol. 5010, pp. 197–208. International Society for Optics and Photonics (2003)
Schulz, F., Ebbecke, M., Gillmann, M., Adrian, B., Agne, S., Dengel, A.: Seizing the treasure: transferring knowledge in invoice analysis. In: 10th International Conference on Document Analysis and Recognition, pp. 848–852. IEEE (2009)
Smith, R.: An overview of the Tesseract OCR engine. In: Ninth International Conference on Document Analysis and Recognition, vol. 2, pp. 629–633. IEEE (2007)
Smith, R.W.: Hybrid page layout analysis via tab-stop detection. In: 10th International Conference on Document Analysis and Recognition, pp. 241–245. IEEE (2009)
Straková, J., Straka, M., Hajič, J.: A new state-of-the-art Czech named entity recognizer. In: 16th International Conference on Text, Speech, and Dialogue, TSD 2013, pp. 68–75 (2013). https://doi.org/10.1007/978-3-642-40585-3_10
Acknowledgments
This work has been partly supported by Konica Minolta Business Solution Czech within the OCR Miner project and by the Masaryk University project MUNI/33/55939/2017.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Ha, H.T., Medved’, M., Nevěřilová, Z., Horák, A. (2018). Recognition of OCR Invoice Metadata Block Types. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2018. Lecture Notes in Computer Science(), vol 11107. Springer, Cham. https://doi.org/10.1007/978-3-030-00794-2_33
Download citation
DOI: https://doi.org/10.1007/978-3-030-00794-2_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00793-5
Online ISBN: 978-3-030-00794-2
eBook Packages: Computer ScienceComputer Science (R0)