Text Detection in Document Images by Machine Learning Algorithms

Zelenika, Darko; Povh, Janez; Ženko, Bernard

doi:10.1007/978-3-319-26227-7_16

Darko Zelenika⁷,
Janez Povh⁷ &
Bernard Ženko⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 403))

1122 Accesses
3 Citations

Abstract

In the proposed paper, we consider a problem of text detection in document images. This problem plays an important role in OCR systems and is a challenging task. In the first step of our proposed text detection approach, we use a self-adjusting bottom-up segmentation algorithm to segment a document image into a set of connected components (CCs). The segmentation algorithm is based on the Sobel edge detection method. In the second step, CCs are described in terms of 27 features and a machine learning algorithm is then used to classify the CCs as text or nontext. For testing the approach, we have collected a dataset (ASTRoID), which contains 500 images of text blocks and 500 images of nontext blocks. We empirically compare performance of the proposed text detection method when using seven different machine learning algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Kise, K.: Page Segmentation Techniques in Document Analysis. Handbook of Document Image Processing and Recognition, pp. 135–175. Springer, London (2014)
Book Google Scholar
Coppi, D., Grana, C., Cucchiara, R.: Illustrations segmentation in digitized documents using local correlation features. In: 10th Italian Research Conference on Digital Libraries, vol. 38, pp. 76–83. Procedia Computer Science, Padua (2014)
Google Scholar
Shafait, F., Keysers, D., Breuel, T.: Performance evaluation and benchmarking of six-page segmentation algorithms. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 941–954. IEEE Press (2008)
Google Scholar
Kruatrachue, B., Moongfangklang, N., Siriboon, K.: Fast document segmentation using contour and X-Y cut technique. In: The Third World Enformatika Conference, WEC vol. 5, pp. 27–29. Turkey (2005)
Google Scholar
Barlas, P., Kasar, T., Adams, S., Chatelain, C., Paquet, T.: A typed and handwritten text block segmentation system for heterogeneous and complex documents. In: 11th IAPR International Workshop on Document Analysis Systems, pp. 46–50, IEEE Press, Tours (2014)
Google Scholar
Priyadharshini, N., Vijaya, M.S.: Genetic programming for document segmentation and region classification using discipulus. Int. J. Adv. Res. Artif. Intell. 2, 15–22 (2013)
Google Scholar
Priyanka, N., Pal, S., Mandal, R.: Line and word segmentation approach for printed documents. Int. J. Comput. Appl. 1, 30–36 (2010)
Google Scholar
Vikas, J.D., Vijay, H.M.: Devnagari document segmentation using histogram approach. Int. J. Comput. Sci. Eng. Inf. Tech. 1, 46–53 (2011)
Google Scholar
Bukhari, S.S., Azawi, M.A., Shafait, F., Breuel, T.M.: Document image segmentation using discriminative learning over connected components. In: 9th IAPR International Workshop on Document Analysis Systems, pp. 183–190. Boston (2010)
Google Scholar
Bukhari, S.S., Asi, A., Breuel, T.M., El-Sana, J.: Layout analysis for arabic historical document images using machine learning. In: International Conference on Frontiers in Handwriting Recognition, pp. 639–644 (2012)
Google Scholar
Zagoris, K., Chatzichristofis, S.A., Papamarkos, N.: Text Localization using standard deviation analysis of structure elements and support vector machines. EURASIP J. Adv. Sign. Process. 47, 1–2 (2011)
Google Scholar
Bukhari, S.S., Shafait, F., Breuel, T.M.: Improved document image segmentation algorithm using multiresolution morphology. In: 18th Document Recognition and Retrieval Conference, pp. 1–10. San Jose (2011)
Google Scholar
Sumathi, C.P., Priya, N.: A combined edge-based text region extraction from document images. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 3, 827–835 (2013)
Google Scholar
Kundu, M.K., Dhar, S., Banerjee, M.: A new approach for segmentation of image and text in natural and commercial color document. In: Proceedings of International Conference on Communication, Devices and Intelligent Systems, pp. 85–88. IEEE Press, India (2012)
Google Scholar
Roy, P.P., Pal, U., Lladós, J.: Touching text character localization in graphical documents using SIFT. In: Proceedings of the 8th International Conference on Graphics Recognition: Achievements, Challenges, and Evolution, pp. 199–211. Springer, France (2010)
Google Scholar
Vasuki, S., Ganesan, L.: Performance measure for edge based color image segmentation in color spaces. In: Proceedings of the International Conference on Emerging Technologies in Intelligent System and Control: Exploring, Exposing, and Experiencing the Emerging Technologies, pp. 621–626. Allied Publishers, Coimbatore (2005)
Google Scholar
Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9, 62–66 (1979)
Article Google Scholar
Basilis, G.G.: Imaging Techniques in Document Analysis Processes. Handbook of Document Image Processing and Recognition. Springer, London (2014)
Google Scholar
Burger, W., Burge, M.J.: Principles of Digital Image Processing. Springer, London (2009)
MATH Google Scholar
WEKA (Open source, Data Mining software in Java), University of Waikato, New Zealand. http://www.cs.waikato.ac.nz/ml/weka

Download references

Acknowledgments

The presented work was supported by Creative Core FISNM-3330-13-500033 ‘Simulations’ project funded by the European Union, The European Regional Development Fund. The operation is carried out within the framework of the Operational Programme for Strengthening Regional Development Potentials for the period 2007–2013, Development Priority 1: Competitiveness and research excellence, Priority Guideline 1.1: Improving the competitive skills and research excellence.

Author information

Authors and Affiliations

Laboratory of Data Technologies, Faculty of Information Studies, Ulica Talcev 3, 8000, Novo Mesto, Slovenia
Darko Zelenika & Janez Povh
Jožef Stefan Institute, Department of Knowledge Technologies, Jamova Cesta 39, 1000, Ljubljana, Slovenia
Bernard Ženko

Authors

Darko Zelenika
View author publications
You can also search for this author in PubMed Google Scholar
Janez Povh
View author publications
You can also search for this author in PubMed Google Scholar
Bernard Ženko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Darko Zelenika .

Editor information

Editors and Affiliations

Department of Systems, Wrocław University of Technology, Wroclaw, Poland
Robert Burduk
Department of Systems and Computer, Wrocław University of Technology, Wroclaw, Poland
Konrad Jackowski
Department of Systems and Computer, Wrocław University of Technology, Wroclaw, Poland
Marek Kurzyński
Dept. of Systems and Computer Networks, Wrocław University of Technology, Wroclaw, Poland
Michał Woźniak
Department of Systems, Wrocław University of Technology, Wroclaw, Poland
Andrzej Żołnierek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zelenika, D., Povh, J., Ženko, B. (2016). Text Detection in Document Images by Machine Learning Algorithms. In: Burduk, R., Jackowski, K., Kurzyński, M., Woźniak, M., Żołnierek, A. (eds) Proceedings of the 9th International Conference on Computer Recognition Systems CORES 2015. Advances in Intelligent Systems and Computing, vol 403. Springer, Cham. https://doi.org/10.1007/978-3-319-26227-7_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-26227-7_16
Published: 05 March 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26225-3
Online ISBN: 978-3-319-26227-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics