Skip to main content

Text Detection in Document Images by Machine Learning Algorithms

  • Conference paper
  • First Online:
Proceedings of the 9th International Conference on Computer Recognition Systems CORES 2015

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 403))

Abstract

In the proposed paper, we consider a problem of text detection in document images. This problem plays an important role in OCR systems and is a challenging task. In the first step of our proposed text detection approach, we use a self-adjusting bottom-up segmentation algorithm to segment a document image into a set of connected components (CCs). The segmentation algorithm is based on the Sobel edge detection method. In the second step, CCs are described in terms of 27 features and a machine learning algorithm is then used to classify the CCs as text or nontext. For testing the approach, we have collected a dataset (ASTRoID), which contains 500 images of text blocks and 500 images of nontext blocks. We empirically compare performance of the proposed text detection method when using seven different machine learning algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Kise, K.: Page Segmentation Techniques in Document Analysis. Handbook of Document Image Processing and Recognition, pp. 135–175. Springer, London (2014)

    Book  Google Scholar 

  2. Coppi, D., Grana, C., Cucchiara, R.: Illustrations segmentation in digitized documents using local correlation features. In: 10th Italian Research Conference on Digital Libraries, vol. 38, pp. 76–83. Procedia Computer Science, Padua (2014)

    Google Scholar 

  3. Shafait, F., Keysers, D., Breuel, T.: Performance evaluation and benchmarking of six-page segmentation algorithms. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 941–954. IEEE Press (2008)

    Google Scholar 

  4. Kruatrachue, B., Moongfangklang, N., Siriboon, K.: Fast document segmentation using contour and X-Y cut technique. In: The Third World Enformatika Conference, WEC vol. 5, pp. 27–29. Turkey (2005)

    Google Scholar 

  5. Barlas, P., Kasar, T., Adams, S., Chatelain, C., Paquet, T.: A typed and handwritten text block segmentation system for heterogeneous and complex documents. In: 11th IAPR International Workshop on Document Analysis Systems, pp. 46–50, IEEE Press, Tours (2014)

    Google Scholar 

  6. Priyadharshini, N., Vijaya, M.S.: Genetic programming for document segmentation and region classification using discipulus. Int. J. Adv. Res. Artif. Intell. 2, 15–22 (2013)

    Google Scholar 

  7. Priyanka, N., Pal, S., Mandal, R.: Line and word segmentation approach for printed documents. Int. J. Comput. Appl. 1, 30–36 (2010)

    Google Scholar 

  8. Vikas, J.D., Vijay, H.M.: Devnagari document segmentation using histogram approach. Int. J. Comput. Sci. Eng. Inf. Tech. 1, 46–53 (2011)

    Google Scholar 

  9. Bukhari, S.S., Azawi, M.A., Shafait, F., Breuel, T.M.: Document image segmentation using discriminative learning over connected components. In: 9th IAPR International Workshop on Document Analysis Systems, pp. 183–190. Boston (2010)

    Google Scholar 

  10. Bukhari, S.S., Asi, A., Breuel, T.M., El-Sana, J.: Layout analysis for arabic historical document images using machine learning. In: International Conference on Frontiers in Handwriting Recognition, pp. 639–644 (2012)

    Google Scholar 

  11. Zagoris, K., Chatzichristofis, S.A., Papamarkos, N.: Text Localization using standard deviation analysis of structure elements and support vector machines. EURASIP J. Adv. Sign. Process. 47, 1–2 (2011)

    Google Scholar 

  12. Bukhari, S.S., Shafait, F., Breuel, T.M.: Improved document image segmentation algorithm using multiresolution morphology. In: 18th Document Recognition and Retrieval Conference, pp. 1–10. San Jose (2011)

    Google Scholar 

  13. Sumathi, C.P., Priya, N.: A combined edge-based text region extraction from document images. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 3, 827–835 (2013)

    Google Scholar 

  14. Kundu, M.K., Dhar, S., Banerjee, M.: A new approach for segmentation of image and text in natural and commercial color document. In: Proceedings of International Conference on Communication, Devices and Intelligent Systems, pp. 85–88. IEEE Press, India (2012)

    Google Scholar 

  15. Roy, P.P., Pal, U., Lladós, J.: Touching text character localization in graphical documents using SIFT. In: Proceedings of the 8th International Conference on Graphics Recognition: Achievements, Challenges, and Evolution, pp. 199–211. Springer, France (2010)

    Google Scholar 

  16. Vasuki, S., Ganesan, L.: Performance measure for edge based color image segmentation in color spaces. In: Proceedings of the International Conference on Emerging Technologies in Intelligent System and Control: Exploring, Exposing, and Experiencing the Emerging Technologies, pp. 621–626. Allied Publishers, Coimbatore (2005)

    Google Scholar 

  17. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9, 62–66 (1979)

    Article  Google Scholar 

  18. Basilis, G.G.: Imaging Techniques in Document Analysis Processes. Handbook of Document Image Processing and Recognition. Springer, London (2014)

    Google Scholar 

  19. Burger, W., Burge, M.J.: Principles of Digital Image Processing. Springer, London (2009)

    MATH  Google Scholar 

  20. WEKA (Open source, Data Mining software in Java), University of Waikato, New Zealand. http://www.cs.waikato.ac.nz/ml/weka

Download references

Acknowledgments

The presented work was supported by Creative Core FISNM-3330-13-500033 ‘Simulations’ project funded by the European Union, The European Regional Development Fund. The operation is carried out within the framework of the Operational Programme for Strengthening Regional Development Potentials for the period 2007–2013, Development Priority 1: Competitiveness and research excellence, Priority Guideline 1.1: Improving the competitive skills and research excellence.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Darko Zelenika .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Zelenika, D., Povh, J., Ženko, B. (2016). Text Detection in Document Images by Machine Learning Algorithms. In: Burduk, R., Jackowski, K., Kurzyński, M., Woźniak, M., Żołnierek, A. (eds) Proceedings of the 9th International Conference on Computer Recognition Systems CORES 2015. Advances in Intelligent Systems and Computing, vol 403. Springer, Cham. https://doi.org/10.1007/978-3-319-26227-7_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-26227-7_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-26225-3

  • Online ISBN: 978-3-319-26227-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics