Skip to main content

Generalization of Hindi OCR Using Adaptive Segmentation and Font Files

  • Chapter
  • First Online:
Book cover Guide to OCR for Indic Scripts

Part of the book series: Advances in Pattern Recognition ((ACVPR))

Abstract

In this chapter, we describe an adaptive Indic OCR system implemented as part of a rapidly retargetable language tool effort and extend work found in [20, 2]. The system includes script identification, character segmentation, training sample creation, and character recognition. For script identification, Hindi words are identified in bilingual or multilingual document images using features of the Devanagari script and support vector machine (SVM). Identified words are then segmented into individual characters, using a font-model-based intelligent character segmentation and recognition system. Using characteristics of structurally similar TrueType fonts, our system automatically builds a model to be used for the segmentation and recognition of the new script, independent of glyph composition. The key is a reliance on known font attributes. In our recognition system three feature extraction methods are used to demonstrate the importance of appropriate features for classification. The methods are tested on both Latin and non-Latin scripts. Results show that the character-level recognition accuracy exceeds 92% for non-Latin and 96% for Latin text on degraded documents. This work is a step toward the recognition of scripts of low-density languages which typically do not warrant the development of commercial OCR, yet often have complete TrueType font descriptions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abd, M.A., Paschos, G.: Effective Arabic character recognition using Support Vector Machines. In: Innovations and Advanced Techniques in Computer and Information Sciences and Engineering, pp. 7–11. Springer-Verlag, London, UK (2007)

    Chapter  Google Scholar 

  2. Agrawal, M., Doermann, D.: Re-targetable ocr with intelligent character segmentation. The Eighth IAPR International Workshop on Document Analysis Systems, 2008 (DAS ’08) pp. 183–190 (2008)

    Google Scholar 

  3. Bansal, V.: Integrating knowledge sources in devanagari text recognition. Ph.D. thesis, Indian Institute of Technology, Kanpur, India (1999)

    Google Scholar 

  4. Bansal, V., Sinha, R.: Segmentation of touching and fused devanagari characters. Pattern Recognition 35, 875–893 (2002)

    Article  MATH  Google Scholar 

  5. Bhattacharya, U., Das, T., Datta, A., Parui, S., Chaudhuri, B.: A hybrid scheme for handprinted numeral recognition based on a self-organizing network and MLP classifiers. International Journal for Pattern Recognition and Artificial Intelligence 16(7), 845–864 (2002)

    Article  Google Scholar 

  6. Britto, A., Sabourin, R., Bortolozzi, F., Suen, C.: The recognition of handwritten numerals strings using a two-stage HMM based method. International Journal of Document Analysis and Recognition 5, 102–117 (2003)

    Article  Google Scholar 

  7. Casey, G., Lecolinet, E.: A survey of methods and strategies in character segmentation. IEEE on Pattern Analysis and Machine Intelligence 18, 690–706 (1996)

    Google Scholar 

  8. Chaudhuri, B., Pal, U.: An ocr system to read two Indian language scripts: Bangla and devanagari (hindi). In: Proceedings of 4th International Conference on Document Analysis and Recognition, pp. 1011–1016. Germany (1997)

    Google Scholar 

  9. Chaudhuri, B., Pal, U.: Skew angle detection of digitized Indian script documents. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(2), 182–186 (1997)

    Article  MathSciNet  Google Scholar 

  10. Chi, Y., Yan, H.: Handwritten numeral recognition using self-organizing maps and fuzzy rules. Pattern Recognition 28, 56–66 (1995)

    Article  Google Scholar 

  11. Choisy, C., Belaid, A.: Cross-learning in analytic word recognition without segmentation. International Journal on Document Analysis and Recognition 4, 281–289 (2002)

    Article  Google Scholar 

  12. Dhanya, D., Ramakrishnan, A.G.: Optimal feature extraction for bilingual OCR. In: DAS ’02: Proceedings of the 5th International Workshop on Document Analysis Systems V, pp. 25–36. Springer-Verlag, London, UK (2002)

    Google Scholar 

  13. Gorman, L.O., Kasturi, R.: Document image analysis: A bibliography. Machine Vision and Applications 5(3), 231–243 (1992)

    Article  Google Scholar 

  14. Granlund, G.H.: Fourier preprocessing for Hand Print Character Recognition. IEEE Transactions on Computers C–21(2), 195–201 (1972)

    Article  MathSciNet  Google Scholar 

  15. Hull, J.J.: Document image skew detection: Survey and annotated bibliography. In: J.J. Hull, S.L. Taylor (eds.) Document Analysis Systems II. Word Scientific (1998)

    Google Scholar 

  16. Ittner, D.J., Baird, H.S.: Language-free layout analysis. In: IAPR 2nd Int’l Conf. on Document Analysis and Recognition, pp. 336–340. Tsukuba Science City, Japan (1993)

    Google Scholar 

  17. Kato, N., Suzuki, M., Omachi, S., Aso, H., Nemoto, Y.: A handwritten character recognition system using directional element feature and asymmetric Mahalanobis distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(3), 258–262 (1999)

    Article  Google Scholar 

  18. Khotanzad, A., Hong, Y.H.: Invariant image recognition by Zernike moments. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(5), 489–497 (1990)

    Article  Google Scholar 

  19. Kopec, G., Chou, P.: Document image decoding using Markov source models. IEEE Transactions on Pattern Analysis and Machine Intelligence 16(6), 602–617 (1994)

    Article  Google Scholar 

  20. Ma, H., Doermann, D.: Adaptive Hindi OCR using generalized Hausdorff image comparison. ACM Transactions on Asian Language Information Processing 26(2), 198–213 (2003)

    Google Scholar 

  21. Ma, H., Doermann, D.: Bootstrapping structured page segmentation. In: SPIE Conference Document Recognition and Retrieval, pp. 179–188. Santa Clara, CA (2003)

    Google Scholar 

  22. Ma, H., Doermann, D.: Gabor filter based multi-class classifier for scanned document images. In: 7th International Conference on Document Analysis and Recognition (ICDAR), pp. 968–972. Edinburgh, Scotland (2003)

    Google Scholar 

  23. Ma, H., Doermann, D.: Word level script identification for scanned document images. In: SPIE Conference Document Recognition and Retrieval. San Jose, CA (2004). To appear

    Google Scholar 

  24. Ma, H., Doermann, D.: Adaptive OCR with limited user feedback. International Conference on Document Analysis and Recognition, pp. 814–818 (2005)

    Google Scholar 

  25. Mahmoud, S.: Arabic character recognition using Fourier descriptors and character contour encoding. Pattern Recognition 27, 815–824 (1994)

    Article  Google Scholar 

  26. McGregor, R.: The OXFORD Hindi-English Dictionary. Oxford University Press, Oxford Delhi, (1993). ISBN 0-19-864339-X

    Google Scholar 

  27. Nartker, T.A., Rice, S.V., Lumos, S.E.: Software tools and test data for research and testing of page-reading OCR systems. Document Recognition and Retrieval XII 5676, 37–47 (2005)

    Google Scholar 

  28. O’Gorman, L.: The document spectrum for page layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11), 1162–1173 (1993)

    Article  Google Scholar 

  29. Plamondon, R., Srihari, S.: On-line and off-line handwritten recognition: A comprehensive survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 62–84 (2000)

    Article  Google Scholar 

  30. Stefano, D., Cioppa, A., Marcelli, A.: Handwritten numeral recognition by means of Evolutionary Algorithms. International Conference on Document Analysis and Recognition 00, 804–807 (1999)

    Google Scholar 

  31. Suen, Y., Berthod, M., Mori, S.: Automatic recognition of hand-printed character-the state of art. Proceedings of IEEE 68, 469–487 (1980)

    Article  Google Scholar 

  32. Teague, M.: Image analysis via the general theory of moments. Journal of the Optical Society of America 70(8), 920–930 (1979)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mudit Agrawal .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag London Limited

About this chapter

Cite this chapter

Agrawal, M., Ma, H., Doermann, D. (2009). Generalization of Hindi OCR Using Adaptive Segmentation and Font Files. In: Govindaraju, V., Setlur, S. (eds) Guide to OCR for Indic Scripts. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-84800-330-9_10

Download citation

  • DOI: https://doi.org/10.1007/978-1-84800-330-9_10

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84800-329-3

  • Online ISBN: 978-1-84800-330-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics