Generalization of Hindi OCR Using Adaptive Segmentation and Font Files

Agrawal, Mudit; Ma, Huanfeng; Doermann, David

doi:10.1007/978-1-84800-330-9_10

Mudit Agrawal³,
Huanfeng Ma⁴ &
David Doermann³

Part of the book series: Advances in Pattern Recognition ((ACVPR))

760 Accesses
3 Citations

Abstract

In this chapter, we describe an adaptive Indic OCR system implemented as part of a rapidly retargetable language tool effort and extend work found in [20, 2]. The system includes script identification, character segmentation, training sample creation, and character recognition. For script identification, Hindi words are identified in bilingual or multilingual document images using features of the Devanagari script and support vector machine (SVM). Identified words are then segmented into individual characters, using a font-model-based intelligent character segmentation and recognition system. Using characteristics of structurally similar TrueType fonts, our system automatically builds a model to be used for the segmentation and recognition of the new script, independent of glyph composition. The key is a reliance on known font attributes. In our recognition system three feature extraction methods are used to demonstrate the importance of appropriate features for classification. The methods are tested on both Latin and non-Latin scripts. Results show that the character-level recognition accuracy exceeds 92% for non-Latin and 96% for Latin text on degraded documents. This work is a step toward the recognition of scripts of low-density languages which typically do not warrant the development of commercial OCR, yet often have complete TrueType font descriptions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abd, M.A., Paschos, G.: Effective Arabic character recognition using Support Vector Machines. In: Innovations and Advanced Techniques in Computer and Information Sciences and Engineering, pp. 7–11. Springer-Verlag, London, UK (2007)
Chapter Google Scholar
Agrawal, M., Doermann, D.: Re-targetable ocr with intelligent character segmentation. The Eighth IAPR International Workshop on Document Analysis Systems, 2008 (DAS ’08) pp. 183–190 (2008)
Google Scholar
Bansal, V.: Integrating knowledge sources in devanagari text recognition. Ph.D. thesis, Indian Institute of Technology, Kanpur, India (1999)
Google Scholar
Bansal, V., Sinha, R.: Segmentation of touching and fused devanagari characters. Pattern Recognition 35, 875–893 (2002)
Article MATH Google Scholar
Bhattacharya, U., Das, T., Datta, A., Parui, S., Chaudhuri, B.: A hybrid scheme for handprinted numeral recognition based on a self-organizing network and MLP classifiers. International Journal for Pattern Recognition and Artificial Intelligence 16(7), 845–864 (2002)
Article Google Scholar
Britto, A., Sabourin, R., Bortolozzi, F., Suen, C.: The recognition of handwritten numerals strings using a two-stage HMM based method. International Journal of Document Analysis and Recognition 5, 102–117 (2003)
Article Google Scholar
Casey, G., Lecolinet, E.: A survey of methods and strategies in character segmentation. IEEE on Pattern Analysis and Machine Intelligence 18, 690–706 (1996)
Google Scholar
Chaudhuri, B., Pal, U.: An ocr system to read two Indian language scripts: Bangla and devanagari (hindi). In: Proceedings of 4th International Conference on Document Analysis and Recognition, pp. 1011–1016. Germany (1997)
Google Scholar
Chaudhuri, B., Pal, U.: Skew angle detection of digitized Indian script documents. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(2), 182–186 (1997)
Article MathSciNet Google Scholar
Chi, Y., Yan, H.: Handwritten numeral recognition using self-organizing maps and fuzzy rules. Pattern Recognition 28, 56–66 (1995)
Article Google Scholar
Choisy, C., Belaid, A.: Cross-learning in analytic word recognition without segmentation. International Journal on Document Analysis and Recognition 4, 281–289 (2002)
Article Google Scholar
Dhanya, D., Ramakrishnan, A.G.: Optimal feature extraction for bilingual OCR. In: DAS ’02: Proceedings of the 5th International Workshop on Document Analysis Systems V, pp. 25–36. Springer-Verlag, London, UK (2002)
Google Scholar
Gorman, L.O., Kasturi, R.: Document image analysis: A bibliography. Machine Vision and Applications 5(3), 231–243 (1992)
Article Google Scholar
Granlund, G.H.: Fourier preprocessing for Hand Print Character Recognition. IEEE Transactions on Computers C–21(2), 195–201 (1972)
Article MathSciNet Google Scholar
Hull, J.J.: Document image skew detection: Survey and annotated bibliography. In: J.J. Hull, S.L. Taylor (eds.) Document Analysis Systems II. Word Scientific (1998)
Google Scholar
Ittner, D.J., Baird, H.S.: Language-free layout analysis. In: IAPR 2nd Int’l Conf. on Document Analysis and Recognition, pp. 336–340. Tsukuba Science City, Japan (1993)
Google Scholar
Kato, N., Suzuki, M., Omachi, S., Aso, H., Nemoto, Y.: A handwritten character recognition system using directional element feature and asymmetric Mahalanobis distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(3), 258–262 (1999)
Article Google Scholar
Khotanzad, A., Hong, Y.H.: Invariant image recognition by Zernike moments. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(5), 489–497 (1990)
Article Google Scholar
Kopec, G., Chou, P.: Document image decoding using Markov source models. IEEE Transactions on Pattern Analysis and Machine Intelligence 16(6), 602–617 (1994)
Article Google Scholar
Ma, H., Doermann, D.: Adaptive Hindi OCR using generalized Hausdorff image comparison. ACM Transactions on Asian Language Information Processing 26(2), 198–213 (2003)
Google Scholar
Ma, H., Doermann, D.: Bootstrapping structured page segmentation. In: SPIE Conference Document Recognition and Retrieval, pp. 179–188. Santa Clara, CA (2003)
Google Scholar
Ma, H., Doermann, D.: Gabor filter based multi-class classifier for scanned document images. In: 7th International Conference on Document Analysis and Recognition (ICDAR), pp. 968–972. Edinburgh, Scotland (2003)
Google Scholar
Ma, H., Doermann, D.: Word level script identification for scanned document images. In: SPIE Conference Document Recognition and Retrieval. San Jose, CA (2004). To appear
Google Scholar
Ma, H., Doermann, D.: Adaptive OCR with limited user feedback. International Conference on Document Analysis and Recognition, pp. 814–818 (2005)
Google Scholar
Mahmoud, S.: Arabic character recognition using Fourier descriptors and character contour encoding. Pattern Recognition 27, 815–824 (1994)
Article Google Scholar
McGregor, R.: The OXFORD Hindi-English Dictionary. Oxford University Press, Oxford Delhi, (1993). ISBN 0-19-864339-X
Google Scholar
Nartker, T.A., Rice, S.V., Lumos, S.E.: Software tools and test data for research and testing of page-reading OCR systems. Document Recognition and Retrieval XII 5676, 37–47 (2005)
Google Scholar
O’Gorman, L.: The document spectrum for page layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11), 1162–1173 (1993)
Article Google Scholar
Plamondon, R., Srihari, S.: On-line and off-line handwritten recognition: A comprehensive survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 62–84 (2000)
Article Google Scholar
Stefano, D., Cioppa, A., Marcelli, A.: Handwritten numeral recognition by means of Evolutionary Algorithms. International Conference on Document Analysis and Recognition 00, 804–807 (1999)
Google Scholar
Suen, Y., Berthod, M., Mori, S.: Automatic recognition of hand-printed character-the state of art. Proceedings of IEEE 68, 469–487 (1980)
Article Google Scholar
Teague, M.: Image analysis via the general theory of moments. Journal of the Optical Society of America 70(8), 920–930 (1979)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

LAMP of UMIACS, University of Maryland, 20742, College Park, MD, USA
Mudit Agrawal & David Doermann
19026, Drexel Hill, PA, USA
Huanfeng Ma

Authors

Mudit Agrawal
View author publications
You can also search for this author in PubMed Google Scholar
Huanfeng Ma
View author publications
You can also search for this author in PubMed Google Scholar
David Doermann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mudit Agrawal .

Editor information

Editors and Affiliations

Analysis & Recognition (CEDAR), Center of Excellence for Document, Lee Entrance 520, Amherst, 14228, U.S.A.
Venu Govindaraju
Analysis & Recognition (CEDAR), Center of Excellence for Document, Lee Entrance 520, Amherst, 14228, U.S.A.
Srirangaraj (Ranga) Setlur

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Agrawal, M., Ma, H., Doermann, D. (2009). Generalization of Hindi OCR Using Adaptive Segmentation and Font Files. In: Govindaraju, V., Setlur, S. (eds) Guide to OCR for Indic Scripts. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-84800-330-9_10

Download citation

DOI: https://doi.org/10.1007/978-1-84800-330-9_10
Published: 28 August 2009
Publisher Name: Springer, London
Print ISBN: 978-1-84800-329-3
Online ISBN: 978-1-84800-330-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics