Abstract
In this chapter, we describe an adaptive Indic OCR system implemented as part of a rapidly retargetable language tool effort and extend work found in [20, 2]. The system includes script identification, character segmentation, training sample creation, and character recognition. For script identification, Hindi words are identified in bilingual or multilingual document images using features of the Devanagari script and support vector machine (SVM). Identified words are then segmented into individual characters, using a font-model-based intelligent character segmentation and recognition system. Using characteristics of structurally similar TrueType fonts, our system automatically builds a model to be used for the segmentation and recognition of the new script, independent of glyph composition. The key is a reliance on known font attributes. In our recognition system three feature extraction methods are used to demonstrate the importance of appropriate features for classification. The methods are tested on both Latin and non-Latin scripts. Results show that the character-level recognition accuracy exceeds 92% for non-Latin and 96% for Latin text on degraded documents. This work is a step toward the recognition of scripts of low-density languages which typically do not warrant the development of commercial OCR, yet often have complete TrueType font descriptions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abd, M.A., Paschos, G.: Effective Arabic character recognition using Support Vector Machines. In: Innovations and Advanced Techniques in Computer and Information Sciences and Engineering, pp. 7–11. Springer-Verlag, London, UK (2007)
Agrawal, M., Doermann, D.: Re-targetable ocr with intelligent character segmentation. The Eighth IAPR International Workshop on Document Analysis Systems, 2008 (DAS ’08) pp. 183–190 (2008)
Bansal, V.: Integrating knowledge sources in devanagari text recognition. Ph.D. thesis, Indian Institute of Technology, Kanpur, India (1999)
Bansal, V., Sinha, R.: Segmentation of touching and fused devanagari characters. Pattern Recognition 35, 875–893 (2002)
Bhattacharya, U., Das, T., Datta, A., Parui, S., Chaudhuri, B.: A hybrid scheme for handprinted numeral recognition based on a self-organizing network and MLP classifiers. International Journal for Pattern Recognition and Artificial Intelligence 16(7), 845–864 (2002)
Britto, A., Sabourin, R., Bortolozzi, F., Suen, C.: The recognition of handwritten numerals strings using a two-stage HMM based method. International Journal of Document Analysis and Recognition 5, 102–117 (2003)
Casey, G., Lecolinet, E.: A survey of methods and strategies in character segmentation. IEEE on Pattern Analysis and Machine Intelligence 18, 690–706 (1996)
Chaudhuri, B., Pal, U.: An ocr system to read two Indian language scripts: Bangla and devanagari (hindi). In: Proceedings of 4th International Conference on Document Analysis and Recognition, pp. 1011–1016. Germany (1997)
Chaudhuri, B., Pal, U.: Skew angle detection of digitized Indian script documents. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(2), 182–186 (1997)
Chi, Y., Yan, H.: Handwritten numeral recognition using self-organizing maps and fuzzy rules. Pattern Recognition 28, 56–66 (1995)
Choisy, C., Belaid, A.: Cross-learning in analytic word recognition without segmentation. International Journal on Document Analysis and Recognition 4, 281–289 (2002)
Dhanya, D., Ramakrishnan, A.G.: Optimal feature extraction for bilingual OCR. In: DAS ’02: Proceedings of the 5th International Workshop on Document Analysis Systems V, pp. 25–36. Springer-Verlag, London, UK (2002)
Gorman, L.O., Kasturi, R.: Document image analysis: A bibliography. Machine Vision and Applications 5(3), 231–243 (1992)
Granlund, G.H.: Fourier preprocessing for Hand Print Character Recognition. IEEE Transactions on Computers C–21(2), 195–201 (1972)
Hull, J.J.: Document image skew detection: Survey and annotated bibliography. In: J.J. Hull, S.L. Taylor (eds.) Document Analysis Systems II. Word Scientific (1998)
Ittner, D.J., Baird, H.S.: Language-free layout analysis. In: IAPR 2nd Int’l Conf. on Document Analysis and Recognition, pp. 336–340. Tsukuba Science City, Japan (1993)
Kato, N., Suzuki, M., Omachi, S., Aso, H., Nemoto, Y.: A handwritten character recognition system using directional element feature and asymmetric Mahalanobis distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(3), 258–262 (1999)
Khotanzad, A., Hong, Y.H.: Invariant image recognition by Zernike moments. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(5), 489–497 (1990)
Kopec, G., Chou, P.: Document image decoding using Markov source models. IEEE Transactions on Pattern Analysis and Machine Intelligence 16(6), 602–617 (1994)
Ma, H., Doermann, D.: Adaptive Hindi OCR using generalized Hausdorff image comparison. ACM Transactions on Asian Language Information Processing 26(2), 198–213 (2003)
Ma, H., Doermann, D.: Bootstrapping structured page segmentation. In: SPIE Conference Document Recognition and Retrieval, pp. 179–188. Santa Clara, CA (2003)
Ma, H., Doermann, D.: Gabor filter based multi-class classifier for scanned document images. In: 7th International Conference on Document Analysis and Recognition (ICDAR), pp. 968–972. Edinburgh, Scotland (2003)
Ma, H., Doermann, D.: Word level script identification for scanned document images. In: SPIE Conference Document Recognition and Retrieval. San Jose, CA (2004). To appear
Ma, H., Doermann, D.: Adaptive OCR with limited user feedback. International Conference on Document Analysis and Recognition, pp. 814–818 (2005)
Mahmoud, S.: Arabic character recognition using Fourier descriptors and character contour encoding. Pattern Recognition 27, 815–824 (1994)
McGregor, R.: The OXFORD Hindi-English Dictionary. Oxford University Press, Oxford Delhi, (1993). ISBN 0-19-864339-X
Nartker, T.A., Rice, S.V., Lumos, S.E.: Software tools and test data for research and testing of page-reading OCR systems. Document Recognition and Retrieval XII 5676, 37–47 (2005)
O’Gorman, L.: The document spectrum for page layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11), 1162–1173 (1993)
Plamondon, R., Srihari, S.: On-line and off-line handwritten recognition: A comprehensive survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 62–84 (2000)
Stefano, D., Cioppa, A., Marcelli, A.: Handwritten numeral recognition by means of Evolutionary Algorithms. International Conference on Document Analysis and Recognition 00, 804–807 (1999)
Suen, Y., Berthod, M., Mori, S.: Automatic recognition of hand-printed character-the state of art. Proceedings of IEEE 68, 469–487 (1980)
Teague, M.: Image analysis via the general theory of moments. Journal of the Optical Society of America 70(8), 920–930 (1979)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag London Limited
About this chapter
Cite this chapter
Agrawal, M., Ma, H., Doermann, D. (2009). Generalization of Hindi OCR Using Adaptive Segmentation and Font Files. In: Govindaraju, V., Setlur, S. (eds) Guide to OCR for Indic Scripts. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-84800-330-9_10
Download citation
DOI: https://doi.org/10.1007/978-1-84800-330-9_10
Published:
Publisher Name: Springer, London
Print ISBN: 978-1-84800-329-3
Online ISBN: 978-1-84800-330-9
eBook Packages: Computer ScienceComputer Science (R0)