Abstract
Indexing and retrieval of Indian language documents is an important problem. We present an interactive access scheme for Indian language document collection using techniques for word-image-based search. The compression and retrieval paradigm we propose is applicable even for those Indian scripts for which reliable OCR technology is not available. Our technique for word spotting is based on exploiting the geometrical features of the word image. The word image features are represented in the form of a graph called geometric feature graph (GFG). The GFG is encoded as a string which serves as a compressed representation of the word image skeleton. We have also augmented the GFG-based word image spotting with latent semantic analysis for more effective retrieval. The query is specified as a set of word images and the documents that best match with the query representation in the latent semantic space are retrieved. The retrieval paradigm is further enhanced to the conceptual level with the use of document image content-domain knowledge specified in the form of an ontology.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
R. Manmath, C. Han, and E. Riseman, “Word spotting: A new approach to indexing hand writing,” in Proceedings of IEEE CVPR, pp. 631–637, 1996.
A. K. Jain and A. M. Namboodiri, “Indexing and retrieval of on-line handwritten documents,” in Proceedings of IEEE ICDAR, pp. 655–659, 2003.
T. M. Rath and R. Manmatha, “Word image matching using dynamic time warping ,” in Proceedings of IEEE CVPR, vol. 2, pp. 521–527, 2003.
Deerwester, S. Dumais, Furnas, Lanouauer, and Harshman, “Indexing by latent semantic analysis,” Journal American Society for Information Retrieval, 41 (6), pp. 391–407, 1990.
G. W. Furnas, S. Deerwester, S. T. Dumais, T. K. Landauer, R. A. Harshman, L. A. Streeter, and K. E. Lochbaum, “Information retrieval using a singular value decomposition model of latent semantic structure,” in Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, (Grenoble, France), pp. 465–480, 1988.
S. T. Dumais, “Latent semantic indexing (LSI),” in Proceedings of the Text Retrieval Conference (TREC-3), 1995.
S. Chaudhury, A. Roy, and L. Dey, “An MIMD algorithm for constant curvature feature extraction using curvature based data partitioning,” Pattern Recognition Letters, 20 (6), pp. 573–583, 1999.
R. C. Gonzalez and R. E. Woods, Digital Image Processing. Prentice Hall, Upper Saddle River, NJ, 3rd ed., 2008.
E. Ukkonen, “Finding approximate patterns in string,” Journal of Algorithms, 6 (1), pp. 132–137, 1985.
S. Banerjee, G. Harit, and S. Chaudhury, “Word image based latent semantic indexing for conceptual querying in document image databases,” in Proceedings of IEEE ICDAR, vol. 2, pp. 1208–1212, 2007.
P. R. Christopher, D. Manning, and H. Schtze, Introduction to Information Retrieval. Cambridge University Press, Cambridge, 1st ed., 2008.
T. Hofmann, “Probabilistic latent semantic indexing,” in Proceedings of SIGIR, 1999.
S. Kumar, N. Khanna, S. Chaudhury, and S. D. Joshi, “Locating text in images using matched wavelets,” in Proceedings of IEEE ICDAR, vol. 2, pp. 595–599, 2005.
L. Saul and F. Pereira, “Aggregate and mixed order Markov models for statistical language processing,” in Proceedings of the 2nd International Conference on Empirical Methods Natural Language Processing, pp. 81–89, 1997.
H. Ghosh, S. Chaudhury, K. Kashyap, and B. Maiti, Ontologies A Handbook of Principles, Concepts and Applications in Information Systems, ch. Ontology Specification and Integration for Multimedia Applications. Springer-Verlag New York, Inc., Secaucus, NJ, USA 2006.
G. Harit, S. Chaudhury, and J. Paranjpe, “Ontology guided access to document images,” in Proceedings of IEEE ICDAR, vol. 1, pp. 292–296, 2005.
H. Ghosh and S. Chaudhury, “Distributed and reactive query planning in R-MAGIC: An agent based multimedia retrieval system,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, pp. 1082–1095, September 2004.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag London Limited
About this chapter
Cite this chapter
Harit, G., Chaudhury, S., Garg, R. (2009). GFG-Based Compression and Retrieval of Document Images in Indian Scripts. In: Govindaraju, V., Setlur, S. (eds) Guide to OCR for Indic Scripts. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-84800-330-9_14
Download citation
DOI: https://doi.org/10.1007/978-1-84800-330-9_14
Published:
Publisher Name: Springer, London
Print ISBN: 978-1-84800-329-3
Online ISBN: 978-1-84800-330-9
eBook Packages: Computer ScienceComputer Science (R0)