Skip to main content

Indian Language Information Retrieval

  • Chapter
  • First Online:
Guide to OCR for Indic Scripts

Part of the book series: Advances in Pattern Recognition ((ACVPR))

Abstract

With the proliferation of the Internet in south Asia over the last decade, the availability of digital documents in Indian languages has increased considerably. The need for effective information access methods for these languages is being increasingly felt. Although Indian language information retrieval (ILIR) research is in a relatively nascent stage (especially with regard to large-scale quantitative evaluation), several research efforts in this area have been reported in the recent past. This chapter reviews the current state of the art in mono-lingual and cross-lingual information access in Indian languages and outlines a recent project that aims to create a comprehensive, end-to-end IR system for Indian languages, along with a standardized evaluation framework (in the spirit of TREC, CLEF, or NTCIR) that will provide a sound empirical basis for further work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://tdil.mit.gov.in/standards.htm#iscii

  2. 2.

    http://www.greenstone.org

  3. 3.

    http://www.dspace.org

  4. 4.

    http://www.internetworldstats.com/stats3.htm

  5. 5.

    http://www.darpa.mil/ipto/programs/tides/

  6. 6.

    http://www.iiit.net/ltrc/Dictionaries/Shabdanjali/Shabdanjali.tgz

  7. 7.

    http://www.itl.nist.gov/iaui/894.01/tests/tdt/

  8. 8.

    Available online at http://www.cfilt.iitb.ac.in/~hdict/webinterface_user/dict_search_user.php

  9. 9.

    Bengali, Hindi, Marathi, Punjabi, Tamil, Telugu

  10. 10.

    http://www.isical.ac.in/∼fire

References

  1. Majumder, P., Mitra, M., Datta, K.: Multilingual information access: an Indian language perspective. In Gey, F., Peters, C., eds.: Proceedings of ACM SIGIR Workshop on MLIR (2006)

    Google Scholar 

  2. Rajashekar, T.: Building Indian language digital library collections: Some experiences with Greenstone software. In: Digital Libraries: International Collaboration and Cross-Fertilization: 7th International Conference on Asian Digital Libraries, ICADL 2004, Springer Berlin/Heidelberg (2004)

    Google Scholar 

  3. Urs, S.R., Raghavan, K.S.: Vidyanidhi: Indian digital library of electronic theses. Commun. ACM 44(5) (2001) 88–89

    Article  Google Scholar 

  4. Mitra, M., Chaudhuri, B.B.: An OCR-based architecture for indexing Indian language web documents. In: Proceedings 2nd Symposium on Indian Morphology, Phonology and Language Engineering (SIMPLE 05) (2005)

    Google Scholar 

  5. Pingali, P., Jagarlamudi, J., Varma, V.: Webkhoj: Indian language IR from multiple character encodings. In: Proceedings of http://WWW2006 Workshop (May 2006)

  6. He, D., Oard, D.W., Wang, J., Luo, J., Demner-Fushman, D., Darwish, K., Resnik, P., Khudanpur, S., Nossal, M., Subotin, M., Leuski, A.: Making miracles: Interactive translingual search for Cebuano and Hindi. ACM Transactions on Asian Language Information Processing (TALIP) 2(3) (2003) 219–244

    Article  Google Scholar 

  7. Pirkola, A.: The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In: SIGIR ’98: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, ACM Press (1998) 55–63

    Google Scholar 

  8. Mitra, M., Chaudhuri, B.: Information retrieval from documents: A survey. Information Retrieval 2(2/3) (2000) 141–163

    Article  Google Scholar 

  9. Larkey, L.S., Connell, M.E., Abduljaleel, N.: Hindi CLIR in thirty days. ACM Transactions on Asian Language Information Processing (TALIP) 2(2) (2003) 130–142

    Article  Google Scholar 

  10. Ramanathan, A., Rao, D.: A lightweight stemmer for Hindi. In: EACL Workshop on Computational Linguistics for South Asian Languages (2003)

    Google Scholar 

  11. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29(1) (2003) 19–51

    Article  Google Scholar 

  12. Callan, J.P., Croft, W.B., Broglio, J.: TREC and Tipster experiments with Inquery. Information Processing and Management 31(3) (1995) 327–343

    Article  Google Scholar 

  13. Weischedel, R., Nguyen, C.: Evaluating a probabilistic model for cross-lingual information retrieval. In: SIGIR ’01: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, ACM Press (2001) 105–110

    Chapter  Google Scholar 

  14. Xu, J., Weischedel, R.: Cross-lingual retrieval for Hindi. ACM Transactions on Asian Language Information Processing (TALIP) 2(2) (2003) 164–168

    Article  Google Scholar 

  15. Leuski, A., Lin, C.Y., Zhou, L., Germann, U., Och, F.J., Hovy, E.: Cross-lingual (c*st*rd): English access to Hindi information. ACM Transactions on Asian Language Information Processing (TALIP) 2(3) (2003) 245–269

    Article  Google Scholar 

  16. Chklovski, T., Mihalcea, R., Pedersen, T., Purandare, A.: The Senseval-3 multilingual EnglishHindi lexical sample task. In Mihalcea, R., Edmonds, P., eds.: Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, Association for Computational Linguistics (July 2004) 5–8

    Google Scholar 

  17. Lee, Y.K., Ng, H.T., Chia, T.K.: Supervised word sense disambiguation with support vector machines and multiple knowledge sources. In Mihalcea, R., Edmonds, P., eds.: Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text (July 2004) 137–140

    Google Scholar 

  18. May, J., Brunstein, A., Natarajan, P., Weischedel, R.: Surprise! what’s in a Cebuano or Hindi name? ACM Transactions on Asian Language Information Processing (TALIP) 2(3) (2003) 169–180

    Article  Google Scholar 

  19. Bikel, D.M., Miller, S., Schwartz, R.L., Weischedel, R.M.: Nymble: a high-performance learning name-finder. In: ANLP Washington, DC, ACL (1997) 194–201

    Google Scholar 

  20. Li, W., McCallum, A.: Rapid development of Hindi named entity recognition using conditional random fields and feature induction. ACM Transactions on Asian Language Information Processing (TALIP) 2(3) (2003) 290–294

    Article  Google Scholar 

  21. Huang, F., Vogel, S., Waibel, A.: Extracting named entity translingual equivalence with limited resources. ACM Transactions on Asian Language Information Processing (TALIP) 2(2) (2003) 124–129

    Article  Google Scholar 

  22. Majumder, P., Mitra, M., Sarkar, N., Mitra, P., Datta, K.: Bengali name identification using a noisy comparable corpus. In: International Conference on Emerging Applications of IT (2006) 41–44

    Google Scholar 

  23. Cucerzan, S., Yarowsky, D.: Language independent named entity recognition combining morphological and contextual evidence. In: Proceedings of Joint SIGDAT Conference on Empirical Methods in NLP and Very Large Corpora (1999) 90–99

    Google Scholar 

  24. Sekine, S., Grishman, R.: Hindi-English cross-lingual question-answering system. ACM Transactions on Asian Language Information Processing (TALIP) 2(3) (2003) 181–192

    Article  Google Scholar 

  25. Allan, J.: Introduction to topic detection and tracking. Norwell, MA, Kluwer Academic Publishers (2002)

    Google Scholar 

  26. Allan, J., Lavrenko, V., Connell, M.E.: A month to topic detection and tracking in Hindi. ACM Transactions on Asian Language Information Processing (TALIP) 2(2) (2003) 85–100

    Article  Google Scholar 

  27. Mandal, D., Gupta, M., Dandapat, S., Banerjee, P., Sarkar, S.: Bengali and Hindi to English CLIR evaluation. In: Advances in Multilingual and Multimodal Information Retrieval (8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007). Number 5152 in LNCS, Budapest, Hungary Springer Verlag (2008) 95–102

    Google Scholar 

  28. Jagarlamudi, J., Kumaran, A.: Cross-Lingual Information Retrieval System for Indian Languages. In: Advances in Multilingual and Multimodal Information Retrieval (8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007). Number 5152 in LNCS, Budapset, Hungary Springer Verlag (2008) 80–87

    Google Scholar 

  29. Pingali, P., Tune, K., Varma, V.: Improving Recall for Hindi, Telugu, Oromo to English CLIR. In: Advances in Multilingual and Multimodal Information Retrieval (8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007). Number 5152 in LNCS, Budapset, Hungary Springer Verlag (2008) 103–110

    Google Scholar 

  30. Chinnakotla, M., Ranadive, S., Damani, O., Bhattacharyya, P.: Hindi to English and Marathi to English cross language information retrieval evaluation. In: Advances in Multilingual and Multimodal Information Retrieval (8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007). Number 5152 in LNCS, Budapest, Hungary Springer Verlag (2008) 111–118

    Google Scholar 

  31. Monz, C., Dorr, B.J.: Iterative translation disambiguation for cross-language information retrieval. In: Proceedings of 28th ACM SIGIR (2005) 520527

    Google Scholar 

  32. Majumder, P., Mitra, M., Pal, D., Bandyopadhyay, A., Maiti, S., Mitra, S., Sen, A., Pal, S.: Text collections for FIRE. In: Proceedings of ACM SIGIR (2008) 699–700

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag London Limited

About this chapter

Cite this chapter

Majumder, P., Mitra, M. (2009). Indian Language Information Retrieval. In: Govindaraju, V., Setlur, S. (eds) Guide to OCR for Indic Scripts. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-84800-330-9_16

Download citation

  • DOI: https://doi.org/10.1007/978-1-84800-330-9_16

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84800-329-3

  • Online ISBN: 978-1-84800-330-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics