Skip to main content

Term Conflation and Blind Relevance Feedback for Information Retrieval on Indian Languages

  • Conference paper
Multilingual Information Access in South Asian Languages

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7536))

  • 670 Accesses

Abstract

For the first participation of Dublin City University (DCU) in the FIRE 2010 evaluation campaign, Information Retrieval (IR) experiments on English, Bengali, Hindi, and Marathi documents were performed to investigate term conflation, Blind Relevance Feedback (BRF), and manual and automatic query translation. The experiments are based on BM25 and on language modeling (LM) for IR. Results show that term conflation always improves Mean Average Precision (MAP) compared to indexing unprocessed word forms, but different approaches seem to work best for different languages. For example, in monolingual Marathi experiments indexing 5-prefixes outperforms our corpus-based stemmer; in Hindi, corpus-based stemming approach achieves a higher MAP. For Bengali, the LM retrieval model with the rule based stemmer achieves a higher (but not significantly higher) MAP than BM25 with a corpus based stemmer (0.4583 vs. 0.4526). In all experiments, BRF yields considerably higher MAP in comparison to experiments without it. Bilingual IR experiments (English to Bengali and English to Hindi) are based on query translations obtained from native speakers and the Google translate web service. For the automatically translated queries, MAP is slightly (but not significantly) lower compared to experiments with manual query translations. The bilingual English to Bengali (English to Hindi) experiments achieve 81.7%-83.3% (78.0%-80.6%) of the best corresponding monolingual experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Majumder, P., Mitra, M., Pal, D., Bandyopadhyay, A., Maiti, S., Mitra, S., Pal, S.: Text collections for FIRE. In: SIGIR 2008, pp. 699–700 (2008)

    Google Scholar 

  2. Larkey, L.S., Connell, M.E., Abduljaleel, N.: Hindi CLIR in thirty days. ACM Transactions on Asian Language Information Processing 2(2), 130–142 (2003)

    Article  Google Scholar 

  3. Dolamic, L., Savoy, J.: UniNE at FIRE 2008: Hindi, Bengali, and Marathi IR. Working Notes of the Forum for Information Retrieval Evaluation, Kolkata, India, December 12-14 (2008)

    Google Scholar 

  4. Savoy, J.: Light stemming approaches for the French, Portuguese, German and Hungarian languages. In: Haddad, H. (ed.) Proceedings of the 2006 ACM Symposium on Applied Computing (SAC), Dijon, France, April 23-27, pp. 1031–1035. ACM (2006)

    Google Scholar 

  5. Xu, T., Oard, D.W.: FIRE-2008 at Maryland: English-Hindi CLIR. Working Notes of the Forum for Information Retrieval Evaluation, Kolkata, India, December 12-14 (2008)

    Google Scholar 

  6. McNamee, P.: N-gram tokenization for Indian language text retrieval. Working Notes of the Forum for Information Retrieval Evaluation, Kolkata, India, December 12-14 (2008)

    Google Scholar 

  7. McNamee, P., Nicholas, C., Mayfield, J.: Addressing morphological variation in alphabetic languages. In: Allan, J., Aslam, J.A., Sanderson, M., Zhai, C., Zobel, J. (eds.) Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, July 19-23, pp. 75–82. ACM, Boston (2009)

    Google Scholar 

  8. Savoy, J.: A stemming procedure and stopword list for general French corpora. Journal of the American Society for Information Science 50(10), 944–952 (1999)

    Article  Google Scholar 

  9. Lovins, J.B.: Development of a stemming algorithm. Mechanical Translation and Computation 11(1-2), 22–31 (1968)

    Google Scholar 

  10. Xu, J., Croft, B.: Corpus-based stemming using co-occurence of word variants. ACM Transactions on Information Systems 16(1), 61–81 (1998)

    Article  Google Scholar 

  11. Krovetz, R.: Viewing morphology as an inference process. In: Korfhage, R., Rasmussen, E., Willett, P. (eds.) Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 191–202. ACM, Pittsburg (1993)

    Google Scholar 

  12. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  13. Harman, D.: How effective is suffixing? Journal of the American Society for Information Science 42(1), 7–15 (1991)

    Article  Google Scholar 

  14. Majumder, P., Mitra, M., Parui, S.K., Kole, G., Mitra, P., Datta, K.: YASS: Yet another suffix stripper. ACM Transactions on Information Systems (TOIS) 25(4), 18:1–18:20 (2007)

    Article  Google Scholar 

  15. Goldsmith, J.: Unsupervised learning of the morphology of a natural language. Computational Linguistics 27, 153–198 (2001)

    Article  MathSciNet  Google Scholar 

  16. Oard, D.W., Levow, G.-A., Cabezas, C.I.: CLEF experiments at maryland: Statistical stemming and backoff translation. In: Peters, C. (ed.) CLEF 2000. LNCS, vol. 2069, pp. 176–187. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  17. Dasgupta, S., Ng, V.: High-performance, language-independent morphological segmentation. In: Sidner, C.L., Schultz, T., Stone, M., Zhai, C. (eds.) Proceedings of the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (NAACL HLT 2007), April 22-27, pp. 155–163. ACL, Rochester (2007)

    Google Scholar 

  18. Keshava, S., Pitler, E.: A simpler, intuitive approach to morpheme induction. In: PASCAL Challenge Workshop on Unsupervised Segmentation of Words Into Morphemes - MorphoChallenge 2005, Venice, Italy, April 12 (2006)

    Google Scholar 

  19. Xu, J., Croft, W.B.: Improving the effectiveness of informational retrieval with Local Context Analysis. ACM Transactions on Information Systems 18, 79–112 (2000)

    Article  Google Scholar 

  20. Bhattacharya, S., Choudhury, M., Sarkar, S., Basu, A.: Inflectional morphology synthesis for bengali noun, pronoun and verb systems. In: Proceedings of the National Conference on Computer Processing of Bangla (NCCPB), pp. 34–43 (2005)

    Google Scholar 

  21. Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.M., Gatford, M.: Okapi at TREC-3. In: Harman, D.K. (ed.) Overview of the Third Text Retrieval Conference (TREC-3), pp. 109–126. National Institute of Standards and Technology (NIST), Gaithersburg (1995)

    Google Scholar 

  22. Robertson, S.E., Walker, S., Beaulieu, M.: Okapi at TREC-7: Automatic ad hoc, filtering, VLC and interactive track. In: Harman, D.K. (ed.) The Seventh Text REtrieval Conference (TREC-7). NIST Special Publication 500-242, pp. 253–264. National Institute of Standards and Technology (NIST), Gaithersburg (1998)

    Google Scholar 

  23. Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: SIGIR 1998: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–281. ACM, New York (1998)

    Chapter  Google Scholar 

  24. Fox, C.: Lexical analysis and stoplists, pp. 102–130. Prentice-Hall, NJ (1992)

    Google Scholar 

  25. Ganguly, D.: Implementing a language modeling framework for information retrieval. Master’s thesis, Indian Statistical Institute, India (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Leveling, J., Ganguly, D., Jones, G.J.F. (2013). Term Conflation and Blind Relevance Feedback for Information Retrieval on Indian Languages. In: Majumder, P., Mitra, M., Bhattacharyya, P., Subramaniam, L.V., Contractor, D., Rosso, P. (eds) Multilingual Information Access in South Asian Languages. Lecture Notes in Computer Science, vol 7536. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40087-2_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40087-2_28

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40086-5

  • Online ISBN: 978-3-642-40087-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics