Term Conflation and Blind Relevance Feedback for Information Retrieval on Indian Languages

Leveling, Johannes; Ganguly, Debasis; Jones, Gareth J. F.

doi:10.1007/978-3-642-40087-2_28

Johannes Leveling²¹,
Debasis Ganguly²¹ &
Gareth J. F. Jones²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7536))

670 Accesses

Abstract

For the first participation of Dublin City University (DCU) in the FIRE 2010 evaluation campaign, Information Retrieval (IR) experiments on English, Bengali, Hindi, and Marathi documents were performed to investigate term conflation, Blind Relevance Feedback (BRF), and manual and automatic query translation. The experiments are based on BM25 and on language modeling (LM) for IR. Results show that term conflation always improves Mean Average Precision (MAP) compared to indexing unprocessed word forms, but different approaches seem to work best for different languages. For example, in monolingual Marathi experiments indexing 5-prefixes outperforms our corpus-based stemmer; in Hindi, corpus-based stemming approach achieves a higher MAP. For Bengali, the LM retrieval model with the rule based stemmer achieves a higher (but not significantly higher) MAP than BM25 with a corpus based stemmer (0.4583 vs. 0.4526). In all experiments, BRF yields considerably higher MAP in comparison to experiments without it. Bilingual IR experiments (English to Bengali and English to Hindi) are based on query translations obtained from native speakers and the Google translate web service. For the automatically translated queries, MAP is slightly (but not significantly) lower compared to experiments with manual query translations. The bilingual English to Bengali (English to Hindi) experiments achieve 81.7%-83.3% (78.0%-80.6%) of the best corresponding monolingual experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Majumder, P., Mitra, M., Pal, D., Bandyopadhyay, A., Maiti, S., Mitra, S., Pal, S.: Text collections for FIRE. In: SIGIR 2008, pp. 699–700 (2008)
Google Scholar
Larkey, L.S., Connell, M.E., Abduljaleel, N.: Hindi CLIR in thirty days. ACM Transactions on Asian Language Information Processing 2(2), 130–142 (2003)
Article Google Scholar
Dolamic, L., Savoy, J.: UniNE at FIRE 2008: Hindi, Bengali, and Marathi IR. Working Notes of the Forum for Information Retrieval Evaluation, Kolkata, India, December 12-14 (2008)
Google Scholar
Savoy, J.: Light stemming approaches for the French, Portuguese, German and Hungarian languages. In: Haddad, H. (ed.) Proceedings of the 2006 ACM Symposium on Applied Computing (SAC), Dijon, France, April 23-27, pp. 1031–1035. ACM (2006)
Google Scholar
Xu, T., Oard, D.W.: FIRE-2008 at Maryland: English-Hindi CLIR. Working Notes of the Forum for Information Retrieval Evaluation, Kolkata, India, December 12-14 (2008)
Google Scholar
McNamee, P.: N-gram tokenization for Indian language text retrieval. Working Notes of the Forum for Information Retrieval Evaluation, Kolkata, India, December 12-14 (2008)
Google Scholar
McNamee, P., Nicholas, C., Mayfield, J.: Addressing morphological variation in alphabetic languages. In: Allan, J., Aslam, J.A., Sanderson, M., Zhai, C., Zobel, J. (eds.) Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, July 19-23, pp. 75–82. ACM, Boston (2009)
Google Scholar
Savoy, J.: A stemming procedure and stopword list for general French corpora. Journal of the American Society for Information Science 50(10), 944–952 (1999)
Article Google Scholar
Lovins, J.B.: Development of a stemming algorithm. Mechanical Translation and Computation 11(1-2), 22–31 (1968)
Google Scholar
Xu, J., Croft, B.: Corpus-based stemming using co-occurence of word variants. ACM Transactions on Information Systems 16(1), 61–81 (1998)
Article Google Scholar
Krovetz, R.: Viewing morphology as an inference process. In: Korfhage, R., Rasmussen, E., Willett, P. (eds.) Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 191–202. ACM, Pittsburg (1993)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Harman, D.: How effective is suffixing? Journal of the American Society for Information Science 42(1), 7–15 (1991)
Article Google Scholar
Majumder, P., Mitra, M., Parui, S.K., Kole, G., Mitra, P., Datta, K.: YASS: Yet another suffix stripper. ACM Transactions on Information Systems (TOIS) 25(4), 18:1–18:20 (2007)
Article Google Scholar
Goldsmith, J.: Unsupervised learning of the morphology of a natural language. Computational Linguistics 27, 153–198 (2001)
Article MathSciNet Google Scholar
Oard, D.W., Levow, G.-A., Cabezas, C.I.: CLEF experiments at maryland: Statistical stemming and backoff translation. In: Peters, C. (ed.) CLEF 2000. LNCS, vol. 2069, pp. 176–187. Springer, Heidelberg (2001)
Chapter Google Scholar
Dasgupta, S., Ng, V.: High-performance, language-independent morphological segmentation. In: Sidner, C.L., Schultz, T., Stone, M., Zhai, C. (eds.) Proceedings of the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (NAACL HLT 2007), April 22-27, pp. 155–163. ACL, Rochester (2007)
Google Scholar
Keshava, S., Pitler, E.: A simpler, intuitive approach to morpheme induction. In: PASCAL Challenge Workshop on Unsupervised Segmentation of Words Into Morphemes - MorphoChallenge 2005, Venice, Italy, April 12 (2006)
Google Scholar
Xu, J., Croft, W.B.: Improving the effectiveness of informational retrieval with Local Context Analysis. ACM Transactions on Information Systems 18, 79–112 (2000)
Article Google Scholar
Bhattacharya, S., Choudhury, M., Sarkar, S., Basu, A.: Inflectional morphology synthesis for bengali noun, pronoun and verb systems. In: Proceedings of the National Conference on Computer Processing of Bangla (NCCPB), pp. 34–43 (2005)
Google Scholar
Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.M., Gatford, M.: Okapi at TREC-3. In: Harman, D.K. (ed.) Overview of the Third Text Retrieval Conference (TREC-3), pp. 109–126. National Institute of Standards and Technology (NIST), Gaithersburg (1995)
Google Scholar
Robertson, S.E., Walker, S., Beaulieu, M.: Okapi at TREC-7: Automatic ad hoc, filtering, VLC and interactive track. In: Harman, D.K. (ed.) The Seventh Text REtrieval Conference (TREC-7). NIST Special Publication 500-242, pp. 253–264. National Institute of Standards and Technology (NIST), Gaithersburg (1998)
Google Scholar
Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: SIGIR 1998: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–281. ACM, New York (1998)
Chapter Google Scholar
Fox, C.: Lexical analysis and stoplists, pp. 102–130. Prentice-Hall, NJ (1992)
Google Scholar
Ganguly, D.: Implementing a language modeling framework for information retrieval. Master’s thesis, Indian Statistical Institute, India (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

CNGL, School of Computing, Dublin City University, Dublin, 9, Ireland
Johannes Leveling, Debasis Ganguly & Gareth J. F. Jones

Authors

Johannes Leveling
View author publications
You can also search for this author in PubMed Google Scholar
Debasis Ganguly
View author publications
You can also search for this author in PubMed Google Scholar
Gareth J. F. Jones
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dhirubhai Ambani Institute of Information and Communication Technology, Gujarat, India
Prasenjit Majumder
Indian Statistical Institute, Kolkata, India
Mandar Mitra
Indian Institutte of Technology, Bombay, India
Pushpak Bhattacharyya
IBM Research New Delhi, India
L. Venkata Subramaniam & Danish Contractor &
NLE Lab - ELiRF, Universitat Politècnica de València, Valencia, Spain
Paolo Rosso

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Leveling, J., Ganguly, D., Jones, G.J.F. (2013). Term Conflation and Blind Relevance Feedback for Information Retrieval on Indian Languages. In: Majumder, P., Mitra, M., Bhattacharyya, P., Subramaniam, L.V., Contractor, D., Rosso, P. (eds) Multilingual Information Access in South Asian Languages. Lecture Notes in Computer Science, vol 7536. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40087-2_28

Download citation

DOI: https://doi.org/10.1007/978-3-642-40087-2_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40086-5
Online ISBN: 978-3-642-40087-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics