Abstract
This paper presents results of a generative method for the management of morphological variation of query keywords in Bengali, Gujarati and Marathi. The method is called Frequent Case Generation (FCG). It is based on the skewed distributions of word forms in natural languages and is suitable for languages that have either fair amount of morphological variation or are morphologically very rich. We participated in the ad hoc task at FIRE 2011 and applied the FCG method on monolingual Bengali, Gujarati and Marathi test collections. Our evaluation was carried out with title and description fields of test topics, and the Lemur search engine. We used plain unprocessed word index as the baseline, and n-gramming and stemming as competing methods. The evaluation results show 30%, 16% and 70% relative mean average precision improvements for Bengali, Gujarati and Marathi respectively when comparing the FCG method to plain words. The method shows competitive performance in comparison to n-gramming and stemming.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Conover, W.J.: Practical Nonparametric Statistics. John Wiley & Sons (December 1998)
Dolamic, L., Savoy, J.: Comparative study of indexing and search strategies for the hindi, marathi, and bengali languages. ACM Trans. Asian Language Information Processing 9, 11:1–11:24 (2010)
Kettunen, K.: Reductive and generative approaches to management of morphological variation of keywords in monolingual information retrieval: an overview. Journal of Documentation 65(2), 267–290 (2009)
Kettunen, K.: Automatic generation of frequent case forms of query keywords in text retrieval. In: Nordström, B., Ranta, A. (eds.) GoTAL 2008. LNCS (LNAI), vol. 5221, pp. 222–236. Springer, Heidelberg (2008)
Kettunen, K., Airio, E.: Is a morphologically complex language really that complex in full-text retrieval? In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 411–422. Springer, Heidelberg (2006)
Kettunen, K., Airio, E., Järvelin, K.: Restricted inflectional form generation in management of morphological keyword variation. Inf. Retr. 10, 415–444 (2007)
Koskenniemi, K.: Finite state morphology and information retrieval. Natural Language Engineering 2(04), 331–336 (1996)
Leturia, I., Gurrutxaga, A., Areta, N., Alegria, I., Ezeiza, A.: Eusbila, a search service designed for the agglutinative nature of basque. In: Vilares, F., Lazarinis, J., Tait, J.I. (eds.) First Workshop on Improving Non English Web Searching (ACM Sigir 2007 Workshop), Marrakech, Morocco (2007)
Leturia, I., Gurrutxaga, A., Areta, N., Pociello, E.: Analysis and performance of morphological query expansion and language-filtering words on basque web searching. In: Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D., Calzolari, N. (eds.) Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco. European Language Resources Association, ELRA (May 2008), http://www.lrec-conf.org/proceedings/lrec2008/
Loponen, A., Järvelin, K.: A dictionary- and corpus-independent statistical lemmatizer for information retrieval in low resource languages. In: Agosti, M., Ferro, N., Peters, C., de Rijke, M., Smeaton, A. (eds.) CLEF 2010. LNCS, vol. 6360, pp. 3–14. Springer, Heidelberg (2010)
Majumder, P., Mitra, M., Pal, D., Bandyopadhyay, A., Maiti, S., Pal, S., Modak, D., Sanyal, S.: The fire 2008 evaluation exercise. ACM Trans. Asian Language Information Processing 9, 10:1–10:24 (2010)
McNamee, P., Mayfield, J.: Character n-gram tokenization for european language text retrieval. Inf. Retr. 7(1-2), 73–97 (2004)
McNamee, P., Nicholas, C.K., Mayfield, J.: Addressing morphological variation in alphabetic languages. In: SIGIR, pp. 75–82 (2009)
Paik, J.H., Mitra, M., Parui, S.K., Järvelin, K.: Gras: An effective and efficient stemming algorithm for information retrieval. ACM Trans. Inf. Syst. 29(4), 19 (2011)
Voorhees, E.M.: Overview of the trec 2004 robust track. In: TREC (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Paik, J.H., Kettunen, K., Pal, D., Järvelin, K. (2013). Frequent Case Generation in Ad Hoc Retrieval of Three Indian Languages – Bengali, Gujarati and Marathi. In: Majumder, P., Mitra, M., Bhattacharyya, P., Subramaniam, L.V., Contractor, D., Rosso, P. (eds) Multilingual Information Access in South Asian Languages. Lecture Notes in Computer Science, vol 7536. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40087-2_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-40087-2_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40086-5
Online ISBN: 978-3-642-40087-2
eBook Packages: Computer ScienceComputer Science (R0)