Abstract
IR system mainly use stop word elimination and stemming in indexing. This paper investigates the impact of stop word removal and stemming on Hindi Informating Retrieval (IR). Three different stemmers have been used in this study and their performance has been compared. The experiments have been conducted on a test collection constructed using Hindi documents from EMILLE corpus. We created a stop-word list of Hindi by extracting the high frequency words from the collection and some manual addition. The evaluation has been made in terms of precision, recall and reduction of index size. The experimental investigation suggests that stop word removal improves retrieval significantly. However, we experienced a small drop in retrieval precision with all the three stemmer.
Preview
Unable to display preview. Download preview PDF.
References
Pandey, A., Siddique, T.: “An unsupervised Hindi stemmer with heuristic improvements”, Proceedings of AND 08, Singapore, pp 99–105.
Bharati, Sangal A.R., Bendre S.M., Kumar P. Aishwarya: Unsupervised Improvement of Morphological Analyzer for Infectionally Rich Languages, Proceedings of the NLPRS, pp 685–692 (2001)
Ganapathiraju, Madhavi A, Levin L.: TelMore: Morphological Generator for Telugu Nouns and verbs, In the proceedings of Second International Conference on Universal Digital Library Alexandria, Egypt November 17–19 (2006)
Goldsmith, J.: Unsupervised Learning of the Morphology of a Natural Language. Computational Linguistics, 27, 153–198, (2001)
Creutz, M., Lagus, K.: Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0 Tech. rep. A81, Helsinki University of Technology, (2005).
Dasgupta S., Vincent N.: Unsupervised morphological parsing of Bengali, Brown, C.P., In: The Grammar of the Telugu Language, 1991, New Delhi: Laurier Books Ltd. (2007).
Krishnamurti, B.: A grammar of modern Telugu, Delhi; New York: Oxford University Press (1985).
Ramanathan, A., Rao, D.: A lightweight stemmer for Hindi. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL), on Computational Linguistics for South Asian Languages (Budapest, Apr.) Workshop (2003)
Lovins J.B.: Development of a stemmng algorithm. Mechanical Translation and Computational Linguistics 11: 22–31 (1977)
Porter, M.: An algorithm for suffix stripping program. Vol. 14, pp. 130–137 (1980)
Larkey, L.S., Connell M.E., Abduljaleel N.: Hindi CLIR in Thirty Days. ACM Transaction on Asian Language Information Processing, Vol. 2, No. 2, Pages No. 130–142 (June 2003)
Snover, M.G., Brent, M.R.: A Bayesian model for morpheme and paradigm identification. In Proceedings of the 39th annual meeting of the ACL, pp. 482–490. (2001)
Brent, M. R., Murthy, S. K., Lundberg, A.: Discovering morphemic suffixes: A case study in minimum description length induction. In Proceedings of the fifth international workshop on artificial intelligence and statistics (1995)
Freitag, D.: Morphology induction from term clusters. In Proceedings of the ninth conference on computational natural language learning (CoNLL) pp. 128–135. (2005)
Wicentowski R.: Multilingual Noise-Robust Supervised Morphological Analysis using the WordFrame Model. In Proceedings of Seventh Meeting of the ACL Special Interest Group on Computational Phonology (SIGPHON), pp. 70–77, (2004)
Harman, D.: How effective is suffixing? Journal of the American Society for Information Science, 42(1), 7–15. (1991)
Frakes, W.B.: Stemming algorithms. In: Frakes, W.B. and Baeze-Yates, R. (editors) Information Retrieval: Data Structures and Algorithms. Englewood Cliffs: Prentice-Hall, pp. 131–160 (1992)
Hull D.A.: Stemming algorithms: a case study for detailed evaluation. Journal of the American Society for Information Science, v. 47n.1, p. 70–84, Jan. (1996)
Braschler M., Ripplinger B.: How Effective is Stemming and Decompounding for German Text Retrieval? Inf. Retr. 7(3–4): 291–316 (2004)
Popovic M., Willett P.: The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data. JASIS 43(5): 384–390 (1992)
Savoy J: A Stemming Procedure and Stopword List for General French Corpora. JASIS 50(10): 944–952 (1999)
Sheridan P., Ballerini J.P.: Experiments in Multilingual Information Retrieval Using the SPIDER System. SIGIR: 58–65 (1996)
Kamps J., Monz C., Maarten de Rijke: Combining Morphological and Ngram Evidence for Monolingual Document Retrieval In: M.-F. Moens, R. De Busser, D. Hiemstra, W. Kraaij, editors, Proceedings of the Third Dutch Information Retrieval Workshop (DIR 2002) pages 47–51
Chen, A., and Gey, F.C.: Generating statistical Hindi stemmers from parallel texts. ACM Trans. Asian Language Inform. Process. Vol. 2,No. 3, Sep. (2003)
The EMILLE Corpus, http://bowland-files.lancs.ac.uk/corplang/emille/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Indian Institute of Information Technology, India
About this paper
Cite this paper
Pandey, A.K., Siddiqui, T.J. (2009). Evaluating Effect of Stemming and Stop-word Removal on Hindi Text Retrieval. In: Tiwary, U.S., Siddiqui, T.J., Radhakrishna, M., Tiwari, M.D. (eds) Proceedings of the First International Conference on Intelligent Human Computer Interaction. Springer, New Delhi. https://doi.org/10.1007/978-81-8489-203-1_31
Download citation
DOI: https://doi.org/10.1007/978-81-8489-203-1_31
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-8489-404-2
Online ISBN: 978-81-8489-203-1
eBook Packages: Computer ScienceComputer Science (R0)