Skip to main content

Abstract

IR system mainly use stop word elimination and stemming in indexing. This paper investigates the impact of stop word removal and stemming on Hindi Informating Retrieval (IR). Three different stemmers have been used in this study and their performance has been compared. The experiments have been conducted on a test collection constructed using Hindi documents from EMILLE corpus. We created a stop-word list of Hindi by extracting the high frequency words from the collection and some manual addition. The evaluation has been made in terms of precision, recall and reduction of index size. The experimental investigation suggests that stop word removal improves retrieval significantly. However, we experienced a small drop in retrieval precision with all the three stemmer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Pandey, A., Siddique, T.: “An unsupervised Hindi stemmer with heuristic improvements”, Proceedings of AND 08, Singapore, pp 99–105.

    Google Scholar 

  2. Bharati, Sangal A.R., Bendre S.M., Kumar P. Aishwarya: Unsupervised Improvement of Morphological Analyzer for Infectionally Rich Languages, Proceedings of the NLPRS, pp 685–692 (2001)

    Google Scholar 

  3. Ganapathiraju, Madhavi A, Levin L.: TelMore: Morphological Generator for Telugu Nouns and verbs, In the proceedings of Second International Conference on Universal Digital Library Alexandria, Egypt November 17–19 (2006)

    Google Scholar 

  4. Goldsmith, J.: Unsupervised Learning of the Morphology of a Natural Language. Computational Linguistics, 27, 153–198, (2001)

    Article  MathSciNet  Google Scholar 

  5. Creutz, M., Lagus, K.: Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0 Tech. rep. A81, Helsinki University of Technology, (2005).

    Google Scholar 

  6. Dasgupta S., Vincent N.: Unsupervised morphological parsing of Bengali, Brown, C.P., In: The Grammar of the Telugu Language, 1991, New Delhi: Laurier Books Ltd. (2007).

    Google Scholar 

  7. Krishnamurti, B.: A grammar of modern Telugu, Delhi; New York: Oxford University Press (1985).

    Google Scholar 

  8. Ramanathan, A., Rao, D.: A lightweight stemmer for Hindi. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL), on Computational Linguistics for South Asian Languages (Budapest, Apr.) Workshop (2003)

    Google Scholar 

  9. Lovins J.B.: Development of a stemmng algorithm. Mechanical Translation and Computational Linguistics 11: 22–31 (1977)

    Google Scholar 

  10. Porter, M.: An algorithm for suffix stripping program. Vol. 14, pp. 130–137 (1980)

    Google Scholar 

  11. Larkey, L.S., Connell M.E., Abduljaleel N.: Hindi CLIR in Thirty Days. ACM Transaction on Asian Language Information Processing, Vol. 2, No. 2, Pages No. 130–142 (June 2003)

    Article  Google Scholar 

  12. Snover, M.G., Brent, M.R.: A Bayesian model for morpheme and paradigm identification. In Proceedings of the 39th annual meeting of the ACL, pp. 482–490. (2001)

    Google Scholar 

  13. Brent, M. R., Murthy, S. K., Lundberg, A.: Discovering morphemic suffixes: A case study in minimum description length induction. In Proceedings of the fifth international workshop on artificial intelligence and statistics (1995)

    Google Scholar 

  14. Freitag, D.: Morphology induction from term clusters. In Proceedings of the ninth conference on computational natural language learning (CoNLL) pp. 128–135. (2005)

    Google Scholar 

  15. Wicentowski R.: Multilingual Noise-Robust Supervised Morphological Analysis using the WordFrame Model. In Proceedings of Seventh Meeting of the ACL Special Interest Group on Computational Phonology (SIGPHON), pp. 70–77, (2004)

    Google Scholar 

  16. Harman, D.: How effective is suffixing? Journal of the American Society for Information Science, 42(1), 7–15. (1991)

    Article  Google Scholar 

  17. Frakes, W.B.: Stemming algorithms. In: Frakes, W.B. and Baeze-Yates, R. (editors) Information Retrieval: Data Structures and Algorithms. Englewood Cliffs: Prentice-Hall, pp. 131–160 (1992)

    Google Scholar 

  18. Hull D.A.: Stemming algorithms: a case study for detailed evaluation. Journal of the American Society for Information Science, v. 47n.1, p. 70–84, Jan. (1996)

    Article  Google Scholar 

  19. Braschler M., Ripplinger B.: How Effective is Stemming and Decompounding for German Text Retrieval? Inf. Retr. 7(3–4): 291–316 (2004)

    Article  Google Scholar 

  20. Popovic M., Willett P.: The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data. JASIS 43(5): 384–390 (1992)

    Article  Google Scholar 

  21. Savoy J: A Stemming Procedure and Stopword List for General French Corpora. JASIS 50(10): 944–952 (1999)

    Article  Google Scholar 

  22. Sheridan P., Ballerini J.P.: Experiments in Multilingual Information Retrieval Using the SPIDER System. SIGIR: 58–65 (1996)

    Google Scholar 

  23. Kamps J., Monz C., Maarten de Rijke: Combining Morphological and Ngram Evidence for Monolingual Document Retrieval In: M.-F. Moens, R. De Busser, D. Hiemstra, W. Kraaij, editors, Proceedings of the Third Dutch Information Retrieval Workshop (DIR 2002) pages 47–51

    Google Scholar 

  24. Chen, A., and Gey, F.C.: Generating statistical Hindi stemmers from parallel texts. ACM Trans. Asian Language Inform. Process. Vol. 2,No. 3, Sep. (2003)

    Google Scholar 

  25. The EMILLE Corpus, http://bowland-files.lancs.ac.uk/corplang/emille/

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Indian Institute of Information Technology, India

About this paper

Cite this paper

Pandey, A.K., Siddiqui, T.J. (2009). Evaluating Effect of Stemming and Stop-word Removal on Hindi Text Retrieval. In: Tiwary, U.S., Siddiqui, T.J., Radhakrishna, M., Tiwari, M.D. (eds) Proceedings of the First International Conference on Intelligent Human Computer Interaction. Springer, New Delhi. https://doi.org/10.1007/978-81-8489-203-1_31

Download citation

  • DOI: https://doi.org/10.1007/978-81-8489-203-1_31

  • Publisher Name: Springer, New Delhi

  • Print ISBN: 978-81-8489-404-2

  • Online ISBN: 978-81-8489-203-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics