Advertisement

Evaluating Effect of Stemming and Stop-word Removal on Hindi Text Retrieval

  • Amaresh Kumar Pandey
  • Tanvver J Siddiqui

Abstract

IR system mainly use stop word elimination and stemming in indexing. This paper investigates the impact of stop word removal and stemming on Hindi Informating Retrieval (IR). Three different stemmers have been used in this study and their performance has been compared. The experiments have been conducted on a test collection constructed using Hindi documents from EMILLE corpus. We created a stop-word list of Hindi by extracting the high frequency words from the collection and some manual addition. The evaluation has been made in terms of precision, recall and reduction of index size. The experimental investigation suggests that stop word removal improves retrieval significantly. However, we experienced a small drop in retrieval precision with all the three stemmer.

Keywords

Average Precision Index Size Stop Word Test Collection Computational Linguistics 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Pandey, A., Siddique, T.: “An unsupervised Hindi stemmer with heuristic improvements”, Proceedings of AND 08, Singapore, pp 99–105.Google Scholar
  2. 2.
    Bharati, Sangal A.R., Bendre S.M., Kumar P. Aishwarya: Unsupervised Improvement of Morphological Analyzer for Infectionally Rich Languages, Proceedings of the NLPRS, pp 685–692 (2001)Google Scholar
  3. 3.
    Ganapathiraju, Madhavi A, Levin L.: TelMore: Morphological Generator for Telugu Nouns and verbs, In the proceedings of Second International Conference on Universal Digital Library Alexandria, Egypt November 17–19 (2006)Google Scholar
  4. 4.
    Goldsmith, J.: Unsupervised Learning of the Morphology of a Natural Language. Computational Linguistics, 27, 153–198, (2001)CrossRefMathSciNetGoogle Scholar
  5. 5.
    Creutz, M., Lagus, K.: Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0 Tech. rep. A81, Helsinki University of Technology, (2005).Google Scholar
  6. 6.
    Dasgupta S., Vincent N.: Unsupervised morphological parsing of Bengali, Brown, C.P., In: The Grammar of the Telugu Language, 1991, New Delhi: Laurier Books Ltd. (2007).Google Scholar
  7. 7.
    Krishnamurti, B.: A grammar of modern Telugu, Delhi; New York: Oxford University Press (1985).Google Scholar
  8. 8.
    Ramanathan, A., Rao, D.: A lightweight stemmer for Hindi. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL), on Computational Linguistics for South Asian Languages (Budapest, Apr.) Workshop (2003)Google Scholar
  9. 9.
    Lovins J.B.: Development of a stemmng algorithm. Mechanical Translation and Computational Linguistics 11: 22–31 (1977)Google Scholar
  10. 10.
    Porter, M.: An algorithm for suffix stripping program. Vol. 14, pp. 130–137 (1980)Google Scholar
  11. 11.
    Larkey, L.S., Connell M.E., Abduljaleel N.: Hindi CLIR in Thirty Days. ACM Transaction on Asian Language Information Processing, Vol. 2, No. 2, Pages No. 130–142 (June 2003)CrossRefGoogle Scholar
  12. 12.
    Snover, M.G., Brent, M.R.: A Bayesian model for morpheme and paradigm identification. In Proceedings of the 39th annual meeting of the ACL, pp. 482–490. (2001)Google Scholar
  13. 13.
    Brent, M. R., Murthy, S. K., Lundberg, A.: Discovering morphemic suffixes: A case study in minimum description length induction. In Proceedings of the fifth international workshop on artificial intelligence and statistics (1995)Google Scholar
  14. 14.
    Freitag, D.: Morphology induction from term clusters. In Proceedings of the ninth conference on computational natural language learning (CoNLL) pp. 128–135. (2005)Google Scholar
  15. 15.
    Wicentowski R.: Multilingual Noise-Robust Supervised Morphological Analysis using the WordFrame Model. In Proceedings of Seventh Meeting of the ACL Special Interest Group on Computational Phonology (SIGPHON), pp. 70–77, (2004)Google Scholar
  16. 16.
    Harman, D.: How effective is suffixing? Journal of the American Society for Information Science, 42(1), 7–15. (1991)CrossRefGoogle Scholar
  17. 17.
    Frakes, W.B.: Stemming algorithms. In: Frakes, W.B. and Baeze-Yates, R. (editors) Information Retrieval: Data Structures and Algorithms. Englewood Cliffs: Prentice-Hall, pp. 131–160 (1992)Google Scholar
  18. 18.
    Hull D.A.: Stemming algorithms: a case study for detailed evaluation. Journal of the American Society for Information Science, v. 47n.1, p. 70–84, Jan. (1996)CrossRefGoogle Scholar
  19. 19.
    Braschler M., Ripplinger B.: How Effective is Stemming and Decompounding for German Text Retrieval? Inf. Retr. 7(3–4): 291–316 (2004)CrossRefGoogle Scholar
  20. 20.
    Popovic M., Willett P.: The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data. JASIS 43(5): 384–390 (1992)CrossRefGoogle Scholar
  21. 21.
    Savoy J: A Stemming Procedure and Stopword List for General French Corpora. JASIS 50(10): 944–952 (1999)CrossRefGoogle Scholar
  22. 22.
    Sheridan P., Ballerini J.P.: Experiments in Multilingual Information Retrieval Using the SPIDER System. SIGIR: 58–65 (1996)Google Scholar
  23. 23.
    Kamps J., Monz C., Maarten de Rijke: Combining Morphological and Ngram Evidence for Monolingual Document Retrieval In: M.-F. Moens, R. De Busser, D. Hiemstra, W. Kraaij, editors, Proceedings of the Third Dutch Information Retrieval Workshop (DIR 2002) pages 47–51Google Scholar
  24. 24.
    Chen, A., and Gey, F.C.: Generating statistical Hindi stemmers from parallel texts. ACM Trans. Asian Language Inform. Process. Vol. 2,No. 3, Sep. (2003)Google Scholar
  25. 25.
    The EMILLE Corpus, http://bowland-files.lancs.ac.uk/corplang/emille/Google Scholar

Copyright information

© Indian Institute of Information Technology, India 2009

Authors and Affiliations

  • Amaresh Kumar Pandey
    • 1
  • Tanvver J Siddiqui
    • 2
  1. 1.Hughes Systique CorporationgurgaonIndia
  2. 2.Indian Institute of Information TechnologyAllahabad, Uttar PradeshIndia

Personalised recommendations