Evaluating Effect of Stemming and Stop-word Removal on Hindi Text Retrieval

Pandey, Amaresh Kumar; Siddiqui, Tanvver J

doi:10.1007/978-81-8489-203-1_31

Amaresh Kumar Pandey² &
Tanvver J Siddiqui³

1433 Accesses
12 Citations

Abstract

IR system mainly use stop word elimination and stemming in indexing. This paper investigates the impact of stop word removal and stemming on Hindi Informating Retrieval (IR). Three different stemmers have been used in this study and their performance has been compared. The experiments have been conducted on a test collection constructed using Hindi documents from EMILLE corpus. We created a stop-word list of Hindi by extracting the high frequency words from the collection and some manual addition. The evaluation has been made in terms of precision, recall and reduction of index size. The experimental investigation suggests that stop word removal improves retrieval significantly. However, we experienced a small drop in retrieval precision with all the three stemmer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Pandey, A., Siddique, T.: “An unsupervised Hindi stemmer with heuristic improvements”, Proceedings of AND 08, Singapore, pp 99–105.
Google Scholar
Bharati, Sangal A.R., Bendre S.M., Kumar P. Aishwarya: Unsupervised Improvement of Morphological Analyzer for Infectionally Rich Languages, Proceedings of the NLPRS, pp 685–692 (2001)
Google Scholar
Ganapathiraju, Madhavi A, Levin L.: TelMore: Morphological Generator for Telugu Nouns and verbs, In the proceedings of Second International Conference on Universal Digital Library Alexandria, Egypt November 17–19 (2006)
Google Scholar
Goldsmith, J.: Unsupervised Learning of the Morphology of a Natural Language. Computational Linguistics, 27, 153–198, (2001)
Article MathSciNet Google Scholar
Creutz, M., Lagus, K.: Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0 Tech. rep. A81, Helsinki University of Technology, (2005).
Google Scholar
Dasgupta S., Vincent N.: Unsupervised morphological parsing of Bengali, Brown, C.P., In: The Grammar of the Telugu Language, 1991, New Delhi: Laurier Books Ltd. (2007).
Google Scholar
Krishnamurti, B.: A grammar of modern Telugu, Delhi; New York: Oxford University Press (1985).
Google Scholar
Ramanathan, A., Rao, D.: A lightweight stemmer for Hindi. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL), on Computational Linguistics for South Asian Languages (Budapest, Apr.) Workshop (2003)
Google Scholar
Lovins J.B.: Development of a stemmng algorithm. Mechanical Translation and Computational Linguistics 11: 22–31 (1977)
Google Scholar
Porter, M.: An algorithm for suffix stripping program. Vol. 14, pp. 130–137 (1980)
Google Scholar
Larkey, L.S., Connell M.E., Abduljaleel N.: Hindi CLIR in Thirty Days. ACM Transaction on Asian Language Information Processing, Vol. 2, No. 2, Pages No. 130–142 (June 2003)
Article Google Scholar
Snover, M.G., Brent, M.R.: A Bayesian model for morpheme and paradigm identification. In Proceedings of the 39th annual meeting of the ACL, pp. 482–490. (2001)
Google Scholar
Brent, M. R., Murthy, S. K., Lundberg, A.: Discovering morphemic suffixes: A case study in minimum description length induction. In Proceedings of the fifth international workshop on artificial intelligence and statistics (1995)
Google Scholar
Freitag, D.: Morphology induction from term clusters. In Proceedings of the ninth conference on computational natural language learning (CoNLL) pp. 128–135. (2005)
Google Scholar
Wicentowski R.: Multilingual Noise-Robust Supervised Morphological Analysis using the WordFrame Model. In Proceedings of Seventh Meeting of the ACL Special Interest Group on Computational Phonology (SIGPHON), pp. 70–77, (2004)
Google Scholar
Harman, D.: How effective is suffixing? Journal of the American Society for Information Science, 42(1), 7–15. (1991)
Article Google Scholar
Frakes, W.B.: Stemming algorithms. In: Frakes, W.B. and Baeze-Yates, R. (editors) Information Retrieval: Data Structures and Algorithms. Englewood Cliffs: Prentice-Hall, pp. 131–160 (1992)
Google Scholar
Hull D.A.: Stemming algorithms: a case study for detailed evaluation. Journal of the American Society for Information Science, v. 47n.1, p. 70–84, Jan. (1996)
Article Google Scholar
Braschler M., Ripplinger B.: How Effective is Stemming and Decompounding for German Text Retrieval? Inf. Retr. 7(3–4): 291–316 (2004)
Article Google Scholar
Popovic M., Willett P.: The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data. JASIS 43(5): 384–390 (1992)
Article Google Scholar
Savoy J: A Stemming Procedure and Stopword List for General French Corpora. JASIS 50(10): 944–952 (1999)
Article Google Scholar
Sheridan P., Ballerini J.P.: Experiments in Multilingual Information Retrieval Using the SPIDER System. SIGIR: 58–65 (1996)
Google Scholar
Kamps J., Monz C., Maarten de Rijke: Combining Morphological and Ngram Evidence for Monolingual Document Retrieval In: M.-F. Moens, R. De Busser, D. Hiemstra, W. Kraaij, editors, Proceedings of the Third Dutch Information Retrieval Workshop (DIR 2002) pages 47–51
Google Scholar
Chen, A., and Gey, F.C.: Generating statistical Hindi stemmers from parallel texts. ACM Trans. Asian Language Inform. Process. Vol. 2,No. 3, Sep. (2003)
Google Scholar
The EMILLE Corpus, http://bowland-files.lancs.ac.uk/corplang/emille/
Google Scholar

Download references

Author information

Authors and Affiliations

Hughes Systique Corporation, Sec-33 Infocity, gurgaon, India
Amaresh Kumar Pandey
Indian Institute of Information Technology, Deoghat, Allahabad, Uttar Pradesh, India, 211012
Tanvver J Siddiqui

Authors

Amaresh Kumar Pandey
View author publications
You can also search for this author in PubMed Google Scholar
Tanvver J Siddiqui
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Indian Institute of Information Technology, Allahabad, India
U. S. Tiwary (Professor), Tanveer J. Siddiqui (Assistant Professor), M. Radhakrishna (Professor) & M. D. Tiwari (Director) (Professor), (Assistant Professor), (Professor) & (Director)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pandey, A.K., Siddiqui, T.J. (2009). Evaluating Effect of Stemming and Stop-word Removal on Hindi Text Retrieval. In: Tiwary, U.S., Siddiqui, T.J., Radhakrishna, M., Tiwari, M.D. (eds) Proceedings of the First International Conference on Intelligent Human Computer Interaction. Springer, New Delhi. https://doi.org/10.1007/978-81-8489-203-1_31

Download citation

DOI: https://doi.org/10.1007/978-81-8489-203-1_31
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-8489-404-2
Online ISBN: 978-81-8489-203-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics