Advertisement

Design and Implementation of Rule-Based Hindi Stemmer for Hindi Information Retrieval

  • Rakesh KumarEmail author
  • Atul Kumar Ramotra
  • Amit Mahajan
  • Vibhakar Mansotra
Conference paper
Part of the Smart Innovation, Systems and Technologies book series (SIST, volume 165)

Abstract

Stemming is a process that maps morphologically similar words to a common root/stem word by removing their prefixes or suffixes. In Natural Language Processing, stemming plays an important role in Information Retrieval, Machine Translation, Text Summarization, etc. Stemming reduces inflected word to its root form without doing any morphological analysis of the word and sometimes it is not necessary that stemming always provides us meaningful/dictionary root words as a lemmatizer always provides meaningful dictionary words. For example, in the Hindi word Open image in new window , (pakshion) is formed as ( Open image in new window (paksh) + Open image in new window ) having Open image in new window as suffix; if we remove this suffix, then it becomes Open image in new window (paksh) and Open image in new window (paksh) which is not a meaningful Hindi dictionary word. In the context of information retrieval, the stemmer reduces varied (morphologically inflected) words to a common form, thereby reducing the index size of the inverted file and increasing the recall. In this paper, researchers have attempted to develop a rule-based Hindi Stemmer Suffix Stripping Approach for Hindi Information Retrieval. A python-based web interface has been designed to implement the proposed algorithm. Also, the developed stemmer is being tested for accuracy and efficiency in two scenarios, first as an independent stemmer and second as a supporting module to indexing in Hindi Information Retrieval. The proposed stemmer has shown an accuracy of 71% as an individual stemmer and also reduced the index size by 26% (approx.) when used in indexing.

Keywords

Stemming Suffix stripping Suffix substitution N-gram Information Retrieval (IR) 

References

  1. 1.
    Sharma, A., Kumar, R., Mansotra, V.: Proposed stemming algorithm for Hindi information retrieval. Int. J. Innov. Res. Comput. Commun. Eng. (An ISO Certif. Organ.) 3297(6), 11449–11455 (2016)Google Scholar
  2. 2.
    Estahbanati, S., Javidan, R., Nikkhah, M.: A new multi-phase algorithm for stemming in Farsi language based on morphology. Int. J. Comput. Theory Eng. 3(5), 623–627 (2011)CrossRefGoogle Scholar
  3. 3.
    Giridhar, N.S., Prema, K.V., Subba Reddy, N.V.: A prospective study of stemming algorithms for web text mining 1. GANPAT Univ. J. Eng. Technol. 1(1), 28–34 (2011)Google Scholar
  4. 4.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130137 (1980)CrossRefGoogle Scholar
  5. 5.
    Mishra, U., Prakash, C.: MAULIK: an effective stemmer for Hindi language. Int. J. Comput. Sci. Eng. 4(05), 711–717 (2012)Google Scholar
  6. 6.
    Ramanathan, A., Rao, D.D.: A lightweight stemmer for Hindi. In: Workshop on Computational Linguistics for South-Asian Languages, EACL (2003)Google Scholar
  7. 7.
    Kumar, D., Rana, P.: Design and development of a stemmer for Punjabi. Int. J. Comput. Appl. 11(12), 18–23 (2010)Google Scholar
  8. 8.
    Gupta, V.: Hindi rule based stemmer for nouns. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 4(1) (2014). ISSN: 2277-128XGoogle Scholar
  9. 9.
    Shahid Husain, M.: An unsupervised approach to develop stemmer. Int. J. Nat. Lang. Comput. 1(2), 15–23 (2012)CrossRefGoogle Scholar
  10. 10.
    Paul, S., Tandon, M., Joshi, N., Mathur, I., Design of a rule based Hindi lemmatizer, pp. 67–74 (2013)Google Scholar
  11. 11.
    Rastogi, M., Khanna, P.: Development of morphological analyzer for Bangla. Int. J. Comput. Appl. 95(17), 1–5 (2014)Google Scholar
  12. 12.
    Eckart, T., Quasthoff, U.: Statistical corpus and language comparison on comparable corpora. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds.) Building and Using Comparable Corpora. Springer, Heidelberg (2013); Author, F., Author, S.: Title of a proceedings paper. In: Editor, F., Editor, S. (eds.) Conference 2016, LNCS, vol. 9999, pp. 1–13. Springer, Heidelberg (2016)CrossRefGoogle Scholar
  13. 13.
    Hafer, M., Weiss, S.: Word segmentation by letter successor varieties. Inf. Storage Retr. 10, 371–385 (1974)CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2020

Authors and Affiliations

  • Rakesh Kumar
    • 1
    Email author
  • Atul Kumar Ramotra
    • 1
  • Amit Mahajan
    • 1
  • Vibhakar Mansotra
    • 1
  1. 1.University of JammuJammuIndia

Personalised recommendations