Skip to main content

Stemming for Kurdish Information Retrieval

  • Conference paper
Information Retrieval Technology (AIRS 2013)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8281))

Included in the following conference series:

Abstract

Resource scarcity along with diversity –in both dialect and script– are the two primary challenges in Kurdish language processing. In this paper we aim at addressing these two problems by building stemmers for the two main dialects of the Kurdish language (i.e. Sorani and Kurmanji) and investigate their effectiveness on Kurdish Information Retrieval.

More specifically, we build Jedar, the first rule-based stemmer for both Sorani and Kurmanji. We also implement GRAS –as a state-of-the-art statistical stemming technique– and apply it to both of the Kurdish dialects. We then conduct a comprehensive experimental study to compare the effectiveness of these stemmers.

Our experimental results show that stemming can significantly –up to %35– improve the retrieval performance on Kurdish documents. Furthermore, they indicate that the gains from the rule-based and the statistical approaches are comparable.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bacchin, M., Ferro, N., Melucci, M.: A Probabilistic Model for Stemmer Generation. Information Processing and Management 41(1), 121–137 (2005)

    Article  Google Scholar 

  2. Blau, J.: Méthode de Kurde: Sorani. Harmattan (2000)

    Google Scholar 

  3. Braschler, M., Ripplinger, B.: How Effective is Stemming and Decompounding for German Text Retrieval? Information Retrieval 7(3-4), 291–316 (2004)

    Article  Google Scholar 

  4. Esmaili, K.S., et al.: Building a Test Collection for Sorani Kurdish. In: Proceedings of IEEE AICCSA (2013)

    Google Scholar 

  5. Esmaili, K.S., Salavati, S.: Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison. In: Proceedings of the 51st Annual Meeting of ACL (2013)

    Google Scholar 

  6. Esmaili, K.S., Salavati, S., Datta, A.: Towards Kurdish Information Retrieval. ACM TALIP (to appear, 2013)

    Google Scholar 

  7. Gautier, G.: Building a Kurdish Language Corpus: An Overview of the Technical Problems. In: Proceedings of ICEMCO (1998)

    Google Scholar 

  8. Haig, G., Matras, Y.: Kurdish Linguistics: A Brief Overview. Language Typology and Universals 55(1) (2002)

    Google Scholar 

  9. Harman, D.: How Effective is Suffixing? JASIS 42(1), 7–15 (1991)

    Article  Google Scholar 

  10. Hassanpour, A., et al.: Introduction. Kurdish: Linguicide, Resistance and Hope. International Journal of the Sociology of Language 217, 1–8 (2012)

    Article  Google Scholar 

  11. Hull, D.A.: Stemming Algorithms: A Case Study for Detailed Evaluation. Journal of the American Society for Information Science 47(1), 70–84 (1996)

    Article  Google Scholar 

  12. KLPP. Kurdish Language Stemmers, http://klpp.github.io/

  13. KLPP. The Pewan Test Collection, http://klpp.github.io/

  14. Krovetz, R.: Viewing Morphology as an Inference Process. In: Proceedings of ACM SIGIR 1993, pp. 191–202 (1993)

    Google Scholar 

  15. Lovins, J.B.: Development of a Stemming Algorithm. MIT Information Processing Group, Electronic Systems Laboratory (1968)

    Google Scholar 

  16. MacKenzie, D.N.: Kurdish Dialect Studies. Oxford University Press (1961)

    Google Scholar 

  17. Majumder, P., Mitra, M., Pal, D.: Bulgarian, hungarian and czech stemming using YASS. In: Peters, C., Jijkoun, V., Mandl, T., Müller, H., Oard, D.W., Peñas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 49–56. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  18. Majumder, P., Mitra, M., Parui, S.K., Kole, G., Mitra, P., Datta, K.: YASS: Yet Another Suffix Stripper. ACM TOIS 25(4), 18 (2007)

    Article  Google Scholar 

  19. MG4J. Managing Gigabytes for Java, http://mg4j.dsi.unimi.it/

  20. Monz, C., De Rijke, M.: Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German, and Italian. In: Evaluation of Cross-Language Information Retrieval Systems, pp. 262–277 (2002)

    Google Scholar 

  21. Paice, C.D.: An Evaluation Method for Stemming Algorithms. In: Proceedings of ACM SIGIR 1994, pp. 42–50 (1994)

    Google Scholar 

  22. Paik, J.H., Mitra, M., Parui, S.K., Järvelin, K.: GRAS: An Effective and Efficient Stemming Algorithm for Information Retrieval. ACM TOIS 29(4), 19 (2011)

    Google Scholar 

  23. Porter, M.F.: An algorithm for suffix stripping, pp. 313–316. Morgan Kaufmann Publishers Inc. (1997)

    Google Scholar 

  24. Porter, M.: Snowball: A Language for Stemming Algorithms (2001)

    Google Scholar 

  25. Samvelian, P.: When Morphology Does Better Than Syntax: The Ezafe Construction in Persian. Ms., Université de Paris (2006)

    Google Scholar 

  26. Samvelian, P.: A Lexical Account of Sorani Kurdish Prepositions. In: Proceedings of International Conference on Head-Driven Phrase Structure Grammar, pp. 235–249 (2007)

    Google Scholar 

  27. Samvelian, P.: What Sorani Kurdish Absolute Prepositions Tell Us about Cliticization. Texas Linguistic Society IX, p. 265 (2007)

    Google Scholar 

  28. Smirnov, I.: Overview of Stemming Algorithms. Mechanical Translation (2008)

    Google Scholar 

  29. Walther, G.: Fitting into Morphological Structure: Accounting for Sorani Kurdish Endoclitics. In: The Proceedings of the Eighth Mediterranean Morphology Meeting (2011)

    Google Scholar 

  30. Walther, G., et al.: Fast Development of Basic NLP Tools: Towards a Lexicon and a POS Tagger for Kurmanji Kurdish. In: Proceedings of the 29th International Conference on Lexis and Grammar (2010)

    Google Scholar 

  31. Walther, G., Sagot, B.: Developing a Large-scale Lexicon for a Less-Resourced Language. In: SaLTMiL’s Workshop on Less-resourced Languages (LREC) (2010)

    Google Scholar 

  32. Xu, J., Croft, B.: Corpus-based Stemming Using Cooccurrence of Word Variants. ACM TOIS 16(1), 61–81 (1998)

    Article  Google Scholar 

  33. Xu, J., Fraser, A., Weischedel, R.: Empirical Studies in Strategies for Arabic Retrieval. In: Proceedings ACM SIGIR 2002, pp. 269–274 (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Salavati, S., Sheykh Esmaili, K., Akhlaghian, F. (2013). Stemming for Kurdish Information Retrieval. In: Banchs, R.E., Silvestri, F., Liu, TY., Zhang, M., Gao, S., Lang, J. (eds) Information Retrieval Technology. AIRS 2013. Lecture Notes in Computer Science, vol 8281. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45068-6_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-45068-6_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-45067-9

  • Online ISBN: 978-3-642-45068-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics