Stemming for Kurdish Information Retrieval

Salavati, Shahin; Sheykh Esmaili, Kyumars; Akhlaghian, Fardin

doi:10.1007/978-3-642-45068-6_24

Shahin Salavati²⁰,
Kyumars Sheykh Esmaili²¹ &
Fardin Akhlaghian²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8281))

Included in the following conference series:

Asia Information Retrieval Symposium

1467 Accesses
4 Citations

Abstract

Resource scarcity along with diversity –in both dialect and script– are the two primary challenges in Kurdish language processing. In this paper we aim at addressing these two problems by building stemmers for the two main dialects of the Kurdish language (i.e. Sorani and Kurmanji) and investigate their effectiveness on Kurdish Information Retrieval.

More specifically, we build Jedar, the first rule-based stemmer for both Sorani and Kurmanji. We also implement GRAS –as a state-of-the-art statistical stemming technique– and apply it to both of the Kurdish dialects. We then conduct a comprehensive experimental study to compare the effectiveness of these stemmers.

Our experimental results show that stemming can significantly –up to %35– improve the retrieval performance on Kurdish documents. Furthermore, they indicate that the gains from the rule-based and the statistical approaches are comparable.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bacchin, M., Ferro, N., Melucci, M.: A Probabilistic Model for Stemmer Generation. Information Processing and Management 41(1), 121–137 (2005)
Article Google Scholar
Blau, J.: Méthode de Kurde: Sorani. Harmattan (2000)
Google Scholar
Braschler, M., Ripplinger, B.: How Effective is Stemming and Decompounding for German Text Retrieval? Information Retrieval 7(3-4), 291–316 (2004)
Article Google Scholar
Esmaili, K.S., et al.: Building a Test Collection for Sorani Kurdish. In: Proceedings of IEEE AICCSA (2013)
Google Scholar
Esmaili, K.S., Salavati, S.: Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison. In: Proceedings of the 51st Annual Meeting of ACL (2013)
Google Scholar
Esmaili, K.S., Salavati, S., Datta, A.: Towards Kurdish Information Retrieval. ACM TALIP (to appear, 2013)
Google Scholar
Gautier, G.: Building a Kurdish Language Corpus: An Overview of the Technical Problems. In: Proceedings of ICEMCO (1998)
Google Scholar
Haig, G., Matras, Y.: Kurdish Linguistics: A Brief Overview. Language Typology and Universals 55(1) (2002)
Google Scholar
Harman, D.: How Effective is Suffixing? JASIS 42(1), 7–15 (1991)
Article Google Scholar
Hassanpour, A., et al.: Introduction. Kurdish: Linguicide, Resistance and Hope. International Journal of the Sociology of Language 217, 1–8 (2012)
Article Google Scholar
Hull, D.A.: Stemming Algorithms: A Case Study for Detailed Evaluation. Journal of the American Society for Information Science 47(1), 70–84 (1996)
Article Google Scholar
KLPP. Kurdish Language Stemmers, http://klpp.github.io/
KLPP. The Pewan Test Collection, http://klpp.github.io/
Krovetz, R.: Viewing Morphology as an Inference Process. In: Proceedings of ACM SIGIR 1993, pp. 191–202 (1993)
Google Scholar
Lovins, J.B.: Development of a Stemming Algorithm. MIT Information Processing Group, Electronic Systems Laboratory (1968)
Google Scholar
MacKenzie, D.N.: Kurdish Dialect Studies. Oxford University Press (1961)
Google Scholar
Majumder, P., Mitra, M., Pal, D.: Bulgarian, hungarian and czech stemming using YASS. In: Peters, C., Jijkoun, V., Mandl, T., Müller, H., Oard, D.W., Peñas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 49–56. Springer, Heidelberg (2008)
Chapter Google Scholar
Majumder, P., Mitra, M., Parui, S.K., Kole, G., Mitra, P., Datta, K.: YASS: Yet Another Suffix Stripper. ACM TOIS 25(4), 18 (2007)
Article Google Scholar
MG4J. Managing Gigabytes for Java, http://mg4j.dsi.unimi.it/
Monz, C., De Rijke, M.: Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German, and Italian. In: Evaluation of Cross-Language Information Retrieval Systems, pp. 262–277 (2002)
Google Scholar
Paice, C.D.: An Evaluation Method for Stemming Algorithms. In: Proceedings of ACM SIGIR 1994, pp. 42–50 (1994)
Google Scholar
Paik, J.H., Mitra, M., Parui, S.K., Järvelin, K.: GRAS: An Effective and Efficient Stemming Algorithm for Information Retrieval. ACM TOIS 29(4), 19 (2011)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping, pp. 313–316. Morgan Kaufmann Publishers Inc. (1997)
Google Scholar
Porter, M.: Snowball: A Language for Stemming Algorithms (2001)
Google Scholar
Samvelian, P.: When Morphology Does Better Than Syntax: The Ezafe Construction in Persian. Ms., Université de Paris (2006)
Google Scholar
Samvelian, P.: A Lexical Account of Sorani Kurdish Prepositions. In: Proceedings of International Conference on Head-Driven Phrase Structure Grammar, pp. 235–249 (2007)
Google Scholar
Samvelian, P.: What Sorani Kurdish Absolute Prepositions Tell Us about Cliticization. Texas Linguistic Society IX, p. 265 (2007)
Google Scholar
Smirnov, I.: Overview of Stemming Algorithms. Mechanical Translation (2008)
Google Scholar
Walther, G.: Fitting into Morphological Structure: Accounting for Sorani Kurdish Endoclitics. In: The Proceedings of the Eighth Mediterranean Morphology Meeting (2011)
Google Scholar
Walther, G., et al.: Fast Development of Basic NLP Tools: Towards a Lexicon and a POS Tagger for Kurmanji Kurdish. In: Proceedings of the 29th International Conference on Lexis and Grammar (2010)
Google Scholar
Walther, G., Sagot, B.: Developing a Large-scale Lexicon for a Less-Resourced Language. In: SaLTMiL’s Workshop on Less-resourced Languages (LREC) (2010)
Google Scholar
Xu, J., Croft, B.: Corpus-based Stemming Using Cooccurrence of Word Variants. ACM TOIS 16(1), 61–81 (1998)
Article Google Scholar
Xu, J., Fraser, A., Weischedel, R.: Empirical Studies in Strategies for Arabic Retrieval. In: Proceedings ACM SIGIR 2002, pp. 269–274 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Kurdistan, Sanandaj, Iran
Shahin Salavati & Fardin Akhlaghian
Nanyang Technological University, Singapore
Kyumars Sheykh Esmaili

Authors

Shahin Salavati
View author publications
You can also search for this author in PubMed Google Scholar
Kyumars Sheykh Esmaili
View author publications
You can also search for this author in PubMed Google Scholar
Fardin Akhlaghian
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute for Infocomm Research, Human Language Technology, 1 Fusionopolis Way #21-01, Connexis South, 138632, Singapore
Rafael E. Banchs , Min Zhang & Sheng Gao , &
Yahoo Labs, Avinguda Diagonal 177, 08018, Barcelona, Spain
Fabrizio Silvestri
Microsoft Research Asia, No. 5, Danling Street, Haidian District, 100080, Beijing, China
Tie-Yan Liu
Institute for Infocomm Research, Human Language Technology, 1 Fusionopolis Way #21-01, Connexis South,, 138632, Singapore
Jun Lang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Salavati, S., Sheykh Esmaili, K., Akhlaghian, F. (2013). Stemming for Kurdish Information Retrieval. In: Banchs, R.E., Silvestri, F., Liu, TY., Zhang, M., Gao, S., Lang, J. (eds) Information Retrieval Technology. AIRS 2013. Lecture Notes in Computer Science, vol 8281. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45068-6_24

Download citation

DOI: https://doi.org/10.1007/978-3-642-45068-6_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45067-9
Online ISBN: 978-3-642-45068-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics