Abstract
Stemming is reducing a word to its root or stem form. Kannada is a morphologically rich language and words get inflected to different forms based on person, number, gender and tense. Stemming is an important pre-processing step in any Natural Language Processing application. In this paper, stemming is performed on Kannada words using unsupervised method using suffix arrays. An accuracy of 0.58 % was achieved with this method. The performance of the stemmer is further improved by using a stem-list dictionary in combination with the unsupervised method. A list of 18,804 stem words is created manually in Kannada Language as part of this work. A 10 % improvement in performance is observed. The effect of the proposed stemmer on text classification of Kannada documents using Naïve Bayes and Maximum Entropy methods are compared. It is shown in this paper, that stemming improves the performance of text classification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Frakes, W.B., Baeza-Yates, R.: Information Retrieval: Data Structures and Algorithms. Prentice Hall, Englewood Cliffs (1992)
Lovins, J.: Development of a stemming algorithm. Mech. Transl. Comput. Linguist. 11, 22–23 (1968)
Paice, C., Husk, G.: Another stemmer. ACM SIGIR Forum 24(3), 566 (1990)
Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Ramanathan, A., Rao, D.D.: A lightweight stemmer for Hindi. In: Proceedings of EACL, ACL (2003)
Islam, Z., Uddin, N., Khan, M.: A light weight stemmer for bengali and its use in spelling checker. In: Proceedings of 1st International conference on Digital Communications and Computer Applications (DCCA 2007), Irbid, Jordan, pp. 87–93 (2007)
Majumder, P., Mitra, M., Parui, S.K., Kole, G., Mitra, P., Datta, K.: Yass: yet another suffix stripper. ACM Trans. Inf. Syst. 25(4), 18 (2007)
Pandey, A.K., Siddiqui, T.J.: An unsupervised Hindi stemmer with heuristic improvements. In: Proceedings of the Second W.orkshop on Analytics for Noisy Unstructured Text Data, AND 2008, Singapore, pp. 99–105 (2008)
Dasgupta, S., Ng, V.: Unsupervised morphological parsing of bengali. Lang. Resour. Eval. 40, 311–330 (2006)
Keshava, S., Pitler, E.: A simpler, intuitive approach to morpheme induction. In: Proceedings of 2nd Pascal Challenges Workshop, pp. 31–35 (2006)
Majgaonker, M.M., Siddiqui, T.J.: Discovering suffixes: a case study for Marathi language. Int. J. Comput. Sci. Eng. 04, 2716–2720 (2010)
Suba, K., Jiandani, D., Bhattacharyya, P.: Hybrid inflectional stemmer and rule-based derivational stemmer for Gujrati. In: 2nd Workshop on South and Southeast Asian Natural Languages Processing, Chiang Mai, Thailand (2011)
Gupta, V., Lehal, G.S.: Punjabi language stemmer for nouns and proper names. In: Proceedings of the 2nd Workshop on South and Southeast Asian Natural Language Processing (WSSANLP) IJCNLP 2011, Chiang Mai, Thailand, pp. 35–39 (2011)
Kumar, D., Rana, P.: Design and development of a stemmer for Punjabi. Int. J. Comput. Appl. 11(12), 0975–8887 (2010)
Padma, M.C., Prathibha, R.J.: Development of morphological stemmer, analyzer and generator for Kannada nouns. In: Proceedings of International Conference, ICERECT 2012, pp. 713–723 (2014)
Bhat, S.: Statistical stemming for Kannada. In: Proceedings The 4th Workshop on South and Southeast Asian NLP (WSSANLP), International Joint Conference on Natural Language Processing, Nagoya, Japan, pp. 25–33, 14–18 Oct 2013
http://www.hlt.utdallas.edu/~sajib/FinalDistribution.tar.gz. Accessed 24 July 2014
Emille corpus: http://www.emille.lancs.ac.uk (2003)
Nigam, K., Lafferty, J., McCallum, A.: Using maximum entropy for text classification. In: IJCAI 1999 Workshop on Machine Learning for Information Filtering, pp. 61–67 (1999)
McCallum, A.K.: MALLET: a machine learning for language toolkit (2002)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer India
About this paper
Cite this paper
Deepamala, N., Ramakanth Kumar, P. (2015). Kannada Stemmer and Its Effect on Kannada Documents Classification. In: Jain, L., Behera, H., Mandal, J., Mohapatra, D. (eds) Computational Intelligence in Data Mining - Volume 3. Smart Innovation, Systems and Technologies, vol 33. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2202-6_7
Download citation
DOI: https://doi.org/10.1007/978-81-322-2202-6_7
Published:
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-2201-9
Online ISBN: 978-81-322-2202-6
eBook Packages: EngineeringEngineering (R0)