Advertisement

The Effect of Stemming and Stop-Word-Removal on Automatic Text Classification in Turkish Language

  • Mustafa ÇağataylıEmail author
  • Erbuğ Çelebi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9489)

Abstract

Text classification is defined simply as the labeling of natural and unstructured language text documents using predefined categories or classes. This classification not only help organizations in improving their business communication skills and their customer satisfaction levels, but also improves the usage of unstructured data in academic and non-academic world. The aim of this study is to analyze the effect of stemming, over-sampling, and stopword-removal when doing automatic classification on Turkish content. After obtaning a Turkish Corpus, stemming, balancing, and stopword-removal is applied and the results are evaluated.

Keywords

Text classification Turkish text classification Stemming Stopword removal Over-sampling 

References

  1. 1.
    Digital Universe Invaded By Sensors, Press Release, EMC 2 (2014). http://www.emc.com/about/news/press/2014/20140409-01.htm
  2. 2.
    Big Data, for better or worse: 90 % of world,s data generated over last two years, ScienceDaily, 2013. http://www.sciencedaily.com/releases/2013/05/130522085217.htm
  3. 3.
    Torunoğlu, D., Çakırman, E., Ganiz, M.C., Akyokuş, S., Gürbüz, M.Z.: Analysis of preprocessing methods on classification of Turkish texts. In: International Symposium on Innovations in Intelligent Systems and Applications (INISTA), pp. 112–117, İstanbul (2011)Google Scholar
  4. 4.
    Can, F., Kocberber, S., Balcik, E., Kaynak, C., Ocalan, H.C., Vursavas, O.M.: Information retrieval on Turkish texts. J. Am. Soc. Inform. Sci. Technol. 59(3), 407–421 (2008)CrossRefGoogle Scholar
  5. 5.
    Güran, A., Akyokuş, S., Bayazıt, N.G., Gürbüz, M.Z.: Turkish text categorization using N-Gram words. In: International Symposium on Innovations in Intelligent Systems and Applications, Trabzon (2009)Google Scholar
  6. 6.
    Akkuş, B.K., Çakıcı, R.: Categorization of Turkish news documents with morphological analysis. In: Proceedings of the ACL Student Research Workshop, pp. 1–8, Sofia (2013)Google Scholar
  7. 7.
    Akın, A.A., Akın, M.D.: Zemberek an open source NLP framework for Turkic languages (2007)Google Scholar
  8. 8.
    Amasyalı, M.F., Diri, B.: Automatic Turkish text categorization in terms of author, genre and gender. In: Kop, C., Fliedl, G., Mayr, H.C., Métais, E. (eds.) NLDB 2006. LNCS, vol. 3999, pp. 221–226. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  9. 9.
    Özgür, L., Güngör, T., Gürgen, F.: Adaptive anti-spam filtering for agglutinative languages: a special case for Turkish. Pattern Recogn. Lett. 25(16), 1819–1831 (2004)CrossRefGoogle Scholar
  10. 10.
    Çataltepe, Z., Turan, Y., Kesgin, F.: Turkish document classification using shorter roots. In: IEEE 15th Signal Processing and Communications Applications, Eskişehir (2007)Google Scholar
  11. 11.
    Çıltık, A., Güngör, T.: Time efficient spam e-mail filtering using n-gram models. Pattern Recogn. Lett. 29(1), 19–33 (2008)CrossRefGoogle Scholar
  12. 12.
    Amasyalı, M.F., Beken, A.: Measurement of Turkish word semantic similarity and text categorization application. In: IEEE 17th Signal Processing and Communications Applications Conference, Antalya (2009)Google Scholar
  13. 13.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. Arch. 16(1), 321–357 (2002)zbMATHGoogle Scholar
  14. 14.
    Basu, A., Walters, C., Shepherd, M.: Support vector machines for text categorization. In: Proceedings of the 36th Annual Hawaii International Conference on System Sciences (HICSS 2003), Track 4, vol. 4, pp. 103.3, Washington (2003)Google Scholar
  15. 15.
    Burges, C.J.C.: Simplified support vector decision rules. In: 13th International Conference on Machine Learning, p. 71 (1996)Google Scholar
  16. 16.
    Kwok, J.T.: Automated text categorization using support vector machine. In: Proceedings of the International Conference on Neural Information Processing (ICONIP), pp. 347–351, Kitakyushu (1998)Google Scholar
  17. 17.
    Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Berlin (1995)CrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Cyprus International UniversityNorth NicosiaNorth Cyprus

Personalised recommendations