Effects of Light Stemming on Feature Extraction and Selection for Arabic Documents Classification
Abstract
This chapter aims to study the effects of the light stemming technique on feature extraction where Bag of Words (BoW) and Term frequency- Inverse Documents (TF-IDF) are employed for Arabic document classification. Moreover, feature selection methods such as Chi-square (Chi2), Information gain (IG), and singular value decomposition (SVD) are used to select the most relevant features. K-nearest Neighbor (kNN), Logistic Regression (LR), and Support Vector Machine (SVM) classifiers are used to build the classification model. Experiment are conducted using a public data collected from Arab websites, namely, BBC Arabic dataset. Experiment results show that SVM outperforms LR and KNN. Furthermore, BoW outperforms TF-IDF without using a stemming technique. Using a Robust Arabic Light Stemmer (ARLStem) as our main light stemmer shows a positive effect when combined with TF-IDF over the baseline. In the experiment where Chi2 is used as the feature selection technique, SVM resulted in 0.9568% F1-micro using BoW to extract the features from the dataset where 5000 relevant features were selected. In the experiment where IG is used as the feature selection method, SVM achieved 0.9588% F1-micro with BoW and 4000 selected features. Finally in the experiment where SVD is used as the feature selection technique, SVM reached 0.9569% F1-micro when using BoW and 5000 relevant feature were selected. The aforementioned experiments report the best results achieved where stemming is not employed.
Keywords
Arabic text classification Feature extraction Feature selection Stemming techniqueueNotes
Acknowledgements
This work was partially supported by the National Natural Science Foundation of China (Grant No. 61672398, 61806151), the Defense Industrial Technology Development Program (Grant No. JCKY2018110C165), and the Hubei Provincial Natural Science Foundation of China (Grant No. 2017CFA012).
References
- 1.A. Dahou, M.A. Elaziz, J. Zhou, S. Xiong, Arabic sentiment classification using convolutional neural network and differential evolution algorithm. Comput. Intell. Neurosci. 2019 (2019)Google Scholar
- 2.J.R. Méndez, T.R. Cotos-Yañez, D. Ruano-Ordás, A new semantic-based feature selection method for spam filtering. Appl. Soft Comput. 76, 89–104 (2019)CrossRefGoogle Scholar
- 3.S. Sakurai, A. Suyama, An e-mail analysis method based on text mining techniques. Appl. Soft Comput. 6(1), 62–71 (2005)CrossRefGoogle Scholar
- 4.A. Ayedh, G. Tan, K. Alwesabi, H. Rajeh, The effect of preprocessing on arabic document categorization. Algorithms 9(2), 27 (2016)MathSciNetCrossRefGoogle Scholar
- 5.J.-S. Kuo, Active Learning for Constructing Transliteration. J. Am. Soc. Inf. Sci., 59(1), 126–135 (2008). [Online]. Available: http://ejournals.ebsco.com/direct.asp?ArticleID=40729F4826A638E14483
- 6.A. Ayedh, G. Tan, Building and benchmarking novel Arabic stemmer for document classification. J. Comput. Theor. Nanosci. 13(3), 1527–1535 (2016)CrossRefGoogle Scholar
- 7.Slamet, C., Atmadja, A.R., Maylawati, D.S., Lestari, R.S., Darmalaksana, W., Ramdhani, M.A.: Automated text summarization for indonesian article using vector space model. IOP Conf. Ser. Mater. Sci. Eng. 288(1) (2018)CrossRefGoogle Scholar
- 8.A. Sinaga, Adiwijaya, H. Nugroho, Development of word-based text compression algorithm for Indonesian language document, in 2015 3rd International Conference on Advanced Information and Communication Technology ICoICT 2015, pp. 450–454 (2015)Google Scholar
- 9.M. Hussein, H.M. Mousa, R.M. Sallam, Arabic text categorization using mixed words. I.J. Inf. Technol. Comput. Sci. Inf. Technol. Comput. Sci., 11(11), 74–81, 2016. [Online]. Available: http://www.mecs-press.net/ijitcs/ijitcs-v8-n11/IJITCS-V8-N11-9.pdf
- 10.R. Mamoun, M. Ahmed, Arabic text stemming: Comparative analysis. in Conference of Basic Sciences and Engineering Studies (SGCAC). IEEE 2016, 88–93 (2016)Google Scholar
- 11.F. Harrag, E. El-qawasmeh, I. Al, Improving arabic text categorization using decision trees. First Int. Conf. Networked Digit. Technol. 2009. NDT ’09. no. September, pp. 110–115 (2009)Google Scholar
- 12.B. Sharef, N. Omar, Z. Sharef, An Automated arabic text categorization based on the frequency Ratio Accumulation 11(2), 213–221 (2014)Google Scholar
- 13.B. Al-Shargabi, F. Olayah, W.A. Romimah, An experimental study for the effect of stop words elimination for arabic text classification algorithms. Int. J. Inf. Technol. Web Eng. (IJITWE) 6(2), 68–75 (2011)CrossRefGoogle Scholar
- 14.D. AbuZeina, F. Al-Anzi, Employing fisher discriminant analysis for Arabic text classification. Comput. Electr. Eng. 1–13 (2017)Google Scholar
- 15.S.A. Yousif, V.W. Samawi, I. Elkabani, R. Zantout, The effect of combining different semantic relations on arabic text classification. World Comput. Sci. Inform. Technol. J 5(1), 12–118 (2015)Google Scholar
- 16.A. Nehar, D. Ziadi, and H. Cherroun, “Rational kernels for Arabic Root Extraction and Text Classification,” J. King Saud Univ. - Comput. Inf. Sci., vol. 28, no. 2, pp. 157–169, 2016. [Online]. Available: http://dx.doi.org/10.1016/j.jksuci.2015.11.004
- 17.Y. A. Alhaj, J. Xiang, D. Zhao, M. A. Al-Qaness, M. A. Elaziz, and A. Dahou, “A study of the effects of stemming strategies on arabic document classification,” IEEE Access (2019)Google Scholar
- 18.L.S. Larkey, L. Ballesteros, M.E. Connell, Improving stemming for Arabic information retrieval, in Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’03, 2002, p. 275. [Online]. Available: http://portal.acm.org/citation.cfm?doid=564376.564425
- 19.Y.A. Alhaj, W.U. Wickramaarachchi, A. Hussain, M.A. Al-Qaness, and H.M. Abdelaal, Efficient feature representation based on the effect of words frequency for arabic documents classification, in Proceedings of the 2nd International Conference on Telecommunications and Communication Engineering (ACM, 2018), pp. 397–401Google Scholar
- 20.L. Larkey, L. Ballesteros, and M. Connell, Light stemming for Arabic information retrieval,” Arab. Comput. Morphol., pp. 221–243 (2007)Google Scholar
- 21.K. Abainia, S. Ouamour, H. Sayoud, A novel robust arabic light stemmer. J. Exp. & Theor. Artif. Intell. 29(3), 557–573 (2017)CrossRefGoogle Scholar
- 22.K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, Text classification algorithms: a survey. Information, 10(4), 150 (2019)CrossRefGoogle Scholar
- 23.A.K. Uysal, S. Günal, S. Ergin, E.Ş. Günal, Detection of sms spam messages on mobile phones, in 20th Signal Processing and Communications Applications Conference (SIU). IEEE 2012, 1–4 (2012)Google Scholar
- 24.F. Thabtah, M. Eljinini, M. Zamzeer, W. Hadi, "Naïve bayesian based on chi square to categorize arabic data," in proceedings of The 11th International Business Information Management Association Conference (IBIMA) Conference on Innovation and Knowledge Management in Twin Track Economies (Egypt, Cairo, 2009), pp. 4–6Google Scholar
- 25.G.W. Furnas, S. Deerwester, S.T. Dumais, T.K. Landauer, R.A. Harshman, L.A. Streeter, K.E. Lochbaum,: Information retrieval using a singular value decomposition model of latent semantic structure,” in 11th Annu. Int. ACM SIGIR Conf. Res. Dev. Inf. Retr. (SIGIR 1988) (1988)Google Scholar
- 26.P. Tsangaratos, I. Ilia, Comparison of a logistic regression and naïve bayes classifier in landslide susceptibility assessments: The influence of models complexity and training dataset size. Catena 145, 164–179 (2016)CrossRefGoogle Scholar
- 27.M. Syiam, Z.T. Fayed, M.B. Habib, An intelligent system for arabic text categorization. Int. J. Intell. Comput. Inf. Sci. 6(1), 1–19 (2006)CrossRefGoogle Scholar
- 28.A. Moh, A. Mesleh, Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System. J. Comput. Sci. 3(6), 430–435 (2007)Google Scholar
- 29.M. Saad, W. Ashour, OSAC: Open Source Arabic Corpora, in 6th international conference on computer systems (EECS’10), Nov 25-26, 2010, Lefke, Cyprus., pp. 118–123, 2010. [Online]. Available: http://site.iugaza.edu.ps/msaad/files/2010/12/mksaad-OSAC-Open-Source-Arabic-Corpora-EECS10-rev8.pdf