Automatic Kurdish Text Classification Using KDC 4007 Dataset

Rashid, Tarik A.; Mustafa, Arazo M.; Saeed, Ari M.

doi:10.1007/978-3-319-59463-7_19

Tarik A. Rashid⁵,
Arazo M. Mustafa⁶ &
Ari M. Saeed⁷

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 6))

Included in the following conference series:

International Conference on Emerging Internetworking, Data & Web Technologies

1421 Accesses
9 Citations

Abstract

Due to the large volume of text documents uploaded on the Internet daily. The quantity of Kurdish documents which can be obtained via the web increases drastically with each passing day. Considering news appearances, specifically, documents identified with categories, for example, health, politics, and sport appear to be in the wrong category or archives might be positioned in a nonspecific category called others. This paper is concerned with text classification of Kurdish text documents to placing articles or an email into its right class per their contents. Even though there are considerable numbers of studies directed on text classification in other languages, and the quantity of studies conducted in Kurdish is extremely restricted because of the absence of openness, and convenience of datasets. In this paper, a new dataset named KDC-4007 that can be widely used in the studies of text classification about Kurdish news and articles is created. KDC-4007 dataset its file formats are compatible with well-known text mining tools. Comparisons of three best-known algorithms (such as Support Vector Machine (SVM), Naïve Bays (NB) and Decision Tree (DT) classifiers) for text classification and TF × IDF feature weighting method are evaluated on KDC-4007. The paper also studies the effects of utilizing Kurdish stemmer on the effectiveness of these classifiers. The experimental results indicate that the good accuracy value 91.03% is provided by the SVM classifier, especially when the stemming and TF × IDF feature weighting are involved in the preprocessing phase. KDC-4007 datasets are available publicly and the outcome of this study can be further used in future as a baseline for evaluations with other classifiers by other researchers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hotho, A., Nurnberger, A., Paaß, G.: A brief survey of text mining. LDV Forum-GLDV J. Comput. Linguist. Lang. Technol. 20, 19–62 (2005)
Google Scholar
Tan, A.: Text mining: the state of the art and the challenges concept-based. In: Proceedings of the PAKDD 1999 Workshop on Knowledge Discovery from Advanced Databases, pp. 65–70 (1999)
Google Scholar
Chen, K.C.: Text Mining e-complaints data from e-auction store. J. Bus. Econ. Res. 7(5), 15–24 (2009)
Google Scholar
Mohammed, F.S., Zakaria, L., Omar, N., Albared, M.Y.: Automatic kurdish sorani text categorization using N-gram based model. In: 2012 International Conference on Computer & Information Science (ICCIS), 12 Jun 2012, vol. 1, pp. 392–395. IEEE (2012)
Google Scholar
Wahbeh, A., Al-Kabi, M., Al-Radaideh, Q., Al-Shawakfa, E., Alsmadi, I.: The effect of stemming on arabic text classification: an empirical study. Int. J. Inf. Retrieval Res. 1(3), 54–70 (2011)
Article Google Scholar
Mohammad, A.H., Alwada’n, T., Al-Momani, O.: Arabic text categorization using support vector machine, Naïve Bayes and neural network. GSTF J. Comput. (JoC) 5(1), 108–115 (2016)
Article Google Scholar
Mohsen, A.M., Hassan, H.A., Idrees, A.M.: Documents emotions classification model based on tf-idf weighting measure. World Acad. Sci. Eng. Technol. Int. J. Comput. Electric. Automat. Control Inf. Eng. 3(1), 1795 (2016)
Google Scholar
Hmeidi, I., Al-Ayyoub, M., Abdulla, N.A., Almodawar, A.A., Abooraig, R., Mahyoub, N.A.: Automatic Arabic text categorization: a comprehensive comparative study. J. Inf. Sci. 41(1), 114–124 (2015)
Article Google Scholar
Rish, I.: An empirical study of the naive Bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, 4 August 2001, vol. 3, no. 22, pp. 41–46. IBM, New York (2001)
Google Scholar
Sharma, R., Gulati, N.: Improving the accuracy and reducing the redundancy in data mining. Int. J. Eng. Sci., 45–75 (2016)
Google Scholar
Last, M., Markov, A., Kandel, A.: Multi-lingual detection of web terrorist content. In: Chen, H. (ed.) WISI. LNCS, pp. 16–30. Springer (2006)
Google Scholar
Kotsiantis, S.B., Zaharakis, I., Pintelas, P.: Supervised machine learning: a review of classification techniques, vol. 31, pp. 249–268 (2007)
Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
MATH Google Scholar
Burges, C.J.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2(2), 121–167 (1998)
Article Google Scholar
Esmaili, K.S., Eliassi, D., Salavati, S., Aliabadi, P., Mohammadi, A., Yosefi, S., Hakimi, S.: Building a test collection for Sorani Kurdish. In: Proceedings of the 10th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 2013), Ifrane, Morocco, 27–30 May 2013. IEEE, New York (2013)
Google Scholar
Hassani, H., Medjedovic, D.: Automatic kurdish dialects identification. Comput. Sci. Inf. Technol., 61 (2016)
Google Scholar
Mustafa, A.M., Rashid, T.A.: Kurdish stemmer pre-processing steps for improving information retrieval. J. Inf. Sci., 1–14 (2017). doi: 10.1177/0165551510000000, sagepub.co.uk/journalsPermissions.nav, jis.sagepub.com
Szymański, J.: Comparative analysis of text representation methods using classification. Cybern. Syst. 45(2), 180–199 (2014)
Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Patra, A., Singh, D.: A survey report on text classification with different term weighing methods and comparison between classification algorithms. Int. J. Comput. Appl. 75(7) (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Kurdistan Hewler, Erbil, Kurdistan, Iraq
Tarik A. Rashid
School of Computer Science, College of Science, University of Sulaimania, Sulaymaniyah, Kurdistan, Iraq
Arazo M. Mustafa
Department of Computer Science, College of Science, University of Halabja, Halabja, Kurdistan, Iraq
Ari M. Saeed

Authors

Tarik A. Rashid
View author publications
You can also search for this author in PubMed Google Scholar
Arazo M. Mustafa
View author publications
You can also search for this author in PubMed Google Scholar
Ari M. Saeed
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tarik A. Rashid .

Editor information

Editors and Affiliations

Fukuoka Institute of Technology, Fukuoka, Japan
Leonard Barolli
School of Computer Sciences, Hubei University of Technology, Wuhan, China
Mingwu Zhang
Department of Electronic Technology, Key, Engineering University of CAPF, Xi’an, Xizang, China
Xu An Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rashid, T.A., Mustafa, A.M., Saeed, A.M. (2018). Automatic Kurdish Text Classification Using KDC 4007 Dataset. In: Barolli, L., Zhang, M., Wang, X. (eds) Advances in Internetworking, Data & Web Technologies. EIDWT 2017. Lecture Notes on Data Engineering and Communications Technologies, vol 6. Springer, Cham. https://doi.org/10.1007/978-3-319-59463-7_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-59463-7_19
Published: 28 May 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59462-0
Online ISBN: 978-3-319-59463-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics