An Arabic text categorization approach using term weighting and multiple reducts
 85 Downloads
Abstract
Text categorization is the process of assigning a predefined category label to an unlabeled document based on its content. One of the challenges of automatic text categorization is the high dimensionality of data that may affect the performance of the categorization model. This paper proposed an approach for the categorization of Arabic text based on term weighting and the reduct concept of the rough set theory to reduce the number of terms used to generate the classification rules that form the classifier. The paper proposed a multiple minimal reduct extraction algorithm by improving the Quick reduct algorithm. The multiple reducts are used to generate the set of classification rules which represent the rough set classifier. To evaluate the proposed approach, an Arabic corpus of 2700 documents nine categories is used. In the experiment, we compared the results of the proposed approach when using multiple and single minimal reducts. The results showed that the proposed approach had achieved an accuracy of 94% when using multiple reducts, which outperformed the single reduct method which achieved an accuracy of 86%. The results of the experiments also showed that the proposed approach outperforms both the KNN and J48 algorithms regarding classification accuracy using the dataset on hand.
Keywords
Rough set theory Arabic text categorization Reducts extraction Single reduct Multiple reducts1 Introduction
Due to the rapid increase in textual information available on the internet, the process of getting relevant information becomes more difficult. Text mining, which is defined as the process of extracting knowledge from the huge amount of textual data, is one of the techniques, or solutions that can be used to overcome the problem of the increasing size of the textual information and to further facilitates extracting useful information from the text. Researchers in the fields of data mining and information retrieval have investigated different types of text mining tasks, such as text categorization (AlRadaideh et al. 2011; Ghareb et al. 2018), text clustering (Abualigah et al. 2018), and text summarization (AlRadaideh and Bataineh 2018).
Text categorization (TC) is the process of assigning a predefined category (label) to an unlabeled document based on its content (Lam et al. 1999). In recent years, and with the rapid increase in the size of information on the Web, text categorization has attracted the attention of many researchers to use TC as a way to simplify the access to useful information. Text categorization has been used for several applications such as spam filtering, improving the performance of information retrieval systems, and in medical information systems (Lam et al. 1999; Abualigah and Hanandeh 2015; Wang et al. 2006; Zhang et al. 2009).
In practice, the typical text categorization system consists of four main phases, where each phase may further includes several other steps. These phases include the text preprocessing phase, which has several steps that aim to prepare the text in the documents to be ready to be used by the categorization model. Usually, this phase includes three processes: text tokenization process, stopword removal, and term weighting.
One of the most frequent challenges in automatic text categorization is the high dimensionality of terms that may affect the performance of the categorization model. A good solution to overcome this challenge is to use feature selection methods that allow selecting a subset of data that best represent the whole data in the best way (Rasim and Telceken 2018; Abualigah and Khader 2017; Abualigah et al. 2018). Rough set theory (RST) is one of the tools that have been used by researchers as a method or a tool for reducing the size of dimensionality and for feature selection using the reduct concept in rough set theory which allows representing the whole data using only part of that data (Pawlak 1982).
The Arabic language needs careful preprocessing since it has some features that are different from other languages. As summarized by Duwairi and ElOrfali (2014), the Arabic language differs from other languages in its orthographical nature where the words are written from right to left and the shape of letters changes according to their position in the word. The Arabic language includes three major parts of speech (noun, verb, and particle). Nouns and verbs are obtained from roots by applying templates to the roots in order to generate stems and then by introducing prefixes and suffixes (Darwish 2002).
The Arabic language has its challenges too. Arabic does not support letter capitalization, and it has no strict punctuation rules. The tokenization process is not a straightforward job for Arabic. The Arabic language is a morphologically rich language; this issue also complicates the tokenization process. Arabic words are compact, where a word can correspond to an entire phrase or sentence; for example, one Arabic word may contain four tokens. Moreover, Arabic dialects vary from one Arab country to another. All these challenges may affect the results of the next process of the text such as classification, or sentiment analysis.
Although the process of Arabic text categorization was adopted by many researchers who have implemented several categorization algorithms, this field still needs more efforts to be enriched with new and improved algorithms. After reviewing the literature, we noticed that very little research related to Arabic Text Categorization has considered using rough set theory for the categorization process. This motivated us to address this issue.
The main problem addressed in this research is how to build an Arabic text categorization model using the power of rough set theory in feature subset selection and dimension reduction to achieve an acceptable classification performance. As will be mentioned later, reduct is the main concept of the rough set theory that is used for feature subset selection. One of the wellknown rough set theorybased subset selection methods, called Quick Reduct (Chouchoulas 1999), produces a single reduct for the dataset.
This research also addressed the issue of what if multiple reducts were used instead of a single reduct? The hypothesis of this research indicates that generating multiple reducts will yield better classification accuracy. The reduct generation step is used as a second term selection step on top of the TF–IDF step which is used for term weighting and selection.
The main objectives of this paper can be summarized in the following points: (1) Investigating the possibility of building a rough set theorybased model for the categorization of Arabic text that satisfies acceptable performance in comparison with other classification algorithms. (2) Proposing an Improved version of the Quick Reduct method to generate multiple reducts.
The rest of this paper is organized as follows: Section 2 presents some preliminaries of rough set theory. Section 3 reviews some approaches that have been applied for the categorization of Arabic text and also reviews the use of rough set theory in text categorization and feature selection. Section 4 describes the proposed method for building the rough setbased Arabic text categorization model with a detailed description of its components. Section 5 presents and discusses the experimental results of the proposed approach, and Sect. 6 presents the conclusions, and future work.
2 Rough set theory preliminaries
The rough set theory is a mathematical tool that has been proposed by Z. Pawlak in the early 1980s for knowledge discovery and data analysis that can be used for the analysis of vagueness, imprecise, uncertain and incomplete data (Pawlak 1982). The rough set theory has an advantage of using it as a tool to reduce the number of attributes and to discover data dependencies (Velayutham and Thangavel 2011). This advantage of rough set theory has made it more applicable in many domains; such as decision support systems, engineering, banking, medicine, as well as others applications (Pawlak 1991).
Han et al. (2012) summarized the definition of rough set concept as “A rough set definition for a given class, X, is approximated by two sets—a lower approximation of X and an upper approximation of X. The lower approximation of X consists of all the data tuples that, based on the knowledge of the attributes, are certain to belong to X without ambiguity. The upper approximation of X consists of all the tuples that, based on the knowledge of the attributes, cannot be described as not belonging to X. The lower and upper approximations for a class X are shown in Fig. 1”.
2.1 Decision table in rough set theory
An example of a decision table
\(x_{i} \in U\)  \(a_{1}\)  \(a_{2}\)  \(a_{3}\)  \(a_{4}\)  d 

\(x_{1}\)  1  0  2  2  0 
\(x_{2}\)  0  1  1  1  2 
\(x_{3}\)  2  0  0  1  1 
\(x_{4}\)  1  1  0  2  2 
\(x_{5}\)  1  0  2  0  1 
\(x_{6}\)  2  2  0  1  1 
\(x_{7}\)  2  1  1  1  2 
\(x_{8}\)  0  1  1  0  1 
2.2 The reduct concept
As defined by Pawlak (1991), a reduct is the minimal subset of attributes \(B \subset A\) from the decision table that can be used to discern each object in the table. In practice, the reducts extraction process is one of the most important concepts in rough set theory that may take the longest required time and effort as the computation is highly influenced by the number of attributes in the decision table. Reducts computation has been proven to be an NPhard problem which has exponential time related to the available number of attributes in the decision table (Skowron and Rauszer 1992).
The reduct concept represents one of the most important concepts in rough set theory application to data mining. Reducts extraction methods allow selecting a subset of the attributes set instead of dealing with the whole set of attributes (Lin 1996). Rough set theory provides two types of reducts: full reducts and object reducts. Full reducts can be used to discern all objects from each other whereas an object reduct is the minimal set of attributes that discern a particular object from all other objects. For reducts extraction there are several methods that were proposed such as exhaustive search algorithm, heuristic search algorithm, Johnson’s algorithm, feature weighting algorithm, and genetic algorithms (AlRadaideh et al. 2005; Zhong et al. 2001).
2.3 Reducts extraction
The following are some concepts of the rough set theory (Pawlak 1991) that can be used to extract reducts. A complete example of how to compute and use these concepts to extract reducts can be found in (Chouchoulas 1999).
 (1)
The total function f(x, a) denotes the value of attribute \(a \in A\) in object \(x \in U\). The function f(x, a) defines an equivalence relation over U. Concerning a given attribute a; the function partitions the universe into a set of pairwise disjoint subsets of U. Assume a subset of the set of attributes \(P \subset A\), two objects x and y in U are indiscernible with respect to P if and only if \(f(x, q)=f(y, q) \forall q \in P\).
 (2)The indiscernibility relation IND (P) denotes the indiscernibility relation for all \(P \subset A\). U/IND (P) is used to denote the partition of U given IND (P), U/IND (P) is calculated as:where$$\begin{aligned} U/\hbox {IND}~(P) = \otimes \{q \in P : U/\hbox {IND} (q)\}, \end{aligned}$$$$\begin{aligned} A \otimes B = \{X \cap Y: \forall X \in A, \forall Y \in B, X \cap Y\ne \emptyset \} \end{aligned}$$
 (3)Given an equivalence relation IND (P), the lower and upper approximations of a set \(P \subseteq U\) are defined as follow:$$\begin{aligned} \text {Lower approximation:}~\underline{P}Y= & {} \cup \{X: X \in U/\hbox {IND}~ (P), X \subseteq Y\}\\ \text {Upper Approximation:}~\overline{P}Y= & {} \cup \{X: X\! \in \! U/\hbox {IND}~(P), X \cap Y\! \ne \! \emptyset \} \end{aligned}$$
The other dimension in reduction is to keep only those attributes that preserve the indiscernibility relation and, consequently, set approximation. The rejected attributes are redundant are redundant since their removal cannot worsen the classification. There is usually several such subsets of attributes and those which are minimal are called reducts.
 (4)Pawlak (1991) defined the degree of dependency of a set Q of decision attributes on a set of conditional attributes P as follows:$$\begin{aligned} \gamma P (Q) = \frac{\parallel \hbox {POS}_{\mathrm{P}} \left( Q\right) \parallel }{\parallel U \parallel } \end{aligned}$$
 (5)The significance of an attribute is defined by calculating the change of dependency when removing the attribute from the set of the considered conditional attributes. The significance of an attribute \(\sigma \) can be defined as follows. Given P, Q and an attribute \(x \in P\):$$\begin{aligned} \quad \sigma P(Q, x)=\gamma P(Q)\gamma (P\{x\} (Q)) \end{aligned}$$
 (6)Attribute reduction (Reduct) involves removing attributes that have no significance to the classification at hand. The dataset may have more than one attribute reduct set. Remembering that D is the set of decision attributes and C is the set of conditional attributes, the set of reducts R is defined as follows:$$\begin{aligned} \quad R = \{X: X \subseteq C, \gamma C (D) = \gamma x (D)\} \end{aligned}$$
 (7)The minimal reduct \(\hbox {R}_{\mathrm{min}} \subseteq R\) is the set of shortest reducts that can be defined as follows:$$\begin{aligned} R_{\mathrm{min}} = \{X: X \in R, \forall Y \in R, X\le Y\} \end{aligned}$$
2.4 The quick reduct algorithm
Several approaches were proposed in the literature to generate the reducts sets. The Quick Reduct algorithm is proposed by Chouchoulas (1999) and is used for the purpose of text classification. Quick Reduct algorithm attempts to find reducts without the necessity to generate all possible subsets exhaustively. For this, the Quick Reduct algorithm is considered a good option to overcome the NPhard problem that has exponential time related to the number of attributes in the decision table. Using Quick Reduct algorithm, a reduct may be found by searching for the first set of generated subsets of the lowest level.
3 Related work
This section focuses on reviewing several Arabic text categorization approaches. In the first section, we reviewed some approaches that have been implemented for categorization of Arabic text. In the second section, we reviewed the use of rough set theory in the fields of text categorization.
3.1 Arabic text categorization
In the last few years, the importance of text categorization for the Arabic language has attracted many researchers. For example and starting from 2006, AlShalabi et al. (2006) applied the Knearest neighbor method for the categorization of Arabic text, Duwairi (2006) proposed a distancebased approach for the categorization of Arabic text, and Syiam et al. (2006) presented an intelligent system that used Knearest neighbor and Rocchio classifiers for categorizing documents that were collected from some Egyptian newspapers.
Mesleh (2007) proposed an approach using support vector machine (SVM) for Arabic text categorization then Duwairi (2007) compared the three classifiers Naïve Bayes (NB), Knearest neighbor (KNN), and distancebased for categorizing Arabic text. After that, Hmeidi et al. (2008) evaluated the support vector machine and KNN classifiers regarding their ability to categorize Arabic text.
Duwairi et al. (2009) applied the Knearest neighbor algorithm for the categorization of Arabic text to compare among three stemming techniques: full stemming, light stemming, and word clusters. After that, a text categorization approach based on the application of artificial neural networks was proposed by Harrag and ElQawasmeh (2009) and Thabtah et al. (2009) proposed an approach based on Naïve Bayesian method for Arabic text categorization. For this purpose, the authors used CHI square for feature selection and Naïve Bayesian for categorization. Gharib et al. (2009) applied support vector machine (SVM) for Arabic text categorization, and Harrag et al. (2009) proposed an approach based on decision tree algorithm for categorization of Arabic text.
Noaman et al. (2010) applied the Naïve Bayesian classifier for the categorization of Arabic text. AlDhaheri (2010) proposed an approach based on artificial neural networks (ANN) algorithm and feature reduction technique based on a combination of feature selection methods: term frequency–inverse document frequency (TF–IDF) and document frequency–category frequency (DF–CF) with principle component analysis (PCA) techniques. Harrag et al. (2010) applied the back propagation neural network algorithm with the objective to compare among five dimension reduction techniques; including full stemming, light stemming, document frequency, term frequency–inverse document frequency, and latent semantic indexing.
In 2011, several researched where published dealing with Arabic text categorization. Alsaleem (2011) applied support vector machine and Naïve Bayesian classifiers for Arabic text categorization. Alsaleem used Arabic documents collected from some Saudi Arabia newspapers. AlSalemi and Aziz (2011) proposed an approach based on Bayesian learning models for the categorization of Arabic text. AlRadaideh et al. (2011) proposed an approach based on association rule mining for the categorization of Arabic text. Hussien et al. (2011) applied sequential minimal optimization (SMO), Naïve Bayesian and J48 (C4.5) algorithms to compare among these three algorithms and find the most applicable algorithm for categorization of Arabic text. Chantar and Corne (2011) proposed a hybrid approach that based on binary practical swarm optimization and Knearest neighbor algorithms for categorization of Arabic text. Wahbeh et al. (2011) studied the effect of stemming on Arabic text classification. The researchers applied three classifiers including sequential minimal optimization (SMO), J48, and Naïve Bayes (NB) algorithms for classifying the stemmed and the nonstemmed Arabic textual dataset.
Azara et al. (2012) proposed an approach that based on neural network algorithm for the categorization of Arabic text by using learning vector quantization (LVQ) algorithm. The algorithm is based on Kohonen selforganizing map (SOM) that can organize bigsize document collections according to textual similarities. AlDiabat (2012) compared some of the rulebased algorithms for categorization of Arabic text. For this purpose, the author compared four well know rulebased algorithms, including one rule, rule induction (RIPPER), decision trees (C4.5), and hybrid (PART) to select the most applicable algorithm among them when categorization of Arabic text.
Hmeidi et al. (2015) presented a survey and a comparative study of several text categorization approaches for Arabic text. In addition, AlRadaideh and AlKhateeb (2015) applied the associative classifier approach to classify Arabic articles related to the medical domain. The experimental results reported by the authors showed that the associative classification approach outperformed the C4.5, Ripper and SVM algorithms based on a corpus of 1000 Arabic medical articles that belong to 10 different diseases (classes).
Ghareb et al. (2016) proposed a hybrid feature selection approach which combines the advantages of several filter feature selection methods with an enhanced version of genetic algorithm. Recently Ghareb et al. (2018) proposed three enhanced filter feature selection methods that can be used for text classification. The methods are: Category Relevant Feature Measure, Modified Category Discriminated Measure, and Odd Ratio2.
3.2 Text categorization using rough set theory
With the growing interest in text categorization, there are several machine learning algorithms were proposed. The rough set theory is one of such algorithms that have been used by some researchers for text categorization. For example, Bao et al. (2001) proposed a hybrid text categorization model based on rough set theory and latent semantic indexing. The objective of the proposed model was to categorize the text to overcome the challenge of the high dimensionality of data. The results showed that using rough set theory resulted in a set of rules small enough to be understood by humans. The use of latent semantic indexing to grouping keyword made high improvement than using rough setbased approach alone.
Jensen (2005) proposed an approach based on combining rough set and fuzzy set for feature selection named fuzzyrough feature selection (FRFS). Jensen aimed to use the proposed approach to reducing dimensionality. Zhao and Zhang (2005) proposed a model based on the rough set theory for the classification of emails into three categories: spam, nonspam, and suspicious. The researchers compared the results of the rough set based model with Naïve Bayesian classifier. The experimental results of comparison between the proposed the rough set classifier and Naïve Bayesian classifier showed that the rough setbased model reduced the error rate in classifying nonspam as spam emails more than in Naïve Bayesian classifier.
Dai et al. (2008) presented an approach based on Rough Set theory and a modified version of the Chisquare for text categorization. They used the reduct concept of rough set theory for generating rules that can be used for text categorization. The results of comparison with other classification methods showed that rough set theory classifier produced the highest accuracy among the compared three classifiers.
Yin et al. (2008) proposed an approach for text categorization based on Rough Set theory. In the proposed approach, the authors initially created a decision table then all terms in the decision table were weighted, then the features were selected and finally the classification rules were extracted. The results showed that the proposed approach had produced higher accuracy in comparison with support vector machine (SVM) classifier. Chen and Liu (2008) proposed a model that combines rough set theory and support vector machine for text categorization. In the proposed model, rough set theory was employed as an attribute reduction method where the reduced set of attributes was used as an input to the support vector machine classifier. The experimental result showed that the model had produced the highest accuracy in comparison with several other classifiers.
Thangavel and Pethalakshmi (2009) reviewed some techniques for dimensionality reduction under rough set theory environment. They also reviewed the rough sets hybridization with fuzzy sets, neural network and metaheuristic algorithms. The performance analysis of the algorithms has been discussed in connection with the classification task.
In the field of using rough set theory for categorization of Arabic text, Yahia (2011) presented a study that reviewed some approaches that were implemented for the categorization of Arabic text such as Naïve Bayesian (NB), Knearest neighbor (KNN), and support vector machine (SVM). The author focused on using rough set theory for Arabic text categorization and reported that using rough setbased reasoning in Arabic text categorization is needed. Yahia (2011) reported that he has a plan to build a model for the categorization of Arabic text based on rough set theory principles for feature selection and support vector machine (SVM) for classification, or modified Chisquare for feature selection, and rough set for classification. No experiments were reported in that short paper.
Another work for Arabic text was introduced by AlRadaideh and Twaiq (2014), where they evaluated two reduct computation methods for Arabic sentiment categorization. The two algorithms are the Johnson Reducer and the genetic algorithms based reducer. To evaluate the two methods, they used a corpus of Egyptian dialects tweets, and they measured the performance of the two algorithms in terms of the number of generated reducts and the number of the generated rules for each algorithm. The reported preliminary results showed that the classification process after using genetic algorithms for reduct generation achieved an accuracy of 55%, which outperformed the classification process using Johnson reducer.
Recently, AlRadaideh and AlQudah (2017) investigated using the rough set theory concepts for feature selection for sentiment analysis of Arabic text. The work considered an extension to the previous work of AlRadaideh and Twaiq (2014). The study investigated four reduct computation algorithms and two rule generation algorithms. The conclusion of the work indicates that using rough set theory concepts for feature selection is appealing and can achieve good results in comparison with using the full set of terms.
4 The proposed approach
4.1 Building the rough set classifier (phase 1)
This phase consists of two main steps. The first one is the text preprocessing and the second is the building of the Rough set theory classifier.
In general, the text preprocessing step includes document tokenization, stopword removal, term weighting, and document representation. The result of this phase is used for generating a set of rules based on the rough set theory that can be used for categorizing testing documents. In this phase, the rough set theory is used for feature selection via extraction of reducts. This process is used to select a subset of terms that best represent the whole document. After the reducts are extracted, the set of minimal reducts is then used for generating a set of rules that can be used for constructing the rough set model (classifier)m represented as a set of if–then rules.
4.2 Evaluating the classifier (phase 2)
 (1)
Using the built model for categorizing the test dataset and evaluating the model using some wellknown evaluation metrics. The performance of the rough set classifier is evaluated using some evaluation metrics that include precision, recall, Fmeasure, and accuracy metrics.
 (2)
Comparing the results of the proposed model with some traditional categorization models.
4.3 Text preprocessing

Document tokenization is the process of breaking up a sequence of strings into meaningful pieces called tokens such as words, keywords, phrases, and symbols.

Stop words removal is the process of removing punctuation marks, formatting tags, digits, prepositions, pronouns, conjunction and auxiliary verbs.
4.4 Term weighting using TF–IDF
4.5 Building the RST classifier
To build the classifier, the dataset is partitioned into a training set and a testing set. Training documents are used as an input for the rough set training phase. In general, rough set training phase includes some steps such as building the decision table, reducts extraction, and classification rules generation.
Sample of the decision table
4.6 The proposed reducts extraction method
To generate the set of the multiple reducts, we propose to improve the Quick Reduct algorithm for extracting reducts from the decision table (D). The Quick Reduct algorithm is proposed to return only single minimal reduct which may not be good to be used for text categorization. It is noticed by Hu et al. (2004) that using single minimal reduct for categorization may not be an appropriate solution in most cases. Using multiple reducts concept, which was used by some other works such as Ishii et al. (2010), can produce better categorization performance than using a single minimal reduct. In the proposed approach we modified the Quick Reduct algorithm to return the set of all minimal reducts, instead of returning only single minimal reduct, to use them for generating rules used for text categorization.
Sample of generated rules
Now if R is still empty \((R= \{\})\), the algorithm will find the significances of all possible subsets of size 2 and compare the significance of each generated subset of size 2 with the calculated significance of the entire decision system. For each subset that has significance equal to the significance of the entire decision system, make R equal to this subset \((R = \{\hbox {subset}\})\) then add R to the list of minimal reducts. Again if the list of minimal reducts is not empty stop searching for other reducts in higher levels, else go to next step. If R is still empty \((R=\{\})\), the algorithm will find the significances of all possible subsets of size \(n1\), then compare the significance of each generated subset of size \(n1\) with the calculated significance of the entire decision system. For each subset having significance equal to the significance of the entire decision system make R equal to this subset \((R = \{\hbox {subset}\})\) then add R to the list of minimal reducts. If the list of minimal reducts is not empty stop searching for other reducts in higher level, else go to next step. If R is still empty \((R= \{\})\), take the subset of size n and add it to list of minimal reducts then return the list of minimal reducts (LMR), in this case the list of minimal reducts will contain only single reduct.
4.7 Rules generation/building the classifier
The rules generation is the process of using the contents of attributes reducts produced in the previous step to generate the set of classification rules that will form the classifier. In the proposed approach, we used all minimal attributes reducts for generating the set of rules that are kept in a dictionary, or as a list in the form of if–then rules. These set of rules represent the categorization model that is used for categorizing Arabic documents.
The majority voting method was used to avoid the conflict that may occur among rules, because some of the documents may be matched by rules of more than one category. The proposed approach is designed to assign only one category from the predefined categories to each document. Table 3 shows a sample of the generated rules.
4.8 Model evaluation

True positive (TP) refers to the number of documents that were correctly classified by the classification model (classifier) as Category A and they actually belong to this Category (A). This actual category is the original category provided with the document.

True negative (TN) represents the number of documents that the classifier correctly classified them as they do not belong to the category A (i.e., belong to other categories), and they actually did not belong to that category as well.

False positive (FP) represents the number of documents that the classifier incorrectly classified them under the category A, but actually they do not belong to that category A.

False negative (FN) represents the number of documents that the classifier misclassified them to another category, but the actual class indicates that they should be classified under category A.
 Precision can be thought of as a measure of exactness; which refers to the percentage of test documents that were correctly classified as category A, and they actually belong to category A.$$\begin{aligned} \hbox {Precision} = (\hbox {TP})/(\hbox {TP}+\hbox {FP}) \end{aligned}$$
 Recall is a measure of completeness; which refers to the percentage of test documents that were classified as not being of category A, and they actually belong to category A (i.e., this document belong to other categories other than A).$$\begin{aligned} \hbox {Recall} = (\hbox {TP})/(\hbox {TP} + \hbox {FN}) \end{aligned}$$
 Fmeasure is an alternative way to use precision and recall by combining them into a single measure. The Fmeasure is a harmonic mean of precision and recall.$$\begin{aligned} F\hbox {measure}\! =\! (2 * \hbox {Precision} * \hbox {Recall}) / (\hbox {Precision} \!+\! \hbox {Recall}) \end{aligned}$$
 Accuracy of a classifier on a given test set of documents is the percentage of test documents that are correctly classified by the classifier either to belong category A or to other categories.$$\begin{aligned} \hbox {Accuracy} = (\hbox {TP} + \hbox {TN})/(\hbox {TP} + \hbox {TN} + \hbox {FP} + \hbox {FN}) \end{aligned}$$
Confusion matrix structure
Actual category  Predicted category  

Category A  Other categories  
Category A  TP  FN 
Other categories  FP  TN 
To compute these metrics, we used two main accuracy estimation methods; these are the Kfolds crossvalidation (CV) and the percentage split. In Kfolds crossvalidation, the set of documents is partitioned into K folds, \(K1\)folds are used to train the classifier, and one fold is used for testing. This process is repeated K times by selecting another fold for testing. After K times, the evaluation metrics are averaged to find the final value of each metric.
In the percentage split method, the set of documents is simply randomly partitioned into two parts. The first partition, which is usually set to be twothirds of the documents set, is used to train the classification method to build the classifier, while the other partition is used to evaluate the classifier by computing the evaluation metrics.
5 Experiments and discussion of results
Results using single reduct and tenfold CV method
Category  Precision  Recall  Fmeasure 

Art  0.89  0.90  0.89 
Economy  0.76  0.92  0.84 
Health  0.92  0.96  0.94 
Law  0.74  0.88  0.80 
Literature  0.84  0.78  0.80 
Politics  0.86  0.76  0.80 
Religion  0.92  0.84  0.88 
Sport  0.97  0.91  0.94 
Technology  0.98  0.79  0.87 
Results using multiple reducts and tenfold CV
Category  Precision  Recall  Fmeasure 

Art  0.98  0.97  0.97 
Economy  0.90  0.98  0.93 
Health  0.98  0.97  0.98 
Law  0.85  0.95  0.89 
Literature  0.94  0.93  0.93 
Politics  0.91  0.89  0.90 
Religion  0.97  0.87  0.92 
Sport  1.00  0.99  0.99 
Technology  0.98  0.92  0.95 
5.1 The corpus
Coverage of the proposed approach using multiple and single reducts
Fold#  Coverage of multiple reducts  Coverage of single reduct  

# of classified documents  # of unclassified documents  # of classified documents  # of unclassified documents  
1  270  0  221  49 
2  270  0  218  52 
3  270  0  245  25 
4  270  0  225  45 
5  270  0  223  47 
6  270  0  237  33 
7  270  0  241  29 
8  270  0  228  42 
9  270  0  236  34 
10  270  0  229  41 
Total  2700  0  2303  397 
Average  \(2700/2700 = 100\%\)  \(0/2700 = 0\%\)  \(2303/2700 = 85\%\)  \(397/2700 = 15\%\) 
5.2 Experiment using single reduct and multiple reducts
In this section, we used the proposed approach for the categorization of the testing documents using single reduct to compare with the results when using multiple reducts. Table 5 shows the results of the proposed approach using single reduct where tenfold CV is used for splitting the dataset. The highest precision was for the Technology category which has achieved a precision of 98% whereas the lowest was for the Law category which has achieved a precision of 74%. As for the accuracy, the final total classification accuracy of the built classifier using the single reduct method was 86%.
Table 6 shows the results when using the multiple reduct approach. We can notice that most categories have a precision value greater than 90% where the lowest value of precision obtained was for the Law category and the highest was for the Sport category. In terms of recall, the results in Table 6 showed that the lowest value of recall obtained for the Religion category. The highest value of recall obtained was for the Sport category. As for accuracy, the final classification accuracy of the built classifier using the multiple reducts way was 94%.
Table 7 shows the comparison between the results of the using multiple and single reducts in terms of the number of classified and unclassified documents in each experiment when using the tenfold CV method. Figure 6 shows the Fmeasure results of the proposed approach using multiple and single reducts.
Referring to the results presented in Table 7, we can notice that the results of the proposed approach using single reduct have shown that the number of generated rules when using single reduct were not enough for the categorization of all testing documents where there is 15% of the total number of testing documents that were not matched with any rule. On the other hand, the results of Table 6 indicate that most of the documents were categorized by the proposed approach using multiple reducts to their correct and actual categories.
The results in Table 7 showed that using multiple reducts have produced better performance than single reduct. The results showed that the use of multiple reducts method generated rules that are enough for classifying all testing documents, whereas the use of single reduct has generated rules that are not enough for classifying all testing documents.
From Fig. 8 it can be noticed that the Fmeasure when using multiple reducts is better than the Fmeasure produced when using single reduct in all categories. In addition, the results when using single reduct is for 85% of the total number of testing documents while there are 15% of the total number of testing documents were not classified to any category.
5.3 Comparison with other classification methods
To compare the results of the proposed approach with the results of other text categorization algorithms, we used the WEKA toolkit (Hall et al. 2009) using the same dataset that was used by the proposed approach. We used the two wellknown classification methods: the Knearest neighbor (KNN) and the J48 algorithm; which is the Java implementation of the wellknown C4.5 decision tree classification method.
Results of Knearest neighbor using tenfold CV method
Category  Precision  Recall  Fmeasure 

Art  0.67  0.41  0.51 
Economy  0.89  0.48  0.62 
Health  0.76  0.88  0.82 
Law  0.56  0.73  0.64 
Literature  0.95  0.23  0.37 
Politics  1.00  0.13  0.23 
Religion  0.24  0.97  0.39 
Sport  1.00  0.76  0.86 
Technology  0.99  0.24  0.39 
Results of J48 using tenfold CV method
Category  Precision  Recall  Fmeasure 

Art  0.89  0.89  0.89 
Economy  0.82  0.82  0.82 
Health  0.83  0.84  0.83 
Law  0.64  0.64  0.64 
Literature  0.80  0.74  0.77 
Politics  0.61  0.66  0.63 
Religion  0.84  0.76  0.80 
Sport  0.95  0.94  0.95 
Technology  0.80  0.84  0.82 
As for the J48 method, Table 9 shows the results of the J48 algorithm using tenfold CV method. Figure 8 shows a comparison of the Fmeasure metric between the results of the proposed approach with the results of the J48 algorithm when using tenfold CV method.
When we compared the results of the proposed approach with other classification methods, the results of the J48 method presented in Table 9 indicate that there is a reasonable number of documents that were not categorized to their correct and actual categories. From Fig. 8, we can notice as well that the Fmeasure values produced by the proposed approach are higher than the Fmeasure values produced by the J48 algorithm in all categories. This indicates that the method of selecting the best set of terms (multiple reducts) which were used to classify the documents using the proposed approach was more appropriate than the method used in the J48 which is the gain ratio measure.
When reviewing the results presented in Table 8 for the KNN method, we can notice that there is a variation in terms of precision and recall from one category to another. We can also notice from Fig. 7 that the proposed approach has produced results better than the KNN algorithm in all categories. This indicates that the set of terms used to classify the documents using the proposed approach was more significant than using the whole set of terms as in the KNN method.
As an end note, it can be noticed that the rough set approach is applicable for the categorization of Arabic text which has produced an acceptable accuracy that reached 94%. As for other rough set based classification methods, there is no other work that uses the rough set methodology to classify Arabic documents to compare with.
6 Conclusion
In this paper, we proposed a rough set theorybased approach for categorization of Arabic text using multiple reducts. Based on the experiments, and the results presented throughout this paper, we can conclude and claim that this research work has achieved its objectives. The proposed approach has been practically tested in order to demonstrate its applicability for the categorization of Arabic text. The experimental results obtained with the approach strongly support the significance of the approach. The results of the proposed approach using a document set of 2700 documents of 9 categories showed that using the multiple reducts strategy produced an accuracy of 94% which is better than the accuracy produced when using single reduct strategy which has achieved 86% accuracy. In addition, the number of rules generated when using single reduct was not enough for the categorization of all testing documents where there was 15% of the total number of testing documents when using tenfold CV method that was not categorized. From the results, we conclude that using multiple reducts have produced better performance than using single reduct when using Rough set approach for the categorization of Arabic text.
On the other hand, the results of comparison between the proposed approach with KNN and J48 algorithms showed that the proposed approach produced an accuracy that reached 94% which has outperformed the accuracy produced by KNN which reached 55% and the accuracy produced by the J48 algorithm which reached 79%.
As a concluding remark, the research presented in this paper proved the objectives and the hypothesis of this research. The research concluded that the Rough set approach is applicable for the categorization of Arabic text. Moreover, it was proven that the generation of multiple reducts is a necessary step to improve the performance of the Rough set classifier, in comparison with generating single reduct.
While running this research we encountered several ideas that can be considered for future research and extension of the proposed approach presented in this paper. These methods include: (1) enhancing the proposed approach to be able to categorize each document into multiple categories if its content indicated that it belongs to more than one category, (2) studying the effect of stemming on the performance of the proposed approach.
Notes
Funding
This research received no specific grant from any funding agency in public, commercial, or notforprofit sectors.
Compliance with ethical standards
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent
Informed consent was obtained from all individual participants included in the study.
References
 Abualigah LMQ, Hanandeh ES (2015) Applying genetic algorithms to information retrieval using vector space model. Int J Comput Sci Eng Appl 5(1):19–28Google Scholar
 Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J Supercomput 73(11):4773–4795Google Scholar
 Abualigah LM, Khader AT, Hanandeh ES (2018) A new feature selection method to improve the document clustering using particle swarm optimization algorithm. J Comput Sci 25:456–466. https://doi.org/10.1016/j.jocs.2017.07.018 Google Scholar
 AlDhaheri S (2010) Arabic text categorization based on features reduction using artificial neural network. Master Thesis Faculty of Graduate Studies, The University of JordanGoogle Scholar
 AlDiabat M (2012) Arabic text categorization using classification rule mining. Appl Math Sci 6:4033–4046Google Scholar
 AlRadaideh Q, AlKhateeb S (2015) An associative rulebased classifier for Arabic medical text. Int J Knowl Eng Data Min 3(3–4):255–273Google Scholar
 AlRadaideh Q, AlQudah G (2017) Application of rough setbased feature selection for Arabic sentiment analysis. Cognit Comput 9(4):436–445Google Scholar
 AlRadaideh Q, Bataineh D (2018) A hybrid approach for arabic text summarization using domain knowledge and genetic algorithms. Cognit Comput. https://doi.org/10.1007/s125590189547z Google Scholar
 AlRadaideh Q, AlShawakfa E, Ghareb A, Abu Salem H (2011) An approach for Arabic text categorization using association rule mining. Int J Comput Process Lang 23(1):81–106Google Scholar
 AlRadaideh Q, Sulaiman MN, Selamat MH, Ibrahim H (2005) Approximate reduct computation by rough sets based attribute weighting. In: Proceedings of the IEEE international conference on granular computing, pp 383–386Google Scholar
 AlRadaideh Q, Twaiq L (2014) Rough set theory for Arabic sentiment classification. In: Proceedings of the 2014 international conference on future internet of things and cloud. IEEE Computer SocietyGoogle Scholar
 Alsaleem S (2011) Automated Arabic text categorization using SVM and NB. Int Arab J eTechnol 2(2):124–128Google Scholar
 AlSalemi B, Aziz M (2011) Statistical Bayesian learning for automatic arabic text categorization. J Comput Sci 7(1):39–45Google Scholar
 AlShalabi R, Kanaan G, Gharaibeh M (2006) Arabic text categorization using KNN algorithm. In: Proceedings of the 4th international multiconference on computer science and information technology. Amman, JordanGoogle Scholar
 Azara M, Fatayer T, ElHalees A (2012) Arabic text classification using learning vector quantization. In: Proceedings of the 8th international conference on informatics and systems (INFOS2012), pp 39–43Google Scholar
 Bao Y, Aoyama S, Du X, Yamada K, Ishii N (2001) A rough set based hybrid method to text categorization. In: Proceedings of the 2nd international conference on web information systems engineering. IEEE Computer Society, pp 254–261Google Scholar
 Chantar HK, Corne DW (2011) Feature subset selection for arabic document categorization using BPSOKNN. In: Nature and Biologically Inspired Computing (NaBIC), pp 545–551Google Scholar
 Chen Y, Zeng Z, Lu J (2017) Neighborhood rough set reduction with fish swarm algorithm. Soft Comput 21(23):6907–6918Google Scholar
 Chen P, Liu S (2008) Rough setbased SVM classifier for text categorization. In: Proceedings of the fourth international conference on natural computation (ICNC), pp 153–157Google Scholar
 Chouchoulas A (1999) A rough set approach to text classification. Master Thesis, School of Artificial Intelligence, Division of Informatics, the University of EdinburghGoogle Scholar
 Dai L, Hu J, Liu W (2008) Using modified CHI square and rough set for text categorization with many redundant features. In: Proceedings of the international symposium on computational intelligence and design (ISCIS), vol 1, pp 182–185Google Scholar
 Darwish K (2002) Building a shallow Arabic morphological analyzer in one day. In: Proceedings of the ACL workshop on computational approaches to semitic ACLGoogle Scholar
 Duwairi R (2006) Machine learning for Arabic text categorization. J Am Soc Inf Sci Technol 57(8):1005–1010Google Scholar
 Duwairi R (2007) Arabic text categorization. Arab J Inf Technol 4(2):125–131Google Scholar
 Duwairi R, ElOrfali M (2014) A study of the effects of preprocessing strategies on sentiment analysis for Arabic text. J Inf Sci 40(4):501–13Google Scholar
 Duwairi R, AlRefai M, Khasawneh N (2009) Feature reduction techniques for Arabic text categorization. J Am Soc Inf Sci 60(11):2347–2352Google Scholar
 Ghareb A, Hamdan A, Bakar A (2016) Hybrid feature selection based on enhanced genetic algorithm for text categorization. Exp Syst Appl 49:31–47Google Scholar
 Ghareb A, Bakar AA, AlRadaideh Q, Hamdan A (2018) Enhanced filter feature selection methods for Arabic text categorization. Int J Inf Retr Res 8(2):1–24Google Scholar
 Gharib TF, Habib MB, Fayed ZT (2009) Arabic text classification using support vector machines. Int J Comput Appl 16(4):1–8Google Scholar
 Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten I (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18Google Scholar
 Han J, Kamber M, Pei J (2012) Data mining concepts and techniques, 3rd edn. Morgan Kaufmann Publishers, Los AltoszbMATHGoogle Scholar
 Harrag F, ElQawasmah E, AlSalman AS (2010) Comparing dimension reduction techniques for Arabic text classification using BPNN algorithm. In: Proceedings of the 2010 first international conference on integrated intelligent computing, pp 6–11Google Scholar
 Harrag F, ElQawasmeh E (2009) Neural network for Arabic text classification. In: Proceedings of the international conference of applications of digital information and web technologies, ICADIWT ’09, pp 778–783Google Scholar
 Harrag F, ElQawasmeh E, Pichappan P (2009) Improving Arabic text categorization using decision trees. In: Proceedings of the 1st international conference of NDT ’09, pp 110–115Google Scholar
 Hmeidi I, Hawashin B, ElQawasmeh E (2008) Performance of KNN and SVM classifiers on full word Arabic articles. Adv Eng Inform 22:106–111Google Scholar
 Hmeidi I, AlAyyoub M, Abdulla N, Almodawar A, Abooraig R, Mahyoub N (2015) Automatic Arabic text categorization: a comprehensive comparative study. J Inf Sci 41(1):114–124Google Scholar
 Hussien MI, Olayah F, ALdwan M, Shamsan A (2011) Arabic text classification using SMO, Naive Bayesian, J48 algorithm. Int J Res Rev Appl Sci 9(2):306–316Google Scholar
 Hu Q, Yu D, Xie Z (2004) Improvement on classification performance based on multiple reduct ensembles. In: Proceedings of the 2004 IEEE conference on cybernetics and intelligent systems, vol 2, pp 1016–1021Google Scholar
 Ishii N, Morioka Y, Kimura H, Bao Y (2010) Classification by partial data of multiple reducts kNN with confidence. In: Proceedings of the 22nd IEEE international conference on tools with artificial intelligence, pp 94–101Google Scholar
 Jensen R (2005) Combining rough and fuzzy sets for feature selection. Ph.D. Thesis, School of Informatics, University of EdinburghGoogle Scholar
 Lam W, Ruiz M, Srinivasan P (1999) Automatic text categorization and its application to text retrieval. IEEE Trans Knowl Data Eng 11(6):865–879Google Scholar
 Lin TY (1996) Rough set theory in very large databases. In: Proceedings of the symposium on modeling analysis and simulation, CESA’96 IMACS multiconference on computational engineering in systems applications, pp 936–941Google Scholar
 Mesleh A (2007) Chisquare feature extraction based SVMs Arabic language text categorization system. J Comput Sci 3(6):430–435Google Scholar
 Noaman H, Elmougy S, Ghoneim A, Hamza T (2010) Naïve Bayes classifier based Arabic document categorization. In: Proceedings of the 7th international conference in informatics and systems (INFOS 2010), Cairo, EgyptGoogle Scholar
 Pawlak Z (1982) Rough sets. Int J Comput Inform Sci 11(5):341–356zbMATHGoogle Scholar
 Pawlak Z (1991) Rough sets: theoretical aspects of reasoning about data. Kluwer, DordrechtzbMATHGoogle Scholar
 Rasim Cekik R, Telceken S (2018) A new classification method based on rough sets theory. Soft Comput 22(6):1881–1889Google Scholar
 Skowron A, Rauszer C (1992) The discernibility matrices and functions in information systems. In: Słowiński R (ed) Intelligent decisionGoogle Scholar
 Syiam MM, Fayed ZT, Habib MB (2006) An intelligent system for arabic text categorization. Int J Intell Comput Inf Sci 6(1):1–19Google Scholar
 Thabtah F, Eljinini M, Zamzeer M, Hadi W (2009) Naïve Bayesian based on chisquare to categorize Arabic data. In: Proceedings of the 11th international business information management association conference (IBIMA) conference on innovation and knowledge management in Twin track economies, Cairo, pp 930–935Google Scholar
 Thangavel K, Pethalakshmi A (2009) Dimensionality reduction based on rough set theory: a review. Appl Soft Comput 9(1):1–12Google Scholar
 Velayutham C, Thangavel K (2011) Unsupervised quick reduct algorithm using rough set theory. J Electron Sci Technol (JEST) 9(3):193–201Google Scholar
 Wahbeh A, AlKabi M, AlRadaideh Q, AlShawakfa E, Alsmadi I (2011) The effect of stemming on Arabic text classification: an empirical study. Int J Inf Retr Res 1(3):54–70Google Scholar
 Wang Z, Sun X, Li X, Zhang D (2006) An efficient SVMbased spam filtering algorithm. In: Proceedings of the fifth international conference on machine learning and cybernetics, pp 3682–3686Google Scholar
 Wang N, Wang P, Zhang B (2010) An improved TF–IDF weights function based on information theory. In: Proceedings of the international conference on computer and communication technologies in agriculture engineering, pp 439–441Google Scholar
 Yahia ME (2011) Arabic text categorization based on rough set classification. In: Proceedings of the 9th IEEE/ACS international conference on computer systems and applications, pp 293–294Google Scholar
 Yin S, Huang Z, Chen L, Qiu Y (2008) An approach for text classification feature dimensionality reduction and rule generation on rough set. In: Proceedings of the third international conference on innovative computing, information and control (ICICIC 2008), published by IEEE CSGoogle Scholar
 Zhang Q, Tan J, Zhou H, Tao W, He K (2009) Machine learning methods for medical text categorization. In: Proceedings of the PacificAsia conference on circuits, communications and system, pp 494–497Google Scholar
 Zhao W, Zhang Z (2005) An Email classification model based on rough set theory. In: Proceedings of the 2005 international conference on active media technology (AMT 2005), pp 403–408Google Scholar
 Zhong N, Dong J, Ohsuga S (2001) Using rough sets with heuristics for feature selection. J Intell Inf Syst 16(3):199–214zbMATHGoogle Scholar
 Zhu XZ, Zhu W, Fan XN (2017) Rough set methods in feature selection via submodular function. Soft Comput 21(13):3699–3711zbMATHGoogle Scholar