Advertisement

Soft Computing

, Volume 23, Issue 14, pp 5849–5863 | Cite as

An Arabic text categorization approach using term weighting and multiple reducts

  • Qasem A. Al-RadaidehEmail author
  • Mohammed A. Al-Abrat
Methodologies and Application
  • 85 Downloads

Abstract

Text categorization is the process of assigning a predefined category label to an unlabeled document based on its content. One of the challenges of automatic text categorization is the high dimensionality of data that may affect the performance of the categorization model. This paper proposed an approach for the categorization of Arabic text based on term weighting and the reduct concept of the rough set theory to reduce the number of terms used to generate the classification rules that form the classifier. The paper proposed a multiple minimal reduct extraction algorithm by improving the Quick reduct algorithm. The multiple reducts are used to generate the set of classification rules which represent the rough set classifier. To evaluate the proposed approach, an Arabic corpus of 2700 documents nine categories is used. In the experiment, we compared the results of the proposed approach when using multiple and single minimal reducts. The results showed that the proposed approach had achieved an accuracy of 94% when using multiple reducts, which outperformed the single reduct method which achieved an accuracy of 86%. The results of the experiments also showed that the proposed approach outperforms both the K-NN and J48 algorithms regarding classification accuracy using the dataset on hand.

Keywords

Rough set theory Arabic text categorization Reducts extraction Single reduct Multiple reducts 

1 Introduction

Due to the rapid increase in textual information available on the internet, the process of getting relevant information becomes more difficult. Text mining, which is defined as the process of extracting knowledge from the huge amount of textual data, is one of the techniques, or solutions that can be used to overcome the problem of the increasing size of the textual information and to further facilitates extracting useful information from the text. Researchers in the fields of data mining and information retrieval have investigated different types of text mining tasks, such as text categorization (Al-Radaideh et al. 2011; Ghareb et al. 2018), text clustering (Abualigah et al. 2018), and text summarization (Al-Radaideh and Bataineh 2018).

Text categorization (TC) is the process of assigning a predefined category (label) to an unlabeled document based on its content (Lam et al. 1999). In recent years, and with the rapid increase in the size of information on the Web, text categorization has attracted the attention of many researchers to use TC as a way to simplify the access to useful information. Text categorization has been used for several applications such as spam filtering, improving the performance of information retrieval systems, and in medical information systems (Lam et al. 1999; Abualigah and Hanandeh 2015; Wang et al. 2006; Zhang et al. 2009).

In practice, the typical text categorization system consists of four main phases, where each phase may further includes several other steps. These phases include the text preprocessing phase, which has several steps that aim to prepare the text in the documents to be ready to be used by the categorization model. Usually, this phase includes three processes: text tokenization process, stop-word removal, and term weighting.

One of the most frequent challenges in automatic text categorization is the high dimensionality of terms that may affect the performance of the categorization model. A good solution to overcome this challenge is to use feature selection methods that allow selecting a subset of data that best represent the whole data in the best way (Rasim and Telceken 2018; Abualigah and Khader 2017; Abualigah et al. 2018). Rough set theory (RST) is one of the tools that have been used by researchers as a method or a tool for reducing the size of dimensionality and for feature selection using the reduct concept in rough set theory which allows representing the whole data using only part of that data (Pawlak 1982).

The Arabic language needs careful preprocessing since it has some features that are different from other languages. As summarized by Duwairi and El-Orfali (2014), the Arabic language differs from other languages in its orthographical nature where the words are written from right to left and the shape of letters changes according to their position in the word. The Arabic language includes three major parts of speech (noun, verb, and particle). Nouns and verbs are obtained from roots by applying templates to the roots in order to generate stems and then by introducing prefixes and suffixes (Darwish 2002).

The Arabic language has its challenges too. Arabic does not support letter capitalization, and it has no strict punctuation rules. The tokenization process is not a straightforward job for Arabic. The Arabic language is a morphologically rich language; this issue also complicates the tokenization process. Arabic words are compact, where a word can correspond to an entire phrase or sentence; for example, one Arabic word may contain four tokens. Moreover, Arabic dialects vary from one Arab country to another. All these challenges may affect the results of the next process of the text such as classification, or sentiment analysis.

Although the process of Arabic text categorization was adopted by many researchers who have implemented several categorization algorithms, this field still needs more efforts to be enriched with new and improved algorithms. After reviewing the literature, we noticed that very little research related to Arabic Text Categorization has considered using rough set theory for the categorization process. This motivated us to address this issue.

The main problem addressed in this research is how to build an Arabic text categorization model using the power of rough set theory in feature subset selection and dimension reduction to achieve an acceptable classification performance. As will be mentioned later, reduct is the main concept of the rough set theory that is used for feature subset selection. One of the well-known rough set theory-based subset selection methods, called Quick Reduct (Chouchoulas 1999), produces a single reduct for the dataset.

This research also addressed the issue of what if multiple reducts were used instead of a single reduct? The hypothesis of this research indicates that generating multiple reducts will yield better classification accuracy. The reduct generation step is used as a second term selection step on top of the TF–IDF step which is used for term weighting and selection.

The main objectives of this paper can be summarized in the following points: (1) Investigating the possibility of building a rough set theory-based model for the categorization of Arabic text that satisfies acceptable performance in comparison with other classification algorithms. (2) Proposing an Improved version of the Quick Reduct method to generate multiple reducts.

The rest of this paper is organized as follows: Section 2 presents some preliminaries of rough set theory. Section 3 reviews some approaches that have been applied for the categorization of Arabic text and also reviews the use of rough set theory in text categorization and feature selection. Section 4 describes the proposed method for building the rough set-based Arabic text categorization model with a detailed description of its components. Section 5 presents and discusses the experimental results of the proposed approach, and Sect. 6 presents the conclusions, and future work.

2 Rough set theory preliminaries

The rough set theory is a mathematical tool that has been proposed by Z. Pawlak in the early 1980s for knowledge discovery and data analysis that can be used for the analysis of vagueness, imprecise, uncertain and incomplete data (Pawlak 1982). The rough set theory has an advantage of using it as a tool to reduce the number of attributes and to discover data dependencies (Velayutham and Thangavel 2011). This advantage of rough set theory has made it more applicable in many domains; such as decision support systems, engineering, banking, medicine, as well as others applications (Pawlak 1991).

Han et al. (2012) summarized the definition of rough set concept as “A rough set definition for a given class, X, is approximated by two sets—a lower approximation of X and an upper approximation of X. The lower approximation of X consists of all the data tuples that, based on the knowledge of the attributes, are certain to belong to X without ambiguity. The upper approximation of X consists of all the tuples that, based on the knowledge of the attributes, cannot be described as not belonging to X. The lower and upper approximations for a class X are shown in Fig. 1”.

2.1 Decision table in rough set theory

Rough set theory bears an assumption that to define a set there is a need for some information about elements of the universe (Pawlak 1991). In practice, the information about elements is presented in a form of a decision table (DT). The decision table is formally defined as \(\hbox {DT} = (U, A \cup \{d\})\), where \(U =\{x_{1}, {\ldots }, x_{n}\}\) is a nonempty finite set of objects \((x_{i})\); called the universe, and \(A = \{a_{1}, {\ldots }, a_{k}\}\) is a nonempty finite set of attributes \((a_{i})\); called conditional attributes, while the attribute d is called the decision attribute. Every attribute \(a \in A\) is a total function \(a: U\rightarrow Va\), where Va is the set of allowable values for the attribute a. An example of a decision table adopted from (Pawlak 1991) is presented in Table 1. In this decision table, U denotes the set of all objects in the dataset \(U = \{x_{1}, x_{2}, x_{3}, x_{4}, x_{5}, x_{6}, x_{7}, x_{8}\}\), A is the set of all attributes \(A = \{a_{1}, a_{2}, a_{3}, a_{4}\}\) and d is the decision attribute.
Fig. 1

Positive, Negative, and Boundary Regions in Rough Set (Pawlak 1991)

Table 1

An example of a decision table

\(x_{i} \in U\)

\(a_{1}\)

\(a_{2}\)

\(a_{3}\)

\(a_{4}\)

d

\(x_{1}\)

1

0

2

2

0

\(x_{2}\)

0

1

1

1

2

\(x_{3}\)

2

0

0

1

1

\(x_{4}\)

1

1

0

2

2

\(x_{5}\)

1

0

2

0

1

\(x_{6}\)

2

2

0

1

1

\(x_{7}\)

2

1

1

1

2

\(x_{8}\)

0

1

1

0

1

2.2 The reduct concept

As defined by Pawlak (1991), a reduct is the minimal subset of attributes \(B \subset A\) from the decision table that can be used to discern each object in the table. In practice, the reducts extraction process is one of the most important concepts in rough set theory that may take the longest required time and effort as the computation is highly influenced by the number of attributes in the decision table. Reducts computation has been proven to be an NP-hard problem which has exponential time related to the available number of attributes in the decision table (Skowron and Rauszer 1992).

The reduct concept represents one of the most important concepts in rough set theory application to data mining. Reducts extraction methods allow selecting a subset of the attributes set instead of dealing with the whole set of attributes (Lin 1996). Rough set theory provides two types of reducts: full reducts and object reducts. Full reducts can be used to discern all objects from each other whereas an object reduct is the minimal set of attributes that discern a particular object from all other objects. For reducts extraction there are several methods that were proposed such as exhaustive search algorithm, heuristic search algorithm, Johnson’s algorithm, feature weighting algorithm, and genetic algorithms (Al-Radaideh et al. 2005; Zhong et al. 2001).

2.3 Reducts extraction

The following are some concepts of the rough set theory (Pawlak 1991) that can be used to extract reducts. A complete example of how to compute and use these concepts to extract reducts can be found in (Chouchoulas 1999).

  1. (1)

    The total function f(xa) denotes the value of attribute \(a \in A\) in object \(x \in U\). The function f(xa) defines an equivalence relation over U. Concerning a given attribute a; the function partitions the universe into a set of pair-wise disjoint subsets of U. Assume a subset of the set of attributes \(P \subset A\), two objects x and y in U are indiscernible with respect to P if and only if \(f(x, q)=f(y, q) \forall q \in P\).

     
  2. (2)
    The indiscernibility relation IND (P) denotes the indiscernibility relation for all \(P \subset A\). U/IND (P) is used to denote the partition of U given IND (P), U/IND (P) is calculated as:
    $$\begin{aligned} U/\hbox {IND}~(P) = \otimes \{q \in P : U/\hbox {IND} (q)\}, \end{aligned}$$
    where
    $$\begin{aligned} A \otimes B = \{X \cap Y: \forall X \in A, \forall Y \in B, X \cap Y\ne \emptyset \} \end{aligned}$$
     
  3. (3)
    Given an equivalence relation IND (P), the lower and upper approximations of a set \(P \subseteq U\) are defined as follow:
    $$\begin{aligned} \text {Lower approximation:}~\underline{P}Y= & {} \cup \{X: X \in U/\hbox {IND}~ (P), X \subseteq Y\}\\ \text {Upper Approximation:}~\overline{P}Y= & {} \cup \{X: X\! \in \! U/\hbox {IND}~(P), X \cap Y\! \ne \! \emptyset \} \end{aligned}$$
     
Assume P and Q are equivalence relations in U, the positive, negative and boundary regions are defined as (\(\hbox {POS}_{P}\) (Q), \(\hbox {NEG}_{P}\) (Q) and \(\hbox {BN}_{P}\) (Q), respectively) as follows. Figure 1 illustrates the positive, negative and boundary regions of the concept X (Chouchoulas 1999).
$$\begin{aligned}&\hbox {POS}_{P}~(Q) = \cup _{x \in Q}, \underline{P}X\\&\hbox {NEG}_{P}~(Q) = U - \cup _{x \in Q} \,\overline{P}\hbox {X}\\&\hbox {BN}_{P} (\hbox {Q}) = \cup _{x \in Q}, \overline{P}X-\cup _{x \in Q}, \underline{P}X \end{aligned}$$
Let \(\hbox {IS} = (U, A)\) be an information system and let B is a subset of A. We can approximate X using only information obtained in B by constructing the B-Lower and B-Upper approximations of X, denoted by BX and BX, respectively.

The other dimension in reduction is to keep only those attributes that preserve the indiscernibility relation and, consequently, set approximation. The rejected attributes are redundant are redundant since their removal cannot worsen the classification. There is usually several such subsets of attributes and those which are minimal are called reducts.

Given an information system \(\hbox {IS} = (U, A)\) the definitions of these notations are as follows: A reduct of IS is a minimal set of attributes B subset of A such that \(\hbox {IND}_{\mathrm{IS}}~(B) = \hbox {IND}_{\mathrm{IS}}~(A)\).
  1. (4)
    Pawlak (1991) defined the degree of dependency of a set Q of decision attributes on a set of conditional attributes P as follows:
    $$\begin{aligned} \gamma P (Q) = \frac{\parallel \hbox {POS}_{\mathrm{P}} \left( Q\right) \parallel }{\parallel U \parallel } \end{aligned}$$
     
If \(\gamma = 0\), there is no dependence and for \(0< \gamma < 1\), this means that there is a partial dependence and if \(\gamma = 1\), means that there is a complete dependence. The symbol \(\parallel U\parallel \) represents the cardinality of the set U.
  1. (5)
    The significance of an attribute is defined by calculating the change of dependency when removing the attribute from the set of the considered conditional attributes. The significance of an attribute \(\sigma \) can be defined as follows. Given P, Q and an attribute \(x \in P\):
    $$\begin{aligned} \quad \sigma P(Q, x)=\gamma P(Q)-\gamma (P-\{x\} (Q)) \end{aligned}$$
     
  2. (6)
    Attribute reduction (Reduct) involves removing attributes that have no significance to the classification at hand. The dataset may have more than one attribute reduct set. Remembering that D is the set of decision attributes and C is the set of conditional attributes, the set of reducts R is defined as follows:
    $$\begin{aligned} \quad R = \{X: X \subseteq C, \gamma C (D) = \gamma x (D)\} \end{aligned}$$
     
  3. (7)
    The minimal reduct \(\hbox {R}_{\mathrm{min}} \subseteq R\) is the set of shortest reducts that can be defined as follows:
    $$\begin{aligned} R_{\mathrm{min}} = \{X: X \in R, \forall Y \in R, X\le Y\} \end{aligned}$$
     

2.4 The quick reduct algorithm

Several approaches were proposed in the literature to generate the reducts sets. The Quick Reduct algorithm is proposed by Chouchoulas (1999) and is used for the purpose of text classification. Quick Reduct algorithm attempts to find reducts without the necessity to generate all possible subsets exhaustively. For this, the Quick Reduct algorithm is considered a good option to overcome the NP-hard problem that has exponential time related to the number of attributes in the decision table. Using Quick Reduct algorithm, a reduct may be found by searching for the first set of generated subsets of the lowest level.

The Quick Reduct algorithm is based on the idea of attribute significances and dependencies for extracting the reducts. Figure 2 presents the pseudocode of the Quick Reduct algorithm as presented by Chouchoulas (1999) for reduct extraction.
Fig. 2

Pseudocode of quick reduct algorithm (Chouchoulas 1999)

3 Related work

This section focuses on reviewing several Arabic text categorization approaches. In the first section, we reviewed some approaches that have been implemented for categorization of Arabic text. In the second section, we reviewed the use of rough set theory in the fields of text categorization.

3.1 Arabic text categorization

In the last few years, the importance of text categorization for the Arabic language has attracted many researchers. For example and starting from 2006, Al-Shalabi et al. (2006) applied the K-nearest neighbor method for the categorization of Arabic text, Duwairi (2006) proposed a distance-based approach for the categorization of Arabic text, and Syiam et al. (2006) presented an intelligent system that used K-nearest neighbor and Rocchio classifiers for categorizing documents that were collected from some Egyptian newspapers.

Mesleh (2007) proposed an approach using support vector machine (SVM) for Arabic text categorization then Duwairi (2007) compared the three classifiers Naïve Bayes (NB), K-nearest neighbor (K-NN), and distance-based for categorizing Arabic text. After that, Hmeidi et al. (2008) evaluated the support vector machine and K-NN classifiers regarding their ability to categorize Arabic text.

Duwairi et al. (2009) applied the K-nearest neighbor algorithm for the categorization of Arabic text to compare among three stemming techniques: full stemming, light stemming, and word clusters. After that, a text categorization approach based on the application of artificial neural networks was proposed by Harrag and El-Qawasmeh (2009) and Thabtah et al. (2009) proposed an approach based on Naïve Bayesian method for Arabic text categorization. For this purpose, the authors used CHI square for feature selection and Naïve Bayesian for categorization. Gharib et al. (2009) applied support vector machine (SVM) for Arabic text categorization, and Harrag et al. (2009) proposed an approach based on decision tree algorithm for categorization of Arabic text.

Noaman et al. (2010) applied the Naïve Bayesian classifier for the categorization of Arabic text. Al-Dhaheri (2010) proposed an approach based on artificial neural networks (ANN) algorithm and feature reduction technique based on a combination of feature selection methods: term frequency–inverse document frequency (TF–IDF) and document frequency–category frequency (DF–CF) with principle component analysis (PCA) techniques. Harrag et al. (2010) applied the back propagation neural network algorithm with the objective to compare among five dimension reduction techniques; including full stemming, light stemming, document frequency, term frequency–inverse document frequency, and latent semantic indexing.

In 2011, several researched where published dealing with Arabic text categorization. Alsaleem (2011) applied support vector machine and Naïve Bayesian classifiers for Arabic text categorization. Alsaleem used Arabic documents collected from some Saudi Arabia newspapers. Al-Salemi and Aziz (2011) proposed an approach based on Bayesian learning models for the categorization of Arabic text. Al-Radaideh et al. (2011) proposed an approach based on association rule mining for the categorization of Arabic text. Hussien et al. (2011) applied sequential minimal optimization (SMO), Naïve Bayesian and J48 (C4.5) algorithms to compare among these three algorithms and find the most applicable algorithm for categorization of Arabic text. Chantar and Corne (2011) proposed a hybrid approach that based on binary practical swarm optimization and K-nearest neighbor algorithms for categorization of Arabic text. Wahbeh et al. (2011) studied the effect of stemming on Arabic text classification. The researchers applied three classifiers including sequential minimal optimization (SMO), J48, and Naïve Bayes (NB) algorithms for classifying the stemmed and the non-stemmed Arabic textual dataset.

Azara et al. (2012) proposed an approach that based on neural network algorithm for the categorization of Arabic text by using learning vector quantization (LVQ) algorithm. The algorithm is based on Kohonen self-organizing map (SOM) that can organize big-size document collections according to textual similarities. Al-Diabat (2012) compared some of the rule-based algorithms for categorization of Arabic text. For this purpose, the author compared four well know rule-based algorithms, including one rule, rule induction (RIPPER), decision trees (C4.5), and hybrid (PART) to select the most applicable algorithm among them when categorization of Arabic text.

Hmeidi et al. (2015) presented a survey and a comparative study of several text categorization approaches for Arabic text. In addition, Al-Radaideh and Al-Khateeb (2015) applied the associative classifier approach to classify Arabic articles related to the medical domain. The experimental results reported by the authors showed that the associative classification approach outperformed the C4.5, Ripper and SVM algorithms based on a corpus of 1000 Arabic medical articles that belong to 10 different diseases (classes).

Ghareb et al. (2016) proposed a hybrid feature selection approach which combines the advantages of several filter feature selection methods with an enhanced version of genetic algorithm. Recently Ghareb et al. (2018) proposed three enhanced filter feature selection methods that can be used for text classification. The methods are: Category Relevant Feature Measure, Modified Category Discriminated Measure, and Odd Ratio2.

3.2 Text categorization using rough set theory

With the growing interest in text categorization, there are several machine learning algorithms were proposed. The rough set theory is one of such algorithms that have been used by some researchers for text categorization. For example, Bao et al. (2001) proposed a hybrid text categorization model based on rough set theory and latent semantic indexing. The objective of the proposed model was to categorize the text to overcome the challenge of the high dimensionality of data. The results showed that using rough set theory resulted in a set of rules small enough to be understood by humans. The use of latent semantic indexing to grouping keyword made high improvement than using rough set-based approach alone.

Jensen (2005) proposed an approach based on combining rough set and fuzzy set for feature selection named fuzzy-rough feature selection (FRFS). Jensen aimed to use the proposed approach to reducing dimensionality. Zhao and Zhang (2005) proposed a model based on the rough set theory for the classification of e-mails into three categories: spam, non-spam, and suspicious. The researchers compared the results of the rough set based model with Naïve Bayesian classifier. The experimental results of comparison between the proposed the rough set classifier and Naïve Bayesian classifier showed that the rough set-based model reduced the error rate in classifying non-spam as spam e-mails more than in Naïve Bayesian classifier.

Dai et al. (2008) presented an approach based on Rough Set theory and a modified version of the Chi-square for text categorization. They used the reduct concept of rough set theory for generating rules that can be used for text categorization. The results of comparison with other classification methods showed that rough set theory classifier produced the highest accuracy among the compared three classifiers.

Yin et al. (2008) proposed an approach for text categorization based on Rough Set theory. In the proposed approach, the authors initially created a decision table then all terms in the decision table were weighted, then the features were selected and finally the classification rules were extracted. The results showed that the proposed approach had produced higher accuracy in comparison with support vector machine (SVM) classifier. Chen and Liu (2008) proposed a model that combines rough set theory and support vector machine for text categorization. In the proposed model, rough set theory was employed as an attribute reduction method where the reduced set of attributes was used as an input to the support vector machine classifier. The experimental result showed that the model had produced the highest accuracy in comparison with several other classifiers.

Thangavel and Pethalakshmi (2009) reviewed some techniques for dimensionality reduction under rough set theory environment. They also reviewed the rough sets hybridization with fuzzy sets, neural network and meta-heuristic algorithms. The performance analysis of the algorithms has been discussed in connection with the classification task.

In the field of using rough set theory for categorization of Arabic text, Yahia (2011) presented a study that reviewed some approaches that were implemented for the categorization of Arabic text such as Naïve Bayesian (NB), K-nearest neighbor (K-NN), and support vector machine (SVM). The author focused on using rough set theory for Arabic text categorization and reported that using rough set-based reasoning in Arabic text categorization is needed. Yahia (2011) reported that he has a plan to build a model for the categorization of Arabic text based on rough set theory principles for feature selection and support vector machine (SVM) for classification, or modified Chi-square for feature selection, and rough set for classification. No experiments were reported in that short paper.

Another work for Arabic text was introduced by Al-Radaideh and Twaiq (2014), where they evaluated two reduct computation methods for Arabic sentiment categorization. The two algorithms are the Johnson Reducer and the genetic algorithms based reducer. To evaluate the two methods, they used a corpus of Egyptian dialects tweets, and they measured the performance of the two algorithms in terms of the number of generated reducts and the number of the generated rules for each algorithm. The reported preliminary results showed that the classification process after using genetic algorithms for reduct generation achieved an accuracy of 55%, which outperformed the classification process using Johnson reducer.

Recently, Al-Radaideh and Al-Qudah (2017) investigated using the rough set theory concepts for feature selection for sentiment analysis of Arabic text. The work considered an extension to the previous work of Al-Radaideh and Twaiq (2014). The study investigated four reduct computation algorithms and two rule generation algorithms. The conclusion of the work indicates that using rough set theory concepts for feature selection is appealing and can achieve good results in comparison with using the full set of terms.

4 The proposed approach

This section presents a detailed description of the proposed approach. In general, there are several steps that were followed to apply the proposed approach. The main phases of the proposed approach can be outlined in two main phases: building the rough set classifier and evaluating the classifier. The main steps of building the rough set classifier are illustrated in Fig. 3.
Fig. 3

Main steps of building the rough set classifier

4.1 Building the rough set classifier (phase 1)

This phase consists of two main steps. The first one is the text preprocessing and the second is the building of the Rough set theory classifier.

In general, the text preprocessing step includes document tokenization, stop-word removal, term weighting, and document representation. The result of this phase is used for generating a set of rules based on the rough set theory that can be used for categorizing testing documents. In this phase, the rough set theory is used for feature selection via extraction of reducts. This process is used to select a subset of terms that best represent the whole document. After the reducts are extracted, the set of minimal reducts is then used for generating a set of rules that can be used for constructing the rough set model (classifier)m represented as a set of if–then rules.

4.2 Evaluating the classifier (phase 2)

In this phase two activities are performed:
  1. (1)

    Using the built model for categorizing the test dataset and evaluating the model using some well-known evaluation metrics. The performance of the rough set classifier is evaluated using some evaluation metrics that include precision, recall, F-measure, and accuracy metrics.

     
  2. (2)

    Comparing the results of the proposed model with some traditional categorization models.

     

4.3 Text preprocessing

In its simplest form, the text preprocessing step consists of several traditional texts steps including tokenization, stop words removal, term weighting, and document representation. The goal of text preprocessing is to prepare the text in the documents to extract features and then decide which features can be used for classifier learning to generate rules to be used later for categorizing test documents.
  • Document tokenization is the process of breaking up a sequence of strings into meaningful pieces called tokens such as words, keywords, phrases, and symbols.

  • Stop words removal is the process of removing punctuation marks, formatting tags, digits, prepositions, pronouns, conjunction and auxiliary verbs.

4.4 Term weighting using TF–IDF

The most popular method that could be used to compute the weight of a term is the term frequency–inverse document frequency (TF–IDF) (Wang et al. 2010). The TF–IDF is a weighting method used to choose the best terms that have the highest weight. The term frequency (TF) weight refers to the number of occurrences of a term i in a document j. The final (TF–IDF) term weight W(ij) is calculated as follows.
$$\begin{aligned} W(i, j) = { TF} (i, j) * \hbox {log} \left( {\frac{N}{n_i }} \right) \end{aligned}$$
where N is the total number of documents in the collection; \(n_{i}\), is the number of documents in the collection that contain term i.

4.5 Building the RST classifier

To build the classifier, the dataset is partitioned into a training set and a testing set. Training documents are used as an input for the rough set training phase. In general, rough set training phase includes some steps such as building the decision table, reducts extraction, and classification rules generation.

Building the decision table is the first step toward applying the rough set theory for building the classifier. This process is achieved by selecting the top 10 terms \((t_{1},\ldots ,t_{10})\) of each document that have the highest weights, then building the decision table (D) in which the columns represent terms and the rows represent the documents. The last column (Class) represents the category of the document. A sample decision table for eight documents is presented in Table 2.
Table 2

Sample of the decision table

4.6 The proposed reducts extraction method

To generate the set of the multiple reducts, we propose to improve the Quick Reduct algorithm for extracting reducts from the decision table (D). The Quick Reduct algorithm is proposed to return only single minimal reduct which may not be good to be used for text categorization. It is noticed by Hu et al. (2004) that using single minimal reduct for categorization may not be an appropriate solution in most cases. Using multiple reducts concept, which was used by some other works such as Ishii et al. (2010), can produce better categorization performance than using a single minimal reduct. In the proposed approach we modified the Quick Reduct algorithm to return the set of all minimal reducts, instead of returning only single minimal reduct, to use them for generating rules used for text categorization.

Figure 4 presents the pseudocode of the Improved Quick Reduct algorithm, and the flowchart that illustrated the main steps of reducts extraction is presented in Fig. 5. The algorithm starts by finding the significance of the entire decision system as one unit, then creates an empty list to be used for containing all minimal reducts \((R = \{\})\). The second step finds the significances of all possible subsets in the decision system of size 1 and compares the significance of each generated subset of size 1 with the calculated significance of the entire decision system. Now, for each subset that has a significance equal to the significance of the entire decision system, this subset will be assigned the minimal reduct \(R(R = \{\hbox {subset}\})\) then R is added to the list of minimal reducts. If the list of minimal reducts is not empty the algorithm will stop searching for other reducts in higher levels, otherwise the algorithm will continue to next step.
Fig. 4

Pseudocode of the improved quick reduct algorithm

Fig. 5

Flow diagram of the improved quick reduct algorithm

Table 3

Sample of generated rules

Now if R is still empty \((R= \{\})\), the algorithm will find the significances of all possible subsets of size 2 and compare the significance of each generated subset of size 2 with the calculated significance of the entire decision system. For each subset that has significance equal to the significance of the entire decision system, make R equal to this subset \((R = \{\hbox {subset}\})\) then add R to the list of minimal reducts. Again if the list of minimal reducts is not empty stop searching for other reducts in higher levels, else go to next step. If R is still empty \((R=\{\})\), the algorithm will find the significances of all possible subsets of size \(n-1\), then compare the significance of each generated subset of size \(n-1\) with the calculated significance of the entire decision system. For each subset having significance equal to the significance of the entire decision system make R equal to this subset \((R = \{\hbox {subset}\})\) then add R to the list of minimal reducts. If the list of minimal reducts is not empty stop searching for other reducts in higher level, else go to next step. If R is still empty \((R= \{\})\), take the subset of size n and add it to list of minimal reducts then return the list of minimal reducts (LMR), in this case the list of minimal reducts will contain only single reduct.

4.7 Rules generation/building the classifier

The rules generation is the process of using the contents of attributes reducts produced in the previous step to generate the set of classification rules that will form the classifier. In the proposed approach, we used all minimal attributes reducts for generating the set of rules that are kept in a dictionary, or as a list in the form of if–then rules. These set of rules represent the categorization model that is used for categorizing Arabic documents.

The majority voting method was used to avoid the conflict that may occur among rules, because some of the documents may be matched by rules of more than one category. The proposed approach is designed to assign only one category from the predefined categories to each document. Table 3 shows a sample of the generated rules.

4.8 Model evaluation

Evaluating the effectiveness of text categorization systems can be achieved using several metrics (Han et al. 2012). In the proposed approach the used evaluation metrics are precision, recall, F-measure, and accuracy. To find these metrics, the terms true positive, true negative, false positive, and false negative were used. These terms are usually presented in a confusion matrix (Han et al. 2012) as in Table 4. These metrics and terms are defined as follows:
  • True positive (TP) refers to the number of documents that were correctly classified by the classification model (classifier) as Category A and they actually belong to this Category (A). This actual category is the original category provided with the document.

  • True negative (TN) represents the number of documents that the classifier correctly classified them as they do not belong to the category A (i.e., belong to other categories), and they actually did not belong to that category as well.

  • False positive (FP) represents the number of documents that the classifier incorrectly classified them under the category A, but actually they do not belong to that category A.

  • False negative (FN) represents the number of documents that the classifier misclassified them to another category, but the actual class indicates that they should be classified under category A.

  • Precision can be thought of as a measure of exactness; which refers to the percentage of test documents that were correctly classified as category A, and they actually belong to category A.
    $$\begin{aligned} \hbox {Precision} = (\hbox {TP})/(\hbox {TP}+\hbox {FP}) \end{aligned}$$
  • Recall is a measure of completeness; which refers to the percentage of test documents that were classified as not being of category A, and they actually belong to category A (i.e., this document belong to other categories other than A).
    $$\begin{aligned} \hbox {Recall} = (\hbox {TP})/(\hbox {TP} + \hbox {FN}) \end{aligned}$$
  • F-measure is an alternative way to use precision and recall by combining them into a single measure. The F-measure is a harmonic mean of precision and recall.
    $$\begin{aligned} F\hbox {-measure}\! =\! (2 * \hbox {Precision} * \hbox {Recall}) / (\hbox {Precision} \!+\! \hbox {Recall}) \end{aligned}$$
  • Accuracy of a classifier on a given test set of documents is the percentage of test documents that are correctly classified by the classifier either to belong category A or to other categories.
    $$\begin{aligned} \hbox {Accuracy} = (\hbox {TP} + \hbox {TN})/(\hbox {TP} + \hbox {TN} + \hbox {FP} + \hbox {FN}) \end{aligned}$$
Table 4

Confusion matrix structure

Actual category

Predicted category

Category A

Other categories

Category A

TP

FN

Other categories

FP

TN

To compute these metrics, we used two main accuracy estimation methods; these are the K-folds cross-validation (CV) and the percentage split. In K-folds cross-validation, the set of documents is partitioned into K folds, \(K-1\)-folds are used to train the classifier, and one fold is used for testing. This process is repeated K times by selecting another fold for testing. After K times, the evaluation metrics are averaged to find the final value of each metric.

In the percentage split method, the set of documents is simply randomly partitioned into two parts. The first partition, which is usually set to be two-thirds of the documents set, is used to train the classification method to build the classifier, while the other partition is used to evaluate the classifier by computing the evaluation metrics.

5 Experiments and discussion of results

This section presents and discusses the experimental results of the proposed approach. We introduced a brief description of the Arabic dataset (Corpus) that is used in the experiments. Next, we presented the results of using the proposed approach for categorizing Arabic text using single and multiple reducts then we compared the results of the proposed approach using single and multiple reducts. At the end, we conduct a comparison of the proposed approach with some other well-known categorization algorithms such as the K-nearest neighbor and J48 using the same corpus..
Table 5

Results using single reduct and tenfold CV method

Category

Precision

Recall

F-measure

Art

0.89

0.90

0.89

Economy

0.76

0.92

0.84

Health

0.92

0.96

0.94

Law

0.74

0.88

0.80

Literature

0.84

0.78

0.80

Politics

0.86

0.76

0.80

Religion

0.92

0.84

0.88

Sport

0.97

0.91

0.94

Technology

0.98

0.79

0.87

Table 6

Results using multiple reducts and tenfold CV

Category

Precision

Recall

F-measure

Art

0.98

0.97

0.97

Economy

0.90

0.98

0.93

Health

0.98

0.97

0.98

Law

0.85

0.95

0.89

Literature

0.94

0.93

0.93

Politics

0.91

0.89

0.90

Religion

0.97

0.87

0.92

Sport

1.00

0.99

0.99

Technology

0.98

0.92

0.95

5.1 The corpus

To evaluate the methods presented in this research paper, we used an Arabic corpus available and collected from (www.diab.edublogs.org). The corpus consists of 2700 documents that have been categorized manually by human experts into nine categories (Art, Economy, Health, Law, Literature, Politics, Religion, Sport, and Technology). Documents in the corpus were evenly distributed on the nine categories with each having 300 documents.
Table 7

Coverage of the proposed approach using multiple and single reducts

Fold#

Coverage of multiple reducts

Coverage of single reduct

# of classified documents

# of unclassified documents

# of classified documents

# of unclassified documents

1

270

0

221

49

2

270

0

218

52

3

270

0

245

25

4

270

0

225

45

5

270

0

223

47

6

270

0

237

33

7

270

0

241

29

8

270

0

228

42

9

270

0

236

34

10

270

0

229

41

Total

2700

0

2303

397

Average

\(2700/2700 = 100\%\)

\(0/2700 = 0\%\)

\(2303/2700 = 85\%\)

\(397/2700 = 15\%\)

Fig. 6

F-measure when using multiple and single reducts with tenfold CV

5.2 Experiment using single reduct and multiple reducts

In this section, we used the proposed approach for the categorization of the testing documents using single reduct to compare with the results when using multiple reducts. Table 5 shows the results of the proposed approach using single reduct where tenfold CV is used for splitting the dataset. The highest precision was for the Technology category which has achieved a precision of 98% whereas the lowest was for the Law category which has achieved a precision of 74%. As for the accuracy, the final total classification accuracy of the built classifier using the single reduct method was 86%.

Table 6 shows the results when using the multiple reduct approach. We can notice that most categories have a precision value greater than 90% where the lowest value of precision obtained was for the Law category and the highest was for the Sport category. In terms of recall, the results in Table 6 showed that the lowest value of recall obtained for the Religion category. The highest value of recall obtained was for the Sport category. As for accuracy, the final classification accuracy of the built classifier using the multiple reducts way was 94%.

Table 7 shows the comparison between the results of the using multiple and single reducts in terms of the number of classified and unclassified documents in each experiment when using the tenfold CV method. Figure 6 shows the F-measure results of the proposed approach using multiple and single reducts.

Referring to the results presented in Table 7, we can notice that the results of the proposed approach using single reduct have shown that the number of generated rules when using single reduct were not enough for the categorization of all testing documents where there is 15% of the total number of testing documents that were not matched with any rule. On the other hand, the results of Table 6 indicate that most of the documents were categorized by the proposed approach using multiple reducts to their correct and actual categories.

The results in Table 7 showed that using multiple reducts have produced better performance than single reduct. The results showed that the use of multiple reducts method generated rules that are enough for classifying all testing documents, whereas the use of single reduct has generated rules that are not enough for classifying all testing documents.

From Fig. 8 it can be noticed that the F-measure when using multiple reducts is better than the F-measure produced when using single reduct in all categories. In addition, the results when using single reduct is for 85% of the total number of testing documents while there are 15% of the total number of testing documents were not classified to any category.

5.3 Comparison with other classification methods

To compare the results of the proposed approach with the results of other text categorization algorithms, we used the WEKA toolkit (Hall et al. 2009) using the same dataset that was used by the proposed approach. We used the two well-known classification methods: the K-nearest neighbor (K-NN) and the J48 algorithm; which is the Java implementation of the well-known C4.5 decision tree classification method.

Table 8 shows the results of the K-NN algorithm using tenfold CV method. In terms of precision, the highest precision value obtained was for Politics and Sport categories whereas the lowest precision obtained was for the Religion category. In terms of recall, the highest recall value obtained was for the Religion category, whereas the lowest recall obtained was for the Politics category. In General, the results in Table 8 showed that there are a large number of documents that did not categorize to their correct and actual categories. Figure 7 shows a comparison of the F-measure metric between the results of the rough set-based approach with the results of the K-NN algorithm when using tenfold CV method.
Table 8

Results of K-nearest neighbor using tenfold CV method

Category

Precision

Recall

F-measure

Art

0.67

0.41

0.51

Economy

0.89

0.48

0.62

Health

0.76

0.88

0.82

Law

0.56

0.73

0.64

Literature

0.95

0.23

0.37

Politics

1.00

0.13

0.23

Religion

0.24

0.97

0.39

Sport

1.00

0.76

0.86

Technology

0.99

0.24

0.39

Table 9

Results of J48 using tenfold CV method

Category

Precision

Recall

F-measure

Art

0.89

0.89

0.89

Economy

0.82

0.82

0.82

Health

0.83

0.84

0.83

Law

0.64

0.64

0.64

Literature

0.80

0.74

0.77

Politics

0.61

0.66

0.63

Religion

0.84

0.76

0.80

Sport

0.95

0.94

0.95

Technology

0.80

0.84

0.82

As for the J48 method, Table 9 shows the results of the J48 algorithm using tenfold CV method. Figure 8 shows a comparison of the F-measure metric between the results of the proposed approach with the results of the J48 algorithm when using tenfold CV method.

When we compared the results of the proposed approach with other classification methods, the results of the J48 method presented in Table 9 indicate that there is a reasonable number of documents that were not categorized to their correct and actual categories. From Fig. 8, we can notice as well that the F-measure values produced by the proposed approach are higher than the F-measure values produced by the J48 algorithm in all categories. This indicates that the method of selecting the best set of terms (multiple reducts) which were used to classify the documents using the proposed approach was more appropriate than the method used in the J48 which is the gain ratio measure.

When reviewing the results presented in Table 8 for the K-NN method, we can notice that there is a variation in terms of precision and recall from one category to another. We can also notice from Fig. 7 that the proposed approach has produced results better than the K-NN algorithm in all categories. This indicates that the set of terms used to classify the documents using the proposed approach was more significant than using the whole set of terms as in the K-NN method.

In addition, when we reviewed the results of some methods presented in the literature review section we noticed that the proposed approach presented in this paper achieved acceptable results in term of classification accuracy in comparison with other approaches. For example, the decision tree algorithm proposed by Harrag et al. (2009) was tested using two Arabic corpuses including: scientific corpus and literary corpus; the results showed that the decision tree achieved accuracy reached 93% for the scientific corpus and 91% for the literary corpus while the proposed approach achieved 94% of accuracy.
Fig. 7

F-measure of the proposed approach and K-NN algorithm with tenfold CV

Fig. 8

F-measure of the proposed approach and J48 algorithm using tenfold CV method

As an end note, it can be noticed that the rough set approach is applicable for the categorization of Arabic text which has produced an acceptable accuracy that reached 94%. As for other rough set based classification methods, there is no other work that uses the rough set methodology to classify Arabic documents to compare with.

6 Conclusion

In this paper, we proposed a rough set theory-based approach for categorization of Arabic text using multiple reducts. Based on the experiments, and the results presented throughout this paper, we can conclude and claim that this research work has achieved its objectives. The proposed approach has been practically tested in order to demonstrate its applicability for the categorization of Arabic text. The experimental results obtained with the approach strongly support the significance of the approach. The results of the proposed approach using a document set of 2700 documents of 9 categories showed that using the multiple reducts strategy produced an accuracy of 94% which is better than the accuracy produced when using single reduct strategy which has achieved 86% accuracy. In addition, the number of rules generated when using single reduct was not enough for the categorization of all testing documents where there was 15% of the total number of testing documents when using tenfold CV method that was not categorized. From the results, we conclude that using multiple reducts have produced better performance than using single reduct when using Rough set approach for the categorization of Arabic text.

On the other hand, the results of comparison between the proposed approach with K-NN and J48 algorithms showed that the proposed approach produced an accuracy that reached 94% which has outperformed the accuracy produced by K-NN which reached 55% and the accuracy produced by the J48 algorithm which reached 79%.

As a concluding remark, the research presented in this paper proved the objectives and the hypothesis of this research. The research concluded that the Rough set approach is applicable for the categorization of Arabic text. Moreover, it was proven that the generation of multiple reducts is a necessary step to improve the performance of the Rough set classifier, in comparison with generating single reduct.

While running this research we encountered several ideas that can be considered for future research and extension of the proposed approach presented in this paper. These methods include: (1) enhancing the proposed approach to be able to categorize each document into multiple categories if its content indicated that it belongs to more than one category, (2) studying the effect of stemming on the performance of the proposed approach.

Notes

Funding

This research received no specific grant from any funding agency in public, commercial, or not-for-profit sectors.

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

References

  1. Abualigah LMQ, Hanandeh ES (2015) Applying genetic algorithms to information retrieval using vector space model. Int J Comput Sci Eng Appl 5(1):19–28Google Scholar
  2. Abualigah LM, Khader AT (2017) Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering. J Supercomput 73(11):4773–4795Google Scholar
  3. Abualigah LM, Khader AT, Hanandeh ES (2018) A new feature selection method to improve the document clustering using particle swarm optimization algorithm. J Comput Sci 25:456–466.  https://doi.org/10.1016/j.jocs.2017.07.018 Google Scholar
  4. Al-Dhaheri S (2010) Arabic text categorization based on features reduction using artificial neural network. Master Thesis Faculty of Graduate Studies, The University of JordanGoogle Scholar
  5. Al-Diabat M (2012) Arabic text categorization using classification rule mining. Appl Math Sci 6:4033–4046Google Scholar
  6. Al-Radaideh Q, Al-Khateeb S (2015) An associative rule-based classifier for Arabic medical text. Int J Knowl Eng Data Min 3(3–4):255–273Google Scholar
  7. Al-Radaideh Q, Al-Qudah G (2017) Application of rough set-based feature selection for Arabic sentiment analysis. Cognit Comput 9(4):436–445Google Scholar
  8. Al-Radaideh Q, Bataineh D (2018) A hybrid approach for arabic text summarization using domain knowledge and genetic algorithms. Cognit Comput.  https://doi.org/10.1007/s12559-018-9547-z Google Scholar
  9. Al-Radaideh Q, Al-Shawakfa E, Ghareb A, Abu Salem H (2011) An approach for Arabic text categorization using association rule mining. Int J Comput Process Lang 23(1):81–106Google Scholar
  10. Al-Radaideh Q, Sulaiman MN, Selamat MH, Ibrahim H (2005) Approximate reduct computation by rough sets based attribute weighting. In: Proceedings of the IEEE international conference on granular computing, pp 383–386Google Scholar
  11. Al-Radaideh Q, Twaiq L (2014) Rough set theory for Arabic sentiment classification. In: Proceedings of the 2014 international conference on future internet of things and cloud. IEEE Computer SocietyGoogle Scholar
  12. Alsaleem S (2011) Automated Arabic text categorization using SVM and NB. Int Arab J e-Technol 2(2):124–128Google Scholar
  13. Al-Salemi B, Aziz M (2011) Statistical Bayesian learning for automatic arabic text categorization. J Comput Sci 7(1):39–45Google Scholar
  14. Al-Shalabi R, Kanaan G, Gharaibeh M (2006) Arabic text categorization using KNN algorithm. In: Proceedings of the 4th international multi-conference on computer science and information technology. Amman, JordanGoogle Scholar
  15. Azara M, Fatayer T, El-Halees A (2012) Arabic text classification using learning vector quantization. In: Proceedings of the 8th international conference on informatics and systems (INFOS2012), pp 39–43Google Scholar
  16. Bao Y, Aoyama S, Du X, Yamada K, Ishii N (2001) A rough set based hybrid method to text categorization. In: Proceedings of the 2nd international conference on web information systems engineering. IEEE Computer Society, pp 254–261Google Scholar
  17. Chantar HK, Corne DW (2011) Feature subset selection for arabic document categorization using BPSO-KNN. In: Nature and Biologically Inspired Computing (NaBIC), pp 545–551Google Scholar
  18. Chen Y, Zeng Z, Lu J (2017) Neighborhood rough set reduction with fish swarm algorithm. Soft Comput 21(23):6907–6918Google Scholar
  19. Chen P, Liu S (2008) Rough set-based SVM classifier for text categorization. In: Proceedings of the fourth international conference on natural computation (ICNC), pp 153–157Google Scholar
  20. Chouchoulas A (1999) A rough set approach to text classification. Master Thesis, School of Artificial Intelligence, Division of Informatics, the University of EdinburghGoogle Scholar
  21. Dai L, Hu J, Liu W (2008) Using modified CHI square and rough set for text categorization with many redundant features. In: Proceedings of the international symposium on computational intelligence and design (ISCIS), vol 1, pp 182–185Google Scholar
  22. Darwish K (2002) Building a shallow Arabic morphological analyzer in one day. In: Proceedings of the ACL workshop on computational approaches to semitic ACLGoogle Scholar
  23. Duwairi R (2006) Machine learning for Arabic text categorization. J Am Soc Inf Sci Technol 57(8):1005–1010Google Scholar
  24. Duwairi R (2007) Arabic text categorization. Arab J Inf Technol 4(2):125–131Google Scholar
  25. Duwairi R, El-Orfali M (2014) A study of the effects of preprocessing strategies on sentiment analysis for Arabic text. J Inf Sci 40(4):501–13Google Scholar
  26. Duwairi R, Al-Refai M, Khasawneh N (2009) Feature reduction techniques for Arabic text categorization. J Am Soc Inf Sci 60(11):2347–2352Google Scholar
  27. Ghareb A, Hamdan A, Bakar A (2016) Hybrid feature selection based on enhanced genetic algorithm for text categorization. Exp Syst Appl 49:31–47Google Scholar
  28. Ghareb A, Bakar AA, Al-Radaideh Q, Hamdan A (2018) Enhanced filter feature selection methods for Arabic text categorization. Int J Inf Retr Res 8(2):1–24Google Scholar
  29. Gharib TF, Habib MB, Fayed ZT (2009) Arabic text classification using support vector machines. Int J Comput Appl 16(4):1–8Google Scholar
  30. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten I (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18Google Scholar
  31. Han J, Kamber M, Pei J (2012) Data mining concepts and techniques, 3rd edn. Morgan Kaufmann Publishers, Los AltoszbMATHGoogle Scholar
  32. Harrag F, El-Qawasmah E, Al-Salman AS (2010) Comparing dimension reduction techniques for Arabic text classification using BPNN algorithm. In: Proceedings of the 2010 first international conference on integrated intelligent computing, pp 6–11Google Scholar
  33. Harrag F, El-Qawasmeh E (2009) Neural network for Arabic text classification. In: Proceedings of the international conference of applications of digital information and web technologies, ICADIWT ’09, pp 778–783Google Scholar
  34. Harrag F, El-Qawasmeh E, Pichappan P (2009) Improving Arabic text categorization using decision trees. In: Proceedings of the 1st international conference of NDT ’09, pp 110–115Google Scholar
  35. Hmeidi I, Hawashin B, El-Qawasmeh E (2008) Performance of KNN and SVM classifiers on full word Arabic articles. Adv Eng Inform 22:106–111Google Scholar
  36. Hmeidi I, Al-Ayyoub M, Abdulla N, Almodawar A, Abooraig R, Mahyoub N (2015) Automatic Arabic text categorization: a comprehensive comparative study. J Inf Sci 41(1):114–124Google Scholar
  37. Hussien MI, Olayah F, AL-dwan M, Shamsan A (2011) Arabic text classification using SMO, Naive Bayesian, J48 algorithm. Int J Res Rev Appl Sci 9(2):306–316Google Scholar
  38. Hu Q, Yu D, Xie Z (2004) Improvement on classification performance based on multiple reduct ensembles. In: Proceedings of the 2004 IEEE conference on cybernetics and intelligent systems, vol 2, pp 1016–1021Google Scholar
  39. Ishii N, Morioka Y, Kimura H, Bao Y (2010) Classification by partial data of multiple reducts kNN with confidence. In: Proceedings of the 22nd IEEE international conference on tools with artificial intelligence, pp 94–101Google Scholar
  40. Jensen R (2005) Combining rough and fuzzy sets for feature selection. Ph.D. Thesis, School of Informatics, University of EdinburghGoogle Scholar
  41. Lam W, Ruiz M, Srinivasan P (1999) Automatic text categorization and its application to text retrieval. IEEE Trans Knowl Data Eng 11(6):865–879Google Scholar
  42. Lin TY (1996) Rough set theory in very large databases. In: Proceedings of the symposium on modeling analysis and simulation, CESA’96 IMACS multi-conference on computational engineering in systems applications, pp 936–941Google Scholar
  43. Mesleh A (2007) Chi-square feature extraction based SVMs Arabic language text categorization system. J Comput Sci 3(6):430–435Google Scholar
  44. Noaman H, Elmougy S, Ghoneim A, Hamza T (2010) Naïve Bayes classifier based Arabic document categorization. In: Proceedings of the 7th international conference in informatics and systems (INFOS 2010), Cairo, EgyptGoogle Scholar
  45. Pawlak Z (1982) Rough sets. Int J Comput Inform Sci 11(5):341–356zbMATHGoogle Scholar
  46. Pawlak Z (1991) Rough sets: theoretical aspects of reasoning about data. Kluwer, DordrechtzbMATHGoogle Scholar
  47. Rasim Cekik R, Telceken S (2018) A new classification method based on rough sets theory. Soft Comput 22(6):1881–1889Google Scholar
  48. Skowron A, Rauszer C (1992) The discernibility matrices and functions in information systems. In: Słowiński R (ed) Intelligent decisionGoogle Scholar
  49. Syiam MM, Fayed ZT, Habib MB (2006) An intelligent system for arabic text categorization. Int J Intell Comput Inf Sci 6(1):1–19Google Scholar
  50. Thabtah F, Eljinini M, Zamzeer M, Hadi W (2009) Naïve Bayesian based on chi-square to categorize Arabic data. In: Proceedings of the 11th international business information management association conference (IBIMA) conference on innovation and knowledge management in Twin track economies, Cairo, pp 930–935Google Scholar
  51. Thangavel K, Pethalakshmi A (2009) Dimensionality reduction based on rough set theory: a review. Appl Soft Comput 9(1):1–12Google Scholar
  52. Velayutham C, Thangavel K (2011) Unsupervised quick reduct algorithm using rough set theory. J Electron Sci Technol (JEST) 9(3):193–201Google Scholar
  53. Wahbeh A, Al-Kabi M, Al-Radaideh Q, Al-Shawakfa E, Alsmadi I (2011) The effect of stemming on Arabic text classification: an empirical study. Int J Inf Retr Res 1(3):54–70Google Scholar
  54. Wang Z, Sun X, Li X, Zhang D (2006) An efficient SVM-based spam filtering algorithm. In: Proceedings of the fifth international conference on machine learning and cybernetics, pp 3682–3686Google Scholar
  55. Wang N, Wang P, Zhang B (2010) An improved TF–IDF weights function based on information theory. In: Proceedings of the international conference on computer and communication technologies in agriculture engineering, pp 439–441Google Scholar
  56. Yahia ME (2011) Arabic text categorization based on rough set classification. In: Proceedings of the 9th IEEE/ACS international conference on computer systems and applications, pp 293–294Google Scholar
  57. Yin S, Huang Z, Chen L, Qiu Y (2008) An approach for text classification feature dimensionality reduction and rule generation on rough set. In: Proceedings of the third international conference on innovative computing, information and control (ICICIC 2008), published by IEEE CSGoogle Scholar
  58. Zhang Q, Tan J, Zhou H, Tao W, He K (2009) Machine learning methods for medical text categorization. In: Proceedings of the Pacific-Asia conference on circuits, communications and system, pp 494–497Google Scholar
  59. Zhao W, Zhang Z (2005) An E-mail classification model based on rough set theory. In: Proceedings of the 2005 international conference on active media technology (AMT 2005), pp 403–408Google Scholar
  60. Zhong N, Dong J, Ohsuga S (2001) Using rough sets with heuristics for feature selection. J Intell Inf Syst 16(3):199–214zbMATHGoogle Scholar
  61. Zhu XZ, Zhu W, Fan XN (2017) Rough set methods in feature selection via submodular function. Soft Comput 21(13):3699–3711zbMATHGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer Information Systems, Faculty of Information Technology and Computer SciencesYarmouk UniversityIrbidJordan

Personalised recommendations