1 Introduction

The task of feature subset selection (FS) is a necessary preprocessing step for building learning models that increases predictive accuracy and model comprehensibility. Finding informative features is the most challenging task under inconsistent and imprecise information. The classical rough set theory (RST), proposed by Pawlak [14], is an essential mathematical method for dealing imprecise information without additional information about data. RST is only applicable to categorical decision systems  [14] thus, it requires a prior discretization for numerical decision systems.

A given decision system has many reducts, and even a single reduct has the adequate capability for inducing a reliable classification model. The generalizability of classifier performance induced by the reduct varies from one reduct to others, and there is no guarantee that the preferred single reduct can impose the best performance. Hence, researchers are interested in computing the best (optimal) reduct out of all possible reducts. Several heuristics approaches based on dependency measure, and discernibility matrix for single reduct computation are proposed in literature [8, 15, 17]. Even though these approaches are computationally efficient with polynomial time complexity, they can’t assure the computation of optimal reduct.

In 1992, Skowron et al. [17] introduced boolean reasoning based approach for all reduct computation using crisp discernibility matrix. An optimal reduct can thus be selected by introducing the optimality criterion on all possible reducts. At present, this is the only way to obtain an optimal reduct. Though, in RST computing minimal/all reducts is an NP-hard problem [17]. Subsequently, several aspects has been investigated using the evolutionary algorithm such as genetic algorithm (GA) [19], ant colony optimization (ACO) [9], simulated annealing [11], particle swarm optimization (PSO) [18] etc., for near-optimal reduct computation. In this context, Wroblewski [19] proposed three approaches by combining GA with the greedy heuristics to generate minimal reduct. Jensen et al. [9] adopted a stochastic approach based on ACO to create a near-optimal reduct. While Wang et al. [18] proposed a reduct computation approach through PSO, Chen et al. [2] incorporated fish swarm optimization with rough sets for finding the reduct. Jensen et al. [11] proposed a mechanism for feature selection that combined the simulated annealing algorithm with the rough set theory.

However, most of the existing optimal/near-optimal reduct computation approaches are formulated with the optimality criteria of reduct having a minimum number of features (shortest length reduct). The dependency measures, such as gamma measure [8, 20], conditional information entropy measure [10], discernibility based measure  [1, 17] etc., are used in reduct computation algorithms favouring those attributes with larger domain cardinality as the resultant finer granular space achieves better dependency measure value and can result in reduct with the smallest size. There is a correspondence between reduct and rule induction. Rules are induced from the granules (equivalence classes) obtained using reduct attributes [4]. In a finer granular space, the cardinality of rules will be larger, and the strength of the rule will be smaller, which can affect the generalizability of reduct induced classifiers. In contrast, coarser granular space will generate a smaller size rule set with higher strength and can induce better classifiers. Hence, in this work the optimality criterion is taken as minimizing the cardinality of equivalence classes or granules induced by reduct attributes. Based on this criterion, the resulting optimal reduct will produce coarsest granular space.

In this work, a consistent heuristic is proposed based on the considered optimality criterion. An optimal reduct computation algorithm (\(A^*RSOR\)) is developed by using the proposed consistent heuristic in \(A^*\) search. The resulting approach acquires a significance as an optimal reduct can be computed without generation of all possible reducts. The significance of \(A^{*}RSOR\) is validated both theoretically as well as through experimental analysis.

The remaining part of this paper is structured in the following order. The theoretical background of the classical rough sets along with the preliminaries of relative dependency measure and \(A^*\) search strategy are discussed in Sect. 2. In Sect. 3, the detailed theoretical explanation of the proposed optimal rough set reduct computation algorithm \(A^*RSOR\) is explored. Section 4 covers the comparative experimental evaluation and analysis of results. Lastly, in Sect. 5, the paper ends with the remarks and the future possibility of the proposed method.

2 Theoretical Background

2.1 Rough Set Theory

The decision system is represented as \(DS = (\mathbb {U},\mathbb {C}\cup \mathbb {D} )\), where \(\mathbb {U}\) represents the non-empty finite set of objects called universe, \(\mathbb {C}\) is the set of conditional attributes such that \(a:\mathbb {U} \rightarrow V_a\), \(\forall a \in \mathbb {C} \) where \(V_a\) is the set of domain values with respect to ‘a’. \(\mathbb {D}\) is a set of decision attributes or response variables, usually \(\mathbb {D} = \{d\}\) containing a single decision attribute. For any arbitrary subset \( \mathbb {B} \subseteq \mathbb {C}\), there exist an associated equivalence relation called indiscernibility relation \(IND( \mathbb {B})\) defined as:

$$\begin{aligned} IND( \mathbb {B})=\{(u_i,u_j)\in \mathbb {U}^2~|~\forall a\in \mathbb {B},a(u_i)=a(u_j)\} \end{aligned}$$
(1)

The collection of all equivalence classes(granules) of \(IND( \mathbb {B})\) is represented as \(\mathbb {U}/ IND(\mathbb {B}\)) or \(\mathbb {U}/ \mathbb {B} \), and is defined as: \( \mathbb {U}/ IND( \mathbb {B})= \otimes \left\{ a \in \mathbb {B}: \mathbb {U}/IND(\{a\}) \right\} \), where \(\otimes \) is the refinement operator. In the rest of the paper granules of \(IND(\mathbb {B})\) are represented as \(\mathbb {U}/ \mathbb {B}\).

The set of granules \(\mathbb {U}/ \mathbb {B}\) constitute the granular space through which rough set based approximations are defined. Let \( X \subseteq \mathbb {U}\) be the concept to be approximated, then the \(\mathbb {B}\)-lower and \(\mathbb {B}\)-upper approximations of X are computed as follows:

(2)

where \([u]_ \mathbb {B}\) is the equivalence class of ‘u’.

The positive region \(POS_{ \mathbb {B}}(\mathbb {D})\) is the collection of all objects that are certainly belongs to the concept \(\mathbb {D}\), defined as:

$$\begin{aligned} POS_{ \mathbb {B}}(\mathbb {D}) = \bigcup _{X \in \mathbb {U}/\mathbb {D}} \underline{ \mathbb {B}}X \end{aligned}$$
(3)

Consistent Decision System: A decision system DS is said to be a consistent decision system(CDS), if and only if \(POS_{\mathbb {C}}(\mathbb {D})=\mathbb {U}\). In case the given decision system is inconsistent, then the generalized decision operator has given in [5] is applied to convert DS into CDS. Hence in the rest of the paper, DS is assumed to be consistent.

2.2 Relative Dependency Measure

Han et al. [6] defined a dependency measure named as relative dependency. Let \( \mathbb {B} \subseteq \mathbb {C}\), then the degree of relative attribute dependency, denoted as \(\kappa _\mathbb {B}(\mathbb {D})\), of \(\mathbb {D}\) over \( \mathbb {B}\), is defined as:

$$\begin{aligned} \kappa _ \mathbb {B}(\mathbb {D})=\frac{\left| U/ \mathbb {B} \right| }{\left| U/( \mathbb {B} \cup \mathbb {D}) \right| } \end{aligned}$$
(4)

\( \mathbb {B}\) is a reduct of CDS, if only if, \(\kappa _ \mathbb {B}(\mathbb {D})=\kappa _\mathbb {C}(\mathbb {D})=1\) and \(\forall Q \subset \mathbb {B},\kappa _Q(\mathbb {D})\ne \kappa _\mathbb {C}(\mathbb {D})\). Here |S| for any set S, represents the cardinality of S.

2.3 \(A^*\) Search Algorithm

In several domains, the solution to a problem can be formulated as a state-space search, represented as a graph, which contains a source node and several possible goal nodes. The solution is given in the form of a path or state according to the nature of the problem. \(A^*\) is a popular search strategy  [7], introduced as part of Shakey Project [13]. This is a complete and optimal search algorithm, which was formulated by combining the Dijkstra’s Shortest Path and Best-Fist Search mechanisms. Here along with the path cost, there is a prediction heuristic cost for reaching the goal node from a particular state ‘n’ is associated. Hence, the total cost to reach the goal state can be estimated as:

$$\begin{aligned} f(n) = g(n)+ h(n) \end{aligned}$$
(5)

where g(n) denotes the path cost from the source node to the current node ‘n’ and h(n) is the prediction heuristic which estimates the cost to reach the goal from node ‘n’. Any node with h(n) as zero is a candidate goal node.

Generally, \(A^*\) uses the openlist, which is defined as a priority queue to conduct the repeated selection of the least cost nodes to explore and closelist defined as the collection of explored nodes. The process is started by placing the source node into the openlist. At each iteration, the node with the lowest f(n) value is removed from the openlist queue and placed in closelist queue, and their corresponding successor nodes with updated f values are added to the openlist queue. If the node selected for exploration from openlist has the heuristic values of zero, then \(A^*\) algorithm stops and returns the solution associated with the node.

\(A^*\) search is optimal, if h(n) is admissible and consistent. A heuristic is admissible, if the value of h(n) never overestimate the actual cost. The heuristic is consistent, if \(h(n) \le c(n,n') + h(n')\) provided that \(n'\) is the child node of n in the search space graph and \(c(n,n')\) denotes the corresponding edge cost. In general, every consistent heuristic is admissible [7].

3 \(A^*RSOR\) Search Algorithm

The purpose of the \(A^*RSOR\) method is to find the reduct with the coarsest granular space for inducing least number of rules. Let RED(CDS) represent set of all possible reducts for CDS, then an optimal reduct \( \mathbb {B}^*\) with the coarsest granular space must satisfy the property:

$$\begin{aligned} | \mathbb {U}/ \mathbb {B}^* |=\underset{ \mathbb {B} \in RED(CDS)}{min} | \mathbb {U}/ \mathbb {B} | \end{aligned}$$
(6)

The computation of reduct is a search problem in the search space of the power set of \(\mathbb {C}\). The traditional SBE and SFS control strategies are the example of hill climbing approaches in this search space. The goal state corresponds to a reduct in RED(CDS).In formulating \(A^*\) reduct computation algorithm, a heuristic function is needed for evaluating the cost of reaching any goal state. We have devised a heuristic function for states in the search space of reduct computation in the next subsection.

3.1 Partition Refinement Heuristic (PR-Heuristic)

The Kappa measure \(\kappa _{ \mathbb {B}}(\mathbb {D})\) is a measure of the purity of granular space induced by \( \mathbb {B}\). A granule \(g\in \mathbb {U}/ \mathbb {B}\) is pure if ‘g’ contains objects of single decision class. A granular space is said to be a pure granular space if all of its granules are pure. For a pure granule, there is no further refinement takes place with the inclusion of decision attribute. As every reduct of CDS induces a pure granular space, we have \(\kappa _{ \mathbb {B}}(\mathbb {D})=1, \forall ~\mathbb {B}\in RED(CDS)\) as \(|\mathbb {U}/ \mathbb {B}| = |\mathbb {U}/( \mathbb {B} \cup \mathbb {D})|\).

If an attribute collection \( \mathbb {B}\) is almost pure, then many of the granules in \(\mathbb {U}/ \mathbb {B}\) are pure, that is \(|\mathbb {U}/( \mathbb {B}\cup \mathbb {D})|\) will be slightly higher than \(|\mathbb {U}/ \mathbb {B}|\) as only the remaining granules will participate in the refinement resulting in \(\kappa _ \mathbb {B}(\mathbb {D})\) near to one. If \( \mathbb {B}\) induces almost impure granular space, then \(|\mathbb {U}/( \mathbb {B}\cup \mathbb {D})|\) is much higher than \(|\mathbb {U}/ \mathbb {B}|\) resulting in \(\kappa _ \mathbb {B}\) which is near to zero. Hence, the number of splits additionally occurring through the refinement of \(IND( \mathbb {B})\) into \(IND( \mathbb {B} \cup \mathbb {D})\) estimates the furtherness of the current attribute set \( \mathbb {B}\) in becoming reduct. This motivated us to formulate a heuristic named as partition refinement heuristic (PR-Heuristic) to estimate the cost attributes set \( \mathbb {B}\) in becoming the reduct. For \( \mathbb {B} \subseteq \mathbb {C}\), PR-heuristic \(h_{PR}\) is given by

$$\begin{aligned} h_{PR}( \mathbb {B})= |\mathbb {U}/( \mathbb {B} \cup \mathbb {D})|- |\mathbb {U}/ \mathbb {B}| \end{aligned}$$
(7)

Here \(h_{PR}( \mathbb {B})\) is the number of split (refinements) in \(\mathbb {U}/ \mathbb {B}\) occurring with inclusion of \(\mathbb {D}\). In the reduct computation search space, let the child node of \( \mathbb {B}\) is \( \mathbb {B}'= \mathbb {B} \cup \{a\}\) for any \(a \in {\mathbb {C}- \mathbb {B}}\). The cost of refinement from \( \mathbb {B}\) to \( \mathbb {B}'\) is \(c( \mathbb {B}, \mathbb {B}')= |\mathbb {U}/ \mathbb {B}'|-|\mathbb {U}/ \mathbb {B}|\). \(h_{PR}\) is said to be consistent heuristic, if and only if \(h_{PR}( \mathbb {B})\le h_{PR}( \mathbb {B}')+c( \mathbb {B}, \mathbb {B}')\). Theorem 1 gives the proof for establishing that \(h_{PR}\) is a consistent heuristic.

Theorem 1

The partition refinement heuristic \(h_{PR}\) for state space of reduct computation in CDS is a consistent heuristic.

Proof

Let \(h_{PR}\) denote the PR-heuristic. Let \( \mathbb {B} \subseteq \mathbb {C}\) denotes a node being explored in \(A^*\) algorithm, and for any \(a\in \mathbb {C}- \mathbb {B}\), a new node \(\mathbb {B'}= \mathbb {B} \cup \{a\}\) is generated. Using Eq. 7, it follows as:

\(h_{PR}( \mathbb {B})= \left| \mathbb {U}/({ \mathbb {B} \cup \mathbb {D})} \right| -\left| \mathbb {U}/ \mathbb {B} \right| \) and \(h_{PR}( \mathbb {B}^{'}) = |\mathbb {U}/({ \mathbb {B} \cup \{a\} \cup \mathbb {D}) }|- |\mathbb {U}/ ( \mathbb {B}\cup \{a\})|\)

The edge cost \(c( \mathbb {B}, \mathbb {B}^{'})= |\mathbb {U}/({ \mathbb {B} \cup \{a\})}|- |\mathbb {U}/ \mathbb {B}|\), then consider,

\(h_{PR}( \mathbb {B}^{'})+c( \mathbb {B}, \mathbb {B}^{'}) =|\mathbb {U}/({ \mathbb {B} \cup \{a\} \cup \mathbb {D}) }|-|\mathbb {U}/ \mathbb {B}|\) \(\ge | \mathbb {U}/({ \mathbb {B} \cup \mathbb {D})}|- |\mathbb {U}/ \mathbb {B}| = h_{PR}( \mathbb {B})\)

(Since \(IND( \mathbb {B} \cup \{a\} \cup \mathbb {D})\) is a refinement of \(IND( \mathbb {B} \cup \mathbb {D})\)). Hence, it is proved that \(h_{PR}\) is a consistent heuristic.

3.2 \(A^*\) Algorithm with PR-Heuristic

An \(A^*\) search based algorithm is formulated using PR-heuristic as \(A^*\) rough set optimal reduct (\(A^*RSOR\)) algorithm, which is presented in Algorithm 1. For a set of attributes \( \mathbb {B} \subseteq \mathbb {C}\), path cost \(g( \mathbb {B})\) represents cardinality of granular space induced by \( \mathbb {B}\). i.e., \(g( \mathbb {B}) = |\mathbb {U}/ \mathbb {B}|\). Hence, the total cost \(f( \mathbb {B})\) becomes \(f( \mathbb {B}) = g( \mathbb {B}) + h_{PR}( \mathbb {B}) = |\mathbb {U}/ (\mathbb {B} \cup \mathbb {D})|\).

figure a

In \(A^*RSOR\) algorithm, openlist represents a priority queue of frontier nodes in the increasing order of ‘f’ values. Initially, openlist is inserted with nodes corresponding to individual attributes of \(\mathbb {C}\). In each iteration, the least ‘f’ value node CN is removed from openlist and inserted in closelist. If \(h_{PR}(CN) = 0\) then we have identified the optimal reduct and the attribute set in CN is return as optimal reduct. Otherwise, a child node is generated for each attribute addition into CN, which is not already included in CN. The resulting node CS is inserted in the openlist, if and only if it is not in either the openlist or the closelist. In case CS corresponds to attribute set which is a superset to a candidate reduct node in the openlist (a node with \(h_{PR}\) = 0) then, it is not included in the openlist as it results in the superset of reduct. This verification is represented as a function SuperReductCheck(CSopenlist).

4 Empirical Results and Observations

The proposed algorithm \(A^*RSOR\) is implemented in Matlab-2017a environment, and comparative experiments are conducted in the system with the following configuration: 3.40 GHz \(\times \) 4-Intel(R) Core i5-7500 processor, 8GB DDR4 RAM, Ubuntu-16.04.1 LTS 64-bit operating system. Eight categorical benchmark datasets are used from UCI-machine learning repository [3], as shown Table 1. Additionally, the Wine, Sahart and Zoo datasets are discretized using “mdlp” discretization method [12] to transform into the categorical datasets. The verification of optimal reduct computation by \(A^*RSOR\) through ranking experiment and 10-fold cross-validation based induced classifier performance analysis is conducted. Comparative experimental studies are conducted with simulated annealing based near-optimal reduct computation algorithm SimRSAR [11] and hill-climbing search based greedy reduct computational algorithm \(IQRA\_IG\) [15].

Table 1. The description of experimental datasets

4.1 Ranking Experiments

The correctness of the proposed method and implementation is verified in ranking experiment by checking the satisfiability of optimality criteria in the computed reduct. Towards this objective for each dataset, all reducts are computed using rough set exploration system (RSES) [16] and are ranked in the increasing order of optimality criteria (\(f(\mathbb {R}) = | \mathbb {U}/IND(\mathbb {R})|\)). The ranks obtained by reducts from \(A^*RSOR\), SimRSAR, and \(IQRA\_IG\) approaches are reported along with obtained ‘f’ values in Table 2. Table 2 also reported reduct length obtained by the respective algorithms. Total number reducts, obtained by RSES tool, are reported under the column |RSESReducts|.

Table 2. Results of ranking experiment with total granular space and the corresponding length of the optimal reduct.

Analysis of Results: The results demonstrate that the implemented \(A^*RSOR\) algorithm achieves the optimal reduct in all the datasets by obtaining rank one. The compared approaches SimRSAR, and \(IQRA\_IG\) have varying rankings across the datasets. A significant difference in ranking order is observed in Austra, Diab, Heart, Wine, and Zoo datasets and are marked by \(*\) symbol. Out of these only in Austra and Wine datasets, there is a significant variation in \(f(\mathbb {R})\) values.

4.2 Ten-Fold Experiments

In this section, we conducted the 10-fold cross-validation experiments on given benchmark datasets for assessing the relevance of reduct in inducing different classifiers (Naive Bayes (NB), CART and Random Forest (RF)). In all the classifiers, default options are used while treating all attributes as categorical. Table 3 depicts the mean and standard deviation of reduct length and computational time (in seconds) resulted in ten-fold experiments. Table 4 demonstrates the resulting mean and standard deviation \((\mu \pm \sigma )\) of classification accuracies obtained in ten-fold experiments. In Table 4, the column ALLAttributes refers to results from un-reduced training data.

Table 3. Reduct length and computational time in ten-fold cross-validation experiment

Student t-test is conducted on classification accuracies for analysis of results between the proposed algorithm \(A^*RSOR\) with \(IQRA\_IG \), SimRSAR and ALLattributes, and the results of the t-test are depicted in Table 4 along with classification accuracies. Table 4 entries show the four significance levels indicated as (+/-)*, (+/-)**, (+/-)***, # as per p-value in the t-test. The significance level for experimental results based on t-test are divided as statistically significant and are indicated as * \((p-value \le 0.05)\), statistically highly significant as ** \((p-value \le 0.01)\), statistically extremely significant as *** \((p-value \le 0.001)\) and no statistical difference indicated as #. The prefix of ‘+’ denotes the \(A^*RSOR\) has performed better than the compared algorithm and ‘-’ denotes that it has underperformed.

Analysis of Results: In Table 3, \(IQRA\_IG\) algorithm has obtained much lesser computational time than \(A^*RSOR\) and SimRSAR algorithms. This is in correlation with the theoretically lesser complexity of a greedy hill-climbing search algorithm in comparison to multiple subspace search algorithms of \(A^*RSOR\) and SimRSAR. The practical time complexity of \(A^*RSOR\) search algorithm depends on the depth at which the optimal reduct is formed. For instance, in Diabetes and Sahart datasets, the proposed \(A^*RSOR\) algorithm incurred only 3.6, 3.2 s on average, whereas SimRSAR incurred 250, 36.5 s. In contrast, SimRSAR obtained significantly lesser computational time in Austra, Heart, wine, and Zoo datasets. It is to be noted that out of these three algorithms, the worst cases time complexity of \(A^*RSOR\) is exponential while the other two algorithms are having polynomial time complexity. Both SimRSAR and \(IQRA\_IG\) are formulated towards obtaining shorter length reduct. Hence, the reduct lengths of \(A^*RSOR\) are slightly higher or equal than other algorithms.

Table 4. Ten-fold cross-validation results for classification

In NB, CART and RF classifiers, \(A^*RSOR\) obtained statistically similar in classification accuracies with compared algorithms in Breastcancer, Diab, Lymphography and Zoo datasets. In Sahart dataset, \(A^*RSOR\) approach performed statistically significant than SimRSAR in NB classifiers. In Austra and Wine datasets, \(A^*RSOR\) algorithm achieved statistically extremely significant accuracies than compared algorithms using all three classifiers. Also, in Wine dataset, \(A^*RSOR\) algorithm performed extremely significant than AllAttributes in both NB & CART classifiers. Likewise, in Austra dataset, \(A^*RSOR\) algorithm performed statistically significant accuracy than AllAttributes using NB classifier. Results also demonstrate the additional advantages for RF in achieving better accuracies with AllAttributes, as RF is an ensemble classifier with bagging over attribute space. It is further noted that \(A^*RSOR\) achieves statistically similar results with AllAttributes in RF for all datasets except Wine dataset.

The datasets in which \(A^*RSOR\) obtained better accuracies i.e., Austra and Wine, are also the datasets in which \(IQRA\_IG\), SimRSAR algorithm obtained higher rank reducts along with significantly higher f(R) value. This establishes that the chosen optimality criterion of the coarser granular space is relevant in obtaining reducts with greater potential in building better classification models.

5 Conclusion

Many approaches of rough set based reduct algorithm are aiming towards computing shortest length reduct both optimally and near optimally. In this work, the need for an alternative optimal criterion for reduct computation is identified, and \(A^*RSOR\) algorithm is developed for the computation of optimal reduct using \(A^*\) search. Partition refinement heuristic is introduced and proved to be a consistent heuristic. Comparative experimental results validated the utility of proposed optimality criteria. In the future, scalable algorithms for proposed \(A^*RSOR\) will be developed for enhancing the applicability to large scale decision systems.