Keywords

1 Introduction

In areas such as pattern recognition, data mining, machine learning, statistical analysis, and in general, in tasks involving data analysis or knowledge discovery from datasets, it is common to process collections of objectsFootnote 1 characterized by many features. In this situation, it is reasonable to assume that by retaining all features we will get better knowledge about the objects of study. However, in practice, this is usually not true, because many features could be either irrelevant or redundant. Indeed, it is well-known that irrelevant and redundant features may have an adverse impact on learning algorithms, decreasing the performance of supervised classifiers and producing biases or even incorrect models [1]. Feature selection methods [2, 3] have shown to be a useful tool for alleviating this problem; where the aim is to identify and eliminate irrelevant and/or redundant features in the data without significantly decreasing the prediction accuracy of a classifier built using only the selected features. Moreover, feature selection, not only reduces the dimensionality of the data facilitating their visualization and understanding; but also, it commonly leads to more compact models with better generalization ability [4].

In the literature of feature selection for supervised classification most feature selection methods (selectors) have been designed for either numeric or non-numeric data. However, these methods cannot be directly applied to mixed datasets where objects are simultaneously described by both numerical and non-numerical features. Mixed data [5] is very common, and it appears in many real-world problems; for example, in biomedical and health-care applications, socio-economic and business, software cost estimations, and so on.

In practice, for using supervised feature selection methods developed exclusively for numerical or non-numerical data on mixed data problems, it is common to apply feature transformations. The process of transforming non-numerical features to numerical ones is called encoding; nevertheless, this transformation has as a main drawback that the categories of a non-numerical feature should be coded as numerical values. This codification introduces an artificial order between feature values which does not necessarily reflect its original nature. Moreover, some mathematical operations such as addition and multiplication by a scalar do not make sense over the transformed data [6, 7]. Conversely, when the feature selection methods require non-numerical data as input, an a priori data discretization for converting numerical features into non-numerical ones is needed. However, this discretization brings with it an inherent loss of information due to the binning process [7], and consequently, the results of feature selection become highly dependent on the applied discretization method. Another solution that has been considered in some supervised feature selection methods [7, 8] is to analyze the numerical and non-numerical features separately, and then to merge the two set of results. However, as [5] have noted, using this solution, the associations that exist between numerical and non-numerical features are ignored.

Based on the results reported in [9], where the spectral gap score combined with a kernel function was successfully used for feature selection in unsupervised mixed datasets. In this paper, we propose an extension of the aforementioned method and show how using a supervised kernel and a simple leave-one-out search strategy results in a filter feature selection method useful to be applied for selecting relevant features in supervised mixed datasets. The proposed method does not transform the original space of features or process the data separately, and it can produce both a feature raking or a feature subset composed of only relevant features. Experimental results on real-world datasets show that the proposed method achieves outstanding performance compared to the state-of-the-art methods.

The rest of this paper is organized as follows. In Sect. 2, we provide a brief review of the related work, in Sect. 3, we describe the proposed method. Experiments will be presented and discussed in Sect. 4. Finally, Sect. 5 will conclude this paper and enunciate further research on this topic.

2 Related Work

In the literature, many supervised feature selection methods have been proposed [10, 11], and according to the feature selection approach they can be categorized as filter, wrapper, or hybrid. Among the more classical and relevant filter feature selection methods for supervised classification we can mention: Information Gain (IG) [12], Fisher Score [13], Gini index [14], Relieff [15], and CFS [16]. IG, Fisher score, Gini index, and Relieff are univariate filter methods (also called ranking-based methods) that evaluate features according to some quality criterion that quantifies the relevance of features individually; meanwhile, CFS is a multivariate feature selection method that quantifies the relevancy of features jointly, therefore it provides a feature subset as a result. IG was designed to work on non-numerical features, while Fisher score and Gini index can only process numerical features. On the other hand both CFS and Relieff, according to their respective authors, can process mixed data, CFS processes numerical and non-numerical features as non-numerical (numerical features are discretized). Meanwhile, Relief deals with this problem by using the Hamming distance for non-numerical features and the Euclidean distance for numerical ones.

On the other hand, some supervised feature selection methods developed exclusively for mixed data have also been introduced, and they can be classified into four main approaches: Statistic/Probabilistic [8, 17, 18], Information Theory [7, 19,20,21,22], Fuzzy/Rough set theory [23,24,25,26], and kernel-based [27, 28] methods. In the former, the basic idea is to evaluate the relevancy of features using measures such as the join error probability or using different correlation measures (one for each type of feature) to quantify the degree of association among features. Information theory based methods, evaluate features using measures such as mutual information or entropy. On the other hand, Fuzzy/Rough set theory methods evaluate features based on fuzzy relations or equivalence classes (also called granules). Finally, kernel-based methods perform feature selection using three components: a dedicated kernel that can handle mixed data, a feature search strategy, and a classifier (usually SVM) which allows quantifying the importance of each feature through the objective function using the dedicated kernel.

3 Proposed Method

The proposed method is inspired by the previous Unsupervised Spectral Feature Selection Method for mixed data (USFSM) introduced in [9], which is based on Spectral Feature Selection [1]. USFSM proposes to quantify the feature consistencyFootnote 2 by analyzing the changes in the spectrum distribution (spectral gaps) of the Symmetrical Normalized Laplacian matrix when each feature is excluded separately.

Formally, given a collection of m objects \(X_{F}=\{{{\varvec{x}}}_{1},{{\varvec{x}}}_{2},\ldots ,{{\varvec{x}}}_{m}\}\), described by a set of n numerical or non-numerical features \(F= \{\varvec{f}_{1},\varvec{f}_{2},\ldots , \varvec{f}_{n}\}\), a target concept \(T=\{t_{1},t_{2},\ldots ,t_{m}\}\) indicating the object’s class labels, and a \(m~{\times }~m\) similarity (kernel) matrix S, containing the similarities \(s_{ij}\ge 0\) between all pairs of objects \({{\varvec{x}}}_{i}\), \({{\varvec{x}}}_{j}\) \(\in \) \(X_{F}\). Structural information from \(X_{F}\) can be obtained from the eigensystem of the Symmetrical Normalized Laplacian matrix \(\mathcal {L}({S})\) [29] (Laplacian graph) derived from S. Specifically, the first \(c+1\) eigenvalues of \(\mathcal {L}({S})\) (arranged in ascending order) and the corresponding eigenvectors contain information for separating the data \(X_{F}\) in c classes [1]. Nevertheless, to quantify the feature consistency using the spectral gap score [9] in a supervised context, we need to specify a good similarity function that uses both, the information contained the features in F, and the information provided by the target objective T. For doing this, we propose to build the object’s similarity matrix S using the following supervised kernel function:

$$\begin{aligned} s_{ij}= \dfrac{K({{\varvec{x}}}_{i},{{\varvec{x}}}_{j}) + D(t_{i},t_{j})}{2} \end{aligned}$$
(1)

where \(D(t_{i},t_{j})=1\) if \(t_{i}=t_{j}\); otherwise \(D(t_{i},t_{j})=0\), and \(K({{\varvec{x}}}_{i},{{\varvec{x}}}_{j})\) is the Clinical kernel defined as in [9]. With this kernel function, we are modeling the data structure in a supervised way, and at the same time, we are taking into account the contribution of each feature in the objects’ similarity.

To quantify the consistency of each feature \(\varvec{f}_{i} \in F\), we measure the changes that could be produced when \(\varvec{f}_{i}\) is eliminated from the dataset \(X_{F}\) using a leave-one-out search strategy, i.e.:

$$\begin{aligned} \varphi (\varvec{f}_{i}) = \gamma (X_{F},c)-\gamma ({X_{F_{i}}},c) \end{aligned}$$
(2)

where \(X_{F_{i}}\) denotes the dataset described by the set of features \(F_{i}\), which contains all features except \(\varvec{f}_{i}\), and \(\gamma (\cdot ,\cdot )\) is the spectral gap score defined as:

$$\begin{aligned} \gamma (X,c)=\sum _{i=2}^{c+1}\sum _{j=i+1}^{c+2}\left| \frac{\lambda _{i}-\lambda _{j}}{\tau }\right| \end{aligned}$$
(3)

where \(\lambda _{i}\), \(i=2,\ldots c+2\) are the first \(c+1\) nontrivial eigenvalues of the spectrum of \(\mathcal {L}({S})\), being S the similarity matrix from X, c the number of classes in the dataset, and \(\tau =\sum _{i=2}^{c+2}\lambda _{i}\) a normalization term. In this score, the bigger the gap of the first \(c+1\) eigenvalues of \(\mathcal {L}(S)\), the best will be the separation between the classes.

figure a

Our proposed method begins with the construction of the similarity matrix S from the original dataset \(X_{F}\) along with the class label information contained in T using 1. Then, we construct the Symmetrical Normalized Laplacian matrix \(\mathcal {L}(S)\) and obtain its spectrum. Afterwards, using (3), the spectral gap score \(\gamma (\cdot ,\cdot )\) for \(X_{F}\) is computed. This procedure is repeated n times using a leave-one-out feature elimination strategy over F to quantify the relevance of each feature \(\varvec{f}_{i}\) through (2). Finally, features in F are ranked from the most to the least consistent according to the feature weights w[i],  \(i=1,2,\ldots , n\) corresponding to each feature \(\varvec{f}_{i} \in F\), and a feature subset \(F_{Sub}\) consisting of those features with \(w[i]>0\) (relevant features) is obtained. The pseudocode of our filter method, named Supervised Spectral Feature Selection method for Mixed data (SSFSM), is described in Algorithm 1.

4 Experiment Results

In this section, we first describe the experimental setup in Sect. 4.1, and later, the results obtained from the evaluation of the supervised filter methods over the used datasets are presented in Sect. 4.2.

4.1 Experimental Setup

To evaluate the effectiveness of our method, some experiments were done on real-world mixed datasets taken from the UCI Machine Learning repository [30]; detailed information about these datasets is summarized in Table 1. We have compared our proposal against two feature subset selectors of the state-of-the-art that can handle mixed data, namely, CFS [16] and the Fuzzy Rough Set theory-based method introduced by Zhang et al. in [24]. Moreover, in order to contrast our method against more classical and relevant based-on-ranking filter supervised feature selection methods of the state-of-the-art, we also have made a comparison with IG [12], Relieff [15], Fisher [13], and Gini index [14].

Following the standard way for assessing supervised feature selection methods, for the comparison among all selectors, we evaluate the quality of feature selection results using the classification accuracy (ACC) of the well-known and broadly used SVM [31] classifier. For the evaluation against the feature subset selectors (CFS and Zhang et al. method), we applied stratified ten-fold cross-validation, and the final classification performance is reported as the average accuracy over the ten folds. For each fold, each feature selection method is first applied on the training set to obtain a feature subset. Then, after training the classifier using the selected features, the respective test sets are used for assessing the classifier through its accuracy. Meanwhile, for the comparison against the ranking-based methods (IG, Relieff, Fisher, and Gini), we compute the aggregated accuracy [32]. The aggregated accuracy is obtained by averaging the average accuracy achieved by the classifiers using the top \(1, 2, \ldots , n-1, n\) features according to the ranking produced by each selector, in this way, we can evaluate how good is the feature ranking obtained by each selector. In order to measure the statistical significance of the results of our method, the Wilcoxon test [33] was employed over the results obtained in each dataset, and we have marked with the “+” symbol those datasets where there is a statistically significant difference of the results of the corresponding method against the results of our method.

The implementation of ranking-based methods was taken from the Matlab Feature Selection package available in the ASU Feature Selection repository [34]. Meanwhile for CFS and Zhang et al. methods we used the author’s implementation with the parameters recommended by their respective authors. For SSFSM, the Apache Commons MathFootnote 3 library was used for matrix operations and eigensystem computation. All experiments were run in Matlab® R2018a with Java 9.04, using a computer with an Intel Core i7-2600 3.40 GHz \(\times \) 8 processor with 32 GB DDR4 RAM, running 64-bit Ubuntu 16.04 LTS (GNU/Linux 4.13.0-38 generic) operating system.

4.2 Experimental Results and Comparisons

In Tables 2 and 3, the classification accuracy reached by using the evaluated feature subset methods and the aggregated accuracy of ranking-based methods using SVM on the datasets of Table 1 are shown, respectively. In these tables, the last row shows the overall average results obtained on all tested datasets. Additionally, in the last column of Table 2, the classification accuracy using the whole set of features of each dataset is included.

Table 1. Details of the real-world mixed datasets used in our experiments.
Table 2. Classification accuracy of SVM on the feature subsets produced by SSFSM, CFS, and Zhang et al. on the mixed datasets of Table 1.
Table 3. Aggregated accuracy of SVM on the feature ranking produced by SSFSM, IG, Gini index, Relieff and Fisher score on the datasets of Table 1.
Fig. 1.
figure 1

Classification accuracy of SVM on the feature ranking produced by SSFSM, IG, Gini index, Relieff and Fisher score on six datasets of Table 1. The x-axis is the number of selected features. The y-axis is the classification accuracy.

As we can see in Table 2, regarding the methods for mixed data that select subsets of features, the best average results were obtained by our method, outperforming the CFS and Zhang et al. selectors, and having a statistically better performance in Automovile, Cylinder-bands, and Flags datasets. Moreover, as we can observe in this table, our method was the only one that got an average result better than those obtained using the whole set of features.

On the other hand, regarding the ranking-based methods, in Table 3, we can observe that again our method got the best results on average (average aggregated accuracy), and it was significantly better than Gini index, and Fisher score on the Acute-inflammation, Horse-colic, credit-approval, and Automovil datasets. This indicates that the ranking produced by our method is, on average, better than the ranking produced by the other evaluated ranking-based selectors, as we can corroborate in most of the feature ranking plotsFootnote 4 shown in Fig. 1.

5 Conclusions and Future Work

To solve the feature selection problem in mixed data, in this paper we introduce a new supervised filter feature selection method. Our method is based on Spectral Feature Selection (spectral gap score) combined with a new supervised kernel and a simple leave-one-out search strategy. It results in ranking-based/feature subset selector that can effectively solve feature selection problems in supervised mixed datasets. To show the effectiveness of our method, we tested it on several public mixed datasets. The results have shown that our method, is better than CFS [16] and the method introduced by Zhang et al. in [24], two filter feature subset selectors that can handle mixed data. Furthermore, our method achieves better results than popular filter based-on raking feature selection methods of the state-of-the-art, such as IG, Relieff, Fisher score and Gini index.

A comparison using other classifiers and selectors is mandatory, and it is part of the future work of this research. Another interesting research direction is to perform a broader study about wrapper strategies that combined with our method allow to improve its performance in terms of accuracy.