Keywords

1 Introduction

Classification is one of the most important tasks in machine learning. Numerous classification approaches, such as k Nearest Neighbor (kNN) [9], Decision Tree (DT), Naïve Bayes (NB), and Support Vector Machine, have been well developed and applied in many applications. However, most of the classifiers face serious trouble for imbalanced class distribution and thus learning from the imbalanced dataset is one of the top ten challenging problems in data mining research [20].

To solve class imbalance problem, various strategies have already been proposed which can be grouped into two broad categories namely data oriented and algorithm oriented approaches. Data oriented approaches use sampling techniques. In order to make dataset balanced, the sampling techniques either oversample the minority instances or select instances (under-sample) from the majority class. A sampling technique namely Synthetic Minority Over-sampling TEchnique (SMOTE) has been proposed that increases the number of minority class instances by creating artificial and non-repeated samples [4].

In contrast, algorithm oriented approaches are the modifications of traditional algorithms such as DT and kNN. The modified DTs for imbalanced classification are Hellinger Distance DT (HDDT) [5], Class Confidence Proportion DT (CCPDT) [13] and Weighted Inter-node Hellinger Distance DT (iHDwDT) [1]. These DTs use different splitting criteria while selecting a feature in split point.

kNN is one of the simplest classifiers. Despite its simplicity, kNN is considered as one of the top most influential data mining algorithms [19]. Traditional kNN finds the k closest instances from the training data to a query instance and treats all neighbors equally. Dudani has proposed a distance based weighted kNN which provides more weights to closer neighbors [8]. Another variant of kNN approach, Generalized Mean Distance based kNN (GMDKNN) [10], has been presented by introducing multi-generalized mean distance and the nested generalized mean distance. All these variants of kNN are sensitive to the majority instances and thus perform poorly for imbalanced datasets.

Considering this imbalance problem, several researchers extended kNN for imbalanced datasets [7, 11, 12]. In Exemplar-based kNN (kENN) [11], Li and Zhang expand the decision boundary for the minority class by identifying the exemplar minority instances. A weighting algorithm namely Class Confidence Weighted kNN (CCWKNN) has been presented in [12] where the probability of feature values given the class labels is considered as weight. Dubey and Pudi have proposed a weighted kNN (WKNN) [7] which considers the class distribution in a wider region around a query instance. The class weight for each training instance is estimated by taking the local class distributions into account.

The purpose of these existing studies is to improve the overall performance for imbalanced data. However, these methods overlook the problem of uncertainty which is prevalent in almost all datasets [18]. The reason behind this uncertainty is that the complete statistical knowledge associated with the conditional density function of each class is hardly available [6]. To address this problem, kNN has been extended using Dempster-Shafer Theory of evidence (DST) to better model uncertain data named Evidential kNN (EKNN) [6]. In EKNN, each neighbor assigns basic belief on classes based on a distance measure. Nevertheless, this approach again does not take consideration of the class imbalance problem.

To address these aforementioned problems, we propose a Proximity weighted Evidential kNN (PEkNN) classifier and make the following contributions. Firstly, we have proposed a confidence (posterior) assignment procedure on each neighbor of a query instance. Secondly, we have also proposed to use proximity of a neighbor as a weight to discount the confidence of a neighbor. It is shown that, this weighted confidence increases the likelihood of classifying a minority class. Thirdly, DST framework is used to combine decisions from different neighbors.

2 Dempster-Shafer Theory of Evidence

Dempster-Shafer theory of evidence is a generalized form of Bayesian theory. It assigns degree of belief for all possible subsets of the hypothesis set. Let, \(C = \{C_1, \dots , C_M\}\) be a finite hypothesis set of mutually exclusive and exhaustive hypotheses. The belief in a hypothesis assigned based on a piece of evidence is ranged numerically as [0, 1]. A Basic Belief Assignment (BBA) is a function \(m : 2^C \rightarrow [0, 1]\) which satisfies the following properties:

$$\begin{aligned} m(\emptyset ) = 0 \quad \text {and}\quad \sum _{A \subseteq C} m(A) = 1 \end{aligned}$$
(1)

where m(A) is a degree of belief (referred as mass) which reflects how strongly A is supported by the piece of evidence. m(C) represents the degree of ignorance.

Several pieces of evidence characterized by their BBAs can be fused using Dempster’s rule of combination [16]. For two BBAs \(m_1({.})\) and \(m_2({.})\) which are not totally conflicting, the combination rule can be expressed using Eq. (2).

$$\begin{aligned} m(A) = \frac{\sum _{B \cap C = A}m_1(B)m_2(C)}{1 - \sum _{B \cap C = \emptyset }m_1(B) m_2(C)} \quad A \ne \emptyset \end{aligned}$$
(2)

where \(A, B, C \in 2^C\) and \(\sum _{B \cap C = \emptyset }m_1(B) m_2(C) < 1\).

For decision making, Belief, Plausibility and betting Probability (\(P_{bet}\)) are usually used. For a singleton class A, \(P_{bet}(A)\) is derived in Eq. (3) where \({\mid } B {\mid }\) represents the cardinality of the element B.

$$\begin{aligned} P_{bet}(A) = \sum _{A \subseteq B} \frac{{\mid } A \cap B {\mid }}{{\mid } B {\mid }} \times m(B) \end{aligned}$$
(3)

3 Proximity Weighted Evidential kNN (PEkNN)

kNN faces difficulty in imbalanced datasets as it treats all neighbors of the query instance equally and most of the neighbors will be of the majority class. To deal with this issue, the proposed algorithm attempts to provide more importance to neighbors with a higher proximity weighted confidence. Here, confidence of an instance indicates a conditional probability of that instance based on training data. Algorithms such as NB also uses conditional probability while classifying a query instance. However, the performance of NB degrades due to the poor estimation of the conditional density of the query instance associated with each class. In contrast, PEkNN computes conditional probability of neighborhood instances rather than query instance. Furthermore, as uncertainty is prevalent in almost all datasets [18]. This is more significant for imbalanced datasets where little information is available for the minority class. To deal with this issue, PEkNN uses DST to combine the evidences provided by each neighbor.

For a new query instance (\(x_t\)), PEkNN first finds k closest neighbors according to some distance measurement (e.g. Euclidean distance). Let, \(S(x_t, k)\) be the set of k closest neighbors of \(x_t\) and each member of \(S(x_t, k)\) is considered as a piece of evidence which assigns mass values for each subset of C known as BBA.

Now, consider \(x_i\) as the i-th neighbor of \(x_t\) belonging to class \(C_q\). As \(x_i\) is a piece of evidence belonging to \(C_q\), some part of its belief will be committed to \(C_q\). The rest of the belief can not be distributed to any other subset of C except itself. The BBA provided by \(x_i\) can be represented by Eq. (4), (5) and (6) where \(0< \beta _0 < 1\).

$$\begin{aligned} m_i(\{C_q\}) = \beta = \beta _0 \times \varPsi (x_i, x_t) \end{aligned}$$
(4)
$$\begin{aligned} m_i(A) = 0 \quad \forall A \in 2^C {\setminus } \{C, \{C_q\}\} \end{aligned}$$
(5)
$$\begin{aligned} m_i(C) = 1 - \beta \end{aligned}$$
(6)

Now, we will discuss about two of our intuitions. First, a piece of evidence belonging to \(C_q\) will assign a larger belief to \(C_q\) when the evidence is more reliable which we call confidence. An evidence having higher posterior probability should get more confidence than the one which is in lower posterior probability region. The second intuition is that a neighbor will assign more belief to a specific class when the neighbor and the query instance are more proximate. The function defined in Eq. (7), \(\varPsi ({.})\) satisfies the two aforementioned intuitions where \(p_i\) is the confidence of \(x_i\) represented by the probability of class label (\(y_i\)) given \(x_i\) and \(prx(x_i, x_t)\) represents the proximity between \(x_i\) and \(x_t\).

$$\begin{aligned} \varPsi (x_i, x_t) = prx(x_i, x_t) \times p_i \end{aligned}$$
(7)

The procedure how PEkNN algorithm classifies a query instance is presented in Algorithm 1. The confidence assignment, proximity estimation and decision making steps are presented in detail in Sects. 3.1, 3.2 and 3.3 respectively.

figure a

3.1 Estimation of Confidence

The confidence (\(p_i\)) of an instance \(x_i\) (\(x_i \in \mathbb {R}^l\)) belonging to \(y_i\) is assigned in the following manner derived in Eq. (8).

$$\begin{aligned} p_i = P(y_i \mid x_i) = \frac{P(y_i) \times P(x_i \mid y_i)}{\sum _{j=1}^M P(C_j) \times P(x_i \mid C_j)} \end{aligned}$$
(8)

where \(y_i \in \{C_1, C_2, \dots , C_M\}\), \(P(C_j)\) represents the prior of \(C_j\) in training space and \(P(x_i|C_j)\) represents the likelihood in Bayes’ theorem. Here, two approaches of estimating class-wise Probability Density Function (PDF) is presented. First one is using Single Gaussian Model (SGM) and another one is using Gaussian Mixture Model (GMM). When PEkNN uses confidence derived from SGM, we call it sPEkNN, and mPEkNN when it uses confidence derived from GMM.

Single Gaussian model assumes that all the features are independent and the continuous values associated with each class follow a normal distribution. Under these assumptions, the likelihood function can be represented as Eq. (9).

$$\begin{aligned} P(x) = \prod _{j=1}^l P(x_j) = \prod _{j=1}^l f(x_j ; \mu _j, {\sigma _j}^2) = \prod _{j=1}^l \frac{1}{\sqrt{2\pi }\sigma _j} \times \exp (-\frac{(x_j - \mu _j)^2}{2\sigma _j^2}) \end{aligned}$$
(9)

where \(x_j\) denotes the j-th feature of x and f(.) represents the normally distributed PDF parameterized by mean (\(\mu \)) and variance (\(\sigma ^2\)).

On the other hand, GMM can also be used to estimate PDF from multivariate data. The class-wise PDF using m-component mixture model is given in Eq. (10).

$$\begin{aligned} P(x) = \sum _{i=1}^{m} \alpha _i P(x \mid Z_i) \end{aligned}$$
(10)

The procedure of finding complete set of parameters (\(Z_1, \dots , Z_m, \alpha _1, \dots , \alpha _m\)) specifying the mixture model is briefly described in [14].

3.2 Estimation of Proximity

To capture the proximity between two instances, some distance measurement can be used. The proximity between two instances (\(x_i\) and \(x_j\)) from training samples will be maximum when \(x_i\) and \(x_j\) are identical. One the other hand, it will be lowest when they are the farthest two instances in the feature space. To measure this proximity, a normalization is applied as Eq. (11) so that \(prx(x_i, x_j) \in [0,1]\). Here, \(d_{max}\) is the distance between two farthest training instances.

$$\begin{aligned} prx(x_i, x_j) = 1 - \frac{d(x_i, x_j)}{d_{max}} \end{aligned}$$
(11)

3.3 Decision Making

According to Eq. (7), \(\varPsi ({.})\) will return a larger value when a neighbor is more confident and more closer to the query instance. Now, for each of the k nearest neighbors, the BBAs are defined using Eq. (4), (5) and (6). In order to classify \(x_t\), these BBAs are combined using DST. The betting probability (\(P_{bet}\)) for each singleton class from this combined decision will be then calculated using Eq. (3). Finally, the decision from this \(P_{bet}\) is taken using Eq. (12).

$$\begin{aligned} \hat{y} = \mathop {\mathrm {arg\,max}}\limits _{c \in \{C_1, \dots , C_M\}} P_{bet}(c) \end{aligned}$$
(12)

where c is a singleton class so that the cardinality of c is 1.

Properties of \(\beta \): Value of \(\beta \) is bounded between 0 to 1.

Proof

From Eq. (4), (7) and (8), it can be derived that,

$$\begin{aligned} \beta = \beta _0 \times P(y_i \mid x_i) \times prx(x_i, x_t) \end{aligned}$$
(13)

Here, \(\beta _0\) is a user given constant satisfying \(0< \beta _0 < 1\). The second term, \(P(y_i \mid x_i)\), represents the posterior probability. The last term, \(prx(x_i, x_t)\) is at most equal to 1 and at least equal to zero. As can be seen from Eq. (13), \(\beta \) is a product of three terms and all these terms are bounded between 0 to 1. It is sufficient to claim that, the value of \(\beta \) must be bounded between 0 to 1.

Fig. 1.
figure 1

A synthetic imbalanced dataset

3.4 An Illustrative Example

Figure 1 shows the instances of a two-class imbalance problem where (\(+\))s and (\(\bullet \))s represent the minority (Class-A) and majority class instances (Class-B) instances respectively. The class boundaries are represented as dotted lines and three query instances (\(t_1, t_2, t_3\)) are marked with (\(\star \))s. Here, first query instance \(t_1\) is situated in a majority class region bounded by minority instances. Both kNN and PEkNN can successfully classify \(t_1\). Traditional algorithms such as C4.5 and NB face difficulties in this situation.

The two other query instances \(t_2\) and \(t_3\) associated with a region namely \(A_1\) (see Fig. 1b). Here, for both \(t_2\) and \(t_3\), the four neighbors are \(x_a\), \(x_b\), \(x_c\) and \(x_d\). Traditional kNN with \(k=4\), will classify both \(t_2\) and \(t_3\) as Class-B. PEkNN, on the other hand, considers the confidence of each neighbor. Here, \(x_d\) will provide a higher confidence compared to majority class instances (\(x_a\), \(x_b\) and \(x_c\)). Assume, the confidence of \(x_a\), \(x_b\), \(x_c\), and \(x_d\) are 0.30, 0.40, 0.30, and 0.75 respectively. And the proximity with respect to \(t_2\) are 0.90, 0.95, 0.85 and 0.95 respectively. Then BBAs assigned by PEkNN for these neighbors are \(m_a(\{B\}) = 0.2565\), \(m_a(\{A,B\}) = 0.7435\), \(m_b(\{B\}) = 0.3610\), \(m_b(\{A, B\}) = 0.6390\), \(m_c(\{B\}) = 0.2423\), \(m_c(\{A, B\}) = 0.7577\) and \(m_d(\{A\}) = 0.6769\), \(m_d(\{A, B\}) = 0.3231\). Here, \(\beta _0\) is set to 0.95. Now, combing these BBAs using DST, we get \(P_{bet}(A) = 0.5325\) and \(P_{bet}(B) = 0.4675\) which indicates that \(t_2\) will be correctly classified as Class-A.

On the other hand, for the query instance \(t_3\), the proximity of \(x_a\), \(x_b\), \(x_c\) and \(x_d\) are 0.85, 0.95, 0.95 and 0.85 respectively. We, therefore, get \(P_{bet}(A) = 0.4661\) and \(P_{bet}(B) = 0.5339\) indicating that \(t_3\) will be classified as Class-B. Therefore, \(t_3\) is correctly classified as a majority class instance even though the neighbors of \(t_2\) and \(t_3\) are same.

Instead of DST, let us reconsider simpler techniques to combine evidences such as summing and taking the maximum of the proximity weighted confidences. If we simply sum class-wise proximity weighted confidences, both \(t_2\) and \(t_3\) get a higher value for Class-B as three of the four neighbors belong to that class. To avoid this bias, a query can be simply classified in the class for which it gets maximum proximity weighted confidence among the neighbors. But this method does not consider the local neighborhood priors. For which, it will classify both \(t_2\) and \(t_3\) as minority class which is not desired. PEkNN on the other hand using the DST framework successfully classifies both query instances.

4 Experiments and Results

Dataset description, implementation details and the performance metrics followed by the results obtained from the experiments with discussion are given in the following subsections.

4.1 Dataset Description

The characteristics of the 30 benchmark datasets are shown in Table 1 which are collected from UCI machine learning repository [3] and KEEL Imbalanced Datasets [2]. Imbalance Ratio (IR) between the samples of majority class and minority class of the datasets used in these experiment are at least 1.5 and values of all the features are numeric. A dataset is highly imbalanced when the value of IR is very high.

Table 1. Descriptions of Imbalanced Datasets. Idx, #Inst, #Cl and #Ftr represent index of a dataset, number of instances, classes and features respectively.
Table 2. Performance comparison among different algorithms on imbalanced datasets in terms of AUC (%). The best result for each dataset is in bold. SMT+kNN represents kNN followed by SMOTE sampling.

4.2 Implementation Details and Performance Metrics

PEkNN is benchmarked against other algorithms including traditional learning algorithms (kNN, C4.5, NB), oversampling strategy (SMOTE), recent algorithms in the kNN family (EKNN, WKNN, CCWKNN, kENN, GMDKNN) and few tree based recent algorithms for imbalanced classification (CCPDT, HDDT, iHDwDT). For PEkNN, we use \(\beta _0=0.95\) in this experiment. For kENN, the confidence level is set 0.1 and we set \(p=1\) for GMDKNN.

We have conducted 10-fold stratified cross validation to evaluate the performance of the proposed method. The Receiver Operating Characteristic (ROC) curve [17] is widely used to evaluate imbalanced classification. We use Area Under the ROC Curve (AUC) for evaluating the classifier performance. For comparison, all the classifiers are ranked on each dataset in terms of AUC, with ranking of 1 is the best. We also perform Friedman tests on the ranks. After rejecting the null hypothesis using Friedman test that all the classifiers are equivalent, a post-hoc test called Nemenyi test [15] is used to determine the performance of which classifier is significantly better than the others.

4.3 Result and Discussion

Table 2 represents the comparison of 14 classifiers over 30 imbalanced datasets. The average ranks of these classifiers indicate that kNN is better performing algorithm compared to other traditional classifiers on imbalanced datasets. Though kNN performs better than C4.5, modifications of tree based algorithms for imbalanced datasets perform better than kNN. Moreover, kNN on SMOTE sampled datasets performs slightly better than kNN without sampling.

Now, if we compare kNN with its different variants, it can be observed that kENN and WKNN improve the overall performance of traditional kNN although another variant CCWKNN fails to improve the performance in most cases over the experimented datasets. Moreover, it is investigated that, the recent generalized mean based kNN approach GMDKNN performs worse than kNN on imbalanced datasets. In contrast, we can observe from Table 2 that, EKNN performs better than all other classifiers except the proposed sPEkNN and mPEkNN. It indicates that, handling uncertainty can improve the performance of kNN on imbalanced datasets. Finally, average ranks show that mPEkNN is the best performing classifier compared to others in the imbalanced datasets.

In addition, Table 2 summarizes the counts of Win-Tie-Loss (W-T-L) of sPEkNN and mPEkNN against other classifiers which indicates that mPEkNN performs better than other classifiers in most cases. From Win-Tie-Loss, it is observed that mPEkNN wins in at most 29 datasets with no loss against C4.5 and GMDKNN classifiers. In the least case, mPEkNN performs better in 19 datasets and worse in 7 datasets compared to EKNN.

The results of Friedman test (Fr. Test) with two base classifiers (sPEkNN and mPEkNN) are shown in the last two lines of the Table 2. From Friedman test with 14 classifiers and 30 datasets, we can conclude that, all the fourteen classifiers are not equivalent. After rejecting that all fourteen classifiers perform equivalent, Nemenyi test is performed to determine which classifier performs significantly better than the others. A tick( ) sign under a classifier indicates that Nemenyi test suggests the performance of that classifier is significantly different from the base classifier in pairwise comparison at \(95\%\) confidence level. Nemenyi test states that, sPEkNN performs significantly better than all compared classifiers except EKNN, CCPDT and HDDT. More importantly, the test suggests that mPEkNN is the best performing classifier among twelve classifiers.

4.4 Effects of Neighborhood Size and Imbalance Ratio

Here, we show the effects of neighborhood size and Imbalance Ration (IR) on the performance of the proposed method compared to other kNN variants. Due to page limitations, only one dataset (Ionosphere) is used to present the comparison in terms of AUC with different the values of k ranging from 1 to 20. It is clear from Fig. 2a that sPEkNN and mPEkNN consistently perform better than the other algorithms and are less sensitive to the value of k.

Fig. 2.
figure 2

Performance comparison among the algorithms belonging in kNN family

To visualize the effect of IR, we use a synthetic dataset of two-class problem in a two-dimensional space where instances of each class are taken from two Gaussian distributions. The characteristics of the dataset is given below where class-A is the minority class and Class-B is the majority class.

$$\begin{aligned} \begin{aligned} \eta _{1}^{A} = 0.6 \text {, } \eta _{2}^{A} = 0.4 \text {, } \mu _{1}^{A} = \begin{bmatrix} 3&3\end{bmatrix}^{T} \text {, } \mu _{2}^{A} = \begin{bmatrix} -2&-2\end{bmatrix}^{T} \text {, } \varSigma _{1}^{A}=3I \text { and } \varSigma _{2}^{A}=I \\ \eta _{1}^{B} = 0.9 \text {, } \eta _{2}^{B} = 0.1 \text {, } \mu _{1}^{B} = \begin{bmatrix} 0&0\end{bmatrix}^{T} \text {, } \mu _{2}^{B} = \begin{bmatrix} 4&3\end{bmatrix}^{T} \text {, } \varSigma _{1}^{B}=8I \text { and } \varSigma _{2}^{B}=I \end{aligned} \end{aligned}$$

Here \(\eta \) represents the mixture proportion and I is the identity matrix. Different datasets of 1500 samples are generated varying the class imbalance ratio ranging from 2 to 10. It is observable from Fig. 2b that, although the imbalance ratio increases, the performance of mPEkNN remains more steady compared to other kNN variants indicating less sensitivity of mPEkNN in these synthetic datasets.

5 Conclusion

This paper proposes an extended kNN algorithm to increase the performance of existing kNN by making it vigorous to imbalance class problem. In PEkNN, for a query instance, we calculate a confidence for each neighbor instance from the posterior probability of that instance which is then discounted by the proximity of that instance from the query instance. We show that this proximity weighted confidence increases the likelihood of classifying a minority class instance. To calculate the confidence we used two methods one using single Gaussian model (sPEkNN) and other using Gaussian mixture model (mPEkNN). Results over 30 datasets provide the evidence that the proposed approach is better than twelve relevant methods in imbalanced datasets. However, one limitation of the proposed method is that we assume all the feature values as numeric. As future research direction, we have plan to extend the work for categorical features.