A Proximity Weighted Evidential k Nearest Neighbor Classifier for Imbalanced Data

Kadir, Md. Eusha; Akash, Pritom Saha; Sharmin, Sadia; Ali, Amin Ahsan; Shoyaib, Mohammad

doi:10.1007/978-3-030-47436-2_6

Md. Eusha Kadir¹⁴,
Pritom Saha Akash¹⁴,
Sadia Sharmin¹⁵,
Amin Ahsan Ali¹⁶ &
…
Mohammad Shoyaib¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12085))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

4523 Accesses
6 Citations

Abstract

In k Nearest Neighbor (kNN) classifier, a query instance is classified based on the most frequent class of its nearest neighbors among the training instances. In imbalanced datasets, kNN becomes biased towards the majority instances of the training space. To solve this problem, we propose a method called Proximity weighted Evidential kNN classifier. In this method, each neighbor of a query instance is considered as a piece of evidence from which we calculate the probability of class label given feature values to provide more preference to the minority instances. This is then discounted by the proximity of the neighbor to prioritize the closer instances in the local neighborhood. These evidences are then combined using Dempster-Shafer theory of evidence. A rigorous experiment over 30 benchmark imbalanced datasets shows that our method performs better compared to 12 popular methods. In pairwise comparison of these 12 methods with our method, in the best case, our method wins in 29 datasets, and in the worst case it wins in least 19 datasets. More importantly, according to Friedman test the proposed method ranks higher than all other methods in terms of AUC at 5% level of significance.

You have full access to this open access chapter, Download conference paper PDF

Class Based Weighted K-Nearest Neighbor over Imbalance Dataset

Mass-Based Similarity Weighted k-Neighbor for Class Imbalance

A Positive-biased Nearest Neighbour Algorithm for Imbalanced Classification

Keywords

1 Introduction

Classification is one of the most important tasks in machine learning. Numerous classification approaches, such as k Nearest Neighbor (kNN) [9], Decision Tree (DT), Naïve Bayes (NB), and Support Vector Machine, have been well developed and applied in many applications. However, most of the classifiers face serious trouble for imbalanced class distribution and thus learning from the imbalanced dataset is one of the top ten challenging problems in data mining research [20].

To solve class imbalance problem, various strategies have already been proposed which can be grouped into two broad categories namely data oriented and algorithm oriented approaches. Data oriented approaches use sampling techniques. In order to make dataset balanced, the sampling techniques either oversample the minority instances or select instances (under-sample) from the majority class. A sampling technique namely Synthetic Minority Over-sampling TEchnique (SMOTE) has been proposed that increases the number of minority class instances by creating artificial and non-repeated samples [4].

In contrast, algorithm oriented approaches are the modifications of traditional algorithms such as DT and kNN. The modified DTs for imbalanced classification are Hellinger Distance DT (HDDT) [5], Class Confidence Proportion DT (CCPDT) [13] and Weighted Inter-node Hellinger Distance DT (iHDwDT) [1]. These DTs use different splitting criteria while selecting a feature in split point.

kNN is one of the simplest classifiers. Despite its simplicity, kNN is considered as one of the top most influential data mining algorithms [19]. Traditional kNN finds the k closest instances from the training data to a query instance and treats all neighbors equally. Dudani has proposed a distance based weighted kNN which provides more weights to closer neighbors [8]. Another variant of kNN approach, Generalized Mean Distance based kNN (GMDKNN) [10], has been presented by introducing multi-generalized mean distance and the nested generalized mean distance. All these variants of kNN are sensitive to the majority instances and thus perform poorly for imbalanced datasets.

Considering this imbalance problem, several researchers extended kNN for imbalanced datasets [7, 11, 12]. In Exemplar-based kNN (kENN) [11], Li and Zhang expand the decision boundary for the minority class by identifying the exemplar minority instances. A weighting algorithm namely Class Confidence Weighted kNN (CCWKNN) has been presented in [12] where the probability of feature values given the class labels is considered as weight. Dubey and Pudi have proposed a weighted kNN (WKNN) [7] which considers the class distribution in a wider region around a query instance. The class weight for each training instance is estimated by taking the local class distributions into account.

The purpose of these existing studies is to improve the overall performance for imbalanced data. However, these methods overlook the problem of uncertainty which is prevalent in almost all datasets [18]. The reason behind this uncertainty is that the complete statistical knowledge associated with the conditional density function of each class is hardly available [6]. To address this problem, kNN has been extended using Dempster-Shafer Theory of evidence (DST) to better model uncertain data named Evidential kNN (EKNN) [6]. In EKNN, each neighbor assigns basic belief on classes based on a distance measure. Nevertheless, this approach again does not take consideration of the class imbalance problem.

To address these aforementioned problems, we propose a Proximity weighted Evidential kNN (PEkNN) classifier and make the following contributions. Firstly, we have proposed a confidence (posterior) assignment procedure on each neighbor of a query instance. Secondly, we have also proposed to use proximity of a neighbor as a weight to discount the confidence of a neighbor. It is shown that, this weighted confidence increases the likelihood of classifying a minority class. Thirdly, DST framework is used to combine decisions from different neighbors.

2 Dempster-Shafer Theory of Evidence

Dempster-Shafer theory of evidence is a generalized form of Bayesian theory. It assigns degree of belief for all possible subsets of the hypothesis set. Let, $C = \{C_1, \dots , C_M\}$ be a finite hypothesis set of mutually exclusive and exhaustive hypotheses. The belief in a hypothesis assigned based on a piece of evidence is ranged numerically as [0, 1]. A Basic Belief Assignment (BBA) is a function $m : 2^C \rightarrow [0, 1]$ which satisfies the following properties:

$$\begin{aligned} m(\emptyset ) = 0 \quad \text {and}\quad \sum _{A \subseteq C} m(A) = 1 \end{aligned}$$

(1)

where m(A) is a degree of belief (referred as mass) which reflects how strongly A is supported by the piece of evidence. m(C) represents the degree of ignorance.

Several pieces of evidence characterized by their BBAs can be fused using Dempster’s rule of combination [16]. For two BBAs $m_1({.})$ and $m_2({.})$ which are not totally conflicting, the combination rule can be expressed using Eq. (2).

$$\begin{aligned} m(A) = \frac{\sum _{B \cap C = A}m_1(B)m_2(C)}{1 - \sum _{B \cap C = \emptyset }m_1(B) m_2(C)} \quad A \ne \emptyset \end{aligned}$$

(2)

where $A, B, C \in 2^C$ and $\sum _{B \cap C = \emptyset }m_1(B) m_2(C) < 1$.

For decision making, Belief, Plausibility and betting Probability ($P_{bet}$) are usually used. For a singleton class A, $P_{bet}(A)$ is derived in Eq. (3) where ${\mid } B {\mid }$ represents the cardinality of the element B.

$$\begin{aligned} P_{bet}(A) = \sum _{A \subseteq B} \frac{{\mid } A \cap B {\mid }}{{\mid } B {\mid }} \times m(B) \end{aligned}$$

(3)

3 Proximity Weighted Evidential kNN (PEkNN)

kNN faces difficulty in imbalanced datasets as it treats all neighbors of the query instance equally and most of the neighbors will be of the majority class. To deal with this issue, the proposed algorithm attempts to provide more importance to neighbors with a higher proximity weighted confidence. Here, confidence of an instance indicates a conditional probability of that instance based on training data. Algorithms such as NB also uses conditional probability while classifying a query instance. However, the performance of NB degrades due to the poor estimation of the conditional density of the query instance associated with each class. In contrast, PEkNN computes conditional probability of neighborhood instances rather than query instance. Furthermore, as uncertainty is prevalent in almost all datasets [18]. This is more significant for imbalanced datasets where little information is available for the minority class. To deal with this issue, PEkNN uses DST to combine the evidences provided by each neighbor.

For a new query instance ($x_t$), PEkNN first finds k closest neighbors according to some distance measurement (e.g. Euclidean distance). Let, $S(x_t, k)$ be the set of k closest neighbors of $x_t$ and each member of $S(x_t, k)$ is considered as a piece of evidence which assigns mass values for each subset of C known as BBA.

Now, consider $x_i$ as the i-th neighbor of $x_t$ belonging to class $C_q$. As $x_i$ is a piece of evidence belonging to $C_q$, some part of its belief will be committed to $C_q$. The rest of the belief can not be distributed to any other subset of C except itself. The BBA provided by $x_i$ can be represented by Eq. (4), (5) and (6) where $0< \beta _0 < 1$.

$$\begin{aligned} m_i(\{C_q\}) = \beta = \beta _0 \times \varPsi (x_i, x_t) \end{aligned}$$

(4)

$$\begin{aligned} m_i(A) = 0 \quad \forall A \in 2^C {\setminus } \{C, \{C_q\}\} \end{aligned}$$

(5)

$$\begin{aligned} m_i(C) = 1 - \beta \end{aligned}$$

(6)

Now, we will discuss about two of our intuitions. First, a piece of evidence belonging to $C_q$ will assign a larger belief to $C_q$ when the evidence is more reliable which we call confidence. An evidence having higher posterior probability should get more confidence than the one which is in lower posterior probability region. The second intuition is that a neighbor will assign more belief to a specific class when the neighbor and the query instance are more proximate. The function defined in Eq. (7), $\varPsi ({.})$ satisfies the two aforementioned intuitions where $p_i$ is the confidence of $x_i$ represented by the probability of class label ($y_i$) given $x_i$ and $prx(x_i, x_t)$ represents the proximity between $x_i$ and $x_t$.

$$\begin{aligned} \varPsi (x_i, x_t) = prx(x_i, x_t) \times p_i \end{aligned}$$

(7)

The procedure how PEkNN algorithm classifies a query instance is presented in Algorithm 1. The confidence assignment, proximity estimation and decision making steps are presented in detail in Sects. 3.1, 3.2 and 3.3 respectively.

3.1 Estimation of Confidence

The confidence ($p_i$) of an instance $x_i$ ($x_i \in \mathbb {R}^l$) belonging to $y_i$ is assigned in the following manner derived in Eq. (8).

$$\begin{aligned} p_i = P(y_i \mid x_i) = \frac{P(y_i) \times P(x_i \mid y_i)}{\sum _{j=1}^M P(C_j) \times P(x_i \mid C_j)} \end{aligned}$$

(8)

where $y_i \in \{C_1, C_2, \dots , C_M\}$, $P(C_j)$ represents the prior of $C_j$ in training space and $P(x_i|C_j)$ represents the likelihood in Bayes’ theorem. Here, two approaches of estimating class-wise Probability Density Function (PDF) is presented. First one is using Single Gaussian Model (SGM) and another one is using Gaussian Mixture Model (GMM). When PEkNN uses confidence derived from SGM, we call it sPEkNN, and mPEkNN when it uses confidence derived from GMM.

Single Gaussian model assumes that all the features are independent and the continuous values associated with each class follow a normal distribution. Under these assumptions, the likelihood function can be represented as Eq. (9).

$$\begin{aligned} P(x) = \prod _{j=1}^l P(x_j) = \prod _{j=1}^l f(x_j ; \mu _j, {\sigma _j}^2) = \prod _{j=1}^l \frac{1}{\sqrt{2\pi }\sigma _j} \times \exp (-\frac{(x_j - \mu _j)^2}{2\sigma _j^2}) \end{aligned}$$

(9)

where $x_j$ denotes the j-th feature of x and f(.) represents the normally distributed PDF parameterized by mean ($\mu $) and variance ($\sigma ^2$).

On the other hand, GMM can also be used to estimate PDF from multivariate data. The class-wise PDF using m-component mixture model is given in Eq. (10).

$$\begin{aligned} P(x) = \sum _{i=1}^{m} \alpha _i P(x \mid Z_i) \end{aligned}$$

(10)

The procedure of finding complete set of parameters ($Z_1, \dots , Z_m, \alpha _1, \dots , \alpha _m$) specifying the mixture model is briefly described in [14].

3.2 Estimation of Proximity

To capture the proximity between two instances, some distance measurement can be used. The proximity between two instances ($x_i$ and $x_j$) from training samples will be maximum when $x_i$ and $x_j$ are identical. One the other hand, it will be lowest when they are the farthest two instances in the feature space. To measure this proximity, a normalization is applied as Eq. (11) so that $prx(x_i, x_j) \in [0,1]$. Here, $d_{max}$ is the distance between two farthest training instances.

$$\begin{aligned} prx(x_i, x_j) = 1 - \frac{d(x_i, x_j)}{d_{max}} \end{aligned}$$

(11)

3.3 Decision Making

According to Eq. (7), $\varPsi ({.})$ will return a larger value when a neighbor is more confident and more closer to the query instance. Now, for each of the k nearest neighbors, the BBAs are defined using Eq. (4), (5) and (6). In order to classify $x_t$, these BBAs are combined using DST. The betting probability ($P_{bet}$) for each singleton class from this combined decision will be then calculated using Eq. (3). Finally, the decision from this $P_{bet}$ is taken using Eq. (12).

$$\begin{aligned} \hat{y} = \mathop {\mathrm {arg\,max}}\limits _{c \in \{C_1, \dots , C_M\}} P_{bet}(c) \end{aligned}$$

(12)

where c is a singleton class so that the cardinality of c is 1.

Properties of $\beta $: Value of $\beta $ is bounded between 0 to 1.

Proof

From Eq. (4), (7) and (8), it can be derived that,

$$\begin{aligned} \beta = \beta _0 \times P(y_i \mid x_i) \times prx(x_i, x_t) \end{aligned}$$

(13)

Here, $\beta _0$ is a user given constant satisfying $0< \beta _0 < 1$. The second term, $P(y_i \mid x_i)$, represents the posterior probability. The last term, $prx(x_i, x_t)$ is at most equal to 1 and at least equal to zero. As can be seen from Eq. (13), $\beta $ is a product of three terms and all these terms are bounded between 0 to 1. It is sufficient to claim that, the value of $\beta $ must be bounded between 0 to 1.

3.4 An Illustrative Example

Figure 1 shows the instances of a two-class imbalance problem where ($+$)s and ($\bullet $)s represent the minority (Class-A) and majority class instances (Class-B) instances respectively. The class boundaries are represented as dotted lines and three query instances ($t_1, t_2, t_3$) are marked with ($\star $)s. Here, first query instance $t_1$ is situated in a majority class region bounded by minority instances. Both kNN and PEkNN can successfully classify $t_1$. Traditional algorithms such as C4.5 and NB face difficulties in this situation.

The two other query instances $t_2$ and $t_3$ associated with a region namely $A_1$ (see Fig. 1b). Here, for both $t_2$ and $t_3$, the four neighbors are $x_a$, $x_b$, $x_c$ and $x_d$. Traditional kNN with $k=4$, will classify both $t_2$ and $t_3$ as Class-B. PEkNN, on the other hand, considers the confidence of each neighbor. Here, $x_d$ will provide a higher confidence compared to majority class instances ($x_a$, $x_b$ and $x_c$). Assume, the confidence of $x_a$, $x_b$, $x_c$, and $x_d$ are 0.30, 0.40, 0.30, and 0.75 respectively. And the proximity with respect to $t_2$ are 0.90, 0.95, 0.85 and 0.95 respectively. Then BBAs assigned by PEkNN for these neighbors are $m_a(\{B\}) = 0.2565$, $m_a(\{A,B\}) = 0.7435$, $m_b(\{B\}) = 0.3610$, $m_b(\{A, B\}) = 0.6390$, $m_c(\{B\}) = 0.2423$, $m_c(\{A, B\}) = 0.7577$ and $m_d(\{A\}) = 0.6769$, $m_d(\{A, B\}) = 0.3231$. Here, $\beta _0$ is set to 0.95. Now, combing these BBAs using DST, we get $P_{bet}(A) = 0.5325$ and $P_{bet}(B) = 0.4675$ which indicates that $t_2$ will be correctly classified as Class-A.

On the other hand, for the query instance $t_3$, the proximity of $x_a$, $x_b$, $x_c$ and $x_d$ are 0.85, 0.95, 0.95 and 0.85 respectively. We, therefore, get $P_{bet}(A) = 0.4661$ and $P_{bet}(B) = 0.5339$ indicating that $t_3$ will be classified as Class-B. Therefore, $t_3$ is correctly classified as a majority class instance even though the neighbors of $t_2$ and $t_3$ are same.

Instead of DST, let us reconsider simpler techniques to combine evidences such as summing and taking the maximum of the proximity weighted confidences. If we simply sum class-wise proximity weighted confidences, both $t_2$ and $t_3$ get a higher value for Class-B as three of the four neighbors belong to that class. To avoid this bias, a query can be simply classified in the class for which it gets maximum proximity weighted confidence among the neighbors. But this method does not consider the local neighborhood priors. For which, it will classify both $t_2$ and $t_3$ as minority class which is not desired. PEkNN on the other hand using the DST framework successfully classifies both query instances.

4 Experiments and Results

Dataset description, implementation details and the performance metrics followed by the results obtained from the experiments with discussion are given in the following subsections.

4.1 Dataset Description

The characteristics of the 30 benchmark datasets are shown in Table 1 which are collected from UCI machine learning repository [3] and KEEL Imbalanced Datasets [2]. Imbalance Ratio (IR) between the samples of majority class and minority class of the datasets used in these experiment are at least 1.5 and values of all the features are numeric. A dataset is highly imbalanced when the value of IR is very high.

Table 1. Descriptions of Imbalanced Datasets. Idx, #Inst, #Cl and #Ftr represent index of a dataset, number of instances, classes and features respectively.

Full size table

Table 2. Performance comparison among different algorithms on imbalanced datasets in terms of AUC (%). The best result for each dataset is in bold. SMT+kNN represents kNN followed by SMOTE sampling.

Full size table

4.2 Implementation Details and Performance Metrics

PEkNN is benchmarked against other algorithms including traditional learning algorithms (kNN, C4.5, NB), oversampling strategy (SMOTE), recent algorithms in the kNN family (EKNN, WKNN, CCWKNN, kENN, GMDKNN) and few tree based recent algorithms for imbalanced classification (CCPDT, HDDT, iHDwDT). For PEkNN, we use $\beta _0=0.95$ in this experiment. For kENN, the confidence level is set 0.1 and we set $p=1$ for GMDKNN.

We have conducted 10-fold stratified cross validation to evaluate the performance of the proposed method. The Receiver Operating Characteristic (ROC) curve [17] is widely used to evaluate imbalanced classification. We use Area Under the ROC Curve (AUC) for evaluating the classifier performance. For comparison, all the classifiers are ranked on each dataset in terms of AUC, with ranking of 1 is the best. We also perform Friedman tests on the ranks. After rejecting the null hypothesis using Friedman test that all the classifiers are equivalent, a post-hoc test called Nemenyi test [15] is used to determine the performance of which classifier is significantly better than the others.

4.3 Result and Discussion

Table 2 represents the comparison of 14 classifiers over 30 imbalanced datasets. The average ranks of these classifiers indicate that kNN is better performing algorithm compared to other traditional classifiers on imbalanced datasets. Though kNN performs better than C4.5, modifications of tree based algorithms for imbalanced datasets perform better than kNN. Moreover, kNN on SMOTE sampled datasets performs slightly better than kNN without sampling.

Now, if we compare kNN with its different variants, it can be observed that kENN and WKNN improve the overall performance of traditional kNN although another variant CCWKNN fails to improve the performance in most cases over the experimented datasets. Moreover, it is investigated that, the recent generalized mean based kNN approach GMDKNN performs worse than kNN on imbalanced datasets. In contrast, we can observe from Table 2 that, EKNN performs better than all other classifiers except the proposed sPEkNN and mPEkNN. It indicates that, handling uncertainty can improve the performance of kNN on imbalanced datasets. Finally, average ranks show that mPEkNN is the best performing classifier compared to others in the imbalanced datasets.

In addition, Table 2 summarizes the counts of Win-Tie-Loss (W-T-L) of sPEkNN and mPEkNN against other classifiers which indicates that mPEkNN performs better than other classifiers in most cases. From Win-Tie-Loss, it is observed that mPEkNN wins in at most 29 datasets with no loss against C4.5 and GMDKNN classifiers. In the least case, mPEkNN performs better in 19 datasets and worse in 7 datasets compared to EKNN.

The results of Friedman test (Fr. Test) with two base classifiers (sPEkNN and mPEkNN) are shown in the last two lines of the Table 2. From Friedman test with 14 classifiers and 30 datasets, we can conclude that, all the fourteen classifiers are not equivalent. After rejecting that all fourteen classifiers perform equivalent, Nemenyi test is performed to determine which classifier performs significantly better than the others. A tick( ) sign under a classifier indicates that Nemenyi test suggests the performance of that classifier is significantly different from the base classifier in pairwise comparison at $95\%$ confidence level. Nemenyi test states that, sPEkNN performs significantly better than all compared classifiers except EKNN, CCPDT and HDDT. More importantly, the test suggests that mPEkNN is the best performing classifier among twelve classifiers.

4.4 Effects of Neighborhood Size and Imbalance Ratio

Here, we show the effects of neighborhood size and Imbalance Ration (IR) on the performance of the proposed method compared to other kNN variants. Due to page limitations, only one dataset (Ionosphere) is used to present the comparison in terms of AUC with different the values of k ranging from 1 to 20. It is clear from Fig. 2a that sPEkNN and mPEkNN consistently perform better than the other algorithms and are less sensitive to the value of k.

To visualize the effect of IR, we use a synthetic dataset of two-class problem in a two-dimensional space where instances of each class are taken from two Gaussian distributions. The characteristics of the dataset is given below where class-A is the minority class and Class-B is the majority class.

$$\begin{aligned} \begin{aligned} \eta _{1}^{A} = 0.6 \text {, } \eta _{2}^{A} = 0.4 \text {, } \mu _{1}^{A} = \begin{bmatrix} 3&3\end{bmatrix}^{T} \text {, } \mu _{2}^{A} = \begin{bmatrix} -2&-2\end{bmatrix}^{T} \text {, } \varSigma _{1}^{A}=3I \text { and } \varSigma _{2}^{A}=I \\ \eta _{1}^{B} = 0.9 \text {, } \eta _{2}^{B} = 0.1 \text {, } \mu _{1}^{B} = \begin{bmatrix} 0&0\end{bmatrix}^{T} \text {, } \mu _{2}^{B} = \begin{bmatrix} 4&3\end{bmatrix}^{T} \text {, } \varSigma _{1}^{B}=8I \text { and } \varSigma _{2}^{B}=I \end{aligned} \end{aligned}$$

Here $\eta $ represents the mixture proportion and I is the identity matrix. Different datasets of 1500 samples are generated varying the class imbalance ratio ranging from 2 to 10. It is observable from Fig. 2b that, although the imbalance ratio increases, the performance of mPEkNN remains more steady compared to other kNN variants indicating less sensitivity of mPEkNN in these synthetic datasets.

5 Conclusion

This paper proposes an extended kNN algorithm to increase the performance of existing kNN by making it vigorous to imbalance class problem. In PEkNN, for a query instance, we calculate a confidence for each neighbor instance from the posterior probability of that instance which is then discounted by the proximity of that instance from the query instance. We show that this proximity weighted confidence increases the likelihood of classifying a minority class instance. To calculate the confidence we used two methods one using single Gaussian model (sPEkNN) and other using Gaussian mixture model (mPEkNN). Results over 30 datasets provide the evidence that the proposed approach is better than twelve relevant methods in imbalanced datasets. However, one limitation of the proposed method is that we assume all the feature values as numeric. As future research direction, we have plan to extend the work for categorical features.

References

Akash, P.S., Kadir, M.E., Ali, A.A., Shoyaib, M.: Inter-node hellinger distance based decision tree. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI (2019)
Google Scholar
Alcalá-Fdez, J., et al.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17(2), 255–287 (2011)
Google Scholar
Bache, K., Lichman, M.: UCI machine learning repository (2013)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Cieslak, D.A., Chawla, N.V.: Learning decision trees for unbalanced data. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008. LNCS (LNAI), vol. 5211, pp. 241–256. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87479-9_34
Chapter Google Scholar
Denoeux, T.: A k-nearest neighbor classification rule based on dempster-shafer theory. IEEE Trans. Syst. Man. Cybern. 25(5), 804–813 (1995)
Article Google Scholar
Dubey, H., Pudi, V.: Class Based Weighted K-Nearest Neighbor over Imbalance Dataset. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 305–316. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37456-2_26
Chapter Google Scholar
Dudani, S.A.: The distance-weighted k-nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. SMC–6(4), 325–327 (1976)
Article Google Scholar
Fix, E., Hodges Jr., J.L.: Discriminatory analysis-nonparametric discrimination: consistency properties. Technical report, California University, Berkeley (1951)
Google Scholar
Gou, J., Ma, H., Ou, W., Zeng, S., Rao, Y., Yang, H.: A generalized mean distance-based k-nearest neighbor classifier. Expert Syst. Appl. 115, 356–372 (2019)
Article Google Scholar
Li, Y., Zhang, X.: Improving k nearest neighbor with exemplar generalization for imbalanced classification. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011. LNCS (LNAI), vol. 6635, pp. 321–332. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20847-8_27
Chapter Google Scholar
Liu, W., Chawla, S.: Class confidence weighted kNN algorithms for imbalanced data sets. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011. LNCS (LNAI), vol. 6635, pp. 345–356. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20847-8_29
Chapter Google Scholar
Liu, W., Chawla, S., Cieslak, D.A., Chawla, N.V.: A robust decision tree algorithm for imbalanced data sets. In: Proceedings of the 2010 SIAM International Conference on Data Mining, pp. 766–777. SIAM (2010)
Google Scholar
McLachlan, G., Peel, D.: Finite Mixture Models. Wiley, Hoboken (2004)
MATH Google Scholar
Nemenyi, P.: Distribution-free multiple comparisons. Ph.D. thesis, Princeton University (1963)
Google Scholar
Shafer, G.: A Mathematical Theory of Evidence, vol. 42. Princeton University Press, Princeton (1976)
MATH Google Scholar
Swets, J.A.: Measuring the accuracy of diagnostic systems. Science 240(4857), 1285–1293 (1988)
Article MathSciNet Google Scholar
Trafalis, T.B., Alwazzi, S.A.: Support vector regression with noisy data: a second order cone programming approach. Int. J. Gen Syst 36(2), 237–250 (2007)
Article MathSciNet Google Scholar
Wu, X., et al.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008). https://doi.org/10.1007/s10115-007-0114-2
Article Google Scholar
Yang, Q., Wu, X.: 10 challenging problems in data mining research. Int. J. Inf. Technol. Decis. Making 5(04), 597–604 (2006)
Article Google Scholar

Download references

Acknowledgments

This research is supported by the fellowship from ICT Division, Ministry of Posts, Telecommunications and Information Technology, Bangladesh. Grant No - 56.00.0000.028.33.093.19-427; Dated 20.11.2019.

Author information

Authors and Affiliations

Institute of Information Technology, University of Dhaka, Dhaka, Bangladesh
Md. Eusha Kadir, Pritom Saha Akash & Mohammad Shoyaib
Islamic University of Technology, Gazipur, Bangladesh
Sadia Sharmin
Independent University, Dhaka, Bangladesh
Amin Ahsan Ali

Authors

Md. Eusha Kadir
View author publications
You can also search for this author in PubMed Google Scholar
Pritom Saha Akash
View author publications
You can also search for this author in PubMed Google Scholar
Sadia Sharmin
View author publications
You can also search for this author in PubMed Google Scholar
Amin Ahsan Ali
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Shoyaib
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Md. Eusha Kadir .

Editor information

Editors and Affiliations

School of Information Systems, Singapore Management University, Singapore, Singapore
Hady W. Lauw
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, Hong Kong
Raymond Chi-Wing Wong
Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, Athens, Greece
Alexandros Ntoulas
School of Information Systems, Singapore Management University, Singapore, Singapore
Ee-Peng Lim
Institute of Data Science, National University of Singapore, Singapore, Singapore
See-Kiong Ng
School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
Sinno Jialin Pan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kadir, M.E., Akash, P.S., Sharmin, S., Ali, A.A., Shoyaib, M. (2020). A Proximity Weighted Evidential k Nearest Neighbor Classifier for Imbalanced Data. In: Lauw, H., Wong, RW., Ntoulas, A., Lim, EP., Ng, SK., Pan, S. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2020. Lecture Notes in Computer Science(), vol 12085. Springer, Cham. https://doi.org/10.1007/978-3-030-47436-2_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-47436-2_6
Published: 06 May 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-47435-5
Online ISBN: 978-3-030-47436-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics