An Improved Measurement of the Imbalanced Dataset

Zhang, Chunkai; Zhou, Ying; Chen, Yingyang; Qi, Changqing; Wang, Xuan; Dong, Lifeng

doi:10.1007/978-3-319-94295-7_25

An Improved Measurement of the Imbalanced Dataset

Chunkai Zhang¹⁵,
Ying Zhou¹⁵,
Yingyang Chen¹⁵,
Changqing Qi¹⁵,
Xuan Wang¹⁵ &
…
Lifeng Dong¹⁶

Conference paper
First Online: 19 June 2018

2087 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10967))

Abstract

Imbalanced classification is a classification problem that violates the assumption of uniform distribution of samples. In such problems, traditional imbalanced datasets are measured in terms of the imbalance of sample size, without considering the distribution information, which has a more important impact on the classification performance, so the traditional measurements have a weak relation with the classification performance. This paper proposed an improved measurement for imbalanced datasets, it is based on the idea that a sample surrounded by more same class samples is easier to classify, for each sample of different classes, the proposed method calculates the average number of the k nearest neighbors in the same class in different subsets under the weighted k-NN, after that, the product of these average values is regarded as the measurement of this dataset, and it is a good indicator of the relationship between the distribution of samples and the classification results. The experimental results show that the proposed measurement has a higher correlation with the classification results and shows the difficulty of classification of data sets more clearly.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

The classification problem is a very important part of machine learning. In the traditional classification problem, the model training is based on the assumption that the sample distribution is uniform, so the classification cost of each sample is consistent. However, in realistic data sets, the assumption of uniform distribution of samples is difficult to satisfy, in order to pursue the global accuracy, the traditional classifier can easily get an unsatisfying classification performance of the minority samples, causing them to be hard to recognize. The imbalanced classification problem has appeared in many fields, such as bioinformatics [1, 2], remote sensing image recognition [3], and privacy protection in cybersecurity [4,5,6]. The wide coverage of the imbalance problem has very important practical significance.

The traditional imbalanced classification problem has two features: the difference in sample size; the difference in misclassification cost for different classes. Scholars have proposed the data-level methods for feature 1 and the algorithm-level methods for feature 2. As the data-level methods can be regarded as a preprocessing before training the classifier, so they have been popular for many years. The data-level methods can be divided into the oversampling to add the minority samples, the under-sampling to reduce the majority samples and the hybrid sampling methods, they aim to get a balanced dataset, which is defined by the imbalanced dataset measurement.

Measurement of the imbalanced datasets can be divided into two types: local measurements and global measurements. The local measurements refer to these methods which need traversing each sample in a data set [7, 8], calculating a measurement usually accompanied by the k-NN algorithm for each sample, the overall measurement is defined by the mean value of measurement of all the samples in the dataset. Because this kind of measurement contains the calculation for each sample, and it can be used in the sampling algorithm to find a simpler dataset to model with enough information with the original dataset. Global measurement [9] refers to a result calculated for a sample in the entire data set, or a variety of indicators derived from statistical analysis. It is usually accompanied by a variety of calculations for the separate results of the positive and negative subsets. Such measurements are difficult to achieve in a single implemented on the sample, it can only be used as a measure of the dataset, and it is difficult to play a role in the sampling algorithm because the movement of a single sample can hardly affect the original measurement result.

As the number of samples has had a noticeable effect on the classification results. Therefore, the imbalance ratio [8,9,10] (IR) of the number of samples in different classes has been popular for many years as a measurement of imbalanced datasets. Based on IR, scholars have proposed many sampling algorithms to balance the datasets to release effect of the imbalance in sample size on the classification performance, so the measurement plays a very important role in imbalanced classification. However, the IR is not informative enough to measure a specific dataset overall, as it is a global measurement, studies [10] have shown that when the number of samples is relatively large, it does not cause a reduction in the classification performance of the minority class, but when the number of samples is seriously insufficient, the rarity of the minority samples will cause a low recognition rate of the minority samples. The local measurements develop the global ones as they take the distribution into consideration, meanwhile, with the understanding of the classifier and data, the distribution based sampling methods [12,13,14] are taking the distribution information into consideration, and also encourage the new measurements to contain the distribution information.

This paper proposes a measurement containing the distribution information, it is motivated that the nearer a sample is with the same labeled samples, the easier it can be classified correctly. The proposed method calculates the average number of the k nearest neighbors in the same class in different subsets under the weighted k-NN, after that, the product of these average values is regarded as the measurement of this dataset. It improves the correlation between the measurement and the final classification performance, which indicate that the proposed is more informative. This paper is arranged as follows: Sect. 2 describes the related work in measurement of the imbalanced dataset, Sect. 3 shows the proposed measurement improved generalized imbalanced ratio (IGIR), and Sect. 4 describes the experimental results and analysis, the final section concludes the proposed method and the future work.

2 Related Work

There are many different factors which have effects on the imbalanced classification, resulting in various kinds of measurements considering different factors. For example, the Imbalance Ratio (IR) is based on the difference in sample size, the maximum Fisher’s discriminant ratio (F1) is based on the overlap in feature values from different classes, and the Complexity Measurement (CM), Generalized Imbalance Ratio (GIR), and the proposed method Improved Generalized Imbalance Ratio (IGIR) are based on the idea that data distribution plays important role in imbalanced classification. These measurements are used in such two ways: indicate whether a dataset is easy to classify, and measure the sampled subset in sampling methods. Therefore, in order to achieve a better performance, the measurement should have a relatively high correlation with the classification results.

Given dataset X, which contains N₊ positive samples (the minority class), N_ negative samples (the majority class), and the total number of samples is N = N_ + N₊.

2.1 IR

Imbalance ratio [11,12,13] the definition is as follows, it is defined as the size sample ratio:

$$ IR = \frac{N\_}{{N_{ + } }} $$

(1)

When samples with different labels have the same distribution, the sample size is able to reflect whether the samples are easy to classify, otherwise, the IR is not so informative to indicate whether the dataset is easy to classify. For example, in the Fig. 1, the IR of data in (a) is 4 and in (b) is 1, but the two classes in (a) have a clear linear boundary while there is not in (b), so we can get 100% accuracy in (a) but cannot in (b) with a same linear model, which is contrary to the comparison result of IR, since IR is the proportion of sample size and does not contain any sample distribution information, complexity of the data distribution cannot be represented in IR.

2.2 F1

A classical measure of the discriminative power of the covariates, or features, is Fisher’s discriminant ratio [9], and F1 is the maximum Fishers discriminant ratio:

$$ f = \frac{{\left( {\mu_{1} - \mu_{2} } \right)^{2} }}{{\sigma_{1}^{2} + \sigma_{2}^{2} }} $$

(2)

Where μ₁, μ₂, σ ²₁ , σ ²₂ are the means and variances of the positive and negative subsets, respectively. For multidimensional problems, the maximum f over all features can be used. However, if f gets 0, it does not necessarily mean that the classes are not separable, as it could just be that the separating boundary is not parallel to an axis in any of the given features.

2.3 CM

CM [7] focuses on the local information for each data point via the nearest neighbors, and uses this information to capture data complexity.

$$ {\text{CM}}_{k\left( j \right)} = I\left( {\frac{{\begin{array}{*{20}c} {Number\;of\;patterns} \\ {j^{\prime } in\;N_{j} \;with\;y_{j}^{\prime } = y_{j} } \\ \end{array} }}{k} \le 0.5} \right) $$

(3)

Where I(.) is the indicator function. The overall measurement is

$$ {\text{CM}}_{k} = \frac{1}{n}\sum\nolimits_{j = 1}^{n} {CM_{k\left( j \right)} } $$

(4)

CM is determined by the label of its neighbors. If the neighbors of a sample contain more samples in the same class, the sample will be easier to classify. On the contrary, if the samples are surrounded by samples with different labels, then the sample is difficult to classify correctly, and the average number of different classes samples contained in the k nearest neighbors is used as the measurement. The higher the CM, the more difficult the dataset is to learn.

2.4 GIR

The GIR [8] is an improvement of cm, it focuses on the differences in the difficulty of classifying the samples in different classes. A dataset with a larger GIR is more difficult to get a good performance of the minority class, as the classifier tends to be trained with the easier samples according to the Occam shaver principle. Because we tend to use a most simple classifier to fit the whole dataset, while the more difficult samples need a more complex classifier, which may cause overfitting with single classifier, so this is the reason why ensemble can be effective in the imbalance classification, as they have the different classifiers corresponding to different levels of sample classification difficulty.

$$ T_{ + } = \frac{1}{{N_{ + } }}\sum\nolimits_{x \in P} {\frac{1}{k}} \sum\nolimits_{r = 1}^{k} {IR\left( {x,X} \right)} = \frac{1}{{N_{ + } }}\sum\nolimits_{x \in P} {t_{k} \left( x \right)} $$

(5)

$$ T_{ - } = \frac{1}{{N_{ - } }}\sum\nolimits_{x \in N} {t_{k} \left( x \right)} $$

(6)

$$ {\text{GIR}} = {\text{T}}_{ - } - {\text{T}}_{ + } $$

(7)

Where IR(x, X) is an indicator function. For a sample x, if its k-nearest neighbor’s label is the same as x, the result is 1, otherwise, it gets 0.

GIR considers the different measurements of the positive and negative subsets, which is an improvement of CM, GIR is defined as the difference between positive and negative sample subsets, and paper [8] successfully applies GIR to oversampling and under-sampling algorithms. The experimental results show that GIR-based resampling algorithm can effectively improve the classification performance.

However, there are two problems in GIR, first, in the classification process, besides the label of k nearest neighbors, their distance from the sample will also affect the classification result. Second, the GIR of the data set is calculated by the measurement of the negative class minus positive class, so GIR is a relative measurement. As shown in Table 1, the final result shows that the two data sets have the same GIR, but it is clear that the dataset (b) is more difficult to classify than (a). Therefore, GIR is not so sufficient to fully interpret the complexity of the dataset distribution.

Table 1. The dilemma of GIR

Full size table

Table 2. Confusion metrics

Full size table

Table 3. R² of measurements and classification results.

Full size table

3 The Proposed Method

We proposed an improved measurement called IGIR in this paper, it is based on the idea that the sample distribution plays an important role in the classification result, the motivation of IGIR is that if there are many samples with the same label around the sample, the sample is easily classified, and on the contrary, the sample is hard to classify. Different distances of the k nearest neighbors have different effects on the classification results of the sample.

$$ weight_{r} = \frac{k - r}{k},r = 0,1,2 \cdots k - 1 $$

(8)

$$ {\text{wei-T}}_{ + } = \frac{1}{{{\text{N}}_{ + } }}\sum\nolimits_{{{\text{x}} \in {\text{P}}}} {1 / {\text{k}}} \sum\nolimits_{{{\text{r}} = 1}}^{\text{k}} {weight_{r} } *{\text{IR}}({\text{x}},{\text{X}}) = \frac{1}{{{\text{N}}_{ + } }}\sum\nolimits_{{{\text{x}} \in {\text{P}}}} {{\text{t}}_{\text{k}} ({\text{x}})} $$

(9)

$$ {\text{wei-T}}_{ - } = \frac{1}{{{\text{N}}_{ - } }}\sum\nolimits_{{{\text{x}} \in {\text{N}}}} {{\text{t}}_{\text{k}} ( {\text{x)}}} $$

(10)

$$ {\text{wei-IGIR}} = \sqrt {{\text{wei-T}}_{ - } *{\text{wei-T}}_{ + } } $$

(11)

In the calculation of IGIR, the k-nearest neighbors of each sample in the dataset are calculated at first, and their neighboring class labels are retained. First, according to the calculation method in formula (8), the weights of the k nearest neighbors are gradually reduced to 0. The main reason for not using distance is that the distances between different samples and its neighbors are inconsistent. This will result in the inconsistency of the weights of each sample in the calculation process, therefore, it is not possible to have a comparative standard for the overall results; second, to describe the dataset reasonably with an absolute measurement, and to avoid the relativity in the original GIR and spired by the definition of geometric mean, IGIR is defined as the compound measurements of positive and negative subsets. In this case, to ensure that the order of magnitude is unchanged, it is processed by prescribing to better measure the difficulty of classification of the dataset.

In the proposed method, firstly, calculate the weight according to formula (8), secondly, compute the k nearest neighbors of each sample and $ t_{k} \left( x \right) $ for each sample, thirdly, compute the average $ t_{k} \left( x \right) $ of the positive and negative subsets, finally, compute the IGIR according to the formula (11).

IGIR can be regarded as the average classification accuracy under a weighted k-NN. That is, the more neighbors of the same class in the sample, the more likely the sample is to be classified as the original classifier, then IGIR has the nature to be related to the final classification performance.

4 Experimental Results

4.1 Datasets

The experimental data in this paper comes from the UCI machine learning database [14]. Some of them are multi-class datasets, in order to obtain a harder dataset to classify, we select one of the class as the minority class, and the rest of classes are regarded as the majority, and the details are shown as Table 4.

Table 4. Datasets

Full size table

4.2 Evaluation

In the binary imbalanced classification, the confusion matrix is often used to evaluate the performance of the classifier, which is defined in Table 2:

FN represents the number of positive samples that are incorrectly classified as negative, and FP is the number of samples that are incorrectly classified as positive, there have been compound evaluations, such as F-value and Gmean [15].

$$ {\text{sensitivity}} = \frac{TP}{TP + FP} $$

(12)

$$ {\text{recall}} = \frac{TP}{TP + FN} $$

(13)

$$ {\text{F-value}} = \frac{{(1 +\upbeta^{2} ) \times {\text{recall}} \times {\text{precision}}}}{{\upbeta^{2} \times {\text{recall}} + {\text{precision}}}} $$

(14)

$$ {\text{Gmean}} = \sqrt {\frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}} \times \frac{\text{TN}}{{{\text{TN}} + {\text{FP}}}}} $$

(15)

4.3 Experimental Settings and Results

Set β = 1 in F-value called F1_min, all involved k-NN are set with k = 5, the classifier is C4.5, all results are the average of 10 times of 10-fold cross-validation.

The classification results and measurements of different datasets are shown in Table 5, as we can see from the table, the IR and F1 have no value restriction, so the value can be very huge, while the CM, GIR and IGIR are limited in [0, 1], the scatter plots are used to show a more clear relation between the measurements and classification as shown in Fig. 2. Taking the sensitivity of minority class as an example, the Fig. 2 shows the relationship between different measurements and classification results. It can be seen that the correlation between CM and IGIR have a stronger linear relation with sensitivity as the measurement while there is no obvious trend in the rest measurements. In addition, the points in CM are more dispersed and the ones in IGIR are more concentrated, which means datasets with the same IGIR are more likely to have the same degree of classification difficulty than those with the same CM.

Table 5. Measurements and classification results

Full size table

4.4 Analysis

In order to quantitatively analyze the relationship between different measurements and the classification results, the results are further analyzed by the determination coefficient R². R² reflects how many percentages of the fluctuation of Y can be described by the fluctuation of X. That is to say, what percentage of the variance of the representation variable Y can be explained by the controlled variable X.

$$ {\text{R}}^{2} = \frac{\text{SSR}}{{\rm SST}} = 1 - \frac{\text{SSE}}{{\rm SST}} $$

(16)

Where SST = SSR + SSE, SST represents the total sum of squares, SSR represents the regression sum of squares, and the SSE represents the error sum of squares.

The R² in Table 3 also shows the superiority of IGIR. The IGIR proposed in this paper is more capable to indicate the classification results, and it has a stronger relevance with the final classification performance and can be a better indicator of the sampled subset in resampling methods.

In IGIR, we calculate the number of samples of the average k-nearest neighbors by each sample, so the calculated value can be considered as the probability that the sample is classified as its own class. To a certain extent, this measurement can be regarded as Gmean under the k-NN classifier, and it is reasonable to indicate the classification performance of other classifiers.

5 Conclusion

In this paper, an improved measurement for imbalanced datasets is proposed, it takes the distribution information into consideration and it is based on the idea that a sample surrounded by more same class samples is easier to classify, for each sample of different classes, the proposed method calculates the average number of the k nearest neighbors in the same class in different subsets under the weighted k-NN, after that, the product of these average values is regarded as the measurement of this dataset. The experimental results show that the proposed measurement has a higher correlation with the classification results and can be used in the sampling algorithm. The future work will be sampling algorithms based on this measurement to improve the classification results.

References

Wang, Y., Li, X., Tao, B.: Improving classification of mature microRNA by solving class imbalance problem. Sci. Rep. 6, 25941 (2016)
Article Google Scholar
Stegmayer, G., Yones, C., Kamenetzky, L., Milone, D.H.: High class-imbalance in pre-miRNA prediction: a novel approach based on deepSOM. IEEE/ACM Trans. Comput. Biol. Bioinf. 14(6), 1316–1326 (2017)
Article Google Scholar
Leichtle, T., Geiß, C., Lakes, T., Taubenböck, H.: Class imbalance in unsupervised change detection – a diagnostic analysis from urban remote sensing. Int. J. Appl. Earth Obs. Geoinf. 60, 83–98 (2017)
Article Google Scholar
Li, C., Liu, S.: A comparative study of the class imbalance problem in Twitter spam detection. Concurr. Comput. Pract. Exp. 30(4) (2018)
Article Google Scholar
Singh, S., Liu, Y., Ding, W., Li, Z.: Empirical evaluation of big data analytics using design of experiment: case studies on telecommunication data (2016)
Google Scholar
Hale, M.L., Walter, C., Lin, J., Gamble, R.F.: A priori prediction of phishing victimization based on structural content factors (2017)
Article Google Scholar
Anwar, N., Jones, G., Ganesh, S.: Measurement of data complexity for classification problems with unbalanced data. Stat. Anal. Data Min. 7(3), 194–211 (2014)
Article MathSciNet Google Scholar
Tang, B., He, H.: GIR-based ensemble sampling approaches for imbalanced learning. Pattern Recogn. 71, 306–319 (2017)
Article Google Scholar
Ho, T.: A data complexity analysis of comparative advantages of decision forest constructors. Pattern Anal. Appl. 5(2), 102–112 (2002)
Article MathSciNet Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1), 321–357 (2002)
MATH Google Scholar
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91
Chapter Google Scholar
Zhang, M.: Foundations of Modern Analysis. Academic Press, London (1960)
Google Scholar
Weiss, G.M.: Learning with rare cases and small disjuncts. In: Twelfth International Conference on Machine Learning, pp. 558–565 (1995)
Chapter Google Scholar
Zhang, H., Wang, Z.: A normal distribution-based over-sampling approach to imbalanced data classification. In: International Conference on Advanced Data Mining and Applications, pp. 83–96 (2011)
Chapter Google Scholar
Li, D.C., Hu, S.C., Lin, L.S., Yeh, C.W.: Detecting representative data and generating synthetic samples to improve learning accuracy with imbalanced data sets. PLoS ONE 12(8), e0181853 (2017)
Article Google Scholar
Moreo, A., Esuli, A., Sebastiani, F.: Distributional random oversampling for imbalanced text classification, pp. 805–808 (2016)
Google Scholar
Amini, M.R., Usunier, N., Goutte, C.: http://archive.ics.uci.edu/ml/datasets.html. Accessed 22 Mar 2018
Menardi, G., Torelli, N.: Training and assessing classification rules with imbalanced data. Data Min. Knowl. Disc. 28(1), 92–122 (2014)
Article MathSciNet Google Scholar

Download references

Acknowledgment

This study was supported by the Shenzhen Research Council (Grant No. JSGG20170822160842949, JCYJ20170307151518535, JCYJ20160226201453085, JCYJ20170307151831260).

Author information

Authors and Affiliations

Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China
Chunkai Zhang, Ying Zhou, Yingyang Chen, Changqing Qi & Xuan Wang
Hamline University, 1536 Hewitt Avenue, St. Paul, USA
Lifeng Dong

Authors

Chunkai Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ying Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yingyang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Changqing Qi
View author publications
You can also search for this author in PubMed Google Scholar
Xuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lifeng Dong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chunkai Zhang .

Editor information

Editors and Affiliations

Huawei Technologies CO., Ltd, Shenzhen, China
Min Luo
Kingdee International Software Group CO. Ltd, Shenzhen, China
Liang-Jie Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, C., Zhou, Y., Chen, Y., Qi, C., Wang, X., Dong, L. (2018). An Improved Measurement of the Imbalanced Dataset. In: Luo, M., Zhang, LJ. (eds) Cloud Computing – CLOUD 2018. CLOUD 2018. Lecture Notes in Computer Science(), vol 10967. Springer, Cham. https://doi.org/10.1007/978-3-319-94295-7_25

Download citation

DOI: https://doi.org/10.1007/978-3-319-94295-7_25
Published: 19 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94294-0
Online ISBN: 978-3-319-94295-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

1 Introduction

2 Related Work

2.1 IR

2.2 F1

2.3 CM

2.4 GIR

3 The Proposed Method

4 Experimental Results

4.1 Datasets

4.2 Evaluation

4.3 Experimental Settings and Results

4.4 Analysis

5 Conclusion

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation