Keywords

1 Introduction

The classification problem is a very important part of machine learning, and it is also the first step for artificial intelligence to understand human life. At present, most classifiers assume that the samples of different classes are evenly distributed, and the classification costs are the same. However, in reality, the data people are more concerned about is often scarce, such as the detection of credit card fraud and medical disease diagnosis. In the medical disease diagnosis, most of the results are normal while only a small proportion of the results are diagnosed as diseases, which indicates the different distribution in different classes samples. Second, if healthy people are misdiagnosed as diseases, they can be removed by other inspection methods, errors do not cause very serious accidents, but if the disease people are diagnosed as healthy, it may cause the patients to miss the best treatment time and cause serious consequences. This is the second feature of the imbalanced classification problems: different classes of misclassification costs are inconsistent. At the same time, if samples are classified as diseases as much as possible because they are afraid to miss the disease samples, it will cause a huge waste of medical resources and intensify conflicts between doctors and patients. Therefore, it is not feasible to determine all samples as disease, and the best way is to try to separate these two results as correct as possible. Due to the scarcity of the minority samples and the definition of global accuracy, the classifier pays less attention to the minority class, so the recognition performance is unsatisfying. Imbalanced classification problems arise in many fields, such as bioinformatics [1, 2], remote sensing image recognition [3], and privacy protection in cybersecurity [4,5,6]. The imbalanced problems cover widely and have a very important practical significance.

The traditional solutions to the imbalanced problems are divided into two parts: the algorithm-level methods and the data-level methods. The algorithm-level methods mainly focus on the different misclassification costs, such as improved neural network [7]: it uses the approximation of F1 value of the minority class as the cost function; the bagging algorithm [8] continues to enhance the misclassified the minority samples, and improve the recognition rate of the minority samples; structured SVM [9] uses the F1 value of the minority samples as the optimization function, and has a better performance in the classification of the minority samples.

The data-level methods focus on the imbalance of sample size, which mainly adjust the data sample size through resampling to reduce the impact on classification performance. The data-level methods can be divided into over-sampling, under-sampling and hybrid sampling. Over-sampling adds the minority samples in the training process, Over-sampling can effectively improve the classification performance of the minority class but it has no idea of the rationality. Under-sampling [10] removes the majority samples before training, which can quickly reach equilibrium, but may take a risk of losing valuable samples.

The oversampling method can be divided into random sampling and informed sampling. Random sampling means repeating the known samples, which includes simple repetition [11], linear interpolation [12], nonlinear interpolation [13], etc.; SMOTE [12], as a classic over-sampling algorithm, interpolates linearly in the minority samples, will increase the amount of information and rationality of synthesized samples in random oversampling, which improves the classification effect. Border-line-smote [14], to reduce the risk of overfitting, it selects the minority samples needing to be interpolated called boundary samples. The above oversampling methods only consider the influence of the sample size and the local sample distribution on the classification performance, ignoring the overall distribution of the sample, which is more informative for classification performance.

Informed sampling [15] uses the distribution information in the sample to fit its probability distribution function (PDF) and sample it according to the PDF. Chen [16] proposed a normal distribution based oversampling approach, and this approach assumes the minority class distribution as the Gaussian normal distribution, the parameters are calculated from the minority samples with EM algorithm, the experimental results are better than SMOTE and random oversampling. Different scholars have proposed oversampling algorithms based on various distributions, such as the Gaussian distribution [16, 17], Weibull distribution [18], etc. Due to the distribution information, these algorithms have made greater progress than random oversampling method. However, the problems are also obvious: there is a prior assumption about the real distribution and all the features are dependent from each other. If the real distribution meets this hypothesis, it will get better results, otherwise, the improvement is limited, so it is inconsistent in their effect on different datasets.

Data level methods are of great matter in imbalanced classification, as it can be regarded as a step in data preprocessing, it will have a positive effect on the final classification results. Since the factors that affect the datasets classification include not only the sample size, but also the sample distribution, while the current over-sampling methods do not make full use of distribution information and cannot guarantee the rationality of the generated samples.

In this paper, we propose a oversampling method based on the variational auto-encoder [19] (VAE) model to generate the minority samples. The proposed method is motivated that the distribution information plays an important role in oversampling methods, and aims at the rationality of the generated samples, we use VAE to increase minority instances, to our knowledge, first, the output dimension of the neural network is not limited so it can generate data of any dimension; second, the strong fitting ability of the neural network can simulate any distribution function without any prior knowledge in advance. We use this model to model the distribution of minority samples and oversample according to the model, the proposed method shows the superiority that it doesn’t need any prior distribution assumption nor the dependent features assumption, the experimental results prove the effectiveness of the algorithm.

We organize the paper as follows. Section 2 describes related work of this paper. Section 3 presents the proposed algorithm and analyze it. Section 4 shows the experimental results. Section 5 concludes the paper.

2 Related Work

In 2013, KM [19] proposed VAE: add variational inference to auto-encoder and use parameterization trick to make the variational inference combined with stochastic gradient descent. The overall structure of vae network is shown in Fig. 1, while it assumes the hidden variables to be a Gaussian standard distribution, it is easy to sample and the final probability distribution function is uncertain, coincides with the characteristics of distribution-based oversampling.

Fig. 1.
figure 1

Structure of variational auto-encoder.

In VAE, we assume the variables are determined by the hidden compression code z, the encoder can map z to X, which makes z obey a particular distribution (such as Gaussian distribution, etc.). Knowing the possibility distribution function and its mapping function, we can sample z and encode z, to get new x to generate infinite sample theoretically. The structure of vae as shown below:

Assume z is a latent variable, and its distribution function is p(z), use Bayesian conditional probability formula to calculate P(X):

$$ {\text{p}}\left( {\text{X}} \right)\; = \,\smallint {\text{p}}\left( {\text{X|z}} \right){\text{p}}\left( {\text{z}} \right)\,{\text{dz}} $$
(1)

However, in z’s prior distribution, most of z cannot generate reliable samples, that is p(X|z) tends to 0, so p(X|z)p(z) tends to 0. To simplify the calculation, only p(X│z) need to be calculated. Considering the z with larger P(X|z), which is represented by P(z|X) form the encoder, but only considering this part of z cannot generate samples that are not in original data, so we need to assume the distribution of P(z│X) and complete the error through the decoder.

Q(z) is the assumption of the real distribution, we use KL divergence to calculate the difference between the real distribution and the assumption:

$$ {\text{D(p||q)}}\; = \,\smallint {\text{p}}\left( {\text{x}} \right)\,{ \log }\frac{{{\text{p}}\left( {\text{x}} \right)}}{{{\text{q}}\left( {\text{x}} \right)}}{\text{dx}} $$
(2)

Formula (2) shows that if two distribution is close, KL divergence will tend to 0. And the loss function of VAE model is

$$ argmin\;{\text{D(Q(z)||P(z|X))}} $$
(3)

Apply the formula (2) to the formula (3)

$$ {\text{D}}[{\text{Q}}({\text{z}})||{\text{P}}\left( {{\text{z}}|{\text{X}}} \right)]\; = \;{\text{E}}_{{{\text{z}}\sim{\text{Q}}}} [\log \,{\text{Q}}\left( {\text{z}} \right) - \log \,{\text{P}}\left( {{\text{z}}|{\text{X}}} \right)] $$
(4)

Apply Bayes rule to \( {\text{P}}({\text{z}}|{\text{X}}) \), we can get both \( {\text{P}}\left( {\text{X}} \right) \) and \( {\text{P}}({\text{X}}|{\text{z}}) \)

$$ {\text{D}}({\text{Q}}\left( {\text{z}} \right)||{\text{P}}({\text{z}}|{\text{X}}))\; = \;{\text{E}}_{{{\text{z}}\sim{\text{Q}}}} \left[ {\log \,{\text{Q}}\left( {\text{z}} \right)\, - \,\log \,{\text{P}}\left( {\text{z}} \right)} \right]\, + \,\log \,{\text{P}}({\text{X}}) $$
(5)

Apply the \( {\text{D}}[{\text{Q}}({\text{z}})||{\text{P}}\left( {{\text{z}}|{\text{X}}} \right)] \) into it, note that X is fixed, and Q can be any distribution, not just a distribution which does a good job at mapping X to the z’s to produce X. since we’re interested in inferring P(X), it makes sense to construct a Q which does depend on X, and in particular, one which makes \( {\text{D}}({\text{Q}}\left( {\text{z}} \right)||{\text{P}}({\text{z}}|{\text{X}})) \) small: Because \( {\text{P}}\left( {\text{X}} \right) \) is fixed, the minimum \( {\text{D}}({\text{Q}}\left( z \right)||{\text{P}}\left( {{\text{z}}|{\text{X}}} \right)) \) will transform to maximize the value of the right side of the equation, and \( \log {\text{P}}({\text{X}}|{\text{z}}) \) is the probability of X decoded by z. It is calculated as the cross-entropy or mean-squared error of the original sample. The latter can be regarded as the difference between the assumption and the distribution of z in the encoder.

$$ \log \,{\text{P}}\left( {\text{X}} \right)\, - \,{\text{D}}[{\text{Q}}({\text{z}})||{\text{P}}({\text{z}}|{\text{X}})]\; = \;{\text{E}}_{{{\text{z}}\sim{\text{Q}}}} \left[ {\log \,{\text{P}}\left( {{\text{X}}|{\text{z}}} \right)} \right]\, - \,{\text{D}}[{\text{Q}}[{\text{z}}]||{\text{P}}({\text{z}})] $$
(6)

3 The Proposed Method

In this paper, an oversampling method based on VAE is proposed, motivated by the idea that the distribution information is important in oversampling method. Without any prior assumption of the real PDF of the minority samples nor the independent assumption in the features, the proposed method can automatically model the PDF with the oral data. However, there is also a trick in the proposed, there might have discrete features in the data, while the features generated by the stochastic gradient descent must be continuously differentiable, so this part of the features must be selected before vae training using formula (9), and after generating the continuous features, use 1-NN to classify the generated continuous and combine the continuous features with the discrete features of the nearest original sample into a new composite sample.

We don’t have enough information about whether a feature is discrete or not, so we assume that it is a discrete feature if there are no more than 2 distinct values in all the feature values. In fact, it is useless in classification if there is only one distinct value among the whole dataset.

Given training dataset \( {\text{X}}\; = \;\left\{ {\left( {{\text{x}}_{1} ,{\text{y}}_{1} } \right),\left( {{\text{x}}_{2} ,{\text{y}}_{2} } \right), \cdots ,\left( {{\text{x}}_{\text{N}} ,{\text{y}}_{\text{N}} } \right)} \right\} \), \( {\text{x}}_{\text{i}} \; \in \;R^{d} \) is the sample of d dimension, \( {\text{y}}_{\text{i}} \; \in \;\left\{ {0,\;1} \right\} \) is the labels represent negative and positive. We use P and N to represents a positive class sample subset and a negative class sample subset, where P contains \( {\text{N}}_{ + } \) positive samples, N contains \( {\text{N}}_{ - } \) negative samples, and \( N_{ + } + N_{ - } = N \).

During the training of the VAE model, nelementsj represents the number of distinct feature values in \( {\text{j}}_{\text{th}} \) dimension in the positive subset, the formula is shown as (7):

$$ nelements_{j} \; = \;\sum\nolimits_{\text{i}}^{{{\text{N}}_{ + } }} {} distinct\left\{ {{\text{x}}_{\text{ij}} } \right\} ,\; 1\le {\text{j}} \le {\text{d}} $$
(7)
$$ x_{i} = \left\{ {x_{i1} ,x_{i2} , \cdots ,x_{ik} } \right\}\, \cup \,{\text{\{ }}x_{{i\left( {k + 1} \right)}} ,\; \cdots ,\;x_{id} \} $$
(8)
$$ {\text{s}}.{\text{t}}.\;\left\{ {\begin{array}{*{20}l} {nelements_{j} > 2,} \hfill & { 1 \le j \le k} \hfill \\ {nelement{\text{s}}_{j} { \le }2,} \hfill & {k + 1 \le j \le d} \hfill \\ \end{array} } \right. $$
(9)

If nelements j is no more than 2, the feature j is discrete, otherwise, the feature is continuous. Divide the features in the positive subset into continuous and discrete features in order and the continuous features are used as the final training set.

$$ {\text{Xtrainvae}}\; = \;\left[ {\begin{array}{*{20}c} {{\text{x}}_{ 1 1} } & \cdots & {{\text{x}}_{{ 1 {\text{k}}}} } \\ \vdots & \ddots & \vdots \\ {{\text{x}}_{{{\text{N}}_{ + } 1}} } & \cdots & {{\text{x}}_{{{\text{N}}_{ + } {\text{k}}}} } \\ \end{array} } \right] $$
(10)

Train a VAE model with \( Xtrain \) and randomly sample it, assume \( Xnew \) is synthetic a sample:

$$ \left\{ {\begin{array}{*{20}l} {Xfinal_{ij} \; = \;Xnew_{ij} \, \cup \,X_{lm} ,} \hfill & {k + 1 \le m \le d} \hfill \\ {s.t.\;agrmin{\sum }\left| {\left| {Xnew_{ij} - X_{lj} } \right|} \right|^{ 2} ,} \hfill & {1 \le j \le k} \hfill \\ \end{array} } \right. $$
(11)

\( X_{final} \) is the final synthetic sample, and \( X{ \cup }X_{final} \) is the final training set called \( X_{ov} \).

figure a

The whole process is described as Algorithm 1, firstly, normalize the dataset to scale the range of data, and divide(X) is a function which can split the dataset as training set and testing set, in imbalanced classification, to keep the distribution unchanged in these subsets, the positive and negative samples are split separately. Secondly, choose the features with over two distinct values and use them as the \( X_{trainvae} \). Thirdly, train a VAE model and sample from the trained model, suppose the generated samples as \( X_{new} \). Finally, add the discrete features for the generated samples using their nearest neighbors’ discrete features, and these are \( X_{final} \)(Table 1).

Table 1. Dataset.

4 Experiment

4.1 Dataset and Evaluation

In this paper, all datasets are from UCI [20] Machine Learning Repository, and some of them are multi-label datasets, so we select one class as the minority class and the remaining samples as majority class. The missing values are supplemented by the most frequent value. After that we use normalization, the formula is shown in (12):

$$ x_{inew} \; = \;\frac{{x_{i} - \bar{x}}}{s} $$
(12)
$$ \bar{x}\; = \;\frac{1}{n}\sum\nolimits_{i = 1}^{n} {} x_{i} ,\;s\; = \;\sqrt {\left( {\frac{1}{n - 1}\sum\nolimits_{i}^{n} {} (x_{i} - \bar{x})} \right)^{2} } $$
(13)

In traditional classification method, global accuracy is used as the evaluation, but in the imbalanced problem, this evaluation will mask the classification performance of the minority class. In extreme conditions, assume the dataset only contain 1% minority class, if the classifier decides all samples as majority class, the accuracy still can reach 99%, and however, the recognition rate of the minority samples is 0. In binary imbalanced classification, the confusion matrix in Table 2. Confusion metrics. is often used to evaluate the performance of the classifier.

Table 2. Confusion metrics.

Among them, FN is the number of the positive samples misclassified as negative, and FP is the number of the negative samples misclassified as positive. There are some new evaluation metrics based on confusion matrix to calculate the accuracy and recall of imbalance data such as F-value, G-mean [21].

$$ precision\; = \;\frac{TP}{TP + FP} $$
(14)
$$ recall\; = \;\frac{TP}{TP + FN} $$
(15)
$$ F - value\; = \; \frac{{(1 + \beta^{2} )\, \times \,recall\, \times \,precision}}{{\beta^{2} \, \times \,recall\, + \,precision}} $$
(16)

Where \( \beta \;{ \in }\;\left[ {0,\; + \infty } \right]. \)

$$ Gmean\; = \;\sqrt {\frac{\text{TP}}{\text{TP + FN}}\, \times \,\frac{\text{TN}}{\text{TN + FP}}} $$
(17)

In this experiment, we choose \( \beta \; = \;1 \) for F-value, it is the average between recall and precision. Gmean is the geometric mean of the classification accuracy of the minority class and majority class. Only when the precision of minority class and precision of majority are high at the same time, gmean will be maximum.

4.2 Experiment Results

In this paper, we compare different oversampling algorithms such as NDO-sampling [17] and random interpolation algorithm SMOTE [12] (SMO). The classifier is naïve Bayes to reduce the impact of classifier’s parameters on classification performance. To reduce the randomness in the final results, each algorithm calculates the average of 10 times with 10-fold cross-validation. The results of NDO are from the corresponding paper, and k = 5 in SMOTE, the structure of the proposed method is shown in Fig. 1, we use the random sample in generating new samples.

The result shown in the Table 3 indicate that vae performs better in generating samples than NDO and SMOTE when the number of oversampling is the same, as the VAE can generate more reasonable samples with more information. With the growth of oversampling rate, all the sampling methods can help to improve the classification performance, which indicate that the original minority samples don’t contain enough information for a classifier to recognize them correctly from the negative samples.

Table 3. F1-min of different algorithms and oversampling rate.

In the meanwhile, from the result in Table 4, compared with the traditional oversampling algorithms which sacrifice some majority samples to ensure the classification performance of minority, the proposed method can guarantee the rational distribution of synthetic samples and improve the classification performance of the majority samples, which indicates a stronger classifier.

Table 4. F1-maj of different algorithms and oversampling rate.

The proposed method can produce more reasonable samples, which can be concluded form the result shown in Table 5, as the classifier trained with the samples generated by the proposed method can get a better overall classification performance, as the \( Gmean \) is the geometric mean of the accuracy of the minority samples and the majority samples, and with a higher oversampling rate, the classifier gets a best result with the proposed oversampling method.

Table 5. Gmean of different algorithms and oversampling rate.

The experimental results also show that for all oversampling methods, a higher oversampling rate can lead to a better classification performance, when the minority samples after oversampling are equal to the majority ones in size, the best classification performance is reached, this indicates that the size has limited effect on the classification performance, more informative samples and stronger classifier play a bigger role.

5 Conclusion

In this paper, we propose an oversampling algorithm based on VAE, in order to make full use of the distribution information in the dataset, it can generate more reasonable samples with no prior assumption of the real distribution nor the assumption that the features are independent, what’s more, we separate the features into discrete and continuous ones, use the nearest discrete features as the features of the generated samples, to generate samples with real meaning as can as possible. The experiment results prove the effectiveness of the proposed method, it can improve the overall performance rather than the minority samples. The sampling is still too rough to guarantee the generated samples’ impact on the classifiers, and the future work is to overcome this drawback.