Keywords

1 Introduction

In the binary classification problem, it is often the case that one class is much more over-represented than the other one. The more over-presented class is usually referred to as the negative or majority class, and the other one as the positive or minority class (we use these words interchangeably in this paper). Most standard algorithms (such as k-NN, SVM, decision tree and deep neural network) are designed to be trained with balanced data. If given imbalanced training data, they will have difficulty in learning proper decision rules and produce imbalanced performance on test data. In other words, they are more likely to make mistakes on minority class instances than on majority class ones.

Most existing explanations attribute the imbalanced performance phenomenon to the imbalanced training data. Therefore, many approaches which aim to balance the data distribution are proposed. For example, random over-sampling method creates positives by randomly replicating the minority class instances. Although the data distribution can be made balanced in this way, it doesn’t create any new information and easily leads to over-fitted classifiers. SMOTE [1] overcomes the weakness by synthesizing new non-replicated positives. However, it is still believed to be insufficient to solve the imbalanced data problem. Another direction to balance the data distribution is under-sampling the over-presented negatives. For example, random under-sampling, like random over-sampling, selects the negatives randomly to construct the balanced positives and negatives. There are also several other more advanced under-sampling methods, such as One-Sided Selection (OSS) [2] and Nearest Neighbor Cleaning Rule [3]. These methods try to remove those redundant, borderline or mislabeled samples and strive for more reliable ones. However, they are suitable to handle those low-dimensional, highly structured and size limited training data which is often in vector form. For those which are high-dimensional, unstructured and in large quantity, they behave unsatisfactorily.

In this paper, in addition to the imbalanced training data, we explore deeper and more fundamental explanations to imbalanced performance in deep neural networks. We choose deep neural network to study the imbalance problem for two reasons. Firstly, deep neural network has achieved great success in various problems in the last several years, such as image classification [4], object detection [5] and machine translation [6]. It has been the most popular model in the machine learning field. Secondly, usually used as an end-to-end model, it can handle raw input data directly without complex data preprocessing. Thus it is suitable to handle those high-dimensional and unstructured raw training data. Using deep neural networks, we find that imbalanced data is a neither sufficient nor necessary condition for the imbalanced performance. By saying that, we mean:

  • balanced training data can also lead to imbalanced performance on test data (shown in Fig. 1b and c), and

  • classifiers trained on imbalanced training data do not necessarily produce imbalanced performance on test data (shown in Fig. 1d).

Fig. 1.
figure 1

Illustration of bias of a linear separator induced over training data in a one-dimensional example. Triangles and circles denote positive and negative samples, respectively. The corresponding latent distributions are shown. The dotted line (\(\hat{w}\)) is the hypothesis induced over training data, while the solid line (\(w^*\)) depicts the optimal separator of the underlying distributions.

Additionally, we find another important factor for imbalanced performance is the distance between the negative samples and the positives (or the latent decision hyperplane). Specifically, sampling those which are distant from the positives will push the learnt decision hyperplane towards the space of the negatives (shown in Fig. 1b). On the other hand, sampling those which are near the positives or aggressive ones (which tend to be mislabeled and invade the space of the positives) will push the learnt decision hyperplane towards the space of the positives (shown in Fig. 1c). Neither of these two sampling ways can balance the performance on the two class test data. Based on these observations, we propose to sample the negatives which are moderately distant from the positives (shown in Fig. 1e). These negatives are less likely to be mislabeled and more reliable than the nearest ones, meanwhile they are more informative than those which are distant from the positives. We call the proposed under-sampling method Moderate Negative Mining (MNM). Experiments conducted on various datasets validate the proposed approach.

2 Related Work

There is large literature on solving imbalanced data problem. Existing approaches can be roughly categorized into two groups: the internal approaches [7, 8] and the external approaches [1,2,3, 9]. The internal approaches modify existing algorithms or create new ones to handle the imbalanced classification problem. For example, cost-sensitive learning algorithms impose heavier penalty on misclassifying the minority class instances than that on misclassifying the majority class ones [8]. Although internal approaches are effective in certain cases, they have the disadvantage of being algorithm specific. That is to say, it might be quite difficult, or sometimes impossible, to transport the modification proposed for one classifier to the others [9]. The external approaches, on the other hand, leave those learning algorithms unchanged and alter the data distribution to solve the imbalanced classification problem. They are also known as data re-sampling methods and can be divided into two groups: over-sampling and under-sampling methods. Over-sampling methods increase the quantity of minority class instances by replicating them or synthesizing new ones to balance the data distribution [1]. They are criticized for the over-fitting problem [10]. Under-sampling methods, in an another direction, under-sample the majority class instances to balance the imbalanced distribution. For example, One-Sided Selection (OSS) [2] adopts the Tomek links [11] to eliminate borderline and noisy examples. Nearest Neighbor Cleaning Rule [3] decreases the size of the negatives by cleaning the unreliable negatives. Both these approaches are based on simple nearest neighbor assumption, thus they have limited performance when used for deep neural networks.

Albeit the large literature on imbalanced problem, few works [12,13,14] study the imbalanced classification problem in deep neural networks. In [12], over-sampling and under-sampling are combined by complementary neural network and SMOTE algorithm to solve the imbalanced classification problem. The effects of sampling and threshold-moving in training cost-sensitive neural networks are studied in [13]. It is demonstrated in [14] that more discriminative deep representations can be learned by enforcing a deep network to maintain both inter-cluster and inter-class margins. In this paper, we explore more insights into learning from imbalanced data by deep neural networks and propose a new under-sampling method to solve the imbalanced classification problem. We believe our approach is a good complement to existing approaches for solving the imbalanced classification problem in deep neural networks. The proposed method can be categorized as an external approach.

3 Imbalanced Data and Imbalanced Performance

In this section, we empirically validate our first observation: imbalanced data is a neither sufficient nor necessary condition for imbalanced performance in deep neural networks. The validation process consists of two stages. In the first stage, we validate that balanced training data can also lead to imbalanced performance. In the second stage, we validate that imbalanced data does not necessarily lead to imbalanced performance. The experimental settings of these two processes are largely kept the same, so we first briefly introduce them as follows.

We use the publicly available CIFAR-10 [15] dataset to construct the imbalanced datasets. CIFAR-10 has 10 classes. To construct imbalanced datasets for binary classification problem, we treat one of them as the positive class and the others as the negative one. In this way, we construct 10 imbalanced datasets (CIFAR10-AIRPLANE, CIFAR10-AUTOMOBILE, CIFAR10-BIRD, CIFAR10-CAT, CIFAR10-DEER, CIFAR10-DOG, CIFAR10-FROG, CIFAR10-HORSE, CIFAR10-SHIP, CIFAR10-TRUCK). Original training images are used for training. The test images are equally split into 2 groups. One for validation, and the other one for test.

In this paper, we study the imbalanced classification problem in deep neural networks. Our deep neural networks follow the architecture of 6 convolutional layers. Each convolutional layer is followed by a RELU [18] nonlinear activation layer. We list the detailed information of the architecture in Table 1. The number following each @ symbol represents the number of kernels in the convolutional layer.

Table 1. Architecture of deep neural networks

These deep neural networks are optimised by the Stochastic Gradient Descent (SGD) algorithm. The base learning rate is set to be 0.01 and decreases by \(50\%\) every 10 epoches. The maximum epoch is set to be 40. The batch size in the SGD algorithm is set to be 256.

3.1 Balanced Training Data Can Lead to Imbalanced Performance

Balanced training data can also lead to imbalanced performance, as shown in Fig. 1b and c. In Fig. 1b, the negatives which are distant from the positives are sampled. On the contrary, samples which are near or across the boundary are sampled in Fig. 1c. The intuition illustrated in Fig. 1 is simple, but it leaves the distance between two data points unsolved, especially for the high-dimensional raw data, such as images. We propose a simple method to approach this problem. Firstly, we randomly under-sample the negatives to balance the training data. We use the balanced training data to train a basic network, which we call metric network. The metric network follows the same architecture with the final classifier (although it can be of a different one, as shown in [16]), and is used to calculate the distance between the negatives and the decision boundary in their embedding space. Finally, we sample the same number of negatives as the positives to construct the balanced training data. To make the performance of the trained classifier imbalanced, we only sample those which are most (or least) distant from the decision boundary (or the positives).

In the trained embedding space, we assume the optimal decision hyperplane is approximately the decision hyperplane learnt by the classifier (this is a reasonable assumption because the embedding space and the classifier are optimized jointly). For softmax classifier, we usually take 0.5 as the decision threshold for a two-class classification problem. The predicted probability of a softmax classifier is

$$\begin{aligned} p_i = \frac{e^{W_i^{T}\cdot \text {x}}}{\sum _{j}{e^{W_j^{T}\cdot \text {x}}}}, \end{aligned}$$
(1)

where i is in \(\{0, 1\}\), W is the learnt weights of the softmax classifier, and \(\text {x}\) is a data point in the embedding space. The decision hyperplane is thus

$$\begin{aligned} (W_{0} - W_1)^{T} \cdot \text {x} = 0. \end{aligned}$$
(2)

For any negative data point \(\text {x}\), its signed distance to the decision hyperplane is

$$\begin{aligned} \begin{aligned} d&=\frac{(W_{0} - W_1)^{T} \cdot \text {x}}{||W_{0} - W_1||} \\&\varpropto (W_{0} - W_1)^{T}\cdot \text {x} \\&= \ln {(p_0\cdot \sum _{j}{e^{W_j^T\cdot \text {x}}})} - \ln {(p_1\cdot \sum _{j}{e^{W_j^T\cdot \text {x}}})}\\&= \ln {\frac{p_0}{1-p_0}} \\&\varpropto p_0. \end{aligned} \end{aligned}$$
(3)

Thus the predicted probability is a good indicator of the distance between the data point and the decision hyperplane in the embedding space.

In our experiments, to give better understanding of the above proposed validation method, we categorize the negatives into 9 groups (group 0 to group 8) based on their predicted probabilities, as the negatives is as 9 times large in size as the positives. From group 0 to group 8, predicted probabilities of the target label (\(p_0\)) go from the largest to the smallest, thus the distance between the negatives in them and the decision hyperplane goes from the largest to smallest. Every group has the same number of negatives. Therefore, for every constructed imbalanced dataset, we can construct 9 balanced training datasets with their 9 groups of negatives and all the positives.

We use sensitivity and specificity to measure the imbalanced performance. Sensitivity, also called the true positive rate, is defined as

$$\begin{aligned} sensitivity = \frac{TP}{TP + FN}. \end{aligned}$$
(4)

Specificity, also called true negative rate, is defined as

$$\begin{aligned} specificity = \frac{TN}{TN + FP}. \end{aligned}$$
(5)

For a fair comparison, during the training of every deep classifier, we keep all experimental settings the same except for the different training data. In Fig. 2, we plot the sensitivity and the specificity of the classifiers trained from the training data which is comprised of the negatives in one group and all the positives.

As shown in Fig. 2, the performance of classifiers trained from the balanced datasets which are comprised of the negatives in group 0 and all the positives and is rather imbalanced, as the sensitivity is generally much larger than the specificity for group 0 in all datasets. As the group number increases within a certain range (i.e. the distance between the negative instances and the decision hyperplane increases), the difference between sensitivity and specificity becomes smaller and the imbalanced performance gradually disappears. However, when it goes further, the difference rebounds and the imbalance phenomenon arises again. From this experiment, we empirically validate that balanced training data can also lead to imbalanced performance on test data.

Fig. 2.
figure 2

Sensitivity (solid lines) and specificity (dashed lines) of trained classifiers from training data comprised of all the positives and the negatives in one group. A solid line and a dashed one of the same color correspond to the same classifier. (Color figure online)

3.2 Imbalanced Training Data Can Lead to Balanced Performance

It is shown in Fig. 2 that the balanced training data can lead to imbalanced performance. Based on this finding, we can balance the imbalanced performance by altering the data distribution. For example, the classifiers trained from the negatives in group 0 tend to make more mistakes on negatives than on positives. If we increase the size of the negatives in training data, the imbalanced performance can be made balanced. In our experiment, we keep the positives unchanged and increase the size of the negatives in group 0 to balance the performance. Every time increasing the size of the negatives, we add to the training data the most distant negatives which are not contained in the training data. Figure 3 shows how the performance of the trained classifier changes, as the ratio between the size of the negatives and that of the positives in the CIFAR10-AUTOMOBILE dataset increases. As can be seen, when the ratio is about 6 (in this case, the training data is rather imbalanced), the performance becomes balanced. Although here we only plot the results on CIFAR10-AUTOMOBILE dataset, experiments on other datasets give the similar results.

Fig. 3.
figure 3

Performance of the trained classifiers as we change the ratio between the size of negatives and that of the positives. We randomly choose the CIFAR10-AUTOMOBILE dataset.

4 Moderate Negative Mining

Although the method in Sect. 3.2 can balance classifiers’ performance on the test data, it requires carefully regulating the size of the negatives to balance the imbalanced performance. In addition, in real life we desire for not only more balanced but also better performance. In this section, we propose a new under-sampling method named Moderate Negative Mining (MNM) to aid deep neural networks in learning from imbalanced data. The proposed under-sampling approach can not only solve the imbalanced problem, but also produce better overall performance in our experiments when compared with other existing under-sampling approaches.

As shown in Sect. 3, both the most and the least distant negatives will lead to imbalanced performance. Two reasons account for that. Firstly, the least distant negatives tend to be unreliable or mislabeled ones. These negatives have been shown to degrade the performance of trained classifiers. Secondly, the most distant negatives are far away from the decision hyperplane, thus they are not informative enough to represent the space of the negatives. Moderate Negative Mining exploits a simple strategy to under-sample the over-presented negatives. We hypothesize the moderately distant negatives which are between the most and the least distant ones, are ideal negatives for training the classifier. This is a reasonable assumption because the moderately distant negatives are less likely to be polluted by mislabeled ones and those unvalued redundant ones.

As described in Sect. 3, we use the predicted probability to measure the distance between a negative instance and the decision hyperplane. To give better comparison, in the same way as Sect. 3.1, we also categorize the negatives into 9 groups (from the most distant negatives in group 0 to the least distant ones in group 8) and give the performance of classifiers trained from negatives in each group. We use deep neural networks of the same architecture as described in Sect. 3. All experimental settings are kept the same during the training process for fair comparisons, except for the different training data.

Fig. 4.
figure 4

AUROC (solid lines) and AUPR (dashed lines) curves of deep classifiers trained from training data comprised of all the positives and the negatives in one group. Horizontal lines denotes performance of random under-sampling.

Table 2. Performance on CIFAR-10

As the deep classifiers often output a score for both the positive and the negative class, we must specify a score threshold (usually 0.5 for binary classification) to make a trade-off between false positives and false negatives. However, when evaluating the performance, we require a measurement which is independent of the score threshold to measure the overall performance of the trained model. We sidestep the threshold selection problem by employing AUROC (Area Under Receiver Operating Characteristic curve) and AUPR (Area Under Precision-Recall curve). Figure 4 depicts how AUROC and AUPR values change as we choose different group of negatives as training data. As can be seen, for almost all datasets, we can find a group (or several groups) which achieves better performance than random under-sampling.

The proposed Moderate Negative Mining (MNM) method selects the best classifier on the validation data. To give a better comparison with other under-sampling methods, we list the AUROC and AUPR values of the proposed method on test data in Table 2, as well as those of Random Under-sampling (RUS), One-Sided Selection (OSS) [2] and Nearest Neighbor Cleaning Rule (NCL) [3]. These are all well known under-sampling methods to solve the imbalanced data problem. As can be seen, the proposed method achieves better performance on almost all datasets. The superiority shown in PR space is more obvious than that in ROC space. It is because the PR curves are more informative than ROC curves when the data is imbalanced [17].

5 Discussion

Although imbalanced performance is often believed to be caused by imbalanced training data, we have shown that imbalanced training data is a neither sufficient nor necessary condition in Sect. 3. Therefore, in addition to construction of balanced training data, we can also construct imbalanced training data to balance the classifier’s performance on the positives and the negatives. Here we call the former method balance method and the latter imbalance method for brevity.

As deep neural networks usually get better performance given larger training data, we may expect better performance in the imbalance method. However, our experiments show the balance method achieves better performance. It is because although the imbalanced method trains the model with more examples, most of the sampled negatives are easy ones and of little information to learn the decision rule. This is validated further by our proposed Moderate Negative Mining method, in which we abandon those easy negatives and those most hardest ones. Both these two types of negatives can hinder the learning of the classifier.

6 Conclusion

In this paper we empirically prove that imbalanced data is a neither sufficient nor necessary condition for imbalanced performance on test data, although the imbalanced training data is often believed to be the chief culprit. We find another important factor for imbalanced performance is the distance between the majority class instances and the decision boundary. Based on these observations, we propose a new under-sampling, named as Moderate Negative Mining, to solve the imbalanced classification problem in deep neural networks. Several experiments demonstrate the superiority of the proposed method.

For future work, we aim to select the majority class instances directly rather than by boring validation. Furthermore, we desire to formulate the negative under-sampling as a learning problem, which is challenging as the sampled examples are discrete.