Keywords

1 Introduction

In supervised classification, imbalanced data arises when in some classes the number of objects is much lower than in other classes. Usually, the minority class (the class with the lowest amount of objects) is the class of interest in a class imbalance problem. However, when working on imbalanced datasets, classifiers tend to have a bias towards the majority class (the class with the largest amount of objects), resulting in poor classification performance for the minority class. This behavior is more notorious when the imbalance among classes is higher.

Some researchers have tackled the class imbalance problem [12, 13] through oversampling methods [410]. Oversampling methods have the advantage of being independent of the classification method to be used, since they generate synthetic objects based only on the training set. SMOTE [2] is one of the most used and well known oversampling methods, which generates synthetic objects along the line segments joining objects in the minority class with some of their nearest neighbors. Thus, by increasing the amount of objects of the minority class, SMOTE tries to balance the amount of objects for all the classes. SMOTE has a random component for generating synthetic objects, producing a different result each time it is applied. Therefore, whether SMOTE is applied several times, choosing the best result becomes an issue.

In this paper, we propose a new oversampling method based on SMOTE, which computes in a deterministic way how many new synthetic objects should be generated from each object of the minority class and where these new objects should be placed. According to our experiments the proposed method performs better than SMOTE in different datasets with different imbalance level.

The rest of this document is organized as follows: in the Sect. 2, some related works are described; in the Sect. 3, the proposed method is introduced; in the Sect. 4, the experimental setup for the experiments is described; in the Sect. 5, the experimental results are shown; and in the Sect. 6, our conclusions and some future work directions are discussed.

2 Related Work

In the literature there are two types of extensions of SMOTE, those which combine SMOTE with other methods like noise filters (SMOTE-IPF) [6], subsampling methods (SMOTE-RSB*) [5] or feature selectors (E-SMOTE) [7]; and those that modify SMOTE like Bordeline-SMOTE [4], Safe-Level-SMOTE [8], SMOTE-OUT [9], SMOTE-COSINE [9] or Random-SMOTE [10]. Our work belongs to these last kind of methods. Thus we briefly describe some methods that modify SMOTE:

Borderline-SMOTE only oversamples the objects in the borderline of the minority class. First, it finds out the borderline objects of the minority class; then, synthetic objects are generated from these objects and they are added to the original training set. Borderline-SMOTE works as follows:

  • For each object in the minority class its k nearest neighbors, from the whole training set, are calculated.

  • If the k nearest neighbors contain objects from the minority and majority classes and the amount of its nearest neighbors in the majority class is larger than the amount of its nearest neighbors in the minority class, the object is considered as a borderline object.

For each borderline object, Borderline-SMOTE calculates its k nearest neighbors in the minority class and generates synthetic objects in the same way as SMOTE.

Safe-Level-Synthetic Minority Oversampling technique, assigns to each object in the minority class its safe level before generating synthetic objects. Each synthetic object is positioned closer to the largest safe level object, in this way all the synthetic objects are generated in safe regions. The safe level of an object is defined as the number of minority class objects among it’s k nearest neighbors.

SMOTE-OUT randomly generates synthetic objects along the outside line of attributes between an object of the minority class and its nearest neighbor of the majority class. Once a synthetic object has been generated then SMOTE-OUT finds out its nearest object in the minority class and randomly generates another synthetic object along the line of attributes between them.

SMOTE-Cosine works as SMOTE but it computes the k-nearest neighbors by voting using two distance metrics (Euclidean and Cosine).

Random-SMOTE generates temporally synthetic objects along the line of attributes between two objects of the minority class which are selected randomly. After, synthetic objects along the line of attributes between each temporal synthetic object and one object of the minority class are generated.

3 Proposed Method

The proposed method takes into account the distances between each object of the minority class and its k-nearest neighbors in the same class in order to determine how many synthetic objects should be generated from each object. For each object in the minority class, the higher its distance dispersion against its k-nearest neighbors, the higher the number of synthetic objects to generate. Additionally, the distance between an object and each one of its k-nearest neighbors is also taken into account individually to determine how many objects should be generated between each pair of objects. The larger the distance between a pairs of objects allows generating a greater amount of synthetic objects between them. The new synthetic objects are generated by dividing the difference in attributes between two objects by the number of objects to be generated between them. In this way, the synthetic objects will be created in a deterministic and uniform way.

For evaluating the dispersion of objects around each \(object_{i}\) of the minority class we propose to use the standard deviation (\(\sigma _{i}\)) of the distances between the \(object_{i}\) and its k nearest neighbors (\(object_{ij}\) \(j=1,..k\)). We propose generating an amount of synthetic objects around each \(object_{i}\) in the minority class such that it be proportional to the fraction of the standard deviation of the distances (\(\sigma _{i}\)) with respect to the sum of all the standard deviations of the distances computed for all the objects in the minority class (\(\sum _{i=1}^{m}\sigma _{i}\), where m is the amount of objects in the minority class). Then, around objects of the minority class with higher standard deviation of distances, with respect to its k-nearest, neighbors more objects will be created.

Fig. 1.
figure 1

Three objects of a minority class with distance values and standard deviation of distances for 3-nearest neighbors.

After determining the proportion (\(p_{i}\)) of synthetic objects to generate from each object of the minority class, we proceed to calculate the proportion (\(p_{ij}\)) of synthetic objects to generate between each object of the minority class and each one of its k nearest neighbors. This proportion of synthetic objects is calculated as the fraction that represents the distance between the object and each nearest neighbor regarding to the sum of all the distances between the object and all its k nearest neighbors.

In the Fig. 1, we can see an example with three objects of the minority class and its 3-nearest neighbors (into the minority class), where \(\sum _{i=1}^{m}\sigma _{i}=2\). Here, the fraction (\(p_1\)) of \(\sigma _{1}\) is \(p_{1}= \sigma _{1} /\sum _{i=1}^{m}\sigma _{i} =0.5\) , therefore 50 % of the synthetic objects to be generated will be generated from \(object_1\). For \(\sigma _{2}\) the fraction is 30 % and 20 % for \(\sigma _{3}\), If we have to generate a total of 10 synthetic objects, 5 would be generated around \(object_{1}\), 3 around \(object_{2}\), and 2 around \(object_{3}\).

In order to determine the number of objects to generate (\(s_{ij}=p_{ij}*p_{i}*n\)) between an \(object_{i}\) and each one of its nearest neighbors (\(object_{ij}\)) we take into account the amount of objects to generate for the minority class (n), this amount is given by the difference between the number of objects in the minority and majority class (\(n=(M-m)*R\), where M is the number of objects in majority class, m is the number of objects in the minority class and R is a parameter defining the proportion of the difference to be reached).

Fig. 2.
figure 2

Generation of 3 synthetic objects between an \(object_{i}\) of the minority class and one of its nearest neighbors \(object_{ij}\).

After calculating the amount of synthetic objects \(s_{ij}\) to generate between each \(object_{i}\) an each one of its nearest neighbors (\(object_{ij},j=1,\dots ,k\)), attribute differences \(diff_{ij}\) (\(object_{i},object_{ij}\)) between \(object_{i}\) and \(object_{ij}\) are calculated. These differences are divided by the amount of synthetic objects to generate plus one, obtaining \(diff'_{ij}=diff_{ij}/(s_{ij}+1\)). This difference is added \(s_{ij}\) times to \(object_{i}\); with each addition we obtain a new synthetic object and all the new synthetic objects are added to the original minority class. In the Fig. 2 we can see an example of the generation of three synthetic objects between an object of the minority class \(object_{i}\) and one of its k-nearest neighbors \(object_{ij}\).

If for an object in the minority class the amount of objects to be generated regarding to the proportion representing the fraction of the standard deviation of the distances with respect to the sum of all standard deviations of distances represents less than 1 object, SMOTE-D does not generate synthetic objects from this object. The same happens if the proportion that represents the distance between the object and one of its nearest neighbors regarding to the sum of all the distances between the object and all its k nearest neighbors represents less than the 1 %.

Given a training set with M objects in the majority class and m objects in the minority class. The detailed procedure of SMOTE-D is as follows:

  • Calculate the amount of objects to be generated for the minority class (\(n=(M-m)*R\)) according to a parameter \(R\in [0,1]\).

  • Calculate the distances (\(d_{ij}\)) between each \(object_{i}\) in the minority class and its k nearest neighbors, \(j=1,\dots ,k\) (k is a parameter).

  • Calculate the standard deviation (\(\sigma _{i}\)) of the distances between each \(object_{i}\) and its k-nearest neighbors.

  • Calculate for each \(object_{i}\) the fraction (\(p_{i}\)) of its standard deviation (\(\sigma _{i}\)) from the total sum of all standard deviations as \(p_{i}=\sigma _{i}/\sum _{i=1}^{m}\sigma _{i} \).

  • Calculate the fraction (\(p_{ij}\)) of each distance (\(d_{ij}\)) with respect to the sum of distances of each \(object_{i}\) and its k nearest neighbors as \(p_{ij}=d_{ij}/\sum _{j=1}^{k}d_{ij}\).

  • Calculate the number of objects (\(s_{ij}\)) to generate between an \(object_{i}\) and one of its nearest neighbors \(object_{ij}\) as \(s_{ij}=p_{i}*p_{ij}*n\).

  • Get the attribute difference \(diff_{ij}\) between an \(object_{i}\) and each one of its k nearest neighbors \(object_{ij}\) as \(diff_{ij}=object_{ij}-object_{i} ; j=1, \dots ,k\).

    • Divide the difference between an object and each one of its neighbors by the amount of synthetic objects to be generated from this pair plus 1. (\(diff_{ij}^{'}=diff_{ij}/(s_{ij}+1)\))

    • Add the difference \(diff^{'}_{ij}\) to the object of the minority class as many times as objects to generate (\(s_{ij}\)).

  • Add the generated synthetic objects to the minority class.

4 Experimental Setup

For evaluating the proposed method, we use 66 datasets taken from the repository KEEL [1], we used 5-fold cross validation. In order to measure the degree of imbalance in a dataset, in the literature, the imbalance ratio (IR) is commonly used. The IR of a dataset with M objects in the majority class and m objects in the minority class is computed as \(IR=M/m\). All the datasets used in our experiments are from binary problems with numeric attributes, and IR ranging from 1.82 to 129.44 (see Table 1).

Table 1. Sumary of the datasets used in our experiments.

Into the KEEL repository these 66 datasets are provided together with the results of applying SMOTE with \(k=5\) as the number of nearest neighbors, HVDM [3] as distance function and an oversampling rate N such that \(IR=0.0\). Thus, for our experiments SMOTE-D was configured in the same way (\(k=5\), HVDM and \(R=1\) in order to get \(IR=0.0\)).

For comparing the results of SMOTE-D and SMOTE, decision trees, support vector machines (SVM) and KNN with \(k=5\) were used. We applied SMOTE-D in all datasets and compare the classification results against those obtained by using SMOTE.

One of the measures commonly used for assessing the quality of classifiers on imbalanced datasets is the F-measure [11]. Therefore, in our experiments we used this measure to asses our results, additionally we applied a t-test with statistical significant difference at the 5 % of significance level between the results of SMOTE and SMOTE-D.

Table 2. Results of applying the evaluated classifiers over the results of SMOTE and SMOTE-D over the tested datasets using HVDM metric distance.
Table 3. Results of applying the evaluated classifiers over the results of SMOTE and SMOTE-D over the tested datasets using euclidean metric distance.

5 Experimental Results

The results of comparing SMOTE-D and SMOTE with different classifiers in terms of F-measure are shown in Tables 2 and 3. In both tables, the first column shows the name of the databases, the following columns show the classification results, in terms of F-measure, for decision tree, KNN and SVM classifiers, respectively. In the last row, the average over the 66 datasets for each classifier is shown. The best results of F-measure appear boldfaced. Datasets with (*) are those where the t-test showed statistical significant difference at the 5 % of significance level.

In the results that appear in the Table 2, the proposed method obtains a better performance in 67 % of the datasets using decision trees, in 61 % using KNN, and 73 % using SVM. In average the results show a statistical improvement of 3.14 %, 2.73 % and 4.26 % in terms of F-measure for decision trees, KNN and SVM respectively. Considering only those datasets with an IR greater than 15.8 the results obtained by all the classifiers when SMOTE-D is applied are statistical significant better for all used classifiers, these databases can be seen with the names in bold.

In the results that appear in the Table 3, the proposed method obtains a better performance in 50 % of the datasets using decision trees, in 46 % using KNN, and 50 % using SVM. In average the results do not show a statistical difference for all classifiers. Considering only those datasets with an IR greater than 10.59 the results obtained by SVM classifier when SMOTE-D is applied are statistical significant better for SVM classifier, these databases can be seen with the names in bold

6 Conclusions

This paper introduces a new oversampling method, SMOTE-D, which is a deterministic version of SMOTE. Comparisons against SMOTE in terms of F-measure using decision trees, KNN and SVM classifiers show that SMOTE-D get better results than SMOTE. These results give evidence that estimating the dispersion of the objects of the minority class (based on the standard deviation of distances) to determine how many objects should be generated around each object in the minority class and how many objects should be created between each object and its nearest neighbors, together with a deterministic and uniform creation of the synthetic objects, allows an oversampling of the minority class such that better results than SMOTE can be obtained.

From our experiments, we can conclude that when a dataset has an imbalance ratio higher than 10.0, then SMOTE-D,using either Euclidean or HVD distance, performs better than SMOTE.

As future work, we are going to extend the proposed method for working with nominal attributes.