SMOTE-D a Deterministic Version of SMOTE

Torres, Fredy Rodríguez; Carrasco-Ochoa, Jesús A.; Martínez-Trinidad, José Fco.

doi:10.1007/978-3-319-39393-3_18

Fredy Rodríguez Torres¹⁸,
Jesús A. Carrasco-Ochoa¹⁸ &
José Fco. Martínez-Trinidad¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9703))

Included in the following conference series:

Mexican Conference on Pattern Recognition

1973 Accesses
13 Citations

Abstract

Imbalanced data is a problem of current research interest. This problem arises when the number of objects in a class is much lower than in other classes. In order to address this problem several methods for oversampling the minority class have been proposed. Oversampling methods generate synthetic objects for the minority class in order to balance the amount of objects between classes, among them, SMOTE is one of the most successful and well-known methods. In this paper, we introduce a modification of SMOTE which deterministically generates synthetic objects for the minority class. Our proposed method eliminates the random component of SMOTE and generates different amount of synthetic objects for each object of the minority class. An experimental comparison of the proposed method against SMOTE in standard imbalanced datasets is provided. The experimental results show an improvement of our proposed method regarding SMOTE, in terms of F-measure.

You have full access to this open access chapter, Download conference paper PDF

Farthest SMOTE: A Modified SMOTE Approach

A Novel Distribution Analysis for SMOTE Oversampling Method in Handling Class Imbalance

MaMiPot: a paradigm shift for the classification of imbalanced data

Article 07 December 2022

Keywords

1 Introduction

In supervised classification, imbalanced data arises when in some classes the number of objects is much lower than in other classes. Usually, the minority class (the class with the lowest amount of objects) is the class of interest in a class imbalance problem. However, when working on imbalanced datasets, classifiers tend to have a bias towards the majority class (the class with the largest amount of objects), resulting in poor classification performance for the minority class. This behavior is more notorious when the imbalance among classes is higher.

Some researchers have tackled the class imbalance problem [12, 13] through oversampling methods [4–10]. Oversampling methods have the advantage of being independent of the classification method to be used, since they generate synthetic objects based only on the training set. SMOTE [2] is one of the most used and well known oversampling methods, which generates synthetic objects along the line segments joining objects in the minority class with some of their nearest neighbors. Thus, by increasing the amount of objects of the minority class, SMOTE tries to balance the amount of objects for all the classes. SMOTE has a random component for generating synthetic objects, producing a different result each time it is applied. Therefore, whether SMOTE is applied several times, choosing the best result becomes an issue.

In this paper, we propose a new oversampling method based on SMOTE, which computes in a deterministic way how many new synthetic objects should be generated from each object of the minority class and where these new objects should be placed. According to our experiments the proposed method performs better than SMOTE in different datasets with different imbalance level.

The rest of this document is organized as follows: in the Sect. 2, some related works are described; in the Sect. 3, the proposed method is introduced; in the Sect. 4, the experimental setup for the experiments is described; in the Sect. 5, the experimental results are shown; and in the Sect. 6, our conclusions and some future work directions are discussed.

2 Related Work

In the literature there are two types of extensions of SMOTE, those which combine SMOTE with other methods like noise filters (SMOTE-IPF) [6], subsampling methods (SMOTE-RSB*) [5] or feature selectors (E-SMOTE) [7]; and those that modify SMOTE like Bordeline-SMOTE [4], Safe-Level-SMOTE [8], SMOTE-OUT [9], SMOTE-COSINE [9] or Random-SMOTE [10]. Our work belongs to these last kind of methods. Thus we briefly describe some methods that modify SMOTE:

Borderline-SMOTE only oversamples the objects in the borderline of the minority class. First, it finds out the borderline objects of the minority class; then, synthetic objects are generated from these objects and they are added to the original training set. Borderline-SMOTE works as follows:

For each object in the minority class its k nearest neighbors, from the whole training set, are calculated.
If the k nearest neighbors contain objects from the minority and majority classes and the amount of its nearest neighbors in the majority class is larger than the amount of its nearest neighbors in the minority class, the object is considered as a borderline object.

For each borderline object, Borderline-SMOTE calculates its k nearest neighbors in the minority class and generates synthetic objects in the same way as SMOTE.

Safe-Level-Synthetic Minority Oversampling technique, assigns to each object in the minority class its safe level before generating synthetic objects. Each synthetic object is positioned closer to the largest safe level object, in this way all the synthetic objects are generated in safe regions. The safe level of an object is defined as the number of minority class objects among it’s k nearest neighbors.

SMOTE-OUT randomly generates synthetic objects along the outside line of attributes between an object of the minority class and its nearest neighbor of the majority class. Once a synthetic object has been generated then SMOTE-OUT finds out its nearest object in the minority class and randomly generates another synthetic object along the line of attributes between them.

SMOTE-Cosine works as SMOTE but it computes the k-nearest neighbors by voting using two distance metrics (Euclidean and Cosine).

Random-SMOTE generates temporally synthetic objects along the line of attributes between two objects of the minority class which are selected randomly. After, synthetic objects along the line of attributes between each temporal synthetic object and one object of the minority class are generated.

3 Proposed Method

The proposed method takes into account the distances between each object of the minority class and its k-nearest neighbors in the same class in order to determine how many synthetic objects should be generated from each object. For each object in the minority class, the higher its distance dispersion against its k-nearest neighbors, the higher the number of synthetic objects to generate. Additionally, the distance between an object and each one of its k-nearest neighbors is also taken into account individually to determine how many objects should be generated between each pair of objects. The larger the distance between a pairs of objects allows generating a greater amount of synthetic objects between them. The new synthetic objects are generated by dividing the difference in attributes between two objects by the number of objects to be generated between them. In this way, the synthetic objects will be created in a deterministic and uniform way.

For evaluating the dispersion of objects around each \(object_{i}\) of the minority class we propose to use the standard deviation (\(\sigma _{i}\)) of the distances between the \(object_{i}\) and its k nearest neighbors (\(object_{ij}\) \(j=1,..k\)). We propose generating an amount of synthetic objects around each \(object_{i}\) in the minority class such that it be proportional to the fraction of the standard deviation of the distances (\(\sigma _{i}\)) with respect to the sum of all the standard deviations of the distances computed for all the objects in the minority class (\(\sum _{i=1}^{m}\sigma _{i}\), where m is the amount of objects in the minority class). Then, around objects of the minority class with higher standard deviation of distances, with respect to its k-nearest, neighbors more objects will be created.

After determining the proportion (\(p_{i}\)) of synthetic objects to generate from each object of the minority class, we proceed to calculate the proportion (\(p_{ij}\)) of synthetic objects to generate between each object of the minority class and each one of its k nearest neighbors. This proportion of synthetic objects is calculated as the fraction that represents the distance between the object and each nearest neighbor regarding to the sum of all the distances between the object and all its k nearest neighbors.

In the Fig. 1, we can see an example with three objects of the minority class and its 3-nearest neighbors (into the minority class), where \(\sum _{i=1}^{m}\sigma _{i}=2\). Here, the fraction (\(p_1\)) of \(\sigma _{1}\) is \(p_{1}= \sigma _{1} /\sum _{i=1}^{m}\sigma _{i} =0.5\) , therefore 50 % of the synthetic objects to be generated will be generated from \(object_1\). For \(\sigma _{2}\) the fraction is 30 % and 20 % for \(\sigma _{3}\), If we have to generate a total of 10 synthetic objects, 5 would be generated around \(object_{1}\), 3 around \(object_{2}\), and 2 around \(object_{3}\).

In order to determine the number of objects to generate (\(s_{ij}=p_{ij}*p_{i}*n\)) between an \(object_{i}\) and each one of its nearest neighbors (\(object_{ij}\)) we take into account the amount of objects to generate for the minority class (n), this amount is given by the difference between the number of objects in the minority and majority class (\(n=(M-m)*R\), where M is the number of objects in majority class, m is the number of objects in the minority class and R is a parameter defining the proportion of the difference to be reached).

After calculating the amount of synthetic objects \(s_{ij}\) to generate between each \(object_{i}\) an each one of its nearest neighbors (\(object_{ij},j=1,\dots ,k\)), attribute differences \(diff_{ij}\) (\(object_{i},object_{ij}\)) between \(object_{i}\) and \(object_{ij}\) are calculated. These differences are divided by the amount of synthetic objects to generate plus one, obtaining \(diff'_{ij}=diff_{ij}/(s_{ij}+1\)). This difference is added \(s_{ij}\) times to \(object_{i}\); with each addition we obtain a new synthetic object and all the new synthetic objects are added to the original minority class. In the Fig. 2 we can see an example of the generation of three synthetic objects between an object of the minority class \(object_{i}\) and one of its k-nearest neighbors \(object_{ij}\).

If for an object in the minority class the amount of objects to be generated regarding to the proportion representing the fraction of the standard deviation of the distances with respect to the sum of all standard deviations of distances represents less than 1 object, SMOTE-D does not generate synthetic objects from this object. The same happens if the proportion that represents the distance between the object and one of its nearest neighbors regarding to the sum of all the distances between the object and all its k nearest neighbors represents less than the 1 %.

Given a training set with M objects in the majority class and m objects in the minority class. The detailed procedure of SMOTE-D is as follows:

Calculate the amount of objects to be generated for the minority class (\(n=(M-m)*R\)) according to a parameter \(R\in [0,1]\).
Calculate the distances (\(d_{ij}\)) between each \(object_{i}\) in the minority class and its k nearest neighbors, \(j=1,\dots ,k\) (k is a parameter).
Calculate the standard deviation (\(\sigma _{i}\)) of the distances between each \(object_{i}\) and its k-nearest neighbors.
Calculate for each \(object_{i}\) the fraction (\(p_{i}\)) of its standard deviation (\(\sigma _{i}\)) from the total sum of all standard deviations as \(p_{i}=\sigma _{i}/\sum _{i=1}^{m}\sigma _{i} \).
Calculate the fraction (\(p_{ij}\)) of each distance (\(d_{ij}\)) with respect to the sum of distances of each \(object_{i}\) and its k nearest neighbors as \(p_{ij}=d_{ij}/\sum _{j=1}^{k}d_{ij}\).
Calculate the number of objects (\(s_{ij}\)) to generate between an \(object_{i}\) and one of its nearest neighbors \(object_{ij}\) as \(s_{ij}=p_{i}*p_{ij}*n\).
Get the attribute difference \(diff_{ij}\) between an \(object_{i}\) and each one of its k nearest neighbors \(object_{ij}\) as \(diff_{ij}=object_{ij}-object_{i} ; j=1, \dots ,k\).
- Divide the difference between an object and each one of its neighbors by the amount of synthetic objects to be generated from this pair plus 1. (\(diff_{ij}^{'}=diff_{ij}/(s_{ij}+1)\))
- Add the difference \(diff^{'}_{ij}\) to the object of the minority class as many times as objects to generate (\(s_{ij}\)).
Add the generated synthetic objects to the minority class.

4 Experimental Setup

For evaluating the proposed method, we use 66 datasets taken from the repository KEEL [1], we used 5-fold cross validation. In order to measure the degree of imbalance in a dataset, in the literature, the imbalance ratio (IR) is commonly used. The IR of a dataset with M objects in the majority class and m objects in the minority class is computed as \(IR=M/m\). All the datasets used in our experiments are from binary problems with numeric attributes, and IR ranging from 1.82 to 129.44 (see Table 1).

Table 1. Sumary of the datasets used in our experiments.

Full size table

Into the KEEL repository these 66 datasets are provided together with the results of applying SMOTE with \(k=5\) as the number of nearest neighbors, HVDM [3] as distance function and an oversampling rate N such that \(IR=0.0\). Thus, for our experiments SMOTE-D was configured in the same way (\(k=5\), HVDM and \(R=1\) in order to get \(IR=0.0\)).

For comparing the results of SMOTE-D and SMOTE, decision trees, support vector machines (SVM) and KNN with \(k=5\) were used. We applied SMOTE-D in all datasets and compare the classification results against those obtained by using SMOTE.

One of the measures commonly used for assessing the quality of classifiers on imbalanced datasets is the F-measure [11]. Therefore, in our experiments we used this measure to asses our results, additionally we applied a t-test with statistical significant difference at the 5 % of significance level between the results of SMOTE and SMOTE-D.

Table 2. Results of applying the evaluated classifiers over the results of SMOTE and SMOTE-D over the tested datasets using HVDM metric distance.

Full size table

Table 3. Results of applying the evaluated classifiers over the results of SMOTE and SMOTE-D over the tested datasets using euclidean metric distance.

Full size table

5 Experimental Results

The results of comparing SMOTE-D and SMOTE with different classifiers in terms of F-measure are shown in Tables 2 and 3. In both tables, the first column shows the name of the databases, the following columns show the classification results, in terms of F-measure, for decision tree, KNN and SVM classifiers, respectively. In the last row, the average over the 66 datasets for each classifier is shown. The best results of F-measure appear boldfaced. Datasets with (*) are those where the t-test showed statistical significant difference at the 5 % of significance level.

In the results that appear in the Table 2, the proposed method obtains a better performance in 67 % of the datasets using decision trees, in 61 % using KNN, and 73 % using SVM. In average the results show a statistical improvement of 3.14 %, 2.73 % and 4.26 % in terms of F-measure for decision trees, KNN and SVM respectively. Considering only those datasets with an IR greater than 15.8 the results obtained by all the classifiers when SMOTE-D is applied are statistical significant better for all used classifiers, these databases can be seen with the names in bold.

In the results that appear in the Table 3, the proposed method obtains a better performance in 50 % of the datasets using decision trees, in 46 % using KNN, and 50 % using SVM. In average the results do not show a statistical difference for all classifiers. Considering only those datasets with an IR greater than 10.59 the results obtained by SVM classifier when SMOTE-D is applied are statistical significant better for SVM classifier, these databases can be seen with the names in bold

6 Conclusions

This paper introduces a new oversampling method, SMOTE-D, which is a deterministic version of SMOTE. Comparisons against SMOTE in terms of F-measure using decision trees, KNN and SVM classifiers show that SMOTE-D get better results than SMOTE. These results give evidence that estimating the dispersion of the objects of the minority class (based on the standard deviation of distances) to determine how many objects should be generated around each object in the minority class and how many objects should be created between each object and its nearest neighbors, together with a deterministic and uniform creation of the synthetic objects, allows an oversampling of the minority class such that better results than SMOTE can be obtained.

From our experiments, we can conclude that when a dataset has an imbalance ratio higher than 10.0, then SMOTE-D,using either Euclidean or HVD distance, performs better than SMOTE.

As future work, we are going to extend the proposed method for working with nominal attributes.

References

Alcalá-Fdez, J., Fernandez, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17(2–3), 255–287 (2011)
Google Scholar
Chawla, N.V., et al.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
MATH Google Scholar
Wilson, D., Randall Martinez, T.R.: Improved heterogeneous distance functions. J. Artif. Intell. Res. 6, 1–34 (1997)
MathSciNet MATH Google Scholar
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005)
Chapter Google Scholar
Ramentol, E., et al.: SMOTE-RSB*: a hybrid preprocessing approach based on over-sampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl. Inf. Syst. 33(2), 245–265 (2012)
Article Google Scholar
Sáez, J.A., et al.: SMOTE IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf. Sci. 291, 184–203 (2015)
Article Google Scholar
Deepa, T., Punithavalli, M.: An E-SMOTE technique for feature selection in high-dimensional imbalanced dataset. In: 2011 3rd International Conference on Electronics Computer Technology (ICECT), vol. 2. IEEE (2011)
Google Scholar
Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 475–482. Springer, Heidelberg (2009)
Chapter Google Scholar
Koto, F.: SMOTE-OUT, SMOTE-COSINE, and selected-SMOTE: an enhancement strategy to handle imbalance in data level. In: 2014 International Conference on Advanced Computer Science and Information Systems (ICACSIS). IEEE (2014)
Google Scholar
Dong, Y., Wang, X.: A new over-sampling approach: random-SMOTE for learning from imbalanced data sets. In: Xiong, H., Lee, W.B. (eds.) KSEM 2011. LNCS, vol. 7091, pp. 343–352. Springer, Heidelberg (2011)
Chapter Google Scholar
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM (1999)
Google Scholar
Shakiba, N., Rueda, L.: MicroRNA identification using linear dimensionality reduction with explicit feature mapping. In: BMC Proceedings. BioMed Central (2013)
Google Scholar
Batuwita, R., Palade, V.: Adjusted geometric-mean: a novel performance measure for imbalanced bioinformatics datasets learning. J. Bioinf. Comput. Biol. 10(04), 1250003 (2012)
Article Google Scholar

Download references

Acknowledgment

This work was partly supported by National Council of Science and Technology of Mexico under the scholarship grant 627301.

Author information

Authors and Affiliations

Instituto Nacional de Astrofísica, Óptica y Electrónica, Puebla, Mexico
Fredy Rodríguez Torres, Jesús A. Carrasco-Ochoa & José Fco. Martínez-Trinidad

Authors

Fredy Rodríguez Torres
View author publications
You can also search for this author in PubMed Google Scholar
Jesús A. Carrasco-Ochoa
View author publications
You can also search for this author in PubMed Google Scholar
José Fco. Martínez-Trinidad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fredy Rodríguez Torres .

Editor information

Editors and Affiliations

INAOE, Sta. Maria Tonantzintla, Mexico
José Francisco Martínez-Trinidad
INAOE, Sta. Maria Tonantzintla, Puebla, Mexico
Jesús Ariel Carrasco-Ochoa
University of Guanajuato, Salamanca, Mexico
Victor Ayala Ramirez
Autonomous University of Puebla, Puebla, Mexico
José Arturo Olvera-López
University of Münster, Münster, Nordrhein-Westfalen, Germany
Xiaoyi Jiang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Torres, F.R., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F. (2016). SMOTE-D a Deterministic Version of SMOTE. In: Martínez-Trinidad, J., Carrasco-Ochoa, J., Ayala Ramirez, V., Olvera-López, J., Jiang, X. (eds) Pattern Recognition. MCPR 2016. Lecture Notes in Computer Science(), vol 9703. Springer, Cham. https://doi.org/10.1007/978-3-319-39393-3_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-39393-3_18
Published: 21 May 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-39392-6
Online ISBN: 978-3-319-39393-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

SMOTE-D a Deterministic Version of SMOTE

Abstract

Similar content being viewed by others

Farthest SMOTE: A Modified SMOTE Approach

A Novel Distribution Analysis for SMOTE Oversampling Method in Handling Class Imbalance

MaMiPot: a paradigm shift for the classification of imbalanced data

Keywords

1 Introduction

2 Related Work

3 Proposed Method

4 Experimental Setup

5 Experimental Results

6 Conclusions

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

SMOTE-D a Deterministic Version of SMOTE

Abstract

Similar content being viewed by others

Farthest SMOTE: A Modified SMOTE Approach

A Novel Distribution Analysis for SMOTE Oversampling Method in Handling Class Imbalance

MaMiPot: a paradigm shift for the classification of imbalanced data

Keywords

1 Introduction

2 Related Work

3 Proposed Method

4 Experimental Setup

5 Experimental Results

6 Conclusions

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation