Abstract
Imbalanced class distribution is a challenging problem in many real-life classification problems. Existing synthetic oversampling do suffer from the curse of dimensionality because they rely heavily on Euclidean distance. This paper proposed a new method, called Minority Oversampling Technique based on Local Densities in Low-Dimensional Space (or MOT2LD in short). MOT2LD first maps each training sample into a low-dimensional space, and makes clustering of their low-dimensional representations. It then assigns weight to each minority sample as the product of two quantities: local minority density and local majority count, indicating its importance of sampling. The synthetic minority class samples are generated inside some minority cluster. MOT2LD has been evaluated on 15 real-world data sets. The experimental results have shown that our method outperforms some other existing methods including SMOTE, Borderline-SMOTE, ADASYN, and MWMOTE, in terms of G-mean and F-measure.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Fawcett, T.E., Provost, F.: Adaptive Fraud Detection. Data Min. Knowl. Disc. 3(1), 291–316 (1997)
Mladenić, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naive bayes. In: Proceedings of the 16th International Conference on Machine Learning, pp. 258−267 (1999)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority oversampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Han, H., Wang, W.Y., Mao, B.H.: Borderline-SMOTE: a new oversampling method in imbalanced data sets learning. In: Proceedings of International Conference on Intelligent Computing, pp. 878−887 (2005)
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of IEEE International Joint Conference on Neural Networks, pp. 1322−1328 (2008)
Barua, S., Islam, M.M., Yao, X., Murase, K.: MWMOTE - majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 26(2), 405–425 (2014)
van der Maaten, L.J.P., Postma, E.O., van den Herik, H.J.: Dimensionality reduction: a comparative review. Tilburg University Techical Report, TiCC-TR 2009–005 (2009)
Hotelling, H.: Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24, 417–441 (1933)
Torgerson, W.S.: Multidimensional scaling I: theory and method. Psychometrika 17, 401–419 (1952)
Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by Locally Linear Embedding. Science 290(5500), 2323–2326 (2000)
Hinton, G.E., Roweis, S.T.: Stochastic neighbor embedding. In: Advances in Neural Information Processing Systems, vol. 15, pp. 833−840 (2002)
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9, 2579–2605 (2008)
Liu, Y.: Distance metric learning: a comprehensive survey. Research Report, Michigan State University (2006)
Voorhees, E.M.: Implementing agglomerative hierarchic clustering algorithms for use in document retrieval. Inf. Process. Manage. 22(6), 465–476 (1986)
Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344, 1492–1496 (2014)
MacQueen, J.: Some methods for classifications and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematics, Statistics and Probability, University of California Press, pp. 281−297 (1967)
Bache, K., Lichman, M.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, 2013 [http://archive.ics.uci.edu/ml]
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. CRC press (1984)
Cao, H., Li, X.L., Woon, Y.-K., Ng, S.K.: SPO: structure preserving oversampling for imbalanced time series classification. In: Proceedings of IEEE International Conference on Data Mining (2011)
Cao, H., Li, X.L., Woon, Y.K., Ng, S.K.: Integrated oversampling for imbalanced time series classification. IEEE Trans. Knowl. Data Eng. 25(12), 2809–2822 (2013)
Pang, Z.F., Cao, H., Tan, Y.F.: MOGT: oversampling with a parsimonious mixture of Gaussian trees model for imbalanced time-series classification. In: Proceedings of IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1−6 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Xie, Z., Jiang, L., Ye, T., Li, X. (2015). A Synthetic Minority Oversampling Method Based on Local Densities in Low-Dimensional Space for Imbalanced Learning. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M. (eds) Database Systems for Advanced Applications. DASFAA 2015. Lecture Notes in Computer Science(), vol 9050. Springer, Cham. https://doi.org/10.1007/978-3-319-18123-3_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-18123-3_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18122-6
Online ISBN: 978-3-319-18123-3
eBook Packages: Computer ScienceComputer Science (R0)