Abstract
We present a framework to address the imbalanced data problem using semi-supervised learning. Specifically, from a supervised problem, we create a semi-supervised problem and then use a semi-supervised learning method to identify the most relevant instances to establish a well-defined training set. We present extensive experimental results, which demonstrate that the proposed framework significantly outperforms all other sampling algorithms in 67% of the cases across three different classifiers and ranks second best for the remaining 33% of the cases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9), 1263–1284 (2009)
Chawla, N., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
Oh, S.: Error back-propagation algorithm for classification of imbalanced data. Neurocomputing 74(6), 1058–1061 (2011)
Elkan, C.: The foundations of cost-sensitive learning. In: Proc. International Joint Conference on Artificial Intelligence, Seattle, WA, vol. 17, pp. 973–978 (August 2001)
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: One-sided selection. In: Proc. 14th International Conference on Machine Learning, Nashville, TN, USA, July 8-12, pp. 179–186 (1997)
Yen, S., Lee, Y., Lin, C., Ying, J.: Investigating the effect of sampling methods for imbalanced data distributions. In: Proc. IEEE International Conference on Systems, Man and Cybernetics, Taipei, vol. 5, pp. 4163–4168 (October 2006)
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005)
Batista, G., Prati, R., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 6(1), 20–29 (2004)
García, V., Sánchez, J.S., Mollineda, R.A.: On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowledge-Based Systems 25(1), 13–21 (2012)
Weiss, G.M.: Mining with rarity: A unifying framework. ACM SIGKDD Explorations Newsletter 6(1), 7–19 (2004)
Holte, R.C., Acker, L.E., Porter, B.W.: Concept learning and the problem of small disjuncts. In: Proc. 11th International Joint Conference on Artificial Intelligence, Detroit, vol. 1 (August 1989)
Wang, B.X., Japkowicz, N.: Imbalanced data set learning with synthetic samples. In: Proc. IRIS Machine Learning Workshop, Canada (June 2004)
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proc. IEEE International Joint Conference on Neural Networks, Hong Kong, pp. 1322–1328 (June 2008)
Yoon, K., Kwek, S.: An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics. In: Proc. Hybrid Intelligent Systems, p. 6. Rio de Janeiro, Brazil (2005)
Mani, I., Zhang, I.: Knn approach to unbalanced data distributions: A case study involving information extraction. In: Proc. Proceedings of Workshop on Learning from Imbalanced Datasets, Washington DC (January 2003)
Yen, S., Lee, Y.: Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. In: Huang, D.-S., Li, K., Irwin, G.W. (eds.) ICIC 2006. LNCIS, vol. 344, pp. 731–740. Springer, Heidelberg (2006)
Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 6, 769–772 (1976)
Ramanna, S., Jain, L.C., Howlett, R.J.: Emerging paradigms in machine learning. Springer Publishing Company, Incorporated (2012)
Zhou, D., Bousquet, O., Navin Lal, T., Scholkopf, B.: Learning with local and global consistency. Advances in Neural Information Processing Systems 16(16), 321–328 (2004)
Driessens, K., Reutemann, P., Pfahringer, B., Leschi, C.: Using weighted nearest neighbor to benefit from unlabeled data. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, pp. 60–69. Springer, Heidelberg (2006)
Leistner, C., Saffari, A., Bischof, H.: Semi-supervised random forests. In: Proc. 12th International Conference on Computer Vision, Kyoto, Japan, pp. 506–513 (October 2009)
Murphy, P.M., Aha, D.W.: UCI repository of machine learning databases. Machine-readable repository. University of California, Department of Information and Computer Science, Irvine (1992)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, H.: WEKA data mining software. ACM SIGKDD Explorations Newsletter 11(1), 10–18 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Almogahed, B.A., Kakadiaris, I.A. (2014). Empowering Imbalanced Data in Supervised Learning: A Semi-supervised Learning Approach. In: Wermter, S., et al. Artificial Neural Networks and Machine Learning – ICANN 2014. ICANN 2014. Lecture Notes in Computer Science, vol 8681. Springer, Cham. https://doi.org/10.1007/978-3-319-11179-7_66
Download citation
DOI: https://doi.org/10.1007/978-3-319-11179-7_66
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11178-0
Online ISBN: 978-3-319-11179-7
eBook Packages: Computer ScienceComputer Science (R0)