A Normal Distribution-Based Over-Sampling Approach to Imbalanced Data Classification

Zhang, Huaxiang; Wang, Zhichao

doi:10.1007/978-3-642-25853-4_7

Huaxiang Zhang^22,23 &
Zhichao Wang^22,23

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7120))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

1091 Accesses
12 Citations
3 Altmetric

Abstract

This study proposes a normal distribution-based over-sampling approach to balance the number of instances belonging to different classes in a data set. The balanced training data are used to learn unbiased classifiers for the original data set. Under some conditions, the proposed over-sampling approach generates samples with expected mean and variance similar to that of the original minority class data. As the approach tries to generate synthetic data with similar probability distributions to the original data, and expands the class boundaries of the minority class, it may increase the minority class classification performance. Experimental results show that the proposed approach outperforms alternative methods on benchmark data sets most of the times when implementing several classical classification algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Barandela, R., Sanchez, J.S., Garcia, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recognition 36, 849–851 (2003)
Article Google Scholar
Zhou, Z.-H., Liu, X.-Y.: Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering 18, 63–77 (2006)
Article Google Scholar
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM SIGKDD Explorations Newsletter 6(1), 20–29 (2004)
Article Google Scholar
Sun, A., Lim, E.-P., Liu, Y.: On Strategies for Imbalanced Text Classification Using SVM: A Comparative Study. Decision Support Systems 48(1), 191–201 (2009)
Article Google Scholar
Kotsiantis, S., Kanellopoulos, D., Pintelas, P.: Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering, 301–312 (2006)
Google Scholar
Yen, S.-J., Lee, Y.-S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications 36, 5718–5727 (2009)
Article Google Scholar
Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
MATH Google Scholar
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005)
Chapter Google Scholar
Guo, H., Viktor, H.L.: Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. SIGKDD Explorations Newsletter 6, 30–39 (2004)
Article Google Scholar
Estabrooks, A., Jo, T., Japkowicz, N.: A Multiple Resampling Method for Learning from Imbalanced Data Sets. Computational Intelligence 20(1), 18–36 (2004)
Article MathSciNet Google Scholar
Peng, Y., Yao, J.: AdaOUBoost: Adaptive Over-sampling and Under-sampling to Boost the Concept Learning in Large Scale Imbalanced Data Sets. In: MIR 2010, Philadelphia, Pennsylvania, USA, pp. 111–118 (March 2010)
Google Scholar
Drummond, C., Holte, R.C.: C4.5, Class Imbalance and Cost Sensitivity: Why Under-Sampling beats Over-Sampling. In: Proceedings of the ICML 2003 Workshop on Learning from Imbalanced Data Sets (2003)
Google Scholar
Zheng, Z., Wu, X., Srihari, R.: Feature Selection for Text Categorization on Imbalanced Data. SIGKDD Explorations Newsletter 6(1), 80–89 (2004)
Article Google Scholar
Chen, M.-C., Chen, L.-S., Hsu, C.-C., Zeng, W.-R.: An information granulation based data mining approach for classifying imbalanced data. Information Sciences 178, 3214–3227 (2008)
Article Google Scholar
Wu, G., Chang, E.Y.: KBA: Kernel boundary alignment considering imbalanced data distribution. IEEE Transactions on Knowledge and Data Engineering 17(6), 786–795 (2005)
Article Google Scholar
Provost, F., Fawcett, T., Kohavi, R.: The case against accuracy estimation for comparing induction algorithms. In: Proceedings of the Fifteenth International Conference on Machine Learning, pp. 445–453. Morgan Kaufmann, San Francisco (1998)
Google Scholar
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: One-sided selection. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186. Morgan Kaufmann, San Francisco (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Shandong Normal University, Jinan, 250014, Shandong, China
Huaxiang Zhang & Zhichao Wang
Shandong Provincial Key Laboratory for Novel Distributed Computer Software Technology, Jinan, 250014, China
Huaxiang Zhang & Zhichao Wang

Authors

Huaxiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhichao Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Technology, Tsinghua University, 100084, Beijing, China
Jie Tang & Jianyong Wang &
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, SAR, China
Irwin King
Faculty of Engineering and Information Technology, University of Technology, 2007, Sydney, NSW, Australia
Ling Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, H., Wang, Z. (2011). A Normal Distribution-Based Over-Sampling Approach to Imbalanced Data Classification. In: Tang, J., King, I., Chen, L., Wang, J. (eds) Advanced Data Mining and Applications. ADMA 2011. Lecture Notes in Computer Science(), vol 7120. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25853-4_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-25853-4_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25852-7
Online ISBN: 978-3-642-25853-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics