Skip to main content

A Normal Distribution-Based Over-Sampling Approach to Imbalanced Data Classification

  • Conference paper
Advanced Data Mining and Applications (ADMA 2011)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7120))

Included in the following conference series:

Abstract

This study proposes a normal distribution-based over-sampling approach to balance the number of instances belonging to different classes in a data set. The balanced training data are used to learn unbiased classifiers for the original data set. Under some conditions, the proposed over-sampling approach generates samples with expected mean and variance similar to that of the original minority class data. As the approach tries to generate synthetic data with similar probability distributions to the original data, and expands the class boundaries of the minority class, it may increase the minority class classification performance. Experimental results show that the proposed approach outperforms alternative methods on benchmark data sets most of the times when implementing several classical classification algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Barandela, R., Sanchez, J.S., Garcia, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recognition 36, 849–851 (2003)

    Article  Google Scholar 

  2. Zhou, Z.-H., Liu, X.-Y.: Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering 18, 63–77 (2006)

    Article  Google Scholar 

  3. Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM SIGKDD Explorations Newsletter 6(1), 20–29 (2004)

    Article  Google Scholar 

  4. Sun, A., Lim, E.-P., Liu, Y.: On Strategies for Imbalanced Text Classification Using SVM: A Comparative Study. Decision Support Systems 48(1), 191–201 (2009)

    Article  Google Scholar 

  5. Kotsiantis, S., Kanellopoulos, D., Pintelas, P.: Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering, 301–312 (2006)

    Google Scholar 

  6. Yen, S.-J., Lee, Y.-S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications 36, 5718–5727 (2009)

    Article  Google Scholar 

  7. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)

    MATH  Google Scholar 

  8. Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  9. Guo, H., Viktor, H.L.: Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. SIGKDD Explorations Newsletter 6, 30–39 (2004)

    Article  Google Scholar 

  10. Estabrooks, A., Jo, T., Japkowicz, N.: A Multiple Resampling Method for Learning from Imbalanced Data Sets. Computational Intelligence 20(1), 18–36 (2004)

    Article  MathSciNet  Google Scholar 

  11. Peng, Y., Yao, J.: AdaOUBoost: Adaptive Over-sampling and Under-sampling to Boost the Concept Learning in Large Scale Imbalanced Data Sets. In: MIR 2010, Philadelphia, Pennsylvania, USA, pp. 111–118 (March 2010)

    Google Scholar 

  12. Drummond, C., Holte, R.C.: C4.5, Class Imbalance and Cost Sensitivity: Why Under-Sampling beats Over-Sampling. In: Proceedings of the ICML 2003 Workshop on Learning from Imbalanced Data Sets (2003)

    Google Scholar 

  13. Zheng, Z., Wu, X., Srihari, R.: Feature Selection for Text Categorization on Imbalanced Data. SIGKDD Explorations Newsletter 6(1), 80–89 (2004)

    Article  Google Scholar 

  14. Chen, M.-C., Chen, L.-S., Hsu, C.-C., Zeng, W.-R.: An information granulation based data mining approach for classifying imbalanced data. Information Sciences 178, 3214–3227 (2008)

    Article  Google Scholar 

  15. Wu, G., Chang, E.Y.: KBA: Kernel boundary alignment considering imbalanced data distribution. IEEE Transactions on Knowledge and Data Engineering 17(6), 786–795 (2005)

    Article  Google Scholar 

  16. Provost, F., Fawcett, T., Kohavi, R.: The case against accuracy estimation for comparing induction algorithms. In: Proceedings of the Fifteenth International Conference on Machine Learning, pp. 445–453. Morgan Kaufmann, San Francisco (1998)

    Google Scholar 

  17. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: One-sided selection. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186. Morgan Kaufmann, San Francisco (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhang, H., Wang, Z. (2011). A Normal Distribution-Based Over-Sampling Approach to Imbalanced Data Classification. In: Tang, J., King, I., Chen, L., Wang, J. (eds) Advanced Data Mining and Applications. ADMA 2011. Lecture Notes in Computer Science(), vol 7120. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25853-4_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-25853-4_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-25852-7

  • Online ISBN: 978-3-642-25853-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics