Skip to main content

A Comparison of Re-sampling Techniques for Pattern Classification in Imbalanced Data-Sets

  • Conference paper
  • First Online:
Advances in Computational Intelligence Systems (UKCI 2018)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 840))

Included in the following conference series:

Abstract

Class imbalance is a common challenge when dealing with pattern classification of real-world medical data-sets. An effective counter-measure typically used is a method known as re-sampling. In this paper we implement an ANN with different re-sampling techniques to subsequently compare and evaluate the performances. Re-sampling strategies included a control, under-sampling, over-sampling, and a combination of the two. We found that over-sampling and the combination of under- and over-sampling both led to a significantly superior classifier performance compared to under-sampling only in correctly predicting labelled classes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ayres-DeCampos, D., Bernardes, J., Garrido, A., MarquesDeS, J., PereiraLeite, L.: SisPorto 2.0: a program for automated analysis of cardiotocograms. J. Matern. Fetal Med. 9, 311–318 (2000)

    Google Scholar 

  2. Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 30(7), 1145–1159 (1997). https://doi.org/10.1016/s0031-3203(96)00142-2

    Article  Google Scholar 

  3. Brooks, G.P., Johanson, G.A.: Sample size considerations for multiple comparison procedures in ANOVA. J. Mod. Appl. Stat. Methods 10(1), 97–109 (2011). https://doi.org/10.22237/jmasm/1304222940

    Article  Google Scholar 

  4. de Campos, D.A.: The SisPorto automated analysis

    Google Scholar 

  5. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Article  Google Scholar 

  6. Dagostino, R.B.: An omnibus test of normality for moderate and large size samples. Biometrika 58(2), 341 (1971). https://doi.org/10.2307/2334522

    Article  MathSciNet  Google Scholar 

  7. UCI Machine Learning Repository Database: Cardiotocography Data Set (2010). https://archive.ics.uci.edu/ml/datasets/cardiotocography

  8. HHU Düsseldorf: G*Power. http://www.gpower.hhu.de/en.html

  9. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)

    MathSciNet  MATH  Google Scholar 

  10. Ennos, A.R., Johnson, M.: Statistical and Data Handling Skills in Biology. Pearson Education, New York (2017)

    Google Scholar 

  11. Esteva, A., Kuprel, B., Novoa, R., Ko, J., Swetter, S., Blau, H., Thrun, S.: Dermatologist-level classification of skin cancer with deep neural networks. Nature 542(7639), 115–118 (2017)

    Article  Google Scholar 

  12. Gigerenzer, G.: Helping doctors and patients make sense of health statistics. In: Simply Rational, p. 2193 (2015). https://doi.org/10.1093/acprof:oso/9780199390076.003.0005

  13. Heaton, J.: Introduction to Neural Networks for Java, p. 440. Heaton Research, Inc. (2008). https://dl.acm.org/citation.cfm?id=1502373. ISBN 1604390085 9781604390087

  14. Huang, J., Ling, C.: Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 17(3), 299–310 (2005). https://doi.org/10.1109/tkde.2005.50

    Article  Google Scholar 

  15. Ishibuchi, H., Nakaskima, T.: Improving the performance of fuzzy classifier systems for pattern classification problems with continuous attributes. IEEE Trans. Ind. Electron. 46(6), 1057–1068 (1999). https://doi.org/10.1109/41.807986

    Article  Google Scholar 

  16. Kim, H.Y.: Statistical notes for clinical researchers: type I and type II errors in statistical decision. Restor. Dentist. Endod. 40(3), 249 (2015). https://doi.org/10.5395/rde.2015.40.3.249

    Article  Google Scholar 

  17. Nguyen, H.M., Cooper, E.W., Kamei, K.: Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradigms 3(1), 4 (2011). https://doi.org/10.1504/ijkesdp.2011.039875

    Article  Google Scholar 

  18. Pearson, E.S., Dagostino, R.B., Bowman, K.O.: Tests for departure from normality: comparison of powers. Biometrika 64(2), 231–246 (1977). https://doi.org/10.1093/biomet/64.2.231

    Article  MATH  Google Scholar 

  19. Prati, R.C., Batista, G.E.A.P.A., Monard, M.C.: Class imbalances versus class overlapping: an analysis of a learning system behavior. In: MICAI 2004: Advances in Artificial Intelligence Lecture Notes in Computer Science, pp. 312–321 (2004). https://doi.org/10.1007/978-3-540-24694-7-32

  20. Preacher, K.J., Rucker, D.D., Maccallum, R.C., Nicewander, W.A.: Use of the extreme groups approach: a critical reexamination and new recommendations. Psychol. Methods 10(2), 178–192 (2005). https://doi.org/10.1037/1082-989x.10.2.178

    Article  Google Scholar 

  21. Prechelt, L.: Early stopping but when? In: Neural Networks: Tricks of the Trade, vol. 7700 (2012). https://doi.org/10.1007/978-3-642-35289-8-5

  22. Provost, F., Fawcett, T., Kohavi, R.: The case against accuracy estimation for comparing induction algorithms. In: Proceedings of the Fifteenth International Conference on Machine Learning (1998)

    Google Scholar 

  23. Saha, R., Chowdhury, A.R., Banerjee, S.: Diabetic retinopathy related lesions detection and classification using machine learning technology. Artificial Intelligence and Soft Computing Lecture Notes in Computer Science, pp. 734–745 (2016). https://doi.org/10.1007/978-3-319-39384-1-65

  24. Scikit-Learn: Confusion Matrix. http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html

  25. Tape, T.: The Area Under an ROC Curve. http://gim.unmc.edu/dxtests/roc3.htm

  26. Thatcher, L.: The Benefits of Machine Learning in Healthcare (2017). https://healthcare.ai/the-benefits-of-machine-learning-in-healthcare

  27. Penn State University: Power and Sample Size Determination for Testing a Population Mean. https://onlinecourses.science.psu.edu/stat500/node/46

  28. Yen, S.J., Lee, Y.S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 36(3), 5718–5727 (2009). https://doi.org/10.1016/j.eswa.2008.06.108

    Article  Google Scholar 

  29. Zacharaki, E.I., Wang, S., Chawla, S., Yoo, D.S., Wolf, R., Melhem, E.R., Davatzikos, C.: Classification of brain tumor type and grade using MRI texture and shape in a machine learning scheme. Magn. Reson. Med. 62(6), 1609–1618 (2009). https://doi.org/10.1002/mrm.22147

    Article  Google Scholar 

  30. Zhang, J., Mani, I.: KNN approach to unbalanced data distributions: a case study involving information extraction. In: Workshop on Learning from Imbalanced Datasets II (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcia Amstelvina Saul .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Saul, M.A., Rostami, S. (2019). A Comparison of Re-sampling Techniques for Pattern Classification in Imbalanced Data-Sets. In: Lotfi, A., Bouchachia, H., Gegov, A., Langensiepen, C., McGinnity, M. (eds) Advances in Computational Intelligence Systems. UKCI 2018. Advances in Intelligent Systems and Computing, vol 840. Springer, Cham. https://doi.org/10.1007/978-3-319-97982-3_20

Download citation

Publish with us

Policies and ethics