Advertisement

A Comparison of Re-sampling Techniques for Pattern Classification in Imbalanced Data-Sets

  • Marcia Amstelvina Saul
  • Shahin Rostami
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 840)

Abstract

Class imbalance is a common challenge when dealing with pattern classification of real-world medical data-sets. An effective counter-measure typically used is a method known as re-sampling. In this paper we implement an ANN with different re-sampling techniques to subsequently compare and evaluate the performances. Re-sampling strategies included a control, under-sampling, over-sampling, and a combination of the two. We found that over-sampling and the combination of under- and over-sampling both led to a significantly superior classifier performance compared to under-sampling only in correctly predicting labelled classes.

Keywords

Machine learning Imbalanced data Over-sampling Under-sampling 

References

  1. 1.
    Ayres-DeCampos, D., Bernardes, J., Garrido, A., MarquesDeS, J., PereiraLeite, L.: SisPorto 2.0: a program for automated analysis of cardiotocograms. J. Matern. Fetal Med. 9, 311–318 (2000)Google Scholar
  2. 2.
    Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 30(7), 1145–1159 (1997).  https://doi.org/10.1016/s0031-3203(96)00142-2CrossRefGoogle Scholar
  3. 3.
    Brooks, G.P., Johanson, G.A.: Sample size considerations for multiple comparison procedures in ANOVA. J. Mod. Appl. Stat. Methods 10(1), 97–109 (2011).  https://doi.org/10.22237/jmasm/1304222940CrossRefGoogle Scholar
  4. 4.
    de Campos, D.A.: The SisPorto automated analysisGoogle Scholar
  5. 5.
    Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)CrossRefGoogle Scholar
  6. 6.
    Dagostino, R.B.: An omnibus test of normality for moderate and large size samples. Biometrika 58(2), 341 (1971).  https://doi.org/10.2307/2334522MathSciNetCrossRefGoogle Scholar
  7. 7.
    UCI Machine Learning Repository Database: Cardiotocography Data Set (2010). https://archive.ics.uci.edu/ml/datasets/cardiotocography
  8. 8.
    HHU Düsseldorf: G*Power. http://www.gpower.hhu.de/en.html
  9. 9.
    Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)MathSciNetzbMATHGoogle Scholar
  10. 10.
    Ennos, A.R., Johnson, M.: Statistical and Data Handling Skills in Biology. Pearson Education, New York (2017)Google Scholar
  11. 11.
    Esteva, A., Kuprel, B., Novoa, R., Ko, J., Swetter, S., Blau, H., Thrun, S.: Dermatologist-level classification of skin cancer with deep neural networks. Nature 542(7639), 115–118 (2017)CrossRefGoogle Scholar
  12. 12.
    Gigerenzer, G.: Helping doctors and patients make sense of health statistics. In: Simply Rational, p. 2193 (2015).  https://doi.org/10.1093/acprof:oso/9780199390076.003.0005
  13. 13.
    Heaton, J.: Introduction to Neural Networks for Java, p. 440. Heaton Research, Inc. (2008). https://dl.acm.org/citation.cfm?id=1502373. ISBN 1604390085 9781604390087
  14. 14.
    Huang, J., Ling, C.: Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 17(3), 299–310 (2005).  https://doi.org/10.1109/tkde.2005.50CrossRefGoogle Scholar
  15. 15.
    Ishibuchi, H., Nakaskima, T.: Improving the performance of fuzzy classifier systems for pattern classification problems with continuous attributes. IEEE Trans. Ind. Electron. 46(6), 1057–1068 (1999).  https://doi.org/10.1109/41.807986CrossRefGoogle Scholar
  16. 16.
    Kim, H.Y.: Statistical notes for clinical researchers: type I and type II errors in statistical decision. Restor. Dentist. Endod. 40(3), 249 (2015).  https://doi.org/10.5395/rde.2015.40.3.249CrossRefGoogle Scholar
  17. 17.
    Nguyen, H.M., Cooper, E.W., Kamei, K.: Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradigms 3(1), 4 (2011).  https://doi.org/10.1504/ijkesdp.2011.039875CrossRefGoogle Scholar
  18. 18.
    Pearson, E.S., Dagostino, R.B., Bowman, K.O.: Tests for departure from normality: comparison of powers. Biometrika 64(2), 231–246 (1977).  https://doi.org/10.1093/biomet/64.2.231CrossRefzbMATHGoogle Scholar
  19. 19.
    Prati, R.C., Batista, G.E.A.P.A., Monard, M.C.: Class imbalances versus class overlapping: an analysis of a learning system behavior. In: MICAI 2004: Advances in Artificial Intelligence Lecture Notes in Computer Science, pp. 312–321 (2004).  https://doi.org/10.1007/978-3-540-24694-7-32
  20. 20.
    Preacher, K.J., Rucker, D.D., Maccallum, R.C., Nicewander, W.A.: Use of the extreme groups approach: a critical reexamination and new recommendations. Psychol. Methods 10(2), 178–192 (2005).  https://doi.org/10.1037/1082-989x.10.2.178CrossRefGoogle Scholar
  21. 21.
    Prechelt, L.: Early stopping but when? In: Neural Networks: Tricks of the Trade, vol. 7700 (2012). https://doi.org/10.1007/978-3-642-35289-8-5
  22. 22.
    Provost, F., Fawcett, T., Kohavi, R.: The case against accuracy estimation for comparing induction algorithms. In: Proceedings of the Fifteenth International Conference on Machine Learning (1998)Google Scholar
  23. 23.
    Saha, R., Chowdhury, A.R., Banerjee, S.: Diabetic retinopathy related lesions detection and classification using machine learning technology. Artificial Intelligence and Soft Computing Lecture Notes in Computer Science, pp. 734–745 (2016).  https://doi.org/10.1007/978-3-319-39384-1-65
  24. 24.
  25. 25.
    Tape, T.: The Area Under an ROC Curve. http://gim.unmc.edu/dxtests/roc3.htm
  26. 26.
    Thatcher, L.: The Benefits of Machine Learning in Healthcare (2017). https://healthcare.ai/the-benefits-of-machine-learning-in-healthcare
  27. 27.
    Penn State University: Power and Sample Size Determination for Testing a Population Mean. https://onlinecourses.science.psu.edu/stat500/node/46
  28. 28.
    Yen, S.J., Lee, Y.S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 36(3), 5718–5727 (2009).  https://doi.org/10.1016/j.eswa.2008.06.108CrossRefGoogle Scholar
  29. 29.
    Zacharaki, E.I., Wang, S., Chawla, S., Yoo, D.S., Wolf, R., Melhem, E.R., Davatzikos, C.: Classification of brain tumor type and grade using MRI texture and shape in a machine learning scheme. Magn. Reson. Med. 62(6), 1609–1618 (2009).  https://doi.org/10.1002/mrm.22147CrossRefGoogle Scholar
  30. 30.
    Zhang, J., Mani, I.: KNN approach to unbalanced data distributions: a case study involving information extraction. In: Workshop on Learning from Imbalanced Datasets II (2003)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Faculty of Science & TechnologyBournemouth UniversityBournemouthUK

Personalised recommendations