A Comparison of Re-sampling Techniques for Pattern Classification in Imbalanced Data-Sets

Saul, Marcia Amstelvina; Rostami, Shahin

doi:10.1007/978-3-319-97982-3_20

Marcia Amstelvina Saul¹⁹ &
Shahin Rostami¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 840))

Included in the following conference series:

UK Workshop on Computational Intelligence

1223 Accesses
1 Citations

Abstract

Class imbalance is a common challenge when dealing with pattern classification of real-world medical data-sets. An effective counter-measure typically used is a method known as re-sampling. In this paper we implement an ANN with different re-sampling techniques to subsequently compare and evaluate the performances. Re-sampling strategies included a control, under-sampling, over-sampling, and a combination of the two. We found that over-sampling and the combination of under- and over-sampling both led to a significantly superior classifier performance compared to under-sampling only in correctly predicting labelled classes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ayres-DeCampos, D., Bernardes, J., Garrido, A., MarquesDeS, J., PereiraLeite, L.: SisPorto 2.0: a program for automated analysis of cardiotocograms. J. Matern. Fetal Med. 9, 311–318 (2000)
Google Scholar
Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 30(7), 1145–1159 (1997). https://doi.org/10.1016/s0031-3203(96)00142-2
Article Google Scholar
Brooks, G.P., Johanson, G.A.: Sample size considerations for multiple comparison procedures in ANOVA. J. Mod. Appl. Stat. Methods 10(1), 97–109 (2011). https://doi.org/10.22237/jmasm/1304222940
Article Google Scholar
de Campos, D.A.: The SisPorto automated analysis
Google Scholar
Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Dagostino, R.B.: An omnibus test of normality for moderate and large size samples. Biometrika 58(2), 341 (1971). https://doi.org/10.2307/2334522
Article MathSciNet Google Scholar
UCI Machine Learning Repository Database: Cardiotocography Data Set (2010). https://archive.ics.uci.edu/ml/datasets/cardiotocography
HHU Düsseldorf: G*Power. http://www.gpower.hhu.de/en.html
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
MathSciNet MATH Google Scholar
Ennos, A.R., Johnson, M.: Statistical and Data Handling Skills in Biology. Pearson Education, New York (2017)
Google Scholar
Esteva, A., Kuprel, B., Novoa, R., Ko, J., Swetter, S., Blau, H., Thrun, S.: Dermatologist-level classification of skin cancer with deep neural networks. Nature 542(7639), 115–118 (2017)
Article Google Scholar
Gigerenzer, G.: Helping doctors and patients make sense of health statistics. In: Simply Rational, p. 2193 (2015). https://doi.org/10.1093/acprof:oso/9780199390076.003.0005
Heaton, J.: Introduction to Neural Networks for Java, p. 440. Heaton Research, Inc. (2008). https://dl.acm.org/citation.cfm?id=1502373. ISBN 1604390085 9781604390087
Huang, J., Ling, C.: Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 17(3), 299–310 (2005). https://doi.org/10.1109/tkde.2005.50
Article Google Scholar
Ishibuchi, H., Nakaskima, T.: Improving the performance of fuzzy classifier systems for pattern classification problems with continuous attributes. IEEE Trans. Ind. Electron. 46(6), 1057–1068 (1999). https://doi.org/10.1109/41.807986
Article Google Scholar
Kim, H.Y.: Statistical notes for clinical researchers: type I and type II errors in statistical decision. Restor. Dentist. Endod. 40(3), 249 (2015). https://doi.org/10.5395/rde.2015.40.3.249
Article Google Scholar
Nguyen, H.M., Cooper, E.W., Kamei, K.: Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradigms 3(1), 4 (2011). https://doi.org/10.1504/ijkesdp.2011.039875
Article Google Scholar
Pearson, E.S., Dagostino, R.B., Bowman, K.O.: Tests for departure from normality: comparison of powers. Biometrika 64(2), 231–246 (1977). https://doi.org/10.1093/biomet/64.2.231
Article MATH Google Scholar
Prati, R.C., Batista, G.E.A.P.A., Monard, M.C.: Class imbalances versus class overlapping: an analysis of a learning system behavior. In: MICAI 2004: Advances in Artificial Intelligence Lecture Notes in Computer Science, pp. 312–321 (2004). https://doi.org/10.1007/978-3-540-24694-7-32
Preacher, K.J., Rucker, D.D., Maccallum, R.C., Nicewander, W.A.: Use of the extreme groups approach: a critical reexamination and new recommendations. Psychol. Methods 10(2), 178–192 (2005). https://doi.org/10.1037/1082-989x.10.2.178
Article Google Scholar
Prechelt, L.: Early stopping but when? In: Neural Networks: Tricks of the Trade, vol. 7700 (2012). https://doi.org/10.1007/978-3-642-35289-8-5
Provost, F., Fawcett, T., Kohavi, R.: The case against accuracy estimation for comparing induction algorithms. In: Proceedings of the Fifteenth International Conference on Machine Learning (1998)
Google Scholar
Saha, R., Chowdhury, A.R., Banerjee, S.: Diabetic retinopathy related lesions detection and classification using machine learning technology. Artificial Intelligence and Soft Computing Lecture Notes in Computer Science, pp. 734–745 (2016). https://doi.org/10.1007/978-3-319-39384-1-65
Scikit-Learn: Confusion Matrix. http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
Tape, T.: The Area Under an ROC Curve. http://gim.unmc.edu/dxtests/roc3.htm
Thatcher, L.: The Benefits of Machine Learning in Healthcare (2017). https://healthcare.ai/the-benefits-of-machine-learning-in-healthcare
Penn State University: Power and Sample Size Determination for Testing a Population Mean. https://onlinecourses.science.psu.edu/stat500/node/46
Yen, S.J., Lee, Y.S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 36(3), 5718–5727 (2009). https://doi.org/10.1016/j.eswa.2008.06.108
Article Google Scholar
Zacharaki, E.I., Wang, S., Chawla, S., Yoo, D.S., Wolf, R., Melhem, E.R., Davatzikos, C.: Classification of brain tumor type and grade using MRI texture and shape in a machine learning scheme. Magn. Reson. Med. 62(6), 1609–1618 (2009). https://doi.org/10.1002/mrm.22147
Article Google Scholar
Zhang, J., Mani, I.: KNN approach to unbalanced data distributions: a case study involving information extraction. In: Workshop on Learning from Imbalanced Datasets II (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Science & Technology, Bournemouth University, Bournemouth, BH12 5BB, UK
Marcia Amstelvina Saul & Shahin Rostami

Authors

Marcia Amstelvina Saul
View author publications
You can also search for this author in PubMed Google Scholar
Shahin Rostami
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcia Amstelvina Saul .

Editor information

Editors and Affiliations

School of Science and Technology, Nottingham Trent University, Nottingham, United Kingdom
Ahmad Lotfi
Faculty of Science and Technology, Bournemouth University, Poole, Dorset, United Kingdom
Hamid Bouchachia
School of Computing, University of Portsmouth, Portsmouth, Hampshire, United Kingdom
Alexander Gegov
School of Science and Technology, Nottingham Trent University, Nottingham, United Kingdom
Caroline Langensiepen
College of Science and Technology, Nottingham Trent University, Nottingham, United Kingdom
Martin McGinnity

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Saul, M.A., Rostami, S. (2019). A Comparison of Re-sampling Techniques for Pattern Classification in Imbalanced Data-Sets. In: Lotfi, A., Bouchachia, H., Gegov, A., Langensiepen, C., McGinnity, M. (eds) Advances in Computational Intelligence Systems. UKCI 2018. Advances in Intelligent Systems and Computing, vol 840. Springer, Cham. https://doi.org/10.1007/978-3-319-97982-3_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-97982-3_20
Published: 11 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-97981-6
Online ISBN: 978-3-319-97982-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics