Advertisement

Handling Imbalanced Ratio for Class Imbalance Problem Using SMOTE

  • Nurulfitrah NoorhalimEmail author
  • Aida Ali
  • Siti Mariyam Shamsuddin
Conference paper

Abstract

There are many issues regarding datasets classification. One such issue is class imbalance classification, which often occurs with extreme skewness across many real-world domains. The issue presents itself as one of the fundamental difficulties to form robust classifiers. In this paper, a sampling method was used to identify the performance of classification for k-NN classifier and C4.5 classifier with a ten-fold cross validation. Experimental results conducted showed that sampling greatly benefited the performance of classification in class imbalance problem, by improving class boundary region especially with extremely imbalanced datasets (extreme number of imbalanced ratio). This result demonstrates that class imbalance can affect many domains in real-world applications.

Keywords

Sampling Synthetic minority over-sampling Imbalanced dataset 

Notes

Acknowledgements

The authors would like to express appreciation to the UTM Big Data Centre of Universiti Teknologi Malaysia and Y.M. Said for their support in this study. The authors greatly acknowledge the Research Management Centre, UTM and Ministry of Higher Education for the financial support through Research University Grant (RUG) Vot. No. Q.JI30000.2528.13H30.

References

  1. 1.
    Ali, A., Shamsuddin, S.M., Ralescu, A.L.: Classification with class imbalance problem: a review. Int. J. Adv. Soft Comput. Appl. 7(3) (2015)Google Scholar
  2. 2.
    Beyan, C., Fisher, R.: Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recogn. 48(5), 1653–1672 (2015)CrossRefGoogle Scholar
  3. 3.
    Cleofas-Sánchez, L., Sánchez, J.S., García, V., Valdovinos, R.: Associative learning on imbalanced environments: An empirical study. Expert Syst. Appl. 54, 387–397 (2016)CrossRefGoogle Scholar
  4. 4.
    Al-Stouhi, S., Reddy, C.K.: Transfer learning for class imbalance problems with inadequate data. Knowl. Inf. Syst. 48(1), 201–228 (2016)CrossRefGoogle Scholar
  5. 5.
    Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor. Newsl. 6(1), 1–6 (2004)CrossRefGoogle Scholar
  6. 6.
    Sun, Y., Wong, A.K., Kamel, M.S.: Classification of imbalanced data: a review. Int. J. Pattern Recognit Artif Intell. 23(04), 687–719 (2009)CrossRefGoogle Scholar
  7. 7.
    He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRefGoogle Scholar
  8. 8.
    Bruha, I., Kočková, S.: A support for decision-making: cost-sensitive learning system. Artif. Intell. Med. 6(1), 67–82 (1994).  https://doi.org/10.1016/0933-3657(94)90058-2CrossRefGoogle Scholar
  9. 9.
    Kukar, M., Kononenko, I., Grošelj, C., Kralj, K., Fettich, J.: Analysing and improving the diagnosis of ischaemic heart disease with machine learning. Artif. Intell. Med. 16(1), 25–50 (1999).  https://doi.org/10.1016/S0933-3657(98)00063-3CrossRefGoogle Scholar
  10. 10.
    Gao, K., Khoshgoftaar, T.M., Napolitano, A.: An empirical investigation of combining filter-based feature subset selection and data sampling for software defect prediction. Int. J. Reliab. Qual. Saf. Eng. 22(6) (2015).  https://doi.org/10.1142/s0218539315500278
  11. 11.
    Gao, K., Khoshgoftaar, T.M., Napolitano, A.: Aggregating data sampling with feature subset selection to address skewed software defect data. Int. J. Soft. Eng. Knowl. Eng. 25(09n10), 1531–1550 (2015)Google Scholar
  12. 12.
    Abidine, M.B., Fergani, B., Ordóñez, F.J.: Effect of over-sampling versus under-sampling for SVM and LDA classifiers for activity recognition. Int. J. Des. Nat. Ecodynamics 11(3), 306–316 (2016).  https://doi.org/10.2495/DNE-V11-N3-306-316CrossRefGoogle Scholar
  13. 13.
    Bach, M., Werner, A., Żywiec, J., Pluskiewicz, W.: The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis. Inf. Sci. 384, 174–190 (2017).  https://doi.org/10.1016/j.ins.2016.09.038CrossRefGoogle Scholar
  14. 14.
    Ando, S.: Classifying imbalanced data in distance-based feature space. Knowl. Inf. Syst. 46(3), 707–730 (2016)CrossRefGoogle Scholar
  15. 15.
    Lee, W., Jun, C.-H., Lee, J.-S.: Instance categorization by support vector machines to adjust weights in AdaBoost for imbalanced data classification. Inf. Sci. 381, 92–103 (2017).  https://doi.org/10.1016/j.ins.2016.11.014CrossRefGoogle Scholar
  16. 16.
    López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)CrossRefGoogle Scholar
  17. 17.
    Lee, C.S., Sheen, D.: Nonconforming generalized multiscale finite element methods. J. Comput. Appl. Math. 311, 215–229 (2017)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Rivera, W.A., Xanthopoulos, P.: A priori synthetic over-sampling methods for increasing classification sensitivity in imbalanced data sets. Expert Syst. Appl. 66, 124–135 (2016)CrossRefGoogle Scholar
  19. 19.
    Vluymans, S., Triguero, I., Cornelis, C., Saeys, Y.: EPRENNID: An evolutionary prototype reduction based ensemble for nearest neighbor classification of imbalanced data. Neurocomputing 216, 596–610 (2016).  https://doi.org/10.1016/j.neucom.2016.08.026CrossRefGoogle Scholar
  20. 20.
    Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012).  https://doi.org/10.1109/TSMCC.2011.2161285CrossRefGoogle Scholar
  21. 21.
    Downs, R.: Beware the aliased signal! Electron. Des. 59(4) (2011)Google Scholar
  22. 22.
    Visa, S.: Fuzzy classifiers for imbalanced data sets. University of Cincinnati (2006)Google Scholar
  23. 23.
    Kotsiantis, S., Kanellopoulos, D., Pintelas, P.: Handling imbalanced datasets: a review. GESTS Int. Trans. Comput. Sci. Eng. 30(1), 25–36 (2006)Google Scholar
  24. 24.
    García, V., Sánchez, J.S., Mollineda, R.A.: On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl.-Based Syst. 25(1), 13–21 (2012)CrossRefGoogle Scholar
  25. 25.
    Fernández, A., del Jesus, M.J., Herrera, F.: Hierarchical fuzzy rule based classification systems with genetic rule selection for imbalanced data-sets. Int. J. Approx. Reason. 50(3), 561–577 (2009)CrossRefGoogle Scholar
  26. 26.
    Phung, S.L., Bouzerdoum, A., Nguyen, G.H.: Learning pattern classification tasks with imbalanced data sets (2009)Google Scholar
  27. 27.
    Xiong, H., Wu, J., Liu, L.: Classification with class overlapping: a systematic study. In: The 2010 International Conference on E-Business Intelligence, pp. 491–497 (2010)Google Scholar
  28. 28.
    Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(4), 463–484 (2012)Google Scholar
  29. 29.
    Longadge, R., Dongre, S.: Class imbalance problem in data mining review (2013). arXiv preprint arXiv:1305.1707
  30. 30.
    Japkowicz, N.: Learning from imbalanced data sets: a comparison of various strategies. In: AAAI Workshop on Learning from Imbalanced Data Sets, pp. 10–15, Menlo Park, CA (2000)Google Scholar
  31. 31.
    Batista, G.E., Prati, R.C., Monard, M.C.: Balancing strategies and class overlapping. In: International Symposium on Intelligent Data Analysis, pp. 24–35. Springer, Heidelberg (2005)Google Scholar
  32. 32.
    Prati, R.C., Batista, G.E., Monard, M.C.: Learning with class skews and small disjuncts. In: Brazilian Symposium on Artificial Intelligence, pp. 296–306. Springer, Heidelberg (2004)Google Scholar
  33. 33.
    Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17 (2011)Google Scholar
  34. 34.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  35. 35.
    Kotsiantis, S., Kanellopoulos, D., Pintelas, P.: Data preprocessing for supervised leaning. Int. J. Comput. Sci. 1(2), 111–117 (2006)Google Scholar
  36. 36.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)CrossRefGoogle Scholar
  37. 37.
    Fernández, A., García, S., del Jesus, M.J., Herrera, F.: A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets Syst. 159(18), 2378–2398 (2008)MathSciNetCrossRefGoogle Scholar
  38. 38.
    Salzberg, S.L.: C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993. Mach. Learn. 16(3), 235–240 (1994).  https://doi.org/10.1007/bf00993309

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  • Nurulfitrah Noorhalim
    • 1
    Email author
  • Aida Ali
    • 1
  • Siti Mariyam Shamsuddin
    • 1
  1. 1.Faculty of ComputingUniversiti Teknologi Malaysia, UTMSkudaiMalaysia

Personalised recommendations