HCAB-SMOTE: A Hybrid Clustered Affinitive Borderline SMOTE Approach for Imbalanced Data Binary Classification

  • Hisham Al MajzoubEmail author
  • Islam Elgedawy
  • Öykü Akaydın
  • Mehtap Köse Ulukök
Research Article-Computer Engineering and Computer Science


Binary datasets are considered imbalanced when one of their two classes has less than 40% of the total number of the data instances (i.e., minority class). Existing classification algorithms are biased when applied on imbalanced binary datasets, as they misclassify instances of minority class. Many techniques are proposed to minimize the bias and to increase the classification accuracy. Synthetic Minority Oversampling Technique (SMOTE) is a well-known approach proposed to address this problem. It generates new synthetic data instances to balance the dataset. Unfortunately, it generates these instances randomly, leading to the generation of useless new instances, which is time and memory consuming. Different SMOTE derivatives were proposed to overcome this problem (such as Borderline SMOTE), yet the number of generated instances slightly changed. To overcome such problem, this paper proposes a novel approach for generating synthesized data instances known as Hybrid Clustered Affinitive Borderline SMOTE (HCAB-SMOTE). It managed to minimize the number of generated instances while increasing the classification accuracy. It combines undersampling for removing majority noise instances and oversampling approaches to enhance the density of the borderline. It uses k-means clustering on the borderline area and identify which clusters to oversample to achieve better results. Experimental results show that HCAB-SMOTE outperformed SMOTE, Borderline SMOTE, AB-SMOTE and CAB-SMOTE approaches which were developed before reaching HCAB-SMOTE, as it provided the highest classification accuracy with the least number of generated instances.


Imbalanced data Borderline SMOTE Oversampling SMOTE AB-SMOTE k-means clustering 


  1. 1.
    Sun, A.; Lim, E.P.; Liu, Y.: On strategies for imbalanced text classification using SVM: a comparative study. Decis. Support Syst. 48(1), 191–201 (2009)CrossRefGoogle Scholar
  2. 2.
    Tek, F.B.; Dempster, A.G.; Kale, I.: Parasite detection and identification for automated thin blood film malaria diagnosis. Comput. Vis. Image Underst. 114(1), 21–32 (2010)CrossRefGoogle Scholar
  3. 3.
    Qureshi, S.A.; Rehman, A.S.; Qamar, A.M.; Kamal, A.; Rehman, A.: Telecommunication subscribers’ churn prediction model using machine learning. In: Eighth International Conference Digital Information Management (ICDIM 2013), September, pp. 131–136 (2013)Google Scholar
  4. 4.
    “Keel Datasets, Wine Quality.” Accessed 21 Aug 2019
  5. 5.
    Bekkar, M.; Alitouche, D.; Akrouf, T.; AkroufAlitouche, T.: Imbalanced data learning approaches review. Data Min. Knowl. 3(4), 15–33 (2013)Google Scholar
  6. 6.
    Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)CrossRefGoogle Scholar
  7. 7.
    Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C.: Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5476 LNAI, pp. 475–482 (2009)CrossRefGoogle Scholar
  8. 8.
    Chawla, N.V.; Lazarevic, A.; Hall, L.O.; Bowyer, K.W.: SMOTEBoost: improving prediction of the minority class in boosting, pp. 107–119 (2003)CrossRefGoogle Scholar
  9. 9.
    Han, H.; Wang, W.; Mao, B.: Borderline-SMOTE : a new over-sampling method in imbalanced data sets learning, pp. 878–887 (2005)Google Scholar
  10. 10.
    Bach, M.; Werner, A.; Żywiec, J.; Pluskiewicz, W.: The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis. Inf. Sci. 384, 174–190 (2017)CrossRefGoogle Scholar
  11. 11.
    Douzas, G.; Bacao, F.; Last, F.: Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf. Sci. 465, 1–20 (2018)CrossRefGoogle Scholar
  12. 12.
    Elhassan, A.T.; Aljourf, M.: Classification of imbalance data using Tomek Link (T-Link) combined with random under-sampling (RUS) as a data reduction method. J. Inform. Data Min. 1(2), 1–12 (2016)CrossRefGoogle Scholar
  13. 13.
    Oskouei, R.J.; Bigham, B.S.: Over-sampling via under-sampling in strongly imbalanced data. Int. J. Adv. Intell. Paradig. 9(1), 58 (2017)CrossRefGoogle Scholar
  14. 14.
    Japkowicz, N.: Learning from imbalanced data sets: a comparison of various strategies. In: AAAI Workshop Learning from Imbalanced Data Sets, vol. 68, pp. 10–15 (2000)Google Scholar
  15. 15.
    Stefanowski, J.; Wilk, S.: Selective pre-processing of imbalanced data for improving classification performance. In: Data Warehousing and Knowledge Discovery (Lecture Notes Computer Science Series 5182), pp. 283–292 (2008)Google Scholar
  16. 16.
    Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. Inf. Sci. (2001)Google Scholar
  17. 17.
    “Weka.” Accessed 7 Jan 2020
  18. 18.
    Fernández, A.; López, V.; Galar, M.; Del Jesus, M.J.; Herrera, F.: Analysing the classification of imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches. Knowl. Based Syst. 42, 97–110 (2013)CrossRefGoogle Scholar
  19. 19.
    Keel Datasets, Abalone9-18. Accessed 21 Aug 2019
  20. 20.
    Crowd Analytix. Accessed:21 Aug 2019
  21. 21.
    IBM Analytics Telco Customer Churn Dataset. Accessed 21 Aug 2019
  22. 22.
    Dua, C.; Dheeru; Graff: UCI Machine Learning Repository (2017). Accessed 21 Aug 2019
  23. 23.
    Haberman, S.J.: Generalized residuals for log-linear models. In: Proceedings of 9th International Conference on Biometrics, pp. 104–122 (1976)Google Scholar
  24. 24.
  25. 25.
    Keel Datasets, Solar Flare. Accessed 21 Aug 2019
  26. 26.
  27. 27.
    Moro, S.; Laureano, R.M.S.; Cortez, P.: Using data mining for bank direct marketing: An application of the CRISP-DM methodology. In: ESM 2011–2011 European Simulation and Modelling Conference 2011, no. Figure 1, pp. 117–121 (2011)Google Scholar
  28. 28.
    Kohavi, R.; Becker, B.: Adult Census Income (1996). Accessed 7 Jan 2020
  29. 29.
    K. A. E. A. Challenge: No Title. Accessed 7 Jan 2020
  30. 30.
    Cup, K.: No Title (2012). Accessed 7 Jan 2020
  31. 31.
    Cervantes, J.; Garcia-Lamont, F.; Rodriguez, L.; López, A.; Castilla, J.R.; Trueba, A.: PSO-based method for SVM classification on skewed data sets. Neurocomputing 228, 187–197 (2017)CrossRefGoogle Scholar
  32. 32.
    López, V.; Fernández, A.; Herrera, F.: On the importance of the validation technique for classification with imbalanced datasets: addressing covariate shift when data is skewed. Inf. Sci. 257, 1–13 (2014)CrossRefGoogle Scholar
  33. 33.
    Saito, T.; Rehmsmeier, M.: The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10(3), 1–21 (2015)CrossRefGoogle Scholar

Copyright information

© King Fahd University of Petroleum & Minerals 2020

Authors and Affiliations

  1. 1.Management Information Systems Department, School of Applied SciencesCyprus International UniversityNicosiaTurkey
  2. 2.Computer Engineering DepartmentMiddle East Technical University, Northern Cyprus CampusKalkanlı, Guzelyurt, Mersin 10Turkey
  3. 3.Department of Computer EngineeringCyprus International UniversityNicosiaTurkey
  4. 4.Department of Software EngineeringUniversity of City IslandFamagustaTurkey

Personalised recommendations