Advertisement

Ensemble learning via constraint projection and undersampling technique for class-imbalance problem

  • Huaping GuoEmail author
  • Jun Zhou
  • Chang-an Wu
Focus
  • 36 Downloads

Abstract

Ensemble learning is an effective technique for the class-imbalance problem, and the key for obtaining a successful ensemble is to create individual base classifiers with high accuracy and diversity. In this paper, we propose a novel ensemble learning method via constraint projection and undersampling technique, constructing each base classifier through the following two steps: 1) constructing a set of pairwise constraints by undersampling examples from the minority/majority class set and learning a projection matrix from the pairwise constraint set and 2) undersampling the original training set to obtaining a new training set on which a base classifier is constructed in the new feature space defined by the projection matrix. For the first step, the projection matrix is mainly used to enhance the separability between the diverse class examples and thus to improve the performance of the base classifier, and the undersampling technique is used to create diverse sets of pairwise constraints to train diverse projection matrices, thus introducing diversity to base classifiers. For the second step, the undersampling technique aims to improve the performance of base classifiers on the minority class and further increase the diversity between the individual base classifiers. The experimental results show that the proposed method shows significantly better performance on the measures of recall, g-mean, f-measure and AUC than other state-of-the-art methods for 29 datasets with various data distributions and imbalance ratios.

Keywords

Ensemble learning Constraint projection Undersampling technique Class-imbalance 

Notes

Acknowledgements

This work is supported in part by the National Natural Science Foundation of China (No. 61802329), in part by Project of Science and Technology Department of Henan Province (No. 182102210132), in part by the Innovation Team Support Plan of the University of Science and Technology of Henan Province (No. 19IRTSTHN014), and in part by Nanhu Scholars Program for Young Scholars of XYNU.

Compliance with ethical standards

Conflict of interest

Huaping Guo declares that he has no conflict of interest. Jun Zhou declares that he has no conflict of interest. Chang-an Wu declares that he has no conflict of interest.

References

  1. Alcalá-Fdez J, Fernandez A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) KEEL data-mining software tool: data set repository integration of algorithms and experimental analysis framework. J Multiple Valued Logic Soft Comput 17(2–3):255–287Google Scholar
  2. Ameta D (2017) Ensemble classifier approach in breast cancer detection and malignancy grading-a review, CoRR abs/1704.03801Google Scholar
  3. Bao L, Juan C, Li J, Zhang Y (2016b) Boosted near-miss under-sampling on SVM ensembles for concept detection in large-scale imbalanced datasets. Neurocomputing 172:198–206CrossRefGoogle Scholar
  4. Barandela R, Valdovinos RM, Sánchez JS (2003) New applications of ensembles of classifiers. Pattern Anal Appl 6(3):245–256MathSciNetCrossRefGoogle Scholar
  5. Barua S, Islam MM, Yao X, Murase K (2014) MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425CrossRefGoogle Scholar
  6. Branco P, Torgo L, Ribeiro RP (2015) A survey of predictive modelling under imbalanced distributions, CoRR abs/1505.01658 DGoogle Scholar
  7. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140zbMATHGoogle Scholar
  8. Breiman L (2001) Random forests. Mach Learn 45(1):5–32zbMATHCrossRefGoogle Scholar
  9. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Proceedings of the 13th Pacific-Asia conference on knowledge discovery and data mining(PAKDD), lecture notes in computer science, vol 5476. Springer, Bangkok, Thailand, pp 475–482CrossRefGoogle Scholar
  10. Castellanos FJ, Valero-Mas JJ, Calvo-Zaragoza J, Rico-Juan JR (2018) Oversampling imbalanced data in the string space. Pattern Recognit Lett 103:32–38CrossRefGoogle Scholar
  11. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357zbMATHCrossRefGoogle Scholar
  12. Chawla N V, Lazarevic A, Hall LO, Bowyer K W (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: Proceedings of the 7th European conference on principles and practice of knowledge discovery in databases, lecture notes in computer science, vol 2838. Springer, Cavtat-Dubrovnik, Croatia, pp 107–119CrossRefGoogle Scholar
  13. Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNetzbMATHGoogle Scholar
  14. Devi D, Biswas SK, Purkayastha B (2017) Redundancy-driven modified tomek-link based undersampling: a solution to class imbalance. Pattern Recognit Lett 93:3–12CrossRefGoogle Scholar
  15. Douzas G, Bação F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci 465:1–20CrossRefGoogle Scholar
  16. Estabrooks A, Japkowicz N (2001) A mixture-of-experts framework for learning from imbalanced data sets. In: Proceeding of the 4th international conference on advances in intelligent data analysis, lecture notes in computer science, vol 2189. Springer, Cascais, Portugal, pp 34–43zbMATHCrossRefGoogle Scholar
  17. Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F (2018) Learning from imbalanced data sets. Springer, Berlin, pp 1–377CrossRefGoogle Scholar
  18. Frayman Y, Ming Ting K, Wang L (1999) A fuzzy neural network for data mining: dealing with the problem of small disjuncts. In: Proceeding of international joint conference neural networks, pp 2490–2493. Washington, DC, USAGoogle Scholar
  19. Freund Y, Schapire R E (1996) Experiments with a new boosting algorithm. In: Proceedings of the 13th international conference on machine learning, pp 148–15. Morgan Kaufmann, Bari, ItalyGoogle Scholar
  20. Fu X, Wang L, Chua K S (2002) Training RBF neural networks on unbalanced data. In: Proceedings of the 9th international conference on neural information processing (ICONIP 2002), Singapore, pp 1016–1020Google Scholar
  21. Galar M, Fernández A, Tartas EB, Sola HB, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C 42(4):463–484CrossRefGoogle Scholar
  22. García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180:2044–2064CrossRefGoogle Scholar
  23. García S, Zhang Z, Altalhi AH, Alshomrani S, Herrera F (2018) Dynamic ensemble selection for multi-class imbalanced datasets. Inf Sci 445–446:22–37MathSciNetCrossRefGoogle Scholar
  24. García-Pddrajas N, García-Osorio C, Fyfe C (2007) Nonlinear boosting projections for ensemble construction. J Mach Learn Res 8:1–33MathSciNetzbMATHGoogle Scholar
  25. Guo H, Li Y, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239CrossRefGoogle Scholar
  26. Han H, Wang W, Mao B (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Proceedings of the 1st international conference on intelligent computing (part I), lecture notes in computer science, vol 3644. Springer, Hefei, China, pp 878–887Google Scholar
  27. He H, Ma Y (2013) Editors, imbalanced learning: foundations, algorithms, and applications. Wiley, New YorkzbMATHCrossRefGoogle Scholar
  28. Hido S, Kashima H, Takahashi Y (2009) Roughly balanced bagging for imbalanced data. Stat Anal Data Min 2(5–6):412–426MathSciNetCrossRefGoogle Scholar
  29. Huang S, Wang H, Li T, Yang Y, Li T (2016) Constraint co-projections for semi-supervised co-clustering. IEEE Trans Cybern 46(12):3047–3058CrossRefGoogle Scholar
  30. Kang P, Cho S (2006) EUS SVMs: ensemble of under-sampled SVMs for data imbalance problems. In: Proceedings of the 13th international conference on neural information processing, part I, lecture notes in computer science, vol 4232. Springer, Hong Kong, China, pp 837–846CrossRefGoogle Scholar
  31. Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2–3):195–215CrossRefGoogle Scholar
  32. Li J (2008) A two-step rejection procedure for testing multiple hypotheses. J Stat Plan Inference 138:1521–1527MathSciNetzbMATHCrossRefGoogle Scholar
  33. Liu X Y, Wu J, Zhou Z H (2006) Exploratory under-sampling for class-imbalance learning. In: Proceedings of the 6th IEEE international conference on data mining (ICDM), Hong Kong, China, pp 965–969Google Scholar
  34. Liu XY, Wu J, Zhou ZH (2009) Exploratory under-sampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B 39(2):965–969Google Scholar
  35. Lu W, Li Z, Chu J (2017) Adaptive ensemble undersampling-boost: a novel learning framework for imbalanced data. J Syst Softw 132:272–282CrossRefGoogle Scholar
  36. Martin PD (2011) Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J Mach Learn Technol 2(1):37–63Google Scholar
  37. Nguyen HM, Cooper EW, Kamei K (2011) Borderline over-sampling for imbalanced data classification. Int J Knowl Eng Soft Data Paradig 3(1):4–21CrossRefGoogle Scholar
  38. Provost F J, Weiss G M (2011) Learning when training data are costly: the effect of class distribution on tree induction, CoRR abs/1106.4557Google Scholar
  39. Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San MateoGoogle Scholar
  40. Ren F, Cao P, Li W, Zhao D, Zaïane OR (2017) Ensemble based adaptive over-sampling method for imbalanced data learning in computer aided detection of microaneurysm. Comput Med Imaging Graph 55:54–67CrossRefGoogle Scholar
  41. Rodríguez JJ, Kuncheva LI, Alonso CJ (2006) Rotation forest: a new classifier ensemble method. IEEE Trans Pattern Anal Mach Intell 28(10):1619–1630CrossRefGoogle Scholar
  42. Rodríguez-Fdez I, Canosa A, Mucientes M, Bugarín A (2015) STAC: a web platform for the comparison of algorithms using statistical tests. In: Proceeding of the IEEE international conference on fuzzy systems, pp 1–8. FUZZ-IEEE, Istanbul, TurkeyGoogle Scholar
  43. Rokach L (2010) Ensemble-based classifiers. Artif Intell Rev 33(1–2):1–39CrossRefGoogle Scholar
  44. Schapire RE (1999) A brief introduction to boosting. In: Proceedings of the 16th international joint conference on artificial intelligence, pp 1401–1406. Morgan Kaufmann, Stockholm, SwedenGoogle Scholar
  45. Seiffert C, Khoshgoftaar T, Hulse JV, Napolitano A (2010) Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A 40(1):185–197CrossRefGoogle Scholar
  46. Sun B, Chen H, Wang J, Xie H (2018a) Evolutionary under-sampling based bagging ensemble method for imbalanced data classification. Front Comput Sci 12(2):331–350CrossRefGoogle Scholar
  47. Sun J, Lang J, Fujita H, Li H (2018b) Imbalanced enterprise credit evaluation with DTE-SBD: decision tree ensemble based on SMOTE and bagging with differentiated sampling rates. Inf Sci 425:76–91MathSciNetCrossRefGoogle Scholar
  48. Taft LM, Evans RS, Shyu CR, Egger MJ, Chawla N, Mitchell JA, Thornton SN, Bray B, Varner Michael W (2009) Countering imbalanced datasets to improve adverse drug event predictive models in labor and delivery. J Biomed Inform 42(2):356–364CrossRefGoogle Scholar
  49. Tao D, Tang X, Li X, Wu X (2006) Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Trans Pattern Anal Mach Intell 28(7):1088–1099CrossRefGoogle Scholar
  50. Wang L, Fu X (2005) Data mining with computational intelligence. Advanced information and knowledge processing. Springer, BerlinzbMATHGoogle Scholar
  51. Wang S, Yao X (2009) Diversity analysis on imbalanced data sets by using ensemble models. In: Proceedings of the IEEE symposium on computational intelligence and data mining, part of the IEEE symposium series on computational intelligence, pp 324–331. IEEE, Nashville, TN, USAGoogle Scholar
  52. Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng AFM, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37CrossRefGoogle Scholar
  53. Zhai J, Zhai M, Kang X (2014) Condensed fuzzy nearest neighbor methods based on fuzzy rough set technique. Intell Data Anal 18(3):429–447CrossRefGoogle Scholar
  54. Zhang D, Chen S, Zhou Z, Yang Q (2008) Constraint projections for ensemble learning. In: Proceedings of the 23rd AAAI conference on artificial intelligence, pp 758–763. AAAI Press, Chicago, Illinois, USAGoogle Scholar
  55. Zhou J, Guo H, Wu C-A (2018) Ensemble based on constraint projection and under-sampling for imbalanced learning. In: Proceeding of the 14th international conference on natural computation, fuzzy systems and knowledge discovery, (ICNC-FSKD 2018), pp 724–731, IEEE, Huangshan, ChinaGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.School of Computer and Information TechnologyXinyang Normal UniversityXinyangChina
  2. 2.Henan Key Lab. of Analysis and Applications of Education Big DataXinyang Normal UniversityXinyangChina

Personalised recommendations