Advertisement

Frontiers of Computer Science

, Volume 12, Issue 2, pp 331–350 | Cite as

Evolutionary under-sampling based bagging ensemble method for imbalanced data classification

  • Bo Sun
  • Haiyan Chen
  • Jiandong Wang
  • Hua Xie
Research Article

Abstract

In the class imbalanced learning scenario, traditional machine learning algorithms focusing on optimizing the overall accuracy tend to achieve poor classification performance especially for the minority class in which we are most interested. To solve this problem, many effective approaches have been proposed. Among them, the bagging ensemble methods with integration of the under-sampling techniques have demonstrated better performance than some other ones including the bagging ensemble methods integrated with the over-sampling techniques, the cost-sensitive methods, etc. Although these under-sampling techniques promote the diversity among the generated base classifiers with the help of random partition or sampling for the majority class, they do not take any measure to ensure the individual classification performance, consequently affecting the achievability of better ensemble performance. On the other hand, evolutionary under-sampling EUS as a novel undersampling technique has been successfully applied in searching for the best majority class subset for training a good-performance nearest neighbor classifier. Inspired by EUS, in this paper, we try to introduce it into the under-sampling bagging framework and propose an EUS based bagging ensemble method EUS-Bag by designing a new fitness function considering three factors to make EUS better suited to the framework. With our fitness function, EUS-Bag could generate a set of accurate and diverse base classifiers. To verify the effectiveness of EUS-Bag, we conduct a series of comparison experiments on 22 two-class imbalanced classification problems. Experimental results measured using recall, geometric mean and AUC all demonstrate its superior performance.

Keywords

class imbalanced problem under-sampling bagging evolutionary under-sampling ensemble learning machine learning data mining 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Notes

Acknowledgements

We would like to express our gratitude to both the associate editor and the anonymous reviewers for their constructive comments that improved the quality of our manuscript to a large extent. This work was supported by the National Natural Science Foundation of China (Grant No.61501229) and the Fundamental Research Funds for the Central Universities (NS2015091, NS2014067, NJ20160013).

Supplementary material

11704_2016_5306_MOESM1_ESM.ppt (350 kb)
Supplementary material, approximately 350 KB.

References

  1. 1.
    Banfield R E, Hall L O, Bowyer K W, Kegelmeyer WP. A comparison of decision tree ensemble creation techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(1): 173–180Google Scholar
  2. 2.
    Donate J P, Cortez P, Sanchez G G, Miguel A S. Time series forecasting using a weighted cross-validation evolutionary artificial neural network ensemble. Neurocomputing, 2013, 109(1): 27–32Google Scholar
  3. 3.
    Niu D X, Wang Y L, Wu D D. Power load forecasting using support vector machine and ant colony optimization. Expert Systems with Applications, 2010, 37(3): 2531–2539Google Scholar
  4. 4.
    Rutkowski L, Jaworski M, Pietruczuk L, Duda P. The CART decision tree for mining data streams. Information Sciences, 2014, 266: 1–15zbMATHGoogle Scholar
  5. 5.
    Bar-Hen A, Gey S, Poggi J M. Influence measures for CART classification trees. Journal of Classification, 2015, 32(1): 21–45MathSciNetzbMATHGoogle Scholar
  6. 6.
    Mazurowski M A, Habas P A, Zurada J M, Lo J Y, Baker J A, Tourassi G D. Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Networks, 2008, 21(2): 427–436Google Scholar
  7. 7.
    Tomczak J M, Zieba M. Probabilistic combination of classification rules and its application to medical diagnosis. Machine Learning, 2015, 101(1–3): 105–135MathSciNetzbMATHGoogle Scholar
  8. 8.
    Tavallaee M, Stakhanova N, Ghorbani A A. Toward credible evaluation of anomaly-based intrusion-detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2010, 40(5): 516–524Google Scholar
  9. 9.
    Ngai EWT, Hu Y, Wong Y H, Chen Y J, Sun X. The application of data mining techniques in financial fraud detection: a classification framework and an academic review of literature. Decision Support Systems, 2011, 50(3): 559–569Google Scholar
  10. 10.
    Chang X J, Yu Y L, Yang Y, Hauptmann A G. Searching persuasively: joint event detection and evidence justification with limited supervision. In: Proceedings of the 23rd Annual ACM Conference on Multimedia. 2015, 581–590Google Scholar
  11. 11.
    Chang X J, Yang Y, Xing E P, Yu Y L. Complex event detection using semantic saliency and nearly-isotonic SVM. In: Proceedings of the 32nd International Conference on Machine Learning. 2015, 1348–1357Google Scholar
  12. 12.
    Chang X J, Yang Y, Hauptmann A G, Xing E P. Semantic concept discovery for large-scale zero-shot event detection. In: Proceedings of the 4th International Joint Conference on Artificial Intelligence. 2015Google Scholar
  13. 13.
    Bermejo P, Gámez J A, Puerta J M. Improving the performance of naive bayes multinomial in e-mail foldering by introducing distributionbased balance of datasets. Expert Systems with Applications, 2011, 38(3): 2072–2080Google Scholar
  14. 14.
    Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42(4): 463–484Google Scholar
  15. 15.
    Nanni L, Fantozzi C, Lazzarini N. Coupling different methods for overcoming the class imbalance problem. Neurocomputing, 2015, 158(1): 48–61Google Scholar
  16. 16.
    Batista G E, Prati R C, Monard MC. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20–29Google Scholar
  17. 17.
    Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16(1): 321–357zbMATHGoogle Scholar
  18. 18.
    Sáez J A, Luengo J, Stefanowski J, Herrera F. SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Information Sciences, 2015, 291(1): 184–203Google Scholar
  19. 19.
    Estabrooks A, Jo T, Japkowicz N. A multiple resampling method for learning from imbalanced data sets. Computational Intelligence, 2004, 20(1): 18–36MathSciNetGoogle Scholar
  20. 20.
    He H B, Garcia E A. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263–1284Google Scholar
  21. 21.
    Drummond C, Holte R C. C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Proceedings of the International Conference on Machine Learning, Workshop on Learning from Imbalanced Datasets II. 2003, 1–8Google Scholar
  22. 22.
    Han H, Wang W Y, Mao B H. Borderline-SMOTE: a new oversampling method in imbalanced data sets learning. In: Proceedings of International Conference on Intelligent Computing. 2005, 878–887Google Scholar
  23. 23.
    Lin Y, Lee Y, Wahba G. Support vector machines for classification in nonstandard situations. Machine learning, 2002, 46(1–3): 191–202zbMATHGoogle Scholar
  24. 24.
    Wu G, Chang E Y. KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(6): 786–795Google Scholar
  25. 25.
    Barandela R, Sánchez J S, Garcia V, Rangel E. Strategies for learning in class imbalance problems. Pattern Recognition, 2003, 36(3): 849–851Google Scholar
  26. 26.
    Ling C X, Sheng V S, Yang Q. Test strategies for cost-sensitive decision trees. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(8): 1055–1067Google Scholar
  27. 27.
    Zhou Z H, Liu X Y. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(1): 63–77MathSciNetGoogle Scholar
  28. 28.
    Chawla N V, Cieslak D A, Hall L O, Joshi A. Automatically countering imbalance and its empirical relationship to cost. Data Mining and Knowledge Discovery, 2008, 17(2): 225–252MathSciNetGoogle Scholar
  29. 29.
    Tao D C, Tang X O, Li X L, Wu X D. Asymmetric bagging and random subspace for support vector machines-based relevance feedback. IEEE Transactions on Pattern Analysis andMachine Intelligence, 2006, 28(7): 1088–1099Google Scholar
  30. 30.
    Wang S, Yao X. Diversity analysis on imbalanced data sets by using ensemble models. In: Proceedings of IEEE Symposium on Computational Intelligence and Data Mining. 2009, 324–331Google Scholar
  31. 31.
    Hido S, Kashima H, Takahashi Y. Roughly balanced bagging for imbalanced data. Statistical Analysis and Data Mining, 2009, 2(5–6): 412–426MathSciNetGoogle Scholar
  32. 32.
    Liu X Y, Wu J X, Zhou Z H. Exploratory undersampling for classimbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2009, 39(2): 539–550Google Scholar
  33. 33.
    Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A. RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, 2010, 40(1): 185–197Google Scholar
  34. 34.
    Barandela R, Valdovinos R M, Sánchez J S. New applications of ensembles of classifiers. Pattern Analysis and Applications, 2003, 6(3): 245–256MathSciNetGoogle Scholar
  35. 35.
    Khoshgoftaar T M, Van Hulse J, Napolitano A. Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 41(3): 552–568Google Scholar
  36. 36.
    Chawla N V, Lazarevic A, Hall L O, Bowyer K W. SMOTEBoost: improving prediction of the minority class in boosting. In: Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2003, 107–119Google Scholar
  37. 37.
    Zhou Z H. Ensemble Methods: Foundations and Algorithms. Florida: CRC Press, 2012Google Scholar
  38. 38.
    Sun B, Chen H Y, Wang J D. An empirical margin explanation for the effectiveness of DECORATE ensemble learning algorithm. Knowledge-Based Systems, 2015, 78(1): 1–12Google Scholar
  39. 39.
    Hsu KW, Srivastava J. Improving bagging performance through multialgorithm ensembles. Frontiers of Computer Science, 2012, 6(5): 498–512zbMATHGoogle Scholar
  40. 40.
    Liu E, Zhao H, Guo F F, Liang J M, Tian J. Fingerprint segmentation based on an AdaBoost classifier. Frontiers of Computer Science, 2011, 5(2): 148–157MathSciNetGoogle Scholar
  41. 41.
    Yan Y, Xu Z W, Tsang I W, Long G, Yang Y. Robust semi-supervised learning through label aggregation. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence. 2016, 1–7Google Scholar
  42. 42.
    Rong W G, Peng B L, Ouyang Y X, Li C, Xiong Z. Structural information aware deep semi-supervised recurrent neural network for sentiment analysis. Frontiers of Computer Science, 2015, 9(2): 171–184MathSciNetGoogle Scholar
  43. 43.
    Zhou Z H. When semi-supervised learning meets ensemble learning. Frontiers of Electrical and Electronic Engineering, 2011, 6(1): 6–16Google Scholar
  44. 44.
    Breiman L. Bagging predictors. Machine Learning, 1996, 24(2): 123–140zbMATHGoogle Scholar
  45. 45.
    Freund Y, Schapire R E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 1997, 55(1): 119–139MathSciNetzbMATHGoogle Scholar
  46. 46.
    Garcia S, Herrera F. Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evolutionary Computation, 2009, 17(3): 275–306MathSciNetGoogle Scholar
  47. 47.
    Garcia S, Derrac J, Cano J, Herrera F. Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(3): 417–435Google Scholar
  48. 48.
    Luengo J, Fernández A, Garica S, Herrera F. Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling. Soft Computing, 2011, 15(10): 1909–1936Google Scholar
  49. 49.
    Drown D J, Khoshgoftaar T M, Seliya N. Evolutionary sampling and software quality modeling of high-assurance systems. IEEE Transactions on Systems, Man and Cybernetics: PART A–Systems and Humans, 2009, 39(5): 1097–1107Google Scholar
  50. 50.
    Galar M, Fernández A, Barrenechea E, Herrera F. EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognition, 2013, 46(12): 3460–3471Google Scholar
  51. 51.
    Fawcett T. ROC graphs: notes and practical considerations for researchers. Machine Learning, 2004, 31(1): 1–38MathSciNetGoogle Scholar
  52. 52.
    Kuncheva L I, Whitaker C J. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine learning, 2003, 51(2): 181–207zbMATHGoogle Scholar
  53. 53.
    Dietterich T G. Ensemble Learning. Cambridge: The MIT Press, 2002zbMATHGoogle Scholar
  54. 54.
    Banfield R E, Hall L O, Bowyer K W, Kegelmeyer W P. Ensemble diversity measures and their application to thinning. Information Fusion, 2005, 6(1): 49–62Google Scholar
  55. 55.
    Man K F, Tang K S, Kwong S. Genetic Algorithms: Concepts and Designs. Berlin: Springer Science & Business Media, 2012zbMATHGoogle Scholar
  56. 56.
    Sun Z B, Song Q B, Zhu X Y, Sun H L, Xu B W, Zhou Y M. A novel ensemble method for classifying imbalanced data. Pattern Recognition, 2015, 48(5): 1623–1637Google Scholar
  57. 57.
    He H B, Ma Y Q. Imbalanced Learning: Foundations, Algorithms, and Applications. New Jersey: John Wiley & Sons, 2013zbMATHGoogle Scholar
  58. 58.
    Demšar J. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 2006, 7(1): 1–30MathSciNetzbMATHGoogle Scholar

Copyright information

© Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.College of Computer Science and TechnologyNanjing University of Aeronautics and AstronauticsNanjingChina
  2. 2.National Key Lab of ATFMNanjing University of Aeronautics and AstronauticsNanjingChina

Personalised recommendations