Advertisement

A memetic approach for training set selection in imbalanced data sets

  • Bahareh Nikpour
  • Hossein Nezamabadi-pourEmail author
Original Article

Abstract

Imbalanced data classification is a challenging problem in the field of machine learning. The problem occurs when data samples have an uneven distribution amongst the classes and classical classifiers are not suitable for classifying such datasets. To overcome this problem, in this paper, the best training samples are selected from data samples with the goal of improving the performance of classifier when dealing with imbalanced data. To do so, some heuristic methods are presented which use local information to give a proper view about whether removing or retaining each sample of training set. Subsequently, the methods are considered as local search algorithms and combined with a global search algorithm in a framework to form memetic algorithms. The global search used in this paper is binary quantum inspired gravitational search algorithm (BQIGSA) which is a new metaheuristic search for optimization of binary encoded problems. BQIGSA is employed since we seek for a highly stochastic and random search algorithm to solve our problem. We propose to use six different local search algorithms, three of which are application oriented that we designed based on the problem and the rest are general, and the best local search is then determined. Experiments are performed on 45 standard datasets, and G-mean and AUC criteria are considered as evaluation tools. Then, the data sets are employed to compare the best memetic approaches with some popular state of the art algorithms as well as a recently proposed memetic algorithm and the results show their superiority. At the last step, the performance of the proposed algorithm for four different classifiers is evaluated and the best classifier is determined to be utilized for this method.

Keywords

Imbalanced data Under-sampling methods Training set selection Metaheuristics Memetic algorithms Binary quantum-inspired gravitational search algorithm 

Notes

References

  1. 1.
    Singh PK (2017) Three-way fuzzy concept lattice representation using neutrosophic set. Int J Mach Learn Cybern 8(1):69–79CrossRefGoogle Scholar
  2. 2.
    Cieslak DA, Chawla NV, Striegel A (2006) Combating imbalance in network intrusion datasets. In: IEEE international conference on granular computing.  https://doi.org/10.1109/GRC.2006.1635905
  3. 3.
    Kubat M, Holte RC, Matwin SJML (1998) Machine learning for the detection of oil spills in satellite radar images 30(2–3):195–215Google Scholar
  4. 4.
    Zhang D, Islam MM, Lu G (2012) A review on automatic image annotation techniques. Pattern Recogn 45(1):346–362CrossRefGoogle Scholar
  5. 5.
    Pednault EP, Rosen BK, Apte C (2000) Handling imbalanced data sets in insurance risk modeling. IBM TJ Watson Research Center Yorktown Heights, New YorkGoogle Scholar
  6. 6.
    Yu H, Ni J, Zhao JJN (2013) ACOSampling: an ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data. Neurocomputing 101:309–318CrossRefGoogle Scholar
  7. 7.
    Nezamabadi-pour H (2015) A quantum-inspired gravitational search algorithm for binary encoded optimization problems. Eng Appl Artif Intell 40:62–75CrossRefGoogle Scholar
  8. 8.
    Moscato P (1999) Memetic algorithms: a short introduction. New ideas in optimization. McGraw-Hill, WashingtonGoogle Scholar
  9. 9.
    García S, Cano JR, Herrera FJPR (2008) A memetic algorithm for evolutionary prototype selection: a scaling up approach. Pattern Recogn 41(8):2693–2709zbMATHCrossRefGoogle Scholar
  10. 10.
    Ong YS, Lim MH, Zhu N, Wong KW (2006) Classification of adaptive memetic algorithms: a comparative study. IEEE Trans Syst Man Cybern Part B 36(1):141–152CrossRefGoogle Scholar
  11. 11.
    Chen X, Ong YS, Lim MH, Tan KC (2011) A multi-facet survey on memetic computation. IEEE Trans Evol Comput 15(5):591–607CrossRefGoogle Scholar
  12. 12.
    Grzymala-Busse JW, Stefanowski J, Wilk S (2004) A comparison of two approaches to data mining from imbalanced data. In: Negoita MG, Howlett RJ, Jain LC (eds) Knowledge-based intelligent information and engineering systems. KES 2004. Lecture notes in computer science, vol 3213. Springer, Berlin, HeidelbergGoogle Scholar
  13. 13.
    Krawczyk B, Woźniak M (2015) Cost-sensitive neural network with roc-based moving threshold for imbalanced classification. In: international conference on intelligent data engineering and automated learning. Springer, New YorkCrossRefGoogle Scholar
  14. 14.
    Yang C-Y, Yang J-S, Wang J-J (2009) Margin calibration in SVM class-imbalanced learning. Neurocomputing 73(1–3):397–411.  https://doi.org/10.1016/j.neucom.2009.08.006 CrossRefGoogle Scholar
  15. 15.
    Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: KDD '99 proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, California, USA, 15–18 Aug 1999, pp 155–164.  https://doi.org/10.1145/312129.312220
  16. 16.
    Elkan C (2001) The foundations of cost-sensitive learning. In: International joint conference on artificial intelligence. Lawrence Erlbaum Associates LtdGoogle Scholar
  17. 17.
    Ting KM (2002) An instance-weighting method to induce cost-sensitive trees. IEEE Trans Knowl Data Eng 14(3):659–665MathSciNetCrossRefGoogle Scholar
  18. 18.
    Zhou Z-H, Liu X-Y (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1):63–77MathSciNetCrossRefGoogle Scholar
  19. 19.
    Saryazdi S, Nikpour B, Nezamabadi-Pour H (2017) NPC: Neighbors’ progressive competition algorithm for classification of imbalanced data sets. In: 2017 3rd Iranian conference on intelligent systems and signal processing (ICSPIS). IEEE, Shahrood, IranGoogle Scholar
  20. 20.
    Gao M et al (2011) A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems. Neurocomputing 74(17):3456–3466CrossRefGoogle Scholar
  21. 21.
    Lin S-C, Yuan-chin IC, Yang W-N (2009) Meta-learning for imbalanced data and classification ensemble in binary classification. Neurocomputing 73(1):484–494CrossRefGoogle Scholar
  22. 22.
    Galar M et al (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C 42(4):463–484CrossRefGoogle Scholar
  23. 23.
    Jian C, Gao J, Ao Y (2016) A new sampling method for classifying imbalanced data based on support vector machine ensemble. Neurocomputing 193:115–122.  https://doi.org/10.1016/j.neucom.2016.02.006 CrossRefGoogle Scholar
  24. 24.
    Tahir MA, Kittler J, Yan FJPR (2012) Inverse random under sampling for class imbalance problem and its application to multi-label classification 45(10):3738–3750Google Scholar
  25. 25.
    Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6:769–772MathSciNetzbMATHGoogle Scholar
  26. 26.
    Hart P (1968) The condensed nearest neighbor rule (Corresp). IEEE Trans Inf Theory 14(3):515–516.  https://doi.org/10.1109/TIT.1968.1054155 CrossRefGoogle Scholar
  27. 27.
    Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: ICML. Nashville, USAGoogle Scholar
  28. 28.
    Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3):408–421.  https://doi.org/10.1109/TSMC.1972.4309137 MathSciNetCrossRefzbMATHGoogle Scholar
  29. 29.
    Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: Quaglini S, Barahona P, Andreassen S (eds) Artificial intelligence in medicine. Lecture notes in computer science, vol 2101. Springer, Berlin, pp 63–66CrossRefGoogle Scholar
  30. 30.
    Yoon K, Kwek S (2005) An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics. In: Fifth international conference on hybrid intelligent systems (HIS'05). IEEE, Rio de Janeiro, BrazilGoogle Scholar
  31. 31.
    Ghazikhani A, Yazdi HS, Monsefi R (2012) Class imbalance handling using wrapper-based random oversampling. In: 20th Iranian conference on electrical engineering (ICEE 2012). IEEE, Tehran, IranGoogle Scholar
  32. 32.
    Chen S, He H, Garcia EA (2010) RAMOBoost: ranked minority oversampling in boosting. IEEE Trans Neural Netw 21(10):1624–1642.  https://doi.org/10.1109/TNN.2010.2066988 CrossRefGoogle Scholar
  33. 33.
    He H et al (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE World Congress on Computational Intelligence). IEEE, Hong Kong, ChinaGoogle Scholar
  34. 34.
    Chawla NV et al (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357zbMATHCrossRefGoogle Scholar
  35. 35.
    Hu S et al (2009) MSMOTE: improving classification performance when training data is imbalanced. In: 2009 Second international workshop on computer science and engineering. IEEE, Qingdao, ChinaGoogle Scholar
  36. 36.
    Barua S et al (2014) MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425CrossRefGoogle Scholar
  37. 37.
    Gao M et al (2014) PDFOS: PDF estimation based over-sampling for imbalanced two-class problems. Neurocomputing 138:248–259CrossRefGoogle Scholar
  38. 38.
    Ramentol E et al (2012) SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 33(2):245–265CrossRefGoogle Scholar
  39. 39.
    Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29.  https://doi.org/10.1145/1007730.1007735 CrossRefGoogle Scholar
  40. 40.
    Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang DS, Zhang XP, Huang GB (eds) Advances in intelligent computing. ICIC 2005. Lecture notes in computer science, vol 3644. Springer, Berlin, HeidelbergGoogle Scholar
  41. 41.
    Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho TB (eds) Advances in knowledge discovery and data mining. Lecture notes in computer science, vol 5476. Springer, BerlinCrossRefGoogle Scholar
  42. 42.
    Cateni S, Colla V, Vannucci M (2014) A method for resampling imbalanced datasets in binary classification tasks for real-world problems. Neurocomputing 135:32–41CrossRefGoogle Scholar
  43. 43.
    Vluymans S et al (2016) EPRENNID: An evolutionary prototype reduction based ensemble for nearest neighbor classification of imbalanced data. Neurocomputing 216:596–610CrossRefGoogle Scholar
  44. 44.
    García S, Herrera F (2009) Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol Comput 17(3):275–306MathSciNetCrossRefGoogle Scholar
  45. 45.
    García S, Fernández A, Herrera F (2009) Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems. Appl Soft Comput 9(4):1304–1314CrossRefGoogle Scholar
  46. 46.
    Garcı S et al (2012) Evolutionary-based selection of generalized instances for imbalanced classification. Knowl Based Syst 25(1):3–12CrossRefGoogle Scholar
  47. 47.
    Galar M et al (2013) EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recogn 46(12):3460–3471CrossRefGoogle Scholar
  48. 48.
    Lim P, Goh CK, Tan KC (2016) Evolutionary cluster-based synthetic oversampling ensemble (eco-ensemble) for imbalance learning. IEEE Trans Cybern 47(9):2850–2861CrossRefGoogle Scholar
  49. 49.
    Li J et al (2016) Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms. J Supercomput 72(10):3708–3728CrossRefGoogle Scholar
  50. 50.
    Fernández A et al (2017) A pareto based ensemble with feature and instance selection for learning from multi-class imbalanced datasets. Int J Neural Syst.  https://doi.org/10.1142/S0129065717500289 CrossRefGoogle Scholar
  51. 51.
    Nikpour B, Nezamabadi-pour H (2018) HTSS: a hyper-heuristic training set selection method for imbalanced data sets. Iran J Comput Sci 1(2):109–128CrossRefGoogle Scholar
  52. 52.
    Krasnogor N, Smith J (2005) A tutorial for competent memetic algorithms: model, taxonomy, and design issues. IEEE Trans Evol Comput 9(5):474–488CrossRefGoogle Scholar
  53. 53.
    Chen X et al (2011) A multi-facet survey on memetic computation. IEEE Trans Evol Comput 15(5):591–607CrossRefGoogle Scholar
  54. 54.
    Kannan SS, Ramaraj N (2010) A novel hybrid feature selection via symmetrical uncertainty ranking based local memetic search algorithm. Knowl Based Syst 23(6):580–585CrossRefGoogle Scholar
  55. 55.
    Lee J, Kim D-W (2015) Memetic feature selection algorithm for multi-label classification. Inf Sci 293:80–96CrossRefGoogle Scholar
  56. 56.
    Cano A, Zafra A, Ventura S (2013) Weighted data gravitation classification for standard and imbalanced data. IEEE Trans Cybern 43(6):1672–1687CrossRefGoogle Scholar
  57. 57.
    Peng L et al (2014) A new approach for imbalanced data classification based on data gravitation. Inf Sci 288:347–373CrossRefGoogle Scholar
  58. 58.
    Zhu Y, Wang Z, Gao D (2015) Gravitational fixed radius nearest neighbor for imbalanced problem. Knowl Based Syst 90:224–238CrossRefGoogle Scholar
  59. 59.
    Nikpour B, Shabani M, Nezamabadi-pour H (2017) Proposing new method to improve gravitational fixed nearest neighbor algorithm for imbalanced data classification. In: 2nd conference on swarm intelligence and evolutionary computation (CSIEC), Kerman, Iran, 7–9 Mar 2017. IEEE.  https://doi.org/10.1109/CSIEC.2017.7940167
  60. 60.
    Shabani-kordshooli M, Nikpour B, Nezamabadi-pour H (2017) An improvement to gravitational fixed radius nearest neighbor for imbalanced problem. In: Artificial intelligence and signal processing conference (AISP). IEEEGoogle Scholar
  61. 61.
    Nezamabadi-pour H (2015) A quantum-inspired gravitational search algorithm for binary encoded optimization problems. Eng Appl Artif Intell 40:62–75CrossRefGoogle Scholar
  62. 62.
    Nielsen MA, Chuang IL (2000) Quantum computation and quantum information. Quantum 546:1231zbMATHGoogle Scholar
  63. 63.
    Zhang G (2011) Quantum-inspired evolutionary algorithms: a survey and empirical study. J Heuristics 17(3):303–351zbMATHCrossRefGoogle Scholar
  64. 64.
    Meng K, Wang HG, Dong ZY, Wong KP (2010) Quantum-inspired particle swarm optimization for valve-point economic load dispatch. IEEE Trans Power Syst 25(1):215–222.  https://doi.org/10.1109/TPWRS.2009.2030359 CrossRefGoogle Scholar
  65. 65.
    Hoffmeister F, Bäck T (1990) Genetic algorithms and evolution strategies: similarities and differences. In: International conference on parallel problem solving from nature. Springer, New YorkGoogle Scholar
  66. 66.
    Mardani S (2014) A hyper-heuristic algorithm using fuzzy controller for feature selection. Master thesis, Electrical Engineering Department, Shahid Bahonar University of KermanGoogle Scholar
  67. 67.
    Bhowmik P et al (2010) A new differential evolution with improved mutation strategy. In: IEEE congress on evolutionary computation. IEEE, Barcelona, SpainGoogle Scholar
  68. 68.
    García S et al (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064CrossRefGoogle Scholar
  69. 69.
    Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6(2):65–70MathSciNetzbMATHGoogle Scholar
  70. 70.
    López V et al (2013) An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141CrossRefGoogle Scholar
  71. 71.
    Yu D-J et al (2013) Improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling. Neurocomputing 104:180–190CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Intelligent Data Processing Laboratory(IDPL), Department of Electrical EngineeringShahid Bahonar University of KermanKermanIran
  2. 2.Mahani Mathematical Research CenterShahid Bahonar University of KermanKermanIran

Personalised recommendations