Abstract
Availability of data from many different applications such as surveillance systems, security appliances, finances has been continuously expanding. Many machine learning (ML) and data mining models have shown promising power in learning from the available data. However, the problem of learning an ML classifier from imbalanced data is still a challenging problem. This problem is often regarded as the imbalanced learning problem. In this problem, there is more amount of information known from the majority classes than the minority classes. In such a learning environment, the classifier during training over-fits to the former classes and under-fits to the minority classes. Distance-based strategy, for example, SMOTE, has been quite useful to oversample the minority classes that essentially uses nearest neighbor samples from the available samples. In this paper, we propose a notion of employing genetic algorithm (GA) that would essentially learn the probability distribution from the available data to generate the minority class samples for binary classification problems. We validate and test our proposed oversampling strategy by training three different kinds of classifiers. The comparative analysis with SMOTE-based oversampling and the proposed GA-based oversampling shows promising results for a selected ten very popular imbalanced datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We intentionally avoid common terminologies in GA such as population, chromosome, crossover, mutation without any loss of generality in understanding this methodology.
References
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Barandela, R., Sánchez, J.S., Garcıa, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recogn. 36(3), 849–851 (2003)
Aly, M.: Survey on multiclass classification methods. Neural Netw. 19, 1–9 (2005)
Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 39(2), 539–550 (2009)
Li, J., Fong, S., Wong, R.K., Chu, V.W.: Adaptive multi-objective swarm fusion for imbalanced data classification. Inf. Fusion (2017)
Dash, T., Nayak, T., Swain, R.R.: Controlling wall following robot navigation based on gravitational search and feed forward neural network. In: Proceedings of the 2nd International Conference on Perception and Machine Intelligence, pp. 196–200. ACM (2015)
Boussaïd, I., Lepagnot, J., Siarry, P.: A survey on optimization metaheuristics. Inf. Sci. 237, 82–117 (2013)
Dash, T., Sahu, P.K.: Gradient gravitational search: an efficient metaheuristic algorithm for global optimization. J. Comput. Chem. 36(14), 1060–1068 (2015)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Li, J., Fong, S., Zhuang, Y.: Optimizing smote by metaheuristics with neural network and decision tree. In: 2015 3rd International Symposium on Computational and Business Intelligence (ISCBI), pp. 26–32. IEEE (2015)
Jiang, K., Lu, J., Xia, K.: A novel algorithm for imbalance data classification based on genetic algorithm improved smote. Arab. J. Sci. Eng. 41(8), 3255–3266 (2016)
Zorić, B., Bajer, D., Martinović, G.: Employing different optimisation approaches for smote parameter tuning. In: International Conference on Smart Systems and Technologies (SST), pp. 191–196. IEEE (2016)
Goldberg, D.E., Holland, J.H.: Genetic algorithms and machine learning. Mach. Learn. 3(2), 95–99 (1988)
Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., GarcÃa, S., Sanchez, L., Herrera, F.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17 (2011)
Dinesh, S., Dash, T.: Reliable evaluation of neural network for multiclass classification of real-world data. arXiv preprint arXiv:1612.00671 (2016)
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20(1), 37–46 (1960)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011)
Pai, P.P., Dash, T., Mondal, S.: Sequence-based discrimination of protein-RNA interacting residues using a probabilistic approach. J. Theor. Biol. 418, 77–83 (2017)
Wang, S., Yao, X.: Multiclass imbalance problems: analysis and potential solutions. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 42(4), 1119–1130 (2012)
Wan, X., Liu, J., Cheung, W.K., Tong, T.: Learning to improve medical decision making from imbalanced data without a priori cost. BMC Med. Inform. Decis. Mak. 14(1), 111 (2014)
Nayak, T., Dash, T., Rao, D.C., Sahu, P.K.: Evolutionary neural networks versus adaptive resonance theory net for breast cancer diagnosis. In: Proceedings of the International Conference on Informatics and Analytics, p. 97. ACM (2016)
Dash, T.: Automatic navigation of wall following mobile robot using adaptive resonance theory of type-1. Biologically Inspired Cogn. Archit. 12, 1–8 (2015)
Dash, T.: A study on intrusion detection using neural networks trained with evolutionary algorithms. Soft Comput. 21(10), 2687–2700 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Saladi, P.S.M., Dash, T. (2019). Genetic Algorithm-Based Oversampling Technique to Learn from Imbalanced Data. In: Bansal, J., Das, K., Nagar, A., Deep, K., Ojha, A. (eds) Soft Computing for Problem Solving. Advances in Intelligent Systems and Computing, vol 816. Springer, Singapore. https://doi.org/10.1007/978-981-13-1592-3_30
Download citation
DOI: https://doi.org/10.1007/978-981-13-1592-3_30
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1591-6
Online ISBN: 978-981-13-1592-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)