Skip to main content

Genetic Algorithm-Based Oversampling Technique to Learn from Imbalanced Data

  • Conference paper
  • First Online:
Soft Computing for Problem Solving

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 816))

Abstract

Availability of data from many different applications such as surveillance systems, security appliances, finances has been continuously expanding. Many machine learning (ML) and data mining models have shown promising power in learning from the available data. However, the problem of learning an ML classifier from imbalanced data is still a challenging problem. This problem is often regarded as the imbalanced learning problem. In this problem, there is more amount of information known from the majority classes than the minority classes. In such a learning environment, the classifier during training over-fits to the former classes and under-fits to the minority classes. Distance-based strategy, for example, SMOTE, has been quite useful to oversample the minority classes that essentially uses nearest neighbor samples from the available samples. In this paper, we propose a notion of employing genetic algorithm (GA) that would essentially learn the probability distribution from the available data to generate the minority class samples for binary classification problems. We validate and test our proposed oversampling strategy by training three different kinds of classifiers. The comparative analysis with SMOTE-based oversampling and the proposed GA-based oversampling shows promising results for a selected ten very popular imbalanced datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We intentionally avoid common terminologies in GA such as population, chromosome, crossover, mutation without any loss of generality in understanding this methodology.

References

  1. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)

    Article  Google Scholar 

  2. Barandela, R., Sánchez, J.S., Garcıa, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recogn. 36(3), 849–851 (2003)

    Article  Google Scholar 

  3. Aly, M.: Survey on multiclass classification methods. Neural Netw. 19, 1–9 (2005)

    Google Scholar 

  4. Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 39(2), 539–550 (2009)

    Article  Google Scholar 

  5. Li, J., Fong, S., Wong, R.K., Chu, V.W.: Adaptive multi-objective swarm fusion for imbalanced data classification. Inf. Fusion (2017)

    Google Scholar 

  6. Dash, T., Nayak, T., Swain, R.R.: Controlling wall following robot navigation based on gravitational search and feed forward neural network. In: Proceedings of the 2nd International Conference on Perception and Machine Intelligence, pp. 196–200. ACM (2015)

    Google Scholar 

  7. Boussaïd, I., Lepagnot, J., Siarry, P.: A survey on optimization metaheuristics. Inf. Sci. 237, 82–117 (2013)

    Article  MathSciNet  Google Scholar 

  8. Dash, T., Sahu, P.K.: Gradient gravitational search: an efficient metaheuristic algorithm for global optimization. J. Comput. Chem. 36(14), 1060–1068 (2015)

    Article  Google Scholar 

  9. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Article  Google Scholar 

  10. Li, J., Fong, S., Zhuang, Y.: Optimizing smote by metaheuristics with neural network and decision tree. In: 2015 3rd International Symposium on Computational and Business Intelligence (ISCBI), pp. 26–32. IEEE (2015)

    Google Scholar 

  11. Jiang, K., Lu, J., Xia, K.: A novel algorithm for imbalance data classification based on genetic algorithm improved smote. Arab. J. Sci. Eng. 41(8), 3255–3266 (2016)

    Article  Google Scholar 

  12. Zorić, B., Bajer, D., Martinović, G.: Employing different optimisation approaches for smote parameter tuning. In: International Conference on Smart Systems and Technologies (SST), pp. 191–196. IEEE (2016)

    Google Scholar 

  13. Goldberg, D.E., Holland, J.H.: Genetic algorithms and machine learning. Mach. Learn. 3(2), 95–99 (1988)

    Article  Google Scholar 

  14. Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sanchez, L., Herrera, F.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17 (2011)

    Google Scholar 

  15. Dinesh, S., Dash, T.: Reliable evaluation of neural network for multiclass classification of real-world data. arXiv preprint arXiv:1612.00671 (2016)

  16. Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20(1), 37–46 (1960)

    Article  Google Scholar 

  17. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  18. Pai, P.P., Dash, T., Mondal, S.: Sequence-based discrimination of protein-RNA interacting residues using a probabilistic approach. J. Theor. Biol. 418, 77–83 (2017)

    Article  Google Scholar 

  19. Wang, S., Yao, X.: Multiclass imbalance problems: analysis and potential solutions. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 42(4), 1119–1130 (2012)

    Article  Google Scholar 

  20. Wan, X., Liu, J., Cheung, W.K., Tong, T.: Learning to improve medical decision making from imbalanced data without a priori cost. BMC Med. Inform. Decis. Mak. 14(1), 111 (2014)

    Article  Google Scholar 

  21. Nayak, T., Dash, T., Rao, D.C., Sahu, P.K.: Evolutionary neural networks versus adaptive resonance theory net for breast cancer diagnosis. In: Proceedings of the International Conference on Informatics and Analytics, p. 97. ACM (2016)

    Google Scholar 

  22. Dash, T.: Automatic navigation of wall following mobile robot using adaptive resonance theory of type-1. Biologically Inspired Cogn. Archit. 12, 1–8 (2015)

    Article  Google Scholar 

  23. Dash, T.: A study on intrusion detection using neural networks trained with evolutionary algorithms. Soft Comput. 21(10), 2687–2700 (2017)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tirtharaj Dash .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Saladi, P.S.M., Dash, T. (2019). Genetic Algorithm-Based Oversampling Technique to Learn from Imbalanced Data. In: Bansal, J., Das, K., Nagar, A., Deep, K., Ojha, A. (eds) Soft Computing for Problem Solving. Advances in Intelligent Systems and Computing, vol 816. Springer, Singapore. https://doi.org/10.1007/978-981-13-1592-3_30

Download citation

Publish with us

Policies and ethics