An Empirical Study of Oversampling and Undersampling for Instance Selection Methods on Imbalance Datasets

  • Julio Hernandez
  • Jesús Ariel Carrasco-Ochoa
  • José Francisco Martínez-Trinidad
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8258)

Abstract

Instance selection methods get low accuracy in problems with imbalanced databases. In the literature, the problem of imbalanced databases has been tackled applying oversampling or undersampling methods. Therefore, in this paper, we present an empirical study about the use of oversampling and undersampling methods to improve the accuracy of instance selection methods on imbalanced databases. We apply different oversampling and undersampling methods jointly with instance selectors over several public imbalanced databases. Our experimental results show that using oversampling and undersampling methods significantly improves the accuracy for the minority class.

Keywords

supervised classification instance selection oversampling undersampling imbalanced datasets 

References

  1. 1.
    Aha, D.W., Kibler, D., Albert, M.K.: Instance-Based Learning Algorithms. Mach. Learn. 6, 37–66 (1991)Google Scholar
  2. 2.
    Brighton, H., Mellish, C.: Advances in Instance Selection for Instance-Based Learning Algorithms. Data Min. Knowl. Discov. 6, 153–172 (2002)MathSciNetCrossRefMATHGoogle Scholar
  3. 3.
    Garcia, S., Derrac, J., Cano, J., Herrera, F.: Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study. IEEE Trans. Pattern Anal. Mach. Intell. 34, 417–435 (2012)CrossRefGoogle Scholar
  4. 4.
    Olvera-López, J.A., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Kittler, J.: A review of instance selection methods. Artif. Intell. Rev. 34, 133–143 (2010)CrossRefGoogle Scholar
  5. 5.
    Estabrooks, A., Jo, T., Japkowicz, N.: A Multiple Resampling Method for Learning from Imbalanced Data Sets. Computational Intelligence 20, 18–36 (2004)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl. 6, 1–6 (2004)CrossRefGoogle Scholar
  7. 7.
    Sun, Y.M., Wong, A.K.C., Kamel, M.S.: Classification of imbalance data: A review. International Journal of Pattern Recognition and Artificial Intelligence 4, 687–719 (2009)CrossRefGoogle Scholar
  8. 8.
    García-Pedrajas, N., Romero del Castillo, J.A., Ortiz-Boyer, D.: A cooperative coevolutionary algorithm for instance selection for instance-based learning. Machine Learning 78, 381–420 (2010)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. Newsl. 6, 20–29 (2004)CrossRefGoogle Scholar
  10. 10.
    Wilson, D.R., Martinez, T.R.: Reduction Techniques for Instance-Based Learning Algorithms. Mach. Learn. 30, 257–286 (2000)CrossRefGoogle Scholar
  11. 11.
    Eshelman, L.J.: The CHC Adaptive Search Algorithm: How to Have Safe Search When Engaging in Nontraditional Genetic Recombination. In: Foundations of Genetic Algorithms, pp. 265–283. Morgan Kaufmann, San Francisco (1991)Google Scholar
  12. 12.
    Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Longman Publishing Co., Inc., Boston (1989)MATHGoogle Scholar
  13. 13.
    Whitley, D.: The GENITOR algorithm and selection pressure: why rank-based allocation of reproductive trials is best. In: Proceedings of the Third International Conference on Genetic Algorithms, pp. 116–121. Morgan Kaufmann Publishers Inc. (1989)Google Scholar
  14. 14.
    Hernandez-Leal, P., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Olvera-Lopez, J.A.: InstanceRank based on borders for instance selection. Pattern Recogn. 46, 365–375 (2013)CrossRefGoogle Scholar
  15. 15.
    Wilson, D.L.: Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems, Man and Cybernetics 2, 408–421 (1972)CrossRefMATHGoogle Scholar
  16. 16.
    Cano, J.R., Herrera, F., Lozano, M.: Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study. Trans. Evol. Comp. 6, 561–575 (2003)CrossRefGoogle Scholar
  17. 17.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Int. Res. 16, 321–357 (2002)MATHGoogle Scholar
  18. 18.
    Millán-Giraldo, M., García, V., Sánchez, J.S.: One-sided prototype selection on class imbalanced dissimilarity matrices. In: Gimel’farb, G., Hancock, E., Imiya, A., Kuijper, A., Kudo, M., Omachi, S., Windeatt, T., Yamada, K. (eds.) SSPR & SPR 2012. LNCS, vol. 7626, pp. 391–399. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  19. 19.
    Pérez-Rodríguez, J., de Haro-García, A., García-Pedrajas, N.: Instance selection for class imbalanced problems by means of selecting instances more than once. In: Lozano, J.A., Gámez, J.A., Moreno, J.A. (eds.) CAEPIA 2011. LNCS, vol. 7023, pp. 104–113. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  20. 20.
    Alcalá-Fdez, J., Fernandez, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.: KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework. Journal of Multiple-Valued Logic and Soft Computing 17, 255–287 (2011)Google Scholar
  21. 21.
    Estabrooks, A., Jo, T., Japkowicz, N.: A Multiple Resampling Method for Learning from Imbalanced Data Sets. Computational Intelligence 20, 18–36 (2004)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Jesús, A.-F., Alberto, F., Julián, L., Joaquín, D., Salvador, G.: KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework. Multiple-Valued Logic and Soft Computing 17, 255–287 (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Julio Hernandez
    • 1
  • Jesús Ariel Carrasco-Ochoa
    • 1
  • José Francisco Martínez-Trinidad
    • 1
  1. 1.Computer Science DepartmentInstituto Nacional de Astrofísica Óptica y ElectrónicaPueblaMexico

Personalised recommendations