Advertisement

An Empirical Study of Oversampling and Undersampling Methods for LCMine an Emerging Pattern Based Classifier

  • Octavio Loyola-González
  • Milton García-Borroto
  • Miguel Angel Medina-Pérez
  • José Fco. Martínez-Trinidad
  • Jesús Ariel Carrasco-Ochoa
  • Guillermo De Ita
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7914)

Abstract

Classifiers based on emerging patterns are usually more understandable for humans than those based on more complex mathematical models. However, most of the classifiers based on emerging patterns get low accuracy in those problems with imbalanced databases. This problem has been tackled through oversampling or undersampling methods, nevertheless, to the best of our knowledge these methods have not been tested for classifiers based on emerging patterns. Therefore, in this paper, we present an empirical study about the use of oversampling and undersampling methods to improve the accuracy of a classifier based on emerging patterns. We apply the most popular oversampling and undersampling methods over 30 databases from the UCI Repository of Machine Learning. Our experimental results show that using oversampling and undersampling methods significantly improves the accuracy of the classifier for the minority class.

Keywords

supervised classification emerging patterns imbalanced databases oversampling undersampling 

References

  1. 1.
    Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. SIGKDD Explor. Newsl. 6(1), 20–29 (2004)CrossRefGoogle Scholar
  2. 2.
    Bhattacharyya, S., Jha, S., Tharakunnel, K., Westland, J.C.: Data mining for credit card fraud: A comparative study. Decision Support Systems 50(3), 602–613 (2011)CrossRefGoogle Scholar
  3. 3.
    Blake, C., Merz, C.J.: {UCI} Repository of machine learning databases. Tech. rep., University of California, Irvine, School of Information and Computer Sciences (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html
  4. 4.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16(1), 321–357 (2002)zbMATHGoogle Scholar
  5. 5.
    Chawla, N.: Data Mining for Imbalanced Datasets: An Overview. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 875–886. Springer, US (2010)Google Scholar
  6. 6.
    Demšar, J.: Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research 7, 1–30 (2006)zbMATHGoogle Scholar
  7. 7.
    Dong, G.: Preliminaries. In: Dong, G., Bailey, J. (eds.) Contrast Data Mining: Concepts, Algorithms, and Applications, ch. 1. Data Mining and Knowledge Discovery Series, pp. 3–12. Chapman & Hall/CRC, United States of America (2012)Google Scholar
  8. 8.
    Estabrooks, A., Jo, T., Japkowicz, N.: A Multiple Resampling Method For Learning From Imbalanced Data Sets. Computational Intelligence 20(1), 18–36 (2004)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Frawley, W.J., Piatetsky-Shapiro, G., Matheus, C.J.: Knowledge Discovery in Databases: An Overview. AI Magazine 13(3), 57–70 (1992)Google Scholar
  10. 10.
    García, S., Herrera, F.: An Extension on “Statistical Comparisons of Classifiers over Multiple Data Sets” for all Pairwise Comparisons. Journal of Machine Learning Research 9, 2677–2694 (2008)zbMATHGoogle Scholar
  11. 11.
    García-Borroto, M., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A., Medina-Pérez, M.A., Ruiz-Shulcloper, J.: LCMine: An efficient algorithm for mining discriminative regularities and its application in supervised classification. Pattern Recognition 43(9), 3025–3034 (2010)zbMATHCrossRefGoogle Scholar
  12. 12.
    García-Borroto, M., Martínez-Trinidad, J., Carrasco-Ochoa, J.: A survey of emerging patterns for supervised classification. Artificial Intelligence Review 1–17 (2012)Google Scholar
  13. 13.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explorations 11(1), 10–18 (2009)CrossRefGoogle Scholar
  14. 14.
    Lenca, P., Lallich, S., Do, T.-N., Pham, N.-K.: A comparison of different off-centered entropies to deal with class imbalance for decision trees. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 634–643. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  15. 15.
    Li, D.C., Liu, C.W., Hu, S.C.: A learning method for the class imbalance problem with medical data sets. Computers in Biology and Medicine 40(5), 509–518 (2010)CrossRefGoogle Scholar
  16. 16.
    Liu, W., Chawla, S., Cieslak, D.A., Chawla, N.V.: A Robust Decision Tree Algorithm for Imbalanced Data Sets. In: SDM 2010, pp. 766–777 (2010)Google Scholar
  17. 17.
    Prati, R.C., Batista, G.E.A.P.A., Monard, M.C.: A Study with Class Imbalance and Random Sampling for a Decision Tree Learning System. In: Bramer, M. (ed.) Artificial Intelligence in Theory and Practice II, vol. 276, pp. 131–140. Springer, Boston (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Octavio Loyola-González
    • 1
    • 2
  • Milton García-Borroto
    • 1
  • Miguel Angel Medina-Pérez
    • 2
  • José Fco. Martínez-Trinidad
    • 2
  • Jesús Ariel Carrasco-Ochoa
    • 2
  • Guillermo De Ita
    • 3
  1. 1.Centro de BioplantasUniversidad de Ciego de Ávila.Ciego de ÁvilaCuba
  2. 2.Instituto Nacional de Astrofísica, Óptica y ElectrónicaSta. María TonanzintlaMéxico
  3. 3.Faculty of Computer ScienceBenemérita Universidad Autónoma de PueblaMéxico

Personalised recommendations