Abstract
Sampling methods are a direct approach to tackle the problem of class imbalance. These methods sample a data set in order to alter the class distributions. Usually these methods are applied to obtain a more balanced distribution. An open-ended question about sampling methods is which distribution can provide the best results, if any. In this work we develop a broad empirical study aiming to provide more insights into this question. Our results suggest that altering the class distribution can improve the classification performance of classifiers considering AUC as a performance metric. Furthermore, as a general recommendation, random over-sampling to balance distribution is a good starting point in order to deal with class imbalance.
Chapter PDF
References
A. Asuncion, D.N.: UCI machine learning repository (2007). Http://www.ics.uci.edu/∼mlearn/MLRepository.html
Batista, G., Prati, R.C., Monard, M.C.: A Study of the Behaviour of Several Methods for Balance Machine Learning Training Data. SIGKDD Explorations 6(1), 20-29 (2004)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-sampling Technique. JAIR 16, 321-357 (2002)
Cussens, J.: Bayes and Pseudo-Bayes Estimates of Conditional Probabilities and their Reliability. In: ECML’93, pp. 136-152 (1993)
Drummond, C., Holte, R.C.: Exploiting the Cost (In)Sensitivity of Decision Tree Splitting Criteria. In: ICML’2000, pp. 239-246 (2000)
Elkan, C.: Learning and Making Decisions When Costs and Probabilities are Both Unknown. In: KDD’01, pp. 204-213 (2001)
Elkan, C.: The Foudations of the Cost-sensitive Learning. In: IJCAI’01, pp. 973-978. Margan Kaufmann (2001)
Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters 27(8), 861-874 (2006)
Japkowicz, N.: Class Imabalances: Are we Focusing on the Right Issue? In: ICML’2003 Workshop on Learning from Imbalanced Data Sets (II) (2003)
Laurikkala, J.: Improving Identification of Difficult Small Classes by Balancing Class Distributions. Tech. Rep. A-2001-2, Univ. of Tampere, Finland (2001)
Prati, R.C., Batista, G.E.A.P.A., Monard, M.C.: Class Imbalance versus Class Overlapping: an Analysis of a Learning System Behavior. In: MICAI’04, pp. 312-321 (2004)
Provost, F.J., Fawcett, T.: Robust classification for imprecise environments. Machine Learning 42(3), 203-231 (2001)
Quinlan, J.R.: C4.5 Programs for Machine Learning. Morgan Kaufmann (1988)
Weiss, G.M., Provost, F.: Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction. JAIR 19, 315-354 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 International Federation for Information Processing
About this paper
Cite this paper
Prati, R.C., Batista, G.E.A.P.A., Monard, M.C. (2008). A Study with Class Imbalance and Random Sampling for a Decision Tree Learning System. In: Bramer, M. (eds) Artificial Intelligence in Theory and Practice II. IFIP AI 2008. IFIP – The International Federation for Information Processing, vol 276. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09695-7_13
Download citation
DOI: https://doi.org/10.1007/978-0-387-09695-7_13
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-09694-0
Online ISBN: 978-0-387-09695-7
eBook Packages: Computer ScienceComputer Science (R0)