Learning Decision Trees for Unbalanced Data

Cieslak, David A.; Chawla, Nitesh V.

doi:10.1007/978-3-540-87479-9_34

David A. Cieslak¹ &
Nitesh V. Chawla¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5211))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

8164 Accesses
94 Citations

Abstract

Learning from unbalanced datasets presents a convoluted problem in which traditional learning algorithms may perform poorly. The objective functions used for learning the classifiers typically tend to favor the larger, less important classes in such problems. This paper compares the performance of several popular decision tree splitting criteria – information gain, Gini measure, and DKM – and identifies a new skew insensitive measure in Hellinger distance. We outline the strengths of Hellinger distance in class imbalance, proposes its application in forming decision trees, and performs a comprehensive comparative analysis between each decision tree construction method. In addition, we consider the performance of each tree within a powerful sampling wrapper framework to capture the interaction of the splitting metric and sampling. We evaluate over this wide range of datasets and determine which operate best under class imbalance.

Download to read the full chapter text

Chapter PDF

Enhancing techniques for learning decision trees from imbalanced data

Article 02 March 2019

Addressing Local Class Imbalance in Balanced Datasets with Dynamic Impurity Decision Trees

Decision Tree

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Japkowicz, N.: Class Imbalance Problem: Significance & Strategies. In: International Conference on Artificial Intelligence (ICAI), pp. 111–117 (2000)
Google Scholar
Kubat, M., Matwin, S.: Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In: International Conference on Machine Learning (ICML), pp. 179–186 (1997)
Google Scholar
Batista, G., Prati, R., Monard, M.: A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. SIGKDD Explorations 6(1), 20–29 (2004)
Article Google Scholar
Van Hulse, J., Khoshgoftaar, T., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: ICML, pp. 935–942 (2007)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
MATH Google Scholar
Quinlan, J.R.: Induction of Decision Trees. Machine Learning 1, 81–106 (1986)
Google Scholar
Chawla, N.V., Japkowicz, N., Kołcz, A. (eds.): Proceedings of the ICML 2003 Workshop on Learning from Imbalanced Data Sets II (2003)
Google Scholar
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.: Classification and Regression Trees. Chapman and Hall, Boca Raton (1984)
MATH Google Scholar
Flach, P.A.: The Geometry of ROC Space: Understanding Machine Learning Metrics through ROC Isometrics. In: ICML, pp. 194–201 (2003)
Google Scholar
Dietterich, T., Kearns, M., Mansour, Y.: Applying the weak learning framework to understand and improve C4.5. In: Proc. 13th International Conference on Machine Learning, pp. 96–104. Morgan Kaufmann, San Francisco (1996)
Google Scholar
Drummond, C., Holte, R.: Exploiting the cost (in)sensitivity of decision tree splitting criteria. In: ICML, pp. 239–246 (2000)
Google Scholar
Zadrozny, B., Elkan, C.: Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In: Proc. 18th International Conf. on Machine Learning, pp. 609–616. Morgan Kaufmann, San Francisco (2001)
Google Scholar
Kailath, T.: The Divergence and Bhattacharyya Distance Measures in Signal Selection. IEEE Transactions on Communications 15(1), 52–60 (1967)
Article Google Scholar
Rao, C.: A Review of Canonical Coordinates and an Alternative to Corresponence Analysis using Hellinger Distance. Questiio (Quaderns d’Estadistica i Investigacio Operativa) 19, 23–63 (1995)
Google Scholar
Demsar, J.: Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research 7, 1–30 (2006)
MathSciNet Google Scholar
Provost, F., Domingos, P.: Tree Induction for Probability-Based Ranking. Machine Learning 52(3), 199–215 (September 2003)
Article MATH Google Scholar
Chawla, N.V.: C4.5 and Imbalanced Data Sets: Investigating the Effect of Sampling Method, Probabilistic Estimate, and Decision Tree Structure. In: ICML Workshop on Learning from Imbalanced Data Sets II (2003)
Google Scholar
Vilalta, R., Oblinger, D.: A Quantification of Distance-Bias Between Evaluation Metrics In Classification. In: ICML, pp. 1087–1094 (2000)
Google Scholar
Chawla, N.V., Cieslak, D.A., Hall, L.O., Joshi, A.: Automatically countering imbalance and its empirical relationship to cost. Utility-Based Data Mining: A Special issue of the International Journal Data Mining and Knowledge Discovery (2008)
Google Scholar
Elkan, C.: The Foundations of Cost-Sensitive Learning. In: International Joint Conference on Artificial Intelligence (IJCAI), pp. 973–978 (2001)
Google Scholar
Dietterich, T.G.: Approximate statistical test for comparing supervised classiffication learning algorithms. Neural Computation 10(7), 1895–1923 (1998)
Article Google Scholar
Kubat, M., Holte, R.C., Matwin, S.: Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning 30, 195–215 (1998)
Article Google Scholar
Radivojac, P., Chawla, N.V., Dunker, A.K., Obradovic, Z.: Classification and knowledge discovery in protein databases. Journal of Biomedical Informatics 37, 224–239 (2004)
Article Google Scholar
Chang, C., Lin, C.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm
Asuncion, A., Newman, D.: UCI Machine Learning Repository (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Notre Dame, Notre Dame, IN 46556, USA
David A. Cieslak & Nitesh V. Chawla

Authors

David A. Cieslak
View author publications
You can also search for this author in PubMed Google Scholar
Nitesh V. Chawla
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Walter Daelemans Bart Goethals Katharina Morik

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cieslak, D.A., Chawla, N.V. (2008). Learning Decision Trees for Unbalanced Data. In: Daelemans, W., Goethals, B., Morik, K. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2008. Lecture Notes in Computer Science(), vol 5211. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87479-9_34

Download citation

DOI: https://doi.org/10.1007/978-3-540-87479-9_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-87478-2
Online ISBN: 978-3-540-87479-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learning Decision Trees for Unbalanced Data

Abstract

Chapter PDF

Similar content being viewed by others

Enhancing techniques for learning decision trees from imbalanced data

Addressing Local Class Imbalance in Balanced Datasets with Dynamic Impurity Decision Trees

Decision Tree

Keywords

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Learning Decision Trees for Unbalanced Data

Abstract

Chapter PDF

Similar content being viewed by others

Enhancing techniques for learning decision trees from imbalanced data

Addressing Local Class Imbalance in Balanced Datasets with Dynamic Impurity Decision Trees

Decision Tree

Keywords

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation