Abstract
A tacit assumption in classifier induction is that the class distribution of the training set must match the class distribution of the test set. A direct implementation is to retrain a model using a data set with matching class distribution every time the operating condition changes (i.e., the matching model). The alternative is to modify the decision rule of a previous trained model to the new operating condition. The latter is the single model approach commonly used and recommended by many researchers. In this paper, we argue with empirical support using decision trees that learning using the matching class distribution is desirable. We also make explicit the differences and limitations of the two methods for the single model approach: rescaling and thresholding.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Blake, C., Merz, C.J.: UCI Repository of machine learning databases. Irvine, CA: University of California (1998). www.ics.uci.edu/~mlearn/MLRepository.html
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification And Regression Trees. Chapman and Hall, Boca Raton (1993)
Dietterich, T.G., Kearns, M., Mansour, Y.: Applying the weak learning framework to understand and improve C4.5. In: Proceedings of Thirteenth International Conference on Machine Learning, pp. 96–104. Morgan Kaufmann, San Francisco (1996)
Duda, O.R., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley, Chichester (2001)
Drummond, C., Holte, R.C.: Explicitly Representing Expected Cost: An Alternative to ROC Representation. In: Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining, pp. 198–207 (2000)
Drummond, C., Holte, R.C.: Exploiting the Cost (In)sensitivity of Decision Tree Splitting Criteria. In: Proceedings of The Seventeenth International Conference on Machine Learning, pp. 239–246. Morgan Kaufmann, San Francisco (2000)
Elkan, C.: The Foundations of Cost-Sensitive Learning. In: Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, pp. 973–978 (2001)
Hooper, P.M.: Reference Point Logistic Regression and the Identification of DNA Functional Sites. Journal of Classification 18, 81–107 (2001)
Provost, F., Fawcett, T.: Robust Classification for Imprecise Environments. Machine Learning 42, 203–231 (2001)
Provost, F., Domingos, P.: Tree-Induction for Probability-based Ranking. Machine Learning 52, 199–215 (2003)
Quinlan, J.R.: C4.5: Program for Machine Learning. Morgan Kaufmann, San Francisco (1993)
Ting, K.M.: Issues in Classifier Evaluation using Optimal Cost Curves. In: Proceedings of The Nineteenth International Conference on Machine Learning, pp. 642–649 (2002)
Ting, K.M.: An Instance-Weighting Method to Induce Cost-Sensitive Trees. IEEE Transactions on Knowledge and Data Engineering 14(3), 659–665 (2002)
Weiss, G., Provost, F.: Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence Research 19, 315–354 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ting, K.M. (2004). Matching Model Versus Single Model: A Study of the Requirement to Match Class Distribution Using Decision Trees. In: Boulicaut, JF., Esposito, F., Giannotti, F., Pedreschi, D. (eds) Machine Learning: ECML 2004. ECML 2004. Lecture Notes in Computer Science(), vol 3201. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30115-8_40
Download citation
DOI: https://doi.org/10.1007/978-3-540-30115-8_40
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23105-9
Online ISBN: 978-3-540-30115-8
eBook Packages: Springer Book Archive