The Impact of Small Disjuncts on Classifier Learning

Weiss, Gary M.

doi:10.1007/978-1-4419-1280-0_9

Gary M. Weiss⁴

Part of the book series: Annals of Information Systems ((AOIS,volume 8))

4213 Accesses
24 Citations

Abstract

Many classifier induction systems express the induced classifier in terms of a disjunctive description. Small disjuncts are those that classify few training examples. These disjuncts are interesting because they are known to have a much higher error rate than large disjuncts and are responsible for many, if not most, of all classification errors. Previous research has investigated this phenomenon by performing ad hoc analyses of a small number of data sets. In this chapter we provide a much more systematic study of small disjuncts and analyze how they affect classifiers induced from 30 real-world data sets. A new metric, error concentration, is used to show that for these 30 data sets classification errors are often heavily concentrated toward the smaller disjuncts. Various factors, including pruning, training set size, noise, and class imbalance are then analyzed to determine how they affect small disjuncts and the distribution of errors across disjuncts. This analysis provides many insights into why some data sets are difficult to learn from and also provides a better understanding of classifier learning in general.We believe that such an understanding is critical to the development of improved classifier induction algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ali, K.M., Pazzani, M.J.: Reducing the small disjuncts problem by learning probabilistic concept Descriptions. In: Petsche, T. (ed.) Computational Learning Theory and Natural Learning Systems, Volume 3, MIT Press, Cambridge, MA (1992)
Google Scholar
Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Science. http://www.ics.uci.edu/mlearn/MLRepository.html. Cited Sept 2008
Carvalho D.R., Freitas A.A.: A hybrid decision tree/genetic algorithm for coping with the problem of small disjuncts in data mining. In: Proceedings of the 2000 Genetic and Evolutionary Computation Conference, pp. 1061–1068 (2000)
Google Scholar
Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P.: SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357 (2002)
Google Scholar
Chawla N.V., Cieslak D.A., Hall L.O., Joshi A.: Automatically countering imbalance and its empirical relationship to cost. Data Mining and Knowledge Discovery, 17(2), 225–252 (2008)
Article Google Scholar
Cohen W.: Fast effective rule induction. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 115–123 (1995)
Google Scholar
Cohen W., Singer Y.: A simple, fast, and effective rule learner. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence, pp. 335–342 (1999)
Google Scholar
Danyluk A.P., Provost F.J.: Small disjuncts in action: learning to diagnose errors in the local loop of the telephone network. In: Proceedings of the Tenth International Conference on Machine Learning, pp. 81–88 (1993)
Google Scholar
Holte R.C., Acker L.E., Porter B.W.: Concept learning and the problem of small disjuncts. In: Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, pp. 813–818 (1989)
Google Scholar
Japkowicz N., Stephen S.: The class imbalance problem: a systematic study. Intelligent Data Analysis 6(5), 429–450 (2002)
Google Scholar
Jo T., Japkowicz, N. Class imbalances versus small disjuncts. SIGKDD Explorations 6(1), 40–49 (2004)
Article Google Scholar
Quinlan J.R.: The effect of noise on concept learning. In: Michalski R.S., Carbonell J.G., Mitchell T.M. (eds.), Machine Learning, an Artificial Intelligence Approach, Volume II, Morgan Kaufmann, San Francisco, CA (1986)
Google Scholar
Quinlan J.R.: Technical note: improved estimates for the accuracy of small disjuncts. Machine Learning, 6(1) (1991)
Google Scholar
Quinlan J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA (1993)
Google Scholar
Ting K.M.: The problem of small disjuncts: its remedy in decision trees. In: Proceedings of the Tenth Canadian Conference on Artificial Intelligence, pp. 91–97 (1994)
Google Scholar
Van den Bosch A., Weijters A., Van den Herik H.J., Daelemans W.: When small disjuncts abound, try lazy learning: A case study. In: Proceedings of the Seventh Belgian-Dutch Conference on Machine Learning, pp. 109–118 (1997)
Google Scholar
Weiss G.M.: Learning with rare cases and small disjuncts. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 558–565 (1995)
Google Scholar
Weiss G.M.: Mining with rarity: A unifying framework, SIGKDD Explorations 6(1), 7–19 (2004)
Article Google Scholar
Weiss G.M., Hirsh H.: The problem with noise and small disjuncts. In: Proceedings of the Fifteenth International Conference on Machine Learning, pp. 574–578 (1998)
Google Scholar
Weiss G.M., Hirsh H.: A quantitative study of small disjuncts. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence, Austin, Texas, pp. 665–670 (2000)
Google Scholar
Weiss G.M., McCarthy K., Zabar B.: Cost-Sensitive Learning vs. Sampling: Which is best for handling unbalanced classes with unequal error costs? In: Proceedings of the 2007 International Conference on Data Mining, pp. 35–41 (2007)
Google Scholar
Weiss G.M., Provost F.: Learning when training data are costly: the effect of class distribution on tree induction. Journal of AI Research 19, 315–354 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Fordham University, Bronx, NY, 10458, USA
Gary M. Weiss

Authors

Gary M. Weiss
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gary M. Weiss .

Editor information

Editors and Affiliations

Inst. Wirtschaftsinformatik, Universität Hamburg, Von-Melle-Park 5, Hamburg, 20146, Germany
Robert Stahlbock
Management School, Dept. Management Science, Lancaster University, Lancaster, LA1 4YX, United Kingdom
Sven F. Crone
Inst. Wirtschaftsinformatik, Universität Hamburg, Von-Melle-Park 5, Hamburg, 20146, Germany
Stefan Lessmann

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Weiss, G.M. (2010). The Impact of Small Disjuncts on Classifier Learning. In: Stahlbock, R., Crone, S., Lessmann, S. (eds) Data Mining. Annals of Information Systems, vol 8. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-1280-0_9

Download citation

DOI: https://doi.org/10.1007/978-1-4419-1280-0_9
Published: 15 October 2009
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-1279-4
Online ISBN: 978-1-4419-1280-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics