Modelling Classification Performance for Large Data Sets

Gu, Baohua; Hu, Feifang; Liu, Huan

doi:10.1007/3-540-47714-4_29

Baohua Gu⁷,
Feifang Hu⁸ &
Huan Liu⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2118))

Included in the following conference series:

International Conference on Web-Age Information Management

398 Accesses
18 Citations

Abstract

For many learning algorithms, their learning accuracy will increase as the size of training data increases, forming the well-known learning curve. Usually a learning curve can be fitted by interpolating or extrapolating some points on it with a specified model. The obtained learning curve can then be used to predict the maximum achievable learning accuracy or to estimate the amount of data needed to achieve an expected learning accuracy, both of which will be especially meaningful to data mining on large data sets. Although some models have been proposed to model learning curves, most of them do not test their applicability to large data sets. In this paper, we focus on this issue. We empirically compare six potentially useful models by fitting learning curves of two typical classification algorithms—C4.5 (decision tree) and LOG (logistic discrimination) on eight large UCI benchmark data sets. By using all available data for learning, we fit a full-length learning curve; by using a small portion of the data, we fit a part-length learning curve. The models are then compared in terms of two performances: (1) how well they fit a full-length learning curve, and (2) how well a fitted part-length learning curve can predict learning accuracy at the full length. Experimental results show that the power law (y = a - b * x ^-c) is the best among the six models in both the performances for the two algorithms and all the data sets. These results support the applicability of learning curves to data mining.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

S. Amari, N. Fujita, and S. Shinomoto. Four types of learning curves. Neural Computation, 4(4):605–618, 1992.
Article Google Scholar
Yonathan Bard. Nonlinear Parameter Estimation. Academic Press, 1974.
Google Scholar
C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. (http://www.ics.uci.edu/~mlearn/MLRepository.html).
C. Cortes. Prediction of Generalization Ability in Learning Machines. PhD thesis, Department of Computer Science, University of Rochester, New York, 1993.
Google Scholar
L.J. Frey and D.H. Fisher. Modeling decision tree performance with the power law. In Proceedings of the Seventh International Workshop on Artificial Intelligence and Statistics. Morgan Kaufmann, 1999.
Google Scholar
B.H. Gu, F.F. Hu, and H. Liu. An empirical study of fitting learning curves. Technical report, Department of Computer Science, National University of Singapore, 2001.
Google Scholar
H. Gu and H. Takahashi. Exponential or polynomial learning curves?—case-based studies. Neural Computation, 12(4):795–809, 2000.
Article Google Scholar
G.H. John and P. Langley. Static versus dynamic sampling for data mining. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD’96). AAAI / MIT Press, 1996.
Google Scholar
C.M. Kadie. Seer: Maximum Likelihood Regression for Learning-Speed Curves. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 1995.
Google Scholar
T. Lim, W. Loh, and Y. Shih. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 1999.
Google Scholar
Tjen-Sien Lim. Users’ guide for logdiscr version 2.0, 1999. (http://recursive-partitioning.com/logdiscr/).
D. Michie, D.J. Spiegelhalter, and C.C. Taylor. Machine Learning, Neural and Statistical Classification. Ellis Horwood Limited, Campus 400, Maylands Avenue, Hemel Hempstead, Hertfordshire, HP2 7EZ, England, 1994.
MATH Google Scholar
F. Provost, D. Jensen, and T. Oates. Efficient progressive sampling. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining(KDD’99), pages 23–32. AAAI/MIT Press, 1999.
Google Scholar
F. Provost and V. Kolluri. A survey of methods for scaling up inductive algorithms. Machine Learning, pages 1–42, 1999.
Google Scholar
J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993. (http://www.cse.unsw.edu.au/quinlan/).
Google Scholar
D. A. Ratkowsky. Handbook of Nonlinear Regression Models. Marcel Dekker, INC., 1990.
Google Scholar
G.A.F. Seber and C.J. Wild. Nonlinear regression. Wiley & Sons, 1989.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, National University of Singapore, Singapore
Baohua Gu
Department of Statistics & Applied Probability, National University of Singapore, Singapore
Feifang Hu
Department of Computer Science & Engineering, Arizona State University, USA
Huan Liu

Authors

Baohua Gu
View author publications
You can also search for this author in PubMed Google Scholar
Feifang Hu
View author publications
You can also search for this author in PubMed Google Scholar
Huan Liu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Information and Software Engineering, George Mason University, Fairfax, VA, 22030-4444, USA
X. Sean Wang
Department of Computer Science and Engineering, Northeastern University, Shenyang, 110004, China
Ge Yu
Department of Computer Science, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
Hongjun Lu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gu, B., Hu, F., Liu, H. (2001). Modelling Classification Performance for Large Data Sets. In: Wang, X.S., Yu, G., Lu, H. (eds) Advances in Web-Age Information Management. WAIM 2001. Lecture Notes in Computer Science, vol 2118. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-47714-4_29

Download citation

DOI: https://doi.org/10.1007/3-540-47714-4_29
Published: 28 June 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42298-3
Online ISBN: 978-3-540-47714-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics