Racing Committees for Large Datasets

Frank, Eibe; Holmes, Geoffrey; Kirkby, Richard; Hall, Mark

doi:10.1007/3-540-36182-0_15

Eibe Frank⁷,
Geoffrey Holmes⁷,
Richard Kirkby⁷ &
…
Mark Hall⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2534))

Included in the following conference series:

International Conference on Discovery Science

972 Accesses
10 Citations

Abstract

This paper proposes a method for generating classifiers from large datasets by building a committee of simple base classifiers using a standard boosting algorithm. It permits the processing of large datasets even if the underlying base learning algorithm cannot efficiently do so. The basic idea is to split incoming data into chunks and build a committee based on classifiers built from these individual chunks. Our method extends earlier work by introducing a method for adaptively pruning the committee. This is essential when applying the algorithm in practice because it dramatically reduces the algorithm’s running time and memory consumption. It also makes it possible to efficiently “race” committees corresponding to different chunk sizes. This is important because our empirical results show that the accuracy of the resulting committee can vary significantly with the chunk size. They also show that pruning is indeed crucial to make the method practical for large datasets in terms of running time and memory requirements. Surprisingly, the results demonstrate that pruning can also improve accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

C. Blake, E. Keogh, and C.J. Merz. UCI repository of machine learning databases, 1998. [http://www.ics.uci.edu/~mlearn/MLRepository.html].
Leo Breiman. Arcing classifiers. The Annals of Statistics, 26(3):801–849, 1998.
Article MATH MathSciNet Google Scholar
Leo Breiman. Pasting small votes for classification in large databases and on-line. Machine Learning, pages 85–103, 1999.
Google Scholar
Wei Fan, Salvatore J. Stolfo, and Junxin Zhang. The application of AdaBoost for distributed, scalable and on-line learning. In 5th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages 362–366, 1999.
Google Scholar
Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In Proc. 13th Int. Conf. on Machine Learning, pages 148–156. Morgan Kaufmann, 1996.
Google Scholar
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistic, 38(2):337–374, 2000.
Article MathSciNet Google Scholar
S. Hettich and S. D. Bay. The UCI KDD archive, 1999. [http://kdd.ics.uci.edu].
G. H. John and P. Langley. Static versus dynamic sampling for data mining. In 2nd ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining, pages 367–370, 1996.
Google Scholar
D. D. Margineantu and T. G. Dietterich. Pruning adaptive boosting. In Proc. of the 14th Int. Conf. on Machine Learning, pages 211–218, 1997.
Google Scholar
Nikunj Oza and Stuart Russell. Experimental comparisons of online and batch versions of bagging and boosting. In 7th ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining, pages 359–364, 2001.
Google Scholar
A. L. Prodromidis, S. J. Stolfo, and P. K. Chan. Pruning classifiers in a distributed meta-learning system. In Proc. of 1st National Conference on New Information Technologies, pages 151–160, 1998.
Google Scholar
Andreas L. Prodromidis and Salvatore J. Stolfo. Cost complexity-based pruning of ensemble classifiers. Knowledge and Information Systems, 3(4):449–469, 2001.
Article MATH Google Scholar
A. Srinivasan. A study of two sampling methods for analysing large datasets with ILP. Data Mining and Knowledge Discovery, 3(1):95–123, 1999.
Article Google Scholar
W. Nick Street and YongSeog Kim. A streaming ensemble algorithm (SEA) for large-scale classification. In 7th ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining, pages 377–382, 2001.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Waikato, Hamilton, New Zealand
Eibe Frank, Geoffrey Holmes, Richard Kirkby & Mark Hall

Authors

Eibe Frank
View author publications
You can also search for this author in PubMed Google Scholar
Geoffrey Holmes
View author publications
You can also search for this author in PubMed Google Scholar
Richard Kirkby
View author publications
You can also search for this author in PubMed Google Scholar
Mark Hall
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Deutsches Forschungszentrum für Künstliche Intelligenz, Stuhlsatzenhausweg 3, 66123, Saarbrücken, Germany
Steffen Lange
National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, 101-8430, Tokyo, Japan
Ken Satoh
Department of Computer Science, University of Maryland, College Park, 20742, Maryland, MD, USA
Carl H. Smith

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Frank, E., Holmes, G., Kirkby, R., Hall, M. (2002). Racing Committees for Large Datasets. In: Lange, S., Satoh, K., Smith, C.H. (eds) Discovery Science. DS 2002. Lecture Notes in Computer Science, vol 2534. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36182-0_15

Download citation

DOI: https://doi.org/10.1007/3-540-36182-0_15
Published: 08 November 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00188-1
Online ISBN: 978-3-540-36182-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics