Abstract
This paper proposes a method for generating classifiers from large datasets by building a committee of simple base classifiers using a standard boosting algorithm. It permits the processing of large datasets even if the underlying base learning algorithm cannot efficiently do so. The basic idea is to split incoming data into chunks and build a committee based on classifiers built from these individual chunks. Our method extends earlier work by introducing a method for adaptively pruning the committee. This is essential when applying the algorithm in practice because it dramatically reduces the algorithm’s running time and memory consumption. It also makes it possible to efficiently “race” committees corresponding to different chunk sizes. This is important because our empirical results show that the accuracy of the resulting committee can vary significantly with the chunk size. They also show that pruning is indeed crucial to make the method practical for large datasets in terms of running time and memory requirements. Surprisingly, the results demonstrate that pruning can also improve accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
C. Blake, E. Keogh, and C.J. Merz. UCI repository of machine learning databases, 1998. [http://www.ics.uci.edu/~mlearn/MLRepository.html].
Leo Breiman. Arcing classifiers. The Annals of Statistics, 26(3):801–849, 1998.
Leo Breiman. Pasting small votes for classification in large databases and on-line. Machine Learning, pages 85–103, 1999.
Wei Fan, Salvatore J. Stolfo, and Junxin Zhang. The application of AdaBoost for distributed, scalable and on-line learning. In 5th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages 362–366, 1999.
Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In Proc. 13th Int. Conf. on Machine Learning, pages 148–156. Morgan Kaufmann, 1996.
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistic, 38(2):337–374, 2000.
S. Hettich and S. D. Bay. The UCI KDD archive, 1999. [http://kdd.ics.uci.edu].
G. H. John and P. Langley. Static versus dynamic sampling for data mining. In 2nd ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining, pages 367–370, 1996.
D. D. Margineantu and T. G. Dietterich. Pruning adaptive boosting. In Proc. of the 14th Int. Conf. on Machine Learning, pages 211–218, 1997.
Nikunj Oza and Stuart Russell. Experimental comparisons of online and batch versions of bagging and boosting. In 7th ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining, pages 359–364, 2001.
A. L. Prodromidis, S. J. Stolfo, and P. K. Chan. Pruning classifiers in a distributed meta-learning system. In Proc. of 1st National Conference on New Information Technologies, pages 151–160, 1998.
Andreas L. Prodromidis and Salvatore J. Stolfo. Cost complexity-based pruning of ensemble classifiers. Knowledge and Information Systems, 3(4):449–469, 2001.
A. Srinivasan. A study of two sampling methods for analysing large datasets with ILP. Data Mining and Knowledge Discovery, 3(1):95–123, 1999.
W. Nick Street and YongSeog Kim. A streaming ensemble algorithm (SEA) for large-scale classification. In 7th ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining, pages 377–382, 2001.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Frank, E., Holmes, G., Kirkby, R., Hall, M. (2002). Racing Committees for Large Datasets. In: Lange, S., Satoh, K., Smith, C.H. (eds) Discovery Science. DS 2002. Lecture Notes in Computer Science, vol 2534. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36182-0_15
Download citation
DOI: https://doi.org/10.1007/3-540-36182-0_15
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00188-1
Online ISBN: 978-3-540-36182-4
eBook Packages: Springer Book Archive