Skip to main content

Racing Committees for Large Datasets

  • Conference paper
  • First Online:
Discovery Science (DS 2002)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2534))

Included in the following conference series:

Abstract

This paper proposes a method for generating classifiers from large datasets by building a committee of simple base classifiers using a standard boosting algorithm. It permits the processing of large datasets even if the underlying base learning algorithm cannot efficiently do so. The basic idea is to split incoming data into chunks and build a committee based on classifiers built from these individual chunks. Our method extends earlier work by introducing a method for adaptively pruning the committee. This is essential when applying the algorithm in practice because it dramatically reduces the algorithm’s running time and memory consumption. It also makes it possible to efficiently “race” committees corresponding to different chunk sizes. This is important because our empirical results show that the accuracy of the resulting committee can vary significantly with the chunk size. They also show that pruning is indeed crucial to make the method practical for large datasets in terms of running time and memory requirements. Surprisingly, the results demonstrate that pruning can also improve accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. C. Blake, E. Keogh, and C.J. Merz. UCI repository of machine learning databases, 1998. [http://www.ics.uci.edu/~mlearn/MLRepository.html].

  2. Leo Breiman. Arcing classifiers. The Annals of Statistics, 26(3):801–849, 1998.

    Article  MATH  MathSciNet  Google Scholar 

  3. Leo Breiman. Pasting small votes for classification in large databases and on-line. Machine Learning, pages 85–103, 1999.

    Google Scholar 

  4. Wei Fan, Salvatore J. Stolfo, and Junxin Zhang. The application of AdaBoost for distributed, scalable and on-line learning. In 5th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages 362–366, 1999.

    Google Scholar 

  5. Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In Proc. 13th Int. Conf. on Machine Learning, pages 148–156. Morgan Kaufmann, 1996.

    Google Scholar 

  6. Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistic, 38(2):337–374, 2000.

    Article  MathSciNet  Google Scholar 

  7. S. Hettich and S. D. Bay. The UCI KDD archive, 1999. [http://kdd.ics.uci.edu].

  8. G. H. John and P. Langley. Static versus dynamic sampling for data mining. In 2nd ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining, pages 367–370, 1996.

    Google Scholar 

  9. D. D. Margineantu and T. G. Dietterich. Pruning adaptive boosting. In Proc. of the 14th Int. Conf. on Machine Learning, pages 211–218, 1997.

    Google Scholar 

  10. Nikunj Oza and Stuart Russell. Experimental comparisons of online and batch versions of bagging and boosting. In 7th ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining, pages 359–364, 2001.

    Google Scholar 

  11. A. L. Prodromidis, S. J. Stolfo, and P. K. Chan. Pruning classifiers in a distributed meta-learning system. In Proc. of 1st National Conference on New Information Technologies, pages 151–160, 1998.

    Google Scholar 

  12. Andreas L. Prodromidis and Salvatore J. Stolfo. Cost complexity-based pruning of ensemble classifiers. Knowledge and Information Systems, 3(4):449–469, 2001.

    Article  MATH  Google Scholar 

  13. A. Srinivasan. A study of two sampling methods for analysing large datasets with ILP. Data Mining and Knowledge Discovery, 3(1):95–123, 1999.

    Article  Google Scholar 

  14. W. Nick Street and YongSeog Kim. A streaming ensemble algorithm (SEA) for large-scale classification. In 7th ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining, pages 377–382, 2001.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Frank, E., Holmes, G., Kirkby, R., Hall, M. (2002). Racing Committees for Large Datasets. In: Lange, S., Satoh, K., Smith, C.H. (eds) Discovery Science. DS 2002. Lecture Notes in Computer Science, vol 2534. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36182-0_15

Download citation

  • DOI: https://doi.org/10.1007/3-540-36182-0_15

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-00188-1

  • Online ISBN: 978-3-540-36182-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics