Skip to main content

Boosting

  • Chapter
  • First Online:
  • 3236 Accesses

Part of the book series: Springer Texts in Statistics ((STS))

Abstract

In this chapter we continue the use of fitting ensembles. We focus on boosting with an emphasis on classifiers. The unifying idea is that the statistical learning procedure makes many passes through the data and constructs fitted values for each. However, with each pass, observations that were fit more poorly on the last pass are given more weight. In that way, the algorithm works more diligently to fit the hard-to-fit observations. In the end, each set of fitted values is combined in an averaging process that serves as a regularizer. Boosting can be a very effective statistical learning procedure.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   119.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Even though the response is binary, it is treated as a numeric. The reason will be clear later.

  2. 2.

    A more general definition is provided by Schapire and his colleagues (1998: 1697). A wonderfully rich and more recent discussion about the central role of margins in boosting can be found in Schapire and Freund (2012). The book is a remarkable mix from a computer science perspective of the formal mathematics and very accessible discussions of what the mathematics means.

  3. 3.

    In computer science parlance, an “example” is an observation or case.

  4. 4.

    The response is represented by y i because it is a numeric residual regardless of whether the original response was numeric or categorical. Categorical response variables are coded as a numeric 1 or 0 so that arithmetic operators can be applied.

  5. 5.

    Θ here does not represent random integers used in random forests.

  6. 6.

    Naming conventions get a little confusing at this point. Formally, a gradient is a set (vector) of partial derivatives. But it also is common to speak of those partial derivatives as gradients.

  7. 7.

    The response variable is no longer shown because it does not change in the minimization process.

  8. 8.

    For very large datasets, there is a scalable version of tree boosting called XGBoost (Chen and Guestrin 2016) that can provide noticeable speed improvements but has yet to include the range of useful loss functions found in gbm. There is also the need to configure the data in particular ways that differ from data frames. But for large datasets, it is worth the trouble. More will be said about XGBoost when “deep learning” is briefly considered in Chap. 8.

  9. 9.

    Other initializations such as least squares regression could be used, depending on loss function (e.g., for a quantitative response variable).

  10. 10.

    The original gbm was written by Greg Ridgeway. The current and substantially revised gbm is written by Brandon Greenwell, Bradley Boehmke, and Jay Cunningham, who build on Ridgeway’s original code.

  11. 11.

    For gbm, the data not selected for each tree are called “out-of-bag” data although that is not fully consistent with the usual definition because in gbm the sampling is without replacement.

  12. 12.

    The partial dependence plots as part of gbm work well. One can also use the partial dependence plots in the library pdp, written by Brandon Greenwell, that come with more options.

  13. 13.

    A useful indication of whether there are too many passes through the data is if the distribution of the fitted proportions/probabilities starts to look bimodal. There is no formal reason why a bimodal distribution is necessarily problematic, but with large trees, the distribution should not be terribly lumpy. A lumpy distribution is far less of a potential problem if there is no interest in interpreting the fitted proportions/probabilities.

  14. 14.

    The caret library is written by Max Kuhn. There are also procedures to automate gbm tuning, but like all automated tuning, subject-matter expertise plays no role. Considerable caution is warranted.

  15. 15.

    The R procedure XGBoost allows for categorical response variables with more than two categories. Weighting can still be used to tune the results. Usually, more trial and error is required than for the binary case.

  16. 16.

    Because gbm depends on regression trees even for binary outcome variables, outcome variables need to be numeric.

  17. 17.

    For these analyses, the work was done on an iMac with a single core. The processor was a 3.4 Ghz Intel Core i7.

  18. 18.

    If forecasting were on the table, it might have been useful to try a much larger number of iterations to reduce estimates of out-of-sample error.

  19. 19.

    The plots are shown just as gbm builds them, and there are very few options provided. But just as with random forests, the underlying data can be stored and then used to construct new plots more responsive to the preferences of data analysts. One also has the prospect of doing the plotting in pdp.

  20. 20.

    Because both inputs are integers, the transition from one value to the next is the midpoint between the two.

  21. 21.

    It is not appropriate to compare the overall error rate in the two tables (0.18–0.21) because the errors are not weighted by costs. In Table 6.2, classification errors for those who perished are about 5 times more costly than in Table 6.1.

  22. 22.

    Ideally, there would be far more data and a much more narrow 95% confidence interval. If a reader runs the code shown again, slightly different results are almost certain because of a new random split of the data. It also is possible to get an upper or lower bound outside of the 0.0–1.0 range. Negative values should be treated as values of 0.0. Positive values larger than 1.0 should be treated as values of 1.0. Valid coverage remains.

  23. 23.

    Only order matters so compared to estimates of conditional means, a lot of information is discarded. Working with very high or very low quantiles can exacerbate the problem, because the data are usually less dense toward the tails.

  24. 24.

    The out-of-bag approach was not available in gbm for boosted quantile regression.

  25. 25.

    The library pdp constructs partial dependence plots for a variety of machine learning methods offering a wide range of useful options including access to the ICEbox library. ICEbox is written by Alex Goldstein, Adam Kapelner, and Justin Bleich, and pdp is written by Brandon Greenwell.

  26. 26.

    The size of the correlation is being substantially determined by actual fares over $200. They are still being fit badly, but not a great deal worse.

  27. 27.

    Causality is not a feature of the data itself. It is an interpretive overlay that depends on how the data were generated (i.e., an experiment in which the intervention is manipulated) and/or subject-matter theory (e.g, force equals mass times acceleration). Discussions about the nature of causality have a very long history. The current view, and the one taken here, is widely, if not universally, accepted. Imbens and Rubin (2015) provide an excellent and accessible treatment.

  28. 28.

    T can also be numeric, most commonly conceptualized as a “dose” of some intervention. For purposes of this discussion, we need not consider numeric interventions.

  29. 29.

    Whether in practice the intervention assigned is actually delivered is another matter that is beyond the scope of this discussion. Randomized experiments often are jeopardized when the intervention assigned is not delivered.

  30. 30.

    The average treatment effect (ATE) is for a binary treatment defined as the difference between the response variable’s mean or proportion under one study condition and the response variable’s mean or proportion under the other study condition. One imagines the average outcome when all of the study subjects receive the treatment compared to the average outcome when all study subjects do not receive the treatment. A randomized experiment properly implemented provides and unbiased estimate. These ideas easily can be extended to cover studies with more than two study conditions.

  31. 31.

    Regression diagnostics can help, but then what? You may learn that linearity is not plausible, but what functional forms are? What if the apparent nonlinearities are really caused by omitted variables not in the data?

References

  • Bartlett, P. L., & M. Traskin (2007). AdaBoost is Consistent. Journal of Machine Learning Research, 8, 23472368.

    MathSciNet  MATH  Google Scholar 

  • Berk, R. A., Kriegler, B., & Ylvisaker, D. (2008). Counting the homeless in Los Angeles county. In D. Nolan & S. Speed (Eds.), Probability and statistics: Essays in honor of David A. Freedman. Monograph series for the institute of mathematical statistics.

    Google Scholar 

  • Bühlmann, P., & Yu, B. (2004). Discussion. The Annals of Statistics, 32, 96–107.

    MATH  Google Scholar 

  • Buja, A., & Stuetzle, W. (2006). Observations on bagging. Statistica Sinica, 16(2), 323–352.

    MathSciNet  MATH  Google Scholar 

  • Buja, A., Mease, D., & Wyner, A. J. (2007). Boosting Algorithms: Regularization, Prediction and Model Fitting. Statistical Science, 22(4), 506–512.

    Article  MathSciNet  Google Scholar 

  • Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. arXiv:1603.02754v1 [cs.LG].

    Google Scholar 

  • Chernozhukov, V., Chetverikov, D., Demirer, M., Esther Duflo, E., Hansen, C., & Newey, W. (2017). Double/Debiased/Neyman machine learning of treatment effects. arXiv:1701.08687v1 [stat.ML].

    Google Scholar 

  • Freund, Y., & Schapire, R. (1996). Experiments with a new boosting algorithm. In Machine learning: Proceedings for the thirteenth international conference (pp. 148–156). San Francisco: Morgan Kaufmann.

    Google Scholar 

  • Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55, 119–139.

    Article  MathSciNet  Google Scholar 

  • Freund, Y., & Schapire, R. E. (1999). A short introduction to boosting. Journal of the Japanese Society for Artificial Intelligence, 14, 771–780.

    Google Scholar 

  • Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29, 1189–1232

    Article  MathSciNet  Google Scholar 

  • Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics and Data Analysis, 38, 367–378.

    Article  MathSciNet  Google Scholar 

  • Friedman, J. H., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting (with discussion). Annals of Statistics, 28, 337–407.

    Article  MathSciNet  Google Scholar 

  • Friedman, J. H., Hastie, T., Rosset S., Tibshirani, R., & Zhu, J. (2004). Discussion of boosting papers. Annals of Statistics, 32, 102–107.

    MATH  Google Scholar 

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning (2nd edn.). New York: Springer.

    Book  Google Scholar 

  • Imbens, G., & Rubin, D. B. (2015) Causal inference for statistics, social, and biomedical sciences: An introduction. Cambridge: Cambridge University.

    Book  Google Scholar 

  • Jiang, W. (2004). Process consistency for AdaBoost. Annals of Statistics, 32, 13–29.

    Article  MathSciNet  Google Scholar 

  • Mease, D., & Wyner, A. J. (2008). Evidence contrary to the statistical view of boosting (with discussion). Journal of Machine Learning, 9, 1–26.

    Google Scholar 

  • Mease, D., Wyner, A. J., & Buja, A. (2007). Boosted classification trees and class Probability/Quantile estimation. Journal of Machine Learning, 8, 409–439.

    MATH  Google Scholar 

  • Ridgeway, G. (1999). The state of boosting. Computing Science and Statistics, 31, 172–181.

    Google Scholar 

  • Ridgeway, G. (2012). Generalized boosted models: A guide to the gbm Package. Available at from gbm documentation in R.

    Google Scholar 

  • Rosenbaum, P. R. (2010). Design of observational studies. New York: Springer.

    Book  Google Scholar 

  • Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55.

    Article  MathSciNet  Google Scholar 

  • Schapire, R. E. (1999). A brief introduction to boosting. In Proceedings of the sixteenth international joint conference on artificial intelligence.

    Google Scholar 

  • Schapire, R. E., & Freund, Y. (2012) Boosting. Cambridge: MIT Press.

    MATH  Google Scholar 

  • Schapire, R. E., Freund, Y., Bartlett, P, & Lee, W.-S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5), 1651–1686.

    Article  MathSciNet  Google Scholar 

  • Scharfstein, D.O. Rotnitzky, A. and J.M. Robins (1994) “Adjusting for Non-Ignorable Drop-Out Using Semipara-Metric Non-Response Models.” Journal of the American Statistical Association 94:1096–1120.

    Article  Google Scholar 

  • Tchetgen Tchetgen, E. J., Robins, J. M., & Rotnitzky, A. (2010). On doubly robust estimation in a semiparametric odds ratio model. Biometrika, 97(1), 171–180.

    Article  MathSciNet  Google Scholar 

  • Wyner, A. J., Olson, M., Bleich, J, & Mease, D. (2017). Explaining the success of AdaBoost and random forests as interpolating classifiers. Journal of Machine Learning Research, 18, 1–33.

    MathSciNet  MATH  Google Scholar 

  • Zhang, T., & Yu, B. (2005). Boosting with early stopping: Convergence and consistency. Annals of Statistics, 33(4), 1538–1579.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Berk, R.A. (2020). Boosting. In: Statistical Learning from a Regression Perspective. Springer Texts in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-030-40189-4_6

Download citation

Publish with us

Policies and ethics