Boosting

Berk, Richard A.

doi:10.1007/978-3-030-40189-4_6

Boosting

Richard A. Berk⁵

Chapter
First Online: 30 June 2020

3236 Accesses

Part of the book series: Springer Texts in Statistics ((STS))

Abstract

In this chapter we continue the use of fitting ensembles. We focus on boosting with an emphasis on classifiers. The unifying idea is that the statistical learning procedure makes many passes through the data and constructs fitted values for each. However, with each pass, observations that were fit more poorly on the last pass are given more weight. In that way, the algorithm works more diligently to fit the hard-to-fit observations. In the end, each set of fitted values is combined in an averaging process that serves as a regularizer. Boosting can be a very effective statistical learning procedure.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Hardcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Even though the response is binary, it is treated as a numeric. The reason will be clear later.
2.
A more general definition is provided by Schapire and his colleagues (1998: 1697). A wonderfully rich and more recent discussion about the central role of margins in boosting can be found in Schapire and Freund (2012). The book is a remarkable mix from a computer science perspective of the formal mathematics and very accessible discussions of what the mathematics means.
3.
In computer science parlance, an “example” is an observation or case.
4.
The response is represented by y _i because it is a numeric residual regardless of whether the original response was numeric or categorical. Categorical response variables are coded as a numeric 1 or 0 so that arithmetic operators can be applied.
5.
Θ here does not represent random integers used in random forests.
6.
Naming conventions get a little confusing at this point. Formally, a gradient is a set (vector) of partial derivatives. But it also is common to speak of those partial derivatives as gradients.
7.
The response variable is no longer shown because it does not change in the minimization process.
8.
For very large datasets, there is a scalable version of tree boosting called XGBoost (Chen and Guestrin 2016) that can provide noticeable speed improvements but has yet to include the range of useful loss functions found in gbm. There is also the need to configure the data in particular ways that differ from data frames. But for large datasets, it is worth the trouble. More will be said about XGBoost when “deep learning” is briefly considered in Chap. 8.
9.
Other initializations such as least squares regression could be used, depending on loss function (e.g., for a quantitative response variable).
10.
The original gbm was written by Greg Ridgeway. The current and substantially revised gbm is written by Brandon Greenwell, Bradley Boehmke, and Jay Cunningham, who build on Ridgeway’s original code.
11.
For gbm, the data not selected for each tree are called “out-of-bag” data although that is not fully consistent with the usual definition because in gbm the sampling is without replacement.
12.
The partial dependence plots as part of gbm work well. One can also use the partial dependence plots in the library pdp, written by Brandon Greenwell, that come with more options.
13.
A useful indication of whether there are too many passes through the data is if the distribution of the fitted proportions/probabilities starts to look bimodal. There is no formal reason why a bimodal distribution is necessarily problematic, but with large trees, the distribution should not be terribly lumpy. A lumpy distribution is far less of a potential problem if there is no interest in interpreting the fitted proportions/probabilities.
14.
The caret library is written by Max Kuhn. There are also procedures to automate gbm tuning, but like all automated tuning, subject-matter expertise plays no role. Considerable caution is warranted.
15.
The R procedure XGBoost allows for categorical response variables with more than two categories. Weighting can still be used to tune the results. Usually, more trial and error is required than for the binary case.
16.
Because gbm depends on regression trees even for binary outcome variables, outcome variables need to be numeric.
17.
For these analyses, the work was done on an iMac with a single core. The processor was a 3.4 Ghz Intel Core i7.
18.
If forecasting were on the table, it might have been useful to try a much larger number of iterations to reduce estimates of out-of-sample error.
19.
The plots are shown just as gbm builds them, and there are very few options provided. But just as with random forests, the underlying data can be stored and then used to construct new plots more responsive to the preferences of data analysts. One also has the prospect of doing the plotting in pdp.
20.
Because both inputs are integers, the transition from one value to the next is the midpoint between the two.
21.
It is not appropriate to compare the overall error rate in the two tables (0.18–0.21) because the errors are not weighted by costs. In Table 6.2, classification errors for those who perished are about 5 times more costly than in Table 6.1.
22.
Ideally, there would be far more data and a much more narrow 95% confidence interval. If a reader runs the code shown again, slightly different results are almost certain because of a new random split of the data. It also is possible to get an upper or lower bound outside of the 0.0–1.0 range. Negative values should be treated as values of 0.0. Positive values larger than 1.0 should be treated as values of 1.0. Valid coverage remains.
23.
Only order matters so compared to estimates of conditional means, a lot of information is discarded. Working with very high or very low quantiles can exacerbate the problem, because the data are usually less dense toward the tails.
24.
The out-of-bag approach was not available in gbm for boosted quantile regression.
25.
The library pdp constructs partial dependence plots for a variety of machine learning methods offering a wide range of useful options including access to the ICEbox library. ICEbox is written by Alex Goldstein, Adam Kapelner, and Justin Bleich, and pdp is written by Brandon Greenwell.
26.
The size of the correlation is being substantially determined by actual fares over $200. They are still being fit badly, but not a great deal worse.
27.
Causality is not a feature of the data itself. It is an interpretive overlay that depends on how the data were generated (i.e., an experiment in which the intervention is manipulated) and/or subject-matter theory (e.g, force equals mass times acceleration). Discussions about the nature of causality have a very long history. The current view, and the one taken here, is widely, if not universally, accepted. Imbens and Rubin (2015) provide an excellent and accessible treatment.
28.
T can also be numeric, most commonly conceptualized as a “dose” of some intervention. For purposes of this discussion, we need not consider numeric interventions.
29.
Whether in practice the intervention assigned is actually delivered is another matter that is beyond the scope of this discussion. Randomized experiments often are jeopardized when the intervention assigned is not delivered.
30.
The average treatment effect (ATE) is for a binary treatment defined as the difference between the response variable’s mean or proportion under one study condition and the response variable’s mean or proportion under the other study condition. One imagines the average outcome when all of the study subjects receive the treatment compared to the average outcome when all study subjects do not receive the treatment. A randomized experiment properly implemented provides and unbiased estimate. These ideas easily can be extended to cover studies with more than two study conditions.
31.
Regression diagnostics can help, but then what? You may learn that linearity is not plausible, but what functional forms are? What if the apparent nonlinearities are really caused by omitted variables not in the data?

References

Bartlett, P. L., & M. Traskin (2007). AdaBoost is Consistent. Journal of Machine Learning Research, 8, 23472368.
MathSciNet MATH Google Scholar
Berk, R. A., Kriegler, B., & Ylvisaker, D. (2008). Counting the homeless in Los Angeles county. In D. Nolan & S. Speed (Eds.), Probability and statistics: Essays in honor of David A. Freedman. Monograph series for the institute of mathematical statistics.
Google Scholar
Bühlmann, P., & Yu, B. (2004). Discussion. The Annals of Statistics, 32, 96–107.
MATH Google Scholar
Buja, A., & Stuetzle, W. (2006). Observations on bagging. Statistica Sinica, 16(2), 323–352.
MathSciNet MATH Google Scholar
Buja, A., Mease, D., & Wyner, A. J. (2007). Boosting Algorithms: Regularization, Prediction and Model Fitting. Statistical Science, 22(4), 506–512.
Article MathSciNet Google Scholar
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. arXiv:1603.02754v1 [cs.LG].
Google Scholar
Chernozhukov, V., Chetverikov, D., Demirer, M., Esther Duflo, E., Hansen, C., & Newey, W. (2017). Double/Debiased/Neyman machine learning of treatment effects. arXiv:1701.08687v1 [stat.ML].
Google Scholar
Freund, Y., & Schapire, R. (1996). Experiments with a new boosting algorithm. In Machine learning: Proceedings for the thirteenth international conference (pp. 148–156). San Francisco: Morgan Kaufmann.
Google Scholar
Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55, 119–139.
Article MathSciNet Google Scholar
Freund, Y., & Schapire, R. E. (1999). A short introduction to boosting. Journal of the Japanese Society for Artificial Intelligence, 14, 771–780.
Google Scholar
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29, 1189–1232
Article MathSciNet Google Scholar
Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics and Data Analysis, 38, 367–378.
Article MathSciNet Google Scholar
Friedman, J. H., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting (with discussion). Annals of Statistics, 28, 337–407.
Article MathSciNet Google Scholar
Friedman, J. H., Hastie, T., Rosset S., Tibshirani, R., & Zhu, J. (2004). Discussion of boosting papers. Annals of Statistics, 32, 102–107.
MATH Google Scholar
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning (2nd edn.). New York: Springer.
Book Google Scholar
Imbens, G., & Rubin, D. B. (2015) Causal inference for statistics, social, and biomedical sciences: An introduction. Cambridge: Cambridge University.
Book Google Scholar
Jiang, W. (2004). Process consistency for AdaBoost. Annals of Statistics, 32, 13–29.
Article MathSciNet Google Scholar
Mease, D., & Wyner, A. J. (2008). Evidence contrary to the statistical view of boosting (with discussion). Journal of Machine Learning, 9, 1–26.
Google Scholar
Mease, D., Wyner, A. J., & Buja, A. (2007). Boosted classification trees and class Probability/Quantile estimation. Journal of Machine Learning, 8, 409–439.
MATH Google Scholar
Ridgeway, G. (1999). The state of boosting. Computing Science and Statistics, 31, 172–181.
Google Scholar
Ridgeway, G. (2012). Generalized boosted models: A guide to the gbm Package. Available at from gbm documentation in R.
Google Scholar
Rosenbaum, P. R. (2010). Design of observational studies. New York: Springer.
Book Google Scholar
Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55.
Article MathSciNet Google Scholar
Schapire, R. E. (1999). A brief introduction to boosting. In Proceedings of the sixteenth international joint conference on artificial intelligence.
Google Scholar
Schapire, R. E., & Freund, Y. (2012) Boosting. Cambridge: MIT Press.
MATH Google Scholar
Schapire, R. E., Freund, Y., Bartlett, P, & Lee, W.-S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5), 1651–1686.
Article MathSciNet Google Scholar
Scharfstein, D.O. Rotnitzky, A. and J.M. Robins (1994) “Adjusting for Non-Ignorable Drop-Out Using Semipara-Metric Non-Response Models.” Journal of the American Statistical Association 94:1096–1120.
Article Google Scholar
Tchetgen Tchetgen, E. J., Robins, J. M., & Rotnitzky, A. (2010). On doubly robust estimation in a semiparametric odds ratio model. Biometrika, 97(1), 171–180.
Article MathSciNet Google Scholar
Wyner, A. J., Olson, M., Bleich, J, & Mease, D. (2017). Explaining the success of AdaBoost and random forests as interpolating classifiers. Journal of Machine Learning Research, 18, 1–33.
MathSciNet MATH Google Scholar
Zhang, T., & Yu, B. (2005). Boosting with early stopping: Convergence and consistency. Annals of Statistics, 33(4), 1538–1579.
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Criminology, Schools of Arts and Sciences, University of Pennsylvania, Philadelphia, PA, USA
Richard A. Berk

Authors

Richard A. Berk
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Berk, R.A. (2020). Boosting. In: Statistical Learning from a Regression Perspective. Springer Texts in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-030-40189-4_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-40189-4_6
Published: 30 June 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-40188-7
Online ISBN: 978-3-030-40189-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics