On the choice and influence of the number of boosting steps for high-dimensional linear Cox-models

Seibold, Heidi; Bernau, Christoph; Boulesteix, Anne-Laure; De Bin, Riccardo

doi:10.1007/s00180-017-0773-8

On the choice and influence of the number of boosting steps for high-dimensional linear Cox-models

Original Paper
Published: 28 November 2017

Volume 33, pages 1195–1215, (2018)
Cite this article

Computational Statistics Aims and scope Submit manuscript

624 Accesses
16 Citations
2 Altmetric
Explore all metrics

Abstract

In biomedical research, boosting-based regression approaches have gained much attention in the last decade. Their intrinsic variable selection procedure and ability to shrink the estimates of the regression coefficients toward 0 make these techniques appropriate to fit prediction models in the case of high-dimensional data, e.g. gene expressions. Their prediction performance, however, highly depends on specific tuning parameters, in particular on the number of boosting iterations to perform. This crucial parameter is usually selected via cross-validation. The cross-validation procedure may highly depend on a completely random component, namely the considered fold partition. We empirically study how much this randomness affects the results of the boosting techniques, in terms of selected predictors and prediction ability of the related models. We use four publicly available data sets related to four different diseases. In these studies, the goal is to predict survival end-points when a large number of continuous candidate predictors are available. We focus on two well known boosting approaches implemented in the R-packages CoxBoost and mboost, assuming the validity of the proportional hazards assumption and the linearity of the effects of the predictors. We show that the variability in selected predictors and prediction ability of the model is reduced by averaging over several repetitions of cross-validation in the selection of the tuning parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Boosting for high-dimensional two-class prediction

Article Open access 21 September 2015

On the Simultaneous Analysis of Clinical and Omics Data: A Comparison of Globalboosttest and Pre-validation Techniques

I-Boost: an integrative boosting approach for predicting survival time with multiple genomics platforms

Article Open access 07 March 2019

References

Binder H (2013) CoxBoost: Cox models by likelihood based boosting for a single survival endpoint or competing risks, R package version 1.4. http://CRAN.R-project.org/package=CoxBoost
Binder H, Schumacher M (2008a) Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinform 9:14
Article Google Scholar
Binder H, Schumacher M (2008b) Adapting prediction error estimates for biased complexity selection in high-dimensional bootstrap samples. Stat Appl Genet Mol Biol 7:12
Article MathSciNet MATH Google Scholar
Boulesteix AL, Richter A, Bernau C (2013) Complexity selection with cross-validation for lasso and sparse partial least squares using high-dimensional data. In: Lausen B, Van den Poel D, Ultsch A (eds) Algorithms from and for nature and life. Springer, Berlin, pp 261–268
Chapter Google Scholar
Bøvelstad H, Nygård S, Borgan Ø (2009) Survival prediction from clinico-genomic models—a comparative study. BMC Bioinform 10:413
Article Google Scholar
Brier GW (1950) Verification of forecasts expressed in terms of probability. Mon Weather Rev 78:1–3
Article Google Scholar
Bühlmann P (2006) Boosting for high-dimensional linear models. Ann Stat 34:559–583
Article MathSciNet MATH Google Scholar
Bühlmann P, Hothorn T (2007) Boosting algorithms: regularization, prediction and model fitting. Stat Sci 22:477–505
Article MathSciNet MATH Google Scholar
Bühlmann P, Yu B (2003) Boosting with the L\(_2\) loss: regression and classification. J Am Stat Assoc 98:324–339
Article MATH Google Scholar
Chang YCI, Huang Y, Huang YP (2010) Early stopping in \(l_2\) boosting. Comput Stat Data Anal 54:2203–2213
Article MathSciNet MATH Google Scholar
De Bin R (2016) Boosting in Cox regression: a comparison between the likelihood-based and the model-based approaches with focus on the R-packages CoxBoost and mboost. Comput Stat 31:513–531
Article MathSciNet MATH Google Scholar
De Bin R, Herold T, Boulesteix AL (2014a) Added predictive value of omics data: specific issues related to validation illustrated by two case studies. BMC Med Res Methodol 14:117
Article Google Scholar
De Bin R, Sauerbrei W, Boulesteix AL (2014b) Investigating the prediction ability of survival models based on both clinical and omics data: two case studies. Stat Med 33:5310–5329
Article MathSciNet Google Scholar
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32:407–499
Article MathSciNet MATH Google Scholar
Freund Y, Schapire R (1996) Experiments with a new boosting algorithm. In: Proceedings of the 13th international conference on machine learning. Morgan Kaufmann, pp 148–156
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232
Article MathSciNet MATH Google Scholar
Fuchs M, Hornung R, De Bin R, Boulesteix, AL (2013) A U-statistic estimator for the variance of resampling-based error estimators. Technical Report 148, University of Munich
Gerds TA, Schumacher M (2006) Consistent estimation of the expected brier score in general survival models with right-censored event times. Biom J 48(6):1029–1040
Article MathSciNet Google Scholar
Graf E, Schmoor C, Sauerbrei W, Schumacher M (1999) Assessment and comparison of prognostic classification schemes for survival data. Stat Med 18:2529–2545
Article Google Scholar
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning: data mining, inference and prediction. Springer, New York
Book MATH Google Scholar
Hatzis C, Pusztai L, Valero V, Booser DJ, Esserman L, Lluch A, Vidaurre T, Holmes F, Souchon E, Wang H et al (2011) A genomic predictor of response and survival following taxane–anthracycline chemotherapy for invasive breast cancer. J Am Med Assoc 305(18):1873
Article Google Scholar
Hofner B, Mayr A, Robinzonov N, Schmid M (2014) Model-based boosting in R: a hands-on tutorial using the R package mboost. Comput Stat 29:3–35
Article MathSciNet MATH Google Scholar
Hothorn T, Bühlmann P, Dudoit S, Molinaro A, Van Der Laan MJ (2006) Survival ensembles. Biostatistics 7:355–373
Article MATH Google Scholar
Hothorn T, Buehlmann P, Kneib T, Schmid M, Hofner B (2015) mboost: model-based boosting, R package version 2.4-2. http://CRAN.R-project.org/package=mboost
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of international joint conference on artificial intelligence, pp 1137–1145
Mayr A, Hofner B, Schmid M (2012) The importance of knowing when to stop. A sequential stopping rule for component-wise gradient boosting. Methods Inf Med 51:178–186
Article Google Scholar
Mayr A, Binder H, Gefeller O, Schmid M (2014a) Extending statistical boosting. Methods Inf Med 53:428–435
Article Google Scholar
Mayr A, Binder H, Gefeller O, Schmid M (2014b) The evolution of boosting algorithms. Methods Inf Med 53:419–427
Article Google Scholar
Metzeler KH, Hummel M, Bloomfield CD, Spiekermann K, Braess J, Sauerland MC, Heinecke A, Radmacher M, Marcucci G, Whitman SP et al (2008) An 86-probe-set gene-expression signature predicts survival in cytogenetically normal acute myeloid leukemia. Blood 112(10):4193–4201
Article Google Scholar
Mogensen UB, Ishwaran H, Gerds TA (2012) Evaluating random forests for survival analysis using prediction error curves. J Stat Soft 50(11):1–23
Article Google Scholar
Oberthuer A, Kaderali L, Kahlert Y, Hero B, Westermann F, Berthold F, Brors B, Eils R, Fischer M (2008) Subclassification and individual survival time prediction from gene expression data of neuroblastoma patients by using caspar. Clin Cancer Res 14(20):6590–6601
Article Google Scholar
Ridgeway G (1999) Generalization of boosting algorithms and applications of Bayesian inference for massive datasets. PhD thesis, University of Washington
Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, Muller-Hermelink HK, Smeland EB, Giltnane JM et al (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large-b-cell lymphoma. N Engl J Med 346(25):1937–1947
Article Google Scholar
Schmid M, Hothorn T (2008) Flexible boosting of accelerated failure time models. BMC Bioinform 9:269
Article Google Scholar
Schumacher M, Binder H, Gerds T (2007) Assessment of survival prediction models based on microarray data. Bioinformatics 23:1768–1774
Article Google Scholar
Tutz G, Binder H (2006) Generalized additive modeling with implicit variable selection by likelihood-based boosting. Biometrics 62:961–971
Article MathSciNet MATH Google Scholar
Verweij PJ, Van Houwelingen HC (1993) Cross-validation in survival analysis. Stat Med 12:2305–2314
Article Google Scholar

Download references

Acknowledgements

We thank Rory Wilson and Jenny Lee for language improvements. HS and RDB were supported by Grants BO3139/4-1, BO3139/4-2 and BO3139/2-3 to ALB from the German Research Foundation (DFG).

Author information

Authors and Affiliations

Institute for Medical Information Processing, Biometry and Epidemiology, LMU Munich, Munich, Germany
Heidi Seibold, Anne-Laure Boulesteix & Riccardo De Bin
Epidemiology, Biostatistics and Prevention Institute (EBPI), University of Zurich, Zurich, Switzerland
Heidi Seibold
Leibniz Supercomputing Centre, Munich, Germany
Christoph Bernau
Department of Mathematics, University of Oslo, Oslo, Norway
Riccardo De Bin

Authors

Heidi Seibold
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Bernau
View author publications
You can also search for this author in PubMed Google Scholar
Anne-Laure Boulesteix
View author publications
You can also search for this author in PubMed Google Scholar
Riccardo De Bin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Heidi Seibold.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 216 KB)

Supplementary material 2 (txt 0 KB)

Supplementary material 3 (R 12 KB)

Supplementary material 4 (pdf 134 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Seibold, H., Bernau, C., Boulesteix, AL. et al. On the choice and influence of the number of boosting steps for high-dimensional linear Cox-models. Comput Stat 33, 1195–1215 (2018). https://doi.org/10.1007/s00180-017-0773-8

Download citation

Received: 12 January 2016
Accepted: 13 October 2017
Published: 28 November 2017
Issue Date: September 2018
DOI: https://doi.org/10.1007/s00180-017-0773-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the choice and influence of the number of boosting steps for high-dimensional linear Cox-models

Abstract

Access this article

Similar content being viewed by others

Boosting for high-dimensional two-class prediction

On the Simultaneous Analysis of Clinical and Omics Data: A Comparison of Globalboosttest and Pre-validation Techniques

I-Boost: an integrative boosting approach for predicting survival time with multiple genomics platforms

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 216 KB)

Supplementary material 2 (txt 0 KB)

Supplementary material 3 (R 12 KB)

Supplementary material 4 (pdf 134 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On the choice and influence of the number of boosting steps for high-dimensional linear Cox-models

Abstract

Access this article

Similar content being viewed by others

Boosting for high-dimensional two-class prediction

On the Simultaneous Analysis of Clinical and Omics Data: A Comparison of Globalboosttest and Pre-validation Techniques

I-Boost: an integrative boosting approach for predicting survival time with multiple genomics platforms

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 216 KB)

Supplementary material 2 (txt 0 KB)

Supplementary material 3 (R 12 KB)

Supplementary material 4 (pdf 134 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation