Advertisement

Computational Statistics

, Volume 33, Issue 3, pp 1195–1215 | Cite as

On the choice and influence of the number of boosting steps for high-dimensional linear Cox-models

  • Heidi Seibold
  • Christoph Bernau
  • Anne-Laure Boulesteix
  • Riccardo De Bin
Original Paper

Abstract

In biomedical research, boosting-based regression approaches have gained much attention in the last decade. Their intrinsic variable selection procedure and ability to shrink the estimates of the regression coefficients toward 0 make these techniques appropriate to fit prediction models in the case of high-dimensional data, e.g. gene expressions. Their prediction performance, however, highly depends on specific tuning parameters, in particular on the number of boosting iterations to perform. This crucial parameter is usually selected via cross-validation. The cross-validation procedure may highly depend on a completely random component, namely the considered fold partition. We empirically study how much this randomness affects the results of the boosting techniques, in terms of selected predictors and prediction ability of the related models. We use four publicly available data sets related to four different diseases. In these studies, the goal is to predict survival end-points when a large number of continuous candidate predictors are available. We focus on two well known boosting approaches implemented in the R-packages CoxBoost and mboost, assuming the validity of the proportional hazards assumption and the linearity of the effects of the predictors. We show that the variability in selected predictors and prediction ability of the model is reduced by averaging over several repetitions of cross-validation in the selection of the tuning parameters.

Keywords

Boosting Cross-validation Parameter tuning High dimensional data Survival analysis 

Notes

Acknowledgements

We thank Rory Wilson and Jenny Lee for language improvements. HS and RDB were supported by Grants BO3139/4-1, BO3139/4-2 and BO3139/2-3 to ALB from the German Research Foundation (DFG).

Supplementary material

180_2017_773_MOESM1_ESM.pdf (216 kb)
Supplementary material 1 (pdf 216 KB)
180_2017_773_MOESM2_ESM.txt (1 kb)
Supplementary material 2 (txt 0 KB)
180_2017_773_MOESM3_ESM.r (13 kb)
Supplementary material 3 (R 12 KB)
180_2017_773_MOESM4_ESM.pdf (134 kb)
Supplementary material 4 (pdf 134 KB)

References

  1. Binder H (2013) CoxBoost: Cox models by likelihood based boosting for a single survival endpoint or competing risks, R package version 1.4. http://CRAN.R-project.org/package=CoxBoost
  2. Binder H, Schumacher M (2008a) Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinform 9:14CrossRefGoogle Scholar
  3. Binder H, Schumacher M (2008b) Adapting prediction error estimates for biased complexity selection in high-dimensional bootstrap samples. Stat Appl Genet Mol Biol 7:12MathSciNetCrossRefMATHGoogle Scholar
  4. Boulesteix AL, Richter A, Bernau C (2013) Complexity selection with cross-validation for lasso and sparse partial least squares using high-dimensional data. In: Lausen B, Van den Poel D, Ultsch A (eds) Algorithms from and for nature and life. Springer, Berlin, pp 261–268CrossRefGoogle Scholar
  5. Bøvelstad H, Nygård S, Borgan Ø (2009) Survival prediction from clinico-genomic models—a comparative study. BMC Bioinform 10:413CrossRefGoogle Scholar
  6. Brier GW (1950) Verification of forecasts expressed in terms of probability. Mon Weather Rev 78:1–3CrossRefGoogle Scholar
  7. Bühlmann P (2006) Boosting for high-dimensional linear models. Ann Stat 34:559–583MathSciNetCrossRefMATHGoogle Scholar
  8. Bühlmann P, Hothorn T (2007) Boosting algorithms: regularization, prediction and model fitting. Stat Sci 22:477–505MathSciNetCrossRefMATHGoogle Scholar
  9. Bühlmann P, Yu B (2003) Boosting with the L\(_2\) loss: regression and classification. J Am Stat Assoc 98:324–339CrossRefMATHGoogle Scholar
  10. Chang YCI, Huang Y, Huang YP (2010) Early stopping in \(l_2\) boosting. Comput Stat Data Anal 54:2203–2213MathSciNetCrossRefMATHGoogle Scholar
  11. De Bin R (2016) Boosting in Cox regression: a comparison between the likelihood-based and the model-based approaches with focus on the R-packages CoxBoost and mboost. Comput Stat 31:513–531MathSciNetCrossRefMATHGoogle Scholar
  12. De Bin R, Herold T, Boulesteix AL (2014a) Added predictive value of omics data: specific issues related to validation illustrated by two case studies. BMC Med Res Methodol 14:117CrossRefGoogle Scholar
  13. De Bin R, Sauerbrei W, Boulesteix AL (2014b) Investigating the prediction ability of survival models based on both clinical and omics data: two case studies. Stat Med 33:5310–5329MathSciNetCrossRefGoogle Scholar
  14. Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32:407–499MathSciNetCrossRefMATHGoogle Scholar
  15. Freund Y, Schapire R (1996) Experiments with a new boosting algorithm. In: Proceedings of the 13th international conference on machine learning. Morgan Kaufmann, pp 148–156Google Scholar
  16. Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232MathSciNetCrossRefMATHGoogle Scholar
  17. Fuchs M, Hornung R, De Bin R, Boulesteix, AL (2013) A U-statistic estimator for the variance of resampling-based error estimators. Technical Report 148, University of MunichGoogle Scholar
  18. Gerds TA, Schumacher M (2006) Consistent estimation of the expected brier score in general survival models with right-censored event times. Biom J 48(6):1029–1040MathSciNetCrossRefGoogle Scholar
  19. Graf E, Schmoor C, Sauerbrei W, Schumacher M (1999) Assessment and comparison of prognostic classification schemes for survival data. Stat Med 18:2529–2545CrossRefGoogle Scholar
  20. Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning: data mining, inference and prediction. Springer, New YorkCrossRefMATHGoogle Scholar
  21. Hatzis C, Pusztai L, Valero V, Booser DJ, Esserman L, Lluch A, Vidaurre T, Holmes F, Souchon E, Wang H et al (2011) A genomic predictor of response and survival following taxane–anthracycline chemotherapy for invasive breast cancer. J Am Med Assoc 305(18):1873CrossRefGoogle Scholar
  22. Hofner B, Mayr A, Robinzonov N, Schmid M (2014) Model-based boosting in R: a hands-on tutorial using the R package mboost. Comput Stat 29:3–35MathSciNetCrossRefMATHGoogle Scholar
  23. Hothorn T, Bühlmann P, Dudoit S, Molinaro A, Van Der Laan MJ (2006) Survival ensembles. Biostatistics 7:355–373CrossRefMATHGoogle Scholar
  24. Hothorn T, Buehlmann P, Kneib T, Schmid M, Hofner B (2015) mboost: model-based boosting, R package version 2.4-2. http://CRAN.R-project.org/package=mboost
  25. Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of international joint conference on artificial intelligence, pp 1137–1145Google Scholar
  26. Mayr A, Hofner B, Schmid M (2012) The importance of knowing when to stop. A sequential stopping rule for component-wise gradient boosting. Methods Inf Med 51:178–186CrossRefGoogle Scholar
  27. Mayr A, Binder H, Gefeller O, Schmid M (2014a) Extending statistical boosting. Methods Inf Med 53:428–435CrossRefGoogle Scholar
  28. Mayr A, Binder H, Gefeller O, Schmid M (2014b) The evolution of boosting algorithms. Methods Inf Med 53:419–427CrossRefGoogle Scholar
  29. Metzeler KH, Hummel M, Bloomfield CD, Spiekermann K, Braess J, Sauerland MC, Heinecke A, Radmacher M, Marcucci G, Whitman SP et al (2008) An 86-probe-set gene-expression signature predicts survival in cytogenetically normal acute myeloid leukemia. Blood 112(10):4193–4201CrossRefGoogle Scholar
  30. Mogensen UB, Ishwaran H, Gerds TA (2012) Evaluating random forests for survival analysis using prediction error curves. J Stat Soft 50(11):1–23CrossRefGoogle Scholar
  31. Oberthuer A, Kaderali L, Kahlert Y, Hero B, Westermann F, Berthold F, Brors B, Eils R, Fischer M (2008) Subclassification and individual survival time prediction from gene expression data of neuroblastoma patients by using caspar. Clin Cancer Res 14(20):6590–6601CrossRefGoogle Scholar
  32. Ridgeway G (1999) Generalization of boosting algorithms and applications of Bayesian inference for massive datasets. PhD thesis, University of WashingtonGoogle Scholar
  33. Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, Muller-Hermelink HK, Smeland EB, Giltnane JM et al (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large-b-cell lymphoma. N Engl J Med 346(25):1937–1947CrossRefGoogle Scholar
  34. Schmid M, Hothorn T (2008) Flexible boosting of accelerated failure time models. BMC Bioinform 9:269CrossRefGoogle Scholar
  35. Schumacher M, Binder H, Gerds T (2007) Assessment of survival prediction models based on microarray data. Bioinformatics 23:1768–1774CrossRefGoogle Scholar
  36. Tutz G, Binder H (2006) Generalized additive modeling with implicit variable selection by likelihood-based boosting. Biometrics 62:961–971MathSciNetCrossRefMATHGoogle Scholar
  37. Verweij PJ, Van Houwelingen HC (1993) Cross-validation in survival analysis. Stat Med 12:2305–2314CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany 2017

Authors and Affiliations

  1. 1.Institute for Medical Information Processing, Biometry and EpidemiologyLMU MunichMunichGermany
  2. 2.Epidemiology, Biostatistics and Prevention Institute (EBPI)University of ZurichZurichSwitzerland
  3. 3.Leibniz Supercomputing CentreMunichGermany
  4. 4.Department of MathematicsUniversity of OsloOsloNorway

Personalised recommendations