Lifetime Data Analysis

, Volume 25, Issue 1, pp 150–167 | Cite as

The Wally plot approach to assess the calibration of clinical prediction models

  • Paul BlancheEmail author
  • Thomas A. Gerds
  • Claus T. Ekstrøm


A prediction model is calibrated if, roughly, for any percentage x we can expect that x subjects out of 100 experience the event among all subjects that have a predicted risk of x%. Typically, the calibration assumption is assessed graphically but in practice it is often challenging to judge whether a “disappointing” calibration plot is the consequence of a departure from the calibration assumption, or alternatively just “bad luck” due to sampling variability. We propose a graphical approach which enables the visualization of how much a calibration plot agrees with the calibration assumption to address this issue. The approach is mainly based on the idea of generating new plots which mimic the available data under the calibration assumption. The method handles the common non-trivial situations in which the data contain censored observations and occurrences of competing events. This is done by building on ideas from constrained non-parametric maximum likelihood estimation methods. Two examples from large cohort data illustrate our proposal. The ‘wally’ R package is provided to make the methodology easily usable.


Censoring Competing risks Model validation Prediction modeling Residual plot Survival analysis 



PB is grateful to the Bettencourt Schueller foundation for its support. We thank the DIVAT consortium and the Three-City study group for providing the data of the DIVAT and of the Three-City cohorts. Their supports are listed at and


  1. Aalen OO, Johansen S (1978) An empirical transition matrix for non-homogeneous Markov chains based on censored observations. Scand J Stat 5:141–150MathSciNetzbMATHGoogle Scholar
  2. Andersen PK, Borgan Ø, Gill RD, Keiding N (1993) Statistical models based on counting processes. Springer, New YorkCrossRefzbMATHGoogle Scholar
  3. Austin PC, Steyerberg EW (2014) Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers. Stat Med 33(3):517–535MathSciNetCrossRefGoogle Scholar
  4. Barber S, Jennison C (1999) Symmetric tests and confidence intervals for survival probabilities and quantiles of censored survival data. Biometrics 55(2):430–436MathSciNetCrossRefzbMATHGoogle Scholar
  5. Beyersmann J, Allignol A, Schumacher M (2011) Competing risks and multistate models with R. Springer Science & Business Media, BerlinzbMATHGoogle Scholar
  6. Blanche P (2017) Confidence intervals for the cumulative incidence function via constrained NPMLE.
  7. Blanche P, Proust-Lima C, Loubère L, Berr C, Dartigues J-F, Jacqmin-Gadda H (2015) Quantifying and comparing dynamic predictive accuracy of joint models for longitudinal marker and time-to-event in presence of censoring and competing risks. Biometrics 71(1):102–113MathSciNetCrossRefzbMATHGoogle Scholar
  8. Bröcker J, Smith LA (2007) Increasing the reliability of reliability diagrams. Weather Forecast 22(3):651–661CrossRefGoogle Scholar
  9. Buja A, Cook D, Hofmann H, Lawrence M, Lee E-K, Swayne DF, Wickham H (2009) Statistical inference for exploratory data analysis and model diagnostics. Philos Trans R Soc Lond A Math Phys Eng Sci 367(1906):4361–4383MathSciNetCrossRefzbMATHGoogle Scholar
  10. Camm A et al (2010) Guidelines for the management of atrial fibrillation: the task force for the management of atrial fibrillation of the european society of cardiology (esc). Eur Heart J 31:2369–2429CrossRefGoogle Scholar
  11. Crowson CS, Atkinson EJ, Therneau TM (2016) Assessing calibration of prognostic risk scores. Stat Methods Med Res 25:1692–1706MathSciNetCrossRefGoogle Scholar
  12. Demler OV, Paynter NP, Cook NR (2015) Tests of calibration and goodness-of-fit in the survival setting. Stat Med 34(10):1659–1680MathSciNetCrossRefGoogle Scholar
  13. Efron B (1981) Censored data and the bootstrap. J Am Stat Assoc 76(374):312–319MathSciNetCrossRefzbMATHGoogle Scholar
  14. Ekstrøm CT (2013) Teaching ’instant experience’ with graphical model validation techniques. Teach Stat 36(1):23–26CrossRefGoogle Scholar
  15. Fournier M-C, Foucher Y, Blanche P, Buron F, Giral M, Dantan E (2016) A joint model for longitudinal and time-to-event data to better assess the specific role of donor and recipient factors on long-term kidney transplantation outcomes. Eur J Epidemiol 31(5):469–479CrossRefGoogle Scholar
  16. Freedman AN, Seminara D, Gail MH, Hartge P, Colditz GA, Ballard-Barbash R, Pfeiffer RM (2005) Cancer risk prediction models: a workshop on development, evaluation, and application. J Natl Cancer Inst 97(10):715–723CrossRefGoogle Scholar
  17. Gail MH, Pfeiffer RM (2005) On criteria for evaluating models of absolute risk. Biostatistics 6(2):227–239CrossRefzbMATHGoogle Scholar
  18. Gerds TA, Cai T, Schumacher M (2008) The performance of risk prediction models. Biometr J 50(4):457–479MathSciNetCrossRefGoogle Scholar
  19. Gerds TA, Andersen PK, Kattan MW (2014) Calibration plots for risk prediction models in the presence of competing risks. Stat Med 33(18):3191–3203MathSciNetCrossRefGoogle Scholar
  20. Geskus RB (2015) Data analysis with competing risks and intermediate states, vol 82. CRC Press, Boca RatonCrossRefGoogle Scholar
  21. Handford M (2007) Where is Wally?. Walker Books Ltd, LondonGoogle Scholar
  22. Kaplan E, Meier P (1958) Nonparametric estimation from incomplete observations. J Am Stat Assoc 53(282):457–481MathSciNetCrossRefzbMATHGoogle Scholar
  23. Lemeshow S, Hosmer DW (1982) A review of goodness of fit statistics for use in the development of logistic regression models. Am J Epidemiol 115(1):92–106CrossRefGoogle Scholar
  24. Li G, Sun Y (2000) A simulation-based goodness-of-fit test for survival data. Stat Probab Lett 47(4):403–410MathSciNetCrossRefzbMATHGoogle Scholar
  25. Lin DY, Wei L-J, Ying Z (1993) Checking the Cox model with cumulative sums of martingale-based residuals. Biometrika 80(3):557–572MathSciNetCrossRefzbMATHGoogle Scholar
  26. Loy A, Follett L, Hofmann H (2016) Variations of Q–Q plots: the power of our eyes!. Am Stat 70(2):202–214MathSciNetCrossRefGoogle Scholar
  27. Majumder M, Hofmann H, Cook D (2013) Validation of visual statistical inference, applied to linear models. J Am Stat Assoc 108(503):942–956MathSciNetCrossRefzbMATHGoogle Scholar
  28. Martinussen T, Scheike T (2006) Dynamic regression models for survival data. Springer, BerlinzbMATHGoogle Scholar
  29. Pepe M, Janes H (2013) Methods for evaluating prediction performance of biomarkers and tests. In: Lee M-L, Gail G, Cai T, Pfeiffer R, Gandy A (eds) Risk assessment and evaluation of predictions. Springer, BerlinGoogle Scholar
  30. Pepe MS, Feng Z, Huang Y, Longton G, Prentice R, Thompson IM, Zheng Y (2008) Integrating the predictiveness of a marker with its performance as a classifier. Am J Epidemiol 167(3):362–368CrossRefGoogle Scholar
  31. R Core Team (2017) R: a language and environment for statistical computing. R Foundation for Statistical Computing, ViennaGoogle Scholar
  32. Robins J, Ritov Y et al (1997) Toward a curse of dimentionality appropriate asymptotic theory for semi-parametric models. Stat Med 16(3):285–319CrossRefGoogle Scholar
  33. Steyerberg E (2009) Clinical prediction models: a practical approach to development, validation, and updating. Springer, BerlinCrossRefzbMATHGoogle Scholar
  34. Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, Pencina MJ, Kattan MW (2010) Assessing the performance of prediction models: a framework for some traditional and novel measures. Epidemiology 21(1):128CrossRefGoogle Scholar
  35. Thomas DR, Grunkemeier GL (1975) Confidence interval estimation of survival probabilities for censored data. J Am Stat Assoc 70(352):865–871MathSciNetCrossRefzbMATHGoogle Scholar
  36. Tukey J (1972) Some graphic and semigraphic displays. In: Bancroft T (ed) Statistical papers in honor of George W. Snedecor. Iowa State University, Ames, Iowa, p 293–316Google Scholar
  37. Viallon V, Benichou J, Clavel-Chapelon F, Ragusa S (2009) How to evaluate the calibration of a disease risk prediction tool. Stat Med 28:901–916MathSciNetCrossRefGoogle Scholar
  38. Vickers A, Cronin A (2010) Everything you always wanted to know about evaluating prediction models (but were too afraid to ask). Urology 76(6):1298–1301CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2017

Authors and Affiliations

  • Paul Blanche
    • 1
    Email author
  • Thomas A. Gerds
    • 2
  • Claus T. Ekstrøm
    • 2
  1. 1.LMBAUniversity of South BrittanyVannesFrance
  2. 2.Department of biostatisticsUniversity of CopenhagenCopenhagenDenmark

Personalised recommendations