Skip to main content
Log in

Understanding increments in model performance metrics

  • Published:
Lifetime Data Analysis Aims and scope Submit manuscript

Abstract

The area under the receiver operating characteristic curve (AUC) is the most commonly reported measure of discrimination for prediction models with binary outcomes. However, recently it has been criticized for its inability to increase when important risk factors are added to a baseline model with good discrimination. This has led to the claim that the reliance on the AUC as a measure of discrimination may miss important improvements in clinical performance of risk prediction rules derived from a baseline model. In this paper we investigate this claim by relating the AUC to measures of clinical performance based on sensitivity and specificity under the assumption of multivariate normality. The behavior of the AUC is contrasted with that of discrimination slope. We show that unless rules with very good specificity are desired, the change in the AUC does an adequate job as a predictor of the change in measures of clinical performance. However, stronger or more numerous predictors are needed to achieve the same increment in the AUC for baseline models with good versus poor discrimination. When excellent specificity is desired, our results suggest that the discrimination slope might be a better measure of model improvement than AUC. The theoretical results are illustrated using a Framingham Heart Study example of a model for predicting the 10-year incidence of atrial fibrillation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Baker SG, Cook NR, Vickers A et al (2009) Using relative utility curves to evaluate risk prediction. J R Stat Soc Ser A Stat Soc 172(4):729–748

    Article  MathSciNet  Google Scholar 

  • Cook NR (2007) Use and misuse of the receiver operating characteristics curve in risk prediction. Circulation 115(7):928–935

    Article  Google Scholar 

  • Cox DR (1972) Regression models and life tables. J R Stat Soc Ser B 34:187–220

    MATH  Google Scholar 

  • D’Agostino RB Sr, Pencina MJ (2012) Invited commentary: clinical usefulness of the framingham cardiovascular risk profile beyond its statistical performance. Am J Epidemiol 176(3):187–189

  • DeLong ER, DeLong DM, Clarke-Pearson DL (1988) Comparing areas under two or more correlated reciever operating characteristics curves: a nonparamentric approach. Biometrics 44(3):837–845

    Article  MATH  Google Scholar 

  • Demler OV, Pencina MJ, D’Agostino RB Sr (2012) Misuse of DeLong test to compare AUCs for nested models. Stat Med 31:2577–2587

    Article  Google Scholar 

  • Dodd LE, Pepe MS (2003) Partial AUC estimation and regression. Biometrics 54:614–623

    Article  MathSciNet  Google Scholar 

  • Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics 7:179–188

    Article  Google Scholar 

  • Gail MH, Pfeiffer RM (2005) On criteria for evaluating models of absolute risk. Biostatistics 6(2):227–239

    Article  MATH  Google Scholar 

  • Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36

    Google Scholar 

  • Hilden J, Glashiou P (1996) Regret graphs, diagnostic uncertainty and the Youden’s index. Stat Med 15: 969–986

    Google Scholar 

  • Mahalanobis PC (1936) On the generalized distance in statistics. Proc Natl Inst Sci India 2:49–55

    MATH  Google Scholar 

  • Morrison DF (1990) Multivariate statistical methods, 3rd edn. McGraw-Hill, New York

    Google Scholar 

  • Pencina MJ, D’Agostino RB Sr, D’Agostino RB Jr et al (2008) Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med 27(2):157–172

    Article  MathSciNet  Google Scholar 

  • Pencina MJ, D’Agostino RB Sr, Steyerberg E (2011) Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Stat Med 30(1):11–21

    Article  MathSciNet  Google Scholar 

  • Pencina MJ, D’Agostino RB, Demler OV (2012) Novel metrics for evaluating improvement in discrimination: net reclassification and integrated discrimination improvement for normal variables and nested models. Stat Med 31:101–113

    Article  MathSciNet  Google Scholar 

  • Pepe MS, Janes H, Longton G et al (2004) Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. Am J Epidemiol 159(9):882–890

    Article  Google Scholar 

  • Schnabel RB, Larson MG, Yamamoto JF et al (2010) Relations of biomarkers of distinct pathophysiological pathways and atrial fibrillation incidence in the community. Circulation 121(2):200–207

    Article  Google Scholar 

  • Steyerberg EW, Vickers AJ, Cook NR et al (2010) Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology 21(1):128–138

    Article  Google Scholar 

  • Steyerberg EW, Pencina MJ, Lingsma HF et al (2012) Assessing the incremental value of diagnostic and prognostic markers: a review and illustration. Eur J Clin Invest 42(2):216–228

    Article  Google Scholar 

  • Su JQ, Liu JS (1993) Linear combinations of multiple diagnostic markers. J Am Stat Assoc 88:1350–1355

    Article  MathSciNet  MATH  Google Scholar 

  • Tzoulaki I, Liberopoulos G, Ioannidis JPA (2009) Assessment of claims of improved prediction beyond the Framingham risk score. JAMA 302(21):2345–2352

    Article  Google Scholar 

  • Vapnik V (1998) Statistical learning theory. Wiley, New York

    MATH  Google Scholar 

  • Vickers AJ, Elkin EB (2006) Decision curve analysis: a novel method for evaluating prediction models. Med Decis Mak 26(6):565–574

    Article  Google Scholar 

  • Walker SH, Duncan DB (1967) Estimation of the probability of an event as a function of several independent variables. Biometrika 54:167–179

    MathSciNet  MATH  Google Scholar 

  • Ware JH (2006) The limitations of risk factors as prognostic tools. N Engl J Med 355:25

    Article  Google Scholar 

  • Yates JF (1982) External correspondence: decomposition of the mean probability score. Organ Behav Hum Per 30:132–156

    Article  Google Scholar 

  • Youden WJ (1950) Index for rating diagnostic tests. Cancer 3(1):32–35

    Article  Google Scholar 

Download references

Acknowledgments

This research has been supported by National Heart, Lung, and Blood Institute’s Framingham Heart Study; contract/Grant Number: N01-HC-25195. Dr. Pencina has been additionally supported by NIH/ARRA Risk Prediction of Atrial Fibrillation; Grant Number: RC1HL101056.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael J. Pencina.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pencina, M.J., D’Agostino, R.B. & Massaro, J.M. Understanding increments in model performance metrics. Lifetime Data Anal 19, 202–218 (2013). https://doi.org/10.1007/s10985-012-9238-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10985-012-9238-0

Keywords

Navigation