Abstract
The area under the receiver operating characteristic curve (AUC) is the most commonly reported measure of discrimination for prediction models with binary outcomes. However, recently it has been criticized for its inability to increase when important risk factors are added to a baseline model with good discrimination. This has led to the claim that the reliance on the AUC as a measure of discrimination may miss important improvements in clinical performance of risk prediction rules derived from a baseline model. In this paper we investigate this claim by relating the AUC to measures of clinical performance based on sensitivity and specificity under the assumption of multivariate normality. The behavior of the AUC is contrasted with that of discrimination slope. We show that unless rules with very good specificity are desired, the change in the AUC does an adequate job as a predictor of the change in measures of clinical performance. However, stronger or more numerous predictors are needed to achieve the same increment in the AUC for baseline models with good versus poor discrimination. When excellent specificity is desired, our results suggest that the discrimination slope might be a better measure of model improvement than AUC. The theoretical results are illustrated using a Framingham Heart Study example of a model for predicting the 10-year incidence of atrial fibrillation.
Similar content being viewed by others
References
Baker SG, Cook NR, Vickers A et al (2009) Using relative utility curves to evaluate risk prediction. J R Stat Soc Ser A Stat Soc 172(4):729–748
Cook NR (2007) Use and misuse of the receiver operating characteristics curve in risk prediction. Circulation 115(7):928–935
Cox DR (1972) Regression models and life tables. J R Stat Soc Ser B 34:187–220
D’Agostino RB Sr, Pencina MJ (2012) Invited commentary: clinical usefulness of the framingham cardiovascular risk profile beyond its statistical performance. Am J Epidemiol 176(3):187–189
DeLong ER, DeLong DM, Clarke-Pearson DL (1988) Comparing areas under two or more correlated reciever operating characteristics curves: a nonparamentric approach. Biometrics 44(3):837–845
Demler OV, Pencina MJ, D’Agostino RB Sr (2012) Misuse of DeLong test to compare AUCs for nested models. Stat Med 31:2577–2587
Dodd LE, Pepe MS (2003) Partial AUC estimation and regression. Biometrics 54:614–623
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics 7:179–188
Gail MH, Pfeiffer RM (2005) On criteria for evaluating models of absolute risk. Biostatistics 6(2):227–239
Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36
Hilden J, Glashiou P (1996) Regret graphs, diagnostic uncertainty and the Youden’s index. Stat Med 15: 969–986
Mahalanobis PC (1936) On the generalized distance in statistics. Proc Natl Inst Sci India 2:49–55
Morrison DF (1990) Multivariate statistical methods, 3rd edn. McGraw-Hill, New York
Pencina MJ, D’Agostino RB Sr, D’Agostino RB Jr et al (2008) Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med 27(2):157–172
Pencina MJ, D’Agostino RB Sr, Steyerberg E (2011) Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Stat Med 30(1):11–21
Pencina MJ, D’Agostino RB, Demler OV (2012) Novel metrics for evaluating improvement in discrimination: net reclassification and integrated discrimination improvement for normal variables and nested models. Stat Med 31:101–113
Pepe MS, Janes H, Longton G et al (2004) Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. Am J Epidemiol 159(9):882–890
Schnabel RB, Larson MG, Yamamoto JF et al (2010) Relations of biomarkers of distinct pathophysiological pathways and atrial fibrillation incidence in the community. Circulation 121(2):200–207
Steyerberg EW, Vickers AJ, Cook NR et al (2010) Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology 21(1):128–138
Steyerberg EW, Pencina MJ, Lingsma HF et al (2012) Assessing the incremental value of diagnostic and prognostic markers: a review and illustration. Eur J Clin Invest 42(2):216–228
Su JQ, Liu JS (1993) Linear combinations of multiple diagnostic markers. J Am Stat Assoc 88:1350–1355
Tzoulaki I, Liberopoulos G, Ioannidis JPA (2009) Assessment of claims of improved prediction beyond the Framingham risk score. JAMA 302(21):2345–2352
Vapnik V (1998) Statistical learning theory. Wiley, New York
Vickers AJ, Elkin EB (2006) Decision curve analysis: a novel method for evaluating prediction models. Med Decis Mak 26(6):565–574
Walker SH, Duncan DB (1967) Estimation of the probability of an event as a function of several independent variables. Biometrika 54:167–179
Ware JH (2006) The limitations of risk factors as prognostic tools. N Engl J Med 355:25
Yates JF (1982) External correspondence: decomposition of the mean probability score. Organ Behav Hum Per 30:132–156
Youden WJ (1950) Index for rating diagnostic tests. Cancer 3(1):32–35
Acknowledgments
This research has been supported by National Heart, Lung, and Blood Institute’s Framingham Heart Study; contract/Grant Number: N01-HC-25195. Dr. Pencina has been additionally supported by NIH/ARRA Risk Prediction of Atrial Fibrillation; Grant Number: RC1HL101056.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pencina, M.J., D’Agostino, R.B. & Massaro, J.M. Understanding increments in model performance metrics. Lifetime Data Anal 19, 202–218 (2013). https://doi.org/10.1007/s10985-012-9238-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10985-012-9238-0