Understanding increments in model performance metrics

Pencina, Michael J.; D’Agostino, Ralph B.; Massaro, Joseph M.

doi:10.1007/s10985-012-9238-0

Understanding increments in model performance metrics

Published: 16 December 2012

Volume 19, pages 202–218, (2013)
Cite this article

Lifetime Data Analysis Aims and scope Submit manuscript

Michael J. Pencina¹,
Ralph B. D’Agostino² &
Joseph M. Massaro¹

707 Accesses
28 Citations
3 Altmetric
Explore all metrics

Abstract

The area under the receiver operating characteristic curve (AUC) is the most commonly reported measure of discrimination for prediction models with binary outcomes. However, recently it has been criticized for its inability to increase when important risk factors are added to a baseline model with good discrimination. This has led to the claim that the reliance on the AUC as a measure of discrimination may miss important improvements in clinical performance of risk prediction rules derived from a baseline model. In this paper we investigate this claim by relating the AUC to measures of clinical performance based on sensitivity and specificity under the assumption of multivariate normality. The behavior of the AUC is contrasted with that of discrimination slope. We show that unless rules with very good specificity are desired, the change in the AUC does an adequate job as a predictor of the change in measures of clinical performance. However, stronger or more numerous predictors are needed to achieve the same increment in the AUC for baseline models with good versus poor discrimination. When excellent specificity is desired, our results suggest that the discrimination slope might be a better measure of model improvement than AUC. The theoretical results are illustrated using a Framingham Heart Study example of a model for predicting the 10-year incidence of atrial fibrillation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The index of prediction accuracy: an intuitive measure useful for evaluating risk prediction models

Article Open access 04 May 2018

The Net Reclassification Index (NRI): A Misleading Measure of Prediction Improvement Even with Independent Test Data Sets

Article 23 August 2014

The ROC Diagonal is Not Layperson’s Chance: A New Baseline Shows the Useful Area

References

Baker SG, Cook NR, Vickers A et al (2009) Using relative utility curves to evaluate risk prediction. J R Stat Soc Ser A Stat Soc 172(4):729–748
Article MathSciNet Google Scholar
Cook NR (2007) Use and misuse of the receiver operating characteristics curve in risk prediction. Circulation 115(7):928–935
Article Google Scholar
Cox DR (1972) Regression models and life tables. J R Stat Soc Ser B 34:187–220
MATH Google Scholar
D’Agostino RB Sr, Pencina MJ (2012) Invited commentary: clinical usefulness of the framingham cardiovascular risk profile beyond its statistical performance. Am J Epidemiol 176(3):187–189
DeLong ER, DeLong DM, Clarke-Pearson DL (1988) Comparing areas under two or more correlated reciever operating characteristics curves: a nonparamentric approach. Biometrics 44(3):837–845
Article MATH Google Scholar
Demler OV, Pencina MJ, D’Agostino RB Sr (2012) Misuse of DeLong test to compare AUCs for nested models. Stat Med 31:2577–2587
Article Google Scholar
Dodd LE, Pepe MS (2003) Partial AUC estimation and regression. Biometrics 54:614–623
Article MathSciNet Google Scholar
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics 7:179–188
Article Google Scholar
Gail MH, Pfeiffer RM (2005) On criteria for evaluating models of absolute risk. Biostatistics 6(2):227–239
Article MATH Google Scholar
Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36
Google Scholar
Hilden J, Glashiou P (1996) Regret graphs, diagnostic uncertainty and the Youden’s index. Stat Med 15: 969–986
Google Scholar
Mahalanobis PC (1936) On the generalized distance in statistics. Proc Natl Inst Sci India 2:49–55
MATH Google Scholar
Morrison DF (1990) Multivariate statistical methods, 3rd edn. McGraw-Hill, New York
Google Scholar
Pencina MJ, D’Agostino RB Sr, D’Agostino RB Jr et al (2008) Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med 27(2):157–172
Article MathSciNet Google Scholar
Pencina MJ, D’Agostino RB Sr, Steyerberg E (2011) Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Stat Med 30(1):11–21
Article MathSciNet Google Scholar
Pencina MJ, D’Agostino RB, Demler OV (2012) Novel metrics for evaluating improvement in discrimination: net reclassification and integrated discrimination improvement for normal variables and nested models. Stat Med 31:101–113
Article MathSciNet Google Scholar
Pepe MS, Janes H, Longton G et al (2004) Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. Am J Epidemiol 159(9):882–890
Article Google Scholar
Schnabel RB, Larson MG, Yamamoto JF et al (2010) Relations of biomarkers of distinct pathophysiological pathways and atrial fibrillation incidence in the community. Circulation 121(2):200–207
Article Google Scholar
Steyerberg EW, Vickers AJ, Cook NR et al (2010) Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology 21(1):128–138
Article Google Scholar
Steyerberg EW, Pencina MJ, Lingsma HF et al (2012) Assessing the incremental value of diagnostic and prognostic markers: a review and illustration. Eur J Clin Invest 42(2):216–228
Article Google Scholar
Su JQ, Liu JS (1993) Linear combinations of multiple diagnostic markers. J Am Stat Assoc 88:1350–1355
Article MathSciNet MATH Google Scholar
Tzoulaki I, Liberopoulos G, Ioannidis JPA (2009) Assessment of claims of improved prediction beyond the Framingham risk score. JAMA 302(21):2345–2352
Article Google Scholar
Vapnik V (1998) Statistical learning theory. Wiley, New York
MATH Google Scholar
Vickers AJ, Elkin EB (2006) Decision curve analysis: a novel method for evaluating prediction models. Med Decis Mak 26(6):565–574
Article Google Scholar
Walker SH, Duncan DB (1967) Estimation of the probability of an event as a function of several independent variables. Biometrika 54:167–179
MathSciNet MATH Google Scholar
Ware JH (2006) The limitations of risk factors as prognostic tools. N Engl J Med 355:25
Article Google Scholar
Yates JF (1982) External correspondence: decomposition of the mean probability score. Organ Behav Hum Per 30:132–156
Article Google Scholar
Youden WJ (1950) Index for rating diagnostic tests. Cancer 3(1):32–35
Article Google Scholar

Download references

Acknowledgments

This research has been supported by National Heart, Lung, and Blood Institute’s Framingham Heart Study; contract/Grant Number: N01-HC-25195. Dr. Pencina has been additionally supported by NIH/ARRA Risk Prediction of Atrial Fibrillation; Grant Number: RC1HL101056.

Author information

Authors and Affiliations

Department of Biostatistics, Harvard Clinical Research Institute, Boston University, CrossTown, 801 Massachusetts Ave., Boston, MA, 02118, USA
Michael J. Pencina & Joseph M. Massaro
Department of Mathematics and Statistics, Harvard Clinical Research Institute, Boston University, 111 Cummington St, Boston, MA, 02215, USA
Ralph B. D’Agostino

Authors

Michael J. Pencina
View author publications
You can also search for this author in PubMed Google Scholar
Ralph B. D’Agostino
View author publications
You can also search for this author in PubMed Google Scholar
Joseph M. Massaro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael J. Pencina.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pencina, M.J., D’Agostino, R.B. & Massaro, J.M. Understanding increments in model performance metrics. Lifetime Data Anal 19, 202–218 (2013). https://doi.org/10.1007/s10985-012-9238-0

Download citation

Received: 15 March 2012
Accepted: 26 November 2012
Published: 16 December 2012
Issue Date: April 2013
DOI: https://doi.org/10.1007/s10985-012-9238-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Understanding increments in model performance metrics

Abstract

Access this article

Similar content being viewed by others

The index of prediction accuracy: an intuitive measure useful for evaluating risk prediction models

The Net Reclassification Index (NRI): A Misleading Measure of Prediction Improvement Even with Independent Test Data Sets

The ROC Diagonal is Not Layperson’s Chance: A New Baseline Shows the Useful Area

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Understanding increments in model performance metrics

Abstract

Access this article

Similar content being viewed by others

The index of prediction accuracy: an intuitive measure useful for evaluating risk prediction models

The Net Reclassification Index (NRI): A Misleading Measure of Prediction Improvement Even with Independent Test Data Sets

The ROC Diagonal is Not Layperson’s Chance: A New Baseline Shows the Useful Area

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation