Evaluating discrete choice prediction models when the evaluation data is corrupted: analytic results and bias corrections for the area under the ROC
There has been a growing recognition that issues of data quality, which are routine in practice, can materially affect the assessment of learned model performance. In this paper, we develop some analytic results that are useful in sizing the biases associated with tests of discriminatory model power when these are performed using corrupt (“noisy”) data. As it is sometimes unavoidable to test models with data that are known to be corrupt, we also provide some guidance on interpreting results of such tests. In some cases, with appropriate knowledge of the corruption mechanism, the true values of the performance statistics such as the area under the ROC curve may be recovered (in expectation), even when the underlying data have been corrupted. We also provide estimators of the standard errors of such recovered performance statistics. An analysis of the estimators reveals interesting behavior including the observation that “noisy” data does not “cancel out” across models even when the same corrupt data set is used to test multiple candidate models. Because our results are analytic, they may be applied in a broad range of settings and this can be done without the need for simulation.
KeywordsROC Model validation Prediction Data corruption Bias correction Misclassification Credit models Machine learning
Mathematics Subject Classification62-07 62G10
The author is grateful to Sanjiv Das, David Fagnan, Lisa Goldberg and Mitchell Petersen for detailed comments on earlier drafts of this paper. The author is particulalry grateful to Foster Provost who provided extensive and detailed suggestions on improving the exposition and extending the results—including suggesting the idea of a recovered ROC. This article was greatly improved by the observations and suggestions of three annonomous reviewers. All errors are, of course, my own. The views expressed in this article are those of the author and do not represent the views of former employers or any of their affiliates.
- Charles E (2001) The foundations of cost-sensitive learning. In: Proceedings of the joint conference on artificial intelligence (IJCAI’01), pp 973–978Google Scholar
- Engelmann B, Hayden E, Tasche D (2003) Testing rating accuracy. RISK 16:82–862Google Scholar
- Macskassy SA (2005) Foster provost, and saharon rosset. ROC confidence bands: an empirical evaluation. In: Proceedings of the 22st international conference on machine learning. ICML, Bohn, Germany, pp 537–544Google Scholar
- Provost F, Fawcett T (2001) Robust classification for impercise environments. Mach Learn 42(2):203–231Google Scholar
- Russell H, Tanng QK, Dwyer DW (2012) The effect of imperfect data on default prediction validation tests. J Risk Model Valid 6(1):1–20Google Scholar
- Sobehart J, Keenan S, Stein R (2000) Validation methodologies for default risk models. Credit 16:51–56Google Scholar
- Stein RM (2007) Benchmarking default prediction models: pitfalls and remedies in model validation. J Risk Model Valid 1(1):77–113Google Scholar
- Stein RM, Kocagil AE, Bohn J, Akhavain J (2003) Systematic and idiosyncratic risk in middle-market default prediction: a study of the performance of the RiskCalc and PFM Models. MKMV Special CommentGoogle Scholar