Improving class probability estimates for imbalanced data

Wallace, Byron C.; Dahabreh, Issa J.

doi:10.1007/s10115-013-0670-6

Improving class probability estimates for imbalanced data

Regular paper
Published: 12 July 2013

Volume 41, pages 33–52, (2014)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Byron C. Wallace¹ &
Issa J. Dahabreh¹

1992 Accesses
37 Citations
1 Altmetric
Explore all metrics

Abstract

Obtaining good probability estimates is imperative for many applications. The increased uncertainty and typically asymmetric costs surrounding rare events increase this need. Experts (and classification systems) often rely on probabilities to inform decisions. However, we demonstrate that class probability estimates obtained via supervised learning in imbalanced scenarios systematically underestimate the probabilities for minority class instances, despite ostensibly good overall calibration. To our knowledge, this problem has not previously been explored. We propose a new metric, the stratified Brier score, to capture class-specific calibration, analogous to the per-class metrics widely used to assess the discriminative performance of classifiers in imbalanced scenarios. We propose a simple, effective method to mitigate the bias of probability estimates for imbalanced data that bags estimators independently calibrated over balanced bootstrap samples. This approach drastically improves performance on the minority instances without greatly affecting overall calibration. We extend our previous work in this direction by providing ample additional empirical evidence for the utility of this strategy, using both support vector machines and boosted decision trees as base learners. Finally, we show that additional uncertainty can be exploited via a Bayesian approach by considering posterior distributions over bagged probability estimates.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Constrained Naïve Bayes with application to unbalanced data classification

Article Open access 20 October 2021

Non-parametric Bayesian Isotonic Calibration: Fighting Over-Confidence in Binary Classification

On Properties of Undersampling Bagging and Its Extensions for Imbalanced Data

Notes

http://www.cebm.brown.edu/static/imbalanced-datasets.zip.
We note that Cieslak and Chawla [7] have investigated the specific case of probability estimation trees (PETs) for imbalanced data and that Foster and Stine [13] have considered the related task of variable selection for prediction under imbalance.
Somewhat confusingly, the term ‘calibration’ is often used both to refer to the process of calibrating classifiers and to measure the accuracy of probability estimates.
We used the somewhat arbitrary relative weight of 10.
See Table 1.
This limit is undefined in general; here \(\tilde{\pi }\) is coming from the positive side.
Recall that the training time of SVMs scales quadratically with the number of instances.
http://www.cebm.brown.edu/static/imbalanced-datasets.zip.

References

Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
MATH MathSciNet Google Scholar
Breslow NE, Day NE (1980) Statistical methods in cancer research, vol. 1. The analysis of case-control studies, vol 1. Distributed for IARC by WHO, Geneva, Switzerland
Brier GW (1950) Verification of forecasts expressed in terms of probability. Mon Weather Rev 78(1):1–3
Article Google Scholar
Chawla NV (2010) Data mining for imbalanced datasets: an overview. Data Mining Knowledge Discovery Handbook, pp 875–886
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
MATH Google Scholar
Chawla NV, Cieslak DA, Hall LO, Joshi A (2008) Automatically countering imbalance and its empirical relationship to cost. Data Min Knowl Discov 17(2):225–252
Article MathSciNet Google Scholar
Cieslak D, Chawla N (2008) Analyzing pets on imbalanced datasets when training and testing class distributions differ. Adv Knowl Discov Data Min 5012: 519–526
Google Scholar
Cohen AM, Hersh WR, Peterson K, Yen PY (2006a) Reducing workload in systematic review preparation using automated citation classification. J Am Med Inform Assoc 13(2):206–219
Article Google Scholar
Cohen G, Hilario M, Sax H, Hugonnet S, Geissbuhler A (2006b) Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med 37(1):7–18
Article Google Scholar
Cohen I, Goldszmidt M (2004) Properties and benefits of calibrated classifiers. In: 15th European conference on machine learning (ECML04) and 8th European conference on principles and practice of knowledge discovery in databases (PKDD04), pp 125–136
Elkan C (2001) The foundations of cost-sensitive learning. In: Proceedings of the 7th international joint conference on, artificial intelligence (IJCAI01), vol 17, pp 973–978
Firth D (1993) Bias reduction of maximum likelihood estimates. Biometrika 80(1):27–38
Article MATH MathSciNet Google Scholar
Foster DP, Stine RA (2004) Variable selection in data mining: building a predictive model for bankruptcy. J Am Stat Assoc 99(466):303–313
Article MATH MathSciNet Google Scholar
Guo X, Yin Y, Dong C, Yang G, Zhou G (2008) On the class imbalance problem. In: 4th International conference on natural computation (ICNC08), pp 192–201
Haibo H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
Hastie T, Tibshirani R (1998) Classification by pairwise coupling. Ann Stat 26(2):451–471
Article MATH MathSciNet Google Scholar
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449
MATH Google Scholar
King G, Zeng L (2001) Logistic regression in rare events data. Political Anal 9(2):137–163
Article Google Scholar
Lin HT, Lin CJ, Weng RC (2007) A note on platt’s probabilistic outputs for support vector machines. Mach Learn 68(3):267–276
Article Google Scholar
McCullagh P, Nelder JA (1987) Generalized linear models. Chapman and Hall, London
Google Scholar
Niculescu-Mizil A, Caruana R (2005a) Obtaining calibrated probabilities from boosting. In: Proceedings of the 21st conference on uncertainty in artificial intelligence (UAI05), pp 413–420
Niculescu-Mizil A, Caruana R (2005b) Predicting good probabilities with supervised learning. In: Proceedings of the 22nd international conference on machine learning (ICML05). ACM, pp 625–632
Platt J (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Large Margin Classif 10(3):61–74
Google Scholar
Provost F (2000) Machine learning from imbalanced data sets 101 (invited talk). In: Proceedings of the AAAI workshop on learning from imbalanced data sets (AAAI00)
Rabe-Hesketh S, Skrondal A (2008) Multilevel and longitudinal modeling using Stata. Stata Corp, College Station, Texas
Van Hulse J, Khoshgoftaar TM, Napolitano A (2007) Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th international conference on machine learning (ICML07), pp 935–942
Wallace BC, Dahabreh IJ (2012) Class probability estimates are unreliable for imbalanced data (and how to fix them). In: Proceedings of the 12th international conference on data mining (ICDM12). IEEE, pp 695–704
Wallace BC, Small K, Brodley CE, Trikalinos TA (2011) Class imbalance, redux. In: Proceedings of the 11th international conference on data mining (ICDM11). IEEE, pp 754–763
Wallace BC, Trikalinos TA, Lau J, Brodley C, Schmid CH (2010) Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinform 11(1):55
Article Google Scholar
Walter SD (1985) Small sample estimation of log odds ratios from logistic regression and fourfold tables. Stat Med 4(4):437–444
Article Google Scholar
Yang W, Wu X (2006) 10 challenging problems in data mining research. Int J Inf Technol Dec Making 5(4):597–604
Article Google Scholar
Zadrozny B, Elkan C (2002) Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining (KDD02). ACM, pp 694–699
Zhu J, Hovy E, (2007) Active learning for word sense disambiguation with methods for addressing the class imbalance problem. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural, language learning, pp 783–790

Download references

Author information

Authors and Affiliations

Center for Evidence-Based Medicine, Brown University, Providence, RI, USA
Byron C. Wallace & Issa J. Dahabreh

Authors

Byron C. Wallace
View author publications
You can also search for this author in PubMed Google Scholar
Issa J. Dahabreh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Byron C. Wallace.

Additional information

This article is an extended version of our ICDM 2012 paper [27]. This work was supported in part by a grant from the Agency for Healthcare Research and Quality (AHRQ, grant HS018494-01). The findings and conclusions in this paper are those of the authors, who are responsible for its content, and do not necessarily represent the views of the AHRQ. No statement in this report should be construed as an official position of the AHRQ or of the U.S. Department of Health and Human Services.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wallace, B.C., Dahabreh, I.J. Improving class probability estimates for imbalanced data. Knowl Inf Syst 41, 33–52 (2014). https://doi.org/10.1007/s10115-013-0670-6

Download citation

Received: 10 February 2013
Revised: 12 April 2013
Accepted: 29 June 2013
Published: 12 July 2013
Issue Date: October 2014
DOI: https://doi.org/10.1007/s10115-013-0670-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving class probability estimates for imbalanced data

Abstract

Access this article

Similar content being viewed by others

Constrained Naïve Bayes with application to unbalanced data classification

Non-parametric Bayesian Isotonic Calibration: Fighting Over-Confidence in Binary Classification

On Properties of Undersampling Bagging and Its Extensions for Imbalanced Data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving class probability estimates for imbalanced data

Abstract

Access this article

Similar content being viewed by others

Constrained Naïve Bayes with application to unbalanced data classification

Non-parametric Bayesian Isotonic Calibration: Fighting Over-Confidence in Binary Classification

On Properties of Undersampling Bagging and Its Extensions for Imbalanced Data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation