Skip to main content

Sample Selection Bias Correction Theory

  • Conference paper
Algorithmic Learning Theory (ALT 2008)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5254))

Included in the following conference series:

Abstract

This paper presents a theoretical analysis of sample selection bias correction. The sample bias correction technique commonly used in machine learning consists of reweighting the cost of an error on each training point of a biased sample to more closely reflect the unbiased distribution. This relies on weights derived by various estimation techniques based on finite samples. We analyze the effect of an error in that estimation on the accuracy of the hypothesis returned by the learning algorithm for two estimation techniques: a cluster-based estimation technique and kernel mean matching. We also report the results of sample bias correction experiments with several data sets using these techniques. Our analysis is based on the novel concept of distributional stability which generalizes the existing concept of point-based stability. Much of our work and proof techniques can be used to analyze other importance weighting techniques and their effect on accuracy when using a distributionally stable algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Bickel, S., Brückner, M., Scheffer, T.: Discriminative learning for differing training and test distributions. In: ICML 2007, pp. 81–88 (2007)

    Google Scholar 

  • Bousquet, O., Elisseeff, A.: Stability and generalization. JMLR 2, 499–526 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  • Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and regression trees. CRC Press, Boca Raton (1984)

    MATH  Google Scholar 

  • Cortes, C., Vapnik, V.N.: Support-Vector Networks. Machine Learning 20, 273–297 (1995)

    MATH  Google Scholar 

  • Devroye, L., Wagner, T.: Distribution-free performance bounds for potential function rules. IEEE Trans. on Information Theory, 601–604 (1979)

    Google Scholar 

  • Dudík, M., Schapire, R.E., Phillips, S.J.: Correcting sample selection bias in maximum entropy density estimation. In: NIPS 2005 (2006)

    Google Scholar 

  • Elkan, C.: The foundations of cost-sensitive learning. In: IJCAI, pp. 973–978 (2001)

    Google Scholar 

  • Fan, W., Davidson, I., Zadrozny, B., Yu, P.S.: An improved categorization of classifier’s sensitivity on sample selection bias. In: ICDM 2005, pp. 605–608. IEEE Computer Society, Los Alamitos (2005)

    Google Scholar 

  • Heckman, J.J.: Sample Selection Bias as a Specification Error. Econometrica 47, 151–161 (1979)

    Article  MathSciNet  Google Scholar 

  • Huang, J., Smola, A., Gretton, A., Borgwardt, K., Schölkopf, B.: Correcting Sample Selection Bias by Unlabeled Data. Technical Report CS-2006-44). University of Waterloo (2006a)

    Google Scholar 

  • Huang, J., Smola, A.J., Gretton, A., Borgwardt, K.M., Schölkopf, B.: Correcting sample selection bias by unlabeled data. In: NIPS 2006, pp. 601–608 (2006b)

    Google Scholar 

  • Kearns, M., Ron, D.: Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. In: COLT 1997, pp. 152–162 (1997)

    Google Scholar 

  • Little, R.J.A., Rubin, D.B.: Statistical analysis with missing data. John Wiley & Sons, Inc., New York (1986)

    Google Scholar 

  • Saunders, C., Gammerman, A., Vovk, V.: Ridge Regression Learning Algorithm in Dual Variables. In: ICML 1998, pp. 515–521 (1998)

    Google Scholar 

  • Steinwart, I.: On the influence of the kernel on the consistency of support vector machines. JMLR 2, 67–93 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  • Sugiyama, M., Nakajima, S., Kashima, H., von Bünau, P., Kawanabe, M.: Direct importance estimation with model selection and its application to covariate shift adaptation. In: NIPS 2008 (2008)

    Google Scholar 

  • Vapnik, V.N.: Statistical learning theory. Wiley-Interscience, New York (1998)

    MATH  Google Scholar 

  • Zadrozny, B.: Learning and evaluating classifiers under sample selection bias. In: ICML 2004 (2004)

    Google Scholar 

  • Zadrozny, B., Langford, J., Abe, N.: Cost-sensitive learning by cost-proportionate example weighting. In: ICDM 2003 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cortes, C., Mohri, M., Riley, M., Rostamizadeh, A. (2008). Sample Selection Bias Correction Theory. In: Freund, Y., Györfi, L., Turán, G., Zeugmann, T. (eds) Algorithmic Learning Theory. ALT 2008. Lecture Notes in Computer Science(), vol 5254. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87987-9_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-87987-9_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-87986-2

  • Online ISBN: 978-3-540-87987-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics