Skip to main content

A Novel Evaluation Methodology for Assessing Off-Policy Learning Methods in Contextual Bandits

  • Conference paper
  • First Online:
Advances in Artificial Intelligence (Canadian AI 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10832))

Included in the following conference series:

Abstract

We propose a novel evaluation methodology for assessing off-policy learning methods in contextual bandits. In particular, we provide a way to use data from any given Randomized Control Trial (RCT) to generate a range of observational studies with synthesized “outcome functions” that can match the user’s specified degrees of sample selection bias, which can then be used to comprehensively assess a given learning method. This is especially important in evaluating methods developed for precision medicine, where deploying a bad policy can have devastating effects. As the outcome function specifies the real-valued quality of any treatment for any instance, we can accurately compute the quality of any proposed treatment policy. This paper uses this evaluation methodology to establish a common ground for comparing the robustness and performance of the available off-policy learning methods in the literature.

R. Greiner—The authors were supported by NSERC and Amii.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This is not one-hot encoding as there may be instances with multiple associated labels – e.g., a news article concerning political initiatives on climate change.

  2. 2.

    Note that the test set remains intact for evaluating the learned policy.

  3. 3.

    This means the X values are realistic. By contrast, we do not know whether the X values from a supervised dataset look like realistic [medical] observational studies.

  4. 4.

    A low \(R^2\) measure suggests that there must exist [some] unobserved confounder(s) that [significantly] contribute to the outcome.

  5. 5.

    Our implementation of IPS (and SN below) is obtained from Policy Optimizer for Exponential Models (POEM [19]). We extended POEM substantially to include a way to deal with the missing components (i.e., OP and DR), as well as implementation of the proposed evaluation methodology.

References

  1. Pearl, J.: Causality. Cambridge University Press, New York (2009)

    Book  MATH  Google Scholar 

  2. Imbens, G.W., Rubin, D.B.: Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, New York (2015)

    Book  MATH  Google Scholar 

  3. Wang, C.C., Kulkarni, S.R., Poor, H.V.: Bandit problems with side observations. IEEE Trans. Autom. Control 50(3), 338–355 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  4. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, vol. 1. MIT Press, Cambridge (1998)

    Google Scholar 

  5. Bottou, L., Peters, J., Candela, J.Q., Charles, D.X., Chickering, M., Portugaly, E., Ray, D., Simard, P.Y., Snelson, E.: Counterfactual reasoning and learning systems: the example of computational advertising. JMLR 14(1), 3207–3260 (2013)

    MathSciNet  MATH  Google Scholar 

  6. Li, L., Chen, S., Kleban, J., Gupta, A.: Counterfactual estimation and optimization of click metrics in search engines: a case study. In: Proceedings of the 24th International Conference on World Wide Web. ACM (2015)

    Google Scholar 

  7. Schnabel, T., Swaminathan, A., Singh, A., Chandak, N., Joachims, T.: Recommendations as treatments: debiasing learning and evaluation. In: Proceedings of the 33rd International Conference on Machine Learning, vol. 48 (2016)

    Google Scholar 

  8. Liu, Y.E., Mandel, T., Brunskill, E., Popovic, Z.: Trading off scientific knowledge and user learning with multi-armed bandits. In: Educational Data Mining (2014)

    Google Scholar 

  9. Li, L., Chu, W., Langford, J., Schapire, R.E.: A contextual-bandit approach to personalized news article recommendation. In: Proceedings of the 19th International Conference on World Wide Web. ACM (2010)

    Google Scholar 

  10. Li, L., Chu, W., Langford, J., Wang, X.: Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In: Proceedings of the 4th International Conference on Web Search and Data Mining, Hong Kong (2011)

    Google Scholar 

  11. Horvitz, D.G., Thompson, D.J.: A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47(260), 663–685 (1952)

    Article  MathSciNet  MATH  Google Scholar 

  12. Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score in observational studies for causal effects. Biometrika 70, 41–55 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  13. Dudík, M., Langford, J., Li, L.: Doubly robust policy evaluation and learning. In: International Conference on Machine Learning (2011)

    Google Scholar 

  14. Robins, J.M., Rotnitzky, A., Zhao, L.P.: Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc. 89(427), 846–866 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  15. Swaminathan, A., Joachims, T.: The self-normalized estimator for counterfactual learning. In: Advances in Neural Information Processing Systems (2015)

    Google Scholar 

  16. Hirano, K., Imbens, G.W., Ridder, G.: Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71(4), 1161–1189 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  17. Swaminathan, A., Joachims, T.: Counterfactual risk minimization: learning from logged bandit feedback. In: International Conference on Machine Learning (2015)

    Google Scholar 

  18. Beygelzimer, A., Langford, J.: The offset tree for learning with partial labels. In: Proceedings of the 15th ACM SIGKDD. ACM (2009)

    Google Scholar 

  19. Swaminathan, A., Joachims, T.: Batch learning from logged bandit feedback through counterfactual risk minimization. JMLR 16, 1731–1755 (2015)

    MathSciNet  MATH  Google Scholar 

  20. Rasmussen, C.E., Williams, C.K.: Gaussian Processes for Machine Learning, vol. 1. MIT Press, Cambridge (2006)

    MATH  Google Scholar 

  21. Vickers, A.J., Rees, R.W., Zollman, C.E., McCarney, R., Smith, C.M., Ellis, N., Fisher, P., Van Haselen, R.: Acupuncture for chronic headache in primary care: large, pragmatic, randomised trial. BMJ 328(7442), 744 (2004)

    Article  Google Scholar 

  22. Vickers, A.J.: Whose data set is it anyway? Sharing raw data from randomized trials. Trials 7(1), 15 (2006)

    Article  Google Scholar 

  23. Hypericum Depression Trial Study Group, et al.: Effect of Hypericum perforatum (St. John’s Wort) in major depressive disorder: a randomized controlled trial. JAMA 287(14), 1807–1814 (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Negar Hassanpour .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hassanpour, N., Greiner, R. (2018). A Novel Evaluation Methodology for Assessing Off-Policy Learning Methods in Contextual Bandits. In: Bagheri, E., Cheung, J. (eds) Advances in Artificial Intelligence. Canadian AI 2018. Lecture Notes in Computer Science(), vol 10832. Springer, Cham. https://doi.org/10.1007/978-3-319-89656-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-89656-4_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-89655-7

  • Online ISBN: 978-3-319-89656-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics