A Novel Evaluation Methodology for Assessing Off-Policy Learning Methods in Contextual Bandits

Hassanpour, Negar; Greiner, Russell

doi:10.1007/978-3-319-89656-4_3

Negar Hassanpour¹⁵ &
Russell Greiner¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10832))

Included in the following conference series:

Canadian Conference on Artificial Intelligence

3078 Accesses
1 Altmetric

Abstract

We propose a novel evaluation methodology for assessing off-policy learning methods in contextual bandits. In particular, we provide a way to use data from any given Randomized Control Trial (RCT) to generate a range of observational studies with synthesized “outcome functions” that can match the user’s specified degrees of sample selection bias, which can then be used to comprehensively assess a given learning method. This is especially important in evaluating methods developed for precision medicine, where deploying a bad policy can have devastating effects. As the outcome function specifies the real-valued quality of any treatment for any instance, we can accurately compute the quality of any proposed treatment policy. This paper uses this evaluation methodology to establish a common ground for comparing the robustness and performance of the available off-policy learning methods in the literature.

R. Greiner—The authors were supported by NSERC and Amii.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
This is not one-hot encoding as there may be instances with multiple associated labels – e.g., a news article concerning political initiatives on climate change.
2.
Note that the test set remains intact for evaluating the learned policy.
3.
This means the X values are realistic. By contrast, we do not know whether the X values from a supervised dataset look like realistic [medical] observational studies.
4.
A low \(R^2\) measure suggests that there must exist [some] unobserved confounder(s) that [significantly] contribute to the outcome.
5.
Our implementation of IPS (and SN below) is obtained from Policy Optimizer for Exponential Models (POEM [19]). We extended POEM substantially to include a way to deal with the missing components (i.e., OP and DR), as well as implementation of the proposed evaluation methodology.

References

Pearl, J.: Causality. Cambridge University Press, New York (2009)
Book MATH Google Scholar
Imbens, G.W., Rubin, D.B.: Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, New York (2015)
Book MATH Google Scholar
Wang, C.C., Kulkarni, S.R., Poor, H.V.: Bandit problems with side observations. IEEE Trans. Autom. Control 50(3), 338–355 (2005)
Article MathSciNet MATH Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, vol. 1. MIT Press, Cambridge (1998)
Google Scholar
Bottou, L., Peters, J., Candela, J.Q., Charles, D.X., Chickering, M., Portugaly, E., Ray, D., Simard, P.Y., Snelson, E.: Counterfactual reasoning and learning systems: the example of computational advertising. JMLR 14(1), 3207–3260 (2013)
MathSciNet MATH Google Scholar
Li, L., Chen, S., Kleban, J., Gupta, A.: Counterfactual estimation and optimization of click metrics in search engines: a case study. In: Proceedings of the 24th International Conference on World Wide Web. ACM (2015)
Google Scholar
Schnabel, T., Swaminathan, A., Singh, A., Chandak, N., Joachims, T.: Recommendations as treatments: debiasing learning and evaluation. In: Proceedings of the 33rd International Conference on Machine Learning, vol. 48 (2016)
Google Scholar
Liu, Y.E., Mandel, T., Brunskill, E., Popovic, Z.: Trading off scientific knowledge and user learning with multi-armed bandits. In: Educational Data Mining (2014)
Google Scholar
Li, L., Chu, W., Langford, J., Schapire, R.E.: A contextual-bandit approach to personalized news article recommendation. In: Proceedings of the 19th International Conference on World Wide Web. ACM (2010)
Google Scholar
Li, L., Chu, W., Langford, J., Wang, X.: Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In: Proceedings of the 4th International Conference on Web Search and Data Mining, Hong Kong (2011)
Google Scholar
Horvitz, D.G., Thompson, D.J.: A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47(260), 663–685 (1952)
Article MathSciNet MATH Google Scholar
Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score in observational studies for causal effects. Biometrika 70, 41–55 (1983)
Article MathSciNet MATH Google Scholar
Dudík, M., Langford, J., Li, L.: Doubly robust policy evaluation and learning. In: International Conference on Machine Learning (2011)
Google Scholar
Robins, J.M., Rotnitzky, A., Zhao, L.P.: Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc. 89(427), 846–866 (1994)
Article MathSciNet MATH Google Scholar
Swaminathan, A., Joachims, T.: The self-normalized estimator for counterfactual learning. In: Advances in Neural Information Processing Systems (2015)
Google Scholar
Hirano, K., Imbens, G.W., Ridder, G.: Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71(4), 1161–1189 (2003)
Article MathSciNet MATH Google Scholar
Swaminathan, A., Joachims, T.: Counterfactual risk minimization: learning from logged bandit feedback. In: International Conference on Machine Learning (2015)
Google Scholar
Beygelzimer, A., Langford, J.: The offset tree for learning with partial labels. In: Proceedings of the 15th ACM SIGKDD. ACM (2009)
Google Scholar
Swaminathan, A., Joachims, T.: Batch learning from logged bandit feedback through counterfactual risk minimization. JMLR 16, 1731–1755 (2015)
MathSciNet MATH Google Scholar
Rasmussen, C.E., Williams, C.K.: Gaussian Processes for Machine Learning, vol. 1. MIT Press, Cambridge (2006)
MATH Google Scholar
Vickers, A.J., Rees, R.W., Zollman, C.E., McCarney, R., Smith, C.M., Ellis, N., Fisher, P., Van Haselen, R.: Acupuncture for chronic headache in primary care: large, pragmatic, randomised trial. BMJ 328(7442), 744 (2004)
Article Google Scholar
Vickers, A.J.: Whose data set is it anyway? Sharing raw data from randomized trials. Trials 7(1), 15 (2006)
Article Google Scholar
Hypericum Depression Trial Study Group, et al.: Effect of Hypericum perforatum (St. John’s Wort) in major depressive disorder: a randomized controlled trial. JAMA 287(14), 1807–1814 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Alberta, Edmonton, Canada
Negar Hassanpour & Russell Greiner

Authors

Negar Hassanpour
View author publications
You can also search for this author in PubMed Google Scholar
Russell Greiner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Negar Hassanpour .

Editor information

Editors and Affiliations

Ryerson University, Toronto, Ontario, Canada
Ebrahim Bagheri
McGill University, Montréal, Québec, Canada
Jackie C.K. Cheung

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hassanpour, N., Greiner, R. (2018). A Novel Evaluation Methodology for Assessing Off-Policy Learning Methods in Contextual Bandits. In: Bagheri, E., Cheung, J. (eds) Advances in Artificial Intelligence. Canadian AI 2018. Lecture Notes in Computer Science(), vol 10832. Springer, Cham. https://doi.org/10.1007/978-3-319-89656-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-89656-4_3
Published: 06 April 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-89655-7
Online ISBN: 978-3-319-89656-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics