Statistical Reinforcement Learning

Chakraborty, Bibhas; Moodie, Erica E. M.

doi:10.1007/978-1-4614-7428-9_3

Bibhas Chakraborty³ &
Erica E. M. Moodie⁴

Part of the book series: Statistics for Biology and Health ((SBH))

5012 Accesses
4 Citations

Abstract

Constructing optimal dynamic treatment regimes for chronic disorders based on patient data is a problem of multi-stage decision making about the best sequence of treatments. This problem bears strong resemblance to the problem of reinforcement learning in computer science, a branch of machine learning that deals with the problem of multi-stage, sequential decision making by a learning agent. In this chapter, we review the necessary concepts of reinforcement learning, connect them to the relevant statistical literature, and develop a mathematical framework that will enable us to treat the problem of estimating the optimal dynamic treatment regimes rigorously.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Hardcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In some settings, there may only be a terminal reward for the entire sequence of agent-environment interactions.
2.
In the case of a SMART, this policy consists of the randomization probabilities and is known by design, whereas for an observational study, this can be estimated by the propensity score (see Sect. 3.5 for definition).
3.
The version of Q-learning we will be using in this book is similar to the fitted Q-iteration algorithm in the RL literature. This version is an adaptation of Watkins’ classical Q-learning to batch data, involving function approximation.
4.
Inference for stage 1 parameters in Q-learning is problematic due to an underlying lack of smoothness, so usual bootstrap inference is not theoretically valid. Nevertheless, we use it here for illustrative purposes only. Valid inference procedures will be discussed in Chap. 8.

References

Bellman, R. E. (1957). Dynamic programming. Princeton: Princeton University Press.
MATH Google Scholar
Berkson, J. (1946). Limitations of the application of fourfold tables to hospital data. Biometrics Bulletin, 2, 47–53.
Article Google Scholar
Bertsekas, D. P., & Tsitsiklis, J. (1996). Neuro-dynamic programming. Belmont: Athena Scientific.
MATH Google Scholar
Chakraborty, B., & Moodie, E. E. M. (2013). Estimating optimal dynamic treatment regimes with shared decision rules across stages: An extension of Q-learning (under revision).
Google Scholar
Chakraborty, B., Laber, E. B., & Zhao, Y. (2013). Inference for optimal dynamic treatment regimes using an adaptive m-out-of-n bootstrap scheme. Biometrics, (in press).
Google Scholar
Daniel, R. M., De Stavola, B. L., & Cousens, S. N. (2011). gformula: Estimating causal effects in the presence of time-varying confounding or mediation using the g-computation formula. The Stata Journal, 11, 479–517.
Google Scholar
Ernst, D., Stan, G. B., Goncalves, J., & Wehenkel, L. (2006). Clinical data based optimal STI strategies for HIV: A reinforcement learning approach. In Proceedings of the machine learning conference of Belgium and The Netherlands (Benelearn), Ghent (pp. 65–72).
Google Scholar
Ertefaie, A., Asgharian, M., & Stephens, D. A. (2012). Estimation of average treatment effects using penalization (submitted).
Google Scholar
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360.
Article MathSciNet MATH Google Scholar
Gao, H. (1998). Wavelet shrinkage denoising using the nonnegative garrote. Journal of Computational and Graphical Statistics, 7, 469–488.
MathSciNet Google Scholar
Greenland, S., Pearl, J., & Robins, J. M. (1999). Causal diagrams for epidemiologic research. Epidemiology, 10, 37–48.
Article Google Scholar
Gunter, L., Zhu, J., & Murphy, S. A. (2007). Variable selection for optimal decision making. In Proceedings of the 11th conference on artificial intelligence in medicine, Amsterdam.
Google Scholar
Kakade, S. M. (2003). On the sample complexity of reinforcement learning (Dissertation, University College London).
Google Scholar
Murphy, S. A. (2005b). A generalization error for Q-learning. Journal of Machine Learning Research, 6, 1073–1097.
MATH Google Scholar
Murphy, S. A., & Bingham, D. (2009). Screening experiments for developing dynamic treatment regimes. Journal of the American Statistical Association, 184, 391–408.
Article MathSciNet Google Scholar
Murphy, S. A., Van der Laan, M. J., Robins, J. M., & CPPRG (2001). Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 96, 1410–1423.
Google Scholar
Nahum-Shani, I., Qian, M., Almiral, D., Pelham, W., Gnagy, B., Fabiano, G., Waxmonsky, J., Yu, J., & Murphy, S. A. (2012a). Experimental design and primary data analysis methods for comparing adaptive interventions. Psychological Methods, 17, 457–477.
Article Google Scholar
Oetting, A. I., Levy, J. A., Weiss, R. D., & Murphy, S. A. (2011). Statistical methodology for a SMART design in the development of adaptive treatment strategies. In: P. E. Shrout, K. M. Keyes, & K. Ornstein (Eds.) Causality and Psychopathology: Finding the Determinants of Disorders and their Cures (pp. 179–205). Arlington: American Psychiatric Publishing.
Google Scholar
Pliskin, J. S., Shepard, D., & Weinstein, M. C. (1980). Utility functions for life years and health status: Theory, assessment, and application. Operations Research, 28, 206–224.
Article MathSciNet Google Scholar
Robins, J. M. (1986). A new approach to causal inference in mortality studies with sustained exposure periods – Application to control of the healthy worker survivor effect. Mathematical Modelling, 7, 1393–1512.
Article MathSciNet MATH Google Scholar
Robins, J. M., Orellana, L., & Rotnitzky, A. (2008). Estimation and extrapolation of optimal treatment and testing strategies. Statistics in Medicine, 27, 4678–4721.
Article MathSciNet Google Scholar
Rosenbaum, P. R., & Rubin, D. B. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association, 79, 516–524.
Article Google Scholar
Rosenbaum, P. R., & Rubin, D. B. (1985). Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician, 39, 33–38.
Google Scholar
Rosthøj, S., Fullwood, C., Henderson, R., & Stewart, S. (2006). Estimation of optimal dynamic anticoagulation regimes from observational data: A regret-based approach. Statistics in Medicine, 25, 4197–4215.
Article MathSciNet Google Scholar
Shao, J. (1994). Bootstrap sample size in nonregular cases. Proceedings of the American Mathematical Society, 122, 1251–1262.
Article MathSciNet MATH Google Scholar
Sjölander, A., Nyrén, O., Bellocco, R., & Evans, M. (2011). Comparing different strategies for timing of dialysis initiation through inverse probability weighting. American Journal of Epidemiology, 174, 1204–1210.
Article Google Scholar
Swartz, M. S., Perkins, D. O., Stroup, T. S., McEvoy, J. P., Nieri, J. M., & Haal, D. D. (2003). Assessing clinical and functional outcomes in the Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) schizophrenia trial. Schizophrenia Bulletin, 29, 33–43.
Article Google Scholar
Weinstein, M. C., Feinberg, H., Elstein, A. S., Frazier, H. S., Neuhauser, D., Neutra, R. R., & McNeil, B. J. (1980). Clinical decision analysis. Philadelphia: Saunders.
Google Scholar
Zhao, Y., Zeng, D., Socinski, M. A., & Kosorok, M. R. (2011). Reinforcement learning strategies for clinical trials in nonsmall cell lung cancer. Biometrics, 67, 1422–1433.
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Biostatistics, Columbia University, New York, USA
Bibhas Chakraborty
Department of Epidemiology, Biostatistics, and Occupational Health, McGill University, Montreal, Québec, Canada
Erica E. M. Moodie

Authors

Bibhas Chakraborty
View author publications
You can also search for this author in PubMed Google Scholar
Erica E. M. Moodie
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Chakraborty, B., Moodie, E.E.M. (2013). Statistical Reinforcement Learning. In: Statistical Methods for Dynamic Treatment Regimes. Statistics for Biology and Health. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-7428-9_3

Download citation

DOI: https://doi.org/10.1007/978-1-4614-7428-9_3
Published: 15 April 2013
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-7427-2
Online ISBN: 978-1-4614-7428-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics