Abstract
Constructing optimal dynamic treatment regimes for chronic disorders based on patient data is a problem of multi-stage decision making about the best sequence of treatments. This problem bears strong resemblance to the problem of reinforcement learning in computer science, a branch of machine learning that deals with the problem of multi-stage, sequential decision making by a learning agent. In this chapter, we review the necessary concepts of reinforcement learning, connect them to the relevant statistical literature, and develop a mathematical framework that will enable us to treat the problem of estimating the optimal dynamic treatment regimes rigorously.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In some settings, there may only be a terminal reward for the entire sequence of agent-environment interactions.
- 2.
In the case of a SMART, this policy consists of the randomization probabilities and is known by design, whereas for an observational study, this can be estimated by the propensity score (see Sect. 3.5 for definition).
- 3.
The version of Q-learning we will be using in this book is similar to the fitted Q-iteration algorithm in the RL literature. This version is an adaptation of Watkins’ classical Q-learning to batch data, involving function approximation.
- 4.
Inference for stage 1 parameters in Q-learning is problematic due to an underlying lack of smoothness, so usual bootstrap inference is not theoretically valid. Nevertheless, we use it here for illustrative purposes only. Valid inference procedures will be discussed in Chap. 8.
References
Bellman, R. E. (1957). Dynamic programming. Princeton: Princeton University Press.
Berkson, J. (1946). Limitations of the application of fourfold tables to hospital data. Biometrics Bulletin, 2, 47–53.
Bertsekas, D. P., & Tsitsiklis, J. (1996). Neuro-dynamic programming. Belmont: Athena Scientific.
Chakraborty, B., & Moodie, E. E. M. (2013). Estimating optimal dynamic treatment regimes with shared decision rules across stages: An extension of Q-learning (under revision).
Chakraborty, B., Laber, E. B., & Zhao, Y. (2013). Inference for optimal dynamic treatment regimes using an adaptive m-out-of-n bootstrap scheme. Biometrics, (in press).
Daniel, R. M., De Stavola, B. L., & Cousens, S. N. (2011). gformula: Estimating causal effects in the presence of time-varying confounding or mediation using the g-computation formula. The Stata Journal, 11, 479–517.
Ernst, D., Stan, G. B., Goncalves, J., & Wehenkel, L. (2006). Clinical data based optimal STI strategies for HIV: A reinforcement learning approach. In Proceedings of the machine learning conference of Belgium and The Netherlands (Benelearn), Ghent (pp. 65–72).
Ertefaie, A., Asgharian, M., & Stephens, D. A. (2012). Estimation of average treatment effects using penalization (submitted).
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360.
Gao, H. (1998). Wavelet shrinkage denoising using the nonnegative garrote. Journal of Computational and Graphical Statistics, 7, 469–488.
Greenland, S., Pearl, J., & Robins, J. M. (1999). Causal diagrams for epidemiologic research. Epidemiology, 10, 37–48.
Gunter, L., Zhu, J., & Murphy, S. A. (2007). Variable selection for optimal decision making. In Proceedings of the 11th conference on artificial intelligence in medicine, Amsterdam.
Kakade, S. M. (2003). On the sample complexity of reinforcement learning (Dissertation, University College London).
Murphy, S. A. (2005b). A generalization error for Q-learning. Journal of Machine Learning Research, 6, 1073–1097.
Murphy, S. A., & Bingham, D. (2009). Screening experiments for developing dynamic treatment regimes. Journal of the American Statistical Association, 184, 391–408.
Murphy, S. A., Van der Laan, M. J., Robins, J. M., & CPPRG (2001). Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 96, 1410–1423.
Nahum-Shani, I., Qian, M., Almiral, D., Pelham, W., Gnagy, B., Fabiano, G., Waxmonsky, J., Yu, J., & Murphy, S. A. (2012a). Experimental design and primary data analysis methods for comparing adaptive interventions. Psychological Methods, 17, 457–477.
Oetting, A. I., Levy, J. A., Weiss, R. D., & Murphy, S. A. (2011). Statistical methodology for a SMART design in the development of adaptive treatment strategies. In: P. E. Shrout, K. M. Keyes, & K. Ornstein (Eds.) Causality and Psychopathology: Finding the Determinants of Disorders and their Cures (pp. 179–205). Arlington: American Psychiatric Publishing.
Pliskin, J. S., Shepard, D., & Weinstein, M. C. (1980). Utility functions for life years and health status: Theory, assessment, and application. Operations Research, 28, 206–224.
Robins, J. M. (1986). A new approach to causal inference in mortality studies with sustained exposure periods – Application to control of the healthy worker survivor effect. Mathematical Modelling, 7, 1393–1512.
Robins, J. M., Orellana, L., & Rotnitzky, A. (2008). Estimation and extrapolation of optimal treatment and testing strategies. Statistics in Medicine, 27, 4678–4721.
Rosenbaum, P. R., & Rubin, D. B. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association, 79, 516–524.
Rosenbaum, P. R., & Rubin, D. B. (1985). Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician, 39, 33–38.
Rosthøj, S., Fullwood, C., Henderson, R., & Stewart, S. (2006). Estimation of optimal dynamic anticoagulation regimes from observational data: A regret-based approach. Statistics in Medicine, 25, 4197–4215.
Shao, J. (1994). Bootstrap sample size in nonregular cases. Proceedings of the American Mathematical Society, 122, 1251–1262.
Sjölander, A., Nyrén, O., Bellocco, R., & Evans, M. (2011). Comparing different strategies for timing of dialysis initiation through inverse probability weighting. American Journal of Epidemiology, 174, 1204–1210.
Swartz, M. S., Perkins, D. O., Stroup, T. S., McEvoy, J. P., Nieri, J. M., & Haal, D. D. (2003). Assessing clinical and functional outcomes in the Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) schizophrenia trial. Schizophrenia Bulletin, 29, 33–43.
Weinstein, M. C., Feinberg, H., Elstein, A. S., Frazier, H. S., Neuhauser, D., Neutra, R. R., & McNeil, B. J. (1980). Clinical decision analysis. Philadelphia: Saunders.
Zhao, Y., Zeng, D., Socinski, M. A., & Kosorok, M. R. (2011). Reinforcement learning strategies for clinical trials in nonsmall cell lung cancer. Biometrics, 67, 1422–1433.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer Science+Business Media New York
About this chapter
Cite this chapter
Chakraborty, B., Moodie, E.E.M. (2013). Statistical Reinforcement Learning. In: Statistical Methods for Dynamic Treatment Regimes. Statistics for Biology and Health. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-7428-9_3
Download citation
DOI: https://doi.org/10.1007/978-1-4614-7428-9_3
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-7427-2
Online ISBN: 978-1-4614-7428-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)