Skip to main content

Statistical Reinforcement Learning

  • Chapter
  • First Online:
Statistical Methods for Dynamic Treatment Regimes

Part of the book series: Statistics for Biology and Health ((SBH))

Abstract

Constructing optimal dynamic treatment regimes for chronic disorders based on patient data is a problem of multi-stage decision making about the best sequence of treatments. This problem bears strong resemblance to the problem of reinforcement learning in computer science, a branch of machine learning that deals with the problem of multi-stage, sequential decision making by a learning agent. In this chapter, we review the necessary concepts of reinforcement learning, connect them to the relevant statistical literature, and develop a mathematical framework that will enable us to treat the problem of estimating the optimal dynamic treatment regimes rigorously.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 119.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In some settings, there may only be a terminal reward for the entire sequence of agent-environment interactions.

  2. 2.

    In the case of a SMART, this policy consists of the randomization probabilities and is known by design, whereas for an observational study, this can be estimated by the propensity score (see Sect. 3.5 for definition).

  3. 3.

    The version of Q-learning we will be using in this book is similar to the fitted Q-iteration algorithm in the RL literature. This version is an adaptation of Watkins’ classical Q-learning to batch data, involving function approximation.

  4. 4.

    Inference for stage 1 parameters in Q-learning is problematic due to an underlying lack of smoothness, so usual bootstrap inference is not theoretically valid. Nevertheless, we use it here for illustrative purposes only. Valid inference procedures will be discussed in Chap. 8.

References

  • Bellman, R. E. (1957). Dynamic programming. Princeton: Princeton University Press.

    MATH  Google Scholar 

  • Berkson, J. (1946). Limitations of the application of fourfold tables to hospital data. Biometrics Bulletin, 2, 47–53.

    Article  Google Scholar 

  • Bertsekas, D. P., & Tsitsiklis, J. (1996). Neuro-dynamic programming. Belmont: Athena Scientific.

    MATH  Google Scholar 

  • Chakraborty, B., & Moodie, E. E. M. (2013). Estimating optimal dynamic treatment regimes with shared decision rules across stages: An extension of Q-learning (under revision).

    Google Scholar 

  • Chakraborty, B., Laber, E. B., & Zhao, Y. (2013). Inference for optimal dynamic treatment regimes using an adaptive m-out-of-n bootstrap scheme. Biometrics, (in press).

    Google Scholar 

  • Daniel, R. M., De Stavola, B. L., & Cousens, S. N. (2011). gformula: Estimating causal effects in the presence of time-varying confounding or mediation using the g-computation formula. The Stata Journal, 11, 479–517.

    Google Scholar 

  • Ernst, D., Stan, G. B., Goncalves, J., & Wehenkel, L. (2006). Clinical data based optimal STI strategies for HIV: A reinforcement learning approach. In Proceedings of the machine learning conference of Belgium and The Netherlands (Benelearn), Ghent (pp. 65–72).

    Google Scholar 

  • Ertefaie, A., Asgharian, M., & Stephens, D. A. (2012). Estimation of average treatment effects using penalization (submitted).

    Google Scholar 

  • Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360.

    Article  MathSciNet  MATH  Google Scholar 

  • Gao, H. (1998). Wavelet shrinkage denoising using the nonnegative garrote. Journal of Computational and Graphical Statistics, 7, 469–488.

    MathSciNet  Google Scholar 

  • Greenland, S., Pearl, J., & Robins, J. M. (1999). Causal diagrams for epidemiologic research. Epidemiology, 10, 37–48.

    Article  Google Scholar 

  • Gunter, L., Zhu, J., & Murphy, S. A. (2007). Variable selection for optimal decision making. In Proceedings of the 11th conference on artificial intelligence in medicine, Amsterdam.

    Google Scholar 

  • Kakade, S. M. (2003). On the sample complexity of reinforcement learning (Dissertation, University College London).

    Google Scholar 

  • Murphy, S. A. (2005b). A generalization error for Q-learning. Journal of Machine Learning Research, 6, 1073–1097.

    MATH  Google Scholar 

  • Murphy, S. A., & Bingham, D. (2009). Screening experiments for developing dynamic treatment regimes. Journal of the American Statistical Association, 184, 391–408.

    Article  MathSciNet  Google Scholar 

  • Murphy, S. A., Van der Laan, M. J., Robins, J. M., & CPPRG (2001). Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 96, 1410–1423.

    Google Scholar 

  • Nahum-Shani, I., Qian, M., Almiral, D., Pelham, W., Gnagy, B., Fabiano, G., Waxmonsky, J., Yu, J., & Murphy, S. A. (2012a). Experimental design and primary data analysis methods for comparing adaptive interventions. Psychological Methods, 17, 457–477.

    Article  Google Scholar 

  • Oetting, A. I., Levy, J. A., Weiss, R. D., & Murphy, S. A. (2011). Statistical methodology for a SMART design in the development of adaptive treatment strategies. In: P. E. Shrout, K. M. Keyes, & K. Ornstein (Eds.) Causality and Psychopathology: Finding the Determinants of Disorders and their Cures (pp. 179–205). Arlington: American Psychiatric Publishing.

    Google Scholar 

  • Pliskin, J. S., Shepard, D., & Weinstein, M. C. (1980). Utility functions for life years and health status: Theory, assessment, and application. Operations Research, 28, 206–224.

    Article  MathSciNet  Google Scholar 

  • Robins, J. M. (1986). A new approach to causal inference in mortality studies with sustained exposure periods – Application to control of the healthy worker survivor effect. Mathematical Modelling, 7, 1393–1512.

    Article  MathSciNet  MATH  Google Scholar 

  • Robins, J. M., Orellana, L., & Rotnitzky, A. (2008). Estimation and extrapolation of optimal treatment and testing strategies. Statistics in Medicine, 27, 4678–4721.

    Article  MathSciNet  Google Scholar 

  • Rosenbaum, P. R., & Rubin, D. B. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association, 79, 516–524.

    Article  Google Scholar 

  • Rosenbaum, P. R., & Rubin, D. B. (1985). Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician, 39, 33–38.

    Google Scholar 

  • Rosthøj, S., Fullwood, C., Henderson, R., & Stewart, S. (2006). Estimation of optimal dynamic anticoagulation regimes from observational data: A regret-based approach. Statistics in Medicine, 25, 4197–4215.

    Article  MathSciNet  Google Scholar 

  • Shao, J. (1994). Bootstrap sample size in nonregular cases. Proceedings of the American Mathematical Society, 122, 1251–1262.

    Article  MathSciNet  MATH  Google Scholar 

  • Sjölander, A., Nyrén, O., Bellocco, R., & Evans, M. (2011). Comparing different strategies for timing of dialysis initiation through inverse probability weighting. American Journal of Epidemiology, 174, 1204–1210.

    Article  Google Scholar 

  • Swartz, M. S., Perkins, D. O., Stroup, T. S., McEvoy, J. P., Nieri, J. M., & Haal, D. D. (2003). Assessing clinical and functional outcomes in the Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) schizophrenia trial. Schizophrenia Bulletin, 29, 33–43.

    Article  Google Scholar 

  • Weinstein, M. C., Feinberg, H., Elstein, A. S., Frazier, H. S., Neuhauser, D., Neutra, R. R., & McNeil, B. J. (1980). Clinical decision analysis. Philadelphia: Saunders.

    Google Scholar 

  • Zhao, Y., Zeng, D., Socinski, M. A., & Kosorok, M. R. (2011). Reinforcement learning strategies for clinical trials in nonsmall cell lung cancer. Biometrics, 67, 1422–1433.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media New York

About this chapter

Cite this chapter

Chakraborty, B., Moodie, E.E.M. (2013). Statistical Reinforcement Learning. In: Statistical Methods for Dynamic Treatment Regimes. Statistics for Biology and Health. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-7428-9_3

Download citation

Publish with us

Policies and ethics