Applications of Reinforcement Learning

  • Matthew F. Dixon
  • Igor Halperin
  • Paul Bilokon


This chapter considers real-world applications of reinforcement learning in finance, as well as further advances in the theory presented in the previous chapter. We start with one of the most common problems of quantitative finance, which is the problem of optimal portfolio trading in discrete time. Many practical problems of trading or risk management amount to different forms of dynamic portfolio optimization, with different optimization criteria, portfolio composition, and constraints. This chapter introduces a reinforcement learning approach to option pricing that generalizes the classical Black–Scholes model to a data-driven approach using Q-learning. It then presents a probabilistic extension of Q-learning called G-learning and shows how it can be used for dynamic portfolio optimization. For certain specifications of reward functions, G-learning is semi-analytically tractable and amounts to a probabilistic version of linear quadratic regulators (LQR). Detailed analyses of such cases are presented, and show their solutions with examples from problems of dynamic portfolio optimization and wealth management.


  1. Black, F., & Scholes, M. (1973). The pricing of options and corporate liabilities. Journal of Political Economy, 81(3), 637–654.MathSciNetzbMATHCrossRefGoogle Scholar
  2. Boyd, S., Busetti, E., Diamond, S., Kahn, R., Koh, K., Nystrup, P., et al. (2017). Multi-period trading via convex optimization. Foundations and Trends in Optimization, 1–74.Google Scholar
  3. Browne, S. (1996). Reaching goals by a deadline: digital options and continuous-time active portfolio management. Scholar
  4. Carr, P., Ellis, K., & Gupta, V. (1988). Static hedging of exotic options. Journal of Finance, 53(3), 1165–1190.CrossRefGoogle Scholar
  5. Cerný, A., & Kallsen, J. (2007). Hedging by sequential regression revisited. Working paper, City University London and TU München.zbMATHCrossRefGoogle Scholar
  6. Cheung, K. C., & Yang, H. (2007). Optimal investment-consumption strategy in a discrete-time model with regime switching. Discrete and Continuous Dynamical Systems, 8(2), 315–332.MathSciNetzbMATHCrossRefGoogle Scholar
  7. Das, S. R., Ostrov, D., Radhakrishnan, A., & Srivastav, D. (2018). Dynamic portfolio allocation in goals-based wealth management. Scholar
  8. Duan, J. C., & Simonato, J. G. (2001). American option pricing under GARCH by a Markov chain approximation. Journal of Economic Dynamics and Control, 25, 1689–1718.MathSciNetzbMATHCrossRefGoogle Scholar
  9. Ernst, D., Geurts, P., & Wehenkel, L. (2005). Tree-based batch model reinforcement learning. Journal of Machine Learning Research, 6, 405–556.zbMATHGoogle Scholar
  10. Föllmer, H., & Schweizer, M. (1989). Hedging by sequential regression: An introduction to the mathematics of option trading. ASTIN Bulletin, 18, 147–160.CrossRefGoogle Scholar
  11. Fox, R., Pakman, A., & Tishby, N. (2015). Taming the noise in reinforcement learning via soft updates. In 32nd Conference on Uncertainty in Artificial Intelligence (UAI).
  12. Garleanu, N., & Pedersen, L. H. (2013). Dynamic trading with predictable returns and transaction costs. Journal of Finance, 68(6), 2309–2340.CrossRefGoogle Scholar
  13. Gosavi, A. (2015). Finite horizon Markov control with one-step variance penalties. In Conference Proceedings of the Allerton Conferences, Allerton, IL.Google Scholar
  14. Grau, A. J. (2007). Applications of least-square regressions to pricing and hedging of financial derivatives. PhD. thesis, Technische Universit”at München.Google Scholar
  15. Halperin, I. (2018). QLBS: Q-learner in the Black-Scholes(-Merton) worlds. Journal of Derivatives 2020, (to be published). Available at
  16. Halperin, I. (2019). The QLBS Q-learner goes NuQLear: Fitted Q iteration, inverse RL, and option portfolios. Quantitative Finance, 19(9)., available at
  17. Halperin, I., & Feldshteyn, I. (2018). Market self-learning of signals, impact and optimal trading: invisible hand inference with free energy, (or, how we learned to stop worrying and love bounded rationality). Scholar
  18. Lin, C., Zeng, L., & Wu, H. (2019). Multi-period portfolio optimization in a defined contribution pension plan during the decumulation phase. Journal of Industrial and Management Optimization, 15(1), 401–427. Scholar
  19. Longstaff, F. A., & Schwartz, E. S. (2001). Valuing American options by simulation - a simple least-square approach. The Review of Financial Studies, 14(1), 113–147.CrossRefGoogle Scholar
  20. Markowitz, H. (1959). Portfolio selection: efficient diversification of investment. John Wiley.Google Scholar
  21. Marschinski, R., Rossi, P., Tavoni, M., & Cocco, F. (2007). Portfolio selection with probabilistic utility. Annals of Operations Research, 151(1), 223–239.MathSciNetzbMATHCrossRefGoogle Scholar
  22. Merton, R. C. (1971). Optimum consumption and portfolio rules in a continuous-time model. Journal of Economic Theory, 3(4), 373–413.MathSciNetzbMATHCrossRefGoogle Scholar
  23. Merton, R. C. (1974). Theory of rational option pricing. Bell Journal of Economics and Management Science, 4(1), 141–183.MathSciNetzbMATHCrossRefGoogle Scholar
  24. Murphy, S. A. (2005). A generalization error for Q-learning. Journal of Machine Learning Research, 6, 1073–1097.MathSciNetzbMATHGoogle Scholar
  25. Ortega, P. A., & Lee, D. D. (2014). An adversarial interpretation of information-theoretic bounded rationality. In Proceedings of the Twenty-Eighth AAAI Conference on AI.
  26. Petrelli, A., Balachandran, R., Siu, O., Chatterjee, R., Jun, Z., & Kapoor, V. (2010). Optimal dynamic hedging of equity options: residual-risks transaction-costs. working paper.Google Scholar
  27. Potters, M., Bouchaud, J., & Sestovic, D. (2001). Hedged Monte Carlo: low variance derivative pricing with objective probabilities. Physica A, 289, 517–525.MathSciNetzbMATHCrossRefGoogle Scholar
  28. Sato, Y. (2019). Model-free reinforcement learning for financial portfolios: a brief survey. Scholar
  29. Schweizer, M. (1995). Variance-optimal hedging in discrete time. Mathematics of Operations Research, 20, 1–32.MathSciNetzbMATHCrossRefGoogle Scholar
  30. Todorov, E., & Li, W. (2005). A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems. In Proceeding of the American Control Conference, Portland OR, USA, pp. 300–306.Google Scholar
  31. van Hasselt, H. (2010). Double Q-learning. Advances in Neural Information Processing Systems.
  32. Watkins, C. J. (1989). Learning from delayed rewards. Ph.D. Thesis, Kings College, Cambridge, England.Google Scholar
  33. Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8(179–192), 3–4.zbMATHGoogle Scholar
  34. Wilmott, P. (1998). Derivatives: the theory and practice of financial engineering. Wiley.Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Matthew F. Dixon
    • 1
  • Igor Halperin
    • 2
  • Paul Bilokon
    • 3
  1. 1.Department of Applied MathematicsIllinois Institute of TechnologyChicagoUSA
  2. 2.Tandon School of EngineeringNew York UniversityBrooklynUSA
  3. 3.Department of MathematicsImperial College LondonLondonUK

Personalised recommendations