Reinforcement Learning

  • Ke-Lin DuEmail author
  • M. N. S. Swamy


One of the primary goals of AI is to produce fully autonomous agents that learn optimal behaviors through trial and error by interacting with their environments. The reinforcement learning paradigm is essentially learning through interaction. It has its root in behaviorist psychology. Reinforcement learning is influenced by optimal control, which is underpinned by mathematical dynamic programming formalism. This chapter deals with reinforcement learning.


  1. 1.
    Abdallah, S., & Kaisers, M. (2016). Addressing environment non-stationarity by repeating Q-learning updates. Journal of Machine Learning Research, 17, 1–31.MathSciNetzbMATHGoogle Scholar
  2. 2.
    Bakker, B., & Schmidhuber, J. (2004). Hierarchical reinforcement learning based on subgoal discovery and subpolicy specialization. In Proceedings of the 8th Conference on Intelligent Autonomous Systems (pp. 438–445). Amsterdam, The Netherlands.Google Scholar
  3. 3.
    Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuron-like adaptive elements that can solve difficult learning control problems. IEEE Transaction on Systems, Man and Cybernetics, 13(5), 834–846.Google Scholar
  4. 4.
    Barto, A. G. (1992). Reinforcement learning and adaptive critic methods. In D. A. White & D. A. Sofge (Eds.), Handbook of intelligent control: Neural, fuzzy, and adaptive approaches (pp. 469–471). New York: Van Nostrand Reinhold.Google Scholar
  5. 5.
    Bohmer, W., Grunewalde, S., Shen, Y., Musial, M., & Obermayer, K. (2013). Construction of approximation spaces for reinforcement learning. Journal of Machine Learning Research, 14, 2067–2118.MathSciNetzbMATHGoogle Scholar
  6. 6.
    Bradtke, S. J., & Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22, 33–57.zbMATHGoogle Scholar
  7. 7.
    Choi, J., & Kim, K.-E. (2011). Inverse reinforcement learning in partially observable environments. Journal of Machine Learning Research, 12, 691–730.MathSciNetzbMATHGoogle Scholar
  8. 8.
    Dayan, P., & Sejnowski, T. (1994). TD(\(\lambda \)) converges with probability 1. Machine Learning, 14(1), 295–301.Google Scholar
  9. 9.
    Elfwing, S., Uchibe, E., & Doya, K. (2016). From free energy to expected energy: Improving energy-based value function approximation in reinforcement learning. Neural Networks, 84, 17–27.Google Scholar
  10. 10.
    Furmston, T., Lever, G., & Barber, D. (2016). Approximate Newton methods for policy search in Markov decision processes. Journal of Machine Learning Research, 17, 1–51.MathSciNetzbMATHGoogle Scholar
  11. 11.
    Furnkranz, J., Hullermeier, E., Cheng, W., & Park, S.-H. (2012). Preference-based reinforcement learning: A formal framework and a policy iteration algorithm. Machine Learning, 89, 123–156.MathSciNetzbMATHGoogle Scholar
  12. 12.
    Garcia, J., & Fernandez, F. (2015). A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1), 1437–1480.MathSciNetzbMATHGoogle Scholar
  13. 13.
    Ghavamzadeh, M., & Mahadevan, S. (2007). Hierarchical average reward reinforcement learning. Journal of Machine Learning Research, 8, 2629–2669.MathSciNetzbMATHGoogle Scholar
  14. 14.
    Greenwald, A., Hall, K., & Serrano, R. (2003). Correlated Q-learning. In Proceedings of the 20th International Conference on Machine Learning (pp. 242–249). Washington, DC.Google Scholar
  15. 15.
    Heess, N., Silver, D., & Teh, Y. W. (2012). Actor–critic reinforcement learning with energy-based policies. In JMLR Workshop and Conference Proceedings: 10th European Workshop on Reinforcement Learning (EWRL) (Vol. 24, pp. 43–57).Google Scholar
  16. 16.
    Houk, J., Adams, J. L., & Barto, A. G. (1995). A model of how the basal ganglia generate and use neural signals that predict reinforcement. In J. C. Houk, J. Davis, & D. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 250–268). Cambridge, MA: MIT Press.Google Scholar
  17. 17.
    Hu, J., & Wellman, M. P. (2003). Nash Q-learning for general-sum stochastic games. Journal of Machine Learning Research, 4, 1039–1069.MathSciNetzbMATHGoogle Scholar
  18. 18.
    Hu, Y., Gao, Y., & An, B. (2015). Accelerating multiagent reinforcement learning by equilibrium transfer. IEEE Transactions on Cybernetics, 45(7), 1289–1302.Google Scholar
  19. 19.
    Hu, Y., Gao, Y., & An, B. (2015). Multiagent reinforcement learning with unshared value functions. IEEE Transactions on Cybernetics, 45(4), 647–662.Google Scholar
  20. 20.
    Hwang, K.-S., & Lo, C.-Y. (2013). Policy improvement by a model-free Dyna architecture. IEEE Transactions on Neural Networks and Learning Systems, 24(5), 776–788.Google Scholar
  21. 21.
    Kaelbling, L. P., Littman, M. I., & Moore, A. W. (1996). Reinforcement lerning: A survey. Journal of Artificial Intelligence Research, 4, 237–285.Google Scholar
  22. 22.
    Kaisers, M., & Tuyls, K. (2010). Frequency adjusted multi-agent Q-learning. In Proceedings of International Conference on Autonomous Agents and Multiagent Systems (AAMAS) (pp. 309–316).Google Scholar
  23. 23.
    Konda, V., & Tsitsiklis, J. (2000). Actor–critic algorithms. In Advances in neural information processing systems (Vol. 12, pp. 1008–1014).Google Scholar
  24. 24.
    Lagoudakis, M. G., & Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research, 4, 1107–1149.MathSciNetzbMATHGoogle Scholar
  25. 25.
    Legenstein, R., Wilbert, N., & Wiskott, L. (2010). Reinforcement learning on slow features of high-dimensional input streams. PLoS Computational Biology, 6(8), e1000894.MathSciNetGoogle Scholar
  26. 26.
    Leslie, D. S., & Collins, E. J. (2005). Individual Q-learning in normal form games. SIAM Journal on Control and Optimization, 44(2), 495–514.Google Scholar
  27. 27.
    Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the 11th International Conference on Machine Learning (pp. 157–163). New Brunswick, NJ.Google Scholar
  28. 28.
    Littman, M. L. (2001). Friend-or-foe Q-learning in general-sum games. In Proceedings of the 18th International Conference on Machine Learning (pp. 322–328). San Francisco, CA: Morgan Kaufmann.Google Scholar
  29. 29.
    Luciw, M., & Schmidhuber, J. (2012). Low complexity proto-value function learning from sensory observations with incremental slow feature analysis. In Proceedings of International Conference on Artificial Neural Networks, LNCS (Vol. 7553, pp. 279–287). Berlin: Springer.Google Scholar
  30. 30.
    Martin, H. J. A., de Lope, J., & Maravall, D. (2011). Robust high performance reinforcement learning through weighted \(k\)-nearest neighbors. Neurocomputing, 74(8), 1251–1259.Google Scholar
  31. 31.
    Mihatsch, O., & Neuneier, R. (2014). Risk-sensitive reinforcement learning. Machine Learning, 49(2), 267–290.zbMATHGoogle Scholar
  32. 32.
    Narendra, K. S., & Thathachar, M. A. L. (1974). Learning automata: A survey. IEEE Transactions on Systems, Man, and Cybernetics, 4(4), 323–334.MathSciNetzbMATHGoogle Scholar
  33. 33.
    Parisi, S., Tangkaratt, V., Peters, J., & Khan, M. E. (2019). TD-regularized actor-critic methods. Machine Learning. Scholar
  34. 34.
    Peng, J., & Williams, R. J. (1994). Incremental multi-step Q-learning. In: Proceedings of the 11th International Conference on Machine Learning (pp. 226–232). San Francisco, CA: Morgan Kaufmann.Google Scholar
  35. 35.
    Piot, B., Geist, M., & Pietquin, O. (2017). Bridging the gap between imitation learning and inverse reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, 28(8), 1814–1826.MathSciNetGoogle Scholar
  36. 36.
    Potjans, W., Morrison, A., & Diesmann, M. (2009). A spiking neural network model of an actor–critic learning agent. Neural Computation, 21, 301–339.MathSciNetzbMATHGoogle Scholar
  37. 37.
    Reynolds, J. N., Hyland, B. I., & Wickens, J. R. (2001). A cellular mechanism of reward-related learning. Nature, 413, 67–70.Google Scholar
  38. 38.
    Rummery, G., & Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Engineering Department, Cambridge University.Google Scholar
  39. 39.
    Sallans, B., & Hinton, G. E. (2004). Reinforcement learning with factored states and actions. Journal of Machine Learning Research, 5, 1063–1088.MathSciNetzbMATHGoogle Scholar
  40. 40.
    Sastry, P. S., Phansalkar, V. V., & Thathachar, M. A. L. (1994). Decentralised learning of Nash equilibria in multiperson stochastic games with incomplete information. IEEE Transactions on Systems, Man, and Cybernetics, 24, 769–777.MathSciNetzbMATHGoogle Scholar
  41. 41.
    Schaal, S. (1999). Is imitation learning the route to humanoid robots? Trends in Cognitive Sciences, 3(6), 233–242.Google Scholar
  42. 42.
    Schultz, W. (1998). Predictive reward signal of dopamine neurons. Journal of Neurophysiology, 80(1), 1–27.MathSciNetGoogle Scholar
  43. 43.
    Sutton, R. S. (1988). Learning to predict by the method of temporal difference. Machine Learning, 3(1), 9–44.Google Scholar
  44. 44.
    Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the 7th International Conference on Machine Learning (pp. 216–224). Austin, TX.Google Scholar
  45. 45.
    Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press.zbMATHGoogle Scholar
  46. 46.
    Thathachar, M. A. L., & Sastry, P. S. (2002). Varieties of learning automata: An overview. IEEE Transactions on Systems, Man, and Cybernetics B, 32(6), 711–722.Google Scholar
  47. 47.
    Tsetlin, M. L. (1973). Automata theory and modeling of biological systems. New York: Academic.zbMATHGoogle Scholar
  48. 48.
    Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5), 674–690.MathSciNetzbMATHGoogle Scholar
  49. 49.
    van Seijen, H., Mahmood, A. R., Pilarski, P. M., Machado, M. C., & Sutton, R. S. (2016). True online temporal-difference learning. Journal of Machine Learning Research, 17, 1–40.MathSciNetzbMATHGoogle Scholar
  50. 50.
    Watkins, C. J. H. C. (1989). Learning from Delayed Rewards. Ph.D. thesis, Department of Computer Science, King’s College, Cambridge University, UK.Google Scholar
  51. 51.
    Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3), 279–292.zbMATHGoogle Scholar
  52. 52.
    Werbos, P. J. (1990). Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks, 3, 179–189.Google Scholar
  53. 53.
    Zhao, T., Hachiya, H., Tangkaratt, V., Morimoto, J., & Sugiyama, M. (2013). Efficient sample reuse in policy gradients with parameter-based exploration. Neural Networks, 25(6), 1512–1547.MathSciNetzbMATHGoogle Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Electrical and Computer EngineeringConcordia UniversityMontrealCanada
  2. 2.Xonlink Inc.HangzhouChina

Personalised recommendations