Deep Reinforcement Learning

  • Charu C. Aggarwal


“The reward of suffering is experience.”—Harry S. Truman


  1. [10]
    D. Amodei at al. Concrete problems in AI safety. arXiv:1606.06565, 2016.
  2. [19]
    B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neural network architectures using reinforcement learning. arXiv:1611.02167, 2016.
  3. [22]
    J. Baxter, A. Tridgell, and L. Weaver. Knightcap: a chess program that learns by combining td (lambda) with game-tree search. arXiv cs/9901002, 1999.Google Scholar
  4. [25]
    M. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47, pp. 253–279, 2013.CrossRefGoogle Scholar
  5. [26]
    R. E. Bellman. Dynamic Programming. Princeton University Press, 1957.Google Scholar
  6. [46]
    M. Bojarski et al. End to end learning for self-driving cars. arXiv:1604.07316, 2016.
  7. [47]
    M. Bojarski et al. Explaining How a Deep Neural Network Trained with End-to-End Learning Steers a Car. arXiv:1704.07911, 2017.
  8. [52]
    C. Browne et al. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1), pp. 1–43, 2012.CrossRefGoogle Scholar
  9. [73]
    C. Clark and A. Storkey. Training deep convolutional neural networks to play go. ICML Confererence, pp. 1766–1774, 2015.Google Scholar
  10. [135]
    S. Gelly et al. The grand challenge of computer Go: Monte Carlo tree search and extensions. Communcations of the ACM, 55, pp. 106–113, 2012.CrossRefGoogle Scholar
  11. [142]
    P. Glynn. Likelihood ratio gradient estimation: an overview, Proceedings of the 1987 Winter Simulation Conference, pp. 366–375, 1987.Google Scholar
  12. [162]
    I. Grondman, L. Busoniu, G. A. Lopes, and R. Babuska. A survey of actor-critic reinforcement learning: Standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, 42(6), pp. 1291–1307, 2012.CrossRefGoogle Scholar
  13. [165]
    X. Guo, S. Singh, H. Lee, R. Lewis, and X. Wang. Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning. Advances in NIPS Conference, pp. 3338–3346, 2014.Google Scholar
  14. [174]
    H. van Hasselt, A. Guez, and D. Silver. Deep Reinforcement Learning with Double Q-Learning. AAAI Conference, 2016.Google Scholar
  15. [187]
    N. Heess et al. Emergence of Locomotion Behaviours in Rich Environments. arXiv:1707.02286, 2017. Video 1 at: Video 2 at:
  16. [230]
    S. Kakade. A natural policy gradient. NIPS Conference, pp. 1057–1063, 2002.Google Scholar
  17. [246]
    L. Kocsis and C. Szepesvari. Bandit based monte-carlo planning. ECML Conference, pp. 282–293, 2006.Google Scholar
  18. [259]
    M. Lai. Giraffe: Using deep reinforcement learning to play chess. arXiv:1509.01549, 2015.Google Scholar
  19. [286]
    S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(39), pp. 1–40, 2016.Video at:
  20. [290]
    M. Lewis, D. Yarats, Y. Dauphin, D. Parikh, and D. Batra. Deal or No Deal? End-to-End Learning for Negotiation Dialogues. arXiv:1706.05125, 2017.
  21. [291]
    J. Li, W. Monroe, A. Ritter, M. Galley,, J. Gao, and D. Jurafsky. Deep reinforcement learning for dialogue generation. arXiv:1606.01541, 2016.
  22. [293]
    Y. Li. Deep reinforcement learning: An overview. arXiv:1701.07274, 2017.
  23. [296]
    L.-J. Lin. Reinforcement learning for robots using neural networks. Technical Report, DTIC Document, 1993.Google Scholar
  24. [307]
    C. Maddison, A. Huang, I. Sutskever, and D. Silver. Move evaluation in Go using deep convolutional neural networks. International Conference on Learning Representations, 2015.Google Scholar
  25. [335]
    V. Mnih et al. Human-level control through deep reinforcement learning. Nature, 518 (7540), pp. 529–533, 2015.CrossRefGoogle Scholar
  26. [336]
    V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv:1312.5602., 2013.
  27. [337]
    V. Mnih et al. Asynchronous methods for deep reinforcement learning. ICML Confererence, pp. 1928–1937, 2016.Google Scholar
  28. [338]
    V. Mnih, N. Heess, and A. Graves. Recurrent models of visual attention. NIPS Conference, pp. 2204–2212, 2014.Google Scholar
  29. [343]
    A. Moore and C. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13(1), pp. 103–130, 1993.Google Scholar
  30. [346]
    M. Müller, M. Enzenberger, B. Arneson, and R. Segal. Fuego - an open-source framework for board games and Go engine based on Monte-Carlo tree search. IEEE Transactions on Computational Intelligence and AI in Games, 2, pp. 259–270, 2010.CrossRefGoogle Scholar
  31. [349]
    K. S. Narendra and K. Parthasarathy. Identification and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks, 1(1), pp. 4–27, 1990.CrossRefGoogle Scholar
  32. [355]
    A. Ng and M. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. Uncertainity in Artificial Intelligence, pp. 406–415, 2000.Google Scholar
  33. [374]
    J. Peters and S. Schaal. Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4), pp. 682–697, 2008.CrossRefGoogle Scholar
  34. [381]
    D. Pomerleau. ALVINN, an autonomous land vehicle in a neural network. Technical Report, Carnegie Mellon University, 1989.Google Scholar
  35. [412]
    G. Rummery and M. Niranjan. Online Q-learning using connectionist systems (Vol. 37). University of Cambridge, Department of Engineering, 1994.Google Scholar
  36. [421]
    A. Samuel. Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3, pp. 210–229, 1959.MathSciNetCrossRefGoogle Scholar
  37. [424]
    W. Saunders, G. Sastry, A. Stuhlmueller, and O. Evans. Trial without Error: Towards Safe Reinforcement Learning via Human Intervention. arXiv:1707.05173, 2017.
  38. [427]
    S. Schaal. Is imitation learning the route to humanoid robots? Trends in Cognitive Sciences, 3(6), pp. 233–242, 1999.CrossRefGoogle Scholar
  39. [428]
    T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. arXiv:1511.05952, 2015.
  40. [432]
    J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. ICML Conference, 2015.Google Scholar
  41. [433]
    J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. ICLR Conference, 2016.Google Scholar
  42. [440]
    I. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau. Building end-to-end dialogue systems using generative hierarchical neural network models. AAAI Conference, pp. 3776–3784, 2016.Google Scholar
  43. [445]
    D. Silver et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529.7587, pp. 484–489, 2016.CrossRefGoogle Scholar
  44. [446]
    D. Silver et al. Mastering the game of go without human knowledge. Nature, 550.7676, pp. 354–359, 2017.CrossRefGoogle Scholar
  45. [447]
    D. Silver et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv, 2017.
  46. [453]
    H. Simon. The Sciences of the Artificial. MIT Press, 1996.Google Scholar
  47. [481]
    I. Sutskever and V. Nair. Mimicking Go experts with convolutional neural networks. International Conference on Artificial Neural Networks, pp. 101–110, 2008.Google Scholar
  48. [482]
    R. Sutton. Learning to Predict by the Method of Temporal Differences, Machine Learning, 3, pp. 9–44, 1988.Google Scholar
  49. [483]
    R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.Google Scholar
  50. [484]
    R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. NIPS Conference, pp. 1057–1063, 2000.Google Scholar
  51. [492]
    G. Tesauro. Practical issues in temporal difference learning. Advances in NIPS Conference, pp. 259–266, 1992.Google Scholar
  52. [493]
    G. Tesauro. Td-gammon: A self-teaching backgammon program. Applications of Neural Networks, Springer, pp. 267–285, 1992.Google Scholar
  53. [494]
    G. Tesauro. Temporal difference learning and TD-Gammon. Communications of the ACM, 38(3), pp. 58–68, 1995.CrossRefGoogle Scholar
  54. [496]
    S. Thrun. Learning to play the game of chess NIPS Conference, pp. 1069–1076, 1995.Google Scholar
  55. [498]
    Y. Tian, Q. Gong, W. Shang, Y. Wu, and L. Zitnick. ELF: An extensive, lightweight and flexible research platform for real-time strategy games. arXiv:1707.01067, 2017.
  56. [508]
    O. Vinyals and Q. Le. A Neural Conversational Model. arXiv:1506.05869, 2015.
  57. [519]
    C. J. H. Watkins. Learning from delayed rewards. PhD Thesis, King’s College, Cambridge, 1989.Google Scholar
  58. [520]
    C. J. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3–4), pp. 279–292, 1992.zbMATHGoogle Scholar
  59. [533]
    R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4), pp. 229–256, 1992.zbMATHGoogle Scholar
  60. [540]
    K. Xu et al. Show, attend, and tell: Neural image caption generation with visual attention. ICML Confererence, 2015.Google Scholar
  61. [563]
    V. Zhong, C. Xiong, and R. Socher. Seq2SQL: Generating structured queries from natural language using reinforcement learning. arXiv:1709.00103, 2017.
  62. [569]
    B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. arXiv:1611.01578, 2016.
  63. [583]
  64. [602]
  65. [603]
  66. [604]
  67. [605]
  68. [606]
  69. [607]
  70. [608] googles-ai-won-the-game-go-by-defying-millennia-of-basic-human-instinct/
  71. [609]
  72. [610]
  73. [611]
  74. [612]
  75. [613]
  76. [619]
  77. [620]
  78. [621]
  79. [622]
  80. [623]
  81. [624]
  82. [625]
  83. [626]

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Charu C. Aggarwal
    • 1
  1. 1.IBM T. J. Watson Research CenterInternational Business MachinesYorktown HeightsUSA

Personalised recommendations