Deep Reinforcement Learning with Hidden Layers on Future States

  • Hirotaka KamekoEmail author
  • Jun Suzuki
  • Naoki Mizukami
  • Yoshimasa Tsuruoka
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 818)


Deep reinforcement learning algorithms such as Deep Q-Networks have successfully been used to construct a strong agent for Atari games by only performing direct evaluation of the current state and actions. This is in stark contrast to the algorithms for traditional board games such as Chess or Go, where a look-ahead search mechanism is indispensable to build a strong agent. In this paper, we present a novel deep reinforcement learning architecture that can both effectively and efficiently use information on future states in video games. First, we demonstrate that such information is indeed quite useful in deep reinforcement learning by using exact state transition information obtained from the emulator. We then propose a method that predicts future states using Long Short Term Memory (LSTM), such that the agent can look ahead without the emulator. In this work, we applied our method to the asynchronous advantage actor-critic (A3C) architecture. The experimental results show that our proposed method with predicted future states substantially outperforms the vanilla A3C in several Atari games.


  1. 1.
    Bellemare, M.G., Naddaf, Y., Veness, J., Bowling, M.: The arcade learning environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013)Google Scholar
  2. 2.
    Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., Munos, R.: Unifying count-based exploration and intrinsic motivation. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 1471–1479. Curran Associates, Inc. (2016)Google Scholar
  3. 3.
    Bellemare, M., Veness, J., Bowling, M.: Investigating contingency awareness using Atari 2600 games. In: AAAI Conference on Artificial Intelligence, pp. 864–871 (2012)Google Scholar
  4. 4.
    Guo, X., Singh, S., Lewis, R., Lee, H.: Deep learning for reward design to improve Monte Carlo tree search in Atari games. In: Proceedings of 25th International Joint Conference on Artificial Intelligence, pp. 1519–1525 (2016)Google Scholar
  5. 5.
    Hausknecht, M., Stone, P.: Deep recurrent Q-learning for partially observable MDPs. In: 2015 AAAI Fall Symposium Series, pp. 29–37 (2015)Google Scholar
  6. 6.
    Jaderberg, M., Mnih, V., Czarnecki, W.M., Schaul, T., Leibo, J.Z., Silver, D., Kavukcuoglu, K.: Reinforcement learning with unsupervised auxiliary tasks. CoRR abs/1611.05397 (2016).
  7. 7.
    Kocsis, L., Szepesvári, C.: Bandit based Monte-Carlo planning. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 282–293. Springer, Heidelberg (2006). CrossRefGoogle Scholar
  8. 8.
    Lin, L.J.: Reinforcement learning for robots using neural networks. Ph.D. thesis, Carnegie Mellon University (1992)Google Scholar
  9. 9.
    Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T.P., Harley, T., Silver, D., Kavukcuoglu, K.: Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning (ICML), pp. 1928–1937 (2016)Google Scholar
  10. 10.
    Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing Atari with deep reinforcement learning. In: NIPS Deep Learning Workshop (2013)Google Scholar
  11. 11.
    Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)CrossRefGoogle Scholar
  12. 12.
    Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., De Maria, A., Panneershelvam, V., Suleyman, M., Beattie, C., Petersen, S., et al.: Massively parallel methods for deep reinforcement learning. In: ICML Deep Learning Workshop (2015)Google Scholar
  13. 13.
    Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-2010), pp. 807–814 (2010)Google Scholar
  14. 14.
    Osband, I., Blundell, C., Pritzel, A., Van Roy, B.: Deep exploration via bootstrapped DQN. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 4026–4034. Curran Associates, Inc. (2016)Google Scholar
  15. 15.
    Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), pp. 1889–1897 (2015)Google Scholar
  16. 16.
    Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., Hassabis, D.: Mastering the game of Go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)CrossRefGoogle Scholar
  17. 17.
    Sutton, R.S.: Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: Proceedings of the 7th International Conference on Machine Learning, pp. 216–224 (1990)Google Scholar
  18. 18.
    Sutton, R.S.: Dyna, an integrated architecture for learning, planning, and reacting. SIGART Bull. 2(4), 160–163 (1991)CrossRefGoogle Scholar
  19. 19.
    Tang, H., Houthooft, R., Foote, D., Stooke, A., Chen, X., Duan, Y., Schulman, J., De Turck, F., Abbeel, P.: #Exploration: a study of count-based exploration for deep reinforcement learning. In: NIPS Deep Reinforcement Learning Workshop (2016)Google Scholar
  20. 20.
    Tieleman, T., Hinton, G.: Lecture 6e RMSprop: divide the gradient by a running average of its recent magnitude. Coursera: Neural Networks for Machine Learning (2012)Google Scholar
  21. 21.
    Tokui, S., Oono, K., Hido, S., Clayton, J.: Chainer: a next-generation open source framework for deep learning. In: Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-Ninth Annual Conference on Neural Information Processing Systems (NIPS) (2015)Google Scholar
  22. 22.
    Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double Q-learning. In: AAAI Conference on Artificial Intelligence, pp. 2094–2100 (2016)Google Scholar
  23. 23.
    Wang, Z., de Freitas, N., Lanctot, M.: Dueling network architectures for deep reinforcement learning. CoRR abs/1511.06581 (2015).
  24. 24.
    Watkins, C.J.C.H.: Learning from delayed rewards. Ph.D. thesis, University of Cambridge, England (1989)Google Scholar
  25. 25.
    Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Hirotaka Kameko
    • 1
    Email author
  • Jun Suzuki
    • 2
    • 3
  • Naoki Mizukami
    • 1
  • Yoshimasa Tsuruoka
    • 1
  1. 1.Graduate School of EngineeringThe University of TokyoTokyoJapan
  2. 2.NTT Communication Science LaboratoriesNTT CorporationTokyoJapan
  3. 3.RIKEN Center for Advanced Intelligence ProjectTokyoJapan

Personalised recommendations