Robust Reinforcement Learning with a Stochastic Value Function

  • Reiji HatsugaiEmail author
  • Mary Inaba
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10710)


The field of reinforcement learning has been significantly advanced by the application of deep learning. The Deep Deterministic Policy Gradient(DDPG), an actor-critic method for continuous control, can derive satisfactory policies by use of a deep neural network. However, in common with other deep neural networks, the DDPG requires a large number of training samples and careful hyperparameter tuning.

In this paper, we propose a Stochastic Value Function (SVF) that treats a value function such as the Q function as a stochastic variable that can be sampled from \(N(\mu _{Q}, \sigma _{Q})\). To learn the appropriate value functions, we use Bayesian regression with KL divergence in place of simple regression with squared errors. We demonstrate that the technique used in Trust Region Policy Optimization (TRPO) can provide efficient learning. We implemented DDPG with SVF (DDPG-SVF) and confirmed (1) that DDPG-SVF converged well, with high sampling efficiency, (2) that DDPG-SVF obtained good results while requiring less hyperparameter tuning, and (3) that the TRPO technique offers an effective way of addressing the hyperparameter tuning problem.


  1. 1.
    Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)CrossRefGoogle Scholar
  2. 2.
    Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning., September 2015
  3. 3.
    Amari, S.-I.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)CrossRefGoogle Scholar
  4. 4.
    Precup, D.: Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, p. 80 (2000)Google Scholar
  5. 5.
    Watkins, C.J.C.H., Dayan, P.: Q-learning. Mach. Learn. 8(3), 279–292 (1992)zbMATHGoogle Scholar
  6. 6.
    Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems, pp. 1057–1063 (2000)Google Scholar
  7. 7.
    Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms. In: Proceedings of the 31st International Conference on Machine Learning (ICML-2014), pp. 387–395 (2014)Google Scholar
  8. 8.
    Martens, J., Sutskever, I.: Training deep and recurrent networks with hessian-free optimization. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 479–535. Springer, Heidelberg (2012). Scholar
  9. 9.
    Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
  10. 10.
    Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: International Conference on Machine Learning, pp. 1329–1338 (2016)Google Scholar
  11. 11.
    Kingma, D.P., Welling, M.: Auto-encoding variational bayes., December 2013
  12. 12.
    Schulman, J., Levine, S., Moritz, P., Jordan, M.I., Abbeel, P.: Trust region policy optimization., February 2015
  13. 13.
    Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: OpenAI Gym. arXiv preprint arXiv:1606.01540 (2016)

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.The University of TokyoTokyoJapan

Personalised recommendations