Efficient Sample Reuse in EM-Based Policy Search

  • Hirotaka Hachiya
  • Jan Peters
  • Masashi Sugiyama
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5781)


Direct policy search is a promising reinforcement learning framework in particular for controlling in continuous, high-dimensional systems such as anthropomorphic robots. Policy search often requires a large number of samples for obtaining a stable policy update estimator due to its high flexibility. However, this is prohibitive when the sampling cost is expensive. In this paper, we extend an EM-based policy search method so that previously collected samples can be efficiently reused. The usefulness of the proposed method, called Reward-weighted Regression with sample Reuse (R3), is demonstrated through a robot learning experiment.


Reinforcement Learning Importance Sampling Importance Weight Reward Function Neural Information Processing System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Bagnell, J.A., Kakade, S., Ng, A.Y., Schneider, J.: Policy search by dynamic programming. In: Neural Information Processing Systems, vol. 16 (2003)Google Scholar
  2. 2.
    Dayan, P., Hinton, G.E.: Using expectation-maximization for reinforcement learning. Neural Computation 9(2), 271–278 (1997)CrossRefzbMATHGoogle Scholar
  3. 3.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society B 39, 1–38 (1977)MathSciNetzbMATHGoogle Scholar
  4. 4.
    Hachiya, H., Akiyama, T., Sugiyama, M., Peters, J.: Adaptive importance sampling with automatic model selection in value function approximation. In: Proceedings of the Twenty-Third National Conference on Artificial Intelligence (2008)Google Scholar
  5. 5.
    Kakade, S.: A natural policy gradient. In: Neural Information Processing Systems, vol. 14, pp. 1531–1538 (2002)Google Scholar
  6. 6.
    Kober, J., Peters, J.: Policy search for motor primitives in robotics. In: Neural Information Processing Systems, vol. 21 (2008)Google Scholar
  7. 7.
    Peshkin, C.R., Shelton, L.: Learning from scarce experience. In: Proceedings of International Conference on Machine Learning, pp. 498–505 (2002)Google Scholar
  8. 8.
    Peters, J., Schaal, S.: Reinforcement learning by reward-weighted regression for operational space control. In: Proceedings of the International Conference on Machine Learning (2007)Google Scholar
  9. 9.
    Peters, J., Vijayakumar, S., Shaal, S.: Natural actor-critic. In: Proceedings of the 16th European Conference on Machine Learning, pp. 280–291 (2005)Google Scholar
  10. 10.
    Precup, D., Sutton, R.S., Singh, S.: Eligibility traces for off-policy policy evaluation. In: Proceedings of International Conference on Machine Learning, pp. 759–766 (2000)Google Scholar
  11. 11.
    Rao, C.R.: Linear Statistical Inference and Its Applications. Wiley, Chichester (1973)CrossRefzbMATHGoogle Scholar
  12. 12.
    Schaal, S.: The SL Simulation and Real-Time Control Software Package. University of Southern California (2007)Google Scholar
  13. 13.
    Shelton, C.R.: Policy improvement for POMDPs using normalized importance sampling. In: Proceedings of Uncertainty in Artificial Intelligence, pp. 496–503 (2001)Google Scholar
  14. 14.
    Shimodaira, H.: Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference 90(2), 227–244 (2000)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Sugiyama, M., Krauledat, M., Müller, K.-R.: Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research 8, 985–1005 (2007)zbMATHGoogle Scholar
  16. 16.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)Google Scholar
  17. 17.
    Sutton, R.S., Mcallester, M., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems, vol. 12, pp. 1057–1063. MIT Press, Cambridge (2000)Google Scholar
  18. 18.
    Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 229–256 (1992)zbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Hirotaka Hachiya
    • 1
  • Jan Peters
    • 2
  • Masashi Sugiyama
    • 1
  1. 1.Tokyo Institute of TechnologyTokyoJapan
  2. 2.Max-Planck Institute for Biological CyberneticsTübingenGermany

Personalised recommendations