Advertisement

Artificial Intelligence Review

, Volume 52, Issue 3, pp 2039–2059 | Cite as

Importance sampling policy gradient algorithms in reproducing kernel Hilbert space

  • Tuyen Pham Le
  • Vien Anh Ngo
  • P. Marlith Jaramillo
  • TaeChoong ChungEmail author
Article
  • 211 Downloads

Abstract

Modeling policies in reproducing kernel Hilbert space (RKHS) offers a very flexible and powerful new family of policy gradient algorithms called RKHS policy gradient algorithms. They are designed to optimize over a space of very high or infinite dimensional policies. As a matter of fact, they are known to suffer from a large variance problem. This critical issue comes from the fact that updating the current policy is based on a functional gradient that does not exploit all old episodes sampled by previous policies. In this paper, we introduce a generalized RKHS policy gradient algorithm that integrates the following important ideas: (i) policy modeling in RKHS; (ii) normalized importance sampling, which helps reduce the estimation variance by reusing previously sampled episodes in a principled way; and (iii) regularization terms, which avoid updating the policy too over-fit to sampled data. In the experiment section, we provide an analysis of the proposed algorithms through bench-marking domains. The experiment results show that the proposed algorithm can still enjoy a powerful policy modeling in RKHS and achieve more data-efficiency.

Keywords

Reproducing kernel Hilbert space Policy search Reinforcement learning Importance sampling Policy gradient Non-parametric 

Notes

Acknowledgements

The authors are grateful to the Basic Science Research Program through the National Research Foundation of Korea (NRF-2017R1D1A1B04036354) and Kyung Hee University (KHU-20160601).

Funding   This study was funded by the National Research Foundation of Korea (NRF-2017R1D1A1B04036354) and Kyung Hee University (KHU-20160601).

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

References

  1. Bagnell JA, Schneider J (2003) Policy search in kernel Hilbert space. Technical report, Robotics Institute, Carnegie Mellon UniversityGoogle Scholar
  2. Baxter J, Bartlett PL, Weaver L (2001) Experiments with infinite-horizon, policy-gradient estimation. J Artif Intell Res 15:351–381MathSciNetCrossRefzbMATHGoogle Scholar
  3. Bishop CM (2006) Pattern recognition and machine learning (Information science and statistics). Springer, New YorkzbMATHGoogle Scholar
  4. Boyan JA (1999) Least-squares temporal difference learning. In: Proceedings of the sixteenth international conference on machine learning, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc., pp 49–56Google Scholar
  5. Daniel C, Neumann G, and Peters J (2012) Learning concurrent motor skills in versatile solution spaces. In: 2012 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 3591–3597. IEEEGoogle Scholar
  6. De Nardi R (2013) The qrsim quadrotors simulator. RN 13(08):08Google Scholar
  7. Deisenroth M, Neumann G, Peters J (2013) A survey on policy search for robotics. Found Trends Robot 2(1–2):1–142Google Scholar
  8. Hachiya H, Akiyama T, Sugiayma M, Peters J (2009) Adaptive importance sampling for value function approximation in off-policy reinforcement learning. Neural Netw 22(10):1399–1410CrossRefzbMATHGoogle Scholar
  9. Hofmann T, Schölkopf B, Smola AJ (2008) Kernel methods in machine learning. Ann Stat 3:1171–1220MathSciNetCrossRefzbMATHGoogle Scholar
  10. Kober J, Oztop E, Peters J, Walsh T (2011) Reinforcement learning to adjust robot movements to new situations. In: Twenty-second international joint conference on artificial intelligence (IJCAI 2011), pp 2650–2655. AAAI PressGoogle Scholar
  11. Kober J, Peters J (2011) Policy search for motor primitives in robotics. Mach Learn 84(1):171–203MathSciNetCrossRefzbMATHGoogle Scholar
  12. Kober J, Wilhelm A, Oztop E, Peters J (2012) Reinforcement learning to adjust parametrized motor primitives to new situations. Auton Robots 33(4):361–379CrossRefGoogle Scholar
  13. Kolter JZ, Ng AY (2009) Regularization and feature selection in least-squares temporal difference learning. In: Proceedings of the 26th annual international conference on machine learning, pp 521–528. ACMGoogle Scholar
  14. Lawrence G, Cowan N, Russell S (2002) Efficient gradient estimation for motor control learning. In: Proceedings of the Nineteenth conference on uncertainty in artificial intelligence, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc., pp 354–361Google Scholar
  15. Lever G, Stafford R (2015) Modelling policies in mdps in reproducing kernel Hilbert space. In: Proceedings of the eighteenth international conference on artificial intelligence and statistics, pp 590–598Google Scholar
  16. Levine S, Koltun V (2013a) Guided policy search. In: Proceedings of the 30th international conference on machine learning, pp 1–9Google Scholar
  17. Levine S, Koltun V (2013b) Variational policy search via trajectory optimization. In: Advances in neural information processing systems 26: 27th annual conference on neural information processing systems 2013. Proceedings of a meeting held on 5-8 December 2013, Lake Tahoe, NV, USA, pp 207–215Google Scholar
  18. Micchelli CA, Pontil M (2005) On learning vector-valued functions. Neural Comput 17(1):177–204MathSciNetCrossRefzbMATHGoogle Scholar
  19. Milanfar P (2013) A tour of modern image filtering: new insights and methods, both practical and theoretical. IEEE Signal Process Mag 30(1):106–128CrossRefGoogle Scholar
  20. Neumann G (2011) Variational inference for policy search in changing situations. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 817–824Google Scholar
  21. Peters J, Mülling K, Altun Y (2010) Relative entropy policy search. In: Proceedings of the twenty-fourth AAAI conference on artificial intelligence, AAAI 2010, Atlanta, GA, USA, 11–15 July 2010Google Scholar
  22. Peters J, Schaal S (2008a) Reinforcement learning of motor skills with policy gradients. Neural Netw 21(4):682–697CrossRefGoogle Scholar
  23. Peters J, Schaal S (2008b) Reinforcement learning of motor skills with policy gradients. Neural Netw 21(4):682–697CrossRefGoogle Scholar
  24. Precup D, Sutton RS, Singh SP (2000) Eligibility traces for off-policy policy evaluation. In: Proceedings of the seventeenth international conference on machine learning, ICML ’00, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc., pp 759–766Google Scholar
  25. Scholkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, CambridgeGoogle Scholar
  26. Schulman J, Levine S, Abbeel P, Jordan MI, Moritz P (2015) Trust region policy optimization. In: Proceedings of the 32nd international conference on machine learning, ICML 2015, Lille, France, 6–11 July 2015, pp 1889–1897Google Scholar
  27. Shelton CR (2001) Policy improvement for POMDPs using normalized importance sampling. In: Proceedings of the seventeenth conference on uncertainty in artificial intelligence, UAI’01, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc., pp 496–503Google Scholar
  28. Sutton RS, McAllester DA, Singh SP, Mansour Y et al (1999) Policy gradient methods for reinforcement learning with function approximation. In: NIPS, vol 99. MIT Press, pp 1057–1063Google Scholar
  29. Tsitsiklis JN, Van Roy B (1997) An analysis of temporal-difference learning with function approximation. IEEE Trans Autom Control 42(5):674–690MathSciNetCrossRefzbMATHGoogle Scholar
  30. Vien NA, Englert P, Toussaint M (2016) Policy search in reproducing kernel Hilbert space. In: Proceedings of the 25th international joint conference on artificial intelligenceGoogle Scholar
  31. Vincent P, Bengio Y (2002) Kernel matching pursuit. Mach Learn 48(1–3):165–187CrossRefzbMATHGoogle Scholar
  32. Wawrzyński P (2009) Real-time reinforcement learning by sequential actor-critics and experience replay. Neural Netw 22(10):1484–1497CrossRefzbMATHGoogle Scholar
  33. Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8(3–4):229–256zbMATHGoogle Scholar
  34. Xu X, Hu D, Lu X (2007) Kernel-based least squares policy iteration for reinforcement learning. IEEE Trans Neural Netw 18(4):973–992CrossRefGoogle Scholar
  35. Zhao T, Hachiya H, Tangkaratt V, Morimoto J, Sugiyama M (2013) Efficient sample reuse in policy gradients with parameter-based exploration. Neural Comput 25(6):1512–1547MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2017

Authors and Affiliations

  • Tuyen Pham Le
    • 1
  • Vien Anh Ngo
    • 2
  • P. Marlith Jaramillo
    • 1
  • TaeChoong Chung
    • 1
    Email author
  1. 1.Artificial Intelligence Lab, Computer Science and Engineering DepartmentKyung Hee UniversityYonginSouth Korea
  2. 2.EEECS/ECITQueen’s University BelfastBelfastUK

Personalised recommendations