Importance sampling policy gradient algorithms in reproducing kernel Hilbert space
- 211 Downloads
Modeling policies in reproducing kernel Hilbert space (RKHS) offers a very flexible and powerful new family of policy gradient algorithms called RKHS policy gradient algorithms. They are designed to optimize over a space of very high or infinite dimensional policies. As a matter of fact, they are known to suffer from a large variance problem. This critical issue comes from the fact that updating the current policy is based on a functional gradient that does not exploit all old episodes sampled by previous policies. In this paper, we introduce a generalized RKHS policy gradient algorithm that integrates the following important ideas: (i) policy modeling in RKHS; (ii) normalized importance sampling, which helps reduce the estimation variance by reusing previously sampled episodes in a principled way; and (iii) regularization terms, which avoid updating the policy too over-fit to sampled data. In the experiment section, we provide an analysis of the proposed algorithms through bench-marking domains. The experiment results show that the proposed algorithm can still enjoy a powerful policy modeling in RKHS and achieve more data-efficiency.
KeywordsReproducing kernel Hilbert space Policy search Reinforcement learning Importance sampling Policy gradient Non-parametric
The authors are grateful to the Basic Science Research Program through the National Research Foundation of Korea (NRF-2017R1D1A1B04036354) and Kyung Hee University (KHU-20160601).
Funding This study was funded by the National Research Foundation of Korea (NRF-2017R1D1A1B04036354) and Kyung Hee University (KHU-20160601).
Compliance with ethical standards
Conflict of interest
The authors declare that they have no conflict of interest.
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent was obtained from all individual participants included in the study.
- Bagnell JA, Schneider J (2003) Policy search in kernel Hilbert space. Technical report, Robotics Institute, Carnegie Mellon UniversityGoogle Scholar
- Boyan JA (1999) Least-squares temporal difference learning. In: Proceedings of the sixteenth international conference on machine learning, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc., pp 49–56Google Scholar
- Daniel C, Neumann G, and Peters J (2012) Learning concurrent motor skills in versatile solution spaces. In: 2012 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 3591–3597. IEEEGoogle Scholar
- De Nardi R (2013) The qrsim quadrotors simulator. RN 13(08):08Google Scholar
- Deisenroth M, Neumann G, Peters J (2013) A survey on policy search for robotics. Found Trends Robot 2(1–2):1–142Google Scholar
- Kober J, Oztop E, Peters J, Walsh T (2011) Reinforcement learning to adjust robot movements to new situations. In: Twenty-second international joint conference on artificial intelligence (IJCAI 2011), pp 2650–2655. AAAI PressGoogle Scholar
- Kolter JZ, Ng AY (2009) Regularization and feature selection in least-squares temporal difference learning. In: Proceedings of the 26th annual international conference on machine learning, pp 521–528. ACMGoogle Scholar
- Lawrence G, Cowan N, Russell S (2002) Efficient gradient estimation for motor control learning. In: Proceedings of the Nineteenth conference on uncertainty in artificial intelligence, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc., pp 354–361Google Scholar
- Lever G, Stafford R (2015) Modelling policies in mdps in reproducing kernel Hilbert space. In: Proceedings of the eighteenth international conference on artificial intelligence and statistics, pp 590–598Google Scholar
- Levine S, Koltun V (2013a) Guided policy search. In: Proceedings of the 30th international conference on machine learning, pp 1–9Google Scholar
- Levine S, Koltun V (2013b) Variational policy search via trajectory optimization. In: Advances in neural information processing systems 26: 27th annual conference on neural information processing systems 2013. Proceedings of a meeting held on 5-8 December 2013, Lake Tahoe, NV, USA, pp 207–215Google Scholar
- Neumann G (2011) Variational inference for policy search in changing situations. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 817–824Google Scholar
- Peters J, Mülling K, Altun Y (2010) Relative entropy policy search. In: Proceedings of the twenty-fourth AAAI conference on artificial intelligence, AAAI 2010, Atlanta, GA, USA, 11–15 July 2010Google Scholar
- Precup D, Sutton RS, Singh SP (2000) Eligibility traces for off-policy policy evaluation. In: Proceedings of the seventeenth international conference on machine learning, ICML ’00, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc., pp 759–766Google Scholar
- Scholkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, CambridgeGoogle Scholar
- Schulman J, Levine S, Abbeel P, Jordan MI, Moritz P (2015) Trust region policy optimization. In: Proceedings of the 32nd international conference on machine learning, ICML 2015, Lille, France, 6–11 July 2015, pp 1889–1897Google Scholar
- Shelton CR (2001) Policy improvement for POMDPs using normalized importance sampling. In: Proceedings of the seventeenth conference on uncertainty in artificial intelligence, UAI’01, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc., pp 496–503Google Scholar
- Sutton RS, McAllester DA, Singh SP, Mansour Y et al (1999) Policy gradient methods for reinforcement learning with function approximation. In: NIPS, vol 99. MIT Press, pp 1057–1063Google Scholar
- Vien NA, Englert P, Toussaint M (2016) Policy search in reproducing kernel Hilbert space. In: Proceedings of the 25th international joint conference on artificial intelligenceGoogle Scholar