Abstract
This paper investigates a novel model-free reinforcement learning architecture, the Natural Actor-Critic. The actor updates are based on stochastic policy gradients employing Amari’s natural gradient approach, while the critic obtains both the natural policy gradient and additional parameters of a value function simultaneously by linear regression. We show that actor improvements with natural policy gradients are particularly appealing as these are independent of coordinate frame of the chosen policy representation, and can be estimated more efficiently than regular policy gradients. The critic makes use of a special basis function parameterization motivated by the policy-gradient compatible function approximation. We show that several well-known reinforcement learning methods such as the original Actor-Critic and Bradtke’s Linear Quadratic Q-Learning are in fact Natural Actor-Critic algorithms. Empirical evaluations illustrate the effectiveness of our techniques in comparison to previous methods, and also demonstrate their applicability for learning control on an anthropomorphic robot arm.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Amari, S.: Natural gradient works efficiently in learning. Neural Computation 10, 251–276 (1998)
Bagnell, J., Schneider, J.: Covariant policy search. In: International Joint Conference on Artificial Intelligence (2003)
Baird, L.C.: Advantage Updating. Wright Lab. Tech. Rep. WL-TR-93-1146 (1993)
Baird, L.C., Moore, A.W.: Gradient descent for general reinforcement learning. In: Advances in Neural Information Processing Systems 11 (1999)
Bartlett, P.: An introduction to reinforcement learning theory: Value function methods. In: Mendelson, S., Smola, A.J. (eds.) Advanced Lectures on Machine Learning. LNCS (LNAI), vol. 2600, pp. 184–202. Springer, Heidelberg (2003)
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)
Boyan, J.: Least-squares temporal difference learning. In: Machine Learning: Proceedings of the Sixteenth International Conference, pp. 49–56 (1999)
Bradtke, S., Ydstie, E., Barto, A.G.: Adaptive Linear Quadratic Control Using Policy Iteration. University of Massachusetts, Amherst, MA (1994)
Ijspeert, A., Nakanishi, J., Schaal, S.: Learning rhythmic movements by demonstration using nonlinear oscillators. In: IEEE International Conference on Intelligent Robots and Systems (IROS 2002), pp. 958–963 (2002)
Kakade, S.A.: Natural policy gradient. In: Advances in Neural Information Processing Systems 14 (2002)
Konda, V., Tsitsiklis, J.: Actor-critic algorithms. In: Advances in Neural Information Processing Systems 12 (2000)
Moon, T., Stirling, W.: Mathematical Methods and Algorithms for Signal Processing. Prentice Hall, Englewood Cliffs (2000)
Peters, J., Vijaykumar, S., Schaal, S.: Reinforcement learning for humanoid robotics. In: IEEE International Conference on Humandoid Robots (2003)
Sutton, R.S., Barto, A.G.: Reinforcement Learning. MIT Press, Cambridge (1998)
Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems 12 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Peters, J., Vijayakumar, S., Schaal, S. (2005). Natural Actor-Critic. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds) Machine Learning: ECML 2005. ECML 2005. Lecture Notes in Computer Science(), vol 3720. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11564096_29
Download citation
DOI: https://doi.org/10.1007/11564096_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29243-2
Online ISBN: 978-3-540-31692-3
eBook Packages: Computer ScienceComputer Science (R0)