Abstract
Although tabular reinforcement learning (RL) methods have been proved to converge to an optimal policy, the combination of particular conventional reinforcement learning techniques with function approximators can lead to divergence. In this paper we show why off-policy RL methods combined with linear function approximators can lead to divergence. Furthermore, we analyze two different types of updates; standard and averaging RL updates. Although averaging RL will not diverge, we show that they can converge to wrong value functions. In our experiments we compare standard to averaging value iteration (VI) with CMACs and the results show that for small values of the discount factor averaging VI works better, whereas for large values of the discount factor standard VI performs better, although it does not always converge.
Chapter PDF
Similar content being viewed by others
Keywords
- Optimal Policy
- Discount Factor
- Markov Decision Process
- Policy Iteration
- Reinforcement Learning Algorithm
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Albus, J.S.: A theory of cerebellar function. Mathematical Biosciences 10, 25–61 (1975)
Baird, L.: Residual algorithms: Reinforcement learning with function approximation. In: Prieditis, A., Russell, S. (eds.) Machine Learning: Proceedings of the Twelfth International Conference, pp. 30–37. Morgan Kaufmann Publishers, San Francisco (1995)
Bellman, R.: Dynamic Programming. Princeton University Press, Princeton (1957)
Boyan, J.A., Moore, A.W.: Generalization in reinforcement learning: Safely approximating the value function. In: Tesauro, G., Touretzky, D.S., Leen, T.K. (eds.) Advances in Neural Information Processing Systems, vol. 7, pp. 369–376. MIT Press, Cambridge (1995)
Gordon, G.J.: Stable function approximation in dynamic programming. Technical Report CMU-CS-95-103, Carnegie Mellon University (1995)
Jaakkola, T., Jordan, M.I., Singh, S.P.: On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation 6, 1185–1201 (1994)
Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4, 237–285 (1996)
Perkins, T.J., Precup, D.: A convergent form of approximate policy iteration. In: Todd, K., Leen, T.G. (eds.) Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems, vol. 13, MIT Press, Cambridge (2002)
Reynolds, S.I.: The stability of general discounted reinforcement learning with linear function approximation. In: Proceedings of the UK Workshop on Computational Intelligence (UKCI 2002), pp. 139–146 (2002)
Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist sytems. Technical Report CUED/F-INFENG-TR 166, Cambridge University, UK (1994)
Singh, S.P., Jaakkola, T., Littman, M.L., Szepesvari, C.: Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning 38(3), 287–308 (2000)
Sutton, R.S.: Generalization in reinforcement learning: Successful examples using sparse coarse coding. In: Touretzky, D.S., Mozer, M.C., Hasselmo, M.E. (eds.) Advances in Neural Information Processing Systems, vol. 8, pp. 1038–1045. MIT Press, Cambridge (1996)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. The MIT press, Cambridge (1998)
Szepesvari, C., Smart, W.D.: Convergent value function approximation methods. In: The International Conference om Machine Learning, ICML 2004 (2004) (accepted in)
Tesauro, G.J.: Temporal difference learning and TD-Gammon. Communications of the ACM 38, 58–68 (1995)
Tsitsiklis, J.N.: Asynchronous stochastic approximation and Q-learning. Machine Learning 16, 185–202 (1994)
Watkins, C.J.C.H.: Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, England (1989)
Watkins, C.J.C.H., Dayan, P.: Q-learning. Machine Learning 8, 279–292 (1992)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wiering, M.A. (2004). Convergence and Divergence in Standard and Averaging Reinforcement Learning. In: Boulicaut, JF., Esposito, F., Giannotti, F., Pedreschi, D. (eds) Machine Learning: ECML 2004. ECML 2004. Lecture Notes in Computer Science(), vol 3201. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30115-8_44
Download citation
DOI: https://doi.org/10.1007/978-3-540-30115-8_44
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23105-9
Online ISBN: 978-3-540-30115-8
eBook Packages: Springer Book Archive