Abstract
While direct, model-free reinforcement learning often performs better than model-based approaches in practice, only the latter have yet supported theoretical guarantees for finite-sample convergence. A major difficulty in analyzing the direct approach in an online setting is the absence of a definitive exploration strategy. We extend the notion of admissibility to direct reinforcement learning and show that standard Q-learning with optimistic initial values and constant learning rate is admissible. The notion justifies the use of a greedy strategy that we believe performs very well in practice and holds theoretical significance in deriving finite-sample convergence for direct reinforcement learning. We present empirical evidence that supports our idea.
Chapter PDF
References
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)
Brafman, R.I., Tennenholtz, M.: R-max, A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning. Journal of Machine Learning Research 3, 213–231 (2002)
Even-Dar, E., Mansour, Y.: Learning rates for Q-Learning. Journal of Machine Learning Research 5, 1–25 (2003)
Kaelbling, L.: Learning in Embedded Systems. PhD thesis, Computer Science Department, Stanford University (1990)
Kearns, M., Singh, S.: Near-Optimal Reinforcement Learning in Polynomial Time. In: Proc. of 15th ICML, pp. 260–268. Morgan Kaufmann, San Francisco (1998)
Kearns, M., Singh, S.: Finite-Sample Rates of Convergence for Q-Learning and Indirect Methods. In: Advances in Neural Information Processing Systems 11, pp. 996–1002. The MIT Press, Cambridge (1999)
Koenig, S., Simmons, R.G.: The Effect of Representation and Knowledge on Goal-Directed Exploration with Reinforcement Learning Algorithms. Machine Learning 22(1/3), 227–250 (1996)
Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist systems. Tech. Report CUED/F-INFENG/TR 166, Cambridge University Engineering Dept. (1994)
Sutton, R.: Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding. In: Advances in Neural Information Processing Systems 8, pp. 1038–1044. MIT Press, Cambridge (1996)
Sutton, R., Barto, A.: Reinforcement Learning. MIT Press, Cambridge (1998)
Watkins, C.J.C.H.: Learning from Delayed Rewards. PhD thesis, Cambridge, England (1989)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lim, S.H., DeJong, G. (2005). Towards Finite-Sample Convergence of Direct Reinforcement Learning. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds) Machine Learning: ECML 2005. ECML 2005. Lecture Notes in Computer Science(), vol 3720. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11564096_25
Download citation
DOI: https://doi.org/10.1007/11564096_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29243-2
Online ISBN: 978-3-540-31692-3
eBook Packages: Computer ScienceComputer Science (R0)