Abstract
In this paper, we analyze the convergence of Q-learning with linear function approximation. We identify a set of conditions that implies the convergence of this method with probability 1, when a fixed learning policy is used. We discuss the differences and similarities between our results and those obtained in several related works. We also discuss the applicability of this method when a changing policy is used. Finally, we describe the applicability of this approximate method in partially observable scenarios.
This work was partially supported by Programa Operacional Sociedade do Conhecimento (POS_C) that includes FEDER funds.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Sutton, R.: Learning to predict by the methods of temporal differences. Machine Learning 3, 9–44 (1988)
Watkins, C.: Learning from delayed rewards. PhD thesis, King’s College, University of Cambridge (May 1989)
Rummery, G., Niranjan, M.: On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department (1994)
Sutton, R.: DYNA, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin 2(4), 160–163 (1991)
Barto, A., Bradtke, S., Singh, S.: Learning to act using real-time dynamic programming. Technical Report UM-CS-1993-002, Department of Computer Science, University of Massachusetts at Amherst (1993)
Boyan, J.: Least-squares temporal difference learning. In: Proc. 16th Int. Conf. Machine Learning, 49–56 (1999)
Bertsekas, D., Tsitsiklis, J.: Neuro-Dynamic Programming. Athena Scientific (1996)
Sutton, R.: Generalization in reinforcement learning: Successful examples using sparse coarse coding. Advances in Neural Information Processing Systems 8, 1038–1044 (1996)
Boyan, J., Moore, A.: Generalization in reinforcement learning: Safely approximating the value function. Advances in Neural Information Processing Systems 7, 369–376 (1994)
Tesauro, G.: TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation 6(2), 215–219 (1994)
Singh, S., Jaakkola, T., Jordan, M.: Reinforcement learning with soft state aggregation. Advances in Neural Information Processing Systems 7, 361–368 (1994)
Gordon, G.: Stable function approximation in dynamic programming. Technical Report CMU-CS-95-103, School of Computer Science, Carnegie Mellon University (1995)
Tsitsiklis, J., Van Roy, B.: Feature-based methods for large scale dynamic programming. Machine Learning 22, 59–94 (1996)
Precup, D., Sutton, R., Dasgupta, S.: Off-policy temporal-difference learning with function approximation. In: Proc. 18th Int. Conf. Machine Learning, 417–424 (2001)
Szepesvári, C., Smart, W.: Interpolation-based Q-learning. In: Proc. 21st Int. Conf. Machine learning, 100–107 (2004)
Tsitsiklis, J., Van Roy, B.: An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control AC-42(5), 674–690 (1996)
Borkar, V.: A learning algorithm for discrete-time stochastic control. Probability in the Engineering and Informational Sciences 14, 243–258 (2000)
Melo, F., Ribeiro, M.I.: Q-learning with linear function approximation. Technical Report RT-602-07, Institute for Systems and Robotics (March 2007)
Watkins, C., Dayan, P.: Technical note: Q-learning. Machine Learning 8, 279–292 (1992)
Meyn, S., Tweedie, R.: Markov Chains and Stochastic Stability. Springer, Heidelberg (1993)
Baird, L.: Residual algorithms: Reinforcement learning with function approximation. In: Proc. 12th Int. Conf. Machine Learning, 30–37 (1995)
Bertsekas, D., Borkar, V., Nedić, A.: 9. In: Improved temporal difference methods with linear function approximation. Wiley Publishers, 235–260 (2004)
Baker, W.: Learning via stochastic approximation in function space. PhD Thesis (1997)
Lusena, C., Goldsmith, J., Mundhenk, M.: Nonapproximability results for partially observable Markov decision processes. J. Artificial Intelligence Research 14, 83–103 (2001)
Papadimitriou, C., Tsitsiklis, J.: The complexity of Markov chain decision processes. Mathematics of Operations Research 12(3), 441–450 (1987)
Cassandra, A.: Exact and approximate algorithms for partially observable Markov decision processes. PhD thesis, Brown University (May 1998)
Aberdeen, D.: A (revised) survey of approximate methods for solving partially observable Markov decision processes. Technical report, National ICT Australia, Canberra, Australia (2003)
Littman, M., Cassandra, A., Kaelbling, L.: Learning policies for partially observable environments: Scaling up. In: Proc. 12th Int. Conf. Machine Learning, 362–370 (1995)
Parr, R., Russell, S.: Approximating optimal policies for partially observable stochastic domains. In: Proc. Int. Joint Conf. Artificial Intelligence, 1088–1094 (1995)
He, Q., Shayman, M.: Solving POMDPs by on-policy linear approximate learning algorithm. In: Proc. Conf. Information Sciences and Systems (2000)
Glaubius, R., Smart, W.: Manifold representations for value-function approximation in reinforcement learning. Technical Report 05-19, Department of Computer Science and Engineering, Washington University in St. Louis (2005)
Keller, P., Mannor, S., Precup, D.: Automatic basis function construction for approximate dynamic programming and reinforcement learning. In: Proc. 23rd Int. Conf. Machine Learning, 449–456 (2006)
Menache, I., Mannor, S., Shimkin, N.: Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research 134(1), 215–238 (2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Melo, F.S., Ribeiro, M.I. (2007). Q-Learning with Linear Function Approximation. In: Bshouty, N.H., Gentile, C. (eds) Learning Theory. COLT 2007. Lecture Notes in Computer Science(), vol 4539. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72927-3_23
Download citation
DOI: https://doi.org/10.1007/978-3-540-72927-3_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72925-9
Online ISBN: 978-3-540-72927-3
eBook Packages: Computer ScienceComputer Science (R0)