Multiscale Q-learning with linear function approximation
- 495 Downloads
We present in this article a two-timescale variant of Q-learning with linear function approximation. Both Q-values and policies are assumed to be parameterized with the policy parameter updated on a faster timescale as compared to the Q-value parameter. This timescale separation is seen to result in significantly improved numerical performance of the proposed algorithm over Q-learning. We show that the proposed algorithm converges almost surely to a closed connected internally chain transitive invariant set of an associated differential inclusion.
KeywordsQ-learning with linear function approximation Reinforcement learning Stochastic approximation Ordinary differential equation Differential inclusion Multi-stage Stochastic shortest path problem
The authors thank the Editor Prof. C. G. Cassandras, the Associate Editor, and all the anonymous reviewers for their detailed comments and criticisms on the various drafts of this paper, that led to several corrections in the proof and presentation. In particular, the authors gratefully thank the reviewer who suggested that they follow a differential inclusions based approach for the slower scale dynamics. The authors thank Prof. V. S. Borkar for helpful discussions. This work was partially supported through projects from the Department of Science and Technology (Government of India), Xerox Corporation (USA), and the Robert Bosch Centre (Indian Institute of Science).
- Azar MG, Gomez V, Kappen HJ (2011) Dynamic policy programming with function approximation. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics (AISTATS), Fort LauderdaleGoogle Scholar
- Baird LC (1995) Residual algorithms: reinforcement learning with function approximation. In: Proceedings of ICML. Morgan Kaufmann, pp 30–37Google Scholar
- Bertsekas DP (2007) Dynamic programming and optimal control, vol II, 3rd ed. Athena Scientific, BelmontGoogle Scholar
- Borkar VS (2008) Stochastic approximation: a dynamical systems viewpoint. Cambridge University Press and Hindustan Book AgencyGoogle Scholar
- Maei HR, Szepesvari C, Bhatnagar S, Precup D, Silver D, Sutton RS (2009) Convergent temporal-difference learning with arbitrary smooth function approximation. Proceedings of NIPSGoogle Scholar
- Maei HR, Szepesvari Cs, Bhatnagar S, Sutton RS (2010) Toward off-policy learning control with function approximation. Proceedings of ICML, HaifaGoogle Scholar
- Melo F, Ribeiro M (2007) Q-learning with linear function approximation. Learning Theory, Springer, pp 308–322Google Scholar
- Sutton RS (1988) Learning to predict by the method of temporal differences. Mach Learn 3:9–44Google Scholar
- Sutton RS, Barto A (1998) Reinforcement learning: an introduction. MIT Press, CambridgeGoogle Scholar
- Sutton RS, Szepesvari Cs, Maei HR (2009) A convergent O(n) temporal-difference algorithm for off-policy learning with linear function approximation. In: Proceedings of NIPS. MIT Press, pp 1609–1616Google Scholar
- Sutton RS, Maei HR, Precup D, Bhatnagar S, Silver D, Szepesvari Cs, Wiewiora E (2009) Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: Proceedings of ICML. ACM, pp 993–1000Google Scholar