# Multiscale Q-learning with linear function approximation

- 495 Downloads
- 1 Citations

## Abstract

We present in this article a two-timescale variant of Q-learning with linear function approximation. Both Q-values and policies are assumed to be parameterized with the policy parameter updated on a faster timescale as compared to the Q-value parameter. This timescale separation is seen to result in significantly improved numerical performance of the proposed algorithm over Q-learning. We show that the proposed algorithm converges almost surely to a closed connected internally chain transitive invariant set of an associated differential inclusion.

## Keywords

Q-learning with linear function approximation Reinforcement learning Stochastic approximation Ordinary differential equation Differential inclusion Multi-stage Stochastic shortest path problem## Notes

### Acknowledgments

The authors thank the Editor Prof. C. G. Cassandras, the Associate Editor, and all the anonymous reviewers for their detailed comments and criticisms on the various drafts of this paper, that led to several corrections in the proof and presentation. In particular, the authors gratefully thank the reviewer who suggested that they follow a differential inclusions based approach for the slower scale dynamics. The authors thank Prof. V. S. Borkar for helpful discussions. This work was partially supported through projects from the Department of Science and Technology (Government of India), Xerox Corporation (USA), and the Robert Bosch Centre (Indian Institute of Science).

## References

- Abdulla MS, Bhatnagar S (2007) Reinforcement learning based algorithms for average cost Markov decision processes. Discrete Event Dyn Syst Theory Appl 17(1):23–52MathSciNetCrossRefMATHGoogle Scholar
- Abounadi J, Bertsekas D, Borkar VS (2001) Learning algorithms for Markov decision processes. SIAM J Control Optim 40:681–698MathSciNetCrossRefMATHGoogle Scholar
- Aubin J, Cellina A (1984) Differential inclusions: set-valued maps and viability theory. Springer, New YorkCrossRefMATHGoogle Scholar
- Azar MG, Gomez V, Kappen HJ (2011) Dynamic policy programming with function approximation. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics (AISTATS), Fort LauderdaleGoogle Scholar
- Baird LC (1995) Residual algorithms: reinforcement learning with function approximation. In: Proceedings of ICML. Morgan Kaufmann, pp 30–37Google Scholar
- Benaim M, Hofbauer J, Sorin S (2005) Stochastic approximations and differential inclusions. SIAM J Control Optim 44(1):328–348MathSciNetCrossRefMATHGoogle Scholar
- Benaim M, Hofbauer J, Sorin S (2006) Stochastic approximations and differential inclusions, Part II: applications. Math Oper Res 31(4):673–695MathSciNetMATHGoogle Scholar
- Bertsekas DP (2005) Dynamic programming and optimal control, 3rd ed. Athena Scientific, BelmontMATHGoogle Scholar
- Bertsekas DP (2007) Dynamic programming and optimal control, vol II, 3rd ed. Athena Scientific, BelmontGoogle Scholar
- Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming. Athena Scientific, BelmontMATHGoogle Scholar
- Bhatnagar S, Babu KM (2008) New algorithms of the Q-learning type. Automatica 44(4):1111–1119MathSciNetCrossRefMATHGoogle Scholar
- Bhatnagar S, Borkar VS (1997) Multiscale stochastic approximation for parametric optimization of hidden Markov models. Probab Eng Inf Sci 11:509–522MathSciNetCrossRefMATHGoogle Scholar
- Bhatnagar S, Fu MC, Marcus SI, Wang I-J (2003) Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences. ACM Transactions on Modelling and Computer Simulation 13(2):180–209CrossRefGoogle Scholar
- Bhatnagar S, Kumar S (2004) A simultaneous perturbation stochastic approximation based actor–critic algorithm for Markov decision processes. IEEE Trans Autom Control 49(4):592–598MathSciNetCrossRefGoogle Scholar
- Bhatnagar S (2005) Adaptive multivariate three-timescale stochastic approximation algorithms for simulation based optimization. ACM Transactions on Modeling and Computer Simulation 15(1):74–107CrossRefGoogle Scholar
- Bhatnagar S (2007) Adaptive Newton-based multivariate smoothed functional algorithms for simulation optimization. ACM Transactions on Modeling and Computer Simulation 18(1):2:1–2:35CrossRefGoogle Scholar
- Bhatnagar S, Prasad HL, Prashanth LA (2013) Stochastic recursive algorithms for optimization: simultaneous perturbation methods, lecture notes in control and information sciences. Springer, LondonCrossRefGoogle Scholar
- Bhatnagar S, Sutton RS, Ghavamzadeh M, Lee M (2009) Natural actor-critic algorithms. Automatica 45:2471–2482MathSciNetCrossRefMATHGoogle Scholar
- Bhatnagar S, Lakshmanan K (2012) An online actor-critic algorithm with function approximation for constrained Markov decision processes. J Optim Theory Appl 153(3):688–708MathSciNetCrossRefMATHGoogle Scholar
- Borkar VS (1995) Probability theory: an advanced course. Springer, New YorkCrossRefMATHGoogle Scholar
- Borkar VS (1997) Stochastic approximation with two timescales. Syst Control Lett 29:291–294MathSciNetCrossRefMATHGoogle Scholar
- Borkar VS (2008) Stochastic approximation: a dynamical systems viewpoint. Cambridge University Press and Hindustan Book AgencyGoogle Scholar
- Borkar VS, Meyn SP (2000) The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM J Control Optim 38(2):447–469MathSciNetCrossRefMATHGoogle Scholar
- Brandiere O (1998) Some pathological traps for stochastic approximation. SIAM J Contr Optim 36:1293–1314MathSciNetCrossRefMATHGoogle Scholar
- Ephremides A, Varaiya P, Walrand J (1980) A simple dynamic routing problem. IEEE Trans Autom Control 25(4):690–693MathSciNetCrossRefMATHGoogle Scholar
- Gelfand SB, Mitter SK (1991) Recursive stochastic algorithms for global optimization in \({\mathcal R}^{d_{*}}\). SIAM J Control Optim 29(5):999–1018MathSciNetCrossRefMATHGoogle Scholar
- Konda VR, Borkar VS (1999) Actor–critic like learning algorithms for Markov decision processes. SIAM J Control Optim 38(1):94–123MathSciNetCrossRefMATHGoogle Scholar
- Konda VR, Tsitsiklis JN (2003) On actor–critic algorithms. SIAM J Control Optim 42(4):1143–1166MathSciNetCrossRefMATHGoogle Scholar
- Kushner HJ, Clark DS (1978) Stochastic approximation methods for constrained and unconstrained systems. Springer, New YorkCrossRefMATHGoogle Scholar
- Kushner HJ, Yin GG (1997) Stochastic approximation algorithms and applications. Springer, New YorkCrossRefMATHGoogle Scholar
- Maei HR, Szepesvari C, Bhatnagar S, Precup D, Silver D, Sutton RS (2009) Convergent temporal-difference learning with arbitrary smooth function approximation. Proceedings of NIPSGoogle Scholar
- Maei HR, Szepesvari Cs, Bhatnagar S, Sutton RS (2010) Toward off-policy learning control with function approximation. Proceedings of ICML, HaifaGoogle Scholar
- Melo F, Ribeiro M (2007) Q-learning with linear function approximation. Learning Theory, Springer, pp 308–322Google Scholar
- Pemantle R (1990) Nonconvergence to unstable points in urn models and stochastic approximations. Annals Prob 18:698–712MathSciNetCrossRefMATHGoogle Scholar
- Prashanth LA, Chatterjee A, Bhatnagar S (2014) Two timescale convergent Q-learning for sleep scheduling in wireless sensor networks. Wirel Netw 20:2589–2604CrossRefGoogle Scholar
- Puterman ML (1994) Markov decision processes: discrete stochastic dynamic programming. Wiley, New YorkCrossRefMATHGoogle Scholar
- Schweitzer PJ (1968) Perturbation theory and finite Markov chains. J Appl Probab 5:401–413MathSciNetCrossRefMATHGoogle Scholar
- Sutton RS (1988) Learning to predict by the method of temporal differences. Mach Learn 3:9–44Google Scholar
- Sutton RS, Barto A (1998) Reinforcement learning: an introduction. MIT Press, CambridgeGoogle Scholar
- Sutton RS, Szepesvari Cs, Maei HR (2009) A convergent
*O*(*n*) temporal-difference algorithm for off-policy learning with linear function approximation. In: Proceedings of NIPS. MIT Press, pp 1609–1616Google Scholar - Sutton RS, Maei HR, Precup D, Bhatnagar S, Silver D, Szepesvari Cs, Wiewiora E (2009) Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: Proceedings of ICML. ACM, pp 993–1000Google Scholar
- Spall JC (1992) Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Trans Autom Control 37(3):332–341MathSciNetCrossRefMATHGoogle Scholar
- Spall JC (1997) A one-measurement form of simultaneous perturbation stochastic approximation. Automatica 33:109–112MathSciNetCrossRefMATHGoogle Scholar
- Szepesvari C, Smart WD (2004) Interpolation-based Q-learning. In: Proceedings of ICML. Banff, CanadaCrossRefGoogle Scholar
- Tsitsiklis JN (1994) Asynchronous stochastic approximation and Q-learning. Mach Learn 16:185–202MATHGoogle Scholar
- Tsitsiklis JN, Van Roy B (1997) An analysis of temporal-difference learning with function approximation. IEEE Trans Autom Control 42(5):674–690MathSciNetCrossRefMATHGoogle Scholar
- Tsitsikis J, Van Roy B (1999) Average cost temporal-difference learning. Automatica 35:1799–1808CrossRefMATHGoogle Scholar
- Walrand J (1988) An introduction to queueing networks. Prentice Hall, New JerseyMATHGoogle Scholar
- Watkins C, Dayan P (1992) Q-learning. Mach Learn 8:279–292MATHGoogle Scholar
- Weber RW (1978) On the optimal assignment of customers to parallel servers. J Appl Probab 15:406–413MathSciNetCrossRefMATHGoogle Scholar