Discrete Event Dynamic Systems

, Volume 26, Issue 3, pp 477–509 | Cite as

Multiscale Q-learning with linear function approximation



We present in this article a two-timescale variant of Q-learning with linear function approximation. Both Q-values and policies are assumed to be parameterized with the policy parameter updated on a faster timescale as compared to the Q-value parameter. This timescale separation is seen to result in significantly improved numerical performance of the proposed algorithm over Q-learning. We show that the proposed algorithm converges almost surely to a closed connected internally chain transitive invariant set of an associated differential inclusion.


Q-learning with linear function approximation Reinforcement learning Stochastic approximation Ordinary differential equation Differential inclusion Multi-stage Stochastic shortest path problem 



The authors thank the Editor Prof. C. G. Cassandras, the Associate Editor, and all the anonymous reviewers for their detailed comments and criticisms on the various drafts of this paper, that led to several corrections in the proof and presentation. In particular, the authors gratefully thank the reviewer who suggested that they follow a differential inclusions based approach for the slower scale dynamics. The authors thank Prof. V. S. Borkar for helpful discussions. This work was partially supported through projects from the Department of Science and Technology (Government of India), Xerox Corporation (USA), and the Robert Bosch Centre (Indian Institute of Science).


  1. Abdulla MS, Bhatnagar S (2007) Reinforcement learning based algorithms for average cost Markov decision processes. Discrete Event Dyn Syst Theory Appl 17(1):23–52MathSciNetCrossRefMATHGoogle Scholar
  2. Abounadi J, Bertsekas D, Borkar VS (2001) Learning algorithms for Markov decision processes. SIAM J Control Optim 40:681–698MathSciNetCrossRefMATHGoogle Scholar
  3. Aubin J, Cellina A (1984) Differential inclusions: set-valued maps and viability theory. Springer, New YorkCrossRefMATHGoogle Scholar
  4. Azar MG, Gomez V, Kappen HJ (2011) Dynamic policy programming with function approximation. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics (AISTATS), Fort LauderdaleGoogle Scholar
  5. Baird LC (1995) Residual algorithms: reinforcement learning with function approximation. In: Proceedings of ICML. Morgan Kaufmann, pp 30–37Google Scholar
  6. Benaim M, Hofbauer J, Sorin S (2005) Stochastic approximations and differential inclusions. SIAM J Control Optim 44(1):328–348MathSciNetCrossRefMATHGoogle Scholar
  7. Benaim M, Hofbauer J, Sorin S (2006) Stochastic approximations and differential inclusions, Part II: applications. Math Oper Res 31(4):673–695MathSciNetMATHGoogle Scholar
  8. Bertsekas DP (2005) Dynamic programming and optimal control, 3rd ed. Athena Scientific, BelmontMATHGoogle Scholar
  9. Bertsekas DP (2007) Dynamic programming and optimal control, vol II, 3rd ed. Athena Scientific, BelmontGoogle Scholar
  10. Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming. Athena Scientific, BelmontMATHGoogle Scholar
  11. Bhatnagar S, Babu KM (2008) New algorithms of the Q-learning type. Automatica 44(4):1111–1119MathSciNetCrossRefMATHGoogle Scholar
  12. Bhatnagar S, Borkar VS (1997) Multiscale stochastic approximation for parametric optimization of hidden Markov models. Probab Eng Inf Sci 11:509–522MathSciNetCrossRefMATHGoogle Scholar
  13. Bhatnagar S, Fu MC, Marcus SI, Wang I-J (2003) Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences. ACM Transactions on Modelling and Computer Simulation 13(2):180–209CrossRefGoogle Scholar
  14. Bhatnagar S, Kumar S (2004) A simultaneous perturbation stochastic approximation based actor–critic algorithm for Markov decision processes. IEEE Trans Autom Control 49(4):592–598MathSciNetCrossRefGoogle Scholar
  15. Bhatnagar S (2005) Adaptive multivariate three-timescale stochastic approximation algorithms for simulation based optimization. ACM Transactions on Modeling and Computer Simulation 15(1):74–107CrossRefGoogle Scholar
  16. Bhatnagar S (2007) Adaptive Newton-based multivariate smoothed functional algorithms for simulation optimization. ACM Transactions on Modeling and Computer Simulation 18(1):2:1–2:35CrossRefGoogle Scholar
  17. Bhatnagar S, Prasad HL, Prashanth LA (2013) Stochastic recursive algorithms for optimization: simultaneous perturbation methods, lecture notes in control and information sciences. Springer, LondonCrossRefGoogle Scholar
  18. Bhatnagar S, Sutton RS, Ghavamzadeh M, Lee M (2009) Natural actor-critic algorithms. Automatica 45:2471–2482MathSciNetCrossRefMATHGoogle Scholar
  19. Bhatnagar S, Lakshmanan K (2012) An online actor-critic algorithm with function approximation for constrained Markov decision processes. J Optim Theory Appl 153(3):688–708MathSciNetCrossRefMATHGoogle Scholar
  20. Borkar VS (1995) Probability theory: an advanced course. Springer, New YorkCrossRefMATHGoogle Scholar
  21. Borkar VS (1997) Stochastic approximation with two timescales. Syst Control Lett 29:291–294MathSciNetCrossRefMATHGoogle Scholar
  22. Borkar VS (2008) Stochastic approximation: a dynamical systems viewpoint. Cambridge University Press and Hindustan Book AgencyGoogle Scholar
  23. Borkar VS, Meyn SP (2000) The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM J Control Optim 38(2):447–469MathSciNetCrossRefMATHGoogle Scholar
  24. Brandiere O (1998) Some pathological traps for stochastic approximation. SIAM J Contr Optim 36:1293–1314MathSciNetCrossRefMATHGoogle Scholar
  25. Ephremides A, Varaiya P, Walrand J (1980) A simple dynamic routing problem. IEEE Trans Autom Control 25(4):690–693MathSciNetCrossRefMATHGoogle Scholar
  26. Gelfand SB, Mitter SK (1991) Recursive stochastic algorithms for global optimization in \({\mathcal R}^{d_{*}}\). SIAM J Control Optim 29(5):999–1018MathSciNetCrossRefMATHGoogle Scholar
  27. Konda VR, Borkar VS (1999) Actor–critic like learning algorithms for Markov decision processes. SIAM J Control Optim 38(1):94–123MathSciNetCrossRefMATHGoogle Scholar
  28. Konda VR, Tsitsiklis JN (2003) On actor–critic algorithms. SIAM J Control Optim 42(4):1143–1166MathSciNetCrossRefMATHGoogle Scholar
  29. Kushner HJ, Clark DS (1978) Stochastic approximation methods for constrained and unconstrained systems. Springer, New YorkCrossRefMATHGoogle Scholar
  30. Kushner HJ, Yin GG (1997) Stochastic approximation algorithms and applications. Springer, New YorkCrossRefMATHGoogle Scholar
  31. Maei HR, Szepesvari C, Bhatnagar S, Precup D, Silver D, Sutton RS (2009) Convergent temporal-difference learning with arbitrary smooth function approximation. Proceedings of NIPSGoogle Scholar
  32. Maei HR, Szepesvari Cs, Bhatnagar S, Sutton RS (2010) Toward off-policy learning control with function approximation. Proceedings of ICML, HaifaGoogle Scholar
  33. Melo F, Ribeiro M (2007) Q-learning with linear function approximation. Learning Theory, Springer, pp 308–322Google Scholar
  34. Pemantle R (1990) Nonconvergence to unstable points in urn models and stochastic approximations. Annals Prob 18:698–712MathSciNetCrossRefMATHGoogle Scholar
  35. Prashanth LA, Chatterjee A, Bhatnagar S (2014) Two timescale convergent Q-learning for sleep scheduling in wireless sensor networks. Wirel Netw 20:2589–2604CrossRefGoogle Scholar
  36. Puterman ML (1994) Markov decision processes: discrete stochastic dynamic programming. Wiley, New YorkCrossRefMATHGoogle Scholar
  37. Schweitzer PJ (1968) Perturbation theory and finite Markov chains. J Appl Probab 5:401–413MathSciNetCrossRefMATHGoogle Scholar
  38. Sutton RS (1988) Learning to predict by the method of temporal differences. Mach Learn 3:9–44Google Scholar
  39. Sutton RS, Barto A (1998) Reinforcement learning: an introduction. MIT Press, CambridgeGoogle Scholar
  40. Sutton RS, Szepesvari Cs, Maei HR (2009) A convergent O(n) temporal-difference algorithm for off-policy learning with linear function approximation. In: Proceedings of NIPS. MIT Press, pp 1609–1616Google Scholar
  41. Sutton RS, Maei HR, Precup D, Bhatnagar S, Silver D, Szepesvari Cs, Wiewiora E (2009) Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: Proceedings of ICML. ACM, pp 993–1000Google Scholar
  42. Spall JC (1992) Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Trans Autom Control 37(3):332–341MathSciNetCrossRefMATHGoogle Scholar
  43. Spall JC (1997) A one-measurement form of simultaneous perturbation stochastic approximation. Automatica 33:109–112MathSciNetCrossRefMATHGoogle Scholar
  44. Szepesvari C, Smart WD (2004) Interpolation-based Q-learning. In: Proceedings of ICML. Banff, CanadaCrossRefGoogle Scholar
  45. Tsitsiklis JN (1994) Asynchronous stochastic approximation and Q-learning. Mach Learn 16:185–202MATHGoogle Scholar
  46. Tsitsiklis JN, Van Roy B (1997) An analysis of temporal-difference learning with function approximation. IEEE Trans Autom Control 42(5):674–690MathSciNetCrossRefMATHGoogle Scholar
  47. Tsitsikis J, Van Roy B (1999) Average cost temporal-difference learning. Automatica 35:1799–1808CrossRefMATHGoogle Scholar
  48. Walrand J (1988) An introduction to queueing networks. Prentice Hall, New JerseyMATHGoogle Scholar
  49. Watkins C, Dayan P (1992) Q-learning. Mach Learn 8:279–292MATHGoogle Scholar
  50. Weber RW (1978) On the optimal assignment of customers to parallel servers. J Appl Probab 15:406–413MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Department of Computer Science and AutomationIndian Institute of ScienceBangaloreIndia
  2. 2.Department of Mechanical EngineeringNational University of SingaporeSingaporeSingapore

Personalised recommendations