Multiscale Q-learning with linear function approximation

Bhatnagar, Shalabh; Lakshmanan, K.

doi:10.1007/s10626-015-0216-z

Multiscale Q-learning with linear function approximation

Published: 30 August 2015

Volume 26, pages 477–509, (2016)
Cite this article

Discrete Event Dynamic Systems Aims and scope Submit manuscript

Shalabh Bhatnagar¹ &
K. Lakshmanan²

713 Accesses
4 Citations
Explore all metrics

Abstract

We present in this article a two-timescale variant of Q-learning with linear function approximation. Both Q-values and policies are assumed to be parameterized with the policy parameter updated on a faster timescale as compared to the Q-value parameter. This timescale separation is seen to result in significantly improved numerical performance of the proposed algorithm over Q-learning. We show that the proposed algorithm converges almost surely to a closed connected internally chain transitive invariant set of an associated differential inclusion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Input-Decoupled Q-Learning for Optimal Control

Article 14 May 2019

Full Gradient DQN Reinforcement Learning: A Provably Convergent Scheme

A New Discrete-Time Iterative Adaptive Dynamic Programming Algorithm Based on Q-Learning

References

Abdulla MS, Bhatnagar S (2007) Reinforcement learning based algorithms for average cost Markov decision processes. Discrete Event Dyn Syst Theory Appl 17(1):23–52
Article MathSciNet MATH Google Scholar
Abounadi J, Bertsekas D, Borkar VS (2001) Learning algorithms for Markov decision processes. SIAM J Control Optim 40:681–698
Article MathSciNet MATH Google Scholar
Aubin J, Cellina A (1984) Differential inclusions: set-valued maps and viability theory. Springer, New York
Book MATH Google Scholar
Azar MG, Gomez V, Kappen HJ (2011) Dynamic policy programming with function approximation. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics (AISTATS), Fort Lauderdale
Baird LC (1995) Residual algorithms: reinforcement learning with function approximation. In: Proceedings of ICML. Morgan Kaufmann, pp 30–37
Benaim M, Hofbauer J, Sorin S (2005) Stochastic approximations and differential inclusions. SIAM J Control Optim 44(1):328–348
Article MathSciNet MATH Google Scholar
Benaim M, Hofbauer J, Sorin S (2006) Stochastic approximations and differential inclusions, Part II: applications. Math Oper Res 31(4):673–695
MathSciNet MATH Google Scholar
Bertsekas DP (2005) Dynamic programming and optimal control, 3rd ed. Athena Scientific, Belmont
MATH Google Scholar
Bertsekas DP (2007) Dynamic programming and optimal control, vol II, 3rd ed. Athena Scientific, Belmont
Google Scholar
Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming. Athena Scientific, Belmont
MATH Google Scholar
Bhatnagar S, Babu KM (2008) New algorithms of the Q-learning type. Automatica 44(4):1111–1119
Article MathSciNet MATH Google Scholar
Bhatnagar S, Borkar VS (1997) Multiscale stochastic approximation for parametric optimization of hidden Markov models. Probab Eng Inf Sci 11:509–522
Article MathSciNet MATH Google Scholar
Bhatnagar S, Fu MC, Marcus SI, Wang I-J (2003) Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences. ACM Transactions on Modelling and Computer Simulation 13(2):180–209
Article Google Scholar
Bhatnagar S, Kumar S (2004) A simultaneous perturbation stochastic approximation based actor–critic algorithm for Markov decision processes. IEEE Trans Autom Control 49(4):592–598
Article MathSciNet Google Scholar
Bhatnagar S (2005) Adaptive multivariate three-timescale stochastic approximation algorithms for simulation based optimization. ACM Transactions on Modeling and Computer Simulation 15(1):74–107
Article Google Scholar
Bhatnagar S (2007) Adaptive Newton-based multivariate smoothed functional algorithms for simulation optimization. ACM Transactions on Modeling and Computer Simulation 18(1):2:1–2:35
Article Google Scholar
Bhatnagar S, Prasad HL, Prashanth LA (2013) Stochastic recursive algorithms for optimization: simultaneous perturbation methods, lecture notes in control and information sciences. Springer, London
Book Google Scholar
Bhatnagar S, Sutton RS, Ghavamzadeh M, Lee M (2009) Natural actor-critic algorithms. Automatica 45:2471–2482
Article MathSciNet MATH Google Scholar
Bhatnagar S, Lakshmanan K (2012) An online actor-critic algorithm with function approximation for constrained Markov decision processes. J Optim Theory Appl 153(3):688–708
Article MathSciNet MATH Google Scholar
Borkar VS (1995) Probability theory: an advanced course. Springer, New York
Book MATH Google Scholar
Borkar VS (1997) Stochastic approximation with two timescales. Syst Control Lett 29:291–294
Article MathSciNet MATH Google Scholar
Borkar VS (2008) Stochastic approximation: a dynamical systems viewpoint. Cambridge University Press and Hindustan Book Agency
Borkar VS, Meyn SP (2000) The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM J Control Optim 38(2):447–469
Article MathSciNet MATH Google Scholar
Brandiere O (1998) Some pathological traps for stochastic approximation. SIAM J Contr Optim 36:1293–1314
Article MathSciNet MATH Google Scholar
Ephremides A, Varaiya P, Walrand J (1980) A simple dynamic routing problem. IEEE Trans Autom Control 25(4):690–693
Article MathSciNet MATH Google Scholar
Gelfand SB, Mitter SK (1991) Recursive stochastic algorithms for global optimization in ${\mathcal R}^{d_{*}}$. SIAM J Control Optim 29(5):999–1018
Article MathSciNet MATH Google Scholar
Konda VR, Borkar VS (1999) Actor–critic like learning algorithms for Markov decision processes. SIAM J Control Optim 38(1):94–123
Article MathSciNet MATH Google Scholar
Konda VR, Tsitsiklis JN (2003) On actor–critic algorithms. SIAM J Control Optim 42(4):1143–1166
Article MathSciNet MATH Google Scholar
Kushner HJ, Clark DS (1978) Stochastic approximation methods for constrained and unconstrained systems. Springer, New York
Book MATH Google Scholar
Kushner HJ, Yin GG (1997) Stochastic approximation algorithms and applications. Springer, New York
Book MATH Google Scholar
Maei HR, Szepesvari C, Bhatnagar S, Precup D, Silver D, Sutton RS (2009) Convergent temporal-difference learning with arbitrary smooth function approximation. Proceedings of NIPS
Maei HR, Szepesvari Cs, Bhatnagar S, Sutton RS (2010) Toward off-policy learning control with function approximation. Proceedings of ICML, Haifa
Google Scholar
Melo F, Ribeiro M (2007) Q-learning with linear function approximation. Learning Theory, Springer, pp 308–322
Pemantle R (1990) Nonconvergence to unstable points in urn models and stochastic approximations. Annals Prob 18:698–712
Article MathSciNet MATH Google Scholar
Prashanth LA, Chatterjee A, Bhatnagar S (2014) Two timescale convergent Q-learning for sleep scheduling in wireless sensor networks. Wirel Netw 20:2589–2604
Article Google Scholar
Puterman ML (1994) Markov decision processes: discrete stochastic dynamic programming. Wiley, New York
Book MATH Google Scholar
Schweitzer PJ (1968) Perturbation theory and finite Markov chains. J Appl Probab 5:401–413
Article MathSciNet MATH Google Scholar
Sutton RS (1988) Learning to predict by the method of temporal differences. Mach Learn 3:9–44
Google Scholar
Sutton RS, Barto A (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
Google Scholar
Sutton RS, Szepesvari Cs, Maei HR (2009) A convergent O(n) temporal-difference algorithm for off-policy learning with linear function approximation. In: Proceedings of NIPS. MIT Press, pp 1609–1616
Sutton RS, Maei HR, Precup D, Bhatnagar S, Silver D, Szepesvari Cs, Wiewiora E (2009) Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: Proceedings of ICML. ACM, pp 993–1000
Spall JC (1992) Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Trans Autom Control 37(3):332–341
Article MathSciNet MATH Google Scholar
Spall JC (1997) A one-measurement form of simultaneous perturbation stochastic approximation. Automatica 33:109–112
Article MathSciNet MATH Google Scholar
Szepesvari C, Smart WD (2004) Interpolation-based Q-learning. In: Proceedings of ICML. Banff, Canada
Book Google Scholar
Tsitsiklis JN (1994) Asynchronous stochastic approximation and Q-learning. Mach Learn 16:185–202
MATH Google Scholar
Tsitsiklis JN, Van Roy B (1997) An analysis of temporal-difference learning with function approximation. IEEE Trans Autom Control 42(5):674–690
Article MathSciNet MATH Google Scholar
Tsitsikis J, Van Roy B (1999) Average cost temporal-difference learning. Automatica 35:1799–1808
Article MATH Google Scholar
Walrand J (1988) An introduction to queueing networks. Prentice Hall, New Jersey
MATH Google Scholar
Watkins C, Dayan P (1992) Q-learning. Mach Learn 8:279–292
MATH Google Scholar
Weber RW (1978) On the optimal assignment of customers to parallel servers. J Appl Probab 15:406–413
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

The authors thank the Editor Prof. C. G. Cassandras, the Associate Editor, and all the anonymous reviewers for their detailed comments and criticisms on the various drafts of this paper, that led to several corrections in the proof and presentation. In particular, the authors gratefully thank the reviewer who suggested that they follow a differential inclusions based approach for the slower scale dynamics. The authors thank Prof. V. S. Borkar for helpful discussions. This work was partially supported through projects from the Department of Science and Technology (Government of India), Xerox Corporation (USA), and the Robert Bosch Centre (Indian Institute of Science).

Author information

Authors and Affiliations

Department of Computer Science and Automation, Indian Institute of Science, Bangalore, 560 012, India
Shalabh Bhatnagar
Department of Mechanical Engineering, National University of Singapore, Singapore, Singapore
K. Lakshmanan

Authors

Shalabh Bhatnagar
View author publications
You can also search for this author in PubMed Google Scholar
K. Lakshmanan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shalabh Bhatnagar.

Appendix

In this section, we present detailed proofs of some of the results given in Section 3.

Proof of Proposition 1

Note that the n-step (n > 1) transition probability of going from state (i, a) to (j, b) is

$${p_{w}^{n}}(i,a;j,b) = P(X_{n}=j,Z_{n}=b\mid X_{0}=i, Z_{0}=a,\pi_{w}) $$

$$= {q_{w}^{n}}(i,a,j)\pi_{w}(j,b),$$

where ${q_{w}^{n}}(i,a,j)$ is the n-step probability of going to state j when the initial state is i and action a is chosen (in state i), while actions in other stages (from 1 to n−1) are chosen according to the SRP π _w. It is easy to see that ${ \sum \limits _{j\in S} {q_{w}^{n}}(i,a,j) =1}$, ∀i ∈ S, a ∈ A(i).

Let l ∈ S be such that p(i, a, l)>0. Now from Assumption 1, X _n, n ≥ 0, under any SRP π _w is irreducible. Thus, given SRP π _w and states l, j, there exists an integer n ₁ > 0 such that

$$p^{n_{1}}(l,j,\pi_{w}) \overset{\triangle}{=} P(X_{n_{1}}=j\mid X_{0}=l,\pi_{w}) >0.$$

Note that in estimating p ⁿ(l, j, π _w), it is assumed that the actions at each of the n stages are picked according to the policy π _w. This is unlike estimating ${q_{w}^{n}}(l,a,j)$ where the first action to be picked is a in state l while the actions in the remaining n−1 stages are picked according to π _w. Now observe that

$$p^{n}_{w}(i,a; j,b) \geq p(i,a,l) p^{n-1}(l,j,\pi_{w})\pi_{w}(j,b). $$

Thus, $p_{w}^{n_{1}+1}(i,a;j,b) >0$. Similarly, it can be shown that there exists an integer n ₂ > 0 such that $p_{w}^{n_{2}+1}(j,b; i,a)>0$. Thus, {(X _n, Z _n)} is an irreducible Markov chain when Z _n, n ≥ 0 are obtained according to π _w.

Next, we show that {(X _n, Z _n)} is aperiodic. Again let l ∈ S be such that p(i, a, l)>0. Since the process {X _n} is aperiodic under π _w, from Assumption 1, there exists an integer M > 0 such that p ⁿ(l, l, π _w) > 0∀n ≥ M, see for instance, Lemma 5.3.2, pp.99, of (Borkar 1995). By irreducibility of {X _n} under π _w, there exists n ₃ > 0 (integer) such that $p^{n_{3}}(l,i,\pi _{w})>0$. Now note that

$$p^{1+n+n_{3}}_{w}(i,a; i,a) \geq p(i,a,l) p^{n}(l,l,\pi_{w}) p^{n_{3}}(l,i,\pi_{w})\pi_{w}(i,a)$$

$$>0, \forall n \geq M.$$

Thus, ${p_{w}^{n}}(i,a;i,a)>0 \forall n\geq (1+M+n_{3})$. Hence, {(X _n, Z _n)} is aperiodic under π _w as well. Finally, since S × A(S) is a finite set, {(X _n, Z _n)} is also positive recurrent. The claim follows. □

Proof of Lemma 1

We shall use a key result from Schweitzer (1968) for the proof. Let ${ P_{w}^{\infty } = \lim _{m\rightarrow \infty } \frac {1}{m}\sum \limits _{n=1}^{m} {P_{w}^{n}}}$ and $Z_{w} \overset {\triangle }{=} (I- P_{w} - P_{w}^{\infty })^{-1}$, respectively, where I denotes the (|S × A(S)| × |S × A(S)|)-identity matrix and ${P_{w}^{m}}$ is the matrix of m-step transition probabilities ${p^{m}_{w}}(i,a;j,b)$, i, j ∈ S, a ∈ A(i), b ∈ A(j). From Theorem 2, pp.402-403 of (Schweitzer 1968), one can write

$$\mathbb{f}_{w+\xi e_{i}} = {\mathbb{f}}_{w}(I + (P_{w+\xi e_{i}}-P_{w})Z_{w} + o(\xi)), $$

where ξ > 0 is a small quantity and e _i, i ∈ {1,…,N} is a unit vector with 1 as its ith entry and 0s elsewhere. Hence, we get

$$\nabla_{w,i} {\mathbb{f}}_{w} = {\mathbb{f}}_{w} \nabla_{w,i} P_{w} Z_{w}, i=1,\ldots, N. $$

Thus, $\nabla _{w} {\mathbb {f}}_{w} = {\mathbb {f}}_{w} \nabla P_{w} Z_{w}$. Now since p _w(i, a;j, b) = p(i, a, j)π _w(j, b), it follows from Assumption 2, it follows that p _w(i, a;j, b) are continuously differentiable, i.e., ∇_w P _w exists and is continuous. Hence, $\nabla _{w} {\mathbb {f}}_{w}$ exists.

Next we verify that $\nabla _{w} {\mathbb {f}}_{w}$ is continuous as well. Note that ${\mathbb {f}}_{w}$ is continuous since it is differentiable. Further, ∇_w P _w is continuous as noted above. Also, from Cramer’s rule, it follows that Z _w is continuously differentiable and hence also continuous over w ∈ C. Since the set C is a compact subset of ${\mathcal R}^{N}$, it is easy to see that $\nabla _{w} {\mathbb {f}}_{w}$ is continuous as well. The claim follows. □

Proof of Lemma 2

It is easy to see from the definition of R(𝜃, w) and Lemma 1 that the partial derivatives of R(𝜃, w) with respect to any $\theta \in {\mathcal R}^{d}$ and w ∈ C exist. Note that from definition, for a given w ∈ C,

$$\nabla_{\theta} R(\theta,w) = \sum\limits_{(i,a)\in S\times A(S)} f_{w}(i,a)\phi_{i,a},$$

which is a constant function of 𝜃, hence continuous. Now consider

$$\nabla_{w} R(\theta,w) = (\nabla_{w,1}R(\theta,w),\ldots,\nabla_{w,N}R(\theta,w))^{T}, $$

where ∇_{w, i} R(𝜃, w) is the partial derivative of R(𝜃, w) with respect to w _i, given 𝜃 ∈ D. Note that sup𝜃 ∈ D∥𝜃∥ < ∞, since D is bounded. Now, given 𝜃 ∈ D,

$$\nabla_{w} R(\theta,w) = \sum\limits_{(i, a)\in S\times A(S)} \nabla_{w} f_{w}(i,a) \theta^{T} \phi_{i,a},$$

since S × A(S) is a finite set. Let w ¹ and w ² be two points in C. Then,

$$\parallel \nabla_{w} R(\theta,w^{1}) - \nabla_{w} R(\theta,w^{2}) \parallel $$

$$\leq \sum\limits_{(i,a)\in S\times A(S)} \parallel \nabla_{w} f_{w^{1}}(i,a) {\theta}^{T} \phi_{i,a} - \nabla_{w} f_{w^{2}}(i,a) {\theta}^{T} \phi_{i,a} \parallel $$

$$\leq \sum\limits_{(i,a)\in S\times A(S)} \parallel \nabla_{w} f_{w^{1}}(i,a)-\nabla_{w} f_{w^{2}}(i,a)\parallel |{\theta}^{T} \phi_{i,a}|. $$

Now since D is a compact set, note that

$$L_{2} \overset{\triangle}{=} \max_{(i,a)\in S\times A(S)}\max_{\theta\in D}|\theta^{T}\phi_{i,a}|<\infty. $$

The claim now follows since ∇_w f _w(i, a) is a continuous function from Lemma 1 (in fact also uniformly continuous since w ∈ C, a compact set). □

Proof of Lemma 6

We first show the claim in (3.5). Recall from Lemma 5 that

$$\parallel w_{n+s}-w_{n} \parallel \rightarrow 0 ~\text{as}~ n\rightarrow\infty, $$

almost surely, for all s ∈ {1,…,P}. From Lemma 2 and the above, it follows that

$$\parallel \nabla_{w,k} R(\theta, w_{n+s}) - \nabla_{w,k} R(\theta, w_{n}) \parallel \rightarrow 0 \,\,\text{as}\,\, n\rightarrow\infty, $$

∀s ∈ {1,…,P}, k ∈ {1,…,N}. By letting M = P in Assumption 5, it follows that a(j)/a(m)→1 as m → ∞ for any j ∈ {m,…,m + P−1}. Note also that P is an even integer. As a consequence of Lemma 4, one can split any set of the type $A_{m} \overset {\triangle }{=}\{m,m+1,\ldots ,m+P-1\}$ into two disjoint subsets $A_{m,k,l}^{+}$ and $A_{m,k,l}^{-}$ each having the same number of elements, with $A_{m,k,l}^{+} \cup A_{m,k,l}^{-} = A_{m}$ and such that ${\frac {{\triangle _{n}^{k}}}{{\triangle _{n}^{l}}}}$ takes value $+1 \forall n\in A_{m,k,l}^{+}$ and $-1 \forall n \in A_{m,k,l}^{-}$, respectively. Thus,

$$\parallel \sum\limits_{n=m}^{m+P-1} \!\!\frac{a(n)}{a(m)} \frac{{\triangle_{n}^{k}}}{{\triangle_{n}^{l}}} \nabla_{w,k} R(\theta, w_{n}) \parallel = \parallel\!\! \sum\limits_{n \in A_{m,k,l}^{+}} \!\frac{a(n)}{a(m)} \nabla_{w,k}R(\theta, w_{n}) -\!\! \sum\limits_{n \in A_{m,k,l}^{-}} \frac{a(n)}{a(m)} \nabla_{w,k}R(\theta, w_{n}) \parallel. $$

It now follows as a consequence of the above that

$$\parallel \sum\limits_{n=m}^{m+P-1} \frac{a(n)}{a(m)} \frac{{\triangle_{n}^{k}}}{{\triangle_{n}^{l}}} \nabla_{w,k} R(\theta, w_{n}) \parallel \rightarrow 0, $$

almost surely as m → ∞. Finally, the claim in (3.6) follows from Lemma 5, Lemma 2 and Assumption 5, in a similar manner as (3.5). □

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bhatnagar, S., Lakshmanan, K. Multiscale Q-learning with linear function approximation. Discrete Event Dyn Syst 26, 477–509 (2016). https://doi.org/10.1007/s10626-015-0216-z

Download citation

Received: 25 September 2012
Accepted: 10 August 2015
Published: 30 August 2015
Issue Date: September 2016
DOI: https://doi.org/10.1007/s10626-015-0216-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multiscale Q-learning with linear function approximation

Abstract

Access this article

Similar content being viewed by others

Input-Decoupled Q-Learning for Optimal Control

Full Gradient DQN Reinforcement Learning: A Provably Convergent Scheme

A New Discrete-Time Iterative Adaptive Dynamic Programming Algorithm Based on Q-Learning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Proof of Proposition 1

Proof of Lemma 1

Proof of Lemma 2

Proof of Lemma 6

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multiscale Q-learning with linear function approximation

Abstract

Access this article

Similar content being viewed by others

Input-Decoupled Q-Learning for Optimal Control

Full Gradient DQN Reinforcement Learning: A Provably Convergent Scheme

A New Discrete-Time Iterative Adaptive Dynamic Programming Algorithm Based on Q-Learning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Proof of Proposition 1

Proof of Lemma 1

Proof of Lemma 2

Proof of Lemma 6

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation