Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation

Zhong, Shan; Liu, Quan; Zhang, Zongzhang; Fu, Qiming

doi:10.1007/s11704-017-6222-6

Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation

Research Article
Published: 13 February 2018

Volume 13, pages 106–126, (2019)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Shan Zhong^1,2,3,4,
Quan Liu^1,4,5,
Zongzhang Zhang¹ &
…
Qiming Fu^1,3,4,6

223 Accesses
10 Citations
Explore all metrics

Abstract

Dyna is an effective reinforcement learning (RL) approach that combines value function evaluation with model learning. However, existing works on Dyna mostly discuss only its efficiency in RL problems with discrete action spaces. This paper proposes a novel Dyna variant, called Dyna-LSTD-PA, aiming to handle problems with continuous action spaces. Dyna-LSTD-PA stands for Dyna based on least-squares temporal difference (LSTD) and policy approximation. Dyna-LSTD-PA consists of two simultaneous, interacting processes. The learning process determines the probability distribution over action spaces using the Gaussian distribution; estimates the underlying value function, policy, and model by linear representation; and updates their parameter vectors online by LSTD(λ). The planning process updates the parameter vector of the value function again by using offline LSTD(λ). Dyna-LSTD-PA also uses the Sherman–Morrison formula to improve the efficiency of LSTD(λ), and weights the parameter vector of the value function to bring the two processes together. Theoretically, the global error bound is derived by considering approximation, estimation, and model errors. Experimentally, Dyna-LSTD-PA outperforms two representative methods in terms of convergence rate, success rate, and stability performance on four benchmark RL problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Survey of Linear Value Function Approximation in Reinforcement Learning

A Survey on Constraining Policy Updates Using the KL Divergence

Sparse Approximations to Value Functions in Reinforcement Learning

References

Sutton R S, Barto A G. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998
MATH Google Scholar
Mnih V, Kavukcuoglu K, Silver D, Rusu A, Veness J, Bellemare M, Graves A, Riedmiller M, Fidjeland A, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D. Human level control through deep reinforcement learning. Nature, 2015, 518(7540): 529–533
Article Google Scholar
Littman M L. Reinforcement learning improves behaviour from evaluative feedback. Nature, 2015, 521(7553): 445–451
Article Google Scholar
Andersson O, Heintz F, Doherty P. On the undecidability of probabilistic planning and infinite-horizon partially observable. In: Proceedings of the 29th National Conference on Artificial Intelligence. 2015, 2497–2503
Google Scholar
Bellman R E, Dreyfus S E. Applied Dynamic Programming. Princeton, NJ: Princeton University Press, 2015
MATH Google Scholar
Barker J K, Korf R E. Limitations of front-to-end bidirectional heuristic search. In: Proceedings of the 29th National Conference on Artificial Intelligence. 2015, 1086–1092
Google Scholar
Robert C P, Casella G. Monte Carlo Statistical Methods. New York: Springer Science & Business Media, 2013
MATH Google Scholar
Sutton R, Mahmood A R, Precup D, Hasselt H V. A new Q (λ) with interim forward view and Monte-Carlo equivalence. In: Proceedings of International Conference on Machine Learning. 2014, 568–576
Google Scholar
Seijen H V, Sutton R. True online TD (λ). In: Proceedings of International Conference on Machine Learning. 2014, 692–700
Google Scholar
Hasselt H V, Mahmood A R, Sutton R S. Off-policy TD (λ) with a true online equivalence. In: Proceedings of International Conference on Uncertainty in Artificial Intelligence. 2014, 330–339
Google Scholar
Werbos P J. Advanced forecasting methods for global crisis warning and models of intelligence. General Systems, 1977, 22(6): 25–38
Google Scholar
Al-Tamimi A, Lewis F L, Abu-Khalaf M. Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2008, 38(4): 943–949
Article Google Scholar
Wang F Y, Jin N, Liu D E, Wei Q L. Adaptive dynamic programming for finite-horizon optimal control of discrete-time nonlinear systems with ϵ-error bound. IEEE Transactions on Neural Networks, 2011, 22(1): 24–36
Article Google Scholar
Liu D, Wei Q L. Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25(3): 621–634
Article Google Scholar
Murray J J, Cox C J, Lendaris G G, Saeks R. Adaptive dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics, Part C, 2002, 32(2): 140–153
Article Google Scholar
Hanselmann T, Noakes L, Zaknich A. Continuous time adaptive critics. IEEE Transactions on Neural Networks, 2007, 18(3): 631–647
Article Google Scholar
Wei Q, Song R, Yan P. Data-driven zero-sum neuro-optimal control for a class of continuous-time unknown nonlinear systems with disturbance using ADP. IEEE Transactions on Neural Networks and Learning Systems, 2016, 27(2): 444–458
Article MathSciNet Google Scholar
Busoniu L, Babuška R, De Schutter B, Ernst D. Reinforcement Learning and Dynamic Programming Using Function Approximators. Boca Raton: CRC Press, 2010
MATH Google Scholar
Sutton R S. Integrated architecture for learning, planning and reacting based on approximating dynamic programming. In: Proceedings of International Conference on Machine Learning. 1990, 216–224
Google Scholar
Peng J, Williams R J. Efficient learning and planning within the Dyna framework. Adaptive Behavior, 1993, 4(1): 437–454
Article Google Scholar
Moore A W, Atkeson C G. Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 1993, 13(1): 103–130
Google Scholar
Sutton R S, Szepesvári C, Geramfard A, Bowling M. Dyna-style planning with linear function approximation and prioritized sweeping. In: Proceedings of International Conference on Uncertainty in Artificial Intelligence. 2008, 528–536
Google Scholar
Silver D, Sutton R S, Müller M. Temporal-difference search in computer Go. Machine Learning, 2012, 87(2): 183–219
Article MathSciNet MATH Google Scholar
Coulom R. Efficient selectivity and backup operators in Monte-Carlo tree search. In: Proceedings of the 5th International Conference on Computers and Games, 2006, 72–83
Google Scholar
Zhou Y C, Liu Q, Fu Q M, Zhang Z Z. Trajectory sampling value iteration: Improved Dyna search for MDPs. In: Proceedings of the International Conference on Autonomous Agents and Multiagent Systems. 2015, 1685–1686
Google Scholar
Martin H, De Lope J. Ex α: An effective algorithm for continuous actions reinforcement learning problems. In: Proceedings of the 35th Annual Conference of IEEE on Industrial Electronics. 2009, 2063–2068
Google Scholar
Weinstein A, Littman M L. Bandit-based planning and learning in continuous-action Markov decision processes. In: Proceedings of International Conference on Automated Planning and Scheduling. 2012, 306–314
Google Scholar
Busoniu L, Daniels A, Munos R, Babuška R. Optimistic planning for continuous-action deterministic systems. In: Proceedings of IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning. 2013, 69–76
Google Scholar
Grondman I, Vaandrager M, Busoniu L, Babuška R, Schuitema E. Efficient model learning methods for actor-critic control. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 2012, 42(3): 591–602
Article Google Scholar
Hasselt H V. Reinforcement learning in continuous state and action spaces. In: Wiering M, van Otterlo M, eds. Reinforcement Learning. Berlin: Springer Heidelberg, 2012, 207–251
Book Google Scholar
Degris T, Pilarski P M, Sutton R S. Model-free reinforcement learning with continuous action in practice. In: Proceedings of IEEE American Control Conference. 2012, 2177–2182
Google Scholar
Bradtke S J, Barto A G. Linear least-squares algorithms for temporal difference learning. Machine Learning, 1996, 22(1–3): 33–57
MATH Google Scholar
Boyan J A. Technical update: least-square temporal difference learning. Machine learning, 2002, 49(2–3): 233–246
Article MATH Google Scholar
Tagorti M, Scherer B. On the rate of the convergence and error bounds for LSTD(λ). In: Proceedings of International Conference on Machine Learning. 2015, 528–536
Google Scholar
Hwang K S, Jiang W C, Chen Y J. Model learning and knowledge sharing for a multiagent system With Dyna-Q. IEEE Transactions on Cybernetics, 2015, 45(5): 964–976
Google Scholar
Faulkner R, Precup D. Dyna planning using a feature based generative model. In: Proceedings of Advances in Neural Information Processing Systems. 2010, 1–9
Google Scholar
Silver D, Sutton R S, Müller M. Reinforcement learning of local shape in the game of Go. In: Proceedings of International Joint Conference on Artificial Intelligence. 2007, 1053–1058
Google Scholar
Yao H, Szepesvári C. Approximate policy iteration with linear action models. In: Proceedings of the 26th National Conference on Artificial Intelligence. 2012
Google Scholar
Tsitsiklis J N, Roy B V. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 1997, 42(5): 674–690
Article MathSciNet MATH Google Scholar
Nedic A, Bertsekas D P. Least squares policy evaluation algorithms with linear function approximation. Theory and Applications, 2002, 13(1–2): 79–110
MathSciNet MATH Google Scholar
Lazaric A, Ghavamzadeh M, Munos R. Finite-sample analysis of least-square policy iteration. Journal of Machine Learning Research, 2012, 13(4): 3041–3074
MathSciNet MATH Google Scholar
Xu X, He H G, Hu D W. Efficient reinforcement learning using recursive least-squares methods. Journal of Artificial Intelligence Research, 2002, 16(1): 259–292
Article MathSciNet MATH Google Scholar
Ljung L, Soderstron T. Theory and practice of recursive identification. Cambridge, MA: MIT Press, 1983
Google Scholar
Geramifard A, Bowling M, Sutton R S. Incremental least-squares temporal difference learning. In: Proceedings of the National Conference on Artificial Intelligence. 2006, 356–361
Google Scholar
Berenji H R, Khedkar P. Learning and tuning fuzzy logic controllers through reinforcements. IEEE Transactions on Neural Networks, 1992, 3(5): 724–740
Article Google Scholar
Sutton R S. Generalization in reinforcement learning: successful examples using sparse coarse coding. Neural Information Processing Systems, 1996: 1038–1044
Google Scholar
Bhatnagar S, Sutton R S, Ghavamzadeh M, Lee M. Natural actor-critic algorithms. Automatica, 2009, 45(11): 2471–2482
Article MathSciNet MATH Google Scholar
Liu D R, Li H L, Wang D. Feature selection and feature learning for high-dimensional batch reinforcement learning: a survey. International Journal of Automation and Computing, 2015, 12(3): 229–242
Article Google Scholar
Farahmand A M, Ghavamzadeh M, Szepesvári C, Mannor S. Regularized fitted Q-iteration for planning in continuous-space Markovian decision problems. In: Proceedings of American Control Conference. 2009, 725–730
Google Scholar
Farahmand A M, Szepesvári C. Model selection in reinforcement learning. Machine Learning, 2011, 85(3): 299–332
Article MathSciNet MATH Google Scholar
Kolter J Z, Ng A Y. Regularization and feature selection in least-squares temporal difference learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. 2009, 521–528
Google Scholar
Ghavamzadeh M, Lazaric A, Munos R, Hoffman M W. Finite-sample analysis of LASSO-TD. In: Proceedings of the 28th International Conference on Machine Learning. 2011, 1177–1184
Google Scholar
Mahadevan S, Liu B. Sparse Q-learning with mirror descent. In: Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence. 2012, 564–573
Google Scholar
Painter-Wakefield C, Parr R. Greedy algorithms for sparse reinforcement learning. In: Proceedings of the 29th International Conference on Machine Learning. 2012, 1391–1398
Google Scholar
Johns J, Mahadevan S. Sparse approximate policy evaluation using graph-based basis functions. Technical Report UM-CS-2009-041. 2009
Google Scholar
Ghavamzadeh M, Lazaric A, Maillard O A, Munos R. LSTD with random projections. In: Proceedings of Advances in Neural Information Processing Systems. 2010, 721–729
Google Scholar
Liu B, Mahadevan S. Compressive reinforcement learning with oblique random projections. Technical Report UM-CS-2011-024. 2011
Google Scholar
Xu X, Hu D W, Lu X C. Kernel-based least squares policy iteration for reinforcement learning. IEEE Transactionson Neural Networks, 2007, 18(4): 973–992
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Soochow University, Suzhou, 215000, China
Shan Zhong, Quan Liu, Zongzhang Zhang & Qiming Fu
School of Computer Science and Engineering, Changshu Institute of Technology, Changshu, 215500, China
Shan Zhong
Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency, Suzhou University of Science and Technology, Suzhou, 215006, China
Shan Zhong & Qiming Fu
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012, China
Shan Zhong, Quan Liu & Qiming Fu
Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing, 210000, China
Quan Liu
College of Electronic & Information Engineering, Suzhou University of Science and Technology, Suzhou, 215000, China
Qiming Fu

Authors

Shan Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Quan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zongzhang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Qiming Fu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Quan Liu.

Additional information

Shan Zhong is now a PhD candidate in School of Computer Science and Technology at Soochow University, China. She received her master degree from Jiangsu University, China. She is also a lecturer in Changshu Institute of Technology, China. Her main research interests include machine learning and deep learning.

Quan Liu is now a professor and PhD supervisor in School of Computer Science and Technology at Soochow University, China. He received his PhD degree at Jilin University, China in 2004. He worked as a post-doctor at Nanjing University, China from 2006-2008. He is a senior member of China Computer Federation. His main research interests include reinforcement learning, intelligence information processing, and automated reasoning.

Zongzhang Zhang received his PhD degree in computer science from University of Science and Technology of China, China in 2012. He is currently an associate professor at Soochow University, China. He worked as a research fellow at National University of Singapore, Singapore from 2012 to 2014 and as a visiting scholar at Rutgers University, USA from 2010 to 2011. His research directions include POMDPs, reinforcement learning, and multi-agent systems.

Qiming Fu received his master’s and PhD degrees in School of Computer Science and Technology at Soochow University, China in 2011 and 2014, respectively. He works as a lecturer at Suzhou University of Science and Technology, China. His main research interests include reinforcement learning, Bayesian methods, deep learning.

Electronic supplementary material