Abstract
Dyna is an effective reinforcement learning (RL) approach that combines value function evaluation with model learning. However, existing works on Dyna mostly discuss only its efficiency in RL problems with discrete action spaces. This paper proposes a novel Dyna variant, called Dyna-LSTD-PA, aiming to handle problems with continuous action spaces. Dyna-LSTD-PA stands for Dyna based on least-squares temporal difference (LSTD) and policy approximation. Dyna-LSTD-PA consists of two simultaneous, interacting processes. The learning process determines the probability distribution over action spaces using the Gaussian distribution; estimates the underlying value function, policy, and model by linear representation; and updates their parameter vectors online by LSTD(λ). The planning process updates the parameter vector of the value function again by using offline LSTD(λ). Dyna-LSTD-PA also uses the Sherman–Morrison formula to improve the efficiency of LSTD(λ), and weights the parameter vector of the value function to bring the two processes together. Theoretically, the global error bound is derived by considering approximation, estimation, and model errors. Experimentally, Dyna-LSTD-PA outperforms two representative methods in terms of convergence rate, success rate, and stability performance on four benchmark RL problems.
Similar content being viewed by others
References
Sutton R S, Barto A G. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998
Mnih V, Kavukcuoglu K, Silver D, Rusu A, Veness J, Bellemare M, Graves A, Riedmiller M, Fidjeland A, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D. Human level control through deep reinforcement learning. Nature, 2015, 518(7540): 529–533
Littman M L. Reinforcement learning improves behaviour from evaluative feedback. Nature, 2015, 521(7553): 445–451
Andersson O, Heintz F, Doherty P. On the undecidability of probabilistic planning and infinite-horizon partially observable. In: Proceedings of the 29th National Conference on Artificial Intelligence. 2015, 2497–2503
Bellman R E, Dreyfus S E. Applied Dynamic Programming. Princeton, NJ: Princeton University Press, 2015
Barker J K, Korf R E. Limitations of front-to-end bidirectional heuristic search. In: Proceedings of the 29th National Conference on Artificial Intelligence. 2015, 1086–1092
Robert C P, Casella G. Monte Carlo Statistical Methods. New York: Springer Science & Business Media, 2013
Sutton R, Mahmood A R, Precup D, Hasselt H V. A new Q (λ) with interim forward view and Monte-Carlo equivalence. In: Proceedings of International Conference on Machine Learning. 2014, 568–576
Seijen H V, Sutton R. True online TD (λ). In: Proceedings of International Conference on Machine Learning. 2014, 692–700
Hasselt H V, Mahmood A R, Sutton R S. Off-policy TD (λ) with a true online equivalence. In: Proceedings of International Conference on Uncertainty in Artificial Intelligence. 2014, 330–339
Werbos P J. Advanced forecasting methods for global crisis warning and models of intelligence. General Systems, 1977, 22(6): 25–38
Al-Tamimi A, Lewis F L, Abu-Khalaf M. Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2008, 38(4): 943–949
Wang F Y, Jin N, Liu D E, Wei Q L. Adaptive dynamic programming for finite-horizon optimal control of discrete-time nonlinear systems with ϵ-error bound. IEEE Transactions on Neural Networks, 2011, 22(1): 24–36
Liu D, Wei Q L. Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25(3): 621–634
Murray J J, Cox C J, Lendaris G G, Saeks R. Adaptive dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics, Part C, 2002, 32(2): 140–153
Hanselmann T, Noakes L, Zaknich A. Continuous time adaptive critics. IEEE Transactions on Neural Networks, 2007, 18(3): 631–647
Wei Q, Song R, Yan P. Data-driven zero-sum neuro-optimal control for a class of continuous-time unknown nonlinear systems with disturbance using ADP. IEEE Transactions on Neural Networks and Learning Systems, 2016, 27(2): 444–458
Busoniu L, Babuška R, De Schutter B, Ernst D. Reinforcement Learning and Dynamic Programming Using Function Approximators. Boca Raton: CRC Press, 2010
Sutton R S. Integrated architecture for learning, planning and reacting based on approximating dynamic programming. In: Proceedings of International Conference on Machine Learning. 1990, 216–224
Peng J, Williams R J. Efficient learning and planning within the Dyna framework. Adaptive Behavior, 1993, 4(1): 437–454
Moore A W, Atkeson C G. Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 1993, 13(1): 103–130
Sutton R S, Szepesvári C, Geramfard A, Bowling M. Dyna-style planning with linear function approximation and prioritized sweeping. In: Proceedings of International Conference on Uncertainty in Artificial Intelligence. 2008, 528–536
Silver D, Sutton R S, Müller M. Temporal-difference search in computer Go. Machine Learning, 2012, 87(2): 183–219
Coulom R. Efficient selectivity and backup operators in Monte-Carlo tree search. In: Proceedings of the 5th International Conference on Computers and Games, 2006, 72–83
Zhou Y C, Liu Q, Fu Q M, Zhang Z Z. Trajectory sampling value iteration: Improved Dyna search for MDPs. In: Proceedings of the International Conference on Autonomous Agents and Multiagent Systems. 2015, 1685–1686
Martin H, De Lope J. Ex α: An effective algorithm for continuous actions reinforcement learning problems. In: Proceedings of the 35th Annual Conference of IEEE on Industrial Electronics. 2009, 2063–2068
Weinstein A, Littman M L. Bandit-based planning and learning in continuous-action Markov decision processes. In: Proceedings of International Conference on Automated Planning and Scheduling. 2012, 306–314
Busoniu L, Daniels A, Munos R, Babuška R. Optimistic planning for continuous-action deterministic systems. In: Proceedings of IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning. 2013, 69–76
Grondman I, Vaandrager M, Busoniu L, Babuška R, Schuitema E. Efficient model learning methods for actor-critic control. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 2012, 42(3): 591–602
Hasselt H V. Reinforcement learning in continuous state and action spaces. In: Wiering M, van Otterlo M, eds. Reinforcement Learning. Berlin: Springer Heidelberg, 2012, 207–251
Degris T, Pilarski P M, Sutton R S. Model-free reinforcement learning with continuous action in practice. In: Proceedings of IEEE American Control Conference. 2012, 2177–2182
Bradtke S J, Barto A G. Linear least-squares algorithms for temporal difference learning. Machine Learning, 1996, 22(1–3): 33–57
Boyan J A. Technical update: least-square temporal difference learning. Machine learning, 2002, 49(2–3): 233–246
Tagorti M, Scherer B. On the rate of the convergence and error bounds for LSTD(λ). In: Proceedings of International Conference on Machine Learning. 2015, 528–536
Hwang K S, Jiang W C, Chen Y J. Model learning and knowledge sharing for a multiagent system With Dyna-Q. IEEE Transactions on Cybernetics, 2015, 45(5): 964–976
Faulkner R, Precup D. Dyna planning using a feature based generative model. In: Proceedings of Advances in Neural Information Processing Systems. 2010, 1–9
Silver D, Sutton R S, Müller M. Reinforcement learning of local shape in the game of Go. In: Proceedings of International Joint Conference on Artificial Intelligence. 2007, 1053–1058
Yao H, Szepesvári C. Approximate policy iteration with linear action models. In: Proceedings of the 26th National Conference on Artificial Intelligence. 2012
Tsitsiklis J N, Roy B V. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 1997, 42(5): 674–690
Nedic A, Bertsekas D P. Least squares policy evaluation algorithms with linear function approximation. Theory and Applications, 2002, 13(1–2): 79–110
Lazaric A, Ghavamzadeh M, Munos R. Finite-sample analysis of least-square policy iteration. Journal of Machine Learning Research, 2012, 13(4): 3041–3074
Xu X, He H G, Hu D W. Efficient reinforcement learning using recursive least-squares methods. Journal of Artificial Intelligence Research, 2002, 16(1): 259–292
Ljung L, Soderstron T. Theory and practice of recursive identification. Cambridge, MA: MIT Press, 1983
Geramifard A, Bowling M, Sutton R S. Incremental least-squares temporal difference learning. In: Proceedings of the National Conference on Artificial Intelligence. 2006, 356–361
Berenji H R, Khedkar P. Learning and tuning fuzzy logic controllers through reinforcements. IEEE Transactions on Neural Networks, 1992, 3(5): 724–740
Sutton R S. Generalization in reinforcement learning: successful examples using sparse coarse coding. Neural Information Processing Systems, 1996: 1038–1044
Bhatnagar S, Sutton R S, Ghavamzadeh M, Lee M. Natural actor-critic algorithms. Automatica, 2009, 45(11): 2471–2482
Liu D R, Li H L, Wang D. Feature selection and feature learning for high-dimensional batch reinforcement learning: a survey. International Journal of Automation and Computing, 2015, 12(3): 229–242
Farahmand A M, Ghavamzadeh M, Szepesvári C, Mannor S. Regularized fitted Q-iteration for planning in continuous-space Markovian decision problems. In: Proceedings of American Control Conference. 2009, 725–730
Farahmand A M, Szepesvári C. Model selection in reinforcement learning. Machine Learning, 2011, 85(3): 299–332
Kolter J Z, Ng A Y. Regularization and feature selection in least-squares temporal difference learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. 2009, 521–528
Ghavamzadeh M, Lazaric A, Munos R, Hoffman M W. Finite-sample analysis of LASSO-TD. In: Proceedings of the 28th International Conference on Machine Learning. 2011, 1177–1184
Mahadevan S, Liu B. Sparse Q-learning with mirror descent. In: Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence. 2012, 564–573
Painter-Wakefield C, Parr R. Greedy algorithms for sparse reinforcement learning. In: Proceedings of the 29th International Conference on Machine Learning. 2012, 1391–1398
Johns J, Mahadevan S. Sparse approximate policy evaluation using graph-based basis functions. Technical Report UM-CS-2009-041. 2009
Ghavamzadeh M, Lazaric A, Maillard O A, Munos R. LSTD with random projections. In: Proceedings of Advances in Neural Information Processing Systems. 2010, 721–729
Liu B, Mahadevan S. Compressive reinforcement learning with oblique random projections. Technical Report UM-CS-2011-024. 2011
Xu X, Hu D W, Lu X C. Kernel-based least squares policy iteration for reinforcement learning. IEEE Transactionson Neural Networks, 2007, 18(4): 973–992
Author information
Authors and Affiliations
Corresponding author
Additional information
Shan Zhong is now a PhD candidate in School of Computer Science and Technology at Soochow University, China. She received her master degree from Jiangsu University, China. She is also a lecturer in Changshu Institute of Technology, China. Her main research interests include machine learning and deep learning.
Quan Liu is now a professor and PhD supervisor in School of Computer Science and Technology at Soochow University, China. He received his PhD degree at Jilin University, China in 2004. He worked as a post-doctor at Nanjing University, China from 2006-2008. He is a senior member of China Computer Federation. His main research interests include reinforcement learning, intelligence information processing, and automated reasoning.
Zongzhang Zhang received his PhD degree in computer science from University of Science and Technology of China, China in 2012. He is currently an associate professor at Soochow University, China. He worked as a research fellow at National University of Singapore, Singapore from 2012 to 2014 and as a visiting scholar at Rutgers University, USA from 2010 to 2011. His research directions include POMDPs, reinforcement learning, and multi-agent systems.
Qiming Fu received his master’s and PhD degrees in School of Computer Science and Technology at Soochow University, China in 2011 and 2014, respectively. He works as a lecturer at Suzhou University of Science and Technology, China. His main research interests include reinforcement learning, Bayesian methods, deep learning.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Zhong, S., Liu, Q., Zhang, Z. et al. Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation. Front. Comput. Sci. 13, 106–126 (2019). https://doi.org/10.1007/s11704-017-6222-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11704-017-6222-6