Skip to main content
Log in

Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Dyna is an effective reinforcement learning (RL) approach that combines value function evaluation with model learning. However, existing works on Dyna mostly discuss only its efficiency in RL problems with discrete action spaces. This paper proposes a novel Dyna variant, called Dyna-LSTD-PA, aiming to handle problems with continuous action spaces. Dyna-LSTD-PA stands for Dyna based on least-squares temporal difference (LSTD) and policy approximation. Dyna-LSTD-PA consists of two simultaneous, interacting processes. The learning process determines the probability distribution over action spaces using the Gaussian distribution; estimates the underlying value function, policy, and model by linear representation; and updates their parameter vectors online by LSTD(λ). The planning process updates the parameter vector of the value function again by using offline LSTD(λ). Dyna-LSTD-PA also uses the Sherman–Morrison formula to improve the efficiency of LSTD(λ), and weights the parameter vector of the value function to bring the two processes together. Theoretically, the global error bound is derived by considering approximation, estimation, and model errors. Experimentally, Dyna-LSTD-PA outperforms two representative methods in terms of convergence rate, success rate, and stability performance on four benchmark RL problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Sutton R S, Barto A G. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998

    MATH  Google Scholar 

  2. Mnih V, Kavukcuoglu K, Silver D, Rusu A, Veness J, Bellemare M, Graves A, Riedmiller M, Fidjeland A, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D. Human level control through deep reinforcement learning. Nature, 2015, 518(7540): 529–533

    Article  Google Scholar 

  3. Littman M L. Reinforcement learning improves behaviour from evaluative feedback. Nature, 2015, 521(7553): 445–451

    Article  Google Scholar 

  4. Andersson O, Heintz F, Doherty P. On the undecidability of probabilistic planning and infinite-horizon partially observable. In: Proceedings of the 29th National Conference on Artificial Intelligence. 2015, 2497–2503

    Google Scholar 

  5. Bellman R E, Dreyfus S E. Applied Dynamic Programming. Princeton, NJ: Princeton University Press, 2015

    MATH  Google Scholar 

  6. Barker J K, Korf R E. Limitations of front-to-end bidirectional heuristic search. In: Proceedings of the 29th National Conference on Artificial Intelligence. 2015, 1086–1092

    Google Scholar 

  7. Robert C P, Casella G. Monte Carlo Statistical Methods. New York: Springer Science & Business Media, 2013

    MATH  Google Scholar 

  8. Sutton R, Mahmood A R, Precup D, Hasselt H V. A new Q (λ) with interim forward view and Monte-Carlo equivalence. In: Proceedings of International Conference on Machine Learning. 2014, 568–576

    Google Scholar 

  9. Seijen H V, Sutton R. True online TD (λ). In: Proceedings of International Conference on Machine Learning. 2014, 692–700

    Google Scholar 

  10. Hasselt H V, Mahmood A R, Sutton R S. Off-policy TD (λ) with a true online equivalence. In: Proceedings of International Conference on Uncertainty in Artificial Intelligence. 2014, 330–339

    Google Scholar 

  11. Werbos P J. Advanced forecasting methods for global crisis warning and models of intelligence. General Systems, 1977, 22(6): 25–38

    Google Scholar 

  12. Al-Tamimi A, Lewis F L, Abu-Khalaf M. Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2008, 38(4): 943–949

    Article  Google Scholar 

  13. Wang F Y, Jin N, Liu D E, Wei Q L. Adaptive dynamic programming for finite-horizon optimal control of discrete-time nonlinear systems with ϵ-error bound. IEEE Transactions on Neural Networks, 2011, 22(1): 24–36

    Article  Google Scholar 

  14. Liu D, Wei Q L. Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25(3): 621–634

    Article  Google Scholar 

  15. Murray J J, Cox C J, Lendaris G G, Saeks R. Adaptive dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics, Part C, 2002, 32(2): 140–153

    Article  Google Scholar 

  16. Hanselmann T, Noakes L, Zaknich A. Continuous time adaptive critics. IEEE Transactions on Neural Networks, 2007, 18(3): 631–647

    Article  Google Scholar 

  17. Wei Q, Song R, Yan P. Data-driven zero-sum neuro-optimal control for a class of continuous-time unknown nonlinear systems with disturbance using ADP. IEEE Transactions on Neural Networks and Learning Systems, 2016, 27(2): 444–458

    Article  MathSciNet  Google Scholar 

  18. Busoniu L, Babuška R, De Schutter B, Ernst D. Reinforcement Learning and Dynamic Programming Using Function Approximators. Boca Raton: CRC Press, 2010

    MATH  Google Scholar 

  19. Sutton R S. Integrated architecture for learning, planning and reacting based on approximating dynamic programming. In: Proceedings of International Conference on Machine Learning. 1990, 216–224

    Google Scholar 

  20. Peng J, Williams R J. Efficient learning and planning within the Dyna framework. Adaptive Behavior, 1993, 4(1): 437–454

    Article  Google Scholar 

  21. Moore A W, Atkeson C G. Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 1993, 13(1): 103–130

    Google Scholar 

  22. Sutton R S, Szepesvári C, Geramfard A, Bowling M. Dyna-style planning with linear function approximation and prioritized sweeping. In: Proceedings of International Conference on Uncertainty in Artificial Intelligence. 2008, 528–536

    Google Scholar 

  23. Silver D, Sutton R S, Müller M. Temporal-difference search in computer Go. Machine Learning, 2012, 87(2): 183–219

    Article  MathSciNet  MATH  Google Scholar 

  24. Coulom R. Efficient selectivity and backup operators in Monte-Carlo tree search. In: Proceedings of the 5th International Conference on Computers and Games, 2006, 72–83

    Google Scholar 

  25. Zhou Y C, Liu Q, Fu Q M, Zhang Z Z. Trajectory sampling value iteration: Improved Dyna search for MDPs. In: Proceedings of the International Conference on Autonomous Agents and Multiagent Systems. 2015, 1685–1686

    Google Scholar 

  26. Martin H, De Lope J. Ex α: An effective algorithm for continuous actions reinforcement learning problems. In: Proceedings of the 35th Annual Conference of IEEE on Industrial Electronics. 2009, 2063–2068

    Google Scholar 

  27. Weinstein A, Littman M L. Bandit-based planning and learning in continuous-action Markov decision processes. In: Proceedings of International Conference on Automated Planning and Scheduling. 2012, 306–314

    Google Scholar 

  28. Busoniu L, Daniels A, Munos R, Babuška R. Optimistic planning for continuous-action deterministic systems. In: Proceedings of IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning. 2013, 69–76

    Google Scholar 

  29. Grondman I, Vaandrager M, Busoniu L, Babuška R, Schuitema E. Efficient model learning methods for actor-critic control. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 2012, 42(3): 591–602

    Article  Google Scholar 

  30. Hasselt H V. Reinforcement learning in continuous state and action spaces. In: Wiering M, van Otterlo M, eds. Reinforcement Learning. Berlin: Springer Heidelberg, 2012, 207–251

    Book  Google Scholar 

  31. Degris T, Pilarski P M, Sutton R S. Model-free reinforcement learning with continuous action in practice. In: Proceedings of IEEE American Control Conference. 2012, 2177–2182

    Google Scholar 

  32. Bradtke S J, Barto A G. Linear least-squares algorithms for temporal difference learning. Machine Learning, 1996, 22(1–3): 33–57

    MATH  Google Scholar 

  33. Boyan J A. Technical update: least-square temporal difference learning. Machine learning, 2002, 49(2–3): 233–246

    Article  MATH  Google Scholar 

  34. Tagorti M, Scherer B. On the rate of the convergence and error bounds for LSTD(λ). In: Proceedings of International Conference on Machine Learning. 2015, 528–536

    Google Scholar 

  35. Hwang K S, Jiang W C, Chen Y J. Model learning and knowledge sharing for a multiagent system With Dyna-Q. IEEE Transactions on Cybernetics, 2015, 45(5): 964–976

    Google Scholar 

  36. Faulkner R, Precup D. Dyna planning using a feature based generative model. In: Proceedings of Advances in Neural Information Processing Systems. 2010, 1–9

    Google Scholar 

  37. Silver D, Sutton R S, Müller M. Reinforcement learning of local shape in the game of Go. In: Proceedings of International Joint Conference on Artificial Intelligence. 2007, 1053–1058

    Google Scholar 

  38. Yao H, Szepesvári C. Approximate policy iteration with linear action models. In: Proceedings of the 26th National Conference on Artificial Intelligence. 2012

    Google Scholar 

  39. Tsitsiklis J N, Roy B V. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 1997, 42(5): 674–690

    Article  MathSciNet  MATH  Google Scholar 

  40. Nedic A, Bertsekas D P. Least squares policy evaluation algorithms with linear function approximation. Theory and Applications, 2002, 13(1–2): 79–110

    MathSciNet  MATH  Google Scholar 

  41. Lazaric A, Ghavamzadeh M, Munos R. Finite-sample analysis of least-square policy iteration. Journal of Machine Learning Research, 2012, 13(4): 3041–3074

    MathSciNet  MATH  Google Scholar 

  42. Xu X, He H G, Hu D W. Efficient reinforcement learning using recursive least-squares methods. Journal of Artificial Intelligence Research, 2002, 16(1): 259–292

    Article  MathSciNet  MATH  Google Scholar 

  43. Ljung L, Soderstron T. Theory and practice of recursive identification. Cambridge, MA: MIT Press, 1983

    Google Scholar 

  44. Geramifard A, Bowling M, Sutton R S. Incremental least-squares temporal difference learning. In: Proceedings of the National Conference on Artificial Intelligence. 2006, 356–361

    Google Scholar 

  45. Berenji H R, Khedkar P. Learning and tuning fuzzy logic controllers through reinforcements. IEEE Transactions on Neural Networks, 1992, 3(5): 724–740

    Article  Google Scholar 

  46. Sutton R S. Generalization in reinforcement learning: successful examples using sparse coarse coding. Neural Information Processing Systems, 1996: 1038–1044

    Google Scholar 

  47. Bhatnagar S, Sutton R S, Ghavamzadeh M, Lee M. Natural actor-critic algorithms. Automatica, 2009, 45(11): 2471–2482

    Article  MathSciNet  MATH  Google Scholar 

  48. Liu D R, Li H L, Wang D. Feature selection and feature learning for high-dimensional batch reinforcement learning: a survey. International Journal of Automation and Computing, 2015, 12(3): 229–242

    Article  Google Scholar 

  49. Farahmand A M, Ghavamzadeh M, Szepesvári C, Mannor S. Regularized fitted Q-iteration for planning in continuous-space Markovian decision problems. In: Proceedings of American Control Conference. 2009, 725–730

    Google Scholar 

  50. Farahmand A M, Szepesvári C. Model selection in reinforcement learning. Machine Learning, 2011, 85(3): 299–332

    Article  MathSciNet  MATH  Google Scholar 

  51. Kolter J Z, Ng A Y. Regularization and feature selection in least-squares temporal difference learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. 2009, 521–528

    Google Scholar 

  52. Ghavamzadeh M, Lazaric A, Munos R, Hoffman M W. Finite-sample analysis of LASSO-TD. In: Proceedings of the 28th International Conference on Machine Learning. 2011, 1177–1184

    Google Scholar 

  53. Mahadevan S, Liu B. Sparse Q-learning with mirror descent. In: Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence. 2012, 564–573

    Google Scholar 

  54. Painter-Wakefield C, Parr R. Greedy algorithms for sparse reinforcement learning. In: Proceedings of the 29th International Conference on Machine Learning. 2012, 1391–1398

    Google Scholar 

  55. Johns J, Mahadevan S. Sparse approximate policy evaluation using graph-based basis functions. Technical Report UM-CS-2009-041. 2009

    Google Scholar 

  56. Ghavamzadeh M, Lazaric A, Maillard O A, Munos R. LSTD with random projections. In: Proceedings of Advances in Neural Information Processing Systems. 2010, 721–729

    Google Scholar 

  57. Liu B, Mahadevan S. Compressive reinforcement learning with oblique random projections. Technical Report UM-CS-2011-024. 2011

    Google Scholar 

  58. Xu X, Hu D W, Lu X C. Kernel-based least squares policy iteration for reinforcement learning. IEEE Transactionson Neural Networks, 2007, 18(4): 973–992

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Quan Liu.

Additional information

Shan Zhong is now a PhD candidate in School of Computer Science and Technology at Soochow University, China. She received her master degree from Jiangsu University, China. She is also a lecturer in Changshu Institute of Technology, China. Her main research interests include machine learning and deep learning.

Quan Liu is now a professor and PhD supervisor in School of Computer Science and Technology at Soochow University, China. He received his PhD degree at Jilin University, China in 2004. He worked as a post-doctor at Nanjing University, China from 2006-2008. He is a senior member of China Computer Federation. His main research interests include reinforcement learning, intelligence information processing, and automated reasoning.

Zongzhang Zhang received his PhD degree in computer science from University of Science and Technology of China, China in 2012. He is currently an associate professor at Soochow University, China. He worked as a research fellow at National University of Singapore, Singapore from 2012 to 2014 and as a visiting scholar at Rutgers University, USA from 2010 to 2011. His research directions include POMDPs, reinforcement learning, and multi-agent systems.

Qiming Fu received his master’s and PhD degrees in School of Computer Science and Technology at Soochow University, China in 2011 and 2014, respectively. He works as a lecturer at Suzhou University of Science and Technology, China. His main research interests include reinforcement learning, Bayesian methods, deep learning.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhong, S., Liu, Q., Zhang, Z. et al. Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation. Front. Comput. Sci. 13, 106–126 (2019). https://doi.org/10.1007/s11704-017-6222-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-017-6222-6

Keywords

Navigation