Advertisement

Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation

Research Article
  • 16 Downloads

Abstract

Dyna is an effective reinforcement learning (RL) approach that combines value function evaluation with model learning. However, existing works on Dyna mostly discuss only its efficiency in RL problems with discrete action spaces. This paper proposes a novel Dyna variant, called Dyna-LSTD-PA, aiming to handle problems with continuous action spaces. Dyna-LSTD-PA stands for Dyna based on least-squares temporal difference (LSTD) and policy approximation. Dyna-LSTD-PA consists of two simultaneous, interacting processes. The learning process determines the probability distribution over action spaces using the Gaussian distribution; estimates the underlying value function, policy, and model by linear representation; and updates their parameter vectors online by LSTD(λ). The planning process updates the parameter vector of the value function again by using offline LSTD(λ). Dyna-LSTD-PA also uses the Sherman–Morrison formula to improve the efficiency of LSTD(λ), and weights the parameter vector of the value function to bring the two processes together. Theoretically, the global error bound is derived by considering approximation, estimation, and model errors. Experimentally, Dyna-LSTD-PA outperforms two representative methods in terms of convergence rate, success rate, and stability performance on four benchmark RL problems.

Keywords

problem solving control methods heuristic search methods dynamic programming 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Notes

Acknowledgements

This paper was partially supported by Innovation Center of Novel Software Technology and Industrialization, the National Natural Science Foundation of China (Grant Nos. 61772355, 61702055, 61303108, 61373094, 61472262, 61502323, 61502329), Natural Science Foundation of Jiangsu (BK2012616), Provincial Natural Science Foundation of Jiangsu (BK20151260), High School Natural Foundation of Jiangsu (13KJB520020, 16KJD520001), Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University (93K172014K04, 93K172017K18), and Suzhou industrial application of basic research program part (SYG201422).

Supplementary material

11704_2017_6222_MOESM1_ESM.ppt (184 kb)
Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation

References

  1. 1.
    Sutton R S, Barto A G. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998Google Scholar
  2. 2.
    Mnih V, Kavukcuoglu K, Silver D, Rusu A, Veness J, Bellemare M, Graves A, Riedmiller M, Fidjeland A, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D. Human level control through deep reinforcement learning. Nature, 2015, 518(7540): 529–533CrossRefGoogle Scholar
  3. 3.
    Littman M L. Reinforcement learning improves behaviour from evaluative feedback. Nature, 2015, 521(7553): 445–451CrossRefGoogle Scholar
  4. 4.
    Andersson O, Heintz F, Doherty P. On the undecidability of probabilistic planning and infinite-horizon partially observable. In: Proceedings of the 29th National Conference on Artificial Intelligence. 2015, 2497–2503Google Scholar
  5. 5.
    Bellman R E, Dreyfus S E. Applied Dynamic Programming. Princeton, NJ: Princeton University Press, 2015MATHGoogle Scholar
  6. 6.
    Barker J K, Korf R E. Limitations of front-to-end bidirectional heuristic search. In: Proceedings of the 29th National Conference on Artificial Intelligence. 2015: 1086–1092Google Scholar
  7. 7.
    Robert C P, Casella G. Monte Carlo Statistical Methods. New York: Springer Science & Business Media, 2013MATHGoogle Scholar
  8. 8.
    Sutton R, Mahmood A R, Precup D, Hasselt H V. A new Q (λ) with interim forward view and Monte-Carlo equivalence. In: Proceedings of International Conference on Machine Learning. 2014, 568–576Google Scholar
  9. 9.
    Seijen H V, Sutton R. True online TD (λ). In: Proceedings of International Conference on Machine Learning. 2014, 692–700Google Scholar
  10. 10.
    Hasselt H V, Mahmood A R, Sutton R S. Off-policy TD (λ) with a true online equivalence. In: Proceedings of International Conference on Uncertainty in Artificial Intelligence. 2014, 330–339Google Scholar
  11. 11.
    Werbos P J. Advanced forecasting methods for global crisis warning and models of intelligence. General Systems, 1977, 22(6): 25–38Google Scholar
  12. 12.
    Al-Tamimi A, Lewis F L, Abu-Khalaf M. Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2008, 38(4): 943–949CrossRefGoogle Scholar
  13. 13.
    Wang F Y, Jin N, Liu D E, Wei Q L. Adaptive dynamic programming for finite-horizon optimal control of discrete-time nonlinear systems with ∈-error bound. IEEE Transactions on Neural Networks, 2011, 22(1): 24–36CrossRefGoogle Scholar
  14. 14.
    Liu D, Wei Q L. Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25(3): 621–634CrossRefGoogle Scholar
  15. 15.
    Murray J J, Cox C J, Lendaris G G, Saeks R. Adaptive dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics, Part C, 2002, 32(2): 140–153CrossRefGoogle Scholar
  16. 16.
    Hanselmann T, Noakes L, Zaknich A. Continuous time adaptive critics. IEEE Transactions on Neural Networks, 2007, 18(3): 631–647CrossRefGoogle Scholar
  17. 17.
    Wei Q, Song R, Yan P. Data-driven zero-sum neuro-optimal control for a class of continuous-time unknown nonlinear systems with disturbance using ADP. IEEE Transactions on Neural Networks and Learning Systems, 2016, 27(2): 444–458MathSciNetCrossRefGoogle Scholar
  18. 18.
    Busoniu L, Babuška R, De Schutter B, Ernst D. Reinforcement Learning and Dynamic Programming Using Function Approximators. Boca Raton: CRC Press, 2010CrossRefMATHGoogle Scholar
  19. 19.
    Sutton R S. Integrated architecture for learning, planning and reacting based on approximating dynamic programming. In: Proceedings of International Conference on Machine Learning. 1990, 216–224Google Scholar
  20. 20.
    Peng J, Williams R J. Efficient learning and planning within the Dyna framework. Adaptive Behavior, 1993, 4(1): 437–454CrossRefGoogle Scholar
  21. 21.
    Moore A W, Atkeson C G. Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 1993, 13(1): 103–130Google Scholar
  22. 22.
    Sutton R S, Szepesvári C, Geramfard A, Bowling M. Dyna-style planning with linear function approximation and prioritized sweeping. In: Proceedings of International Conference on Uncertainty in Artificial Intelligence. 2008, 528–536Google Scholar
  23. 23.
    Silver D, Sutton R S, Müller M. Temporal-difference search in computer Go. Machine Learning, 2012, 87(2): 183–219MathSciNetCrossRefMATHGoogle Scholar
  24. 24.
    Coulom R. Efficient selectivity and backup operators in Monte-Carlo tree search. In: Proceedings of the 5th International Conference on Computers and Games, 2006, 72–83Google Scholar
  25. 25.
    Zhou Y C, Liu Q, Fu Q M, Zhang Z Z. Trajectory sampling value iteration: Improved Dyna search for MDPs. In: Proceedings of the International Conference on Autonomous Agents and Multiagent Systems. 2015, 1685–1686Google Scholar
  26. 26.
    Martin H, De Lope J. Ex< α>: An effective algorithm for continuous actions reinforcement learning problems. In: Proceedings of the 35th Annual Conference of IEEE on Industrial Electronics. 2009, 2063–2068Google Scholar
  27. 27.
    Weinstein A, Littman M L. Bandit-based planning and learning in continuous-action Markov decision processes. In: Proceedings of International Conference on Automated Planning and Scheduling. 2012, 306–314Google Scholar
  28. 28.
    Busoniu L, Daniels A, Munos R, Babuška R. Optimistic planning for continuous-action deterministic systems. In: Proceedings of IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning. 2013, 69–76Google Scholar
  29. 29.
    Grondman I, Vaandrager M, Busoniu L, Babuška R, Schuitema E. Efficient model learning methods for actor-critic control. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 2012, 42(3): 591–602CrossRefGoogle Scholar
  30. 30.
    Hasselt H V. Reinforcement learning in continuous state and action spaces. In: Wiering M, van Otterlo M, eds. Reinforcement Learning. Berlin: Springer Heidelberg, 2012, 207–251CrossRefGoogle Scholar
  31. 31.
    Degris T, Pilarski P M, Sutton R S. Model-free reinforcement learning with continuous action in practice. In: Proceedings of IEEE American Control Conference. 2012, 2177–2182Google Scholar
  32. 32.
    Bradtke S J, Barto A G. Linear least-squares algorithms for temporal difference learning. Machine Learning, 1996, 22(1–3): 33–57MATHGoogle Scholar
  33. 33.
    Boyan J A. Technical update: least-square temporal difference learning. Machine learning, 2002, 49(2–3): 233–246CrossRefMATHGoogle Scholar
  34. 34.
    Tagorti M, Scherer B. On the rate of the convergence and error bounds for LSTD(λ). In: Proceedings of International Conference on Machine Learning. 2015, 528–536Google Scholar
  35. 35.
    Hwang K S, Jiang W C, Chen Y J. Model learning and knowledge sharing for a multiagent system With Dyna-Q. IEEE Transactions on Cybernetics, 2015, 45(5): 964–976Google Scholar
  36. 36.
    Faulkner R, Precup D. Dyna planning using a feature based generative model. In: Proceedings of Advances in Neural Information Processing Systems. 2010, 1–9Google Scholar
  37. 37.
    Silver D, Sutton R S, Müller M. Reinforcement learning of local shape in the game of Go. In: Proceedings of International Joint Conference on Artificial Intelligence. 2007, 1053–1058Google Scholar
  38. 38.
    Yao H, Szepesvári C. Approximate policy iteration with linear action models. In: Proceedings of the 26th National Conference on Artificial Intelligence. 2012Google Scholar
  39. 39.
    Tsitsiklis J N, Roy B V. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 1997, 42(5): 674–690MathSciNetCrossRefMATHGoogle Scholar
  40. 40.
    Nedic A, Bertsekas D P. Least squares policy evaluation algorithms with linear function approximation. Theory and Applications, 2002, 13(1–2): 79–110MathSciNetMATHGoogle Scholar
  41. 41.
    Lazaric A, Ghavamzadeh M, Munos R. Finite-sample analysis of leastsquare policy iteration. Journal of Machine Learning Research, 2012, 13(4): 3041–3074MathSciNetMATHGoogle Scholar
  42. 42.
    Xu X, He H G, Hu D W. Efficient reinforcement learning using recursive least-squares methods. Journal of Artificial Intelligence Research, 2002, 16(1): 259–292MathSciNetMATHGoogle Scholar
  43. 43.
    Ljung L, Soderstron T. Theory and practice of recursive identification. Cambridge, MA: MIT Press, 1983Google Scholar
  44. 44.
    Geramifard A, Bowling M, Sutton R S. Incremental least-squares temporal difference learning. In: Proceedings of the National Conference on Artificial Intelligence. 2006, 356–361Google Scholar
  45. 45.
    Berenji H R, Khedkar P. Learning and tuning fuzzy logic controllers through reinforcements. IEEE Transactions on Neural Networks, 1992, 3(5): 724–740CrossRefGoogle Scholar
  46. 46.
    Sutton R S. Generalization in reinforcement learning: successful examples using sparse coarse coding. Neural Information Processing Systems, 1996: 1038–1044Google Scholar
  47. 47.
    Bhatnagar S, Sutton R S, Ghavamzadeh M, Lee M. Natural actor-critic algorithms. Automatica, 2009, 45(11): 2471–2482MathSciNetCrossRefMATHGoogle Scholar
  48. 48.
    Liu D R, Li H L, Wang D. Feature selection and feature learning for high-dimensional batch reinforcement learning: a survey. International Journal of Automation and Computing, 2015, 12(3): 229–242CrossRefGoogle Scholar
  49. 49.
    Farahmand A M, Ghavamzadeh M, Szepesvári C, Mannor S. Regularized fitted Q-iteration for planning in continuous-space Markovian decision problems. In: Proceedings of American Control Conference. 2009, 725–730Google Scholar
  50. 50.
    Farahmand A M, Szepesvári C. Model selection in reinforcement learning. Machine Learning, 2011, 85(3): 299–332MathSciNetCrossRefMATHGoogle Scholar
  51. 51.
    Kolter J Z, Ng A Y. Regularization and feature selection in leastsquares temporal difference learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. 2009, 521–528Google Scholar
  52. 52.
    Ghavamzadeh M, Lazaric A, Munos R, Hoffman M W. Finite-sample analysis of LASSO-TD. In: Proceedings of the 28th International Conference on Machine Learning. 2011, 1177–1184Google Scholar
  53. 53.
    Mahadevan S, Liu B. Sparse Q-learning with mirror descent. In: Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence. 2012, 564–573Google Scholar
  54. 54.
    Painter-Wakefield C, Parr R. Greedy algorithms for sparse reinforcement learning. In: Proceedings of the 29th International Conference on Machine Learning. 2012, 1391–1398Google Scholar
  55. 55.
    Johns J, Mahadevan S. Sparse approximate policy evaluation using graph-based basis functions. Technical Report UM-CS-2009-041. 2009Google Scholar
  56. 56.
    Ghavamzadeh M, Lazaric A, Maillard O A, Munos R. LSTD with random projections. In: Proceedings of Advances in Neural Information Processing Systems. 2010, 721–729Google Scholar
  57. 57.
    Liu B, Mahadevan S. Compressive reinforcement learning with oblique random projections. Technical Report UM-CS-2011-024. 2011Google Scholar
  58. 58.
    Xu X, Hu D W, Lu X C. Kernel-based least squares policy iteration for reinforcement learning. IEEE Transactionson Neural Networks, 2007, 18(4): 973–992CrossRefGoogle Scholar

Copyright information

© Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  • Shan Zhong
    • 1
    • 2
    • 3
    • 5
  • Quan Liu
    • 1
    • 4
    • 5
  • Zongzhang Zhang
    • 1
  • Qiming Fu
    • 1
    • 3
    • 5
    • 6
  1. 1.School of Computer Science and TechnologySoochow UniversitySuzhouChina
  2. 2.School of Computer Science and EngineeringChangshu Institute of TechnologyChangshuChina
  3. 3.Jiangsu Province Key Laboratory of Intelligent Building Energy EfficiencySuzhou University of Science and TechnologySuzhouChina
  4. 4.Collaborative Innovation Center of Novel Software Technology and IndustrializationNanjingChina
  5. 5.Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of EducationJilin UniversityChangchunChina
  6. 6.College of Electronic & Information EngineeringSuzhou University of Science and TechnologySuzhouChina

Personalised recommendations