Advertisement

Journal of Intelligent & Robotic Systems

, Volume 86, Issue 2, pp 153–173 | Cite as

Survey of Model-Based Reinforcement Learning: Applications on Robotics

  • Athanasios S. Polydoros
  • Lazaros Nalpantidis
Article

Abstract

Reinforcement learning is an appealing approach for allowing robots to learn new tasks. Relevant literature reveals a plethora of methods, but at the same time makes clear the lack of implementations for dealing with real life challenges. Current expectations raise the demand for adaptable robots. We argue that, by employing model-based reinforcement learning, the—now limited—adaptability characteristics of robotic systems can be expanded. Also, model-based reinforcement learning exhibits advantages that makes it more applicable to real life use-cases compared to model-free methods. Thus, in this survey, model-based methods that have been applied in robotics are covered. We categorize them based on the derivation of an optimal policy, the definition of the returns function, the type of the transition model and the learned task. Finally, we discuss the applicability of model-based reinforcement learning approaches in new applications, taking into consideration the state of the art in both algorithms and hardware.

Keywords

Intelligent robotics Machine learning Model-based reinforcement learning Robot learning Policy search Transition models Reward functions 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Deisenroth, M.P.: A survey on policy search for robotics. Foundations and Trends in Robotics 2(1–2), 1–142 (2011)CrossRefGoogle Scholar
  2. 2.
    Kober, J., Bagnell, J.A., Peters, J.: Reinforcement learning in robotics: a survey. Int. J. Robot. Res. 32(11), 1238–1274 (2013)CrossRefGoogle Scholar
  3. 3.
    Kormushev, P., Calinon, S., Caldwell, D.G.: Reinforcement learning in robotics: applications and real-world challenges. Robotics 2(3), 122–148 (2013)CrossRefGoogle Scholar
  4. 4.
    Levine, S., Koltun, V.: Learning complex neural network policies with trajectory optimization. In: Proceedings of the 31St International Conference on Machine Learning (ICML-14), pp. 829–837 (2014)Google Scholar
  5. 5.
    Deisenroth, M.P., Englert, P., Peters, J., Fox, D.: Multi-task policy search for robotics. In: IEEE International Conference on Robotics and Automation, IEEE, pp. 3876–3881 (2014)Google Scholar
  6. 6.
    van Rooijen, J., Grondman, I., Babuška, R.: Learning rate free reinforcement learning for real-time motion control using a value-gradient based policy. Mechatronics 24(8), 966–974 (2014)Google Scholar
  7. 7.
    Wilson, A., Fern, A., Tadepalli, P.: Using trajectory data to improve bayesian optimization for reinforcement learning. J. Mach. Learn. Res. 15(1), 253–282 (2014)MathSciNetMATHGoogle Scholar
  8. 8.
    Kupcsik, A., Deisenroth, M.P., Peters, J., Loh, A.P., Vadakkepat, P., Neumann, G.: Model-based contextual policy search for data-efficient generalization of robot skills. Artif. Intell. (2014)Google Scholar
  9. 9.
    Strahl, J., Honkela, T., Wagner, P.: A gaussian process reinforcement learning algorithm with adaptability and minimal tuning requirements. In: Artificial Neural Networks and Machine Learning–ICANN 2014, pp. 371–378. Springer (2014)Google Scholar
  10. 10.
    Boedecker, J., Springenberg, J.T., Wulfing, J., Riedmiller, M.: Approximate real-time optimal control based on sparse gaussian process models. In: 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), IEEE, pp. 1–8 (2014)Google Scholar
  11. 11.
    Depraetere, B., Liu, M., Pinte, G., Grondman, I., Babuška, R. : Comparison of model-free and model-based methods for time optimal hit control of a badminton robot. Mechatronics 24(8), 1021–1030 (2014)CrossRefGoogle Scholar
  12. 12.
    Guenter, F., Hersch, M., Calinon, S., Billard, A.: Reinforcement learning for imitating constrained reaching movements. Adv. Robot. 21(13), 1521–1544 (2007)Google Scholar
  13. 13.
    Shaker, M.R., Yue, S., Duckett, T.: Vision-based reinforcement learning using approximate policy iteration. In: International Conference on Advanced Robotics (2009)Google Scholar
  14. 14.
    Touzet, C.F.: Neural reinforcement learning for behaviour synthesis. Robot. Auton. Syst. 22(3-4), 251–281 (1997)CrossRefGoogle Scholar
  15. 15.
    Boone, G.: Efficient reinforcement learning: model-based Acrobot control. In: Proceedings of International Conference on Robotics and Automation, p. 1 (1997)Google Scholar
  16. 16.
    Abbeel, P., Quigley, M., Ng, A.Y.: Using inaccurate models in reinforcement learning. In: Proceedings of the 23rd International Conference on Machine Learning - ICML ’06, pp. 1–8. ACM Press, New York, USA (2006)Google Scholar
  17. 17.
    Morimoto, J., Atkeson, C.G.: Minimax differential dynamic programming: an application to robust biped walking. Adv. Neural Inf. Proces. Syst. 15, 1539–1546 (2003)Google Scholar
  18. 18.
    Martínez-Marín, T., Duckett, T.: Fast reinforcement learning for vision-guided mobile robots. In: Proceedings - IEEE International Conference on Robotics and Automation, vol. 2005, pp. 4170–4175 (2005)Google Scholar
  19. 19.
    Martinez-Marin, T.: On-line optimal motion planning for nonholonomic mobile robots. In: Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006, pp. 512–517. IEEE (2006)Google Scholar
  20. 20.
    Bakker, B., Zhumatiy, V., Gruener, G., Schmidhuber, J.: Quasi-online reinforcement learning for robots. In: Proceedings - IEEE International Conference on Robotics and Automation, vol. 2006, pp. 2997–3002 (2006)Google Scholar
  21. 21.
    Leffler, B.R., Littman, M.L., Edmunds, T.: Efficient reinforcement learning with relocatable action models. In: Proceedings of the 22nd AAAI Conference on Artificial Intelligence, pp. 572–577 (2007)Google Scholar
  22. 22.
    Hester, T., Quinlan, M., Stone, P.: Generalized model learning for reinforcement learning on a humanoid robot. In: IEEE International Conference on Robotics and Automation (ICRA), 2010, pp. 2369–2374. IEEE (2010)Google Scholar
  23. 23.
    Nguyen, T., Li, Z., Silander, T., Leong, T.Y.: Online feature selection for model-based reinforcement learning. Proceedings of the 30th International Conference on Machine Learning (ICML-13), 498–506 (2013)Google Scholar
  24. 24.
    Van Den Berg, J., Miller, S., Duckworth, D., Hu, H., Wan, A., Fu, X.Y., Goldberg, K., Abbeel, P.: Superhuman performance of surgical tasks by robots using iterative learning from human-guided demonstrations. In: Proceedings - IEEE International Conference on Robotics and Automation, pp. 2074–2081 (2010)Google Scholar
  25. 25.
    Abbeel, P., Coates, A., Ng, A.Y.: Autonomous helicopter aerobatics through apprenticeship learning. Int. J. Robot. Res. 29(13), 1608–1639 (2010)CrossRefGoogle Scholar
  26. 26.
    Ross, S., Bagnell, J.A.: Agnostic system identification for model-based reinforcement learning. In: Proceedings of the 29th International Conference on Machine Learning, pp. 1703–1710 (2012)Google Scholar
  27. 27.
    Coates, A., Abbeel, P., Ng, A.Y.: Apprenticeship learning for helicopter control. Commun. ACM 52(7), 97–105 (2009). doi: 10.1145/1538788.1538812
  28. 28.
    Schneider, J.G.: Exploiting Model Uncertainty Estimates for Safe Dynamic Control Learning. In: Neural Information Processing Systems 9, pp. 1047–1053. The MIT Press (1996)Google Scholar
  29. 29.
    Kuvayev, L., Sutton, R.: Model-based reinforcement learning with an approximate, learned model. In: Proceedings of the Ninth Yale Workshop on Adaptive and Learning Systems, pp. 101–105 (1996)Google Scholar
  30. 30.
    Hester, T., Quinlan, M., Stone, P.: RTMBA: a real-time model-based reinforcement learning architecture for robot control. In: IEEE International Conference on Robotics and Automation, pp. 85–89 (2012)Google Scholar
  31. 31.
    Frank, M., Leitner, J., Stollenga, M., Förster, A., Schmidhuber, J.: Curiosity driven reinforcement learning for motion planning on humanoids. Frontiers in neurorobotics 7, 25 (2014)Google Scholar
  32. 32.
    Atkeson, C.G.: Nonparametric model-based reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 1008–1014 (1998)Google Scholar
  33. 33.
    Yamaguchi, A., Atkeson, C.G.: Neural networks and differential dynamic programming for reinforcement learning problems. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 5434–5441. IEEE (2016)Google Scholar
  34. 34.
    Howard, R.: Dynamic Programming and Markov Processes. Technology Press of the Massachusetts Institute of Technology (1960)Google Scholar
  35. 35.
    Bellman, R.E.: Dynamic Programming. Princeton University Press, Princeton (1957)MATHGoogle Scholar
  36. 36.
    Peters, J., Schaal, S.: Natural actor-critic. Neurocomputing 71(7), 1180–1190 (2008)CrossRefGoogle Scholar
  37. 37.
    Peters, J., Vijayakumar, S., Schaal, S.: Reinforcement learning for humanoid robotics. In: Proceedings of the Third IEEE-RAS International Conference on Humanoid Robots, pp. 1–20 (2003)Google Scholar
  38. 38.
    Amari, S.I.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)CrossRefGoogle Scholar
  39. 39.
    Lagoudakis, M.G., Parr, R.: Least-squares policy iteration. J. Mach. Learn. Res. 4, 1107–1149 (2003)MathSciNetMATHGoogle Scholar
  40. 40.
    Lagoudakis, M., Parr, R., Littman, M.: Least-squares methods in reinforcement learning for control. In: Vlahavas, I., Spyropoulos, C. (eds.) Methods and Applications of Artificial Intelligence. Volume 2308 of Lecture Notes in Computer Science, pp. 249–260. Springer, Berlin, Heidelberg (2002)Google Scholar
  41. 41.
    Moore, A.W., Atkeson, C.G.: Prioritized sweeping: reinforcement learning with less data and less time. Mach. Learn. 13(1), 103–130 (1993)Google Scholar
  42. 42.
    Rasmussen, C.E.: Gaussian Processes for Machine Learning. MIT Press (2006)Google Scholar
  43. 43.
    Brafman, R.I., Tennenholtz, M.: R-max-a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 3, 213–231 (2003)MathSciNetMATHGoogle Scholar
  44. 44.
    Sherstov, A.A., Stone, P.: Improving Action Selection in Mdp’s via Knowledge Transfer. In: AAAI, vol. 5, pp. 1024–1029 (2005)Google Scholar
  45. 45.
    Lang, T., Toussaint, M., Kersting, K.: Exploration in relational domains for model-based reinforcement learning. J. Mach. Learn. Res. 13, 3725–3768 (2012)MathSciNetMATHGoogle Scholar
  46. 46.
    Martínez, D., Alenya, G., Torras, C.: Relational reinforcement learning with guided demonstrations. Artif. Intell. (2015)Google Scholar
  47. 47.
    Martínez, D., Alenya, G., Torras, C.: Safe robot execution in model-based reinforcement learning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2015, pp. 6422–6427 (2015)Google Scholar
  48. 48.
    Yamaguchi, A., Atkeson, C.G.: Differential dynamic programming with temporally decomposed dynamics. In: IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids), 2015, pp. 696–703 (2015)Google Scholar
  49. 49.
    Andersson, O., Heintz, F., Doherty, P.: Model-based reinforcement learning in continuous environments using real-time constrained optimization. In: Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI15) (2015)Google Scholar
  50. 50.
    Anderson, B.D., Moore, J.B.: Optimal control: linear quadratic methods. Courier Corporation (2007)Google Scholar
  51. 51.
    Bertsekas, D.P., Bertsekas, D.P., Bertsekas, D.P., Bertsekas, D.P.: Dynamic Programming and Optimal Control, vol. 1. Athena Scientific Belmont, MA (1995)Google Scholar
  52. 52.
    Bradtke, S.J.: Incremental dynamic programming for on-line adaptive optimal control. Phd thesis, Amherst, MA, USA. UMI Order No. GAX95-10446 (1995)Google Scholar
  53. 53.
    Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist systems. Technical Report 166 Cambridge University Engineering Department (1994)Google Scholar
  54. 54.
    Watkins, C., Dayan, P.: Technical note: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)MATHGoogle Scholar
  55. 55.
    Sutton, R.S.: Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bull. 2(4), 160–163 (1991)CrossRefGoogle Scholar
  56. 56.
    Bagnell, J., Schneider, J.: Autonomous helicopter control using reinforcement learning policy search methods. In: IEEE International Conference on Robotics and Automation, vol. 2, pp. 1615–1620 (2001)Google Scholar
  57. 57.
    El-Fakdi, A., Carreras, M.: Policy gradient based reinforcement learning for real autonomous underwater cable tracking. In: 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3635–3640. IEEE (2008)Google Scholar
  58. 58.
    El-Fakdi, A., Carreras, M.: Two-step gradient-based reinforcement learning for underwater robotics behavior learning. Robot. Auton. Syst. 61(3), 271–282 (2013)CrossRefGoogle Scholar
  59. 59.
    Morimoto, J., Atkeson, C.G.: Nonparametric representation of an approximated poincaré map for learning biped locomotion. Auton. Robot. 27(2), 131–144 (2009)CrossRefGoogle Scholar
  60. 60.
    Ng, A.Y., Kim, H.J., Jordan, M.I., Sastry, S.: Autonomous helicopter flight via reinforcement learning. Adv. Neural Inf. Proces. Syst. 16(16), 363–372 (2004)Google Scholar
  61. 61.
    Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: Proceedings of the 32nd International Conference on Machine Learning, pp. 1889–1897 (2015)Google Scholar
  62. 62.
    Deisenroth, M., Rasmussen, C., Fox, D.: Learning to control a low-cost manipulator using data-efficient reinforcement learning. RSS (2011)Google Scholar
  63. 63.
    Deisenroth, M.P., Calandra, R., Seyfarth, A., Peters, J.: Toward fast policy search for learning legged locomotion. In: IEEE International Conference on Intelligent Robots and Systems, pp. 1787–1792 (2012)Google Scholar
  64. 64.
    Koppejan, R., Whiteson, S.: Neuroevolutionary reinforcement learning for generalized helicopter control. In: Proceedings of the 11Th Annual Conference on Genetic and Evolutionary Computation - GECCO ’09, p. 145. ACM Press, New York, USA (2009)Google Scholar
  65. 65.
    Kupcsik, A., Deisenroth, M., Peters, J., Neumann, G.: Data-efficient generalization of robot skills with contextual policy search. In: Proceedings of the National Conference on Artificial Intelligence (AAAI) (2013)Google Scholar
  66. 66.
    Levine, S., Koltun, V.: Variational policy search via trajectory optimization. In: Advances in Neural Information Processing, pp. 207–215 (2013)Google Scholar
  67. 67.
    Deisenroth, M., Rasmussen, C.E.: PILCO: a model-based and data-efficient approach to policy search. In: 28th International Conference on Machine Learning, pp. 465–472 (2011)Google Scholar
  68. 68.
    Englert, P., Paraschos, A., Peters, J., Deisenroth, M.P.: Model-based imitation learning by probabilistic trajectory matching. In: IEEE International Conference on Robotics and Automation, pp. 1922–1927 (2013)Google Scholar
  69. 69.
    Mordatch, I., Mishra, N., Eppner, C., Abbeel, P.: Combining model-based policy search with online model learning for control of physical humanoids. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 242–248 (2016)Google Scholar
  70. 70.
    Tangkaratt, V., Mori, S., Zhao, T., Morimoto, J., Sugiyama, M.: Model-based policy gradients with parameter-based exploration by least-squares conditional density estimation. Neural Netw. 57, 128–140 (2014)CrossRefMATHGoogle Scholar
  71. 71.
    Ko, J., Klein, D.J., Fox, D., Haehnel, D.: Gaussian processes and reinforcement learning for identification and control of an autonomous blimp. In: Proceedings 2007 IEEE International Conference on Robotics and Automation, pp. 742–747 (2007)Google Scholar
  72. 72.
    Michels, J., Saxena, A., Ng, A.Y.: High speed obstacle avoidance using monocular vision and reinforcement learning. In: Proceedings of the 22nd International Conference on Machine Learning, ACM, pp. 593–600 (2005)Google Scholar
  73. 73.
    Williams, G., Drews, P., Goldfain, B., Rehg, J.M., Theodorou, E.A.: Aggressive driving with model predictive path integral control. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1433–1440 (2016)Google Scholar
  74. 74.
    Baxter, J., Bartlett, P.L.: Direct gradient-based reinforcement learning. In: The 2000 IEEE International Symposium on Circuits and Systems, 2000. Proceedings. ISCAS 2000 Geneva, vol. 3, pp. 271–274. IEEE (2000)Google Scholar
  75. 75.
    Girard, A., Rasmussen, C.E., Candela, J.Q., Murray-Smith, R.: Gaussian process priors with uncertain inputs application to multiple-step ahead time series forecasting. In: Becker, S., Thrun, S., Obermayer, K. (eds.) Advances in Neural Information Processing Systems 15, pp. 545–552. MIT Press (2003)Google Scholar
  76. 76.
    Deisenroth, M.P.: Efficient Reinforcement Learning Using Gaussian Processes, vol. 9. KIT Scientific Publishing (2010)Google Scholar
  77. 77.
    Ng, A.Y., Jordan, M.: PEGASUS: a policy search method for large MDPs and POMDPs. In: Sixteenth Conference on Uncertainty in Artificial Intelligence, pp. 406–415. Morgan Kaufmann Publishers Inc (2000)Google Scholar
  78. 78.
    Peters, J., Mulling, K., Altun, Y.: Relative entropy policy search. In: Twenty-Fourth AAAI Conference on Artificial Intelligence (2010)Google Scholar
  79. 79.
    Theodorou, E., Buchli, J., Schaal, S.: A generalized path integral control approach to reinforcement learning. J. Mach. Learn. Res. 11(Nov), 3137–3181 (2010)MathSciNetMATHGoogle Scholar
  80. 80.
    Pan, Y., Theodorou, E., Kontitsis, M.: Sample efficient path integral control under uncertainty. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28, pp. 2314–2322. Curran Associates, Inc (2015)Google Scholar
  81. 81.
    Colomé, A., Planells, A., Torras, C.: A friction-model-based framework for reinforcement learning of robotic tasks in non-rigid environments. In: 2015 IEEE International Conference on Robotics and Automation, (ICRA), pp. 5649–5654. IEEE (2015)Google Scholar
  82. 82.
    Theodorou, E., Buchli, J., Schaal, S.: Reinforcement learning of motor skills in high dimensions: a path integral approach. In: IEEE International Conference on Robotics and Automation (ICRA), 2010, IEEE, pp. 2397–2403 (2010)Google Scholar
  83. 83.
    Kober, J., Peters, J.R.: Policy search for motor primitives in robotics. In: Advances in Neural Information Processing Systems, pp. 849–856 (2009)Google Scholar
  84. 84.
    Polydoros, A.S., Nalpantidis, L.: A reservoir computing approach for learning forward dynamics of industrial manipulators. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2016, IEEE, pp. 612–618 (2016)Google Scholar
  85. 85.
    Schaal, S., Atkeson, C.G.: Constructive incremental learning from only local information. Neural Comput. 10, 2047–2084 (1997)CrossRefGoogle Scholar
  86. 86.
    Atkeson, C.G., Moore, A.W., Schaal, S.: Locally weighted learning for control. In: Lazy Learning, pp. 75–113. Springer (1997)Google Scholar
  87. 87.
    Quinlan, J.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)Google Scholar
  88. 88.
    Rasmussen, C.E.: Gaussian processes in machine learning. In: Advanced Lectures on Machine Learning, pp. 63–71. Springer (2004)Google Scholar
  89. 89.
    Albus, J.S.: A new approach to manipulator control: the cerebellar model articulation controller (CMAC). J. Dyn. Syst. Meas. Control. 97(3), 220–227 (1975)CrossRefMATHGoogle Scholar
  90. 90.
    Zufiria, P., Martínez-Marín, T.: Improved optimal control methods based upon the adjoining cell mapping technique. J. Optim. Theory Appl. 118(3), 657–680 (2003)MathSciNetCrossRefMATHGoogle Scholar
  91. 91.
    Andrew Moore, J.S.: Memory-based stochastic optimization. In: Touretzky, D., Mozer, M., Hasselm, M. (eds.) Neural Information Processing Systems 8, vol. 8, pp. 1066–1072. MIT Press (1996)Google Scholar
  92. 92.
    Sugiyama, M., Takeuchi, I., Suzuki, T., Kanamori, T., Hachiya, H., Okanohara, D.: Least-squares conditional density estimation. IEICE Trans. Inf. Syst. 93(3), 583–594 (2010)CrossRefGoogle Scholar
  93. 93.
    Tangkaratt, V., Morimoto, J., Sugiyama, M.: Model-based reinforcement learning with dimension reduction. Neural Netw. 84, 1–16 (2016)CrossRefGoogle Scholar
  94. 94.
    Polydoros, A.S., Nalpantidis, L., Kruger, V.: Real-time deep learning of robotic manipulator inverse dynamics. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3442–3448 (2015)Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2017

Authors and Affiliations

  1. 1.Department of Mechanical and Manufacturing EngineeringAalborg UniversityCopenhagen SVDenmark

Personalised recommendations