Skip to main content

AdaPT: Zero-Shot Adaptive Policy Transfer for Stochastic Dynamical Systems

  • Conference paper
  • First Online:
Robotics Research

Abstract

Model-free policy learning has enabled good performance on complex tasks that were previously intractable with traditional control techniques. However, this comes at the cost of requiring a perfectly accurate model for training. This is infeasible due to the very high sample complexity of model-free methods preventing training on the target system. This renders such methods unsuitable for physical systems. Model mismatch due to dynamics parameter differences and unmodeled dynamics error may cause suboptimal or unsafe behavior upon direct transfer. We introduce the Adaptive Policy Transfer for Stochastic Dynamics (AdaPT) algorithm that achieves provably safe and robust, dynamically-feasible zero-shot transfer of RL-policies to new domains with dynamics error. AdaPT combines the strengths of offline policy learning in a black-box source simulator with online tube-based MPC to attenuate bounded dynamics mismatch between the source and target dynamics. AdaPT allows online transfer of policies, trained solely in a simulation offline, to a family of unknown targets without fine-tuning. We also formally show that (i) AdaPT guarantees bounded state and control deviation through state-action tubes under relatively weak technical assumptions and, (ii) AdaPT results in a bounded loss of reward accumulation relative to a policy trained and evaluated in the source environment. We evaluate AdaPT on 2 continuous, non-holonomic simulated dynamical systems with 4 different disturbance models, and find that AdaPT performs between 50 and \(300\%\) better on mean reward accrual than direct policy transfer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abbeel, P., Quigley, M., Ng, A.Y.: Using inaccurate models in reinforcement learning. In: Proceedings of the 23rd International Conference on Machine Learning. ACM (2006)

    Google Scholar 

  2. Bousmalis, K., Irpan, A., Wohlhart, P., Bai, Y., Kelcey, M., Kalakrishnan, M., Downs, L., Ibarz, J., Pastor, P., Konolige, K., et al.: Using simulation and domain adaptation to improve efficiency of deep robotic grasping (2017). arXiv:1709.07857

  3. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: Openai gym (2016). arXiv:1606.01540

  4. Christiano, P., Shah, Z., Mordatch, I., Schneider, J., Blackwell, T., Tobin, J., Abbeel, P., Zaremba, W.: Transfer from simulation to real world through learning deep inverse dynamics model (2016). arXiv:1610.03518

  5. Deisenroth, M., Rasmussen, C.E.: Pilco: a model-based and data-efficient approach to policy search. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11) (2011)

    Google Scholar 

  6. Devin, C., Gupta, A., Darrell, T., Abbeel, P., Levine, S.: Learning modular neural network policies for multi-task and multi-robot transfer (2016). arXiv:1609.07088

  7. Farshidian, F., Pardo, D., Buchli, J.: Sequential linear quadratic optimal control for nonlinear switched systems (2016). arXiv:1609.02198

  8. Gu, S., Lillicrap, T., Sutskever, I., Levine, S.: Continuous deep Q-learning with modelbased acceleration. ICML (2016)

    Google Scholar 

  9. Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T., Tassa, Y.: Learning continuous control policies by stochastic value gradients. In: NIPS (2015)

    Google Scholar 

  10. Kahn, G., Zhang, T., Levine, S., Abbeel, P.: Plato: policy learning using adaptive trajectory optimization (2016). arXiv:1603.00622

  11. Kober, J., Bagnell, J.A., Peters, J.: Reinforcement learning in robotics: a survey. Int. J. Robot. Res. 0 278 364 913 495 721 (2013)

    Google Scholar 

  12. Levine, S., Abbeel, P.: Learning neural network policies with guided policy search under unknown dynamics. In: Advances in Neural Information Processing Systems (2014)

    Google Scholar 

  13. Levine, S., Finn, C., Darrell, T., Abbeel, P.: End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 17(39), 1–40 (2016)

    MathSciNet  MATH  Google Scholar 

  14. Mandlekar*, A., Zhu*, Y., Garg*, A., Fei-Fei, L., Savarese, S.: Adversarially robust policy learning through active construction of physically-plausible perturbations. In: IEEE International Conference on Intelligent Robots and Systems (IROS) (2017) (* equal contribution)

    Google Scholar 

  15. Mayne, D.Q., Kerrigan, E.C., Van Wyk, E., Falugi, P.: Tube-based robust nonlinear model predictive control. Int. J. Robust Nonlinear Control (2011)

    Google Scholar 

  16. Mitrovic, D., Klanke, S., Vijayakumar, S.: Adaptive optimal feedback control with learned internal dynamics models. In: From Motor Learning to Interaction Learning in Robots, pp. 65–84. Springer, New York (2010)

    Google Scholar 

  17. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)

    Article  Google Scholar 

  18. Moerland, T., Broekens, J., Jonker, C.: Learning multimodal transition dynamics for model-based reinforcement learning (2017). arXiv:1705.00470

  19. Mordatch, I., Lowrey, K., Todorov, E.: Ensemble-cio: full-body dynamic motion planning that transfers to physical humanoids. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5307–5314. IEEE (2015)

    Google Scholar 

  20. Neunert, M., de Crousaz, C., Furrer, F., Kamel, M., Farshidian, F., Siegwart, R., Buchli, J.: Fast nonlinear model predictive control for unified trajectory optimization and tracking. In: Proceedings of the IEEE Conference on Robotics and Automation (2016)

    Google Scholar 

  21. Peng, X.B., Andrychowicz, M., Zaremba, W., Abbeel, P.: Sim-to-real transfer of robotic control with dynamics randomization (2017). arXiv:1710.06537

  22. Pinto, L., Davidson, J., Sukthankar, R., Gupta, A.: Robust adversarial reinforcement learning (2017). arXiv:1703.02702

  23. Rajeswaran, A., Ghotra, S., Levine, S., Ravindran, B.: EPOpt: learning robust neural network policies using model ensembles (2016). arXiv:1610.01283

  24. Rusu, A.A., Rabinowitz, N.C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., Hadsell, R.: Progressive neural networks (2016). arXiv:1606.04671

  25. Schulman, J., Levine, S., Moritz, P., Jordan, M., Abbeel, P.: Trust region policy optimization. In: ICML (2015)

    Google Scholar 

  26. Singh, S., Majumdar, A., Slotine, J.-J., Pavone, M.: Robust online motion planning via contraction theory and convex optimization. In: Proceedings of the IEEE Conference on Robotics and Automation (2017)

    Google Scholar 

  27. Tassa, Y., Erez, T., Todorov, E.: Synthesis and stabilization of complex behaviors through online trajectory optimization. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4906–4913. IEEE (2012)

    Google Scholar 

  28. Taylor, M.E., Stone, P.: Transfer learning for reinforcement learning domains: a survey. J. Mach. Learn. Res. 10 (2009)

    Google Scholar 

  29. Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Abbeel, P.: Domain randomization for transferring deep neural networks from simulation to the real world (2017). arXiv:1703.06907

  30. Todorov, E., Li, W.: A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems. In: Proceedings of the 2005 American Control Conference, pp. 300-306. IEEE (2005)

    Google Scholar 

  31. Webb, D.J., van den Berg, J.: Kinodynamic RRT*: asymptotically optimal motion planning for robots with linear dynamics. In: IEEE International Conference on Robotics and Automation (ICRA) (2013)

    Google Scholar 

  32. Zhou, K., Doyle, J.C., Glover, K., et al.: Robust and Optimal Control, vol. 40 (1996)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the Office of Naval Research (Grant N00014-15-1-2673) and by the Toyota Research Institute (“TRI”). This article solely reflects the opinions and conclusions of its authors and not ONR, TRI, or any other Toyota entity. James Harrison was supported in part by the Stanford Graduate Fellowship and the National Sciences and Engineering Research Council of Canada (NSERC).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to James Harrison .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Harrison, J. et al. (2020). AdaPT: Zero-Shot Adaptive Policy Transfer for Stochastic Dynamical Systems. In: Amato, N., Hager, G., Thomas, S., Torres-Torriti, M. (eds) Robotics Research. Springer Proceedings in Advanced Robotics, vol 10. Springer, Cham. https://doi.org/10.1007/978-3-030-28619-4_34

Download citation

Publish with us

Policies and ethics