Skip to main content

Runtime-Safety-Guided Policy Repair

  • Conference paper
  • First Online:
Runtime Verification (RV 2020)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 12399))

Included in the following conference series:

Abstract

We study the problem of policy repair for learning-based control policies in safety-critical settings. We consider an architecture where a high-performance learning-based control policy (e.g. one trained as a neural network) is paired with a model-based safety controller. The safety controller is endowed with the abilities to predict whether the trained policy will lead the system to an unsafe state, and take over control when necessary. While this architecture can provide added safety assurances, intermittent and frequent switching between the trained policy and the safety controller can result in undesirable behaviors and reduced performance. We propose to reduce or even eliminate control switching by ‘repairing’ the trained policy based on runtime data produced by the safety controller in a way that deviates minimally from the original policy. The key idea behind our approach is the formulation of a trajectory optimization problem that allows the joint reasoning of policy update and safety constraints. Experimental results demonstrate that our approach is effective even when the system model in the safety controller is unknown and only approximated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We use the terms ‘controller’ and ‘control policy’ (or simply ‘policy’) interchangeably in this paper. The latter is more common in the machine learning literature.

  2. 2.

    Proof can be found in the extended version https://arxiv.org/abs/2008.07667.

  3. 3.

    https://gym.openai.com/envs/MountainCarContinuous-v0/.

References

  1. Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 22–31. JMLR.org (2017)

    Google Scholar 

  2. Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  3. Amodei, D., Olah, C., Steinhardt, J., Christiano, P.F., Schulman, J., Mané, D.: Concrete problems in AI safety. CoRR, abs/1606.06565 (2016)

    Google Scholar 

  4. Argall, B.D., Chernova, S., Veloso, M., Browning, B.: A survey of robot learning from demonstration. Robot. Auton. Syst. 57(5), 469–483 (2009)

    Article  Google Scholar 

  5. Bareiss, D., Van den Berg, J.: Reciprocal collision avoidance for robots with linear dynamics using LQR-Obstacles. In: 2013 IEEE International Conference on Robotics and Automation, pp. 3847–3853. IEEE (2013)

    Google Scholar 

  6. Borrelli, F., Keviczky, T., Balas, G.J.: Collision-free UAV formation flight using decentralized optimization and invariant sets. In: 2004 43rd IEEE Conference on Decision and Control (CDC) (IEEE Cat. No. 04CH37601), vol. 1, pp. 1099–1104. IEEE (2004)

    Google Scholar 

  7. Chen, J., Zhan, W., Tomizuka, M.: Constrained iterative LQR for on-road autonomous driving motion planning. In: 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), pp. 1–7, October 2017

    Google Scholar 

  8. Cheng, R., Orosz, G., Murray, R.M., Burdick, J.W.: End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. ArXiv, abs/1903.08792 (2019)

    Google Scholar 

  9. Chow, Y., Nachum, O., Duenez-Guzman, E., Ghavamzadeh, M.: A Lyapunov-based approach to safe reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 8092–8101 (2018)

    Google Scholar 

  10. DeCastro, J.A., Kress-Gazit, H.: Guaranteeing reactive high-level behaviors for robots with complex dynamics. In: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 49–756. IEEE (2013)

    Google Scholar 

  11. Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: an open urban driving simulator. In: Levine, S., Vanhoucke, V., Goldberg, K., (eds.), Proceedings of the 1st Annual Conference on Robot Learning, Proceedings of Machine Learning Research PMLR, vol. 78, pp. 1–16, 13–15 November 2017

    Google Scholar 

  12. Fulton, N., Platzer, A.: Safe reinforcement learning via formal methods: toward safe control through proof and learning. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  13. Kwakernaak, H., Sivan, R.: Linear Optimal Control Systems, vol. 1. Wiley, New York (1972)

    Google Scholar 

  14. Li, W., Todorov, E.: Iterative linear quadratic regulator design for nonlinear biological movement systems. In: ICINCO (2004)

    Google Scholar 

  15. Maciejowski, J.: Predictive Control: With Constraints (2002)

    Google Scholar 

  16. Maler, O., Nickovic, D.: Monitoring temporal properties of continuous signals. In: Lakhnech, Y., Yovine, S. (eds.) FORMATS/FTRTFT -2004. LNCS, vol. 3253, pp. 152–166. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30206-3_12

    Chapter  MATH  Google Scholar 

  17. Menda, K., Driggs-Campbell, K., Kochenderfer, M.J..: EnsembleDAgger: a Bayesian approach to safe imitation learning. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5041–5048 (2019)

    Google Scholar 

  18. Nesterov, Y., Nemirovskii, A.: Interior-Point Polynomial Algorithms in Convex Programming. Society for Industrial and Applied Mathematics (1994)

    Google Scholar 

  19. Nocedal, J., Wright, S.: Numerical Optimization. Springer, Heidelberg (2006)

    MATH  Google Scholar 

  20. Osa, T., Pajarinen, J., Neumann, G., Bagnell, J.A., Abbeel, P., Peters, J.: An algorithmic perspective on imitation learning. Found. Trends® Robot. 7(1–2), 1–179 (2018)

    Google Scholar 

  21. Pardalos, P.M., Vavasis, S.A.: Quadratic programming with one negative eigenvalue is NP-hard. J. Global Optim. 1, 15–22 (1991)

    Article  MathSciNet  Google Scholar 

  22. Pereira, M., Fan, D.D., An, G.N., Theodorou, E.A.: MPC-inspired neural network policies for sequential decision making. CoRR, abs/1802.05803 (2018)

    Google Scholar 

  23. Phan, D., Paoletti, N., Grosu, R., Jansen, N., Smolka, S.A., Stoller, S.D.: Neural simplex architecture. ArXiv, abs/1908.00528 (2019)

    Google Scholar 

  24. Pnueli, A.: The temporal logic of programs. In: 18th Annual Symposium on Foundations of Computer Science (SFCS 1977), pp. 46–57. IEEE (1977)

    Google Scholar 

  25. Pomerleau, D.: ALVINN: an autonomous land vehicle in a neural network. In: NIPS (1988)

    Google Scholar 

  26. Raman, V., Donzé, A., Maasoumy, M., Murray, R.M., Sangiovanni-Vincentelli, A., Seshia, S.A.: Model predictive control with signal temporal logic specifications. In: 53rd IEEE Conference on Decision and Control, pp. 81–87. IEEE (2014)

    Google Scholar 

  27. Ross, S., Bagnell, D.: Efficient reductions for imitation learning. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 661–668 (2010)

    Google Scholar 

  28. Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011)

    Google Scholar 

  29. Sadigh, D., Kapoor, A.: Safe control under uncertainty with probabilistic signal temporal logic. In: Proceedings of Robotics: Science and Systems XII, June 2016

    Google Scholar 

  30. Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015)

    Google Scholar 

  31. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. CoRR, abs/1707.06347 (2017)

    Google Scholar 

  32. Seto, D., Krogh, B., Sha, L., Chutinan, A.: The simplex architecture for safe online control system upgrades. In: Proceedings of the 1998 American Control Conference. ACC (IEEE Cat. No. 98CH36207), vol. 6, pp. 504–3508, vol. 6 (1998)

    Google Scholar 

  33. Tassa, Y., Erez, T., Todorov, E.: Synthesis and stabilization of complex behaviors through online trajectory optimization. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4906–4913, October 2012

    Google Scholar 

  34. Wabersich, K.P., Zeilinger, M.N.: Linear model predictive safety certification for learning-based control. In: 2018 IEEE Conference on Decision and Control (CDC), pp. 7130–7135 (2018)

    Google Scholar 

  35. Wu, C.-J., et al.: Machine learning at Facebook: understanding inference at the edge. In: 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 331–344. IEEE (2019)

    Google Scholar 

  36. Zhang, J., Cho, K.: Query-efficient imitation learning for end-to-end simulated driving. In: AAAI (2017)

    Google Scholar 

  37. Zhou, W., Li, W.: Safety-aware apprenticeship learning. In: Chockler, H., Weissenbacher, G. (eds.) CAV 2018. LNCS, vol. 10981, pp. 662–680. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-96145-3_38

    Chapter  Google Scholar 

Download references

Acknowledgements

We gratefully acknowledge the support from National Science Foundation (NSF) grants 1646497.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weichao Zhou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhou, W., Gao, R., Kim, B., Kang, E., Li, W. (2020). Runtime-Safety-Guided Policy Repair. In: Deshmukh, J., Ničković, D. (eds) Runtime Verification. RV 2020. Lecture Notes in Computer Science(), vol 12399. Springer, Cham. https://doi.org/10.1007/978-3-030-60508-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-60508-7_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-60507-0

  • Online ISBN: 978-3-030-60508-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics