Runtime-Safety-Guided Policy Repair

Zhou, Weichao; Gao, Ruihan; Kim, BaekGyu; Kang, Eunsuk; Li, Wenchao

doi:10.1007/978-3-030-60508-7_7

Weichao Zhou¹⁰,
Ruihan Gao¹¹,
BaekGyu Kim¹²,
Eunsuk Kang¹³ &
…
Wenchao Li¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 12399))

Included in the following conference series:

International Conference on Runtime Verification

891 Accesses
3 Citations

Abstract

We study the problem of policy repair for learning-based control policies in safety-critical settings. We consider an architecture where a high-performance learning-based control policy (e.g. one trained as a neural network) is paired with a model-based safety controller. The safety controller is endowed with the abilities to predict whether the trained policy will lead the system to an unsafe state, and take over control when necessary. While this architecture can provide added safety assurances, intermittent and frequent switching between the trained policy and the safety controller can result in undesirable behaviors and reduced performance. We propose to reduce or even eliminate control switching by ‘repairing’ the trained policy based on runtime data produced by the safety controller in a way that deviates minimally from the original policy. The key idea behind our approach is the formulation of a trajectory optimization problem that allows the joint reasoning of policy update and safety constraints. Experimental results demonstrate that our approach is effective even when the system model in the safety controller is unknown and only approximated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We use the terms ‘controller’ and ‘control policy’ (or simply ‘policy’) interchangeably in this paper. The latter is more common in the machine learning literature.
2.
Proof can be found in the extended version https://arxiv.org/abs/2008.07667.
3.
https://gym.openai.com/envs/MountainCarContinuous-v0/.

References

Achiam, J., Held, D., Tamar, A., Abbeel, P.: Constrained policy optimization. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 22–31. JMLR.org (2017)
Google Scholar
Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Amodei, D., Olah, C., Steinhardt, J., Christiano, P.F., Schulman, J., Mané, D.: Concrete problems in AI safety. CoRR, abs/1606.06565 (2016)
Google Scholar
Argall, B.D., Chernova, S., Veloso, M., Browning, B.: A survey of robot learning from demonstration. Robot. Auton. Syst. 57(5), 469–483 (2009)
Article Google Scholar
Bareiss, D., Van den Berg, J.: Reciprocal collision avoidance for robots with linear dynamics using LQR-Obstacles. In: 2013 IEEE International Conference on Robotics and Automation, pp. 3847–3853. IEEE (2013)
Google Scholar
Borrelli, F., Keviczky, T., Balas, G.J.: Collision-free UAV formation flight using decentralized optimization and invariant sets. In: 2004 43rd IEEE Conference on Decision and Control (CDC) (IEEE Cat. No. 04CH37601), vol. 1, pp. 1099–1104. IEEE (2004)
Google Scholar
Chen, J., Zhan, W., Tomizuka, M.: Constrained iterative LQR for on-road autonomous driving motion planning. In: 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), pp. 1–7, October 2017
Google Scholar
Cheng, R., Orosz, G., Murray, R.M., Burdick, J.W.: End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. ArXiv, abs/1903.08792 (2019)
Google Scholar
Chow, Y., Nachum, O., Duenez-Guzman, E., Ghavamzadeh, M.: A Lyapunov-based approach to safe reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 8092–8101 (2018)
Google Scholar
DeCastro, J.A., Kress-Gazit, H.: Guaranteeing reactive high-level behaviors for robots with complex dynamics. In: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 49–756. IEEE (2013)
Google Scholar
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: an open urban driving simulator. In: Levine, S., Vanhoucke, V., Goldberg, K., (eds.), Proceedings of the 1st Annual Conference on Robot Learning, Proceedings of Machine Learning Research PMLR, vol. 78, pp. 1–16, 13–15 November 2017
Google Scholar
Fulton, N., Platzer, A.: Safe reinforcement learning via formal methods: toward safe control through proof and learning. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Kwakernaak, H., Sivan, R.: Linear Optimal Control Systems, vol. 1. Wiley, New York (1972)
Google Scholar
Li, W., Todorov, E.: Iterative linear quadratic regulator design for nonlinear biological movement systems. In: ICINCO (2004)
Google Scholar
Maciejowski, J.: Predictive Control: With Constraints (2002)
Google Scholar
Maler, O., Nickovic, D.: Monitoring temporal properties of continuous signals. In: Lakhnech, Y., Yovine, S. (eds.) FORMATS/FTRTFT -2004. LNCS, vol. 3253, pp. 152–166. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30206-3_12
Chapter MATH Google Scholar
Menda, K., Driggs-Campbell, K., Kochenderfer, M.J..: EnsembleDAgger: a Bayesian approach to safe imitation learning. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5041–5048 (2019)
Google Scholar
Nesterov, Y., Nemirovskii, A.: Interior-Point Polynomial Algorithms in Convex Programming. Society for Industrial and Applied Mathematics (1994)
Google Scholar
Nocedal, J., Wright, S.: Numerical Optimization. Springer, Heidelberg (2006)
MATH Google Scholar
Osa, T., Pajarinen, J., Neumann, G., Bagnell, J.A., Abbeel, P., Peters, J.: An algorithmic perspective on imitation learning. Found. Trends® Robot. 7(1–2), 1–179 (2018)
Google Scholar
Pardalos, P.M., Vavasis, S.A.: Quadratic programming with one negative eigenvalue is NP-hard. J. Global Optim. 1, 15–22 (1991)
Article MathSciNet Google Scholar
Pereira, M., Fan, D.D., An, G.N., Theodorou, E.A.: MPC-inspired neural network policies for sequential decision making. CoRR, abs/1802.05803 (2018)
Google Scholar
Phan, D., Paoletti, N., Grosu, R., Jansen, N., Smolka, S.A., Stoller, S.D.: Neural simplex architecture. ArXiv, abs/1908.00528 (2019)
Google Scholar
Pnueli, A.: The temporal logic of programs. In: 18th Annual Symposium on Foundations of Computer Science (SFCS 1977), pp. 46–57. IEEE (1977)
Google Scholar
Pomerleau, D.: ALVINN: an autonomous land vehicle in a neural network. In: NIPS (1988)
Google Scholar
Raman, V., Donzé, A., Maasoumy, M., Murray, R.M., Sangiovanni-Vincentelli, A., Seshia, S.A.: Model predictive control with signal temporal logic specifications. In: 53rd IEEE Conference on Decision and Control, pp. 81–87. IEEE (2014)
Google Scholar
Ross, S., Bagnell, D.: Efficient reductions for imitation learning. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 661–668 (2010)
Google Scholar
Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011)
Google Scholar
Sadigh, D., Kapoor, A.: Safe control under uncertainty with probabilistic signal temporal logic. In: Proceedings of Robotics: Science and Systems XII, June 2016
Google Scholar
Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning, pp. 1889–1897 (2015)
Google Scholar
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. CoRR, abs/1707.06347 (2017)
Google Scholar
Seto, D., Krogh, B., Sha, L., Chutinan, A.: The simplex architecture for safe online control system upgrades. In: Proceedings of the 1998 American Control Conference. ACC (IEEE Cat. No. 98CH36207), vol. 6, pp. 504–3508, vol. 6 (1998)
Google Scholar
Tassa, Y., Erez, T., Todorov, E.: Synthesis and stabilization of complex behaviors through online trajectory optimization. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4906–4913, October 2012
Google Scholar
Wabersich, K.P., Zeilinger, M.N.: Linear model predictive safety certification for learning-based control. In: 2018 IEEE Conference on Decision and Control (CDC), pp. 7130–7135 (2018)
Google Scholar
Wu, C.-J., et al.: Machine learning at Facebook: understanding inference at the edge. In: 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 331–344. IEEE (2019)
Google Scholar
Zhang, J., Cho, K.: Query-efficient imitation learning for end-to-end simulated driving. In: AAAI (2017)
Google Scholar
Zhou, W., Li, W.: Safety-aware apprenticeship learning. In: Chockler, H., Weissenbacher, G. (eds.) CAV 2018. LNCS, vol. 10981, pp. 662–680. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-96145-3_38
Chapter Google Scholar

Download references

Acknowledgements

We gratefully acknowledge the support from National Science Foundation (NSF) grants 1646497.

Author information

Authors and Affiliations

Boston University, Boston, MA, USA
Weichao Zhou & Wenchao Li
Nanyang Technological University, Singapore, Singapore
Ruihan Gao
Toyota Motor North America R&D, Mountain View, CA, USA
BaekGyu Kim
Carnegie Mellon University, Pittsburgh, PA, USA
Eunsuk Kang

Authors

Weichao Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Ruihan Gao
View author publications
You can also search for this author in PubMed Google Scholar
BaekGyu Kim
View author publications
You can also search for this author in PubMed Google Scholar
Eunsuk Kang
View author publications
You can also search for this author in PubMed Google Scholar
Wenchao Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weichao Zhou .

Editor information

Editors and Affiliations

University of Southern California, Los Angeles, CA, USA
Jyotirmoy Deshmukh
AIT Austrian Institute of Technology GmbH, Vienna, Austria
Dejan Ničković

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, W., Gao, R., Kim, B., Kang, E., Li, W. (2020). Runtime-Safety-Guided Policy Repair. In: Deshmukh, J., Ničković, D. (eds) Runtime Verification. RV 2020. Lecture Notes in Computer Science(), vol 12399. Springer, Cham. https://doi.org/10.1007/978-3-030-60508-7_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-60508-7_7
Published: 02 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60507-0
Online ISBN: 978-3-030-60508-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics