Abstract
Reinforcement learning (RL) agents converge to optimal solutions for sequential decision making problems. Although increasingly successful, RL cannot be used in applications where unpredictable agent behaviour may have significant unintended negative consequences. We address this limitation by introducing an assured reinforcement learning (ARL) method which uses quantitative verification (QV) to restrict the agent behaviour to areas that satisfy safety, reliability and performance constraints specified in probabilistic temporal logic. To this end, ARL builds an abstract Markov decision process (AMDP) that models the problem to solve at a high level, and uses QV to identify a set of Pareto-optimal AMDP policies that satisfy the constraints. These formally verified abstract policies define areas of the agent behaviour space where RL can occur without constraint violations. We show the effectiveness of our ARL method through two case studies: a benchmark flag-collection navigation task and an assisted-living planning system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Wiering, M., Otterlo, M.: Reinforcement learning and markov decision processes. In: Wiering, M., Otterlo, M. (eds.) Reinforcement Learning: State-of-the-art, pp. 3–42. Springer, Berlin, Heidelberg (2012)
Riedmiller, M., Gabel, T., Hafner, R.: Reinforcement learning for robot soccer. Auton. Robots 27(1), 55–73 (2009)
Kwok, C., Fox, D.: reinforcement learning for sensing strategies. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3158–3163. IEEE (2004)
Baxter, J., Tridgell, A., Weaver, L.: KnightCap: A chess program that learns by combining TD(\(\lambda \)) with minimax search. In: 15th International Conference on Machine Learning, pp. 28–36. Morgan Kaufmann (1998)
Bagnell, J.A., Schneider, J.G.: autonomous helicopter control using reinforcement learning policy search methods. In: IEEE International Conference on Robotics and Automation. IEEE (2001)
Calinescu, R., Ghezzi, C., Kwiatkowska, M., Mirandola, R.: Self-adaptive software needs quantitative verification at runtime. Commun. ACM 55(9), 69–77 (2012)
GarcÃa, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16(1), 1437–1480 (2015)
Mason, G., Calinescu, R., Kudenko, D., Banks, A.: Combining reinforcement learning and quantitative verification for agent policy assurance. In: 6th International Workshop on Combinations of Intelligent Methods and Applications, pp. 45–52 (2016)
Kwiatkowska, M.: Quantitative verification: models, techniques and tools. In: 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, pp. 449–458. ACM Press (2007)
Hansson, H., Jonsson, B.: A logic for reasoning about time and reliability. Formal Aspects Comput. 6(5), 512–535 (1994)
Marthi, B.: Automatic shaping and decomposition of reward functions. In: 24th International Conference on Machine learning, pp. 601–608. ACM (2007)
Lahijanian, M., Andersson, S.B., Belta, C.: Temporal logic motion planning and control with probabilistic satisfaction guarantees. IEEE Trans. Robot. 28(2), 396–409 (2012)
Cizelj, I., Ding, X.C. and Lahijanian, M., Pinto, A., Belta, C.: Probabilistically safe vehicle control in a hostile environment. In: 18th World Congress of the The International Federation of Automatic Control, pp. 11803–11808. (2011)
Feng, L., Wiltsche, C., Humphrey, L., Topcu, U.: Synthesis of human-in-the-loop control protocols for autonomous systems. IEEE Trans. Autom. Sci. Eng. 13(2), 450–462 (2016)
Li, L., Walsh, T.J., Littman, M.L.: Towards a unified theory of state abstraction for MDPs. In: 9th International Symposium on Artificial Intelligence and Mathematics, pp. 531–539 (2006)
Sutton, R.S., Precup, D., Singh, S.: Between MDPs and Semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artif. Intell. 112(1–2), 181–211 (1999)
Andova, S., Hermanns, H., Katoen, J.-P.: Discrete-time rewards model-checked. In: Larsen, K.G., Niebert, P. (eds) Formal Modeling and Analysis of Timed Systems. LNCS, vol. 2791, pp. 88–104 (2004)
Abe, N., Melville, P., Pendus, C., Reddy, C.K., Jensen, D.L., Thomas, V.P., Bennett, J.J., Anderson, G.F., Cooley, B.R., Kowalczyk, M., Domick, M., Gardinier, T.: Optimizing debt collections using constrained reinforcement learning. In: 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 75–84. ACM (2010)
Castro, D.D., Tamar, A., Mannor, S.: Policy gradients with variance related risk criteria. In: 29th International Conference on Machine Learning, pp. 935–942. ACM (2012)
Delage, E., Mannor, S.: Percentile optimization for markov decision processes with parameter uncertainty. Oper. Res. 58(1), 203–213 (2010)
Geibel, P.: Reinforcement learning for MDPs with constraints. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) 17th European Conference on Machine Learning. LNAI vol. 4212, pp. 646–653. Springer, Heidelberg (2006)
Moldovan, T.M., Abbeel, P.: Safe exploration in markov decision processes. In: 29th International Conference on Machine Learning, pp. 1711–1718. ACM (2012)
Ponda, S.S., Johnson, L.B., How, J.P.: Risk allocation strategies for distributed chance-constrained task allocation. In: American Control Conference, pp. 3230–3236. IEEE (2013)
Dearden, R., Friedman, N., Russell, S.: Bayesian Q-learning. In: 15th National Conference on Artificial Intelligence, pp. 761–768. AAAI Press (1998)
Boger, J., Hoey, J., Poupart, P., Boutilier, C., Fernie, G., Mihailidis, A.: A planning system based on markov decision processes to guide people with dementia through activities of daily living. IEEE Trans. Inf. Technol. Biomed. 10(2), 323–33 (2006)
Kwiatkowska, M.: advances in quantitative verification for ubiquitous computing. In: Liu, Z., Woodcock, J., Zhu, H. (eds.) 10th International Colloquium on Theoretical Aspects of Computing. LNCS vol. 8049, pp. 42–58. Springer (2013)
Kwiatkowska, M., Norman, G., Parker, D.: PRISM 4.0: verification of probabilistic real-time systems. In: Gopalakrishnan, G., Qadeer, S. (eds.) 23rd International Conference on Computer Aided Verification. LNCS vol. 6806, pp. 585–591. Springer (2011)
Katoen, J.-P., Zapreev, I.S., Hahn, E.M., Hermanns, H., Jansen, D.N.: The Ins and Outs of the probabilistic model checker MRMC. Perform. Eval. 68(2), 90–104 (2011)
Kwiatkowska, M., Norman, G., Parker, D.: Stochastic model checking. In: Bernardo, M., Hillston, J. (eds.) 7th International Conference on Formal Methods for Performance Evaluation. LNCS vol. 4486, pp. 220–270. Springer (2007)
Watkins, C.J.C.H., Dayan, P.: Q-Learning. Mach. Learn. 8(3), 279–292 (1992)
Xia, L., Jia, Q.-S.: Policy Iteration for parameterized markov decision processes and its application. In: 9th Asian Control Conference, pp. 1–6. IEEE (2013)
Liu, C., Xu, X., Hu, D.: Multiobjective reinforcement learning: a comprehensive overview. IEEE Trans. Syst. Man Cybern. Syst. 45(3), 385–398 (2015)
Gerasimou, S., Tamburrelli, G., Calinescu, R.: Search-based synthesis of probabilistic models for quality-of-service software engineering. In: 30th IEEE/ACM International Conference on Automated Software Engineering, pp. 319–330. IEEE (2015)
Arcuri, A., Briand, L.: A practical guide for using statistical tests to assess randomized algorithms in software engineering. In: 33rd International Conference on Software Engineering, pp. 1–10. ACM (2011)
Calinescu, R., Kikuchi, S., Johnson, K.: Compositional reverification of probabilistic safety properties for large-scale complex IT systems. In: Large-Scale Complex IT Systems. Development, Operation and Management, pp. 303–329. Springer (2012)
Calinescu, R., Johnson, K., Rafiq, Y.: Developing self-verifying service-based systems. In: 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 734–737 (2013)
Calinescu, R., Gerasimou, S., Banks, A.: Self-adaptive software with decentralised control loops. In: 17th International Conference on Fundamental Approaches to Software Engineering, pp. 235–251. Springer (2015)
Gerasimou, S., Calinescu, R., Banks, A.: Efficient runtime quantitative verification using caching, lookahead, and nearly-optimal reconfiguration. In: 9th International Symposium on Software Engineering for Adaptive and Self-Managing Systems (SEAMS), pp. 115–124 (2014)
Heger, M.: Consideration of risk in reinforcement learning. In: 11th International Conference on Machine Learning, pp. 105–111. (1994)
Mihatsch, O., Neuneier, R.: Risk-sensitive reinforcement learning. Mach. Learn. 49(2), 267–290 (2002)
Driessens, K., Džeroski, S.: Integrating Guidance into Relational Reinforcement Learning. Mach. Learn. 57(3), 271–304 (2004)
Vamplew, P., Dazeley, R., Berry, A., Issabekov, R., Dekker, E.: Empirical evaluation methods for multiobjective reinforcement learning algorithms. Mach. Learn. 84(1), 51–80 (2011)
Gábor, Z., Kalmár, Z., Szepesvári, C.: Multi-criteria reinforcement learning. In: 15th International Conference on Machine Learning, pp. 197–205. Morgan Kaufmann (1998)
Mannor, S., Shimkin, N.: A geometric approach to multi-criterion reinforcement learning. J. Mach. Learn. Res. 5, 325–360 (2004)
Barrett, L., Narayanan, S.: learning all optimal policies with multiple criteria. In: 25th International Conference on Machine learning, pp. 41–47. Omni Press (2008)
Moffaert, K.V., Nowé, A.: Multi-objective reinforcement learning using sets of pareto dominating policies. J. Mach. Learn. Res. 15(1), 3663–3692 (2014)
Mason, G., Calinescu, R., Kudenko, D., Banks, A.: assured reinforcement learning for safety-critical applications. In: Doctoral Consortium on Agents and Artificial Intelligence (DCAART ’17), pp. 9–16 (2017)
Mason, G., Calinescu, R., Kudenko, D., Banks, A.: Assured reinforcement learning with formally verified abstract policies. In: 9th International Conference on Agents and Artificial Intelligence, pp. 105–117 (2017)
Calinescu, R., Johnson, K., Rafiq, Y.: Using observation ageing to improve Markovian model learning in QoS engineering. In: 2nd International Conference on Performance Engineering, pp. 505–510 (2011)
Calinescu, R., Rafiq, Y., Johnson, K., Bakir, M.E.: Adaptive model learning for continual verification of non-functional properties. In: 5th International Conference on Performance Engineering, pp. 87–98 (2014)
Efthymiadis, K., Kudenko, D.: Knowledge revision for reinforcement learning with abstract MDPs. In: Bordini, R.H., Elkind, E., Weiss, G., Yolum, P. (eds.) 14th International Conference on Autonomous Agents and Multiagent Systems, pp. 763–770. (2015)
Acknowledgements
This paper presents research sponsored by the UK MOD. The information contained in it should not be interpreted as representing the views of the UK MOD, nor should it be assumed it reflects any current or future UK MOD policy.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this chapter
Cite this chapter
Mason, G., Calinescu, R., Kudenko, D., Banks, A. (2018). Assurance in Reinforcement Learning Using Quantitative Verification. In: Hatzilygeroudis, I., Palade, V. (eds) Advances in Hybridization of Intelligent Methods. Smart Innovation, Systems and Technologies, vol 85. Springer, Cham. https://doi.org/10.1007/978-3-319-66790-4_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-66790-4_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66789-8
Online ISBN: 978-3-319-66790-4
eBook Packages: EngineeringEngineering (R0)