Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The rapid progress of artificial intelligence (AI) comes with a growing concern over its safety when deployed in real-life systems and situations. As highlighted in [3], if the objective function of an AI agent is wrongly specified, then maximizing that objective function may lead to harmful results. In addition, the objective function or the training data may focus only on accomplishing a specific task and ignore other aspects, such as safety constraints, of the environment. In this paper, we propose a novel framework that combines explicit safety specification with learning from data. We consider safety specification expressed in Probabilistic Computation Tree Logic (PCTL) and show how probabilistic model checking can be used to ensure safety and retain performance of a learning algorithm known as apprenticeship learning (AL).

We consider the formulation of apprenticeship learning by Abbeel and Ng [1]. The concept of AL is closely related to reinforcement learning (RL) where an agent learns what actions to take in an environment (known as a policy) by maximizing some notion of long-term reward. In AL, however, the agent is not given the reward function, but instead has to first estimate it from a set of expert demonstrations via a technique called inverse reinforcement learning [18]. The formulation assumes that the reward function is expressible as a linear combination of known state features. An expert demonstrates the task by maximizing this reward function and the agent tries to derive a policy that can match the feature expectations of the expert’s demonstrations. Apprenticeship learning can also be viewed as an instance of the class of techniques known as Learning from Demonstration (LfD). One issue with LfD is that the expert often can only demonstrate how the task works but not how the task may fail. This is because failure may cause irrecoverable damages to the system such as crashing a vehicle. In general, the lack of “negative examples” can cause a heavy bias in how the learning agent constructs the reward estimate. In fact, even if all the demonstrations are safe, the agent may still end up learning an unsafe policy.

The key idea of this paper is to incorporate formal verification in apprenticeship learning. We are inspired by the line of work on formal inductive synthesis [10] and counterexample-guided inductive synthesis [22]. Our approach is also similar in spirit to the recent work on safety-constrained reinforcement learning [11]. However, our approach uses the results of model checking in a novel way. We consider safety specification expressed in probabilistic computation tree logic (PCTL). We employ a verification-in-the-loop approach by embedding PCTL model checking as a safety checking mechanism inside the learning phase of AL. In particular, when a learnt policy does not satisfy the PCTL formula, we leverage counterexamples generated by the model checker to steer the policy search in AL. In essence, counterexample generation can be viewed as supplementing negative examples for the learner. Thus, the learner will try to find a policy that not only imitates the expert’s demonstrations but also stays away from the failure scenarios as captured by the counterexamples.

In summary, we make the following contributions in this paper.

  • We propose a novel framework for incorporating formal safety guarantees in Learning from Demonstration.

  • We develop a novel algorithm called CounterExample Guided Apprenticeship Learning (CEGAL) that combines probabilistic model checking with the optimization-based approach of apprenticeship learning.

  • We demonstrate that our approach can guarantee safety for a set of case studies and attain performance comparable to that of using apprenticeship learning alone.

The rest of the paper is organized as follows. Section 2 reviews background information on apprenticeship learning and PCTL model checking. Section 3 defines the safety-aware apprenticeship learning problem and gives an overview of our approach. Section 4 illustrates the counterexample-guided learning framework. Section 5 describes the proposed algorithm in detail. Section 6 presents a set of experimental results demonstrating the effectiveness of our approach. Section 7 discusses related work. Section 8 concludes and offers future directions.

2 Preliminaries

2.1 Markov Decision Process and Discrete-Time Markov Chain

Markov Decision Process (MDP) is a tuple \(M = (S,A,T,\gamma ,s_0,R)\), where S is a finite set of states; A is a set of actions; \(T: S\times A\times S\rightarrow [0, 1]\) is a transition function describing the probability of transitioning from one state \(s\in S\) to another state by taking action \(a\in A\) in state s; \(R: S\rightarrow \mathbb {R}\ \)is a reward function which maps each state \(s\in S\) to a real number indicating the reward of being in state s; \(s_0\in S\) is the initial state; \(\gamma \in [0, 1)\) is a discount factor which describes how future rewards attenuate when a sequence of transitions is made. A deterministic and stationary (or memoryless) policy \(\pi : S \rightarrow A\) for an MDP M is a mapping from states to actions, i.e. the policy deterministically selects what action to take solely based on the current state. In this paper, we restrict ourselves to deterministic and stationary policy. A policy \(\pi \) for an MDP M induces a Discrete-Time Markov Chain (DTMC) \(M_{\pi }=(S, T_\pi , s_0)\), where \(T_\pi :S\times S\rightarrow [0, 1]\) is the probability of transitioning from a state s to another state in one step. A trajectory \(\tau = s_0\xrightarrow {T_\pi (s_0, s_1)>0} s_1\xrightarrow {T_\pi (s_1, s_2)>0} s_2,\ldots ,\) is a sequence of states where \(s_i\in S\). The accumulated reward of \(\tau \) is \(\sum \limits \limits _{i=0}^{\infty } \gamma ^i R(s_i)\). The value function \(V_\pi : S\rightarrow \mathbb {R}\) measures the expectation of accumulated reward \(E[\sum \limits _{i=0}^{\infty }\gamma ^i R(s_i)]\) starting from a state s and following policy \(\pi \). An optimal policy \(\pi \) for MDP M is a policy that maximizes the value function [4].

2.2 Apprenticeship Learning via Inverse Reinforcement Learning

Inverse reinforcement learning (IRL) aims at recovering the reward function R of \(M\backslash R = (S, A, T, \gamma , s_0)\) from a set of m trajectories \(\varGamma _E=\{\tau _0, \tau _1, \ldots , \tau _{m-1}\}\) demonstrated by an expert. Apprenticeship learning (AL) [1] assumes that the reward function is a linear combination of state features, i.e. \(R(s) = \omega ^Tf(s)\) where \(f : S \rightarrow [0,\ 1]^k\) is a vector of known features over states S and \(\omega \in \mathbb {R}^k\) is an unknown weight vector that satisfies \(||\omega ||_2\le 1\). The expected features of a policy \(\pi \) are the expected values of the cumulative discounted state features f(s) by following \(\pi \) on M, i.e. \(\mu _\pi = E[\sum ^\infty _{t=0} \gamma ^t f(s_t) | \pi ]\). Let \(\mu _{E}\) denote the expected features of the unknown expert’s policy \(\pi _E\). \(\mu _{E}\) can be approximated by the expected features of expert’s m demonstrated trajectories \(\hat{\mu }_E=\frac{1}{m} \sum \limits _{\tau \in \varGamma _E}\sum \limits _{t=0}^{\infty }\gamma ^t f(s_{t})\) if m is large enough. With a slight abuse of notations, we use \(\mu _\varGamma \) to also denote the expected features of a set of paths \(\varGamma \). Given an error bound \(\epsilon \), a policy \(\pi ^*\) is defined to be \(\epsilon \)-close to \(\pi _E\) if its expected features \(\mu _{\pi ^*}\) satisfies \(||\mu _E - \mu _{\pi ^*}||_2 \le \epsilon \). The expected features of a policy can be calculated by using Monte Carlo method, value iteration or linear programming [1, 4].

The algorithm proposed by Abbeel and Ng [1] starts with a random policy \(\pi _0\) and its expected features \(\mu _{\pi _0}\). Assuming that in iteration i, a set of i candidate policies \(\varPi =\{\pi _0, \pi _1,\ldots , \pi _{i-1}\}\) and their corresponding expected features \(\{\mu _\pi |\pi \in \varPi \}\) have been found, the algorithm solves the following optimization problem.

$$\begin{aligned} \delta = \max \limits _{\omega }\min \limits _{\pi \in \varPi }\ \omega ^T(\hat{\mu }_E - \mu _{\pi })\qquad s.t.\,||\omega ||_2\le 1 \end{aligned}$$
(1)

The optimal \(\omega \) is used to find the corresponding optimal policy \(\pi _{i}\) and the expected features \(\mu _{\pi _i}\). If \(\delta \le \epsilon \), then the algorithm terminates and \(\pi _{i}\) is produced as the output. Otherwise, \(\mu _{\pi _i}\) is added to the set of features for the candidate policy set \(\varPi \) and the algorithm continues to the next iteration.

2.3 PCTL Model Checking

Probabilistic model checking can be used to verify properties of a stochastic system such as “is the probability that the agent reaches the unsafe area within 10 steps smaller than 5%?”. Probabilistic Computation Tree Logic (PCTL) [7] allows for probabilistic quantification of properties. The syntax of PCTL includes state formulas and path formulas [13]. A state formula \(\phi \) asserts property of a single state \(s\in S\) whereas a path formula \(\psi \) asserts property of a trajectory.

$$\begin{aligned} \phi \, {:}{:}\!=&true\ |\ l_i\ |\ \lnot \phi _i\ |\ \phi _i \wedge \phi _j\ |\ P_{\bowtie p^*}[\psi ]\end{aligned}$$
(2)
$$\begin{aligned} \psi \,{:}{:}\!=&\mathbf{X } \phi \ |\ \phi _1 \mathbf{U } ^{\le k}\phi _2\ |\ \phi _1\mathbf{U } \phi _2\ \end{aligned}$$
(3)

where \(l_i\) is atomic proposition and \(\phi _i, \phi _j\) are state formulas; \(\,\bowtie \,\in \{\le , \ge , <, >\}\); \(P_{\bowtie p^*}[\psi ]\) means that the probability of generating a trajectory that satisfies formula \(\psi \) is \(\bowtie p^*\). \(\mathbf{X } \phi \) asserts that the next state after initial state in the trajectory satisfies \(\phi \); \(\phi _1\,\mathbf{U } ^{\le k}\,\phi _2\) asserts that \(\phi _2\) is satisfied in at most k transitions and all preceding states satisfy \(\phi _1\); \(\phi _1\,\mathbf{U } \,\phi _2\) asserts that \(\phi _2\) will be eventually satisfied and all preceding states satisfy \(\phi _1\). The semantics of PCTL is defined by a satisfaction relation \(\models \) as follows.

$$\begin{aligned} s\,\models \,&true&~ \text {iff state }s\in S\end{aligned}$$
(4)
$$\begin{aligned} s\,\models \,&\phi&~ \text {iff state s satisfies the state formula }\phi \end{aligned}$$
(5)
$$\begin{aligned} \tau \,\models \,&\psi&~ \text {iff trajectory }\tau \text { satisfies the path formula }\psi . \end{aligned}$$
(6)

Additionally, \(\models _{min}\) denotes the minimal satisfaction relation [6] between \(\tau \) and \(\psi \). Defining \(pref(\tau )\) as the set of all prefixes of trajectory \(\tau \) including \(\tau \) itself, then \(\tau \,\models _{min}\,\psi \) iff \((\tau \,\models \,\psi ) \wedge (\forall \tau '\in pref(\tau )\backslash \tau , \tau ' \nvDash \psi )\). For instance, if \(\psi =\phi _1\,\mathbf{U } ^{\le k}\,\phi _2\), then for any finite trajectory \(\tau \,\models _{min}\,\phi _1 \mathbf{U } ^{\le k}\phi _2\), only the final state in \(\tau \) satisfies \(\phi _2\). Let \(P(\tau )\) be the probability of transitioning along a trajectory \(\tau \) and let \(\varGamma _\psi \) be the set of all finite trajectories that satisfy \(\tau \,\models _{min}\,\psi \), the value of PCTL property \(\psi \) is defined as \(P_{=?|s_0}[\psi ]=\sum \limits _{\tau \in \varGamma _\psi }P(\tau )\). For a DTMC \(M_{\pi }\) and a state formula \(\phi = P_{\le p^*}[\psi ]\), \(M_{\pi }\,\models \,\phi \) iff \(P_{=?|s_0}[\psi ]\le p^*\).

A counterexample of \(\phi \) is a set \(cex\subseteq \varGamma _\psi \) that satisfies \(\sum \limits _{\tau \in cex}P(\tau )> p^*\). Let \(\mathbb {P}(\varGamma ) = \sum \limits _{\tau \in \varGamma }P(\tau )\) be the sum of probabilities of all trajectories in a set \(\varGamma \). Let \(CEX_{\phi }\subseteq 2^{\varGamma _\psi }\) be the set of all counterexamples for a formula \(\phi \) such that \((\forall cex\in CEX_{\phi },\mathbb {P}(cex)> p^*)\) and \((\forall \varGamma \in 2^{\varGamma _\psi }\backslash CEX_{\phi }, \mathbb {P}(\varGamma )\le p^*)\). A minimal counterexample is a set \(cex\in CEX_{\phi }\) such that \(\forall cex'\in CEX_{\phi }, |cex|\le |cex'|\). By converting DTMC \(M_\pi \) into a weighted directed graph, counterexample can be found by solving a k-shortest paths (KSP) problem or a hop-constrained KSP (HKSP) problem [6]. Alternatively, counterexamples can be found by using Satisfiability Modulo Theory solving or mixed integer linear programming to determine the minimal critical subsystems that capture the counterexamples in \(M_\pi \) [23].

A policy can also be synthesized by solving the objective \(\underset{\pi }{min}\ P_{=?}[\psi ]\) for an MDP M. This problem can be solved by linear programming or policy iteration (and value iteration for step-bounded reachability) [14].

3 Problem Formulation and Overview

Suppose there are some unsafe states in an \(MDP\backslash RM = (S,A,T,\gamma ,s_0)\). A safety issue in apprenticeship learning means that an agent following the learnt policy would have a higher probability of entering those unsafe states than it should. There are multiple reasons that can give rise to this issue. First, it is possible that the expert policy \(\pi _E\) itself has a high probability of reaching the unsafe states. Second, human experts often tend to perform only successful demonstrations that do not highlight the unwanted situations [21]. This lack of negative examples in the training set can cause the learning agent to be unaware of the existence of those unsafe states.

Fig. 1.
figure 1

The 8 \(\times \) 8 grid-world. (a) Lighter grid cells have higher rewards than the darker ones. The two black grid cells have the lowest rewards, while the two white ones have the highest rewards. The grid cells enclosed by red lines are considered unsafe. (b) The blue line is an example trajectory demonstrated by the expert. (c) Only the goal states are assigned high rewards and there is little difference between the unsafe states and their nearby states. As a result, the learnt policy has a high probability of passing through the unsafe states as indicated by the cyan line. (d) \(p^*=20\%\). The learnt policy is optimal to a reward function that correctly assigns low rewards to the unsafe states. (Color figure online)

We use a 8 \(\times \) 8 grid-world navigation example as shown in Fig. 1 to illustrate this problem. An agent starts from the upper-left corner and moves from cell to cell until it reaches the lower-right corner. The ‘unsafe’ cells are enclosed by the red lines. These represent regions that the agent should avoid. In each step, the agent can choose to stay in current cell or move to an adjacent cell but with \(20\%\) chance of moving randomly instead of following its decision. The goal area, the unsafe area and the reward mapping for all states are shown in Fig. 1(a). For each state \(s\in S\), its feature vector consists of 4 radial basis feature functions with respect to the squared Euclidean distances between s and the 4 states with the highest or lowest rewards as shown in Fig. 1(a). In addition, a specification \(\varPhi \) formalized in PCTL is used to capture the safety requirement. In (7), \(p^*\) is the required upper bound of the probability of reaching an unsafe state within \(t=64\) steps.

$$\begin{aligned} \varPhi \, {:}{:}\!= P_{\le p^*}[\texttt {true}\ \mathbf{U } ^{\le t}\ \texttt {unsafe}] \end{aligned}$$
(7)

Let \(\pi _E\) be the optimal policy under the reward map shown in Fig. 1(a). The probability of entering an unsafe region within 64 steps by following \(\pi _E\) is 24.6%. Now consider the scenario where the expert performs a number of demonstrations by following \(\pi _E\). All demonstrated trajectories in this case successfully reach the goal areas without ever passing through any of the unsafe regions. Figure 1(b) shows a representative trajectory (in blue) among 10, 000 such demonstrated trajectories. The resulting reward map by running the AL algorithm on these 10,000 demonstrations is shown in Fig. 1(c). Observe that only the goal area has been learnt whereas the agent is oblivious to the unsafe regions (treating them in the same way as other dark cells). In fact, the probability of reaching an unsafe state within 64 steps with this policy turns out to be 82.6% (thus violating the safety requirement by a large margin). To make matters worse, the value of \(p^*\) may be decided or revised after a policy has been learnt. In those cases, even the original expert policy \(\pi _E\) may be unsafe, e.g., when \(p^*=20\%\). Thus, we need to adapt the original AL algorithm so that it will take into account of such safety requirement. Figure 1(d) shows the resulting reward map learned using our proposed algorithm (to be described in detail later) for \(p^*=20\%\). It clearly matches well with the color differentiation in the original reward map and captures both the goal states and the unsafe regions. This policy has an unsafe probability of 19.0%. We are now ready to state our problem.

Definition 1

The safety-aware apprenticeship learning (SafeAL) problem is, given an \(MDP\backslash R\), a set of m trajectories \(\{\tau _0, \tau _1,\ldots , \tau _{m-1}\}\) demonstrated by an expert, and a specification \(\varPhi \), to learn a policy \(\pi \) that satisfies \(\varPhi \) and is \(\epsilon \)-close to the expert policy \(\pi _E\).

Remark 1

We note that a solution may not always exist for the SafeAL problem. While the decision problem of checking whether a solution exists is of theoretical interest, in this paper, we focus on tackling the problem of finding a policy \(\pi \) that satisfies a PCTL formula \(\varPhi \) (if \(\varPhi \) is satisfiable) and whose performance is as close to that of the expert’s as possible, i.e. we relax the condition on \(\mu _{\pi }\) being \(\epsilon \)-close to \(\mu _E\).

4 A Framework for Safety-Aware Learning

In this section, we describe a general framework for safety-aware learning. This novel framework utilizes information from both the expert demonstrations and a verifier. The proposed framework is illustrated in Fig. 2. Similar to the counterexample-guided inductive synthesis (CEGIS) paradigm [22], our framework consists of a verifier and a learner. The verifier checks if a candidate policy satisfies the safety specification \(\varPhi \). In case \(\varPhi \) is not satisfied, the verifier generates a counterexample for \(\varPhi \). The main difference from CEGIS is that our framework considers not only functional correctness, e.g., safety, but also performance (as captured by the learning objective). Starting from an initial policy \(\pi _0\), each time the learner learns a new policy, the verifier checks if the specification is satisfied. If true, then this policy is added to the candidate set, otherwise the verifier will generate a (minimal) counterexample and add it to the counterexample set. During the learning phase, the learner uses both the counterexample set and candidate set to find a policy that is close to the (unknown) expert policy and far away from the counterexamples. The goal is to find a policy that is \(\epsilon \)-close to the expert policy and satisfies the specification. For the grid-world example introduced in Sect. 3, when \(p^*=5\%\) (thus presenting a stricter safety requirement compared to the expert policy \(\pi _E\)), our approach produces a policy with only 4.2% of reaching an unsafe state within 64 steps (with the correspondingly inferred reward mapping shown in Fig. 1(d)).

Fig. 2.
figure 2

Our safety-aware learning framework. Given an initial policy \(\pi _0\), a specification \(\varPhi \) and a learning objective (as captured by \(\epsilon \)), the framework iterates between a verifier and a learner to search for a policy \(\pi ^*\) that satisfies both \(\varPhi \) and \(\epsilon \). One invariant that this framework maintains is that all the \(\pi _i\)’s in the candidate policy set satisfy \(\varPhi \).

Fig. 3.
figure 3

(a) Learn from expert. (b) Learn from both expert demonstrations and counterexamples.

Learning from a (minimal) counterexample \(cex_{\pi }\) of a policy \(\pi \) is similar to learning from expert demonstrations. The basic principle of the AL algorithm proposed in [1] is to find a weight vector \(\omega \) under which the expected reward of \(\pi _E\) maximally outperforms any mixture of the policies in the candidate policy set \(\varPi =\{\pi _0, \pi _1, \pi _2, \ldots \}\). Thus, \(\omega \) can be viewed as the normal vector of the hyperplane \(\omega ^T(\mu - \mu _E) = 0\) that has the maximal distance to the convex hull of the set \(\{\mu _{\pi }\,|\,\pi \in \varPi \}\) as illustrated in the 2D feature space in Fig. 3(a). It can be shown that \(\omega ^T \mu _\pi \ge \omega ^T \mu _{\pi '}\) for all previously found \(\pi '\)s. Intuitively, this helps to move the candidate \(\mu _\pi \) closer to \(\mu _E\). Similarly, we can apply the same max-margin separation principle to maximize the distance between the candidate policies and the counterexamples (in the \(\mu \) space). Let \({CEX}= \{cex_0, cex_1, cex_2, ...\}\) denote the set of counterexamples of the policies that do not satisfy the specification \(\varPhi \). Maximizing the distance between the convex hulls of the sets \(\{\mu _{cex}\,|\,cex\in {CEX}\}\) and \(\{\mu _{\pi }\,|\,\pi \in \varPi \}\) is equivalent to maximizing the distance between the parallel supporting hyperplanes of the two convex hulls as shown in Fig. 3(b). The corresponding optimization function is given in Eq. (8).

$$\begin{aligned} \delta = \max \limits _{\omega }\min \limits _{\pi \in \varPi , cex\in CEX}\ \omega ^T(\mu _{\pi }-\mu _{cex})\qquad s.t.\,||\omega ||_2\le 1 \end{aligned}$$
(8)

To attain good performance similar to that of the expert, we still want to learn from \(\mu _E\). Thus, the overall problem can be formulated as a multi-objective optimization problem that combines (1) and (8) into (9).

$$\begin{aligned} \max \limits _\omega \min \limits _{\pi \in \varPi , \tilde{\pi }\in \varPi , cex \in CEX} (\omega ^T (\mu _E - \mu _{\pi }),\ \omega ^T (\mu _{\tilde{\pi }} - \mu _{cex}) )\qquad s.t.\,||\omega ||_2\le 1 \end{aligned}$$
(9)

5 Counterexample-Guided Apprenticeship Learning

In this section, we introduce the CounterExample Guided Apprenticeship Learning (CEGAL) algorithm to solve the SafeAL problem. It can be viewed as a special case of the safety-aware learning framework described in the previous section. In addition to combining policy verification, counterexample generation and AL, our approach uses an adaptive weighting scheme to weight the separation from \(\mu _E\) with the separation from \(\mu _{cex}\).

$$\begin{aligned}&\underset{\omega }{\max }\underset{\pi \in \varPi _S,\tilde{\pi }\in \varPi _S, cex\in CEX}{\min }\ \omega ^T(k(\mu _E - \mu _{\pi })+(1-k)(\mu _{ \tilde{\pi }} - \mu _{cex}))\\&s.t.\, ||\omega ||_2\le 1 ,\, k\in [0, 1] \nonumber \\&\quad \ \,\omega ^T(\mu _E - \mu _{\pi })\le \omega ^T(\mu _E - \mu _{\pi '}),\ \forall \pi '\in \varPi _S \nonumber \\&\quad \ \,\omega ^T(\mu _{\tilde{\pi }} - \mu _{cex})\le \omega ^T(\mu _{ \tilde{\pi }'} - \mu _{cex'}),\ \forall \tilde{\pi }'\in \varPi _S, \forall cex'\in {CEX} \nonumber \end{aligned}$$
(10)

In essence, we take a weighted-sum approach for solving the multi-objective optimization problem (9). Assuming that \(\varPi _S=\{\pi _{1},\pi _{2},\pi _{3},\ldots \}\) is a set of candidate policies that all satisfy \(\varPhi \), \({CEX} =\{cex_1,cex_2,cex_3,\ldots \}\) is a set of counterexamples. We introduce a parameter k and change (9) into a weighted sum optimization problem (10). Note that \(\pi \) and \(\tilde{\pi }\) can be different. The optimal \(\omega \) solved from (10) can be used to generate a new policy \(\pi _\omega \) by using algorithms such as policy iteration. We use a probabilistic model checker, such as PRISM [13], to check if \(\pi _\omega \) satisfies \(\varPhi \). If it does, then it will be added to \(\varPi _S\). Otherwise, a counterexample generator, such as COMICS [9], is used to generate a (minimal) counterexample \(cex_{\pi _\omega }\), which will be added to CEX.

figure a

Algorithm 1 describes CEGAL in detail. With a constant \(sup=1\) and a variable \(inf\in [0, sup]\) for the upper and lower bounds respectively, the learner determines the value of k within [infsup] in each iteration depending on the outcome of the verifier and uses k in solving (10) in line 27. Like most nonlinear optimization algorithms, this algorithm requires an initial guess, which is an initial safe policy \(\pi _0\) to make \(\varPi _S\) nonempty. A good initial candidate would be the maximally safe policy for example obtained using PRISM-games [15]. Without loss of generality, we assume this policy satisfies \(\varPhi \). Suppose in iteration i, an intermediate policy \(\pi _i\) learnt by the learner in iteration \(i-1\) is verified to satisfy \(\varPhi \), then we increase inf to \(inf=k\) and reset k to \(k=sup\) as shown in line 22. If \(\pi _i\) does not satisfy \(\varPhi \), then we reduce k to \(k=\alpha \cdot inf + (1 - \alpha )k\) as shown in line 26 where \(\alpha \in (0, 1)\) is a step length parameter. If \(|k-inf|\le \sigma \) and \(\pi _i\) still does not satisfy \(\varPhi \), the algorithm chooses from \(\varPi _S\) a best safe policy \(\pi ^*\) which has the smallest margin to \(\pi _E\) as shown in line 24. If \(\pi _i\) satisfies \(\varPhi \) and is \(\epsilon \)-close to \(\pi _E\), the algorithm outputs \(\pi _i\) as show in line 19. For the occasions when \(\pi _i\) satisfies \(\varPhi \) and \(inf = sup = k = 1\), solving (10) is equivalent to solving (1) as in the original AL algorithm.

Remark 2

The initial policy \(\pi _0\) does not have to be maximally safe, although such a policy can be used to verify if \(\varPhi \) is satisfiable at all. Naively safe policies often suffice for obtaining a safe and performant output at the end. Such a policy can be obtained easily in many settings, e.g., in the grid-world example one safe policy is simply staying in the initial cell. In both cases, \(\pi _0\) typically has very low performance since satisfying \(\varPhi \) is the only requirement for it.

Theorem 1

Given an initial policy \(\pi _0\) that satisfies \(\varPhi \), Algorithm 1 is guaranteed to output a policy \(\pi ^*\), such that (1) \(\pi ^*\) satisfies \(\varPhi \), and (2) the performance of \(\pi ^*\) is at least as good as that of \(\pi _0\) when compared to \(\pi _E\), i.e. \(\Vert \mu _E - \mu _{\pi ^*}\Vert _2\le \Vert \mu _E - \mu _{\pi _0}\Vert _2\).

Proof Sketch. The first part of the guarantee can be proven by case splitting. Algorithm 1 outputs \(\pi ^*\) either when \(\pi ^*\) satisfies \(\varPhi \) and is \(\epsilon \)-close to \(\pi _E\), or when \(|k-inf|\le \sigma \) in some iteration. In the first case, \(\pi ^*\) clearly satisfies \(\varPhi \). In the second case, \(\pi ^*\) is selected from the set \(\varPi _S\) which contains all the policies that have been found to satisfy \(\varPhi \) so far, so \(\pi ^*\) satisfies \(\varPhi \). For the second part of the guarantee, the initial policy \(\pi _0\) is the final output \(\pi ^*\) if \(\pi _0\) satisfies \(\varPhi \) and is \(\epsilon \)-close to \(\pi _E\). Otherwise, \(\pi _0\) is added to \(\varPi _S\) if it satisfies \(\varPhi \). During the iteration, if \(|k-inf|\le \sigma \) in some iteration, then the final output is \(\pi ^*=\underset{{\pi }\in \varPi _S}{argmin}||\mu _E - \mu _{{\pi }}||_2\), so it must satisfy \(\Vert \mu _E - \mu _{\pi ^*}\Vert _2\le \Vert \mu _E - \mu _{\pi _0}\Vert _2\). If a learnt policy \(\pi ^*\) satisfies \(\varPhi \) and is \(\epsilon \)-close to \(\pi _E\), then Algorithm 1 outputs \(\pi ^*\) without adding it to \(\varPi _S\). Obviously \(\Vert \mu _E - \mu _{{\pi }}\Vert _2>\epsilon , \forall {\pi }\in \varPi _S\), so \(\Vert \mu _E - \mu _{\pi ^*}\Vert _2\le \Vert \mu _E - \mu _{\pi _0}\Vert _2\).

Discussion. In the worst case, CEGAL will return the initial safe policy. However, this can be because a policy that simultaneously satisfies \(\varPhi \) and is \(\epsilon \)-close to the expert’s demonstrations does not exist. Comparing to AL which offers no safety guarantee and finding the maximally safe policy which has very poor performance, CEGAL provides a principled way of guaranteeing safety while retaining performance.

Convergence. Algorithm 1 is guaranteed to terminate. Let \(inf_t\) be the \(t^{th}\) assigned value of inf. After \(inf_t\) is given, k is decreased from \(k_0=sup\) iteratively by \(k_{i}=\alpha \cdot inf_t + (1 - \alpha )k_{i-1}\) until either \(|k_i-inf_t|\le \sigma \) in line 24 or a new safe policy is found in line 18. The update of k satisfies the following equality.

$$\begin{aligned} \frac{|k_{i+1} - inf_t|}{|k_i - inf_t|}= & {} \frac{\alpha \cdot inf_t + (1 - \alpha )k_i - inf_t}{k_i - inf_t} = 1-\alpha \end{aligned}$$
(11)

Thus, it takes no more than \(1+\log _{1-\alpha }\frac{\sigma }{sup - inf_t}\) iterations for either the algorithm to terminate in line 24 or a new safe policy to be found in line 18. If a new safe policy is found in line 18, inf will be assigned in line 22 by the current value of k as \(inf_{t+1}=k\) which obviously satisfies \(inf_{t+1} - inf_t \ge (1-\alpha )\sigma \). After the assignment of \(inf_{t+1}\), the iterative update of k resumes. Since \(sup-inf_t \le 1\), the following inequality holds.

$$\begin{aligned} \frac{|inf_{t+1} - sup|}{|inf_{t} - sup|}\le & {} \frac{sup -inf_{t} - (1-\alpha )\sigma }{sup- inf_{t}}\le 1 - (1-\alpha )\sigma \end{aligned}$$
(12)

Obviously, starting from an initial \(inf=inf_0<sup\), with the alternating update of inf and k, inf will keep getting close to sup unless the algorithm terminates as in line 24 or a safe policy \(\epsilon \)-close to \(\pi _E\) is found as in line 19. The extreme case is that finally \(inf=sup\) after no more than \(\frac{sup-inf_0}{(1-\alpha )\sigma }\) updates on inf. Then, the problem becomes AL. Therefore, the worst case of this algorithm can have two phases. In the first phase, inf increases from \(inf=0\) to \(inf=sup\). Between each two consecutive updates \((t, t+1)\) on inf, there are no more than \(\log _{1-\alpha }\frac{(1-\alpha )\sigma }{sup - inf_t}\) updates on k before inf is increased from \(inf_t\) to \(inf_{t+1}\). Overall, this phase takes no more than

$$\begin{aligned} {\underset{0\le i< \frac{sup-inf_0}{(1-\alpha )\sigma }}{\sum }\log _{1-\alpha } \frac{(1-\alpha )\sigma }{sup-inf_0-i\cdot (1-\alpha )\sigma }} = {\underset{0\le i< \frac{1}{(1-\alpha )\sigma }}{\sum }\log _{1-\alpha } \frac{(1-\alpha )\sigma }{1-i\cdot (1-\alpha )\sigma }} \end{aligned}$$
(13)

iterations to reduce the multi-objective optimization problem to original apprenticeship learning and then the second phase begins. Since \(k=sup\), the iteration will stop immediately when an unsafe policy is learnt as in line 24. This phase will not take more iterations than original AL algorithm does to converge and the convergence result of AL is given in [1].

In each iteration, the algorithm first solves a second-order cone programming (SOCP) problem (10) to learn a policy. SOCP problems can be solved in polynomial time by interior-point (IP) methods [12]. PCTL model checking for DTMCs can be solved in time linear in the size of the formula and polynomial in the size of the state space [7]. Counterexample generation can be done either by enumerating paths using the k-shortest path algorithm or determining a critical subsystem using either a SMT formulation or mixed integer linear programming (MILP) [23]. For the k-shortest path-based algorithm, it can be computationally expensive sometimes to enumerate a large amount of paths (i.e. a large k) when \(p^*\) is large. This can be alleviated by using a smaller \(p^*\) during calculation, which is equivalent to considering only paths that have high probabilities.

6 Experiments

We evaluate our algorithm on three case studies: (1) grid-world, (2) cart-pole, and (3) mountain-car. The cart-pole environmentFootnote 1 and the mountain-car environmentFootnote 2 are obtained from OpenAI Gym. All experiments are carried out on a quad-core i7-7700K processor running at 3.6 GHz with 16 GB of memory. Our prototype tool was implemented in PythonFootnote 3. The parameters are \(\gamma =0.99, \epsilon =10, \sigma =10^{-5}, \alpha =0.5\) and the maximum number of iterations is 50. For the OpenAI-gym experiments, in each step, the agent sends an action to the OpenAI environment and the environment returns an observation and a reward (0 or 1). We show that our algorithm can guarantee safety while retaining the performance of the learnt policy compared with using AL alone.

6.1 Grid World

We first evaluate the scalability of our tool using the grid-world example. Table 1 shows the average runtime (per iteration) for the individual components of our tool as the size of the grid-world increases. The first and second columns indicate the size of the grid world and the resulting state space. The third column shows the average runtime that policy iteration takes to compute an optimal policy \(\pi \) for a known reward function. The forth column indicates the average runtime that policy iteration takes to compute the expected features \(\mu \) for a known policy. The fifth column indicates the average runtime of verifying the PCTL formula using PRISM. The last column indicates the average runtime that generating a counterexample using COMICS.

Table 1. Average runtime per iteration in seconds.

6.2 Cart-Pole from OpenAI Gym

In the cart-pole environment as shown in Fig. 4(a), the goal is to keep the pole on a cart from falling over as long as possible by moving the cart either to the left or to the right in each time step. The maximum step length is \(t=200\). The position, velocity and angle of the cart and the pole are continuous values and observable, but the actual dynamics of the system are unknownFootnote 4.

Fig. 4.
figure 4

(a) The cart-pole environment. (b) The cart is at −0.3 and pole angle is \(-20^{\circ }\). (c) The cart is at 0.3 and pole angle is \(20^{\circ }\).

A maneuver is deemed unsafe if the pole angle is larger than \(\pm 20^{\circ }\) while the cart’s horizontal position is more than \(\pm 0.3\) as shown in Fig. 4(b) and (c). We formalize the safety requirement in PCTL as (14).

(14)
Table 2. In the cart-pole environment, higher average steps mean better performance. The safest policy is synthesized using PRISM-games.

We used 2000 demonstrations for which the pole is held upright without violating any of the safety conditions for all 200 steps in each demonstration. The safest policy synthesized by PRISM-games is used as the initial safe policy. We also compare the different policies learned by CEGAL for different safety threshold \(p^*\)s. In Table 2, the policies are compared in terms of model checking results (‘MC Result’) on the PCTL property in (14) using the constructed MDP, the average steps (‘Avg. Steps’) that a policy (executed in the OpenAI environment) can hold across 5000 rounds (the higher the better), and the number of iterations (‘Num. of Iters’) it takes for the algorithm to terminate (either converge to an \(\epsilon \)-close policy, or terminate due to \(\sigma \), or terminate after 50 iterations). The policy in the first row is the result of using AL alone, which has the best performance but also a 49.1% probability of violating the safety requirement. The safest policy as shown in the second row is always safe has almost no performance at all. This policy simply letts the pole fall and thus does not risk moving the cart out of the range \([-0.3, 0.3]\). On the other hand, it is clear that the policies learnt using CEGAL always satisfy the safety requirement. From \(p^*\) = 30% to 10%, the performance of the learnt policy is comparable to that of the AL policy. However, when the safety threshold becomes very low, e.g., \(p^*\) = 5%, the performance of the learnt policy drops significantly. This reflects the phenomenon that the tighter the safety condition is the less room for the agent to maneuver to achieve a good performance.

6.3 Mountain-Car from OpenAI Gym

Our third experiment uses the mountain-car environment from OpenAI Gym. As shown in Fig. 5(a), a car starts from the bottom of the valley and tries to reach the mountaintop on the right as quickly as possible. In each time step the car can perform one of the three actions, accelerating to the left, coasting, and accelerating to the right. The agent fails if the step length reaches the maximum (\(t=66\)). The velocity and position of the car are continuous values and observable while the exact dynamics are unknownFootnote 5. In this game setting, the car cannot reach the right mountaintop by simply accelerating to the right. It has to accumulate momentum first by moving back and forth in the valley. The safety rules we enforce are shown in Fig. 5(b). They correspond to speed limits when the car is close to the left mountaintop or to the right mountaintop (in case it is a cliff on the other side of the mountaintop). Similar to the previous experiments, we considered 2000 expert demonstrations for which all of them successfully reach the right mountaintop without violating any of the safety conditions. The average number of steps for the expert to drive the car to the right mountaintop is 40. We formalize the safety requirement in PCTL as (15).

(15)
Fig. 5.
figure 5

(a) The original mountain-car environment. (b) The mountain-car environment with traffic rules: when the distance from the car to the left edge or the right edge is shorter than 0.1, the speed of the car should be lower than 0.04.

We compare the different policies using the same set of categories as in the cart-pole example. The numbers are averaged over 5000 runs. As shown in the first row, the policy learnt via ALFootnote 6 has the highest probability of going over the speed limits. We observed that this policy made the car speed up all the way to the left mountaintop to maximize its potential energy. The safest policy corresponds to simply staying in the bottom of the valley. The policies learnt via CEGAL for safety threshold \(p^*\) ranging from 60% to 50% not only have lower probability of violating the speed limits but also achieve comparable performance. As the safety threshold \(p^*\) decreases further, the agent becomes more conservative and it takes more time for the car to finish the task. For \(p^*\) = 20%, the agent never succeeds in reaching the top within 66 steps (Table 3).

Table 3. In the mountain-car environment, lower average steps mean better performance. The safest policy is synthesized via PRISM-games.

7 Related Work

A taxonomy of AI safety problems is given in [3] where the issues of misspecified objective or reward and insufficient or poorly curated training data are highlighted. There have been several attempts to address these issues from different angles. The problem of safe exploration is studied in [8, 17]. In particular, the latter work proposes to add a safety constraint, which is evaluated by amount of damage, to the optimization problem so that the optimal policy can maximize the return without violating the limit on the expected damage. An obvious shortcoming of this approach is that actual failures will have to occur to properly assess damage.

Formal methods have been applied to the problem of AI safety. In [5], the authors propose to combine machine learning and reachability analysis for dynamical models to achieve high performance and guarantee safety. In this work, we focus on probabilistic models which are natural in many modern machine learning methods. In [20], the authors propose to use formal specification to synthesize a control policy for reinforcement learning. They consider formal specifications captured in Linear Temporal Logic, whereas we consider PCTL which matches better with the underlying probabilistic model. Recently, the problem of safe reinforcement learning was explored in [2] where a monitor (called shield) is used to enforce temporal logic properties either during the learning phase or execution phase of the reinforcement learning algorithm. The shield provides a list of safe actions each time the agent makes a decision so that the temporal property is preserved. In [11], the authors also propose an approach for controller synthesis in reinforcement learning. In this case, an SMT-solver is used to find a scheduler (policy) for the synchronous product of an MDP and a DTMC so that it satisfies both a probabilistic reachability property and an expected cost property. Another approach that leverages PCTL model checking is proposed in [16]. A so-called abstract Markov decision process (AMDP) model of the environment is first built and PCTL model checking is then used to check the satisfiability of safety specification. Our work is similar to these in spirit in the application of formal methods. However, while the concept of AL is closely related to reinforcement learning, an agent in the AL paradigm needs to learn a policy from demonstrations without knowing the reward function a priori.

A distinguishing characteristic of our method is the tight integration of formal verification with learning from data (apprenticeship learning in particular). Among imitation or apprenticeship learning methods, margin based algorithms [1, 18, 19] try to maximize the margin between the expert’s policy and all learnt policies until the one with the smallest margin is produced. The apprenticeship learning algorithm proposed by Abbeel and Ng [1] was largely motivated by the support vector machine (SVM) in that features of expert demonstration is maximally separately from all features of all other candidate policies. Our algorithm makes use of this observation when using counterexamples to steer the policy search process. Recently, the idea of learning from failed demonstrations started to emerge. In [21], the authors propose an IRL algorithm that can learn from both successful and failed demonstrations. It is done by reformulating maximum entropy algorithm in [24] to find a policy that maximally deviates from the failed demonstrations while approaching the successful ones as much as possible. However, this entropy-based method requires obtaining many failed demonstrations and can be very costly in practice.

Finally, our approach is inspired by the work on formal inductive synthesis [10] and counterexample-guided inductive synthesis (CEGIS) [22]. These frameworks typically combine a constraint-based synthesizer with a verification oracle. In each iteration, the agent refines her hypothesis (i.e. generates a new candidate solution) based on counterexamples provided by the oracle. Our approach can be viewed as an extension of CEGIS where the objective is not just functional correctness but also meeting certain learning criteria.

8 Conclusion and Future Work

We propose a counterexample-guided approach for combining probabilistic model checking with apprenticeship learning to ensure safety of the apprenticehsip learning outcome. Our approach makes novel use of counterexamples to steer the policy search process by reformulating the feature matching problem into a multi-objective optimization problem that additionally takes safety into account. Our experiments indicate that the proposed approach can guarantee safety and retain performance for a set of benchmarks including examples drawn from OpenAI Gym. In the future, we would like to explore other imitation or apprenticeship learning algorithms and extend our techniques to those settings.