Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The standard models for dynamic stochastic systems with both probabilistic and nondeterministic behaviour are Markov decision processes (MDPs) [13]. They are widely used in verification of probabilistic systems [4, 5] in several ways. Firstly, in concurrent probabilistic systems, such as communication protocols, the nondeterminism arises from scheduling [6, 7]. Secondly, in probabilistic systems operating in open environments, such as various stochastic reactive systems, nondeterminism arises from environmental inputs [8, 9]. Thirdly, for underspecified probabilistic systems, a controller is synthesized, resolving the nondeterminism in a way that optimizes some objective, such as energy consumption or time constraints in embedded systems [4, 5].

In analysis of MDPs, the behaviour under all possible strategies (schedulers, controllers, policies) is examined. For example, in the first two cases, the result of the verification process is either a guarantee that a given property holds under all strategies, or a counterexample strategy. In the third case, either a witness strategy guaranteeing a given property is synthesized, or its non-existence is stated. In all settings, it is desirable that the output strategies should be “small and understandable” apart from correct. Intuitively, it is a strategy with a representation small enough for the human debugger to read and understand where the bug is (in the verification setting), or for the programmer to implement in the device (in the synthesis setting). In this paper, we focus on the verification setting and illustrate our approach mainly on probabilistic protocols. Nonetheless, our results immediately carry over to the synthesis setting.

Obtaining a small and simple strategy may be impossible if the strategy is required to be optimal, i.e., in our setting reaching the error state with the highest possible probability. Therefore, there is a trade-off between simplicity and optimality of the strategies. However, in order to debug a system, a simple counterexample or a series thereof is more valuable than the most comprehensive, but incomprehensible counterexample. In practice, a simple strategy reaching the error with probability smaller by a factor of \(\varepsilon \), e.g. one per cent, is a more valuable source of information than a huge description of an optimal strategy. Similarly, controllers in embedded devices should strive for optimality, but only as long as they are small enough to fit in the device. In summary, we are interested in finding small and simple close-to-optimal (\(\varepsilon \)-optimal) strategies.

How can one obtain a small and simple strategy? This seems to require some understanding of the particular system and the bug. How can we do something like that automatically? The approaches have so far been limited to BDD representations of the strategy, or generating subchains representing a subset of paths induced by the strategy. While BDDs provide a succinct representation, they are not well readable and understandable. Further, subchains do not focus on the decisions the strategy makes at all. In contrast, a huge effort has been spent on methods to obtain “understanding” from large sets of data, using machine learning methods. In this paper, we propose to extend their use in verification, namely of reachability properties in MDPs, in several ways. Our first aim of using these methods is to efficiently exploit the structure that is present in the models, written in e.g. PRISM language with variables and commands. This structure gets lost in the traditional numerical analysis of the MDPs generated from the PRISM language description. The second aim is to distil more information from the generated MDPs, namely the importance of each decision. Both lead to an improved understanding of the strategy’s decisions.

Our Approach. We propose three steps to obtain the desired strategies. Each of them has a positive effect on the resulting size.

(1) Obtaining a (Possibly Partially Defined and Liberal) \(\varepsilon \) -optimal Strategy. The \(\varepsilon \)-optimal strategies produced by standard methods, such as value iteration of PRISM [10], may be too large to compute and overly specific. Firstly, as argued in [11], typically only a small fraction of the system needs to be explored in order to find an \(\varepsilon \)-optimal strategy, whereas most states are reached with only a very small probability. Without much loss, the strategy may not be defined there. For example, in the MDP M depicted in Fig. 1, the decision in q (and \(v_i\)’s) is almost irrelevant for the overall probability of reaching t from s. Such a partially defined strategy can be obtained using learning methods [11].

Secondly, while the usual strategies prescribe which action to play, liberal strategies leave more choices open. There are several advantages of liberal strategies, and similar notions of strategies called permissive strategies have been studied in [1214]. A liberal strategy, instead of choosing an action in each state, chooses a set of actions to be played uniformly at every state. First, each liberal strategy represents a set of strategies, and thus covers more behaviour. Second, in counter-example guided abstraction-refinement (CEGAR) analysis, since liberal strategies can represent sets of counter-examples, they accelerate the abstraction-refinement loop by ruling out several counter-examples at once. Finally, they also allow for more robust learning of smaller strategies in Step 3. We show that such strategies can be obtained from standard value iteration as well as [11]. Further processing of the strategies in Step 2 and 3 allows liberal strategies as input and preserves liberality in the small representation of the strategy.

Fig. 1.
figure 1

An MDP M with reachability objective t

(2) Identifying Important Parts of the Strategy. We define a concept of importance of a state w.r.t. a strategy, corresponding to the probability of visiting the state by the strategy. Observe that only a fraction of states can be reached while following the strategy, and thus have positive importance. On the unreachable states, with zero importance, the definition of the strategy is useless. For instance, in M, both states p and q must have been explored when constructing the strategy in order to find out whether it is better to take action a or b. However, if the resulting strategy is to use b and d, the information what to do in \(u_i\)’s is useless. In addition, we consider \(v_i\)’s to be of zero importance, too, since they are never reached on the way to target.

Furthermore, apart from ignoring states with zero importance, we want to partially ignore decisions that are unlikely to be made (in less important states such as q), and in contrast, stress more the decisions in important states likely to be visited (such as s). Note that this is difficult to achieve in data structures that remember all the stored data exactly, such as BDDs. Of course, we can store decisions in states with importance above a certain threshold. However, we obtain much smaller representations if we allow more variability and reflect the whole quantitative information, as shown in Step 3.

(3) Data Structures for Compact Representation of Strategies. The explicit representation of a strategy by a table of pairs (state, action to play) results in a huge amount of data since the systems often have millions of states. Therefore, a symbolic representation by binary decision diagrams (BDD) looks as a reasonable option. However, there are several drawbacks of using BDDs. Firstly, due to the bit-level representation of the state-action pairs, the resulting BDD is not very readable. Secondly, it is often still too large to be understood by human, for instance due to a bad ordering of the variables. Thirdly, it cannot quantitatively reflect the differences in the importance of states.

Therefore, we propose to use decision trees instead, e.g. [15], a structure similar to BDDs, but with nodes labelled by various predicates over the system’s variables. They have several advantages. Firstly, they yield an explanation of the decision, as opposed to e.g. neural networks, and thus provide an explanation how the strategy works. Secondly, sophisticated algorithms for their construction, based on entropy, result in smaller representation than BDD, where a good ordering of variables is known to be notoriously difficult to find [4]. Thirdly, as suggested in Step 2, they allow for less probable remembering of less stressed data if this sufficiently simplifies the tree and decreases its size. Finally, the major drawback of decision trees in machine learning—frequent overfitting of the training data—is not an issue in our setting since the tree is not used for classification of test data, but only of the training data.

Summary of Our Contribution. In summary our contributions are as follows:

  • We provide a method for obtaining succinct representation of \(\varepsilon \)-optimal strategies as decision trees. The method is based on a new concept of importance measure and on well-established machine learning techniques.

  • Experimental data show that even for some systems larger than the available memory, our method yields trees with only several dozens of nodes.

  • We illustrate the understandability of the representation on several examples from PRISM benchmarks [16], reading off respective bug explanations.

Related Work. In artificial intelligence, compact (factored) representations of MDP structure have been developed using dynamic Bayesian networks [17, 18], probabilistic STRIPS [19], algebraic decision diagrams [20], and also decision trees [17]. Formalisms used to represent MDPs can, in principle, be used to represent values and policies as well. In particular, variants of decision trees are probably the most used [17, 21, 22]. For a detailed survey of compact representations see [23]. In the context of verification, MDPs are often represented using variants of (MT)BDDs [2426], and strategies by BDDs [27].

Decision trees have been used in connection with real-time dynamic programming and reinforcement learning [28, 29]. Learning a compact decision tree representation of a policy has been investigated in [30] for the case of body sensor networks, but the paper aims at a completely different application field (a simple model of sensor networks as opposed to generic PRISM models), uses different objectives (discounted rewards), and does not consider the importance of a state that, as we show, may substantially decrease sizes of resulting policies.

Our results are related to the problem of computing minimal/small counterexamples in probabilistic verification. Most papers concentrate on solving this problem for Markov chains and linear-time properties [3134], branching-time properties [3537], and in the context of simulation [38]. A couple of tools have been developed for probabilistic counterexample generation, namely DiPro [39] and COMICS [40]. For a detailed survey see [41]. While previous approaches focus on presenting diagnostic paths forming the counterexample, our approach focuses on decisions made by the respective strategy.

Concerning MDPs, [33] uses mixed integer linear programming to compute minimal critical sub-systems, i.e. whole sub-MDPs as opposed to a compact representation of “right" decisions computed by our methods. [42] uses a directed on-the-fly search to compute sets of most probable diagnostic paths (which somehow resembles our notion of importance), but the paths are encoded explicitly by AND/OR trees as opposed to our use of decision trees. Neither of these papers takes advantage of an internal structure of states and their methods substantially differ from ours. The notion of paths encoded as AND/OR trees has also been studied in [43] to represent probabilistic counter-examples visually as fault trees, and then derive causal (the cause and effect) relationship between events. [44] develops abstraction-based framework for model-checking MDPs based on games, which allows to trade compactness for precision, but does not give a procedure for constructing (a compact representation of) counterexample strategies. [45, 46] computes a smallest set of guarded commands (of a PRISM-like language) that induce a critical subsystem, but, unlike our methods, does not provide a compact representation of actual decisions needed to reach an erroneous state; moreover, there is not always a command based counterexample.

Counter-examples play a crucial role in CEGAR analysis of MDPs, and have been widely studied, such as, game-based abstraction refinement [47]; non-compositional CEGAR approach for reachability [48] and safe-pCTL [49]; compositional CEGAR approach for safe-pCTL and qualitative logics [38, 50]; and abstraction-refinement for quantitative properties [51, 52]. All of these works only consider a single strategy represented explicitly, whereas our approach considers a succinct representation of a set of strategies, and can accelerate the abstraction-refinement loop.

2 Preliminaries

We use \(\mathbb {N}\), \(\mathbb {Q}\), and \(\mathbb {R}\) to denote the sets of positive integers, rational and real numbers, respectively. The set of all rational probability distributions over a finite set X is denoted by \( Dist (X)\). Further, \(d\in Dist (X)\) is Dirac if \(d(x)=1\) for some \(x\in X\). Given a function \(f:X\rightarrow \mathbb {R}\), we write \({{\mathrm{arg\,max}}}_{x\in X} f(x)=\{x\in X\mid f(x)=\max _{x'\in X} f(x')\}\).

Markov Chains. A Markov chain is a tuple \(M = (L,P,\mu )\) where L is a finite set of locations, \(P:L\rightarrow Dist (L)\) is a probabilistic transition function, and \(\mu \in Dist (L)\) is the initial probability distribution. We denote the respective unique probability measure for M by \(\mathbb P\).

Markov Decision Processes. A Markov decision process (MDP) is a tuple \(G=(S,A, Act ,\delta ,\hat{s})\) where S is a finite set of states, A is a finite set of actions, \( Act : S\rightarrow 2^A\setminus \{\emptyset \}\) assigns to each state s the set \( Act (s)\) of actions enabled in s, \(\delta : S\times A\rightarrow Dist (S)\) is a probabilistic transition function that, given a state and an action, gives a probability distribution over the successor states, and \(\hat{s}\) is the initial state. A run in G is an infinite alternating sequence of states and actions \(\omega =s_1 a_1 s_2 a_2\cdots \) such that for all \(i \ge 1\), we have \(a_i\in Act (s_i)\) and \(\delta (s_i,a_i)(s_{i+1}) > 0\). A path of length k in G is a finite prefix \(w= s_1 a_1\cdots a_{k-1} s_k\) of a run in G.

Strategies and Plays. Intuitively, a strategy (or a policy) in an MDP G is a “recipe” to choose actions. Formally, a strategy is a function \(\sigma :S\rightarrow Dist (A)\) that given the current state of a play gives a probability distribution over the enabled actions.Footnote 1 In general, a strategy may randomize, i.e. return non-Dirac distributions. A strategy is deterministic if it gives a Dirac distribution for every argument.

A play of G determined by a strategy \(\sigma \) and a state \(\bar{s}\in S\) is a Markov chain \(G^\sigma _{\bar{s}}\) where the set of locations is S, the initial distribution \(\mu \) is Dirac with \(\mu (\bar{s})=1\) and

$$ P(s)(s')=\sum _{a\in A}\sigma (s)(a) \cdot \delta (s,a)(s')\,. $$

The induced probability measure is denoted by \(\mathbb P^{\sigma }_{\bar{s}}\) and “almost surely” or “almost all runs” refers to happening with probability 1 according to this measure. We usually write \(\mathbb P^{\sigma }\) instead of \(\mathbb P^{\sigma }_{\hat{s}}\) (here \(\hat{s}\) is the initial state of G).

Liberal Strategies. A liberal strategy is a function \(\varsigma :S\rightarrow 2^A\) such that for every \(s\in S\) we have that \(\emptyset \not = \varsigma (s)\subseteq Act (s)\). Given a liberal strategy \(\varsigma \) and a state s, an action \(a\in Act (s)\) is good (in s w.r.t. \(\varsigma \)) if \(a\in \varsigma (s)\), and bad otherwise. Abusing notation, we denote by \(\varsigma \) the strategy that to every state s assigns the uniform distribution on \(\varsigma (s)\) (which, in particular, allows us to use \(G^{\varsigma }_s\), \(\mathbb P^{\varsigma }_s\) and apply the notion of \(\varepsilon \)-optimality to \(\varsigma \)).

Reachability Objectives. Given a set \(F\subseteq S\) of target states, we denote by \(\Diamond {F}\) the set of all runs that visit a state of F. For a state \(s\in S\), the maximal reachability probability (or simply value) in s, is \( Val (s):=\max _{\sigma } \mathbb P^{\sigma }_s[\Diamond {F}]\). Given \(\epsilon \ge 0\), we say that a strategy \(\sigma \) is \(\varepsilon \) -optimal if \(\mathbb P^{\sigma }[\Diamond {F}] \ge Val (\hat{s})-\varepsilon \), and we call a 0-optimal strategy optimal.Footnote 2 To avoid overly technical notation, we assume that states of F, subject to the reachability objective, are absorbing, i.e. for all \(s\in F,a\in Act (s)\) we have \(\delta (s,a)(s)=1\).

End Components. A non-empty set \(S'\subseteq S\) is an end component (EC) of G if there is \( Act ':S'\rightarrow 2^A\setminus \{\emptyset \}\) such that (1) for all \(s\in S'\) we have \( Act '(s)\subseteq Act (s)\), (2) for all \(s\in S'\), we have \(a\in Act '(s)\) iff \(\delta (s,a)\in Dist (S')\), and (3) for all \(s,t\in S'\) there is a path \(\omega = s_1 a_1\cdots a_{k-1} s_k\) such that \(s_1 = s\), \(s_k=t\), and \(s_i\in S',a_i\in Act '(s_i)\) for every i. An end component is a maximal end component (MEC) if it is maximal with respect to the subset ordering. Given an MDP, the set of MECs is denoted by \(\mathsf {MEC}\). Given a MEC, actions of \( Act '(s)\) and \( Act (s)\setminus Act '(s)\) are called internal and external (in state s), respectively.

3 Computing \(\varepsilon \)-Optimal Strategies

There are many algorithms for solving quantitative reachability in MDPs, such as the value iteration, the strategy improvement, linear programming based methods etc., see [2]. The main method implemented in PRISM is the value iteration, which successively (under)approximates the value \( Val (s,a)=\sum _{s'\in A} \delta (s,a)(s')\cdot Val (s')\) of every state-action pair (sa) by a value V(sa), and stops when the approximation is good enough. Denoting by \(V(s):=\max _{a\in Act (s)} V(s,a)\), every step of the value iteration improves the approximation V(sa) by assigning \(V(s,a):=\sum _{s'\in S} \delta (s,a)(s')\cdot V(s')\) (we start with V such that \(V(s)=1\) if \(s\in F\), and \(V(s)=0\) otherwise).

The disadvantage of the standard value iteration (and also most of the above mentioned traditional methods) is that it works with the whole state space of the MDP (or at least with its reachable part). For instance, consider states \(u_i,v_i\) of Fig. 1. The paper [11] adapts methods of bounded real-time dynamic programming (BRTDP, see e.g. [53]) to speed up the computation of the value iteration by improving V(sa)Footnote 3 only on “important” state-action pairs identified by simulations.

Even though RTDP methods may substantially reduce the size of an \(\varepsilon \)-optimal strategy, its explicit representation is usually large and difficult to understand. Thus we develop succinct representations of strategies, based on decision trees, that will reduce the size even further and also provide a human readable representation. Even though the above methods are capable of yielding deterministic \(\varepsilon \)-optimal strategies, that can be immediately fed into machine learning algorithms, we found it advantageous to give the learning algorithm more freedom in the sense that if there are more \(\varepsilon \)-optimal strategies, we let the algorithm choose (uniformly). This is especially useful within MECs where many actions have the same value. Therefore, we extract liberal \(\varepsilon \)-optimal strategies from the value approximation V, output either by the value iteration or BRTDP.

Computing Liberal \(\varepsilon \) -Optimal Strategies. Let us show how to obtain a liberal strategy \(\varsigma \) from the value iteration, or BRTDP. For simplicity, we start with MDP without MECs.

MDP without End Components. We say that \(V:S\times A\rightarrow [0,1]\) is a valid \(\varepsilon \) -underapproximation if the following conditions hold:

  1. 1.

    \(V(s,a)\le Val (s,a)\) for all \(s\in S\) and \(a\in A\)

  2. 2.

    \( Val (\hat{s})-V(\hat{s})\le \varepsilon \)

  3. 3.

    \(V(s,a)\le \sum _{s'\in S} \delta (s,a)(s')\cdot V(s')\) for all \(s\in S\) and \(a\in Act {s}\)

The outputs V of both the value iteration, and BRTDP are valid \(\varepsilon \)-underapproximations. We define a liberal strategy \(\varsigma ^V\) by \(\varsigma ^V(s)={{\mathrm{arg\,max}}}_{a\in Act (s)}\, V(s,a)\) for all \(s\in S\).Footnote 4

Lemma 1

For every \(\varepsilon >0\) and a valid \(\varepsilon \)-underapproximation V, \(\varsigma ^V\) is \(\varepsilon \)-optimal.Footnote 5

General MDP. For MDPs with end components we have to extend the definition of the valid \(\varepsilon \)-underapproximation. Given a MEC \(S'\subseteq S\), we say that \((s,a)\in S\times A\) is maximal-external in \(S'\) if \(s\in S'\), \(a\in Act (s)\) is external and \(V(s,a)\ge V(s',a')\) for all \(s'\in S'\) and \(a'\in Act (s')\). A state \(s'\in S'\) is an exit (of \(S'\)) if (sa) is maximal-external in \(S'\) for some \(a\in Act (s)\). We add the following condition to the valid \(\varepsilon \)-underapproximation:

  1. 4.

    Each MEC \(S'\subseteq S\) has at least one exit.

Now the definition of \(\varsigma ^V\) is also more complicated:

  • For every \(s\in S\) which is not in any MEC, we put \(\varsigma ^V(s)={{\mathrm{arg\,max}}}_{a\in Act (s)}\, V(s,a)\).

  • For every \(s\in S\) which is in a MEC \(S'\),

    • if s is an exit, then \(\varsigma ^V(s)=\{a\in Act (s)\mid (s,a)\text { is maximal-external in }S'\}\)

    • otherwise, \(\varsigma ^V(s)=\{a\in Act (s)\mid a\text { is internal}\}\)

Using these extended definitions, Lemma 1 remains valid. Further, note that \(\varsigma ^V(s)\) is defined even for states with trivial underapproximation \(V(s)=0\), for instance a state s that was never subject to any value iteration improvement. Then the values \(\varsigma (s)\) may not be stored explicitly, but follow implicitly from not storing any V(s), thus assuming \(V(s,\cdot )=0\).

4 Importance of Decisions

Note that once we have computed an \(\varepsilon \)-optimal liberal strategy \(\varsigma \), we may, in principle, compute a compact representation of \(\varsigma \) (using e.g. BDDs), and obtain a strategy with possibly smaller representation than above.

However, we go one step further as follows. Given a liberal strategy \(\varsigma \) and a state \(s\in S\), we define the importance of s by

$$ Imp ^\varsigma (s):=\mathbb {P}^{\varsigma }[\Diamond s\mid \Diamond F] $$

the probability of visiting s conditioned on reaching F (afterwards). Intuitively, the importance is high for states where a good decision can help to reach the target.Footnote 6

Example 1

For the MDP of Fig. 1 with the objective \(\Diamond \{t\}\) and a strategy \(\varsigma \) choosing b, we have \( Imp ^{\varsigma }(s)=1\) and \( Imp ^{\varsigma }(q)=5/995\). Trivially, \( Imp ^\varsigma (t)=1\). For all other states, the importance is zero.

Obviously, decisions made in states of zero importance do not affect \(\mathbb P^{\varsigma }[\Diamond {F}]\) since these states never occur on paths from \(\hat{s}\) to F. However, note that many states of S may be reachable in \(G^{\varsigma }\) with positive but negligibly small probability. Clearly, the value of \(\mathbb P^{\varsigma }[\Diamond {F}]\) depends only marginally on choices made in these states. Formally, let \(\varsigma _{\varDelta }\) be a strategy obtained from \(\varsigma \) by changing each \(\varsigma (s)\) with \( Imp ^\varsigma (s)\le \varDelta \) to an arbitrary subset of \( Act (s)\). We obtain the following obvious property:

Lemma 2

For every liberal strategy \(\varsigma \), we have \(\displaystyle \lim _{\varDelta \rightarrow 0} \mathbb P^{\varsigma _{\varDelta }}[\Diamond {F}]=\mathbb P^{\varsigma }[\Diamond {F}]\).

In fact, every \(\varDelta <\min (\{ Imp ^\varsigma (s)\mid s\in S\}\setminus \{0\})\) satisfies \(\mathbb P^{\varsigma _{\varDelta }}[\Diamond {F}]=\mathbb P^{\varsigma }[\Diamond {F}]\). But often even larger \(\varDelta \) may give \(\mathbb P^{\varsigma _{\varDelta }}[\Diamond {F}]\) sufficiently close to \(\mathbb P^{\varsigma }[\Diamond {F}]\). Such \(\varDelta \) may be found using e.g. trial and error approach.Footnote 7

Most importantly, we can use the importance of a state to affect the probability that decisions in this state are indeed remembered in the data structure. Data structures with such a feature are used in various learning algorithms. In the next section, we discuss decision trees. Due to this extra variability, which decisions to learn, the resulting decision trees are smaller than BDDs for strictly defined \(\varsigma _\varDelta \).

5 Efficient Representations

Let \(G=(S,A, Act ,\delta ,\hat{s})\) be an MDP. In order to symbolically represent strategies in G, we need to assume that states and actions have some internal structure. Inspired by PRISM language [5], we consider a set \(\mathcal {V}=\{v_1,\ldots ,v_n\}\) of integer variables, each \(v_i\) gets its values from a finite domain \( Dom (v_i)\). We suppose that \(S=\prod _{i=1}^n Dom (v_i)\subseteq \mathbb {Z}^n\), i.e. each state is a vector of integers. Further, we assume that the MDP arises as a product of m modules, each of which can separately perform non-synchronizing actions as well as synchronously with other modules perform a synchronizing action. Therefore, we suppose \(A\subseteq \bar{A} \times \{0,\ldots ,m\}\), where \(\bar{A}\subseteq \mathbb {N}\) is a finite set and the second component determines the module performing the action (0 stands for synchronizing actions).Footnote 8

Since a liberal strategy is a function of the form \(\varsigma :S \rightarrow 2^A\), assigning to each state its good actions, it can be explicitly represented as a list of state-action pairs, i.e., as a subset of

$$\begin{aligned} S\times A=\prod _{i=1}^n Dom (v_i)\times \bar{A} \times \{0,1,\ldots ,m\} \end{aligned}$$
(1)

In addition, standard optimization algorithms implemented in PRISM use an explicit “don’t-care” value \(-2\) for action in each unreachable state, meaning the strategy is not defined. However, one could simply not list these pairs at all. Thus a smaller list is obtained, with only the states where \(\varsigma \) is defined. Recall that one may also omit states s satisfying \( Imp ^{\varsigma }(s)=0\), thus ignoring reachable states with zero probability to reach the target. Further optimization may be achieved by omitting states s satisfying \( Imp ^{\varsigma }(s)<\varDelta \) for a suitable \(\varDelta >0\).

5.1 BDD Representation

The explicit set representation can be encoded as a binary decision diagram (BDD). This has been used in e.g. [27, 55]. The principle of the BDD representation of a set is that (1) each element is encoded as a string of bits and (2) an automaton, in the form of a binary directed acyclic graph, is created so that (3) the accepted language is exactly the set of the given bit strings. Although BDDs are quite efficient, see Sect. 6, each of these three steps can be significantly improved:

  1. 1.

    Instead of a string of bits describing all variables, a string of integers (one per variable) can be used. Branching is then done not on the value of each bit, but according to an inequality comparing the variable to a constant. This significantly improves the readability.

  2. 2.

    Instead of building the automaton according to a chosen order of bits, we let a heuristic choose the order of the inequalities and the actual constants in the inequalities.

  3. 3.

    Instead of representing the language precisely, we allow the heuristic to choose which data to represent and which not. The likelihood that each datum is represented corresponds to its importance, which we provide as another input.

The latter two steps lead to significantly smaller graphs than BDDs. All this can be done in an efficient way using decision trees learning.

5.2 Representation Using Decision Trees

Decision Trees. A decision tree for a domain \(\prod _{i=1}^d X_i\subseteq \mathbb {Z}^d\) is a tuple \(\mathcal {T}=(T,\rho ,\theta )\) where T is a finite rooted binary (ordered) tree with a set of inner nodes N and a set of leaves L, \(\rho \) assigns to every inner node a predicate of the form \([x_i\sim const ]\) where \(i\in \{1,\ldots ,d\}\), \(x_i\in X_i\), \( const \in \mathbb {Z}\), \(\mathord {\sim }\in \{\le ,<,\ge ,>,=\}\), and \(\theta \) assigns to every leaf a value \( good \), or \( bad \).Footnote 9

Similarly to BDDs, the language \(\mathcal {L}(\mathcal {T})\subseteq \mathbb {N}^n\) of the tree is defined as follows. For a vector \(\mathbf {\bar{x}}=(\bar{x}_1,\ldots ,\bar{x}_n)\in \mathbb {N}^n\), we find a path p from the root to a leaf such that for each inner node n on the path, the predicate \(\rho (n)\) is satisfied by substitution \(x_i=\bar{x}_i\) iff the first child of n is on p. Denote the leaf on this particular path by \(\ell \). Then \(\mathbf {\bar{x}}\) is in the language \(\mathcal {L}(\mathcal {T})\) of \(\mathcal T\) iff \(\theta (\ell )= good \).

Fig. 2.
figure 2

A decision tree for \(\{1,2,3,7\}\subseteq \{1,\ldots ,7\}\)

Example 2

Consider dimension \(d=1\), domain \(X_1=\{1,\ldots ,7\}\). A tree representing a set \(\{1,2,3,7\}\) is depicted in Fig. 2. To depict the ordered tree clearly, we use unbroken lines for the first child, corresponding to the satisfied predicate, and dashed line for the second one, corresponding to the unsatisfied predicate.

In our setting, we use the domain \(S\times A\) defined by Equation (1) which is of the form \(\prod _{i=1}^{n+2} X_i\) where for each \(1\le i\le n\) we have \(X_i= Dom (v_i)\), \(X_{n+1}=\bar{A}\) and \(X_{n+2}=\{0,1,\ldots ,m\}\). Here the coordinates \( Dom (v_i)\) are considered “unbounded” and, consequently, the respective predicates use inequalities. In contrast, we know the possible values of \(\bar{A}\times \{0,1,\ldots ,m\}\) in advance and they are not too many. Therefore, these coordinates are considered “discrete” and the respective predicates use equality. Examples of such trees are given in Sect. 6 in Figs. 4 and 5. Now a decision tree \(\mathcal {T}\) for this domain determines a liberal strategy \(\varsigma :S\rightarrow 2^A\) by \(a\in \varsigma (s)\) iff \((s,a)\in \mathcal {L}(\mathcal {T})\).

Learning. We describe the process of learning a training set, which can also be understood as storing the input data. Given a training sequence (repetitions allowed!) \(\mathbf {x}^1,\ldots ,\mathbf {x}^k\), with each \(\mathbf {x}^i=(x^i_1,\ldots ,x^i_n)\in \mathbb {N}^d\), partitioned into the positive and negative subsequence, the process of learning according to the algorithm ID3 [15, 56] proceeds as follows:

  1. 1.

    Start with a single node (root), and assign to it the whole training sequence.

  2. 2.

    Given a node n with a sequence \(\tau \),

    • if all training examples in \(\tau \) are positive, set \(\theta (n)= good \) and stop;

    • if all training examples in \(\tau \) are negative, set \(\theta (n)= bad \) and stop;

    • otherwise,

      • choose a predicate with the “highest gain” (with lowest entropy, see e.g. [15, Sects. 3.4.1, 3.7.2]),

      • split \(\tau \) into sequences satisfying and not satisfying the predicate, assign them to the first and the second child, respectively,

      • go to step 2 for each child.

Intuitively, the predicate with the highest gain splits the sequence so that it maximizes the portion of positive data in the satisfying subsequence and the portion of negative data in the non-satisfying subsequence.

In addition, the final tree can be pruned. This means that some leaves are merged, resulting in a smaller tree at the cost of some imprecision of storing (the language of the tree changes). The pruning phase is quite sophisticated, hence for the sake of simplicity and brevity, we omit the details here. We use the standard C4.5 algorithm and refer to [15, 57]. In Sect. 6, we comment on effects of parameters used in pruning.

Learning a Strategy. Assume that we already have a liberal strategy \(\varsigma :S\rightarrow 2^A\). We show how we learn good and bad state-action pairs so that the language of the resulting tree is close to the set of good pairs. The training sequence will be composed of state-action pairs where good pairs are positive examples, and bad pairs are negative ones. Since our aim is to ensure that important states are learnt and not pruned away, we repeat pairs with more important states in the training sequence more frequently.

Formally, for every \(s\in S\) and \(a\in Act (s)\), we put the pair (sa) to the training sequence \( repeat (s)\)-times, where

$$ repeat (s)=c\cdot Imp ^\varsigma (s) $$

for some constant \(c\in \mathbb {N}\) (note that \( Imp ^\varsigma (s)\le 1\)). Since we want to avoid exact computation of \( Imp ^\varsigma (s)\), we estimate it using simulations. In practice, we thus run c simulation runs that reach the target and set \( repeat (s)\) to be the number of runs where s was also reached.

6 Experiments

In this section, we present the experimental evaluation of the presented methods, which we have implemented within the probabilistic model checker PRISM [5]. All the results presented in this section were obtained on a single Intel(R) Xeon(R) CPU (3.50 GHz) with memory limited to 10 GB.

First, we discuss several alternative options to construct the training data and to learn them in a decision tree. Further, we compare decision trees to other data structures, namely sets and BDDs, with respect to the sizes necessary for storing a strategy. Finally, we illustrate how the decision trees can be used to gain insight into our benchmarks.

6.1 Decision Tree Learning

Generating Training Data. The strategies we work with come from two different sources. Firstly, we consider strategies constructed by PRISM, which can be generated using the explicit or sparse model checking engine. Secondly, we consider strategies constructed by the BRTDP algorithm [11], which are defined on a part of the state space only.

Recall that given a strategy, the training data for the decision trees is constructed from c simulation runs according to the strategy. In our experiments, we found that \(c=10000\) produces good results in all the examples we consider. Note that we stop each simulation as soon as the target or a state with no path to the target state is reached.

Decision Tree Learning in Weka. The decision trees are constructed using the Weka machine learning package [58]. The Weka suite offers various decision tree classifiers. We use the J48 classifier, which is an implementation of the C4.5 algorithm [57]. The J48 classifier offers two parameters to control the pruning that affect the size of the decision tree:

  • The leaf size parameter \(M\in \mathbb {N}\) determines that each leaf node with less than M instances in the training data is merged with its siblings. Therefore, only values smaller than the number of instances per classification class are reasonable, since higher numbers always result in the trivial tree of size 1.

  • The confidence factor \(C\in (0,0.5)\) is used internally for determining the amount of pruning during decision tree construction. Smaller values incur more pruning and therefore smaller trees.

Detailed information and an empirical study of the parameters for J48 is available in [59].

Effects of the Parameters. We illustrate the effects of the parameters C and M on the resulting size of the decision tree on the mer benchmark. However, similar behaviour appears in all the examples. Figure 3a and b show the resulting size of the decision tree for several (random) executions. Each line in the plots corresponds to one decision tree, learned with 15 different values of the parameter. The C parameter scales linearly between 0.0001 and 0.5. The M parameter scales logarithmically between 1 and the minimum number of instances per class in the respective training set. The plots in Fig. 3 show that M is an effective parameter in calibrating the resulting tree size, whereas C plays less of a role. Hence, we use \(C=10^{-4}\). Furthermore, since the tree size is monotone in M, the parameter M can be used to retrieve a desired level of detail.

Fig. 3.
figure 3

Decision tree parameters

Figure 3c depicts the relation of the tree size to the relative error of the induced strategy. It shows that there is a threshold size under which the tree is not able to capture the strategy correctly anymore and the error rises quickly. Above the threshold size, the error is around \(1\,\%\), considered reasonable in order to extract reliable information. This threshold behaviour is observed in all our examples. Therefore, it is sensible to perform a binary search for the highest M ensuring the error at most \(1\,\%\).

Table 1. Comparison of representation sizes for strategies obtained from PRISM and from BRTDP computation. Sizes are presented as the number of states for explicit lists of values, the number of nodes for BDDs, and the number of nodes for decision trees (DT). For DT, we display the tree size obtained from the binary search described above. DT Error reports the relative error of the strategy determined by the decision tree (on the induced Markov chain) compared to the optimal value, obtained by model checking with PRISM.

6.2 Results

First, we briefly introduce the four examples from the PRISM benchmark suite [16], which we tested our method on. Note that the majority of the states in the used benchamrks are non-deterministic, so the strategies are non-trivial in most states.

Table 2. Effects of various learning variants on the tree size. Smallest trees computed from PRISM or BRTDP and the average time to compute one number are presented.

firewire models the Tree Identify Protocol of the IEEE 1394 High Performance Serial Bus, which is used to transport video and audio signals within a network of multimedia devices. The reachability objective is that one node gets the root and the other one the child role.

investor models a market investor and shares, which change their value probabilistically over time. The reachability objective is that the investor finishes a trade at a time, when his shares are more valuable than some threshold value.

mer is a mutual exclusion protocol, that regulates the access of two users to two different resources. The protocol should prohibit that both resources are accessed simultaneously.

zeroconf is a network protocol which allows users to choose their IP addresses autonomously. The protocol should detected and prohibit IP address conflict.

For every example, Table 1 shows the size of the state space, the value of the optimal strategy, and the sizes of strategies obtained from explicit model checking by PRISM and by BRTDP, for each discussed data structure.

Learning Variants. In order to justify our choice of the importance function \( Imp \), we compare it to several alternatives.

  1. 1.

    When constructing the training data, we can use the importance measure \( Imp \), and add states as often as is indicated by its importance (\(\textsc {I}\)), or neglect it and simply add every visited state exactly once (\(\textsc {O}\)).

  2. 2.

    Further, states on the simulation are learned conditioned on the fact that the target state is reached (\(\Diamond \)). Another option is to consider all simulations (\(\forall \)).

  3. 3.

    Finally, instead of the probability to visit the state (\(\mathbb P\)), one can consider the expected number of visits (\(\mathbb E\)).

In Table 2, we report the sizes of the decision trees obtained for the all learning variants. We conclude that our choice (\(\textsc {I}\Diamond \mathbb P\)) is the most useful one.

6.3 Understanding Decision Trees

We show how the constructed decision trees can help us to gain insight into the essential features of the systems.

zeroconf example. In Fig. 4 we present a decision tree that is a strategy for zeroconf and shows how an unresolved IP address conflict can occur in the protocol. First we present how to read the strategy represented in Fig. 4. Next we show how the strategy can explain the conflict in the protocol. Assume that we are classifying a state-action pair (sa), where action a is enabled in state s.

  1. 1.

    No matter what the current state s is, the action rec is always classified as \( bad \) according to the root of the tree. Therefore, the action rec should be played with positive probability only if all other available actions in the current state are also classified as \( bad \).

  2. 2.

    If action a is different from rec, the right son of the root node is reached. If action a is different from action l>0&b=1&ip_mess=1 -> b’=0&z’=0&n1’=min(n1+1,8)&ip_mess’=0 (the whole PRISM command is a single action), then a is classified as \( good \) in state s. Otherwise, the left son is reached.

  3. 3.

    In node \(\mathtt{z \le 0}\) the classification of action a (that is the action that labels the parent node) depends on the variable valuation of the current state. If the value of var. z is greater than 0, then a is classified as \( good \) in state s, otherwise it is classified as \( bad \).

Action rec stands for a network host receiving a reply to a broadcast message, resulting in resolution of an IP address conflict if one is present, which clearly does not help in constructing an unresolved conflict. The action labelling the right son of the root represents the detection of an IP address conflict by an arbitrary network host. This action is only good, if variable z, which is a clock variable, in the current state is greater than 0. The combined meaning of the two nodes is that an unresolved IP address conflict can occur if the conflict is detected too late.

Fig. 4.
figure 4

A decision tree for zeroconf

firewire example. For firewire, we obtain a trivial tree with a single node, labelled \( good \). Therefore, playing all available actions in each state guarantees reaching the target almost surely. In contrast to other representations, we have automatically obtained the information that the network always reaches the target configuration, regardless of the individual behaviour of its components and their interleaving.

mer example. In the case of mer, there exists a strategy that violates the required property that the two resources are not accessed simultaneously. The decision tree for the mer strategy is depicted in Fig. 5. In order to understand how a state is reached, where both resources are accessed at the same time, it is necessary to determine which user accesses which resource in that state.

  1. 1.

    The two tree nodes labelled by 1 explain what resource user 1 should access. The root node labelled by action s1=0&r1=0 -> r1’=2 specifies that the request to access resource 2 (variable r1 is set to 2) is classified as \( bad \). The only remaining action for user 1 is to request access to resource 1. This action is classified as \( good \) by the right son of the root node.

  2. 2.

    Analogously, the tree nodes labelled by 2 specify that user 2 actions should request access to resource 2 (follows from s2=0&r2=0 -> r2’=2). Once resource 2 is requested it should change its internal state s2 to 1 (follows from s2=0&r2=2 -> s2’=1). It follows, that in the state violating the property, user 1 has access to resource 1 and user 2 to resource 2.

The model is supposed to correctly handle such overlapping requests, but fails to do so in a specific case. In order to further debug the model, one has to find the action of the scheduler that causes this undesired behaviour. The lower part of the tree specifies that u1_request_comm is a candidate for such an action. Inspecting a snippet of the code of u1_request_comm from the PRISM source code (shown below) reveals that in the given situation, the scheduler reacts inappropriately with some probability p.

figure c

The remaining nodes of the tree that were not discussed are necessary to reset the situation if the non-faulty part (with probability \(1-p\)) of the u1_request_comm command was executed. It should be noted that executing the faulty u1_request_ comm action does not lead to the undesired state right away. The action only grants user 1 access rights in a situation, where he should not get these rights. Only a successive action leads to user 1 accessing the resource and the undesired state being reached. This is a common type of bug, where the command that triggered an error is not the cause of it.

Fig. 5.
figure 5

A decision tree for mer

7 Conclusion

In this work we presented a new approach to represent strategies in MDPs in a succinct and comprehensible way. We exploited machine learning methods to achieve our goals. Interesting directions of future works are to investigate whether other machine learning methods can be integrated with our approach, and to extend our approach from reachability objectives to other objectives (such as long-run average and discounted-sum).