figure a

1 Introduction

Dynamic partial-order reduction (DPOR) [1, 10, 19] is a mature approach to mitigate the state explosion problem in stateless model checking of multithreaded programs. DPORs are based on Mazurkiewicz trace theory [13], a true-concurrency semantics where the set of executions of the program is partitioned into equivalence classes known as Mazurkiewicz traces (M-traces). In a DPOR, this partitioning is defined by an independence relation over concurrent actions that is computed dynamically and the method explores executions which are representatives of M-traces. The exploration is sound when it explores all M-traces, and it is considered optimal [1] when it explores each M-trace only once.

Since two independent actions might have to be explored from the same state in order to explore all M-traces, a DPOR algorithm uses independence to compute a provably-sufficient subset of the enabled transitions to explore for each state encountered. Typically this involves the combination of forward reasoning (persistent sets [11] or source sets [1, 4]) with backward reasoning (sleep sets [11]) to obtain a more efficient exploration. However, in order to obtain optimality, a DPOR needs to compute sequences of transitions (as opposed to sets of enabled transitions) that avoid visiting a previously visited M-trace. These sequences are stored in a data structure called wakeup trees in [1] and known as alternatives in [19]. Computing these sequences thus amounts to deciding whether the DPOR needs to visit yet another M-trace (or all have already been seen).

In this paper, we prove that computing alternatives in an optimal DPOR is an NP-complete problem. To the best our knowledge this is the first formal complexity result on this important subproblem that optimal and non-optimal DPORs need to solve. The program shown in Fig. 1(a) illustrates a practical consequence of this result: the non-optimal, state-of-the-art SDPOR algorithm [1] can explore here \(\mathop {\mathcal {O}} (2^n)\) interleavings but the program has only \(\mathop {\mathcal {O}} (n)\) M-traces.

Fig. 1.
figure 1

(a) Programs; (b) partially-ordered executions;

The program contains \(n \mathrel {:=}3\) writer threads \(w_0, w_1, w_2\), each writing to a different variable. The thread count increments \(n-1\) times a zero-initialized counter c. Thread master reads c into variable i and writes to \(x_i\).

The statements \(x_0 = 7\) and \(x_1 = 8\) are independent because they produce the same state regardless of their execution order. Statements \(i = c\) and any statement in the count thread are dependent or interfering: their execution orders result in different states. Similarly, \(x_i = 0\) interferes with exactly one writer thread, depending on the value of i.

Using this independence relation, the set of executions of this program can be partitioned into six M-traces, corresponding to the six partial orders shown in Fig. 1(b). Thus, an optimal DPOR explores six executions (2n-executions for n writers). We now show why SDPOR explores \(\mathop {\mathcal {O}} (2^n)\) in the general case. Conceptually, SDPOR is a loop that (1) runs the program, (2) identifies two dependent statements that can be swapped, and (3) reverses them and re-executes the program. It terminates when no more dependent statements can be swapped.

Consider the interference on the counter variable c between the master and the count thread. Their execution order determines which writer thread interferes with the master statement \(x_i= 0\). If \(c=1\) is executed just before \(i=c\), then \(x_i=0\) interferes with \(w_1\). However, if \(i=c\) is executed before, then \(x_i=0\) interferes with \(w_0\). Since SDPOR does not track relations between dependent statements, it will naively try to reverse the race between \(x_i=0\) and all writer threads, which results in exploring \(\mathop {\mathcal {O}} (2^n)\) executions. In this program, exploring only six traces requires understanding the entanglement between both interferences as the order in which the first is reversed determines the second.

As a trade-off solution between solving this NP-complete problem and potentially explore an exponential number of redundant schedules, we propose a hybrid approach called Quasi-Optimal POR (QPOR) which can turn a non-optimal DPOR into an optimal one. In particular, we provide a polynomial algorithm to compute alternative executions that can arbitrarily approximate the optimal solution based on a user specified constant k. The key concept is a new notion of k-partial alternative, which can intuitively be seen as a “good enough” alternative: they revert two interfering statements while remembering the resolution of the last \(k-1\) interferences.

The major differences between QPOR and the DPORs of [1] are that: (1) QPOR is based on prime event structures [17], a partial-order semantics that has been recently applied to programs [19, 21], instead of a sequential view to thread interleaving, and (2) it computes k-partial alternatives with an \(\mathop {\mathcal {O}} (n^k)\) algorithm while optimal DPOR corresponds to computing \(\infty \)-partial alternatives with an \(\mathop {\mathcal {O}} (2^n)\) algorithm. For the program shown in Fig. 1(a), QPOR achieves optimality with \(k=2\) because races are coupled with (at most) another race. As expected, the cost of computing k-partial alternatives and the reductions obtained by the method increase with higher values of k.

Finding k-partial alternatives requires decision procedures for traversing the causality and conflict relations in event structures. Our main algorithmic contribution is to represent these relations as a set of trees where events are encoded as one or two nodes in two different trees. We show that checking causality/conflict between events amounts to an efficient traversal in one of these trees.

In summary, our main contributions are:

  • Proof that computing alternatives for optimal DPOR is NP-complete (Sect. 4).

  • Efficient data structures and algorithms for (1) computing k-partial alternatives in polynomial time, and (2) represent and traverse partial orders (Sect. 5).

  • Implementation of QPOR in a new tool called Dpu and experimental evaluations against SDPOR in Nidhugg and the testing tool Maple (Sect. 6).

  • Benchmarks with \(\mathop {\mathcal {O}} (n)\) M-traces where SDPOR explores \(\mathop {\mathcal {O}} (2^n)\) executions (Sect. 6).

Furthermore, in Sect. 6 we show that: (1) low values of k often achieve optimality; (2) even with non-optimal explorations Dpu greatly outperforms Nidhugg; (3) Dpu copes with production code in Debian packages and achieves much higher state space coverage and efficiency than Maple.

Proofs for all our formal results are available in the unabridged version [15].

2 Preliminaries

In this section we provide the formal background used throughout the paper.

Concurrent Programs. We consider deterministic concurrent programs composed of a fixed number of threads that communicate via shared memory and synchronize using mutexes (Fig. 1(a) can be trivially modified to satisfy this). We also assume that local statements can only modify shared memory within a mutex block. Therefore, it suffices to only consider races of mutex accesses.

Formally, a concurrent program is a structure \(P \mathrel {:=}\langle \mathcal {M}, \mathcal {L}, T, m_0, l_0\rangle \), where \(\mathcal {M}\) is the set of memory states (valuations of program variables, including instruction pointers), \(\mathcal {L}\) is the set of mutexes, \(m_0\) is the initial memory state, \(l_0\) is the initial mutexes state and T is the set of thread statements. A thread statement \(t \mathrel {:=}\langle i,f\rangle \) is a pair where \(i \in \mathbb {N}\) is the thread identifier associated with the statement and \(f :\mathcal {M}\rightarrow (\mathcal {M}\times \varLambda )\) is a partial function that models the transformation of the memory as well as the effect \(\varLambda \mathrel {:=}{\{ \texttt {loc} \mathclose \}}\cup ({\{ \texttt {acq},\texttt {rel} \mathclose \}} \times \mathcal {L})\) of the statement with respect to thread synchronization. Statements of \(\texttt {loc}\) effect model local thread code. Statements associated with \(\langle \texttt {acq},x\rangle \) or \(\langle \texttt {rel},x\rangle \) model lock and unlock operations on a mutex x. Finally, we assume that (1) functions f are PTIME-decidable; (2) \(\texttt {acq}\)/\(\texttt {rel}\) statements do not modify the memory; and (3) \(\texttt {loc}\) statements modify thread-shared memory only within lock/unlock blocks. When (3) is violated, then P has a datarace (undefined behavior in almost all languages), and our technique can be used to find such statements, see Sect. 6.

We use labelled transition systems (\(LTS\)) semantics for our programs. We associate a program P with the \(LTS\) \(M_P \mathrel {:=}\langle \mathcal {S}, {\rightarrow }, A, s_0\rangle \). The set \(\mathcal {S}\mathrel {:=}\mathcal {M}\times (\mathcal {L}\rightarrow {\{ 0,1 \mathclose \}})\) are the states of \(M_P\), i.e., pairs of the form \(\langle m,v\rangle \) where m is the state of the memory and v indicates when a mutex is locked (1) or unlocked (0). The actions in \(A \subseteq \mathbb {N}\times \varLambda \) are pairs \(\langle i,b\rangle \) where i is the identifier of the thread that executes some statement and b is the effect of the statement. We use the function \(p :A \rightarrow \mathbb {N}\) to retrieve the thread identifier. The transition relation \({\rightarrow } \subseteq \mathcal {S}\times A \times \mathcal {S}\) contains a triple exactly when there is some thread statement \(\langle i,f\rangle \in T\) such that \(f(m) = \langle m',b\rangle \) and either (1) \(b = \texttt {loc}\) and \(v' = v\), or (2) \(b = \langle \texttt {acq},x\rangle \) and \(v(x) = 0\) and \(v' = v_{|x\mapsto 1}\), or (3) \(b = \langle \texttt {rel},x\rangle \) and \(v' = v_{|x\mapsto 0}\). Notation \(f_{x\mapsto y}\) denotes a function that behaves like f for all inputs except for x, where \(f(x) = y\). The initial state is \(s_0 \mathrel {:=}\langle m_0,l_0\rangle \).

Furthermore, if is a transition, the action a is enabled at s. Let \(\mathop { enabl } (s)\) denote the set of actions enabled at s. A sequence \(\sigma \mathrel {:=}a_1 \ldots a_n \in A^*\) is a run when there are states \(s_1, \ldots , s_n\) satisfying . We define \(\mathop { state } (\sigma )\mathrel {:=}s_n\). We let \(\mathop { runs } (M_P)\) denote the set of all runs and \(\mathop { reach } (M_P) \mathrel {:=}{\{ \mathop { state } (\sigma )\in \mathcal {S}:\sigma \in \mathop { runs } (M_P) \mathclose \}}\) the set of all reachable states.

Independence. Dynamic partial-order reduction methods use a notion called independence to avoid exploring concurrent interleavings that lead to the same state. We recall the standard notion of independence for actions in [11]. Two actions \(a, a' \in A\) commute at a state \(s \in \mathcal {S}\) iff

  • if \(a \in \mathop { enabl } (s)\) and , then \(a' \in \mathop { enabl } (s)\) iff \(a' \in \mathop { enabl } (s')\); and

  • if \(a, a' \in \mathop { enabl } (s)\), then there is a state \(s'\) such that and .

Independence between actions is an under-approximation of commutativity. A binary relation \({\mathrel {\diamondsuit }} \subseteq A \times A\) is an independence on \(M_P\) if it is symmetric, irreflexive, and every pair \(\langle a, a'\rangle \) in \(\mathrel {\diamondsuit }\) commutes at every state in \(\mathop { reach } (M_P)\).

In general \(M_P\) has multiple independence relations, clearly \(\emptyset \) is always one of them. We define relation \({\mathrel {\diamondsuit }_P} \subseteq A \times A\) as the smallest irreflexive, symmetric relation where \(\langle i,b\rangle \mathrel {\diamondsuit }_P \langle i',b'\rangle \) holds if \(i \ne i'\) and either \(b = \texttt {loc}\) or \(b = \texttt {acq}\ x\) and \(b' \not \in {\{ \texttt {acq}\ x, \texttt {rel}\ x \mathclose \}}\). By construction \(\mathrel {\diamondsuit }_P\) is always an independence.

Labelled Prime Event Structures. Prime event structures (pes) are well-known non-interleaving, partial-order semantics [7, 8, 16]. Let X be a set of actions. A pes over X is a structure \(\mathcal {E}\mathrel {:=}\langle E, <, \mathrel {\#}, h\rangle \) where E is a set of events, \({<} \subseteq E \times E\) is a strict partial order called causality relation, \({\mathrel {\#}} \subseteq E \times E\) is a symmetric, irreflexive conflict relation, and \(h :E \rightarrow X\) is a labelling function. Causality represents the happens-before relation between events, and conflict between two events expresses that any execution includes at most one of them. Figure 2(b) shows a pes over \(\mathbb {N}\times \varLambda \) where causality is depicted by arrows, conflicts by dotted lines, and the labelling h is shown next to the events, e.g., \(1 < 5\), \(8 < 12\), \(2 \mathrel {\#}8\), and \(h(1) = \langle 0,\texttt {loc}\rangle \). The history of an event e, \(\left\lceil e \right\rceil \mathrel {:=}{\{ e' \in E :e' < e \mathclose \}}\), is the least set of events that need to happen before e.

The notion of concurrent execution in a \(\textsc {pes} \) is captured by the concept of configuration. A configuration is a (partially ordered) execution of the system, i.e., a set \(C \subseteq E\) of events that is causally closed (if \(e \in C\), then \(\left\lceil e \right\rceil \subseteq C\)) and conflict-free (if \(e, e' \in C\), then \(\lnot (e \mathrel {\#}e')\)). In Fig. 2(b), the set \({\{ 8,9,15 \mathclose \}}\) is a configuration, but \({\{ 3 \mathclose \}}\) or \({\{ 1,2,8 \mathclose \}}\) are not. We let \(\mathop { conf } (\mathcal {E})\) denote the set of all configurations of \(\mathcal {E}\), and \([e] \mathrel {:=}\left\lceil e \right\rceil \cup {\{ e \mathclose \}}\) the local configuration of e. In Fig. 2(b), \([11] = {\{ 1,8,9,10,11 \mathclose \}}\). A configuration represents a set of interleavings over X. An interleaving is a sequence in \(X^*\) that labels any topological sorting of the events in C. In Fig. 2(b), \(\mathop { inter } ({\{ 1,8 \mathclose \}}) = {\{ ab, ba \mathclose \}}\) with \(a \mathrel {:=}\langle 0,\texttt {loc}\rangle \) and \(b \mathrel {:=}\langle 1,\texttt {acq}\ m\rangle \).

The extensions of C are the events not in C whose histories are included in C: \(\mathop { ex } (C) \mathrel {:=}{\{ e \in E :e \notin C \wedge \left\lceil e \right\rceil \subseteq C \mathclose \}}\). The enabled events of C are the extensions that can form a larger configuration: \(\mathop { en } (C) \mathrel {:=}{\{ e \in \mathop { ex } (C) :C \cup {\{ e \mathclose \}} \in \mathop { conf } (\mathcal {E}) \mathclose \}}\). Finally, the conflicting extensions of C are the extensions that are not enabled: \(\mathop { cex } (C) \mathrel {:=}\mathop { ex } (C) \setminus \mathop { en } (C)\). In Fig. 2(b), \(\mathop { ex } ({\{ 1,8 \mathclose \}}) = {\{ 2,9,15 \mathclose \}}\), \(\mathop { en } ({\{ 1,8 \mathclose \}}) = {\{ 9,15 \mathclose \}}\), and \(\mathop { cex } ({\{ 1,8 \mathclose \}}) = {\{ 2 \mathclose \}}\). See [20] for more information on pes concepts.

Parametric Unfolding Semantics. We recall the program pes semantics of [19, 20] (modulo notation differences). For a program P and any independence \(\mathrel {\diamondsuit }\) on \(M_P\) we define a pes \(\mathcal {U}_{P,\mathrel {\diamondsuit }}\) that represents the behavior of P, i.e., such that the interleavings of its set of configurations equals \(\mathop { runs } (M_P)\).

Each event in \(\mathcal {U}_{P,\mathrel {\diamondsuit }}\) is defined by a canonical name of the form \(e \mathrel {:=}\langle a,H\rangle \), where \(a \in A\) is an action of \(M_P\) and H is a configuration of \(\mathcal {U}_{P,\mathrel {\diamondsuit }}\). Intuitively, e represents the action a after the history (or the causes) H. Figure 2(b) shows an example. Event 11 is \(\langle \langle 0, \texttt {acq}\ m\rangle ,{\{ 1,8,9,10 \mathclose \}}\rangle \) and event 1 is \(\langle \langle 0, \texttt {loc}\rangle , \emptyset \rangle \). Note the inductive nature of the name, and how it allows to uniquely identify each event. We define the state of a configuration as the state reached by any of its interleavings. Formally, for \(C \in \mathop { conf } (\mathcal {U}_{P,\mathrel {\diamondsuit }})\) we define \(\mathop { state } (C)\) as \(s_0\) if \(C = \emptyset \) and as \(\mathop { state } (\sigma )\) for some \(\sigma \in \mathop { inter } (C)\) if \(C \ne \emptyset \). Despite its appearance \(\mathop { state } (C)\) is well-defined because all sequences in \(\mathop { inter } (C)\) reach the same state, see [20] for a proof.

Definition 1

(Unfolding). Given a program P and some independence relation \(\mathrel {\diamondsuit }\) on \(M_P \mathrel {:=}\langle \mathcal {S}, \rightarrow , A, s_0\rangle \), the unfolding of P under \(\mathrel {\diamondsuit }\), denoted \(\mathcal {U}_{P,\mathrel {\diamondsuit }}\), is the pes over A constructed by the following fixpoint rules:

  1. 1.

    Start with a pes \(\mathcal {E}\mathrel {:=}\langle E, <, {\mathrel {\#}}, h\rangle \) equal to \(\langle \emptyset , \emptyset , \emptyset , \emptyset \rangle \).

  2. 2.

    Add a new event \(e \mathrel {:=}\langle a,C\rangle \) to E for any configuration \(C \in \mathop { conf } (\mathcal {E})\) and any action \(a \in A\) if a is enabled at \(\mathop { state } (C)\) and \(\lnot (a \mathrel {\diamondsuit }h(e'))\) holds for every <-maximal event \(e'\) in C.

  3. 3.

    For any new e in E, update <, \(\mathrel {\#}\), and h as follows: for every \(e' \in C\), set \(e' < e\); for any \(e' \in E \setminus C\), set \(e' \mathrel {\#}e\) if \(e \ne e'\) and \(\lnot (a \mathrel {\diamondsuit }h(e'))\); set \(h(e) \mathrel {:=}a\).

  4. 4.

    Repeat steps 2 and 3 until no new event can be added to E; return \(\mathcal {E}\).

Step 1 creates an empty pes with only one (empty) configuration. Step 2 inserts a new event \(\langle a,C\rangle \) by finding a configuration C that enables an action a which is dependent with all causality-maximal events in C. In Fig. 2, this initially creates events 1, 8, and 15. For event \(1 \mathrel {:=}\langle \langle 0,\texttt {loc}\rangle , \emptyset \rangle \), this is because action \(\langle 0,\texttt {loc}\rangle \) is enabled at \(\mathop { state } (\emptyset )= s_0\) and there is no <-maximal event in \(\emptyset \) to consider. Similarly, the state of \(C_1 \mathrel {:=}{\{ 1, 8, 9, 10 \mathclose \}}\) enables action \(a_1 \mathrel {:=}\langle 0, \texttt {acq}\ m\rangle \), and both h(1) and h(10) are dependent with \(a_1\) in \(\mathrel {\diamondsuit }_P\). As a result \(\langle a_1,C_1\rangle \) is an event (number 11). Furthermore, while \(a_2 \mathrel {:=}\langle 0, \texttt {loc}\rangle \) is enabled at \(\mathop { state } (C_2)\), with \(C_2 \mathrel {:=}{\{ 8,9,10 \mathclose \}}\), \(a_2\) is independent of h(10) and \(\langle a_2,C_2\rangle \) is not an event.

After inserting an event \(e \mathrel {:=}\langle a,C\rangle \), Definition 1 declares all events in C causal predecessors of e. For any event \(e'\) in E but not in [e] such that \(h(e')\) is dependent with a, the order of execution of e and \(e'\) yields different states. We thus set them in conflict. In Fig. 2, we set \(2 \mathrel {\#}8\) because h(2) is dependent with h(8) and \(2 \notin [8]\) and \(8 \notin [2]\).

Fig. 2.
figure 2

(a) A program P; (b) its unfolding semantics \(\mathcal {U}_{P,\mathrel {\diamondsuit }_P}\).

3 Unfolding-Based DPOR

This section presents an algorithm that exhaustively explores all deadlock states of a given program (a deadlock is a state where no thread is enabled).

For the rest of the paper, unless otherwise stated, we let P be a terminating program (i.e., \(\mathop { runs } (M_P)\) is a finite set of finite sequences) and \(\mathrel {\diamondsuit }\) an independence on \(M_P\). Consequently, \(\mathcal {U}_{P,\mathrel {\diamondsuit }}\) has finitely many events and configurations.

Our POR algorithm (Algorithm 1) analyzes P by exploring the configurations of \(\mathcal {U}_{P,\mathrel {\diamondsuit }}\). It visits all \(\subseteq \)-maximal configurations of \(\mathcal {U}_{P,\mathrel {\diamondsuit }}\), which correspond to the deadlock states in \(\mathop { reach } (M_P)\), and organizes the exploration as a binary tree.

has a global set U that stores all events of \(\mathcal {U}_{P,\mathrel {\diamondsuit }}\) discovered so far. The three arguments are: C, the configuration to be explored; D (for disabled), a set of events that shall never be visited (included in C) again; and A (for add), used to direct the exploration towards a configuration that conflicts with D. A call to visits all maximal configurations of \(\mathcal {U}_{P,\mathrel {\diamondsuit }}\) which contain C and do not contain D, and the first one explored contains \(C \cup A\).

figure b

The algorithm first adds \(\mathop { ex } (C)\) to U. If C is a maximal configuration (i.e., there is no enabled event) then line 5 returns. If C is not maximal but \(\mathop { en } (C) \subseteq D\), then all possible events that could be added to C have already been explored and this call was redundant work. In this case the algorithm also returns and we say that it has explored a sleep-set blocked (SSB) execution [1]. Algorithm 1 next selects an event enabled at C, if possible from A (line 7 and 9) and makes a recursive call (left subtree) that explores all configurations that contain all events in \(C \cup {\{ e \mathclose \}}\) and no event from D. Since that call visits all maximal configurations containing C and e, it remains to visit those containing C but not e. At line 11 we determine if any such configuration exists. Function returns a set of configurations, so-called clues. A clue is a witness that a \(\subseteq \)-maximal configuration exists in \(\mathcal {U}_{P,\mathrel {\diamondsuit }}\) which contains C and not \(D \cup {\{ e \mathclose \}}\).

Definition 2

(Clue). Let D and U be sets of events, and C a configuration such that \(C \cap D = \emptyset \). A clue to D after C in U is a configuration \(J \subseteq U\) such that \(C \cup J\) is a configuration and \(D \cap J = \emptyset \).

Definition 3

(Alt function). Function denotes any function such that returns a set of clues to F after B in U, and the set is non-empty if \(\mathcal {U}_{P,\mathrel {\diamondsuit }}\) has at least one maximal configuration C where \(B \subseteq C\) and \(C \cap F = \emptyset \).

When returns a clue J, the clue is passed in the second recursive call (line 12) to “mark the way” (using set A) in the subsequent recursive calls at line 10, and guide the exploration towards the maximal configuration that J witnesses. Definition 3 does not identify a concrete implementation of . It rather indicates how to implement so that Algorithm 1 terminates and is complete (see below). Different PORs in the literature can be reframed in terms of Algorithm 1. SDPOR [1] uses clues that mark the way with only one event ahead (\(|J \setminus C| = 1\)) and can hit SSBs. Optimal DPORs [1, 19] use size-varying clues that guide the exploration provably guaranteeing that any SSB will be avoided.

Algorithm 1 is optimal when it does not explore a SSB. To make Algorithm 1 optimal needs to return clues that are alternatives [19], which satisfy stronger constraints. When that happens, Algorithm 1 is equivalent to the DPOR in [19] and becomes optimal (see [20] for a proof).

Definition 4

(Alternative [19]). Let D and U be sets of events and C a configuration such that \(C \cap D = \emptyset \). An alternative to D after C in U is a clue J to D after C in U such that \(\forall e \in D: \exists e' \in J\), \(e \mathrel {\#}e'\).

Line 13 removes from U events that will not be necessary for  to find clues in the future. The events preserved, \(Q_{C,D} \mathrel {:=}C \cup D \cup \mathop { \# } (C \cup D)\), include all events in \(C \cup D\) as well as every event in U that is in conflict with some event in \(C \cup D\). The preserved events will suffice to compute alternatives [19], but other non-optimal implementations of  could allow for more aggressive pruning.

The \(\subseteq \)-maximal configurations of Fig. 2(b) are \([7] \cup [17]\), [14], and [19]. Our algorithm starts at configuration \(C = \emptyset \). After 10 recursive calls it visits \(C = [7] \cup [17]\). Then it backtracks to \(C = {\{ 1 \mathclose \}}\), calls , which provides, e.g., \(J = {\{ 1,8 \mathclose \}}\), and visits \(C = {\{ 1,8 \mathclose \}}\) with \(D = {\{ 2 \mathclose \}}\). After 6 more recursive calls it visits \(C = [14]\), backtracks to \(C = [12]\), calls , which provides, e.g., \(J = {\{ 15 \mathclose \}}\), and after two more recursive calls it visits \(C = [12] \cup {\{ 15 \mathclose \}}\) with \(D = {\{ 2,13 \mathclose \}}\). Finally, after 4 more recursive calls it visits \(C = [19]\).

Finally, we focus on the correctness of Algorithm 1, and prove termination and soundness of the algorithm:

Theorem 1

(Termination). Regardless of its input, Algorithm 1 always stops.

Theorem 2

(Completeness). Let \(\hat{C}\) be a \(\subseteq \)-maximal configuration of \(\mathcal {U}_{P,\mathrel {\diamondsuit }}\). Then Algorithm 1 calls at least once with \(C=\hat{C}\).

4 Complexity

This section presents complexity results about the only non-trival steps in Algorithm 1: computing \(\mathop { ex } (C)\) and the call to . An implementation of that systematically returns B would satisfy Definition 3, but would also render Algorithm 1 unusable (equivalent to a DFS in \(M_P\)). On the other hand the algorithm becomes optimal when returns alternatives. Optimality comes at a cost:

Theorem 3

Given a finite pes \(\mathcal {E}\), some configuration \(C \in \mathop { conf } (\mathcal {E})\), and a set \(D \subseteq \mathop { ex } (C)\), deciding if an alternative to D after C exists in \(\mathcal {E}\) is NP-complete.

Theorem 3 assumes that \(\mathcal {E}\) is an arbitrary pes. Assuming that \(\mathcal {E}\) is the unfolding of a program P under \(\mathrel {\diamondsuit }_P\) does not reduce this complexity:

Theorem 4

Let P be a program and U a causally-closed set of events from \(\mathcal {U}_{P,\mathrel {\diamondsuit }_P}\). For any configuration \(C \subseteq U\) and any \(D \subseteq \mathop { ex } (C)\), deciding if an alternative to D after C exists in U is NP-complete.

These complexity results lead us to consider (in next section) new approaches that avoid the NP-hardness of computing alternatives while still retaining their capacity to prune the search.

Finally, we focus on the complexity of computing \(\mathop { ex } (C)\), which essentially reduces to computing \(\mathop { cex } (C)\), as computing \(\mathop { en } (C)\) is trivial. Assuming that \(\mathcal {E}\) is given, computing \(\mathop { cex } (C)\) for some \(C \in \mathop { conf } (\mathcal {E})\) is a linear problem. However, for any realistic implementation of Algorithm 1, \(\mathcal {E}\) is not available (the very goal of Algorithm 1 is to find all of its events). So a useful complexity result about \(\mathop { cex } (C)\) necessarily refers to the orignal system under analysis. When \(\mathcal {E}\) is the unfolding of a Petri net [14], computing \(\mathop { cex } (C)\) is NP-complete:

Theorem 5

Let N be a Petri net, t a transition of N, \(\mathcal {E}\) the unfolding of N and C a configuration of \(\mathcal {E}\). Deciding if \(h^{-1}(t) \cap \mathop { cex } (C) = \emptyset \) is NP-complete.

Fortunately, computing \(\mathop { cex } (C)\) for programs is a much simpler task. Function , shown in Algorithm 1, computes and returns \(\mathop { cex } (C)\) when \(\mathcal {E}\) is the unfolding of some program. We explain in detail in Sect. 5.3. But assuming that functions and can be computed in constant time, and relation < decided in \(\mathop {\mathcal {O}} (\log |C|)\), as we will show, clearly works in time \(\mathop {\mathcal {O}} (n^2\log n)\), where \(n \mathrel {:=}|C|\), as both loops are bounded by the size of C.

5 New Algorithm for Computing Alternatives

This section introduces a new class of clues, called k-partial alternatives. These can arbitrarily reduce the number of redundant explorations (SSBs) performed by Algorithm 1 and can be computed in polynomial time. Specialized data structures and algorithms for k-partial alternatives are also presented.

Definition 5

(k-partial alternative). Let U be a set of events, \(C \subseteq U\) a configuration, \(D \subseteq U\) a set of events, and \(k \in \mathbb {N}\) a number. A configuration J is a k-partial alternative to D after C if there is some \(\hat{D} \subseteq D\) such that \(|\hat{D}| = k\) and J is an alternative to \(\hat{D}\) after C.

A k-partial alternative needs to conflict with only k (instead of all) events in D. An alternative is thus an \(\infty \)-partial alternative. If we reframe SDPOR in terms of Algorithm 1, it becomes an algorithm using singleton 1-partial alternatives. While k-partial alternatives are a very simple concept, most of their simplicity stems from the fact that they are expressed within the elegant framework of pes semantics. Defining the same concept on top of sequential semantics (often used in the POR literature [1, 2, 9,10,11, 23]), would have required much more complex device.

We compute k-partial alternatives using a comb data structure:

Definition 6

(Comb). Let A be a set. An A-comb c of size \(n \in \mathbb {N}\) is an ordered collection of spikes \(\langle s_1,\ldots ,s_n\rangle \), where each spike \(s_i \in A^*\) is a sequence of elements over A. Furthermore, a combination over c is any tuple \(\langle a_1, \ldots , a_n\rangle \) where \(a_i \in s_i\) is an element of the spike.

It is possible to compute k-partial alternatives (and by extension optimal alternatives) to D after C in U using a comb, as follows:

  1. 1.

    Select k (or |D|, whichever is smaller) arbitrary events \(e_1, \ldots , e_k\) from D.

  2. 2.

    Build a U-comb \(\langle s_1, \ldots , s_k\rangle \) of size k, where spike \(s_i\) contains all events in U in conflict with \(e_i\).

  3. 3.

    Remove from \(s_i\) any event \(\hat{e}\) such that either \([\hat{e}] \cup C\) is not a configuration or \([\hat{e}] \cap D \ne \emptyset \).

  4. 4.

    Find combinations \(\langle e'_1, \ldots , e'_k\rangle \) in the comb satisfying \(\lnot (e'_i \mathrel {\#}e'_j)\) for \(i \ne j\).

  5. 5.

    For any such combination the set \(J \mathrel {:=}[e'_1] \cup \ldots \cup [e'_k]\) is a k-partial alternative.

Step 3 guarantees that J is a clue. Steps 1 and 2 guarantee that it will conflict with at least k events from D. It is straightforward to prove that the procedure will find a k-partial alternative to D after C in U when an \(\infty \)-partial alternative to D after C exists in U. It can thus be used to implement Definition 3.

Steps 2, 3, and 4 require to decide whether a given pair of events is in conflict. Similarly, step 3 requires to decide if two events are causally related. Efficiently computing k-partial alternatives thus reduces to efficiently computing causality and conflict between events.

5.1 Computing Causality and Conflict for PES Events

In this section we introduce an efficient data structure for deciding whether two events in the unfolding of a program are causally related or in conflict.

As in Sect. 3, let P be a program, \(M_P\) its LTS semantics, and \(\mathrel {\diamondsuit }_P\) its independence relation (defined in Sect. 2). Additionally, let \(\mathcal {E}\) denote the pes \(\mathcal {U}_{P,\mathrel {\diamondsuit }_P}\) of P extended with a new event \(\bot \) that causally precedes every event in \(\mathcal {U}_{P,\mathrel {\diamondsuit }_P}\).

The unfolding \(\mathcal {E}\) represents the dependency of actions in \(M_P\) through the causality and conflict relations between events. By definition of \(\mathrel {\diamondsuit }_P\) we know that for any two events \(e,e' \in \mathcal {E}\):

  • If e and \(e'\) are events from the same thread, then they are either causally related or in conflict.

  • If e and \(e'\) are lock/unlock operations on the same variable, then similarly they are either causally related or in conflict.

This means that the causality/conflict relations between all events of one thread can be tracked using a tree. For every thread of the program we define and maintain a so-called thread tree. Each event of the thread has a corresponding node in the tree. A tree node n is the parent of another tree node \(n'\) iff the event associated with n is the immediate causal predecessor of the event associated with \(n'\). That is, the ancestor relation of the tree encodes the causality relations of events in the thread, and the branching of the tree represents conflict. Given two events \(e, e'\) of the same thread we have that \(e < e'\) iff \(\lnot (e \mathrel {\#}e')\) iff the tree node of e is an ancestor of the tree node of \(e'\).

We apply the same idea to track causality/conflict between \(\texttt {acq}\) and \(\texttt {rel}\) events. For every lock \(l \in \mathcal {L}\) we maintain a separate lock tree, containing a node for each event labelled by either \(\langle \texttt {acq}, l\rangle \) or \(\langle \texttt {rel}, l\rangle \). As before, the ancestor relation in a lock tree encodes the causality relations of all events represented in that tree. Events of type \(\texttt {acq}\)/\(\texttt {rel}\) have tree nodes in both their lock and thread trees. Events for \(\texttt {loc}\) actions are associated to only one node in the thread tree.

This idea gives a procedure to decide a causality/conflict query for two events when they belong to the same thread or modify the same lock. But we still need to decide causality and conflict for other events, e.g., \(\texttt {loc}\) events of different threads. Again by construction of \(\mathrel {\diamondsuit }_P\), the only source of conflict/causality for events are the causality/conflict relations between the causal predecessors of the two. These relations can be summarized by keeping two mappings for each event:

Definition 7

Let \(e \in E\) be an event of \(\mathcal {E}\). We define the thread mapping \( tmax :E \times \mathbb {N}\rightarrow E\) as the only function that maps every pair \(\langle e,i\rangle \) to the unique <-maximal event from thread i in [e], or \(\bot \) if [e] contains no event from thread i. Similarly, the lock mapping \( lmax :E \times \mathcal {L}\rightarrow E\) maps every pair \(\langle e,l\rangle \) to the unique <-maximal event \(e' \in [e]\) such that \(h(e')\) is an action of the form \(\langle \texttt {acq}, l\rangle \) or \(\langle \texttt {rel}, l\rangle \), or \(\bot \) if no such event exists in [e].

The information stored by the thread and lock mappings enables us to decide causality and conflict queries for arbitrary pairs of events:

Theorem 6

Let \(e, e' \in \mathcal {E}\) be two arbitrary events from resp. threads i and \(i'\), with \(i \ne i'\). Then \(e < e'\) holds iff \(e \leqslant \mathop { tmax } (e', i)\). And \(e \mathrel {\#}e'\) holds iff there is some \(l \in \mathcal {L}\) such that \(\mathop { lmax } (e, l) \mathrel {\#}\mathop { lmax } (e', l)\).

As a consequence of Theorem 6, deciding whether two events are related by causality or conflict reduces to deciding whether two nodes from the same lock or thread tree are ancestors.

5.2 Computing Causality and Conflict for Tree Nodes

This section presents an efficient algorithm to decide if two nodes of a tree are ancestors. The algorithm is similar to a search in a skip list [18].

Let \(\langle N, {\lessdot }, r\rangle \) denote a tree, where N is a set of nodes, \({\lessdot } \subseteq N \times N\) is the parent relation, and \(r \in N\) is the root. Let d(n) be the depth of each node in the tree, with \(d(r) = 0\). A node n is an ancestor of \(n'\) if it belongs to the only path from r to \(n'\). Finally, for a node \(n \in N\) and some integer \(g \in \mathbb {N}\) such that \(g \le d(n)\) let q(ng) denote the unique ancestor \(n'\) of n such that \(d(n') = g\).

Given two distinct nodes \(n, n' \in N\), we need to efficiently decide whether n is an ancestor of \(n'\). The key idea is that if \(d(n) = d(n')\), then the answer is clearly negative; and if the depths are different and w.l.o.g. \(d(n) < d(n')\), then we have that n is an ancestor of \(n'\) iff nodes n and \(n'' \mathrel {:=}q(n',d(n))\) are the same node.

To find \(n''\) from \(n'\), a linear traversal of the branch starting from \(n'\) would be expensive for deep trees. Instead, we propose to use a data structure similar to a skip list. Each node stores a pointer to the parent node and also a number of pointers to ancestor nodes at distances \(s^1, s^2, s^3, \ldots \), where \(s \in \mathbb {N}\) is a user-defined step. The number of pointers stored at a node n is equal to the number of trailing zeros in the s-ary representation of d(n). For instance, for \(s \mathrel {:=}2\) a node at depth 4 stores 2 pointers (apart from the pointer to the parent) pointing to the nodes at depth \(4-s^1 = 2\) and depth \(4-s^2 = 0\). Similarly a node at depth 12 stores a pointer to the ancestor (at depth 11) and pointers to the ancestors at depths 10 and 8. With this algorithm computing q(ng) requires traversing \(\log (d(n) - g)\) nodes of the tree.

5.3 Computing Conflicting Extensions

We now explain how function in Algorithm 1 works. A call to constructs and returns all events in \(\mathop { cex } (C)\). The function works only when the pes being explored is the unfolding of a program P under the independence \(\mathrel {\diamondsuit }_P\).

Owing to the properties of \(\mathcal {U}_{P,\mathrel {\diamondsuit }_P}\), all events in \(\mathop { cex } (C)\) are labelled by \(\texttt {acq}\) actions. Broadly speaking, this is because only the actions from different threads that are co-enabled and are dependent create conflicts in \(\mathcal {U}_{P,\mathrel {\diamondsuit }_P}\). And this is only possible for \(\texttt {acq}\) statements. For the same reason, an event labelled by \(a \mathrel {:=}\langle i, \langle \texttt {acq}, l\rangle \rangle \) exists in \(\mathop { cex } (C)\) iff there is some event \(e \in C\) such that \(h(e) = a\).

Function exploits these facts and the lock tree introduced in Sect. 5.1 to compute \(\mathop { cex } (C)\). Intuitively, it finds every event e labelled by an \(\langle \texttt {acq}, l\rangle \) statement and tries to “execute” it before the \(\langle \texttt {rel}, l\rangle \) that happened before e (if there is one). If it can, it creates a new event \(\hat{e}\) with the same label as e.

Function returns the only immediate causal predecessor of event e in its own thread. For an \(\texttt {acq}\)/\(\texttt {rel}\) event e, function returns the parent node of event e in its lock tree (or \(\bot \) if e is the root). So for an \(\texttt {acq}\) event it returns a \(\texttt {rel}\) event, and for a \(\texttt {rel}\) event it returns an \(\texttt {acq}\) event.

6 Experimental Evaluation

We implemented QPOR in a new tool called Dpu (Dynamic Program Unfolder, available at https://github.com/cesaro/dpu/releases/tag/v0.5.2). Dpu is a stateless model checker for C programs with POSIX threading. It uses the LLVM infrastructure to parse, instrument, and JIT-compile the program, which is assumed to be data-deterministic. It implements k-partial alternatives (k is an input), optimal POR, and context-switch bounding [6].

Dpu does not use data-races as a source of thread interference for POR. It will not explore two execution orders for the two instructions that exhibit a data-race. However, it can be instructed to detect and report data races found during the POR exploration. When requested, this detection happens for a user-provided percentage of the executions explored by POR.

6.1 Comparison to SDPOR

In this section we investigate the following experimental questions: (a) How does QPOR compare against SDPOR? (b) For which values of k do k-partial alternatives yield optimal exploration?

We use realistic programs that expose complex thread synchronization patterns including a job dispatcher, a multiple-producer multiple-consumer scheme, parallel computation of \(\pi \), and a thread pool. Complex synchronizations patterns are frequent in these examples, including nested and intertwined critical sections or conditional interactions between threads based on the processed data, and provide means to highlight the differences between POR approaches and drive improvement. Each program contains between 2 and 8 assertions, often ensuring invariants of the used data structures. All programs are safe and have between 90 and 200 lines of code. We also considered the SV-COMP’17 benchmarks, but almost all of them contain very simple synchronization patterns, not representative of more complex concurrent algorithms. On these benchmarks QPOR and SDPOR perform an almost identical exploration, both timeout on exactly the same instances, and both find exactly the same bugs.

Table 1. Comparing QPOR and SDPOR. Machine: Linux, Intel Xeon 2.4 GHz. TO: timeout after 8 min. Columns are: Th: nr. of threads; Confs: maximal configurations; Time in seconds, Memory in MB; SSB: Sleep-set blocked executions. N/A: analysis with lower k yielded 0 SSBs.

In Table 1, we present a comparison between Dpu and Nidhugg [2], an efficient implementation of SDPOR for multithreaded C programs. We run k-partial alternatives with \(k \in {\{ 1, 2, 3 \mathclose \}}\) and optimal alternatives. The number of SSB executions dramatically decreases as k increases. With \(k=3\) almost no instance produces SSBs (except MPC(4,5)) and optimality is achieved with \(k=4\). Programs with simple synchronization patterns, e.g., the Pi benchmark, are explored optimally both with \(k=1\) and by SDPOR, while more complex synchronization patterns require \(k > 1\).

Overall, if the benchmark exhibits many SSBs, the run time reduces as k increases, and optimal exploration is the fastest option. However, when the benchmark contains few SSBs (cf., Mpat, Pi, Poke), k-partial alternatives can be slightly faster than optimal POR, an observation inline with previous literature [1]. Code profiling revealed that when the comb is large and contains many solutions, both optimal and non-optimal POR will easily find them, but optimal POR spends additional time constructing a larger comb. This suggests that optimal POR would profit from a lazy comb construction algorithm.

Dpu is faster than Nidhugg in the majority of the benchmarks because it can greatly reduce the number of SSBs. In the cases where both tools explore the same set of executions, Dpu is in general faster than Nidhugg because it JIT-compiles the program, while Nidhugg interprets it. All the benchmark in Table 1 are data-race free, but Nidhugg cannot be instructed to ignore data-races and will attempt to revert them. Dpu was run with data-race detection disabled. Enabling it will incur in approximatively 10% overhead. In contrast with previous observations [1, 2], the results in Table 1 show that SSBs can dramatically slow down the execution of SDPOR.

6.2 Evaluation of the Tree-Based Algorithms

We now evaluate the efficiency of our tree-based algorithms from Sect. 5 answering: (a) What are the average/maximal depths of the thread/lock sequential trees? (b) What is the average depth difference on causality/conflict queries? (c) What is the best step for branch skip lists? We do not compare our algorithms against others because to the best of our knowledge none is available (other than a naive implementation of the mathematical definition of causality/conflict).

Fig. 3.
figure 3

(a), (b) Depths of trees; (c), (d) frequency of depth distances

We run Dpu with an optimal exploration over 15 selected programs from Table 1, with 380 to 204K maximal configurations in the unfolding. In total, the 15 unfoldings contain 246 trees (150 thread trees and 96 lock trees) with 5.2M nodes. Figure 3 shows the average depth of the nodes in each tree (subfigure a) and the maximum depth of the trees (subfigure b), for each of the 246 trees.

While the average depth of a node is 22.7, as much as 80% of the trees have a maximum depth of less than 8 nodes, and 90% of them less than 16 nodes. The average of 22.7 is however larger because deeper trees contain proportionally more nodes. The depth of the deepest node of every tree was between 3 and 77.

We next evaluate depth differences in the causality and conflict queries over these trees. Figure 3(a) and (b) respectively show the frequency of various depth distances associated to causality and conflict queries made by optimal POR.

Surprisingly, depth differences are very small for both causality and conflict queries. When deciding causality between events, as much as 92% of the queries were for tree nodes separated by a distance between 1 and 4, and 70% had a difference of 1 or 2 nodes. This means that optimal POR, and specifically the procedure that adds \(\mathop { ex } (C)\) to the unfolding (which is the main source of causality queries), systematically performs causality queries which are trivial with the proposed data structures. The situation is similar for checking conflicts: 82% of the queries are about tree nodes whose depth difference is between 1 and 4.

These experiments show that most queries on the causality trees require very short walks, which strongly drives to use the data structure proposed in Sect. 5. Finally, we chose a (rather arbitrary) skip step of 4. We observed that other values do not significantly impact the run time/memory consumption for most benchmarks, since the depth difference on causality/conflict requests is very low.

6.3 Evaluation Against the State-of-the-Art on System Code

We now evaluate the scalability and applicability of Dpu on five multithreaded programs in two Debian packages: blktrace [5], a block layer I/O tracing mechanism, and mafft [12], a tool for multiple alignment of amino acid or nucleotide sequences. The code size of these utilities ranges from 2K to 40K LOC, and mafft is parametric in the number of threads.

Table 2. Comparing DPU with Maple (same machine). LOC: lines of code; Execs: nr. of executions; R: safe or unsafe. Other columns as before. Timeout: 8 min.

We compared Dpu against Maple [24], a state-of-the-art testing tool for multithreaded programs, as the top ranked verification tools from SV-COMP’17 are still unable to cope with such large and complex multithreaded code. Unfortunately we could not compare against Nidhugg because it cannot deal with the (abundant) C-library calls in these programs.

Table 2 presents our experimental results. We use Dpu with optimal exploration and the modified version of Maple used in [22]. To test the effectiveness of both approaches in state space coverage and bug finding, we introduce bugs in 4 of the benchmarks (Add,Dnd,Mdl,pla). For the safe benchmark Blk, we perform exhaustive state-space exploration using Maple’s DFS mode. On this benchmark, Dpu outperfors Maple by several orders of magnitude: Dpu explores up to 20K executions covering the entire state space in 10 s, while Maple only explores up to 108 executions in 8 min.

For the remaining benchmarks, we use the random scheduler of Maple, considered to be the best baseline for bug finding [22]. First, we run Dpu to retrieve a bound on the number of random executions to answer whether both tools are able to find the bug within the same number of executions. Maple found bugs in all buggy programs (except for one variant in Add) even though Dpu greatly outperforms and is able to achieve much more state space coverage.

6.4 Profiling a Stateless POR

In order to understand the cost of each component of the algorithm, we profile Dpu on a selection of 7 programs from Table 1. Dpu spends between 30% and 90% of the run time executing the program (65% in average). The remaining time is spent computing alternatives, distributed as follows: adding events to the event structure (15% to 30%), building the spikes of a new comb (1% to 50%), searching for solutions in the comb (less than 5%), and computing conflicting extensions (less than 5%). Counterintuitively, building the comb is more expensive than exploring it, even in the optimal case. Filling the spikes seems to be more memory-intensive than exploring the comb, which exploits data locality.

7 Conclusion

We have shown that computing alternatives in an optimal DPOR exploration is NP-complete. To mitigate this problem, we introduced a new approach to compute alternatives in polynomial time, approximating the optimal exploration with a user-defined constant. Experiments conducted on benchmarks including Debian packages show that our implementation outperforms current verification tools and uses appropriate data structures. Our profiling results show that running the program is often more expensive than computing alternatives. Hence, efforts in reducing the number of redundant executions, even if significantly costly, are likely to reduce the overall execution time.