Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

figure a

1 Introduction

Concurrent programs are notoriously difficult to test. Executions of different threads can interleave arbitrarily, and any such interleaving may trigger unexpected errors and lead to serious production failures [13]. Traditional testing over concurrent programs relies on the system scheduler to interleave executions (or events) and is limited to detect bugs because some interleavings are repeatedly tested while missing many others.

Systematic testing [9, 16, 18, 28,29,30], instead of relying on the system scheduler, utilizes formal methods to systematically schedule concurrent events and attempt to cover all possible interleavings. However, the interleaving space of concurrent programs is exponential to the execution length and often far exceeds the testing budget, leading to the so-called state-space explosion problem. Techniques such as partial order reduction (POR) [1, 2, 8, 10] and dynamic interface reduction [11] have been introduced to reduce the interleaving space. But, in most cases, the reduced space of a complex concurrent program is still too large to test exhaustively. Moreover, systematic testing often uses a deterministic search algorithm (e.g., the depth-first search) that only slightly adjusts the interleaving at each iteration, e.g., flip the order of two events. Such a search may very well get stuck in a homogeneous interleaving subspace and waste the testing budget by exploring mostly equivalent interleavings.

To mitigate the state-space explosion problem, randomized scheduling algorithms are proposed to sample, rather than enumerating, the interleaving space while still keeping the diversity of the interleavings explored [28]. The most straightforward sampling algorithm is random walk: at each step, randomly pick an enabled event to execute. Previous work showed that even such a sampling outperformed the exhaustive search at finding errors in real-world concurrent programs [24]. This can be explained by applying the small-scope hypothesis [12, Sect. 5.1.3] to the domain of concurrency error detection [17]: errors in real-world concurrent programs are non-adversarial and can often be triggered if a small number of events happen in the right order, which sampling has a good probability to achieve.

Random walk, however, has a unsurprisingly poor probabilistic guarantee of error detection. Consider the program in Fig. 1a. The assertion of thread A fails if, and only if, the statement “x = 1” of thread B is executed before this assertion. Without knowing which order (between the assertion and “x = 1”) triggers this failure as a priori, we should sample both orders uniformly because the probabilistic guarantee of detecting this error is the minimum sampling probability of these two orders. Unfortunately, random walk may yield extremely non-uniform sampling probabilities for different orders when only a small number of events matter. In this example, to trigger the failure, the assertion of thread A has to be delayed (or not picked) by m times in random walk, making its probabilistic guarantee as low as \(1{/}2^{m}\).

Fig. 1.
figure 1

(a) An example illustrating random walk’s weakness in probabilistic guarantee of error detection, where variable x is initially 0; (b) An example illustrating PCT’s redundancy in exploring the partial order.

To sample different orders more uniformly, Probabilistic Concurrency Testing (PCT) [4] depends on a user-provided parameter d, the number of events to delay, to randomly pick d events within the execution, and inserts a preemption before each of the d events. Since the events are picked randomly by PCT, the corresponding interleaving space is sampled more uniformly, resulting in a much stronger probabilistic guarantee than random walk. Consider the program in Fig. 1a again. To trigger the failure, there is no event needed to be delayed, other than having the right thread (i.e. thread B) to run first. Thus, the probability trigger (or avoid) the failure is 1 / 2, which is much higher than \(1{/}2^{m}\).

However, PCT does not consider the partial order of events entailed by a concurrent program, such that the explored interleavings are still quite redundant. Consider the example in Fig. 1b. Both A1 and B1 are executed before the barrier and do not race with any statement. Statements A2 and B2 form a race, and so do statements A3 and B3. Depending on how each race is resolved, the program events have total four different partial orders. However, without considering the effects of barriers, PCT will attempt to delay A1 or B1 in vain. Furthermore, without considering the race condition, PCT may first test an interleaving A2\({}\rightarrow \) A3\({}\rightarrow {}\)B2\({}\rightarrow {}\)B3 (by delaying A3 and B2), and then test a partial-order equivalent and thus completely redundant interleaving A2\({}\rightarrow {}\) B2\({}\rightarrow {}\)A3\({}\rightarrow {}\)B3 (by delaying A3 and B3). Such redundancies in PCT waste testing resources and weaken the probabilistic guarantee.

Towards addressing the above challenges, this paper makes three main contributions. First, we present a concurrency testing approach, named partial order sampling (POS), that samples the concurrent program execution based on the partial orders and provides strong probabilistic guarantees of error detection. In contrast to the sophisticated algorithms and heavy bookkeeping used in prior POR work, the core algorithm of POS is much more straightforward. In POS, each event is assigned with a random priority and, at each step, the event with the highest priority is executed. After each execution, all events that race with the executed event will be reassigned with a fresh random priority. Since each event has its own priority, POS (1) samples the orders of a group of dependent events uniformly and (2) uses one execution to sample independent event groups in parallel, both benefiting its probabilistic guarantee. The priority reassignment is also critical. Consider racing events \(e_1\) and \(e_2\), and an initial priority assignment that runs \(e_1\) first. Without the priority reassignment, \(e_2\) may very well be delayed again when a new racing event \(e_3\) occurs because \(e_2\)’s priority is more likely to be small (the reason that \(e_2\) is delayed after \(e_1\) at the first place). Such priority reassignments ensure that POS samples the two orders of \(e_2\) and \(e_3\) uniformly.

Secondly, the probabilistic guarantee of POS has been formally analyzed and shown to be exponentially stronger than random walk and PCT for general programs. The probability for POS to execute any partial order can be calculated by modeling the ordering constraints as a bipartite graph and computing the probability that these constraints can be satisfied by a random priority assignment. Although prior POR work typically have soundness proofs of the space reduction [1, 8], those proofs depend on an exhaustive searching strategy and it is unclear how they can be adapted to randomized algorithms. Some randomized algorithms leverage POR to heuristically avoid redundant exploration, but no formal analysis of their probabilistic guarantee is given [22, 28]. To the best of our knowledge, POS is the first work to sample partial orders with formal probabilistic guarantee of error detection.

Lastly, POS has been implemented and evaluated using both randomly generated programs and real-world concurrent software such as Firefox’s JavaScript engine SpiderMonkey in SCTBench [24]. Our POS implementation supports shared-memory multithreaded programs using Pthreads. The evaluation results show that POS provided \(134.1{\times }\) stronger overall guarantees than random walk and PCT on randomly generated programs, and the error detection is \(2.6{\times }\) faster than random walk and PCT on SCTBench. POS managed to find the six most difficult bugs in SCTBench with the highest probability among all algorithms evaluated and performed the best among 20 of the total 32 non-trivial bugs in our evaluation.

Related Work. There is a rich literature of concurrency testing. Systematic testing [9, 14, 18, 28] exhaustively enumerates all possible schedules of a program, which suffers from the state-space explosion problem. Partial order reduction techniques [1, 2, 8, 10] alleviate this problem by avoiding exploring schedules that are redundant under partial order equivalence but rely on bookkeeping the massive exploration history to identify redundancy and it is unclear how they can be applied to the sampling methods.

PCT [4] explores schedules containing orderings of small sets of events and guarantees probabilistic coverage of finding bugs involving rare orders of a small number of events. PCT, however, does not take partial orders into account and becomes ineffective when dealing with a large number of ordering events. Also, the need of user-provided parameters diminishes the coverage guarantee, as the parameters are often provided imprecisely. Chistikov et al. [5] introduced hitting families to cover all admissible total orders of a set of events. However, this approach may cover redundant total orders that correspond to the same partial order. RAPOS [22] leverages the ideas from the partial order reduction, resembling our work in its goal, but does not provide a formal proof for its probabilistic guarantee. Our micro-benchmarks show that POS has a \(5.0{\times }\) overall advantage over RAPOS (see Sect. 6.1).

Coverage-driven concurrency testing [26, 32] leverages relaxed coverage metrics to discover rarely explored interleavings. Directed testing [21, 23] focuses on exploring specific types of interleavings, such as data races and atomicity violations, to reveal bugs. There is a large body of other work showing how to detect concurrency bugs using static analysis [19, 25] or dynamic analysis [7, 15, 20]. But none of them can be effectively applied to real-world software systems, while still have formal probabilistic guarantees.

2 Running Example

Figure 2 shows the running example of this paper. In this example, we assume that memory accesses are sequentially consistent and all shared variables (e.g., x, w, etc.) are initialized to be 0. The program consists of two threads, i.e., A and B. Thread B will be blocked at B4 by wait(w) until \(\mathtt{w} > 0\). Thread A will set w to be 1 at A3 via signal(w) and unblock thread B. The assertion at A4 will fail if, and only if, the program is executed in the following total order:

$$ \mathtt{B1}\rightarrow \mathtt{A1}\rightarrow \mathtt{B2}\rightarrow \mathtt{B3}\rightarrow \mathtt{A2}\rightarrow \mathtt{A3}\rightarrow \mathtt{B4}\rightarrow \mathtt{B5}\rightarrow \mathtt{B6}\rightarrow \mathtt{A4} $$
Fig. 2.
figure 2

The running example involving two threads.

To detect this bug, random walk has to make the correct choice at every step. Among all ten steps, three of them only have a single option: A2 and A3 must be executed first to enable B4, and A4 is the only statement left at the last step. Thus, the probability of reaching the bug is \(1{/}2^7 = 1{/}128\). As for PCT, we have to insert two preemption points just before statements B2 and A2 among ten statements, thus the probability for PCT is \(1{/}10 \times 1{/}10 \times 1{/}2 = 1{/}200\), where this 1 / 2 comes from the requirement that thread B has to be executed first.

In POS, this bug can be detected with a substantial probability of 1 / 48, much higher than other approaches. Indeed, our formal guarantees ensure that any behavior of this program can be covered with a probability of at least 1 / 60.

3 Preliminary

Concurrent Machine Model. Our concurrent abstract machine models a finite set of processes and a set of shared objects. The machine state is denoted as s, which consists of the local state of each process and the state of shared objects. The abstract machine assumes the sequential consistency and allows the arbitrary interleaving among all processes. At each step, starting from s, any running process can be randomly selected to make a move to update the state to \(s'\) and generate an event e, denoted as \(s\xrightarrow {e}{}s'\).

An event e is a tuple \(e:= (\mathtt {pid}{}, \mathtt {intr}{}, \mathtt {obj}{}, \mathtt {ind}{})\), where \(\mathtt {pid}{}\) is the process ID, \(\mathtt {intr}{}\) is the statement (or instruction) pointer, \(\mathtt {obj}{}\) is the shared object accessed by this step (we assume each statement only access at most a single shared object), and \(\mathtt {ind}{}\) indicates how many times this \(\mathtt {intr}{}\) has been executed and is used to distinguish different runs of the same instruction. For example, the execution of the statement “A2: y++” in Fig. 2 will generate the event \(\mathtt{(A, A2, y, 0)}\). Such an event captures the information of the corresponding step and can be used to replay the execution. In other words, given the starting state s and the event e, the resulting state \(s'\) of a step “\(\xrightarrow {e}{}\)” is determined.

A trace t is a list of events generated by a sequence of program transitions (or steps) starting from the initial machine state (denoted as \(s_0\)). For example, the following program execution:

$$\begin{aligned} s_0 \xrightarrow {e_0}{} s_1 \xrightarrow {e_1}{}\cdots \xrightarrow {e_n}{} s_{n+1} \end{aligned}$$

generates the trace , where the symbol “” means“cons-ing” an event to the trace. Trace events can be accessed by index (e.g., \(t[1] = e_1\)).

A trace can be used to replay a sequence of executions. In other words, given the initial machine state \(s_0\) and the trace t, the resulting state of running t (denoted as “\(\mathtt {State}(t)\)”) is determined.

We write \(\mathtt {En}(s) := \{e \mid \exists s', s \xrightarrow {e}{} s'\}\) as the set of events enabled (or allowed to be executed) at state s. Take the program in Fig. 2 as an example. Initially, both A1 and B1 can be executed, and the corresponding two events form the enabled set \(\mathtt {En}(s_0)\). The blocking wait at B4, however, can be enabled only after being signaled at A3. A state s is called a terminating state if, and only if, \(\mathtt {En}(s) = \emptyset \). We assume that any disabled event will eventually become enabled and every process must end with either a terminating state or an error state. This indicates that all traces are finite. For readability, we often abbreviate \(\mathtt {En}(\mathtt {State}(t))\), i.e., the enabled event set after executing trace t, as \(\mathtt {En}(t)\).

Partial Order of Traces. Two events \(e_0\) and \(e_1\) are called independent events (denoted as \(e_0 \bot e_1\)) if, and only if, they neither belong to the same process nor access the same object:

$$ e_0 \bot e_1 := \,\, (e_0.\mathtt {pid}\ne e_1.\mathtt {pid}) \ \wedge \ (e_0.\mathtt {obj}\ne e_1.\mathtt {obj}) $$

The execution order of independent events does not affect the resulting state. If a trace t can be generated by swapping adjacent and independent events of another trace \(t'\), then these two traces t and \(t'\) are partial order equivalent. Intuitively, partial order equivalent traces are guaranteed to lead the program to the same state. The partial order of a trace is characterized by the orders between all dependent events plus their transitive closure. Given a trace t, its partial order relation “\(\sqsubset _t\)” is defined as the minimal relation over its events that satisfies:

  1. (1)

    \(\forall i\ j,\ i < j\ \wedge \ t[i] \not \perp t[j]\ \implies \ t[i] \sqsubset _t t[j]\)

  2. (2)

    \(\forall i\ j\ k,\ t[i] \sqsubset _t t[j]\ \wedge \ t[j] \sqsubset _t t[k]\ \implies \ t[i] \sqsubset _t t[k]\)

Two traces with the same partial order relation and the same event set must be partial order equivalent.

Given an event order \(\mathcal {E}\) and its order relation \(\sqsubset _\mathcal {E}\), we say a trace t follows \(\mathcal {E}\) and write “\(t \simeq \mathcal {E}\)” if, and only if,

$$ \forall e_0\ e_1,\ e_0 \sqsubset _t e_1 \implies e_0 \sqsubset _\mathcal {E}e_1 $$

We write “\(t \models \mathcal {E}\)” to denote that \(\mathcal {E}\) is exactly the partial order of trace t:

$$ t \models \mathcal {E}:= \quad \forall e_0\ e_1,\ e_0 \sqsubset _t e_1 \iff e_0 \sqsubset _\mathcal {E}e_1 $$

Probabilistic Error-Detection Guarantees. Each partial order of a concurrent program may lead to a different and potentially incorrect outcome. Therefore, any possible partial order has to be explored. The minimum probability of these explorations are called the probabilistic error-detection guarantee of a randomized scheduler.

Algorithm 1 presents a framework to formally reason about this guarantee. A sampling procedure \(\mathrm {Sample}\) samples a terminating trace t of a program. It starts with the empty trace and repeatedly invokes a randomized scheduler (denoted as \(\mathtt {Sch}\)) to append an event to the trace until the program terminates. The randomized scheduler \(\mathtt {Sch}\) selects an enabled event from \(\mathtt {En}(t)\) and the randomness comes from the random variable parameter, i.e., \(\mathtt {R}\).

A naive scheduler can be purely random without any strategy. A sophisticated scheduler may utilize additional information, such as the properties of the current trace and the enabled event set.

figure b

Given the randomized scheduler \(\mathtt {Sch}\) on \(\mathtt {R}\) and any partial order \(\mathcal {E}\) of a program, we write “\(P(\mathrm {Sample}(\mathtt {Sch}, \mathtt {R}) \models \mathcal {E})\)” to denote the probability of covering \(\mathcal {E}\), i.e., generating a trace whose partial order is exactly \(\mathcal {E}\) using Algorithm 1. The probabilistic error-detection guarantee of the scheduler \(\mathtt {Sch}\) on \(\mathtt {R}\) is then defined as the minimum probability of covering the partial order \(\mathcal {E}\) of any terminating trace of the program:

$$ \min _\mathcal {E}P(\mathrm {Sample}(\mathtt {Sch}, \mathtt {R}) \models \mathcal {E}) $$

4 POS - Algorithm and Analysis

In this section, we first present BasicPOS, a priority-based scheduler and analyze its probability of covering a given partial order (see Sect. 4.1). Based on the analysis of BasicPOS, we then show that such a priority-based algorithm can be dramatically improved by introducing the priority reassignment, resulting in our POS algorithm (see Sect. 4.2). Finally, we present how to calculate the probabilistic error-detection guarantee of POS on general programs (see Sect. 4.3).

4.1 BasicPOS

In BasicPOS, each event is associated with a random and immutable priority, and, at each step, the enabled event with the highest priority will be picked to execute. We use \(\mathtt {Pri}\) to denote the map from events to priorities and describe BasicPOS in Algorithm 2, which instantiates the random variable \(\mathtt {R}\) in Algorithm 1 with \(\mathtt {Pri}\). The priority \(\mathtt {Pri}(e)\) of every event e is independent with each other and follows the uniform distribution \(\mathcal {U}(0,1)\).

figure c

We now consider in what condition would BasicPOS sample a trace that follows a given partial order \(\mathcal {E}\) of a program. It means that the generated trace t, at the end of each loop iteration (line 5 in Algorithm 2), must satisfy the invariant “\(t \simeq \mathcal {E}\)”. Thus, the event priorities have to be properly ordered such that, given a trace t satisfies “\(t \simeq \mathcal {E}\)”, the enabled event \(e^*\) with the highest priority must satisfies “”. In other words, given “\(t \simeq \mathcal {E}\)”, for any \(e \in \mathtt {En}(t)\) and “”, there must be some \(e'\in \mathtt {En}(t)\) satisfying “” and a proper priority map where \(e'\) has a higher priority, i.e., \(\mathtt {Pri}(e') > \mathtt {Pri}(e)\). Thus, e will not be selected as the event \(e^*\) at line 4 in Algorithm 2. The following Lemma 1 indicates that such an event \(e'\) always exists:

Lemma 1

Proof

We can prove it by contradiction. Since traces are finite, we assume that some traces are counterexamples to the lemma and t is the longest such trace. In other words, we have \(t \simeq \mathcal {E}\) and there exists such that:

(1)

Since \(\mathcal {E}\) is the partial order of a terminating trace and the traces t has not terminated yet, we know that there must exist an event \(e' \in \mathtt {En}(t)\) such that . Let , by (1), we have that \(\lnot (e' \sqsubset _\mathcal {E}e)\) and

First two statements are intuitive. The third one also holds, otherwise, \(e' \sqsubset _\mathcal {E}e\) can be implied by the transitivity of partial orders using \(e''\). Thus, \(t'\) is a counterexample that is longer than t, contradicting to our assumption.    \(\square \)

Thanks to Lemma 1, we then only need to construct a priority map such that this \(e'\) has a higher priority. Let “\(e \bowtie _\mathcal {E}e' := \exists t,\ t \simeq \mathcal {E} \wedge \{e, e'\} \subseteq \mathtt {En}(t)\)” denote that e and \(e'\) can be simultaneously enabled under \(\mathcal {E}\). We write

$$ \mathtt {PS}_\mathcal {E}(e) :=\ \{ e' \mid e' \sqsubset _\mathcal {E}e\ \wedge \ e \bowtie _\mathcal {E}e' \} $$

as the set of events that can be simultaneously enabled with but have to be selected prior to e in order to follow \(\mathcal {E}\). We have that any \(e'\) specified by Lemma 1 must belong to \(\mathtt {PS}_\mathcal {E}(e)\). Let \(V_\mathcal {E}\) be the event set ordered by \(\mathcal {E}\). The priority map \(\mathtt {Pri}\) can be constructed as below:

$$\begin{aligned} \bigwedge _{e \in V_\mathcal {E},\ e' \in \mathtt {PS}_\mathcal {E}(e)} \mathtt {Pri}(e) < \mathtt {Pri}(e') \end{aligned}$$
(Cond-BasicPOS)

The traces sampled by BasicPOS using this \(\mathtt {Pri}\) will always follow \(\mathcal {E}\).

Although (Cond-BasicPOS) is not the necessary condition to sample a trace following a desired partial order, from our observation, it gives a good estimation for the worst cases. This leads us to locate the major weakness of BasicPOS: the constraint propagation of priorities. An event e with a large \(\mathtt {PS}_\mathcal {E}(e)\) set may have a relatively low priority since its priority has to be lower than all the events in \(\mathtt {PS}_\mathcal {E}(e)\). Thus, for any simultaneously enabled event \(e'\) that has to be delayed after e, \(\mathtt {Pri}(e')\) must be even smaller than \(\mathtt {Pri}(e)\), which is unnecessarily hard to satisfy for a random \(\mathtt {Pri}(e')\). Due to this constraints propagation, the probability that a priority map \(\mathtt {Pri}\) satisfies (Cond-BasicPOS) can be as low as \(1/|V_\mathcal {E}|!\).

figure d

4.2 POS

We will now show how to improve BasicPOS by eliminating the propagation of priority constraints. Consider the situation when an event e (delayed at some trace t) becomes eligible to schedule right after scheduling some \(e'\), i.e.,

If we reset the priority of e right after scheduling \(e'\), all the constraints causing the delay of e will not be propagated to the event \(e''\) such that \(e \in \mathtt {PS}_\mathcal {E}(e'')\). However, there is no way for us to know which e should be reset after \(e'\) during the sampling, since \(\mathcal {E}\) is unknown and not provided. Notice that

If we reset the priority of all the events that access the same object with \(e'\), the propagation of priority constraints will also be eliminated.

To analyze how POS works to follow \(\mathcal {E}\) under the reassignment scheme, we have to model how many priorities need to be reset at each step. Note that blindly reassigning priorities of all delayed events at each step would be sub-optimal, which degenerates the algorithm to random walk. To give a formal and more precise analysis, we introduce the object index functions for trace t and partial order \(\mathcal {E}\):

$$ \begin{array}{ll} \mathtt {I}(t, e) :=\ |\, \{ e' \mid e' \in t \ \wedge \ e.\mathtt {obj}= e'.\mathtt {obj}\}\, | \\ \mathtt {I}_\mathcal {E}(e) :=\ |\, \{ e' \mid e' \sqsubset _\mathcal {E}e \ \wedge \ e.\mathtt {obj}= e'.\mathtt {obj}\}\, | \\ \end{array} $$

Intuitively, when \(e \in \mathtt {En}(t)\), scheduling e on t will operate \(e.\mathtt {obj}\) after \(\mathtt {I}(t, e)\) previous events. A trace t follows \(\mathcal {E}\) if every step (indicated by t[i]) operates the object \(t[i].\mathtt {obj}\) after \(\mathtt {I}_\mathcal {E}(t[i])\) previous events in the trace.

We then index (or version) the priority of event e using the index function as \(\mathtt {Pri}(e, \mathtt {I}(t, e))\) and introduce POS shown in Algorithm 3. By proving that

we have that scheduling an event e will increase the priority version of all the events accessing \(e.\mathtt {obj}\), resulting in the priority reassignment.

We can then prove that the following statements hold:

To ensure that the selection of \(e^*\) on trace t follows \(\mathcal {E}\) at the line 4 of Algorithm 3, any e satisfying \(\mathtt {I}(t, e) < \mathtt {I}_\mathcal {E}(e)\) has to have a smaller priority than some \(e'\) satisfying \(\mathtt {I}(t, e') = \mathtt {I}_\mathcal {E}(e)\) and such \(e'\) must exist by Lemma 1. In this way, the priority constraints for POS to sample \(\mathcal {E}\) are as below:

$$ \bigwedge \mathtt {Pri}(e, i)< \mathtt {Pri}(e', \mathtt {I}_\mathcal {E}(e')) \text { for some } i < \mathtt {I}_\mathcal {E}(e) $$

which is bipartite and the propagation of priority constraints is eliminated. The effectiveness of POS is guaranteed by Theorem 1.

Theorem 1

Given any partial order \(\mathcal {E}\) of a program with \(\mathcal {P}> 1\) processes. Let

$$ D_\mathcal {E}:= |\, \{ (e, e') \mid e \sqsubset _\mathcal {E}e'\ \wedge \ e \not \perp e' \ \wedge \ e \bowtie _\mathcal {E}e' \}\,| $$

be the number of races in \(\mathcal {E}\), we have that

  1. 1.

    \(D_\mathcal {E}\le |V_\mathcal {E}| \times (\mathcal {P}- 1)\), and

  2. 2.

    POS has at least the following probability to sample a trace \(t \simeq \mathcal {E}\):

    $$ \left( \frac{1}{\mathcal {P}}\right) ^{|V_\mathcal {E}|} R^U $$

    where \(R = \mathcal {P}\times |V_\mathcal {E}|/(|V_\mathcal {E}|+D_\mathcal {E}) \ge 1\) and \(U = (|V_\mathcal {E}|-\lceil D_\mathcal {E}/(\mathcal {P}- 1) \rceil ) / 2 \ge 0\)

Please refer to the technical report [33] for the detailed proof and the construction of priority constraints.

figure e
figure f

4.3 Probability Guarantee of POS on General Programs

We now analyze how POS performs on general programs compared to random walk and PCT. Consider a program with \(\mathcal {P}\) processes and \(\mathcal {N}\) total events. It is generally common for a program have substantial non-racing events, for example, accessing shared variables protected by locks, semaphores, and condition variables, etc. We assume that there exists a ratio \(0 \le \alpha \le 1\) such that in any partial order there are at least \(\alpha \mathcal {N}\) non-racing events.

Under this assumption, for random walk, we can construct an adversary program with the worst case probability as \(1/\mathcal {P}^\mathcal {N}\) for almost any \(\alpha \) [33]. For PCT, since only the order of the \((1-\alpha )\mathcal {N}\) events may affect the partial order, the number of preemptions needed for a partial order in the worst case becomes \((1-\alpha )\mathcal {N}\), and thus the worst case probability bound is \(1/\mathcal {N}^{(1-\alpha )\mathcal {N}}\). For POS, the number of races \(D_\mathcal {E}\) is reduced to \((1-\alpha ) \mathcal {N}\times (\mathcal {P}- 1)\) in the worst case, Theorem 1 guarantees the probability lower bound as

$$ \frac{1}{\mathcal {P}^\mathcal {N}} \left( \frac{1}{1 - (1-1/P)\alpha }\right) ^{\alpha \mathcal {N}/ 2} $$

Thus, POS advantages random walk when \(\alpha > 0\) and degenerates to random walk when \(\alpha = 0\). Also, POS advantages PCT if \(\mathcal {N}> \mathcal {P}\) (when \(\alpha = 0\)) or \(\mathcal {N}^{1/\alpha - 1} > \mathcal {P}^{1/\alpha }\sqrt{1+\alpha /\mathcal {P}-\alpha }\) (when \(0< \alpha < 1\)). For example, when \(\mathcal {P}= 2\) and \(\alpha = 1{/}2\), POS advantages PCT if \(\mathcal {N}>2\sqrt{3}\). In other words, in this case, POS is better than PCT if there are at least four total events.

5 Implementation

The algorithm of POS requires a pre-determined priority map, while the implementation could decide the event priority on demand when new events appear. The implementation of POS is shown in Algorithm 4, where lines 14–18 are for the priority reassignment. Variable s represents the current program state with the following interfaces:

  • \(s.\mathrm {Enabled}()\) returns the current set of enabled events.

  • \(s.\mathrm {Execute}(e)\) returns the resulting state after executing e in the state of s.

  • \(s.\mathrm {IsRacing}(e, e')\) returns if there is a race between e and \(e'\).

In the algorithm, if a race is detected during the scheduling, the priority of the delayed event in the race will be removed and then be reassigned at lines 6–9.

figure g

Relaxation for Read-Only Events. The abstract interface \(s.\mathrm {IsRacing}(\ldots )\) allows us to relax our model for read-only events. When both e and \(e'\) are read-only events, \(s.\mathrm {IsRacing}(e, e')\) returns false even if they are accessing the same object. Our evaluations show that this relaxation improves the execution time of POS.

Fairness Workaround. POS is probabilistically fair. For an enabled event e with priority \(p > 0\), the cumulative probability for e to delay by \(k \rightarrow \infty \) steps without racing is at most \((1-p^{\mathcal {P}})^k \rightarrow 0\). However, it is possible that POS delays events for prolonged time, slowing down the test. To alleviate this, the current implementation resets all event priorities for every \(10^3\) voluntary context switch events, e.g., sched_yield() calls. This is only useful for speeding up few benchmark programs that have busy loops (sched_yield() calls were added by SCTBench creators) and has minimal impact on the probability of hitting bugs.

6 Evaluation

To understand the performance of POS and compare with other sampling methods, we conducted experiments on both micro benchmarks (automatically generated) and macro benchmarks (including real-world programs).

6.1 Micro Benchmark

We generated programs with a small number of static events as the micro benchmarks. We assumed multi-threaded programs with t threads and each thread executes m events accessing o objects. To make the program space tractable, we chose \(t = m = o = 4\), resulting 16 total events. To simulate different object access patterns in real programs, we chose to randomly distribute events accessing different objects with the following configurations:

  • Each object has respectively {4,4,4,4} accessing events. (Uniform)

  • Each object has respectively {2,2,6,6} accessing events. (Skewed)

Table 1. Coverage on the micro benchmark programs. Columns under “benchmark” are program characteristics explained in Sect. 6.1. “0(x)” represents incomplete coverage.
Table 2. Coverage on the micro benchmark programs - 50% read

The results are shown in Table 1. The benchmark columns show the characteristics of each generated program, including (1) the configuration used for generating the program; (2) the number of distinct partial orders in the program; (3) the maximum number of preemptions needed for covering all partial orders; and (4) the maximum number of races in any partial order. We measured the coverage of each sampling method on each program by the minimum hit ratio on any partial order of the program. On every program, we ran each sampling methods for \(5\times 10^7\) times (except for random walk, for which we calculated the exact probabilities). If a program was not fully covered by an algorithm within the sample limit, the coverage is denoted as “0(x)”, where x is the number of covered partial orders. We let PCT sample the exact number of the preemptions needed for each case. We tweaked PCT to improve its coverage by adding a dummy event at the beginning of each thread, as otherwise PCT cannot preempt the actual first event of each thread. The results show that POS performed the best among all algorithms. For each algorithm, we calculated the overall performance as the geometric mean of the coverage.Footnote 1 POS overall performed \({\sim } 7.0{\times }\) better compared to other algorithms (\({\sim } 134.1{\times }\) excluding RAPOS and BasicPOS).

To understand our relaxation of read-only events, we generated another set of programs with the same configurations, but with half of the events read-only. The results are shown in Table 2, where the relaxed algorithm is denoted as POS\(^*\). Overall, POS\(^*\) performed roughly \({\sim } 1.4{\times }\) as good as POS and \({\sim } 5.0{\times }\) better compared to other algorithms (\({\sim } 226.4{\times }\) excluding RAPOS and BasicPOS).

6.2 Macro Benchmark

We used SCTBench [24], a collection of concurrency bugs on multi-threaded programs, to evaluate POS on practical programs. SCTBench collected 49 concurrency bugs from previous parallel workloads [3, 27] and concurrency testing/verification work [4, 6, 18, 21, 31]. SCTBench comes with a concurrency testing tool, Maple [32], which intercepts pthread primitives and shared memory accesses, as well as controls their interleaving. When a bug is triggered, it will be caught by Maple and reported back. We implemented POS with the relaxation of read-only events in Maple. Each sampling method was evaluated in SCTBench by the ratio of tries and hits of the bug in each case. For each case, we ran each sampling method on it until the number of tries reaches \(10^4\). We recorded the bug hit count h and the total runs count t, and calculated the ratio as h / t.

Two cases in SCTBench are not adopted: parsec-2.0-streamcluster2 and radbench-bug1. Because neither of the algorithms can hit their bugs once, which conflicts with previous results. We strengthened the case safestack-bug1 by internally repeating the case for \(10^4\) times (and shrunk the run limit to 500). This amortizes the per-run overhead of Maple, which could take up to a few seconds. We modified PCT to reset for every internal loop. We evaluated variants of PCT algorithms of PCT-d, representing PCT with \(d-1\) preemption points, to reduce the disadvantage of a sub-optimal d. The results are shown in Table 3. We ignore cases in which all algorithms can hit the bugs with more than half of their tries. The cases are sorted based on the minimum hit ratio across algorithms. The performance of each algorithm is aggregated by calculating the geometric mean of hit ratiosFootnote 2 on every case. The best hit ratio for each case is marked as blue.

Table 3. Bug hit ratios on macro benchmark programs

The results of macro benchmark experiments can be highlighted as below:

  • Overall, POS performed the best in hitting bugs in SCTBench. The geometric mean of POS is \({\sim } 2.6 {\times }\) better than PCT and \({\sim } 4.7{\times }\) better than random walk. Because the buggy interleavings in each case are not necessarily the most difficult ones to sample, POS may not perform overwhelmingly better than others, as in micro benchmarks.

  • Among all 32 cases shown in the table, POS performed the best among all algorithms in 20 cases, while PCT variants were the best in 10 cases and random walk was the best in three cases.

  • POS is able to hit all bugs in SCTBench, while all PCT variants missed one case within the limit (and one case with hit ratio of 0.0002), and random walk missed three cases (and one case with hit ratio of 0.0003).

7 Conclusion

We have presented POS, a concurrency testing approach to sample the partial order of concurrent programs. POS’s core algorithm is simple and lightweight: (1) assign a random priority to each event in a program; (2) repeatedly execute the event with the highest priority; and (3) after executing an event, reassign its racing events with random priorities. We have formally shown that POS has an exponentially stronger probabilistic error-detection guarantee than existing randomized scheduling algorithms. Evaluations have shown that POS is effective in covering the partial-order space of micro-benchmarks and finding concurrency bugs in real-world programs such as Firefox’s JavaScript engine SpiderMonkey.