Compositional Verification of Compiler Optimisations on Relaxed Memory
Abstract
A valid compiler optimisation transforms a block in a program without introducing new observable behaviours to the program as a whole. Deciding which optimisations are valid can be difficult, and depends closely on the semantic model of the programming language. Axiomatic relaxed models, such as C++11, present particular challenges for determining validity, because such models allow subtle effects of a block transformation to be observed by the rest of the program. In this paper we present a denotational theory that captures optimisation validity on an axiomatic model corresponding to a fragment of C++11. Our theory allows verifying an optimisation compositionally, by considering only the block it transforms instead of the whole program. Using this property, we realise the theory in the first pushbutton tool that can verify realworld optimisations under an axiomatic memory model.
1 Introduction
Context and Objectives. Any program defines a collection of observable behaviours: a sorting algorithm maps unsorted to sorted sequences, and a paint program responds to mouse clicks by updating a rendering. It is often desirable to transform a program without introducing new observable behaviours – for example, in a compiler optimisation or programmer refactoring. Such transformations are called observational refinements, and they ensure that properties of the original program will carry over to the transformed version. It is also desirable for transformations to be compositional, meaning that they can be applied to a block of code irrespective of the surrounding program context. Compositional transformations are particularly useful for automated systems such as compilers, where they are known as peephole optimisations.
The semantics of the language is highly significant in determining which transformations are valid, because it determines the ways that a block of code being transformed can interact with its context and thereby affect the observable behaviour of the whole program. Our work applies to a relaxed memory concurrent setting. Thus, the context of a codeblock includes both code sequentially before and after the block, and code that runs in parallel. Relaxed memory means that different threads can observe different, apparently contradictory orders of events – such behaviour is permitted by programming languages to reflect CPUlevel relaxations and to allow compiler optimisations.
We focus on axiomatic memory models of the type used in C/C++ and Java. In axiomatic models, program executions are represented by structures of memory actions and relations on them, and program semantics is defined by a set of axioms constraining these structures. Reasoning about the correctness of program transformations on such memory models is very challenging, and indeed, compiler optimisations have been repeatedly shown unsound with respect to models they were intended to support [23, 25]. The fundamental difficulty is that axiomatic models are defined in a global, noncompositional way, making it very challenging to reason compositionally about the single codeblock being transformed.
Approach. Suppose we have a codeblock B, embedded into an unknown program context. We define a denotation for the codeblock which summarises its behaviour in a restricted representative context. The denotation consists of a set of histories which track interactions across the boundary between the codeblock and its context, but abstract from internal structure of the codeblock. We can then validate a transformation from codeblock B to \(B'\) by comparing their denotations. This approach is compositional: it requires reasoning only about the codeblocks and representative contexts; the validity of the transformation in an arbitrary context will follow. It is also fully abstract, meaning that it can verify any valid transformation: considering only representative contexts and histories does not lose generality.
We also define a variant of our denotation that is finite at the cost of losing full abstraction. We achieve this by further restricting the form of contexts one needs to consider in exchange for tracking more information in histories. For example, it is unnecessary to consider executions where two context operations read from the same write.
Using this finite denotation, we implement a prototype verification tool, Stellite. Our tool converts an input transformation into a model in the Alloy language [12], and then checks that the transformation is valid using the Alloy* solver [18]. Our tool can prove or disprove a range of introduction, elimination, and exchange compiler optimisations. Many of these were verified by hand in previous work; our tool verifies them automatically.
Contributions. Our contribution is twofold. First, we define the first fully abstract denotational semantics for an axiomatic relaxed model. Previous proposals in this space targeted either nonrelaxed sequential consistency [6] or much more restrictive operational relaxed models [7, 13, 21]. Second, we show it is feasible to automatically verify relaxedmemory program transformations. Previous techniques required laborious proofs by hand or in a proof assistant [23, 24, 25, 26, 27]. Our target model is derived from the C/C++ 2011 standard [22]. However, our aim is not to handle C/C++ per se (especially as the model is in flux in several respects; see Sect. 3.7). Rather we target the simplest axiomatic model rich enough to demonstrate our approach.
2 Observation and Transformation
The language semantics is highly significant in determining observational refinement. For example, the code blocks \(B_1 :\texttt {store(x,5)}\) and \(B_2 :\texttt {store(x,2); store(x,5)}\) are observationally equivalent in a sequential setting. However, in a concurrent setting the intermediate state, \(\mathtt{x} = 2\), can be observed in \(B_2\) but not \(B_1\), meaning the codeblocks are no longer observationally equivalent. In a relaxedmemory setting there is no global state seen by all threads, which further complicates the notion of observation.
Compositional Verification. To establish \(B_1 \preccurlyeq _\mathsf{bl}B_2\), it is difficult to examine all possible syntactic contexts. Our approach is to construct a denotation for each codeblock – a simplified, ideally finite, summary of possible interactions between the block and its context. We then define a refinement relation on denotations and use it to establish observational refinement. We write \(B_1 \sqsubseteq B_2\) when the denotation of \(B_1\) refines \(B_2\).
Refinement on denotations should be adequate, i.e., it should validly approximate observational refinement: \( B_1 \sqsubseteq B_2 \implies B_1 \preccurlyeq _\mathsf{bl}B_2 \). Hence, if \(B_1 \sqsubseteq B_2\), then \(B_2 \leadsto B_1\) is a valid transformation. It is also desirable for the denotation to be fully abstract: \( B_1 \preccurlyeq _\mathsf{bl}B_2 \implies B_1 \sqsubseteq B_2 \). This means any valid transformation can be verified by comparing denotations. Below we define several versions of \(\sqsubseteq \) with different properties.
3 Target Language and Core Memory Model
Our language’s memory model is derived from the C/C++ 2011 standard (henceforth ‘C11’), as formalised by [5, 22]. However, we simplify our model in several ways; see the end of section for details. In C11 terms, our model covers releaseacquire and nonatomic operations, and sequentially consistent fences. To simplify the presentation, at first we omit nonatomics, and extend our approach to cover them in Sect. 7. Thus, all operations in this section correspond to C11’s releaseacquire.
3.1 Relaxed Memory Primer
In most relaxed models \(\mathtt{v1}=\mathtt{v2}=0\) is a possible poststate for SB. This cannot occur on a sequentially consistent system: if \(\mathtt{v1}=0\), then store(y,1) must be ordered after the load of y, which would order store(x,1) before the load of x, forcing it to assign \(\mathtt{v2} =1\). In some relaxed models, \(\mathtt{b} = 1 \wedge \mathtt{r} = 0\) is a possible poststate for MP. This is undesirable if, for example, x is a complex datastructure and f is a flag indicating it has been safely created.
3.2 Language Syntax
The construct \(\{  \}\) represents a blockshaped hole in the program. To simplify our presentation, we assume that at most one hole appears in the program. Transformations that apply to multiple blocks at once can be simulated by using the fact our approach is compositional: transformations can be applied in sequence using different divisions of the program into codeblock and context.
We restrict Prog, Contx and Block to ensure LLSC pairs are matched correctly. Each SC must be preceded in program order by a LL to the same location. Other types of operations may occur between the LL and SC, but intervening SC operations are forbidden. For example, the program LL(x); SC(x,v1); SC(x,v2); is forbidden. We also forbid LLSC pairs from spanning parallel compositions, and from spanning the block/context boundary.
3.3 Memory Model Structure
The semantics of a whole program P is given by a set \(\llbracket P \rrbracket \) of executions, which consist of actions, representing memory events on global variables, and several relations on these. Actions are tuples in the set \(\mathsf{Action}\, {\mathop {=}\limits ^{\varDelta }}\, \mathsf{ActID} \times \mathsf{Kind} \times \mathsf{Option}(\mathsf{GVar}) \times \mathsf{Val}^{*} \). In an action \((a,k,z,b) \in \mathsf{Action}\): \(a \in \mathsf{ActID}\) is the unique action identifier; \(k \in \mathsf{Kind}\) is the kind of action – we use load, store, LL, SC, and the failed variant \({\mathsf{SC}_f}\) in the semantics, and will introduce further kinds as needed; \(z \in \mathsf{Option}(\mathsf{GVar})\) is an option type consisting of either a single global variable \(\mathsf{Just}(x)\) or None; and \(b \in \mathsf{Val}^{*}\) is the vector of values (actions with multiple values are used in Sect. 4).
The semantics of a program \(P \in \mathsf{Prog}\) is defined in two stages. First, a threadlocal semantics of P produces a set \(\langle P \rangle \) of preexecutions \(({\mathcal {A}}, \mathsf{sb}) \in \mathsf{PreExec}\). A preexecution contains a finite set of memory actions \({\mathcal {A}}\subseteq \mathsf{Action}\) that could be produced by the program. It has a transitive and irreflexive sequencebefore relation \(\mathsf{sb}\subseteq {\mathcal {A}}\times {\mathcal {A}}\), which defines the sequential order imposed by the program syntax.
For example two sequential statements in the same thread produce actions ordered in sb. The threadlocal semantics takes into account control flow in P’s threads and operations on local variables. However, it does not constrain the behaviour of global variables: the values threads read from them are chosen arbitrarily. This is addressed by extending preexecutions with extra relations, and filtering the resulting executions using validity axioms.
3.4 ThreadLocal Semantics
3.5 Execution Structure and Validity Axioms
The semantics of a program P is a set \(\llbracket P \rrbracket \) of executions \(X = ({\mathcal {A}}, \mathsf{sb}, \mathsf{at}, \mathsf{rf}, \mathsf{mo}, \mathsf{hb}) \in \mathsf{Exec}\), where \(({\mathcal {A}}, \mathsf{sb})\) is a preexecution and \(\mathsf{at}, \mathsf{rf}, \mathsf{mo}, \mathsf{hb}\subseteq {\mathcal {A}}\times {\mathcal {A}}\). Given an execution X we sometimes write \({\mathcal {A}}(X), \mathsf{sb}(X), \ldots \) as selectors for the appropriate set or relation. The relations have the following purposes.

Readsfrom (rf) is an injective map from reads to writes at the same location of the same value. A read and a write actions are related Open image in new window if r takes its value from w.

Modification order (mo) is an irreflexive, total order on write actions to each distinct variable. This is a pervariable order in which all threads observe writes to the variable; two threads cannot observe these writes in different orders.

Happensbefore (hb) is analogous to global temporal order – but unlike the sequentially consistent notion of time, it is partial. Happensbefore is defined as \((\mathsf{sb}\cup \mathsf{rf})^+\): therefore statements ordered in the program syntax are ordered in time, as are reads with the writes they observe.

Atomicity (\(\mathsf{at}\subseteq \mathsf{sb}\)) is an extension to standard C11 which we use to support LLSC (see below). It is an injective function from a successful loadlink action to a successful storeconditional, giving a LLSC pair.

HBdef: Open image in new window and Open image in new window is acyclic.
This axiom defines hb and enforces the intuitive property that there are no cycles in the temporal order. It also prevents an action reading from its hbfuture: as rf is included in hb, this would result in a cycle.

HBvsMO: \(\lnot \exists w_1, w_2.\) Open image in new window This axiom requires that the order in which writes to a location become visible to threads cannot contradict the temporal order. But take note that writes may be ordered in mo but not hb.

Coherence: \(\lnot \exists w_1, w_2, r.\) Open image in new window This axiom generalises the sequentially consistent prohibition on reading overwritten values. If two writes are ordered in mo, then intuitively the second overwrites the first. A read that follows some write in hb or mo cannot read from writes earlier in mo – these earlier writes have been overwritten. However, unlike in sequential consistency, hb is partial, so there may be multiple writes that an action can legally read.

RFval: Open image in new window Most reads must take their value from a write, represented by an rf edge. However, the RFval axiom allows the rf edge to be omitted if the read takes the initial value 0 and there is no hbearlier write to the same location. Intuitively, an hbearlier write would supersede the initial value in a similar way to Coherence.

Atom: Open image in new window Open image in new window This axiom is adapted from [16]. For an LLSC pair \(\textit{ll}\) and \(\textit{sc}\), it ensures that there is no mointervening write \(w_2\) that would invalidate the store.
Our model forbids the problematic relaxed behaviour of the messagepassing (MP) program in Fig. 1 that yields \(\mathtt{b} = 1 \wedge \mathtt{r} = 0\). Figure 3 shows an (invalid) execution that would exhibit this behaviour. To avoid clutter, here and in the following we omit hb edges obtained by transitivity and local variable values. This execution is allowed by the threadlocal semantics of the MP program, but it is ruled out by the Coherence validity axiom. As hb is transitively closed, there is a derived hb edge Open image in new window , which forms a Coherence violation. Thus, this is not an execution of the MP program. Indeed, any execution ending in \(\mathsf{load}(\mathtt{x}, 0)\) is forbidden for the same reason, meaning that the MP relaxed behaviour cannot occur.
3.6 Relaxed Observations
3.7 Differences from C11
Our language’s memory model is derived from the C11 formalisation in [5], with a number of simplifications. We chose C11 because it demonstrates most of the important features of axiomatic language models. However, we do not target the precise C11 model: rather we target an abstracted model that is rich enough to demonstrate our approach. Relaxed language semantics is still a very active topic of research, and several C11 features are known to be significantly flawed, with multiple competing fixes proposed. Some of our differences from [5] are intended to avoid such problematic features so that we can cleanly demonstrate our approach.

We omit sequentially consistent accesses because their semantics is known to be flawed in C11 [17]. We do handle sequentially consistent fences, but these are stronger than those of C11: we use the semantics proposed in [16]. It has been proved sound under existing compilation strategies to common multiprocessors.

We omit relaxed (RLX) accesses to avoid wellknown problems with thinair values [4]. There are multiple recent competing proposals for fixing these problems, e.g. [14, 15, 20].

Our model does not include infinite computations, because their semantics in C11style axiomatic models remains undecided in the literature [4]. However, our proofs do not depend on the assumption that execution contexts are finite.

Our language is based on shared variables, not dynamically allocated addressable memory, so for example we cannot write y:=*x; z:=*y. This simplifies our theory by allowing us to fix the variables accessed by a codeblock upfront. We believe our results can be extended to support addressable memory, because C11style models grant no special status to pointers; we elaborate on this in Sect. 4.

We add LLSC atomic instructions to our language in addition to C11’s standard CAS. To do this, we adapt the approach of [16]. This increases the observational power of a context and is necessary for full abstraction in the presence of nonatomics; see Sect. 8. LLSC is available as a hardware instruction on many platforms supporting C11, such as Power and ARM. However, we do not propose adding LLSC to C11: rather, it supports an interesting result in relaxed memory model theory. Our adequacy results do not depend on LLSC.
4 Denotations of CodeBlocks
We construct the denotation for a codeblock in two steps: (1) generate the blocklocal executions under a set of special cutdown contexts; (2) from each execution, extract a summary of interactions between the codeblock and the context called a history.
4.1 BlockLocal Executions

Local variables. A context can include code that precedes and follows the block on the same thread, with interaction through local variables, but – due to syntactic restriction – not through \(\mathsf{LL}\)/\(\mathsf{SC}\) atomic regions. We capture this with special action \(\mathsf{call} (\sigma )\) at the start of the block, and \(\mathsf{ret} (\sigma ')\) at the end, where \(\sigma , \sigma ' :\mathsf{LVar} \rightarrow \mathsf{Val}\) record the values of local variables at these points. Assume that variables in LVar are ordered: \(l_1, l_2, \ldots , l_n\). Then \(\mathsf{call} (\sigma )\) is encoded by the action \( (i, \mathsf{call}, \mathsf{None}, [\sigma (l_1),\, \ldots \, \sigma (l_n)] ) \), with fresh identifier i. We encode \(\mathsf{ret} \) in the same way.

Global variable actions. The context can also interact with the block through concurrent reads and writes to global variables. These interactions are represented by set \({\mathcal {A}}_B\) of context actions added to the ones generated by the threadlocal semantics of the block. This set only contains actions on the variables \(\mathsf{VS}_B\) that B can access (\(\mathsf{VS}_B\) can be constructed syntactically). Given an execution X constructed using \({\mathcal {A}}_B\) (see below) we write \(\mathsf{contx}(X)\) to recover the set \({\mathcal {A}}_B\).
 Context happensbefore. The context can generate \(\mathsf{hb}\) edges between its actions, which affect the behaviour of the block. We track these effects with a relation \(R_B\) over actions in \({\mathcal {A}}_B\), call and ret : The context can generate hb edges between actions directly if they are on the same thread, or indirectly through interthread reads. Likewise call /ret may be related to context actions on the same or different threads.

Context atomicity. The context can generate \(\mathsf{at}\) edges between its actions that we capture in the relation \(S_B \subseteq {\mathcal {A}}_B \times {\mathcal {A}}_B\). We require this relation to be an injective function from \(\mathsf{LL}\) to \(\mathsf{SC}\) actions. We consider only cases where LL/SC pairs do not cross block boundaries, so we need not consider boundarycrossing \(\mathsf{at}\) edges.
 1.
\({\mathcal {A}}_B \subseteq {\mathcal {A}}\) and there exist variable maps \(\sigma , \sigma '\) such that \(\{ \mathsf{call} (\sigma ), \mathsf{ret} (\sigma ')\} \subseteq {\mathcal {A}}\). That is, the call, return, and extra context actions are included in the execution.
 2.
There exists a set \({\mathcal {A}}_l\) and relation \(\mathsf{sb}_l\) such that (i) \(({\mathcal {A}}_l, \mathsf{sb}_l, \sigma ') \in \langle B, \sigma \rangle \); (ii) \({\mathcal {A}}_l = {\mathcal {A}}\setminus ({\mathcal {A}}_B \cup \{\mathsf{call}, \mathsf{ret} \})\); (iii) \(\mathsf{sb}_l = \mathsf{sb}\setminus \{ (\mathsf{call}, u), (u, \mathsf{ret}) \mid u \in {\mathcal {A}}_l \}\). That is, actions from the codeblock satisfy the threadlocal semantics, beginning with map \(\sigma \) and deriving map \(\sigma '\). All actions arising from the block are between \(\mathsf{call} \) and \(\mathsf{ret} \) in sb.
 3.
X satisfies the validity axioms, but with modified axioms HBdef\('\) and Atom\('\). We define HBdef\('\) as: \(\mathsf{hb}= (\mathsf{sb}\cup \mathsf{rf}\cup R_B)^{+}\) and hb is acyclic. That is, context relation \(R_B\) is added to \(\mathsf{hb}\). Open image in new window is defined analogously with \(S_B\) added to \(\mathsf{at}\).
4.2 Histories

The action set \({\mathcal {A}}\) is the projection of X’s action set to call, ret, and \(\mathsf{contx}(X)\).
 The guarantee relation G is the projection of \(\mathsf{hb}(X)\) to
The guarantee summarises the codeblock’s effect on its context: it suffices to only track hb and ignore other relations. Note the guarantee definition is similar to the context relation \(R_B\), definition (4). The difference is that call and ret are switched: this is because the guarantee represents hb edges generated by the codeblock, while \(R_B\) represents the edges generated by the context. The right of Fig. 4 shows the history corresponding to the blocklocal execution on the left.
It is desirable for our denotation to hide the precise operations inside the block – this lets it relate syntactically distinct blocks. Nonetheless, the history must record hb effects such as those above that are visible to the context. In Execution 1, the Coherence violation is still visible if we only consider context operations, call, ret, and the guarantee G – i.e. the history. In Execution 2, the fact that the read is permitted is likewise visible from examining the history. Thus the guarantee, combined with the local variable poststates, capture the effect of the block on the context without recording the actions inside the block.
4.3 Comparing Denotations
Theorem 1
(Adequacy of \(\sqsubseteq _\mathsf{q}\)). \(B_1 \sqsubseteq _\mathsf{q}B_2 \implies B_1 \preccurlyeq _\mathsf{bl}B_2\).
Theorem 2
(Full abstraction of \(\sqsubseteq _\mathsf{q}\)). \(B_1 \preccurlyeq _\mathsf{bl}B_2 \implies B_1 \sqsubseteq _\mathsf{q}B_2\).
4.4 Example Transformation
In Fig. 6, we illustrate the necessary reasoning for an execution \(X_1 \in \llbracket B_1, {\mathcal {A}}, R, S \rrbracket \), with a context action set \({\mathcal {A}}\) consisting of a single load \(\mathtt{x} = 1\), a context relation R relating ret to the load, and an empty S relation. This choice of R forces the context load to read from the store in the block. We can exhibit an execution \(X_2 \in \llbracket B_2, {\mathcal {A}}, R, S \rrbracket \) with a matching history by making the context load read from the final store in the block.
5 A Finite Denotation

We remove the quantification over context relation R from definition (7) by fixing it as \(\emptyset \). In exchange, we extend the history with an extra component called a deny.

We eliminate redundant blocklocal executions from the denotation, and only consider a reduced set of executions X that satisfy a predicate \(\mathsf{cut}(X)\).
These two steps are both necessary to achieve finiteness. Removing the R relation reduces the amount of structure in the context. This makes it possible to then remove redundant patterns – for example, duplicate reads from the same write.
Theorem 3
(Adequacy of \(\sqsubseteq _\mathsf{c}\)). \(B_1 \sqsubseteq _\mathsf{c}B_2 \implies B_1 \preccurlyeq _\mathsf{bl}B_2\).
5.1 Cutting Predicate
Removing the context relation R in definition (8) removes a large amount of structure from the context. However, there are still unboundedly many blocklocal executions with an empty R – for example, we can have an unbounded number of reads and writes that do not interact with the block. The cutting predicate identifies these redundant executions.
It should be intuitively clear why the first two of the above patterns are redundant. The main surprise is the third pattern, which preserves some nonvisible writes. This is required by Theorem 3 for technical reasons connected to perlocation coherence. We illustrate the application of \(\mathsf{cut}()\) to a blocklocal execution in Fig. 7.
5.2 Extended History ( Open image in new window )
The deny records the hb edges that cannot be enforced due to the execution structure. For example, consider the blocklocal execution^{3} of Fig. 8.
This pattern could not occur in a context that generates the dashed edge D as a hb – to do so would violate the HBvsMO axiom. In our previous definition of \(\sqsubseteq _\mathsf{q}\), we explicitly represented the presence or absence of this edge through the R relation. In our new formulation, we represent such ‘forbidden’ edges in the history by a deny edge.
One can think of a deny edge as an ‘almost’ violation of an axiom. For example, if \(\textsf {HBvsMOd}(u,v)\) holds, then the context cannot generate an extra hbedge Open image in new window – to do so would violate HBvsMO.
5.3 Finiteness
Theorem 4
(Finiteness). If for a block B and state \(\sigma \) the set of threadlocal executions \(\langle B, \sigma \rangle \) is finite, then so is the set of resulting blocklocal executions, Open image in new window .
Proof
(sketch). It is easy to see for a given threadlocal execution there are finitely many possible visible reads and writes. Any two nonvisible writes must be distinguished by at least one visible write, limiting their number. \(\square \)
Theorem 4 means that any transformation can be checked automatically if the two blocks have finite sets of threadlocal executions. We assume a finite data domain, meaning action can only take finitely many distinct values in Val. Recall also that our language does not include loops. Given these facts, any transformations written in our language will satisfy finiteness, and can therefore by automatically checked.
6 Prototype Verification Tool
Stellite is our prototype tool that verifies transformations using the Alloy* model checker [12, 18]. Our tool takes an input transformation \(B_2 \leadsto B_1\) written in a Clike syntax. It automatically converts the transformation into an Alloy* model encoding \(B_1 \sqsubseteq _\mathsf{c}B_2\). If the tool reports success, then the transformation is verified for unboundedly large syntactic contexts and executions.
An Alloy model consists of a collection of predicates on relations, and an instance of the model is a set of relations that satisfy the predicates. As previously noted in [28], there is therefore a natural fit between Alloy models and axiomatic memory models.
 1.
The two sides of an input transformation \(B_1\) and \(B_2\) are automatically converted into Alloy predicates expressing their syntactic structure. Intuitively, these block predicates are built by following the threadlocal semantics from Sect. 3.
 2.
The block predicates are linked with a predefined Alloy model expressing the memory model and \(\sqsubseteq _\mathsf{c}\).
 3.
The Alloy* solver searches (using SAT) for a history of \(B_1\) that has no matching history of \(B_2\). We use the higherorder Alloy* solver of [18] because the standard Alloy solver cannot support the existential quantification on histories in \(\sqsubseteq _\mathsf{c}\).
The Alloy* solver is parameterised by the maximum size of the model it will examine. However, our finiteness theorem for \(\sqsubseteq _\mathsf{c}\) (Theorem 4) means there is a bound on the size of cutdown context that needs to be considered to verify any given transformation. If our tool reports that a transformation is correct, it is verified in all syntactic contexts of unbounded size.
Given a query \(B_1 \sqsubseteq _\mathsf{c}B_2\), the required context bound grows in proportion to the number of internal actions on distinct locations in \(B_1\). This is because our cutting predicate permits context actions if they interact with internal actions, either directly, or by interleaving between internal actions. In our experiments we run the tool with a model bound of 10, sufficient to give soundness for all the transformations we consider. Note that most of our example transformations do not require such a large bound, and execution times improve if it is reduced.
If a counterexample is discovered, the problematic execution and history can be viewed using the Alloy model visualiser, which has a similar appearance to the execution diagrams in this paper. The output model generated by our tool encodes the history of \(B_1\) for which no history of \(B_2\) could be found. As \(\sqsubseteq _\mathsf{c}\) is not fully abstract, this counterexample could, of course, be spurious.
Experimental Results. We have tested our tool on a range of different transformations. A table of experimental results is given in Fig. 9. Many of our examples are derived from [23] – we cover all their examples that fit into our tool’s input language. Transformations of the sort that we check have led to realworld bugs in GCC [19] and LLVM [8]. Note that some transformations are invalid because of their effect on local variables, e.g. \(\mathtt{skip}\leadsto l\ {:}{=}\ \mathtt{load}(x)\). The closely related transformation \(\mathtt{skip}\leadsto \mathtt{load}(x)\) throws away the result of the read, and is consequently valid.
Our tool takes significant time to verify some of the above examples, and two of the transformations cause the tool to time out. This is due to the complexity and nondeterminism of the C11 model. In particular, our execution times are comparable to existing C++ model simulators such as Cppmem when they run on a few lines of code [3]. However, our tool is a sound transformation verifier, rather than a simulator, and thus solves a more difficult problem: transformations are verified for unboundedly large syntactic contexts and executions, rather than for a single execution.
7 Transformations with Nonatomics
We now extend our approach to nonatomic (i.e. unsynchronised) accesses. C11 nonatomics are intended to enable sequential compiler optimisations that would otherwise be unsound in a concurrent context. To achieve this, any concurrent readwrite or writewrite pair of nonatomic actions on the same location is declared a data race, which causes the whole program to have undefined behaviour. Therefore, adding nonatomics impacts not just the model, but also our denotation.
7.1 Memory Model with Nonatomics

RFHBNA: Open image in new window

HBdef: \(\mathsf{hb}= (\mathsf{sb}\cup (\mathsf{rf}\cap \{(w,r) \mid w,r \notin \mathsf{NA}\}))^{+}\)

modification order (mo) relates writes to atomic y, but not nonatomic x;

the first load of x is forced to read from the initialisation by RFHBNA; and

the second read of x is forced to read 1 because the \(\mathsf{hb}\) created by the load of y obscures the nowstale initialisation write, in accordance with CoherNA.
We write \(\mathsf{safe}(X)\) if an execution satisfies this axiom. Returning to the left of Fig. 10, we see that there is a violation of DRF – a race on nonatomics – between the first load of x and the store of x on the lefthand thread.
7.2 Denotation with Nonatomics
We now define our denotation for nonatomics, \(\mathrel {\sqsubseteq _\mathsf{q}^\mathsf{NA}}\), building on the ‘quantified’ denotation \(\sqsubseteq _\mathsf{q}\) defined in Sect. 4. (We have also defined a finite variant of this denotation using the cutting strategy described in Sect. 5 – we leave this to [10, Sect. C].)
Nonatomic actions do not participate in happensbefore (hb) or coherence order (mo). For this reason, we need not change the structure of the history. However, nonatomics introduce undefined behaviour \(\top \), which is a special kind of observable behaviour. If a block races with its context in some execution, the whole program becomes unsafe, for all executions. Therefore, our denotation must identify how a block may race with its context. In particular, for the denotation to be adequate, for any context C and two blocks \(B_1 \mathrel {\sqsubseteq _\mathsf{q}^\mathsf{NA}}B_2\), we must have that if \(C(B_1)\) is racy, then \(C(B_2)\) is also racy.
The code on the left of Fig. 10 represents a context, composed with \(B_2\), and the execution of Fig. 10 demonstrates that together they are racy. If we were to apply our transformation to the fragment \(B_2\) of the righthand thread, then we would produce the code on the right in Fig. 10. On the right in Fig. 10, we present a similar execution to the one given on the left. The reordering on the righthand thread has led to the second load of x taking the value 0 rather than 1, in accordance with RFHBNA. Note that the execution still has a race on the first load of x, albeit with different following events. As this example illustrates, when considering racy executions in the definition of \(\mathrel {\sqsubseteq _\mathsf{q}^\mathsf{NA}}\), we may need to match executions of the two codeblocks that behave differently after a race. This is the key subtlety in our definition of \(\mathrel {\sqsubseteq _\mathsf{q}^\mathsf{NA}}\).
In more detail, for two related blocks \(B_1 \mathrel {\sqsubseteq _\mathsf{q}^\mathsf{NA}}B_2\), if \(B_2\) generates a race in a blocklocal execution under a given (reduced) context, then we require \(B_1\) and \(B_2\) to have corresponding histories only up to the point the race occurs. Once the race has occurred, the following behaviours of \(B_1\) and \(B_2\) may differ. This still ensures adequacy: when the blocks \(B_1\) and \(B_2\) are embedded into a syntactic context C, this ensures that a race can be reproduced in \(C(B_2)\), and hence, \(C(B_1) \preccurlyeq _\mathsf{pr}^\mathsf{NA}C(B_2)\).

If \(X_2\) is safe, then the situation corresponds to \(\sqsubseteq _\mathsf{q}\) – see Sect. 4, def. (7). In fact, if \(B_2\) is certain to be safe, for example because it has no nonatomic accesses, then the above definition is equivalent to \(\sqsubseteq _\mathsf{q}\).

If \(X_2\) is unsafe then it has a race, and we do not have to relate the whole executions \(X_1\) and \(X_2\). We need only show that the race in \(X_2\) is feasible by finding a prefix in \(X_1\) that refines the prefix leading to the race in \(X_2\). In other words, \(X_2\) will behave consistently with \(X_1\) until it becomes unsafe. This ensures that the race in \(X_2\) will in fact occur, and its undefined behaviour will subsume the behaviour of \(B_1\). After \(X_2\) becomes unsafe, the two blocks can behave entirely differently, so we need not show that the complete histories of \(X_1\) and \(X_2\) are related.
In Fig. 11 we give an execution \(X_1 \in \llbracket B_1, {\mathcal {A}}, R, S \rrbracket \), with a context action set \({\mathcal {A}}\) consisting of a nonatomic store of \(\mathtt{x} = 1\) and an atomic store of \(\mathtt{y} = 1\), and a context relation R relating the store of \(\mathtt{x}\) to the store of \(\mathtt{y}\). Note that this choice of context actions matches the lefthand thread in the code listings of Fig. 10, and there are data races between the loads and the store on x.
To prove the refinement for this execution, we exhibit a corresponding unsafe execution \(X_2 \in \llbracket B_2, {\mathcal {A}}, R, S \rrbracket _v\). The histories of the complete executions \(X_1\) and \(X_2\) differ in their return action. In \(X_2\) the load of \(\mathtt{y}\) takes the value of the context store, so CoherNA forces the second load of \(\mathtt{x}\) to read from the context store of \(\mathtt{x}\). This changes the values of local variables recorded in \(\mathsf{ret} '\). However, because \(X_2\) is unsafe, we can select a prefix \(X'_2\) which includes the race (we denote in grey the parts that we do not include). Similarly, we can select a prefix \(X'_1\) of \(X_1\). We have that \(\mathsf{hist}(X'_1) = \mathsf{hist}(X'_2)\) (shown in the figure), even though the histories \(\mathsf{hist}(X_1)\) and \(\mathsf{hist}(X_2)\) do not correspond.
Theorem 5
(Adequacy of \(\mathrel {\sqsubseteq _\mathsf{q}^\mathsf{NA}}\)). \(B_1 \mathrel {\sqsubseteq _\mathsf{q}^\mathsf{NA}}B_2 \implies B_1 \preccurlyeq _\mathsf{bl}^\mathsf{NA}B_2 \).
Theorem 6
(Full abstraction of \(\mathrel {\sqsubseteq _\mathsf{q}^\mathsf{NA}}\)). \(B_1 \preccurlyeq _\mathsf{bl}^\mathsf{NA}B_2 \Rightarrow B_1 \mathrel {\sqsubseteq _\mathsf{q}^\mathsf{NA}}B_2 \).
We prove Theorem 5 in [10, Sect. B] and Theorem 6 in [10, Sect. F]. Note that the prefixing in our definition of \(\mathrel {\sqsubseteq _\mathsf{q}^\mathsf{NA}}\) is required for full abstraction—but it would be adequate to always require complete executions with related histories.
8 Full Abstraction
The key idea of our proofs of full abstraction (Theorems 2 and 6, given in full in [10, Sect. F]) is to construct a special syntactic context that is sensitive to one particular history. Namely, given an execution X produced from a block B with context happensbefore R, this context \(C_X\) guarantees: (1) that X is the block portion of an execution of \(C_X(B)\); and (2) for any block \(B'\), if \(C_X(B')\) has a different block history from X, then this is visible in different observable behaviour. Therefore for any blocks that are distinguished by different histories, \(C_X\) can produce a program with different observable behaviour, establishing full abstraction.
Special Context Construction. The precise definition of the special context construction \(C_X\) is given in [10, Sect. F] – here we sketch its behaviour. \(C_X\) executes the context operations from X in parallel with the block. It wraps these operations in auxiliary wrapper code to enforce context happensbefore, R, and to check the history. If wrapper code fails, it writes to an error variable, which thereby alters the observable behaviour.
The context must also prohibit history edges beyond those in the original guarantee G, and again it uses watchdog variables. For each (u, v) not in G, the special context writes to watchdog variable \(g_{(u,v)}\) before u and a reads \(g_{(u,v)}\) after v. If the read of \(g_{(u,v)}\) does read the value written before u, then there is an errant history edge, and the error location is written. An erroneous execution has the shape given on the right in Fig. 12 (omitting the write to the error location).
Full Abstraction and LLSC. Our proof of full abstraction for the language with C11 nonatomics requires the language to also include LLSC, not just C11’s standard CAS: the former operation increases the observational power of the context. However, without nonatomics (Sect. 4) CAS would be sufficient to prove full abstraction.
9 Related Work
Our approach builds on our prior work [3], which generalises linearizability [11] to the C11 memory model. This work represented interactions between a library and its clients by sets of histories consisting of a guarantee and a deny; we do the same for codeblock and context. However, our previous work assumed information hiding, i.e., that the variables used by the library cannot be directly accessed by clients; we lift this assumption here. We also establish both adequacy and full abstraction, propose a finite denotation, and build an automated verification tool.
Our approach is similar in structure to the seminal concurrency semantics of Brookes [6]: i.e. a code block is represented by a denotation capturing possible interactions with an abstracted context. In [6], denotations are sets of traces, consisting of sequences of global program states; context actions are represented by changes in these states. To handle the more complex axiomatic memory model, our denotation consists of sets of context actions and relations on them, with context actions explicitly represented as such. Also, in order to achieve full abstraction, Brookes assumes a powerful atomic await() instruction which blocks until the global state satisfies a predicate. Our result does not require this: all our instructions operate on single locations, and our strongest instruction is LLSC, which is commonly available on hardware.
Brookeslike approaches have been applied to several relaxed models: operational hardware models [7], TSO [13], and SCDRF [21]. Also, [7, 21] define tools for verifying program transformations. All three approaches are based on traces rather than partial orders, and are therefore not directly portable to C11style axiomatic memory models. All three also target substantially stronger (i.e. more restrictive) models.
Methods for verifying code transformations, either manually or using proof assistants, have been proposed for several relaxed models: TSO [24, 26, 27], Java [25] and C/C++ [23]. These methods are noncompositional in the sense that verifying a transformation requires considering the trace set of the entire program—there is no abstraction of the context. We abstract both the sequential and concurrent context and thereby support automated verification. The above methods also model transformations as rewrites on program executions, whereas we treat them directly as modifications of program syntax; the latter corresponds more closely to actual compilers. Finally, these methods all require considerable proof effort; we build an automated verification tool.
Our tool is a sound verification tool – that is, transformations are verified for all context and all executions of unbounded size. Several tools exist for testing (not verifying) program transformations on axiomatic memory models by searching for counterexamples to correctness, e.g., [16] for GCC and [8] for LLVM. Alloy was used by [28] in a testing tool for comparing memory models – this includes comparing languagelevel constructs with their compiled forms.
10 Conclusions
We have proposed the first fully abstract denotational semantics for an axiomatic relaxed memory model, and using this, we have built the first tool capable of automatically verifying program transformation on such a model. Our theory lays the groundwork for further research into the properties of axiomatic models. In particular, our definition of the denotation as a set of histories and our context reduction should be portable to other axiomatic models based on happensbefore, such as those for hardware [1].
Footnotes
 1.
This definition relies on the fact that our language supports a fixed set of global variables, not dynamically allocated addressable memory (see Sect. 3.7). We believe that in the future our results can be extended to support dynamic memory. For this, the blocklocal construction would need to quantify over actions on all possible memory locations, not just the static variable set \(\mathsf{VS}_B\). The rest of our theory would remain the same, because C11style models grant no special status to pointer values. Cutting down to a finite denotation, as in Sect. 5 below, would require some extra abstraction over memory – for example, a separation logic domain such as [9].
 2.
We choose these poststates for exposition purposes – in fact these blocks are also distinguishable through local variable l1 alone.
 3.
We use this execution for illustration, but in fact the \(\mathsf{cut}()\) predicate would forbid the load.
 4.
This example was provided to us by Lahav, Giannarakis and Vafeiadis in personal communication.
Notes
Acknowledgements
Thanks to Jeremy Jacob, Viktor Vafeiadis, and John Wickerson for comments and suggestions. Dodds was supported by a Royal Society Industrial Fellowship, and undertook this work while faculty at the University of York. Batty is supported by a Lloyds Register Foundation and Royal Academy of Engineering Research Fellowship.
References
 1.Alglave, J., Maranget, L., Tautschnig, M.: Herding cats: modelling, simulation, testing, and data mining for weak memory. ACM Trans. Program. Lang. Syst. 36(2), 7:1–7:74 (2014)CrossRefGoogle Scholar
 2.Anderson, J.H., Moir, M.: Universal constructions for multiobject operations. In: Symposium on Principles of Distributed Computing (PODC), pp. 184–193 (1995)Google Scholar
 3.Batty, M., Dodds, M., Gotsman, A.: Library abstraction for C/C++ concurrency. In: Symposium on Principles of Programming Languages (POPL), pp. 235–248 (2013)Google Scholar
 4.Batty, M., Memarian, K., Nienhuis, K., PichonPharabod, J., Sewell, P.: The problem of programming language concurrency semantics. In: Vitek, J. (ed.) ESOP 2015. LNCS, vol. 9032, pp. 283–307. Springer, Heidelberg (2015). https://doi.org/10.1007/9783662466698_12CrossRefGoogle Scholar
 5.Batty, M., Owens, S., Sarkar, S., Sewell, P., Weber, T.: Mathematizing C++ concurrency. In: Symposium on Principles of Programming Languages (POPL), pp. 55–66 (2011)Google Scholar
 6.Brookes, S.: Full abstraction for a sharedvariable parallel language. Inf. Comput. 127(2), 145–163 (1996)MathSciNetCrossRefGoogle Scholar
 7.Burckhardt, S., Musuvathi, M., Singh, V.: Verifying local transformations on relaxed memory models. In: International Conference on Compiler Construction (CC), pp. 104–123 (2010)Google Scholar
 8.Chakraborty, S., Vafeiadis, V.: Validating optimizations of concurrent C/C++ programs. In: International Symposium on Code Generation and Optimization (CGO), pp. 216–226 (2016)Google Scholar
 9.Distefano, D., O’Hearn, P.W., Yang, H.: A local shape analysis based on separation logic. In: Hermanns, H., Palsberg, J. (eds.) TACAS 2006. LNCS, vol. 3920, pp. 287–302. Springer, Heidelberg (2006). https://doi.org/10.1007/11691372_19CrossRefzbMATHGoogle Scholar
 10.Dodds, M., Batty, M., Gotsman, A.: Compositional verification of compiler optimisations on relaxed memory (extended version). CoRR, arXiv:1802.05918 (2018)
 11.Herlihy, M.P., Wing, J.M.: Linearizability: a correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. 12(3), 463–492 (1990)CrossRefGoogle Scholar
 12.Jackson, D.: Software Abstractions  Logic Language and Analysis, Revised edn. MIT Press, Cambridge (2012)Google Scholar
 13.Jagadeesan, R., Petri, G., Riely, J.: Brookes is relaxed, almost! In: International Conference on Foundations of Software Science and Computational Structures (FOSSACS), pp. 180–194 (2012)Google Scholar
 14.Jeffrey, A., Riely, J.: On thin air reads towards an event structures model of relaxed memory. In: Symposium on Logic in Computer Science (LICS), pp. 759–767 (2016)Google Scholar
 15.Kang, J., Hur, C.K., Lahav, O., Vafeiadis, V., Dreyer, D.: A promising semantics for relaxedmemory concurrency. In: Symposium on Principles of Programming Languages (POPL), pp. 175–189 (2017)Google Scholar
 16.Lahav, O., Giannarakis, N., Vafeiadis, V.: Taming releaseacquire consistency. In: Symposium on Principles of Programming Languages (POPL), pp. 649–662 (2016)Google Scholar
 17.Lahav, O., Vafeiadis, V., Kang, J., Hur, C.K., Dreyer, D.: Repairing sequential consistency in C/C++11. In: Conference on Programming Language Design and Implementation (PLDI), pp. 618–632 (2017)Google Scholar
 18.Milicevic, A., Near, J.P., Kang, E., Jackson, D.: Alloy*: a generalpurpose higherorder relational constraint solver. In: International Conference on Software Engineering (ICSE), pp. 609–619 (2015)Google Scholar
 19.Morisset, R., Pawan, P., Zappa Nardelli, F.: Compiler testing via a theory of sound optimisations in the C11/C++11 memory model. In: Conference on Programming Language Design and Implementation (PLDI), pp. 187–196 (2013)Google Scholar
 20.PichonPharabod, J., Sewell, P.: A concurrency semantics for relaxed atomics that permits optimisation and avoids thinair executions. In: Symposium on Principles of Programming Languages (POPL), pp. 622–633 (2016)Google Scholar
 21.Poetzl, D., Kroening, D.: Formalizing and checking thread refinement for dataracefree execution models. In: International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS), pp. 515–530 (2016)CrossRefGoogle Scholar
 22.The C++ Standards Committee: Programming Languages – C++ (2011). ISO/IEC JTC1 SC22 WG21Google Scholar
 23.Vafeiadis, V., Balabonski, T., Chakraborty, S., Morisset, R., Zappa Nardelli, F.: Common compiler optimisations are invalid in the C11 memory model and what we can do about it. In: Symposium on Principles of Programming Languages (POPL), pp. 209–220 (2015)Google Scholar
 24.Vafeiadis, V., Zappa Nardelli, F.: Verifying fence elimination optimisations. In: International Conference on Static Analysis (SAS), pp. 146–162 (2011)CrossRefGoogle Scholar
 25.Ševčík, J., Aspinall, D.: On validity of program transformations in the Java memory model. In: Vitek, J. (ed.) ECOOP 2008. LNCS, vol. 5142, pp. 27–51. Springer, Heidelberg (2008). https://doi.org/10.1007/9783540705925_3CrossRefGoogle Scholar
 26.Ševčík, J., Vafeiadis, V., Zappa Nardelli, F., Jagannathan, S.: Sewell, P.: Relaxedmemory concurrency and verified compilation. In: Symposium on Principles of Programming Languages (POPL), pp. 43–54 (2011)Google Scholar
 27.Ševčík, J., Vafeiadis, V., Zappa Nardelli, F., Jagannathan, S., Sewell, P.: CompCertTSO: a verified compiler for relaxedmemory concurrency. J. ACM 60(3), 22:1–22:50 (2013)MathSciNetCrossRefGoogle Scholar
 28.Wickerson, J., Batty, M., Sorensen, T., Constantinides, G.A.: Automatically comparing memory consistency models. In: Symposium on Principles of Programming Languages (POPL), pp. 190–204 (2017)Google Scholar
Copyright information
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.