1 Introduction

A modern high-performance, cache-coherent, distributed-memory hardware system is inherently complex. Such systems by their nature are highly concurrent and nondeterministic. The goal of this work is to provide a framework for full verification of complex hardware systems.

Modularity has long been understood as a key property for effective design and verification in this domain, decomposing systems into pieces that can be specified and verified independently. In our design, processors and memory systems independently employ intricate optimizations that exploit opportunities for parallelism. We are able to prove that each of these two main components still provides strong guarantees to support sequential consistency (SC) [25], and then compose those proofs into a result for the full system. Either component may be optimized further without requiring any changes to the implementation, specification, or proof of the other. Our concrete optimizations include speculation in processors and using a hierarchy of caches in memory.

We thus present the first mechanized proof of correctness of a realistic multiprocessor, shared-memory hardware system, including the first mechanized correctness proof of a directory-based cache-coherence protocol for arbitrary cache hierarchies, i.e., the proof is parameterized over an unknown number of processors connected to an unknown number of caches in an unknown number of levels (e.g., L1, L2). Our proof has been carried out in the Coq proof assistant and is available at http://github.com/vmurali/SeqConsistency. Since our technique is based on proof assistants, the computational complexity of verification remains constant for any choice of parameters. In the process, we introduce a methodology for modular verification of hardware designs, based on the theory of labeled transition systems (LTSes).

LTSes as hardware descriptions are an established idea [2, 17, 18], for which there are compilers that convert LTSes into efficient hardware. Our work is based on the Bluespec language [3, 6], whose semantics match the formalism of this paper. Bluespec specifies hardware components as atomic rules of a transition system over state elements, and its commercial compiler synthesizes these specs into circuits (i.e., Verilog code) with competitive performance. The model that we verify is close to literally transliterated from real Bluespec designs that have been compiled to hardware. Our cache-coherent memory system is based on a Bluespec implementation [13] used to implement an FPGA-based simulator for a cache-coherent multiprocessor PowerPC system [23]. The hardware synthesized from that implementation is rather efficient: an 8-core system with a 2-level cache hierarchy can run 55 million instructions per second on the BEE FPGA board [10]. Within Coq we adopt a semantics style very close to Bluespec, using inductive definitions of state transition systems, where each transition rule corresponds to an atomic Bluespec rule.

Our high-level agenda here is to import to the hardware-verification domain good ideas from the worlds of programming-language semantics and formal software verification, and to demonstrate some advantages of human-guided deductive techniques over model-checking techniques that less readily support modularity and generalization over infinite families of systems, and which may provide less insight to hardware designers (e.g., by not yielding human-understandable invariants about systems).

Paper Organization: We begin with a discussion of related work in Sect. 2. Section 3 introduces our flavor of the labeled transition systems formalism, including a definition of trace refinement. Section 4 shows a generic decomposition of any multiprocessor system, independently of the memory model that it implements, and discusses the store atomicity property of the memory subcomponent. Section 5 gives a simple formal model of sequential consistency. The following sections refine the two main subcomponents of our multiprocessor system. Section 7 discusses definition and verification of a speculative processor model, and Sect. 8 defines and proves our hierarchical cache-coherence protocol. Finally, Sect. 9 shows the whole-system modular proof of our complex system and ends with some conclusions in Sect. 10.

2 Related Work

Hardware verification is dominated by model checking – for example, processor verification [8, 29] and (more recently) Intel’s execution cluster verification [22]. Many abstraction techniques are used to reduce designs to finite state spaces, which can be explored exhaustively. There are limits to the construction of sound abstractions, so verifications of protocols such as cache-coherence have mostly treated systems with concrete topologies, involving particular finite numbers of caches and processors. For instance, explicit-state model checking tools like Murphi [15] or TLC [21, 26] are able to handle only single-level cache hierarchies with fewer than ten addresses and ten CPUs, as opposed to the billions of addresses in a real system, or the ever-growing number of CPUs. Symbolic model-checking techniques have fared better: McMillan et al. have verified a two-level MSI protocol based on the Gigamax distributed multiprocessor using SMV [31]. Optimizations on these techniques (e.g., partial-order reduction [4], symmetry reduction [5, 11, 12, 16, 19, 37], compositional reasoning [20, 28, 30], extended-FSM [14]) also scale the approach, verifying up to two levels of cache hierarchy, but are unable to handle multi-level hierarchical protocols. In fact, related work by Zhang et al. [37] insists that parameterization should be restricted to single dimensions for the state-of-the-art tools to scale practically. In all these cases, finding invariants automatically is actually hard. Chou et al. [12] require manual insertion of extra invariants, called “non-interference lemmas”, to eliminate counterexamples that violate the required property. Flow-based methodology [35] gives yet another way of manually specifying invariants. In general, we believe that the level of complexity of the manually specified invariants between those approaches and ours is similar. Moreover, we might hope to achieve higher assurance and understanding of design ideas by verifying infinite families of hardware designs, which resist reduction to finite-state models. Past work by Zhang et al. [37] has involved model-checking hierarchical cache-coherence protocols [38], with a restriction to binary trees of caches only, relying on paper-and-pencil proofs about the behavior of fractal-like systems. Those authors agree that, as a result, the protocol suffers from a serious performance handicap. Our cache protocol in this paper is chosen to support more realistic performance scaling.

Theorem provers have also been used to verify microprocessors, e.g., HOL to verify an academic microprocessor AVI-1 [36]. Cache-coherence proofs have also used mechanized theorem provers, though all previous work has verified only single-level hierarchies. Examples include using ACL2 for verifying a bus-based snoop protocol [32], using a combination of model-checking and PVS [33] to verify the FLASH protocol [24], and using PVS to mechanize some portions of a paper-and-pencil proof verifying that the Cachet cache-coherence protocol [34] does not violate the CRF memory model. The first two of these works do not provide insights that can be used to design and verify other protocols. The last falls short of proving a “full functional correctness” property of a memory system. In this paper, we verify that property for a complex cache protocol, based on human-meaningful invariants that generalize to related protocols.

3 Labeled Transition Systems

We make extensive use of the general theory of labeled transition systems, a semantics approach especially relevant to communicating concurrent systems. As we are formalizing processors for Turing-complete machine languages, it is challenging to prove that a system preserves almost any aspect of processor behavior from a model such as SC. To focus our theorems, we pick the time-honored property of termination. An optimized system should terminate or diverge iff the reference system could also terminate or diverge, respectively. All sorts of other interesting program properties are reducible to this one, in the style of computability theory. Our basic definitions of transition systems build in special treatment of halting, so that we need not mention it explicitly in most of the following contexts.

Definition 1

A labeled transition system (LTS) is a ternary relation, over \(\mathcal S^H \times \mathcal L^\epsilon \times \mathcal S^H\), for some sets \(\mathcal S\) of states and \(\mathcal L\) of labels. We usually do not mention these sets explicitly, as they tend to be clear from context. We write \(X^\epsilon \) for lifting of a set X to have an extra “empty” element \(\epsilon \) (like an option type in ML). We write \(X^H\) for lifting of a set X to have an extra “halt” element H. We also implicitly consider each LTS to be associated with an initial state in \(\mathcal S\).

For LTS A, we write \((s) \xrightarrow [\textstyle \text {A}]{\textstyle \ell } (s')\) as shorthand for \((s, \ell , s') \in A\), and we write \(A_0\) for A’s initial state. The intuition is that A is one process within a concurrent system. The label \(\ell \) from set \(\mathcal L\) of labels is produced when A participates in some IO exchange with another process; otherwise it is an empty or “silent” label \(\epsilon \). For brevity, we may omit labels for \(\epsilon \) steps.

3.1 Basic Constructions on LTSes

From an LTS representing single-step system evolution, we can build an LTS capturing arbitrary-length evolutions.

Definition 2

The transitive-reflexive closure of A, written \(A^{*}\), is a derived LTS. Where A’s states and labels are \(\mathcal S\) and \(\mathcal L\), the states of \(A^{*}\) are \(\mathcal S\), and the labels are \(\mathcal L^{*}\), or sequences of labels from the original system. \(A^{*}\) steps from s to \(s'\) when there exist zero or more transitions in A that move from s to \(s'\). The label of this transition is the concatenation of all labels generated in A, where the empty or “silent” label \(\epsilon \) is treated as an identity element for concatenation.

We also want to compose n copies of an LTS together, with no explicit communication between them. We apply this construction later to lift a single-CPU system to a multi-CPU system.

Definition 3

The n -repetition of A, written \(A^{n}\), is a derived LTS. Where A’s states and labels are \(\mathcal S\) and \(\mathcal L\), the states of \(A^{n}\) are \(\mathcal S^n\), and the labels are \([1, n] \times \mathcal L\), or pairs that tag labels with which component system generated them. These labels are generated only when the component system generates a label. The whole system halts whenever one of the components halts.

Eventually, we need processes to be able to communicate with each other, which we formalize via the \(+\) composition operator that connects same-label transitions in the two systems, treating the label as a cooperative communication event that may now be hidden from the outside world, as an \(\epsilon \) label.

Definition 4

Where A and B are two LTSes sharing labels set \(\mathcal L\), and with state sets \(\mathcal S_A\) and \(\mathcal S_B\) respectively, the communicating composition \(A + B\) is a new LTS with states \(\mathcal S_A \times \mathcal S_B\) and an empty label set, defined as follows:

figure a

3.2 Refinement Between LTSes

We need a notion of when one LTS “implements” another. Intuitively, transition labels and halting are all that the outside world can observe. Two systems that produce identical labels and termination behavior under all circumstances can be considered as safe substitutes for one another. We need only an asymmetrical notion of compatibility:

Definition 5

For some label domain \(\mathcal L\), let \(f : \mathcal L \rightarrow \mathcal L^\epsilon \) be a function that is able to replace labels with alternative labels, or erase them altogether. Let LTSes A and B have the same label set \(\mathcal L\). We say that A trace-refines B w.r.t. f, or \(A \sqsubseteq _f B\), if:

figure b

Each label in the trace is replaced by the mapping of f on it, and labels mapped to \(\epsilon \) by f are dropped. f is overloaded to denote the multilabel version when applied to \(\eta \).

For brevity, we write \(A \sqsubseteq B\) for \(A \sqsubseteq _\mathsf {id} B\), for identity function \(\mathsf {id}\), forcing traces in the two systems to match exactly. Under this notion of identical traces, we say that A is sound w.r.t. B. That case matches traditional notions of trace refinement, often proved with simulation arguments, which we also adopt.

3.3 A Few Useful Lemmas

We need the following theorems in our proof.

Theorem 1

\(\sqsubseteq \) is reflexive and transitive.

Theorem 2

If \(A \sqsubseteq _f B\), then \(A^{n} \sqsubseteq _{f^n} B^{n}\), where \(f^n\) is f lifted appropriately to deal with indices (\(f^n(i, \ell ) = (i, \ell ')\) when \(f(\ell ) = \ell '\), and \(f^n(i, \ell ) = \epsilon \) when \(f(\ell ) = \epsilon \)).

Theorem 3

If \(A \sqsubseteq _f A'\) and \(B \sqsubseteq _f B'\), then \(A + B \sqsubseteq _\mathsf {id} A' + B'\).

All these theorems can be proved using standard techniques.

4 Decomposing a Shared-Memory Multiprocessor System

Any conventional multiprocessor system can be divided logically into three components, as shown in Fig. 1. The top-level system design is shown in the middle, while the details of its components, the memory system and the processor (\(P_i\)), are shown in the magnified boxes. The processor component \(P_i\) can be implemented in a variety of ways, from one executing instructions one-by-one in program order, to a complex one speculatively executing many instructions concurrently to exploit parallelism. The memory component is normally implemented using a hierarchy of caches, in order to increase the performance of the overall system, because the latency of accessing memory directly is large compared to that of accessing a much smaller cache. Between each processor and the global memory subsystem appears some local buffer, \(LB_i\), each specific to processor \(P_i\).

Fig. 1.
figure 1

Components of a multiprocessor system

Popular ISAs, such as Intel x86, ARM, and PowerPC, do not guarantee sequential consistency. However, we want to emphasize that, in every weak-memory system we are aware of, the main memory still exposes atomic loads and stores! Weaker semantics in a core \(P_i\) arise only because of (1) reordering of memory instructions by the core and/or (2) the properties of the local buffers \(LB_i\) connected to \(P_i\).

Consequently, we focus on this opportunity to simplify proof decomposition. We prove that our main memory component satisfies an intuitive store atomicity property – which is an appropriate specification of the memory component even for implementations of weaker memory models. Store atomicity can be understood via the operational semantics of Fig. 2, describing an LTS that receives load and store requests (\(\mathsf {Ld}\) and \(\mathsf {St}\)) from processors and sends back load responses (\(\mathsf {Ld}\mathsf {Rp}\)). The transfer happens via input buffers \( ins (p)\) from processor p and output buffers \( outs (p)\) to processor p. Note that this model allows the memory system to answer pending memory requests in any order (as indicated by the bag union operator \(\uplus \)), even potentially reordering requests from a single processor, so long as, whenever it does process a request, that action appears to take place atomically.

Fig. 2.
figure 2

LTS for a simple memory

Figure 2 provides our first example of a hardware component specified as an LTS via a set of inference rules. Such notation may seem far from the domain of synthesizable hardware, but it is actually extremely close to Bluespec notation, and the Bluespec compiler translates automatically to hardware circuits in Verilog [1].

The memory component is composed of a hierarchy of caches, with cache nodes labeled like “L1,” “L2,” etc., to avoid the latency of round trips with main memory. Therefore, it is the responsibility of the hierarchy of caches (which forms the memory subcomponent) to implement the store atomicity property. In fact, as we prove in Sect. 8, the purpose of the cache-coherence protocol is to establish this invariant for the memory subcomponent. Concretely, we have verified a directory-based protocol for coordinating an arbitrary tree of caches, where each node stores a conservative approximation of its children’s states.

As an instance of the above decomposition, we prove that a multiprocessor system with no local buffering in between the processor and the memory components indeed implements SC. We implement a highly speculative processor that executes instructions and issues loads out of order, but commits instructions (once some “verification” is done) in order.

The processor itself can be decomposed into several components. In the zoomed-in version of Fig. 1, we show a highly speculative out-of-order-issue processor. We have the normal architectural state, such as values of registers. Our proofs are generic over a family of instruction set architectures, with parameters for opcode sets and functions for executing opcodes and decoding them from memory. Other key components are a branch predictor, which guesses at the control-flow path that a processor will follow, to facilitate speculation; and a reorder buffer (ROB), which decides which instructions along that path to try executing ahead of schedule. Our proofs apply to an arbitrary branch predictor and any reorder buffer satisfying a simple semantic condition.

Our framework establishes theorems of the form “if system A has a run with some particular observable behavior, then system B also has a run with the same behavior.” In this sense, we say that A correctly implements B. Other important properties, such as deadlock freedom for A (which might get stuck without producing any useful behavior), are left for future work.

5 Specifying Sequential Consistency

Our final theorem in this paper establishes that a particular complex hardware system implements sequential consistency (SC) properly. We state the theorem in terms of the trace refinement relation \(\sqsubseteq \) developed in Sect. 3. Therefore, we need to define an LTS embodying SC. The simpler this system, the better. We need not worry about its performance, since we prove that an optimized system remains faithful to it.

Figure 3 defines an LTS for an n-processor system that is sequentially consistent, parameterized over details of the ISA. In particular, the ISA gives us some domains of architectural states s (e.g., register files) and of program counters \( pc \). A function \(\mathsf {dec}(s, pc )\) figures out which instruction \( pc \) references in the current state, returning the instruction’s “decoded” form. A companion function \(\mathsf {exec}(s, pc , d)\) actually executes the instruction, returning a new state \(s'\) and the next program counter \( pc '\).

Fig. 3.
figure 3

LTS for SC with n simple processors

The legal instruction forms, which are outputs of \(\mathsf {dec}\), are \((\mathsf {Nm}, x)\), for an operation not accessing memory; \((\mathsf {Ld}, x, a)\), for a memory load from address a; \((\mathsf {St}, a, v)\), for a memory store of value v to address a; and H, for a “halt” instruction that moves the LTS to state H. The parameter x above represents the rest of the instruction, including the opcode, registers, constants, etc.

The legal inputs to \(\mathsf {exec}\) encode both a decoded instruction and any relevant responses from the memory system. These inputs are \((\mathsf {Nm}, x)\) and \(\mathsf {St}\), which need no extra input from the memory; and \((\mathsf {Ld}, x, v)\), where v gives the contents of the requested memory cell.

We define the initial state of SC as \((\theta _0, m_0)\), where \(m_0\) is some initial memory fixed throughout our development, mapping every address to value \(v_0\); and \(\theta _0\) maps every processor ID to \((s_0, pc _0)\), using architecture-specific default values \(s_0\) and \( pc _0\).

This LTS encodes Lamport’s notion of SC, where processors take turns executing nondeterministically in a simple interleaving. Note that, in this setting, an operational specification such as the LTS for SC is precisely the proper characterization of full functional correctness for a hardware design, much as a precondition-postcondition pair does that in a partial-correctness Hoare logic. Our SC LTS fully constrains observable behavior of a system to remain consistent with simple interleaving. Similar operational models are possible as top-level specifications for systems following weaker memory models, by giving the LTS for the local buffer component and composing the three components simultaneously.

Our final, optimized system is parameterized over an ISA in the same way as SC is. In the course of the rest of this paper, we define an optimized system O and prove \(O \sqsubseteq \text {SC}\). To support a modular proof decomposition, however, we need to introduce a few intermediate systems first.

Fig. 4.
figure 4

LTS for a simple decoupled processor (P\(_{\text {ref}}\))

6 Respecifying Sequential Consistency with Communication

Realistic hardware systems do not implement the monolithic SC of Fig. 3 directly. Instead, there is usually a split between processors and memory. Here we formalize that split using LTSes that compose to produce a system refining the SC model.

Figure 4 defines an LTS for a simple decoupled processor (P\(_\text {ref}\)). Memory does not appear within a processor’s state. Instead, to load from or store to an address, requests are sent to the memory system and responses are received. Both kinds of messages are encoded as labels: \(\text {ToM}\) for requests to memory and \(\text {ToP}\) for responses from memory back to the processor.

A state of P\(_{\text {ref}}\) is a triple \((s, pc , wait )\), giving the current architectural state s and program counter \( pc \), as well as a Boolean flag \( wait \) indicating whether the processor is blocked waiting for a response from the memory system. As in the SC model, the state of the processor is changed to H whenever \(\mathsf {dec}\) returns H.

As initial state for system \(P_\text {ref}\), we use \((s_0, pc _0, \bot )\).

The simple memory defined earlier in Fig. 2 is meant to be composed with P\(_{\text {ref}}\) processors. A request to memory like \((t, \mathsf {Ld}, a)\) asks for the value of memory cell a, associating a tag t that the processor can use to match responses to requests. Those responses take the form \((t, \mathsf {Ld}, v)\), giving the value v of the requested memory address.

A memory state is a triple \(( ins , outs , m)\), giving not just the memory m itself, but also buffers \( ins \) and \( outs \) for receiving processor requests and batching up responses to processors, respectively. We define the initial state of the \(M_m\) LTS as \((\emptyset , \emptyset , m_0)\), with empty queues.

Now we can compose these LTSes to produce an implementation of SC.

For a system of n processors, our decoupled SC implementation is \(\text {P}^n_{\text {ref}} + M_m\).

Theorem 4

\(P^n_{{ref}} + M_m \sqsubseteq SC\)

Proof

By induction on traces of the decoupled system, relating them to those of the SC reference (similar to the technique in WEB refinement [27]). We need to choose an abstraction function f from states of the complex system to states of the simple system. This function must be inductive in the appropriate sense: a step from s to \(s'\) on the left of the simulation relation must be matched by sequences of steps on the right from f(s) to \(f(s')\). We choose f that just preserves state components in states with no pending memory-to-processor responses. When such responses exist, f first executes them on the appropriate processors.    \(\square \)

7 Speculative Out-of-Order Processor

We implement a speculative processor, which may create many simultaneous outstanding requests to the memory – as an optimization to increase parallelism. Our processor proof is in some sense generic over correct speculation strategies. We parameterize over two key components of a processor design: a branch predictor (which makes guesses about processor-local control flow in advance of resolving conditional jumps) and a reorder buffer (which decides what speculative instructions – such as memory loads – are worth issuing at which moments, in effect reordering later instructions to happen before earlier instructions have finished).

The branch predictor is the simpler of the two components, whose state is indicated with metavariable \( bp \). The operations on such state are \(\mathsf {curPpc}( bp )\) (to extract the current program-counter prediction); \(\mathsf {nextPpc}( bp )\) (to advance to predicting the next instruction); and \(\mathsf {setNextPpc}( bp , pc )\) (to reset prediction to begin at a known-accurate position \( pc \)). We need not impose any explicit correctness criterion on branch predictors; the processor uses predictions only as hints, and it always resets the predictor using \(\mathsf {setNextPpc}\) after detecting an inaccurate hint.

The interface and formal contract of a reorder buffer are more involved. We write \( rob \) as a metavariable for reorder-buffer state, and \(\phi \) denotes the state of an empty buffer. The operations associated with \( rob \) are:

  • \(\mathsf {insert}( pc , rob )\), which appends the program instruction at location \( pc \) to the list of instructions that the buffer is allowed to consider executing.

  • \(\mathsf {compute}( rob )\), which models a step of computation inside the buffer, returning both an updated state and an optional speculative load to issue. For instance, it invokes the \(\mathsf {dec}\) and \(\mathsf {exec}\) functions (as defined for SC) internally to obtain the next program counter, state, etc. (but the actual states are not updated).

  • \(\mathsf {upd}\mathsf {Ld}( rob ,t,v)\), which informs the buffer that the memory has returned result value v for the speculative load with tag \(t \ne \epsilon \).

  • \(\mathsf {commit}( rob )\), which returns the next instruction in serial program order, if we have accumulated enough memory responses to execute it accurately, or returns \(\epsilon \) otherwise. When \(\mathsf {commit}\) returns an instruction, it also returns the associated program counter plus the next program counter to which it would advance afterward. Furthermore, the instruction is extended with any relevant response from memory (used only for load instructions, obtained through \(\mathsf {upd}\mathsf {Ld}\)) and with the new architectural state (e.g., register file) after execution.

  • \(\mathsf {retire}( rob )\), which informs the buffer that its \(\mathsf {commit}\) instruction was executed successfully, so it is time to move on to the next instruction.

Fig. 5.
figure 5

Speculating, out-of-order issue processor

Figure 5 defines the speculative processor LTS P\(_\mathtt{so }\). This processor may issue arbitrary speculative loads, but it commits only the instruction that comes next in serial program order. The processor will issue two kinds of loads, a speculative load (whose tag is not \(\epsilon \)) and a commit or real load (whose tag is \(\epsilon \)). To maintain SC, every speculative load must have a matching verification load later on, and we maintain the illusion that the program depends only on the results of verification loads, which, along with stores, must be issued in serial program order.

When committing a previously issued speculative load instruction, the associated speculative memory load response is verified against the new commit load response. If the resulting values do not match, the processor terminates all past uncommitted speculation, by emptying the reorder buffer and resetting the next predicted program counter in the branch predictor to the correct next value. In common cases, performance of executing loads twice is good, because it is likely that the verification load finds the address already in a local cache – thanks to the recent processing of the speculative load. Moreover, 60 % to 90 % of verification loads can be avoided by tracking speculative loads [9]; in the future we will extend our proofs to include such optimizations.

A full processor state is \((s, pc , wait , rob , bp )\), comprising architectural state, the program counter, a Boolean flag indicating whether the processor is waiting for a memory response about an instruction being committed, and the reorder-buffer and branch-predictor states. Its initial state is given by \((s_0, pc _0, \bot , \phi , bp _0)\). The interface of this processor with memory (i.e., communication labels with \(\mathsf {ToM}, \mathsf {ToP}\)) is identical to that of the reference processor.

Finally, we impose a general correctness condition on reorder buffers (Fig. 6). Intuitively, whenever the buffer claims (via a \(\mathsf {commit}\) output) that a particular instruction is next to execute (thus causing certain state changes), that instruction must really be next in line according to how the program runs in the SC system, and its execution must really cause those state changes.

Fig. 6.
figure 6

Correctness of reorder buffer

When this condition holds, we may conclude the correctness theorem for out-of-order processors. We use a trace-transformation function \(\mathsf {noSpec}\) that drops all speculative-load requests and responses (i.e., those load requests and responses whose tags are not \(\epsilon \)). See Definition 5 for a review of how we use such functions in framing trace refinement. Intuitively, we prove that any behavior by the speculating processor can be matched by the simple processor, with speculative messages erased.

Theorem 5

\(P_{{so}} \sqsubseteq _\mathsf {noSpec}P_{{ref}}\)

Proof

By induction on P\(_\text {so}\) traces, using an abstraction function that drops the speculative messages and the \( rob \) and \( bp \) states to relate the two systems. The reorder-buffer correctness condition is crucial to relate its behavior with the simple in-order execution of P\(_\text {ref}\).   \(\square \)

Corollary 1

\(P^n_{so} \sqsubseteq _{\mathsf {noSpec}^n} P_{ref}^n\)

Proof

Direct consequence of Theorems 5 and 2 (the latter is about n-repetition).   \(\square \)

8 Cache-Based Memory System

We now turn our attention to a more efficient implementation of memory. With the cache hierarchy of Fig. 1, we have concurrent interaction of many processors with many caches, and the relationship with the original \(M_m\) system is far from direct. However, this intricate concurrent execution is crucial to hiding the latency of main-memory access. Figure 7 formalizes as an LTS \(M_c\) the algorithm we implemented (based on a published implementation [13]) for providing the memory abstraction on top of a cache hierarchy. We have what is called an invalidating directory-based hierarchical cache-coherence protocol.

Fig. 7.
figure 7

LTS for cache-coherent shared-memory system

We describe a state of the system using fields \( d \), \( ch \), \( cs \), \( dir \), \( w \), \( dirw \), \( ins \), \( outs \). The \( ins \) and \( outs \) sets are the interfaces to the processors and are exactly the same as in \(M_m\) (Fig. 2). We use \(\textit{parent}(c,p)\) to denote that p is the parent of c.

A coherence state is M, S, or I, broadly representing permissions to modify, read, or do nothing with an address, respectively, the decreasing permissions denoted by \(M > S > I\). More precisely, if a node n is in coherence state M or S for some address, then there might be some node in n’s subtree that has write or read permissions, respectively, for that address. Coherence state of cache c for address a is denoted by \( cs (c,a)\). \( d (c,a)\) represents the data in cache c for address a.

\( w (c,a)\) stores the permission an address a in cache c is waiting for, if any. That is, cache c has decided to upgrade its coherence state for address a to a more permissive value, but it is waiting for acknowledgment from its parent before upgrading.

\( dir (p,c,a)\) represents the parent p’s notion of the coherence state of the child c for address a. We later prove that this notion is always conservative, i.e., if the parent assumes that a child does not have a particular permission, then it is guaranteed in this system that the child will not have that permission. \( dirw (p,c,a)\) denotes whether the parent p is waiting for any downgrade response from its child c for address a, and if so, the coherence state that the child must downgrade to as well.

There are three types of communication channels in the system: (i) \( ch (p,c,\mathsf {RR})\) (which carries both downgrade request and upgrade response messages from parent p to its child c), (ii) \( ch (c,p,\mathsf {Rq})\) (which carries upgrade request messages from child c to its parent p) and (iii) \( ch (c,p,\mathsf {Rp})\) (which carries downgrade response messages from child c to its parent p). While the \( ch (c,p,\mathsf {Rp})\) and \( ch (p,c,\mathsf {RR})\) channels deliver messages between the same pair of nodes in the same order in which the messages were injected (i.e., they obey the FIFO property, indicated by the use of in Fig. 7), \( ch (c,p,\mathsf {Rq})\) need not obey such a property (indicated by the use of \(\uplus \) for unordered bags in Fig. 7). This asymmetry arises because only one downgrade request can be outstanding for one parent-child pair for an address.

Here is an intuition on how the transitions work in the common case. A cache can spontaneously decide to upgrade its coherence state, in which case it sends an upgrade request to its parent. The parent then makes a local decision on whether to send a response to the requesting child or not, based on its directory approximation and its own coherence state \( cs \). If \( cs \) is lower than the requested upgrade, then it cannot handle the request, and instead must decide to upgrade \( cs \). Once the parent’s \( cs \) is not lower than the requested upgrade, it makes sure that the rest of its children are “compatible” with the requested upgrade (given by the \(\mathsf {dirCompat}\) definition below). If not, the parent must send requests to the incompatible children to downgrade. Finally, when the \( cs \)’s upgrade and children’s downgrade responses are all received, the original request can be responded to. A request in \( ins \) can be processed by an L1 cache only if it is in the appropriate state, otherwise it has to request an upgrade for that address.

Definition 6

\( \mathsf {dirCompat}(p, c, x, a) = \left\{ \begin{array}{ll} x = M &{}\Rightarrow \forall c' \ne c. \; dir (p,c',a) = I\\ x = S &{}\Rightarrow \forall c' \ne c. \; dir (p,c',a) \le S\\ \end{array} \right. \)

A complication arises because a cache can voluntarily decide to downgrade its state. This transition is used to model invalidation of cache lines to make room for a different location. As a result, the parent’s \( dir \) and the corresponding \( cs \) of the child may go out of sync, leading to the parent requesting a child to downgrade when it already has. To handle this situation, the child has to drop the downgrade request when it has already downgraded to the required state (Rule DropReq in Fig. 7), to avoid deadlocks by not dequeuing the request.

8.1 Proving \(M_c\) is Store Atomic

We must prove the following theorem, i.e., the cache-based system is sound with respect to the simple memory.

Theorem 6

\(M_c \sqsubseteq M_m\)

We present the key theorem needed for this proof below. Throughout this section, we say time to denote the number of transitions that occurred before reaching the specified state.

Theorem 7

A is store atomic, i.e., \(A \sqsubseteq M_m\) and \(M_m \sqsubseteq A\) iff for any load request \(\mathsf {ToM}(t, \mathsf {Ld}, a)\) received, the response \(\mathsf {ToP}(t, \mathsf {Ld}, v)\) sent at time T is such that

  1. 1.

    \(v = v_0\) (the initial value of any memory address) and no store request \(\mathsf {ToM}(\mathsf {St},a, v')\) has been processed at any time \(T'\) such that \(T' < T\) or

  2. 2.

    There is a store request \(\mathsf {ToM}(\mathsf {St}, a, v)\) that was processed at time \(T_q\) such that \(T_q < T\) and no other store request \(\mathsf {ToM}(\mathsf {St}, a, v')\) was processed at any time \(T'\) such that \(T_q < T' < T\).

The proof that \(M_c\) obeys the properties in Theorem 7 is involved enough that we state only key lemmas that we used.

Lemma 1

At any time T, if address a in cache c obeys \( cs (c,a) \ge S\) and \(\forall i.\; dir (c,i,a) \le S\), then a will have the latest value, i.e.,

  1. 1.

    \( d (c,a) = v_0\) and no store request \(\mathsf {ToM}(\mathsf {St}, a, v)\) has been processed at any time \(T'\) such that \(T' < T\) or

  2. 2.

    There is a store request \(\mathsf {ToM}(\mathsf {St}, a, v)\) that was processed at time \(T_q\) such that \(T_q < T \wedge d (c,a) = v\) and no other store request \(\mathsf {ToM}(\mathsf {St}, a, v')\) was processed at any time \(T'\) such that \(T_q < T' < T\).

It is relatively straightforward to prove the properties of Theorem 7, given Lemma 1. To prove Lemma 1, it has to be decomposed further into the following, each of which holds at any time.

Lemma 2

If some response m for an address a is in transit (i.e., we are considering any time T such that \(T_s \le T \le T_r\) where \(T_s\) is the time of sending m and \(T_r\) the time of receiving m), then no cache can process store requests for a, and m must be sent from a cache c where \( cs (c, a) \ge S\) and \(\forall i.\; dir(c,i,a)\le S\).

Lemma 3

At any time, \(\forall p, \forall c, \forall a.\; \textit{parent}(c,p) \Rightarrow \)

\( cs (c, a) \le dir (p, c, a) \wedge \mathsf {dirCompat}(p,c, dir (p,c,a),a) \wedge dir (p,c,a) \le cs (p,a)\)

The same proof structure can be used to prove other invalidation-based protocols with inclusive caches (where any address present in a cache will also be present in its parent) like MESI, MOSI, and MOESI; we omit the discussion of extending this proof to these for space reasons. The MSI proof is about 12,000 lines of Coq code, of which 80 % can be reused as-is for the other protocols.

9 The Final Result

With our two main results about optimized processors and memories, we can complete the correctness proof of the composed optimized system.

First, we need to know that, whenever the simple memory can generate some trace of messages, it could also generate the same trace with all speculative messages removed. We need this property to justify the introduction of speculation, during our final series of refinements from the optimized system to SC.

Theorem 8

\(M_m \sqsubseteq _{\mathsf {noSpec}^n} M_m\)

Proof

By induction on traces, with an identity abstraction function.    \(\square \)

That theorem turns out to be the crucial ingredient to justify placing a speculative processor in-context with simple memory.

Theorem 9

\(P^n_{so} + M_m \sqsubseteq P^n_{ref} + M_m\)

Proof

Follows from Theorem 3 (our result about \(+\)), Corollary 1, and Theorem 8.    \(\square \)

The last theorem kept the memory the same while refining the processor. The next one does the opposite, switching out memory.

Theorem 10

\(P^n_{{so}} + M_c \sqsubseteq P^n_{{so}} + M_m\)

Proof

Follows from Theorems 6 and 3 plus reflexivity of \(\sqsubseteq \) (Theorem 1).    \(\square \)

Theorem 11

\(P^n_{{so}} + M_c \sqsubseteq SC\)

Proof

We twice apply \(\sqsubseteq \) transitivity (Theorem 1) to connect Theorems 10, 9, and 4   \(\square \)

10 Conclusions and Future Work

In this paper, we developed a mechanized modular proof of a parametric hierarchical cache-coherence protocol in Coq and use this proof modularly for a verification of sequential consistency for a complete system containing out-of-order processors. Our proof modularization corresponds naturally to the modularization seen in hardware implementations, allowing verification to be carried out in tandem with the design. Our overall goal is to enable design of formally verified hardware systems. To this end, we have been working on a DSL in Coq for translating to and from Bluespec, and we are developing appropriate libraries and proof automation, extending the work of Braibant et al. [7] with support for modular specification and verification, systematizing some elements of this paper’s Coq development that are specialized to our particular proof.

While we provide a clean interface for an SC system, we are also working on encompassing relaxed memory models commonly used in modern processors.