Caches and Shared Memory
Accesses to shared memory hardware are as often as possible performed in parallel. After all, the purpose of multi-core computing is gaining speed by parallelism. If memory accesses were sequential as in the atomic protocols, memory would be a sequential bottleneck.
Accesses to cache based hardware memory systems take one, two, or many more hardware cycles. Thus, they are certainly not performed in an atomic fashion.
In Sect. 8.1 we introduce what we call abstract caches and show that the common basic types of single caches (direct mapped, k-way set associative, fully associative) can be modelled as abstract caches. This will later permit to simplify notation considerably. It also permits to unify most of the theory of shared memory constructions for all basic cache types. However, presently our definition of abstract caches does not yet include eviction addresses. The construction we present involves direct mapped caches, and we have to deal with eviction addresses below the abstract cache level. Modifying this small part of the construction and the corresponding arguments to other types of basic caches should not be hard. Modification of the definition of abstract caches such that they can be used completely as a black box is still future work. In the classroom it suffices to show that direct mapped caches can be modeled as abstract caches.
In Sect. 8.2 we develop formalism permitting to deal with i) atomic protocols, ii) hardware shared memory systems, and iii) simulations between them. It is the best formalism we can presently come up with. Suggestions for improvement are welcome. If one aims at correctness proofs there is no way to avoid this section (or an improved version of it) in the classroom.
Section 8.3 formulates in the framework of Sect. 8.2 the classical theory of the atomic MOESI protocol together with some auxiliary technical results that are needed later. Also we have enriched the standard MOESI protocol by a treatment of compare-and-swap (CAS) operations. We did this for two reasons: i) compare-and-swap operations are essential for the implementation of locks. Thus, multi-core machines without such operations are of limited use to put it mildly, ii) compare-and-swap is not a read followed by a conditional write; it is an atomic read followed by a conditional write, and this makes a large difference for the implementation.
A hardware-level implementation of the protocol for the direct mapped caches is presented in Sect. 8.4. It has the obvious three parts: i) data paths, ii) control automata, and iii) bus arbiter. The construction of data paths and control automata is not exactly straightforward. Caches in the data part generally consist of general 2-port RAMs, because they have to be able to serve their processor and to participate in the snooping bus protocol at the same time. We have provided each processor with two control automata: a master automaton processing requests of the processor and a slave automaton organizing the cache response to the requests of other masters on the bus. The arbiter does round robin scheduling of bus requests. One should sleep an extra hour at night before presenting this material in the classroom.
The correctness proof for the shared memory system is presented in Sect. 8.5. An outline of the proof is given at the start of the section. Roughly speaking, the proof contains the following kinds of arguments: i) showing that bus arbitration guarantees at any time that at most one master automaton is not idle, ii) showing the absence of bus contention (except on open collector buses), among other things by showing that during global transactions (involving more than one cache) master and participating slave automata stay “in sync”, iii) concluding that control signals and data “somehow corresponding to the atomic protocol” are exchanged via the buses, iv) abstracting memory accesses in the sense of Sect. 8.2 from the hardware computation and ordering them sequentially by their end cycle; it turns out that for accesses with identical end cycles it does not matter how we order them among each other, and v) showing (by induction on the end cycles of accesses) that the data exchanged via the buses are exactly the data exchanged by the atomic protocol, if it were run in the memory system configuration at the end cycle of the access. This establishes simulation and allows us to conclude that cache invariants are maintained in the hardware computation (because hardware simulates the atomic protocol, and there it was model-checked that the invariants are maintained).
Many of the arguments of parts i) to iv) are tedious bookkeeping; in the classroom it suffices to just state the corresponding lemmas and to present only a few typical proofs. However, even in this preliminary/bookkeeping phase of the proof the order of arguments is of great importance: the absence of bus contention often hinges on the cache invariants. Part v) is not only hard; it turns out that it is also highly dependent on properties of the particular cache protocol we are using. Thus, reinspection of the corresponding portions of the proof is necessary, if one wants to establish shared memory correctness for a different protocol.