Caches and Shared Memory

  • Mikhail Kovalev
  • Silvia M. Müller
  • Wolfgang J. Paul
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9000)


In this chapter we implement a cache based shared memory system and prove that it is sequentially consistent. Sequential consistency means: i) answers of read accesses to the memory system behave as if all accesses to the memory system were performed in some sequential order and ii) this order is consistent with the local order of accesses [7]. Cache coherence is maintained by the classical MOESI protocol as introduced in [16]. That a sequentially consistent shared memory system can be built at the gate level is in a sense the fundamental result of multi-core computing. Evidence that it holds is overwhelming: such systems are since decades part of commercial multi-core processors. Much to our surprise, when preparing the lectures for this chapter, we found in the open literature only one (undocumented) published gate level design of a cache based shared memory system [17]. Closely related to our subject, there is of course also an abundance of literature in the model checking community showing for a great variety of cache protocols, that desirable invariants - including cache coherence - are maintained, if accesses to the memory system are performed atomically at arbitrary caches in an arbitrary sequential order. In what follows we will call this variety of protocols atomic protocols. For a survey on the verification techniques for cache coherence protocols see [13], and for the model checking of the MOESI protocol we refer the reader to [4]. Atomic protocols and shared memory hardware differ in several important aspects:
  • Accesses to shared memory hardware are as often as possible performed in parallel. After all, the purpose of multi-core computing is gaining speed by parallelism. If memory accesses were sequential as in the atomic protocols, memory would be a sequential bottleneck.

  • Accesses to cache based hardware memory systems take one, two, or many more hardware cycles. Thus, they are certainly not performed in an atomic fashion.

Fortunately, we will be able to use the model checked invariants literally as lemmas in the hardware correctness proof presented here, but very considerable extra proof effort will be required to establish a simulation between the hardware computation and the atomic protocol. After it is established one can easily conclude sequential consistency of the hardware system, because the atomic computation is sequential to begin with.

In Sect. 8.1 we introduce what we call abstract caches and show that the common basic types of single caches (direct mapped, k-way set associative, fully associative) can be modelled as abstract caches. This will later permit to simplify notation considerably. It also permits to unify most of the theory of shared memory constructions for all basic cache types. However, presently our definition of abstract caches does not yet include eviction addresses. The construction we present involves direct mapped caches, and we have to deal with eviction addresses below the abstract cache level. Modifying this small part of the construction and the corresponding arguments to other types of basic caches should not be hard. Modification of the definition of abstract caches such that they can be used completely as a black box is still future work. In the classroom it suffices to show that direct mapped caches can be modeled as abstract caches.

In Sect. 8.2 we develop formalism permitting to deal with i) atomic protocols, ii) hardware shared memory systems, and iii) simulations between them. It is the best formalism we can presently come up with. Suggestions for improvement are welcome. If one aims at correctness proofs there is no way to avoid this section (or an improved version of it) in the classroom.

Section 8.3 formulates in the framework of Sect. 8.2 the classical theory of the atomic MOESI protocol together with some auxiliary technical results that are needed later. Also we have enriched the standard MOESI protocol by a treatment of compare-and-swap (CAS) operations. We did this for two reasons: i) compare-and-swap operations are essential for the implementation of locks. Thus, multi-core machines without such operations are of limited use to put it mildly, ii) compare-and-swap is not a read followed by a conditional write; it is an atomic read followed by a conditional write, and this makes a large difference for the implementation.

A hardware-level implementation of the protocol for the direct mapped caches is presented in Sect. 8.4. It has the obvious three parts: i) data paths, ii) control automata, and iii) bus arbiter. The construction of data paths and control automata is not exactly straightforward. Caches in the data part generally consist of general 2-port RAMs, because they have to be able to serve their processor and to participate in the snooping bus protocol at the same time. We have provided each processor with two control automata: a master automaton processing requests of the processor and a slave automaton organizing the cache response to the requests of other masters on the bus. The arbiter does round robin scheduling of bus requests. One should sleep an extra hour at night before presenting this material in the classroom.

The correctness proof for the shared memory system is presented in Sect. 8.5. An outline of the proof is given at the start of the section. Roughly speaking, the proof contains the following kinds of arguments: i) showing that bus arbitration guarantees at any time that at most one master automaton is not idle, ii) showing the absence of bus contention (except on open collector buses), among other things by showing that during global transactions (involving more than one cache) master and participating slave automata stay “in sync”, iii) concluding that control signals and data “somehow corresponding to the atomic protocol” are exchanged via the buses, iv) abstracting memory accesses in the sense of Sect. 8.2 from the hardware computation and ordering them sequentially by their end cycle; it turns out that for accesses with identical end cycles it does not matter how we order them among each other, and v) showing (by induction on the end cycles of accesses) that the data exchanged via the buses are exactly the data exchanged by the atomic protocol, if it were run in the memory system configuration at the end cycle of the access. This establishes simulation and allows us to conclude that cache invariants are maintained in the hardware computation (because hardware simulates the atomic protocol, and there it was model-checked that the invariants are maintained).

Many of the arguments of parts i) to iv) are tedious bookkeeping; in the classroom it suffices to just state the corresponding lemmas and to present only a few typical proofs. However, even in this preliminary/bookkeeping phase of the proof the order of arguments is of great importance: the absence of bus contention often hinges on the cache invariants. Part v) is not only hard; it turns out that it is also highly dependent on properties of the particular cache protocol we are using. Thus, reinspection of the corresponding portions of the proof is necessary, if one wants to establish shared memory correctness for a different protocol.


Main Memory Cache Line Shared Memory System Global Access Global Transaction 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Mikhail Kovalev
    • 1
  • Silvia M. Müller
    • 2
  • Wolfgang J. Paul
    • 3
  1. 1.Sirrix AGSaarbrückenGermany
  2. 2.IBM Germany Research and Development GmbHBöblingenGermany
  3. 3.Saarland UniversitySaarbrückenGermany

Personalised recommendations