Implementing Application-Specific Cache-Coherence Protocols in Configurable Hardware

  • David Brooks
  • Margaret Martonosi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1602)


Streamlining communication is key to achieving good performance in shared-memory parallel programs. While full hardware support for cache coherence generally offers the best performance, not all parallel machines provide it. Instead, software layers using Shared Virtual Memory (SVM) can be built to enforce coherence at a higher level. In prior work, researchers have studied application-specific cache coherence protocols implemented either in SVM systems or as handlers run by programmable protocol processors. Since the protocols are specialized to the needs of a single application, they can be particularly helpful in reducing the long latencies and processing overhead that sometimes degrade performance in SVM systems. This paper studies implementing application-specific protocols in hardware, but not via an instruction-based protocol processor as is typical. Instead, we consider configurable implementations based on Field-Programmable Gate Arrays (FPGAs). This approach can be faster than software-based techniques and less expensive than some hardware-based techniques. We study one application, appbt, in detail, including a VHDL-level design of the configurable protocol design. We sketch out approaches for other applications as well. Implementing protocol operations in configurable hardware improves communication performance by roughly 11X for a 32-node system. While overall speedups are a more modest 12% our method is promising because of its flexibility and because it offers a new way of harnessing configurable hardware at the network interface, where it already exists or could be easily added to current systems.


Communication Time Baseline System Configurable Hardware Protocol Processing Correct Processor 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Li, K., Hudak, P.: Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems 7(4), 321–359 (1989)CrossRefGoogle Scholar
  2. 2.
    Reinhardt, S.K., Larus, J.R., Wood, D.A.: Tempest and Typhoon: User- Level Shared Memory. In: Proc. 21st Annual Int. Symposium on Computer Architecture (April 1994)Google Scholar
  3. 3.
    Hill, M., et al.: Tempest: A Substrate for Portable Parallel Programs. In: COMP/CON Spring 95Google Scholar
  4. 4.
    Falsafi, B., Lebeck, A.R., et al.: Application-Specific Protocols for User-Level Shared Memory. In: Supercomputing 1994 (November 1994)Google Scholar
  5. 5.
    Boden, N.J., et al.: Myrinet – A Gigabit-per-Second Local-Area Network. IEEE Micro 15(1), 29–36 (1995)CrossRefGoogle Scholar
  6. 6.
    Bilas, A.: Improving the Performance of Shared Virtual Memory on System Area Networks. Technical Report #TR-586-98, Princeton Computer Science Dept. (August 1998)Google Scholar
  7. 7.
    Liao, C., et al.: Monitoring Shared Virtual Memory on a Myrinet-based PC Cluster. In: 12th ACM International Conference on Supercomputing (ICS) (July 1998)Google Scholar
  8. 8.
    Pfile, R.W.: Typhoon-Zero Implementation: The Vortex Module. University of Wisconsin-Madison, August 31 (1995)Google Scholar
  9. 9.
    Heinrich, M., et al.: The Performance Impact of Flexibility in the Stanford FLASH Multiprocessor. In: Proc. 6th Int. Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, CA (October 1994)Google Scholar
  10. 10.
    McHenry, J.T., et al.: An FPGA-based coprocessor for ATM firewalls. In: Proc. 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (April 1997)Google Scholar
  11. 11.
    Guillaud, J.-F., et al.: A PC/ATM interface accelerator using reconfigurable technology. In Proc. of the SPIE, vol. 2608, pp. 134–145 (1995)Google Scholar
  12. 12.
    Chandra, et al.: Teapot: Language Support for Writing Memory Coherency Protocols. In: SIGPLAN Conference on Programming Language Design and Implementation (PLDI) (May 1996)Google Scholar
  13. 13.
    Veenstra, J.E., Fowler, R.J.: MINT Tutorial and User Manual. Technical Report 452, Computer Science Department, The University of Rochester (June 1993) (Revised August 1994)Google Scholar
  14. 14.
    PCI Local Bus Specification, PCI Special Interest Group, Hillsboro, Oregon (April 1993)Google Scholar
  15. 15.
    Techniques for Increasing PCI Performance, Intel Co. (September 1997)Google Scholar
  16. 16.
    Fang, W., et al.: Contention and Queueing in an Experimental Multicomputer: Analytical and Simulation-based Results. TR-508-96, Princeton Computer Science Department (January 1996)Google Scholar
  17. 17.
    Bailey, et al.: The NAS Parallel Benchmarks. TR RNR-91-002, Ames Research Center (January 1991)Google Scholar
  18. 18.
    FPGA Express Version 2.0, Synopsys Co. Google Scholar
  19. 19.
    Workview Office Version 7.3, Viewlogic Co. Google Scholar
  20. 20.
    XACTstep Foundation Series F1.3 Software, Xilinx Co. Google Scholar
  21. 21.
    Culler, D.E., et al.: Parallel Programming in Split-C. In: Supercomputing 1993 (November 1993)Google Scholar
  22. 22.
    Chandra, et al.: Where is Time Spent in Message-Passing and Shared-Memory Programs? In: 6th Int. Conf. on Architectural Support for Prog. Languages and Operating Systems (October 1994)Google Scholar
  23. 23.
    Mukherjee, S., et al.: Efficient Support for Irregular Applications on Distributed-Memory Machines. In: 5th Symposium on Principles and Practices of Parallel Programming (July 1995)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1999

Authors and Affiliations

  • David Brooks
    • 1
  • Margaret Martonosi
    • 1
  1. 1.Dept. of Electrical EngineeringPrinceton University 

Personalised recommendations