Skip to main content
Log in

DEAM: Decoupled, Expressive, Area-Efficient Metadata Cache

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Chip multiprocessor presents brand new opportunities for holistic on-chip data and coherence management solutions. An intelligent protocol should be adaptive to the fine-grain accessing behavior. And in terms of storage of metadata, the size of conventional directory grows as the square of the number of processors, making it very expensive in large-scale systems. In this paper, we propose a metadata cache framework to achieve three goals: 1) reducing the latency of data access and coherence activities, 2) saving the storage of metadata, and 3) providing support for other optimization techniques. The metadata is implemented with compact structures and tracks the dynamically changing access pattern. The pattern information is used to guide the delegation and replication of decoupled data and metadata to allow fast access. We also use our metadata cache as a building block to enhance stream prefetching. Using detailed execution-driven simulation, we demonstrate that our protocol achieves an average speedup of 1.12X compared with a shared cache protocol with 1/5 of the storage of metadata.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Conway P, Kalyanasundharam N, Donley G, Lepak K, Hughes B. Cache hierarchy and memory subsystem of the AMD Opteron processor. IEEE Micro, 2010, 30(2): 16-29.

    Article  Google Scholar 

  2. Kalla R, Sinharoy B, Starke W, Floyd M. Power7: IBM’s next-generation server processor. IEEE Micro, 2010, 30(2): 7-15.

    Article  Google Scholar 

  3. Shah M, Barren J, Brooks J et al. UltraSPARC T2: A highly-treaded, power-efficient, SPARC SOC. In Proc. IEEE Asian Solid-State Circuits Conf., Nov. 2007, pp.22-25.

  4. Mcnairy C, Bhatia R. Montecito: A dual-core, dual-thread Itanium processor. IEEE Micro, 2005, 25(2): 10-20.

    Article  Google Scholar 

  5. Keltcher C N, Mcgrath K J, Ahmed A, Conway P. The AMD Opteron processor for multiprocessor servers. IEEE Micro, 2003, 23(2): 66-76.

    Article  Google Scholar 

  6. Barroso L A, Gharachorloo K, Mcnamara R, Nowatzyk A, Qadeer S, Sano B, Smith S, Stets R, Verghese B. Piranha: A scalable architecture based on single-chip multiprocessing. In Proc. the 27th Int. Symp. Comp. Arch., Jun. 2000, pp.282-293.

  7. Tendler J M, Dodson J S, Fields J S, Le H, Sinharoy B. POWER4 system microarchitecture. IBM Journal of Research and Development, 2002, 46(1): 5-25.

    Article  Google Scholar 

  8. Agarwal V, Hrishikesh M S, Keckler S W, Burger D. Clock rate versus IPC: The end of the road for conventional microarchitectures. In Proc. the 27th Int. Symp. Comp. Arch., Jun. 2000, pp.248-259.

  9. Ho R, Mai K W, Horowitz M A. The future of wires. Proceedings of the IEEE, 2001, 89(4): 490-504.

    Article  Google Scholar 

  10. Zhang M, Asanović K. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In Proc. the 32nd Int. Symp. Comp. Arch., Jun. 2005, pp.336-345.

  11. Huh J, Kim C, Shafi H, Zhang L, Burger D, Keckler S W. A NUCA substrate for flexible CMP cache sharing. IEEE Trans. Parallel and Distributed Systems, 2007, 18(8): 1028-1040.

    Article  Google Scholar 

  12. Chishti Z, Powell M D, Vijaykumar T N. Optimizing replication, communication, and capacity allocation in CMPs. In Proc. the 32nd Int. Symp. Comp. Arch., June 2005, pp.357-368.

  13. Hossain H, Dwarkadas S, Huang M C. DDCache: Decoupled and delegable cache data and metadata. In Proc. the 18th Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 2009, pp.227-236.

  14. Marty M R, Hill M D. Virtual hierarchies to support server consolidation. In Proc. the 34th Int. Symp. Comp. Arch., Jun. 2007, pp.46-56.

  15. Censier L M, Feautrier P. A new solution to coherence problems in multicache systems. IEEE Trans. Computers, 1978, 27(12): 1112-1118.

    Article  MATH  Google Scholar 

  16. Zebchuk J, Moshovos A. RegionTracker: A case for dualgrain tracking in the memory system. Technical Report, University of Toronto, 2006.

  17. Zebchuk J, Safi E, Moshovos A. A framework for coarsegrain optimizations in the on-chip memory hierarchy. In Proc. the 40th Int. Symp. Microarch., Dec. 2007, pp.314-327.

  18. Bowen N S, Pradham D K. Processor- and memory-based checkpoint and rollback recovery. Computer, 1993, 26(2): 22-31.

    Article  Google Scholar 

  19. Xue J, Garg A, Ciftcioglu B et al. An intra-chip free-space optical interconnect: Extended technical report. Technical Report, Dept. Electrical & Computer Engineering, University of Rochester, 2010.

  20. Compaq Computer Corporation. Alpha 21264/EV67 microprocessor hardware reference manual. Sept. 2000. http://download.majix.org/dec/21264ev67_hrm.pdf, May 2014.

  21. Burger D, Austin T M. The SimpleScalar tool set, version 2.0. ACM SIGARCH Computer Architecture News, 1997, 25(3): 13-25.

  22. Sheng L, Ahn J, Strong R D, Brockman J B, Tullsen D M, Jouppi N P. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. the 42nd Int. Symp. Microarch., Dec. 2009, pp.469-480.

  23. Woo S, Ohara M, Torrie E, Singh J, Gupta A. The SPLASH-2 programs: Characterization and methodological considerations. In Proc. the 22nd Int. Symp. Comp. Arch., Jun. 1995, pp.24-36.

  24. Bienia C, Kumar S, Singh J, Li K. The PARSEC bench mark suite: Characterization and architectural implications. In Proc. the 17th Int. Conf. Parallel Arch. and Compilation Techniques, Oct. 2008, pp.72-81.

  25. Zhang M, Asanovic K. Victim migration: Dynamically adapting between private and shared CMP caches. Technical Report, MIT-CSAIL, Oct. 2005.

  26. Jouppi N P. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proc. the 17th Int. Symp. Comp. Arch., May 1990, pp.364-373.

  27. Ganusov I, Burtscher M. On the importance of optimizing the configuration of stream prefetchers. In Proc. the 2005 Workshop on Memory System Performance, Jun. 2005, pp.54-61.

  28. Ros A, Acacio M E, Garcia J M. Direct coherence: Bringing together performance and scalability in shared-memory multiprocessors. In Proc. the 14th Int. Conf. High Performance Computing, Dec. 2007, pp.147-160.

  29. Ros A, Acacio M E, Garcia J M. DiCo-CMP: Efficient cache coherency in tiled CMP architectures. In Proc. Int. Symp. Parallel and Distributed Processing, Apr. 2008.

  30. Hossain H, Dwarkadas S, Huang M C. Improving support for locality and fine-grain sharing in chip multiprocessors. In Proc. the 17th Int. Conf. Parallel Arch. and Compilation Techniques, Oct. 2008, pp.155-165.

  31. Ros A, Kaxiras S. Complexity-effective multicore coherence. In Proc. the 21st Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 2012, pp.241-252.

  32. Pugsley S H, Spjut J B, Nellans D W, Balasubramonian R. SWEL: Hardware cache coherence protocols to map shared data onto shared caches. In Proc. the 19th Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 2010, pp.465-476.

  33. Khan O, Hoffmann H, Lis M, Hijaz F, Agarwal A, Devadas S. ARCc: A case for an architecturally redundant cache-coherence architecture for large multicores. In Proc. the 29th IEEE Int. Conf. Computer Design, Oct. 2011, pp.411-418.

  34. Beckmann B M, Marty M R, Wood D A. ASR: Adaptive selective replication for CMP caches. In Proc. the 39th Int. Symp. Microarch., Dec. 2006, pp.443-454.

  35. Hardavellas N, Ferdman M, Falsafi B, Ailamaki A. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In Proc. the 36th Int. Symp. Comp. Arch., Jun. 2009, pp.184-195.

  36. Agarwal A, Simoni R, Hennessy J et al. An evaluation of directory schemes for cache coherence. In Proc. the 15th Int. Symp. Comp. Arch., May 30-June 2, 1988, pp.280-289.

  37. Chaiken D, Kubiatowicz J, Agarwal A. LimitLESS directories: A scalable cache coherence scheme. In Proc. the 4th Int. Conf. Arch. Support for Prog. Lang. and Operating Systems, Apr. 1991, pp.224-234.

  38. Maa Y, Pradhan D, Thiebaut D. Two economical directory schemes for large-scale cache coherent multiprocessors. ACM SIGARCH Computer Architecture News, 1991, 19(5): 10-18.

    Article  Google Scholar 

  39. Gupta A, Weber W, Mowry T. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Proc. Int. Conf. Parallel Processing, Aug. 1990, pp.312-321.

  40. Sanchez D, Kozyrakis C. SCD: A scalable coherence directory with flexible sharer set encoding. In Proc. the 18th IEEE Int. Symp. High-Perf. Comp. Arch., Feb. 2012, pp.1-12.

  41. Zhao H, Shriraman A, Dwarkadas S. SPACE: Sharing pattern-based directory coherence for multicore scalability. In Proc. the 19th Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 2010, pp.135-146.

  42. Zhao H, Shriraman A, Dwarkadas S, Srinivasan V. SPATL: Honey, I shrunk the coherence directory. In Proc. Int. Conf. Parallel Arch. and Compilation Techniques, Oct. 2011, pp.33-44.

  43. Choi J, Park K. Segment directory enhancing the limited directory cache coherence schemes. In Proc. the 10th Int. Parallel and Distributed Processing Symp., Apr. 1999, pp.258-267.

  44. Cuesta B, Ros A, Gómez M, Robles A, Duato J. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In Proc. Int. Symp. Comp. Arch., Jun. 2011, pp.93-104.

  45. Alisafaee M. Spatiotemporal coherence tracking. In Proc. the 45th Int. Symp. Microarch., Dec. 2012, pp.341-350.

  46. Fang L, Liu P, Hu Q, Huang M C, Jiang G F. Building expressive, area-efficient coherence directories. In Proc. the 22nd Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 2013, pp.299-308.

  47. Ebrahimi E, Mutlu O, Chang J L, Patt Y N. Coordinated control of multiple prefetchers in multi-core systems. In Proc. the 42nd Int. Symp. Microarch., Dec. 2009, pp.316-326.

  48. Dahlgren F, Dubois M, Stenstrom P. Fixed and adaptive sequential prefetching in shared memory multiprocessors. In Proc. Int. Conf. Parallel Processing, Aug. 1993, pp.56-63.

  49. Nesbit K J, Dhodapkar A S, Smith J E. AC/DC: An adaptive data cache prefetcher. In Proc. the 13th Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 29-Oct. 3, 2004, pp.135-145.

  50. Gendler A, Mendelson A, Birk Y. A PAB-based multi-prefetcher mechanism. International Journal of Parallel Programming, 2006, 34(2): 171-188.

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peng Liu.

Additional information

The work is supported by the Joint Research Fund for Overseas Chinese Scholars and Scholars in Hong Kong and Macao of the National Natural Science Foundation of China under Grant No. 61028004, and the Open Project Program of the State Key Laboratory of Mathematical Engineering and Advanced Computing under Grant No. 2014A08.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 229 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, P., Fang, L. & Huang, M.C. DEAM: Decoupled, Expressive, Area-Efficient Metadata Cache. J. Comput. Sci. Technol. 29, 679–691 (2014). https://doi.org/10.1007/s11390-014-1459-0

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-014-1459-0

Keywords

Navigation