DEAM: Decoupled, Expressive, Area-Efficient Metadata Cache

Liu, Peng; Fang, Lei; Huang, Michael C.

doi:10.1007/s11390-014-1459-0

DEAM: Decoupled, Expressive, Area-Efficient Metadata Cache

Regular Paper
Published: 04 July 2014

Volume 29, pages 679–691, (2014)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Peng Liu^1,2,
Lei Fang¹ &
Michael C. Huang³

88 Accesses
1 Citation
Explore all metrics

Abstract

Chip multiprocessor presents brand new opportunities for holistic on-chip data and coherence management solutions. An intelligent protocol should be adaptive to the fine-grain accessing behavior. And in terms of storage of metadata, the size of conventional directory grows as the square of the number of processors, making it very expensive in large-scale systems. In this paper, we propose a metadata cache framework to achieve three goals: 1) reducing the latency of data access and coherence activities, 2) saving the storage of metadata, and 3) providing support for other optimization techniques. The metadata is implemented with compact structures and tracks the dynamically changing access pattern. The pattern information is used to guide the delegation and replication of decoupled data and metadata to allow fast access. We also use our metadata cache as a building block to enhance stream prefetching. Using detailed execution-driven simulation, we demonstrate that our protocol achieves an average speedup of 1.12X compared with a shared cache protocol with 1/5 of the storage of metadata.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dynamic, Tagless Cache Coherence Architecture in Chip Multiprocessor

Cache Memory Architectures for Handling Big Data Applications: A Survey

An adaptive migration–replication scheme (AMR) for shared cache in chip multiprocessors

Article 26 July 2015

Nitin Chaturvedi, Arun Subramaniyan & S. Gurunarayanan

References

Conway P, Kalyanasundharam N, Donley G, Lepak K, Hughes B. Cache hierarchy and memory subsystem of the AMD Opteron processor. IEEE Micro, 2010, 30(2): 16-29.
Article Google Scholar
Kalla R, Sinharoy B, Starke W, Floyd M. Power7: IBM’s next-generation server processor. IEEE Micro, 2010, 30(2): 7-15.
Article Google Scholar
Shah M, Barren J, Brooks J et al. UltraSPARC T2: A highly-treaded, power-efficient, SPARC SOC. In Proc. IEEE Asian Solid-State Circuits Conf., Nov. 2007, pp.22-25.
Mcnairy C, Bhatia R. Montecito: A dual-core, dual-thread Itanium processor. IEEE Micro, 2005, 25(2): 10-20.
Article Google Scholar
Keltcher C N, Mcgrath K J, Ahmed A, Conway P. The AMD Opteron processor for multiprocessor servers. IEEE Micro, 2003, 23(2): 66-76.
Article Google Scholar
Barroso L A, Gharachorloo K, Mcnamara R, Nowatzyk A, Qadeer S, Sano B, Smith S, Stets R, Verghese B. Piranha: A scalable architecture based on single-chip multiprocessing. In Proc. the 27th Int. Symp. Comp. Arch., Jun. 2000, pp.282-293.
Tendler J M, Dodson J S, Fields J S, Le H, Sinharoy B. POWER4 system microarchitecture. IBM Journal of Research and Development, 2002, 46(1): 5-25.
Article Google Scholar
Agarwal V, Hrishikesh M S, Keckler S W, Burger D. Clock rate versus IPC: The end of the road for conventional microarchitectures. In Proc. the 27th Int. Symp. Comp. Arch., Jun. 2000, pp.248-259.
Ho R, Mai K W, Horowitz M A. The future of wires. Proceedings of the IEEE, 2001, 89(4): 490-504.
Article Google Scholar
Zhang M, Asanović K. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In Proc. the 32nd Int. Symp. Comp. Arch., Jun. 2005, pp.336-345.
Huh J, Kim C, Shafi H, Zhang L, Burger D, Keckler S W. A NUCA substrate for flexible CMP cache sharing. IEEE Trans. Parallel and Distributed Systems, 2007, 18(8): 1028-1040.
Article Google Scholar
Chishti Z, Powell M D, Vijaykumar T N. Optimizing replication, communication, and capacity allocation in CMPs. In Proc. the 32nd Int. Symp. Comp. Arch., June 2005, pp.357-368.
Hossain H, Dwarkadas S, Huang M C. DDCache: Decoupled and delegable cache data and metadata. In Proc. the 18th Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 2009, pp.227-236.
Marty M R, Hill M D. Virtual hierarchies to support server consolidation. In Proc. the 34th Int. Symp. Comp. Arch., Jun. 2007, pp.46-56.
Censier L M, Feautrier P. A new solution to coherence problems in multicache systems. IEEE Trans. Computers, 1978, 27(12): 1112-1118.
Article MATH Google Scholar
Zebchuk J, Moshovos A. RegionTracker: A case for dualgrain tracking in the memory system. Technical Report, University of Toronto, 2006.
Zebchuk J, Safi E, Moshovos A. A framework for coarsegrain optimizations in the on-chip memory hierarchy. In Proc. the 40th Int. Symp. Microarch., Dec. 2007, pp.314-327.
Bowen N S, Pradham D K. Processor- and memory-based checkpoint and rollback recovery. Computer, 1993, 26(2): 22-31.
Article Google Scholar
Xue J, Garg A, Ciftcioglu B et al. An intra-chip free-space optical interconnect: Extended technical report. Technical Report, Dept. Electrical & Computer Engineering, University of Rochester, 2010.
Compaq Computer Corporation. Alpha 21264/EV67 microprocessor hardware reference manual. Sept. 2000. http://download.majix.org/dec/21264ev67_hrm.pdf, May 2014.
Burger D, Austin T M. The SimpleScalar tool set, version 2.0. ACM SIGARCH Computer Architecture News, 1997, 25(3): 13-25.
Sheng L, Ahn J, Strong R D, Brockman J B, Tullsen D M, Jouppi N P. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. the 42nd Int. Symp. Microarch., Dec. 2009, pp.469-480.
Woo S, Ohara M, Torrie E, Singh J, Gupta A. The SPLASH-2 programs: Characterization and methodological considerations. In Proc. the 22nd Int. Symp. Comp. Arch., Jun. 1995, pp.24-36.
Bienia C, Kumar S, Singh J, Li K. The PARSEC bench mark suite: Characterization and architectural implications. In Proc. the 17th Int. Conf. Parallel Arch. and Compilation Techniques, Oct. 2008, pp.72-81.
Zhang M, Asanovic K. Victim migration: Dynamically adapting between private and shared CMP caches. Technical Report, MIT-CSAIL, Oct. 2005.
Jouppi N P. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proc. the 17th Int. Symp. Comp. Arch., May 1990, pp.364-373.
Ganusov I, Burtscher M. On the importance of optimizing the configuration of stream prefetchers. In Proc. the 2005 Workshop on Memory System Performance, Jun. 2005, pp.54-61.
Ros A, Acacio M E, Garcia J M. Direct coherence: Bringing together performance and scalability in shared-memory multiprocessors. In Proc. the 14th Int. Conf. High Performance Computing, Dec. 2007, pp.147-160.
Ros A, Acacio M E, Garcia J M. DiCo-CMP: Efficient cache coherency in tiled CMP architectures. In Proc. Int. Symp. Parallel and Distributed Processing, Apr. 2008.
Hossain H, Dwarkadas S, Huang M C. Improving support for locality and fine-grain sharing in chip multiprocessors. In Proc. the 17th Int. Conf. Parallel Arch. and Compilation Techniques, Oct. 2008, pp.155-165.
Ros A, Kaxiras S. Complexity-effective multicore coherence. In Proc. the 21st Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 2012, pp.241-252.
Pugsley S H, Spjut J B, Nellans D W, Balasubramonian R. SWEL: Hardware cache coherence protocols to map shared data onto shared caches. In Proc. the 19th Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 2010, pp.465-476.
Khan O, Hoffmann H, Lis M, Hijaz F, Agarwal A, Devadas S. ARCc: A case for an architecturally redundant cache-coherence architecture for large multicores. In Proc. the 29th IEEE Int. Conf. Computer Design, Oct. 2011, pp.411-418.
Beckmann B M, Marty M R, Wood D A. ASR: Adaptive selective replication for CMP caches. In Proc. the 39th Int. Symp. Microarch., Dec. 2006, pp.443-454.
Hardavellas N, Ferdman M, Falsafi B, Ailamaki A. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In Proc. the 36th Int. Symp. Comp. Arch., Jun. 2009, pp.184-195.
Agarwal A, Simoni R, Hennessy J et al. An evaluation of directory schemes for cache coherence. In Proc. the 15th Int. Symp. Comp. Arch., May 30-June 2, 1988, pp.280-289.
Chaiken D, Kubiatowicz J, Agarwal A. LimitLESS directories: A scalable cache coherence scheme. In Proc. the 4th Int. Conf. Arch. Support for Prog. Lang. and Operating Systems, Apr. 1991, pp.224-234.
Maa Y, Pradhan D, Thiebaut D. Two economical directory schemes for large-scale cache coherent multiprocessors. ACM SIGARCH Computer Architecture News, 1991, 19(5): 10-18.
Article Google Scholar
Gupta A, Weber W, Mowry T. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Proc. Int. Conf. Parallel Processing, Aug. 1990, pp.312-321.
Sanchez D, Kozyrakis C. SCD: A scalable coherence directory with flexible sharer set encoding. In Proc. the 18th IEEE Int. Symp. High-Perf. Comp. Arch., Feb. 2012, pp.1-12.
Zhao H, Shriraman A, Dwarkadas S. SPACE: Sharing pattern-based directory coherence for multicore scalability. In Proc. the 19th Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 2010, pp.135-146.
Zhao H, Shriraman A, Dwarkadas S, Srinivasan V. SPATL: Honey, I shrunk the coherence directory. In Proc. Int. Conf. Parallel Arch. and Compilation Techniques, Oct. 2011, pp.33-44.
Choi J, Park K. Segment directory enhancing the limited directory cache coherence schemes. In Proc. the 10th Int. Parallel and Distributed Processing Symp., Apr. 1999, pp.258-267.
Cuesta B, Ros A, Gómez M, Robles A, Duato J. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In Proc. Int. Symp. Comp. Arch., Jun. 2011, pp.93-104.
Alisafaee M. Spatiotemporal coherence tracking. In Proc. the 45th Int. Symp. Microarch., Dec. 2012, pp.341-350.
Fang L, Liu P, Hu Q, Huang M C, Jiang G F. Building expressive, area-efficient coherence directories. In Proc. the 22nd Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 2013, pp.299-308.
Ebrahimi E, Mutlu O, Chang J L, Patt Y N. Coordinated control of multiple prefetchers in multi-core systems. In Proc. the 42nd Int. Symp. Microarch., Dec. 2009, pp.316-326.
Dahlgren F, Dubois M, Stenstrom P. Fixed and adaptive sequential prefetching in shared memory multiprocessors. In Proc. Int. Conf. Parallel Processing, Aug. 1993, pp.56-63.
Nesbit K J, Dhodapkar A S, Smith J E. AC/DC: An adaptive data cache prefetcher. In Proc. the 13th Int. Conf. Parallel Arch. and Compilation Techniques, Sept. 29-Oct. 3, 2004, pp.135-145.
Gendler A, Mendelson A, Birk Y. A PAB-based multi-prefetcher mechanism. International Journal of Parallel Programming, 2006, 34(2): 171-188.
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Science and Electronic Engineering, Zhejiang University, Hangzhou, 310027, China
Peng Liu & Lei Fang
State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi, 214125, China
Peng Liu
Department of Electrical and Computer Engineering, University of Rochester, Rochester, NY, 14627-0231, U.S.A.
Michael C. Huang

Authors

Peng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Lei Fang
View author publications
You can also search for this author in PubMed Google Scholar
Michael C. Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peng Liu.

Additional information

The work is supported by the Joint Research Fund for Overseas Chinese Scholars and Scholars in Hong Kong and Macao of the National Natural Science Foundation of China under Grant No. 61028004, and the Open Project Program of the State Key Laboratory of Mathematical Engineering and Advanced Computing under Grant No. 2014A08.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 229 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, P., Fang, L. & Huang, M.C. DEAM: Decoupled, Expressive, Area-Efficient Metadata Cache. J. Comput. Sci. Technol. 29, 679–691 (2014). https://doi.org/10.1007/s11390-014-1459-0

Download citation

Received: 03 March 2014
Revised: 12 May 2014
Published: 04 July 2014
Issue Date: July 2014
DOI: https://doi.org/10.1007/s11390-014-1459-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DEAM: Decoupled, Expressive, Area-Efficient Metadata Cache

Abstract

Access this article

Similar content being viewed by others

Dynamic, Tagless Cache Coherence Architecture in Chip Multiprocessor

Cache Memory Architectures for Handling Big Data Applications: A Survey

An adaptive migration–replication scheme (AMR) for shared cache in chip multiprocessors

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

DEAM: Decoupled, Expressive, Area-Efficient Metadata Cache

Abstract

Access this article

Similar content being viewed by others

Dynamic, Tagless Cache Coherence Architecture in Chip Multiprocessor

Cache Memory Architectures for Handling Big Data Applications: A Survey

An adaptive migration–replication scheme (AMR) for shared cache in chip multiprocessors

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation