BACH: A Bandwidth-Aware Hybrid Cache Hierarchy Design with Nonvolatile Memories

Zhao, Jishen; Xu, Cong; Zhang, Tao; Xie, Yuan

doi:10.1007/s11390-016-1609-7

BACH: A Bandwidth-Aware Hybrid Cache Hierarchy Design with Nonvolatile Memories

Regular Paper
Published: 08 January 2016

Volume 31, pages 20–35, (2016)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Jishen Zhao¹,
Cong Xu²,
Tao Zhang³ &
…
Yuan Xie⁴

196 Accesses
4 Citations
Explore all metrics

Abstract

Limited main memory bandwidth is becoming a fundamental performance bottleneck in chipmultiprocessor (CMP) design. Yet directly increasing the peak memory bandwidth can incur high cost and power consumption. In this paper, we address this problem by proposing a memory, a bandwidth-aware reconfigurable cache hierarchy, BACH, with hybrid memory technologies. Components of our BACH design include a hybrid cache hierarchy, a reconfiguration mechanism, and a statistical prediction engine. Our hybrid cache hierarchy chooses different memory technologies with various bandwidth characteristics, such as spin-transfer torque memory (STT-MRAM), resistive memory (ReRAM), and embedded DRAM (eDRAM), to configure each level so that the peak bandwidth of the overall cache hierarchy is optimized. Our reconfiguration mechanism can dynamically adjust the cache capacity of each level based on the predicted bandwidth demands of running workloads. The bandwidth prediction is performed by our prediction engine. We evaluate the system performance gain obtained by BACH design with a set of multithreaded and multiprogrammed workloads with and without the limitation of system power budget. Compared with traditional SRAM-based cache design, BACH improves the system throughput by 58% and 14% with multithreaded and multiprogrammed workloads respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluating Controlled Memory Request Injection to Counter PREM Memory Underutilization

Profile-driven memory bandwidth management for accelerators and CPUs in QoS-enabled platforms

Article 26 April 2022

Parul Sohal, Rohan Tabish, … Renato Mancuso

Workload Characterization for Memory Management in Emerging Embedded Platforms

References

McKee S A. Reflections on the memory wall. In Proc. the 1st Conference on Computing Frontiers, April 2004, p.162.
Burger D, Goodman J R, K¨agi A. Memory bandwidth limitations of future microprocessors. In Proc. the 23rd International Symposium on Computer Architecture, May 1996, pp.78-89.
Rogers B M, Krishna A, Bell G B et al. Scaling the bandwidth wall: Challenges in and avenues for CMP scaling. In Proc. the 36th International Symposium on Computer Architecture, June 2009, pp.371-382.
Huh J, Burger D, Keckler S W. Exploring the design space of future CMPs. In Proc. the International Conference on Parallel Architectures and Compilation Techniques, Sept. 2001, pp.199-210.
Lindholm E, Nickolls J, Oberman S, Montrym J. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 2008, 28(2): 39–55.
Sun G, Wu X, Xie Y. Exploration of 3D stacked L2 cache design for high performance and efficient thermal control. In Proc. the International Symposium on Low Power Electronics and Design, Aug. 2009, pp.295-298.
Sun G, Dong X, Xie Y, Li J, Chen Y. A novel architecture of the 3D stacked MRAM L2 cache for CMPs. In Proc. the 15th International Conference on High Performance Computer Architecture, Feb. 2009, pp.239-249.
Yu C, Petrov P. Off-chip memory bandwidth minimization through cache partitioning for multi-core platforms. In Proc. the 47th Design Automation Conference, June 2010, pp.132-137.
Sun G, Hughes C, Kim C, Zhao J, Xu C, Xie Y, Chen Y K. Moguls: A model to explore memory hierarchy for throughput computing. In Proc. the 38th ISCA, June 2011, pp.377-388.
Hosomi M, Yamagishi H, Yamamoto T et al. A novel nonvolatile memory with spin torque transfer magnetization switching: Spin-RAM. In Proc. IEEE International Electron Devices Meeting, IEDM Technical Digest, Dec. 2005, pp.459-462.
Zhao W, Belhaire E, Mistral Q, Chappert C, Javerliac V, Dieny B, Nicolle E. Macro-model of spin-transfer torque based magnetic tunnel junction device for hybrid magnetic-CMOS design. In Proc. the 2006 IEEE International Conference: Behavioral Modeling and Simulation Workshop, Sept. 2006, pp.40-43.
Degraeve R, Fantini A, Clima S et al. Dynamic ‘hour glass’ model for SET and RESET in HfO2 RRAM. In Proc. the Symposium on VLSI Technology, June 2012, pp.75-76.
Goux L, Fantini A, Kar G et al. Ultralow sub-500nA operating current high-performance TiN\Al2O3\HfO2\Hf\TiN bipolar RRAM achieved through understanding-based stack-engineering. In Proc. the Symposium on VLSI Technology, June 2012, pp.159-160.
Cagli C, Buckley J, Jousseaume V et al. Characterization and modelling of electrode impact in HfO2-based RRAM. In Proc. the Memory Workshop, June 2012.
Raoux S, Burr G W, Breitwisch M J et al. Phase-change random access memory: A scalable technology. IBM Journal of Research and Development, 2008, 52(4/5): 465–479.
Sousa V. Phase change materials engineering for RESET current reduction. In Proc. the Memory Workshop, June 2012.
Wu X, Li J, Zhang L, Speight E, Rajamony R, Xie Y. Hybrid cache architecture with disparate memory technologies. In Proc. the 36th International Symposium on Computer Architecture, June 2009, pp.34-45.
Kim K H, Hyun Jo S, Gaba S, Lu W. Nanoscale resistive memory with intrinsic diode characteristics and long endurance. Applied Physics Letters, 2010, 96(5): 053 106.1-053 106.3.
Lee H Y, Chen Y S, Chen P S et al. Evidence and solution of over-RESET problem for HfOx based resistive memory with sub-ns switching speed and high endurance. In Proc. the International Electron Devices Meeting, Dec. 2010, pp.19.7.1-19.7.4.
Kim Y B, Lee S, Lee D et al. Bi-layered RRAM with unlimited endurance and extremely uniform switching. In Proc. the Symposium on VLSI Technology, June 2011, pp.52-53.
Ahn S, Song Y, Jeong C et al. Highly manufacturable high density phase change memory of 64Mb and beyond. In Proc. the International Electron Devices Meeting, Dec. 2004, pp.907-910.
Kitagawa E, Fujita S, Nomura K et al. Impact of ultra low power and fast write operation of advanced perpendicular MTJ on power reduction for high-performance mobile CPU. In Proc. the International Electron Devices Meeting, Dec. 2012, pp.29.4.1-29.4.4.
Yoda H, Fujita S, Shimomura N et al. Progress of STTMRAM technology and the effect on normally-off computing systems. In Proc. the International Electron Devices Meeting, Dec. 2012, pp.11.3.1-11.3.4.
Abe K, Noguchi H, Kitagawa E, Shimomura N, Ito J, Fujita S. Novel hybrid DRAM/MRAM design for reducing power of high performance mobile CPU. In Proc. the International Electron Devices Meeting, Dec. 2012, pp.10.5.1-10.5.4.
Schechter S, Loh G H, Straus K, Burger D. Use ECP, not ECC, for hard failures in resistive memories. In Proc. the 37th International Symposium on Computer Architecture, June 2010, pp.141-152.
Ipek E, Condit J, Nightingale E B, Burger D, Moscibroda T. Dynamically replicated memory: Building reliable systems from nanoscale resistive memories. In Proc. the 15th Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, Mar. 2010, pp.3-14.
Seong N H, Woo D H, Srinivasan V, Rivers J A, Lee H H S. SAFER: Stuck-at-fault error recovery for memories. In Proc. the 43rd International Symposium on Microarchitecture, Dec. 2010, pp.115-124.
Qureshi M K, Karidis J, Franceschini M, Srinivasan V, Lastras L, Abali B. Enhancing lifetime and security of PCMbased main memory with start-gap wear leveling. In Proc. the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2009, pp.14-23.
Seong N H, Woo D H, Lee H H S. Security refresh: Prevent malicious wear-out and increase durability for phase-change memory with dynamically randomized address mapping. In Proc. the International Symposium on Computer Architecture, June 2010, pp.383-394.
Yoon D H, Muralimanohar N, Chang J, Ranganathan P, Jouppi N, Erez M. FREE-p: Protecting non-volatile memory against both hard and soft errors. In Proc. the 17th International Symposium on High Performance Computer Architecture, Feb. 2011, pp.466-477.
Dorsey P. Xilinx stacked silicon interconnect technology delivers breakthrough FPGA capacity, bandwidth, and power efficiency. Xilinx White Papers: Virtex-7 FPGAs, WP 380, 2010.
Zhao J, Dong X, Xie Y. Cost-aware three-dimensional (3D) many-core multiprocessor design. In Proc. the 47th Design Automation Conference, June 2010, pp.126-131.
Xie Y, Loh G H, Black B, Bernstein K. Design space exploration for 3D architectures. J. Emerg. Technol. Comput. Syst., 2006, 2(2): 65–103.
Loh G H. 3D-stacked memory architectures for multi-core processors. In Proc. the 35th International Symposium on Computer Architecture, June 2008, pp.453-464.
Dong X, Xie Y, Muralimanohar N, Jouppi N P. Simple but effective heterogeneous main memory with on-chip memory controller support. In Proc. the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2010.
Kgil T, D’Souza S, Saidi A et al. PicoServer: Using 3D stacking technology to enable a compact energy efficient chip multiprocessor. In Proc. the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2006, pp.117-128.
Liu C C, Ganusov I, Burtscher M, Tiwari S. Bridging the processor-memory performance gap with 3D IC technology. IEEE Design and Test of Computers, 2005, 22(6): 556–564.
Loi G L, Agrawal B, Srivastava N, Lin S C, Sherwood T, Banerjee K. A thermally-aware performance analysis of vertically integrated (3-D) processor-memory hierarchy. In Proc. the 43rd Design Automation Conference, July 2006, pp.991-996.
Gu S, Marchal P, Facchini M, Wang F, Suh M, Lisk D, Nowak M. Stackable memory of 3D chip integration for mobile applications. In Proc. Int. Electron Devices Meeting, Dec. 2008.
Woo D H, Seong N H, Lewis D L, Lee H H. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth. In Proc. the 16th International Conference for High Performance Computer Architecture, Jan. 2010.
Kim J S, Oh C S, Lee H et al. A 1.2V 12.8GB/s 2Gb mobile wide-I/O DRAM with 4 × 128 I/Os using TSV-based stacking. In Proc. Int. Solid-State Circuits Conf. Digest of Technical Papers, Feb. 2011, pp.496-498.
Loi I, Benini L. An efficient distributed memory interface for many-core platform with 3D stacked DRAM. In Proc. Design, Automation and Test in Europe Conference & Exhibition, Mar. 2010, pp.99-104.
Jevdjic D, Volos S, Falsafi B. Die-stacked DRAM caches for servers: Hit ratio, latency, or bandwidth? Have it all with footprint cache. In Proc. the 40th International Symposium on Computer Architecture, June 2013, pp.404-415.
Lin C J, Kang S H, Wang Y J et al. 45nm low power CMOS logic compatible embedded STT MRAM utilizing a reverseconnection 1T/1MTJ cell. In Proc. the International Electron Devices Meeting, Dec. 2009, pp.11.6.1-11.6.4.
Ranganathan P, Adve S, Jouppi N P. Reconfigurable caches and their application to media processing. In Proc. the 27th International Symposium on Computer Architecture, June 2000, pp.214-224.
Srikantaiah S, Kultursay E, Zhang T, Kandemir M, Irwin M, Xie Y. MorphCache: A reconfigurable adaptive multilevel cache hierarchy. In Proc. the 17th International Symposium on High Performance Computer Architecture, Feb. 2011, pp.231-242.
Dong X, Xu C, Xie Y, Jouppi N. NVSim: A circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2012, 31(7): 994–1007.
Kim C, Burger D, Keckler S W. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Proc. the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2002, pp.211-222.
Flautner K, Kim N S, Martin S, Blaauw D, Mudge T. Drowsy caches: Simple techniques for reducing leakage power. In Proc. the 29th International Symposium on Computer Architecture, May 2002, pp.148-157.
Zhou P, Pandey V, Sundaresan J, Raghuraman A, Zhou Y, Kumar S. Dynamic tracking of page miss ratio curve for memory management. In Proc. the 11th International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2004, pp.177-188.
Kim S, Chandra D, Solihin Y. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proc. the 13th International Conference on Parallel Architectures and Compilation Techniques, Sept. 2004, pp.111-122.
Duesterwald E, Ca¸scaval C, Dwarkadas S. Characterizing and predicting program behavior and its variability. In Proc. the 12th International Conference on Parallel Architectures and Compilation Techniques, Sept.27-Oct.1, 2003, pp.220-231.
Sarikaya R, Buyuktosunoglu A. Predicting program behavior based on objective function minimization. In Proc. the 10th International Symposium on Workload Characterization, Sept. 2007, pp.25-34.
Sarikaya R, Isci C, Buyuktosunoglu A. Runtime workload behavior prediction using statistical metric modeling with application to dynamic power management. In Proc. the International Symposium on Workload Characterization, Dec. 2010.
Chen S F, Goodman J. An empirical study of smoothing techniques for language modeling. In Proc. the 34th Annual Meeting on Association for Computational Linguistics, June 1996, pp.310-318.
Magnusson P S, Christensson M, Eskilson J et al. Simics: A full system simulation platform. IEEE Transactions on Computer, 2002, 35(2): 50–58.
Shah M, Barren J, Brooks J et al. UltraSPARC T2: A highly-treaded, powere-efficient, SPARC SOC. In Proc. the IEEE Solid-State Circuits Conference, Nov. 2007, pp.22-25.
Bienia C. Benchmarking modern multiprocessors [Ph.D. Thesis]. Princeton University, January 2011.
Li S, Ahn J H, Strong R D, Brockman J B, Tullsen D M, Jouppi N P. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2009, pp.469-480.
Meza J, Chang J, Yoon H, Mutlu O, Ranganathan P. Enabling efficient and scalable hybrid memories using finegranularity DRAM cache management. IEEE Comput. Archit. Lett., 2012, 11(2): 61–64.

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, University of California at Santa Cruz, Santa Cruz, CA, 95064, U.S.A.
Jishen Zhao
Hewlet-Packard Labs, Palo Alto, CA, 94304, U.S.A.
Cong Xu
NVIDIA Corporation, Santa Clara, CA, 95050, U.S.A.
Tao Zhang
Department of Electrical and Computer Engineering, University of California at Santa Barbara, Santa Barbara, CA, 93106, U.S.A.
Yuan Xie

Authors

Jishen Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Cong Xu
View author publications
You can also search for this author in PubMed Google Scholar
Tao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Xie
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jishen Zhao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, J., Xu, C., Zhang, T. et al. BACH: A Bandwidth-Aware Hybrid Cache Hierarchy Design with Nonvolatile Memories. J. Comput. Sci. Technol. 31, 20–35 (2016). https://doi.org/10.1007/s11390-016-1609-7

Download citation

Received: 08 September 2015
Revised: 10 December 2015
Published: 08 January 2016
Issue Date: January 2016
DOI: https://doi.org/10.1007/s11390-016-1609-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BACH: A Bandwidth-Aware Hybrid Cache Hierarchy Design with Nonvolatile Memories

Abstract

Access this article

Similar content being viewed by others

Evaluating Controlled Memory Request Injection to Counter PREM Memory Underutilization

Profile-driven memory bandwidth management for accelerators and CPUs in QoS-enabled platforms

Workload Characterization for Memory Management in Emerging Embedded Platforms

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

BACH: A Bandwidth-Aware Hybrid Cache Hierarchy Design with Nonvolatile Memories

Abstract

Access this article

Similar content being viewed by others

Evaluating Controlled Memory Request Injection to Counter PREM Memory Underutilization

Profile-driven memory bandwidth management for accelerators and CPUs in QoS-enabled platforms

Workload Characterization for Memory Management in Emerging Embedded Platforms

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation