Advertisement

Frontiers of Computer Science

, Volume 12, Issue 6, pp 1090–1104 | Cite as

HSCS: a hybrid shared cache scheduling scheme for multiprogrammed workloads

  • Jingyu Zhang
  • Chentao Wu
  • Dingyu YangEmail author
  • Yuanyi Chen
  • Xiaodong Meng
  • Liting Xu
  • Minyi GuoEmail author
Research Article
  • 31 Downloads

Abstract

The traditional dynamic random-access memory (DRAM) storage medium can be integrated on chips via modern emerging 3D-stacking technology to architect a DRAM shared cache in multicore systems. Compared with static random-access memory (SRAM), DRAM is larger but slower. In the existing research, a lot of work has been devoted to improving the workload performance using SRAM and stacked DRAM together in shared cache systems, ranging from SRAM structure improvement to optimizing cache tags and data access. However, little attention has been paid to designing a shared cache scheduling scheme for multiprogrammed workloads with different memory footprints in multicore systems. Motivated by this, we propose a hybrid shared cache scheduling scheme that allows a multicore system to utilize SRAM and 3D-stacked DRAM efficiently, thus achieving better workload performance. This scheduling scheme employs (1) a cache monitor, which is used to collect cache statistics; (2) a cache evaluator, which is used to evaluate the cache information during the process of programs being executed; and (3) a cache switcher, which is used to self-adaptively choose SRAM or DRAM shared cache modules. A cache data migration policy is naturally developed to guarantee that the scheduling scheme works correctly. Extensive experiments are conducted to evaluate the workload performance of our proposed scheme. The experimental results showed that our method can improve the multiprogrammed workload performance by up to 25% compared with state-of-the-art methods (including conventional and DRAM cache systems).

Keywords

multicore system shared cache workload performance 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Notes

Acknowledgements

We would like to acknowledge the editors and anonymous reviewers for their careful work and instructive suggestions. Also, we thank Dr. Zhi-Jie Wang for his warm help and advices. This work was supported by the National Basic Research Program of China (2015CB352403), the National Natural Science Foundation of China (Grant Nos. 61261160502, 61272099, 61303012, 61572323, and 61628208), the Scientific Innovation Act of STCSM (13511504200), the EU FP7 CLIMBER project (PIRSES-GA-2012-318939), and the CCF-Tencent Open Fund.

Supplementary material

11704_2017_6349_MOESM1_ESM.ppt (188 kb)
Supplementary material, approximately 188 KB.

References

  1. 1.
    Chou C, Jaleel A, Qureshi M K. BEAR: techniques for mitigating bandwidth bloat in gigascale DRAM caches. In: Proceedings of the 42nd Annual International Symposium on Computer Architecture. 2016, 198–210Google Scholar
  2. 2.
    Lee Y, Kim J, Jang H, Yang H, Kim J, Jeong J, Lee J W. A fully associative, tagless DRAM cache. In: Proceedings of ACM/IEEE International Symposium on Computer Architecture. 2015, 211–222Google Scholar
  3. 3.
    Hameed F, Bauer L, Henkel J. Adaptive cache management for a combined SRAMand DRAM cache hierarchy for multi-cores. In: Proceedings of Design, Automation and Test in Europe. 2013, 77–82Google Scholar
  4. 4.
    Hundal R, Oklobdzija V G. Determination of optimal sizes for a first and second level SRAM-DRAM on-chip cache combination. In: Proceedings of IEEE International Conference on Computer Design: VLSI in Computers and Processors. 1994, 60–64CrossRefGoogle Scholar
  5. 5.
    Qureshi M K, Loh G H. Fundamental latency trade-off in architecting DRAM caches: outperforming impractical SRAM-tags with a simple and practical design. In: Proceedings of IEEE/ACM International Symposium on Microarchitecture. 2012, 235–246Google Scholar
  6. 6.
    Huang C C, Nagarajan V. ATCache: reducing DRAMcache latency via a small SRAM tag cache. In: Proceedings of International Conference on Parallel Architectures and Compilation. 2014, 51–60Google Scholar
  7. 7.
    Hameed F, Bauer L, Henkel J. Reducing latency in an SRAM/DRAMcache hierarchy via a novel Tag-Cache architecture. In: Proceedings of Design Automation Conference. 2014, 1–6Google Scholar
  8. 8.
    Andrade D, Fraguela B B, Doallo R. Accurate prediction of the behavior of multithreaded applications in shared caches. Parallel Computing, 2013, 39(1): 36–57CrossRefGoogle Scholar
  9. 9.
    Manikantan R, Rajan K, Govindarajan R. Probabilistic Shared Cache Management (PriSM). In: Proceedings of ACM/IEEE International Symposium on Computer Architecture. 2012, 428–439Google Scholar
  10. 10.
    Wei W, Jiang D, Xiong J, Chen M. HAP: hybrid-memory-aware partition in shared last-level cache. In: Proceedings of IEEE International Conference on Computer Design. 2014, 28–35Google Scholar
  11. 11.
    Holey A, Mekkat V, Yew P C, Zhai A. Performance-energy considerations for shared cache management in a heterogeneous multicore processor. ACM Transactions on Architecture & Code Optimization, 2015, 12(1): 1–29CrossRefGoogle Scholar
  12. 12.
    El-Moursy A, Sibai F N. V-Set cache: an efficient adaptive shared cache for multi-core processors. Journal of Circuits System & Computers, 2014, 23(23): 815–822Google Scholar
  13. 13.
    Zhang D, Ju L, Zhao M, Gao X, Jia Z. Write-back aware shared lastlevel cache management for hybrid main memory. In: Proceedings of Design Automation Conference. 2016Google Scholar
  14. 14.
    Elhelw A S, Moursy A E, Fahmy H A H. Time-based least memory intensive scheduling. In: Proceedings of the 8th IEEE International Symposium on Embedded Multicore/Manycore Systems-on-Chip. 2014, 311–318Google Scholar
  15. 15.
    Elhelw A S, El-Moursy A, Fahmy H A H. Adaptive time-based least memory intensive scheduling. In: Proceedings of the 9th IEEE International Symposium on Embedded Multicore/Manycore Systems-on-Chip. 2015, 167–174Google Scholar
  16. 16.
    Chen Q, Zheng L, Guo M. DWS: demand-aware work-stealing in multi-programmed multi-core architectures. In: Proceedings of Programming Models and Applications on Multicores and Manycores. 2014Google Scholar
  17. 17.
    Chen X, Xu C, Dick R P, Mao Z M. Performance and power modeling in a multi-programmed multi-core environment. In: Proceedings of Design Automation Conference. 2010, 813–818Google Scholar
  18. 18.
    Roscoe B, Herlev M, Liu C. Auto-tuning multi-programmed workload on the SCC. In: Proceedings of International Green Computing Conference. 2013, 1–5Google Scholar
  19. 19.
    Huang C, Ravi S, Raghunathan A, Jha N K. Synthesis of heterogeneous distributed architectures for memory-intensive applications. In: Proceedings of International Conference on Computer Aided Design. 2003, 46–53Google Scholar
  20. 20.
    Huang C, Ravi S, Raghunathan A, Jha N K. Generation of heterogeneous distributed architectures for memory-intensive applications through high-level synthesis. IEEE Transactions on Very Large Scale Integration Systems, 2007, 15(11): 1191–1204CrossRefGoogle Scholar
  21. 21.
    Castellana V G, Ferrandi F. Abstract: speeding-up memory intensive applications through adaptive hardware accelerators. In: Proceedings of SC Companion: High Performance Computing, Networking Storage and Analysis. 2012, 1415–1416Google Scholar
  22. 22.
    Yi W, Tang Y, Wang G, Fang X. A case study of SWIM: optimization of memory intensive application on GPGPU. In: Proceedings of International Symposium on Parallel Architectures, Algorithms and Programming. 2010, 123–129Google Scholar
  23. 23.
    Athanasaki E, Anastopoulos N, Kourtis K, Koziris N. Exploring the performance limits of simultaneous multithreading for memory intensive applications. Journal of Supercomputing, 2008, 44(1): 64–97CrossRefGoogle Scholar
  24. 24.
    Chun K C, Jain P, Kim C H. Logic-compatible embedded DRAM design for memory intensive low power systems. In: Proceedings of IEEE International Symposium on Circuits and Systems. 2010, 277–280Google Scholar
  25. 25.
    Jaleel A, Nuzman J, Moga A, Steely S C, Emer J. High performing cache hierarchies for server workloads: relaxing inclusion to capture the latency benefits of exclusive caches. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture. 2015, 343–353Google Scholar
  26. 26.
    Akin B, Franchetti F, Hoe J C. Data reorganization in memory using 3D-stacked DRAM. In: Proceedings of ACM/IEEE International Symposium on Computer Architecture. 2015, 131–143Google Scholar
  27. 27.
    Jevdjic D, Volos S, Falsafi B. Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache. In: Proceedings of ACM/IEEE International Symposium on Computer Architecture. 2013, 404–415Google Scholar
  28. 28.
    Oskin M, Loh G H. A software-managed approach to die-stacked DRAM. In: Proceedings of International Conference on Parallel Architecture and Compilation. 2015, 188–200Google Scholar
  29. 29.
    Mekkat V, Holey A, Yew P C, Zhai A. Managing shared last-level cache in a heterogeneous multicore processor. In: Proceedings of International Conference on Parallel Architectures & Compilation Techniques. 2013, 225–234Google Scholar
  30. 30.
    Lee M, Kim S. Performance-controllable shared cache architecture for multi-core soft real-time systems. In: Proceedings of IEEE International Conference on Computer Design. 2013, 519–522Google Scholar
  31. 31.
    Pan A, Pai V S. Runtime-driven shared last-level cache management for task-parallel programs. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2015, 1–12Google Scholar
  32. 32.
    Albericio J, Ibáñez P, Viñals V, Llabería J M. The reuse cache: downsizing the shared last-level cache. In: Proceedings of IEEE/ACM International Symposium on Microarchitecturee. 2013, 310–321CrossRefGoogle Scholar
  33. 33.
    Loh G H, Hill M D. Supporting very large DRAM caches with compound-access scheduling and MissMap. IEEE Micro, 2012, 32(3): 70–78CrossRefGoogle Scholar
  34. 34.
    Loh G H, Hill M D. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches. In: Proceedings of IEEE/ACM International Symposium on Microarchitecture. 2011, 454–464Google Scholar
  35. 35.
    Dong HW, Seong N H, Lee H H S. Pragmatic integration of an SRAM row cache in heterogeneous 3-D DRAM architecture using TSV. IEEE Transactions on Very Large Scale Integration Systems, 2013, 21(1): 1–13CrossRefGoogle Scholar
  36. 36.
    Chen Q, Zheng L, Guo M. Adaptive demand-aware work-stealing in multi-programmed multi-core architectures. Concurrency & Computation: Practice & Experience, 2016, 28(2): 455–471CrossRefGoogle Scholar
  37. 37.
    Suo G, Yang X. System level speedup oriented cache partitioning for multi-programmed systems. In: Proceedings of IFIP International Conference on Network and Parallel Computing. 2009, 204–210Google Scholar
  38. 38.
    Kirovski D, Lee C, Potkonjak M, Mangione-Smith W H. Application-driven synthesis of memory-intensive systems-on-chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 1999, 18(9): 1316–1326CrossRefGoogle Scholar
  39. 39.
    Sim J, Loh G H, Sridharan V, O’Connor M. A configurable and strong RAS solution for die-stacked DRAMcaches. IEEEMicro, 2014, 34(3): 80–90Google Scholar
  40. 40.
    Lin B, Li S, Liao X, Zhang J. Leach: an automatic learning cache for inline primary deduplication system. Frontiers of Computer Science, 2014, 8(2): 175–183.MathSciNetCrossRefGoogle Scholar
  41. 41.
    Chou C, Jaleel A, Qureshi M K. CAMEO: a two-level memory organization with capacity of main memory and flexibility of hardware-managed cache. In: Proceedings of IEEE/ACM International Symposium on Microarchitecture. 2014, 1–12Google Scholar
  42. 42.
    Ou J, Patton M, Moore M D, Xu Y, Jiang S. A penalty aware memory allocation scheme for key-value cache. In: Proceedings of International Conference on Parallel Processing. 2015, 530–539Google Scholar
  43. 43.
    Woo H D, Seong N H, Lewis D L, Lee H H S. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth. In: Proceedings of the 16th IEEE International Symposium on High Performance Computer Architecture. 2010, 1–12Google Scholar
  44. 44.
    Loh G H. Extending the effectiveness of 3D-stacked DRAM caches with an adaptive multi-queue policy. In: Proceedings of IEEE/ACM International Symposium on Microarchitecture. 2009, 201–212Google Scholar
  45. 45.
    Jiang L, Liu Y, Duan L, Xie Y, Xu Q. Modeling TSV open defects in 3D-stacked DRAM. In: Proceedings of IEEE International Test Conference. 2010, 174–182Google Scholar
  46. 46.
    Hennessy J L, Patterson D A. Computer Architecture: A Quantitative Approach. Elsevier, 2011zbMATHGoogle Scholar
  47. 47.
    Li S, Cheng B, Gao X, Qiao L, Tang Z. Performance characterization of SPEC CPU2006 benchmarks on Intel and AMD platform. In: Proceedings of IEEE International Workshop on Education Technology & Computer Science. 2009, 116–121Google Scholar
  48. 48.
    Sim J, Loh G H, Kim H, O’Connor M. A mostly-clean DRAM cache for effective hit speculation and self-balancing dispatch. In: Proceedings of IEEE/ACM International Symposium on Microarchitecture. 2012, 247–257Google Scholar
  49. 49.
    Begum R, Hempstead M. Power-agility metrics: measuring dynamic characteristics of energy proportionality. In: Proceedings of IEEE International Conference on Computer Design. 2015, 643–650Google Scholar

Copyright information

© Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringShanghai Jiao Tong UniversityShanghaiChina
  2. 2.Hunan Provincial Key Laboratory of Intelligent Processing of Big Data on Transportation, School of Computer and Communication EngineeringChangsha University of Science and TechnologyChangshaChina
  3. 3.School of Electronics and InformationShanghai Dianji UniversityShanghaiChina

Personalised recommendations