The Journal of Supercomputing

, Volume 74, Issue 4, pp 1609–1635 | Cite as

Data-type specific cache compression in GPGPUs

Article
  • 51 Downloads

Abstract

In this paper, we evaluate compressibility of L1 data caches and L2 cache in general-purpose graphics processing units (GPGPUs). Our proposed scheme is geared toward improving performance and power of GPGPUs through cache compression. GPGPUs are throughput-oriented devices which execute thousands of threads simultaneously. To handle working set of this massive number of threads, modern GPGPUs exploit several levels of caches. GPGPU design trend shows that the size of caches continues to grow to support even more thread level parallelism. We propose using cache compression to increase effective cache capacity, improve performance, and reduce power consumption in GPGPUs. Our work is motivated by the observation that the values within a cache block are similar, i.e., the arithmetic difference of two successive values within a cache block is small. To reduce data redundancy in L1 data caches and L2 cache, we use low-cost and implementation-efficient base-delta-immediate (BDI) algorithm. BDI replaces a cache block with a base and an array of deltas where the combined size of the base and deltas is less than the original cache block. We also study locality of fields in integer and floating-point numbers. We found that entropy of fields varies across different data types. Based on entropy, we offer different BDI compression schemes for integer and floating-point numbers. We augment a simple, yet effective, predictor that determines type of values dynamically in hardware and without the help of a compiler or a programmer. Evaluation results show that on average, cache compression improves performance by 8% and saves energy of caches by 9%.

Keywords

GPGPU Cache compression Data-type predictor Performance Energy 

Notes

Acknowledgements

This work was supported by the Natural Sciences and Engineering Research Council of Canada.

References

  1. 1.
    AMD Graphics Cores Next (GCN) Architecture white paper, AMD, 2012Google Scholar
  2. 2.
    NVIDIA Corp (2012) NVIDIA’s next generation CUDA compute architecture: Kepler GK110Google Scholar
  3. 3.
    Narasiman V et al (2011) Improving GPU performance via large warps and two-level warp scheduling. In: Proceedings of the MICRO, Porto Alegre, BrazilGoogle Scholar
  4. 4.
    Fung WWL et al (2007) Dynamic warp formation and scheduling for efficient GPU control flow. In: Proceedings of the MICRO, Chicago, IL, pp 407–418Google Scholar
  5. 5.
    NVIDIA GeForce GTX 680, The fastest, most efficient GPU ever built, V1.0Google Scholar
  6. 6.
    NVIDIA GeForce GTX 980, Featuring Maxwell, The Most Advanced GPU Ever Made, V1.1Google Scholar
  7. 7.
    Bakhoda A, Kim J, Aamodt T (2010) Throughput-effective on-chip networks for Manycore accelerators. In: MICROGoogle Scholar
  8. 8.
    Singh I et al (2013) Cache coherence for GPU architectures. In: Proceedings of the HPCAGoogle Scholar
  9. 9.
    Abali B, Franke H, Poff DE, Saccone RA, Schulz CO, Herger LM, Smith TB (2001) Memory expansion technology (MXT): software support and performance, IBM JRDGoogle Scholar
  10. 10.
    Pekhimenko G et al (2012) Base-delta-immediate compression: practical data compression for on-chip caches. In: Proceedings of the PACT, Minneapolis, MN, USAGoogle Scholar
  11. 11.
    Sardashti S et al (2013) Decoupled compressed cache: exploiting spatial locality for energy-optimized compressed caching. In: Proceedings of the MICRO, Davis, CAGoogle Scholar
  12. 12.
    Alameldeen AR, Wood DA (2004) Adaptive cache compression for high-performance processors. In: Proceedings of the 31st Annual International Symposium on Computer ArchitectureGoogle Scholar
  13. 13.
    Gomez L, Cappello F (2013) Improving floating point compression through binary masks. In: IEEE International Conference on Big Data, pp 326–331Google Scholar
  14. 14.
    Townsend K, Zambreno J (2015) A multi-phase approach to floating-point compression. In: Proceedings of the IEEE International Conference on Electro/Information Technology (EIT)Google Scholar
  15. 15.
    Citron D (2004) Exploiting low entropy to reduce wire delay. IEEE Comput Archit Lett 3:1CrossRefGoogle Scholar
  16. 16.
    Bakhoda A et al (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: Proceedings of the ISPASSGoogle Scholar
  17. 17.
    Arelakis A, Stenstrom P (2014) SC\(^{2}\): a statistical compression cache scheme. In: Proceeding of the 41st Annual International Symposium on Computer Architecture, Minneapolis, MN, USAGoogle Scholar
  18. 18.
    Muralimanohar N et al (2007) Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In: Proceedings of the MICRO, pp 3–14Google Scholar
  19. 19.
    FreePDK\(^{TM}\) process design kit. http://www.eda.ncsu.edu/wiki/FreePDK
  20. 20.
    Lee S et al (2015) Warped-compression: enabling power efficient GPUs through register compression. In: Proceedings of the ISCA, pp 502–514Google Scholar
  21. 21.
    NVIDIA (2013) CUDA C/C++ SDK code samplesGoogle Scholar
  22. 22.
    Stratton JA et al (2012) Parboil: a revised benchmark suite for scientific and commercial throughput computingGoogle Scholar
  23. 23.
    Boyer CM et al (2009) Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)Google Scholar
  24. 24.
    Zhang Y, Yang J, Gupta R (2000) Frequent value compression in data caches. In: Proceeding of the MICRO-33Google Scholar
  25. 25.
    Vijaykumar N et al (2015) A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps. In: Proceedings of the ISCA, Portland, ORGoogle Scholar
  26. 26.
    Sathish V, Schulte MJ, Kim NS (2012) Lossless and lossy memory I/O link compression for improving performance of GPGPU workloads. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, Minneapolis, MN, USAGoogle Scholar
  27. 27.
    Xiang P et al (2013) Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancement. In: Proceedings of the ICS, Oregon, USAGoogle Scholar
  28. 28.
    Collange S, Kouyoumdjian A (2011) Affine vector cache for memory bandwidth savings. Universite de Lyon, Tech. RepGoogle Scholar
  29. 29.
    Citron D (2004) Exploiting low entropy to reduce wire delay. IEEE Comput Archit Lett 3:1–1CrossRefGoogle Scholar
  30. 30.
    Nitta C, Farrens M (2008) Techniques for increasing effective data bandwidth. In: IEEE International Conference on Computer Design (ICCD), pp 514–519Google Scholar
  31. 31.
    Burtscher M, Ratanaworabhan P (2009) FPC: a high-speed compressor for double-precision floating-point data. IEEE Trans Comput 58:18–31MathSciNetCrossRefMATHGoogle Scholar
  32. 32.
    Sazeides Y, Smith JE (1997) The predictability of data values. In: Proceedings of the 30th International Symposium Microarchitecture (MICRO’97), pp 248–258Google Scholar
  33. 33.
    Goeman B, Vandierendonck H, Bosschere K (2001) Differential FCM: increasing value prediction accuracy by improving table usage efficiency. In: Proceedings of the Seventh International Symposium on High Performance Computer Architecture (HPCA’01), pp 207–216Google Scholar
  34. 34.
    Arelakis A, Dahlgren F, Stenstrom P (2015) HyComp: a hybrid cache compression method for selection of data-type-specific compression methods. In: Proceedings of the 48th International Symposium on Microarchitecture, Waikiki, Hawaii, pp 38–49Google Scholar
  35. 35.
    Falahati H, Hessabi S, Abdi M, Baniasadi A (2015) Power-efficient prefetching on GPGPUs. J Supercomput 71:2808–2829CrossRefGoogle Scholar
  36. 36.
    Wang S-Y, Chang R-G (2007) Code size reduction by compressing repeated instruction sequences. J Supercomput 40:319–331CrossRefGoogle Scholar
  37. 37.
    Hijaz F, Shi Q, Kurian G, Devadas S, Khan O (2016) Locality-aware data replication in the last-level cache for large scale multicores. J Supercomput 72:718–752CrossRefGoogle Scholar
  38. 38.
    Atoofian E (2016) Compressed L1 data cache and L2 cache in GPGPUs. In: Proceedings of the 2016 IEEE 27th International Conference on Application-Specific Systems, Architectures and Processors (ASAP)Google Scholar
  39. 39.
    Atoofian E (2016) Many-thread aware compression in GPGPUs. In: Proceedings of the Scalable Computing and Communications, pp 628–635Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2017

Authors and Affiliations

  1. 1.Electrical Engineering DepartmentLakehead UniversityThunder BayCanada

Personalised recommendations