A Study on L1 Data Cache Bypassing Methods for High-Performance GPUs

  • Cong Thuan Do
  • Min Goo Moon
  • Jong Myon Kim
  • Cheol Hong KimEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 931)


Graphics Processing Units (GPUs) with massive parallel architecture have been widely used to boost performance of both graphics and general-purpose programs. GPGPUs become one of the most attractive platforms in exploiting plentiful thread-level parallelism. In recent GPUs, cache hierarchies have been employed to deal with applications with irregular memory access patterns. Unfortunately, GPU caches exhibit poor efficiency due to arising many performance challenges such as cache contention and resource congestion caused by large number of active threads in GPUs. Cache bypassing can be a solution to reduce the impact of cache contention and resource congestion. In this paper, we introduce a new cache bypassing technique that is able to make effective bypassing decisions. In particular, the proposed mechanism employs a small memory, which can be accessed before actual cache access, to record the tag information of the L1 data cache. By using this information, the mechanism can know the status of the L1 data cache and use it as a bypassing hint to make the cache bypassing decision close to optimal. Our experimental results based on a modern GPU platform reveal that our proposed cache bypassing technique achieves up to 10.4% of IPC improvement on average.


GPU CPU Bypassing Cache Performance 



This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2018R1A2B6005740), and it was also supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2018-2016-0-00314) supervised by the IITP (Institute for Information & communications Technology Promotion).


  1. 1.
    Ryoo, S., Rodrigues, C., Baghsorkhi, S., Stone, S., Kirk, D., Hwu, W.: Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: The ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 73–82 (2008)Google Scholar
  2. 2.
    Son, D.O., Do, C.T., Choi, H.J., Nam, J., Kim, C.H.: A dynamic CTA scheduling scheme for massive parallel computing. Clust. Comput. 20(1), 781–787 (2017)CrossRefGoogle Scholar
  3. 3.
    Narasiman, V., Shebanow, M., Lee, C.J., Miftakhutdinov, R., Mutlu, O., Patt, Y.N.: Improving GPU performance via large warps and two-level warp scheduling. In: The IEEE/ACM International Symposium on Microarchitecture, pp. 308–317 (2011)Google Scholar
  4. 4.
    Rogers, T.G., O’Connor, M., Aamodt, T.: Cache-conscious wavefront scheduling. In: The International Symposium on Microarchitecture, pp. 72–83 (2012)Google Scholar
  5. 5.
    Lee, S.-Y., Arunkumar, A., Wu, C.-J.: CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. In: The International Symposium on Computer Architecture, pp. 515–527 (2015)CrossRefGoogle Scholar
  6. 6.
    Meng, J., Tarjan, D., Skadron, K.: Dynamic warp subdivision for integrated branch and memory divergence tolerance. In: The International Symposium on Computer Architecture, pp. 235–246 (2010)Google Scholar
  7. 7.
    Park, Y., Park, J.J.K., Park, H., Mahlke, S.: Libra: tailoring SIMD execution using heterogeneous hardware and dynamic configurability. In: The IEEE/ACM International Symposium on Microarchitecture, pp. 84–95 (2012)Google Scholar
  8. 8.
    Rhu, M., Erez, M.: Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation. In: The International Symposium on Computer Architecture, pp. 356–367 (2013)Google Scholar
  9. 9.
    Do, C.T., Choi, H.J., Kim, J.M., Kim, C.H.: A new cache replacement algorithm for last-level caches by exploiting tag-distance correlation of cache lines. Microprocess. Microsyst. 39(4–5), 286–295 (2015)CrossRefGoogle Scholar
  10. 10.
    Jaleel, A., Theobald, K.B., Steely, S.C., Emer, J.: High performance cache replacement using re-reference interval prediction (RRIP). In: The International Symposium on Computer Architecture, pp. 60–71 (2010)Google Scholar
  11. 11.
    Qureshi, M.K., Jaleel, A., Patt, Y.N., Steely, S.C., Emer, J.: Adaptive insertion policies for high performance caching. In: The International Symposium on Computer Architecture, pp. 381–391(2007)Google Scholar
  12. 12.
    Jia, W., Shaw, K., Martonosi, M.: MRPB: memory request prioritization for massively parallel processors. In: The IEEE International Symposium on High Performance Computer Architecture, pp. 272–283 (2014)Google Scholar
  13. 13.
    Chen, X., Chang, L.-W. Rodrigues, C.I., Lv, J., Wang, Z., Hwu, W.-M.W.: Adaptive cache bypass and insertion for many-core accelerators. In: The International Workshop on Manycore Embedded Systems, p. 1 (2014)Google Scholar
  14. 14.
    Duong, N., Zhao, D., Kim, T., Cammarota, R., Valero, M., Veidenbaum, A.V.: Improving cache management policies using dynamic reuse distances. In: The IEEE/ACM International Symposium on Microarchitecture, pp. 389–400 (2012)Google Scholar
  15. 15.
    Do, C.T., Kim, J.M., Kim, C.H.: Early miss prediction based periodic cache bypassing for high performance GPUs. Microprocess. Microsyst. 55, 44–54 (2017)CrossRefGoogle Scholar
  16. 16.
    Xie, X., Liang, Y., Wang, Y., Sun, G., Wang, T.: Coordinated static and dynamic cache bypassing for GPUs. In: The IEEE International Symposium on High Performance Computer Architecture, pp. 76–88 (2015)Google Scholar
  17. 17.
    Xie, X., Liang, Y., Sun, G., Chen, D.: An efficient compiler framework for cache bypassing on GPUs. In: The International Conference on Computer-Aided Design, pp. 516–523 (2013)Google Scholar
  18. 18.
    Krewell, K.: AMD’s Fusion Finally Arrives, Microprocessor Report (2011)Google Scholar
  19. 19.
    Krewell, K.: NVIDIA Lowers the Heat on Kepler, Microprocessor Report (2012)Google Scholar
  20. 20.
    Kirk, D., Hwu, W.: Programming Massively Parallel Processors. Elsevier, London (2010)Google Scholar
  21. 21.
    Abdalla, K.M., et al.: Scheduling and Execution of Compute Task. US Patent US20130185725 (2013)Google Scholar
  22. 22.
    NVIDIA. NVIDIA Tegra Multiprocessor Architecture (2010)Google Scholar
  23. 23.
    Bakhola, A., Yuan, G., Fung, W., Wong, H., Aamodt, T.: Analyzing CUDA workloads using a detailed GPU simulator. In: The International Symposium on Analysis of Systems and Software, pp. 163–174 (2009)Google Scholar
  24. 24.
    Delano, E., Mulla, D.: Data cache design considerations for the Itanium2 processor. In the International Conference on Computer Design, pp. 356–362 (2002)Google Scholar
  25. 25.
    Brock, B., Exerman, M.: Cache Latencies of the PowerPC MPC7451. Freescale Semiconductor, Inc., Austin, TX, USA (2006)Google Scholar
  26. 26.

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  • Cong Thuan Do
    • 1
  • Min Goo Moon
    • 2
  • Jong Myon Kim
    • 3
  • Cheol Hong Kim
    • 2
    Email author
  1. 1.Department of Computer ScienceKorea UniversitySeoulKorea
  2. 2.School of Electronics and Computer EngineeringChonnam National UniversityGwangjuKorea
  3. 3.School of Electrical EngineeringUniversity of UlsanUlsanKorea

Personalised recommendations