Dynamic Selective Warp Scheduling for GPUs Using L1 Data Cache Locality Information

  • Gwang Bok Kim
  • Jong Myon Kim
  • Cheol Hong KimEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 931)


Warp scheduling policy for GPUs has significant impact on performance since the order of executed warps determines the degree of data cache locality. Greedy warp scheduling policy such as GTO shows better performance than fair scheduling policy for numerous applications. However, cache locality by multiple warps is underutilized when the GTO is adopted, resulting in overall performance degradation. In this paper, we propose a dynamic selective warp scheduling exploiting data locality of workload. Inter-warp locality and intra-warp locality are determined based on the access history information of the L1 data cache. By adjusting scheduling policy dynamically, the performance and cache efficiency are improved compared LRR and GTO significantly. According to our experimental results, the proposed technique provides IPC improvement by 19% and 3.8% over LRR and GTO, respectively.


GPU Warp scheduling Cache Data locality Access history 



This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF2018R1A2B6005740).


  1. 1.
    Narasiman, V., Shebanow, M., Lee, C.J., Miftakhutdinov, R., Mutlu, O., Patt, Y.N.: Improving GPU performance via large warps and two-level warp scheduling. In: 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 308–317. IEEE (2011)Google Scholar
  2. 2.
    Zhang, Y., Xing, Z., Liu, C., Tang, C., Wang, Q.: Locality based warp scheduling in GPGPUs, Futur. Gener. Comput. Syst. (2017)Google Scholar
  3. 3.
    Wang, B., Zhu, Y., Yu, W.: OAWS: memory occlusion aware warp scheduling. In: International Conference on Parallel Architecture and Compilation Techniques, pp. 45–55. IEEE (2016)Google Scholar
  4. 4.
    Wang, J., Rubin, N., Sidelnik, A., Yalamanchili, S.: LaPerm: locality aware scheduler for dynamic parallelism on GPUs. ACM SIGARCH Comput. Arch. News 44(3), 583–595 (2016)CrossRefGoogle Scholar
  5. 5.
    Zhang, W.: Enhancing data cache reliability by the addition of a small fully-associative replication cache. In: Proceedings of the 18th Annual International Conference on Supercomputing, pp. 12–19 (2004)Google Scholar
  6. 6.
    Sato, M., Egawa, R., Takizawa, H., Kobayashi, H.: A voting-based working set assessment scheme for dynamic cache resizing mechanisms. In: IEEE International Conference on Computer Design (ICCD), pp. 98–105. IEEE (2010)Google Scholar
  7. 7.
    Lee, M., Kim, G., Kim, J., Seo, W., Cho, Y., Ryu, S.: iPAWS: instruction-issue pattern-based adaptive warp scheduling for GPGPUs. In: IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 370–381. IEEE (2016)Google Scholar
  8. 8.
    Oh, Y., Kim, K., Yoon, M.K., Park, J.H., Ro, W.W., Annavaram, M.: APRES: improving cache efficiency by exploiting load characteristics on GPUs. ACM SIGARCH Comput. Arch. News 44(3), 191–203 (2016)CrossRefGoogle Scholar
  9. 9.
    Aamodt, T.M., Fung, W.W.L.: GPGPU-Sim 3.x Manual (2014). 3.x Manual
  10. 10.
    Che, S., et al.: Rodinia: a benchmark suite for heterogeneous computing, workload characterization. In: IEEE International Symposium on IISWC 2009, pp. 44–54 (2009)Google Scholar
  11. 11.
    Bakhoda, A., Yuan, G.L., Fung, W.W.L., Wong, H., Aamodt, T.M.: Analyzing CUDA workloads using a detailed GPU simulator. In: Performance Analysis of Systems and Software, pp. 163–174 (2009)Google Scholar
  12. 12.
    Grauer-Gray, S., Xu, L., Searles, R., Ayalasomayajula, S., Cavazos, J.: Auto-tuning a high-level language targeted to GPU Codes. In: Innovative Parallel Computing, pp. 1–10 (2012)Google Scholar
  13. 13.
    NVIDIA, NVIDIA CUDA C programming guide v4.2, April 2012.
  14. 14.
    Nugteren, C., van den Braak, G.-J., Corporaal, H., Bal, H.: A detailed GPU cache model based on reuse distance theory. In: High Performance Computer Architecture, pp. 37–48 (2014)Google Scholar
  15. 15.
    Wong, H., Papadopoulou, M.-M., Sadooghi-Alvandi, M., Moshovos, A.: Demystifying GPU microarchitecture through microbenchmarking. In: Performance Analysis of Systems & Software, pp. 235–246 (2010)Google Scholar
  16. 16.
    Rogers, T.G., O’Connor, M., Aamodt, T.M.: Cache-conscious wavefront scheduling. In: Proceedings of the IEEE/ACM International Symposium on Microarchitecture, pp. 72–83 (2012)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  • Gwang Bok Kim
    • 1
  • Jong Myon Kim
    • 2
  • Cheol Hong Kim
    • 1
    Email author
  1. 1.School of Electronics and Computer EngineeringChonnam National UniversityGwangjuKorea
  2. 2.School of Electronical EngineeringUniversity of UlsanUlsanKorea

Personalised recommendations