International Journal of Parallel Programming

, Volume 44, Issue 1, pp 109–129 | Cite as

A Credit-Based Load-Balance-Aware CTA Scheduling Optimization Scheme in GPGPU



GPGPU improves the computing performance due to the massive parallelism. The cooperative-thread-array (CTA) schedulers employed by the current GPGPUs greedily issue CTAs to GPU cores as soon as the resources become available for higher thread level parallelism. Due to the locality consideration in the memory controller, the CTA execution time varies in different cores, and thus it leads to a load imbalance of the CTA issuance among the cores. The load imbalance causes the computing resources under-utilized, and leaves an opportunity for further performance improvement. However, existing warp and CTA scheduling policies did not take account of load balance. We propose a credit-based load-balance-aware CTA scheduling optimization scheme (CLASO) piggybacked to a standard GPGPU scheduling system. CLASO uses credits to limit the amount of CTAs issued on each core to avoid the greedy issuance to faster executing cores as well as the starvation to leftover cores. In addition, CLASO employs the global credits and two tuning parameters, active levels and loose levels, to enhance the load balance and the robustness. Instead of a standalone scheduling policy, CLASO is compatible with existing CTA and warp schedulers. The experiments conducted using several paradigmatic benchmarks illustrate that CLASO effectively improves the load balance by reducing 52.4 % idle cycles on average, and achieves up to 26.6 % speedup compared to the GPGPU baseline scheduling policy.


GPGPU CTA scheduler Credit-based load-balance-aware scheduling scheme Load balance 



This research is partially sponsored by the U.S. National Science Foundation (NSF) Grants CCF-1102624 and CNS-1218960, and National Natural Science Foundation of China grants 61033012 and 11372067. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding agencies.


  1. 1.
    Narasiman, V., Shebanow, M., Lee, C. et al.: Improving GPU performance via large warps and two-level warp scheduling. In: International Symposium on Microarchitecture, pp. 308–317 (2011)Google Scholar
  2. 2.
    Jog, A., Kayiran, O., Nachiappan, N. et al.: OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. In: International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 395–406 (2013)Google Scholar
  3. 3.
    Kayiran, O., Jog, A., Kandermir, M. et al.: Neither more nor less: optimizing thread-level parallelism for GPGPUs. In: International Conference on Parallel Architectures and Compilation Techniques, pp. 157–166 (2013)Google Scholar
  4. 4.
  5. 5.
    Khronos Group: The open standard for parallel programming of heterogeneous systems (2013)
  6. 6.
    NVIDIA: NVIDIA Visual Profiler (2014)
  7. 7.
    NVIDIA: CUDA C/C++ SDK code samples (2011)
  8. 8.
    Bakhoda, A., Yuan, G., Fung, W. et al.: Analyzing CUDA workloads using a detailed GPU simulator. In: International Symposium on Performance Analysis of Systems and Software, pp. 163–174 (2009)Google Scholar
  9. 9.
    NVIDIA: Tesla C2050 / C2070 GPU computing processor (2010).
  10. 10.
    Che, S., Boyer, M., Meng, J. et al.: Rodinia: a benchmark suite for heterogeneous computing. In: International Symposium on Workload Characterization, pp. 44–54 (2009)Google Scholar
  11. 11.
    Stratton, J.A., Rodrigues, C., Sung, I.J. et al.: Parboil: a revised benchmark suite for scientific and commercial throughput computing. Tech. Rep. IMPACT-12-01 University of Illinois at Urbana-Champaign (2012)Google Scholar
  12. 12.
    Lee, M., Song, S., Moon, J. et al.: Improving GPGPU resource utilization through alternative thread block scheduling. In: International Symposium on High Performance Computer Architecture, pp. 263–273 (2014)Google Scholar
  13. 13.
    Adriaens, J., Compton, K., Kim, N. et al.: The case for GPGPU spatial multitasking. In: International Symposium on High Performance Computer Architecture, pp. 1–12 (2012)Google Scholar
  14. 14.
    Jog, A., Kayiran, O., Mishra, A. et al.: Orchestrated scheduling and prefetching for GPGPUs. In: International Symposium on Computer Architecture, pp. 332–343 (2013)Google Scholar
  15. 15.
    Gebhart, M., Johnson, D.R., Tarjan, D. et al.: Energy-efficient mechanisms for managing thread context in throughput processors. In: International Symposium on Computer Architecture, pp. 235–246 (2011)Google Scholar
  16. 16.
    Rogers, T., O’Connor, M., Aamodt, T. et al.: Cache-conscious wavefront scheduling. In: International Symposium on Microarchitecture, pp. 72–83 (2012)Google Scholar
  17. 17.
    Meng, J., Tarjan, D., Skadron, K.: Dynamic warp subdivision for integrated branch and memory divergence tolerance. In: International Symposium on Computer Architecture, pp. 235–246 (2010)Google Scholar
  18. 18.
    Fung, W.W.L., Sham, I., Yuan, G. et al.: Dynamic warp formation and scheduling for efficient GPU control flow. In: International Symposium on Microarchitecture, pp. 407–420 (2007)Google Scholar
  19. 19.
    Fung, W., Aamodt, T.: Thread block compaction for efficient SIMT control flow. In: International Symposium on High Performance Computer Architecture, pp. 25–36 (2011)Google Scholar
  20. 20.
    Brunie, N., Collange, S., Diamos, G.: Simultaneous branch and warp interweaving for sustained GPU performance. In: International Symposium on Computer Architecture, pp. 49–60 (2012)Google Scholar
  21. 21.
    Jia, W., Shaw, K.A., Martonosi, M.: MRPB: memory request prioritization for massively parallel processors. In: International Symposium on High Performance Computer Architecture, pp. 274–285 (2014)Google Scholar
  22. 22.
    Jog, A., Bolotin, E., Guz, Z. et al.: Application-aware memory system for fair and efficient execution of concurrent GPGPU applications. In: Workshop on General Purpose Processing Using GPUs, pp. 1–8 (2014)Google Scholar
  23. 23.
    Lakshminarayana, N.B., Kim, H.: Effect of instruction fetch and memory scheduling on GPU performance. In: Workshop on Language, Compiler, and Architecture Support for GPGPU, pp. 1–10 (2010)Google Scholar
  24. 24.
    Chen, L., Villa, O., Krishnamoorthy, S. et al.: Dynamic load balancing on single- and multi-GPU systems. In: IEEE International Symposium on Parallel and Distributed Processing, pp. 1–12 (2010)Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Yulong Yu
    • 1
    • 2
  • Xubin He
    • 2
  • He Guo
    • 1
  • Yuxin Wang
    • 3
  • Xin Chen
    • 1
  1. 1.School of Software TechnologyDalian University of TechnologyDalianChina
  2. 2.Department of Electrical and Computer EngineeringVirginia Commonwealth UniversityRichmondUSA
  3. 3.School of Computer Science and TechnologyDalian University of TechnologyDalianChina

Personalised recommendations