Skip to main content
Log in

Kernel concurrency opportunities based on GPU benchmarks characterization

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Graphical Processing Units (GPUs) became an important platform to general purpose computing, thanks to their high performance and low cost when compared to CPUs. Modern GPU architectures are constantly evolving with growing resources. In order to take advantage of all the resources available and increase the GPU efficiency, new generation GPUs include support for concurrent kernel execution. Different kernels can be executed at the same time and share the GPU resources. Thus, benchmark suites developed to evaluate GPU performance and scalability should take this aspect into account that could be quite different from traditional CPU benchmarks. Nowadays, SHOC, Parboil, and Rodinia are the main benchmark suites for evaluating GPUs. This work analyzes these benchmark suites in a novel way. We propose to categorize the kernels of each application of these benchmarks by multiple criteria, built on their behavior in terms of computation type (integer or float), usage of memory hierarchy, efficiency and hardware occupancy. Based on the characterization results, we analyze kernel concurrency opportunities. The focus is on disclosing the resource requirements of the kernels of these benchmarks and to explain their behavior when executed concurrently.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. http://iss.ices.utexas.edu/?p=projects/galois/lonestargpu.

References

  1. Adriaens, J.T., Compton, K., Kim, N.S., Schulte, M.J.: The case for GPGPU spatial multitasking. In: IEEE 18th International Symposium on High Performance Computer Architecture (HPCA), pp. 1–12. IEEE (2012)

  2. Asanovic, K.: The landscape of parallel computing research: a view from berkeley. Tech. Rep. UCB/EECS-2006-183, EECS Department, University of California, Berkley, CA, USA (2006)

  3. Bakhoda, A., Yuan, G.L., Fung, W.W., Wong, H., Aamodt, T.M.: Analyzing CUDA workloads using a detailed GPU simulator. In: IEEE International Symposium on Performance Analysis of Systems and Software, 2009. ISPASS 2009, pp. 163–174. IEEE (2009)

  4. Bienia, C.: Benchmarking Modern Multiprocessors. Princeton University, Princeton (2011)

    Google Scholar 

  5. Bienia, C.: Benchmarking modern multiprocessors. Ph.D. thesis, Princeton University (2011)

  6. Breder, B., Charles, E., Cruz, R., Clua, E., Bentes, C., Drummond, L.: Maximizando o uso dos recursos de GPU através da reordenação da submissão de kernels concorrentes. In: Anais do WSCAD 2016 Simpósio de Sistemas Computacionais de Alto Desempenho, pp. 98–109. Editora da Sociedade Brasileira de Computação (SBC) (2016)

  7. Burtscher, M., Nasre, R., Pingali, K.: A quantitative study of irregular programs on GPUs. In: 2012 IEEE International Symposium on Workload Characterization (IISWC), pp. 141–151. IEEE (2012)

  8. Carvalho, P., Drummond, L., Bentes, C., Clua, E., Cataldo, E., Marzulo, L.: Analysis and characterization of gpu benchmarks for kernel concurrency efficiency. In: Mocskos E., Nesmachnow S. (eds.) High Performance Computing. CARLA 2017. Communications in Computer and Information Science, vol. 796 (2017)

    Google Scholar 

  9. Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.H., Skadron, K.: Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), pp. 44–54 (2009)

  10. Che, S., Sheaffer, J.W., Boyer, M., Szafaryn, L.G., Wang, L., Skadron, K.: A characterization of the rodinia benchmark suite with comparison to contemporary CMP workloads. In: Proceedings of the IEEE International Symposium on Workload Characterization (2010)

  11. Che, S., Skadron, K.: Benchfriend: correlating the performance of GPU benchmarks. Int. J. High Perform. Comput. Appl. 28(2), 238–250 (2014)

    Article  Google Scholar 

  12. Cruz, R., Drummond, L., Clua, E., Bentes, C.: Analyzing and estimating the performance of concurrent kernels execution on GPUs. In: Proceedings of the XVIII Simpósio em Sistemas Computacionais de Alto Desempenho-WSCAD (2017)

  13. Cruz, R.A., Bentes, C., Breder, B., Vasconcellos, E., Clua, E., de Carvalho, P., Drummond, L.: Maximizing the GPU resource usage by reordering concurrent kernels submission. Concurr. Comput.

  14. Danalis, A., Marin, G., McCurdy, C., Meredith, J.S., Roth, P.C., Spafford, K., Tipparaju, V., Vetter, J.S.: The scalable heterogeneous computing (SHOC) benchmark suite. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pp. 63–74 (2010)

  15. Goswami, N., Shankar, R., Joshi, M., Li, T.: Exploring GPGPU workloads: Characterization methodology, analysis and microarchitecture evaluation implications. In: 2010 IEEE International Symposium on Workload Characterization (IISWC), pp. 1–10. IEEE (2010)

  16. Hu, Q., Shu, J., Fan, J., Lu, Y.: Run-time performance estimation and fairness-oriented scheduling policy for concurrent GPGPU applications. In: 2016 45th International Conference on Parallel Processing (ICPP), pp. 57–66. IEEE (2016)

  17. Jog, A., Kayiran, O., Kesten, T., Pattnaik, A., Bolotin, E., Chatterjee, N., Keckler, S.W., Kandemir, M.T., Das, C.R.: Anatomy of GPU memory system for multi-application execution. In: Proceedings of the 2015 International Symposium on Memory Systems, pp. 223–234. ACM (2015)

  18. Joshi, A., Phansalkar, A., Eeckhout, L., John, L.K.: Measuring benchmark similarity using inherent program characteristics. IEEE Trans. Comput. 55(6), 769–782 (2006)

    Article  Google Scholar 

  19. Kerr, A., Diamos, G., Yalamanchili, S.: A characterization and analysis of PTX kernels. In: IEEE International Symposium on Workload Characterization, 2009. IISWC 2009, pp. 3–12. IEEE (2009)

  20. Li, T., Narayana, V.K., El-Ghazawi, T.: A power-aware symbiotic scheduling algorithm for concurrent GPU kernels. In: IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS), 2015, pp. 562–569 (2015)

  21. NVIDIA: Cuda multi process service overview (2017). https://docs.nvidia.com/pdf/CUDA_Multi_Process_Service_Overview.pdf

  22. NVIDIA Corp: Profiler user’s guide. https://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview (2017). An optional note

  23. O’Neil, M.A., Burtscher, M.: Microarchitectural performance characterization of irregular GPU kernels. In: 2014 IEEE International Symposium on Workload Characterization (IISWC), pp. 130–139. IEEE (2014)

  24. Pai, S., Thazhuthaveetil, M.J., Govindarajan, R.: Improving GPGPU concurrency with elastic kernels. In: ACM SIGPLAN Notices, vol. 48, pp. 407–418. ACM (2013)

  25. Ravi, V.T., Becchi, M., Agrawal, G., Chakradhar, S.: Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework. In: Proceedings of the 20th International Symposium on High Performance Distributed Computing, pp. 217–228. ACM (2011)

  26. SHOC: (2012). https://github.com/vetter/shoc/wiki

  27. Spafford, K., Meredith, J.S., Vetter, J.S., Chen, J., Grout, R.W., Sankaran, R.: Accelerating S3D: a GPGPU case study. In: Euro-Par Workshops, pp. 122–131. Springer, New York (2009)

    Google Scholar 

  28. Stratton, J.A., Rodrigues, C., Sung, I.J., Obeid, N., Chang, L.W., Anssari, N., Liu, G.D., mei W. Hwu, W.: Parboil: a revised benchmark suite for scientific and commercial throughput computing (2012)

  29. Wende, F., Cordes, F., Steinke, T.: On improving the performance of multi-threaded CUDA applications with concurrent kernel execution by kernel reordering. In: Symposium on Application Accelerators in High Performance Computing (SAAHPC), pp. 74–83 (2012)

  30. Xu, Q., Jeon, H., Kim, K., Ro, W.W., Annavaram, M.: Warped-slicer: efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming. In: Proceedings of the 43rd International Symposium on Computer Architecture, pp. 230–242. IEEE Press (2016)

  31. Zhong, J., He, B.: Kernelet: high-throughput GPU kernel executions with dynamic slicing and scheduling. IEEE Trans. Parallel Distrib. Syst. 25(6), 1522–1532 (2014)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cristiana Bentes.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Carvalho, P., Cruz, R., Drummond, L.M.A. et al. Kernel concurrency opportunities based on GPU benchmarks characterization. Cluster Comput 23, 177–188 (2020). https://doi.org/10.1007/s10586-018-02901-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-018-02901-1

Keywords

Navigation