Kernel concurrency opportunities based on GPU benchmarks characterization

Carvalho, Pablo; Cruz, Rommel; Drummond, Lucia M. A.; Bentes, Cristiana; Clua, Esteban; Cataldo, Edson; Marzulo, Leandro A. J.

doi:10.1007/s10586-018-02901-1

Kernel concurrency opportunities based on GPU benchmarks characterization

Published: 17 January 2019

Volume 23, pages 177–188, (2020)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Pablo Carvalho¹,
Rommel Cruz¹,
Lucia M. A. Drummond¹,
Cristiana Bentes ORCID: orcid.org/0000-0001-9092-6007²,
Esteban Clua¹,
Edson Cataldo³ &
…
Leandro A. J. Marzulo⁴

655 Accesses
8 Citations
Explore all metrics

Abstract

Graphical Processing Units (GPUs) became an important platform to general purpose computing, thanks to their high performance and low cost when compared to CPUs. Modern GPU architectures are constantly evolving with growing resources. In order to take advantage of all the resources available and increase the GPU efficiency, new generation GPUs include support for concurrent kernel execution. Different kernels can be executed at the same time and share the GPU resources. Thus, benchmark suites developed to evaluate GPU performance and scalability should take this aspect into account that could be quite different from traditional CPU benchmarks. Nowadays, SHOC, Parboil, and Rodinia are the main benchmark suites for evaluating GPUs. This work analyzes these benchmark suites in a novel way. We propose to categorize the kernels of each application of these benchmarks by multiple criteria, built on their behavior in terms of computation type (integer or float), usage of memory hierarchy, efficiency and hardware occupancy. Based on the characterization results, we analyze kernel concurrency opportunities. The focus is on disclosing the resource requirements of the kernels of these benchmarks and to explain their behavior when executed concurrently.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Analysis and Characterization of GPU Benchmarks for Kernel Concurrency Efficiency

Impacts of optimization strategies on performance, power/energy consumption of a GPU based parallel reduction

Article 01 November 2017

A statistical performance analyzer framework for OpenCL kernels on Nvidia GPUs

Article 13 December 2014

Notes

http://iss.ices.utexas.edu/?p=projects/galois/lonestargpu.

References

Adriaens, J.T., Compton, K., Kim, N.S., Schulte, M.J.: The case for GPGPU spatial multitasking. In: IEEE 18th International Symposium on High Performance Computer Architecture (HPCA), pp. 1–12. IEEE (2012)
Asanovic, K.: The landscape of parallel computing research: a view from berkeley. Tech. Rep. UCB/EECS-2006-183, EECS Department, University of California, Berkley, CA, USA (2006)
Bakhoda, A., Yuan, G.L., Fung, W.W., Wong, H., Aamodt, T.M.: Analyzing CUDA workloads using a detailed GPU simulator. In: IEEE International Symposium on Performance Analysis of Systems and Software, 2009. ISPASS 2009, pp. 163–174. IEEE (2009)
Bienia, C.: Benchmarking Modern Multiprocessors. Princeton University, Princeton (2011)
Google Scholar
Bienia, C.: Benchmarking modern multiprocessors. Ph.D. thesis, Princeton University (2011)
Breder, B., Charles, E., Cruz, R., Clua, E., Bentes, C., Drummond, L.: Maximizando o uso dos recursos de GPU através da reordenação da submissão de kernels concorrentes. In: Anais do WSCAD 2016 Simpósio de Sistemas Computacionais de Alto Desempenho, pp. 98–109. Editora da Sociedade Brasileira de Computação (SBC) (2016)
Burtscher, M., Nasre, R., Pingali, K.: A quantitative study of irregular programs on GPUs. In: 2012 IEEE International Symposium on Workload Characterization (IISWC), pp. 141–151. IEEE (2012)
Carvalho, P., Drummond, L., Bentes, C., Clua, E., Cataldo, E., Marzulo, L.: Analysis and characterization of gpu benchmarks for kernel concurrency efficiency. In: Mocskos E., Nesmachnow S. (eds.) High Performance Computing. CARLA 2017. Communications in Computer and Information Science, vol. 796 (2017)
Google Scholar
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.H., Skadron, K.: Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), pp. 44–54 (2009)
Che, S., Sheaffer, J.W., Boyer, M., Szafaryn, L.G., Wang, L., Skadron, K.: A characterization of the rodinia benchmark suite with comparison to contemporary CMP workloads. In: Proceedings of the IEEE International Symposium on Workload Characterization (2010)
Che, S., Skadron, K.: Benchfriend: correlating the performance of GPU benchmarks. Int. J. High Perform. Comput. Appl. 28(2), 238–250 (2014)
Article Google Scholar
Cruz, R., Drummond, L., Clua, E., Bentes, C.: Analyzing and estimating the performance of concurrent kernels execution on GPUs. In: Proceedings of the XVIII Simpósio em Sistemas Computacionais de Alto Desempenho-WSCAD (2017)
Cruz, R.A., Bentes, C., Breder, B., Vasconcellos, E., Clua, E., de Carvalho, P., Drummond, L.: Maximizing the GPU resource usage by reordering concurrent kernels submission. Concurr. Comput.
Danalis, A., Marin, G., McCurdy, C., Meredith, J.S., Roth, P.C., Spafford, K., Tipparaju, V., Vetter, J.S.: The scalable heterogeneous computing (SHOC) benchmark suite. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pp. 63–74 (2010)
Goswami, N., Shankar, R., Joshi, M., Li, T.: Exploring GPGPU workloads: Characterization methodology, analysis and microarchitecture evaluation implications. In: 2010 IEEE International Symposium on Workload Characterization (IISWC), pp. 1–10. IEEE (2010)
Hu, Q., Shu, J., Fan, J., Lu, Y.: Run-time performance estimation and fairness-oriented scheduling policy for concurrent GPGPU applications. In: 2016 45th International Conference on Parallel Processing (ICPP), pp. 57–66. IEEE (2016)
Jog, A., Kayiran, O., Kesten, T., Pattnaik, A., Bolotin, E., Chatterjee, N., Keckler, S.W., Kandemir, M.T., Das, C.R.: Anatomy of GPU memory system for multi-application execution. In: Proceedings of the 2015 International Symposium on Memory Systems, pp. 223–234. ACM (2015)
Joshi, A., Phansalkar, A., Eeckhout, L., John, L.K.: Measuring benchmark similarity using inherent program characteristics. IEEE Trans. Comput. 55(6), 769–782 (2006)
Article Google Scholar
Kerr, A., Diamos, G., Yalamanchili, S.: A characterization and analysis of PTX kernels. In: IEEE International Symposium on Workload Characterization, 2009. IISWC 2009, pp. 3–12. IEEE (2009)
Li, T., Narayana, V.K., El-Ghazawi, T.: A power-aware symbiotic scheduling algorithm for concurrent GPU kernels. In: IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS), 2015, pp. 562–569 (2015)
NVIDIA: Cuda multi process service overview (2017). https://docs.nvidia.com/pdf/CUDA_Multi_Process_Service_Overview.pdf
NVIDIA Corp: Profiler user’s guide. https://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview (2017). An optional note
O’Neil, M.A., Burtscher, M.: Microarchitectural performance characterization of irregular GPU kernels. In: 2014 IEEE International Symposium on Workload Characterization (IISWC), pp. 130–139. IEEE (2014)
Pai, S., Thazhuthaveetil, M.J., Govindarajan, R.: Improving GPGPU concurrency with elastic kernels. In: ACM SIGPLAN Notices, vol. 48, pp. 407–418. ACM (2013)
Ravi, V.T., Becchi, M., Agrawal, G., Chakradhar, S.: Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework. In: Proceedings of the 20th International Symposium on High Performance Distributed Computing, pp. 217–228. ACM (2011)
SHOC: (2012). https://github.com/vetter/shoc/wiki
Spafford, K., Meredith, J.S., Vetter, J.S., Chen, J., Grout, R.W., Sankaran, R.: Accelerating S3D: a GPGPU case study. In: Euro-Par Workshops, pp. 122–131. Springer, New York (2009)
Google Scholar
Stratton, J.A., Rodrigues, C., Sung, I.J., Obeid, N., Chang, L.W., Anssari, N., Liu, G.D., mei W. Hwu, W.: Parboil: a revised benchmark suite for scientific and commercial throughput computing (2012)
Wende, F., Cordes, F., Steinke, T.: On improving the performance of multi-threaded CUDA applications with concurrent kernel execution by kernel reordering. In: Symposium on Application Accelerators in High Performance Computing (SAAHPC), pp. 74–83 (2012)
Xu, Q., Jeon, H., Kim, K., Ro, W.W., Annavaram, M.: Warped-slicer: efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming. In: Proceedings of the 43rd International Symposium on Computer Architecture, pp. 230–242. IEEE Press (2016)
Zhong, J., He, B.: Kernelet: high-throughput GPU kernel executions with dynamic slicing and scheduling. IEEE Trans. Parallel Distrib. Syst. 25(6), 1522–1532 (2014)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Instituto de Computação - Universidade Federal Fluminense, Niterói, Brazil
Pablo Carvalho, Rommel Cruz, Lucia M. A. Drummond & Esteban Clua
Engenharia de Sistemas e Computação - Universidade do Estado do Rio de Janeiro, Rio de Janeiro, Brazil
Cristiana Bentes
Programa de Pós-graduação em Engenharia Elétrica e de Telecomunicações - Universidade Federal Fluminense, Niterói, Brazil
Edson Cataldo
Google, Sunnyvale, USA
Leandro A. J. Marzulo

Authors

Pablo Carvalho
View author publications
You can also search for this author in PubMed Google Scholar
Rommel Cruz
View author publications
You can also search for this author in PubMed Google Scholar
Lucia M. A. Drummond
View author publications
You can also search for this author in PubMed Google Scholar
Cristiana Bentes
View author publications
You can also search for this author in PubMed Google Scholar
Esteban Clua
View author publications
You can also search for this author in PubMed Google Scholar
Edson Cataldo
View author publications
You can also search for this author in PubMed Google Scholar
Leandro A. J. Marzulo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cristiana Bentes.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Carvalho, P., Cruz, R., Drummond, L.M.A. et al. Kernel concurrency opportunities based on GPU benchmarks characterization. Cluster Comput 23, 177–188 (2020). https://doi.org/10.1007/s10586-018-02901-1

Download citation

Received: 29 January 2018
Revised: 07 August 2018
Accepted: 24 December 2018
Published: 17 January 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s10586-018-02901-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Kernel concurrency opportunities based on GPU benchmarks characterization

Abstract

Access this article

Similar content being viewed by others

Analysis and Characterization of GPU Benchmarks for Kernel Concurrency Efficiency

Impacts of optimization strategies on performance, power/energy consumption of a GPU based parallel reduction

A statistical performance analyzer framework for OpenCL kernels on Nvidia GPUs

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Kernel concurrency opportunities based on GPU benchmarks characterization

Abstract

Access this article

Similar content being viewed by others

Analysis and Characterization of GPU Benchmarks for Kernel Concurrency Efficiency

Impacts of optimization strategies on performance, power/energy consumption of a GPU based parallel reduction

A statistical performance analyzer framework for OpenCL kernels on Nvidia GPUs

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation