Adaptive Simultaneous Multi-tenancy for GPUs

Bashizade, Ramin; Li, Yuxuan; Lebeck, Alvin R.

doi:10.1007/978-3-030-10632-4_5

Ramin Bashizade¹⁵,
Yuxuan Li¹⁵ &
Alvin R. Lebeck¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11332))

Included in the following conference series:

Workshop on Job Scheduling Strategies for Parallel Processing

452 Accesses

Abstract

Graphics Processing Units (GPUs) are energy-efficient massively parallel accelerators that are increasingly deployed in multi-tenant environments such as data-centers for general-purpose computing as well as graphics applications. Using GPUs in multi-tenant setups requires an efficient and low-overhead method for sharing the device among multiple users that improves system throughput while adapting to the changes in workload. This requires mechanisms to control the resources allocated to each kernel, and an efficient policy to make decisions about this allocation.

In this paper, we propose adaptive simultaneous multi-tenancy to address these issues. Adaptive simultaneous multi-tenancy allows for sharing the GPU among multiple kernels, as opposed to single kernel multi-tenancy that only runs one kernel on the GPU at any given time and static simultaneous multi-tenancy that does not adapt to events in the system. Our proposed system dynamically adjusts the kernels’ parameters at run-time when a new kernel arrives or a running kernel ends. Evaluations using our prototype implementation show that, compared to sequentially executing the kernels, system throughput is improved by an average of 9.8% (and up to 22.4%) for combinations of kernels that include at least one low-utilization kernel.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The recently announced NVIDIA Volta architecture solves the head-of-line blocking at the GPU block scheduler by dividing the GPU into smaller virtual GPUs, but it lacks the flexibility provided by persistent threads.
2.
Scratchpad memory in NVIDIA terminology is called shared memory.

References

Adriaens, J.T., Compton, K., Kim, N.S., Schulte, M.J.: The case for GPGPU spatial multitasking. In: Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture. HPCA 2012, pp. 1–12. IEEE Computer Society, Washington, DC (2012). http://dx.doi.org/10.1109/HPCA.2012.6168946
Amazon Web Services: Elastic GPUS (2017). https://aws.amazon.com/ec2/Elastic-GPUs/
Basaran, C., Kang, K.D.: Supporting preemptive task executions and memory copies in GPGPUS. In: Proceedings of the 2012 24th Euromicro Conference on Real-Time Systems. ECRTS 2012, pp. 287–296. IEEE Computer Society, Washington, DC (2012). http://dx.doi.org/10.1109/ECRTS.2012.15
Chase, J.S., Anderson, D.C., Thakar, P.N., Vahdat, A.M., Doyle, R.P.: Managing energy and server resources in hosting centers. In: Proceedings of the Eighteenth ACM Symposium on Operating Systems Principles. SOSP 2001, pp. 103–116. ACM, New York (2001). http://doi.acm.org/10.1145/502034.502045
Che, S., Sheaffer, J.W., Boyer, M., Szafaryn, L.G., Wang, L., Skadron, K.: A characterization of the Rodinia benchmark suite with comparison to contemporary cmp workloads. In: Proceedings of the IEEE International Symposium on Workload Characterization (IISWC 2010), pp. 1–11. IISWC 2010. IEEE Computer Society, Washington, DC (2010). http://dx.doi.org/10.1109/IISWC.2010.5650274
Chen, G., Zhao, Y., Shen, X., Zhou, H.: Effisha: a software framework for enabling effficient preemptive scheduling of GPU. In: Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 3–16. PPoPP 2017, ACM, New York (2017). http://doi.acm.org/10.1145/3018743.3018748
Danalis, A., et al.: The scalable heterogeneous computing (shoc) benchmark suite. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pp. 63–74. GPGPU-3, ACM, New York (2010). http://doi.acm.org/10.1145/1735688.1735702
Eyerman, S., Eeckhout, L.: System-level performance metrics for multiprogram workloads. IEEE Micro 28(3), 42–53 (2008)
Article Google Scholar
Google: Google cloud platforms (2017). https://cloud.google.com/gpu/
Gregg, C., Dorn, J., Hazelwood, K., Skadron, K.: Fine-grained Resource Sharing for Concurrent GPGPU Kernels. In: Proceedings of the 4th USENIX Conference on Hot Topics in Parallelism. HotPar 2012, p. 10. USENIX Association, Berkeley, (2012). http://dl.acm.org/citation.cfm?id=2342788.2342798
Gupta, K., Stuart, J.A., Owens, J.D.: A study of persistent threads style GPU programming for GPGPU workloads. In: 2012 Innovative Parallel Computing (InPar), pp. 1–14, May 2012
Google Scholar
Jiao, Q., Lu, M., Huynh, H.P., Mitra, T.: Improving GPGPU energy-efficiency through concurrent kernel execution and DVFs. In: Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization. CGO 2015, pp. 1–11. IEEE Computer Society, Washington, DC (2015). http://dl.acm.org/citation.cfm?id=2738600.2738602
Jones, S.: Introduction to dynamic parallelism. In: Nvidia GPU Technology Conference. NVIDIA (2012). http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0338-GTC2012-CUDA-Programming-Model.pdf
Liang, Y., Huynh, H.P., Rupnow, K., Goh, R.S.M., Chen, D.: Efficient gpu spatial-temporal multitasking. IEEE Trans. Parall. Distrib. Syst. 26(3), 748–760 (2015)
Article Google Scholar
Microsoft: Microsoft azure (2016). https://azure.microsoft.com/en-us/blog/azure-n-series-general-availability-on-december-1/
Nvidia: CUDA programming guide (2008). https://docs.nvidia.com/cuda/cuda-c-programming-guide/
Nvidia: Next generation CUDA computer architecture Kepler GK110 (2012)
Google Scholar
NVIDIA: Multi-process service (2015). https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
NVIDIA: Pascal architecture whitepaper, June 2015. http://www.nvidia.com/object/pascal-architecture-whitepaper.html
NVIDIA: Volta architecture whitepaper, June 2015. http://www.nvidia.com/object/volta-architecture-whitepaper.html
Pai, S., Thazhuthaveetil, M.J., Govindarajan, R.: Improving GPGPU concurrency with elastic kernels. In: Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 407–418. ASPLOS 2013, ACM, New York (2013). http://doi.acm.org/10.1145/2451116.2451160
Park, J.J.K., Park, Y., Mahlke, S.: Chimera: collaborative preemption for multitasking on a shared GPU. In: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems. ASPLOS 2015, pp. 593–606. ACM, New York (2015). http://doi.acm.org/10.1145/2694344.2694346
Park, J.J.K., Park, Y., Mahlke, S.: Dynamic resource management for efficient utilization of multitasking GPUs. In: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. ASPLOS 2017, pp. 527–540. ACM, New York (2017). http://doi.acm.org/10.1145/3037697.3037707
Randles, M., Lamb, D., Taleb-Bendiab, A.: A comparative study into distributed load balancing algorithms for cloud computing. In: 2010 IEEE 24th International Conference on Advanced Information Networking and Applications Workshops, pp. 551–556, April 2010
Google Scholar
Shahar, S., Bergman, S., Silberstein, M.: Activepointers: a case for software address translation on GPUs. In: Proceedings of the 43rd International Symposium on Computer Architecture. ISCA 2016, pp. 596–608. IEEE Press, Piscataway (2016). https://doi.org/10.1109/ISCA.2016.58
Stratton, J.A., et al.: Parboil: a revised benchmark suite for scientific and commercial throughput computing. Technical report (2012). https://scholar.google.com/scholar?oi=bibs&hl=en&cluster=14097255143770688510
Tanasic, I., Gelado, I., Cabezas, J., Ramirez, A., Navarro, N., Valero, M.: Enabling preemptive multiprogramming on GPUs. In: Proceeding of the 41st Annual International Symposium on Computer Architecuture, pp. 193–204. ISCA 2014, IEEE Press, Piscataway (2014). http://dl.acm.org/citation.cfm?id=2665671.2665702
Wang, Z., Yang, J., Melhem, R., Childers, B., Zhang, Y., Guo, M.: Simultaneous multikernel GPU: Multi-tasking throughput processors via fine-grained sharing. In: 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 358–369, March 2016
Google Scholar
Wu, B., Chen, G., Li, D., Shen, X., Vetter, J.: Enabling and exploiting flexible task assignment on GPU through SM-centric program transformations. In: Proceedings of the 29th ACM on International Conference on Supercomputing. ICS 2015, pp. 119–130. ACM, New York (2015). http://doi.acm.org/10.1145/2751205.2751213
Wu, B., Liu, X., Zhou, X., Jiang, C.: Flep: enabling flexible and efficient preemption on GPUs. In: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 483–496. ASPLOS 2017, ACM, New York (2017). http://doi.acm.org/10.1145/3037697.3037742
Xu, Q., Jeon, H., Kim, K., Ro, W.W., Annavaram, M.: Warped-slicer: Efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming. In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 230–242, June 2016
Google Scholar
Zhong, J., He, B.: Kernelet: high-throughput gpu kernel executions with dynamic slicing and scheduling. IEEE Trans. Parallel Distrib. Syst. 25(6), 1522–1532 (2014). https://doi.org/10.1109/TPDS.2013.257
Article MathSciNet Google Scholar

Download references

Acknowledgments

This work is supported in part by the National Science Foundation (CCF-1335443) and equipment donations from NVIDIA.

Author information

Authors and Affiliations

Department of Computer Science, Duke University, Durham, NC, USA
Ramin Bashizade, Yuxuan Li & Alvin R. Lebeck

Authors

Ramin Bashizade
View author publications
You can also search for this author in PubMed Google Scholar
Yuxuan Li
View author publications
You can also search for this author in PubMed Google Scholar
Alvin R. Lebeck
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ramin Bashizade .

Editor information

Editors and Affiliations

CESNET, Prague, Czech Republic
Dalibor Klusáček
Google, Mountain View, CA, USA
Walfredo Cirne
Google, Seattle, WA, USA
Narayan Desai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bashizade, R., Li, Y., Lebeck, A.R. (2019). Adaptive Simultaneous Multi-tenancy for GPUs. In: Klusáček, D., Cirne, W., Desai, N. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2018. Lecture Notes in Computer Science(), vol 11332. Springer, Cham. https://doi.org/10.1007/978-3-030-10632-4_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-10632-4_5
Published: 13 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-10631-7
Online ISBN: 978-3-030-10632-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics