CUDA-Zero: a framework for porting shared memory GPU applications to multi-GPUs

Chen, DeHao; Chen, WenGuang; Zheng, WeiMin

doi:10.1007/s11432-011-4497-z

CUDA-Zero: a framework for porting shared memory GPU applications to multi-GPUs

Research Paper
Published: 25 February 2012

Volume 55, pages 663–676, (2012)
Cite this article

Science China Information Sciences Aims and scope Submit manuscript

DeHao Chen¹,
WenGuang Chen¹ &
WeiMin Zheng¹

157 Accesses
6 Citations
Explore all metrics

Abstract

As the prevalence of general purpose computations on GPU, shared memory programming models were proposed to ease the pain of GPU programming. However, with the demanding needs of more intensive workloads, it’s desirable to port GPU programs to more scalable distributed memory environment, such as multi-GPUs. To achieve this, programs need to be re-written with mixed programming models (e.g. CUDA and message passing). Programmers not only need to work carefully on workload distribution, but also on scheduling mechanisms to ensure the efficiency of the execution. In this paper, we studied the possibilities of automating the process of parallelization to multi-GPUs. Starting from a GPU program written in shared memory model, our framework analyzes the access patterns of arrays in kernel functions to derive the data partition schemes. To acquire the access pattern, we proposed a 3-tiers approach: static analysis, profile based analysis and user annotation. Experiments show that most access patterns can be derived correctly by the first two tiers, which means that zero efforts are needed to port an existing application to distributed memory environment. We use our framework to parallelize several applications, and show that for certain kinds of applications, CUDA-Zero can achieve efficient parallelization in multi-GPU environment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Pragma Directed Shared Memory Centric Optimizations on GPUs

Article 07 March 2016

Jing Li, Lei Liu, … Cheng-Yong Wu

Automated Transformation of GPU-Specific OpenCL Kernels Targeting Performance Portability on Multi-Core/Many-Core CPUs

Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations

Article 07 November 2015

Mei Wen, Da-fei Huang, … Dong Chen

References

Phillips J C, Stone J E, Schulten K. Adapting a message-driven parallel application to gpu-accelerated clusters. In: SC’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. Piscataway: IEEE Press, 2008. 1–9
Google Scholar
Ryoo S, Rodrigues C I, Baghsorkhi S S, et al. Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In: PPoPP’08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York: ACM, 2008. 73–82
Chapter Google Scholar
NVIDIA. NVIDIA CUDA Programming Guide 2.0. 2008
Stone J E, Gohara D, Shi G. OpenCL: a parallel programming standard for heterogeneous computing systems. Comput Sci Eng, 2010, 12: 66–73
Article Google Scholar
Buck I, Foley T, Horn D, et al. Brook for gpus: stream computing on graphics hardware. In: SIGGRAPH’04: ACM SIGGRAPH 2004 Papers. New York: ACM, 2004. 777–786
Chapter Google Scholar
Quintana Orti G, Igual F D, Quintana Orti E S, et al. Solving dense linear systems on platforms with multiple hardware accelerators. In: PPoPP’09: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York: ACM, 2009. 121–130
Google Scholar
Dana S, David K. Exploring the multi-gpu design space. In: IPDPS’09: Proceedings of the 24th IEEE International Parallel and Distributed Processing Symposium. New York: ACM, 2009. 1–12
Google Scholar
Sundaram N, Raghunathan A, Chakradhar S T. A framework for efficient and scalable execution of domain-specific templates on gpus. In: IEEE International Parallel and Distributed Processing Symposium. Washington DC: IEEE, 2009. 1–12
Chapter Google Scholar
Moerschell A, Owens J D. Distributed texture memory in a multi-gpu environment. In: GH’06: Proceedings of the 21st ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware. New York: ACM, 2006. 31–38
Google Scholar
Zhe F, Feng Q, Arie K. Zippygpu: programming toolkit for general-purpose computation on gpu clusters. In: GPGPU Workshop at Supercomputing. Washington DC: IEEE, 2009. 1–12
Google Scholar
Strengert M, Muler C, Dachsbacher C, et al. CUDASA: compute unified device and systems architecture. IEEE Trans Vis Comput Gr, 2009, 15: 605–617
Article Google Scholar
Kim J, Kim H, Lee J H, et al. Achieving a single compute device image in opencl for multiple gpus. In: Proceedings of the 16th ACM symposium on Principles and Practice of Parallel Programming. New York: ACM, 2011. 277–288
Google Scholar
Li J. Compiling crystal for distributed-memory machines. PhD Thesis. New Haven: Yale University, 1992. 1–134
Google Scholar
Li J, Chen M. Generating explicit communication from shared-memory program references. In: SC’90: Proceedings of the 1990 Conference on Supercomputing. Los Alamitos: IEEE Computer Society Press, 1990. 865–876
Google Scholar
Gupta M, Banerjee P. Demonstration of automatic data partitioning techniques for parallelizing compilers on multicom puters. IEEE Trans Parall Distr, 1992, 3: 179–193
Article Google Scholar
Stratton J A, Stone S S, Hwu W M W. Mcuda: an efficient implementation of cuda kernels for multi-core cpus. In: Languages and Compilers for Parallel Computing: 21th International Workshop, LCPC 2008. New York: ACM, 2008. 16–30
Chapter Google Scholar
Choudhary A, Koelbel C, Zosel M. High performance fortran: implementor and users workshop. In: SC’93: Proceedings of the 1993 ACM/IEEE Conference on Supercomputing. New York: ACM, 1993. 610–613
Chapter Google Scholar
Chamberlain B L, Choi S E, Lewis E C, et al. Zpl: a machine independent programming language for parallel computers. IEEE Trans Software Eng, 2000, 26: 197–211
Article Google Scholar
Chamberlain B, Callahan D, Zima H. Parallel programmability and the chapel language. Int J High Perform C, 2007, 21: 291–312
Article Google Scholar
Hong C, Chen D, Chen W, et al. Mapcg: writing parallel program portable between cpu and gpu. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. New York: ACM, 2010. 217–226
Chapter Google Scholar
Chen D, Hong C, Chen W, et al. A mapreduce framework in heterogenous gpu environment. In: Proceedings of the EPHAM09 Workshop. New York: ACM, 2009. 20–27
Google Scholar
Impact research group. The parboil benchmark suite. http://www.crhc.uiuc.edu/IMPACT/parboil.php

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
DeHao Chen, WenGuang Chen & WeiMin Zheng

Authors

DeHao Chen
View author publications
You can also search for this author in PubMed Google Scholar
WenGuang Chen
View author publications
You can also search for this author in PubMed Google Scholar
WeiMin Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to DeHao Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, D., Chen, W. & Zheng, W. CUDA-Zero: a framework for porting shared memory GPU applications to multi-GPUs. Sci. China Inf. Sci. 55, 663–676 (2012). https://doi.org/10.1007/s11432-011-4497-z

Download citation

Received: 31 January 2011
Accepted: 23 May 2011
Published: 25 February 2012
Issue Date: March 2012
DOI: https://doi.org/10.1007/s11432-011-4497-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CUDA-Zero: a framework for porting shared memory GPU applications to multi-GPUs

Abstract

Access this article

Similar content being viewed by others

Pragma Directed Shared Memory Centric Optimizations on GPUs

Automated Transformation of GPU-Specific OpenCL Kernels Targeting Performance Portability on Multi-Core/Many-Core CPUs

Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CUDA-Zero: a framework for porting shared memory GPU applications to multi-GPUs

Abstract

Access this article

Similar content being viewed by others

Pragma Directed Shared Memory Centric Optimizations on GPUs

Automated Transformation of GPU-Specific OpenCL Kernels Targeting Performance Portability on Multi-Core/Many-Core CPUs

Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation