Abstract
As the prevalence of general purpose computations on GPU, shared memory programming models were proposed to ease the pain of GPU programming. However, with the demanding needs of more intensive workloads, it’s desirable to port GPU programs to more scalable distributed memory environment, such as multi-GPUs. To achieve this, programs need to be re-written with mixed programming models (e.g. CUDA and message passing). Programmers not only need to work carefully on workload distribution, but also on scheduling mechanisms to ensure the efficiency of the execution. In this paper, we studied the possibilities of automating the process of parallelization to multi-GPUs. Starting from a GPU program written in shared memory model, our framework analyzes the access patterns of arrays in kernel functions to derive the data partition schemes. To acquire the access pattern, we proposed a 3-tiers approach: static analysis, profile based analysis and user annotation. Experiments show that most access patterns can be derived correctly by the first two tiers, which means that zero efforts are needed to port an existing application to distributed memory environment. We use our framework to parallelize several applications, and show that for certain kinds of applications, CUDA-Zero can achieve efficient parallelization in multi-GPU environment.
Similar content being viewed by others
References
Phillips J C, Stone J E, Schulten K. Adapting a message-driven parallel application to gpu-accelerated clusters. In: SC’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. Piscataway: IEEE Press, 2008. 1–9
Ryoo S, Rodrigues C I, Baghsorkhi S S, et al. Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In: PPoPP’08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York: ACM, 2008. 73–82
NVIDIA. NVIDIA CUDA Programming Guide 2.0. 2008
Stone J E, Gohara D, Shi G. OpenCL: a parallel programming standard for heterogeneous computing systems. Comput Sci Eng, 2010, 12: 66–73
Buck I, Foley T, Horn D, et al. Brook for gpus: stream computing on graphics hardware. In: SIGGRAPH’04: ACM SIGGRAPH 2004 Papers. New York: ACM, 2004. 777–786
Quintana Orti G, Igual F D, Quintana Orti E S, et al. Solving dense linear systems on platforms with multiple hardware accelerators. In: PPoPP’09: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York: ACM, 2009. 121–130
Dana S, David K. Exploring the multi-gpu design space. In: IPDPS’09: Proceedings of the 24th IEEE International Parallel and Distributed Processing Symposium. New York: ACM, 2009. 1–12
Sundaram N, Raghunathan A, Chakradhar S T. A framework for efficient and scalable execution of domain-specific templates on gpus. In: IEEE International Parallel and Distributed Processing Symposium. Washington DC: IEEE, 2009. 1–12
Moerschell A, Owens J D. Distributed texture memory in a multi-gpu environment. In: GH’06: Proceedings of the 21st ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware. New York: ACM, 2006. 31–38
Zhe F, Feng Q, Arie K. Zippygpu: programming toolkit for general-purpose computation on gpu clusters. In: GPGPU Workshop at Supercomputing. Washington DC: IEEE, 2009. 1–12
Strengert M, Muler C, Dachsbacher C, et al. CUDASA: compute unified device and systems architecture. IEEE Trans Vis Comput Gr, 2009, 15: 605–617
Kim J, Kim H, Lee J H, et al. Achieving a single compute device image in opencl for multiple gpus. In: Proceedings of the 16th ACM symposium on Principles and Practice of Parallel Programming. New York: ACM, 2011. 277–288
Li J. Compiling crystal for distributed-memory machines. PhD Thesis. New Haven: Yale University, 1992. 1–134
Li J, Chen M. Generating explicit communication from shared-memory program references. In: SC’90: Proceedings of the 1990 Conference on Supercomputing. Los Alamitos: IEEE Computer Society Press, 1990. 865–876
Gupta M, Banerjee P. Demonstration of automatic data partitioning techniques for parallelizing compilers on multicom puters. IEEE Trans Parall Distr, 1992, 3: 179–193
Stratton J A, Stone S S, Hwu W M W. Mcuda: an efficient implementation of cuda kernels for multi-core cpus. In: Languages and Compilers for Parallel Computing: 21th International Workshop, LCPC 2008. New York: ACM, 2008. 16–30
Choudhary A, Koelbel C, Zosel M. High performance fortran: implementor and users workshop. In: SC’93: Proceedings of the 1993 ACM/IEEE Conference on Supercomputing. New York: ACM, 1993. 610–613
Chamberlain B L, Choi S E, Lewis E C, et al. Zpl: a machine independent programming language for parallel computers. IEEE Trans Software Eng, 2000, 26: 197–211
Chamberlain B, Callahan D, Zima H. Parallel programmability and the chapel language. Int J High Perform C, 2007, 21: 291–312
Hong C, Chen D, Chen W, et al. Mapcg: writing parallel program portable between cpu and gpu. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. New York: ACM, 2010. 217–226
Chen D, Hong C, Chen W, et al. A mapreduce framework in heterogenous gpu environment. In: Proceedings of the EPHAM09 Workshop. New York: ACM, 2009. 20–27
Impact research group. The parboil benchmark suite. http://www.crhc.uiuc.edu/IMPACT/parboil.php
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chen, D., Chen, W. & Zheng, W. CUDA-Zero: a framework for porting shared memory GPU applications to multi-GPUs. Sci. China Inf. Sci. 55, 663–676 (2012). https://doi.org/10.1007/s11432-011-4497-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11432-011-4497-z