Skip to main content
Log in

CUDA-Zero: a framework for porting shared memory GPU applications to multi-GPUs

  • Research Paper
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

As the prevalence of general purpose computations on GPU, shared memory programming models were proposed to ease the pain of GPU programming. However, with the demanding needs of more intensive workloads, it’s desirable to port GPU programs to more scalable distributed memory environment, such as multi-GPUs. To achieve this, programs need to be re-written with mixed programming models (e.g. CUDA and message passing). Programmers not only need to work carefully on workload distribution, but also on scheduling mechanisms to ensure the efficiency of the execution. In this paper, we studied the possibilities of automating the process of parallelization to multi-GPUs. Starting from a GPU program written in shared memory model, our framework analyzes the access patterns of arrays in kernel functions to derive the data partition schemes. To acquire the access pattern, we proposed a 3-tiers approach: static analysis, profile based analysis and user annotation. Experiments show that most access patterns can be derived correctly by the first two tiers, which means that zero efforts are needed to port an existing application to distributed memory environment. We use our framework to parallelize several applications, and show that for certain kinds of applications, CUDA-Zero can achieve efficient parallelization in multi-GPU environment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Phillips J C, Stone J E, Schulten K. Adapting a message-driven parallel application to gpu-accelerated clusters. In: SC’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. Piscataway: IEEE Press, 2008. 1–9

    Google Scholar 

  2. Ryoo S, Rodrigues C I, Baghsorkhi S S, et al. Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In: PPoPP’08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York: ACM, 2008. 73–82

    Chapter  Google Scholar 

  3. NVIDIA. NVIDIA CUDA Programming Guide 2.0. 2008

  4. Stone J E, Gohara D, Shi G. OpenCL: a parallel programming standard for heterogeneous computing systems. Comput Sci Eng, 2010, 12: 66–73

    Article  Google Scholar 

  5. Buck I, Foley T, Horn D, et al. Brook for gpus: stream computing on graphics hardware. In: SIGGRAPH’04: ACM SIGGRAPH 2004 Papers. New York: ACM, 2004. 777–786

    Chapter  Google Scholar 

  6. Quintana Orti G, Igual F D, Quintana Orti E S, et al. Solving dense linear systems on platforms with multiple hardware accelerators. In: PPoPP’09: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York: ACM, 2009. 121–130

    Google Scholar 

  7. Dana S, David K. Exploring the multi-gpu design space. In: IPDPS’09: Proceedings of the 24th IEEE International Parallel and Distributed Processing Symposium. New York: ACM, 2009. 1–12

    Google Scholar 

  8. Sundaram N, Raghunathan A, Chakradhar S T. A framework for efficient and scalable execution of domain-specific templates on gpus. In: IEEE International Parallel and Distributed Processing Symposium. Washington DC: IEEE, 2009. 1–12

    Chapter  Google Scholar 

  9. Moerschell A, Owens J D. Distributed texture memory in a multi-gpu environment. In: GH’06: Proceedings of the 21st ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware. New York: ACM, 2006. 31–38

    Google Scholar 

  10. Zhe F, Feng Q, Arie K. Zippygpu: programming toolkit for general-purpose computation on gpu clusters. In: GPGPU Workshop at Supercomputing. Washington DC: IEEE, 2009. 1–12

    Google Scholar 

  11. Strengert M, Muler C, Dachsbacher C, et al. CUDASA: compute unified device and systems architecture. IEEE Trans Vis Comput Gr, 2009, 15: 605–617

    Article  Google Scholar 

  12. Kim J, Kim H, Lee J H, et al. Achieving a single compute device image in opencl for multiple gpus. In: Proceedings of the 16th ACM symposium on Principles and Practice of Parallel Programming. New York: ACM, 2011. 277–288

    Google Scholar 

  13. Li J. Compiling crystal for distributed-memory machines. PhD Thesis. New Haven: Yale University, 1992. 1–134

    Google Scholar 

  14. Li J, Chen M. Generating explicit communication from shared-memory program references. In: SC’90: Proceedings of the 1990 Conference on Supercomputing. Los Alamitos: IEEE Computer Society Press, 1990. 865–876

    Google Scholar 

  15. Gupta M, Banerjee P. Demonstration of automatic data partitioning techniques for parallelizing compilers on multicom puters. IEEE Trans Parall Distr, 1992, 3: 179–193

    Article  Google Scholar 

  16. Stratton J A, Stone S S, Hwu W M W. Mcuda: an efficient implementation of cuda kernels for multi-core cpus. In: Languages and Compilers for Parallel Computing: 21th International Workshop, LCPC 2008. New York: ACM, 2008. 16–30

    Chapter  Google Scholar 

  17. Choudhary A, Koelbel C, Zosel M. High performance fortran: implementor and users workshop. In: SC’93: Proceedings of the 1993 ACM/IEEE Conference on Supercomputing. New York: ACM, 1993. 610–613

    Chapter  Google Scholar 

  18. Chamberlain B L, Choi S E, Lewis E C, et al. Zpl: a machine independent programming language for parallel computers. IEEE Trans Software Eng, 2000, 26: 197–211

    Article  Google Scholar 

  19. Chamberlain B, Callahan D, Zima H. Parallel programmability and the chapel language. Int J High Perform C, 2007, 21: 291–312

    Article  Google Scholar 

  20. Hong C, Chen D, Chen W, et al. Mapcg: writing parallel program portable between cpu and gpu. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. New York: ACM, 2010. 217–226

    Chapter  Google Scholar 

  21. Chen D, Hong C, Chen W, et al. A mapreduce framework in heterogenous gpu environment. In: Proceedings of the EPHAM09 Workshop. New York: ACM, 2009. 20–27

    Google Scholar 

  22. Impact research group. The parboil benchmark suite. http://www.crhc.uiuc.edu/IMPACT/parboil.php

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to DeHao Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, D., Chen, W. & Zheng, W. CUDA-Zero: a framework for porting shared memory GPU applications to multi-GPUs. Sci. China Inf. Sci. 55, 663–676 (2012). https://doi.org/10.1007/s11432-011-4497-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11432-011-4497-z

Keywords

Navigation