CUDA-For-Clusters: A System for Efficient Execution of CUDA Kernels on Multi-core Clusters

Prabhakar, Raghu; Govindarajan, R.; Thazhuthaveetil, Matthew J.

doi:10.1007/978-3-642-32820-6_42

CUDA-For-Clusters: A System for Efficient Execution of CUDA Kernels on Multi-core Clusters

Raghu Prabhakar¹⁹,
R. Govindarajan²⁰ &
Matthew J. Thazhuthaveetil²⁰

Conference paper

3060 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7484))

Abstract

Rapid advancements in multi-core processor architectures coupled with low-cost, low-latency, high-bandwidth interconnects have made clusters of multi-core machines a common computing resource. Unfortunately, writing good parallel programs that efficiently utilize all the resources in such a cluster is still a major challenge. Various programming languages have been proposed as a solution to this problem, but are yet to be adopted widely to run performance-critical code mainly due to the relatively immature software framework and the effort involved in re-writing existing code in the new language. In this paper, we motivate and describe our initial study in exploring CUDA as a programming language for a cluster of multi-cores. We develop CUDA-For-Clusters (CFC), a framework that transparently orchestrates execution of CUDA kernels on a cluster of multi-core machines. The well-structured nature of a CUDA kernel, the growing popularity, support and stability of the CUDA software stack collectively make CUDA a good candidate to be considered as a programming language for a cluster. CFC uses a mixture of source-to-source compiler transformations, a work distribution runtime and a light-weight software distributed shared memory to manage parallel executions. Initial results on running several standard CUDA benchmark programs achieve impressive speedups of up to 7.5X on a cluster with 8 nodes, thereby opening up an interesting direction of research for further investigation.

Download to read the full chapter text

Chapter PDF

References

Intel concurrent collections for c++, http://software.intel.com/en-us/articles/intel-concurrent-collections-for-cc/
Nvidia cuda c sdk, http://developer.download.nvidia.com/compute/cuda/sdk
Nvidia cuda zone, http://www.nvidia.com/cuda
Opencl overview, http://www.khronos.org/developers/library/overview/opencl_overview.pdf
Openmp specifications, version 3.0, http://openmp.org/wp/openmp-specifications/
The parboil benchmark suite, http://impact.crhc.illinois.edu/parboil.php
The portland group, http://www.pgroup.com
Adve, S.V., Gharachorloo, K.: Shared memory consistency models: A tutorial. IEEE Computer 29, 66–76 (1995)
Article Google Scholar
Amza, C., Cox, A.L., Dwarkadas, S., Keleher, P., Lu, H., Rajamony, R., Yu, W., Zwaenepoel, W.: Treadmarks: Shared memory computing on networks of workstations. Computer 29(2), 18–28 (1996)
Article Google Scholar
Cappello, F., Etiemble, D.: Mpi versus mpi+openmp on ibm sp for the nas benchmarks. In: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing (CDROM), Supercomputing 2000. IEEE Computer Society, Washington, DC (2000)
Google Scholar
Chamberlain, B., Callahan, D., Zima, H.: Parallel programmability and the chapel language. Int. J. High Perform. Comput. Appl. 21(3), 291–312 (2007)
Article Google Scholar
Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., Sarkar, V.: X10: an object-oriented approach to non-uniform cluster computing. In: OOPSLA 2005: Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, pp. 519–538. ACM, New York (2005)
Chapter Google Scholar
Danalis, A., Marin, G., McCurdy, C., Meredith, J.S., Roth, P.C., Spafford, K., Tipparaju, V., Vetter, J.S.: The scalable heterogeneous computing (shoc) benchmark suite. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU 2010, pp. 63–74. ACM, New York (2010)
Chapter Google Scholar
Diamos, G.F., Kerr, A.R., Yalamanchili, S., Clark, N.: Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In: PACT 2010: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, pp. 353–364. ACM, New York (2010)
Chapter Google Scholar
Gelado, I., Stone, J.E., Cabezas, J., Patel, S., Navarro, N., Hwu, W.M.W.: An asymmetric distributed shared memory model for heterogeneous parallel systems. SIGARCH Comput. Archit. News 38(1), 347–358 (2010)
Article Google Scholar
Gummaraju, J., Morichetti, L., Houston, M., Sander, B., Gaster, B.R., Zheng, B.: Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT 2010, pp. 205–216. ACM, New York (2010)
Chapter Google Scholar
Kim, J., Seo, S., Lee, J., Nah, J., Jo, G., Lee, J.: Opencl as a programming model for gpu clusters. In: LCPC 2011: Proceedings of the 24th International Workshop on Languages and Compilers for Parallel Computing, (2011)
Google Scholar
Li, K., Hudak, P.: Memory coherence in shared virtual memory systems. ACM Trans. Comput. Syst. 7(4), 321–359 (1989)
Article Google Scholar
Manoj, N.P., Manjunath, K.V., Govindarajan, R.: Cas-dsm: a compiler assisted software distributed shared memory. Int. J. Parallel Program. 32(2), 77–122 (2004)
Article MATH Google Scholar
Snir, M., Otto, S., Huss-Lederman, S., Walker, D., Dongarra, J.: MPI-The Complete Reference, Volume 1: The MPI Core. MIT Press, Cambridge (1998)
Google Scholar
Stratton, J.A., Grover, V., Marathe, J., Aarts, B., Murphy, M., Hu, Z., Hwu, W.M.W.: Efficient compilation of fine-grained spmd-threaded programs for multicore cpus. In: CGO 2010: Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, pp. 111–119. ACM, New York (2010)
Chapter Google Scholar
Stratton, J.A., Stone, S.S., Hwu, W.-M.W.: Mcuda: An efficient implementation of cuda kernels for multi-core cpus, pp. 16–30 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

University of California, Los Angeles, USA
Raghu Prabhakar
Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore, India
R. Govindarajan & Matthew J. Thazhuthaveetil

Authors

Raghu Prabhakar
View author publications
You can also search for this author in PubMed Google Scholar
R. Govindarajan
View author publications
You can also search for this author in PubMed Google Scholar
Matthew J. Thazhuthaveetil
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Patras, Computer Technology Institute and Press “Diophantus”,, N. Kazantzaki, 26504, Rio, Greece
Christos Kaklamanis
University of Patras, University Building B, 26504, Rio, Greece
Theodore Papatheodorou
Computer Technology Institute and Press “Diophantus”, University of Patras, N. Kazantzaki, 26504, Rio, Greece
Paul G. Spirakis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Prabhakar, R., Govindarajan, R., Thazhuthaveetil, M.J. (2012). CUDA-For-Clusters: A System for Efficient Execution of CUDA Kernels on Multi-core Clusters. In: Kaklamanis, C., Papatheodorou, T., Spirakis, P.G. (eds) Euro-Par 2012 Parallel Processing. Euro-Par 2012. Lecture Notes in Computer Science, vol 7484. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32820-6_42

Download citation

DOI: https://doi.org/10.1007/978-3-642-32820-6_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32819-0
Online ISBN: 978-3-642-32820-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics