Performance Study of OpenMP and Hybrid Programming Models on CPU–GPU Cluster
Optimizing complex code of scientific and engineering applications is a challenging area of research. There are many parallel and distributed programming frameworks which efficiently optimize the code for the performance. In this study, we did a comparison study of the performance of parallel computing models. We have used irregular graph algorithms such as Floyd’s algorithm (shortest path problems) and Kruskal’s algorithm (minimum spanning tree problems). We have considered OpenMP and hybrid [OpenMP + MPI] on CPU cluster and MPI + CUDA programming strategies on the GPU cluster to improve the performance on shared–distributed memory architecture by minimizing communication and computation overlap overhead between individual nodes. A single MPI process per node is used to launch small chunks of large irregular graph algorithm on various nodes on the cluster. CUDA is used to distribute the work between the different GPU cores within a cluster node. Results show that from the performance perspective GPU, implementation of graph algorithms is effective than the CPU implementation. Results also show that hybrid [MPI + CUDA] parallel programming framework for Floyd’s algorithm on GPU cluster yields an average speedup of 19.03 when compared to the OpenMP and a speedup of 15.96 is observed against CPU cluster with hybrid [MPI + OpenMP] frameworks. For Kruskal’s algorithm, average speedup of 27.26 is observed when compared against OpenMP and a speedup of 20.74 is observed against CPU’s cluster with hybrid [MPI + OpenMP] frameworks.
KeywordsCPU GPU CUDA MPI OpenMP
- 2.Lončar, V., & Škrbić, S. (2010). Parallel implementation of minimum spanning tree algorithms using MPI. Serbia: Faculty of Science, Department for Mathematics an Informatics, University of Novi Sad.Google Scholar
- 3.Ravela, S. C. (2010). Comparison of shared memory based parallel programming models (Technical Report MSC-2010-01). Blekinge Institute of Technology.Google Scholar
- 4.Huang, Y., & Guo, S. Design and implementation of parallel Prim’s algorithm. Zhengzhou, China: Zhengzhou Information Science and Technology Institute.Google Scholar
- 5.Kang, S. J., Lee, S. Y., & Lee, K. M. (2015). Performance comparison with MPI, OpenMP and map reduce in practical problems. 2015, Article ID 575687. http://dx.doi.org/1001155/2015/575687.
- 6.Qingshuang, W. All-pairs shortest path algorithm based on MPI + CUDA distributed parallel programming model. Wuhu, Anhui, 241003, China: College of Territorial Resources and Tourism, Anhui Normal University.Google Scholar
- 8.Rostrup, S., Srivastava, S., & Singhal, K. (2011). Fast and memory-efficient minimum spanning tree on the GPU. In 2nd International Workshop on GPUs and Scientific Applications (GPUScA 2011). Geneva: Inderscience.Google Scholar
- 10.Barney, B. (2007). Introduction to parallel computing. Lawrence Livermore National Laboratory. https://computing.llnl.gov/tutorials/parallelcomp/.