Dynamic Programming Parallelization of Matrix Chain Multiplication on GPU: A Comparative Study
The dynamic programming paradigm involves various important optimization problems. The set of optimization problems includes optimal binary search tree, longest common subsequence, binary knapsack, Matrix chain multiplication (MCM), and many more. In dynamic programming problems, the MCM of n matrices comprises the computation of the parenthesization for an optimal matrix product, which requires the computation time of O(n 3) using O(n 2) table size. We propose the MCM parallelization techniques for thread-level of multi-core CPU and group of threads on NVIDIA GPU. The prime objective of this paper is to present and analyze massively parallel implementations of MCM algorithm using OpenMP and CUDA on parallel systems such as Intel Xeon CPU and NVIDIA GPU. The implemented parallel MCM algorithm achieved a speedup of 10× on an Intel Xeon using OpenMP and a speedup of 7× on NVIDIA Quadro FX 3800 GPU with reference to its serial implementation. So the speedup achieved on multi-core CPU dominates the speedup achieved by the GPU. This paper also presents performance comparisons for OpenMP, when chunk size of iterations of a loop and scheduling techniques of those chunks among core changes.
KeywordsGPU Dynamic programming CUDA OpenMP Matrix chain multiplication
- 2.Dash T, Nayak T. Chain multiplication of dense matrices: proposing a shared memory based parallel algorithm. Int J Comput Appl. 2012;8(1):11–6.Google Scholar
- 3.Xiao S, Aji AM, Feng WC. On the robust mapping of dynamic programming onto a graphics processing unit. In: Proceeding of the 15th international conference on parallel and distributed systems (ICPADS); 2009. p. 26–33.Google Scholar
- 4.Nishida K, Ito Y, Nakano K. Accelerating the dynamic programming for matrix chain product on the GPU. In: 2nd international conference on networking and computing (ICNC); 2011. p. 320–6.Google Scholar
- 5.Wu CC, Ke JY, Lin H, Feng WC. Optimizing dynamic programming on graphics processing units via adaptive thread-level parallelism. In: 17th international conference on parallel and distributed systems (ICPADS); 2011. p. 96–103.Google Scholar
- 6.The CUDA Zone. http://www.nvidia.com/object/cuda_home_new.html.
- 7.Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to algorithms. 2nd ed. PHI Learning Private Limited; 2008.Google Scholar
- 8.Broquedis F, Diakhaté F, Thibault S, Aumage O, Namyst R, Wacrenier PA. Scheduling dynamic OpenMP applications over multicore architectures. In: OpenMP in a new era of parallelism. Lecture notes in computer science, vol. 5004; 2008. p. 170–180.Google Scholar
- 9.Chapman B, Jost G, Van Der Pas R. Using OpenMP: portable shared memory parallel programming. The MIT Press; 2007.Google Scholar
- 10.OpenMP specifications. http://www.openmp.org/specs/.
- 12.NVIDIA Corporation. CUDA C Programming Guide. Version 3.1.1.Google Scholar