A Hybrid CPU-GPU Multifrontal Optimizing Method in Sparse Cholesky Factorization
- 132 Downloads
In many scientific computing applications, sparse Cholesky factorization is used to solve large sparse linear equations in distributed environment. GPU computing is a new way to solve the problem. However, sparse Cholesky factorization on GPU is hardly to achieve excellent performance due to the structure irregularity of matrix and the low GPU resource utilization. A hybrid CPU-GPU implementation of sparse Cholesky factorization is proposed based on multifrontal method. A large sparse coefficient matrix is decomposed into a series of small dense matrices (frontal matrices) in the method, and then multiple GEMM (General Matrix-matrix Multiplication) operations are computed on them. GEMMs are the main operations in sparse Cholesky factorization, but they are hardly to perform better in parallel on GPU. In order to improve the performance, the scheme of multiple task queues is adopted to perform multiple GEMMs parallelized with multifrontal method; all GEMM tasks are scheduled dynamically on GPU and CPU based on computation scales for load balance and computing-time reduction. Experimental results show that the approach can outperform the implementations of cuBLAS, achieving up to 1.98× speedup on GTX460 (Fermi micro-architecture) and 3.06× speedup on K20m (Kepler micro-architecture), respectively.
KeywordsMultifrontal method Multiple task queues scheme Task allocation GPU acceleration
This work is supported by the National Natural Science Foundation of China (grant No. 61133008) and the National Basic Research Program (973 Program) (grant No. 2013CB2282036).
- 5.Li, SG, Hu, CJ, Zhang, JC, & et al. (2015). Automatic tuning of sparse matrix-vector multiplication on multicore clusters. Science China Information Sciences, 58(9), 1–14.Google Scholar
- 6.Nvidia, CUDA. Cublas library. http://docs.nvidia.com/cuda/cublas/#axzz47fgrxXqP.
- 8.Li, X, Li, F, & Clark, JM (2013). Exploration of multifrontal method with GPU in power flow computation. In Power and Energy Society General Meeting (PES), 2013 IEEE (pp. 1–5). IEEE.Google Scholar
- 9.Sao, P, Vuduc, R, & Li, XS (2014). A distributed, CPU-GPU sparse direct solver. In Euro-par 2014 parallel processing (pp. 487–498). Springer International Publishing.Google Scholar
- 11.Nvidia. Fermi architecture. https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.
- 12.Nvidia. Kepler gk110. https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.
- 13.Nvidia. Kepler architecture. http://www.nvidia.com/object/nvidia-kepler.html.
- 14.Toledo, S, Chen, D, & Rotkin, V. Taucs: a library of sparse linear solvers. http://www.tau.ac.il/stoledo/taucs/.
- 15.George, T, Saxena, V, Gupta, A, & et al. (2011). Multifrontal factorization of sparse SPD matrices on GPUs. Parallel & Distributed Processing Symposium (IPDPS), 2011, IEEE International (pp. 372–383). IEEE.Google Scholar
- 16.Gupta, A. (2000). WSMP watson sparse matrix package (Part-I: direct solution of symmetric sparse systems). Yorktown Heights: IBM TJ Watson Research Center. Tech. Rep RC.Google Scholar
- 21.Lebedev, S, Akhmedzhanov, D, Kozinov, E, & et al. (2015). Dynamic parallelization strategies for multifrontal sparse cholesky factorization. Parallel computing technologies (pp. 68–79). Springer International Publishing.Google Scholar
- 22.Rennich, SC, Stosic, D, & Davis, TA (2014). Accelerating sparse cholesky factorization on GPUs. Proceedings of the Fourth Workshop on Irregular Applications: Architectures and Algorithms (pp. 9–16). IEEE Press.Google Scholar
- 23.Kim, K, & Eijkhout, V (2013). Scheduling a parallel sparse direct solver to multiple GPUs. Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International. IEEE, (pp. 1401–1408).Google Scholar
- 25.Yeralan, SN, Davis, TA, & Ranka, S (2013). Sparse QR factorization on gpu architectures. University of Florida. Tech. Rep.Google Scholar
- 26.Ren, L, Chen, X, Wang, Y, & et al. (2012). Sparse LU factorization for parallel circuit simulation on GPU. In Proceedings of the 49th Annual Design Automation Conference (pp. 1125–1130), ACM.Google Scholar
- 27.MIT CSAIL Supertech Research Group. Cilk: A linguistic and runtime technology for algorithmic multithreaded programming. http://supertech.csail.mit.edu/cilk/.
- 28.Chen, L, Villa, O, Krishnamoorthy, S, & et al. (2010). Dynamic load balancing on single-and multi-GPU systems. In IEEE international symposium on parallel & distributed processing (IPDPS), 2010 (pp. 1–12). IEEE.Google Scholar
- 29.Davis, TA, & Hu, Y. The University of Florida sparse matrix collection. http://www.cise.ufl.edu/research/sparse/matrices/.
- 30.Wang, H, Wang, R, Luan, ZZ, & et al. (2015). Improving multiprocessor performance with fine-grain coherence bypass. Science China Information Sciences, 58(1), 1–15.Google Scholar