Abstract
This paper presents the first hybrid MPI+OpenMP+CUDA implementation of a distributed memory right-looking unsymmetric sparse direct solver (i.e., sparse LU factorization) that uses static pivoting. While BLAS calls can account for more than 40% of the overall factorization time, the difficulty is that small problem sizes dominate the workload, making efficient GPU utilization challenging. This fact motivates our approach, which is to find ways to aggregate collections of small BLAS operations into larger ones; to schedule operations to achieve load balance and hide long-latency operations, such as PCIe transfer; and to exploit simultaneously all of a node’s available CPU cores and GPUs.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
IPM : Integrated performance monitoring, http://ipm-hpc.sourceforge.net/ (accessed: January 26, 2014)
Davis, T.A., Hu, Y.: The university of florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38(1), 1 (2011)
George, T., Saxena, V., Gupta, A., Singh, A., Choudjury, A.: Multifrontal factorization of sparse spd matrices on GPUs. In: Proc. of IEEE International Parallel and Distributed Processing Symposium (IPDPS 2011), Anchorage, Alaska, May 16-20 (2011)
Krawezik, G., Poole, G.: Accelerating the ANSYS direct sparse solver with GPUs. In: Proc. Symposium on Application Accelerators in High Performance Computing (SAAHPC). Urbana-Champaign, IL (2009), http://saahpc.ncsa.illinois.edu/09
Li, X.S., Demmel, J.W.: SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems. ACM Trans. Mathematical Software 29(2), 110–140 (2003)
Lucas, R.F., Wagenbreth, G., Davis, D.M., Grimes, R.: Multifrontal computations on GPUs and their multi-core hosts. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 71–82. Springer, Heidelberg (2011), http://vecpar.fe.up.pt/2010/papers/5.php
Sao, P., Vuduc, R., Li, X.: A distributed CPU-GPU sparse direct solver. Technical report, Georgia Institute of technology (2014)
Schenk, O., Christen, M., Burkhart, H.: Algorithmic performance studies on graphics processing units. J. Parallel and Distributed Computing 68(10), 1360–1369 (2008)
Vuduc, R., Chandramowlishwaran, A., Choi, J., Guney, M., Shringarpure, A.: On the limits of GPU acceleration. In: Proc. of the 2nd USENIX Conference on Hot Topics in Parallelism, HotPar 2010, Berkeley, CA (2010)
Yamazaki, I., Li, X.S.: New scheduling strategies and hybrid programming for a parallel right-looking sparse LU factorization algorithm on multicore cluster systems. In: 2012 IEEE 26th International Parallel & Distributed Processing Symposium (IPDPS), pp. 619–630. IEEE (2012)
Yeralan, S.N., Davis, T., Ranka, S.: Sparse QR factorization on gpu architectures. Technical report, University of Florida (November 2013)
Yu, C.D., Wang, W., Pierce, D.: A CPU-GPU hybrid approach for the unsymmetric multifrontal method. Parallel Computing 37, 759–770 (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Sao, P., Vuduc, R., Li, X.S. (2014). A Distributed CPU-GPU Sparse Direct Solver. In: Silva, F., Dutra, I., Santos Costa, V. (eds) Euro-Par 2014 Parallel Processing. Euro-Par 2014. Lecture Notes in Computer Science, vol 8632. Springer, Cham. https://doi.org/10.1007/978-3-319-09873-9_41
Download citation
DOI: https://doi.org/10.1007/978-3-319-09873-9_41
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09872-2
Online ISBN: 978-3-319-09873-9
eBook Packages: Computer ScienceComputer Science (R0)