Abstract
In this paper, we develop a high-performance GPU kernel for one of the most popular dense linear algebra operations, the matrix-vector multiplication. The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture), which is designed from the ground up for scientific computing. We show that it is essentially a matter of fully utilizing the fine-grained parallelism of the many-core GPU in order to achieve high-performance for dense matrix-vector multiplication. We show that auto-tuning can be successfully employed to the GPU kernel so that it performs well for all matrix shapes and sizes.
Chapter PDF
Similar content being viewed by others
References
NVIDIA Corp.: CUDA C Programming Guide Version 4.0 (2011)
NVIDIA Corp.: CUDA CUBLAS Library (2011)
Tomov, S., Nath, R., Du, P., Dongarra, J.: MAGMA v0.2 Users’ Guide (2009)
Sørensen, H.H.B.: Auto-tuning Dense Vector and Matrix-Vector Operations for Fermi GPUs (2011) (submitted)
Fujimoto, N.: Faster matrix-vector multiplication on GeForce 8800GTX. In: IEEE International Symposium on Parallel and Distributed Processing (2008)
Tomov, S., Nath, R., Dongarra, J.: Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing. Parallel Computing 36(12) (2010)
Anderson, E., Bai, Z., Bischof, C., Blackford, L.S., Demmel, J., Dongarra, J.J., Du Croz, J., Hammarling, S., Greenbaum, A., McKenney, A., Sorensen, D.: LAPACK Users’ guide, 3rd edn. SIAM, Philadelphia (1999)
Nath, R., Tomov, S., Dongarra, J.: Accelerating GPU kernels for dense linear algebra (2009)
Li, Y., Dongarra, J., Tomov, S.: A Note on Auto-tuning GEMM for GPUs (2009)
NVIDIA Corp.: Fermi, Whitepaper (2009)
Harris, M.: Optimizing Parallel Reduction in CUDA. NVIDIA Dev. Tech. (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sørensen, H.H.B. (2012). High-Performance Matrix-Vector Multiplication on the GPU. In: Alexander, M., et al. Euro-Par 2011: Parallel Processing Workshops. Euro-Par 2011. Lecture Notes in Computer Science, vol 7155. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29737-3_42
Download citation
DOI: https://doi.org/10.1007/978-3-642-29737-3_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29736-6
Online ISBN: 978-3-642-29737-3
eBook Packages: Computer ScienceComputer Science (R0)