High-Performance Matrix-Vector Multiplication on the GPU

  • Hans Henrik Brandenborg Sørensen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7155)


In this paper, we develop a high-performance GPU kernel for one of the most popular dense linear algebra operations, the matrix-vector multiplication. The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture), which is designed from the ground up for scientific computing. We show that it is essentially a matter of fully utilizing the fine-grained parallelism of the many-core GPU in order to achieve high-performance for dense matrix-vector multiplication. We show that auto-tuning can be successfully employed to the GPU kernel so that it performs well for all matrix shapes and sizes.


GPU Matrix-Vector Multiplication Dense linear algebra 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    NVIDIA Corp.: CUDA C Programming Guide Version 4.0 (2011)Google Scholar
  2. 2.
    NVIDIA Corp.: CUDA CUBLAS Library (2011)Google Scholar
  3. 3.
    Tomov, S., Nath, R., Du, P., Dongarra, J.: MAGMA v0.2 Users’ Guide (2009)Google Scholar
  4. 4.
    Sørensen, H.H.B.: Auto-tuning Dense Vector and Matrix-Vector Operations for Fermi GPUs (2011) (submitted)Google Scholar
  5. 5.
    Fujimoto, N.: Faster matrix-vector multiplication on GeForce 8800GTX. In: IEEE International Symposium on Parallel and Distributed Processing (2008)Google Scholar
  6. 6.
    Tomov, S., Nath, R., Dongarra, J.: Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing. Parallel Computing 36(12) (2010)Google Scholar
  7. 7.
    Anderson, E., Bai, Z., Bischof, C., Blackford, L.S., Demmel, J., Dongarra, J.J., Du Croz, J., Hammarling, S., Greenbaum, A., McKenney, A., Sorensen, D.: LAPACK Users’ guide, 3rd edn. SIAM, Philadelphia (1999)CrossRefGoogle Scholar
  8. 8.
    Nath, R., Tomov, S., Dongarra, J.: Accelerating GPU kernels for dense linear algebra (2009)Google Scholar
  9. 9.
    Li, Y., Dongarra, J., Tomov, S.: A Note on Auto-tuning GEMM for GPUs (2009)Google Scholar
  10. 10.
    NVIDIA Corp.: Fermi, Whitepaper (2009)Google Scholar
  11. 11.
    Harris, M.: Optimizing Parallel Reduction in CUDA. NVIDIA Dev. Tech. (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Hans Henrik Brandenborg Sørensen
    • 1
  1. 1.Informatics and Mathematical ModellingTechnical University of DenmarkLyngbyDenmark

Personalised recommendations