Journal of Signal Processing Systems

, Volume 90, Issue 1, pp 69–86 | Cite as

LightSpMV: Faster CUDA-Compatible Sparse Matrix-Vector Multiplication Using Compressed Sparse Rows



Compressed sparse row (CSR) is one of the most frequently used sparse matrix storage formats. However, the efficiency of existing CUDA-compatible CSR-based sparse matrix vector multiplication (SpMV) implementations is relatively low. We address this issue by presenting LightSpMV, a parallelized CSR-based SpMV implementation programmed in CUDA C++. This algorithm achieves high speed by employing atomic and warp shuffle instructions to implement fine-grained dynamic distribution of matrix rows over vectors/warps as well as efficient vector dot product computation. Moreover, we propose a unified cache hit rate computation approach to consistently investigate the caching behavior for different SpMV kernels, which may have different data deployment in the hierarchical memory space of CUDA-enabled GPUs. We have assessed LightSpMV using a set of sparse matrices and further compared it to the CSR-based SpMV kernels in the top-performing CUSP, ViennaCL and cuSPARSE libraries. Our experimental results demonstrate that LightSpMV is superior to CUSP, ViennaCL and cuSPARSE on the same Kepler-based Tesla K40c GPU, running up to 2.63× and 2.65× faster than CUSP, up to 2.52× and 1.96× faster than ViennaCL, and up to 1.94× and 1.79× faster than cuSPARSE with respect to single and double precision, respectively. In addition, for the acceleration of the PageRank graph application, LightSpMV still keeps consistent superiority to the aforementioned three counterparts. LightSpMV is open-source and publicly available at


Sparse matrix-vector multiplication Compressed sparse row CUDA GPU 



We acknowledge funding by the Center for Computational Sciences (SRFN) Johannes Gutenberg University Mainz and the Carl-Zeiss-Foundation.


  1. 1.
    Aila, T., & Laine, S. (2009). Understanding the efficiency of ray traversal on gpus. In Proceedings of the conference on high performance graphics 2009 (pp. 145–149): ACM.Google Scholar
  2. 2.
    Aluru, M., Zola, J., Nettleton, D., & Aluru, S. (2012). Reverse engineering and analysis of large genome-scale gene networks. Nucleic acids research (p. gks904).Google Scholar
  3. 3.
    Asanovic, K., Bodik, R., Catanzaro, B. C., Gebis, J. J., Husbands, P., Keutzer, K., Patterson, D. A., Plishker, W. L., Shalf, J., Williams, S. W., & et al. (2006). The landscape of parallel computing research: A view from berkeley. Tech. rep., Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley.Google Scholar
  4. 4.
    Ashari, A., Sedaghati, N., Eisenlohr, J., Parthasarath, S., & Sadayappan, P. (2014). Fast sparse matrix-vector multiplication on gpus for graph applications. In Proceedings of the international conference for high performance computing, networking, storage and analysis (pp. 781–792): IEEE.Google Scholar
  5. 5.
    Ashari, A., Sedaghati, N., Eisenlohr, J., & Sadayappan, P. (2014). An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on gpus. In Proceedings of the 28th ACM international conference on supercomputing (pp. 273–282): ACM.Google Scholar
  6. 6.
    Barrachina, S., Castillo, M., Igual, F. D., Mayo, R., & Quintana-Ortí, E. S. (2008). Solving dense linear systems on graphics processors. In Lecture notes in computer science, (Vol. 5168 pp. 739–748): Springer.Google Scholar
  7. 7.
    Baskaran, M. M., & Bordawekar, R. (2008). Optimizing sparse matrix-vector multiplication on gpus using compile-time and run-time strategies. IBM Reserach Report RC24704.Google Scholar
  8. 8.
    Bell, N., & Garland, M. (2009). Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the conference on high performance computing networking, storage and analysis (p. 18): ACM.Google Scholar
  9. 9.
    Bell, N., & Garland, M. (2014). Cusp: Generic parallel algorithms for sparse matrix and graph computations (v0.4).
  10. 10.
    Brin, S., & Page, L. (2010). The anatomy of a large-scale hypertextual web search engine.Google Scholar
  11. 11.
    Bustamam, A., Burrage, K., & Hamilton, N. A. (2012). Fast parallel markov clustering in bioinformatics using massively parallel computing on gpu with cuda and ellpack-r sparse format. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(3), 679–692.CrossRefGoogle Scholar
  12. 12.
    Butte, A. J., & Kohane, I. S. (1999). Unsupervised knowledge discovery in medical databases using relevance networks. In Proceedings of the AMIA Symposium (p. 711): American Medical Informatics Association.Google Scholar
  13. 13.
    Choi, J. W., Singh, A., & Vuduc, R. W. (2010). Model-driven autotuning of sparse matrix-vector multiply on gpus. In ACM sigplan notices, (Vol. 45 pp. 115–126): ACM.Google Scholar
  14. 14.
    Daga, M., & Greathouse, J. L. (2015). Structural agnostic spmv: Adapting csr-adaptive for irregular matrices. In 2015 IEEE 22nd International conference on high performance computing (HiPC) (pp. 64–74): IEEE.Google Scholar
  15. 15.
    Dang, H. V., & Schmidt, B. (2013). Cuda-enabled sparse matrix–vector multiplication on gpus using atomic operations. Parallel Computing, 39(11), 737–750.MathSciNetCrossRefGoogle Scholar
  16. 16.
    Davis, T. A., & Hu, Y. (2011). The university of florida sparse matrix collection. ACM Transactions on Mathematical Software, 38(1), 1.MathSciNetMATHGoogle Scholar
  17. 17.
    Dehnavi, M. M., Fernández, D. M., & Giannacopoulos, D. (2010). Finite-element sparse matrix vector multiplication on graphic processing units. IEEE Transactions on Magnetics, 46(8), 2982–2985.CrossRefGoogle Scholar
  18. 18.
    Gilbert, J. R., Reinhardt, S., & Shah, V. B. (2007). High-performance graph algorithms from parallel sparse matrices. In Applied Parallel Computing. State of the Art in Scientific Computing (pp. 260–269): Springer.Google Scholar
  19. 19.
    Goumas, G., Kourtis, K., Anastopoulos, N., Karakasis, V., & Koziris, N. (2009). Performance evaluation of the sparse matrix-vector multiplication on modern architectures. The Journal of Supercomputing, 50(1), 36–77.CrossRefGoogle Scholar
  20. 20.
    Greathouse, J. L., & Daga, M. (2014). Efficient sparse matrix-vector multiplication on gpus using the csr storage format. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 769–780): IEEE.Google Scholar
  21. 21.
    Im, E. J., & Yelick, K. (2000). Optimization of sparse matrix kernels for data mining. In First SIAM Conference on Data Mining. Citeseer.Google Scholar
  22. 22.
    Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604–632.MathSciNetCrossRefMATHGoogle Scholar
  23. 23.
    Li, R., & Saad, Y. (2013). Gpu-accelerated preconditioned iterative linear solvers. The Journal of Supercomputing, 63(2), 443–466.CrossRefGoogle Scholar
  24. 24.
    Liu, W., & Vinter, B. (2015). Csr5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing (pp. 339–350).Google Scholar
  25. 25.
    Liu, W., & Vinter, B. (2015). Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors. Parallel Computing, 49, 179–193.MathSciNetCrossRefGoogle Scholar
  26. 26.
    Liu, X., Smelyanskiy, M., Chow, E., & Dubey, P. (2013). Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th international ACM conference on International conference on supercomputing (pp. 273–282): ACM.Google Scholar
  27. 27.
    Liu, Y., & Schmidt, B. (2014). Swaphi: Smith-waterman protein database search on xeon phi coprocessors. In 25th IEEE International Conference on Application-specific Systems, Architectures and Processors (pp. 184–185): IEEE.Google Scholar
  28. 28.
    Liu, Y., & Schmidt, B. (2015). Lightspmv: Faster csr-based sparse matrix-vector multiplication on cuda-enabled gpus. In 26th IEEE International Conference on Application-specific Systems (pp. 82–89).Google Scholar
  29. 29.
    Liu, Y., Tran, T. T., Lauenroth, F., & Schmidt, B. (2014). Swaphi-ls: Smith-waterman algorithm on xeon phi coprocessors for long dna sequences. In 2014 IEEE International Conference on Cluster Computing (pp. 257–265): IEEE.Google Scholar
  30. 30.
    Merrill, D., & Garland, M. (2016). Merge-based sparse matrix-vector multiplication (spmv) using the csr storage format. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (p. 43): ACM.Google Scholar
  31. 31.
    Merrill, D., Garland, M., & Grimshaw, A. (2012). Scalable gpu graph traversal. In ACM SIGPLAN Notices, (Vol. 47 pp. 117–128): ACM.Google Scholar
  32. 32.
    Misra, S., Pamnany, K., & Aluru, S. (2014). Parallel mutual information based construction of whole-genome networks on the intel (r) xeon phi (tm) coprocessor. In 28th IEEE International on Parallel and Distributed Processing Symposium (pp. 241–250): IEEE.Google Scholar
  33. 33.
    Monakov, A., Lokhmotov, A., & Avetisyan, A. (2010). Automatically tuning sparse matrix-vector multiplication for gpu architectures. In High Performance Embedded Architectures and Compilers (pp. 111–125): Springer.Google Scholar
  34. 34.
    Nagasaka, Y., Nukada, A., & Matsuoka, S. (2016). Adaptive multi-level blocking optimization for sparse matrix vector multiplication on gpu. Procedia Computer Science, 80, 131–142.CrossRefGoogle Scholar
  35. 35.
    Nvidia (2013). Nvidia’s next generation cuda compute architecture: Kepler gk110. NVIDIA White Paper.Google Scholar
  36. 36.
  37. 37.
    NVIDIA (2015). The nvidia cuda sparse matrix library (cusparse). In CUDA 6.5 toolkit.Google Scholar
  38. 38.
    NVIDIA (2015). Nvidia visual profiler in cuda 7 tookit.
  39. 39.
    Nvidia (2016). Nvidia gp100 pascal architecture-infinite compute for infinite opportunities.
  40. 40.
    Reguly, I., & Giles, M. (2012). Efficient sparse matrix-vector multiplication on cache-based gpus. In Innovative Parallel Computing, 2012 (pp. 1–12): IEEE.Google Scholar
  41. 41.
    Rupp, K., Rudolf, F., & Weinbub, J. (2010). Viennacl-a high level linear algebra library for gpus and multi-core cpus. Proceedings of the International Workshop on GPUs and Scientific Applications, 51–56.Google Scholar
  42. 42.
    Saad, Y. (2003). Iterative methods for sparse linear systems, Siam.Google Scholar
  43. 43.
    Saule, E., Kaya, K., & Çatalyürek, Ü. V. (2014). Performance evaluation of sparse matrix multiplication kernels on intel xeon phi, (pp. 559–570): Springer.Google Scholar
  44. 44.
    Su, B. Y., & Keutzer, K. (2012). clspmv: A cross-platform opencl spmv framework on gpus. In Proceedings of the 26th ACM international conference on Supercomputing (pp. 353–364): ACM.Google Scholar
  45. 45.
    Tang, W., Tan, W., Goh, R. S. M., Turner, S., & Wong, W. K. (2015). A family of bit-representation-optimized formats for fast sparse matrix-vector multiplication on the gpu. IEEE Transactions on Parallel and Distributed Systems, 26(9), 2373–2385.CrossRefGoogle Scholar
  46. 46.
    Tong, H., Faloutsos, C., & Pan, J. Y. (2008). Random walk with restart: fast solutions and applications. Knowledge and Information Systems, 14(3), 327–346.CrossRefMATHGoogle Scholar
  47. 47.
    Tzeng, S., Patney, A., & Owens, J. D. (2010). Task management for irregular-parallel workloads on the gpu. In Proceedings of the Conference on High Performance Graphics (pp. 29–37): Eurographics Association.Google Scholar
  48. 48.
    Vazquez, F., Ortega, G., Fernández, J. J., & Garzón, E. M. (2010). Improving the performance of the sparse matrix vector product with gpus. In 10th IEEE International Conference on Computer and Information Technology (pp. 1146–1151): IEEE.Google Scholar
  49. 49.
    Volkov, V. (2010). Better performance at lower occupancy. In Proceedings of the GPU technology conference, GTC, (Vol. 10 p. 16). San Jose, CA.Google Scholar
  50. 50.
    Volkov, V., & Demmel, J. W. (2008). Benchmarking gpus to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 31 (pp. 1–11): IEEE.Google Scholar
  51. 51.
    Vuduc, R. W. (2003). Automatic performance tuning of sparse matrix kernels. Ph.D. thesis. PhD thesis, University of California, Berkeley.Google Scholar
  52. 52.
    Wu, B., Zhao, Z., Zhang, E. Z., Jiang, Y., & Shen, X. (2013). Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on gpu (Vol. 48, pp. 57–68): ACM.Google Scholar
  53. 53.
    Xiang, P., Yang, Y., & Zhou, H. (2014). Warp-level divergence in gpus: Characterization, impact, and mitigation. In 20th IEEE International Symposium on High Performance Computer Architecture (pp. 284–295): IEEE.Google Scholar
  54. 54.
    Yan, S., Li, C., Zhang, Y., & Zhou, H. (2014). yaspmv: Yet another spmv framework on gpus (Vol. 49, pp. 107–118): ACM.Google Scholar
  55. 55.
    Yang, X., Parthasarathy, S., & Sadayappan, P. (2011). Fast sparse matrix-vector multiplication on gpus: implications for graph mining. Proceedings of the VLDB Endowment, 4(4), 231–242.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.School of Computational Science & EngineeringGeorgia Institute of TechnologyAtlantaUSA
  2. 2.Institute of Computer ScienceJohannes Gutenberg University MainzMainzGermany

Personalised recommendations