Advertisement

Benchmarking SpMV Methods on Many-Core Platforms

  • Biwei XieEmail author
  • Zhen Jia
  • Yungang Bao
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11459)

Abstract

SpMV is an essential kernel existing in many HPC and data center applications. Meanwhile, the emerging many-core hardware provides promising computational power, and is widely used for acceleration. Many methods and formats have been proposed aiming at better performance of SpMV on many-core platforms. However, there is still lack of a comprehensive comparison of SpMV methods to show their performance difference on sparse matrices with various sparse patterns. Moreover, there is still no systematic work to bridge the gap between SpMV performance and sparse pattern.

In this paper, we investigate the performance of 27 SpMV methods with 1500+ sparse matrices on two many-core platforms: Intel Xeon Phi (Knights Landing 7250) and Nvidia GPGPU (Tesla M40). Our work shows that no single SpMV methods is optimal for all sparse patterns, but some methods can achieve approximately the best performance on most sparse matrices. We further select 13 features to describe the sparse pattern and analyze their correlations to the performance of each SpMV method. Our observations should help other researchers and practitioners to better understand the SpMV performance and provide implications to guide the selection of suitable SpMV method.

Keywords

Benchmarking SpMV Many-core Evaluation 

Notes

Acknowledgement

This work was supported partially by National Key R&D Program of China (2016YFB1000201), and the National Natural Science Foundation of China (Grant No. 61420106013), and Youth Innovation Promotion Association of Chinese Academy of Sciences (2013073).

References

  1. 1.
    Ravishankar, M., et al.: Distributed memory code generation for mixed irregular/regular computations. In: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, pp. 65–75. ACM, New York (2015). http://doi.acm.org/10.1145/2688500.2688515
  2. 2.
    Venkat, A., Hall, M., Strout, M.: Loop and data transformations for sparse matrix code. SIGPLAN Not. 506, 521–532 (2015).  https://doi.org/10.1145/2737924.2738003CrossRefGoogle Scholar
  3. 3.
    Wang, L., et al.: Bigdatabench: a big data benchmark suite from internet services. In: Proceedings of the 20th IEEE International Symposium on High Performance Computer Architecture, HPCA 2014, pp. 488–499, Feburary 2014Google Scholar
  4. 4.
    Jia, Z., Wang, L., Zhan, J., Zhang, L., Luo, C.: Characterizing data analysis workloads in data centers. In: Proceedings of the IEEE International Symposium on Workload Characterization, IISWC 2013, pp. 66–76, September 2013Google Scholar
  5. 5.
    Liu, C., Xie, B., Liu, X., Xue, W., Yang, H., Liu, X.: Towards efficient SpMV on sunway manycore architectures. In: Proceedings of the 2018 International Conference on Supercomputing, ICS 2018, pp. 363–373. ACM, New York (2018). http://doi.acm.org/10.1145/3205289.3205313
  6. 6.
    Buono, D., et al.: Optimizing sparse matrix-vector multiplication for large-scale data analytics. In: Proceedings of the 30th International Conference on Supercomputing, ICS 2016, pp. 37:1–37:12. ACM, New York (2016). http://doi.acm.org/10.1145/2925426.2926278
  7. 7.
    Pinar, A., Heath, M.T.: Improving performance of sparse matrix-vector multiplication. In: Proceedings of the 13th ACM/IEEE Conference on Supercomputing, ICS 1999. ACM, New York (1999). http://doi.acm.org/10.1145/331532.331562
  8. 8.
    Yavits, L., Ginosar, R.: Accelerator for sparse machine learning. IEEE Comput. Archit. Lett. 99, 1 (2017)Google Scholar
  9. 9.
    Greathouse, J.L., Daga, M.: Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In: Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, pp. 769–780. IEEE Press, Piscataway (2014).  https://doi.org/10.1109/SC.2014.68
  10. 10.
    Abu-Sufah, W., Abdel Karim, A.: Auto-tuning of sparse matrix-vector multiplication on graphics processors. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) ISC 2013. LNCS, vol. 7905, pp. 151–164. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-38750-0_12CrossRefGoogle Scholar
  11. 11.
    Li, J., Tan, G., Chen, M., Sun, N.: SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2013, pp. 117–126. ACM, New York (2013). http://doi.acm.org/10.1145/2462156.2462181
  12. 12.
    Elafrou, A., Goumas, G., Koziris, N.: A lightweight optimization selection method for sparse matrix-vector multiplication. arXiv e-prints, November 2015Google Scholar
  13. 13.
    Yan, S., Li, C., Zhang, Y., Zhou, H.: YASPMV: yet another SpMV framework on GPUs. In: Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2014, pp. 107–118. ACM, New York (2014). http://doi.acm.org/10.1145/2555243.2555255
  14. 14.
    Sedaghati, N., Mu, T., Pouchet, L.-N., Parthasarathy, S., Sadayappan, P.: Automatic selection of sparse matrix representation on GPUs. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ICS 2015, pp. 99–108. ACM, New York (2015). http://doi.acm.org/10.1145/2751205.2751244
  15. 15.
    Zhao, Y., Li, J., Liao, C., Shen, X.: Bridging the gap between deep learning and sparse matrix format selection. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2018, pp. 94–108. ACM, New York (2018)Google Scholar
  16. 16.
    Sodani, A., et al.: Knights landing: second-generation Intel Xeon Phi product. IEEE Micro 362, 34–46 (2016).  https://doi.org/10.1109/MM.2016.25CrossRefGoogle Scholar
  17. 17.
    Wang, E., et al.: Intel math kernel library. In: Wang, E. (ed.) High-Performance Computing on the Intel® Xeon Phi™, pp. 167–188. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-06486-4_7CrossRefGoogle Scholar
  18. 18.
    CUDA CUSPARSE Library: NVIDIA, August 2010Google Scholar
  19. 19.
    Dalton, S., Bell, N., Olson, L., Garland, M.: CUSP: generic parallel algorithms for sparse matrix and graph computations, version 0.5.0. (2014). http://cusplibrary.github.io/
  20. 20.
    Bosma, W., Cannon, J., Playoust, C.: The magma algebra system I: the user language. J. Symb. Comput. 243–4, 235–265 (1997).  https://doi.org/10.1006/jsco.1996.0125MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Ashari, A., Sedaghati, N., Eisenlohr, J., Sadayappan, P.: An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs. In: Proceedings of the 28th ACM International Conference on Supercomputing, ICS 2014, pp. 273–282. ACM, New York (2014). http://doi.acm.org/10.1145/2597652.2597678
  22. 22.
    Liu, X., Smelyanskiy, M., Chow, E., Dubey, P.: Efficient sparse matrix-vector multiplication on x86-based many-core processors. In: Proceedings of the 27th ACM International Conference on Supercomputing, ICS 2013, pp. 273–282. ACM, New York (2013). http://doi.acm.org/10.1145/2464996.2465013
  23. 23.
    Xie, B., et al.: CVR: efficient vectorization of spmv on x86 processors. In: Proceedings of the 16th IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2018 (2018)Google Scholar
  24. 24.
    Liu, W., Vinter, B.: CSR5: an efficient storage format for cross-platform sparse matrix-vector multiplication. In: Proceedings of the 29th ACM International Conference on Supercomputing, ICS 2015, pp. 339–350. ACM, New York (2015)Google Scholar
  25. 25.
    Tang, W.T., et al.: Optimizing and auto-tuning scale-free sparse matrix-vector multiplication on Intel Xeon Phi. In: Proceedings of the 13th IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015, pp. 136–145. IEEE Computer Society, Washington (2015)Google Scholar
  26. 26.
    Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proceedings of the ACM/IEEE Conference on High Performance Computing Networking, Storage and Analysis, SC 2009, pp. 18:1–18:11. ACM, New York (2009). http://doi.acm.org/10.1145/1654059.1654078
  27. 27.
    Merrill, D., Garland, M.: Merge-based parallel sparse matrix-vector multiplication. In: Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, pp. 58:1–58:12. IEEE, Piscataway (2016).  https://doi.org/10.1109/SC.2016.57
  28. 28.
    Davis, T.A.: The University of Florida sparse matrix collection. NA DIGEST (1997)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.State Key Laboratory of Computer Architecture, Institute of Computing TechnologyChinese Academy of SciencesBeijingChina
  2. 2.Department of Computer SciencePrinceton UniversityPrincetonUSA
  3. 3.University of Chinese Academy of SciencesBeijingChina

Personalised recommendations