High Performance Tensor–Vector Multiplication on Shared-Memory Systems

  • Filip PawłowskiEmail author
  • Bora Uçar
  • Albert-Jan Yzelman
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12043)


Tensor–vector multiplication is one of the core components in tensor computations. We have recently investigated high performance, single core implementation of this bandwidth-bound operation. Here, we investigate its efficient, shared-memory implementations. Upon carefully analyzing the design space, we implement a number of alternatives using OpenMP and compare them experimentally. Experimental results on up to 8 socket systems show near peak performance for the proposed algorithms.


Tensor computations Tensor–vector multiplication Shared-memory systems 


  1. 1.
    Bader, B.W., Kolda, T.G.: Algorithm 862: MATLAB tensor classes for fast algorithm prototyping. ACM TOMS 32(4), 635–653 (2006)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Ballard, G., Knight, N., Rouse, K.: Communication lower bounds for matricized tensor times Khatri-Rao product. In: 2018 IPDPS, pp. 557–567. IEEE (2018)Google Scholar
  3. 3.
    Grama, A.Y., Gupta, A., Kumar, V.: Isoefficiency: measuring the scalability of parallel algorithms and architectures. IEEE Parallel Distrib. Technol. Syst. Appl. 1(3), 12–21 (1993)CrossRefGoogle Scholar
  4. 4.
    Kjolstad, F., Kamil, S., Chou, S., Lugato, D., Amarasinghe, S.: The tensor algebra compiler. Proc. ACM Program. Lang. 1(OOPSLA), 77:1–77:29 (2017)CrossRefGoogle Scholar
  5. 5.
    Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Li, J., Battaglino, C., Perros, I., Sun, J., Vuduc, R.: An input-adaptive and in-place approach to dense tensor-times-matrix multiply. In: SC 2015, pp. 76:1–76:12 (2015)Google Scholar
  7. 7.
    Matthews, D.: High-performance tensor contraction without transposition. SIAM J. Sci. Comput. 40(1), C1–C24 (2018)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Morton, G.M.: A computer oriented geodetic data base and a new technique in file sequencing (1966)Google Scholar
  9. 9.
    Pawłowski, F., Uçar, B., Yzelman, A.J.N.: High performance tensor-vector multiples on shared memory systems. Technical report 9274, Inria, Grenoble-Rhône-Alpes (2019)Google Scholar
  10. 10.
    Pawlowski, F., Uçar, B., Yzelman, A.N.: A multi-dimensional Morton-ordered block storage for mode-oblivious tensor computations. J. Comput. Sci. (2019). Scholar
  11. 11.
    Solomonik, E., Matthews, D., Hammond, J.R., Stanton, J.F., Demmel, J.: A massively parallel tensor contraction framework for coupled-cluster computations. J. Parallel Distrib. Comput. 74(12), 3176–3190 (2014)CrossRefGoogle Scholar
  12. 12.
    Springer, P., Bientinesi, P.: Design of a high-performance GEMM-like tensor-tensor multiplication. ACM TOMS 44(3), 1–29 (2018)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Huawei Technologies FranceBoulogne-BillancourtFrance
  2. 2.ENS LyonLyonFrance
  3. 3.CNRS and LIP (UMR5668, CNRS - ENS Lyon - UCB Lyon 1 - INRIA)LyonFrance

Personalised recommendations