High Performance Tensor–Vector Multiplication on Shared-Memory Systems
- 42 Downloads
Tensor–vector multiplication is one of the core components in tensor computations. We have recently investigated high performance, single core implementation of this bandwidth-bound operation. Here, we investigate its efficient, shared-memory implementations. Upon carefully analyzing the design space, we implement a number of alternatives using OpenMP and compare them experimentally. Experimental results on up to 8 socket systems show near peak performance for the proposed algorithms.
KeywordsTensor computations Tensor–vector multiplication Shared-memory systems
- 2.Ballard, G., Knight, N., Rouse, K.: Communication lower bounds for matricized tensor times Khatri-Rao product. In: 2018 IPDPS, pp. 557–567. IEEE (2018)Google Scholar
- 6.Li, J., Battaglino, C., Perros, I., Sun, J., Vuduc, R.: An input-adaptive and in-place approach to dense tensor-times-matrix multiply. In: SC 2015, pp. 76:1–76:12 (2015)Google Scholar
- 8.Morton, G.M.: A computer oriented geodetic data base and a new technique in file sequencing (1966)Google Scholar
- 9.Pawłowski, F., Uçar, B., Yzelman, A.J.N.: High performance tensor-vector multiples on shared memory systems. Technical report 9274, Inria, Grenoble-Rhône-Alpes (2019)Google Scholar