On the Performance Prediction of BLAS-based Tensor Contractions

Peise, Elmar; Fabregat-Traver, Diego; Bientinesi, Paolo

doi:10.1007/978-3-319-17248-4_10

On the Performance Prediction of BLAS-based Tensor Contractions

Elmar Peise¹⁶,
Diego Fabregat-Traver¹⁶ &
Paolo Bientinesi¹⁶

Conference paper
First Online: 01 January 2015

1103 Accesses
6 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8966))

Abstract

Tensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of high-performance kernels for such operations is known to be a challenging task. While for operations on one- and two-dimensional tensors there exist standardized interfaces and highly-optimized libraries (BLAS), for higher dimensional tensors neither standards nor highly-tuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating high-performance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cache-aware micro-benchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the best-performing algorithms in a tiny fraction of the time taken by the direct execution of the algorithms.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
gemm is the BLAS-3 routine for matrix-matrix multiplication, which on many systems is optimized within a few percent of peak performance.
2.
For the sake of simplicity and without any loss of generality, we ignore any distinction between covariant and contravariant vectors; this means we treat any index as a subscript.
3.
In the Matlab-like notation used in this paper, 1: \(b\) are the numbers from 1 to \(b\), while an index : in a tensor refers to all elements along that dimension, e.g., \(C\) [:,b] is the \(b\)-th column of \(C\).
4.
The pictogram next to the algorithm visualizes the slicing of the tensors that originates the algorithm’s sequence of gemvs. The red objects represent the operands of the BLAS kernel.
5.
The algorithm names are composed of two parts: The first part is the list of sliced tensor indices iterated over by the algorithm’s loops including an apostrophe \('\) for each copy-kernel; the second part is the BLAS-kernel at the algorithm’s core.
6.
For algorithms with more than 1 for-loop, all slicings are visualized in blue and only the kernel operands (the slicings’ intersections) are in red.
7.
2 GHz, 4 cores, 4 double precision flops/cycle/core, 6 MB L2 cache/2 cores.
8.
Due to the regular storage format and memory access strides of dense linear algebra operations such as the considered tensor contractions, this simplifying assumption does not affect the reliability of the results.
9.
The cache-line size is \({64}\mathrm{\,B} = 8\) doubles.
10.
Slow tensor contraction algorithms were stopped before reaching the largest test-cases by limiting the total measurement time per algorithm to 15 minutes.
11.
Using 10 cores, the theoretical peak performance is 80 flops/cycle.

References

Kidder, L.E., Scheel, M.A., Teukolsky, S.A.: Extending the lifetime of 3d black hole computations with a new hyperbolic system of evolution equations. Phys. Rev. D 64, 064017 (2001)
Article MathSciNet Google Scholar
Lehner, L.: Numerical relativity: a review. Class. Quantum Gravity 18(17), R25 (2001)
Article MATH MathSciNet Google Scholar
Helgaker, T., Jorgensen, P., Olsen, J.: Molecular Electronic-Structure Theory. Wiley, Chichester (2000)
Book Google Scholar
C̆íz̆ek, J.: On the correlation problem in atomic and molecular systems calculation of wavefunction components in ursell type expansion using quantum-field theoretical methods. J. Chem. Phys. 45(11), 4256–4266 (1966)
Article Google Scholar
Bartlett, R.J., Musiał, M.: Coupled-cluster theory in quantum chemistry. Rev. Mod. Phys. 79, 291–352 (2007)
Article Google Scholar
Lawson, C.L., Hanson, R.J., Kincaid, D.R., Krogh, F.T.: Basic linear algebra subprograms for fortran usage. ACM Trans. Math. Softw. 5(3), 308–323 (1979)
Article MATH Google Scholar
Dongarra, J.J., Du Croz, J., Hammarling, S., Hanson, R.J.: An extended set of fortran basic linear algebra subprograms. ACM Trans. Math. Softw. 14(1), 1–17 (1988)
Article MATH Google Scholar
Dongarra, J.J., Du Croz, J., Hammarling, S., Duff, I.S.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990)
Article MATH Google Scholar
Baumgartner, G., Auer, A., Bernholdt, D., Bibireata, A., Choppella, V., Cociorva, D., Gao, X., Harrison, R., Hirata, S., Krishnamoorthy, S., Krishnan, S., Lam, C., Lu, Q., Nooijen, M., Pitzer, R., Ramanujam, J., Sadayappan, P., Sibiryakov, A.: Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proc. IEEE 93(2), 276–292 (2005)
Article Google Scholar
Lu, Q., Gao, X., Krishnamoorthy, S., Baumgartner, G., Ramanujam, J., Sadayappan, P.: Empirical performance model-driven data layout optimization and library call selection for tensor contraction expressions. J. Parallel Distrib. Comput. 72(3), 338–352 (2012)
Article Google Scholar
Di Napoli, E., Fabregat-Traver, D., Quintana-Orti, G., Bientinesi, P.: Towards an efficient use of the blas library for multilinear tensor contractions. Appl. Math. Comput. 235, 454–468 (2014)
Article MathSciNet Google Scholar
Iakymchuk, R., Bientinesi, P.: Modeling performance through memory-stalls. SIGMETRICS Perform. Eval. Rev. 40(2), 86–91 (2012)
Article Google Scholar
Iakymchuk, R., Bientinesi, P.: Execution-less performance modeling. In: Proceedings of the Second International Workshop on Performance Modeling, Benchmarking and Simulation of High-Performance Computing Systems (PMBS 2011) held as part of the Supercomputing Conference (SC 2011), Seattle, USA, November 2011
Google Scholar
Peise, E., Bientinesi, P.: Performance modeling for dense linear algebra. In: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. SCC 2012, pp. 406–416. IEEE Computer Society, Washington, DC, USA (2012)
Google Scholar
OpenBLAS. http://xianyi.github.com/OpenBLAS

Download references

Acknowledgments

Financial support from the Deutsche Forschungsgemeinschaft (DFG) through Grant GSC 111 and the Deutsche Telekom Stiftung is gratefully acknowledged.

Author information

Authors and Affiliations

AICES, RWTH Aachen, Aachen, Germany
Elmar Peise, Diego Fabregat-Traver & Paolo Bientinesi

Authors

Elmar Peise
View author publications
You can also search for this author in PubMed Google Scholar
Diego Fabregat-Traver
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Bientinesi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Elmar Peise .

Editor information

Editors and Affiliations

University of Warwick, Coventry, United Kingdom
Stephen A. Jarvis
University of Warwick, Coventry, United Kingdom
Steven A. Wright
Sandia National Laboratories CSRI, Albuquerque, New Mexico, USA
Simon D. Hammond

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Peise, E., Fabregat-Traver, D., Bientinesi, P. (2015). On the Performance Prediction of BLAS-based Tensor Contractions. In: Jarvis, S., Wright, S., Hammond, S. (eds) High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. PMBS 2014. Lecture Notes in Computer Science(), vol 8966. Springer, Cham. https://doi.org/10.1007/978-3-319-17248-4_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-17248-4_10
Published: 18 April 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-17247-7
Online ISBN: 978-3-319-17248-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics