Skip to main content

On the Performance Prediction of BLAS-based Tensor Contractions

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8966))

Abstract

Tensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of high-performance kernels for such operations is known to be a challenging task. While for operations on one- and two-dimensional tensors there exist standardized interfaces and highly-optimized libraries (BLAS), for higher dimensional tensors neither standards nor highly-tuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating high-performance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cache-aware micro-benchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the best-performing algorithms in a tiny fraction of the time taken by the direct execution of the algorithms.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    gemm is the BLAS-3 routine for matrix-matrix multiplication, which on many systems is optimized within a few percent of peak performance.

  2. 2.

    For the sake of simplicity and without any loss of generality, we ignore any distinction between covariant and contravariant vectors; this means we treat any index as a subscript.

  3. 3.

    In the Matlab-like notation used in this paper, 1: \(b\) are the numbers from 1 to \(b\), while an index : in a tensor refers to all elements along that dimension, e.g., \(C\) [:,b] is the \(b\)-th column of \(C\).

  4. 4.

    The pictogram next to the algorithm visualizes the slicing of the tensors that originates the algorithm’s sequence of gemvs. The red objects represent the operands of the BLAS kernel.

  5. 5.

    The algorithm names are composed of two parts: The first part is the list of sliced tensor indices iterated over by the algorithm’s loops including an apostrophe \('\) for each copy-kernel; the second part is the BLAS-kernel at the algorithm’s core.

  6. 6.

    For algorithms with more than 1 for-loop, all slicings are visualized in blue and only the kernel operands (the slicings’ intersections) are in red.

  7. 7.

    2 GHz, 4 cores, 4 double precision flops/cycle/core, 6 MB L2 cache/2 cores.

  8. 8.

    Due to the regular storage format and memory access strides of dense linear algebra operations such as the considered tensor contractions, this simplifying assumption does not affect the reliability of the results.

  9. 9.

    The cache-line size is \({64}\mathrm{\,B} = 8\) doubles.

  10. 10.

    Slow tensor contraction algorithms were stopped before reaching the largest test-cases by limiting the total measurement time per algorithm to 15 minutes.

  11. 11.

    Using 10 cores, the theoretical peak performance is 80 flops/cycle.

References

  1. Kidder, L.E., Scheel, M.A., Teukolsky, S.A.: Extending the lifetime of 3d black hole computations with a new hyperbolic system of evolution equations. Phys. Rev. D 64, 064017 (2001)

    Article  MathSciNet  Google Scholar 

  2. Lehner, L.: Numerical relativity: a review. Class. Quantum Gravity 18(17), R25 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  3. Helgaker, T., Jorgensen, P., Olsen, J.: Molecular Electronic-Structure Theory. Wiley, Chichester (2000)

    Book  Google Scholar 

  4. C̆íz̆ek, J.: On the correlation problem in atomic and molecular systems calculation of wavefunction components in ursell type expansion using quantum-field theoretical methods. J. Chem. Phys. 45(11), 4256–4266 (1966)

    Article  Google Scholar 

  5. Bartlett, R.J., Musiał, M.: Coupled-cluster theory in quantum chemistry. Rev. Mod. Phys. 79, 291–352 (2007)

    Article  Google Scholar 

  6. Lawson, C.L., Hanson, R.J., Kincaid, D.R., Krogh, F.T.: Basic linear algebra subprograms for fortran usage. ACM Trans. Math. Softw. 5(3), 308–323 (1979)

    Article  MATH  Google Scholar 

  7. Dongarra, J.J., Du Croz, J., Hammarling, S., Hanson, R.J.: An extended set of fortran basic linear algebra subprograms. ACM Trans. Math. Softw. 14(1), 1–17 (1988)

    Article  MATH  Google Scholar 

  8. Dongarra, J.J., Du Croz, J., Hammarling, S., Duff, I.S.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990)

    Article  MATH  Google Scholar 

  9. Baumgartner, G., Auer, A., Bernholdt, D., Bibireata, A., Choppella, V., Cociorva, D., Gao, X., Harrison, R., Hirata, S., Krishnamoorthy, S., Krishnan, S., Lam, C., Lu, Q., Nooijen, M., Pitzer, R., Ramanujam, J., Sadayappan, P., Sibiryakov, A.: Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proc. IEEE 93(2), 276–292 (2005)

    Article  Google Scholar 

  10. Lu, Q., Gao, X., Krishnamoorthy, S., Baumgartner, G., Ramanujam, J., Sadayappan, P.: Empirical performance model-driven data layout optimization and library call selection for tensor contraction expressions. J. Parallel Distrib. Comput. 72(3), 338–352 (2012)

    Article  Google Scholar 

  11. Di Napoli, E., Fabregat-Traver, D., Quintana-Orti, G., Bientinesi, P.: Towards an efficient use of the blas library for multilinear tensor contractions. Appl. Math. Comput. 235, 454–468 (2014)

    Article  MathSciNet  Google Scholar 

  12. Iakymchuk, R., Bientinesi, P.: Modeling performance through memory-stalls. SIGMETRICS Perform. Eval. Rev. 40(2), 86–91 (2012)

    Article  Google Scholar 

  13. Iakymchuk, R., Bientinesi, P.: Execution-less performance modeling. In: Proceedings of the Second International Workshop on Performance Modeling, Benchmarking and Simulation of High-Performance Computing Systems (PMBS 2011) held as part of the Supercomputing Conference (SC 2011), Seattle, USA, November 2011

    Google Scholar 

  14. Peise, E., Bientinesi, P.: Performance modeling for dense linear algebra. In: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. SCC 2012, pp. 406–416. IEEE Computer Society, Washington, DC, USA (2012)

    Google Scholar 

  15. OpenBLAS. http://xianyi.github.com/OpenBLAS

Download references

Acknowledgments

Financial support from the Deutsche Forschungsgemeinschaft (DFG) through Grant GSC 111 and the Deutsche Telekom Stiftung is gratefully acknowledged.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elmar Peise .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Peise, E., Fabregat-Traver, D., Bientinesi, P. (2015). On the Performance Prediction of BLAS-based Tensor Contractions. In: Jarvis, S., Wright, S., Hammond, S. (eds) High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. PMBS 2014. Lecture Notes in Computer Science(), vol 8966. Springer, Cham. https://doi.org/10.1007/978-3-319-17248-4_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-17248-4_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-17247-7

  • Online ISBN: 978-3-319-17248-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics