Optimizing tensor contraction expressions for hybrid CPU-GPU execution

Ma, Wenjing; Krishnamoorthy, Sriram; Villa, Oreste; Kowalski, Karol; Agrawal, Gagan

doi:10.1007/s10586-011-0179-2

Optimizing tensor contraction expressions for hybrid CPU-GPU execution

Published: 24 November 2011

Volume 16, pages 131–155, (2013)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Wenjing Ma¹,
Sriram Krishnamoorthy¹,
Oreste Villa¹,
Karol Kowalski^1,3 &
…
Gagan Agrawal²

543 Accesses
26 Citations
Explore all metrics

Abstract

Tensor contractions are generalized multidimensional matrix multiplication operations that widely occur in quantum chemistry. Efficient execution of tensor contractions on Graphics Processing Units (GPUs) requires several challenges to be addressed, including index permutation and small dimension-sizes reducing thread block utilization. Moreover, to apply the same optimizations to various expressions, we need a code generation tool. In this paper, we present our approach to automatically generate CUDA code to execute tensor contractions on GPUs, including management of data movement between CPU and GPU. To evaluate our tool, GPU-enabled code is generated for the most expensive contractions in CCSD(T), a key coupled cluster method, and incorporated into NWChem, a popular computational chemistry suite. For this method, we demonstrate speedup over a factor of 8.4 using one GPU as compared to one CPU core and over 2.6 when utilizing the entire system using hybrid CPU+GPU solution with 2 GPUs and 5 cores (instead of 7 cores per node). We further investigate tensor contraction code on a new series of GPUs, the Fermi GPUs, and provide several effective optimization algorithms. For the same computation of CCSD(T), on a cluster with Fermi GPUs, we achieve a speedup of 3.4 over a cluster with T10 GPUs. With a single Fermi GPU on each node, we achieve a speedup of 43 over the sequential CPU version.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Anzt, H., Hahn, T., Heuveline, V., Rocker, B.: GPU accelerated scientific computing: evaluation of the NVIDIA Fermi architecture; elementary kernels and linear solvers (2010). http://www.emcl.kit.edu/preprints/emcl-preprint-2010-04.pdf
Aprà, E., Rendell, A.P., Harrison, R.J., Tipparaju, V., deJong, W.A., Xantheas, S.S.: Liquid water: obtaining the right answer for the right reasons. In: Proceedings of the ACM/IEEE SC Conference on High Performance Networking and Computing, pp. 1–7 (2009). doi:10.1145/1654059.1654127
Chapter Google Scholar
Auer, A., Baumgartner, G., Bernholdt, D., Bibireata, A., Choppella, V., Cociorva, D., Gao, X., Harrison, R., Krishnamoorthy, S., Krishnan, S., Lam, C., Lu, Q., Nooijen, M., Pitzer, R., Ramanujam, J., Sadayappan, P., Sibiryakov, A.: Automatic code generation for many-body electronic structure methods: the tensor contraction engine. Mol. Phys. 2, 211 (2006)
Article Google Scholar
Baghsorkhi, S.S., Delahaye, M., Patel, S.J., Gropp, W.D., Hwu, W.M.: An adaptive performance modeling tool for GPU architectures. In: Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 105–114 (2010). doi:10.1145/1693453.1693470
Google Scholar
Bartlett, R.J., Musiał, M.: Coupled-cluster theory in quantum chemistry. Rev. Mod. Phys. 79(1), 291–352 (2007). doi:10.1103/RevModPhys.79.291
Article Google Scholar
Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: A compiler framework for optimization of affine loop nests for GPGPUs. In: Proceedings of the International Conference on Supercomputing (ICS), pp. 225–234 (2008). doi:10.1145/1375527.1375562
Google Scholar
Baumgartner, G., Auer, A., Bernholdt, D., Bibireata, A., Choppella, V., Cociorva, D., Gao, X., Harrison, R., Hirata, S., Krishnamoorthy, S., et al.: Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proc. IEEE 93(2), 276–292 (2005)
Article Google Scholar
Boyer, M., Tarjan, D., Acton, S.T., Skadron, K.: Accelerating leukocyte tracking using CUDA: a case study in leveraging manycore coprocessors. In: Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), pp. 1–12 (2009). doi:10.1109/IPDPS.2009.5160984
Google Scholar
Che, S., Meng, J., Sheaffer, J.W., Skadron, K.: A performance study of general-purpose applications on graphics processors using CUDA. J. Parallel Distrib. Comput. 68(10), 1370–1380 (2008). doi:10.1016/j.jpdc.2008.05.014
Article Google Scholar
Choi, J.W., Singh, A., Vuduc, R.W.: Model-driven autotuning of sparse matrix-vector multiply on GPUs. In: Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 115–126 (2010). doi:10.1145/1693453.1693471
Google Scholar
Čižek, J.: On correlation problem in atomic and molecular systems. Calculation of wavefunction components in ursell-type expansion using quantum-field theoretical methods. J. Chem. Phys. 45(11), 4256–4266 (1966)
Article Google Scholar
Consortium, H.T.: PCI Express 3.0 specification. http://www.hypertransport.org/docs/twgdocs/HTC20051222-00046-0028.pdf (2011)
DePrince, A.E., Hammond, J.R.: Coupled cluster theory on graphics processing units I. The coupled cluster doubles method. J. Chem. Theory Comput. 7(5), 1287–1295 (2011). doi:10.1021/ct100584w. http://pubs.acs.org/doi/abs/10.1021/ct100584w
Article Google Scholar
Dotsenko, Y., Baghsorkhi, S.S., Lloyd, B., Govindaraju, N.K.: Auto-tuning of fast Fourier transform on graphics processors. In: Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP ’11, pp. 257–266. ACM Press, New York (2011). doi:10.1145/1941553.1941589. URL http://doi.acm.org/10.1145/1941553.1941589
Chapter Google Scholar
Dunning, T.: Gaussian basis sets for use in correlated molecular calculations I. The atoms boron through neon and hydrogen. J. Chem. Phys. 90, 1007–1023 (1989)
Article Google Scholar
Filippi, C., Zaccheddu, M., Buda, F.: Absorption spectrum of the green fluorescent protein chromophore: a difficult case for ab initio methods? J. Chem. Theory Comput. 5, 2074–2087 (2009)
Article Google Scholar
Gordon, M.I., Thies, W., Amarasinghe, S.: Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. Oper. Syst. Rev. 40(5), 151–162 (2006). doi:10.1145/1168917.1168877
Article Google Scholar
Hammond, J.R., De Prince, III, A.E.: Evaluating one-sided programming models for gpu cluster computations. http://saahpc.ncsa.illinois.edu/papers/paper_43.pdf (2011)
Harish, P., Narayanan, P.: Accelerating large graph algorithms on the GPU using CUDA. In: Proceedings of the International Conference on High Performance Computing (HiPC), pp. 197–208 (2007)
Google Scholar
Hirata, S.: Tensor contraction engine: abstraction and automated parallel implementation of configuration-interaction, coupled-cluster, and many-body perturbation theories. J. Phys. Chem. 107(46), 9887–9897 (2003)
Article Google Scholar
Hong, S., Kim, H.: An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: ISCA ’09: Proceedings of the 36th Annual International Symposium on Computer Architecture, pp. 152–163. ACM Press, New York (2009). doi:10.1145/1555754.1555775
Chapter Google Scholar
Intel: An introduction to the Intel QuickPath Interconnect. Document Number: 320412, January 2009, http://www.intel.com/technology/quickpath/introduction.pdf
Kowalski, K., Krishnamoorthy, S., Olson, R.M., Tipparaju, V., Apra, E.: Scalable implementations of accurate excited-state coupled cluster theories: application of high-level methods to porphyrin-based systems. In: Proceedings of the ACM/IEEE SC Conference on High Performance Networking and Computing (2011). doi:10.1145/2063384.2063481
Google Scholar
Li, Y., Dongarra, J., Tomov, S.: A note on auto-tuning GEMM for GPUs. In: Proceedings of the International Conference on Computational Science (ICCS), pp. 884–892 (2009). doi:10.1007/978-3-642-01970-8-89
Google Scholar
Lu, Q., Krishnamoorthy, S., Sadayappan, P.: Combining analytical and empirical approaches in tuning matrix transposition. In: Proceedings of the Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 233–242 (2006). doi:10.1145/1152154.1152190
Chapter Google Scholar
Ma, W., Agrawal, G.: A translation system for enabling data mining applications on GPUs. In: Proceedings of the International Conference on Supercomputing (ICS), pp. 400–409 (2009). doi:10.1145/1542275.1542331
Google Scholar
Ma, W., Krishnamoorthy, S., Villa, O., Kowalski, K.: GPU-based implementations of the noniterative regularized-CCSD(T) corrections: applications to strongly correlated systems. J. Chem. Theory Comput. 7(5), 1316–1327 (2011). doi:10.1021/ct1007247. URL http://pubs.acs.org/doi/abs/10.1021/ct1007247
Article Google Scholar
Molka, D., Hackenberg, D., Schone, R., Muller, M.S.: Memory performance and cache coherency effects on an intel nehalem multiprocessor system. In: Proceedings of the Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 261–270 (2009). doi:10.1109/PACT.2009.22
Google Scholar
Murthy, S.G.: Optimal loop unrolling for GPGPU programs. Master’s thesis, The Ohio State University (2009)
Nath, R., Tomov, S., Dongarra, J.: An improved MAGMA GEMM for fermi GPUs. http://icl.cs.utk.edu/projectsfiles/magma/pubs/fermi_gemm.pdf (2010)
Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA. ACM Queue 6(2), 40–53 (2008). doi:10.1145/1365490.1365500
Article Google Scholar
Nieplocha, J., Tipparaju, V., Krishnan, M., Panda, D.: High performance remote memory access communication: the armci approach. Int. J. High Perform. Comput. Appl. 20(2), 233 (2006)
Article Google Scholar
Nukada, A., Ogata, Y., Endo, T., Matsuoka, S.: Bandwidth intensive 3-D FFT kernel for GPUs using CUDA. In: Proceedings of the ACM/IEEE SC Conference on High Performance Networking and Computing, pp. 1–11 (2008)
Google Scholar
Nvidia: NVIDIA’s next generation CUDA compute architecture: Fermi. http://www.nvidia.com/object/fermi_architecture.html
NVIDIA: NVIDIA CUDA Programming guide, version 3.0 (2010)
Paldus, J., Li, X.: A critical assessment of coupled cluster method in quantum chemistry. Adv. Chem. Phys. 110, 1–175 (1999)
Article Google Scholar
Raghavachari, K., Trucks, G.W., Pople, J.A., Head-Gordon, M.: A 5th-order perturbation comparison of electron correlation theories. Chem. Phys. Lett. 157(6), 479–483 (1989)
Article Google Scholar
Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu, W.M.: Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 73–82 (2008). doi:10.1145/1345206.1345220
Chapter Google Scholar
Ryoo, S., Rodrigues, C.I., Stone, S.S., Baghsorkhi, S.S., Ueng, S.Z., Stratton, J.A., Hwu, W.M.W.: Program optimization space pruning for a multithreaded GPU. In: Proceedings of the International Symposium on Code Generation and Optimization (CGO), pp. 195–204 (2008). doi:10.1145/1356058.1356084
Google Scholar
Schatz, M., Trapnell, C., Delcher, A., Varshney, A.: High-throughput sequence alignment using graphics processing units. BMC Bioinform. 8(1), 474 (2007). doi:10.1186/1471-2105-8-474
Article Google Scholar
TOP500: http://www.top500.org (2011)
Udupa, A., Govindarajan, R., Thazhuthaveetil, M.J.: Software pipelined execution of stream programs on GPUs. In: Proceedings of the International Symposium on Code Generation and Optimization (CGO), pp. 200–209 (2009). doi:10.1109/CGO.2009.20
Chapter Google Scholar
Valiev, M., Bylaska, E., Govind, N., Kowalski, K., Straatsma, T., Dam, H.V., Wang, D., Nieplocha, J., Apra, E., Windus, T., de Jong, W.: NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations. Comput. Phys. Commun. 181(9), 1477–1489 (2010). doi:10.1016/j.cpc.2010.04.018. URL http://www.sciencedirect.com/science/article/pii/S0010465510001438
Article MATH Google Scholar
Volkov, V., Demmel, J.: LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs. Tech. Rep. UCB/EECS-2008-49, EECS Department. University of California, Berkeley (2008). URL http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-49.html
Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: Proceedings of the ACM/IEEE SC Conference on High Performance Networking and Computing, pp. 1–11 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Computational Sciences and Mathematics Division, Pacific Northwest National Laboratory, P.O. Box 999, MS-IN K8-91, Richland, WA, USA
Wenjing Ma, Sriram Krishnamoorthy, Oreste Villa & Karol Kowalski
Department of Computer Sciences and Engineering, The Ohio State University, 2015 Neil Ave, Columbus, OH, USA
Gagan Agrawal
Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, P.O. Box 999, MS-IN K8-91, Richland, WA, USA
Karol Kowalski

Authors

Wenjing Ma
View author publications
You can also search for this author in PubMed Google Scholar
Sriram Krishnamoorthy
View author publications
You can also search for this author in PubMed Google Scholar
Oreste Villa
View author publications
You can also search for this author in PubMed Google Scholar
Karol Kowalski
View author publications
You can also search for this author in PubMed Google Scholar
Gagan Agrawal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sriram Krishnamoorthy.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ma, W., Krishnamoorthy, S., Villa, O. et al. Optimizing tensor contraction expressions for hybrid CPU-GPU execution. Cluster Comput 16, 131–155 (2013). https://doi.org/10.1007/s10586-011-0179-2

Download citation

Received: 28 July 2011
Accepted: 11 August 2011
Published: 24 November 2011
Issue Date: March 2013
DOI: https://doi.org/10.1007/s10586-011-0179-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimizing tensor contraction expressions for hybrid CPU-GPU execution

Abstract

Access this article

Similar content being viewed by others

GPU Tuning for First-Principle Electronic Structure Simulations

A Performance Study of Quantum ESPRESSO’s PWscf Code on Multi-core and GPU Systems

Highly scalable implementation of an implicit matrix-free solver for gas dynamics on GPU-accelerated clusters

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimizing tensor contraction expressions for hybrid CPU-GPU execution

Abstract

Access this article

Similar content being viewed by others

GPU Tuning for First-Principle Electronic Structure Simulations

A Performance Study of Quantum ESPRESSO’s PWscf Code on Multi-core and GPU Systems

Highly scalable implementation of an implicit matrix-free solver for gas dynamics on GPU-accelerated clusters

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation