Lessons Learned from Optimizing Kernels for Adaptive Aggregation Multi-grid Solvers in Lattice QCD

Joó, Bálint; Kurth, Thorsten

doi:10.1007/978-3-030-02465-9_34

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11203))

Included in the following conference series:

International Conference on High Performance Computing

1259 Accesses

Abstract

In recent years, adaptive aggregation multi-grid (AAMG) methods have become the gold standard for solving the Dirac equation in Lattice QCD (LQCD) using Wilson-Clover fermions. These methods are able to overcome the critical slowing down as quark masses approach their physical values and are thus the go-to method for performing Lattice QCD calculations at realistic physical parameters. In this paper we discuss the optimization of a specific building block for implementing AAMG for Wilson-Clover fermions from LQCD, known as the coarse restrictor operator, on contemporary Intel processors featuring large SIMD widths and high thread counts. We will discuss in detail the efficient use of OpenMP and Intel vector intrinsics in our attempts to exploit fine grained parallelism on the coarsest levels. We present performance optimizations and discuss the ramifications for implementing a full AAMG stack on Intel Xeon Phi Knights Landing and Skylake processors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
due to a bug in the libgomp runtime, we replaced spread by true to achieve correct binding for the GCC compiler tests.

References

NERSC Cori Website. https://www.nersc.gov/users/computational-systems/cori/
Babich, R., Brannick, J., Brower, R., Clark, M., Manteuffel, T., et al.: Adaptive multigrid algorithm for the lattice Wilson-Dirac operator. Phys. Rev. Lett. 105, 201602 (2010)
Article Google Scholar
Babich, R., Clark, M.A., Joo, B., Shi, G., Brower, R.C., Gottlieb, S.: Scaling lattice QCD beyond 100 GPUs. In: SC 2011 International Conference for High Performance Computing, Networking, Storage and Analysis Seattle, Washington, 12–18 November 2011 (2011). http://inspirehep.net/record/927455/files/arXiv:1109.2935.pdf
Boyle, P.A.: Hierarchically deflated conjugate gradient (2014)
Google Scholar
Brannick, J., Brower, R.C., Clark, M.A., Osborn, J.C., Rebbi, C.: Adaptive multigrid algorithm for lattice QCD. Phys. Rev. Lett. 100, 041601 (2008)
Article Google Scholar
Brower, R.C., Weinberg, E., Clark, M.A., Strelchenko, A.: Phys. Rev. D 97, 114513 (2018). https://journals.aps.org/prd/abstract/10.1103/PhysRevD.97.114513
Article Google Scholar
Clark, M.A., Babich, R., Barros, K., Brower, R.C., Rebbi, C.: Solving Lattice QCD systems of equations using mixed precision solvers on GPUs. Comput. Phys. Commun. 181, 1517–1528 (2010)
Article Google Scholar
Clark, M.A., Brower, R., Cheng, M.: Hierarchical algorithms on heterogeneous architectures: adaptive multigrid solvers for LQCD on GPUs. In: Proceedings of the 2014 GPU Technology Conference (2014)
Google Scholar
Clark, M.A., Joó, B., Strelchenko, A., Cheng, M., Gambhir, A., Brower, R.: Accelerating lattice QCD multigrid on GPUs using fine-grained parallelization. In: ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis, Salt Lake City, Utah (2016)
Google Scholar
Cohen, S.D., Brower, R.C., Clark, M.A., Osborn, J.C.: Multigrid algorithms for domain-wall fermions. In: PoS LATTICE 2011, 030 (2011)
Google Scholar
Creutz, M.: Quarks, Gluons and Lattices. Cambridge Monographs on Mathematical Physics, 169 p. Cambridge University Press, Cambridge (1983)
Google Scholar
Frommer, A., Kahl, K., Krieg, S., Leder, B., Rottmann, M.: Adaptive aggregation based domain decomposition multigrid for the lattice wilson dirac operator. SIAM J. Sci. Comput. 36, A1581–A1608 (2014)
Article MathSciNet Google Scholar
Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur. Stand. 49(6), 409–436 (1952)
Article MathSciNet Google Scholar
Heybrock, S., Rottmann, M., Georg, P., Wettig, T.: Adaptive algebraic multigrid on SIMD architectures. In: PoS LATTICE 2015, 036 (2016)
Google Scholar
Joó, B.: mg\(\_\)proto github repository. https://github.com/jeffersonlab/mg_proto.git
Luscher, M.: Deflation acceleration of lattice QCD simulations. JHEP 12, 011 (2007)
Article Google Scholar
Montvay, I., Munster, G.: Quantum Fields on a Lattice. Cambridge Monographs on Mathematical Physics, 491 p. Cambridge University Press, Cambridge (1994)
Google Scholar
Osborn, J., Babich, R., Brannick, J., Brower, R., Clark, M., et al.: Multigrid solver for clover fermions. In: PoS LATTICE 2010, 037 (2010)
Google Scholar
Rothe, H.J.: Lattice Gauge Theories: An Introduction. World Scientific Lecture Notes in Physics, vol. 74, pp. 1–605 (2005)
Google Scholar
Sheikholeslami, B., Wohlert, R.: Improved continuum limit lattice action for QCD with wilson fermions. Nucl. Phys. B 259, 572 (1985)
Article Google Scholar
van der Vorst, H.A.: Bi-CGSTAB: a fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 13(2), 631–644 (1992)
Article MathSciNet Google Scholar
Winter, F.T., Clark, M.A., Edwards, R.G., Joó, B.: A framework for lattice QCD calculations on GPUs. In: Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS 2014, pp. 1073–1082. IEEE Computer Society, Washington, DC, USA (2014). http://dx.doi.org/10.1109/IPDPS.2014.112
Yamaguchi, A., Boyle, P.: Hierarchically deflated conjugate residual. In: PoS LATTICE 2016, 374 (2016)
Google Scholar

Download references

Acknowledgment

This research used resources of the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231, and of the ALCF, which is supported by DOE/SC under contract DE-AC02-06CH11357. B. Joo acknowledges funding from the DOE Office Of Science, Offices of Nuclear Physics and Advanced Scientific Computing Research through the SciDAC program. B. Joo also acknowledges support from the U.S. DOE Exascale Computing Project (ECP). This work is supported by the U.S. Department of Energy, Office of Science, Office of Nuclear Physics under contract DE-AC05-06OR23177. B. Joo would like to thank and acknowledge Kate Clark of NVIDIA for many discussions about expressing and mapping parallelism in multi-grid solver components in a variety of programming models and hardware and her helpful comments after a reading of this manuscript, as well as Christian Trott of Sandia Labs for discussions about nested paralleism in OpenMP. This work used resources provided by the Performance Research Laboratory at the University of Oregon. We would especially like to thank Sameer Shende and Rob Yelle for their professional support of the Performance Research Laboratory computers and their timely response to our requests.

Author information

Authors and Affiliations

Thomas Jefferson National Accelerator Facility, Newport News, VA, USA
Bálint Joó
National Energy Research Scientific Computing Center, Berkeley, CA, USA
Thorsten Kurth

Authors

Bálint Joó
View author publications
You can also search for this author in PubMed Google Scholar
Thorsten Kurth
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thorsten Kurth .

Editor information

Editors and Affiliations

Tokyo Institute of Technology, Tokyo, Japan
Rio Yokota
University of Edinburgh, Edinburgh, UK
Michèle Weiland
Lawrence Berkeley National Laboratory, Berkeley, CA, USA
John Shalf
Swiss National Supercomputing Centre, Lugano, Switzerland
Sadaf Alam

Appendices

Appendix A - AVX512 SIMD Routines

We show below the code for the complex matrix vector multiplication, where a SIMD length column in_v is multiplied by complex scalar whose first element is being pointed to by in_s. The resulting vector is accumulated onto out_v.

Once a routine such as CMadd available the 2 block diagonal matrices for a given site can be applied with code like below. Here we assume that the final output has been initialized (either to zero, or it has already some inner site’s worth of data in it) outside this routine.

Appendix B - Parallel Reduction for Manual Nesting

The code for the parallel reduction (parallel over blocks, with 1 site per block) is listed below. We note that rather than closing the preceding region and re-opening it as below, we could have attempted to keep one parallel regions and used #pragma omp barrier to ensure all results were written before summing. However, ideally the barrier would need to be only in a group of threads within a block whereas the OpenMP barrier would synchronize all active threads in the workgroup. Given that some may be idle, this could lead to much more messy code.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Joó, B., Kurth, T. (2018). Lessons Learned from Optimizing Kernels for Adaptive Aggregation Multi-grid Solvers in Lattice QCD. In: Yokota, R., Weiland, M., Shalf, J., Alam, S. (eds) High Performance Computing. ISC High Performance 2018. Lecture Notes in Computer Science(), vol 11203. Springer, Cham. https://doi.org/10.1007/978-3-030-02465-9_34

Download citation

DOI: https://doi.org/10.1007/978-3-030-02465-9_34
Published: 25 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02464-2
Online ISBN: 978-3-030-02465-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics