# Lessons Learned from Optimizing Kernels for Adaptive Aggregation Multi-grid Solvers in Lattice QCD

## Abstract

In recent years, adaptive aggregation multi-grid (AAMG) methods have become the gold standard for solving the Dirac equation in Lattice QCD (LQCD) using Wilson-Clover fermions. These methods are able to overcome the critical slowing down as quark masses approach their physical values and are thus the go-to method for performing Lattice QCD calculations at realistic physical parameters. In this paper we discuss the optimization of a specific building block for implementing AAMG for Wilson-Clover fermions from LQCD, known as the coarse restrictor operator, on contemporary Intel processors featuring large SIMD widths and high thread counts. We will discuss in detail the efficient use of OpenMP and Intel vector intrinsics in our attempts to exploit fine grained parallelism on the coarsest levels. We present performance optimizations and discuss the ramifications for implementing a full AAMG stack on Intel Xeon Phi Knights Landing and Skylake processors.

## Notes

### Acknowledgment

This research used resources of the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231, and of the ALCF, which is supported by DOE/SC under contract DE-AC02-06CH11357. B. Joo acknowledges funding from the DOE Office Of Science, Offices of Nuclear Physics and Advanced Scientific Computing Research through the SciDAC program. B. Joo also acknowledges support from the U.S. DOE Exascale Computing Project (ECP). This work is supported by the U.S. Department of Energy, Office of Science, Office of Nuclear Physics under contract DE-AC05-06OR23177. B. Joo would like to thank and acknowledge Kate Clark of NVIDIA for many discussions about expressing and mapping parallelism in multi-grid solver components in a variety of programming models and hardware and her helpful comments after a reading of this manuscript, as well as Christian Trott of Sandia Labs for discussions about nested paralleism in OpenMP. This work used resources provided by the Performance Research Laboratory at the University of Oregon. We would especially like to thank Sameer Shende and Rob Yelle for their professional support of the Performance Research Laboratory computers and their timely response to our requests.

## References

- 1.NERSC Cori Website. https://www.nersc.gov/users/computational-systems/cori/
- 2.Babich, R., Brannick, J., Brower, R., Clark, M., Manteuffel, T., et al.: Adaptive multigrid algorithm for the lattice Wilson-Dirac operator. Phys. Rev. Lett.
**105**, 201602 (2010)CrossRefGoogle Scholar - 3.Babich, R., Clark, M.A., Joo, B., Shi, G., Brower, R.C., Gottlieb, S.: Scaling lattice QCD beyond 100 GPUs. In: SC 2011 International Conference for High Performance Computing, Networking, Storage and Analysis Seattle, Washington, 12–18 November 2011 (2011). http://inspirehep.net/record/927455/files/arXiv:1109.2935.pdf
- 4.Boyle, P.A.: Hierarchically deflated conjugate gradient (2014)Google Scholar
- 5.Brannick, J., Brower, R.C., Clark, M.A., Osborn, J.C., Rebbi, C.: Adaptive multigrid algorithm for lattice QCD. Phys. Rev. Lett.
**100**, 041601 (2008)CrossRefGoogle Scholar - 6.Brower, R.C., Weinberg, E., Clark, M.A., Strelchenko, A.: Phys. Rev. D
**97**, 114513 (2018). https://journals.aps.org/prd/abstract/10.1103/PhysRevD.97.114513CrossRefGoogle Scholar - 7.Clark, M.A., Babich, R., Barros, K., Brower, R.C., Rebbi, C.: Solving Lattice QCD systems of equations using mixed precision solvers on GPUs. Comput. Phys. Commun.
**181**, 1517–1528 (2010)CrossRefGoogle Scholar - 8.Clark, M.A., Brower, R., Cheng, M.: Hierarchical algorithms on heterogeneous architectures: adaptive multigrid solvers for LQCD on GPUs. In: Proceedings of the 2014 GPU Technology Conference (2014)Google Scholar
- 9.Clark, M.A., Joó, B., Strelchenko, A., Cheng, M., Gambhir, A., Brower, R.: Accelerating lattice QCD multigrid on GPUs using fine-grained parallelization. In: ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis, Salt Lake City, Utah (2016)Google Scholar
- 10.Cohen, S.D., Brower, R.C., Clark, M.A., Osborn, J.C.: Multigrid algorithms for domain-wall fermions. In: PoS LATTICE 2011, 030 (2011)Google Scholar
- 11.Creutz, M.: Quarks, Gluons and Lattices. Cambridge Monographs on Mathematical Physics, 169 p. Cambridge University Press, Cambridge (1983)Google Scholar
- 12.Frommer, A., Kahl, K., Krieg, S., Leder, B., Rottmann, M.: Adaptive aggregation based domain decomposition multigrid for the lattice wilson dirac operator. SIAM J. Sci. Comput.
**36**, A1581–A1608 (2014)MathSciNetCrossRefGoogle Scholar - 13.Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur. Stand.
**49**(6), 409–436 (1952)MathSciNetCrossRefGoogle Scholar - 14.Heybrock, S., Rottmann, M., Georg, P., Wettig, T.: Adaptive algebraic multigrid on SIMD architectures. In: PoS LATTICE 2015, 036 (2016)Google Scholar
- 15.
- 16.Luscher, M.: Deflation acceleration of lattice QCD simulations. JHEP
**12**, 011 (2007)CrossRefGoogle Scholar - 17.Montvay, I., Munster, G.: Quantum Fields on a Lattice. Cambridge Monographs on Mathematical Physics, 491 p. Cambridge University Press, Cambridge (1994)Google Scholar
- 18.Osborn, J., Babich, R., Brannick, J., Brower, R., Clark, M., et al.: Multigrid solver for clover fermions. In: PoS LATTICE 2010, 037 (2010)Google Scholar
- 19.Rothe, H.J.: Lattice Gauge Theories: An Introduction. World Scientific Lecture Notes in Physics, vol. 74, pp. 1–605 (2005)Google Scholar
- 20.Sheikholeslami, B., Wohlert, R.: Improved continuum limit lattice action for QCD with wilson fermions. Nucl. Phys. B
**259**, 572 (1985)CrossRefGoogle Scholar - 21.van der Vorst, H.A.: Bi-CGSTAB: a fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems. SIAM J. Sci. Stat. Comput.
**13**(2), 631–644 (1992)MathSciNetCrossRefGoogle Scholar - 22.Winter, F.T., Clark, M.A., Edwards, R.G., Joó, B.: A framework for lattice QCD calculations on GPUs. In: Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS 2014, pp. 1073–1082. IEEE Computer Society, Washington, DC, USA (2014). http://dx.doi.org/10.1109/IPDPS.2014.112
- 23.Yamaguchi, A., Boyle, P.: Hierarchically deflated conjugate residual. In: PoS LATTICE 2016, 374 (2016)Google Scholar