Abstract
In recent years, adaptive aggregation multi-grid (AAMG) methods have become the gold standard for solving the Dirac equation in Lattice QCD (LQCD) using Wilson-Clover fermions. These methods are able to overcome the critical slowing down as quark masses approach their physical values and are thus the go-to method for performing Lattice QCD calculations at realistic physical parameters. In this paper we discuss the optimization of a specific building block for implementing AAMG for Wilson-Clover fermions from LQCD, known as the coarse restrictor operator, on contemporary Intel processors featuring large SIMD widths and high thread counts. We will discuss in detail the efficient use of OpenMP and Intel vector intrinsics in our attempts to exploit fine grained parallelism on the coarsest levels. We present performance optimizations and discuss the ramifications for implementing a full AAMG stack on Intel Xeon Phi Knights Landing and Skylake processors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
due to a bug in the libgomp runtime, we replaced spread by true to achieve correct binding for the GCC compiler tests.
References
NERSC Cori Website. https://www.nersc.gov/users/computational-systems/cori/
Babich, R., Brannick, J., Brower, R., Clark, M., Manteuffel, T., et al.: Adaptive multigrid algorithm for the lattice Wilson-Dirac operator. Phys. Rev. Lett. 105, 201602 (2010)
Babich, R., Clark, M.A., Joo, B., Shi, G., Brower, R.C., Gottlieb, S.: Scaling lattice QCD beyond 100 GPUs. In: SC 2011 International Conference for High Performance Computing, Networking, Storage and Analysis Seattle, Washington, 12–18 November 2011 (2011). http://inspirehep.net/record/927455/files/arXiv:1109.2935.pdf
Boyle, P.A.: Hierarchically deflated conjugate gradient (2014)
Brannick, J., Brower, R.C., Clark, M.A., Osborn, J.C., Rebbi, C.: Adaptive multigrid algorithm for lattice QCD. Phys. Rev. Lett. 100, 041601 (2008)
Brower, R.C., Weinberg, E., Clark, M.A., Strelchenko, A.: Phys. Rev. D 97, 114513 (2018). https://journals.aps.org/prd/abstract/10.1103/PhysRevD.97.114513
Clark, M.A., Babich, R., Barros, K., Brower, R.C., Rebbi, C.: Solving Lattice QCD systems of equations using mixed precision solvers on GPUs. Comput. Phys. Commun. 181, 1517–1528 (2010)
Clark, M.A., Brower, R., Cheng, M.: Hierarchical algorithms on heterogeneous architectures: adaptive multigrid solvers for LQCD on GPUs. In: Proceedings of the 2014 GPU Technology Conference (2014)
Clark, M.A., Joó, B., Strelchenko, A., Cheng, M., Gambhir, A., Brower, R.: Accelerating lattice QCD multigrid on GPUs using fine-grained parallelization. In: ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis, Salt Lake City, Utah (2016)
Cohen, S.D., Brower, R.C., Clark, M.A., Osborn, J.C.: Multigrid algorithms for domain-wall fermions. In: PoS LATTICE 2011, 030 (2011)
Creutz, M.: Quarks, Gluons and Lattices. Cambridge Monographs on Mathematical Physics, 169 p. Cambridge University Press, Cambridge (1983)
Frommer, A., Kahl, K., Krieg, S., Leder, B., Rottmann, M.: Adaptive aggregation based domain decomposition multigrid for the lattice wilson dirac operator. SIAM J. Sci. Comput. 36, A1581–A1608 (2014)
Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur. Stand. 49(6), 409–436 (1952)
Heybrock, S., Rottmann, M., Georg, P., Wettig, T.: Adaptive algebraic multigrid on SIMD architectures. In: PoS LATTICE 2015, 036 (2016)
Joó, B.: mg\(\_\)proto github repository. https://github.com/jeffersonlab/mg_proto.git
Luscher, M.: Deflation acceleration of lattice QCD simulations. JHEP 12, 011 (2007)
Montvay, I., Munster, G.: Quantum Fields on a Lattice. Cambridge Monographs on Mathematical Physics, 491 p. Cambridge University Press, Cambridge (1994)
Osborn, J., Babich, R., Brannick, J., Brower, R., Clark, M., et al.: Multigrid solver for clover fermions. In: PoS LATTICE 2010, 037 (2010)
Rothe, H.J.: Lattice Gauge Theories: An Introduction. World Scientific Lecture Notes in Physics, vol. 74, pp. 1–605 (2005)
Sheikholeslami, B., Wohlert, R.: Improved continuum limit lattice action for QCD with wilson fermions. Nucl. Phys. B 259, 572 (1985)
van der Vorst, H.A.: Bi-CGSTAB: a fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 13(2), 631–644 (1992)
Winter, F.T., Clark, M.A., Edwards, R.G., Joó, B.: A framework for lattice QCD calculations on GPUs. In: Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS 2014, pp. 1073–1082. IEEE Computer Society, Washington, DC, USA (2014). http://dx.doi.org/10.1109/IPDPS.2014.112
Yamaguchi, A., Boyle, P.: Hierarchically deflated conjugate residual. In: PoS LATTICE 2016, 374 (2016)
Acknowledgment
This research used resources of the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231, and of the ALCF, which is supported by DOE/SC under contract DE-AC02-06CH11357. B. Joo acknowledges funding from the DOE Office Of Science, Offices of Nuclear Physics and Advanced Scientific Computing Research through the SciDAC program. B. Joo also acknowledges support from the U.S. DOE Exascale Computing Project (ECP). This work is supported by the U.S. Department of Energy, Office of Science, Office of Nuclear Physics under contract DE-AC05-06OR23177. B. Joo would like to thank and acknowledge Kate Clark of NVIDIA for many discussions about expressing and mapping parallelism in multi-grid solver components in a variety of programming models and hardware and her helpful comments after a reading of this manuscript, as well as Christian Trott of Sandia Labs for discussions about nested paralleism in OpenMP. This work used resources provided by the Performance Research Laboratory at the University of Oregon. We would especially like to thank Sameer Shende and Rob Yelle for their professional support of the Performance Research Laboratory computers and their timely response to our requests.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix A - AVX512 SIMD Routines
We show below the code for the complex matrix vector multiplication, where a SIMD length column in_v is multiplied by complex scalar whose first element is being pointed to by in_s. The resulting vector is accumulated onto out_v.
Once a routine such as CMadd available the 2 block diagonal matrices for a given site can be applied with code like below. Here we assume that the final output has been initialized (either to zero, or it has already some inner site’s worth of data in it) outside this routine.
Appendix B - Parallel Reduction for Manual Nesting
The code for the parallel reduction (parallel over blocks, with 1 site per block) is listed below. We note that rather than closing the preceding region and re-opening it as below, we could have attempted to keep one parallel regions and used #pragma omp barrier to ensure all results were written before summing. However, ideally the barrier would need to be only in a group of threads within a block whereas the OpenMP barrier would synchronize all active threads in the workgroup. Given that some may be idle, this could lead to much more messy code.
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Joó, B., Kurth, T. (2018). Lessons Learned from Optimizing Kernels for Adaptive Aggregation Multi-grid Solvers in Lattice QCD. In: Yokota, R., Weiland, M., Shalf, J., Alam, S. (eds) High Performance Computing. ISC High Performance 2018. Lecture Notes in Computer Science(), vol 11203. Springer, Cham. https://doi.org/10.1007/978-3-030-02465-9_34
Download citation
DOI: https://doi.org/10.1007/978-3-030-02465-9_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02464-2
Online ISBN: 978-3-030-02465-9
eBook Packages: Computer ScienceComputer Science (R0)