Skip to main content

Lessons Learned from Optimizing Kernels for Adaptive Aggregation Multi-grid Solvers in Lattice QCD

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2018)

Abstract

In recent years, adaptive aggregation multi-grid (AAMG) methods have become the gold standard for solving the Dirac equation in Lattice QCD (LQCD) using Wilson-Clover fermions. These methods are able to overcome the critical slowing down as quark masses approach their physical values and are thus the go-to method for performing Lattice QCD calculations at realistic physical parameters. In this paper we discuss the optimization of a specific building block for implementing AAMG for Wilson-Clover fermions from LQCD, known as the coarse restrictor operator, on contemporary Intel processors featuring large SIMD widths and high thread counts. We will discuss in detail the efficient use of OpenMP and Intel vector intrinsics in our attempts to exploit fine grained parallelism on the coarsest levels. We present performance optimizations and discuss the ramifications for implementing a full AAMG stack on Intel Xeon Phi Knights Landing and Skylake processors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    due to a bug in the libgomp runtime, we replaced spread by true to achieve correct binding for the GCC compiler tests.

References

  1. NERSC Cori Website. https://www.nersc.gov/users/computational-systems/cori/

  2. Babich, R., Brannick, J., Brower, R., Clark, M., Manteuffel, T., et al.: Adaptive multigrid algorithm for the lattice Wilson-Dirac operator. Phys. Rev. Lett. 105, 201602 (2010)

    Article  Google Scholar 

  3. Babich, R., Clark, M.A., Joo, B., Shi, G., Brower, R.C., Gottlieb, S.: Scaling lattice QCD beyond 100 GPUs. In: SC 2011 International Conference for High Performance Computing, Networking, Storage and Analysis Seattle, Washington, 12–18 November 2011 (2011). http://inspirehep.net/record/927455/files/arXiv:1109.2935.pdf

  4. Boyle, P.A.: Hierarchically deflated conjugate gradient (2014)

    Google Scholar 

  5. Brannick, J., Brower, R.C., Clark, M.A., Osborn, J.C., Rebbi, C.: Adaptive multigrid algorithm for lattice QCD. Phys. Rev. Lett. 100, 041601 (2008)

    Article  Google Scholar 

  6. Brower, R.C., Weinberg, E., Clark, M.A., Strelchenko, A.: Phys. Rev. D 97, 114513 (2018). https://journals.aps.org/prd/abstract/10.1103/PhysRevD.97.114513

    Article  Google Scholar 

  7. Clark, M.A., Babich, R., Barros, K., Brower, R.C., Rebbi, C.: Solving Lattice QCD systems of equations using mixed precision solvers on GPUs. Comput. Phys. Commun. 181, 1517–1528 (2010)

    Article  Google Scholar 

  8. Clark, M.A., Brower, R., Cheng, M.: Hierarchical algorithms on heterogeneous architectures: adaptive multigrid solvers for LQCD on GPUs. In: Proceedings of the 2014 GPU Technology Conference (2014)

    Google Scholar 

  9. Clark, M.A., Joó, B., Strelchenko, A., Cheng, M., Gambhir, A., Brower, R.: Accelerating lattice QCD multigrid on GPUs using fine-grained parallelization. In: ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis, Salt Lake City, Utah (2016)

    Google Scholar 

  10. Cohen, S.D., Brower, R.C., Clark, M.A., Osborn, J.C.: Multigrid algorithms for domain-wall fermions. In: PoS LATTICE 2011, 030 (2011)

    Google Scholar 

  11. Creutz, M.: Quarks, Gluons and Lattices. Cambridge Monographs on Mathematical Physics, 169 p. Cambridge University Press, Cambridge (1983)

    Google Scholar 

  12. Frommer, A., Kahl, K., Krieg, S., Leder, B., Rottmann, M.: Adaptive aggregation based domain decomposition multigrid for the lattice wilson dirac operator. SIAM J. Sci. Comput. 36, A1581–A1608 (2014)

    Article  MathSciNet  Google Scholar 

  13. Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur. Stand. 49(6), 409–436 (1952)

    Article  MathSciNet  Google Scholar 

  14. Heybrock, S., Rottmann, M., Georg, P., Wettig, T.: Adaptive algebraic multigrid on SIMD architectures. In: PoS LATTICE 2015, 036 (2016)

    Google Scholar 

  15. Joó, B.: mg\(\_\)proto github repository. https://github.com/jeffersonlab/mg_proto.git

  16. Luscher, M.: Deflation acceleration of lattice QCD simulations. JHEP 12, 011 (2007)

    Article  Google Scholar 

  17. Montvay, I., Munster, G.: Quantum Fields on a Lattice. Cambridge Monographs on Mathematical Physics, 491 p. Cambridge University Press, Cambridge (1994)

    Google Scholar 

  18. Osborn, J., Babich, R., Brannick, J., Brower, R., Clark, M., et al.: Multigrid solver for clover fermions. In: PoS LATTICE 2010, 037 (2010)

    Google Scholar 

  19. Rothe, H.J.: Lattice Gauge Theories: An Introduction. World Scientific Lecture Notes in Physics, vol. 74, pp. 1–605 (2005)

    Google Scholar 

  20. Sheikholeslami, B., Wohlert, R.: Improved continuum limit lattice action for QCD with wilson fermions. Nucl. Phys. B 259, 572 (1985)

    Article  Google Scholar 

  21. van der Vorst, H.A.: Bi-CGSTAB: a fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 13(2), 631–644 (1992)

    Article  MathSciNet  Google Scholar 

  22. Winter, F.T., Clark, M.A., Edwards, R.G., Joó, B.: A framework for lattice QCD calculations on GPUs. In: Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS 2014, pp. 1073–1082. IEEE Computer Society, Washington, DC, USA (2014). http://dx.doi.org/10.1109/IPDPS.2014.112

  23. Yamaguchi, A., Boyle, P.: Hierarchically deflated conjugate residual. In: PoS LATTICE 2016, 374 (2016)

    Google Scholar 

Download references

Acknowledgment

This research used resources of the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231, and of the ALCF, which is supported by DOE/SC under contract DE-AC02-06CH11357. B. Joo acknowledges funding from the DOE Office Of Science, Offices of Nuclear Physics and Advanced Scientific Computing Research through the SciDAC program. B. Joo also acknowledges support from the U.S. DOE Exascale Computing Project (ECP). This work is supported by the U.S. Department of Energy, Office of Science, Office of Nuclear Physics under contract DE-AC05-06OR23177. B. Joo would like to thank and acknowledge Kate Clark of NVIDIA for many discussions about expressing and mapping parallelism in multi-grid solver components in a variety of programming models and hardware and her helpful comments after a reading of this manuscript, as well as Christian Trott of Sandia Labs for discussions about nested paralleism in OpenMP. This work used resources provided by the Performance Research Laboratory at the University of Oregon. We would especially like to thank Sameer Shende and Rob Yelle for their professional support of the Performance Research Laboratory computers and their timely response to our requests.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thorsten Kurth .

Editor information

Editors and Affiliations

Appendices

Appendix A - AVX512 SIMD Routines

We show below the code for the complex matrix vector multiplication, where a SIMD length column in_v is multiplied by complex scalar whose first element is being pointed to by in_s. The resulting vector is accumulated onto out_v.

figure a

Once a routine such as CMadd available the 2 block diagonal matrices for a given site can be applied with code like below. Here we assume that the final output has been initialized (either to zero, or it has already some inner site’s worth of data in it) outside this routine.

figure b

Appendix B - Parallel Reduction for Manual Nesting

The code for the parallel reduction (parallel over blocks, with 1 site per block) is listed below. We note that rather than closing the preceding region and re-opening it as below, we could have attempted to keep one parallel regions and used #pragma omp barrier to ensure all results were written before summing. However, ideally the barrier would need to be only in a group of threads within a block whereas the OpenMP barrier would synchronize all active threads in the workgroup. Given that some may be idle, this could lead to much more messy code.

figure c

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Joó, B., Kurth, T. (2018). Lessons Learned from Optimizing Kernels for Adaptive Aggregation Multi-grid Solvers in Lattice QCD. In: Yokota, R., Weiland, M., Shalf, J., Alam, S. (eds) High Performance Computing. ISC High Performance 2018. Lecture Notes in Computer Science(), vol 11203. Springer, Cham. https://doi.org/10.1007/978-3-030-02465-9_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-02465-9_34

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-02464-2

  • Online ISBN: 978-3-030-02465-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics