Optimizing a Multiple Right-Hand Side Dslash Kernel for Intel Knights Corner

Walden, Aaron; Khan, Sabbir; Joó, Bálint; Ranjan, Desh; Zubair, Mohammad

doi:10.1007/978-3-319-46079-6_28

Aaron Walden¹⁶,
Sabbir Khan¹⁶,
Bálint Joó¹⁷,
Desh Ranjan¹⁶ &
…
Mohammad Zubair¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9945))

Included in the following conference series:

International Conference on High Performance Computing

2345 Accesses

Abstract

There is a significant interest in the computational physics community to perform lattice quantum chromodynamics (LQCD) simulations, which can run into the trillions of operations. LQCD computations solve a sparse linear system using a Wilson Dslash kernel, which has an arithmetic intensity of 0.88–2.29. This makes Dslash memory bandwidth-bound on most architectures, including Intel Xeon Phi Knights Corner (KNC). Most research optimizing the Dslash operator has been focused on single right-hand side (SRHS) linear solvers. There is a class of LQCD computations which aims to solve systems with multiple right-hand sides (MRHS), presenting additional opportunities for data reuse and vectorization. We present two approaches to MRHS Dslash: a vector register blocking approach and one using the software package QPhiX with a custom code generator for low-level intrinsics. We observed significant speedups using our approaches, with sustained performance of over 700 GFLOPS (single precision) in one instance. We achieved up to 29 % of theoretical peak performance compared to a maximum of 13 % obtained by the previous SRHS method using QPhiX.

Notice: Authored by Jefferson Science Associates, LLC under U.S. DOE Contract No. DE-AC05-06OR23177. The U.S. Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce this manuscript for U.S. Government purposes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Intel\(^{\textregistered }\) XeonPhi^™ Coprocessor: Software developers guide. Technical report, Intel Corporation, March 2014
Google Scholar
Walden, A.: An optimized multiple right-hand side Dslash Kernel for Intel\(^{\textregistered }\) Xeon Phi^™. Master’s thesis, Old Dominion University, Norfolk, VA (2016). http://www.cs.odu.edu/~awalden/walden_ms_thesis.pdf
Joó, B., et. al: Code generator for the QPhiX library, Wilson fermions. https://github.com/JeffersonLab/qphix-codegen
Joó, B., et. al: QPhiX: QCD for Intel Xeon Phi and Xeon processors. https://github.com/JeffersonLab/qphix
Diavastos, A., Stylianou, G., Koutsou, G.: Exploring parallelism on the Intel\(^{\textregistered }\) Xeon Phi^™ with lattice-QCD kernels. http://clusterware.cyi.ac.cy/data/paper.pdf
Gupta, R.: Introduction to lattice QCD. \(\text{arXiv}\):\(\text{ hep-lat/9807028 }\). http://arxiv.org/abs/hep-lat/9807028
Heybrock, S., Joó, B., Kalamkar, D.D., Smelyanskiy, M.,Vaidyanathan, K., Wettig, T., Dubey, P.: Lattice QCD with domain decomposition on Intel\(^{\textregistered }\) Xeon Phi^™ co-processors. In: Proceedings of the International Conference for High Performance Computing, Networking, Storageand Analysis, SC 2014, pp. 69–80. IEEE Press, Piscataway (2014). http://dx.doi.org/10.1109/SC.2014.11
Joó, B., Kalamkar, D.D., Vaidyanathan, K., Smelyanskiy, M., Pamnany, K., Lee, V.W., Dubey, P., Watson, W.: Lattice QCD on Intel\(^{\textregistered }\) Xeon Phi^™ coprocessors. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) Supercomputing. LNCS, vol. 7905, pp. 40–54. Springer, Heidelberg (2013)
Chapter Google Scholar
Joó, B., Smelyanskiy, M., Kalamkar, D.D., Vaidyanathan, K.: Chapter 9-Wilson Dslash kernel from lattice QCD optimization. In: Reinders, J., Jeffers, J. (eds.) High Performance Parallelism Pearls Volume Two: Multicore and Many-Core Programming Approaches, vol. 2, pp. 139–170. Morgan Kaufmann, Boston (2015). http://www.sciencedirect.com/science/article/pii/B9780128038192000239
Chapter Google Scholar
Kaczmarek, O., Schmidt, C., Steinbrecher, P., Mukherjee, S., Wagner, M.: HISQ inverter on Intel Xeon Phi and NVIDIA GPUs. CoRR abs/1409.1510 (2014). http://arxiv.org/abs/1409.1510
Nguyen, A., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5D blocking optimization for stencil computations on modern CPUs and GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 1–13. IEEE Computer Society, Washington, DC (2010). http://dx.doi.org/10.1109/SC.2010.2
Richtmann, D., Heybrock, S., Wettig, T.: Multiple right-hand-sidesetup for the DD-\(\alpha \)AMG. In: Proceedings of the 33rd International Symposium on Lattice Field Theory, July 2015. http://arxiv.org/abs/1601.03184
Sakurai, T., Tadano, H., Kuramashi, Y.: Application of block Krylovsubspace algorithms to the Wilson-Dirac equation with multiple right-hand sides inlattice QCD. Comput. Phys. Commun. 181(1), 113–117 (2010). http://www.sciencedirect.com/science/article/pii/S0010465509002859
Article MathSciNet MATH Google Scholar
Smelyanskiy, M., Vaidyanathan, K., Choi, J., Joó, B., Chhugani,J., Clark, M.A., Dubey, P.: High-performance lattice QCD for multi-core based parallelsystems using a cache-friendly hybrid threaded-MPI approach. In: 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–10, November 2011
Google Scholar

Download references

Acknowledgments

This work was partially supported by a grant from Jefferson Lab. Aaron Walden and Sabbir Khan were also partially supported by the Old Dominion University Modeling and Simulation Fellowship Program and gratefully acknowledge this support. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Nuclear Physics under contract DE-AC05-06OR23177.

Author information

Authors and Affiliations

Department of Computer Science, Old Dominion University, Norfolk, VA, 23529, USA
Aaron Walden, Sabbir Khan, Desh Ranjan & Mohammad Zubair
Thomas Jefferson National Accelerator Facility, Newport News, VA, 23606, USA
Bálint Joó

Authors

Aaron Walden
View author publications
You can also search for this author in PubMed Google Scholar
Sabbir Khan
View author publications
You can also search for this author in PubMed Google Scholar
Bálint Joó
View author publications
You can also search for this author in PubMed Google Scholar
Desh Ranjan
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Zubair
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aaron Walden .

Editor information

Editors and Affiliations

University of Delaware, Newark, Delaware, USA
Michela Taufer
Forschungszentrum Jülich, Jülich, Germany
Bernd Mohr
DKRZ, Hamburg, Germany
Julian M. Kunkel

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Walden, A., Khan, S., Joó, B., Ranjan, D., Zubair, M. (2016). Optimizing a Multiple Right-Hand Side Dslash Kernel for Intel Knights Corner. In: Taufer, M., Mohr, B., Kunkel, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9945. Springer, Cham. https://doi.org/10.1007/978-3-319-46079-6_28

Download citation

DOI: https://doi.org/10.1007/978-3-319-46079-6_28
Published: 06 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46078-9
Online ISBN: 978-3-319-46079-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics