Skip to main content

Optimizing a Multiple Right-Hand Side Dslash Kernel for Intel Knights Corner

  • Conference paper
  • First Online:
Book cover High Performance Computing (ISC High Performance 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9945))

Included in the following conference series:

  • 2345 Accesses

Abstract

There is a significant interest in the computational physics community to perform lattice quantum chromodynamics (LQCD) simulations, which can run into the trillions of operations. LQCD computations solve a sparse linear system using a Wilson Dslash kernel, which has an arithmetic intensity of 0.88–2.29. This makes Dslash memory bandwidth-bound on most architectures, including Intel Xeon Phi Knights Corner (KNC). Most research optimizing the Dslash operator has been focused on single right-hand side (SRHS) linear solvers. There is a class of LQCD computations which aims to solve systems with multiple right-hand sides (MRHS), presenting additional opportunities for data reuse and vectorization. We present two approaches to MRHS Dslash: a vector register blocking approach and one using the software package QPhiX with a custom code generator for low-level intrinsics. We observed significant speedups using our approaches, with sustained performance of over 700 GFLOPS (single precision) in one instance. We achieved up to 29 % of theoretical peak performance compared to a maximum of 13 % obtained by the previous SRHS method using QPhiX.

Notice: Authored by Jefferson Science Associates, LLC under U.S. DOE Contract No. DE-AC05-06OR23177. The U.S. Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce this manuscript for U.S. Government purposes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Intel\(^{\textregistered }\) XeonPhi Coprocessor: Software developers guide. Technical report, Intel Corporation, March 2014

    Google Scholar 

  2. Walden, A.: An optimized multiple right-hand side Dslash Kernel for Intel\(^{\textregistered }\) Xeon Phi. Master’s thesis, Old Dominion University, Norfolk, VA (2016). http://www.cs.odu.edu/~awalden/walden_ms_thesis.pdf

  3. Joó, B., et. al: Code generator for the QPhiX library, Wilson fermions. https://github.com/JeffersonLab/qphix-codegen

  4. Joó, B., et. al: QPhiX: QCD for Intel Xeon Phi and Xeon processors. https://github.com/JeffersonLab/qphix

  5. Diavastos, A., Stylianou, G., Koutsou, G.: Exploring parallelism on the Intel\(^{\textregistered }\) Xeon Phi with lattice-QCD kernels. http://clusterware.cyi.ac.cy/data/paper.pdf

  6. Gupta, R.: Introduction to lattice QCD. \(\text{arXiv}\):\(\text{ hep-lat/9807028 }\). http://arxiv.org/abs/hep-lat/9807028

  7. Heybrock, S., Joó, B., Kalamkar, D.D., Smelyanskiy, M.,Vaidyanathan, K., Wettig, T., Dubey, P.: Lattice QCD with domain decomposition on Intel\(^{\textregistered }\) Xeon Phi co-processors. In: Proceedings of the International Conference for High Performance Computing, Networking, Storageand Analysis, SC 2014, pp. 69–80. IEEE Press, Piscataway (2014). http://dx.doi.org/10.1109/SC.2014.11

  8. Joó, B., Kalamkar, D.D., Vaidyanathan, K., Smelyanskiy, M., Pamnany, K., Lee, V.W., Dubey, P., Watson, W.: Lattice QCD on Intel\(^{\textregistered }\) Xeon Phi coprocessors. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) Supercomputing. LNCS, vol. 7905, pp. 40–54. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  9. Joó, B., Smelyanskiy, M., Kalamkar, D.D., Vaidyanathan, K.: Chapter 9-Wilson Dslash kernel from lattice QCD optimization. In: Reinders, J., Jeffers, J. (eds.) High Performance Parallelism Pearls Volume Two: Multicore and Many-Core Programming Approaches, vol. 2, pp. 139–170. Morgan Kaufmann, Boston (2015). http://www.sciencedirect.com/science/article/pii/B9780128038192000239

    Chapter  Google Scholar 

  10. Kaczmarek, O., Schmidt, C., Steinbrecher, P., Mukherjee, S., Wagner, M.: HISQ inverter on Intel Xeon Phi and NVIDIA GPUs. CoRR abs/1409.1510 (2014). http://arxiv.org/abs/1409.1510

  11. Nguyen, A., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5D blocking optimization for stencil computations on modern CPUs and GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 1–13. IEEE Computer Society, Washington, DC (2010). http://dx.doi.org/10.1109/SC.2010.2

  12. Richtmann, D., Heybrock, S., Wettig, T.: Multiple right-hand-sidesetup for the DD-\(\alpha \)AMG. In: Proceedings of the 33rd International Symposium on Lattice Field Theory, July 2015. http://arxiv.org/abs/1601.03184

  13. Sakurai, T., Tadano, H., Kuramashi, Y.: Application of block Krylovsubspace algorithms to the Wilson-Dirac equation with multiple right-hand sides inlattice QCD. Comput. Phys. Commun. 181(1), 113–117 (2010). http://www.sciencedirect.com/science/article/pii/S0010465509002859

    Article  MathSciNet  MATH  Google Scholar 

  14. Smelyanskiy, M., Vaidyanathan, K., Choi, J., Joó, B., Chhugani,J., Clark, M.A., Dubey, P.: High-performance lattice QCD for multi-core based parallelsystems using a cache-friendly hybrid threaded-MPI approach. In: 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–10, November 2011

    Google Scholar 

Download references

Acknowledgments

This work was partially supported by a grant from Jefferson Lab. Aaron Walden and Sabbir Khan were also partially supported by the Old Dominion University Modeling and Simulation Fellowship Program and gratefully acknowledge this support. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Nuclear Physics under contract DE-AC05-06OR23177.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aaron Walden .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Walden, A., Khan, S., Joó, B., Ranjan, D., Zubair, M. (2016). Optimizing a Multiple Right-Hand Side Dslash Kernel for Intel Knights Corner. In: Taufer, M., Mohr, B., Kunkel, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9945. Springer, Cham. https://doi.org/10.1007/978-3-319-46079-6_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-46079-6_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-46078-9

  • Online ISBN: 978-3-319-46079-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics