Optimizing a Multiple Right-Hand Side Dslash Kernel for Intel Knights Corner

  • Aaron WaldenEmail author
  • Sabbir Khan
  • Bálint Joó
  • Desh Ranjan
  • Mohammad Zubair
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9945)


There is a significant interest in the computational physics community to perform lattice quantum chromodynamics (LQCD) simulations, which can run into the trillions of operations. LQCD computations solve a sparse linear system using a Wilson Dslash kernel, which has an arithmetic intensity of 0.88–2.29. This makes Dslash memory bandwidth-bound on most architectures, including Intel Xeon Phi Knights Corner (KNC). Most research optimizing the Dslash operator has been focused on single right-hand side (SRHS) linear solvers. There is a class of LQCD computations which aims to solve systems with multiple right-hand sides (MRHS), presenting additional opportunities for data reuse and vectorization. We present two approaches to MRHS Dslash: a vector register blocking approach and one using the software package QPhiX with a custom code generator for low-level intrinsics. We observed significant speedups using our approaches, with sustained performance of over 700 GFLOPS (single precision) in one instance. We achieved up to 29 % of theoretical peak performance compared to a maximum of 13 % obtained by the previous SRHS method using QPhiX.


LQCD Optimization Performance Wilson-Dslash Code generator Parallel programming Vectorization Xeon Phi Knights Corner 



This work was partially supported by a grant from Jefferson Lab. Aaron Walden and Sabbir Khan were also partially supported by the Old Dominion University Modeling and Simulation Fellowship Program and gratefully acknowledge this support. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Nuclear Physics under contract DE-AC05-06OR23177.


  1. 1.
    Intel\(^{\textregistered }\) XeonPhi Coprocessor: Software developers guide. Technical report, Intel Corporation, March 2014Google Scholar
  2. 2.
    Walden, A.: An optimized multiple right-hand side Dslash Kernel for Intel\(^{\textregistered }\) Xeon Phi. Master’s thesis, Old Dominion University, Norfolk, VA (2016).
  3. 3.
    Joó, B., et. al: Code generator for the QPhiX library, Wilson fermions.
  4. 4.
    Joó, B., et. al: QPhiX: QCD for Intel Xeon Phi and Xeon processors.
  5. 5.
    Diavastos, A., Stylianou, G., Koutsou, G.: Exploring parallelism on the Intel\(^{\textregistered }\) Xeon Phi with lattice-QCD kernels.
  6. 6.
    Gupta, R.: Introduction to lattice QCD. \(\text{arXiv}\):\(\text{ hep-lat/9807028 }\).
  7. 7.
    Heybrock, S., Joó, B., Kalamkar, D.D., Smelyanskiy, M.,Vaidyanathan, K., Wettig, T., Dubey, P.: Lattice QCD with domain decomposition on Intel\(^{\textregistered }\) Xeon Phi co-processors. In: Proceedings of the International Conference for High Performance Computing, Networking, Storageand Analysis, SC 2014, pp. 69–80. IEEE Press, Piscataway (2014).
  8. 8.
    Joó, B., Kalamkar, D.D., Vaidyanathan, K., Smelyanskiy, M., Pamnany, K., Lee, V.W., Dubey, P., Watson, W.: Lattice QCD on Intel\(^{\textregistered }\) Xeon Phi coprocessors. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) Supercomputing. LNCS, vol. 7905, pp. 40–54. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  9. 9.
    Joó, B., Smelyanskiy, M., Kalamkar, D.D., Vaidyanathan, K.: Chapter 9-Wilson Dslash kernel from lattice QCD optimization. In: Reinders, J., Jeffers, J. (eds.) High Performance Parallelism Pearls Volume Two: Multicore and Many-Core Programming Approaches, vol. 2, pp. 139–170. Morgan Kaufmann, Boston (2015). CrossRefGoogle Scholar
  10. 10.
    Kaczmarek, O., Schmidt, C., Steinbrecher, P., Mukherjee, S., Wagner, M.: HISQ inverter on Intel Xeon Phi and NVIDIA GPUs. CoRR abs/1409.1510 (2014).
  11. 11.
    Nguyen, A., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5D blocking optimization for stencil computations on modern CPUs and GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 1–13. IEEE Computer Society, Washington, DC (2010).
  12. 12.
    Richtmann, D., Heybrock, S., Wettig, T.: Multiple right-hand-sidesetup for the DD-\(\alpha \)AMG. In: Proceedings of the 33rd International Symposium on Lattice Field Theory, July 2015.
  13. 13.
    Sakurai, T., Tadano, H., Kuramashi, Y.: Application of block Krylovsubspace algorithms to the Wilson-Dirac equation with multiple right-hand sides inlattice QCD. Comput. Phys. Commun. 181(1), 113–117 (2010). MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Smelyanskiy, M., Vaidyanathan, K., Choi, J., Joó, B., Chhugani,J., Clark, M.A., Dubey, P.: High-performance lattice QCD for multi-core based parallelsystems using a cache-friendly hybrid threaded-MPI approach. In: 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–10, November 2011Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Aaron Walden
    • 1
    Email author
  • Sabbir Khan
    • 1
  • Bálint Joó
    • 2
  • Desh Ranjan
    • 1
  • Mohammad Zubair
    • 1
  1. 1.Department of Computer ScienceOld Dominion UniversityNorfolkUSA
  2. 2.Thomas Jefferson National Accelerator FacilityNewport NewsUSA

Personalised recommendations