The Journal of Supercomputing

, Volume 75, Issue 12, pp 7895–7908 | Cite as

Auto-tuning GEMM kernels on the Intel KNL and Intel Skylake-SP processors

  • Roktaek Lim
  • Yeongha Lee
  • Raehyun Kim
  • Jaeyoung ChoiEmail author
  • Myungho Lee


The general matrix–matrix multiplication is a core building block for implementing Basic Linear Algebra Subprograms. This paper presents a methodology for automatically producing the matrix–matrix multiplication kernels tuned for the Intel Xeon Phi Processor code-named Knights Landing and the Intel Skylake-SP processors with AVX-512 intrinsic functions. The architecture of the latest manycore processors has been complicated in the levels of parallelism and cache hierarchies; it is not easy to find the best combination of optimization techniques for a given application. Our approach produces matrix multiplication kernels through a process of heuristic auto-tuning based on generating multiple kernels and selecting the fastest ones through performance tests. The tuning parameters include the size of block matrices for registers and caches, prefetch distances, and loop unrolling depth. Parameters for multithreaded execution, such as identifying loops to parallelize and the optimal number of threads for such loops are also investigated. We also present a method to reduce the parameter search space based on our previous research results.


Manycore Intel Xeon Phi Intel Skylake-SP Auto-tuning Matrix–matrix multiplication AVX-512 



The work was supported by the Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Korea government (MSIT) (NRF-2015M3C4A7065662).


  1. 1.
    Bilmes J, Asanovic K, Chin CW, Demmel J (2014) Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: ACM International Conference on Supercomputing 25th Anniversary Volume. ACM, pp 253–260Google Scholar
  2. 2.
    Goto K, Geijn RA (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw (TOMS) 34(3):12MathSciNetCrossRefGoogle Scholar
  3. 3.
    Gunnels JA, Henry GM, Van De Geijn RA (2001) A family of high-performance matrix multiplication algorithms. In: International Conference on Computational Science. Springer, pp 51–60Google Scholar
  4. 4.
    Heinecke A, Vaidyanathan K, Smelyanskiy M, Kobotov A, Dubtsov R, Henry G, Shet AG, Chrysos G, Dubey P (2013) Design and implementation of the linpack benchmark for single and multi-node systems based on Intel® Xeon Phi Coprocessor. In: IEEE 27th International Symposium on Parallel & Distributed Processing (IPDPS), 2013. IEEE, pp 126–137Google Scholar
  5. 5.
    Intel: Math kernel library (2018) Accessed 24 July 2018
  6. 6.
    Jeffers J, Reinders J, Sodani A (2016) Intel Xeon Phi processor high performance programming: knights, landing edn. Morgan Kaufmann, BurlingtonGoogle Scholar
  7. 7.
    Lim R, Lee Y, Kim R, Choi J (2018) OpenMP-based parallel implementation of matrix-matrix multiplication on the Intel Knights Landing. In: HPC Asia 2018, pp 63–66Google Scholar
  8. 8.
    Lim R, Lee Y, Kim R, Choi J (2018) An implementation of matrix-matrix multiplication on the Intel KNL processor with AVX-512. Cluster Comput 21(4):1785–1795CrossRefGoogle Scholar
  9. 9.
    Low TM, Igual FD, Smith TM, Quintana-Orti ES (2016) Analytical modeling is enough for high-performance blis. ACM Trans Math Softw (TOMS) 43(2):12MathSciNetCrossRefGoogle Scholar
  10. 10.
    Smith TM, Van De Geijn RA, Smelyanskiy M, Hammond JR, Van Zee FG (2014) Anatomy of high-performance many-threaded matrix multiplication. In: Parallel and Distributed Processing Symposium, 2014 IEEE 28th International. IEEE, pp 1049–1059Google Scholar
  11. 11.
    Van Zee FG, Van De Geijn RA (2015) BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans Math Softw (TOMS) 41(3):14MathSciNetCrossRefGoogle Scholar
  12. 12.
    Whaley RC, Dongarra JJ (1998) Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing. IEEE Computer Society, pp 1–27Google Scholar
  13. 13.
    Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimizations of software and the atlas project. Parallel Comput 27(1–2):3–35CrossRefGoogle Scholar
  14. 14.
    Van Zee FG, Smith TM, Marker B, Low TM, Van De Geign RA, Igual FD, Smelyanskiy M, Zhang X, Kistler M, Austel V, Gunnels JA, Killough L (2016) The BLIS framework: experiments in portability. ACM Trans Math Softw (TOMS) 42(2):12:1–12:19CrossRefGoogle Scholar
  15. 15.
    Zhang X, Wang Q, Werber S (2018) Openblas. Accessed 24 July 2018

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Soongsil UniversitySeoulKorea
  2. 2.Myongji UniversityYonginKorea

Personalised recommendations