A High Arithmetic Intensity Krylov Subspace Method Based on Stencil Compiler Programs

  • Simplice Donfack
  • Patrick Sanan
  • Olaf SchenkEmail author
  • Bram Reps
  • Wim Vanroose
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11087)


Stencil calculations and matrix-free Krylov subspace solvers represent important components of many scientific computing applications. In these solvers, stencil applications are often the dominant part of the computation; an efficient parallel implementation of the kernel is therefore crucial to reduce the time to solution. Inspired by polynomial preconditioning, we remove upper bounds on the arithmetic intensity of the Krylov subspace building block by replacing the matrix with a higher-degree matrix polynomial. Using the latest state-of-the-art stencil compiler programs with temporal blocking, reduced memory bandwidth usage and, consequently, better utilization of SIMD vectorization and thus speedup on modern hardware, we are able to obtain performance improvements for higher polynomial degrees than simpler cache-blocking approaches have yielded in the past, demonstrating the new appeal of polynomial techniques on emerging architectures. We present results in a shared-memory environment and an extension to a distributed-memory environment with local shared memory.


Stencil compilers Performance engineering Krylov methods Code generation Autotuning HPC CG Polynomial preconditioning 



We thank Uday Bondhugula for helpful correspondence and upgrades of PLUTO, Karl Rupp for the data in Fig. 2, and Radim Janalik for initial results used in Fig. 5. We acknowledge the Swiss National Supercomputing Center (CSCS) and the University of Erlangen for computing resources. This research has been funded under the EU FP7-ICT project “Exascale Algorithms and Advanced Computational Techniques” (project reference 610741).


  1. 1.
    Asanovic, K., Bodik, R., Demmel, J., Keaveny, T., Keutzer, K., Kubiatowicz, J., Morgan, N., Patterson, D., Sen, K., Wawrzynek, J., Wessel, D., Yelick, K.: A view of the parallel computing landscape. Commun. ACM 52(10), 56–67 (2009)CrossRefGoogle Scholar
  2. 2.
    Ashby, S.F., Manteuffel, T.A., Otto, J.S.: A comparison of adaptive Chebyshev and least squares polynomial preconditioning for Hermitian positive definite linear systems. SIAM J. Sci. Stat. Comput. 13(1), 1–29 (1992)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Balay, S., Abhyankar, S., Adams, M.F., Brown, J., Brune, P., Buschelman, K., Dalcin, L., Eijkhout, V., Gropp, W.D., Kaushik, D., Knepley, M.G., McInnes, L.C., Rupp, K., Smith, B.F., Zampini, S., Zhang, H.: PETSc users manual. Technical report ANL-95/11 - Revision 3.6, Argonne National Laboratory (2015)Google Scholar
  4. 4.
    Balay, S., Gropp, W.D., McInnes, L.C., Smith, B.F.: Efficient management of parallelism in object oriented numerical software libraries. In: Arge, E., Bruaset, A.M., Langtangen, H.P. (eds.) Modern Software Tools in Scientific Computing, pp. 163–202. Birkhäuser Press, Boston (1997). Scholar
  5. 5.
    Bianco, M., Varetto, U.: A generic library for stencil computations. arXiv preprint arXiv:1207.1746 (2012)
  6. 6.
    Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: Pluto: a practical and fully automatic polyhedral program optimization system. In: Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation (PLDI 2008), June 2008. Citeseer, Tucson (2008)Google Scholar
  7. 7.
    Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic polyhedral parallelizer and locality optimizer. ACM SIGPLAN Not. 43(6), 101–113 (2008)CrossRefGoogle Scholar
  8. 8.
    Briggs, W.L., Henson, V.E., McCormick, S.F.: A Multigrid Tutorial, 2nd edn. SIAM, University City (2000)CrossRefGoogle Scholar
  9. 9.
    Chow, E., Falgout, R.D., Hu, J.J., Tuminaro, R.S., Yang, U.M.: A survey of parallelization techniques for multigrid solvers. In: Parallel Processing for Scientific Computing, vol. 20, pp. 179–201 (2006)CrossRefGoogle Scholar
  10. 10.
    Christen, M., Schenk, O., Burkhart, H.: Automatic code generation and tuning for stencil kernels on modern microarchitectures. In: Proceedings of International Supercomputing Conference (ISC 2011), vol. 26, pp. 205–210 (2011)Google Scholar
  11. 11.
    Christen, M., Schenk, O., Burkhart, H.: PATUS: a code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In: 2011 IEEE International Conference on Parallel and Distributed Processing Symposium (IPDPS), pp. 676–687. IEEE (2011)Google Scholar
  12. 12.
    Christen, M., Schenk, O., Cui, Y.: PATUS for convenient high-performance stencils: evaluation in earthquake simulations. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 11:1–11:10. IEEE Computer Society Press, Los Alamitos (2012)Google Scholar
  13. 13.
    Chronopoulos, A.T., Swanson, C.D.: Parallel iterative s-step methods for unsymmetric linear systems. Parallel Comput. 22(5), 623–641 (1996)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Chronopoulos, A.T., Gear, C.W.: s-Step iterative methods for symmetric linear systems. J. Comput. Appl. Math. 25(2), 153–168 (1989)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Dubois, P.F., Greenbaum, A., Rodrigue, G.H.: Approximating the inverse of a matrix for use in iterative algorithms on vector processors. Computing 22(3), 257–268 (1979)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Erlangga, Y.A., Nabben, R.: Multilevel projection-based nested Krylov iteration for boundary value problems. SIAM J. Sci. Comput. 30(3), 1572–1595 (2008)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Feautrier, P., Lengauer, C.: The polyhedron model. In: Encyclopedia of Parallel Computing, pp. 1581–1592. Springer, Heidelberg (2011)Google Scholar
  18. 18.
    Fujita, K., Ichimura, T., Koyama, K., Inoue, H., Hori, M., Maddegedara, L.: Fast and scalable low-order implicit unstructured finite-element solver for earth’s crust deformation problem. In: Proceedings of the Platform for Advanced Scientific Computing Conference, PASC 2017, pp. 11:1–11:10. ACM, New York (2017)Google Scholar
  19. 19.
    Ghysels, P., Vanroose, W.: Hiding global synchronization latency in the preconditioned conjugate gradient algorithm. Parallel Comput. 40(7), 224–238 (2014)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Ghysels, P., Ashby, T.J., Meerbergen, K., Vanroose, W.: Hiding global communication latency in the GMRES algorithm on massively parallel machines. SIAM J. Sci. Comput. 35(1), C48–C71 (2013)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Grigori, L., Moufawad, S.: Communication avoiding ILU0 preconditioner. SIAM J. Sci. Comput. 37(2), C217–C246 (2015)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Grosser, T., Größlinger, A., Lengauer, C.: Polly - performing polyhedral optimizations on a low-level intermediate representation. Parallel Process. Lett. 22(4), 1250010 (2012)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Gysi, T., Grosser, T., Hoefler, T.: MODESTO: data-centric analytic optimization of complex stencil programs on heterogeneous architectures. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ICS 2015, pp. 177–186. ACM, New York (2015)Google Scholar
  24. 24.
    King, J., Kirby, R.M.: A scalable, efficient scheme for evaluation of stencil computations over unstructured meshes. In: 2013 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–12, November 2013Google Scholar
  25. 25.
    Malas, T., Hager, G., Ltaief, H., Stengel, H., Wellein, G., Keyes, D.: Multicore-optimized wavefront diamond blocking for optimizing stencil updates. SIAM J. Sci. Comput. 37(4), C439–C464 (2015)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Mohiyuddin, M., Hoemmen, M., Demmel, J., Yelick, K.: Minimizing communication in sparse matrix solvers. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, p. 36. ACM (2009)Google Scholar
  27. 27.
    Stiefel, E., Hestenes, M.R.: Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards 49(6) (1952)Google Scholar
  28. 28.
    Rupp, K.: CPU, GPU, and MIC hardware characteristics over time.
  29. 29.
    Rutishauser, H.: Theory of gradient methods. In: Engeli, M., Ginsburg, T., Rutishauser, H., Stiefel, E. (eds.) Refined Iterative Methods for Computation of the Solution and the Eigenvalues of Self-adjoint Boundary Value Problems, pp. 24–49. Springer, Heidelberg (1959). Scholar
  30. 30.
    Saad, Y.: Krylov subspace methods on supercomputers. SIAM J. Sci. Stat. Comput. 10(6), 1200–1232 (1989)MathSciNetCrossRefGoogle Scholar
  31. 31.
    Tang, Y., Chowdhury, R.A., Kuszmaul, B.C., Luk, C.-K., Leiserson, C.E.: The Pochoir stencil compiler. In: Proceedings of 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2011), pp. 117–128. ACM (2011)Google Scholar
  32. 32.
    Treibig, J., Hager, G., Wellein, G.: LIKWID: lightweight performance tools. In: Bischof, C., Hegering, H.G., Nagel, W., Wittum, G. (eds.) Competence in High Performance Computing 2010, pp. 165–175. Springer, Heidelberg (2012). Scholar
  33. 33.
    U.S. Department of Energy, Office of Advanced Scientific Computing Research. Report on the workshop on Extreme-Scale Solvers: Transition to future Architectures, March 2012. Accessed Mar 2013
  34. 34.
    Bondhugula, U., Bandishti, V., Pananilath, I.: Tiling stencil computations to maximize parallelism. In: Proceedings of ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2012), pp. 1–11 (2012)Google Scholar
  35. 35.
    Van der Vorst, H.A.: Iterative Krylov Methods for Large Linear Systems, vol. 13. Cambridge University Press, Cambridge (2003)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Simplice Donfack
    • 1
  • Patrick Sanan
    • 1
  • Olaf Schenk
    • 1
    Email author
  • Bram Reps
    • 2
  • Wim Vanroose
    • 2
  1. 1.Institute of Computational ScienceUniversità della Svizzera italiana (USI)LuganoSwitzerland
  2. 2.University of AntwerpAntwerpBelgium

Personalised recommendations