The Journal of Supercomputing

, Volume 75, Issue 12, pp 7778–7789 | Cite as

The DiamondCandy LRnLA algorithm: raising efficiency of the 3D cross-stencil schemes

  • Anastasia PerepelkinaEmail author
  • Vadim Levchenko
  • Sergey Khilkov


The parallel efficiency is raised by increasing the locality of calculation. With the locally recursive non-locally asynchronous algorithms method, we have constructed a new algorithm that improves the locality of the cross-stencil scheme implementation by the decomposition of the 3D computational domain in time and space. The decomposition is based on a tiling of the 3D1T space into hexahedrons that closely fit the octahedron shape. This shape leads to an algorithm that is less intuitive than the rectangular domain decomposition, but since it follows the natural shape of the dependency region of the cross stencil, it has advantages in data localization and parallelization possibilities. We show its construction, analysis, and implementation possibilities. We present the benchmark results and show that the algorithm follows quantitative estimations: The performance exceeds the memory-bound limit of the stepwise implementation and does not degrade when the whole domain data do not fit higher cache levels.


LRnLA algorithms Cross stencil Parallel algorithms Roofline model 


  1. 1.
    Feautrier P (1992) Some efficient solutions to the affine scheduling problem: part I. One-dimensional time. Int J Parallel Prog 21(5):313–348. MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Fukaya T, Iwashita T (2018) Time–space tiling with tile-level parallelism for the 3D FDTD method. In: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region. ACM, New York, pp 116–126Google Scholar
  3. 3.
    Jia Z, Maggioni M, Staiger B, Scarpazza DP (2018) Dissecting the NVIDIA Volta GPU architecture via microbenchmarking. arXiv preprint arXiv:1804.06826
  4. 4.
    Korneev B, Levchenko V (2016) Detailed numerical simulation of shock–body interaction in 3D multicomponent flow using the RKDG numerical method and “DiamondTorre” GPU algorithm of implementation. J Phys Conf Ser 681:012046CrossRefGoogle Scholar
  5. 5.
    Levchenko VD (2005) Asynchronous parallel algorithms as a way to archive effectiveness of computations. J Inf Technol Comput Syst 1:68 (in Russian) Google Scholar
  6. 6.
    Levchenko VD, Perepelkina AY (2018) Locally recursive non-locally asynchronous algorithms for stencil computation. Lobachevskii J Math 39(4):552–561MathSciNetCrossRefGoogle Scholar
  7. 7.
    Levchenko VD, Perepelkina AY, Zakirov AV (2016) DiamondTorre algorithm for high-performance wave modeling. Computation 4(3):29CrossRefGoogle Scholar
  8. 8.
    Malas T, Hager G, Ltaief H, Stengel H, Wellein G, Keyes D (2015) Multicore-optimized wavefront diamond blocking for optimizing stencil updates. SIAM J Sci Comput 37(4):C439–C464. MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Muranushi T, Makino J (2015) Optimal temporal blocking for stencil computation. Procedia Comput Sci 51:1303–1312CrossRefGoogle Scholar
  10. 10.
    Quilleré F, Rajopadhye S, Wilde D (2000) Generation of efficient nested loops from polyhedra. Int J Parallel Prog 28(5):469–498CrossRefGoogle Scholar
  11. 11.
    Strzodka R, Shaheen M, Pajak D, Seidel HP (2011) Cache accurate time skewing in iterative stencil computations. In: 2011 International Conference on Parallel Processing, pp 571–581.
  12. 12.
    Succi S (2001) The lattice Boltzmann equation: for fluid dynamics and beyond. Oxford University Press, OxfordzbMATHGoogle Scholar
  13. 13.
    Wellein G, Hager G, Zeiser T, Wittmann M, Fehske H (2009) Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. In: Computer Software and Applications Conference, 2009. COMPSAC’09. 33rd Annual IEEE International, vol 1. IEEE, Washington, pp 579–586Google Scholar
  14. 14.
    Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76CrossRefGoogle Scholar
  15. 15.
    Wonnacott D (2002) Achieving scalable locality with time skewing. Int J Parallel Prog 30(3):181–221CrossRefGoogle Scholar
  16. 16.
    Yount C, Duran A (2016) Effective use of large high-bandwidth memory caches in HPC stencil computation via temporal wave-front tiling. In: Proceedings of the 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems, PMBS ’16. IEEE Press, Piscataway, NJ, pp 65–75.
  17. 17.
    Zakirov A, Levchenko V, Ivanov A, Perepelkina A, Levchenko T, Rok V (2017) High-performance 3D modeling of a full-wave seismic field for seismic survey tasks. Geoinformatika 3:34–45Google Scholar
  18. 18.
    Zakirov A, Levchenko V, Perepelkina A, Zempo Y (2016) High performance FDTD algorithm for GPGPU supercomputers. J Phys Conf Ser 759:012100CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Anastasia Perepelkina
    • 1
    Email author
  • Vadim Levchenko
    • 1
  • Sergey Khilkov
    • 2
  1. 1.Keldysh Institute of Applied MathematicsMoscowRussia
  2. 2.HIPERCONE Ltd.MoscowRussia

Personalised recommendations