The DiamondTetris Algorithm for Maximum Performance Vectorized Stencil Computation

  • Vadim Levchenko
  • Anastasia PerepelkinaEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10421)


An algorithm from the LRnLA family, DiamondTetris, for stencil computation is constructed. It is aimed for Many-Integrated-Core processors of the Xeon Phi family. The algorithm and its implementation is described for the wave equation based simulation. Its strong points are locality, efficient use of memory hierarchy, and, most importantly, seamless vectorization. Specifically, only 1 vector rearrange operation is necessary per cell value update. The performance is estimated with the roofline model. The algorithm is implemented in code and tested on Xeon and Xeon Phi machines.



The access to the computing resources with Intel Xeon Phi KNL has been provided by Colfax Research ( in the course of “Deep Dive” HOW series.


  1. 1.
    Bertolacci, I.J., Olschanowsky, C., Harshbarger, B., Chamberlain, B.L., Wonnacott, D.G., Strout, M.M.: Parameterized diamond tiling for stencil computations with chapel parallel iterators. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ICS 2015, pp. 197–206. ACM, New York (2015).
  2. 2.
    Doerfler, D., Deslippe, J., Williams, S., Oliker, L., Cook, B., Kurth, T., Lobet, M., Malas, T., Vay, J.-L., Vincenti, H.: Applying the roofline performance model to the Intel Xeon Phi knights landing processor. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 339–353. Springer, Cham (2016). doi: 10.1007/978-3-319-46079-6_24 CrossRefGoogle Scholar
  3. 3.
    Frigo, M., Strumpen, V.: The memory behavior of cache oblivious stencil computations. J. Supercomput. 39(2), 93–112 (2007)CrossRefGoogle Scholar
  4. 4.
    Grosser, T., Cohen, A., Holewinski, J., Sadayappan, P., Verdoolaege, S.: Hybrid hexagonal/classical tiling for gpus. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2014, pp. 66:66–66:75. ACM, New York (2014).
  5. 5.
    Henretty, T., Veras, R., Franchetti, F., Pouchet, L.N., Ramanujam, J., Sadayappan, P.: A stencil compiler for short-vector simd architectures. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS 2013, pp. 13–24. ACM, New York (2013).
  6. 6.
    Levchenko, V., Perepelkina, A., Zakirov, A.: Diamondtorre algorithm for high-performance wave modeling. Computation 4(3), 29 (2016). CrossRefGoogle Scholar
  7. 7.
    Levchenko, V.: Asynchronous parallel algorithms as a way to archive effectiveness of computations. J. Inf. Technol. Comput. Syst. (1), 68 (2005). (in Russian)Google Scholar
  8. 8.
    McCalpin, J., Wonnacott, D.: Time skewing: a value-based approach to optimizing for memory locality. Technical report (1999).
  9. 9.
    Muranushi, T., Makino, J., Hosono, N., Inoue, H., Nishizawa, S., Tomita, H., Nitadori, K., Iwasawa, M., Maruyama, Y., Yashiro, H., Nakamura, Y., Hotta, H.: Automatic generation of efficient codes from mathematical descriptions of stencil computation. In: Proceedings of the 5th International Workshop on Functional High-Performance Computing, FHPC 2016. Association for Computing Machinery (ACM) (2016).
  10. 10.
    Nguyen, A., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5DD blocking optimization for stencil computations on modern CPUs and GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 1–13 (2010).
  11. 11.
    Williams, S., Waterman, A., Patterson, D.A.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009). CrossRefGoogle Scholar
  12. 12.
    Wolfe, M.: More iteration space tiling. In: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, Supercomputing 1989. ACM, New York (1989).
  13. 13.
    Yount, C., Duran, A.: Effective use of large high-bandwidth memory caches in hpc stencil computation via temporal wave-front tiling. In: Proceedings of the 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems, PMBS 2016, pp. 65–75. IEEE Press, Piscataway (2016).
  14. 14.
    Zakirov, A., Levchenko, V.D., Perepelkina, A., Yasunari, Z.: High performance fdtd code implementation for gpgpu supercomputers. Keldysh Institute Preprints (44), 22 pages (2016).

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Keldysh Institute of Applied Mathematics RASMoscowRussia

Personalised recommendations