Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures

  • Tom Henretty
  • Kevin Stock
  • Louis-Noël Pouchet
  • Franz Franchetti
  • J. Ramanujam
  • P. Sadayappan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6601)


Stencil computations are at the core of applications in many domains such as computational electromagnetics, image processing, and partial differential equation solvers used in a variety of scientific and engineering applications. Short-vector SIMD instruction sets such as SSE and VMX provide a promising and widely available avenue for enhancing performance on modern processors. However a fundamental memory stream alignment issue limits achieved performance with stencil computations on modern short SIMD architectures. In this paper, we propose a novel data layout transformation that avoids the stream alignment conflict, along with a static analysis technique for determining where this transformation is applicable. Significant performance increases are demonstrated for a variety of stencil codes on three modern SIMD-capable processors.


Single Precision Data Layout Access Function Reuse Distance Innermost Loop 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Allen, R., Kennedy, K.: Automatic translation of fortran programs to vector form. ACM TOPLAS 9(4) (1987)Google Scholar
  2. 2.
    Amarasinghe, S., Lam, M.: Communication optimization and code generation for distributed memory machines. In: PLDI (1993)Google Scholar
  3. 3.
    Anderson, J., Amarasinghe, S., Lam, M.: Data and computation transformations for multiprocessors. In: PPoPP (1995)Google Scholar
  4. 4.
    Augustin, W., Heuveline, V., Weiss, J.-P.: Optimized stencil computation using in-place calculation on modern multicore systems. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 772–784. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  5. 5.
    Chatterjee, S., Gilbert, J., Schreiber, R., Teng, S.: Automatic array alignment in data-parallel programs. In: POPL (1993)Google Scholar
  6. 6.
    Datta, K., Kamil, S., Williams, S., Oliker, L., Shalf, J., Yelick, K.: Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Review 51(1) (2009)Google Scholar
  7. 7.
    Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: SC 2008, pp. 1–12 (2008)Google Scholar
  8. 8.
    Datta, K., Williams, S., Volkov, V., Carter, J., Oliker, L., Shalf, J., Yelick, K.: Auto-tuning the 27-point stencil for multicore. In: iWAPT 2009 (2009)Google Scholar
  9. 9.
    de la Cruz, R., Araya-Polo, M., Cela, J.M.: Introducing the semi-stencil algorithm. In: PPAM (1) (2009)Google Scholar
  10. 10.
    Dursun, H., Nomura, K., Wang, W., Kunaseth, M., Peng, L., Seymour, R., Kalia, R., Nakano, A., Vashishta, P.: In-core optimization of high-order stencil computations. In: PDPTA (2009)Google Scholar
  11. 11.
    Dursun, H., Nomura, K.-i., Peng, L., Seymour, R., Wang, W., Kalia, R.K., Nakano, A., Vashishta, P.: A multilevel parallelization framework for high-order stencil computations. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 642–653. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  12. 12.
    Eichenberger, A., Wu, P., O’Brien, K.: Vectorization for simd architectures with alignment constraints. In: PLDI (2004)Google Scholar
  13. 13.
    Fireman, L., Petrank, E., Zaks, A.: New algorithms for SIMD alignment. In: Adsul, B., Vetta, A. (eds.) CC 2007. LNCS, vol. 4420, pp. 1–15. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  14. 14.
    Hohenauer, M., Engel, F., Leupers, R., Ascheid, G., Meyr, H.: A simd optimization framework for retargetable compilers. ACM TACO 6(1) (2009)Google Scholar
  15. 15.
    Jang, B., Mistry, P., Schaa, D., Dominguez, R., Kaeli, D.R.: Data transformations enabling loop vectorization on multithreaded data parallel architectures. In: PPOPP (2010)Google Scholar
  16. 16.
    Kamil, S., Datta, K., Williams, S., Oliker, L., Shalf, J., Yelick, K.: Implicit and explicit optimizations for stencil computations. In: MSPC 2006 (2006)Google Scholar
  17. 17.
    Kamil, S., Husbands, P., Oliker, L., Shalf, J., Yelick, K.: Impact of modern memory subsystems on cache optimizations for stencil computations. In: MSP 2005 (2005)Google Scholar
  18. 18.
    Kandemir, M., Choudhary, A., Shenoy, N., Banerjee, P., Ramanujam, J.: A linear algebra framework for automatic determination of optimal data layouts. IEEE TPDS 10(2) (1999)Google Scholar
  19. 19.
    Kennedy, K., Allen, J.: Optimizing compilers for modern architectures: A dependence-based approach. Morgan Kaufmann, San Francisco (2002)Google Scholar
  20. 20.
    Kennedy, K., Kremer, U.: Automatic data layout for distributed-memory machines. ACM TOPLAS 20(4) (1998)Google Scholar
  21. 21.
    Krishnamoorthy, S., Baskaran, M., Bondhugula, U., Ramanujam, J., Rountev, A., Sadayappan, P.: Effective automatic parallelization of stencil computations. In: PLDI (2007)Google Scholar
  22. 22.
    Larsen, S., Amarasinghe, S.P.: Exploiting superword level parallelism with multimedia instruction sets. In: PLDI (2000)Google Scholar
  23. 23.
    Larsen, S., Witchel, E., Amarasinghe, S.P.: Increasing and detecting memory address congruence. In: IEEE PACT (2002)Google Scholar
  24. 24.
    Li, Z., Song, Y.: Automatic tiling of iterative stencil loops. ACM TOPLAS 26(6) (2004)Google Scholar
  25. 25.
    Meng, J., Skadron, K.: Performance modeling and automatic ghost zone optimization for iterative stencil loops on gpus. In: ICS (2009)Google Scholar
  26. 26.
    Micikevicius, P.: 3d finite difference computation on gpus using cuda. In: GPGPU-2 (2009)Google Scholar
  27. 27.
    Nuzman, D., Henderson, R.: Multi-platform auto-vectorization. In: CGO (2006)Google Scholar
  28. 28.
    Nuzman, D., Rosen, I., Zaks, A.: Auto-vectorization of interleaved data for simd. In: PLDI (2006)Google Scholar
  29. 29.
    Nuzman, D., Zaks, A.: Outer-loop vectorization: revisited for short simd architectures. In: PACT (2008)Google Scholar
  30. 30.
    O’Boyle, M., Knijnenburg, P.: Nonsingular data transformations: Definition, validity, and applications. IJPP 27(3) (1999)Google Scholar
  31. 31.
    Orozco, D., Gao, G.R.: Mapping the FDTD Application to Many-Core Chip Architectures. In: ICPP (2009)Google Scholar
  32. 32.
    Rivera, G., Tseng, C.-W.: Data transformations for eliminating conflict misses. In: PLDI (1998)Google Scholar
  33. 33.
    Shafiq, M., Pericas, M., de la Cruz, R., Araya-Polo, M., Navarro, N., Ayguade, E.: Exploiting memory customization in fpga for 3d stencil computations. In: FPT (2009)Google Scholar
  34. 34.
    Solar-Lezama, A., Arnold, G., Tancau, L., Bodik, R., Saraswat, V., Seshia, S.: Sketching stencils. In: PLDI (2007)Google Scholar
  35. 35.
    Treibig, J., Wellein, G., Hager, G.: Efficient multicore-aware parallelization strategies for iterative stencil computations. CoRR, abs/1004.1741 (2010)Google Scholar
  36. 36.
    Venkatasubramanian, S., Vuduc, R.: Tuned and wildly asynchronous stencil kernels for hybrid cpu/gpu systems. In: ICS (2009)Google Scholar
  37. 37.
    Wellein, G., Hager, G., Zeiser, T., Wittmann, M., Fehske, H.: Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. In: COMPSAC (2009)Google Scholar
  38. 38.
    Wittmann, M., Hager, G., Treibig, J., Wellein, G.: Leveraging shared caches for parallel temporal blocking of stencil codes on multicore processors and clusters. CoRR, abs/1006.3148 (2010)Google Scholar
  39. 39.
    Wolfe, M.J.: High Performance Compilers For Parallel Computing. Addison-Wesley, Reading (1996)zbMATHGoogle Scholar
  40. 40.
    Wonnacott, D.: Achieving scalable locality with time skewing. IJPP 30(3) (2002)Google Scholar
  41. 41.
    Wu, P., Eichenberger, A.E., Wang, A.: Efficient SIMD Code Generation for Runtime Alignment and Length Conversion. In: CGO (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Tom Henretty
    • 1
  • Kevin Stock
    • 1
  • Louis-Noël Pouchet
    • 1
  • Franz Franchetti
    • 2
  • J. Ramanujam
    • 3
  • P. Sadayappan
    • 1
  1. 1.The Ohio State UniversityUSA
  2. 2.Carnegie Mellon UniversityUSA
  3. 3.Louisiana State UniversityUSA

Personalised recommendations