International Journal of Parallel Programming

, Volume 39, Issue 1, pp 115–142 | Cite as

A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations



Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addition, synchronization is often used to signal the completion of halo exchanges. Both communication and synchronization may incur significant overhead on parallel architectures with shared memory. This is especially true in the case of graphics processors (GPUs), which do not preserve the state of the per-core L1 storage across global synchronizations. To reduce these overheads, ghost zones can be created to replicate stencil operations, reducing communication and synchronization costs at the expense of redundantly computing some values on multiple PEs. However, the selection of the optimal ghost zone size depends on the characteristics of both the architecture and the application, and it has only been studied for message-passing systems in distributed environments. To automate this process on shared memory systems, we establish a performance model using NVIDIA’s Tesla architecture as a case study and propose a framework that uses the performance model to automatically select the ghost zone size that performs best and generate appropriate code. The modeling is validated by four diverse ISL applications, for which the predicted ghost zone configurations are able to achieve a speedup no less than 95% of the optimal speedup.


Ghost zone Halo Performance model Iterative stencil loops GPU Tiling 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Allen, G., Dramlitsch, T., Foster, I., Karonis, N.T., Ripeanu, M., Seidel, E., Toonen, B.: Supporting efficient execution in heterogeneous distributed computing environments with cactus and globus. In: SC’01, pp. 52–52 (2001)Google Scholar
  2. 2.
    Alpert, M.: Not just fun and games. April (1999)Google Scholar
  3. 3.
    Bromley M., Heller S., McNerney T., Steele G.L. Jr: Fortran at ten gigaflops: the connection machine convolution compiler. PLDI ’91 26(6), 145–156 (1991)CrossRefGoogle Scholar
  4. 4.
    Chatterjee, S., Gilbert, J.R., Schreiber, R.: Mobile and replicated alignment of arrays in data-parallel programs. In: SC’93, pp. 420–429 November (1993)Google Scholar
  5. 5.
    Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Skadron, K.: A performance study of general purpose applications on graphics processors using CUDA, June (2008)Google Scholar
  6. 6.
    NVIDIA Corporation. Geforce gtx 280 specifications. (2008)Google Scholar
  7. 7.
    NVIDIA Corporation. NVIDIA CUDA visual profiler. June (2008)Google Scholar
  8. 8.
    Dagum, L.: OpenMP: a proposed industry standard API for shared memory programming, October (1997)Google Scholar
  9. 9.
    Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: SC ’08. 1–12 (2008)Google Scholar
  10. 10.
    Deitz, S.J., Chamberlain, B.L., Snyder, L.: Eliminating redundancies in sum-of-product array computations. In: ICS ’01, pp. 65–77 (2001)Google Scholar
  11. 11.
    Evans, L.C.: Partial differential equations. Am. Math. Soc. (1998)Google Scholar
  12. 12.
    Chen, L., Zhang, Z.-Q., Feng, X.-B.: Redundant computation partition on distributed-memory systems. In: ICA3PP ’02, pp. 252 (2002)Google Scholar
  13. 13.
    Frigo, M., Strumpen, V.: Cache oblivious stencil computations. In: ICS’05, pp. 361–366 (2005)Google Scholar
  14. 14.
    Goodnight, N.: CUDA/OpenGL fluid simulation. April (2007)Google Scholar
  15. 15.
    Gschwind, M.: Chip multiprocessing and the cell broadband engine. In: CF’06 (2006)Google Scholar
  16. 16.
    Huang C.-H., Sadayappan P.: Communication-free hyperplane partitioning of nested loops. J. Parallel Distrib. Comput. 19(2), 90–102 (1993)MATHCrossRefGoogle Scholar
  17. 17.
    Huang, W., Stan, M.R., Skadron, K., Ghosh, S., Sankaranarayanan, K., Velusamy, S.: Compact thermal modeling for temperature-aware design. In: DAC’04. (2004)Google Scholar
  18. 18.
    Electronic Educational Devices Inc. Watts up? electricity meter operator’s manual. (2002)Google Scholar
  19. 19.
    Jalby, W., Meier, U.: Optimizing matrix operations on a parallel multiprocessor with a hierarchical memory system, pp. 429–432 (1986)Google Scholar
  20. 20.
    Kamil, S., Husbands, P., Oliker, L., Shalf, J., Yelick, K.: Impact of modern memory subsystems on cache optimizations for stencil computations. In: MSP’05, pp. 36–43 (2005)Google Scholar
  21. 21.
    Kowarschik M., Weiß C., Karl W., Rüde U.: Cache-aware multigrid methods for solving poisson’s equation in two dimensions. Computing 64(4), 381–399 (2000)MATHCrossRefMathSciNetGoogle Scholar
  22. 22.
    Krishnamoorthy S., Baskaran M., Bondhugula U., Ramanujam J., Rountev A., Sadayappan P.: Effective automatic parallelization of stencil computations. PLDI ’07 42(6), 235–244 (2007)CrossRefGoogle Scholar
  23. 23.
    Lee P.: Techniques for compiling programs on distributed memory multicomputers. Parallel Comput. 21, 1895–1923 (1995)CrossRefMathSciNetGoogle Scholar
  24. 24.
    Li Z., Song Y.: Automatic tiling of iterative stencil loops. ACM Trans. Program. Lang. Syst. 26(6), 975–1028 (2004)CrossRefGoogle Scholar
  25. 25.
    Manjikian N., Abdelrahman T.S.: Fusion of loops for parallelism and locality. Parallel Distrib. Syst. 8, 19–28 (1997)Google Scholar
  26. 26.
    Meng, J., Skadron, K.: Performance modeling and automatic ghost zone optimization for iterative stencil loops on gpus. In: ICS ’09, pp. 256–265 (2009)Google Scholar
  27. 27.
    Nickolls J., Buck I., Garland M., Skadron K.: Scalable parallel programming with CUDA. Queue 6(2), 40–53 (2008)CrossRefGoogle Scholar
  28. 28.
    Premnath K.N., Abraham J.: Three-dimensional multi-relaxation time (mrt) lattice-Boltzmann models for multiphase flow. J. Comput. Phys. 224(2), 539–559 (2007)MATHCrossRefMathSciNetGoogle Scholar
  29. 29.
    Ramanujam, J.: Tiling of iteration spaces for multicomputers. In: Proceedings International Conference Parallel Processing, pp. 179–186. (1990)Google Scholar
  30. 30.
    Renganarayana, L., Harthikote-Matha, M., Dewri, R., Rajopadhye, S.: Towards optimal multi-level tiling for stencil computations. IPDPS’07, pp. 1–10, March (2007)Google Scholar
  31. 31.
    Renganarayana, L., Rajopadhye, S.: Positivity, posynomials and tile size selection. In: SC ’08, pp. 1–12 (2008)Google Scholar
  32. 32.
    Ripeanu, M., Iamnitchi, A., Foster, I.: Cactus application: Performance predictions in a grid environment. In: EuroPar’01. (2001)Google Scholar
  33. 33.
    Rivera G., Tseng, C.-W.: Tiling optimizations for 3D scientific computations. In: SC ’00, p. 32 (2000)Google Scholar
  34. 34.
    Ueng, S.-Z., Baghsorkhi, S., Lathara, M., Hwu, W.m.: CUDA-lite: Reducing GPU programming complexity. In: LCPC’08. (2008)Google Scholar
  35. 35.
    Wonnacott, D.: Time skewing for parallel computers. In: WLCPC’99, pp. 477–480 (1999)Google Scholar
  36. 36.
    Wonnacott D.: Achieving scalable locality with time skewing. Int. J. Parallel Program. 30(3), 181–221 (2002)MATHCrossRefGoogle Scholar
  37. 37.
    Yang Z., Zhu Y., Pu Y.: Parallel image processing based on CUDA. Int. Conf. Comput. Sci. Software Eng. 3, 198–201 (2008)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of VirginiaCharlottesvilleUSA

Personalised recommendations