Model-Driven Tile Size Selection for DOACROSS Loops on GPUs

Di, Peng; Xue, Jingling

doi:10.1007/978-3-642-23397-5_40

Peng Di¹⁸ &
Jingling Xue¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6853))

Included in the following conference series:

European Conference on Parallel Processing

1469 Accesses
10 Citations
3 Altmetric

Abstract

DOALL loops are tiled to exploit DOALL parallelism and data locality on GPUs. In contrast, due to loop-carried dependences, DOACROSS loops must be skewed first in order to make tiling legal and exploit wavefront parallelism across the tiles and within a tile. Thus, tile size selection, which is performance-critical, becomes more complex for DOACROSS loops than DOALL loops on GPUs. This paper presents a model-driven approach to automating this process. Validation using 1D, 2D and 3D SOR solvers shows that our framework can find the tile sizes for these representative DOACROSS loops to achieve performances close to the best observed for a range of problem sizes tested.

This research is supported by an Australian Research Council Grant DP110104628.

Download to read the full chapter text

Chapter PDF

Revisiting the Parallel Strategy for DOACROSS Loops

Article 22 March 2019

Exact and Approximated Data-Reuse Optimizations for Tiling with Parametric Sizes

Parametric GPU Code Generation for Affine Loop Programs

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Baghsorkhi, S.S., Delahaye, M., Patel, S.J., Gropp, W.D., Hwu, W.-m.W.: An adaptive performance modeling tool for GPU architectures. In: PPoPP 2010, pp. 105–114. ACM Press, New York (2010)
Google Scholar
Baskaran, M.M., Ramanujam, J., Sadayappan, P.: Automatic C-to-CUDA Code Generation for Affine Programs. In: CC 2010, pp. 244–263 (2010)
Google Scholar
Choi, J.W., Singh, A., Vuduc, R.W.: Model-driven autotuning of sparse matrix-vector multiply on GPUs. In: PPoPP 2010, pp. 115–126 (2010)
Google Scholar
Cui, H., Wang, L., Xue, J., Feng, X., Yang, Y.: Automatic library generation for blas3 on gpus. In: IPDPS 2011 (2011)
Google Scholar
Cui, H., Xue, J., Wang, L., Yang, Y., Feng, X., Fan, D.: Extendable pattern-oriented optimization directives. In: CGO 2011, pp. 107–118 (2011)
Google Scholar
Di, P., Wan, Q., Zhang, X., Wu, H., Xue, J.: Toward harnessing doacross parallelism for multi-gpgpus. In: ICPP 2010 (2010)
Google Scholar
Fischer, S.: A parallel SSOR preconditioner for lattice QCD. Computer Physics Communications 98(1-2), 20–34 (1996)
Article Google Scholar
Hackbusch, W.: Iterative solution of Large Sparse Systems of Equations. Applied Mathematical Sciences. Springer, Heidelberg (1993)
MATH Google Scholar
Hong, S., Kim, H.: An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: ISCA 2009, p. 152 (June 2009)
Google Scholar
Huang, Q., Xue, J., Vera, X.: Code tiling for improving the cache performance of PDE solvers. In: ICPP 2003, pp. 615–625 (2003)
Google Scholar
Lee, S., Min, S.-J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: PPoPP 2009, pp. 101–110 (2009)
Google Scholar
Liu, Y., Zhang, E.Z., Shen, X.: A Cross-Input Adaptive Framework for GPU Programs Optimization. In: IPDPS 2009, pp. 16–19 (2009)
Google Scholar
Quarteroni, A., Valli, A.: Numerical Approximation of Partial Differential Equations. Springer, Heidelberg (1994)
MATH Google Scholar
Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu, W.-m.W.: Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: PPoPP 2008, pp. 73–82 (2008)
Google Scholar
Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: SC 2008, pp. 1–11 (2008)
Google Scholar
Wong, H., Papadopoulou, M.M., Sadooghi-Alvandi, M., Moshovos, A.: Demystifying GPU microarchitecture through microbenchmarking. In: ISPASS 2010, pp. 235–246 (2010)
Google Scholar
Xiao, S., Feng, W.-C.: Inter-block GPU communication via fast barrier synchronization. In: IPDPS 2010, pp. 1–12 (2010)
Google Scholar
Xue, J.: Loop Tiling for Parallelism. Kluwer Academic Publishers, Dordrecht (2000)
Book MATH Google Scholar
Yang, Y., Xiang, P., Kong, J., Zhou, H.: A GPGPU compiler for memory optimization and parallelism management. In: PLDI 2010, p. 86 (May 2010)
Google Scholar
Yuki, T., Renganarayanan, L., Rajopadhye, S., Anderson, C., Eichenberger, A.E., O’Brien, K.: Automatic creation of tile size selection models. In: CGO 2010, p. 190 (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Programming Languages and Compilers Group, School of Computer Science and Engineering, University of New South Wales, Sydney, Australia
Peng Di & Jingling Xue

Authors

Peng Di
View author publications
You can also search for this author in PubMed Google Scholar
Jingling Xue
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Equipe Runtime, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France
Emmanuel Jeannot & Raymond Namyst &
Equipe HIEPACS, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France
Jean Roman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Di, P., Xue, J. (2011). Model-Driven Tile Size Selection for DOACROSS Loops on GPUs. In: Jeannot, E., Namyst, R., Roman, J. (eds) Euro-Par 2011 Parallel Processing. Euro-Par 2011. Lecture Notes in Computer Science, vol 6853. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23397-5_40

Download citation

DOI: https://doi.org/10.1007/978-3-642-23397-5_40
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23396-8
Online ISBN: 978-3-642-23397-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Model-Driven Tile Size Selection for DOACROSS Loops on GPUs

Abstract

Chapter PDF

Similar content being viewed by others

Revisiting the Parallel Strategy for DOACROSS Loops

Exact and Approximated Data-Reuse Optimizations for Tiling with Parametric Sizes

Parametric GPU Code Generation for Affine Loop Programs

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Model-Driven Tile Size Selection for DOACROSS Loops on GPUs

Abstract

Chapter PDF

Similar content being viewed by others

Revisiting the Parallel Strategy for DOACROSS Loops

Exact and Approximated Data-Reuse Optimizations for Tiling with Parametric Sizes

Parametric GPU Code Generation for Affine Loop Programs

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation