Abstract
DOALL loops are tiled to exploit DOALL parallelism and data locality on GPUs. In contrast, due to loop-carried dependences, DOACROSS loops must be skewed first in order to make tiling legal and exploit wavefront parallelism across the tiles and within a tile. Thus, tile size selection, which is performance-critical, becomes more complex for DOACROSS loops than DOALL loops on GPUs. This paper presents a model-driven approach to automating this process. Validation using 1D, 2D and 3D SOR solvers shows that our framework can find the tile sizes for these representative DOACROSS loops to achieve performances close to the best observed for a range of problem sizes tested.
This research is supported by an Australian Research Council Grant DP110104628.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Baghsorkhi, S.S., Delahaye, M., Patel, S.J., Gropp, W.D., Hwu, W.-m.W.: An adaptive performance modeling tool for GPU architectures. In: PPoPP 2010, pp. 105–114. ACM Press, New York (2010)
Baskaran, M.M., Ramanujam, J., Sadayappan, P.: Automatic C-to-CUDA Code Generation for Affine Programs. In: CC 2010, pp. 244–263 (2010)
Choi, J.W., Singh, A., Vuduc, R.W.: Model-driven autotuning of sparse matrix-vector multiply on GPUs. In: PPoPP 2010, pp. 115–126 (2010)
Cui, H., Wang, L., Xue, J., Feng, X., Yang, Y.: Automatic library generation for blas3 on gpus. In: IPDPS 2011 (2011)
Cui, H., Xue, J., Wang, L., Yang, Y., Feng, X., Fan, D.: Extendable pattern-oriented optimization directives. In: CGO 2011, pp. 107–118 (2011)
Di, P., Wan, Q., Zhang, X., Wu, H., Xue, J.: Toward harnessing doacross parallelism for multi-gpgpus. In: ICPP 2010 (2010)
Fischer, S.: A parallel SSOR preconditioner for lattice QCD. Computer Physics Communications 98(1-2), 20–34 (1996)
Hackbusch, W.: Iterative solution of Large Sparse Systems of Equations. Applied Mathematical Sciences. Springer, Heidelberg (1993)
Hong, S., Kim, H.: An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: ISCA 2009, p. 152 (June 2009)
Huang, Q., Xue, J., Vera, X.: Code tiling for improving the cache performance of PDE solvers. In: ICPP 2003, pp. 615–625 (2003)
Lee, S., Min, S.-J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: PPoPP 2009, pp. 101–110 (2009)
Liu, Y., Zhang, E.Z., Shen, X.: A Cross-Input Adaptive Framework for GPU Programs Optimization. In: IPDPS 2009, pp. 16–19 (2009)
Quarteroni, A., Valli, A.: Numerical Approximation of Partial Differential Equations. Springer, Heidelberg (1994)
Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu, W.-m.W.: Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: PPoPP 2008, pp. 73–82 (2008)
Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: SC 2008, pp. 1–11 (2008)
Wong, H., Papadopoulou, M.M., Sadooghi-Alvandi, M., Moshovos, A.: Demystifying GPU microarchitecture through microbenchmarking. In: ISPASS 2010, pp. 235–246 (2010)
Xiao, S., Feng, W.-C.: Inter-block GPU communication via fast barrier synchronization. In: IPDPS 2010, pp. 1–12 (2010)
Xue, J.: Loop Tiling for Parallelism. Kluwer Academic Publishers, Dordrecht (2000)
Yang, Y., Xiang, P., Kong, J., Zhou, H.: A GPGPU compiler for memory optimization and parallelism management. In: PLDI 2010, p. 86 (May 2010)
Yuki, T., Renganarayanan, L., Rajopadhye, S., Anderson, C., Eichenberger, A.E., O’Brien, K.: Automatic creation of tile size selection models. In: CGO 2010, p. 190 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Di, P., Xue, J. (2011). Model-Driven Tile Size Selection for DOACROSS Loops on GPUs. In: Jeannot, E., Namyst, R., Roman, J. (eds) Euro-Par 2011 Parallel Processing. Euro-Par 2011. Lecture Notes in Computer Science, vol 6853. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23397-5_40
Download citation
DOI: https://doi.org/10.1007/978-3-642-23397-5_40
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23396-8
Online ISBN: 978-3-642-23397-5
eBook Packages: Computer ScienceComputer Science (R0)