Optimizing Stencil Application on Multi-thread GPU Architecture Using Stream Programming Model

Xudong, Fang; Yuhua, Tang; Guibin, Wang; Tao, Tang; Ying, Zhang

doi:10.1007/978-3-642-11950-7_21

Fang Xudong¹⁹,
Tang Yuhua¹⁹,
Wang Guibin¹⁹,
Tang Tao¹⁹ &
…
Zhang Ying¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5974))

Included in the following conference series:

International Conference on Architecture of Computing Systems

835 Accesses
2 Citations

Abstract

With fast development of GPU hardware and software, using GPUs to accelerate non-graphics CPU applications is becoming inevitable trend. GPUs are good at performing ALU-intensive computation and feature high peak performance; however, how to harness GPUs’ powerful computing capacity to accelerate the applications in the field of scientific computing still remains a big challenge. In this paper, we implement the whole application Mgrid taken from Spec2000 benchmarks on an AMD GPU and propose several optimization strategies for stencil computations in the naive GPU code. We first improve thread utilization through using vector types and multiple output streams mechanism provided by the Brook+ programming language. By tuning thread granularity, we try to hit the right balance between locality within each thread and parallelism among threads. Then, we reorganize the stream layout by transforming the 3D data stream into the 2D stream in the block manner. Through stream reorganization, more data locality in the cache is exploited. Further, we propose branch elimination to convert control dependence to data dependence, catering to GPUs’ powerful ALU-intensive processing capability. Finally, we redistribute computations between CPU and GPU to make more advisable computing resources usage considering different problem sizes. We demonstrate the effectiveness of our proposed optimization strategies on an AMD Radeon HD4870 GPU using the Brook+ programming language. Using a double-precision floating-point implementation, the experimental results show that the optimized GPU version of Mgrid gains 2.38x speedup compared to the naive GPU code and obtains as high as 15.06x speedup versus the CPU implementation run on an Intel Xeon E5405 CPU.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

AMD.: Ati stream computing user guide v1.4beta (2009), http://developer.amd.com/gpu_assets/Stream_Computing_User_Guide.pdf
NVIDIA.: Compute unified device architecture programming guide v2.1beta (2009), http://developer.download.nvidia.com/compute/cuda/1_0/NVIDIA_CUDA_Programming_Guide_1.0.pdf
Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu, W.m.W.: Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In: PPoPP 2008: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, pp. 73–82. ACM, New York (2008)
Chapter Google Scholar
Krishnamoorthy, S., Baskaran, M., Bondhugula, U., Ramanujam, J., Rountev, A., Sadayappan, P.: Effective automatic parallelization of stencil computations. SIGPLAN Not. 42(6), 235–244 (2007)
Article Google Scholar
Fan, Z., Qiu, F., Kaufman, A., Yoakum-Stover, S.: Gpu cluster for high performance computing. In: SC 2004: Proceedings of the 2004 ACM/IEEE conference on Supercomputing, Washington, DC, USA, p. 47. IEEE Computer Society Press, Los Alamitos (2004)
Google Scholar
Buck, I.: Brook specification v0.2 (2003), http://hci.stanford.edu/cstr/reports/2003-04.pdf
Ryoo, S., Rodrigues, C.I., Stone, S.S., Stratton, J.A., Ueng, S.-Z., Baghsorkhi, S.S., Hwu, W.-m.W.: Program optimization carving for gpu computing. J. Parallel Distrib. Comput. 68(10), 1389–1401 (2008)
Article Google Scholar
Mohan, T., de Supinski, B.R., McKee, S.A., Mueller, F., Yoo, A., Schulz, M.: Identifying and exploiting spatial regularity in data memory references. In: SC 2003: Proceedings of the 2003 ACM/IEEE conference on Supercomputing, Washington, DC, USA, p. 49. IEEE Computer Society, Los Alamitos (2003)
Google Scholar
Harris, M.J., Baxter, W.V., Scheuermann, T., Lastra, A.: Simulation of cloud dynamics on graphics hardware. In: HWWS 2003: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, Aire-la-Ville, Switzerland, Switzerland, pp. 92–101. Eurographics Association (2003)
Google Scholar
Allen, J.R., Kennedy, K., Porterfield, C., Warren, J.: Conversion of control dependence to data dependence. In: POPL 1983: Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, pp. 177–189. ACM, New York (1983)
Chapter Google Scholar
Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu, W.-m.W.: Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In: PPoPP 2008: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, pp. 73–82. ACM, New York (2008)
Chapter Google Scholar
Jang, B., Do, S., Pien, H., Kaeli, D.: Architecture-aware optimization targeting multithreaded stream computing. In: GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp. 62–70. ACM, New York (2009)
Chapter Google Scholar
Wang, G., Yang, X.J., Zhang, Y., Tang, T., Fang, X.D.: Program optimization of stencil based application on the gpu-accelerated system. In: Intl. Symposium on Parallel and Distributed Processing and Applications, pp. 219–225 (2009)
Google Scholar
Li, Z., Song, Y.: Automatic tiling of iterative stencil loops. ACM Trans. Program. Lang. Syst. 26(6), 975–1028 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, Hunan, China
Fang Xudong, Tang Yuhua, Wang Guibin, Tang Tao & Zhang Ying

Authors

Fang Xudong
View author publications
You can also search for this author in PubMed Google Scholar
Tang Yuhua
View author publications
You can also search for this author in PubMed Google Scholar
Wang Guibin
View author publications
You can also search for this author in PubMed Google Scholar
Tang Tao
View author publications
You can also search for this author in PubMed Google Scholar
Zhang Ying
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Leibniz University Hannover, Appelstraße 4, 30167, Hannover, Germany
Christian Müller-Schloer
Karlsruhe Institute of Technology (KIT), Kaiserstraße 12, 7613, Karlsruhe, Germany
Wolfgang Karl
Thales Research and Technology, Campus Polytechnique, 1 Avenue Augustin Fresnel, 91767, Palaiseau Cedex, France
Sami Yehia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xudong, F., Yuhua, T., Guibin, W., Tao, T., Ying, Z. (2010). Optimizing Stencil Application on Multi-thread GPU Architecture Using Stream Programming Model. In: Müller-Schloer, C., Karl, W., Yehia, S. (eds) Architecture of Computing Systems - ARCS 2010. ARCS 2010. Lecture Notes in Computer Science, vol 5974. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-11950-7_21

Download citation

DOI: https://doi.org/10.1007/978-3-642-11950-7_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-11949-1
Online ISBN: 978-3-642-11950-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics