Multi-workgroup Tiling to Improve the Locality of Explicit One-Step Methods for ODE Systems with Limited Access Distance on GPUs

Korch, Matthias; Werner, Tim

doi:10.1007/978-3-030-43229-4_1

Matthias Korch¹² &
Tim Werner¹²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12043))

Included in the following conference series:

International Conference on Parallel Processing and Applied Mathematics

878 Accesses
2 Citations

Abstract

Solving an initial value problem of a large system of ordinary differential equations (ODEs) on a GPU is often memory bound, which makes optimizing the locality of memory references important. We exploit the limited access distance, which is a property of a large class of right-hand-side functions, to enable hexagonal or trapezoidal tiling across the stages of the ODE method. Since previous work showed that the traditional approach of launching one workgroup per tile is worthwhile only for small limited access distances, we introduce an approach where several workgroups cooperate on a tile (multi-workgroup tiling) and investigate several optimizations and variations. Finally, we show the superiority of the multi-workgroup tiling over the traditional single-workgroup tiling for large access distances by a detailed experimental evaluation using two different Runge–Kutta (RK) methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Filipovič, J., Madzin, M., Fousek, J., Matyska, L.: Optimizing CUDA code by kernel fusion: application on BLAS. J. Supercomput. 71(10), 3934–3957 (2015). https://doi.org/10.1007/s11227-015-1483-z
Article Google Scholar
Grosser, T., Cohen, A., Holewinski, J., Sadayappan, P., Verdoolaege, S.: Hybrid hexagonal/classical tiling for GPUs. In: Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 66–75. ACM (2014). https://doi.org/10.1145/2544137.2544160
Hairer, E., Nørsett, S.P., Wanner, G.: Solving Ordinary Differential Equations I: Nonstiff Problems, 2nd edn. Springer, Berlin (2000). https://doi.org/10.1007/978-3-540-78862-1
Book MATH Google Scholar
Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Approach, 5th edn. Morgan Kaufmann, Amsterdam (2011)
MATH Google Scholar
Korch, M.: Locality improvement of data-parallel Adams–Bashforth methods through block-based pipelining of time steps. In: Kaklamanis, C., Papatheodorou, T., Spirakis, P.G. (eds.) Euro-Par 2012. LNCS, vol. 7484, pp. 563–574. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32820-6_56
Chapter Google Scholar
Korch, M., Werner, T.: Accelerating explicit ODE methods by kernel fusion. Concurr. Comput. Pract. Exp. 30(18), e4470 (2018). https://doi.org/10.1002/cpe.4470
Article Google Scholar
Korch, M., Werner, T.: Exploiting limited access distance for kernel fusion across the stages of explicit one-step methods on GPUs. In: 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 148–157 (2018). https://doi.org/10.1109/CAHPC.2018.8645892
Malas, T., Hager, G., Ltaief, H., Stengel, H., Wellein, G., Keyes, D.: Multicore-optimized wavefront diamond blocking for optimizing stencil updates. SIAM J. Sci. Comput. 37(4), C439–C464 (2015). https://doi.org/10.1137/140991133
Article MathSciNet MATH Google Scholar
Wahib, M., Maruyama, N.: Automated GPU kernel transformations in large-scale production stencil applications. In: 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC), pp. 259–270 (2015). https://doi.org/10.1145/2749246.2749255
Xiao, S., Aji, A.M., Feng, W.: On the robust mapping of dynamic programming onto a graphics processing unit. In: 15th International Conference on Parallel and Distributed Systems (ICPADS), pp. 26–33 (December 2009). https://doi.org/10.1109/ICPADS.2009.110
Xiao, S., Feng, W.: Inter-block GPU communication via fast barrier synchronization. In: IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp. 1–12 (April 2010). https://doi.org/10.1109/IPDPS.2010.5470477

Download references

Acknowledgment

This work has been supported by the German Research Foundation (DFG) under grant KO 2252/3-2.

Author information

Authors and Affiliations

Department of Computer Science, University of Bayreuth, Bayreuth, Germany
Matthias Korch & Tim Werner

Authors

Matthias Korch
View author publications
You can also search for this author in PubMed Google Scholar
Tim Werner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matthias Korch .

Editor information

Editors and Affiliations

Czestochowa University of Technology, Czestochowa, Poland
Roman Wyrzykowski
University of Southern California, Marina del Rey, CA, USA
Ewa Deelman
University of Tennessee, Knoxville, TN, USA
Jack Dongarra
Czestochowa University of Technology, Czestochowa, Poland
Konrad Karczewski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Korch, M., Werner, T. (2020). Multi-workgroup Tiling to Improve the Locality of Explicit One-Step Methods for ODE Systems with Limited Access Distance on GPUs. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2019. Lecture Notes in Computer Science(), vol 12043. Springer, Cham. https://doi.org/10.1007/978-3-030-43229-4_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-43229-4_1
Published: 19 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-43228-7
Online ISBN: 978-3-030-43229-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics