Skip to main content

Multi-workgroup Tiling to Improve the Locality of Explicit One-Step Methods for ODE Systems with Limited Access Distance on GPUs

  • Conference paper
  • First Online:
Parallel Processing and Applied Mathematics (PPAM 2019)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12043))

Abstract

Solving an initial value problem of a large system of ordinary differential equations (ODEs) on a GPU is often memory bound, which makes optimizing the locality of memory references important. We exploit the limited access distance, which is a property of a large class of right-hand-side functions, to enable hexagonal or trapezoidal tiling across the stages of the ODE method. Since previous work showed that the traditional approach of launching one workgroup per tile is worthwhile only for small limited access distances, we introduce an approach where several workgroups cooperate on a tile (multi-workgroup tiling) and investigate several optimizations and variations. Finally, we show the superiority of the multi-workgroup tiling over the traditional single-workgroup tiling for large access distances by a detailed experimental evaluation using two different Runge–Kutta (RK) methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Filipovič, J., Madzin, M., Fousek, J., Matyska, L.: Optimizing CUDA code by kernel fusion: application on BLAS. J. Supercomput. 71(10), 3934–3957 (2015). https://doi.org/10.1007/s11227-015-1483-z

    Article  Google Scholar 

  2. Grosser, T., Cohen, A., Holewinski, J., Sadayappan, P., Verdoolaege, S.: Hybrid hexagonal/classical tiling for GPUs. In: Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 66–75. ACM (2014). https://doi.org/10.1145/2544137.2544160

  3. Hairer, E., Nørsett, S.P., Wanner, G.: Solving Ordinary Differential Equations I: Nonstiff Problems, 2nd edn. Springer, Berlin (2000). https://doi.org/10.1007/978-3-540-78862-1

    Book  MATH  Google Scholar 

  4. Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Approach, 5th edn. Morgan Kaufmann, Amsterdam (2011)

    MATH  Google Scholar 

  5. Korch, M.: Locality improvement of data-parallel Adams–Bashforth methods through block-based pipelining of time steps. In: Kaklamanis, C., Papatheodorou, T., Spirakis, P.G. (eds.) Euro-Par 2012. LNCS, vol. 7484, pp. 563–574. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32820-6_56

    Chapter  Google Scholar 

  6. Korch, M., Werner, T.: Accelerating explicit ODE methods by kernel fusion. Concurr. Comput. Pract. Exp. 30(18), e4470 (2018). https://doi.org/10.1002/cpe.4470

    Article  Google Scholar 

  7. Korch, M., Werner, T.: Exploiting limited access distance for kernel fusion across the stages of explicit one-step methods on GPUs. In: 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 148–157 (2018). https://doi.org/10.1109/CAHPC.2018.8645892

  8. Malas, T., Hager, G., Ltaief, H., Stengel, H., Wellein, G., Keyes, D.: Multicore-optimized wavefront diamond blocking for optimizing stencil updates. SIAM J. Sci. Comput. 37(4), C439–C464 (2015). https://doi.org/10.1137/140991133

    Article  MathSciNet  MATH  Google Scholar 

  9. Wahib, M., Maruyama, N.: Automated GPU kernel transformations in large-scale production stencil applications. In: 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC), pp. 259–270 (2015). https://doi.org/10.1145/2749246.2749255

  10. Xiao, S., Aji, A.M., Feng, W.: On the robust mapping of dynamic programming onto a graphics processing unit. In: 15th International Conference on Parallel and Distributed Systems (ICPADS), pp. 26–33 (December 2009). https://doi.org/10.1109/ICPADS.2009.110

  11. Xiao, S., Feng, W.: Inter-block GPU communication via fast barrier synchronization. In: IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp. 1–12 (April 2010). https://doi.org/10.1109/IPDPS.2010.5470477

Download references

Acknowledgment

This work has been supported by the German Research Foundation (DFG) under grant KO 2252/3-2.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matthias Korch .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Korch, M., Werner, T. (2020). Multi-workgroup Tiling to Improve the Locality of Explicit One-Step Methods for ODE Systems with Limited Access Distance on GPUs. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2019. Lecture Notes in Computer Science(), vol 12043. Springer, Cham. https://doi.org/10.1007/978-3-030-43229-4_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-43229-4_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-43228-7

  • Online ISBN: 978-3-030-43229-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics