Skip to main content

Vectorized Higher Order Finite Difference Kernels

  • Conference paper
Applied Parallel and Scientific Computing (PARA 2012)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7782))

Included in the following conference series:

Abstract

Several highly optimized implementations of Finite Difference schemes are discussed. The combination of vectorization and an interleaved data layout, spatial and temporal loop tiling algorithms, loop unrolling, and parameter tuning lead to efficient computational kernels in one to three spatial dimensions, truncation errors of order two to twelve, and isotropic and compact anisotropic stencils. The kernels are implemented on and tuned for several processor architectures like recent Intel Sandy Bridge, Ivy Bridge and AMD Bulldozer CPU cores, all with AVX vector instructions as well as Nvidia Kepler and Fermi and AMD Southern and Northern Islands GPU architectures, as well as some older architectures for comparison. The kernels are either based on a cache aware spatial loop or on time-slicing to compute several time steps at once. Furthermore, vector components can either be independent, grouped in short vectors of SSE, AVX or GPU warp size or in larger virtual vectors with explicit synchronization. The optimal choice of the algorithm and its parameters depend both on the Finite Difference stencil and on the processor architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Datta, K., Kamil, S., Williams, S., Oliker, L., Shalf, J., Yelick, K.: Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Rev. 51(1), 129–159 (2009)

    Article  MATH  Google Scholar 

  2. Micikevicius, P.: 3D finite difference computation on GPUs using Cuda. In: Proc. 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, pp. 79–84. ACM (2009)

    Google Scholar 

  3. Weyhausen, A.: Numerical algorithms of general relativity for heterogeneous computing environments. Diplomarbeit, Universität Jena, Physics Dept. (2010)

    Google Scholar 

  4. Maruyama, N., Nomura, T., Sato, K., Matsuoka, S.: Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In: Supercomputing. IEEE (2011)

    Google Scholar 

  5. Holewinski, J., Pouchet, L.N., Sadayappan, P.: High-performance code generation for stencil computations on GPU architectures. In: Proceedings of the 26th ACM International Conference on Supercomputing (2012)

    Google Scholar 

  6. Williams, S., Kalamkar, D., Singh, A., Deshpande, A., Straalen, B.V., Smelyanskiy, M., Almgren, A., Dubey, P., Shalf, J., Oliker, L.: Optimization of geometric multigrid for emerging multi- and manycore processors. In: Supercomputing. IEEE (2012)

    Google Scholar 

  7. Zhang, Y., Mueller, F.: Auto-generation and auto-tuning of 3D stencil codes on GPU clusters. In: Proc. 10th Int. Symp. Code Gen. Optim., San Jose, CA (2012)

    Google Scholar 

  8. Zumbusch, G.: Tuning a finite difference computation for parallel vector processors. In: 11th Int. Symp. Parallel and Distrib. Comput. CPS, pp. 63–70. IEEE (2012)

    Google Scholar 

  9. Song, Y., Li, Z.: New tiling techniques to improve cache temporal locality. In: Proc. ACM SIGPLAN Conf. Prog. Lang. Design Impl., Atlanta, pp. 215–228 (1999)

    Google Scholar 

  10. McCalpin, J., Wonnacott, D.: Time skewing: A value-based approach to optimizing for memory locality. Technical Report DCS-TR-379, Rutgers Univ. (1999)

    Google Scholar 

  11. Rivera, G., Tseng, C.: Tiling optimizations for 3D scientific computations. In: Supercomputing (2000)

    Google Scholar 

  12. Weiß, C.: Data Locality Optimizations for Multigrid Methods on Structured Grids. PhD thesis, TU München (2001)

    Google Scholar 

  13. Stürmer, M., Treibig, J., Rüde, U.: Optimising a 3D multigrid algorithm for the IA-64 architecture. Int. J. Computational Science and Engineering 4, 29–35 (2008)

    Google Scholar 

  14. Wellein, G., Hager, G., Zeiser, T., Wittmann, M., Fehske, H.: Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. In: Int. Comput. Soft. and Applications Conf. (COMPSAC), pp. 579–586 (2009)

    Google Scholar 

  15. Nguyen, A., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In: Supercomputing. IEEE (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zumbusch, G. (2013). Vectorized Higher Order Finite Difference Kernels. In: Manninen, P., Öster, P. (eds) Applied Parallel and Scientific Computing. PARA 2012. Lecture Notes in Computer Science, vol 7782. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36803-5_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-36803-5_25

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-36802-8

  • Online ISBN: 978-3-642-36803-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics