A Portable SIMD Primitive Using Kokkos for Heterogeneous Architectures

Sahasrabudhe, Damodar; Phipps, Eric T.; Rajamanickam, Sivasankaran; Berzins, Martin

doi:10.1007/978-3-030-49943-3_7

A Portable SIMD Primitive Using Kokkos for Heterogeneous Architectures

Damodar Sahasrabudhe¹⁰,
Eric T. Phipps¹¹,
Sivasankaran Rajamanickam¹¹ &
…
Martin Berzins¹⁰

Conference paper
First Online: 09 June 2020

288 Accesses
10 Citations

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 12017))

Abstract

As computer architectures are rapidly evolving (e.g. those designed for exascale), multiple portability frameworks have been developed to avoid new architecture-specific development and tuning. However, portability frameworks depend on compilers for auto-vectorization and may lack support for explicit vectorization on heterogeneous platforms. Alternatively, programmers can use intrinsics-based primitives to achieve more efficient vectorization, but the lack of a gpu back-end for these primitives makes such code non-portable. A unified, portable, Single Instruction Multiple Data (simd) primitive proposed in this work, allows intrinsics-based vectorization on cpus and many-core architectures such as Intel Knights Landing (knl), and also facilitates Single Instruction Multiple Threads (simt) based execution on gpus. This unified primitive, coupled with the Kokkos portability ecosystem, makes it possible to develop explicitly vectorized code, which is portable across heterogeneous platforms. The new simd primitive is used on different architectures to test the performance boost against hard-to-auto-vectorize baseline, to measure the overhead against efficiently vectroized baseline, and to evaluate the new feature called the “logical vector length” (lvl). The simd primitive provides portability across cpus and gpus without any performance degradation being observed experimentally.

The authors thank Sandia National Lab and Department of Energy, National Nuclear Security Administration (under Award Number(s) DE-NA0002375), for funding this work. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy‘s National Nuclear Security Administration under contract DE-NA-0003525. The authors are grateful to Sandia and also Center for High Performance Computing, University of Utah for extending the resources to run the experiments. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Adamczyk, W., et al.: Application of LES-CFD for predicting pulverized-coal working conditions after installation of NOx control system. Energy 160, 693–709 (2018)
Article Google Scholar
Berzins, M., et al.: Extending the Uintah framework through the petascale modeling of detonation in arrays of high explosive devices. SIAM J. Sci. Comput. 38, 101–122 (2016). http://www.sci.utah.edu/publications/Ber2015a/detonationsiam16-2.pdf
Carr, S.: Combining optimization for cache and instruction-level parallelism. In: Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique, pp. 238–247. IEEE (1996)
Google Scholar
Cope, B., et al.: Implementation of 2D Convolution on FPGA, GPU and CPU. Imperial College Report, pp. 2–5 (2006)
Google Scholar
Edwards, H., Trott, C., Sunderland, D.: Kokkos: enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distrib. Comput. 74(12), 3202–3216 (2014)
Article Google Scholar
U.S. Department of Energy: U.S. Department of Energy and Cray to Deliver Record-Setting Frontier Supercomputer at ORNL. https://www.energy.gov/articles/us-department-energy-and-cray-deliver-record-setting-frontier-supercomputer-ornl (2019)
Espasa, R., Valero, M.: Exploiting instruction-and data-level parallelism. IEEE Micro 17(5), 20–27 (1997)
Article Google Scholar
Henretty, T., Stock, K., Pouchet, L.-N., Franchetti, F., Ramanujam, J., Sadayappan, P.: Data layout transformation for stencil computations on short-vector SIMD architectures. In: Knoop, J. (ed.) CC 2011. LNCS, vol. 6601, pp. 225–245. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19861-8_13
Chapter Google Scholar
Holewinski, J., et al.: Dynamic trace-based analysis of vectorization potential of applications. ACM SIGPLAN Not. 47(6), 371–382 (2012)
Article Google Scholar
Holmen, J.: Private communication (2018)
Google Scholar
Holmen, J.K., et al.: Portably improving uintah’s readiness for exascale systems through the use of kokkos. SCI Institute (2019). http://www.sci.utah.edu/publications/Hol2019a/UUSCI-2019-001.pdf
Hornung, R., Keasler, J.: The RAJA portability layer: overview and status. Technical report, Lawrence Livermore National Laboratories (LLNL), Livermore, CA, United States (2014)
Google Scholar
Howard, M., et al.: Employing multiple levels of parallelism for CFD at large scales on next generation high-performance computing platforms. In: 2018 Proceedings of the Tenth International Conference on Computational Fluid Dynamics (ICCFD 10), Barcelona, 9–13 July 2018
Google Scholar
Intel: Requirements for Vectorizable Loops (2012). https://software.intel.com/en-us/articles/requirements-for-vectorizable-loops
Jacob, A., et al.: Towards performance portable GPU programming with RAJA. In: Workshop on Portability Among HPC Architectures for Scientific Applications (2015)
Google Scholar
Jeffers, J., Reinders, J., Sodani, A.: Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann, Burlington (2016)
Google Scholar
Karpiński, P., McDonald, J.: A high-performance portable abstract interface for explicit SIMD vectorization. In: Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores. ACM (2017)
Google Scholar
Kim, K., et al.: Designing vector-friendly compact BLAS and LAPACK kernels. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 55. ACM (2017)
Google Scholar
Kim, K., et al.: KokkosKernels v. 0.9, Version 00 (2 2017). https://www.osti.gov//servlets/purl/1349511
Kretz, M., Lindenstruth, V.: Vc: a C++ library for explicit vectorization. Softw. Pract. Exp. 42(11), 1409–1430 (2012)
Article Google Scholar
Leißa, R., Hack, S., Wald, I.: Extending a C-like language for portable SIMD programming. ACM SIGPLAN Not. 47(8), 65–74 (2012)
Article Google Scholar
Medina, D., St-Cyr, A., Warburton, T.: OCCA: A unified approach to multi-threading languages. arXiv preprint arXiv:1403.0968 (2014)
IT Peer Network: Think Exponential: Intel’s Xe Architecture. https://itpeernetwork.intel.com/intel-xe-compute#gs.emsehp (2019)
Opencl, K., Munshi, A.: The openCL specification version: 1.0 document revision: 48, 23 (2008). https://www.khronos.org/registry/OpenCL/specs/opencl-1.0.pdf
Pedel, J., Thornock, J., Smith, S., Smith, P.: Large eddy simulation of polydisperse particles in turbulent coaxial jets using the direct quadrature method of moments. Int. J. Multiph. Flow 63, 23–38 (2014). https://doi.org/10.1016/j.ijmultiphaseflow.2014.03.002
Article MathSciNet Google Scholar
Pai, S., Govindarajan, R., Thazhuthaveetil, M.: PLASMA: portable programming for SIMD heterogeneous accelerators. In: Workshop on Language, Compiler, and Architecture Support for GPGPU, held in conjunction with HPCA/PPoPP (2010)
Google Scholar
Phipps, E., D’Elia, M., Edwards, H., Hoemmen, M., Hu, J., Rajamanickam, S.: Embedded ensemble propagation for improving performance, portability, and scalability of uncertainty quantification on emerging computational architectures. SIAM J. Sci. Comput. 39(2), C162–C193 (2017)
Article MathSciNet Google Scholar
Phipps, E., Tuminaro, R., Miller, C.: Stokhos: trilinos tools for embedded stochastic-galerkin uncertainty quantification methods. Technical report, Sandia National Laboratories (SNL-NM), Albuquerque, NM, United States (2008)
Google Scholar
Stephens, N., et al.: The ARM scalable vector extension. IEEE Micro 37(2), 26–39 (2017)
Article Google Scholar
Tian, X., et al.: LLVM compiler implementation for explicit parallelization and SIMD vectorization. In: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC, p. 4. ACM (2017)
Google Scholar
Trott, C.R.: Kokkos: the C++ performance portability programming model. Technical report, Sandia National Laboratories (SNL-NM), Albuquerque, NM, United States (2017)
Google Scholar
Wang, H., Wu, P., Tanase, I., Serrano, M., Moreira, J.: Simple, portable and fast SIMD intrinsic programming: generic simd library. In: Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing. ACM (2014)
Google Scholar
Zenker, E., et al.: Alpaka-an abstraction library for parallel kernel acceleration. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 631–640. IEEE (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Scientific Computing and Imaging Institute, University of Utah, Salt Lake City, UT, USA
Damodar Sahasrabudhe & Martin Berzins
Center for Computing Research, Sandia National Laboratories, Albuquerque, NM, USA
Eric T. Phipps & Sivasankaran Rajamanickam

Authors

Damodar Sahasrabudhe
View author publications
You can also search for this author in PubMed Google Scholar
Eric T. Phipps
View author publications
You can also search for this author in PubMed Google Scholar
Sivasankaran Rajamanickam
View author publications
You can also search for this author in PubMed Google Scholar
Martin Berzins
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Damodar Sahasrabudhe .

Editor information

Editors and Affiliations

RWTH Aachen University, Aachen, Germany
Sandra Wienke
Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Sridutt Bhalachandra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sahasrabudhe, D., Phipps, E.T., Rajamanickam, S., Berzins, M. (2020). A Portable SIMD Primitive Using Kokkos for Heterogeneous Architectures. In: Wienke, S., Bhalachandra, S. (eds) Accelerator Programming Using Directives. WACCPD 2019. Lecture Notes in Computer Science(), vol 12017. Springer, Cham. https://doi.org/10.1007/978-3-030-49943-3_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-49943-3_7
Published: 09 June 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-49942-6
Online ISBN: 978-3-030-49943-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics