Abstract
As computer architectures are rapidly evolving (e.g. those designed for exascale), multiple portability frameworks have been developed to avoid new architecture-specific development and tuning. However, portability frameworks depend on compilers for auto-vectorization and may lack support for explicit vectorization on heterogeneous platforms. Alternatively, programmers can use intrinsics-based primitives to achieve more efficient vectorization, but the lack of a gpu back-end for these primitives makes such code non-portable. A unified, portable, Single Instruction Multiple Data (simd) primitive proposed in this work, allows intrinsics-based vectorization on cpus and many-core architectures such as Intel Knights Landing (knl), and also facilitates Single Instruction Multiple Threads (simt) based execution on gpus. This unified primitive, coupled with the Kokkos portability ecosystem, makes it possible to develop explicitly vectorized code, which is portable across heterogeneous platforms. The new simd primitive is used on different architectures to test the performance boost against hard-to-auto-vectorize baseline, to measure the overhead against efficiently vectroized baseline, and to evaluate the new feature called the “logical vector length” (lvl). The simd primitive provides portability across cpus and gpus without any performance degradation being observed experimentally.
The authors thank Sandia National Lab and Department of Energy, National Nuclear Security Administration (under Award Number(s) DE-NA0002375), for funding this work. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy‘s National Nuclear Security Administration under contract DE-NA-0003525. The authors are grateful to Sandia and also Center for High Performance Computing, University of Utah for extending the resources to run the experiments. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Adamczyk, W., et al.: Application of LES-CFD for predicting pulverized-coal working conditions after installation of NOx control system. Energy 160, 693–709 (2018)
Berzins, M., et al.: Extending the Uintah framework through the petascale modeling of detonation in arrays of high explosive devices. SIAM J. Sci. Comput. 38, 101–122 (2016). http://www.sci.utah.edu/publications/Ber2015a/detonationsiam16-2.pdf
Carr, S.: Combining optimization for cache and instruction-level parallelism. In: Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique, pp. 238–247. IEEE (1996)
Cope, B., et al.: Implementation of 2D Convolution on FPGA, GPU and CPU. Imperial College Report, pp. 2–5 (2006)
Edwards, H., Trott, C., Sunderland, D.: Kokkos: enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distrib. Comput. 74(12), 3202–3216 (2014)
U.S. Department of Energy: U.S. Department of Energy and Cray to Deliver Record-Setting Frontier Supercomputer at ORNL. https://www.energy.gov/articles/us-department-energy-and-cray-deliver-record-setting-frontier-supercomputer-ornl (2019)
Espasa, R., Valero, M.: Exploiting instruction-and data-level parallelism. IEEE Micro 17(5), 20–27 (1997)
Henretty, T., Stock, K., Pouchet, L.-N., Franchetti, F., Ramanujam, J., Sadayappan, P.: Data layout transformation for stencil computations on short-vector SIMD architectures. In: Knoop, J. (ed.) CC 2011. LNCS, vol. 6601, pp. 225–245. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19861-8_13
Holewinski, J., et al.: Dynamic trace-based analysis of vectorization potential of applications. ACM SIGPLAN Not. 47(6), 371–382 (2012)
Holmen, J.: Private communication (2018)
Holmen, J.K., et al.: Portably improving uintah’s readiness for exascale systems through the use of kokkos. SCI Institute (2019). http://www.sci.utah.edu/publications/Hol2019a/UUSCI-2019-001.pdf
Hornung, R., Keasler, J.: The RAJA portability layer: overview and status. Technical report, Lawrence Livermore National Laboratories (LLNL), Livermore, CA, United States (2014)
Howard, M., et al.: Employing multiple levels of parallelism for CFD at large scales on next generation high-performance computing platforms. In: 2018 Proceedings of the Tenth International Conference on Computational Fluid Dynamics (ICCFD 10), Barcelona, 9–13 July 2018
Intel: Requirements for Vectorizable Loops (2012). https://software.intel.com/en-us/articles/requirements-for-vectorizable-loops
Jacob, A., et al.: Towards performance portable GPU programming with RAJA. In: Workshop on Portability Among HPC Architectures for Scientific Applications (2015)
Jeffers, J., Reinders, J., Sodani, A.: Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann, Burlington (2016)
Karpiński, P., McDonald, J.: A high-performance portable abstract interface for explicit SIMD vectorization. In: Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores. ACM (2017)
Kim, K., et al.: Designing vector-friendly compact BLAS and LAPACK kernels. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 55. ACM (2017)
Kim, K., et al.: KokkosKernels v. 0.9, Version 00 (2 2017). https://www.osti.gov//servlets/purl/1349511
Kretz, M., Lindenstruth, V.: Vc: a C++ library for explicit vectorization. Softw. Pract. Exp. 42(11), 1409–1430 (2012)
Leißa, R., Hack, S., Wald, I.: Extending a C-like language for portable SIMD programming. ACM SIGPLAN Not. 47(8), 65–74 (2012)
Medina, D., St-Cyr, A., Warburton, T.: OCCA: A unified approach to multi-threading languages. arXiv preprint arXiv:1403.0968 (2014)
IT Peer Network: Think Exponential: Intel’s Xe Architecture. https://itpeernetwork.intel.com/intel-xe-compute#gs.emsehp (2019)
Opencl, K., Munshi, A.: The openCL specification version: 1.0 document revision: 48, 23 (2008). https://www.khronos.org/registry/OpenCL/specs/opencl-1.0.pdf
Pedel, J., Thornock, J., Smith, S., Smith, P.: Large eddy simulation of polydisperse particles in turbulent coaxial jets using the direct quadrature method of moments. Int. J. Multiph. Flow 63, 23–38 (2014). https://doi.org/10.1016/j.ijmultiphaseflow.2014.03.002
Pai, S., Govindarajan, R., Thazhuthaveetil, M.: PLASMA: portable programming for SIMD heterogeneous accelerators. In: Workshop on Language, Compiler, and Architecture Support for GPGPU, held in conjunction with HPCA/PPoPP (2010)
Phipps, E., D’Elia, M., Edwards, H., Hoemmen, M., Hu, J., Rajamanickam, S.: Embedded ensemble propagation for improving performance, portability, and scalability of uncertainty quantification on emerging computational architectures. SIAM J. Sci. Comput. 39(2), C162–C193 (2017)
Phipps, E., Tuminaro, R., Miller, C.: Stokhos: trilinos tools for embedded stochastic-galerkin uncertainty quantification methods. Technical report, Sandia National Laboratories (SNL-NM), Albuquerque, NM, United States (2008)
Stephens, N., et al.: The ARM scalable vector extension. IEEE Micro 37(2), 26–39 (2017)
Tian, X., et al.: LLVM compiler implementation for explicit parallelization and SIMD vectorization. In: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC, p. 4. ACM (2017)
Trott, C.R.: Kokkos: the C++ performance portability programming model. Technical report, Sandia National Laboratories (SNL-NM), Albuquerque, NM, United States (2017)
Wang, H., Wu, P., Tanase, I., Serrano, M., Moreira, J.: Simple, portable and fast SIMD intrinsic programming: generic simd library. In: Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing. ACM (2014)
Zenker, E., et al.: Alpaka-an abstraction library for parallel kernel acceleration. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 631–640. IEEE (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Sahasrabudhe, D., Phipps, E.T., Rajamanickam, S., Berzins, M. (2020). A Portable SIMD Primitive Using Kokkos for Heterogeneous Architectures. In: Wienke, S., Bhalachandra, S. (eds) Accelerator Programming Using Directives. WACCPD 2019. Lecture Notes in Computer Science(), vol 12017. Springer, Cham. https://doi.org/10.1007/978-3-030-49943-3_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-49943-3_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-49942-6
Online ISBN: 978-3-030-49943-3
eBook Packages: Computer ScienceComputer Science (R0)