Pragmatic Performance Portability with OpenMP 4.x
In this paper we investigate the current compiler technologies supporting OpenMP 4.x features targeting a range of devices, in particular, the Cray compiler 8.5.0 targeting an Intel Xeon Broadwell and NVIDIA K20x, IBM’s OpenMP 4.5 Clang branch (clang-ykt) targeting an NVIDIA K20x, the Intel compiler 16 targeting an Intel Xeon Phi Knights Landing, and GCC 6.1 targeting an AMD APU. We outline the mechanisms that they use to map the OpenMP model onto their target architectures, and conduct performance testing with a number of representative data parallel kernels. Following this we present a discussion about the current state of play in terms of performance portability and propose some straightforward guidelines for writing performance portable code, derived from our observations. At the time of writing, developers will likely have to rely on the pre-processor for certain kernels to achieve functional portability, but we expect that future homogenisation of required directives between compilers and architectures is feasible.
KeywordsOpenMP 4.x Performance portability Parallel programming
We would like to thank Cray Inc. for providing access to their XC40 supercomputer Swan, which hosted the Intel Xeon Broadwell, and NVIDIA K20x processors. The Intel Xeon Phi KNL was provided by the Intel Parallel Computing Center at the University of Bristol, and we would like to thank Jim Cownie at Intel for his support. We also want to thank the sponsors of this research, EPSRC and the UK Atomic Weapons Establishment.
- 1.Bercea, G., Bertolli, C., Antao, S., Jacob, A., et al.: Performance analysis of OpenMPon a GPU using a Coral Proxy application. In: Proceedings of the 6th InternationalWorkshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems, p. 2. ACM (2015)Google Scholar
- 2.Bertolli, C., Antao, S., Bercea, G.-T., et al.: Integrating GPU support for OpenMP offloading directives into clang. In: Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, LLVM 2015 (2015)Google Scholar
- 3.Bertolli, C., Antao, S.F., Eichenberger, A., et al.: Coordinating GPU threads for OpenMP 4.0 in LLVM. In: Proceedings of the LLVM Compiler Infrastructure in HPC, pp. 12–21. IEEE Press (2014)Google Scholar
- 4.Hart, A.: First experiences porting a parallel application to a hybrid supercomputer with OpenMP 4.0 device constructs. In: Proceedings of the OpenMP: Heterogenous Execution and Data Movements: 11th International Workshop on OpenMP, IWOMP, pp. 73–85 (2015)Google Scholar
- 6.Larkin, J.: Performance portability through descriptive parallelism. Presentation at DOE Centers of Execellence Performance Portability Meeting (2016). https://asc.llnl.gov/DOE-COE-Mtg-2016/talks/2-20_Larkin.pdf
- 7.Lin, P., Liao, C., Quinlan, D., et al.: Experiences of using the OpenMP accelerator model to port DOE stencil applications. In: Proceedings of the OpenMP: Heterogenous Execution and Data Movements: 11th International Workshop on OpenMP, IWOMP 2015, pp. 45–59 (2015)Google Scholar
- 8.Martineau, M., McIntosh-Smith, S., Boulton, M., Gaudin, W.: An evaluation of emerging many-core parallel programming models. In: Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM 2016 (2016)Google Scholar
- 9.Martineau, M., McIntosh-Smith, S., Gaudin, W.: Evaluating OpenMP 4.0’s effectiveness as a heterogeneous parallel programming model. In: Proceedings of 21st International Workship on High-Level Parallel Programming Models and Supportive Environments, HIPS 2016 (2016)Google Scholar
- 10.McIntosh-Smith, S., Boulton, M., Curran, D., Price, J.: On the performance portability of structured grid codes on many-core computer architectures. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) ISC 2014. LNCS, vol. 8488, pp. 53–75. Springer, Heidelberg (2014)Google Scholar
- 11.OpenMP Architecture Review Board. OpenMP Application Program Interface v4.5 (2015)Google Scholar