GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models
Many scientific codes consist of memory bandwidth bound kernels — the dominating factor of the runtime is the speed at which data can be loaded from memory into the Arithmetic Logic Units, before results are written back to memory. One major advantage of many-core devices such as General Purpose Graphics Processing Units (GPGPUs) and the Intel Xeon Phi is their focus on providing increased memory bandwidth over traditional CPU architectures. However, as with CPUs, this peak memory bandwidth is usually unachievable in practice and so benchmarks are required to measure a practical upper bound on expected performance.
The choice of one programming model over another should ideally not limit the performance that can be achieved on a device. GPU-STREAM has been updated to incorporate a wide variety of the latest parallel programming models, all implementing the same parallel scheme. As such this tool can be used as a kind of Rosetta Stone which provides both a cross-platform and cross-programming model array of results of achievable memory bandwidth.
KeywordsProgramming Model Application Program Interface Memory Bandwidth Parallel Programming Model General Purpose Graphic Processing Unit
We would like to thank Cray Inc. for providing access to the Cray XC40 supercomputer, Swan, and the Cray CS cluster, Falcon. Our thanks to Codeplay for access to the ComputeCpp SYCL compiler and to Douglas Miles at PGI (NVIDIA) for access to the PGI compiler. We would also like to that the University of Bristol Intel Parallel Computing Center (IPCC). This work was carried out using the computational facilities of the Advanced Computing Research Centre, University of Bristol - http://www.bris.ac.uk/acrc/. Thanks also go to the University of Oxford for access to the Power 8 system.
- 1.Bhat, K.: clpeak (2015). https://github.com/krrishnarraj/clpeak
- 2.Codeplay: ComputeCpp. https://www.codeplay.com/products/computecpp
- 3.Danalis, A., Marin, G., McCurdy, C., Meredith, J.S., Roth, P.C., Spafford, K., Tipparaju, V., Vetter, J.S.: The scalable heterogeneous computing (SHOC) benchmark suite. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU-3, pp. 63–74. ACM, New York (2010). http://doi.acm.org/10.1145/1735688.1735702
- 4.Deakin, T., McIntosh-Smith, S.: GPU-STREAM: benchmarking the achievable memory bandwidth of graphics processing units (poster). In: Supercomputing, Austin, Texas (2015)Google Scholar
- 5.Edwards, H.C., Sunderland, D.: Kokkos array performance-portable manycore programming model. In: Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM 2012), pp. 1–10. ACM (2012)Google Scholar
- 6.Heroux, M., Doerfler, D., et al.: Improving performance via mini-applications. Technical report, SAND2009-5574, Sandia National Laboratories (2009)Google Scholar
- 7.Hornung, R.D., Keasler, J.A.: The RAJA Portability Layer: Overview and Status (2014)Google Scholar
- 8.Khronos OpenCL Working Group SYCL subgroup: SYCL Provisional Specification (2016)Google Scholar
- 9.Martineau, M., McIntosh-Smith, S., Boulton, M., Gaudin, W.: An evaluation of emerging many-core parallel programming models. In: Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycore, PMAM 2016, pp. 1–10. ACM, New York (2016). http://doi.acm.org/10.1145/2883404.2883420
- 10.McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Comput. Soc. Tech. Comm. Comput. Archit. (TCCA) Newslett. 19–25 (1995)Google Scholar
- 11.Munshi, A.: The OpenCL Specification, Version 1.1 (2011)Google Scholar
- 12.NVIDIA: CUDA Toolkit 7.5Google Scholar
- 13.OpenACC-Standard.org: The OpenACC Application Programming Interface - Version 2.5 (2015)Google Scholar
- 14.OpenMP Architecture Review Board: OpenMP Application Program Interface, Version 4.5 (2015)Google Scholar
- 15.Reguly, I.Z., Keita, A.K., Giles, M.B.: Benchmarking the IBM Power8 processor. In: Proceedings of the 25th Annual International Conference on Computer Science and Software Engineering, pp. 61–69. IBM Corporation, Riverton (2015)Google Scholar
- 16.Standard Performance Evaluation Corporation: SPEC Accel (2016). https://www.spec.org/accel/