Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10732)


Accelerator devices are increasingly used to build large supercomputers and current installations usually include more than one accelerator per system node. To keep all devices busy, kernels have to be executed concurrently which can be achieved via asynchronous kernel launches. This work compares the performance for an implementation of the Conjugate Gradient method with CUDA, OpenCL, and OpenACC on NVIDIA Pascal GPUs. Furthermore, it takes a look at Intel Xeon Phi coprocessors when programmed with OpenCL and OpenMP. In doing so, it tries to answer the question of whether the higher abstraction level of directive based models is inferior to lower level paradigms in terms of performance.



The experiments were performed with computing resources granted by JARA-HPC from RWTH Aachen University under project jara0001.


  1. 1.
    Vulkan - Industry Forged. https://www.khronos.org/vulkan/. Accessed 6 July 2017
  2. 2.
    Abraham, M.J., Murtola, T., Schulz, R., Pll, S., Smith, J.C., Hess, B., Lindahl, E.: GROMACS: high performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 12, 19–25 (2015). http://www.sciencedirect.com/science/article/pii/S2352711015000059
  3. 3.
    Aji, A.M., Dinan, J., Buntinas, D., Balaji, P., Feng, W.-C., Bisset, K.R., Thakur, R.: MPI-ACC: an integrated and extensible approach to data movement in accelerator-based systems. In: 2012 IEEE 14th International Conference on High Performance Computing and Communication 2012 IEEE 9th International Conference on Embedded Software and Systems, pp. 647–654, June 2012Google Scholar
  4. 4.
    Allada, V., Benjegerdes, T., Bode, B.: Performance analysis of memory transfers and GEMM subroutines on NVIDIA Tesla GPU cluster. In: 2009 IEEE International Conference on Cluster Computing and Workshops, pp. 1–9, August 2009Google Scholar
  5. 5.
    Augonnet, C., Clet-Ortega, J., Thibault, S., Namyst, R.: Data-aware task scheduling on multi-accelerator based platforms. In: 2010 IEEE 16th International Conference on Parallel and Distributed Systems, pp. 291–298 (Dec 2010)Google Scholar
  6. 6.
    Beri, T., Bansal, S., Kumar, S.: A scheduling and runtime framework for a cluster of heterogeneous machines with multiple accelerators. In: 2015 IEEE International Parallel and Distributed Processing Symposium, pp. 146–155, May 2015Google Scholar
  7. 7.
    Bernaschi, M., Salvadore, F.: Multi-Kepler GPU vs. Multi-Intel MIC: a two test case performance study. In: 2014 International Conference on High Performance Computing Simulation (HPCS), pp. 1–8, July 2014Google Scholar
  8. 8.
    Boku, T., Ishikawa, K.I., Kuramashi, Y., Meadows, L., D‘Mello, M., Troute, M., Vemuri, R.: A performance evaluation of CCS QCD benchmark on the COMA (Intel(R) Xeon Phi, KNC) system (2016)Google Scholar
  9. 9.
    Davis, T.: The SuiteSparse Matrix Collection (formerly known as the University of Florida Sparse Matrix Collection). https://www.cise.ufl.edu/research/sparse/matrices/. Accessed 30 May 2017
  10. 10.
    Deakin, T., Price, J., Martineau, M., McIntosh-Smith, S.: GPU-STREAM v2.0: benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 489–507. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46079-6_34 CrossRefGoogle Scholar
  11. 11.
    Hahnfeld, J.: CGxx - Object-Oriented Implementation of the Conjugate Gradients Method, August 2017. https://github.com/hahnjo/CGxx
  12. 12.
    Hahnfeld, J.: Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices, July 2017, Bachelor thesisGoogle Scholar
  13. 13.
    Hahnfeld, J., Cramer, T., Klemm, M., Terboven, C., Müller, M.S.: A Pattern for Overlapping Communication and Computation with OpenMP Target Directives (2017)Google Scholar
  14. 14.
    Hahnfeld, J., Terboven, C., Price, J., Pflug, H.J., Müller, M.: Measurement data for paper “Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices” (2017). http://dx.doi.org/10.18154/RWTH-2017-10493
  15. 15.
    Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur. Stan. 49(6), 409–436 (1952)MathSciNetCrossRefMATHGoogle Scholar
  16. 16.
    Hoshino, T., Maruyama, N., Matsuoka, S., Takaki, R.: CUDA vs OpenACC: performance case studies with kernel benchmarks and a memory-bound CFD application. In: 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, pp. 136–143, May 2013Google Scholar
  17. 17.
    Jääskeläinen, P., de La Lama, C.S., Schnetter, E., Raiskila, K., Takala, J., Berg, H.: pocl: A performance-portable OpenCL Implementation. Int. J. Parallel Program. 43(5), 752–785 (2015).  https://doi.org/10.1007/s10766-014-0320-y
  18. 18.
    Jo, G., Nah, J., Lee, J., Kim, J., Lee, J.: Accelerating LINPACK with MPI-OpenCL on clusters of Multi-GPU nodes. IEEE Trans. Parallel Distrib. Syst. 26(7), 1814–1825 (2015)CrossRefGoogle Scholar
  19. 19.
    Krieder, S.J., Wozniak, J.M., Armstrong, T., Wilde, M., Katz, D.S., Grimmer, B., Foster, I.T., Raicu, I.: Design and evaluation of the GeMTC framework for GPU-enabled many-task computing. In: Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC 2014, pp. 153–164. ACM, New York (2014).  https://doi.org/10.1145/2600212.2600228
  20. 20.
    Lawlor, O.S.: Message passing for GPGPU clusters: CudaMPI. In: 2009 IEEE International Conference on Cluster Computing and Workshops, pp. 1–8, August 2009Google Scholar
  21. 21.
    McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pp. 19–25, December 1995Google Scholar
  22. 22.
    Meng, Q., Humphrey, A., Schmidt, J., Berzins, M.: Preliminary experiences with the Uintah framework on Intel Xeon Phi and Stampede. In: Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery, XSEDE 2013, pp. 48:1–48:8. ACM, New York (2013).  https://doi.org/10.1145/2484762.2484779
  23. 23.
    Mu, D., Chen, P., Wang, L.: Accelerating the discontinuous Galerkin method for seismic wave propagation simulations using multiple GPUs with CUDA and MPI. Earthquake Sci. 26(6), 377–393 (2013).  https://doi.org/10.1007/s11589-013-0047-7
  24. 24.
    Quintana-Ortí, G., Igual, F.D., Quintana-Ortí, E.S., van de Geijn, R.A.: Solving dense linear systems on platforms with multiple hardware accelerators. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2009, pp. 121–130. ACM, New York (2009).  https://doi.org/10.1145/1504176.1504196
  25. 25.
    Stuart, J.A., Owens, J.D.: Message passing on data-parallel architectures. In: 2009 IEEE International Symposium on Parallel Distributed Processing, pp. 1–12, May 2009Google Scholar
  26. 26.
    Stuart, J.A., Balaji, P., Owens, J.D.: Extending MPI to accelerators. In: Proceedings of the 1st Workshop on Architectures and Systems for Big Data, ASBD 2011, pp. 19–23. ACM, New York (2011).  https://doi.org/10.1145/2377978.2377981
  27. 27.
    Vázquez, F., Garzón, E.M.: The sparse matrix vector product on GPUs (2009)Google Scholar
  28. 28.
    Vinogradov, S., Fedorova, J., Curran, D., Cownie, J.: OpenMP 4.0 vs. OpenCL: performance comparison. In: OpenMPCon 2015, October 2015Google Scholar
  29. 29.
    Wienke, S., an Mey, D., Müller, M.S.: Accelerators for technical computing: is it worth the pain? A TCO perspective. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) ISC 2013. LNCS, vol. 7905, pp. 330–342. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-38750-0_25 CrossRefGoogle Scholar
  30. 30.
    Wienke, S., Terboven, C., Beyer, J.C., Müller, M.S.: A pattern-based comparison of OpenACC and OpenMP for accelerator computing. In: Silva, F., Dutra, I., Santos Costa, V. (eds.) Euro-Par 2014. LNCS, vol. 8632, pp. 812–823. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-09873-9_68 Google Scholar
  31. 31.
    Wozniak, J.M., Armstrong, T.G., Wilde, M., Katz, D.S., Lusk, E., Foster, I.T.: Swift/T: scalable data flow programming for many-task applications. In: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2013, pp. 309–310. ACM, New York (2013).  https://doi.org/10.1145/2442516.2442559
  32. 32.
    Yamazaki, I., Tomov, S., Dongarra, J.: One-sided dense matrix factorizations on a multicore with multiple GPU accelerators. Procedia Comput. Sci. 9, 37–46 (2012). http://www.sciencedirect.com/science/article/pii/S1877050912001263. Proceedings of the International Conference on Computational Science, ICCS 2012
  33. 33.
    Yan, Y., Lin, P.H., Liao, C., de Supinski, B.R., Quinlan, D.J.: Supporting multiple accelerators in high-level programming models. In: Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM 2015, pp. 170–180. ACM, New York (2015).  https://doi.org/10.1145/2712386.2712405

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.JARA–HPC, Chair for High Performance ComputingRWTH Aachen UniversityAachenGermany
  2. 2.IT CenterRWTH Aachen UniversityAachenGermany
  3. 3.Department of Computer ScienceUniversity of BristolBristolUK

Personalised recommendations