Research on Matrix Multiplication Based on the Combination of OpenACC and CUDA

  • Yuexing WangEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 980)


With the improvement of GPU’s general computing capacity, the use of parallel computing to solve some difficult problems with large amount of data and intensive computing tasks has become the trend of the times. In GPU general computing, CUDA and OpenCL have been widely used and studied. However, the two parallel programming models generally exist the weakness that whose API is too close to the underlying hardware, which makes programming inefficient and is not suitable for the large-scale parallel tasks that require rapid implementation. OpenACC is a relatively advanced and simple programming language, which can achieve rapid parallelization, but the computing effect of the program is relatively low (generally lower than CUDA). Therefore, this paper tries to combine CUDA and OpenACC for mixed parallelization. This way not only greatly reduces the workload of code conversion, but also has a computing performance no less than a pure CUDA program.


CUDA OpenACC matrix multiplication 


  1. 1.
    Harris, M., et al.: GPGPU: general purpose computation on graphics hardware. In: ACM SIGGRAPH 2004 Course Notes, p. 33. ACM (2004)Google Scholar
  2. 2.
    Yang, Y., et al.: An optimizing compiler for GPGPU programs with input-data sharing. In: ACM Sigplan Symposium on Principles & Practice of Parallel Programming, pp. 343–344. ACM (2010)Google Scholar
  3. 3.
    Giunta, G., Montella, R., Agrillo, G., Coviello, G.: A GPGPU transparent virtualization component for high performance computing clouds. In: D’Ambra, P., Guarracino, M., Talia, D. (eds.) Euro-Par 2010. LNCS, vol. 6271, pp. 379–391. Springer, Heidelberg (2010). Scholar
  4. 4.
    Lee, S., Min, S.J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: ACM Sigplan Symposium on Principles and Practice of Parallel Programming, pp. 101–110. ACM (2009)Google Scholar
  5. 5.
    Han, T.D., Abdelrahman, T.S.: hiCUDA: high-Level GPGPU programming. IEEE Trans. Parallel Distrib. Syst. 22(1), 78–90 (2010)CrossRefGoogle Scholar
  6. 6.
    Kessler, C., et al.: Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption. In: The Workshop on Adaptive Resource Management & Scheduling for Cloud Computing, pp. 1–6. ACM (2017)Google Scholar
  7. 7.
    Komatsu, K., et al.: Translation of large-scale simulation codes for an OpenACC platform using the xevolver framework. Int. J. Networking Comput. 6(2), 167–180 (2017)CrossRefGoogle Scholar
  8. 8.
    Rostami, R.M., Ghaffari-Miab, M.: Fast computation of finite difference generated time-domain Green’s functions of layered media using OpenAcc on graphics processors. In: Iranian Conference on Electrical Engineering (2017)Google Scholar
  9. 9.
    Pereira, A.D., et al.: Enabling efficient stencil code generation in OpenACC. Procedia Comput. Sci. 108, 2333–2337 (2017)CrossRefGoogle Scholar
  10. 10.
    Calore, E., Kraus, J., Schifano, S.F., Tripiccione, R.: Accelerating lattice boltzmann applications with OpenACC. In: Träff, J.L., Hunold, S., Versaci, F. (eds.) Euro-Par 2015. LNCS, vol. 9233, pp. 613–624. Springer, Heidelberg (2015). Scholar
  11. 11.
    Feki, S., Al-Jarro, A., Bagci, H.: Multi-GPU-based acceleration of the explicit time domain volume integral equation solver using MPI-OpenACC. Radio Science Meeting, p. 90. IEEE (2013)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.Hebei University of EngineeringHan DanChina

Personalised recommendations