International Journal of Parallel Programming

, Volume 42, Issue 6, pp 1032–1047 | Cite as

Reducing Communication Overhead in Multi-GPU Hybrid Solver for 2D Laplace’s Equation

  • Michał Czapiński
  • Chris Thompson
  • Stuart Barnes


The possibility of porting algorithms to graphics processing units (GPUs) raises significant interest among researchers. The natural next step is to employ multiple GPUs, but communication overhead may limit further performance improvement. In this paper, we investigate techniques reducing overhead on hybrid CPU–GPU platforms, including careful data layout and usage of GPU memory spaces, and use of non-blocking communication. In addition, we propose an accurate automatic load balancing technique for heterogeneous environments. We validate our approach on a hybrid Jacobi solver for 2D Laplace’s Equation. Experiments carried out using various graphics hardware and types of connectivity have confirmed that the proposed data layout allows our fastest CUDA kernels to reach the analytical limit for memory bandwidth (up to 106 GB/s on NVidia GTX 480), and that the non-blocking communication significantly reduces overhead, allowing for almost linear speed-up, even when communication is carried out over relatively slow networks.


Hybrid parallelism Multiple GPUs Heterogeneous architectures Non-blocking communication Laplace solver  CUDA 



The authors wish to thank Dr. Mark Stillwell for proof-reading the original manuscript and his valuable and constructive comments.


  1. 1.
    Bolz, J., Farmer, I., Grinspun, E., Schröder, P.: Sparse matrix solvers on the GPU: conjugate gradients and multigrid. In: Proceedings of ACM Transactions on Graphics, pp. 917–924 (2003)Google Scholar
  2. 2.
    Goodnight, N., Woolley, C., Lewin, G., Luebke, D., Humphreys, G.: A multigrid solver for boundary value problems using programmable graphics hardware. In: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pp. 102–111 (2003)Google Scholar
  3. 3.
    Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA. ACM Queue 6, 40–53 (2008)CrossRefGoogle Scholar
  4. 4.
    Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: NVIDIA Tesla: a unified graphics and computing architecture. IEEE Micro 28, 39–55 (2008)Google Scholar
  5. 5.
    Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A., Purcell, T.J.: A survey of general-purpose computation on graphics hardware. Comput. Graph. Forum 26, 80–113 (2007)CrossRefGoogle Scholar
  6. 6.
    Garland, M., Le Grand, S., Nickolls, J., Anderson, J., Hardwick, J., Morton, S., Phillips, E., Zhang, Y., Volkov, V.: Parallel computing experiences with CUDA. IEEE Micro 28, 13–27 (2008)Google Scholar
  7. 7.
    Sanders, J., Kandrot, E.: CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley, Reading (2010)Google Scholar
  8. 8.
    Kirk, D., Hwu, W., Hwu, W.: Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers, Los Altos (2010)Google Scholar
  9. 9.
    Stock, F., Koch, A.: A fast GPU implementation for solving sparse ill-posed linear equation systems. In: Proceedings of the 8th International Conference on Parallel Processing and Applied Mathematics: Part I, pp. 457–466 (2010)Google Scholar
  10. 10.
    Wozniak, M., Olas, T., Wyrzykowski, R.: Parallel implementation of conjugate gradient method on graphics processors. In: Proceedings of the 8th International Conference on Parallel Processing and Applied Mathematics: Part I, pp. 125–135 (2010)Google Scholar
  11. 11.
    Zhang, Y., Cohen, J., Owens, J.D.: Fast tridiagonal solvers on the GPU. ACM SIGPLAN Notices 45, 127–136 (2010)CrossRefGoogle Scholar
  12. 12.
    Göddeke, D., Strzodka, R.: Cyclic reduction tridiagonal solvers on GPUs applied to mixed precision multigrid. IEEE Trans. Parallel Distrib. Syst. 22, 22–32 (2011)CrossRefGoogle Scholar
  13. 13.
    Elsen, E., LeGresley, P., Darve, E.: Large calculation of the flow over a hypersonic vehicle using a GPU. J. Comput. Phys. 227, 10,148–10,161 (2008)Google Scholar
  14. 14.
    Feng, Z., Li, P.: Multigrid on GPU: tackling power grid analysis on parallel SIMT platforms. In: ICCAD 2008. IEEE/ACM International Conference on, Computer-Aided Design, pp. 647–654 (2008)Google Scholar
  15. 15.
    Czapiński, M., Barnes, S.: Tabu search with two approaches to parallel flowshop evaluation on CUDA platform. J. Parallel Distrib. Comput. 71, 802–811 (2011)CrossRefGoogle Scholar
  16. 16.
    Czapiński, M.: An effective parallel multistart tabu search for quadratic assignment problem on CUDA platform. J. Parallel Distrib. Comput. 73, 1461–1468 (2013)CrossRefGoogle Scholar
  17. 17.
    Lawlor, O.: Message passing for GPGPU clusters: CudaMPI. In: Cluster Computing and Workshops, 2009. CLUSTER ’09. IEEE International Conference on, pp. 1–8 (2009)Google Scholar
  18. 18.
    Cevahir, A., Nukada, A., Matsuoka, S.: Fast conjugate gradients with multiple GPUs. In: Proceedings of the 9th International Conference on Computational Science: Part I, pp. 893–903 (2009)Google Scholar
  19. 19.
    Tomov, S., Dongarra, J., Baboulin, M.: Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput. 36, 232–240 (2010)CrossRefMATHGoogle Scholar
  20. 20.
    Yang, C.T., Huang, C.L., Lin, C.F.: Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters. Comput. Phys. Commun. 182, 266–269 (2011)CrossRefGoogle Scholar
  21. 21.
    Brightwell, R., Riesen, R., Underwood, K.D.: Analyzing the impact of overlap, offload, and independent progress for message passing interface applications. Int. J. High Perform. Comput. Appl. 19, 103–117 (2005)CrossRefGoogle Scholar
  22. 22.
    Hoefler, T., Gottschling, P., Lumsdaine, A., Rehm, W.: Optimizing a conjugate gradient solver with non-blocking collective operations. Parallel Comput. 33, 624–633 (2007)CrossRefMathSciNetGoogle Scholar
  23. 23.
    Shet, A., Sadayappan, P., Bernholdt, D., Nieplocha, J., Tipparaju, V.: A framework for characterizing overlap of communication and computation in parallel applications. Clust. Comput. 11, 75–90 (2008)CrossRefGoogle Scholar
  24. 24.
    Thakur, R., Gropp, W.: Test suite for evaluating performance of multithreaded MPI communication. Parallel Comput. 35, 608–617 (2009)CrossRefGoogle Scholar
  25. 25.
    NVidia: NVIDIA CUDA C Programming Guide. (2011). Accessed 10 July 2013
  26. 26.
    White III, J., Dongarra, J.: Overlapping computation and communication for advection on hybrid parallel computers. In: International Parallel and Distributed Processing, Symposium, pp. 59–67 (2011)Google Scholar
  27. 27.
    Micikevicius, P.: 3D finite difference computation on GPUs using CUDA. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp. 79–84 (2009)Google Scholar
  28. 28.
    Demmel, J.: Applied Numerical Linear Algebra. Society for Industrial and Applied Mathematics, Philadelphia (1997)CrossRefMATHGoogle Scholar
  29. 29.
    Briggs, W.L., Henson, V.E., McCormick, S.F.: A Multigrid Tutorial, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia (2000)CrossRefMATHGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Michał Czapiński
    • 1
  • Chris Thompson
    • 1
  • Stuart Barnes
    • 1
  1. 1.Applied Mathematics and Computing GroupCranfield UniversityCranfieldUK

Personalised recommendations