Dual buffer rotation four-stage pipeline for CPU–GPU cooperative computing

Li, Tao; Dong, Qiankun; Wang, Yifeng; Gong, Xiaoli; Yang, Yulu

doi:10.1007/s00500-017-2795-0

Dual buffer rotation four-stage pipeline for CPU–GPU cooperative computing

Methodologies and Application
Published: 06 September 2017

Volume 23, pages 859–869, (2019)
Cite this article

Soft Computing Aims and scope Submit manuscript

Tao Li¹,
Qiankun Dong¹,
Yifeng Wang¹,
Xiaoli Gong ORCID: orcid.org/0000-0002-9836-558X¹ &
…
Yulu Yang¹

382 Accesses
7 Citations
Explore all metrics

Abstract

Accelerators such as GPUs have become popular general-purpose computing device in the field of high-performance computing. With the boosting ability of storage and computation, it is very important to solve the complex scientific and engineering problems on CPU–GPU heterogeneous system in the big data era. Now the compute-intensive problems have been successfully solved using CPU–GPU cooperative computing. However, it is difficult to handle large-scale data-intensive problems, especially for those limited by GPU device memory. In this paper, the dual buffer rotation four-stage pipeline (DBFP) mechanism is proposed for CPU–GPU cooperative computation to efficiently handle data-intensive problems, which need larger memory than that of a single GPU. The data block partition-based pipeline computing strategy is designed on top of the DBFP mechanism. On the one hand, it breaks out the bottleneck of limited GPU device memory. On the other hand, it explores high-performance computing of CPU and GPU with data transfer and computation overlap. Furthermore, it is easy to extend the DBFP mechanism on the heterogeneous system equipped with multiple GPUs and achieve high resource utilization. The results show that it can achieve 99 and 90% of theoretical performance for dense general matrix multiplication on one GPU and two GPUs respectively with Nvidia GTX480 or K40 GPUs. It also enables K-means and T-nearest-neighbor algorithms to process larger datasets, which used to be limited by the GPU device memory. We achieve nearly 1.9-fold performance gains by dynamic task scheduling on two GPUs when the performance bottleneck is GPU computing or data transmission.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

D-Cubicle: boosting data transfer dynamically for large-scale analytical queries in single-GPU systems

Article 12 December 2022

HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs

Article 11 October 2020

Mimir+: An Optimized Framework of MapReduce on Heterogeneous High-Performance Computing System

Notes

TOP500.org.
CUDA_C_Programming_Guide.
http://docs.nvidia.com/cuda/index.html#cuda-api-references.
MAGMA. http://icl.utk.edu/magma/.
http://gpgpu.org/.
NVIDIA. NVIDIA’s next generation CUDATM compute architecture Whitepaper, V1.0 edition.
OpenMP[EB/OL]. http://www.openmp.org/.
CUDA_C_Programming_Guide.
LAPACK, http://www.netlib.org/lapack/.
ScalaPACK, http://www.netlib.org/scalapack/.

References

Aciu RM, Ciocarlie H (2013) Algorithm for cooperative CPU–GPU computing. In: 15th international symposium on symbolic and numeric algorithms for scientific computing (SYNASC). IEEE, pp 352–358
Arumugam K, Godunov A, Ranjan D et al (2013) A memory efficient algorithm for adaptive multidimensional integration with multiple GPUs. In: 20th international conference on high performance computing (HiPC). IEEE, pp 169–175
Breitbart J (2011) Analysis of a memory bandwidth limited scenario for NUMA and GPU systems. In: IPDPSW 2011, international symposium on parallel and distributed processing workshops and PhD forum. IEEE, pp 693–699
Domanski L, Bednarz T, Gureyev T et al (2013) Applications of heterogeneous computing in computational and simulation science. Int J Comput Sci Eng 8(3):240–252
Google Scholar
Du P, Weber R, Luszczek P et al (2012) From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. Parallel Comput 38(8):391–407
Article Google Scholar
Fujii Y, Azumi T, Nishio N et al (2013) Data transfer matters for GPU computing. In: International conference on parallel and distributed systems (ICPADS). IEEE, pp 275–282
Gu J, Beckmann BM, Cao T et al (2014) iCHAT: inter-cache hardware-assistant data transfer for heterogeneous chip multiprocessors. In: 2014 9th IEEE international conference on networking, architecture, and storage (NAS). IEEE, pp 242–251
Hagen L, Kahng AB (1992) New spectral methods for ratio cut participating and clustering. IEEE Trans Comput Aided Des 11(9):1074–1085
Article Google Scholar
Hou Q, Sun X, Zhou K et al (2011) Memory-scalable GPU spatial hierarchy construction. IEEE Trans Vis Comput Graph 17(4):466–474
Article Google Scholar
Huet S, Boulos V, Fristot V et al (2011) DFG implementation on multi GPU cluster with computation-communication overlap. In: 2011 conference on design and architectures for signal and image processing (DASIP). IEEE, pp 1–8
Jablin TB, Prabhu P, Jablin JA et al (2011) Automatic CPU–GPU communication management and optimization. ACM SIGPLAN Not ACM 46(6):142–151
Article Google Scholar
Kim Y, Lee J, Kim D et al (2014a) ScaleGPU: GPU architecture for memory-unaware GPU programming. IEEE Comput Archit Lett 13(2):101–104
Kim Y, Lee J, Jo JE et al (2014b) GPUdmm: a high-performance and memory-oblivious GPU architecture using dynamic memory management. In: IEEE 20th international symposium on high performance computer architecture (HPCA). IEEE, pp 546–557
Kothapalli K, Banerjee DS, Narayanan PJ et al (2013) CPU and/or GPU: revisiting the GPU vs. CPU myth. Preprint. arXiv:1303.2171
Li Y, Zhang Y (2014) An automatic performance tuning framework for FFT on heterogeneous platforms. J Comput Res Dev 51(3):637–649
Google Scholar
Li T, Li H, Liu X et al (2013) GPU acceleration of interior point methods in large scale SVM training. In: TrustCom2013, 12th IEEE international conference on trust, security and privacy in computing and communications. IEEE, pp 863–870
Li T, Wang D, Zhang S et al (2014) Parallel rank coherence in networks for inferring disease phenotype and gene set associations. In: Advanced computer architecture. Springer, Berlin, pp 163–176
Luk C K, Hong S, Kim H (2009) Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In: Proceedings of the 42nd annual IEEE/ACM international symposium on microarchitecture. ACM, pp 45–55
Luszczek Dongarra J, Petitet A (2001) The LINPACK benchmark: past, present and future. Mimeo, University of Tennessee
Mohanavalli S, Jaisakthi SM, Aravindan C (2011) Strategies for parallelizing \(K\)-means data clustering algorithm. In: Das VV, Thomas G, Lumban Gaol F (eds) Information technology and mobile communication. Springer, vol 147. Berlin, Heidelberg, pp 427–430
Pienaar JA, Chakradhar S, Raghunathan A (2012) Automatic generation of software pipelines for heterogeneous parallel systems. In: International conference for high performance computing, networking, storage and analysis (SC). IEEE, pp 1–12
Vestias M, Neto H (2014) Trends of CPU, GPU and FPGA for high-performance computing. In: 24th international conference on field programmable logic and applications (FPL). IEEE, pp 1–6
Wang Y, Jin X, Cheng X (2013) Network big data: present and future. Chin J Comput 36(6):1125–1138
Article Google Scholar
Wang H, Potluri S, Bureddy D et al (2014) GPU-aware MPI on RDMA-enabled clusters: design, implementation and evaluation. IEEE Trans Parallel Distrib Syst 25(10):2595–2605
Article Google Scholar
Werkhoven B, Maassen J, Seinstra FJ et al (2014) Performance models for CPU–GPU data transfers. In: 14th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid). IEEE, pp 11–20
Zhang B, Cao H, Dong X, Li D, Hu L (2011) Novel GPU data partitioning method to overlap communication and computation. J Xi’an Jiaotong Univ 45(4):1–6
Google Scholar
Zhang S, Li T, Jiao X, Wang Y, Yang Y (2014) Hlanc: heterogeneous parallel implementation of the implicitly restarted Lanczos method. In: The 3rd international workshop on heterogeneous and unconventional cluster architectures and applications, Minneapolis, Sept. 9–12

Download references

Acknowledgements

Funding was provided by Natural Science Foundation of Tianjin City (Grant No. 16JCYBJC15200), Science and Technology Support Program of Tianjin (Grant No. 15ZXDSGX00020), Specialized Research Fund for the Doctoral Program of Higher Education of China (Grant No. 20130031120028) and Research Plan in Application Foundation and Advanced Technologies in Tianjin (Grant No. 14JCQNJC00700).

Author information

Authors and Affiliations

College of Computer and Control Engineering, Nankai University, Tianjin, 300071, China
Tao Li, Qiankun Dong, Yifeng Wang, Xiaoli Gong & Yulu Yang

Authors

Tao Li
View author publications
You can also search for this author in PubMed Google Scholar
Qiankun Dong
View author publications
You can also search for this author in PubMed Google Scholar
Yifeng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoli Gong
View author publications
You can also search for this author in PubMed Google Scholar
Yulu Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaoli Gong.

Ethics declarations

Conflict of interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled “DUAL BUFFERS ROTATION FOUR-STAGE PIPELINE FOR CPU-GPU COOPERATIVE COMPUTING”.

Additional information

Communicated by V. Loia.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, T., Dong, Q., Wang, Y. et al. Dual buffer rotation four-stage pipeline for CPU–GPU cooperative computing. Soft Comput 23, 859–869 (2019). https://doi.org/10.1007/s00500-017-2795-0

Download citation

Published: 06 September 2017
Issue Date: 13 February 2019
DOI: https://doi.org/10.1007/s00500-017-2795-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dual buffer rotation four-stage pipeline for CPU–GPU cooperative computing

Abstract

Access this article

Similar content being viewed by others

D-Cubicle: boosting data transfer dynamically for large-scale analytical queries in single-GPU systems

HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs

Mimir+: An Optimized Framework of MapReduce on Heterogeneous High-Performance Computing System

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Dual buffer rotation four-stage pipeline for CPU–GPU cooperative computing

Abstract

Access this article

Similar content being viewed by others

D-Cubicle: boosting data transfer dynamically for large-scale analytical queries in single-GPU systems

HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs

Mimir+: An Optimized Framework of MapReduce on Heterogeneous High-Performance Computing System

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation