Abstract
Besides traditional CPU-based parallel computer, heterogeneous parallel architectures that consists of both CPU and GPGPU are used in many emerging large-scale clusters/supercomputers. In order to better utilize both the CPU and GPU, an application could divide and distribute its workload to the two types of hardware at the same time. However, it is not trivial to find an optimal allocation for all the applications offline, because applications often have various characters thus different applications have different speedup ratio on GPGPU compared with that on CPU. In order to solve this problem, this chapter presents the techniques that can balance the application workload across heterogeneous hardware.
Part of contents in this chapter has been published through International Workshop on Programming Models and Applications for Multicores and Manycores. Reprinted from Ref. [14], with permission from ACM.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
C. Augonnet, S. Thibault, R. Namyst, P. Wacrenier, StarPU: A unified platform for task scheduling on heterogeneous multicore architectures, Concurrency and Computation: Practice and Experience 23 (2) (2011) 187–198.
S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, W.-m. W. Hwu, An adaptive performance modeling tool for GPU architectures, in: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’10, ACM, New York, NY, USA, 2010, pp. 105–114.
I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, P. Hanrahan, Brook for GPUs: stream computing on graphics hardware, in: ACM SIGGRAPH 2004 Papers, SIGGRAPH ’04, ACM, New York, NY, USA, 2004, pp. 777–786.
J. Bueno, L. Martinell, A. Duran, M. Farreras, X. Martorell, R. Badia, E. Ayguade, J. Labarta, Productive cluster programming with OmpSS, Euro-Par 2011 Parallel Processing (2011) 555–566.
B. He, W. Fang, Q. Luo, N. K. Govindaraju, T. Wang, Mars: a mapreduce framework on graphics processors, in: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, PACT ’08, ACM, New York, NY, USA, 2008, pp. 260–269.
S. Hong, H. Kim, An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness, in: Proceedings of the 36th annual international symposium on Computer architecture, ISCA ’09, ACM, New York, NY, USA, 2009, pp. 152–163.
S. Hong, H. Kim, An integrated GPU power and performance model, in: Proceedings of the 37th annual international symposium on Computer architecture, ISCA ’10, ACM, New York, NY, USA, 2010, pp. 280–289.
C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 45–55. ACM, 2009.
P. McCormick, J. Inman, J. Ahrens, J. Mohd-Yusof, G. Roth, S. Cummins, Scout: a data-parallel programming language for graphics processors, Parallel Computing 33 (10–11) (2007) 648–662.
A. Munshi, The OpenCL specification version: 1.2 (2011).
C. Nvidia, CUDA C programming guide 5.0 (2012).
S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, W.-m. W. Hwu, Optimization principles and application performance evaluation of a multithreaded GPU using CUDA, in: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, PPoPP ’08, ACM, New York, NY, USA, 2008, pp. 73–82.
T. R. Scogland, B. Rountree, W.-c. Feng, and B. R. De Supinski. Heterogeneous task scheduling for accelerated openmp. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 144–155. IEEE, 2012.
Z. Wang, L. Zheng, Q. Chen, and M. Guo. CAP: co-scheduling based on asymptotic profiling in CPU+ GPU hybrid systems. Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores, pages 107–114. ACM, 2013.
Y. Zhang, J. Owens, A quantitative performance analysis model for GPU architectures, in: High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, 2011, pp. 382 –393.
F. Zhang, B. Wu, J. Zhai, B. He, and W. Chen. Finepar: irregularity-aware fine-grained workload partitioning on integrated architectures. In Proceedings of the 2017 International Symposium on Code Generation and Optimization, pages 27–38. IEEE Press, 2017.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2017 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Chen, Q., Guo, M. (2017). Load Balancing for Heterogeneous Parallel Architecture. In: Task Scheduling for Multi-core and Parallel Architectures. Springer, Singapore. https://doi.org/10.1007/978-981-10-6238-4_6
Download citation
DOI: https://doi.org/10.1007/978-981-10-6238-4_6
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6237-7
Online ISBN: 978-981-10-6238-4
eBook Packages: Computer ScienceComputer Science (R0)