Skip to main content

Advertisement

Log in

Load balancing in a heterogeneous world: CPU-Xeon Phi co-execution of data-parallel kernels

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Heterogeneous systems composed by a CPU and a set of different hardware accelerators are very compelling thanks to their excellent performance and energy consumption features. One of the most important problems of those systems is the workload distribution among their devices. This paper describes an extension of the Maat library to allow the co-execution of a data-parallel OpenCL kernel on a heterogeneous system composed by a CPU and an Intel Xeon Phi. Maat provides an abstract view of the heterogeneous system as well as set of load balancing algorithms to squeeze the performance out of the node. It automatically performs the data partition and distribution among the devices, generates the kernels and efficiently merges the partial outputs together. Experimental results show that this approach always outperforms the baseline with only a Xeon Phi, giving excellent performance and energy efficiency. Furthermore, it is essential to select the right load balancing algorithm because it has a huge impact in the system performance and energy consumption.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Aji AM et al (2016) MultiCL: enabling automatic scheduling for task-parallel workloads in OpenCL. Parallel Comput 58:37–55

    Article  MathSciNet  Google Scholar 

  2. AMD Accelerated Parallel Processing (APP) Software Development Kit (SDK) V3. Last accessed January 2018. https://developer.amd.com/amd-accelerated-parallel-processing-app-sdk/

  3. Belviranli ME, Bhuyan LN, Gupta R (2013) A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans Archit Code Optim 9(4):1–20

    Article  Google Scholar 

  4. Castillo E et al (2014) Financial applications on multi-CPU and multi-GPU architectures. J Supercomput 71(2):729–739

    Article  Google Scholar 

  5. Donyanavard B, Mück T, Sarma S, Dutt N (2016) SPARTA: runtime task allocation for energy efficient heterogeneous many-cores bryan. In: Proceedings of the 11th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, pp 1–10

  6. Lastovetsky A, Szustak L, Wyrzykowski R (2017) Model-based optimization of eulag kernel on intel xeon phi through load imbalancing. IEEE Trans Parallel Distrib Syst 28(3):787–797

    Article  Google Scholar 

  7. Lee J, Samadi M, Park Y, Mahlke S (2015) Skmd. ACM Trans Comput Syst 33(3):1–27

    Article  Google Scholar 

  8. Li P, Brunet E, Trahay F, Parrot C, Thomas G, Namyst R (2015) Automatic OpenCL code generation for multi-device heterogeneous architectures. In: Proceedings of the International Conference on Parallel Processing, pp 959–968

  9. Lopez et al (2016) Towards achieving performance portability using directives for accelerators. In: Third workshop on accelerator programming using directives, pp 13–24

  10. Ma K, Li X, Chen W, Zhang C, Wang X (2012) GreenGPU: a holistic approach to energy efficiency in GPU-CPU heterogeneous architectures. In: Proceedings of the International Conference on Parallel Processing, pp 48–57

  11. Pandit P, Govindarajan R (2014) Fluidic kernels: cooperative execution of opencl programs on multiple heterogeneous devices. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, pp 273–283

  12. Pérez B, Bosque JL, Beivide R (2016) Simplifying programming and load balancing of data parallel applications on heterogeneous systems. In: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit, ACM, pp 42–51

  13. Salehian S, Liu J, Yan Y (2017) Comparison of threading programming models. In: Proceedings IEEE 31st International Parallel and Distributed Processing Sym. Workshops, pp 766–774

  14. Stone JE, Gohara D, Shi G (2010) OpenCL: a parallel programming standard for heterogeneous computing systems. IEEE Des Test 12(3):66–73

    Google Scholar 

  15. Vilches A, Asenjo R, Navarro A, Corbera F, Gran R, Garzarán M (2015) Adaptive partitioning for irregular applications on heterogeneous CPU–GPU chips. Procedia Comput Sci 51(1):140–149

    Article  Google Scholar 

  16. Wienke S, Terboven C, An Mey D, Muller MS (2013) Accelerators, quo vadis? Performance vs. productivity. In: Proceedings of the International Conference on High Performance Computing and Simulation, pp 471–473

  17. Xiao X, Hirasawa S, Takizawa H, Kobayashi H (2016) The importance of dynamic load balancing among openmp thread teams for irregular workloads. In: 4th International Symposium on Computing and Networking, pp 529–535

  18. Zhang F, Zhai J, He B, Zhang S, Chen W (2017) Understanding co-running behaviors on integrated cpu/gpu architectures. IEEE Trans Parallel Distrib Syst 28(3):905–918

    Article  Google Scholar 

  19. Zhong Z, Rychkov V, Lastovetsky A (2015) Data partitioning on multicore and multi-GPU platforms using functional performance models. IEEE Trans Comput 64(9):2506–2518

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This work has been supported by the Spanish Ministry of Education, FPU grant FPU16/03299, the University of Cantabria, grant CVE-2014-18166, the Spanish Science and Technology Commission under contracts TIN2016-76635-C2-2-R and TIN2016-81840-REDT (CAPAP-H6 network), the European Research Council (G.A. No. 321253) and the European HiPEAC Network of Excellence. The Mont-Blanc project has received funding from the European Unions Horizon 2020 research and innovation programme under Grant Agreement No. 671697.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Raúl Nozal.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nozal, R., Perez, B., Bosque, J.L. et al. Load balancing in a heterogeneous world: CPU-Xeon Phi co-execution of data-parallel kernels. J Supercomput 75, 1123–1136 (2019). https://doi.org/10.1007/s11227-018-2318-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-018-2318-5

Keywords

Navigation