dOCAL: high-level distributed programming with OpenCL and CUDA

  • Ari RaschEmail author
  • Julian Bigge
  • Martin Wrodarczyk
  • Richard Schulze
  • Sergei Gorlatch


In the state-of-the-art parallel programming approaches OpenCL and CUDA, so-called host code is required for program’s execution. Efficiently implementing host code is often a cumbersome task, especially when executing OpenCL and CUDA programs on systems with multiple nodes, each comprising different devices, e.g., multi-core CPU and graphics processing units; the programmer is responsible for explicitly managing node’s and device’s memory, synchronizing computations with data transfers between devices of potentially different nodes and for optimizing data transfers between devices’ memories and nodes’ main memories, e.g., by using pinned main memory for accelerating data transfers and overlapping the transfers with computations. We develop distributed OpenCL/CUDA abstraction layer (dOCAL)—a novel high-level C++ library that simplifies the development of host code. dOCAL combines major advantages over the state-of-the-art high-level approaches: (1) it simplifies implementing both OpenCL and CUDA host code by providing a simple-to-use, high-level abstraction API; (2) it supports executing arbitrary OpenCL and CUDA programs; (3) it allows conveniently targeting the devices of different nodes by automatically managing node-to-node communications; (4) it simplifies implementing data transfer optimizations by providing different, specially allocated memory regions, e.g., pinned main memory for overlapping data transfers with computations; (5) it optimizes memory management by automatically avoiding unnecessary data transfers; (6) it enables interoperability between OpenCL and CUDA host code for systems with devices from different vendors. Our experiments show that dOCAL significantly simplifies the development of host code for heterogeneous and distributed systems, with a low runtime overhead.


OpenCL CUDA Host code Distributed system Heterogenous system Interoperability Data transfer optimization 



  1. 1.
    Rasch A, Gorlatch S (2018) ATF: a generic, directive-based auto-tuning framework. In: CCPE, pp 1–16.
  2. 2.
    Aldinucci M et al (2015) The loop-of-stencil-reduce paradigm. In: IEEE Trustcom/BigDataSE/ISPA, pp 172–177Google Scholar
  3. 3.
    Boehm B et al (1995) Cost models for future software life cycle processes: COCOMO 2.0. In: Annals of software engineering, pp 57–94Google Scholar
  4. 4.
  5. 5.
    Castro D et al (2016) Farms, pipes, streams and reforestation: reasoning about structured parallel processes using types and hylomorphisms. In: Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming, ICFP, pp 4–17Google Scholar
  6. 6.
    Cedric A et al (2011) StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. In: Concurrency and computation: practice and experience, pp 187–198Google Scholar
  7. 7.
    Chang PP et al (1989) Inline function expansion for compiling C programs. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, pp 246–257Google Scholar
  8. 8.
    Dagum L et al (1998) OpenMP: an industry-standard api for shared-memory programming. In: IEEE computational science and engineering, pp 46–55Google Scholar
  9. 9.
    Dastgeer U et al (2014) The PEPPHER composition tool: performance-aware dynamic composition of applications for GPU-based systems. In: Computing, pp 1195–1211Google Scholar
  10. 10.
    Wheeler David A (2018) SLOCCount.
  11. 11.
    Du P et al (2012) From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. In: Parallel computing, pp 391 – 407Google Scholar
  12. 12.
    Duato J et al (2010) rCUDA: reducing the number of GPU-based accelerators in high performance clusters. In: International Conference on High Performance Computing Simulation, pp 224–231Google Scholar
  13. 13.
    Duran A et al (2011) OmpSs: a proposal for programming heterogeneous multi-core architectures. In: Parallel processing letters, pp 173–193Google Scholar
  14. 14.
    Enmyren J et al (2010) SkePU: a multi-backend skeleton programming library for multi-GPU systems. In: HLPP, pp 5–14Google Scholar
  15. 15.
    Ernsting S et al (2011) Data parallel skeletons for GPU clusters and multi-GPU systems. In: PARCO, pp 509–518Google Scholar
  16. 16.
    Gorlatch S, Cole M (2011) Parallel skeletons. In: Encyclopedia of parallel computing, pp 1417–1422Google Scholar
  17. 17.
    Grasso I et al (2013) LibWater: heterogeneous distributed computing made easy. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS, pp 161–172Google Scholar
  18. 18.
    Haidl M, Gorlatch S (2014) PACXX: towards a unified programming model for programming accelerators using C++14. In: LLVM compiler infrastructure in HPC, pp 1–11Google Scholar
  19. 19.
    Halstead MH (1977) Elements of software science. Elsevier computer science library: operational programming systems seriesGoogle Scholar
  20. 20.
    Intel: Ambient Occlusion Benchmark (AOBench) (2014).
  21. 21.
  22. 22.
    Intel: CUDA Deep Neural Network Library (2018).
  23. 23.
  24. 24.
    Jia Y et al (2014) Caffe: convolutional architecture for fast feature embedding. In: arXiv preprint arXiv:1408.5093
  25. 25.
    Karimi K et al (2010) A performance comparison of CUDA and OpenCL. In: CoRRGoogle Scholar
  26. 26.
    Kegel P et al (2012) dOpenCL: towards a uniform programming approach for distributed heterogeneous multi-/many-core systems. In: IEEE 26th international parallel and distributed processing symposium workshops PhD forum, pp 174–186Google Scholar
  27. 27.
    Kim J et al (2012) SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters. In: Proceedings of the 26th ACM International Conference on Supercomputing, ICS, pp 341–352Google Scholar
  28. 28.
    Klöckner A et al (2012) PyCUDA and PyOpenCL: a scripting-based approach to GPU run-time code generation. In: Parallel computing, pp 157 – 174Google Scholar
  29. 29.
    Koch G et al (2015) Siamese neural networks for one-shot image recognition. In: ICML deep learning workshopGoogle Scholar
  30. 30.
    Lee S et al (2010) OpenMPC: extended OpenMP programming and tuning for GPUs. In: ACM/IEEE International Conference for high Performance Computing, Networking, Storage and Analysis, pp 1–11Google Scholar
  31. 31.
    McCabe T.J (1976) A complexity measure. In: IEEE transactions on software engineering, pp 308–320Google Scholar
  32. 32.
    Memeti S et al (2017) Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption. In: Workshop on adaptive resource management and scheduling for cloud computing, pp 1–6Google Scholar
  33. 33.
    Moreton-Fernandez A et al (2017) Multi-device controllers: a library to simplify parallel heterogeneous programming. Int J Parallel Program 47(1):94–113CrossRefGoogle Scholar
  34. 34.
    Nugteren C (2016) CLBlast: a tuned OpenCL BLAS library. In: CoRRGoogle Scholar
  35. 35.
    NVIDIA: nvidia-opencl-examples. (2012)
  36. 36.
  37. 37.
    NVIDIA: CUDA Toolkit 9.1 (2018).
  38. 38.
    NVIDIA: how to optimize data transfers in CUDA C/C++ (2018).
  39. 39.
    NVIDIA: how to overlap data transfers in CUDA C/C++ (2018).
  40. 40.
  41. 41.
    NVIDIA: unified memory for CUDA beginners (2018).
  42. 42.
    Pérez B et al (2016) Simplifying programming and load balancing of data parallel applications on heterogeneous systems. In: GPGPU, pp 42–51Google Scholar
  43. 43.
    Reyes R et al (2015) SYCL: single-source C++ accelerator programming. In: PARCO, pp 673–682Google Scholar
  44. 44.
    rharish100193: halstead metrics tool (2016).
  45. 45.
    Rompf T et al (2015) Go meta! A case for generative programming and DSLs in performance critical systems. In: LIPIcs–Leibniz international proceedings in informatics, pp 238–261Google Scholar
  46. 46.
    Rupp K et al (2010) Automatic performance optimization in ViennaCL for GPUs. In: POOSC, pp 1–6Google Scholar
  47. 47.
    Spafford K et al (2010) Maestro: data orchestration and tuning for OpenCL devices. In: Euro-Par–parallel processing. Springer, Berlin, pp 275–286Google Scholar
  48. 48.
    Standard C++ foundation foundation members: ISO C++ (2018).
  49. 49.
    Steuwer M et al (2011) SkelCL—a portable skeleton library for high-level GPU programming. In: IEEE IPDPS workshops, pp 1176–1182Google Scholar
  50. 50.
    Steve Arnold: CCCC project documentation (2005).
  51. 51.
    Szuppe J (2016) Boost.Compute: a parallel computing library for C++ based on OpenCL. In: IWOCL, pp 1–39Google Scholar
  52. 52.
    Tejedor E et al (2011) ClusterSs: a task-based programming model for clusters. In: Proceedings of the 20th international symposium on high performance distributed computing, HPDC, pp 267–268Google Scholar
  53. 53.
    Tillet P, Cox D (2017) Input-aware auto-tuning of compute-bound HPC kernels. In: SC, pp 1–12Google Scholar
  54. 54.
    Vinas M et al (2015) Improving OpenCL programmability with the heterogeneous programming library. In: International Conference on Computational Science, ICCS, pp 110 – 119Google Scholar
  55. 55.
    Wienke S et al (2012) OpenACC—first experiences with real-world applications. In: Euro-Par parallel processing, pp 859–870Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Mathematics and Computer ScienceUniversity of MünsterMünsterGermany

Personalised recommendations