Journal of Zhejiang University SCIENCE C

, Volume 14, Issue 11, pp 859–872 | Cite as

Efficient fine-grained shared buffer management for multiple OpenCL devices

  • Chang-qing Xun
  • Dong Chen
  • Qiang Lan
  • Chun-yuan Zhang


OpenCL programming provides full code portability between different hardware platforms, and can serve as a good programming candidate for heterogeneous systems, which typically consist of a host processor and several accelerators. However, to make full use of the computing capacity of such a system, programmers are requested to manage diverse OpenCL-enabled devices explicitly, including distributing the workload between different devices and managing data transfer between multiple devices. All these tedious jobs pose a huge challenge for programmers. In this paper, a distributed shared OpenCL memory (DSOM) is presented, which relieves users of having to manage data transfer explicitly, by supporting shared buffers across devices. DSOM allocates shared buffers in the system memory and treats the on-device memory as a software managed virtual cache buffer. To support fine-grained shared buffer management, we designed a kernel parser in DSOM for buffer access range analysis. A basic modified, shared, invalid cache coherency is implemented for DSOM to maintain coherency for cache buffers. In addition, we propose a novel strategy to minimize communication cost between devices by launching each necessary data transfer as early as possible. This strategy enables overlap of data transfer with kernel execution. Our experimental results show that the applicability of our method for buffer access range analysis is good, and the efficiency of DSOM is high.

Key words

Shared buffer OpenCL Heterogeneous programming Fine grained 

CLC number



Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Agarwal, A., Bianchini, R., Chaiken, D., Johnson, K.L., Kranz, D., Kubiatowicz, J., Lim, B.H., Mackenzie, K., Yeung, D., 1995. The MIT Alewife Machine: Architecture and Performance. Proc. 22nd Annual Int. Symp. on Computer Architecture, p.2–13. [doi:10.1145/223982.223985]CrossRefGoogle Scholar
  2. Bal, H.E., Tanenbaum, A.S., 1988. Distributed Programming with Shared Data. Proc. Int. Conf. on Computer Languages, p.82–91. [doi:10.1109/ICCL.1988.13046]Google Scholar
  3. Balasundaram, V., Kennedy, K., 1989. A Technique for Summarizing Data Access and Its Use in Parallelism Enhancing Transformations. Proc. ACM SIGPLAN Conf. on Programming Language Design and Implementation, p.41–53. [doi:10.1145/73141.74822]Google Scholar
  4. Bershad, B.N., Zekauskas, M.J., Sawdon, W.A., 1993. The Midway Distributed Shared Memory System. Compcon Spring, Digest of Papers, p.528–537. [doi:10.1109/CMPCON.1993.289730]CrossRefGoogle Scholar
  5. Cadar, C., Dunbar, D., Engler, D., 2008. KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. Proc. 8th USENIX Conf. on Operating Systems Design and Implementation, p.209–224.Google Scholar
  6. Callahan, D., Kennedy, K., 1988. Analysis of interprocedural side effects in a parallel programming environment. J. Parall. Distr. Comput., 5(5):517–550. [doi:10.1016/0743-7315(88)90011-1]CrossRefGoogle Scholar
  7. Danalis, A., Marin, G., McCurdy, C., Meredith, J.S., Roth, P.C., Spafford, K., Tipparaju, V., Vetter, J.S., 2010. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. Proc. 3rd Workshop on General-Purpose Computation on Graphics Processing Units, p.63–74. [doi:10.1145/1735688.1735702]CrossRefGoogle Scholar
  8. Dantzig, G.B., Curtis, E.B., 1973. Fourier-Motzkin elimination and its dual. J. Comb. Theory A, 14(3):288–297.CrossRefMATHGoogle Scholar
  9. Dasgupta, P., LeBlanc, R.J.Jr., Ahamad, M., Ramachandran, U., 1991. The clouds distributed operating system. Computer, 24(11):34–44. [doi:10.1109/2.116849]CrossRefGoogle Scholar
  10. Delp, G., Sethi, A., Farber, D., 1988. An Analysis of Memnet—an Experiment in High-Speed Shared-Memory Local Networking. Symp. Proc. on Communications architectures and protocols, p.165–174. [doi:10.1145/52324.52342]Google Scholar
  11. Frank, S., Burkhardt, H., Rothnie, J., 1993. The KSR 1: Bridging the Gap Between Shared Memory and MPPs. Compcon Spring, Digest of Papers, p.285–294. [doi:10.1109/CMPCON.1993.289682]CrossRefGoogle Scholar
  12. Gelado, I., Stone, J.E., Cabezas, J., Patel, S., Navarro, N., Hwu, W.W., 2010. An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems. Proc. 15th ASPLOS on Architectural Support for Programming Languages and Operating Systems, p.347–358. [doi:10.1145/1736020.1736059]CrossRefGoogle Scholar
  13. Jablin, T.B., Prabhu, P., Jablin, J.A., Johnson, N.P., Beard, S.R., August, D.I., 2011. Automatic CPU-GPU Communication Management and Optimization. Proc. 32nd ACM SIGPLAN Conf. on Programming Language Design and Implementation, p.142–151. [doi:10.1145/1993498.1993516]CrossRefGoogle Scholar
  14. Jablin, T.B., Jablin, J.A., Prabhu, P., Liu, F., August, D.I., 2012. Dynamically Managed Data for CPU-GPU Architectures. Proc. 10th Int. Symp. on Code Generation and Optimization, p.165–174. [doi:10.1145/2259016.2259038]Google Scholar
  15. Kim, J., Kim, H., Lee, J.H., Lee, J., 2011. Achieving a Single Compute Device Image in OpenCL for Multiple GPUs. Proc. 16th ACM Symp. on Principles and Practice of Parallel Programming, p.277–288. [doi:10.1145/1941553.1941591]Google Scholar
  16. Lattner, C., Adve, V., 2004. LLVM: a Compilation Framework for Lifelong Program Analysis & Transformation. Proc. Int. Symp. on Code Generation and Optimization: Feedback-Directed and Runtime Optimization, p.75–87.Google Scholar
  17. Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., et al., 2010. Debunking the 100x GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. SIGARCH Comput. Archit. News, 38(3):451–460. [doi:10.1145/1816038.1816021]CrossRefGoogle Scholar
  18. Paek, Y., Hoeflinger, J., Padua, D., 2002. Efficient and precise array access analysis. ACM Trans. Progr. Lang. Syst., 24(1):65–109. [doi:10.1145/509705.509708]CrossRefGoogle Scholar
  19. Pai, S., Govindarajan, R., Thazhuthaveetil, M.J., 2012. Fast and Efficient Automatic Memory Management for GPUs Using Compiler-Assisted Runtime Coherence Scheme. Proc. 21st Int. Conf. on Parallel Architectures and Compilation Techniques, p.33–42. [doi:10.1145/2370816.2370 824]CrossRefGoogle Scholar
  20. Pugh, W., 1992. A practical algorithm for exact array dependence analysis. ACM Commun., 35(8):102–114. [doi:10.1145/135226.135233]CrossRefGoogle Scholar
  21. Seo, S., Jo, G., Lee, J., 2011. Performance Characterization of the NAS Parallel Benchmarks in OpenCL. IEEE Int. Symp. on Workload Characterization, p.137–148. [doi:10.1109/IISWC.2011.6114174]Google Scholar
  22. Shen, Z., Li, Z., Yew, P.C., 1990. An empirical study of Fortran programs for parallelizing compilers. IEEE Trans. Parall. Distr. Syst., 1(3):356–364. [doi:10.1109/71.80162]CrossRefGoogle Scholar
  23. Stratton, J.A., Stone, S.S., Hwu, W.W., 2008. Languages and Compilers for Parallel Computing. Springer-Verlag Berlin Heidelberg, p.16–30. [doi:10.1007/978-3-540-89740-8]CrossRefGoogle Scholar
  24. Stratton, J.A., Rodrigues, C., Sung, R., Obeid, N., Chang, L.W., Anssari, N., Liu, D., Hwu, W.W., 2012. Parboil: a Revised Benchmark Suite for Scientific and Commercial Throughput Computing. IMPACT Technical Report No. IMPACT-12-01, Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, Champaign, Illinois, USA.Google Scholar
  25. Triolet, R., Irigoin, F., Feautrier, P., 1986. Direct parallelization of call statements. SIGPLAN Not., 21(7):176–185. [doi:10. 1145/13310.13329]CrossRefGoogle Scholar
  26. Wilson, A.W.Jr., LaRowe, R.P.Jr., Teller, M.J., 1993. Hardware Assist for Distributed Shared Memory. Proc. 13th Int. Conf. on Distributed Computing Systems, p.246–255. [doi:10.1109/ICDCS.1993.287702]Google Scholar
  27. Wolfe, M., 2010. Implementing the PGI Accelerator Model. Proc. 3rd Workshop on General-Purpose Computation on Graphics Processing Units, p.43–50. [doi:10.1145/1735 688.1735697]CrossRefGoogle Scholar
  28. Yan, Y., Grossman, M., Sarkar, V., 2009. JCUDA: a Programmer-Friendly Interface for Accelerating Java Programs with CUDA. Proc. 15th Int. Euro-Par Conf. on Parallel Processing, p.887–899. [doi:10.1007/978-3-642-03869-3-82]Google Scholar

Copyright information

© Journal of Zhejiang University Science Editorial Office and Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Chang-qing Xun
    • 1
    • 2
  • Dong Chen
    • 1
    • 2
  • Qiang Lan
    • 1
    • 2
  • Chun-yuan Zhang
    • 1
    • 2
  1. 1.College of ComputerNational University of Defense TechnologyChangshaChina
  2. 2.State Key Laboratory of High Performance ComputingNational University of Defense TechnologyChangshaChina

Personalised recommendations