CUDA-DTM: Distributed Transactional Memory for GPU Clusters

  • Samuel Irving
  • Sui Chen
  • Lu PengEmail author
  • Costas Busch
  • Maurice Herlihy
  • Christopher J. Michael
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11704)


We present CUDA-DTM, the first ever Distributed Transactional Memory framework written in CUDA for large scale GPU clusters. Transactional Memory has become an attractive auto-coherence scheme for GPU applications with irregular memory access patterns due to its ability to avoid serializing threads while still maintaining programmability. We extend GPU Software Transactional Memory to allow threads across many GPUs to access a coherent distributed shared memory space and propose a scheme for GPU-to-GPU communication using CUDA-Aware MPI. The performance of CUDA-DTM is evaluated using a suite of seven irregular memory access benchmarks with varying degrees of compute intensity, contention, and node-to-node communication frequency. Using a cluster of 256 devices, our experiments show that GPU clusters using CUDA-DTM can be up to 115x faster than CPU clusters.


Distributed Transactional Memory GPU cluster CUDA 


  1. 1.
    Burtscher, M., Nasre, R., Pingali, K.: A quantitative study of irregular programs on GPUS. In: 2012 IEEE International Symposium on Workload Characterization (IISWC), pp. 141–151. IEEE (2012)Google Scholar
  2. 2.
    Cederman, D., Tsigas, P., Chaudhry, M.T.: Towards a software transactional memory for graphics processors. In: EGPGV, pp. 121–129 (2010)Google Scholar
  3. 3.
    Chen, S., Peng, L.: Efficient GPU hardware transactional memory through early conflict resolution. In: 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 274–284. IEEE (2016)Google Scholar
  4. 4.
    Chen, S., Peng, L., Irving, S.: Accelerating GPU hardware transactional memory with snapshot isolation. In: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 282–294. IEEE (2017)Google Scholar
  5. 5.
    Chen, S., Zhang, F., Liu, L., Peng, L.: Efficient GPU NVRAM persistent with helper warps. In: ACM/IEEE International Conference on Design Automation (DAC). ACM/IEEE (2019)Google Scholar
  6. 6.
    Fung, W.W., Singh, I., Brownsword, A., Aamodt, T.M.: Hardware transactional memory for GPU architectures. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 296–307. ACM (2011)Google Scholar
  7. 7.
    Gramoli, V.: More than you ever wanted to know about synchronization: synchrobench, measuring the impact of the synchronization on concurrent algorithms. In: ACM SIGPLAN Notices, vol. 50, pp. 1–10. ACM (2015)CrossRefGoogle Scholar
  8. 8.
    Herlihy, M., Moss, J.E.B.: Transactional memory: architectural support for lock-free data structures, vol. 21. ACM (1993)Google Scholar
  9. 9.
    Herlihy, M., Sun, Y.: Distributed transactional memory for metric-space networks. Distrib. Comput. 20(3), 195–208 (2007)CrossRefGoogle Scholar
  10. 10.
    Holey, A., Zhai, A.: Lightweight software transactions on GPUs. In: 2014 43rd International Conference on Parallel Processing (ICPP), pp. 461–470. IEEE (2014)Google Scholar
  11. 11.
    Minh, C.C., Chung, J., Kozyrakis, C., Olukotun, K.: Stamp: stanford transactional applications for multi-processing. In: 2008 IEEE International Symposium on Workload Characterization, pp. 35–46. IEEE (2008)Google Scholar
  12. 12.
    Mishra, S., Turcu, A., Palmieri, R., Ravindran, B.: HyflowCPP: a distributed transactional memory framework for c++. In: 2013 12th IEEE International Symposium on Network Computing and Applications (NCA), pp. 219–226. IEEE (2013)Google Scholar
  13. 13.
    Moss, J.E.B.: Nested transactions: an approach to reliable distributed computing. Technical report, Massachusetts Institute of Tech Cambridge Lab for Computer Science (1981)Google Scholar
  14. 14.
    Sharma, G., Busch, C.: Distributed transactional memory for general networks. Distrib. Comput. 27(5), 329–362 (2014)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Shen, Q., Sharp, C., Blewitt, W., Ushaw, G., Morgan, G.: PR-STM: priority rule based software transactions for the GPU. In: Träff, J.L., Hunold, S., Versaci, F. (eds.) Euro-Par 2015. LNCS, vol. 9233, pp. 361–372. Springer, Heidelberg (2015). Scholar
  16. 16.
    Villegas, A., Navarro, A., Asenjo, R., Plata, O.: Toward a software transactional memory for heterogeneous CPU-GPU processors. J. Supercomput. 1–16 (2017).
  17. 17.
    Xu, Y., Wang, R., Goswami, N., Li, T., Gao, L., Qian, D.: Software transactional memory for GPU architectures. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, p. 1. ACM (2014)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Samuel Irving
    • 1
  • Sui Chen
    • 1
  • Lu Peng
    • 1
    Email author
  • Costas Busch
    • 1
  • Maurice Herlihy
    • 2
  • Christopher J. Michael
    • 1
  1. 1.Louisiana State UniversityBaton RougeUSA
  2. 2.Brown UniversityProvidenceUSA

Personalised recommendations