Improving the Performance of Distributed MXNet with RDMA

  • Mingfan LiEmail author
  • Ke Wen
  • Han Lin
  • Xu Jin
  • Zheng Wu
  • Hong An
  • Mengxian Chi


As one of the most influential deep learning frameworks, MXNet has achieved excellent performance and many breakthroughs in academic and industrial fields for various machine learning situations. The initial implementation of MXNet uses proxy-socket interface, which delivers suboptimal performance in distributed environment. In a massive parallel training task, parameters are updated frequently during each training loop, in which case network performance becomes the main factor of overall performance. Over the past decade, high performance interconnects have employed remote direct memory access (RDMA) technology to provide excellent performance for numerous scientific domains. In this paper, we describe an efficient design that extends the open-source MXNet to make it RDMA capable via RDMA-based parameter server interfaces. With modest optimizations towards memory usage and transmission overhead, RDMA-based MXNet achieves great performance improvement over the original software. Our experiments reveal that, for the communication subsystem of MXNet, the new design achieves 16x speedup (up to 21x at peak) over 1 Gigabit Ethernet (1GigE). For the two training cases on MXNet, the optimized implementation gains 5x and 9x speedup, respectively. Compared to experiments on the IP-over-InfiniBand (IPoIB) protocol, it achieves nearly 30% performance improvement, as well as better scalability and alleviation of bottlenecks.


Distributed MXNet Parameter server RDMA InfiniBand Network optimization 



The work is supported by the National Key Research and Development Program of China(Grants No. 2016YFB1000403).


  1. 1.
    de Bruijne, M.: Machine learning approaches in medical image analysis: from detection to diagnosis. Med. Image. Anal. 33, 94–97 (2016). CrossRefGoogle Scholar
  2. 2.
    Young, T., Hazarika, D., Poria, S., Cambria, E.: Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag. 13(3), 55–75 (2018). CrossRefGoogle Scholar
  3. 3.
    Pérez, G., Arbeláez, P.: Automated detection of lung nodules with three-dimensional convolutional neural networks. Proc. SPIE 10572, 10572-1-10572-10 (2017).
  4. 4.
    Huang G., Sun, Y., Liu, Z., Sedra, D.,Weinberger, K.Q.: Deep networks with stochastic depth. In: European Conference on Computer Vision, pp. 646–661. Springer (2016)Google Scholar
  5. 5.
    You, Y., Zhang, Z., Hsieh, C., Demmel, J., Keutzer, K.: ImageNet training in minutes. CoRR. arXiv:1709.05011 (2017)
  6. 6.
    Grun, P., Hefty, S., Sur, S., Goodell, D., Russell, R.D., Pritchard, H., Squyres, J.M.: A brief introduction to the OpenFabrics interfaces: a new network API for maximizing high performance application efficiency. In: 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, pp. 34–39 (2015).
  7. 7.
    Hintjens, P.: ZeroMQ: the guide. (2010)
  8. 8.
    MacArthur, P., Liu, Q., Russell, R.D., Mizero, F., Veeraraghavan, M., Dennis, J.M.: An integrated tutorial on InfiniBand, verbs, and MPI. IEEE Commun. Surv. Tutorials 19(4), 2894–2926 (2017). CrossRefGoogle Scholar
  9. 9.
    RDMA Consortium and others: Architectural specifications for RDMA over TCP/IP (2009)Google Scholar
  10. 10.
    Li, M., Zhou, L.,Yang, Z., Li, A., Xia, F., Andersen, D.G., Smola, A.: Parameter server for distributed machine learning. In: Big Learning NIPS Workshop, vol. 6, p. 2 (2013)Google Scholar
  11. 11.
    Buyya, R., Cortes, T., Jin, H.: An introduction to the InfiniBand architecture. In: High Performance Mass Storage and Parallel I/O: Technologies and Applications (2002).
  12. 12.
    Liu, J., Wu, J., Panda, D.K.: High performance RDMA-based MPI implementation over InfiniBand. Int. J. Parallel Program. 32(3), 167–198 (2004). CrossRefzbMATHGoogle Scholar
  13. 13.
    Islam, N.S., Rahman, M.W., Jose, J., Rajachandrasekar, R., Wang, H., Subramoni, H., Murthy, C., Panda, D.K.: High performance RDMA-based design of HDFS over InfiniBand. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p. 35. IEEE Computer Society Press (2012)Google Scholar
  14. 14.
    Jose, J., Subramoni, H., Luo, M., Zhang, M., Huang, J., Wasi-ur Rahman, M., Islam, N.S., Ouyang, X., Wang, H., Sur, S., et al.: Memcached design on high performance rdma capable interconnects. In: 2011 International Conference on Parallel Processing (ICPP), pp. 743–752. IEEE (2011)Google Scholar
  15. 15.
    Jose, J., Luo, M., Sur, S., Panda, D.K.: Unifying UPC and MPI runtimes: experience with MVAPICH. In: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, p. 5. ACM (2010)Google Scholar
  16. 16.
    Jia, C., Liu, J., Jin, X., Lin, H., An, 412 H., Han, W., Wu, Z., Chi, M.: Improving the performance of distributed TensorFlow with RDMA. Int. J. Parallel Program. 46(4), 674–685 (2018).
  17. 17.
    Lu, X., Islam, NS.,Wasi-Ur-Rahman, M., Jose, J., Subramoni, H.,Wang, H., Panda, D.K.: High-performance design of Hadoop RPC with RDMA over InfiniBand. In: 2013 42nd International Conference on Parallel Processing (ICPP), pp 641–650. IEEE (2013)Google Scholar
  18. 18.
    Mitchell, C., Geng, Y., Li, J.: Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. In: USENIX Annual Technical Conference, pp. 103–114 (2013)Google Scholar
  19. 19.
    Zhang, H., Zheng, Z., Xu, S., Dai, W., Ho, Q., Liang, X., Hu, Z., Wei, J., Xie, P., Xing, E.P.: Poseidon: an efficient communication architecture for distributed deep learning on GPU clusters. arXiv preprint arXiv:1706.03292 (2017)
  20. 20.
    Mamidala, A.R., Kollias, G., Ward, C., Artico, F.: MXNET-MPI: embedding MPI parallelism in parameter server task model for scaling deep learning. ArXiv e-prints arXiv:1801.03855. (2018)
  21. 21.
    Liu, J., Jiang,W.,Wyckoff, P., Panda, D.K., Ashton, D., Buntinas, D., Gropp,W., Toonen, B.: In: 18th International Parallel and Distributed Processing Symposium, 2004 (IEEE, 2004), p. 16Google Scholar
  22. 21.
    Pandya, A.A.: TCP/IP processor and engine using RDMA (2008). US Patent 7,376,755Google Scholar
  23. 22.
    Kalia, A., Kaminsky, M., Andersen, D.G.: Using RDMA efficiently for key-value services. ACM SIGCOMM Comput. Commun. Rev. 44(4), 295–306 (2015)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.University of Science and Technology of ChinaHefeiChina

Personalised recommendations