Multi-stage Gradient Compression: Overcoming the Communication Bottleneck in Distributed Deep Learning

  • Qu Lu
  • Wantao LiuEmail author
  • Jizhong Han
  • Jinrong Guo
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11301)


Due to the huge size of deep learning model and the limited bandwidth of network, communication cost has become a salient bottleneck in distributed training. Gradient compression is an effective way to relieve the pressure of bandwidth and increase the scalability of distributed training. In this paper, we propose a novel gradient compression technique, Multi-Stage Gradient Compression (MGC) with Sparsity Automatic Adjustment and Gradient Recession. These techniques divide the whole training process into three stages which fit different compression strategy. To handle error and preserve accuracy, we accumulate the quantization error and sparse gradients locally with momentum correction. Our experiments show that MGC achieves excellent compression ratio up to 3800x without incurring accuracy loss. We compress gradient size of ResNet-50 from 97 MB to 0.03 MB, for AlexNet from 233 MB to 0.06 MB. We even get a better accuracy than baseline on GoogLeNet. Experiments also show the significant scalability of MGC.


Gradient compression Communication optimization Distributed system Network bottleneck Deep learning 



This research is supported by the National Key Research and Development Program of China (No. 2017YFB1010000).


  1. 1.
    Aji, A.F., Heafield, K.: Sparse communication for distributed gradient descent. CoRR abs/1704.05021 (2017)Google Scholar
  2. 2.
    Alistarh, D., Li, J., Tomioka, R., Vojnovic, M.: QSGD: randomized quantization for communication-optimal stochastic gradient descent. arXiv preprint arXiv:1610.02132 (2016)
  3. 3.
    Chen, C., Choi, J., Brand, D., Agrawal, A., Zhang, W., Gopalakrishnan, K.: AdaComp : adaptive residual gradient compression for data-parallel distributed training. CoRR abs/1712.02679 (2017)Google Scholar
  4. 4.
    Dryden, N., Moon, T., Jacobs, S.A., Van Essen, B.: Communication quantization for data-parallel training of deep neural networks. In: Workshop on Machine Learning in HPC Environments (MLHPC), pp. 1–8. IEEE (2016)Google Scholar
  5. 5.
    Goyal, P., et al.: Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)
  6. 6.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  7. 7.
    Ho, Q., et al.: More effective distributed ml via a stale synchronous parallel parameter server. In: Advances in Neural Information Processing Systems, pp. 1223–1231 (2013)Google Scholar
  8. 8.
    Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large-batch training for deep learning: generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 (2016)
  9. 9.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  10. 10.
    Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: AAAI, vol. 333, pp. 2267–2273 (2015)Google Scholar
  11. 11.
    Lin, Y., Han, S., Mao, H., Wang, Y., Dally, W.J.: Deep gradient compression: reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887 (2017)
  12. 12.
    Mitliagkas, I., Zhang, C., Hadjis, S., Ré, C.: Asynchrony begets momentum, with an application to deep learning. In: 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 997–1004. IEEE (2016)Google Scholar
  13. 13.
    Neelakantan, A., et al.: Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807 (2015)
  14. 14.
    Recht, B., Re, C., Wright, S., Niu, F.: HOGWILD: a lock-free approach to parallelizing stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 693–701 (2011)Google Scholar
  15. 15.
    Seide, F., Fu, H., Droppo, J., Li, G., Yu, D.: 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)Google Scholar
  16. 16.
    Strom, N.: Scalable distributed DNN training using commodity GPU cloud computing. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)Google Scholar
  17. 17.
    Wen, W., et al.: TernGrad: ternary gradients to reduce communication in distributed deep learning. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 1509–1519. Curran Associates, Inc. (2017)Google Scholar
  18. 18.
    Zhang, H., et al.: Poseidon: an efficient communication architecture for distributed deep learning on GPU clusters. arXiv preprint (2017)Google Scholar
  19. 19.
    Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: DoReFa-Net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016)

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Institute of Information EngineeringChinese Academy of SciencesBeijingChina
  2. 2.School of Cyber SecurityUniversity of Chinese Academy of SciencesBeijingChina

Personalised recommendations