Skip to main content

Sparse Gradient Compression for Distributed SGD

  • Conference paper
  • First Online:
Database Systems for Advanced Applications (DASFAA 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11447))

Included in the following conference series:

Abstract

Communication bandwidth is a bottleneck in distributed machine learning, and limits the system scalability. The transmission of gradients often dominates the communication in distributed SGD. One promising technique is using the gradient compression to reduce the communication cost. Recently, many approaches have been developed for the deep neural networks. However, they still suffer from the high memory cost, slow convergence and serious staleness problems over sparse high-dimensional models. In this work, we propose Sparse Gradient Compression (SGC) to efficiently train both the sparse models and the deep neural networks. SGC uses momentum approximation to reduce the memory cost with negligible accuracy degradation. Then it improves the accuracy with long-term gradient compensation, which maintains global momentum to make up for the information loss caused by the approximation. Finally, to alleviate the staleness problem, SGC updates model weight with the accumulation of delayed gradients at local, called local update technique. The experiments over the sparse high-dimensional models and deep neural networks indicate that SGC can compress 99.99% gradients for every iteration without performance degradation, and saves the communication cost up to 48\(\times \).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: COMPSTAT’2010, pp. 177–186. Physica-Verlag HD (2010)

    Google Scholar 

  2. Li, Y., Chen, Z., Cai, Y., Huang, D., Li, Q.: Accelerating convolutional neural networks using fine-tuned backpropagation progress. In: Bao, Z., Trajcevski, G., Chang, L., Hua, W. (eds.) DASFAA 2017. LNCS, vol. 10179, pp. 256–266. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-55705-2_20

    Chapter  Google Scholar 

  3. Zhao, K., Zhang, J., Zhang, L., Li, C., Chen, H.: CDSFM: a circular distributed SGLD-based factorization machines. In: Pei, J., Manolopoulos, Y., Sadiq, S., Li, J. (eds.) DASFAA 2018. LNCS, vol. 10828, pp. 701–709. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91458-9_43

    Chapter  Google Scholar 

  4. Wang, K., Peng, H., Jin, Y., Sha, C., Wang, X.: Local weighted matrix factorization for top-n recommendation with implicit feedback. Data Sci. Eng. 1(4), 252–264 (2016)

    Article  Google Scholar 

  5. Davis, L.J., Offord, K.P.: Logistic regression. J. Pers. Assess. 68(3), 497–507 (1997)

    Article  Google Scholar 

  6. Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J., Scholkopf, B.: Support vector machines. IEEE Intell. Syst. Appl. 13(4), 18–28 (1998)

    Article  Google Scholar 

  7. Jiang, J., Zhang, Z., Cui, B., Tong, Y., Xu, N.: StroMAX: partitioning-based scheduler for real-time stream processing system. In: Candan, S., Chen, L., Pedersen, T.B., Chang, L., Hua, W. (eds.) DASFAA 2017. LNCS, vol. 10178, pp. 269–288. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-55699-4_17

    Chapter  Google Scholar 

  8. Bhuiyan, M., Hasan, M.A.: Representing graphs as bag of vertices and partitions for graph classification. Data Sci. Eng. 3(2), 150–165 (2018)

    Article  Google Scholar 

  9. Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)

    Article  Google Scholar 

  10. Lin, Y., Han, S., Mao, H., Wang, Y., Dally, B.: Deep gradient compression: reducing the communication bandwidth for distributed training. In: ICLR (2018)

    Google Scholar 

  11. Tsuzuku, Y., Imachi, H., Akiba, H.: Variance-based gradient compression for efficient distributed deep learning. arXiv preprint arXiv:1802.06058 (2018)

  12. Jiang, J., Fu, F., Yang, T., Cui, B.: SketchML: accelerating distributed machine learning with data sketches. In: SIGMOD, pp. 1269–1284. ACM (2018)

    Google Scholar 

  13. Strom, N.: Scalable distributed DNN training using commodity GPU cloud computing. In: INTERSPEECH (2015)

    Google Scholar 

  14. Aji, A.F., Heafield, K.: Sparse communication for distributed gradient descent. In: EMNLP, pp. 440–445. Association for Computational Linguistics (2017)

    Google Scholar 

  15. Wangni, J., Wang, J., Liu, J., Zhang, T.: Gradient sparsification for communication-efficient distributed optimization. arXiv preprint arXiv:1710.09854 (2017)

  16. Reddi, S.J., Kale, S., Kumar, S.: On the convergence of adam and beyond. In: ICLR (2018)

    Google Scholar 

  17. Seide, F., Fu, H., Droppo, J., Li, G., Yu, D.: 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In: INTERSPEECH (2014)

    Google Scholar 

  18. Alistarh, D., Grubic, D., Li, J., Tomioka, R., Vojnovic, M.: QSGD: communication-efficient SGD via gradient quantization and encoding. In: NIPS, pp. 1709–1720. Curran Associates Inc. (2017)

    Google Scholar 

  19. Wu, J., Huang, W., Huang, J., Zhang, T.: Error compensated quantized SGD and its applications to large-scale distributed optimization. In: ICML, pp. 5321–5329 (2018)

    Google Scholar 

  20. Dryden, N., Moon, T., Jacobs, S.A., Essen, B.V.: Communication quantization for data-parallel training of deep neural networks. In: MLHPC, pp. 1–8. IEEE (2016)

    Google Scholar 

  21. Chen, C.-Y., Choi, J., Brand, D., Agrawal, A., Zhang, W., Gopalakrishnan, K.: Adacomp: adaptive residual gradient compression for data-parallel distributed training. In: AAAI, pp. 2827–2835 (2018)

    Google Scholar 

  22. Mitliagkas, I., Zhang, C., Hadjis, S., Ré, C.: Asynchrony begets momentum, with an application to deep learning. In: Allerton, pp. 997–1004. IEEE (2016)

    Google Scholar 

  23. Zhang, W., Gupta, S., Lian, X., Liu, J.: Staleness-aware async-SGD for distributed deep learning. In: IJCAI, pp. 2350–2356. AAAI Press (2016)

    Google Scholar 

  24. Jiang, J., Cui, B., Zhang, C., Yu, L.: Heterogeneity-aware distributed parameter servers. In: SIGMOD, pp. 463–478. ACM (2017)

    Google Scholar 

  25. Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)

    Article  Google Scholar 

  26. McMahan, B., Streeter, M.: Delay-tolerant algorithms for asynchronous distributed online learning. In: NIPS, pp. 2915–2923 (2014)

    Google Scholar 

  27. Nesterov, Y., et al.: Gradient methods for minimizing composite objective function (2007)

    Google Scholar 

  28. Zhang, W., Du, T., Wang, J.: Deep learning over multi-field categorical data. In: Ferro, N., et al. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 45–57. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30671-1_4

    Chapter  Google Scholar 

  29. Jiang, J., Lele, Y., Jiang, J., Liu, Y., Cui, B.: Angel: a new large-scale machine learning system. Natl. Sci. Rev. 5(2), 216–236 (2017)

    Article  Google Scholar 

  30. Lele, Y., Zhang, C., Shao, Y., Cui, B.: LDA*: a robust and large-scale topic modeling system. Proc. VLDB Endowment 10(11), 1406–1417 (2017)

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by the National Key Research and Development Program of China (No. 2018YFB1004403), NSFC (No. 61832001, 61702015, 61702016, 61572039), and PKU-Tencent joint research Lab.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haobo Sun .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sun, H. et al. (2019). Sparse Gradient Compression for Distributed SGD. In: Li, G., Yang, J., Gama, J., Natwichai, J., Tong, Y. (eds) Database Systems for Advanced Applications. DASFAA 2019. Lecture Notes in Computer Science(), vol 11447. Springer, Cham. https://doi.org/10.1007/978-3-030-18579-4_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-18579-4_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-18578-7

  • Online ISBN: 978-3-030-18579-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics