Sparse Gradient Compression for Distributed SGD

Sun, Haobo; Shao, Yingxia; Jiang, Jiawei; Cui, Bin; Lei, Kai; Xu, Yu; Wang, Jiang

doi:10.1007/978-3-030-18579-4_9

Haobo Sun²⁴,
Yingxia Shao²⁵,
Jiawei Jiang²⁷,
Bin Cui²⁶,
Kai Lei²⁴,
Yu Xu²⁷ &
…
Jiang Wang²⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11447))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

3653 Accesses
4 Citations

Abstract

Communication bandwidth is a bottleneck in distributed machine learning, and limits the system scalability. The transmission of gradients often dominates the communication in distributed SGD. One promising technique is using the gradient compression to reduce the communication cost. Recently, many approaches have been developed for the deep neural networks. However, they still suffer from the high memory cost, slow convergence and serious staleness problems over sparse high-dimensional models. In this work, we propose Sparse Gradient Compression (SGC) to efficiently train both the sparse models and the deep neural networks. SGC uses momentum approximation to reduce the memory cost with negligible accuracy degradation. Then it improves the accuracy with long-term gradient compensation, which maintains global momentum to make up for the information loss caused by the approximation. Finally, to alleviate the staleness problem, SGC updates model weight with the accumulation of delayed gradients at local, called local update technique. The experiments over the sparse high-dimensional models and deep neural networks indicate that SGC can compress 99.99% gradients for every iteration without performance degradation, and saves the communication cost up to 48\(\times \).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: COMPSTAT’2010, pp. 177–186. Physica-Verlag HD (2010)
Google Scholar
Li, Y., Chen, Z., Cai, Y., Huang, D., Li, Q.: Accelerating convolutional neural networks using fine-tuned backpropagation progress. In: Bao, Z., Trajcevski, G., Chang, L., Hua, W. (eds.) DASFAA 2017. LNCS, vol. 10179, pp. 256–266. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-55705-2_20
Chapter Google Scholar
Zhao, K., Zhang, J., Zhang, L., Li, C., Chen, H.: CDSFM: a circular distributed SGLD-based factorization machines. In: Pei, J., Manolopoulos, Y., Sadiq, S., Li, J. (eds.) DASFAA 2018. LNCS, vol. 10828, pp. 701–709. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91458-9_43
Chapter Google Scholar
Wang, K., Peng, H., Jin, Y., Sha, C., Wang, X.: Local weighted matrix factorization for top-n recommendation with implicit feedback. Data Sci. Eng. 1(4), 252–264 (2016)
Article Google Scholar
Davis, L.J., Offord, K.P.: Logistic regression. J. Pers. Assess. 68(3), 497–507 (1997)
Article Google Scholar
Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J., Scholkopf, B.: Support vector machines. IEEE Intell. Syst. Appl. 13(4), 18–28 (1998)
Article Google Scholar
Jiang, J., Zhang, Z., Cui, B., Tong, Y., Xu, N.: StroMAX: partitioning-based scheduler for real-time stream processing system. In: Candan, S., Chen, L., Pedersen, T.B., Chang, L., Hua, W. (eds.) DASFAA 2017. LNCS, vol. 10178, pp. 269–288. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-55699-4_17
Chapter Google Scholar
Bhuiyan, M., Hasan, M.A.: Representing graphs as bag of vertices and partitions for graph classification. Data Sci. Eng. 3(2), 150–165 (2018)
Article Google Scholar
Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)
Article Google Scholar
Lin, Y., Han, S., Mao, H., Wang, Y., Dally, B.: Deep gradient compression: reducing the communication bandwidth for distributed training. In: ICLR (2018)
Google Scholar
Tsuzuku, Y., Imachi, H., Akiba, H.: Variance-based gradient compression for efficient distributed deep learning. arXiv preprint arXiv:1802.06058 (2018)
Jiang, J., Fu, F., Yang, T., Cui, B.: SketchML: accelerating distributed machine learning with data sketches. In: SIGMOD, pp. 1269–1284. ACM (2018)
Google Scholar
Strom, N.: Scalable distributed DNN training using commodity GPU cloud computing. In: INTERSPEECH (2015)
Google Scholar
Aji, A.F., Heafield, K.: Sparse communication for distributed gradient descent. In: EMNLP, pp. 440–445. Association for Computational Linguistics (2017)
Google Scholar
Wangni, J., Wang, J., Liu, J., Zhang, T.: Gradient sparsification for communication-efficient distributed optimization. arXiv preprint arXiv:1710.09854 (2017)
Reddi, S.J., Kale, S., Kumar, S.: On the convergence of adam and beyond. In: ICLR (2018)
Google Scholar
Seide, F., Fu, H., Droppo, J., Li, G., Yu, D.: 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In: INTERSPEECH (2014)
Google Scholar
Alistarh, D., Grubic, D., Li, J., Tomioka, R., Vojnovic, M.: QSGD: communication-efficient SGD via gradient quantization and encoding. In: NIPS, pp. 1709–1720. Curran Associates Inc. (2017)
Google Scholar
Wu, J., Huang, W., Huang, J., Zhang, T.: Error compensated quantized SGD and its applications to large-scale distributed optimization. In: ICML, pp. 5321–5329 (2018)
Google Scholar
Dryden, N., Moon, T., Jacobs, S.A., Essen, B.V.: Communication quantization for data-parallel training of deep neural networks. In: MLHPC, pp. 1–8. IEEE (2016)
Google Scholar
Chen, C.-Y., Choi, J., Brand, D., Agrawal, A., Zhang, W., Gopalakrishnan, K.: Adacomp: adaptive residual gradient compression for data-parallel distributed training. In: AAAI, pp. 2827–2835 (2018)
Google Scholar
Mitliagkas, I., Zhang, C., Hadjis, S., Ré, C.: Asynchrony begets momentum, with an application to deep learning. In: Allerton, pp. 997–1004. IEEE (2016)
Google Scholar
Zhang, W., Gupta, S., Lian, X., Liu, J.: Staleness-aware async-SGD for distributed deep learning. In: IJCAI, pp. 2350–2356. AAAI Press (2016)
Google Scholar
Jiang, J., Cui, B., Zhang, C., Yu, L.: Heterogeneity-aware distributed parameter servers. In: SIGMOD, pp. 463–478. ACM (2017)
Google Scholar
Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
Article Google Scholar
McMahan, B., Streeter, M.: Delay-tolerant algorithms for asynchronous distributed online learning. In: NIPS, pp. 2915–2923 (2014)
Google Scholar
Nesterov, Y., et al.: Gradient methods for minimizing composite objective function (2007)
Google Scholar
Zhang, W., Du, T., Wang, J.: Deep learning over multi-field categorical data. In: Ferro, N., et al. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 45–57. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30671-1_4
Chapter Google Scholar
Jiang, J., Lele, Y., Jiang, J., Liu, Y., Cui, B.: Angel: a new large-scale machine learning system. Natl. Sci. Rev. 5(2), 216–236 (2017)
Article Google Scholar
Lele, Y., Zhang, C., Shao, Y., Cui, B.: LDA*: a robust and large-scale topic modeling system. Proc. VLDB Endowment 10(11), 1406–1417 (2017)
Article Google Scholar

Download references

Acknowledgements

This work is supported by the National Key Research and Development Program of China (No. 2018YFB1004403), NSFC (No. 61832001, 61702015, 61702016, 61572039), and PKU-Tencent joint research Lab.

Author information

Authors and Affiliations

School of Electronic and Computer Engineering, Peking University, Shenzhen, China
Haobo Sun & Kai Lei
Beijing Key Lab of Intelligent Telecommunications Software and Multimedia, BUPT, Beijing, China
Yingxia Shao
Center for Data Science, Peking University & National Engineering Laboratory for Big Data Analysis and Applications, Beijing, China
Bin Cui
Tencent Inc., Shenzhen, China
Jiawei Jiang, Yu Xu & Jiang Wang

Authors

Haobo Sun
View author publications
You can also search for this author in PubMed Google Scholar
Yingxia Shao
View author publications
You can also search for this author in PubMed Google Scholar
Jiawei Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Bin Cui
View author publications
You can also search for this author in PubMed Google Scholar
Kai Lei
View author publications
You can also search for this author in PubMed Google Scholar
Yu Xu
View author publications
You can also search for this author in PubMed Google Scholar
Jiang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haobo Sun .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Guoliang Li
Duke University, Durham, NC, USA
Jun Yang
University of Porto, Porto, Portugal
Joao Gama
Chiang Mai University, Chiang Mai, Thailand
Juggapong Natwichai
Beihang University, Beijing, China
Yongxin Tong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, H. et al. (2019). Sparse Gradient Compression for Distributed SGD. In: Li, G., Yang, J., Gama, J., Natwichai, J., Tong, Y. (eds) Database Systems for Advanced Applications. DASFAA 2019. Lecture Notes in Computer Science(), vol 11447. Springer, Cham. https://doi.org/10.1007/978-3-030-18579-4_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-18579-4_9
Published: 24 April 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-18578-7
Online ISBN: 978-3-030-18579-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics