Abstract
Communication bandwidth is a bottleneck in distributed machine learning, and limits the system scalability. The transmission of gradients often dominates the communication in distributed SGD. One promising technique is using the gradient compression to reduce the communication cost. Recently, many approaches have been developed for the deep neural networks. However, they still suffer from the high memory cost, slow convergence and serious staleness problems over sparse high-dimensional models. In this work, we propose Sparse Gradient Compression (SGC) to efficiently train both the sparse models and the deep neural networks. SGC uses momentum approximation to reduce the memory cost with negligible accuracy degradation. Then it improves the accuracy with long-term gradient compensation, which maintains global momentum to make up for the information loss caused by the approximation. Finally, to alleviate the staleness problem, SGC updates model weight with the accumulation of delayed gradients at local, called local update technique. The experiments over the sparse high-dimensional models and deep neural networks indicate that SGC can compress 99.99% gradients for every iteration without performance degradation, and saves the communication cost up to 48\(\times \).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: COMPSTAT’2010, pp. 177–186. Physica-Verlag HD (2010)
Li, Y., Chen, Z., Cai, Y., Huang, D., Li, Q.: Accelerating convolutional neural networks using fine-tuned backpropagation progress. In: Bao, Z., Trajcevski, G., Chang, L., Hua, W. (eds.) DASFAA 2017. LNCS, vol. 10179, pp. 256–266. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-55705-2_20
Zhao, K., Zhang, J., Zhang, L., Li, C., Chen, H.: CDSFM: a circular distributed SGLD-based factorization machines. In: Pei, J., Manolopoulos, Y., Sadiq, S., Li, J. (eds.) DASFAA 2018. LNCS, vol. 10828, pp. 701–709. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91458-9_43
Wang, K., Peng, H., Jin, Y., Sha, C., Wang, X.: Local weighted matrix factorization for top-n recommendation with implicit feedback. Data Sci. Eng. 1(4), 252–264 (2016)
Davis, L.J., Offord, K.P.: Logistic regression. J. Pers. Assess. 68(3), 497–507 (1997)
Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J., Scholkopf, B.: Support vector machines. IEEE Intell. Syst. Appl. 13(4), 18–28 (1998)
Jiang, J., Zhang, Z., Cui, B., Tong, Y., Xu, N.: StroMAX: partitioning-based scheduler for real-time stream processing system. In: Candan, S., Chen, L., Pedersen, T.B., Chang, L., Hua, W. (eds.) DASFAA 2017. LNCS, vol. 10178, pp. 269–288. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-55699-4_17
Bhuiyan, M., Hasan, M.A.: Representing graphs as bag of vertices and partitions for graph classification. Data Sci. Eng. 3(2), 150–165 (2018)
Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)
Lin, Y., Han, S., Mao, H., Wang, Y., Dally, B.: Deep gradient compression: reducing the communication bandwidth for distributed training. In: ICLR (2018)
Tsuzuku, Y., Imachi, H., Akiba, H.: Variance-based gradient compression for efficient distributed deep learning. arXiv preprint arXiv:1802.06058 (2018)
Jiang, J., Fu, F., Yang, T., Cui, B.: SketchML: accelerating distributed machine learning with data sketches. In: SIGMOD, pp. 1269–1284. ACM (2018)
Strom, N.: Scalable distributed DNN training using commodity GPU cloud computing. In: INTERSPEECH (2015)
Aji, A.F., Heafield, K.: Sparse communication for distributed gradient descent. In: EMNLP, pp. 440–445. Association for Computational Linguistics (2017)
Wangni, J., Wang, J., Liu, J., Zhang, T.: Gradient sparsification for communication-efficient distributed optimization. arXiv preprint arXiv:1710.09854 (2017)
Reddi, S.J., Kale, S., Kumar, S.: On the convergence of adam and beyond. In: ICLR (2018)
Seide, F., Fu, H., Droppo, J., Li, G., Yu, D.: 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In: INTERSPEECH (2014)
Alistarh, D., Grubic, D., Li, J., Tomioka, R., Vojnovic, M.: QSGD: communication-efficient SGD via gradient quantization and encoding. In: NIPS, pp. 1709–1720. Curran Associates Inc. (2017)
Wu, J., Huang, W., Huang, J., Zhang, T.: Error compensated quantized SGD and its applications to large-scale distributed optimization. In: ICML, pp. 5321–5329 (2018)
Dryden, N., Moon, T., Jacobs, S.A., Essen, B.V.: Communication quantization for data-parallel training of deep neural networks. In: MLHPC, pp. 1–8. IEEE (2016)
Chen, C.-Y., Choi, J., Brand, D., Agrawal, A., Zhang, W., Gopalakrishnan, K.: Adacomp: adaptive residual gradient compression for data-parallel distributed training. In: AAAI, pp. 2827–2835 (2018)
Mitliagkas, I., Zhang, C., Hadjis, S., Ré, C.: Asynchrony begets momentum, with an application to deep learning. In: Allerton, pp. 997–1004. IEEE (2016)
Zhang, W., Gupta, S., Lian, X., Liu, J.: Staleness-aware async-SGD for distributed deep learning. In: IJCAI, pp. 2350–2356. AAAI Press (2016)
Jiang, J., Cui, B., Zhang, C., Yu, L.: Heterogeneity-aware distributed parameter servers. In: SIGMOD, pp. 463–478. ACM (2017)
Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
McMahan, B., Streeter, M.: Delay-tolerant algorithms for asynchronous distributed online learning. In: NIPS, pp. 2915–2923 (2014)
Nesterov, Y., et al.: Gradient methods for minimizing composite objective function (2007)
Zhang, W., Du, T., Wang, J.: Deep learning over multi-field categorical data. In: Ferro, N., et al. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 45–57. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30671-1_4
Jiang, J., Lele, Y., Jiang, J., Liu, Y., Cui, B.: Angel: a new large-scale machine learning system. Natl. Sci. Rev. 5(2), 216–236 (2017)
Lele, Y., Zhang, C., Shao, Y., Cui, B.: LDA*: a robust and large-scale topic modeling system. Proc. VLDB Endowment 10(11), 1406–1417 (2017)
Acknowledgements
This work is supported by the National Key Research and Development Program of China (No. 2018YFB1004403), NSFC (No. 61832001, 61702015, 61702016, 61572039), and PKU-Tencent joint research Lab.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Sun, H. et al. (2019). Sparse Gradient Compression for Distributed SGD. In: Li, G., Yang, J., Gama, J., Natwichai, J., Tong, Y. (eds) Database Systems for Advanced Applications. DASFAA 2019. Lecture Notes in Computer Science(), vol 11447. Springer, Cham. https://doi.org/10.1007/978-3-030-18579-4_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-18579-4_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-18578-7
Online ISBN: 978-3-030-18579-4
eBook Packages: Computer ScienceComputer Science (R0)