Abstract
Asynchronous FTRL-proximal and L2 norm done at server are two widely used tricks in Parameters Server which is an implement of delayed SGD. Their commonness is leaving parts of updating computation on server which reduces the burden of network via making transmitted data sparse. But above tricks’ convergences are not well-proved. In this paper, based on above commonness, we propose a more general algorithm named as asynchronous COMID and prove its regret bound. We prove that asynchronous FTRL-proximal and L2 norm done at server are applications of asynchronous COMID, which demonstrates the convergences of above two tricks. Then, we conduct experiments to verify theoretical results. Experimental results show that compared with delayed SGD on Parameters Server, asynchronous COMID reduces the burden of the network without any harm on the mathematical convergence speed and final output.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abadi, M., et al.: Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
Baidu: Paddlepaddle (2016). https://github.com/PaddlePaddle/Paddle
Bottou, L., Bousquet, O.: The tradeoffs of large scale learning. In: Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December, pp. 161–168 (2007)
Chaturapruek, S., Duchi, J.C., Re, C.: Asynchronous stochastic convex optimization: the noise is in the noise and SGD don’t care, pp. 1531–1539 (2015)
Chen, T., et al.: Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. Statistics (2015)
Dean, J., et al.: Large scale distributed deep networks. In: International Conference on Neural Information Processing Systems, pp. 1223–1231 (2012)
Dekel, O., Gilad-Bachrach, R., Shamir, O., Xiao, L.: Optimal distributed online prediction using mini-batches. J. Mach. Learn. Res. 13(1), 165–202 (2012)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(7), 257–269 (2010)
Duchi, J., Tewari, A., Chicago, T.: Composite objective mirror descent. In: COLT 2010 - The Conference on Learning Theory, Haifa, Israel, June, pp. 14–26 (2010)
Feng, N., Recht, B., Re, C., Wright, S.J.: Hogwild!: a lock-free approach to parallelizing stochastic gradient descent. Adv. Neural Inf. Process. Syst. 24, 693–701 (2011)
Ho, Q., et al.: More effective distributed ml via a stale synchronous parallel parameter server. Adv. Neural Inf. Process. Syst. 2013(2013), 1223–1231 (2013)
Langford, J., Smola, A.J., Zinkevich, M.: Slow learners are fast. In: Advances in Neural Information Processing Systems 22: Conference on Neural Information Processing Systems 2009. Proceedings of A Meeting Held 7–10 December 2009, Vancouver, British Columbia, Canada, pp. 2331–2339 (2009)
Mcmahan, H.B.: Follow-the-regularized-leader and mirror descent: equivalence theorems and l1 regularization. JMLR 15, 2011 (2013)
Mcmahan, H.B., et al.: Ad clickprediction: a view from the trenches. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1222–1230 (2013)
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. Siam J. Optim. 19, 1574–1609 (2009)
Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120(1), 221–259 (2009)
Shalev-Shwartz, S., Srebro, N.: SVM optimization: inverse dependence on training set size. In: International Conference on Machine Learning, pp. 928–935 (2008)
Xing, E.P., et al.: Petuum: a new platform for distributed machine learning on big data. IEEE Trans. Big Data 1(2), 49–67 (2013)
Yu, H., Lo, H., Hsieh, H.: Feature engineering and classifier ensemble for KDD cup 2010. In: JMLR Workshop and Conference (2010)
Zhu, Y., Chatterjee, S., Duchi, J.C., Lafferty, J.D.: Local minimax complexity of stochastic convex optimization. In: Neural Information Processing Systems, pp. 3423–3431 (2016)
Zinkevich, M., Weimer, M., Smola, A.J., Li, L.: Parallelized stochastic gradient descent. Adv. Neural Inf. Process. Syst. 23(23), 2595–2603 (2010)
Acknowledgement
This work was supported by National Natural Science Foundation of China under Grant No. 61502450, Grant No. 61432018, and Grant No. 61521092; National Key R&D Program of China under Grant No. 2016YFB0200800, Grant No. 2017YFB0202302, and Grant No. 2016YFE0100300.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Daning, C., Shigang, L., Yunquan, Z. (2019). Asynchronous COMID: The Theoretic Basis for Transmitted Data Sparsification Tricks on Parameter Server. In: Ren, R., Zheng, C., Zhan, J. (eds) Big Scientific Data Benchmarks, Architecture, and Systems. SDBA 2018. Communications in Computer and Information Science, vol 911. Springer, Singapore. https://doi.org/10.1007/978-981-13-5910-1_6
Download citation
DOI: https://doi.org/10.1007/978-981-13-5910-1_6
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-5909-5
Online ISBN: 978-981-13-5910-1
eBook Packages: Computer ScienceComputer Science (R0)