More Effective Distributed Deep Learning Using Staleness Based Parameter Updating

  • Yan YeEmail author
  • Mengqiang Chen
  • Zijie Yan
  • Weigang Wu
  • Nong Xiao
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11335)


Deep learning technology has been widely applied for various purposes, especially big data analysis. However, computation required for deep learning is getting more complex and larger. In order to accelerate the training of large-scale deep networks, various distributed parallel training protocols have been proposed. In this paper, we design a novel asynchronous training protocol, Weighted Asynchronous Parallel (WASP), to update neural network parameters in a more effective way. The core of WASP is “gradient staleness”, a parameter version number based metric to weight gradients and reduce the influence of the stale parameters. Moreover, by periodic forced synchronization of parameters, WASP combines the advantages of synchronous and asynchronous training models and can speed up training with a rapid convergence rate. We conduct experiments using two classical convolutional neural networks, LeNet-5 and ResNet-101, at the Tianhe-2 supercomputing system, and the results show that, WASP can achieve much higher acceleration than existing asynchronous parallel training protocols.


Distributed deep learning Parallel computing Parameter server Asynchronous parallel Supercomputing system 


  1. 1.
    Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)CrossRefGoogle Scholar
  2. 2.
    Li, X., Zhang, G., Huang, H., et al.: Performance analysis of GPU-based convolutional neural networks. In: International Conference on Parallel Processing, Philadelphia, USA, pp. 67–76. IEEE (2016)Google Scholar
  3. 3.
    Li, M., Andersen, D., Park, J., et al.: Scaling distributed machine learning with the parameter server. In: International Conference on Big Data Science and Computing, Beijing, China, pp. 583–598. ACM (2014)Google Scholar
  4. 4.
    Dean, J., Corrado, G., Monga, R., et al.: Large scale distributed deep networks. In: International Conference on Neural Information Processing Systems, Lake Tahoe, Nevada, USA, pp. 1223–1231. Curran Associates Inc. (2012)Google Scholar
  5. 5.
    Ho, Q., Cipar, J., Cui, H., et al.: More effective distributed ML via a stale synchronous parallel parameter server. In: International Conference on Neural Information Processing Systems, Daegu, South Korea, pp. 1223–1231. Curran Associates Inc. (2013)Google Scholar
  6. 6.
    Zhang, W., Gupta, S., Lian, X., et al.: Staleness-aware async-SGD for distributed deep learning. In: International Joint Conference on Artificial Intelligence, vol. 1511(05950), pp. 2350–2356 (2016)Google Scholar
  7. 7.
    Smola, A., Narayanamurthy, S.: An architecture for parallel topic models. Very Large Data Bases 3(1–2), 703–710 (2010)Google Scholar
  8. 8.
    Ahmed, A., Aly, M., Gonzalez, J., et al.: Scalable inference in latent variable models. In: ACM International Conference on Web Search and Data Mining, Seattle Washington, USA, pp. 123–132. ACM (2012)Google Scholar
  9. 9.
    Li, M., Zhou, L., Yang, Z., et al.: Parameter server for distributed machine learning. In: Big Learning NIPS Workshop, Lake Tahoe, Nevada, USA, pp. 1–10. ACM (2013)Google Scholar
  10. 10.
    Zhang, H., Hu, Z., Wei, J., et al.: Poseidon: A system architecture for efficient GPU-based deep learning on multiple machines. Comput. Sci. 1512(06216), 10–21 (2015)Google Scholar
  11. 11.
    Abadi, M., Barham, P., Chen, J., et al.: TensorFlow: a system for large-scale machine learning. In: OSDI 2016 Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation, Savannah, USA, pp. 265–283. USENIX Association (2016)Google Scholar
  12. 12.
    Wang, M., Xiao, T., Li, J., Zhang, J., Hong, C., et al.: Minerva: a scalable and highly efficient training platform for deep learning. In: NIPS 2014 Workshop of Distributed Matrix Computations, Montreal, Canada, pp. 1–9. ACM (2014)Google Scholar
  13. 13.
    Valiant, L.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)CrossRefGoogle Scholar
  14. 14.
    McColl, W.: Bulk Synchronous Parallel Computing. Oxford University Press, Oxford (1995)Google Scholar
  15. 15.
    Cui, H., Cipar, J., Ho, Q., et al.: Exploiting bounded staleness to speed up big data analytics. In: Usenix Conference on Usenix Technical Conference, Philadelphia, USA, pp. 37–48. USENIX Association (2014)Google Scholar
  16. 16.
    Jiang, C., Xing, P., Rajat, M., et al.: Revisiting distributed synchronous SGD. In: International Conference on Learning Representations, vol. 1604(00981), pp. 1–10 (2017)Google Scholar
  17. 17.
    Jiang, J., Cui, B., Zhang, C., et al.: Heterogeneity-aware distributed parameter servers. In: ACM International Conference, Glasgow, Scotland, pp. 463–478. ACM (2017)Google Scholar
  18. 18.
    Dai, W., Kumar, A., Ho, Q., et al.: High-performance distributed ML at scale through parameter server consistency models. In: Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, Texas, pp. 79–87. AAAI Press (2015)Google Scholar
  19. 19.
    Lecun, Y., Cortes, C.: The MNIST database of handwritten digits. Courant Inst. Math. Sci. 3(7), 1–10 (2010)Google Scholar
  20. 20.
    Lecun, Y., Bottou, L., Bengio, Y., et al.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  21. 21.
    Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Comput. Sci. Dept. 1(4), 1–60 (2009)Google Scholar
  22. 22.
    He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. Comput. Vis. Pattern Recogn. 1512(03385), 770–778 (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Yan Ye
    • 1
    • 2
    Email author
  • Mengqiang Chen
    • 1
    • 3
  • Zijie Yan
    • 1
    • 2
  • Weigang Wu
    • 1
    • 3
  • Nong Xiao
    • 1
    • 2
  1. 1.School of Data and Computer ScienceSun Yet-Sen UniversityGuangzhouChina
  2. 2.Guangdong Province Key Laboratory of Big Data Analysis and ProcessingGuangzhouChina
  3. 3.MoE Key Laboratory of Machine Intelligence and Advanced ComputingGuangzhouChina

Personalised recommendations