Skip to main content

Towards Evaluation of Tensorflow Performance in a Distributed Compute Environment

  • Conference paper
  • First Online:
Performance Evaluation and Benchmarking for the Era of Artificial Intelligence (TPCTC 2018)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11135))

Included in the following conference series:

Abstract

Tensorflow (TF) is a highly popular Deep Learning (DL) software framework. Neural network training, a critical part of DL workflow, is a computationally intensive process that can take days or even weeks. Therefore, achieving faster training times is an active area of research and practise. TF supports multiple GPU parallelization, both within a single machine and between multiple physical servers. However, the distributed case is hard to use and consequently, almost all published performance data comes from the single machine use case. To fill this gap, here we benchmark Tensorflow in a GPU-equipped distributed environment. Our work evaluates performance of various hardware and software combinations. In particular, we examine several types of interconnect technologies to determine their impact on performance. Our results show that with the right choice of input parameters and appropriate hardware, GPU-equipped general-purpose compute clusters can provide comparable deep learning training performance to specialized machines designed for AI workloads.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. OSDI 16, 265–283 (2016)

    Google Scholar 

  2. Paszke, A., et al.: Automatic differentiation in PyTorch. https://openreview.net/forum?id=BJJsrmfCZ

  3. Imagenet. http://image-net.org/about-stats

  4. Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531 (2014)

  5. Nvidia Corporation. https://www.nvidia.com/en-us/data-center/dgx-1/

  6. Nvidia Corporation. https://www.nvidia.com/en-us/data-center/dgx-2/

  7. Nvidia Corporation. https://www.nvidia.com/en-us/data-center/nvlink/

  8. Goyal, P., et al.: Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)

  9. You, Y., Zhang, Z., Hsieh, C., Demmel, J., Keutzer, K.: ImageNet training in minutes. CoRR, abs/1709.05011 (2017)

    Google Scholar 

  10. Cho, M., Finkler, U., Kumar, S., Kung, D., Saxena, V., Sreedhar, D.: PowerAI DDL. arXiv preprint arXiv:1708.02188 (2017)

  11. Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep Learning, vol. 1. MIT Press, Cambridge (2016)

    MATH  Google Scholar 

  12. Sergeev, A., Del Balso, M.: Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018)

  13. Lenovo SD530. https://lenovopress.com/lp0635-thinksystem-sd530-server

  14. Tensorflow Performance Guide. https://www.tensorflow.org/performance/performance_guide

  15. Tensorflow P100 Benchmarks. https://www.tensorflow.org/performance/benchmarks#results

  16. Mishkin, D., Sergievskiy, N., Matas, J.: Systematic evaluation of convolution neural network advances on the Imagenet. Comput. Vis. Image Underst. 161, 11–19 (2017)

    Article  Google Scholar 

  17. NVIDIA DGX-1 With Tesla V100 System Architecture. http://images.nvidia.com/content/pdf/dgx1-v100-system-architecture-whitepaper.pdf

  18. MLPerf. https://www.mlperf.org/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ajay Dholakia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hodak, M., Dholakia, A. (2019). Towards Evaluation of Tensorflow Performance in a Distributed Compute Environment. In: Nambiar, R., Poess, M. (eds) Performance Evaluation and Benchmarking for the Era of Artificial Intelligence. TPCTC 2018. Lecture Notes in Computer Science(), vol 11135. Springer, Cham. https://doi.org/10.1007/978-3-030-11404-6_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-11404-6_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-11403-9

  • Online ISBN: 978-3-030-11404-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics