Towards Evaluation of Tensorflow Performance in a Distributed Compute Environment

Hodak, Miro; Dholakia, Ajay

doi:10.1007/978-3-030-11404-6_7

Miro Hodak¹⁴ &
Ajay Dholakia¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11135))

Included in the following conference series:

Technology Conference on Performance Evaluation and Benchmarking

837 Accesses
2 Citations

Abstract

Tensorflow (TF) is a highly popular Deep Learning (DL) software framework. Neural network training, a critical part of DL workflow, is a computationally intensive process that can take days or even weeks. Therefore, achieving faster training times is an active area of research and practise. TF supports multiple GPU parallelization, both within a single machine and between multiple physical servers. However, the distributed case is hard to use and consequently, almost all published performance data comes from the single machine use case. To fill this gap, here we benchmark Tensorflow in a GPU-equipped distributed environment. Our work evaluates performance of various hardware and software combinations. In particular, we examine several types of interconnect technologies to determine their impact on performance. Our results show that with the right choice of input parameters and appropriate hardware, GPU-equipped general-purpose compute clusters can provide comparable deep learning training performance to specialized machines designed for AI workloads.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. OSDI 16, 265–283 (2016)
Google Scholar
Paszke, A., et al.: Automatic differentiation in PyTorch. https://openreview.net/forum?id=BJJsrmfCZ
Imagenet. http://image-net.org/about-stats
Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531 (2014)
Nvidia Corporation. https://www.nvidia.com/en-us/data-center/dgx-1/
Nvidia Corporation. https://www.nvidia.com/en-us/data-center/dgx-2/
Nvidia Corporation. https://www.nvidia.com/en-us/data-center/nvlink/
Goyal, P., et al.: Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)
You, Y., Zhang, Z., Hsieh, C., Demmel, J., Keutzer, K.: ImageNet training in minutes. CoRR, abs/1709.05011 (2017)
Google Scholar
Cho, M., Finkler, U., Kumar, S., Kung, D., Saxena, V., Sreedhar, D.: PowerAI DDL. arXiv preprint arXiv:1708.02188 (2017)
Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep Learning, vol. 1. MIT Press, Cambridge (2016)
MATH Google Scholar
Sergeev, A., Del Balso, M.: Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018)
Lenovo SD530. https://lenovopress.com/lp0635-thinksystem-sd530-server
Tensorflow Performance Guide. https://www.tensorflow.org/performance/performance_guide
Tensorflow P100 Benchmarks. https://www.tensorflow.org/performance/benchmarks#results
Mishkin, D., Sergievskiy, N., Matas, J.: Systematic evaluation of convolution neural network advances on the Imagenet. Comput. Vis. Image Underst. 161, 11–19 (2017)
Article Google Scholar
NVIDIA DGX-1 With Tesla V100 System Architecture. http://images.nvidia.com/content/pdf/dgx1-v100-system-architecture-whitepaper.pdf
MLPerf. https://www.mlperf.org/

Download references

Author information

Authors and Affiliations

Lenovo, Data Center Group, Morrisville, NC, USA
Miro Hodak & Ajay Dholakia

Authors

Miro Hodak
View author publications
You can also search for this author in PubMed Google Scholar
Ajay Dholakia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ajay Dholakia .

Editor information

Editors and Affiliations

Advanced Micro Systems, Inc., Santa Clara, CA, USA
Raghunath Nambiar
Oracle Corporation, Redwood Shores, CA, USA
Meikel Poess

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hodak, M., Dholakia, A. (2019). Towards Evaluation of Tensorflow Performance in a Distributed Compute Environment. In: Nambiar, R., Poess, M. (eds) Performance Evaluation and Benchmarking for the Era of Artificial Intelligence. TPCTC 2018. Lecture Notes in Computer Science(), vol 11135. Springer, Cham. https://doi.org/10.1007/978-3-030-11404-6_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-11404-6_7
Published: 30 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11403-9
Online ISBN: 978-3-030-11404-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics