Benchmarking the NVIDIA V100 GPU and Tensor Cores

  • Matt MartineauEmail author
  • Patrick Atkinson
  • Simon McIntosh-Smith
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11339)


The V100 GPU is the newest server-grade GPU produced by NVIDIA and introduces a number of new hardware and API features. This paper details the results of benchmarking the V100 GPU and demonstrates that it is a significant generational improvement, increasing memory bandwidth, cache bandwidth, and reducing latency. A major new addition is the Tensor core units, which have been marketed as deep learning acceleration features that enable the computation of a \(4\times 4\times 4\) half precision matrix-multiply-accumulate operation in a single clock cycle. This paper confirms that the Tensor cores offer considerable performance gains for half precision general matrix multiplication; however, programming them requires fine control of the memory hierarchy that is typically unnecessary for other applications.


  1. 1.
    Ang, J., Cook, J., Domino, S.P., Glass, M.W., Voskuilen, G.R.: Exascale co-design progress and accomplishments. New Front. High Perform. Comput. Big Data 30, 3 (2017)Google Scholar
  2. 2.
    Appleyard, J., Yokim, S.: Programming Tensor Cores in CUDA 9 (2017)Google Scholar
  3. 3.
    Deakin, T., Price, J., McIntosh-Smith, S.: Portable methods for measuring cache hierarchy performance. In: IEEE/ACM Super Computing (2017)Google Scholar
  4. 4.
    Harris, M.: Mixed-Precision Programming with CUDA 8 (2017)Google Scholar
  5. 5.
    Jia, Z., Maggioni, M., Staiger, B., Scarpazza, D.P.: Dissecting the NVIDIA Volta GPU architecture via microbenchmarking, April 2018Google Scholar
  6. 6.
    Markidis, S., Chien, S.W.D., Laure, E., Peng, I.B., Vetter, J.S.: NVIDIA Tensor Core Programmability, Performance & Precision. CoRR abs/1803.04014 (2018).
  7. 7.
    Martineau, M., McIntosh-Smith, S.: The arch project: physics mini-apps for algorithmic exploration and evaluating programming environments on HPC architectures. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 850–857, September 2017Google Scholar
  8. 8.
    Martineau, M., McIntosh-Smith, S.: Exploring on-node parallelism with neutral, a Monte Carlo neutral particle transport mini-app. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE (2017)Google Scholar
  9. 9.
    McCalpin, J.: Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA), pp. 19–25 (1995)Google Scholar
  10. 10.
    NVIDIA Corporation: Parallel Thread Execution ISA Version 6.1 (2017)Google Scholar
  11. 11.
    Reguly, I.Z., Keita, A.K., Giles, M.B.: Benchmarking the IBM Power8 processor. In: Proceedings of the 25th Annual International Conference on Computer Science and Software Engineering, pp. 61–69. IBM Corp. (2015)Google Scholar
  12. 12.
    Trott, C.R.: Early Experience with P100 on POWER8. Technical report, Sandia National Lab. (SNL-NM), Albuquerque, NM (United States) (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Matt Martineau
    • 1
    Email author
  • Patrick Atkinson
    • 1
  • Simon McIntosh-Smith
    • 1
  1. 1.HPC GroupUniversity of BristolBristolUK

Personalised recommendations