Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer

  • Feng WangEmail author
  • Can-Qun Yang
  • Yun-Fei Du
  • Juan Chen
  • Hui-Zhan Yi
  • Wei-Xia Xu


In this paper we present the programming of the Linpack benchmark on TianHe-1 system, the first petascale supercomputer system of China, and the largest GPU-accelerated heterogeneous system ever attempted before. A hybrid programming model consisting of MPI, OpenMP and streaming computing is described to explore the task parallel, thread parallel and data parallel of the Linpack. We explain how we optimized the load distribution across the CPUs and GPUs using the two-level adaptive method and describe the implementation in details. To overcome the low-bandwidth between the CPU and GPU communication, we present a software pipelining technique to hide the communication overhead. Combined with other traditional optimizations, the Linpack we developed achieved 196:7 GFLOPS on a single compute element of TianHe-1. This result is 70:1% of the peak compute capability, 3:3 times faster than the result by using the vendor's library. On the full configuration of TianHe-1 our optimizations resulted in a Linpack performance of 0:563 PFLOPS, which made TianHe-1 the 5th fastest supercomputer on the Top500 list in November, 2009.


petascale Linpack GPU heterogeneous supercomputer 

Supplementary material

11390_2011_184_MOESM1_ESM.pdf (80 kb)
(PDF 80.1 KB)


  1. [1]
    Dongarra J J, van de Geijn R A, Walker D W. Scalability issues affecting the design of a dense linear algebra library. J. Parallel Distrib. Comput., 1994, 22(3): 523–537.CrossRefGoogle Scholar
  2. [2], Nov. 10, 2010.
  3. [3]
    Villarreal J, Najjar W. Compiled hardware acceleration of molecular dynamics code. In Proc. International Conference on Field Programmable Logic and Applications (FPL 2008), Heidelberg, Germany, Sept. 8–10, 2008, pp.667-670.Google Scholar
  4. [4]
    NVIDIA. Fermi compute architecture whitepaper, 2009.Google Scholar
  5. [5]
    AMD. AMD stream computing user guide v 1.4.0, Feb. 2009.Google Scholar
  6. [6]
    NVIDIA. CUDA programming guide, June 2007.Google Scholar
  7. [7]
    Munshi A. Opencl parallel computing on the GPU and CPU. In Proc. ACM SIGGRAPH 2008, Los Angeles, USA, Aug. 11–15, 2008.Google Scholar
  8. [8]
    Falcao G, Yamagiwa S, Silva V, Sousa L. Parallel LDPC decoding on GPUs using a stream-based computing approach. Journal of Computer Science and Technology, 2009, 24(5): 913–924.CrossRefGoogle Scholar
  9. [9]
    Roberts E, Stone J E, Sepulveda L, Mei W, Hwu W, Luthey-Schulten Z. Long time-scale simulations of in vivo diffusion using GPU hardware. In Proc. the 2009 IEEE International Symposium on Parallel&Distributed Processing (IPDPS 2009), Rome, Italy, May 23–29, 2009, pp.1-8.Google Scholar
  10. [10]
    Meng J, Skadron K. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In Proc. the 23 rd International Conference on Supercomputing (ICS 2009), Yorktown Heights, USA, Jun. 8–12, 2009, pp.256-265.Google Scholar
  11. [11]
    Di P, Wan Q, Zhang X, Wu H, Xue J. Toward harnessing DOACROSS parallelism for multi-GPGPUs. In Proc. the 39th International Conference on Parallel Processing, San Diego, USA, Sept. 13–16, 2010, pp.40-50.Google Scholar
  12. [12]
    Fan Z, Qiu F, Kaufman A, Yoakum-Stover S. GPU cluster for high performance computing. In Proc. the 2004 ACM/IEEE Conference on Supercomputing (SC 2004), Pittsburgh, USA, Nov. 6–12, 2004, p.47.Google Scholar
  13. [13]
    Sun J C, Yuan G X, Zhang L B, Zhang Y Q. 2009 China top100 list of high performance computer., Nov. 2009.
  14. [14]
    Petitet A, Whaley R C, Dongarra J J, Cleary A. HPL — A portable implementation of the high-performance linpack benchmark for distributed memory computers., 2006.
  15. [15]
    Luk C K, Hong S, Kim H. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proc. the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (Micro-42), New York, USA, Dec. 12–16, 2009, pp.45-55.Google Scholar
  16. [16]
    Dongarra J J, Luszczek P, Petitet A. The linpack benchmark: Past, present and future. Concurrency and Computation: Practice and Experience, 2003, 15(9): 803–820.CrossRefGoogle Scholar
  17. [17]
    Dongarra J J, Du Croz J, Hammarling S, Duff I S. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw., 1990, 16(1): 1–17.zbMATHCrossRefGoogle Scholar
  18. [18]
    Kistler M, Gunnels J, Brokenshire D, Benton B. Petascale computing with accelerators. In Proc. the 14th ACM SIG-PLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2009), Raleigh, USA, Feb. 14–18, 2009, pp.241-250.Google Scholar
  19. [19]
    Baliga H, Cooray N, Gamsaragan E, Smith P, Yoon K, Abel J, Valles A. Original 45 nm Intels Core2 processor performance. Intel Technology Journal, 2008, 11: 157–168.Google Scholar
  20. [20]
    AMD. AMD core math library for graphic processors release notes for version 1.0, 2009.Google Scholar
  21. [21]
    Agarwal R, Balle S M, Gustavson F G, Joshi M, Palkar P. A three-dimensional approach to parallel matrix multiplication. IBM Journal of Research and Development, 1995, 39(5): 575–582.CrossRefGoogle Scholar
  22. [22]
    Ryoo S, Rodrigues C I, Baghsorkhi S S, Stone S S, Kirk D B, Hwu W M W. Optimization principles and application per- formance evaluation of a multithreaded GPU using CUDA. In Proc. the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2008), Salt Lake City, Feb. 20–23, 2008, pp.73-82.Google Scholar
  23. [23]
    Quintana-Ortí G, Igual F D, Quintana-Ortí E S, van de Geijn R A. Solving dense linear systems on platforms with multiple hardware accelerators. In Proc. the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2009), Raleigh, USA, Feb. 14–18, 2009, pp.121-130.Google Scholar
  24. [24]
    Linderman M D, Collins J D, Wang H, Meng T H. Merge: A programming model for heterogeneous multi-core systems. SIGOPS Oper. Syst. Rev., 2008, 42(2): 287–296.CrossRefGoogle Scholar
  25. [25]
    Fatica M. Accelerating linpack with CUDA on heterogenous clusters. In Proc. 2nd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-2), Washington DC, USA, 2009, pp.46-51.Google Scholar
  26. [26]
    Johns C R, Brokenshire D A. Introduction to the cell broadband engine architecture. IBM J. Res. Dev., 2007, 51(5): 503–519.CrossRefGoogle Scholar
  27. [27]
  28. [28]
    Hamano T, Endo T, Matsuoka S. Power-aware dynamic task scheduling for heterogeneous accelerated clusters. In Proc. Int. Parallel and Distributed Processing Symposium, Rome, Italy, May 23–29, 2009, pp.1-8.Google Scholar
  29. [29]
    Clearspeed Technology Inc.
  30. [30]
  31. [31]
    Endo T, Matsuoka S. Massive supercomputing coping with heterogeneity of modern accelerators. In Proc. the 2008 IEEE International Symposium on Parallel&Distributed Processing (IPDPS 2008), Miami, USA, Apr. 14–18, 2008, pp.1-10.Google Scholar

Copyright information

© Springer Science+Business Media, LLC & Science Press, China 2011

Authors and Affiliations

  • Feng Wang
    • 1
    Email author
  • Can-Qun Yang
    • 1
  • Yun-Fei Du
    • 1
  • Juan Chen
    • 1
  • Hui-Zhan Yi
    • 1
  • Wei-Xia Xu
    • 1
  1. 1.School of Computer ScienceNational University of Defense TechnologyChangshaChina

Personalised recommendations