Abstract
Dawning Nebulae is a heterogeneous system composed of 9280 multi-core x86 CPUs and 4640 NVIDIA Fermi GPUs. With a Linpack performance of 1.271 petaFLOPS, it was ranked the second in the TOP500 List released in June 2010. In this paper, key issues in the system design of Dawning Nebulae are introduced. System tuning methodologies aiming at petaFLOPS Linpack result are presented, including algorithmic optimization and communication improvement. The design of its file I/O subsystem, including HVFS and the underlying DCFS3, is also described. Performance evaluations show that the Linpack efficiency of each node reaches 69.89%, and 1024-node aggregate read and write bandwidths exceed 100 GB/s and 70GB/s respectively. The success of Dawning Nebulae has demonstrated the viability of CPU/GPU heterogeneous structure for future designs of supercomputers.
Similar content being viewed by others
References
Compute unified device architecture. http://www.nvidia.com/object/cuda home new.html, 2011.
Petitet A, Whaley R C, Dongarra J, Cleary A. HPL — A portable implementation of the high performance Linpack benchmark for distributed memory computers, version 2.0. http://www.netlib.org/benchmark/hpl/, Sept. 2008.
Fatica M. Accelerating Linpack with CUDA on heterogenous clusters. In Proc. the 2nd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-2), Washington DC, USA, Mar. 8, 2009, pp.46-51.
Tan G, Sun N, Gao G R. Improving performance of dynamic programming via parallelism and locality on multi-core architectures. IEEE Transactions on Parallel and Distributed Systems, 2009, 20(2): 261–274.
Nagle D, Serenyi D, Matthews A. The Panasas ActiveScale storage cluster — Delivering scalable high bandwidth storage. In Proc. 2004 IEEE/ACM High Performance Computing, Networking and Storage Conference (SC2004), Pittsburgh, USA, Nov. 6–12, 2004, p.53.
Shvachko K, Huang H, Radia S, Chansler R. The Hadoop distributed file system. In Proc. the 26th IEEE (MSST2010) Symposium on Massive Storage Systems and Technologies (Research Track), Inchine Village, USA, May 3–7, 2010.
Schmuck F, Haskin R. GPFS: A shared-disk file system for large computing clusters. In Proc. the First USENIX Conference on File and Storage Technologies (FAST2002), Monterey, USA, Jan. 28–30, 2002, Article No.19.
Braam P J. The Lustre storage architecture. White Paper, Cluster File Systems, Inc., Oct. 2003.
http://www.pvfs.org/, 2011.
IBM Tivoli SANergy administrator’s guide, Version 3, Release 2. IBM Corporation, Oct. 2002.
http://www.quantum.com/Products/Software/StorNext/Index.aspx.
http://www.datadomain.com/, 2011.
Ghemawat S, Gobioff H, Leung S T. The Google file system. In Proc. the 19th ACM Symp. Operating Systems Principles (SOSP 2003), New York, USA, Oct. 19–22, 2003, pp.29-43.
http://ceph.newdream.net/, 2011.
Patil S, Gibson G. GIGA+: Scalable directories for shared file systems. Carnegie Mellon University Parallel Data Lab, Technical Report CMU-PDL-08-110, Oct. 2008.
Xing J, Xiong J, Sun N, Ma J. Adaptive and scalable metadata management to support a trillion files. In Proc. the SC2009, Portland, USA, Nov. 14–20, 2009, Article No. 26.
Fagin R, Nievergelt J, Pippenger N, Strong H R. Extendible hashing — A fast access method for dynamic files. ACM Trans. Database Systems, Sept. 1979, 4(3): 315–344.
Zhou Y, Chen Z, Li K. Second-level buffer cache management. IEEE Transactions on Parallel and Distributed Systems, Jun. 2004, 15(6): 505–519.
Chen Z, Zhang Y, Zhou Y, Scott H, Schiefer B. Empirical evaluation of multi-level buffer cache collaboration for storage systems. In Proc. Int. Conf. Measurements and Modeling of Computer Systems (SIGMETRICS 2005), Banff, Canada, Jun. 6–10, 2005, pp.145-156.
Li X, Aboulnaga A, Salem K, Sachedina A, Gao S. Secondtier cache management using write hints. In Proc. the 4th USENIX Conference on File and Storage Technologies (FAST 2005), San Francisco, USA, Dec. 13–16, 2005, pp.115–127.
Jiang S, Zhang X. ULC: A file block placement and replacement protocol to efficiently exploit hierarchical locality in multi-level buffer caches. In Proc. the 24th International Conference on Distributed Computing Systems (ICDCS 2004), Tokyo, Japan, Mar. 24–26, 2004, pp.168-177.
Yadgar G, Factor M, Li K, Schuster A. MC2: Multiple clients on a multilevel cache. In Proc. the 28th International Conference on Distributed Computing Systems (ICDCS 2008), Beijing, China, Jun. 17–20, 2008, pp.722–730.
Li C, Shen K. Managing prefetch memory for data-intensive online servers. In Proc. the 4th USENIX Conference on File and Storage Technologies (FAST 2005), San Francisco, USA, Dec. 13–16, 2005, pp.253–266.
Li C, Shen K, Papathanasiou A. Competitive prefetching for concurrent sequential I/O. In Proc. EuroSys 2007 Conference, Lisbon, Portugal, Mar. 21–23, 2007, pp.189–202.
Liang S, Jiang S, Zhang X. STEP: Sequentiality and thrashing detection based prefetching to improve performance of networked storage servers. In Proc. the 27th International Conference on Distributed Computing Systems (ICDCS 2007), Toronto, Canada, Jun. 25–29, 2007, Article No. 64.
Zhang Z, Lee K, Ma X, Zhou Y. PFC: Transparent optimization of existing prefetching strategies for multi-level storage systems. In Proc. the 28th International Conference on Distributed Computing Systems (ICDCS 2008), Beijing, China, Jun. 17–20, 2008, pp.740–751.
Li M, Varki E, Bhatia S, Merchant A. TaP: Table-based prefetching for storage caches. In Proc. the 6th USENIX Conference on File and Storage Technologies (FAST 2008), San Jose, USA, Feb. 26–29, 2008, Article No. 6.
Nisar, W Liao, A Choudhary. Scaling parallel I/O performance through I/O delegate and caching system. In Proc. the 2008 International Conference on for High Performance Computing, Networking, Storage and Analysis (SC 2008), Austin, USA, Nov. 15–21, 2008, Article No. 9.
Chen Y, Byna S, Sun X, Thakur R, Gropp W. Hiding I/O latency with pre-execution prefetching for parallel applications. In Proc. the 2008 International Conference for High Performance Computing, Networking, Storage and Analysis (SC2008), Austin, USA, Nov. 15–21, 2008, No. 40.
Byna S, Chen Y, Sun X, Thakur R, Gropp W. Parallel I/O prefetching using MPI file caching and I/O signatures. In Proc. the 2008 International Conference for High Performance Computing, Networking, Storage and Analysis (SC2008), Austin, USA, Nov. 15–21, 2008, Article No. 44.
Chen H, Xiong J, Sun N. A novel hint-based I/O mechanism for centralized file server of cluster. In Proc. 2008 IEEE International Conference on Cluster Computing (Cluster 2008), Tsukuba, Japan, Sept. 29–Oct. 1, 2008, pp.194–201.
Norcott W D. Iozone file system benchmark. 2005, http://www.iozone.org/docs/IOzone msword 98.pdf.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work is supported by the National Hi-Tech Research and Development 863 Program of China under Grant No. 2009AA01A129, the National Natural Science Foundation of China under Grant Nos. 60633040, 60803030, 61033009 the National Basic Research 973 Program of China under Grant No. 2011CB302500, the National Natural Science Foundation for Distinguished Young Scholars of China under Grant No. 60925009, and the Foundation for Innovative Research Groups of the National Natural Science Foundation of China under Grant No. 60921002.
Electronic Supplementary Material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Sun, NH., Xing, J., Huo, ZG. et al. Dawning Nebulae: A PetaFLOPS Supercomputer with a Heterogeneous Structure. J. Comput. Sci. Technol. 26, 352–362 (2011). https://doi.org/10.1007/s11390-011-1138-3
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-011-1138-3