ONFS: a hierarchical hybrid file system based on memory, SSD, and HDD for high performance computers

  • Xin Liu
  • Yu-tong Lu
  • Jie Yu
  • Peng-fei Wang
  • Jie-ting Wu
  • Ying Lu


With supercomputers developing towards exascale, the number of compute cores increases dramatically, making more complex and larger-scale applications possible. The input/output (I/O) requirements of large-scale applications, workflow applications, and their checkpointing include substantial bandwidth and an extremely low latency, posing a serious challenge to high performance computing (HPC) storage systems. Current hard disk drive (HDD) based underlying storage systems are becoming more and more incompetent to meet the requirements of next-generation exascale supercomputers. To rise to the challenge, we propose a hierarchical hybrid storage system, on-line and near-line file system (ONFS). It leverages dynamic random access memory (DRAM) and solid state drive (SSD) in compute nodes, and HDD in storage servers to build a three-level storage system in a unified namespace. It supports portable operating system interface (POSIX) semantics, and provides high bandwidth, low latency, and huge storage capacity. In this paper, we present the technical details on distributed metadata management, the strategy of memory borrow and return, data consistency, parallel access control, and mechanisms guiding downward and upward migration in ONFS. We implement an ONFS prototype on the TH-1A supercomputer, and conduct experiments to test its I/O performance and scalability. The results show that the bandwidths of single-thread and multi-thread ‘read’/‘write’ are 6-fold and 5-fold better than HDD-based Lustre, respectively. The I/O bandwidth of data-intensive applications in ONFS can be 6.35 times that in Lustre.

Key words

High performance computing Hierarchical hybrid storage system Distributed metadata management Data migration 

CLC number



  1. Agrawal, N., Bolosky, W.J., Douceur, J.R., et al., 2007. A five-year study of file-system metadata. ACM Trans. Stor., 3(3):9. Scholar
  2. ALCF, 2017. Computational Systems: Mira. Argonne Leadership Computing Facility. Scholar
  3. Ali, N., Carns, P., Iskra, K., et al., 2009. Scalable I/O forwarding framework for high-performance computing systems. IEEE Int. Conf. on CLUSTER Computing and Workshops, p.1–10.Google Scholar
  4. Anderson, E., Hall, J., Hartline, J., et al., 2001. An experimental study of data migration algorithms. Proc. Algorithm Engineering, Int. Workshop, p.145–158.CrossRefGoogle Scholar
  5. Appuswamy, R., van Moolenbroek, D.C., Tanenbaum, A.S., 2012. Integrating flash-based SSDs into the storage stack. IEEE Symp. on Mass Storage Systems and Technologies, p.1–12.Google Scholar
  6. Bent, J., Grider, G., Kettering, B., et al., 2012. Storage challenges at Los Alamos National Lab. IEEE 28th Symp. on Mass Storage Systems and Technologies, p.1–5. Scholar
  7. Bharathi, S., Chervenak, A., Deelman, E., et al., 2008. Characterization of scientific workflows. 3rd Workshop on Workflows in Support of Large-Scale Science, p.1–10. Scholar
  8. Byan, S., Lentini, J., Madan, A., et al., 2012. Mercury: host-side flash caching for the data center. IEEE 28th Symp. on MASS Storage Systems and Technologies, p.1–12. Scholar
  9. Canim, M., Mihaila, G.A., Bhattacharjee, B., et al., 2010. SSD bufferpool extensions for database systems. Proc. VLDB Endow., 3(1–2): 1435–1446. Scholar
  10. Carns, P.H., Ligon, W.B., III, Ross, R.B., 2000. PVFS: a parallel file system for Linux clusters. Proc. 4th Annual Linux Showcase and Conf., p.317–328.Google Scholar
  11. Carns, P.H., Harms, K., Allcock, W., et al., 2011. Understanding and improving computational science storage access through continuous characterization. ACM Trans. Stor., 7(3):1–14. Scholar
  12. Chen, F., Koufaty, D.A., Zhang, X., 2011. Hystor: making the best use of solid state drives in high performance storage systems. Proc. Int. Conf. on Supercomputing, p.22–32. Scholar
  13. Cheong, S.K., Jeong, J.J., Jeong, Y.W., et al., 2011. Research on the I/O performance advancement of a low speed HDD using DDR-SSD. 6th Int. Conf. on Future Information Technology, p.508–513. Scholar
  14. Congiu, G., Narasimhamurthy, S., Süß, T., et al., 2016. Improving collective I/O performance using non-volatile memory devices. IEEE Int. Conf. on Cluster Computing, p.120–129. Scholar
  15. Dai, N., Wu, W., Zhang, W., et al., 2011. TTI RTM using variable grid in depth. Int. Petroleum Technology Conf., p.1–7. Scholar
  16. Dong, W.R., Liu, G.M., Yu, J., et al., 2015. SFDC: file access pattern aware cache framework for high-performance computer. IEEE 17th Int. Conf. on High Performance Computing and Communications, IEEE 7th Int. Symp. on Cyberspace Safety and Security, IEEE 12th Int. Conf. on Embedded Software and Systems, p.342–350. Scholar
  17. Dong, X., Xie, Y., Muralimanohar, N., et al., 2011. Hybrid checkpointing using emerging nonvolatile memories for future exascale systems. ACM Trans. Archit. Code Optim., 8(2):1–29. Scholar
  18. Dongarra, J., 2010. Impact of architecture and technology for extreme scale on software and algorithm design. Department of Energy Workshop on Cross-cutting Technologies for Computing at the Exascale.Google Scholar
  19. Gluster, 2017. Gluster File System. http://www.gluster.orgGoogle Scholar
  20. Hitachi Data Systems Cooperation, 2010. Dynamic Storage Tiering: the Integration of Block, File and Content. Scholar
  21. Holland, D.A., Angelino, E., Wald, G., et al., 2013. Flash caching on the storage client. USENIX Annual Technical Conf., p.127–138.Google Scholar
  22. Iskra, K., Romein, J.W., Yoshii, K., et al., 2008. ZOID: I/Oforwarding infrastructure for petascale architectures. Proc. 13th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, p.153–162. Scholar
  23. Kim, Y., Gupta, A., Urgaonkar, B., et al., 2011. Hybridstore: a cost-efficient, high-performance storage system combining SSDs and HDDs. IEEE 19th Annual Int. Symp. on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems, p.227–236. Scholar
  24. Kuhlen, M., Vogelsberger, M., Angulo, R., 2012. Numerical simulations of the dark universe: state of the art and the next decade. Phys. Dark Univ., 1(1):50–93. Scholar
  25. Lee, D., Choi, J., Kim, J.H., et al., 1999. On the existence of a spectrum of policies that subsumes the least recently used (LRU) and least frequently used (LFU) policies. Proc. ACM SIGMETRICS Int. Conf. on Measurement and Modeling of Computer Systems, p.134–143. Scholar
  26. Liao, X., Xiao, L., Yang, C., et al., 2014. MilkyWay-2 supercomputer: system and application. Front. Comput. Sci., 8(3):345–356. Scholar
  27. Liu, N., Cope, J., Carns, P., et al., 2012. On the role of burst buffers in leadership-class storage systems. IEEE 28th Symp. on MASS Storage Systems and Technologies, p.1–11. Scholar
  28. Liu, X., Lu, Y., Yu, J., et al., 2017a. MemUsing: dynamic, efficient memory utilization in compute nodes for HPC memory-based storage systems. Proc. 7th Int. Workshop on Computer Science and Engineering, p.8–16.Google Scholar
  29. Liu, X., Lu, Y., Wu, C., et al., 2017b. UGSD: scalable and efficient metadata management for EB-scale file systems. Proc. Int. Conf. on Compute and Data Analysis, p.81–90. Scholar
  30. LLNL, 2012. Sequoia. Lawrence Livermore National Laboratory. Scholar
  31. Lofstead, J., Jimenez, I., Maltzahn, C., et al., 2016. DAOS and friends: a proposal for an exascale storage system. Int. Conf. for High Performance Computing, Networking, Storage & Analysis, p.585–596. Scholar
  32. Lu, C.Y., Alvarez, G.A., Wilkes, J., 2002. Aqueduct: online data migration with performance guarantees. FAST Conf. on File and Storage Technologies, p.219–230.Google Scholar
  33. Miller, E.L., Greenan, K., Leung, A., et al., 2011. Reliable and efficient metadata storage and indexing using nvram. J. Comput. Sci. Technol., 26(3):344–351.CrossRefGoogle Scholar
  34. Muralidhar, S., Lloyd, W., Roy, S., et al., 2014. f4: Facebook’s warm blob storage system. Proc. 11th USENIX Symp. on Operating Systems Design and Implementation, p.383–398.Google Scholar
  35. NERSC, 2017a. Burst Buffer Architecture and Software Roadmap. National Energy Research Scientific Computing Center. computational-systems/cori/burst-buffer/burst-bufferGoogle Scholar
  36. NERSC, 2017b. The Configuration of Cori File System. National Energy Research Scientific Computing Center. Scholar
  37. Ocaña, K., de Oliveira, D., 2015. Parallel computing in genomic research: advances and applications. Adv. Appl. Bioinform. Chem., 8:23–25. Scholar
  38. Ovsyannikov, A., Romanus, M., Straalen, B.V., et al., 2017. Scientific workflows at datawarp-speed: accelerated data-intensive science using NERSE’s burst buffer. Joint Int. Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems, p.1–6. Scholar
  39. Pawlowski, B., Juszczak, C., Staubach, P., et al., 1994. NFS version 3: design and implementation. USENIX Summer Technical Conf., p.137–152.Google Scholar
  40. Prabhakar, R., Vazhkudai, S.S., Kim, Y., et al., 2011. Provisioning a multi-tiered data staging area for extremescale machines. Int. Conf. on Distributed Computing Systems, p.1–12. Scholar
  41. Qiao, F., Song, Z., Bao, Y., et al., 2013. Development and evaluation of an earth system model with surface gravity waves. J. Geophys. Res. Ocean., 118(9):4514–4524. Scholar
  42. Rajachandrasekar, R., Moody, A., Mohror, K., et al., 2013. A 1 PB/s file system to checkpoint three million MPI tasks. Proc. 22nd Int. Symp. on High-Performance Parallel and Distributed Computing, p.143–154. Scholar
  43. Rodeh, O., Teperman, A., 2003. zFS—a scalable distributed file system using object disks. Proc. 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, p.207–218. Scholar
  44. Roselli, D., Anderson, T.E., Lorchid, J.R., 2000. A comparison of file system workloads. Proc. USENIX Annual Technical Conf., p.41–54.Google Scholar
  45. Saito, S., Oikawa, S., 2012. Exploration of non-volatile memory management in the OS kernel. 3rd Int. Conf. on Networking and Computing, p.302–306. Scholar
  46. Sato, K., Mohror, K., Moody, A., et al., 2014. A user-level infiniband-based file system and checkpoint strategy for burst buffers. 14th IEEE/ACM Int. Symp. on Cluster, Cloud and Grid Computing, p.21–30. Scholar
  47. Satyanarayanan, M., Kistler, J.J., Kumar, P., et al., 1990. Coda: a highly available file system for a distributed workstation environment. IEEE Trans. Comput., 39(4):447–459. Scholar
  48. Saxena, M., Swift, M.M., Zhang, Y., 2012. FlashTier: a lightweight, consistent and durable storage cache. Proc. 7th ACM European Conf. on Computer Systems, p.267–280. Scholar
  49. Schenck, W., El Sayed, S., Foszczynski, M., et al., 2017. Evaluation and performance modeling of a burst buffer solution. ACM SIGOPS Oper. Syst. Rev., 50(1):12–26. Scholar
  50. Schmuck, F., Haskin, R., 2002. GPFS: a shared-disk file system for large computing clusters. Proc. 1st USENIX Conf. on File and Storage Technologies, No. 19.Google Scholar
  51. Shalf, J., Dosanjh, S., Morrison, J., 2010. Exascale computing technology challenges. Int. Conf. on High Performance Computing for Computational Science, p.1–25.Google Scholar
  52. Shibata, T., Choi, S., Taura, K., 2010. File-access patterns of data-intensive workflow applications and their implications to distributed filesystems. Proc. 19th ACM Int. Symp. on High Performance Distributed Computing, p.746–755. Scholar
  53. Soundararajan, G., Prabhakaran, V., Balakrishnan, M., et al., 2010. Extending SSD lifetimes with disk-based write caches. Proc. 8th USENIX Conf. on File and Storage Technologies, No. 8. Scholar
  54. Strande, S.M., Cicotti, P., Sinkovits, R.S., et al., 2012. Gordon: design, performance, and experiences deploying and supporting a data intensive supercomputer. Proc. 1st Conf. of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the Campus and Beyond, No. 3. Scholar
  55. Tan, Z., Zhou, W., Feng, D., et al., 2013. ALDM: adaptive loading data migration in distributed file systems. IEEE Trans. Magn., 49(6):2645–2652. Scholar
  56. Uta, A., Sandu, A., Kielmann, T., 2016. Overcoming data locality. Fut. Gener. Comput. Syst., 54(C):144–158. Scholar
  57. Vangoor, B.K.R., Tarasov, V., Zadok, E., 2017. To FUSE or not to FUSE: performance of user-space file systems. Proc. 15th USENIX Conf. on File and Storage Technologies, p.59–72.Google Scholar
  58. Vetter, J.S., Mittal, S., 2015. Opportunities for nonvolatile memory systems in extreme-scale high-performance computing. Comput. Sci. Eng., 17(2):73–82. Scholar
  59. Wang, F., Xin, Q., Hong, B., et al., 2004. File system workload analysis for large scale scientific computing applications. Proc. 21st IEEE/12th NASA Goddard Conf. on Mass Storage Systems and Technologies, p.139–152.Google Scholar
  60. Wang, F., Oral, S., Shipman, G., et al., 2010. Understanding Lustre Filesystem Internals. Technical Report, No. ORNL/TM-2009/117. Oak Ridge National Laboratory, National Center for Computational Sciences, Oak Ridge, USA.Google Scholar
  61. Wang, T., Oral, S., Wang, Y., et al., 2014. BurstMem: a high-performance burst buffer system for scientific applications. IEEE Int. Conf. on Big Data, p.71–79. Scholar
  62. Wang, T., Mohror, K., Moody, A., et al., 2016. An ephemeral burst-buffer file system for scientific applications. Proc. Int. Conf. for High Performance Computing, Networking, Storage and Analysis, p.807–818. Scholar
  63. Weil, S.A., Brandt, S.A., Miller, E.L., et al., 2006. Ceph: a scalable, high-performance distributed file system. Proc. 7th Symp. on Operating Systems Design and Implementation, p.307–320.Google Scholar
  64. Yang, X.J., Liao, X.K., Lu, K., et al., 2011. The TianHe-1A supercomputer: its hardware and software. J. Comput. Sci. Technol., 26(3):344–351. Scholar
  65. Yildiz, O., Dorier, M., Ibrahim, S., et al., 2016. On the root causes of cross-application I/O interference in HPC storage systems. IEEE Int. Parallel and Distributed Processing Symp., p.750–759. Scholar
  66. Yu, J., Liu, G.M., Dong, W.R., et al., 2017. WatCache: a workload-aware temporary cache on the compute side of HPC systems. J. Supercomput., 1(2):1–33. Scholar
  67. Zhao, D.F., Raicu, I., 2013. HyCache: a user-level caching middleware for distributed file systems. IEEE Int. Symp. on Parallel and Distributed Processing Workshops and Phd Forum, p.1997–2006. Scholar
  68. Zhao, D.F., Zhang, Z., Zhou, X.B., et al., 2014. FusionFS: toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems. IEEE Int. Conf. on Big Data, p.61–70. Scholar

Copyright information

© Zhejiang University and Springer-Verlag GmbH Germany, part of Springer Nature 2017

Authors and Affiliations

  1. 1.School of ComputerNational University of Defense TechnologyChangshaChina
  2. 2.Computer Science and EngineeringUniversity of Nebraska-LincolnLincolnUSA
  3. 3.National Supercomputing CenterTianjinChina

Personalised recommendations