ONFS: a hierarchical hybrid file system based on memory, SSD, and HDD for high performance computers

  • Xin Liu
  • Yu-tong Lu
  • Jie Yu
  • Peng-fei Wang
  • Jie-ting Wu
  • Ying Lu


With supercomputers developing towards exascale, the number of compute cores increases dramatically, making more complex and larger-scale applications possible. The input/output (I/O) requirements of large-scale applications, workflow applications, and their checkpointing include substantial bandwidth and an extremely low latency, posing a serious challenge to high performance computing (HPC) storage systems. Current hard disk drive (HDD) based underlying storage systems are becoming more and more incompetent to meet the requirements of next-generation exascale supercomputers. To rise to the challenge, we propose a hierarchical hybrid storage system, on-line and near-line file system (ONFS). It leverages dynamic random access memory (DRAM) and solid state drive (SSD) in compute nodes, and HDD in storage servers to build a three-level storage system in a unified namespace. It supports portable operating system interface (POSIX) semantics, and provides high bandwidth, low latency, and huge storage capacity. In this paper, we present the technical details on distributed metadata management, the strategy of memory borrow and return, data consistency, parallel access control, and mechanisms guiding downward and upward migration in ONFS. We implement an ONFS prototype on the TH-1A supercomputer, and conduct experiments to test its I/O performance and scalability. The results show that the bandwidths of single-thread and multi-thread ‘read’/‘write’ are 6-fold and 5-fold better than HDD-based Lustre, respectively. The I/O bandwidth of data-intensive applications in ONFS can be 6.35 times that in Lustre.

Key words

High performance computing Hierarchical hybrid storage system Distributed metadata management Data migration 

CLC number



Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Agrawal, N., Bolosky, W.J., Douceur, J.R., et al., 2007. A five-year study of file-system metadata. ACM Trans. Stor., 3(3):9. https://doi.org/10.1145/1288783.1288788CrossRefGoogle Scholar
  2. ALCF, 2017. Computational Systems: Mira. Argonne Leadership Computing Facility. https://www.alcf.anl.gov/user-guides/computational-systemsGoogle Scholar
  3. Ali, N., Carns, P., Iskra, K., et al., 2009. Scalable I/O forwarding framework for high-performance computing systems. IEEE Int. Conf. on CLUSTER Computing and Workshops, p.1–10.Google Scholar
  4. Anderson, E., Hall, J., Hartline, J., et al., 2001. An experimental study of data migration algorithms. Proc. Algorithm Engineering, Int. Workshop, p.145–158.CrossRefGoogle Scholar
  5. Appuswamy, R., van Moolenbroek, D.C., Tanenbaum, A.S., 2012. Integrating flash-based SSDs into the storage stack. IEEE Symp. on Mass Storage Systems and Technologies, p.1–12.Google Scholar
  6. Bent, J., Grider, G., Kettering, B., et al., 2012. Storage challenges at Los Alamos National Lab. IEEE 28th Symp. on Mass Storage Systems and Technologies, p.1–5. https://doi.org/10.1109/MSST.2012.6232376Google Scholar
  7. Bharathi, S., Chervenak, A., Deelman, E., et al., 2008. Characterization of scientific workflows. 3rd Workshop on Workflows in Support of Large-Scale Science, p.1–10. https://doi.org/10.1109/WORKS.2008.4723958Google Scholar
  8. Byan, S., Lentini, J., Madan, A., et al., 2012. Mercury: host-side flash caching for the data center. IEEE 28th Symp. on MASS Storage Systems and Technologies, p.1–12. https://doi.org/0.1109/MSST.2012.6232368Google Scholar
  9. Canim, M., Mihaila, G.A., Bhattacharjee, B., et al., 2010. SSD bufferpool extensions for database systems. Proc. VLDB Endow., 3(1–2): 1435–1446. https://doi.org/10.14778/1920841.1921017CrossRefGoogle Scholar
  10. Carns, P.H., Ligon, W.B., III, Ross, R.B., 2000. PVFS: a parallel file system for Linux clusters. Proc. 4th Annual Linux Showcase and Conf., p.317–328.Google Scholar
  11. Carns, P.H., Harms, K., Allcock, W., et al., 2011. Understanding and improving computational science storage access through continuous characterization. ACM Trans. Stor., 7(3):1–14. https://doi.org/10.1145/2027066.2027068CrossRefGoogle Scholar
  12. Chen, F., Koufaty, D.A., Zhang, X., 2011. Hystor: making the best use of solid state drives in high performance storage systems. Proc. Int. Conf. on Supercomputing, p.22–32. https://doi.org/10.1145/1995896.1995902Google Scholar
  13. Cheong, S.K., Jeong, J.J., Jeong, Y.W., et al., 2011. Research on the I/O performance advancement of a low speed HDD using DDR-SSD. 6th Int. Conf. on Future Information Technology, p.508–513. https://doi.org/10.1007/978-3-642-22333-4_66CrossRefGoogle Scholar
  14. Congiu, G., Narasimhamurthy, S., Süß, T., et al., 2016. Improving collective I/O performance using non-volatile memory devices. IEEE Int. Conf. on Cluster Computing, p.120–129. https://doi.org/10.1109/CLUSTER.2016.37Google Scholar
  15. Cray, 2017. Cray Sonexion 3000. https://www.cray.com/products/storage/sonexionGoogle Scholar
  16. Dai, N., Wu, W., Zhang, W., et al., 2011. TTI RTM using variable grid in depth. Int. Petroleum Technology Conf., p.1–7. https://doi.org/10.2523/IPTC-15050-MSGoogle Scholar
  17. Dell EMC, 2017. All Flash Storage. https://www.dellemc.com/en-us/storage/discover-flash-storage/index.htmGoogle Scholar
  18. Dong, W.R., Liu, G.M., Yu, J., et al., 2015. SFDC: file access pattern aware cache framework for high-performance computer. IEEE 17th Int. Conf. on High Performance Computing and Communications, IEEE 7th Int. Symp. on Cyberspace Safety and Security, IEEE 12th Int. Conf. on Embedded Software and Systems, p.342–350. https://doi.org/10.1109/HPCC-CSS-ICESS.2015.185Google Scholar
  19. Dong, X., Xie, Y., Muralimanohar, N., et al., 2011. Hybrid checkpointing using emerging nonvolatile memories for future exascale systems. ACM Trans. Archit. Code Optim., 8(2):1–29. https://doi.org/10.1145/1970386.1970387CrossRefGoogle Scholar
  20. Dongarra, J., 2010. Impact of architecture and technology for extreme scale on software and algorithm design. Department of Energy Workshop on Cross-cutting Technologies for Computing at the Exascale.Google Scholar
  21. Facebook, 2013. Flashcache at Facebook from 2010 to 2013 and Beyond. https://www.facebook.com/notes/facebook-engineering/flashcache-at-facebook-from-2010-to-2013-and-beyond/10151725297413920/Google Scholar
  22. Gluster, 2017. Gluster File System. http://www.gluster.orgGoogle Scholar
  23. Hitachi Data Systems Cooperation, 2010. Dynamic Storage Tiering: the Integration of Block, File and Content. https://shobiziems.com/hitachi_nas/hitachiwhite-paper-dynamic-storage-tiering.pdfGoogle Scholar
  24. Holland, D.A., Angelino, E., Wald, G., et al., 2013. Flash caching on the storage client. USENIX Annual Technical Conf., p.127–138.Google Scholar
  25. IBM, 2017. IBM Blue Gene/Q https://www-03.ibm.com/systems/technicalcomputing/solutions/bluegene/Google Scholar
  26. Intel, 2017. Intel Data Center SSD. https://www.intel.com/content/www/us/en/products/memory-storage/solidstate-drives/data-center-SSDs/dc-p4600-series/dcp4600-4tb-aic-3d1.htmlGoogle Scholar
  27. Iskra, K., Romein, J.W., Yoshii, K., et al., 2008. ZOID: I/Oforwarding infrastructure for petascale architectures. Proc. 13th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, p.153–162. https://doi.org/10.1145/1345206.1345230Google Scholar
  28. Kim, Y., Gupta, A., Urgaonkar, B., et al., 2011. Hybridstore: a cost-efficient, high-performance storage system combining SSDs and HDDs. IEEE 19th Annual Int. Symp. on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems, p.227–236. https://doi.org/10.1109/MASCOTS.2011.64Google Scholar
  29. Kuhlen, M., Vogelsberger, M., Angulo, R., 2012. Numerical simulations of the dark universe: state of the art and the next decade. Phys. Dark Univ., 1(1):50–93. https://doi.org/10.1016/j.dark.2012.10.002CrossRefGoogle Scholar
  30. Lee, D., Choi, J., Kim, J.H., et al., 1999. On the existence of a spectrum of policies that subsumes the least recently used (LRU) and least frequently used (LFU) policies. Proc. ACM SIGMETRICS Int. Conf. on Measurement and Modeling of Computer Systems, p.134–143. https://doi.org/10.1145/301453.301487Google Scholar
  31. Liao, X., Xiao, L., Yang, C., et al., 2014. MilkyWay-2 supercomputer: system and application. Front. Comput. Sci., 8(3):345–356. https://doi.org/10.1007/s11704-014-3501-3MathSciNetCrossRefGoogle Scholar
  32. Liu, N., Cope, J., Carns, P., et al., 2012. On the role of burst buffers in leadership-class storage systems. IEEE 28th Symp. on MASS Storage Systems and Technologies, p.1–11. https://doi.org/10.1109/MSST.2012.6232369Google Scholar
  33. Liu, X., Lu, Y., Yu, J., et al., 2017a. MemUsing: dynamic, efficient memory utilization in compute nodes for HPC memory-based storage systems. Proc. 7th Int. Workshop on Computer Science and Engineering, p.8–16.Google Scholar
  34. Liu, X., Lu, Y., Wu, C., et al., 2017b. UGSD: scalable and efficient metadata management for EB-scale file systems. Proc. Int. Conf. on Compute and Data Analysis, p.81–90. https://doi.org/10.1145/3093241.3093257Google Scholar
  35. LLNL, 2012. Sequoia. Lawrence Livermore National Laboratory. https://computation.llnl.gov/computers/sequoiaGoogle Scholar
  36. Lofstead, J., Jimenez, I., Maltzahn, C., et al., 2016. DAOS and friends: a proposal for an exascale storage system. Int. Conf. for High Performance Computing, Networking, Storage & Analysis, p.585–596. https://doi.org/10.1109/SC.2016.49Google Scholar
  37. Lu, C.Y., Alvarez, G.A., Wilkes, J., 2002. Aqueduct: online data migration with performance guarantees. FAST Conf. on File and Storage Technologies, p.219–230.Google Scholar
  38. Miller, E.L., Greenan, K., Leung, A., et al., 2011. Reliable and efficient metadata storage and indexing using nvram. J. Comput. Sci. Technol., 26(3):344–351.CrossRefGoogle Scholar
  39. Muralidhar, S., Lloyd, W., Roy, S., et al., 2014. f4: Facebook’s warm blob storage system. Proc. 11th USENIX Symp. on Operating Systems Design and Implementation, p.383–398.Google Scholar
  40. NERSC, 2017a. Burst Buffer Architecture and Software Roadmap. National Energy Research Scientific Computing Center. http://www.nersc.gov/users/ computational-systems/cori/burst-buffer/burst-bufferGoogle Scholar
  41. NERSC, 2017b. The Configuration of Cori File System. National Energy Research Scientific Computing Center. http://www.nersc.gov/users/computationalsystems/cori/configuration/Google Scholar
  42. NetApp, 2016. All Flash Arrays. http://www.netapp.com/ us/products/storage-systems/all-flash-array/aff-aseries.aspxGoogle Scholar
  43. Ocaña, K., de Oliveira, D., 2015. Parallel computing in genomic research: advances and applications. Adv. Appl. Bioinform. Chem., 8:23–25. https://doi.org/10.2147/AABC.S64482Google Scholar
  44. Ovsyannikov, A., Romanus, M., Straalen, B.V., et al., 2017. Scientific workflows at datawarp-speed: accelerated data-intensive science using NERSE’s burst buffer. Joint Int. Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems, p.1–6. https://doi.org/10.1109/PDSW-DISCS.2016.005Google Scholar
  45. Pawlowski, B., Juszczak, C., Staubach, P., et al., 1994. NFS version 3: design and implementation. USENIX Summer Technical Conf., p.137–152.Google Scholar
  46. Prabhakar, R., Vazhkudai, S.S., Kim, Y., et al., 2011. Provisioning a multi-tiered data staging area for extremescale machines. Int. Conf. on Distributed Computing Systems, p.1–12. https://doi.org/10.1109/ICDCS.2011.33Google Scholar
  47. Qiao, F., Song, Z., Bao, Y., et al., 2013. Development and evaluation of an earth system model with surface gravity waves. J. Geophys. Res. Ocean., 118(9):4514–4524. https://doi.org/10.1002/jgrc.20327CrossRefGoogle Scholar
  48. Rajachandrasekar, R., Moody, A., Mohror, K., et al., 2013. A 1 PB/s file system to checkpoint three million MPI tasks. Proc. 22nd Int. Symp. on High-Performance Parallel and Distributed Computing, p.143–154. https://doi.org/10.1145/2462902.2462908Google Scholar
  49. Rodeh, O., Teperman, A., 2003. zFS—a scalable distributed file system using object disks. Proc. 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, p.207–218. https://doi.org/10.1109/MASS.2003.1194858Google Scholar
  50. Roselli, D., Anderson, T.E., Lorchid, J.R., 2000. A comparison of file system workloads. Proc. USENIX Annual Technical Conf., p.41–54.Google Scholar
  51. Saito, S., Oikawa, S., 2012. Exploration of non-volatile memory management in the OS kernel. 3rd Int. Conf. on Networking and Computing, p.302–306. https://doi.org/10.1109/ICNC.2012.56Google Scholar
  52. Sato, K., Mohror, K., Moody, A., et al., 2014. A user-level infiniband-based file system and checkpoint strategy for burst buffers. 14th IEEE/ACM Int. Symp. on Cluster, Cloud and Grid Computing, p.21–30. https://doi.org/10.1109/CCGrid.2014.24Google Scholar
  53. Satyanarayanan, M., Kistler, J.J., Kumar, P., et al., 1990. Coda: a highly available file system for a distributed workstation environment. IEEE Trans. Comput., 39(4):447–459. https://doi.org/10.1109/12.54838CrossRefGoogle Scholar
  54. Saxena, M., Swift, M.M., Zhang, Y., 2012. FlashTier: a lightweight, consistent and durable storage cache. Proc. 7th ACM European Conf. on Computer Systems, p.267–280. https://doi.org/10.1145/2168836.2168863Google Scholar
  55. Schenck, W., El Sayed, S., Foszczynski, M., et al., 2017. Evaluation and performance modeling of a burst buffer solution. ACM SIGOPS Oper. Syst. Rev., 50(1):12–26. https://doi.org/10.1145/3041710.3041714CrossRefGoogle Scholar
  56. Schmuck, F., Haskin, R., 2002. GPFS: a shared-disk file system for large computing clusters. Proc. 1st USENIX Conf. on File and Storage Technologies, No. 19.Google Scholar
  57. Seagate Technology LLC, 2017. Seagate NAS+SRS HDD Product Manual. https://www.seagate.com/wwwcontent/product-content/nas-fam/nas-hdd/enus/ docs/100764115g.pdfGoogle Scholar
  58. Shalf, J., Dosanjh, S., Morrison, J., 2010. Exascale computing technology challenges. Int. Conf. on High Performance Computing for Computational Science, p.1–25.Google Scholar
  59. Shibata, T., Choi, S., Taura, K., 2010. File-access patterns of data-intensive workflow applications and their implications to distributed filesystems. Proc. 19th ACM Int. Symp. on High Performance Distributed Computing, p.746–755. https://doi.org/10.1145/1851476.1851585Google Scholar
  60. Soundararajan, G., Prabhakaran, V., Balakrishnan, M., et al., 2010. Extending SSD lifetimes with disk-based write caches. Proc. 8th USENIX Conf. on File and Storage Technologies, No. 8. https://doi.org/10.1021/ja0386501Google Scholar
  61. Strande, S.M., Cicotti, P., Sinkovits, R.S., et al., 2012. Gordon: design, performance, and experiences deploying and supporting a data intensive supercomputer. Proc. 1st Conf. of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the Campus and Beyond, No. 3. https://doi.org/10.1145/2335755.2335789Google Scholar
  62. Tan, Z., Zhou, W., Feng, D., et al., 2013. ALDM: adaptive loading data migration in distributed file systems. IEEE Trans. Magn., 49(6):2645–2652. https://doi.org/10.1109/TMAG.2013.2251616CrossRefGoogle Scholar
  63. Uta, A., Sandu, A., Kielmann, T., 2016. Overcoming data locality. Fut. Gener. Comput. Syst., 54(C):144–158. https://doi.org/10.1016/j.future.2015.01.013CrossRefGoogle Scholar
  64. Vangoor, B.K.R., Tarasov, V., Zadok, E., 2017. To FUSE or not to FUSE: performance of user-space file systems. Proc. 15th USENIX Conf. on File and Storage Technologies, p.59–72.Google Scholar
  65. Vetter, J.S., Mittal, S., 2015. Opportunities for nonvolatile memory systems in extreme-scale high-performance computing. Comput. Sci. Eng., 17(2):73–82. https://doi.org/10.1109/MCSE.2015.4CrossRefGoogle Scholar
  66. Wang, F., Xin, Q., Hong, B., et al., 2004. File system workload analysis for large scale scientific computing applications. Proc. 21st IEEE/12th NASA Goddard Conf. on Mass Storage Systems and Technologies, p.139–152.Google Scholar
  67. Wang, F., Oral, S., Shipman, G., et al., 2010. Understanding Lustre Filesystem Internals. Technical Report, No. ORNL/TM-2009/117. Oak Ridge National Laboratory, National Center for Computational Sciences, Oak Ridge, USA.Google Scholar
  68. Wang, T., Oral, S., Wang, Y., et al., 2014. BurstMem: a high-performance burst buffer system for scientific applications. IEEE Int. Conf. on Big Data, p.71–79. https://doi.org/10.1109/BigData.2014.7004215Google Scholar
  69. Wang, T., Mohror, K., Moody, A., et al., 2016. An ephemeral burst-buffer file system for scientific applications. Proc. Int. Conf. for High Performance Computing, Networking, Storage and Analysis, p.807–818. https://doi.org/10.1109/SC.2016.68Google Scholar
  70. Weil, S.A., Brandt, S.A., Miller, E.L., et al., 2006. Ceph: a scalable, high-performance distributed file system. Proc. 7th Symp. on Operating Systems Design and Implementation, p.307–320.Google Scholar
  71. Yang, X.J., Liao, X.K., Lu, K., et al., 2011. The TianHe-1A supercomputer: its hardware and software. J. Comput. Sci. Technol., 26(3):344–351. https://doi.org/10.1007/s02011-011-1137-8CrossRefGoogle Scholar
  72. Yildiz, O., Dorier, M., Ibrahim, S., et al., 2016. On the root causes of cross-application I/O interference in HPC storage systems. IEEE Int. Parallel and Distributed Processing Symp., p.750–759. https://doi.org/10.1109/IPDPS.2016.50Google Scholar
  73. Yu, J., Liu, G.M., Dong, W.R., et al., 2017. WatCache: a workload-aware temporary cache on the compute side of HPC systems. J. Supercomput., 1(2):1–33. https://doi.org/10.1007/s11227-017-2167-7Google Scholar
  74. Zhao, D.F., Raicu, I., 2013. HyCache: a user-level caching middleware for distributed file systems. IEEE Int. Symp. on Parallel and Distributed Processing Workshops and Phd Forum, p.1997–2006. https://doi.org/10.1109/IPDPSW.2013.83Google Scholar
  75. Zhao, D.F., Zhang, Z., Zhou, X.B., et al., 2014. FusionFS: toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems. IEEE Int. Conf. on Big Data, p.61–70. https://doi.org/10.1109/BigData.2014.7004214Google Scholar

Copyright information

© Zhejiang University and Springer-Verlag GmbH Germany, part of Springer Nature 2017

Authors and Affiliations

  1. 1.School of ComputerNational University of Defense TechnologyChangshaChina
  2. 2.Computer Science and EngineeringUniversity of Nebraska-LincolnLincolnUSA
  3. 3.National Supercomputing CenterTianjinChina

Personalised recommendations