System Software for Data-Intensive Science

  • Osamu TatebeEmail author
  • Yoshihiro Oyama
  • Masahiro Tanaka
  • Hiroki Ohtsuji
  • Fuyumasa Takatsu
  • Xieming Li


The storage performance is an issue for supercomputers to facilitate the data-intensive science. To improve the storage bandwidth according to the number of compute nodes, we assume a node-local scale-out storage architecture. The number of local storages increases according to the number of compute nodes, and the total storage bandwidth increases scalably. Our research target is a distributed file system in the node-local storage architecture, an operating system for compute node, and runtime systems for the distributed file system using node-local storages for workflow systems, MapReduce, MPI-IO, and batch job schedulers.


  1. 1.
    Armstrong, T.G., Zhang, Z., Katz, D.S., Wilde, M., Foster, I.T.: Scheduling many-task workloads on supercomputers: dealing with trailing tasks. In: 2010 3rd Workshop on Many-Task Computing on Grids and Supercomputers, pp. 1–10. IEEE (2010).
  2. 2.
    Dahlin, M.D., Wang, R.Y., Anderson, T.E., Patterson, D.A.: Cooperative caching: using remote client memory to improve file system performance. In: Proceedings of the 1st USENIX Conference on Operating Systems Design and Implementation (1994)Google Scholar
  3. 3.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008). CrossRefGoogle Scholar
  4. 4.
    Fusion-Io: NVM Primitives Library (2014).
  5. 5.
    Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles, pp. 20–43 (2003)Google Scholar
  6. 6.
    Hadoop Distributed File System.
  7. 7.
    Herlihy, M., Luchangco, V., Moir, M., Scherer III, W.N.: Software transactional memory for dynamic-sized data structures. In: Proceedings of the Twenty-Second Annual Symposium on Principles of Distributed Computing (PODC ’03), pp. 92–101. ACM, New York (2003).
  8. 8.
    Jacob, J.C., Katz, D.S., Berriman, G.B., Good, J.C., Laity, A.C., Deelman, E., Kesselman, C., Singh, G., Su, M.H., Prince, T.A., Williams, R.: Montage: a grid portal and software toolkit for science-grade astronomical image mosaicking. Int. J. Comput. Sci. Eng. 4(2), 73–87 (2009). CrossRefGoogle Scholar
  9. 9.
    Josephson, W.K., Bongo, L.A., Li, K., Flynn, D.: DFS: a file system for virtualized flash storage. ACM Trans. Storage 6(3), 14:1–14:25 (2010)CrossRefGoogle Scholar
  10. 10.
    Karypis, G., Kumar, V.: Multilevel algorithms for multi-constraint graph partitioning. In: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, Supercomputing ’98, pp. 296–310. Springer, IEEE Computer Society, Washington, DC (1998).
  11. 11.
    Li, X., Tatebe, O.: Improved Data-Aware Task Dispatching for Batch Queuing Systems. In: 2016 Seventh International Workshop on Data-Intensive Computing in the Clouds (DataCloud), pp. 37–44. IEEE (2016).
  12. 12.
    Li, X., Tatebe, O.: Data-aware task dispatching for batch queuing system. IEEE Syst. J. 11(2), 889–897 (2017). CrossRefGoogle Scholar
  13. 13.
    Ohtsuji, H., Tatebe, O.: Active-storage mechanism for cluster-wide RAID system. In: Proceedings of IEEE International Conference on Data Science and Data Intensive Systems (DSDIS), pp. 25–32 (2015)Google Scholar
  14. 14.
    Oyama, Y., Ishiguro, S., Murakami, J., Sasaki, S., Matsumiya, R., Tatebe, O.: Reduction of operating system jitter caused by page reclaim. In: Proceedings of the 4th International Workshop on Runtime and Operating Systems for Supercomputers (2014)Google Scholar
  15. 15.
    Oyama, Y., Ishiguro, S., Murakami, J., Sasaki, S., Matsumiya, R., Tatebe, O.: Experimental analysis of operating system jitter caused by page reclaim. J. Supercomput. 72(5), 1946–1972 (2016)CrossRefGoogle Scholar
  16. 16.
    Oyama, Y., Murakami, J., Ishiguro, S., Tatebe, O.: Implementation of a deduplication cache mechanism using content-defined chunking. Int. J. High Perform. Comput. Netw. 9(3), 190–205 (2016)CrossRefGoogle Scholar
  17. 17.
    Ren, K., Gibson, G.: Tablefs: Enhancing metadata efficiency in the local file system. In: Proceedings of the 2013 USENIX Conference on Annual Technical Conference, USENIX ATC’13, pp. 145–156. USENIX Association, Berkeley (2013).
  18. 18.
    Ren, K., Zheng, Q., Patil, S., Gibson, G.: Indexfs: Scaling file system metadata performance with stateless caching and bulk insertion. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC’14, pp. 237–248. IEEE Press, Piscataway (2014).
  19. 19.
    Sasaki, S., Matsumiya, R., Takahashi, K., Oyama, Y., Tatebe, O.: RDMA-based cooperative caching for a distributed file system. In: Proceedings of the 21st IEEE International Conference on Parallel and Distributed Systems, pp. 344–353 (2015)Google Scholar
  20. 20.
    Sasaki, S., Takahashi, K., Oyama, Y., Tatebe, O.: RDMA-based direct transfer of file data to remote page cache. In: Proceedings of 2015 IEEE International Conference on Cluster Computing, pp. 214–225 (2015)Google Scholar
  21. 21.
    Schloegel, K., Karypis, G., Kumar, V.: Parallel static and dynamic multi-constraint graph partitioning. Concur. Comput. Pract. Exp. 14(3), 219–240 (2002). CrossRefGoogle Scholar
  22. 22.
    Takatsu, F., Hiraga, K., Tatebe, O.: Design of object storage using open VM for high-performance distributed file system. J. Inf. Process. 24(5), 824–833 (2016)Google Scholar
  23. 23.
    Takatsu, F., Hiraga, K., Tatebe, O.: PPFS: a scale-out distributed file system for post-petascale systems. J. Inf. Process. 25, 538–447 (2017)Google Scholar
  24. 24.
    Tanaka, M., Tatebe, O.: Pwrake: A parallel and distributed flexible workflow management tool for wide-area data intensive computing. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC ’10), pp. 356–359. ACM Press, New York (2010).
  25. 25.
    Tanaka, M., Tatebe, O.: Workflow scheduling to minimize data movement using multi-constraint graph partitioning. In: 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2012), pp. 65–72. IEEE (2012).
  26. 26.
    Tanaka, M., Tatebe, O.: Disk Cache-Aware Task Scheduling For Data-Intensive and Many-Task Workflow. In: IEEE Cluster 2014, pp. 167–175. IEEE, Madrid (2014).
  27. 27.
    Tatebe, O., Hiraga, K., Soda, N.: Gfarm grid file system. N. Gener. Comput. 28, 257–275 (2010)CrossRefGoogle Scholar
  28. 28.
    Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., Stoica, I.: Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling. In: Proceedings of the 5th European conference on Computer systems – EuroSys ’10, p. 265. ACM Press, New York (2010).

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  • Osamu Tatebe
    • 1
    Email author
  • Yoshihiro Oyama
    • 1
  • Masahiro Tanaka
    • 2
  • Hiroki Ohtsuji
    • 3
  • Fuyumasa Takatsu
    • 4
  • Xieming Li
    • 4
  1. 1.University of TsukubaTsukubaJapan
  2. 2.Keio UniversityFujisawaJapan
  3. 3.University of Tsukuba (Currently Fujitsu Laboratories Ltd.)KawasakiJapan
  4. 4.Yahoo Japan CorporationChiyoda-kuJapan

Personalised recommendations