Software Technology That Deals with Deeper Memory Hierarchy in Post-petascale Era

  • Toshio EndoEmail author
  • Hiroko Midorikawa
  • Yukinori Sato


There is an urgent need to develop technology that realizes larger, finer, and faster simulations in meteorology, bioinformatics, disaster measures, and so on, toward post-petascale era. However, the “memory wall” problem will be the one of largest obstacles; the growth of memory bandwidth and capacity will be even slower than that of processor throughput. For this purpose, we suppose system architecture with memory hierarchy including hybrid memory devices, including nonvolatile RAM (NVRAM), and develop new software technology that efficiently utilizes the hybrid memory hierarchy. The area of our research includes new compiler technology, memory management, and application algorithms.


  1. 1.
    Ammons, G., Ball, T., Larus, J.R.: Exploiting hardware performance counters with flow and context sensitive profiling. In: Proceedings of the ACM SIGPLAN 1997 conference on programming language design and implementation, pp. 85–96 (1997)Google Scholar
  2. 2.
    Bernaschi, M., Bisson, M., Endo, T., Fatica, M., Matsuoka, S., Melchionna, S., Succi, S.: Petaflop biofluidics simulations on a two million-core system. In: IEEE/ACM SC’11, 12p. (2011)Google Scholar
  3. 3.
    Endo, T.: Realizing out-of-core stencil computations using multi-Tier memory hierarchy on GPGPU clusters. In: IEEE Cluster Computing (CLUSTER2016), pp. 21–29 (2016)Google Scholar
  4. 4.
    Endo, T., Jin, G.: Software technologies coping with memory hierarchy of GPGPU clusters for stencil computations. In: IEEE Cluster Computing (CLUSTER2014), pp. 132–139 (2014)Google Scholar
  5. 5.
    Endo, T., Nukada, A., Matsuoka, S.: TSUBAME-KFC: a modern liquid submersion cooling prototype towards exascale becoming the greenest supercomputer in the world. In: IEEE International Conference on Parallel and Distributed Systems (ICPADS 2014), pp. 360–367 (2014)Google Scholar
  6. 6.
    Endo, T., Takasaki, Y., Matsuoka, S.: Realizing extremely large-scale stencil applications on GPU supercomputers. In: IEEE International Conference on Parallel and Distributed Systems (ICPADS 2015), pp. 625–632 (2015)Google Scholar
  7. 7.
    Grosser, T., Groesslinger, A., Lengauer, C.: Polly – performing polyhedral optimizations on a low-level intermediate representation. Parallel Process. Lett. 22(04), 1–28 (2012)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Hong, C., et al.: Effective padding of multidimensional arrays to avoid cache conflict misses. In: Proceedings of the 37th ACM Conference on Programming Language Design and Implementation, PLDI ’16, pp. 129–144 (2016)Google Scholar
  9. 9.
    Lucas, R., et al.: Top ten exascale research challenges, DOE ASCAC Subcommittee Report (2014)Google Scholar
  10. 10.
    Luk, C.K., et al.: Pin: building customized program analysis tools with dynamic instrumentation. In: Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 190–200 (2005)Google Scholar
  11. 11.
    Matsubara, Y., Sato, Y.: Online memory access pattern analysis on an application profiling tool. In: International Workshop on Advances in Networking and Computing, 2014 (WANC2014), pp. 602–604 (2014)CrossRefGoogle Scholar
  12. 12.
    Matsuoka, S., Endo, T., Nukada, A., Miura, S., Nomura, A., Sato, H., Jitsumoto, H., Sandr Drozd, A.: Overview of TSUBAME3.0, green cloud supercomputer for convergence of HPC, AI and big-data, GSIC, Tokyo Institute of Technology. e-Sci. J. 16, 2–9 (2017)Google Scholar
  13. 13.
    Midorikawa, H.: The performance analysis of portable parallel programming interface MpC for SDSM and pthread. In: Proceedings of IEEE/ACM International Symposium on Cluster Computing and the Grid CCGrid2005. Fifth International Workshop on Distributed Shared Memory (DSM2005), vol. 2, pp. 889–896 (2005).
  14. 14.
    Midorikawa, H.: Blk-Tune: blocking parameter auto-tuning to minimize input-output traffic for flash-based out-of-core stencil computations. In: Proceedings of IEEE International Parallel and Distributed Processing Symposium 2016 Workshop, IPDPSW2016, pp. 1516–1526 (2016).
  15. 15.
    Midorikawa, H., Tan, H.: Locality-aware stencil computations using flash SSDs as main memory extension. In: Proceedings of IEEE/ACM International Symposium on Cluster, Cloud and the Grid Computing CCGrid2015, pp. 1163–1168 (2015). Google Scholar
  16. 16.
    Midorikawa, H., Tan, H.: Evaluation of flash-based out-of-core stencil computation algorithms for SSD-equipped clusters. In: The 22nd IEEE International Conference on Parallel and Distributed Systems ICPADS2016, pp. 1031–1040 (2016).
  17. 17.
    Midorikawa, H., Tan, H.: A highly efficient I/O-based out-of-core stencil algorithm with globally optimized temporal blocking. In: Proceedings of 2017 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, pp. 1–6 (2017).
  18. 18.
    Midorikawa, H., Saito, K., et al.: Using a cluster as a memory resource: a fast and large virtual memory on MPI. In: Proceedings of IEEE International Conference on Cluster Computing Cluster2009, pp. 1–10 (2009). Google Scholar
  19. 19.
    Midorikawa, H., Kitagawa, K., Ohura, H.: Efficient swap protocol for remote memory paging in out-of-core multi-thread applications. In: Proceedings of 2017 IEEE International Conference on Cluster Computing Cluster2017, pp. 637–638 (2017).
  20. 20.
    Nguyen, A., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In: IEEE/ACM SC’10, 13p. (2010)Google Scholar
  21. 21.
    Onodera, N., Aoki, T., Shimokawabe, T., Miyashita, T., Kobayashi, H.: Large-Eddy simulation of fluid-structure interaction using lattice Boltzmann method on multi-GPU clusters. In: 5th Asia Pacific Congress on Computational Mechanics and 4th International Symposium on Computational Mechanics (2013).Google Scholar
  22. 22.
    Phillips, E.H., Fatica, M.: Implementing the Himeno benchmark with CUDA on GPU clusters. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–10 (2010)Google Scholar
  23. 23.
    Satish, N., Kim, C., Chhugani, J., Saito, H., Krishnaiyer, R., Smelyanskiy, M., Girkar, M., Dubey, P.: Can traditional programming bridge the ninja performance gap for parallel computing applications? Commun. ACM 58(5), 77–86 (2015)CrossRefGoogle Scholar
  24. 24.
    Sato, Y., Endo, T.: An accurate simulator of cache-line conflicts to exploit the underlying cache performance. In: Proceedings of 23rd International European Conference on Parallel and Distributed Computing (Euro-Par 2017), pp. 119–133 (2017)Google Scholar
  25. 25.
    Sato, Y., Inoguchi, Y., Nakamura, T.: On-the-fly detection of precise loop nests across procedures on a dynamic binary translation system. In: Proceedings of the 8th ACM International Conference on Computing Frontiers, pp. 25:0–25:10 (2011)Google Scholar
  26. 26.
    Sato, Y., Inoguchi, Y., Nakamura, T.: Whole program data dependence profiling to unveil parallel regions in the dynamic execution. In: Proceedings of 2012 IEEE International Symposium on Workload Characterization (IISWC2012), pp. 69–80 (2012)Google Scholar
  27. 27.
    Sato, Y., Inoguchi, Y., Nakamura, T.: Identifying program loop nesting structures during execution of machine code. IEICE Trans. Inf. Syst. E97-D(9), 2371–2385 (2014)CrossRefGoogle Scholar
  28. 28.
    Sato, Y., Sato, S., Endo, T.: Exana: an execution-driven application analysis tool for assisting productive performance tuning. In: Proceedings of the 2nd International Workshop on Software Engineering for Parallel Systems, SEPS 2015, pp. 1–10 (2015)Google Scholar
  29. 29.
    Sato, Y., Yuki, T., Endo, T.: ExanaDBT: a dynamic compilation system for transparent polyhedral optimizations at runtime. In: ACM International Conference on Computing Frontiers 2017 (CF’17), p. 10 (2017)CrossRefGoogle Scholar
  30. 30.
    Shimokawabe, T., Aoki, T., Takaki, T., Yamanaka, A., Nukada, A., Endo, T., Maruyama, N., Matsuoka, S.: Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer. In: IEEE/ACM SC’11, 11p. (2011)Google Scholar
  31. 31.
    TSUBAME3.0: The super computer in Global Scientific Information and Computing Center, Tokyo Institute of Technology. Online: 26 Mar 2018
  32. 32.
    Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. ACM PLDI 91, 30–44 (1991)Google Scholar
  33. 33.
    Yuki, T., Sato, Y., Endo, T.: Evaluating autotuning heuristics for loop tiling. In: International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2018), p. 2 (2018)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.Global Scientific Information and Computing CenterTokyo Institute of TechnologyTokyoJapan
  2. 2.Seikei UniversityTokyoJapan
  3. 3.Toyohashi University of TechnologyAichiJapan

Personalised recommendations