Highly Productive, High-Performance Application Frameworks for Post-Petascale Computing

  • Naoya MaruyamaEmail author
  • Takayuki Aoki
  • Kenjiro Taura
  • Rio Yokota
  • Mohamed Wahib
  • Motohiko Matsuda
  • Keisuke Fukuda
  • Takashi Shimokawabe
  • Naoyuki Onodera
  • Michel Müller
  • Shintaro Iwasaki


We present an overview of our project that aimed to achieve both high performance and high productivity. In order to achieve our aim, we designed and developed high-level domain-specific frameworks that can automate many of tedious and complicated program optimizations for certain computation patterns. This article walks through some of our research results and highlights how we achieved both high performance and high productivity.


  1. 1.
    Akiyama, S., Taura, K.: Uni-address threads: scalable thread management for RDMA-based work stealing. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC’15, Portland, pp. 15–26 (2015)Google Scholar
  2. 2.
    Andoh, Y., Yoshii, N., Fujimoto, K., Mizutani, K., Kojima, H., Yamada, A., Okazaki, S., Kawaguchi, K., Nagao, H., Iwahashi, K., Mizutani, F., Minami, K., Ichikawa, S., Komatsu, H., Ishizuki, S., Takeda, Y., Fukushima, M.: MODYLAS: a highly parallelized general-purpose molecular dynamics simulation program for large-scale systems with long-range forces calculated by fast multipole method (FMM) and highly scalable fine-grained new parallel processing algorithms. J. Chem. Theory Comput. 9, 3201–3209 (2012)CrossRefGoogle Scholar
  3. 3.
    Antoniu, G., Bougé, L., Namyst, R.: An efficient and transparent thread migration scheme in the PM2 runtime system. In: Proceedings of the 11 IPPS/SPDP’99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing, San Juan, pp. 496–510 (1999)Google Scholar
  4. 4.
    Appel, A.W.: An efficient program for many-body simulation. SIAM J. Sci. Stat. Comput. 6(1), 85–103 (1985)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Architecture Review Board: OpenMP application program interface version 3.0. Technical report (2008)Google Scholar
  6. 6.
    Bernaschi, M., Fatica, M., Melchionna, S., Succi, S., Kaxiras, E.: A flexible high-performance Lattice Boltzmann GPU code for the simulations of fluid flows in complex geometries. Concurr. Comput. Pract. Exp. 22(1), 1–14 (2010)CrossRefGoogle Scholar
  7. 7.
    Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Carrier, J., Greengard, L., Rokhlin, V.: A fast adaptive multipole algorithm for particle simulations. SIAM J. Sci. Stat. Comput. 9(4), 669–686 (1988)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Choi, C.H., Ivanic, J., Gordon, M.S., Reudenberg, K.: Rapid and stable determination of rotation matrices between spherical harmonics by direct recursion. J. Chem. Phys. 111(19), 8825–8831 (1999)CrossRefGoogle Scholar
  10. 10.
    Dachsel, H.: Fast and accurate determination of the Wigner rotation matrices in the fast multipole method. J. Chem. Phys. 124, 144115 (2006)CrossRefGoogle Scholar
  11. 11.
    Darve, E., Cecka, C., Takahashi, T.: The fast multipole method on parallel clusters, multicore processors, and graphics processing units. Comptes Rendus Mecanique 339, 185–193 (2011)CrossRefGoogle Scholar
  12. 12.
    Dehnen, W.: A hierarchical O(N) force calculation algorithm. J. Comput. Phys. 179(1), 27–42 (2002)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Dinan, J., Brian Larkins, D., Sadayappan, P., Krishnamoorthy, S., Nieplocha, J.: Scalable work stealing. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09, Portland, pp. 53:1–53:11 (2009)Google Scholar
  14. 14.
    Dubinski, J.: A parallel tree code. New Astron. 1, 133–147 (1996)CrossRefGoogle Scholar
  15. 15.
    Fortin, P.: Multipole-to-local operator in the fast multipole method: comparison of FFT, rotations and BLAS improvements. Technical Report RR-5752, Rapports de recherche, et theses de l’Inria (2005)Google Scholar
  16. 16.
    Fortin, P.: High performance parallel hierarchical algorithm for N-body problems. Ph.D. thesis, Universite Bordeaux 1 (2007)Google Scholar
  17. 17.
    Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the Cilk-5 multithreaded language. In: Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, PLDI ’98, Montreal, pp. 212–223 (1998)Google Scholar
  18. 18.
    Fukuda, K., Matsuda, M., Maruyama, N., Yokota, R., Taura, K., Matsuoka, S.: Tapas: an implicitly parallel programming framework for hierarchical n-body algorithms. In: 22nd IEEE International Conference on Parallel and Distributed Systems, ICPADS 2016, Wuhan, China, 13–16 Dec 2016, pp. 1100–1109 (2016)Google Scholar
  19. 19.
    Germano, M., Piomelli, U., Moin, P., Cabot, W.H.: A dynamic subgrid-scale eddy viscosity model. Phys. Fluids A Fluid Dyn. 3(7), 1760–1765 (1991)CrossRefGoogle Scholar
  20. 20.
    Grama, A.Y., Kumar, V., Sameh, A.: Scalable parallel formulations of the Barnes-Hut method for N-body simulations. In: Proceedings of the 1994 ACM/IEEE Conference on Supercomputing, Washington, DC, pp. 1–10 (1994)Google Scholar
  21. 21.
    Gumerov, N.A., Duraiswami, R.: Fast multipole methods on graphics processors. J. Comput. Phys. 227, 8290–8313 (2008)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Hiraishi, T., Yasugi, M., Umatani, S., Yuasa, T.: Backtracking-based load balancing. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’09, Raleigh, pp. 55–64 (2009)Google Scholar
  23. 23.
    Iwasaki, S., Taura, K.: A static cut-off for task parallel programs. In: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT ’16, pp. 139–150. ACM, New York (2016)Google Scholar
  24. 24.
    Kobayashi, H., Ham, F., Wu, X.: Application of a local SGS model based on coherent structures to complex geometries. Int. J. Heat Fluid Flow 29(3), 640–653 (2008). The Fifth International Symposium on Turbulence and Shear Flow Phenomena (TSFP5), MunichGoogle Scholar
  25. 25.
    Křivánek, J., Konttinen, J., Pattanaik, S., Bouatouch, K.: Fast approximation to spherical harmonic rotation. Technical Report 1728, Institut De Recherche En Informatique Et Systemes Aleatoires (2005)Google Scholar
  26. 26.
    Lange, B., Fortin, P.: Parallel dual tree traversal on multi-core and many-core architectures for astrophysical N-body simulations. Technical Report hal-00947130, Sorbonne Universités UPMC (2014)Google Scholar
  27. 27.
    Lashuk, I., Chandramowlishwaran, A., Langston, H., Nguyen, T.-A., Sampath, R., Shringarpure, A., Vuduc, R., Ying, L., Zorin, D., Biros, G.: A massively parallel adaptive fast multipole method on heterogeneous architectures. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, Portland (2009)Google Scholar
  28. 28.
    Lessig, C., de Witt, T., Fiume, E.: Efficient and accurate rotation of finite spherical harmonics expansions. J. Comput. Phys. 231, 243–250 (2012)MathSciNetCrossRefGoogle Scholar
  29. 29.
    Makino, J.: Comparison of two different tree algorithms. J. Comput. Phys. 88, 393–408 (1990)CrossRefGoogle Scholar
  30. 30.
    Makino, J.: A fast parallel treecode with GRAPE. Publ. Astron. Soc. Jpn. 56, 521–531 (2004)CrossRefGoogle Scholar
  31. 31.
    Maruyama, N., Nomura, T., Sato, K., Matsuoka, S.: Physics: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, Seattle (2011)Google Scholar
  32. 32.
    Min, S.-J., Iancu, C., Yelick, K.: Hierarchical work stealing on manycore clusters. In: Fifth Conference on Partitioned Global Address Space Programming Models, PGAS ’11, Galveston Island (2011)Google Scholar
  33. 33.
    Mohr, E., Kranz, D.A., Halstead, Jr. R. H.: Lazy task creation: a technique for increasing the granularity of parallel programs. IEEE Trans. Parallel Distrib. Syst. 2, 264–280 (1991)CrossRefGoogle Scholar
  34. 34.
    Müller, M., Aoki, T.: Hybrid fortran: high productivity GPU porting framework applied to Japanese weather prediction model. In: WACCPD: Accelerator Programming Using Directives 2017, pp. 20–41. Springer (2018)Google Scholar
  35. 35.
    Müller, M., Aoki, T.: New high performance GPGPU code transformation framework applied to large production weather prediction code (2018). Preprint as accepted for ACM TOPCGoogle Scholar
  36. 36.
    Nakashima, J., Nakatani, S., Taura, K.: Design and implementation of a customizable work stealing scheduler. In: Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers, ROSS ’13, Eugene, pp. 9:1–9:8 (2013)Google Scholar
  37. 37.
    Ohnuki, S., Chewl, W.C.: Error minimization of multipole expansion. SIAM J. Sci. Comput. 26(6), 2047–2065 (2005)MathSciNetCrossRefGoogle Scholar
  38. 38.
    Ohori, A., Taura, K., Ueno, K.: Making SML# a general-purpose high-performance language. In: ML Family Workshop, Oxford (2017)Google Scholar
  39. 39.
    Pharr, M., Mark, W.R.: ISPC: a SPMD compiler for high-performance CPU programming. In: 2012 Innovative Parallel Computing (InPar), San Jose, pp. 1–13, May 2012.Google Scholar
  40. 40.
    Rahimian, A., Lashuk, I., Veerapaneni, S., Chandramowlishwaran, A., Malhotra, D., Moon, L., Sampath, R., Shringarpure, A., Vetter, J., Vuduc, R., Zorin, D., Biros, G.: Petascale direct numerical simulation of blood flow on 200k cores and heterogeneous architectures. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’10, New Orleans, pp. 1–11 (2010)Google Scholar
  41. 41.
    Rankin, W.T.: Efficient parallel implementations of multipole based N-body algorithm. Ph.D. thesis, Duke University (1999)Google Scholar
  42. 42.
    Reinders, J.: Intel Threading Building Blocks, 1st edn. O’Reilly & Associates, Inc., Sebastopol (2007)Google Scholar
  43. 43.
    Salmon, J.K.: Parallel Hierarchical N-Body Methods. Ph.D. thesis, California Institute of Technology (1991)Google Scholar
  44. 44.
    Seo, S., Amer, A., Balaji, P., Bordage, C., Bosilca, G., Brooks, A., Carns, P., Castelló, A., Genet, D., Herault, T., Iwasaki, S., Jindal, P., Kalé, L.V., Krishnamoorthy, S., Lifflander, J., Lu, H., Meneses, E., Snir, M., Sun, Y., Taura, K., Beckman, P.: Argobots: a lightweight low-level threading and tasking framework. IEEE Trans. Parallel Distrib. Syst. 29(3), 512–526 (2018)CrossRefGoogle Scholar
  45. 45.
    Shimokawabe, T., Aoki, T., Ishida, J., Kawano, K., Muroi, C.: 145 TFlops performance on 3990 GPUs of TSUBAME 2.0 supercomputer for an operational weather prediction. Proc. Comput. Sci. 4, 1535–1544 (2011)CrossRefGoogle Scholar
  46. 46.
    Shimokawabe, T., Aoki, T., Muroi, C., Ishida, J., Kawano, K., Endo, T., Nukada, A., Maruyama, N., Matsuoka, S.: An 80-fold speedup, 15.0 TFlops full GPU acceleration of non-hydrostatic weather model ASUCA production code. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, pp. 1–11 (2010)Google Scholar
  47. 47.
    Shimokawabe, T., Aoki, T., Onodera, N.: A high-productivity framework for multi-GPU computation of mesh-based applications. In: HiStencils 2014, Vienna, p. 23 (2014)Google Scholar
  48. 48.
    Shimokawabe, T., Aoki, T., Onodera, N.: High-productivity framework on GPU-rich supercomputers for operational weather prediction code ASUCA. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, New Orleans, pp. 251–261 (2014)Google Scholar
  49. 49.
    Shimokawabe, T., Takaki, T., Endo, T., Yamanaka, A., Maruyama, N., Aoki, T., Nukada, A., Matsuoka, S.: Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer. In: 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Seattle, pp. 1–11 (2011)Google Scholar
  50. 50.
    Singh, J.P., Holt, C., Hennessy, J.L., Gupta, A.: A parallel adaptive fast multipole method. In: Proceedings of the Supercomputing Conference 1993, Portland, pp. 54–65 (1993)Google Scholar
  51. 51.
    Solomonik, E., Kalé, L.V.: Highly scalable parallel sorting. In: IEEE International Symposium on Parallel and Distributed Processing, Rio de Janeiro, pp. 1–12 (2010)Google Scholar
  52. 52.
    Takahashi, T., Cecka, C., Fong, W., Darve, E.: Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units. Int. J. Numer. Methods Eng. 89, 105–133 (2012)CrossRefGoogle Scholar
  53. 53.
    Taura, K., Nakashima, J., Yokota, R., Maruyama, N.: A task parallel implementation of fast multipole methods. In: Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), Salt Lake City (2012)Google Scholar
  54. 54.
    Wahib, M., Maruyama, N.: Scalable kernel fusion for memory-bound GPU applications. In: ACM/IEEE Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14), New Orleans (2014)Google Scholar
  55. 55.
    Wahib, M., Maruyama, N.: Automated GPU kernel transformations in large-scale production stencil applications. In: ACM Conference on High Performance and Distributed Computing (HPDC’15), Portland (2015)Google Scholar
  56. 56.
    Wahib, M., Maruyama, N.: Data-centric GPU-based adaptive mesh refinement. In: Workshop on Irregular Applications: Architectures and Algorithms (IA3 2015), Austin (2015)Google Scholar
  57. 57.
    Wahib, M., Maruyama, N., Aoki, T.: Daino: a high-level framework for parallel and efficient AMR on GPUs. In: ACM/IEEE Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’16), Salt Lake City (2016)Google Scholar
  58. 58.
    Warren, M.S., Salmon, J.K.: A parallel hashed OCT-tree N-body algorithm. In: Proceedings of the 1993 ACM/IEEE Conference on Supercomputing, Portland, pp. 12–21 (1993)Google Scholar
  59. 59.
    Warren, M.S., Salmon, J.K.: A portable parallel particle program. Comput. Phys. Commun. 87, 266–290 (1995)CrossRefGoogle Scholar
  60. 60.
    Wheeler, K.B., Murphy, R.C., Thain, D.: Qthreads: an API for programming with millions of lightweight threads. In: 2008 IEEE International Symposium on Parallel and Distributed Processing, IPDPS ’08, pp. 1–8 (2008)Google Scholar
  61. 61.
    Xian, W., Takayuki, A.: Multi-GPU performance of incompressible flow computation by Lattice Boltzmann method on GPU cluster. Parallel Comput. 37(9), 521–535 (2011). Emerging Programming Paradigms for Large-Scale Scientific ComputingGoogle Scholar
  62. 62.
    Yokota, R.: An FMM based on dual tree traversal for many-core architectures. J. Algorithms Comput. Technol. 7(3), 301–324 (2013)CrossRefGoogle Scholar
  63. 63.
    Yokota, R., Barba, L.A.: A tuned and scalable fast multipole method as a preeminent algorithm for Exascale systems. Int. J. High Perform. Comput. Appl. 26(4), 337–346 (2012)CrossRefGoogle Scholar
  64. 64.
    Yokota, R., Turkiyyah, G., Keyes, D.: Communication complexity of the fast multipole method and its algebraic variants. Supercomput. Front. Innov. 1(1), 63–84 (2014)Google Scholar
  65. 65.
    Yu, H., Girimaji, S.S., Luo, L.-S.: DNS and LES of decaying isotropic turbulence with and without frame rotation using Lattice Boltzmann method. J. Comput. Phys. 209(2), 599–616 (2005)CrossRefGoogle Scholar
  66. 66.
    Zhang, B.: Asynchronous task scheduling of the fast multipole method using various runtime systems. In: Proceedings of the Forth Workshop on Data-Flow Execution Models for Extreme Scale Computing, Edmonton (2014)Google Scholar
  67. 67.
    Zima, H.P., Callahan, D., Chamberlain, B.L.: The cascade high productivity language. In: International Workshop on High-Level Programming Models and Supportive Environments, Santa Fe, pp. 52–60 (2004)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  • Naoya Maruyama
    • 1
    Email author
  • Takayuki Aoki
    • 2
  • Kenjiro Taura
    • 3
  • Rio Yokota
    • 2
  • Mohamed Wahib
    • 1
  • Motohiko Matsuda
    • 1
  • Keisuke Fukuda
    • 2
  • Takashi Shimokawabe
    • 3
  • Naoyuki Onodera
    • 2
  • Michel Müller
    • 2
  • Shintaro Iwasaki
    • 3
  1. 1.RIKEN AICSKobeJapan
  2. 2.Tokyo Institute of TechnologyMeguro-kuJapan
  3. 3.University of TokyoBunkyo-kuJapan

Personalised recommendations