GPU Implementation of a Sophisticated Implicit Low-Order Finite Element Solver with FP21-32-64 Computation Using OpenACC

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12017)


Accelerating applications with portability and maintainability is one of the big challenges in science and engineering. Previously, we have developed a fast implicit low-order three-dimensional finite element solver, which has a complicated algorithm including artificial intelligence and transprecision computing. In addition, all possible tunings for the target architecture were implemented; accordingly, the solver has inferior portability and maintainability. In this paper, we apply OpenACC to the solver. The directive-based implementation of OpenACC enables GPU computation to be introduced with a smaller developmental cost even for complex codes. In performance measurements on AI Bridging Cloud Infrastructure (ABCI), we evaluated that a reasonable speedup was attained on GPUs, given that the elapsed time of the entire solver was reduced to 1/14 of that on CPUs based on the original CPU implementation. Our proposed template to use transprecision computing with our custom FP21 data type is available to the public; therefore, it can provide a successful example for other scientific computing applications.


OpenACC Finite element analysis Conjugate gradient solver Transprecision computing Lower-Precision data types 



Our results were obtained using Computational resource of AI Bridging Cloud Infrastructure (ABCI), National Institute of Advanced Industrial Science and Technology (AIST). We acknowledge support from Post K computer project (Priority Issue 3 - Development of integrated simulation systems for hazards and disasters induced by earthquakes and tsunamis), and Japan Society for the Promotion of Science (17K14719, 18H05239, 18K18873). Part of our results were obtained using the Summit at Oak Ridge Leadership Computing Facility, a US Department of Energy, Office of Science User Facility at Oak Ridge National Laboratory.


  1. 1.
    Bielak, J., Ghattas, O., Kim, E.: Parallel octree-based finite element method for large-scale earthquake ground motion simulation. Comput. Model. Eng. Sci. 10(2), 99 (2005)MathSciNetzbMATHGoogle Scholar
  2. 2.
    Computing Resources of AI bridging Clound Infrastructure. Accessed 11 Oct 2019
  3. 3.
    Farber, R.: Parallel programming with OpenACC. Newnes, Oxford (2016)Google Scholar
  4. 4.
    Fujita, K., et al.: Development of element-by-element kernel algorithms in unstructured implicit low-order finite-element earthquake simulation for many-core Wide-SIMD CPUs. In: Rodrigues, J., et al. (eds.) ICCS 2019. LNCS, vol. 11536, pp. 267–280. Springer, Cham (2019). Scholar
  5. 5.
    Fujita, K., Yamaguchi, T., Ichimura, T., Hori, M., Maddegedara, L.: Acceleration of element-by-element kernel in unstructured implicit low-order finite-element earthquake simulation using OpenACC on pascal GPUs. In: Proceedings of the Third International Workshop on Accelerator Programming Using Directives, pp. 1–12. IEEE Press (2016)Google Scholar
  6. 6.
    Golub, G.H., Ye, Q.: Inexact preconditioned conjugate gradient method with inner-outer iteration. SIAM J. Sci. Comput. 21(4), 1305–1320 (1999)MathSciNetzbMATHCrossRefGoogle Scholar
  7. 7.
    Ichimura, T., et al.: A fast scalable implicit solver with concentrated computation for nonlinear time-evolution problems on low-order unstructured finite elements. In: 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 620–629. IEEE (2018)Google Scholar
  8. 8.
    Ichimura, T., et al.: A fast scalable implicit solver for nonlinear time-evolution earthquake city problem on low-ordered unstructured finite elements with artificial intelligence and transprecision computing. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 627–637. IEEE (2018)Google Scholar
  9. 9.
    Idriss, I.M., Dobry, R., Sing, R.: Nonlinear behavior of soft clays during cyclic loading. J. Geotech. Geoenviron. Eng. 104(ASCE 14265), 1427–1447 (1978)Google Scholar
  10. 10.
    Japan Meteorological Agency: Strong ground motion of The Southern Hyogo prefecture earthquake in 1995 observed at Kobe JMA observatory. Accessed 11 Oct 2018
  11. 11.
    Jeong, S., Solenthaler, B., Pollefeys, M., Gross, M., et al.: Data-driven fluid simulations using regression forests. ACM Trans. Graph. (TOG) 34(6), 199 (2015)Google Scholar
  12. 12.
    Jia, Z., Maggioni, M., Staiger, B., Scarpazza, D.P.: Dissecting the NVIDIA volta GPU architecture via microbenchmarking. arXiv preprint arXiv:1804.06826 (2018)
  13. 13.
    Kahan, W.: IEEE standard 754 for binary floating-point arithmetic. Lecture Notes Status IEEE 754(94720–1776), 11 (1996)Google Scholar
  14. 14.
    Kurth, T., et al.: Exascale deep learning for climate analytics. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, p. 51. IEEE Press (2018)Google Scholar
  15. 15.
    Malossi, A.C.I., et al.: The transprecision computing paradigm: concept, design, and applications. In: 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1105–1110. IEEE (2018)Google Scholar
  16. 16.
    Massing, G.: Eigenspannungen und verfestigung beinn messing. In: Proceedings of the 2nd International Congress of Applied Mechanics (1926)Google Scholar
  17. 17.
    Micikevicius, P.: 3D finite difference computation on GPUs using CUDA. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp. 79–84. ACM (2009)Google Scholar
  18. 18.
    Miyazaki, H., Kusano, Y., Shinjou, N., Shoji, F., Yokokawa, M., Watanabe, T.: Overview of the K computer system. Fujitsu Sci. Tech. J. 48(3), 302–309 (2012)Google Scholar
  19. 19.
  20. 20.
    OpenACC. Accessed 11 Oct 2019
  21. 21.
    Saad, Y.: Iterative Methods for Sparse Linear Systems, vol. 82. SIAM, Philadelphia (2003)zbMATHCrossRefGoogle Scholar
  22. 22.
  23. 23.
    Using bfloat16 with tensorflow models. Accessed 11 Oct 2019
  24. 24.
    Winget, J.M., Hughes, T.J.: Solution algorithms for nonlinear transient heat conduction analysis employing element-by-element iterative strategies. Comput. Methods Appl. Mech. Eng. 52(1–3), 711–815 (1985)MathSciNetzbMATHCrossRefGoogle Scholar
  25. 25.
    Yamaguchi, T., Fujita, K., Ichimura, T., Hori, M., Lalith, M., Nakajima, K.: Implicit low-order unstructured finite-element multiple simulation enhanced by dense computation using OpenACC. In: Chandrasekaran, S., Juckeland, G. (eds.) WACCPD 2017. LNCS, vol. 10732, pp. 42–59. Springer, Cham (2018). Scholar
  26. 26.
    Yamaguchi, T., Fujita, K., Ichimura, T., Naruse, A., Lalith, M., Hori, M.: FP21AXPY. Figshare (2020). Accessed 22 Jan 2020

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.The University of TokyoBunkyoJapan
  2. 2.Center for Computational Science, RIKENChuo, KobeJapan
  3. 3.NVIDIA CorporationMinatoJapan
  4. 4.Japan Agency for Marine-Earth Science and TechnologyKanazawa, YokohamaJapan

Personalised recommendations