Advertisement

GPU-Accelerated Language and Communication Support by FPGA

  • Taisuke Boku
  • Toshihiro Hanawa
  • Hitoshi Murai
  • Masahiro Nakao
  • Yohei Miki
  • Hideharu Amano
  • Masayuki Umemura
Chapter

Abstract

Although the GPU is one of the most successfully used accelerating devices for HPC, there are several issues when it is used for large-scale parallel systems. To describe real applications on GPU-ready parallel systems, we need to combine different paradigms of programming such as CUDA/OpenCL, MPI, and OpenMP for advanced platforms. In the hardware configuration, inter-GPU communication through PCIe channel and support by CPU are required which causes large overhead to be a bottleneck of total parallel processing performance. In our project to be described in this chapter, we developed an FPGA-based platform to reduce the latency of inter-GPU communication and also a PGAS language for distributed-memory programming with accelerating devices such as GPU. Through this work, a new approach to compensate the hardware and software weakness of parallel GPU computing is provided. Moreover, FPGA technology for computation and communication acceleration is described upon astrophysical problem where GPU or CPU computation is not sufficient on performance.

References

  1. 1.
    Aarseth, S.J.: Dynamical evolution of clusters of galaxies, I. Mon. Not. R. Astron. Soc. 126, 223 (1963).  https://doi.org/10.1093/mnras/126.3.223 CrossRefGoogle Scholar
  2. 2.
    Barnes, J., Hut, P.: A hierarchical O(N log N) force-calculation algorithm. Nature 324, 446–449 (1986). https://doi.org/10.1038/324446a0 CrossRefGoogle Scholar
  3. 3.
    Cunningham, D., et al.: GPU programming in a high level language: compiling X10 to CUDA. In: Proceedings of the 2011 ACM SIGPLAN X10 workshop (X10 ’11), New York (2011)Google Scholar
  4. 4.
    Edwards, H.C., Trott, C.R.: Kokkos: enabling performance portability across manycore architectures. In: Proceedings of the 2013 extreme scaling workshop (XSW 2013), pp. 18–24, Aug 2013Google Scholar
  5. 5.
    Garland, M., Kudlur, M., Zheng, Y.: Designing a unified programming model for heterogeneous machines. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pp. 67:1–67:11, Los Alamitos (2012)Google Scholar
  6. 6.
    Hanawa, T., Kodama, Y., Boku, T., Sato, M.: Interconnect for tightly coupled accelerators architecture. In: IEEE 21st Annual Sympsium on High-Performance Interconnects (HOT Interconnects 21) (2013)Google Scholar
  7. 7.
    Hornung, R.D., Keasler, J.A.: The RAJA portability layer: overview and status. Technical Report LLNLTR-661403, LLNL (2014)Google Scholar
  8. 8.
    McMillan, S.L.W.: The vectorization of small-N integrators. In: Hut, P., McMillan, S.L.W. (eds.) The Use of Supercomputers in Stellar Dynamics. Lecture Notes in Physics, vol. 267, p. 156. Springer, Berlin (1986).  https://doi.org/10.1007/BFb0116406 Google Scholar
  9. 9.
    Mellanox Fabric Collective Accelerator. http://www.mellanox.com/
  10. 10.
    Miki, Y., Umemura, M.: GOTHIC: gravitational oct-tree code accelerated by hierarchical time step controlling. New Astron. 52, 65–81 (2017). https://doi.org/10.1016/j.newast.2016.10.007 CrossRefGoogle Scholar
  11. 11.
    Miki, Y., Umemura, M.: MAGI: many-component galaxy initializer. Mon. Not. R. Astron. Soc. 475, 2269–2281 (2018).  https://doi.org/10.1093/mnras/stx3327 CrossRefGoogle Scholar
  12. 12.
    NVIDIA Corporation: NVIDIA GPUDirect (2014). https://developer.nvidia.com/gpudirect Google Scholar
  13. 13.
    Odajima, T., et al.: Hybrid communication with TCA and infiniband on a parallel programming language XcalableACC for GPU clusters. In: Proceedings of the 2015 IEEE International Conference on Cluster Computing, pp. 627–634, Sept 2015Google Scholar
  14. 14.
    Omni Compiler Project: Omni compiler project (2018). http://omni-compiler.org/
  15. 15.
    OpenACC-Standard.org: The OpenACC application programming interface version 2.0 (2013). http://www.openacc.org/sites/default/files/OpenACC.2.0a_1.pdf
  16. 16.
    Potluri, S., Hamidouche, K., Venkatesh, A., Bureddy, D., Panda, D.K.: Efficient inter-node MPI communication using GPUDirect RDMA for infiniband clusters with NVIDIA GPUs. In: Proceedings of the International Conference on Parallel Processing, pp. 80–89 (2013)Google Scholar
  17. 17.
    RIKEN AICS and University of Tsukuba: XcalableACC language specification version 1.0 (2017). http://xcalablemp.org/download/XACC/xacc-spec-1.0.pdf
  18. 18.
    Sidelnik, A., et al.: Performance portability with the Chapel language. In: Proceedings of the IEEE 26th International Parallel and Distributed Processing Symposium, pp. 582–594 (2012)Google Scholar
  19. 19.
    Stone, A.I., et al.: Evaluating coarray fortran with the cgpop miniapp. In: Proceedings of the Fifth Conference on Partitioned Global Address Space Programming Models (PGAS), Oct 2011.Google Scholar
  20. 20.
    Tsuruta, C., Miki, Y., Kuhara, T., Amano, H., Umemura, M.: Off-loading LET generation to PEACH2: a switching hub for high performance GPU clusters. In: ACM SIGARCH Computer Architecture News – HEART15, vol. 43, pp. 3–8. ACM, New York (2016). http://doi.acm.org/10.1145/2927964.2927966 CrossRefGoogle Scholar
  21. 21.
    Tsuruta, C., Kaneda, K., Nishikawa, N., Amano, H.: Accelerator-in-switch: a framework for tightly coupled switching hub and an accelerator with FPGA. In: 27th International Conference on Field Programmable Logic & Application (FPL2017) (2017)Google Scholar
  22. 22.
    Warren, M.S., Salmon, J.K.: Astrophysical N-body simulations using hierarchical tree data structures. In: Proceedings of the 1992 ACM/IEEE Conference on Supercomputing, pp. 570–576. IEEE Computer Society Press (1992)Google Scholar
  23. 23.
    Wilson, K.G.: Confinement of quarks. Phys. Rev. D 10, 2445–2459 (1974)CrossRefGoogle Scholar
  24. 24.
    XcalableMP Specification Working Group: XcalableMP specification version 1.2 (2013). http://www.xcalablemp.org/download/spec/xmp-spec-1.2.pdf
  25. 25.
    Zenker, E., Worpitz, B., Widera, R., Huebl, A., Juckeland, G., Knpfer, A., Nagel, W.E., Bussmann, M.: Alpaka – an abstraction library for parallel Kernel acceleration. In: Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 631–640, May 2016Google Scholar
  26. 26.
    Zilberman, N., Audzevich, Y., Kalogeridou, G., Bojan, N.M., Zhang, J., Moore, A.W.: NetFPGA – rapid prototyping of high bandwidth devices in open source. In: 25th International Conference on Field Programmable Logic and Applications (FPL) (2015)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  • Taisuke Boku
    • 1
  • Toshihiro Hanawa
    • 2
  • Hitoshi Murai
    • 3
  • Masahiro Nakao
    • 3
  • Yohei Miki
    • 2
  • Hideharu Amano
    • 4
  • Masayuki Umemura
    • 1
  1. 1.Center for Computational SciencesUniversity of TsukubaTsukubaJapan
  2. 2.Information Technology CenterThe University of TokyoTokyoJapan
  3. 3.Center for Computational ScienceRIKENKobeJapan
  4. 4.Department of Information and Computer ScienceKeio UniversityTokyoJapan

Personalised recommendations