Advertisement

Early Experience on Running OpenStaPLE on DAVIDE

  • Claudio Bonati
  • Enrico CaloreEmail author
  • Massimo D’Elia
  • Michele Mesiti
  • Francesco Negro
  • Sebastiano Fabio Schifano
  • Giorgio Silvi
  • Raffaele Tripiccione
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11203)

Abstract

In this contribution we measure the computing and energy performance of the recently developed DAVIDE HPC-cluster, a massively parallel machine based on IBM POWER CPUs and NVIDIA Pascal GPUs. We use as an application benchmark the OpenStaPLE Lattice QCD code, written using the OpenACC programming framework. Our code exploits the computing performance of GPUs through the use of OpenACC directives, and uses OpenMPI to manage the parallelism among several GPUs. We analyze the speed-up and the aggregate performance of the code, and try to identify possible bottlenecks that harm performances. Using the power monitor tools available on DAVIDE we also discuss some energy aspects pointing out the best trade-offs between time-to-solution and energy-to-solution.

Keywords

LQCD OpenACC POWER8 NVLink 

Notes

Acknowledgements

We thank CINECA and E4 Computer Engineering for granting access to the DAVIDE cluster and for their support. We thank Università degli Studi di Ferrara and INFN Ferrara for granting access to the COKA cluster. This work has been developed in the framework of the COKA and COSA projects of INFN.

References

  1. 1.
    Bernard, C., et al.: Panel discussion on the cost of dynamical quarksimulations. Nucl. Phys. B-Proc. Suppl.106-107(Suppl. C), 199–205 (2002).  https://doi.org/10.1016/S0920-5632(01)01664-4CrossRefGoogle Scholar
  2. 2.
    Bilardi, G., Pietracaprina, A., Pucci, G., Schifano, F., Tripiccione, R.: The potential of on-chip multiprocessing for QCD machines. In: Bader, D.A., Parashar, M., Sridhar, V., Prasanna, V.K. (eds.) HiPC 2005. LNCS, vol. 3769, pp. 386–397. Springer, Heidelberg (2005).  https://doi.org/10.1007/11602569_41CrossRefGoogle Scholar
  3. 3.
    Albanese, M., et al.: The ape computer: an array processor optimized for lattice gauge theory simulations. Comput. Phys. Commun. 45, 345–353 (1987).  https://doi.org/10.1016/0010-4655(87)90172-XCrossRefGoogle Scholar
  4. 4.
    Boyle, P., et al.: Overview of the QCDSP and QCDOC computers. IBM J. Res. Dev. 49(2.3), 351–365 (2005).  https://doi.org/10.1147/rd.492.0351CrossRefGoogle Scholar
  5. 5.
    Belletti, F., et al.: Computing for LQCD: Ape NEXT. Comput. Sci. Eng. 8(1), 50–61 (2006).  https://doi.org/10.1109/MCSE.2006.4CrossRefGoogle Scholar
  6. 6.
    Adiga, N.R., et al.: An overview of the BlueGene/L Supercomputer. In: ACM/IEEE 2002 Conference on Supercomputing, pp. 60–60, November 2002.  https://doi.org/10.1109/SC.2002.10017
  7. 7.
    Goldrian, G., et al.: QPACE: quantum chromodynamics parallel computing on the cell broadband engine. Comput. Sci. Eng. 10(6), 46–54 (2008).  https://doi.org/10.1109/MCSE.2008.153CrossRefGoogle Scholar
  8. 8.
    Haring, R., et al.: The IBM Blue Gene/Q Compute Chip. IEEE Micro 32(2), 48–60 (2012).  https://doi.org/10.1109/MM.2011.108CrossRefGoogle Scholar
  9. 9.
    OpenACC.org: OpenACC directives for accelerators. http://www.openacc-standard.org
  10. 10.
    OpenMP: The OpenMP API specification for parallel programming. http://www.openmp.org/
  11. 11.
    Blair, S., Albing, C., Grund, A., Jocksch, A.: Accelerating an MPI lattice boltzmann code using OpenACC. In: Proceedings of the Second Workshop on Accelerator Programming Using Directives, WACCPD 2015, pp. 3:1–3:9. ACM, New York (2015).  https://doi.org/10.1145/2832105.2832111
  12. 12.
    Kraus, J., Schlottke, M., Adinetz, A., Pleiter, D.: Accelerating a C++ CFD code with OpenACC. In: 2014 First Workshop on Accelerator Programming using Directives (WACCPD), pp. 47–54 (2014).  https://doi.org/10.1109/WACCPD.2014.11
  13. 13.
    Calore, E., Gabbana, A., Kraus, J., Schifano, S.F., Tripiccione, R.: Performance and portability of accelerated lattice Boltzmann applications with OpenACC. Concurr. Comput. Pract. Exp. 28(12), 3485–3502 (2016).  https://doi.org/10.1002/cpe.3862CrossRefGoogle Scholar
  14. 14.
    Gupta, S., Majumdar, P.: Accelerating lattice QCD simulations with 2 flavours of staggered fermions on multiple GPUs using OpenACC - a first attempt arXiv:1710.09178, [hep-lat] (2017)
  15. 15.
    Bonati, C., et al.: Design and optimization of a portable LQCD Monte Carlo code using OpenACC. Int. J. Mod. Phys. C 28(5) (2017).  https://doi.org/10.1142/S0129183117500632CrossRefGoogle Scholar
  16. 16.
    Bonati, C., et al.: Portable multi-node LQCD Monte Carlo simulations using OpenACC. Int. J. Mod. Phys. C 29(1) (2018).  https://doi.org/10.1142/S0129183118500109CrossRefGoogle Scholar
  17. 17.
    Ahmad, W.A., et al.: Design of an energy aware petaflops class high performance cluster based on power architecture. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 964–973, May 2017.  https://doi.org/10.1109/IPDPSW.2017.22
  18. 18.
    Bonati, C., et al.: Roberge-Weiss endpoint and chiral symmetry restoration in \(N_f = 2+1\). QCD arXiv:1807.02106, [hep-lat] (2018)
  19. 19.
    Bonati, C., Cossu, G., D’Elia, M., Incardona, P.: Qcd simulations with staggered fermions on gpus. Comput. Phys. Commun. 183(4), 853–863 (2012).  https://doi.org/10.1016/j.cpc.2011.12.011CrossRefGoogle Scholar
  20. 20.
    Bonati, C., et al.: Development of scientific software for HPC architectures using OpenACC: the case of LQCD. In: The 2015 International Workshop on Software Engineering for High Performance Computing in Science (SE4HPCS), pp. 9–15. ICSE Companion Proceedings (2015).  https://doi.org/10.1109/SE4HPCS.2015.9
  21. 21.
    Clark, M., Kennedy, A., Sroczynski, Z.: Exact 2+ 1 flavour RHMC simulations. arXiv hep-lat/0409133 (2004)Google Scholar
  22. 22.
    Clark, M., Kennedy, A.: Accelerating dynamical-fermion computations using the rational hybrid monte carlo algorithm with multiple pseudofermion fields. Phys. Rev. Lett. 98(5), 051601 (2007)CrossRefGoogle Scholar
  23. 23.
    Clark, M., Kennedy, A.: Accelerating staggered-fermion dynamics with the rational hybrid Monte Carlo algorithm. Phys. Rev. D 75(1), 011502 (2007)CrossRefGoogle Scholar
  24. 24.
    Duane, S., Kennedy, A.D., Pendleton, B.J., Roweth, D.: Hybrid Monte Carlo. Phys. Rev. D 195(2), 216–222 (1987)Google Scholar
  25. 25.
    Kennedy, A.: Algorithms for dynamical fermions. arXiv hep-lat/0607038 (2006)Google Scholar
  26. 26.
    Jegerlehner, B.: Krylov space solvers for shifted linear systems. arXiv hep-lat/9612014 (1996)Google Scholar
  27. 27.
    Simoncini, V., Szyld, D.B.: Recent computational developments in krylov subspace methods for linear systems. Numer. Linear Algebr. Appl. 14(1), 1–59 (2007)MathSciNetzbMATHCrossRefGoogle Scholar
  28. 28.
    Weisz, P.: Continuum limit improved lattice action for pure yang-mills theory (i). Nucl. Phys. B 212(1), 1–17 (1983)MathSciNetCrossRefGoogle Scholar
  29. 29.
    Curci, G., Menotti, P., Paffuti, G.: Symanzik’s improved lagrangian for lattice gauge theory. Phys. Lett. B 130(3–4), 205–208 (1983)CrossRefGoogle Scholar
  30. 30.
    Adinetz, A.V., et al.: Performance evaluation of scientific applications on POWER8. In: Jarvis, S.A., Wright, S.A., Hammond, S.D. (eds.) PMBS 2014. LNCS, vol. 8966, pp. 24–45. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-17248-4_2CrossRefGoogle Scholar
  31. 31.
    Eshelman, E.: Comparing NVLink vs PCI-e with NVIDIA Tesla P100 GPUs onOpenPOWER servers, January 2017. https://www.microway.com/hpc-tech-tips/comparing-nvlink-vs-pci-e-nvidia-tesla-p100-gpus-openpower-servers/
  32. 32.
    Caldeira, A.B., Haug, V., Vetter, S.: IBM Power System S822LC for High Performance Computing Introduction and Technical Overview. IBM, October 2016Google Scholar
  33. 33.
    Eliuk, S., Upright, C., Skjellum, A.: dMath: Linear algebra for scaleoutGP-GPUs. In: 2016 IEEE 18th International Conference on High PerformanceComputing and Communications; IEEE 14th International Conference on SmartCity; IEEE 2nd International Conference on Data Science and Systems(HPCC/SmartCity/DSS), pp. 647–654, December 2016.  https://doi.org/10.1109/HPCC-SmartCity-DSS.2016.0096
  34. 34.
    Beneventi, F., Bartolini, A., Cavazzoni, C., Benini, L.: Continuous learning of HPC infrastructure models using big data analytics and in-memory processing tools. In: Proceedings of the Conference on Design, Automation & Test in Europe, DATE 2017, pp. 1038–1043 (2017).  https://doi.org/10.23919/DATE.2017.7927143
  35. 35.
    Calore, E., Marchi, D., Schifano, S.F., Tripiccione, R.: Optimizing communications in multi-GPU Lattice Boltzmann simulations. In: 2015 International Conference on High Performance Computing Simulation (HPCS), pp. 55–62, July 2015.  https://doi.org/10.1109/HPCSim.2015.7237021
  36. 36.
    Calore, E., Gabbana, A., Kraus, J., Pellegrini, E., Schifano, S.F., Tripiccione, R.: Massively parallel lattice-Boltzmann codes on large GPU clusters. Parallel Comput. 58, 1–24 (2016).  https://doi.org/10.1016/j.parco.2016.08.005MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Claudio Bonati
    • 1
  • Enrico Calore
    • 2
    Email author
  • Massimo D’Elia
    • 1
  • Michele Mesiti
    • 3
  • Francesco Negro
    • 4
  • Sebastiano Fabio Schifano
    • 2
  • Giorgio Silvi
    • 5
  • Raffaele Tripiccione
    • 2
  1. 1.Università di Pisa and INFN Sezione di PisaPisaItaly
  2. 2.Università degli Studi di Ferrara and INFN Sezione di FerraraFerraraItaly
  3. 3.Academy of Advanced ComputingSwansea UniversitySwanseaUK
  4. 4.INFN Sezione di PisaPisaItaly
  5. 5.Jülich Supercomputing CentreJülichGermany

Personalised recommendations