Advertisement

The Pangeo Ecosystem: Interactive Computing Tools for the Geosciences: Benchmarking on HPC

  • Tina Erica OdakaEmail author
  • Anderson Banihirwe
  • Guillaume Eynard-Bontemps
  • Aurelien Ponte
  • Guillaume Maze
  • Kevin Paul
  • Jared Baker
  • Ryan Abernathey
Conference paper
  • 7 Downloads
Part of the Communications in Computer and Information Science book series (CCIS, volume 1190)

Abstract

The Pangeo ecosystem is an interactive computing software stack for HPC and public cloud infrastructures. In this paper, we show benchmarking results of the Pangeo platform on two different HPC systems. Four different geoscience operations were considered in this benchmarking study with varying chunk sizes and chunking schemes. Both strong and weak scaling analyses were performed. Chunk sizes between 64 MB to 512 MB were considered, with the best scalability obtained for 512 MB. Compared to certain manual chunking schemes, the auto chunking scheme scaled well.

Keywords

Pangeo Interactive computing HPC Cloud Benchmarking Dask Xarray 

Notes

Acknowledgment

Dr. Abernathey was supported by NSF Earthcube award 1740648. Dr. Paul and Mr. Banihirwe were both supported by NSF Earthcube award 1740633.

References

  1. 1.
    Zender, C.S.: Analysis of self-describing gridded geoscience data with netCDF Operators (NCO). Environ. Model. Softw. 23(10–11), 1338–1342 (2008).  https://doi.org/10.1016/j.envsoft.2008.03.004CrossRefGoogle Scholar
  2. 2.
    The NCAR Command Language (Version 6.6.2) [Software]. Boulder, Colorado: UCAR/NCAR/CISL/TDD (2019).  https://doi.org/10.5065/d6wd3xh5
  3. 3.
    Nitzberg, B., Schopf, J.M., Jones, J.P.: PBS Pro: grid computing and scheduling attributes. In: Nabrzyski, J., Schopf, J.M., Weglarz, J. (eds.) Grid Resource Management. International Series in Operations Research & Management Science, vol. 64, pp. 183–190. Springer, Boston (2004).  https://doi.org/10.1007/978-1-4615-0509-9_13
  4. 4.
    Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: simple linux utility for resource management. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 44–60. Springer, Heidelberg (2003).  https://doi.org/10.1007/10968987_3CrossRefGoogle Scholar
  5. 5.
    Pangeo: A community platform for Big Data geoscience. http://pangeo.io
  6. 6.
    Robinson, N.H., Hamman, J., Abernathey, R.: Science needs to rethink how it interacts with big data: Five principles for effective scientific big data systems. arXiv e-prints p. arXiv:1908.03356, August 2019
  7. 7.
    Eynard-Bontemps, G., Abernathey, R., Hamman, J., Ponte, A., Rath, W.: The PANGEO big data ecosystem and its use at CNES. In: Proceedings of 2019 Big Data from Space, . Munich, Germany, pp. 49–52 (2019).  https://doi.org/10.2760/848593
  8. 8.
    Abernathey, R., et al.: Pangeo NSF Earthcube Proposal (2017).  https://doi.org/10.6084/m9.figshare.5361094.v1
  9. 9.
    Yu, X., Ponte, A.L., Elipot, S., Menemenlis, D., Zaron, E.D., Abernathey, R.: Surface kinetic energy distributions in the global oceans from a high-resolution numerical model and surface drifter observations. Geophys. Res. Lett. 46(16), 9757–9766 (2019).  https://doi.org/10.1029/2019GL083074CrossRefGoogle Scholar
  10. 10.
    Rotary spectral analysis of surface currents and zonal average. https://github.com/apatlpo/mit_equinox/blob/master/hal/rechunk_rotspectra.ipynb
  11. 11.
    Kluyver, T., et al.: Jupyter Notebooks – a publishing format for reproducible computational workflows. In: Loizides, F., Scmidt, B. (eds.) Positioning and Power in Academic Publishing: Players, Agents and Agendas, pp. 87–90. IOS Press (2016).  https://doi.org/10.3233/978-1-61499-649-1-87
  12. 12.
    Hoyer, S., Hamman, J.: Xarray: N-D labeled arrays and datasets in Python. J. Open Res. Softw. 5(1), 10 (2017).  https://doi.org/10.5334/jors.148CrossRefGoogle Scholar
  13. 13.
    Met Office: Iris: A Python library for analysing and visualising meteorological and oceanographic data sets. Exeter, Devon (2010–2013). http://scitools.org.uk/iris
  14. 14.
    Rocklin, M.: Dask: parallel computation with blocked algorithms and task scheduling. In: Huff, K., Bergstra, J. (eds.) Proceedings of the 14th Python in Science Conference, pp. 126–132 (2015).  https://doi.org/10.25080/Majora-7b98e3ed-013
  15. 15.
    Dask Development Team: Dask: library for dynamic task scheduling (2016). https://dask.org
  16. 16.
    Zaharia, M., et al.: Apache Spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016).  https://doi.org/10.1145/2934664CrossRefGoogle Scholar
  17. 17.
  18. 18.
    CNES: The Centre National d’Etudes Spatiales (CNES) is the government agency responsible for shaping and implementing France’s space policy in Europe. https://cnes.fr/
  19. 19.
    Computational and Information Systems Laboratory.: Cheyenne: SGI ICE XA Cluster (2017).  https://doi.org/10.5065/d6rx99hx
  20. 20.
    JupyterHub — JupyterHub 1.0.0 documentation. https://jupyterhub.readthedocs.io/
  21. 21.
  22. 22.
  23. 23.
    Benchmarking and scaling studies of the Pangeo platform. https://github.com/pangeo-data/benchmarking
  24. 24.
    Liu, J., Wu, J., Panda, D.K.: High performance RDMA-based MPI implementation over InfiniBand. Int. J. Parallel Prog. 32(3), 167–198 (2004).  https://doi.org/10.1023/B:IJPP.0000029272.69895.c1CrossRefzbMATHGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Laboratory for Ocean Physics and Satellite Remote Sensing UMR LOPS, Ifremer, Univ. Brest, CNRS, IRD, IUEMBrestFrance
  2. 2.National Center for Atmospheric ResearchBoulderUSA
  3. 3.CNES Computing Center Team, Centre National d’Etudes SpatialesToulouseFrance
  4. 4.Lamont Doherty Earth ObservatoryColumbia UniversityNew YorkUSA

Personalised recommendations