Advertisement

Enabling Continuous Testing of HPC Systems Using ReFrame

  • Vasileios KarakasisEmail author
  • Theofilos Manitaras
  • Victor Holanda Rusu
  • Rafael Sarmiento-Pérez
  • Christopher Bignamini
  • Matthias Kraushaar
  • Andreas Jocksch
  • Samuel Omlin
  • Guilherme Peretti-Pezzi
  • João P. S. C. Augusto
  • Brian Friesen
  • Yun He
  • Lisa Gerhardt
  • Brandon Cook
  • Zhi-Qiang You
  • Samuel Khuvis
  • Karen Tomko
Conference paper
  • 7 Downloads
Part of the Communications in Computer and Information Science book series (CCIS, volume 1190)

Abstract

Regression testing of HPC systems is of crucial importance when it comes to ensure the quality of service offered to the end users. At the same time, it poses a great challenge to the systems and application engineers to continuously maintain regression tests that cover as many aspects as possible of the user experience. In this paper, we briefly present ReFrame, a framework for writing regression tests for HPC systems and how this is used by CSCS, NERSC and OSC to continuously test their systems. ReFrame is designed to abstract away the complexity of the interactions with the system and to separate the logic of a regression test from the low-level details, which pertain to the system configuration and setup. Regression tests in ReFrame are simple Python classes that specify the basic parameters of the test plus any additional logic. The framework will load the test and send it down a well-defined pipeline which will take care of its execution. ReFrame can be easily set up on any cluster and its straightforward invocation allows it to be easily integrated with common continuous integration/deployment (CI/CD) tools, in order to perform continuous testing of an HPC system. Finally, its ability to feed the collected performance data to well known log channels, such as Syslog, Graylog or, simply, parsable log files, make it also a powerful tool for continuously monitoring the health of the system from user’s perspective.

Notes

Acknowledgements

CSCS would like to thank the members of the User Engagement and Support and the HPC Operations units for their valuable feedback regarding the framework and their contributions in writing regression tests for the system.

This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility operated under Contract No. DE-AC02-05CH11231.

References

  1. 1.
    OpenHPC: Community building blocks for HPC systems. https://github.com/openhpc/ohpc
  2. 2.
  3. 3.
  4. 4.
    Adams, M., Brown, J., Shalf, J., Straalen, B.V., Strohmaier, E., Williams, S.: HPGMG 1.0: a benchmark for ranking high performance computing systems. Technical report, LBNL-6630E, Lawrence Berkeley National Laboratory, May 2014. http://escholarship.org/uc/item/00r9w79m
  5. 5.
    Checconi, F., Petrini, F., Willcock, J., Lumsdaine, A., Choudhury, A.R., Sabharwal, Y.: Breaking the speed and scalability Barriers for Graph exploration on distributed-memory machines. In: SC 2012: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–12, November 2012.  https://doi.org/10.1109/SC.2012.25
  6. 6.
    Chun, B.N.: DART: distributed automated regression testing for large-scale network applications. In: Higashino, T. (ed.) OPODIS 2004. LNCS, vol. 3544, pp. 20–36. Springer, Heidelberg (2005).  https://doi.org/10.1007/11516798_2CrossRefGoogle Scholar
  7. 7.
    Colby, K., Maji, A.K., Rahman, J., Bottum, J.: Testpilot: A flexible framework for user-centric testing of HPC clusters. In: Proceedings of the Fourth International Workshop on HPC User Support Tools, HUST 2017, pp. 5:1–5:10. ACM, New York (2017).  https://doi.org/10.1145/3152493.3152555. http://doi.acm.org/10.1145/3152493.3152555
  8. 8.
    Dongarra, J., Heroux, M.A., Luszczek, P.: HPCG benchmark: a new metric for ranking high performance computing systems. Technical report, UT-EECS-15-736, Electrical Engineering and Compute Science Department, University of Tennessee, Knoxville, November 2015. https://library.eecs.utk.edu/storage/594phpwDhjVNut-eecs-15-736.pdf
  9. 9.
    Dubois, P.F.: Testing scientific programs. Comput. Sci. Eng. 14(4), 69–73 (2012).  https://doi.org/10.1109/MCSE.2012.84CrossRefGoogle Scholar
  10. 10.
    Furlani, J.L., Osel, P.W.: Abstract yourself with modules. In: Proceedings of the 10th USENIX Conference on System Administration, LISA 1996, pp. 193–204. USENIX Association, Berkeley (1996). http://dl.acm.org/citation.cfm?id=1029824.1029858
  11. 11.
    Gamblin, T., et al.: The Spack package manager: bringing order to HPC software chaos. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, pp. 40:1–40:12. ACM, New York (2015).  https://doi.org/10.1145/2807591.2807623. http://doi.acm.org/10.1145/2807591.2807623
  12. 12.
    GrafanaLabs: Grafana: The open platform for beautiful analytics and monitoring. https://grafana.com/
  13. 13.
    Graylog Community: Enterprise Log Management for All. https://www.graylog.org/
  14. 14.
    Horenko, I.: Finite element approach to clustering of multidimensional time series. SIAM J. Sci. Comput. 32(1), 62–83 (2010).  https://doi.org/10.1137/080715962CrossRefMathSciNetzbMATHGoogle Scholar
  15. 15.
    Hoste, K., Timmerman, J., Georges, A., Weirdt, S.D.: Easybuild: building software with ease. In: 2012 IEEE International Conference on Services Computing (SCC), pp. 572–582, November 2013.  https://doi.org/10.1109/SC.Companion.2012.81. doi.ieeecomputersociety.org/10.1109/SC.Companion.2012.81
  16. 16.
    Jülich Supercomputing Centre: JUBE Benchmarking Environment. https://apps.fz-juelich.de/jsc/jube/jube2/docu/index.html
  17. 17.
    Khuvis, S., et al.: A continuous integration-based framework for software management. In: Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (Learning), PEARC 2019, pp. 28:1–28:7. ACM, New York (2019).  https://doi.org/10.1145/3332186.3332219. http://doi.acm.org/10.1145/3332186.3332219
  18. 18.
    Kurth, T., et al.: Analyzing performance of selected NESAP applications on the Cori HPC system. In: Kunkel, J.M., Yokota, R., Taufer, M., Shalf, J. (eds.) ISC High Performance 2017. LNCS, vol. 10524, pp. 334–347. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-67630-2_25CrossRefGoogle Scholar
  19. 19.
    Lockwood, G.: IOR and mdtest (2019). https://github.com/hpc/ior
  20. 20.
    Ma, Wenjing, Ao, Yulong, Yang, Chao, Williams, Samuel: Solving a trillion unknowns per second with HPGMG on Sunway TaihuLight. Cluster Comput. 1–15 (2019).  https://doi.org/10.1007/s10586-019-02938-w
  21. 21.
    McLay, R.: Lmod: A New Environment Module System. https://lmod.readthedocs.io/
  22. 22.
    Merchant, S., Prabhakar, G.: Tool for performance tuning and regression analyses of HPC systems and applications. In: 2012 19th International Conference on High Performance Computing, pp. 1–6, December 2012.  https://doi.org/10.1109/HiPC.2012.6507528
  23. 23.
    Open Source: Environment Modules. http://modules.sourceforge.net/
  24. 24.
    Sauers, J.: Onyx Point works with Exascale Computing Project to bring CI to supercomputing centers (2018). https://www.onyxpoint.com/onyxpoint-works-with-ecp-to-bring-ci-to-supercomputers/
  25. 25.
    Shan, H., Williams, S., Zheng, Y., Kamil, A., Yelick, K.: Implementing high-performance geometric multigrid solver with naturally grained messages. In: 2015 9th International Conference on Partitioned Global Address Space Programming Models, pp. 38–46, September 2015.  https://doi.org/10.1109/PGAS.2015.12
  26. 26.
    Siddiqui, S.: Buildtest: A HPC Application Testing Framework. https://github.com/HPC-buildtest/buildtest
  27. 27.
    Whitney, C., Bautista, E., Davis, T.: The NERSC Data Collect Environment. In: Cray User Group 2016. CUG16 (2016). https://cug.org/proceedings/cug2016_proceedings/includes/files/pap101s2-file1.pdf
  28. 28.
    Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: simple Linux utility for resource management. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 44–60. Springer, Heidelberg (2003).  https://doi.org/10.1007/10968987_3. https://slurm.schedmd.com/CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Vasileios Karakasis
    • 1
    Email author
  • Theofilos Manitaras
    • 1
  • Victor Holanda Rusu
    • 1
  • Rafael Sarmiento-Pérez
    • 1
  • Christopher Bignamini
    • 1
  • Matthias Kraushaar
    • 1
  • Andreas Jocksch
    • 1
  • Samuel Omlin
    • 1
  • Guilherme Peretti-Pezzi
    • 1
  • João P. S. C. Augusto
    • 2
  • Brian Friesen
    • 3
  • Yun He
    • 3
  • Lisa Gerhardt
    • 3
  • Brandon Cook
    • 3
  • Zhi-Qiang You
    • 4
  • Samuel Khuvis
    • 4
  • Karen Tomko
    • 4
  1. 1.Swiss National Supercomputing CentreLuganoSwitzerland
  2. 2.Università della Svizzera ItalianaLuganoSwitzerland
  3. 3.Lawrence Berkeley National LaboratoryBerkeleyUSA
  4. 4.Ohio Supercomputer CenterColumbusUSA

Personalised recommendations