On-the-Fly Calculation of Performance Metrics with Adaptive Time Resolution for HPC Compute Jobs

  • Konstantin Stefanov
  • Vadim Voevodin
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 965)


Performance monitoring is a method to debug performance issues in different types of applications. It uses various performance metrics obtained from the servers the application runs on, and also may use metrics which are produced by the application itself. The common approach to building performance monitoring systems is to store all the data to a database and then to retrieve the data which correspond to the specific job and perform an analysis using that portion of the data. This approach works well when the data stream is not very large. For large performance monitoring data stream this incurs much IO and imposes high requirements on storage systems which process the data.

In this paper we propose an adaptive on-the-fly approach to performance monitoring of High Performance Computing (HPC) compute jobs which significantly lowers data streams to be written to a storage. We used this approach to implement performance monitoring system for HPC cluster to monitor compute jobs. The output of our performance monitoring system is a time-series graph representing aggregated performance metrics for the job. The time resolution of the resulted graph is adaptive and depends on the duration of the analyzed job.


Performance Performance monitoring Adaptive performance monitoring Supercomputer HPC 



The work is supported by the Russian Found for Basic Research, grant 16-07-01121 The research is carried out using the equipment of the shared research facilities of HPC computing resources at M.V.Lomonosov Moscow State University This material is based upon work supported by the Russian Presidential study grant (SP-1981.2016.5).


  1. 1.
    Nikitenko, D., et al.: JobDigest detailed system monitoring-based supercomputer application behavior analysis. In: Voevodin, V., Sobolev, S. (eds.) RuSCDays 2017. CCIS. Springer, Cham (2017). Scholar
  2. 2.
    Adhianto, L., et al.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurr. Comput.: Pract. Exp. 22(6), 685–701 (2010). Scholar
  3. 3.
    Eisenhauer, G., Kraemer, E., Schwan, K., Stasko, J., Vetter, J., Mallavarupu, N.: Falcon: on-line monitoring and steering of large-scale parallel programs. In: Proceedings Frontiers 1995. The Fifth Symposium on the Frontiers of Massively Parallel Computation, pp. 422-429. IEEE Computer Society Press, McLean, VA (1995).
  4. 4.
    Gunter, D., Tierney, B., Jackson, K., Lee, J., Stoufer, M.: Dynamic monitoring of high-performance distributed applications. In: Proceedings 11th IEEE International Symposium on High Performance Distributed Computing, pp. 163–170. IEEE Computer Society (2002).
  5. 5.
    Jagode, H., Dongarra, J., Alam, S.R., Vetter, J.S., Spear, W., Malony, A.D.: A holistic approach for performance measurement and analysis for petascale applications. In: Allen, G., Nabrzyski, J., Seidel, E., Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) Computational Science ICCS 2009. LNCS, vol. 5545, pp. 686–695. Springer, Heidelberg (2009). Scholar
  6. 6.
    Mellor-Crummey, J., Fowler, R.J., Marin, G., Tallent, N.: HPCView: a tool for top-down analysis of node performance. J. Supercomput. 23(1), 81–104 (2002)CrossRefGoogle Scholar
  7. 7.
    Ries, B., et al.: The paragon performance monitoring environment. In: Supercomputing 1993, Proceedings, pp. 850-859. IEEE (1993).
  8. 8.
    Kluge, M., Hackenberg, D., Nagel, W.E.: Collecting distributed performance data with dataheap: generating and exploiting a holistic system view. Procedia Comput. Sci. 9, 1969–1978 (2012). Scholar
  9. 9.
    Mooney, R., Schmidt, K., Studham, R.: NWPerf: a system wide performance monitoring tool for large Linux clusters. In: 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No. 04EX935), pp. 379–389. IEEE (2004).
  10. 10.
    Shaykhislamov, D., Voevodin, V.: An approach for detecting abnormal parallel applications based on time series analysis methods. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds.) PPAM 2017. LNCS, vol. 10777, pp. 359–369. Springer, Cham (2018). Scholar
  11. 11.
    Stefanov, K., Voevodin, V., Zhumatiy, S., Voevodin, V.: Dynamically reconfigurable distributed modular monitoring system for supercomputers (DiMMon). In: Sloot, P., Boukhanovsky, A., Athanassoulis, G., Klimentov, A. (eds.) 4th International Young Scientist Conference on Computational Science. Procedia Comput. Sci. 66, 625–634. Elsevier B.V. (2015).
  12. 12.
    Slurm workload manager.

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.M.V. Lomonosov Moscow State UniversityMoscowRussia

Personalised recommendations