Abstract
The problem of effective resource utilization is very challenging nowadays, especially for HPC centers running top-level supercomputing facilities with high energy consumption and significant number of workgroups. The weakness of many system monitoring based approaches to efficiency study is the basic orientation on professionals and analysis of specific jobs with low availability for regular users. The proposed all-round performance analysis approach, covering single application performance, project-level and overall system resource utilization based on system monitoring data that promises to be an effective and low cost technique aimed at all types of HPC center users. Every user of HPC center can access details on any of his executed jobs to better understand application behavior and sequences of job runs including scalability study, helping in turn to perform appropriate optimizations and implement co-design techniques. Taking into consideration all levels (user, project manager, administrator), the approach aids to improve output of HPC centers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Voevodin, V., Stefanov, K.: Supercomputers at exascale: bigdata and extreme computing of the total monitoring. In: BDEC Workshop, Barcelona, 29-30 January (2015)
Stefanov, K., Voevodin, V.: Distributed modular monitoring (DiMMon) approach to supercomputer monitoring. In: Proceedings of the 2015 IEEE International Conference on Cluster Computing, pp. 502–503. IEEE (2015)
Zenoss. http://www.zenoss.org
Zabbix. http://www.zabbix.com
Cacti®. http://www.cacti.net
Massie, M.L., et al.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput. 30(7), 817–840 (2004)
The OpenNMS project. http://www.opennms.org
Nagios - the industry standard in IT infrastructure monitoring. http://www.nagios.org
Collectd – The system statistics collection daemon. https://collectd.org
Nikitenko, D.: Complex approach to performance analysis of supercomputer systems based on system monitoring data. In: Numerical Methods and Programming vol. 15, pp. 85–97 (2014)
Gunter, D., Tierney, B., Jackson, K., Lee, J., Stoufer, M.: Dynamic monitoring of high-performance distributed applications. In: Proceedings 11th IEEE International Symposium on High Performance Distributed Computing, pp. 163–170 (2002)
Mellor-Crummey, J., Fowler, R.J., Marin, G., Tallent, N.: HPCVIEW: a tool for top-down analysis of node performance. J. Supercomput. 23(1), 81–104 (2002)
Jagode, H., Dongarra, J., Alam, S., Vetter, J., Spear, W., Malony, A.D.: A holistic approach for performance measurement and analysis for petascale applications. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M. (eds.) ICCS 2009, Part II. LNCS, vol. 5545, pp. 686–695. Springer, Heidelberg (2009)
Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurrency Comput. Pract. Exp. 22(6), 685–701 (2010)
Eisenhauer, G., Kraemer, E., Schwan, K., Stasko, J., Vetter, J., Mallavarupu, N.: Falcon: on-line monitoring and steering of large-scale parallel programs. In: Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation, pp. 422–429 (1995)
Kluge, M., Hackenberg, D., Nagel, W.E.: Collecting distributed performance data with dataheap: generating and exploiting a holistic system view. Procedia Comput. Sci. 9, 1969–1978 (2012)
Mooney, R., Schmidt, K.P., Studham, R.S.: NWPerf: a system wide performance monitoring tool for large Linux clusters. In: 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No. 04EX935), pp. 379–389 (2004)
Ries, B., et al.: The paragon performance monitoring environment. In: Supercomputing 1993, Proceedings, pp. 850–859 (1993)
Joint RF-EU HOPSA Project. http://www.vi-hps.org/projects/hopsa/overview
Open Trace Format (OTF2). https://silc.zih.tu-dresden.de/otf2-current/html
Mohr, B., Hagersten, E., Giménez, J., Knüpfer, A., Nikitenko, D., Nilsson, M., Servat, H., Shah, A., Voevodin, V., Winkler, F., Wolf, F., Zhukov, I.: The HOPSA workflow and tools. In: Proceedings of the 6th International Parallel Tools Workshop, Stuttgart, September 2012
Antonov, A., Nikitenko, D., Shvets, P., Sobolev, S., Stefanov, K., Voevodin, V., Voevodin, V., Zhumatiy, S.: An approach for ensuring reliable functioning of a supercomputer based on a formal model. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds.) PPAM 2015. LNCS, vol. 9573, pp. 12–22. Springer, Heidelberg (2016). doi:10.1007/978-3-319-32149-3_2
Adinets, A., Bryzgalov, P., Nikitenko, D., Stefanov, K., Voevodin, V., Zhumatiy, S.: Job digest: an approach to dynamic analysis of job characteristics on supercomputers. In: Numerical Methods and Programming: Advanced Computing, vol. 13. Sect. 2, pp. 160–166 (2012)
Nikitenko, D., Voevodin, V., Zhumatiy, S., Stefanov, K., Teplov, A., Shvets, P., Voevodin, V.: Supercomputer application integral characteristics analysis for the whole queued job collection of large-scale HPC systems. In: Parallel Computational Technologies (PCT 2016): Proceedings of the International Scientific Conference. Chelyabinsk, Publishing of the South Ural State University, pp. 20–30 (2016)
Nikitenko, D., Voevodin, V., Zhumatiy, S.: Octoshell: large supercomputer complex administration system. In: proceedings of Russian Supercomputing Days International Conference, Moscow, Russia, 28-29 September 2015, Proceedings, CEUR Workshop Proceedings, vol. 1482, pp. 69–83 (2015)
Nikitenko, D., et al.: Resolving frontier problems of mastering large-scale supercomputer complexes. In: Proceedings of the ACM International Conference on Computing Frontiers (CF 2016), pp. 349–352. ACM, New York (2016)
Slurm Workload Manager. http://slurm.schedmd.com
Cleo cluster batch system. http://sourceforge.net/projects/cleo-bs
Clustrx. http://t-platforms.ru/products/software/clustrxproductfamily/clustrxwatch.html
Voevodin, V., Voevodin, V., Shaikhislamov, D., Nikitenko, D.: Data mining method for anomaly detection in the supercomputer task flow. In: proceedings of Numerical Computations: Theory and Algorithms, The 2nd International Conference and Summer School, 20-24 June 2016, Pizzo calabro, Italy (2016)
Description of random forest algorithm and its realization. http://scikit-learn.org/stable/modules/ensemble.html#random-forests
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Nikitenko, D., Stefanov, K., Zhumatiy, S., Voevodin, V., Teplov, A., Shvets, P. (2016). System Monitoring-Based Holistic Resource Utilization Analysis for Every User of a Large HPC Center. In: Carretero, J., et al. Algorithms and Architectures for Parallel Processing. ICA3PP 2016. Lecture Notes in Computer Science(), vol 10049. Springer, Cham. https://doi.org/10.1007/978-3-319-49956-7_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-49956-7_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49955-0
Online ISBN: 978-3-319-49956-7
eBook Packages: Computer ScienceComputer Science (R0)