System Monitoring-Based Holistic Resource Utilization Analysis for Every User of a Large HPC Center

Nikitenko, Dmitry; Stefanov, Konstantin; Zhumatiy, Sergey; Voevodin, Vadim; Teplov, Alexey; Shvets, Pavel

doi:10.1007/978-3-319-49956-7_24

Dmitry Nikitenko³⁰,
Konstantin Stefanov³⁰,
Sergey Zhumatiy³⁰,
Vadim Voevodin³⁰,
Alexey Teplov³⁰ &
…
Pavel Shvets³⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10049))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

941 Accesses
7 Citations

Abstract

The problem of effective resource utilization is very challenging nowadays, especially for HPC centers running top-level supercomputing facilities with high energy consumption and significant number of workgroups. The weakness of many system monitoring based approaches to efficiency study is the basic orientation on professionals and analysis of specific jobs with low availability for regular users. The proposed all-round performance analysis approach, covering single application performance, project-level and overall system resource utilization based on system monitoring data that promises to be an effective and low cost technique aimed at all types of HPC center users. Every user of HPC center can access details on any of his executed jobs to better understand application behavior and sequences of job runs including scalability study, helping in turn to perform appropriate optimizations and implement co-design techniques. Taking into consideration all levels (user, project manager, administrator), the approach aids to improve output of HPC centers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Voevodin, V., Stefanov, K.: Supercomputers at exascale: bigdata and extreme computing of the total monitoring. In: BDEC Workshop, Barcelona, 29-30 January (2015)
Google Scholar
Stefanov, K., Voevodin, V.: Distributed modular monitoring (DiMMon) approach to supercomputer monitoring. In: Proceedings of the 2015 IEEE International Conference on Cluster Computing, pp. 502–503. IEEE (2015)
Google Scholar
Zenoss. http://www.zenoss.org
Zabbix. http://www.zabbix.com
Cacti®. http://www.cacti.net
Massie, M.L., et al.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput. 30(7), 817–840 (2004)
Article Google Scholar
The OpenNMS project. http://www.opennms.org
Nagios - the industry standard in IT infrastructure monitoring. http://www.nagios.org
Collectd – The system statistics collection daemon. https://collectd.org
Nikitenko, D.: Complex approach to performance analysis of supercomputer systems based on system monitoring data. In: Numerical Methods and Programming vol. 15, pp. 85–97 (2014)
Google Scholar
Gunter, D., Tierney, B., Jackson, K., Lee, J., Stoufer, M.: Dynamic monitoring of high-performance distributed applications. In: Proceedings 11th IEEE International Symposium on High Performance Distributed Computing, pp. 163–170 (2002)
Google Scholar
Mellor-Crummey, J., Fowler, R.J., Marin, G., Tallent, N.: HPCVIEW: a tool for top-down analysis of node performance. J. Supercomput. 23(1), 81–104 (2002)
Article MATH Google Scholar
Jagode, H., Dongarra, J., Alam, S., Vetter, J., Spear, W., Malony, A.D.: A holistic approach for performance measurement and analysis for petascale applications. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M. (eds.) ICCS 2009, Part II. LNCS, vol. 5545, pp. 686–695. Springer, Heidelberg (2009)
Chapter Google Scholar
Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurrency Comput. Pract. Exp. 22(6), 685–701 (2010)
Google Scholar
Eisenhauer, G., Kraemer, E., Schwan, K., Stasko, J., Vetter, J., Mallavarupu, N.: Falcon: on-line monitoring and steering of large-scale parallel programs. In: Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation, pp. 422–429 (1995)
Google Scholar
Kluge, M., Hackenberg, D., Nagel, W.E.: Collecting distributed performance data with dataheap: generating and exploiting a holistic system view. Procedia Comput. Sci. 9, 1969–1978 (2012)
Article Google Scholar
Mooney, R., Schmidt, K.P., Studham, R.S.: NWPerf: a system wide performance monitoring tool for large Linux clusters. In: 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No. 04EX935), pp. 379–389 (2004)
Google Scholar
Ries, B., et al.: The paragon performance monitoring environment. In: Supercomputing 1993, Proceedings, pp. 850–859 (1993)
Google Scholar
Joint RF-EU HOPSA Project. http://www.vi-hps.org/projects/hopsa/overview
Open Trace Format (OTF2). https://silc.zih.tu-dresden.de/otf2-current/html
Mohr, B., Hagersten, E., Giménez, J., Knüpfer, A., Nikitenko, D., Nilsson, M., Servat, H., Shah, A., Voevodin, V., Winkler, F., Wolf, F., Zhukov, I.: The HOPSA workflow and tools. In: Proceedings of the 6th International Parallel Tools Workshop, Stuttgart, September 2012
Google Scholar
Antonov, A., Nikitenko, D., Shvets, P., Sobolev, S., Stefanov, K., Voevodin, V., Voevodin, V., Zhumatiy, S.: An approach for ensuring reliable functioning of a supercomputer based on a formal model. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds.) PPAM 2015. LNCS, vol. 9573, pp. 12–22. Springer, Heidelberg (2016). doi:10.1007/978-3-319-32149-3_2
Chapter Google Scholar
Adinets, A., Bryzgalov, P., Nikitenko, D., Stefanov, K., Voevodin, V., Zhumatiy, S.: Job digest: an approach to dynamic analysis of job characteristics on supercomputers. In: Numerical Methods and Programming: Advanced Computing, vol. 13. Sect. 2, pp. 160–166 (2012)
Google Scholar
Nikitenko, D., Voevodin, V., Zhumatiy, S., Stefanov, K., Teplov, A., Shvets, P., Voevodin, V.: Supercomputer application integral characteristics analysis for the whole queued job collection of large-scale HPC systems. In: Parallel Computational Technologies (PCT 2016): Proceedings of the International Scientific Conference. Chelyabinsk, Publishing of the South Ural State University, pp. 20–30 (2016)
Google Scholar
Nikitenko, D., Voevodin, V., Zhumatiy, S.: Octoshell: large supercomputer complex administration system. In: proceedings of Russian Supercomputing Days International Conference, Moscow, Russia, 28-29 September 2015, Proceedings, CEUR Workshop Proceedings, vol. 1482, pp. 69–83 (2015)
Google Scholar
Nikitenko, D., et al.: Resolving frontier problems of mastering large-scale supercomputer complexes. In: Proceedings of the ACM International Conference on Computing Frontiers (CF 2016), pp. 349–352. ACM, New York (2016)
Google Scholar
Slurm Workload Manager. http://slurm.schedmd.com
Cleo cluster batch system. http://sourceforge.net/projects/cleo-bs
Clustrx. http://t-platforms.ru/products/software/clustrxproductfamily/clustrxwatch.html
Voevodin, V., Voevodin, V., Shaikhislamov, D., Nikitenko, D.: Data mining method for anomaly detection in the supercomputer task flow. In: proceedings of Numerical Computations: Theory and Algorithms, The 2nd International Conference and Summer School, 20-24 June 2016, Pizzo calabro, Italy (2016)
Google Scholar
Description of random forest algorithm and its realization. http://scikit-learn.org/stable/modules/ensemble.html#random-forests

Download references

Author information

Authors and Affiliations

Research Computing Center of Moscow State University, Moscow, Russia
Dmitry Nikitenko, Konstantin Stefanov, Sergey Zhumatiy, Vadim Voevodin, Alexey Teplov & Pavel Shvets

Authors

Dmitry Nikitenko
View author publications
You can also search for this author in PubMed Google Scholar
Konstantin Stefanov
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Zhumatiy
View author publications
You can also search for this author in PubMed Google Scholar
Vadim Voevodin
View author publications
You can also search for this author in PubMed Google Scholar
Alexey Teplov
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Shvets
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dmitry Nikitenko .

Editor information

Editors and Affiliations

Carlos III University of Madrid, Getafe, Spain
Jesus Carretero
Carlos III University of Madrid, Getafe, Spain
Javier Garcia-Blas
Mathematical Support for Computers, N. I. Lobachevsky State University of Nizhny Novgorod, Nizhniy Novgorod, Russia
Victor Gergel
Research Computing Center (RCC), Moscow State University, Moscow, Russia
Vladimir Voevodin
Research Computing Center (RCC), Moscow State University, Moscow, Russia
Iosif Meyerov
E.U. Politécnica, Universidad de Extremaddura, Cáceres, Spain
Juan A. Rico-Gallego
Ingenieria de Sistemas Informáticos, Universidad de Extremaddura, Cáceres, Spain
Juan C. Díaz-Martín
Universitat Politécnica de València, Valencia, Spain
Pedro Alonso
Distributed and Parallel Systems Group, Institute for Computer Science, Innsbruck, Austria
Juan Durillo
Carlos III University of Madrid, Getafe, Spain
José Daniel Garcia Sánchez
UCD School of Computer Science, University College Dublin, Dublin, Ireland
Alexey L. Lastovetsky
University of Calabria, Rende (CS), Italy
Fabrizio Marozzo
Information Science and Engineering, Central South University, Changsha, Hunan, China
Qin Liu
Information Science and Engineering, Central South University, Changsha, Hunan, China
Zakirul Alam Bhuiyan
Ludwig Maximilian University of Munich, Munich, Germany
Karl Fürlinger
Informatik 10 - Rechnertechnik, Technische Universität München, Munich, Germany
Josef Weidendorfer
High Performance Computing Center (HLRS), Stuttgart, Germany
José Gracia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nikitenko, D., Stefanov, K., Zhumatiy, S., Voevodin, V., Teplov, A., Shvets, P. (2016). System Monitoring-Based Holistic Resource Utilization Analysis for Every User of a Large HPC Center. In: Carretero, J., et al. Algorithms and Architectures for Parallel Processing. ICA3PP 2016. Lecture Notes in Computer Science(), vol 10049. Springer, Cham. https://doi.org/10.1007/978-3-319-49956-7_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-49956-7_24
Published: 19 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49955-0
Online ISBN: 978-3-319-49956-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics