JobDigest – Detailed System Monitoring-Based Supercomputer Application Behavior Analysis

Nikitenko, Dmitry; Antonov, Alexander; Shvets, Pavel; Sobolev, Sergey; Stefanov, Konstantin; Voevodin, Vadim; Voevodin, Vladimir; Zhumatiy, Sergey

doi:10.1007/978-3-319-71255-0_42

JobDigest – Detailed System Monitoring-Based Supercomputer Application Behavior Analysis

Conference paper
First Online: 15 November 2017

1060 Accesses
11 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 793))

Abstract

The efficiency of computing resources utilization by user applications can be analyzed in various ways. The JobDigest approach based on system monitoring was developed in Moscow State University and is currently used in everyday practice of the largest Russian supercomputing center of Moscow State University. The approach features application behavior analysis for every job run on HPC system providing: the set of dynamic application characteristics - time series of values representing utilization of CPU, memory, network, storage, etc. with diagrams and heat maps; the integral characteristics representing average utilization rates; job tagging and categorization with means of informing system administrators and managers on suspicious or abnormal applications. The paper describes the approach principles and workflow, it also demonstrates JobDigest use cases and positioning of the proposed techniques in the set of tools and methods that are used in the MSU HPC Center to ensure its 24/7 efficient and productive functioning.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
The JobDigest® is a registered trademark in Russian Federation. The application for an invention of the JobDigest approach was filed.

References

Zenoss. http://www.zenoss.org. Last accessed 10 May 2017
Zabbix. http://www.zabbix.com. Last accessed 10 May 2017
Cacti®. http://www.cacti.net. Last accessed 10 May 2017
Massie, M.L., et al.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput. 30(7), 817–840 (2004)
Article Google Scholar
The OpenNMS project. http://www.opennms.org. Last accessed 10 May 2017
Nagios - the industry standard in IT infrastructure monitoring. http://www.nagios.org. Last accessed 10 May 2017
Collectd – The system statistics collection daemon. https://collectd.org. Last accessed 10 May 2017
Stefanov, K.S., Voevodin, Vl.V.: Distributed modular monitoring (DiMMon) approach to supercomputer monitoring. In: Proceedings of the 2015 IEEE International Conference on Cluster Computing, pp. 502–503. IEEE (2015). https://doi.org/10.1109/CLUSTER.2015.83
Stefanov, K.S., Voevodin, Vl.V., Zhumatiy, S.A., Voevodin, Vad.V.: Dynamically reconfigurable distributed modular monitoring system for supercomputers (DiMMon). Procedia Comput. Sci. 66, 625–634 (2015). Elsevier B.V. https://doi.org/10.1016/j.procs.2015.11.071
Article Google Scholar
Gunter, D., Tierney, B., Jackson, K., Lee, J., Stoufer, M.: Dynamic monitoring of high-performance distributed applications. In: Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing, pp. 163–170 (2002)
Google Scholar
Mellor-Crummey, J., Fowler, R.J., Marin, G., Tallent, N.: HPCVIEW: a tool for top-down analysis of node performance. J. Supercomput. 23(1), 81–104 (2002)
Article MATH Google Scholar
Jagode, H., Dongarra, J., Alam, S., Vetter, J., Spear, W., Malony, A.D.: A holistic approach for performance measurement and analysis for petascale applications. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2009, Part II. LNCS, vol. 5545, pp. 686–695. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01973-9_77
Chapter Google Scholar
Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurrency Comput. Pract. Exp. 22(6), 685–701 (2010)
Google Scholar
Eisenhauer, G., Kraemer, E., Schwan, K., Stasko, J., Vetter, J., Mallavarupu, N.: Falcon: on-line monitoring and steering of large-scale parallel programs. In: Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation, pp. 422–429 (1995)
Google Scholar
Kluge, M., Hackenberg, D., Nagel, W.E.: Collecting distributed performance data with dataheap: generating and exploiting a holistic system view. Procedia Comput. Sci. 9, 1969–1978 (2012)
Article Google Scholar
Mooney, R., Schmidt, K.P., Studham, R.S.: NWPerf: a system wide performance monitoring tool for large Linux clusters. In: 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935), pp. 379–389 (2004)
Google Scholar
Ries, B., et al.: The paragon performance monitoring environment. In: Proceedings of Supercomputing 1993, pp. 850–859 (1993)
Google Scholar
Nikitenko, D.A.: Complex approach to performance analysis of supercomputer systems based on system monitoring data. Numer. Meth. Programm. New Comput. Technol. 15, 85–97 (2014)
Google Scholar
Antonov, A.S., Zhumatiy, S.A., Nikitenko, D.A., Stefanov, K.S., Teplov, A.M., Shvets, P.A.: Analysis of dynamic characteristics of job stream on supercomputer system. Numer. Meth. Program. New Comput. Technol. 14(2), 104–108 (2013)
Google Scholar
Nikitenko, D.A., Adinets, A.V., Bryzgalov, P.A., Stefanov, K.S., Voevodin, Vad.V., Zhumatiy, S.A.: Job Digest - approach to analysis of application dynamic characteristics on supercomputer systems. Numer. Meth. Program. New Comput. Technol. 13, 160–166 (2012)
Google Scholar
Voevodin, V., Voevodin, V.: Efficiency of exascale supercomputer centers and supercomputing education. In: Gitler, I., Klapp, J. (eds.) ISUM 2015. CCIS, vol. 595, pp. 14–23. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32243-8_2
Chapter Google Scholar
Nikitenko, D.A., Zhumatiy, S.A., Shvets, P.A.: Making large-scale systems observable — another inescapable step towards exascale. Supercomput. Front. Innovations 3(2), 72–79 (2016). https://doi.org/10.14529/jsfi160205
Google Scholar
Antonov, A., Nikitenko, D., Shvets, P., Sobolev, S., Stefanov, K., Voevodin, V., Voevodin, V., Zhumatiy, S.: An approach for ensuring reliable functioning of a supercomputer based on a formal model. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds.) PPAM 2015, Part I. LNCS, vol. 9573, pp. 12–22. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32149-3_2
Chapter Google Scholar
Nikitenko, D.A., Voevodin, Vl.V., Zhumatiy, S.A.: Resolving frontier problems of mastering large-scale supercomputer complexes. In: ACM International Conference on Computing Frontiers (CF 2016), Como, Italy, pp. 349–352. ACM, New York, 16–18 May 2016. https://doi.org/10.1145/2903150.2903481
Nikitenko, D.A., Voevodin, Vl.V., Zhumatiy, S.A.: Octoshell: large supercomputer complex administration system. In: Russian Supercomputing Days International Conference, Moscow, Russia. CEUR Workshop Proceedings, vol. 1482, pp. 69–83, 28–29 September 2015
Google Scholar
Nikitenko, D., Stefanov, K., Zhumatiy, S., Voevodin, V., Teplov, A., Shvets, P.: System monitoring-based holistic resource utilization analysis for every user of a large HPC center. In: Carretero, J., et al. (eds.) ICA3PP 2016. LNCS, vol. 10049, pp. 305–318. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49956-7_24
Chapter Google Scholar
Slurm Workload Manager. http://slurm.schedmd.com. Last accessed 10 May 2017
Clustrx. http://www.hpcc.unical.it/hpc2010/ctrbs/tkachev.pdf. Last accessed 10 May 2017
JobDigest components. https://github.com/srcc-msu/job_statistics. Last accessed 10 May 2017
Mohr, B., Hagersten, E., Giménez, J., Knüpfer, A., Nikitenko, D., Nilsson, M., Servat, H., Shah, A., Voevodin, Vl., Winkler, F., Wolf, F., Zhukov, I.: The HOPSA workflow and tools. In: Proceedings of the 6th International Parallel Tools Workshop, Stuttgart (2012)
Google Scholar
Voevodin, Vl.V., Zhumatiy, S.A., Sobolev, S.I., Antonov, A.S., Bryzgalov, P.A., Nikitenko, D.A., Stefanov, K.S., Voevodin, Vad.V.: Practice of “Lomonosov” Supercomputer. Open Systems J. 7, 36–39 (2012). Open Systems Publ., Moscow
Google Scholar
Antonov, A., Teplov, A.: Generalized approach to scalability analysis of parallel applications. In: Carretero, J., et al. (eds.) ICA3PP 2016. LNCS, vol. 10049, pp. 291–304. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49956-7_23
Chapter Google Scholar
Nikitenko, D.A., Voevodin, Vl.V., Voevodin, Vad.V., Zhumatiy, S.A., Stefanov, K.S., Teplov, A.M., Shvets, P.A.: Supercomputer application integral characteristics analysis for the whole queued job collection of large-scale HPC systems. In: 10th Annual International Scientific Conference on Parallel Computing Technologies, PCT 2016. CEUR Workshop Proceedings, vol. 1576, Arkhangelsk, Russian Federation, pp. 20–30, 29–31 March 2016
Google Scholar
Voevodin, Vl.V., Voevodin, Vad.V., Shaikhislamov, D.I., Nikitenko, D.A.: Data mining method for anomaly detection in the supercomputer task flow: numerical computations: theory and algorithms. In: The 2nd International Conference and Summer School, Pizzo calabro, Italy. AIP Conference Proceedings, vol. 1776, pp. 090015-1–090015-4, 20–24 June 2016. https://doi.org/10.1063/1.4965379
Andreev, D.Yu., Antonov, A.S., Voevodin, Vad.V., Zhumatiy, S.A., Nikitenko, D.A., Stefanov, K.S., Shvets, P.A.: A system for the automated finding of inefficiencies and errors in parallel programs. Numer. Meth. Program. New Comput. Technol. 14(2), 48–53 (2013)
Google Scholar

Download references

Acknowledgements

The results were obtained in the Research Computing Center of M.V. Lomonosov Moscow State University. The work is funded in part by the Russian Found for Basic Research, grants №17-07-00719, №16-07-01121, and Russian Presidential study grant (SP-1981.2016.5).

Author information

Authors and Affiliations

Research Computing Center, Lomonosov Moscow State University, Moscow, 119234, Russian Federation
Dmitry Nikitenko, Alexander Antonov, Pavel Shvets, Sergey Sobolev, Konstantin Stefanov, Vadim Voevodin, Vladimir Voevodin & Sergey Zhumatiy

Authors

Dmitry Nikitenko
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Antonov
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Shvets
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Sobolev
View author publications
You can also search for this author in PubMed Google Scholar
Konstantin Stefanov
View author publications
You can also search for this author in PubMed Google Scholar
Vadim Voevodin
View author publications
You can also search for this author in PubMed Google Scholar
Vladimir Voevodin
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Zhumatiy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dmitry Nikitenko .

Editor information

Editors and Affiliations

Research Computing Center (RCC), Moscow State University, Moscow, Russia
Vladimir Voevodin
Moscow State University, Moscow, Russia
Sergey Sobolev

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nikitenko, D. et al. (2017). JobDigest – Detailed System Monitoring-Based Supercomputer Application Behavior Analysis. In: Voevodin, V., Sobolev, S. (eds) Supercomputing. RuSCDays 2017. Communications in Computer and Information Science, vol 793. Springer, Cham. https://doi.org/10.1007/978-3-319-71255-0_42

Download citation

DOI: https://doi.org/10.1007/978-3-319-71255-0_42
Published: 15 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-71254-3
Online ISBN: 978-3-319-71255-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics