Abstract
The efficiency of computing resources utilization by user applications can be analyzed in various ways. The JobDigest approach based on system monitoring was developed in Moscow State University and is currently used in everyday practice of the largest Russian supercomputing center of Moscow State University. The approach features application behavior analysis for every job run on HPC system providing: the set of dynamic application characteristics - time series of values representing utilization of CPU, memory, network, storage, etc. with diagrams and heat maps; the integral characteristics representing average utilization rates; job tagging and categorization with means of informing system administrators and managers on suspicious or abnormal applications. The paper describes the approach principles and workflow, it also demonstrates JobDigest use cases and positioning of the proposed techniques in the set of tools and methods that are used in the MSU HPC Center to ensure its 24/7 efficient and productive functioning.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
The JobDigest® is a registered trademark in Russian Federation. The application for an invention of the JobDigest approach was filed.
References
Zenoss. http://www.zenoss.org. Last accessed 10 May 2017
Zabbix. http://www.zabbix.com. Last accessed 10 May 2017
Cacti®. http://www.cacti.net. Last accessed 10 May 2017
Massie, M.L., et al.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput. 30(7), 817–840 (2004)
The OpenNMS project. http://www.opennms.org. Last accessed 10 May 2017
Nagios - the industry standard in IT infrastructure monitoring. http://www.nagios.org. Last accessed 10 May 2017
Collectd – The system statistics collection daemon. https://collectd.org. Last accessed 10 May 2017
Stefanov, K.S., Voevodin, Vl.V.: Distributed modular monitoring (DiMMon) approach to supercomputer monitoring. In: Proceedings of the 2015 IEEE International Conference on Cluster Computing, pp. 502–503. IEEE (2015). https://doi.org/10.1109/CLUSTER.2015.83
Stefanov, K.S., Voevodin, Vl.V., Zhumatiy, S.A., Voevodin, Vad.V.: Dynamically reconfigurable distributed modular monitoring system for supercomputers (DiMMon). Procedia Comput. Sci. 66, 625–634 (2015). Elsevier B.V. https://doi.org/10.1016/j.procs.2015.11.071
Gunter, D., Tierney, B., Jackson, K., Lee, J., Stoufer, M.: Dynamic monitoring of high-performance distributed applications. In: Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing, pp. 163–170 (2002)
Mellor-Crummey, J., Fowler, R.J., Marin, G., Tallent, N.: HPCVIEW: a tool for top-down analysis of node performance. J. Supercomput. 23(1), 81–104 (2002)
Jagode, H., Dongarra, J., Alam, S., Vetter, J., Spear, W., Malony, A.D.: A holistic approach for performance measurement and analysis for petascale applications. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2009, Part II. LNCS, vol. 5545, pp. 686–695. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01973-9_77
Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurrency Comput. Pract. Exp. 22(6), 685–701 (2010)
Eisenhauer, G., Kraemer, E., Schwan, K., Stasko, J., Vetter, J., Mallavarupu, N.: Falcon: on-line monitoring and steering of large-scale parallel programs. In: Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation, pp. 422–429 (1995)
Kluge, M., Hackenberg, D., Nagel, W.E.: Collecting distributed performance data with dataheap: generating and exploiting a holistic system view. Procedia Comput. Sci. 9, 1969–1978 (2012)
Mooney, R., Schmidt, K.P., Studham, R.S.: NWPerf: a system wide performance monitoring tool for large Linux clusters. In: 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935), pp. 379–389 (2004)
Ries, B., et al.: The paragon performance monitoring environment. In: Proceedings of Supercomputing 1993, pp. 850–859 (1993)
Nikitenko, D.A.: Complex approach to performance analysis of supercomputer systems based on system monitoring data. Numer. Meth. Programm. New Comput. Technol. 15, 85–97 (2014)
Antonov, A.S., Zhumatiy, S.A., Nikitenko, D.A., Stefanov, K.S., Teplov, A.M., Shvets, P.A.: Analysis of dynamic characteristics of job stream on supercomputer system. Numer. Meth. Program. New Comput. Technol. 14(2), 104–108 (2013)
Nikitenko, D.A., Adinets, A.V., Bryzgalov, P.A., Stefanov, K.S., Voevodin, Vad.V., Zhumatiy, S.A.: Job Digest - approach to analysis of application dynamic characteristics on supercomputer systems. Numer. Meth. Program. New Comput. Technol. 13, 160–166 (2012)
Voevodin, V., Voevodin, V.: Efficiency of exascale supercomputer centers and supercomputing education. In: Gitler, I., Klapp, J. (eds.) ISUM 2015. CCIS, vol. 595, pp. 14–23. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32243-8_2
Nikitenko, D.A., Zhumatiy, S.A., Shvets, P.A.: Making large-scale systems observable — another inescapable step towards exascale. Supercomput. Front. Innovations 3(2), 72–79 (2016). https://doi.org/10.14529/jsfi160205
Antonov, A., Nikitenko, D., Shvets, P., Sobolev, S., Stefanov, K., Voevodin, V., Voevodin, V., Zhumatiy, S.: An approach for ensuring reliable functioning of a supercomputer based on a formal model. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds.) PPAM 2015, Part I. LNCS, vol. 9573, pp. 12–22. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32149-3_2
Nikitenko, D.A., Voevodin, Vl.V., Zhumatiy, S.A.: Resolving frontier problems of mastering large-scale supercomputer complexes. In: ACM International Conference on Computing Frontiers (CF 2016), Como, Italy, pp. 349–352. ACM, New York, 16–18 May 2016. https://doi.org/10.1145/2903150.2903481
Nikitenko, D.A., Voevodin, Vl.V., Zhumatiy, S.A.: Octoshell: large supercomputer complex administration system. In: Russian Supercomputing Days International Conference, Moscow, Russia. CEUR Workshop Proceedings, vol. 1482, pp. 69–83, 28–29 September 2015
Nikitenko, D., Stefanov, K., Zhumatiy, S., Voevodin, V., Teplov, A., Shvets, P.: System monitoring-based holistic resource utilization analysis for every user of a large HPC center. In: Carretero, J., et al. (eds.) ICA3PP 2016. LNCS, vol. 10049, pp. 305–318. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49956-7_24
Slurm Workload Manager. http://slurm.schedmd.com. Last accessed 10 May 2017
Clustrx. http://www.hpcc.unical.it/hpc2010/ctrbs/tkachev.pdf. Last accessed 10 May 2017
JobDigest components. https://github.com/srcc-msu/job_statistics. Last accessed 10 May 2017
Mohr, B., Hagersten, E., Giménez, J., Knüpfer, A., Nikitenko, D., Nilsson, M., Servat, H., Shah, A., Voevodin, Vl., Winkler, F., Wolf, F., Zhukov, I.: The HOPSA workflow and tools. In: Proceedings of the 6th International Parallel Tools Workshop, Stuttgart (2012)
Voevodin, Vl.V., Zhumatiy, S.A., Sobolev, S.I., Antonov, A.S., Bryzgalov, P.A., Nikitenko, D.A., Stefanov, K.S., Voevodin, Vad.V.: Practice of “Lomonosov” Supercomputer. Open Systems J. 7, 36–39 (2012). Open Systems Publ., Moscow
Antonov, A., Teplov, A.: Generalized approach to scalability analysis of parallel applications. In: Carretero, J., et al. (eds.) ICA3PP 2016. LNCS, vol. 10049, pp. 291–304. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49956-7_23
Nikitenko, D.A., Voevodin, Vl.V., Voevodin, Vad.V., Zhumatiy, S.A., Stefanov, K.S., Teplov, A.M., Shvets, P.A.: Supercomputer application integral characteristics analysis for the whole queued job collection of large-scale HPC systems. In: 10th Annual International Scientific Conference on Parallel Computing Technologies, PCT 2016. CEUR Workshop Proceedings, vol. 1576, Arkhangelsk, Russian Federation, pp. 20–30, 29–31 March 2016
Voevodin, Vl.V., Voevodin, Vad.V., Shaikhislamov, D.I., Nikitenko, D.A.: Data mining method for anomaly detection in the supercomputer task flow: numerical computations: theory and algorithms. In: The 2nd International Conference and Summer School, Pizzo calabro, Italy. AIP Conference Proceedings, vol. 1776, pp. 090015-1–090015-4, 20–24 June 2016. https://doi.org/10.1063/1.4965379
Andreev, D.Yu., Antonov, A.S., Voevodin, Vad.V., Zhumatiy, S.A., Nikitenko, D.A., Stefanov, K.S., Shvets, P.A.: A system for the automated finding of inefficiencies and errors in parallel programs. Numer. Meth. Program. New Comput. Technol. 14(2), 48–53 (2013)
Acknowledgements
The results were obtained in the Research Computing Center of M.V. Lomonosov Moscow State University. The work is funded in part by the Russian Found for Basic Research, grants №17-07-00719, №16-07-01121, and Russian Presidential study grant (SP-1981.2016.5).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Nikitenko, D. et al. (2017). JobDigest – Detailed System Monitoring-Based Supercomputer Application Behavior Analysis. In: Voevodin, V., Sobolev, S. (eds) Supercomputing. RuSCDays 2017. Communications in Computer and Information Science, vol 793. Springer, Cham. https://doi.org/10.1007/978-3-319-71255-0_42
Download citation
DOI: https://doi.org/10.1007/978-3-319-71255-0_42
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-71254-3
Online ISBN: 978-3-319-71255-0
eBook Packages: Computer ScienceComputer Science (R0)