Skip to main content

JobDigest – Detailed System Monitoring-Based Supercomputer Application Behavior Analysis

  • Conference paper
  • First Online:

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 793))

Abstract

The efficiency of computing resources utilization by user applications can be analyzed in various ways. The JobDigest approach based on system monitoring was developed in Moscow State University and is currently used in everyday practice of the largest Russian supercomputing center of Moscow State University. The approach features application behavior analysis for every job run on HPC system providing: the set of dynamic application characteristics - time series of values representing utilization of CPU, memory, network, storage, etc. with diagrams and heat maps; the integral characteristics representing average utilization rates; job tagging and categorization with means of informing system administrators and managers on suspicious or abnormal applications. The paper describes the approach principles and workflow, it also demonstrates JobDigest use cases and positioning of the proposed techniques in the set of tools and methods that are used in the MSU HPC Center to ensure its 24/7 efficient and productive functioning.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The JobDigest® is a registered trademark in Russian Federation. The application for an invention of the JobDigest approach was filed.

References

  1. Zenoss. http://www.zenoss.org. Last accessed 10 May 2017

  2. Zabbix. http://www.zabbix.com. Last accessed 10 May 2017

  3. Cacti®. http://www.cacti.net. Last accessed 10 May 2017

  4. Massie, M.L., et al.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput. 30(7), 817–840 (2004)

    Article  Google Scholar 

  5. The OpenNMS project. http://www.opennms.org. Last accessed 10 May 2017

  6. Nagios - the industry standard in IT infrastructure monitoring. http://www.nagios.org. Last accessed 10 May 2017

  7. Collectd – The system statistics collection daemon. https://collectd.org. Last accessed 10 May 2017

  8. Stefanov, K.S., Voevodin, Vl.V.: Distributed modular monitoring (DiMMon) approach to supercomputer monitoring. In: Proceedings of the 2015 IEEE International Conference on Cluster Computing, pp. 502–503. IEEE (2015). https://doi.org/10.1109/CLUSTER.2015.83

  9. Stefanov, K.S., Voevodin, Vl.V., Zhumatiy, S.A., Voevodin, Vad.V.: Dynamically reconfigurable distributed modular monitoring system for supercomputers (DiMMon). Procedia Comput. Sci. 66, 625–634 (2015). Elsevier B.V. https://doi.org/10.1016/j.procs.2015.11.071

    Article  Google Scholar 

  10. Gunter, D., Tierney, B., Jackson, K., Lee, J., Stoufer, M.: Dynamic monitoring of high-performance distributed applications. In: Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing, pp. 163–170 (2002)

    Google Scholar 

  11. Mellor-Crummey, J., Fowler, R.J., Marin, G., Tallent, N.: HPCVIEW: a tool for top-down analysis of node performance. J. Supercomput. 23(1), 81–104 (2002)

    Article  MATH  Google Scholar 

  12. Jagode, H., Dongarra, J., Alam, S., Vetter, J., Spear, W., Malony, A.D.: A holistic approach for performance measurement and analysis for petascale applications. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2009, Part II. LNCS, vol. 5545, pp. 686–695. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01973-9_77

    Chapter  Google Scholar 

  13. Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurrency Comput. Pract. Exp. 22(6), 685–701 (2010)

    Google Scholar 

  14. Eisenhauer, G., Kraemer, E., Schwan, K., Stasko, J., Vetter, J., Mallavarupu, N.: Falcon: on-line monitoring and steering of large-scale parallel programs. In: Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation, pp. 422–429 (1995)

    Google Scholar 

  15. Kluge, M., Hackenberg, D., Nagel, W.E.: Collecting distributed performance data with dataheap: generating and exploiting a holistic system view. Procedia Comput. Sci. 9, 1969–1978 (2012)

    Article  Google Scholar 

  16. Mooney, R., Schmidt, K.P., Studham, R.S.: NWPerf: a system wide performance monitoring tool for large Linux clusters. In: 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935), pp. 379–389 (2004)

    Google Scholar 

  17. Ries, B., et al.: The paragon performance monitoring environment. In: Proceedings of Supercomputing 1993, pp. 850–859 (1993)

    Google Scholar 

  18. Nikitenko, D.A.: Complex approach to performance analysis of supercomputer systems based on system monitoring data. Numer. Meth. Programm. New Comput. Technol. 15, 85–97 (2014)

    Google Scholar 

  19. Antonov, A.S., Zhumatiy, S.A., Nikitenko, D.A., Stefanov, K.S., Teplov, A.M., Shvets, P.A.: Analysis of dynamic characteristics of job stream on supercomputer system. Numer. Meth. Program. New Comput. Technol. 14(2), 104–108 (2013)

    Google Scholar 

  20. Nikitenko, D.A., Adinets, A.V., Bryzgalov, P.A., Stefanov, K.S., Voevodin, Vad.V., Zhumatiy, S.A.: Job Digest - approach to analysis of application dynamic characteristics on supercomputer systems. Numer. Meth. Program. New Comput. Technol. 13, 160–166 (2012)

    Google Scholar 

  21. Voevodin, V., Voevodin, V.: Efficiency of exascale supercomputer centers and supercomputing education. In: Gitler, I., Klapp, J. (eds.) ISUM 2015. CCIS, vol. 595, pp. 14–23. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32243-8_2

    Chapter  Google Scholar 

  22. Nikitenko, D.A., Zhumatiy, S.A., Shvets, P.A.: Making large-scale systems observable — another inescapable step towards exascale. Supercomput. Front. Innovations 3(2), 72–79 (2016). https://doi.org/10.14529/jsfi160205

    Google Scholar 

  23. Antonov, A., Nikitenko, D., Shvets, P., Sobolev, S., Stefanov, K., Voevodin, V., Voevodin, V., Zhumatiy, S.: An approach for ensuring reliable functioning of a supercomputer based on a formal model. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds.) PPAM 2015, Part I. LNCS, vol. 9573, pp. 12–22. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32149-3_2

    Chapter  Google Scholar 

  24. Nikitenko, D.A., Voevodin, Vl.V., Zhumatiy, S.A.: Resolving frontier problems of mastering large-scale supercomputer complexes. In: ACM International Conference on Computing Frontiers (CF 2016), Como, Italy, pp. 349–352. ACM, New York, 16–18 May 2016. https://doi.org/10.1145/2903150.2903481

  25. Nikitenko, D.A., Voevodin, Vl.V., Zhumatiy, S.A.: Octoshell: large supercomputer complex administration system. In: Russian Supercomputing Days International Conference, Moscow, Russia. CEUR Workshop Proceedings, vol. 1482, pp. 69–83, 28–29 September 2015

    Google Scholar 

  26. Nikitenko, D., Stefanov, K., Zhumatiy, S., Voevodin, V., Teplov, A., Shvets, P.: System monitoring-based holistic resource utilization analysis for every user of a large HPC center. In: Carretero, J., et al. (eds.) ICA3PP 2016. LNCS, vol. 10049, pp. 305–318. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49956-7_24

    Chapter  Google Scholar 

  27. Slurm Workload Manager. http://slurm.schedmd.com. Last accessed 10 May 2017

  28. Clustrx. http://www.hpcc.unical.it/hpc2010/ctrbs/tkachev.pdf. Last accessed 10 May 2017

  29. JobDigest components. https://github.com/srcc-msu/job_statistics. Last accessed 10 May 2017

  30. Mohr, B., Hagersten, E., Giménez, J., Knüpfer, A., Nikitenko, D., Nilsson, M., Servat, H., Shah, A., Voevodin, Vl., Winkler, F., Wolf, F., Zhukov, I.: The HOPSA workflow and tools. In: Proceedings of the 6th International Parallel Tools Workshop, Stuttgart (2012)

    Google Scholar 

  31. Voevodin, Vl.V., Zhumatiy, S.A., Sobolev, S.I., Antonov, A.S., Bryzgalov, P.A., Nikitenko, D.A., Stefanov, K.S., Voevodin, Vad.V.: Practice of “Lomonosov” Supercomputer. Open Systems J. 7, 36–39 (2012). Open Systems Publ., Moscow

    Google Scholar 

  32. Antonov, A., Teplov, A.: Generalized approach to scalability analysis of parallel applications. In: Carretero, J., et al. (eds.) ICA3PP 2016. LNCS, vol. 10049, pp. 291–304. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49956-7_23

    Chapter  Google Scholar 

  33. Nikitenko, D.A., Voevodin, Vl.V., Voevodin, Vad.V., Zhumatiy, S.A., Stefanov, K.S., Teplov, A.M., Shvets, P.A.: Supercomputer application integral characteristics analysis for the whole queued job collection of large-scale HPC systems. In: 10th Annual International Scientific Conference on Parallel Computing Technologies, PCT 2016. CEUR Workshop Proceedings, vol. 1576, Arkhangelsk, Russian Federation, pp. 20–30, 29–31 March 2016

    Google Scholar 

  34. Voevodin, Vl.V., Voevodin, Vad.V., Shaikhislamov, D.I., Nikitenko, D.A.: Data mining method for anomaly detection in the supercomputer task flow: numerical computations: theory and algorithms. In: The 2nd International Conference and Summer School, Pizzo calabro, Italy. AIP Conference Proceedings, vol. 1776, pp. 090015-1–090015-4, 20–24 June 2016. https://doi.org/10.1063/1.4965379

  35. Andreev, D.Yu., Antonov, A.S., Voevodin, Vad.V., Zhumatiy, S.A., Nikitenko, D.A., Stefanov, K.S., Shvets, P.A.: A system for the automated finding of inefficiencies and errors in parallel programs. Numer. Meth. Program. New Comput. Technol. 14(2), 48–53 (2013)

    Google Scholar 

Download references

Acknowledgements

The results were obtained in the Research Computing Center of M.V. Lomonosov Moscow State University. The work is funded in part by the Russian Found for Basic Research, grants №17-07-00719, №16-07-01121, and Russian Presidential study grant (SP-1981.2016.5).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dmitry Nikitenko .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nikitenko, D. et al. (2017). JobDigest – Detailed System Monitoring-Based Supercomputer Application Behavior Analysis. In: Voevodin, V., Sobolev, S. (eds) Supercomputing. RuSCDays 2017. Communications in Computer and Information Science, vol 793. Springer, Cham. https://doi.org/10.1007/978-3-319-71255-0_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-71255-0_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-71254-3

  • Online ISBN: 978-3-319-71255-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics