Skip to main content

System Monitoring-Based Holistic Resource Utilization Analysis for Every User of a Large HPC Center

  • Conference paper
  • First Online:
Algorithms and Architectures for Parallel Processing (ICA3PP 2016)

Abstract

The problem of effective resource utilization is very challenging nowadays, especially for HPC centers running top-level supercomputing facilities with high energy consumption and significant number of workgroups. The weakness of many system monitoring based approaches to efficiency study is the basic orientation on professionals and analysis of specific jobs with low availability for regular users. The proposed all-round performance analysis approach, covering single application performance, project-level and overall system resource utilization based on system monitoring data that promises to be an effective and low cost technique aimed at all types of HPC center users. Every user of HPC center can access details on any of his executed jobs to better understand application behavior and sequences of job runs including scalability study, helping in turn to perform appropriate optimizations and implement co-design techniques. Taking into consideration all levels (user, project manager, administrator), the approach aids to improve output of HPC centers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Voevodin, V., Stefanov, K.: Supercomputers at exascale: bigdata and extreme computing of the total monitoring. In: BDEC Workshop, Barcelona, 29-30 January (2015)

    Google Scholar 

  2. Stefanov, K., Voevodin, V.: Distributed modular monitoring (DiMMon) approach to supercomputer monitoring. In: Proceedings of the 2015 IEEE International Conference on Cluster Computing, pp. 502–503. IEEE (2015)

    Google Scholar 

  3. Zenoss. http://www.zenoss.org

  4. Zabbix. http://www.zabbix.com

  5. Cacti®. http://www.cacti.net

  6. Massie, M.L., et al.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput. 30(7), 817–840 (2004)

    Article  Google Scholar 

  7. The OpenNMS project. http://www.opennms.org

  8. Nagios - the industry standard in IT infrastructure monitoring. http://www.nagios.org

  9. Collectd – The system statistics collection daemon. https://collectd.org

  10. Nikitenko, D.: Complex approach to performance analysis of supercomputer systems based on system monitoring data. In: Numerical Methods and Programming vol. 15, pp. 85–97 (2014)

    Google Scholar 

  11. Gunter, D., Tierney, B., Jackson, K., Lee, J., Stoufer, M.: Dynamic monitoring of high-performance distributed applications. In: Proceedings 11th IEEE International Symposium on High Performance Distributed Computing, pp. 163–170 (2002)

    Google Scholar 

  12. Mellor-Crummey, J., Fowler, R.J., Marin, G., Tallent, N.: HPCVIEW: a tool for top-down analysis of node performance. J. Supercomput. 23(1), 81–104 (2002)

    Article  MATH  Google Scholar 

  13. Jagode, H., Dongarra, J., Alam, S., Vetter, J., Spear, W., Malony, A.D.: A holistic approach for performance measurement and analysis for petascale applications. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M. (eds.) ICCS 2009, Part II. LNCS, vol. 5545, pp. 686–695. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  14. Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurrency Comput. Pract. Exp. 22(6), 685–701 (2010)

    Google Scholar 

  15. Eisenhauer, G., Kraemer, E., Schwan, K., Stasko, J., Vetter, J., Mallavarupu, N.: Falcon: on-line monitoring and steering of large-scale parallel programs. In: Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation, pp. 422–429 (1995)

    Google Scholar 

  16. Kluge, M., Hackenberg, D., Nagel, W.E.: Collecting distributed performance data with dataheap: generating and exploiting a holistic system view. Procedia Comput. Sci. 9, 1969–1978 (2012)

    Article  Google Scholar 

  17. Mooney, R., Schmidt, K.P., Studham, R.S.: NWPerf: a system wide performance monitoring tool for large Linux clusters. In: 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No. 04EX935), pp. 379–389 (2004)

    Google Scholar 

  18. Ries, B., et al.: The paragon performance monitoring environment. In: Supercomputing 1993, Proceedings, pp. 850–859 (1993)

    Google Scholar 

  19. Joint RF-EU HOPSA Project. http://www.vi-hps.org/projects/hopsa/overview

  20. Open Trace Format (OTF2). https://silc.zih.tu-dresden.de/otf2-current/html

  21. Mohr, B., Hagersten, E., Giménez, J., Knüpfer, A., Nikitenko, D., Nilsson, M., Servat, H., Shah, A., Voevodin, V., Winkler, F., Wolf, F., Zhukov, I.: The HOPSA workflow and tools. In: Proceedings of the 6th International Parallel Tools Workshop, Stuttgart, September 2012

    Google Scholar 

  22. Antonov, A., Nikitenko, D., Shvets, P., Sobolev, S., Stefanov, K., Voevodin, V., Voevodin, V., Zhumatiy, S.: An approach for ensuring reliable functioning of a supercomputer based on a formal model. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds.) PPAM 2015. LNCS, vol. 9573, pp. 12–22. Springer, Heidelberg (2016). doi:10.1007/978-3-319-32149-3_2

    Chapter  Google Scholar 

  23. Adinets, A., Bryzgalov, P., Nikitenko, D., Stefanov, K., Voevodin, V., Zhumatiy, S.: Job digest: an approach to dynamic analysis of job characteristics on supercomputers. In: Numerical Methods and Programming: Advanced Computing, vol. 13. Sect. 2, pp. 160–166 (2012)

    Google Scholar 

  24. Nikitenko, D., Voevodin, V., Zhumatiy, S., Stefanov, K., Teplov, A., Shvets, P., Voevodin, V.: Supercomputer application integral characteristics analysis for the whole queued job collection of large-scale HPC systems. In: Parallel Computational Technologies (PCT 2016): Proceedings of the International Scientific Conference. Chelyabinsk, Publishing of the South Ural State University, pp. 20–30 (2016)

    Google Scholar 

  25. Nikitenko, D., Voevodin, V., Zhumatiy, S.: Octoshell: large supercomputer complex administration system. In: proceedings of Russian Supercomputing Days International Conference, Moscow, Russia, 28-29 September 2015, Proceedings, CEUR Workshop Proceedings, vol. 1482, pp. 69–83 (2015)

    Google Scholar 

  26. Nikitenko, D., et al.: Resolving frontier problems of mastering large-scale supercomputer complexes. In: Proceedings of the ACM International Conference on Computing Frontiers (CF 2016), pp. 349–352. ACM, New York (2016)

    Google Scholar 

  27. Slurm Workload Manager. http://slurm.schedmd.com

  28. Cleo cluster batch system. http://sourceforge.net/projects/cleo-bs

  29. Clustrx. http://t-platforms.ru/products/software/clustrxproductfamily/clustrxwatch.html

  30. Voevodin, V., Voevodin, V., Shaikhislamov, D., Nikitenko, D.: Data mining method for anomaly detection in the supercomputer task flow. In: proceedings of Numerical Computations: Theory and Algorithms, The 2nd International Conference and Summer School, 20-24 June 2016, Pizzo calabro, Italy (2016)

    Google Scholar 

  31. Description of random forest algorithm and its realization. http://scikit-learn.org/stable/modules/ensemble.html#random-forests

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dmitry Nikitenko .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Nikitenko, D., Stefanov, K., Zhumatiy, S., Voevodin, V., Teplov, A., Shvets, P. (2016). System Monitoring-Based Holistic Resource Utilization Analysis for Every User of a Large HPC Center. In: Carretero, J., et al. Algorithms and Architectures for Parallel Processing. ICA3PP 2016. Lecture Notes in Computer Science(), vol 10049. Springer, Cham. https://doi.org/10.1007/978-3-319-49956-7_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-49956-7_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-49955-0

  • Online ISBN: 978-3-319-49956-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics