Theory and Practice of Efficient Supercomputer Management

Voevodin, Vadim

doi:10.1007/978-3-319-66896-3_1

Theory and Practice of Efficient Supercomputer Management

Vadim Voevodin⁶

Conference paper
First Online: 26 August 2017

313 Accesses

Abstract

The efficiency of using modern supercomputer systems is very low due to their high complexity. It is getting harder to control the state of supercomputer, but the cost of low efficiency can be very significant. In order to solve this issue, software for efficient supercomputer management is needed. This paper describes a set of tools being developed in Research Computing Center of Lomonosov Moscow State University (RCC MSU) that is intended to provide a holistic approach to efficiency analysis from different points of view. Efficiency of particular user applications and whole supercomputer job flow, efficiency of computational resources utilization, supercomputer reliability, HPC facility management—all these questions are being studied by the described tools.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Antonov, A., Nikitenko, D., Shvets, P., Sobolev, S., Stefanov, K., Voevodin, V., Voevodin, V., Zhumatiy, S.: An approach for ensuring reliable functioning of a supercomputer based on a formal model. In: Parallel Processing and Applied Mathematics: 11th International Conference, PPAM 2015, Krakow, September 6–9, 2015. Revised Selected Papers, Part I, pp. 12–22. Springer, Cham (2016). DOI 10.1007/ 978-3-319-32149-3_2. http://link.springer.com/10.1007/978-3-319-32149-3_2
Bright Cluster Manager home page. http://www.brightcomputing.com/product-offerings/bright-cluster-manager-for-hpc. Cited 15-06-2017
Geimer, M., Wolf, F., Wylie, B.J.N., Ibrahim, E., Becker, D., Mohr, B.: The Scalasca performance toolset architecture. Concurr. Comput. Pract. Experience 22(6), 702–719 (2010). DOI 10.1002/cpe.1556. http://doi.wiley.com/10.1002/cpe.1556
Google Scholar
Infrastructure Monitoring System Nagios. https://www.nagios.org/. Cited 15 Jun 2017
Jagode, H., Dongarra, J., Alam, S., Vetter, J., Spear, W., Malony, A.D.: A holistic approach for performance measurement and analysis for petascale applications. In: Computational Science – ICCS 2009, pp. 686–695. Springer, Berlin (2009). DOI 10.1007/978-3-642-01973-9_77. http://link.springer.com/10.1007/978-3-642-01973-9_77
JobDigest Components. https://github.com/srcc-msu/job_statistics. Cited 15 Jun 2017
Lu, K., Wang, X., Li, G., Wang, R., Chi, W., Liu, Y., Tang, H., Feng, H., Gao, Y.: Iaso: an autonomous fault-tolerant management system for supercomputers. Front. Comp. Sci. 8(3), 378–390 (2014). DOI 10.1007/s11704-014-3503-1. http://link.springer.com/10.1007/s11704-014-3503-1
Article MathSciNet Google Scholar
Mohr, B., Voevodin, V., Gimenez, J., Hagersten, E., Knupfer, A., Nikitenko, D.A., Nilsson, M., Servat, H., Shah, A., Winkler, F., Wolf, F., Zhukov, I.: The HOPSA workflow and tools. In: Tools for High Performance Computing 2012, pp. 127–146. Springer, Berlin (2013). DOI 10.1007/978-3-642-37349-7_9. http://link.springer.com/10.1007/978-3-642-37349-7_9
Nikitenko, D.A., Voevodin, V.V., Zhumatiy, S.A.: Octoshell: large supercomputer complex administration system. Bull. South Ural State Univ. Ser. Comput. Math. Softw. Eng. 5(3), 76–95 (2016). DOI 10.14529/cmse160306. http://vestnik.susu.ru/cmi/article/view/3998
Nikitenko, D.A., Zhumatiy, S.A., Shvets, P.A.: Making large-scale systems observable - another inescapable step towards exascale. Supercomput. Front. Innov. 3(2), 72–79 (2016). DOI 10.14529/jsfi160205. http://superfri.org/superfri/article/view/96
OctoShell Source Code. https://github.com/%5Cshell/%5Cshell-v2. Cited 15 Jun 2017
OctoTron Framework Source Code. https://github.com/srcc-msu/OctoTron. Cited 15 Jun 2017
Slurm Workload Manager. https://slurm.schedmd.com/. Cited 15 Jun 2017
Stefanov, K., Voevodin, V., Zhumatiy, S., Voevodin, V.: Dynamically reconfigurable distributed modular monitoring system for supercomputers (DiMMon). Proc. Comput. Sci. 66, 625–634 (2015). DOI 10.1016/j.procs.2015.11.071. http://linkinghub.elsevier.com/retrieve/pii/S1877050915034201
Article Google Scholar
System Statistics Collection Daemon Collectd. https://collectd.org/. Cited 15 Jun 2017
TORQUE Resource Manager. http://www.adaptivecomputing.com/products/open-source/torque/. Cited 15 Jun 2017
Voevodin, V., Voevodin, V.: Software system stack for efficiency of exascale supercomputer centers. Technical Report (2015)
Google Scholar
Voevodin, V., Zhumatiy, S., Sobolev, S., Antonov, A., Bryzgalov, P., Nikitenko, D., Stefanov, K., Voevodin, V.: The practice of “Lomonosov” supercomputer. Open Syst. DBMS 7, 36–39 (2012)
Google Scholar
Voevodin, V., Voevodin, V., Shaikhislamov, D., Nikitenko, D.: Data mining method for anomaly detection in the supercomputer task flow. In: Numerical Computations: Theory and Algorithms, The 2nd International Conference and Summer School, pp. 090015-1–090015-4. Pizzo Calabro (2016). DOI 10.1063/1.4965379. http://aip.scitation.org/doi/abs/10.1063/1.4965379
Zenoss – Monitoring and Analytics Software. https://community.zenoss.com/home. Cited 15 Jun 2017

Download references

Acknowledgements

This material is based upon work supported in part by the Russian Found for Basic Research (grant No. 16-07-00972) and Russian Presidential study grant (SP-1981.2016.5).

Author information

Authors and Affiliations

Research Computing Center of Lomonosov Moscow State University, Moscow, Russia
Vadim Voevodin

Authors

Vadim Voevodin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vadim Voevodin .

Editor information

Editors and Affiliations

High Performance Computing Center (HLRS), University of Stuttgart, Stuttgart, Baden-Württemberg, Germany
Michael M. Resch
Europe GmbH, NEC High Performance Computing, Düsseldorf, Nordrhein-Westfalen, Germany
Wolfgang Bez
Europe GmbH, NEC High Performance Computing, Stuttgart, Germany
Erich Focht
High Performance Computing Center (HLRS), University of Stuttgart , Stuttgart, Germany
Michael Gienger
Cyberscience Center, Tohoku University , Sendai, Japan
Hiroaki Kobayashi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Voevodin, V. (2017). Theory and Practice of Efficient Supercomputer Management. In: Resch, M., Bez, W., Focht, E., Gienger, M., Kobayashi, H. (eds) Sustained Simulation Performance 2017 . Springer, Cham. https://doi.org/10.1007/978-3-319-66896-3_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-66896-3_1
Published: 26 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66895-6
Online ISBN: 978-3-319-66896-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics