Abstract
The efficiency of using modern supercomputer systems is very low due to their high complexity. It is getting harder to control the state of supercomputer, but the cost of low efficiency can be very significant. In order to solve this issue, software for efficient supercomputer management is needed. This paper describes a set of tools being developed in Research Computing Center of Lomonosov Moscow State University (RCC MSU) that is intended to provide a holistic approach to efficiency analysis from different points of view. Efficiency of particular user applications and whole supercomputer job flow, efficiency of computational resources utilization, supercomputer reliability, HPC facility management—all these questions are being studied by the described tools.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Antonov, A., Nikitenko, D., Shvets, P., Sobolev, S., Stefanov, K., Voevodin, V., Voevodin, V., Zhumatiy, S.: An approach for ensuring reliable functioning of a supercomputer based on a formal model. In: Parallel Processing and Applied Mathematics: 11th International Conference, PPAM 2015, Krakow, September 6–9, 2015. Revised Selected Papers, Part I, pp. 12–22. Springer, Cham (2016). DOI 10.1007/ 978-3-319-32149-3_2. http://link.springer.com/10.1007/978-3-319-32149-3_2
Bright Cluster Manager home page. http://www.brightcomputing.com/product-offerings/bright-cluster-manager-for-hpc. Cited 15-06-2017
Geimer, M., Wolf, F., Wylie, B.J.N., Ibrahim, E., Becker, D., Mohr, B.: The Scalasca performance toolset architecture. Concurr. Comput. Pract. Experience 22(6), 702–719 (2010). DOI 10.1002/cpe.1556. http://doi.wiley.com/10.1002/cpe.1556
Infrastructure Monitoring System Nagios. https://www.nagios.org/. Cited 15 Jun 2017
Jagode, H., Dongarra, J., Alam, S., Vetter, J., Spear, W., Malony, A.D.: A holistic approach for performance measurement and analysis for petascale applications. In: Computational Science – ICCS 2009, pp. 686–695. Springer, Berlin (2009). DOI 10.1007/978-3-642-01973-9_77. http://link.springer.com/10.1007/978-3-642-01973-9_77
JobDigest Components. https://github.com/srcc-msu/job_statistics. Cited 15 Jun 2017
Lu, K., Wang, X., Li, G., Wang, R., Chi, W., Liu, Y., Tang, H., Feng, H., Gao, Y.: Iaso: an autonomous fault-tolerant management system for supercomputers. Front. Comp. Sci. 8(3), 378–390 (2014). DOI 10.1007/s11704-014-3503-1. http://link.springer.com/10.1007/s11704-014-3503-1
Mohr, B., Voevodin, V., Gimenez, J., Hagersten, E., Knupfer, A., Nikitenko, D.A., Nilsson, M., Servat, H., Shah, A., Winkler, F., Wolf, F., Zhukov, I.: The HOPSA workflow and tools. In: Tools for High Performance Computing 2012, pp. 127–146. Springer, Berlin (2013). DOI 10.1007/978-3-642-37349-7_9. http://link.springer.com/10.1007/978-3-642-37349-7_9
Nikitenko, D.A., Voevodin, V.V., Zhumatiy, S.A.: Octoshell: large supercomputer complex administration system. Bull. South Ural State Univ. Ser. Comput. Math. Softw. Eng. 5(3), 76–95 (2016). DOI 10.14529/cmse160306. http://vestnik.susu.ru/cmi/article/view/3998
Nikitenko, D.A., Zhumatiy, S.A., Shvets, P.A.: Making large-scale systems observable - another inescapable step towards exascale. Supercomput. Front. Innov. 3(2), 72–79 (2016). DOI 10.14529/jsfi160205. http://superfri.org/superfri/article/view/96
OctoShell Source Code. https://github.com/%5Cshell/%5Cshell-v2. Cited 15 Jun 2017
OctoTron Framework Source Code. https://github.com/srcc-msu/OctoTron. Cited 15 Jun 2017
Slurm Workload Manager. https://slurm.schedmd.com/. Cited 15 Jun 2017
Stefanov, K., Voevodin, V., Zhumatiy, S., Voevodin, V.: Dynamically reconfigurable distributed modular monitoring system for supercomputers (DiMMon). Proc. Comput. Sci. 66, 625–634 (2015). DOI 10.1016/j.procs.2015.11.071. http://linkinghub.elsevier.com/retrieve/pii/S1877050915034201
System Statistics Collection Daemon Collectd. https://collectd.org/. Cited 15 Jun 2017
TORQUE Resource Manager. http://www.adaptivecomputing.com/products/open-source/torque/. Cited 15 Jun 2017
Voevodin, V., Voevodin, V.: Software system stack for efficiency of exascale supercomputer centers. Technical Report (2015)
Voevodin, V., Zhumatiy, S., Sobolev, S., Antonov, A., Bryzgalov, P., Nikitenko, D., Stefanov, K., Voevodin, V.: The practice of “Lomonosov” supercomputer. Open Syst. DBMS 7, 36–39 (2012)
Voevodin, V., Voevodin, V., Shaikhislamov, D., Nikitenko, D.: Data mining method for anomaly detection in the supercomputer task flow. In: Numerical Computations: Theory and Algorithms, The 2nd International Conference and Summer School, pp. 090015-1–090015-4. Pizzo Calabro (2016). DOI 10.1063/1.4965379. http://aip.scitation.org/doi/abs/10.1063/1.4965379
Zenoss – Monitoring and Analytics Software. https://community.zenoss.com/home. Cited 15 Jun 2017
Acknowledgements
This material is based upon work supported in part by the Russian Found for Basic Research (grant No. 16-07-00972) and Russian Presidential study grant (SP-1981.2016.5).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Voevodin, V. (2017). Theory and Practice of Efficient Supercomputer Management. In: Resch, M., Bez, W., Focht, E., Gienger, M., Kobayashi, H. (eds) Sustained Simulation Performance 2017 . Springer, Cham. https://doi.org/10.1007/978-3-319-66896-3_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-66896-3_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66895-6
Online ISBN: 978-3-319-66896-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)