Skip to main content

Theory and Practice of Efficient Supercomputer Management

  • Conference paper
  • First Online:
  • 313 Accesses

Abstract

The efficiency of using modern supercomputer systems is very low due to their high complexity. It is getting harder to control the state of supercomputer, but the cost of low efficiency can be very significant. In order to solve this issue, software for efficient supercomputer management is needed. This paper describes a set of tools being developed in Research Computing Center of Lomonosov Moscow State University (RCC MSU) that is intended to provide a holistic approach to efficiency analysis from different points of view. Efficiency of particular user applications and whole supercomputer job flow, efficiency of computational resources utilization, supercomputer reliability, HPC facility management—all these questions are being studied by the described tools.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Antonov, A., Nikitenko, D., Shvets, P., Sobolev, S., Stefanov, K., Voevodin, V., Voevodin, V., Zhumatiy, S.: An approach for ensuring reliable functioning of a supercomputer based on a formal model. In: Parallel Processing and Applied Mathematics: 11th International Conference, PPAM 2015, Krakow, September 6–9, 2015. Revised Selected Papers, Part I, pp. 12–22. Springer, Cham (2016). DOI 10.1007/ 978-3-319-32149-3_2. http://link.springer.com/10.1007/978-3-319-32149-3_2

  2. Bright Cluster Manager home page. http://www.brightcomputing.com/product-offerings/bright-cluster-manager-for-hpc. Cited 15-06-2017

  3. Geimer, M., Wolf, F., Wylie, B.J.N., Ibrahim, E., Becker, D., Mohr, B.: The Scalasca performance toolset architecture. Concurr. Comput. Pract. Experience 22(6), 702–719 (2010). DOI 10.1002/cpe.1556. http://doi.wiley.com/10.1002/cpe.1556

    Google Scholar 

  4. Infrastructure Monitoring System Nagios. https://www.nagios.org/. Cited 15 Jun 2017

  5. Jagode, H., Dongarra, J., Alam, S., Vetter, J., Spear, W., Malony, A.D.: A holistic approach for performance measurement and analysis for petascale applications. In: Computational Science – ICCS 2009, pp. 686–695. Springer, Berlin (2009). DOI 10.1007/978-3-642-01973-9_77. http://link.springer.com/10.1007/978-3-642-01973-9_77

  6. JobDigest Components. https://github.com/srcc-msu/job_statistics. Cited 15 Jun 2017

  7. Lu, K., Wang, X., Li, G., Wang, R., Chi, W., Liu, Y., Tang, H., Feng, H., Gao, Y.: Iaso: an autonomous fault-tolerant management system for supercomputers. Front. Comp. Sci. 8(3), 378–390 (2014). DOI 10.1007/s11704-014-3503-1. http://link.springer.com/10.1007/s11704-014-3503-1

    Article  MathSciNet  Google Scholar 

  8. Mohr, B., Voevodin, V., Gimenez, J., Hagersten, E., Knupfer, A., Nikitenko, D.A., Nilsson, M., Servat, H., Shah, A., Winkler, F., Wolf, F., Zhukov, I.: The HOPSA workflow and tools. In: Tools for High Performance Computing 2012, pp. 127–146. Springer, Berlin (2013). DOI 10.1007/978-3-642-37349-7_9. http://link.springer.com/10.1007/978-3-642-37349-7_9

  9. Nikitenko, D.A., Voevodin, V.V., Zhumatiy, S.A.: Octoshell: large supercomputer complex administration system. Bull. South Ural State Univ. Ser. Comput. Math. Softw. Eng. 5(3), 76–95 (2016). DOI 10.14529/cmse160306. http://vestnik.susu.ru/cmi/article/view/3998

  10. Nikitenko, D.A., Zhumatiy, S.A., Shvets, P.A.: Making large-scale systems observable - another inescapable step towards exascale. Supercomput. Front. Innov. 3(2), 72–79 (2016). DOI 10.14529/jsfi160205. http://superfri.org/superfri/article/view/96

  11. OctoShell Source Code. https://github.com/%5Cshell/%5Cshell-v2. Cited 15 Jun 2017

  12. OctoTron Framework Source Code. https://github.com/srcc-msu/OctoTron. Cited 15 Jun 2017

  13. Slurm Workload Manager. https://slurm.schedmd.com/. Cited 15 Jun 2017

  14. Stefanov, K., Voevodin, V., Zhumatiy, S., Voevodin, V.: Dynamically reconfigurable distributed modular monitoring system for supercomputers (DiMMon). Proc. Comput. Sci. 66, 625–634 (2015). DOI 10.1016/j.procs.2015.11.071. http://linkinghub.elsevier.com/retrieve/pii/S1877050915034201

    Article  Google Scholar 

  15. System Statistics Collection Daemon Collectd. https://collectd.org/. Cited 15 Jun 2017

  16. TORQUE Resource Manager. http://www.adaptivecomputing.com/products/open-source/torque/. Cited 15 Jun 2017

  17. Voevodin, V., Voevodin, V.: Software system stack for efficiency of exascale supercomputer centers. Technical Report (2015)

    Google Scholar 

  18. Voevodin, V., Zhumatiy, S., Sobolev, S., Antonov, A., Bryzgalov, P., Nikitenko, D., Stefanov, K., Voevodin, V.: The practice of “Lomonosov” supercomputer. Open Syst. DBMS 7, 36–39 (2012)

    Google Scholar 

  19. Voevodin, V., Voevodin, V., Shaikhislamov, D., Nikitenko, D.: Data mining method for anomaly detection in the supercomputer task flow. In: Numerical Computations: Theory and Algorithms, The 2nd International Conference and Summer School, pp. 090015-1–090015-4. Pizzo Calabro (2016). DOI 10.1063/1.4965379. http://aip.scitation.org/doi/abs/10.1063/1.4965379

  20. Zenoss – Monitoring and Analytics Software. https://community.zenoss.com/home. Cited 15 Jun 2017

Download references

Acknowledgements

This material is based upon work supported in part by the Russian Found for Basic Research (grant No. 16-07-00972) and Russian Presidential study grant (SP-1981.2016.5).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vadim Voevodin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Voevodin, V. (2017). Theory and Practice of Efficient Supercomputer Management. In: Resch, M., Bez, W., Focht, E., Gienger, M., Kobayashi, H. (eds) Sustained Simulation Performance 2017 . Springer, Cham. https://doi.org/10.1007/978-3-319-66896-3_1

Download citation

Publish with us

Policies and ethics