Pattern-Based Modeling of High-Performance Computing Resilience

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10659)

Abstract

With the growing scale and complexity of high-performance computing (HPC) systems, resilience solutions that ensure continuity of service despite frequent errors and component failures must be methodically designed to balance the reliability requirements with the overheads to performance and power. Design patterns enable a structured approach to the development of resilience solutions, providing hardware and software designers with the building block elements for the rapid development of novel solutions and for adapting existing technologies for emerging, extreme-scale HPC environments. In this paper, we develop analytical models that enable designers to evaluate the reliability and performance characteristics of the design patterns. These models are particularly useful in building a unified framework that analyzes and compares various resilience solutions built using a combination of patterns.

Keywords

High-performance computing Resilience Patterns Performance Reliability Modeling 

Notes

Acknowledgements

This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, program manager Lucy Nowell, under contract number DE-AC05-00OR22725.

References

  1. 1.
    Beaudry, M.D.: Performance-related reliability measures for computing systems. IEEE Trans. Comput. C–27(6), 540–547 (1978)CrossRefMATHGoogle Scholar
  2. 2.
    Daly, J.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst. 22(3), 303–312 (2006)CrossRefGoogle Scholar
  3. 3.
    Di, S., Bautista-Gomez, L., Cappello, F.: Optimization of a multilevel checkpoint model with uncertain execution scales. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, pp. 907–918 (2014)Google Scholar
  4. 4.
    Elliott, J., Kharbas, K., Fiala, D., Mueller, F., Ferreira, K., Engelmann, C.: Combining partial redundancy and checkpointing for HPC. In: 2012 IEEE 32nd International Conference on Distributed Computing Systems, pp. 615–626, June 2012Google Scholar
  5. 5.
    Elnozahy, E.N., Plank, J.S.: Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery. IEEE Trans. Dependable Secure Comput. 1(2), 97–108 (2004)CrossRefGoogle Scholar
  6. 6.
    Geist, A.: How to kill a supercomputer: dirty power, cosmic rays, and bad solder. IEEE Spectr. (2016)Google Scholar
  7. 7.
    Geist, R., Trivedi, K.S.: Reliability estimation of fault-tolerant systems: tools and techniques. Computer 23(7), 52–61 (1990)CrossRefGoogle Scholar
  8. 8.
    Hukerikar, S., Engelmann, C.: Resilience design patterns: a structured approach to resilience at extreme scale (version 1.1). Technical report ORNL/TM-2016/767, Oak Ridge National Laboratory, Oak Ridge, TN, USA, December 2016Google Scholar
  9. 9.
    Hukerikar, S., Engelmann, C.: Resilience design patterns: a structured approach to resilience at extreme scale. Supercomput. Front. Innov. 4(3), 1–38 (2017)Google Scholar
  10. 10.
    Liu, Y., Nassar, R., Leangsuksun, C., Naksinehaboon, N., Paun, M., Scott, S.: A reliability-aware approach for an optimal checkpoint/restart model in HPC environments. In: 2007 IEEE International Conference on Cluster Computing, pp. 452–457, September 2007Google Scholar
  11. 11.
    Pham, H.: Reliability Modeling, Analysis and Optimization. World Scientific Publishing, Singapore (2006)CrossRefGoogle Scholar
  12. 12.
    Shalf, J., Dosanjh, S., Morrison, J.: Exascale computing technology challenges. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 1–25. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-19328-6_1 CrossRefGoogle Scholar
  13. 13.
    Trivedi, K.S., Malhotra, M.: Reliability and performability techniques and tools: a survey. In: Walke, B., Spaniol, O. (eds.) Messung, Modellierung und Bewertung von Rechen- und Kommunikationssystemen. INFORMAT, pp. 27–48. Springer, Heidelberg (1993).  https://doi.org/10.1007/978-3-642-78495-8_3 CrossRefGoogle Scholar
  14. 14.
    Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9), 530–531 (1974)CrossRefMATHGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Computer Science and Mathematics DivisionOak Ridge National LaboratoryOak RidgeUSA

Personalised recommendations