Pattern-Based Modeling of High-Performance Computing Resilience
Abstract
With the growing scale and complexity of high-performance computing (HPC) systems, resilience solutions that ensure continuity of service despite frequent errors and component failures must be methodically designed to balance the reliability requirements with the overheads to performance and power. Design patterns enable a structured approach to the development of resilience solutions, providing hardware and software designers with the building block elements for the rapid development of novel solutions and for adapting existing technologies for emerging, extreme-scale HPC environments. In this paper, we develop analytical models that enable designers to evaluate the reliability and performance characteristics of the design patterns. These models are particularly useful in building a unified framework that analyzes and compares various resilience solutions built using a combination of patterns.
Keywords
High-performance computing Resilience Patterns Performance Reliability ModelingNotes
Acknowledgements
This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, program manager Lucy Nowell, under contract number DE-AC05-00OR22725.
References
- 1.Beaudry, M.D.: Performance-related reliability measures for computing systems. IEEE Trans. Comput. C–27(6), 540–547 (1978)CrossRefMATHGoogle Scholar
- 2.Daly, J.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst. 22(3), 303–312 (2006)CrossRefGoogle Scholar
- 3.Di, S., Bautista-Gomez, L., Cappello, F.: Optimization of a multilevel checkpoint model with uncertain execution scales. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, pp. 907–918 (2014)Google Scholar
- 4.Elliott, J., Kharbas, K., Fiala, D., Mueller, F., Ferreira, K., Engelmann, C.: Combining partial redundancy and checkpointing for HPC. In: 2012 IEEE 32nd International Conference on Distributed Computing Systems, pp. 615–626, June 2012Google Scholar
- 5.Elnozahy, E.N., Plank, J.S.: Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery. IEEE Trans. Dependable Secure Comput. 1(2), 97–108 (2004)CrossRefGoogle Scholar
- 6.Geist, A.: How to kill a supercomputer: dirty power, cosmic rays, and bad solder. IEEE Spectr. (2016)Google Scholar
- 7.Geist, R., Trivedi, K.S.: Reliability estimation of fault-tolerant systems: tools and techniques. Computer 23(7), 52–61 (1990)CrossRefGoogle Scholar
- 8.Hukerikar, S., Engelmann, C.: Resilience design patterns: a structured approach to resilience at extreme scale (version 1.1). Technical report ORNL/TM-2016/767, Oak Ridge National Laboratory, Oak Ridge, TN, USA, December 2016Google Scholar
- 9.Hukerikar, S., Engelmann, C.: Resilience design patterns: a structured approach to resilience at extreme scale. Supercomput. Front. Innov. 4(3), 1–38 (2017)Google Scholar
- 10.Liu, Y., Nassar, R., Leangsuksun, C., Naksinehaboon, N., Paun, M., Scott, S.: A reliability-aware approach for an optimal checkpoint/restart model in HPC environments. In: 2007 IEEE International Conference on Cluster Computing, pp. 452–457, September 2007Google Scholar
- 11.Pham, H.: Reliability Modeling, Analysis and Optimization. World Scientific Publishing, Singapore (2006)CrossRefGoogle Scholar
- 12.Shalf, J., Dosanjh, S., Morrison, J.: Exascale computing technology challenges. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 1–25. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19328-6_1 CrossRefGoogle Scholar
- 13.Trivedi, K.S., Malhotra, M.: Reliability and performability techniques and tools: a survey. In: Walke, B., Spaniol, O. (eds.) Messung, Modellierung und Bewertung von Rechen- und Kommunikationssystemen. INFORMAT, pp. 27–48. Springer, Heidelberg (1993). https://doi.org/10.1007/978-3-642-78495-8_3 CrossRefGoogle Scholar
- 14.Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9), 530–531 (1974)CrossRefMATHGoogle Scholar