On the Resilience of Conjugate Gradient and Multigrid Methods to Node Failures
In this paper, we examine the inherent resilience of multigrid (MG) and conjugate gradient (CG) methods in the search for algorithm-based approaches to deal with node failures in large parallel HPC systems. In previous work, silent data corruption has been modeled as the perturbation of values in the work arrays of a MG solver. It was concluded that MG recovers fast from errors of this type. We explore how fast MG and CG methods recover from the loss of a contiguous section of their working memory, modeling a node failure. Since MG and CG methods differ in their convergence rates, we propose a methodology to compare their resilience: Time is represented as a fraction of the iterations required to reach a certain target precision, and failures are introduced when the residual norm reaches a certain threshold. We use the two solvers on a linear system that represents a model elliptic partial differential equation, and we experimentally evaluate the overhead caused by the introduced faults. Additionally, we observe the behavior of the conjugate gradient solver under node failures for additional test problems. Approximating the lost values of the solution using interpolation reduces the overhead for MG, but the effect on the CG solver is minimal. We conclude that the methods also have the inherent ability to recover from node failures. However, we illustrate that the relative overhead caused by node failures is significant.
KeywordsNode failure Conjugate gradient Multigrid Resilience
This work has been supported by the Vienna Science and Technology Fund (WWTF) through project ICT15-113.
- 1.Agullo, E., Giraud, L., Guermouche, A., Roman, J., Zounon, M.: Towards resilient parallel linear Krylov solvers: recover-restart strategies. Research Report RR-8324, INRIA, July 2013Google Scholar
- 4.Altenbernd, M., Göddeke, D.: Soft fault detection and correction for multigrid. Int. J. High Perform. Comput. Appl. (2017). https://doi.org/10.1177/1094342016684006
- 5.Balay, S., Abhyankar, S., Adams, M.F., Brown, J., Brune, P., Buschelman, K., Dalcin, L., Eijkhout, V., Gropp, W.D., Kaushik, D., Knepley, M.G., McInnes, L.C., Rupp, K., Smith, B.F., Zampini, S., Zhang, H., Zhang, H.: PETSc users manual. Technical report ANL-95/11 - Revision 3.7, Argonne National Laboratory (2016)Google Scholar
- 6.Balay, S., Gropp, W.D., McInnes, L.C., Smith, B.F.: Efficient management of parallelism in object oriented numerical software libraries. In: Arge, E., Bruaset, A.M., Langtangen, H.P. (eds.) Modern Software Tools in Scientific Computing, pp. 163–202. Birkhäuser Press, Boston (1997). https://doi.org/10.1007/978-1-4612-1986-6_8 CrossRefGoogle Scholar
- 8.Casas, M., de Supinski, B.R., Bronevetsky, G., Schulz, M.: Fault resilience of the algebraic multi-grid solver. In: Proceedings of the 26th ACM International Conference on Supercomputing, ICS 2012, pp. 91–100. ACM (2012)Google Scholar
- 13.Sao, P., Vuduc, R.: Self-stabilizing iterative solvers. In: Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2013, pp. 4:1–4:8. ACM (2013)Google Scholar
- 14.Snir, M., Wisniewski, R.W., Abraham, J.A., Adve, S.V., Bagchi, S., Balaji, P., Belak, J., Bose, P., Cappello, F., Carlson, B., Chien, A.A., Coteus, P., DeBardeleben, N.A., Diniz, P.C., Engelmann, C., Erez, M., Fazzari, S., Geist, A., Gupta, R., Johnson, F., Krishnamoorthy, S., Leyffer, S., Liberty, D., Mitra, S., Munson, T., Schreiber, R., Stearley, J., Hensbergen, E.V.: Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28(2), 129–173 (2014)CrossRefGoogle Scholar