Advertisement

On the Resilience of Conjugate Gradient and Multigrid Methods to Node Failures

  • Carlos Pachajoa
  • Wilfried N. Gansterer
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10659)

Abstract

In this paper, we examine the inherent resilience of multigrid (MG) and conjugate gradient (CG) methods in the search for algorithm-based approaches to deal with node failures in large parallel HPC systems. In previous work, silent data corruption has been modeled as the perturbation of values in the work arrays of a MG solver. It was concluded that MG recovers fast from errors of this type. We explore how fast MG and CG methods recover from the loss of a contiguous section of their working memory, modeling a node failure. Since MG and CG methods differ in their convergence rates, we propose a methodology to compare their resilience: Time is represented as a fraction of the iterations required to reach a certain target precision, and failures are introduced when the residual norm reaches a certain threshold. We use the two solvers on a linear system that represents a model elliptic partial differential equation, and we experimentally evaluate the overhead caused by the introduced faults. Additionally, we observe the behavior of the conjugate gradient solver under node failures for additional test problems. Approximating the lost values of the solution using interpolation reduces the overhead for MG, but the effect on the CG solver is minimal. We conclude that the methods also have the inherent ability to recover from node failures. However, we illustrate that the relative overhead caused by node failures is significant.

Keywords

Node failure Conjugate gradient Multigrid Resilience 

Notes

Acknowledgement

This work has been supported by the Vienna Science and Technology Fund (WWTF) through project ICT15-113.

References

  1. 1.
    Agullo, E., Giraud, L., Guermouche, A., Roman, J., Zounon, M.: Towards resilient parallel linear Krylov solvers: recover-restart strategies. Research Report RR-8324, INRIA, July 2013Google Scholar
  2. 2.
    Agullo, E., Giraud, L., Guermouche, A., Roman, J., Zounon, M.: Numerical recovery strategies for parallel resilient Krylov linear solvers. Numer. Lin. Algebra Appl. 23(5), 888–905 (2016)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Ainsworth, M., Glusa, C.: Is the multigrid method fault tolerant? The two-grid case. SIAM J. Sci. Comput. 39(2), C116–C143 (2017)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Altenbernd, M., Göddeke, D.: Soft fault detection and correction for multigrid. Int. J. High Perform. Comput. Appl. (2017).  https://doi.org/10.1177/1094342016684006
  5. 5.
    Balay, S., Abhyankar, S., Adams, M.F., Brown, J., Brune, P., Buschelman, K., Dalcin, L., Eijkhout, V., Gropp, W.D., Kaushik, D., Knepley, M.G., McInnes, L.C., Rupp, K., Smith, B.F., Zampini, S., Zhang, H., Zhang, H.: PETSc users manual. Technical report ANL-95/11 - Revision 3.7, Argonne National Laboratory (2016)Google Scholar
  6. 6.
    Balay, S., Gropp, W.D., McInnes, L.C., Smith, B.F.: Efficient management of parallelism in object oriented numerical software libraries. In: Arge, E., Bruaset, A.M., Langtangen, H.P. (eds.) Modern Software Tools in Scientific Computing, pp. 163–202. Birkhäuser Press, Boston (1997).  https://doi.org/10.1007/978-1-4612-1986-6_8 CrossRefGoogle Scholar
  7. 7.
    Briggs, W., Henson, V., McCormick, S.: A Multigrid Tutorial, 2nd edn. SIAM, Philadelphia (2000)CrossRefzbMATHGoogle Scholar
  8. 8.
    Casas, M., de Supinski, B.R., Bronevetsky, G., Schulz, M.: Fault resilience of the algebraic multi-grid solver. In: Proceedings of the 26th ACM International Conference on Supercomputing, ICS 2012, pp. 91–100. ACM (2012)Google Scholar
  9. 9.
    Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1:1–1:25 (2011)MathSciNetzbMATHGoogle Scholar
  10. 10.
    Huang, K.H., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 33(6), 518–528 (1984)CrossRefzbMATHGoogle Scholar
  11. 11.
    Mishra, A., Banerjee, P.: An algorithm-based error detection scheme for the multigrid method. IEEE Trans. Comput. 52(9), 1089–1099 (2003)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. SIAM, Philadelphia (2003)CrossRefzbMATHGoogle Scholar
  13. 13.
    Sao, P., Vuduc, R.: Self-stabilizing iterative solvers. In: Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2013, pp. 4:1–4:8. ACM (2013)Google Scholar
  14. 14.
    Snir, M., Wisniewski, R.W., Abraham, J.A., Adve, S.V., Bagchi, S., Balaji, P., Belak, J., Bose, P., Cappello, F., Carlson, B., Chien, A.A., Coteus, P., DeBardeleben, N.A., Diniz, P.C., Engelmann, C., Erez, M., Fazzari, S., Geist, A., Gupta, R., Johnson, F., Krishnamoorthy, S., Leyffer, S., Liberty, D., Mitra, S., Munson, T., Schreiber, R., Stearley, J., Hensbergen, E.V.: Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl. 28(2), 129–173 (2014)CrossRefGoogle Scholar
  15. 15.
    Trottenberg, U., Oosterlee, C.W., Schüller, A.: Multigrid. Academic Press, Cambridge (2001)zbMATHGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.University of Vienna, Faculty of Computer ScienceViennaAustria

Personalised recommendations