Abstract
Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates. We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-stream snapshots, multi-version data structures with portable and rich error checking/recovery. Experimental results validate correct execution with low performance overhead under varied error conditions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Borkar, S., Chien, A.A.: The future of microprocessors. Commun. ACM 54(5), 67–77 (2011)
Bridges, P.G., Ferreira, K. B., Heroux, M. A., Hoemmen, M.: Fault-tolerant linear solvers via selective reliability. ArXiv e-prints, June 2012. Provided by the SAO/NASA Astrophysics Data System
Bronevetsky, G., de Supinski, B.: Soft error vulnerability of iterative linear algebra methods. In: Proceedings of ICS (2008)
Cappello, F., Geist, A., Gropp, W., Kale, L., Kramer, W., Snir, M.: Towards exascale resilience. Int. J. High Perform. Comput. Appl. 23(4), 374–388 (2009)
Chen, J., McInnes, L.C., Zhang, H.: Analysis and practical use of flexible BiCGStab. Technical report ANL/MCS-P3039-0912, Argonne National Laboratory (2012)
Chen, Z.: Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. In: Proceedings of PPoPP (2013)
Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1–25 (2011)
Du, P., Luszczek, P., Dongarra, J.: High performance dense linear system solver with resilience to multiple soft errors. In: Proceedings of ICCS (2012)
Elliott, J., Hoemmen, M., Mueller, F.: Evaluating the impact of SDC on the GMRES iterative solver. In: Proceedings of IPDPS (2014)
Chien, A., et al.: Global View Resilience Project (GVR). http://gvr.cs.uchicago.edu
Elnozahy, M., et al.: System resilience at extreme scale (2009). White Paper written for the Defense Advanced Research Project Agency (DARPA), with Ricardo Bianchini et al.
Heroux, M., et al.: An overview of the trilinos project. ACM Trans. Math. Softw. 31(3), 397–423 (2005)
Kogge, P., et al.: Exascale computing study: Technology challenges in achieving exascale systems. Technical report TR-2008-13, University of Notre Dame CSE Department (2008)
Huang, K., Abraham, J.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. C–33(6), 518–528 (1984)
Lidman, J., Quinlan, D. J., Liao, C., McKee, S.A.: ROSEFTTransform - a source-to-source translation framework for exascale fault-tolerance research. In: DSN-W (2012)
Moody, A., Bronevetsky, G., Mohror, K., Supinski, B.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of Supercomputing (2010)
Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. SIAM, Philadelphia (2003)
Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of ICS (2012)
Acknowledgments
We thank Mark Hoemmen from Sandia National Laboratories for his advice. This work supported by the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Award DE-SC0008603 and Contract DE-AC02-06CH11357. Also under the DOE National Nuclear Security Administration (NNSA) Advanced Simulation and Computing (ASC) program. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy National Nuclear Security Administration under contract DE-AC04-94AL85000.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Zheng, Z., Chien, A.A., Teranishi, K. (2015). Fault Tolerance in an Inner-Outer Solver: A GVR-Enabled Case Study. In: Daydé, M., Marques, O., Nakajima, K. (eds) High Performance Computing for Computational Science -- VECPAR 2014. VECPAR 2014. Lecture Notes in Computer Science(), vol 8969. Springer, Cham. https://doi.org/10.1007/978-3-319-17353-5_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-17353-5_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-17352-8
Online ISBN: 978-3-319-17353-5
eBook Packages: Computer ScienceComputer Science (R0)