Skip to main content

Fault Tolerance in an Inner-Outer Solver: A GVR-Enabled Case Study

  • Conference paper
  • First Online:
High Performance Computing for Computational Science -- VECPAR 2014 (VECPAR 2014)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8969))

Abstract

Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates. We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-stream snapshots, multi-version data structures with portable and rich error checking/recovery. Experimental results validate correct execution with low performance overhead under varied error conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Borkar, S., Chien, A.A.: The future of microprocessors. Commun. ACM 54(5), 67–77 (2011)

    Article  Google Scholar 

  2. Bridges, P.G., Ferreira, K. B., Heroux, M. A., Hoemmen, M.: Fault-tolerant linear solvers via selective reliability. ArXiv e-prints, June 2012. Provided by the SAO/NASA Astrophysics Data System

    Google Scholar 

  3. Bronevetsky, G., de Supinski, B.: Soft error vulnerability of iterative linear algebra methods. In: Proceedings of ICS (2008)

    Google Scholar 

  4. Cappello, F., Geist, A., Gropp, W., Kale, L., Kramer, W., Snir, M.: Towards exascale resilience. Int. J. High Perform. Comput. Appl. 23(4), 374–388 (2009)

    Article  Google Scholar 

  5. Chen, J., McInnes, L.C., Zhang, H.: Analysis and practical use of flexible BiCGStab. Technical report ANL/MCS-P3039-0912, Argonne National Laboratory (2012)

    Google Scholar 

  6. Chen, Z.: Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. In: Proceedings of PPoPP (2013)

    Google Scholar 

  7. Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1–25 (2011)

    MathSciNet  Google Scholar 

  8. Du, P., Luszczek, P., Dongarra, J.: High performance dense linear system solver with resilience to multiple soft errors. In: Proceedings of ICCS (2012)

    Google Scholar 

  9. Elliott, J., Hoemmen, M., Mueller, F.: Evaluating the impact of SDC on the GMRES iterative solver. In: Proceedings of IPDPS (2014)

    Google Scholar 

  10. Chien, A., et al.: Global View Resilience Project (GVR). http://gvr.cs.uchicago.edu

  11. Elnozahy, M., et al.: System resilience at extreme scale (2009). White Paper written for the Defense Advanced Research Project Agency (DARPA), with Ricardo Bianchini et al.

    Google Scholar 

  12. Heroux, M., et al.: An overview of the trilinos project. ACM Trans. Math. Softw. 31(3), 397–423 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  13. Kogge, P., et al.: Exascale computing study: Technology challenges in achieving exascale systems. Technical report TR-2008-13, University of Notre Dame CSE Department (2008)

    Google Scholar 

  14. Huang, K., Abraham, J.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. C–33(6), 518–528 (1984)

    Article  Google Scholar 

  15. Lidman, J., Quinlan, D. J., Liao, C., McKee, S.A.: ROSEFTTransform - a source-to-source translation framework for exascale fault-tolerance research. In: DSN-W (2012)

    Google Scholar 

  16. Moody, A., Bronevetsky, G., Mohror, K., Supinski, B.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of Supercomputing (2010)

    Google Scholar 

  17. Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. SIAM, Philadelphia (2003)

    Book  MATH  Google Scholar 

  18. Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of ICS (2012)

    Google Scholar 

Download references

Acknowledgments

We thank Mark Hoemmen from Sandia National Laboratories for his advice. This work supported by the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Award DE-SC0008603 and Contract DE-AC02-06CH11357. Also under the DOE National Nuclear Security Administration (NNSA) Advanced Simulation and Computing (ASC) program. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy National Nuclear Security Administration under contract DE-AC04-94AL85000.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrew A. Chien .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Zheng, Z., Chien, A.A., Teranishi, K. (2015). Fault Tolerance in an Inner-Outer Solver: A GVR-Enabled Case Study. In: Daydé, M., Marques, O., Nakajima, K. (eds) High Performance Computing for Computational Science -- VECPAR 2014. VECPAR 2014. Lecture Notes in Computer Science(), vol 8969. Springer, Cham. https://doi.org/10.1007/978-3-319-17353-5_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-17353-5_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-17352-8

  • Online ISBN: 978-3-319-17353-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics