Skip to main content

Scalability of Partial Differential Equations Preconditioner Resilient to Soft and Hard Faults

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2016)

Abstract

We present a resilient domain-decomposition preconditioner for partial differential equations (PDEs). The algorithm reformulates the PDE as a sampling problem, followed by a solution update through data manipulation that is resilient to both soft and hard faults. We discuss an implementation based on a server-client model where all state information is held by the servers, while clients are designed solely as computational units. Servers are assumed to be “sandboxed”, while no assumption is made on the reliability of the clients. We explore the scalability of the algorithm up to \(\sim \)12k cores, build an SST/macro skeleton to extrapolate to \(\sim \)50k cores, and show the resilience under simulated hard and soft faults for a 2D linear Poisson equation.

I’m an employee of the US Government and transfer the rights to the extent transferable (Title 17 §105 U.S.C. applies)

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ang, J.A., Barrett, R.F., Benner, R.E., Burke, D., Chan, C., Cook, J., Donofrio, D., Hammond, S.D., Hemmert, K.S., Kelly, S.M., Le, H., Leung, V.J., Resnick, D.R., Rodrigues, A.F., Shalf, J., Stark, D., Unat, D., Wright, N.J.: Abstract machine models and proxy architectures for exascale computing. In: Proceedings of the 1st International Workshop on Hardware-Software Co-Design for High Performance Computing. Co-HPC 2014, pp. 25–32. IEEE Press, Piscataway, NJ, USA (2014). http://dx.doi.org/10.1109/Co-HPC.2014.4

  2. Benzi, M., Frommer, A., Nabben, R., Szyld, D.B.: Algebraic theory of multiplicative schwarz methods. Numerische Mathematik 89(4), 605–639 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  3. Bland, W., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J.: Post-failure recovery of mpi communication capability: design and rationale. Int. J. High Perform. Comput. Appl. 27(3), 244–254 (2013). http://dx.doi.org/10.1177/1094342013488238

    Article  Google Scholar 

  4. Bosilca, G., Delmas, R., Dongarra, J., Langou, J.: Algorithm-based fault tolerance applied to high performance computing. J. Parallel Distrib. Comput. 69(4), 410–416 (2009)

    Article  Google Scholar 

  5. Bridges, P.G., Ferreira, K.B., Heroux, M.A., Hoemmen, M.: Fault-tolerant linear solvers via selective reliability. ArXiv e-prints, June 2012

    Google Scholar 

  6. Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B., Snir, M.: Toward exascale resilience. Int. J. High Perform. Comput. Appl. 23(4), 374–388 (2009)

    Article  Google Scholar 

  7. Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomput. Front. Innovations 1(1) (2014). http://superfri.org/superfri/article/view/14

  8. Chen, Z.: Algorithm-based recovery for iterative methods without checkpointing. In: Proceedings of the 20th International Symposium on High Performance Distributed Computing, HPDC 2011, pp. 73–84. ACM, New York, NY, USA (2011). http://doi.acm.org/10.1145/1996130.1996142

  9. Daubechies, I., DeVore, R., Fornasier, M., Güntürk, C.S.: Iteratively reweighted least squares minimization for sparse recovery. Commun. Pure Appl. Math. 63(1), 1–38 (2010). http://dx.doi.org/10.1002/cpa.20303

    Article  MathSciNet  MATH  Google Scholar 

  10. DOE-ASCR: Exascale programming challenges. Technical report, July 2011. http://science.energy.gov/~/media/ascr/pdf/program-documents/docs/ProgrammingChallengesWorkshopReport.pdf

  11. DOE-ASCR: Top ten exascale research challenges. Technical report, February 2014

    Google Scholar 

  12. Du, P., Bouteiller, A., Bosilca, G., Herault, T., Dongarra, J.: Algorithm-based fault tolerance for dense matrix factorizations. In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2012, pp. 225–234. ACM, New York, NY, USA (2012). http://doi.acm.org/10.1145/2145816.2145845

  13. Du, P., Luszczek, P., Dongarra, J.: High performance dense linear system solver with soft error resilience. In: IEEE International Conference on Cluster Computing (CLUSTER), pp. 272–280, September 2011

    Google Scholar 

  14. Engelmann, C., Naughton, T.: Toward a performance/resilience tool for hardware/software co-design of high-performance computing systems. In: 2013 42nd International Conference on Parallel Processing (ICPP), pp. 960–969, October 2013

    Google Scholar 

  15. Griebel, M., Oswald, P.: Greedy and randomized versions of the multiplicative schwarz method. Linear Algebra Appl. 437(7), 1596–1610 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  16. Gupta, R., Iskra, K., Yoshii, K., Balaji, P., Beckman, P.: Introspective fault tolerance for exascale systems. Technical report, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439 (2012)

    Google Scholar 

  17. Heroux, M., Bartlett, R., Hoekstra, V.H.R., Hu, J., Kolda, T., Lehoucq, R., Long, K., Pawlowski, R., Phipps, E., Salinger, A., Thornquist, H., Tuminaro, R., Willenbring, J., Williams, A.: An overview of trilinos. Technical report, SAND2003-2927, Sandia National Laboratories (2003)

    Google Scholar 

  18. Holst, M.: Algebraic schwarz theory. Technical report CRPC-994-10, California Institute of Technology (1994)

    Google Scholar 

  19. Keyes, D.: How scalable is domain decomposition in practice? In: Proceedings of the 11th International Conference on Domain Decomposition Methods, pp. 286–297. Domain Decomposition Press (1999)

    Google Scholar 

  20. Larson, J.W., Hegland, M., Harding, B., Roberts, S., Stals, L., Rendell, A.P., Strazdins, P., Ali, M.M., Kowitz, C., Nobes, R., Southern, J., Wilson, N., Li, M., Oishi, Y.: Fault-tolerant grid-based solvers: combining concepts from sparse grids and mapreduce. Proc. Comput. Sci. 18, 130–139 (2013)

    Article  Google Scholar 

  21. Li, D., Vetter, J.S., Yu, W.: Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 57:1–57:11. IEEE Computer Society Press, Los Alamitos, CA, USA (2012). http://dl.acm.org/citation.cfm?id=2388996.2389074

  22. Li, M.L., Ramachandran, P., Sahoo, S.K., Adve, S.V., Adve, V.S., Zhou, Y.: Understanding the propagation of hard errors to software and implications for resilient system design. SIGOPS Oper. Syst. Rev. 42(2), 265–276 (2008). http://doi.acm.org/10.1145/1353535.1346315

    Article  Google Scholar 

  23. Malkowski, K., Raghavan, P., Kandemir, M.: Analyzing the soft error resilience of linear solvers on multicore multiprocessors. In: 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp. 1–12 (2010)

    Google Scholar 

  24. Quarteroni, A., Valli, A.: Domain Decomposition Methods for Partial Differential Equations. Numerical Mathematics and Scientific Computation. Clarendon Press, Oxford (1999)

    MATH  Google Scholar 

  25. Rizzi, F., Morris, K., Sargsyan, K., Mycek, P., Safta, C., LeMaitre, O., Knio, O., Debusschere, B.: Partial differential equations preconditioner resilient to soft and hard faults. In: IEEE International Conference on Cluster Computing (CLUSTER), pp. 552–562, September 2015

    Google Scholar 

  26. Sargsyan, K., Rizzi, F., Mycek, P., Safta, C., Morris, K., Najm, H., Maître, O.L., Knio, O., Debusschere, B.: Fault resilient domain decomposition preconditioner for PDES. SIAM J. Sci. Comput. 37(5), A2317–A2345 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  27. Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secure Comput. 7(4), 337–350 (2010)

    Article  Google Scholar 

  28. Shye, A., Moseley, T., Reddi, V., Blomstedt, J., Connors, D.: Using process-level redundancy to exploit multiple cores for transient fault tolerance. In: 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2007, DSN 2007, pp. 297–306 (2007)

    Google Scholar 

  29. Sloan, J., Kumar, R., Bronevetsky, G.: Algorithmic approaches to low overhead fault detection for sparse linear algebra. In: 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 1–12, June 2012

    Google Scholar 

  30. Snir, M., Wisniewski, R.W., Abraham, J.A., Adve, S.V., Bagchi, S., Balaji, P., Belak, J., Bose, P., Cappello, F., Carlson, B., Chien, A.A., Coteus, P., DeBardeleben, N., Diniz, P.C., Engelmann, C., Erez, M., Fazzari, S., Geist, A., Gupta, R., Johnson, F., Krishnamoorthy, S., Leyffer, S., Liberty, D., Mitra, S., Munson, T., Schreiber, R., Stearley, J., Hensbergen, E.V.: Addressing failures in exascale computing. IJHPCA, 129–173 (2014)

    Google Scholar 

  31. Toselli, A., Widlund, O.: Domain Decomposition Methods - Algorithms and Theory. Springer Series in Computational Mathematics. Springer, Heidelberg (2005). http://link.springer.com/book/10.1007/b137868

    Book  MATH  Google Scholar 

  32. Wilke, J.J., Kenny, J.P.: Using discrete event simulation for programming model exploration at extreme-scale: macroscale components for the structural simulation toolkit (SST). Technical report, Sandia technical report SAND2015-1027 (2015)

    Google Scholar 

Download references

Acknowledgments

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under Award Numbers 13-016717. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Karla Morris .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland (outside the US)

About this paper

Cite this paper

Morris, K. et al. (2016). Scalability of Partial Differential Equations Preconditioner Resilient to Soft and Hard Faults. In: Kunkel, J., Balaji, P., Dongarra, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9697. Springer, Cham. https://doi.org/10.1007/978-3-319-41321-1_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-41321-1_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-41320-4

  • Online ISBN: 978-3-319-41321-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics