Skip to main content

Epidemic Fault Tolerance for Extreme-Scale Parallel Computing

  • Conference paper
  • First Online:
  • 804 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9258))

Abstract

Process failure rate in the next generation of high performance computing systems is expected to be very high. MPI Forum is working on providing semantics and support for fault tolerance. Run-Through Stabilization, User-Level Failure Mitigation and Process Recovery proposals are the resulting endeavors. Run-Through Stabilization/User Level Failure Mitigation proposals require a fault tolerant failure detection and consensus algorithm to inform the application of failures so that it can employ Algorithm Based Fault Tolerance for quicker recovery and continued execution. This paper discusses the proposals in short, the failure detectors available in the literature and their unsuitability for realizing fault tolerance in MPI. It then outlines an inherently fault-tolerant and scalable Epidemic (or Gossip-based) approach for failure detection and consensus. Some simulations and an initial experimental analysis are presented, which indicate that this is a promising research direction.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Bland, W., Bosilca, G., Bouteiller, A., Herault, T., Dongarra, J.: A proposal for User-Level Failure Mitigation in the MPI-3 standard. University of Tennessee, Department of Electrical Engineering and Computer Science (2012)

    Google Scholar 

  2. Bland, W., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J.J.: Post-failure recovery of MPI communication capability: Design and rationale. Int. J. High Perform. Comput. Appl. (2013)

    Google Scholar 

  3. Blasa, F., Cafiero, S., Fortino, G., Di Fatta, G.: Symmetric push-sum protocol for decentralised aggregation (2011)

    Google Scholar 

  4. Buntinas, D.: Scalable distributed consensus to support MPI fault tolerance. In: 26th IEEE International Conference on Parallel & Distributed Processing Symposium (IPDPS), May 2012, pp. 1240–1249 (2012)

    Google Scholar 

  5. Chandra, T.D., Toueg, S.: Unreliable failure detectors for reliable distributed systems. J. ACM (JACM) 43(2), 225–267 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  6. Daly, J.T., Lead, R.: Application resilience for truculent systems. In: Workshop on Fault Tolerance for Extreme-Scale Computing, Albuquerque, NM – 19–20 March 2009, ANL/MCS-TM-312 (2009)

    Google Scholar 

  7. Daly, J., Harrod, B., Hoang, T., Nowell, L., Adolf, B., Borkar, S., Wu, J.: Inter-Agency Workshop on HPC resilience at extreme scale. In: National Security Agency Advanced Computing Systems, February 2012 (2012)

    Google Scholar 

  8. Di Fatta, G., Blasa, F., Cafiero, S., Fortino, G.: Fault tolerant decentralised K-Means clustering for asynchronous large-scale networks. J. Parallel Distrib. Comput. 73(3), 317–329 (2013)

    Article  Google Scholar 

  9. Fault Tolerance Working Group. Run-though stabilization interfaces and semantics. In: svn. mpi-forum. org/trac/mpi-forum-web/wiki/ft/run through stabilization (2012)

    Google Scholar 

  10. Gupta, I., Chandra, T.D., Goldszmidt, G.S.: On scalable and efficient distributed failure detectors. In: Proceedings of the Twentieth Annual ACM Symposium on Principles of Distributed Computing, August 2001, pp. 170–179. ACM (2001)

    Google Scholar 

  11. Huang, K.H., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 100(6), 518–528 (1984)

    Article  Google Scholar 

  12. Hursey, J., Naughton, T., Vallee, G., Graham, R.L.: A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI. In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra, J. (eds.) EuroMPI 2011. LNCS, vol. 6960, pp. 255–263. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  13. Message Passing Interface Forum: MPI: A Message Passing Interface. In: Proceedings of Supercomputing 1993, pp. 878–883. IEEE Computer Society Press (1993)

    Google Scholar 

  14. Montresor, A., Jelasity, M.: PeerSim: A scalable P2P simulator. In: IEEE Ninth International Conference on Peer-to-Peer Computing, P2P 2009, pp. 99–100. IEEE (2009)

    Google Scholar 

  15. Process Recovery Proposal. https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/process_recovery_2. Accessed: 14 May 2015

  16. Ranganathan, S., George, A.D., Todd, R.W., Chidester, M.C.: Gossip-style failure detection and distributed consensus for scalable heterogeneous clusters. Cluster Comput. 4(3), 197–209 (2001)

    Article  Google Scholar 

  17. Schroeder, B., Gibson, G.A.: Understanding failures in petascale computers. In: Journal of Physics: Conference Series, vol. 78(1), p. 012022. IOP Publishing, July 2007

    Google Scholar 

  18. Soltero, P., Bridges, P., Arnold, D., Lang, M.: A Gossip-based approach to exascale system services. In: Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers, p. 3. ACM, June 2013

    Google Scholar 

  19. Song, H., Leangsuksun, C., Nassar, R., Gottumukkala, N.R., Scott, S.: Availability modeling and analysis on high performance cluster computing systems. In: The First International Conference on Availability, Reliability and Security, ARES 2006, April 2006, p.8. IEEE (2006)

    Google Scholar 

  20. Straková, H., Niederbrucker, G., Gansterer, W.N.: Fault tolerance properties of gossip-based distributed orthogonal iteration methods. Procedia Comput. Sci. 18, 189–198 (2013)

    Article  Google Scholar 

  21. Taerat, N., Nakisinehaboon, N., Chandler, C., Elliot, J., Leangsuksun, C., Ostrouchov, G., Scott, S.L.: Using log information to perform statistical analysis on failures encountered by large-scale HPC deployments. In: Proceedings of the 2008 High Availability and Performance Computing Workshop, vol. 4, pp. 29–43 (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amogh Katti .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Katti, A., Di Fatta, G. (2015). Epidemic Fault Tolerance for Extreme-Scale Parallel Computing. In: Di Fatta, G., Fortino, G., Li, W., Pathan, M., Stahl, F., Guerrieri, A. (eds) Internet and Distributed Computing Systems. IDCS 2015. Lecture Notes in Computer Science(), vol 9258. Springer, Cham. https://doi.org/10.1007/978-3-319-23237-9_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-23237-9_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-23236-2

  • Online ISBN: 978-3-319-23237-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics