Epidemic Fault Tolerance for Extreme-Scale Parallel Computing

Katti, Amogh; Di Fatta, Giuseppe

doi:10.1007/978-3-319-23237-9_18

Epidemic Fault Tolerance for Extreme-Scale Parallel Computing

Amogh Katti¹⁹ &
Giuseppe Di Fatta¹⁹

Conference paper
First Online: 01 January 2015

804 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9258))

Abstract

Process failure rate in the next generation of high performance computing systems is expected to be very high. MPI Forum is working on providing semantics and support for fault tolerance. Run-Through Stabilization, User-Level Failure Mitigation and Process Recovery proposals are the resulting endeavors. Run-Through Stabilization/User Level Failure Mitigation proposals require a fault tolerant failure detection and consensus algorithm to inform the application of failures so that it can employ Algorithm Based Fault Tolerance for quicker recovery and continued execution. This paper discusses the proposals in short, the failure detectors available in the literature and their unsuitability for realizing fault tolerance in MPI. It then outlines an inherently fault-tolerant and scalable Epidemic (or Gossip-based) approach for failure detection and consensus. Some simulations and an initial experimental analysis are presented, which indicate that this is a promising research direction.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Bland, W., Bosilca, G., Bouteiller, A., Herault, T., Dongarra, J.: A proposal for User-Level Failure Mitigation in the MPI-3 standard. University of Tennessee, Department of Electrical Engineering and Computer Science (2012)
Google Scholar
Bland, W., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J.J.: Post-failure recovery of MPI communication capability: Design and rationale. Int. J. High Perform. Comput. Appl. (2013)
Google Scholar
Blasa, F., Cafiero, S., Fortino, G., Di Fatta, G.: Symmetric push-sum protocol for decentralised aggregation (2011)
Google Scholar
Buntinas, D.: Scalable distributed consensus to support MPI fault tolerance. In: 26th IEEE International Conference on Parallel & Distributed Processing Symposium (IPDPS), May 2012, pp. 1240–1249 (2012)
Google Scholar
Chandra, T.D., Toueg, S.: Unreliable failure detectors for reliable distributed systems. J. ACM (JACM) 43(2), 225–267 (1996)
Article MathSciNet MATH Google Scholar
Daly, J.T., Lead, R.: Application resilience for truculent systems. In: Workshop on Fault Tolerance for Extreme-Scale Computing, Albuquerque, NM – 19–20 March 2009, ANL/MCS-TM-312 (2009)
Google Scholar
Daly, J., Harrod, B., Hoang, T., Nowell, L., Adolf, B., Borkar, S., Wu, J.: Inter-Agency Workshop on HPC resilience at extreme scale. In: National Security Agency Advanced Computing Systems, February 2012 (2012)
Google Scholar
Di Fatta, G., Blasa, F., Cafiero, S., Fortino, G.: Fault tolerant decentralised K-Means clustering for asynchronous large-scale networks. J. Parallel Distrib. Comput. 73(3), 317–329 (2013)
Article Google Scholar
Fault Tolerance Working Group. Run-though stabilization interfaces and semantics. In: svn. mpi-forum. org/trac/mpi-forum-web/wiki/ft/run through stabilization (2012)
Google Scholar
Gupta, I., Chandra, T.D., Goldszmidt, G.S.: On scalable and efficient distributed failure detectors. In: Proceedings of the Twentieth Annual ACM Symposium on Principles of Distributed Computing, August 2001, pp. 170–179. ACM (2001)
Google Scholar
Huang, K.H., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 100(6), 518–528 (1984)
Article Google Scholar
Hursey, J., Naughton, T., Vallee, G., Graham, R.L.: A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI. In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra, J. (eds.) EuroMPI 2011. LNCS, vol. 6960, pp. 255–263. Springer, Heidelberg (2011)
Chapter Google Scholar
Message Passing Interface Forum: MPI: A Message Passing Interface. In: Proceedings of Supercomputing 1993, pp. 878–883. IEEE Computer Society Press (1993)
Google Scholar
Montresor, A., Jelasity, M.: PeerSim: A scalable P2P simulator. In: IEEE Ninth International Conference on Peer-to-Peer Computing, P2P 2009, pp. 99–100. IEEE (2009)
Google Scholar
Process Recovery Proposal. https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/process_recovery_2. Accessed: 14 May 2015
Ranganathan, S., George, A.D., Todd, R.W., Chidester, M.C.: Gossip-style failure detection and distributed consensus for scalable heterogeneous clusters. Cluster Comput. 4(3), 197–209 (2001)
Article Google Scholar
Schroeder, B., Gibson, G.A.: Understanding failures in petascale computers. In: Journal of Physics: Conference Series, vol. 78(1), p. 012022. IOP Publishing, July 2007
Google Scholar
Soltero, P., Bridges, P., Arnold, D., Lang, M.: A Gossip-based approach to exascale system services. In: Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers, p. 3. ACM, June 2013
Google Scholar
Song, H., Leangsuksun, C., Nassar, R., Gottumukkala, N.R., Scott, S.: Availability modeling and analysis on high performance cluster computing systems. In: The First International Conference on Availability, Reliability and Security, ARES 2006, April 2006, p.8. IEEE (2006)
Google Scholar
Straková, H., Niederbrucker, G., Gansterer, W.N.: Fault tolerance properties of gossip-based distributed orthogonal iteration methods. Procedia Comput. Sci. 18, 189–198 (2013)
Article Google Scholar
Taerat, N., Nakisinehaboon, N., Chandler, C., Elliot, J., Leangsuksun, C., Ostrouchov, G., Scott, S.L.: Using log information to perform statistical analysis on failures encountered by large-scale HPC deployments. In: Proceedings of the 2008 High Availability and Performance Computing Workshop, vol. 4, pp. 29–43 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Systems Engineering, University of Reading, Whiteknights, Reading, Berkshire, RG6 6AY, UK
Amogh Katti & Giuseppe Di Fatta

Authors

Amogh Katti
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe Di Fatta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amogh Katti .

Editor information

Editors and Affiliations

School of Systems Engineering, University of Reading, Reading, Berkshire, United Kingdom
Giuseppe Di Fatta
Dipartimento di Ingegneria Informatica, Modellistica, Elettronica e Sistemistica, University of Calabria Dipartimento di Ingegneria Informat, Rende, Italy
Giancarlo Fortino
School of Logistics and Engineer, University of Technology Wuhan, Wuhan, China
Wenfeng Li
CSIRO ICT, Acton, Australia
Mukaddim Pathan
School of Systems Engineering, University of Reading, Whiteknights, Reading, United Kingdom
Frederic Stahl
Dipartimento di Ingegneria Informatica, Modellistica, Elettronica e Sistemistica, University of Calabria, Rende, Italy
Antonio Guerrieri

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Katti, A., Di Fatta, G. (2015). Epidemic Fault Tolerance for Extreme-Scale Parallel Computing. In: Di Fatta, G., Fortino, G., Li, W., Pathan, M., Stahl, F., Guerrieri, A. (eds) Internet and Distributed Computing Systems. IDCS 2015. Lecture Notes in Computer Science(), vol 9258. Springer, Cham. https://doi.org/10.1007/978-3-319-23237-9_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-23237-9_18
Published: 25 August 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23236-2
Online ISBN: 978-3-319-23237-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics