Scalable Fault Tolerant MPI: Extending the Recovery Algorithm

  • Graham E. Fagg
  • Thara Angskun
  • George Bosilca
  • Jelena Pjesivac-Grbovic
  • Jack J. Dongarra
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3666)


Fault Tolerant MPI (FT-MPI) [6] was designed as a solution to allow applications different methods to handle process failures beyond simple check-point restart schemes. The initial implementation of FT-MPI included a robust heavy weight system state recovery algorithm that was designed to manage the membership of MPI communicators during multiple failures. The algorithm and its implementation although robust, was very conservative and this effected its scalability on both very large clusters as well as on distributed systems. This paper details the FT-MPI recovery algorithm and our initial experiments with new recovery algorithms that are aimed at being both scalable and latency tolerant. Our conclusions shows that the use of both topology aware collective communication and distributed consensus algorithms together produce the best results.


Death Event Current Algorithm Recovery Algorithm Collective Communication Collective Operation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic mpi programs on clusters of workstations. In: 8th IEEE International Symposium on High Performance Distributed Computing (1999)Google Scholar
  2. 2.
    Batchu, R., Neelamegam, J., Cui, Z., Beddhua, M., Skjellum, A., Dandass, Y., Apte, M.: Mpi/ftTM: Architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing. In: Proceedings of the 1st IEEE International Symposium of Cluster Computing and the Grid held in Melbourne, Australia (2001)Google Scholar
  3. 3.
    Beck, M., Dongarra, J.J., Fagg, G.E., Geist, G.A., Gray, P., Kohl, J., Migliardi, M., Moore, K., Moore, T., Papadopoulous, P., Scott, S.L., Sunderam, V.: HARNESS:a next generation distributed virtual machine. Future Generation Computer Systems 15 (1999)Google Scholar
  4. 4.
    Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fédak, G., Germain, C., Hérault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Néri, V., Selikhov, A.: MPICH-v: Toward a scalable fault tolerant MPI for volatile nodes. In: SuperComputing, Baltimore USA (November 2002)Google Scholar
  5. 5.
    Burns, G., Daoud, R.: Robust MPI message delivery through guaranteed resources. In: MPI Developers Conference (June 1995)Google Scholar
  6. 6.
    Fagg, G.E., Bukovsky, A., Dongarra, J.J.: HARNESS and fault tolerant MPI. Parallel Computing 27, 1479–1496 (2001)zbMATHCrossRefGoogle Scholar
  7. 7.
    Fagg, G.E., Moore, K., Dongarra, J.J.: Scalable networked information processing environment (SNIPE). Future Generation Computing Systems 15, 571–582 (1999)CrossRefGoogle Scholar
  8. 8.
    Graham, R.L., Choi, S.-E., Daniel, D.J., Desai, N.N., Minnich, R.G., Rasmussen, C.E., Risinger, L.D., Sukalski, M.W.: A network-failure-tolerant message-passing system for terascale clusters. In: ICS, New York, USA, June 22-26 (2002)Google Scholar
  9. 9.
    Louca, S., Neophytou, N., Lachanas, A., Evripidou, P.: Mpi-ft: Portable fault tolerance scheme for MPI. In: Parallel Processing Letters, vol. 10(4), pp. 371–382. World Scientific Publishing Company, Singapore (2000)Google Scholar
  10. 10.
    Message Passing Interface Forum. MPI: A Message Passing Interface Standard (June 1995),
  11. 11.
    Message Passing Interface Forum. MPI-2: Extensions to the Message Passing Interface (July 1997),
  12. 12.
    Stellner, G.: Cocheck: Checkpointing and process migration for MPI. In: Proceedings of the 10th International Parallel Processing Symposium (IPPS 1996), Honolulu, Hawaii (1996)Google Scholar
  13. 13.
    Vadhiyar, S.S., Fagg, G.E., Dongarra, J.J.: Performance modeling for self-adapting collective communications for MPI. In: LACSI Symposium, Eldorado Hotel, Santa Fe, NM, October 15-18. Springer, Heidelberg (2001)Google Scholar
  14. 14.
    Sankaran, S., Squyres, J.M., Barrett, B., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing. In: LACSI Symposium, Santa Fe, NM (October 2003)Google Scholar
  15. 15.
    Gabriel, E., Fagg, G.E., Bosilica, G., Angskun, T., Dongarra, J.J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. In: Proceedings 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungry (2004)Google Scholar
  16. 16.
    Kielmann, T., Hofman, R.F.H., Bal, H.E., Plaat, A., Bhoedjang, R.A.F.: MagPIe: MPI’s collective communication operations for clustered wide area systems. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 1999), May 1999, vol. 34(8), pp. 131–140 (1999)Google Scholar
  17. 17.
    Tanenbaum, A.S., van Steen, M.: Distributed Systems: Principles and Paradigms. Prentice Hall, Englewood Cliffs (2002)zbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Graham E. Fagg
    • 1
  • Thara Angskun
    • 1
  • George Bosilca
    • 1
  • Jelena Pjesivac-Grbovic
    • 1
  • Jack J. Dongarra
    • 1
  1. 1.Dept. of Computer ScienceThe University of TennesseeKnoxvilleUSA

Personalised recommendations