Skip to main content

Part of the book series: Computer Communications and Networks ((CCN))

Abstract

As supercomputers are entering an era of massive parallelism where the frequency of faults is increasing, the MPI standard remains distressingly vague on the consequence of failures on MPI communications. In this chapter, we present the spectrum of techniques that can be applied to enable MPI application recovery, ranging from fully automatic to completely user driven. First, we present the effective deployment of most advanced checkpoint/restart techniques within the MPI implementation, so that failed processors are automatically restarted in a consistent state with surviving processes, at a performance cost. Then, we investigate how MPI can support application-driven recovery techniques, and introduce a set of extensions to MPI that allow restoring communication capabilities, while maintaining the extreme level of performance to which MPI users have become accustomed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.top500.org/.

  2. 2.

    http://meetings.mpi-forum.org/mpi3.0_ft.php.

  3. 3.

    The interested reader may refer to Chap. 17 of the complete draft, available from http://fault-tolerance.org/ulfm/ulfm-specification.

  4. 4.

    https://asc.llnl.gov/sequoia/benchmarks/#amg.

References

  1. Ali MM, Southern J, Strazdins PE, Harding B (2014) Application level fault recovery: using fault-tolerant open MPI in a PDE solver. In: 2014 IEEE international parallel and distributed processing symposium workshops, Phoenix, AZ, USA, 19–23 May 2014, pp 1169–1178. IEEE

    Google Scholar 

  2. Alvisi L, Marzullo K (1995) Message logging: pessimistic, optimistic, and causal. In: Proceedings of the 15th international conference on distributed computing systems (ICDCS 1995). IEEE CS Press, pp 229–236

    Google Scholar 

  3. Alvisi L, Elnozahy E, Rao S, Husain SA, Mel AD (1999) An analysis of communication induced checkpointing. In: 29th symposium on fault-tolerant computing (FTCS’99). IEEE CS Press

    Google Scholar 

  4. Bautista-Gomez L, Tsuboi S, Komatitsch D, Cappello F, Maruyama N, Matsuoka S (2011) FTI: high performance fault tolerance interface for hybrid systems. In: International conference high performance computing, networking, storage and analysis SC’11

    Google Scholar 

  5. Bland W, Bouteiller A, Herault T, Hursey J, Bosilca G, Dongarra JJ (2013) An evaluation of user-level failure mitigation support in MPI. Computing 95(12):1171–1184

    Article  Google Scholar 

  6. Bosilca G, Delmas R, Dongarra J, Langou J (2009) Algorithm-based fault tolerance applied to high performance computing. J Parallel Distrib Comput 69(4):410–416

    Article  Google Scholar 

  7. Bosilca G, Bouteiller A, Herault T, Lemarinier P, Dongarra JJ (2010) Dodging the cost of unavoidable memory copies in message logging protocols. In: Keller R, Gabriel E, Resch MM, Dongarra J (eds) EuroMPI, Lecture Notes in Computer Science, Vol 6305. pp 189–197

    Google Scholar 

  8. Bouteiller A, Collin B, Herault T, Lemarinier P, Cappello F (2005) Impact of event logger on causal message logging protocols for fault tolerant MPI. In: IPDPS’05: proceedings of the 19th IEEE international parallel and distributed processing symposium (IPDPS’05)—papers. IEEE Computer Society, Washington, DC, USA, p 97

    Google Scholar 

  9. Bouteiller A, Herault T, Krawezik G, Lemarinier P, Cappello F (2006) MPICH-V project: a multiprotocol automatic fault tolerant MPI. Int J High Perform Comput Appl 20:319–333

    Article  Google Scholar 

  10. Bouteiller A, Ropars T, Bosilca G, Morin C, Dongarra J (2009) Reasons to be pessimist or optimist for failure recovery in high performance clusters. In: IEEE, editor, proceedings of the 2009 IEEE cluster conference

    Google Scholar 

  11. Bouteiller A, Bosilca G, Dongarra J (2010) Redesigning the message logging model for high performance. Concurr Comput: Pract Exp 22(16):2196–2211

    Article  Google Scholar 

  12. Bouteiller A, Herault T, Bosilca G, Dongarra JJ (2013) Correlated set coordination in fault tolerant message logging protocols for many-core clusters. Concurr Comput: Pract Exp 25(4):572–585

    Article  Google Scholar 

  13. Bouteiller A, Herault T, Bosilca G, Du P, Dongarra J (2015) Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy. ACM Trans Parallel Comput 1(2):10:1–10:28

    Article  Google Scholar 

  14. Bronevetsky G (2007) Portable checkpointing for parallel applications. Ph.D. thesis, Cornell University, Department of Computer Science

    Google Scholar 

  15. Buntinas D, Coti C, Herault T, Lemarinier P, Pilard L, Rezmerita A, Rodriguez E, Cappello F (2008) Blocking versus non-blocking coordinated checkpointing for large-scale fault tolerant MPI protocols. Future Gener Comput Syst 24(1):73–84

    Article  Google Scholar 

  16. Cappello F, Geist A, Gropp B, Kalé LV, Kramer B, Snir M (2009) Toward exascale resilience. Int J High Perform Comput Appl 23(4):374–388

    Article  Google Scholar 

  17. Cappello F, Casanova H, Robert Y (2011) Preventive migration versus preventive checkpointing for extreme scale supercomputers. Parallel Process Lett, pp 111–132

    Google Scholar 

  18. Chakravorty S, Mendes CL, Kalé LV (2006) Proactive fault tolerance in MPI applications via task migration. In: HiPC 2006, the IEEE high performance computing conference. IEEE Computer Society Press, pp 485–496

    Google Scholar 

  19. Chandra TD, Toueg S (1996) Unreliable failure detectors for reliable distributed systems. J ACM (JACM) 43(2):225–267

    Article  MathSciNet  Google Scholar 

  20. Chandy KM, Lamport L (1985) Distributed snapshots: determining global states of distributed systems. Trans Comput Syst 3(1):63–75. ACM

    Google Scholar 

  21. Chen Z, Fagg GE, Gabriel E, Langou J, Angskun T, Bosilca G, Dongarra J (2005) Fault tolerant high performance computing by a coding approach. In: Proceedings of the tenth ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP’05. ACM, New York, USA, pp 213–223

    Google Scholar 

  22. Daly JT (2006) A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener Comput Syst 22:303–312

    Article  Google Scholar 

  23. Damani OP, Wang Y-M, Garg VK (2003) Distributed recovery with K-optimistic logging. J Parallel Distrib Comput 63:1193–1218

    Article  Google Scholar 

  24. Davies T, Karlsson C, Liu H, Ding C, Chen Z (2011) High performance Linpack benchmark: a fault tolerant implementation without checkpointing. In: Proceedings of the 25th ACM international conference on supercomputing (ICS 2011). ACM

    Google Scholar 

  25. Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on operating systems design and implementation—vol 6, OSDI’04. USENIX Association, Berkeley, CA, USA, pp 10–10

    Google Scholar 

  26. Dongarra J, Blackford L, Choi J et al (1997) ScaLAPACK user’s guide. Society for Industrial and Applied Mathematics, Philadelphia

    Google Scholar 

  27. Dongarra J, Beckman P, Moore T, Aerts P, Aloisio G, Andre J-C, Barkai D, Berthou J-Y, Boku T, Braunschweig B, Cappello F, Chapman B, Chi X, Choudhary A, Dosanjh S, Dunning T, Fiore S, Geist A, Gropp B, Harrison R, Hereld M, Heroux M, Hoisie A, Hotta K, Jin Z, Ishikawa Y, Johnson F, Kale S, Kenway R, Keyes D, Kramer B, Labarta J, Lichnewsky A, Lippert T, Lucas B, Maccabe B, Matsuoka S, Messina P, Michielse P, Mohr B, Mueller MS, Nagel WE, Nakashima H, Papka ME, Reed D, Sato M, Seidel E, Shalf J, Skinner D, Snir M, Sterling T, Stevens R, Streitz F, Sugar B, Sumimoto S, Tang W, Taylor J, Thakur R, Trefethen A, Valero M, Van Der Steen A, Vetter J, Williams P, Wisniewski R, Yelick K (2011) The international exascale software project roadmap. Int J High Perform Comput Appl 25(1):3–60

    Article  Google Scholar 

  28. Duell J (2002) The design and implementation of Berkeley lab’s Linux checkpoint/restart. Technical report LBNL-54941, Livermore-Berkeley National Laboratory

    Google Scholar 

  29. Elnozahy E, Zwaenepoel W (1992) Manetho: transparent rollback-recovery with low overhead, limited rollback and fast output. IEEE Trans Comput 41(5):526–531

    Article  Google Scholar 

  30. Elnozahy M, Alvisi L, Wang YM, Johnson DB (2002) A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv (CSUR) 34(3):375–408

    Article  Google Scholar 

  31. Fagg G, Dongarra J (2000) FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: 7th Euro PVM/MPI user’s group meeting 2000, vol 1908/2000, Balatonfüred, “ Hungary”. Springer, Heidelberg

    Google Scholar 

  32. Fagg GE, Bukovsky A, Dongarra JJ (2001) HARNESS and fault tolerant MPI. Parallel Comput 27(11):1479–1495

    Article  Google Scholar 

  33. Ferreira K, Stearley J, Laros J, Oldfield R, Pedretti K, Brightwell R, Riesen R, Bridges P, Arnold D (2011) Evaluating the viability of process replication reliability for exascale systems. In: International conference for high performance computing, networking, storage and analysis (SC), 2011. ACM Request Permissions, pp 1–12

    Google Scholar 

  34. Fischer M, Lynch N, Paterson M (1985) Impossibility of distributed consensus with one faulty process. J ACM 32:374–382

    Article  MathSciNet  Google Scholar 

  35. Gabriel E, Fagg GE, Bosilca G, Angskun T, Dongarra JJ, Squyres JM, Sahay V, Kambadur P, Barrett B, Lumsdaine A, Castain RH, Daniel DJ, Graham RL, Woodall TS (2004) Open MPI: goals, concept, and design of a next generation MPI implementation. In: Proceedings of the 11th European PVM/MPI users’ group meeting, Budapest, Hungary, pp 97–104

    Google Scholar 

  36. Gamell M, Katz DS, Kolla H, Chen J, Klasky S, Parashar M (2014) Exploring automatic, online failure recovery for scientific applications at extreme scales. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, SC’14. IEEE Press, Piscataway, NJ, pp 895–906

    Google Scholar 

  37. Gao Q, Huang W, Koop MJ, Panda DK (2007) Group-based coordinated checkpointing for MPI: a case study on infiniband. In: International conference on parallel processing, 2007. ICPP 2007

    Google Scholar 

  38. Geist GA, Kohl JA, Papadopoulos PM (1996) PVM and MPI: a comparison of features. Calc Paralleles 8:137–150

    Google Scholar 

  39. Gelenbe E (1979) On the optimum checkpoint interval. J ACM 26:259–270

    Article  MathSciNet  Google Scholar 

  40. Geoffray P (2002) OPIOM: off-processor i/o with myrinet. Future Gener Comput Syst 18(4):491–499

    Article  Google Scholar 

  41. Goglin B (2008) Improving message passing over ethernet with i/oat copy offload in open-mx. In: Proceedings of the 2008 IEEE international conference on cluster computing. IEEE, pp 223–231

    Google Scholar 

  42. Gropp W, Lusk E (1997) Why are PVM and MPI so different? In: Bubak M, Dongarra J, Waśniewski J (eds) Recent advances in parallel virtual machine and message passing interface, Lecture Notes in Computer Science, vol 1332. Springer, Berlin, pp 1–10

    Google Scholar 

  43. Gropp W, Lusk E (2004) Fault tolerance in message passing interface programs. Int J High Perform Comput Appl 18:363–372

    Article  Google Scholar 

  44. Guermouche A, Ropars T, Snir M, Cappello F (2012) HydEE: failure containment without event logging for large scale send-deterministic MPI applications. In: IEEE 26th international parallel distributed processing symposium (IPDPS), 2012, pp 1216–1227

    Google Scholar 

  45. Hadzilacos V, Toueg S (1993) Fault-tolerant broadcasts and related problems. In: Mullender S (ed) Distributed systems, 2nd edn. ACM/Addison-Wesley, Boston, pp 97–145 (chapter 5)

    Google Scholar 

  46. Hassani A, Skjellum A, Brightwell R (2014) Design and evaluation of FA-MPI, a transactional resilience scheme for non-blocking MPI. In: International conference on dependable systems and networks (DSN), 2014 44th annual IEEE/IFIP, pp 750–755

    Google Scholar 

  47. Ho JCY, Wang C-L, Lau FCM (2008) Scalable group-based checkpoint/restart for large-scale message-passing systems. In: Proceedings of the 22nd IEEE international symposium on parallel and distributed processing (IPDPS). IEEE, pp 1–12

    Google Scholar 

  48. Huang K, Abraham J (1984) Algorithm-based fault tolerance for matrix operations. IEEE Trans Comput 100(6):518–528

    Article  Google Scholar 

  49. Hursey J, Squyres J, Mattox T, Lumsdaine A (2007) The design and implementation of checkpoint/restart process fault tolerance for open MPI. In: IEEE international parallel and distributed processing symposium, 2007. IPDPS 2007, pp 1–8

    Google Scholar 

  50. Hursey J, Naughton T, Vallee G, Graham RL (2011) A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI. In: EuroMPI 2011: proceedings of the 18th EuroMPI conference, Santorini, Greece

    Google Scholar 

  51. Hursey J, Graham RL, Bronevetsky G, Buntinas D, Pritchard H, Solt DG (2011) Run-through stabilization: an MPI proposal for process fault tolerance. In: EuroMPI 2011: proceedings of the 18th EuroMPI conference, Santorini, Greece

    Google Scholar 

  52. Hélary J-M, Mostefaoui A, Raynal M (1999) Communication-induced determination of consistent snapshots. IEEE Trans Parallel Distrib Syst 10(9):865–877

    Article  Google Scholar 

  53. Lamport L (1978) Time, clocks, and the ordering of events in a distributed system. Commun ACM 21(7):558–565

    Article  Google Scholar 

  54. Lemarinier P, Bouteiller A, Herault T, Krawezik G, Cappello F (2004) Improved message logging versus improved coordinated checkpointing for fault tolerant MPI. In: IEEE international conference on cluster computing. IEEE CS Press

    Google Scholar 

  55. Liu X, Xu X, Ren X, Tang Y, Dai Z (2013) A message logging protocol based on user level failure mitigation. In: Kołodziej J, Di Martino B, Talia D, Xiong K (eds) Algorithms and architectures for parallel processing, Lecture Notes in Computer Science, vol 8285. Springer International Publishing, Switzerland, pp 312–323

    Google Scholar 

  56. Lu C-D, Reed DA (2005) Scalable diskless checkpointing for large parallel systems. Ph.D. thesis, University of Illinois at Urbana-Champain

    Google Scholar 

  57. Luk F, Park H (1988) An analysis of algorithm-based fault tolerance techniques. J Parallel Distrib Comput 5(2):172–184

    Article  Google Scholar 

  58. Meneses E, Mendes CL, Kalé LV (2010) Team-based message logging: preliminary results. In: Proceedings of the 2010 10th IEEE/ACM international conference on cluster, cloud and grid computing, CCGRID’10. IEEE Computer Society, Washington, pp 697–702

    Google Scholar 

  59. Mohan C, Lindsay B (1985) Efficient commit protocols for the tree of processes model of distributed transactions. SIGOPS OSR, vol 19. ACM, New York, pp 40–52

    Google Scholar 

  60. Moody A, Bronevetsky G, Mohror K, de Supinski BR (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE international conference for high performance computing. Networking, storage and analysis, pp 1–11

    Google Scholar 

  61. Negara S, Zheng G, Pan K-C, Negara N, Johnson RE, Kalé LV, Ricker PM (2011) Automatic MPI to AMPI program transformation using photran. In: Proceedings of the 2010 conference on parallel processing, Euro-Par 2010. Springer, Berlin, pp 531–539

    Google Scholar 

  62. Pauli S, Kohler M, Arbenz P (2013) A fault tolerant implementation of multi-level monte carlo methods. In: Bader M, Bode A, Bungartz H, Gerndt M, Joubert GR, Peters FJ (eds) Parallel computing: accelerating computational science and engineering (CSE), proceedings of the international conference on parallel computing, ParCo 2013, 10–13 September 2013, Garching (near Munich), Germany, advances in parallel computing, vol 25. IOS Press, pp 471–480

    Google Scholar 

  63. Petrini F, Frachtenberg E, Hoisie A, Coll S (2003) Performance evaluation of the quadrics interconnection network. Clust Comput 6(2):125–142

    Article  Google Scholar 

  64. Plank JS (1993) Efficient checkpointing on MIMD architectures. Ph.D. thesis, Princeton University

    Google Scholar 

  65. Plank JS, Thomason MG (2001) Processor allocation and checkpoint interval selection in cluster computing systems. J Parallel Distrib Comput 61:1590

    Google Scholar 

  66. Rao S, Alvisi L, Vin HM (1998) The cost of recovery in message logging protocols. In: 17th symposium on reliable distributed systems (SRDS). IEEE CS Press, pp 10–18

    Google Scholar 

  67. Rao S, Alvisi L, Vin HM (1999) Egida: an extensible toolkit for low-overhead fault-tolerance. In : 29th symposium on fault-tolerant computing (FTCS’99). IEEE CS Press, pp 48–55

    Google Scholar 

  68. Rieker M, Ansel J, Cooperman G (2006) Transparent user-level checkpointing for the native posix thread library for linux. In: The 2006 international conference on parallel and distributed processing techniques and applications, Las Vegas, NV

    Google Scholar 

  69. Ropars T, Morin C (2011) Active optimistic and distributed message logging for message-passing applications. Concurr. Comput. : Pract. Exper. 23(17):2167–2178

    Article  Google Scholar 

  70. Ropars T, Guermouche A, Uçar B, Meneses E, Kalé LV, Cappello F (2011) On the use of cluster-based partial message logging to improve fault tolerance for MPI HPC applications. In: Proceedings of the 17th international conference on parallel processing—volume Part I, Euro-Par’11. Springer, Berlin, pp 567–578

    Google Scholar 

  71. Roy-Chowdhury A, Banerjee P (1996) Algorithm-based fault location and recovery for matrix computations on multiprocessor systems. IEEE Trans Comput 45(11):1239–1247

    Article  Google Scholar 

  72. Ruscio JF, Heffner MA, Varadarajan S (2006) DejaVu: transparent user-level checkpointing, migration and recovery for distributed systems. In: SC’06: proceedings of the 2006 ACM/IEEE conference on supercomputing. ACM Press, New York, USA, pp 158

    Google Scholar 

  73. Sankaran S, Squyres JM, Barrett B, Lumsdaine A, Duell J, Hargrove P, Roman E (2003) The LAM/MPI checkpoint/restart framework: system-initiated checkpointing. In: Proceedings, LACSI symposium, Sante Fe, New Mexico, USA

    Google Scholar 

  74. Schroeder B, Gibson G (2007) Understanding failures in petascale computers. In: Journal of physics: conference series, vol 78. IOP Publishing, pp 12–22

    Google Scholar 

  75. Schulz M, de Supinski B (2006) A flexible and dynamic infrastructure for MPI tool interoperability. In: International conference on parallel processing, 2006. ICPP 2006, pp 193–202

    Google Scholar 

  76. Schulz M, Bronevetsky G, Supinski BR (2008) On the performance of transparent MPI piggyback messages. In: Proceedings of the 15th European PVM/MPI users’ group meeting on recent advances in parallel virtual machine and message passing interface. Springer, Berlin, pp 194–201

    Google Scholar 

  77. Shantharam M, Srinivasmurthy S, Raghavan P (2012) Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of the 26th ACM international conference on supercomputing, ICS’12. ACM, New York, pp 69–78

    Google Scholar 

  78. Silva LM, Silva JG (1998) System-level versus user-defined checkpointing. In: Proceedings of the the 17th IEEE symposium on reliable distributed systems, SRDS’98. IEEE Computer Society, Washington, DC, p 68

    Google Scholar 

  79. Singhal M, Kshemkalyani A (1992) An efficient implementation of vector clocks. Inf Process Lett 43(1):47–52

    Article  Google Scholar 

  80. Sistla AP, Welch JL (1989) Efficient distributed recovery using message logging. In: PODC ’89: proceedings of the eighth annual ACM symposium on principles of distributed computing. ACM Press, New York, pp 223–238

    Google Scholar 

  81. Smith SW, Johnson DB, Tygar JD (1995) Completely asynchronous optimistic recovery with minimal rollbacks. In: FTCS-25: 25th international symposium on fault tolerant computing digest of papers. Pasadena, California, pp 361–371

    Google Scholar 

  82. Snell QO, Mikler AR, Gustafson JL (1996) NetPIPE: a network protocol independent performance evaluator. In: IASTED international conference on intelligent information management and systems

    Google Scholar 

  83. Stellner G (1996) CoCheck: checkpointing and process migration for MPI. In: Proceedings of the 10th international parallel processing symposium (IPPS’96), Honolulu, Hawaii. IEEE CS Press

    Google Scholar 

  84. Stricker T, Gross T (1995) Optimizing memory system performance for communication in parallel computers. In: ISCA’95: proceedings of the 22nd annual international symposium on computer architecture. ACM, New York, pp 308–319

    Google Scholar 

  85. Strom R, Yemini S (1985) Optimistic recovery in distributed systems. ACM Trans Comput Syst 3(3):204–226

    Article  Google Scholar 

  86. Teranishi K, Heroux MA (2014) Toward local failure local recovery resilience model using MPI-ULFM. In: Proceedings of the 21st European MPI users’ group meeting, EuroMPI/ASIA’14. ACM, New York, pp 51:51–51:56

    Google Scholar 

  87. The MPI Forum (1993) MPI: a message passing interface. In: Supercomputing’93: proceedings of the 1993 ACM/IEEE conference on supercomputing. ACM Press, New York, pp 878–883

    Google Scholar 

  88. The MPI Forum (2012) MPI: a message-passing interface standard, version 3.0. The Universtity of Tennessee, Knoxville

    Google Scholar 

  89. The SciDB Development Team (2010) Overview of SciDB: large scale array storage, processing and analysis. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, SIGMOD’10. ACM, New York, pp 963–968

    Google Scholar 

  90. Vaidyanathan K, Chai L, Huang W, Panda DK (2007) Efficient asynchronous memory copy operations on multi-core systems and I/OAT. In: CLUSTER’07: proceedings of the 2007 IEEE international conference on cluster computing. IEEE Computer Society, Washington, DC, pp 159–168

    Google Scholar 

  91. Wang C, Mueller F, Engelmann C, Scott SL (2008) Proactive process-level live migration in hpc environments. In: SC’08: proceedings of the 2008 ACM/IEEE conference on supercomputing. IEEE Press, Piscataway, NJ, pp 1–12

    Google Scholar 

  92. Young JW (1974) A first order approximation to the optimum checkpoint interval. Commun ACM 17:530–531

    Article  Google Scholar 

Download references

Acknowledgments

The author would like to thank the coworkers who participated in the creation of the software or participated in the elaboration of some source material in this chapter, in particular, Christine Morin and Thomas Ropars, for leading the effort with the optimistic message logging protocol; Wesley Bland and Joshua Hursey which helped with the implementation of ulfm; and George Bosilca and Thomas Herault. This work has received support from the NSF under award Adapt #1339763, and from the CREST project of the Japan Science and Technology Agency (JST).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aurélien Bouteiller .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Bouteiller, A. (2015). Fault-Tolerant MPI. In: Herault, T., Robert, Y. (eds) Fault-Tolerance Techniques for High-Performance Computing. Computer Communications and Networks. Springer, Cham. https://doi.org/10.1007/978-3-319-20943-2_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-20943-2_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-20942-5

  • Online ISBN: 978-3-319-20943-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics