Abstract
As supercomputers are entering an era of massive parallelism where the frequency of faults is increasing, the MPI standard remains distressingly vague on the consequence of failures on MPI communications. In this chapter, we present the spectrum of techniques that can be applied to enable MPI application recovery, ranging from fully automatic to completely user driven. First, we present the effective deployment of most advanced checkpoint/restart techniques within the MPI implementation, so that failed processors are automatically restarted in a consistent state with surviving processes, at a performance cost. Then, we investigate how MPI can support application-driven recovery techniques, and introduce a set of extensions to MPI that allow restoring communication capabilities, while maintaining the extreme level of performance to which MPI users have become accustomed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
The interested reader may refer to Chap. 17 of the complete draft, available from http://fault-tolerance.org/ulfm/ulfm-specification.
- 4.
References
Ali MM, Southern J, Strazdins PE, Harding B (2014) Application level fault recovery: using fault-tolerant open MPI in a PDE solver. In: 2014 IEEE international parallel and distributed processing symposium workshops, Phoenix, AZ, USA, 19–23 May 2014, pp 1169–1178. IEEE
Alvisi L, Marzullo K (1995) Message logging: pessimistic, optimistic, and causal. In: Proceedings of the 15th international conference on distributed computing systems (ICDCS 1995). IEEE CS Press, pp 229–236
Alvisi L, Elnozahy E, Rao S, Husain SA, Mel AD (1999) An analysis of communication induced checkpointing. In: 29th symposium on fault-tolerant computing (FTCS’99). IEEE CS Press
Bautista-Gomez L, Tsuboi S, Komatitsch D, Cappello F, Maruyama N, Matsuoka S (2011) FTI: high performance fault tolerance interface for hybrid systems. In: International conference high performance computing, networking, storage and analysis SC’11
Bland W, Bouteiller A, Herault T, Hursey J, Bosilca G, Dongarra JJ (2013) An evaluation of user-level failure mitigation support in MPI. Computing 95(12):1171–1184
Bosilca G, Delmas R, Dongarra J, Langou J (2009) Algorithm-based fault tolerance applied to high performance computing. J Parallel Distrib Comput 69(4):410–416
Bosilca G, Bouteiller A, Herault T, Lemarinier P, Dongarra JJ (2010) Dodging the cost of unavoidable memory copies in message logging protocols. In: Keller R, Gabriel E, Resch MM, Dongarra J (eds) EuroMPI, Lecture Notes in Computer Science, Vol 6305. pp 189–197
Bouteiller A, Collin B, Herault T, Lemarinier P, Cappello F (2005) Impact of event logger on causal message logging protocols for fault tolerant MPI. In: IPDPS’05: proceedings of the 19th IEEE international parallel and distributed processing symposium (IPDPS’05)—papers. IEEE Computer Society, Washington, DC, USA, p 97
Bouteiller A, Herault T, Krawezik G, Lemarinier P, Cappello F (2006) MPICH-V project: a multiprotocol automatic fault tolerant MPI. Int J High Perform Comput Appl 20:319–333
Bouteiller A, Ropars T, Bosilca G, Morin C, Dongarra J (2009) Reasons to be pessimist or optimist for failure recovery in high performance clusters. In: IEEE, editor, proceedings of the 2009 IEEE cluster conference
Bouteiller A, Bosilca G, Dongarra J (2010) Redesigning the message logging model for high performance. Concurr Comput: Pract Exp 22(16):2196–2211
Bouteiller A, Herault T, Bosilca G, Dongarra JJ (2013) Correlated set coordination in fault tolerant message logging protocols for many-core clusters. Concurr Comput: Pract Exp 25(4):572–585
Bouteiller A, Herault T, Bosilca G, Du P, Dongarra J (2015) Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy. ACM Trans Parallel Comput 1(2):10:1–10:28
Bronevetsky G (2007) Portable checkpointing for parallel applications. Ph.D. thesis, Cornell University, Department of Computer Science
Buntinas D, Coti C, Herault T, Lemarinier P, Pilard L, Rezmerita A, Rodriguez E, Cappello F (2008) Blocking versus non-blocking coordinated checkpointing for large-scale fault tolerant MPI protocols. Future Gener Comput Syst 24(1):73–84
Cappello F, Geist A, Gropp B, Kalé LV, Kramer B, Snir M (2009) Toward exascale resilience. Int J High Perform Comput Appl 23(4):374–388
Cappello F, Casanova H, Robert Y (2011) Preventive migration versus preventive checkpointing for extreme scale supercomputers. Parallel Process Lett, pp 111–132
Chakravorty S, Mendes CL, Kalé LV (2006) Proactive fault tolerance in MPI applications via task migration. In: HiPC 2006, the IEEE high performance computing conference. IEEE Computer Society Press, pp 485–496
Chandra TD, Toueg S (1996) Unreliable failure detectors for reliable distributed systems. J ACM (JACM) 43(2):225–267
Chandy KM, Lamport L (1985) Distributed snapshots: determining global states of distributed systems. Trans Comput Syst 3(1):63–75. ACM
Chen Z, Fagg GE, Gabriel E, Langou J, Angskun T, Bosilca G, Dongarra J (2005) Fault tolerant high performance computing by a coding approach. In: Proceedings of the tenth ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP’05. ACM, New York, USA, pp 213–223
Daly JT (2006) A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener Comput Syst 22:303–312
Damani OP, Wang Y-M, Garg VK (2003) Distributed recovery with K-optimistic logging. J Parallel Distrib Comput 63:1193–1218
Davies T, Karlsson C, Liu H, Ding C, Chen Z (2011) High performance Linpack benchmark: a fault tolerant implementation without checkpointing. In: Proceedings of the 25th ACM international conference on supercomputing (ICS 2011). ACM
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on operating systems design and implementation—vol 6, OSDI’04. USENIX Association, Berkeley, CA, USA, pp 10–10
Dongarra J, Blackford L, Choi J et al (1997) ScaLAPACK user’s guide. Society for Industrial and Applied Mathematics, Philadelphia
Dongarra J, Beckman P, Moore T, Aerts P, Aloisio G, Andre J-C, Barkai D, Berthou J-Y, Boku T, Braunschweig B, Cappello F, Chapman B, Chi X, Choudhary A, Dosanjh S, Dunning T, Fiore S, Geist A, Gropp B, Harrison R, Hereld M, Heroux M, Hoisie A, Hotta K, Jin Z, Ishikawa Y, Johnson F, Kale S, Kenway R, Keyes D, Kramer B, Labarta J, Lichnewsky A, Lippert T, Lucas B, Maccabe B, Matsuoka S, Messina P, Michielse P, Mohr B, Mueller MS, Nagel WE, Nakashima H, Papka ME, Reed D, Sato M, Seidel E, Shalf J, Skinner D, Snir M, Sterling T, Stevens R, Streitz F, Sugar B, Sumimoto S, Tang W, Taylor J, Thakur R, Trefethen A, Valero M, Van Der Steen A, Vetter J, Williams P, Wisniewski R, Yelick K (2011) The international exascale software project roadmap. Int J High Perform Comput Appl 25(1):3–60
Duell J (2002) The design and implementation of Berkeley lab’s Linux checkpoint/restart. Technical report LBNL-54941, Livermore-Berkeley National Laboratory
Elnozahy E, Zwaenepoel W (1992) Manetho: transparent rollback-recovery with low overhead, limited rollback and fast output. IEEE Trans Comput 41(5):526–531
Elnozahy M, Alvisi L, Wang YM, Johnson DB (2002) A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv (CSUR) 34(3):375–408
Fagg G, Dongarra J (2000) FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: 7th Euro PVM/MPI user’s group meeting 2000, vol 1908/2000, Balatonfüred, “ Hungary”. Springer, Heidelberg
Fagg GE, Bukovsky A, Dongarra JJ (2001) HARNESS and fault tolerant MPI. Parallel Comput 27(11):1479–1495
Ferreira K, Stearley J, Laros J, Oldfield R, Pedretti K, Brightwell R, Riesen R, Bridges P, Arnold D (2011) Evaluating the viability of process replication reliability for exascale systems. In: International conference for high performance computing, networking, storage and analysis (SC), 2011. ACM Request Permissions, pp 1–12
Fischer M, Lynch N, Paterson M (1985) Impossibility of distributed consensus with one faulty process. J ACM 32:374–382
Gabriel E, Fagg GE, Bosilca G, Angskun T, Dongarra JJ, Squyres JM, Sahay V, Kambadur P, Barrett B, Lumsdaine A, Castain RH, Daniel DJ, Graham RL, Woodall TS (2004) Open MPI: goals, concept, and design of a next generation MPI implementation. In: Proceedings of the 11th European PVM/MPI users’ group meeting, Budapest, Hungary, pp 97–104
Gamell M, Katz DS, Kolla H, Chen J, Klasky S, Parashar M (2014) Exploring automatic, online failure recovery for scientific applications at extreme scales. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, SC’14. IEEE Press, Piscataway, NJ, pp 895–906
Gao Q, Huang W, Koop MJ, Panda DK (2007) Group-based coordinated checkpointing for MPI: a case study on infiniband. In: International conference on parallel processing, 2007. ICPP 2007
Geist GA, Kohl JA, Papadopoulos PM (1996) PVM and MPI: a comparison of features. Calc Paralleles 8:137–150
Gelenbe E (1979) On the optimum checkpoint interval. J ACM 26:259–270
Geoffray P (2002) OPIOM: off-processor i/o with myrinet. Future Gener Comput Syst 18(4):491–499
Goglin B (2008) Improving message passing over ethernet with i/oat copy offload in open-mx. In: Proceedings of the 2008 IEEE international conference on cluster computing. IEEE, pp 223–231
Gropp W, Lusk E (1997) Why are PVM and MPI so different? In: Bubak M, Dongarra J, Waśniewski J (eds) Recent advances in parallel virtual machine and message passing interface, Lecture Notes in Computer Science, vol 1332. Springer, Berlin, pp 1–10
Gropp W, Lusk E (2004) Fault tolerance in message passing interface programs. Int J High Perform Comput Appl 18:363–372
Guermouche A, Ropars T, Snir M, Cappello F (2012) HydEE: failure containment without event logging for large scale send-deterministic MPI applications. In: IEEE 26th international parallel distributed processing symposium (IPDPS), 2012, pp 1216–1227
Hadzilacos V, Toueg S (1993) Fault-tolerant broadcasts and related problems. In: Mullender S (ed) Distributed systems, 2nd edn. ACM/Addison-Wesley, Boston, pp 97–145 (chapter 5)
Hassani A, Skjellum A, Brightwell R (2014) Design and evaluation of FA-MPI, a transactional resilience scheme for non-blocking MPI. In: International conference on dependable systems and networks (DSN), 2014 44th annual IEEE/IFIP, pp 750–755
Ho JCY, Wang C-L, Lau FCM (2008) Scalable group-based checkpoint/restart for large-scale message-passing systems. In: Proceedings of the 22nd IEEE international symposium on parallel and distributed processing (IPDPS). IEEE, pp 1–12
Huang K, Abraham J (1984) Algorithm-based fault tolerance for matrix operations. IEEE Trans Comput 100(6):518–528
Hursey J, Squyres J, Mattox T, Lumsdaine A (2007) The design and implementation of checkpoint/restart process fault tolerance for open MPI. In: IEEE international parallel and distributed processing symposium, 2007. IPDPS 2007, pp 1–8
Hursey J, Naughton T, Vallee G, Graham RL (2011) A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI. In: EuroMPI 2011: proceedings of the 18th EuroMPI conference, Santorini, Greece
Hursey J, Graham RL, Bronevetsky G, Buntinas D, Pritchard H, Solt DG (2011) Run-through stabilization: an MPI proposal for process fault tolerance. In: EuroMPI 2011: proceedings of the 18th EuroMPI conference, Santorini, Greece
Hélary J-M, Mostefaoui A, Raynal M (1999) Communication-induced determination of consistent snapshots. IEEE Trans Parallel Distrib Syst 10(9):865–877
Lamport L (1978) Time, clocks, and the ordering of events in a distributed system. Commun ACM 21(7):558–565
Lemarinier P, Bouteiller A, Herault T, Krawezik G, Cappello F (2004) Improved message logging versus improved coordinated checkpointing for fault tolerant MPI. In: IEEE international conference on cluster computing. IEEE CS Press
Liu X, Xu X, Ren X, Tang Y, Dai Z (2013) A message logging protocol based on user level failure mitigation. In: Kołodziej J, Di Martino B, Talia D, Xiong K (eds) Algorithms and architectures for parallel processing, Lecture Notes in Computer Science, vol 8285. Springer International Publishing, Switzerland, pp 312–323
Lu C-D, Reed DA (2005) Scalable diskless checkpointing for large parallel systems. Ph.D. thesis, University of Illinois at Urbana-Champain
Luk F, Park H (1988) An analysis of algorithm-based fault tolerance techniques. J Parallel Distrib Comput 5(2):172–184
Meneses E, Mendes CL, Kalé LV (2010) Team-based message logging: preliminary results. In: Proceedings of the 2010 10th IEEE/ACM international conference on cluster, cloud and grid computing, CCGRID’10. IEEE Computer Society, Washington, pp 697–702
Mohan C, Lindsay B (1985) Efficient commit protocols for the tree of processes model of distributed transactions. SIGOPS OSR, vol 19. ACM, New York, pp 40–52
Moody A, Bronevetsky G, Mohror K, de Supinski BR (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE international conference for high performance computing. Networking, storage and analysis, pp 1–11
Negara S, Zheng G, Pan K-C, Negara N, Johnson RE, Kalé LV, Ricker PM (2011) Automatic MPI to AMPI program transformation using photran. In: Proceedings of the 2010 conference on parallel processing, Euro-Par 2010. Springer, Berlin, pp 531–539
Pauli S, Kohler M, Arbenz P (2013) A fault tolerant implementation of multi-level monte carlo methods. In: Bader M, Bode A, Bungartz H, Gerndt M, Joubert GR, Peters FJ (eds) Parallel computing: accelerating computational science and engineering (CSE), proceedings of the international conference on parallel computing, ParCo 2013, 10–13 September 2013, Garching (near Munich), Germany, advances in parallel computing, vol 25. IOS Press, pp 471–480
Petrini F, Frachtenberg E, Hoisie A, Coll S (2003) Performance evaluation of the quadrics interconnection network. Clust Comput 6(2):125–142
Plank JS (1993) Efficient checkpointing on MIMD architectures. Ph.D. thesis, Princeton University
Plank JS, Thomason MG (2001) Processor allocation and checkpoint interval selection in cluster computing systems. J Parallel Distrib Comput 61:1590
Rao S, Alvisi L, Vin HM (1998) The cost of recovery in message logging protocols. In: 17th symposium on reliable distributed systems (SRDS). IEEE CS Press, pp 10–18
Rao S, Alvisi L, Vin HM (1999) Egida: an extensible toolkit for low-overhead fault-tolerance. In : 29th symposium on fault-tolerant computing (FTCS’99). IEEE CS Press, pp 48–55
Rieker M, Ansel J, Cooperman G (2006) Transparent user-level checkpointing for the native posix thread library for linux. In: The 2006 international conference on parallel and distributed processing techniques and applications, Las Vegas, NV
Ropars T, Morin C (2011) Active optimistic and distributed message logging for message-passing applications. Concurr. Comput. : Pract. Exper. 23(17):2167–2178
Ropars T, Guermouche A, Uçar B, Meneses E, Kalé LV, Cappello F (2011) On the use of cluster-based partial message logging to improve fault tolerance for MPI HPC applications. In: Proceedings of the 17th international conference on parallel processing—volume Part I, Euro-Par’11. Springer, Berlin, pp 567–578
Roy-Chowdhury A, Banerjee P (1996) Algorithm-based fault location and recovery for matrix computations on multiprocessor systems. IEEE Trans Comput 45(11):1239–1247
Ruscio JF, Heffner MA, Varadarajan S (2006) DejaVu: transparent user-level checkpointing, migration and recovery for distributed systems. In: SC’06: proceedings of the 2006 ACM/IEEE conference on supercomputing. ACM Press, New York, USA, pp 158
Sankaran S, Squyres JM, Barrett B, Lumsdaine A, Duell J, Hargrove P, Roman E (2003) The LAM/MPI checkpoint/restart framework: system-initiated checkpointing. In: Proceedings, LACSI symposium, Sante Fe, New Mexico, USA
Schroeder B, Gibson G (2007) Understanding failures in petascale computers. In: Journal of physics: conference series, vol 78. IOP Publishing, pp 12–22
Schulz M, de Supinski B (2006) A flexible and dynamic infrastructure for MPI tool interoperability. In: International conference on parallel processing, 2006. ICPP 2006, pp 193–202
Schulz M, Bronevetsky G, Supinski BR (2008) On the performance of transparent MPI piggyback messages. In: Proceedings of the 15th European PVM/MPI users’ group meeting on recent advances in parallel virtual machine and message passing interface. Springer, Berlin, pp 194–201
Shantharam M, Srinivasmurthy S, Raghavan P (2012) Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of the 26th ACM international conference on supercomputing, ICS’12. ACM, New York, pp 69–78
Silva LM, Silva JG (1998) System-level versus user-defined checkpointing. In: Proceedings of the the 17th IEEE symposium on reliable distributed systems, SRDS’98. IEEE Computer Society, Washington, DC, p 68
Singhal M, Kshemkalyani A (1992) An efficient implementation of vector clocks. Inf Process Lett 43(1):47–52
Sistla AP, Welch JL (1989) Efficient distributed recovery using message logging. In: PODC ’89: proceedings of the eighth annual ACM symposium on principles of distributed computing. ACM Press, New York, pp 223–238
Smith SW, Johnson DB, Tygar JD (1995) Completely asynchronous optimistic recovery with minimal rollbacks. In: FTCS-25: 25th international symposium on fault tolerant computing digest of papers. Pasadena, California, pp 361–371
Snell QO, Mikler AR, Gustafson JL (1996) NetPIPE: a network protocol independent performance evaluator. In: IASTED international conference on intelligent information management and systems
Stellner G (1996) CoCheck: checkpointing and process migration for MPI. In: Proceedings of the 10th international parallel processing symposium (IPPS’96), Honolulu, Hawaii. IEEE CS Press
Stricker T, Gross T (1995) Optimizing memory system performance for communication in parallel computers. In: ISCA’95: proceedings of the 22nd annual international symposium on computer architecture. ACM, New York, pp 308–319
Strom R, Yemini S (1985) Optimistic recovery in distributed systems. ACM Trans Comput Syst 3(3):204–226
Teranishi K, Heroux MA (2014) Toward local failure local recovery resilience model using MPI-ULFM. In: Proceedings of the 21st European MPI users’ group meeting, EuroMPI/ASIA’14. ACM, New York, pp 51:51–51:56
The MPI Forum (1993) MPI: a message passing interface. In: Supercomputing’93: proceedings of the 1993 ACM/IEEE conference on supercomputing. ACM Press, New York, pp 878–883
The MPI Forum (2012) MPI: a message-passing interface standard, version 3.0. The Universtity of Tennessee, Knoxville
The SciDB Development Team (2010) Overview of SciDB: large scale array storage, processing and analysis. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, SIGMOD’10. ACM, New York, pp 963–968
Vaidyanathan K, Chai L, Huang W, Panda DK (2007) Efficient asynchronous memory copy operations on multi-core systems and I/OAT. In: CLUSTER’07: proceedings of the 2007 IEEE international conference on cluster computing. IEEE Computer Society, Washington, DC, pp 159–168
Wang C, Mueller F, Engelmann C, Scott SL (2008) Proactive process-level live migration in hpc environments. In: SC’08: proceedings of the 2008 ACM/IEEE conference on supercomputing. IEEE Press, Piscataway, NJ, pp 1–12
Young JW (1974) A first order approximation to the optimum checkpoint interval. Commun ACM 17:530–531
Acknowledgments
The author would like to thank the coworkers who participated in the creation of the software or participated in the elaboration of some source material in this chapter, in particular, Christine Morin and Thomas Ropars, for leading the effort with the optimistic message logging protocol; Wesley Bland and Joshua Hursey which helped with the implementation of ulfm; and George Bosilca and Thomas Herault. This work has received support from the NSF under award Adapt #1339763, and from the CREST project of the Japan Science and Technology Agency (JST).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Bouteiller, A. (2015). Fault-Tolerant MPI. In: Herault, T., Robert, Y. (eds) Fault-Tolerance Techniques for High-Performance Computing. Computer Communications and Networks. Springer, Cham. https://doi.org/10.1007/978-3-319-20943-2_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-20943-2_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20942-5
Online ISBN: 978-3-319-20943-2
eBook Packages: Computer ScienceComputer Science (R0)