Advertisement

Fault-Tolerant MPI

  • Aurélien BouteillerEmail author
Chapter
Part of the Computer Communications and Networks book series (CCN)

Abstract

As supercomputers are entering an era of massive parallelism where the frequency of faults is increasing, the MPI standard remains distressingly vague on the consequence of failures on MPI communications. In this chapter, we present the spectrum of techniques that can be applied to enable MPI application recovery, ranging from fully automatic to completely user driven. First, we present the effective deployment of most advanced checkpoint/restart techniques within the MPI implementation, so that failed processors are automatically restarted in a consistent state with surviving processes, at a performance cost. Then, we investigate how MPI can support application-driven recovery techniques, and introduce a set of extensions to MPI that allow restoring communication capabilities, while maintaining the extreme level of performance to which MPI users have become accustomed.

Keywords

Message Passing Interface Collective Operation Message Logging Message Passing Interface Process Open Message Passing Interface 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgments

The author would like to thank the coworkers who participated in the creation of the software or participated in the elaboration of some source material in this chapter, in particular, Christine Morin and Thomas Ropars, for leading the effort with the optimistic message logging protocol; Wesley Bland and Joshua Hursey which helped with the implementation of ulfm; and George Bosilca and Thomas Herault. This work has received support from the NSF under award Adapt #1339763, and from the CREST project of the Japan Science and Technology Agency (JST).

References

  1. 1.
    Ali MM, Southern J, Strazdins PE, Harding B (2014) Application level fault recovery: using fault-tolerant open MPI in a PDE solver. In: 2014 IEEE international parallel and distributed processing symposium workshops, Phoenix, AZ, USA, 19–23 May 2014, pp 1169–1178. IEEEGoogle Scholar
  2. 2.
    Alvisi L, Marzullo K (1995) Message logging: pessimistic, optimistic, and causal. In: Proceedings of the 15th international conference on distributed computing systems (ICDCS 1995). IEEE CS Press, pp 229–236Google Scholar
  3. 3.
    Alvisi L, Elnozahy E, Rao S, Husain SA, Mel AD (1999) An analysis of communication induced checkpointing. In: 29th symposium on fault-tolerant computing (FTCS’99). IEEE CS PressGoogle Scholar
  4. 4.
    Bautista-Gomez L, Tsuboi S, Komatitsch D, Cappello F, Maruyama N, Matsuoka S (2011) FTI: high performance fault tolerance interface for hybrid systems. In: International conference high performance computing, networking, storage and analysis SC’11Google Scholar
  5. 5.
    Bland W, Bouteiller A, Herault T, Hursey J, Bosilca G, Dongarra JJ (2013) An evaluation of user-level failure mitigation support in MPI. Computing 95(12):1171–1184CrossRefGoogle Scholar
  6. 6.
    Bosilca G, Delmas R, Dongarra J, Langou J (2009) Algorithm-based fault tolerance applied to high performance computing. J Parallel Distrib Comput 69(4):410–416CrossRefGoogle Scholar
  7. 7.
    Bosilca G, Bouteiller A, Herault T, Lemarinier P, Dongarra JJ (2010) Dodging the cost of unavoidable memory copies in message logging protocols. In: Keller R, Gabriel E, Resch MM, Dongarra J (eds) EuroMPI, Lecture Notes in Computer Science, Vol 6305. pp 189–197Google Scholar
  8. 8.
    Bouteiller A, Collin B, Herault T, Lemarinier P, Cappello F (2005) Impact of event logger on causal message logging protocols for fault tolerant MPI. In: IPDPS’05: proceedings of the 19th IEEE international parallel and distributed processing symposium (IPDPS’05)—papers. IEEE Computer Society, Washington, DC, USA, p 97Google Scholar
  9. 9.
    Bouteiller A, Herault T, Krawezik G, Lemarinier P, Cappello F (2006) MPICH-V project: a multiprotocol automatic fault tolerant MPI. Int J High Perform Comput Appl 20:319–333CrossRefGoogle Scholar
  10. 10.
    Bouteiller A, Ropars T, Bosilca G, Morin C, Dongarra J (2009) Reasons to be pessimist or optimist for failure recovery in high performance clusters. In: IEEE, editor, proceedings of the 2009 IEEE cluster conferenceGoogle Scholar
  11. 11.
    Bouteiller A, Bosilca G, Dongarra J (2010) Redesigning the message logging model for high performance. Concurr Comput: Pract Exp 22(16):2196–2211CrossRefGoogle Scholar
  12. 12.
    Bouteiller A, Herault T, Bosilca G, Dongarra JJ (2013) Correlated set coordination in fault tolerant message logging protocols for many-core clusters. Concurr Comput: Pract Exp 25(4):572–585CrossRefGoogle Scholar
  13. 13.
    Bouteiller A, Herault T, Bosilca G, Du P, Dongarra J (2015) Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy. ACM Trans Parallel Comput 1(2):10:1–10:28CrossRefGoogle Scholar
  14. 14.
    Bronevetsky G (2007) Portable checkpointing for parallel applications. Ph.D. thesis, Cornell University, Department of Computer ScienceGoogle Scholar
  15. 15.
    Buntinas D, Coti C, Herault T, Lemarinier P, Pilard L, Rezmerita A, Rodriguez E, Cappello F (2008) Blocking versus non-blocking coordinated checkpointing for large-scale fault tolerant MPI protocols. Future Gener Comput Syst 24(1):73–84CrossRefGoogle Scholar
  16. 16.
    Cappello F, Geist A, Gropp B, Kalé LV, Kramer B, Snir M (2009) Toward exascale resilience. Int J High Perform Comput Appl 23(4):374–388CrossRefGoogle Scholar
  17. 17.
    Cappello F, Casanova H, Robert Y (2011) Preventive migration versus preventive checkpointing for extreme scale supercomputers. Parallel Process Lett, pp 111–132Google Scholar
  18. 18.
    Chakravorty S, Mendes CL, Kalé LV (2006) Proactive fault tolerance in MPI applications via task migration. In: HiPC 2006, the IEEE high performance computing conference. IEEE Computer Society Press, pp 485–496Google Scholar
  19. 19.
    Chandra TD, Toueg S (1996) Unreliable failure detectors for reliable distributed systems. J ACM (JACM) 43(2):225–267MathSciNetCrossRefGoogle Scholar
  20. 20.
    Chandy KM, Lamport L (1985) Distributed snapshots: determining global states of distributed systems. Trans Comput Syst 3(1):63–75. ACMGoogle Scholar
  21. 21.
    Chen Z, Fagg GE, Gabriel E, Langou J, Angskun T, Bosilca G, Dongarra J (2005) Fault tolerant high performance computing by a coding approach. In: Proceedings of the tenth ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP’05. ACM, New York, USA, pp 213–223Google Scholar
  22. 22.
    Daly JT (2006) A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener Comput Syst 22:303–312CrossRefGoogle Scholar
  23. 23.
    Damani OP, Wang Y-M, Garg VK (2003) Distributed recovery with K-optimistic logging. J Parallel Distrib Comput 63:1193–1218CrossRefGoogle Scholar
  24. 24.
    Davies T, Karlsson C, Liu H, Ding C, Chen Z (2011) High performance Linpack benchmark: a fault tolerant implementation without checkpointing. In: Proceedings of the 25th ACM international conference on supercomputing (ICS 2011). ACMGoogle Scholar
  25. 25.
    Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on operating systems design and implementation—vol 6, OSDI’04. USENIX Association, Berkeley, CA, USA, pp 10–10Google Scholar
  26. 26.
    Dongarra J, Blackford L, Choi J et al (1997) ScaLAPACK user’s guide. Society for Industrial and Applied Mathematics, PhiladelphiaGoogle Scholar
  27. 27.
    Dongarra J, Beckman P, Moore T, Aerts P, Aloisio G, Andre J-C, Barkai D, Berthou J-Y, Boku T, Braunschweig B, Cappello F, Chapman B, Chi X, Choudhary A, Dosanjh S, Dunning T, Fiore S, Geist A, Gropp B, Harrison R, Hereld M, Heroux M, Hoisie A, Hotta K, Jin Z, Ishikawa Y, Johnson F, Kale S, Kenway R, Keyes D, Kramer B, Labarta J, Lichnewsky A, Lippert T, Lucas B, Maccabe B, Matsuoka S, Messina P, Michielse P, Mohr B, Mueller MS, Nagel WE, Nakashima H, Papka ME, Reed D, Sato M, Seidel E, Shalf J, Skinner D, Snir M, Sterling T, Stevens R, Streitz F, Sugar B, Sumimoto S, Tang W, Taylor J, Thakur R, Trefethen A, Valero M, Van Der Steen A, Vetter J, Williams P, Wisniewski R, Yelick K (2011) The international exascale software project roadmap. Int J High Perform Comput Appl 25(1):3–60CrossRefGoogle Scholar
  28. 28.
    Duell J (2002) The design and implementation of Berkeley lab’s Linux checkpoint/restart. Technical report LBNL-54941, Livermore-Berkeley National LaboratoryGoogle Scholar
  29. 29.
    Elnozahy E, Zwaenepoel W (1992) Manetho: transparent rollback-recovery with low overhead, limited rollback and fast output. IEEE Trans Comput 41(5):526–531CrossRefGoogle Scholar
  30. 30.
    Elnozahy M, Alvisi L, Wang YM, Johnson DB (2002) A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv (CSUR) 34(3):375–408CrossRefGoogle Scholar
  31. 31.
    Fagg G, Dongarra J (2000) FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: 7th Euro PVM/MPI user’s group meeting 2000, vol 1908/2000, Balatonfüred, “ Hungary”. Springer, HeidelbergGoogle Scholar
  32. 32.
    Fagg GE, Bukovsky A, Dongarra JJ (2001) HARNESS and fault tolerant MPI. Parallel Comput 27(11):1479–1495CrossRefGoogle Scholar
  33. 33.
    Ferreira K, Stearley J, Laros J, Oldfield R, Pedretti K, Brightwell R, Riesen R, Bridges P, Arnold D (2011) Evaluating the viability of process replication reliability for exascale systems. In: International conference for high performance computing, networking, storage and analysis (SC), 2011. ACM Request Permissions, pp 1–12Google Scholar
  34. 34.
    Fischer M, Lynch N, Paterson M (1985) Impossibility of distributed consensus with one faulty process. J ACM 32:374–382MathSciNetCrossRefGoogle Scholar
  35. 35.
    Gabriel E, Fagg GE, Bosilca G, Angskun T, Dongarra JJ, Squyres JM, Sahay V, Kambadur P, Barrett B, Lumsdaine A, Castain RH, Daniel DJ, Graham RL, Woodall TS (2004) Open MPI: goals, concept, and design of a next generation MPI implementation. In: Proceedings of the 11th European PVM/MPI users’ group meeting, Budapest, Hungary, pp 97–104Google Scholar
  36. 36.
    Gamell M, Katz DS, Kolla H, Chen J, Klasky S, Parashar M (2014) Exploring automatic, online failure recovery for scientific applications at extreme scales. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, SC’14. IEEE Press, Piscataway, NJ, pp 895–906Google Scholar
  37. 37.
    Gao Q, Huang W, Koop MJ, Panda DK (2007) Group-based coordinated checkpointing for MPI: a case study on infiniband. In: International conference on parallel processing, 2007. ICPP 2007Google Scholar
  38. 38.
    Geist GA, Kohl JA, Papadopoulos PM (1996) PVM and MPI: a comparison of features. Calc Paralleles 8:137–150Google Scholar
  39. 39.
    Gelenbe E (1979) On the optimum checkpoint interval. J ACM 26:259–270MathSciNetCrossRefGoogle Scholar
  40. 40.
    Geoffray P (2002) OPIOM: off-processor i/o with myrinet. Future Gener Comput Syst 18(4):491–499CrossRefGoogle Scholar
  41. 41.
    Goglin B (2008) Improving message passing over ethernet with i/oat copy offload in open-mx. In: Proceedings of the 2008 IEEE international conference on cluster computing. IEEE, pp 223–231Google Scholar
  42. 42.
    Gropp W, Lusk E (1997) Why are PVM and MPI so different? In: Bubak M, Dongarra J, Waśniewski J (eds) Recent advances in parallel virtual machine and message passing interface, Lecture Notes in Computer Science, vol 1332. Springer, Berlin, pp 1–10Google Scholar
  43. 43.
    Gropp W, Lusk E (2004) Fault tolerance in message passing interface programs. Int J High Perform Comput Appl 18:363–372CrossRefGoogle Scholar
  44. 44.
    Guermouche A, Ropars T, Snir M, Cappello F (2012) HydEE: failure containment without event logging for large scale send-deterministic MPI applications. In: IEEE 26th international parallel distributed processing symposium (IPDPS), 2012, pp 1216–1227Google Scholar
  45. 45.
    Hadzilacos V, Toueg S (1993) Fault-tolerant broadcasts and related problems. In: Mullender S (ed) Distributed systems, 2nd edn. ACM/Addison-Wesley, Boston, pp 97–145 (chapter 5)Google Scholar
  46. 46.
    Hassani A, Skjellum A, Brightwell R (2014) Design and evaluation of FA-MPI, a transactional resilience scheme for non-blocking MPI. In: International conference on dependable systems and networks (DSN), 2014 44th annual IEEE/IFIP, pp 750–755Google Scholar
  47. 47.
    Ho JCY, Wang C-L, Lau FCM (2008) Scalable group-based checkpoint/restart for large-scale message-passing systems. In: Proceedings of the 22nd IEEE international symposium on parallel and distributed processing (IPDPS). IEEE, pp 1–12Google Scholar
  48. 48.
    Huang K, Abraham J (1984) Algorithm-based fault tolerance for matrix operations. IEEE Trans Comput 100(6):518–528CrossRefGoogle Scholar
  49. 49.
    Hursey J, Squyres J, Mattox T, Lumsdaine A (2007) The design and implementation of checkpoint/restart process fault tolerance for open MPI. In: IEEE international parallel and distributed processing symposium, 2007. IPDPS 2007, pp 1–8Google Scholar
  50. 50.
    Hursey J, Naughton T, Vallee G, Graham RL (2011) A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI. In: EuroMPI 2011: proceedings of the 18th EuroMPI conference, Santorini, GreeceGoogle Scholar
  51. 51.
    Hursey J, Graham RL, Bronevetsky G, Buntinas D, Pritchard H, Solt DG (2011) Run-through stabilization: an MPI proposal for process fault tolerance. In: EuroMPI 2011: proceedings of the 18th EuroMPI conference, Santorini, GreeceGoogle Scholar
  52. 52.
    Hélary J-M, Mostefaoui A, Raynal M (1999) Communication-induced determination of consistent snapshots. IEEE Trans Parallel Distrib Syst 10(9):865–877CrossRefGoogle Scholar
  53. 53.
    Lamport L (1978) Time, clocks, and the ordering of events in a distributed system. Commun ACM 21(7):558–565CrossRefGoogle Scholar
  54. 54.
    Lemarinier P, Bouteiller A, Herault T, Krawezik G, Cappello F (2004) Improved message logging versus improved coordinated checkpointing for fault tolerant MPI. In: IEEE international conference on cluster computing. IEEE CS PressGoogle Scholar
  55. 55.
    Liu X, Xu X, Ren X, Tang Y, Dai Z (2013) A message logging protocol based on user level failure mitigation. In: Kołodziej J, Di Martino B, Talia D, Xiong K (eds) Algorithms and architectures for parallel processing, Lecture Notes in Computer Science, vol 8285. Springer International Publishing, Switzerland, pp 312–323Google Scholar
  56. 56.
    Lu C-D, Reed DA (2005) Scalable diskless checkpointing for large parallel systems. Ph.D. thesis, University of Illinois at Urbana-ChampainGoogle Scholar
  57. 57.
    Luk F, Park H (1988) An analysis of algorithm-based fault tolerance techniques. J Parallel Distrib Comput 5(2):172–184CrossRefGoogle Scholar
  58. 58.
    Meneses E, Mendes CL, Kalé LV (2010) Team-based message logging: preliminary results. In: Proceedings of the 2010 10th IEEE/ACM international conference on cluster, cloud and grid computing, CCGRID’10. IEEE Computer Society, Washington, pp 697–702Google Scholar
  59. 59.
    Mohan C, Lindsay B (1985) Efficient commit protocols for the tree of processes model of distributed transactions. SIGOPS OSR, vol 19. ACM, New York, pp 40–52Google Scholar
  60. 60.
    Moody A, Bronevetsky G, Mohror K, de Supinski BR (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE international conference for high performance computing. Networking, storage and analysis, pp 1–11Google Scholar
  61. 61.
    Negara S, Zheng G, Pan K-C, Negara N, Johnson RE, Kalé LV, Ricker PM (2011) Automatic MPI to AMPI program transformation using photran. In: Proceedings of the 2010 conference on parallel processing, Euro-Par 2010. Springer, Berlin, pp 531–539Google Scholar
  62. 62.
    Pauli S, Kohler M, Arbenz P (2013) A fault tolerant implementation of multi-level monte carlo methods. In: Bader M, Bode A, Bungartz H, Gerndt M, Joubert GR, Peters FJ (eds) Parallel computing: accelerating computational science and engineering (CSE), proceedings of the international conference on parallel computing, ParCo 2013, 10–13 September 2013, Garching (near Munich), Germany, advances in parallel computing, vol 25. IOS Press, pp 471–480Google Scholar
  63. 63.
    Petrini F, Frachtenberg E, Hoisie A, Coll S (2003) Performance evaluation of the quadrics interconnection network. Clust Comput 6(2):125–142CrossRefGoogle Scholar
  64. 64.
    Plank JS (1993) Efficient checkpointing on MIMD architectures. Ph.D. thesis, Princeton UniversityGoogle Scholar
  65. 65.
    Plank JS, Thomason MG (2001) Processor allocation and checkpoint interval selection in cluster computing systems. J Parallel Distrib Comput 61:1590Google Scholar
  66. 66.
    Rao S, Alvisi L, Vin HM (1998) The cost of recovery in message logging protocols. In: 17th symposium on reliable distributed systems (SRDS). IEEE CS Press, pp 10–18Google Scholar
  67. 67.
    Rao S, Alvisi L, Vin HM (1999) Egida: an extensible toolkit for low-overhead fault-tolerance. In : 29th symposium on fault-tolerant computing (FTCS’99). IEEE CS Press, pp 48–55Google Scholar
  68. 68.
    Rieker M, Ansel J, Cooperman G (2006) Transparent user-level checkpointing for the native posix thread library for linux. In: The 2006 international conference on parallel and distributed processing techniques and applications, Las Vegas, NVGoogle Scholar
  69. 69.
    Ropars T, Morin C (2011) Active optimistic and distributed message logging for message-passing applications. Concurr. Comput. : Pract. Exper. 23(17):2167–2178CrossRefGoogle Scholar
  70. 70.
    Ropars T, Guermouche A, Uçar B, Meneses E, Kalé LV, Cappello F (2011) On the use of cluster-based partial message logging to improve fault tolerance for MPI HPC applications. In: Proceedings of the 17th international conference on parallel processing—volume Part I, Euro-Par’11. Springer, Berlin, pp 567–578Google Scholar
  71. 71.
    Roy-Chowdhury A, Banerjee P (1996) Algorithm-based fault location and recovery for matrix computations on multiprocessor systems. IEEE Trans Comput 45(11):1239–1247CrossRefGoogle Scholar
  72. 72.
    Ruscio JF, Heffner MA, Varadarajan S (2006) DejaVu: transparent user-level checkpointing, migration and recovery for distributed systems. In: SC’06: proceedings of the 2006 ACM/IEEE conference on supercomputing. ACM Press, New York, USA, pp 158Google Scholar
  73. 73.
    Sankaran S, Squyres JM, Barrett B, Lumsdaine A, Duell J, Hargrove P, Roman E (2003) The LAM/MPI checkpoint/restart framework: system-initiated checkpointing. In: Proceedings, LACSI symposium, Sante Fe, New Mexico, USAGoogle Scholar
  74. 74.
    Schroeder B, Gibson G (2007) Understanding failures in petascale computers. In: Journal of physics: conference series, vol 78. IOP Publishing, pp 12–22Google Scholar
  75. 75.
    Schulz M, de Supinski B (2006) A flexible and dynamic infrastructure for MPI tool interoperability. In: International conference on parallel processing, 2006. ICPP 2006, pp 193–202Google Scholar
  76. 76.
    Schulz M, Bronevetsky G, Supinski BR (2008) On the performance of transparent MPI piggyback messages. In: Proceedings of the 15th European PVM/MPI users’ group meeting on recent advances in parallel virtual machine and message passing interface. Springer, Berlin, pp 194–201Google Scholar
  77. 77.
    Shantharam M, Srinivasmurthy S, Raghavan P (2012) Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of the 26th ACM international conference on supercomputing, ICS’12. ACM, New York, pp 69–78Google Scholar
  78. 78.
    Silva LM, Silva JG (1998) System-level versus user-defined checkpointing. In: Proceedings of the the 17th IEEE symposium on reliable distributed systems, SRDS’98. IEEE Computer Society, Washington, DC, p 68Google Scholar
  79. 79.
    Singhal M, Kshemkalyani A (1992) An efficient implementation of vector clocks. Inf Process Lett 43(1):47–52CrossRefGoogle Scholar
  80. 80.
    Sistla AP, Welch JL (1989) Efficient distributed recovery using message logging. In: PODC ’89: proceedings of the eighth annual ACM symposium on principles of distributed computing. ACM Press, New York, pp 223–238Google Scholar
  81. 81.
    Smith SW, Johnson DB, Tygar JD (1995) Completely asynchronous optimistic recovery with minimal rollbacks. In: FTCS-25: 25th international symposium on fault tolerant computing digest of papers. Pasadena, California, pp 361–371Google Scholar
  82. 82.
    Snell QO, Mikler AR, Gustafson JL (1996) NetPIPE: a network protocol independent performance evaluator. In: IASTED international conference on intelligent information management and systemsGoogle Scholar
  83. 83.
    Stellner G (1996) CoCheck: checkpointing and process migration for MPI. In: Proceedings of the 10th international parallel processing symposium (IPPS’96), Honolulu, Hawaii. IEEE CS PressGoogle Scholar
  84. 84.
    Stricker T, Gross T (1995) Optimizing memory system performance for communication in parallel computers. In: ISCA’95: proceedings of the 22nd annual international symposium on computer architecture. ACM, New York, pp 308–319Google Scholar
  85. 85.
    Strom R, Yemini S (1985) Optimistic recovery in distributed systems. ACM Trans Comput Syst 3(3):204–226CrossRefGoogle Scholar
  86. 86.
    Teranishi K, Heroux MA (2014) Toward local failure local recovery resilience model using MPI-ULFM. In: Proceedings of the 21st European MPI users’ group meeting, EuroMPI/ASIA’14. ACM, New York, pp 51:51–51:56Google Scholar
  87. 87.
    The MPI Forum (1993) MPI: a message passing interface. In: Supercomputing’93: proceedings of the 1993 ACM/IEEE conference on supercomputing. ACM Press, New York, pp 878–883Google Scholar
  88. 88.
    The MPI Forum (2012) MPI: a message-passing interface standard, version 3.0. The Universtity of Tennessee, KnoxvilleGoogle Scholar
  89. 89.
    The SciDB Development Team (2010) Overview of SciDB: large scale array storage, processing and analysis. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, SIGMOD’10. ACM, New York, pp 963–968Google Scholar
  90. 90.
    Vaidyanathan K, Chai L, Huang W, Panda DK (2007) Efficient asynchronous memory copy operations on multi-core systems and I/OAT. In: CLUSTER’07: proceedings of the 2007 IEEE international conference on cluster computing. IEEE Computer Society, Washington, DC, pp 159–168Google Scholar
  91. 91.
    Wang C, Mueller F, Engelmann C, Scott SL (2008) Proactive process-level live migration in hpc environments. In: SC’08: proceedings of the 2008 ACM/IEEE conference on supercomputing. IEEE Press, Piscataway, NJ, pp 1–12Google Scholar
  92. 92.
    Young JW (1974) A first order approximation to the optimum checkpoint interval. Commun ACM 17:530–531CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Department of EECSUniversity of Tennessee KnoxvilleKnoxvilleUSA

Personalised recommendations