Using Replication for Resilience on Exascale Systems

  • Henri Casanova
  • Frédéric VivienEmail author
  • Dounia Zaidouni
Part of the Computer Communications and Networks book series (CCN)


High-performance computing applications must be resilient to faults. The traditional fault tolerance solution is checkpoint–recovery, by which application state is saved to and recovered from secondary storage throughout execution. It has been shown that, even when using an optimal checkpointing strategy, the checkpointing overhead precludes high parallel efficiency at large-scale. Additional fault tolerance mechanisms must thus be used. Such a mechanism is replication, which can be used in addition to checkpoint–recovery. Using replication, multiple processors perform the same computation so that a processor failure does not necessarily mean application failure. While at first glance replication may seem wasteful, it may be significantly more efficient than using solely checkpoint–recovery at large scale. In this work we investigate two approaches for replication. In the first approach, entire application instances are replicated. In the second approach, each process in a single application instance is (transparently) replicated. We provide a theoretical study of these two approaches, comparing them to the pure checkpoint–recovery approach in terms of expected application execution times.


  1. 1.
    Amdahl G (1967) The validity of the single processor approach to achieving large scale computing capabilities. In: AFIPS conference proceedings, vol 30. AFIPS Press, pp 483–485Google Scholar
  2. 2.
    Blackford LS, Choi J, Cleary A, D’Azevedo E, Demmel J, Dhillon I, Dongarra J, Hammarling S, Henry G, Petitet A, Stanley K, Walker D, Whaley RC (1997) ScaLAPACK users’ guide. SIAMGoogle Scholar
  3. 3.
    Bougeret M, Casanova H, Rabie M, Robert Y, Vivien F (2011) Checkpointing strategies for parallel jobs. In: Proceedings of 2011 international conference high performance computing, networking, storage and analysis SC’11. ACM PressGoogle Scholar
  4. 4.
    Bouguerra M-S, Gautier T, Trystram D, Vincent J-M (2010) A flexible checkpoint/restart model in distributed systems. In: PPAM, vol 6067. LNCS, pp 206–215Google Scholar
  5. 5.
    Castelli V, Harper RE, Heidelberger P, Hunter SW, Trivedi KS, Vaidyanathan K, Zeggert WP (2001) Proactive management of software aging. IBM J Res Dev 45(2):311–332CrossRefGoogle Scholar
  6. 6.
    Chung J, Lee I, Sullivan M, Ryoo JH, Kim DW, Yoon DH, Kaplan L, Erez M (2012) Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems. In: Proceedings of international conference on high performance computing, networking, storage and analysis SC’12. ACM PressGoogle Scholar
  7. 7.
    Daly JT (2004) A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener Comput Syst 22(3):303–312CrossRefzbMATHGoogle Scholar
  8. 8.
    Dongarra J, Beckman P, Aerts P, Cappello F, Lippert T, Matsuoka S, Messina P, Moore T, Stevens R, Trefethen A, Valero M (2009) The international exascale software project: a call to cooperative action by the global high-performance community. Int J High Perform Comput Appl 23(4):309–322CrossRefzbMATHGoogle Scholar
  9. 9.
    Elliott J, Kharbas K, Fiala D, Mueller F, Ferreira K, Engelmann C (2012) Combining partial redundancy and checkpointing for HPC. In: ICDCS’12. IEEEGoogle Scholar
  10. 10.
    Elnozahy E, Plank J (2004) Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery. IEEE Trans Dependable Secur Comput 1(2):97–108CrossRefGoogle Scholar
  11. 11.
    Engelmann C, Swen B (2011) Redundant execution of HPC applications with MR-MPI. In: PDCN. IASTEDGoogle Scholar
  12. 12.
    Engelmann C, Ong HH, Scorr SL (2009) The case for modular redundancy in large-scale high performance computing systems. In: Proceedings of the 8th IASTED international conference on parallel and distributed computing and networks (PDCN), pp 189–194Google Scholar
  13. 13.
    Ferreira K, Stearley J, Laros JHI, Oldfield R, Pedretti K, Brightwell R, Riesen R, Bridges PG, Arnold D (2011) Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 international conference on high performance computing, networking, storage and analysis SC’11. ACM PressGoogle Scholar
  14. 14.
    Flajolet P, Grabner PJ, Kirschenhofer P, Prodinger H (1995) On Ramanujan’s Q-function. J Comput Appl Math 58:103–116MathSciNetCrossRefGoogle Scholar
  15. 15.
    George C, Vadhiyar SS (2012) AdFT: an adaptive framework for fault tolerance on large scale systems using application malleability. Procedia Comput Sci 9:166–175CrossRefGoogle Scholar
  16. 16.
    Gärtner F (1999) Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM comput Surv 31(1):1–26CrossRefGoogle Scholar
  17. 17.
    Heath T, Martin RP, Nguyen TD (2002) Improving cluster availability using workstation validation. SIGMETRICS Perf Eval Rev 30(1):217–227CrossRefGoogle Scholar
  18. 18.
    Heien R, Kondo D, Gainaru A, LaPine D, Kramer B, Cappello F (2011) Modeling and tolerating heterogeneous failures on large parallel system. In: Proceedings of the IEEE/ACM supercomputing conference (SC)Google Scholar
  19. 19.
    Jones W, Daly J, DeBardeleben N (2010) Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters. In: HPDC’10. ACM, pp 276–279Google Scholar
  20. 20.
    Kolettis N, Fulton ND (1995) Software rejuvenation: analysis, module and applications. In: FTCS’95. IEEE CS, Washington, p 381Google Scholar
  21. 21.
    Leblanc T, Anand R, Gabriel E, Subhlok J (2009) VolpexMPI: an MPI library for execution of parallel applications on volatile nodes. In: 16th European PVM/MPI users’ group meeting. Springer, pp 124–133Google Scholar
  22. 22.
    Liu Y, Nassar R, Leangsuksun C, Naksinehaboon N, Paun M, Scott S (2008) An optimal checkpoint/restart model for a large scale high performance computing system. In: IPDPS 2008. IEEE, pp 1–9Google Scholar
  23. 23.
    Oldfield RA, Arunagiri S, Teller PJ, Seelam S, Varela MR, Riesen R, Roth PC (2007) Modeling the impact of checkpoints on next-generation systems. In: Proceedings of the 24th IEEE conference on mass storage systems and technologies, pp 30–46Google Scholar
  24. 24.
    Pinedo M (2008) Scheduling: theory, algorithms, and systems, 3rd edn. Springer, New YorkGoogle Scholar
  25. 25.
    Riesen R, Ferreira K, Stearley J (2010) See applications run and throughput jump: the case for redundant computing in HPC. In: Proceedings of the dependable systems and networks workshops, pp 29–34Google Scholar
  26. 26.
    Ross SM (2009) Introduction to probability models, 11th edn. Academic Press, New YorkGoogle Scholar
  27. 27.
    Sarkar V, Harrod W, Snavely A (2009) Software challenges in extreme scale systems. J Phys Conf Ser 180(1):012045CrossRefGoogle Scholar
  28. 28.
    Schroeder B, Gibson G (2007) Understanding failures in petascale computers. J Phys Conf Ser 78(1):012022CrossRefGoogle Scholar
  29. 29.
    Schroeder B, Gibson GA (2006) A large-scale study of failures in high-performance computing systems. In: Proceedings of DSN, pp 249–258Google Scholar
  30. 30.
    Schroeder B, Gibson GA (2007) Understanding failures in petascale computers. J Phys Conf Ser 78(1):188–198Google Scholar
  31. 31.
    Stearley J, Ferreira KB, Robinson DJ, Laros J, Pedretti KT, Arnold D, Bridges PG, Riesen R (2012) Does partial replication pay off? In FTXS (a DSN workshop). IEEEGoogle Scholar
  32. 32.
    Venkatesh K (2010) Analysis of dependencies of checkpoint cost and checkpoint interval of fault tolerant MPI applications. Analysis 2(08):2690–2697Google Scholar
  33. 33.
    Wang L, Karthik P, Kalbarczyk Z, Iyer R, Votta L, Vick C, Wood A (2005) Modeling coordinated checkpointing for large-scale supercomputers. In: Proceedings of the international conference on dependable systems and networks, pp 812–821Google Scholar
  34. 34.
    Yang X-J, Wang Z, Xue J, Zhou Y (2012) The reliability wall for exascale supercomputing. IEEE Trans Comput 61(6):767–779MathSciNetCrossRefGoogle Scholar
  35. 35.
    Yi S, Kondo D, Kim B, Park G, Cho Y (2010) Using replication and checkpointing for reliable task management in computational grids. In: Proceedings of the international conference on high performance computing and simulationGoogle Scholar
  36. 36.
    Young JW (1974) A first order approximation to the optimum checkpoint interval. Commun ACM 17(9):530–531CrossRefGoogle Scholar
  37. 37.
    Zheng G, Ni X, Kale L (2012) A scalable double in-memory checkpoint and restart scheme towards exascale. In: Dependable systems and networks workshops (DSN-W)Google Scholar
  38. 38.
    Zheng Z, Lan Z (2009) Reliability-aware scalability models for high performance computing. In: Proceedings of the IEEE conference on cluster computingGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Henri Casanova
    • 1
  • Frédéric Vivien
    • 2
    Email author
  • Dounia Zaidouni
    • 2
  1. 1.University of Hawai‘iManoaUSA
  2. 2.INRIA & Ecole Normale Supérieure de LyonLyonFrance

Personalised recommendations