Using Simulation to Evaluate the Performance of Resilience Strategies at Scale

  • Scott LevyEmail author
  • Bryan Topp
  • Kurt B. Ferreira
  • Dorian Arnold
  • Torsten Hoefler
  • Patrick Widener
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8551)


Fault-tolerance has been identified as a major challenge for future extreme-scale systems. Current predictions suggest that, as systems grow in size, failures will occur more frequently. Because increases in failure frequency reduce the performance and scalability of these systems, significant effort has been devoted to developing and refining resilience mechanisms to mitigate the impact of failures. However, effective evaluation of these mechanisms has been challenging. Current systems are smaller and have significantly different architectural features (e.g., interconnect, persistent storage) than we expect to see in next-generation systems. To overcome these challenges, we propose the use of simulation. Simulation has been shown to be an effective tool for investigating performance characteristics of applications on future systems. In this work, we: identify the set of system characteristics that are necessary for accurate performance prediction of resilience mechanisms for HPC systems and applications; demonstrate how these system characteristics can be incorporated into an existing large-scale simulator; and evaluate the predictive performance of our modified simulator. We also describe how we were able to optimize the simulator for large temporal and spatial scales—allowing the simulator to run 4x faster and use over 100x less memory.


Fault Tolerance Wall Clock Time Trace Length Simulated Node Resilience Strategy 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bergman, K., et al.: Exascale computing study: Technology challenges in achieving exascale systems (September 2008),
  2. 2.
    Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst. 22(3), 303–312 (2006)CrossRefGoogle Scholar
  3. 3.
    Bouguerra, M.-S., Gautier, T., Trystram, D., Vincent, J.-M.: A flexible checkpoint/restart model in distributed systems. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2009, Part I. LNCS, vol. 6067, pp. 206–215. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  4. 4.
    Guermouche, A., Ropars, T., Brunet, E., Snir, M., Cappello, F.: Uncoordinated checkpointing without domino effect for send-deterministic MPI applications. In: International Parallel Distributed Processing Symposium (IPDPS), pp. 989–1000 (May 2011)Google Scholar
  5. 5.
    Alvisi, L., Elnozahy, E., Rao, S., Husain, S., de Mel, A.: An analysis of communication induced checkpointing. In: Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, Digest of Papers, pp. 242–249 (1999)Google Scholar
  6. 6.
    Monnet, S., Morin, C., Badrinath, R.: A hierarchical checkpointing protocol for parallel applications in cluster federations. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium, p. 211. IEEE (2004)Google Scholar
  7. 7.
    Elnozahy, E.N.M., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)CrossRefGoogle Scholar
  8. 8.
    Oldfield, R.A., Arunagiri, S., Teller, P.J., Seelam, S., Varela, M.R., Riesen, R., Roth, P.C.: Modeling the impact of checkpoints on next-generation systems. In: 24th IEEE Conference on Mass Storage Systems and Technologies, pp. 30–46 (September 2007)Google Scholar
  9. 9.
    Ferreira, K., Riesen, R., Bridges, P., Arnold, D., Stearley, J., Laros III, J.H., Oldfield, R., Pedretti, K., Brightwell, R.: Evaluating the viability of process replication reliability for exascale systems. In: Lathrop, S., Costa, J., Kramer, W. (eds.) SC. ACM (November 2011)Google Scholar
  10. 10.
    Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: International Conference on Dependable Systems and Networks (DSN) (June 2006)Google Scholar
  11. 11.
    Kannan, S., Gavrilovska, A., Schwan, K., Milojicic, D.: Optimizing checkpoints using NVM as virtual memory. In: Proceedings of the International Parallel and Distributed Processing Symposium, IPDPS 2013. ACM, New York (2013)Google Scholar
  12. 12.
    Dong, X., Muralimanohar, N., Jouppi, N., Kaufmann, R., Xie, Y.: Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 2009, pp. 57:1–57:12. ACM, New York (2009)Google Scholar
  13. 13.
    Bronevetsky, G., Marques, D., Pingali, K., McKee, S., Rugina, R.: Compiler-enhanced incremental checkpointing for openmp applications. In: IEEE International Symposium on Parallel & Distributed Processing, pp. 1–12 (2009)Google Scholar
  14. 14.
    Ferreira, Kurt B., Riesen, Rolf, Brighwell, Ron, Bridges, Patrick, Arnold, Dorian: libhashckpt: hash-based incremental checkpointing using GPU’s. In: Cotronis, Yiannis, Danalis, Anthony, Nikolopoulos, Dimitrios S., Dongarra, Jack (eds.) EuroMPI 2011. LNCS, vol. 6960, pp. 272–281. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  15. 15.
    Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2010), pp. 1–11 (2010),
  16. 16.
    Ibtesham, D., Arnold, D., Bridges, P.G., Ferreira, K.B., Brightwell, R.: On the viability of compression for reducing the overheads of checkpoint/restart-based fault tolerance. In: 2012 41st International Conference on Parallel Processing, pp. 148–157 (2012)Google Scholar
  17. 17.
    Guermouche, A., Ropars, T., Snir, M., Cappello, F.: HydEE: Failure containment without event logging for large scale send-deterministic mpi applications. In: IPDPS, pp. 1216–1227. IEEE Computer Society (2012)Google Scholar
  18. 18.
    Mubarak, M., Carothers, C.D., Ross, R., Carns, P.: Modeling a million-node dragonfly network using massively parallel discrete-event simulation. In: 2012 SC Companion: High Performance Computing, Networking, Storage and Analysis (SCC), pp. 366–376. IEEE (2012)Google Scholar
  19. 19.
    Zheng, G., Wilmarth, T., Jagadishprasad, P., Kalé, L.V.: Simulation-based performance prediction for large parallel machines. International Journal of Parallel Programming 33(2–3), 183–207 (2005)CrossRefGoogle Scholar
  20. 20.
    Hoefler, T., Schneider, T., Lumsdaine, A.: LogGOPSim - Simulating Large-Scale Applications in the LogGOPS Model. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 597–604. ACM (June 2010)Google Scholar
  21. 21.
    Ferreira, K.B., Bridges, P., Brightwell, R.: Characterizing application sensitivity to os interference using kernel-level noise injection. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, p. 19. IEEE Press (2008)Google Scholar
  22. 22.
    Hoefler, T., Schneider, T., Lumsdaine, A.: Characterizing the Influence of System Noise on Large-Scale Applications by Simulation. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2010) (November 2010)Google Scholar
  23. 23.
    Simon, Horst D.: Barriers to exascale computing. In: Daydé, Michel, Marques, Osni, Nakajima, Kengo (eds.) VECPAR. LNCS, vol. 7851, pp. 1–3. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  24. 24.
    Elnozahy, E.N., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375–408 (2002)CrossRefGoogle Scholar
  25. 25.
    Plank, J.S., Li, K., Puening, M.A.: Diskless checkpointing. IEEE Transactions on Parallel and Distributed Systems 9(10), 972–986 (1998)CrossRefGoogle Scholar
  26. 26.
    Plank, J.S., Kim, Y.B., Dongarra, J.J.: Algorithm-based diskless checkpointing for fault tolerant matrix operations. In: Twenty-Fifth International Symposium on Fault-Tolerant Computing, Digest of Papers, Pasadena, CA, USA, pp. 351–360. IEEE Comput. Soc. Press, Los Alamitos (1995)Google Scholar
  27. 27.
    Silva, L.M., Silva, J.G.: An experimental study about diskless checkpointing. In: 24th EUROMICRO Conference, Vasteras, Sweden, pp. 395–402. IEEE Computer Society Press (August 1998)Google Scholar
  28. 28.
    Monnet, S., Morin, C., Badrinath, R.: Hybrid checkpointing for parallel applications in cluster federations. In: IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2004, pp. 773–782. IEEE (2004)Google Scholar
  29. 29.
    Alvisi, L., Elnozahy, E., Rao, S., Husain, S.A., De Mel, A.: An analysis of communication induced checkpointing. In: Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing. Digest of Papers, pp. 242–249. IEEE (1999)Google Scholar
  30. 30.
    Gioiosa, R., Sancho, J.C., Jiang, S., Petrini, F., Davis, K.: Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers. In: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, p. 9. IEEE Computer Society (2005)Google Scholar
  31. 31.
    Widener, P., Ferreira, K., Levy, S., Bridges, P.G., Arnold, D., Brightwell, R.: Asking the right questions: benchmarking fault-tolerant extreme-scale systems. In:Proc. 6th Workshop on Resiliency in High Performance Computing, Aachen,Germany (August 2013), in conjunction with Euro-Par 2013Google Scholar
  32. 32.
    Riesen, R., Ferreira, K., Stearley, J., Oldfield, R., Laros III, J.H., Pedretti, K., Brightwell, R., et al.: Redundant computing for exascale systems. Technical report SAND2010-8709. Sandia National Laboratories (2010)Google Scholar
  33. 33.
    Hoefler, T.: LogGOPSim - A LogGOPS (LogP, LogGP, LogGPS) Simulator and Simulation Framework (April 10, 2013),
  34. 34.
    Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R., von Eicken, T.: LogP: towards a realistic model of parallel computation. SIGPLAN Not. 28(7), 1–12 (1993)CrossRefGoogle Scholar
  35. 35.
    Hoefler, T., Siebert, C., Lumsdaine, A.: Group Operation Assembly Language - a flexible way to express collective communication. In: ICPP-2009 - The 38th International Conference on Parallel Processing. IEEE (September 2009)Google Scholar
  36. 36.
    Tikotekar, A., Vallée, G., Naughton, T., Scott, S.L., Leangsuksun, C.: Evaluation of fault-tolerant policies using simulation. In: 2007 IEEE International Conference on Cluster Computing, pp. 303–311. IEEE (2007)Google Scholar
  37. 37.
    Bohm, S., Engelmann, C.: xSim: The extreme-scale simulator. In: 2011 International Conference on High Performance Computing and Simulation (HPCS), pp. 280–286. IEEE (2011)Google Scholar
  38. 38.
    Boteanu, A., Dobre, C., Pop, F., Cristea, V.: Simulator for fault tolerance in large scale distributed systems. In: 2010 IEEE International Conference on Intelligent Computer Communication and Processing (ICCP), pp. 443–450. IEEE (2010)Google Scholar
  39. 39.
    Janssen, C.L., Adalsteinsson, H., Cranford, S., Kenny, J.P., Pinar, A., Evensky, D.A., Mayo, J.: A simulator for large-scale parallel computer architectures. International Journal of Distributed Systems and Technologies (IJDST) 1(2), 57–73 (2010)CrossRefGoogle Scholar
  40. 40.
    Sst: The structural simulation toolkit (2011),
  41. 41.
    Bland, W., Bouteiller, A., Herault, T., Hursey, J., Bosilca, G., Dongarra, J.J.: An evaluation of user-level failure mitigation support in MPI. In: Träff, J.L., Benkner, S., Dongarra, J.J. (eds.) EuroMPI 2012. LNCS, vol. 7490, pp. 193–203. Springer, Heidelberg (2012)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Scott Levy
    • 1
    Email author
  • Bryan Topp
    • 1
  • Kurt B. Ferreira
    • 2
  • Dorian Arnold
    • 1
  • Torsten Hoefler
    • 3
  • Patrick Widener
    • 2
  1. 1.Department of Computer ScienceUniversity of New MexicoAlbuquerqueUSA
  2. 2.Scalable System Software, Sandia National LaboratoriesAlbuquerqueUSA
  3. 3.Computer Science DepartmentETH ZürichZürichSwitzerland

Personalised recommendations