It’s Not the Heat, It’s the Humidity: Scheduling Resilience Activity at Scale

  • Patrick M. Widener
  • Kurt B. Ferreira
  • Scott Levy
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10659)


Maintaining the performance of high-performance computing (HPC) applications with the expected increase in failures is a major challenge for next-generation extreme-scale systems. With increasing scale, resilience activities (e.g. checkpointing) are expected to become more diverse, less tightly synchronized, and more computationally intensive. Few existing studies, however, have examined how decisions about scheduling resilience activities impact application performance. In this work, we examine the relationship between the duration and frequency of resilience activities and application performance. Our study reveals several key findings: (i) the aggregate amount of time consumed by resilience activities is not an effective metric for predicting application performance; (ii) the duration of the interruptions due to resilience activities has the greatest influence on application performance; shorter, but more frequent, interruptions are correlated with better application performance; and (iii) the differential impact of resilience activities across applications is related to the applications’ inter-collective frequencies; the performance of applications that perform infrequent collective operations scales better in the presence of resilience activities than the performance of applications that perform more frequent collective operations. This initial study demonstrates the importance of considering how resilience activities are scheduled. We provide critical analysis and direct guidance on how the resilience challenges of future systems can be met while minimizing the impact on application performance.


Resilience Scheduling Performance Collectives 


  1. 1.
    Alvisi, L., Elnozahy, E., Rao, S., Husain, S., de Mel, A.: An analysis of communication induced checkpointing. In: Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, 1999. Digest of Papers, pp. 242–249 (1999)Google Scholar
  2. 2.
    Bronevetsky, G.: Communication-sensitive static dataflow for parallel message passing applications. In: Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization, pp. 1–12. IEEE Computer Society (2009)Google Scholar
  3. 3.
    Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomput. Front. Innov. 1(1) (2014).
  4. 4.
    Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R., von Eicken, T.: LogP: towards a realistic model of parallel computation. In: Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 1993, pp. 1–12. ACM, New York (1993).
  5. 5.
    Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)CrossRefGoogle Scholar
  6. 6.
    Ferreira, K., Riesen, R., Stearley, J., Laros III, J.H., Oldfield, R., Pedretti, K., Bridges, P., Arnold, D., Brightwell, R.: Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of the ACM/IEEE International Conference on High Performance Computing, Networking, Storage, and Analysis, (SC 2011), November 2011Google Scholar
  7. 7.
    Ferreira, K.B., Bridges, P., Brightwell, R.: Characterizing application sensitivity to OS interference using kernel-level noise injection. In: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, p. 19. IEEE Press (2008)Google Scholar
  8. 8.
    Ferreira, K.B., Levy, S., Widener, P., Arnold, D., Hoefler, T.: Understanding the effects of communication and coordination on checkpointing at scale. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2014), pp. 883–894. IEEE Press (2014)Google Scholar
  9. 9.
    Ferreira, K.B., Riesen, R., Brighwell, R., Bridges, P., Arnold, D.: libhashckpt: hash-based incremental checkpointing using GPU’s. In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra, J. (eds.) EuroMPI 2011. LNCS, vol. 6960, pp. 272–281. Springer, Heidelberg (2011). CrossRefGoogle Scholar
  10. 10.
    Gioiosa, R., Sancho, J.C., Jiang, S., Petrini, F., Davis, K.: Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers. In: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, p. 9. IEEE Computer Society (2005)Google Scholar
  11. 11.
    Guermouche, A., Ropars, T., Brunet, E., Snir, M., Cappello, F.: Uncoordinated checkpointing without domino effect for send-deterministic MPI applications. In: International Parallel Distributed Processing Symposium (IPDPS), pp. 989–1000, May 2011Google Scholar
  12. 12.
    Heroux, M.A., Doerfler, D.W., Crozier, P.S., Willenbring, J.M., Edwards, H.C., Williams, A., Rajan, M., Keiter, E.R., Thornquist, H.K., Numrich, R.W.: Improving performance via mini-applications. Technical report, Sandia National Laboratories (2009)Google Scholar
  13. 13.
    Hoefler, T., Schneider, T., Lumsdaine, A.: LogGOPSim - simulating large-scale applications in the LogGOPS model. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 597–604. ACM, June 2010Google Scholar
  14. 14.
    Ibtesham, D., Arnold, D., Bridges, P.G., Ferreira, K.B., Brightwell, R.: On the viability of compression for reducing the overheads of checkpoint/restart-based fault tolerance. In: 2012 41st International Conference on Parallel Processing (ICPP), pp. 148–157. IEEE (2012)Google Scholar
  15. 15.
    Islam, T.Z., Mohror, K., Bagchi, S., Moody, A., De Supinski, B.R., Eigenmann, R.: McrEngine: a scalable checkpointing system using data-aware aggregation and compression. In: 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–11. IEEE (2012)Google Scholar
  16. 16.
    Johnson, D.B., Zwaenepoel, W.: Recovery in distributed systems using asynchronous message logging and checkpointing. In: Proceedings of the Seventh Annual ACM Symposium on Principles of Distributed Computing, pp. 171–181 (1988)Google Scholar
  17. 17.
    Karlin, I., Bhatele, A., Chamberlain, B.L., Cohen, J., Devito, Z., Gokhale, M., Haque, R., Hornung, R., Keasler, J., Laney, D., Luke, E., Lloyd, S., McGraw, J., Neely, R., Richards, D., Schulz, M., Still, C.H., Wang, F., Wong, D.: LULESH programming model and performance ports overview. Technical report, LLNL-TR-608824, Lawrence Livermore National Laboratory, December 2012Google Scholar
  18. 18.
    Lamport, L., Shostak, R., Pease, M.: The Byzantine generals problem. ACM Trans. Program. Lang. Syst. (TOPLAS) 4(3), 382–401 (1982)CrossRefzbMATHGoogle Scholar
  19. 19.
    Levy, S., Ferreira, K.B., Bridges, P.G.: Improving application resilience to memory errors with lightweight compression. In: SC16: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 323–334. IEEE (2016)Google Scholar
  20. 20.
    Levy, S., Topp, B., Ferreira, K.B., Arnold, D., Hoefler, T., Widener, P.: Using simulation to evaluate the performance of resilience strategies at scale. In: 2013 SC Companion: High Performance Computing, Networking, Storage and Analysis (SCC). IEEE (2013)Google Scholar
  21. 21.
    Maloney, A., Goscinski, A.: A survey and review of the current state of rollback-recovery for cluster systems. Concurr. Comput. Pract. Exp. 21(12), 1632–1666 (2009)CrossRefGoogle Scholar
  22. 22.
    Monnet, S., Morin, C., Badrinath, R.: A hierarchical checkpointing protocol for parallel applications in cluster federations. In: Proceedings of 18th International Parallel and Distributed Processing Symposium, p. 211. IEEE (2004)Google Scholar
  23. 23.
    Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC10), SC 2010, pp. 1–11. IEEE Computer Society, Washington, DC (2010).
  24. 24.
    Plimpton, S.: Fast parallel algorithms for short-range molecular-dynamics. J. Comput. Phys. 117(1), 1–19 (1995)CrossRefzbMATHGoogle Scholar
  25. 25.
    Widener, P.M., Ferreira, K.B., Levy, S.: Horseshoes and hand grenades: the case for approximate coordination in local checkpointing protocols. In: Desprez, F. (ed.) Euro-Par 2016. LNCS, vol. 10104, pp. 623–634. Springer, Cham (2017). CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Patrick M. Widener
    • 1
  • Kurt B. Ferreira
    • 1
  • Scott Levy
    • 1
  1. 1.Sandia National Laboratories, Center for Computing ResearchAlbuquerqueUSA

Personalised recommendations