Advertisement

Optimal Checkpointing Period: Time vs. Energy

  • Guillaume AupyEmail author
  • Anne Benoit
  • Thomas Hérault
  • Yves Robert
  • Jack Dongarra
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8551)

Abstract

This short paper deals with parallel scientific applications using non-blocking and periodic coordinated checkpointing to enforce resilience. We provide a model and detailed formulas for total execution time and consumed energy. We characterize the optimal period for both objectives, and we assess the range of time/energy trade-offs to be made by instantiating the model with a set of realistic scenarios for Exascale systems. We give a particular emphasis to I/O transfers, because the relative cost of communication is expected to dramatically increase, both in terms of latency and consumed energy, for future Exascale platforms.

Keywords

Execution Time Total Execution Time Optimal Period Mean Time Between Failure Power Overhead 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Dongarra, J., Beckman, P., Aerts, P., Cappello, F., Lippert, T., Matsuoka, S., Messina, P., Moore, T., Stevens, R., Trefethen, A., Valero, M.: The international exascale software project: a call to cooperative action by the global high-performance community. Int. Journal of High Performance Computing Applications 23, 309–322 (2009)CrossRefGoogle Scholar
  2. 2.
    Sarkar, V., et al.: Exascale software study: Software challenges in extreme scale systems (2009), White paper available at; http://users.ece.gatech.edu/mrichard/ExascaleComputingStudyReports/ECSS%20report%20101909.pdf
  3. 3.
    Young, J.W.: A first order approximation to the optimum checkpoint interval. Comm. of the ACM 17, 530–531 (1974)CrossRefzbMATHGoogle Scholar
  4. 4.
    Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. FGCS 22, 303–312 (2004)CrossRefGoogle Scholar
  5. 5.
    Shalf, J., Dosanjh, S., Morrison, J.: Exascale computing technology challenges. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 1–25. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  6. 6.
    Meneses, E., Sarood, O., Kalé, L.V.: Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems. In: Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2012), New York, USA (2012)Google Scholar
  7. 7.
    Chandy, K.M., Lamport, L.: Distributed snapshots: Determining global states of distributed systems. Transactions on Computer Systems 3(1), 63–75 (1985)CrossRefGoogle Scholar
  8. 8.
    Bosilca, G., Bouteiller, A., Brunet, E., Cappello, F., Dongarra, J., Guermouche, A., Hérault, T., Robert, Y., Vivien, F., Zaidouni, D.: Unified model for assessing checkpointing protocols at extreme-scale. Concurrency and Computation: Practice and Experience (2013) (to be published); Also available as INRIA research report 7950 at http://graal.ens-lyon.fr/~yrobert
  9. 9.
    Ferreira, K., Stearley, J., Laros, J.H.I., Oldfield, R., Pedretti, K., Brightwell, R., Riesen, R., Bridges, P.G., Arnold, D.: Evaluating the Viability of Process Replication Reliability for Exascale Systems. In: Proc. of the ACM/IEEE SC Conf. (2011)Google Scholar
  10. 10.
    Zheng, G., Ni, X., Kalé, L.V.: A scalable double in-memory checkpoint and restart scheme towards exascale. In: Dependable Systems and Networks Workshops (DSN-W) (2012)Google Scholar
  11. 11.
    Cappello, F., Casanova, H., Robert, Y.: Preventive migration vs. preventive checkpointing for extreme scale supercomputers. Parallel Processing Letters 21, 111–132 (2011)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Zheng, G., Shi, L., Kalé, L.V.: FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In: Proc. 2004 IEEE Int. Conf. Cluster Computing. IEEE Computer Society (2004)Google Scholar
  13. 13.
    Ni, X., Meneses, E., Kalé, L.V.: Hiding checkpoint overhead in HPC applications with a semi-blocking algorithm. In: Proc. 2012 IEEE Int. Conf. Cluster Computing. IEEE Computer Society (2012)Google Scholar
  14. 14.
    Dongarra, J., Hérault, T., Robert, Y.: Revisiting the double checkpointing algorithm. In: 15th Workshop on Advances in Parallel and Distributed Computational Models, APDCM 2013. IEEE Computer Society Press (2013)Google Scholar
  15. 15.
    Rajachandrasekar, R., Moody, A., Mohror, K., Panda, D.K.D.: A 1 PB/s file system to checkpoint three million MPI tasks. In: Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2013, pp. 143–154. ACM, New York (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Guillaume Aupy
    • 1
    Email author
  • Anne Benoit
    • 1
  • Thomas Hérault
    • 2
  • Yves Robert
    • 1
    • 2
  • Jack Dongarra
    • 2
  1. 1.Laboratoire LIPÉcole Normale Supérieure de LyonLyonFrance
  2. 2.University of TennesseeKnoxvilleUSA

Personalised recommendations