Journal of Zhejiang University SCIENCE C

, Volume 13, Issue 12, pp 891–900 | Cite as

Optimizing checkpoint for scientific simulations



It is extremely time-consuming to restart a long-running simulation from the beginning when a failure occurs. Checkpointing is a viable solution that enables simulations to be resumed from the point of failure. We study three models to determine the optimal checkpoint interval between contiguous checkpoints so that the total execution time is minimized and we demonstrate that optimal checkpointing can facilitate self-optimizing. This study greatly advances our knowledge of and practice in optimizing long-running scientific simulations.

Key words

Checkpoint Long-running Optimizing Simulation 

CLC number



Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Cao, T., Vaz Salles, M., Sowell, B., Yue, Y., Demers, A., Gehrke, J., White, W., 2011. Fast Checkpoint Recovery Algorithms for Frequently Consistent Applications. Proc. ACM SIGMOD Int. Conf. on Management of data, p.265–276. [doi:10.1145/1989323.1989352]Google Scholar
  2. Chandy, K., 1975. A survey of analytic models for rollback and recovery strategies. Computer, 8(5):40–47. [doi:10.1109/C-M.1975.218955]CrossRefGoogle Scholar
  3. Duda, A., 1983. The effects of checkpointing on program execution times. Inf. Process. Lett., 16(5):221–229. [doi:10.1016/0020-0190(83)90093-5]MathSciNetMATHCrossRefGoogle Scholar
  4. Gelenbe, E., Hernandez, M., 1990. Optimum checkpoints with age dependent failures. Acta Inf., 27(6):519–531. [doi:10.1007/BF00277388]MathSciNetMATHCrossRefGoogle Scholar
  5. Grassi, V., Donatiello, L., Tucci, S., 1992. On the optimal checkpointing of critical task and transaction-oriented systems. IEEE Trans. Software Eng., 18(1):72–77. [doi:10.1109/32.120317]CrossRefGoogle Scholar
  6. Huang, Y., Madey, G., 2005. Autonomic Web-Based Simulations. Proc. 38th Annual Simulation Symp., p.160–167. [doi:10.1109/ANSS.2005.15]Google Scholar
  7. Huang, Y., Xiang, X., Madey, G., 2004. A Self Manageable Infrastructure for Supporting Web-Based Simulations. Proc. 37th Annual Simulation Symp., p.149–156. [doi:10.1109/SIMSYM.2004.1299478]Google Scholar
  8. Ji, Y., Jiang, H., Chaudhary, V., 2011. A heuristic checkpoint placement algorithm for adaptive application-level checkpointing. Int. J. Appl. Sci. Technol., 1(6):50–61.Google Scholar
  9. Kohl, J., Papadopoulas, P., 1998. Efficient and Flexible Fault Tolerance and Migration of Scientific Simulations Using CUMULVS. Proc. SIGMETRICS Symp. on Parallel and Distributed Tools, p.60–71. [doi:10.1145/281035.281042]Google Scholar
  10. Kulkarni, V.G., Nicola, V.F., Trivedi, K.S., 1990. Effects of checkpointing and queuing on program performance. Commun. Stat. Stoch. Models, 6(4):615–648. [doi:10.1080/15326349908807166]MathSciNetMATHCrossRefGoogle Scholar
  11. Kwak, S., Yang, J., 2012. Optimal checkpoint placement on real-time tasks with harmonic periods. J. Comput. Sci. Technol., 27(1):105–112. [doi:10.1007/s11390-012-1209-0]CrossRefGoogle Scholar
  12. Kwak, S.W., Chio, B.J., Kim, B.K., 2001. An optimal checkpointing strategy for real time control systems under transient faults. IEEE Trans. Reliab., 50(3):293–301. [doi:10.1109/24.974127]CrossRefGoogle Scholar
  13. Ling, Y., Mi, J., Lin, X., 2001. A variational calculus approach to optimal checkpoint placement. IEEE Trans. Comput., 50(7):699–708. [doi:10.1109/12.936236]CrossRefGoogle Scholar
  14. Nicola, V., 1995. Checkpointing and the Modeling of Program Execution Time. In: Lyu, M.R. (Ed.), Software Fault Tolerance. John Wiley & Sons, Chichester, England, p.167–188.Google Scholar
  15. Shin, K.G., Lin, T., Lee, Y., 1987. Optimal checkpointing of real-time tasks. IEEE Trans. Comput., 36(11):519–531.CrossRefGoogle Scholar
  16. Tantawi, A.N., Ruschitzka, M., 1983. Performance Analysis of Checkpointing Strategies. Proc. ACM SIGMETRICS Conf. on Measurement and Modeling of Computer Systems, p.129.Google Scholar
  17. Young, J.W., 1974. A first order approximation to the optimum checkpoint interval. Commun. ACM, 17(9):530–531. [doi:10.1145/361147.361115]MATHCrossRefGoogle Scholar

Copyright information

© Journal of Zhejiang University Science Editorial Office and Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  1. 1.Economics & Management CollegeSouthwest Jiaotong UniversityChengduChina
  2. 2.Industrial and Commercial CollegeGuizhou University of Finance and EconomicsGuiyangChina
  3. 3.College of BusinessUniversity of North AlabamaFlorenceUSA

Personalised recommendations