Advertisement

The Checkpoint-Timing for Backward Fault-Tolerant Schemes

  • Min Zhang
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 908)

Abstract

To improve the performance of the backward fault tolerant scheme in the long-running parallel application, a general checkpoint-timing method was proposed to determine the unequal checkpointing interval according to an arbitrary failure rate, to reduce the total execution time. Firstly, a new model was introduced to evaluate the mean expected execution time. Secondly, the optimality condition was derived for the constant failure rate according to the calculation model, and the optimal equal checkpointing interval can be obtained easily. Subsequently, a general method was derived to determine the checkpointing timing for the other failure rate. The final results shown the proposal is practical to trade-off the re-processing overhead and the checkpointing overhead in the backward fault-tolerant scheme.

Keywords

Parallel computation Fault tolerance Checkpointing Failure rate 

References

  1. 1.
    Li, T., Shafique, M., Ambrose, J.A., et al.: Fine-grained checkpoint recovery for application-specific instruction-set processors. IEEE Trans. Comput. 66(4), 647–660 (2017)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Meroufel, B., Belalem, G.: Lightweight coordinated checkpointing in cloud computing. J. High Speed Netw. 20(3), 131–143 (2014)Google Scholar
  3. 3.
    Salehi, M., Tavana, M.K., Rehman, S., et al.: Two-state checkpointing for energy-efficient fault tolerance in hard real-time systems. IEEE Trans. Very Large Scale Integr. Syst. 24(7), 2426–2437 (2016)CrossRefGoogle Scholar
  4. 4.
    Islam, T.Z., Bagchi, S., Eigenmann, R.: Reliable and efficient distributed checkpointing system for grid environments. J. Grid Comput. 12(4), 593–613 (2014)CrossRefGoogle Scholar
  5. 5.
    Fu, H., Yu, C., Sun, J., Du, J., Wang, M.: A multilevel fault-tolerance technique for the DAG data driven model. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Shenzhen, China, pp. 1127–1130 (2015)Google Scholar
  6. 6.
    Mendizabal, O.M., Jalili Marandi, P., Dotti, F.L., Pedone, F.: Checkpointing in parallel state-machine replication. In: Aguilera, M.K., Querzoni, L., Shapiro, M. (eds.) OPODIS 2014. LNCS, vol. 8878, pp. 123–138. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-14472-6_9CrossRefGoogle Scholar
  7. 7.
    Sweiti, S., Dweik, A.A.: Integrated replication-checkpoint fault tolerance approach of mobile agents “IRCFT”. Int. Arab J. Inf. Technol. 13(1A), 190–195 (2016)Google Scholar
  8. 8.
    Awasthi, L.K., Misra, M., Joshi, R.C., et al.: Minimum mutable checkpoint-based coordinated checkpointing protocol for mobile distributed systems. Int. J. Commun. Netw. Distrib. Syst. 12(4), 356–380 (2014)CrossRefGoogle Scholar
  9. 9.
    Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)CrossRefGoogle Scholar
  10. 10.
    Treaster, M.: A survey of fault-tolerance and fault-recovery techniques in parallel systems. Technical report cs.DC/0501002, ACM Computing Research Repository, January 2005Google Scholar
  11. 11.
    Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9), 530–531 (1974)CrossRefGoogle Scholar
  12. 12.
    Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst. 22, 303–312 (2006)CrossRefGoogle Scholar
  13. 13.
    Ozaki, T., Dohi, T., Kaio, N.: Numerical computation algorithms for sequential checkpoint placement. Perform. Eval. 66, 311–326 (2009)CrossRefGoogle Scholar
  14. 14.
    Naruse, K., Umemura, S., Nakagawa, S.: Optimal checkpointing interval for two-level recovery schemes. Comput. Math Appl. 51, 371–376 (2006)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Okamura, H., Dohi, T.: Comprehensive evaluation of aperiodic checkpointing and rejuvenation schemes in operational software system. J. Syst. Softw. 83(9), 1591–1604 (2010)CrossRefGoogle Scholar
  16. 16.
    Endo, P.T., Rodrigues, M., et al.: High availability in clouds: systematic review and research challenges. J. Cloud Comput. Adv. Syst. Appl. 5, 16 (2016)CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.Lianyungang JARI Electronics Co., Ltd. of CSICLianyungangChina

Personalised recommendations