Abstract
Checkpointing is an effective measure to ensure the completion of long-running jobs in Desktop Grids which are subject to frequent resource failures. We focus on checkpointing strategies in the context of Desktop Grids, including volunteer computing systems, where individual hosts follow diverse failure distributions. We propose an algorithm which computes sequence of checkpoint interval lengths for each individual host according to a sample of its availability interval lengths. This algorithm directly approximates the probability distribution of availability interval lengths with the sample, without deriving a closed form of the probability distribution. Through simulations with synthetic trace data and trace data from real volunteer computing project, this sample based strategy shows better performance than periodic strategy in terms of wasted time in most cases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Nurmi, D., Brevik, J., Wolski, R.: Modeling Machine Availability in Enterprise and Wide-Area Distributed Computing Environments. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005. LNCS, vol. 3648, pp. 432–441. Springer, Heidelberg (2005)
Wolski, R., Nurmi, D., Brevik, J.: An Analysis of Availability Distributions in Condor. In: IPDPS 2007: Proceedings of the 21th International Parallel and Distributed Processing Symposium, pp. 1–6. IEEE (2007)
Javadi, B., Kondo, D., Vincent, J.-M., Anderson, D.P.: Mining for Statistical Availability Models in Large-Scale Distributed Systems: An Empirical Study of SETI@home. In: MASCOTS 2009: Proceedings of the 17th Annual Meeting of the IEEE/ACM International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems, pp. 1–10 (2009)
Young, J.W.: A First Order Approximation to the Optimal Checkpoint Interval. Commun. ACM 17(9), 530–531 (1974)
Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Comp. Syst. 22(3), 303–312 (2006)
Chandy, K.M.: A Survey of Analytic Models of Rollback and Recovery Strategies. Computer 8(5), 40–47 (1975)
Chandy, K.M., Browne, J.C., Dissly, C.W., Uhrig, W.R.: Analytic models for rollback and recovery strategies in database systems. IEEE Trans. Software Eng. SE-1, 100–110 (1975)
Gelenbe, E.: A model of rollback recovery with multiple checkpoints. In: Proceedings of the Second International Symposium on Software Engineering, pp. 251–255. ACM (1976)
Gelenbe, E., Derochette, D.: Performance of rollback recovery systems under intermittent failures. Commun. ACM 21(6), 493–499 (1978)
Gelenbe, E.: On the optimum checkpoint interval. J. ACM 26(2), 259–270 (1979)
Tantawi, A.N., Ruschitzka, M.: Performance Analysis of Checkpointing Strategies. ACM Trans. Comput. Syst. 2(2), 123–144 (1984)
L’Ecuyer, P., Malenfant, J.: Computing Optimal Checkpointing Strategies for Rollback and Recovery Systems. IEEE Trans. Computers 37(4), 491–496 (1988)
Ling, Y., Mi, J., Lin, X.: A Variational Calculus Approach to Optimal Checkpoint Placement. IEEE Trans. Computers 50(7), 699–708 (2001)
Krishna, C.M., Shin, K.G., Lee, Y.-H.: Optimization Criteria for Checkpoint Placements. Comm. ACM 27(4), 1008–1012 (1984)
Bouguerra, M.-S., Kondo, D., Trystram, D.: On the Scheduling of Checkpoints in Desktop Grids. In: Proceedings of 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID 2011), pp. 305–313. IEEE (2011)
Vaidya, N.H.: Impact of checkpoint latency on overhead ratio of a checkpointing scheme. IEEE Transactions on Computers 46(8), 942–947 (1997)
Ziv, A., Bruck, J.: Performance Optimization of Checkpointing Schemes with Task Duplication. IEEE Transactions on Computers 46(12), 1381–1386 (1997)
Ziv, A., Bruck, J.: An On-Line Algorithm for Checkpoint Placement. IEEE Transactions on Computers 46(9), 976–985 (1997)
Javadi, B., Kondo, D., Vincent, J.-M., Anderson, D.P.: Discovering Statistical Models of Availability in Large Distributed Systems: An Empirical Study of SETI@home. IEEE Trans. Parallel Distrib. Syst. 22(11), 1896–1903 (2011)
Kondo, D., Javadi, B., Iosup, A., Epema, D.H.J.: The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems. In: CCGRID 2010: Proceedings of the 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp. 398–407. IEEE (2010)
Anderson, D.P.: BOINC: a system for public-resource computing and storage. In: GRID 2004: Proceedings of the Fifth IEEE/ACM International Workshop on Grid Computing, pp. 4–10 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, D., Gong, B. (2012). On the Checkpointing Strategy in Desktop Grids. In: Xiang, Y., Pathan, M., Tao, X., Wang, H. (eds) Internet and Distributed Computing Systems. IDCS 2012. Lecture Notes in Computer Science, vol 7646. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34883-9_17
Download citation
DOI: https://doi.org/10.1007/978-3-642-34883-9_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34882-2
Online ISBN: 978-3-642-34883-9
eBook Packages: Computer ScienceComputer Science (R0)