A new approach for high performance computing systems with various checkpointing schemes
Roll-forward recovery schemes were proposed to enhance the performance of fault tolerant systems employing checkpointing approach. In the roll-forward schemes, multiple processors are used for simultaneous roll-forward and validation processing. This paper proposes thesample comparison approach along with the checkpointing, which further improves the performance by reducing the overhead imposed by the checkpointing. We also develop general analytical models for estimating the availability, which are applicable for any checkpointing scheme. Performance comparisons reveal that the availabilities of the checkpointing schemes with sample comparison are higher than those of the schemes without it, while the required checkpoint interval is larger.
Keywordsavailability checkpointing fault-tolerant rollback roll-forward
Unable to display preview. Download preview PDF.
- 1.A. Agbaria, A. Freund, and R. Friedman. Evaluating distributed checkpointing protocols.23rd Intl. Conf. Dist. Comput. Syst., May 2003, pp. 266–273.Google Scholar
- 2.L. Alvisi, E. Elnozahy, S. Rao, S. A. Husain, and A. D. Mel. An analysis of communication induced check-pointing.29th Fault-Tolerance Comput. Symp., June 1999, pp. 242–249.Google Scholar
- 3.R. Baldoni, J. M. Helary, and M. Raynal. Rollback-dependency trackability: A minimal characterization and its protocol.Inform, and Comput., 2001.Google Scholar
- 6.B. Lee, T. Park, and H. Y. Yeom. On the impossibility of non-blocking consistent casual recovery.IEICE Trims. Inform. Syst. E83-D, (2):29l-294, 2000.Google Scholar
- 7.J. Long, W. K. Fuchs, and J. A. Abraham. Compiler-assisted static checkpoint insertion.22nd Intl. Symp. Fault-Tolerant Computing, July 1992, pp. 58–65.Google Scholar
- 8.J. Long, W. K. Fuchs, and J. A. Abraham. Implementing forward recovery using checkpoints in distributed systems.IFIP Work. Conf. Dependable Comput. for Critical Appl., 1992, pp. 27–36.Google Scholar
- 9.D. Manivannan and M. Singhal. Quazi-synchronous checkpoint: Models, characterization, and classification.IEEE Trans. Parallel and Distributed Systems, 1O(7):7O3–7I3, 1999.Google Scholar
- 10.T. Park and H. Y. Yeom. An asychronous recovery scheme based on optimistic message logging for mobile computing systems.20th Intl. Conf. Dist. Comput. Syst., April 2000. pp. 436–443.Google Scholar
- 11.G.-L. Park, H. Y. Youn, and H.-S. Choo. Optimal checkpoint interval analysis using stochastic petri net.IEEE Intl. Symp. Dependable Computing, Dec. 2001, pp. 57–60.Google Scholar
- 15.B. Yao, K.-F. Ssu, and W. K. Fuchs.Message logging in mobile computing. 29th Intl. Symp. on Fault-Tolerant Computing, 1999, pp. 14–19.Google Scholar