Roll-forward checkpointing schemes

Pradhan, Dhiraj K.; Das Sharma, Debendra; Vaidya, Nitin H.

doi:10.1007/BFb0020026

Dhiraj K. Pradhan¹,
Debendra Das Sharma¹ &
Nitin H. Vaidya¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 774))

Included in the following conference series:

Workshop on Fault Tolerance

173 Accesses
4 Citations

Abstract

In modular redundant systems, tasks are replicated to achieve fault-tolerance. Checkpointing schemes that exploit replication can achieve better performance than the ones that ignore how the fault detection mechanism is implemented [24]. This Chapter presents two such schemes named Dynamic Roll-Forward Checkpointing Scheme and the Static Roll-Forward Checkpointing Scheme.

In the dynamic scheme for duplex systems, each task is assumed to be executing simultaneously on two processing modules. At each checkpoint, the state of the two modules executing the task is compared for detection of faults. If a fault is detected, instead of the usual roll-back, both the modules continue execution to the next checkpoint interval. The failed checkpoint interval is ‘retried’ on a spare module, which helps in identifying the failed processing module and making its state consistent.

It is demonstrated that this scheme increases the likelihood of a task completing within a specified deadline in spite of transient faults. The dynamic scheme also results in a lower average execution time with a lower variance as compared to the usual duplex roll-back schemes.

The dynamic scheme avoids a roll-back in most cases if the transient faults are independent. However, for correlated faults, it may cause multiple roll-backs. The static scheme is capable of tolerating both independent and correlated faults. In the static scheme for triplex systems, each task is assumed to be executing on three processing modules. At each checkpoint, the state of the three processing modules is compared for detection of faults. Thus, it can tolerate all single faults by masking. In the event of multiple failures, none of the checkpoints match. In that case, various recovery actions are possible depending on the choice of concurrent depth. For initiating a roll-forward action, one of the three processing modules is rolled back to execute the interval that experienced failure, while the other two modules continue execution to the next checkpoint interval. The module that was rolled back helps in identifying the faulty modules and the recovery action continues. In this roll-forward scheme, we do not require any spare modules; thereby avoiding the need for task migration. Simulation results indicate that this scheme outperforms the dynamic scheme in meeting deadlines in the presence of correlated faults. It also results in a lower execution time with lower variance as compared to the static scheme.

Research supported in part by ONR

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

P. Agrawal, “Fault Tolerance in Multiprocessor Systems without Dedicated Redundancy”, IEEE Trans. Compu., No. 3, vol. 37, Mar 1988, pp. 358–362.
Google Scholar
P. A. Bernstein, “Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing”, Computer, Feb. 1988, pp. 37–45.
Google Scholar
K. M. Chandy and C. V. Ramamoorthy, “Rollback and Recovery Strategies for Computer Programs”, IEEE Trans. Compu., No. 6, vol. 21, June 1972, pp. 546–556.
Google Scholar
P. F. Chimento and K. S. Trivedi, “The Performance of Block Structured Programs on Processors Subject to Failure and Repair”, in High Performance Computer Systems, E. Gelenbe (Ed.), Elsevier Science Publishers, 1988.
Google Scholar
D. Das Sharma and D. K. Pradhan, “A Static Roll-Forward Checkpointing Scheme Using Three Processors”, Tech. Rep. TR-93-050, Dept. of Computer Science, Texas A&M Univ., 1993.
Google Scholar
C. I. Dimmer, “The Tandem Non-Stop System”, in Resilient Computing Systems, T. Anderson, ed., vol. 1, John Wiley and Sons, 1985.
Google Scholar
E. Gelenbe and D. Derochette, “Performance of Rollback Recovery Systems under Intermittent Failures”, Comm. ACM, No. 6, vol. 21, June 1978, pp. 493–499.
Google Scholar
S. R. Kane et al., “Impulsive Phase of Solar Flares”, in P. A. Sturrock ed., Solar Flares: A Monograph from Skylab Solar Workshop II, Univ of Colorado Press, Boulder, CO, 1980.
Google Scholar
C. M. Krishna and A. D. Singh, “Modeling Correlated Transient Failures in Fault-Tolerant Systems”, Proc. IEEE Intl. Symp. on Fault-Tolerant Computing, 1989, pp. 374–381.
Google Scholar
V. G. Kulkarni, V. F. Nicola and K. S. Trivedi, “Effects of Checkpointing and Queuing on Program Performance”, Comm. Stat.-Stochastic Models, No. 6, vol. 4, 1990, pp. 615–648.
Google Scholar
P. L'Ecuyer and J. Malenfant, “Computing Optimal Checkpointing Strategies for Rollback and recovery Systems”, IEEE Trans. Comp., No. 4, vol. 37, Apr. 1988, pp. 491–496.
Google Scholar
J. Long, W. K. Fuchs and J. A. Abraham, “Implementing Forward Recovery using Checkpoints in Distributed Systems”, IFIP 2nd Intl. Working Conf. Dependable Computing for Critical Applications, Feb. 1991.
Google Scholar
J. Long, W. K. Fuchs and J. A. Abraham, “Forward Recovery using Check-pointing in Parallel Systems”, Proc. Intl. Conf. Parallel Proc., Jan 1990, pp. 1272–1275.
Google Scholar
Y. K. Malaiya, “Linearly Correlated Intermittent Failures”, IEEE Trans. Reliability, Vol. R-31, No. 2, June 1982, pp. 211–215.
Google Scholar
D. K. Pradhan and N. H. Vaidya, “Roll-forward Checkpointing Scheme: A Novel Fault-Tolerant Architecture”, Submitted to IEEE Transactions on Computers, Dec. 1992, Revised Nov. 1993.
Google Scholar
D. K. Pradhan and N. H. Vaidya, “Roll-Forward Checkpointing Scheme: Concurrent Retry with Nondedicated Spares”, Proc. IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, July 1992, pp. 166–174.
Google Scholar
D. K. Pradhan, “Redundancy Schemes for Recovery”, Tech. Rep. TR-89-CSE-16, Elect. & Comp. Engg., Univ. of Massachusetts, Amherst, 1989.
Google Scholar
D. K. Pradhan ed., “Fault-Tolerant Computing: Theory and Techniques”, Vol. I & II, Prentice Hall, NJ, 1986.
Google Scholar
O. Serlin, “Fault-Tolerant Systems in Commercial Applications”, Computer, Aug. 1984, pp. 19–30.
Google Scholar
K. G. Shin, T.-H. Lin and Y.-H. Lee, “Optimal Checkpointing of Real-Time Tasks”, IEEE Trans. Comp., No. 11, vol. 36, Nov. 1987, pp. 1328–1341.
Google Scholar
D. P. Siewiorek et al., “A Case Study of C.MMP, CM* and C.VMP. Experiences with Fault-Tolerance in Multiprocessor Systems”, Proc. IEEE, Oct 1978, pp. 1178–1199.
Google Scholar
J. J. Stiffler, “Architectural design for near-100coverage”, Proc. Intl. Symp. on Fault Tolerant Computing, 1976, pp. 134–137.
Google Scholar
D. Tang, R. Iyer and S. Subramani, “Failure Analysis and Modeling a VAX Cluster System”, Proc. IEEE Intl. Symp. on Fault-Tolerant Computing, 1990, pp. 244–251.
Google Scholar
N. H. Vaidya, Ph.D. Dissertation, Elect. & Computer Engg., University of Massachusetts, Amherst, MA 01003, 1993.
Google Scholar
N. H. Vaidya and D. K. Pradhan, “Concurrent Retry with Nondedicated Spares: A Fault-Tolerant Checkpointing Scheme without Rollback”, Tech. Rep. TR-91-CSE-23, Elect, & Comp. Engg., Univ. of Massachusetts, Amherst, Oct. 1991.
Google Scholar
N. H. Vaidya and D. K. Pradhan, “A Fault Tolerance Scheme for a System of Duplicated Communicating Processeses”, Proc. IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, July 1992, pp. 166–174.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Texas A&M University, 77843-3112, College Station, TX
Dhiraj K. Pradhan, Debendra Das Sharma & Nitin H. Vaidya

Authors

Dhiraj K. Pradhan
View author publications
You can also search for this author in PubMed Google Scholar
Debendra Das Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Nitin H. Vaidya
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Michel Banâtre Peter A. Lee

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pradhan, D.K., Das Sharma, D., Vaidya, N.H. (1994). Roll-forward checkpointing schemes. In: Banâtre, M., Lee, P.A. (eds) Hardware and Software Architectures for Fault Tolerance. Fault Tolerance 1993. Lecture Notes in Computer Science, vol 774. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0020026

Download citation

DOI: https://doi.org/10.1007/BFb0020026
Published: 10 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-57767-6
Online ISBN: 978-3-540-48330-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics