Skip to main content

Roll-forward checkpointing schemes

  • Hardware Architectures for Fault Tolerance
  • Conference paper
  • First Online:
Hardware and Software Architectures for Fault Tolerance (Fault Tolerance 1993)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 774))

Included in the following conference series:

Abstract

In modular redundant systems, tasks are replicated to achieve fault-tolerance. Checkpointing schemes that exploit replication can achieve better performance than the ones that ignore how the fault detection mechanism is implemented [24]. This Chapter presents two such schemes named Dynamic Roll-Forward Checkpointing Scheme and the Static Roll-Forward Checkpointing Scheme.

In the dynamic scheme for duplex systems, each task is assumed to be executing simultaneously on two processing modules. At each checkpoint, the state of the two modules executing the task is compared for detection of faults. If a fault is detected, instead of the usual roll-back, both the modules continue execution to the next checkpoint interval. The failed checkpoint interval is ‘retried’ on a spare module, which helps in identifying the failed processing module and making its state consistent.

It is demonstrated that this scheme increases the likelihood of a task completing within a specified deadline in spite of transient faults. The dynamic scheme also results in a lower average execution time with a lower variance as compared to the usual duplex roll-back schemes.

The dynamic scheme avoids a roll-back in most cases if the transient faults are independent. However, for correlated faults, it may cause multiple roll-backs. The static scheme is capable of tolerating both independent and correlated faults. In the static scheme for triplex systems, each task is assumed to be executing on three processing modules. At each checkpoint, the state of the three processing modules is compared for detection of faults. Thus, it can tolerate all single faults by masking. In the event of multiple failures, none of the checkpoints match. In that case, various recovery actions are possible depending on the choice of concurrent depth. For initiating a roll-forward action, one of the three processing modules is rolled back to execute the interval that experienced failure, while the other two modules continue execution to the next checkpoint interval. The module that was rolled back helps in identifying the faulty modules and the recovery action continues. In this roll-forward scheme, we do not require any spare modules; thereby avoiding the need for task migration. Simulation results indicate that this scheme outperforms the dynamic scheme in meeting deadlines in the presence of correlated faults. It also results in a lower execution time with lower variance as compared to the static scheme.

Research supported in part by ONR

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. P. Agrawal, “Fault Tolerance in Multiprocessor Systems without Dedicated Redundancy”, IEEE Trans. Compu., No. 3, vol. 37, Mar 1988, pp. 358–362.

    Google Scholar 

  2. P. A. Bernstein, “Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing”, Computer, Feb. 1988, pp. 37–45.

    Google Scholar 

  3. K. M. Chandy and C. V. Ramamoorthy, “Rollback and Recovery Strategies for Computer Programs”, IEEE Trans. Compu., No. 6, vol. 21, June 1972, pp. 546–556.

    Google Scholar 

  4. P. F. Chimento and K. S. Trivedi, “The Performance of Block Structured Programs on Processors Subject to Failure and Repair”, in High Performance Computer Systems, E. Gelenbe (Ed.), Elsevier Science Publishers, 1988.

    Google Scholar 

  5. D. Das Sharma and D. K. Pradhan, “A Static Roll-Forward Checkpointing Scheme Using Three Processors”, Tech. Rep. TR-93-050, Dept. of Computer Science, Texas A&M Univ., 1993.

    Google Scholar 

  6. C. I. Dimmer, “The Tandem Non-Stop System”, in Resilient Computing Systems, T. Anderson, ed., vol. 1, John Wiley and Sons, 1985.

    Google Scholar 

  7. E. Gelenbe and D. Derochette, “Performance of Rollback Recovery Systems under Intermittent Failures”, Comm. ACM, No. 6, vol. 21, June 1978, pp. 493–499.

    Google Scholar 

  8. S. R. Kane et al., “Impulsive Phase of Solar Flares”, in P. A. Sturrock ed., Solar Flares: A Monograph from Skylab Solar Workshop II, Univ of Colorado Press, Boulder, CO, 1980.

    Google Scholar 

  9. C. M. Krishna and A. D. Singh, “Modeling Correlated Transient Failures in Fault-Tolerant Systems”, Proc. IEEE Intl. Symp. on Fault-Tolerant Computing, 1989, pp. 374–381.

    Google Scholar 

  10. V. G. Kulkarni, V. F. Nicola and K. S. Trivedi, “Effects of Checkpointing and Queuing on Program Performance”, Comm. Stat.-Stochastic Models, No. 6, vol. 4, 1990, pp. 615–648.

    Google Scholar 

  11. P. L'Ecuyer and J. Malenfant, “Computing Optimal Checkpointing Strategies for Rollback and recovery Systems”, IEEE Trans. Comp., No. 4, vol. 37, Apr. 1988, pp. 491–496.

    Google Scholar 

  12. J. Long, W. K. Fuchs and J. A. Abraham, “Implementing Forward Recovery using Checkpoints in Distributed Systems”, IFIP 2nd Intl. Working Conf. Dependable Computing for Critical Applications, Feb. 1991.

    Google Scholar 

  13. J. Long, W. K. Fuchs and J. A. Abraham, “Forward Recovery using Check-pointing in Parallel Systems”, Proc. Intl. Conf. Parallel Proc., Jan 1990, pp. 1272–1275.

    Google Scholar 

  14. Y. K. Malaiya, “Linearly Correlated Intermittent Failures”, IEEE Trans. Reliability, Vol. R-31, No. 2, June 1982, pp. 211–215.

    Google Scholar 

  15. D. K. Pradhan and N. H. Vaidya, “Roll-forward Checkpointing Scheme: A Novel Fault-Tolerant Architecture”, Submitted to IEEE Transactions on Computers, Dec. 1992, Revised Nov. 1993.

    Google Scholar 

  16. D. K. Pradhan and N. H. Vaidya, “Roll-Forward Checkpointing Scheme: Concurrent Retry with Nondedicated Spares”, Proc. IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, July 1992, pp. 166–174.

    Google Scholar 

  17. D. K. Pradhan, “Redundancy Schemes for Recovery”, Tech. Rep. TR-89-CSE-16, Elect. & Comp. Engg., Univ. of Massachusetts, Amherst, 1989.

    Google Scholar 

  18. D. K. Pradhan ed., “Fault-Tolerant Computing: Theory and Techniques”, Vol. I & II, Prentice Hall, NJ, 1986.

    Google Scholar 

  19. O. Serlin, “Fault-Tolerant Systems in Commercial Applications”, Computer, Aug. 1984, pp. 19–30.

    Google Scholar 

  20. K. G. Shin, T.-H. Lin and Y.-H. Lee, “Optimal Checkpointing of Real-Time Tasks”, IEEE Trans. Comp., No. 11, vol. 36, Nov. 1987, pp. 1328–1341.

    Google Scholar 

  21. D. P. Siewiorek et al., “A Case Study of C.MMP, CM* and C.VMP. Experiences with Fault-Tolerance in Multiprocessor Systems”, Proc. IEEE, Oct 1978, pp. 1178–1199.

    Google Scholar 

  22. J. J. Stiffler, “Architectural design for near-100coverage”, Proc. Intl. Symp. on Fault Tolerant Computing, 1976, pp. 134–137.

    Google Scholar 

  23. D. Tang, R. Iyer and S. Subramani, “Failure Analysis and Modeling a VAX Cluster System”, Proc. IEEE Intl. Symp. on Fault-Tolerant Computing, 1990, pp. 244–251.

    Google Scholar 

  24. N. H. Vaidya, Ph.D. Dissertation, Elect. & Computer Engg., University of Massachusetts, Amherst, MA 01003, 1993.

    Google Scholar 

  25. N. H. Vaidya and D. K. Pradhan, “Concurrent Retry with Nondedicated Spares: A Fault-Tolerant Checkpointing Scheme without Rollback”, Tech. Rep. TR-91-CSE-23, Elect, & Comp. Engg., Univ. of Massachusetts, Amherst, Oct. 1991.

    Google Scholar 

  26. N. H. Vaidya and D. K. Pradhan, “A Fault Tolerance Scheme for a System of Duplicated Communicating Processeses”, Proc. IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, July 1992, pp. 166–174.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Michel Banâtre Peter A. Lee

Rights and permissions

Reprints and permissions

Copyright information

© 1994 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Pradhan, D.K., Das Sharma, D., Vaidya, N.H. (1994). Roll-forward checkpointing schemes. In: Banâtre, M., Lee, P.A. (eds) Hardware and Software Architectures for Fault Tolerance. Fault Tolerance 1993. Lecture Notes in Computer Science, vol 774. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0020026

Download citation

  • DOI: https://doi.org/10.1007/BFb0020026

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-57767-6

  • Online ISBN: 978-3-540-48330-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics