Abstract
In modular redundant systems, tasks are replicated to achieve fault-tolerance. Checkpointing schemes that exploit replication can achieve better performance than the ones that ignore how the fault detection mechanism is implemented [24]. This Chapter presents two such schemes named Dynamic Roll-Forward Checkpointing Scheme and the Static Roll-Forward Checkpointing Scheme.
In the dynamic scheme for duplex systems, each task is assumed to be executing simultaneously on two processing modules. At each checkpoint, the state of the two modules executing the task is compared for detection of faults. If a fault is detected, instead of the usual roll-back, both the modules continue execution to the next checkpoint interval. The failed checkpoint interval is ‘retried’ on a spare module, which helps in identifying the failed processing module and making its state consistent.
It is demonstrated that this scheme increases the likelihood of a task completing within a specified deadline in spite of transient faults. The dynamic scheme also results in a lower average execution time with a lower variance as compared to the usual duplex roll-back schemes.
The dynamic scheme avoids a roll-back in most cases if the transient faults are independent. However, for correlated faults, it may cause multiple roll-backs. The static scheme is capable of tolerating both independent and correlated faults. In the static scheme for triplex systems, each task is assumed to be executing on three processing modules. At each checkpoint, the state of the three processing modules is compared for detection of faults. Thus, it can tolerate all single faults by masking. In the event of multiple failures, none of the checkpoints match. In that case, various recovery actions are possible depending on the choice of concurrent depth. For initiating a roll-forward action, one of the three processing modules is rolled back to execute the interval that experienced failure, while the other two modules continue execution to the next checkpoint interval. The module that was rolled back helps in identifying the faulty modules and the recovery action continues. In this roll-forward scheme, we do not require any spare modules; thereby avoiding the need for task migration. Simulation results indicate that this scheme outperforms the dynamic scheme in meeting deadlines in the presence of correlated faults. It also results in a lower execution time with lower variance as compared to the static scheme.
Research supported in part by ONR
Preview
Unable to display preview. Download preview PDF.
References
P. Agrawal, “Fault Tolerance in Multiprocessor Systems without Dedicated Redundancy”, IEEE Trans. Compu., No. 3, vol. 37, Mar 1988, pp. 358–362.
P. A. Bernstein, “Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing”, Computer, Feb. 1988, pp. 37–45.
K. M. Chandy and C. V. Ramamoorthy, “Rollback and Recovery Strategies for Computer Programs”, IEEE Trans. Compu., No. 6, vol. 21, June 1972, pp. 546–556.
P. F. Chimento and K. S. Trivedi, “The Performance of Block Structured Programs on Processors Subject to Failure and Repair”, in High Performance Computer Systems, E. Gelenbe (Ed.), Elsevier Science Publishers, 1988.
D. Das Sharma and D. K. Pradhan, “A Static Roll-Forward Checkpointing Scheme Using Three Processors”, Tech. Rep. TR-93-050, Dept. of Computer Science, Texas A&M Univ., 1993.
C. I. Dimmer, “The Tandem Non-Stop System”, in Resilient Computing Systems, T. Anderson, ed., vol. 1, John Wiley and Sons, 1985.
E. Gelenbe and D. Derochette, “Performance of Rollback Recovery Systems under Intermittent Failures”, Comm. ACM, No. 6, vol. 21, June 1978, pp. 493–499.
S. R. Kane et al., “Impulsive Phase of Solar Flares”, in P. A. Sturrock ed., Solar Flares: A Monograph from Skylab Solar Workshop II, Univ of Colorado Press, Boulder, CO, 1980.
C. M. Krishna and A. D. Singh, “Modeling Correlated Transient Failures in Fault-Tolerant Systems”, Proc. IEEE Intl. Symp. on Fault-Tolerant Computing, 1989, pp. 374–381.
V. G. Kulkarni, V. F. Nicola and K. S. Trivedi, “Effects of Checkpointing and Queuing on Program Performance”, Comm. Stat.-Stochastic Models, No. 6, vol. 4, 1990, pp. 615–648.
P. L'Ecuyer and J. Malenfant, “Computing Optimal Checkpointing Strategies for Rollback and recovery Systems”, IEEE Trans. Comp., No. 4, vol. 37, Apr. 1988, pp. 491–496.
J. Long, W. K. Fuchs and J. A. Abraham, “Implementing Forward Recovery using Checkpoints in Distributed Systems”, IFIP 2nd Intl. Working Conf. Dependable Computing for Critical Applications, Feb. 1991.
J. Long, W. K. Fuchs and J. A. Abraham, “Forward Recovery using Check-pointing in Parallel Systems”, Proc. Intl. Conf. Parallel Proc., Jan 1990, pp. 1272–1275.
Y. K. Malaiya, “Linearly Correlated Intermittent Failures”, IEEE Trans. Reliability, Vol. R-31, No. 2, June 1982, pp. 211–215.
D. K. Pradhan and N. H. Vaidya, “Roll-forward Checkpointing Scheme: A Novel Fault-Tolerant Architecture”, Submitted to IEEE Transactions on Computers, Dec. 1992, Revised Nov. 1993.
D. K. Pradhan and N. H. Vaidya, “Roll-Forward Checkpointing Scheme: Concurrent Retry with Nondedicated Spares”, Proc. IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, July 1992, pp. 166–174.
D. K. Pradhan, “Redundancy Schemes for Recovery”, Tech. Rep. TR-89-CSE-16, Elect. & Comp. Engg., Univ. of Massachusetts, Amherst, 1989.
D. K. Pradhan ed., “Fault-Tolerant Computing: Theory and Techniques”, Vol. I & II, Prentice Hall, NJ, 1986.
O. Serlin, “Fault-Tolerant Systems in Commercial Applications”, Computer, Aug. 1984, pp. 19–30.
K. G. Shin, T.-H. Lin and Y.-H. Lee, “Optimal Checkpointing of Real-Time Tasks”, IEEE Trans. Comp., No. 11, vol. 36, Nov. 1987, pp. 1328–1341.
D. P. Siewiorek et al., “A Case Study of C.MMP, CM* and C.VMP. Experiences with Fault-Tolerance in Multiprocessor Systems”, Proc. IEEE, Oct 1978, pp. 1178–1199.
J. J. Stiffler, “Architectural design for near-100coverage”, Proc. Intl. Symp. on Fault Tolerant Computing, 1976, pp. 134–137.
D. Tang, R. Iyer and S. Subramani, “Failure Analysis and Modeling a VAX Cluster System”, Proc. IEEE Intl. Symp. on Fault-Tolerant Computing, 1990, pp. 244–251.
N. H. Vaidya, Ph.D. Dissertation, Elect. & Computer Engg., University of Massachusetts, Amherst, MA 01003, 1993.
N. H. Vaidya and D. K. Pradhan, “Concurrent Retry with Nondedicated Spares: A Fault-Tolerant Checkpointing Scheme without Rollback”, Tech. Rep. TR-91-CSE-23, Elect, & Comp. Engg., Univ. of Massachusetts, Amherst, Oct. 1991.
N. H. Vaidya and D. K. Pradhan, “A Fault Tolerance Scheme for a System of Duplicated Communicating Processeses”, Proc. IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, July 1992, pp. 166–174.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1994 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pradhan, D.K., Das Sharma, D., Vaidya, N.H. (1994). Roll-forward checkpointing schemes. In: Banâtre, M., Lee, P.A. (eds) Hardware and Software Architectures for Fault Tolerance. Fault Tolerance 1993. Lecture Notes in Computer Science, vol 774. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0020026
Download citation
DOI: https://doi.org/10.1007/BFb0020026
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-57767-6
Online ISBN: 978-3-540-48330-4
eBook Packages: Springer Book Archive