Abstract
The problem of adapting backward error recovery to parallel real time systems is discussed in this paper. Because of error propagation among different cooperating processes, an error occurring in one process may influence some important outputs in other processes. Therefore, a local output has to be delayed until its validity is confirmed globally. Since backward error recovery adopts redundancy of computing time instead of processing equipment, the variation of the actual execution time of a cooperating process may be very large if it works in an unreliable environment. These problems are the primary obstacles to be removed. Previous studies focus their attentions on how to eliminate domino-effect dynamically. But backward error recovery cannot be applied directly in parallel real time systems even under the condition that no domino-effect exists. How to reduce output delays efficiently if no domino-effect remains? How to estimate this delay time? How to calculate the actual execution time of every process and how to schedule these processes under an unstable condition? These problems were omitted in literature unfortunately. The interest of this paper is to provide satisfactory solutions to these problems to make it possible to adopt backward error recovery efficiently in parallel real time systems.
Similar content being viewed by others
References
B. Randellet al., Reliability issues in computing system designs.Computing Surveys. 1975, 10(2),
I. Koren, Z. Koren and S. Su, Analysis of a class of recovery procedures.IEEE Trans. on Computers, 1986, C-35(8), 703–712.
R. H. Campbell and B. Randell, Error recovery in asynchronous systems.IEEE Trans. on Software Engineering, 1986, SE-12(8), 811–826.
D. P. Siewiorek, Architecture of fault-tolerant computers.IEEE Computer, 1984, 17(8), 9–18.
K. G. Shin, T.-H. Lin and Y.-H. Lee. Optimal checkpointing of real-time tasks.IEEE Trans. on Software Engineering, 1987, SE-36(11), 1328–1341.
K. Kant and A. Silberschatz, Error propagation and recovery in concurrent environments.The Computer Journal, 1984, 28(5).
K. G. Shin and T.-H. Lin, Modeling and measurement of error propagation in a multimodule computing system.IEEE Trans. on Computers, 1988, C-37(9), 1053–1066.
R. Koo and S. Toueg, Checkpointing and rollback-recovery for distributed systems.IEEE Trans. on Software Engineering, 1987, SE-13(1), 23–31.
A. Ciuffoletti, Error recovery in systems of communicating processes. Proc. 7th Int'1 Conf. on Software Engineering, 1984.
R. E. Strom and S. Yemini, Optimistic recovery—An asynchronous approach to fault-tolerance in distributed system.Proc. FTCS-14, 1984.
D. Zhou, A recovery technique for distributed communicating process systemsJ. of Comput. Sci. & Technol. (ISSN 1000-9000), 1986, 1(2), 32–41.
D. Zhou, Eliminating domino effect in backward error recovery in distributed systems. Proc. 2nd Int'l Conf. on Compt. and Appl., Beijing, July, 1987.
D. Zhou and X. Xu, A distributed error recovery technique and its implementation and application on UNIX.J. of Comput. Sci. & Technol. (ISSN 1000-9000), 1990, 5(2), 127–138.
K. G. Shin, Y.-H. Lee, Evaluation of error recovery blocks used for cooperating processes.IEEE Trans. on Software Engineering, 1984, SE-10(6).
K. J. Lin, S. Natarajan, J. W.-s. Liu and T. Krauskopf, Concord: A system of imprecise computations. Proc. 1987 IEEE Compsac. Japan, Oct., 1987.
K. J. Lin, S. Natarajan and J. W.-s. Liu, Imprecise results: Utilizing partial computations in real-time systems. Proc. IEEE Real-Time Syst. Symp., 1987.
G. Färber, Prozessrechnentechnik, pp. 132–142. Springer-Verlag, Berlin, Heidelberg, New York, 1979.
E. G. Coffman Jr. and R. Graham,Scheduling Theory. New York: Wiley, 1976.
R. Henn, Deterministische modelle für die prozessorzuteilung in einer harten realzeit-umgebung.Doktorarbeit, Fachbereich Mathematik, TU München, 1975.
J.-y. Chung, J. W.-s. Liu and K.-j. Lin, Scheduling periodic jobs that allow imprecise results.IEEE Trans. on Computers, 1990, C-39(9), 1156–1174.
M. H. Woodbury and K. G. Shin, Measurement and analysis of workload effects on fault latency in real-time systems.IEEE Trans. on Software Engineering, 1990, 16(2) 212–216.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Zhou, D. Adapting backward error recovery to parallel real time systems. J. of Comput. Sci. & Technol. 7, 257–267 (1992). https://doi.org/10.1007/BF02946576
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/BF02946576