Dependability of Distributed Programs: Algorithms and Performance

Chabridon, S.; Gelenbe, E.

doi:10.1007/978-3-642-79917-4_15

S. Chabridon³ &
E. Gelenbe⁴

Part of the book series: Esprit Basic Research Series ((ESPRIT BASIC))

44 Accesses

Summary

In this paper, we use task graph models to represent the behaviour of parallel programs. These models are characterized by execution times of the tasks, and by the precedence relation between the tasks. The latter can be represented by a probabilistic ordering, or can be provided with a specific known ordering for a given application. When failures occur in the processing system, we consider a recovery mechanism based on failure detection, and subsequent task restart. Both of these operations take additional processing times which are explicitly represented in the task graph characterization. Failures themselves are represented in the model by variable failure rates. We report on the design, analysis and simulation of novel algorithms which will ensure that application software runs correctly on an MIMD system in which processing units (PU) may fail. These algorithms are based on certain existing tasks which are selected within the program, which we call agents. Their role is to carry out failure detection and if necessary restart of other tasks, as soon as they have completed their own specific assigned work. The effect of these algorithms is evaluated using analytical approximations and simulation as a function of failure rates, and other system parameters. The comparison of the simulation results with the approximate analytical results, shows a very good level of accuracy for this degree of complexity, which indicates that simple analytical formulae can be used to obtain robust first-order estimates of program execution times with and without failures. We also provide specific examples of task graphs for two well known computations (matrix multiplication and the Fast Fourier Transform) and their parallel implementation. Finally we provide simulation results which evaluate the proposed failure detection and recovery algorithms for the specific case of the FFT algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 16.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

B. Bhargava and S.-R. Lian, “Independent Checkpointing and Concurrent Rollback for Recovery in Distributed Systems — An Optimistic Approach”, Proc. 7th IEEE Symposium on Reliable Distributed Systems, 1988.
Google Scholar
P. Bernstein, V. Hadzilacos and N. Goodman, “Concurrency Control and Recovery in Database Systems”, Addison-Wesley, 1987.
Google Scholar
S. Chabridon and E. Gelenbe, “Dependable execution of distributed programs”, Proc. Massively Parallel Processing Conference’ 94, (North-Holland Elsevier), Delft, June 21–23, 1994.
Google Scholar
J.M. Cooley and J.W. Tukey, “An algorithm for the machine calculation of complex Fourier series”, Mathematics of Computation, 19, pp. 297–301, 1965.
Article MathSciNet MATH Google Scholar
E. Gelenbe, “A model of roll-back recovery with multiple checkpoints”, Proc. ACM-IEEE 2nd International Symposium on Software Engineering, October 1976, pp. 251-255.
Google Scholar
E. Gelenbe, “Multiprocessor performance”, John Wiley & Sons, New York, 1989.
MATH Google Scholar
E. Gelenbe, “On the Optimum Check-Point Interval”, Journal of the ACM, 26, pp. 259–270, 1979.
Article MathSciNet MATH Google Scholar
E. Gelenbe, “Temps d’exécution asymptotique d’un programme parallèle”, Comptes-Rendus Acad. Sci. Paris (Proc. French National Academy of Science), 309(I), pp. 399–402, 1989.
MathSciNet MATH Google Scholar
E. Gelenbe and D. Derochette, “Performance of roll-back recovery systems under intermittent failures”, Comm. ACM, 21(6), June 1978, pp. 493–499.
Article MathSciNet MATH Google Scholar
E. Gelenbe and I. Mitrani, “Modeling the Execution of Block Structured Processes with Hardware and Software Failures”, in G. Iazeolla, P. Courtois, and A. Hordijk (eds.), Mathematical Computer Performance and Reliability, North Holland Pub. Co., 1983.
Google Scholar
E. Gelenbe, R. Nelson, T. Philips and A. Tantawi, “Asymptotic processing time of a model of parallel computation”, Proc. National Computer Conference (U.S.A.), pp. 127-138, 1986.
Google Scholar
P.C. Kanellakis and A.A. Shvartsman, “Efficient parallel algorithms can be made robust”, Distributed Computing, pp. 201-217, 1992.
Google Scholar
P.C. Kanellakis, A.A. Shvartsman, J.F. Buss and P.L. Radge, “Parallel algorithms with processor failures and delays”, Brown University Tech. Rep. No. CS-91-54.
Google Scholar
Z. Kedem and K. Palem, “Transformations for the Automatic Derivation of Resilient Parallel Programs”, Proc. 1992 IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, pp. 16-25, 1992.
Google Scholar
Z. Kedem, K. Palem, M. Rabin, and A. Raghunathan, “Efficient Program Transformations for Resilient Parallel Computation via Randomization”, Proc. 24th ACM Symp. on Theory of Computing, pp. 306-317, 1992.
Google Scholar
Z. Kedem, K. Palem, A. Raghunathan, and P. Spirakis, “Resilient Parallel Computing on Unreliable Parallel Machines”, Lectures on Parallel Computation, Eds. A. Gibbons and P. Spirakis, Cambridge University Press, pp. 145-172, 1993.
Google Scholar
Z. Kedem, K. Palem, and P. Spirakis, “Efficient Robust Parallel Computations”, Proc. 22nd ACM Symp. on Theory of Computing, pp. 138-148, 1990.
Google Scholar
N. Pekergin and J. Vincent, “Stochastic bounds on parallel program execution times”, IEEE Trans, on Software Engineering, 17(10), pp. 105–113, 1991.
Google Scholar
R.A. Sahner and K.S. Trivedi, “Performance and reliability using directed acyclic graphs”, IEEE Trans, on Software Engineering, 13(10), pp. 1105–1114, 1987.
Article Google Scholar

Download references

Author information

Authors and Affiliations

EHEI, Université René Descartes, France
S. Chabridon
Department of Electrical Engineering, Duke University, USA
E. Gelenbe

Authors

S. Chabridon
View author publications
You can also search for this author in PubMed Google Scholar
E. Gelenbe
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

INRIA Institut National de Recherche en Informatique et en Automatique 2004, route des Lucioles, F-06561, Valbonne Cedex, France
François Baccelli & Alain Jean-Marie &
Department of Computing Science, University of Newcastle, Newcastle upon Tyne, NE1 7RU, UK
Isi Mitrani

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Chabridon, S., Gelenbe, E. (1995). Dependability of Distributed Programs: Algorithms and Performance. In: Baccelli, F., Jean-Marie, A., Mitrani, I. (eds) Quantitative Methods in Parallel Systems. Esprit Basic Research Series. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-79917-4_15

Download citation

DOI: https://doi.org/10.1007/978-3-642-79917-4_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-79919-8
Online ISBN: 978-3-642-79917-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics