Skip to main content

Dependability of Distributed Programs: Algorithms and Performance

  • Chapter
Quantitative Methods in Parallel Systems

Part of the book series: Esprit Basic Research Series ((ESPRIT BASIC))

  • 44 Accesses

Summary

In this paper, we use task graph models to represent the behaviour of parallel programs. These models are characterized by execution times of the tasks, and by the precedence relation between the tasks. The latter can be represented by a probabilistic ordering, or can be provided with a specific known ordering for a given application. When failures occur in the processing system, we consider a recovery mechanism based on failure detection, and subsequent task restart. Both of these operations take additional processing times which are explicitly represented in the task graph characterization. Failures themselves are represented in the model by variable failure rates. We report on the design, analysis and simulation of novel algorithms which will ensure that application software runs correctly on an MIMD system in which processing units (PU) may fail. These algorithms are based on certain existing tasks which are selected within the program, which we call agents. Their role is to carry out failure detection and if necessary restart of other tasks, as soon as they have completed their own specific assigned work. The effect of these algorithms is evaluated using analytical approximations and simulation as a function of failure rates, and other system parameters. The comparison of the simulation results with the approximate analytical results, shows a very good level of accuracy for this degree of complexity, which indicates that simple analytical formulae can be used to obtain robust first-order estimates of program execution times with and without failures. We also provide specific examples of task graphs for two well known computations (matrix multiplication and the Fast Fourier Transform) and their parallel implementation. Finally we provide simulation results which evaluate the proposed failure detection and recovery algorithms for the specific case of the FFT algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 16.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. B. Bhargava and S.-R. Lian, “Independent Checkpointing and Concurrent Rollback for Recovery in Distributed Systems — An Optimistic Approach”, Proc. 7th IEEE Symposium on Reliable Distributed Systems, 1988.

    Google Scholar 

  2. P. Bernstein, V. Hadzilacos and N. Goodman, “Concurrency Control and Recovery in Database Systems”, Addison-Wesley, 1987.

    Google Scholar 

  3. S. Chabridon and E. Gelenbe, “Dependable execution of distributed programs”, Proc. Massively Parallel Processing Conference’ 94, (North-Holland Elsevier), Delft, June 21–23, 1994.

    Google Scholar 

  4. J.M. Cooley and J.W. Tukey, “An algorithm for the machine calculation of complex Fourier series”, Mathematics of Computation, 19, pp. 297–301, 1965.

    Article  MathSciNet  MATH  Google Scholar 

  5. E. Gelenbe, “A model of roll-back recovery with multiple checkpoints”, Proc. ACM-IEEE 2nd International Symposium on Software Engineering, October 1976, pp. 251-255.

    Google Scholar 

  6. E. Gelenbe, “Multiprocessor performance”, John Wiley & Sons, New York, 1989.

    MATH  Google Scholar 

  7. E. Gelenbe, “On the Optimum Check-Point Interval”, Journal of the ACM, 26, pp. 259–270, 1979.

    Article  MathSciNet  MATH  Google Scholar 

  8. E. Gelenbe, “Temps d’exécution asymptotique d’un programme parallèle”, Comptes-Rendus Acad. Sci. Paris (Proc. French National Academy of Science), 309(I), pp. 399–402, 1989.

    MathSciNet  MATH  Google Scholar 

  9. E. Gelenbe and D. Derochette, “Performance of roll-back recovery systems under intermittent failures”, Comm. ACM, 21(6), June 1978, pp. 493–499.

    Article  MathSciNet  MATH  Google Scholar 

  10. E. Gelenbe and I. Mitrani, “Modeling the Execution of Block Structured Processes with Hardware and Software Failures”, in G. Iazeolla, P. Courtois, and A. Hordijk (eds.), Mathematical Computer Performance and Reliability, North Holland Pub. Co., 1983.

    Google Scholar 

  11. E. Gelenbe, R. Nelson, T. Philips and A. Tantawi, “Asymptotic processing time of a model of parallel computation”, Proc. National Computer Conference (U.S.A.), pp. 127-138, 1986.

    Google Scholar 

  12. P.C. Kanellakis and A.A. Shvartsman, “Efficient parallel algorithms can be made robust”, Distributed Computing, pp. 201-217, 1992.

    Google Scholar 

  13. P.C. Kanellakis, A.A. Shvartsman, J.F. Buss and P.L. Radge, “Parallel algorithms with processor failures and delays”, Brown University Tech. Rep. No. CS-91-54.

    Google Scholar 

  14. Z. Kedem and K. Palem, “Transformations for the Automatic Derivation of Resilient Parallel Programs”, Proc. 1992 IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, pp. 16-25, 1992.

    Google Scholar 

  15. Z. Kedem, K. Palem, M. Rabin, and A. Raghunathan, “Efficient Program Transformations for Resilient Parallel Computation via Randomization”, Proc. 24th ACM Symp. on Theory of Computing, pp. 306-317, 1992.

    Google Scholar 

  16. Z. Kedem, K. Palem, A. Raghunathan, and P. Spirakis, “Resilient Parallel Computing on Unreliable Parallel Machines”, Lectures on Parallel Computation, Eds. A. Gibbons and P. Spirakis, Cambridge University Press, pp. 145-172, 1993.

    Google Scholar 

  17. Z. Kedem, K. Palem, and P. Spirakis, “Efficient Robust Parallel Computations”, Proc. 22nd ACM Symp. on Theory of Computing, pp. 138-148, 1990.

    Google Scholar 

  18. N. Pekergin and J. Vincent, “Stochastic bounds on parallel program execution times”, IEEE Trans, on Software Engineering, 17(10), pp. 105–113, 1991.

    Google Scholar 

  19. R.A. Sahner and K.S. Trivedi, “Performance and reliability using directed acyclic graphs”, IEEE Trans, on Software Engineering, 13(10), pp. 1105–1114, 1987.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1995 ECSC-EC-EAEC, Brussels-Luxembourg

About this chapter

Cite this chapter

Chabridon, S., Gelenbe, E. (1995). Dependability of Distributed Programs: Algorithms and Performance. In: Baccelli, F., Jean-Marie, A., Mitrani, I. (eds) Quantitative Methods in Parallel Systems. Esprit Basic Research Series. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-79917-4_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-79917-4_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-79919-8

  • Online ISBN: 978-3-642-79917-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics