Formal Methods in System Design

, Volume 26, Issue 2, pp 183–196 | Cite as

Automated Analysis of Fault-Tolerance in Distributed Systems



A method for automated analysis of fault-tolerance of distributed systems is presented. It is based on a stream (or data-flow) model of distributed computation. Temporal (ordering) relationships between messages received by a component on different channels are not captured by this model. This makes the analysis more efficient and forces the use of conservative approximations in analysis of systems whose behavior depends on such inter-channel orderings. To further support efficient analysis, our framework includes abstractions for the contents, number, and ordering of messages sent on each channel. Analysis of a reliable broadcast protocol illustrates the method.


Operating System Conservative Approximation Automate Analysis Efficient Analysis Broadcast Protocol 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    P.A. Bernstein, V. Hadzilacos, and N. Goodman, Concurrency Control and Recovery in Database Systems, Addison Wesley, 1987.Google Scholar
  2. 2.
    B. Boigelot and P. Godefroid, “Symbolic verification of communication protocols with infinite state spaces using QDDs,” Formal Methods in System Design Vol. 4, No. 3, pp. 237–255, 1999.CrossRefGoogle Scholar
  3. 3.
    A. Bouajjani and P. Habermehl, “Symbolic reachability analysis of fifo-channel systems with nonregular sets of configurations,” Theoretical Computer Science, Vol. 221, Nos. 1–2, pp. 211–250, 1999.CrossRefGoogle Scholar
  4. 4.
    J.D. Brock, A formal model of non-determinate dataflow computation,” Ph.D. thesis, Massachusetts Institute of Technology. Available as MIT Laboratory for Computer Science Technical Report TR-309, 1983.Google Scholar
  5. 5.
    J.D. Brock and W.B. Ackerman, “Scenarios: A model of non-determinate computation,” in: J. Diaz and I. Ramos (Eds.), Formalisation of Programming Concepts, Vol. 107, of Lecture Notes in Computer Science, Springer-Verlag, pp. 252–259, 1981.Google Scholar
  6. 6.
    M. Broy, “Semantics of finite and infinite networks of concurrent communicating agents,” Distributed Computing, Vol. 2, No. 1, pp. 13–31, 1987.CrossRefGoogle Scholar
  7. 7.
    M. Broy, “Nondeterministic data flow programs: How to avoid the merge anomaly,” Science of Computer Programming Vol. 10, pp. 65–85, 1988.CrossRefGoogle Scholar
  8. 8.
    M. Broy, “Functional specification of time sensitive communicating systems,” in: J. de Bakker, W.-P. de Roever, and G. Rozenberg (Eds.), Proceedings of the REX Workshop on Stepwise Refinement of Distributed Systems, Vol. 430 of Lecture Notes in Computer Science, Springer-Verlag, 1990, pp. 153–179.Google Scholar
  9. 9.
    M. Broy and C. Dendorfer, “Modelling operating system structures by timed stream processing functions,” Journal of Functional Programming, Vol. 2, pp. 1–21, 1992.Google Scholar
  10. 10.
    E.M. Clarke, O. Grumberg, and D.E. Long, “Model Checking and Abstraction,” ACM Transactions on Programming Languages and Systems Vol. 16, No. 5, pp. 1512–1542, 1994.CrossRefGoogle Scholar
  11. 11.
    P. Cousot and R. Cousot, “A unified lattice model for static analysis of programs by construction or approximation of fixpoints,” in Conference Record of the Fourth Annual ACM Symposium on Principles of Programming Languages 1977, pp. 238–252.Google Scholar
  12. 12.
    B.L. Di Vito, R.W. Butler, and J.L. Caldwell, “High level design proof of a reliable computing platform,” in J.F. Meyer and R.D. Schlichting (Eds.), Dependable Computing for Critical Applications 2, Vol. 6 of Dependable Computing and Fault-Tolerant Systems, Springer-Verlag, pp. 279–306, 1991.Google Scholar
  13. 13.
    V. Hadzilacos and S. Toueg, “A modular approach to fault-tolerant broadcasts and related problems,” Technical Report TR 94-1425, Cornell University, Department of Computer Science, 1994.Google Scholar
  14. 14.
    G.J. Holzmann, “The Spin Model Checker,” IEEE Transactions on Software Engineering, Vol. 23, No. 5, pp. 279–295, 1997.CrossRefGoogle Scholar
  15. 15.
    G.J. Holzmann and D. Peled, “An improvement in formal verification,” in Proc. 7th Int’l. Conference on Formal Description Techniques (FORTE ‘94), Chapman & Hall, 1995, pp. 197–211.Google Scholar
  16. 16.
    ITU-T: 1996, “ITU-TS Recommendation Z.120: Message Sequence Chart (MSC)”.Google Scholar
  17. 17.
    N.D. Jones and F. Nielson, “Abstract Interpretation: A semantics-based tool for program analysis,” in Handbook of Logic in Computer Science, Oxford University Press, 1994, pp. 527–629.Google Scholar
  18. 18.
    G. Kahn, “The semantics of a simple language for parallel programming,” in J.L. Rosenfeld (Ed.), Information Processing 74: Proceedings of the IFIP Congress 74, North-Holland, 1974, pp. 471–475.Google Scholar
  19. 19.
    R.M. Keller, “Denotational models for parallel programs with indeterminate operators,” in E.J. Neuhold (Ed.), Formal Description of Programming Concepts. North Holland, pp. 337–366, 1978.Google Scholar
  20. 20.
    R.P. Kurshan, “Analysis of discrete event coordination,” in J. de Bakker, W.-P. de Roever, and G. Rozenberg (Eds.): Proc. of the REX Workshop on Stepwise Refinement of Distributed Systems, Vol. 430 of Lecture Notes in Computer Science, Springer-Verlag, 1989, pp. 414–453.Google Scholar
  21. 21.
    R.P. Kurshan, Computer-Aided Verification of Coordinating Processes: The Automata-Theoretic Approach, Princeton University Press, 1994.Google Scholar
  22. 22.
    L. Lamport, R. Shostak, and M. Pease, “The Byzantine Generals Problem,” ACM Transactions on Programming Languages and Systems, Vol. 4, No. 3, pp. 382–401, 1982.CrossRefGoogle Scholar
  23. 23.
    X. Leroy, “The Caml Light System,” INRIA, 1997. Available via http:///
  24. 24.
    N.A. Lynch, Distributed Algorithms, Morgan Kaufmann, 1996.Google Scholar
  25. 25.
    S. Mishra, C. Fetzer, and F. Cristian, “The timewheel asynchronous atomic broadcast protocol,” in Proceedings of the 1997 International Conference on Parallel and Distributed Processing Techniques and Applications, CSREA Press, pp. 1239–1248, 1997.Google Scholar
  26. 26.
    S. Mishra, C. Fetzer, and F. Cristian, “The timewheel group membership protocol,” in Proceedings of the 3rd IEEE Workshop on Fault-tolerant Parallel and Distributed Systems, 1998.Google Scholar
  27. 27.
    J. Ousterhout, Tcl and the Tk Toolkit, Addison Wesley, 1994.Google Scholar
  28. 28.
    V.R. Pratt, “On the composition of processes,” in Conference Record of the Ninth Annual ACM Symposium on Principles of Programming Languages, ACM Press, 1982, pp. 213–223.Google Scholar
  29. 29.
    J. Rushby, “A fault-masking and transient-recovery model for digital flight-control systems,” in J. Vytopil (Ed.), Formal Techniques in Real-Time and Fault-Tolerant Systems, Kluwer, 1993, pp. 109–136. Also appeared in SRI International Computer Science Laboratory Technical Report SRI-CSL-93-04.Google Scholar
  30. 30.
    F.B. Schneider, “Towards fault-tolerant and secure agentry,” in M. Mavronikolas (Ed.), Proc. 11th International Workshop on Distributed Algorithms (WDAG ‘97), Vol. 1320 of Lecture Notes in Computer Science. Springer-Verlag, 1997.Google Scholar
  31. 31.
    S.D. Stoller, “A method and tool for analyzing fault-tolerance in systems,” Ph.D. thesis, Cornell University, 1997.Google Scholar
  32. 32.
    S.D. Stoller and F.B. Schneider, “Automated stream-based analysis of fault-tolerance,” in Fifth International Symposium on Formal Techniques in Real Time and Fault Tolerant Systems (FTRTFT), Vol. 1486 of Lecture Notes in Computer Science, Springer-Verlag, 1998, pp. 113–122.Google Scholar

Copyright information

© Springer Science + Business Media, Inc. 2005

Authors and Affiliations

  1. 1.Computer Science Dept.State University of New York at Stony BrookStony Brook
  2. 2.Dept. of Computer ScienceCornell UniversityIthaca

Personalised recommendations