On the Impact of Fast Failure Detectors on Real-Time Fault-Tolerant Systems

  • Marcos K. Aguilera
  • Gérard Le Lann
  • Sam Toueg
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2508)


We investigate whether fast failure detectors can be useful— and if so by how much— in the design of real-time fault-tolerant systems. Specifically, we show how fast failure detectors can speed up consensus and fault-tolerant broadcasts, by providing fast algorithms and deriving some matching lower bounds, for synchronous systems with crashes. These results show that a fast failure detector service (implemented using specialized hardware or expedited message delivery) can be an important tool in the design of real-time mission-critical systems.


Correct Process Failure Detector Consensus Algorithm Synchronous System Broadcast Algorithm 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    M. K. Aguilera, W. Chen, and S. Toueg. Using the heartbeat failure detector for quiescent reliable communication and consensus in partitionable networks. Theoretical Computer Science, 220(1):3–30, June 1999.Google Scholar
  2. 2.
    M. K. Aguilera, W. Chen, and S. Toueg. Failure detection and consensus in the crash-recovery model. Distributed Computing, 13(2):99–125, Apr. 2000.Google Scholar
  3. 3.
    M. K. Aguilera, C. Delporte-Gallet, H. Fauconnier, and S. Toueg. Stable leader election. In Proceedings of the 15th International Symposium on Distributed Computing, Lecture Notes on Computer Science, Oct. 2001.Google Scholar
  4. 4.
    O. Babaoğlu, R. Davoli, and A. Montresor. Failure detectors, group membership and view-synchronous communication in partitionable asynchronous systems. Technical Report UBLCS-95-18, Dept. of Computer Science, University of Bologna, Bologna, Italy, November 1995.Google Scholar
  5. 5.
    A. Casimiro, P. Martins, and P. Veríssimo. How to build a timely computing base using real-time linux. In Proceedings of the 2000 IEEE International Workshop on Factory Communication Systems, pages 127–134, Sept. 2000.Google Scholar
  6. 6.
    T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225–267, Mar. 1996. A preliminary version appeared in Proceedings of the 10th ACM Symposium on Principles of Distributed Computing, Aug., 1991, 325–340.Google Scholar
  7. 7.
    W. Chen, S. Toueg, and M. K. Aguilera. On the quality of service of failure detectors. IEEE Transactions on Computers, 51(1):13–32, Jan. 2002.Google Scholar
  8. 8.
    B. Deianov and S. Toueg. Failure detector service for dependable computing (fast abstract). In Proceedings of the 2000 International Conference on Dependable Systems and Networks, pages B14–B15. IEEE Computer Society, June 2000.Google Scholar
  9. 9.
    D. Dolev and R. Reischuk. Bounds on information exchange for Byzantine agreement. J. ACM, 32(1):191–204, Jan. 1985.Google Scholar
  10. 10.
    D. Ferrari and D. C. Verma. A scheme for real-time channel establishment in wide-area networks. IEEE Journal on Selected Areas in Communications, 8(3):368–379, Apr. 1990.Google Scholar
  11. 11.
    R. Guerraoui, M. Larrea, and A. Schiper. Non blocking atomic commitment with an unreliable failure detector. In Proceedings of the 14th IEEE Symposium on Reliable Distributed Systems, pages 41–50, Sept. 1995.Google Scholar
  12. 12.
    V. Hadzilacos and S. Toueg. A modular approach to fault-tolerant broadcasts and related problems. Technical Report 94-1425, Department of Computer Science, Cornell University, Ithaca, New York, May 1994.Google Scholar
  13. 13.
    J.-F. Hermant and G. Le Lann. Fast asynchronous uniform consensus in real-time distributed systems. IEEE Transactions on Computers, Aug. 2002. Special issue on Asynchronous Real-Time Distributed Systems.Google Scholar
  14. 14.
    M. Hurfin and M. Raynal. A simple and fast asynchronous consensus protocol based on a weak failure detector. Distributed Computing, 12(4):209–223, 1999.CrossRefGoogle Scholar
  15. 15.
    D. Ivan, M. K. Aguilera, C. Delporte-Gallet, H. Fauconnier, and S. Toueg, November 2001. Prototype of a shared failure detector service with QoS guarantees.Google Scholar
  16. 16.
    J. F. Kurose, M. Schwartz, and Y. Yemini. Multiple-access protocols and time-constrained communication. ACM Computing Surveys, 16(1):43–70, Mar. 1984.Google Scholar
  17. 17.
    C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogramming in a hard real-time environment. J. ACM, 20(1):46–61, Jan. 1973.Google Scholar
  18. 18.
    N. A. Lynch. Distributed Algorithms. Morgan Kaufmann Publishers, Inc., 1996.Google Scholar
  19. 19.
    G. Le Lann, 2001. Private communication with Astrium, Axlog, European Space Agency.Google Scholar
  20. 20.
    G. Neiger and S. Toueg. Automatically increasing the fault-tolerance of distributed algorithms. Journal of Algorithms, 11(3):374–419, 1990.zbMATHCrossRefMathSciNetGoogle Scholar
  21. 21.
    K. Tindell, A. Burns, and A. J. Wellings. Analysis of hard real-time communications. Real-Time Systems, 9(1):147–171, Sept. 1995.Google Scholar
  22. 22.
    H. Zhang. Service disciplines for guaranteed performance service in packet-switching networks. Proceedings of the IEEE, 83(10):1374–1399, Oct. 1995.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Marcos K. Aguilera
    • 1
  • Gérard Le Lann
    • 2
  • Sam Toueg
    • 3
  1. 1.HP Systems Research CenterPalo AltoUSA
  2. 2.INRIA RocquencourtLe ChesnayFrance
  3. 3.Department of Computer ScienceUniversity of TorontoTorontoCanada

Personalised recommendations