Skip to main content

Implementing Fault-Tolerant Services Using State Machines: Beyond Replication

  • Conference paper
Book cover Distributed Computing (DISC 2010)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6343))

Included in the following conference series:

Abstract

This paper describes a method to implement fault-tolerant services in distributed systems based on the idea of fused state machines. The theory of fused state machines uses a combination of coding theory and replication to ensure efficiency as well as savings in storage and messages during normal operations. Fused state machines may incur higher overhead during recovery from crash or Byzantine faults, but that may be acceptable if the probability of fault is low. Assuming n different state machines, pure replication based schemes require n(f + 1) replicas to tolerate f crash faults in a system and n(2f + 1) replicas to tolerate f Byzantine faults. For crash faults, we give an algorithm that requires the optimal f backup state machines for tolerating f faults in the system of n machines. For Byzantine faults, we propose an algorithm that requires only nf + f additional state machines, as opposed to 2nf state machines. Our algorithm combines ideas from coding theory with replication to provide low overhead during normal operation while keeping the number of copies required to tolerate f faults small.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Lamport, L.: Time, clocks, and the ordering of events in a distributed system. ACM Commun. 21(7), 558–565 (1978)

    Article  MATH  Google Scholar 

  2. Lamport, L.: Using time instead of timeout for fault-tolerant distributed systems. ACM Trans. Program. Lang. Syst. 6(2), 254–280 (1984)

    Article  Google Scholar 

  3. Schneider, F.B.: Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Surv. 22(4), 299–319 (1990)

    Article  Google Scholar 

  4. Sivasubramanian, S., Szymaniak, M., Pierre, G., van Steen, M.: Replication for web hosting systems. ACM Comput. Surv. 36(3), 291–334 (2004)

    Article  Google Scholar 

  5. MacWilliams, F.J., Sloane, N.J.A.: The Theory of Error-Correcting Codes. North-Holland Publishing Company, Amsterdam (1981)

    Google Scholar 

  6. van Lint, J.H.: Introduction to Coding Theory. Springer, Heidelberg (1998)

    Google Scholar 

  7. Pease, M., Shostak, R., Lamport, L.: Reaching agreements in the presence of faults. Journal of the ACM 27(2), 228–234 (1980)

    Article  MATH  MathSciNet  Google Scholar 

  8. Raynal, M., Schiper, A., Toueg, S.: The causal ordering abstraction and a simple way to implement it. Information Processing Letters 39(6), 343–350 (1991)

    Article  MATH  MathSciNet  Google Scholar 

  9. Ricart, G., Agrawala, A.K.: An optimal algorithm for mutual exclusion in computer networks. Communications of the ACM 24 (1981)

    Google Scholar 

  10. Garg, V.K., Ogale, V.A.: Fusible data structures for fault-tolerance. In: ICDCS, p. 20. IEEE Computer Society, Los Alamitos (2007)

    Google Scholar 

  11. Balasubramanian, B., Garg, V.K.: A fusion-based approach for handling multiple faults in data structures. Technical Report ECE-PDS-2009-001, Parallel and Distributed Systems Laboratory, ECE Dept. University of Texas at Austin (2009)

    Google Scholar 

  12. Ogale, V.A., Balasubramanian, B., Garg, V.K.: A fusion-based approach for tolerating faults in finite state machines. In: IPDPS, pp. 1–11. IEEE, Los Alamitos (2009)

    Google Scholar 

  13. Patterson, D.A., Gibson, G., Katz, R.H.: A case for redundant arrays of inexpensive disks (raid). In: SIGMOD ’88: Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, pp. 109–116. ACM Press, New York (1988)

    Chapter  Google Scholar 

  14. Chen, P.M., Lee, E.K., Gibson, G.A., Katz, R.H., Patterson, D.A.: Raid: high-performance, reliable secondary storage. ACM Comput. Surv. 26(2), 145–185 (1994)

    Article  Google Scholar 

  15. Plank, J.S.: A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Software – Practice & Experience 27(9), 995–1012 (1997)

    Article  Google Scholar 

  16. Luby, M.G., Mitzenmacher, M., Shokrollahi, M.A., Spielman, D.A., Stemann, V.: Practical loss-resilient codes. In: STOC ’97: Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, pp. 150–159. ACM Press, New York (1997)

    Chapter  Google Scholar 

  17. Byers, J.W., Luby, M., Mitzenmacher, M., Rege, A.: A digital fountain approach to reliable distribution of bulk data. SIGCOMM Comput. Commun. Rev. 28(4), 56–67 (1998)

    Article  Google Scholar 

  18. Garg, V.K.: Implementing fault-tolerant services using fused state machines. Technical Report ECE-PDS-2010-001, Parallel and Distributed Systems Laboratory, ECE Dept. University of Texas at Austin (2010)

    Google Scholar 

  19. Plank, J.S., Y.D.: Note: Correction to the 1997 tutorial on reed-solomon coding. Softw., Pract. Exper. 35(2), 189–194 (2005)

    Article  Google Scholar 

  20. Birman, K.P., Joseph, T.A.: Reliable communication in the presence of failures. ACM Transactions on Computer Systems 5(1), 47–76 (1987)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Garg, V.K. (2010). Implementing Fault-Tolerant Services Using State Machines: Beyond Replication. In: Lynch, N.A., Shvartsman, A.A. (eds) Distributed Computing. DISC 2010. Lecture Notes in Computer Science, vol 6343. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15763-9_44

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15763-9_44

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15762-2

  • Online ISBN: 978-3-642-15763-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics