Abstract
Uncoordinated checkpointing is one technique used to build processes that can recover to a consistent state after crashing. This technique requires each process to periodically record its state in a checkpoint. Furthermore, the threads executing on each process log any non-deterministic action that they take following the latest checkpointed state. When a process crashes, a new process, initialized with the appropriate recorded local state, is created in its place. The new process restarts executing, and whenever one of its threads confronts a non-deterministic choice, the thread references the log in order to reproduce the same action performed before the crash. Thus, uncoordinated checkpointing implements an abstraction of a resilient process in which the crash of a process is translated into intermittent unavailability of that process.
We give a specification of the consistency property “no orphan threads” in the context of multithreaded processes running on a shared memory multiprocessor. We also give a definition of optimality for uncoordinated checkpointing protocols given a memory coherency protocol. We then use this specification to derive an existing uncoordinated checkpoint protocol and show that it is optimal. This protocol assumes that once a process crashes, no further processes crash until the first process completes recovery.
This author was supported in part by the Office of Naval Research under contract N00014-91-J-1219, the National Science Foundation under Grant No. CCR-9003440, DARPA/NSF Grant No. CCR-9014363, NASA/DARPA grant NAG-2-893, and AFOSR grant F49620-94-1-0198. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author and do not reflect the views of these agencies.
This author was supported in part by the Defense Advanced Research Projects Agency (DoD) under NASA Ames grant number NAG 2-593, Contract N00140-87-C-8904 and by AFOSR grant number F49620-93-1-0242. The views, opinions, and findings contained in this report are those of the authors and should not be construed as an official Department of Defense position, policy, or decision.
Preview
Unable to display preview. Download preview PDF.
References
Lorenzo Alvisi and Keith Marzullo. Optimal Message Logging Protocols. Cornell University Department of Computer Science Technical Report TR 94-1457, September 1994.
Lorenzo Alvisi and Keith Marzullo. Message logging: Pessimistic, optimistic, causal and optimal. In Proceedings of the Fifteenth International Conference on Distributed Computing Systems. IEEE Computer Society, May 1995.
B. N. Bershad, M. J. Zekauskas, and W. A. Sawdon. The midway distributed shared memory system. In Proceedings of the 93 COMPCON Conference, pages 528–537. IEEE, February 1993.
P. Guedes and M. Castro. Distributed shared object memory. In Proceedings of the 4th Workshop on Workstation Operating Systems, pages 142–149, October 1993.
Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, July 1978.
Leslie Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, C-28(9):241–248, September 1979.
N. Neves, M. Castro, and P. Guedes. A checkpoint protocol for an entry consistent shared memory system. In Proceedings of the Thirteenth Symposium on Principles of Distributed Computing. ACM SIGACT/SIGOPS, August 1994.
Fred B. Schneider. Byzantine generals in action: Implementing fail-stop processors. ACM Transactions on Computer Systems, 2(2):145–154, May 1984.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1995 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Alvisi, L., Marzullo, K. (1995). Deriving optimal checkpoint protocols for distributed shared memory architectures. In: Birman, K.P., Mattern, F., Schiper, A. (eds) Theory and Practice in Distributed Systems. Lecture Notes in Computer Science, vol 938. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-60042-6_8
Download citation
DOI: https://doi.org/10.1007/3-540-60042-6_8
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-60042-8
Online ISBN: 978-3-540-49409-6
eBook Packages: Springer Book Archive