Abstract
In this paper, we describe an efficient coordinated checkpointing and recovery algorithm which can work even when the channels are assumed to be non-FIFO, and messages may be lost. Nodes are assumed to be autonomous, and they do not block while taking checkpoints. Based on the local conditions, any process can request the previous coordinator for the ’permission’ to initiate a new checkpoint. Allowing multiple initiators of checkpoints avoids the bottleneck associated with a single initiator, but the algorithm permits only a single instance of checkpointing process at any given time, thus reducing much of the overhead associated with multiple initiators of distributed algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Koo, R., Toueg, S.: Checkpointing and rollback-recovery for distributed systems. IEEE Transactions on Software Engineering 13(1), 23–31 (1987)
Silva, L.M., Silva, J.G.: Global checkpointing for distributed programs. In: Proceedings of the 10th Symposium on Reliable Distributed Systems, pp. 155–162 (1992)
Prakash, R., Singhal, M.: Low cost checkpointing and failure recovery in mobile computing systems. IEEE Transactions on Parallel and Distributed Systems 7(10), 1035–1048 (1996)
Elnozahy, E., Johnson, D., Yang, Y.: A survey of rollback-recovery protocols in message passing systems. ACM Computing Surveys 34(3), 375–408 (2002)
Manivannan, D., Singhal, M.: Quasi synchronous checkpointing: Models, characterization and classification. IEEE Transactions on Parallel and Distributed Systems 10(7), 206–216 (1999)
Briatico, D., Ciuffoletti, A., Simoncini, L.: A distributed domino-effect free recovery algorithm. In: Proceedings of the IEEE International Symposium on Reliability, Distributed Software and Databases, pp. 207–215 (1984)
Manivannan, D., Singhal, M.: A low overhead recovery technique using quasi synchronous checkpointing. In: Proceedings of the IEEE International Conference on Distributed Computing Systems, pp. 100–107 (1996)
Baldoni, R., Quaglia, F., Fornara, P.: An index-based checkpointing algorithm for autonomous distributed systems. In: Proceedings of the IEEE International Conference on Distributed Computing Systems, pp. 181–188 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kumar, K.P.K., Hansdah, R.C. (2006). An Efficient and Scalable Checkpointing and Recovery Algorithm for Distributed Systems. In: Chaudhuri, S., Das, S.R., Paul, H.S., Tirthapura, S. (eds) Distributed Computing and Networking. ICDCN 2006. Lecture Notes in Computer Science, vol 4308. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11947950_11
Download citation
DOI: https://doi.org/10.1007/11947950_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68139-7
Online ISBN: 978-3-540-68140-3
eBook Packages: Computer ScienceComputer Science (R0)