Reliable and Efficient Distributed Checkpointing System for Grid Environments

Islam, Tanzima Zerin; Bagchi, Saurabh; Eigenmann, Rudolf

doi:10.1007/s10723-014-9297-4

Reliable and Efficient Distributed Checkpointing System for Grid Environments

Published: 20 May 2014

Volume 12, pages 593–613, (2014)
Cite this article

Journal of Grid Computing Aims and scope Submit manuscript

Tanzima Zerin Islam¹,
Saurabh Bagchi² &
Rudolf Eigenmann²

169 Accesses
4 Citations
Explore all metrics

Abstract

In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, unpredictable evictions of guest jobs lead to fluctuating completion times. Checkpoint-recovery is an attractive mechanism for recovering from such “failures”. Today’s FGCS systems often use expensive, high-performance dedicated checkpoint servers. However, in geographically distributed clusters, this may incur high checkpoint transfer latencies. In this paper we present a distributed checkpointing system called Falcon that uses available disk resources of the FGCS machines as shared checkpoint repositories. However, an unavailable storage host may lead to loss of checkpoint data. Therefore, we model the failures of a storage host and develop a prediction algorithm for choosing reliable checkpoint repositories. We experiment with Falcon in the university-wide Condor testbed at Purdue and show improved and consistent performance for guest jobs in the presence of irregular resource availability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Checkpointing Tools in a Supercomputer Center

Article 01 December 2020

Reinit $$^{++}$$ : Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization

References

Ryu, K., Hollingsworth, J.: Resource policing to support fine-grain cycle stealing in networks of workstations. IEEE Trans. Parallel Distrib. Syst., 878–892 (2004)
Elnozahy, E. N. M., Alvisi, L., Wang, Y., Johnson, D. B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. (CSUR) 34(3), 375–408 (2002)
Article Google Scholar
Thain, D., Tannenbaum, T., Livny, M.: Distributed computing in practice: The Condor experience. Concurr. Pract. Experience 17(2–4), 323–356 (2005)
Article Google Scholar
Boilergrid: A large, high-throughput, distributed computing system. [Online]. Available: http://www.rcac.purdue.edu/userinfo/resources/boilergrid/
Aguilera, M. K., Janakiraman, R., Xu, L.: Using erasure codes efficiently for storage in a distributed system. In: International Conference on Dependable Systems and Networks (DSN), pp. 336–345 (2005)
Ren, X., Eigenmann, R., Bagchi, S.: Failure-aware checkpointing in fine-grained cycle sharing systems. In: Proceedings of the 16th International Symposium on High Performance Distributed Computing, pp. 33–42 (2007)
De Camargo, R. Y., Cerqueira, R., Kon, F.: Strategies for storage of checkpointing data using non-dedicated repositories on grid systems. In: Proceedings of the 3rd International Workshop on Middleware for Grid Computing, pp. 1–6. ACM (2005)
Ren, X., Lee, S., Eigenmann, R., Bagchi, S.: Resource failure prediction in fine-grained cycle sharing systems. In: IEEE International Symposium on High Performance Distributed Computing (HPDC), pp. 19–23 (2006)
Ren, X., Eigenmann, R.: Empirical studies on the behavior of resource availability in fine-grained cycle sharing systems. In: International Conference on Parallel Processing (ICPP), pp. 3–11 (2006)
Islam, T. Z., Mohror, K., Bagchi, S., Moody, A., de Supinski, B. R., Eigenmann, R.: Mcrengine: A scalable checkpointing system using data-aware aggregation and compression. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp 1–11. IEEE (2012)
Islam, T. Z., Bagchi, S., Eigenmann, R.: Falcon: A system for reliable checkpoint recovery in shared grid environments. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–12. ACM, New York (2009)
Google Scholar
Strauss, J., Katabi, D., Kaashoek, F.: A measurement study of available bandwidth estimation tools. In: ACM SIGCOMM Conference on Internet measurement, pp. 39–44. ACM (2003)
Wilcox-O’Hearn, Z.: Zfec Homepage, [Online]. Available: http://allmydata.org/trac/zfec (2008)
Albayraktaroglu, K., Jaleel, A., Wu, X., Franklin, M., Jacob, B., Tseng, C.-W., Yeung, D.: Biobench: A benchmark suite of bioinformatics applications. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 2–9 (2005)
Bray, T.: The Bonnie home page. Located at http://www.textuality.com/bonnie (1996)
Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: Collective operations in application-level fault-tolerant mpi. In: Proceedings of the 17th Annual International Conference on Supercomputing, pp. 234–243. ACM (2003)
Walters, J., Chaudhary, V.: A comprehensive user-level checkpointing strategy for MPI applications, Technical report, 2007-1, The State University of New York, Tech. Rep. Buffalo (2007)
Rodrigues, R., Liskov, B.: High Availability in DHTs: Erasure Coding vs Replication. In: Peer-to-Peer Systems IV 4th International Workshop IPTPS 2005 (2005)
Rhea, S., Eaton, P., Geels, D., Weatherspoon, H., Zhao, B., Kubiatowicz, J.: Pond: the OceanStore prototype. In: Proceedings of the 2nd USENIX Conference on File and Storage Technologies (FAST) (2003)
Rood, B., Lewis, M. J.: Multi-state grid resource availability characterization. In: IEEE/ACM International Conference on Grid Computing, pp. 42–49. IEEE Computer Society (2007)
Rood, B., Lewis, M. J: Scheduling on the grid via multi-state resource availability prediction. In: IEEE/ACM International Conference on Grid Computing, pp 126–135 (2008)
Feller, E., Mehnert-Spahn, J., Schoettner, M., Morin, C.: Independent checkpointing in a heterogeneous grid environment. Futur. Gener. Comput. Syst. 28(1), 163–170 (2012)
Article Google Scholar
Ansel, J., Arya, K., Cooperman, G.: Dmtcp: Transparent checkpointing for cluster computations and the desktop. In: IEEE International Symposium on Parallel & Distributed Processing, pp. 1–12. IEEE (2009)

Download references

Author information

Authors and Affiliations

Lawrence Livermore National Laboratory, Box 808, L-560, Livermore, CA, 94551-0808, USA
Tanzima Zerin Islam
Center for Applied Scientific Computing (CASC), Lawrence Livermore National Laboratory, Livermore, CA, USA
Saurabh Bagchi & Rudolf Eigenmann

Authors

Tanzima Zerin Islam
View author publications
You can also search for this author in PubMed Google Scholar
Saurabh Bagchi
View author publications
You can also search for this author in PubMed Google Scholar
Rudolf Eigenmann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tanzima Zerin Islam.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Islam, T.Z., Bagchi, S. & Eigenmann, R. Reliable and Efficient Distributed Checkpointing System for Grid Environments. J Grid Computing 12, 593–613 (2014). https://doi.org/10.1007/s10723-014-9297-4

Download citation

Received: 08 April 2013
Accepted: 13 March 2014
Published: 20 May 2014
Issue Date: December 2014
DOI: https://doi.org/10.1007/s10723-014-9297-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reliable and Efficient Distributed Checkpointing System for Grid Environments

Abstract

Access this article

Similar content being viewed by others

Checkpointing Tools in a Supercomputer Center

Reinit $$^{++}$$ : Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Reliable and Efficient Distributed Checkpointing System for Grid Environments

Abstract

Access this article

Similar content being viewed by others

Checkpointing Tools in a Supercomputer Center

Reinit $$^{++}$$ : Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation