Skip to main content
Log in

Reliable and Efficient Distributed Checkpointing System for Grid Environments

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, unpredictable evictions of guest jobs lead to fluctuating completion times. Checkpoint-recovery is an attractive mechanism for recovering from such “failures”. Today’s FGCS systems often use expensive, high-performance dedicated checkpoint servers. However, in geographically distributed clusters, this may incur high checkpoint transfer latencies. In this paper we present a distributed checkpointing system called Falcon that uses available disk resources of the FGCS machines as shared checkpoint repositories. However, an unavailable storage host may lead to loss of checkpoint data. Therefore, we model the failures of a storage host and develop a prediction algorithm for choosing reliable checkpoint repositories. We experiment with Falcon in the university-wide Condor testbed at Purdue and show improved and consistent performance for guest jobs in the presence of irregular resource availability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ryu, K., Hollingsworth, J.: Resource policing to support fine-grain cycle stealing in networks of workstations. IEEE Trans. Parallel Distrib. Syst., 878–892 (2004)

  2. Elnozahy, E. N. M., Alvisi, L., Wang, Y., Johnson, D. B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. (CSUR) 34(3), 375–408 (2002)

    Article  Google Scholar 

  3. Thain, D., Tannenbaum, T., Livny, M.: Distributed computing in practice: The Condor experience. Concurr. Pract. Experience 17(2–4), 323–356 (2005)

    Article  Google Scholar 

  4. Boilergrid: A large, high-throughput, distributed computing system. [Online]. Available: http://www.rcac.purdue.edu/userinfo/resources/boilergrid/

  5. Aguilera, M. K., Janakiraman, R., Xu, L.: Using erasure codes efficiently for storage in a distributed system. In: International Conference on Dependable Systems and Networks (DSN), pp. 336–345 (2005)

  6. Ren, X., Eigenmann, R., Bagchi, S.: Failure-aware checkpointing in fine-grained cycle sharing systems. In: Proceedings of the 16th International Symposium on High Performance Distributed Computing, pp. 33–42 (2007)

  7. De Camargo, R. Y., Cerqueira, R., Kon, F.: Strategies for storage of checkpointing data using non-dedicated repositories on grid systems. In: Proceedings of the 3rd International Workshop on Middleware for Grid Computing, pp. 1–6. ACM (2005)

  8. Ren, X., Lee, S., Eigenmann, R., Bagchi, S.: Resource failure prediction in fine-grained cycle sharing systems. In: IEEE International Symposium on High Performance Distributed Computing (HPDC), pp. 19–23 (2006)

  9. Ren, X., Eigenmann, R.: Empirical studies on the behavior of resource availability in fine-grained cycle sharing systems. In: International Conference on Parallel Processing (ICPP), pp. 3–11 (2006)

  10. Islam, T. Z., Mohror, K., Bagchi, S., Moody, A., de Supinski, B. R., Eigenmann, R.: Mcrengine: A scalable checkpointing system using data-aware aggregation and compression. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp 1–11. IEEE (2012)

  11. Islam, T. Z., Bagchi, S., Eigenmann, R.: Falcon: A system for reliable checkpoint recovery in shared grid environments. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–12. ACM, New York (2009)

    Google Scholar 

  12. Strauss, J., Katabi, D., Kaashoek, F.: A measurement study of available bandwidth estimation tools. In: ACM SIGCOMM Conference on Internet measurement, pp. 39–44. ACM (2003)

  13. Wilcox-O’Hearn, Z.: Zfec Homepage, [Online]. Available: http://allmydata.org/trac/zfec (2008)

  14. Albayraktaroglu, K., Jaleel, A., Wu, X., Franklin, M., Jacob, B., Tseng, C.-W., Yeung, D.: Biobench: A benchmark suite of bioinformatics applications. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 2–9 (2005)

  15. Bray, T.: The Bonnie home page. Located at http://www.textuality.com/bonnie (1996)

  16. Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: Collective operations in application-level fault-tolerant mpi. In: Proceedings of the 17th Annual International Conference on Supercomputing, pp. 234–243. ACM (2003)

  17. Walters, J., Chaudhary, V.: A comprehensive user-level checkpointing strategy for MPI applications, Technical report, 2007-1, The State University of New York, Tech. Rep. Buffalo (2007)

  18. Rodrigues, R., Liskov, B.: High Availability in DHTs: Erasure Coding vs Replication. In: Peer-to-Peer Systems IV 4th International Workshop IPTPS 2005 (2005)

  19. Rhea, S., Eaton, P., Geels, D., Weatherspoon, H., Zhao, B., Kubiatowicz, J.: Pond: the OceanStore prototype. In: Proceedings of the 2nd USENIX Conference on File and Storage Technologies (FAST) (2003)

  20. Rood, B., Lewis, M. J.: Multi-state grid resource availability characterization. In: IEEE/ACM International Conference on Grid Computing, pp. 42–49. IEEE Computer Society (2007)

  21. Rood, B., Lewis, M. J: Scheduling on the grid via multi-state resource availability prediction. In: IEEE/ACM International Conference on Grid Computing, pp 126–135 (2008)

  22. Feller, E., Mehnert-Spahn, J., Schoettner, M., Morin, C.: Independent checkpointing in a heterogeneous grid environment. Futur. Gener. Comput. Syst. 28(1), 163–170 (2012)

    Article  Google Scholar 

  23. Ansel, J., Arya, K., Cooperman, G.: Dmtcp: Transparent checkpointing for cluster computations and the desktop. In: IEEE International Symposium on Parallel & Distributed Processing, pp. 1–12. IEEE (2009)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tanzima Zerin Islam.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Islam, T.Z., Bagchi, S. & Eigenmann, R. Reliable and Efficient Distributed Checkpointing System for Grid Environments. J Grid Computing 12, 593–613 (2014). https://doi.org/10.1007/s10723-014-9297-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-014-9297-4

Keywords

Navigation