Journal of Grid Computing

, Volume 12, Issue 4, pp 593–613 | Cite as

Reliable and Efficient Distributed Checkpointing System for Grid Environments

  • Tanzima Zerin Islam
  • Saurabh Bagchi
  • Rudolf Eigenmann


In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, unpredictable evictions of guest jobs lead to fluctuating completion times. Checkpoint-recovery is an attractive mechanism for recovering from such “failures”. Today’s FGCS systems often use expensive, high-performance dedicated checkpoint servers. However, in geographically distributed clusters, this may incur high checkpoint transfer latencies. In this paper we present a distributed checkpointing system called Falcon that uses available disk resources of the FGCS machines as shared checkpoint repositories. However, an unavailable storage host may lead to loss of checkpoint data. Therefore, we model the failures of a storage host and develop a prediction algorithm for choosing reliable checkpoint repositories. We experiment with Falcon in the university-wide Condor testbed at Purdue and show improved and consistent performance for guest jobs in the presence of irregular resource availability.


Checkpoint Checkpointing Recovery Reliability Cycle sharing system FGCS Condor Efficient Data parallel checkpointing Erasure encoding Checkpoint/Restart 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ryu, K., Hollingsworth, J.: Resource policing to support fine-grain cycle stealing in networks of workstations. IEEE Trans. Parallel Distrib. Syst., 878–892 (2004)Google Scholar
  2. 2.
    Elnozahy, E. N. M., Alvisi, L., Wang, Y., Johnson, D. B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. (CSUR) 34(3), 375–408 (2002)CrossRefGoogle Scholar
  3. 3.
    Thain, D., Tannenbaum, T., Livny, M.: Distributed computing in practice: The Condor experience. Concurr. Pract. Experience 17(2–4), 323–356 (2005)CrossRefGoogle Scholar
  4. 4.
    Boilergrid: A large, high-throughput, distributed computing system. [Online]. Available:
  5. 5.
    Aguilera, M. K., Janakiraman, R., Xu, L.: Using erasure codes efficiently for storage in a distributed system. In: International Conference on Dependable Systems and Networks (DSN), pp. 336–345 (2005)Google Scholar
  6. 6.
    Ren, X., Eigenmann, R., Bagchi, S.: Failure-aware checkpointing in fine-grained cycle sharing systems. In: Proceedings of the 16th International Symposium on High Performance Distributed Computing, pp. 33–42 (2007)Google Scholar
  7. 7.
    De Camargo, R. Y., Cerqueira, R., Kon, F.: Strategies for storage of checkpointing data using non-dedicated repositories on grid systems. In: Proceedings of the 3rd International Workshop on Middleware for Grid Computing, pp. 1–6. ACM (2005)Google Scholar
  8. 8.
    Ren, X., Lee, S., Eigenmann, R., Bagchi, S.: Resource failure prediction in fine-grained cycle sharing systems. In: IEEE International Symposium on High Performance Distributed Computing (HPDC), pp. 19–23 (2006)Google Scholar
  9. 9.
    Ren, X., Eigenmann, R.: Empirical studies on the behavior of resource availability in fine-grained cycle sharing systems. In: International Conference on Parallel Processing (ICPP), pp. 3–11 (2006)Google Scholar
  10. 10.
    Islam, T. Z., Mohror, K., Bagchi, S., Moody, A., de Supinski, B. R., Eigenmann, R.: Mcrengine: A scalable checkpointing system using data-aware aggregation and compression. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp 1–11. IEEE (2012)Google Scholar
  11. 11.
    Islam, T. Z., Bagchi, S., Eigenmann, R.: Falcon: A system for reliable checkpoint recovery in shared grid environments. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–12. ACM, New York (2009)Google Scholar
  12. 12.
    Strauss, J., Katabi, D., Kaashoek, F.: A measurement study of available bandwidth estimation tools. In: ACM SIGCOMM Conference on Internet measurement, pp. 39–44. ACM (2003)Google Scholar
  13. 13.
    Wilcox-O’Hearn, Z.: Zfec Homepage, [Online]. Available: (2008)
  14. 14.
    Albayraktaroglu, K., Jaleel, A., Wu, X., Franklin, M., Jacob, B., Tseng, C.-W., Yeung, D.: Biobench: A benchmark suite of bioinformatics applications. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 2–9 (2005)Google Scholar
  15. 15.
    Bray, T.: The Bonnie home page. Located at (1996)
  16. 16.
    Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: Collective operations in application-level fault-tolerant mpi. In: Proceedings of the 17th Annual International Conference on Supercomputing, pp. 234–243. ACM (2003)Google Scholar
  17. 17.
    Walters, J., Chaudhary, V.: A comprehensive user-level checkpointing strategy for MPI applications, Technical report, 2007-1, The State University of New York, Tech. Rep. Buffalo (2007)Google Scholar
  18. 18.
    Rodrigues, R., Liskov, B.: High Availability in DHTs: Erasure Coding vs Replication. In: Peer-to-Peer Systems IV 4th International Workshop IPTPS 2005 (2005)Google Scholar
  19. 19.
    Rhea, S., Eaton, P., Geels, D., Weatherspoon, H., Zhao, B., Kubiatowicz, J.: Pond: the OceanStore prototype. In: Proceedings of the 2nd USENIX Conference on File and Storage Technologies (FAST) (2003)Google Scholar
  20. 20.
    Rood, B., Lewis, M. J.: Multi-state grid resource availability characterization. In: IEEE/ACM International Conference on Grid Computing, pp. 42–49. IEEE Computer Society (2007)Google Scholar
  21. 21.
    Rood, B., Lewis, M. J: Scheduling on the grid via multi-state resource availability prediction. In: IEEE/ACM International Conference on Grid Computing, pp 126–135 (2008)Google Scholar
  22. 22.
    Feller, E., Mehnert-Spahn, J., Schoettner, M., Morin, C.: Independent checkpointing in a heterogeneous grid environment. Futur. Gener. Comput. Syst. 28(1), 163–170 (2012)CrossRefGoogle Scholar
  23. 23.
    Ansel, J., Arya, K., Cooperman, G.: Dmtcp: Transparent checkpointing for cluster computations and the desktop. In: IEEE International Symposium on Parallel & Distributed Processing, pp. 1–12. IEEE (2009)Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht (outside the USA) 2014

Authors and Affiliations

  • Tanzima Zerin Islam
    • 1
  • Saurabh Bagchi
    • 2
  • Rudolf Eigenmann
    • 2
  1. 1.Lawrence Livermore National LaboratoryLivermoreUSA
  2. 2.Center for Applied Scientific Computing (CASC)Lawrence Livermore National LaboratoryLivermoreUSA

Personalised recommendations