Persistent Fault-Tolerance for Divide-and-Conquer Applications on the Grid

  • Gosia Wrzesinska
  • Ana-Maria Oprescu
  • Thilo Kielmann
  • Henri Bal
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4641)


Grid applications need to be fault tolerant, malleable, and migratable. In previous work, we have presented orphan saving, an efficient mechanism addressing these issues for divide-and-conquer applications. In this paper, we present a mechanism for writing partial results to checkpoint files, adding the capability to also tolerate the total loss of all processors, and to allow suspending and later resuming an application.

Both mechanisms have only negligible overheads in the absence of faults, even with extremely short checkpointing intervals like one minute. In the case of faults, the new checkpointing mechanism outperforms orphan saving by 10% to 15%. Also, suspending/resuming an application has only little overhead, making our approach very attractive for writing grid applications.


Fault Tolerance Travel Salesman Problem Grid Environment Grid Application Execution Tree 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Wrzesinska, G., van Nieuwport, R.V., Maassen, J., Bal, H.E.: Fault-tolerance, Malleability and Migration for Divide-and-Conquer Applications on the Grid. In: IPDPS 2005. 19th IEEE International Parallel and Distributed Processing Symposium, IEEE Computer Society Press, Los Alamitos (2005)Google Scholar
  2. 2.
    Baldeschwieler, J., Blumofe, R., Brewer, E.: ATLAS: An Infrastructure for Global Computing. In: Seventh ACM SIGOPS European Workshop on System Support for Worldwide Applications, Connemara, Ireland, pp. 165–172 (September 1996)Google Scholar
  3. 3.
    van Nieuwpoort, R.V., Kielmann, T., Bal, H.: Efficient Load Balancing for Wide-Area Divide-and-Conquer Applications. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Snowbird, Utah, USA, pp. 34–43 (June 2001)Google Scholar
  4. 4.
    van Nieuwpoort, R.V., Maassen, J., Wrzesinska, G., Kielmann, T., Bal, H.E.: Satin: Simple and Efficient Java-based Grid Programming. Scalable Computing: Practice and Experience 6(3), 19–32 (2005)Google Scholar
  5. 5.
    Allen, G., Davis, K., Goodale, T., Hutanu, A., Kaiser, H., Kielmann, T., Merzky, A., van Nieuwpoort, R., Reinefeld, A., Schintke, F., Schütt, T., Seidel, E., Ullmer, B.: The Grid Application Toolkit: Towards Generic and Easy Application Programming Interfaces for the Grid. Proceedings of the IEEE 93(3), 534–550 (2005)CrossRefGoogle Scholar
  6. 6.
    Li, K., Naughton, J.F., Plank, J.S.: Real-time, concurrent checkpointing for parallel programs. In: PPoPP 1990. 2nd ACM SIGPLAN Symposium on Principles and Practice of Parall el Programming, pp. 79–88. ACM Press, New York (1990)CrossRefGoogle Scholar
  7. 7.
    Litzkow, M., Livny, M., Mutka, M.: Condor - A Hunter of Idle Workstations. In: Proceedings of the 8th International Conference of Distributed Computing Systems, San Jose, California, pp. 104–111 (June 1988)Google Scholar
  8. 8.
    Allen, G., Benger, W., Goodale, T., Hege, H.C., Lanfermann, G., Merzky, A., Radke, T., Seidel, E., Shalf, J.: The Cactus Code: A Problem Solving Environment for the Grid. In: Proceedings of the Ninth IEEE International Symposium on High Performance Distributed Computing (HPDC9), Pittsburgh, Pennsylvania, USA, pp. 253–260 (August 2000)Google Scholar
  9. 9.
    Iskra, K.A., Hendrikse, Z.W., van Albada, G.D., Overeinder, B.J., Sloot, P.M.A., Gehring, J.: Experiments with migration of message-passing tasks. In: Buyya, R., Baker, M. (eds.) GRID 2000. LNCS, vol. 1971, pp. 203–213. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  10. 10.
    Plank, J.: Efficient Checkpointing on MIMD architectures. PhD thesis, Princeton University (1993)Google Scholar
  11. 11.
    Vadhiyar, S.S., Dongarra, J.J.: SRS – a framework for developing malleable and migratable parallel applications for distributed systems. Parallel Processing Letters 13(2), 291–312 (2003)CrossRefMathSciNetGoogle Scholar
  12. 12.
    Finkel, R., Manber, U.: DIB – A Distributed Implementation of Backtracking. ACM Transactions of Programming Languages and Systems 9(2), 235–256 (1987)CrossRefGoogle Scholar
  13. 13.
    Lin, F.C.H., Keller, R.M.: Distributed Recovery in Applicative Systems. In: Proceedings of the 1986 International Conference on Parallel Processing, University Park, PA, USA, pp. 405–412 (August 1986)Google Scholar
  14. 14.
    Blumofe, R., Lisiecki, P.: Adaptive and Reliable Parallel Computing on Networks of Workstations. In: USENIX 1997 Annual Technical Conference on UNIX and Advanced Computing Systems, Anaheim, California, pp. 133–147 (January 1997)Google Scholar
  15. 15.
    Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: An Efficient Multithreaded Runtime System. Journal of Parallel and Distributed Computing 37(1), 55–69 (1996)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Gosia Wrzesinska
    • 1
  • Ana-Maria Oprescu
    • 1
  • Thilo Kielmann
    • 1
  • Henri Bal
    • 1
  1. 1.Vrije Universiteit Amsterdam 

Personalised recommendations