Migol: A Fault-Tolerant Service Framework for MPI Applications in the Grid

  • André Luckow
  • Bettina Schnor
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3666)


In a distributed, inherently dynamic Grid environment the reliability of individual resources cannot be guaranteed. The more resources and components are involved the more error-prone is the system. Therefore, it is important to enhance the dependability of the system with fault-tolerance mechanisms. In this paper, we present Migol, a fault-tolerant, self-healing Grid service infrastructure for MPI applications.

The benefit of the Grid is that in case of a failure an application may be migrated and restarted from a checkpoint file on another site. This approach requires a service infrastructure which handles the necessary activities transparently for an application. But any migration framework cannot support fault-tolerant applications, if it is not fault-tolerant itself.


Grid computing fault-tolerance migration MPI Globus 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic mpi programs on clusters of workstations. In: HPDC 1999: Proceedings of the The Eighth IEEE International Symposium on High Performance Distributed Computing, Washington, DC, USA, p. 31. IEEE Computer Society, Los Alamitos (1999)Google Scholar
  2. 2.
    Nguyen-Tuong, A., Grimshaw, A.S., Wasson, G., Humphrey, M., Knight, J.C.: Towards Dependable Grids. Available at
  3. 3.
    Barak, A., Braverman, A., Gilderman, I., Laaden, O.: Performance of PVM with the MOSIX Preemptive Process Migration. In: Proceedings of the 7th Israeli Conference on Computer Systems and Software Engineering, Herzliya, June 1996, pp. 38–45 (1996)Google Scholar
  4. 4.
    Basney, J., Humphrey, M., Welch, V.: The myproxy online credential repository (2005),
  5. 5.
    Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., Selikhov, A.: Mpich-v: toward a scalable fault tolerant mpi for volatile nodes. In: Supercomputing 2002: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Los Alamitos, CA, USA, pp. 1–18. IEEE Computer Society Press, Los Alamitos (2002)Google Scholar
  6. 6.
    Chen, D., et al.: OGSA Globus Toolkit 3 evaluation activity at CERN. Nucl. Instrum. Meth. A534, 80–84 (2004)Google Scholar
  7. 7.
    Chervenak, A.L., Palavalli, N., Bharathi, S., Kesselman, C., Schwartzkopf, R.: Performance and scalability of a replica location service (2004). Available at
  8. 8.
    Czajkowski, K., Ferguson, D.F., Foster, I., Frey, J., Graham, S., Sedukhin, I., Snelling, D., Tuecke, S., Vambenepe, W.: The WS-Resource Framework (2005). Available at
  9. 9.
    Czajkowski, K., Foster, I.T., Karonis, N.T., Kesselman, C., Martin, S., Smith, W., Tuecke, S.: A resource management architecture for metacomputing systems. In: IPPS/SPDP 1998: Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing, London, UK, pp. 62–82. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  10. 10.
    Tannenbaum, T., Thain, D., Livny, M.: Condor and the grid. In: Berman, F., Hey, A.J.G. (eds.) Grid Computing: Making the Global Infrastructure a Reality. John Wiley, Chichester (2003)Google Scholar
  11. 11.
    Fagg, G.E., Dongarra, J.: Ft-mpi: Fault tolerant mpi, supporting dynamic applications in a dynamic world. In: Proceedings of the 7th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, London, UK, pp. 346–353. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  12. 12.
    Floros, E., Cotronis, Y.: Exposing mpi applications as grid services. In: Danelutto, M., Vanneschi, M., Laforenza, D. (eds.) Euro-Par 2004. LNCS, vol. 3149, pp. 436–443. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  13. 13.
    Foster, I., Kesselman, C., Nick, J.M., Tuecke, S.: The Physiology of the Grid – An Open Grid Services Architecture for Distributed Systems Integration (2002). Available at
  14. 14.
    Foster, I.T., Kesselman, C., Tsudik, G., Tuecke, S.: A security architecture for computational grids. In: ACM Conference on Computer and Communications Security, pp. 83–92 (1998)Google Scholar
  15. 15.
    Globus Homepage (2005). Available at
  16. 16.
    Thilo, J.M., Wrzesinska, K.G., van Niewpoort, R.V., Bal, H.E.: Fault-tolerant scheduling of fine-grained tasks in grid environments. In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium(IPDPS 2005), Denver, Colorado, USA (April 2005)Google Scholar
  17. 17.
    Gropp, W., Lusk, E.: Fault tolerance in mpi programs. High Performance Computing and Applications (2002)Google Scholar
  18. 18.
    Henderson, R., Tweten, D.: Portable Batch System: External reference specification. Technical report, NASA Ames Research Center (1996)Google Scholar
  19. 19.
    Kohl, J.A., Papadopoulos, P.M.: Cumulvs version 1.0 (1996). Available at
  20. 20.
    Kovacs, J., Kacsuk, P.: A migration framework for executing parallel programs in the grid. In: Proceedings of the 2nd European Across Grids Conference, Nicosia, Cyprus (January 2004)Google Scholar
  21. 21.
    Lanfermann, G., Schnor, B., Seidel, E.: Grid object description: Characterizing grids. In: Eighth IFIP/IEEE International Symposium on Integrated Network Management (IM 2003), Colorado Springs, Colorado, USA (March 2003)Google Scholar
  22. 22.
    Litzkow, M., Livny, M., Mutka, M.: Condor - a hunter of idle workstations. In: Proceedings of the 8th International Conference of Distributed Computing Systems (June 1988)Google Scholar
  23. 23.
    Mihahn, M., Schnor, B.: Fault-tolerant grid peer services. Technical report, University Potsdam (2004)Google Scholar
  24. 24.
    Montero, R.S., Huedo, E., Llorente, I.M.: Grid resource selection for opportunistic job migration. In: Kosch, H., Böszörményi, L., Hellwagner, H. (eds.) Euro-Par 2003. LNCS, vol. 2790, pp. 366–373. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  25. 25.
    Petri, S., Langendörfer, H.: Load Balancing and Fault Tolerance in Workstation Clusters– Migrating Groups of Communicating Processes. Operating Systems Review 29(4), 25–36 (1995)CrossRefGoogle Scholar
  26. 26.
    Puppin, D., Tonellotto, N., Laforenza, D.: Using web services to run distributed numerical applications. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 207–214. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  27. 27.
    Wolski, N.S.R., Hayes, J.: The network weather service: A distributed resource performance forecasting service for metacomputing. Journal of Future Generation Computing Systems 15(5-6), 757–768 (1999)CrossRefGoogle Scholar
  28. 28.
    Smith, C.: Open source metascheduling for virtual organizations with the community scheduler framework (csf). Technical report, Platform Computing Inc. (2003)Google Scholar
  29. 29.
    Stellner, G.: CoCheck: Checkpointing and Process Migration for MPI. In: Proceedings of the 10th International Parallel Processing Symposium (IPPS 1996), Honolulu, Hawaii (1996)Google Scholar
  30. 30.
    Tuecke, S., Foster, I., Kesselman, C.: Open Grid Service Infrastructure (2003). Available at
  31. 31.
    Vadhiyar, S.S., Dongarra, J.J.: A performance oriented migration framework for the grid. In: Proceedings of the 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, p. 130. IEEE Computer Society, Los Alamitos (2003)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • André Luckow
    • 1
  • Bettina Schnor
    • 1
  1. 1.Institute of Computer ScienceUniversity PotsdamGermany

Personalised recommendations