Advertisement

Migol: A Fault-Tolerant Service Framework for MPI Applications in the Grid

  • André Luckow
  • Bettina Schnor
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3666)

Abstract

In a distributed, inherently dynamic Grid environment the reliability of individual resources cannot be guaranteed. The more resources and components are involved the more error-prone is the system. Therefore, it is important to enhance the dependability of the system with fault-tolerance mechanisms. In this paper, we present Migol, a fault-tolerant, self-healing Grid service infrastructure for MPI applications.

The benefit of the Grid is that in case of a failure an application may be migrated and restarted from a checkpoint file on another site. This approach requires a service infrastructure which handles the necessary activities transparently for an application. But any migration framework cannot support fault-tolerant applications, if it is not fault-tolerant itself.

Keywords

Grid computing fault-tolerance migration MPI Globus 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic mpi programs on clusters of workstations. In: HPDC 1999: Proceedings of the The Eighth IEEE International Symposium on High Performance Distributed Computing, Washington, DC, USA, p. 31. IEEE Computer Society, Los Alamitos (1999)Google Scholar
  2. 2.
    Nguyen-Tuong, A., Grimshaw, A.S., Wasson, G., Humphrey, M., Knight, J.C.: Towards Dependable Grids. Available at http://www.cs.virginia.edu/~techrep/CS-2004-11.pdf
  3. 3.
    Barak, A., Braverman, A., Gilderman, I., Laaden, O.: Performance of PVM with the MOSIX Preemptive Process Migration. In: Proceedings of the 7th Israeli Conference on Computer Systems and Software Engineering, Herzliya, June 1996, pp. 38–45 (1996)Google Scholar
  4. 4.
    Basney, J., Humphrey, M., Welch, V.: The myproxy online credential repository (2005), http://www.ncsa.uiuc.edu/~jbasney/myproxy-spe.pdf
  5. 5.
    Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., Selikhov, A.: Mpich-v: toward a scalable fault tolerant mpi for volatile nodes. In: Supercomputing 2002: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Los Alamitos, CA, USA, pp. 1–18. IEEE Computer Society Press, Los Alamitos (2002)Google Scholar
  6. 6.
    Chen, D., et al.: OGSA Globus Toolkit 3 evaluation activity at CERN. Nucl. Instrum. Meth. A534, 80–84 (2004)Google Scholar
  7. 7.
    Chervenak, A.L., Palavalli, N., Bharathi, S., Kesselman, C., Schwartzkopf, R.: Performance and scalability of a replica location service (2004). Available at http://www.globus.org/alliance/publications/papers/chervenakhpdc13.pdf
  8. 8.
    Czajkowski, K., Ferguson, D.F., Foster, I., Frey, J., Graham, S., Sedukhin, I., Snelling, D., Tuecke, S., Vambenepe, W.: The WS-Resource Framework (2005). Available at http://www.oasis-open.org/committees/download.php/6796/ws-wsrf.pdf
  9. 9.
    Czajkowski, K., Foster, I.T., Karonis, N.T., Kesselman, C., Martin, S., Smith, W., Tuecke, S.: A resource management architecture for metacomputing systems. In: IPPS/SPDP 1998: Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing, London, UK, pp. 62–82. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  10. 10.
    Tannenbaum, T., Thain, D., Livny, M.: Condor and the grid. In: Berman, F., Hey, A.J.G. (eds.) Grid Computing: Making the Global Infrastructure a Reality. John Wiley, Chichester (2003)Google Scholar
  11. 11.
    Fagg, G.E., Dongarra, J.: Ft-mpi: Fault tolerant mpi, supporting dynamic applications in a dynamic world. In: Proceedings of the 7th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, London, UK, pp. 346–353. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  12. 12.
    Floros, E., Cotronis, Y.: Exposing mpi applications as grid services. In: Danelutto, M., Vanneschi, M., Laforenza, D. (eds.) Euro-Par 2004. LNCS, vol. 3149, pp. 436–443. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  13. 13.
    Foster, I., Kesselman, C., Nick, J.M., Tuecke, S.: The Physiology of the Grid – An Open Grid Services Architecture for Distributed Systems Integration (2002). Available at http://www-unix.globus.org/toolkit/3.0/ogsa/docs/physiology.pdf
  14. 14.
    Foster, I.T., Kesselman, C., Tsudik, G., Tuecke, S.: A security architecture for computational grids. In: ACM Conference on Computer and Communications Security, pp. 83–92 (1998)Google Scholar
  15. 15.
    Globus Homepage (2005). Available at http://www.globus.org
  16. 16.
    Thilo, J.M., Wrzesinska, K.G., van Niewpoort, R.V., Bal, H.E.: Fault-tolerant scheduling of fine-grained tasks in grid environments. In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium(IPDPS 2005), Denver, Colorado, USA (April 2005)Google Scholar
  17. 17.
    Gropp, W., Lusk, E.: Fault tolerance in mpi programs. High Performance Computing and Applications (2002)Google Scholar
  18. 18.
    Henderson, R., Tweten, D.: Portable Batch System: External reference specification. Technical report, NASA Ames Research Center (1996)Google Scholar
  19. 19.
    Kohl, J.A., Papadopoulos, P.M.: Cumulvs version 1.0 (1996). Available at http://www.netlib.org/cumulvs/
  20. 20.
    Kovacs, J., Kacsuk, P.: A migration framework for executing parallel programs in the grid. In: Proceedings of the 2nd European Across Grids Conference, Nicosia, Cyprus (January 2004)Google Scholar
  21. 21.
    Lanfermann, G., Schnor, B., Seidel, E.: Grid object description: Characterizing grids. In: Eighth IFIP/IEEE International Symposium on Integrated Network Management (IM 2003), Colorado Springs, Colorado, USA (March 2003)Google Scholar
  22. 22.
    Litzkow, M., Livny, M., Mutka, M.: Condor - a hunter of idle workstations. In: Proceedings of the 8th International Conference of Distributed Computing Systems (June 1988)Google Scholar
  23. 23.
    Mihahn, M., Schnor, B.: Fault-tolerant grid peer services. Technical report, University Potsdam (2004)Google Scholar
  24. 24.
    Montero, R.S., Huedo, E., Llorente, I.M.: Grid resource selection for opportunistic job migration. In: Kosch, H., Böszörményi, L., Hellwagner, H. (eds.) Euro-Par 2003. LNCS, vol. 2790, pp. 366–373. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  25. 25.
    Petri, S., Langendörfer, H.: Load Balancing and Fault Tolerance in Workstation Clusters– Migrating Groups of Communicating Processes. Operating Systems Review 29(4), 25–36 (1995)CrossRefGoogle Scholar
  26. 26.
    Puppin, D., Tonellotto, N., Laforenza, D.: Using web services to run distributed numerical applications. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 207–214. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  27. 27.
    Wolski, N.S.R., Hayes, J.: The network weather service: A distributed resource performance forecasting service for metacomputing. Journal of Future Generation Computing Systems 15(5-6), 757–768 (1999)CrossRefGoogle Scholar
  28. 28.
    Smith, C.: Open source metascheduling for virtual organizations with the community scheduler framework (csf). Technical report, Platform Computing Inc. (2003)Google Scholar
  29. 29.
    Stellner, G.: CoCheck: Checkpointing and Process Migration for MPI. In: Proceedings of the 10th International Parallel Processing Symposium (IPPS 1996), Honolulu, Hawaii (1996)Google Scholar
  30. 30.
    Tuecke, S., Foster, I., Kesselman, C.: Open Grid Service Infrastructure (2003). Available at http://www-unix.globus.org/toolkit/draft-ggf-ogsi-gridservice-33_2003-06-27.pdf
  31. 31.
    Vadhiyar, S.S., Dongarra, J.J.: A performance oriented migration framework for the grid. In: Proceedings of the 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, p. 130. IEEE Computer Society, Los Alamitos (2003)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • André Luckow
    • 1
  • Bettina Schnor
    • 1
  1. 1.Institute of Computer ScienceUniversity PotsdamGermany

Personalised recommendations