Advertisement

Applicability of Generic Naming Services and Fault-Tolerant Metacomputing with FT-MPI

  • David Dewolfs
  • Dawid Kurzyniec
  • Vaidy Sunderam
  • Jan Broeckhove
  • Tom Dhaene
  • Graham Fagg
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3666)

Abstract

There is a growing interest in deploying MPI over multiple, heterogenous and geographically distributed resources for performing very large scale computations. However, increasing the amount of geographical distribution and resources creates problems with interoperability and fault-tolerance. FT-MPI presents an interesting solution for adding fault-tolerance to MPI, but suffers from interoperability limitations and potential single points of failure when crossing multiple administrative domains. We propose to overcome these limitations by adding “pluggability” for one potential single point of failure – the name service used by FT-MPI – and combining FT-MPI with the H2O metacomputing framework.

Keywords

FT-MPI H2O metacomputing fault-tolerance heterogeneity 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Kurzyniec, D., Sunderam, V.: Combining FT-MPI with H20: Fault-tolerant MPI across administrative boundaries. In: Proceedings of th HCW 2005-14th Heterogeneous Computing Workshop (2005) (accepted)Google Scholar
  2. 2.
    Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations. In: Eighth IEEE International Symposium on High Performance Distributed Computing, p. 31 (1999)Google Scholar
  3. 3.
    Bouteiller, A., Cappello, F., Herault, T., Krawezik, G., Lemarinier, P., Magniette, F.: MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In: ACM/IEEE SC 2003 Conference, p. 25 (2003)Google Scholar
  4. 4.
    Chen, Y., Li, K., Plank, J.S.: CLIP: A checkpointing tool for message-passing parallel programs (1997). Available at http://citeseerist.psu.edu/chen97clip.html
  5. 5.
    Chin, J., Coveney, P.V.: Towards tractable toolkits for the Grid: a plea for lightweight, usable middleware. Available at http://www.realitygrid.org/lgpaper21.pdf
  6. 6.
    Elnozahy, E., Zwaenepoel, W.: Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output. IEEE Transactions on Computers, Special Issue on Fault-Tolerant Computing 41(5), 526–531 (1992)Google Scholar
  7. 7.
    Fagg, G., Gabriel, E., Chen, Z., Angskun, T., Bosilca, G., Pjesivac-Grbovic, J., Dongarra, J.: Process fault-tolerance: Sematics, design and applications for high-performance computing. In: International Journal for High Performance Applications and Supercomputing (2004)Google Scholar
  8. 8.
    Imamura, T., Tsujita, Y., Koide, H., Takemiya, H.: An architecture of Stampi: MPI library on a cluster of parallel computers. In: 7th European PVM/MPI Users’ Group Meeting, pp. 4–18 (2000)Google Scholar
  9. 9.
    Karonis, N., Toonen, B., Foster, I.: MPICH-G2: A grid-enabled implementation of the Message Passing Interface. Journal of Parallel and Distributed Computing (JPDC) 63(5), 551–563 (2003)zbMATHCrossRefGoogle Scholar
  10. 10.
    Keller, R., Krammer, B., Mueller, M.S., Resch, M.M., Gabriel, E.: MPI development tools and applications for the grid. In: Workshop on Grid Applications and Programming Tools (2003)Google Scholar
  11. 11.
    Kurzyniec, D., Wrzosek, T., Drzewiecki, D., Sunderam, V.: Towards self-organising distributed computing frameworks: The H2O approach. Parallel Processing Letters 13(2), 273–290 (2003)CrossRefMathSciNetGoogle Scholar
  12. 12.
    Louca, S., Neophytou, N., Lachanas, A., Eviripidou, P.: MPI-FT: Portable fault-tolerance scheme for MPI. Parallel Processing Letters 10(4), 371–382 (2000)CrossRefGoogle Scholar
  13. 13.
    Stellner, G.: CoCheck: Checkpointing and process migration for MPI. In: 10th International Parallel Processing Symposium, 526–531 (1996)Google Scholar
  14. 14.
    Tyrakowski, T., Sunderam, V.S., Migliardi, M.: Distributed Name Service in Harness. In: Alexandrov, V.N., Dongarra, J., Juliano, B.A., Renner, R.S., Tan, C.J.K. (eds.) ICCS-ComputSci 2001. LNCS, vol. 2073, pp. 345–354. Springer, Heidelberg (2001)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • David Dewolfs
    • 1
  • Dawid Kurzyniec
    • 1
  • Vaidy Sunderam
    • 1
  • Jan Broeckhove
    • 1
  • Tom Dhaene
    • 1
  • Graham Fagg
    • 1
  1. 1.Depts. of Math and Computer ScienceUniversity of Antwerp, Belgium and Emory UniversityAtlantaUSA

Personalised recommendations