Advertisement

Self-Healing Network for Scalable Fault Tolerant Runtime Environments

  • Thara Angskun
  • Graham E. Fagg
  • George Bosilca
  • Jelena Pješivac-Grbović
  • Jack J. Dongarra

Keywords

Span Tree Multicast Group Distribute Hash Table Broadcast Message Runtime Environment 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    T. Angskun, G. E. Fagg, G. Bosilca, J. Pjesivac-Grbovic, and J. Dongarra. Scalable fault tolerant protocol for parallel runtime environments. In Proceedings of the 13th European PVM/MPI User’s Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, Bonn, Germany, September 2006. Springer-Verlag.Google Scholar
  2. [2]
    M. Beck, J. J. Dongarra, G. E. Fagg, G. A. Geist, P. Gray, J. Kohl, M. Migliardi, K. Moore, T. Moore, P. Papadopoulous, S. L. Scott, and V. Sunderam. HARNESS: A next generation distributed virtual machine. Future Generation Computer Systems, 15(5-6):571-582, 1999.CrossRefGoogle Scholar
  3. [3]
    G. Burns, R. Daoud, and J. Vaigl. LAM: An Open Cluster Environment for MPI. In Proceedings Supercomputing Symposium, pages 379-386, 1994.Google Scholar
  4. [4]
    R. Butler, W. Gropp, and E. L. Lusk. A scalable process-management environment for parallel program. In Proceedings of the 7th European PVM/MPI User’s Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 168-175, London, UK, 2000. Springer-Verlag.Google Scholar
  5. [5]
    R. H. Castain, T. S. Woodall, D. J. Daniel, J. M. Squyres, B. Barrett, and G. E. Fagg. The open run-time environment (openrte): A transparent multi-cluster environment for high-performance computing. In Proceedings 12th European PVM/MPI User’s Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, Sorrento(Naples), Italy, September 2005. Springer-Verlag.Google Scholar
  6. [6]
    J. J. Dongarra, H. Meuer, and E. Strohmaier. TOP500 supercomputer sites. Supercomputer, 13(1):89-120, 1997.Google Scholar
  7. [7]
    G. E. Fagg, E. Gabriel, G. Bosilca, T. Angskun, Z. Chen, J. Pjesivac-Grbovic, K. London, and J. Dongarra. Extending the mpi specification for process fault tolerance on high performance computing systems. In Proceedings of the International Supercomputer Conference (ICS) 2004, Heidelberg, Germany, June 2006. Primeur.Google Scholar
  8. [8]
    E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Proceedings 11th European PVM/MPI User’s Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 97-104, Budapest, Hungary, September 2004. Springer-Verlag.Google Scholar
  9. [9]
    W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high - performance, portable implementation of MPI message passing interface standard. Parallel Computing, 22(6):789-828, 1996.MATHCrossRefGoogle Scholar
  10. [10]
    I. Gupta, R. van Renesse, and K. Birman. Scalable fault-tolerant aggregation in large process groups. In Proceedings of The International Conference on Dependable Systems and Networks (DSN), pages 433-442, 2001.Google Scholar
  11. [11]
    MPI Forum. MPI: A message-passing interface standard. Technical report, 1994.Google Scholar
  12. [12]
    S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A scalable content addressable network. Technical Report TR-00-010, Berkeley, CA, 2000.Google Scholar
  13. [13]
    R. V. Renesse, Y. Minsky, and M. Hayden. A gossip-style failure detection service. Technical Report TR98-1687, 28, 1998Google Scholar
  14. [14]
    A. Rowstron and P. Druschel. Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. Lecture Notes in Computer Science, 2218:329-350, 2001.CrossRefGoogle Scholar
  15. [15]
    I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan. Chord: A scalable Peer-To-Peer lookup service for internet applications. In Proceedings of the 2001 ACM SIGCOMM Conference, pages 149-160, 2001.Google Scholar
  16. [16]
    B. Y. Zhao, J. D. Kubiatowicz, and A. D. Joseph. Tapestry: An infrastructure for fault-tolerant wide-area location and routing. Technical Report UCB/CSD-01-1141, UC Berkeley, April 2001.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  • Thara Angskun
    • 1
  • Graham E. Fagg
    • 2
  • George Bosilca
    • 3
  • Jelena Pješivac-Grbović
    • 4
  • Jack J. Dongarra
    • 5
  1. 1.Dept. of Computer ScienceThe University of TennesseeKnoxvilleUSA
  2. 2.Dept. of Computer ScienceThe University of TennesseeKnoxvilleUSA
  3. 3.Dept. of Computer ScienceUniversity of TennesseeKnoxvilleUSA
  4. 4.Dept. of Computer ScienceThe University of TennesseeKnoxvilleUSA
  5. 5.Dept. of Computer ScienceThe University of TennesseeKnoxvilleUSA

Personalised recommendations