Toward Understanding Soft Faults in High Performance Cluster Networks

Evans, Jeffrey J.; Baik, Seongbok; Hood, Cynthia S.; Gropp, William

doi:10.1007/978-0-387-35674-7_14

Jeffrey J. Evans⁷,
Seongbok Baik⁷,
Cynthia S. Hood⁷ &
…
William Gropp⁸

Part of the book series: IFIP — The International Federation for Information Processing ((IFIPAICT,volume 118))

Included in the following conference series:

International Symposium on Integrated Network Management

387 Accesses

Abstract

Fault management in high performance cluster networks has been focused on the notion of hard faults (i.e., link or node failures). Network degradations that negatively impact performance but do not result in failures often go unnoticed. In this paper, we classify such degradations as soft faults. In addition, we identify consistent performance as an important requirement in cluster networks. Using this service requirement, we describe a comprehensive strategy for cluster fault management.

The original version of this chapter was revised: The copyright line was incorrect. This has been corrected. The Erratum to this chapter is available at DOI: 10.1007/978-0-387-35674-7_66

Download to read the full chapter text

Chapter PDF

Fault-Detection Managers: More May Not Be the Merrier

Article 20 February 2021

FINJ: A Fault Injection Tool for HPC Systems

Understanding System Resilience for Converged Computing of Cloud, Edge, and HPC

Keywords

References

S. Baik, C. Hood, and W. Gropp. Prototype of am3: Active mapper and monitoring module for the Myrinet environment. In Proceedings of the HSLN Workshop, Nov. 2002.
Google Scholar
D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. Eicken. Logp: Towards a realistic model of parallel computation. In Proceedings of the Fourth ACM SIGPLAN Symposium on Princples and Practices of Parallel Programming, May 1993.
Google Scholar
W. Gu, G. Eisenhauer, and K. Schwan. Falcon: On-line moniroting and steering of parallel programs. In Ninth International Conference on Parallel and Distributed Computing and Systems (PDCS’97), Oct. 1997.
Google Scholar
J. Hollingsworth and B. Miller. Dynamic control of performance monitoring on large scale parallel systems. In International Conference on Supercomputing, July 1993.
Google Scholar
C. S. Hood and C. Ji. Proactive network-fault detection. IEEE Transactions on Reliability, 46 (3): 333–341, September 1997.
Article Google Scholar
Argonne National Laboratory. Chiba City, the Argonne scalable cluster, 1999. http://www-unix. mcs. anl. gov/chiba/.
Google Scholar
R. P. Martin, A. M. Vandat, D. E. Culler, and T. E. Anderson. Effects of communication latency, overhead, and bandwidth in a cluster architecture. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 85–97, June 1997.
Google Scholar
C. Mendes and D. Reed. Performance stability and prediction. In IEEE International Workshop on High Performance Computing (WHPC’94), March 1994.
Google Scholar
D. M. Ogle, K. Schwan, and R. Snodgrass. Application-dependent dynamic monitoring of distributed and parallel systems. IEEE Transactions on Parallel and Distributed Systems, 4 (7): 762–778, July 1993.
Article Google Scholar
J. M. Orduna, F. Silla, and J. Duato. A new task mapping technique for communication-aware scheduling strategies. In International Conference on Parallel Processing Workshops, pages 349–354, 2001.
Chapter Google Scholar
D. A. Reed, R. A. Aydt, R. J. Noe, P. C. Roth, K. A. Shields, B. W. Schwartz, and L. F. Tavera. Scalable performance analysis: The pablo performance analysis environment. In Proceedings of the IEEE Computer Society Scalable Parallel Libraries Conference, October 1993.
Google Scholar
J. Vetter and D. Reed. Managing performance analysis with dynamic projection pursuit. In Proceedings of SC’99, November 1999.
Google Scholar
J. Vetter and K. Schwan. Progress: A toolkit for interactive program steering. In Proceedings of the International Conference on Parallel Processing, August 1995.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Illinois Institute of Technology, 10 West 31st St., Chicago, Illinois, 60616, USA
Jeffrey J. Evans, Seongbok Baik & Cynthia S. Hood
Mathematics and Computer Science Division, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, Illinois, 60439, USA
William Gropp

Authors

Jeffrey J. Evans
View author publications
You can also search for this author in PubMed Google Scholar
Seongbok Baik
View author publications
You can also search for this author in PubMed Google Scholar
Cynthia S. Hood
View author publications
You can also search for this author in PubMed Google Scholar
William Gropp
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IBM Research, USA
Germán Goldszmidt
University of Osnabrück, Germany
Jürgen Schönwälder

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Evans, J.J., Baik, S., Hood, C.S., Gropp, W. (2003). Toward Understanding Soft Faults in High Performance Cluster Networks. In: Goldszmidt, G., Schönwälder, J. (eds) Integrated Network Management VIII. IM 2003. IFIP — The International Federation for Information Processing, vol 118. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-35674-7_14

Download citation

DOI: https://doi.org/10.1007/978-0-387-35674-7_14
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4757-5521-3
Online ISBN: 978-0-387-35674-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Toward Understanding Soft Faults in High Performance Cluster Networks

Abstract

Chapter PDF

Similar content being viewed by others

Fault-Detection Managers: More May Not Be the Merrier

FINJ: A Fault Injection Tool for HPC Systems

Understanding System Resilience for Converged Computing of Cloud, Edge, and HPC

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Toward Understanding Soft Faults in High Performance Cluster Networks

Abstract

Chapter PDF

Similar content being viewed by others

Fault-Detection Managers: More May Not Be the Merrier

FINJ: A Fault Injection Tool for HPC Systems

Understanding System Resilience for Converged Computing of Cloud, Edge, and HPC

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation