Abstract
Fault management in high performance cluster networks has been focused on the notion of hard faults (i.e., link or node failures). Network degradations that negatively impact performance but do not result in failures often go unnoticed. In this paper, we classify such degradations as soft faults. In addition, we identify consistent performance as an important requirement in cluster networks. Using this service requirement, we describe a comprehensive strategy for cluster fault management.
The original version of this chapter was revised: The copyright line was incorrect. This has been corrected. The Erratum to this chapter is available at DOI: 10.1007/978-0-387-35674-7_66
Chapter PDF
Similar content being viewed by others
References
S. Baik, C. Hood, and W. Gropp. Prototype of am3: Active mapper and monitoring module for the Myrinet environment. In Proceedings of the HSLN Workshop, Nov. 2002.
D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. Eicken. Logp: Towards a realistic model of parallel computation. In Proceedings of the Fourth ACM SIGPLAN Symposium on Princples and Practices of Parallel Programming, May 1993.
W. Gu, G. Eisenhauer, and K. Schwan. Falcon: On-line moniroting and steering of parallel programs. In Ninth International Conference on Parallel and Distributed Computing and Systems (PDCS’97), Oct. 1997.
J. Hollingsworth and B. Miller. Dynamic control of performance monitoring on large scale parallel systems. In International Conference on Supercomputing, July 1993.
C. S. Hood and C. Ji. Proactive network-fault detection. IEEE Transactions on Reliability, 46 (3): 333–341, September 1997.
Argonne National Laboratory. Chiba City, the Argonne scalable cluster, 1999. http://www-unix. mcs. anl. gov/chiba/.
R. P. Martin, A. M. Vandat, D. E. Culler, and T. E. Anderson. Effects of communication latency, overhead, and bandwidth in a cluster architecture. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 85–97, June 1997.
C. Mendes and D. Reed. Performance stability and prediction. In IEEE International Workshop on High Performance Computing (WHPC’94), March 1994.
D. M. Ogle, K. Schwan, and R. Snodgrass. Application-dependent dynamic monitoring of distributed and parallel systems. IEEE Transactions on Parallel and Distributed Systems, 4 (7): 762–778, July 1993.
J. M. Orduna, F. Silla, and J. Duato. A new task mapping technique for communication-aware scheduling strategies. In International Conference on Parallel Processing Workshops, pages 349–354, 2001.
D. A. Reed, R. A. Aydt, R. J. Noe, P. C. Roth, K. A. Shields, B. W. Schwartz, and L. F. Tavera. Scalable performance analysis: The pablo performance analysis environment. In Proceedings of the IEEE Computer Society Scalable Parallel Libraries Conference, October 1993.
J. Vetter and D. Reed. Managing performance analysis with dynamic projection pursuit. In Proceedings of SC’99, November 1999.
J. Vetter and K. Schwan. Progress: A toolkit for interactive program steering. In Proceedings of the International Conference on Parallel Processing, August 1995.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 IFIP International Federation for Information Processing
About this chapter
Cite this chapter
Evans, J.J., Baik, S., Hood, C.S., Gropp, W. (2003). Toward Understanding Soft Faults in High Performance Cluster Networks. In: Goldszmidt, G., Schönwälder, J. (eds) Integrated Network Management VIII. IM 2003. IFIP — The International Federation for Information Processing, vol 118. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-35674-7_14
Download citation
DOI: https://doi.org/10.1007/978-0-387-35674-7_14
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4757-5521-3
Online ISBN: 978-0-387-35674-7
eBook Packages: Springer Book Archive