Abstract
Network fault management systems are mission-critical, for they are most needed during periods when part of the network is faulty. Distributed system-level diagnosis offers a practical and theoretically sound solution for fault-tolerant fault monitoring. It guarantees that faults don’t impair the fault management process. Recently, results from the application of distributed system-level diagnosis applied for SNMP-based LAN fault management have been reported [1, 2]. In this paper we expand those results by presenting a new algorithm for diagnosis of non-broadcast networks, applied to point-to-point network fault management. In the algorithm, nodes test links periodically, and disseminate link time-out information to all its fault-free neighbors in parallel. Upon receiving link time-out information a node computes which portion of the network has become unreachable. This approach is closer to reality than previous algorithms, for it is impossible to distinguish a faulty node from a node to which all routes are faulty. The diagnosis latency of the algorithm is optimal, as nodes report events in parallel, and latency is proportional to the diameter of the network. The dissemination step includes mechanisms to reduce the number of redundant messages introduced by the parallel strategy. We present a MIB for the algorithm, and a SNMP-based implementation. The evaluation of algorithm’s impact on network performance, shows that the amount of bandwidth required is less than 0.1% for popular link capacities. We conclude demonstrating the integration of LAN and WAN fault diagnosis into a unified framework.
The author has a scholarship from the Brazilian research council, CNPq.
Chapter PDF
Similar content being viewed by others
References
E.P. Duarte Jr., and T. Nanya, “An SNMP-based Implementation of The Adaptive DSD Algorithm for LAN Fault Management,” Proc. IEEE/IFIP NOMS’96, pp. 530–539, Kyoto, April 1996.
E.P. Duarte Jr., and T. Nanya, “Hierarchical Distributed System-Level Diagnosis Applied for SNMP-based Network Fault Management”, Proc. IEEE 16th Symp. Reliable Distributed Systems, Niagara, September 1996.
F. Preparata, G. Metze, and R.T. Chien, “On The Connection Assignment Problem of Diagnosable Systems,” IEEE Transactions on Electronic Computers, Vol. 16, pp. 848–854, 1968.
S.L. Hakimi, and A.T. Amin, “Characterization of Connection Assignments of Diagnosable Systems,” IEEE Transactions on Computers, Vol. 23, pp. 86–88, 1974.
S L Hakimi, and K. Nakajima, “On Adaptive System Diagnosis” IEEE Transactions on • Computers, Vol. 33, pp. 234–240, 1984.
J.G. Kuhl, and S.M. Reddy, “Distributed Fault-Tolerance for Large Multiprocessor Systems,” Proc. 7th Annual Symp. Computer Architecture, pp. 23–30, 1980.
J.G. Kuhl, and S.M. Reddy, “Fault-Diagnosis in Fully Distributed Systems,” Proc. 11th Fault Tolerant Computing Symp, pp. 100–105, 1981.
S.H. Hosseini, J.G. Kuhl, and S.M. Reddy, “A Diagnosis Algorithm for Distributed Computing Systems with Failure and Repair,” IEEE Transactions on Computers, Vol. 33, pp. 223–233, 1984.
R.P. Bianchini, K. Goodwin, and D.S. Nydick, “Practical Application and Implementation of System-Level Diagnosis Theory,” Proc. 20th Fault Tolerant Computing Symp, pp. 332339, 1990.
R.P. Bianchini, and R. Buskens, “An Adaptive Distributed System-Level Diagnosis Algorithm and Its Implementation,” Proc. 21st Fault Tolerant Computing Symp, pp. 222–229, 1991.
R.P. Bianchini, and R. Buskens, “Implementation of On-Line Distributed System-Level Diagnosis Theory,” IEEE Transactions on Computers, Vol. 41, pp. 616–626, 1992.
A. Bagchi, and S.L. Hakimi, “An Optimal Algorithm for Distributed System-Level Diagnosis,” Proc. 21 5 t Fault Tolerant Computing Symp, June, 1991.
M. Stahl, R. Buskens, and R. Bianchini, “On-Line Diagnosis on General Topology Networks,” Proc. Workshop Fault-Tolerant Parallel and Distributed Systems, July 1992.
M. Stahl, R. Buskens, and R. Bianchini, “Simulation of the Adapt On-Line Diagnosis Algorithm for General Topology Networks,” Proc. IEEE 11th Symp. Reliable Distributed Systems, October 1992.
S.Rangarajan, A.T. Dahbura, and E.A. Ziegler, “A Distributed System-Level Diagnosis Algorithm for Arbitrary Network Topologies,” IEEE Transactions on Computers, Vol. 44, pp. 312–333, 1995.
G.Mansfield, M.Ouchi, K.Jayanthi, Y.Kimura, K.Ohta, Y.Nemoto, “Techniques for automated Network Map Generation using SNMP”, Proc. of INFOCOM’96, pp. 473–480, March 1996.
M. Rose, and K. McCloghrie, “Structure and Identification of Management Information for TCP/IP-based Internets,” RFC 1155, 1990.
J.D. Case, M.S. Fedor, M.L. Schoffstall, and J.R. Davin, “A Simple Network Management Protocol,” RFC 1157, 1990.
K. McCloghtie and M.T. Rose, “Management Information Base for Network Management of TCP/IP-based Internets,” RFC 1213, 1991.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1997 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Duarte, E.P., Nanya, T., Noguchi, S., Mansfield, G. (1997). Non-Broadcast Network Fault-Monitoring Based on System-Level Diagnosis. In: Lazar, A.A., Saracco, R., Stadler, R. (eds) Integrated Network Management V. IM 1997. IFIP — The International Federation for Information Processing. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-35180-3_44
Download citation
DOI: https://doi.org/10.1007/978-0-387-35180-3_44
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4757-5519-0
Online ISBN: 978-0-387-35180-3
eBook Packages: Springer Book Archive