Network Management: Fault Management, Performance Management, and Planned Maintenance

Yates, Jennifer M.; Ge, Zihui

doi:10.1007/978-1-84882-828-5_12

Jennifer M. Yates⁴ &
Zihui Ge⁴

Part of the book series: Computer Communications and Networks ((CCN))

932 Accesses
1 Citations

Abstract

This chapter discusses the systems, activities, and challenges associated with daily operation of large IP/MPLS networks. Specifically, this chapter focuses on detecting, troubleshooting, and resolving faults and performance events. It highlights how network performance and health is managed over time, with emphasis on the application and challenges of exploratory data mining in this context. And finally, the chapter explores planned maintenance; the activities that operations personnel perform as part of the continued operations, evolution, and growth of large IP/MPLS networks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Note that although root cause analysis is a term often used by event management system vendors, we prefer to use the term “event correlation” here, as root cause more generally implies a far more detailed explanation than can be provided by event management systems. More details are provided later in this chapter.

References

Gerards, R. (2009). The Syslog Protocol. IETF. RFC 5424.
Google Scholar
Della Maggiora, P., Elliott, C., Pavone, R., Phelps, K., & Thompson, J. (2000). Performance and fault management. Cisco Press.
Google Scholar
Shaikh, A., & Greenberg, A. (2004). OSPF Monitoring: Architecture, Design and Deployment Experience. USENIX. Symposium on Networked Systems Design and Implementation (NSDI).
Google Scholar
Mauro, D., & Schmidt, K. (2005). Essential SNMP. O’Reilly.
Google Scholar
HP’s Operations Center. [Online] https://h10078.www1.hp.com/cda/hpms/display/main/hpms_content.jsp?zn=bto&cp=1–11–15–28ˆ1745__4000__100____
EMC’s Ionix platform. [Online] http://www.emc.com/products/family/ionix-family.htm.
IBM’s Tivoli. [Online] http://en.wikipedia.org/wiki/IBM_Tivoli_Framework.
Kliger, S., et al. (1995). A Coding Approach to Event Correlation. Fourth International Symposium on Integrated Network Management. pp. 266–277.
Google Scholar
Yemini, S., Kliger, S., Mozes, E., Yemini, Y., & Ohsie, D. (May 1996). High speed and robust event correlation. IEEE Communications Magazine, 34, 82–90.
Article Google Scholar
Ciavattone, L., Morton, A., & Ramachandran, G. (June 2003). Standardized active measurements on a Tier 1 IP backbone. IEEE Communications Magazine, 41.
Google Scholar
Barford, P., Kline, J., Plonka, D., & Ro, A. (2002). A Signal Analysis of Network Traffic. ACM Internet Measurement Workshop. pp. 71–82.
Google Scholar
Huang, Y., Feamster, N., Lakhina, A., & Xu, J. (2007). Diagnosing Network Disruptions with Network-Wide Analysis. ACM Sigmetrics. 35, pp. 61–72.
Article Google Scholar
Lakhina, A., Crovella, M., & Diot, C. (2005). Mining Anomalies Using Traffic Feature Distributions. ACM SIGCOMM. Vol. 35, pp. 217–228.
Article Google Scholar
Zhang, Y., Ge, Z., Greenberg, A., & Roughan, M. (2005). Network Anomography. ACM Usenix. Internet Measurement Workshop. pp. 317–330.
Google Scholar
Venkataraman, S., Caballero, J., Song, D., Blum, A., & Yates, J. (2006). Black Box Anomaly Detection: Is It Utopian?. ACM 5th Workshop on Hot Topics in Networking (HotNets). pp. 127–132.
Google Scholar
Tague, N. R. (1995). The Quality Toolbox. Amer Society for Quality.
Google Scholar
Juran, J., & Gryna, F. (1998). Juran’s quality control handbook. New York: McGraw-Hill.
Google Scholar
Kalmanek, C., Ge, Z., Lee, S., Lund, C., Pei, D., Seidel, J., Van der Merwe, J., & Yates, J. (October 2009). Darkstar: Using Exploratory Data Mining to Raise the Bar on Network Reliability and Performance. Design of Reliable Communication Networks International Workshop.
Google Scholar
Golab, L., Johnson, T., Seidel, J., & Shkapenyuk, V. (2009). Stream Warehousing with Data Depot. ACM SIGMOD.
Google Scholar
Golab, L., Johnson, T., & Shkapenyuk, V. (2009). Scheduling Updates in a Real-Time Stream Warehouse. IEEE International Conference on Data Engineering (ICDE). pp. 1207–1210.
Google Scholar
Mahimkar, A., Yates, J., Zhang, Y., Shaikh, A., Wang, J., Ge, Z., & Ee, C. (2008). Troubleshooting Chronic Conditions in Large IP Networks. Madrid, Spain: ACM International Conference on Emerging Network Experiments and Technologies (CoNEXT).
Google Scholar
Mahimkar, A., Ge, Z., Shaikh, A., Wang, J., Yates, J., Zhang, Y., & Zhao, Q. (2009). Towards Automated Performance Diagnosis in a Large IPTV Network. ACM SIGCOMM.
Google Scholar
Dasu, T., & Johnson, T. (2003). Exploratory data mining and data cleaning. Wiley.
Google Scholar
Nisbet, R., Elder, J., & Miner, G. (2009). Handbook of statistical analysis & data mining applications. Academic.
Google Scholar
Demeester, P., Gryseels, M., Autenrieth, A., Brianza, C., Castagna, L., Signorelli, G., Clemente, R., Ravera, M., Jajszczyk, A., Janukowicz, D., Van Doorselaere, K., & Harada, Y. (August 1999). Resilience in multilayer networks. IEEE Communications Magazine, 37, pp. 70–76.
Article Google Scholar
Sebos, P., Yates, J., Li, G., Greenberg, A., Lazer, M., Kalmanek, C., & Rubenstein, D. (2003). Ultra-Fast IP Link and Interface Provisioning with Applications to IP Restoration. IEEE/LEOS Optical Fiber Communications Conference. pp. 557–558.
Google Scholar
Sebos, P., Yates, J., Li, G., Rubenstein, D., & Lazer, M. (2004). An Integrated IP/Optical Approach for Efficient Access Router Failure Recovery. IEEE/LEOS Optical Fiber Communications Conference.
Google Scholar

Download references

Acknowledgements

The authors thank the AT&T network and service operations teams for invaluable collaborations with us, their Research partners, over the years. In particular, we thank Bobbi Bailey, Heather Robinett, and Joanne Emmons (AT&T) for detailed discussions related to this chapter and beyond. Finally, we acknowledge Stuart Mackie from EMC, for discussions regarding alarm correlation.

Author information

Authors and Affiliations

AT&T Labs – Research, Florham Park, NJ, USA
Jennifer M. Yates & Zihui Ge

Authors

Jennifer M. Yates
View author publications
You can also search for this author in PubMed Google Scholar
Zihui Ge
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jennifer M. Yates .

Editor information

Editors and Affiliations

AT & T Labs Research, Park Ave. 180, Florham Park, 07932, USA
Charles R. Kalmanek
, School of Information Technology, Indian Institute of Technology Kharagpur, Kharagpur, 721302, India
Sudip Misra
Dept. Computer Science, Yale University, Prospect St. 51, New Haven, 06511, USA
Yang (Richard) Yang

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Yates, J.M., Ge, Z. (2010). Network Management: Fault Management, Performance Management, and Planned Maintenance. In: Kalmanek, C., Misra, S., Yang, Y. (eds) Guide to Reliable Internet Services and Applications. Computer Communications and Networks. Springer, London. https://doi.org/10.1007/978-1-84882-828-5_12

Download citation

DOI: https://doi.org/10.1007/978-1-84882-828-5_12
Published: 25 January 2010
Publisher Name: Springer, London
Print ISBN: 978-1-84882-827-8
Online ISBN: 978-1-84882-828-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics