Abstract
This chapter discusses the systems, activities, and challenges associated with daily operation of large IP/MPLS networks. Specifically, this chapter focuses on detecting, troubleshooting, and resolving faults and performance events. It highlights how network performance and health is managed over time, with emphasis on the application and challenges of exploratory data mining in this context. And finally, the chapter explores planned maintenance; the activities that operations personnel perform as part of the continued operations, evolution, and growth of large IP/MPLS networks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Note that although root cause analysis is a term often used by event management system vendors, we prefer to use the term “event correlation” here, as root cause more generally implies a far more detailed explanation than can be provided by event management systems. More details are provided later in this chapter.
References
Gerards, R. (2009). The Syslog Protocol. IETF. RFC 5424.
Della Maggiora, P., Elliott, C., Pavone, R., Phelps, K., & Thompson, J. (2000). Performance and fault management. Cisco Press.
Shaikh, A., & Greenberg, A. (2004). OSPF Monitoring: Architecture, Design and Deployment Experience. USENIX. Symposium on Networked Systems Design and Implementation (NSDI).
Mauro, D., & Schmidt, K. (2005). Essential SNMP. O’Reilly.
HP’s Operations Center. [Online] https://h10078.www1.hp.com/cda/hpms/display/main/hpms_content.jsp?zn=bto&cp=1–11–15–28ˆ1745__4000__100____
EMC’s Ionix platform. [Online] http://www.emc.com/products/family/ionix-family.htm.
IBM’s Tivoli. [Online] http://en.wikipedia.org/wiki/IBM_Tivoli_Framework.
Kliger, S., et al. (1995). A Coding Approach to Event Correlation. Fourth International Symposium on Integrated Network Management. pp. 266–277.
Yemini, S., Kliger, S., Mozes, E., Yemini, Y., & Ohsie, D. (May 1996). High speed and robust event correlation. IEEE Communications Magazine, 34, 82–90.
Ciavattone, L., Morton, A., & Ramachandran, G. (June 2003). Standardized active measurements on a Tier 1 IP backbone. IEEE Communications Magazine, 41.
Barford, P., Kline, J., Plonka, D., & Ro, A. (2002). A Signal Analysis of Network Traffic. ACM Internet Measurement Workshop. pp. 71–82.
Huang, Y., Feamster, N., Lakhina, A., & Xu, J. (2007). Diagnosing Network Disruptions with Network-Wide Analysis. ACM Sigmetrics. 35, pp. 61–72.
Lakhina, A., Crovella, M., & Diot, C. (2005). Mining Anomalies Using Traffic Feature Distributions. ACM SIGCOMM. Vol. 35, pp. 217–228.
Zhang, Y., Ge, Z., Greenberg, A., & Roughan, M. (2005). Network Anomography. ACM Usenix. Internet Measurement Workshop. pp. 317–330.
Venkataraman, S., Caballero, J., Song, D., Blum, A., & Yates, J. (2006). Black Box Anomaly Detection: Is It Utopian?. ACM 5th Workshop on Hot Topics in Networking (HotNets). pp. 127–132.
Tague, N. R. (1995). The Quality Toolbox. Amer Society for Quality.
Juran, J., & Gryna, F. (1998). Juran’s quality control handbook. New York: McGraw-Hill.
Kalmanek, C., Ge, Z., Lee, S., Lund, C., Pei, D., Seidel, J., Van der Merwe, J., & Yates, J. (October 2009). Darkstar: Using Exploratory Data Mining to Raise the Bar on Network Reliability and Performance. Design of Reliable Communication Networks International Workshop.
Golab, L., Johnson, T., Seidel, J., & Shkapenyuk, V. (2009). Stream Warehousing with Data Depot. ACM SIGMOD.
Golab, L., Johnson, T., & Shkapenyuk, V. (2009). Scheduling Updates in a Real-Time Stream Warehouse. IEEE International Conference on Data Engineering (ICDE). pp. 1207–1210.
Mahimkar, A., Yates, J., Zhang, Y., Shaikh, A., Wang, J., Ge, Z., & Ee, C. (2008). Troubleshooting Chronic Conditions in Large IP Networks. Madrid, Spain: ACM International Conference on Emerging Network Experiments and Technologies (CoNEXT).
Mahimkar, A., Ge, Z., Shaikh, A., Wang, J., Yates, J., Zhang, Y., & Zhao, Q. (2009). Towards Automated Performance Diagnosis in a Large IPTV Network. ACM SIGCOMM.
Dasu, T., & Johnson, T. (2003). Exploratory data mining and data cleaning. Wiley.
Nisbet, R., Elder, J., & Miner, G. (2009). Handbook of statistical analysis & data mining applications. Academic.
Demeester, P., Gryseels, M., Autenrieth, A., Brianza, C., Castagna, L., Signorelli, G., Clemente, R., Ravera, M., Jajszczyk, A., Janukowicz, D., Van Doorselaere, K., & Harada, Y. (August 1999). Resilience in multilayer networks. IEEE Communications Magazine, 37, pp. 70–76.
Sebos, P., Yates, J., Li, G., Greenberg, A., Lazer, M., Kalmanek, C., & Rubenstein, D. (2003). Ultra-Fast IP Link and Interface Provisioning with Applications to IP Restoration. IEEE/LEOS Optical Fiber Communications Conference. pp. 557–558.
Sebos, P., Yates, J., Li, G., Rubenstein, D., & Lazer, M. (2004). An Integrated IP/Optical Approach for Efficient Access Router Failure Recovery. IEEE/LEOS Optical Fiber Communications Conference.
Acknowledgements
The authors thank the AT&T network and service operations teams for invaluable collaborations with us, their Research partners, over the years. In particular, we thank Bobbi Bailey, Heather Robinett, and Joanne Emmons (AT&T) for detailed discussions related to this chapter and beyond. Finally, we acknowledge Stuart Mackie from EMC, for discussions regarding alarm correlation.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag London
About this chapter
Cite this chapter
Yates, J.M., Ge, Z. (2010). Network Management: Fault Management, Performance Management, and Planned Maintenance. In: Kalmanek, C., Misra, S., Yang, Y. (eds) Guide to Reliable Internet Services and Applications. Computer Communications and Networks. Springer, London. https://doi.org/10.1007/978-1-84882-828-5_12
Download citation
DOI: https://doi.org/10.1007/978-1-84882-828-5_12
Published:
Publisher Name: Springer, London
Print ISBN: 978-1-84882-827-8
Online ISBN: 978-1-84882-828-5
eBook Packages: Computer ScienceComputer Science (R0)