Skip to main content

Network Management: Fault Management, Performance Management, and Planned Maintenance

  • Chapter
  • First Online:
Guide to Reliable Internet Services and Applications

Part of the book series: Computer Communications and Networks ((CCN))

Abstract

This chapter discusses the systems, activities, and challenges associated with daily operation of large IP/MPLS networks. Specifically, this chapter focuses on detecting, troubleshooting, and resolving faults and performance events. It highlights how network performance and health is managed over time, with emphasis on the application and challenges of exploratory data mining in this context. And finally, the chapter explores planned maintenance; the activities that operations personnel perform as part of the continued operations, evolution, and growth of large IP/MPLS networks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Note that although root cause analysis is a term often used by event management system vendors, we prefer to use the term “event correlation” here, as root cause more generally implies a far more detailed explanation than can be provided by event management systems. More details are provided later in this chapter.

References

  1. Gerards, R. (2009). The Syslog Protocol. IETF. RFC 5424.

    Google Scholar 

  2. Della Maggiora, P., Elliott, C., Pavone, R., Phelps, K., & Thompson, J. (2000). Performance and fault management. Cisco Press.

    Google Scholar 

  3. Shaikh, A., & Greenberg, A. (2004). OSPF Monitoring: Architecture, Design and Deployment Experience. USENIX. Symposium on Networked Systems Design and Implementation (NSDI).

    Google Scholar 

  4. Mauro, D., & Schmidt, K. (2005). Essential SNMP. O’Reilly.

    Google Scholar 

  5. HP’s Operations Center. [Online] https://h10078.www1.hp.com/cda/hpms/display/main/hpms_content.jsp?zn=bto&cp=1–11–15–28ˆ1745_​_4000_​_100__​_​_

  6. EMC’s Ionix platform. [Online] http://www.emc.com/products/family/ionix-family.htm.

  7. IBM’s Tivoli. [Online] http://en.wikipedia.org/wiki/IBM_Tivoli_Framework.

  8. Kliger, S., et al. (1995). A Coding Approach to Event Correlation. Fourth International Symposium on Integrated Network Management. pp. 266–277.

    Google Scholar 

  9. Yemini, S., Kliger, S., Mozes, E., Yemini, Y., & Ohsie, D. (May 1996). High speed and robust event correlation. IEEE Communications Magazine, 34, 82–90.

    Article  Google Scholar 

  10. Ciavattone, L., Morton, A., & Ramachandran, G. (June 2003). Standardized active measurements on a Tier 1 IP backbone. IEEE Communications Magazine, 41.

    Google Scholar 

  11. Barford, P., Kline, J., Plonka, D., & Ro, A. (2002). A Signal Analysis of Network Traffic. ACM Internet Measurement Workshop. pp. 71–82.

    Google Scholar 

  12. Huang, Y., Feamster, N., Lakhina, A., & Xu, J. (2007). Diagnosing Network Disruptions with Network-Wide Analysis. ACM Sigmetrics. 35, pp. 61–72.

    Article  Google Scholar 

  13. Lakhina, A., Crovella, M., & Diot, C. (2005). Mining Anomalies Using Traffic Feature Distributions. ACM SIGCOMM. Vol. 35, pp. 217–228.

    Article  Google Scholar 

  14. Zhang, Y., Ge, Z., Greenberg, A., & Roughan, M. (2005). Network Anomography. ACM Usenix. Internet Measurement Workshop. pp. 317–330.

    Google Scholar 

  15. Venkataraman, S., Caballero, J., Song, D., Blum, A., & Yates, J. (2006). Black Box Anomaly Detection: Is It Utopian?. ACM 5th Workshop on Hot Topics in Networking (HotNets). pp. 127–132.

    Google Scholar 

  16. Tague, N. R. (1995). The Quality Toolbox. Amer Society for Quality.

    Google Scholar 

  17. Juran, J., & Gryna, F. (1998). Juran’s quality control handbook. New York: McGraw-Hill.

    Google Scholar 

  18. Kalmanek, C., Ge, Z., Lee, S., Lund, C., Pei, D., Seidel, J., Van der Merwe, J., & Yates, J. (October 2009). Darkstar: Using Exploratory Data Mining to Raise the Bar on Network Reliability and Performance. Design of Reliable Communication Networks International Workshop.

    Google Scholar 

  19. Golab, L., Johnson, T., Seidel, J., & Shkapenyuk, V. (2009). Stream Warehousing with Data Depot. ACM SIGMOD.

    Google Scholar 

  20. Golab, L., Johnson, T., & Shkapenyuk, V. (2009). Scheduling Updates in a Real-Time Stream Warehouse. IEEE International Conference on Data Engineering (ICDE). pp. 1207–1210.

    Google Scholar 

  21. Mahimkar, A., Yates, J., Zhang, Y., Shaikh, A., Wang, J., Ge, Z., & Ee, C. (2008). Troubleshooting Chronic Conditions in Large IP Networks. Madrid, Spain: ACM International Conference on Emerging Network Experiments and Technologies (CoNEXT).

    Google Scholar 

  22. Mahimkar, A., Ge, Z., Shaikh, A., Wang, J., Yates, J., Zhang, Y., & Zhao, Q. (2009). Towards Automated Performance Diagnosis in a Large IPTV Network. ACM SIGCOMM.

    Google Scholar 

  23. Dasu, T., & Johnson, T. (2003). Exploratory data mining and data cleaning. Wiley.

    Google Scholar 

  24. Nisbet, R., Elder, J., & Miner, G. (2009). Handbook of statistical analysis & data mining applications. Academic.

    Google Scholar 

  25. Demeester, P., Gryseels, M., Autenrieth, A., Brianza, C., Castagna, L., Signorelli, G., Clemente, R., Ravera, M., Jajszczyk, A., Janukowicz, D., Van Doorselaere, K., & Harada, Y. (August 1999). Resilience in multilayer networks. IEEE Communications Magazine, 37, pp. 70–76.

    Article  Google Scholar 

  26. Sebos, P., Yates, J., Li, G., Greenberg, A., Lazer, M., Kalmanek, C., & Rubenstein, D. (2003). Ultra-Fast IP Link and Interface Provisioning with Applications to IP Restoration. IEEE/LEOS Optical Fiber Communications Conference. pp. 557–558.

    Google Scholar 

  27. Sebos, P., Yates, J., Li, G., Rubenstein, D., & Lazer, M. (2004). An Integrated IP/Optical Approach for Efficient Access Router Failure Recovery. IEEE/LEOS Optical Fiber Communications Conference.

    Google Scholar 

Download references

Acknowledgements

The authors thank the AT&T network and service operations teams for invaluable collaborations with us, their Research partners, over the years. In particular, we thank Bobbi Bailey, Heather Robinett, and Joanne Emmons (AT&T) for detailed discussions related to this chapter and beyond. Finally, we acknowledge Stuart Mackie from EMC, for discussions regarding alarm correlation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jennifer M. Yates .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag London

About this chapter

Cite this chapter

Yates, J.M., Ge, Z. (2010). Network Management: Fault Management, Performance Management, and Planned Maintenance. In: Kalmanek, C., Misra, S., Yang, Y. (eds) Guide to Reliable Internet Services and Applications. Computer Communications and Networks. Springer, London. https://doi.org/10.1007/978-1-84882-828-5_12

Download citation

  • DOI: https://doi.org/10.1007/978-1-84882-828-5_12

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84882-827-8

  • Online ISBN: 978-1-84882-828-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics