Skip to main content

Building Large-Scale, Reliable Network Services

  • Chapter
  • First Online:
Guide to Reliable Internet Services and Applications

Part of the book series: Computer Communications and Networks ((CCN))

  • 843 Accesses

Abstract

Large-scale network services can be built in a manner that provides a high level of reliability and availability, thereby minimizing the number of failures as well as the impact of a failure. In this chapter, we discuss various techniques, including organizational considerations that facilitate the production of reliable software, with a particular emphasis on software architecture. In spite of all attempts to eliminate failures, some inevitably occur. We also discuss techniques that aid in troubleshooting failed systems as well as those that tend to minimize the duration of a failure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Gaining concurrence from the customer or sponsor may require more than the production of a high-level requirements document, such as the development of demonstration software.

  2. 2.

    Overall system availability (e.g., 99.99% availability) excludes such maintenance activities; i.e., availability is measured against all time other than scheduled maintenance activities.

  3. 3.

    The sponsor or customer should provide a load forecast to aid in the formulation of the performance requirements.

  4. 4.

    Network engineering is covered elsewhere in this book and is not covered here, except for a few recommendations that aid overall service availability.

  5. 5.

    Such access may be from an end-user of the service (e.g., via a browser or email client) or from another system, either internal to the service or from a customer or third-party server.

  6. 6.

    To simplify the specification of alarm correlation rules, the system and/or tools used to perform alarm correlation will drive commonality requirements on logging (e.g., common date and time formats, allowing rules to determine multiple failures within a given time interval).

  7. 7.

    “Test early, test often” should be followed in any case; it fosters easier and faster bug detection than waiting.

  8. 8.

    In cases where continuous operation is not a hard requirement, automatic or scheduled process restart (sometimes called process rejuvenation) can be used to get “clean” memory. That said, software that exhibits no leaks is probably always more reliable than software that leaks.

  9. 9.

    This is in addition to the event log documentation described earlier in the chapter.

  10. 10.

    A special class of alarms falls into the category of events that “should almost never happen.” For such alarms, directing operations staff to call the developer is acceptable. However, if what “should almost never happen” begins to occur frequently, then the developer should provide more detail on the action to be taken by the operations staff.

  11. 11.

    It is a good idea to simply provide an informational log entry when a problem that had been previously reported has been cleared.

  12. 12.

    One technique is to encapsulate all the date and time stamp string generation in a single project module. That module needs to be configurable (e.g., via a configuration parameter) to always return the same value.

  13. 13.

    A critical bug is one that prevents the service or system from functioning; it is sometimes referred to as a “severity 1 problem.” A major bug is one that prevents a portion of the service or system from functioning; it is sometimes referred to as a “severity 2 problem.”

  14. 14.

    Software is like wine – it improves with time.

  15. 15.

    This is not to say that the known version is perfect. Operations staff will always prefer the devil that they know to the one that they do not know.

  16. 16.

    For web-server reporting, webalizer is a useful tool.

References

  1. Oppenheimer, D., Ganapathi, A., Patterson, D.A. (2003). Why do Internet services fail, and what can be done about it? 4th Usenix Symposium on Internet Technologies and Systems.

    Google Scholar 

  2. Persse, J. (2006). Process improvement essentials: CMMI, Six Sigma, and ISO 9001. O’Reilly Media, Inc.

    Google Scholar 

  3. Brooks, F.P (1995). The mythical man-month: Essays on software engineering. Reading, MA: Addison-Wesley.

    Google Scholar 

  4. Schneider, G., Winters, J.P. (2001). Applying use cases: A practical guide. Reading, MA: Addison-Wesley.

    Google Scholar 

  5. Bourke, T. (2001). Server load balancing. O’Reilly Media, Inc.

    Google Scholar 

  6. Bono, V.J. (1997). 7007 Explanation and apology. NANOG email of Apr 26, 1997.

    Google Scholar 

  7. Zhang, Z., Zhang, Y., Hu, Y.C., Mao, Z.M. (2007). Practical defenses against BGP prefix hijacking. Proceedings of the 2007 ACM CoNEXT conference.

    Google Scholar 

  8. Patterson, D.A., Gibson, G., Katz, R. H. (1988). A case for redundant arrays of inexpensive disks (RAID). Proceedings of the 1988 ACM SIGMOD international conference on Management of Data.

    Google Scholar 

  9. Schroeder, B., Gibson, G.A. (2007). Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Proceedings of the 5th USENIX conference on File and Storage Technologies.

    Google Scholar 

  10. Tognazzini, B. (1992). Tog on interface. Reading, MA: Addison-Wesley.

    Google Scholar 

  11. Spolsky, J. (2001). User interface design for programmers. Berkeley, CA: Apress.

    MATH  Google Scholar 

  12. Fowler, M. (1999). Refactoring. Reading, MA: Addison-Wesley.

    Google Scholar 

  13. Bosworth, E. (2008). The IBM 370 programming environment. Lecture Notes. Department of Computer Science, Columbus State University.

    Google Scholar 

  14. Cornett, S. (2009). Code coverage analysis. http://www.bullseye.com/coverage.html. Accessed May 17, 2009.

  15. Musa, J.D. (2004). Software reliability engineering: More reliable software faster and cheaper, 2nd edn. Indiana: AuthorHouse.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alan L. Glasser .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag London

About this chapter

Cite this chapter

Glasser, A.L. (2010). Building Large-Scale, Reliable Network Services. In: Kalmanek, C., Misra, S., Yang, Y. (eds) Guide to Reliable Internet Services and Applications. Computer Communications and Networks. Springer, London. https://doi.org/10.1007/978-1-84882-828-5_15

Download citation

  • DOI: https://doi.org/10.1007/978-1-84882-828-5_15

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84882-827-8

  • Online ISBN: 978-1-84882-828-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics