Building Large-Scale, Reliable Network Services

Glasser, Alan L.

doi:10.1007/978-1-84882-828-5_15

Alan L. Glasser⁴

Part of the book series: Computer Communications and Networks ((CCN))

843 Accesses

Abstract

Large-scale network services can be built in a manner that provides a high level of reliability and availability, thereby minimizing the number of failures as well as the impact of a failure. In this chapter, we discuss various techniques, including organizational considerations that facilitate the production of reliable software, with a particular emphasis on software architecture. In spite of all attempts to eliminate failures, some inevitably occur. We also discuss techniques that aid in troubleshooting failed systems as well as those that tend to minimize the duration of a failure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Gaining concurrence from the customer or sponsor may require more than the production of a high-level requirements document, such as the development of demonstration software.
2.
Overall system availability (e.g., 99.99% availability) excludes such maintenance activities; i.e., availability is measured against all time other than scheduled maintenance activities.
3.
The sponsor or customer should provide a load forecast to aid in the formulation of the performance requirements.
4.
Network engineering is covered elsewhere in this book and is not covered here, except for a few recommendations that aid overall service availability.
5.
Such access may be from an end-user of the service (e.g., via a browser or email client) or from another system, either internal to the service or from a customer or third-party server.
6.
To simplify the specification of alarm correlation rules, the system and/or tools used to perform alarm correlation will drive commonality requirements on logging (e.g., common date and time formats, allowing rules to determine multiple failures within a given time interval).
7.
“Test early, test often” should be followed in any case; it fosters easier and faster bug detection than waiting.
8.
In cases where continuous operation is not a hard requirement, automatic or scheduled process restart (sometimes called process rejuvenation) can be used to get “clean” memory. That said, software that exhibits no leaks is probably always more reliable than software that leaks.
9.
This is in addition to the event log documentation described earlier in the chapter.
10.
A special class of alarms falls into the category of events that “should almost never happen.” For such alarms, directing operations staff to call the developer is acceptable. However, if what “should almost never happen” begins to occur frequently, then the developer should provide more detail on the action to be taken by the operations staff.
11.
It is a good idea to simply provide an informational log entry when a problem that had been previously reported has been cleared.
12.
One technique is to encapsulate all the date and time stamp string generation in a single project module. That module needs to be configurable (e.g., via a configuration parameter) to always return the same value.
13.
A critical bug is one that prevents the service or system from functioning; it is sometimes referred to as a “severity 1 problem.” A major bug is one that prevents a portion of the service or system from functioning; it is sometimes referred to as a “severity 2 problem.”
14.
Software is like wine – it improves with time.
15.
This is not to say that the known version is perfect. Operations staff will always prefer the devil that they know to the one that they do not know.
16.
For web-server reporting, webalizer is a useful tool.

References

Oppenheimer, D., Ganapathi, A., Patterson, D.A. (2003). Why do Internet services fail, and what can be done about it? 4th Usenix Symposium on Internet Technologies and Systems.
Google Scholar
Persse, J. (2006). Process improvement essentials: CMMI, Six Sigma, and ISO 9001. O’Reilly Media, Inc.
Google Scholar
Brooks, F.P (1995). The mythical man-month: Essays on software engineering. Reading, MA: Addison-Wesley.
Google Scholar
Schneider, G., Winters, J.P. (2001). Applying use cases: A practical guide. Reading, MA: Addison-Wesley.
Google Scholar
Bourke, T. (2001). Server load balancing. O’Reilly Media, Inc.
Google Scholar
Bono, V.J. (1997). 7007 Explanation and apology. NANOG email of Apr 26, 1997.
Google Scholar
Zhang, Z., Zhang, Y., Hu, Y.C., Mao, Z.M. (2007). Practical defenses against BGP prefix hijacking. Proceedings of the 2007 ACM CoNEXT conference.
Google Scholar
Patterson, D.A., Gibson, G., Katz, R. H. (1988). A case for redundant arrays of inexpensive disks (RAID). Proceedings of the 1988 ACM SIGMOD international conference on Management of Data.
Google Scholar
Schroeder, B., Gibson, G.A. (2007). Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Proceedings of the 5th USENIX conference on File and Storage Technologies.
Google Scholar
Tognazzini, B. (1992). Tog on interface. Reading, MA: Addison-Wesley.
Google Scholar
Spolsky, J. (2001). User interface design for programmers. Berkeley, CA: Apress.
MATH Google Scholar
Fowler, M. (1999). Refactoring. Reading, MA: Addison-Wesley.
Google Scholar
Bosworth, E. (2008). The IBM 370 programming environment. Lecture Notes. Department of Computer Science, Columbus State University.
Google Scholar
Cornett, S. (2009). Code coverage analysis. http://www.bullseye.com/coverage.html. Accessed May 17, 2009.
Musa, J.D. (2004). Software reliability engineering: More reliable software faster and cheaper, 2nd edn. Indiana: AuthorHouse.
Google Scholar

Download references

Author information

Authors and Affiliations

AT&T Labs Research, Middletown, NJ, USA
Alan L. Glasser (Distinguished Member of Technical Staff)

Authors

Alan L. Glasser
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alan L. Glasser .

Editor information

Editors and Affiliations

AT & T Labs Research, Park Ave. 180, Florham Park, 07932, USA
Charles R. Kalmanek
, School of Information Technology, Indian Institute of Technology Kharagpur, Kharagpur, 721302, India
Sudip Misra
Dept. Computer Science, Yale University, Prospect St. 51, New Haven, 06511, USA
Yang (Richard) Yang

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Glasser, A.L. (2010). Building Large-Scale, Reliable Network Services. In: Kalmanek, C., Misra, S., Yang, Y. (eds) Guide to Reliable Internet Services and Applications. Computer Communications and Networks. Springer, London. https://doi.org/10.1007/978-1-84882-828-5_15

Download citation

DOI: https://doi.org/10.1007/978-1-84882-828-5_15
Published: 25 January 2010
Publisher Name: Springer, London
Print ISBN: 978-1-84882-827-8
Online ISBN: 978-1-84882-828-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics