Abstract
Proper capacity/performance engineering is critical to the success of developing and deploying any complex networked application. In this chapter, we discuss the typical capacity, performance, reliability, and scalability engineering activities required to deploy a networked service platform. These activities begin at the earliest stages, and span the entire platform life cycle: from architecture, design, and development, through service test and deployment, to ongoing capacity management. The goal of this chapter is not to present an exhaustive “how to” manual, but rather to highlight areas where proper capacity/performance engineering is especially critical to success. We use an ISP email platform as a unifying case study to illustrate many of these tasks. This chapter covers the following topics: Architecture Assessment – elements, transactions, flows, and bottlenecks Workload Assessment – workload, requirements, budgeting, and estimation Availability/Reliability Assessment – modeling and failure-mode analysis Capacity/Performance Assessment – measurement, modeling, and overload Scalability Assessment – demand projections, modeling, and engineering rules Capacity/Performance Management – monitoring, growth, and automation Capacity/Performance Engineering – “best practice” principles
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The term “capacity/performance engineering” in the chapter title and throughout this chapter broadly refers to the expansive set of activities required to assess and manage platform capacity, performance, availability, reliability, and scalability.
- 2.
This Markovian property results from the memoryless nature of the exponential distribution, and is referred to as Poisson Arrivals See Time Averages (PASTA).
- 3.
The coefficient of variation (CV) is a normalized measure of dispersion of a distribution, defined as the ratio of the standard deviation σ to the mean μ (CV = σ ∕ μ).
- 4.
In reality, ISPs typically support multiple applications in addition to e-mail (e.g., newsgroups and web hosting). These applications typically share physical resources, either through virtualization, common transactions (e.g., authentication), or shared infrastructure (e.g., LANs). For the purpose of illustrating the C/PE tasks, we assume that all physical resources are dedicated to the single e-mail application. In the case of resource sharing/virtualization, the C/PE analysis must account for the impact of additional workload, reduced resource availability, and contention.
- 5.
This expression results from a BoE model for delay W reviewed in Section 16.2.
- 6.
As discussed in Section 16.2, both analytic modeling and practical experience suggest that the average delay for user-initiated jobs with common code execution is typically one-third to half of 95th percentile delay. As part of the budgeting exercise, we can perform sensitivity analyses around this 95th percentile-to-mean assumption.
Abbreviations
- ACL:
-
access control list
- AS/V:
-
anti-spam/virus filtering server
- BH:
-
busy hour
- B5M:
-
busy 5 min.
- BoE:
-
back-of-the-envelope
- C/PE:
-
capacity/performance engineering
- DMoQ:
-
direct measure of quality
- DPM:
-
defect per million
- DSL:
-
digital subscriber line
- DT:
-
downtime
- FIFO:
-
first-in-first-out
- FIT:
-
fault insertion testing
- FMEA:
-
failure modes and effects analysis
- FTP:
-
File Transfer Protocol
- FTTH:
-
fiber-to-the-home
- GW:
-
IB SMTP Gateway server
- HT:
-
headroom threshold
- HTTP:
-
Hyper-Text Transfer Protocol
- HTTPS:
-
Secure HTTP
- HW:
-
hardware
- IMAP:
-
Internet Message Access Protocol
- IB:
-
inbound
- i.i.d.:
-
independent identically distributed
- I/O:
-
input/output
- ISP:
-
Internet service provider
- LAN:
-
local area network
- LIFO:
-
last-in-first-out
- MIB:
-
management information base
- MR:
-
OB Mail Relay server
- MRA:
-
modification request analysis
- MTTF:
-
mean-time-to-failure
- MTTR:
-
mean-time-to-restore
- NAS:
-
network attached storage
- NFS:
-
network file system
- OB:
-
outbound
- PO:
-
Post Office server
- POP:
-
Post Office Protocol
- PP:
-
POP Proxy server
- PS:
-
processor-sharing
- RBD:
-
reliability block diagram
- SAN:
-
storage area network
- SLA:
-
service-level agreement
- SLO:
-
service-level objective
- SNMP:
-
Simple Network Management Protocol
- SPoF:
-
single point of failure
- SRE:
-
software reliability engineering
- SMTP:
-
Simple Mail Transfer Protocol
- tps:
-
transactions per second
- VIP:
-
virtual IP address (aka VLAN)
- WM:
-
WebMail server
References
Smith, C., & Williams, L. (2002). Performance solutions – a practical guide to creating responsive, scalable software. Reading, MA: Addison-Wesley.
Chrissis, M., Konrad, M., & Shrum, S. (2003). CMMI: Guidelines for process integration and product improvement. Reading, MA: Addison-Wesley.
Jain, R. (1991). The art of computer systems performance analysis: Techniques for experimental design, measurement, simulation, and modeling. New York: Wiley-Interactive.
Menasce, D., Almeida, V., & Dowdy, L. (2004). Performance by design – computer capacity planning by example. Upper Saddle River, NJ: Prentice Hall PTR.
Ross, S. (1972). Introduction to probability models. New York: Academic.
Cooper, R. (1981). Introduction to queueing theory (2nd ed.). New York: North Holland.
Lazowska, E., Zahorjan, J., Graham, G., & Sevcik, K. (1984). Quantitative system performance – computer system analysis using queueing network models. Upper Saddle River, NJ: Prentice-Hall.
Kleinrock, L. (1975). Queueing systems, volume 1: theory. New York: Wiley-Interscience.
Little, J. (1961). A proof of the queueing formula L = λ W. Operations Research 9, 383–387.
Hennessy, J., & Patterson, D. (2007). Computer architecture: a quantitative approach (4th ed.). Boston, MA: Elsevier-Morgan Kaufman.
Snee, R. (1990). Statistical thinking and its contribution to total quality. American Statistician, 44(2), 116–121.
Smith, C. (1990). Performance engineering of software systems. Reading, MA: Addison-Wesley.
Musa, J. (1999). Software reliability engineering. New York: McGraw-Hill.
Billington, R., & Allan, R. (1992). Reliability evaluation of engineering systems (2nd ed.). New York: Plenum.
Reeser, P. (1996). Predicting system reliability in a client/server application hosting environment. Proceedings, Joint AT&T/Lucent Reliability Info Forum.
Huebner, F., Meier-Hellstern, K., & Reeser, P. (2001). Performance testing for IP services and systems. In Dumke, R, Rautenstrauch, C., Schmietendorf, A., & Scholz, A. (Eds.), Performance engineering – state of the art and current trends. Heidelberg: Springer-Verlag.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag London
About this chapter
Cite this chapter
Reeser, P. (2010). Capacity and Performance Engineering for Networked Application Servers: A Case Study in E-mail Platform Planning. In: Kalmanek, C., Misra, S., Yang, Y. (eds) Guide to Reliable Internet Services and Applications. Computer Communications and Networks. Springer, London. https://doi.org/10.1007/978-1-84882-828-5_16
Download citation
DOI: https://doi.org/10.1007/978-1-84882-828-5_16
Published:
Publisher Name: Springer, London
Print ISBN: 978-1-84882-827-8
Online ISBN: 978-1-84882-828-5
eBook Packages: Computer ScienceComputer Science (R0)