Abstract
Datacenter downtime has become a major concern as every minute equates to money lost. An unplanned outage can easily cost a datacenter $8000 per minute of downtime and can even reach costs of $16,000 per minute of downtime. The main root causes of unplanned failures are largely attributed to power system failure and human error. Hardware failures, such as server failures, only account for about 4–5% of unplanned downtime. However, these types of failures are often much more difficult and costly to recover from. As a result, unplanned datacenter outages caused by server failures are responsible for the highest incurred costs, compared to downtimes attributed to other root causes, despite their low rate of occurrence as seen in first figure of this chapter. This presents much of the motivation behind the work in this paper as we develop a framework for reducing this hardware failure subject to performance constraints.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
2013 cost of data center outages, 2013. http://www.emersonnetworkpower.com
M. Alizadeh, S. Yang, M. Sharif, S. Katti, N. McKeown, B. Prabhakar, S. Shenker, pFabric: minimal near-optimal datacenter transport, in Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, SIGCOMM ’13 (ACM, New York, 2013), pp. 435–446. http://doi.acm.org/10.1145/2486001.2486031
A. Das, A. Kumar, B. Veeravalli, Reliability-driven task mapping for lifetime extension of networks-on-chip based multiprocessor systems, in Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’13 (EDA Consortium, San Jose, 2013), pp. 689–694. http://dl.acm.org/citation.cfm?id=2485288.2485457
X. Fan, W.-D. Weber, L. A. Barroso, Power provisioning for a warehouse-sized computer, in Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA ’07 (ACM, New York, 2007), pp. 13–23. http://doi.acm.org/10.1145/1250662.1250665
M.T. Heath, Scientific Computing: An Introductory Survey (McGraw-Hill, New York, 1997)
R. Hecht-Nielsen, Theory of the backpropagation neural network, in International Joint Conference on Neural Networks, IJCNN (IEEE, Piscataway, 1989), pp. 593–605
K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989)
W. Huang, S. Ghosh, S. Velusamy, K. Sankaranarayanan, K. Skadron, M.R. Stan, HotSpot: a compact thermal modeling methodology for early-stage VLSI design. IEEE Trans. Very Large Scale Integr. VLSI Syst. 14(5), 501–513 (2006)
X. Huang, T. Yu, V. Sukharev, S.X.-D. Tan, Physics-based electromigration assessment for power grid networks, in Proceedings Design Automation Conference (DAC) (IEEE, Piscataway, 2014)
T. Jaakkola, M.I. Jordan, S.P. Singh, On the convergence of stochastic iterative dynamic programming algorithms. Neural Comput. 6(6), 1185–1201 (1994). http://dx.doi.org/10.1162/neco.1994.6.6.1185
T. Kim, Z. Sun, C. Cook, H. Zhao, R. Li, D. Wong, S.X.-D. Tan, Cross-layer modeling and optimization for electromigration induced reliability, in Proceedings Design Automation Conference (DAC) (IEEE, Piscataway, 2016)
Z. Lu, W. Huang, J. Lach, M. Stan, K. Skadron, Interconnect lifetime prediction under dynamic stress for reliability-aware design, in Proceedings of the International Conference on Computer Aided Design (ICCAD) (IEEE, Piscataway, 2004), pp. 327–334
C.D. Martino, Z. Kalbarczyk, R.K. Iyer, F. Baccanico, J. Fullop, W. Kramer, Lessons learned from the analysis of system failures at petascale: the case of blue waters, in Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN ’14 (IEEE Computer Society, Washington 2014), pp. 610–621. http://dx.doi.org/10.1109/DSN.2014.62
D. Meisner, C.M. Sadler, L.A. Barroso, W.-D. Weber, T.F. Wenisch, Power management of online data-intensive services, in International Symposium on Computer Architecture (2011)
D. Meisner, J. Wu, T.F. Wenisch, Bighouse: a simulation infrastructure for data center systems, in 2012 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (IEEE, Piscataway, 2012)
E. Pinheiro, W.-D. Weber, L.A. Barroso, Failure trends in a large disk drive population, in Proceedings of the 5th USENIX Conference on File and Storage Technologies, FAST ’07 (USENIX Association, Berkeley, 2007), p. 2. http://dl.acm.org/citation.cfm?id=1267903.1267905
B. Schroeder, E. Pinheiro, W.-D. Weber, Dram errors in the wild: a large-scale field study, in Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’09 (ACM, New York, 2009), pp. 193–204. http://doi.acm.org/10.1145/1555349.1555372
W. Song, S. Mukhopadhyay, S. Yalamanchili, Architectural reliability: lifetime reliability characterization and management of many-core processors. Comput. Archit. Lett. 14(2), 103–106 (2014)
S. Wang, J.-J. Chen, Thermal-aware lifetime reliability in multicore systems, in 2010 11th International Symposium on Quality Electronic Design (ISQED) (IEEE, Piscataway, 2010), pp. 399–405
D. Wong, M. Annavaram, Implications of high energy proportional servers on cluster-wide energy proportionality, in Proceedings of the 19th IEEE International Symposium on High Performance Computer Architecture, HPCA-19 ’14 (IEEE, Piscataway, 2014)
www.spec.org/power_ssj2008/, Specpower_ssj2008, 2012
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Tan, S., Tahoori, M., Kim, T., Wang, S., Sun, Z., Kiamehr, S. (2019). Cross-Layer DRM and Optimization for Datacenter Systems. In: Long-Term Reliability of Nanometer VLSI Systems. Springer, Cham. https://doi.org/10.1007/978-3-030-26172-6_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-26172-6_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26171-9
Online ISBN: 978-3-030-26172-6
eBook Packages: Physics and AstronomyPhysics and Astronomy (R0)