Skip to main content

Cross-Layer DRM and Optimization for Datacenter Systems

  • Chapter
  • First Online:
Long-Term Reliability of Nanometer VLSI Systems

Abstract

Datacenter downtime has become a major concern as every minute equates to money lost. An unplanned outage can easily cost a datacenter $8000 per minute of downtime and can even reach costs of $16,000 per minute of downtime. The main root causes of unplanned failures are largely attributed to power system failure and human error. Hardware failures, such as server failures, only account for about 4–5% of unplanned downtime. However, these types of failures are often much more difficult and costly to recover from. As a result, unplanned datacenter outages caused by server failures are responsible for the highest incurred costs, compared to downtimes attributed to other root causes, despite their low rate of occurrence as seen in first figure of this chapter. This presents much of the motivation behind the work in this paper as we develop a framework for reducing this hardware failure subject to performance constraints.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. 2013 cost of data center outages, 2013. http://www.emersonnetworkpower.com

  2. M. Alizadeh, S. Yang, M. Sharif, S. Katti, N. McKeown, B. Prabhakar, S. Shenker, pFabric: minimal near-optimal datacenter transport, in Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, SIGCOMM ’13 (ACM, New York, 2013), pp. 435–446. http://doi.acm.org/10.1145/2486001.2486031

    Book  Google Scholar 

  3. A. Das, A. Kumar, B. Veeravalli, Reliability-driven task mapping for lifetime extension of networks-on-chip based multiprocessor systems, in Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’13 (EDA Consortium, San Jose, 2013), pp. 689–694. http://dl.acm.org/citation.cfm?id=2485288.2485457

    Google Scholar 

  4. X. Fan, W.-D. Weber, L. A. Barroso, Power provisioning for a warehouse-sized computer, in Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA ’07 (ACM, New York, 2007), pp. 13–23. http://doi.acm.org/10.1145/1250662.1250665

    Google Scholar 

  5. M.T. Heath, Scientific Computing: An Introductory Survey (McGraw-Hill, New York, 1997)

    MATH  Google Scholar 

  6. R. Hecht-Nielsen, Theory of the backpropagation neural network, in International Joint Conference on Neural Networks, IJCNN (IEEE, Piscataway, 1989), pp. 593–605

    Book  Google Scholar 

  7. K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989)

    Article  Google Scholar 

  8. W. Huang, S. Ghosh, S. Velusamy, K. Sankaranarayanan, K. Skadron, M.R. Stan, HotSpot: a compact thermal modeling methodology for early-stage VLSI design. IEEE Trans. Very Large Scale Integr. VLSI Syst. 14(5), 501–513 (2006)

    Article  Google Scholar 

  9. X. Huang, T. Yu, V. Sukharev, S.X.-D. Tan, Physics-based electromigration assessment for power grid networks, in Proceedings Design Automation Conference (DAC) (IEEE, Piscataway, 2014)

    Google Scholar 

  10. T. Jaakkola, M.I. Jordan, S.P. Singh, On the convergence of stochastic iterative dynamic programming algorithms. Neural Comput. 6(6), 1185–1201 (1994). http://dx.doi.org/10.1162/neco.1994.6.6.1185

    Article  Google Scholar 

  11. T. Kim, Z. Sun, C. Cook, H. Zhao, R. Li, D. Wong, S.X.-D. Tan, Cross-layer modeling and optimization for electromigration induced reliability, in Proceedings Design Automation Conference (DAC) (IEEE, Piscataway, 2016)

    Google Scholar 

  12. Z. Lu, W. Huang, J. Lach, M. Stan, K. Skadron, Interconnect lifetime prediction under dynamic stress for reliability-aware design, in Proceedings of the International Conference on Computer Aided Design (ICCAD) (IEEE, Piscataway, 2004), pp. 327–334

    Google Scholar 

  13. C.D. Martino, Z. Kalbarczyk, R.K. Iyer, F. Baccanico, J. Fullop, W. Kramer, Lessons learned from the analysis of system failures at petascale: the case of blue waters, in Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN ’14 (IEEE Computer Society, Washington 2014), pp. 610–621. http://dx.doi.org/10.1109/DSN.2014.62

  14. D. Meisner, C.M. Sadler, L.A. Barroso, W.-D. Weber, T.F. Wenisch, Power management of online data-intensive services, in International Symposium on Computer Architecture (2011)

    Google Scholar 

  15. D. Meisner, J. Wu, T.F. Wenisch, Bighouse: a simulation infrastructure for data center systems, in 2012 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (IEEE, Piscataway, 2012)

    Google Scholar 

  16. E. Pinheiro, W.-D. Weber, L.A. Barroso, Failure trends in a large disk drive population, in Proceedings of the 5th USENIX Conference on File and Storage Technologies, FAST ’07 (USENIX Association, Berkeley, 2007), p. 2. http://dl.acm.org/citation.cfm?id=1267903.1267905

    Google Scholar 

  17. B. Schroeder, E. Pinheiro, W.-D. Weber, Dram errors in the wild: a large-scale field study, in Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’09 (ACM, New York, 2009), pp. 193–204. http://doi.acm.org/10.1145/1555349.1555372

    Google Scholar 

  18. W. Song, S. Mukhopadhyay, S. Yalamanchili, Architectural reliability: lifetime reliability characterization and management of many-core processors. Comput. Archit. Lett. 14(2), 103–106 (2014)

    Article  Google Scholar 

  19. S. Wang, J.-J. Chen, Thermal-aware lifetime reliability in multicore systems, in 2010 11th International Symposium on Quality Electronic Design (ISQED) (IEEE, Piscataway, 2010), pp. 399–405

    Google Scholar 

  20. D. Wong, M. Annavaram, Implications of high energy proportional servers on cluster-wide energy proportionality, in Proceedings of the 19th IEEE International Symposium on High Performance Computer Architecture, HPCA-19 ’14 (IEEE, Piscataway, 2014)

    Google Scholar 

  21. www.spec.org/power_ssj2008/, Specpower_ssj2008, 2012

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Tan, S., Tahoori, M., Kim, T., Wang, S., Sun, Z., Kiamehr, S. (2019). Cross-Layer DRM and Optimization for Datacenter Systems. In: Long-Term Reliability of Nanometer VLSI Systems. Springer, Cham. https://doi.org/10.1007/978-3-030-26172-6_12

Download citation

Publish with us

Policies and ethics