Skip to main content

Failure Data-Driven Selective Node-Level Duplication to Improve MTTF in High Performance Computing Systems

  • Conference paper
High Performance Computing Systems and Applications (HPCS 2009)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5976))

Abstract

This paper presents our analysis of the failure behavior of large scale systems using the failure logs collected by Los Alamos National Laboratory on 22 of their computing clusters.We note that not all nodes show similar failure behavior in the systems. Our objective, therefore, was to arrive at an ordering of nodes to be incrementally (one by one) selected for duplication so as to achieve a target MTTF for the system after duplicating the least number of nodes. We arrived at a model for the fault coverage provided by duplicating each node and ordered the nodes according to coverage provided by each node. As compared to traditional approach of randomly choosing nodes for duplication, our model‐driven approach provides improvements ranging from 82% to 1700% depending on the improvement in MTTF that is targeted and the failure distribution of the nodes in the system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Plank, J.S., Elwasif, W.R.: Experimental assessment of workstation failures and their impact on checkpointing systems. In: Proceedings of Fault Tolerant Computing Systems, FTCS 1998 (1998)

    Google Scholar 

  2. Nath, S., Yu, H., Gibbons, P.B., Seshan, S.: Subtleties in tolerating correlated failures. In: Proceedings of the Symposium On Networked Systems Design and Implementation, NSDI 2006 (2006)

    Google Scholar 

  3. Heath, T., Martin, R.P., Nguyen, T.D.: Improving cluster availability using workstation validation. In: Proceedings of ACM SIGMETRICS (2002)

    Google Scholar 

  4. Long, D., Muir, A., Golding, R.: A longitudinal survey of internet host reliability. In: Proceedings of the 14th Intl. Symposium on Reliable Distributed Systems (1995)

    Google Scholar 

  5. Nurmi, D., Brevik, J., Wolski, R.: Modeling machine availability in enterprise and wide-area distributed computing environments. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005. LNCS, vol. 3648, pp. 432–441. Springer, Heidelberg (2005)

    Google Scholar 

  6. Sahoo, R.K., Sivasubramaniam, A., Squillante, M.S., Zhang, Y.: Failure data analysis of a large-scale heterogeneous server environment. In: Proceedings of Dependable Systems and Networks (June 2004)

    Google Scholar 

  7. Tang, D., Iyer, R.K., Subramani, S.S.: Failure analysis and modelling of a VAX cluster system. In: Fault Tolerant Computing Systems (1990)

    Google Scholar 

  8. Xu, J., Kalbarczyk, Z., Iyer, R.K.: Networked Windows NT system field failure data analysis. In: Proc. of the Pacific Rim International Symposium on Dependable Computing (1999)

    Google Scholar 

  9. Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance-computing systems. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN), Philadelphia, PA (June 2006)

    Google Scholar 

  10. Iyer, R.K., Rossetti, D.J., Hsueh, M.C.: Measurement and modeling of computer reliability as affected by system activity. ACM Transactions on Computing Systems 4(3) (1986)

    Google Scholar 

  11. Castillo, X., Siewiorek, D.: Workload, performance, and reliability of digital computing systems. In: 11th International Conference on Fault Tolerant Computing Systems (1981)

    Google Scholar 

  12. Oliner, A.J., Stearley, J.: What Supercomputers Say: A Study of Five System Logs. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN), Edinburgh, UK, June 2007, pp. 575–584 (2007)

    Google Scholar 

  13. Lan, Z., Li, Y., Gujrati, P., Zheng, Z., Thakur, R., White, J.: A Fault Diagnosis and Prognosis Service for TeraGrid Clusters. In: Proceedings of TeraGrid 2007 (2007)

    Google Scholar 

  14. Gujrati, P., Li, Y., Lan, Z., Thakur, R., White, J.: Exploring Meta-learning to Improve Failure Prediction in Supercomputing Clusters. In: Proceedings of International Conference on Parallel Processing, ICPP (2007)

    Google Scholar 

  15. Li, Y., Lan, Z.: Using Adaptive Fault Tolerance to Improve Application Robustness on the TeraGrid. In: Proceedings of TeraGrid 2007 (2007)

    Google Scholar 

  16. Lan, Z., Li, Y.: Adaptive Fault Management of Parallel Applications for High Performance Computing. IEEE Transactions on Computers 57(12), 1647–1660 (2008)

    Article  MathSciNet  Google Scholar 

  17. Oliner, A.J., Sahoo, R.K., Moreira, J.E., Gupta, M., Sivasubramaniam, A.: Fault-aware job scheduling for Bluegene/L systems. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium, IPDPS (2004)

    Google Scholar 

  18. Weaver, C., Austin, T.: A fault tolerant approach to microprocessor design. In: Proceedings of the International Conference on Dependable Systems and Networks, July 2001, pp. 411–420 (2001)

    Google Scholar 

  19. Austin, T.: DIVA: A reliable substrate for deep submicron microarchitecture design. In: Proceedings of the Thirty-Second International Symposium on Microarchitecture, November 1999, pp. 196–207 (1999)

    Google Scholar 

  20. Vijaykumar, T., Pomeranz, I., Cheng, K.: Transient fault recovery using simultaneous multithreading. In: Proceedings of the Twenty-Ninth Annual International Symposium on Computer Architecture, May 2002, pp. 87–98 (2002)

    Google Scholar 

  21. Oh, N., Shirvani, P.P., McCluskey, E.J.: Error detection by duplicated instructions in superscalar processors. IEEE Transactions on Reliability 51(1), 63–75 (2002)

    Article  Google Scholar 

  22. Oh, N., Mitra, S., McCluskey, E.J.: ED4I: Error Detection by Diverse Data and Duplicated Instructions. IEEE Transactions on Computers 51(2), 180–199 (2002)

    Article  Google Scholar 

  23. Slegel, T., et al.: IBM’s S/390 G5 microprocessor design. IEEE Micro 19(2), 12–23 (1999)

    Article  Google Scholar 

  24. Tullsen, D.M., Eggers, S.J., Levy, H.M.: Simultaneous multithreading: Maximizing on-chip performance. In: Proceedings of the Twenty-Second International Symposium on Computer Architecture, June 1995, pp. 392–403 (1995)

    Google Scholar 

  25. Rotenberg, E.: AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In: Proceedings of the Twenty-Ninth International Symposium on Fault-Tolerant Computing Systems, June 1999, pp. 84–91 (1999)

    Google Scholar 

  26. Sundaramoorthy, K., Purser, Z., Rotenberg, E.: Slipstream processors: Improving both performance and fault tolerance. In: Proceedings of the Thirty-Third International Symposium on Microarchitecture, December 2000, pp. 269–280 (2000)

    Google Scholar 

  27. Reinhardt, S.K., Mukherjee, S.S.: Transient fault detection via simultaneous multithreading. In: Proceedings of the Twenty-Seventh International Symposium on Computer Architecture, June 2000, pp. 25–36 (2000)

    Google Scholar 

  28. Qureshi, M.A., Mutlu, O., Patt, Y.N.: Microarchitecture-based introspection: A technique for transient-fault tolerance in microprocessors. In: Proceedings of International Conference on Dependable Systems and Networks, June 2005, pp. 434–443 (2005)

    Google Scholar 

  29. Parashar, A., Sivasubramaniam, A., Gurumurthi, S.: SlicK: slice-based locality exploitation for efficient redundant multithreading. In: Proceedings of the 12th Intl., conference on ASPLOS (2006)

    Google Scholar 

  30. Cooper, A.E., Chow, W.T.: Development of on-board space computer systems. IBM Journal of Research and Development 20(1), 5–19 (1976)

    Article  Google Scholar 

  31. Jewett, D.: Integrity S2: A fault-tolerant Unix platform. In: Digest of Papers Fault-Tolerant Computing: The Twenty-First International Symposium, Montreal, Canada, June 25-27, pp. 512–519 (1991)

    Google Scholar 

  32. AT&T 5ESSTM from top to bottom, http://www.morehouse.org/hin/ess/ess05.htm

  33. AT&T Technical Staff. The 5ESS switching system. The AT&T Technical Journal 64(6), Part 2 (July-August 1985)

    Google Scholar 

  34. Avizienis, A.: Arithmetic error codes: Cost and effectiveness studies for Application in digital system design. IEEE Transactions on Computers 20(11), 1332–1331 (1971)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Nakka, N., Choudhary, A. (2010). Failure Data-Driven Selective Node-Level Duplication to Improve MTTF in High Performance Computing Systems. In: Mewhort, D.J.K., Cann, N.M., Slater, G.W., Naughton, T.J. (eds) High Performance Computing Systems and Applications. HPCS 2009. Lecture Notes in Computer Science, vol 5976. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12659-8_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-12659-8_23

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-12658-1

  • Online ISBN: 978-3-642-12659-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics