Failure Data-Driven Selective Node-Level Duplication to Improve MTTF in High Performance Computing Systems

Nakka, Nithin; Choudhary, Alok

doi:10.1007/978-3-642-12659-8_23

Nithin Nakka²⁰ &
Alok Choudhary²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5976))

Included in the following conference series:

International Symposium on High Performance Computing Systems and Applications

1346 Accesses
2 Citations

Abstract

This paper presents our analysis of the failure behavior of large scale systems using the failure logs collected by Los Alamos National Laboratory on 22 of their computing clusters.We note that not all nodes show similar failure behavior in the systems. Our objective, therefore, was to arrive at an ordering of nodes to be incrementally (one by one) selected for duplication so as to achieve a target MTTF for the system after duplicating the least number of nodes. We arrived at a model for the fault coverage provided by duplicating each node and ordered the nodes according to coverage provided by each node. As compared to traditional approach of randomly choosing nodes for duplication, our model‐driven approach provides improvements ranging from 82% to 1700% depending on the improvement in MTTF that is targeted and the failure distribution of the nodes in the system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Plank, J.S., Elwasif, W.R.: Experimental assessment of workstation failures and their impact on checkpointing systems. In: Proceedings of Fault Tolerant Computing Systems, FTCS 1998 (1998)
Google Scholar
Nath, S., Yu, H., Gibbons, P.B., Seshan, S.: Subtleties in tolerating correlated failures. In: Proceedings of the Symposium On Networked Systems Design and Implementation, NSDI 2006 (2006)
Google Scholar
Heath, T., Martin, R.P., Nguyen, T.D.: Improving cluster availability using workstation validation. In: Proceedings of ACM SIGMETRICS (2002)
Google Scholar
Long, D., Muir, A., Golding, R.: A longitudinal survey of internet host reliability. In: Proceedings of the 14th Intl. Symposium on Reliable Distributed Systems (1995)
Google Scholar
Nurmi, D., Brevik, J., Wolski, R.: Modeling machine availability in enterprise and wide-area distributed computing environments. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005. LNCS, vol. 3648, pp. 432–441. Springer, Heidelberg (2005)
Google Scholar
Sahoo, R.K., Sivasubramaniam, A., Squillante, M.S., Zhang, Y.: Failure data analysis of a large-scale heterogeneous server environment. In: Proceedings of Dependable Systems and Networks (June 2004)
Google Scholar
Tang, D., Iyer, R.K., Subramani, S.S.: Failure analysis and modelling of a VAX cluster system. In: Fault Tolerant Computing Systems (1990)
Google Scholar
Xu, J., Kalbarczyk, Z., Iyer, R.K.: Networked Windows NT system field failure data analysis. In: Proc. of the Pacific Rim International Symposium on Dependable Computing (1999)
Google Scholar
Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance-computing systems. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN), Philadelphia, PA (June 2006)
Google Scholar
Iyer, R.K., Rossetti, D.J., Hsueh, M.C.: Measurement and modeling of computer reliability as affected by system activity. ACM Transactions on Computing Systems 4(3) (1986)
Google Scholar
Castillo, X., Siewiorek, D.: Workload, performance, and reliability of digital computing systems. In: 11th International Conference on Fault Tolerant Computing Systems (1981)
Google Scholar
Oliner, A.J., Stearley, J.: What Supercomputers Say: A Study of Five System Logs. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN), Edinburgh, UK, June 2007, pp. 575–584 (2007)
Google Scholar
Lan, Z., Li, Y., Gujrati, P., Zheng, Z., Thakur, R., White, J.: A Fault Diagnosis and Prognosis Service for TeraGrid Clusters. In: Proceedings of TeraGrid 2007 (2007)
Google Scholar
Gujrati, P., Li, Y., Lan, Z., Thakur, R., White, J.: Exploring Meta-learning to Improve Failure Prediction in Supercomputing Clusters. In: Proceedings of International Conference on Parallel Processing, ICPP (2007)
Google Scholar
Li, Y., Lan, Z.: Using Adaptive Fault Tolerance to Improve Application Robustness on the TeraGrid. In: Proceedings of TeraGrid 2007 (2007)
Google Scholar
Lan, Z., Li, Y.: Adaptive Fault Management of Parallel Applications for High Performance Computing. IEEE Transactions on Computers 57(12), 1647–1660 (2008)
Article MathSciNet Google Scholar
Oliner, A.J., Sahoo, R.K., Moreira, J.E., Gupta, M., Sivasubramaniam, A.: Fault-aware job scheduling for Bluegene/L systems. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium, IPDPS (2004)
Google Scholar
Weaver, C., Austin, T.: A fault tolerant approach to microprocessor design. In: Proceedings of the International Conference on Dependable Systems and Networks, July 2001, pp. 411–420 (2001)
Google Scholar
Austin, T.: DIVA: A reliable substrate for deep submicron microarchitecture design. In: Proceedings of the Thirty-Second International Symposium on Microarchitecture, November 1999, pp. 196–207 (1999)
Google Scholar
Vijaykumar, T., Pomeranz, I., Cheng, K.: Transient fault recovery using simultaneous multithreading. In: Proceedings of the Twenty-Ninth Annual International Symposium on Computer Architecture, May 2002, pp. 87–98 (2002)
Google Scholar
Oh, N., Shirvani, P.P., McCluskey, E.J.: Error detection by duplicated instructions in superscalar processors. IEEE Transactions on Reliability 51(1), 63–75 (2002)
Article Google Scholar
Oh, N., Mitra, S., McCluskey, E.J.: ED4I: Error Detection by Diverse Data and Duplicated Instructions. IEEE Transactions on Computers 51(2), 180–199 (2002)
Article Google Scholar
Slegel, T., et al.: IBM’s S/390 G5 microprocessor design. IEEE Micro 19(2), 12–23 (1999)
Article Google Scholar
Tullsen, D.M., Eggers, S.J., Levy, H.M.: Simultaneous multithreading: Maximizing on-chip performance. In: Proceedings of the Twenty-Second International Symposium on Computer Architecture, June 1995, pp. 392–403 (1995)
Google Scholar
Rotenberg, E.: AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In: Proceedings of the Twenty-Ninth International Symposium on Fault-Tolerant Computing Systems, June 1999, pp. 84–91 (1999)
Google Scholar
Sundaramoorthy, K., Purser, Z., Rotenberg, E.: Slipstream processors: Improving both performance and fault tolerance. In: Proceedings of the Thirty-Third International Symposium on Microarchitecture, December 2000, pp. 269–280 (2000)
Google Scholar
Reinhardt, S.K., Mukherjee, S.S.: Transient fault detection via simultaneous multithreading. In: Proceedings of the Twenty-Seventh International Symposium on Computer Architecture, June 2000, pp. 25–36 (2000)
Google Scholar
Qureshi, M.A., Mutlu, O., Patt, Y.N.: Microarchitecture-based introspection: A technique for transient-fault tolerance in microprocessors. In: Proceedings of International Conference on Dependable Systems and Networks, June 2005, pp. 434–443 (2005)
Google Scholar
Parashar, A., Sivasubramaniam, A., Gurumurthi, S.: SlicK: slice-based locality exploitation for efficient redundant multithreading. In: Proceedings of the 12th Intl., conference on ASPLOS (2006)
Google Scholar
Cooper, A.E., Chow, W.T.: Development of on-board space computer systems. IBM Journal of Research and Development 20(1), 5–19 (1976)
Article Google Scholar
Jewett, D.: Integrity S2: A fault-tolerant Unix platform. In: Digest of Papers Fault-Tolerant Computing: The Twenty-First International Symposium, Montreal, Canada, June 25-27, pp. 512–519 (1991)
Google Scholar
AT&T 5ESS^TM from top to bottom, http://www.morehouse.org/hin/ess/ess05.htm
AT&T Technical Staff. The 5ESS switching system. The AT&T Technical Journal 64(6), Part 2 (July-August 1985)
Google Scholar
Avizienis, A.: Arithmetic error codes: Cost and effectiveness studies for Application in digital system design. IEEE Transactions on Computers 20(11), 1332–1331 (1971)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering and Computer Science, Northwestern University, 2145 Sheridan Rd, Tech Inst. Bldg., EECS Dept., Evanston, IL, 60201
Nithin Nakka & Alok Choudhary

Authors

Nithin Nakka
View author publications
You can also search for this author in PubMed Google Scholar
Alok Choudhary
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Psychology, Queen‘s University, 62 Arch St, K7L 3N6, Kingston, Ontario, Canada
Douglas J. K. Mewhort
Dept of Chemistry, Queen’s University, Chernoff Hall, K7L 3N6, Kingston, Ontario, Canada
Natalie M. Cann
University of Ottawa, Hagen Hall, 115 Séraphin-Marion, K1N 6N5, Ottawa, Ontario, Canada
Gary W. Slater
Oak Ridge National Laboratory, 1 Bethel Valley Road, Bldg. 5100, MS-6173, Oak Ridge, 37831-6173, TN, USA
Thomas J. Naughton

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nakka, N., Choudhary, A. (2010). Failure Data-Driven Selective Node-Level Duplication to Improve MTTF in High Performance Computing Systems. In: Mewhort, D.J.K., Cann, N.M., Slater, G.W., Naughton, T.J. (eds) High Performance Computing Systems and Applications. HPCS 2009. Lecture Notes in Computer Science, vol 5976. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12659-8_23

Download citation

DOI: https://doi.org/10.1007/978-3-642-12659-8_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12658-1
Online ISBN: 978-3-642-12659-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics