Abstract
This paper presents our analysis of the failure behavior of large scale systems using the failure logs collected by Los Alamos National Laboratory on 22 of their computing clusters.We note that not all nodes show similar failure behavior in the systems. Our objective, therefore, was to arrive at an ordering of nodes to be incrementally (one by one) selected for duplication so as to achieve a target MTTF for the system after duplicating the least number of nodes. We arrived at a model for the fault coverage provided by duplicating each node and ordered the nodes according to coverage provided by each node. As compared to traditional approach of randomly choosing nodes for duplication, our model‐driven approach provides improvements ranging from 82% to 1700% depending on the improvement in MTTF that is targeted and the failure distribution of the nodes in the system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Plank, J.S., Elwasif, W.R.: Experimental assessment of workstation failures and their impact on checkpointing systems. In: Proceedings of Fault Tolerant Computing Systems, FTCS 1998 (1998)
Nath, S., Yu, H., Gibbons, P.B., Seshan, S.: Subtleties in tolerating correlated failures. In: Proceedings of the Symposium On Networked Systems Design and Implementation, NSDI 2006 (2006)
Heath, T., Martin, R.P., Nguyen, T.D.: Improving cluster availability using workstation validation. In: Proceedings of ACM SIGMETRICS (2002)
Long, D., Muir, A., Golding, R.: A longitudinal survey of internet host reliability. In: Proceedings of the 14th Intl. Symposium on Reliable Distributed Systems (1995)
Nurmi, D., Brevik, J., Wolski, R.: Modeling machine availability in enterprise and wide-area distributed computing environments. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005. LNCS, vol. 3648, pp. 432–441. Springer, Heidelberg (2005)
Sahoo, R.K., Sivasubramaniam, A., Squillante, M.S., Zhang, Y.: Failure data analysis of a large-scale heterogeneous server environment. In: Proceedings of Dependable Systems and Networks (June 2004)
Tang, D., Iyer, R.K., Subramani, S.S.: Failure analysis and modelling of a VAX cluster system. In: Fault Tolerant Computing Systems (1990)
Xu, J., Kalbarczyk, Z., Iyer, R.K.: Networked Windows NT system field failure data analysis. In: Proc. of the Pacific Rim International Symposium on Dependable Computing (1999)
Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance-computing systems. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN), Philadelphia, PA (June 2006)
Iyer, R.K., Rossetti, D.J., Hsueh, M.C.: Measurement and modeling of computer reliability as affected by system activity. ACM Transactions on Computing Systems 4(3) (1986)
Castillo, X., Siewiorek, D.: Workload, performance, and reliability of digital computing systems. In: 11th International Conference on Fault Tolerant Computing Systems (1981)
Oliner, A.J., Stearley, J.: What Supercomputers Say: A Study of Five System Logs. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN), Edinburgh, UK, June 2007, pp. 575–584 (2007)
Lan, Z., Li, Y., Gujrati, P., Zheng, Z., Thakur, R., White, J.: A Fault Diagnosis and Prognosis Service for TeraGrid Clusters. In: Proceedings of TeraGrid 2007 (2007)
Gujrati, P., Li, Y., Lan, Z., Thakur, R., White, J.: Exploring Meta-learning to Improve Failure Prediction in Supercomputing Clusters. In: Proceedings of International Conference on Parallel Processing, ICPP (2007)
Li, Y., Lan, Z.: Using Adaptive Fault Tolerance to Improve Application Robustness on the TeraGrid. In: Proceedings of TeraGrid 2007 (2007)
Lan, Z., Li, Y.: Adaptive Fault Management of Parallel Applications for High Performance Computing. IEEE Transactions on Computers 57(12), 1647–1660 (2008)
Oliner, A.J., Sahoo, R.K., Moreira, J.E., Gupta, M., Sivasubramaniam, A.: Fault-aware job scheduling for Bluegene/L systems. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium, IPDPS (2004)
Weaver, C., Austin, T.: A fault tolerant approach to microprocessor design. In: Proceedings of the International Conference on Dependable Systems and Networks, July 2001, pp. 411–420 (2001)
Austin, T.: DIVA: A reliable substrate for deep submicron microarchitecture design. In: Proceedings of the Thirty-Second International Symposium on Microarchitecture, November 1999, pp. 196–207 (1999)
Vijaykumar, T., Pomeranz, I., Cheng, K.: Transient fault recovery using simultaneous multithreading. In: Proceedings of the Twenty-Ninth Annual International Symposium on Computer Architecture, May 2002, pp. 87–98 (2002)
Oh, N., Shirvani, P.P., McCluskey, E.J.: Error detection by duplicated instructions in superscalar processors. IEEE Transactions on Reliability 51(1), 63–75 (2002)
Oh, N., Mitra, S., McCluskey, E.J.: ED4I: Error Detection by Diverse Data and Duplicated Instructions. IEEE Transactions on Computers 51(2), 180–199 (2002)
Slegel, T., et al.: IBM’s S/390 G5 microprocessor design. IEEE Micro 19(2), 12–23 (1999)
Tullsen, D.M., Eggers, S.J., Levy, H.M.: Simultaneous multithreading: Maximizing on-chip performance. In: Proceedings of the Twenty-Second International Symposium on Computer Architecture, June 1995, pp. 392–403 (1995)
Rotenberg, E.: AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In: Proceedings of the Twenty-Ninth International Symposium on Fault-Tolerant Computing Systems, June 1999, pp. 84–91 (1999)
Sundaramoorthy, K., Purser, Z., Rotenberg, E.: Slipstream processors: Improving both performance and fault tolerance. In: Proceedings of the Thirty-Third International Symposium on Microarchitecture, December 2000, pp. 269–280 (2000)
Reinhardt, S.K., Mukherjee, S.S.: Transient fault detection via simultaneous multithreading. In: Proceedings of the Twenty-Seventh International Symposium on Computer Architecture, June 2000, pp. 25–36 (2000)
Qureshi, M.A., Mutlu, O., Patt, Y.N.: Microarchitecture-based introspection: A technique for transient-fault tolerance in microprocessors. In: Proceedings of International Conference on Dependable Systems and Networks, June 2005, pp. 434–443 (2005)
Parashar, A., Sivasubramaniam, A., Gurumurthi, S.: SlicK: slice-based locality exploitation for efficient redundant multithreading. In: Proceedings of the 12th Intl., conference on ASPLOS (2006)
Cooper, A.E., Chow, W.T.: Development of on-board space computer systems. IBM Journal of Research and Development 20(1), 5–19 (1976)
Jewett, D.: Integrity S2: A fault-tolerant Unix platform. In: Digest of Papers Fault-Tolerant Computing: The Twenty-First International Symposium, Montreal, Canada, June 25-27, pp. 512–519 (1991)
AT&T 5ESSTM from top to bottom, http://www.morehouse.org/hin/ess/ess05.htm
AT&T Technical Staff. The 5ESS switching system. The AT&T Technical Journal 64(6), Part 2 (July-August 1985)
Avizienis, A.: Arithmetic error codes: Cost and effectiveness studies for Application in digital system design. IEEE Transactions on Computers 20(11), 1332–1331 (1971)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nakka, N., Choudhary, A. (2010). Failure Data-Driven Selective Node-Level Duplication to Improve MTTF in High Performance Computing Systems. In: Mewhort, D.J.K., Cann, N.M., Slater, G.W., Naughton, T.J. (eds) High Performance Computing Systems and Applications. HPCS 2009. Lecture Notes in Computer Science, vol 5976. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12659-8_23
Download citation
DOI: https://doi.org/10.1007/978-3-642-12659-8_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12658-1
Online ISBN: 978-3-642-12659-8
eBook Packages: Computer ScienceComputer Science (R0)