Highly Reliable Linux HPC Clusters: Self-Awareness Approach

Leangsuksun, Chokchai; Liu, Tong; Liu, Yudan; Scott, Stephen L.; Libby, Richard; Haddad, Ibrahim

doi:10.1007/978-3-540-30566-8_27

Chokchai Leangsuksun²⁰,
Tong Liu²¹,
Yudan Liu²⁰,
Stephen L. Scott²²,
Richard Libby²³ &
…
Ibrahim Haddad²⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3358))

Included in the following conference series:

International Symposium on Parallel and Distributed Processing and Applications

720 Accesses
3 Citations

Abstract

Current solutions for fault-tolerance in HPC systems focus on dealing with the result of a failure. However, most are unable to handle runtime system configuration changes caused by transient failures and require a complete restart of the entire machine. The recently released HA-OSCAR software stack is one such effort making inroads here. This paper discusses detailed solutions for the high-availability and serviceability enhancement of clusters by HA-OSCAR via multi-head-node failover and a service level fault tolerance mechanism. Our solution employs self-configuration and introduces Adaptive Self Healing (ASH) techniques. HA-OSCAR availability improvement analysis was also conducted with various sensitivity factors. Finally, the paper also entails the details of the system layering strategy, dependability modeling, and analysis of an actual experimental system by a Petri net-based model, Stochastic Reword Net (SRN).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Leangsuksun, C., Shen, L., Liu, T., Song, H., Scott, S.: Availability Prediction and Modeling of High Availability OSCAR Cluster. In: IEEE International Conference on Cluster Computing (Cluster 2003), Hong Kong, December 2-4 (2003)
Google Scholar
Leangsuksun, C., Shen, L., Liu, T., Song, H., Scott, S.: Dependability Prediction of High Availability OSCAR Cluster Server. In: The 2003 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2003), Las Vegas, Nevada, USA, June 23-26 (2003)
Google Scholar
Finley, B., Frazier, D., Gonyou, A., Jort, A., et al.: SystemImager v3.0.x Manual, February 19 (2003)
Google Scholar
Brim, M.J., Mattson, T.G., Scott, S.L.: OSCAR: Open Source Cluster Application Resources. In: Ottawa Linux Symposium 2001, Ottawa, Canada (2001)
Google Scholar
Muppala, J., Ciardo, G., Trivedi, K.S.: Stochastic Reward Nets for Reliability Prediction, Communications in Reliability. Maintainability and Serviceability: An International Journal published by SAE International 1(2), 9–20 (1994)
Google Scholar
Ciardo, G., Muppala, J., Trivedi, K.: SPNP: Stochastic Petri net package. In: Proc. Int. Workshop on Petri Nets and Performance Models, pp. 142–150. IEEE Computer Society Press, Los Alamitos (1989)
Google Scholar
Sankaran, S., Squyres, J.M., Barrett, B., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing. In: LACSI Symposium, Santa Fe, NM, October 27-29 (2003)
Google Scholar
Fagg, G.E., Gabriel, E., Chen, Z., Angskun, T., Bosilca, G., Bukovsky, A., Dongarra, J.J.: Fault Tolerant Communication Library and Applications for High Performance Computing. In: LACSI Symposium 2003, Santa Fe, NM, October 27-29 (2003)
Google Scholar
Kottapalli, C.V.: Intelligence based Checkpoint Placement for Parallel MPI programs on Linux Clusters, Master Thesis Report, Computer Science Program, Louisiana Tech University (August 2004) (in preparation)
Google Scholar
Leangsuksun, C., et al.: A Failure Predictive and Policy-Based High Availability Strategy for Linux High Performance Computing Cluster. In: The 5th LCI International Conference on Linux Clusters: The HPC Revolution 2004, Austin, TX, May 18-20 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, Louisiana Tech University,
Chokchai Leangsuksun & Yudan Liu
Enterprise Platforms Group, Dell Corp.,
Tong Liu
Oak Ridge National Laboratory,
Stephen L. Scott
Intel Corporation,
Richard Libby
Ericsson Research,
Ibrahim Haddad

Authors

Chokchai Leangsuksun
View author publications
You can also search for this author in PubMed Google Scholar
Tong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yudan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Stephen L. Scott
View author publications
You can also search for this author in PubMed Google Scholar
Richard Libby
View author publications
You can also search for this author in PubMed Google Scholar
Ibrahim Haddad
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computing, Hong Kong Polytechnic University, Kowloon, Hong Kong, China
Jiannong Cao
Department of Computer Science, St. Francis Xavier University, Antigonish, Canada
Laurence T. Yang
Department of Computer Science and Engineering, Shanghai Jiao Tong University, 200030, Shanghai, China
Minyi Guo
Department of Computer Science, The University of Hong Kong, Pokfulam
Francis Lau

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Leangsuksun, C., Liu, T., Liu, Y., Scott, S.L., Libby, R., Haddad, I. (2004). Highly Reliable Linux HPC Clusters: Self-Awareness Approach. In: Cao, J., Yang, L.T., Guo, M., Lau, F. (eds) Parallel and Distributed Processing and Applications. ISPA 2004. Lecture Notes in Computer Science, vol 3358. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30566-8_27

Download citation

DOI: https://doi.org/10.1007/978-3-540-30566-8_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24128-7
Online ISBN: 978-3-540-30566-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics