Abstract
Current solutions for fault-tolerance in HPC systems focus on dealing with the result of a failure. However, most are unable to handle runtime system configuration changes caused by transient failures and require a complete restart of the entire machine. The recently released HA-OSCAR software stack is one such effort making inroads here. This paper discusses detailed solutions for the high-availability and serviceability enhancement of clusters by HA-OSCAR via multi-head-node failover and a service level fault tolerance mechanism. Our solution employs self-configuration and introduces Adaptive Self Healing (ASH) techniques. HA-OSCAR availability improvement analysis was also conducted with various sensitivity factors. Finally, the paper also entails the details of the system layering strategy, dependability modeling, and analysis of an actual experimental system by a Petri net-based model, Stochastic Reword Net (SRN).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Leangsuksun, C., Shen, L., Liu, T., Song, H., Scott, S.: Availability Prediction and Modeling of High Availability OSCAR Cluster. In: IEEE International Conference on Cluster Computing (Cluster 2003), Hong Kong, December 2-4 (2003)
Leangsuksun, C., Shen, L., Liu, T., Song, H., Scott, S.: Dependability Prediction of High Availability OSCAR Cluster Server. In: The 2003 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2003), Las Vegas, Nevada, USA, June 23-26 (2003)
Finley, B., Frazier, D., Gonyou, A., Jort, A., et al.: SystemImager v3.0.x Manual, February 19 (2003)
Brim, M.J., Mattson, T.G., Scott, S.L.: OSCAR: Open Source Cluster Application Resources. In: Ottawa Linux Symposium 2001, Ottawa, Canada (2001)
Muppala, J., Ciardo, G., Trivedi, K.S.: Stochastic Reward Nets for Reliability Prediction, Communications in Reliability. Maintainability and Serviceability: An International Journal published by SAE International 1(2), 9–20 (1994)
Ciardo, G., Muppala, J., Trivedi, K.: SPNP: Stochastic Petri net package. In: Proc. Int. Workshop on Petri Nets and Performance Models, pp. 142–150. IEEE Computer Society Press, Los Alamitos (1989)
Sankaran, S., Squyres, J.M., Barrett, B., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing. In: LACSI Symposium, Santa Fe, NM, October 27-29 (2003)
Fagg, G.E., Gabriel, E., Chen, Z., Angskun, T., Bosilca, G., Bukovsky, A., Dongarra, J.J.: Fault Tolerant Communication Library and Applications for High Performance Computing. In: LACSI Symposium 2003, Santa Fe, NM, October 27-29 (2003)
Kottapalli, C.V.: Intelligence based Checkpoint Placement for Parallel MPI programs on Linux Clusters, Master Thesis Report, Computer Science Program, Louisiana Tech University (August 2004) (in preparation)
Leangsuksun, C., et al.: A Failure Predictive and Policy-Based High Availability Strategy for Linux High Performance Computing Cluster. In: The 5th LCI International Conference on Linux Clusters: The HPC Revolution 2004, Austin, TX, May 18-20 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Leangsuksun, C., Liu, T., Liu, Y., Scott, S.L., Libby, R., Haddad, I. (2004). Highly Reliable Linux HPC Clusters: Self-Awareness Approach. In: Cao, J., Yang, L.T., Guo, M., Lau, F. (eds) Parallel and Distributed Processing and Applications. ISPA 2004. Lecture Notes in Computer Science, vol 3358. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30566-8_27
Download citation
DOI: https://doi.org/10.1007/978-3-540-30566-8_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24128-7
Online ISBN: 978-3-540-30566-8
eBook Packages: Computer ScienceComputer Science (R0)