Skip to main content

Highly Reliable Linux HPC Clusters: Self-Awareness Approach

  • Conference paper
Parallel and Distributed Processing and Applications (ISPA 2004)

Abstract

Current solutions for fault-tolerance in HPC systems focus on dealing with the result of a failure. However, most are unable to handle runtime system configuration changes caused by transient failures and require a complete restart of the entire machine. The recently released HA-OSCAR software stack is one such effort making inroads here. This paper discusses detailed solutions for the high-availability and serviceability enhancement of clusters by HA-OSCAR via multi-head-node failover and a service level fault tolerance mechanism. Our solution employs self-configuration and introduces Adaptive Self Healing (ASH) techniques. HA-OSCAR availability improvement analysis was also conducted with various sensitivity factors. Finally, the paper also entails the details of the system layering strategy, dependability modeling, and analysis of an actual experimental system by a Petri net-based model, Stochastic Reword Net (SRN).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Leangsuksun, C., Shen, L., Liu, T., Song, H., Scott, S.: Availability Prediction and Modeling of High Availability OSCAR Cluster. In: IEEE International Conference on Cluster Computing (Cluster 2003), Hong Kong, December 2-4 (2003)

    Google Scholar 

  2. Leangsuksun, C., Shen, L., Liu, T., Song, H., Scott, S.: Dependability Prediction of High Availability OSCAR Cluster Server. In: The 2003 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2003), Las Vegas, Nevada, USA, June 23-26 (2003)

    Google Scholar 

  3. Finley, B., Frazier, D., Gonyou, A., Jort, A., et al.: SystemImager v3.0.x Manual, February 19 (2003)

    Google Scholar 

  4. Brim, M.J., Mattson, T.G., Scott, S.L.: OSCAR: Open Source Cluster Application Resources. In: Ottawa Linux Symposium 2001, Ottawa, Canada (2001)

    Google Scholar 

  5. Muppala, J., Ciardo, G., Trivedi, K.S.: Stochastic Reward Nets for Reliability Prediction, Communications in Reliability. Maintainability and Serviceability: An International Journal published by SAE International 1(2), 9–20 (1994)

    Google Scholar 

  6. Ciardo, G., Muppala, J., Trivedi, K.: SPNP: Stochastic Petri net package. In: Proc. Int. Workshop on Petri Nets and Performance Models, pp. 142–150. IEEE Computer Society Press, Los Alamitos (1989)

    Google Scholar 

  7. Sankaran, S., Squyres, J.M., Barrett, B., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing. In: LACSI Symposium, Santa Fe, NM, October 27-29 (2003)

    Google Scholar 

  8. Fagg, G.E., Gabriel, E., Chen, Z., Angskun, T., Bosilca, G., Bukovsky, A., Dongarra, J.J.: Fault Tolerant Communication Library and Applications for High Performance Computing. In: LACSI Symposium 2003, Santa Fe, NM, October 27-29 (2003)

    Google Scholar 

  9. Kottapalli, C.V.: Intelligence based Checkpoint Placement for Parallel MPI programs on Linux Clusters, Master Thesis Report, Computer Science Program, Louisiana Tech University (August 2004) (in preparation)

    Google Scholar 

  10. Leangsuksun, C., et al.: A Failure Predictive and Policy-Based High Availability Strategy for Linux High Performance Computing Cluster. In: The 5th LCI International Conference on Linux Clusters: The HPC Revolution 2004, Austin, TX, May 18-20 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Leangsuksun, C., Liu, T., Liu, Y., Scott, S.L., Libby, R., Haddad, I. (2004). Highly Reliable Linux HPC Clusters: Self-Awareness Approach. In: Cao, J., Yang, L.T., Guo, M., Lau, F. (eds) Parallel and Distributed Processing and Applications. ISPA 2004. Lecture Notes in Computer Science, vol 3358. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30566-8_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30566-8_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-24128-7

  • Online ISBN: 978-3-540-30566-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics