A Case for Virtual Machine Based Fault Injection in a High-Performance Computing Environment

Naughton, Thomas; Vallée, Geoffroy; Engelmann, Christian; Scott, Stephen L.

doi:10.1007/978-3-642-29737-3_27

Thomas Naughton³⁰,
Geoffroy Vallée³⁰,
Christian Engelmann³⁰ &
…
Stephen L. Scott³⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7155))

Included in the following conference series:

European Conference on Parallel Processing

1435 Accesses
2 Citations

Abstract

Large-scale computing platforms provide tremendous capabilities for scientific discovery. As applications and system software scale up to multi- petaflops and beyond to exascale platforms, the occurrence of failure will be much more common. This has given rise to a push in fault-tolerance and resilience research for high-performance computing (HPC) systems. This includes work on log analysis to identify types of failures, enhancements to the Message Passing Interface (MPI) to incorporate fault awareness, and a variety of fault tolerance mechanisms that span redundant computation, algorithm based fault tolerance, and advanced checkpoint/restart techniques.

While there is much work to be done on the FT/Resilience mechanisms for such large-scale systems, there is also a profound gap in the tools for experimentation. This gap is compounded by the fact that HPC environments have stringent performance requirements and are often highly customized. The tool chain for these systems are often tailored for the platform and the operating environments typically contain many site/machine specific enhancements. Therefore, it is desirable to maintain a consistent execution environment to minimize end-user (scientist) interruption.

The work on system-level virtualization for HPC system offers a unique opportunity to maintain a consistent execution environment via a virtual machine (VM). Recent work on virtualization for HPC has shown that low-overhead, high performance systems can be realized [7, 15]. Virtualization also provides a clean abstraction for building experimental tools for investigation into the effects of failures in HPC and the related research on FT/Resilience mechanisms and policies. In this paper we discuss the motivation for tools to perform fault injection in an HPC context. We also present the design of a new fault injection framework that can leverage virtualization.

ORNL’s work was supported by the U.S. Department of Energy, under Contract DE-AC05-00OR22725.

Download to read the full chapter text

Chapter PDF

Using Performance Tools to Support Experiments in HPC Resilience

HPC-SFI: System-Level Fault Injection for High Performance Computing Systems

What Is the Right Balance for Performance and Isolation with Virtualization in HPC?

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Buntinas, D., Bosilica, G., Graham, R.L., Vallée, G., Watson, G.R.: A Scalable Tools Communication Infrastructure. In: Proceedings of the 22nd International High Performance Computing Symposium (HPCS 2008), June 9-11, session track: 6th Annual Symposium on OSCAR and HPC Cluster Systems (OSCAR 2008). IEEE Computer Society (2008), http://www.csm.ornl.gov/oscar08/
Gupta, R., Beckman, P., Park, B.H., Lusk, E., Hargrove, P., Geist, A., Lumsdaine, A., Dongarra, J.: Cifts: A coordinated infrastructure for fault-tolerant systems. In: International Conference on Parallel Processing, ICPP (2009)
Google Scholar
Hoarau, W., Lemarinier, P., Herault, T., Rodriguez, E., Tixeuil, S., Cappello, F.: Fail-mpi: How fault-tolerant is fault-tolerant mpi? In: IEEE International Conference on Cluster Computing, pp. 1–10 (September 2006)
Google Scholar
Hoarau, W., Tixeuil, S., Vauchelles, F.: Fail-fci: Versatile fault injection. Future Generation Computer Systems 23(7), 913–919 (2007), http://www.sciencedirect.com/science/article/pii/S0167739X07000209
Article Google Scholar
Hsueh, M.C., Tsai, T.K., Iyer, R.K.: Fault injection techniques and tools. Computer 30(4), 75–82 (1997)
Article Google Scholar
Carreira, J., Madeira, H., Silva, J.G.: Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers. IEEE Transactions on Software Engineering 24(2) (February 1998), http://www.xception.org/files/IEEETSE98.pdf
Lange, J., Pedretti, K., Hudson, T., Dinda, P., Cui, Z., Xia, L., Bridges, P., Jaconette, S., Levenhagen, M., Brightwell, R., Widener, P.: Palacios and Kitten: High Performance Operating Systems For Scalable Virtualized and Native Supercomputing. Tech. Rep. NWU-EECS-09-14, Northwestern University, July 20 (2009), http://v3vee.org/papers/NWU-EECS-09-14.pdf
Le, M., Gallagher, A., Tamir, Y.: Challenges and Opportunities with Fault Injection in Virtualized Systems. In: First International Workshop on Virtualization Performance: Analysis, Characterization, and Tools, Austin, Texas, USA (April 2008), http://www.cs.ucla.edu/~tamir/papers/vpact08.pdf
Marinescu, P.D., Candea, G.: LFI: A Practical and General Library-Level Fault Injector. In: Proceedings of the 39th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2009), June 29 - July 2. IEEE (2009), http://dslab.epfl.ch/pubs/lfi/index.html
Potyra, S., Sieh, V., Cin, M.D.: Evaluating fault-tolerant system designs using FAUmachine. In: Proceedings of the 2007 Workshop on Engineering Fault Tolerant Systems (EFTS 2007), p. 9. ACM, New York (2007)
Chapter Google Scholar
Silva, J.G., Carreira, J., Madeira, H., Costa, D., Moreira, F.: Experimental assessment of parallel systems. In: Proceedings of the 26th Annual International Symposium on Fault-Tolerant Computing (FTCS 1996), June 25-27, pp. 415–424 (1996)
Google Scholar
Stott, D.T., Floering, B., Burke, D., Kalbarczyk, Z., Iyer, R.K.: NFTAPE: A framework for assessing dependability in distributed systems with lightweight fault inectors. In: Proceedings of the 4th IEEE International Computer Performance and Dependability Symposium (IPDS), pp. 91–100. IEEE (March 2000)
Google Scholar
Süßkraut, M., Creutz, S., Fetzer, C.: Fast fault injection with virtual machines (fast abstract). In: Supplement of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN2007) (June 2007), http://wwwse.inf.tu-dresden.de/papers/preprint-suesskraut2007DSNb.pdf
Vallée, G., Naughton, T., Scott, S.L.: System Management Software for Virtual Environments. In: Proceedings of the ACM International Conference on Computing Frontiers (CF 2007), Ischia, Italy, May 7-9 (2007)
Google Scholar
Youseff, L., Seymour, K., You, H., Dongarra, J., Wolski, R.: The impact of paravirtualized memory hierarchy on linear algebra computational kernels and software. In: Proceedings of the 17th International Symposium on High Performance Distributed Computing (HPDC 2008), pp. 141–152. ACM, New York (2008)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Oak Ridge National Laboratory, Computer Science and Mathematics Division, Oak Ridge, TN, 37831, USA
Thomas Naughton, Geoffroy Vallée, Christian Engelmann & Stephen L. Scott

Authors

Thomas Naughton
View author publications
You can also search for this author in PubMed Google Scholar
Geoffroy Vallée
View author publications
You can also search for this author in PubMed Google Scholar
Christian Engelmann
View author publications
You can also search for this author in PubMed Google Scholar
Stephen L. Scott
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Scilytics, Koellnerhofgasse 3/15A, 1010, Vienna, Austria
Michael Alexander
ICAR-CNR, Via P. Castellino, 111, 80131, Napoli, Italy
Pasqua D’Ambra
University of Amsterdam, 1090, Amsterdam, Netherlands
Adam Belloum
Innovative Computing Laboratory, The University of Tennessee, USA
George Bosilca
Department of Experimental Medicine and Clinic, University Magna Græcia, 88100, Catanzaro, Italy
Mario Cannataro
Computer Science Department, University of Pisa, Italy
Marco Danelutto
Second University of Naples, Italy
Beniamino Di Martino
TU München, Boltzmannstr. 3, 85748, Garching, Germany
Michael Gerndt
Equipe Runtime, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France
Emmanuel Jeannot & Raymond Namyst &
Equipe HIEPACS, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France
Jean Roman
Oak Ridge National Laboratory, Computer Science and Mathematics Division, 37831-6164, Oak Ridge, TN, USA
Stephen L. Scott
Department of Scientific Computing, University of Vienna, Nordbergstr. 15/3C, 1090, Vienna, Austrial
Jesper Larsson Traff
Computer Science and Mathematics Division, Oak Ridge National Laboratory, 37831, Oak Ridge, TN, USA
Geoffroy Vallée
Technische Universität München, Germany
Josef Weidendorfer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Naughton, T., Vallée, G., Engelmann, C., Scott, S.L. (2012). A Case for Virtual Machine Based Fault Injection in a High-Performance Computing Environment. In: Alexander, M., et al. Euro-Par 2011: Parallel Processing Workshops. Euro-Par 2011. Lecture Notes in Computer Science, vol 7155. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29737-3_27

Download citation

DOI: https://doi.org/10.1007/978-3-642-29737-3_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29736-6
Online ISBN: 978-3-642-29737-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Case for Virtual Machine Based Fault Injection in a High-Performance Computing Environment

Abstract

Chapter PDF

Similar content being viewed by others

Using Performance Tools to Support Experiments in HPC Resilience

HPC-SFI: System-Level Fault Injection for High Performance Computing Systems

What Is the Right Balance for Performance and Isolation with Virtualization in HPC?

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Case for Virtual Machine Based Fault Injection in a High-Performance Computing Environment

Abstract

Chapter PDF

Similar content being viewed by others

Using Performance Tools to Support Experiments in HPC Resilience

HPC-SFI: System-Level Fault Injection for High Performance Computing Systems

What Is the Right Balance for Performance and Isolation with Virtualization in HPC?

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation