Abstract
Large-scale computing platforms provide tremendous capabilities for scientific discovery. As applications and system software scale up to multi- petaflops and beyond to exascale platforms, the occurrence of failure will be much more common. This has given rise to a push in fault-tolerance and resilience research for high-performance computing (HPC) systems. This includes work on log analysis to identify types of failures, enhancements to the Message Passing Interface (MPI) to incorporate fault awareness, and a variety of fault tolerance mechanisms that span redundant computation, algorithm based fault tolerance, and advanced checkpoint/restart techniques.
While there is much work to be done on the FT/Resilience mechanisms for such large-scale systems, there is also a profound gap in the tools for experimentation. This gap is compounded by the fact that HPC environments have stringent performance requirements and are often highly customized. The tool chain for these systems are often tailored for the platform and the operating environments typically contain many site/machine specific enhancements. Therefore, it is desirable to maintain a consistent execution environment to minimize end-user (scientist) interruption.
The work on system-level virtualization for HPC system offers a unique opportunity to maintain a consistent execution environment via a virtual machine (VM). Recent work on virtualization for HPC has shown that low-overhead, high performance systems can be realized [7, 15]. Virtualization also provides a clean abstraction for building experimental tools for investigation into the effects of failures in HPC and the related research on FT/Resilience mechanisms and policies. In this paper we discuss the motivation for tools to perform fault injection in an HPC context. We also present the design of a new fault injection framework that can leverage virtualization.
ORNL’s work was supported by the U.S. Department of Energy, under Contract DE-AC05-00OR22725.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Buntinas, D., Bosilica, G., Graham, R.L., Vallée, G., Watson, G.R.: A Scalable Tools Communication Infrastructure. In: Proceedings of the 22nd International High Performance Computing Symposium (HPCS 2008), June 9-11, session track: 6th Annual Symposium on OSCAR and HPC Cluster Systems (OSCAR 2008). IEEE Computer Society (2008), http://www.csm.ornl.gov/oscar08/
Gupta, R., Beckman, P., Park, B.H., Lusk, E., Hargrove, P., Geist, A., Lumsdaine, A., Dongarra, J.: Cifts: A coordinated infrastructure for fault-tolerant systems. In: International Conference on Parallel Processing, ICPP (2009)
Hoarau, W., Lemarinier, P., Herault, T., Rodriguez, E., Tixeuil, S., Cappello, F.: Fail-mpi: How fault-tolerant is fault-tolerant mpi? In: IEEE International Conference on Cluster Computing, pp. 1–10 (September 2006)
Hoarau, W., Tixeuil, S., Vauchelles, F.: Fail-fci: Versatile fault injection. Future Generation Computer Systems 23(7), 913–919 (2007), http://www.sciencedirect.com/science/article/pii/S0167739X07000209
Hsueh, M.C., Tsai, T.K., Iyer, R.K.: Fault injection techniques and tools. Computer 30(4), 75–82 (1997)
Carreira, J., Madeira, H., Silva, J.G.: Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers. IEEE Transactions on Software Engineering 24(2) (February 1998), http://www.xception.org/files/IEEETSE98.pdf
Lange, J., Pedretti, K., Hudson, T., Dinda, P., Cui, Z., Xia, L., Bridges, P., Jaconette, S., Levenhagen, M., Brightwell, R., Widener, P.: Palacios and Kitten: High Performance Operating Systems For Scalable Virtualized and Native Supercomputing. Tech. Rep. NWU-EECS-09-14, Northwestern University, July 20 (2009), http://v3vee.org/papers/NWU-EECS-09-14.pdf
Le, M., Gallagher, A., Tamir, Y.: Challenges and Opportunities with Fault Injection in Virtualized Systems. In: First International Workshop on Virtualization Performance: Analysis, Characterization, and Tools, Austin, Texas, USA (April 2008), http://www.cs.ucla.edu/~tamir/papers/vpact08.pdf
Marinescu, P.D., Candea, G.: LFI: A Practical and General Library-Level Fault Injector. In: Proceedings of the 39th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2009), June 29 - July 2. IEEE (2009), http://dslab.epfl.ch/pubs/lfi/index.html
Potyra, S., Sieh, V., Cin, M.D.: Evaluating fault-tolerant system designs using FAUmachine. In: Proceedings of the 2007 Workshop on Engineering Fault Tolerant Systems (EFTS 2007), p. 9. ACM, New York (2007)
Silva, J.G., Carreira, J., Madeira, H., Costa, D., Moreira, F.: Experimental assessment of parallel systems. In: Proceedings of the 26th Annual International Symposium on Fault-Tolerant Computing (FTCS 1996), June 25-27, pp. 415–424 (1996)
Stott, D.T., Floering, B., Burke, D., Kalbarczyk, Z., Iyer, R.K.: NFTAPE: A framework for assessing dependability in distributed systems with lightweight fault inectors. In: Proceedings of the 4th IEEE International Computer Performance and Dependability Symposium (IPDS), pp. 91–100. IEEE (March 2000)
Süßkraut, M., Creutz, S., Fetzer, C.: Fast fault injection with virtual machines (fast abstract). In: Supplement of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN2007) (June 2007), http://wwwse.inf.tu-dresden.de/papers/preprint-suesskraut2007DSNb.pdf
Vallée, G., Naughton, T., Scott, S.L.: System Management Software for Virtual Environments. In: Proceedings of the ACM International Conference on Computing Frontiers (CF 2007), Ischia, Italy, May 7-9 (2007)
Youseff, L., Seymour, K., You, H., Dongarra, J., Wolski, R.: The impact of paravirtualized memory hierarchy on linear algebra computational kernels and software. In: Proceedings of the 17th International Symposium on High Performance Distributed Computing (HPDC 2008), pp. 141–152. ACM, New York (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Naughton, T., Vallée, G., Engelmann, C., Scott, S.L. (2012). A Case for Virtual Machine Based Fault Injection in a High-Performance Computing Environment. In: Alexander, M., et al. Euro-Par 2011: Parallel Processing Workshops. Euro-Par 2011. Lecture Notes in Computer Science, vol 7155. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29737-3_27
Download citation
DOI: https://doi.org/10.1007/978-3-642-29737-3_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29736-6
Online ISBN: 978-3-642-29737-3
eBook Packages: Computer ScienceComputer Science (R0)