A Case for Virtual Machine Based Fault Injection in a High-Performance Computing Environment
Large-scale computing platforms provide tremendous capabilities for scientific discovery. As applications and system software scale up to multi- petaflops and beyond to exascale platforms, the occurrence of failure will be much more common. This has given rise to a push in fault-tolerance and resilience research for high-performance computing (HPC) systems. This includes work on log analysis to identify types of failures, enhancements to the Message Passing Interface (MPI) to incorporate fault awareness, and a variety of fault tolerance mechanisms that span redundant computation, algorithm based fault tolerance, and advanced checkpoint/restart techniques.
While there is much work to be done on the FT/Resilience mechanisms for such large-scale systems, there is also a profound gap in the tools for experimentation. This gap is compounded by the fact that HPC environments have stringent performance requirements and are often highly customized. The tool chain for these systems are often tailored for the platform and the operating environments typically contain many site/machine specific enhancements. Therefore, it is desirable to maintain a consistent execution environment to minimize end-user (scientist) interruption.
The work on system-level virtualization for HPC system offers a unique opportunity to maintain a consistent execution environment via a virtual machine (VM). Recent work on virtualization for HPC has shown that low-overhead, high performance systems can be realized [7, 15]. Virtualization also provides a clean abstraction for building experimental tools for investigation into the effects of failures in HPC and the related research on FT/Resilience mechanisms and policies. In this paper we discuss the motivation for tools to perform fault injection in an HPC context. We also present the design of a new fault injection framework that can leverage virtualization.
KeywordsVirtual Machine Message Passing Interface System Under Test Fault Injection Fault Tolerance Mechanism
Unable to display preview. Download preview PDF.
- 1.Buntinas, D., Bosilica, G., Graham, R.L., Vallée, G., Watson, G.R.: A Scalable Tools Communication Infrastructure. In: Proceedings of the 22nd International High Performance Computing Symposium (HPCS 2008), June 9-11, session track: 6th Annual Symposium on OSCAR and HPC Cluster Systems (OSCAR 2008). IEEE Computer Society (2008), http://www.csm.ornl.gov/oscar08/
- 2.Gupta, R., Beckman, P., Park, B.H., Lusk, E., Hargrove, P., Geist, A., Lumsdaine, A., Dongarra, J.: Cifts: A coordinated infrastructure for fault-tolerant systems. In: International Conference on Parallel Processing, ICPP (2009)Google Scholar
- 3.Hoarau, W., Lemarinier, P., Herault, T., Rodriguez, E., Tixeuil, S., Cappello, F.: Fail-mpi: How fault-tolerant is fault-tolerant mpi? In: IEEE International Conference on Cluster Computing, pp. 1–10 (September 2006)Google Scholar
- 4.Hoarau, W., Tixeuil, S., Vauchelles, F.: Fail-fci: Versatile fault injection. Future Generation Computer Systems 23(7), 913–919 (2007), http://www.sciencedirect.com/science/article/pii/S0167739X07000209 CrossRefGoogle Scholar
- 6.Carreira, J., Madeira, H., Silva, J.G.: Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers. IEEE Transactions on Software Engineering 24(2) (February 1998), http://www.xception.org/files/IEEETSE98.pdf
- 7.Lange, J., Pedretti, K., Hudson, T., Dinda, P., Cui, Z., Xia, L., Bridges, P., Jaconette, S., Levenhagen, M., Brightwell, R., Widener, P.: Palacios and Kitten: High Performance Operating Systems For Scalable Virtualized and Native Supercomputing. Tech. Rep. NWU-EECS-09-14, Northwestern University, July 20 (2009), http://v3vee.org/papers/NWU-EECS-09-14.pdf
- 8.Le, M., Gallagher, A., Tamir, Y.: Challenges and Opportunities with Fault Injection in Virtualized Systems. In: First International Workshop on Virtualization Performance: Analysis, Characterization, and Tools, Austin, Texas, USA (April 2008), http://www.cs.ucla.edu/~tamir/papers/vpact08.pdf
- 9.Marinescu, P.D., Candea, G.: LFI: A Practical and General Library-Level Fault Injector. In: Proceedings of the 39th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2009), June 29 - July 2. IEEE (2009), http://dslab.epfl.ch/pubs/lfi/index.html
- 11.Silva, J.G., Carreira, J., Madeira, H., Costa, D., Moreira, F.: Experimental assessment of parallel systems. In: Proceedings of the 26th Annual International Symposium on Fault-Tolerant Computing (FTCS 1996), June 25-27, pp. 415–424 (1996)Google Scholar
- 12.Stott, D.T., Floering, B., Burke, D., Kalbarczyk, Z., Iyer, R.K.: NFTAPE: A framework for assessing dependability in distributed systems with lightweight fault inectors. In: Proceedings of the 4th IEEE International Computer Performance and Dependability Symposium (IPDS), pp. 91–100. IEEE (March 2000)Google Scholar
- 13.Süßkraut, M., Creutz, S., Fetzer, C.: Fast fault injection with virtual machines (fast abstract). In: Supplement of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN2007) (June 2007), http://wwwse.inf.tu-dresden.de/papers/preprint-suesskraut2007DSNb.pdf
- 14.Vallée, G., Naughton, T., Scott, S.L.: System Management Software for Virtual Environments. In: Proceedings of the ACM International Conference on Computing Frontiers (CF 2007), Ischia, Italy, May 7-9 (2007)Google Scholar
- 15.Youseff, L., Seymour, K., You, H., Dongarra, J., Wolski, R.: The impact of paravirtualized memory hierarchy on linear algebra computational kernels and software. In: Proceedings of the 17th International Symposium on High Performance Distributed Computing (HPDC 2008), pp. 141–152. ACM, New York (2008)CrossRefGoogle Scholar