Abstract
Computing grids are large-scale, highly-distributed, often hierarchical, platforms. At such scales, failures are no longer exceptions, but part of the normal behavior. When designing software for grids, developers have to take failures into account. It is crucial to make experiments at a large scale, with various volatility conditions, in order to measure the impact of failures on the whole system. This paper presents an experimental tool allowing the user to inject failures during a practical evaluation of fault-tolerant systems. We illustrate the usefulness of our tool through an evaluation of a hierarchical grid failure detector.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Grid’5000 project. http://www.grid5000.org
The PARIS research group. http://www.irisa.fr/paris
Alvarez, G.A., Cristian, F.: Centralized failure injection for distributed, fault-tolerant protocol testing. In: International Conference on Distributed Computing Systems, p–10 (1997), citeseer.ist.psu.edu/alvarez97centralized.html
Antoniu, G., et al.: Going large-scale in P2P experiments using the JXTA distributed framework. In: Danelutto, M., Vanneschi, M., Laforenza, D. (eds.) Euro-Par 2004. LNCS, vol. 3149, pp. 1038–1047. Springer, Heidelberg (2004)
Antoniu, G., Deverge, J.-F., Monnet, S.: How to bring together fault tolerance and data consistency to enable grid data sharing. Concurrency and Computation: Practice and Experience 17, To appear. Available as RR-5467 (September 2006)
Arlat, J., et al.: Fault injection and dependability evaluation of fault-tolerant systems. IEEE Transactions on Computers 42(8), 913–923 (1993), citeseer.ist.psu.edu/arlat93fault.html
Bertier, M., Marin, O., Sens, P.: Implementation and performance evaluation of an adaptable failure detector. In: Proceedings of the International Conference on Dependable Systems and Networks, Washington, DC, June 2002, pp. 354–363 (2002)
Bertier, M., Marin, O., Sens, P.: Performance analysis of a hierarchical failure detector. In: Proceedings of the International Conference on Dependable Systems and Networks, San Francisco, CA, USA (June 2003)
A Collaboration between researchers at UC Berkeley, LBL, USC/ISI, and Xerox PARC. The ns manual (formerly ns notes and documentation) (2003), http://www.isi.edu/nsnam/ns/doc/ns_doc.pdf
Carson, M., Santay, D.: NIST Net - a Linux-based network emulation tool. To appear in special issue of Computer Communication Review (2004)
Casanova, H.: Simgrid: A toolkit for the simulation of application scheduling. In: First IEEE/ACM International Symposium on Cluster Computing and the Grid, Brisbane, Australia, pp. 430–441. ACM Press, New York (2001), citeseer.nj.nec.com/casanova01simgrid.html
Chandra, T.D., Toueg, S.: Unreliable failure detectors for reliable distributed systems. Journal of the ACM (1996)
Fischer, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of distributed consensus with one faulty process. Journal of the ACM 32(2), 374–382 (1985)
Hoarau, W., Tixeuil, S.: Easy fault injection and stress testing with fail-fci (January 2006)
Little, M., McCue, D.: Construction and use of a simulation package in c++. Technical Report 437, University of Newcastle upon Tyne (June 1993)
Rizzo, L.: Dummynet and forward error correction (FREENIX track). In: 1998 USENIX Annual Technical Conference, New Orleans, LA (1998)
Voas, J., et al.: Predicting how badly “good” software can behave. IEEE Software 14(4), 73–83 (1997), citeseer.ist.psu.edu/voas97predicting.html
Voas, J., et al.: A ’crystal ball’ for software liability. Computer 30(6), 29–36 (1997), doi:10.1109/2.587545
JXTA Distributed Framework (2003), http://jdf.jxta.org/
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Monnet, S., Bertier, M. (2007). Using Failure Injection Mechanisms to Experiment and Evaluate a Grid Failure Detector. In: Daydé, M., Palma, J.M.L.M., Coutinho, Á.L.G.A., Pacitti, E., Lopes, J.C. (eds) High Performance Computing for Computational Science - VECPAR 2006. VECPAR 2006. Lecture Notes in Computer Science, vol 4395. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71351-7_48
Download citation
DOI: https://doi.org/10.1007/978-3-540-71351-7_48
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71350-0
Online ISBN: 978-3-540-71351-7
eBook Packages: Computer ScienceComputer Science (R0)