Using Failure Injection Mechanisms to Experiment and Evaluate a Grid Failure Detector

Monnet, Sébastien; Bertier, Marin

doi:10.1007/978-3-540-71351-7_48

Sébastien Monnet¹ &
Marin Bertier²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4395))

Included in the following conference series:

International Conference on High Performance Computing for Computational Science

733 Accesses
2 Citations

Abstract

Computing grids are large-scale, highly-distributed, often hierarchical, platforms. At such scales, failures are no longer exceptions, but part of the normal behavior. When designing software for grids, developers have to take failures into account. It is crucial to make experiments at a large scale, with various volatility conditions, in order to measure the impact of failures on the whole system. This paper presents an experimental tool allowing the user to inject failures during a practical evaluation of fault-tolerant systems. We illustrate the usefulness of our tool through an evaluation of a hierarchical grid failure detector.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Grid’5000 project. http://www.grid5000.org
The PARIS research group. http://www.irisa.fr/paris
Alvarez, G.A., Cristian, F.: Centralized failure injection for distributed, fault-tolerant protocol testing. In: International Conference on Distributed Computing Systems, p–10 (1997), citeseer.ist.psu.edu/alvarez97centralized.html
Antoniu, G., et al.: Going large-scale in P2P experiments using the JXTA distributed framework. In: Danelutto, M., Vanneschi, M., Laforenza, D. (eds.) Euro-Par 2004. LNCS, vol. 3149, pp. 1038–1047. Springer, Heidelberg (2004)
Google Scholar
Antoniu, G., Deverge, J.-F., Monnet, S.: How to bring together fault tolerance and data consistency to enable grid data sharing. Concurrency and Computation: Practice and Experience 17, To appear. Available as RR-5467 (September 2006)
Google Scholar
Arlat, J., et al.: Fault injection and dependability evaluation of fault-tolerant systems. IEEE Transactions on Computers 42(8), 913–923 (1993), citeseer.ist.psu.edu/arlat93fault.html
Article Google Scholar
Bertier, M., Marin, O., Sens, P.: Implementation and performance evaluation of an adaptable failure detector. In: Proceedings of the International Conference on Dependable Systems and Networks, Washington, DC, June 2002, pp. 354–363 (2002)
Google Scholar
Bertier, M., Marin, O., Sens, P.: Performance analysis of a hierarchical failure detector. In: Proceedings of the International Conference on Dependable Systems and Networks, San Francisco, CA, USA (June 2003)
Google Scholar
A Collaboration between researchers at UC Berkeley, LBL, USC/ISI, and Xerox PARC. The ns manual (formerly ns notes and documentation) (2003), http://www.isi.edu/nsnam/ns/doc/ns_doc.pdf
Carson, M., Santay, D.: NIST Net - a Linux-based network emulation tool. To appear in special issue of Computer Communication Review (2004)
Google Scholar
Casanova, H.: Simgrid: A toolkit for the simulation of application scheduling. In: First IEEE/ACM International Symposium on Cluster Computing and the Grid, Brisbane, Australia, pp. 430–441. ACM Press, New York (2001), citeseer.nj.nec.com/casanova01simgrid.html
Chapter Google Scholar
Chandra, T.D., Toueg, S.: Unreliable failure detectors for reliable distributed systems. Journal of the ACM (1996)
Google Scholar
Fischer, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of distributed consensus with one faulty process. Journal of the ACM 32(2), 374–382 (1985)
Article MATH MathSciNet Google Scholar
Hoarau, W., Tixeuil, S.: Easy fault injection and stress testing with fail-fci (January 2006)
Google Scholar
Little, M., McCue, D.: Construction and use of a simulation package in c++. Technical Report 437, University of Newcastle upon Tyne (June 1993)
Google Scholar
Rizzo, L.: Dummynet and forward error correction (FREENIX track). In: 1998 USENIX Annual Technical Conference, New Orleans, LA (1998)
Google Scholar
Voas, J., et al.: Predicting how badly “good” software can behave. IEEE Software 14(4), 73–83 (1997), citeseer.ist.psu.edu/voas97predicting.html
Article Google Scholar
Voas, J., et al.: A ’crystal ball’ for software liability. Computer 30(6), 29–36 (1997), doi:10.1109/2.587545
Article Google Scholar
JXTA Distributed Framework (2003), http://jdf.jxta.org/

Download references

Author information

Authors and Affiliations

IRISA/University of Rennes I,
Sébastien Monnet
IRISA/INSA,
Marin Bertier

Authors

Sébastien Monnet
View author publications
You can also search for this author in PubMed Google Scholar
Marin Bertier
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Michel Daydé José M. L. M. Palma Álvaro L. G. A. Coutinho Esther Pacitti João Correia Lopes

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Monnet, S., Bertier, M. (2007). Using Failure Injection Mechanisms to Experiment and Evaluate a Grid Failure Detector. In: Daydé, M., Palma, J.M.L.M., Coutinho, Á.L.G.A., Pacitti, E., Lopes, J.C. (eds) High Performance Computing for Computational Science - VECPAR 2006. VECPAR 2006. Lecture Notes in Computer Science, vol 4395. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71351-7_48

Download citation

DOI: https://doi.org/10.1007/978-3-540-71351-7_48
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71350-0
Online ISBN: 978-3-540-71351-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics