Skip to main content

Using Failure Injection Mechanisms to Experiment and Evaluate a Grid Failure Detector

  • Conference paper
Book cover High Performance Computing for Computational Science - VECPAR 2006 (VECPAR 2006)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4395))

Abstract

Computing grids are large-scale, highly-distributed, often hierarchical, platforms. At such scales, failures are no longer exceptions, but part of the normal behavior. When designing software for grids, developers have to take failures into account. It is crucial to make experiments at a large scale, with various volatility conditions, in order to measure the impact of failures on the whole system. This paper presents an experimental tool allowing the user to inject failures during a practical evaluation of fault-tolerant systems. We illustrate the usefulness of our tool through an evaluation of a hierarchical grid failure detector.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Grid’5000 project. http://www.grid5000.org

  2. The PARIS research group. http://www.irisa.fr/paris

  3. Alvarez, G.A., Cristian, F.: Centralized failure injection for distributed, fault-tolerant protocol testing. In: International Conference on Distributed Computing Systems, p–10 (1997), citeseer.ist.psu.edu/alvarez97centralized.html

  4. Antoniu, G., et al.: Going large-scale in P2P experiments using the JXTA distributed framework. In: Danelutto, M., Vanneschi, M., Laforenza, D. (eds.) Euro-Par 2004. LNCS, vol. 3149, pp. 1038–1047. Springer, Heidelberg (2004)

    Google Scholar 

  5. Antoniu, G., Deverge, J.-F., Monnet, S.: How to bring together fault tolerance and data consistency to enable grid data sharing. Concurrency and Computation: Practice and Experience 17, To appear. Available as RR-5467 (September 2006)

    Google Scholar 

  6. Arlat, J., et al.: Fault injection and dependability evaluation of fault-tolerant systems. IEEE Transactions on Computers 42(8), 913–923 (1993), citeseer.ist.psu.edu/arlat93fault.html

    Article  Google Scholar 

  7. Bertier, M., Marin, O., Sens, P.: Implementation and performance evaluation of an adaptable failure detector. In: Proceedings of the International Conference on Dependable Systems and Networks, Washington, DC, June 2002, pp. 354–363 (2002)

    Google Scholar 

  8. Bertier, M., Marin, O., Sens, P.: Performance analysis of a hierarchical failure detector. In: Proceedings of the International Conference on Dependable Systems and Networks, San Francisco, CA, USA (June 2003)

    Google Scholar 

  9. A Collaboration between researchers at UC Berkeley, LBL, USC/ISI, and Xerox PARC. The ns manual (formerly ns notes and documentation) (2003), http://www.isi.edu/nsnam/ns/doc/ns_doc.pdf

  10. Carson, M., Santay, D.: NIST Net - a Linux-based network emulation tool. To appear in special issue of Computer Communication Review (2004)

    Google Scholar 

  11. Casanova, H.: Simgrid: A toolkit for the simulation of application scheduling. In: First IEEE/ACM International Symposium on Cluster Computing and the Grid, Brisbane, Australia, pp. 430–441. ACM Press, New York (2001), citeseer.nj.nec.com/casanova01simgrid.html

    Chapter  Google Scholar 

  12. Chandra, T.D., Toueg, S.: Unreliable failure detectors for reliable distributed systems. Journal of the ACM (1996)

    Google Scholar 

  13. Fischer, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of distributed consensus with one faulty process. Journal of the ACM 32(2), 374–382 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  14. Hoarau, W., Tixeuil, S.: Easy fault injection and stress testing with fail-fci (January 2006)

    Google Scholar 

  15. Little, M., McCue, D.: Construction and use of a simulation package in c++. Technical Report 437, University of Newcastle upon Tyne (June 1993)

    Google Scholar 

  16. Rizzo, L.: Dummynet and forward error correction (FREENIX track). In: 1998 USENIX Annual Technical Conference, New Orleans, LA (1998)

    Google Scholar 

  17. Voas, J., et al.: Predicting how badly “good” software can behave. IEEE Software 14(4), 73–83 (1997), citeseer.ist.psu.edu/voas97predicting.html

    Article  Google Scholar 

  18. Voas, J., et al.: A ’crystal ball’ for software liability. Computer 30(6), 29–36 (1997), doi:10.1109/2.587545

    Article  Google Scholar 

  19. JXTA Distributed Framework (2003), http://jdf.jxta.org/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Michel Daydé José M. L. M. Palma Álvaro L. G. A. Coutinho Esther Pacitti João Correia Lopes

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Monnet, S., Bertier, M. (2007). Using Failure Injection Mechanisms to Experiment and Evaluate a Grid Failure Detector. In: Daydé, M., Palma, J.M.L.M., Coutinho, Á.L.G.A., Pacitti, E., Lopes, J.C. (eds) High Performance Computing for Computational Science - VECPAR 2006. VECPAR 2006. Lecture Notes in Computer Science, vol 4395. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71351-7_48

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-71351-7_48

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-71350-0

  • Online ISBN: 978-3-540-71351-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics