Fault Detection Service Architecture for Grid Computing Systems

Abawajy, J. H.

doi:10.1007/978-3-540-24709-8_12

J. H. Abawajy²⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3044))

Included in the following conference series:

International Conference on Computational Science and Its Applications

898 Accesses
9 Citations

Abstract

The ability to tolerate failures while effectively exploiting the grid computing resources in an scalable and transparent manner must be an integral part of grid computing infrastructure. Hence, fault-detection service is a necessary prerequisite to fault tolerance and fault recovery in grid computing. To this end, we present an scalable fault detection service architecture. The proposed fault-detection system provides services that monitors user applications, grid middlewares and the dynamically changing state of a collection of distributed resources. It reports summaries of this information to the appropriate agents on demand or instantaneously in the event of failures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Tierney, B., Crowley, D., Gunter, M., Holding, J., Lee, M., Thompson, A.: Monitoring Sensor Management System for Grid Environments. In: Proceedings of HPDC, pp. 97–104 (2000)
Google Scholar
Grimshaw, A., Ferrari, A., Knabe, F., Humphrey, M.: Wide-Area Computing: Resource sharing on a large scale. IEEE Computer 5, 29–37 (1999)
Google Scholar
Namyoon, W., Soonho, C., Hyungsoo, J., Jungwhan, M., Heon, Y., Taesoon, P., Hyungwoo, P.: MPICH-GF: Providing Fault Tolerance on Grid Environments. In: Proceedings of CCGrid (2003)
Google Scholar
James, F., Todd, T., Foster, I., Livny, M., Tuecke, S.: Condor-G: A Computation Management Agent for Multi-Institutional Grids. In: Proceedings of HPDC’10 (2001), Available at http://www.cs.wisc.edu/condor/condorg/
Soonwook, H.: A Generic Failure Detection Service for the Grid, Ph.D. thesis, institution = ”University of Southern California (2003)
Google Scholar
Renesse, R., Minsky, Y., Hayden, M.: A Gossip-Style Failure Detection Service, Technical Report, TR98-1687 (1998)
Google Scholar
Abawajy, J.H., Dandamudi, S.P.: A Reconfigurable Multi-Layered Grid Scheduling Infrastructure. In: Proceedings of PDPTA 2003, pp. 138–144 (2003)
Google Scholar
Nguyen-Tuong, A.: Integrating Fault-Tolerance Techniques in Grid Applications, Ph.D. thesis, The University of Vergina (2000)
Google Scholar
Foster, I., Kesselman, C.: The Globus Project: A Status Report. In: Proceedings of Heterogeneous Computing Workshop, pp. 4–18 (1998)
Google Scholar
Stelling, P., Foster, I., Kesselman, C., Lee, C., Laszewski, G.: A Fault Detection Service for Wide Area Distributed Computations. In: Proceedings of HPDC, pp. 268–278 (1998)
Google Scholar
Foster, I.: The Grid: A New Infrastructure for 21st Century Science. Physics Today 2, 42–47 (2002)
Article Google Scholar
Li, M., Goldberg, D., Tao, W., Tamir, Y.: Fault-Tolerant Cluster Management For Reliable High-Performance Computing. In: Proceedings of Parallel and Distributed Computing and Systems, pp. 480–485 (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Technology, Deakin University, Geelong, Victoria, Australia
J. H. Abawajy

Authors

J. H. Abawajy
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Chemistry, University of Perugia, Via Elce di Sotto, 8, I-06123, Perugia, Italy
Antonio Laganá
Department of Computer Science, University of Calgary, 2500 University Drive N.W., T2N 1N4, Calgary AB, Canada
Marina L. Gavrilova
William Norris Professor, Head of the Computer Science and Engineering Department, University of Minnesota, USA
Vipin Kumar
School of Computing, Soongsil University, Seoul, Korea
Youngsong Mun
OptimaNumerics Ltd., Cathedral House, 23-31 Waring Street, BT1 2DX, Belfast, UK
C. J. Kenneth Tan
Department of Mathematics and Computer Science, University of Perugia, via Vanvitelli, 1, I-06123, Perugia, Italy
Osvaldo Gervasi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Abawajy, J.H. (2004). Fault Detection Service Architecture for Grid Computing Systems. In: Laganá, A., Gavrilova, M.L., Kumar, V., Mun, Y., Tan, C.J.K., Gervasi, O. (eds) Computational Science and Its Applications – ICCSA 2004. ICCSA 2004. Lecture Notes in Computer Science, vol 3044. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24709-8_12

Download citation

DOI: https://doi.org/10.1007/978-3-540-24709-8_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22056-5
Online ISBN: 978-3-540-24709-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics