Abstract
The ability to tolerate failures while effectively exploiting the grid computing resources in an scalable and transparent manner must be an integral part of grid computing infrastructure. Hence, fault-detection service is a necessary prerequisite to fault tolerance and fault recovery in grid computing. To this end, we present an scalable fault detection service architecture. The proposed fault-detection system provides services that monitors user applications, grid middlewares and the dynamically changing state of a collection of distributed resources. It reports summaries of this information to the appropriate agents on demand or instantaneously in the event of failures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Tierney, B., Crowley, D., Gunter, M., Holding, J., Lee, M., Thompson, A.: Monitoring Sensor Management System for Grid Environments. In: Proceedings of HPDC, pp. 97–104 (2000)
Grimshaw, A., Ferrari, A., Knabe, F., Humphrey, M.: Wide-Area Computing: Resource sharing on a large scale. IEEE Computer 5, 29–37 (1999)
Namyoon, W., Soonho, C., Hyungsoo, J., Jungwhan, M., Heon, Y., Taesoon, P., Hyungwoo, P.: MPICH-GF: Providing Fault Tolerance on Grid Environments. In: Proceedings of CCGrid (2003)
James, F., Todd, T., Foster, I., Livny, M., Tuecke, S.: Condor-G: A Computation Management Agent for Multi-Institutional Grids. In: Proceedings of HPDC’10 (2001), Available at http://www.cs.wisc.edu/condor/condorg/
Soonwook, H.: A Generic Failure Detection Service for the Grid, Ph.D. thesis, institution = ”University of Southern California (2003)
Renesse, R., Minsky, Y., Hayden, M.: A Gossip-Style Failure Detection Service, Technical Report, TR98-1687 (1998)
Abawajy, J.H., Dandamudi, S.P.: A Reconfigurable Multi-Layered Grid Scheduling Infrastructure. In: Proceedings of PDPTA 2003, pp. 138–144 (2003)
Nguyen-Tuong, A.: Integrating Fault-Tolerance Techniques in Grid Applications, Ph.D. thesis, The University of Vergina (2000)
Foster, I., Kesselman, C.: The Globus Project: A Status Report. In: Proceedings of Heterogeneous Computing Workshop, pp. 4–18 (1998)
Stelling, P., Foster, I., Kesselman, C., Lee, C., Laszewski, G.: A Fault Detection Service for Wide Area Distributed Computations. In: Proceedings of HPDC, pp. 268–278 (1998)
Foster, I.: The Grid: A New Infrastructure for 21st Century Science. Physics Today 2, 42–47 (2002)
Li, M., Goldberg, D., Tao, W., Tamir, Y.: Fault-Tolerant Cluster Management For Reliable High-Performance Computing. In: Proceedings of Parallel and Distributed Computing and Systems, pp. 480–485 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Abawajy, J.H. (2004). Fault Detection Service Architecture for Grid Computing Systems. In: Laganá, A., Gavrilova, M.L., Kumar, V., Mun, Y., Tan, C.J.K., Gervasi, O. (eds) Computational Science and Its Applications – ICCSA 2004. ICCSA 2004. Lecture Notes in Computer Science, vol 3044. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24709-8_12
Download citation
DOI: https://doi.org/10.1007/978-3-540-24709-8_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22056-5
Online ISBN: 978-3-540-24709-8
eBook Packages: Springer Book Archive