Advertisement

ABHA: A Framework for Autonomic Job Recovery

  • Charles Earl
  • Emilio Remolina
  • Jim Ong
  • John Brown
  • Chris Kuszmaul
  • Brad Stone
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3278)

Abstract

Key issues to address in autonomic job recovery for cluster computing are recognizing job failure; understanding the failure sufficiently to know if and how to restart the job; and rapidly integrating this information into the cluster architecture so that the failure is better mitigated in the future. The Agent Based High Availability (ABHA) system provides an API and a collection of services for building autonomic batch job recovery into cluster computing environments. An agent API allows users to define agents for failure diagnosis and recovery. It is currently being evaluated in the U.S. Department of Energy’s STAR project.

References

  1. 1.
    Chess, D., Kephart, J.: The Vision of Autonomic Computing. IEEE Computer Magazine, 41-50 (2003)Google Scholar
  2. 2.
    STAR experiment website, http://www.star.bnl.gov/
  3. 3.
    Condor project website, http://www.cs.wisc.edu/condor/
  4. 4.
    Platform Computing LSF, http://www.platform.com
  5. 5.
    Ganglia project website, http://ganglia.sourceforge.net/
  6. 6.
  7. 7.
    Parallel Distributed Systems Facility website, http://www.nersc.gov/nusers/resources/PDSF/

Copyright information

© IFIP International Federation for Information Processing 2004

Authors and Affiliations

  • Charles Earl
    • 1
  • Emilio Remolina
    • 1
  • Jim Ong
    • 1
  • John Brown
    • 2
  • Chris Kuszmaul
    • 2
  • Brad Stone
    • 2
  1. 1.Stottler Henke AssociatesUSA
  2. 2.Pentum Group,IncUSA

Personalised recommendations