A Framework for Automated Fault Recovery Planning in Large-Scale Virtualized Infrastructures

  • Feng Liu
  • Vitalian A. Danciu
  • Pavlo Kerestey
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6473)


Multi-layered provisioning architectures such as those in emergent virtualized (e.g. cloud) infrastructures exacerbate the cost of faults to a degree where automation effectively constitutes a prerequisite for operations. The acquisition of management information and the execution of routine tasks have been automated to some degree; however the decision processes behind fault management in large-scale environments have not. This paper addresses automation of such decision processes by proposing a planning-based fault recovery algorithm based on hierarchical task networks and data models for the knowledge necessary to the recovery process. We embed these concepts in a generic architecture and evaluate its prototypical implementation with respect to function and scalability.


fault management AI planning virtualization cloud computing 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Willams, A.: Top 5 Cloud Outages of the Past Two Years. ReadWrite Cloud (2010)Google Scholar
  2. 2.
    Andrzejak, A., Reinefeld, A., Schintke, F., Schuett, T.: On adaptability in grid systems. Future Generation Grids, 29–46 (2006)Google Scholar
  3. 3.
    Arshad, N.: A Planning-Based Approach to Failure Recovery in Distributed Systems. PhD thesis, University of Colorado at Boulder (2006)Google Scholar
  4. 4.
    Blythe, J., Deelman, E., Gil, Y., Kesselman, C.: Transparent grid computing: a knowledge-based approach. In: 15th Innovative Applications of Artificial Intelligence Conference (2003)Google Scholar
  5. 5.
    Deelman, E., Blythe, J., Gil, Y., Kesselman, K.V.C., Mehta, G.: Mapping abstract complex workflows onto grid environments. Journal of Grid Computing 1(1) (March 2003)Google Scholar
  6. 6.
    Erol, K., Hendler, J., Nau, D.S.: Umcp: A sound and complete procedure for hierarchical task-network planning. In: Proceedings of the 2nd International Conference on Artificial Intelligence Planning Systems (AIPS 1994), pp. 249–254 (1994)Google Scholar
  7. 7.
    Barrett, A., et al.: Partial-order planning: Evaluating possible efficiency gains. Artificial Intelligence 67, 71–112 (1994)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Fishburn, P.C.: Utility theory for decision making. Storming Media (1970)Google Scholar
  9. 9.
    Fox, M., Long, D.: International planning competition (2002)Google Scholar
  10. 10.
    Goldberg, D.E., et al.: Genetic algorithms in search, optimization, and machine learning. Addison-Wesley, Reading (1989)zbMATHGoogle Scholar
  11. 11.
    Dignan, L.: Amazon’s S3 Outage: Is the cloud too complicated? ZDNet (July 2008)Google Scholar
  12. 12.
    Nau, D., Ghallab, M., Traverso, P.: Automated Planning: Theory & Practice. Morgan Kaufmann Publishers Inc., San Francisco (2004)zbMATHGoogle Scholar
  13. 13.
    Nau, D.S.: Current trends in automated planning. AI Magazine 28(4), 43 (2007)Google Scholar
  14. 14.
    Robertson, P., Williams, B.: Automatic recovery from software failure. Communications of the ACM 49(3), 47 (2006)CrossRefGoogle Scholar
  15. 15.
    Gopisetty, S., et al.: Automated Planner for Storage Provisioning and Disaster Recovery. IBM Journal of Research and Development 52(4/5) (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Feng Liu
    • 1
  • Vitalian A. Danciu
    • 1
  • Pavlo Kerestey
    • 2
  1. 1.Munich Network Management TeamLudwig-Maximilians-UniversitätMünchen
  2. 2.Technische Universität MünchenGermany

Personalised recommendations