Design and Implementation of a Fault Tolerant Job Flow Manager Using Job Flow Patterns and Recovery Policies

  • Selim Kalayci
  • Onyeka Ezenwoye
  • Balaji Viswanathan
  • Gargi Dasgupta
  • S. Masoud Sadjadi
  • Liana Fong
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5364)


Currently, many grid applications are developed as job flows that are composed of multiple jobs. The execution of job flows requires the support of a job flow manager and a job scheduler. Due to the long running nature of job flows, the support for fault tolerance and recovery policies is especially important. This support is inherently complicated due to the sequencing and dependency of jobs within a flow, and the required coordination between workflow engines and job schedulers. In this paper, we describe the design and implementation of a job flow manager that supports fault tolerance. First, we identify and label job flow patterns within a job flow during deployment time. Next, at runtime, we introduce a proxy that intercepts and resolves faults using job flow patterns and their corresponding fault-recovery policies. Our design has the advantages of separation of the job flow and fault handling logic, requiring no manipulation at the modeling time, and providing flexibility with respect to fault resolution at runtime. We validate our design with a prototypical implementation based on the ActiveBPEL workflow engine and GridWay Meta-scheduler, and Montage application as the case study.


Recovery Action Deployment Time Fault Handling Recovery Policy Montage Application 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Berriman, G.: Montage: A Grid enabled image mosaic service for the national virtual observatory. Astronomical Data Analysis Software and Systems XIII (2003)Google Scholar
  2. 2.
    Taylor, I.J., et al. (eds.): Workflows for e-Science. Springer, Heidelberg (2007)Google Scholar
  3. 3.
    Brown, G.D.: Z/OS JC, 5th edn. Wiley Publisher, Chichester (2002)Google Scholar
  4. 4.
    Tan, W., Fong, L., Bobroff, N.: BPEL4Job: a fault-handling design for job flow management. In: Krämer, B.J., Lin, K.-J., Narasimhan, P. (eds.) ICSOC 2007. LNCS, vol. 4749, pp. 27–42. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  5. 5.
    Ezenwoye, O., Sadjadi, S.M.: TRAP/BPEL: A Framework for Dynamic Adaptation of Composite Services. In: International Conference on Web Information Systems and Technologies (WEBIST-2007), Barcelona, Spain (2007)Google Scholar
  6. 6.
    Dasgupta, G., Ezenwoye, O., Fong, L., Kalayci, S., Sadjadi, S.M., Viswanathan, B.: Design of a Fault-Tolerant Job-Flow Manager for Grid Environments Using Standard Technologies, Job-Flow Patterns, and a Transparent Proxy. In: Proceedings of 20th International Conference on Software Engineering and Knowledge Engineering (SEKE), Redwood City, CA (July 2008)Google Scholar
  7. 7.
  8. 8.
    Huedo, E., Montero, R.S., Llorente, I.M.: The GridWay Framework for Adaptive Scheduling and Execution on Grids. In: Workshop on Adaptive Grid Middleware, Intl. Conf. Parallel Architectures and Compilation Techniques (PACT 2003) (September 2003)Google Scholar
  9. 9.
    Russell, N., van der Aalst, W.M.P., ter Hofstede, A.H.M.: Workflow Exception Patterns. In: Dubois, E., Pohl, K. (eds.) CAiSE 2006. LNCS, vol. 4001. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  10. 10.
    Adams, M., ter Hofstede, A.H.M., Edmond, D., van der Aalst, W.M.P.: Dynamic and Extensible Exception Handling for Workflows: A Service-Oriented Implementation. BPM Center Report BPM-07-03, (2007)Google Scholar
  11. 11.
    Jordan, D., et al.: Web Services Business Process Execution Language Version 2.0 (2007),
  12. 12.
  13. 13.
    Anjomshoaa, A., et al.: Job Submission Description Language (JSDL) Specification v1.0. Proposed Recommendation from the JSDL Working Group (2005),
  14. 14.
    Bobroff, N., Fong, L., Kalayci, S., Liu, Y., Martinez, J.C., Rodero, I., Sadjadi, S.M., Villegas, D.: Enabling Interoperability among Meta-Schedulers. In: IEEE 8th International Symposium on Cluster Computing and the Grid (ccGrid) (May 2008)Google Scholar
  15. 15.
    Foster, I., Kesselman, C.: Globus: A Metacomputing Infrastructure Toolkit. In: Proceedings of the Workshop on Environments and Tools for Parallel Scientific Computing, SIAM, Lyon, France (August 1996)Google Scholar
  16. 16.
    Rajic, H., et al.: Distributed Resource Management Application API Specification 1.0. Technical report, DRMAA. Working Group - The Global Grid Forum (2003)Google Scholar
  17. 17.
    Couvares, P., et al.: Workflow Management in Condor. In: Taylor, I.J., et al. (eds.) Workflows for e-Science. Springer Press, Heidelberg (2007)Google Scholar
  18. 18.
    Deelman, E., et al.: Pegasus: a Framework for Mapping Complex Scientific Workflows onto Distributed Systems. Scientific Programming Journal 13(3), 219–237 (2005)CrossRefGoogle Scholar
  19. 19.
    Dasgupta, G., Dasgupta, K., Viswanathan, B.: Data-WISE: Efficient management of data-intensive workloads in scheduled Grid environments. In: Proceedings of IEEE/IFIP Network Operations and Management Symposium, NOMS (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Selim Kalayci
    • 1
  • Onyeka Ezenwoye
    • 2
  • Balaji Viswanathan
    • 3
  • Gargi Dasgupta
    • 3
  • S. Masoud Sadjadi
    • 1
  • Liana Fong
    • 4
  1. 1.Florida International UniversityMiamiUSA
  2. 2.South Dakota State UniversityBrookingsUSA
  3. 3.IBM India Research LabNew DelhiIndia
  4. 4.IBM Watson Research Center, HawthorneNYUSA

Personalised recommendations