DDG Task Recovery for Cluster Computing

  • G. T. Nguyen
  • L. Hluchy
  • V. D. Tran
  • M. Kotocova
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2328)


This paper presents a solution for the problem of transparent recovery of asynchronous distributed computation on clusters of workstations when a fault occurs on a node. If the system has fault-tolerant features, it can survive the fault and continues its computations. Performance degradation is unavoidable when hardware redundancies are not available. It is a large advantage if the long-runtime application can restart from a checkpoint instead of restarting whole computation. This paper presents the fault-tolerant feature of the DDG environment oriented to cluster systems without hardware spare.


Parallel Program Task Graph Multiprocessor System Permanent Fault Faulty Node 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Tran V.D., Hluchy L., Nguyen G.T.: Parallel Program Model for Distributed Systems. EuroPVM/MPI, 2000, pp. 250–257. Springer-Verlag.Google Scholar
  2. 2.
    Hluchý L., Tran V.D., Nguyen G.T.: Parallel Programming with Data Driven Model. EuroMicro, 2000, pp. 205–211. IEEE Computer Society Press.Google Scholar
  3. 3.
    Tran V.D., Hluchý L., Nguyen G.T.: Parallel Program Model and Environment. PARCO, 1999, pp. 697–704. Imperial College Press.Google Scholar
  4. 4.
    Bauch A., Maehle E., Markus F.J.: A Distributed Algorithm for Fault-Tolerant Dynamic Task Scheduling. EuroMicro, 1994, pp. 309–316.Google Scholar
  5. 5.
    Duato J., Yalamanchili S., Ni L.: Interconnection Networks an Engineering Approach. IEEE Computer Society Press, 1997. ISBN 0-8186-7800-3.Google Scholar
  6. 6.
    Pfister G.F.: In Search of Clusters, 2nd Edition. Prentice Hall, 1998, ISBN 0-13-899709-8.Google Scholar
  7. 7.
    El-Rewini H., Lewis T. G.: Distributed and Parallel Computing. Manning Publication, 1998. ISBN 0-13-795592-8.Google Scholar
  8. 8.
    Richmond M., Hitchens M.: A New Process Migration Algorithm. Operating System Review, 1997, vol. 31, no. 1, pp. 31–42.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • G. T. Nguyen
    • 1
  • L. Hluchy
    • 1
  • V. D. Tran
    • 1
  • M. Kotocova
    • 2
  1. 1.SASInstitute of InformaticsBratislavaSlovakia
  2. 2.Department of Computer Science, STUBratislavaSlovakia

Personalised recommendations