Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules: A Programmer’s Perspective

Gerlach, Sebastian; Schaeli, Basile; Hersch, Roger D.

doi:10.1007/11808107_9

Sebastian Gerlach¹⁹,
Basile Schaeli¹⁹ &
Roger D. Hersch¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 4028))

369 Accesses

Abstract

Dynamic Parallel Schedules (DPS) is a flow graph based framework for developing parallel applications on clusters of workstations. The DPS flow graph execution model enables automatic pipelined parallel execution of applications. DPS supports graceful degradation of parallel applications in case of node failures. The fault-tolerance mechanism relies on a set of backup threads stored in the volatile storage of alternate nodes that are kept up to date by both duplicating transmitted data objects and performing periodical checkpointing. The current state of a failed node can be reconstructed on its backup threads by re-executing the application since the last checkpoint. A valid execution order is automatically deduced from the flow graph. The addition of fault-tolerance to a DPS application requires only minor changes to the application’s source code. The present contribution focuses on the development of fault-tolerant parallel applications with DPS from a programmer’s perspective.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations. In: 8th International Symposium on High Performance Distributed Computing (HPDC-8 1999). IEEE CS Press, Los Alamitos (1999)
Google Scholar
Baratloo, A., Dasgupta, P., Kedem, Z.M.: Calypso: A Novel Software System for Fault-Tolerant Parallel Procssing on Distributed Platforms. In: Proc. International Symposium on High-Performance Distributed Computing, pp. 122–129 (1995)
Google Scholar
Batchu, R., Neelamegam, J., Cui, Z., Beddhua, M., Skjel-lum, A., Dandass, Y., Apte, M.: MPI/FT: Architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing. In: 1st IEEE International Symposium of Cluster Computing and the Grid, Melbourne, Australia (2001)
Google Scholar
Bhargava, B., Lian, S.R.: Independent Checkpointing and Concurrent Rollback for Recovery - an Optimistic Approach. In: Proc. IEEE Symposium on Reliable Distributed Systems, pp. 3–12 (1988)
Google Scholar
Chakravorty, S., Kale, L.V.: A fault tolerant protocol for massively parallel systems. In: 18th International Parallel and Distributed Processing Symposium (IPDPS 2004), pp. 212–219 (April 2004)
Google Scholar
Das, D., Dasgupta, P., Das, P.P.: A New Method for Transparent Fault Tolerance of Distributed Programs on a Network of Workstations Using Alternative Schedules. In: Proc. Conf. on Algorithms and Architectures for Parallel Processing (ICAPP 1997), pp. 479–486 (1997)
Google Scholar
Dongarra, J., Otto, S., Snir, M., Walker, D.: A message passing standard for MPP and Workstations. Communications of the ACM 39(7), 84–90 (1996)
Article Google Scholar
Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computing Surveys 34(3), 375–408 (2002)
Article Google Scholar
Elnozahy, E.N., Zwaenepoel, W.: Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback and Fast Output Commit. IEEE Transactions on Computers 41(5), 526–531 (1992)
Article Google Scholar
Gerlach, S., Hersch, R.D.: DPS - Dynamic Parallel Schedules. In: International Parallel and Distributed Processing Symposium (IPDPS 2003), pp. 15–24 (April 2003)
Google Scholar
Gerlach, S., Hersch, R.D.: Fault-tolerant Parallel Applications with Dynamic Parallel Schedules. In: International Parallel and Distributed Processing Symposium (IPDPS 2005), p. 278b (April 2005)
Google Scholar
Gerlach, S.: DPS online documentation, http://dps.epfl.ch
Johnson, D.B., Zwaenepoel, W.: Sender based message logging, Digest of Papers, FTCS-17. In: Proc. 17th Annual International Symposium on Fault-Tolerant Computing, pp. 14–19 (1987)
Google Scholar
Plank, J.S., Kim, Y., Dongarra, J.J.: Algorithm-Based Diskless Checkpointing for Fault Tolerant Matrix Operations, FTCS-25. In: Proc. 25th Annual International Symposium on Fault-Tolerant Computing, pp. 351–360 (1995)
Google Scholar
Strom, R., Yemini, S.: Optimistic recovery in distributed systems. ACM Transactions on Computer Systems 3(3), 204–226 (1985)
Article Google Scholar
Tamir, Y., Sequin, C.H.: Error recovery in multicomputers using global checkpoints. In: Proceedings of the International Conference on Parallel Processing, pp. 32–41 (1984)
Google Scholar
Wang, Y.M., Fuchs, W.K.: Lazy Checkpoint Coordination for Bounding Rollback Propagation. In: Proc. 12th Symposium on Reliable Distributed Systems, October 1993, pp. 78–85 (1993)
Google Scholar

Download references

Author information

Authors and Affiliations

Ecole Polytechnique Fédérale de Lausanne (EPFL), School of Computer and Communication Sciences, Station 14, 1015, Ecublens, Switzerland
Sebastian Gerlach, Basile Schaeli & Roger D. Hersch

Authors

Sebastian Gerlach
View author publications
You can also search for this author in PubMed Google Scholar
Basile Schaeli
View author publications
You can also search for this author in PubMed Google Scholar
Roger D. Hersch
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Informatics, University of Fribourg, Bd. de Pérolles 90, CH-1700, Fribourg, Switzerland
Jürg Kohlas
Eiffel Software, USA
Bertrand Meyer
École Polytechnique Fédérale de Lausanne (EPFL), 1015, Lausanne, Switzerland
André Schiper

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Gerlach, S., Schaeli, B., Hersch, R.D. (2006). Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules: A Programmer’s Perspective. In: Kohlas, J., Meyer, B., Schiper, A. (eds) Dependable Systems: Software, Computing, Networks. Lecture Notes in Computer Science, vol 4028. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11808107_9

Download citation

DOI: https://doi.org/10.1007/11808107_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-36821-2
Online ISBN: 978-3-540-36823-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics