Skip to main content

Scheduling for Fault-Tolerance: An Introduction

  • Chapter
  • First Online:
Topics in Parallel and Distributed Computing
  • 726 Accesses

Abstract

Parallel execution time is expected to decrease as the number of processors increases. We show in this chapter that this is not as easy as it seems, even for perfectly parallel applications. In particular, processors are subject to faults. The more processors are available, the more likely faults will strike during execution. The main strategy to cope with faults in High Performance Computing is checkpointing. We introduce the reader to this approach, and explain how to determine the optimal checkpointing period through scheduling techniques. We also detail how to combine checkpointing with prediction and with replication.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 59.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    As a side note, one needs only 23 persons for the probability of a common birthday to reach 0.5 (a question often asked in geek evenings).

  2. 2.

    By the way, there is a nice little exercise in “Appendix 6: Scheduling a Linear Chain of Tasks” if you are motivated to help.

References

  1. G. Aupy, Y. Robert, F. Vivien, and D. Zaidouni. Checkpointing strategies with prediction windows. In Dependable Computing (PRDC), 2013 IEEE 19th Pacific Rim International Symposium on, pages 1–10. IEEE, 2013.

    Google Scholar 

  2. G. Aupy, Y. Robert, F. Vivien, and D. Zaidouni. Checkpointing algorithms and fault prediction. Journal of Parallel and Distributed Computing, 74(2):2048–2064, 2014.

    Article  Google Scholar 

  3. M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien. Checkpointing strategies for parallel jobs. In Proceedings of SC’11, 2011.

    Google Scholar 

  4. F. Cappello, A. Geist, B. Gropp, L. V. Kalé, B. Kramer, and M. Snir. Toward Exascale Resilience. Int. Journal of High Performance Computing Applications, 23(4):374–388, 2009.

    Article  Google Scholar 

  5. H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni. Combining process replication and checkpointing for resilience on exascale systems. Research report RR-7951, INRIA, May 2012.

    Google Scholar 

  6. J. T. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. FGCS, 22(3):303–312, 2004.

    Article  Google Scholar 

  7. C. Engelmann, H. H. Ong, and S. L. Scorr. The case for modular redundancy in large-scale highh performance computing systems. In Proc. of the 8th IASTED Infernational Conference on Parallel and Distributed Computing and Networks (PDCN), pages 189–194, 2009.

    Google Scholar 

  8. K. Ferreira, J. Stearley, J. H. I. Laros, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold. Evaluating the Viability of Process Replication Reliability for Exascale Systems. In Proc. of the ACM/IEEE SC Conf., 2011.

    Google Scholar 

  9. P. Flajolet, P. J. Grabner, P. Kirschenhofer, and H. Prodinger. On Ramanujan’s Q-Function. J. Computational and Applied Mathematics, 58:103–116, 1995.

    Article  MathSciNet  Google Scholar 

  10. A. Gainaru, F. Cappello, and W. Kramer. Taming of the shrew: Modeling the normal and faulty behavior of large-scale hpc systems. In Proc. IPDPS’12, 2012.

    Google Scholar 

  11. A. Gainaru, F. Cappello, M. Snir, and W. Kramer. Failure prediction for hpc systems and applications: Current situation and open issues. Int. J. High Perform. Comput. Appl., 27(3):273–282, 2013.

    Article  Google Scholar 

  12. F. Gärtner. Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM Computing Surveys, 31(1), 1999.

    Article  Google Scholar 

  13. T. Hérault and Y. Robert, editors. Fault-Tolerance Techniques for High-Performance Computing, Computer Communications and Networks. Springer Verlag, 2015.

    Google Scholar 

  14. O. Kella and W. Stadje. Superposition of renewal processes and an application to multi-server queues. Statistics & probability letters, 76(17):1914–1924, 2006.

    Article  MathSciNet  Google Scholar 

  15. D. Kondo, A. Chien, and H. Casanova. Scheduling Task Parallel Applications for Rapid Application Turnaround on Enterprise Desktop Grids. J. Grid Computing, 5(4):379–405, 2007.

    Article  Google Scholar 

  16. S. M. Ross. Introduction to Probability Models, Eleventh Edition. Academic Press, 2009.

    Google Scholar 

  17. B. Schroeder and G. Gibson. Understanding failures in petascale computers. Journal of Physics: Conference Series, 78(1), 2007.

    Google Scholar 

  18. J. W. Young. A first order approximation to the optimum checkpoint interval. Comm. of the ACM, 17(9):530–531, 1974.

    Article  Google Scholar 

  19. L. Yu, Z. Zheng, Z. Lan, and S. Coghlan. Practical online failure prediction for blue gene/p: Period-based vs event-driven. In Dependable Systems and Networks Workshops (DSN-W), pages 259–264, 2011.

    Google Scholar 

  20. Z. Zheng and Z. Lan. Reliability-aware scalability models for high performance computing. In Proc. of the IEEE Conference on Cluster Computing, 2009.

    Google Scholar 

  21. Z. Zheng, Z. Lan, R. Gupta, S. Coghlan, and P. Beckman. A practical failure prediction with location and lead time for blue gene/p. In Dependable Systems and Networks Workshops (DSN-W), pages 15–22, 2010.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yves Robert .

Editor information

Editors and Affiliations

Appendices

Appendix 1: First-Order Approximation of T FO

It is interesting to point out why the value of T FO given by Eq. (8) is a first-order approximation, even for large jobs. Indeed, there are several restrictions for the approach to be valid:

  • We have stated that the expected number of faults during execution is \(\ensuremath {N_{\text{faults}}} = \frac {\ensuremath {\textsc {Time}_{\text{final}}} }{\mu } \), and that the expected time lost due to a fault is \(\ensuremath {T_{\text{lost}}} = \frac {\ensuremath {T} }{2} + \ensuremath {D} + \ensuremath {R} \). Both statements are true individually, but the expectation of a product is the product of the expectations only if the random variables are independent, which is not the case here because Time final depends upon the fault inter-arrival times.

  • In Eq. (4), we have to enforce C ≤ T in order to have Waste FF ≤ 1.

  • In Eq. (5), we have to enforce \(\ensuremath {D} +\ensuremath {R} + \frac {\ensuremath {T} }{2} \leq \mu \) in order to have Waste fault ≤ 1. We must cap the period to enforce this latter constraint. Intuitively, we need μ to be large enough for Eq. (5) to make sense. However, for large-scale platforms, regardless of the value of the individual MTBF μ ind, there is always a threshold in the number of components p above which the platform MTBF, \(\mu = \frac {\mu _{\text{ind}}}{p}\), becomes too small for Eq. (5) to be valid.

  • Equation (5) is accurate only when two or more faults do not take place within the same period. Although unlikely when μ is large in front of T, the possible occurrence of many faults during the same period cannot be eliminated.

To ensure that the condition of having at most a single fault per period is met with a high probability, we cap the length of the period: we enforce the condition T ≤ αμ, where α is some tuning parameter chosen as follows. The number of faults during a period of length T can be modeled as a Poisson process of parameter \(\beta = \frac {\ensuremath {T} }{\mu }\). The probability of having k ≥ 0 faults is \(P(X=k) = \frac {\beta ^{k}}{k!} e^{-\beta }\), where X is the random variable showing the number of faults. Hence the probability of having two or more faults is π = P(X ≥ 2) = 1 − (P(X = 0) + P(X = 1)) = 1 − (1 + β)e β. If we assume α = 0.27 then π ≤ 0.03, hence a valid approximation when bounding the period range accordingly. Indeed, with such a conservative value for α, we have overlapping faults for only 3% of the checkpointing segments in average, so that the model is quite reliable. For consistency, we also enforce the same type of bound on the checkpoint time, and on the downtime and recovery: C ≤ αμ and D + R ≤ αμ. However, enforcing these constraints may lead to use a sub-optimal period: it may well be the case that the optimal period \(\sqrt {2(\mu - (\ensuremath {D} +\ensuremath {R} )) \ensuremath {C} }\) of Eq. (8) does not belong to the admissible interval [C, αμ]. In that case, the waste is minimized for one of the bounds of the admissible interval. This is because, as seen from Eq. (7), the waste is a convex function of the period.

We conclude this discussion on a positive note. While capping the period, and enforcing a lower bound on the MTBF, is mandatory for mathematical rigor, simulations in Aupy et al. [2] show that actual job executions can always use the value from Eq. (8), accounting for multiple faults whenever they occur by re-executing the work until success. The first-order model turns out to be surprisingly robust!

Appendix 2: Optimal Value of T FO

There is a beautiful method to compute the optimal value of T FO accurately. First we show how to compute the expected time \(\ensuremath {\mathbb {E}} (\ensuremath {\textsc {Time}} (\ensuremath {T} - \ensuremath {C} ,\ensuremath {C} ,\ensuremath {D} ,\ensuremath {R} ,\lambda ))\) to execute a work of duration T − C followed by a checkpoint of duration C, given the values of C, D, and R, and a fault distribution Exp(λ). If a fault interrupts a given trial before success, there is a downtime of duration D followed by a recovery of length R. We assume that faults can strike during checkpoint and recovery, but not during downtime.

Proposition 3

\(\ensuremath {\mathbb {E}} (\ensuremath {\textsc {Time}} (\ensuremath {T} - \ensuremath {C} ,\ensuremath {C} ,\ensuremath {D} ,\ensuremath {R} ,\lambda ))= e^{\lambda \ensuremath {R} } \left (\frac {1}{\lambda } + \ensuremath {D} \right ) (e^{\lambda \ensuremath {T} } -1 )\).

Proof

For simplification, we write Time instead of Time(T − C, C, D, R, λ) in the proof below. Consider the following two cases:

  1. (i)

    Either there is no fault during the execution of the period, then the time needed is exactly T;

  2. (ii)

    Or there is one fault before successfully completing the period, then some additional delays are incurred. More specifically, as seen for the first order approximation, there are two sources of delays: the time spent computing by the processors before the fault (accounted for by variable Time lost), and the time spent for downtime and recovery (accounted for by variable Time rec). Once a successful recovery has been completed, there still remain T − C units of work to execute.

Thus Time obeys the following recursive equation:

$$\displaystyle \begin{aligned} \ensuremath{\textsc{Time}} =\left\{ \begin{array}{ll} \ensuremath{T} & \text{if there is no fault} \\ \ensuremath{\ensuremath{\textsc{Time}_{\text{lost}}} } + \ensuremath{\ensuremath{\textsc{Time}_{\text{rec}}} } +\ensuremath{\textsc{Time}} & \text{otherwise} \end{array}\right. {} \end{aligned} $$
(15)
Time lost :

denotes the amount of time spent by the processors before the first fault, knowing that this fault occurs within the next T units of time. In other terms, it is the time that is wasted because computation and checkpoint were not successfully completed (the corresponding value in Fig. 2 is T lost − D − R).

Time rec :

represents the amount of time needed by the system to recover from the fault (the corresponding value in Fig. 2 is D + R).

The expectation of Time can be computed from Eq. (15) by weighting each case by its probability to occur:

$$\displaystyle \begin{aligned} \ensuremath{\mathbb{E}} (\ensuremath{\textsc{Time}} ) &= \ensuremath{\mathbb{P}} \left ( \text{no fault}\right ) \cdot \ensuremath{T} +\ensuremath{\mathbb{P}} \left ( \text{a fault strikes}\right ) \cdot \ensuremath{\mathbb{E}} \left ( \ensuremath{\ensuremath{\textsc{Time}_{\text{lost}}} } + \ensuremath{\ensuremath{\textsc{Time}_{\text{rec}}} } +\ensuremath{\textsc{Time}} \right) \\ &= e^{- \lambda \ensuremath{T} } \ensuremath{T} + (1-e^{- \lambda \ensuremath{T} } ) \left( \ensuremath{\mathbb{E}} (\ensuremath{\ensuremath{\textsc{Time}_{\text{lost}}} } )+ \ensuremath{\mathbb{E}} (\ensuremath{\ensuremath{\textsc{Time}_{\text{rec}}} } ) +\ensuremath{\mathbb{E}} (\ensuremath{\textsc{Time}} ) \right)\ , \end{aligned} $$

which simplifies into:

$$\displaystyle \begin{aligned} \ensuremath{\mathbb{E}} (\ensuremath{\textsc{Time}} )= \ensuremath{T} + (e^{\lambda \ensuremath{T} } -1 ) \left ( E(\ensuremath{\ensuremath{\textsc{Time}_{\text{lost}}} } )+ E(\ensuremath{\ensuremath{\textsc{Time}_{\text{rec}}} } ) \right ) {} \end{aligned} $$
(16)

We have \(\ensuremath {\mathbb {E}} (\ensuremath {\ensuremath {\textsc {Time}_{\text{lost}}} } ) = \int _0^{\infty } x \ensuremath {\mathbb {P}} (X=x|X < \ensuremath {T} ) dx = \frac {1}{\ensuremath {\mathbb {P}} (X < \ensuremath {T} )}\int _0^{\ensuremath {T} } x e^{-\lambda x} dx\), and \(\ensuremath {\mathbb {P}} (X < \ensuremath {T} ) = 1-e^{-\lambda \ensuremath {T} }\). Integrating by parts, we derive that

$$\displaystyle \begin{aligned} \ensuremath{\mathbb{E}} (\ensuremath{\ensuremath{\textsc{Time}_{\text{lost}}} } ) = \frac{1}{\lambda} - \frac{\ensuremath{T} }{e^{\lambda \ensuremath{T} } - 1} {} \end{aligned} $$
(17)

Next, the reasoning to compute \(\ensuremath {\mathbb {E}} (\ensuremath {\ensuremath {\textsc {Time}_{\text{rec}}} } )\), is very similar to \(\ensuremath {\mathbb {E}} (\ensuremath {\textsc {Time}} )\) (note that there can be no fault during D but there can be some during R):

$$\displaystyle \begin{aligned} \ensuremath{\mathbb{E}} (\ensuremath{\ensuremath{\textsc{Time}_{\text{rec}}} } ) = e^{-\lambda \ensuremath{R} } (\ensuremath{D} +\ensuremath{R} ) + (1-e^{-\lambda \ensuremath{R} }) (\ensuremath{D} +\ensuremath{\mathbb{E}} (\ensuremath{R_{lost}} )+\ensuremath{\mathbb{E}} (\ensuremath{\ensuremath{\textsc{Time}_{\text{rec}}} } )) \end{aligned}$$

Here, R lost is the amount of time lost to executing the recovery before a fault happens, knowing that this fault occurs within the next R units of time. Replacing T by R in Eq. (17), we obtain \(\ensuremath {\mathbb {E}} (\ensuremath {R_{lost}} ) = \frac {1}{\lambda } - \frac {\ensuremath {R} }{e^{\lambda \ensuremath {R} } - 1}\). The expression for \(\ensuremath {\mathbb {E}} (\ensuremath {\ensuremath {\textsc {Time}_{\text{rec}}} } )\) simplifies to

$$\displaystyle \begin{aligned} \ensuremath{\mathbb{E}} (\ensuremath{\ensuremath{\textsc{Time}_{\text{rec}}} } ) = \ensuremath{D} e^{ \lambda \ensuremath{R} } + \frac{1}{\lambda} (e^{\lambda \ensuremath{R} } - 1)\end{aligned} $$

Plugging the values of \(\ensuremath {\mathbb {E}} (\ensuremath {\ensuremath {\textsc {Time}_{\text{lost}}} } )\) and \(\ensuremath {\mathbb {E}} (\ensuremath {\ensuremath {\textsc {Time}_{\text{rec}}} } )\) into Eq. (16) leads to the desired value:

$$\displaystyle \begin{aligned} \ensuremath{\mathbb{E}} (\ensuremath{\textsc{Time}} (\ensuremath{T} - \ensuremath{C} ,\ensuremath{C} ,\ensuremath{D} ,\ensuremath{R} ,\lambda))= e^{\lambda \ensuremath{R} } \left(\frac{1}{\lambda} + \ensuremath{D} \right) (e^{\lambda \ensuremath{T} } -1 ) \end{aligned}$$

Proposition 3 is the key to proving that the optimal checkpointing strategy is periodic. Indeed, consider an application of duration Time base, and divide the execution into periods of different lengths T i, each with a checkpoint as the end. The expectation of the total execution time is the sum of the expectations of the time needed for each period. Proposition 3 shows that the expected time for a period is a convex function of its length, hence all periods must be equal and T i = T for all i.

There remains to find the best number of periods, or equivalently, the size of each work chunk before checkpointing. With k periods of length \(T =\frac {\ensuremath {\textsc {Time}_{\text{base}}} }{k}\), we have to minimize a function that depends on k. This is easy for a skilled mathematician who knows the Lambert function 𝕃 (defined as \(\ensuremath {\mathbb {L}} (z)e^{\ensuremath {\mathbb {L}} (z)}=z\)). She would find the optimal rational value k opt of k by differentiation, prove that the objective function is convex, and conclude that the optimal value is either ⌊k opt⌋ or ⌈k opt⌉, thereby determining the optimal period T opt. What if you are not a skilled mathematician? No problem, simply use T FO as a first-order approximation, and be comforted that the first-order terms in the Taylor expansion of T opt is …T FO! See Bougeret et al. [3] for all details.

Appendix 3: MTBF of a Platform with p Parallel Processors

In this section we give another proof of Proposition 1. Interestingly, it applies to any continuous probability distribution with bounded (nonzero) expectation, not just Exponential laws.

First we prove that Eq. (1) does hold true. Consider a single processor, say processor p q. Let X i, i ≥ 0 denote the IID (independent and identically distributed) random variables for the fault inter-arrival times on p q , and assume that X i ∼ D X, where D X is a continuous probability distribution with bounded (nonzero) expectation μ ind. In particular, \(\mathbb E\left (X_{i}\right ) = \mu _{\text{ind}}\) for all i. Consider a fixed time bound F. Let n q(F) be the number of faults on p q until time F. More precisely, the (n q(F))-th fault is the last one to happen before time F or at time F, and the (n q(F) + 1)-st fault is the first to happen after time F. By definition of n q(F), we have

$$\displaystyle \begin{aligned}\sum_{i=1}^{n_{q}(F)} X_{i} \leq F < \sum_{i=1}^{n_{q}(F)+1} X_{i}.\end{aligned}$$

Using Wald’s equation (see the textbook of Ross [16, p. 420]), with n q(F) as a stopping criterion, we derive:

$$\displaystyle \begin{aligned}\mathbb E\left(n_{q}(F)\right) \mu_{\text{ind}} \leq F \leq (\mathbb E\left(n_{q}(F)\right) +1) \mu_{\text{ind}},\end{aligned}$$

and we obtain:

$$\displaystyle \begin{aligned} \lim_{F \rightarrow +\infty} \frac{\mathbb E\left(n_{q}(F)\right)}{F} = \frac{1}{\mu_{\text{ind}}}. {} \end{aligned} $$
(18)

As promised, Eq. (18) is exactly Eq. (1).

Now consider a platform with p identical processors, whose fault inter-arrival times are IID random variables that follow the distribution D X. Unfortunately, if D X is not an Exponential law, then the inter-arrival times of the faults of the whole platform, i.e., of the super-processor of section “Checkpointing on a Parallel Platform”, are no longer IID. The minimum trick used in the proof of Proposition 1 works only for the first fault. For the following ones, we need to remember the history of the previous faults, and things get too complicated. However, we could still define the MTBF μ of the super-processor. Using Eq. (18), μ must satisfy:

$$\displaystyle \begin{aligned}\lim_{F \rightarrow +\infty} \frac{\mathbb E\left(n(F)\right)}{F} = \frac{1}{\mu},\end{aligned}$$

where n(F) is the number of faults on the super-processor until time F. But does the limit always exist? and if yes, what is its value?

The answer to both questions is not difficult. Consider a fixed time bound F as before. Let n(F) be the number of faults on the whole platform until time F, and let m q(F) be the number of these faults that strike component number q. Of course we have \(n(F) = \sum _{q=1}^{p} m_{q}(F)\). By definition, m q(F) is the number of faults on component q until time F. From Eq. (18) again, we have for each component q:

$$\displaystyle \begin{aligned} \lim_{F \rightarrow +\infty} \frac{\mathbb E\left(m_{q}(F)\right)}{F} = \frac{1}{\mu_{\text{ind}}}. \end{aligned}$$

Since \(n(F) = \sum _{q=1}^{p} m_{q}(F)\), we also have:

$$\displaystyle \begin{aligned} \lim_{F \rightarrow +\infty} \frac{\mathbb E\left(n(F)\right)}{F} = \frac{p}{\mu_{\text{ind}}} \end{aligned} $$
(19)

which answers both questions at the same time!

The curious reader may ask how to extend Eq. (19) when processors have different fault-rates. Let \(X_{i}^{(q)}\), i ≥ 0 denote the IID random variables for the fault inter-arrival times on p q , and assume that \(X_{i}^{(q)} \sim D_{X}^{(q)}\), where \(D_{X}^{(q)}\) is a continuous probability distribution with bounded (nonzero) expectation μ (q). For instance if μ (2) = 3 μ (1), then (in expectation) processor 1 experiences three times more failures than processor 2. As before, consider a fixed time bound F, and let n q(F) be the number of faults on p q until time F. Equation (18) now writes \(\lim _{F \rightarrow +\infty } \frac {\mathbb E\left (m_{q}(F)\right )}{F} = \frac {1}{\mu ^{(q)}}\). Now let n(F) be the total number of faults on the whole platform until time F. The same proof as above leads to

$$\displaystyle \begin{aligned} \lim_{F \rightarrow +\infty} \frac{\mathbb E\left(n(F)\right)}{F} = \sum_{q=1}^{p} \frac{1}{\mu^{(q)}} \end{aligned} $$
(20)

Kella and Stadje [14] provide more results on the superposition of renewal processes (which is the actual mathematical name of the topic discussed here!).

Appendix 4: Going Further with Prediction

The discussion on predictions in section “Fault Prediction” has been kept overly simple. For instance when a fault is predicted, sometimes there is not enough time to take proactive actions, because we are already checkpointing. In this case, there is no other choice than ignoring the prediction.

Furthermore, a better strategy should take into account at what point in the period does the prediction occur. After all, there is no reason to always trust the predictor, in particular if it has a bad precision. Intuitively, the later the prediction takes place in the period, the more likely we are inclined to trust the predictor and take proactive actions. This is because the amount of work that we could lose gets larger and larger. On the contrary, if the prediction happens in the beginning of the period, we have to trade-off the probability that the proactive checkpoint may be useless (if we take a proactive action) with the small amount of work that may be lost in the case where a fault would actually happen (if we do not trust the predictor). The optimal approach is to never trust the predictor in the beginning of a period, and to always trust it in the end; the cross-over point \(\frac {\ensuremath {C_{\ensuremath {pr} }} }{\ensuremath {pr} }\) depends on the time to take a proactive checkpoint and on the precision of the predictor. Details are provided by Aupy et al. [2] for details.

Finally, it is more realistic to assume that the predictor cannot give the exact moment where the fault is going to strike, but rather will provide an interval of time, a.k.a. a prediction window. Aupy et al. [1] provide more information.

Appendix 5: Going Further with Replication

In the context of replication, there are two natural options for “counting” faults. The option chosen in section “Replication” is to allow new faults to hit processors that have already been hit. This is the option chosen by Ferreira et al. [8], who introduced the problem. Another option is to count only faults that hit running processors, and thus effectively kill replica pairs and interrupt the application. This second option may seem more natural as the running processors are the only ones that are important for the application execution. It turns out that both options are almost equivalent, the values of their MNFTI only differ by one, as shown by Casanova et al. [5].

Speaking of faults, an important question is: why don’t we repair (or rejuvenate) processors on the fly, instead of doing so only when the whole application is forced to stop, recover from the last checkpoint, and restart execution? The answer is technical: current HPC resource management systems assign the user a fixed set of resources for the execution, and do not allow new resources (such as spare nodes) to be dynamically added during the execution. In fact, frequently, a new configuration is assigned to the user at restart time. But nothing prevents us from enhancing the tools! It should then be possible to reserve a few additional nodes in addition to the computing nodes. These nodes would be used to migrate the system image of a replica node as soon as its buddy fails, in order to re-create the failed node on the fly. Of course the surviving node must be isolated from the application while the migration is taking place, in order to maintain a coherent view of both nodes, and this induces some overhead. It would be quite interesting to explore such strategies.

Here a few bibliographical notes about replication. Replication has long been used as a fault-tolerance mechanism in distributed systems (see the survey of Gartner [12]), and in the context of volunteer computing (see the work of Kondo et al. [15]). Replication has recently received attention in the context of HPC (High Performance Computing) applications. Representative papers are those by Schroeder and Gibson [17], Zheng and Lan [20], Engelmann, Ong, and Scorr [7], and Ferreira et al. [8]. While replicating all processors is very expensive, replicating only critical processes, or only a fraction of all processes, is a direction being currently explored under the name partial replication.

Speaking of critical processes, we make a final digression. The de-facto standard to enforce fault-tolerance in critical or embedded systems is Triple Modular Redundancy, or TMR. Computations are triplicated on three different processors, and if their results differ, a voting mechanism is called. TMR is not used to protect from fail-stop faults, but rather to detect and correct errors in the execution of the application. While we all like, say, safe planes protected by TMR, the cost is tremendous: by definition, two thirds of the resources are wasted (and this does not include the overhead of voting when an error is identified).

Appendix 6: Scheduling a Linear Chain of Tasks

In this exercise you are asked to help Alice (again). She is still writing her thesis but she does not want to checkpoint at given periods of time. She hates being interrupted in the middle of something because she loses concentration. She now wants to checkpoint only at the end of a chapter. She still has to decide after which chapters it is best to checkpoint.

The difference with the original problem is that the checkpoints can only be taken at given time-steps. If we formulate the problem in a abstract way, we have a linear chain of n tasks (the n chapters in Alice’s thesis), T 1, T 2, …, T n. Each task T i has weight w i (the time it takes to write that chapter). The cost to checkpoint after T i is C i. The time to recover from a fault depends upon where the last checkpoint was taken. For example, assume that T i was checkpointed, and that T i+1, T i+2 were not. If a fault strikes during the execution of T i+3, we need to roll back and read the checkpoint of T i from stable storage, which costs R i. Then we start re-executing T i+1 and the following tasks. Note that the costs C i and R i are likely proportional to the chapter length).

As before, the inter-arrival times of the faults are IID random variables following the Exponential law Exp(λ). We must decide after which tasks to checkpoint, in order to minimize the expectation of the total time. Figure 12 gives you a hint. Time C(i) is the optimal solution for the execution of tasks T 1, T 2, …, T i. The solution to the problem is Time C(n), and we use a dynamic programming algorithm to compute it. In the algorithm, we need to know Time Z (i+1,j), the expected time to compute a segment of tasks [T i+1T j] and to checkpoint the last one T j, knowing that there is a checkpoint before the first one (hence after T i) and that no intermediate checkpoint is taken. Time Z stands for Zero intermediate checkpoint. It turns out that we already know the value of Time Z(i + 1, j): check that we have

$$\displaystyle \begin{aligned}\ensuremath{\textsc{Time}_{\text{Z}}} (i+1,j) = \ensuremath{\mathbb{E}} \left (\ensuremath{\textsc{Time}} \left (\sum_{k=i+1}^{j} w_{k},\ensuremath{C} _{j},\ensuremath{D} ,\ensuremath{R} _{i},\lambda\right ) \right)\end{aligned}$$

and use Proposition 3.

Fig. 12
figure 12

Hint for the exercise

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Aupy, G., Robert, Y. (2018). Scheduling for Fault-Tolerance: An Introduction. In: Prasad, S., Gupta, A., Rosenberg, A., Sussman, A., Weems, C. (eds) Topics in Parallel and Distributed Computing. Springer, Cham. https://doi.org/10.1007/978-3-319-93109-8_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-93109-8_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-93108-1

  • Online ISBN: 978-3-319-93109-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics