Scheduling for Fault-Tolerance: An Introduction

Aupy, Guillaume; Robert, Yves

doi:10.1007/978-3-319-93109-8_6

Guillaume Aupy⁶ &
Yves Robert^7,8

726 Accesses

Abstract

Parallel execution time is expected to decrease as the number of processors increases. We show in this chapter that this is not as easy as it seems, even for perfectly parallel applications. In particular, processors are subject to faults. The more processors are available, the more likely faults will strike during execution. The main strategy to cope with faults in High Performance Computing is checkpointing. We introduce the reader to this approach, and explain how to determine the optimal checkpointing period through scheduling techniques. We also detail how to combine checkpointing with prediction and with replication.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Hardcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
As a side note, one needs only 23 persons for the probability of a common birthday to reach 0.5 (a question often asked in geek evenings).
2.
By the way, there is a nice little exercise in “Appendix 6: Scheduling a Linear Chain of Tasks” if you are motivated to help.

References

G. Aupy, Y. Robert, F. Vivien, and D. Zaidouni. Checkpointing strategies with prediction windows. In Dependable Computing (PRDC), 2013 IEEE 19th Pacific Rim International Symposium on, pages 1–10. IEEE, 2013.
Google Scholar
G. Aupy, Y. Robert, F. Vivien, and D. Zaidouni. Checkpointing algorithms and fault prediction. Journal of Parallel and Distributed Computing, 74(2):2048–2064, 2014.
Article Google Scholar
M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien. Checkpointing strategies for parallel jobs. In Proceedings of SC’11, 2011.
Google Scholar
F. Cappello, A. Geist, B. Gropp, L. V. Kalé, B. Kramer, and M. Snir. Toward Exascale Resilience. Int. Journal of High Performance Computing Applications, 23(4):374–388, 2009.
Article Google Scholar
H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni. Combining process replication and checkpointing for resilience on exascale systems. Research report RR-7951, INRIA, May 2012.
Google Scholar
J. T. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. FGCS, 22(3):303–312, 2004.
Article Google Scholar
C. Engelmann, H. H. Ong, and S. L. Scorr. The case for modular redundancy in large-scale highh performance computing systems. In Proc. of the 8th IASTED Infernational Conference on Parallel and Distributed Computing and Networks (PDCN), pages 189–194, 2009.
Google Scholar
K. Ferreira, J. Stearley, J. H. I. Laros, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold. Evaluating the Viability of Process Replication Reliability for Exascale Systems. In Proc. of the ACM/IEEE SC Conf., 2011.
Google Scholar
P. Flajolet, P. J. Grabner, P. Kirschenhofer, and H. Prodinger. On Ramanujan’s Q-Function. J. Computational and Applied Mathematics, 58:103–116, 1995.
Article MathSciNet Google Scholar
A. Gainaru, F. Cappello, and W. Kramer. Taming of the shrew: Modeling the normal and faulty behavior of large-scale hpc systems. In Proc. IPDPS’12, 2012.
Google Scholar
A. Gainaru, F. Cappello, M. Snir, and W. Kramer. Failure prediction for hpc systems and applications: Current situation and open issues. Int. J. High Perform. Comput. Appl., 27(3):273–282, 2013.
Article Google Scholar
F. Gärtner. Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM Computing Surveys, 31(1), 1999.
Article Google Scholar
T. Hérault and Y. Robert, editors. Fault-Tolerance Techniques for High-Performance Computing, Computer Communications and Networks. Springer Verlag, 2015.
Google Scholar
O. Kella and W. Stadje. Superposition of renewal processes and an application to multi-server queues. Statistics & probability letters, 76(17):1914–1924, 2006.
Article MathSciNet Google Scholar
D. Kondo, A. Chien, and H. Casanova. Scheduling Task Parallel Applications for Rapid Application Turnaround on Enterprise Desktop Grids. J. Grid Computing, 5(4):379–405, 2007.
Article Google Scholar
S. M. Ross. Introduction to Probability Models, Eleventh Edition. Academic Press, 2009.
Google Scholar
B. Schroeder and G. Gibson. Understanding failures in petascale computers. Journal of Physics: Conference Series, 78(1), 2007.
Google Scholar
J. W. Young. A first order approximation to the optimum checkpoint interval. Comm. of the ACM, 17(9):530–531, 1974.
Article Google Scholar
L. Yu, Z. Zheng, Z. Lan, and S. Coghlan. Practical online failure prediction for blue gene/p: Period-based vs event-driven. In Dependable Systems and Networks Workshops (DSN-W), pages 259–264, 2011.
Google Scholar
Z. Zheng and Z. Lan. Reliability-aware scalability models for high performance computing. In Proc. of the IEEE Conference on Cluster Computing, 2009.
Google Scholar
Z. Zheng, Z. Lan, R. Gupta, S. Coghlan, and P. Beckman. A practical failure prediction with location and lead time for blue gene/p. In Dependable Systems and Networks Workshops (DSN-W), pages 15–22, 2010.
Google Scholar

Download references

Author information

Authors and Affiliations

Inria & Labri, University of Bordeaux, Bordeaux, France
Guillaume Aupy
ENS Lyon, Lyon, France
Yves Robert
University of Tennessee, Knoxville, TN, USA
Yves Robert

Authors

Guillaume Aupy
View author publications
You can also search for this author in PubMed Google Scholar
Yves Robert
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yves Robert .

Editor information

Editors and Affiliations

Georgia State University, Atlanta, GA, USA
Sushil K. Prasad
IBM Research AI, Yorktown Heights, NY, USA
Anshul Gupta
University of Massachusetts Amherst, Amherst, MA, USA
Arnold Rosenberg
University of Maryland, College Park, MD, USA
Alan Sussman
University of Massachusetts Amherst, Amherst, MA, USA
Charles Weems

Appendices

Appendix 1: First-Order Approximation of T _FO

It is interesting to point out why the value of T _FO given by Eq. (8) is a first-order approximation, even for large jobs. Indeed, there are several restrictions for the approach to be valid:

We have stated that the expected number of faults during execution is $\ensuremath {N_{\text{faults}}} = \frac {\ensuremath {\textsc {Time}_{\text{final}}} }{\mu } $, and that the expected time lost due to a fault is $\ensuremath {T_{\text{lost}}} = \frac {\ensuremath {T} }{2} + \ensuremath {D} + \ensuremath {R} $. Both statements are true individually, but the expectation of a product is the product of the expectations only if the random variables are independent, which is not the case here because Time _final depends upon the fault inter-arrival times.
In Eq. (4), we have to enforce C ≤ T in order to have Waste _FF ≤ 1.
In Eq. (5), we have to enforce $\ensuremath {D} +\ensuremath {R} + \frac {\ensuremath {T} }{2} \leq \mu $ in order to have Waste _fault ≤ 1. We must cap the period to enforce this latter constraint. Intuitively, we need μ to be large enough for Eq. (5) to make sense. However, for large-scale platforms, regardless of the value of the individual MTBF μ _ind, there is always a threshold in the number of components p above which the platform MTBF, $\mu = \frac {\mu _{\text{ind}}}{p}$, becomes too small for Eq. (5) to be valid.
Equation (5) is accurate only when two or more faults do not take place within the same period. Although unlikely when μ is large in front of T, the possible occurrence of many faults during the same period cannot be eliminated.

To ensure that the condition of having at most a single fault per period is met with a high probability, we cap the length of the period: we enforce the condition T ≤ αμ, where α is some tuning parameter chosen as follows. The number of faults during a period of length T can be modeled as a Poisson process of parameter $\beta = \frac {\ensuremath {T} }{\mu }$. The probability of having k ≥ 0 faults is $P(X=k) = \frac {\beta ^{k}}{k!} e^{-\beta }$, where X is the random variable showing the number of faults. Hence the probability of having two or more faults is π = P(X ≥ 2) = 1 − (P(X = 0) + P(X = 1)) = 1 − (1 + β)e ^−β. If we assume α = 0.27 then π ≤ 0.03, hence a valid approximation when bounding the period range accordingly. Indeed, with such a conservative value for α, we have overlapping faults for only 3% of the checkpointing segments in average, so that the model is quite reliable. For consistency, we also enforce the same type of bound on the checkpoint time, and on the downtime and recovery: C ≤ αμ and D + R ≤ αμ. However, enforcing these constraints may lead to use a sub-optimal period: it may well be the case that the optimal period $\sqrt {2(\mu - (\ensuremath {D} +\ensuremath {R} )) \ensuremath {C} }$ of Eq. (8) does not belong to the admissible interval [C, αμ]. In that case, the waste is minimized for one of the bounds of the admissible interval. This is because, as seen from Eq. (7), the waste is a convex function of the period.

We conclude this discussion on a positive note. While capping the period, and enforcing a lower bound on the MTBF, is mandatory for mathematical rigor, simulations in Aupy et al. [2] show that actual job executions can always use the value from Eq. (8), accounting for multiple faults whenever they occur by re-executing the work until success. The first-order model turns out to be surprisingly robust!

Appendix 2: Optimal Value of T _FO

There is a beautiful method to compute the optimal value of T _FO accurately. First we show how to compute the expected time $\ensuremath {\mathbb {E}} (\ensuremath {\textsc {Time}} (\ensuremath {T} - \ensuremath {C} ,\ensuremath {C} ,\ensuremath {D} ,\ensuremath {R} ,\lambda ))$ to execute a work of duration T − C followed by a checkpoint of duration C, given the values of C, D, and R, and a fault distribution Exp(λ). If a fault interrupts a given trial before success, there is a downtime of duration D followed by a recovery of length R. We assume that faults can strike during checkpoint and recovery, but not during downtime.

Proposition 3

$\ensuremath {\mathbb {E}} (\ensuremath {\textsc {Time}} (\ensuremath {T} - \ensuremath {C} ,\ensuremath {C} ,\ensuremath {D} ,\ensuremath {R} ,\lambda ))= e^{\lambda \ensuremath {R} } \left (\frac {1}{\lambda } + \ensuremath {D} \right ) (e^{\lambda \ensuremath {T} } -1 )$.

Proof

For simplification, we write Time instead of Time(T − C, C, D, R, λ) in the proof below. Consider the following two cases:

(i)
Either there is no fault during the execution of the period, then the time needed is exactly T;
(ii)
Or there is one fault before successfully completing the period, then some additional delays are incurred. More specifically, as seen for the first order approximation, there are two sources of delays: the time spent computing by the processors before the fault (accounted for by variable Time _lost), and the time spent for downtime and recovery (accounted for by variable Time _rec). Once a successful recovery has been completed, there still remain T − C units of work to execute.

Thus Time obeys the following recursive equation:

$$\displaystyle \begin{aligned} \ensuremath{\textsc{Time}} =\left\{ \begin{array}{ll} \ensuremath{T} & \text{if there is no fault} \\ \ensuremath{\ensuremath{\textsc{Time}_{\text{lost}}} } + \ensuremath{\ensuremath{\textsc{Time}_{\text{rec}}} } +\ensuremath{\textsc{Time}} & \text{otherwise} \end{array}\right. {} \end{aligned} $$

(15)

Time _lost :: denotes the amount of time spent by the processors before the first fault, knowing that this fault occurs within the next T units of time. In other terms, it is the time that is wasted because computation and checkpoint were not successfully completed (the corresponding value in Fig. 2 is T _lost − D − R).
Time _rec :: represents the amount of time needed by the system to recover from the fault (the corresponding value in Fig. 2 is D + R).

The expectation of Time can be computed from Eq. (15) by weighting each case by its probability to occur:

$$\displaystyle \begin{aligned} \ensuremath{\mathbb{E}} (\ensuremath{\textsc{Time}} ) &= \ensuremath{\mathbb{P}} \left ( \text{no fault}\right ) \cdot \ensuremath{T} +\ensuremath{\mathbb{P}} \left ( \text{a fault strikes}\right ) \cdot \ensuremath{\mathbb{E}} \left ( \ensuremath{\ensuremath{\textsc{Time}_{\text{lost}}} } + \ensuremath{\ensuremath{\textsc{Time}_{\text{rec}}} } +\ensuremath{\textsc{Time}} \right) \\ &= e^{- \lambda \ensuremath{T} } \ensuremath{T} + (1-e^{- \lambda \ensuremath{T} } ) \left( \ensuremath{\mathbb{E}} (\ensuremath{\ensuremath{\textsc{Time}_{\text{lost}}} } )+ \ensuremath{\mathbb{E}} (\ensuremath{\ensuremath{\textsc{Time}_{\text{rec}}} } ) +\ensuremath{\mathbb{E}} (\ensuremath{\textsc{Time}} ) \right)\ , \end{aligned} $$

which simplifies into:

$$\displaystyle \begin{aligned} \ensuremath{\mathbb{E}} (\ensuremath{\textsc{Time}} )= \ensuremath{T} + (e^{\lambda \ensuremath{T} } -1 ) \left ( E(\ensuremath{\ensuremath{\textsc{Time}_{\text{lost}}} } )+ E(\ensuremath{\ensuremath{\textsc{Time}_{\text{rec}}} } ) \right ) {} \end{aligned} $$

(16)

We have $\ensuremath {\mathbb {E}} (\ensuremath {\ensuremath {\textsc {Time}_{\text{lost}}} } ) = \int _0^{\infty } x \ensuremath {\mathbb {P}} (X=x|X < \ensuremath {T} ) dx = \frac {1}{\ensuremath {\mathbb {P}} (X < \ensuremath {T} )}\int _0^{\ensuremath {T} } x e^{-\lambda x} dx$, and $\ensuremath {\mathbb {P}} (X < \ensuremath {T} ) = 1-e^{-\lambda \ensuremath {T} }$. Integrating by parts, we derive that

$$\displaystyle \begin{aligned} \ensuremath{\mathbb{E}} (\ensuremath{\ensuremath{\textsc{Time}_{\text{lost}}} } ) = \frac{1}{\lambda} - \frac{\ensuremath{T} }{e^{\lambda \ensuremath{T} } - 1} {} \end{aligned} $$

(17)

Next, the reasoning to compute $\ensuremath {\mathbb {E}} (\ensuremath {\ensuremath {\textsc {Time}_{\text{rec}}} } )$, is very similar to $\ensuremath {\mathbb {E}} (\ensuremath {\textsc {Time}} )$ (note that there can be no fault during D but there can be some during R):

$$\displaystyle \begin{aligned} \ensuremath{\mathbb{E}} (\ensuremath{\ensuremath{\textsc{Time}_{\text{rec}}} } ) = e^{-\lambda \ensuremath{R} } (\ensuremath{D} +\ensuremath{R} ) + (1-e^{-\lambda \ensuremath{R} }) (\ensuremath{D} +\ensuremath{\mathbb{E}} (\ensuremath{R_{lost}} )+\ensuremath{\mathbb{E}} (\ensuremath{\ensuremath{\textsc{Time}_{\text{rec}}} } )) \end{aligned}$$

Here, R _lost is the amount of time lost to executing the recovery before a fault happens, knowing that this fault occurs within the next R units of time. Replacing T by R in Eq. (17), we obtain $\ensuremath {\mathbb {E}} (\ensuremath {R_{lost}} ) = \frac {1}{\lambda } - \frac {\ensuremath {R} }{e^{\lambda \ensuremath {R} } - 1}$. The expression for $\ensuremath {\mathbb {E}} (\ensuremath {\ensuremath {\textsc {Time}_{\text{rec}}} } )$ simplifies to

$$\displaystyle \begin{aligned} \ensuremath{\mathbb{E}} (\ensuremath{\ensuremath{\textsc{Time}_{\text{rec}}} } ) = \ensuremath{D} e^{ \lambda \ensuremath{R} } + \frac{1}{\lambda} (e^{\lambda \ensuremath{R} } - 1)\end{aligned} $$

Plugging the values of $\ensuremath {\mathbb {E}} (\ensuremath {\ensuremath {\textsc {Time}_{\text{lost}}} } )$ and $\ensuremath {\mathbb {E}} (\ensuremath {\ensuremath {\textsc {Time}_{\text{rec}}} } )$ into Eq. (16) leads to the desired value:

$$\displaystyle \begin{aligned} \ensuremath{\mathbb{E}} (\ensuremath{\textsc{Time}} (\ensuremath{T} - \ensuremath{C} ,\ensuremath{C} ,\ensuremath{D} ,\ensuremath{R} ,\lambda))= e^{\lambda \ensuremath{R} } \left(\frac{1}{\lambda} + \ensuremath{D} \right) (e^{\lambda \ensuremath{T} } -1 ) \end{aligned}$$

Proposition 3 is the key to proving that the optimal checkpointing strategy is periodic. Indeed, consider an application of duration Time _base, and divide the execution into periods of different lengths T _i, each with a checkpoint as the end. The expectation of the total execution time is the sum of the expectations of the time needed for each period. Proposition 3 shows that the expected time for a period is a convex function of its length, hence all periods must be equal and T _i = T for all i.

There remains to find the best number of periods, or equivalently, the size of each work chunk before checkpointing. With k periods of length $T =\frac {\ensuremath {\textsc {Time}_{\text{base}}} }{k}$, we have to minimize a function that depends on k. This is easy for a skilled mathematician who knows the Lambert function 𝕃 (defined as $\ensuremath {\mathbb {L}} (z)e^{\ensuremath {\mathbb {L}} (z)}=z$). She would find the optimal rational value k _opt of k by differentiation, prove that the objective function is convex, and conclude that the optimal value is either ⌊k _opt⌋ or ⌈k _opt⌉, thereby determining the optimal period T _opt. What if you are not a skilled mathematician? No problem, simply use T _FO as a first-order approximation, and be comforted that the first-order terms in the Taylor expansion of T _opt is …T _FO! See Bougeret et al. [3] for all details.

Appendix 3: MTBF of a Platform with p Parallel Processors

In this section we give another proof of Proposition 1. Interestingly, it applies to any continuous probability distribution with bounded (nonzero) expectation, not just Exponential laws.

First we prove that Eq. (1) does hold true. Consider a single processor, say processor p _q. Let X _i, i ≥ 0 denote the IID (independent and identically distributed) random variables for the fault inter-arrival times on p _q , and assume that X _i ∼ D _X, where D _X is a continuous probability distribution with bounded (nonzero) expectation μ _ind. In particular, $\mathbb E\left (X_{i}\right ) = \mu _{\text{ind}}$ for all i. Consider a fixed time bound F. Let n _q(F) be the number of faults on p _q until time F. More precisely, the (n _q(F))-th fault is the last one to happen before time F or at time F, and the (n _q(F) + 1)-st fault is the first to happen after time F. By definition of n _q(F), we have

$$\displaystyle \begin{aligned}\sum_{i=1}^{n_{q}(F)} X_{i} \leq F < \sum_{i=1}^{n_{q}(F)+1} X_{i}.\end{aligned}$$

Using Wald’s equation (see the textbook of Ross [16, p. 420]), with n _q(F) as a stopping criterion, we derive:

$$\displaystyle \begin{aligned}\mathbb E\left(n_{q}(F)\right) \mu_{\text{ind}} \leq F \leq (\mathbb E\left(n_{q}(F)\right) +1) \mu_{\text{ind}},\end{aligned}$$

and we obtain:

$$\displaystyle \begin{aligned} \lim_{F \rightarrow +\infty} \frac{\mathbb E\left(n_{q}(F)\right)}{F} = \frac{1}{\mu_{\text{ind}}}. {} \end{aligned} $$

(18)

As promised, Eq. (18) is exactly Eq. (1).

Now consider a platform with p identical processors, whose fault inter-arrival times are IID random variables that follow the distribution D _X. Unfortunately, if D _X is not an Exponential law, then the inter-arrival times of the faults of the whole platform, i.e., of the super-processor of section “Checkpointing on a Parallel Platform”, are no longer IID. The minimum trick used in the proof of Proposition 1 works only for the first fault. For the following ones, we need to remember the history of the previous faults, and things get too complicated. However, we could still define the MTBF μ of the super-processor. Using Eq. (18), μ must satisfy:

$$\displaystyle \begin{aligned}\lim_{F \rightarrow +\infty} \frac{\mathbb E\left(n(F)\right)}{F} = \frac{1}{\mu},\end{aligned}$$

where n(F) is the number of faults on the super-processor until time F. But does the limit always exist? and if yes, what is its value?

The answer to both questions is not difficult. Consider a fixed time bound F as before. Let n(F) be the number of faults on the whole platform until time F, and let m _q(F) be the number of these faults that strike component number q. Of course we have $n(F) = \sum _{q=1}^{p} m_{q}(F)$. By definition, m _q(F) is the number of faults on component q until time F. From Eq. (18) again, we have for each component q:

$$\displaystyle \begin{aligned} \lim_{F \rightarrow +\infty} \frac{\mathbb E\left(m_{q}(F)\right)}{F} = \frac{1}{\mu_{\text{ind}}}. \end{aligned}$$

Since $n(F) = \sum _{q=1}^{p} m_{q}(F)$, we also have:

$$\displaystyle \begin{aligned} \lim_{F \rightarrow +\infty} \frac{\mathbb E\left(n(F)\right)}{F} = \frac{p}{\mu_{\text{ind}}} \end{aligned} $$

(19)

which answers both questions at the same time!

The curious reader may ask how to extend Eq. (19) when processors have different fault-rates. Let $X_{i}^{(q)}$, i ≥ 0 denote the IID random variables for the fault inter-arrival times on p _q , and assume that $X_{i}^{(q)} \sim D_{X}^{(q)}$, where $D_{X}^{(q)}$ is a continuous probability distribution with bounded (nonzero) expectation μ ^(q). For instance if μ ⁽²⁾ = 3 μ ⁽¹⁾, then (in expectation) processor 1 experiences three times more failures than processor 2. As before, consider a fixed time bound F, and let n _q(F) be the number of faults on p _q until time F. Equation (18) now writes $\lim _{F \rightarrow +\infty } \frac {\mathbb E\left (m_{q}(F)\right )}{F} = \frac {1}{\mu ^{(q)}}$. Now let n(F) be the total number of faults on the whole platform until time F. The same proof as above leads to

$$\displaystyle \begin{aligned} \lim_{F \rightarrow +\infty} \frac{\mathbb E\left(n(F)\right)}{F} = \sum_{q=1}^{p} \frac{1}{\mu^{(q)}} \end{aligned} $$

(20)

Kella and Stadje [14] provide more results on the superposition of renewal processes (which is the actual mathematical name of the topic discussed here!).

Appendix 4: Going Further with Prediction

The discussion on predictions in section “Fault Prediction” has been kept overly simple. For instance when a fault is predicted, sometimes there is not enough time to take proactive actions, because we are already checkpointing. In this case, there is no other choice than ignoring the prediction.

Furthermore, a better strategy should take into account at what point in the period does the prediction occur. After all, there is no reason to always trust the predictor, in particular if it has a bad precision. Intuitively, the later the prediction takes place in the period, the more likely we are inclined to trust the predictor and take proactive actions. This is because the amount of work that we could lose gets larger and larger. On the contrary, if the prediction happens in the beginning of the period, we have to trade-off the probability that the proactive checkpoint may be useless (if we take a proactive action) with the small amount of work that may be lost in the case where a fault would actually happen (if we do not trust the predictor). The optimal approach is to never trust the predictor in the beginning of a period, and to always trust it in the end; the cross-over point $\frac {\ensuremath {C_{\ensuremath {pr} }} }{\ensuremath {pr} }$ depends on the time to take a proactive checkpoint and on the precision of the predictor. Details are provided by Aupy et al. [2] for details.

Finally, it is more realistic to assume that the predictor cannot give the exact moment where the fault is going to strike, but rather will provide an interval of time, a.k.a. a prediction window. Aupy et al. [1] provide more information.

Appendix 5: Going Further with Replication

In the context of replication, there are two natural options for “counting” faults. The option chosen in section “Replication” is to allow new faults to hit processors that have already been hit. This is the option chosen by Ferreira et al. [8], who introduced the problem. Another option is to count only faults that hit running processors, and thus effectively kill replica pairs and interrupt the application. This second option may seem more natural as the running processors are the only ones that are important for the application execution. It turns out that both options are almost equivalent, the values of their MNFTI only differ by one, as shown by Casanova et al. [5].

Speaking of faults, an important question is: why don’t we repair (or rejuvenate) processors on the fly, instead of doing so only when the whole application is forced to stop, recover from the last checkpoint, and restart execution? The answer is technical: current HPC resource management systems assign the user a fixed set of resources for the execution, and do not allow new resources (such as spare nodes) to be dynamically added during the execution. In fact, frequently, a new configuration is assigned to the user at restart time. But nothing prevents us from enhancing the tools! It should then be possible to reserve a few additional nodes in addition to the computing nodes. These nodes would be used to migrate the system image of a replica node as soon as its buddy fails, in order to re-create the failed node on the fly. Of course the surviving node must be isolated from the application while the migration is taking place, in order to maintain a coherent view of both nodes, and this induces some overhead. It would be quite interesting to explore such strategies.

Here a few bibliographical notes about replication. Replication has long been used as a fault-tolerance mechanism in distributed systems (see the survey of Gartner [12]), and in the context of volunteer computing (see the work of Kondo et al. [15]). Replication has recently received attention in the context of HPC (High Performance Computing) applications. Representative papers are those by Schroeder and Gibson [17], Zheng and Lan [20], Engelmann, Ong, and Scorr [7], and Ferreira et al. [8]. While replicating all processors is very expensive, replicating only critical processes, or only a fraction of all processes, is a direction being currently explored under the name partial replication.

Speaking of critical processes, we make a final digression. The de-facto standard to enforce fault-tolerance in critical or embedded systems is Triple Modular Redundancy, or TMR. Computations are triplicated on three different processors, and if their results differ, a voting mechanism is called. TMR is not used to protect from fail-stop faults, but rather to detect and correct errors in the execution of the application. While we all like, say, safe planes protected by TMR, the cost is tremendous: by definition, two thirds of the resources are wasted (and this does not include the overhead of voting when an error is identified).

Appendix 6: Scheduling a Linear Chain of Tasks

In this exercise you are asked to help Alice (again). She is still writing her thesis but she does not want to checkpoint at given periods of time. She hates being interrupted in the middle of something because she loses concentration. She now wants to checkpoint only at the end of a chapter. She still has to decide after which chapters it is best to checkpoint.

The difference with the original problem is that the checkpoints can only be taken at given time-steps. If we formulate the problem in a abstract way, we have a linear chain of n tasks (the n chapters in Alice’s thesis), T ₁, T ₂, …, T _n. Each task T _i has weight w _i (the time it takes to write that chapter). The cost to checkpoint after T _i is C _i. The time to recover from a fault depends upon where the last checkpoint was taken. For example, assume that T _i was checkpointed, and that T _i+1, T _i+2 were not. If a fault strikes during the execution of T _i+3, we need to roll back and read the checkpoint of T _i from stable storage, which costs R _i. Then we start re-executing T _i+1 and the following tasks. Note that the costs C _i and R _i are likely proportional to the chapter length).

As before, the inter-arrival times of the faults are IID random variables following the Exponential law Exp(λ). We must decide after which tasks to checkpoint, in order to minimize the expectation of the total time. Figure 12 gives you a hint. Time _C(i) is the optimal solution for the execution of tasks T ₁, T ₂, …, T _i. The solution to the problem is Time _C(n), and we use a dynamic programming algorithm to compute it. In the algorithm, we need to know Time _Z (i+1,j), the expected time to compute a segment of tasks [T _i+1…T _j] and to checkpoint the last one T _j, knowing that there is a checkpoint before the first one (hence after T _i) and that no intermediate checkpoint is taken. Time _Z stands for Zero intermediate checkpoint. It turns out that we already know the value of Time _Z(i + 1, j): check that we have

$$\displaystyle \begin{aligned}\ensuremath{\textsc{Time}_{\text{Z}}} (i+1,j) = \ensuremath{\mathbb{E}} \left (\ensuremath{\textsc{Time}} \left (\sum_{k=i+1}^{j} w_{k},\ensuremath{C} _{j},\ensuremath{D} ,\ensuremath{R} _{i},\lambda\right ) \right)\end{aligned}$$

and use Proposition 3.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Aupy, G., Robert, Y. (2018). Scheduling for Fault-Tolerance: An Introduction. In: Prasad, S., Gupta, A., Rosenberg, A., Sussman, A., Weems, C. (eds) Topics in Parallel and Distributed Computing. Springer, Cham. https://doi.org/10.1007/978-3-319-93109-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-93109-8_6
Published: 30 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93108-1
Online ISBN: 978-3-319-93109-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics