Keywords

1 Introduction

Consider a long-running job that requests N processors from the batch scheduler. Resilience to fail-stop errorsFootnote 1 is provided by a Checkpoint/Restart (CR) mechanism, which is the de-facto standard approach for High-Performance Computing (HPC) applications. After each failure, the application restarts from the last checkpoint but the number of available processors decreases, assuming the application can continue execution after a failure (e.g., using ULFM [3]). Until which point should the execution proceed before requesting a new allocation with N fresh resources from the batch scheduler?

The answer depends upon the nature of the application. For a \(\textsc {Rigid}\) application, the number of processors must remain constant throughout the execution. The question is then to decide the number F of processors (out of the N available initially) that will be used as spares. With F spares, the application can tolerate F failures. The application always executes with \(N-F\) processors: after each failure, then it restarts from the last checkpoint and continues executing with \(N-F\) processors, the faulty processor having been replaced by a spare. After F failures, the application stops when the \((F+1)\)st failure strikes, and relinquishes the current allocation. It then asks for a new allocation with N processors, which takes a wait time, \(D \), to start (as other applications are most likely using the platform concurrently). The optimal value of F obviously depends on the value of \(D \), in addition to the application and resilience parameters. The wait time typically ranges from several hours to several days if the platform is over-subscribed (up to 10 days for large applications on the K-computer [24]). The metric to optimize here is the (expected) application yield, which is the fraction of useful work per second, averaged over the N resources, and computed in steady-state mode (expected value for multiple batch allocations of N resources).

For a \(\textsc {Moldable}\) application, the problem is different: here we assume that the application can use a different number of processors after each restart. The application starts executing with \(N \) processors; after the first failure, the application recovers from the last checkpoint and is able to continue with only \(N-1\) processors, albeit with a slowdown factor \(\frac{N-1}{N}\). After how many failures \(F \) should the application decide to stopFootnote 2 and accept to produce no progress during \(D \), in order to request a new allocation? Again, the metric to optimize is the application yield.

Finally, consider an application which must have a given shape (or a set of given shapes) in terms of processor layout. Typically, these shapes are dictated by the algorithm. In this paper, we use the example of a \(\textsc {GridShaped}\) application, which is required to execute on a rectangular processor grid whose size can dynamically be chosen. Most dense linear algebra kernels (matrix multiplication, LU, Cholesky and QR factorizations) are \(\textsc {GridShaped}\) applications, and perform more efficiently on square processor grids than on elongated rectangle ones. The application starts with a square \(p \times p\) grid of \(N=p^{2}\) processors. After the first failure, execution continues on a \(p \times (p-1)\) rectangular grid, keeping \(p-1\) processors as spares for the next \(p-1\) failures. After p failures, the grid is shrunk again to a \((p-1) \times (p-1)\) square grid, and so on. We address the same question: after how many failures F should the application stop working on a smaller processor grid and request a new allocation, in order to optimize the application yield?

The major contribution of this paper is to present a detailed performance model and to provide analytical formulas for the expected yield of each application type. Due to lack of space, we instantiate the model for a single applicative scenarios, for which we draw comparisons across application types. Our model is publicly available [21] so that more scenarios can be explored. Notably, the paper qualifies the optimal number of spares for the optimal yield, and the optimal length of a period between two full restarts; it also qualifies how much the yield and total work done within a period are improved by deploying \(\textsc {Moldable}\) applications w.r.t. \(\textsc {Rigid}\) applications.

The rest of the paper is organized as follows. Section 2 provides an overview of related work. Section 3 is devoted to formally defining the performance model. Section 4 provides formulas for the yield of \(\textsc {Rigid}\), \(\textsc {Moldable}\) and \(\textsc {GridShaped}\) applications. These formulas are instantiated through the applicative scenario in Sect. 5, to compare the different results. Finally, Sect. 6 provides final remarks and hints for future work.

2 Related Work

We first survey related work on checkpoint-restart. Then we discuss previous contributions on \(\textsc {Moldable}\) applications.

Checkpoint-Restart. Checkpoint/restart (CR) is the most common strategy employed to protect applications from underlying faults and failures on HPC platforms. Generally, CR periodically outputs snapshots (\(i.e.,\) checkpoints) of the application global, distributed state to some stable storage device. When a failure occurs, the last stored checkpoint is retrieved and used to restart the application.

A widely-used approach for HPC applications is to use a fixed checkpoint period (typically one or a few hours), but it is sub-optimal. Instead, application-specific metrics can (and should) be used to determine the optimal checkpoint period. The well-known Young/Daly formula [8, 25] yields an application optimal checkpoint period, \(\sqrt{2 \mu C}\) seconds, where C is the time to commit a checkpoint and \(\mu \) the application Mean Time Between Failures (MTBF) on the platform. We have \(\mu = \frac{\mu _{ ind }}{N}\), where \(N \) is the number of processors enrolled by the application and \(\mu _{ ind } \) is the MTBF of an individual processor [17].

The Young/Daly formula minimizes platform waste, defined as the fraction of job execution time that does not contribute to its progress. The two sources of waste are the time spent taking checkpoints (which motivates longer checkpoint periods) and the time needed to recover and re-execute after each failure (which motivates shorter checkpoint periods). The Young/Daly period achieves the optimal trade-off between these sources to minimize the total waste.

For \(\textsc {Rigid}\) applications, both [18, 26] report some experimental study to determine the optimal number of processors and of spares that should be used. Furthermore, the optimal number of resources for a perfectly parallel job is computed via an iterative relaxation procedure in [18] and through analytical formulas in [5].

Moldable and GridShaped Applications. \(\textsc {Rigid}\) and \(\textsc {Moldable}\) applications have been studied for long in the context of scientific applications. A detailed survey on various application types (\(\textsc {Rigid}\), \(\textsc {Moldable}\), malleable) was conducted in [10]. Resizing application to improve performance has been investigated by many authors, including [6, 19, 22, 23] among others. A related recent study is the design of a MPI prototype for enabling tolerance in \(\textsc {Moldable}\) MapReduce applications [13].

The TORQUE/Maui scheduler has been extended to support evolving, malleable, and \(\textsc {Moldable}\) parallel jobs [20]. In addition, the scheduler may have system-wide spare nodes to replace failed nodes. In contrast, our scheme does not assume a change of behavior from the batch schedulers and resource allocators, but utilizes job-wide spare nodes: a node set including potential spare nodes is allocated and dedicated to a job at the time of scheduling, that can be used by the application to restart within the same job after a failure.

An experimental validation of the feasibility of shrinking application on the fly is provided in [2]. In this paper, the authors used an iterative solver application to compare two recovery strategies, shrinking and spare node substitution. They use ULFM, the fault-tolerant extension of MPI that offers the possibiliity of dynamically resizing the execution after a failure. In [11, 15], the authors studied \(\textsc {Moldable}\) and \(\textsc {GridShaped}\) applications that continue executing after some failures. They focus on the performance degradation incurred after shrinking or spare node substitution, due to less efficient communications (in particular collective communications). A major difference with our work is that these studies focus on recovery overhead and do not address overall performance nor yield.

3 Performance Model

This section reviews the key parameters of the performance model. Some assumptions are made to simplify the computation of the yield. We discuss possible extensions in Sect. 6.

Application/Platform Framework. We consider perfectly parallel applications that execute on homogeneous parallel platforms. Without loss of generality, we assume that each processor has unit speed: we only need to know that the total amount of work done by p processors within T seconds requires \(\frac{p}{q}T\) seconds with q processors.

Mean Time Between Failures (MTBF). Each processor is subject to failures which are IID (independent and identically distributed) random variables following an Exponential probability distribution of mean \(\mu _{ ind }\), the individual processor MTBF. Then the MTBF of a section of the platform comprised of i processors is given by \(\mu _{i} = \frac{\mu _{ ind }}{i}\) [17].

Checkpoints. Processors checkpoint periodically, using the optimal Young/Daly period [8, 25]: for an application using i processors, this period is \(\sqrt{2 {C_{i}} \mu _{i}}\), where \({C_{i}}\) is the time to checkpoint with i processorsFootnote 3. We consider two cases to define \({C_{i}}\). In both cases, the overall application memory footprint is considered constant at \( Mem_{tot} \), so the size of individual checkpoints is inversely linear with the number of participating/surviving processors. In the first case, the I/O bandwidth is the bottleneck (which is often the case in HPC platforms – it takes only a few processors to saturate the I/O bandwidth); then the checkpoint cost is constant and given by \({C_{i}} = \frac{ Mem_{tot} }{\tau _{io}}\), where \(\tau _{io}\) is the aggregated I/O bandwidth. In the second case, the processor network card is the bottleneck (which is the case for in-memory checkpointing, or checkpointing to NVRAM), and the checkpoint cost is inversely proportional to number of active processors: \({C_{i}} = \frac{ Mem_{tot} }{\tau _{xnet} \times i}\), where \(\tau _{xnet}\) is the available network bandwidth, and \( \frac{ Mem_{tot} }{i}\) the checkpoint size.

We denote the recovery time with i processors as \(R_{i}\). For all simulations we use \(R_{i} ={C_{i}} \), assuming that the read and write bandwidths are identical.

Objective. We consider a long-lasting application that requests a resource allocation with N processors. We aim at deriving the optimal number of failures \(F \) that should be tolerated before paying the wait time and requesting a new allocation. We aim at maximizing the yield \({\mathcal Y} \) of the application, defined as the fraction of time during the allocation length and wait time where the N resources perform useful work. Of course a spare does not perform useful work when idle, and no processor is active during wait time, which explains that the yield will always be smaller than 1. We will derive the value of \(F \) that maximizes \({\mathcal Y} \) for the three application types.

4 Expected Yield

This section is the core of the paper. We compute the expected yield for each application type, \(\textsc {Rigid}\), \(\textsc {Moldable}\) and \(\textsc {GridShaped}\).

4.1 Rigid Application

We first consider a \(\textsc {Rigid}\) application that can be parallelized at compile-time to use any number of processors but cannot change this number until it reaches termination. There are \(N \) processors allocated to the application. We use \(N-F \) for execution and keep \(F \) as spares. The execution is protected from failures by checkpoints of duration \({C_{N-F }}\). Each failure striking the application will incur an in-place restart of duration \(R_{N-F } \), using a spare processor to replace the faulty one. However, when the \((F +1)^{st}\) failure strikes, the job will have to stop and perform a full restart, waiting for a new allocation of \(N \) processors to be granted by the job scheduler.

We define \({\mathcal T}_{R} \) as the expected duration of an execution period until the application is ready to continue after the \((F +1)^{st}\) failure strikes. We compute \({\mathcal T}_{R} \) using several first-order approximations. In particular, we ignore scenarios where failures strike during checkpoint, recovery or re-execution, thereby neglecting the probability of two failures within a short time window. Also, we approximate the time lost after a failure as half the checkpointing period. Finally, we assume an integer number of checkpointing periods in between failures. The first failure is expected to strike after \(\mu _{N} \) seconds, the second failure \(\mu _{N-1} \) seconds after the first one, and so on. Without any overhead, the length of a period would be \(\sum _{i = N}^{N-F } \mu _{i} \). Except for the last failure, each failure incurs some overhead only if it strikes the application. This happens with probability \(\frac{N-F }{i}\), where i is the current number of live processors. In that case, the failure requires a restart and some re-execution, namely half the checkpoint period in average. The application always uses \(N-F \) processors, hence the checkpoint period remains equal to \(\sqrt{2 {C_{N-F }} \mu _{N-F }}\). On the contrary, if the failure strikes a spare, there is no overhead. The last failure always requires a wait time, and then a restart and re-execution. Therefore, we derive:

$${\mathcal T}_{R} = \sum _{i = N}^{N-F } \mu _{i} + \sum _{i = N}^{N-F +1} \frac{N-F }{i}\left( R_{N-F } + \frac{\sqrt{2{C_{N-F }} \mu _{N-F }}}{2}\right) + D + R_{N-F } + \frac{\sqrt{2{C_{N-F }} \mu _{N-F }}}{2}$$

What is the total amount of work \({\mathcal W}_{R}\) computed during a period? During the sub-period of length \(\mu _{i} \), there are \(\frac{\mu _{i}}{\sqrt{2{C_{N-F }} \mu _{N-F }}}\) checkpoints, each of length \({C_{N-F }}\), and each processor works during \(\frac{\mu _{i}}{ 1 + \frac{{C_{N-F }}}{\sqrt{2{C_{N-F }} \mu _{N-F }}} }\) seconds. There are \(N-F \) processors at work, hence

$${\mathcal W}_{R} = (N-F ) \cdot \sum _{i=N}^{N-F } \frac{\mu _{i}}{ 1 + \frac{{C_{N-F }}}{\sqrt{2{C_{N-F }} \mu _{N-F }}} }$$

During the duration \({\mathcal T}_{R} \) of the period, in the absence of failures and protection, the application could have used all \(N \) processors to compute. Thus the effective yield with protection for the application during \({\mathcal T}_{R} \) is reduced to \({\mathcal Y}_{R} \):

$${\mathcal Y}_{R} = \frac{{\mathcal W}_{R}}{N \cdot {\mathcal T}_{R}}$$

4.2 Moldable Application

We now consider a \(\textsc {Moldable}\) application that can use a different number of processors after each restart. The application starts executing with \(N \) processors; after the first failure, the application recovers from the last checkpoint and is able to continue with only \(N-1\) processors after paying the restart cost \(R_{N-1} \), albeit with a slowdown factor \(\frac{N-1}{N}\) of the parallel work per time unit.

We define \({\mathcal T}_{M} \) as the expected duration of an execution period until the \((F +1)^{st}\) failure strikes. Without any overhead, the length of a period would be \(\sum _{i = N}^{N-F } \mu _{i} \), the same as for \(\textsc {Rigid}\) applications. But there are few differences. First, each failure strikes the application, since it always uses all live processors. Second, the checkpoint period increases after each failure, since the number of live processors decreases. Third, the re-execution after a failure (except the last one) incurs a slowdown factor because we move from i processors to \(i-1\) processors. Fourth and finally, the re-execution after the last failure is performed faster, because there are more live processors. Altogether, we derive that

$${\mathcal T}_{M} = \sum _{i = N}^{N-F } \mu _{i} + \sum _{i = N}^{N-F +1} \left( R_{i-1} + \frac{i}{i - 1}\cdot \frac{\sqrt{2{C_{i}} \mu _{i}}}{2} \right) + D + R_{N} + \frac{N-F}{N}\frac{\sqrt{2C_{N-F}\mu _{N-F}}}{2}$$

To compute the total amount of work \({\mathcal W}_{M} \) during a period, we proceed as before and consider each sub-period. During the sub-period of length \(\mu _{i} \), there are \(\frac{\mu _{i}}{\sqrt{2{C_{i}} \mu _{i}}}\) checkpoints, each of length \({C_{i}}\), and each processor works during \(\frac{ \mu _{i} }{ 1 + \frac{{C_{i}}}{\sqrt{2{C_{i}} \mu _{i}}} }\) seconds. And there are i processors at work during that sub-period. Altogether:

$${\mathcal W}_{M} = \sum _{i=N}^{N-F } i \times \frac{ \mu _{i} }{ 1 + \frac{{C_{i}}}{\sqrt{2{C_{i}} \mu _{i}}} }, \quad \quad \text { and } {\mathcal Y}_{M} = \frac{{\mathcal W}_{M}}{N \cdot {\mathcal T}_{M}}$$

where \({\mathcal Y}_{M} \) is the yield of the \(\textsc {Moldable}\) application.

4.3 GridShaped Application

Finally, we consider a \(\textsc {GridShaped}\) application, defined as a moldable execution which requires a rectangular processor grid. The application starts with a square \(p \times p\) grid of \(N =p^{2}\) processors. After the first failure, execution continues on a \(p \times (p-1)\) rectangular grid, keeping \(p-1\) processors as spares for the next \(p-1\) failures. After p failures, the grid is shrunk again to a \((p-1) \times (p-1)\) square grid, and the execution continues on this reduced-size square grid. After how many failures \(F \) should the application stop, in order to maximize the application yield? The derivation of the expected length of a period and of the total work are more complicated for \(\textsc {GridShaped}\) than for \(\textsc {Rigid}\) and \(\textsc {Moldable}\). Due to lack of space, we refer to the extended version [12], as well as to the publicly available software [21], for detailed formulas and an algorithm to compute the optimal value of \(F \).

5 Applicative Scenario

As an applicative scenario, we consider a platform with 22,250 nodes (\(150^2\)), with a node MTBF of 20 years, and an application that would take 2 min to checkpoint (at 22,250 nodes). In other words, we let \(N =22,500\), \(\mu _{ ind } =20y\) and \(C_{i}=C = 120\)s. These values are inspired from existing platforms: the Titan supercomputer at OLCF [14], for example, holds 18,688 nodes, and experiences a few node failures per day, implying a node MTBF between 18 and 25 years. The filesystem has a bandwidth of 1.4 TB/s, and nodes altogether aggregate 100 TB of memory, thus a checkpoint that would save 30% of that system should take in the order of 2 min to complete. Further experiments varying \(N \), \(\mu _{ ind } \) and with several scenarios for checkpoint costs are available in the extended version [12].

Figure 1 shows the yield that can be expected if doing a full restart after an optimal number of failures, as a function of the wait time, for the three kind of applications considered (\(\textsc {Rigid}\), \(\textsc {Moldable}\) and \(\textsc {GridShaped}\)). We also plot the expected yield when the application experiences a full restart after each failure (\(\textsc {NoSpare}\)). First, one sees that the three approaches that avoid paying the cost of a wait time after every failure experience a comparable yield, while the performance of the \(\textsc {NoSpare}\) approach quickly degrades to a small efficiency (30% when the wait time is around 14 h).

The zoom box to differentiate the \(\textsc {Rigid}\), \(\textsc {Moldable}\) and \(\textsc {GridShaped}\) yield shows that the \(\textsc {Moldable}\) approach has a slightly higher yield than the other ones, but only for a minimal fraction of the yield. This is expected, as the \(\textsc {Moldable}\) approach takes advantage of all living processors, while the \(\textsc {GridShaped}\) and \(\textsc {Rigid}\) approaches sacrifice the computing power of the spare nodes waiting for the next failure. However, the size of the gain is small to the point of being negligible. The \(\textsc {GridShaped}\) approach experiences a yield that changes in steps. Both these phenomenons are explained by the next figure.

Fig. 1.
figure 1

Optimal yield as function of the wait time, for the different types of applications.

Figure 2 shows the number of failures after which the application should do a full restart, to obtain an optimal yield, as a function of the wait time, for the three kind of applications considered. We observe that this optimal is quickly reached: even with long wait times (e.g. 10 h), 200 to 250 failures (depending on the method) should be tolerated within the allocation before relinquishing it. This is small compared to the number of nodes: less than 1% of the resource should be dedicated as spares for the \(\textsc {Rigid}\) approach, and after losing 1% of the resource, the \(\textsc {Moldable}\) approach should request a new allocation.

This is remarkable, taking into account the poor yield obtained by the approach that does not tolerate failures within the allocation. Even with a small wait time (assuming the platform would be capable of re-scheduling applications that experience failures in less than 2 h), Fig. 1 shows that the yield of the \(\textsc {NoSpare}\) approach would decrease to 70%. This represents a waste of 30%, which is much higher than the recommended waste of 10% for resilience in the current HPC platforms recommendations [4, 7]. Comparatively, provisioning only 1% of additional resources as spares within the allocations, would allow to achieve a yield over 88%, for every approach considered, when the wait time does not exceed 20 h.

The \(\textsc {GridShaped}\) approach experiences steps that correspond to using all the spares created when redeploying the application over a smaller grid before relinquishing the allocation. As illustrated in Fig. 1, the yield evolves in steps, changing the slope of a linear approximation radically when redeploying over a smaller grid. This has for consequence that the maximal yield is always at a slope change point, thus at the frontier of a new grid size. It is still remarkable that even with very small wait times, it is more beneficial to use spares (and thus to lose a full row of processors) than to redeploy immediately.

Fig. 2.
figure 2

Optimal number of failures tolerated between two full restarts, as function of the wait time, for the different types of applications.

Figure 3 shows the length of an allocation providing the optimal yield (best value of \(F \)). After such a duration, the job will have to fully restart in order to maintain the optimal yield. This figure illustrates the real difference between the \(\textsc {Rigid}\) and \(\textsc {Moldable}\) approaches: although both approaches are capable of extracting the same yield, the \(\textsc {Moldable}\) approach can do so with significantly longer periods between full restarts. This is important when considering real life applications, because this means that the applications using a \(\textsc {Moldable}\) approach have a higher chance to complete before the first full restart, and overall will always complete in a lower number of allocations than the \(\textsc {Rigid}\) approach.

Fig. 3.
figure 3

Optimal length of allocations, for the different types of applications.

Fig. 4.
figure 4

Maximum wait time allowed to reach a target yield.

Finally, Fig. 4 shows an upper limit of the duration of the wait time in order to guarantee a given yield for the three applications. In particular, we see that to reach a yield of 90%, an application which would restart its job at each fault would need that restart to be done in less than 6 min whereas the \(\textsc {Rigid}\) and \(\textsc {GridShaped}\) approaches need a full restart in less than 3 h approximately. This bound goes up to 7 h for the \(\textsc {Moldable}\) approach. In comparison, with a wait time of 1 h, the yield obtained using \(\textsc {NoSpare}\) is only 80%. This shows that, using these parameters, it seems impossible to guarantee the recommended waste of 10% without tolerating (a small) number of failures before rescheduling the job.

6 Conclusion

In this paper, we have compared the performance of \(\textsc {Rigid}\), \(\textsc {Moldable}\) and \(\textsc {GridShaped}\) applications when executed on large-scale failure-prone platforms. For each application type, we have computed the optimal number of faults that should be tolerated before requesting a new allocation, as a function of the wait time. Through a realistic applicative scenario inspired by state-of-the-art platforms, we have shown that the three application types experience an optimal yield when requesting a new allocation after experiencing a number of failures that represents a small percentage of the initial number of resources (hence a small percentage of spares for \(\textsc {Rigid}\) applications), and this even for large values of the wait time. On the contrary, the \(\textsc {NoSpare}\) strategy, where a new allocation is requested after each failure, sees its yield dramatically decrease when the wait time increases. We also observed that \(\textsc {Moldable}\) applications enjoy much longer execution periods in between two re-allocations, thereby decreasing the total execution time as compared to \(\textsc {Rigid}\) applications (and \(\textsc {GridShaped}\) applications lying in between).

Future work will be devoted to exploring more applicative scenarios. We also intend to extend the model in several directions. On the application side, we aim at dealing with non-perfectly parallel applications but instead with applications whose speedup profile obeys Amdahl’s law [1]. We will also introduce a more refined speedup profile for \(\textsc {GridShaped}\) applications, with an execution speed that depends on the grid shape (a square being usually faster than an elongated rectangle). On the resilience side, we will address forward-recovery schemes, such as ABFT [9, 16], in replacement of, or in combination with, checkpoint-restart techniques.