Keywords

1 Introduction

Recently, edge computing is seen as an effective solution to the problem of more larger data, which has the advantages of shorter response time and service quality [1]. However, the problems of reliability are still urgent to be solved. The existing fault-tolerant methods can be divided into two categories: reactive and proactive methods. It is well known that reactive schemes will produce low average utilization of resources when the application behavior is highly dynamic. Instead of a reactive scheme, the proactive scheme that adopts a scheme of fault prediction [2,3,4,5] can effectively improve the utilization of resources. However, they only consider a single factor when predicting failures, which greatly affects the accuracy of the prediction results.

In this paper, we jointly consider the CPU temperature and time between failures (TBF) of the host to achieve fault prediction and propose an energy-aware fault-tolerant resource scheduling algorithm to improve the reliability while reducing the energy consumption. Specifically, we use the reliability and energy-aware resource scheduling [2] to allocate resources for tasks firstly. During the tasks execution, the fault tolerance mechanism (VM migration) will be triggered once the temperature reaches the upper threshold or the predicted failure time.

The rest of this paper is organized as follows. The system model is presented in Sect. 2 and follow is the resource scheduling algorithm. The simulation experiments are conducted in Sect. 4. Section 5 summarizes the paper.

2 Fault-Tolerance Resource Scheduling Model

As shown in Fig. 1, the system is mainly divided into two layers. The Users Layer is the producer and consumer of data. The Edge Cloud Layer is the data processing layer that consists of physical resources. Users submit their application to Edge Cloud layer. Then, the physical resources are allocated to tasks by resource management system (RSM). And in order to improve the reliability of system, the system can migrate the running VM from the deteriorating host to other host by RSM.

In this paper, we use the Bag-of-Task (BoT) application which consists of a set of independent tasks. The tasks in each BoT are defined as \(T = \left\{ {tas{k_i}|1 \le i \le n} \right\} \). \(l_i\) is the length of the task \({task}_i\), which directly affects the execution time, \(T_i^{ex}\). Each task \({task}_i\) is allocated to a virtual machine \({vm_j} \in VM\). Each virtual machine \(vm_j\) run a set of tasks \({T_j} \in T\). In addition, \(N = \left\{ {nod{e_k}|1 \le k \le x} \right\} \) denotes the set of the physical hosts on the edge cloud.

Fig. 1.
figure 1

The System Architecture

2.1 Failure Prediction Model

CPU Temperature Prediction: We use the simulation prediction function model of CPU temperature [3] as one of the methods to predict the host failure time as follow:

$$\begin{aligned} f(t|A,\omega ,{t_i},{t_{i + 1}}) = \left\{ {\begin{array}{*{20}{c}} {{e^t}}&{}{0 \le t \le {t_i}}\\ {{e^{{t_i}}}}&{}{{t_i} \le t \le {t_{i + 1}}}\\ {A\sin (\omega t - \omega {t_{i + 1}}) + {e^{{t_i}}}}&{}{{t_{i + 1}} \le t \le {t_{i + 2}}} \end{array}} \right. \end{aligned}$$
(1)

where i is the positive integer set; \(t_i\) is a fixed value calculated by \(e^{t_i}=35\); \(e^{t_i}\) is the temperature when CPU is idle, which is always \({35{\,}^\circ }\mathrm{{C}}\); \(t_{i+1}\) is a random value; \(t_{i+2}\) is calculated by \(t_{i+2}=\pi /\omega + t_{i+1}\); A is the amplitude(lower than \({68{\,}^\circ }\mathrm{{C}}\)); \(\omega \) represents the duration of the CPU execution load.

Time Between Failures Prediction: In addition to the CPU temperature prediction, the method called exponential smoothing [2] is used to predict the TBF. Suppose there is a set of TBFs for the host \({node}_k\), \({TBF}_k = \left\{ {{tbf}_t|1 \le t \le n} \right\} \). Then, the prediction corresponding to \({tbf}_{t+1}\) can be calculated as :

$$\begin{aligned} \left( {tb{f_k}} \right) _{t + 1}^\prime = \left\{ {\begin{array}{*{20}{c}} {\alpha \times {{(tb{f_k})}_t} + ((1 - \alpha ) \times \left( {tb{f_k}} \right) _t^\prime ),}&{}{n > 1}\\ {\left( {tb{f_k}} \right) _t^\prime }&{}{otherwise} \end{array}} \right. \end{aligned}$$
(2)

where \({({tbf}_k)}_t\) is the actual value of the TBF, \({({tbf}_k)}_t^{\prime }\) is the predicted value of the TBF at time t. \(\alpha \) is the smoothing constant.

figure a

2.2 Energy Consumption Model

Let \({vm}_j\) be the VM running on \({node}_k\) with utilization \(u_j\). Then the energy consumption of the task \({task}_i\) running on \({vm}_j\) can be calculated as

$$\begin{aligned} {E_{ij}} = ({P_k}({u_j}) \times T_{ij}^{ex}) + {E_{extr{a_{ij}}}} \end{aligned}$$
(3)

where \(E_{{extra}_{ij}}\) is the energy generated by VM migration, which can be calculated by the VM migration overhead model in [2], Similar to [6], \({P_k}({u_j})\) can be calculated by,

$$\begin{aligned} {P_k}({u_j}) = {P_{{{\min }_k}}} + ({P_{{{\max }_k}}} - {P_{{{\min }_k}}}) \times {u_j} \end{aligned}$$
(4)

where \(P_{min}\) and \(P_{max}\) is the power of node at minimum utilization and maximum utilization, respectively. The utilization \(u_j\) of the VM \({vm}_j\) is the sum of the tasks utilization \(u_i\) which is calculated by normalizing the task length \(l_i\) with the maximum length \(l_{max}\) in B.

3 Energy-Aware Fault-Tolerant Resource Scheduling Algorithm

Given the set of tasks BoT B and the resource configurations of data center. Algorithm 1 is used to configure resources for tasks. Firstly, the Best Fit Bin Packing algorithm [2] is used to allocate the tasks to the VM. Then, the reliability and energy-aware strategy is used to configure physical resources for VMs (lines 1–10). During task execution, once the temperature of the node reaches the upper threshold or the predicted fault time, the VM migration will be triggered. The VM running on deteriorating node selects another node through Algorithm 1 to implement the migration.

4 Performance Evaluation

We do the simulation experiments by extending the simulator ‘CloudSim’ [3] and download the Grid5000 failure dataset from Fault Tracking Archive (FTA) [2] and select the clusters, G1/site1/c1, as the edge cloud data center. Parameter configuration model in [2] is used to match the configuration for each node and generation the BoTs workload which consist of tasks between 2000 and 3000. In order to evaluate the performance of the proposed algorithm (Tem/Tbf), we compare our method with other fault-tolerant strategies. Specifically, we denote ‘NoFT’ as the method with no fault tolerance mechanism. ‘Restr’, ‘Pre-Tem’, ‘Pre-Tbf’, ‘Tem/Tbf’ as the method with resubmission, CPU temperature, TBF, CPU temperature and TBF prediction as the fault tolerant strategy, respectively.

Fig. 2.
figure 2

The task completion rate under different fault-tolerant strategies

Fig. 3.
figure 3

The energy consumption under different fault-tolerant strategies

4.1 Experimental Results

Figure 2 shows the task completion rate and the energy consumption is given in Fig. 3. We can see that the task completion rate and energy consumption is the highest when using Restr method. And among using fault prediction as the fault-tolerant strategy, the extra energy by using Tem/Tbf prediction is only 30 Kwh higher than the other two cases. If using task completion rate to measure the reliability of the system, it is the most reliable by using Restr method, but the excessive energy which will greatly influence interests of operators. And when using Tem/Tbf method, the reliability is much higher than the other two proactive strategies and the increased energy is not large. Therefore, the method we proposed(Tem/Tbf) is more effective.

5 Conclusions

In this paper, we study how to improve the reliability of the edge cloud system while reducing energy consumption as much as possible. We use the reliability and energy-aware resource scheduling algorithm to allocate physical resources for tasks firstly. Then, CPU temperature prediction and time between failures prediction are used to achieve fault tolerance. Comparison with other fault-tolerant strategies, the method we proposed is more effective.