# Multi-objective redundancy hardening with optimal task mapping for independent tasks on multi-cores

- 89 Downloads

## Abstract

The rate of transient faults has increased significantly as the technology scales up. The tolerance of transient faults has become an important issue in the system design. Dual modular redundancy (DMR) and triple modular redundancy (TMR) are two commonly used techniques that can achieve fault detection and masking through executing redundant tasks. As DMR and TMR have different time and cost overheads, we must carefully determine which one should be used for each task (i.e., task hardening) to achieve the optimal system design. Furthermore, for multi-core systems, the system-level design includes the allocation of cores for the tasks (i.e., task mapping) as well. This paper aims at task hardening and mapping simultaneously for independent tasks on multi-cores with heterogeneous performances, in order to minimize the maximum completion time of all tasks (i.e., makespan). We demonstrate that once task hardening is given, task mapping of independent tasks can be achieved by employing min–max-weight perfect matching with a polynomial time complexity. Besides, as there is a trade-off between cost and time performance, we propose a multi-objective memetic algorithm (MOMA)-based task hardening method to obtain a set of solutions with different numbers of cores (i.e., costs), so the designer can choose different solutions according to different requirements. The key idea of the MOMA is to incorporate problem-specific knowledge into the global search of evolutionary algorithms. Our experimental studies have demonstrated the effectiveness of the proposed method and have shown that by combining the results of MOMA and MOEA we can provide a designer with a highly accurate set of solutions within a reasonable amount of time.

## Keywords

Fault tolerance Multi-cores Task hardening Task mapping Multi-objective optimization Memetic algorithms## 1 Introduction

The rapid technology advances, such as transistor size scaling, high operation frequency, and low voltage supply, have led to multiple reliability threats to circuits and systems (Constantinescu 2003). For example, the rate of transient faults in circuits has dramatically increased in the past years. A transient fault is usually caused by high-energy particle that strikes to flip the state of bits in an unpredictable way, which can corrupt the correct application execution state (Baumann 2005). Since the rate of transient faults is much higher than that of permanent faults, any reliable system should employ an effective scheme to tolerate transient faults in the future (Henkel et al. 2013a).

Multi-cores have emerged to be a popular and powerful computing platform for many recent systems (Ebi et al. 2009; Henkel et al. 2013b; Jahn et al. 2011). As the technology keeps scaling down, more and more cores can be integrated into a single chip to meet the growing computing demand of modern applications. Such applications depend on high-performance computing (HPC), and a high reliability in the presence of transient faults is required as well. On the other hand, the individual cores may exhibit significant frequency variations (e.g., up to 30%) because of process variations (result from fabrication imprecision) (Bowman et al. 2002; Dighe et al. 2011). This situation is worsening with technology scaling because of the difficulty of precise fabrication with reduced dimensions at a nanoscale.

To achieve reliability against transient faults, three types of redundancies have conventionally been applied, i.e., time redundancy, space redundancy, and hybrid redundancy. Time redundancy (i.e., re-execution) is based on checkpointing and to execute the task again in case of a fault (Kandasamy et al. 2003; Nikolov and Larsson 2016; Salehi et al. 2016a); it is applicable only after a fault can be detected. The overhead of local fault detection is not free since it is hard to achieve perfect transient fault detection. Space redundancy (i.e., replication), based on executing redundant replicas of task independently on different cores, does not require any specific fault detection mechanism and uses result comparison (majority voting) for fault detection and masking (Koren and Krishna 2007; Pradhan 1996; Salehi et al. 2016b). To make a majority voting, the number of replicas should be an odd number so that triple modular redundancy (TMR) is commonly used (Chen et al. 2016; Lyons and Vanderkulk 1962). Although TMR is simple and predictable for tasks with deadlines, it requires more resources and energy than dual modular redundancy (DMR) (Dave and Jha 1999; Vadlamani et al. 2010). On the other hand, DMR is only capable of detecting fault since it is unable to decide which one is correct. Therefore, another way to implement fault masking is to combine DMR-based detection and checkpoint-based re-execution, which can be regarded as the hybrid redundancy (Kang et al. 2015; Pradhan and Vaidya 1994; Ziv and Bruck 1997). Both TMR and DMR are well suited for multi-core platforms as multi-cores can provide multiple processing units and low-overhead communication for comparing and voting.

Since different fault-tolerant techniques are usually characterized by different time and space overheads, there is an optimization trade-off in *task hardening*, i.e., determining one of the fault-tolerant techniques (e.g., DMR or TMR) for each task. This trade-off, together with the traditional optimization in *task mapping*, i.e., mapping each task to one of the cores, makes the design of fault-tolerant multi-cores very challenging (Das et al. 2014; Gan et al. 2012; Khosravi et al. 2014; Pop et al. 2009; Stralen and Pimentel 2012). It is notable that *replication*-based *task hardening* will introduce new tasks, i.e., replicas, into the system, and these replicas should be mapped as well. In this case, the design of fault-tolerant multi-cores can be regarded as a bi-level optimization problem due to its nested structure. Specifically, the upper-level optimization subproblem is *task hardening* and the lower-level optimization subproblem is *task (including all original tasks and new tasks) mapping*, thereby making the overall optimization computationally very intensive.

As both *task hardening* and *task mapping* are NP-hard in general cases (Bolchini et al. 2011), most of the previous methods are based on meta-heuristics, especially evolutionary algorithms (EAs). Through mapping a task to a set of cores, *replication*-based *task hardening* can be implicitly implemented, e.g., the number of mappings and the list of cores for each task are encoded as an integer vector in (Glaß et al. 2007; Huang et al. 2011; Huang et al. 2012; Reimann et al. 2008), or the mappings to all cores for each task are encoded as a 0–1 vector in Kang et al. (2014a, b). As *task hardening* is implicitly implemented in *task mapping*, the chromosome length of *task mapping* must be overestimated according to the maximum number of replicas that are probably introduced by *replication*-based hardening (Glaß et al. 2007; Huang et al. 2011; Huang et al. 2012; Reimann et al. 2008) or introduced by the total number of cores on the multi-core platforms (Kang et al. 2014a, b). The problem of overestimation is relaxed in the nested two-layer EA (Bolchini and Miele 2013; Bolchini et al. 2011), where the outer layer EA is used to explore *task hardening*, while the inner layer EA is used to explore *task mapping* on each extended task graph (Jhumka et al. 2005) independently. Nested two-layer structure is a commonly used framework for bi-level optimization problems, but the EA-based inner layer makes the whole optimization process very time-consuming. Tabu search (TS) seems to be an alternative way of exploring both *task hardening* and *task mapping* at the same time (Izosimov et al. 2009; Lifa et al. 2010). As an individual/trajectory-based metaheuristic, the encoding strategy of TS is simple and flexible. But the performance of TS highly depends on the initial solution which is generated greedily in the TS-based methods, and generally TS is easy to get trapped in local optima, especially for problems with complex landscape.

As stated above, it is very difficult to develop efficient method for joint *task hardening* and *task mapping* for general cases, because of the nested structure of the two NP-hard problems. This paper builds upon the analysis that, for independent tasks (Bleuse et al. 2017; Hong and Prasanna 2007) (i.e., there is no data dependence between tasks), if the solution of *task hardening* is given, the solution of *task mapping* can be optimally obtained by employing min–max-weight perfect matching (MMW-PM). Specifically, both DMR- and TMR-based hardening techniques are considered, and we show how to link the problem of *task mapping* with the goal of minimizing the worst-case makespan (i.e., the maximum completion time of all tasks) to the MMW-PM model, and then, based on binary search and Hungarian algorithm (Kuhn 1955), we use an efficient heuristic algorithm [proposed in our previous work (Zhong et al. 2016) to obtain an MMW-PM from a bipartite graph with polynomial time complexity. Besides, as there is a trade-off between cost and time performance (Erbas et al. 2006), we propose a multi-objective memetic algorithm (MOMA) -based task hardening method to obtain a set of solutions with different numbers of cores (i.e., costs), so the designer can choose a solution from the Pareto front according to user preferences, such as cost budget or performance requirement. The key idea of MA (Chen et al. 2011; Wang et al. 2010) is to incorporate problem-specific knowledge into the global search of EAs. Our experimental studies have demonstrated the effectiveness of the proposed method and have shown that by combining the results of MOMA and MOEA we can provide a designer with a highly accurate set of solutions in a reasonable amount of time.

*task hardening*can be regarded as the upper-level optimizer for the alternative between DMR and TMR for each task, while the proposed MMW-PM-based

*task mapping*can be regarded as the lower-level solver for task-to-core assignments (

*for all original tasks and new tasks*) in an optimal way.

The rest of this paper is organized as follows. In Sect. 2, we introduce the adopted problem model. The proposed design method of fault-tolerant multi-cores, including the optimal MMW-PM-based *task mapping* and MOMA-based *task hardening*, is detailed in Sects. 3 and 4. The simulation results and discussions are given in Sect. 5. Finally, Sect. 6 concludes this paper.

## 2 Problem definition

### 2.1 System model

We consider a resource library *C* = {*c*_{1}, *c*_{2},…, *c*_{M}} consisting of *M* ISA-compatible RISC cores, which only has single thread per core. Each core *c*_{i} has its own instruction and data cache to execute tasks. Due to the performance heterogeneity, e.g., process variations (Herbert et al. 2012; Raghunathan et al. 2013), each core *c*_{i} has its own frequency, denoted as *f*_{i}, which represents the number of instructions invoked by the core per second. For notational brevity, we index the *M* cores by a non-decreasing order of the current frequencies, i.e., *f*_{max} = *f*_{1} ≥ *f*_{2}… ≥ *f*_{M} = *f*_{min}.

Independent tasks have been used in modeling some practical applications (Bleuse et al. 2017; Hong and Prasanna 2007), e.g., Monte Carlo simulations and computational phylogeny. In these scenarios, each task is to process a fixed amount of source data. The source data of all the tasks initially reside on a single node in the system, which we call the root node.

An instance of the problem is described as a set *T* = {*t*_{1}, *t*_{2},…, *t*_{N}} of *N* independent tasks (Bleuse et al. 2017; Hong and Prasanna 2007). Task is the atomic unit executed by the multi-core platform. Due to the functional difference, each task *t*_{i} has its own program to be executed by the core, and the number of instructions of the compiled program is denoted as *d*_{i}. For notational brevity, we index the *N* tasks by a non-increasing order of the number of instructions, i.e., *d*_{max} = *d*_{1} ≥ *d*_{2}… ≥ *d*_{N} = *d*_{min}. And, the execution time of task *t*_{i} on core *c*_{j} can be evaluated as *et*(*t*_{i}, *c*_{j}) = CPI·*d*_{i}/*f*_{j}, where CPI represents the clock cycle per instruction.

### 2.2 Hardening technique

In this paper, both DMR with re-execution (Kang et al. 2015; Pradhan and Vaidya 1994; Ziv and Bruck 1997) and TMR (Chen et al. 2016; Lyons and Vanderkulk 1962) are considered to be against transient faults. For a task hardened by DMR with re-execution, fault detection is achieved by DMR. If a fault is detected, the tasks are executed again by rolling back. For a task hardened by TMR, fault masking is directly achieved by majority voting. Both DMR with re-execution and TMR with major voting can achieve very high reliability even given a high fault rate. For example, if the fault rate of a task on a core is *λ* = 10^{−6} or 10^{−7} (in the unit of #fault/cycles) to realize high fault scenarios as adopted by the related works (Hu et al. 2006; Li et al. 2004), the fault rate of the task hardened by TMR can be evaluated as *λ*_{TMR} =* λ*^{3} + 3(1 − *λ*)*λ*^{2} ≈ 3 × 10^{−12} or 3 × 10^{−14}, and the fault rate of the task hardened by DMR with re-execution can be evaluated as *λ*_{DMR} = 1 − {(1 − *λ*)^{2} + (1 − (1 − *λ*)^{2})((1 − *λ*)^{2})} ≈ 4 × 10^{−12} or 4 × 10^{−14}; here, it is assumed that the maximum number of faults during the execution of each task is no more than 1.

Both DMR with re-execution and TMR can achieve very high reliability, and they are characterized by different overheads in cost and time. Obviously, three cores are required in TMR, while only two cores are required in DMR with re-execution. On the other hand, the worst-case execution time (WCET) of a task *t*_{i} in TMR mode can be evaluated as *wcet*(*t*_{i}) = max{*et*(*t*_{i1}), *et*(*t*_{i2}), *et*(*t*_{i3})} + *et*_{v}, where tasks *t*_{i1}, *t*_{i2}, and *t*_{i3} are replicas of task *t*_{i}, and *e*_{v} is the execution time of major voting, while the WCET of a task *t*_{i} in DMR mode can be evaluated as *wcet*(*t*_{i}) = 2max{*et*(*t*_{i1}), *et*(*t*_{i2})} + *et*_{r} + etc., where tasks *t*_{i1} and *t*_{i2} are replicas of task *t*_{i}, etc. is the execution time of comparison, and *et*_{r} is the execution time of rollback. Since multi-core platform can provide low-overhead communication for comparing and voting, compared with *et*(*t*_{i}), both etc. and *et*_{v} are negligible. Therefore, TMR mode is more efficient than DMR mode in terms of WCET.

### 2.3 Problem statement

Assume we are given a resource library *C* = {*c*_{1}, *c*_{2},…, *c*_{M}} with different frequencies *f*_{max} = *f*_{1} ≥ *f*_{2}… ≥ *f*_{M} = *f*_{min}, and an instance consisting of independent tasks *T* = {*t*_{1}, *t*_{2},…, *t*_{N}} with different numbers of instructions *d*_{max} = *d*_{1} ≥ *d*_{2}… ≥ *d*_{N} = *d*_{min}, the goal of this paper is to (1) determine TMR mode or DMR mode for each task, and (2) allocate a core for each task (*including all original tasks and new tasks*), such that (1) the number of cores used, i.e., *K *= 2*K*_{DMR} + *3K*_{TMR}, is minimized, where *K*_{DMR} and *K*_{TMR} are the number of tasks hardened by DMR and TMR, respectively, and *K*_{DMR} + *K*_{TMR} = *N*, and (2) the worst-case makespan, i.e., *wcet*(*T*) = max{*wcet*(*t*_{1}), *wcet*(*t*_{2}),…, *wcet*(*t*_{N})}}, is minimized. It is assumed that we have enough cores in the library so that we do not map two tasks or replicas to the same core in order to fully explore the parallelism of the multi-core platform.

To our knowledge, this is the first work specialized for independent tasks in the design of fault-tolerant multi-cores, as all the previous works (Das et al. 2014; Gan et al. 2012; Khosravi et al. 2014; Pop et al. 2009; Stralen and Pimentel 2012) consider general task sets (e.g., there are data dependences between tasks). As mentioned above, *task mapping* for general task sets is a well-known NP-hard problem. Independent task set, as a special kind of task sets, has been widely applied in modeling many real-world applications (Alazzoni and Down 2008; Bleuse et al. 2017; Cortadella et al. 2006; Hong and Prasanna 2007), but without considering faults. As shown in Sect. 3, under the assumption that the execution modes for all tasks are already known beforehand (Sect. 4), the *task mapping* of independent tasks can be optimized by employing min–max-weight perfect matching (MMW-PM) to minimize the worst-case makespan with polynomial time complexity.

### 2.4 Optimal task mapping

#### 2.4.1 Greedy mapping algorithm and a motivational example

*greedy mapping*algorithm (Algorithm 1) is to map tasks to the cores that can be completed as fast as possible. In the following, we provide a motivational example to explain why

*greedy mapping*is not good enough for

*task mapping*. Suppose that we are given three tasks, i.e.,

*t*

_{1},

*t*

_{2}, and

*t*

_{3}. Task

*t*

_{1}is executed in TMR mode, while task

*t*

_{2}and

*t*

_{3}are executed in DMR mode. Now, we consider the

*task mapping*problem to allocate the cores to the tasks for minimizing the worst-case makespan, and the execution times of tasks on the cores are shown in Table 1. For simplicity,

*et*

_{v}, etc., and

*et*

_{r}are supposed to be zeroes here. In this example, we can check all the possible mappings to obtain the optimal result that will be 74.16 ms, where TMR mode task

*t*

_{1}uses core group {

*c*

_{1},

*c*

_{2},

*c*

_{6}}, while DMR mode tasks

*t*

_{2}and

*t*

_{3}use core groups {

*c*

_{3},

*c*

_{7}} and {

*c*

_{4},

*c*

_{5}}. By using

*greedy mapping*to assign the tasks and cores, the result is 79.08 ms, where TMR mode task

*t*

_{1}uses core group {

*c*

_{3},

*c*

_{4},

*c*

_{7}}, while DMR mode tasks

*t*

_{2}and

*t*

_{3}use core groups {

*c*

_{2},

*c*

_{5}} and {

*c*

_{1},

*c*

_{6}}. In the above example, we can observe that the

*greedy mapping*strategy is not good enough. As a consequence, it is clear that such a

*task mapping*problem requires a better strategy, whereas the straightforward exhaustive search is obviously not feasible in practice with the expected high time complexity.

Execution time of tasks on cores (ms)

Execution time | | | | | | | |
---|---|---|---|---|---|---|---|

| 44.19 | 39.14 | 38.08 | 38.10 | 38.80 | 41.54 | 29.24 |

| 43.03 | 38.11 | 37.08 | 37.10 | 37.78 | 40.45 | 28.47 |

| 39.54 | 35.02 | 34.07 | 34.09 | 34.72 | 37.17 | 26.16 |

### 2.5 Min–max-weight perfect matching

Given a bipartite graph *G* = (*U*, *V*, *E*), where *U* and *V* are disjoint and all edges in *E* go between *U* and *V*. A matching is a subset of edges *M* ∈ *E*, such that for all vertices *v* ∈ *U* ∪ *V*, at most one edge of *M* is incident on *v*, that is, no two edges share a common vertex. We say that a vertex *v*∈*U* ∪ *V* is matched by matching *M* if some edge in *M* is incident on *v*; otherwise, *v* is unmatched. A perfect matching is a matching which matches all vertices of the graph, that is, every vertex of the graph is incident to exactly one edge of the matching. The problem of finding a perfect matching can be solved by Hungarian algorithm (Kuhn 1955).

Min–Max-weight perfect matching (MMW-PM) is defined on a weighted bipartite graph as a perfect matching, where the maximal weight of the edges in the matching has a minimal value among all perfect matchings of the given graph. One should note that the MMW-PM problem is different from the maximum (or minimum)-weight perfect matching, where the sum of the weights of the edges in the matching has a maximal (or minimal) value among all perfect matchings of the given graph.

It is easy to link the *task mapping* problem to the proposed MMW-PM model. Given a *task hardening* solution, we can construct a new task set as \( T^{\prime } = \left\{ {t_{1}^{\prime } ,t_{2}^{\prime } , \ldots ,t_{K}^{\prime } } \right\} \) by replicating each task in *T* = {*t*_{1}, *t*_{2},…, *t*_{N}} by 2 (DMR mode) or 3 (TMR mode) one by one, where *K *= 2*K*_{DMR} + *3K*_{TMR}, and *K*_{DMR} and *K*_{TMR} are the number of tasks hardened by DMR and TMR, respectively. On the other hand, we can obtain the resource allocation, \( C^{\prime } = \, \left\{ {c_{1}^{\prime } ,c_{2}^{\prime } , \ldots ,c_{K}^{\prime } } \right\} \), by selecting the first *K* cores with the highest frequencies from *C* = {*c*_{1}, *c*_{2},…, *c*_{M}}. Then, a bipartite graph can be built as \( G^{\prime } = \, \left( {T^{\prime } ,C^{\prime } ,E^{\prime } } \right) \), and \( E^{\prime } \) go between each pair of \( t_{i}^{{^{\prime } }} \) and \( c_{j}^{\prime } \). The associated weight of each edge \( \left( {t_{i}^{\prime } ,c_{j}^{\prime } } \right) \) can be calculated according to: 1) \( w\left( {t_{i}^{\prime } ,c_{j}^{\prime } } \right) \, = et\left( {t_{i}^{\prime } ,c_{j}^{\prime } } \right) + et_{v} , \) if task \( t_{i}^{\prime } \) is executed in TMR mode, or 2) \( w\left( {t_{i}^{\prime } ,c_{j}^{\prime } } \right) \, = 2et\left( {t_{i}^{\prime } ,c_{j}^{\prime } } \right) + et_{r} + et_{c} \), if task \( t_{i}^{\prime } \) is executed in DMR mode. Then, for a MMW-PM \( M^{\prime } \) from \( G^{\prime } \), the matchings in \( M^{\prime } \) represent the mappings from task replicas to cores, and the associated maximum weight of the edges is the worst-case makespan.

Therefore, we can obtain the real worst-case makespan of *T* by employing the above weighting strategy.

### 2.6 MMW-PM heuristic

The above section shows how to link the *task mapping* problem to the MMW-PM model. A naive way to find an MMW-PM (i.e., the optimal mapping) is to find all perfect matchings in the given bipartite graph first and then select the one whose maximal edge weight is minimal. Instead of using such an enumeration method, we use an efferent heuristic algorithm [proposed in our previous work (Zhong et al. 2016) for MMW-PM problem with low time complexity.

Given an undirected weighted bipartite graph *G *= (*U*, *V*, *E*, *W*), where *U* and *V* are disjoint and all edges in *E* go between *U* and *V*. At first, we sort the edges in *G* in ascending order of their weights. Our heuristic is to select an edge *e* in *G* and remove all the edges whose weights are larger than that of *e* in *G*, while a perfect matching method (i.e., Hungarian algorithm (Kuhn 1955)) is used to check whether a perfect matching exists. The framework of the heuristic method is iterative based on binary search, as shown in Algorithm 2 (Zhong et al. 2016). The algorithm starts with an initial perfect matching *M* obtained by the Hungarian algorithm (line 1), and then we have \( E^{\prime } \) by sorting *E* in ascending order according to their weights (line 2). At each iteration, an edge *e* in \( E^{\prime } \) is selected by binary search (lines 5 and 6) and a new graph \( G^{\prime } \) is obtained through removing all the edges with larger weights in *G* (line 7), and then we can obtain a new matching \( M^{\prime } \) by running the Hungarian algorithm on \( G^{\prime } \) (line 8). If the cardinality of \( M^{\prime } \) equates to that of *M*, i.e., \( M^{\prime } \) is a perfect matching as well, we update *M* to \( M^{\prime } \) (line 10) and set *mid* as *high* (line 11), otherwise, we increase *low* by 1 (line 13). The final solution *M* is returned until the loop terminates (line 16).

Obviously, Algorithm 2 (Zhong et al. 2016) would return a perfect matching of the input graph, as the resulting matching *M* has the same cardinality as the initialized *M* obtained from the input graph (line 1). Besides, the edges in *G* are removed according to the ascending order of their weights, so the resulting matching *M* satisfies that the maximal weight of the edges in *M* has a minimal value among all perfect matchings of the input graph. Therefore, Algorithm 2 can indeed find an MMW-PM from the given graph with the advantage of a high efficiency over the enumeration method. Given an undirected weighted bipartite graph *G *= (*U*, *V*, *E*, *W*), with |*U*| = |*V*| = *n* (*n *= *K* in our case), the time complexities of Hungarian algorithm and binary search are *O*(*n*^{3}) and *O*(log*n*), respectively. Therefore, the time complexity of Algorithm 2 is *O*(*n*^{3}log*n*).

## 3 Multi-objective redundancy hardening

### 3.1 Trade-offs in task hardening

*task hardening*for a problem instance, where the green circle represents the solution that all tasks are hardened by DMR, and the magenta star represents the solution that all tasks are hardened by TMR. “DMR for all tasks” uses the least number of cores, while “TMR for all tasks” achieves the best performance in terms of worst-case makespan. In order to provide the designer with a set of solutions with different trade-offs in cost and performance, we model the problem as a multi-objective optimization problem. Specifically, in terms of cost, the objective is to minimize the number of cores used, while in terms of performance, the objective is to minimize the worst-case makespan, so we have:

### 3.2 MOEA-based task hardening

*task hardening*is given as follows.

- (1)
Encoding

*task hardening*population is represented by a

*N*-dimensional 0–1 vector

*th*= {

*th*

_{1},

*th*

_{2}, …

*th*

_{N}} ∈ {0, 1}

^{N}, where

*N*is the number of tasks. Each position of the vector describes the fault-tolerant technique that assigned to the task, i.e.,

*th*

_{i}= 0 indicates that task

*t*

_{i}is hardened by DMR with re-execution and

*th*

_{i}= 1 indicates that task

*t*

_{i}is hardened by TMR.

- (2)
Crossover

*task hardening*solutions.

- (3)
Mutation

*task hardening*solution.

- (4)
Selection

Selection occurs two times during each generation in GA. In our implementation, selection for reproduction is performed before the crossover operator is applied, which is based on a purely random basis without bias to filter any individual, and selection for survival is performed according to NSGA-II, i.e., a fast nondominated sorting approach and a novel selection process with the consideration of both fitness and spread.

The outline of the *MOEA*-*based hardening* is given in Algorithm 3, where *PS*, *P*_{c}, and *P*_{m} indicate the population size, probability of crossover, and probability of mutation, respectively. The algorithm starts with a population *TH* consisting of *PS* random individuals for *task hardening* (line 1). The fitness in terms of cost, i.e., *Object*_{1}, is evaluated according to Eq. (1), and the fitness in terms of performance, i.e., *Object*_{2}, is evaluated based on the proposed *MMW*-*PM*-*based mapping* (Algorithm 2) and Eq. (2) (line 2). During each generation, *TH* is evolved and the population of *PS* individuals generates *PS* children through the crossover operation (lines 4–7) and the mutation operation (lines 8–10). Then, the offspring *NTH* are evaluated (line 11) and used to update the current population (line 12) based on the fast nondominated sort. When the given maximum number of fitness evaluations is reached, the algorithm stops (line 13).

### 3.3 MOMA-based task hardening

As shown in the following experiments (Sect. 5), it is very difficult to obtain the whole Pareto front using the proposed *MOEA*-*based hardening* method so that we cannot provide well-distributed solutions for the designer to choose. In particular, it is hard to minimize *Object*_{2} by NSGA-II, so the solutions are all partiality located at the range where *Object*_{1} is minimized. This is because *Object*_{2} involves a complex nonlinear mapping from decision variables to fitness value (i.e., using *MMW*-*PM*-*based mapping*), while *Object*_{1} is just a linear combination of the variables (i.e., Eq. 1), and the select pressure on *Object*_{2} is too weak in the adopted NSGA-II framework.

In order to incorporate bias in the search for *Object*_{2}, we propose a problem-specific local search operator, and it can be regarded as a kind of memes in the case of MAs (Rubio-Largo et al. 2016; Tersi et al. 2015; Yuan and Xu 2015). In combinatorial optimization, local search operators generally work in the form of heuristics that are customized to a specific problem. In our implementation, the key idea of the heuristic is inherited from the previous greedy reassignment local search (GRLS) operators for *logic mapping* (Yuan et al. 2016; Yuan et al. 2014), which is to reassign the gene values of part of the parent chromosome by taking advantage of the greedy information extracted from the problem instance. In this paper, this idea is extended for *task hardening*.

As analyzed in Sect. 2, TMR mode is more efficient than DMR mode; it is expected that a task can be completed earlier if a task is executed in TMR mode. The GRLS operator designed for *task hardening* tries to flip the hardening technique from DMR to TMR for the given task *t*_{i} with larger *d*_{i}. It is expected to reduce the worst-case makespan, i.e., *Object*_{2}. In order to release the time overhead added to the iterative process of GA, the time complexity of the operator should be as low as possible. In fact, for each task, the priority list of tasks in terms of *d*_{i} can be sorted in advance; thus, the greedy information of the problem instance only needs to be evaluated once before the optimization process.

Since the operator is designed to be complementary to the stochastic search of GA, the incorporation of the operator should maintain the randomness as well. Besides, an operator with strong greediness will weaken the stochastic nature of GA and improve the risk of convergence to the local optima. Therefore, care should be taken when setting the strength of the GRLS operator in achieving the best search performance. In our implementation, for a given solution, a control parameter is introduced to limit the number of tasks that will be applied to by the GRLS operator. For example, *N*·*μ* tasks are randomly selected, where 0 ≤ *μ* ≤ 1 is defined as the greedy strength factor that provides a flexible control on randomness or greediness of the GRLS operator.

*task hardening*is given in Algorithm 4.

*N*·

*μ*tasks are randomly selected and marked as unhardened (line 1). In the loop (lines 2–6), an unhardened task

*t*

_{i}with maximum

*d*

_{i}is selected (line 3); if

*t*

_{i}is hardened by DMR, then change DMR to TMR, and the loop terminates (line 4). By applying Algorithm 4 to each generated solution during the

*MOEA*-

*based hardening*algorithm, we obtain the

*MOMA*-

*based hardening*algorithm.

## 4 Simulations and discussion

In this section, the performance of the proposed method is experimentally investigated. First, we show that the proposed *MMW*-*PM*-*based mapping* consistently outperforms *greedy mapping*, and we test the impact of control parameter *μ* on the performance of the proposed *MOMA*-*based hardening*. Then, compared with *MOEA*-*based hardening*, the effectiveness of *MOMA*-*based hardening* is verified by extensive experiments, especially on large-scale benchmarks. At the final, we show that by combining the results of *MOMA*-*based hardening* and *MOEA*-*based hardening* we can provide the designer with a highly accurate set of solutions in a reasonable amount of time.

In the experiments, a large set of benchmarks with different scales are synthesized, e.g., *N* = 10, 30, 50, and 100 that indicates the number of tasks in the problem instances, and *M* > 3 *N* accordingly that indicates the number of cores in the resource library. We generate four test instances for each case. The number of instructions (*d*) of the compiled tasks ranges from 10^{7} to 5 × 10^{7} and randomly determined, and the value of CPI is assumed to be 1.5. The frequencies (*f*) of cores are normally distributed with mean 1 GHz and standard deviation 0.1 GHz. The voting time (*et*_{v}) involved in TMR, the comprising time (etc.) involved in DMR, and the recovering time (*et*_{r}) involved in re-execution are assumed to be 2 ms, 2 ms, and 6 ms, respectively.

GA is used as the evolutionary search engine, and the parameters of population size (*PS*), probability of crossover (*P*_{c}), and probability of mutation (*P*_{m}) are set as *PS *= 2 *N*, *P*_{c} = 0.8, and *P*_{m} = 0.8/*N* experimentally. The maximum number of generations is set to be 50 to limit the runtime (thus the number of function evaluations is 100 *N*), and in most cases, both MOMA and MOEA have converged to good solutions. Both generational distance (GD) and inverted generational distance (IGD) are used as performance metrics for the proposed *MOMA*-*based hardening* and *MOEA*-*based hardening*. GD is defined as the mean distance in the objective space from each obtained solution to its nearest Pareto optimal solution, while IGD is defined as the mean distance in the objective space from each Pareto optimal solution to its nearest obtained solution. GD provides a measure of proximity of the nondominated solutions in the objective space with respect to the Pareto optimal front, while IGD provides a measure of both proximity and diversity of the nondominated solutions in the objective space with respect to the Pareto optimal front; the smaller the value is, the better the Pareto optimal front is approximated. As customary, the nondominated solutions obtained by multiple runs of both MOMA and MOEA with larger populations and more generations are assumed as the Pareto optimal set, which is the reference set of GD and IGD. We implement the algorithms in MATLAB. All the experiments are performed on a 3.2 GHz Intel Core i5-6500 quad-core platform with 8 GB memory. However, all the tested algorithms are implemented as monolithic processes and no CPU core parallelism is exploited.

### 4.1 MMW-PM-based mapping versus greedy mapping

*task hardening*is given, the proposed

*MMW*-

*PM*-

*based mapping*can obtain the optimal

*task mapping*solution. Figure 3 shows the average worst-case makespan (WCET) of

*greedy mapping*and

*MMW*-

*PM*-

*based mapping*on test instances of different scales. In these simulations, since it is impossible to evaluate the algorithms with all possible

*task hardening*solutions, each bar in the presented figures is obtained by averaging the results through multiple runs (e.g., 2

^{9}, 2

^{10}, 2

^{11}, and 2

^{12}for

*N*= 10, 30, 50, and 100, respectively). As shown in Fig. 3, we can observe that the proposed

*MMW*-

*PM*-

*based mapping*consistently outperforms

*greedy mapping*on all cases, and the average improvements are 3.40%, 3.74%, 2.74%, and 2.54% for different problem scales (

*N*= 10, 30, 50, and 100, respectively).

### 4.2 Sensitivity of parameter *μ*

*μ*controls the randomness or greediness of the GRLS operator; therefore, it provides a balance between the two conflicting objectives. Figure 4 shows the evolutionary curves in terms of IGD for MOMAs with different values of

*μ*on middle-size test instances of

*N*= 50. To make the simulations more convincing, each group result is averaged over 30 runs. As shown in the figure, the larger the values of

*μ*, the faster the MOMA approximates the Pareto optimal front in terms of IGD. But, a very large

*μ*, i.e.,

*μ*= 0.9, results in bad performances at the final stage; this is because such strong greediness disturbs the stochastic search of GA too much. On the contrary, a very small

*μ*, i.e.,

*μ*= 0.1, cannot accelerate the approximation to the Pareto front significantly, but it indeed leads to good results if granted a long runtime. Therefore, a moderate value of

*μ*is preferred, e.g.,

*μ*= 0.3 or 0.5 or 0.7, and the performance of

*MOMA*-

*based hardening*is not very sensitive to the values of

*μ*in this range, i.e., from 0.3 to 0.7. We use

*μ*= 0.5 in the following simulations.

### 4.3 MOMA-based hardening versus MOEA-based hardening

*MOMA*-

*based hardening*and

*MOEA*-

*based hardening*on different test instances of different scales (

*N*= 10, 30, 50, and 100, respectively). To make the simulations more convincing, each group result is averaged over 30 runs. It can be seen that: (1) for small-scale problem (e.g.,

*N*= 10),

*MOEA*-

*based hardening*performs better than

*MOMA*-

*based hardening*, and (2) for large-scale problems (e.g.,

*N*= 30, 50 and 100), the mean IGD values obtained by

*MOMA*-

*based hardening*are much lower than those obtained by

*MOEA*-

*based hardening*on most test instances, which is an indication of better approximation to the Pareto optimal front. The maximum reductions on IGD values are about 82%, 75%, and 86%, and the average reductions are about 64%, 70%, and 80% for different problem scales (

*N*= 30, 50, and 100, respectively). The better approximation can be attributed to the introduction of the problem-specific knowledge, so the effectiveness of the proposed GRLS for

*task hardening*is demonstrated.

We have performed statistical tests for the GD and IGD values of MOEA and MOMA on each benchmark instance. A two-tailed *t* test is conducted with a null hypothesis, stating that there is no difference between two algorithms in comparison. The null hypothesis is rejected if the *p* value is smaller than the significance level *α* = 0.05. We find that (1) there is no significant difference between the GD values of MOEA and MOMA on all test instances, (2) the IGD values of MOEA are statistically better than those of MOMA on small-scale test instances (*N *= 10), and (3) the IGD values of MOMA are statistically better than those of MOEA on large-scale test instances (*N *= 30, 50, and 100).

### 4.4 A comprehensive comparison

*MOMA*-

*based hardening*and

*MOEA*-

*based hardening*on test instances of different scales (

*N*= 10, 30, 50, and 100, respectively). In each figure, the green circle represents the all “0” solution, i.e.,

*th*= {0}

^{N}, that means all the tasks are hardened by DMR, while the magenta star represents the all “1” solutions, i.e.,

*th*= {1}

^{N}, that means all the tasks are hardened by TMR. Obviously, all “0” solutions minimize the number of cores according to Eq. (1), while all “1” solutions can achieve the best performance in terms of worst-case makespan. As shown in the figures, the near “0” solutions can be obtained by using

*MOEA*-

*based hardening*method, and we can approach the best performance with fewer cores by using

*MOMA*-

*based hardening*method. Another observation is that the results obtained by

*MOEA*-

*based hardening*are all partiality located at the area where

*Object*

_{1}(number of cores) is minimized, especially for large-scale problems (e.g.,

*N*= 50 and 100). As analyzed above, this is because

*Object*

_{2}(WCET) involves a complex nonlinear mapping from decision variables to fitness value, while

*Object*

_{1}is just a linear combination of the values; the select pressure on

*Object*

_{2}is too weak in the adopted NSGA-II framework. By incorporating the proposed GRLS operator, we can see

*MOMA*-

*based hardening*can cover this absent area; this is because the proposed operator has bias for minimizing WECT by using more space redundancy (from DMR to TMR). We can obtain a well-distributed Pareto front by combining the solutions of both methods; thus, we can provide the designer with a highly accurate set of solutions.

### 4.5 Hybrid fitness evaluation

*O*(

*n*

^{3}

*log*

^{n})) of the Hungarian-based optimal MMW-PM method (Algorithm 2) is high, the whole EA process is very time-consuming for large-scale problems, e.g., more than 2 days for

*n*= 100 test instances. As a remedy, we propose to use hybrid fitness evaluation method to balance the accuracy and time efficiency of fitness evaluations. The key idea is to combine the optimal and greedy MMW-PM methods in the fitness evolution process (Jin et al. 2002; Jin 2005). Specifically, we use single optimal MMW-PM evaluation for the whole population in every

*Δ*iterations, where

*Δ*is called exact evaluation gap here. A few different values of exact evaluation gap

*Δ*= 1, 5, 10, and 25, are tested as shown in Figs. 13 and 14, where

*Δ*is marked as “GAP.” The results show that the runtime can be reduced significantly if the preference degradation is acceptable by the designers, e.g.,

*Δ*= 5.

After obtaining the Pareto front, it is an open problem of how to make the final decision. From the point of view of multi-core system design, the designer can choose a solution from the Pareto front based on the available hardware resources. For example, if the required resources can be satisfied, the designer can implement a high-performance multi-core system with a low hardware cost. Alternatively, the designer could select the knee point on the Pareto front. It is important to note that the approximate Pareto front found by our algorithm can inform the designer about various trade-offs among conflicting objectives, which is essential in making an informed final decision.

## 5 Conclusion

This paper aims at jointly *task hardening* and *task mapping* for independent tasks on heterogeneous multi-core systems in order to achieve low cost and high performance. In order to minimize the worst-case makespan with optimal *task mapping*, we show that the mapping problem can be modeled as a min–max-weight perfect matching problem. Then, we propose a polynomial time complexity heuristic algorithm that works in a binary search framework and employs Hungarian algorithm as a subroutine. As there is a trade-off between the cost and performance in *task hardening*, we model the *task hardening* problem as a multi-objective optimization problem. Since a MOEA can find a partial Pareto front only, we propose a problem-specific local search operator and incorporate it in the MOEA framework. As shown by our simulation results, the proposed MOMA method can cover the missing part of the Pareto front, so we can obtain the entire Pareto front by combining the solutions of MOEA and MOMA.

This work considers independent tasks (or parallel tasks); we will extend this work to the tasks with data dependents in our future work, e.g., sequential tasks. Besides, we will consider dynamic task mapping, where the frequencies of cores change with heat, etc.

## Notes

### Acknowledgements

This work was supported by the National Key R&D Program of China (Grant No. 2017YFC0804002), the Royal Society (NA160545), the National Natural Science Foundation of China (Grant Nos. 61503357 and 617611360), the Program for Guangdong Introducing Innovative and Entrepreneurial Teams (Grant No. 2017ZT07X386), Shenzhen Peacock Plan (Grant No. KQTD2016112514355531), the Science and Technology Innovation Committee Foundation of Shenzhen (Grant Nos. ZDSYS201703031748284, JCYJ20170307105521943, JCYJ20170817112421757 and JCYJ20180504165652917), and the Program for University Key Laboratory of Guangdong Province (Grant No. 2017KSYS008).

### Compliance with ethical standards

### Conflict of interest

The authors declare that they have no conflict of interest

### Human and animal rights

This article does not contain any studies with human participants or animals.

## References

- Alazzoni I, Down DG (2008) Linear programming-based affinity scheduling of independent tasks on heterogeneous computing Systems. IEEE Trans Parallel Distrib Syst 19(12):1671–1682CrossRefGoogle Scholar
- Baumann RC (2005) Radiation-induced soft errors in advanced semiconductor technologies. IEEE Trans Device Mater Reliab 5(3):305–316CrossRefGoogle Scholar
- Bleuse R, Hunold S, Kedad-Sidhoum S, Monna F, Mounie G, Trystram D (2017) Scheduling independent moldable tasks on multi-cores with GPUs. IEEE Trans Parallel Distrib Syst 28:9. https://doi.org/10.1109/tpds.2017.2675891 CrossRefGoogle Scholar
- Bolchini C, Miele A (2013) Reliability-driven system-level synthesis for mixed-critical embedded systems. IEEE Trans Comput 62(12):2489–2502MathSciNetCrossRefzbMATHGoogle Scholar
- Bolchini C, Miele A, Pilato C (2011) Combined architecture and hardening techniques exploration for reliable embedded system design. In: Proceedings of the Great Lakes symposium on VLSI, 2011, pp 301–306Google Scholar
- Bowman K, Duvall S, Meindl J (2002) Impact of die-to-die and within-die parameter fluctuations on the maximum clock frequency distribution for gigascale integration. IEEE J Solid State Circuits 37(2):183–190CrossRefGoogle Scholar
- Chen X, Ong Y-S, Lim M-H, Tan KC (2011) A multi-facet survey on memetic computation. IEEE Trans Evol Comput 15(5):591–607CrossRefGoogle Scholar
- Chen K-H, Chen J-J, Kriebel F, Rehman S, Shafique M, Henkel J (2016) Task mapping for redundant multithreading in multi-cores with reliability and performance heterogeneity. IEEE Trans Comput 65(11):3441–3455MathSciNetCrossRefzbMATHGoogle Scholar
- Constantinescu C (2003) Trends and challenges in VLSI circuit reliability. IEEE Micro 23(4):14–19CrossRefGoogle Scholar
- Cortadella J, Kondratyev A, Lavagno L, Passerone C, Watanabe Y (2006) Quasi-static scheduling of independent tasks for reactive systems. IEEE Trans Comput Aided Des Integr Circuits Syst 24(10):1492–1514CrossRefzbMATHGoogle Scholar
- Das A, Kumar A, Veeravalli B, Bolchini C, Miele A (2014) Combined DVFS and mapping exploration for lifetime and soft-error susceptibility improvement in MPSoCs. In: Proceedings of the design automation and test in Europe conference and exhibition 2014, pp 1–6Google Scholar
- Dave B, Jha N (1999) COFTA: hardware-software co-synthesis of heterogeneous distributed embedded systems for low overhead fault tolerance. IEEE Trans Comput 48(4):417–441CrossRefGoogle Scholar
- Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol Comput 6(2):182–197CrossRefGoogle Scholar
- Dighe S, Vangal SR, Aseron P, Kumar S, Jacob T, Bowman KA, Howard J, Tschanz J, Erraguntla V, Borkar N, De V, Borkar S (2011) Within-die variation-aware dynamic voltage- frequency-scaling with optimal core allocation and thread hopping for the 80-core teraflops processor. IEEE J Solid-State Circuits 46(1):184–193CrossRefGoogle Scholar
- Ebi T, Faruque M, Henkel J (2009) Tape: thermal-aware agent based power economy multi/many-core architectures. In: Proceedings of the international conference on computer
*-*aided design, pp 302–309Google Scholar - Erbas C, Ceraverbas S, Pimentel AD (2006) Multiobjective optimization and evolutionary algorithms for the application mapping problem in multiprocessor system-on-chip design. IEEE Trans Evol Comput 10(3):358–374CrossRefGoogle Scholar
- Gan J, Pop P, Gruian F, Madsen J (2012) Robust and flexible mapping for real-time distributed applications during the early design phases. In: Proceedings of the design, automation test on European conference exhibition, pp 935–940Google Scholar
- Glaß M, Lukasiewycz M, Streichert T, Haubelt C, Teich J (2007) Reliability-aware system synthesis. In: Proceedings of the design, automation test on European conference exhibition, pp 409–414Google Scholar
- Henkel J, Bauer L, Dutt N, Gupta P, Nassif S, Shafique M, Tahoori M, Wehn N (2013a) Reliable on-chip systems in the nano-era: lessons learnt and future trends. In: Proceedings of the 50th annual design, automation conference, pp 1–10Google Scholar
- Henkel J, Narayanan V, Parameswaran S, Teich J (2013b) Run-time adaption for highly-complex multi-core systems. In: Proceedings of the international conference on hardware/software codesign and system synthesis, pp 1–8Google Scholar
- Herbert S, Garg S, Marculescu D (2012) Exploiting process variability in voltage/frequency control. IEEE Trans Very Large Scale Integr Syst 20(8):1392–1404CrossRefGoogle Scholar
- Hong B, Prasanna VK (2007) Adaptive allocation of independent tasks to maximize throughput. IEEE Trans Parallel Distrib Syst 18(10):1420–1435CrossRefGoogle Scholar
- Hu J, Wang S, Ziavras S (2006) In-register duplication: exploiting narrow-width value for improving register file reliability. In: Proceedings of the international conference on dependable systems and networks, pp 281–290Google Scholar
- Huang J, Blech J, Raabe A, Buckl C, Knoll A (2011) Analysis and optimization of fault-tolerant task scheduling on multiprocessor embedded systems. In: Proceedings of the international conference on hardware/software codesign and system synthesis, pp 247–256Google Scholar
- Huang J, Huang K, Raabe A, Buckl C, Knoll A (2012) Towards fault-tolerant embedded systems with imperfect fault detection. In: Proceedings of the 49th annual design, automation conference, pp 188–196Google Scholar
- Izosimov V, Polian I, Pop P, Eles P, Peng Z (2009) Analysis and optimization of fault-tolerant embedded systems with hardened processors. In: Proceedings of the design, automation test European conference exhibition, pp 682–687Google Scholar
- Jahn J, Faruque M, Henkel J (2011) Carat: context-aware runtime adaptive task migration for multi core architectures. In: Proceedings of the design, automation test European conference exhibition, pp 1–6Google Scholar
- Jhumka A, Klaus S, Huss S (2005) A dependability-driven system-level design approach for embedded systems. In: Proceedings of the design, automation test on European conference exhibition, pp 372–377Google Scholar
- Jin Y (2005) A comprehensive survey of fitness approximation in evolutionary computation. Soft Comput 9(1):3–12CrossRefGoogle Scholar
- Jin Y, Olhofer M, Sendhoff B (2002) A framework for evolutionary optimization with approximate fitness functions. IEEE Trans Evol Comput 6(5):481–494CrossRefGoogle Scholar
- Kandasamy N, Hayes J, Murray B (2003) Transparent recovery from intermittent faults in time-triggered distributed systems. IEEE Trans Comput 52(2):113–125CrossRefGoogle Scholar
- Kang S-H, Yang H, Kim S, Bacivarov I, Ha S, Thiele L (2014a) Reliability-aware mapping optimization of multi-core systems with mixed-criticality. In: Proceedings of the design, automation test European conference exhibition, pp 1–4Google Scholar
- Kang S-H, Yang H, Kim S, Bacivarov I, Ha S, Thiele L (2014b) Static mapping of mixed-critical applications for fault-tolerant MPSOCS. In: Proceedings of the 51st annual design automation conference, pp 1–6Google Scholar
- Kang S-H, Park H-W, Kim S, Oh H, Ha S (2015) Optimal checkpoint selection with dual-modular redundancy hardening. IEEE Trans Comput 64(7):2036–2048MathSciNetCrossRefzbMATHGoogle Scholar
- Khosravi F, Reimann F, Glaß M, Teich J (2014) Multi-objective local-search optimization using reliability importance measuring. In: Proceedings of the 51st annual design, automation conference, pp 1–6Google Scholar
- Koren I, Krishna CM (2007) Fault-tolerant systems. Morgan Kaufmann, San FranciscozbMATHGoogle Scholar
- Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Res Logist Q 2:83–97MathSciNetCrossRefzbMATHGoogle Scholar
- Li L, Degalahal V, Vijaykrishnan N, Kandemir M, Irwin M (2004) Soft error and energy consumption interactions: a data cache perspective. In: Proceedings-international symposium on low power electronics and design, pp 132–137Google Scholar
- Lifa A, Eles P, Peng Z, Izosimov V (2010) Hardware/software optimization of error detection implementation for real-time embedded systems. In: Proceedings of the international conference on hardware/software codesign and system synthesis, pp 41–50Google Scholar
- Lyons R, Vanderkulk W (1962) The use of triple-modular redundancy to improve computer reliability. IBM J Res Dev 6(2):200–209CrossRefzbMATHGoogle Scholar
- Nikolov D, Larsson E (2016) Optimizing the level of confidence for multiple Jobs. IEEE Trans Comput 65(4):1239–1252MathSciNetCrossRefzbMATHGoogle Scholar
- Pop P, Izosimov V, Eles P, Peng Z (2009) Design optimization of time- and cost-constrained fault-tolerant embedded systems with checkpointing and replication. IEEE Trans Very Large Scale Integr Syst 17(3):389–402CrossRefGoogle Scholar
- Pradhan DK (1996) Fault-tolerant computer system design. Prentice-Hall Inc., Upper Saddle RiverGoogle Scholar
- Pradhan D, Vaidya N (1994) Roll-forward checkpointing scheme: a novel fault-tolerant architecture. IEEE Trans Comput 43(10):1163–1174CrossRefzbMATHGoogle Scholar
- Raghunathan B, Turakhia Y, Garg S, Marculescu D (2013) Cherry-picking: exploiting process variations in dark-silicon homogeneous chip multi-processors. In: Proceedings of the design, automation test on European conference exhibition, pp 39–44Google Scholar
- Reimann F, Glaß M, Lukasiewycz M, Keinert J, Haubelt C, Teich J (2008) Symbolic voter placement for dependability-aware system synthesis. In: Proceedings of the international conference on hardware/software codesign and system synthesis, pp 237–242Google Scholar
- Rubio-Largo Á, Vega-Rodríguez MA, González-Álvarez DL (2016) A hybrid multiobjective memetic metaheuristic for multiple sequence alignment. IEEE Trans Evol Comput 20(4):499–514CrossRefGoogle Scholar
- Salehi M, Ejlali A, Al-Hashimi BM (2016a) Two-phase low-energy N-modular redundancy for hard real-time multi-core systems. IEEE Trans Parallel Distrib Syst 27(5):1497–1510CrossRefGoogle Scholar
- Salehi M, Tavana MK, Rehman S, Shafique M (2016b) Two-state checkpointing for energy-efficient fault tolerance in hard real-time system. IEEE Trans Very Large Scale Integr Syst 24(7):2426–2437CrossRefGoogle Scholar
- Shen X, Yao X (2015) Mathematical modelling and multi-objective evolutionary algorithms applied to dynamic flexible job shop scheduling problems. Inf Sci 298:198–224CrossRefGoogle Scholar
- Stralen P, Pimentel A (2012) A SAFE approach towards early design space exploration of fault-tolerant multimedia MPSoCs. In: Proceedings of the international conference on hardware/software codesign and system synthesis, pp 393–402Google Scholar
- Tersi L, Fantozzi S, Stagni R (2015) Characterization of the performance of memetic algorithms for the automation of bone tracking with fluoroscopy. IEEE Trans Evol Comput 19(1):19–30CrossRefGoogle Scholar
- Vadlamani R, Zhao J, Burleson W, Tessier R (2010) Multicore soft error rate stabilization using adaptive dual modular redundancy. In Proceedings of the design, automation test in European conference exhibition, 2010, pp 27–32Google Scholar
- Wang P, Emmerich M, Li R, Tang K, Baeck T, Yao X (2015) Convex hull-based multi-objective genetic programming for maximizing receiver operating characteristic performance. IEEE Trans Evol Comput 19(2):188–200CrossRefGoogle Scholar
- Wang Z, Tang K, Yao X (2010) A memetic algorithm for multi-level redundancy allocation. IEEE Trans Reliab 5(4):754–765CrossRefGoogle Scholar
- Yuan Y, Xu H (2015) Multiobjective flexible job shop scheduling using memetic algorithms. IEEE Trans Autom Sci Eng 12(1):336–353CrossRefGoogle Scholar
- Yuan B, Li B, Weise T, Yao X (2014) A new memetic algorithm with fitness approximation for the defect-tolerant logic mapping in crossbar-based nanoarchitectures. IEEE Trans Evol Comput 18(6):846–859CrossRefGoogle Scholar
- Yuan B, Li B, Chen H, Yao X (2016) Defect-and variation-tolerant logic mapping in nanocrossbar using bipartite matching and memetic algorithm. IEEE Trans Very Large Scale Integr Syst 24(9):2813–2826CrossRefGoogle Scholar
- Zhong F, Yuan B, Li B (2016) A hybrid evolutionary algorithm for multiobjective variation tolerant logic mapping on nanoscale crossbar architectures. Appl Soft Comput 38:955–966CrossRefGoogle Scholar
- Ziv A, Bruck J (1997) Performance optimization of checkpointing schemes with task duplication. IEEE Trans Comput 46(12):1381–1386CrossRefGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.