Keywords

1 Introduction and Motivation

On-chip communication plays one of the crucial roles in multicore architecture topology design. Network-on-Chip (NoC) has been proposed to reduce the power consumption and the network latency limitations of bus-based on-chip multicore architecture [1, 2]. There are several factors affecting the NoC performance, such as the network topology, the routing algorithm, application mapping. So the network-on-chip (NoC) topology design is an important factor in the on-chip multicore architecture. Our team proposed the triplet-based multicore architecture (TriBA) on-chip network is a kind of the multicore WK-recursive network [3, 4], which has several advantages such as scalability, regularity, locality and hierarchy.

Definition 1: Given a WK-recursive NoC topology with N\(^L\) (L \(\ge \) 0) cores (in here N = 3), the core’s ID number is encoded in the sequence a\(_{L}\)a\(_{L-1}\)a\(_{L-2}\) \(\cdots \)a\(_{1}\)a\(_{0}\), where a\(_{i}\) \(\in \)1, 2,\(\cdots \), N (0 \(\le \) i \(\le \) L − 1) which contains the level number and the core number after partition at level\(_{i}\) and the value of a\(_{i}\) means the position of the level number. The Fig. 1 shows TriBA NoC topology as L = 1, 2.

Fig. 1.
figure 1

TriBA multicore network topology with L = 1, 2

TriBA network topology has smaller degree, bisection width, smaller network diameter and less number of total links than 2DMesh topology with the same number of cores. It indicates any of TriBA’s cores can spend less time to send a data package to other cores. Meanwhile, TriBA has a less total links which means capacity of low power consumption. The researches [5, 6] elaborate that on-chip multicore processors such as the power consumption of Terascale and MIT RAW with respect to the whole on-chip power are 30%, 40% respectively. So another crucial challenge in NoC is how associate the IP cores implementing tasks of an application to reduce the power consumption. This is application mapping algorithm which is a crucial design decision to improve the performance of the overall multicore architecture at an early design phase. In this paper, we propose a mapping heuristic algorithm (KLSAT mapping algorithm) that is based on Kernighan–Lin (KL) algorithm, simulated annealing (SA) algorithm and the WK-recursive multicore architecture TriBA to reduce the overall power consumption and network latency. KL algorithm can reduce the fact network communication cost by placing frequently communicating cores closely. SA algorithm is a kind of mapping algorithm for exploiting optimization and searching solutions that originates from the annealing in engineering. TriBA [7] (Triplet-based architecture) is a novelty WK-recursive on-chip multicore architecture with the characteristic of scalability and locality.

2 Related Work

Several previous works have been proposed to use specially designed application mapping algorithms, for example Kernighan-Lin partitioning algorithm and simulated annealing algorithm, to improve the different NoC architectures performance or reduce power consumption and network latency.

Sahu et al. proposed several mapping algorithms which extends the basic Kernighan-Lin bi-partitioning algorithm to enhance the static and dynamic performances of three different NoC architectures [8]. Authors explored the opportunities in optimizing application mapping based on Kernighan-Lin algorithms for express channel-based on-chip network [9]. Manna et al. presented a KL bi-partitioning based approach to perform mapping the core graph of an application onto 2DMesh-based NoC architecture [10]. In [11], the authors proposed an application mapping algorithm for the mesh-of-tree network topology. Authors represented core mapping procedure based on the Kernighan-Lin graph bi-partitioning algorithm to select Through-Silicon-Via positions [12].

However, the KL mapping algorithm has its limitations and the resulted mappings generated by the KL algorithm may not be global optimal solution. It differs from KL algorithm, the SA algorithm has been observed to perform better application mappings. SA is one heuristic algorithm that has been used in a set of previous works for solving the application mapping problems [13,14,15,16,17,18,19,20,21]. Compared with KL mapping algorithm, the significant strength of SA is the ability of finding the global optimum solution.

Hu and Marculescu first used the SA algorithm in application mapping problem to evaluate the Branch and Bound application mapping algorithm on 2DMesh NoC [13]. In [14], the authors proposed algorithms using the simulated annealing and tabu search with communication-weighted model for obtaining low energy. The authors proposed an application mapping technique based on particle swarm optimization combined with simulated annealing for comparison of the performance of Zmesh with that of other NoC topologies [15]. In [16], the authors proposed two heuristics mapping algorithm based on the simulated annealing method for solving the capacitated version of the location-routing problem. The authors [17] used SA algorithm with two functions to map application onto multiprocessor system-on-chip (MPSoC). Bo et al. [18] proposed SA algorithm by using the Nelder-Mead simplex method for selecting a set of parameters applied. Tosun et al. [19] presented a mapping algorithm based on simulated annealing for energy- and communication-aware mapping problems of mesh-based NoC architecture. In [20], the authors proposed a heuristic algorithm CHMAP to solve the application mapping problem on the mesh topology to reduce energy consumption. In [21], an optimized mapping algorithm based on simulated annealing, which allocates tasks that have big communication volume to adjacent places on the mesh, was proposed for reducing the energy consumption of applications running on multicore architecture.

Based on the above mentioned reasons, the novelty of our proposed KLSAT mapping algorithm employs the advantages of KL algorithm and SA algorithm for mapping application onto TriBA multicore architecture. Firstly, we use with a Kernighan-Lin tri-partitioning algorithm which idea is come from the reference [22]. The modified Kernighan-Lin tri-partitioning algorithm which fits for the triplet-based characteristic of TriBA ensures the communication value among cores in the same partition is maximum value and the communication value among cores between partitions is minimum value. Secondly, we employ a SA algorithm to find the final optimal mapping. To the best of our knowledge, the KLSAT mapping algorithm is the first work that employes the modified KL algorithm and SA algorithm onto TriBA, which satisfies the performance requirement of the application mapping and minimizes the average power consumption and network latency. Our experimental results show that, compared to the random mapping algorithm, the KLSAT mapping algorithm reduces the average power consumption and network latency by 6.4%, 12.2% in mapping 27 cores and 29.5%, 26.7% in mapping 81 cores respectively.

3 Problem Formulations

In this section we focus on minimizing the power consumption associated to the application mapping.

3.1 Power Consumption Model

Ye et al. [23] proposed a power consumption model for evaluating the power consumption of switch fabrics in network routers. For the on-chip multicore architecture, however, links between nodes should also be included in the power consumption model. So Hu and Marculescu [24] proposed a modified power consumption model for the on-chip multicore architecture. By evaluating the difference of the power consumption of various components on-chip multicore architecture, Hu and Marculescu found that the power consumed by buffering and internal wires is negligible compared with switch and link. Thus, the power consumption model can be reduced to:

$$\begin{aligned} E_{bit}=E_{Sbit}+E_{Lbit} \end{aligned}$$
(1)

where E\(_{Sbit}\) and E\(_{Lbit}\) represent the energy consumed by switch and link respectively. So, the power consumption of sending one bit from node i to node j can be ex-pressed as following:

$$\begin{aligned} E_{bit}^{i,j}=n_{hops}\times E_{Sbit}+(n_{hops}-1)\times E_{Lbit} \end{aligned}$$
(2)

where n\(_{hops}\) is the number of routers the bit passes on its way along a path from node i to node j.

So the total power consumption of the NoC is the sum of weight value of all edges as following:

$$\begin{aligned} E_{total}=\sum _{i,j}^{E}\sum _{bit}E_{bit}^{i,j} \end{aligned}$$
(3)

3.2 Definition of Application Mapping

The goal of application mapping algorithms is to assign a given task to a specific core in the NoC to match the certain requirement such as minimizing the network latency and power consumption.

Definition 2: The task core graph is a weighted edge graph, C(V, E). A vertex v\(_i\) \(\in \) V represents a task and the weighted edge e\(_{i,j}\) \(\in \) E represents the communication bandwidth between the cores v\(_i\) and v\(_j\). Comm\(_{i,j}\) denotes the weighted value of edge e\(_{i,j}\), which indicates the bandwidth constraints of the communication from vertex v\(_i\) to vertex v\(_j\).

Definition 3: The NoC topology graph is a multicore interconnects architecture graph T(U, F). A vertex u\(_i\) \(\in \) U represents a node in multicore NoC topology and the directed edge f\(_{i,j}\) \(\in \) F indicates a physical link for directed communicating between the vertices u\(_i\) and u\(_j\). Bw\(_{i,j}\) denotes the weighted value of the edge f\(_{i,j}\), which shows the available communication bandwidth across the edge f\(_{i,j}\).

The application mapping algorithm can be formulated as the following one-to-one mapping function:

Mapping algorithm: given a task core graph C(V, E) and the NoC topology graph T(U, F), find the function:

map: \(V\rightarrow U\), such that, map(v\(_i\)) = u\(_j\), \(\forall \) v\(_i\) \(\in \) V, \(\exists \) u\(_j\) \(\in \) U, \(\vert \) V \(\vert \) \(\le \) \(\vert \) U \(\vert \)

\(\forall \) v\(_i\) \(\in \) V, map(v\(_i\))=U

\(\forall \) v\(_i\) \(\ne \) v\(_j\), map(v\(_i\)\(\ne \) map(v\(_j\))

Number(V)\(\le \)Number(U)

Minimam(E\(_{total}\)).

4 The Proposed KLSAT Mapping Algorithm

In this section, we present the proposed KLSAT mapping algorithm which includes the Kernighan-Lin partitioning algorithm and simulated annealing algorithm to minimize the overall communication cost among all of cores. The goal of KL partitioning algorithm is to partition a task graph into subsets recursively and get the minimum value of the communication costs between the subsets. So we use the KL partitioning algorithm to obtain the first stage optimal solution as the initial solution as the input of the next stage SA algorithm. The simulated annealing algorithm is an effective global optimization algorithm which simulates the physical annealing process of solid and solves large scale combinatorial optimization problems. Along with the Metropolis acceptance criterion is introduced to the optimization process, the result of the simulated annealing achieves an approximate global optimal solution. So, we apply the simulated annealing algorithm and obtain the optimal mapping solutions at the second stage.

The KL partitioning algorithm is applied to recursively partition the core graph. Firstly, all cores are in one partition group at level-0. At level-1, there are three partition subsets, naming partition number 1, 2 and 3, each partition containing one third the nodes of the core graph. At level-2, nine partitions are generated (three each from partition-1, partition-2 and partition-3 of level-1) having partition number 11, 12, 13, 21, 22, 23, 31, 32 and 33. This continues until there are 3 cores left in each partition for TriBA. Because the initial partitioning determines the KL algorithm partitioning results, in this paper, this algorithm runs several times for the best result with different randomly generated initial partitions which is used for subsequent mapping and iterative improvement. Figure 2 shows an example with N = 27 and how the IP-sets are merged. By merging three IP-sets, it finds the best contact between boundaries.

Fig. 2.
figure 2

An example of trinomial merging iteration (N = 27)

figure a

Now the next stage, each of these 3-core subsets is assigned to the appropriate basic unit of the multicore architecture TriBA, L is the level of TriBA and the number of cores is 3\(^L\). Although these 3-core subsets are attached to the nearby basic unit arbitrarily, it is still great opportunity to resolve an optimization solution by the proposed KLSAT mapping algorithm.

KLSAT Mapping Algorithm:

When the temperature initialization of the system is completed, the KLSAT mapping algorithm executes two nested loops. After the external loop with KL partition algorithm reaching the global minima, the internal loop refines and finds the optimal local solution. The number of external loop iterations is limited to U\(^2\) as suggested in [14]. The internal loop randomly selects two nodes in a L-level subset and swaps them to determine a new solution. Then the algorithm calculates whether the new solution is better than the old solution. If it is, the new solution replaces the current solution. Otherwise, the algorithm automatically generates a random variable \(\gamma \) (0 \(\le \) \(\gamma \) \(\le \)1), and compares with the acceptance probability function (\(-\) \(\varDelta \)P)/Temperature. If the value of the function result is higher than \(\gamma \), the new solution is accepted. The acceptance probability is high at high temperatures. However, with the temperature of the system lowing, the acceptance probability decreases. We limit the iteration of the internal loop to L\(^2\) consecutive rejects and the Temperature is more than 0.01. When each internal loop completed, temperature of the system decreases and the algorithm starts a new loop accepting the new solution as our initial solution for the next iteration.

figure b

We produce a mapping by using MAPPING (G) algorithm. At each level of tri-partitioning, we assign a partition number 1, 2 and 3 to each subset by turn. These numbers have been utilized in the address assignment process in the MAPPING (G) algorithm. In the mapping algorithm, these 3-core subsets are assigned according to the output results generated by KLSAT mapping algorithm. After the mapping algorithm completed, each core has an assigned (level number, subset number) to identify its mapping position on the on-chip multicore TriBA.

At last the KLSAT mapping algorithm completed, we obtain the global optimal solution. All of task cores are mapped onto the corresponding position of TriBA multicore architecture.

Fig. 3.
figure 3

An example for KLSAT mapping algorithm ((a) an example task graph, (b) communication cost of random mapping, (c) communication cost of KLSAT mapping)

In Fig. 3, we present an example of our KLSAT mapping algorithm. Figure 3(a) shows an example of task graph with communication weighted between nodes. Figure 3(b) and (c) shows communication cost with random mapping and KLSAT mapping. The communication cost of mapping with random mapping is calculated as Commcost = 1815. And with KLSAT mapping, the communication cost becomes Commcost = 1240, which is accepted as the new solution. The KLSAT mapping algorithm continues executing the iteration process until the predefined terminated condition value is reached.

5 Experimentation and Results

5.1 Simulator and Benchmarks

In this paper, we used Gem5 as our simulator to evaluate the KLSAT mapping algorithm, which is widely used as a configurable architecture simulator for multicore on-chip architecture-related research. In Gem5, the Orion [25] model is used to evaluate the power consumption of the various NoC topologies. Meanwhile, the benchmarks of PARSEC [26] are used in the following experiments. We use the WK-recursive NoC TriBA topology as the NoC topology, which is a regular topology with better NoC topology characteristics such as smaller network diameter, less total links and lower node degree than the 2DMesh topology. We compare the KLSAT mapping algorithm with several other algorithms on the TriBA NoC architectures: (1) BL_TriBA (the baseline): which maps the tasks onto the TriBA NoC topology randomly; (2) KL_TriBA: KL mapping algorithm on the TriBA structure; (3) SA_TriBA: which is the conventional simulated annealing algorithm on TriBA NoC structure; (4) KLSAT: our proposed mapping algorithm on TriBA NoC structure.

5.2 Results and Analysis

Based on the previous research experience, we set the initial parameters of the algorithm as follows: M = 4000, temperature\(_0\) = 5000, terminated temperature \(\varepsilon \) = 0.01. We implement the algorithm in Matlab R2013b environment. Host CPU is Intel Core i7 3.40 GHz, 8 GB memory and operating system is Windows 7. Host has 8 processor cores and the sizes of the target machine are 27 and 81.

The network latency of TriBA multicore architecture normalized to the baseline case shows in Figs. 4 and 5. Due to the various communications characteristics of these benchmarks, the network latency of experimental results varies significantly. For 27 cores of TriBA, compared to the baseline case, KL_TriBA, SA_TriBA and KLSAT mapping algorithm decrease the network latency by the average 2.9, 6.1 and 12.2% respectively. In the experimental result of TriBA with 81 cores, the differences between four mapping algorithms are more significant because the communication overheads among cores are dramatically increased. The KLSAT mapping algorithm decreases the network latency by an average of 26.7% compared to the baseline as shown in Fig. 5.

Fig. 4.
figure 4

Network latency of the four algorithms with 27 cores

Fig. 5.
figure 5

Network latency of the four algorithms with 81 cores

Figure 6 shows the power consumption of TriBA with 27 cores. The power consumption is normalized to the BL_TriBA random mapping algorithm. As shown in the Fig. 6, the BL_TriBA random mapping algorithm consumes the highest power consumption while the KLSAT mapping algorithm has the least power, with an average of 6.4% than the random mapping. Figure 7 shows the experimental results of TriBA’s power consumption with 81 cores. For 81 cores of TriBA, compared to the baseline case, KL_TriBA, SA_TriBA and KLSAT mapping algorithm decrease the power consumption by the average 12.0, 23.1 and 29.5% respectively. In this experimental result, the power savings of KLSAT mapping algorithm in Fig. 7 is more significant than that in the 27 cores of TriBA architecture in Fig. 6. Overall, KLSAT mapping algorithm saves power consumption by an average of 29.5% compared to the baseline and achieves better performance compared to KL_TriBA and SA_TriBA.

Fig. 6.
figure 6

Power consumption of the four algorithms with 27 cores

Fig. 7.
figure 7

Power consumption of the four algorithms with 81 cores

The reason is that KLSAT mapping algorithm has a smaller chance to get trapped in local optimum than random mapping algorithm because we add KL partition as the initial solution in the KLSAT mapping algorithm. Because KL partition algorithm combines the triplet-based characteristic of TriBA to make more communication transfer among three cores which have the characteristic of local full interconnect flavor. In consequence, the solution generated by the KLSAT mapping algorithm has less network communication cost and lower power consumption than the other mapping algorithms.

In general, the KLSAT mapping algorithm sharply decreases the number of iterations, the power consumption and network latency, compared with the random mapping algorithm. Our proposed KLSAT algorithm achieves better performance than both the KL algorithm and the simulated annealing algorithm.

6 Conclusion

One of the important research fields on NoC is the design of the application mapping algorithms. Several different mapping algorithms have been presented to reduce network latency, lower power consumption, satisfy bandwidth constraint or minimize on-chip area and so on. This paper focused on a new mapping algorithm based on KL partition algorithm and the simulated annealing algorithm in order to generate better performance in application mapping problems. We designed and implemented an application mapping algorithm on multicore architecture TriBA for performance simulation based on KL partition algorithm and simulated annealing, and verified the KLSAT mapping algorithm by experiments. Our experimental results show that the algorithm has significant reduction in the number of iterations, the network latency and the power consumption. It also shows that the algorithm can solve the large-scale problem.