# A CASE STUDY ON CLUSTER BASED POWER-AWARE MAPPING STRATEGY FOR 2D NoC

# Salini S<sup>1</sup>, Aravindhan A<sup>2</sup> and Lakshminarayanan G<sup>3</sup>

<sup>1,2</sup>Department of Electronics and Communication Engineering, Saintgits College of Engineering, India <sup>3</sup>Department of Electronics and Communication Engineering, National Institute of Technology, Tiruchirappalli, India

#### Abstract

Network on Chip (NoC) is a growing and prominent paradigm which improves the power and performance of the System on Chip (SoC). Application mapping is one of the major challenges in NoC which maps the various Intellectual Property (IP) cores to standard network topology. Among the various application mapping methods, Integer Linear programming (ILP) is one of the static mapping methods, which finds optimum communication cost. However, it consumes longer computation time. To overcome this limitation, cluster based mapping using KL algorithm has been introduced and it acts poorly at partitioning cut degree. Based on these studies, we propose Fidducia-Mattheyses (FM) algorithm for multi clustering to optimize power consumption and communication cost for different benchmarks of NoC. The effectiveness of the proposed method is verified through VOPD, MPEG 4 and PIP benchmarks. Experimental results show a 4.4% and 34% improvement on communication cost and power consumption respectively for FM algorithm with MPEG 4. However, for VOPD the communication cost and total power consumption is improved with 27% and 35% respectively. On the other hand, PIP benchmark offers identical results in total power consumed and communication cost minimization with existing methods.

Keywords:

Application Mapping, Integer Linear Programming (ILP), Cluster Based Mapping, Fidducia-Mattheyses (FM) Algorithm

# **1. INTRODUCTION**

The recent promotions in the semiconductor technology field results in the integration of hundreds or even more IP cores on a single chip. These increments in IP cores made SoC architecture more complex with interconnection problems and less scalability [1]. In order to address this interconnect problem, NoC has emerged as a promising technology with point to point links which improves the scalability and overall performance of the architecture like throughput and power consumption [2].

The NoC architecture design faces an outstanding research issue on the choice of communication infrastructure, application characterization and solution evaluation [3]-[4]. Application mapping is the process of topological placement of assigned cores in a Core Graph (CG) on to different tiles in a Topology Graph (TG) [5]. In general, application mapping on to standard topology based NoC architecture is an NP hard problem [4],[6]. In order to overcome this issue, several methods are proposed in [7]-[10] with various objective functions such as energy, power consumption, communication cost and bandwidth. The minimization of communication cost aims at minimizing the number of hops [11]. The optimization of power consumption in NoC helps the proper functionality of IP cores and routers for improving performance [12]. There are several other alternative methods which provide optimal mapping. ILP is the one among the static mapping methods which provide best mapping [13][11][6]. ILP is computed based on the simplex programming method explained in [14]. The ILP problems are formulated to minimize communication cost. However, it provides large computation time.

As a solution for this issue, cluster based mapping with KL algorithm was proposed in [11]. Nevertheless, the method has a high cut degree of partitioning. In order to overcome this issue, KL algorithm is replaced by tailor made algorithm [12][6] that minimizes cut degree and communication cost, unless, this method performs poorly on power consumption than the power consumption in KL algorithm [15]. In the same scenario, this method provides better results for Directed Cyclic Graphs (DCGs). Hence, for Directed Acyclic Graphs (DAGs), an alternative algorithm called DFS [16] is required to offer better communication cost. KL algorithm is also unsuitable for unequal partitions. As a result an alternative algorithm called FM [17]-[18], is proposed to cluster based mapping which is more suitable for multi-clustering. This method provides better improvement in power consumption, cut degree and communication cost. In this paper, we focus on the improvement in communication cost and power consumption.

The rest of the paper is organized as follows. In section 2 relations between graph theory and NoC is explained. In section 3, related work is discussed. Section 4 explains about the proposed methods. Experimental Results are discussed in section 5 and conclusion is provided in section 6.

# 2. NoC AND GRAPH THEORY

**Definition 1:** A CG is a graph G(V, E) where each vertex  $v_i \in V$  represents a task in the core and each edge  $e_{ij} \in E$  represents a dependency between two tasks (two cores)  $v_i$  and  $v_j$ . The data transfer between  $v_i$  and  $v_j$  is the weight  $w_{ij}$  for all  $e_{ij}$  and is given in bits per sec.

The Fig.1(a) shows an example CG representing the bench mark of PIP [2]. In general, CG is considered as input for mapping and consists of various applications to be mapped. Tiles connected together with links to form TG are required to map these applications to perform suitable mapping method to achieve improvement in communication cost and power consumption.

**Definition 2:** A TG is a graph W(H,R) where each node  $h_i \in H$  denote a tile in the topology and each edge denotes a physical link  $r_{ij}$  in between  $h_i$  and  $h_j$ .

The Fig.1(b) shows an example TG with 8 tiles connected in  $4\times 2$  mesh fashion. The application mapping between CG and TG is made based on the condition in Eq.(1)

$$\left|V\right| \le \left|H\right| \tag{1}$$

For finding a one to one mapping for  $F:V \rightarrow H$  from CG to TG with minimum communication cost.

$$Min: \{CommCost\} = \sum_{\forall t_{ij} \in T} number \ of \ hops \ \times \ w_{ij}$$
(2)

Communication cost is the amount of data transfer on the network and  $w_{ij}$  is the weight between two nodes  $v_i$  and  $v_j$ .

# **3. RELATED WORK**

In this section, first we summarize the related work on ILP for communication cost minimization. Then we explore the relative studies on cluster based mapping for KL algorithms for communication cost minimization.

| Table.1. Co | onstraints | and | variables | used in | formulations | [6, | 13] |
|-------------|------------|-----|-----------|---------|--------------|-----|-----|
|             |            |     |           |         |              |     |     |

| Constant<br>variables | Definitions                                                                                                                                            |  |  |  |
|-----------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|
| n                     | The number of tasks (i.e. nodes) in input CTG                                                                                                          |  |  |  |
| Wi,j                  | The communication weight between tasks <i>i</i> and <i>j</i> in CG                                                                                     |  |  |  |
| $X_{dim}$             | The size of the mesh architecture in <i>x</i> dimension                                                                                                |  |  |  |
| $Y_{dim}$             | The size of the mesh architecture in <i>y</i> dimension                                                                                                |  |  |  |
| $\alpha_{i,x,y}$      | Binary variable $\alpha_{i,x,y} = 1$ if task <i>i</i> is mapped to the tile in the coordinates ( <i>x</i> , <i>y</i> ). Otherwise $\alpha_{i,x,y} = 0$ |  |  |  |
| $X_{i,j,a}$           | Binary variable $X_{i,j,a}=1$ , if the distance in x dimension between tasks <i>i</i> and <i>j</i> is equal to <i>a</i> . Otherwise $X_{i,j,a}=0$      |  |  |  |
| $Y_{i,j,b}$           | Binary variable $Y_{i,j,b}=1$ , if the distance in x dimension between tasks <i>i</i> and <i>j</i> is equal to <i>a</i> . Otherwise $Y_{i,j,b}=0$      |  |  |  |
| X <sub>cost</sub>     | The total communication cost of the network in <i>x</i> dimension                                                                                      |  |  |  |
| Y <sub>cost</sub>     | The total communication cost of the network in <i>y</i> dimension                                                                                      |  |  |  |

#### 3.1 ILP FORMULATION

In this section we are explaining the ILP formulations presented in [6][11][13] to make this paper self contained. Generally for an ILP, problems are formulated and optimized with objective functions with the help of linear functions as constraints. However, the solution of ILP problems is always limited to be integers. ILP in application mapping area is considered to be an NP hard problem. This method assures to find an optimal solution for the problem of mapping [19]. In this ILP formulation, the chip area is considered as a 2D grid and assigns tasks to tiles with this grid [13]. The Table.1 gives the constant terms and variables used in our formulation.

The Eq.(3) indicates that each task *i* has a single tile with coordinates (x,y) to perform mapping. The Eq.(4) shows that there are some tiles remaining without mapping.

$$\sum_{x=0}^{X_{\text{dim}}} \sum_{y=0}^{Y_{\text{dim}}} \alpha_{i,x,y} = 1, \forall i$$
(3)

$$\sum_{i=1}^{n} \alpha_{i,x,y} \le 1, \,\forall x, y \tag{4}$$

. . . . .

The communication cost is calculated by the minimum number of hops (Manhattan distance [13]) between two tasks  $v_i$  and  $v_j$ . From Fig.1(a), if the tasks  $v_2$  and  $v_1$  are mapped with tiles (0,1) and (2,1) in Fig.1(b), then the values of the variables  $X_{i,j,a}$  and  $Y_{i,j,b}$  will become 1 with a = 2 and b = 0. The Eq.(5) and Eq.(6) are used to determine a and b for each pair of tasks.

-- -.

$$\begin{aligned} Ydist_{i,j,b} &\geq \alpha_{i,x_{i},y_{i}} + \alpha_{j,x_{j},y_{j}} - 1, \ \forall i, j \ e_{ij} \in E \\ 0 &\leq x_{i}, x_{j} \leq X_{dim}, \\ 0 &\leq y_{i}, y_{j} \leq Y_{dim} \\ such that b &= |y_{j} - y_{i}| \\ Xdist_{i,j,a} &\geq \alpha_{i,x_{i},y_{i}} + \alpha_{j,x_{j},y_{j}} - 1, \ \forall i, j \ e_{ij} \in E \\ 0 &\leq x_{i}, x_{j} \leq X_{dim}, \\ 0 &\leq y_{i}, y_{j} \leq Y_{dim} \\ such that a &= |x_{j} - x_{i}| \end{aligned}$$
(5)

The total communication cost in x and y dimension is calculated by Eq.(5) and Eq.(6).

$$X\cos t = \sum_{e_{ij}\in E} \sum_{a=1}^{X_{dim}-1} w_{ij} a X dist_{i,j,a}$$
(7)

$$Y\cos t = \sum_{e_{ij}\in E}\sum_{a=1}^{Y_{dim}} w_{ij}bYdist_{i,j,b}$$
(8)

Therefore, minimum communication cost is expressed as,

$$Min: \{Comm \ cost\} = X \ cost + Y \ cost$$
(9)

The major limitation of an ILP formulation is its computation time.

# 3.2 CLUSTER BASED MAPPING WITH KL ALGORITHM

In ILP, when the number of tasks increases the computation time also increases, this makes this method irrelevant to fix this timing problem. Therefore, cluster based mapping with KL algorithm [11] is taken into account for communication cost minimization.

This method has 3 steps (i) Mesh partitioning in which TG is cut into two parts (ii) Graph clustering in which CG is partitioned into two clusters (iii) Mapping and merging in which each cluster and each sub mesh are mapped together.

#### 3.2.1 Mesh Partitioning:

The  $X_{dim}$  and  $Y_{dim}$  values are given into the system such that it should satisfy  $|V| \leq |T| = X_{dim} \times Y_{dim}$ . At the initial stage, the mesh M is cut into two sub-meshes  $M_1$  and  $M_2$ . The partitioning supports the conditions given in Eq.(10)-(12).





Fig.1. (a). An example CG of PIP Benchmark [2], (b). An example TG with the size 4×2



Fig.2. Mesh partitioning example (a). Mesh partitioning at the initial stage; (b). Mesh partitioning at the final stage with dummy tiles added

The Fig.2(a) shows the mesh partitioning procedure of a  $4\times3$  mesh structure for MPEG 4 benchmark according to the Eq.(10)-Eq.(12) in the *x* direction.

$$X_{dim} = [X_{dim}/2] \tag{10}$$

$$X_{dim2} = X_{dim} - X_{dim1} \tag{11}$$

$$Y_{dim1} = Y_{dim2} = Y_{dim} \tag{12}$$

Subsequently, a  $1 \times Y_{dim}$  sized dummy mesh P1 is kept between the sub-meshes as in Fig.2(b).

#### 3.2.2 Graph Clustering:

Graph clustering is a general technique for partitioning the graph into groups by adopting a suitable algorithm. In this step, the given CG is divided into two clusters  $G_A$  and  $G_B$  with modified KL algorithm [15]. Then cut degree  $dG_A$ ,  $G_B$  [11] is calculated for the partitioned CG. The dummy nodes are inserted according to the cut degree to make two clusters connected.



Fig.3. (a). MPEG4 CG [6] (b). MPEG4 CG after partitioning with KL Algorithm [6]

#### 3.2.3 Mapping and Merging:

ILP formulation is applied to each partitioning after dummy node insertion. Modified ILP is applied to each part to perform mapping between each cluster to each mesh and dummy nodes to dummy tiles. After final mapping, the dummy node and dummy tiles are removed.

This method resolves the limitation of the ILP by taking less computation time. But the cut degree for KL algorithm is very high [12] ie., 6 for MPEG 4 shown in Fig.3(b). So we propose an alternative partitioning method to reduce cut degree.

#### 4. PROPOSED METHODOLOGY

The proposed methodology opts Tailor made algorithm, DFS algorithm and FM algorithm for partitioning the CG to reduce the cut degree.

**Definition 1:** DCG is a graph containing route from one node to another node that ends at one node itself.

**Definition 2**: DAG is a directed graph without a route from one node to another node.

#### 4.1 CLUSTER BASED MAPPING WITH TAILOR MADE ALGORITHM

In this method, CG is partitioned using tailor made algorithm [6],[12] into clusters. Initially, the given CG is converted in to traffic distribution matrix ( $\lambda$ ). Then based on the connected nodes, semi connected nodes and disconnected nodes, Adjacency (*A*) matrix[20],  $\check{A}$  matrix and Disconnectivity (*D*) matrix are generated.

- Connected nodes: Nodes with direct communication paths.
- *Semi connected nodes:* Nodes where the communication path exists through an intermediate node.

Later, partitioning is done based on the entries of D matrix. Finally, a redundancy cancelation and refinement steps are performed. The redundancy cancelation process is performed to remove the redundant nodes from the partitioned CG. The refinement process is performed if orphan nodes exist in the partitioned CG. The tailor made algorithm is applied to VOPD, MPEG 4 benchmarks.

Now the meshes are provided on two sides of a partitioned CG as per the requirement. The dummy tiles are introduced between sub meshes according to the number of dummy nodes. Then ILP is applied to find the optimum solution as in the previous case. This method offers good results for DCG because of the creation of orphan nodes when DAGs are partitioned. The MPEG 4 partitioned diagram is shown in [6].

## 4.2 CLUSTER BASED MAPPING BASED ON DFS ALGORITHM

To overcome the issue of tailor made algorithm, DFS is used to perform the initial mapping. DFS is a systematic way of visiting the nodes of the CGs [16]. It examines the incident nodes at several times till no other path exists. The algorithm then back traces to explore the remaining edges. Finally, the CG is partitioned equally. The remaining mapping procedure is similar to the previous case. The PIP partitioned diagram is shown in [6].

# 4.3 CLUSTER BASED MAPPING BASED ON FM ALGORITHM

The goal of multi-clustering is to assign all nodes to any partitions so as to minimize the cut degree by satisfying the balanced criteria. The FM algorithm is a partitioning heuristic which offers substantial improvements over the KL algorithm [17] with the following unique characteristics [17]-[18].

- Most applicable to partitions of unequal size.
- Cut degrees are extended for hyper graph so that all nets with two or more pins can be considered.
- Area of each individual cell is taken into account.
- Selection of cells to move is much faster.

The FM is used for two-level clustering in this paper. The Fig.4(a) shows the first level partitioning of PIP benchmark. Multi-level partitioning splits the prior partitioning into several levels so that heavily communicated nodes are clustered into the same portion [20]. The Fig.4(b) shows the Multi-level partitioning of PIP benchmark. The mapping procedure is similar to the previous case. The Fig.4(c) is showing the multi clustering and Fig.4(d) is showing the multi clustering after long range insertion [12] for PIP benchmark. The various terminologies are,

• The cell gain  $\Delta G(b)$  for cell b is defined as in Eq.(13),

$$DG(b) = ES(b) - TE(b)$$
(13)

where, ES(b) is the cut nets that connect only to b and TE(b) is the number of uncut nets connected to b

• The maximum positive gain  $g_m$  is calculated by the maximum sum of cell gains  $\Delta G$  for m moves in a pass is shown in Eq.(14)

$$g_m = \sum_{i=1}^m \Delta G_i \tag{14}$$

• The ratio factor *R* which provides the balance between the two partitions with respect to cell area is defined as in Eq.(15),

$$R = \frac{area(P)}{area(P) + area(Q)}$$
(15)

where, area (P) and area (Q) are the total respective areas of partitions P and Q

• A partitioning of V into two partitions P and Q is said to be balanced if

$$[R \cdot area(V) - area_{max}(V)] \le area(P) \le [R \cdot area(V) + area_{max}(V)]$$
(16)

where,  $area_{max}(V)$  is the maximum cell area

• A base cell is a cell b that has maximum cell gain G(b) among all free cells.

The Fig.5(a) and Fig.6(a) shows the first level partitioning of MPEG 4 and VOPD benchmark. The Fig.5(b) and Fig.6(b) is showing the multi clustering after long range insertion for MPEG 4 and VOPD [21] benchmark. The FM algorithm is explained in the Algorithm 1.

## 4.4 CALCULATION OF TOTAL POWER CONSUMPTION

For analyzing the power, a complete system level and circuit level analysis of global interconnection links and NoC routers is to be performed [12]. For each network topology, a unique Connectivity matrix (C) is generated which represents the minimum number of links a packet goes through during its transition from the source node to the destination node[12].

At the system level, the total power consumed  $(P_{sys})$  in the global links of a network topology can be estimated from Eq. (17). For Torus topology, there are set of physical links which connect the last router to the first one. The set of elements corresponding to those links in the *C* matrix is multiplied with an adjustment factor to correct the length of those links [12]

$$P_{sys} = \sum_{i=1}^{n} \sum_{j=1}^{n} \lambda_{ij} \cdot C_{ij} \cdot u_p \tag{17}$$

where *i* and *j* are the source and destination node indexes respectively,  $u_p$  represents a unit power and  $\lambda_{ij}$  is defined as the average number of packet associated with each edge.















Fig.5. (a). MPEG4 CG first level partitioning with FM algorithm [17], (b). MPEG4 CG second level partitioning with FM algorithm

At the circuit level, the power consumed in a global interconnect link ( $P_{link}$ ) with a certain number of wires ( $N_{wires}$ ) equals the summation of the dynamic and static power ( $P_{static}$ ). Dynamic power consumption is caused by switching power ( $P_{switching}$ ) and internal power ( $P_{short}$ ). Static power consumption is mainly caused by the leakage power. The global interconnection links power consumption can then be represented in Eq.(18)

$$P_{link} = P_{switching} + P_{short} + P_{static}$$
(18)

where,

$$P_{\text{switching}} = \frac{1}{2} N_{\text{wires}} V_{dd}^{2} (C_{L} \alpha_{L} + C_{C} \alpha_{C}) f \qquad (19)$$

$$P_{short} = N_{wires} \tau \, \alpha_L \, V_{dd} \, I_{short} f \tag{20}$$

$$P_{static} = N_{wires} V_{dd} (I_{bias,wire} + I_{leak})$$
(21)

# Algorithm 1: FM Algorithm for partitioning the CG

Input: graph G(V,E), ratio factor R

Output: partitioned graph G(V,E)

- 1. (LB, UB)=balance\_criterion (G, R)
- 2. (P,Q)=partition(G)
- 3.  $g_m = \infty$
- 4. while( $g_m > 0$ )
- 5. *k*=1
- 6.  $ORDER = \Phi$
- 7. for each (cell  $b \in V$ )
- 8.  $\Delta G[k][b] = ES(b) TE(b)$
- 9. STATUS[b]=free
- 10. while  $(!is_fixed(V))$
- 11.  $cell = \max\_gain(\Delta G[k], LB, UB)$
- 12. add(ORDER,(cell, ( $\Delta G[k]$ ))
- 13. critic\_nets=CRITIC\_NETS(cell)
- 14. if( $cell \in P$ )
- 15. try\_move(*cell*,*P*,*Q*)
- 16. else
- 17. try\_move(*cell*,*Q*,*P*)
- 18. STATUS[cell]=fixed

19. for each(NET NET  $\in$  critic\_nets) 20. for each(cell  $b \in$  NET,  $b \neq$  cell) 21. if(STATUS[b]==free) update\_gain( $\Delta G[k][b]$ ) 22. k=k+123.  $(g_m,m)=$ best\_moves(ORDER)

24. if( $g_m > 0$ )

25. confirm\_moves(ORDER,m)

where,  $V_{dd}$  denotes the supply voltage.  $C_L$  and  $C_C$  represent self and coupling capacitance of a wire and neighboring wires, respectively. The  $\alpha_L$  is the switching activity on a wire and  $\alpha_C$  is the switching activity from the adjacent wires. *f* denotes the clock frequency,  $\tau$  denotes a short circuit period during which  $I_{short}$  flows between source and ground.  $I_{bias,wire}$  represents the current flowing from the wire to its substrate, and  $I_{leak}$  is the leakage current flowing from source to ground regardless of the gate's switching activity. We used the Predictive Technology Model (PTM) [22] to obtain parameters for interconnection links and devices using the BSIM3 models. The total power consumption of the global interconnection links ( $P_{gl}$ ) is given by Eq.(22).

$$P_{gl} = \frac{P_{sys} \cdot P_{link}}{u_p} \tag{22}$$

To analyze the power consumption of the NoC router, a set of synthesized routers is modeled in VHDL. Then they are implemented on silicon using 0.18µm technologies. The power consumption of the whole network ( $P_t$ ) including global links and routers could be given in Eq.(23).

$$Pt = \sum_{i=1}^{n_r} P_{ri} + P_{gl}$$
(23)

where,  $n_r$  is the number of routers and  $P_{ri}$  is the router power.

#### 5. EXPERIMENTAL RESULTS

We considered three standard NoC benchmarks to conduct the experiments and comparing the performance. The experiments are conducted using MATLAB R 2014b<sup>TM</sup>. The communication cost [6], Power, cluster cut size and cut degree [6] are analyzed for MPEG 4, VOPD and PIP benchmarks. The results are provided in tabulation in various subsections.

#### 5.1 CUT DEGREE

The various partitioning algorithms are adopted for cluster based mapping method and cut degree is evaluated for first level partitioning. The results are given in Table.2.

Table.2. Experimental results- Cut degree

| Benchmark | KL<br>Algorithm | Tailor made<br>Algorithm | DFS | FM<br>Algorithm |
|-----------|-----------------|--------------------------|-----|-----------------|
| MPEG4     | 6               | 2                        | -   | 2               |
| VOPD      | 2               | 3                        | -   | 2               |
| PIP       | 2               | -                        | 2   | 1               |

By analyzing the Table.2, it is clear that FM algorithm minimizes the cut degree for all benchmarks.

#### 5.2 CLUSTER CUT SIZE

We estimate the cluster cut size after partitioning the CG with various partitioning algorithms. The results are given in Table.3. From Table.3, it is clear that the FM algorithm partitioning method provides maximum cluster cut size than the other methods for the MPEG 4, VOPD and PIP benchmark.

Table.3. Experimental results - Cluster Cut size

| Benchmark | KL<br>Algorithm | Tailor made<br>Algorithm | DFS | FM<br>Algorithm |
|-----------|-----------------|--------------------------|-----|-----------------|
| MPEG4     | 2               | 2                        | -   | 4               |
| VOPD      | 2               | 2                        | -   | 4               |
| PIP       | 2               | -                        | 2   | 3               |

## 5.3 COMMUNICATION COST

The results of communication cost for cluster based mapping with KL, tailor made and FM algorithm are shown in Table.4.

Table.4. Experimental results-Communication cost (bits/sec)

| Benchmark | ILP  | KL<br>Algorithm | Tailor made<br>Algorithm | DFS | FM<br>Algorithm |
|-----------|------|-----------------|--------------------------|-----|-----------------|
| MPEG4     | 3567 | 3567            | 3428                     | -   | 3587            |
| VOPD      | 4119 | 4205            | 3050                     | -   | 2478            |
| PIP       | 640  | 640             | -                        | 448 | 640             |



Fig.6. (a). VOPD CG first level partitioning with FM algorithm [17]; (b). VOPD CG second level partitioning with FM algorithm

From Table.4 (Fig.7), FM algorithm offers 4.4% and 27% improvement in communication cost than Tailor made algorithm for MPEG 4 and VOPD respectively.







On the other hand, PIP (Fig.8) offers 3% improvement in communication cost for DFS than all the other methods.

Fig.8. Communication Cost for PIP benchmark

#### 5.4 TOTAL POWER CONSUMPTION

The total power consumption is evaluated as the sum of global link power and router power as shown in Eq.(23). The global link power is estimated as the number of power units consumed to transmit data packets given in a TDG between the routers. The cluster based mapping with KL algorithm explained in Section 2 is also adopted for power calculation along with the proposed methods. The results given in Table.5 (Fig.9) are compared with FM, Tailor made algorithm, DFS and KL based methods. The power values are expressed in Watts (W).

Table.5. Experimental results- Total power consumed for mesh topology

| Bench<br>mark | KL<br>Algorithm | Tailor<br>made<br>Algorithm | DFS   | FM<br>Algorithm |
|---------------|-----------------|-----------------------------|-------|-----------------|
| MPEG4         | 0.978           | 1.385                       | -     | 0.914           |
| VOPD          | 1.242           | 1.83                        | -     | 1.175           |
| PIP           | 0.516           | -                           | 0.516 | 0.516           |

From Table.5 (Fig.9), the MPEG 4 and VOPD benchmarks achieved 34% and 35% power consumption for FM algorithm compared with tailor made algorithm respectively. Nevertheless, the PIP benchmark has no improvement in power consumption for mesh topology.



Fig.9. Total Power Consumption for MPEG 4 and VOPD benchmark

# 6. CONCLUSIONS AND FUTURE WORK

In this proposed method, various partitioning algorithms are adopted along with cluster based mapping to analyze the performance and to reduce cut degree, power consumption and communication cost. Tailor made algorithm, DFS and FM algorithm are the different algorithms chosen. Based on the experimental results it is clear that FM algorithm offers minimum cut degree for all the benchmarks. On the other hand, the cluster cut size is increasing for all the three benchmarks. However, there is improvement in power consumption and communication cost for MPEG4 and VOPD benchmarks with FM algorithm where as PIP benchmark shows improvement in communication cost for DFS algorithm.

In future, it is planned to extend this work for multi objective function minimization by adopting metaheuristic search methods along with cluster based mapping.

#### ACKNOWLEDGEMENT

The authors would like to thank Facility for Advanced Computing Testing and Simulation of Electronic Circuit (FACTS ElCi) for providing the lab facility for our research work.

# REFERENCES

- [1] Wen-Chung Tsai, Ying Cherng Lan, Yu Hen Hu and Sao Jie Chen, "Networks on Chips: Structure and Design Methodologies", *Journal of Electrical and Computer Engineering*, Vol. 2012, No. 2, pp. 1-15, 2012
- [2] Tobias Bjerregaard and Shankar Mahadevan, "A Survey of Research and Practices of Network-on-Chip", ACM Computing Surveys, Vol. 38, No. 1, pp. 1-51, 2006.
- [3] Radu Marculescu et.al., "Outstanding Research Problems in NoC Design: System, Micro Architecture, and Circuit Perspectives", *IEEE Transactions on Computer-Aided*

Design of Integrated Circuits and Systems, Vol. 2009, pp. 3-21, 2009.

- [4] Pradip Kumar Sahu and Santanu Chattopadhyay, "A Survey on Application Mapping Strategies for Network-on-Chip Design", *Systems Architecture*, Vol. 59, No. 1, pp. 60-76, 2013.
- [5] Coskun Celdk and Cuneyt F. Bazlamacci, "Effect of Application Mapping on Network on Chip Performance", *Proceedings of IEEE International Conference on Parallel Distributed and Network based processing*, pp. 465-472, 2012
- [6] A. Aravindhan, S. Salini and G. Lakshminarayanan, "Cluster Based Application Mapping Strategy for 2D NoC", Proceedings of 1<sup>st</sup> Global Colloquium on Recent Advancements and Effectual Researches in Engineering, Science and Technology, pp. 505-512, 2016.
- [7] Suleyman Tosun, "New Heuristic Algorithms for Energy Aware Application Mapping and Routing on Mesh-based NoCs", *Systems Architecture*, Vol. 57, No. 1, pp. 69-78, 2011.
- [8] Coskun Celik and Cuneyt F. Bazlamacci, "Evaluation of Energy and Buffer Aware Application Mapping for Networks-on-Chip", *Microprocessors and Microsystems*, Vol. 38, No. 4, pp. 325-336, 2014.
- [9] Srinivasan Murali and Giovanni De Micheli, "Bandwidth-Constrained Mapping of Cores onto NoC Architectures", *Proceedings of the Conference on Design, automation and test in Europe*, Vol. 2, pp. 20896, 2004.
- [10] Coskun Celik and Cuneyt F. Bazlamacci, "Energy and Buffer Aware Application Mapping for Networks-on-Chip with Self Similar Traffic", *Systems Architecture*, Vol. 59, No. 10, pp. 1364-1374, 2013.
- [11] Suleyman Tosun, "Cluster-based Application Mapping Method for Network-on-Chip", *Advances in Engineering Software*, Vol. 42, No. 10, pp. 868-874, 2011.
- [12] Haytham Elmiligi et.al., "Power Optimization for Application-Specific Networks-on-Chips: A Topology-

based Approach", *Microprocessors and Microsystems*, Vol. 33, No. 5-6, pp. 343-355, 2009.

- [13] Suleyman Tosun, Ozcan Ozturk and Meltem Ozen,. "An ILP Formulation for Application Mapping onto Network-on-Chip", Proceedings of 3<sup>rd</sup> International Conference on Application of Information and Communication Technologies, pp. 1-5, 2009.
- [14] D.T. Nguyen, Y. Bai, J. Qin, B. Han and Y. Hu, "Computational aspects of Linear Programming Simplex Method", Advances in Engineering Software, Vol. 31, No. 8-9, pp. 539-545, 2000.
- [15] B. W. Kernighan and S. Lin, "An Efficient Heuristic Procedure for Partitioning Graphs", *The Bell System Technology Journal*, Vol. 49, No. 2, pp. 291-307, 1972.
- [16] Charalampos Papamanthou, "Depth First Search and Directed Acyclic Graphs", Department of Computer Science, University of Crete, pp. 1-27, 2004.
- [17] Andrew B. Kahng, Jens Lienig, Igor L. Markov and Jin Hu, "VLSI Physical Design: From Graph Partitioning to Timing Closure", Springer Science and Business Media, 2011.
- [18] Sadiq M. Sait and Habib Youssef, "VLSI Physical Design Automation: Theory and Practice", World Scientific Publishing, 1999.
- [19] Elyas Khajekarimi and Mahmoud Reza Hashemi, "Energy-Aware ILP Formulation for Application Mapping on NoC based MPSoCs", *Proceedings of 21st Iranian Conference on Electrical Engineering*, pp. 1-5, 2013.
- [20] Tero Harju, "Lecture Notes on Graph Theory, Department of Mathematics", Available at: http://cs.bme.hu/fcs/graphtheory.pdf.
- [21] Asrani Hj Lit, Siti Kudnie Sahari, Rohana Sapawi and Ahmad Fariz Hasan, "Power Optimization for Mesh Network-on-Chip Architecture: Multi-level Network Partitioning Approach", *Proceedings of 6<sup>th</sup> Engineering Conference, Energy and Environment*, pp. 1-5, 2013.
- [22] Predictive Technology Model Available at: http://ptm.asu.edu/