Thermal-aware and uniform priority with scaled routing for high-performance network-on-chip

(1)

by

Stanley Okeke

Bachelor of Engineering, University of Nigeria, Nsukka, 2011

A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of

MASTER OF APPLIED SCIENCE

in the Department of Electrical and Computer Engineering

c

Stanley Okeke, 2017 University of Victoria

(2)

Thermal-Aware and Uniform Priority with Scaled Routing For High-Performance Network-on-Chip

by

Stanley Okeke

Bachelor of Engineering, University of Nigeria, Nsukka, 2011

Supervisory Committee

Dr. Fayez Gebali , Supervisor

(Department of Electrical and Computer Engineering)

Dr. Mihai Sima, Member

(3)

Supervisory Committee

Dr. Fayez Gebali , Supervisor

Dr. Mihai Sima, Member

ABSTRACT

3D-NoC architectures are the amalgamation of the 3D integration (Die stacking of 3D-IC Technology) with the increased scalability found in NoC. Originally, it was proposed to tackle the problem of increasing the number of cores in the 2D plane which seems incompetent due to long distance interconnects. This architecture is aimed to optimize performance, power consumption, achieve low latency and increase the network bandwidth. Nevertheless, as more dies were being stacked vertically, IC operating frequency increases and this leads to some thermal issues which include high power density which increases average temperature. In addition to that, longer heat dissipation path results in different heat dissipation in each layer of the NoC which worsen the situation. An increase in the overall power consumption increases the average temperature, reduces performance and reliability. In this paper, an adaptive thermal-aware management scheme was proposed for 3D-NoCs, concentrating more on the hotspot regions in the network. This proposed protocol employs the thermal state of intermediate nodes and flits properties in a random uniform distributive way for packet routing. The proposed algorithm increases network availability and tends to distribute the temperature of the system evenly and uniformly within the network and making sure that packets are not forwarded to the hotspot node(s) and only flits with certain properties in the distribution are forwarded to the hotspot node(s). Before or during transmission, these two distributions must be calculated alongside the current node temperature to knowing which state of the distribution that node and flit belong to. The simulation shows this gave better performance in throughput and reliability of the network by reducing the number of hotspot nodes

(4)

in the NoC. The proposed algorithm also reduces power consumption which is a function of temperature. Simulations show that our proposed algorithm reduces the total power/energy consumed by more than 59% and throughput is improved by 69% compared to a traditional XYZ routing.

(5)

List of Tables

Table 2.1 Switching Techniques Comparison . . . 12 Table 5.1 SIMULATION PARAMETERS . . . 38 Table 5.2 Improvement in throughput and total power for uniform random

traffic . . . 46 Table 5.3 Improvement in throughput and total power for transpose traffic 46 Table 5.4 Improvement in throughput and total power for shuffle traffic . . 47

(8)

List of Figures

Figure 2.1 Benefits of 3D NoC comparing to 2D NoC. . . 6

Figure 2.2 3D Mesh Topology with 7 Port Router. . . 8

Figure 2.3 A 4-layer 3D NoC. . . 8

Figure 2.4 3-D Router-Bus Hybrid Architecture With Bus Port . . . 9

Figure 2.5 Store-and -Forward Switching . . . 11

Figure 2.6 Virtual Cut-Through Switching . . . 11

Figure 2.7 Wormhole Switching. . . 12

Figure 2.8 A Typical Router Architecture with Virtual Channels. . . 14

Figure 3.1 NoC Structure Model Hierarchy. . . 23

Figure 3.2 Full chip and tile micrograph for the Teraflops Intel 80 Processor. 24 Figure 3.3 Power breakdown; (a) at 4G Hz , (b) Estimated tile profile . . 25

Figure 4.1 Outline of our proposed algorithm. . . 32

Figure 4.2 Thermal-state uniform distribution(a) . . . 32

Figure 4.3 Thermal-state uniform distribution (b) . . . 33

Figure 4.4 Flits uniform distribution(a) . . . 34

Figure 4.5 Thermal-state uniform distribution (b) . . . 34

Figure 5.1 Network throughput comparison for various routing algorithm under different workload . . . 40

Figure 5.2 Total energy consumed under different workload. . . 41

Figure 5.3 Temperature Distribution showing Active and Hotspot nodes for TAUPSR. . . 42

Figure 5.4 Temperature Distribution showing Active and Hotspot nodes for XYZ Routing. . . 43

Figure 5.5 Temperature Distribution showing Active and Hotspot nodes for North-Last Routing. . . 44

(9)

Figure 5.6 Temperature Distribution showing Active and Hotspot nodes for West-First Routing. . . 45

(10)

ACKNOWLEDGEMENTS

I would like to express my appreciation to my supervisor Dr. Fayez Gebali for his support, and professional experience during my research period. He has been a source of strength, encouragement and guardian especially in difficulty moment of my research.. I would also like to express my gratitude to my university, for creating a wonderful platform for student learning at different levels especially working with Dr. Peter Driessen on my first network on-chip project. This project gave me a better and in-depth understanding of Network-on-Chip in general before embarking on this thesis subject and research.

I will also like to appreciate and thank Mostafa Said (Ph.D) for his thorough support, guide and professional experience rendered during the course of this research. I’d say Iwork along-side him to achieving this research goal. Finally I would like to thank my family members for their constant support throughout my study.

(11)

DEDICATION

I would like to dedicate my work to my Professor Dr. Fayez Gebali, and my brother Mr. Paul Okeke for their sponsorship and support throughout the course of this research.

(12)

Introduction

1.1 Introduction

For decades, the semiconductor industry has obsessed over smallness, and for good, reliable reason.The more transistors one can squeeze into a given chip, the more speed and power efficiency gains you reap, at a very lower cost. Following Moores law in 1965 [1] [2], the number of transistors per square inch has since doubled approxi-mately every 18 months, resulting in exponential growth in chip complexity and this is unlikely to continue indefinitely. In 1975, Moore revised that estimate to every two years. While most industry has fallen off of that pace, it still regularly finds ways to shrink. Also according to the International Technology Roadmap Semiconductor (ITRS), the number of processing elements is expected to increase more than 100 processors [3]. In other words, increase in the number of processing element can as well lead to a dramatic increase in memory size. This continuously shrinking tran-sistor and wire dimensions to enable more devices to be fabricated within the same silicon area. However, these feature sizes are now down to tens of nanometers, where the cost of manufacturing is high.

This has prompted a new thought by semiconductors industries on how to tackle the increasing on-chip integration which leads to the concept of three dimensional (3D) integration. The three dimensional (3D) integration is a fast emerging techno-logical option for continuing the exponential trend proposed by [1]. This technology enables building circuits in 3D structures by stacking active silicon layers using TSV for inter tier connection, as opposed to the traditional 3D stacking method using

(13)

wire bonding. This new technology offers a few advantages which could increase the device density allowing complex design implementation and significantly improve per-formance, shorter global interconnects, lower interconnect power consumption, higher packing density, and support for the implementation of mixed-technology chips. The combination of NoC and 3D-IC gave rise to three-dimensional network-on-chip (3D-NoC). With freedom in the third dimension, architectures that were impossible or prohibitive due to wiring constraints in planar ICs are now possible in 3D-NoC.

In recent times, 3D Network-on-Chips has grown tremendously to become the most scalable, fast, reliable, efficient model and structural means of communication in large scale, complex and high-performance computing applications and System on Chips -SoC. Thermal issues due temperature variation have become a critical and challenging drawback associated with 3D-NoC, which has lead to performance decay and reliability of the system. According to [4], the communication bandwidth in a future network on chip architectures will probably be limited by prohibitive levels of power consumption. Regardless, a higher temperature can cause an increase in current leakage and delay in wire interconnection due to higher resistivity [5]. According to [6], there are two major kinds of thermal issues associated with Network-on-Chip systems, including regional temperature differentiation and Hot-spot. With regard to that, differences in the regional temperature are caused by the unbalanced thermal distribution in the network which limits latency prediction thus causing a system synchronization failure. This temperature is a function of the total energy consumed in processing request. This regional temperature variation, in other words, causes temperature variation that makes 3D-NoC systems have more overheated spot thereby reducing the performance of the system [4]. Through-Silicon Vias (TSVs) has been initially proposed as means of stacking multiple dies in the vertical axis and interconnecting them to achieve higher performance with Limitation of overheated regions. So it is better for the routing algorithm to be implemented in the 3D-NoC system to consider the thermal balance of the entire network system before forwarding packets. For maintaining thermal balance within the network, a limited amount of packets or rather packets with special properties should be forward through a very high-temperature node. The hotspots node often serve as a danger node and no packet should be allowed to go through this node at any cost. An efficient routing algorithm for 3D-NoC should also be able to handle deadlock whenever it occurs. This can be done by implementing or providing a deadlock recovery mechanism [7]

(14)

that serve as an escape route or for redirection of the packet.

1.2 Research goal and Objectives

Looking at the immense growth of the semiconductor industries in providing IC’s and integrated circuits with the rate of complexity design, and more highly interconnected systems have grown tremendously, NoCs promise major benefits, but impose new con-straints and limitations. This limitation can be in terms of performance throughput, the reliability of the system and energy consumption of intermediate node which is a function of temperature rise. This thesis presents a novel technique to improve the performance of 3D-NoC system, increase network availability and reduce energy consumption by nodes which reduce the temperature of the system simultaneously. The temperature according to the thermal model [4] is a function of the total en-ergy of consumed. Literally, minimizing the enen-ergy consumed will lead to decrease in temperature of the node.

1.3 Contributions

The major contribution of this thesis improve the performance of the 3D-NoC system and also reducing g the energy of consumption among neighbor node which is a function of temperature. If the energy is reduced it is presumed the temperature is reduced simultaneously as the temperature is directly proportional to the total Power consumed by individual nodes in the network. T = αf (E) as discussed in chapter three. This we achieved by:

1. Proposing a new thermally-aware routing protocol that observes the network states before routing packets

2. Stimulating the network-on-chip using the proposed routing protocol and de-termining its performance.

3. We observe if the performance is improved or not by comparing our algorithm to some other existing 3D routing algorithm.

Thus, this we serve as a distinctive contribution to improving the system perfor-mance in terms of reliability(throughput and the amount of packet/flit received) of the network and total energy consumed.

(15)

1.4 Thesis organization

The thesis is divided into five chapters respectively. Chapter 1 gave the introduction and describe the overall structure of the thesis alongside the goal and objectives of the research. Chapter 2 deals with the overview of three dimension network-on-chip, the router architecture, flow control, allocation mechanism, router, and selection strategy. Some existing 3D routing algorithm and subsequent thermal management schemes for 3D-NoC are also discussed in chapter two. In general, chapter two is literally a case-study of other related works done in 3D-NoC, thermal aware routing and other management schemes applied to liberate performance impact. In chapter 3, we discuss the thesis methodology, the thermal model, and energy transformation model involve in our proposed routing algorithm. This chapter also describes how the temperature from the hotspot is distributed among the networks. it also described the tropical relationship between the temperature, conductivity and total energy of the system. In chapter 4, the proposed model is discussed as well as the pseudo-code. This chapter describes the thermal-state distribution and flit distribution of our model respectively. Similarly, in chapter 5, we implement our model using a cycle accurate simulator that uses SystemC language (AccessNoxim). Chapter 6 conclude our contributions and also presented a future work.

(16)

Chapter 2 Related Works

In this chapter, we presented some of the related works to 3D-NoC. We start by presenting the evolution 3D-NoC from its counterpart 2D-NoC and then focused on the benefits of 3D-NoC designs as compared to 2D. We also discuss the thermal management schemes associated with 3D-NoC.

2.1 Evolution of 3D-NoC

3D Network-on-Chip (3D NoC) consist of switches/router, network adapter (NA) or network interface (NI), links and use circuit or packet switching technology to transfer data inside a 3D IC. The links physically connect the nodes and implement the com-munication. The router implements the communication protocol, and the network interface establishes the logical connection between IP cores and the network.

With the advance increase in technology, it is a known fact to integrates very a large number of Intellectual Properties (IPs). Regardless, efficient data exchange among a large number of nodes leads to a performance degradation of chip multiproces-sors(CMPs) and multi processor SoC(MPSoC) systems [8]. Due to the complexity of wire routing caused by the traditional point-to-point interconnection, which leads to large layout area and a very long transmission delay. By viewing the on-chip in-terconnection as a micro-network, Network-on-Chip (NoC) was proposed as a novel and practical solution to integrate a large number of IPs in a single silicon chip [9]. However, the idea of scaling on-chip networks over two dimensions to accommo-date hundreds of cores are not efficient as increasing the number of the core over two-dimensional plane soon lead to a performance bottleneck due to long distance

(17)

and interconnect [10]. This in recent time leads to the emergent of 3D integration technologies [11]. This various technologies enable to stack several dies on a single chip, forming a 3D integrated circuits (3D-ICs), for example the Through-Silicon-Vias(TSVs) [12], Silicon-silicon fine-pitch interconnect [13], wireless communication between 2D planes and 3D wafer wire bonding technology [14]. The 3D-ICs serves as a replica for the 2D-ICs in terms of improved performance, energy efficiency, cost reduction, and product size reduction. Such technology is highly available for some semiconductor industries such as IBM, IMEC, Tezzaron Semiconductor Corporation and MIT Lincoln Laboratory. According to [15], 3D ICs allow for performance en-hancements even in the absence of scaling as a result of the reduction of intercon-nect distance. By combining the 3D-IC technology with the NoC technology, and making it possible to scale NoCs overs three dimensions result to the evolution of three-dimensional Network-on -Chip(3D-NoC). According to [16] [17], 3D-NoC has the following advantages over traditional 2D-NoC:

N E S W 8 x 8 2D NoC L Lv2 2D Router L 4 x 4 3D NoC (a) (b) 0.5L 0.5L 0.5Lv2 W S D U N E 3D Router (c)

(18)

1. Better IPs mapping density in the network: for example stacking an 8 x 8 2D NoC to a 4 x 4 3D NoC will decrease the form factor of the chip in which the IP mapping density is higher. See Fig. 2.1(a).

2. Shorter interconnection distance and lower network power consumption: Fig. 2.1(b) shows that in 3D-NoC, the original longest physical distance is 0.5L√2 as compared to 2D NoC with physical distance of L√2. Compared with 2D NoCs, 3D-NoCs greatly reduce the network diameter and overall communica-tion distance, thus improving communicacommunica-tion performance and reducing power consumption.

3. Higher network bandwidth, scalability and routing flexibility: 3D-NoCs overcome the limited scalability of 2D NoCs over 2D planes by using short and fast vertical interconnects of 3D-ICs. Adding two extra directions (Up and Down) to the traditional four planar directions (North, South, East, and West) as shown in Fig. 2.1(c) making routing of packets more flexible in terms of path diversity, thus improving the network throughput.

Various study have shown that 3D-networks improve 2D network scalability [18] and performance in terms of delay and throughput [19] [10].

2.2 The 3D-NoC Router Architecture

To extend the NoC router for a 2D mesh topology to the third dimension is simply by adding more channel ports, indicating one for the layer above and one for the layer below as shown in Fig. 2.2. The 2D router has 5 ports (NORTH, SOUTH, EAST, WEST, and LOCAL). When this 2D network is transformed into a 3D-network, it has four layers as in Fig. 2.3 and a total of 7 ports (NORTH, SOUTH, EAST, WEST, LOCAL, UP and DOWN). According to [20], there are two major classifications of 3D-NoC architecture viz: symmetric and bus hybrid. However, the latter lacks concurrent communication in the vertical stack and suffers from possible contention and blocking issues in the vertical interconnects. The key performance metrics in 3D-NoCs include zero-load latency and power consumption of the network. To optimize these two metrics, various 3D-NoC topologies have been proposed. The physical layer of a 3D-NoC architecture consists of longer horizontal interconnects that connect the adjacent nodes in the same layer and shorter vertical interconnects that connect the

(19)

Local_in North_in South_in East_in West_in Up_in Down_in R R R R R R R R R R R R R R R R R R R R R R R hop hop hop Local_out North_out South_out East_out West_out Up_out Down_out 7 x 7 Crossbar

Figure 2.2: 3D Mesh Topology with 7 Port Router.

R R R R R R R R R R R R R R R R R R R R R R R hop hop hop R R R R R R R R R hop hop hop Layer 0 Layer 1 Layer 2 Layer 3

Figure 2.3: A 4-layer 3D NoC.

nodes on different layers. The short vertical links make traversing all layers in the 3D chip feasible in a single hop [21] while using baseline 3D routers imposes one hop from one layer to the next respectively. An NoC-bus architecture was proposed in [22] which uses the bus link in the vertical dimension just to take the privilege of the fast vertical links in the network. In the bus hybrid architecture, it requires just one additional port to the generic 5 X 5 crossbar instead two as in Fig. 2.4. A shared medium protocol is used by flits from different layer wishing to move either UP/DOWN which may lead to bus contention in most cases. For better optimization of this inter-layer communications, an improved 3D-NoC-Bus Hybrid was proposed in [23] which is based on bypassing the router when the flit travels in the vertical dimension. Another reliable approach for a better optimization of 3D NoC-Bus hybrid architectures as describe in [24] is to form a dimensionally decomposed router by simply decoupling the inter-layer and intra-layer communication links. Lafi et al [25] propose a router composed of two totally decoupled modules, one for inter-layer communication and another for intra-layer communication. Also in [26], a multi-layered 3D router was

(20)

Local_in North_in South_in East_in West_in Bus_in R R R R R R R R R R R R R R R R R R R R R R R hop hop hop Local_out North_out South_out East_out West_out Bus_out 6x 6 Crossbar R R R R R R R R R hop hop hop O n e h o p c o u n t

Figure 2.4: 3-D Router-Bus Hybrid Architecture With Bus Port

proposed describing a router containing all layers in a single 3D-chip which is highly compatible with other similarly multi-layered core [27]. This design according to the author reduced the total power consumption as well as increased system performance. In [28], the author presented another option of implementing a true 3D crossbar by designing a 3D crossbar to route any permutation between a set of N x N I/O terminals on one layer of a 3D chip and a second set of N x N I/O terminals on another layer using intermediate layers called Crossbar-Switch Layer Sets (CLSs). Also in [29], the author proposes a bufferless routing for 3D router that utilizes a three-stage permutation network instead of an allocator and crossbar, a single cycle, 1.25GHz was achieved in a 65 nm technology.

2.2.1 Switching Technique

3D-NoC uses different types of switching mechanism and this mechanism defines how and when the input channel of the switch is connected to the output channel selected by the routing algorithm. The transmission of data in the network is in form of messages. A message is broken into multiple packets (each packet has header information that allows the receiver to reconstruct the original message). A packet may itself be broken into a flow control units called flits flits do not contain additional headers. Two packets can follow different paths to a given destination where as flits are always ordered and follow exact same path to a destination. According to [30], the two commonly used switching techniques are: (i) Packet switching and (ii) Circuit Switching.

(21)

Packet Switching

Packet switching is a method of assembling the data to be transmitted over a network into packets of different sizes. This group of packets is composed of the header and a payload which holds the message to be sent [31]. According to [32], packet switching allows the packets in a message to be transmitted via a different channel path. This type of switching mechanism uses a finer granularity buffer and channel control at the flit level by reserving the port that first received the header flits of a packet for the flits of that packets only.

Packet switching is categorized into: 1. Store-And-Foward (SAF) 2. Virtual Cut-Through (VCT) 3. Wormhole Switching (WS)

As shown in Fig. 2.5, the store-and-forward method of packet switching is based on receiving and storing the whole incoming packet before it is forwarded to the next router. Since the entire completed packet must be stored before forwarding, resulting to extra buffer space. This may increase the latency period of nonblocking packets in the buffer. In VCT switching, packets are forwarded as soon as the next router guarantees that the complete packet will be accepted. The router must store the entire packets in case of non-guarantees of the next other routers to receive packets. VCT have lower latency communication as compared to SAF switching. In Fig. 2.6, there is packet pipelining through a switch in VCT that allows flits to cut through the next router input before the packets are completely received in the current router. In wormhole switching, the individual packets are subdivided into smaller and equally-sized units called flits. Each flit has a header and a body flit. The header flits contain the routing information of the packets and is routed similarly to packets in VCT process. Meanwhile, the remaining flits(body flit) are routed through an established path by the header flit. This process according to [33] increases the risk of deadlock though utilizes lesser buffer memory and smaller latency. According to [34], [35], [36], these NoC architectures are based on wormhole switching, where the router is only required to store a few pipelined flits instead of the entire complete packet. This approach as shown in Fig. 2.7 is vulnerable to deadlock as flits will occupy buffer slots in the various router if a message gets blocked. The wormhole switching is same as cut-through, but buffers in each router are allocated on a per-flit

(22)

basis, not per-packet. In [37], the author proposes a new technique that dynamically combines virtual cut-through and wormhole switching to achieve a better throughput compared to its individual switching.

Having listed the three categories of packet switching, we may want to go further to compare these proposed switching techniques in terms of their performance, buffering, hardware complexity and system flexibility as shown in Table 2.1.

C h a n n e ls 0 1 2 3 tr Time-Space diagram

Figure 2.5: Store-and -Forward Switching

C h a n n e ls 0 1 2 3 tr + ts Time-Space diagram Cycle Period

Figure 2.6: Virtual Cut-Through Switching

Circuit Switching

Circuit switching is an approach reliable for bulk transfer where a request is first sent to reserve the channels, the request may be held at an intermediate router until the channel is available (hence, not truly bufferless), ACKs are sent back, and subsequent packets/flits are routed with little effort.

2.2.2 Flow Control Mechanism

To route a message from a source to a destination node, allocation of various resources are required: the channel link, buffers, and a control state. Once a packet starts

(23)

X Y Node-1

Node-0

Node-2 Node-3 Node-4

Y

Node-5

Idle Idle

X is going from Node-1 to Node-4; Y is going form Node-0 to Node-5

Traffic Analogy: Y is trying to make a left turn; X is trying to go straight; there is no left-only lane with wormhole

Figure 2.7: Wormhole Switching.

Table 2.1: Switching Techniques Comparison

Switching Performance _Buffering Complexity Cost Flexibility

Circuit switching

Reliable for bulk and long mes-sage transfer

I flit Low Low

Store-and-Forward

High for short and frequent messages

Packet Low High Low

Virtual Cut-Through

High _Packet High High Low

Wormhole Switching

High A few

(24)

transmitting over a channel, another packet cannot cut in else, the buffer will confuse the flits of the two packets. Moreover, other packets cannot use the channel if the packet is impeded. In other to prevent blocking of a packet to hinder the progress of other packets waiting in line, a Virtual Channel (VC) flow control is implemented according to [38]. Virtual channels arbitrate for physical channel bandwidth on a flit-by-flit basis. A virtual channel between two point resources X and Y are established by allocating time slots (by Time Division Multiplexing - TDM) in each switch on the path between two resources X and Y. In [38], the author introduces an approach that assigns multiple virtual channel/paths, each of which has it own associated buffer queue to the same physical channel as shown in Fig. 2.8.

Virtual Channel Flow Control

The virtual channel flow control mechanism can be summarized as follows:

1. The incoming flits are placed in buffers

2. For any flit to jump to the available router, it must require three individual resources viz:

• A free available virtual channel on its intended hop: A virtual channel is free when the tail flit goes through

• Free buffer entries for that virtual channel: This is determined with on/off switch management scheme

• A free cycle on the physical channel: This is usually for packets competing to share a single physical channel.

2.2.3 Switching Allocation and Scheduling Policies

Router allocators represent key pieces of router control logic. A router design usually includes two major factors: the buffering structure design and the switching allocation scheme or policy. To Allocate an output port, the scheduler will receive a request from incoming flits or flits pending in the input buffers and grants the output port according to a scheduling policy implemented. It must arbitrate effectively among flits of the same output port and among flits from different virtual channels for the

(25)

VCID _{VC 1} VC 2 VC 3 VC M D E M U X M_U X VCID _{VC 1} VC 2 VC 3 VC M D E M U X M_U X VCID _{VC 1} VC 2 VC 3 VC M D E M U X M_U X VCID _{VC 1} VC 2 VC 3 VC M D E M U X M_U X Input Port 1 Output Port 2 Input Port 3 Input Port N Crossbar Output Port 1 Output Port 3 Output Port N VC Allocator Switch Allocator Buffers

(26)

same output channel. In typical 3D Mesh NoCs, the routers usually employ virtual channels (VCs) as their input buffering structure, and use a separated or integrated VC Allocation (VA) and Switching Allocation (SA) to schedule flits of packets from an input port to output port of the router as in Fig. 2.8. Literally, the basic function of a Switch allocator is to match the output port and a request of flits to be transmitted. According to [30], allocation matching occurs following the two basic procedures:

1. Resources are only granted to requesters if a corresponding request exists 2. At most one resources is assigned to each requester and another to another

requester as the case may be.

In general, according to [39] switch allocation has the following basics: arbitration, allocation, and matching.

Most possible switch scheduling policies are :

• Fixed priority arbiter: This scheduling policy ensures that at any given time, the Router executes the highest priority task of all those tasks that are currently ready to execute. This method is often used because it causes starvation of a certain task. Fixed priority arbiter algorithm is useful only when the designer wants to give priority to certain flit request over another.

• Arbiter Oblivious: This scheduling policy ensure that the generated priority is independent of the last grant; instead, it depends on the last cycle priority. Hence, simple circuits, such as shifters, are used to generate the next priority, but oblivious arbiters give a weak fairness [40].

• Round Robin Arbiter: This is the most widely used scheduling policy [41], [42] because it provides strong fairness by assigning the lowest priority to the last served requester. So the generated priority vector is a shifted version of the grant vector [43]. It is typically implemented by using s ring counter token and priority encoder-based arbiters respectively. The arbiter controls the arbitration of the ports and resolves contention problem among flits. It keeps the status of all the ports updated and knows which ports are free and which ports are active with each other. Packets with the same priority and destined for the same output port are scheduled with a round-robin arbiter.

(27)

2.2.4 Packet Starvation

Starvation is a problem that encountered in concurrent computing where a process is perpetually denied necessary resources to other processes its work [44]. It is also a situation where some packets with lower priorities could not advance toward their destination node when the packets with higher priorities reserve the resources all the time. Starvation is usually caused by an overly simplistic scheduling algorithm. For example, if a (poorly designed) multitasking system always switches between the first two tasks while a third never gets to run. Thus, this can be resolved by using a fair routing algorithm or reserving some bandwidth only for low-priority packets.

2.3 3D Routing Algorithm

In recent years, many works prior to the network-on chip have been proposed and developed to tackle different issues associated to network-on-chip either to improve its performance or reduce temperature respectively. Depending on the purpose of implementation, many 3D routing algorithms have been proposed and these protocols have their pros and cons regardless. Among these protocols, some are thermally-aware for temperature control, while others are just adaptive for either thermal distribution or traffic distribution. There are some custom routing schemes that aim to reduce the power consumption and thermal power which is a very challenge design for 3D-NoC systems.

Among the earliest 3D proposed algorithms are the congestion oblivious routing algorithm such as the XYZ [45], West-First [46], and North-Last [47]. Most routing algorithm follows the XYZ routing pattern which is a vertically balanced routing algorithm with better performance since its simple to implement, it is free of deadlock and livelock, and also because packet ordering is not required [48], [49].

Ramanujam et al [50] presented an oblivious routing algorithm called randomized partially minimal (RPM) that aims to balance the traffic in the network improving then the worst case scenario. This protocol is aim to sends packets first to a random layer, then route them along their X and Y dimensions using either XY or YX rout-ing with equal probability. Finally, packets are sent to their final destination along the Z dimension. Ascia et al. [51] also propose another congestion-aware protocol called neighbor-on-path(NOP) that obtain more information about the network us-ing precedent adjacent nodes. Another congestion-aware protocol alongside NOP was

(28)

the DyXYZ [52] which was a fully-adaptive routing protocol for 3D-NoC’s that also uses the congestion information at the input buffer of the neighbor router as a con-gestion metric to selecting the output channels for packet routing. Other concon-gestion- congestion-aware routing protocol which has been used in recent time includes DBAR [53] and CATRA [54]. Gratz et al. In [55], the author proposes a Regional congestion-aware protocol (RCA) to get the local and non-local state information on the network but suffers from relative interference. These interferences as a result of RCA was later solved by a destination based routing algorithm proposed by [56]. FT-OED-XY [57] is a fault tolerance routing algorithm based on XY routing that acts an enhance mechanism for faulty links or nodes in the network. Suraj et al. [58] recently propose an adaptive routing algorithm which is a thermal balance scheme that makes use of the thermal state and buffer state of the nodes for routing the packets. In [59], the author proposes a traffic distributed adaptive routing algorithm for 3D systems with limited bandwidth in vertical links. There are few partially adaptive routing protocols presented in 3D-NoCs such as MAR [60].

To address the thermal problem in 3D-NoC, Chao et al [61] propose a thermal-aware downward routing scheme. To reduce the overall thermal power, this protocol avoids routing to the upper layer where the thermal-power is more than the down-ward layer with the heat sink. In his approach to solving thermal issues associated with 3D-NoC traffics are often sent to the downward layer which has the heat sink because the upper layer in the network consumes the highest thermal power. After the packet is sent to the downward layer, it then routes the packet along the X and Y dimension before sending the packet to the destination node. F.Liu et al. [6] pro-pose a dynamic thermal balance routing algorithm for network-on-chip that solve the problem of differential regional temperature and hotspot within the network. Pre-viously in [62], as thermal problems of three-dimensional Network-on-Chip systems becomes more serious, the author proposes a Proactive Thermal-Budget-Based Belt-way routing (PTB3R) that balances the temperature distribution of the NoC system by identifying two factor: a novel thermal-aware routing index and the Mean time to throttle (MTTT). This two factor represents the active time of a node before the temperature achieves the emergency level. Another thermal- ware routing for 3D NoC is proposed by [63]. Here, the author proposed an adaptive routing algorithm that is thermally-aware of traffic and throttling cases (TTAR) to address congestion due to throttling of transient-temperature control. This algorithm aims to balance the network traffic and detours the throttled tile thereby improving the throughput. Due

(29)

to rising traffic imbalance in the network, with degrading system performance and rapid rise in temperature, Chen et al [64] proposed to balance the distribution of the traffic and temperature in the network in NSI-Mesh and regular mesh system. This protocol also improves the network throughput with less area overhead by balancing the network traffic and temperature evenly.

Just recently, as part of the research to improving performance of 3D-NoC system and reducing the overall power and energy consumption, a new algorithm proposed by [65] is a High performance virtual channel based fully adaptive thermal-aware routing that uses a 12 bit register to reserve the router state of one hop away instead of transmitting the topology information of the whole network. It also uses two virtual channels for each horizontal channel to achieve full adaptivity and high routability.

2.4 Thermal Management scheme for 3D-NoC

Because of the thermal issues associated with high-performance NoC systems, 3D-NoC router will be hotter and will require a thermal management scheme to regulate this rise in temperature. Thus, this can be achieved starting with the design method-ology propose by [66]. The design methodmethod-ology should be suitable with the routing protocol that will effectively provide an efficient thermal management scheme for the referenced system. A study has shown that the easiest, accurate and reliable way of thermal distribution is by the use of thermal sensors [67]. But this method as we know suffers from hardware cost and a large number of control links that will be used to transmit the thermal signals across the network. To efficiently reduce hardware cost of thermal sensors, a thermal management scheme was proposed [68]. This proposed thermal management scheme can be categorize into design-time optimization(DTO) and dynamic thermal management(DTM) [69].

DTO is only considered if the network is stable and its thermal balance is obtained during the offline design of the NoC system. This is usually achieved by taking the worse-case situation into consideration. DTO always tend to optimize the NoC ther-mal distribution at the cost reducing data transmission performance. DTO therther-mal management scheme includes the thermal-aware task mapping, voltage and frequency scaling, etc.

Dynamic thermal management is required to maintain the system temperature within a certain temperature limit. In Dynamic thermal management scheme, the network thermal distribution of the system is regulated dynamically based on the

(30)

current thermal condition. Thus this scheme is categorized as reactive and proactive DTM [70]. Reactive DTM operates only when the network reaches an emergency ther-mal level and avoid the therther-mal problem at the cost of decreasing the performance. The performance impact associated with RDTM is huge but a run-time thermal man-agement scheme proposed by [61] with minimal performance impact. Greg et al. [71] also proposed a traffic migration strategy where the formation of hotspot is avoided by migrating hotspot traffic to other nodes. To ensure chips operating within a safe temperature range, while maintaining better performance, a proactive thermal man-agement scheme based on dynamic frequency scaling bus (DFSB) was proposed [72] for developing thermal-aware 3-D NoC-bus architectures. The innovative benefits in-clude thermal-aware frequency scaling policy (TFSP) and frequency-aware adaptive routing (FAAR), for the temporal and spatial management separately. TFSP dy-namically and proactively adjusts the frequency of DFSB, according to the predicted thermal variation, to throttle the data flow for heat dissipation.

Recently, [73] propose a Kalman-based runtime thermal prediction scheme that primarily consists of a thermally-ware routing algorithm and a proactive throttling scheme to address the problem of forecasting temperature based on noisy thermal sensors in the network-on chip.

(31)

Chapter 3 Methodology

This paper is based on a uniform and balanced adaptive routing that is thermally-aware. This proposed algorithm takes into consideration the thermal state of the neighbor routers and output node temperatures with the flits split into two region-s/states (the high and low priority flits). This flits property is one of our basic con-siderations having the head-flit which holds the packet information, and the body-flit which hold the actual payload of data to be transmitted through the transmit pro-cess. Our model here defines the temperature of nodes at any given instance and the flit property to be transmitted, to knowing which flit can effectively implement this model in terms of packet handling and management. The model also is a reactive approach of the thermal management scheme DTM proposed by [70], as a measure of controlling the temperatures of 3D-NoC nodes to avoid transistors damage. The pro-posed model is aimed to out perform other routing algorithms in terms of throughput and energy consumption rate. Our approach is a proactive one in which the temper-ature is dynamically balanced and uniformly distributed within the network. Let’s explore the methodologies involves in this thesis:

3.1 Traffic Modeling in 3D-NoC

Traffic model literally is a stochastic model of the traffic flows or data sources in a network. In a typical NoC architectures, various architectures have been proposed and all based on states and states transition that is affected by various actions in the NoC system. Various NoC architectures employ various traffic models for determining and quantifying the impact of critical parameters for the underlying network. The

(32)

traffic at NoCs varies considerably during the application execution, more advanced traffic models have to be incorporated in order to study in detail the behavior of the targeted system. However, the traffic patterns generated by different modules in a NoC strongly depends on the application for which the NoC is designed. To observe or rather model the system behaviors, we discuss some established traffic model in NoC which are classified as realistic or synthetic traffic model.

Realistic traffic models are traces of application execution onto NoCs which is a representation of a more specific class of applications while the synthetic traffic patterns correspond to abstract models of packets exchanged between nodes of the NoCs which is generated based on a mathematical model. They do not represent real-life applications, therefore, they cannot be employed for accurate design-space exploration.

3.1.1 Realistic Traffic Model

Realistic traffic models are common for evaluating traffic captured for real life ap-plications. The performance of a NoC depends on the generated traffic pattern [33], therefore the most reliable and accurate way to access the characteristics of the NoC would be to refer to the traffic profiles corresponding to all running applications. The realistic traffic model for NoC allows the NoC designer to analyze the power consumption and delay of NoC in a real-life situation.

3.1.2 Synthetic Traffic Model

Uniform Traffic: This is one of the simplest traffic models, which are considered as a standard benchmark in the network routing studies. In this model, each node sends messages to other nodes with an equal probability and destination nodes are chosen in random using a uniform distribution. It is simple to implement and its simplicity might lead to displeasing evaluation results in NoC architecture.

Transpose Traffic: In this type of traffic modeling, the destination coordinates are the transpose of the source coordinates. Literally, two types of traffic patterns are considered for a transpose traffic. In the first transpose traffic pattern, for example, a node (i,j) only sends packets to a node (n-j,n-i), where n is the network diameter (n x n mesh topology). Under this load, the networks diagonal bisection is a bottleneck as all packets must cross it. I the second traffic pattern, a node (i,j) can only send packets to a node (j, i).

(33)

Bit Complement: This is a widely used traffic model in NoCs where each node ex-changes messages with a node on the opposite side of the network at a rated uniformly random distribution. The bit complement model does not closely match the actual traffic of the NoC architecture. Therefore, in order to determine the coordinates of destination node at this traffic model, a bit-wise inversion of the source coordinates is performed. This load stresses the horizontal and vertical network bi-sections and a NoC statically spreads traffic across all of the bisection links, providing a perfectly balanced network load [30].

Hostpot Traffic Model: The Hot spot nodes are known to be very busy and con-gested nodes in a given network. In this type of traffic pattern, each node sends packets to other nodes with an equal probability with an exception of a specific node called the hot spot node that usually received packets with a higher probability. This scenario selects [N/M]2 _{of the nodes (N is the total number of nodes) as traffic} hotspots, where M ∈ {2, 4, 8, 10, 12..., N }. Probability (usually p ∈ {0.6, 0.7, 0.8}) of traffic is targeted to these hot spots (one is selected at a time by uniform random selection). The other traffic is sent uniformly to all other nodes. Both numbers of hotspot nodes M, as well as their fraction , are user-defined parameters. A variant of this scheme selects different hotspots for each source.

Bit Reversal Traffic: In this type of traffic, each node sends packets to the des-tination node whose address is the bit reversal of the sender’s address. The traffic pattern is generated according to the bit-reversal permutation, a packet generated in the source node N = N1N2....Nm is destined to destination node of B(x) = NmNm−1Nm−2...N3N2N1, for an m-bit address.

Bit Rotation Traffic Model: This traffic model is similar to the bit permutation where the destination address is obtained by rotating the bit string that represents the source address to the right by one. For a bit rotation traffic, the destination node is represented by di = S(i+1) mod m.

Shuffle Traffic Model: This traffic pattern is similar features compared to bit ro-tation traffic pattern, The only different is its destination node which is described as follows: di = s(i−1) mod m.

3.2 NoC Structure Model Hierarchy

In this section of this chapter, we discuss the structural hierarchy of our NoC system comprising of Tile, Memory, Router and a Processing Element. The tile contains the

(34)

Router and Processing Element. Usually, the tile is set by the NoC instance as shown in Fig.3.1. The Processing Elements generate packets and inject into the network by different traffic patterns as discussed above. The router, on the other hand, routes the packets in the networks based on the routing functions discussed in Chapter 2 and so selects the best output channel by a descriptive selection function.

Tile

M

e

m

Router Processing Element

Figure 3.1: NoC Structure Model Hierarchy.

3.2.1 The Thermal Model

Thermal modeling is being used to evaluate an architect design and deliver energy efficiently and environmental comfort [74].Due to the increase in power density that comes with 3-D integration and technology scaling, thermal modeling [75] [76] is gain-ing more attention in recent times. The increasgain-ing power density comes with a huge consequence of increasing the power/energy consumption in the system which in turn will lead to higher on-chip temperature values. As the system temperature increases, the system reliability will be affected, since the failure rate rises exponentially with temperature increase. According to Junhui et al. [77], transistors and the materials used for making nodes starts to fail or get damaged at the temperature 80o_C.

The temperature model for our proposed algorithm (TAUSR) comprises of two basic factors to determining a node temperatures in Network-on-chip mainly:

1. Energy transformation within layers. 2. The thermal conduction in form of heat. Energy Transformation

Our reference core is based on the Intel 80-cores Energy model, in which the energy transformation consumption of energy by each node. A node, on the other hand, comprises of the router, processing element and the memory. It is assumed that the energy consumed by the processing element is negligible. There are some other leakage factors which can contribute to this change of energy as well as distribution

(35)

among nodes. The energy consumed by these nodes is transformed into heat within a short period of time. The Intel 80-core architecture comprises of a five-port, two-lane pipelined packet-switched router core with phase-tolerant mesochronous links. These components form the block of the 80-tile NoC architecture. A typical 80-tile is shown in Fig. 3.2 with various internal energy consumption components. This tile consists of a processing engine connected to a five-port router, which forwards packets between tiles. The processing engine as shown in Fig. 3.2 contains two independent floating-point multiple-accumulator (FPMAC) units; 3 Kbytes of single-cycle instruction memory (IMEM); and 2 Kbytes of data memory (DMEM). The architecture allows scheduling to FPMACs, and DMEM loads and stores packets, send/receive from the mesh network.

DMEM

RIB

MSINT

Router

CLK

IMEM

RF

G

lo

b

a

l

c

lo

c

k

s

p

in

e

+

c

lo

c

k

b

u

ff

e

rs

FPMAC0

FPMAC1

Figure 3.2: Full chip and tile micrograph for the Teraflops Intel 80 Processor. The estimated power breakdown at the tile and router levels, which we simulated at 4 GHz, 1.2 V supply, and at 110uC is shown in Fig.3.3. The communication power

(36)

3D Maps Tours

This workbook has 3D Maps tours available.

Open 3D Maps to edit or play the tours. 22%

7% 33% 17% 6% 15% Queues + Datapath Arbiters + Control Clocking Links MSINT Crossbar (a) 3D Maps Tours

This workbook has 3D Maps tours available. Open 3D Maps to edit or play the tours.

11% 36% 28% 4% 21% Chart Title Clocking Distribution Dual FPMAC Router + Links 10 Port RF IMEM + DMEM Clocking Distribution 11% Dual FPMAC 36% Router + Links 28% 10 Port RF 4% IMEM + DMEM 21% Clocking Distribution Dual FPMAC Router + Links 10 Port RF IMEM + DMEM (b)

(37)

is significant at 28 percent of the tile power, and the synchronous tile-level clock distribution accounts for 11 percent of the total. Clocking power (33 percent) is the largest component of router power, reflecting the high frequency of operation, with the input queues on both lanes and associated data path being the second major component (22 percent). The MSINT blocks result in 6 percent power overhead [78]. Thermal Conduction

The thermal conduction is the transfer of energy(now heat) arising from temperature differences between adjacent nodes in the NoC. This conductivity is ascribed to the ex-change of energy between adjacent/neighbor nodes and the conducting medium(Heat sink).The Heat conduction among node is in different corresponding temperatures of increasing or decreasing other as the case may be.The rate of heat flow in a node router is proportional to the cross-sectional area of the router and the temperature difference between its neighbor or a reference temperature and again inversely pro-portional to the length.

Therefore literally, our thermal model takes into consideration the above two mentions factors to determining the absolute node temperatures.

Energy Temperature relation

Temperature of the intermediate node i can be express below in Eqs 3.1:

Ti = T (0) i + T ET i + T T C i , i = 1, 2, ...N (3.1) where TET

i = Energy transformation temperature associated with the intermediate node i

TT C

i = Thermal conductivity temperature reference to the node i. The Heat flow as a consequence of change in the node temperature. That is, the TET

i increases across each node i as a result of the energy consumption which flits being transmitted. Let E_iT be the overall Energy consumed by the intermediate node i which is associated to the steady-state router power of node i (P_iSRP).

The Energy transformation temperature changes from node to node by a coefficient factor called α which indicates how much the node temperature increases with a

(38)

specific energy consumption in (oCperJ ). So T_iET is express as below:

E_iT = P_iSRP (3.2)

T_iET = α E_iT (3.3)

Thermal conductance can be referred to the act where conductive property of a given material(silicon) changes the specific resistance with change in temperature between any two node node i and node j.

To measure the thermal conductivity temperature TT C

i , we introduce the coefficient of resistance β which is the resistance change factor per degree of temperature change between nodes. This can be express as:

T_iT C = N X j=1,j6=i

βij ∆Tij (3.4)

βij = Resistance change factor per degree of temperature change between node i and j.

∆Tij = change in temperature between any two node i and j.

Note, we assumed all node to have the same initial temperature throughout this sim-ulation.Having known TET

i and TiT C, substitute (3.3) and (3.4) in (3.1), we have

Ti = T (0) i + α E T i + N X j=1,j6=i βij ∆Tij, i = 1, 2...N (3.5)

T_i(0) is the initial temperature of the node i after core warm-up corresponding to the output temperature from the Hotspot simulation tool [79]. This output temperature from the hotspot serve as the initial node temperature which is also affected by some other temperature factors like the environmental temperature(ambient temperature). If all intermediate node have the same initial temperature from the hotspot T_i(0), If from (3.4) that the node temperature can be related to the energy by:

T = α E

(39)

From (3.6), we can tell as the total energy of the system increases, the node temper-ature increases by a factor α and vice versa. Now Substituting (3.6) into (3.5) for the approximate node temperature.

Ti = T (0) i + α E T i + N X j=1,j6=i βij α ∆Eij Ti = Ti(0)+ α[E T i + N X j=1,j6=i βij ∆Eij] (3.7)

Where ∆Eij is the different in energy consumption between node i and node j respec-tively.

Our main aim being to reduce the total energy consumption in the system thereby reducing the system temperature and as well as increasing the system throughput.

3.2.2 System Energy Consumption

The NoC architecture which contains The Tiles. This tile contains components like the Router for routing flits, Memory for storage and a Processing Elements for packet generation. All these component consumes energy depending on their operations. The Tile according to Fig 3.2 contains other system components that consume energy in the system. We adopted the Intel 80-Core Energy model as shown in Fig.3.3 [78].

These components are divided into two: (a) Component Functions called by the router and (b) Component function called by the tile. The component function called by the router which consumes energies are :

1. The Queue and DataPath Component 2. MSINT

3. Arbiters and Control 4. Crossbar

5. Links 6. Clocking

(40)

While the component functions called by the tile are : 1. Dual FPMAC

2. IMEM 3. DMEM

4. Clocking and leakage

Energy consumed by each component is given in Eq.3.8: EQN DataP ath = k ESF EM SIN T = m ESF ELIN KS = n ESF EF P M ACs = ψ ESF ECLOCKIN G= v ESF EIM EM = γ ESF EDM EM = u ESF ECrossbar= h ESF (3.8)

ESF = The Energy Scaling Factor

EQN DataP ath = Energy consumed by the data path and queues EM SIN T = Energy consumed by the mesochronous interface

EF P M ACs = Energy consumed by floating-point multiply-accumulator (FPMAC) EIM EM = Energy consumed by the instruction memory

ELIN KS = Energy consumed by the links connections EDM EM = Energy consumed by the data memory

ECrossbar = Energy consumed by the crossbar and switching ECLOCKIN G = Energy consumed by mesochronous clocking

k, m, n, ψ, v, γ, u and h = Multiplier constant according to Intel 80-Core Energy Model

(41)

Eq.3.8.

ET rans= EQN DataP ath+ EM SIN T + ELIN KS+ EF P M ACs+ ECLOCKIN G+ EIM EM +EDM EM + ECrossbar

(3.9) The total energy consumed by a node i can be represented as :

ET otal,i = EiT = ET rans Ni (3.10)

Where Ni = Total number of flits transmitted by node i.

Substituting Eq.3.10 in Eq.3.7 to measure the node temperature:

Ti = T (0) i + α E T i + N X i=1,i6=j βij α ∆Eij Ti = T (0) i + α[ET rans Ni+ N X i=1,i6=j βij ∆Eij] (3.11)

From Eq.3.11, if the traffic is balanced among nodes, ∆ Eij = 0, all nods will have the same temperature across the network according to Eq.3.12.

Ti = T (0)

(42)

Chapter 4 Thermal-aware and Uniform

Priority Scaled Routing

4.0.1 Our system Model

The thermal model presented in section 2 gives us a clear indication on how to rep-resent our system model and algorithm adaptively. Based on this model, we propose our algorithm by employing the idea of the uniform random distribution of 802.11g which act as a balanced adaptive routing algorithm to distribute the network tem-perature uniformly across the entire chip as well as improving the performance of the system. The system model is divided into two main distribution; the flit distribution and the thermal-state distribution.

4.0.2 Thermal-State Distribution

The thermal-state distribution is a uniform distribution of all possible channel direc-tions in 3D-NoC including the local direction (7 channels in total). The thermal-state distribution tells us the thermal state of individual nodes in the chip and this thermal-state will determine whether it is a good idea to route flits to the node depending on its state. Thermal-state distribution is divided into two parts. One part indicating the hotspot region and the other part indicating the active region with one standby active node for packet redirection in case of deadlock. The hotspot region of the distribution is an emergency region and no flit/packet is allowed to be routed in this direction while the active region is a free region for flits/packets routing. Note, this is a uniform random distribution that is designed to change every simulation cycle.

(43)

This distribution can be represented as shown in eq. 4.1

Xi ∼ U (0, Channels + 1) (4.1)

where Xi is the thermal-state uniform distribution for node i.

If all the distribution turns to be a hotspots region, we employ the use of an escape route via a virtual channel or a channel made for packet redirection.

Fig. 4.1 shows an outline of our proposed algorithm with the source nodes, destination nodes, and the hotspot nodes. The hotspot nodes form the hotspot region of Fig. 4.2. R R R R R R R R R R R R R R R R R R R R R R R hop hop hop R R R R R R R R R hop hop hop Layer 0 Layer 1 Layer 2 Layer 3 y-axis Z-axis _X-axis R R R source node destination node hotspot node

Figure 4.1: Outline of our proposed algorithm.

0 Disallowed Directions Hotspot Region 1 2 3 4 5 6 7 Allowed Directions Active Region Packet Redirection Port Directions

Figure 4.2: Thermal-state uniform distribution(a)

Fig. 4.2 of thermal-state distribution shows a uniform random distribution for the hotspot region and active region respectively. The hotspot region nodes from Fig. 4.2 are node{0, 1, 2, 3 and 4} and the active nodes are node{5, 6 and 7} with the node{7} acting as packet redirection node in-case of deadlock. The purpose of this algorithm is to exclude all output directions in the hotspot region and only direct packets in the active region. Condition is applied prior to the flit distribution in the following subsection. Fig. 4.3 describe a Hotspot and active region random distribution with

(44)

0 Disallowed Directions Hotspot Region 1 2 3 4 5 6 7 Allowed Directions Active Region Packet Redirection Port Directions Active Region Disallowed Directions Packet redirection Allowed Directions

(45)

more active region compare to the Hotspot region. The active region from Fig. 4.3 are node{0, 1, 5, 6, and 7} while the hotspot region nodes are node{5, 6}.

4.0.3 Flit Distribution

According to [73], routers consume an equivalent amount of energy as IP core in an on-chip network and the energy consumed is directly associated with load traffic in form of flits transmission. Each flit is identified by its source id. The entire flit id (source id) ranges from 0 - 255 to identify the individual flit on arrival.

0 Route via our algorithm Low-region flits 32 64 96 128 160 192 224 Deliver Packet Flit_source_id 255 High-region flits

Figure 4.4: Flits uniform distribution(a)

The Flit distribution is a random uniform distribution of all flit id that tags flit property randomly in a uniform distributed way. This distribution is also divided into two regions; the high-level Flit region and the low-level Flit region.

The high-level Flit region indicates that any flit id found in this region, the Flit will be treated as a special case Flit. Hypothetically, Flits with ids in this region must be delivered to their immediate destination without any random wait or delay in a queue regardless if the destination node/direction is in hotspot region or not.

While flit id in the low level region can be manipulated to based on our proposed thermal-aware algorithm in the next subsection.

0 Route via our algorithm Low-region flits 32 64 96 128 160 192 224 Deliver Packet Flit_source_id 255 High-region flits High-region flits Deliver Packet

Figure 4.5: Thermal-state uniform distribution (b)

4.0.4 Deadlock recovery mechanism

Performance degeneration in most network on-chip is caused by the availability of deadlocks in the network. To prevent this occurrence, a deadlock recovery mechanism

(46)

is employed. We employ the principle of [80] which propose a Dimension Reversals method which serves as an adaptive way to overcoming deadlocks in the network. The Dimension Reversal number is the number of times a packet has been routed from a channel in one dimension, i to channel to another dimension, j according to [80] where i < j. The deadlock packet is routed to a destination node via our packet redirection channel assigned to that node.

4.0.5 Thermaware and Uniform priority scaled routing

al-gorithm

3D-NoC comprises of 3D-coordinate axes which are represented as x, y, and z. The network-on-chip has a source node and destination node of which each individual node is represented as a 3D-coordinate. The main aim is to deliver packet coming from the source node to its precedented destination node and our algorithm achieves just that. packets coming from the source node have the probability of transfer through differ-ent 3D-coordinate axes(x, y, and z) and the list of all possible output probabilities of directions is stored as a set/vector. For every node in this set, the current node temperature, flit distribution and thermal-state distribution variables are calculated and noted as in Eq. (7), Eq. (8) and Eq. (9). This process explains our algorithm which is repeated until the packets are successfully delivered to the destination node. Our protocol and algorithm is described in 1:

(47)

Algorithm 1 Thermal-aware and Uniform Priority with Scaled Routing Algorithm. Initialization of all system variables:

Directions: Set of all possible output directions packets can be routed from the current node; Directions = current nodes coordinates

oi ∈ Directions: selected Output node from Directions C: 3D coord value of current network node C{c x, c y, c z} D: 3D coord value of destination network node D{d x, d y, d z} S: 3D coord value of source network node

Flits id: Flits source id CFid: Current flit source id

w ∈ Flits id: Set of all Flits id in Low-level region

x ∈ Directions: Set of output directions in hotspot region Ti

N: Intermediate output temperature for node oi ∈ Directions TS: Temperature threshold

Vc: Packet redirection channel

1: Router in C recieves a new packet from router in S 2: Calculate δX = d x - c x, δY = d y - c x, δZ = d z - c z

3: If (δX = 0) && (δY = 0) && (δZ = 0) direct packet to the local IP core 4: If (δX > 0) push (c x + 1, c y, c z) to Set Directions else push (c x - 1, c y, c z) 5: If (δY > 0) push (c x, c y + 1, c z) to Set Directions else push (c x, c y - 1, c z) 6: If (δZ > 0) push (c x, c y, c z + 1) to Set Directions else push (c x, c y, c z - 1) 7: For very individual output node oi ∈ Directions, Read thermal informations of

the node T_Ni as in Eq(7), Eq(8) and Eq(9)

8: ∃ i, o_i ∈ Directions, If (T_Ni < TS) Select direction i from Directions Set

9: ∀ i, o_i ∈ Directions, If (T_Ni > TS) mark oias hotspot region and generate two uni-form random distribution indicating hotspot region for thermal-state distribution and low-level region for Flit distribution as in Fig.4.2 and Fig.4.4

10: x ∈ Directions, Push all output directions in hotspot region of the thermal-state distribution (Fig.4.2) into set x

11: w ∈ Flits id, Push all flit ids in the Low-Level flit region of the Flit Distribu-tion(Fig.4.4) into set w

12: If (oi in Set x && CFid in Set w) remove oi from Directions, Else do nothing 13: _{If (@ Directions, Direction Set is empty) Push V}_c into Directions Set

14: Update node information

15: repeat steps 1-15 till packets get to destination node. 16: return Directions

(48)

Chapter 5 Simulation Results and

Environment

5.1 Simulation setup

In other to demonstrate effectively the performance of the proposed routing algorithm, we set up 3D mesh NoC system of size 8 x 8 x 4 using a cycle accurate simulator that uses SystemC language [81] called AccessNoxim [82]. AccessNoxim is an open source NoC simulator tool [83], an improvement of Noxim [83] for thermal and power model. The network node generates packet subject to Poisson distribution, all transmitted packet in this simulation tool has fixed packet length. During packet transmission, energy is consumed and the temperature raises in the individual node. This rise in node temperature is monitored reactively for the individual node during the entire process. As the energy increases temperature increases and vice versa

The HotSpot simulator tool [79] is an open source software for measuring tem-perature. The Hotspot simulator tool is used by AccessNoxim to measure the node temperature. Our simulation parameters are given in Fig. 5.1. To guarantee the accuracy of our proposed algorithm, the simulator is warmed up for 10,000 cycles and then average performance over a 100,000 simulation cycle. We consider three traffic pattern 1). Uniform random traffic 2). transpose traffic and finally 3). Shuffle traffic for adequate comparison

(49)

Table 5.1: SIMULATION PARAMETERS Parameters Value Network size 8 x 8 Buffer size 4 Topology 8 x 8 x 4 mesh Routing

DLADR, TAAR, XYZ, North-Last, West-First and TAUPSR

Packet injections Poisson distribution

Traffic Uniform random,

Transpose and Shuffle Schduling and

Allo-cation Arbiter Oblivious

Flow-Control Virtual-Cut through

5.2 Performance analysis and result comparison

Our performance comparison is performed in three different aspects: network through-put, total power consumed and network thermal distribution. We compare our pro-posed algorithm with the existing routing algorithm in AccessNoxim such like XYZ, West-First, North-Last, Odd-Even, DLADR, and TAAR. The initial temperature for all nodes is set at 25o_C.

As for performance metric, we use throughput of the network which is the number of received flits over the total simulation cycles. Another metric we use is the total energy/power consumption in the system which is the overall power consumed during flits transmitted in the network. Also for reliability and longevity of the system component, we measure how much active node compared to the hotspot node in the network for various routing algorithm. The more the active node, the more reliable our system will be and more hotspot node will lead to system failure.

Thermal-aware and uniform priority with scaled routing for high-performance network-on-chip

Contents

List of Tables

List of Figures

Introduction

1.1

Introduction

1.2

Research goal and Objectives

1.3

Contributions

1.4

Thesis organization

Chapter 2

Related Works

2.1

Evolution of 3D-NoC

2.2

The 3D-NoC Router Architecture

2.2.1

Switching Technique

2.2.2

Flow Control Mechanism

2.2.3

Switching Allocation and Scheduling Policies

2.2.4

Packet Starvation

2.3

3D Routing Algorithm

2.4

Thermal Management scheme for 3D-NoC

Chapter 3

Methodology

3.1

Traffic Modeling in 3D-NoC

3.1.1

Realistic Traffic Model

3.1.2

Synthetic Traffic Model

3.2

NoC Structure Model Hierarchy

Tile

Tile

Tile

Tile

M

e

m

3.2.1

The Thermal Model

DMEM

RIB

MSINT

Router

CLK

IMEM

RF

G

lo

b

a

l

c

lo

c

k

s

p

in

e

+

c

lo

c

k

b

u

ff

e

rs