3D NoC simulation model for FPGA

(1)

3D NoC simulation model for FPGA

Academic year 2019-2020

Master of Science in Electrical Engineering - main subject Electronic Circuits and Systems Master's dissertation submitted in order to obtain the academic degree of

Counsellor: Dries Vercruyce

Supervisors: Prof. dr. ir. Dirk Stroobandt, Dr. ir. Poona Bahrebar

Student number: 01502025

(2)

(3)

3D NoC simulation model for FPGA

Academic year 2019-2020

Master of Science in Electrical Engineering - main subject Electronic Circuits and Systems Master's dissertation submitted in order to obtain the academic degree of

Counsellor: Dries Vercruyce

Supervisors: Prof. dr. ir. Dirk Stroobandt, Dr. ir. Poona Bahrebar

Student number: 01502025

(4)

Preface

I would like to thank my promotors Prof. Dirk Stroobandt and Dr. Poona Bahre-bar for giving me the opportunity to work on this thesis and get some more experience in the research field. I would also like to thank them for guiding and counselling me during the process.

A special thanks to my family and friends, for supporting me through difficult times and allowing me to blow off some steam when needed.

The author gives permission to make this master dissertation available for consultation and to copy parts of this master dissertation for personal use. In all cases of other use, the copyright terms have to be respected, in particular with regard to the obligation to state explicitly the source when quoting results from this master dissertation.

(5)

Abstract

Networks-on-Chip (NoCs) are an often used communication scheme in Systems-on-Chip (SoCs). Simulators are used to test several parameters and their effect on the performance of the network. Several software simulators exist, they are flexible and very accurate, but are too slow for larger designs. Field Program-able Gate Arrays (FPGAs) can speed up the simulation process significantly, while maintaining the same level of accuracy. Software simulators allow to test both Two-Dimensional (2D) and Three-Dimensional (3D) networks, while only 2D simulators have been proposed on FPGA.

In this thesis a 3D NoC simulation model for FPGA is proposed. The model is used to explore the possibilities and difficulties with 3D simulation. It is not a complete FPGA simulator, but a software implementation that replicates the behaviour of the FPGA simulator. This model makes use of a 3D Time-Division-Multiplexing (TDM) method to simulate up to 10.648 nodes on a single FPGA. This TDM or clustering method requires an estimate of the resource usage on FPGA. Therefore a VHDL implementation is made for certain sub-modules of the simulator. These estimations are used to determine the optimal clustering. The model is compared with a reference simulator to verify its accuracy.

(6)

(7)

3D NoC simulation model for FPGA

Jonathan D’Hoore

Supervisor(s): Prof. Dr. ir. Dirk Stroobandt, Dr. ir. Poona Bahrebar

Abstract - Networks-on-Chip (NoCs) are an often used

communication scheme in Systems-on-Chip (SoCs). Simulators are used to test several parameters and their effect on the performance of the network. Several software simulators exist, they are flexible and very accurate, but are too slow for larger designs. Field Programable Gate Arrays (FPGAs) can speed up the simulation process significantly, while maintaining the same level of accuracy. Software simulators allow to test both Two-Dimensional (2D) and Three-Two-Dimensional (3D) networks, while only 2D simulators have been proposed on FPGA.

In this thesis a 3D NoC simulation model for FPGA is proposed. The model is used to explore the possibilities and difficulties with 3D simulation. It is not a complete FPGA simulator, but a software implementation that replicates the behaviour of the FPGA simulator. This model makes use of a 3D Time-Division-Multiplexing (TDM) method to simulate up to 10.648 nodes on a single FPGA. This TDM or clustering method requires an estimate of the resource usage on FPGA. Therefore a VHDL implementation is made for certain sub-modules of the simulator. These estimations are used to determine the optimal clustering. The model is compared with a reference simulator to verify its accuracy.

Keywords – Network-on-Chip (NoC), Simulation, 3D simulator,

FPGA

I. INTRODUCTION

Networks-on-Chip (NoCs) have emerged as an effective communication scheme for Systems-on-Chip (SoCs). A SoC consists of many processing elements, memory elements, etc. that are implemented on the same chip. NoCs use a network of routers to interconnect all components and allow efficient communication between them. Two-Dimensional (2D) NoCs are already very scalable to a high amount of nodes. However, the performance of the network can be enhanced by moving to Three-Dimensional (3D) structures [1,2].

Before the actual implementation of the NoC, a design goes through several iterations to fine-tune certain parameters. Simulators are very important in these design steps, as they deliver accurate predictions about the performance of the network, such as the expected latency, throughput, etc. Several software simulators exist, for both 2D and 3D structures: full-system simulators such as gem5 [3], or standalone NoC simulators such as BookSim [4]. They provide very accurate results, while also being very flexible in their design. Parameters can easily be changed and extensions to the simulators can be made. However, they severely lack in simulation speed when designs become larger (hundreds to thousands of nodes).

Field Programmable Gate Array (FPGA) simulators can be used to obtain a higher simulation speed. Several 2D NoCs such as the FNoC [5] emulator have already been constructed. They

can match the accuracy of software simulators, while enhancing the simulation speed significantly (up to 5000 times faster in some cases). Although they are less flexible, their simulation speed makes them very relevant for the simulation of large designs.

To simulate large designs on FPGA, a Time-Division-Multiplexing (TDM) approach is often used. The FPGA does not have enough resources to the complete network at once, so it needs to be divided in several smaller groups called clusters. FNoC has proposed such a clustering method, which efficiently uses its resources and is able to simulate large designs [5]. A lot of research efforts have been made to construct efficient 2D NoC FPGA simulators, but to the best of our knowledge, no 3D FPGA simulator has been proposed yet. In this thesis a model will be constructed to explore the possibilities and difficulties in 3D FPGA simulation. The 2D clustering method introduced in [5] will be extended to 3D networks.

The proposed model is not a complete 3D NoC simulator, but consists of a software implementation that replicates the behaviour on FPGA. Some sub-modules are implemented on FPGA to obtain credible estimations of the resource usage.

II. RELATED WORKS

In recent years, several research groups have constructed 2D FPGA emulators. A short overview is given here.

FIST [6] sacrifices accuracy for speed and area by simplifying the routermodels. Instead of completely modelling the router, it is replaced by a simple load-delay curve, obtained by offline or online training. However, this reduction in accuracy might lead to wrong conclusions about the behaviour of the network. DART [7] uses a global interconnect, which provides all to all communication. Because of this every topology can be implemented without need for reconfiguration. However, this global interconnect introduces area overhead, which limits the size of implementable designs. AdapNoC [8] has moved parts of the simulation to Microblaze soft processors (traffic receptor and traffic generator). This makes the design more flexible, but the communication between the soft processors and the FPGA will become the performance bottleneck and impact the simulation time. DRNoC [] tries to reduce the need for re-synthesis by making use of partial reconfiguration. This means that parts of the implementation can be changed at runtime, without having to re-run the synthesis. They only were able to do this for small designs of up to 16 nodes. FNoC [5] is one of the most promising 2D NoC simulators on FPGA. It has exactly the same accuracy as BookSim and allows designs with up to 128 x 128 nodes on a Virtex-7 FPGA. It can realise speedups of 5000x over Booksim for direct networks (2D-mesh), and 232x for indirect networks (fat-tree). FNoC realises this by efficiently using memory (BRAM blocks) and using a novel TDM approach.

(8)

Previous works have focused on 2D NoC designs. To the best of our knowledge, the extension to 3D designs has not been made yet.

III. TERMINOLOGY

IV. SIMULATION OF 3DNOC

To simulate a 3D NoC, an adapted version of the 2D TDM approach presented in [5] is used. Due to limited resources on the FPGA, the network can not be implemented as a whole. Therefore it is necessary that the network is split in several clusters. One cluster of nodes is physically implemented on the FPGA, this cluster is called the physical cluster. The physical cluster will subsequently simulate all logical clusters. Figure 1 illustrates the clustering of a 6x6x6 mesh. The network is divided in several logical clusters of size 2x2x3 (for the clarity of the figure, not all clusters and vertical connections are shown).

Figure 1: Clustering of a 6x6x6 mesh by using 2x2x3 clusters.

The high-level datapath is depicted in Figure 2. The physical cluster is the center of the design and will perform the actual simulation. The State Memory contains all state variables of every node of each logical cluster. Each logical cluster has its own entry in the memory. The state is loaded into the physical cluster, and after simulation the updated state is stored again into the State Memory.

Logical clusters are connected to each other by inter-cluster channels. The data on these channels is stored in a dedicated Out Buffer. However, caution needs to be taken when storing the data. An extra In Buffer is needed to prevent data from being overwritten due to the simulation order [5].

The datapath looks the same as for the 2D case, but the physical and logical clusters are now 3D and will require more resources. Also, the In and Out buffers will be larger because

there are more inter-cluster connections in a 3D network (up- and down channels).

Figure 2: High-level datapath [5].

The simulation procedure (for one simulation cycle) can be summarized as follows:

1) The state of the first logical cluster is loaded into the physical cluster, as well as the data from the In

buffer.

2) The physical cluster is simulated and all state variables and outgoing data is updated.

3) State variables are stored in the state memory of the logical cluster. Outgoing data is stored in the Out

buffer.

4) Steps 1 to 3 are repeated for every logical cluster. 5) Once all logical clusters are simulated and their

states stored in memory, the content of the Out

buffers is copied to the In buffers.

V. PROPOSED MODEL

Due to timing restrictions, it is not feasible to create a complete VHDL implementation. Therefore the choice has been made to create a model in software (Java), that will replicate the behaviour on FPGA. This will significantly speedup the design process because it is easier to adapt and debug modules in software. Using this model several parameters and clustering possibilities can be tested. However, in order to do the clustering in a meaningful way, it is necessary to obtain an estimate of the resources needed on FPGA. Therefore a VHDL implementation has been made for several modules: the router, IP core and memory. In fact, only the top controlling level (the clustering, communication between nodes, loading and storing of data) is not implemented on FPGA, but replicated in software. This allows to study the properties of the 3D simulation for a range of parameters, in a limited amount of time.

A. Objective of the model

The objective of the proposed model is to obtain a high level understanding of the clustering method and use it to explore how the implementation changes when certain parameters are changed. Special care needs to be taken to the data transfer and storage in the clustering method. All state variables need to be set in a proper way, to ensure correct working of the simulator. The simulator introduced in this manuscript is constructed from scratch, based on the explanations given in [10], which is the reference book in the NoC domain. The proposed simulator makes use of the same clustering method as the method on

(9)

FPGAs. It could be seen as the top controlling level of the FPGA implementation

To perform the clustering in a meaningful way, it is necessary to obtain an estimate of the resources needed on FPGA. Therefore, an implementation in VHDL is structured for several modules. This information is used to estimate the total resource usage and determine several clustering possibilities. Using the proposed model and the VHDL implementations of the modules, the optimal clustering can be determined for several configurations. This could help in future works when a complete FPGA implementation is made, as the design space and possibilities would be reduced.

B. Implementation details

The implemented router model is the 5-stage input queued router. It uses Wormhole Flow Control with Virtual Channels (VCs) and credits as flow control mechanism [10]. Flits and credits each have a separate physical channel in this implementation.

For the Virtual Channel Allocation and Switch Allocation stages, the iSLIP allocation scheme is used. This is a seperable output allocator that uses round-robin arbiters [11].

Both the resource usage of the simulator and the performance on the network depend on several parameters. The proposed model allows for several parameters to be changed, such as the number of VCs, buffer size, source queue size, routing algorithm and traffic pattern.

The routing algorithm can be either Dimension-Order-Routing (DOR) [12] or Minimal Adaptive Dimension-Order-Routing (MAR) [13].

The traffic pattern can be either uniform, hotspot or local traffic. In the uniform traffic pattern, each node has the same probability of receiving a packet. In the hotspot traffic pattern, a few nodes are designated as hotspot nodes. These nodes will receive more packets than the other nodes in the network. In addition to the regular uniform traffic, they will receive an extra portion of hotspot traffic [14,15]. In the local traffic pattern, the probability of receiving a packet decreases as the distance between two nodes increases. It is described by the Communication Probability Distribution (CPD) defined in [16].

VI. RESULTS

First, the resource usage1_{of the physical cluster is reported.}

From this a first estimate of the maximal size of the physical cluster can be made. Secondly, the memory usage of the network is described. It depends on the used cluster and the network size. This will put another limitation on the cluster size, as well on the maximum size of the network. Thirdly, the results from the first two sections can be used to determine an optimal clustering, as a function of several parameters. Lastly, the simulation results from the proposed model are verified with a reference simulator, BookSim [4].

1_{The resource usage is calculated using the Vivado 2019.1}

Synthesis tool with the Virtex Ultrascale

xcvu-190-flgb2104-A. Resource usage of the physical cluster

The resource usage of the physical cluster is estimated based on the resources needed for the implementation of one node (a router and IP core). The resources used to implement them are LookUp Tables (LUTs) and Flip Flops (FFs). However, some resources need to be reserved for the top controlling level. It is assumed that 90% of the total LUTs and FFs can be used for implementing the physical cluster.

First, a baseline configuration is considered. The used parameters are listed in Table 1. The resource usage for this model is listed in Table 2. The largest part of the resource usage is determined by the router (approx.. 95%). Based on these results, 42 nodes could be implemented in the physical cluster, for this configuration.

Table 1: Parameters used for synthesis of baseline configuration.

Parameter Value

Number of VCs: 4 Buffer size 4 Source queue size 4 Routing algorithm DOR (XYZ) Traffic Pattern Uniform

Table 2: Resource usage for baseline configuration

Module LUT FF No. % No. %

Router 21.800 2,03 10.386 0,48 IP core 1.092 0,10 373 0,02 Total 22.892 2,13 10.759 0,50

Several parameters have an influence on the resource usage. Their impact will be discussed in the following sections.

Impact of the number of VCs

Increasing the number of VCs will have a large impact on the resource usage, as indicated in the Table 3. The maximum number of nodes that can be implemented reduces from 42 to 13 if the number of VCs is increased to 8.

Table 3: Resource usage as a function of number of VCs.

Number of VCs LUT FF Max. nodes No. % No. % 2 8.044 0,75 5.507 0,26 120 4 22.892 2,13 10.759 0,50 42 6 68.176 6,35 17.374 0,81 14 8 70.386 6,55 23.743 1,11 13

Impact of the buffer size

The buffer size has a smaller impact on the amount of LUTs and FFs used. For a buffer of size 2, 48 nodes could be implemented, while for a buffer of size 8, only 36 nodes could be implemented (see Table 4).

(10)

Table 4: Resource usage as a function of buffer size. Buffer size LUT FF Max. nodes No. % No. % 2 20.002 1,86 8.455 0,39 48 4 22.892 2,13 10.759 0,50 42 6 24.786 2,31 12.999 0,61 39 8 26.636 2,48 15.304 0,71 36

Impact of the routing algorithm

Changing the routing algorithm impacts the used resources as well. Not only does an adaptive routing algorithm result in a larger routing unit, also additional resources are needed to determine the ’stress level’ of the nodes. The resource usage is reported in Table 5.

Table 5: Resource usage as a function of the routing algorithm

Routing alg. LUT FF Max. nodes No. % No. % XYZ 22.892 2,13 10.759 0,50 42 MAR 24.963 2,32 10.587 0,49 38

Impact of the traffic pattern

To use a different traffic pattern (hotspot and local traffic), more resources are needed. Currently the method is implemented by using an array in memory (BRAM) to differentiate between hotspot/non-hotspot nodes or to store the CPD in the case of local traffic. This appears not to be a very optimal way, as the number of BRAMs are limited. The amount of required BRAMs scale rapidly with network size, so only small networks can be implemented in this way. Due to limited time, this has not been explored further.

If we restrict ourselves to smaller networks (radix smaller than 7), we can compare the resource usage between uniform and hotspot or local traffic (Table 6). In the current implementation, both hotspot and local traffic require equal amount of resources.

Table 6: Resource usage as a function of the traffic pattern.

Traffic Pattern

LUT FF BRAM Max. nodes No. % No. % %

Uniform 22.493 2,09 10.578 0,49 0 42 Hotspot 25.634 2,39 10.655 0,50 2,65 34

Conclusion

The number of VCs has the largest impact on the maximal number of nodes that can be implemented in one cluster. When 8 VCs are used, only 13 nodes can be implemented. For most other parameters, the minimal nodes that can be implemented varies around 35 nodes.

B. Memory usage

The memory usage depends on the size of the network and the size of the clusters. Memory is mapped to BRAM. One BRAM can store up to 36 Kbits of data and can be configured to any power of 2 between 29_{and 2}15_{. Because of this, the}

memory usage as a function of the number of clusters will follow a stair-case progression, as indicated in Figure 3. Adding

BRAM to remap to a larger depth. If this is not possible because (due to the limited size of one BRAM), more BRAMs will be needed, and a steep increase in the memory usage is seen.

Figure 3: memory usage as function of number of clusters for a 1x1x2 cluster.

Cluster configuration

The memory usage does not only depend on the number of clusters, but also on the configuration of the cluster itself (i.e. the size of the cluster in each dimension). For instance, a cluster with 12 nodes can be implemented as a 1x1x12, 1x2x6, 1x3x4 or a 2x2x3 cluster. The memory usage to implement the State Memory will be the same for each case, as the number of nodes it the same. However, the Out and In buffers will not be of the same size, since a more compact cluster will have less outgoing data. This can be seen by comparing Figures 4 and 5.

Figure 4: 1x1x12 cluster.

Figure 5: 2x2x3 cluster.

The memory usage of a cluster of dimensions 𝑍 × 𝑌 × 𝑋 is given by:

𝐵𝑅𝐴𝑀𝑠 = 4 × (𝑋𝑌 + 𝑋𝑍 + 𝑌𝑍) (1)

Cluster size

Naturally the memory usage depends on the number of clusters per node. The memory usage needed for the State Memory will increase linearly with the number of nodes per cluster, as can be seen in Figure 6. Increasing the number of nodes with a factor F, will increase the BRAM usage with the same factor F. This relationship can be used to determine the maximal number of nodes per cluster. In theory, only 1 point is needed to determine this.

For the baseline configuration parameters from Table 1, a maximum of 22 nodes could be implemented in one cluster. This is significantly less than the limitation imposed by the physical cluster (42 nodes).

(11)

Figure 6: BRAM usage as a function of the number of nodes per cluster.

Impact of VCs

Increasing the number of VCs will influence the memory usage. As seen in Figure 7, the relationship is linear. Doubling the number of VCs, will result in an increase in BRAM usage with a factor smaller than 2.

Figure 7: BRAM usage as a function of the number of VCs.

The maximal number of nodes per cluster, for each number of VCs is presented in Table 7. The number of VCs has a large influence on the memory usage. Only 12 nodes can be implemented in one cluster. This is similar to the limitation imposed by the physical cluster (13 nodes).

Table 7: Maximum number of nodes per cluster as a function of the number of VCs.

Number of VCs Max. nodes

2 42 4 22 6 15 8 12

Impact of buffer size

Increasing the buffer size will result in a larger memory usage. The impact will not be as large as for increasing the number of VCs, since only the input buffers are affected. As Figure 8 indicates, the memory usage increases linearly with the buffer size. Table 8 shows the maximum number of nodes per cluster for several buffer sizes. The impact on the memory usage is higher than it was for the physical cluster: only 16 nodes can be implemented for a buffer of size 8 (compared to 36).

Figure 8: BRAM usagae as a function of the buffer size.

Table 8:Maximum number of nodes per cluster as a function of the buffer size.

Buffer size Max. nodes

2 27 4 22 6 16

Conclusion

Both the number of VCs and the buffer size have a significant impact on the memory usage. The limitations on the memory usage are stricter than the limitations imposed by the physical cluster. For the baseline model 22 nodes can be implemented. If more VCs are used or the buffer is larger, this is reduced to 12 or 16 nodes per cluster.

C. Clustering possibilities

Based on the results of the previous sections, several cluster configurations can be considered. A good clustering results in as little clusters as possible, to increase the simulation speed, while still meeting the restrictions on the resource and memory usage.

Fitting clusters into the network

In order to always optimally make use of the physically implemented nodes, it is beneficial that the network size is a multiple of the cluster size. This ensures that all routers will be used during simulation. Figure 9 gives an example of a 4x4x4 network. If it is clustered using a 2x2x2 cluster (a), the clusters will nicely fit in the network (8 clusters needed in total). However, when a 1x3x3 cluster would be used (b), there are some logical clusters that contain only a few nodes that are actually part of the network. Even though more nodes are used per cluster (9 instead of 8), more clusters are needed to simulate the complete network (16 instead of 8).

Optimal cluster

Depending on the allowed number of nodes per cluster, and the network size, an optimal cluster can be determined. This optimal cluster should try to minimize the number of clusters used. This will not always be the largest cluster, as indicated in the previous section. The optimal clusters for several network sizes is given in Table 9, for the baseline model (where 22 nodes can be used) and the model with 2VCs (where 42 nodes can be used). These results show that the simulation time can be halved by having double the amount of nodes per cluster.

(12)

Figure 9: Clustering of a 4x4x4 mesh by (a) 2x2x2 cluster and (b) 1x3x3 cluster. For clarity of the figure, not all vertical connections and clusters are shown.

Table 9: Optimal clusters and number of clusters needed, for different configurations and network sizes

Network size Baseline 2 VCs Opt. Cluster Clusters Needed Opt. Cluster Clusters Needed 5x5x5 5x2x2 9 5x5x1 5 10x10x10 10x2x1 50 10x2x2 25 15x15x15 5x4x1 180 8x5x1 90 20x20x20 20x1x1 400 20x2x1 200 22x22x22 21x1x1 441 21x2x1 231 D. Model verification VII. CONCLUSIONS

In this section, the accuracy of the proposed model is verified. The baseline model is compared with the same model simulated in BookSim.

For other parameters, the trends are observed and compared with literature.

Baseline configuration

The parameters of the baseline configuration are listed in Table 1. To verify the accuracy, the baseline configuration is simulated for several network sizes, and the results compared with BookSim. The results of one such simulation (for a 7x7x7) network is shown in Figure 10. The results of the proposed model are almost the same as BookSim.

Impact of VCs

The impact of the number of VCs is illustrated in Figure 11. As expected, more VCs will result in better performance. Flits are less likely to get blocked and there will be less contention in the network. At low traffic loads the latency will be approximately the same, as the contention is low. At higher traffic loads, there is a large difference.

Figure 10: Latency vs. traffic load curve for a 7x7x7 network

Figure 11: Latency vs traffic load, for several number of VCs

Impact of buffer and packet size

Buffer and packet size have a strong impact on the performance of the network. A packet that contains more flits, will take longer to traverse the network. First of all, since there are more flits, the last flit of a large packet, will be generated later than the last flit of a shorter packet. Even if there is no contention in the network, the packet latency will be larger. However, the flit latency might be the same in both cases. The average packet latency for different packet sizes is shown in Figure 12. The buffer size is set to 8.

Figure 12: Latency vs traffic load as a function of packet size. Buffer size is set 8 flits for all cases.

There also is a dependency between the buffer and the packet size. If the buffer is larger than one packet, than a complete packet can be sent to a router, without having to wait for a buffer slot to become available. This is not the case when the

(13)

downstream router are immediately processed (as is often the case for low traffic loads), there still will be some delay because of the round-trip time needed for the credits.

This is shown in Figure 13. In this figure, three packet sizes and two buffer sizes are tested. It can be seen that the packet size has a larger impact on the packet latency than the buffer size. At low traffic loads, a larger buffer will result in a lower latency (due to credit round-trip time). At larger traffic loads, the larger buffer will postpone the saturation point.

Figure 13: Latency vs traffic load as a function of packet size (P) and buffer size (B).

Traffic pattern

Figure 14 shows the latency curves for the uniform, hotspot and local traffic for a 7x7x7 network. The hotspot node is placed in the middle of the network and receives 10x more traffic than the other nodes. At low traffic loads, the average packet latency is lower for local traffic, because the packets travel shorter distances in the network. Hotspot and uniform have approximately the same packet latency. Since there is not a lot of contention, the region around the hotspot node is not too congested yet, and the latency is comparable to that of the uniform traffic pattern (the average hop count is approximately the same because the hotspot node is in the center of the network).

Figure 14: latency vs traffic load as a function of traffic pattern.

At higher traffic loads, the network with hotspot traffic will saturate first. The majority of the traffic is sent towards the hotspot node, so as soon as packets start blocking in the area around the hotspot node, almost all traffic in the network is delayed. The network with local traffic saturates much later

Since packets on average travel shorter distances than in uniform and hotspot traffic, the chance of being blocked by another packet becomes smaller too. In other words, packets do not stay in the network long enough to sufficiently disturb other packets.

To verify the distribution of received packets of each traffic pattern, a hotmap is made which shows the number of received packets at each node (Figures 15-18). The uniform and hotspot traffic patterns are easily recognized in the hotpots. The local traffic pattern requires more explanation. Nodes at the center have more neighbors than nodes at the edge of the network. Since network is preferably sent to neighboring nodes, they will have a higher probability of receiving packets.

Figure 15: hotmap of number of received packets for uniform traffic pattern in a 7x7x7 network.

Figure 16: hotmap of number of received packets for hotspot traffic pattern in a 7x7x7 network.

Figure 17: hotmap of number of received packets for local traffic pattern in a 7x7x7 network

Routing algorithm

Figure 18 shows the latency vs traffic load curves for DOR and MAR routing, for several traffic patterns. At low traffic loads, the packet latency will be the same, since there is not much contention and the routing algorithm is not that critical. The average packet latency will differ at higher traffic loads. We see that the impact of the adaptive routing algorithm is minimal, and it actually performs worse than the XYZ routing for these traffic patterns. This is in correspondence with results in literature [10]. The reason for this is that the implemented

(14)

only checks the traffic load at the channels connected to the node itself. This approach will not always lead to an optimal choice. Also, the implemented version of MAR uses Two-Block Partitioning which is a simple, but not the most efficient adaptive routing algorithm.

Figure 18: Latency vs traffic load, as a function of the routing algorithm and traffic pattern.

VIII. CONCLUSION

A model is proposed for the simulation of 3D NoCs using FPGA. Using a VHDL implementation of several modules, the resource usage of a physical cluster and the memory usage is estimated. Both depend on several parameters such as the number of VCs, buffer size, routing algorithm, etc. Increasing the number of nodes per cluster will naturally increase the resource usage. Under certain assumptions, the maximum number of nodes per cluster was determined for several configurations. For the baseline model, 22 nodes can be implemented in one cluster, that can be used to simulate a network of maximal 10.648 nodes. The resource estimations give rise to several clustering possibilities. The optimal clustering will result in less clusters and faster simulation time. The optimal cluster size is determined for several configurations. This information can be used in the software model to simulate the design as efficient as possible. The presented software model replicates the FPGA behaviour and demonstrates results comparable to the reference simulator BookSim for the baseline model. The model allows for several parameters to be changed, such as the number of VCs, buffer size, routing algorithm and traffic pattern. The impact of each parameter on both the network performance and simulator resources is reported. These results can be used to speed-up implementation in future works. Based on the parameters used, a cluster configuration can be chosen and implemented. This can save time in the design process. Using and analyzing the created software model might also speed-up the implementation of the top controlling level on the FPGA, which has not been implemented in this thesis.

REFERENCES

[1] L. P. Carloni, P. Pande, and Y. Xie, “Networks-on-chip in emerging interconnect paradigms: Advantages and challenges,” in 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip, pp. 93– 102, IEEE, 2009.Helmut Kopka, LaTeX, eine einführung, Addison-Wesley, 1989.

[2] ] C.-H. Chao, K.-Y. Jheng, H.-Y. Wang, J.-C. Wu, and A.-Y. Wu,

noc systems,” in 2010 Fourth ACM/IEEE International Symposium on Networkson-Chip, pp. 223–230, IEEE, 2010.

[3] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, et al., “The gem5 simulator,” ACM SIGARCH computer architecture news, vol. 39, no. 2, pp. 1–7, 2011.

[4] N. Jiang, D. U. Becker, G. Michelogiannakis, J. Balfour, B. Towles, D. E. Shaw, J. Kim, and W. J. Dally, “A detailed and flexible cycle-accurate network-on-chip simulator,” in 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 86–96, IEEE, 2013

[5] T. V. Chu, S. Sato, and K. Kise, “Fast and cycle-accurate emulation of large-scale networks-on-chip using a single fpga,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 10, no. 4, p. 27, 2017

[6] M. K. Papamichael, J. C. Hoe, and O. Mutlu, “Fist: A fast, lightweight, fpga-friendly packet latency estimator for noc modeling in full-system simulations,” in Proceedings of the Fifth ACM/IEEE International Symposium on Networks-on-Chip, pp. 137–144, ACM, 2011. [7] ] D. Wang, C. Lo, J. Vasiljevic, N. E. Jerger, and J. G. Steffan, “Dart: a

programmable architecture for noc simulation on fpgas,” IEEE Transactions on Computers, vol. 63, no. 3, pp. 664–678, 2012. [8] H. M. Kamali and S. Hessabi, “Adapnoc: A fast and flexible fpgabased

noc simulator,” in 2016 26th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–8, 2016. [9] Y. E. Krasteva, F. Criado, E. de la Torre, and T. Riesgo, “A fast

emulationbased noc prototyping framework,” in 2008 International Conference on Reconfigurable Computing and FPGAs, pp. 211–216, IEEE, 2008.

[10] W. J. Dally and B. P. Towles, Principles and practices of interconnection networks. Elsevier, 2004.

[11] N. McKeown, “The islip scheduling algorithm for input-queued switches,” IEEE/ACM transactions on networking, vol. 7, no. 2, pp. 188–201, 1999.

[12] W. J. Dally and C. L. Seitz, “Deadlock-free message routing in multiprocessor interconnection networks,” 1988.

[13] M. Ebrahimi, M. Daneshtalab, P. Liljeberg, J. Plosila, J. Flich, and H. Tenhunen, “Path-based partitioning methods for 3d networks-on-chip with minimal adaptive routing,” IEEE Transactions on Computers, vol. 63, no. 3, pp. 718–733, 2012.

[14] G.-M. Chiu, “The odd-even turn model for adaptive routing,” IEEE Transactions on parallel and distributed systems, vol. 11, no. 7, pp. 729– 738, 2000.

[15] R. V. Boppana and S. Chalasani, “A comparison of adaptive wormhole routing algorithms,” in Proceedings of the 20th annual international symposium on computer architecture, pp. 351–360, 1993.

[16] G. B. Bezerra, S. Forrest, M. Moses, A. Davis, and P. Zarkesh-Ha, “Modeling noc traffic locality and energy consumption with rent’s communication probability distribution,” in Proceedings of the 12th ACM/IEEE international workshop on System level interconnect prediction, pp. 3–8, 2010.

(15)

List of Figures and Tables

This page contains a list of figures and tables used in the thesis.

1 Communication structures used in SoC [1]. . . 3

2 A 4x4 2D NoC (mesh), each node consists of a IP core, NI and router or switch (S). . . 4

3 Two direct topologies with 16 nodes: (a) mesh and (b) torus. Each circle represents a node (combination of IP core, NI and router). . . 5

4 Butterfly topology with 16 terminal nodes. Terminal nodes with the same number are literally the same nodes. . . 6

5 Tree topology with 16 terminal nodes. . . 7

6 Oblivious routing from source node (S) to destination node (D), in a 5 × 5 mesh. DOR (a) routes along a minimal path, while VRRA (b) first sends the packet to an intermediate node (I). . . . 8

7 Minimal adaptive routing example. Two packets are being routed, one from node A to node F , another from node G to node I. In case (a) there is not a lot of congestion, while in case (b), some channels are more congested than others. . . 9

8 Breakdown of a message into packets and flits. . . 10

9 Virtual channels allow sharing of physical channel [2]. . . 12

10 Timeline of flow control between two nodes [3]. . . 13

11 Schematic of input queued router architecture [3]. . . 14

12 Packet traversing through pipelined routers [3]. . . 15

13 (Output first) Separable Allocator. . . 18

14 Mesh structures with 64 nodes: (a) 2D mesh (8x8) and (b) 3D Symmetric Mesh (4x4x4). . . 19

15 3D NoC topolgies with clustered routers: (a) CIT and (b) CMIT [4]. 20 16 Visualisation of traffic patterns: (a) Uniform, (b) Hotspot and (c) Local (Rentian). Traffic is only generated at one node S. The colours indicate the number of received packets at each node. . 26

17 Latency vs. offered traffic for a 4 × 4 mesh under uniform traffic with DOR. . . 27

18 Components of the TG [5]. . . 29

19 Timeline of the network and a packet source: The enqueue pro-cess describes how the packet source generates and injects packet descriptors into the source queue while the dequeue pro-cess describes how packet descriptors are ejected from the source queue by the network; the state of the source queue is deter-mined by both the network’s time and the packet source’s time [5]. 32 20 Clustering of 6x6 mesh by using 2x3 clusters. . . 33

21 High-level datapath for the 2D clustering method. . . 34

22 Simulation procedure. . . 35

(18)

24 BRAM usage as a function of the number of clusters for a 1x1x2

cluster. . . 46

25 1x1x12 cluster, inter-cluster channels are indicated with red dot-ted lines. . . 47

26 2x2x3, inter-cluster channels are indicated with red dotted lines. 47 27 BRAM usage as function of the number of nodes per cluster. . . 48

28 BRAM usage as function of the number of clusters, for several number of VCs. A 2x2x2 cluster is used. . . 49

29 BRAM usage as function of the number of VCs. A 2x2x2 cluster is used. The depth is smaller than 29_{= 512. . . .} ₄₉

30 BRAM usage as function of numbers of nodes per cluster, for several number of VCs . . . 50

31 BRAM usage as a function of the number of clusters, for several buffer sizes (BS). 2x2x2 cluster is used. . . 51

32 BRAM usage as function of the number of clusters, for several buffer sizes. A 2x2x2 cluster is used. . . 52

33 BRAM usage as a function of numbers of nodes per cluster, for several buffer sizes (BS). . . 53

34 Clustering of a 4x4x4 mesh by (a) 2x2x2 cluster and (b) 1x3x3 cluster. For clarity of the figure, not al vertical connections and clusters are shown. . . 55

35 Latency vs traffic load for several radices, compared with BookSim. 58 36 Distribution of the number of received packets for a 7x7x7 net-work under uniform traffic. . . 59

37 Latency vs traffic load as function of number of virtual channels for a 7x7x7 network. . . 60

38 Latency vs traffic load as a function of packet size for a 7x7x7 network. The buffer size is set to 8 flits for all cases. . . 60

39 Latency vs traffic load as a function of packet size (P) and buffer size (B) for a 7x7x7 network. . . 61

40 Packet latency vs traffic load for different traffic patterns and XYZ routing (7x7x7 mesh). . . 62

41 Number of received packets at every node, for different traffic patterns. . . 64

42 Number of received packets (send by centre node) at all nodes, for local traffic . . . 65

43 Packet latency vs traffic load for different routing algorithms and traffic patterns. . . 65

44 Stalling factor α and average packet latency as a function of traf-fic load. . . 66

1 CPD as a function of number of hops (i.e. Manhattan distance) for p = 0.75 and a 10 × 10 mesh. . . 25

2 Comparison of FPGA-based NoC Simulators [5] . . . 30

3 Parameters used for synthesis of baseline node . . . 42

(19)

5 Resource usage for one node, as a function of the number of VCs (DSPs not shown). . . 43

6 Resource usage for one node, as a function of the buffer size

(DSPs not shown). . . 43

7 Resource usage for one node, as a function of the SQ size

(DSPs not shown). . . 44

8 Resource usage for one node, for XYZ and MAR routing (DSPs

not shown). . . 44

9 Resource usage of one node for different traffic patterns.

Re-sources are tested for a 6x6x6 network. . . 44

10 BRAM usage for several cluster configurations, for a 12x12x12

network. Usage is split in memory needed for the nodes and memory for the inter-cluster data. . . 47

11 BRAM usage for the maximal number of nodes per cluster, for

several number of VCs; . . . 50

12 BRAM usage for different number of VCs, when the number of

clusters is smaller than 512. . . 51

13 BRAM usage for the maximal number of nodes per cluster, for

several buffer sizes. . . 52 14 BRAM usage for different SQ sizes, when the number of clusters

is smaller than 512. . . 53 15 Optimal clusters and number of clusters needed for different

con-figurations and network sizes. . . 56

16 Average hop count comparison between Booksim and own model. 57

(20)

List of Abbreviations and Symbols

Abbreviation Meaning

2D Two-dimensional

3D Three-dimensional

BRAM Block Random Acces Memory

CIT Concentrated Inter-layer Topology

CMIT Clustered Mesh Inter-layer Topology

CPD Communication Probability Distribution

DOR Dimension-Order-Routing

FF Flip Flop

Flit Flow control digit

FPGA Field Programmable Gate Array

IP Intellectual Property

LT Link Traversal

LUT LookUp Table

MAR Mimimal Adaptive Routing

NI Network Interface NoC Network-on-Chip P2P Point-to-Point PE Processing Element RC Route Computation SA Switch Allocation SoC System-on-Chip ST Switch Traversal TDM Time-Division-Multiplexing TG Traffic Generator TR Traffic Receptor TSV Through-Silicon-Via

VA Virtual channel Allocation

VC Virtual Channel

(21)

1 Introduction

Networks-on-Chip (NoCs) have emerged as an effective communication scheme for Systems-on-Chip (SoCs). A SoC consists of many processing elements, memory elements, etc. that are implemented on the same chip. NoCs use a network of routers to interconnect all components and allow efficient commu-nication between them. Two-Dimensional (2D) NoCs are already very scalable to a high amount of nodes. However, the performance of the network can be enhanced by moving to Three-Dimensional (3D) structures.

Before the actual implementation of the NoC, a design goes through sev-eral iterations to fine-tune certain parameters. Simulators are very important in these design steps, as they deliver accurate predictions about the performance of the network, such as the expected latency, throughput, etc. Several software simulators exist, for both 2D and 3D structures: full-system simulators such as gem5 [6], or standalone NoC simulators such as BookSim [7]. They provide very accurate results, while also being very flexible in their design. Param-eters can easily be changed and extensions to the simulators can be made. However, they severely lack in simulation speed when designs become larger (hundreds to thousands of nodes).

Field Programmable Gate Array (FPGA) simulators can be used to obtain a higher simulation speed. Several 2D NoCs such as the FNoC emulator [5] have already been constructed. They can match the accuracy of software sim-ulators, while enhancing the simulation speed significantly (up to 5000 times faster in some cases). Although they are less flexible, their simulation speed makes them very relevant for the simulation of large designs.

To simulate large designs on FPGA, a Time-Division-Multiplexing (TDM) approach is often used. The FPGA does not have enough resources to the complete network at once, so it needs to be divided in several smaller groups called clusters. FNoC [5] has proposed such a clustering method, which effi-ciently uses its resources and is able to simulate large designs.

A lot of research efforts have been made to construct efficient 2D NoC FPGA simulators, but to the best of our knowledge, no 3D FPGA simulator has been proposed yet. In this thesis a model will be constructed to explore the possibilities and difficulties in 3D FPGA simulation. The 2D clustering method introduced in [5] will be extended to 3D networks.

The proposed model is not a complete 3D NoC simulator, but consists of a software implementation that replicates the behaviour on FPGA. Some sub-modules are implemented on FPGA to obtain credible estimations of the re-source usage.

(22)

The thesis is structured as follows:

1. The NoC concepts are introduced and explained in the first part. Several parameters can influence the structure and performance of the architec-ture.

2. The second part will discuss the simulation of NoCs. Both software and FPGA simulators are explored.

3. In the third part the model for simulation of 3D NoCs is proposed. 4. The results of the simulation model are presented in the fourth part. Using

the estimated resource and memory usage, clustering possibilities can be discussed. The model is also verified against a reference simulator. 5. Finally, the conclusion is presented in the last chapter.

(23)

Figure 1: Communication structures used in SoC [1].

2 Network on Chip Basics

In recent years, more and more components are implemented on the same chip. In such System-on-Chip (SoC) designs, a lot of nodes need to be able to communicate with each other. A digital system is composed of three basic building blocks: logic, memory and communication. In most digital systems, communication has become the bottleneck for performance: most of the power is used to drive wires and most of the clock cycle is spent on wire delay, not gate delay [3].

Having an efficient communication model is thus of utmost importance. In the first SoC designs, a combination of Point-to-point (P2P) links and shared bus systems were used. In the P2P structure, every two nodes that communi-cate with each other, need to be connected with a dedicommuni-cated link (see Figure 1(a)). This results in good performance, but scales poorly in terms of com-plexity, cost and design effort. In large designs, many (long) wires would be required.

Bus-shared architectures (Figure 1(b)) reduce the large amount of long wires by sharing bandwidth between several nodes. However, it also lacks scalabil-ity, both in terms of power and performance. Because multiple nodes are con-nected to the same bus, there will be a large capacitive load at the bus drivers, resulting in higher energy consumption and longer delays. Performance is also reduced due to the limited bandwidth capacity [1, 8].

Interconnection networks (Figure 1(c)) such as the Network on Chip (NoC) architecture emerged as an efficient solution for the communication problem. Instead of using busses or dedicated P2P links, the processing elements (PEs) are connected to a network that routes packets between them [9]. An inter-connection network effectively separates the computation resources from each other and from the communication network. This leads to an architecture that is scalable to an arbitrary large number of nodes [1, 10].

(24)

Figure 2: A 4x4 2D NoC (mesh), each node consists of a IP core, NI and router or switch (S).

influence its performance. The basic terminology is first described for two-dimensional (2D) networks. In a last section the extension to three-two-dimensional (3D) networks is made.

2.1 NoC architecture

A NoC is composed of four fundamental building blocks: Intellectual Property (IP) cores, Network Interfaces (NIs), routers and communication channels [11]. The IP core is connected through a NI with the router. This combination will be called a node in the remainder of this thesis. The routers are connected with each other by several communication channels. An example of a 2D NoC is depicted in Figure 2.

2.1.1 Topology

The connection pattern of the routers and channels defines the topology of the network, which is usually modelled by a graph [11]. The topology is an impor-tant aspect of a NoC because it will have a large influence on the used routing strategy, flow control mechanisms (section 2.1.3). and the architecture of the routers itself, such as the number of ports of the router. The topology will have

(25)

Figure 3: Two direct topologies with 16 nodes: (a) mesh and (b) torus. Each circle represents a node (combination of IP core, NI and router).

a major impact on the hop count, i.e. the number of channels a packet has to traverse before arriving at the destination node. A clear analogy is given in [3]: the topology can be seen as a roadmap. The channels (like roads) carry packets (like cars) from one router node (intersection) to another. In terms of topology, NoCs can be broadly classified as direct and indirect [3, 12].

2.1.1.1 Direct topology

In a direct topology, every router is connected to a IP core. Each node in the network consists of a router and an IP core (with its corresponding NI). This means that, at each node, a packet can be either injected/ejected or forwarded to another node. The node is said to be both a terminal and switching node [3]. Figure 3 shows two examples of a direct topology: the 2D mesh and 2D torus. In a 2D mesh, all nodes are placed on a grid with dimensions n×m. Each node is connected to four neighbouring nodes by bidirectional channels (except for nodes at the edges). Its design is very simple and allows for straightforward scaling to larger networks. [10]

The 2D torus keeps the simplicity of the mesh, while reducing the latency and hop count, by adding extra wraparound channels. All routers now have the same amount of neighbours, as can be seen in Figure 3(b). Although the torus network will reduce the hop count, the long wrap-around connections may result in excessive delay [13].

(26)

Figure 4: Butterfly topology with 16 terminal nodes. Terminal nodes with the same number are literally the same nodes.

2.1.1.2 Indirect topology

In an indirect topology, not every router is connected to a IP core. In this case the routers are switching nodes, while the IP core is the terminal node. Figures 4 and 5 show two examples of indirect topologies: the butterfly and tree network [3].

2.1.2 Routing algorithm

When a packet needs to be sent from a source node to a destination node, the routing will determine which path to take. Routing can be performed in several ways, which will strongly influence the performance of the network. A good routing algorithm ensures that the traffic load is balanced evenly over the

(27)

net-Figure 5: Tree topology with 16 terminal nodes.

work, while keeping the path length to a minimum [3]. Often a trade-off needs to be made between these requirements.

Routing algorithms can be classified according to several criteria. In this sec-tion, two criteria will be discussed: adaptivity (i.e. path diversity) and path length [1, 3, 14].

2.1.2.1 Minimal vs. Non-minimal Routing

In terms of path length, a routing algorithm can be classified as either minimal or non-minimal. A minimal routing algorithm will always choose the shortest path between source and destination. A non-minimal routing algorithm allows for longer paths to be chosen as well. [1, 3, 14]

2.1.2.2 Oblivious vs. Adaptive Routing

In adaptive routing, information from the network (traffic load, congestion, etc.) is used in order to balance the traffic load more. Oblivious routing makes its decisions based on some algorithm without information from the network [15].

Deterministic algorithms are a subset of oblivious routing. In determinis-tic routing, always the same path between source and destination is chosen, regardless of the traffic load in the network. Dimension Order Routing (DOR) is an example of an often implemented deterministic routing algorithm, which also is a minimal routing algorithm [16].

However, not every oblivious algorithm needs to be deterministic. An exam-ple of an oblivous, non-determinstic routing algorithm is Valiant’s Randomized Routing Algorithm (VRRA). [17], in which the packet is first routed to a random intermediate node I, before routing to the destination node D. This will balance the load in the network more, but it clearly is not a minimal routing algorithm. Figure 6 shows how a packet would be routed in the case of both DOR and

(28)

Figure 6: Oblivious routing from source node (S) to destination node (D), in a 5 × 5mesh. DOR (a) routes along a minimal path, while VRRA (b) first sends the packet to an intermediate node (I).

VRRA.

Adaptive routing algorithms take the state of the network into account when deciding which route to take. It will actively try to avoid congested regions in the network. Because they have more information available, better choices can be made, resulting in a better balanced traffic load. [18].

Minimal adaptive routing will choose between several minimal paths be-tween source and destination. An example of minimal adaptive routing is given in Figure 7. Two packets are created at nodes A and G and sent to nodes F and I respectively. In case (a) there is not a lot of congestion and a minimal route is chosen for both packets. In case (b), there is a lot of congestion along the path A − B − C and G − H − I. The packet created at node A will be routed along a different minimal path, to avoid the congested channels. However, since there is only one minimal path between nodes G and I, the congestion on this path can not be avoided. A disadvantage of minimal adaptive routing is that the load becomes less balanced if there are few minimal paths available. A non-minimal routing algorithm can balance the load better, but will be more complex [3, 18].

2.1.3 Flow Control

Flow control determines how a networks resources, such as channel bandwidth and buffer capacity are allocated to packets traversing the network [3]. A good

(29)

Figure 7: Minimal adaptive routing example. Two packets are being routed, one from node A to node F , another from node G to node I. In case (a) there is not a lot of congestion, while in case (b), some channels are more congested than others.

flow control mechanism makes sure all resources are being used efficiently. It should prevent that some packets occupy all the resources and block other packets. Some of the most common flow control techniques are studied below. 2.1.3.1 Bufferless Flow Control

Bufferless flow control is the simplest method but in most cases not very effec-tive. As the name implies, the network does not have any buffers to temporarily store packets. When a packet arrives at a router and is not able to allocate the wanted resources (channel bandwidth, control state,...) it can not be stored in a buffer and thus needs to be either dropped or misrouted [19].

A bufferless flow control that prevents dropping or misrouting of packets is called circuit switching. Before sending a packet, all the resources from source to destination need to be reserved. Once all the resources are available and allocated, one or multiple packets will be sent. This method introduces over-head due to the path setup time. Moreover, other messages may be blocked because the links within the reserved path cannot be used to set up other paths. This method might be useful if a lot of large packets need to be sent occasionally [3, 20, 21].

(30)

Figure 8: Breakdown of a message into packets and flits.

2.1.3.2 Packet-Buffer Flow Control

A more efficient flow control method introduces buffers to prevent dropping or misrouting the packets. In Packet-Buffer Flow Control, both the channels and the buffers are allocated in units of packets. This means that one buffer slot contains a complete packet, which is transmitted as a whole [3, 20].

Packet-Buffer Flow Control is advantageous when the messages are small and frequent [20]. However, allocating resources in units of packets is often not the most efficient flow control mechanism. Buffers are not being used optimally and this method might cause problems when multiple packets try to use the same resources. For instance, if a long, low-priority packet is using a chan-nel, a high-priority packet has to wait for the entire low-priority packet to be transmitted before it can access the channel [3, 20].

2.1.3.3 Flit-Buffer Flow Control

More efficient use of buffers and channels can be made when the packets are divided in smaller elements called FLow control digITs (flits) [2], as depicted in Figure 8. Three kinds of flits can be distinguished:

The header flit is the first flit of a packet. It contains the routing informa-tion and will allocate the resources along the path. It is followed by body flits, which contain the actual data from the packet. The body flits will always follow the header flit and stay in order. The last flit is the tail flit, which will deallo-cate the resources when it leaves a node, such that it can be used by the next packet [3, 22].

In Flit-Buffer Flow control, resources are allocated to flits, instead of allocat-ing resources (buffers and channels) to a complete packet. As a result, a flit

(31)

is the basic unit of bandwidth and storage allocation in this mechanism, called Wormhole Flow Control. As a packet traverses a network, the head flit allo-cates the channels and buffers, and the tail flit dealloallo-cates them [3, 20, 23].

The Wormhole Flow Control can be made more efficient by using Virtual Channels (VCs). VCs are used to share a single physical link across several packets. Each VC is connected to its own flit buffer and keeps track of several state variables.

VCs decouple allocation of buffers from allocation of channels by providing mul-tiple buffers for each channel in the network [2]. This can prevent one packet from blocking all other packets that want to use the same channel. This can be illustrated with a simple example given in [2]:

Figure 9(a) shows two packets A and B, both arriving at the same router (R1), and both requesting use of the channel c1. Packet A has been allocated

access to the channel first and its flits start traversing the channel. Further along in the network, the packet becomes blocked and no more flits can be sent. However, channel c1is still allocated to Packet A, which prevents Packet

B from progressing through the network. This means that Packet B needs to wait for Packet A to be completely send, even though the channel is not in use. Introducing a VC will remove this dependency between the two packets (Figure 9(b)): Packet A and B both allocate one VC at the same output port of R1. When Packet A becomes blocked, Packet B can still use the channel and send its packet to the destination.

2.1.3.4 Buffer Management: Credits

When using a buffered flow control mechanism, an extra communication mech-anism must be in place to prevent buffer overflows. Each downstream router needs some way of making the upstream routers aware of the number of avail-able buffers. Several implementations exist such as credit-based, on/off and ack/nack Flow control [3]. The Credit-Based Flow control will be discussed in more detail below.

In Credit-Based flow control, the upstream router keeps a counter of the available buffer slots at the downstream router. Every time a flit is sent to the router, the counter is decreased. When the counter is equal to zero, the router will stop sending flits. This will effectively prevent buffer overflow.

When a flit leaves the buffer at the downstream router, it will send a credit to the upstream router. Once the credit is received at the upstream router, the counter will be incremented and a new flit can be sent [3, 24]. A timeline of the flow control between two nodes is given in Figure 10.

(32)

(33)

(34)

Figure 11: Schematic of input queued router architecture [3].

2.2 Router Architecture

In this thesis, the implemented router is an input-queued, 5-stage router. It only has buffers at the input ports, none at the output ports. Figure 11 gives a schematic overview of the router architecture. In this section the router compo-nents and the five stages will be explained in more detail.

2.2.1 5-stage Router

In the 5-stage router model, the router is pipelined at the flit level. The header flits need to process through five different stages: Routing Computation (RC), Virtual Channel Allocation (VA), Switch Allocation (SA), Switch Traversal (ST) and Link Traversal (LT). Body and tail flits only need to progress through the last three stages. This is schematically presented in Figure 12.

In the first cycle, the header flit arrives at Router 1 and is sent to the RC stage. Here the information in the header flit is used to determine to which output port the packet needs to be forwarded. In the second cycle, the header flit goes to the VA stage, while the first body flit arrives at the router. The body flit does not need to do anything in the RC stage, so it will just wait in the buffer. During the VA stage, a VC is requested at the output port, determined in the RC step.

If during the VA stage, a VC has been granted to the packet, the header flit will move to the SA stage in the third cycle. The first and second body flits still do not have to do anything yet, and just wait in the buffer. In the SA stage, the flit tries to obtain access to the switch (see further), to pass from the input to

(35)

Figure 12: Packet traversing through pipelined routers [3].

the output port. If access is granted, the flit will traverse the switch in the ST stage in the next cycle. The body and tail flits also need to go through the SA and ST stages.

After the ST stage, the flit traverses a communication link between two routers in the LT stage. When arriving at the next router, the flits start the same process again, going through all five stages [3].

During the VA and SA stages, a flit requests access to a certain resource. This request does not always result in a grant. In this case, the request needs to be reapplied in the next cycle and an extra cycle is required in the pipelined process.

2.2.2 Ports & Virtual Channels

Every input or output port is connected to one physical channel. Every physical channel is split in multiple VCs. At each input port, the VC is connected with an input unit. This element contains a flit buffer, as well as state variables that indicate how resources are allocated and in which stage the current package is. The following state variables are saved:

• Global state: indicates in which stage the packet currently is. This global state is related to the pipeline stages: RC, VA and ACTIVE are used. In the active state, flits consecutively process through the SA and ST stages.

• Allocated output port: the output port that has been selected by the rout-ing algorithm for this packet.

(36)

Each input unit can process one packet at the same time. The buffer of the input unit can contain two back-to-back packets, but the second packet needs to wait until the first packet is completely sent before it can be processed (even though the RC and VA stages are not used for the previous packet anymore). This could be circumvented by introducing an extra pair of state variables, such that both packets can be processed in parallel (second packet can start RC and VA stages, but not SA yet).

2.2.3 Routing unit

The routing unit determines the output port for every packet according to a certain routing algorithm. For an oblivious routing algorithm, the routing unit will only need the destination address of the flit to determine the output port. For adaptive routing, the network state variables are needed as well.

2.2.4 Switch

The switch or crossbar in the router is a configurable connection between input and output ports. The configuration of the crossbar is determined during the SA stage.

In a router with N input ports and N output ports, the crossbar is N × N dimensional. This means that there can only be one connection for every input or output port. However, the crossbar can also be implemented with an extra input- or output-speedup. A crossbar with input speedup S can connect S × N input ports to N output ports, or vice versa for output speedup. This can be used to improve the throughput in the case of multiple VCs and simplify the allocator in the SA stage [3].

2.2.5 Allocators

Several resources need to be allocated in the router. Packets try to allocate an output VC during the VA stage, flits try to allocate a transition on the crossbar during the SA stage. These two stages are both regulated by an allocation algorithm. The allocator tries to find an optimal match between the requests and available resources. A good allocation scheme also needs to be fair: all requesters should be able to obtain access to the resources.

The allocators have an important influence on the latency and buffer usage of the network, as they dictate how long flits/packets need to wait before passing through the node [25].

2.2.5.1 Arbitration

Allocators are made up of several arbiters. An arbiter assigns one resource to a group of requesters. Several arbiter implementations exist, but for this thesis