On the Performance and Productivity of FPGA HLS solutions: an SpMV Comparative Study

(1)

Bachelor Informatica

On the Performance and

Productivity of FPGA HLS

solutions: an SpMV

Comparative Study

Rico van Stigt

11325844

June 17, 2019

Supervisor(s):

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

(3)

Abstract

Field-Programmable Gate Arrays (FPGAs) are reconfigurable hardware devices that enable the design and implementation of specialized hardware units. These units are more (energy) efficient than general processing units, like the CPUs or GPUs, but they require FPGA programming to be designed.

Programming FPGAs can be difficult for non-expert hardware programmers, as it was traditionally done using specialized programming languages called HDLs (Hardware Design Languages). A new, alternative way of programming FPGAs comes in the form of High-Level Synthesis (HLS), which allows users to create FPGA designs using higher level programming languages, such as C or Java, combined with a handful of pragmas or other special constructs. In this thesis, we propose a case-study to compare two HLS solutions, Xilinx Vivado HLS and Maxeler MaxJ, on their performance and usability. Specifically, we use Sparse Matrix-Vector multiplication (SpMV) as target application, provide in-depth analysis of the design and implementation of multiple SpMV versions, and empirically analyze the programming constructs and achieved performance of each implementation.

Through this empirical analysis, we found the programmability aspect for both Vivado HLS and MaxJ to be good: despite several quirks and specific workflow challenges, the HLS solutions do offer a superior experience to hand-programming an FPGA for non-expert hardware programmers. Performance-wise, the conclusions are more diverse: we found our Xilinx implementations to be very predictable in their execution time, while our initial Maxeler implementations show a lot of data-dependent fluctuation.

(4)

(5)

Introduction

In recent years, the growth in transistor density has declined and is even expected to stagnate as advancements in process nodes are becoming more and more difficult. This means that in order to increase available processing power, we can no longer rely on increasing amounts of transistors in our chips. Instead we need to look into increasing the efficiency of our hardware.

For example, general purpose CPUs are built to solve virtually any task with good performance. As such, a lot of the CPU’s architectural functionality (e.g., branch prediction for non-branching code, caches for streaming applications, or floating-point units for integer-only applications) is integrated to support only a part of the total workloads running on the machine, and it is useless for the rest. Effectively, CPUs have a large architectural overhead because they have to support a large diversity of applications. This overhead decreases their efficiency. Another example are GPGPUs, which are more specialized for throughput oriented applications. These use less space on the chip for control logic and caching, leaving more space for computation elements. This means GPGPUs are very good at performing the same operation many times in parallel, but are less suitable for sequential workloads. GPGPUs are more efficient in terms of hardware and energy because they target a smaller range of applications, and thus can stream-line the archi-tecture by removing the unnecessary units.

Thus, we argue that a possible solution to increase processor efficiency is to build and use more specialized hardware, which contains only the absolutely necessary logic and thus has less over-head. This can be very beneficial in specific cases, such as antennas or digital signal processors, where the function is fixed for a long time. When the functionality has to be changed more fre-quently, a custom hardware solution may prove to be too costly or too time consuming to develop. A solution that provides a good trade-off between flexibility and efficiency, and which lies in-between a general purpose processor and specialized hardware, is the use of reconfigurable hardware. The commonly-used implementation of this class of architectures are the Field-programmable Gate Arrays (FPGAs). FPGAs (aim to) combine the ease of development of a general purpose processor with the performance and efficiency of specialised hardware. The first FPGA was released in 1985 by Xilinx to compete with Application-Specific Integrated Circuits (ASIC). Since then FPGAs, like CPUs, have undergone a lot of development, the first FPGA contained 64 programmable logic cells, whereas modern offerings from vendors such as Xilinx and Intel can contain millions of these cells. The packaging has also changed tremen-dously, the first FPGA came in a 48 pin Integrated Circuit package, but today there are many forms of FPGAs. These include FPGAs integrated on the same chip as a CPU, but also as PCI-e expansion cards. Other vendors, such as Maxeler, offer completely integrated solutions, including the FPGA, but also the machine it runs in as well as any software packages required to program the FPGA.

(8)

Unfortunately, FPGAs only meet their targets only partially. The performance is often very good (when the application is suited for FPGA acceleration), but the ease of development is lacking. FPGAs generally require low-level programming models, based on circuit diagrams or hardware description languages such as VHDL. This makes it very difficult to create correct and efficient programs for non-expert hardware programmers. To remedy this problem, High-Level Synthesis (HLS) can be used to transform a program written using a high-level language (such as C, OpenCL or even Python) into a design which can be run on an FPGA. While this approach potentially reduces the difficulty for the programmer, it is often unknown how the performance of these HLS solutions compare to an expert’s hand-optimized solution.

1.1 Research question

The aim of this project is to study two HLS solutions, namely MaxJ (for Maxeler boards) and C in Vivado (for Xilinx boards). We aim to compare these solutions on both performance and ease-of-use in a detailed case-study, using a single application and analysing its design and im-plementation for both environments. Moreover, as a frame of reference, we also aim to compare the achieved FPGA performance against a reference implementation for CPUs and GPUs. To discover the benefits and possible pitfalls of these two HLS solutions, we selected as bench-mark the sparse matrix vector multiplication kernel (SpMV). This use-case is chosen since it is well documented and also has various possible implementations. The differences between these implementations may be handled differently depending on the HLS solution used, possibly lead-ing to drastic changes in performance. Since the implementation of SpMV is somewhat simple, it also lends itself well to porting the algorithm to various FPGAs.

All of these aspects culminate to a single overarching research question:

What are the advantages and challenges of developing FPGA computational kernels using HLS?

In order to substantiate the answer to this question, we focus on three specific subquestions: 1. What performance can be achieved by using FPGA SpMV kernels in HLS?

2. What are the advantages of these kernels compared to CPU or GPU implementations? 3. What are the development challenges when using HLS solutions for FPGAs?

1.2 Thesis structure

This thesis is structured as follows.

In Chapter 2 we introduce the basic concepts related to FPGAs and their programming. We also introduce the SpMV algorithm.

In Chapter 3 we discuss the workflow, design and implementation on both Xilinx and Maxeler. In Chapter 4 we describe the experiments we will use to evaluate the performance of our various implementations.

In Chapter 5 we discuss the results of the experiments and how these compare to each other, as well as to our expectations. Thus, this chapter provides the answers to our performance-related research questions.

In Chapter 6 we discuss our own experience with the solutions offered by Xilinx and Maxeler. Thus, this chapter provides the answers to our programmability-related reasearch questions.

In Chapter 7 we compare our work to existing related studies on the subject of FPGAs or HLS.

(9)

CHAPTER 2

Background

In this section we introduce FPGAs and their basic architecture and properties, as well as SpMV, the algorithm we use for our case-study.

2.1 FPGAs

FPGAs were first introduced in 1985 by Xilinx as an alternative to Application-Specific In-tegrated Circuits (ASIC) in the low-cost, high-performance market [16]. FPGAs provided a comparatively strong cost proposition compared to ASICs, because they had less up-front cost, a reduced risk of failure because they could be reprogrammed, and a relatively minor performance penalty. However, modern FPGAs’ competition has shifted from ASICs to GPUs due to the large increase in GPGPU applications.

(a) Logic Cell

(b) Routing Network

Figure 2.1: FPGA Hardware Elements

FPGAs consist of multiple types of elements: these include logic blocks, hard blocks, and on-chip memory in the form of Block RAM [12, 3].

Logic blocks consist of various elements: Lookup Tables (LUT), a Full Adder (FA), a D-Flip-Flop (DFF) and Multiplexors (MUX) to select between outputs. In the example in Figure 2.1a, two 3-input LUTs are used, the outputs of these are fed into either the FA or into the leftmost MUX.

(10)

The middle MUX can be programmed to select either the output from the FA or from the left-most MUX. When the leftleft-most MUX is used, the logic cell essentially works as a 4-input LUT. Finally, the output from the middle MUX is connected to the DFF and the rightmost MUX. The rightmost MUX can be programmed to use the value either directly by selecting the output from the middle MUX or on the next clock cycle by selecting the output from the DFF. These fea-tures allow the logic cell to be reprogrammed by the programmer to behave as a desired logic gate. Hard blocks on the other hand implement specific fixed functions directly in silicon, these can for instance be used for multiplications, digital signal processing, external highspeed I/O or even embedded processor cores. Because these blocks are directly implemented in silicon, they can’t be reprogrammed by the programmer, but they do offer ASIC-level performance and power effi-ciency.

To connect the various available blocks, a routing network is used. A simplified schematic is shown in Figure 2.1b. The routing network consists of programmable switch matrices (PSM) and connection lines to connect them together. The connection lines connect either adjacent PSMs with Singles or the second neighbouring PSM with Doubles. Singles are used to connect the logic blocks (CLB) close together, while the Doubles are favourable for longer connections. The CLBs are arranged in a grid between the connection lines, and are further connected to these lines with interconnects which can also be programmed.

The routing network accounts for the largest used area on the FPGAs, and the largest part of the circuit delay comes from routing delay [1]. Therefore, it is very important to be able to route the design efficiently.

Besides the programmable logic, FPGAs also contain a memory hierarchy. This hierarchy con-sists of low-latency on-chip fast memory known as Block RAM (BRAM), as well as off-chip DRAM. A data access to BRAM typically takes a single clock cycle to complete, while DRAM accesses might take hundreds of cycles. When the FPGA is integrated in a System-on-a-Chip (SoC), as is the case with Xilinx’ Zynq SoC, the DRAM memory space may be shared between the FPGA and the processor.

An FPGA is programmed to implement a fixed function, without being influenced by any form of scheduling or interrupting. Thus, the execution time of any operation performed on an FPGA can be accurately estimated beforehand. This means FPGAs also lend themselves very well to real-time applications where timing is of utmost importance.

Since FPGAs are limited in size, but also in clock speed, various additional challenges arise. Beside creating the correct definition for logic block behaviour, the routing needs to be designed such that the design is able to fit on the FPGA, but also such that the combined logic and routing delay are small enough to meet the timing constraints for the clock cycle. These factors combined make FPGAs significantly more difficult to program than CPUs or even GPUs.

(11)

2.2 Sparse Matrix Vector Multiplication (SpMV)

Sparse-Matrix Vector multiplication consists of a multiplication between a sparse matrix and a dense vector. the definition of a sparse matrix relies of the amount of zero elemens in the matrix - i.e., a sparse matrix contains more zero-elements than nonzero-elements. Sparse matrices are frequently used in many scientific and application fields, such as structural engineering, computational fluid dynamics, computer graphics, or graph processing [6].

2.2.1 Sparse Matrix data formats

For SpMV, one of the fundamental design parameters is the sparse matrix storage format. Ex-amples of formats include Compressed Sparse Row (CSR) and Coordinate list (COO) [7]. The CSR format uses three arrays to represent an nxm matrix with nz nonzero entries. Two arrays with size nz are dedicated to the values and row-indices, respectively, whereas the third array contains m row-lengths. COO consists of a single array with size nz which contains 3-tuples with the row-index, column-index and value.

    0 0 0 0 5 8 0 0 0 0 3 0 0 6 0 0    

(a) Standard Matrix

Rowlengths: 0 2 1 1

Index-value tuples: (0, 5) (1, 8) (2, 3) (1, 6)

(b) Compressed Sparse Row (CSR)

(0, 1, 5), (1, 1, 8), (2, 2, 3), (1, 3, 6)

(c) Coordinate List (COO)

Figure 2.2: SpMV Datatypes

2.2.2 SpMV using the CSR matrix format

The processing of CSR on a simple CPU implementation works as follows. We iterate over all rows, and, for each row, we read the row length n from the rowlengths array, and set an intermediate value for the product element to 0. Then, we read n elements from the value-index array and, for each element, we read the corresponding vector value (using the value-index) and multiply it by the value from the matrix. This value is then added to the current value of the product element. After the row is completely processed, the intermediate product value is written to the output array, using the current row as index. A pseudo-code version of the CSR-based SpMV is presented in Listing 1.

Algorithm 1 CSR Matrix-Vector Multiplication

1: _{function CSR(rows, vector, rowlengths, output)} 2: for y ← 0, rows do 3: rowlength ← rowlengths[y] 4: intermediate ← 0 5: for x ← 0, rowlength do 6: column ← GetNextColumn() 7: value ← GetNextValue()

8: intermediate ← intermediate + value ∗ vector[column]

(12)

2.2.3 SpMV using the COO matrix format

The simple CPU implementation of COO processing works as follows. We initially set the output array to 0. We then iterate over all nonzero tuples. For each element, we obtain the vector value using the column index and multiply this by the value from the tuple. We then add this value to the output array at the row index. A pseudo-code version of this implementation is presented in Listing 2

Algorithm 2 COO Matrix-Vector Multiplication

1: _{function COO(nonzeroes, vector, output)} 2: SetToZero(output)

3: for i ← 0, nonzeroes do

4: row ← GetNextRow()

5: column ← GetNextColumn()

6: value ← GetNextValue()

(13)

CHAPTER 3

Design and implementation

In this chapter we present the design and implementation of several SpMV solutions for both Xilinx and Maxeler. We also describe the workflow which is used to create the solutions, as well as any challenges we encountered during development.

Note that we provide three different SpMV versions for both platforms. First, we initially imple-ment a simple dense matrix-vector multiplication to become familiar with the HLS itself. Once this version is fully operational, we further dive into SpMV specific optimizations, such as using CSR and COO, as well as any HLS optimization directives provided by either Vivado or MaxJ. All implementations use the double-precision floating-point format (IEEE 754) for both the vector and matrix values, and 32-bit wide unsigned integers for any form of indexing (unless otherwise specified).

3.1 Xilinx Vivado HLS

Vivado HLS enables the use of HLS to generate the hardware representation of a computation kernel. However, in order for this design to be integrated into a fully-working design, which can be further tested with real data, an entire ecosystem needs to be designed. In the following sections we cover both the required ecosystem that enables the use of HLS, as well as the actual HLS kernel design.

3.1.1 Xilinx HLS Ecosystem and Workflow

Even the initial implementation on Vivado HLS (VHLS) requires some expertise, as the IP gen-erated by VHLS needs to be integrated within Vivado and even further with the SDK. This includes port-mapping, memory-address specification and programming the calling of the gen-erated IP. Because of this, creating an initial implementation on Vivado requires significant hardware-design knowledge of the programmer.

For example, to create valid port mappings, specific directives have to be given to VHLS to reflect how data communication between the design and the rest of the system should be implemented. In this work, all our VHLS implementations use streaming to send the data to the FPGA. The streams are integrated using the Direct Memory Access AXI IPs (DMA) provided by Xilinx [9]. This approach allows the streaming from system memory, which means we are able to place matri-ces in any format we want on the ARM portion of the Zynq SoC and then stream it to the FPGA. A Xilinx VHLS implementation requires a workflow consisting of three separate programs: Vi-vado HLS (VHLS), ViVi-vado, and the SDK.

Vivado HLS Vivado HLS is used to create the implementation of the actual target algorithm. Within this program, we can generate the IP, which becomes a stand-alone module that can

(14)

be further used in later stages of the workflow. We can further test the implementation for correctness against a CPU counterpart. Within VHLS, we can also provide compiler directives to influence how the IP is generated. Examples of directives include the target clock speed, indications to pipeline or parallelize loops, and port-types. All these directives have a strong impact on the generation of the hardware. For example, the target clock speed strongly influences the scheduling of operations. Thus, a higher clock speed means fewer operations can be run in a single clock period, which in turn means that higher clock speed is not always desirable when the design needs (many) more cycles to complete an operation, thus cancelling the gain in clockspeed. Vivado

The generated IP has to be integrated into an encompassing design using Vivado. The IP contains some basic IO, which has to be connected within this encompassing design.In our case, we use DMAs to send and receive data from the FPGA; these DMAs are connected to the ARM CPU processing system. Vivado offers a high level of automation, as many elements of the final design can be automatically placed and interconnected. However, the data streams to and from the generated IP have to be connected manually.

When the design is completed, it has to be synthesized and implemented for the used hardware. These operations can take anywhere from a few minutes to several hours, depending on the complexity and size of the design. Once the design has been synthesized and implemented, it can be exported to the SDK.

SDK

We use the SDK to interface with the actual development board, implement the CPU code, and integrate it with the IPs within the design. To integrate to CPU code with the IPs, we use automatically generated header files, each type of IP has an associated header file containing functions relevant to the IP’s functionality. Each IP has to be instantiated in the CPU code using a unique ID before other functions can be used. We use the CPU to load the required data into memory and interface with the DMAs and our own IP. We also test the received results from our HLS IP for correctness.

3.1.2 Implementations

Initial implementation

The simple, dense matrix-vector multiplication within VHLS is created using as few ports as possible, to keep the design as simple as possible. The initial design consists of: two input parameters to indicate the dimensions of the matrix, two input streams to stream the vector and the matrix data, respectively, and an output stream to export the result. The format of the vector input stream is defined as m vector values, while the matrix stream format is defined as nxm matrix values in row-major order. The output stream consists of n vector values from the output vector. The vector is initially stored on the BRAM of the FPGA to allow us to quickly access the elements out of order. The intermediate result obtained when iterating over a row is stored in a temporary variable, and when the computation for a single row is finished, the value of the variable is streamed out. This initial implementation does not take advantage of any FPGA specific optimizations such as pipelining or parallelization.

The CSR version

Our VHLS implementation of CSR-based SpMV differs from the original, dense design, in some key aspects. First, the two input streams have been transformed into three streams: one consist-ing of m vector elements, one consistconsist-ing of n row-lengths, and one consistconsist-ing of nz value-index tuples. Again, the vector is initially entered into BRAM and the rows are iterated on. For each row, we keep track of the intermediate value, read the value in the row-length stream, and read that amount of values from the value-index stream. The read index is used to retrieve a value from the vector, which is then multiplied with the value read from the matrix, and the result is added to the intermediate product value. When all elements of the row are read and processed, the intermediate value is streamed out.

(15)

The AXIDMA we use requires the bitwidth of the datatype to be a power of 2 (i.e. 32, 64, 128, ..). Because of this, we use a 64-bit integer as the index within the value-index tuple. This is larger than needed for our purposes, leading to larger datatransfers to the FPGA.

3.1.3 The COO version

The COO format is relatively similar to CSR, so the implementation is also similar to that of CSR. We employ two input streams, again one consisting of m vector elements and one consisting of nz (value,row-index,column-index) tuples. The vector is entered into BRAM, and the tuple stream is read from. Contrary to CSR, we don’t differentiate between rows anymore, since the row index is streamed in as well. Because of this we could potentially switch around the order of the input stream without any negative side effects, though this is not yet relevant in our initial implementation, this would make it potentially easier to parellelize the operations in a more optimized implementation. Unfortunately, this implementation also requires storing the intermediate product values into BRAM. This means we use more resources on the FPGA, and we also require an extra operation when indexing the rows. After the input stream is completed, the computation is also completed, so the contents of the BRAM (i.e., the result of SpMV) can be streamed out.

3.1.4 Challenges

Most of the challenges we have encountered on the Xilinx platform do not occur in the VHLS part, but rather emerge within Vivado and the SDK.

The initial creation of the IP within VHLS is relatively straightforward, as we can simply create it using CPU code. We do, however, need to indicate the method we use to pass the parameters to the FPGA. In our case, these parameters are the input and output streams, as well as the scalars. The scalars are set as Axilite ports, which can be set from the ARM CPU, whereas the streams are set as AXI stream ports. For the streams we can also set the buffer depth - our initial setting for this parameter is the default value of 1.

Within Vivado we have to correctly connect all ports on our design. Thus, the scalar inputs require a connection to an AXI interconnect, which must in turn be connected to the ARM pro-cessing system. The streams need to be connected to DMAs, which also need to be connected to the AXI interconnect, as well as an AXI SmartConnect. The SmartConnect is then also connected to the ARM processing system. While it is not required to exactly know how these elements work or even what their function is, this complex system design does introduce a steep learning curve on how to integrate an VHLS-generated IP within a working design.

To simplify this co-design part, we initially started with a simple design example from Xilinx, which was able to use a single DMA to send data to a First-In First-Out (FIFO) IP, which then sent the data back to the DMA. We replaced the FIFO IP with our own IP and added additional DMAs for the additional streams we employ. This ”learning-by-example” worked well for our first version.

Within the SDK, we have to further interface with the DMAs, as well as our own IP. This is done using generated functions. In the case of our IP, several functions are generated: ini-tialization, start and stop functions, getters and setters for our scalar inputs, as well as various getters for the status of the IP. This is also the case for the DMAs, these also include an initial-ization function and various getters and setters. A transfer can be started using a function call in which we indicate which DMA to use, the memory address on the CPU, the size in bytes, and whether the data is to be sent or received.

After the transfer has been started, there are two distinct methods to check the status of the DMA transfer: by polling or by using interrupts. When the polling mode is used, the CPU is actively checking whether the transfer is finished; in the interrupt mode, the CPU will be interrupted when the transfer is completed. Our implementations use the polling mode, because the implementation is simpler, and the performance is the same. This does mean, however, that

(16)

the CPU can’t be used to perform operations when a transfer is running. The transfers are also limited in size, meaning that large transfers (>8MB) must be split it into multiple smaller transfers.

Since we used an example project from Xilinx as starting point, many of the encountered chal-lenges were relatively easy to overcome. This method also requires less background knowledge compared to starting from an empty starting point. However, it is also more difficult to debug certain issues, due to the lack of background knowledge. This can be especially difficult when the design blocks during a transfer, which often happens due to incorrect parameter settings on either the IP or on the DMAs. This can lead to a mismatch between the amount of data the IP expects and how much is sent using the DMA, but there is no clear indication showing us this is the case.

3.1.5 Optimizations

There are two distinct approaches we can take to optimize our designs: improving efficiency or increasing the used space on the FPGA. To improve the efficiency of a design, we are able to use various directives to guide the HLS compiler. Many of these directives are described in the HLS optimization guide provided by Xilinx [11]. Following this guide, we attempt to use the pipeline directive for the inner loop in our design; effectively, this means the compiler tries to overlap different iterations within the loop to improve the hardware usage and reduce the total execution time. For our designs, however, VHLS reports it is unable to effectively pipeline the loops, because it is unable to effectively overlap different iterations. Even though this warning is given, we will still investigate whether this failed pipeline attempt has improved the design. We will also try to unroll the loop instead of pipelining to see what effect this has on performance. The other approach we will discuss is increasing the used space on the FPGA. Our original designs are relatively small. Therefore, we can easily expand them without running out of space on the FPGA. We choose to use this space to add extra computational elements. Thus, for our dense implementation we have enough space to create a design which contains four of our multiplication IPs along with the necessary DMAs. In turn, we have to reduce the size of the BRAM buffer within our design, which means we are more limited in the size of matrices the design can process. We employ the same tactic on our CSR and COO designs. With our CSR design we are limited to only two multiplication IPs due to space constraints, while our COO design uses too much space to even fit a second multiplication IP.

To utilize the additional multiplication IPs, we must also split the workload in equal sections, ideally without altering the dataformat. In the case of our dense implementation, this is simple: we divide the matrix in (almost) equal slices along the rows. We then calculate these slices on the IPs. For our CSR implementation we try to divide the workload similarly, but we are not able to split the matrix through the middle, since this might lead to uneven workloads. Because of this we move this split to a row in such a way that the amount of nonzeroes for both IPs is as close as possible.

(17)

3.2 Maxeler MaxJ

For the initial Maxeler implementation, we also start with a simple dense matrix-vector multi-plication before moving on to different data formats and optimizations. Due to the difference in programming paradigm to conventional programs, Maxeler also requires some expertise from the programmer for the initial implementation.

3.2.1 Maxeler Ecosystem and Workflow

Maxeler’s workflow requires a single program, the Maxeler IDE (MaxIDE). Each implementation consists of CPU code and engine code. The engine code is written as MaxJ code, and it is further split into a managing portion, and the kernel itself, which runs on the FPGA. The manager can influence various parameters within the design, such as clock frequency or IO configuration. The CPU code can communicate with the kernel using a generated function call based on the kernel itself, as well as the manager.

The kernel itself is programmed using a dataflow programming model. Within this model, we specify the inputs and outputs, as well as the calculations that have to be made. When simple calculations, such as integer operations, are used, the compiler can easily meet timing constraints. But in our case, when using several floating point operations for each point, other steps have to be taken to meet these constraints.

The host code consists of C code used to load the required data into CPU memory and in-terface with the kernel. In our case, the CPU code parses the datasets, inin-terfaces with the FPGA, and tests the results for correctness against a CPU counterpart.

Maxeler implementations can be executed using either simulation or the actual FPGA. When the FPGA is used, the kernel code needs to be synthesized and implemented on the used FPGA, which, just as with Xilinx, can take anywhere from a few minutes to several hours. Maxeler differs from Xilinx in this step because the synthesis is not deterministic: during the place and route step, a seed is used to create a placement map and when the timing fails, another seed is tried. By default, 4 seeds are used, but this can be modified within the manager. Within the manager we can also set the place and route to build upon previous seeds, which came close to meet timing constraints.

When simulation is used, the engine code does not need to be synthesized and implemented. This path offers a quick way to test the implementation in a close-to-real simulation, and make changes more easily.

3.2.2 Implementations

Initial implementation

The simple implementation for Maxeler uses two input streams and a single output stream, as well as several scalar inputs to bound the computation. The first input stream is the vector input containing m values, the second input stream is the matrix input containing mxn values. The output stream consists of the resulting n values. Initially, the vector input stream is entered into BRAM, after which the actual computation is started. We iterate over the specified number of rows, and within this iteration we also iterate over the specified number of columns.

Due to the timing constraints we previously alluded to, we require the use of an OffsetAutoLoop. This is a construct we can use to meet timing constraints by introducing another inner loop, the constraints of which are automatically determined by the IDE to give enough time to complete the computation.

Within this innermost loop, we read the required elements from the matrix stream and the vector during the first iteration. Then we perform the calculation during the rest of the loop before writing the result in the final iteration.

When we start the column iteration, we set a register to 0 and each step this register is incre-mented with the multiplication between the matrix input stream value and the corresponding vector value from BRAM. For each iteration, we also have to carry the value in the register to

(18)

the next step. When the end of the iteration is reached, the register value is sent to the output stream. As with the initial Xilinx implementation, this implementation does not take advantage of any FPGA optimizations.

The CSR version

The conversion from a simple implementation to a CSR implementation contains different chal-lenges when using Maxeler compared to Xilinx. We initially used a counter to iterate over all columns, but, with CSR, this approach has to be changed due to the variable bound of the ”inner” loop. Since this is difficult to accomplish, an initial CSR version will use the length of the longest row as a fixed upper bound. This may cost a lot of performance when the row lengths vary greatly, but in many other cases the impact will be negligible. We do use the row length to stop the reading from the stream when all elements in the row are used, meaning that we do not need to pad the stream with zeroes for the ”expanded” rows. Again, we also use an OffsetAutoLoop to meet timing constraints.

The design consists of four input streams and a single output stream. The input streams are: a vector stream containing m elements, a rowlength stream containing n elements, and, finally the CSR-value and CSR-index streams containing nz elements each. Again, we initially enter the vector in BRAM before iterating over the rows. Inside each row iteration, we iterate up to the longest row length. At the start of the iteration, we set a register to 0, and in each step this register is updated with the multiplication result. We read from the rowlength stream at the start of the column iteration, and we only read from the CSR streams when the current iteration is still smaller than the current row length. At the end of each row-iteration, we output the value of the register to the output stream.

The COO version

For the COO format, we use four input streams and a single output stream. A single stream is used for m vector elements, the three remaining streams consist of nz elements, which combined form a tuple of (value, rowindex, columnindex). The output stream consists of n elements. We initially use BRAM on the FPGA to store the intermediate results before sending it through the output stream.

We start with the initialization of the memory. We enter the vector stream into memory, and also set the intermediate output memory to zero. After these operations, we iterate over the nz elements. Within this outer loop, we create a loop with two iterations, which allows us to split the reading and writing of the intermediate result, enabling us to prevent race conditions. Within this loop, we again utilize an OffsetAutoLoop to meet timing constraints. For each element, we read the value from the intermediate memory as well as the appropriate value from the vector memory during the read iteration. Then, we perform the multiplication between the vector value and the stream value, and update the intermediate result. During the write iteration, we write the new value back into BRAM. After all elements are processed, we stream all elements from the intermediate memory out.

3.2.3 Challenges

Many of the challenges encountered on the Maxeler platform come from the creation of the kernel code, since this requires the use of the dataflow paradigm. For programmers with a background in software development, this paradigm may initially be counter-intuitive, mainly because instruc-tions are not necessarily processed sequentially. In our case, we had to contend with multiple nested loops, where the action varies based on the iteration within the loops. Coordinating these loops is much more involved within the dataflow paradigm compared to regular CPU program-ming.

Just as with Xilinx, we started with an example project provided by Maxeler, where a kernel takes a matrix as input and sums the values for each row. Because of the clever selection of this initial example, the implementation of the initial, dense matrix-vector multiplication was easy:

(19)

it only required us to add the multiplication with the vector values.

Other challenges arise when interfacing with the kernel from the CPU code. While the ker-nel can be called using a single function, it does require some peripheral information. Most notably, we need to enter the amount of cycles the kernel is expected to run. If this number is incorrect, the kernel will block indefinitely. In turn, this requires the inner working of the kernel to be understood even on the CPU side, to determine the correct number of cycles. This is not a problem when the same programmer creates both the kernel and CPU code, but it may cause issues with integration.

Another important factor to take into account is the amount of abstraction used by MaxJ. While this does potentially make the initial implementation simpler, it is much harder to thoroughly understand what actions the IDE takes to implement the kernel on the FPGA. For instance, our CSR design had more problems than the initial dense design to meet the timing constraints on the default clock speed of 75MHZ, while the processing portion is very similar in both designs. While the CSR design did eventually meet the timing constraints, it is unfortunately still unclear why it did not do so at first.

3.2.4 Optimizations

Unlike Xilinx, Maxeler does not provide a guide for design optimization. Because of this, we will limit ourselves to increasing the performance by using more of the space of the FPGA. This can be coordinated using the manager portion of the Engine code. We can assign an id variable to our kernel, we then assign unique ids to each kernel instance within the manager. This theoretically allows us to easily increase the amount of multiplication units within the design.

Unfortunately, these designs were unable to meet timing constraints and we were unable to rectify this issue. We are unsure why this is the case, since we essentially use the same kernel multiple times in parallel - given the availability of resources and the use of the default clock speed, we expected parallelization to be easy to achieve. However, the original design could not be replicated to generate a parallel version of our Maxeler SpMV.

(20)

(21)

CHAPTER 4

Experimental setup

In this chapter we present the design of our experiments. Specifically, we present the hardware platforms, software systems, and input datasets we have selected.

To test the performance of the created designs, we will measure both execution time and power consumption. Besides reporting these two performance metrics, we also measure and report two other relevant aspects: FPGA resource utilization and the time-to-compile.

4.1 Platforms

4.1.1 Xilinx

For testing the performance of our Xilinx design, we use a Pynq Z1 development board, contain-ing an ARM SoC with 2 ARM Cortex A9 cores runncontain-ing at 650MHz and an integrated FPGA. This FPGA is relatively small - it contains only 85.000 logic cells, 0.61 MB on-chip memory and 512MB DDR3 off-chip RAM, as well as 220 DSPs.

To program the FPGA we use Vivado HLS 2018.3 for the creation of the HLS IP, Vivado 2018.3 to integrate the design, and Xilinx SDK 2018.3 to integrate the CPU code. To measure the execution time we use Xilinx’ time measurement library dedicated for the used hardware. For all of our designs we only measure the execution time for the processing of the data, we do not take the time it takes to prepare the data into account.

For power consumption, we first report the estimate as provided by the Vivado toolchain. We further report measured power consumption for some of our large datasets using an external power monitoring device. This power monitoring device uses an INA219 IC to measure the power drawn from the DC input on the Pynq board.

The FPGA resource utilization is also reported directly by Vivado.

The time-to-compile will be split into two separate metrics: the time to export the HLS IP within Vivado HLS, and the time required to synthesize the entire design within Vivado. The software packages are running on a system with an AMD Ryzen 7 2700 CPU, clocked to 2.9GHz to match the clockspeed of the Maxeler Host system. This processor is coupled with 16GB of DDR4 RAM running at 3000MHz. The compilation is set to use 4 threads.

The development board is entirely controlled from within the SDK using a USB connection to the host computer. The board itself is not running a dedicated OS, but it rather executes the code directly. This also means we are severely limited in file reading options, a restriction that required a creative solution for loading the the matrices: we use binary files stored on the SD card. These files are loaded with the help of a basic filesystem library offered by Xilinx, and entered into the system memory. The system memory is then used to stream the data to and from the FPGA using the DMAs.

(22)

4.1.2 Maxeler

For testing on Maxeler, we use a Galava PCI-e expansion card, which contains a Stratix V 5SGXA5 FPGA with 490.000 logic cells, 5.6 MB of on-chip memory and 12GB of off-chip DDR3 RAM. The FPGA also includes 512 small DSPs and 256 large DSPs. This card is plugged into a PCI-e slot of a host machine with an Intel Core i7 870 CPU, running at 2.93GHz. This processor is coupled with 16GB of DDR3 RAM. The system is running on CentOS 6.8 using Linux kernel 2.6.32. To program the FPGA, we use MaxIDE 2016.1 and MaxelerOS 2016.1. To measure the execution time, we use the timespec defined in C’s time library.

Power consumption should be measured using a Maxeler tool called maxtop. However, maxtop’s reporting seems rather unreliabe, so we also use a Racktivity wall-meter. Given the different architecture of the Maxeler system (namely, a host and an accelerator card), the power measure-ment needs to be performed in three steps: (1) measure idle power, (2) measure power under workload, and (3) report the diffence between the two as the dynamic power consumed by the actual SpMV computation.

FPGA resource utilization is directly reported by MaxIDE after compilation, together with the time to compile. Compilation will again be set to use 4 threads. To maximize the chance of meeting timing constraints, we also set the manager to continue with seeds that are close to meeting the timing constraints within place and route.

The input data is parsed directly from Matrix Market files and loaded into the CPU memory space; from here, it is streamed to the FPGA. The result is streamed back to the CPU memory space. Since the memory spaces of the CPU and FPGA are disjoint, we have a significant latency overhead from the PCI-e bus when transferring data to and from the card.

4.1.3 CPU and GPU comparison

To compare the performance of the FPGA solutions, we use a Nvidia Jetson TK1 development board and a processing node on the DAS-5. The Jetson TK1 has been chosen because of the power consumption and size. Both of these metrics are relatively comparable to the FPGAs used. The Jetson TK1 SoC contains 4 ARM Cortex A15 CPU cores running at 2.32GHz, along with a single low-power ARM Cortex A15 core. The SoC also contains an integrated GPU from Nvidia’s Kepler generation with 192 CUDA cores running at 756 - 951 MHz.

The processing on node on the DAS-5 on the other hand contains high performance hardware, creating a more realistic estimate for maximum performance. This node consists of two Intel Xeon E5-2630 v3 processors with 8 cores and 16 threads each, running at 2.4GHz. The node also contains a Nvidia Titan X Pascal GPU with 3,584 CUDA cores running at 1,417-1,531MHz. For the GPU we are able to measure the power consumption using software, but we are not able to do this for the CPU.

For running the computation on these platforms we use an OpenMP implementation for execution on the CPU and for execution on the GPU we use cusparse [4] - a specialized library for computing sparse matrices on Nvidia GPUs.

4.2 Input datasets

Given that the SpMV performance is highly dependent on the input data, we have selected a diverse set of sparse matrices from the SuiteSparse sparse matrix collection [6]. These matrices are collected using real-world applications and thus offer a more realistic performance estimate than randomly generated matrices. The matrices will be multiplied with dense vectors filled with random values, as these have no impact on performance.

The 15 selected matrices and their characteristics are presented in Table 4.1. Note the differences in size, sparsity, rowlength variance, and shape. These differences should highlight the relative performances of the different implementations on different hardware.

Our results focus on comparing the performance from VHLS and MaxJ, with and without opti-mizations, to state of the art implementations on both CPUs and GPUs.

(23)

We run each matrix multiplication 10 times to obtain an accurate mean and standard devia-tion. We measure the execution time for the entire operation, i.e., including data transfer times to and from the FPGA. The Maxeler FPGA will be set to a default target clockspeed of 75MHz, while the Xilinx FPGA was originally set to 100MHz, we will set this to 75MHz to be able to measure the performance differences between implementations rather than between the FPGAs.

4.2.1 Theoretical maximum performance

We will also estimate the theoretical maximum performance of our FPGAs based on the re-sources required for various floating point operations, in our case we are specifically interested in additions and multiplications.

For Xilinx, we can use a utilization report provided by Xilinx [10] to estimate the resource utilization for these operations. The addition uses 3 DSPs while the multiplication uses 7 DSPs, this means we need 10 DSPs to be able to do both a multiplication and an addition. Since our Xilinx FPGA contains 220 DSPs, we can theoretically fit 22 multiply-add modules. When we multiply this value with our clockspeed of 75MHz, we obtain the theoretical maximum perfor-mance. This works out to (22 ∗ 2) FLOP ∗ 75 MHz = 3, 300 MFLOPS = 3.3 GFLOPS.

Maxeler on the other hand, does not provide a resource utilization report, because of this we will use our own resource utilization as an estimate. Our simplest design uses 8 small DSPs and 4 large DSPs for a single multiplcation and addition. This means we can theoretically fit 64 multiply-add modules onto the FPGA. This works out to a theoretical maximum performance of (64 ∗ 2) FLOP ∗ 75 MHz = 9, 600 MFLOPS = 9.6 GFLOPS.

For our designs, we can deduce the actual performance by using the processing time per el-ement, or rather, it’s inverse: elements per unit of time. For each element we perform two floating point operations, addition and multiplication. The total performance can be calculated by dividing the FLOPS per element, 2, with the time per element t: F LOP S = 2

(24)

Matrix Columns Rows Nonzero Average row

Shortest row

Longest

row Field Characteristics

Lewis 1,074 1,074 1,074

(0.0931%) 1 1 1 Structural problem

Square, symmetric, static row-lengths

Newman 1,589 1,589 2,742

(0.109%) 1.73 0 19 Undirected weighted graph Square, symmetric

Freitas 1,182 1,182 2,881

(0.206%) 2.44 1 12 Eigenvalue/model reduction problem Square, asymmetric Godet-Thobie 1,090 1,090 3,546

(0.298%) 3.25 1 90 Computation fluid dynamics problem Square, asymmetric Anderson 12,406 8,081 6,323

(0.00631%) 0.78 0 14 Counter example problem

Wide, asymmetric, very low density Cunningham 11,107 11,107 6,639

(0.00538%) 0.60 0 1 Acoustics problem

Square, symmetric, very low density

Zenios 2,873 2,873 15,302

(0.185%) 5.33 1 37 Optimization problem Square, symmetric

Meszaros 16,281 2,701 52,070

(0.118%) 19.28 16 32 Linear Programming problem Wide, asymmetric

Cheng 14,318 11,028 57,376

(0.03673%) 5.20 1 18 Power network problem

Tall, asymmetric, varying row-lengths

Maragal 3,320 4,654 93,091

(0.602%) 20.00 1 1491 Least squares problem

Tall, asymmetric, varying row-lengths Slater 3,140 3,140 540,022 (5.48%) 171.98 3 2,293 Economic problem Square, asymmetric, varying row-lengths Wright 6,001 6,001 1,137,751 (3.15%) 189.59 1 5,570 Optimization problem Square, symmetric, varying row-lengths ND 9,000 9,000 1,644,345

(2.03%) 182.71 1 453 2D/3D problem Square, symmetric

Simon 14,000 14,000 1,853,104

(0.945%) 132.36 12 294 Directed weighted random graph Square, asymmetric Belcastro 14,340 14,340 9,041,364

(4.39%) 630.50 1 5,570 Human gene network

Square, symmetric,

strongly varying row-lengths

(25)

4.3 Expectations

4.3.1 Time to synthesize

It is difficult to gauge the time it takes to synthesize a design without any prior experiments. However, we argue that our designs are relatively simple and small in size and, because of this, we expect the time to synthesize to be closer to a few minutes rather than to several hours.

4.3.2 Execution time

We can give more meaningful predictions when discussing the execution time. We expect the execution time on both dense implementations to increase linearly with the total size of the input matrix. Moreover, the execution time for the sparse implementations should in theory increase linearly with the number of nonzero elements in the input matrix

However, in practice, our Maxeler CSR version is likely to be heavily dependent on the variance of row lengths in the input matrix; thus, we expect matrices such as Belcastro and Wright to perform relatively poorly, while matrices such as Lewis and Cunningham should perform com-parable to the COO implementation.

The performance difference between Xilinx and Maxeler is more difficult to predict, as it is not known what steps each solution takes to synthesize and implement the actual design. When we consider the difference in hardware used on Maxeler and Xilinx, we expect that the data transfer from the CPU to the FPGA has a much smaller latency on Xilinx, since the memory space is shared and does not have the additional latency of the PCIe bus.

The performance gain from our optimizations is also difficult to estimate, especially since the optimization directives are somewhat abstracted by the compiler. The performance of our par-allel designs are easier to gauge however. Assuming we are able to split the workload evenly, we expect the performance to increase linearly with the amount of multiplication elements used since we do not have a lot of architectural overhead. Finally, We also expect that due to the nature of FPGAs and the designs we implemented, we are likely to see very little run-to-run variance in any of the implementations.

Compared to the CPU and GPU implementations, our FPGA implementations are potentially less sophisticated and optimized. For the Jetson TK1 board, we expect the performance of at least the GPU and maybe even the CPU to be either comparable or superior to our FPGA im-plementations. We should however take into account that this board uses a low-power CPU and GPU, this means that the actual performance we obtain is significantly lower than the maximum that can be obtained with superior hardware, such as available on the DAS-5. For the DAS-5 we expect drastically superior results due to the availability of much more processing power.

4.3.3 Power Consumption

When we consider power consumption, we again have to also consider the hardware we are using. Both the Xilinx and Jetson TK1 platforms are on low-power development boards, while the DAS-5 and the Maxeler platform run on a full-size computer. Because of this, we expect the latter platforms to be significantly more power-hungry overall due to the larger amount of hardware. But this may not be the case for the kernel we are running on the FPGA itself. We compare the idle power draw with the power draw when the kernel is running, because of this we expect this metric to be comparable to the same metric on Xilinx.

(26)

(27)

CHAPTER 5

Results and analysis

In this chapter we present the results of the experiments we have laid out in the previous chapter. We report time to synthesis, and dive into the analysis of the performance results for our dense-matrix and sparse-dense-matrix implementations.

5.1 Time to synthesize

00:00 05:00 10:00 15:00 20:00 25:00 30:00 35:00 40:00 45:00 50:00

Xilinx Dense Xilinx CSR Xilinx COO

Max

eler Dense _Maxeler CSR

Max eler COO

Time to complete (Minutes:Seconds)

HLS Synthesis

Figure 5.1: Time to synthesize over a single run.

The time to synthesize for our implementations on the two platforms is presented in Figure 5.1. Based on this data, we make the following observations.

1. The time to synthesize for the different implementations varies widely. While this is par-tially due to the small sample size, we can still note that synthesis on the Maxeler platform takes slightly longer.

2. The time it takes to synthesize the Maxeler CSR implementation is disproportionately large. This happens because this design had trouble meeting timing constraints, as alluded to in Chapter 3. Because of this, the synthesis had to do multiple rounds of place and route, leading to the large increase in time.

(28)

3. The synthesis, even for our relatively small designs, takes a significant amount of time, ranging between 15 minutes for the quickest runs, and 50 minutes. When we look at the time it takes to export the HLS IP on Xilinx, we see that it takes a relatively short time compared to the actual synthesis, but this does not take into account the updating of the IP within Vivado.

5.2 Execution time

We divide our experiments (and the reporting of their results) into dense and sparse implemen-tations. This is done to provide a meaningful and fair comparison.

5.2.1 Dense-matrix implementations

Figure 5.2 presents the results for our dense-matrix implementations.

0 1 2 3 4 5 6 7 Lewis Godet-Thobie Freitas

Newman Zenios Slater Maragal W right

Meszar os

Ex

ecution time (s/element)

Lower is better (a) Xilinx 0 1 2 3 4 5 6 7 Lewis Godet-Thobie Freitas

Meszar os

Ex

ecution time (s/element)

Lower is better

(b) Maxeler

Figure 5.2: Execution time for the dense implementations. The matrices are ordered by their total size.

Because we are unable to fit the largest matrices of our test set onto the memory of the Pynq board, we must omit these matrices from these results. Thus, the figure includes performance data for 9 out of the 15 total matrices, still enough to compare the Maxeler and Xilinx solutions.

(29)

We order the matrices by their total size (m × n) to create a clear picture on how the datasize influences execution time. Based on the performance data, We make two important observations. First, as expected, all our small matrices - Lewis, Godet-Thobie, Freitas and Newman - complete the computation very quickly. The execution time sharply increases when the matrices increase in size, as can be seen in the performance gap between Newman, with 2.5 million elements, and Zenios, with 8.2 million elements.

Second, when comparing the performance of the two versions - Xilinx and Maxeler - we can see that Maxeler is slightly quicker across all matrices. The run-to-run variance is also very small: for both Xilinx and Maxeler, the standard deviation is in the range of microseconds, which makes it negligible. Figure 5.3 presents a normalized version of the performance data - i.e., we report time-per-element. We see no significant differences between the execution time per element on the different matrices, for both Xilinx and Maxeler. We can also observe that the time per element on the Xilinx implementation is slightly higher than on the Maxeler implementation for all of our matrices.

0 0.05 0.1 0.15 0.2 Lewis Godet-Thobie Freitas

Meszar os

Ex

ecution time (µs/element)

Lower is better (a) Xilinx 0 0.05 0.1 0.15 0.2 Lewis Godet-Thobie Freitas

Meszar os

Ex

Lower is better

(b) Maxeler

Figure 5.3: Execution time normalized by matrix size, dense implementations, The matrices are ordered by their total size.

Moreover, from the execution time per element we can also deduce the amount of FLOPS. This works out to 12.82 MFLOPS for Xilinx and 13.28 MFLOPS for Maxeler.

(30)

5.2.2 Optimizations

0 1 2 3 4 5 6 7 8 9 10 Lewis Godet-Thobie Freitas

Newman Zenios Slater Maragal W right Speedup Unroll Unroll 4x Pipe Pipe 4x Baseline (a) Xilinx

Figure 5.4: The effect of optimizations on our dense implementations. The matrices are ordered by their total size.

The effect of our optimizations are shown in Figure 5.4. We can see a small speedup of 1.1x from unrolling the loop, but we see a much larger speedup of 2.2x when we pipeline the inner loop. This is unexpected, since the compiler reported it was unable to effectively pipeline the loop. When the resource utilization is increased, we see a very large speedup. The speedup effectively scales linearly with the amount of multiplication units we use in the design, meaning that we see a speedup of 4.3x for the unrolled loop version and over 9.5x for the pipelined version. We can also see that the effectiveness of the optimizations is not dependent on the size of the matrix.

After these optimizations, the amount of FLOPS has also increased to 122.89 MFLOPS for the pipelined version with 4 multiplication units.

(31)

5.2.3 Sparse-matrix implementations

Our dense implementations are very predictable, with their execution time only depending on the size of the matrix. This behavior changes for our sparse implementations, where we have to take into account various other aspects of the matrix, like the amount of nonzeroes, the differences in row sizes, the impact of empty rows, and the overall sparsity.

0 0.005 0.01 0.015 0.02 0.025 0.03 Lewis Newman Fr eitas Godet-Thobie Anderson Cunningham Zenios Meszar os Cheng Ex

ecution time (s) Lower is better

COO CSR (a) Xilinx 0 0.005 0.01 0.015 0.02 0.025 0.03 Lewis Newman Fr eitas Godet-Thobie Anderson Cunningham Zenios Meszar os Cheng Ex

COO CSR

0.0435

(b) Maxeler

Figure 5.5: Execution time for our sparse implementations on the small matrices. The matrices are ordered by the amount of nonzeroes.

Figure 5.5 presents an execution time comparison between our two SpMV sparse versions, CSR and COO, for the small matrices, ordered by number of nonzeroes. We make the following observations.

1. For Xilinx, a steady execution time increase is visible as the matrices get more nonzero elements.

2. The Xilinx COO and CSR versions are comparable in terms of performance, regardless of the features of the matrix, with COO slightly faster across the board.

3. The Maxeler implementations are a lot less predictable.

4. The execution time of the Maxeler COO implementation mostly follows an upward trend as the amount of nonzeroes increases, but this is not the case for the CSR implementation. We

(32)

suspect this variability is caused by the variance in row length: matrices such as Anderson and Godet-Thobie, which have a longest row that is much larger than the average row length, and our ”round-up to largest row” adds significant overhead there; matrices where CSR outperforms COO, such as Lewis and Cunningham, show a much smaller difference between the largest and average row length.

5. Our Maxeler implementations are slower than their Xilinx counterparts, with differences ranging from a small overhead (10-20%) to slow-downs as high as 4x.

0 0.5 1 1.5 2

Maragal Slater Wright

ND

Simon Belcastr

o

Ex

COO CSR (a) Xilinx 0 0.5 1 1.5 2 Maragal Slater W right ND _Simon Belcastr o Ex

COO 3.1

CSR

12.8

(b) Maxeler

Figure 5.6: Execution time for our sparse implementations on the large matrices. The matrices are ordered by the amount of nonzeroes.

We note that the large matrices show a similar behavior as the small ones: the Xilinx implemen-tations are comparable to each other, and the execution time increases with the increase in the amount of nonzeroes. This is also the case for our Maxeler COO implementation, but not for the CSR implementation. The results for the CSR implementation do confirm our hypothesis that this behavior is due to the ratio between the average and longest row length: for the worst per-forming matrices, Maragal, Slater, and Wright, this ratio is 74.55, 13.33, and 29.38 respectively. For the best performing matrices, ND and Simon, the ratio is only 2.48 and 2.22 respectively.

(33)

0 1 2 3 4 5 Lewis NewmanFr eitas Godet-Thobie Anderson Cunningham Zenios Meszar os

ChengMaragal SlaterWright ND

Simon Belcastr

o

Ex

Lower is better COO CSR (a) Xilinx 0 1 2 3 4 5 Lewis NewmanFr eitas Godet-Thobie Anderson Cunningham Zenios Meszar os

ChengMaragal SlaterWright ND

Simon Belcastr

o

Ex

Lower is better

COO CSR

11.9

(b) Xilinx

Figure 5.7: Execution time, normalized over nonzero elements, for our sparse implementations. The matrices are ordered by the amount of nonzeroes.

We present the execution time normalized by the amount of nonzeroes in Figure 5.7. We make the following two observations. First, for Xilinx, the execution time per nonzero element is rather constant, which is further indication of the linear correlation between execution time and increasing amount of nonzeroes. The only outliers are Anderson and Cunningham, which are both significantly more sparse than any of the other matrices. We suspect this diverging behavior is due to some form of overhead, either when loading the vector or when storing the intermediate results.

Second, for our Maxeler implementations, no clear correlation can be observed. Especially for the CSR implementation, performance varies a lot in execution time per element, with the most notable outliers being Maragal, Godet-Thobie and Anderson (where we previously noted the large ratio between the longest and average row lengths). The outliers with the COO implemen-tation are mostly the smaller matrices and especially the very sparse ones, such as Anderson and Cunningham; larger matrices (starting from Zenios) show a much more stable performance and a linear correlation of the execution time to the amount of nonzeroes in the matrix.

We can again calculate the amount of FLOPS from our designs. In this case we divide by the amount of nonzeroes instead of the complete matrix size. For the best case on Xilinx, the Belcas-tro matrix, this works out to 12.81 MFLOPS for our COO implementation and 12.81 MFLOPS for our CSR implementation. For Maxeler, the best case matrix is Simon. For our COO

(34)

imple-mentation this works out to 5.79 MFLOPS and to 5.60 MFLOPS for our CSR impleimple-mentation.

5.2.4 Optimizations

0 0.2 0.4 0.6 0.8 1 1.2 1.4 Lewis NewmanFr eitas Godet-Thobie Anderson Cunningham Zenios Meszar os

ChengMaragal SlaterWright NDSimon

Belcastr o Speedup Unroll Pipe Baseline (a) Xilinx

Figure 5.8: The effect of optimizations on the COO implementations. The matrices are matrices ordered by the amount of nonzeroes.

Figure 5.8 shows the speedup achieved by our COO implementation after optimizations. For Xilinx, neither unrolling, nor pipelining the loop have any effect on the execution time and thus, the speedup is 1x. Because of this, there are also no changes to the amount of FLOPS of the design, which stays at 12.81 MFLOPS. Since we were unable to fit any more multiplication elements on the Pynq, we do not have any indication for scaling.

(35)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Lewis NewmanFr eitas Godet-Thobie Anderson Cunningham Zenios Meszar os

ChengMaragal SlaterWright

ND Simon Belcastr o Speedup Unroll Unroll 2x Pipe Pipe 2x Baseline (a) Xilinx

Figure 5.9: The effect of optimizations on the CSR implementations. The matrices are ordered by the amount of nonzeroes.

Figure 5.9 shows the speedup achieved by our CSR implementation after optimizations. For Xil-inx, unrolling the loop does not increase the performance in any meaningful way. Pipelining, on the other hand, can improve performance significantly, showing close to 2.5x speedup. However, this only holds true for the larger matrices. For smaller matrices, particularly those with short rows such as Lewis, Anderson and Cunningham, pipelining offers almost no performance gain. When we increase the amount of multiplication units, the speedup on the large matrices roughly doubles, while the effect for the smaller matrices is slightly less. The highest overall speedup is close to 5x.

In terms of FLOPS, the optimizations lead to a performance of 61.10 MFLOPS for the best case scenario.

5.3 Resource Utilization

Both the Maxeler IDE and Vivado report the total utilization for different FPGA resources. These include logic cells, BRAM, Flip-Flops, and DSPs. Due to architectural differences be-tween the different FPGAs we used, we limit ourselves to reporting data which is available on both platforms. Both platforms report the amount of used logic cells and Flip-Flops as a single absolute value, but the BRAM is reported in blocks. These blocks are different in size for the different architectures (i.e., 20KB blocks vs. 36KB blocks for the Maxeler FPGA and Xilinx FPGA, respectively). For simplicity, we convert these block values into KB. The DSPs also dif-fer between the architectures, but in both cases they are used to perform various floating point operations.

Figures 5.10 and 5.11 present the resource utilization for the Xilinx and Maxeler designs, respec-tively.

(36)

0 2000 4000 6000 8000 10000 12000 Dense CSR COO Logic Cells 0 5000 10000 15000 20000 25000 Dense CSR COO Flip-Flops 0 1000 2000 3000 4000 5000 6000 Dense CSR COO BRAM (KB) 0 2 4 6 8 10 12 14 16 18 20 Dense CSR COO DSPs

Figure 5.10: Xilinx Resource utilization

We note that our Xilinx implementations are similar in terms of resource utilization, with some variation. As would be expected, the simplest implementation, the dense-matrix one, consumes the least amount of resources. Our CSR and COO implementations both use more logic cells, Flip-Flops and BRAM, but the amount of DSPs remains the same. Especially the BRAM usage is significantly higher than the simple implementation. In the case of COO, this happens because of storing the output buffer; for CSR, this happens due to the additional rowlength buffer.

0 2000 4000 6000 8000 10000 12000 Dense CSR COO Logic Cells 0 5000 10000 15000 20000 25000 Dense CSR COO Flip-Flops 0 1000 2000 3000 4000 5000 6000 Dense CSR COO BRAM (KB) 0 2 4 6 8 10 12 14 16 18 20 Dense CSR COO DSPs

Figure 5.11: Maxeler Resource utilization

Our Maxeler implementations show a similar trend as those from Xilinx. The simple implemen-tation uses significantly less resources than either CSR or COO. We also notice that the CSR implementation uses more DSPs than the other implementations, it is unclear why this is the case, as CSR does not use more floating point operations than our other designs. We again see a large increase in memory usage for the COO design, also due to the additional output buffer.

(37)

The BRAM usage from the CSR design is also slightly higher than that of the dense design. When we look at the differences between Maxeler and Xilinx, we can see that Xilinx uses fewer logic cells, Flip-Flops and BRAM but more DSPs. Since this is even the case for our simple designs, we suspect this is an effect from the architectural differences and compilers/toolchains between the different FPGAs.

5.3.1 Optimizations

0 5000 10000 15000 20000 25000 30000 35000 Unr oll Pipe Unr oll 4x Pipe 4x Logic Cells 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 Unr oll Pipe Unr oll 4x Pipe 4x Flip-Flops 0 1000 2000 3000 4000 5000 6000 Unr oll Pipe Unr oll 4x Pipe 4x BRAM (KB) 0 10 20 30 40 50 60 Unr oll Pipe Unr oll 4x Pipe 4x DSPs

Figure 5.12: Xilinx resource utilization for dense optimizations, compared to baseline.

We present the difference in resource utilization between the basic implementation and the opti-mized versions in Figure 5.12. We can see a small increase in the amount of resources used when we apply either pipelining or loop unrolling. When we add additional multiplication units, we see a sharp increase in the amount of logic used.

On the Performance and Productivity of FPGA HLS solutions: an SpMV Comparative Study

Bachelor Informatica