OpenMP performance for GPU-acceleration

(1)

Bachelor Informatica

OpenMP performance for

GPU-acceleration

Danny Opdam

January 29, 2021

Supervisor(s): Ana-Lucia Varbanescu

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

(3)

Abstract

GPU offloading is a hot topic, in this thesis we will compare 2 offloading methods, CUDA and OpenMP. OpenMP has some major benefits that mainly come from ease of use and the possibility for porting existing code.

Through the examination of their performance in applications from the Rodinia[1] bench-mark suite we will come to conclusions about OpenMPs worth in the GPU offloading front. Along the way we reason about the performance gap and try to close it by iterative improvements of our chosen applications. This will highlight how good and bad practises in offloading affect the performance.

(4)

(5)

Introduction

In many scientific domains and industrial applications, large amounts of data are being collected. As the amount of data gathered keeps growing, efficient ways to process these data become more and more important. To efficiently process these data, parallel processing has become normal go-to approach.

Moreover, in recent years, more focus has been put on using GPUs for processing large amounts of data in parallel[2], because the GPU is, in a sense, ‘made for this’: it offers a large amount of processing cores that are designed to work in a SIMD (single instruction, multiple data fashion). The large amount of cores a GPU offers - up to two orders of magnitude more than those available on a traditional CPU - means that more work can be done in parallel, and the performance can further improved.

The process of using GPUs for parallel processing is often called acceleration, and it based on offloading: the parallel-processing parts of the code to be executed on the GPU are said to be offloaded to the accelerator. These offloaded parts of the code are called kernels.

A lot of different methods exist to enable GPU-processing for both new and/or existing soft-ware. In this thesis, we will focus on two such approaches. The first method is to use the well-known CUDA programming model, from NVIDIA, which requires programmers to code the kernels and the operations needed for offloaded specifically. The second method is OpenMP-offloading, which is available in the newer OpenMP standards (4.5+). OpenMP handles accel-erating code in a different way: the code is enhanced with the usage of pragmas, which are added to working sequential implementations. A pragma can be seen as a sort of flag that, when compiled, turns the targeted sequential code into a GPU kernel, and adds the additional code needed for the offloading.

The fact that GPU-acceleration with OpenMP-offloading is as simple as just adding a few pragmas to existing sequential code makes this approach an attractive choice for a lot of applica-tions. However, more attention needs to be paid to the performance of OpenMP-offloading. As this is a much newer approach to GPU-acceleration, there is little data to show how its perfor-mance compares to the native programming model, CUDA. Our research aims to find out how the performance of OpenMP-offloading compares to that of CUDA.

(8)

1.1 Research question

In this thesis we examine the performance gap between CUDA and OpenMP-offloading for GPU accelerators. Our main research question is:

What is the performance gap between CUDA and OpenMP-offloading for GPU accelerated applications?

To answer this question, we will analyze the performance of several representative applica-tions, answering the following questions along the way:

1. SQ1. What is a suitable OpenMP compiler? 2. SQ2. What are representative applications?

3. SQ3. How does OpenMP’s performance compare against CUDA’s? 4. SQ4. What are the reasons behind any potential performance gap?

To answer SQ1, we perform a literature study aiming to find out more about the differences in OpenMP compilers. Based on empirical evidence from this study we pick the most promising ones that works with both OpenMP 4.5+ and is compatible with the DAS-5 (The Distributed ASCI Supercomputer 5) [3].

Next, to answer SQ2, we will choose some of the best suited applications from the Rodinia benchmark suite [1], To enable our comparison in SQ3, we have to port these applications from their current CPU OpenMP implementations to their OpenMP-offloading implementations. Us-ing the new OpenMP-offloadUs-ing implementations, we can measure and compare the performance between the two approaches - OpenMP-offloading and CUDA. Finally, to answer SQ4, we will reason about the differences in performance by using the optimization process of writing OpenMP applications to identify good and bad practises when using OpenMP for GPU accelerated appli-cations.

1.2 Ethics

To the best of our abilities, we will be open and clear in our methods, in order to preserve the integrity and validity of our results. To further the domain of public knowledge and as we use open source API’s and tools we will provide all code and experimental results as open source. Credits to original authors will be given to our best ability.

As for the ethical implications of our work, we can be fairly short: this is an empirical study aimed to analyze and explain a performance difference. As such, at least while respecting the previous points, there are (to our understanding) no negative implications or consequences concerning this thesis.

(9)

CHAPTER 2

Theoretical background

In this section we clarify some of the definitions used in this thesis and summarize relevant related work.

2.1 Platforms

The use of the GPU as an accelerator is so effective because GPUs offer many small cores that are specialized to handle many operations in parallel.

Not every part of an application is compute-intensive however, there are parts that we do not want to offload because the time won in parallelizing them won’t be worth it. This is often because moving data from the CPU to the GPU is costly, there is a large overhead when moving data like this and the speedup must be worth this initial cost. Using the inherent parallelism of the GPU’s architecture can greatly increase the performance of some applications by tasking the GPU with the heavy lifting of parallelizable tasks and letting the CPU take care of the more serial portions of the application.

2.2 Programming models

Nvidia’s CUDA provides the option of very fine-grained parallelism and control . When using CUDA, coding and optimization require extensive knowledge of hardware layout and skills in programming parallellized applications.When an application does not require this fine-grained approach, OpenMP might provide a more efficient, coarse-grained approach to parallelize an application.

OpenMP is an API that seeks to simplify much of the difficulty in programming parallel applications. OpenMP-offloading uses a host/device model. The host is where the initial thread of the program begins execution, while the device is where the kernel is running. The newer standards of OpenMP have added features designed to also enable the use of GPUs as ’devices’. A schematic overview of the host/device execution can be found in Figure 2.1.

The performance of the newer (4.5 and up) versions of OpenMP in these accelerated applica-tions, specifically when compared to hand-crafted CUDA code, still needs to be rigorously tested. The difference in performance must be compared and the causes analyzed and/or explained.

(10)

Figure 2.1: Schematic overview of the OpenMP ’workflow’. Programming Your GPU with OpenMP[4]

2.3 Performance

For this thesis, we refer to the performance of a kernel/application as its execution time. We determine the performance of applications and kernels by benchmarking. To determine the execution time, we use profiling tools that report execution times of our kernels and the memory transfers.

2.4 Related work

Previous studies have compared the performance of OpenMP and OpenCL[5], as well as OpenCL and CUDA[6]. Both papers use applications from the Rodinia suite[1], a benchmarking suite with kernels which are available in CUDA, OpenMP and OpenCL. Following the example of previous work, we also use applications from Rodinia. Although Rodinia’s support for newer versions of OpenMP remains limited, the available applications provide us with a good starting point for our comparison.

In Che et al.[1], several Rodinia applications are analyzed, and some are optimized for GPU’s in CUDA, and CPU’s in OpenMP. The paper presents the speedup of the GPU implementations compared to sequential and 4-threaded CPU implementations, as illustrated in Figure 2.2. They conclude that while they did not spend equal effort in optimizing each application, the perfor-mance differences that are shown are the results of each application’s characteristics. SRAD, HotSpot (HS) and Leukocyte (LC) are said to be more compute intensive, while Needleman-Wunsch (NW), Breadth-First Search (BFS), Kmeans (KM) and Stream Cluster (SC) are limited by off-chip GPU memory bandwidth. We will use a similar comparison, but we compare the performance of CUDA code to that of OpenMP-offloading code.

These previous studies [1], [5], [6] also discuss some methods used for speeding up the GPU applications. For example, to maximize locality, individual GPU threads should traverse arrays in column-major order. Additionally, reducing GPU-CPU communication is an obvious opti-mization, as the overhead of this type of communication is still a large bottleneck in GPU-CPU systems. Other optimizations are the use of shared memory to maximize data reuse, and using

(11)

Figure 2.2: Speedup of GPU compared to 2 CPU implementations -Rodinia: A benchmark suite for heterogeneous computing, Che et al. [1]

constant memory for frequently accessed read-only values. In our study, we will check if and how these optimizations can be also included in OpenMP-offloading code.

In Shen et al. [5] the performance gap between OpenMP and OpenCL is analyzed in three Rodinia applications: CFD, KMeans and PathFinder. The applications were specifically chosen because of the poor performance of OpenCL when compared to OpenMP The performance of regular parallel (CPU) OpenMP code is used, this code was not aggressively optimized.

When analyzing Kmeans’ performance in OpenCL, the authors find that there is a swap kernel that remaps array data from row- to column-major order. This turns out to negatively affect the performance of the application. They go on to explain that in GPU’s, arrays are often accessed in column-major order to better utilize memory locality. As an optimization also found in Che et al.[1], definitely something to keep in mind when using arrays in our (largely) GPU-ran appli-cations.

CFD also turns out to make use of some array transposing, hampering performance in their CPU based implementation. Further optimization mainly focuses on using less precise floating point arithmetic, something that might not be directly translatable to our implementations but is definitely a trade off to keep in mind.

Che et al.[7] states that many applications in Rodinia already take advantage of localizing data access patterns and inter-thread communication within blocks. The advantage of using shared memory for read-only data structures are once again stated to be of great use for enhancing performance. Applications such as Back Propagation, HotSpot, Needleman-Wunsch and Stream-Cluster make great use of shared memory. Kmeans, Leukocyte and MUMmer make great use of texture memory.

We have selected some of these applications (NW, Pathfinder) in our set (see Chapter 3), and we will discuss the impact of using these GPU-specific resources on performance in our evaluation (see Chapter 4).

(12)

(13)

CHAPTER 3

Methodology

In this chapter we present our methodology, and explain how we selected our applications and tools for comparing the performance of GPU applications written with CUDA and with OpenMP-offloading.

3.1 Methods

Using the Rodinia benchmarking suite as the source of the applications we compare the per-formance of puts us at a crossroad. Our end goal is comparing offloading enabled applications implemented in CUDA to those implemented in OpenMP. We therefor either start from sequen-tial code. and port to both CUDA and OpenMP from there, or we use the existing code as a framework to start from.

Because the CUDA versions of the applications are well established and, by design, form a good comparison to their OpenMP CPU counterparts, we elected to use the CUDA implementations as they are. We then have to port the CPU parallel OpenMP code back to sequential code. This sequential code serves as the foundation of our OpenMP-offloading code. Another benefit of using this method is that disabling OpenMP parallelism is easy to do by design, because OpenMP only differs from C(++) code in the pragmas used in the code. These pragmas allow for parallelization when compiled by an OpenMP enabled compiler, but can be ignored when compiled regularly, then performing as sequential code.

Starting from the sequential implementation, we need to define the process that will allow us to accurately compare the performance of our OpenMP-offloading implementation to CUDA’s. The steps of this process will be explored in further detail in the section below. The general layout of the process is as follows:

• Identify loops for parallelization. • Offload the loops to the GPU. • Test for correctness.

• Benchmark performance.

• Optimize offloading to enhance performance.

Following this process for each application will ensure the reproducibility of our research when applied to other applications.

3.1.1 Identifying parallelizable loops

Because OpenMP inherently focuses on parallelizing loops, an important part of ensuring good performance of our GPU-offloading code is identifying which loops are good candidates for par-allelization. Parallelizing loops that are poor candidates will surely hamper performance because

(14)

of the increased overhead in data transfer between the CPU and GPU. We will therefore have to identify loops that are ‘compute-intensive’, loops that have a ratio between overhead and computing that heavily leans to the computing side.

Unfortunately, there is no concrete metric to ensure a loop has favourable compute-intensity, so we will try to come up with one that is both satisfactory and reproducible. Two candidates for this selection process are (1) using the Roofline model [8], and (2) using profiling tools alongside Amdahl’s law[9].

Using method 1 entails the use of the Roofline model, developed by Williams et al.[8]. This method uses a metric to calculate the maximum attainable GFLOP/s for a given parallel appli-cation running on a parallel system. In figure 3.1, the Roofline is explained in more detail: the figure illustrates there is an operational intensity (in Flops per Byte) at which the application is no longer bound by memory bandwidth, but by computing throughput. We could use a similar approach to calculate the operations per Byte of our loops, and define a threshold above which we categorize a loop as compute-intensive.

Figure 3.1: Left: The performance bounds in the Roofline model explained. Right: Two Rooflines of different systems.

Roofline: an insightful visual performance model for multicore architectures, Williams et al.[8] Method 2 would entail us using the sequential code, and time the execution of each loop. Using the execution time in conjunction with Amdahl’s law[9] would give us a list of candidate loops that we can order easily. We would then again have to define a threshold above which we consider loops compute intensive, or, alternatively, we could use the best X candidate loops as our starting point for parallelization. The final method is a lot less robust but much more convenient for us specifically, since OpenMP parallelizes on a loop basis we can easily identify the loops that are being parallelized in the CPU versions. Using those loops as a starting point is often all we need. If a loop is not being parallelized on the CPU, it would almost definitely not be worth it to offload it since this adds a lot of extra overhead in data movements and additional set-up costs. Because the Roofline model is inherently different for each differing architecture and because the heavier loops are already highlighted in OpenMP CPU code we decided to skip

(15)

the additional step of timing all loops and calculating which ones were best suited.

3.1.2 Offloading to the GPU

Having identified the candidate loops in our sequential code, we must then start the process of enabling them for parallelization using OpenMP-offloading pragmas. To keep the process as clear as possible, we perform this process in the same way for each application. Our expectations are that this starting implementation will be very slow, and it will not even outperform standard CPU parallelized OpenMP code. We use this process to ensure reproducibility of the method, as well as to ensure we have a working offloading application from which we can start the next steps.

3.1.3 Testing for correctness

To test for correct functionality, the OpenMP-offloading applications are empirically verified. Specifically, we make use of profilers to confirm GPU activity, and we compare the results for different input datasets against the CUDA and sequential versions. If the results match, we consider the OpenMP-offloading versions to be correct.

3.1.4 Benchmarking performance

With a working OpenMP-offloading application, we can start the process of benchmarking the applications and comparing them to their respective CUDA implementations. As already men-tioned, the main metric used in this analysis is execution time. To ensure the statistical sig-nificance of our results, we run each application (at least) 10 times, and present the average execution time (for more details, see Chapter 4).

3.1.5 Optimizing the OpenMP-offloading implementation

The first-version results will serve as a baseline which we aim to improve. Using different pragmas and other OpenMP tactics designed for offloading, we will incrementally improve our implemen-tation. Each iteration of improvement has to be tested for correctness, and then benchmarked to find out whether our attempts boost, or hamper, performance.

(16)

3.2 Applications

Selecting benchmarking applications for this study gives us some degree of certainty that the applications we use are representative of real world problems. We aim to select Rodinia applica-tions that are relatively easy to translate from OpenMP/CUDA to OpenMP 4.5+, thus enabling several different applications to be tested. Furthermore, the applications whose performance de-pends heavily on their input data (e.g., graph traversal applications) were excluded. This choice was made because we want to evaluate the performance of OpenMP-offloading itself: when per-formance is largely defined by the input instead of the accelerated code, the perper-formance gap we evaluate becomes less meaningful.

Specifically, to assess ease-of-translation, used several basic characteristics of each applica-tion. These characteristics are presented in Table 3.1, and discussed in detail in the following paragraphs.

Application LoC* Pragmas Kernels Dwarf, Domain

Vector add 100+ 2 2 Linear algebra, Embarassingly parallel Needleman-Wunsch 300+ 2 2 Dynamic Programming, Bioinformatics PathFinder 200+ 1 1 Dynamic Programming, Grid Traversal Nearest Neighbour 300+ 1 1 Dense Linear Algebra, Data Mining

Table 3.1: Applications and their respective properties

The first characteristic used in the selection process is lines of code in the relevant application file(s). We elected to only keep our count to these relevant files as some applications come with their own generators for input, multiple versions of the same applications, and so forth. To ensure we only count the more relevant lines of code in our approximation, we kept our count exclusive to the files that contained the kernels and pragmas in the CUDA and OpenMP implementations, respectively.

The second feature applied in the selection process was the amount of pragmas and kernels. Due to the limited time available for this project, we only selected applications with 1 or 2 kernels. The third and final characteristic we take into account is that applications should compute-intensive rather than memory-compute-intensive. This ensures that offloading the application actually benefits from GPU-acceleration

(17)

CHAPTER 4

Empirical evaluation

In this chapter we demonstrate our methods in practice. We first describe the experimental setup and tools used in this project, and further dive into the results of the applications we tested, as presented in Table 3.1.

4.1 Experimental setup and tools

All the experiments presented in this thesis were executed on DAS-5[10], on the VU cluster. For all experiments we used a Titan-X GPU. Each experiment has been executed 10+ times, and the reported time is the average of (almost) all runs. We say almost because the 2 reported times differ a little, Nvprofs results are the average of 10 runs, the Internal timers results are the average of 9 runs. The first run in our internal timer run is excluded as we consider this a warm-up run.

to check how the OpenMP-offloading uses the GPU, we used Nvidia’s profiler, called nvprof. When enabled at runtime, this profiler provides a detailed report on how the application is executed. From this report, we can extract the time spent for data movement from the device to the host (DtoH) and back (HtoD), and kernel times. An example of an nvprof output can be found in Figure 4.1. Additional flags used for nvprof were: --csv, to make the output easier to parse; --log-file, to output the data directly to a file; -e sm cta launched and -e warps launched to double check the number of blocks and warps, respectively, used when executing the code.

Some of the graphs presented in this chapter feature an horizontal axis labeled with number of blocks, teams, or thread-block size. These merit an explanation. In OpenMP-offloading, an extra clause can be added to the pragma containing the teams keyword. This clause sets the number of teams, which consist of multiple threads, that are launched, and are subsequently used to run the offloaded part of the application. CUDA handles things the other way around, we specify a thread block size, i.e., the number of threads launched per block. The application then calculates how many blocks will be needed to cover the full problem size (typically, this is calculated as the problem size divided by the thread block size).

When reporting the measured execution times, we present two values for per combination of problem size and teams/block size. One value is calculated as the average of the reported execution times from Nvprof. The second value, called ”internal”, is obtained by explicitly mea-suring the execution time with timers added in each application’s code. Both times include the execution time of the kernel(s) and the data movements DtoH and HtoD.

The reason for which we include these two different values is that preliminary experiments showed that the reported times were often much higher than one would assume. After some debugging and in-depth data analysis, we identified the problem: each application, when run for the first time, has a very slow run (often called a ”warm-up run”). This slower run is due to the initialization of caches, run-time systems, thread teams, etc. The exact slowdown of this

(18)

warm-up run differs per application, and further changes with the execution parameters we used, but it can be as large as 1.5-2 times slower than the ”regular” runs thereafter.

To combat the negative influence of this first run, we ensured we executed the accelerated part of the application (i.e., the offloaded part) 10 times by adding a 10-iterations loop around the region of interest. Using the internal timers per iteration, we could collect each iteration’s time. In all the following graphs, the ”Time internal” value represents the average of the 9 ”regular” runs, i.e., the runs after the initial, ”warm-up” run.

When reporting the execution time as measured with internal times, the entire offloading process has been executed, which means that our reported time also includes the setup required to start offloading and the processes that bring it to an end. This is why we also kept the ex-ecution times nvprof reported in the graphs. The nvprof times come entirely from within the offloading process and should therefore, in theory, always be lower than those reported by the internal timers.

However, nvprof reports the total time of all 10 runs and, for most applications, it was not feasible to extract the warm-up run from this result. Specifically, this was difficult because the amount of kernel executions and data transfers differ per application, problem size and the amount of teams/blocks used. Automating the gathering of the correct kernel and memory transfer times would be extremely difficult, because each application and parameter combination would report different traces. We note that this ”forced” inclusion of the warm-up run in the nvprof timings explains why, in some graphs, nvprof actually reports a higher total execution time than our internal timers. In these cases the warm-up run skews the result so much, that it makes the average performance of the 10 runs, as reported by nvprof within the offloading process, worse than the time reported by our internal timers.

4.2 Parallelization challenges

While porting our sequential applications to OpenMP-offloading applications, we encountered some difficulties in keeping the processing as similar to CUDA as possible for each application. OpenMP handles parallelism by distributing the work of (mainly) for-loops, so we decided to start with the pragma #pragma omp target teams distribute parallel for. This pragma indicates that we want to offload (omp target) to a device, in our case the GPU. It then starts teams of threads that run on each compute unit, to which it distributes parts of the loop. Finally, parallel for instructs the processing elements in each compute unit to run the part of the loop assigned to them.

However, we found that using this as the sole pragma does not work for each application. More often than not, using a standard approach without respecting data structures and the necessary movement of these data can result in critical failures. This is exacerbated by the fact that a lot of offloading pragma options can’t be directly combined with this one-liner. Often we have to split the pragma into multiple pragma’s, so that we can add varying team sizes or data mappings.

(19)

Figure 4.1: Nvprofs output

4.3 Vector add

We start our comparison of OpenMP and CUDA with vector add, a very simple application that enables us to quickly test different input sizes in combination with either team sizes (for OpenMP) or thread block sizes (for CUDA). Although thread block sizes and team sizes are fundamentally opposite ways of handling parallelism, with a little extra computation on our end, they result in a comparable distribution of the computational workload. For a direct comparison, we divide the problem size by the (thread) block size to get the respective number of teams. For simplicity’s sake we report the matching combinations in the same place in the graph, so the first bar in Figure 4.3 (8192-256) compares directly to the first bar in Figure 4.2 (8192-32). Vector add also serves the purpose of a good baseline comparison with CUDA, the simplicity of the application means a lot of optimizations that other CUDA applications use (mainly the use of shared memory) are not skewing our results in favour of the CUDA implementation. Because shared memory is not something that is easily usable in OpenMP, it often offers a big performance boost in the CUDA implementation, which cannot be ”replicated” in the OpenMP-offloading versions. Because there is little optimization to be added to vector add, we expect the performance of the two GPU-enabled versions to be fairly similar.

For vector add we tested 7 input sizes, which are the number of elements in each of the two vectors being added, varying from 8192 up to 1996080. This large upper limit is only possible because vector add is so simplistic; using similarly large values in some other applications could result in execution times far above acceptable limits. Alongside these varying input sizes, we varied the thread block sizes and team sizes, aiming to determine the impact of this parameter on the execution times.

The results of the OpenMP-offloading version can be found in Figure 4.3, and those of the CUDA version are presented in Figure 4.2.

The pragmas used in the OpenMP version are shown in pseudocode below:

1 #pragma omp t a r g e t t e a m s num teams ( numteams )

2 map ( t o : a , b , s i z e ) map ( f r o m : c ) 3 { 4 #pragma omp d i s t r i b u t e p a r a l l e l f o r 5 f o r ( i = 0 ; i < s i z e ; i ++) 6 c = a + b ; 7 }

(20)

Figure 4.2: The performance of vector-add using CUDA for each combination of problem size and block size.

(21)

Figure 4.3: The performance of vector-add using OpenMP offloading for each combination of problem size and number of teams.

Based on these results, we can draw several conclusions. First, with increased problem size comes increased execution time. We simply spend more time copying memory back and forth and executing calculations.

Next, we notice that CUDA performs pretty stably among different block sizes, while OpenMP’s performance shows a definite preference for larger team sizes. However, the performance of the best OpenMP configuration is actually very similar to CUDA, often being well within 10% of eachother. The performance of the worst OpenMP configurations are visibly worse though, being up to 2.5 times slower.The inclination towards fewer teams makes sense, using a very small team not only requires more teams to execute, and thus more overhead, but it also means that these teams each have to do only a few operations, so they have no intense computation to warrant their initial overhead. Our previous expectaiton - that nvprof reports a faster execution time than the internal timers - also holds true.

Comparing the best OpenMP time of each problem size to their respective (same team/block size) CUDA times we can see that they are almost always within 10% of eachother, or at the very least extremely close to that. Comparing the best OpenMP to the best CUDA time for each problem size gives a similar result. In the cases where the best CUDA times configuration is not the respective best configuration of OpenMP the gap obviously increases, but the execution times are still fairly similar.

(22)

4.4 Needleman-Wunsch (NW)

4.4.1 Basic NW - version 1

Needleman-Wunsch is a much more compute-intensive application than vector add, which means it is much harder to run large problem sizes with a bad algorithm, but there is also much more room for improvement as we can see in the improved versions.

The CUDA version of NW unfortunately only offered 3 possible block sizes (16, 32 and 64). For fairness, we also only ran the OpenMP experiments with comparable numbers of teams. For problem sizes we started with a base of 512, which is the 1D size of the 2D array actual computation is performed on. For larger problem sizes, we simply double this value to obtain the next size, double that to obtain the 3rd one, and so forth. In the CUDA version, Figure 4.4, we stopped at a problem size of 16k. This was not because the CUDA execution times were too slow, but because the OpenMP-offloading version became too slow even at 8k.

We expect CUDA to outperform OpenMP heavily, especially in the early non-improved ver-sions. CUDA has the added advantage of several optimizations, with one of the most important being the use of shared memory.

The first results for the basic version can be found in Figure 4.5. As expected, even the small problem sizes take a long time to run.

This very pragma is shown in the pseudocode below. From this pseudocode we can clearly see the root of the inefficiency problem: we move the entire arrays (with the map() function) back and forth between the host and device in each iteration over the maximum amount of columns.

1 f o r l o o p ( i = max c o l u m n s ; i >= 0 ; i−−)

2 #pragma omp t a r g e t t e a m s num teams ( omp num teams ) \\

3 map ( t o : r e f e r e n c e [ 0 : m a x c o l s∗ max rows ] ) \\

4 map ( t o f r o m : i n p u t i t e m s e t s [ 0 : m a x r o w s∗ m a x c o l s ] )

5 #pragma omp d i s t r i b u t e p a r a l l e l f o r

6 f o r l o o p ( i d x = 0 ; i d x <= i ; i d x ++)

7 c a l c u l a t i o n k e r n e l

Comparing this to the pseudocode of vector add, we observe a possibility for improvement: moving the data to the device once before the first for loop would save us a lot of time spent on moving redundant data there every iteration. Possibly combining this with moving the data back to the host only once should render immediate improvements in performance. Note that, as mentioned in Table 3.1, the NW application has 2 kernels, the above one being the first one, and a second, very similar one, coming immediately after.

(23)

(24)

Figure 4.5: The performance of NW-V1 in OpenMP for small problem sizes and number of teams.

Comparing Figure 4.4 to Figure 4.5, we can see exactly what we expected: OpenMP struggles with even the smallest problem sizes because of many unnecessary data movements. CUDA, in the meantime, makes great use of small optimizations and has no problem running problem sizes 16 to 32 times larger in a fraction of the time.

Upon inspection, the CUDA code confirms that this version makes use of shared memory, something that will always give CUDA a significant performance boost. An additional optimiza-tion technique employed by CUDA is the use of tiling, in essence this technique changes the order in which elements in an array are accessed. Traditionally we access elements in an array either row- or column wise. Using tiling the array is divided into square chunks, which are essentially smaller arrays, so that we can make better use of caches, registers and locality of data access. This version of NW is also the first time we can see the influence of the warm up run in a graph: especially in the larger problem sizes, nvprof reports a slower time than our internal timers, something that should in theory never happen.

The applications total execution time is clearly exponentially related to the problem size. This holds true for both the terrible OpenMP version and the CUDA version, and is largely due to the increasing memory transfers needed. The exponential nature also makes a lot of sense, our problem size is the 1d dimension of a 2d array, so, by design, doubling the problem size quadruples our actual array in size.

Comparing the best times of the OpenMP version to those of CUDA is possible, but the perfor-mance gap between these versions is so huge that we will not include any specific data: compared to the NW-V1 OpenMP version, CUDA is around 1000x faster.

4.4.2 Improved NW - version 2

The biggest improvement to be made was almost immediately obvious: moving data every it-eration of a for loop is never a good idea unless strictly necessary. This is where we made our

(25)

first improvement, and, as can be seen in Figure 4.6, it made a huge difference in performance when compared to Figure 4.5 - for the smaller sizes, we see 10x improvement, while larger sizes only become feasible with this version. The exact alteration of our pragmas can be seen in the pseudocode below.

1 #pragma omp t a r g e t d a t a map \\

5 #pragma omp t a r g e t t e a m s d i s t r i b u t e p a r a l l e l f o r num teams ( omp num teams )

Figure 4.6: The performance of NW-V2 in OpenMP for small-medium problem size and number of teams.

Moving the data mapping to outside the outer loop means we move the data once to the device, at the start of our offloading, and then once again back to the host, once calculations have been completed. This improved performance significantly, in the best case we can run a problem size twice as large in less time. We still see the same trends as in vector add, small amounts of teams generally perform worse. We can also clearly see that we remain far away from CUDA’s performance yet, and we hypothesize that without using shared memory and tiling, we will not be able to reach similar speed as CUDA.

However, the CUDA-OpenMP performance gap did close quite a bit when compared to the first iteration. The best OpenMP configurations in the larger problem sizes are now around 20-40 times slower than their respective CUDA versions. Comparing the best to the best configuration this grows to around 22-45 times slower, and comparing the low problem sizes, OpenMP can be up to 100 times slower.

(26)

4.4.3 Improved NW - version 3

For the final improvement in NW, we moved the data once for both kernels. For the previous versions, we mapped data once before and after the first kernel, and then did the same for the second kernel. In V3, we move the mapping outside both kernels, giving us another small boost in performance, ascan be seen in Figure 4.7. The pseudocode of this improvement is shown below. Notice the use of brackets on empty lines, this is vital to make the data mapping successful. Everything in between the brackets is seen as part of the code that requires data mapping; without the brackets the data mapping would only succeed for the first kernel or not at all.

1 #pragma omp t a r g e t d a t a map \\

4 {

8 c a l c u l a t i o n k e r n e l 1

9

13 c a l c u l a t i o n k e r n e l 2

14 }

Figure 4.7: The performance of NW-V3 in OpenMP for small-medium problem size and number of teams.

This version features only very minor improvements, and, as such, it is no surprise that the performance gain is fairly minimal - we see 3-7% improvement of this version compared to the previous one. Comparing respective OpenMP-V3 and CUDA versions gives us a 25-35 times faster CUDA version for the larger problem sizes. Comparing best to best gives us a slightly

(27)

worse speed-up of 28-37 times in the larger problem sizes. The lower problem sizes are again much slower, around 50-80 times.

4.4.4 Downgrading the CUDA implementation

In a final attempt to close the gap between the OpenMP version and the CUDA version, we attempted to reduce the performance of the CUDA kernel by reversing some of its optimizations. When removing the use of shared memory, the CUDA performance degraded visibly (up to 2-3x, depending on the input sizes). This behaviour brings empirical evidence that a large part of the performance gap is due to these CUDA-specific optimization. As reversing all optimizations would effectively amount to re-implementing the CUDA code from scratch, we did not pursue this research further, and focused on the other applications instead.

4.5 Pathfinder

Turning Pathfinder (PF) into an offloading enabled application turned out to be trickier than the previous applications, because the sequential implementation wasn’t ready to be turned into an offloading implementation from the start. Several dependencies were found and the code was altered to allow for offloading. The CUDA version had block size restrictions again, although not as strict as NW. This time we could not use a block size of 32, so, instead, we use block sizes from 64 to 1024, and the respective number of teams in the OpenMP versions. For input, the application originally used a base width of 10000, this is the width of an array-like structure through which the algorithm tries to find the least costly path. We elected to use a base of 10240 instead. This is neatly divisible by all different block sizes and is therefore guaranteed to not leave part of a block unutilized, or worse yet, leave part of the calculations undone. We then stepped this value up in factors of 10 because the application ran quite well for problem sizes up to a million.

4.5.1 Base version of PF - version 1

The first offloading version of PF has the same problem as NW-V1: we move data from within a for loop. This is of course also the improvement we make in version 2. Still, this baseline version provides a realistic baseline (i.e., beginner programmers could easily make this mistake) to make our improvements upon. The results of the first iteration in OpenMP can be found in Figure 4.9, and we compare them against the performance of the CUDA version, presented in Figure 4.8. An overview of the algorithm in pseudocode can be found below.

1 f o r l o o p ( i = 0 ; i < max r o w s ; i ++)

2 #pragma omp t a r g e t t e a m s d i s t r i b u t e p a r a l l e l f o r num teams ( omp num teams ) \\

3 map ( t o f r o m : t ) \\

4 map ( t o : rows , c o l s , d a t a [ 0 : r o w s∗ c o l s ] s r c [ 0 : c o l s ] ) \\

5 map ( f r o m : d s t [ 0 : c o l s ] )

6 f o r l o o p ( i d x = 0 ; i d x < max c o l s ; i d x ++)

(28)

Figure 4.8: The performance of PF in CUDA for each combination of problem size and block size.

Figure 4.9: The performance of PF-V1 in OpenMP for each combination of problem size and number of teams.

Comparing the performance of the two figures - Figure 4.8 vs. Figure 4.9, we observe that there is significant room for improvement for OpenMP. However, we can also observe that there

(29)

is a similar growth in execution time, for both versions, when increasing the problem size: we observe a 10x increase in execution time for a 10x larger problem size, which confirms that the linear complexity of the code is preserved in both versions.

The optimizations to be made have again much to do with our memory transfers: if we can move the data transfers outside the loops, like we did for NW, we should be able to cut down the execution time, and, therefore, we should see a significant boost in performance. The effect of larger numbers of teams is subtler than in NW, although still present for mostly the same reasons. Finally, the nvprof reported time in the CUDA version is slower than our internal timer, which is again due to the warm-up run skewing the average.

Again, for the CUDA version, we observe little difference in performance under different block sizes. A possible explanation on why OpenMP does seem to be sensitive to this is that teams are just very costly to launch, synchronize, and manage. The CUDA version starts out with nvprof being consistently faster than our internal timers, but, for the largest problem size, the difference changes. This hints at the fact that memory transfers are a large part of why our first run is so slow. After some research it would seem that veteran CUDA developers agree: a warm up run or the cooking of a kernel is something fairly commonplace. It serves the same purpose as our exclusion of the first run. The benefits are especially worthwhile when the kernel execution is in the order of milliseconds, as most of our kernels are. As mentioned before, possible reasons for the slower first run include the warming up of caches, just-in-time compilation (the compilation of code in run time), the transfer of the kernel to the GPU, or the GPU being in a power saving mode.

Performance-wise, CUDA once significantly outperforms the naive OpenMP version. Comparing each respective configuration gives us a speedup of around 75 times for the smallest size, 150 for the intermediate, and 138 for the largest. Comparing the best to the best gives very similar results as those of CUDA are extremely stable to changes in block size.

4.5.2 Improved PF - version 2

Implementing more efficient data movements, similar to those we used for NW, leads to an imme-diate improvement of the OpenMP-offloading PF, as seen in Figure 4.10. Specifically, we observe a speedup around 10x compared to the first, naive version. We also note the clear distinction in the amount of teams that we observed in the other OpenMP applications. Performance, un-fortunately, is still around 10x worse than the CUDA implementation, but we did close the gap by a factor of 10 by simply moving our data in a more efficient manner. We again speculate the performance gap that is left is largely due to CUDA making great use of shared memory and tiling, something that isn’t easily doable in OpenMP.

Comparing to the best OpenMP configuration, we now see a CUDA speedup of around 7.5 times for the largest problem size, 10 times for the intermediate one, and 12 times for the smallest. Best to best comparison shows very similar results, as the CUDA versions results are not changing when the block-sizes change.

1 #pragma omp t a r g e t d a t a map ( t o : rows , c o l s , d a t a [ 0 : r o w s∗ c o l s ] , s r c [ 0 : c o l s ] ) \\

2 map ( f r o m : d s t [ 0 : c o l s ] )

3 f o r l o o p ( i = 0 ; i < max r o w s ; i ++)

4 #pragma omp t a r g e t t e a m s d i s t r i b u t e p a r a l l e l f o r num teams ( omp num teams ) \\

5 map ( t o f r o m : t )

6 f o r l o o p ( i d x = 0 ; i d x < max c o l s ; i d x ++)

(30)

Figure 4.10: The performance of PF-V2 in OpenMP for each combination of problem size and number of teams.

4.5.3 Degrading CUDA performance

Upon inspection, we observed that the CUDA version uses three specific optimizations: it uses tiling, shared memory, and reduces the number of kernel invocations. We again attempted to remove these optimizations from the CUDA version to indeed close the gap when comparing CUDA with the OpenMP-offloading version. In this case, the ”naive CUDA” version is only around 30% slower than the optimized version. This small gap indicates that the optimizations alone cannot be assumed to be responsible for the majority of the performance gap. Instead, it is likely that there are additional implementation differences between the code generated by the CUDA compiler and the version OpenMP generates. The analysis of these differences is beyond the scope of this thesis, and it is left as future work.

4.6 Nearest neighbour (NN)

Nearest Neighbour was the second application that required some rewriting of the code to enable offloading. This happened because some of the functions used in the original CPU implementa-tion were simply not available on the GPU (e.g., sscanf and sqrt), which meant that in order to successfully test an offloading enabled version, some of the data structures and the way the overall data was stored in memory had to be rewritten.

The results of the CUDA version can be found in Figure 4.11, those of OpenMP in Figure 4.12, and the pseudocode of the offloading kernel is shown below.

1 #pragma omp t a r g e t t e a m s d i s t r i b u t e p a r a l l e l f o r num teams ( t e a m s ) \\

2 map ( t o f r o m : k e r n e l r e s u l t s [ 0 : s i z e ] ) \\

3 map ( t o : t a r g e t l o n g i t u d e , t a r g e t l a t i t u d e , c o o r d i n a t e s [ 0 : s i z e ]

(31)

Figure 4.11: The performance of NN in CUDA for each combination of problem size and block size.

(32)

Figure 4.12: The performance of NN in OpenMP for each combination of problem size and number of teams.

Comparing our OpenMP-offloading implementation to the CUDA version shows familiar re-sults: larger amounts of teams perform worse than smaller amounts, nvprof reports a lower execution time than our internal timers, and increasing the problem size increases the execution time about as much as we would expect given the linear complexity of the problem.

Comparing the best OpenMP configurations to their respective CUDA configurations also gives familiar results: the OpenMP-offloading version is around 4-5 times slower than CUDA. Com-paring best to best gives the same result, because, again, the CUDA version shows very stable performance when changing block sizes.

The reason the first implementation is already closer to the CUDA version than our last version of other applications has much to do with the application itself. The single for loop around the kernel makes data movement only possible in one way, which means there is less optimization required for the ‘best’ result.

(33)

CHAPTER 5

Conclusion

Using GPUs for accelerating computation is a common trend in many applications and domains. However, programming GPUs using native models such as CUDA remains difficult. In an effort to reduce these challenges, simpler methods have been designed, especially aiming to aid porting existing code to GPU-accelerated code. OpenMP-offloading is such a method. In this work, we aimed to understand how easy OpenMP-offloading really is when aiming to achieve the same level of performance as native models such as CUDA.

Using OpenMP for offloading sounds straightforward: simply add pragmas to existing sequen-tial code and reap the benefits. The reality, however, is that offloading brings a lot of challenges to the table that sequential and CPU-parallel codes never have to deal with. The GPU having its own distinct memory is not only the source of a lot of frustration when it comes to porting applications for offloading, it is also the reason performance is hampered by memory transfers.

The need for GPU offloading is clear, and its benefits are undeniable. The way in which we do offload is the real choice, and OpenMP has a lot of potential to provide an easier way to reap those benefits. There are, however, some issues that need working out to improve OpenMP’s potential. Moreover, as a newer addition to the GPU-programming scene, OpenMP-offloading provides less documentation and has a less developed community available for help. What adds to the frustration is that there are quite complex rules on the ways in which pragmas can, and cannot, be combined. All of these programmability lapses mean that one of the major benefits OpenMP should offer, its ease of use, is partially negated.

5.1 Main findings

This work was driven by four different research questions. We present our findings for each subquestion and, finally, present the answer to the main research question.

SQ1. We first set our to determine what the suitable compilers to enable OpenMP offloading are. We found that both gcc and clang offload computation to GPUs, but the installation and use of GCC was proven more cumbersome. Moreover, as there are different flags and libraries to be used by the two compilers, all the results presented in this thesis are from applications compiled with clang.

SQ2. We have selected a set of four applications - one very simple benchmark and three representative applications from the benchmarking suite Rodinia. We found that there are different challenges when porting these applications, and we illustrated different solutions for OpenMP-offloading. As our applications are sourced from a benchmarking suite, we consider them inherently representative for real problems. In addition, we use the simple benchmark of vector add to give a more bare-bone comparison of CUDA to OpenMP. Given very little room for fancy improvements using tiling, shared memory and others, we showed that OpenMP doesn’t inherently perform much worse than CUDA.

(34)

optimization is available, to (a) factor(s) of 10 worse when many optimization techniques are used. Utilizing smarter data movements in OpenMP has always closed this performance gap significantly, but never completely.

SQ4. We have seen several reasons for the CUDA-OpenMP performance gap. The most important one remains the high risk of inefficient data transfers, which requires specific pro-gramming skills to avoid. Additionally, the inherent restrictions the OpenMP model puts in place by design, as its parallelization and offloading are automated in a loop-based manner. Fi-nally, we can also add the lack of standard mechanisms to enable shared memory, the lack of easy-to-implement (loop)tiling, and no way to control the amount of kernels to be launched, the less comprehensive and flexible data movements, and the apparent management cost of launching many teams. As stated before, we hypothesize that without an effective intergration of all these elements in OpenMP, the performance gap between OpenMP-offloading and CUDA is here to stay.

In summary, our results indicate that, for non-trivial applications, there is a large perfor-mance gap between CUDA and OpenMP-offloading, and its origins are both in the program-ming model and the current compilers that support it. Finally, we also note that, due to the complex pragma’s, and the limited support for some of the CPU library functions for GPUs, the programmability of OpenMP offloading is not yet as good as that of the classical CPU-based version.

5.2 Future work

The most important findings of this work are (1) that offloading constructs are not as trivial as expected, and (2) that the performance of the OpenMP-offloading and CUDA versions can differ significantly. To gain further understanding of these issues, further research should focus on evaluating the performance gap between the two models for more applications and more compilers. Moreover, better guidelines for the different offloading pragmas and parallelizations should be extracted to allow users to avoid the potential unfortunate combinations.

Currently there is research being done into enabling the use of shared memory[11] in OpenMP along with other techniques already available in CUDA. In the future these might feature in the out-of-the-box version of OpenMP, although some techniques might come with restrictions around how and on what device these can be used.

Additionally, different compilers that support OpenMP offloading are also being developed, when they do become available this is also an interesting avenue for further research.

(35)

Bibliography

[1] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” in 2009 IEEE international symposium on workload characterization (IISWC), Ieee, 2009, pp. 44–54.

[2] J. M. Diaz, K. Friedline, S. Pophale, O. Hernandez, D. E. Bernholdt, and S. Chan-drasekaran, “Analysis of openmp 4.5 offloading in implementations: Correctness and over-head,” Parallel Computing, vol. 89, p. 102 546, 2019.

[3] H. Bal, D. Epema, C. de Laat, R. van Nieuwpoort, J. Romein, F. Seinstra, C. Snoek, and H. Wijshoff, “A medium-scale distributed system for computer science research: Infrastructure for the long term,” Computer, vol. 49, no. 5, pp. 54–63, 2016.

[4] S. McIntosh-Smith, M. Martineau, A. Poenaru, and P. Atkinson, “Programming your gpu with openmp,”

[5] J. Shen, J. Fang, H. Sips, and A. L. Varbanescu, “Performance gaps between openmp and opencl for multi-core cpus,” in 2012 41st International Conference on Parallel Processing Workshops, 2012, pp. 116–125. doi: 10.1109/ICPPW.2012.18.

[6] J. Fang, A. L. Varbanescu, and H. Sips, “A comprehensive performance comparison of cuda and opencl,” in 2011 International Conference on Parallel Processing, 2011, pp. 216–225. doi: 10.1109/ICPP.2011.45.

[7] S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron, “A character-ization of the rodinia benchmark suite with comparison to contemporary cmp workloads,” in IEEE International Symposium on Workload Characterization (IISWC’10), IEEE, 2010, pp. 1–11.

[8] S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for multicore architectures,” Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009.

[9] G. M. Amdahl, “Validity of the single processor approach to achieving large scale comput-ing capabilities,” in Proceedcomput-ings of the April 18-20, 1967, sprcomput-ing joint computer conference, 1967, pp. 483–485.

[10] H. Bal, D. Epema, C. de Laat, R. van Nieuwpoort, J. Romein, F. Seinstra, C. Snoek, and H. Wijshoff, “A medium-scale distributed system for computer science research: Infrastructure for the long term,” Computer, vol. 49, pp. 54–63, 5 2016, issn: 0018-9162. doi: 10.1109/ mc.2016.127. [Online]. Available: http://doi.org/10.1109/mc.2016.127.

[11] G.-T. Bercea, C. Bertolli, A. C. Jacob, A. Eichenberger, A. Bataev, G. Rokos, H. Sung, T. Chen, and K. O’Brien, “Implementing implicit openmp data sharing on gpus,” in Proceed-ings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC, 2017, pp. 1– 12.

OpenMP performance for GPU-acceleration

Bachelor Informatica