From CUDA to OpenACC in Graph Processing Applications

(1)

Bachelor Informatica

From CUDA to OpenACC in

Graph Processing Applications

Wouter de Bruijn

June 17, 2019

Inf

orma

tica

—

Universiteit

v

an

Amsterd

am

(2)

(3)

Abstract

Most GPU-accelerated programs are written using the NVIDIA proprietary API CUDA. CUDA has an extensive collection of users and libraries, but functions only on NVIDIA GPUs and is completely proprietary. This thesis proposes standard ways to convert CUDA to the higher level programming model OpenACC, examines the difficulty of this process, and analyzes the performance of the final converted program.

We have applied our porting methodology to two different graph processing algorithms. We chose Breadth First Search for its relative simplicity and large amount of memory opera-tions, and PageRank for its combination of memory operations and computational sections. The results show that OpenACC was significantly faster than CUDA for PageRank, and was more or less tied with CUDA in Breadth First Search.

In the end, the performance of OpenACC was close enough to CUDA for most cases, and actually faster than CUDA in one case. OpenACC did lack in performance and consistency on multi-core CPUs when compared to OpenMP.

Our systematic process of porting CUDA to OpenACC was therefore successful for the two different graph processing algorithms. The OpenACC ecosystem does still suffer from a lack of user support and documentation, which makes the process of writing larger and more complicated OpenACC programs more difficult than it should be for (beginner) pro-grammers.

(4)

(5)

Introduction

1.1 Context

As we continue to increase the amount of data we collect and process, we need more efficient ways of storing and processing that data. When data consists of entities which can have some sort of relationship between them, a graph can be used to model the data. A graph is a mathe-matical structure consisting of vertices and edges, where each edge connects a pair of vertices. A large variety of data collected today can be represented using graphs, including data from fields like social networks, linguistics, physics, chemistry and other computational sciences [21] [23]. Currently, a lot of graph processing is done using GPU acceleration, as this can bring major performance benefits.

Most GPU-accelerated programs are written using the NVIDIA proprietary API CUDA. CUDA is a widely used and supported programming model used for programming NVIDIA GPUs. The downside of CUDA, however, is that it functions only on NVIDIA GPUs and is completely proprietary to NVIDIA. Moreover, it requires a different way of thinking compared to normal sequential CPU code. For this reason, it would be beneficial to have a higher level and portable API that can be used instead of CUDA without any loss of performance.

In this work, we choose OpenACC as the high-level API. OpenACC is a completely portable acceleration API which aims to be able to let programmers write accelerated code once, and then have the compiler compile it for any type of accelerator. This includes GPUs from potentially any brand, multi-core CPUs and large compute clusters.

1.2 Research Questions

This paper aims to determine whether high-performance OpenACC code can be derived from CUDA code in a systematic manner. We will then investigate the portability of this new Ope-nACC code, by comparing it to CPU multithreaded OpenMP. To investigate this properly, we aim to answer the following questions:

1. Can we create a systematic way of porting CUDA to OpenACC, and how easy is this porting process?

2. What is the difficulty of implementing our algorithms in OpenACC compared to the diffi-culty of implementing our algorithms in CUDA?

3. What are the performance benefits or drawbacks to using the high-level portable program-ming model OpenACC instead of the proprietary CUDA API?

(8)

We answer these questions by porting two graph processing algorithms written in C and CUDA to C and OpenACC, and reviewing this process. We will identify at any quirks or difficulties encountered during this process in order to answer research question 2. In order to answer research question 1, we search for patterns of CUDA calls that are/should be always replaced by the same OpenACC codes. These become systematic one-to-one translations. We further compare the runtime of the original code and the ported code to answer question 3. Finally, in order to answer research question 4, we analyze the performance of both of our OpenACC benchmarks against similar OpenMP code.

1.3 Thesis structure

This thesis is structured in a per-algorithm way. We start with a background chapter (see chapter 2) containing information and short explanations about the CUDA, OpenACC, and OpenMP APIs, as well as relevant related work related to this thesis. Then, in chapter 3, we describe our exact testing and investigation methodologies. Further, for each of the two algorithms, we describe the porting process and any difficulties encountered during this process (see chapters 4 and 5). For each algorithm, we also include the performance comparison of the two versions. We explain any performance differences, and improve the ported code based on this investigation. After discussing the main algorithms, we select PageRank as a case-study for an in-depth comparison against the OpenMP version of the algorithm, running on a CPU (see chapter 5)”. Finally, we conclude this thesis with a summary of our findings and provide suggestions for potential future research (see chapter 7).

(9)

CHAPTER 2

Background & Related Work

2.1 Background

Graph processing is becoming increasingly relevant for many scientific and daily life applications. The massive size of some of the graphs around us (social networks or infrastructure networks) requires high-performance graph processing. Many frameworks have been devised to help users write algorithms from scratch [22] [7] [25] [26], but there are also a lot of high-performance C and CUDA implementations of graph processing applications. These applications are, however, difficult to read, modify, and/or maintain by regular users who do not have experience with those specific frameworks. Thus, obtaining higher-level versions of these codes is desirable for many domains and users. Obtaining these versions is the main goal of our work.

2.1.1 CUDA, OpenACC and OpenMP

In this section we explain (in short) the differences between the three APIs we are using in this work.

CUDA

CUDA is an NVIDIA proprietary API for the C, C++ and FORTRAN programming languages

which can be used to program NVIDIA GPUs. It has an extensive collection of libraries, and a large amount of support behind it. It also enjoys a sizeable community. These things make CUDA one of the most attractive choices when programmers need to use GPU acceleration in their applications.

On a basic level, CUDA works by having the programmer manually program CUDA kernels, which are basically small functions that run on the GPU. These kernels will then be called from the ”normal” code running on the CPU. CUDA is a low level language in the sense that the programmer always has to manually manage memory transfers between the GPU and the rest of the machine, and continually has to think in a GPU-centric way when programming these kernels. Although the code in the CUDA kernels is just C++, the programmer has to consider the inner workings of the GPU while writing these kernels to avoid the performance penalties of, for example, warp divergence and branch divergence [6].

Thus, the low-level of abstraction and the tight coupling with the hardware make it harder for a programmer used to writing standard CPU-bound sequential (or even parallel) code to create efficient CUDA code.

OpenACC

(10)

accelerating a program. Being a compiler directive based API, OpenACC works by instructing the compiler to offload certain sections of code to the accelerator device. In practice this means that the programmer can take sequentially written code, and have the compiler transform that code to the proper API for the target device. This approach has the major advantage of being able to accelerate most code by a small amount, as most sequentially written code has at least a couple of sections that can be parallelised. It also means that it is much easier to add OpenACC acceleration to existing code than it is to accelerate existing code using CUDA.

There are a couple of drawbacks to this compiler-centric approach, however. First, the final performance of the OpenACC accelerated code is very heavily dependent on the compiler being used [16]. This makes the choice of compiler much more important than it usually is, as a compiler switch can bring major performance gains or losses.

Second, when the compiler doesn’t understand the structure of the code to be accelerated, it might not parallelise the code in the way that one might expect. This can result in entire sections not being transformed or not working anymore, and a long struggle to adapt the code to the point where the compiler understands how certain sections of code can be parallelised. This might result in much of the original code of the program having to be rewritten in order to take full advantage of OpenACC acceleration. On top of that, the compiler also has to manage memory access to and from the GPU if the programmer does not explicitly handle this.

The OpenACC API does contain a number of options and directives to better control the offloading of data and code to and from the GPU in a more precise way. This makes it easier for the compiler to offload the code, but requires more programmer skill and code modification.

To summarise, OpenACC aims to be both simpler to reason about than CUDA, and simpler to add to existing codebases than CUDA.

OpenMP

Like OpenACC, OpenMP is a directive-based API. It shares most of the benefits and drawbacks that OpenACC has, but has the additional benefit of being much more mature and much more widely implemented. As such, it enjoys better compiler support and optimisations. The major difference is that instead of being made for writing code for (potentially) any acceleration device, it was mainly focused on just multithreaded CPUs, with OpenMP gaining GPU support only very recently in version 4.5 of the OpenMP standard [18].

Due to its syntactical similarity to OpenACC and its focus on multithreaded CPU accel-eration, OpenMP is a prime candidate for comparison with OpenACCs multithreaded CPU support.

2.2 Definitions

The OpenACC API contains a number of new constructs and compiler directives. In this section, we explain most of the ones relevant to this thesis.

In figure 2.1 we provide a simple example application which contains most of the imporant OpenACC compiler directives that we use in this thesis. In order to explain them, we start at the main function. In this function we see three compiler directives. The directive

#pragma acc enter data copyin(vec_a[0:VSIZE], vec_b[0:VSIZE])

is called a data directive, and is an example of explicit data movement: We explicitely state that elements 0 through VSIZE of both vectors should be copied to the accelerator device.

Directive

#pragma acc kernels

is the simplest and broadest directive that OpenACC provides. It basically tells the compiler that all code in the following regions should be scanned for possible paralellism, and then be paralellised by the compiler. This results in the compiler doing all of the data movement, and all of the compute porting.

(11)

#pragma acc exit data delete(vec_a, vec_b)

which is also an explitic data movement directive. In this case, it tells the compiler to free both vectors from the accelerator device.

Moving to the next function, we see that it has four compiler directives. The first directive is

#pragma acc data present(vec_a[0:VSIZE], vec_b[0:VSIZE])

This directive is made up of two parts. The first part is acc data. This declares that we are now entering a data region. This data region informs the compiler that all accelerator data is shared between regions in this data region, meaning that there is no need to copy data back and forth between the accelerator and the host, and it can simply copy once. The second part of the directive (present(vec_a[0:VSIZE], vec_b[0:VSIZE]) is a declaration that the two vectors from elements 0 through VSIZE are already present on the GPU (due to the explicit data movement directives from the last function) and do not have to be copied.

The next three directives all share the same core:

#pragma acc parallel loop

This is the basic OpenACC directive. It declares the following loop to be a parallel one that can be converted in to a kernel for the accelerator. Each directive applies to a single loop. It is up to the programmer to ensure that the loop is fully parallel and not data dependant (meaning that the result from one loop iteration influences another loop iteration).

The final directive ends with a common suffix: reduction(+:final). This informs the compiler that this loop is not fully parallel, but is in fact a reduction on variable final with operator +. This enabled the compiler to apply some specific and optimised algorithms to do this reduction as fast as possible.

2.3 Related Work

The usability of OpenACC has already been examined earlier [24]; The authors have reached the conclusion that OpenACC has a ”promising ratio of development effort to performance”. Addi-tionally, earlier performance comparison studies between CUDA and OpenACC, like Hoshino et al. [10], have shown OpenACC to be slower then CUDA. The common belief was that this behavior is due to the fact that OpenACC lacks the low-level functionality (like shared memory) that enables the programming tricks that CUDA can do [10]. Supporting this lack of performance is Christgau et al. [5], where ”the platform independent approach does not reach the speed of the native CUDA code”. In this paper, the authors state that ”a deeper analysis shows that memory access patterns have a critical impact on the compute kernels’ performance, although this seems to be caused by the compiler in use”. This supports the points we made in section 2.1.1, where we stated that the performance of OpenACC is heavily affected by the choice (and thus the optimisation level) of compiler.

This performance deficit is not a commonly shared conclusion, however, as Herdman et al. [9] concludes that ”OpenACC is an extremely viable programming model for accelerator devices, improving programmer productivity and achieving better performance than OpenCL and CUDA”. To add to this, Ledur, Zeve, and Anjos [13] concludes that ”OpenACC presented an excellent execution time compared with the other languages” (the other languages being OpenMP and CUDA). In its conclusion, the authors also note that ”CUDA presented good execution times too, but the complexity to construct code is bigger than OpenACC and OpenMP.”

Examining ease-of-use and difficulty of programming (see research question 2) we have Memeti et al. [15] which, based on the number of lines of code used for the APIs, concludes that ”on average OpenACC requires about 6.7x less programming effort compared to OpenCL”, and that ”Programming with OpenCL on average requires two times more effort than programming with CUDA for the Rodinia benchmark suite”. Using some rough maths, this would suggest that

(12)

1 #include <stdlib.h> 2

3 #define VSIZE (100000) 4

5 int do_vec(int *vec_a, int *vec_b) {

6 int* vec_c = malloc(sizeof(int) * VSIZE); 7 int final = 0;

8

9 #pragma acc data present(vec_a[0:VSIZE], vec_b[0:VSIZE])

10 {

11 #pragma acc parallel loop 12 for(int i = 0; i < VSIZE; i++) {

13 vec_c[i] = 0;

14 }

15

16 #pragma acc parallel loop 17 for(int i = 0; i < VSIZE; i++) { 18 vec_c[i] = vec_a[i] + vec_b[i];

19 }

20

21 #pragma acc parallel loop reduction(+:final) 22 for(int i = 0; i < VSIZE; i++) {

23 final += vec_c[i]; 24 } 25 } 26 27 free(vec_c); 28 29 return final; 30 } 31

32 int main(int argc, char *argv[]) {

33 int* vec_a = malloc(sizeof(int) * VSIZE); 34 int* vec_b = malloc(sizeof(int) * VSIZE);

35 #pragma acc enter data copyin(vec_a[0:VSIZE], vec_b[0:VSIZE]) 36

37 #pragma acc kernels

38 {

39 for(int i = 0; i < VSIZE; i++) {

40 vec_a[i] = 5+5;

41 vec_b[i] = 10+10;

42 }

43 }

44

45 #pragma acc exit data delete(vec_a, vec_b) 46 free(vec_a);

47 free(vec_b); 48

49 return do_vec(vec_a, vec_b); 50 }

(13)

CHAPTER 3

Methodology

We answer our research questions using empirical analysis of two different graph processing algorithms: Breadth First Search and PageRank. We have selected these workloads because they are both iterative, like most graph processing algorithms, but cover different types of graph processing, and could therefore require different CUDA and OpenACC constructs. Specifically, Breadth First Search is an algorithm that contains a lot of memory operations, but does not have computationally expensive steps. It also traverses different nodes in every iteration. PageRank contains a mix of both memory intensive steps and computationally expensive steps, and is therefore more balanced. Also, contrary to Breadth First Search, it traverses every node in each iteration.

3.1 Measuring performance

In order to fully explore the performance capabilities and the usability of OpenACC compared to CUDA, we will be using an iterative approach with porting the algorithms. We create a base port of the algorithms, compare it to the original non-ported version and look for any differences in performance and behaviour. Based on the results of this comparison, we adapt the ported code and compare this adapted version to the non-ported version again. This process will be repeated until no more realistic performance gains can be found.

We measure the performance timings of the programs by using operating system specific tim-ing libraries. For the Linux and MacOS operattim-ing systems these will be the functions contained in the ”<sys/time.h>” C headers. For Windows we will use the ”QueryPerformanceCounter()” and ”QueryPerformanceFrequency()” functions from the ”<windows.h>” header.

For GPU-accelerated programs, the offloading execution model (see chapter 2) requires the additional step of data movement between host and device. Thus, in our performance analysis, we time these transfers explicitly. Collecting such fine-grained performance data allows for a better understanding of the sources of performance discrepancies between CUDA and OpenACC.

All performance measurements used are performed on the DAS-5 distributed supercomputer [1], the full specifications of which can be found in table 3.1. We repeat each experiment 200 times for PageRank and 6400 times for BFS before taking an average of the measured runtimes. It is well known that the performance of graph processing algorithms depends on the structure of the input graph. Therefore, we have selected a set of 36 diverse input graphs. This set includes synthetic graphs taken from the graph500 set [17], and snapshots of real-life graphs taken from the KONECT [12] and SNAP [14] repositories. The list of graphs we use for testing can be found in section 8.2, appendix B, along with their sizes.

(14)

CPU Type Dual Intel Xeon E5-2630 v3 CPU Cores 8-core, 16-thread (32 total threads)

CPU Speed 2.4 GHz

RAM 64 GB

GPU Type NVIDIA Titan X

GPU Cores 3584

GPU Speed 1417 MHz

GPU Memory 12 GB GDDR5X @ 480 GB/sec

Table 3.1: DAS-5 Specifications

C with OpenACC/OpenMP Compiler PGI C compiler (pgcc) @ version 19.4-0 (LLVM) C/C++ CUDA Compiler NVIDIA CUDA compiler (nvcc) @ version 10.0.130

OpenACC Profiler PGI C profiler (pgprof) @ version 19.4

CUDA Profiler NVIDIA profiler (nvprof) @ version 10.1.168

Table 3.2: Software used

porting process includes things like setting up the toolchain and configuring the compiler. We will also investigate any weird errors, subtle performance killers and pitfalls that the programmer might encounter while writing OpenACC code in detail. Finally, we reflect on how complex and how systematic the process of identifying and fixing OpenACC performance bugs is. Our goal is, again, to extract potential patterns that can eventually lead to guidelines and, when possible, rules for writing well-behaved OpenACC code.

3.3 Performance Analysis and Debugging

In order to optimise and analyse our compiled binaries, we use debugging and profiling tools, in addition to the information provided by the compiler. The full list of software used along with their versions can be found in table 3.2.

To explain any unexpected performance differences, we will be using a combination of the NVIDIA and PGI GPU profiling tool and the pgcc option ”-Minfo” [20]. ”-Minfo” provides detailed information about the compiler interpretation of our OpenACC directives, and can indicate if any implicit data regions have been placed, any implicit optimisations that have happened, and also how our OpenACC regions have been parallelised.

The NVIDIA CUDA profiler, nvprof (version 10.1.168) provides detailed profiling information for our CUDA executables. It can provide, for example, detailed information about the number of times a function was called, the average runtime and total time spent on a function. It can also provide this information at the level of every CUDA API call. The PGI group OpenACC profiler, pgprof (version 19.4) does the same things for our OpenACC code, but instead of just measuring functions and API calls, it also profiles OpenACC compute regions.

(15)

CHAPTER 4

Breadth First Search

In this chapter we present the different OpenACC BFS implementations we have designed in the context of this thesis. We further provide a detailed analysis of their performance from the perspective of how competitive they are against their CUDA counterparts.

4.1 The Algorithm

Breadth First Search is a basic graph exploration algorithm, in which the goal is to visit all nodes in a graph starting from a specific root node. It is ”breadth first” as we explore all the nodes ”in layers”, i.e., we first visit the nodes connecting to our root node, and then we explore the connections of these nodes, etc. We will be using an expanded version of this algorithm, in which we try to find the distance (also called depth) of each node relative to our chosen root node.

We implement this algorithm in an edge-centric manner: we loop over every edge in the graph, and check if the origin of that edge has been explored. If it has, we mark the destination node as explored and give it the depth of the origin node plus one. The exploration terminates once we go through a loop iteration in which no new nodes have been discovered.

In standard sequential implementations of BFS, the algorithm works by placing newly dis-covered nodes in a queue, and then processing new nodes as they appear in the front of the queue. In our parallel CUDA implementation, we simply process each newly found node in parallel instead of placing them in a queue to be handled later. Each step of the algorithm, we travel along the edges of each node that was discovered in the previous step. This makes the algorithm inherently parallel and removes the need for a queue. It also makes keeping track of the current depth easier, because nodes with equal depths will always be discovered in equal steps. The downside of this variant is that each step, we are always checking each and every edge, and simply doing nothing for the edges that do not need to be explored this step. This results in more total work as a trade-off for the solved load-imbalance.

4.2 Porting

Our (rough) CUDA implementation can be seen in figure 4.1. The structure of the code consists of roughly the following steps:

1. Allocate the GPU memory

2. Copy the graph edges and the results data structure to the GPU. 3. Set the value of the was_updated variable on the GPU to zero.

(16)

desti-1 result_t* do_bfs_cuda(graph_edge_t* graph) {

2 result_t* results = malloc(graph->node_count * sizeof(result_t)); 3 results[0].state = node_toprocess; // Start at node 0

4 5 timer_start(timer_memtransfers); 6 copy_data_to_gpu(graph, results); 7 8 // Do BFS 9 int was_updated; 10 timer_start(timer_nomemtransfers); 11 do {

12 cudaMemset(was_updated_device, 0, sizeof(int)); // set "was_updated" on device to 0 13 bfs_search<<<block_count, CUDA_BLOCKSIZE>>>(edges, results, edge_count);

14 bfs_update_state<<<block_count, CUDA_BLOCKSIZE>>>(results, node_count, was_updated); 15 cudaMemcpy(&was_updated, was_updated_device, sizeof(int),

16 cudaMemcpyDeviceToHost); // Get the "was_updated" value back from the device 17 } while(was_updated == 1);

18 timer_stop(timer_nomemtransfers); 19

20 cudaMemcpy(results, results_device, graph->node_count * sizeof(result_t), 21 cudaMemcpyDeviceToHost); // Copy results back

22

23 delete_data_from_gpu(graph, results); // Delete data from GPU 24 timer_stop(timer_memtransfers);

25

26 return results; 27 }

Figure 4.1: The core of the Breadth First Search algorithm (psuedo-code, shortened) 6. Get the was_updated value back from the GPU. This was changed to 1 if the state of any

node changed in the previous step.

7. Go back to step two if was_updated is set to 1. 8. Copy the results structure back from the GPU. 9. Deallocate the GPU memory.

One might notice that there can be race conditions in step 4: Multiple edges might mark the same destination node as reachable, and one thread might be changing the state from unvisited to reachable while another thread is still seeing it as unvisited. However, this is not a problem for this implementation, because it does not matter which thread marks the target node as reachable: the depth will always be the same for each step due to the parallel nature of this algorithm. This results in the sequential and the parallel versions of the algorithm always ending with the same result, even with the race conditions.

We will discuss the porting of this code by discussing the memory operations and the compute operations separately.

4.2.1 Porting memory transfers

The memory operations consist of allocating the GPU memory, copying the initial data to the GPU, the intermediate memory operations, copying the results back to the host, and finally deallocating the GPU memory. In OpenACC, the allocation and copying of the initial data

(17)

1 result_t* results = malloc(node_count * sizeof(result_t)); 2

3 cudaMalloc(&edges_device, graph->edge_count * sizeof(edge_t)); 4 cudaMalloc(&results_device, graph->node_count * sizeof(result_t)); 5 cudaMalloc(&was_updated_device, sizeof(int));

6

7 cudaMemcpy(edges_device, graph->edges, edge_count * sizeof(edge_t), cudaMemcpyHostToDevice); 8 cudaMemcpy(results_device, results, node_count * sizeof(result_t), cudaMemcpyHostToDevice);

Figure 4.2: BFS CUDA initial data copying

1 result_t* results = malloc(node_count * sizeof(result_t)); 2 ...

3 #pragma acc data

4 { 5 ... 6 do { 7 ... 8 do_bfs_search(); 9 10 update_bfs_states(); 11 ... 12 } while(was_updated == 1); 13 ... 14 copy_results_back(); 15 }

Figure 4.3: BFS OpenACC initial data copying

happens in a single step except when using specialised memory directives, which are not relevant in this thesis.

The code for the initial data copy consists of five lines, and can be seen in figure 4.2. There are three memory allocations, and only two memory copies, because the was_updated_device variable is set in the BFS algorithm loop itself.

Because OpenACC works with compiler directives, we can make the compiler figure out the memory transfers by itself in certain situations. We find that ”certain situations” can be quite arbitrary, as the compiler needs to understand in what way memory will be used and allocated. This process of recognition is extremely unpredictable and prone to errors.

To help the compiler, we can declare a region in which GPU data will be shared with

#pragma acc data

In figure 4.3 we see how this looks in our code. We have wrapped our main algorithm loop in one of these data regions, to declare that all GPU data in this region should be shared with all other code requiring this data. In practice, this means that no updating back and forth between the GPU and host is required in this region when the data is only accessed in regions that are offloaded to the GPU. This also means no extra allocations and deallocations other than at the beginning and end of this zone.

(18)

1 result_t* results = malloc(node_count * sizeof(result_t)); 2 ...

3 #pragma acc data copy(results[0:node_count])

4 { 5 ... 6 do { 7 ... 8 do_bfs_search(); 9 10 update_bfs_states(); 11 ... 12 } while(was_updated == 1); 13 ... 14 copy_results_back(); 15 }

Figure 4.4: BFS OpenACC initial data copying with size hint

the edges array. As to why the compiler could recognise the shape and size of an array which was allocated in a different function while it could not for an array which was allocated several lines above is unclear.

To solve this problem, we can explicitly state what we want to do with the results data. We want the data copied to the GPU from the host before the algorithm starts, and we want it copied back to the host when the algorithm has finished. OpenACC structured data pragmas (like our data region) support the copy(x) argument, which states that we want to copy the given data to the GPU at the start of our structured region, and vice-versa when the region ends. Our updated code can be seen in figure 4.4.

At this point, the compiler generated functioning code which gave correct results. All other memory operations (including the updating of the was_updated variable) were implicitly gener-ated by the compiler without us giving any hints.

4.2.2 Porting compute operations

As seen in our initial CUDA code in figure 4.1, there are two CUDA kernels which have to be ported to OpenACC: bfs_search and bfs_update_state.

bfs_search

The original CUDA implementation of this kernel can be seen in figure 4.5. Here we can see that each instance of the bfs_search kernel handles one edge. This is supported by the call to the kernel, in which we declare that we need a number of blocks equal to the amount of edges in the graph divided by the size of a CUDA block (with one extra block to circumvent some edges not being processed when this division is not round).

For our purposes this means that we can transform the code into a standard for loop iterating over all edges in the graph. We can then annotate this loop with

to tell the compiler that this loop is completely parallel and should be offloaded to the GPU. Our ported code can be seen in figure 4.6. In essense, all we have done is inlined the CUDA kernel, wrapped it in a for loop, and removed any CUDA specific code.

This shows us that simple algorithms and CUDA kernels can be ported to OpenACC easily, sometimes resulting in shorter code. This can be one of the main benefits of OpenACC: You do not need to know how GPUs work in order to write code that can be used on GPUs. Our

(19)

1 __global__ void bfs_search(edge_t* edges, result_t* results, edge_count_t edge_count) { 2 unsigned int i = threadIdx.x + blockDim.x * blockIdx.x;

3

4 if(i < edge_count) {

5 uint32_t origin_index = edges[i].from; 6 uint32_t destination_index = edges[i].to; 7 8 if(results[origin_index].state == node_toprocess) { 9 if(results[destination_index].state == node_unvisited) { 10 results[destination_index].state = node_reachable; 11 results[destination_index].depth = results[origin_index].depth + 1; 12 } 13 } 14 } 15 } 16

17 result_t* do_bfs_cuda(graph_edge_t* graph) {

18 ...

19 do {

20 ...

21 bfs_search<<<(edge_count / CUDA_BLOCKSIZE) + 1, CUDA_BLOCKSIZE>>> 22 (edges_device, results_device, edge_count);

23 ...

24 } while(was_updated == 1);

25 ...

26 }

(20)

2 ...

3 do {

4 ...

5 #pragma acc parallel loop

6 for(edge_count_t i = 0; i < edge_count; i++) { 7 uint32_t origin_index = edges[i].from; 8 uint32_t destination_index = edges[i].to; 9 10 if(results[origin_index].state == node_toprocess) { 11 if(results[destination_index].state == node_unvisited) { 12 results[destination_index].state = node_reachable; 13 results[destination_index].depth = results[origin_index].depth + 1; 14 } 15 } 16 } 17 ... 18 } while(was_updated == 1); 19 ... 20 }

Figure 4.6: BFS Search OpenACC code

bfs_search code already requires some knowledge about GPUs (block sizes and block count)

while the OpenACC variant is basically annotated sequential CPU code. bfs_update_state

The CUDA code to be ported can be seen in figure 4.7. This code consists of a kernel in which each instance of the kernel processes a single element in the results array. As we can see in the allocation of this results array (figures 4.1 and 4.2, it has a length equal to node_count. This is supported by the call of this kernel, in which the amount of blocks to be used is equal to the amount of nodes in the graph divided by the amount of threads per block (with one extra block to circumvent issues with node counts which are not a multiple of this threads-per-block value). This means that just like with the bfs_search kernel, we can wrap it in a for loop with

node_count iterations.

Now we can annotate this for loop with

to tell the compiler that this is a fully parallel loop that should be offloaded. Our ported code can be found in figure 4.8. Again, all we have done is inlined the CUDA kernel, wrapped it in a for loop, and removed the CUDA specific code.

4.3 Performance

In figure 4.9 we can see the normalised performance comparison between the CUDA and Ope-nACC implementations of OpeOpe-nACC. In this figure we can see that our OpeOpe-nACC implementa-tion is faster for the smaller graphs (10 through 13), but loses to the original CUDA implemen-tation for the other graphs.

Figure 4.10 shows the same comparison for the KONECT graphs. Again, we can see that CUDA is faster by a large margin, except for the opsahl-ucsocial graph. Looking at the graph sizes, we see that the opsahl-ucsocial graph is the smallest of all the KONECT graphs.

(21)

1 __global__ void bfs_update_state(result_t* results, uint32_t node_count, int* was_updated) { 2 unsigned int i = threadIdx.x + blockDim.x * blockIdx.x;

3 4 if(i < node_count) { 5 switch(results[i].state) { 6 case node_unvisited: 7 case node_visited: 8 break; 9 case node_reachable: 10 results[i].state = node_toprocess; 11 *(was_updated)= 1; 12 break; 13 case node_toprocess: 14 results[i].state = node_visited; 15 *(was_updated)= 1; 16 break; 17 } 18 } 19 } 20

22 ...

23 do {

24 ...

25 bfs_update_state<<<(node_count / CUDA_BLOCKSIZE) + 1, CUDA_BLOCKSIZE>>>

26 (results_device, node_count, was_updated_device);

27 ...

28 } while(was_updated == 1);

29 ...

30 }

(22)

2 ...

3 do {

4 ...

6 for(uint32_t i = 0; i < node_count; i++) { 7 switch(results[i].state) { 8 case node_unvisited: 9 case node_visited: 10 break; 11 case node_reachable: 12 results[i].state = node_toprocess; 13 was_updated = 1; 14 break; 15 case node_toprocess: 16 results[i].state = node_visited; 17 was_updated = 1; 18 break; 19 } 20 } 21 ... 22 } while(was_updated == 1); 23 ... 24 }

Figure 4.8: BFS Update state OpenACC code

(23)

Figure 4.10: Unoptimised BFS performance comparison for KONECT graphs.

Finally, figure 4.11 contains the comparison for the SNAP graphs. Again, OpenACC has been beaten by CUDA in every graph.

When we combine the results from these figures, we can conclude that the unoptimised version of our OpenACC BFS algorithm is faster than CUDA for very small graphs only. Once the graph size crosses the multi-megabyte line, CUDA starts being faster. In the next section, we will be taking a look at why this is the case, and how we might optimise our OpenACC code to more closely match the original CUDA implementation.

4.4 Optimising

4.4.1 Diagnosing and patching

In order to optimise our BFS OpenACC implementation, we should first diagnose any potential performance problems and how we might solve them. To start diagnosing the problem, we will start with the actual raw performance numbers. These can be seen as an appendix in section 8.3.

In these tables, we can immediately notice two strange results: The difference in total pro-cessing time gets larger as the graphs increase in size, and the OpenACC memory transfer time is significantly shorter than the memory transfer time for CUDA. This is especially obvious in the graph graph500-24, where the memory transfer time of CUDA is 5x longer, but the time spent in the main BFS loop is more than an order or magnitude shorter.

This can mean that OpenACC is simply way faster with memory transfers, but this seems unlikely as this would mean that OpenACC is able to use GPU memory bandwidth more effi-ciently than CUDA. A significant hint can be found in the output our compiler gives us while compiling the OpenACC BFS code:

(24)

Figure 4.11: Unoptimised BFS performance comparison for SNAP graphs. 40, Loop not vectorized/parallelized: contains call

FMA (fused multiply-add) instruction(s) generated 44, Generating Tesla code

45, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 44, Generating implicit copyin(edges[:edge_count])

45, Loop not fused: no successor loop Loop not vectorized: data dependency 58, Generating Tesla code

59, #pragma acc loop gang /* blockIdx.x */

59, Scalar last value needed after loop for was_updated at line 74 Loop not fused: no successor loop

75, get_highrestime inlined, size=4 (inline) file bfs/bfs_acc.c (39) 79, get_highrestime inlined, size=4 (inline) file bfs/bfs_acc.c (39)

In this snippet of compiler output, we can see that the main copying should be taking place at line 35 of our program. However, we can also spot that the compiler has implicitely placed a copyin directive at line 44, right in the middle of our main loop. This results in all the edges of the graph being copied to the GPU again at every step of our BFS loop. As this main loop is not measured as part of the memory transfer, this would result in the memory transfer time being artifically (and incorrectly) short. It will also introduce a lot of latency and overhead in our main loop, as the edges are being copied unnecessarily over and over again. For large graphs (like graph500-24) this introduces a drastic performance hit.

Luckily, this problem is easily solved. Looking at our final memory transfer code in figure 4.4, we simply change the line

#pragma acc data copy(results[0:node_count])

to

(25)

This extra directive instructs the compiler to copy the edges to the GPU at the start of our memory region. This results in the compiler not needing to do the implicit transfer, and thus also not placing in incorrectly. In theory, this should reduce the processing time of the OpenACC BFS algorithm significantly, while bringing the memory transfer time up to be more in line with the original CUDA implementation.

The second major optimisation was also obtained while looking at the compiler output. Observe the following output:

44, Generating Tesla code

45, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 58, Generating Tesla code

59, #pragma acc loop gang /* blockIdx.x */

Here we can see that the compiler decided to vectorise the first for-loop, but has not done the same for the second for-loop. As the loops are completely independant and vectorised behaviour is desired, we make this vectorisation behaviour explicit by turning

into

#pragma acc parallel loop gang vector

This tells our compiler that the loop can exploit both gang and vector level paralellism. This rougly maps to a CUDA block and thread respectively.

In this stage some other small optimisations were made as well. For instance, all OpenACC parallel loop constructs were made async, meaning that the host does not wait for the GPU to finish its task before moving on with the other code. In our case this came down to the host being able to queue the parallel for-loops instead of starting them sequentially.

4.4.2 Optimised results

Here we can see that our optimisations have had a major effect on the runtime of the OpenACC port. In figure 4.12, we see that CUDA’s advantage has almost completely dissappeared, except for graph 23, where it is about 20% faster than OpenACC. In the other graphs in this figure, OpenACC is generally a lot faster.

Figure 4.13 paints a similar picture. OpenACC is, again, generally faster than CUDA except for two outliers, being dbpedia-all and orkut-links. The actual speed increase of OpenACC differs per graph, but seems to hover around 25% faster than CUDA.

Finally, in figure 4.14 we see the biggest improvement. Whereas in the unoptimised compar-ison in figure 4.11 we saw that CUDA was a lot faster (sometimes up to 95% faster), the two implementations are now very similar. Although OpenACC is not beating CUDA as confidently as it was for the other graphs, there is still a major performance improvement compared to the unoptimised version.

The raw performance numbers for the optimised algorithm can be found in appendix D, section 8.4

(26)

Figure 4.12: Optimised BFS performance comparison for Graph500 graphs.

(27)

(28)

(29)

CHAPTER 5

PageRank

In this chapter we present the different OpenACC PageRank implementations we have designed in the context of this thesis. We further provide a detailed analysis of their performance from the perspective of how competitive they are against their CUDA counterparts.

5.1 The Algorithm

PageRank is an algorithm built to measure the ”importance” of a node in a graph relative to the other nodes. It is the original algorithm behind the Google search service [19] [3], but is definitely not limited to use in web pages. The equation defining a node’s PageRank score is defined as follows: P R(pi) = 1− d + d   ∑ p_j∈M(pi) P R(pj) L(pj)   Where:

• d is equal to the damping factor, a way to dampen the final results; • M (x) is the set of nodes with an edge to node x;

• L(x) is the total number of edges originating from node x; • P G(x) is the PageRank score of node x.

There has been some debate about the correctness of this algorithm, however, as the paper describing the actual Google implementation [3] describes the sum of all PageRank scores in a graph being equal to 1, which would result in the following equation:

P R(pi) = 1− d N + d   ∑ pj∈M(pi) P R(pj) L(pj)   Where N is the total amount of nodes in the graph.

In our implementation, we will be using the second algorithm, as the first is prone to floating-point overflow errors.

PageRank requires that all nodes that have incoming edges, also have at least one outgoing edge [4]. There are multiple ways of solving this problem, but we have solved it by adding the reverse of all incoming edges as outgoing edges for nodes without outgoing edges. For example, if we have the edge (a, b) and b has no outgoing edges, we add edge (b, a).

(30)

1 pagerank_t* pagerank(graph_cuda_t* graph) { 2 pagerank_t* ranks = init_pagerank(graph); 3 4 timer_start(timer_memtransfers); 5 copy_graph_to_gpu(graph); 6 timer_start(timer_nomemtransfers); 7 8 do { 9 // Do PageRank

10 pagerank_do<<<block_count, CUDA_BLOCKSIZE>>>(graph, ranks, next_ranks); 11 // Find max change

12 pagerank_max_reduce<<<block_count / 2, CUDA_BLOCKSIZE, sharedmemsize>>> 13 (ranks, ranks_next, max_array, nodecount);

14 // ranks = ranks_next

15 pagerank_shift<<<block_count, CUDA_BLOCKSIZE>>>(ranks, ranks_next, nodecount); 16 } while(max_change > PAGERANK_THRESHOLD);

17

18 timer_stop(timer_nomemtransfers); 19

20 cudaMemcpy(ranks, ranks_device, nodecount * sizeof(pagerank_t), cudaMemcpyDeviceToHost); 21 22 delete_graph_from_gpu(graph); 23 24 timer_stop(timer_memtransfers); 25 26 return ranks; 27 }

Figure 5.1: Shortened PageRank CUDA structure

We implement this algorithm by looping over all nodes in the graph, and then looping over all incoming edges of each node in a nested loop. When we have calculated the next PageRank iteration, we compute the percentage of change for each node, and then find the maximum change percentage between all nodes. If this percentage is lower than our threshold, we consider the PageRanks to be calculated. In order to verify the correctness and equivalence of both the implementations, we make sure that both the final PageRank scores and the number of iterations of both versions of the algorithm are equal.

5.2 Porting

Our original CUDA implementation is roughly described by the code in figure 5.1. Here we can see that there are roughly six parts to port:

1. Allocate the GPU memory

2. Copy the relevant graph fields to the GPU. 3. Do a PageRank iteration.

4. Find the maximum change.

5. Shift the newly computed PageRank values to our array with ”current” values. 6. Copy the final PageRank back to the host.

(31)

1 static void copy_graph_to_gpu(graph_cuda_t* graph) { 2 for(uint32_t i = 0; i < graph->node_count; i++) { 3 // Copy each node's in array

4 cudaMalloc(&graph->nodes.host[i].in.device,

5 graph->nodes.host[i].in_count * sizeof(uint32_t));

6 cudaMemcpyAsync(graph->nodes.host[i].in.device, graph->nodes.host[i].in.host,

7 graph->nodes.host[i].in_count * sizeof(uint32_t), cudaMemcpyHostToDevice); 8 // The outgoing edges are not useful in our case, so we don't copy them

9 }

10 cudaMalloc(&graph->nodes.device, graph->node_count * sizeof(node_cuda_t)); 11 cudaMemcpyAsync(graph->nodes.device, graph->nodes.host,

12 graph->node_count * sizeof(node_cuda_t), cudaMemcpyHostToDevice); 13 }

Figure 5.2: CUDA function to copy the graph to the GPU

These can be reduced to two main parts: the memory transfer operations and the computing operations. We describe these processes separately.

5.2.1 Porting memory transfers

By far the easiest memory operation to port is the copying of the final PageRank back to the host. As OpenACC has a simple pragma to update a host variable by taking the matching device variable, we simply changed the line

cudaMemcpy(ranks, ranks_device, graph->node_count * sizeof(pagerank_t), cudaMemcpyDeviceToHost); to

#pragma acc update host(ranks[0:graph->node_count])

The more complicated functions are the ones copying the graph to the GPU, and deallocating it after the algorithm is complete. In figure 5.2 we can see our original CUDA implementation. This consists of a series of memory allocations and asynchronous memory copies. As the CUDA

malloc function returns a pointer to a device memory location, we have to keep track of two

sets of pointers: one to the host memory, and one to the device memory. This has the effect of forcing us to modify our graph datastructure, as we now have to declare ”dual pointer” types containing a host and a device pointer. This makes our code more verbose and error-prone.

Figure 5.3 presents our OpenACC implementation of this function. The first noticeable thing is that the function itself is much shorter and a lot less verbose. To replace our CUDA mal-loc/memcpy combination, we can use the OpenACC enter data copyin(host[start:end]) structure, which declares that we want to allocate memory space on the device which has size:

sizeof(host_type) * (end - start)

and then copy host[start] through host[end] from our host memory to the device.

We also declare these memory operations to be async, which places them in a queue. This allows us to schedule all copy operations and have them run concurrently to the rest of the code. The first time we need the memory, we simply call

#pragma acc wait

and force these async operations to finish before we move on.

(32)

1 static void copy_graph_to_gpu(graph_acc_t* graph) { 2 #pragma acc enter data copyin(graph[0:1]) async

3 #pragma acc enter data copyin(graph->nodes[0:graph->node_count]) async 4

5 for(uint32_t i = 0; i < graph->node_count; i++) { 6 #pragma acc enter data \

7 copyin(graph->nodes[i].in[0:graph->nodes[i].in_count]) async

8 }

9 }

Figure 5.3: OpenACC function to copy the graph to the GPU

For the deallocation of the graph from the GPU using CUDA, we kept the same structure and for-loop, but changed the cudaMalloc/cudaMempcy functions to a cudaFree function on the same variable and reversed the order. With OpenACC we also reversed the order, and replaced the

#pragma acc enter data copyin(var) async

pragmas with

#pragma acc exit data delete(graph[0:1]) async

This states that we want to deallocate the memory from the GPU without copying it back to the host, as we do the copying manually when needed.

At this point, we have ported our memory operations to OpenACC, and we also encounter our first catch: our compiler cannot determine whether a variable is in the GPU memory if we declare the copy operations ourselves. This results in the compiler implicitly adding copyin and

copyout operations around the places where we do GPU computations on the data. As we have

two main OpenACC parallel compute sections (which we will explain in detail in section 5.2.2), this results in four extra (and quite large) memory operations. More importantly however, the compiler fails to correctly determine sizes of arrays to copy and also incorrectly guesses that each pointer variable holds an array. This results in invalid memory accesses, and a lot of runtime crashes.

To counteract this, we have to declare which variables are present. This causes the compiler to skip the allocation and copying of these variables. In figure 5.4 we can see an example of how that would look in our ported code. In general, these present declarations are always needed in OpenACC code in which the programmer does manual memory management (as opposed to having the compiler figure it out itself), and can quickly bloat code to the point where the code saved by the shorter copyin/copyout operations is then lost again to present pragmas.

5.2.2 Porting compute operations

Assuming that all the GPU memory has now been handled, we continue with porting the actual compute kernels to OpenACC. Our CUDA code contains three kernels. The first kernel com-putes a new PageRank score for each node in our graph, based on their previous scores stored in prev_rank, and stores it in the next_rank array. The second kernel determines the change percentage for each node, and then reduces all these change percentages to their collective max-imum (a classic max reduce problem). The final kernel shifts each element in next_rank to the same index in prev_rank.

PageRank Kernel

In figure 5.5 we can see our basic PageRank kernel. In this kernel, we create a number of threads equal to the amount of nodes in the graph. We then make each thread compute the PageRank score of its respective node.

(33)

1 copy_graph_to_gpu(graph); 2

3 #pragma acc data present(ranks[0:graph->node_count], ranks_next[0:graph->node_count]) 4 #pragma acc data present(graph->nodes->in, graph->nodes, graph)

5 {

6 do {

7 ...

8 } while(max_change > PAGERANK_THRESHOLD); 9

10 #pragma acc update host(ranks[0:graph->node_count])

11 }

12

13 delete_graph_from_gpu(graph);

Figure 5.4: Addition of present pragmas. Notice the extra set of brackets to signal in which code block the variables are present.

1 __global__ void pagerank_kernel(graph_cuda_t* graph,

2 pagerank_t* prev_rank, pagerank_t* next_rank) {

3 int i = threadIdx.x + blockDim.x * blockIdx.x; 4

5 if(i < graph->node_count) { 6 pagerank_t rank = 0; 7

8 for(uint32_t j = 0; j < graph->nodes.device[i].in_count; j++) { 9 uint32_t in_index = graph->nodes.device[i].in.device[j];

10 rank += (prev_rank[in_index] / graph->nodes.device[in_index].out_count);

11 }

12

13 next_rank[i] = ((1.0f - PAGERANK_D) / graph->node_count) + (PAGERANK_D * rank);

14 }

15 } 16

17 pagerank_t* pagerank(graph_cuda_t* graph) {

18 ...

19 unsigned int block_count = (graph->node_count / CUDA_BLOCKSIZE) + 1;

20 pagerank_do<<<block_count, CUDA_BLOCKSIZE>>>(graph_device, ranks_device, ranks_device_next);

21 ...

22 }

(34)

1 pagerank_t* pagerank(graph_cuda_t* graph) {

2 ...

3 do {

4 // Do PageRank

6 for(uint32_t i = 0; i < graph->node_count; i++) {

7 pagerank_t rank = 0;

8

9 #pragma acc loop reduction (+:rank)

10 for(uint32_t j = 0; j < graph->nodes[i].in_count; j++) { 11 uint32_t in_index = graph->nodes[i].in[j];

12 rank += (ranks[in_index]/ graph->nodes[in_index].out_count);

13 }

14

15 ranks_next[i] = ((1.0f - PAGERANK_D) / graph->node_count) + (PAGERANK_D * rank); 16

17 // Determine the threshold

18 thresholds[i] = (fabsf(ranks[i] - ranks_next[i]) / ranks[i]) * 100;

19 }

20 ...

21 } while(max_change > PAGERANK_THRESHOLD);

22 ...

23 }

Figure 5.6: OpenACC: PageRank kernel, ported.

The first order or business is rewriting the kernel to a sequential form and removing it from the function. This will allow us to annotate it with OpenACC pragmas. Luckily for us, this is fairly easy for the PageRank function. Once we have copied the body and have rewritten it to a basic for-loop, we can annotate the for loop with

indicating that this for loop is completely parallel and should be offloaded to the device. As a small optimisation step, we put the computation of the change percentage in this loop. In CUDA, this change percentage is computed in the maximum change percentage kernel.

We have also annotated the inner loop with

#pragma acc loop reduction (+:rank)

This tells the compiler that this loop contains a reduction in variable rank (The sum part of the PageRank equation), enabling the compiler to insert specific optimisations for reduction. At this point, we have fully ported a basic version of the PageRank kernel to OpenACC. Our final result can be seen in figure 5.6.

Maximum Threshold Kernel

During the porting of this kernel, the high-level nature of OpenACC really starts to make a difference. CUDA does not implement any reduction algorithms natively, which means that we have to implement them ourselves, or use third-party libraries like Thrust [2]. This does open up some huge optimisation opportunities for us. Using the basic reduction optimisation idea taken from NVIDIA itself [8], we can implement a reduction kernel that takes full advantage of the architecture of NVIDIA GPUs. As we can compare at most two elements at the same time, we can in theory half the amount of values to be reduced at each step of the process. We can also take advantage of latency hiding, and try to remove any warp divergence. This leaves us

(35)

1 pagerank_t* pagerank(graph_cuda_t* graph) { 2 ... 3 do { 4 // Do PageRank 5 ... 6 // Max change 7 max_change = 0.0f;

8 #pragma acc parallel loop reduction(max:max_change) 9 for(uint32_t i = 0; i < graph->node_count; i++) { 10 max_change = fmaxf(max_change, thresholds[i]);

11 }

12 ...

13 } while(max_change > PAGERANK_THRESHOLD);

14 ...

15 }

Figure 5.7: OpenACC maximum change reduction

with a very long and verbose function (more than 60 lines in total). The final code for this max reduce function can be found as appendix in section 8.1, but is mostly the same as the code from the NVIDIA lecture [8], with the change of the addition being changed to calls to the function

fmaxf(a,b).

OpenACC makes this process a lot easier and shorter. We simply implement a standard sequential max reduce for loop, and annotate it:

#pragma acc parallel loop reduction(max:max_change)

. This approach results in OpenACC being able to implement the optimum reduction algo-rithm for the platform we are compiling for. As we can see in figure 5.7, this code is only five lines in total, and thus much easier to read and comprehend.

5.3 Performance

In figures 5.8, 5.9 and 5.10 you can see our benchmarking results for the PageRank algorithm. Surprisingly, our OpenACC port is faster with every graph by a significant margin. The stan-dard deviation is also generally lower, which means that the OpenACC port is generally more consistent as well. The full data can be found in appendix C, section 8.3.

When investigating the specific graph results, we can see that the dbpedia-starring graph in figure 5.9 is the only graph where the CUDA performance is within 20% of OpenACC.

In the next section, we investigate why our OpenACC implementation is faster, and how we can potentially increase this performance gap.

5.4 Optimising the OpenACC implementation

5.4.1 Diagnosing and patching

In order to (potentially) exploit the behaviour that makes our OpenACC port faster than the original CUDA implementation, we must first figure out what this ”behaviour” specifically is. Using the NVIDIA profiler, nvprof, does not help us any further, as it only shows us that OpenACC is faster and not why. As such, we turn to the compiler information output:

(36)

Figure 5.8: Unoptimised PageRank performance comparison for Graph500 graphs.

(37)

Figure 5.10: Unoptimised PageRank performance comparison for SNAP graphs. 71, #pragma acc loop vector(128) /* threadIdx.x */

Generating reduction(+:rank) 67, Loop not fused: no successor loop 71, Loop is parallelizable

84, Generating Tesla code

85, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ Generating reduction(max:max_threshold)

At this point we can start to understand why OpenACC is faster: it treats our main PageRank loops (seen in 5.11) differently than we do in CUDA. In our original CUDA implementation, each thread handles a single node. A thread calculates the new PageRank score based on all the incoming connections of that node, which can be seen in the inner loop in figure 5.11.

The OpenACC compiler has taken a different approach: it treats the innermost loop as a vector calculation, and the outer loop as a gang (or block in CUDA terms). This results in a lot more threads, as each thread now increments the total new PageRank score of its gang, resulting in each gang now handling a single node, and the CUDA threads working together. Thus, in OpenACC, the work per thread is less, as we have multiple threads working on the update of a single node. This results in a much better load balance than CUDA achieves.

In the end, this results in a completely different algorithmic approach than the one that was taken in the original CUDA code, and one that is clearly faster. Although it is possible to repro-duce this behaviour in CUDA by changing the algorithm, this algorithm is (significantly) more difficult to implement, because the nodes all have a different amount of incoming connections, and we would have to create a second algorithm to start enough CUDA threads, and to then map each thread to both a node and an incoming edge. Even then, we would have to write even more code to combine these threads into one to do the final new rank calculation.

(38)

1 // Do PageRank itself 2 #pragma acc parallel loop

3 for(uint32_t i = 0; i < graph->node_count; i++) { 4 pagerank_t rank = 0;

5

6 #pragma acc loop reduction(+:rank)

7 for(uint32_t j = 0; j < graph->nodes[i].in_count; j++) { 8 uint32_t in_index = graph->nodes[i].in[j] - 1;

9 rank += (ranks[in_index] / graph->nodes[in_index].out_count);

10 }

11 ranks_next[i]= ((1.0f - PAGERANK_D) / graph->node_count) + (PAGERANK_D * rank); 12 // Determine the threshold

13 thresholds[i]= (fabsf(ranks[i] - ranks_next[i]) / ranks[i]) * 100; 14 }

Figure 5.11: Main PageRank OpenACC loop.

above our outer loop to

#pragma acc parallel loop gang

, signalling that this outer loop exploits gang level paralellism. This means that each iteration of the loop is handled by a different gang (CUDA block).

For the inner loop, we change

#pragma acc loop reduction(+:rank)

to

#pragma acc loop vector reduction(+:rank)

, to make our new vectorised behaviour explicit.

Returning to the compiler output, we observe the following line: 71, #pragma acc loop vector(128) /* threadIdx.x */

Generating reduction(+:rank)

Our compiler is notifying us that the inner loop is vectorised with a vector length of 128. This means that the inner loop is processing 128 edges, no more and no less. When there are more than 128 elements, the loop is scheduled multiple times until all elements have been processed. When there are less than 128 elements, the other threads that have no work are still started and scheduled but simply do nothing. This introduces a tradeoff: a larger vector length can increase total paralellism as more elements are processed at a time, but smaller vector lengths result in less threads doing nothing.

For our implementation, each element is one incoming node edge. As most nodes do not have more than 32 edges (and GPUs perform best when the vector lengths are a multiple of 32), we choose 32 as our vector length. In theory, this should result in less threads doing nothing (provided we do not regularly have vertices with more than 32 edges). To achieve this, we change our outer-loop directive from

#pragma acc parallel loop gang

to

(39)

Figure 5.12: Optimised PageRank performance comparison for Graph500 graphs. This sets all the vectors in the gang to size 32.

In addition to the previous optimisations, we have also turned all OpenACC regions to async regions. This enables us to already queue the next region before the previous region has been completely processed. When we need to use data of one of the regions on our host, we can simply call

#pragma acc wait

to wait until all the regions in the queue have been processed.

5.4.2 Optimised results

After running our benchmarks again, we observe not a lot has changed. In figure 5.12 we can see that, on average, the optimised version is slightly faster. This is not the case for all graphs in this figure, however, as specifically the larger graphs (22, 23 and 24) have lost performance relative to their unoptimised counterparts. We think this behaviour is due to our decision to set the vector size at 32 edges. For larger graphs, where nodes have a larger amount of edges on average, this could result in lower performance.

In figure 5.13 we can observe the same kind of results as in figure 5.12. For some graphs, our new version has gained performance, while for other graphs it lost performance. Again, we hypothesize this behaviour is due to our decision to set the vector length at 32 instead of a higher number.

The SNAP graphs in figure 5.14 show better results. Although no graph shows major gains over our previous, unoptimised version, none of the graphs perform worse for OpenACC than they did in the previous version.

These results tell us that it might be possible to optimise OpenACC even further by choosing the vector length based on the average number of incoming edges for each node.

(40)

Figure 5.13: Optimised PageRank performance comparison for KONECT graphs.

(41)

CHAPTER 6

Multithreading OpenACC

One of the main advantages of OpenACC is its cross-platform portability. In fact, one of the main goals of OpenACC is that one should be able to write one single code for all platforms, and have the compiler do the porting. In our previous tests, we have compiled OpenACC for the NVIDIA Tesla architecture in order to compare GPU performance to a more low level language (CUDA in our case). In these tests, we have shown that OpenACC compilers currently have the ability to generate code that performs on-par or better than CUDA. Thus, in this chapter, we will be examining whether the same code behaves well on the CPU. Specifically, we do this by comparing against an equivalent OpenMP implementation.

For this comparison, We use our PageRank implementation, as it contains a combination of both memory intensive sections and computationally intensive sections. This is contrary to BFS, where there is almost no computation, and the entire algorithm consists mainly of memory operations. We make sure the OpenMP and OpenACC implementations are equivalent by comparing their final PageRank scores and the total amount of iterations run.

6.1 OpenMP vs. OpenACC

Like OpenACC, OpenMP is a compiler directive based API. This means that one should write sequential code at first. Once this code is fully functional, the programmer adds the OpenMP compiler directives to convert it to multithreaded code. The main difference to OpenACC is that OpenMP was created for CPU multithreading, while OpenACC was designed to be more general.

As OpenACC and OpenMP are so similar, porting the code is not that interesting. In order to create the OpenMP, benchmark we simply removed all OpenACC directives and replaced them with the equivalent OpenMP ones. This meant replacing

with

#pragma omp parallel for

to start with. We also removed all OpenACC data directives, as there is no need to move data around to any accelerator device.

Finally we made some OpenMP specific optimisations, like annotating any for-loops that do reductions with their equivalent ”reduction(operator:variable)” clauses, and figuring out the optimal scheduling algorithm by benchmarking. Specifically, most nodes in our graphs have a different amount of incoming edges, which means that they have a different amount of work to do. Static scheduling gives each thread an equal amount of iterations (i.e., nodes in our case) to

From CUDA to OpenACC in Graph Processing Applications

Bachelor Informatica