Graph Processing in OpenMP: a Study on Performance Portability

(1)

Bachelor Informatica

Graph Processing in OpenMP:

a Study on Performance

Portability

Duncan Kampert

June 24, 2019

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

(3)

Abstract

Graph processing is an important branch of computer science, and there is a desire to have this processing done in parallel. Despite the significant mismatches between parallel processing and graph processing, there is ample evidence that parallel graph processing, in modern architectures such as multi-core CPUs and many-core GPUs, outperforms sequential solutions. It remains, however, a challenge to easily program graph processing algorithms for these platforms, especially when the programming models are fundamentally different. In this work, we investigate whether OpenMP can act as a portable programming model for parallel graph processing.

Our investigation relies on two graph processing algorithms: BFS and PageRank. We evaluate these graph processing algorithms from a performance portability standpoint, and find how well they perform when implemented in OpenMP, in a portable manner. We show several portable optimization steps, in OpenMP, and demonstrate they can lead to better performing algorithms.

Finally, the optimized algorithms are then benchmarked, and compared against well-known high-performance benchmarks for graph processing. Our results show that the per-formance of our implementation can be competitive with CPU benchmarks; on the GPU side, the performance we achieved is still an order of magnitude away, but we explain some possible reasons as to why this may be the case.

We conclude that OpenMP is competitive against the GAP benchmark, and less com-petitive against GunRock. The performance of OpenMP on GPU can likely be improved in future work, using more in depth analysis to find unnecessary operations.

(4)

(5)

Introduction

Modern processors can no longer rely on near yearly performance boosts through the increase of single core performance. This near annual increase of computational power as described by Moore’s Law, was historically done through the increase of processor frequencies. However, this frequency increase has hit a soft cap due to physical limitations [1, 18]. To increase the performance of computers beyond what is possible with processor frequencies, programmers will have to move towards multi-threaded programming. Multi-core processors have been around for a long time, but the problem herein comes from the complexity of multi-threaded programs.

Multi- and many-core processors can be found in many types of parallel architectures, in which accelerators form a special class due to their special operation: parallel programs (or parts thereof) can be offloaded to accelerators, for example to GPUs. The advantage of offloading code to accelerators is the ability of the program to make use of their specialized features. For example, when offloading code on a GPU, the code can make use of the high-performance features of these architectures: they have more computing cores and their hardware is optimized for large scale parallel execution. In general, offloading is seen as an opportunity to gain performance through faster computation, at the expense of additional data movement. This trade-off makes offloading code a double-edged sword.

A problem which stems from offloading code to accelerators is it takes extra time for a devel-oper to rewrite/redesign the code to be offloaded as a specialized algorithm for the accelerator. In turn, this decreases productivity. Thus, there are programming languages/models/frameworks that have been created specifically to address this productivity challenge. For example, OpenMP is a pragma-based model, where sequential code is decorated with specialized pragmas, which in-struct the compiler to parallelize code [21]. Since version 4.5, an offloading mechanism is present in OpenMP: additional pragmas are used for code to be run on a parallel CPU or a GPU without significant alterations [21]. The problem with this high-level pragma-based approach is that the compiler processes the pragmas and creates the machine code. This creates a larger difference between the original source code and the machine code which is being executed. Because of this difference, it becomes more difficult to see what a program will actually do based on the original source code. Thus making it harder to optimize a program for certain hardware.

There are many algorithms well suited for parallelism - i.e., applications for which using parallelism easily leads to increased performance. Examples include image processing, heavy en-gineering simulations, or deep learning. However, graph processing applications are among the problems for which performance is more and more important, but parallelism remains challeng-ing [17]. Graph problems require high performance because they are rapidly growchalleng-ing in scale. For example, ranking web pages in a search engine is a massive application: the input graph in this example consists of a node for every web page online, creating a graph of billions of nodes.

(8)

It is however not straightforward to parallelize a graph algorithm to fully benefit from massive parallelism and gain significant performance, because there are fundamental mismatches between parallel processing and graph processing [17]. Specifically, graph applications are often sparse in processing and heavy on memory operations, and irregular in terms of parallelism and memory accesses; all these properties have a negative impact on the performance gain a parallel algorithm can achieve. Moreover, graph processing performance is heavily dependent on the structure of the input graph, the selected algorithm to solve a given problem, and the actual parallel hardware being used. For best possible performance, it is crucial for programmers to be able to easily experiment with different types of algorithms and platforms. This is not feasible when using a different programming model for every type of machine of interest. Instead, it is desirable to use portable programming models (same code running on multiple types of machines) - iff their performance is suitable.

1.1 Problem statement and approach

In this thesis, we aim to determine how well portable graph algorithms perform compared to specialized implementations of these same algorithms. More specifically, we will compare graph algorithms in a portable language to algorithms which are specialized for platforms. In turn show-ing the performance portability of our graph algorithms. For this work, our target programmshow-ing model is OpenMP. Our method is empirical in nature, and consists of two main steps: (1) find the optimal implementation of a graph algorithm in OpenMP, and compare its performance to hardware-specialized implementations, and (2) determine how well the implementation(s) per-form across platper-forms without changing the program.

1.2 Organization

This report is organized as follows.

In chapter 2 we present a short overview of the types of hardware available, and the difference between them. This is followed by the concept of performance portability, which is used to define performance over a set of platforms. We then continue explaining what portable language we used, and what makes this language different from a usual programming language. The last segment of our background explains what graphs are, and the two algorithms we used throughout this thesis.

Next, in chapter 3 we give an in-depth view of the two algorithms used, and how they were implemented. This also highlights what optimizations we have done, and what optimizations we have not. This chapter finishes with an overview of our data structure used for storing graphs.

Chapter 4 gives technical details about the hardware we used for all of our experiments. This is followed by a short overview of the graphs we used as input for our algorithms.

In chapter 5 we go over how we tuned our algorithms to give better performance. This chapter shows exactly what approach to optimization we took, with results to back up the choices we made.

Continuing with chapter 6, we show final results of our optimized algorithms compared to baseline implementations. These baseline implementations are known benchmarks, and are platform-specialized.

We finish with chapter 7 and explain the implications of our results. We highlight what problems our implementation might have, as well as show possible solutions to these problems.

(9)

(10)

(11)

CHAPTER 2

Theoretical background

In this section we discuss the background knowledge required to understand the remainder of this thesis. We start by explaining the classes of hardware, and the difference between them. We further discuss our take on performance portability (Section 2.2), as will be used for the remainder of this work. Then we continue with explaining the language we used for our project, and what makes it different from a standard programming language. We also expand on this and show what makes it possible for this language to run on multiple classes of hardware. Lastly we explain the two algorithms we used for all our tests, Breadth First Search and PageRank.

2.1 Parallel hardware platforms

In this section, we highlight the architectural differences between CPUs and GPUs, the main targets for our performance portability study.

There are multiple differences between CPUs and GPUs which make them distinct classes of hardware. Firstly the amount of computing cores for CPUs is generally around 4 for private use, up to around 16 for specialized use. The amount of cores in GPUs is orders of magnitude higher, namely in the thousands. Secondly the clock frequency for CPUs is generally higher than that of GPUs. CPUs sit at around 3 GHz, while GPUs usually have around 1.5 GHz. These two are the main technical differences between the types of hardware, but the main difference comes to light in the use cases.

Figure 2.1: Schematic hardware comparison between CPU (left) and GPU (right). Every green rectangle corresponds to a core. Source: http://ixbtlabs.com/articles3/video/cuda-1-p1. html

(12)

GPUs have many cores, which are specialized to all execute the same piece of code in parallel. For example in image processing, every pixel can be processed by a GPU core simultaneously. This makes GPUs a good choice when the given problem is more processing heavy and can be done in parallel.

CPUs have a higher clock cycle, and can thus execute sequential code faster than GPUs. This makes CPUs better when execution a sequence of statements which are dependent on each previous statement.

Besides differences in hardware, GPUs have something else which makes them vastly distinct from CPUs. CPUs have a direct connection to the main memory, which means they have direct access to all data required to run a program. As seen in the schematic 2.1, GPUs have their own memory. This means when offloading a segment of code to a GPU, the required memory for that segment also has to be shared with the GPU.

2.2 Performance Portability

Performance portability is, intuitively, a combination of two properties, summing up to the abil-ity of the same application code to get similar performance on different platforms. The main advantages of using performance portable applications are visible in the context of heteroge-neous platforms: only one version of the code needs to be maintained, and different (hardware) platforms can be used interchangeably.

Portability is tied to the programming model used when implementing an algorithm. This means using a portable programming model enables the same code to run on multiple platforms. For example, an application implemented in OpenMP can be run on a larger set of platforms than the same algorithm implemented in CUDA.

However, a portable program might not perform well on every platform in the defined set. In order to quantify these performance gaps, a performance portability metric (PPM) has been proposed [23]. There are three input variables in this metric, where a is the application, p is the problem, and H set of platforms. The PPM is defined as the harmonic mean of the efficiency of a program across all platforms H [23];

P P (a, p, H) =          | H | P i∈H 1 ei(a, p) if i is supported ∀i ∈ H 0 otherwise

Where ei(a, p) is the efficiency of the application a, executed on platform p.

In the context of PPM, there are two different performance metrics of interest: application efficiency and architectural efficiency.

Application efficiency is a ratio which specifies how close the portable application gets, on a given platform, to the best known performance of the same application on that platform. In practice, to compute application efficiency, an algorithm is compared to the current best-known implementation of the same algorithm. Of course, in case the best-known implementation is far from optimal, the results will be misleading. Application efficiency is a practical metric, but can not be used without context. Meaning the application efficiency becomes of greater or lesser importance based on the knowledge of the reference implementation.

Architectural efficiency is a ratio between the achieved performance of the application and the theoretical peak performance of the target platform. Depending on the type of application

(13)

(compute- or memory-bound), we estimate architectural efficiency using either the computational throughput (i.e., the maximum amount of FLOPs) or the memory bandwidth (i.e., the maxi-mum transfer rate in GB/s). This metric is an objective approach to measuring performance, but does not tell the whole story. For example in the case of the portable application not being implemented correctly, it can include unnecessary operations. This may bloat the architectural efficiency unfairly. The opposite of this is when it is impossible for an application to the peak performance, thus making the value smaller than it should be. This may lead to wrong con-clusions from the measured value. So architectural efficiency has to be used in context of the application.

Shortcomings of this metric are found when adding more heterogeneous platforms to the set [7]. For example adding a CPU to a set of GPUs. In these cases the PPM for an algorithm can drastically decrease compared to a set of exclusively GPUs.

In this thesis, we exclusively use application efficiency to reason about performance. To ensure portability, we select a portable programming model, which is further explained in the next section.

2.3 OpenMP and code offloading

There are several cross-platform portable languages available for (parallel) programming. While Java is the first example that comes to mind for general computation, the typical example for parallel computing is OpenCL [25]. OpenCL has been designed based on a cross-platform open standard, and its goal was to support portability across CPUs, GPUs, and beyond (currently, this list includes FPGAs, too). However, studies have shown OpenCL not to be performance portable out of the box [8, 9]. Another example is OpenACC, which allows the same code to be compiled for CPUs or GPUs, and executes correctly on both.

While both OpenACC and OpenCL have been specifically designed for portability across multi- and many-core systems, OpenMP, the most famous shared-memory parallel programming model, has only been recently enhanced with the capability of running code on GPUs [21] without massive code modifications. This capability is called offloading, and we briefly discuss it below. In this work, we use OpenMP three main reasons: (1) it is the de-facto standard programming model for parallel processing, meaning it is a popular choice for many programmers, (2) it is pragma-based, thus we expected it to be a high-productivity model, and (3) its offloading mechanisms are brand new, so an objective evaluation can be indicative of OpenMP’s future success. .

Pragma-based models

The main advantage of a pragma-based model is its high level of abstraction: programmers do not have to think about architectural details, workload distribution among processing threads, or about thread synchronization: these will all be solved through pragmas. A second advantage is that the programmer does not have to significantly alter the original sequential algorithm. Depending on what algorithm has to be parallelized, small code changes and the addition of a few lines in the form of pragmas should be enough to change a sequential implementation into a parallel one.

(14)

Code Offloading

Offloading allows for a segment of code to be executed on a target device. It is typically assumed that is a different device than the CPU itself. This is especially useful when a target device is more specialized in parallel programming. For example GPUs are a lot faster than CPUs when running parallelized code, but they (might) require a rewrite of the code in a GPU-specific programming model (like CUDA). Because the device does not necessarily share the same memory as the main program, the data is initially not available. To solve this, all data that is used in this offloaded segment of the code has to be sent to the device. Because of this, the device initially has a performance loss compared to executing solely on the main processing unit. There can however be a large performance gain when running code on an external device which is more specialized for concurrency.

Compilers

One disadvantage of pragma-based models is that a lot of the optimization burden is put on the compiler. Programmers have very little control over how the compiler handles parallelization, and surprising decisions (followed by surprising performance) might be observed. , In turn, if the program shows poor performance (or other forms of weird behavior), it might be more difficult to pinpoint the problem. Additionally, a more practical disadvantage of pragma-based models is that not every compiler supports the pragma extensions to a language, thus restricting users to a subset of possible compilers.

Even for compiler that support the pragmas, there can be a lot of variance in the performance of the same algorithm when using different compilers [6,15]. This makes optimizing an algorithm an even more difficult task. In the case of OpenMP, the compiler is free in how to interpret large parts of the language. This is mostly due to the ambiguity in the language specification. This is a problem when optimizing an algorithm, as some optimizations might be good for some compilers, but have a negative effect on different compilers. Though influence on a compiler on the performance of a performance portable program is an interesting question on itself, it is not in the scope of our project.

Compilers also need to support offloading. Offloading is necessary to be able to run a program on other devices, in this case GPU’s. Because OpenMP support for offloading is fairly new, not every compiler supports it. So to support performance portability testing, the correct compiler has to be found.

2.4 Graphs and graph processing

Graphs are data structures which consist of nodes and edges. These data structures can be used to represent different types of information. For example a node can be an intersection in a road network, where edges are the roads connecting two intersections. This then creates an abstract representation of an actual road network.

Graph processing is then used to extract information from these abstract representations. To build upon the previous example, every edge (road) may then have an extra attribute specifying the actual length of the road. This extra information can then be used to find the shortest path from one node (intersection) to another node.

(15)

2.4.1 Breadth First Search (BFS)

The first algorithm we picked is BFS. It was chosen because it is a relatively simple algorithm. A lot of the variance in performance for BFS is in the choice of graph [27]. BFS traverses over every node in a graph layer by layer, starting from a predefined root-node.

1 w h i l e ( ! f r o n t i e r . empty ( ) ) { 2 f o r ( node i n f r o n t i e r ) { 3 f o r ( n e i g h b o u r i n node . n e i g h b o u r s ( ) ) { 4 i f ( ! n e i g h b o u r . t r a v e r s e d ) { 5 n e i g h b o u r . t r a v e r s e d = t r u e 6 n e x t f r o n t i e r . add ( n e i g h b o u r ) 7 } 8 } 9 } 10 swap ( n e x t f r o n t i e r , f r o n t i e r ) 11 }

An important term to understand BFS is the frontier. Frontier is defined as the active layer in BFS. For the first iteration that would be just the root-node. In the next iteration all the neighbours of the root node would be in the frontier. This continues until all nodes have been traversed exactly once.

BFS is not very processing heavy, and mainly memory usage heavy.

2.4.2 PageRank

The second algorithm we implemented is PageRank [22]. PageRank is an algorithm by Google, which aims to rank pages by their relevance. This relevance is based on the relevance of their neighbours. For example when a highly relevant page (Wikipedia) links to a page, it is likely the linked page is also of high relevance.

1 # I n i t i a l i z e 2 f o r ( node i n n o d e s ) { 3 node . rank = 1 / n o d e s . l e n g t h 4 } 5 6 # C a l c u l a t e rank 7 w h i l e ( i t e r a t i o n s < n ) { 8 s t o p = f a l s e 9 f o r ( node i n n o d e s ) {

10 node . d i s t r i b u t e r a n k = node . rank / node . e d g e s 11 } 12 13 f o r ( node i n n o d e s ) { 14 node . rank = 0 15 f o r ( n e i g h b o u r i n node . n e i g h b o u r s ( ) ) { 16 node . rank += n e i g h b o u r . d i s t r i b u t e r a n k 17 } 18 } 19 20 i f ( n o d e s . c u r r e n t r a n k s ( ) − n o d e s . p r e v i o u s r a n k s ( ) < t h r e s h o l d ) { 21 b r e a k 22 } 23 }

(16)

(17)

CHAPTER 3

Algorithms design and implementations

In this chapter we discuss the design and implementation of the two case-study algorithms in OpenMP.

3.1 BFS

There are many parallel BFS implementations available [27], and some of their specific features will be discussed later in this thesis as potential performance optimizations. Our initial BFS version is based on the Rodinia benchmark implementation [5]. We chose this implementation because it has no hardware-specific optimizations. We wanted our original version to be platform-agnostic; this enables us to use this basic implementation as a foundation to build upon, and see the performance changes on both GPUs and CPUs when we add different types of optimizations. The BFS implementation makes use of two masks. These masks are defined as byte arrays currently, but can be changed to use bit masks for better memory usage. The first one (i.e., t_frontier) keeps track of which nodes will be traversed in the current BFS iteration, while the second one (i.e., t_next) keeps track of which nodes will be traversed in the next BFS iteration. This approach allows for the two masks to be swapped at the end of every iteration. This second mask, t_next has a race condition when multiple nodes have the same neighbour. When multiple nodes have the same neighbour, more than one thread will try to write a value to the same index in the array. However, this value, 1, will be the same for every thread. So even though BFS contains a race condition, it does not matter for the result.

Direction optimization

One of the best-known optimizations for parallel BFS is direction optimization [3], which com-bines the top-down with bottom-up. The former method starts from the frontier, and adds all of it’s neighbours to the frontier of the next iteration. The latter method starts from all unused nodes, and check if any of its neighbours are the frontier. If any neighbours are the frontier, this node will be the frontier in the next iteration of BFS.

(18)

Figure 3.1: Difference between top-down and bottom-up BFS.

Our initial BFS version uses the top-down implementation, without any runtime switch-ing. We opted to exclude direction switching because (1) we have found it adds even more graph-specific fluctuation on performance, and (2) the switching parameters/strategy differ a lot between CPUs and GPUs [3, 28]. Avoiding this perturbation make it easier to compare the performance of algorithms themselves. We note that, if the algorithms themselves are equiva-lent, performance-wise, between OpenMP and native code, switching will have the same impact, regardless of the programming model.

3.2 PageRank

Our first PageRank version was initially based on the specification of the original algorithm. We later added some algorithmic optimizations to this initial implementation taken from GAP [4]. Much like our BFS implementation, this version is platform-agnostic. Although there are many other optimizations available for PageRank, they are not relevant at this point in our analysis.

The PageRank algorithm can be split up into three steps: initialization, calculation, and finalization. The initialization of the ranks is done by giving every page the same rank. These initialization values do not matter too much, as the algorithm converges given enough iterations. The second step, calculation, iterates over all nodes in the network and, for every node, it calculates its rank divided by the amount of neighbours; then the node communicates this rank to its neighbours. When every node has communicated their rank, the array with ranks is first normalized, and then compared to the array of the previous step. The difference between these arrays is used to determine if the array has converged.

The third (and last) step, finalization, takes care of the sink nodes which are not ranked in the previous step. Then, the ranks are distributed one last time, and normalized at the end. This final operations do not change the ranks of the network a lot, but they ensure the sink nodes always have a rank in the network, based on their neighbours.

There are however two exceptions to the calculation step as defined in the original paper. First, a vertex should normalize using the total number of neighbours which are not sinks. In this case, a sink is a node which has no neighbours. Second, it should not communicate its rank to these sinks in this step. These two exceptions are made because sinks do not distribute their rank further to other nodes in the network, which in turn means these nodes will keep accumulating rank, while the rest of the network only loses rank. However, this is not a problem for our graphs, as we only use undirected graphs, which cannot have sinks because any node

(19)

with an inbound connection will always have an outbound connection as well.

While these two exceptions are highlighted in the original paper, they are not implemented in all benchmarks. Thus, we have decided to exclude this implementation from all performance measurements (GAP and GunRock included), and focus on the performance of the core algo-rithm.

Push vs Pull

There are two distinct methods which can be used to calculate PageRank: push-based on pull-based.

The push method has every vertex distribute their rank evenly among neighbours (i.e., the vertex is a source). With every thread responsible for one vertex, this thread will update a memory location belonging to each neighbour. However, there might be multiple nodes writing to the same neighbour’s memory location, which leads to a potential data race. To solve this conflict, atomic operations have to be introduced for the distribution of ranks. As rank distribution is the largest part of the algorithm, this potential serialization (also dependent on the input graph) is likely to have a large impact on the algorithm’s performance.

The pull method also uses one thread per vertex, but defines every node as a recipient. This means every node only needs to sum the new rank to be distributed by all its neighbours; thus, each thread only needs to read its neighbors’ shared memory location. For reading, no atomic operations are required. However, this approach does require the rank distribution per node to be calculated in advance. So our pull implementation has an extra loop which calculates the rank every node will distribute. These rank values are then stored in an array, with a rank for every node.

3.3 Storing Graphs

To support the actual algorithms for graph processing, designing a suitable data structure is mandatory. For graphs, a node structure using pointers to other nodes seems like a good start. However, this is difficult to make compatible with offloading, as the pointers can not be directly translated to the memory of a different device. To support both offloading and graphs in a single data structure, a fixed-size array with all nodes is easier to use. However, as nodes can have a variable amount of edges, using a single fixed-size array with all data is unfeasible.

Instead, we propose to use the CSR format, which uses two arrays: one to store all the nodes and their data (e.g., node properties, or state), and a second one to contain all edges and their data (e.g, weights). The edges are sorted in the same order as the nodes. Meaning all edges for node 0 are first in the list, continuing with every node in ascending order. All edges spawning from these nodes are sorted in ascending order as well. For this format, each vertex had to also store two additional values: its start offset in the edge array, and the amount of edges it has. This allows for every node to access all their edges without requiring variable length arrays.

(20)

(21)

CHAPTER 4

Experimental Setup

All the results presented in this thesis are based on a unified experimental setup, which we describe in more detail in this chapter.

4.1 Hardware platforms

We use two different machines, one mainly for relative performance, the other for absolute performance. This distinction is made because some tests we want to do are mainly to find optimizations to our code. When looking for these optimal optimization changes, the performance only matters relative to the base implementation. We need the absolute performance to be able to say how well our implementation performs compared to the state-of-the-art implementations. We have two different platforms, as highlighted in Table 4.1. Machine A is the local machine we will use for the relative tests. Machine B is located on the DAS-5 which will be used to get the absolute performance. So our performance experiments will run on the DAS-5 [2].

CPU GPU

Name Intel Cores Freq. NVIDIA CUDA Freq. Model (threads) [GHz] Model Cores [GHz] A Intel i7 7700 6 (12) 4.2 Gtx 1070 2048 ≤ 1.6 B Intel Xeon E5-2630v3 16 (32) 2.4 TitanX 3584 ≤ 1.5

Table 4.1: The two machines used for all experiments across this thesis.

Our compiler is GCC version 8.2.0. We used the following flags for optimization and offload-ing, -O3 -fopenmp -foffload="-lm" -DGPU. The same code can be compiled for CPU using the following flags, -O3 -openmp. The script we used to setup the compiler to also support offloading is taken from Krister Walfridsson’s blog1_.

4.2 Input Graphs

As input graphs we use two distinct groups: synthetic graphs and real graphs. They are presented in the following two tables 8.1, and 8.2.

(22)

The synthetic graphs are obtained from the generator of Graph500 [11, 19], and they are popular for graph processing benchmarking because of the control users have on their properties, including scale and topology.

The second set of graphs are real networks, as provided by online repositories. For ex-ample, roadNet-CA represents the road network of the state of California, where intersections are vertices, and roads are edges. Our selected real graphs are obtained from SNAP [16] and KONECT [14]. Compared to Graph500, there is a lot more diversity in the properties of these graphs, as illustrated in the table by the average degree and diameter. Thus, the behavior of our two algorithms (as well as their response to optimizations) will show larger variations than in the Graph500 family, potentially illustrating additional concerns/challenges in terms of performance portability.

(23)

CHAPTER 5

Using OpenMP: Parallelization,

Offloading, and Optimizations

In this chapter we discuss our approach to using OpenMP for parallelizing and offloading the two algorithm of interest on CPUs and GPUs (through offloading). We propose a semi-automated solution to auto-tune these implementations. We further apply further optimizations, driven by the nature of the graph processing algorithms, and demonstrate their performance impact. We will make further use of all these techniques and optimizations in the following chapter, where the core performance analysis is presented.

5.1 Performance Parameters

There are three parameters that are very influential on the performance of a parallel OpenMP graph-processing algorithm. First, an OpenMP graph-processing algorithm’s performance can vary a lot with a small change in the source code, like altering a pragma [15]. Second, different inputs can have a large impact on the performance of an algorithm: different graph properties, including its size and topology, require different approaches to parallelization. In other words, it is difficult to have an optimal algorithm for all graphs. Third, and last, the impact of compilers has also shown to lead to significant variance in performance [15].

While predicting the impact of graph properties or improving compilers are beyond the scope of this work, the OpenMP pragmas is a parameter we can (auto-)tune: we can determine which pragmas provide the best possible performance for a given workload (i.e., algorithm and dataset). To be able to perform auto-tuning, a systematic exploration of the pragma-based paralleliza-tion space is necessary. Thus, we need the ability to test a large combinaparalleliza-tion of pragmas and select the best-performing for our workload.

To manually define and test all these combination would be a lengthy, error-prone process, given that we need to also take into account the graphs themselves.

Take for example a BFS on a ”line” of connected nodes. This type of graph will not perform well in a parallelized implementation compared to the same algorithm in a sequential imple-mentation. Thus, it is also necessary to test different workloads, covering multiple classes of graphs.

(24)

we created an automated tool (implemented as a script) which generates the different values for the parameters, and test each combination automatically, for each input graph. Running the script on multiple graphs is simple, as it only requires a change in the a parameter of the program call.

However, the search becomes more challenging when altering pragmas. This is difficult be-cause pragmas are interpreted by the C preprocessor. To be able to change pragmas, changes have to be made into the code directly, and then the code has to be compiled, executed, and benchmarked.

5.1.1 Pragma Switching

To be able to test multiple combinations of pragmas, we devised a way to easily switch differ-ent pragmas in the code. Our method combines two parts: pragma configuration and pragma placement.

To enable the generation of pragmas correct by construction, we defined two classes: inner pragmas and an outer pragmas. This distinction is necessary because of the requirements we have of our offloading pragmas. In OpenMP, offloading requires specialized pragmas, which define what block of code will be distributed over the cores of a GPU, and what block of code will be parallelized on those cores. This initial distribution, called teams distribute, can only be done once - nested distributing workload over cores makes no sense, given that there is no nested hardware parallelism on GPUs - i.e., there are no CUDA cores nested inside CUDA cores. It is however possible to parallelize inner loops, after distributing the workload over cores. Thus, outer pragma’s refer to the core-level distribution, while inner pragma’s focus on inner-loop parallelization.

To enable the generation of the inner and outer pragmas, we designed a python script which adds #defines to source files. pragmas are added with a builtin C99 macro called Pragma. This takes one argument: a string with the parameters for the pragma. This allows for complex pragma’s to be inserted into the code by simply inserting a string. To make sure the python script does not alter original source files, the script prepends these #defines to a clone of the original file; these clones are prepended an underscore to their name, e.g. pagerank.c → pagerank.c.

For example, the listing below presents a code snippet with offloaded enabled; this code can be edited to allow for our pragma auto-tuning script to run on it.

1 // o r i g i n a l c o d e w i t h OpenMP o f f l o a d i n g 2 #pragma omp teams d i s t r i b u t e

3 f o r ( i n t i = 0 ; i < n ; i ++) { 4 i n t sum = 0 ; 5 #pragma omp p a r a l l e l f o r 6 f o r ( i n t j = 0 ; j < m; j ++) { 7 sum += A[ j ] 8 } 9 10 B [ i ] = sum ; 11 } 12 . . . 1 // e d i t e d c o d e f o r a u t o m a t i c pragma a d d i t i o n 2 PRAGMA OUTER 3 f o r ( i n t i = 0 ; i < n ; i ++) { 4 i n t sum = 0 ; 5 PRAGMA INNER 6 f o r ( i n t j = 0 ; j < m; j ++) {

(25)

7 sum += A[ j ] 8 } 9 10 B [ i ] = sum ; 11 } 12 . . .

Using this method for automatic addition of pragmas for our algorithms is straight-forward. For our choice of sequential implementation of BFS, there are two loops 2.4.1. The outer loop traversing the nodes, and the inner loop traversing the neighbours of each of these nodes. For our sequential implementation of PageRank, the placement of pragmas is almost the same. For the rank calculation segment, there also exists a distinction between traversing all nodes, and traversing the neighbours of these nodes. However, in PageRank, there is an additional loop which may be parallelized. The calculation of distribute_rank, on line 9-11, may also be done in parallel. We chose to define this loop as an outer loop, giving this the same pragma as the node traversal loop on line 13.

5.1.2 Exceptions

Problems could arise when creating self generating code. For example, code can become stuck in an infinite loop. We wanted the script to run without having to monitor the progress constantly, so we can run the program with large parameters sweeps, enabling us to spend more time optimizing the algorithms. To enable some form of fault-tolerant behavior, we implemented two types of exceptions. The first one is blacklisting of certain combinations of pragmas. The second one is setting a time-out for the computation process.

Pragma blacklisting is done to prevent programs getting stuck in infinite loops. For example, when a pragma does not specify that a certain variable has to be updated on a global scale, the program might never exit. Blacklisting also prevents unnecessary processing of certain pragma combinations. There are combinations of pragmas which can not be compiled, and can thus always be skipped. We note that it is the responsibility of the user to define the blacklisted pragmas - while we could envision simple mechanisms to also automatically determine such pragma’s (for example, in combination with the time-out watchdog), this remains outside the scope of this thesis.

The time-out watchdog is a more general solution, which ensures the process of testing for a large parameters space does not stop. However, a drawback of this method (especially when using a fixed timer) is that it might also terminate processes which are still calculating. Our current limit is based on the performance of existing benchmarks on the largest graph in our set. We then multiplied this by 20 to not unnecessarily break out of computation. Of course, such over-provisioning wastes a lot of time when a short graph is stuck in an infinite loop. The more elegant solution is to set a dynamic upper bound, which scales with input graph size, but more research is required in finding the parameters of such a scalability model.

5.1.3 Performance Data

Because of the large amount of versions under test, a data management solution is needed for our performance data. In this work, we decided to use a database which stores all performance data, allowing us to quickly query all information we might need when selecting the best ver-sions or reasoning about performance portability. Our current implementation uses an sqlite3 database [12], because it is quick to setup and works natively with python3.

(26)

collect two different execution time intervals. The first one is the time it took for the data to be sent to the device and back - we call this memory time. The second value we store is the actual execution time of the kernel, i.e., the algorithm itself. This is sometimes called kernel time. We note that this separation allows us more detailed insights into the core performance characteristics of our implementations.

5.1.4 Results

We tested the automated pragma switching on both algorithms - BFS and PageRank, using both the GPU and the CPU. For BFS, we use only the top-down approach. For PageRank, we present both algorithms: (1) the initial implementation as specified in [22], and (2) the vertex pull implementation which is based on the implementation in the GAP benchmark [4]. We measured both memory transfer time, as well as execution time. However, the graphs included here only show kernel execution time, as memory time is a constant overhead per graph (i.e., it is based on the graph size and it does not change with the changes of the pragma’s).

All results in Figures 5.1-5.6 are obtained on machine A (see Table 4.1), i.e., not on DAS-5. Moreover, these figures use log scale, as the actual values here are less important than the relative performance to other pragma combinations.

(27)

Figure 5.1: Breadth First Search algorithm on CPU. Pragma-switching performance comparison on Graph500 graphs.

Figure 5.2: Breadth First Search algorithm on GPU. Pragma-switching performance comparison on Graph500 graphs.

(28)

Figure 5.3: PageRank algorithm on CPU. Using vertex push method to update rank. Pragma-switching performance comparison on Graph500 graphs.

Figure 5.4: PageRank algorithm on GPU. Using vertex push method to update rank. Pragma-switching performance comparison on Graph500 graphs.

(29)

Figure 5.5: PageRank algorithm on CPU. Using vertex pull method to update rank. Pragma-switching performance comparison on Graph500 graphs.

Figure 5.6: PageRank algorithm on GPU. Using vertex pull method to update rank. Pragma-switching performance comparison on Graph500 graphs.

(30)

Based on these results, we make the following observations:

1. The performance gets worse when parallelizing the inner loop. This holds for both the CPU and the GPU, and for both PageRank and BFS. Thus, for these workloads, the inner loop should not be parallelized.

2. For PageRank push, parallelizing on the GPU exceeds the performance of not parallelizing only on very large graphs. This is not the case for the pull version.

5.2 Removing Atomic Operations

After we devised our mechanism to determine and select the best-performing pragma combination for offloading, additional optimizations are explored. The first such optimization targets the potential performance benefits of removing atomic operations.

Atomic operations are there to prevent multiple threads from interfering with each other while writing to memory. For example, incrementing a counter is not thread-safe, as the actual operation has three distinct parts: read the value from memory, update the value, and write the new value back to memory. Without atomic operations, multiple threads might execute these operations in an interleaved manner, thus producing incorrect results.

To prevent these non-deterministic results, atomic operations lock the memory region on which they read/update/write. This means the shared value will always be updated correctly, without inference from other threads. This prevents memory errors, but it reduces the perfor-mance of a parallel program.

While examining our code, we found an atomic operation in our BFS implementation to be unnecessary, and could remove it without altering the algorithm. This atomic operation was updating a variable when a new node was processed. This was added to prevent multiple threads from interfering with eachother as highlighted above. However, this deemed to not be necessary, as this atomic operation had no influence on the correctness of the algorithm. This was the case as the variable was always set to the same value, so in the case of multiple threads interfering, the end result was the same.

To see the exact performance impact of this optimization, we compared the version before and after removing the unnecessary atomic. We report the execution time (averaged over three runs) on Graph500-22 in Figure 5.7. We notice an improvement of 1.2 times compared to the original version.

(31)

Figure 5.7: Relative speedup compared to the original BFS implementation

5.3 Different Scheduling Methods

Scheduling defines a method by which work is divided and allocated to threads for execution. For example, the auto scheduling method delegates the scheduling to the compiler. This seems to also be the default scheduling method.

Scheduling is a relevant optimization for graph processing because it provides a mechanism to counter the inherent load-imbalance in graph processing workloads. As different threads process different nodes, it is intuitively clear that nodes with wildly different degrees will have to process unbalanced loads; a scheduling where this imbalanced is not countered will degrade performance because all threads must wait idly, for the slowest to complete.

All scheduling methods are defined in more detail in the documentation [21]. It is however up to the compiler how to interpret and implement these methods. In our case, we use GCC, and it is thus possible to look up their implementation of auto (because gcc is open source). For GCC, schedule(auto) is a direct mapping to schedule(static)1_.

To determine whether altering the scheduling policy is indeed a viable optimization, We have experimented with different scheduling policies (note that varying the policy is a matter of changing a single pragma parameter) for our PageRank implementation. The results are presented in Figure 5.8.

(32)

Figure 5.8: Comparing different scheduling methods on Graph500-21 and 22. Block size is an arbitrary number.

We found schedule dynamic to be the optimal scheduling method for PageRank, and this increases the performance with roughly 40 - 50%. Dynamic scheduling gives a thread n segments, and thus reduces communication overhead with the parent thread. Though this does decrease overhead, it might not increase performance if the block size is too large. This because if the block are large, the thread that has the last block might have to do a lot of computation while other threads are waiting. This is a problem that guided scheduling should solve, as guided scheduling should reduce block size dynamically. The problem is that the GCC implementation is naive, and does a poor job at this2_{. This might be something to look into when experimenting with} a different compiler. We will however stick with GCC, and thus will use schedule(dynamic,x n).

Next, to determine optimal performance, we empirically tuned our algorithm to find the best possible block size. Specifically, we ran tests with different block sizes, on different sizes of graphs. Our results are presented in Figure 5.9.

(33)

Figure 5.9: Comparing dynamic scheduling block sizes on increasing graph sizes. The graphs used are from the Graph500 set, size 12 to 22.

We find that, for smaller graphs, smaller block sizes are better than larger block sizes, while for the larger graphs, the larger block sizes are better. We do observe that for the largest graphs, a block size of 256 is worse than a block size of 128 or 64. This variation indicates that tuning of block sizes is necessary for every type of workload when optimal performance is desired. However, for the remainder of our performance analysis, we have selected a block size of 64, which provides the best value, on average, across our set of graphs.

(34)

(35)

CHAPTER 6

Performance Portability Analysis

In this chapter we revisit performance portability. Specifically, we report application efficiency (i.e., our best performance compared against the best-known versions of the two proposed al-gorithms) for our set of two platforms, two algorithms, and two datasets. We aim to interpret these results from the perspective of our original question: is our OpenMP implementation performance portable?

One thing to note, however, is that we chose to not implement direction-optimization. This lead to a comparison-related problem. To enable a fair comparison between our portable version and the optimized algorithms we must choose one of only two options: optimizing our own algorithm, or alter the benchmarks to not use those optimizations. We have however not managed to remove direction-optimization from all benchmarks. This means the performance of our implementation is bound to be worse than the performance of the benchmarks.

6.1 Performance portability baselines

To be able to determine how well our portable algorithms perform, we use application efficiency. To compute it, we have to specify a baseline to act as the best-performing version of the appli-cation on each platform.

We select these baselines from well-established graph processing benchmarking suites and/or programming models. For CPUs, the original candidates are GraphMat [26], PowerGraph [10], ligra [24], and the GAP benchmark [4]. For GPUs, the original candidates are GunRock [28] and nvGraph [20].

However, GraphMat and nvGraph are both matrix-based graph processing benchmarks, using algorithms which are fundamentally different than our graph-traversal based algorithms [13]. Therefore, we will not compare to these benchmarks directly.

As GunRock, ligra, and GAP are both graph-based implementations, meaning they are similar-enough to our implementations, we selected these three implementation to determine the best-known performance for our workloads. GunRock being a benchmark for GPUs, and ligra and GAP being benchmarks for CPUs.

(36)

6.2 Performance Results

We compare the benchmarks’ performance against our own implementations for both the GPU and the CPU, and for both BFS and PageRank. The same core OpenMP implementation is used for the GPU and CPU. However, a specialization step is required to adapt our OpenMP code to work on both platforms. Specifically, the pragmas which define the code segment to be offloaded to the GPU does not work for the CPU version. To cope with this mismatch, we embedded a specialization switch in the code, which can be targeted using compiler flags. Passing the flag -DGPU will change all the relevant pragmas to GPU specific pragmas. The absence of this flag implies the code is CPU specific. We have however not made any differences between the two specialized pragmas which alter the meaning of the parallelization. This means we exclusively removed GPU specific keywords to make it compatible for CPU. An example of this can be seen in the legend of 5.5 compared to 5.6.

From all the benchmarks we tested, not all had competitive results. For example ligra PageR-ank [24] was approximately 30 times slower than other implementations. Because of these bench-marks being far from competitive, we have not included all benchbench-marks in our results.

6.3 Results

We discuss the CPU and GPU results separately.

6.3.1 CPU results

Figures 6.1a and 6.1b present the PageRank performance comparison between our OpenMP version and the GAP suite version. Similarly, Figures 6.2a and 6.2b present the BFS performance comparison between our OpenMP version and the GAP suite version.

(a) The PageRank performance of our best CPU implementation vs. the GAP implemen-tation on the Graph500 set.

(b) The PageRank performance relative to the GAP benchmark, in the form of slow-down. Lower values are better, and values lower than 1 indicate OpenMP outperforms GAP.

(37)

(a) The BFS performance of our best CPU implementation vs. the GAP version on the Graph500 set.

(b) The BFS performance relative to the GAP benchmark, in the form of slow-down. Lower values are better, and values lower than 1 indi-cate OpenMP outperforms GAP.

We make the following observations:

1. Our PageRank implementation becomes better relative to the reference implementation when the graph size increases. This implies unnecessary overhead from our implementation. 2. Our BFS implementation becomes worse when graph size increases, thus can be attributed to the difference in algorithms. GAP benchmark does not easily allow to disable direction optimization, so our implementation is most likely worse in that regard.

3. The performance of our BFS implementation can possibly be improved to match the bench-mark with direction optimization.

6.3.2 GPU results

Figures 6.4a and 6.4b present the BFS performance comparison between our OpenMP version and the GunRock version.

(a) The PageRank performance of our best GPU implementation vs. the GunRock version on the Graph500 set 8.1.

(b) The PageRank performance relative to the GunRock benchmark, in the form of slow-down. Lower values are better, and values lower than 1 indicate OpenMP outperforms GunRock. The value at Graph500-23 is 8.

(38)

(a) The BFS performance of our best GPU im-plementation vs. the GunRock version on the Graph500 set 8.1.

(b) The BFS performance relative to the Gun-Rock benchmark, in the form of slow-down. Lower values are better, and values lower than 1 indicate OpenMP outperforms GunRock.

We have found some runs to have unexpected performance results in the GPU BFS imple-mentation 6.4a. As the Graph500 set of graphs were all generated with the same parameters, they should all have approximately the same characteristics, except for being doubled in scale with each graph. This means the execution time should also increase exponentially with the increase of graph scale. However, looking at the non-direction optimized data, they seem to have wildly different results for each graph. More investigation is needed into why these performance variations occur.

Currently, these unexpected results are limiting our observations, and we are in the process of investigating them further. We expect that each increase in graph size would also increase execution time, but we see a lot of variance in the baseline results for smaller graphs. Taking the majority of data points we assume the data for graph 13-14, and 17-19 are incorrect. If this is indeed the case, we can have a more optimistic view of our implementation.

GPU Performance

Because the GPU performance of our implementation is poor, compared to GunRock, we decided to further investigate why this may be the case. For this we used NVIDIA Visual Profiler, which allows the analysis of CUDA code which is being executed on a GPU. It shows a timeline of all CUDA commands which were executed while the specified process is running.

Using the Visual Profiler on our program, we found there is a lot of overhead between iter-ations of PageRank. In our source code the only code between iteriter-ations of PageRank is the swap of two buffers. However, in the timeline of the Visual Profiler, we found that between every iteration, there are two additional cuAllocs and two cuFrees. These memory allocations and frees are not found in our source code, so it must be added by OpenMP.

This extra overhead adds a flat amount of time before and after every execution of a parallel section. This fixed amount of overhead then explains the huge difference between execution times for smaller graphs, and why the difference becomes smaller when the graphs increase in size, as seen in figure 6.3a.

Graph Diversity

We have also ran our algorithms on the SNAP set, but found these results to be very similar to what we have found for the Graph500 set. The unexpected results for smaller graphs are also

(39)

present in these SNAP graphs. The performance difference between our implementation and GunRock is also present across all SNAP graphs. We have omitted these SNAP results from our results chapter for brevity. However, for inclusiveness, we have added these results in Appendix B 9.1.

(40)

(41)

CHAPTER 7

Conclusion

To achieve good performance for graph processing, moving to parallel architectures is necessary. The problem with this approach, however, is that we do not want different codebases for every architecture. For this reason, our goal in this work was to determine whether OpenMP provides code portability and performance portability for graph processing workloads. Specifically, we analyzed the performance of two algorithms implemented in OpenMP: BFS and PageRank.

7.1 Portability

We have observed that, in general, the core parallel versions of the algorithms are the same for the two platforms. However, we also noticed that some specialization is necessary. Our automated pragma-switching tool was able to determine the best combination of pragmas for each platform. These pragmas were different, as offloading to a GPU makes the distinction between parallelization over multiple cores and parallelization over threads within these cores. This distinction is not made in CPU parallelization. This means to have GPU code also run on a CPU, all core parallelization pragmas have to be converted to thread parallelization pragmas. We further note that due to the OpenMP language itself, additional specialization is needed when using the same codebase on both platforms. This specialization only targets a specific OpenMP construct, but does imply that, ultimately, the code is slightly different. To solve this problem, core parallelization pragmas must be made compatible with CPU execution by the OpenMP standard. This would allow the exact same codebase to be used for multiple platforms.

7.2 BFS performance

The difference in performance between our implementation and other implementations can be largely attributed to the difference in algorithms 6.2a. We intended to find out why the perfor-mance of OpenMP is better or worse. To better be able to do this we had not implemented the switching of top-down to bottom-up. This is proven to have a positive impact on the performance of the algorithm [3]. However, the switching of the two approaches at runtime would make it harder to analyze what the choke-points are of OpenMP itself.

To preserve the fair comparison, we have managed to remove the direction optimization from the GunRock benchmark. This new version of the GunRock benchmark (i.e., without direction

(42)

optimization) shows similar scaling to our implementation 6.2a. However, there are some large unattributed differences between results, which are still under investigation. Overall, based on the results from the larger Graph500 graphs (indeed, the challenge is to provide high-performance for large graphs), we can say that our implementation is approximately 2 to 3 times slower than the GunRock baseline.

7.3 PageRank performance

The performance of our implementation of PageRank, when run on the CPU, is similar to that of GAP. The difference that is found, is likely due to additional overhead. So our implementation of a portable PageRank is competitive with CPU benchmarks. However, a larger difference in performance shows when comparing this portable algorithm to a specialized GPU algorithm in CUDA. The performance difference between GunRock and our implementation is extremely large for smaller graphs, and becomes less significant on larger graphs. We identified this difference mainly resulting from additional CUDA operations being added by OpenMP. Between our source code, and the CUDA code being executed on the GPU, there are unnecessary memory operations being executed, slowing down our implementation significantly.

We have not managed to identify the purpose of these memory operations, and can thus not say whether or not these operations are necessary for a correct algorithm. However, when optimizing PageRank further, these extra memory operations should be the first to be analyzed.

7.4 Final Words

We managed to show OpenMP code can be reasonably performant compared to specialized code. The performance difference is less noticeable in CPU performance, and more in GPU performance. This difference is noticeable across both synthetic, and the real graphs on which we tested our algorithms. While we chose to not to focus on extracting the best performance from our algorithms, it is possible to implement direction optimization retroactively. An extension to our implementation would likely improve the performance to match CPU the benchmarks, but this extension will have to be looked into for future work.

The large gap between GPU performance can be either due to inefficient implementation of offloading by the compiler, or due to our version not being able to make use of all available GPU resources. Further analysis is required into the performance gaps between OpenMP and native GPU versions. This analysis would likely require in-depth reverse-engineering of reference code. Possibly investigating the compiler code generation would be necessary to find why our OpenMP offloading seems uncompetitive. We see this analysis, as well as the benchmarking of different compilers, as the highest priority for future work.

(43)

Bibliography

[1] Vikas Agarwal, MS Hrishikesh, Stephen W Keckler, and Doug Burger. Clock rate versus ipc: The end of the road for conventional microarchitectures. In ACM SIGARCH Computer Architecture News, volume 28, pages 248–259. ACM, 2000.

[2] H. Bal, D. Epema, C. de Laat, R. van Nieuwpoort, J. Romein, F. Seinstra, C. Snoek, and H. Wijshoff. A medium-scale distributed system for computer science research: Infrastruc-ture for the long term. Computer, 49(05):54–63, may 2016.

[3] S. Beamer, K. Asanovic, and D. Patterson. Direction-optimizing breadth-first search. In SC ’12: Proceedings of the International Conference on High Performance Computing, Net-working, Storage and Analysis, pages 1–10, Nov 2012.

[4] Scott Beamer, Krste Asanovic, and David A. Patterson. The GAP benchmark suite. CoRR, abs/1508.03619, 2015.

[5] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing. pages 44–54, 10 2009.

[6] Christopher Daley. Evaluation of openmp performance on gpus through micro-benchmarks. DOE Performance, Portability and Productivity Annual Meeting, April 2019.

[7] Henk Dreuning, Roel Heirman, and Ana Varbanescu. A Beginners Guide to Estimating and Improving Performance Portability: ISC High Performance 2018 International Workshops, Frankfurt/Main, Germany, June 28, 2018, Revised Selected Papers, pages 724–742. 06 2018. [8] Peng Du, Rick Weber, Piotr Luszczek, Stanimire Tomov, Gregory Peterson, and Jack Don-garra. From cuda to opencl: Towards a performance-portable solution for multi-platform gpu programming. Parallel Computing, 38(8):391–407, 2012.

[9] Jianbin Fang, Ana Lucia Varbanescu, and Henk Sips. A comprehensive performance com-parison of cuda and opencl. In 2011 International Conference on Parallel Processing, pages 216–225. IEEE, 2011.

[10] Joseph E Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. Power-graph: Distributed graph-parallel computation on natural graphs. In Presented as part of the 10th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 12), pages 17–30, 2012.

[11] Graph500 Committee. Graph500 large-scale benchmarks. https://graph500.org/. [12] D. Richard Hipp. sqlite3. https://www.sqlite.org/.

[13] J. Kepner, P. Aaltonen, D. Bader, A. Bulu, F. Franchetti, J. Gilbert, D. Hutchison, M. Ku-mar, A. Lumsdaine, H. Meyerhenke, S. McMillan, C. Yang, J. D. Owens, M. Zalewski, T. Mattson, and J. Moreira. Mathematical foundations of the graphblas. In 2016 IEEE

(44)

[14] J´erˆome Kunegis. Konect: the koblenz network collection. In Proceedings of the 22nd Inter-national Conference on World Wide Web, pages 1343–1350. ACM, 2013.

[15] Jeff Larkin. Portability of openmp offload directives. SC17, 2017.

[16] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.

[17] Andrew Lumsdaine, Douglas Gregor, Bruce Hendrickson, and Jonathan Berry. Challenges in parallel graph processing. Parallel Processing Letters, 17:5–20, 03 2007.

[18] Doug Matzke. Will physical scalability sabotage performance gains? Computer, 30(9):37–39, 1997.

[19] Richard C Murphy, Kyle B Wheeler, Brian W Barrett, and James A Ang. Introducing the graph 500. Cray Users Group (CUG), 19:45–74, 2010.

[20] NVIDIA Coorporation. nvGraph library user’s, May 2019.

[21] OpenMP Architecture Review Board. OpenMP application program interface version 4.5, November 2015.

[22] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.

[23] Simon J. Pennycook, Jason D. Sewall, and Victor W. Lee. A metric for performance porta-bility. CoRR, abs/1611.07409, 2016.

[24] Julian Shun and Guy E Blelloch. Ligra: a lightweight graph processing framework for shared memory. In ACM Sigplan Notices, volume 48, pages 135–146. ACM, 2013.

[25] John E Stone, David Gohara, and Guochun Shi. Opencl: A parallel programming standard for heterogeneous computing systems. Computing in science & engineering, 12(3):66, 2010. [26] Narayanan Sundaram, Nadathur Satish, Md Mostofa Ali Patwary, Subramanya R. Dulloor, Michael J. Anderson, Satya Gautam Vadlamudi, Dipankar Das, and Pradeep Dubey. Graph-mat: High performance graph analytics made productive. Proc. VLDB Endow., 8(11):1214– 1225, July 2015.

[27] Merijn Verstraaten, Ana Lucia Varbanescu, and Cees de Laat. Using graph properties to speed-up gpu-based graph traversal: A model-driven approach, 2017.

[28] Yangzihao Wang, Yuechao Pan, Andrew Davidson, Yuduo Wu, Carl Yang, Leyuan Wang, Muhammad Osama, Chenshan Yuan, Weitang Liu, Andy T. Riffel, and John D. Owens. Gunrock: Gpu graph analytics. ACM Trans. Parallel Comput., 4(1):3:1–3:49, August 2017.

(45)

CHAPTER 8

Appendix A

Graph500 Set Node Count Edge Count Graph500-12 4096 48k Graph500-13 8192 101k Graph500-14 16k 213k Graph500-15 32k 441k Graph500-16 65k 909k Graph500-17 131k 1.8m Graph500-18 262k 3.8m Graph500-19 524k 7.7m Graph500-20 1.04m 15.7m Graph500-21 2.09m 31.7m Graph500-22 4.19m 64.1m Graph500-23 8.38m 129.3m

(46)

SNAP Set Node Count Edge Count roadNet-CA [src] 1.96m 2.76m roadNet-TX [src] 1.37m 1.92m email-EuAll [src] 265k 420k web-BerkStan [src] 685k 7.60m web-Google [src] 875k 5.10m

Table 8.2: The collection of SNAP graphs [16]. The graphs used for the experiments presented in this thesis. All these graphs are taken from a repository which collect real graphs.

(47)

CHAPTER 9

Appendix B

GunRock GunRock Ours Ours

SNAP Graph PageRank time (ms) BFS time (ms) PageRank time (ms) BFS time (ms) as-skitter 3.6572 7.8263 168.0666 29.45 email-EuAll 3.6936 1.5593 196.5 110.8866 roadNet-CA 2.8929 40.9339 215.0333 3300.1933 roadNet-PA 1.5921 34.9721 156.8666 3164.2633 roadNet-TX 1.9844 48.1300 180.3666 4225.2166 web-BerkStan 1.7931 36.8276 869.0333 1246.5666 web-Google 3.8482 5.618 306.7999 180.0233

Graph Processing in OpenMP: a Study on Performance Portability

Bachelor Informatica