Performance Portability Evaluation of OpenACC in Graph Processing

(1)

Bachelor Informatica

Performance Portability

Eval-uation of OpenACC for Graph

Processing

Maarten van Keulen

June 23, 2019

Supervisor(s): Dr. Ana Lucia Varbanescu

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

(3)

Abstract

Graph processing is a new, challenging HPC field, where traditional algorithms clash with modern architectures [1]. Devising different code for each such architecture is not only challenging as a design process, but also expensive from the perspective of development, deployment, and maintenance. Programming models, like OpenACC, could offer a solution. However, their adoption remains limited in HPC-like fields, due to the expectation that they are not competitive, performance-wise.

This work answers the question if OpenACC is a good candidate for providing performance portability for graph processing workloads.

This is done by the empirical analysis of the acceleration of two case-studies representative for the field of graph processing: Breadth First Search and PageRank. Performance is evalu-ated against state-of-the-art, platform-specific implementations taken from the (CPU-based) GAP and (GPU-based) Gunrock benchmark suites. The performance portability metric [2] is calculated for both applications across the entire set of examined platforms and graphs using the measured application properties.

Results show that the assumption of non competitive performance holds for OpenACC in the domain of graph processing. Demonstrating a performance portability metric of 6,8% and 8,6% for BFS and PR, respectively. These results indicate that OpenACC is not com-petitive in the application of performance critical domain of graph processing.

(4)

(5)

Introduction

Graph processing has been a topic of great interest for a very long time [3, 4]. Graph models are more popular than ever, with applicability in fields ranging from neuroscience, where graphs can model functional connections between brain areas, to the World Wide Web, where graphs can represent the structure of web pages [5]. The rapidly growing size of such real-world graphs calls for scalable graph processing algorithms. For example, Facebook has 2,4 billion active users1, each with their own distinct set of inter-personal connections. Because of its importance, and its unique set of challenges, graph processing is seen as one of the thirteen dwarfs in parallel programming [6].

It is not only the scale of graphs that challenges current algorithms. In fact, the structural prop-erties of graphs are shown to have a significant performance impact for many workloads. When performance and/or scalability are mandatory, it is common practice to use different approaches for different graph classes [7, 8].

To tackle both graph scale and diversity, researchers proposed many algorithmic approaches, embedding additional optimisations such as specialised graph data layouts for improved memory access patterns, and efficient workload mapping for effective use of parallelism [5]. However, the implementation of these solutions is often based on low-level programming models and platform-specific constructs. Given the diversity of modern platforms, as well as their increased complexity and parallelism [6], this level of specialisation is not sustainable: porting multiple algorithms and optimisations to multiple platforms becomes a major challenge, as it demands careful redesign and reengineering efforts to – once again – provide an algorithm that efficiently exploits the features of each new platform. Moreover, when each platform demands its specialised code base, the productivity of the developers working on maintaining these different versions becomes a significant challenge in itself. As such, extending the set of platforms to include the newest developments, is prohibitively expensive and time-consuming. Consequently, many graph pro-cessing algorithms and frameworks remain platform-specific.

To tackle this explosion of platform-specific code bases, several portable programming models have been created. OpenMP, OpenCL, and OpenACC are only a few examples, of different ages (40+ years, 15+ years, and 5+ years, respectively) [9, 10, 11]. Such programming models enable functional portability of a single code base across multiple platforms. Depending on the model, these platforms span less or more diverse platform classes (e.g., OpenMP has originally been designed for shared memory homogeneous platforms, but the standard has been recently extended – since version 4.5 – to support the offloading model specific to accelerators).

Portable programming models often come with their own limitations and performance penalties (see 6). They are, therefore, rarely encountered in performance-hungry domains like graph processing. This perception can change if, compared to more specialised solutions, these portable programming models deliver codes that (1) show limited performance penalties, and (2) provide performance portability, i.e., perform (equally) well across a set of platforms.

A relatively new portable programming models is OpenACC [11], which works by providing 1_{Source: https://www.statista.com/statistics/264810/number-of-monthly-active-facebook-users-worldwide/}

(8)

platform-agnostic directives to guide the parallelisation effort of the compiler. OpenACC is the highest-level portable programming model designed to support accelerators, and, as such it is equally applicable for CPUs and GPUs. Moreover, to the best of my knowledge, the performance portability of OpenACC for graph processing has seen very limited research. For both these reasons, OpenACC is a promising candidate for a thorough performance portability evaluation within the broader scope of feasibility analysis of performance portable programming models for graph processing.

1.1 Research question and approach

In this work, I focus on the following research question:

Is OpenACC a good candidate for providing performance portability for graph processing work-loads?

I postulate that a good candidate for performance portability is a programming model which (1) can deliver a platform-agnostic implementation performing above a given threshold of the state-of-the-art performance (obtained by platform-specific implementations), and (2) can remain platform-agnostic and poses no hidden threats to productivity by using unintuitive constructs. I set the portability threshold at 90%, in line with previous work [12].

Starting from this definition, I propose an empirical research method. Specifically, I focus on the empirical analysis of two case-studies representative for the field of graph processing: Breadth First Search and PageRank. For both algorithms, I design and implement several OpenACC versions, and examine their performance over a diverse set of graphs with distinct structural properties2_. _{To quantify performance portability, I assess the achieved performance against} recognised specialised implementations, which act as reference. Besides performance, in an effort to asses threats to productivity, I also record and report any unintuitive OpenACC behaviour, as well as any language-specific limitations I encounter.

1.2 Thesis structure

This thesis is structured as follows. Chapter 2 provides the necessary theoretical background and terminology used in the rest of the paper. Chapter 3 outlines the design choices made in both the selection of the graph data structure, as well as the design of the algorithms. Chapter 4 outlines the experimental setup. Chapter 5 presents a detailed analysis of the performance results, from the specific angle of performance portability. Chapter 6 positions this work in the broader context of performance portability studies. Finally, Chapter 7 outlines the main findings of this research, provides an answer to the research question, and discusses promising directions for future work.

2_{Structural properties are, for example, the degree distribution and diameter.}

(9)

CHAPTER 2

Theoretical background

This chapter provides theoretical background on the major topics of this paper. Specifically, I discuss performance portability and its corresponding metric (Section 2.1). Next, I provide an overview of the OpenACC programming model, with a brief analysis of its advantages and limitations (Section 2.2). Finally, I discuss the terminology required to discuss graph processing, and outline some of the varied graph structural properties occurring in natural graphs, and their possible implications for graph processing algorithms (Section 2.3).

2.1 Performance Portability

The definition of performance portability has been a topic of debate in the scientific community. For example, definitions such as ‘the ability of the same source code to run productively on a va-riety of different architectures’ [13] are common. However, such definitions are often ambiguous and difficult to quantify. To address this situation, Pennycook et al. [2] put forth a more encom-passing definition of performance portability: ‘a measurement of an application’s performance efficiency for a given problem that can be executed correctly on all platforms in a given set’. This definition comes with a metric, named the Performance Portability Metric (PPM). PPM is defined as evaluating the performance portability for a set of platforms H of an application α solving problem p on platform i:

P P M (α, p, H) =        |H| P i∈H 1 ei(α, p) if i is supported ∀i ∈ H 0 else (2.1)

The performance efficiency metric ei(a, p) can be either application efficiency or architectural efficiency. Application efficiency measures the performance in relation to the best known version for the given platform, while architectural efficiency is relative to the peak performance of the platform itself.

PPM has since gained traction in the scientific community [14, 15], paving the path for fairer comparisons in future work. This is why I have opted to use PPM as a metric for assessing the performance portability of the implementations presented in this work.

2.2 OpenACC

OpenACC emerged in 2011 developed by CAPS, Nvidia, PGI, and Cray as a programming model that uses high-level compiler directives (i.e., pragmas) to expose parallelism in the code, and parallelising compilers to build the code for a variety of parallel accelerators.

(10)

Directives and clauses

OpenACC directives make use of the #pragma mechanism, providing the compiler with infor-mation on how to parallelise a block/region of code. A general form of an OpenACC pragma is: #pragma acc directive-name [clause-list] new-line

There are three distinct types of directives:

• Compute directives, which mark a block of code that should be accelerated by exploiting data parallelism through distributing work to multiple threads.

• Data management directives come in two forms, structured or unstructured data, each with different lifetimes. Structured data applies to the beginning of the next (lexical) scope, for which the compiler will both manage any movement to the device as well as freeing any data upon leaving the scope. Unstructured data has no directly corresponding scope, and is particularly useful when working with C++ classes, allowing the creation and deletion of data at any point in the code.

• Synchronisation directives enable explicit synchronisation of (some of) the concurrent tasks.

Compute directives are provided in two formats with distinct philosophies: fully compiler-driven and user-assisted. The kernels directive is fully compiler-driven: the compiler will decide how to parallelise and accelerate code. User-assisted directives, like the parallel directive, place the focus on the user to properly spell out what to parallelise and how.

Like directives, clauses can be categorised into three types (more details in [16, 17]):

• Data handling clauses assign certain behaviour to variables, overwriting the compiler analysis. Clauses such as private and copy fall into this category.

• Work distribution clauses control the values for work distribution among generated threads. Examples are auto, gang, and vector length.

• Control flow clauses steer the parallel execution during run time, marking execution as conditional (e.g. if present ) or specifying task parallelism (e.g. async).

Execution model

OpenACC ensures portability on multiple computing architectures by working with an abstract model for accelerated computing. This model exposes multiple levels of parallelism on a pro-cessor, as well as a hierarchy of memories with varying speeds. OpenACC adopts an offloading model from a host device to an accelerator device. The devices may be the same, or completely different architectures, with shared or disjoint memory spaces. Figure 2.1 shows a high-level schematic of OpenACC’s abstract accelerator model.

Figure 2.1: OpenACC’s abstract accelerator model1

1_{Figure taken from the 2015 OpenACC programming guide [17]}

(11)

From the programmer perspective, OpenACC works with one code and one set of variables. The standard emphasises that every variable should be thought of as single object, as opposed to two distinct copies of the object located on the host and device, as commonly done in CUDA or OpenCL. This single-object is considered more suitable for portability: it avoids a CUDA-specific practice, and should prevent cases in which generated code is inefficient for devices which share a single memory.

Parallelism

OpenACC operates on three levels of parallelism – gang, worker, and vector:

• A vector has the finest granularity, with an individual instruction operating on multiple pieces of data.

• A worker falls between a vector and gang, operating on a group of vectors of a certain size. • A gang is coarse-grained, consisting of one or more workers. Gangs work independently of

one another, and may not synchronise.

Figure 2.2 provides a schematic representation of these layers of parallelism.

Shared memory is exposed within a gang, which in turn can be used by all workers and vectors part of that gang. It is up to the programmer whether this mapping is handled explicitly.

Figure 2.2: OpenACC levels op parallelism2

The abstraction of the levels of parallelism in combination with the accelerator model outlined in Figure 2.1 makes OpenACC highly portable and allows the same OpenACC code to be mapped to many types of target devices. This abstraction can come at a cost, mapping to platform-specific concepts such as CUDA thread blocks (gangs), warps (workers), and threads (vectors) is loose in order to enhance portability and as such requires compiler output to validate the expected translation.

Atomics

OpenACC supports atomic operations such as write and read, but it is limited in the combina-tion of such operacombina-tions. For example, a common atomic operacombina-tion is compare and swap (CAS), which compares a value at a certain address against an immediate value, and writes a new value

(12)

to the address only if the compared values are equal. An operation like CAS is not supported by OpenACC because a comparison of two values is not guaranteed to be mutually atomic with a write operation.

The lack of such operations can definitely hinder the usability of OpenACC. Thus, a workaround solution is currently recommended: the standard suggests programmers should make use of programming model inter-operability (e.g. combining OpenACC with CUDA) and use native operations in OpenACC. Of course, such solutions come at the cost of the simplicity and porta-bility of the OpenACC programming model itself, and hinder portaporta-bility for functionality and/or performance.

In summary, OpenACC is designed and built as a high-productivity, high-level programming model, where a single code base, decorated by the programmer with directives for parallelism and (efficient) data movement, is parallelised and executed on multi-/many-core accelerated systems. They key challenges in writing efficient, high-performance OpenACC code are identifying the correct placement for directives, how to use and when to avoid, lower-level constructs. How to optimise the directives and constructs. How to correctly interpret compiler feedback, and how to assimilate the feedback into code changes in order to improve performance.

2.3 Graph processing

Graphs are mathematical structures which consist of a set of vertices (also referred to as nodes), of which certain pairs are connected by edges: G = (V, E). These structures are ubiquitous, able to represent relationships such as those between people (social networks), computers (World Wide Web) and roads (road networks) [5].

A graph is either undirected, edge connections (u, v) are symmetric, i.e., they can be traversed from vertex u to v or from v to u, or directed, where a connection from vertex u to v does not imply the reverse connection also exists. The size of a graph G = (V, E) is determined by |E|, while its order is determined by |V |. However, these properties alone are poor indicators for the structural properties of a graph.

Figure 2.3: Two similar yet distinct graphs

Figure 2.3 shows this, the number of vertices and edges is equal for both graphs A and B however, their topology is very different.

Graph processing is, broadly speaking, an algorithmic way to extract useful information from a graph – for example, calculating the shortest path from one vertex to another can be performed using a BFS algorithm. The approach to (parallel) graph processing is not only influenced by the desired information, but often also by the structural characteristics of the given graph. For example, workload balancing is often difficult because of the variance in vertex connectivity, and irregular memory access patterns (due to “jumping” from node to node) make it difficult to effectively use caching. Because of this, graph processing algorithms show large variability in performance, and therefore must always be evaluated on a diverse set of graphs (see Section 4.2). Diversity is achieved by selecting graphs with different structural properties (examples follow). In turn, this diversity exposes the impact these properties (may) have on the performance of the target graph processing application.

In an undirected graph, a vertex’s degree is the number of edges adjacent to that vertex. When dealing with directed graphs, it is common to specify a vertex’s out-degree and in-degree explic-itly. If no such specification is made, degree refers to the number of outgoing edges. The degree

(13)

of a graph is the average out-degree of the vertices in the graph.

The diameter of a graph refers to the longest shortest path between any two vertices in the graph. A connected component of a graph is a subgraph which is fully connected. When referring to mul-tiple connected components of a graph, each vertex and edge belongs to exactly one component. If a graph is directed, connected components are further defined as weakly connect components (WCC) and strongly connected components (SCC). Weak refers to directed connections from either vertex v to u or u to v, and strong to a directed connection between both v to u as u to v. The components are groups where the weak or strong connection holds for all pairs of vertices u and v.

Many real-graphs, such as those taken from social networks, tend to be both small-world and scale-free. A small-world graph’s diameter grows relative to the logarithm of the number of vertices, which results in a low diameter [18]. This is caused by the existence of large groups of connected components, which inter-connect parts of the graph that would otherwise be distant. A scale-free graph’s degree distribution follows a power law, which in turn results in a few vertices having a very high-degree [19].

Because of the skewed distribution in scale-free and small-world networks, reducing the proper-ties of graphs to single numbers, are often not useful for characterising graphs and, by extension, graph processing behaviour. For example, the average degree is not well indicative for tuning load balance for a scale-free graph, because there will be vertices with degrees of order of magni-tude higher than the average. Alternative properties such as the 90-percentile effective diameter (that is, the minimal distance such that 90% of vertex pairs are at most at that distance from each other) provide more insight, as these partly ignore the large disparities that a power-law distribution brings.

Note that there are different approaches to graph processing, with the most common being the vertices-and-edges approach and the matrix approach. While the former focuses on exploiting the connection between nodes (i.e., intuitively, it always attempts to traverse the graph), the latter focuses on processing graphs using algebraic operations on adjacency matrices. Although great strides have been made in recent years in the field of graph algebra [20, 21, 22, 23], deriving the matrix-based formulations for many graph algorithms remains a non-intuitive and cumbersome process. Thus, the scope of this work has been limited to reasoning on graphs from a perspective of vertices and edges.

(14)

(15)

CHAPTER 3

Design

In this chapter, I provide an overview of the design choices for the graph data layout, and for the multiple approaches to implement Breadth First Search and PageRank in OpenACC.

3.1 Graph data layout model

In order to select an appropriate graph data layout model, I evaluate three commonly used lay-outs. The main criteria for an appropriate data format are the compactness and layout regularity. Compactness minimises PCIe bandwidth consumption and regularity enables efficient memory accesses, maximising parallelism. A third, and final, criterion is the simplicity of operations, which often comes as a trade off with the compactness of the data format.

0 1 2 3 w0 w2 w1 w3

Figure 3.1: A directed graph used to demonstrate varied data layouts. Edge weights of edge i are represented as wi.

3.1.1 Adjacency List

An adjacency list (see example in Figure 3.1.1) is a collection of lists, where each list represents the outgoing neighbours of a vertex. Tracking edge weights require a singular additional value per edge representation. This format is that is highly compact, and can easily be dynamically formed using either linked lists or std::vector. However, the data is not contiguous, resulting in irregular memory access.

(16)

Figure 3.2: The adjacency list representation of the graph in Figure 3.1.

3.1.2 Adjacency Matrix

An adjacency matrix M of a graph with N vertices is an rectangular N xN matrix, where an edge from vertex Vi to vertex Vj is indicated by a non-zero value at index Mi,j As can be seen for the example graph (see Figure 3.1) an adjacency matrix is typically sparse, and thus most of its values are zeroes. An adjacency matrix does have the advantage of being a contiguous data format; it further offers the possibility to benefit from highly specialised matrix operations. However, given the abundance of zero values, the data layout is far from compact.

V0 V1 V2 V3

V0 0 1 1 0

V1 0 0 1 0

V2 0 0 0 0

V3 0 0 1 0

Table 3.1: The example adjacency matrix representation of the graph in Figure 3.1.

3.1.3 Compressed Sparse Row

Compressed sparse row (CSR) is a compact format for storing sparse matrices with the added benefit of allowing for regular memory accesses. CSR makes use of three one-dimensional arrays: offset, edges, and weights.

Figure 3.3: The CSR representation of the graph in Figure 3.1.

The offset array is indexed by vertex-id and has as value the corresponding starting point (offset) in the edges array. The number of outgoing or incoming edges of a vertex n can be calculated by edges[n + 1] − edges[n]. To avoid the edge case with this pattern, the offset array is padded with an additional offset indicated by - in Figure 3.1.3. Each value of the edges array corresponds to a destination vertex of an edge, which in turn functions as an index for the offset array. A third

(17)

array, weights, is optionally used and follows the same relation as the offset and edges arrays, except that its values hold edge weights.

3.1.4 Selected layout

An important, but not yet addressed aspect of graph data layouts is the ability to access both the outgoing and incoming neighbourhood of a vertex. In the example graph, each vertex has either an incoming or outgoing neighbourhood, e.g. vertex 0 has an outgoing neighbourhood of {1, 2} and vertex 2 has an incoming neighbourhood of {1, 2, 3}. Of the discussed data layouts, only the adjacency matrix offers access to both neighbourhoods in a timely fashion. Accessing the outgoing neighbourhood of vertex Vi follows Vi,0, Vi,1, ..., Vi,n−1 for each non zero value, and the incoming neighbourhood is the inverted lookup, so vertex Vj follows V0,j, V1,j, ..., Vn−1,j. However, the adjacency matrix advantages do not compensate for its lack of compactness. With graphs reaching vertex counts in the millions, working with a n × n data structure is simply not feasible. Let me use the California Road Network graph (part of the selected graphs in Section 4.2) to illustrate. The graph has n = |V | = 1.965.206 and m = |E| = 2.766.607 edges. A CSR layout using an b-bytes long index-datatype requires (|N | + |E| + |E|) ∗ |b| bytes. For California Road Network, a CSR format would require 28.6 MB of storage for 32-bit integers, whereas an adjacency matrix would require 14.1 TB of storage.

CSR is clearly preferred for its compactness; therefore, enhancing CSR with a mechanism which makes both the incoming and outgoing neighbourhood available will provide a good space-time trade-off between access speed and storage space. Thus, I have opted to design my data as a CSR variant. struct node { int edgeOffset; int nrEdges; }; int *weights; int *edges;

struct node *nodes;

Code 1: Node structure used in CSR vari-ant.

Figure 3.4: The example compressed sparse row variant representation of the graph in Figure 3.1. The CSR variant shown in Figure 3.4 and Code 1 makes use of an expanded offset array con-sisting of node structures, each holding the regular offset value, as well as the number of edges corresponding to the specific vertex. This approach trades the computation of the number edges for each vertex (edges[n + 1] − edges[n]) for a slight decrease in the compactness of the data structure. Specifically, the new data structure results in a layout size of (2 ∗ |N | + |E| + |E|) ∗ b, which amounts to a 26% increase, compared to the regular CSR, when applied to the California Road Network graph. When working with a directed graph, a CSR structure is created for both the outgoing and incoming connections. The latter is also referred to as the inverse graph or reverse lookup graph. If the graph processing application does not require either the outgoing or incoming structure (e.g., top-down BFS does not require an incoming neighbourhood), that structure is simply not instantiated, speeding up the graph creation process.

3.1.5 Preprocessing

The selected graphs are available in plain text files as an edgelist containing edges of the form (Vf rom, Vto[, Vweight])

separated by a newline. These files are preprocessed once, removing duplicate edges and self loops, and converting the graph to the selected CSR format (see Section 3.1.4). In the case of a directed graph, both an outgoing and incoming CSR variants are created. In order not to repeat

(18)

this process for each benchmark, the preprocessed CSRs are written to files in a convenient format for initialising the CSR layout. These files are given the extension .rod, a tribute to the Rodinia benchmark suite [24], where I have first encountered this structure. The .rod files can swiftly be read into memory, speeding up the benchmarking process. The process of reading the preprocessed .rod files into memory (where they are stored in CSR format) is referred to as ingestion time, and is isolated from any reported benchmarks unless explicitly mentioned.

3.2 Breadth First Search

Breadth First Search (BFS) is one of the most important graph traversal algorithms, allowing among others, to compute the shortest paths from a single source (for unweighted graphs) or to test the connectivity of a graph. A BFS traversal starts from a selected vertex, referred to as the source of the traversal, and visits all vertices in the order of their relative depth from the selected source vertex. The action for each visited vertex varies, ranging from updating their traversal cost to updating a simple boolean whether the vertex has been visited or not.

Figure 3.5: Example BFS traversal order for source vertex zero.

The traversal of each level, highlighted in Figure 3.5 by the dotted lines, can be seen as the expansion of a frontier indicated by the light-shaded vertices. At the start of the traversal, only the source vertex is part of the frontier. After the first iteration, the frontier “moves” to the unvisited neighbours of the current frontier. In the example graph, these are vertices 1 and 2. This process is repeated until there are no vertices left to add to the frontier. It is common practice to assign a negative value to vertices which are not connected to the source vertex and are thus not reachable in the BFS traversal.

There are two distinct algorithmic approaches to the expansion of the frontier: top-down or bottom-up. These are described in the following sections.

3.2.1 Top-down

In a top-down BFS, vertices in the active frontier search for an unvisited child. Initialisation starts with two collections, the current frontier and the frontier for the next iteration. A third array is used for the result of the algorithm, in this case parent, which stores the direct parent of each vertex in relation to the source vertex.

Listing 3.1: BFS initialisation 1 f r o n t i e r ← { s o u r c e } 2 n e x t F r o n t i e r ← {} 3 p a r e n t s ← [ −1 , −1, . . . , −1] 4 5 while f r o n t i e r 6= {} 6 topDownStep ( f r o n t i e r , n e x t F r o n t i e r , p a r e n t s ) f r o n t i e r ← n e x t F r o n t i e r 8 n e x t F r o n t i e r ← {} 9 end while 18

(19)

At the end of each iteration, the next frontier becomes the current frontier, and then the next frontier is cleared.

Listing 3.2: BFS top-down step 1 for v ∈ f r o n t i e r do 2 for n ∈ n e i g h b o u r h o o d ( v ) do 3 if p a r e n t s [ n ] = −1 then 4 p a r e n t s [ n ] ← v 5 n e x t F r o n t i e r ← n e x t F r o n t i e r ∪ {n} 6 end if 7 end for 8 end for

For each top down step, as displayed in Listing 3.2, the workload is primarily determined by the number of vertices in the frontier, each checking for unvisited outgoing neighbours by iterating over all their outgoing edges. This approach lends itself to parallelism, apart from lines 5 and 6 where a vertex is selected for the expansion of the frontier. Locks or alternate structures are required to prevent duplicate writes to the next frontier and parents array.

3.2.2 Top-down implementation

NVIDIA suggests five steps for accelerating applications using OpenACC (under the marketing slogan “2x in 5 steps” [25]), further described in a PGInsider article [26]): evaluation, setup code, compute regions, adding data regions and optimising data transfers, and loop-schedule tuning. In this section I go over each step in order to illustrate the decision process behind the design of the top-down BFS implementation.

Evaluation

The algorithmic evaluation of top-down BFS (Section 3.2.1) reveals that the primary point of parallelism would be the processing of each vertex part of the frontier (Listing 3.2 line 1). Another point would be to explore the outgoing neighbourhood of each vertex part of the frontier (Listing 3.2 line 2). A few further observations can be made:

• To produce a correct BFS tree, when evaluating vertices with multiple ancestors of the same depth, it does not matter which specific parent is selected.

Because of this, race conditions between setting parents of an equal level are not important, as each parent is a valid option. For example, in the figure above, it does not matter whether vertex 4 ends up with vertex 1 or 2 as parent.

• The expansion of the frontier can be done in parallel, provided the data structure has no race conditions such as duplicate writes to shared indices.

• The frontiers can be switched only after the frontier has been entirely processed.

These observations in combination with the algorithmic analysis made in Section 3.2.1 are suffi-cient to move from pseudocode to code.

(20)

Setup code

For this implementation the parent array is replaced by a cost array, which stores the depth of each vertex in relation to the source vertex, this has no further impact and is simply another possible output of a BFS traversal.

A data structure is required for the frontier and nextFrontier. I have chosen to work with byte arrays of size |V |, containing a flag for each vertex indicating whether it is part of the corre-sponding frontier or not. Bitmaps would be a more compact alternative, however, these would require a compare and swap operation to set bits in an atomic fashion. Because OpenACC does not support such an operation as explained in Section 2.2, bitmaps were dropped. Another alter-native is a sliding queue, which has the advantage of iterating over the same size as the current frontier, but this structure also requires atomic CAS and, therfore, was deemed unsuitable. For all following code, the pointer g points to the graph structure outlined in Section 3.1.4.

1 for (int i = 0; i < g->nrNodes; i++) { 2 frontier[i] = 0; 3 nextFrontier[i] = 0; 4 cost[i] = -1; 5 } 6 7 frontier[source] = 1; 8 cost[source] = 0;

Code 2: Initialisation of the top-down BFS implementation.

Initialisation is straightforward and follows Listing 3.2. Both frontiers are initialised with zeroes, and the default cost for each vertex is set to −1. The frontier starts with the source vertex, and the cost of the source vertex is 0 by definition. Let us move onto the core of the traversal.

(21)

1 int cnt; 2 do {

3 cnt = 0;

4

5 // Primary loop, process the frontier

6 for (int nID = 0; nID < g->nrNodes; nID++) {

7 if (frontier[nID]) {

8 int edgeOffsetEnd = g->outNodes[nID].edgeOffset

9 + g->outNodes[nID].nrEdges;

10 frontier[nID] = 0;

11 // Iterate over the vertices in the outgoing neighbourhood

12 for (int i = g->outNodes[nID].edgeOffset; i < edgeOffsetEnd; i++) {

13 int outgoingNID = g->outEdges[i];

14 if (cost[outgoingNID] == -1) { 15 cost[outgoingNID] = cost[nID] + 1; 16 nextFrontier[outgoingNID] = 1; 17 } 18 } 19 } 20 } 21

22 // Secondary loop, clearing and swapping frontiers

23 for (int nID = 0; nID < g->nrNodes; nID++) { 24 if (nextFrontier[nID]) {

25 nextFrontier[nID] = 0;

27 // There are still vertices left in the frontier, so I continue

28 cnt = 1;

29 }

30 }

31 } while (cnt);

Code 3: Serial source code of the top-down BFS implementation.

Code 3 shows the plain source code without any OpenACC acceleration. The outer do-while loop (line 2) runs until there are no vertices left in the frontier. This is done based on the variable cnt (continue), which is set to stop each iteration. For each iteration of the outer loop, the entire frontier is processed. The frontier is processed in the primary loop (line 6), iterating over each vertex in the frontier. For each of these vertices belonging to the frontier, a nested loop is run over the vertices in their outgoing neighbourhood (line 12). If an unvisited vertex is encountered, its cost is updated and its ID is added to the nextFrontier (lines 15 and 16).

The secondary loop on line 23 swaps the frontiers, clearing the next frontier for reuse in the process. If the frontier contains any nodes the cnt flag is set so the next frontier is indeed processed. All in all, the code is very similar to the algorithmic outline given in Listing 3.2, with some added constructs to comply with the chosen data structures.

Compute Regions

Loops form a good starting point to accelerate the code, complying with the principle of making the common case fast. There is one loop I can immediately disqualify, which is the do-while loop, because between each depth of the BFS traversal the frontiers need to be synchronised, otherwise vertices of varied depth will be processed simultaneously. As such, there are four regions that qualify for acceleration:

1. The initialisation loop, see Code 2 line 1.

(22)

3. The loop over the outgoing neighbourhood Code 3 line 12. 4. The secondary loop, swapping the frontiers Code 3 line 23.

A logical starting point is using the directive #pragma acc kernels, which leaves full freedom to the compiler on how to approach acceleration. The more freedom given to the compiler, the more likely it will make smart decisions on all platforms, which in theory should be ideal for portability. Sadly, this approach does not work, because the compiler cannot determine that the contents of the loops are independent and as such executes the regions sequentially to guarantee the correct output. By replacing the directives with #pragma acc parallel loop, the programmer ensures the compiler that the regions can, in fact, be executed in parallel. Any further logic is still han-dled by the compiler, such as how the work should be distributed in terms of gangs, workers and vectors.

One more issue remains, OpenACC does not support nesting pragmas in this fashion. As such the loop over the outgoing neighbourhood is illegal, and is removed for now.

There is only need for a single cnt variable, but the secondary loop is processed in parallel. Therefore each thread has access to its own private instance of cnt, which need to be reduced back to a single cnt instance. A logical method would be to reduce all instances of cnt using the boolean or operation, allowing for a short circuit on the first true instance of the variable. This translates to the following directive: #pragma acc reduction(|| : cnt), following operator (|| in this case) : variable (cnt ).

This leaves us with the following configuration which I will refer to as the base version: 1. #pragma acc parallel loop, Code 2 line 1.

2. #pragma acc parallel loop, Code 3 line 6. 3. pass, Code 3 line 12.

4. #pragma acc parallel loop reduction(|| : cnt), Code 3 line 23.

This configuration provides acceleration with a high degree of freedom to the compiler. The compiler determined that the outgoing neighbourhood loop (line 12) should be run sequentially. However, due to the high variation in degrees across all graphs this could be a decent start. Adding data regions and optimising data transfers

Next it is important to ensure there is no unwarranted data movement from and to the host during the BFS traversal, as this can be costly. With the current configuration, data moves between host and device for each kernel launch. This is particularly troublesome as this is repeated twice over for each depth of the BFS traversal, once for the the primary loop and once for the secondary loop.

Previous exploration of the algorithm already revealed that BFS takes a graph as input, and returns a BFS tree in some form, such a cost or parent array as output. I can guarantee this behaviour by wrapping the entirety of the Code seen in 3, in a block accompanied by a data management directive. Because I am working with straightforward data structures in the form of arrays, a structured data management directive will suffice.

(23)

1 #pragma acc data \ 2 copyin(g[0:1]) \ 3 copyin(g->outNodes[0:g->nrNodes], g->outEdges[0:g->nrEdges]) \ 4 create(frontier[0:g->nrNodes], nextFrontier[0:g->nrNodes]) \ 5 copyout(cost[0:g->nrNodes]) 6 { 7 BFS code

8 } /* End of OpenACC data region */

Code 4: Data directive wrapping the BFS algorithm.

Code 4 shows, the structured data region wrapping the BFS code. A point of note is that it is important to copy over the graph structure first before any of its members (line 2), since order is relevant. Copyin checks whether the variable is already present on the device, if not it copies the data, but it does not copy the variable back to the host. Similarly create, only allocates the variable on the device if it is not already present, but data movement between host or device is involved. Copyout allocates memory on the device and only copies the results back to the host. NVIDIA provides a performance profiling tool that delivers developers with feedback for opti-mising code [27], I can use this tool to inspect if the data movement is indeed as I expect it to be. Transfer in for the graph structure, transfer out for the cost array.

Figure 3.6: NVIDIA profiler output for BFS compiled for the GPU using the graph g500-18 as input.

Figure 3.6 shows the profiler output in the form of a timeline. The two vertical bars indicate the slice of the timeline during which all the computations are performed on the algorithm itself. The highlighted horizontal bar, MemCpy (HtoD), shows all data movement from the host to the device (HtoD). On first inspection, it seems that the data movement from HtoD is as I expect, only prior to the algorithm execution. However, further inspection shows that there is data movement from the device to the host (DtoH) during the algorithm, which is highlighted by use of arrows. These transfers correspond with the result of the cnt reduction being sent back to the host. At first this seems undesired, however, because OpenACC does not permit programmer driven synchronisation between gangs [16], the way to implement a barrier is by launch of separate kernels. Which in this case is the launch of kernels between the exploration of frontiers. Because the host is in control of launching kernels, the result transfer of cnt back to the host is required to determine if the frontier is not empty and another kernel should be launched.

Figure 3.6 further shows a final transfer of the cnt result after the right hand horizontal bar, which must evaluate to false as the algorithm stops, no more kernels are launched, and a final DtoH transfer block is shown (indicated by the letter A), which corresponds the transfer of the result array. All considered it can be said the data movement is sufficiently optimal. It might be possible to asynchronously move data during some of the computation, such as moving some of the graph data during the initialisation loop. However, in order to ensure accurate measurements of the execution times, the data directive is wrapped with timers accompanied by acc wait all() calls, ensuring no asynchronous activities cross the timing boundaries.

(24)

The NVIDIA profiler proves to be a power analysis tool, showing more points of interest such the time spend executing each kernel. Furthermore, it shows that for the chosen source vertex and graph, the majority of the time is spend on a single kernel. This kernel is processing the third level of depth of the BFS tree, where the frontier is at its largest, which is characteristic for a small world graph [7]. The profiler also reveals the majority of time (95%+) is spend running the kernels corresponding to the primary loop. Therefore this loop should form the focus of most optimisation efforts.

Loop schedule tuning

By splitting the primary loop into two distinct levels parallelism, gang and vector, I expect to improve the performance compared to the base version where the nested loop over the outgoing neighbourhood is executed sequentially. This alteration looks as follows:

1 #pragma acc parallel num_gangs(NUM_GANGS) vector_length(VECTOR_LENGTH)

2 {

3 // Primary loop, process the frontier

4 #pragma acc loop gang

6 if (frontier[nID]) {

7 int edgeOffsetEnd = g->outNodes[nID].edgeOffset

8 + g->outNodes[nID].nrEdges;

10 // Iterate over the vertices in the outgoing neighbourhood

11 #pragma acc loop vector

12 for (int i = g->outNodes[nID].edgeOffset; i < edgeOffsetEnd; i++) {

13 int outgoingNID = g->outEdges[i];

14 if (cost[outgoingNID] == -1) { 15 cost[outgoingNID] = cost[nID] + 1; 16 nextFrontier[outgoingNID] = 1; 17 } 18 } 19 } 20 } 21 }

Code 5: Primary loop of the BFS algorithm for two levels of parallelism, gang and vector. The optimal configuration of the sizes for the number of gangs and vector length is dependent on both the platform, graph and characteristics of the kernel itself [1, 11]. Not specifying the vari-ables NUM GANGS and VECTOR LENGTH allows the compiler to select intelligent defaults for the widest range of platforms [17].

Whereas this strategy is sensible from a portability perspective, I can possibly do better given the information at hand. A valid strategy to determining the best values for the number of gangs and vector length is auto tuning. Autotuning amounts to performing an intelligent or exhaustive search on certain variables, in this case the number of gangs and vector length. Because our GPU platform is of NVIDIA architecture, an exhaustive search with steps of 32 for both gangs and threads make sense. Thread blocks are always created in warp-sized units, so there is little point in trying to create a thread block of a size that is not a multiple of 32 threads [28]. Making use of the autotuning framework provided by OptACC [29], I have sampled multiple graphs of the G500 dataset, performing an exhaustive 32 grid search for the parameters NUM GANGS and VECTOR LENGTH. Whereas the results were extremely varied for the number of gangs, a vector length of 32 was selected frequently. This could be related to the average degree of around 30 for the G500 graphs. Because of this I have opted to leave the number of gangs to the compiler, but set the vector length at 32.

(25)

This brings us to the following configuration, which I will refer to as optimised: 1. #pragma acc parallel loop, Code 2 line 1.

2. #pragma acc parallel vector length(32), Code 3 line 6. #pragma acc loop gang, Code 3 line 7.

3. #pragma acc loop vector, Code 3 line 13.

4. #pragma acc parallel loop reduction(|| : cnt), Code 3 line 24.

In order to illustrate the performance differences between the base version and the optimised version, I ran both versions for the G500-dataset:

graph500-12 graph500-13 graph500-14 graph500-15 graph500-16 graph500-17 graph500-18 graph500-19 graph500-20 graph500-21 Input graphs 0 1 2 3 4 5 6 Speedup factor

BFS optimisation speedup for G500

CPU GPU

Figure 3.7: Speedup of the optimised version (see 3.2.2) vs the base version (see 3.2.2).

Figure 3.7 shows the performance of the optimised version relative to the base version. A few observation can be made:

1. The CPU performance is not impacted by the optimised construction.

2. The GPU performance is positively impacted across all sizes of the examined G500 graphs, showing a speedup peaking at 6 times that of the base version for graph 18.

All in all, it can be said that the optimised approach is indeed more performant that the base version.

3.2.3 Bottom-up

The second BFS algorithmic approach is not as intuitive as the first. Rather than working on the way “down” from the vertices part of the frontier, one works their way up from all unvisited vertices and checks if they have an incoming connection to one of the vertices part of the frontier. Hence the name bottom-up.

(26)

Listing 3.3: BFS bottom-up step 1 for v ∈ v e r t i c e s do 2 if p a r e n t s [ v ] = −1 then 3 for n ∈ n e i g h b o u r h o o d ( v ) do 4 if n ∈ f r o n t i e r then 5 p a r e n t s [ v ] ← n 6 n e x t F r o n t i e r ← n e x t F r o n t i e r ∪ {v} 7 end if 8 end for 9 end if 10 end for

Listing 3.3 shows such a bottom-up step. The bottom-up approach has the disadvantage of iterating over more vertices in the top level loop (line 1) when compared to iterating over the frontier, especially when the frontier is a small fraction of the total number of vertices. However, the inner loop (line 3) can be ended as soon as a vertex is found which is part of the frontier. Another advantage is that no atomic operations are required, because the only update is to the child itself (line 5), and it is performed without any contention (i.e., the node itself performs its own updates, sequentially). However, this still leaves vast parallelism between vertices.

3.2.4 Direction based

The top-down and bottom-up approaches outperform each other at different points of the graph traversal. The work per algorithm step is roughly equivalent to the number of edges evaluated. The top-down approach evaluates each edge in the frontier, and the bottom-up approach eval-uates – at most – every edge connected to an unvisited vertex, but likely fewer. Beamer et al. have shown that for graphs such as social networks, with a relatively low-diameter, and where the degree between vertices can differ several orders of magnitude, combining the two algorithms in order to switch, at runtime, between the bottom-up and top-down approaches can lead to significant performance increase [7].

This is because in such scale-free graphs, the frontier size ramps up and down exponentially, due to the skewed distribution in degree. For the iterations where the frontier is at is largest, the bottom-up approach is at its best. This is because many vertices have neighbourhoods contain-ing vertices which are already part of the frontier, resultcontain-ing in many successful visits. For these same iterations top-down approach is at its worst, as many of the vertices in the frontier iterate over neighbourhoods containing vertices which are already part of the frontier.

The runtime switching between top-down and bottom-up is based on heuristics based of the number of edges in the frontier (ef), the number of vertices part of the frontier (nf), and the number of edges to check from unvisited vertices (eu). A top-down step will always check ef edges, whereas a bottom-up step will at most check eu edges. As such it makes sense to move from top-down to bottom-up when ef< eu and vice versa. Because euis an upper bound to the number of edges checked during a bottom-up step, a tuning factor α is used. Resulting in the following transitioning parameter from top-down to bottom-up:

Ctb= ef > eu

α

Another tuning factor, β, is used to switch back from bottom-up to top-down when the frontier is too small, resulting in:

Cbt= nf < n β

Both α and β are heuristically determined based on the penalties involved in switching approach. This direction based method is successfully applied in graph benchmarking suites such as GAP [30].

(27)

3.3 PageRank

PageRank is a key algorithm used in web mining [31], first proposed by Google and named after Larry Page, one of its founders. The purpose of PageRank is to assign a measure of importance to vertices in a network based on the number of outgoing and incoming connections of each vertex. PageRank was originally applied to indexing the importance of web pages by counting the number of outgoing and incoming URLs. Intuitively, a vertex is assigned a high PageRank score if it has many incoming connections from other vertices with a high PageRank score.

Figure 3.8: PageRank scores of an example graph1.

The output of the PageRank algorithm is the probability per vertex, that a random traversal of the graph would arrive at a particular vertex. A dampening factor α (often set at 0.85) [31] is used to simulate the probability of moving from one vertex to another during the traversal. For the PageRank algorithm, vertex self-loops are ignored, and multiple outgoing connections from a vertex to the same destination are counted as one. These properties are handled during the preprocessing of the graph (Section 3.1.5).

PageRank is generally calculated by use of an iterative approach, which is referred to as the power iteration method [32]. Given a graph G = (V, E) with a vertex set V and an edge set E, let r denote a PageRank array of size |V |. I further define the set of incoming neighbours Iv for vertex v, and the set of outgoing neighbours as Ov. Then the PageRank of vertex v denoted as rv can be iteratively computed by:

r_v(k+1)= X x∈Iv

r(k)x |Ox|

+ (1 − α) (3.1)

where k denotes the k-th iteration. The PageRank array r is initialised assuming an equal distribution among all vertices in the graph, r = [1, 1, ..., 1] · (1

α). As each iteration of the power method requires an evaluation of all vertices in the graph, the amount of computations between iterations is equal. The process of iterating stops after either a preset number of itera-tions have been performed, or the PageRank of the vertices between iteraitera-tions converges. That is, the average PageRank difference between iteration k and k +1 is less than a preset threshold .

(28)

Listing 3.4: PR power method 1 Input: graph G = (V, E), α, iterations,

2 Output: PageRank r 3 for iterations do 5 for v ∈ V do 6 rv(k+1)=P_x∈I v r(k)x |Ox| + (1 − α) 7 δv= |r (k+1) v − rv(k)| 8 end for 9 if k δ k< then 10 b r e a k end if 12 end for 13 r = r k r k

Listing 3.4 shows the PageRank power method algorithm, where line 13 is conditional on whether normalisation of the PageRank scores is desired.

3.3.1 PageRank implementation

For the OpenACC implementation of PageRank, I take a similar approach as the one used for BFS 3.2.2, also following the 5-steps suggested by NVIDIA [25]: evaluation, setup code, compute regions, adding data regions and optimising data transfers, and loop-schedule tuning. I go over each step in turn, presenting the specific decisions taken for PageRank.

Evaluation

The algorithmic evaluation of PageRank (Section 3.3) reveals that the primary point of paral-lelism is the processing of each vertex. Because the entirety of the computation is dependent on the scores of the previous iteration (3.4 line 4), there can be parallelism between iterations. The primary computation seen in 3.4 line 6, can be broken into two distinct parts: calculating outgoing and incoming contributions, respectively.

1. The outgoing contribution of a vertex is the sum of its incoming contribution divided over the number of outgoing connections.

2. The incoming contribution of a vertex is the sum of the outgoing contributions of its incoming connections.

The PageRank algorithm specifies that all vertices begin with an initial score. Thus, it is pos-sible to first calculate the outgoing contribution: divide the initial score by the the number of outgoing connections; next, the calculation of the incoming contribution is straightforward: sum the outgoing contribution of the incoming neighbourhood.

These observations, in combination with the algorithmic analysis made in Section 3.3, are suffi-cient to move from pseudocode to code.

Setup code

In addition to the graph, PageRank requires two more data-structures to separate its computa-tion in two parts (i.e., the outgoing and incoming calculacomputa-tions). These are a scores array, storing the PageRank scores in each iteration, and a outgoingContribution array which stores the out-going contribution of each vertex between iterations. In order to match the selected reference implementation [30], these values are stored in double precision. I use the pointer g to refer to the graph structure outlined in Section 3.1.4.

(29)

1 const double initScore = 1.0f / g->nrNodes;

2 const double baseScore = (1.0f - dampF) / g->nrNodes; 3

4 for (int i = 0; i < g->nrNodes; i++) { 5 scores[i] = initScore;

6 }

Code 6: Initialisation of the PageRank implementation.

Code 6 shows the straightforward initialisation of the data structure. In compliance with the algorithm shown in Listing 3.4, the initial score initScore and base score baseScore are calculated once, and the PageRank of all vertices is set to the match the initial score.

1 for (int iter = 0; iter < maxIters; iter++) {

2 double error = 0;

3

4 // Secondary loop, calculate the outgoing contributions

6 outgoingContribution[nID] = scores[nID] / g->outNodes[nID].nrEdges;

7 }

8

9 // Primary loop, calculate the incoming contributions

11 int edgeOffsetEnd = g->inNodes[nID].edgeOffset + g->inNodes[nID].nrEdges;

12 double incomingTotal = 0;

13

14 // Iterate over the incoming neighbourhood

15 for (int i = g->inNodes[nID].edgeOffset; i < edgeOffsetEnd; i++) {

16 int incomingNID = g->inEdges[i];

17 incomingTotal += outgoingContribution[incomingNID];

18 }

19

20 double oldScore = scores[nID];

21 scores[nID] = baseScore + dampF * incomingTotal; 22 error += fabs(scores[nID] - oldScore);

23

24 // Break if the PR scores have converged

25 if (error < epsilon) {

26 break;

27 }

28 }

29 }

Code 7: Serial source code of the PageRank implementation.

Code 7 shows the plain source code without any OpenACC acceleration. The outer for loop (line 1) is run until a preset number of iteration has been met or the PageRank scores have met a preset convergence threshold (line 25). Each iteration starts with a calculation of the outgoing contributions (line 5). This is followed by a nested loop which calculates the incoming contributions by iterating over the incoming neighbourhood (line 15), summing the outgoing contributions (line 17), and updating the scores array (line 21).

All in all, the code is similar to the algorithmic outline given in Listing 3.4, except the main computation is broken into two parts: calculating the incoming and outgoing contributions.

(30)

Compute regions

Given that typical compute regions to be accelerated are parallelisable loops (ideally, with high iteration counts), there are four regions which qualify for acceleration:

1. The initialisation loop, see Code 6 line 4.

2. The secondary loop, computing the outgoing contributions, see Code 7 line 5. 3. The primary loop calculating the incoming contributions, see Code 7 line 10. 4. The loop over the incoming neighbourhood, see Code 7 line 15.

Because the loop over the incoming neighbourhood sums the outgoing contributions, a summa-tion reducsumma-tion on the variable incomingTotal is needed. This translates to the following directive: #pragma acc reduction(+ : incomingTotal)

I note that the nested loop over the incoming neighbourhood is similar to the one found in BFS, where an iteration is made over the outgoing neighbourhood of all vertices. Due to the similari-ties, it possible to use the in Section 3.2.2 established basic acceleration directives.

As such, I arrive at the following configuration of directives, which I will refer to as the base version:

1. #pragma acc parallel loop, Code 6 line 4. 2. #pragma acc parallel loop, Code 7 line 5.

3. #pragma acc parallel loop reduction(+ : incomingTotal), Code 7 line 10. 4. pass – leave it to the compiler, Code 7 line 15.

Adding data regions and optimising data transfers

In this step, it is important to reduce the data movement between host and device as much as possible, as this communication is always costly. In the case of PageRank, the minimum data transfers are: the graph is copied to the GPU (as input for the algorithm), and the scores are copied back to the host (as output of the algorithm). There is no need for data movement from between host and device during the algorithm. Similar to BFS, this behaviour can be guaranteed by wrapping the entirety of the code seen in 7 in a block accompanied by a data management directive.

1 #pragma acc data \ 2 copyin(g[0:1]) \ 3 copyin(g->outNodes[0:g->nrNodes], g->outEdges[0:g->nrEdges]) \ 4 copyin(g->inNodes[0:g->nrNodes], g->inEdges[0:g->nrEdges]) \ 5 create(outgoingContribution[0:g->nrNodes]) \ 6 copyout(scores[0:g->nrNodes]) 7 { 8 PR code

9 } /* End of OpenACC data region */

Code 8: Data directive wrapping the PageRank algorithm.

Code 8 shows the structured data region wrapping the PageRank code. The graph is only copied from host to device by the copyin directive. The PageRank scores are allocated on the device, and only moved from the device to host by the copyout directive. Finally, the outgoingContri-bution is simply allocated on the device without further data movement.

(31)

Figure 3.9: NVIDIA profiler output for PR compiled for the GPU using the graph g500-18 as input.

I use the NVIDIA profiler to confirm the data movement is as one would expect. Figure 3.9 shows the profiler output in the form of a timeline. The two vertical bars indicate the slice of the timeline during which all computations are performed on the algorithm itself. The highlighted horizontal bar, MemCpy (HtoD), shows all data movement from HtoD. Inspection reveals there is no data movement during the execution of the algorithm, only prior (see B Figure 3.9) and upon completion (see Figure 3.9 A). Further inspection reveals that each iteration of the PageR-ank algorithm is of similar duration (which makes sense as the workload is invariant between iterations), and all nested kernels are launched 10 times (see 1, 2, ..., 10 in Figure 3.9) which matches the number of iterations.

Because the NVIDIA profiler revealed nothing out of the ordinary, it is safe to move on to loop schedule tuning.

Loop schedule tuning

By applying the same steps used during the tuning of BFS (Section 3.2.2), I arrived at the following directive configuration, which I will refer to as the optimised version:

1. #pragma acc parallel loop, Code 6 line 4. 2. #pragma acc parallel loop, Code 6 line 5.

#pragma acc parallel vector length(32), Code 6 line 6. 3. #pragma acc loop gang, Code 6 line 11.

(32)

graph500-12 graph500-13 graph500-14 graph500-15 graph500-16 graph500-17 graph500-18 graph500-19 graph500-20 graph500-21 Input graphs 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Speedup factor

PR optimisation speedup for G500

CPU GPU

Figure 3.10: Speedup of the optimised PR version (see 3.3.1) vs the base version (see 3.3.1).

In order to illustrate the performance differences between the base version and the optimised version, I ran both versions for the G500-dataset. Figure 3.10 shows the performance of the optimised version relative to the base version. A few observation can be made:

1. The CPU performance is not impacted by the optimised construction.

2. The GPU performance is positively impacted across all sizes of the examined G500 graphs, showing a speedup peaking at 2,1 times that of the base version for graph 17.

3. The GPU speedup curves off for the largest size graph tested.

All in all, the optimised approach outperforms the base version by 54% on the GPU, and will be used as the final version for the performance analysis.

(33)

CHAPTER 4

Experimental setup

In this chapter I present the components of the experimental setup (including the hardware platforms and the benchmarking datasets). I further discuss the analysis methodology, which in-cludes the selection of state-of-the art best-known versions of BFS and PageRank for comparison purposes.

4.1 Platforms

The application performance is measured on two platforms: one CPU (a dual-socket Intel Xeon E5-2630-v3) and one NVIDIA GPU (TitanX, Maxwell generation). The small selection size allows for in-depth analysis of each application, but does still showcase the inter-device per-formance portability differences. The corresponding specifications can be found in Tables 4.1 and 4.1 respectively. All reported experiments are run on the DAS5 computing cluster, running linux CentOS 7. Experiments are scheduled via Slurm as workload manager [33], guaranteeing the exclusivity of the compute resources for the duration of the experiment. All applications are compiled using pgcc/pgc++ 18.10, passing the -fast flag. This flag represents a combination of compiler optimisation instructions, and is recommended for performance by the standard [16].

Table 4.1: CPU specifications

Platform Intel Xeon E5-2630-v3

Architecture Haswell

sockets 2

cores (per socket) 16

threads (per socket) 32

Base clock frequency (GHz) 2,4

Memory bandwidth (GB/s) 48,9

SIMD support AVX2 (256 bit)

Table 4.2: GPU specifications

Platform NVIDIA TitanX

Architecture Maxwell

Cores 3072

Base clock frequency (MHz) 1000

Memory bandwidth (GB/s) 336,5

(34)

4.2 Graph Selection

Because the performance of the algorithms is expected to vary from graph to graph based on their different structural properties, a wide selection of natural graphs is required. The following graphs were selected from the Stanford Network Analysis Platform (SNAP) [34] and corresponding dataset [35], as well as the Koblenz Network Collection (KONECT) [36, 37]. As can be seen in Table 4.2, the selected graphs encompass a varied range of graph properties.

Table 4.3: Specifications of the selected natural graphs taken from the SNAP and KONECT datasets.

Abbreviation Graph Vertices Edges Diameter1 _Directed _References

as-skitter Autonomous systems 1.696.415 11.095.298 25 No [38] roadNet-CA California road network 1.965.206 2.766.607 849 No [39] roadNet-PA Pennsylvania road network 1.088.092 1.541.898 786 No [39] roadNet-TX Texas road network 1.379.917 1.921.660 1054 No [39] web-Google Google web graph 875.713 5.105.039 21 Yes [39] web-BerkStan Berkeley-Stanford web 685.230 7.600.595 514 Yes [39] wiki-Talk Wikipedia Talk network 2.394.385 5.021.410 9 Yes [40] email-EuAll EU email network 265.214 420.045 14 Yes [41] cfinder-google Google.com internal 15.763 171.206 7 Yes [42] z-h-internallink Hudong internal links 1.984.484 14.869.484 16 Yes [43]

dbpedia-all DBpedia 3.966.924 13.820.853 67 Yes [44]

web-NotreDame Notre Dame hyperlinks 325.729 1.497.134 46 Yes [45]

Another set of graphs used are synthetically generated Kronecker graphs [46] per the Graph500 [47] standard (A=0.57, B=C=0.19, D=0.05). Such graphs are relatively similar apart from their exponential increase in size, doubling the number of vertices each step. These graphs follow the naming convention g500-(e), where e signifies 2e _{vertices. The average degrees are similar,} ranging from 23 to 31. Because of this predictable growth, the Graph500 graphs are well suited to bench marking the impact of graph size increase.

The combined collection of graphs are diverse in both topology and origin (synthetic versus natural graphs), encompassing a wide range of data which models real-world connections such as those between people, websites, and roads. Thanks to the SNAP, KONECT and Graph500 [48] initiatives, a wide range of graph properties is readily available, and such graphs have been frequently used in research providing continuity with prior work.

4.3 Reference Application Selection

In order to compare the performance of the applications outlined in Chapter 3 to the state-of-the-art, a selection of appropriate reference applications has to be made. Because of their specialised nature, both a CPU and GPU focused reference are required.

For a CPU focused reference I have opted to work with the GAP [30] Benchmark Suite. GAP uses OpenMP for parallelism and was created in order to standardise graph processing research. The GAP implementations are representative of state-of-the-art performance, but are not front runners. However, they should still make for an excellent baseline. Furthermore the level of graph preprocessing is similar, the implementations reason from a vertex and edge perspective, and the code is reasonably documented.

On the GPU side I have opted to work with the Gunrock [49] GPU Graph Analytics framework. Gunrock makes use of CUDA for parallelism and is featured by NVIDIA as an external library for GPU graph analytics. Like GAP, Gunrock reasons from a vertex and edge perspective. Furthermore the GUNROCK code base is still actively maintained.

(35)

4.4 Measurement methodologies

The reported algorithm execution times are each recorded individually and include every aspect of its execution. The execution times do assume the graph data structure is already loaded into memory. Any time spend on data structures other than the graph used by the algorithm, including memory allocation for the solution is included in the execution time.

The graph data structure selected in Section 3.1.4, is kept equal for all applications as any mod-ification has to be beneficial to all algorithms. For the BFS application the first available source vertex with a non zero degree is selected, opposed to selecting a random source vertex. This is done in order to retain deterministic results, although a case could be made for randomisation in order to minimise the impact of the selected source vertex. The PR application is run for 10 iterations or until a convergence threshold of 10−4 has been met. This configuration is selected to match the default criteria used in the reference GAP benchmark suite.

All experiments are repeated a minimum of 10 times, this is expected to be sufficient as pre-liminary samples show consistent execution times. For all results, the standard deviation of the measurements is indicated by narrow black bars attached to the results.

Table 4.4: Kernel arguments and output.

Algorithm Repetitions Arguments Output

BFS 10 --source=0 |N | sized array of 32 bits integers (depth to source) PR 10 --maxIterations=10 --error=0.0001 |N | sized array of 32 bits floating point numbers

(36)

(37)

CHAPTER 5

Performance analysis

In this chapter, I present an in-depth analysis of the results observed when running the OpenACC kernels on both CPUs and GPUs. As the main focus in this work is cross-platform portability, two types of results are included: comparative analysis of CPU vs GPU performance in terms of kernel execution time, and application efficiency, measured against the state-of-the-art high-performance versions (GAP and Gunrock – see Chapter 4, Section 4.3) acting as best-performing reference code.

5.1 Breadth first search

Figure 5.1 presents the performance results of the OpenACC implementation of BFS (top-down only) on the Graph500 synthetic datasets. I make the following observations:

1. As expected, the GPU outperforms the CPU for all datasets. However, the performance gap between the two platforms depends on the scale of the graph.

2. The performance on both platforms scales well with the scale of G500 graphs (especially for larger graph sizes); this is expected, given the regular structure of this family of synthetic graphs.

3. I observe poor performance when running BFS on the CPU for small graphs. The results seem to indicate some form of constant overhead for the CPU.

Performance Portability Evaluation of OpenACC in Graph Processing

Bachelor Informatica