Performance modeling for graph processing algorithms using machine learning

(1)

Bachelor Informatica

Performance modeling for graph

processing algorithms using

machine learning

Younes Ouazref

June 8, 2018

Supervisor(s): Ana Lucia Varbanescu

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

(3)

Abstract

Graph processing is becoming increasingly important because of its many applications. It is however not possible to create an algorithm that is optimal for every type of graph, due to the fact that the performance of a program is dependent on both the implementation and the input. In this thesis we propose an approach to solve this problem, not by designing a new algorithm but by creating a model that will predict which implementation of a graph processing task is best suited for a specific graph. We will use two already implemented triangle counting algorithms found on github, which use the parallel CUDA framework for GPUs. We will collect the performance data of these algorithms over 21 graphs together with the graph properties. The data will than be trained using three machine learning techniques to create different models: Linear Regression, Decision Tree and Random Forest. We will prove that it is not only possible to create a model that predicts the right algorithm but we will show that the Decision Tree and the Random Forest give a ∼ 95% accuracy and beat the Linear Regression which only has an accuracy of ∼ 54%. Concluding that machine learning is a promising candidate to reduce the execution time of graph processing tasks by predicting the optimal algorithm.

(4)

(5)

5.2.1 CUDA events . . . 26 5.3 Size . . . 27 5.4 Structure . . . 27 5.4.1 Five-number summary . . . 27 6 Model creation 29 6.1 Machine learning . . . 29 6.1.1 Linear Regression . . . 30 6.1.2 Decision Tree . . . 31 6.1.3 Random Forest . . . 32 6.2 Scikit . . . 32 6.2.1 LinearRegression module . . . 32

(6)

7 Experiments and Validation 35 7.1 Experimental setup . . . 35 7.1.1 Data collection . . . 35 7.1.2 Desired result . . . 35 7.2 Results . . . 35 7.3 Validation . . . 40

8 Conclusion and future research 43 8.1 Conclusion . . . 43

(7)

CHAPTER 1

Introduction

In the first Chapter we will introduce the subject by giving some context and explain our research question and goal of the research. Furthermore, the thesis outline will be given.

1.1 Context

Graph processing has become a frequent computing problem. Graphs and applications like the world wide web and its Webgraph [6], social media graphs such as Facebook [37], or road maps like the ones used by Google Maps, have become a part of our daily life. While these graphs can store a lot of information, extracting the right information can be challenging, especially when the graphs are large. And in our consumer society of today, fast data analysis is becoming more important than ever.

Because of the massive sizes that graphs can have, parallelism is needed to allow a result to be computed within a reasonable time. With the advance and wide use of GPUs, which are designed to execute large amounts of data-parallel work [7], and the excellent documentation of the CUDA library1the ’power’ and impact of parallelism has increased.

For many types of graph processing applications, picking one single parallel algorithm which would be optimal for all graphs is not possible. Graphs can be very diverse in nature and prop-erties. And because the performance of different implementations doesn’t only depend on the hardware and software but also on the structural properties of the graph [38], it can be strenuous to find the optimal algorithm for a specific task.

It is therefore desirable to determine which algorithm should be chosen for a given graph processing tasks - i.e., an application and the dataset. Thus, instead of letting the programmer decide which implementation to pick, we aim to devise a model to help him/her to make this choice. In this work, we aim to use statistical modeling (eventually using techniques from machine learning) to build this model, based on the data collected from previous runs. Such a model, if accurate, could reduce processing time and, in its turn, be more efficient than just using one implementation for all graphs; such an efficiency improvement can save time and money.

1.2 Research question and approach

The research question of this thesis is:

Is it feasible to predict, through statistical performance modeling, which parallel algorithm suits a given graph processing task the best?

(8)

To answer this question, I focus on a specific graph processing task, namely triangle counting, and use two different algorithms and their CUDA implementations available online. I aim to build a model to predict which of these algorithms is better for counting the triangles of a given graph. To test the implementations and train the models, I use publicly available graphs

from the DIMACS10 Graph Challenge2 _{. Furthermore, to build the model, I test regression,}

Decision Trees, and Random Forest. My goal is two-fold: (1) to build an accurate model to predict which triangle counting algorithm is best for a given graph, and (2) to determine which modeling technique is most suitable to build this model. To determine the accuracy of the model, I provide an empirical analysis, reporting the model’s correct prediction rate on a set of 21 graphs.

1.3 Thesis structure

This thesis is structured as follows. Chapter 2 will discuss some related work on the same subject. In Chapter 3, some background information will be presented about topics that are essential to gain a better understanding of the thesis. Chapter 4 will explain the triangle counting task and discuss the implementations found. In Chapter 5 the graphs that are used will be examined and the method of collecting the data is provided. Chapter 6 describes the machine learning techniques used and will go into detail on how these techniques were implemented. And Chapter 7 will explain how the data was collected on the DAS-5, and will go through the results. Lastly Chapter 8 will contain my conclusions and recommendations for future research.

(9)

CHAPTER 2

Related work

There have been others who have also tried to create a model to predict performance for graph processing and GPU computations. Here we will discuss their work and examine their results.

My thesis shall continue on the work of Verstraaten et al., who have proven that machine-learning models can be used to select the right algorithm for a graph processing task, and even took advantage of these predictions to be able to switch between implementations at runtime [38]. Their work uses performance data collected from 15 different breadth-first search (BFS) imple-mentations on a set of 248 graphs. Figure 2.1 shows that these 15 impleimple-mentations do have different performance on different graphs. In some cases, these differences are of orders of mag-nitude, depending on the graph properties. This means that picking the wrong implementations could lead to a huge waste of time and energy.

Figure 2.1: Performance difference of the different implementations. Image from [38]. The 15 implementations were written by them and consisted of 5 adaptations with 3 variants each: 2 edge-centric implementations (edge-list and reverse edge-list), 2 vertex-centric imple-mentations (vertex push and vertex pull), and 1 virtual warp-based implementation [38]. The 3 different variants of those implementations are based on how the new frontier size (i.e., the number of vertices that have been assigned a new depth), is computed at the end of each level.

The training and test graphs were taken from KONECT1, a graph database created at the

Universit¨at Koblenz-Landau, Germany. The model used in [38] is based on Binary Decision

Trees, using as features the graph size, frontier size, vertex count, and indicators for the degree distribution. The model predicts which implementation (out of the 15 they propose) would give

(10)

the best performance at each level of the graph traversal. The model achieves ∼96% accuracy over the test data. Using this model-based technique, their approach reduces the overall execution time, and manages to outperform two popular graph processing systems for GPUs, Gunrock and LonestarGPU. The authors therefore conclude that

”machine learning can be used to build a high-accuracy model which, taking into account graph and algorithm properties, can predict the optimal selection of BFS implementations for ∼56% of all BFS traversals and within 2× of optimal for ∼97% of all traversals.” [38]

Other work on performance modeling on GPUs has been done by Wu et al. [41]. They created a high-level GPGPU performance and power model that uses performance counters to be able to predict the performance and power consumption of the kernels at various hardware configurations, see Figure 2.2. To accomplish this task, data was collected of many hardware configurations into a training set. This training set is than assembled into a collection of repre-sentative scaling behaviors on which K-clustering is used and than got mapped by neural network classifiers.

Using this technique they could

”estimate performance with an average error of 15% across a frequency range of 3.3×, a bandwidth range of 2.9×, and an 8× difference in number of CUs. Our dynamic power estimation model has an average error of only 10% over the same range”[41].

Figure 2.2: Design principle of Wu et al. Image from [41].

In this thesis I aim to test whether using a similar technique as Verstraaten et al. is possible for a different algorithm, namely triangle counting. Thus, I also create a model using the collected performance data from two different algorithms for triangle counting, and determine if it is able to predict for an arbitrary graph which implementation will have the best performance.

(11)

CHAPTER 3

Theoretical background

This Chapter provides background information on some essential knowledge which is needed to understand the rest of the thesis. First the basics of graphs will be explained, than parallel programming and its concepts are explained together with an introduction to GPU computing. And lastly we provide some information on the cluster which was used throughout this research.

3.1 Graphs

A graph is a fundamental representation of relationships between objects. [1] A graph G consists of a set of vertices (or nodes) V and a set of edges E which form an ordered pair (V, E). These set of vertices and edges can form a vast amount of different structures which are used in many scientific and commercial fields.

Graphs can have different properties like being undirected in which edges have no orientation. For example if we have two nodes A and B then (A → B) will also mean (A ← B). Or directed graphs which have directions associated with them and where (A → B) doesn’t imply (A ← B). Besides being directed and undirected, a graph can be weighted. This implies that every edge in the graph has a numerical value assigned to them thereby giving the edge extra information. For example representing a distance in a road graph.

In this thesis we will only look at unweighted and undirected graphs. Degree

Even though the size of a graph is determined by the vertices and edges, it doesn’t give a good indication on the structural properties of the graph.

Figure 3.1: Two graphs with different structures.

Figure 3.2 shows this evidently. Both graphs have the same amount of vertices and edges but have a completely different formation.

This is due to the fact that the degrees of the vertices are different. A degree of a vertex is the number of connections it has with other vertices. It can say something about the node and its

(12)

relevance in a graph. For instance a leaf node is a node with a degree of 1, and subsequently has only one connection with the rest of the graph. Vertices with relatively high degrees are referred to as hubs and are interesting to study because they could be central to certain processes.

So the degree of a vertex can say something about individual vertices but to get a better picture about the entire graph, degree distributions are made.

Figure 3.2: The degree values of individual nodes.

The degree distribution is the probability distribution of the degrees over the whole graph. It is calculated by counting the occurrences of the degree values of all the nodes and then dividing them by the amount of vertices. We then have for every degree value the percentage of its frequency. Figure 3.3 and 3.4 are illustrating this for the graphs from Figure 3.2.

(13)

Figure 3.4: The degree distribution of the right graph.

So a graph has many (more) properties which will shape its structure. Finding a good way to save them so those properties are kept preserved is vital.

3.1.1 Representing graphs

There are different ways to represent graphs, with each technique having its advantages and disadvantages.

We will now go through three commonly used methods. Edge array

An edge array stores the edges of a graph by saving every connected pair of vertices together in an array. The vertices are represented as numbers and if the edges also have a weight the value of the weight is put as the third element.

The advantage of an edge array is the fact that it is simple. However if we want to find a particular vertex pair, we would have to go through all the pairs to find the one we need. Adjacency matrices

An adjacency matrix is a matrix consisting of zeros and ones. The edges are stored by putting a 1 whenever there is an edge between vertices, which are represented by the row and column, and a zero otherwise. Looking up a pair of nodes is relatively easy because we only have to access the corresponding entry of the matrix. However because of the high space complexity it isn’t well suited for very large graphs. See table 3.1

Adjacency list

Adjacency lists are a combination of edge arrays and adjacency matrices. For every vertex in the graph all the neighbours are being stored in an array. The length of the list is therefore |V |.

(14)

7 10 2 5 6 1 3 6 2 4 7 3 5 6 1 4 6 1 2 4 5 3

Figure 3.5: An example of a graph with its adjacency list (left). Image from [11]. In Figure 3.5 we see an example on how a graph can be represented using an adjacency list. The first line of the adjacency list contains the numbers 7 and 10. The 7 stands for the number of vertices in the graph and the 10 for the number of edges. From the second line onward the adjacent vertices of the nodes are being given for all the nodes in numerical order. Which means that the 2 5 6 in the second line represents the neighbours of the first node. And the 1 3 6 on the third line represents the neighbours of the second node, etc.

In this thesis we will only be using graphs that use the adjacency list format of Figure 3.5

Representation method Space complexity

Edge array Θ(E)

Adjacency matrix Θ(V2₎

Adjacency list Θ(V + E)

Table 3.1: Space complexity of different graph representations.

And lastly there are real world and synthetic graphs. A real world graph is based on empirical data while a synthetic graph isn’t. Examples of real world graphs are: social graphs, web graphs, transportation graphs, and citation graphs [1]. While some of these graphs can be relatively small, there can also be graphs with millions of vertices and billions of edges. These sizes can make them difficult to interpret and analyze. To be able to process these type of graphs within a sensible timespan, parallelism is needed.

3.2 Parallel Programming

In the field of Information Technology (IT) speed has always been a key factor. How faster a program works the better it will be received. However in recent years hardware doesn’t increase as much in speed as it did 30 years ago. Transistors have become so small that trying to make them even smaller will result in current leaks [22]. Because of this Moore’s law can no longer be held, so to improve the speed of programs we need to look at something else.

If it becomes tough to make the hardware faster the obvious alternative is to look at the software. Traditionally software has always been written sequentially. Every instruction would be executed one after the other. However some parts of a code consists of computations that are independent of each other and can therefore be computed simultaneously, or in other words in parallel.

(15)

Parallelism is used to compute problems by distributing the calculations or process executions over multiple units/elements so that they can be solved concurrently [3], see Figure 3.6. It is highly useful for programs that do the same type of task multiple times. Like running the same computation on different chunks of data, also known as data parallelism or Single instruction, multiple data (SIMD)

An example of parallelism would be finding the longest shortest path in a graph. Every pro-cessing element can then take care of a set of node pairs (the data) and calculate their distance independently. After every element has calculated his share, they combine their results and can now pick the longest distance among them. This process decreases the time needed to do the operations significantly and would result in many orders of speed

Figure 3.6: Comparison between sequential and parallel computing. Image from [16]. Besides looking at the software it is also important to choose the right architecture to run the parallel code on.

3.2.1 GPUs

Parallel computing can be done on both the central processing unit (CPU) and the graphics processing unit (GPU), the advantage of using a GPU is that it has a lot more cores than a CPU and therefore outperforms it [29], see Figure 3.7.

The CPU while having less cores has more processing power per core with lots of cache memory and are therefore designed for high sequential performance [7].

(16)

Figure 3.7: CPU cores vs GPU cores. Image from [17].

GPUs, like their name suggest, were originally used for graphics processing because of the many (heavy) computations it required. But quickly people started to realize that their com-puting power could also be used for multiple purposes, this use is also known as general-purpose computation on GPUs (GPGPU) The availability of the many cores was perfect for parallelism as thousands of computations could be run at the same time.

This has led to the increase of research on parallel computing on GPU architectures which resulted in the development of NVIDIA CUDA.

3.2.2 CUDA

Parallel computing on a GPU cannot be done without any special programming system, CUDA is one of those systems which is gaining significant popularity over the last decade [30]. CUDA is a parallel computing platform and API designed and maintained by NVIDIA. Its first version was released on June 23, 2007 and, at the time of writing this thesis, has already released its ninth version. CUDA only works on a GPU that is based on NVIDIAs G80 architecture or one of its successors [7]. As for the software, it can be used by the programming languages C, C++ and Fortran. However third party warps are also available.

CUDA’s design principle is that it allows the GPU to be a helping processor next to the CPU. In CUDA, the CPU is called the host and the GPU the device. A kernel is a function which is called by the host but executed in parallel by many threads on the device. Kernels carry out the same computation but on different data that are independent of each other. A perk of the kernel is that it doesn’t block the host, so the host can keep doing computations simultaneously with the device. The amount of kernels that are being used is specified by the programmers through a grid of blocks containing threads.

To address a thread within a block, indexes are being used which can be 1-, 2- or 3-dimensional. And the blocks within a grid can be addressed using 1- or 2-dimensional indexes. Threads are therefore distinct and easily identified by its thread index within the block and its block index within the grid [7], see Figure 3.8.

(17)

Figure 3.8: CUDA kernels are executed per thread. Image from [19].

The GPU architecture built by NVIDIA is designed around a scalable array of multithreaded Streaming Multiprocessors. When the host calls a kernel grid, the blocks of the grid are specified and spread over the multiprocessors with unoccupied capacity. The threads of a block and multiple blocks themselves can execute concurrently on one multiprocessor. As the blocks finish, new blocks are launched on the freed multiprocessors [28].

So a multiprocessor is designed to execute a large amount of threads concurrently. To man-age this, it employs a unique architecture called SIMT (Single-Instruction, Multiple-Thread). Which creates, manages, schedules and executes threads in groups of 32 parallel threads called warps [28].

The maximum amount of thread and block sizes that can be specified differs per version of CUDA and the hardware it runs on. For more information on those specifications the CUDA manual can be consulted.1

Even though the CUDA program is now capable of performing a massive amount of compu-tations at the same time, it still needs to make sure every kernel has access to the right chunk of data. It does this by giving different privileges to groups of threads, see Table 3.2.

Memory Cached Access Who

Local No Read/Write One thread

Shared No Read/Write All threads in a block

Global No Read/Write All threads + host

Constant Yes Read All threads + host

Table 3.2: Memory privileges of threads.

Using CUDA, amazing speedups have been achieved [33] and with the demand for faster programs its relevance will only grow.

(18)

3.3 DAS-5

In this thesis the code and experiments will be run on the DAS-5 (The Distributed ASCI Su-percomputer 5). DAS-5 is a six-cluster wide-area distributed system spread throughout the

Netherlands designed by the Advanced School for Computing and Imaging2_{. The cluster used}

during this thesis is the VU cluster, see Table 3.9.

Figure 3.9: The DAS-5 head node and compute node names.

(19)

CHAPTER 4

Triangle Counting

I have chosen to focus my research on the graph processing problem of triangle counting using the CUDA framework. This chapter will explain what triangle counting means, how its done and why we would want to know the number of triangles in a graph. Furthermore it will discuss two different implementations of triangle counting using CUDA which can be found on github12 _and

will be used in the rest of this thesis.

4.1 Definition and applications

One key technique to grasp a better understanding of a graph’s structure is by finding and counting subgraphs. The simplest subgraph of interest is a triangle [39]. Triangles are a cyclic path of length 3 within graphs, and occur in abundance throughout many types of networks.They can be used to measure many aspects of a graph, such as the transitivity ratio and the clustering coefficient, but can also be used to detect spam [2]. Figure 4.1 provides an example of a graph with 4 vertices and 2 triangles.

Figure 4.1: A graph and its two triangles. Image from [20].

4.1.1 Transitivity

A triple is a path of length two where, for instance, node A is connected with node B, and node B is connected with node C. If node A and C would also be connected then they would be besides a triple also a triangle. Transitivity is proportional with the ratio of triangles-to-triples in the graph, and is closely related to the clustering coefficient. Transitivity aims to quantify the extent to which a connectivity relation between nodes in a graph is transitive: if A is connected to B and B is connected to C, what is the likelihood that A is connected to C as well. Transitivity is an important measure to predict the evolution of networks over time [2].

1_{https://github.com/adampolak/triangles} 2_{https://vadtani.github.io/15618-finalproject/}

(20)

4.1.2 Clustering coefficient

The clustering coefficient of a graph is the degree to which the neighbors of a node tend to cluster together to form a clique, i.e., a complete (sub)graph [34]. This metric is also known as the local clustering coefficient; the global clustering coefficient is the average of all the clustering coefficients in a graph, which is equal to the transitivity. The local clustering coefficient is used to identify subgroups within a graph, like, for example a group of friends in a social network.

4.1.3 Spam filtering

Triangle counting can also be used to detect spam emails on the internet. Becchetti et al. [5] proved that by using the distribution of local triangle frequencies, they could differentiate between spam and nonspam hosts [2].

4.2 Parallel Algorithms

Numerous methods have been implemented to count triangles, both sequential [35] and paral-lel [39]. A sequential triangle counting algorithm has a time complexity of O(V3_{) but could}

be optimized to O(Vω_{) with ω< 2.376, representing the fast matrix product exponent by using}

the method of Latapy [26]. This optimization is a classical time-space trade-off, as the space complexity increases to θ(V2_{). This complexity could be a hurdle especially with large graphs,}

having up to a billion vertices [8].

For the remainder of this thesis, we focus on parallel triangle counting. Specifically, we search for GPU-enabled algorithms, which can exploit the massive parallelism of GPUs.

4.2.1 Polak’s Algorithm

The first algoritihm we discuss is Polak’s triangle counting, designed and implemented in CUDA

by Adam Polak [33]3_{. On the Nvidia GeForce GTX 980 GPU, his solution achieves an}

exe-cution time of 12 seconds on a graph with 180 million edges, for which 8.8 billion triangles are found. And is capable of achieving a speed up of 15 to 35 times over his CPU implementation [33]. His algorithm design is based on the forward algorithm which was simplified by Latapy [26] and is a good fit for GPU computing because both the preprocessing and counting phases are easily parallelizable. Furthermore, it rarely access the memory in a random fashion [33]. Input

The algorithm can handle both edge arrays and adjacency lists as input. The algorithm itself works with edge arrays so if an adjacency list would be given it is being converted to an array. Because adjacency list are already sorted it will be less costly than converting an edge array into an adjacency list. Furthermore, the algorithm only handles undirected graphs with no self-loops nor multiple edges, with each undirected edge appearing exactly twice.

Algorithm

The parallel forward algorithm can be divided into two parts. The preprocessing phase and the counting phase.

In the preprocessing phase the edge array is copied to the GPU memory. In the GPU the num-ber of vertices are being calculated using the thrust::reduce and thrust::maximum routine [15], see Listing 4.1. Thrust is a CUDA library which allows the implementation of high performance applications through a high-level interface [36]. Then thrust::sort is used to sort the edges ac-cording to the first vertex. After this the edge array has become a sorted concatenated adjacency list (CAL) of vertices. The node array is than been calculated so the i-th element of this array

(21)

then points to the first edge in the CAL. Now every edge that goes from a vertex with a higher degree to a vertex with a lower degree is being removed with the thrust::remove if routine. This results in the undirected edges becoming directed ones, so edges won’t be calculated twice. Now the CAL is being unzipped and changes from an array of structures to a structure of arrays. And lastly the node array is being recalculated.

i n t NumVerticesGPU ( i n t m, i n t ∗ e d g e s ) { t h r u s t : : d e v i c e p t r <int> p t r( e d g e s ) ;

return 1 + t h r u s t : : r e d u c e (p t r , p t r + 2 ∗ m, 0 , t h r u s t : : maximum<int >() ) ;

}

Listing 4.1: An example of thrust::sort and thrust::maximum.

After the preprocessing phase has ended the triangles are calculated by assigning for each edge a thread which calculates adjacency list intersections sequentially. This can be viewed in Listing 4.2. In the for loop each edge that was assigned to the thread gets called on and which the two pointers merge algorithm is than used upon in the while loop. Afterwards the triangle count is put in the results array.

_ _ g l o b a l _ _ void C a l c u l a t e T r i a n g l e s ( i n t m, const i n t ∗ r e s t r i c t e d g e s , const i n t ∗ r e s t r i c t nodes , u i n t 6 4 _ t∗ r e s u l t s , i n t d e v i c e C o u n t = 1 , i n t d e v i c e I d x = 0 ) { i n t from = gridDim . x ∗ blockDim . x ∗ d e v i c e I d x + blockDim . x ∗ b l o c k I d x . x + t h r e a d I d x . x ; i n t s t e p = d e v i c e C o u n t ∗ gridDim . x ∗ blockDim . x ; u i n t 6 4 _ t c o u n t = 0 ; f o r ( i n t i = from ; i < m; i += s t e p ) { i n t u = e d g e s [ i ] , v = e d g e s [m + i ] ; i n t u i t = n o d e s [ u ] , u end = n o d e s [ u + 1 ] ; i n t v i t = n o d e s [ v ] , v e n d = n o d e s [ v + 1 ] ; i n t a = e d g e s [ u i t ] , b = e d g e s [ v i t ] ; while ( u i t < u end && v i t < v e n d ) {

i n t d = a − b ; i f ( d <= 0 ) a = e d g e s [++ u i t ] ; i f ( d >= 0 ) b = e d g e s [++ v i t ] ; i f ( d == 0 ) ++c o u n t ; } } r e s u l t s [ blockDim . x ∗ b l o c k I d x . x + t h r e a d I d x . x ] = c o u n t ; }

Listing 4.2: The kernel that calculates the intersections of the edges.

After the kernel is done the thrust::reduce routine is called to sum all the results of the threads and gives the total triangle count of the graph.

(22)

4.2.2 Jain and Adtani’s algorithm

The second algorithm found on github is that of Manish Jain and Vashishtha Adtani4 _Their

code is a optimized version of the code proposed by Wang et al. [39] and are able to get an average speedup of 5 × compared to the original. This code is also a CUDA implementation and has proven to be able to count 627,584,181 triangles for a graph with 3 million vertices and 230 million edges on the NVIDIA GTX 1080. Furthermore was their parallel algorithm capable at getting a 1184.16 × speed up compared to their CPU implementation [11].

In the rest of the thesis this algorithm will be referred to as Jain’s algorithm for convenience. Input

Contrary to Polak’s algorithm this implementation can only take in adjacency lists. This is because it heavily depends on its structure as we will see in the next section. All graphs are interpreted to be undirected so every node connected to another node has an outgoing and an incoming edge.

Algorithm

In the first part of the algorithm the adjacency list is reduced to a 1D array called the NodeList. To be able to know which subpart of this list contains the neighbours of certain nodes a ListLen array is filled with the degrees of the nodes. Now another new array is created which stores the starting index of the nodes of the NodeList. This is done by parallelly running an inclusive sum over the ListLen array. The algorithm than creates another list of all the distinct edges that are found in the NodeList. Because it is an undirected graph edges occur twice, so only the edge that comes out of the vertex that has a lower numerical value will be used. To perform this step in parallel, we need to separate the list so there won’t be any conflicts. This is done by calculating a unique id for every edge by using the degree of the vertices.

To now calculate the triangles we need to fetch for every pair of neighbouring vertices their neighbour list and check if they contain one or more common vertices. The values are than summed up to get the total amount of triangles. This process can be seen in Listing 4.3. After fetching the neighbours of the nodes, the while loop searches for common vertices and counts them as triangles. _ _ g l o b a l _ _ void c o u n t T r i a n g l e K e r n e l ( i n t ∗ countArray , e d g e t u p l e t ∗ c o m p r e s s e d l i s t , i n t ∗ s t a r t a d d r ) { i n t i = b l o c k I d x . x ∗ blockDim . x + t h r e a d I d x . x ; i f ( i >= cuConstCounterParams . numEdges ) { return ; } i f ( i == 0 ) { c o u n t A r r a y [ i ] = 0 ; return ; } i n t j = 0 , k = 0 ; u i n t 6 4 _ t c o u n t =0; i n t ∗ n o d e l i s t = cuConstCounterParams . N o d e L i s t ; i n t ∗ l i s t l e n = cuConstCounterParams . L i s t L e n ; e d g e t u p l e t ∗ e d g e L i s t = c o m p r e s s e d l i s t ; i n t u = e d g e L i s t [ i ] . u ; i n t v = e d g e L i s t [ i ] . v ; 4_{https://vadtani.github.io/15618-finalproject/}

(23)

/∗ F e t c h i n g n e i g b o u r v e r t i c e s from t h e node l i s t ∗/ i n t ∗ l i s t 1 = n o d e l i s t + s t a r t a d d r [ u −1] + 1 ; i n t l e n 1 = l i s t l e n [ u ] ; i n t ∗ l i s t 2 = n o d e l i s t + s t a r t a d d r [ v −1] + 1 ; i n t l e n 2 = l i s t l e n [ v ] ; /∗

∗ T r a v e r s i n g b o t h l i s t s t o f i n d t h e common nodes . Each common node ∗ w i l l be c o u n t e d as a t r i a n g l e ∗/ while ( j < l e n 1 && k < l e n 2 ) { i f ( l i s t 1 [ j ] == l i s t 2 [ k ] ) { c o u n t++; j ++; k++; } e l s e i f ( l i s t 1 [ j ] < l i s t 2 [ k ] ) { j ++; } e l s e { k++; } } c o u n t A r r a y [ i ] = c o u n t ; }

(24)

(25)

CHAPTER 5

Data Collection

This chapter discusses the process of data collection, including details on the selected graphs, performance metrics, the collection itself, and a summary of the results.

5.1 Input graphs

Now that we have two different implementations of the triangle counting task. We can start to collect data on them.

Synthetic Kronecker (Graph500) kronecker simple 16 kronecker simple 17 kronecker simple 18 kronecker simple 19 kronecker simple 20 kronecker simple 21

Random Geometric Graphs rgg n 2 15 rgg n 2 16 rgg n 2 17 rgg n 2 18 rgg n 2 19 rgg n 2 20 rgg n 2 21 rgg n 2 22 rgg n 2 23 rgg n 2 24 Real world Citation Network citationCiteseer coAuthorsCiteseer coAuthorsDBLP coPapersCiteseer coPapersDBLP

Table 5.1: Table of the graphs used to collect the data.

The graphs we used were from DIMACS [10]1_{, the}

Centre for Discrete Mathematics & Theoretical Com-puter Science. It contains a diverse collection of graphs that are used in challenges which address questions of determining realistic algorithm performance. From the database the graphs in Table 5.1 were chosen.

The Kronecker graphs are synthetic graphs generated using the Kronecker generator of the Graph500 bench-mark [27]. The random geometric graphs have been gen-erated from random points in the unit square, with edges connecting vertices whose Euclidean distance is below 0.55 × ln(n/n). And lastly the citation graphs are real world graphs representing coauthors and their citations. All these graphs are undirected and in adjacency list format, see Chapter 3.

5.2 GPU execution time

First we need to ask ourselves what we can

in-terpret as performance. Most performance tools

take in account the execution time, memory

us-age and cache misses of the program. For this

thesis we are only interested in the execution

time in milliseconds that the implementations will take.

We have different options to record the execution time. We could put a timer at the start of the program and when the program is finished find out how long it took as a whole. This, while being a good method to

(26)

find out the speed of individual programs, is not suited for our situation. Because both implementations are cre-ated by different people they will contain overhead due to preprocessing and other tasks not affiliated with the computations.

So instead we will only be focusing on the execution time of the kernels on the GPU. Studying both programs made clear that the algorithmic implementation of the triangle counting does all the work on the GPU. So the time it takes for all the kernels to finish will be a good indicator of the performance of the program.

Fortunately does CUDA come with methods to keep track of the GPU times.

5.2.1 CUDA events

There are two ways to keep track of the kernel executions in CUDA. The first one is by using CPU timers at the start and the end of a kernel while using the

c u d a D e v i c e S y n c h r o n i z e( )

command to wait until the kernel is finished. Even though this method works it has the disad-vantage that it stalls the GPU pipeline. Because of this drawback the developers of CUDA have made a relatively light-weight alternative to CPU timers via the CUDA event API. The CUDA event API is able to create two events wich can be send before and after a kernel call to compute the elapsed time in milliseconds between them. They are able to do this by using CUDA streams. CUDA streams can be compared to queues for the GPU, every stream receives operations that need to be executed sequentially. Figure 5.1 shows this concept, first the start event goes to the GPU and gets a timestamp. The kernel follows it and gets computed. When its finished the stop event gets its timestamp and the elapsed time between the events gets calculated.

Figure 5.1: A CUDA stream with two kernels and cuda events to time their execution. An event is created using following snippet:

c u d a E v e n t _ t TCountStart , TCountStop ;

c u d a E v e n t C r e a t e(& TCountStart ) ; c u d a E v e n t C r e a t e(&TCountStop ) ;

And is send to the GPU by calling the record function.

c u d a E v e n t R e c o r d( TCountStart , 0 ) ; // 0 i s t h e s t r e a m number C a l c u l a t e T r i a n g l e s <<<NUM BLOCKS, NUM THREADS>>>(m, d e v e d g e s ,

d e v n o d e s , d e v r e s u l t s ) ; c u d a E v e n t R e c o r d( TCountStop , 0 ) ;

After having done this we can easily calculate the elapsed time. f l o a t TCountElapsedTime ;

c u d a E v e n t E l a p s e d T i m e(&TCountElapsedTime , TCountStart , TCountStop ) ; c o u t << " T r i a n g l e c o u n t t o o k : " << TCountElapsedTime << "

m i l l i s e c o n d s " << e n d l ; c u d a E v e n t D e s t r o y( TCountStart ) ; c u d a E v e n t D e s t r o y( TCountStop ) ;

Using CUDA events for every kernel call and taking the sum of all the elapsed time gives us the GPU execution time. For every algorithm and graph this method is done 5 times and the average time is taken. Now that we have the performance of the implementations we now need to collect data on the graphs.

(27)

5.3 Size

The first and most obvious information we need is the size of the graphs. This is determined by its vertices and edges which are located on the first line of every graph file. This can therefore almost effortless be obtained.

5.4 Structure

Knowing only the size of a graph is not enough to get a good understanding of the graph properties. As shown in Chapter 3, the degree distribution is needed to get a better picture on how it is structured.

However degree distributions can become quite large an using all that data in a model could be demanding. So instead of using all the degrees of a graph we can use its five-number summary.

5.4.1 Five-number summary

A five-number summary is a descriptive statistic that provides information about a set of ob-servations [43]. It gives the five most important sample percentiles: The median which is the value that separates the higher half of the sample from the lower half. The upper quartile, this value separates the highest 25% of the values from the lowest 75%. The lower quartile which is the inverse and separates the lowest 25% from the highest 75%. And lastly the maximum and minimum value [40].

These five numbers give a good indication on how the set is structured. The median says something about the location of the data. The quartiles about the spread and the minimum and maximum value indicate the range. Finally the standard deviation of the degrees is saved to quantify the amount of dispersion the values show.

Now we have all the data we need to create a model. Table 5.2 shows the features of the citationCiteseer graph. The fastest algorithm is already computed and therefore also put in the dataset. In this case it is Jain & Adtani’s.

NumVertices 268495 NumEdges 1156647 Time 14.4882 Algo Jain Median 5.0 LowestValue 1 HighestValue 1318 LowerQ 2.0 UpperQ 10.0 StandardDev 16.282475314177852

(28)

(29)

CHAPTER 6

Model creation

This chapter will explain the concept of machine learning and why it is a powerful tool. It will than go into detail about three machine learning methods used: Linear Regression, Decision Tree and Random Forest. And at last it will show how they were programmed using the python library scikit-learn.

6.1 Machine learning

To find correlations between the data we have collected we can use the power of machine learning to help us further.

Machine learning is a approach to problem solving that uses statistical techniques to give com-puters the ability to progressively improve performance on a specific task (learn) using collected data. This is done without being explicitly programmed [25]. Its basis lies in pattern recognition. Using machine learning, predictions are made for new situations based on previous events. The features of these previous experiences are used as a dataset which a machine learning tech-nique is trained upon. The idea is that the dataset contains a representative sample of the real world situation and that machine learning is able to create a model that maps the relationships of the data to predict new cases. If the dataset is not divers or complex enough compared to the new cases then it is said that the model has underfit the data. On a contrary if the the model is trained too precisely on the training data it will have a low accuracy for future predictions, this is known as overfitting.

In recent years machine learning has gotten a lot of attention [21], from self driving cars [32] to making voice recognition systems like Siri [4]. Its popularity has increased due to the availability of massive data and due to cheaper and more powerful hardware, it has become more feasible [13]. Machine learning tasks are classified in two types: supervised learning and unsupervised learning. With supervised learning, the machine is given input features together with their output. The model that is being created than needs to find a pattern that maps the in- and output. With unsupervised learning the machine only receives input features and it has to find a way to differentiate between them and discover a structure.

Besides different machine learning tasks there are also different methods. These methods are divided by their desired output.

Regression is a supervised method that tries to estimate the relationships between input features. Its output is a formula that predicts a numerical value from those different independent features. This method is used a lot in finance [9]. Another supervised method is classification. In classification the input is being divided into two or more classes. The model than tries to find a pattern in what differentiates these classes so it can predict for new input in which category they belong. An example would be classifying different flowers into their respective species [42].

(30)

Clustering is a unsupervised method which looks a lot like classification. The difference lies in the fact that with classification it is known beforehand to what class the input belongs, while with clustering this information is not present. Instead it tries to divide the input in clusters that are similar. This method is mostly used to better understand and analyze data.

These are the most famous methods even though there are more. In this thesis we will use three different supervised models: Linear Regression, Decision Tree and Random Forest.

6.1.1 Linear Regression

Linear regression, like its name suggest, makes use of regression to try to find a relationship between one or multiple different independent variables and one (target) dependent variable. If there are multiple independent variables it is also known as multiple linear regression, but for convenience we will refer to multiple linear regression as simply linear regression in this thesis [12]. With this relationship a formula is created which gives every independent variable (feature) a weight. The model is now able to predict using new features what the target variable would be and it can find out which of the independent variables has the most significant influence over it.

An example of this can be seen in Figure 6.1. A scatter plot is shown with data points and the predicted line between them. The line is being fit by looking at the error the line gets for every point it predicts wrong. If for example the data point for a certain variable is 50 but the line predicts 55. Then the error for this point is 5. When the line is being fit this comparison happens for every data point and at the end the position of the line which gives the least amount of total error is than being chosen as the prediction. For a simple linear regression this line is represented as:

y = b × x + c

Here is y the dependent variable, b the predicted weight/slope, x the independent variable and c the coefficient.

For multiple linear regression this formula becomes: y = b1× x1+ b2× x2+ ... + bn× xn+ c

where n = the number of features in the model. So for all features (x1 to xn), a weight is

computed (b1 to bn), which together with the coefficient can be used to predict new cases.

Figure 6.1: The relationship between the data is being fit. The position of the line is based on the total error of the data points. Image from [18].

The advantage of using linear regression is that it is a simple model which when used for variables that are linearly related will give good results. However when the data does not have a linear relationship it will give very poor performance. And because it isn’t always known

(31)

beforehand what type of relationship the features have it should be used carefully on unknown datasets. And furthermore it is limited in the sense that it can only have numerical output.

6.1.2 Decision Tree

(Binary) Decision Trees are a type of model where a tree is constructed which will predict future cases based on its branching structure. It has different layers represented as decision nodes where a decision can be made which will result in two or more branches that the model can follow. After going through all the nodes the tree will eventually reach the leaf node, which contains the target value.

This target value is usually of two types, discrete or continuous. Tree models that produce discrete value types are called classification trees, the branches in those trees represent conjunc-tions of features that lead to the leaves which contain the class labels. Tree models that produce continuous output values are called regression trees, here the branches compare the feature to see if it is above or below a certain value. The leaf nodes will here output a continuous value.

The decisions in the decision nodes are made by making sure the entropy after every level in the tree decreases as much as possible. Entropy is the measure of disorder in a dataset, meaning how random our set is. Splitting the data after every level (should) result in a lower entropy, this is also known as information gain. The branches of the tree have made sure that similar data is now grouped together in a lower level.

Example

An practical example of this concept is shown in Figure 6.2 and 6.3

Figure 6.2 shows our dataset and its features. The Decision tree must now ensure to make a decision at the root node that will decrease the entropy the most based on their features. In our case it decided that having the condition ’color == yellow’ will result in the most information gain and so it becomes our first decision node. This concept is repeated until the least entropy is reached.

(32)

Figure 6.3: An example of the decision tree that could be constructed using the dataset of figure 6.2. Image from [23].

The advantages of using a Decision tree is that it is simple to understand. The way decisions are made in the tree are similar to how humans tend to make them. There is also little effort needed to pre-prepare the data. Lastly it can handle both numerical and categorical data. And it works with non-linear parameters.

Disadvantages of the Decision Tree is that it is prone to overfitting when the data contains outliers. It can also get unstable when there is even a small variation in the data. So the main problem that can occur with this method is that it can work too good and create a model that fits the input data perfectly. This can result that the model does not work well on new cases.

6.1.3 Random Forest

To solve the problem of overfitting by Decision Trees, a new modeling design was created: Ran-dom Forest. A RanRan-dom Forest constructs multiple Decision Trees during the training phase. Every tree is trained on a different part of the original training set and when new data is pre-sented all of them will output what they predict is the right value. Out of those values the Random Forest picks the most occurring class (by classification) or the mean of all the values (by regression) [14].

Besides being more robust against overfitting, Random Forest is also highly accurate and can deal better with missing data, while preserving all the the perks of Decision Trees.

The disadvantage of Random Forest is that it makes relatively slow predictions compared to the other models. Of course, this is the case because it has to create and run multiple Decision Trees and compare their results.

6.2 Scikit

The models of the last section have been implemented using scikit-learn [31]. Scikit-learn is a open source python library which contains a lot of programming tools to create machine learning models.

6.2.1 LinearRegression module

With Linear Regression both datasets of Polak and Jain are being imported as csv files together with the csv file containing the graph properties, see Table 6.1 for the features used. The data

(33)

of the algorithms and graphs are than being split to get a dataset of 70% training data and 30% test data.

Using the training data two models are created using the LinearRegression module: l i n r e g 1 = L i n e a r R e g r e s s i o n( ) # P o l a k s model

l i n r e g 1 .f i t( X 1 t r a i n , y 1 t r a i n ) l i n r e g 2 = L i n e a r R e g r e s s i o n( ) l i n r e g 2 .f i t( X 2 t r a i n , y 2 t r a i n )

The reason we use two models and not one is because regression can only output a numerical value (execution time in our case), while we want to classify our problem, see Table 6.2. So we let the two models predict what the execution time would be for both algorithms given the input graphs and pick from those predictions the lowest value.

After having done this we test our predicted algorithm by putting our results for the test graphs in the compareList and compare it with the list of algorithms obtained from the real values in the dataset.

polakTime = l i n r e g 1 .p r e d i c t( X 1 t e s t ) j a i n T i m e = l i n r e g 2 .p r e d i c t( X 2 t e s t )

# Here t h e p r e d i c t e d t i m e s o f t h e two a l g o r i t h m s a r e b e i n g compared and

# t h e one w i t h t h e l o w e s t p r e d i c t e d e x e c u t i o n t i m e i s t h e o p t i m a l one .

f o r c in range ( 0 , len ( polakTime ) ) : i f polakTime [ c ] < j a i n T i m e [ c ] : c o m p a r e L i s t . append ( p o l a k ) e l s e : c o m p a r e L i s t . append ( j a i n ) # Here t h e f a s t e s t a l g o i s t a k e n from t h e r e a l t i m e s t h e y t o o k . f o r d in range ( 0 , len ( y 1 t e s t ) ) : i f y 1 t e s t . v a l u e s [ d ] < y 2 t e s t . v a l u e s [ d ] : t e s t A l g o . append ( p o l a k ) e l s e : t e s t A l g o . append ( j a i n )

Listing 6.1: The prediction of the graphs and the list with real target values are being produced.

Features:

NumVertices The amount of vertices in the graph

NumEdges The amount of edges in the graph

Median The median of all degrees

LowestValue The lowest degree

HighestValue The highest degree

LowerQ The median of the lower quartile

UpperQ The median of the higher quartile

StandardDev The standard deviation of all degrees

Table 6.1: The features used to train the model.

Target value:

Algo The algorithm implementation with the best performance

(34)

6.2.2 Decision Tree & Random Forest classifiers

The preprocessing phase of the Decision Tree and Random Forest classifiers are a bit different. Here we compare the execution times before we split the dataset. So the dataset will train only one model using the features and optimal algorithm. This dataset is read in as a csv file and the features, Table 6.1, are put in the X variable and the target value, Table 6.2, in y. The data is than also split into a 70% training and 30% test set.

X t r a i n , X t e s t , y t r a i n , y t e s t = t r a i n _ t e s t _ s p l i t(X, y , t e s t s i z e = 0 . 3 )

The difference between the two methods lies in the scikit module used. d t r e e = D e c i s i o n T r e e C l a s s i f i e r( ) # For D e c i s i o n t r e e s d t r e e .f i t( X t r a i n , y t r a i n )

c l f = R a n d o m F o r e s t C l a s s i f i e r( ) # For Random F o r e s t c l f .f i t( X t r a i n , y t r a i n )

After this we can now predict given a certain set of features which algorithmic implementation will have a better performance with the lines:

d t r e e . p r e d i c t( X t e s t ) c l f .p r e d i c t( X t e s t )

The code containing the model creation as well as the data collection can be found on: https://github.com/Younes617/Bachelor-thesis2018

(35)

CHAPTER 7

Experiments and Validation

This Chapter will explain how the experiments were conducted. We discuss the results of the prediction, and attempt to explain why Polak’s algorithm works better on some graphs and Jain’s algorithm on others. Lastly, the models we created are validated and the best one is chosen.

7.1 Experimental setup

All the code was run on the DAS-5, see Chapter 3, on one NVIDIA K40 GPU using CUDA version 9.0. CUDA and prun, a module to run jobs on DAS-5, are being enabled at beginning of the experiments.

module l o a d prun module l o a d cuda90

7.1.1 Data collection

After uploading the graphs and code to the cluster, a script was run for every algorithm to collect the performance data (i.e., the execution time) on each one of the graphs, using the following commands:

For Polak :

prun −np 1 −n a t i v e ’ - C K 4 0 - - g r e s = g p u : 1 ’ . / main . e d a t a /’ g r a p h n a m e ’ For J a i n & Adtani :

prun −np 1 −n a t i v e ’ - C K 4 0 - - g r e s = g p u : 1 ’ . / t r i a n g l e C o u n t e r d a t a /’ g r a p h n a m e ’

They were run 5 times for each graph and the average of those times was ultimately used. A special script was also run to get the degree distributions of the graphs.

7.1.2 Desired result

The desired result of the experiment is that both implementations work optimal for different types of graphs. For example, Polak’s algorithm is fast for small graphs, but slow for larger ones, with Jain’s implementation being the opposite. In this case it would be easier for the model to differentiate between them, which would result in a better accuracy for the models.

7.2 Results

Now that we have our performance data, we can plot them to better visualize whether the decision is as simple as we would like. Figure 7.1 shows the execution time of the two algorithms

(36)

next to each other. Because the time to compute the graph of Kronecker 21 is so much larger than all the others, we provide zoomed in plots in Figures 7.2, 7.3, 7.4, 7.5.

Figure 7.1: The difference in execution time of the algorithms.

(37)

Figure 7.3: The zoomed in execution time of Kronecker 16 to 19.

(38)

Figure 7.5: The zoomed in execution time of rgg 20 to rgg 24.

After studying the graphs we can identify which algorithm performs better on which graphs, see Table 7.1 Graph Faster name algorithm citationCiteseer Jain coAuthorsCiteseer Jain coAuthorsDBLP Jain coPapersCiteseer Polak coPapersDBLP Polak

kron g500-simple-logn16 Polak

rgg n 2 15 s0 Jain rgg n 2 16 s0 Jain rgg n 2 17 s0 Jain rgg n 2 18 s0 Jain rgg n 2 19 s0 Jain rgg n 2 20 s0 Jain rgg n 2 21 s0 Jain rgg n 2 22 s0 Jain rgg n 2 23 s0 Jain rgg n 2 24 s0 Jain

Table 7.1: Optimal performing algorithm per graph.

To find out why the algorithms perform better on certain graphs, we have to investigate any correlation between graph properties and execution time, see Table 7.21.

(39)

NumVertices NumEdges Time Algo Median LowestValue HighestValue LowerQ UpperQ StandardDev 268495 1156647 14.4882 Jain 5.0 1 1318 2.0 10.0 16.282475314177852 227320 814134 6.47712 Jain 4.0 1 1372 2.0 8.0 10.633715031194045 299067 977676 7.25242 Jain 4.0 1 336 2.0 7.0 9.815236642715357 434102 16036720 382.154 Polak 39.0 1 1188 15.0 92.0 101.27835054005104 540486 15245729 295.429 Polak 34.0 1 3299 13.0 74.0 66.236146112445326 65535 2456071 121.758 Polak 8.0 0 17997 1.0 31.0 312.45999947386696 131068 5113985 319.4 Polak 7.0 0 29935 1.0 37.0 378.07272554713268 262144 10582686 839.326 Polak 6.0 0 49162 1.0 25.0 453.70518302180687 524287 21780787 2227.26 Polak 5.0 0 80674 1.0 28.0 540.89289184097356 1048576 44619402 6182.98 Polak 4.0 0 131503 1.0 28.0 640.96572832072752 2097152 91040932 22232.1 Polak 4.0 0 213904 0.0 21.0 755.62879809067454 32768 160240 2.24902 Jain 10.0 0 24 8.0 12.0 3.1550893632472099 65536 342127 2.95126 Jain 10.0 0 27 8.0 13.0 3.2478090336519236 131072 728753 4.31974 Jain 11.0 0 28 9.0 13.0 3.3442200804535505 262144 1547283 7.35805 Jain 12.0 0 31 9.0 14.0 3.4485489368497926 524288 3269766 13.6054 Jain 12.0 0 30 10.0 15.0 3.5434506780709185 1048576 6891620 28.7178 Jain 13.0 0 36 11.0 15.0 3.6282560457134294 2097152 14487995 59.9472 Jain 14.0 0 37 11.0 16.0 3.7326398803644776 4194304 30359198 127.057 Jain 14.0 0 36 12.0 17.0 3.8123324216016479 8388608 63501393 187.863 Jain 15.0 0 40 12.0 18.0 3.8955145020947728 16777216 132557200 580.559 Jain 16.0 0 40 13.0 18.0 3.9800650357918763

Table 7.2: The training dataset.

From the data, we see that the size of the graphs are not making the difference (as we hoped for in section 7.1.2), as both implementations can be optimal for some small and some large graphs. Thus, we further look at the node degree data. We immediately notice that the graphs where Polak is performing better have a higher maximum degree, upper quartile and standard deviation. Based on this set of results, intuitively, we expect Polak’s algorithm to work better for graphs with a higer average degree. We validate this intuition by inspecting Figure 7.6, which shows the average degree of the graphs: indeed the graphs where Polak scores optimal does have a higher average degree compared to the graphs that Jain performs better for.

Figure 7.6: The average degree of the graphs.

We note that this finding is still close to our desired result: we found a discriminative factor between the two implementations, but that is not the size of the graph, but the average degree of the graph.

(40)

7.3 Validation

In Chapter 6, we saw how the models were created and that we trained the models using 70% of the dataset. The remaining 30% of the data is used to test the created model and see how it performs. This also known as cross-validation and is used extensively in machine learning [24]. We validated the models by 10-fold cross-validation, each time shuffling the dataset and splitting it 70-30. The accuracy results of the 10 runs are then being averaged to get our scores in Table 7.3.

Before looking at the accuracies we first need to ask ourselves what is a good accuracy for our model: that accuracy becomes our threshold to see if the model performs well enough. Looking at our dataset in Table 7.2, we see that the optimal algorithm is not distributed equally over the graphs. Meaning that for the 21 graphs we are using, Jain’s algorithm is optimal in 13 cases, while Polak’s is in only 8. So if we would have a model that would predict Jain 100% of the time it would have a 21−8

21 × 100 = 61.9% accuracy for our complete dataset. We are therefore looking

for methods that have an accuracy higher than this number.

If we now take a look at Table 7.3 we immediately see that the Decision Tree and Random Forest models have an incredible accuracy of both around the 95%. This means that both methods are excellent candidates to be used for predicting the performance of triangle counting algorithms on graphs.

Model Average Accuracy Viable

Linear Regression 54.28% No

Decision Tree 94.29% Yes

Random Forest 95.71% Yes

Table 7.3: Accuracy of the three models.

A worse candidate is Linear Regression. It has an average accuracy of 54% which means that it is even lower than the threshold we had set. The most straightforward explanation on why Linear Regression does so awful is the fact that the features (i.e., the considered graph properties) are not linearly correlated with the execution time.

Studying the scatterplots of the features compared to Polak’s execution time in Figures 7.7, 7.8 and 7.9 does indeed show that the features are not overall linearly correlated. Predicting a line between the points results in large areas of uncertainty (lightblue parts).

(41)

Figure 7.7: The data of the vertices and edges compared to the execution time of Polak.

(42)

Figure 7.9: The lower quartile, higher quartile and standard deviation compared to the execution time of Polak.

(43)

CHAPTER 8

Conclusion and future research

This chapter presents the conclusions drawn from this work, and propose several recommendations for future reseach.

8.1 Conclusion

One of the goals of this work was to determine if it is possible to create an accurate model to predict which triangle counting algorithm is best suited for a given graph. We can confidently confirm this is possible using a machine-learning model. Building this model is in fact possible even when only using a relatively small set of training graphs and features.

We accomplished this positive result by training a predictive model using the performance data collected by running the algorithm on different graphs. For modeling, we used different tech-niques: Linear Regression, Decision Tree and Random Forest. Validating these models showed us which technique is best suited to perform this prediction.

Linear Regression fell off because its accuracy on the test set was very low at 54.28%; this poor accuracy means its concept is too simple to predict the best algorithm properly.

Decision Tree and Random Forest are on the other hand much better candidates to solve this problem. Both models reached accuracy scores of around 95%, with Random Forest performing slightly better than Decision Tree. But we would recommend the use of Decision Trees to create the model. Even though it is around 1% less accurate than Random Forest, it is a lot simpler, less demanding, and most important faster than Random Forest.

In conclusion, we revisit the research question:

Is it feasible to predict, through statistical performance modeling, which parallel algorithm suits a given graph processing task the best?

The method and results presented in this thesis allow us not only to answer positively to this question, but to also provide a promising technique to do that in practice.

Looking back at the positive results of Verstraaten et al. [38] and Wu et al. [41], this work brings further evidence that the focus on solving graph processing tasks and GPU programming in general should be on creating better models that will help us determine the right algorithm for the right type of task.

We note that a tool to test these models on user graphs is posted on github1. Using our approach, a user can select the model(s) together with a new graph to predict the right algorithm using the data collected in this research.

(44)

8.2 Future work

While our results are positive, there is always room for improvement. For our dataset, we used 21 different graphs. This dataset should be expanded further towards a larger, more diverse set of graphs. This will enable further analysis on how such diversity would positively or negatively influence the precision of the models.

We would also recommend to try to test the model on types of graphs that are not used for training. In this thesis, the models were trained on graphs that had 3 overall themes: Kronecker graphs, citation graphs and geometric graphs. Our test set also included these same types of graphs. This could have positively influenced the precision of the models. If more types of graphs are used for testing, a decrease in precision could indicate models need to be trained on the same type of graphs to keep such a high accuracy.

Another interesting direction future researchers could take is collecting alternative features besides the size and degrees of the graph, to examine which properties have a significant influence on performance. It has been proven in this thesis that the average degree was the determining factor for the classification of our graphs, but that doesn’t necessarily mean that there are no other important factors.

A further challenge for our model could be the addition of more triangle counting implemen-tations. The model would be even more interesting if it could predict the better implementation among more than two implementations.

Finally, we recommend finding out if using such techniques for training a model also works for other types of graph processing tasks. Verstraaten et al. [38] succeeded in predicting the best BFS implementation at runtime, while this thesis proved it was possible to predict the optimal triangle counting implementation. Algorithms like single-source or all-pairs shortest path, as well as the calculation of betweenness metrics, could be candidates to further test this approach.

(45)

Bibliography

[1] Junwhan Ahn et al. “A scalable processing-in-memory accelerator for parallel graph pro-cessing”. In: ACM SIGARCH Computer Architecture News 43.3 (2016), pp. 105–117.

[2] Mohammad Al Hasan and Vachik S Dave. “Triangle counting in large networks: a review”.

In: Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8.2 (2018), e1226.

[3] George S Almasi and Allan Gottlieb. “Highly parallel computing”. In: (1988). [4] Jacob Aron. How innovative is Apple’s new voice assistant, Siri? 2011.

[5] Luca Becchetti et al. “Efficient semi-streaming algorithms for local triangle counting in massive graphs”. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM. 2008, pp. 16–24.

[6] Paolo Boldi and Sebastiano Vigna. “The webgraph framework I: compression techniques”.

In: Proceedings of the 13th international conference on World Wide Web. ACM. 2004, pp. 595–602.

[7] Jens Breitbart. “Case studies on GPU usage and data structure design”. In: Master’s thesis, Universit¨at Kassel (2008).

[8] Avery Ching et al. “One trillion edges: Graph processing at facebook-scale”. In: Proceedings of the VLDB Endowment 8.12 (2015), pp. 1804–1815.

[9] Douglas O Cook, Robert Kieschnick, and Bruce D McCullough. “Regression analysis of

proportions in finance with self selection”. In: Journal of empirical finance 15.5 (2008), pp. 860–867.

[10] David A. Bader, Andrea Kappes, Henning Meyerhenke, Peter Sanders, Christian Schulz and

Dorothea Wagner. Benchmarking for Graph Clustering and Partitioning In Encyclopedia of Social Network Analysis and Mining, pages 73-82. https : / / www . cc . gatech . edu / dimacs10/downloads.shtml. Online; accessed 31 May 2018. 2014.

[11] Final Project Report of Jain & Adtani. https://vadtani.github.io/15618-finalproject/

15-618-final_project_report.pdf. Accessed: 2018-06-08.

[12] David A Freedman. Statistical models: theory and practice. cambridge university press,

2009.

[13] Cornelia Hammer, Ms Diane C Kostroch, and Mr Gabriel Quiros. Big Data: Potential,

Challenges and Statistical Implications. International Monetary Fund, 2017.

[14] Tin Kam Ho. “Random decision forests”. In: Document analysis and recognition, 1995.,

proceedings of the third international conference on. Vol. 1. IEEE. 1995, pp. 278–282. [15] Jared Hoberock and Nathan Bell. Thrust: A parallel template library. 2010.

[16] Image of the comparison between sequential and parallel processing. http://www.techdarting. com/2013/07/what- is- parallel- programming- why- do- you.html. Accessed: 2018-06-08.

[17] Image showing the difference in cores between CPUs and GPUs. https://www.quora.

Performance modeling for graph processing algorithms using machine learning

Bachelor Informatica