Shrinking and Expanding Graph Datasets

(1)

Shrinking and Expanding

Graph Datasets

Author:

Ahmed Musaafir

Supervisor:

Dr. Ana Lucia Varbanescu

Friday 11

th

_{August, 2017}

Universiteit van Amsterdam

Faculteit der Natuurwetenschappen, Wiskunde en Informatica Master Software Engineering

(2)

Acknowledgements

No words can fully express my sincere gratitude to Dr. Ana Lucia Varbanescu, who has guided and helped me tremendously throughout this research, and provided me with many amazing opportunities and experiences including pre-senting at the International Conference on Computing Systems. I thank her for all the great guidance and efforts that she has put into helping me write this thesis.

Furthermore, I would like to thank my parents for their continued love and support throughout all the years.

(3)

Abstract

Graph processing has become a topic of interest in many different domains. Although new algorithms emerge very often, in-depth performance analysis and thorough benchmarking are challenging. One of the reasons that causes this problem is the lack of representative datasets. This study defines, models, and implements shrinking and expanding operations with the aim to tackle this lack of representative datasets.

We consider shrinking as a form of sampling where a smaller, representative graph is obtained from an original graph. Based on an analysis of six different sampling algorithms, we select Total Induced Edge Sampling as our shrinking algorithm.

Furthermore, we consider expanding as a way to obtain a larger, represen-tative graph from an original graph, while providing a certain degree of control over the resulting graph features. Since most existing work does not satisfy our expansion requirement, we propose an idea of graph expansion based on combin-ing graph samplcombin-ing and preset interconnection topologies. We further provide graph-features control by configuring the expansion parameters according to a mix of analytical and empirical models of the expanded graphs.

Both the shrinking and expanding algorithms are implemented in a high performance manner by enabling GPU processing. We further integrate them in a tool where a user can provide relevant input parameters and obtain a shrunk or expanded graph with a list of its properties.

We evaluate the shrinking and expanding algorithms by using different types of datasets. Our results for shrinking show that TIES obtains representative samples of different sizes where many local and topological properties are close to the original graph. As for expanding, our results show that the expansions closely match our guidelines and satisfy the expanding requirement. Further-more, our results validate and often enhance the analytical-based guidelines.

(6)

Chapter 1

Introduction

Graphs are increasingly popular due to their inherent simplicity in describing entities and interconnections. As such, graph processing has become a topic of interest in many domains, ranging from biology to information retrieval, and from social networks to infrastructure networks in real or virtual worlds. Due to the rapid increase in data sizes and analyses complexity, graph processing research is investing a lot of effort in devising and optimizing algorithms for processing such data.

Although new algorithms and algorithm variants are proposed on a weekly basis, in-depth performance analysis and thorough benchmarking are missing. Two main reasons cause this superficial evaluation: the lack of benchmark-ing methodologies for graph processbenchmark-ing, and the lack of (sufficient) representa-tive datasets. While various attempts are being made to define benchmarking methodologies for graph processing [10, 11, 7, 14], the issue of representative datasets is only marginally covered. Most researchers use existing graph archives (data collections) [21, 19] or existing synthetic graph generators [9, 23] to ob-tain some datasets. Existing archives provide too few graphs, and even fewer classes of graphs. In addition, these archives get easily outdated, and do not provide any updates (e.g., no newer versions of the graphs, no new graphs, no new types of graphs). Synthetic graph generators do provide new graphs on demand, but they are designed for specific types of networks, and tuned for specific graph properties, lacking generality. Neither option provides enough generality and samples to fulfill the requirements of thorough benchmarking for graph processing.

1.1 Goals and solution

To tackle the lack of representative datasets, we propose a tool to allow users to replicate an existing real-world graph at different scales. In other words, users should be able to shrink a representative graph that may be too large for their use, or expand an existing small representative graph to a larger scale.

(7)

These operations should allow a user to perform better in-depth evaluation or performance analysis on certain graph datasets for their work. Therefore, an important goal for both operations is to preserve as many properties of the original graph as possible.

Too little research exists in effectively designing algorithms for these opera-tions. Most research focuses on sampling that reduces or obtains a back-in-time (prior state) version of an original graph [17,24,20]. In addition, most research focuses on graph generation that obtains a new graph from scratch which cannot take templates from existing graphs [26,23,9].

In contrast, our research starts from a given graph and attempts to shrink or expand it. In our research, shrinking (section 3.1) overlaps with sampling in that it reduces the original, but enforces more restrictions on the properties. Moreover, in our research, expanding (section3.2) deviates from graph genera-tion in that it takes a template and aims at more control over the properties, thus allowing to obtain better representative datasets.

Needless to say, both operations need to be performed in a high performance manner, in order to reduce waiting time. In essence, our final deliverables have to be a shrinking algorithm and an expanding algorithm, both integrated in a tool that a user can interact with.

1.2 Research question

Based on the current challenges and goals of shrinking and expanding graphs, we focus on designing efficient algorithms that can be integrated in a tool. Hence, the following research question has been formulated:

RQ: Can graph shrinking and expanding be defined, modelled, and implemented in a high performance manner?

To answer this research question, the following sub-questions have been for-mulated:

• SQ1 : How can a graph be shrunk? • SQ2 : How can a graph be expanded?

• SQ3 : How can a tool for shrinking and expanding be implemented in a high performance manner?

Approach

The core components of our research consist of a literature study, algorithmic design, modelling, and experimental analysis.

(8)

1.3 Contributions

Our contributions are threefold:

1. Analysis of sampling algorithms: we introduce the term shrinking as a form of sampling where a smaller, representative graph Gsfrom an original

graph G can be obtained using a desired sample fraction. We provide a concise description of six existing sampling algorithms that fit these requirements and describe their general approach, complexity and quality in terms of property preservation. Based on its properties, we select Total Induced Edge Sampling (a sampling algorithm) as our shrinking algorithm and empirically validate our choice.

2. Expanding graph datasets using sampling: we introduce our idea of the expanding algorithm that obtains a larger, expanded graph Ge from an

original graph G, based on a conceptual approach of combining graph sampling and preset interconnection topologies. In addition, we provide a tool1 _{that allows the user to experiment with the expanding algorithm}

and expand a graph while controlling most of its features.

3. Modelling of the expanded graphs: we provide guidelines based on analyt-ical and empiranalyt-ical models that reason about and predict the properties of an expanded graph before conducting an expansion. Our experiments validate these models as the expanded graphs meet these guidelines and the expansion requirements.

1.4 Outline

Chapter 2 provides a brief description of the concepts and terminology that are used throughout this study. Chapter 3 describes existing shrinking and expanding algorithms. The first part of this chapter presents a review of related work. Furthermore, our own expanding algorithm is presented. Based on this analysis, the shrinking and expanding algorithm are selected in chapter4; we further show how both operations will be designed and implemented in a high performance manner. Chapter5 presents the results of applying the shrinking and expanding algorithm on several datasets. We also measure the performance of the implementations and present our results as an indication of their high performance. Chapter 6 concludes this study and presents ideas for future work.

(9)

Chapter 2

Background

This chapter briefly introduces concepts and definitions regarding graphs, and CUDA that are used throughout this research. For comprehensive materials on this matter, we refer to [32] and [5].

2.1 Graphs

A graph G can be formally defined as G = <V, E> where V is the set of vertices (nodes)_{u1, u2, .., un} and E the set of edges: {(u1, u2), (ua, ub)..}. In

this study, the terms nodes and vertices are interchangeably used.

In a directed graph, the vertices in every edge (ua, ub) are ordered.

Con-trary, a graph that is undirected is a graph where the vertices in every edge are unordered. Two vertices are adjacent (or neighbours) if there is an edge connecting them. The neighbourhood of a vertex v, denoted as N (v), is the set of all the vertices that are adjacent to v, including v. The set of incident edgesto a vertex v are denoted by: δ(v) =_{{(u, v) ∈ E|u ∈ N(v)}. The} de-greeof a vertex v is the number of edges incident to v. An isolated vertex is a vertex that has a degree of 0. A path inside a graph is a sequence of vertices and edges that starts at a vertex and ends at another vertex, such that each edge is incident to its predecessor and successor vertex.

2.1.1 Properties

A graph has various properties associated to it. Throughout this study, different (basic) properties of a graph are mentioned, most of which are described in this section.

The average degree of a graph measures how many edges in the graph exist in comparison to the number of vertices in the graph. This can be calculated by:

average degree =2× |E| |V |

(10)

Similarly, the density of a graph indicates how many edges exist in the current graph in comparison to the maximum number of possible edges. The density for an undirected graph can be obtained by:

density = 2× |E| |V | × (|V | − 1) For a directed graph, this can be done using:

density = |E| |V | × (|V | − 1)

The higher this fraction is, the denser the graph. In contrast, a graph that is not dense, is called sparse.

For a directed graph, we consider both strongly connected components and weakly connected components. The number of strongly connected com-ponents is the number of comcom-ponents where the vertices inside each component are accessible between each other. Between two components, there could be either paths to, or from each other (directed), but not both. The number of weakly connected components is the number of subgraphs that are reachable when ignoring the direction of the path between them.

For undirected graphs, we simply define the number of connected compo-nentsas the components that are unreachable from each other (i.e., there is no link between them).

As explained by [5], the (local) clustering coefficient of a vertex u, denoted as Clu, describes how well the neighbourhood of a node is connected. It can be

calculated by:

Cl(u) = 2× N(u)

Ku× (Ku− 1)

where Ku is the number of links between the neighbours of u. A link exists if

there is a transitive relation, e.g., if u has N (u) = _{{v, w} as neighbours and} there exists an edge between v and w. The average clustering coefficient is the average (local) clustering coefficient of all the nodes in a graph.

The diameter is the longest shortest path length between any two vertices of a graph. Furthermore, the average path length is the average distance between all the combinations of pairs inside a graph.

The term local properties of a graph refers to properties of nodes in a graph, which can be captured by the degree and clustering distributions [1]. In contrast, topological properties characterize the connectivity of a graph, and can be captured by the (weakly/strongly) connected components, diameter, and shortest path length distribution [16].

2.2 Graph representation

Considering how a large graph can be represented is critical in order to maxi-mize efficiency and memory performance. Two general representations are the adjacency matrixand adjacency list [38].

(11)

An adjacency matrix is an n_{× n (where n = |V |) matrix where an element} can indicate whether an edge is present in the graph. The adjacency matrix wastes more space than necessary when graphs are sparse and thus are more appropriate for dense graphs [38] [29].

Adjacency lists on the other hand provide a more compact storage for sparse graphs, since they only store edges that are present in the graph.

A similar way of storing the edges is by using the Coordinate format (COO). The COO format stores edges from the adjacency matrix in two arrays, row and columns, in an ordered manner. An example of the coordinate format representation for graph G and its matrix A (2.1) is shown in figure2.2.

Figure 2.1: Example of a (directed) graph G and its matrix A.

Figure 2.2: Coordinate format representation of G.

Another approach that is similar to the adjacency list and COO format, yet more memory efficient, is the Compressed Sparse Row (CSR) format. Figure2.3shows an example of this format for the same graph specified in figure

2.1[29].

Figure 2.3: Compressed sparse row representation of G.

(12)

The CSR format consists of two arrays: the column indices (vertex array) and the row offsets (edge array). The index of the column indices (the numbers 0..3 above the column indices) represents the vertex in the graph. Each element of an index represents the starting point index inside the row offsets. The element on each index inside the row offsets represent a node (neighbour) of the vertex from the column indices. E.g., for vertex v=1, the starting point of the row offsets would be indices[v]=2 and end at indices[v+1]=4. Thus, the neighbours form offsets[2]=0 and offsets[3]=3. The last element in the column indices is not a vertex and is simply there to indicate what the end offset is of the second last vertex. An important note to keep in mind is that the CSR format is read-only and thus cannot be modified.

2.3 CUDA

The use of graphics processing units (GPUs) emerged with the growing demands and needs for 3D computing [31]. As this demand kept increasing, GPU manu-facturers such as NVIDIA produced new architectures that allows programmers to push the limits for 3D computations. In 2006, NVIDIA introduced the Com-pute Unified Device Architecture (CUDA) computing platform and program-ming model. The CUDA architecture allows programmers to use (NVIDIA) GPUs for general purpose computing.1 2 _{GPU computing is especially}

popu-lar in the scientific computing community, as it allows programmers to obtain speedups for lower costs in comparison to traditional parallel hardwares [31].

An application written using the CUDA parallel programming model consists of the host program, which is essentially the code that runs on the CPU (and its memory). The host is able to execute kernels, which are parallel programs that run on the device (the GPU and its memory). Kernels are executed by many threads that are spread over (thread) blocks.

CPUs support a small number of concurrent threads in comparison to GPUs. However, the GPU threads are lightweight in comparison to the CPU threads, and thus together, the GPU threads aim to maximize throughput.

In order to transfer data to the device for a particular computation, memory needs to be allocated and copied to it, as well as back to the host to obtain any results. The host can be seen as the orchestrator for these operations. Ideally, to maximize performance, operations on the GPU should be performed on large datasets where the same operation on a dataset can be performed millions of times simultaneously. In addition, minimizing data transfer between the host and device can increase the performance of the overall application.

1

http://docs.nvidia.com/pdf/CUDA_C_Best_Practices_Guide.pdf

(13)

Chapter 3

Shrinking and expanding

graphs

The goal of this chapter is to answer SQ1 (section3.1), and SQ2 (section3.2). In each section, a concise analysis is provided in order to answer these questions.

3.1 Shrinking a graph

First and foremost, it is important to echo our goals of shrinking a graph. Our goal of shrinking is to obtain a representative graph sample Gs of an original

graph G that preserves as many properties of the original graph as possible. There is already a wide variety of research available that analyses different graph sampling algorithms on different types of datasets [1,24,34,8,17]. While some focus on obtaining a ‘back-in-time’ graph that tries to capture how a graph has looked at an earlier stage during its evolution, others focus on obtaining a reduced, scaled-down representative sample. We consider our goals of shrinking to be similar to the latter objective, and thus consider shrinking to be a form of sampling.

Ideally, the original graph should be specified by the user of the algorithm along with a sampling fraction, denoted as ∅, that specifies the desired sample size of the graph based on the number of nodes or edges of a graph. A sampling algorithm should output a graph Gs = <Vs, Es> where Vs ⊆ V and Es ⊆ E

and Es⊆ {(u, v)|u ∈ Vs, v∈ Vs}.

Generally, sampling algorithms can be loosely split into three main categories [17]: (1) node based sampling, (2) edge based sampling, and (3) topological based sampling. In this section, we present a brief analysis of two sampling algorithms per category:

1. Node Based Sampling (a) Random Node Sampling

(14)

(b) Node Sampling with Neighbourhood 2. Edge Based Sampling

(a) Random Edge Sampling (b) Total Induced Edge Sampling 3. Topological Based Sampling

(a) Random Walk (b) Forest Fire

The selection of the algorithms per category was primarily made based on popularity and/or uniqueness. For example, Random Node Sampling and Edge Based Sampling are two algorithms that are very popular while Total Induced Edge Sampling is a much more unique algorithm.

For each sampling algorithm in the list, the following is provided: (1) a description regarding the approach; (2) the properties that are preserved, (3) a measurement of the time complexity, and (4) a visualization of the result of the graph after sampling. Before getting into these points for each selected sampling algorithm, some general assumptions are provided below.

Assumptions for input graph visualization

The visualization for each sampling algorithm described in the next sections is considered to be a sampled output from the following original undirected input graph G:

Figure 3.1: Original graph that is used as an input for the visualization of the sampling algorithms in the following subsections.

Note that while this is a simplified graph, a sampling algorithm might sup-port non-simplified1 _{graphs as well. If that is the case, we explicitly mention}

(15)

this in the approach section of the sampling algorithm.

Furthermore, by no means do the visualizations present any indications of property preservation. The visualization is only meant to help understand the sampling algorithm better.

Properties to preserve

A sampling algorithm should preferably preserve as many properties (chapter

2) of the original graph as possible. Ideally, we would like to preserve the local as well as the topological properties including the average degree, diameter, average path length, (global/average) clustering coefficient and density. These (as well as any additional) properties that are preserved are always mentioned in the property preservation sections for each sampling algorithm.

Caution on complexity

Some sampling algorithms may incorporate randomized selection of data (e.g., nodes or edges). For example, if a sampling algorithm requires the selection of random unique nodes (without replacement) up until a desired number is reached, the running-time could be higher as the sampling fraction goes up. Therefore, we assume when talking about complexity, that, unless stated oth-erwise, this random selection is done in a single pass according to some distri-bution.

3.1.1 Random Node Sampling

Approach

Random Node Sampling (or vertex sampling) starts by selecting a subset of the vertices from the original graph independently and uniformly at random: Vs⊆ V [17] [3]. The number of vertices that are selected in this subset depends

on the desired sampling fraction. An edge (u, v) is selected if the end vertices of the edge exist in the sampled vertices set Vs. To put it formally: Es =

∀(a,b)((a, b) ∈ E ∧ a ∈ Vs∧ b ∈ Vs). While this approach might increase any

isolated nodes, it is possible to choose to omit them, as is done in a study by Lee et al. [20].

Property preservation

As for the properties of the sampled output, different researchers/papers have shown/focused on different aspects of the results. For example, a study con-ducted by Blagus et al. [6] showed that this sampling method preserves the global properties (such as the density degree mixing and triplets) the best when the sampling fraction is large.

In addition, a study by Lee et al. [20] showed that for the datasets that were used, the degree distribution was preserved while the clustering coefficient

(16)

preservation seemed to depend on the type of network. It is less likely to form triangles, since the nodes are sampled uniformly at random [28].

Furthermore, Ahmed et al. [1] showed in their study that node sampling underestimates the degree of nodes while preserving some of the clustering co-efficient and path length in a small error margin. The path length distribution was also observed to be close to the original graph.

Complexity

The worst-case running time for node sampling would be O = n2_{where n =}

|V |. This holds if we assume that (1) the random node selection occurs without replacement in a single pass, and (2) obtaining the neighbours of a node occurs in constant time.

Visualization

Assuming that the sampling fraction ∅ = 0.36: 0.36 = |Vs|

11 = 4 vertices from

the original graph would be selected. If that happened to be the following set: Vs ={4, 5, 7, 9}, every edge that exists between any of these vertices would be

selected. In this case, the edges are: Es={(4, 5), (5, 7)}3.2.

Figure 3.2: Result of the sampled graph Gsusing Random Node Sampling.

3.1.2 Node Sampling with Neighbourhood

Approach

Node Sampling with Neighbourhood (vertex sampling with neighbourhood) ini-tially selects a subset ˜Vs of vertices similar to Random Node Sampling. For

each vertex v that is selected, the incident edges δ(v) are selected and added to the sample edges set Es. The sampled vertices form the ones that are adjacent

(17)

isolated vertices (zero-degree vertices) in the sampled vertices set [17]. To our knowledge, there is no explicit number of nodes to select for the initial subset of vertices. However, regardless of the number of initial nodes, to match the desired sample size, a node can be selected one at a time from the initial set along with its neighbours. This process can occur in an iterative manner, up until the desired number of nodes is reached.

Limited amount of information is available regarding the property preservation of this sampling algorithm. However, the study by Hu and Chau [17] formally analysed (without empirical evidence) that the average degree and degree dis-tribution would likely be preserved, since this sampling algorithm obtains the neighbourhood of the samples nodes. A study by Leskovec et al. [24] discussed a similar sampling algorithm, Random Node Neighbour that also selects neigh-bours of a (random) node. Their results have shown that for a directed graph, the out-degree distribution is likely preserved in comparison to the in-degree distribution, since only the (outgoing) neighbours are selected.

Complexity

Similarly as with Random Node Sampling, the worst-case running time would be O = n2 _{where n =}_{|V |. This holds if (1) the random node selection occurs}

without replacement in a single pass, and (2) obtaining the neighbours of a node occurs in constant time (due to a adjacency list representation).

Visualization

The visualization in figure 3.3shows the result of sampling the original graph using a fraction of 0.63 of the vertices, where 7 vertices are selected. In this case, the initial selected vertices are ˜Vs={3, 10}.

(18)

Figure 3.3: Result of the sampled graph Gs using Node Sampling with

Neighbour-hood.

3.1.3 Random Edge Sampling

Approach

In Random Edge Sampling (or edge sampling), a subset of edges from the edge set E is initially selected independently and uniformly at random: Es⊆ E. The

number of edges that are selected depends on the sampling fraction, although this can be done based on the vertices as well. The end vertices of the (random) selected edges are the ones that are selected as the sampled vertices: (i.e., Vs={u, v|(u, v) ∈ Es}) [17] [24].

According to Leskovec et al. [24], a particular problem with this sampling method is that the sampled graphs will be more sparse in comparison to Ran-dom Node Sampling and will not respect the community structure. It is likely that the diameter would be larger than the original graph and that the cluster-ing coefficient would be smaller or will not be preserved at all [3] [28]. However, both Leskovec et al. [24] and Ahmed et al. [1] note that intuitively, this sam-pling method has an upward bias towards high-degree nodes, since these nodes have more edges connected to them. While this bias seems to be in the right direction in terms of representing the original graph’s degree distribution, it is still insufficient [1]. Furthermore, although edge-based sampling techniques per-form better than node or topological based sampling algorithms for the weakly connected components distribution, it is still not captured accurately [24] [1].

(19)

Complexity

Assuming that random edges are selected without replacement in a single pass, the worst-case running time would be O = n where n =_|E|.

Visualization

Figure 3.4 shows the result of edge sampling using a fraction of 0.5 of edges from the original graph. Since the original graph has 11 edges, the amount of sampled edges were rounded off to 6. In this case, the edges that were randomly selected are: Es={(1, 2), (2, 3), (4, 5), (4, 6), (6, 7), (6, 8)}

Figure 3.4: Result of the sampled graph Gs using Random Edge Sampling.

3.1.4 Total Induced Edge Sampling

Approach

Total Induced Edge Sampling (TIES) consists of two steps: (1) the edge-based node sampling step and (2) the induction step [1] [2].

During the edge-based node sampling step, a random set of edges from the original graph G are selected along with their end vertices. The reason for not selecting nodes directly and use edge-based node selection, is to compensate for the downward bias of node based sampling. Edge based node selection naturally selects high-degree nodes. Up until this step, it is similar to Random Edge Sampling. However, in order to counter the sparseness of the Random Edge Sampling algorithm, the induction step is added, where a traversal through all the edges in the graph is required to add every other edge that exists between the selected vertices. Unlike Random Edge Sampling, where the fraction of the sampled graph is determined based on the edges, the vertices have to be used in this algorithm. The reason is simple: it is likely that an unknown amount of extra edges would be added during the induction step.

(20)

The authors of TIES claim that it preserves important local and topological properties of the original graph. While the authors tested this sampling algo-rithm on undirected simplified datasets, they state that the same results should hold for unsimplified graphs [1].

Their results showed that for the majority of their used datasets, the sampled degree distribution is closer to the original in the case of TIES than in the case of other algorithms, like Random Node Sampling and Forest Fire. However, the authors mention that the sampled degree of the nodes is high to capture the degree distribution.

Furthermore, the path length distribution was also captured more accurately than other sampling algorithms. However, due to the induction step, the TIES sampled path length was shorter than for other sampling algorithms.

The clustering coefficient was also captured in comparison to other sampling algorithms. However, the authors mention that TIES could overestimate the clustering coefficient due to the induction step.

Lastly, although better than other sampling algorithms, the component sizes were not well preserved in comparison to the above mentioned properties. Complexity

Assuming that random nodes are selected without replacement in a single pass, the running time would always be O = n where n =|E|, since the induction step traverses through all the edges of the original graph.

Visualization

Figure 3.5 shows the result of sampling the original graph using a fraction of 0.63 of the nodes. Assume that initially, during the edge-based node sampling step, the following set of edges are selected: Es={(1, 2), (4, 6), (5, 7), (7, 10)}.

The sampled vertices set Vscontains the vertices that are incident on the edges.

Afterwards, during the induction step, all edges for which the end vertices exist in Vsare selected: Es={(2, 4), (4, 5), (6, 7)}.

(21)

Figure 3.5: Result of the sampled graph Gsusing Total Induced Edge Sampling.

3.1.5 Random Walk

Approach

Random Walk starts off by selecting a single starting vertex vi_{, where i = 0,}

from the input graph G [17] [24] [37]. This selection can be for instance done uniformly at random. This starting vertex is also added to the sampled vertices set Vs. Afterwards, using a probability of 0.85, a single neighbouring vertex

v(i+1) _{is chosen which is added to V}

s. In addition, the edge (vi, v(i+1)) is added

to the sampled edge set Es. This procedure repeats itself for t steps by ‘walking’

with a probability of 0.85 from the ‘newly’ selected vertex until the preferred amount of nodes or edges is reached. Using a probability of 0.15, the initial starting vertex v0 _{or any other random node is selected. Adding this 0.15}

probability of ‘jumping’ to another node may also be called Random Jump, a variation of Random Walk [12]. Note that these probabilities (0.85 and 0.15) can be changed, but these are the most commonly used values [37] [24]. However, as [24] denoted, there is a possibility of getting stuck (e.g., no new node is selected or available in a small non-connected component of a graph). To avoid this problem, a new random starting point is chosen in the graph where the cycle repeats itself, whenever the same set of nodes are selected after a preset number of steps (i.e., a threshold).

In [24], Random Walk is (empirically) shown to preserve the clustering coef-ficient and diameter of the original graph compared to node and edge based sampling implementations. In another study [37], similar results were shown, where Random Walk samples have an average degree and clustering coefficient closer to the original graph than other sampling methods including Random Node Sampling and Random Edge Sampling.

(22)

Complexity

The worst-case running time for Random Walk is O = n2 where n = |V |. A traversal through all the neighbours of each random selected node could occur, in order to select the newly (random) neighbouring node according to a specified probability.

Visualization

Figure3.6shows the result of sampling the original graph using a fraction of 0.54 of the nodes. Assume that v = 7 was selected as the starting point, followed by 5→ 4 → 2, each being selected during a single step using a probability of 0.85. At 2, through the probability of 0.15, the initial starting vertex was selected, where the algorithm continued to select neighbours (10_{→ 8), until a total of 6} nodes were selected.

Figure 3.6: Result of the sampled graph Gs using Random Walk.

3.1.6 Forest Fire

Approach

Forest Fire starts by selecting a random node v (uniformly at random) from the original graph that is put into Vs [37] [24]. Afterwards, the so-called ‘burning’

process begins, which first selects a random number k neighbouring nodes that are put into Vs, along with the incident edges on v, which are put into the

sampled edge set Es. While k can be a random number, it can also be all the

neighbouring nodes. From the newly selected k neighbouring nodes, the ‘burn’ process repeats itself until the desired amount of nodes is reached. Similarly to ‘Random Jump’, in the case where no new nodes are added into Vs, a (new)

ran-dom node v should be selected, where the neighbouring node selection process continues.

(23)

According to [24], this sampling method mainly preserves the clustering coeffi-cient. Note that the authors of this study performed their experiments on the same datasets using the Random Walk algorithm. While the Random Walk al-gorithm seemed to preserve the diameter the best, Forest Fire came up second among the sampling algorithms. In contrast, [1] observed that Node Sampling performed better than Forest Fire for this property.

In addition, [37] showed that both the average node degree were preserved the best in comparison to other sampling algorithms, including Random Walk when all the neighbours of a particular vertex were selected, rather than a random number k. Furthermore, the study showed that the average (shortest) path length as well as the degree distribution were similar to the original graph. Note that similar to Random Walk, no explicit ‘stopping’ condition is de-fined. Although the algorithm can be implemented to stop when the preferred amount of nodes is sampled, it can also be implemented to exceed the number of preferred amount of nodes, depending on how many neighbours are picked. Complexity

The worst-case running time for Forest Fire is O = n2_{, where n =}

|V |, because a traversal through all the neighbours of each selected node can occur.

Visualization

Figure3.7shows the result of sampling a fraction of 0.54 of the original graph using Forest Fire. Assume that the initial random node was 2 and that all neighbours would be selected during each step. In this case, the neighbours would be Vs={1, 3, 4}. By repeating this for every selected node, in this case

only for u = 4 since it is the only vertex in the set containing neighbours, we would satisfy our preferred sampling fraction.

(24)

Figure 3.7: Result of the sampled graph Gs using Forest Fire.

3.1.7 Impact on property preservation

For our goals, the quality of the sampling algorithms can be determined by how close the properties of Gsmatch the properties of the original graph. The

more properties match, the better the sampling algorithm is. The authors of the aforementioned studies measured the quality of the selected sampling algorithms mostly based on empirical analysis. They experimented with different types of datasets and different sampling sizes.

The selected list of sampling algorithms showed that the quality in terms of property preservation is different per algorithm. Each algorithm is biased towards specific properties of the graph. There is no ‘one-size fits all’ algorithm that preserves all properties and it is up to the user to decide which algorithm fits their needs the best. Even though some algorithms focused on particular properties, the quality of the sampled graph is still not guaranteed as other factors may also play a role. For instance, when the sampling size (fraction) is too small, a loss of quality is often observed.

Furthermore, the dataset can also affect the sampling output. For instance, a road network may behave differently than a social network. While most authors of the sampling algorithms tested them on a selected amount of datasets, the same results may not hold for different types of datasets (or perhaps even the same type of dataset).

In addition, there may be some trade-offs in terms of quality vs efficiency between the algorithms. For instance, Edge Sampling is likely to be faster than Total Induced Edge Sampling due to the lack of the induction step, but may have a loss in quality.

Needless to say, a combination of the parameters above may also yield dif-ferent results. For instance, applying a small sampling fraction with a difdif-ferent type of dataset might have a higher sampling quality than expected. In other words, the results for the same sampling algorithm can not be deterministic

(25)

across different parameters.

Nevertheless, table3.1 summarizes the observations of the results from the aforementioned studies in terms of the quality for a selected set of properties per sampling algorithm. Note that, as discussed above, the scores in the table are not guaranteed.

RN RNN RE TIES RW FF Degree distribution + ++ + ++ + + Diameter + ? - ? + + (Weakly/ Strongly) Connected components −− ? - - – −− Avg. Clustering Coefficient + ? - ++ + +

Avg. (shortest) path length

+ ? - ++ + ++

Table 3.1: Summary of the expectations regarding the property preservation quality per sampling algorithm, represented by the likelihood of preserving the property from low (−−) to high (++). Unknown data is represented by a question mark.

(26)

3.2 Expanding a graph

Similarly as with the analysis of shrinking a graph (section 3.1), it is first and foremost important to echo our goals of expanding a graph. Our definition and goal of the expanding algorithm is to provide a bigger, representative graph Ge of an original graph G (a formal definition is provided and explained in

section 3.2.2). Furthermore, as with the sampling algorithm, the goal of this research is to find an expanding algorithm that preserves as many properties of the original graph as possible. The original graph should be specified by the user along with a scaling factor s that specifies the desired expansion size based on the number of nodes or edges in a graph. Essentially, this expanding approach can be considered as obtaining a scaled-up version of the original graph and is the opposite operation of the shrinking algorithm.

However, the expanding algorithm should allow the user to make certain choices for the graph properties - e.g., preserve clustering coefficient or increase diameter. Allowing such options in the expanding algorithm would better ad-dress our goal of producing a representative graph for in-depth evaluation or performance analysis. For this goal, existing studies use synthetic graph gener-ators as a way to produce a graph from scratch or from a small original graph [36,7,13, 30,35].

Therefore, before introducing our expanding algorithm, we briefly describe a few existing synthetic graph generator algorithms in the next section (3.2.1). Because synthetic graph generators don’t fully comply with our requirements in terms of flexibility, we also introduce our own expanding algorithm in section

3.2.2.

3.2.1 Synthetic graph generators

This section briefly describes a few selected synthetic graph generating algo-rithms which can be considered as related work. A comprehensive survey of generating synthetic graphs is provided by [26].

R-MAT [9]

The recursive matrix (R-MAT) model is one particular solution to quickly gen-erate realistic graphs from scratch following the power law degree distribution. Initially, R-MAT starts off with an empty adjacency matrix (all entries are 0). It divides this matrix into four quadrants, each containing a different probability (a, b, c or d) regarding selection, where the sum of a, b, c and d is equal to 1. One of the quadrants is further chosen and again split into four separate parts, hence the recursive operation of the algorithm. This process continues until a single element (cell in the matrix) is reached and an edge will be inserted into it. The entire process repeats itself until the preferred graph is formed. According to the authors, the setting of the probabilities should match a predetermined order (i.e., a_{≥ b, a ≥ c and a ≥ d) to allow the formation of graph communities.} R-MAT’s can generate directed, undirected, and bipartite graphs [27].

(27)

Kronecker [23]

In the Kronecker graph generative model, a graph is generated recursively using the Kronecker product. A Kronecker product of two graphs is defined as a Kro-necker product of their adjacency matrices. Each multiplication exponentially increases the size of the graph [22]. The Kronecker graph generative model makes a given network denser over time, while the diameter shrinks. Further-more, the Kronecker graphs seems to capture the power law degree distributions as well as the eigenvalue and eigenvector distribution of the graph [22].

The copying mechanism [18]

While this approach does not synthetically generate a graph from scratch, it is still interesting to mention since it is focusing on growing a network in a natu-ral manner. This ‘copying mechanism’ has been introduced to produce sparse networks where the average degree grows logarithmically with the network size while the network diameter remains equal to two. An existing network is grown by adding nodes one at a time. A newly added node randomly selects a target node and links to it. The ancestors of the selected random node are also linked to the newly added node (with the exception of the root node, to avoid a star topology).

Musketeer [15]

This relatively new algorithm takes an original graph as input, along with pa-rameters (such as node growth rate and node edit rate), and aims to reproduce all of its features [27]. This algorithm applies multiple coarsening operations to obtain the network’s Laplacian. During the coarsening, different operations can be applied (such as editing nodes). It then applies uncoarsening operations to obtain the desired scale of the graph. In a sense, this can be seen as a random-izing process. While this algorithm shows promising results, it is noted by the authors that the existing implementation is not fast enough for large networks [33].

ReCoN [33]

To address the performance challenges of Musketeer, some of the authors of the Musketeer algorithm proposed ReCoN, an algorithm that should have similar qualities and better performance. ReCoN starts off with a graph and a commu-nity structure. The general idea to scale a graph is to create disjoint copies of it up until the preferred size. It then randomizes the edges inside the communities of the graph and between them while keeping the node degrees. This random-ization step of the edges should usually make the copies become connected with each other, resulting in the output graph.

(28)

3.2.2 Expanding graph datasets using sampling

Most of the algorithms described in the previous section focused on generating a new graph from scratch or from an original graph with little to no control in terms of the features of the generated graphs, while others focused on time-based growth/evolution of a graph. There is no algorithm that targets our expanding goals directly.

Therefore, we describe here the algorithm we devised to expand an existing graph. We define graph expansion as the operation of increasing the size of a graph while controlling (most of) its local and topological features - e.g., diameter, average path length, clustering coefficient. Note that, as stated earlier in section3.2, by controlling such properties we do not necessarily mean keeping them constant. Instead, we allow a choice among options like keeping these features constant or increasing them proportionally with the scaling factor s. Conceptual approach

By our definition of ‘graph expanding’, we assume to have an original, input graph G (directed or undirected), and a scaling factor, s (i.e., the amount by which the graph size must be increased).

Our initial idea to expand a graph was a trivial one: replicate the graph s times, and interconnect these copies. To handle the case where s is not an integer number, i.e., s =_{bsc+f, we can further sample the graph to f% using a} sampling algorithm of choice (edge-, node-, or topology-based) from the original input graph. After each copy (including possible sampling, if needed), we link the original graph to the newly copied graph, by choosing two random nodes, one for each graph, and adding one edge.

Figure 3.8: Initial concept of the expanding result.

Note that this particular interconnection strategy leads to a topological star structure of the expanded graph. The example in Figure 3.8 shows how this expanding approach works: assume we want to expand our input graph G 4.5 times (this includes the original graph itself). We first start by generating three copies of G. Each copy will be linked back to the original graph by selecting

(29)

two random vertices from each graph. The fraction 0.5 would be sampled from the original graph into Gs, with Es⊂ E and Vs⊂ V , and would be linked back

to the original graph in a similar fashion. The result Ge =< Ee, Ve> can be

seen as a superset of G, i.e., Ee⊇ E and Ve⊇ V . Note further that we expect

|Ve| ≈ |V | × s and |Ee| ≈ |E| + dse.

3.2.3 Variations

We discuss four possible variations: (1) using sampling for all graph ‘copies’, (2) using different interconnection topologies, (3) using denser interconnections, and (4) selecting nodes for interconnection. Section3.2.4discusses more of the implications of these variations on the actual expanded graph.

Sampling

Instead of using full copies of the graph, we can choose to sample the graph several times and use samples instead of full graph copies. For example, if we want to expand the graph 4.5 times, we could for instance sample the original graph 9 times by ∅ = 0.5. If our sampling algorithm is random, it generates a different sampled graph, Gsi, i = 0..8, for each sampling operation we conduct.

As a result, the graph is more diverse, but the entire operation is significantly more compute-intensive: we now need to run the sampling algorithm 9 times, roughly expecting a 9x slowdown compared with the basic approach. Figure3.9

shows this variation. Note that for this particular variation, there is no assurance that every node from the input graph Gi exists in the final result. Depending

on the used sampling algorithm, we have a chance to lose some of the nodes. Hence, we would denote the expanded graph as Ge =< Ee, Ve > with Ee =

{∃x(x ∈ Ei), i∈ [1..n]} and Ve={∃x(x ∈ Vi), i∈ [1..n]}.

Figure 3.9: Expanding a graph using sampling.

Another parameter to be explored for this variation of the core concept is the selection of the number of samples versus the size of the sample and, implicitly, the choice between homogeneous and non-homogeneous samples. For example,

(30)

in the previous case of s = 4.5, we could sample the original graph 9 times by ∅ = 0.5, 5 times by ∅ = 0.9, or 5 times by ∅ = 0.7 and add the full graph once.

Interconnection topologies

So far, our core approach focused on a basic star-like topology. However, other topology structures are easily available for interconnecting the components of the expanded graph. Some examples are presented in Figure3.10.

The interconnection topology will impact some of the properties of the re-sulting graph. For example, the diameter of Gewill be smaller for a star

inter-connection than for a chain interinter-connection. A similar behavior is expected for the average path length property.

Figure 3.10: Examples of topology structure.

Selecting bridge-vertices

When interconnecting the expanded graph components (i.e., copies of G or samples of G), we select a random vertex from each component and interconnect these bridge nodes to form Ge. However, it is also possible to select these

bridge-vertices to impact property preservation for Ge. For example, we can select

bridge-vertices among the high-degree nodes, the low-degree nodes, the average-degree nodes, the isolated nodes, etc., or even devise a policy of ‘switching’ between these options for each component.

(31)

Multi-edge interconnections

For all the variations we discussed so far, the components are interconnected using one edge between two bridge-vertices. As such, both the average degree and the density of the expanded graph are, with high probability, lower than those of the original graph. To alleviate this problem, we can enable multiple bridges between the Gecomponents. This imposes additional parameter tuning,

as we need to decide how many bridges and how to select the bridge-vertices, and, at the same time, enable additional control over the properties of the expanded graph.

3.2.4 Impact on property preservation

There are four parameters that impact the quality of our expanded graph, Ge:

the selection of the components, the interconnection topology, bridge-vertices selection, and the number of interconnection edges. Moreover, in the case that sampling is used for (some of) the components, a fifth parameter of interest is the sampling algorithm. From this point forward, our goal is to implement the expanding algorithm that uses sampling, rather than replicating the graph as is described in3.2.2.

Each parameter has a different impact on transferring the properties of G to Ge. For some parameters and choices - e.g., simple interconnection topologies or

multiple interconnection edges, the impact can be quantified. For others - e.g., the sampling algorithm - the impact cannot be easily quantified, due to many possible constraint (or features) of the sampling algorithm that might conflict with any quantifiable parameter.

To navigate this design space, the expanding algorithm should enable users to easily vary these parameters, choosing among several options for each one of them, and evaluate the different properties of the resulting Gegraph.

A brief analysis of (some of) the controllable properties and their values are presented in tables3.2–3.5. Note that (1) for the diameter, we are describing the max (upper bound), and (2) the impact on selecting the high-degree bridge-vertices parameter needs to be determined through empirical analysis (indicated by ‘TBA’ and will be further described in chapter5). To guide users in their choices, we provide these simple guidelines based on both the analytical and empirical impact model and some brief empirical observations. Based on the results in tables 3.2 – 3.5, an example of such a rule can be: “in case you want to have the expanded graph with a larger diameter, choose a chain topology with a single random bridge”. The quality of our expanding algorithm can be measured by how closely the properties of Gematch the expectations in terms

of the models in tables3.2-3.5.

All the models depend on the properties of the Gsi samples. Depending on

the selected sampling algorithm and its quality in terms of property preservation, the expanded graph’s properties can be affected. However, even if the sampling algorithm preserves a couple of the properties of the original graph, we still cannot guarantee that such properties are further preserved in Ge.

(32)

This immediately leads us to the sampling size. The sampling size is im-portant because it affects the properties of a sampled graph. Naturally, the smaller the sample size, the likely it is to lose more properties of the original graph. In our expanded variation approach, we described that we can expand a graph 4.5 times by sampling the original graph 9 times by a fraction of 0.5 and link them together. However, we can also sample the graph 5 times by a fraction of 0.9. The latter approach might be feasible if the selected sampling algorithm does not preserve its properties on a smaller scale (lower fraction). A particular trade-off here is that the graph might look visually the same. This also holds for our initial concept, which copies the original graph and samples any fraction that is left. However, since the copy has the same exact properties as the source (original) graph, the properties for the expanded graph are likely to be the same for several properties (such as the average degree). In contrast, our proposed variation would depend on the qualities of the sampling algorithm and is likely to be similar to a single sampled version of the original graph for some properties.

(33)

Star topology

Parameters Br.V. Random Random High degree High degree

Bridges 1 b 1 b |Ve| Pn1|Vsi| Pn 1|Vsi| Pn 1|Vsi| Pn 1|Vsi| |Ee| Pn1|Esi| + (n − 1) Pn 1|Esi| + (n − 1) × b Pn 1|Esi| + (n − 1) Pn 1|Esi| + (n − 1) × b

Components n_{× G}si n× Gsi TBA TBA

Diameter P3

1max D(Si) + 2 P31max D(Si) + 2 P31max D(Si) + 2 P31max D(Si) + 2

Avg. clustering Similar TBA TBA TBA

Avg. shortest path _{≈ 3 × AvgP (S}i) Decreases with b TBA TBA

Avg. degree Similar Increases with b TBA TBA Density Decreases Increases with b Decreases Increases with b

Table 3.2: Quantifying the impact of different parameters choices for constructing Geusing the star topology.

The TBA values will be obtained in section5.3.

Chain topology

Bridges 1 b 1 b |Ve| Pn1|Vsi| Pn 1|Vsi| Pn 1|Vsi| Pn 1|Vsi| |Ee| Pn1|Esi| + (n − 1) Pn 1|Esi| + (n − 1) × b Pn 1|Esi| + (n − 1) Pn 1|Esi| + (n − 1) × b

Diameter Pn

1max D(Si)+(n−1) Pn1max D(Si)+(n−1) Pn1max D(Si)+(n−1) Pn1max D(Si)+(n−1)

Avg. shortest path _{≈ (n − 1) × AvgP (S}i) Decreases with b TBA TBA

Table 3.3: Quantifying the impact of different parameters choices for constructing Ge using the chain topology.

Ring topology

Bridges 1 b 1 b |Ve| Pn1|Vsi| Pn 1|Vsi| Pn 1|Vsi| Pn 1|Vsi| |Ee| Pn₁|Esi| + n Pn 1|Esi| + n × b Pn 1|Esi| + n Pn 1|Esi| + n × b

Diameter Pn 1max D(Si) + n Pn 1max D(Si) + n Pn 1max D(Si) + n Pn 1max D(Si) + n

Avg. shortest path _{≈ n × AvgP (S}i) Decreases with b TBA TBA

Table 3.4: Quantifying the impact of different parameters choices for constructing Ge using the ring topology.

Fully connected topology

Bridges 1 b 1 b |Ve| Pn₁|Vsi| Pn 1|Vsi| Pn 1|Vsi| Pn 1|Vsi| |Ee| Pn1|Esi| + (n − 1) × n Pn 1|Esi| + ((n − 1) × n)_{× b} Pn 1|Esi| + (n − 1) × n Pn 1|Esi| + ((n − 1) × n)_{× b}

Diameter P2

1max D(Si) + 1 P21max D(Si) + 1 P21max D(Si) + 1 P21max D(Si) + 1

Avg. shortest path _{≈ 2 × AvgP (S}i) Decreases with b TBA TBA

Table 3.5: Quantifying the impact of different parameters choices for constructing Ge using the fully connected topology.

(34)

3.3 Summary

In this chapter, we have analysed existing graph sampling and synthetic gener-ating algorithms, and proposed our own graph expanding algorithm.

Generally, there are three main types of sampling algorithms: (1) node-based, (2) edge-node-based, and (3) topological-based sampling. The goal of most sampling algorithms from each category is to preserve as many properties to obtain a representative sample of the original graph. However, there is no pling algorithm that preserves all properties of the original graph. Each sam-pling algorithm discussed in this chapter focused on preserving different kinds of properties for specific goals. It depends on the user goals which properties to preserve and choose the best possible sampling algorithm that would likely pre-serve those properties. Nonetheless, no guarantees are offered for the property preservation.

For the expanding algorithm, we briefly analysed a few existing synthetic graph generators. None of them met our goals as they focused on generating a new graph from scratch or from an existing graph with little property preserva-tion or property control in mind. Therefore, we proposed our own expanding al-gorithm where we obtain an expanded version of an input graph using sampling and preset interconnection topologies. Given a sampling algorithm of choice, we sample the original graph using a desired sampling size several times up until the combined size of the collection of sampled graphs has reached the desired graph expansion size. The different sampled graphs add additional parameters be-tween them, such as different topology structures, amount of interconnections, and selection of bridge vertices. For these different parameters, we provided guidelines that can help users select the suitable expanding parameters for their experiments. Fine grained control is however not guaranteed as the quality and features of the sampling algorithm can affect the expanded graph properties.

(35)

Chapter 4

Design and implementation

Our goal is to provide a tool for shrinking and expanding graphs. Therefore, this chapter presents the design requirements for such a tool, as well as the main implementation challenges we had to tackle to enable both flexibility and high performance for such a tool. This chapter answers SQ3 of this research.

4.1 Requirements

The requirements mentioned in this section are formed based on our goals of shrinking and expanding a graph using a flexible, high-performance tool. As was mentioned in section3.2.4, our goal for expanding would be to implement the expanding algorithm including the variation (allowing different parameters) by default. Additionally, we have formulated the following functional and non-functional requirements:

– (R1) Graph input format tool : The user should provide an input graph to the tool in an edge list format. The source and destination of an edge should be specified as an integer.

– (R2) Graph output format tool : The tool should output a shrunk or expanded graph in an edge list format.

– (R3) Input parameters sampling : To conduct a shrinking operation, the user should provide an input graph and a desired sampling fraction. The tool should determine the amount of vertices and edges as well as whether the graph is directed/undirected automatically.

– (R4) Input parameters expanding : To conduct an expanding oper-ation, the user should provide an input graph, a scaling-factor and de-sired sampling size, along with the topology, amount of interconnections and way of selecting bridge-vertices. Since our expanding algorithm adds bridge edges between the sampled graphs, we need to know whether these need to be undirected or directed. While this depends on how the input

(36)

graph is formatted, as is discussed in 4.2, we added an additional pa-rameter that forces bridge edges to be added according to the specified type.

– (R5) Running time: Both shrinking and expanding operations should be performed in a reasonable amount of (total) time. While this require-ment depends on different factors (e.g. sample size/scaling factor, dataset size, hardware), we assume that a reasonable time for an average personal machine is not larger than 5 minutes for a small to medium sized graph of 30 million edges, any sample size and up to three times expanding. – (R6) Modern software engineering practices: This requirement is

mainly targeted towards the variations in our own expanding algorithm. The code-base of the tool should be easily maintained and modifiable. It should allow a way to incorporate any new topology or a new way of selecting bridge-vertices without any effort.

4.2 Selected sampling algorithm

Based on our goals and requirements for the shrinking algorithm, we decided to implement the ‘Total Induced Edge Sampling’ algorithm (TIES, section3.1.4). The main reason for selecting this algorithm was because it preserves several local and topological properties of the original graph. Note that we use TIES for both shrinking and expanding. The authors of TIES showed that the properties of the sampled results match the properties of the original graph closer than other sampling algorithms. Additionally, in a way, TIES does not care whether the input graph is directed or undirected. For instance, assume that the format of our undirected input graph is an edge list. After selecting random nodes from the input graph during the edge-based node sampling step, we simply collect all the edges that are in between them during the induction step. In other words, if our input edge list file contained an undirected edge from one source to a destination and vice-versa, it will be selected as well during the induction step. Thus, this algorithm would support requirement 3 with less effort than, e.g., Random Edge Sampling.

The authors of TIES validated the algorithm while focusing on undirected datasets, with the largest being around 214 thousand nodes and 1.2 million edges [1]. Thus, it would also be interesting to test its functionality and performance on larger (directed and undirected) datasets.

4.3 Enabling GPU processing

In order to achieve high performance for both the shrinking and expanding algorithms and target (requirement 5), we aim to use a GPU. Therefore, we propose a new parallel version of these algorithms, and we use CUDA for their

(37)

implementation. In the following subsections, we provide a brief description of the implementation along with code snippets.

The Total Induced Edge Sampling algorithm essentially consists of two steps: (1) the edge-based node sampling step and the (2) induction step. As was men-tioned in3.1.4, the induction step is the extra step in which a traversal through all the edges of the original graph is required to determine which edges to select based on the random sampled nodes. This step is the most time consuming part of the algorithm (as it requires an iteration through all the edges of a graph) and also gives the trade-off between speed vs quality, in comparison with an algo-rithm like Random Edge Sampling, which excludes the induction step. In order to improve execution time, it is a no-brainer to try and parallelize this step. Especially since on the first look, it seems to be ‘embarrassingly’ parallelizable. Nonetheless, there are still some challenges that need to be addressed.

Since the expanding algorithm makes use of TIES, this step becomes par-tially parallelized too. The goal however is to run each sampling operation (that is required to obtain different samples for expanding) concurrently rather than sequentially.

4.3.1 Obtaining a CSR representation

Since (requirement 1) specifies that the input graph should be specified in an edge list format, a conversion to a CSR is required from the edge list for-mat. To speed this process up, we make use of nvGraph1_{, a data}

analyt-ics library for CUDA provided by Nvidia. This library contains a function, nvgraphConvertTopology, that allows a conversion from a Coordinate List For-mat2 _{(COO) to a CSR. This means that the (input) edge list needs to be}

con-verted to a COO first. The current implementation of our tool does that (on the host) by traversing through the edges and mapping the source and destination to their coordinate value.

4.3.2 Sampling

Edge-based node sampling step

The edge based node sampling step is performed on the host by randomly se-lecting edges and their end vertices up until the desired sample size (based on the vertices) is reached. Any of the vertices from a random edge that was not se-lected before is colse-lected. The vertices are stored in an array, sampled vertices, of size_{|V |. Initially, all the elements in the array are 0. Each collected vertex} is being represented as a 1 in its respective index inside the array, to allow for quick access to determine whether a vertex is present.

1

http://docs.nvidia.com/cuda/nvgraph/index.html#abstract

2_{https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.}

html

(38)

The induction step (kernel)

After having obtained a list of sampled vertices from the random edges, (i.e., the sampled vertices array is filled), a traversal through the edges of the original graph is required. Every edge in which both end vertices exist in the sampled vertices list is selected.

To speed this process up, we parallelize it using the GPU (Listing1). The induction step function accepts the sampled vertices array, the relevant CSR arrays (offsets and indices) and an empty array, d edge data, that holds the edges that have to be collected (line 19-20 ).

We are launching one thread for every element (vertex) in the indices array. Each thread would traverse through the neighbours of its vertex, based on the information specified in the offsets array. The traversal only starts if (line 22-23 ) the neighbour index start offset is below the D SIZE VERTICES (amount of vertices of the input graph) and if the source vertex of the edge does exist in the sampled vertices array. The first condition is used to omit any thread that has a higher number than the size of vertices in the graph. The latter condition is used because there is no need to traverse through the neighbour in which the source vertex does not exist in the sampled vertices array. If it does exist however, then the traversal through the neighbours start. If the neighbour exists as well, the edge is added using the push edge function (line 6 ).

This approach can be seen as vertex-based. From an edge based approach (e.g. if we directly copy the coordinate list to the kernel), we could have mapped every thread to a single edge. While this approach is likely faster than vertex based, it is certainly less memory efficient. A particular disadvantage of using a vertex based approach is if the input graph has a degree distribution that is not evenly distributed. In that case, a thread that has a significantly higher degree for its vertex in the graph will take longer to finish than other threads with vertices that have few degrees. This results in a higher computation time since the shrinking operation has to wait for a single thread to finish, thus pushing the parallelism efforts back. Listing2shows a sequential implementation of the induction step.

After having obtained the edges during this step, they are simply being written to an output file (requirement 2).

(39)

1 __device__int d_edge_count = 0;

2 __constant__ int D_SIZE_EDGES;

3 __constant__ int D_SIZE_VERTICES; 4

5 __device__

6 int push_edge(Edge &edge, Edge* edge_data) {

7 int edge_index = atomicAdd(&d_edge_count, 1);

8 if (edge_index < D_SIZE_EDGES) { 9 edge_data[edge_index] = edge;

10 return edge_index;

11 }

12 else {

13 printf("Maximum edge size threshold reached: %d", D_SIZE_EDGES);

14 return -1;

15 }

16 }

17

18 __global__

19 void induction_step(int* sampled_vertices, int* offsets,

20 int* indices, Edge* edge_data) {

21 int neighbor_index_start_offset = blockIdx.x * blockDim.x + threadIdx.x;

22

23 if (neighbor_index_start_offset < D_SIZE_VERTICES 24 && sampled_vertices[neighbor_index_start_offset]) {

25 int neighbor_index_end_offset = neighbor_index_start_offset + 1;

26

27 for (int n = offsets[neighbor_index_start_offset];

28 n < offsets[neighbor_index_end_offset]; n++) { 29 if (sampled_vertices[indices[n]]) { 30 Edge edge; 31 edge.source = neighbor_index_start_offset; 32 edge.destination = indices[n]; 33 push_edge(edge, edge_data); 34 } 35 } 36 } 37 }

Listing 1: The induction step kernel.

Shrinking and Expanding Graph Datasets

Shrinking and Expanding

Graph Datasets

Author:

Ahmed Musaafir

Supervisor:

Dr. Ana Lucia Varbanescu

Friday 11

August, 2017

Universiteit van Amsterdam

Acknowledgements

Contents

Abstract

Chapter 1

Introduction

1.1

Goals and solution

1.2

Research question

1.3

Contributions

1.4

Outline

Chapter 2

Background

2.1

Graphs

2.1.1

Properties

2.2

Graph representation

2.3

CUDA

Chapter 3

Shrinking and expanding

graphs

3.1

Shrinking a graph

3.1.1

Random Node Sampling

3.1.2

Node Sampling with Neighbourhood

3.1.3

Random Edge Sampling

3.1.4

Total Induced Edge Sampling

3.1.5

Random Walk

3.1.6

Forest Fire

3.1.7

Impact on property preservation

3.2

Expanding a graph

3.2.1

Synthetic graph generators

3.2.2

Expanding graph datasets using sampling

3.2.3

Variations

3.2.4

Impact on property preservation

3.3

Summary

Chapter 4

Design and implementation

4.1

Requirements

4.2

Selected sampling algorithm

4.3

Enabling GPU processing

4.3.1

Obtaining a CSR representation

4.3.2

Sampling

_{August, 2017}