• No results found

Predicting the properties of a scaled graph

N/A
N/A
Protected

Academic year: 2021

Share "Predicting the properties of a scaled graph"

Copied!
46
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Predicting the properties of a

scaled graph

Kayleigh Schoorl 11022345 Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Ana-Lucia Varbanescu

Informatics Institute, Faculty of Science University of Amsterdam

Science Park 904 1098 XH Amsterdam

(2)

Abstract

In the field of graph processing, representative data sets and thorough bench-marking remain challenging, hindering in-depth performance analysis. To solve this problem, a tool has been proposed for scaling a graph up or down, using graph sampling as a main underlying mechanism. However, the correlations that appear between the different input parameters of the scaling tool and the properties of the resulting graphs are not clear.

The aim of this project is to determine the impact of graph scaling on the properties of the resulting graph, with the ultimate goal of assessing whether scaled graphs can be used for performance analysis of real graph processing algorithms.

In the context of this work, the main case-study for measuring performance on the sampled graphs is the PageRank algorithm. The performance of PageR-ank is directly correlated to the number of edges in a graph. Therefore, we con-duct an analysis of the number of edges in samples created using three different sampling methods: Random Node Sampling (RNS), Random Edge Sampling (RES), and Total Induced Edge Sampling (TIES). We find these three meth-ods do provide quite some diversity: RNS shows consistent under-sampling of edges, samples created using RES usually have the expected number of edges, and TIES shows a trend towards over-sampling edges.

Next, we create a model to predict the number of edges in a sample when given the original graph and the sample size. We use a linear regression model on the generated samples, creating one model per sample size for each sampling method. These models show a low error rate for all sampling methods, especially for the bigger sample sizes. RNS and RES both show a very consistent number of edges for all sample sizes. Only TIES shows a large error for the smaller sample sizes, which gradually becomes lower for the bigger sample sizes.

After that, we analyze and predict the performance of PageRank on the sampled graphs. We propose a simple model for predicting the performance, where the ratio of edges for the sampled graph compared to the original graph directly corresponds to the performance ratio. We find that our model shows a large error for small sample sizes, but the error drops significantly for the bigger sample sizes. Moreover, graphs with a high number of edges show a lower error than graphs with a low number of edges, indicating that the performance of PageRank on larger graphs is more stable and, therefore, easier to predict. Additionally, the model performs better for samples generated using TIES than for the other sampling methods, making samples generated using TIES more suitable for graph processing analysis.

Finally, we combine our model for edge-prediction with our PageRank model in order to determine whether our edge-prediction model can be used as a cor-rection factor over a simplistic prediction model only based on the sample size. The results show that our model has lead to an improvement in the predic-tion of PageRank performance and that the predicpredic-tions are close to the actual measured performance.

(3)

Contents

1 Introduction 3

1.1 Research question . . . 4

1.2 Method and approach . . . 4

1.3 Thesis structure . . . 5

2 Background 6 2.1 Graphs . . . 6

2.2 Graph scaling . . . 7

2.3 Graph sampling . . . 7

2.4 Graph processing algorithms . . . 8

3 Graph scaling analysis and prediction 9 3.1 Relevant features for scaled graphs . . . 9

3.2 Methodology . . . 10

3.3 Random Node Sampling (RNS) . . . 12

3.4 Random Edge Sampling (RES) . . . 14

3.5 Total Induced Edge Sampling (TIES) . . . 16

3.6 Summary . . . 18

4 Performance analysis and prediction 20 4.1 Methodology . . . 20

4.2 Random Node Sampling (RNS) . . . 21

4.3 Random Edge Sampling (RES) . . . 22

4.4 Total Induced Edge Sampling (TIES) . . . 23

4.5 RNS vs. RES vs. TIES . . . 23

5 Combining the models 26 6 Conclusion 29 6.1 Main findings . . . 29

6.2 Limitations and future work . . . 31

Bibliography 32 A Social network graphs 34 B Random Node Sampling 36 B.1 Number of sampled edges per graph . . . 36

(4)

C Random Edge Sampling 39 C.1 Number of sampled edges per graph . . . 39 C.2 Regression model results per sample size . . . 40

D Total Induced Edge Sampling 42 D.1 Number of sampled edges per graph . . . 42 D.2 Regression model results per sample size . . . 43

(5)

Chapter 1

Introduction

Graphs are becoming increasingly popular as a means to model all kinds of net-works. However, representative data sets and thorough benchmarking are still lacking, causing the absence of in-depth performance analysis for graph process-ing (Musaafir 2017). Some attempts at definprocess-ing benchmarkprocess-ing methodologies for graph analysis have been made (Iosup et al. 2016, Van Zalingen 2018), but the problem of the lack of representative data sets has still been scarcely covered. The performance of graph processing algorithms is usually tested on data sets from online repositories, or on graphs generated by one of the community-supported generators. These are instances of real-world graphs, with specific, fixed properties. These graphs are therefore isolated samples, making them insufficient for any in-depth performance analysis, scalability studies, or perfor-mance modeling attempts.

To tackle this problem, some methods for creating graphs from scratch have been proposed (Chakrabarti, Zhan, and Faloutsos 2004; Leskovec, Chakrabarti, et al. 2010). The parameters of the generated graphs are shown to match the properties of real graphs. The disadvantage of these methods, however, is that the tools are tuned for specific graph properties and therefore lack generality.

A graph scaling tool proposed by Zhang provides a simple method for scaling graphs (Zhang and Tay 2016). However, it only allows the user to set the number of nodes and edges in the resulting graph and thus doesn’t have a lot of configurable options. It is not shown in detail what the correlations between the input parameters and the resulting graph properties are. It is hard to use this tool to create a graph with specific properties, which emphasizes the need for a tool with more configurable options.

Another graph scaling tool has been proposed by Musaafir (Musaafir 2017). Using this tool, a graph can be scaled up or down, using sampling as a main underlying tool (N. Ahmed, Neville, and Kompella 2011). The goal of scaling graphs is to provide a family of graphs of configurable sizes, but similar prop-erties as the original graph. The tool requires users to specify several input parameters that influence the graph scaling process: besides the actual size in-crease/decrease factor, users need to also set parameters related to sampling, interconnections, and topology. The goal of this project is to determine the correlations between the input parameters and the resulting graph properties.

The experiments for this project will be run on the Distributed ASCI Su-percomputer (Bal et al. 2016).

(6)

1.1

Research question

Currently, the tool proposed by Muaafir (Musaafir 2017) requires users to specify several input parameters that influence the graph scaling process: besides the actual size increase/decrease factor, users need to also set parameters related to sampling, interconnections, and topology, which are far less straightforward to determine compared to size.

At the moment, the correlations that appear between the different input parameters and the properties of the resulting graphs are not clear. The goal of this project is to determine the correlations between the sample size and the properties of the resulting graph for different sampling methods. These correlations can be exploited to help users select the right configuration for the scaling tool, therefore reducing the effort in building families of scaled graphs with similar properties. In turn, this will further increase the usability and applicability of the graph scaling tool.

Therefore, the research question driving the work in this thesis is:

Is graph scaling a useful tool to generate input data for graph pro-cessing performance analysis?

We tackle this rather generic question by focusing on the following specific sub-questions:

• What are the features of interest for assessing graph scaling in the context of graph processing analysis?

• What is the impact of scaling on the properties of a scaled graph? • What is the impact of the properties of a scaled graph on performance?

1.2

Method and approach

For a broad analysis, we focus on three different sampling algorithms: Random Node Sampling, Random Edge Sampling, and Total Induced Edge Sampling. The main case-study for measuring performance on the sampled graphs is the PageRank algorithm, a representative graph-processing workload with a reason-able balance between computation and memory accesses. First, we determine which properties of a graph are of interest in the context of PageRank perfor-mance.

The next step is to perform an initial analysis of a set of graphs. For this analysis, we generate N × M graphs, where N is the number of initial graphs and M is the number of scaling sizes. The properties of these generated graphs will be analyzed in detail by recording the difference between measured and expected data. Based on the results of this initial analysis, more data will be collected.

Using all collected data, a linear regression model is trained to predict the properties of the sampled graphs. We train a model for every sample size, and for every sampling algorithms, resulting in 3 × 9 = 27 models. The goal of the models is to predict the properties of interest of a sample in the context of

(7)

PageRank performance when given the properties of the original graph and the sample size.

The collected data will then also be used to analyze and predict the perfor-mance of PageRank on the sampled graphs, in order to identify the correlation between the performance and the graph properties. The goal is to predict the performance based on these properties.

1.3

Thesis structure

This thesis is structured as follows.

Chapter 2 presents the necessary terminology and background for under-standing our work. Specifically, we discuss the basic notions of graphs, and their sampling, scaling, and basic processing.

Chapter 3 presents the methods we used for analyzing the sampled graphs. Additionally, we present our model for predicting the properties of scaled graphs and demonstrate them for three different sampling algorithms.

Chapter 4 presents the methods we used for measuring performance on the sampled graphs. We focus our analysis on the performance of PageRank. Next, we propose a simple model for predicting the performance on a sampled graph and validate this model using PageRank.

(8)

Chapter 2

Background

In this chapter, we briefly present the necessary terminology and background information required for understanding our work. Specifically, we discuss the basic notions of graphs, and their sampling, scaling, and basic processing.

2.1

Graphs

A graph G can be formally defined as G = hV, Ei where V is a set of ver-tices or nodes {u1, u2, .., un}, and E is a set of edges connecting the vertices

{(u1, u2), (ua, ub), ..} (Reinhard 2010). A graph can be directed or

undi-rected. In a directed graph, the vertices of every edge (ua, ub) are ordered,

i.e., the edge from vertex ua to vertex ub is unidirectional; in an undirected

graph, edges are bidirectional, i.e., (ua, ub) means there is a direct connection

from ua to ub and from ub to ua.

A graph has various properties. The most basic ones are the order and the size, i.e., the number of vertices (|V |) and the number of edges (|E|), respectively. Next, the graph’s average degree measures the average number of neighbors a vertex has and is computed as the ratio between the number of edges and the number of vertices. The density of a graph shows how many edges exist in the current graph, compared to the maximum possible number of edges in a graph of that order. The maximum number of edges is that of a fully connected graph, where each vertex is directly connected, through an edge, to every other vertex. The average path length is the average shortest path length between all combinations of vertices inside a graph. The diameter is determined by taking the longest path of all shortest paths between any two vertices in a graph.

For undirected graphs, the number of connected components is defined as the number of components of the graph that are unreachable from each other. For directed graphs, we differentiate between strongly connected compo-nents and weakly connected compocompo-nents. Strongly connected compocompo-nents are components where the vertices inside each component are accessible to each other. Weakly connected components are components where vertices are reach-able when ignoring the direction of the edges/path between them.

(9)

2.2

Graph scaling

In the field of graph processing, representative data sets and thorough bench-marking remain challenging, hindering in-depth performance analysis. To solve this problem, a tool has been proposed for scaling a graph up or down, using graph sampling as a main underlying mechanism (Musaafir 2017). Specifically, the main idea of this tool is to use the well-known process of graph sampling to create a ”smaller” version of the original graph and combine multiple such graph samples into a graph as large as required.

To drive graph scaling, the tool requires users to provide two parameters: a sample size and a scaling factor. The former defines how large a chunk will be sampled from the original graph, while the latter indicates how large the new scaled graph should be. Note that in the case of scaling down, the parameters are considered equal.

When scaling up, the tool also requires the user to set the number of inter-connections, which determines how many new edges should be used to connect the smaller samples in order to combine them into one large graph. This pa-rameter may impact the average degree of the graph. Additionally, a Boolean argument called force directed bridges determines whether the interconnec-tions between the samples need to be directed. The bridging type decides how the samples will be connected: when set to ’random’, random vertices from the samples will be connected, and when set to ’high’, only high degree vertices from the samples will be connected. This parameter may impact the actual degree distribution of the graph.

The topology determines the way the sampled graphs will be connected. There are four different options for this parameter: star, chain, fully connected, and ring. The star topology connects the graphs in a star-like shape, with one sample in the middle and the other samples being connected to this one. The chain topology connects the samples one by one, creating a chain of graphs. The ring topology is similar to the chain topology, but connects the first and last graph in the chain, creating a ring-like shape. The fully connected topology connects all samples with one another.

Lastly, the sampling method needs to be specified - we briefly detail the choices for sampling in the following section.

2.3

Graph sampling

The graph scaling tool currently supports three different sampling methods: Random Node Sampling (RNS), Random Edge Sampling (RES), and Total In-duced Edge Sampling (TIES).

RNS randomly selects a subset of vertices from the original graph. The number of selected vertices depends on the desired sample size. Next, an edge is selected to be added to the sample if both of its vertices belong to the sampled set of vertices (Musaafir 2017).

RES is similar to RNS, but instead of vertices, it randomly selects a subset of edges from the original graph. The end vertices of the selected edges are then included in the sample (Musaafir 2017).

The TIES algorithm consists of two steps. First, a random set of edges from the original graph is selected, along with their end vertices, in the same way

(10)

as for RES. The second step is to traverse through all the edges in the original graph and add the rest of the edges that are connected by the selected vertices. This method retains the properties of the original graph better than most other sampling algorithms (N. Ahmed, Neville, and Kompella 2011).

2.4

Graph processing algorithms

The scaling tool has been created to tackle the lack of representative data sets for the in-depth performance analysis of graph processing algorithms. To evaluate whether the graphs created by this tool are suitable for performance analysis, we measure and analyze the performance of one such algorithm - namely, the PageRank algorithm - on the generated graphs.

PageRank is an algorithm developed by Google Search, to be used for ranking web pages. Pages are represented as vertices in a graph, and links from one web page to the other are represented as edges between vertices. PageRank assigns scores to web pages in order to determine the rank of the page. The most important factors for determining the score of a web page are the number of other web pages referring to that page and the scores of those web pages. The PageRank of a page is a value between 0 and 1 that expresses the probability someone will arrive at that particular page when randomly clicking a link, and is one of the factors that determine the web pages that appear on Google’s Search Engine results page. The time complexity of the PageRank algorithm is O(|V | + |E|), since in the worst case, it has to visit all the vertices and edges in the graph (Page et al. 1999).

There are many different ways to implement graph processing algorithms such as PageRank. This leads to a diversity of algorithms, each performing best on different machines and data sets. It is this diversity of algorithms that requires representative data sets of multiple sizes and shapes in order to perform in-depth performance analysis. For our implementation, we used the GAP Benchmark Suite in order to measure the performance of PageRank on sampled graphs (Beamer, Asanovic, and Patterson 2015). More systems were investigated and shall again be looked at, but GAP proved most suitable for our needs.

(11)

Chapter 3

Graph scaling analysis and

prediction

In this chapter, we discuss the methods we used for analyzing the sampled graphs. Additionally, we present our model for predicting the properties of scaled graphs and demonstrate them for three different sampling algorithms.

3.1

Relevant features for scaled graphs

Our main case-study is PageRank, the algorithm used by Google to assign a score to web pages based on the number of times the page is being linked to, and on the scores of the other pages linking to it. The score of a web page is, therefore, partly determined by the number of incoming hyperlinks. A full iterative computation based on the most simple implementation of the PageRank algorithm requires O(m × kHk) FLOPS, where m is the number of iterations and kHk is the total number of hyperlinks on the web (Bianchini, Gori, and Scarselli 2005). So, the number of FLOPS required for an iterative computation of PageRank is dependent on the number of hyperlinks connecting the web pages.

In graphs, web pages are represented as vertices, with the hyperlinks connect-ing them beconnect-ing represented as edges. Therefore, the performance of PageRank on a graph depends on the number of edges in the graph. Since the number of edges of the sampled graph is directly correlated to the actual performance of PageRank on that sampled graph, we focus our analysis on the understanding and prediction of the number of edges of sampled graphs when using different sampling algorithms.

Sampling a graph using the scaling tool only has two input parameters: the sam-pling algorithm and the sample size (Musaafir 2017). Our goal is to understand how the properties of the resulting samples depend on these two parameters. We investigate three different sampling algorithms (RNS, RES, and TIES) and 9 different sample sizes (0.1 - 0.9).

We note that restricting the analysis to only sampling (i.e., scaling down) does not impose any restrictions on predicting these parameters for the expanded graphs. Given the simple heuristic we use to generate larger graphs, detecting

(12)

their number of vertices and edges is straightforward, as seen in the analytical model in Equation 3.1 and Equation 3.2.

kEfk = Σsi=0kEik + n (3.1)

kVfk = Σsi=0kVik (3.2)

Therefore, we will train a model to predict the number of edges in a sample given the original graph and the sample size.

3.2

Methodology

First, we selected a set of graphs and sampled them for 9 different sample sizes. For the initial analysis, we selected graphs from the SNAP data set collection (Leskovec and Krevl 2019); we selected a diverse set in terms of type (i.e., domain of origin) and average degree.

As an example, the characteristics of four of the graphs are shown in Ta-ble 3.1. The SNAP library has been used to measure the properties of the graphs. According to SNAP, Amazon is a network with ground-truth commu-nity, Email-Enron is a communication network, Facebook is a social network, and RoadNet-PA is a road network. Note that their average degree ranges be-tween 2.8 and 43.7. Also note that they show a large diversity in terms of other properties, like diameter and density.

Amazon Email-Enron Facebook RoadNet-PA Vertices 334873 36720 4039 1088112 Edges 925879 183847 88234 1541910 Avg. degree 6 10.0135 43.691 2.8341 Graph density 1.6513e-05 0.000272705 0.01082 2.60461e-06

Connected 4 1077 1 214 components Avg. CC 0.396734 0.496604 0.605547 0.0464759 Diameter 42 11 8 784 Avg. shortest 11.88 4.03 3.69 307.75 path

Table (3.1) Four graphs from varying classes.

We varied the sample size from 0.1 to 0.9. For each sample size, we created three samples and took the average number of edges of those three samples in order to account for the randomness factor of the sampling methods.

This measured number of edges per sample size is compared to the expected number of edges. In the ideal case, the number of edges in the sampled graph is relative to the sample size - for example, if the sample size is 0.5, the number of edges should be half of the original number of edges. Therefore, the expected number of edges is calculated by multiplying the number of edges of the original graph by the sample size.

For the four graphs included in Table 3.1, we present the two values - mea-sured and expected numbers of edges - in Figure 3.1. The sampling algorithm we have used in this analysis is TIES. We observe that (1) there are differences

(13)

between the expected and measured number of edges per graph, and (2) the actual difference between the two values varies considerably per graph. For ex-ample, the RoadNet-PA graph shows edge under-sampling (i.e., the measured number of edges is below the expected one), while the Facebook and Email-Enron graphs both show over-sampling (i.e., the measured number of edges is above the expected one).

Figure (3.1) Predicted and actual number of edges for graphs from varying classes

Because of these varying results, we expect our model to have difficulties in being trained and predicting for different classes of graphs. Thus, we decided to focus on a single class of graphs. As proof of concept, we selected social networks as our class of interest - this is the class which the Facebook graph belongs to. We selected 12 social network graphs, all undirected, from the SNAP (Leskovec and Krevl 2019) and Network Repository (Rossi and N.K. Ahmed 2015) collections. We selected these graphs based on their relatively low number of vertices and edges, choosing only graphs with less than 500,000 vertices since larger graphs would take a long time to analyze. Four of these graphs are shown in Table 3.2. The SNAP library has been used to measure the properties of the graphs. The rest of the graphs can be found in Appendix section A.

We use these graphs to generate samples while varying the sample size and the sampling method. We further use the resulting samples to train a model to predict the number of edges in the samples based on the properties of the original graph and the sample size. Because of the small number of data sets, and because we only need the model to predict one property, we decided to build our model using linear regression.

To train the model, we used the Machine Learning library scikit-learn (Pe-dregosa et al. 2011). We created one model per sample size, for which we trained and validated the model on 8 and 4 social network graphs respectively. Given the low observed error of the model, as well as the filtering of the graphs to

(14)

Academia BlogCatalog Catster Livemocha Vertices 200169 88790 149703 104109 Edges 1022441 2093199 5240663 2193087 Avg. degree 10.2158 47.1494 70.0141 42.1306 Graph density 5.1036e-5 0.000531028 0.00046769 0.000404682

Connected 2 3 28 3 components Avg. CC 0.222842 0.353311 0.259748 0.0544106 Diameter 14 8 6 6 Avg. shortest 4.78 3.07 2.54 3.2 path

Table (3.2) Four graphs of the social network class.

belong to a single class, the training data seems to be sufficient for good accu-racy. The training data set can be further increased (and the process repeated) by adding more samples (of the same size) for the chosen graphs, as well as by adding new undirected social network graphs. This extension could further increase the confidence in our model.

In the following sections, we discuss the results of our modeling approach for each graph sampling method.

3.3

Random Node Sampling (RNS)

Following our defined methodology, we created the graph samples using RNS and compared the number of predicted edges to the measured number of edges in the samples. Again, we created three samples per sample size and took the average number of edges of those three samples in order to account for the randomness factor of the sampling method. The results for four graphs can be seen in Figure 3.2. The rest of the results can be found in Appendix section B. The complete data set of all the sampled graphs, including all graph properties, can be found in the ”Graph property analysis data” online drive (Schoorl 2019). It can be seen the samples all show a considerable degree of under-sampling of edges. This behavior is expected, because RNS first selects a random subset of nodes according to the sample size, and then adds the edges that connect the selected nodes. Especially for lower sample sizes, there is a high chance of selecting nodes that are not connected to each other.

The results of a regression model on graphs of sample size 0.5 can be found in Table 3.3. The Mean Absolute Percentage Error (MAPE) is calculated by computing the relative difference between the predicted and the actual number of edges and then taking the average of this number over all the validation graphs. The model has been trained on the properties listed in the table and is defined by the intercept and slope. The differences between the predicted and the actual number of edges per graph can be found in Figure 3.3. The predicted and actual number of edges differ very little for every graph. The results for the other sample sizes can be found in Appendix B.

To see how well a model could be trained for each sample size, we trained models for each sample size from 0.1 until 0.9 using the same training and validation graphs for each sample size. The MAPE per sample size is shown

(15)

Figure (3.2) Predicted and actual number of edges for graphs from the social network class using the RNS method.

in Figure 3.4. The models for the smaller sample sizes show a comparatively high error, and therefore can’t accurately predict the number of edges for those sample sizes. For the larger sample sizes, from 0.3 on, the model shows a very low MAPE, so the number of edges can be predicted with high accuracy for these sample sizes. The only clear outlier is sample size 0.8 - but even there, with an accuracy of 5%, this model still has high accuracy.

Training graphs Brightkite, Catster, Datagen coo 1, Datagen coo 2, Dogster, Gowalla, Livemocha, TheMarker

Validation graphs Academia, BlogCatalog, Buzznet, Datagen coo 3 Properties Nodes, edges, density, clustering coefficient Intercept -1552.0347274097148

Slope [-4.10399515e-02 2.54555322e-01 -6.93467811e+04 -1.23770608e+04] Mean Average 2.22%

Percentage Error

(16)

Figure (3.3) The results of the regression model on sample size 0.5 for RNS

Figure (3.4) Mean Absolute Percentage Error per sample size for RNS

3.4

Random Edge Sampling (RES)

Again, following our methodology, we created the graph samples using RES and compared the number of predicted edges to the actual number of edges in the samples. The results for four graphs can be seen in Figure 3.5. The rest of the results can be found in Appendix C. The complete data set of all the sampled graphs that includes all graph properties can be found in the ”Graph property analysis data” online drive (Schoorl 2019).

The number of edges in the resulting samples is very close to the predicted number of edges. This is because RES selects a subset of edges based on the sample size, so the number of edges in the sample will always be relative to the sample size. Creating a model in order to predict the number of edges is thus not really necessary - however, we will still build a model for consistency.

The results of a regression model on graphs of sample size 0.5 can be found in Table 3.4. The differences between the predicted and the actual number of

(17)

Figure (3.5) Predicted and actual number of edges for graphs from the social network class using the RES method.

edges per graph can be found in Figure 3.6. The results for the other sample sizes can be found in Appendix C.

We present the MAPE of our regression-based model for RES in Figure 3.7. It can be seen that the model has a higher error on the smaller sample sizes. The MAPE gradually decreases as the sample size increases, with the error decreasing from about 9% to about 4%. While the error is lowest on the larger sample sizes, the error is quite low overall, and therefore the model can predict the number of edges for samples created using RES quite well.

Training graphs Brightkite, Catster, Datagen coo 1, Datagen coo 2, Dogster, Gowalla, Livemocha, TheMarker

Validation graphs Academia, BlogCatalog, Buzznet, Datagen coo 3 Properties Nodes, edges, density, clustering coefficient Intercept -39004.52583417902

Slope [-2.06439100e-01 5.19740017e-01 -5.65857520e+06 2.47278869e+05] Mean Average 7.00%

Percentage Error

(18)

Figure (3.6) The results of the regression model on sample size 0.5 for RES

Figure (3.7) Mean Absolute Percentage Error per sample size for RES

3.5

Total Induced Edge Sampling (TIES)

Our final sampling algorithm is TIES, for which we also created the graph samples and compared the number of predicted edges to the actual number of edges in the samples. The results for four graphs can be seen in Figure 3.8. The rest of the results can be found in Appendix D. The complete data set of all the sampled graphs that includes all graph properties can be found in the ”Graph property analysis data” online drive (Schoorl 2019).

From these charts, it can be seen that all sampled graphs show over-sampling as expected, but differ considerably in the degree of sampling. The over-sampling is expected, since the first step of TIES is the same as RES, and more edges are added to the sample in the second step. No direct correlation between the properties of the original graphs and the resulting degree of over-sampling of edges in the samples can be found by observation.

The results of a regression model on graphs of sample size 0.5 can be found in Table 3.5. The differences between the predicted and the actual number of edges

(19)

Figure (3.8) Predicted and actual number of edges for graphs from the social network class using the TIES method.

per graph can be found in Figure 3.9. The results vary considerably per graph. For example, the model predicts the number of edges for the Buzznet graph really well, while the predicted and actual number of edges for Datagen coo 3 show a substantial difference. The results for the other sample sizes can be found in Appendix section D.

The overall results for TIES can be found in Figure 3.10. It can be seen that the model has a high error on the smaller sample sizes and therefore can’t accurately predict the number of edges for those sample sizes. From sample size 0.6 on, the model shows a MAPE of lower than 10% and therefore can be predicted with high accuracy. The sample sizes in between, from 0.3 until 0.6, show a higher MAPE but can still be predicted relatively well.

Training graphs Brightkite, Catster, Datagen coo 1, Datagen coo 2, Dogster, Gowalla, Livemocha, TheMarker

Validation graphs Academia, BlogCatalog, Buzznet, Datagen coo 3 Properties Nodes, edges, density, clustering coefficient Intercept -34790.502646458335

Slope [-8.60522731e-01 8.35436886e-01 -1.38529690e+08 7.44983337e+05] Mean Average 10.94%

Percentage Error

(20)

Figure (3.9) The results of the regression model on sample size 0.5 for TIES

Figure (3.10) Mean Absolute Percentage Error per sample size for TIES

3.6

Summary

Our main case-study for analyzing graph processing performance is PageRank, where performance is correlated with the number of edges in a graph. Therefore, the goal of this analysis is to determine whether we can predict the number of edges in sampled graphs generated by different sampling algorithm. To this end, we conducted an analysis of the number of edges in samples, created using three different sampling methods: RNS, RES, and TIES. The graphs used for generating samples are all from the social network class. For these graphs, RNS showed considerable under-sampling of edges, samples created using RES usually had the expected number of edges, and TIES showed over-sampling.

Next, to predict the number of edges for a sample, we trained linear regres-sion models on the generated samples, creating one model per sample size for each sampling method. These models performed well for all sampling meth-ods, especially on the bigger sample sizes. RNS and RES both showed a very consistent number of edges in the sampled graphs, making it easier to predict

(21)

the number of edges for new graphs. Only TIES showed a very large error for the smaller sample sizes, which got lower for the bigger sample sizes. This can be explained by the fact that the number of edges in samples generated using TIES is partially random, and therefore harder to predict than for RNS and RES, especially for the smaller sample sizes.

(22)

Chapter 4

Performance analysis and

prediction

In this chapter, we discuss the methods we used for measuring performance on the sampled graphs. We focus our analysis on the performance of PageRank on the sampled graphs. We propose a simple model for predicting the algorithm performance, where the ratio of the number of edges in the sampled graph compared to the original graph corresponds to the ratio of the performance on the sampled graph compared to the original graph. We validate this model using real performance data, collected when executing PageRank.

Ultimately, our goal is to demonstrate the usefulness of graph sampling methods for generating similar data sets with an original graph. Therefore, we expect the PageRank performance behavior on the sampled graphs to be similar to that on the original graph. Our model, which we test and validate in this chapter, aims to capture this similar behavior. Non-similar behavior would be an indication that sampling destroys some of the graph properties relevant to PageRank.

4.1

Methodology

To predict the performance of PageRank on the sampled graphs, we propose a simple predictive model, starting from the idea that the most impactful perfor-mance factor is the number of edges. Specifically, we postulate that the ratio between the PageRank execution time on two similar graphs - i.e., the origi-nal and the sample - is the same as the ratio between their number of edges. Thus, if we determine the execution time of PageRank on the original graph, TO, and we can calculate the ratio of edges in the sampled graph, eS and the

original graph, eO, we can determine the execution time on the sampled graph

by multiplying the execution time of the original graph with this edges ratio: TS = TO× eeOS. For example, if the number of edges in the sample is half the

number of edges in the original, the predicted execution time for processing the sampled graph will be half of the execution time for processing the original.

To determine the execution time on the different graphs, we must select one of the many different implementations of PageRank. We note that the selection itself is unlikely to bias our results: as long as we use the same algorithm for all

(23)

graphs, we can reason about similar or non-similar behavior.

In this work, we have used the GAP Benchmark Suite for measuring the per-formance of PageRank on different graphs (Beamer, Asanovic, and Patterson 2015). All our experiments were executed on the Distributed ASCI Supercom-puter (Bal et al. 2016).

In order to determine the effect of parallelism on the PageRank performance on all our graphs, we used two different versions of PageRank in GAP: a se-quential one (i.e., using a single thread), and a 16-thread one. Both versions are running on the same 16-core CPU.

In the following sections, we review the results for each sampling algorithm.

4.2

Random Node Sampling (RNS)

To validate our simple predictive model, we calculate the ratio of the number of edges in the sampled graph compared to the original graph and use this ratio to calculate the expected performance time. Then, we compare this expected performance time to the actual measured performance time, measured by taking the average of 10 trials, and calculate the average error. Note that the num-ber of edges shown here is the numnum-ber of edges as calculated by GAP, which might differ from the number of edges calculated by the SNAP library used in Chapter 3 - however, the resulting ratio is the same.

Table 4.1 shows the performance data for the single thread version of GAP run on the Catster graph sampled with RNS and Table 4.2 shows the data for the 16-thread version. The performance data for the other graphs can be found in the ”Graph property analysis data” online drive (Schoorl 2019).

From the results, it is clear that the simple predictive model does not work very well for smaller sample sizes. For the larger sample sizes, 0.7 and 0.9, the performance can be predicted with relatively low error. The difference between the lower and higher sample sizes seems to indicate that performance time measured on bigger graphs is more stable and easier to predict. The difference in error between the single thread and the 16-thread versions of GAP is minimal, with the 16-thread version only performing significantly better for sample size 0.30. Edges Edges ratio Pred. avg. execution time Actual avg. execution time Relative error Original 5240663 0.077 0.30 531447 0.101 0.008 0.017 53.23% 0.50 1297615 0.248 0.019 0.023 14.71% 0.70 2472053 0.472 0.037 0.035 5.48% 0.90 4498623 0.858 0.067 0.068 2.67% Table (4.1) Performance analysis results for Catster, sampled using RNS, mea-sured using a single thread version of GAP.

(24)

Edges Edges ratio Pred. avg. execution time Actual avg. execution time Relative error Original 5240663 0.055 0.30 531447 0.101 0.006 0.008 28.77% 0.50 1297615 0.248 0.014 0.016 14.57% 0.70 2472053 0.472 0.026 0.027 3.70% 0.90 4498623 0.858 0.047 0.046 1.43% Table (4.2) Performance analysis results for Catster, sampled using RNS, mea-sured using a 16-thread version of GAP.

4.3

Random Edge Sampling (RES)

Table 4.3 shows the performance data for the single thread version of GAP run on the Catster graph sampled with RES and Table 4.4 shows the data for the 16-thread version. The performance data for the other graphs can be found in the ”Graph property analysis data” online drive (Schoorl 2019).

The results show some instability, with the execution time for the original graph being lower than the execution time on a sampled graph in some particular cases. The error measured on graphs sampled using RES is overall lower than the error for RNS. This can be explained by the higher number of edges for the samples, since RNS under-samples the number of edges, while samples created using RES have the expected number of edges (see Chapter 3). The 16-thread version of GAP shows lower error than the single thread version for most sample sizes. Edges Edges ratio Pred. avg. execution time Actual avg. execution time Relative error Original 5240663 0.078 0.30 1634456 0.312 0.024 0.034 28.64% 0.50 2724095 0.520 0.040 0.038 6.87% 0.70 3813712 0.728 0.056 0.053 7.11% 0.90 4903386 0.936 0.073 0.065 10.89% Table (4.3) Performance analysis results for Catster, sampled using RES, mea-sured using a single thread version of GAP.

Edges Edges ratio Pred. avg. execution time Actual avg. execution time Relative error Original 5240663 0.055 0.30 1634456 0.312 0.017 0.020 12.82% 0.50 2724095 0.520 0.028 0.034 15.25% 0.70 3813712 0.728 0.040 0.039 1.20% 0.90 4903386 0.936 0.051 0.049 4.89% Table (4.4) Performance analysis results for Catster, sampled using RES, mea-sured using a 16-thread version of GAP.

(25)

4.4

Total Induced Edge Sampling (TIES)

Table 4.5 shows the performance data for the single thread version of GAP run on the Catster graph sampled with TIES and Table 4.6 shows the data for the 16-thread version. The performance data for the other graphs can be found in the ”Graph property analysis data” online drive (Schoorl 2019).

TIES shows an overall lower error rate than both RNS and RES, especially for the lower sample sizes. The difference is smaller for the larger sample sizes. While the results vary per graph, the execution time can be predicted quite well for samples generated using TIES.

Edges Edges ratio Pred. avg. execution time Actual avg. execution time Relative error Original 5240663 0.078 0.30 3337850 0.637 0.049 0.041 21.90% 0.50 4309205 0.822 0.064 0.054 18.82% 0.70 5002281 0.955 0.074 0.068 9.50% 0.90 5403704 1.031 0.080 0.077 3.57% Table (4.5) Performance analysis results for Catster, sampled using TIES, mea-sured using a single thread version of GAP.

Edges Edges ratio Pred. avg. execution time Actual avg. execution time Relative error Original 5240663 0.055 0.30 3337850 0.637 0.035 0.036 4.11% 0.50 4309205 0.822 0.045 0.044 2.60% 0.70 5002281 0.955 0.052 0.051 2.34% 0.90 5403704 1.031 0.056 0.055 3.39% Table (4.6) Performance analysis results for Catster, sampled using TIES, mea-sured using a 16-thread thread version of GAP.

4.5

RNS vs. RES vs. TIES

In this chapter, we analyzed the performance of PageRank on the sampled graphs described in Chapter 3. Our analysis focused on the comparison between the performance predicted by our simple PageRank model against the actual measured performance. To this end, we summarize our results in the form of the relative error between the prediction and the measured performance. We report the data for every analyzed graph, with the average error for each sampling method, and using different sampling sizes. Specifically, Table 4.7 shows the results for sample size 0.5, and Table 4.8 shows the results for sample size 0.9.

An interesting result is that while there seems to be a large difference be-tween the single thread and the 16-thread versions of GAP when looking at the individual graphs, the average error is not significantly different between the two versions. For sample size 0.9, the error on the 16-thread version is slightly worse than for the single thread version.

(26)

For sample size 0.5, TIES shows the lowest average error, with RES per-forming slightly worse, and RNS considerably worse. As for sample size 0.9, RNS performs better than both RES and TIES, with TIES having a slightly lower error than RES. However, the difference between the sampling methods for sample size 0.9 is significantly lower than the difference for sample size 0.5. Therefore, while the error for RNS is lower than for TIES for sample size 0.9, TIES shows a more consistent low average error.

RNS (1) RES (1) TIES (1) RNS (16) RES (16) TIES (16) Academia 64.11% 32.33% 21.71% 62.92% 8.35% 21.44% Blogcatalog 42.23% 11.65% 16.64% 20.95% 11.56% 0.25% Brightkite 57.31% 13.98% 3.05% 58.32% 8.17% 20.97% Buzznet 47.17% 23.71% 2.55% 36.72% 1.29% 3.18% Catster 14.71% 6.87% 18.82% 14.57% 15.25% 2.60% Datagen coo 1 46.89% 23.92% 10.82% 48.21% 30.98% 3.61% Datagen coo 2 57.60% 22.88% 15.33% 50.01% 32.60% 31.38% Datagen coo 3 52.04% 20.39% 25.37% 49.63% 27.46% 17.17% Dogster 0.60% 7.92% 2.09% 3.90% 16.19% 3.57% Gowalla 60.94% 17.28% 30.10% 58.63% 35.68% 24.94% Livemocha 40.16% 15.07% 7.92% 58.32% 19.37% 14.64% Themarker 40.55% 36.81% 31.35% 54.98% 55.12% 5.85% Average 43.69% 19.40% 15.48% 43.10% 21.84% 12.47%

Table (4.7) Relative error for sample size 0.5 for every graph.

RNS (1) RES (1) TIES (1) RNS (16) RES (16) TIES (16) Academia 9.05% 4.18% 1.98% 9.70% 23.90% 42.50% Blogcatalog 6.84% 13.00% 16.94% 0.82% 18.10% 6.34% Brightkite 2.13% 33.91% 44.80% 9.12% 30.23% 24.98% Buzznet 2.02% 11.94% 1.71% 5.53% 20.52% 6.50% Catster 2.67% 10.89% 3.57% 1.43% 4.89% 3.39% Datagen coo 1 6.03% 2.41% 0.67% 7.69% 2.72% 10.23% Datagen coo 2 14.82% 1.62% 1.14% 12.61% 11.44% 0.44% Datagen coo 3 13.24% 3.76% 2.32% 18.07% 16.02% 29.23% Dogster 1.35% 1.66% 2.58% 3.19% 2.53% 3.11% Gowalla 3.80% 31.57% 4.90% 13.51% 28.95% 28.50% Livemocha 5.92% 2.46% 13.05% 11.67% 10.93% 1.46% Themarker 5.78% 28.98% 32.57% 0.89% 35.62% 12.65% Average 6.14% 12.20% 10.52% 7.85% 17.15% 14.11%

Table (4.8) Relative error for sample size 0.9 for every graph.

The data indicates that the performance of PageRank can be predicted with reasonable accuracy using a simple predictive model, based on the number of edges in the original and sampled graph, and using the performance of the algorithm on the original graph. When considering all sample sizes, graphs sampled using TIES can be predicted better than graphs sampled using RNS and RES, making TIES the most suitable sampling method to create graphs for graph processing analysis.

(27)

Despite this positive overview, the relative error is still high and differs con-siderably per graph. Therefore, the performance of PageRank on the sampled graphs is possibly influenced by more factors than just the number of edges, indicating that our predictive model is too simple.

(28)

Chapter 5

Combining the models

To determine how well our edge-prediction model woks for the estimation of PageRank performance, we combine the data from the PageRank model (see Chapter 4) with that of the edge-prediction model (see Chapter 3).

For this analysis, we focus on the TIES algorithm, as we have already es-tablished it to be the best for preserving the behavior of PageRank across the family of sampled graphs. We report four different execution times: three pre-dicted ones - using only the sample size (TSSp ), using our predicted edge ratio (based on the model in Chapter 3) (TP ERp ), and using the measured edge ratio (TM ERp ) - and a fourth, measured execution time (Tm). All the performance

times are for the single thread version of PageRank.

We present data for all four test graphs used in Chapter 3, namely Academia, BlogCatalog, Buzznet, and Datagen coo 3. The data is presented in Figures 5.1, 5.2, 5.3, and 5.4, respectively.

Figure (5.1) Execution time - predicted and measured - for the Academia graph, using TIES as a sampling method.

(29)

Figure (5.2) Execution time - predicted and measured - for the BlogCatalog graph, using TIES as a sampling method.

Figure (5.3) Execution time - predicted and measured - for the Buzznet graph, using TIES as a sampling method.

Figure (5.4) Execution time - predicted and measured - for the Datagen coo 3 graph, using TIES as a sampling method.

Our analysis focuses on understanding the predictive power of our two com-bined models. Specifically, we envisioned our edge-prediction model to act as a ”correction” factor over the simplistic performance estimate only based on the

(30)

sample size. For example, if a graph’s sample of size 0.5 contains 60% of edges instead of the naively-assumed 50%, we expect that our simplistic PageRank performance model will benefit from this correction. Moreover, we expect this correction will reduce the error between the predicted and measured perfor-mance. In other words, we expect that our combined models enable a better prediction for PageRank’s performance by (1) improving from TSS to Tperp , and

(2) closing the gap to the measured performance. If this assumption is validated by the data, we bring empirical evidence that our combined models are effective for roughly estimating the performance of the PageRank on a scaled-graph, even when only knowing the execution time on the original graph and the sample size.

In this context, we analyze the data presented in the figures and make the following three observations.

First, in most cases (i.e., with only one exception), our predicted edge ratio has lead to an improvement in the prediction of PageRank performance. In other words, in all cases, Tperp is a better approximation of Tm than Tss. Thus, our

edge-predictions can act as a correction factor for the performance prediction of PageRank.

Second, in all cases our predicted edge ratio has delivered a very good esti-mation of Tp

merand Tmp.

Third, and final, although we are combining two predictive models, and therefore we combine their errors, the predictions are surprisingly accurate.

These observations show that our edge-prediction model can be used as a correction factor when predicting the properties of a scaled graph, indicating that the assumption that the performance of PageRank directly correlates with the sample size is naive.

(31)

Chapter 6

Conclusion

In the field of graph processing, obtaining representative data sets for thorough benchmarking remains challenging, seriously limiting in-depth performance anal-ysis. To solve this problem, a tool has been proposed for scaling graphs (up and down), using graph sampling as a main underlying mechanism. However, the correlations that appear between the different input parameters of the scaling tool and the properties of the resulting graphs are not clear. In this work, we attempted to gain insight into these correlations by means of an extensive case-study, using PageRank and a total of 15 graphs.

6.1

Main findings

The goal of this project was to determine the impact of graph scaling on the performance of PageRank for graph processing analysis.

First, in order to determine which graph properties would be the most im-portant for predicting the performance of PageRank, we asked the following question:

What are the features of interest for assessing graph scaling in the context of graph processing analysis?

The PageRank of a certain web page is dependent on the number of hyperlinks connecting it to other web pages. In graphs, these hyperlinks are represented as edges. So, the performance of PageRank on a graph is directly correlated to the number of edges in the graph.

We used this result to answer the following question:

What is the impact of scaling on the properties of a scaled graph?

Because the number of edges directly influences the performance of the PageR-ank algorithm, we conducted an analysis of the number of edges in samples, cre-ated using three different sampling methods: Random Node Sampling (RNS), Random Edge Sampling (RES), and Total Induced Edge Sampling (TIES). The graphs we used were all from the social network class. From our analysis, we

(32)

found that RNS showed under-sampling of edges, samples created using RES usually had the expected number of edges, and TIES showed over-sampling of edges.

These samples were used to train a linear regression model in order to predict the number of edges in the samples. We created one model per sample size. The models use the properties of the original graph in order to predict the number of edges in the sample.

These models showed a low error rate overall, especially for the bigger sample sizes. Since the samples generated using RNS and RES both showed a consistent number of edges for all sample sizes, the number of edges is fairly simple to predict, resulting in a low error. The samples generated using TIES showed more variety, which made the number of edges harder to predict. This is evident in the large error the model shows for the smaller sample sizes, which are more subjective to the randomness factor of TIES. Larger sample sizes showed the prediction error gradually becoming lower per sample size.

Next, we analyzed and predicted the performance of PageRank on the sam-pled graphs in order to answer the following question:

What is the impact of the properties of a scaled graph on graph processing per-formance?

We proposed a simple model for predicting the performance, where the ratio of edges for the sampled graph compared to the original graph directly corresponds to the performance ratio.

This model showed a large error for smaller sample sizes, but the error is acceptable for the bigger sample sizes. Graphs with a high number of edges generally showed a lower error than graphs with a low number of edges, indi-cating that the performance of PageRank on larger graphs is more stable and therefore easier to predict. The model performed better for samples generated using TIES than for the other sampling methods. The model for RNS showed better performance on the larger sample sizes, but TIES showed a lower error overall. Therefore, the performance of PageRank for TIES can be predicted bet-ter using our simple predictive model than RNS and RES. TIES shows the most stable performance and proves to be the best sampling method for preserving the behavior of PageRank across different sample sizes.

Nevertheless, our model for predicting the performance of PageRank still shows a high error rate overall, with the results varying considerably per graph. This indicates that more properties than just the number of edges need to be taken into account in order to predict the performance, making our predictive model too simple.

Finally, we combined our edge-prediction model with our PageRank model. We used the ratio between the number of edges in the original graph and the number of predicted edges for the sample to predict performance time in order to under-stand the predictive power of our two combined models. Linking the results with our original intention of discussing the impact of scaling on graph properties, this analysis brings empirical evidence that our edge-prediction model should be used as a correction factor when predicting the properties of a scaled graph. In other words, assuming that the performance will behave in correlation with

(33)

sample size is naive.

6.2

Limitations and future work

As proof of concept, we have only performed edge analysis on one class of graphs: social networks. Models trained on samples created from social network graphs performed well; however, it is not clear whether these results would hold up for different types of graphs, such as communication networks and road networks. A next step would be to expand these experiments to different classes of graphs, and investigate if it would be possible to train one model for all classes, or if it is necessary for graphs to be divided into classes. Another interesting extension would be the inclusion of directed graphs since we only focused on undirected graphs for the scope of this thesis.

For training the models, we have chosen to use a linear regression model because of the small number of graphs available to train on. However, there are many more different methods for creating predictive models, such as polynomial regression and neural networks. Neural networks require more data for training to achieve good results than we have used for our linear model - however, it would be interesting to see how the results of these models compare to our results.

As for the performance, we used a very simple predictive model to predict the performance on the sampled graphs. While this model gave quite good results on some graphs, especially for samples generated using TIES, it did not consistently work for all graphs. Our model was only based on the number of edges in the graphs, so further research can be done in order to determine what other factors influence the performance of PageRank on these graphs to refine the proposed model and predict the performance with higher accuracy.

(34)

Bibliography

[ANK11] N. Ahmed, J. Neville, and R.R. Kompella. Network Sampling via Edge-based Node Selection with Graph Induction. Tech. rep. Purdue University, 2011.

[Bal+16] H. Bal et al. “A Medium-Scale Distributed System for Computer Science Research: Infrastructure for the Long Term”. In: IEEE Com-puter 5 (2016), pp. 54–63.

[BAP15] S. Beamer, K. Asanovic, and D. Patterson. “The GAP Benchmark Suite”. In: CoRR abs/1508.03619 (2015). url: http://arxiv.org/ abs/1508.03619.

[BGS05] M. Bianchini, M. Gori, and F. Scarselli. “Inside PageRank”. In: ACM Trans. Internet Technol. 5.1 (Feb. 2005), pp. 92–128. issn: 1533-5399. url: http://doi.acm.org/10.1145/1052934.1052938. [CZF04] D. Chakrabarti, Y. Zhan, and C. Faloutsos. “R-mat: A recursive

model for graph mining”. In: Proceedings of the 2004 SIAM Inter-national Conference on Data Mining (2004), pp. 442–446.

[Ios+16] A. Iosup et al. “LDBC graphalytics: a benchmark for large-scale graph analysis on parallel and distributed platforms”. In: Proceedings of the VLDB Endowment 9 (2016), pp. 1317–1328.

[Les+10] J. Leskovec, D. Chakrabarti, et al. “Kronecker graphs: An approach to modeling networks”. In: Journal of Machine Learning Research 11 (2010), pp. 985–1042.

[LK19] J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http : / / snap . stanford . edu / data. [Online; accessed April-2019]. 2019.

[Mus17] A.A.M. Musaafir. “Shrinking and Expanding Graph Datasets”. MA thesis. University of Amsterdam, 2017.

[Pag+99] L. Page et al. The PageRank Citation Ranking: Bringing Order to the Web. Tech. rep. 1999.

[Ped+11] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python”. In: Journal of Machine Learning Research 12 (2011), pp. 2825–2830. [RA15] R.A. Rossi and N.K. Ahmed. “The Network Data Repository with

Interactive Graph Analytics and Visualization”. In: AAAI. 2015. url: http://networkrepository.com.

(35)

[Sch19] K. Schoorl. Graph property analysis data. [Data set]. 2019. url:

https://drive.google.com/open?id=1liOa-YANAdILoxK4LPVgr8ubyFlWAdwy. [Van18] T. Van Zalingen. “An analysis of the scale-invariance of graph

algo-rithms: A case study.” MA thesis. University of Amsterdam, 2018. [ZT16] J.W. Zhang and Y.C. Tay. “GSCALER: Synthetically Scaling A

(36)

Appendix A

Social network graphs

Academia BlogCatalog Brightkite Buzznet Vertices 200169 88790 58228 101169 Edges 1022441 2093199 214078 2763070 Avg. degree 10.2158 47.1494 7.35309 54.6229 Graph density 0.000051036 0.000531028 0.000126283 0.000539922 Connected components 2 3 547 3 Avg. CC 0.222842 0.353311 0.172326 0.232072 Diameter 14 8 15 4 Avg. shortest path 4.78 3.07 4,91 2.33

Catster Datagen coo 1 Datagen coo 2 Datagen coo 3 Vertices 149703 32536 56992 84234 Edges 5240663 954675 1908524 3066990 Avg. degree 70.0141 58.6842 66.9752 72.8207 Graph density 0.00046769 0.00180373 0.00117519 0.000864515 Connected components 28 1 1 1 Avg. CC 0.259748 0.114999 0.107372 0.100651 Diameter 6 5 5 5 Avg. shortest path 2.54 2.81 2.87 2.9

(37)

Dogster Gowalla Livemocha TheMarker Vertices 426823 196591 104109 69413 Edges 8413857 950327 2193087 1644843 Avg. degree 39.4255 9.66806 42.1306 47.3929 Graph density 0.0000923699 0.0000491788 0.000404682 0.000682777 Connected components 130 1 3 48 Avg. CC 0.117376 0.236724 0.0544106 0.186378 Diameter 7 14 6 8 Avg. shortest path 3.11 4.61 3.2 3.05

(38)

Appendix B

Random Node Sampling

(39)
(40)
(41)

Appendix C

Random Edge Sampling

(42)
(43)
(44)

Appendix D

Total Induced Edge

Sampling

(45)
(46)

Referenties

GERELATEERDE DOCUMENTEN

Waarderend en preventief archeologisch onderzoek op de Axxes-locatie te Merelbeke (prov. Oost-Vlaanderen): een grafheuvel uit de Bronstijd en een nederzetting uit de Romeinse

The standard mixture contained I7 UV-absorbing cornpOunds and 8 spacers (Fig_ 2C)_ Deoxyinosine, uridine and deoxymosine can also be separated; in the electrolyte system

The Kingdom capacity (K-capacity) - denoted as ϑ (ρ) - is defined as the maximal number of nodes which can chose their label at will (’king’) such that remaining nodes

This notion follows the intuition of maximal margin classi- fiers, and follows the same line of thought as the classical notion of the shattering number and the VC dimension in

It is shown that by exploiting the space and frequency-selective nature of crosstalk channels this crosstalk cancellation scheme can achieve the majority of the performance gains

• Bij “niet-lerende vogelsoorten” kunnen alleen “primaire” afweermiddelen gebruikt worden, waarbij een meer blijvend effect kan worden bereikt door permanente, dan wel

Tijdens het eerste jaar gras wordt door de helft van de melkveehouders op dezelfde manier bemest als in de..

niet zo snel leggen. In de statistieken van de Australische equivalent van het ministerie van LNV wordt de varkenshouderij ook niet eens genoemd. In vergelijking met de