Utility-based summarization for large graphs

(1)

by

Jasbir Singh

B.Tech. (Computer Science), Guru Nanak Dev University, India, 2018

A Project submitted in partial fulfillment of the requirements for the Degree of

MASTER OF SCIENCE

in the Department of Computer Science

c

Jasbir Singh, 2021 University of Victoria

(2)

Utility-Based Summarization for Large Graphs

by

Jasbir Singh

B.Tech. (Computer Science), Guru Nanak Dev University, India, 2018

Supervisory Committee

Dr. Venkatesh Srinivasan, Supervisor (Department of Computer Science)

Dr. Alex Thomo, Co-Supervisor (Department of Computer Science)

(3)

ABSTRACT

A fundamental challenge in graph mining is the ever increasing size of datasets. Graph summarization aims to find a compact representation resulting in faster query algorithms and reduced storage needs. The flip side of graph summarization is often loss of utility which significantly diminishes its usability. The key questions we address in this work is: How to summarize a graph with some loss of utility but above a user-specified threshold?

Kumar and Efstathopoulos [14] proposed a method to address the above question but with limitations. In our work, we present a highly scalable algorithm, which foregoes the expensive iterative process that hampers previous work. Our algorithm achieves this by combining a memory reduction technique and a novel binary-search approach.

In contrast to the competition, we are able to handle web-scale graphs in a single machine without performance impediment as the utility threshold (and size of summary) decreases. Previous works suffer from conceptual limitations and lack of scalability.

(4)

List of Tables

3.1 Table of frequently used symbols . . . 13 5.1 Summary of datasets . . . 28 5.2 App-utility for top-k queries on T-BUDS Summary . . . 32

(7)

List of Figures

3.1 Different type of supernodes . . . 11

3.2 Example of the utility-based framework. . . 14

3.3 Computation overhead for UDS . . . 16

(a) Astroph . . . 16

(b) CAGRQC . . . 16

(c) HepTh . . . 16

(d) com-amazon . . . 16

4.1 Two-hop graph vs Two-hop MST edges . . . 18

5.1 T-BUDS vs UDS Comparison . . . 30

(a) runtime . . . 30

(b) memory usage . . . 30

5.2 T-BUDS vs UDS for different utility thresholds . . . 30

(a) CN . . . 30

(b) H1 . . . 30

5.3 T-BUDS vs SWeG with respect to app-utility for top-20% queries on CN. . . 33

(a) τ = 0.8 . . . 33

(8)

ACKNOWLEDGEMENTS

I would like to thank:

Dr. Venkatesh Srinivasan, for providing me this opportunity, being supportive and providing expert guidance throughout.

Dr. Alex Thomo, for his expert guidance and support throughout.

Mahdi Hajiabadi, for his collaboration on this project and sharing his knowledge. My friends and family, for their continuous support and motivation.

(9)

DEDICATION

(10)

Introduction

Graphs are ubiquitous and are the most natural representation for many real-world data such as web graphs, social networks, communication networks, citation networks, transaction networks, ecological networks and epidemiological networks. Such graphs are growing at an unprecedented rate. For instance, the web graph consists of more than 3.5 billion web pages connected by more than 129 billion hyperlinks and social networks span across more than 300 billion connections [24]. Consequently, storing such graphs and answering queries, mining patterns, and visualizing them are becoming highly impractical [14, 31].

Graph summarization is a fundamental task of finding a compact representation of the original graph called the summary. It allows us to decrease the footprint of the graph and query more efficiently [6, 25, 31]. Graph summarization also makes possible effective visualization thus facilitating better insights on large-scale graphs [35, 5, 18, 12, 4]. Also crucial is the privacy that a graph summary can provide for privacy-aware graph analytics [15, 8].

The problem has been approached from different directions, such as compression techniques to reduce the number of required bits for describing graphs [29, 2, 3,

(11)

30], sparsification techniques to remove less important nodes/edges in order to make the graph more informative [32, 22] and grouping methods that merge nodes into supernodes based on some interestingness measure [15, 25, 31, 28, 20]. Grouping methods constitute the most popular summarization approach because they allow the user to logically relate the graph summary to the original graph.

1.1 Motivation

The flip side of summarization is loss of utility. This is measured in terms of edges of the original graph that are lost and spurious edges that are introduced in the summary. In this work, we focus on grouping-based utility-driven graph summarization. In terms of state-of-the-art, [31] and [14] offer different ways of measuring loss of utility. The first computes the loss by assuming all edges as unweighted and of equal importance while the second incorporates edge centralities as weights in the loss computation. Also, the first uses a loss budget that is local to each node while the second uses a global budget.

There are several limitations with state-of-the-art [31, 14] on utility-driven graph summarization. By not considering edge importance, the SWeG algorithm of [31] produces (lossy) summaries which are inferior to those produced by UDS of [14] which uses edge importance in its process. Both SWeG and UDS are slow and impractical to run for large datasets in a single machine, thus hampering their utility. SWeG needs to utilize a cluster of machines to be able to handle datasets that are large but still can fit easily in the memory of one machine. UDS is a O(V2_{)-time algorithm;}

based on our experiments, it can only handle small to moderate datasets requiring a large amount of time, often more than 100 hours.

(12)

algorithm that can outperform SWeG and UDS for utility driven graph summarization?

1.2 Contribution

To address these challenges, we propose scalable utility-driven algorithm, T-BUDS, which can handle large graphs efficiently on a single consumer-grade machine. It shares the utility-threshold-driven nature of UDS [14] allowing the user to calibrate the loss of utility according to their needs. However, T-BUDS forgoes the main expensive iterative process that severely hampers UDS. We achieve this by combining a memory reduction technique based on Minimum Spanning Tree and a novel binary-search approach. T-BUDS not only is orders of magnitude faster than UDS, but it also exhibits a useful characteristic; namely, T-BUDS (in contrast to UDS) is mostly computationally insensitive to lowering the utility threshold which amounts to asking for smaller size summary. As such, a user can conveniently experiment with different utility thresholds without incurring a performance penalty.

In summary, our contributions are as follows.

• We propose a highly scalable, utility-driven algorithm, T-BUDS, for the lossy summarization. This algorithm achieves high scalability by combining memory reduction using MST with a novel binary search procedure. While preserving all the nice properties of the UDS summary, T-BUDS outperforms UDS by two orders of magnitude.

• We show that the summary produced by T-BUDS, can be used to answer top-k central node queries based on various centrality measures with high level of accuracy.

• Finally, we verify that the lossy summaries we generate are superior to those of SWeG, which does not consider edge importance in the utility computation.

(13)

For different centrality queries, we show that the results on our summaries are significantly better than those on SWeG summaries.

1.3 Organization

This work is organized as follows:

Chapter 1

provides a brief description for the problem, motivation for the work and our contributions. Chapter 2

provides a brief literature review for the graph summarization methods. The focus is on the current state-of-the-art methods for utility preserving summarization. It provides a brief description, limitations and advantages of each approach.

Chapter 3

presents the background knowledge for understanding graph summary and centrality scores. It also talks about the objective function for our problem and provide a brief overview and limitations of state-of-the art method.

Chapter 4

introduces our novel algorithm based on MST and binary search based approach. It also contains theoretical analysis for correctness and complexity for our method. Chapter 5

presents experimental evaluation for our method against the existing state-of-the-art. It also contains experimental discussion to validate that the utility is applicable to real life utility scenarios.

Chapter 6

(14)

Chapter 7

(15)

Chapter 2 Related Work

Graph summarization has been studied in different contexts and we can classify the proposed methodologies into two general categories, grouping and non-grouping. The non-grouping category includes sparsification-based methods [35, 17, 22] and sampling-based methods [10,16,23,1,34,19]. There are some other methods [15,28] that store the summary as expected adjacency matrix of the summary graph but they fail to create summary for large graphs. For a more detailed analysis of non-grouping methods, see the survey by Liu et al. [21].

The grouping category of methods is more commonly used for graph summarization and as such has received a lot of attention [14, 15, 6, 25, 31, 28, 7, 33, 11]. Mainly these techniques group collections of graph nodes into super-nodes and edges into super-edges connecting super-nodes. In this category, works such as [15,28] can only produce lossy summarizations optimizing different objectives. On the other hand, [25, 31] are able to generate both lossy and lossless summarizations. Among works of the grouping category, we discuss the following works [14, 25, 31, 20] that aim to preserve utility and as such are more closely related to our work.

(16)

compact representation containing the summary along with a correction set. Their goal was to minimize the reconstruction error. Liu et al. [20] proposed a distributed solution to improve the scalability of the approach in [25]. Recently, Shin et al. [31], proposed SWeG, that builds on the work of [25]. They used a shingling and minhash based approach to prune the search space for discovering promising candidate pairs. Weighted Jaccard similarity was used for merging candidate pairs. While SWeG summaries are queryable, the use of correction sets makes SWeG inefficient for answering queries such as top-k query directly.

In the work of Kumar and Efstathopoulos [14], the UDS algorithm was proposed which generates summaries that preserve the utility above a user specified threshold. UDS was able to beat existing techniques in terms of query answering and is considered to be the current state-of-the-art method. However, UDS is not scalable to moderate or large graphs due to high runtime and memory requirements.

(17)

Chapter 3 Preliminaries

This chapter provides the background information and the terminologies that are important in understanding the future chapters. Section 3.1 talks about basic structural properties of graph and explains node or edge importance scores. Section 3.2 provides definitions for understanding graph summary and explains the concept of utility of graph summary. The objective function for our problem is specified in Section 3.3. Finally, Section 3.4 provides a high level overview and drawbacks of existing state-of-the-art method for the specified objective function.

3.1 Graph and Structural Properties

A graph is a natural representation for any set of objects connected by directed or undirected links. Mathematically, it is denoted by G(V, E) where V is the set of vertices and E is the set of edges connecting vertices in V. In this study, we consider simple undirected graphs unless specified.

(18)

3.1.1 Edge/Node Importance

Each node or any edge in G has some properties based on the structure of the graph. Any edge or node in the graph can be assigned an importance score based on its connectivity. For instance, given a social network, any link between celebrities would hold more importance than link between two ordinary persons. There exist several centrality measures that help assign such scores.

Degree Centrality

For any vertex v ∈ V , degree centrality simply depends on the degree of node v in G. The intuition is that the nodes with higher connectivity are more central or important. For normalization purpose, the score is divided by the maximum number of possible neighbors, ie. |V | − 1.

DegreeCentrality(v) = dv

|V | − 1 (3.1)

Betweenness Centrality

Betweenness quantifies the number of times a node v exists on the shortest paths between any two nodes. More precisely, it is defined as

Betweeness(v) = X

s6=t6=v∈V

σst(v)

σst

(3.2)

where σst(v) is the number of shortest paths between s and t passing through v and

σst is the total number of shortest paths between s and t. The intuition is that the

nodes with high betweenness have more probability to be involved in communication between other nodes.

(19)

PageRank Centrality

PageRank centrality is a numerical score assigned to a node depicting its relative importance within the network. The intuition behind the method is that a node is important if it is pointed to or connected by other important nodes and vice versa. It is calculated iteratively and for any undirected graph G = (V, E), all the nodes are initialized with the same Pagerank value i.e. ∀u∈VP0(u) = 1. Pi(u) denote the

Pagerank value of any node u after i-th iteration of the Pagerank algorithm [26] and is calculated as:

Pi(u) ← X

w∈N (u)

Pi−1_(w)

|N (w)| (3.3)

3.2 Graph Summary and Notations

A summary graph is an aggregated graph structure denoted by G = (V, E ), where V is a set of supernodes, and E a set of superedges.

Supernode. Each supernode corresponds to a disjoint set of nodes in G. More precisely, V = {S1, S2, . . . , Sk} such that k ≤ |V |, V =Sk_i=1Si and ∀i 6= j, Si∩Sj = ∅.

Superedge. Superedges are a binary relation between any two pair of supernodes Si and Sj. Existence of superedge represents that every node in Si is connected to

every node in Sj. If a superedge does not exist, it would amount to the fact that no

node in Si is connected to any node in Sj.

3.2.1 Reconstruction

Given a summary graph, we can reconstruct the original graph as follows. For each superedge (Si, Sj) we construct edges (u, v), for each u ∈ Siand v ∈ Sj. For i 6= j, this

amounts to building a complete bipartite graph with Si and Sj as its parts. For i = j

(20)

Figure 3.1: (a,c) Two different type of superedges which result in two different types of reconstructed graph (b,d).

vertices of Si. Figure 3.1 shows how the reconstructed graph is affected by different

types of superedges. Figures (a) and (c) show two different superedges and figures (b) and (d) show their reconstructed versions.

3.2.2 Utility of Graph Summary

As discussed in Section 3.1.1 that each edge holds a different importance and obviously, the more important edges we recover during reconstruction, the better it is. We denote the importance of an edge (u, v) in G by C(u, v). However, during reconstruction some spurious edges can be added and they should have different centrality as well. So,we also introduce the notion of importance for spurious edges and denote that by Cs(u, v).

Preserving utility of a graph implies that algorithm should work to preserve the total importance score of the edges in contrast to the number of edges. Also, the algorithm should try to minimize the total importance score of the edges. In fact, the utility value of and summary G is defined by adding up the importance of all the original edges preserved and subtracting the importance scores of all the spurious edges introduced.

(21)

u(G) = X (Si,Sj)∈E     X (u,v)∈E u∈Si,v∈Sj C(u, v) − X (u,v) /∈E u∈Si,v∈Sj Cs(u, v)     (3.4)

The C(u, v) and Cs(u, v) values are normalized so that their respective sums equal

one. A similar utility model is used in [14] but without weights for spurious edges.

3.2.3 Lossy vs Lossless Summarization

Given any graph G, summarization is a process to reduce the input graph to its summary G.

Lossless Summarization. Any summary graph G is considered lossless if it is possible to regenerate the original graph G from it without any loss in utility, ie u(G) = 1. The reconstructed graph using the lossless summary is exactly the same as the original graph. Though there is no loss in utility, such a summary often provides lower compression.

Lossy Summarization. As lossless summarization provides low compression, often some loss in utility is traded in for compression, ie. u(G) < 1. Lossy summaries are practical and most of the existing methods [14,25] have worked in this context . Lossy summarization is the focus of our work as well.

3.3 Objective Function

In order to have a good summarization, the user defines a threshold τ and requests that the utility of summary, u(G) > τ . Having described the utility-based framework in section 3.2.2, we define the optimization problem we study as follows. Given graph

(22)

Symbols Definition

G = (V, E) Input graph with set V of nodes and set E of edges

G = (V, E) Summary graph with set V of supernodes and set E of superedges

Si The i-th supernode

S(u) Supernode to which node u belongs

Cu Centrality of node u

C(u, v) Centrality of edge (u, v)

du Degree of node u in G

davg Average degree in G

F {(a, c)|(a, b) ∈ E, (b, c) ∈ E} G2−hop = (V, F ) 2-hop graph of G

L Sorted list of F

H Sorted list of MST of G2−hop

u(Gt₎ _{The utility value of the summary after iteration t}

Sedge(S(u), S(v)) Cost of adding a superedge between S(u) and S(v) nSedge(S(u), S(v)) Cost of not adding a superedge between S(u) and S(v) N (u) The neighborhood set of u in graph G

N (X) The neighborhood set of X in graph G

|X| The number of nodes in supernode X

nodes(X) The set of all nodes in supernode X Table 3.1: Table of frequently used symbols

G = (V, E) and user-specified utility threshold τ , our objective is to

minimize |V| subject to u(G) ≥ τ .

(3.5)

Example of Ideal Summarization. Figure 3.2 shows an example for this framework. There are 14 edges and 11 nodes. We assume that the weight of each actual edge is equal to _|E|1 = ₁₄1 and the weight of each spurious edge is equal to 1

(11 2)−14

= ₄₁1. That is, there are 41 spurious edges in total and the weight of each is set in this example to be equal to 1/41. In part (a) the set of nodes inside the circles merge together into new supernodes and the utility still remains one because no information has been lost. In part (b) the circles show two merge cases. In the first case, the

(23)

blue supernode merges with the red node and in the second case, the green supernode merges with the blue node. In the first case, there is a utility loss of 1

14 for missing

one actual edge (see part (d) for the reconstructed graph). We chose not to add an edge from the new blue supernode to one of the neighbours of the red node because doing so would introduce three spurious edges for a cost of ₄₁3 that is greater than

1

14 (cost of missing one actual edge). Similarly, in the second case, there is a utility

loss of ₄₁2 for introducing two spurious edges. Therefore, the utility after this step is 1 −₁₄1 − 2

41 = 505

574. Part (c) shows the summary after all the four merges and part (d)

shows the reconstructed graph of summary in part (c).

Figure 3.2: Example of the utility-based framework. (a) Shows the original graph with two candidate merges with no loss of utility. The result is shown in (b) along with two more candidate merges. The merge of the green supernode with the blue node introduces two spurious edges (see the relevant part in the reconstructed graph in (d)). The merge of the blue supernode with the red node loses an actual edge as shown by the result in (d). (d) shows the reconstructed graph starting from the summary graph in (c).

3.4 UDS and its drawbacks

Kumar et al. [14] proposed UDS, a lossy algorithm for Utility Driven graph Summarization. We introduce UDS briefly because of its relevance to our work, and highlight some of its limitations.

UDS is a greedy iterative algorithm that starts with the original graph G = (V, E) and iteratively merges nodes until the utility of the graph drops below a user-specified

(24)

threshold τ < 1. At a high level, UDS can be considered as a sequence of two steps: (1) Creating two-hop edge list and sorting it (2) Generating the summary by applying many iterations of the merge procedure described above using the sorted two-hop edge list as candidate pairs. To decide the order of the merge operations, UDS considers the set of all two-hop away nodes as the candidate pairs. Call this set F . The algorithm starts merging from the less central candidate pairs in F because they result in less damage to the utility. Towards this, UDS uses a centrality score (e.g. betweenness centrality) for each node in the graph to assign a weight to each candidate pair hu, vi, e.g. Cu + Cv and sorts them in the ascending order. UDS iterates over the sorted

candidate pairs and for each iteration, it performs the following steps.

1. Pick the next pair of candidate nodes hu, vi from F , find their corresponding supernodes S(u), S(v), and merge them into a new supernode S, if S(u) 6= S(v). 2. Update the neighbors of S based on the neighbors of S(u), S(v). In particular, add an edge from S to another supernode if the loss in utility is less than the loss if not added.

3. (Re)compute the utility of the summary built so far and stop if we reach the threshold.

3.4.1 Complexity and Bottleneck

There are two major bottlenecks in UDS method:

1. Generate Candidate Pairs. As UDS uses all the two hop pairs as candidate pairs and the number of two-hop edges is O(|F |), it needs O(|F | lg (|F |)) time and O(|F |) space to compute and sort F .

2. Merge Operation. For any pair hu, vi ∈ L, if S(u) 6= S(v), the number of operations required is equal to number of superedges updated, ie. (|N (S(u))| +

(25)

|N (S(v))|). The worst case scenario is when one supernode keeps getting bigger and thus it can lead to O(|V |) operations on each merge. This scenario is observed experimentally as well and in Figure 3.3 we plot the number of superedges updated after each merge for different graphs. The observation is that in every case, there is a big supernode and whenever a merge includes that supernode, almost O(|V |) operations are required.

Thus, UDS needs O(|F | lg (|F |)) time and O(|F |) space to compute and sort F . Merging two supernodes takes O(V ) time and there can be O(V ) such merges. Therefore, merge steps together require O(|V |2_{) time and (|E|) space. The time}

complexity being O(|V |2_{+ |F | · lg(|F |)) and its space complexity is O(|F |).}

0 0.20.40.60.8 1 1.21.41.61.8 2 ·104 0 2,000 4,000 6,000 8,000

ith merge operation

n um b er of sup er edges up dated (a) Astroph 0 500 1,0001,5002,0002,5003,0003,5004,0004,5005,000 0 200 400 600 800 1,000

ith merge operation

n um b er of sup er edges up dated (b) CAGRQC 0 0.10.20.30.40.50.60.70.80.9 1 ·104 0 500 1,000 1,500 2,000

ith merge operation

n um b er of sup ere dges up dated (c) HepTh 0 0.5 1 1.5 2 2.5 3 3.5 ·105 0 2 4 6 8 ·104

ith merge operation

n um b er of sup eredges up dated (d) com-amazon

(26)

Chapter 4 Algorithm and Analysis

In this chapter, we introduce new algorithm with objective to create a graph summary with minimum number of supernodes while having its utility above a user specifc threshold, τ . Though UDS[14] (briefly described in Section 3.4) provides the desired graph summary but it suffers from a high run time and memory requirement overhead. Our algorithm provides the exact same summary as UDS but is two order of magnitudes faster and memory efficient than UDS. We call our approach T-BUDS.1

It consists of two steps, finding a merge order of supernodes and then performing a binary search on the list of MST edges to make utility related decisions. In Section 4.1, we explain the process for finding the order of merges, defined by a sorted list of MST edges, which we also call candidate node pairs. Section 4.2 details two theorems to establish the foundation of our approach. We discuss the detailed working of our algorithms in section 4.3. Finally, Section 4.4 provides analysis of the algorithm in terms of run-time and memory usage.

(27)

4.1 Generating Candidate Pairs using MST

We use a sorted list of node pairs from MST of two-hop graph to decide the order of merging the supernodes. The idea is to assign a weight w e.g. Cu + Cv to each

candidate pair hu, vi using centrality score (e.g. betweenness centrality) for each node in G. Then iterate through the sorted candidate pairs hu, vi, find their corresponding supernodes S(u), S(v), and merge them into a new supernode S, if S(u) 6= S(v). The approach is based on UDS [14] method and we briefly discuss it in section 3.4.

UDS considers the set of all two-hop away nodes as candidate pairs starting from the less central to more central pairs. However, not every candidate pair will cause a merge. This is because the nodes in the pair can be already in a supernode together due to previous merges. Therefore, there are many useless pairs, which we eliminate with our MST technique below.

A 0 B 0.25 C 0.15 D 0.15 E 0.25 F 0.1 A B C D E F 0.25 0.4 0.15 0.4 0.15 0.3 0.25 0.35 0.35 0.5 A B C D E F 0.25 0.15 0.15 0.25 0.35

Figure 4.1: (a) Original graph, G (b) 2-hop graph of G with edge weights as sum of node weights (c) MST of 2-hop graph

We denote the two hop graph by G2−hop = (V, F ). That is, F = {(a, c)|(a, b) ∈

E and (b, c) ∈ E}. We do not construct it explicitly as UDS does. We propose a method to reduce the number of candidate pairs from O(|F |) to O(|V |) by creating an MST of G2−hop. In theorem 1, we prove that using the sorted edge list of MST of

(28)

4.2 Two Crucial Theorems

4.2.1 MST of G

2−hop

is Sufficient

Here we present a sufficiency theorem, which says that using H instead of L as the list of candidates is sufficient. We denote L for the centrality weight-based sorted version of F and H denotes the sorted list of edges of an MST for G2−hop. The idea

of the proof is that the candidate pairs leading to a merge when L is used, in fact, exactly correspond to the edges of an MST.

Theorem 1 (MST Sufficiency Theorem). For utility threshold τ , using H as the list of candidate pairs will produce the same graph summary as using L.2

Proof. Initially G is same as G and let us assume that at iteration i a new pair hu, vi ← L[i] is chosen and S(u)i−1 and S(v)i−1 are their corresponding supernodes. If S(u)i−1 6= S(v)i−1then they should be merged together into a new supernode. The following two claims need to be proven to ensure the sufficiency of H as a candidate set.

1. If two pairs hu1, v1i and hu2, v2i are in H such that hu1, v1i appears before

hu2, v2i in H then hu1, v1i appears before hu2, v2i in L.

2. If u and v are not inside a same supernode, that is S(u)i−1 6= S(v)i−1, then hu, vi must be in H.

Proof of (1): As both H and L are sorted based on the weights of the edges, the order in which hu1, v1i and hu2, v2i appear in H will be the same as their order in L.

Proof of (2): S(u)i−1 6= S(v)i−1 _{implies that there does not exist any other pair}

hu0_{, v}0_{i ← L[j] for any j < i such that u}0 _{∈ S(u)}i−1 _{and v}0 _{∈ S(v)}i−1_{. Otherwise,}

2_{There can be different sorted versions of L due to possible ties (albeit unlikely as weights are}

real numbers). What this theorem shows is that the summary constructed based on MST is the same as the summary constructed using some sorted version of L.

(29)

S(u0)j would have been merged with S(v0)j in the j-th iteration. Thus, u0 and v0 would belong to the same supernode and S(u)i−1 _{should be same as S(v)}i−1_{. Hence,}

hu, vi is the smallest weight edge in G2−hop connecting S(u)i−1and S(v)i−1. We want

to show now that hu, vi ∈ H i.e. part of the MST. To show this, we claim that, in fact, hu, vi is the smallest weight edge in G2−hop connecting S(u)i−1 and V \ S(u)i−1.

Suppose not. Let us consider the edges between S(u)i−1 _{and V \ S(u)}i−1_{. Recall that}

a cut in a connected graph is a minimal set of edges whose removal disconnects the graph. Therefore, the edges between S(u)i−1 and V \ S(u)i−1 form a cut in G2−hop.

A well known property called cut property of MST states that the minimum weight edge of any cut belongs to the MST [13]. Now let, if possible, a different edge, hu00, v00i in G2−hop be the edge with the smallest weight connecting S(u)i−1 and V \ S(u)i−1.

Then by the cut property, hu00, v00i belongs to H and would have been considered as a candidate pair for merge in an earlier iteration. In that case, u00 and v00 will belong to the same supernode which is a contradiction.

4.2.2 Utility is a Monotone Function

We show in following theorem that the utility is non-increasing as we merge candidate pairs of H in order.

Theorem 2 (Non-increasing utility theorem). Let G0 = G and Gt be the summary graph obtained by processing H in order from index 1 to t where 1 ≤ t ≤ |H|. Then u(Gt−1_{) ≥ u(G}t_).

Proof. Suppose at iteration t, we take a pair hu, vi ← H[t] and two supernodes S(u) and S(v) be merged together to create a new supernode S. After the merge, all the superedges between S and S(w) ∈ N (S) should be updated where N (S) is the set of supernodes such that ∃(u, v) ∈ E for any u ∈ S and v ∈ S(w).

(30)

Let Sedge(Si, Sj) be the cost of adding a superedge between Si and Sj. As some

spurious edges are introduced by adding a superedge, the cost includes the cost of all those spurious edges. Similarly, let nSedge(Si, Sj) be the cost of not adding a

superedge between Si and Sj. As some actual edges are missed by not adding a

superedge, the cost includes the cost of all those actual edges. We have

Sedge(Si, Sj) = X (u,v) /∈E u∈Si,v∈Sj Cs(u, v) (4.1) nSedge(Si, Sj) = X (u,v)∈E u∈Si,v∈Sj C(u, v) (4.2)

Note that, at iteration t, when two supernodes S(u) and S(v) are merged together into S, then the number of spurious edges introduced on adding a superedge between S and any neighbor S(w) is exactly equal to the the sum of spurious edges introduced on connecting S(u), S(w) and S(v), S(w). When S(w) 6= S, the cost of adding a superedge between S and S(w), Sedge(S, S(w)), and the cost of not adding a superedge between S and S(w),

nSedge(S, S(w)), can be calculated as follows:

Sedge(S, S(w)) = Sedge(S(u), S(w)) + Sedge(S(v), S(w)) (4.3)

nSedge(S, S(w)) = nSedge(S(u), S(w)) + nSedge(S(v), S(w)) (4.4) Let us represent loss(S(w)) as the smaller of the cost of adding or not adding a superedge between new supernode S and any other candidate super neighbor S(w). Formally, it is defined by:

(31)

loss(S(w)) = min(Sedge(S, S(w)), nSedge(S, S(w))) −min(Sedge(S(u), S(w)), nSedge(S(u), S(w))) −min(Sedge(S(v), S(w)), nSedge(S(v), S(w))

(4.5)

Let us denote Sedge(S(u), S(w)) as a, nSedge(S(u), S(w)) as b, Sedge(S(v), S(w)) as c, and nSedge(S(v), S(w)) as d. Then, Equation 4.5 is of the form min(a + c, b + d) − (min(a, b) + min(c, d)). As min(a + c, b + d) ≥ min(a, b) + min(c, d), we have loss(S(w)) ≥ 0. So, u(Gt−1) − u(Gt) = P

S(w)∈N (S(u))∪N (S(v))loss(S(w)) ≥ 0.

We can follow a similar strategy for the case of a superloop in which S(w) = S and show that u(Gt−1_{) ≥ u(G}t_).

4.3 Scalable Algorithm, T-BUDS

Theorems 1 and 2 form the basis of our new approach T-BUDS that uses binary search over the sorted list of MST edges, H, in order to find the largest index t for which u(Gt_{) ≥ τ (see Algorithm 2). This requires computing H (done using Algorithm 1)}

followed by lg(|H|) computations of utility. The latter is done using Algorithm 3. Given graph G = (V, E) and centrality scores for each node C[u ∈ V ], T-BUDS first creates the sorted candidate pairs H by calling the Two-hop MST function (Algorithm 1). This function follows the structure of Prim’s algorithm [27] for computing MST. However, we do not want to build the G2−hop graph explicitly. As

such, we start with an arbitrary node s and insert it into a priority queue Q with a key value of 0. All other nodes are initialized with a key value of ∞. For any given node v with minimum key value deleted from Q, v is included in the MST, and the key values of its two-hop away neighbours are updated, when needed.

(32)

Algorithm 1 Two-hop MST

1: Input: G = (V, E), C . C is centrality scores array for nodes

2: key[s] ← 0, parent[s] ← Null, Q.insert(s, key[s])

3: for (v ∈ V \ {s}) do

4: key[v] ← ∞, parent[v] ← Null, Q.insert(v, key[v])

5: while !isEmpty(Q) do 6: (v, ) = Q.delM in() 7: for (w ∈ N (N (v)) | w ∈ Q & w 6= v) do 8: if key[w] > C[v] + C[w] then 9: Q.setKey(w, C[v] + C[w]) 10: parent[w] ← v 11: H ← {(v, parent[v]) : v ∈ V \ {s}}

12: return sorted H based on C

After creating the two-hop MST and sorting its edges, T-BUDS uses a binary search approach and iteratively performs merge operations from the first pair until the middle pair in H (Algorithm 2). In each iteration, we pick a pair of nodes u, v from H, find their supernodes S(u) and S(v) and merge them into a new supernode S. This process continues until the algorithm reaches the middle point. G is the resulting summary after these operations and we compute its utility in line 11. If this utility ≥ τ , then we search for the index t in the second half, otherwise, we search for the index t in the first half. The algorithm finds the best summary in lg |H| iterations and |H|, being the number of edges in the MST of G2−hop, is just O(V ).

Algorithm 3 is used to compute the utility for a specific summary G = (V, E ). The algorithm iterates over all supernodes one at a time and for a given supernode Si, it

creates two maps (count and sum) to hold the details for the superedges connected to Si. count[Sj] stores the number of actual edges between supernodes Si and Sj.

Similarly, sum[Sj] contains the sum of the weights for all the edges between Si and Sj.

Line 4 to line 13 initialize these two structures. Sedge(Si, Sj) (the cost of drawing a

super edge between Siand Sj) and nSedge(Si, Sj) (the cost of not drawing a super edge

(33)

Algorithm 2 T-BUDS

1: Input: G = (V, E), C, τ

2: _{H ← TwoHopMST(G,C)} 3: low ← 0, high ← |H| − 1

4: while low ≤ high do

5: mid ← low+high₂ 6: V ← V , i ← 0 7: for i ≤ mid do 8: hu, vi ← H[i], i ← i + 1 9: _{S ← Merge(S(u), S(v))} 10: V ← (V \ {S(u), S(v)}) ∪ S 11: _{u(G) ← ComputeUtility( V)} 12: if u(G) ≥ τ then high = mid − 1

13: else low = mid + 1

14: _{BuildSuperEdges(V)}

Algorithm 3 Compute Utility

1: Input: G = (V, E), utility ← 1, V . set of supernodes

2: for Si ∈ V do . for each supernode

3: count ← {}, sum ← {} 4: for u ∈ Si do 5: for v ∈ N (u) do 6: Sj ← S(v) 7: if (Si 6= Sj) ∨ (Si = Sj∧ i < j) then 8: if count[Sj] ≥ 1 then 9: count[Sj] ← count[Sj] + 1 10: sum[Sj] ← sum[Sj] + C(u, v)

11: else 12: count[Sj] ← 1 13: sum[Sj] ← C(u, v) 14: for Sj ∈ count.keys ∧ i ≤ j do 15: nSedge(Si, Sj) ← sum[Sj] 16: if Si 6= Sj then Sedge(Si, Sj) ← |Si||Sj|−count[Sj] (|V | 2 )−|E| 17: else Sedge(Si, Sj) ← (|Si| 2 )−count[Sj] (|V | 2 )−|E|

18: if Sedge(Si, Sj) ≤ nSedge(Si, Sj) then 19: utility ← utility − Sedge(Si, Sj) 20: else utility ← utility − nSedge(Si, Sj) 21: return utility

(34)

sum of weights of edges in G between nodes in Si and Sj, it is exactly equal to sum[Sj]

(line 15). If Si 6= Sj, the number of spurious edges is equal to |Si||Sj| − count[Sj] and

since each spurious edge has cost 1 (|V | 2 )−|E| , Sedge(Si, Sj) = |Si||Sj|−count[Sj] (|V | 2)−|E|

(line 16). Similarly, if Si = Sj, the number of spurious

edges is |Si|

2 − count[Sj] and Sedge(Si, Sj) =

(|Si|

2 )−count[Sj]

(|V | 2 )−|E|

(line 17). Finally the utility loss can be estimated as min(Sedge(Si, Sj) , nSedge(Si, Sj)) and the utility is

decremented by the loss. Algorithm 3 returns the final utility for G which is used by Algorithm 2 for making decisions.

Building Superedges. Once the appropriate supernodes have been identified, a superedge is added between two supernodes Si and Sj if and only if Sedge(Si, Sj) ≤

nSedge(Si, Sj). This task can be completed in O(|E) time: Line 19 of Algorithm 3

can be replaced by the task of adding superedge between Si and Sj.

4.4 Complexity analysis

Let us begin by analysing the time complexity of Algorithm 1. As its structure follows that of Prim’s algorithm [27], it requires O(|F | · lg |V |) steps to compute MST. As the number of edges in H is O(|V |), sorting it takes O(|V | lg |V |) time. Thus, the total time complexity of Algorithm 1 is O((|F | + |V |) · lg |V |). The total space required by Algorithm 1 is O(|V |) as it stores the priority queue Q and arrays key, parents, and H all of size O(|V |).

Now let us analyse the time complexity of Algorithm 3. To compute the utility of G, the algorithm iterates over all the edges in G, each edge exactly once, to identify pairs of supernodes (Si, Sj) that have at least one edge of G between them. This step,

that includes the computation of count and sum for each supernode, takes O(E) time. Once this step is completed, it takes O(1) time to compute the Sedge and nSedge

(35)

cost for a pair (Si, Sj). Therefore, the time complexity of Algorithm 3 is O(|E|). It

requires O(|V |) space to store the count and sum arrays.

Finally, let us analyse the time and space complexity of Algorithm 2. As discussed in Section 5.2, Algorithm 2 will perform lg |H| iterations. For each iteration, merging supernodes in Algorithm 2 requires O(|H|) operations and the utility estimation using Algorithm 3 requires O(|E|) time. Thus the time complexity for each iteration is O((|E| + |V |) and time for a total of lg |H| iterations is O((|E| + |V |) · lg(|V |)). The space requirement inside Algorithm 2 is storing H and V, which is O(|V |). Thus, the space requirement of Algorithm 2 is O(|V |).

Summarizing all the above, we have

Theorem 3. The time complexity of T-BUDS is O((|F | + |V |) · lg |V |). The space complexity of T-BUDS is O(|V |).

Data structures. We used the union-find algorithm [9] for representing our supernodes. The union operation was used to implement the merge operation in line 9 of Algorithm 2 and the find operation was used to find the corresponding supernode for a specific node in line 9 of Algorithm 2 and line 6 of Algorithm 3. Using path compression with the union-find algorithm allows reducing the complexity of the union and find operations to O(lg?|V |) (iterated logarithm of |V |). As lg?_{|V | is about 5 when |V | is}

even more than a billion, we treat it as a constant in our calculations. The union-find algorithm only needs two arrays of size |V | and thus the working memory requirement is O(|V |).

Using the union-find algorithm, it is not directly possible to iterate over all the nodes in a specific supernode. So we maintain a linked-list representation of nodes inside each supernode. Also, count and sum in Algorithm 3 were implemented as tree-maps because for large networks, the performance of the hash map degrades. Thus, tree map was a better choice for large networks.

(36)

Chapter 5 Experiments and Evaluation

In this chapter we discuss the experimental evaluation of our method, T-BUDS against the existing state-of-the-art techniques for utility driven summarization. In Section 5.1, we describe the hardware and the resources we used for conducting our experiments. Section 5.2 enlists the datasets used for the experimentation. Section 5.3 describes performance analysis of the T-BUDS versus UDS [14] (state-of-the-art in lossy utility-driven graph summarization) in terms of running time and memory consumption. Finally, we discuss the usefulness analysis of the utility-driven graph summarization framework in Section 5.4.

5.1 Experimental Settings

We conducted all our experiments on Intel Xeon machine with following configuration: • Processor : Intel(R) _Xeon(R) _{CPU E5620 @ 2.40 GHz}

• Memory : 128 GB RAM

(37)

All algorithms were implemented in Java 8. Even though our machine had 128 GB, we used not more than 32 GB of RAM for our method. Since UDS is a memoization based algorithm, it requires more memory and we assigned 64 GB RAM to UDS as it was suggested in the original paper [14].

Our implementations of UDS and SWeG followed the algorithms published in the corresponding papers. Their implementations were proprietary and not publicly available. UDS was run on higher-end AWS hardware for more than 2 weeks, whereas we have a time cutoff of 100 hours (about 4 days) on commodity hardware. For SWeG we implemented the single machine version in order to compare with the other algorithms under the same settings.

5.2 Datasets

We used seven web and social graphs from (http://law.di.unimi.it/datasets. php) varying from moderate size to very large, and we ignored the edge directions and self-loops. Table 5.1 shows the statistics of these graphs.

Graph

Abbr

Nodes

Edges

cnr-2000

CN

325,557

5,565,380

hollywood-2009

H1

1,139,905

113,891,327

hollywood-2011

H2

2,180,759

228,985,632

indochina-2004

IC

7,414,866

304,472,122

uk-2002

U1

18,520,486

529,444,615

arabic-2005

AR

22,744,080

1,116,651,935

uk-2005

U2

39,459,925

1,581,073,454

(38)

5.3 Performance of T-BUDS

In this section, the performance of T-BUDS is compared to the performance of UDS in terms of running time and memory usage (Figure 5.1). For our comparison, we set the utility threshold at 0.8. UDS is quite slow on our moderate and large datasets. Namely, it was not able to complete in reasonable time (100h) for those datasets. As such, we provide as input to UDS not the full list of 2-hop pairs as in [14], but the reduced list from the MST of G2−hop. This way, we were able to handle with UDS

the datasets CN, H1, and H2. However, we still could not have UDS complete for the rest of the datasets.

Figure 5.1 shows the running time (sec) and memory usage (MB) of T-BUDS and UDS. As the figure shows, T-BUDS outperforms UDS in both running time and memory usage by orders of magnitude. Moreover, T-BUDS can easily deal with the largest graph, U2, in less than 7 hours. In contrast UDS takes more than 90 hours on a moderate graph, such as H2, to produce results.

In another experiment we compare the performance of T-BUDS and UDS for varying utility thresholds. Figure 5.2 shows the runtime of the two algorithms on two different graphs CN and H1 in terms of varying utility threshold. Having an algorithm that is computationally insensitive to changing the threshold is desirable because it allows the user to conveniently experiment with different values of the threshold. As shown in the figure, the runtime of T-BUDS remains almost unchanged across different utility thresholds. In contrast, UDS strongly depends on the utility threshold and its runtime grows as the threshold decreases.

(39)

101 ₁₀2 ₁₀3 ₁₀4 ₁₀5 CN H1 H2 IC U1 AR U2 Runtime (sec) UDS T-BUDS (a) runtime 103 ₁₀4 CN H1 H2 IC U1 AR U2 Memory usage (MB) UDS T-BUDS (b) memory usage

Figure 5.1: T-BUDS vs UDS in terms of runtime in sec (a) and memory usage in MB (b). τ is set to 0.8. T-BUDS is orders of magnitude faster than UDS. We provide our MST edge pairs as input for UDS; the original version of UDS could not complete within 100h for all the datasets but CN. With MST as input, UDS still could not complete for IC, U1, AR, and U2.

0.9 0.8 0.7 0.6 0.5 102 103 104 105

τ

Run time (sec) UDS T-BUDS (a) CN 0.9 0.8 0.7 0.6 0.5 103 104 105

τ

Run time (sec) (b) H1

Figure 5.2: T-BUDS vs UDS for different utility thresholds on CN and H2. T-BUDS is faster by orders of magnitude and, being binary search based, is quite stable as τ varies.

(40)

5.4 Usefulness of Utility-Driven Framework

In this section, we study the performance of T-BUDS towards top-k query answering. To do so, we compute the Pagerank centrality (P) for the nodes, and assign (normalized) importance score to each edge based on the sum of the importance scores of its two endpoints. We then compute the summary using T-BUDS.

Subsequently, we obtain the top t% of central nodes in G based on a centrality score (such as Pagerank, Degree, Eigenvector and Betweenness) and check if the corresponding supernode of each such central node in the summary graph is small in size. This is desirable because the centrality for a supernode in the summary is divided evenly among the nodes inside it. Towards this, we use the notion of app-utility as defined in [14]. Namely, the app-utility value of a top-k query is as follows.

app-utility = P v∈Vt 1 |S(v)| |Vt| (5.1)

where Vt is the set of top t% central nodes, |Vt| = t% × |V |. The app-utility value

is between 0 and 1 and the higher the value, the better the summarization is at capturing the structure of the original graph. app-utility = 1 indicates that each central node is a supernode of size 1 and app-utility < 1 if there is at least one central node in a supernode of size greater than one, i.e. “crowded” with other nodes.

Table 5.2 shows the performance of T-BUDS with varying τ and top-t% central nodes on two graphs CN and H1. The four columns after RN show the app-utility value of T-BUDS with respect to top-(t%) of central nodes on CN graph and the last four columns show the app-utility value of T-BUDS on H1 graph. We use four types of top-t% queries, Pagerank (P), Degree (D), Eigenvector (E), and Betweenness (B). In the table, the first five rows labeled P show the app-utility value of T-BUDS summary with respect to Pagerank query, the next five rows labeled D show the

(41)

app-utility value of T-BUDS summary with respect to degree query, and so on. As can be seen from Table 5.2, T-BUDS performs quite well on top-t% queries especially for Pagerank and Betweenness centrality measures.

T-BUDS CN H1 Centrality τ RN 20% 30% 40% 50% 20% 30% 40% 50% P 0.50 0.58 1.00 1.00 1.00 0.84 1.00 1.00 1.00 0.84 0.60 0.53 1.00 1.00 1.00 0.94 1.00 1.00 1.00 0.94 0.70 0.46 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.80 0.38 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.90 0.28 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 D 0.50 0.58 0.36 0.33 0.38 0.47 0.96 0.82 0.68 0.59 0.60 0.53 0.40 0.37 0.42 0.51 0.99 0.92 0.81 0.71 0.70 0.46 0.44 0.42 0.47 0.56 1.00 0.97 0.90 0.82 0.80 0.38 0.53 0.53 0.58 0.65 1.00 0.99 0.96 0.90 0.90 0.28 0.63 0.68 0.72 0.77 1.00 0.99 0.99 0.97 E 0.50 0.58 0.48 0.47 0.50 0.44 0.66 0.59 0.52 0.46 0.60 0.53 0.53 0.52 0.55 0.49 0.72 0.66 0.60 0.55 0.70 0.46 0.58 0.58 0.61 0.54 0.78 0.73 0.67 0.63 0.80 0.38 0.64 0.65 0.69 0.63 0.83 0.80 0.76 0.72 0.90 0.28 0.72 0.74 0.78 0.72 0.88 0.87 0.84 0.82 B 0.50 0.58 0.60 0.59 0.48 0.44 0.51 0.44 0.43 0.40 0.60 0.53 0.68 0.65 0.53 0.48 0.56 0.51 0.50 0.48 0.70 0.46 0.73 0.71 0.58 0.54 0.62 0.59 0.58 0.57 0.80 0.38 0.80 0.80 0.66 0.62 0.70 0.68 0.67 0.66 0.90 0.28 0.91 0.90 0.79 0.75 0.80 0.79 0.78 0.77

Table 5.2: T-BUDS: App-utility for top-k queries for Pagerank (P), Degree (D), Eigenvector (E), and Betweenness (B) centralities.

We compare the performance of T-BUDS versus SWeG in terms of the app-utility value for different centrality measures, namely Pagerank, Degree, Eigenvector, and Betweenness. In order to fairly compare against SWeG, which does not accept a utility threshold τ as a parameter, we fixed five different τ values, 0.5, 0.6, 0.7, 0.8, 0.9, and calculate the reduction in nodes (RN) for each value when using T-BUDS. Note that RN = 1 − |V|/|V |. Then we run SWeG (lossy version) and stop it when each RN value is reached. We compute the app-utility value on each summary for the top 20%, 30%, 40% and 50% of central nodes.

In Figure 5.3, we show the relative improvement of T-BUDS over SWeG for two scenarios, τ = 0.8 and τ = 0.6 for t = 20%. We observe T-BUDS to be significantly better than SWeG. For instance we obtain about 30% and 50% improvement in

(42)

app-0% 10% 20% 30% 40% 50% D P E B (a) τ = 0.8 0% 10% 20% 30% 40% 50% D P E B (b) τ = 0.6

Figure 5.3: T-BUDS vs SWeG with respect to app-utility for top-20% queries on CN. The x axis shows percentages, the y axis shows different queries (D, P, E, B). SWeG lossy was run until it obtained the RN values corresponding to values of τ . The RN values are given in Table 5.2, i.e. 0.58, 0.53, 0.46, 0.38, 0.28. Graph summaries of T-BUDS provide significantly better app-utility than those of SWeG. The difference becomes more pronounced as τ is lowered.

(43)

Chapter 6 Conclusion

In this work, we study the problem for creating a graph summary with objective to minimize storage while having its utility above a user specified threshold. Till now, only UDS [14] has worked on creating such a summary and it was able to beat other type of summarization techniques in terms of answering queries using the summary directly. UDS has some limitations though, specifically, it is a computation heavy algorithm and requires high memory requirements.

In our work, we present a highly scalable algorithm, which foregoes the expensive iterative process that hampers the previous work. Our algorithm, T-BUDS, achieves this by combining a memory reduction technique and a novel binary-search approach. Though memory efficient and faster, it provides the exact same summary as UDS does. To achieve such scalability, we propose to use sorted list of MST edges rather than all two-hop edges (as done in UDS) as candidate pairs. In Theorem 1, we prove that doing so results in exact same summary. Further, we propose a binary search based approach based on Theorem 2 to make the run-time independent of the utility threshold τ .

(44)

UDS in both running time and memory usage by two orders of magnitude. For instance, T-BUDS can easily summarize the largest graph, U2, with more than billion edges in less than 7 hours. In contrast, UDS takes more than 90 hours on a moderate graph, such as H2, with just two million edges. Also, we show that the runtime of UDS grows as the threshold decreases while runtime of T-BUDS is stable and independent of any parameters.

Further, in Section 5.4, we show that the utility of a summary corresponds directly with application utilities like top-k queries, etc. Graph summary using T-BUDS provides significantly better app-utility scores compared to the state-of-the-art technique, SWeG [31].

(45)

Chapter 7 Future Work

Here we list some of the potential directions that we will explore in future.

Firstly, we have checked that T-BUDS does not provide a good compression for lossless case , ie. τ = 1. This follows from UDS [14] and we would like to develop scalable technique for lossless summarization. Further, we plan to integrate such a technique with T-BUDS to provide additional compression for slight loss in utility.

Secondly, we would like to explore some new techniques for deciding the order of the merges. Currently, T-BUDS uses ordered edge list from MST of 2-hop edges which is a significant improvement than using 2-hop edges as in UDS [14]. We hope that finding a better technique would further help to improve the quality of the summary and provide better compression to retain same level of utility.

Thirdly, we would like to extend our method to dynamic graphs as well. One of the challenges might be calculating utility on the fly and new ideas are needed in this context.

(46)

Bibliography

[1] Amr Ahmed, Nino Shervashidze, Shravan Narayanamurthy, Vanja Josifovski, and Alexander J. Smola. Distributed large-scale natural graph factorization. In Proceedings of the 22nd International Conference on World Wide Web, WWW ’13, page 37–48, New York, NY, USA, 2013. Association for Computing Machinery.

[2] Alberto Apostolico and Guido Drovandi. Graph compression by bfs. Algorithms, 2(3):1031–1044, 2009.

[3] Paolo Boldi and Sebastiano Vigna. The webgraph framework i: compression techniques. In Proceedings of the 13th international conference on World Wide Web, pages 595–602, 2004.

[4] Diane J. Cook and Lawrence B. Holder. Substructure discovery using minimum description length and background knowledge. J. Artif. Int. Res., 1(1):231–255, 02 1994.

[5] Cody Dunne and Ben Shneiderman. Motif simplification: Improving network visualization readability with fan, connector, and clique glyphs. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’13, page 3247–3256, New York, NY, USA, 2013. Association for Computing Machinery.

(47)

[6] Wenfei Fan, Jianzhong Li, Xin Wang, and Yinghui Wu. Query preserving graph compression. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD ’12, page 157–168, New York, NY, USA, 2012. Association for Computing Machinery.

[7] Xiangyang Gou, Lei Zou, Chenxingyu Zhao, and Tong Yang. Fast and accurate graph stream summarization. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 1118–1129. IEEE, 2019.

[8] Michael Hay, Gerome Miklau, David Jensen, Don Towsley, and Philipp Weis. Resisting structural re-identification in anonymized social networks. Proc. VLDB Endow., 1(1):102–114, 08 2008.

[9] J. E. Hopcroft and J. D. Ullman. Set merging algorithms. SIAM Journal on Computing, 2(4):294–303, 1973.

[10] Christian H¨ubler, Hans-Peter Kriegel, Karsten Borgwardt, and Zoubin Ghahramani. Metropolis algorithms for representative subgraph sampling. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM ’08, page 283–292, USA, 2008. IEEE Computer Society.

[11] Kifayat Ullah Khan, Waqas Nawaz, and Young-Koo Lee. Set-based approximate approach for lossless graph summarization. Computing, 97(12):1185–1207, 2015. [12] Danai Koutra, U Kang, Jilles Vreeken, and Christos Faloutsos. Summarizing and understanding large graphs. Stat. Anal. Data Min., 8(3):183–202, 06 2015. [13] Joseph B Kruskal. On the shortest spanning subtree of a graph and the traveling

salesman problem. Proceedings of the American Mathematical society, 7(1):48– 50, 1956.

(48)

[14] K Ashwin Kumar and Petros Efstathopoulos. Utility-driven graph summarization. Proceedings of the VLDB Endowment, 12(4):335–347, 2018. [15] Kristen LeFevre and Evimaria Terzi. Grass: Graph structure summarization. In

SDM, 2010.

[16] Jure Leskovec and Christos Faloutsos. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, page 631–636, New York, NY, USA, 2006. Association for Computing Machinery.

[17] C. Li and S. Lin. Egocentric information abstraction for heterogeneous social networks. In 2009 International Conference on Advances in Social Network Analysis and Mining, pages 255–260, 2009.

[18] Chenhui Li, George Baciu, and Yunzhe Wang. Modulgraph: Modularity-based visualization of massive graphs. In SIGGRAPH Asia 2015 Visualization in High Performance Computing, SA ’15, New York, NY, USA, 2015. Association for Computing Machinery.

[19] Edo Liberty. Simple and deterministic matrix sketching. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, page 581–588, New York, NY, USA, 2013. Association for Computing Machinery.

[20] Xingjie Liu, Yuanyuan Tian, Qi He, Wang-Chien Lee, and John McPherson. Distributed graph summarization. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM ’14, page 799–808, New York, NY, USA, 2014. Association for Computing Machinery.

(49)

[21] Yike Liu, Tara Safavi, Abhilash Dighe, and Danai Koutra. Graph summarization methods and applications: A survey. ACM Computing Surveys (CSUR), 51(3):1– 34, 2018.

[22] Antonio Maccioni and Daniel Abadi. Scalable pattern matching over compressed graphs via dedensification. pages 1755–1764, 08 2016.

[23] Arun S. Maiya and Tanya Y. Berger-Wolf. Sampling community structure. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, page 701–710, New York, NY, USA, 2010. Association for Computing Machinery.

[24] Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, and Christian Bizer. Graph structure in the web—revisited: a trick of the heavy tail. In Proceedings of the 23rd international conference on World Wide Web, pages 427–432, 2014.

[25] Saket Navlakha, Rajeev Rastogi, and Nisheeth Shrivastava. Graph summarization with bounded error. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, page 419–432, New York, NY, USA, 2008. Association for Computing Machinery.

[26] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. In WWW 1999, 1999. [27] R. C. Prim. Shortest connection networks and some generalizations. The Bell

System Technical Journal, 36(6):1389–1401, 1957.

[28] Matteo Riondato, David Garc´ıa-Soriano, and Francesco Bonchi. Graph summarization with quality guarantees. Data Mining and Knowledge Discovery, 31(2):314–349, 03 2017.

(50)

[29] Ryan A Rossi and Rong Zhou. Graphzip: a clique-based sparse graph compression method. Journal of Big Data, 5(1):10, 2018.

[30] Neil Shah, Danai Koutra, Lisa Jin, Tianmin Zou, Brian Gallagher, and Christos Faloutsos. On summarizing large-scale dynamic graphs. IEEE Data Eng. Bull., 40(3):75–88, 2017.

[31] Kijung Shin, Amol Ghoting, Myunghwan Kim, and Hema Raghavan. Sweg: Lossless and lossy summarization of web-scale graphs. In The World Wide Web Conference, WWW ’19, page 1679–1690, New York, NY, USA, 2019. Association for Computing Machinery.

[32] Daniel A Spielman and Nikhil Srivastava. Graph sparsification by effective resistances. SIAM Journal on Computing, 40(6):1913–1926, 2011.

[33] Yuanyuan Tian, Richard A. Hankins, and Jignesh M. Patel. Efficient aggregation for graph summarization. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, page 567–580, New York, NY, USA, 2008. Association for Computing Machinery.

[34] Ning Yan, Sona Hasani, Abolfazl Asudeh, and Chengkai Li. Generating preview tables for entity graphs. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, page 1797–1811, New York, NY, USA, 2016. Association for Computing Machinery.

[35] Zeqian Shen, Kwan-Liu Ma, and T. Eliassi-Rad. Visual analysis of large heterogeneous social networks by semantic and structural abstraction. IEEE Transactions on Visualization and Computer Graphics, 12(6):1427–1439, 2006.

Utility-based summarization for large graphs

Contents

List of Tables

List of Figures

Introduction

1.1

Motivation

1.2

Contribution

1.3

Organization

Chapter 2

Related Work

Chapter 3

Preliminaries

3.1

Graph and Structural Properties

3.1.1

Edge/Node Importance

3.2

Graph Summary and Notations

3.2.1

Reconstruction

3.2.2

Utility of Graph Summary

3.2.3

Lossy vs Lossless Summarization

3.3

Objective Function

3.4

UDS and its drawbacks

3.4.1

Complexity and Bottleneck

Chapter 4

Algorithm and Analysis

4.1

Generating Candidate Pairs using MST

4.2

Two Crucial Theorems

4.2.1

MST of G

is Sufficient

4.2.2

Utility is a Monotone Function

4.3

Scalable Algorithm, T-BUDS

4.4

Complexity analysis

Chapter 5

Experiments and Evaluation

5.1

Experimental Settings

5.2

Datasets

Graph

Abbr

Nodes

Edges

cnr-2000

CN

325,557

5,565,380

hollywood-2009

H1

1,139,905

113,891,327

hollywood-2011

H2

2,180,759

228,985,632

indochina-2004

IC

7,414,866

304,472,122

uk-2002

U1

18,520,486

529,444,615

arabic-2005

AR