Influential community discovery in massive social networks using a consumer-grade machine

(1)

by

Shu Chen

B.Eng., Hangzhou Dianzi University, 2014

A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of

MASTER OF SCIENCE

in the Department of Computer Science

c

Shu Chen, 2017 University of Victoria

(2)

Influential Community Discovery in Massive Social Networks Using a Consumer-Grade Machine

by

Shu Chen

B.Eng., Hangzhou Dianzi University, 2014

Supervisory Committee

Dr. Alex Thomo, Supervisor (Department of Computer Science)

Dr. Kui Wu, Departmental Member (Department of Computer Science)

(3)

Supervisory Committee

Dr. Alex Thomo, Supervisor (Department of Computer Science)

Dr. Kui Wu, Departmental Member (Department of Computer Science)

ABSTRACT

Graphs have become very crucial as they can represent a wide variety of systems in different areas. One interesting structure called community in graphs has attracted considerable attention from both academia and industry. Community detection is meaningful, but typically hard in arbitrary networks. A lot of research has been done based on structural information, but we would like to find communities which are not only cohesive but also influential or important. A k-influential community model based on k-core provided by Li, Qin, Yu, and Mao is helpful to discover these cohesive and important communities. They organize the problem as finding top-r most important communities in a given graph.

In this thesis, our goal is to detect top-r most important communities using ef-ficient and memory-saving algorithms running on a consumer-grade machine. We analyze two existing algorithms, then propose multiple new efficient algorithms for this problem. To test their performance, we conduct extensive experiments on some real-world graph datasets. Experimental results show that our algorithms are able to compute top-r most important communities within a very reasonable amount of time and space in a consumer-grade machine.

(4)

List of Tables

Table 6.1 Properties of datasets ordered by m. . . 34 Table 6.2 Ranges of parameters k and r. . . 34 Table 6.3 Sizes of subgraphs of AstroPh and Slashdot0811 in different k . 36 Table 6.4 Sizes of subgraphs of arabic-2005 in different k . . . 40 Table 6.5 Sizes of subgraphs when k = 16 . . . 41 Table 6.6 Details of N C1 running on uk-2002 when r = 40 . . . 47

(7)

List of Figures

Figure 1.1 Communities in a network of network scientist [28] . . . 2

Figure 2.1 Clique example . . . 6

(a) 1-vertex clique . . . 6

(b) 2-vertex clique . . . 6

(c) 3-vertex clique . . . 6

(d) 4-vertex clique . . . 6

Figure 3.1 k-core decomposition for a sample graph . . . 9

Figure 3.2 BZ algorithm applied in a simple graph . . . 12

Figure 3.3 Results of BZ algorithm Core procedure . . . 13

(a) After process node 3 . . . 13

(b) After process neighbor 1 of node 0 . . . 13

(c) After process neighbor 2 of node 0 . . . 13

(d) After process node 4 . . . 13

(e) After process node 1 . . . 13

(f) After process node 2 . . . 13

Figure 4.1 Ck,i and Hk,i, for k = 2 and i ∈ [1, 4]. The original graph in (a) are all in black, grayed out vertices and edges are deleted. Weights are the vertex id’s. . . 19

(a) C2,1 . . . 19 (b) H2,1 . . . 19 (c) C2,2 . . . 19 (d) H2,2 . . . 19 (e) C2,3 . . . 19 (f) H2,3 . . . 19 (g) C2,4 . . . 19 (h) H2,4 . . . 19

(8)

Figure 5.1 Ck,i for k = 2 and i ∈ [1, 5]. The original graph in (a) are all in

black, grayed out vertices and edges are deleted. . . 24

(a) original graph, C2,1 . . . 24

(b) after delete I(1), C2,2 . . . 24

(c) after delete I(2), C2,3 . . . 24

(d) after delete I(3), C2,4 . . . 24

(e) after delete I(4), C2,5 . . . 24

Figure 6.1 Problem 1 - AstroPh (r=40) . . . 35

Figure 6.2 Problem 1 - Slashdot0811 (r=40) . . . 36

Figure 6.3 Problem 2 - AstroPh (r=40) . . . 37

Figure 6.4 Problem 2 - Slashdot0811 (r=40) . . . 37

Figure 6.5 Problem 1 - Pokec (r=40) . . . 39

Figure 6.6 Problem 1 - LiveJournal1 (r=40) . . . 39

Figure 6.7 Problem 1 - uk-2002 (r=40) . . . 39

Figure 6.8 Problem 1 - arabic-2005 (r=40) . . . 40

Figure 6.9 Problem 1 - Pokec (k=16) . . . 41

Figure 6.10Problem 1 - LiveJournal1 (k=16) . . . 42

Figure 6.11Problem 1 - uk-2002 (k=16) . . . 42

Figure 6.12Problem 1 - arabic-2005 (k=16) . . . 42

Figure 6.13Sizes of top 320 CCI communities . . . 43

Figure 6.14Problem 2 - Pokec (r=40) . . . 45

Figure 6.15Problem 2 - LiveJournal1 (r=40) . . . 45

Figure 6.16Problem 2 - uk-2002 (r=40) . . . 45

Figure 6.17Problem 2 - arabic-2005 (r=40) . . . 46

Figure 6.18Problem 2 - Pokec (k=16) . . . 48

Figure 6.19Problem 2 - LiveJournal1 (k=16) . . . 49

Figure 6.20Problem 2 - uk-2002 (k=16) . . . 49

Figure 6.21Problem 2 - arabic-2005 (k=16) . . . 49

(9)

List of Algorithms

1 Batagelj-Zaversnik (BZ) algorithm . . . 11

2 Top-r CCI communities (C0) . . . 17

3 RDelete . . . 18

4 Search Maximally Connected Component (MCC) . . . 18

5 Top-r non-containing CCI communities (NC0) . . . 20

8 RDelete2 . . . 25

9 MCC with alive array (MCC2) . . . 26

(10)

ACKNOWLEDGEMENTS I would like to thank:

My supervisor, Dr. Alex Thomo, for his support and mentoring throughout the past two years. I am deeply grateful to him for providing patient guidance and insightful comments on this thesis.

My teammates, Diana Popova and Ran Wei, for their ideas, helps, and friend-ship. This thesis would not be accomplished without their efforts.

My parents, for always being there, supporting me, believing in me, and loving me. I am eternally grateful.

My friends, Fei, Yunlong, Zheng, and others for their sincere friendship, and accompanying me in the low moments. I am so fortunate to make friends with them.

(11)

DEDICATION

This thesis is dedicated to my mother. For being my best friend forever.

(12)

Introduction

Along with the growth of Internet and computer revolution, graphs have become extremely crucial as they can represent a wide variety of systems in different areas. Biology, physics, social sciences, computer science, and other disciplines frequently use graphs to represent entities and their connections. In many graph applications, finding dense, cohesive sub-graphs (also known as communities) is of paramount importance.

1.1 Motivation

Discovering communities in real networks is one of the important tasks in analyzing graphs because community structures are highly correlated with the functionality of the networks in most of the cases [14]. For example, in the graph of websites, communities may represent groups of pages related to similar topics; in the graph of social networks, communities are likely to group users whose number of followers are big. Figure 1.1 displays a graph where each node represents a scientist and colored according to their community membership [28]. Visually, groups of nodes have a high density of links within communities and share the same color.

k-core, a well-known concept in graph theory, has been applied in many commu-nity detection problems [40, 8, 1, 35, 13] because it can be computed in polynomial time and used as a subroutine for harder problems [21]. Consider the following sce-nario. In social networks, it is common that the engagement of users is more likely if their friends are engaged. Similarly, if connected users drop out, then their friends might drop out too. Therefore, it is important for social network providers to calcu-late the least number of friends a user needs to stay in the network, and determine

(13)

Figure 1.1: Communities in a network of network scientist [28]

the remaining active users after a series of disengagement [21]. This corresponds to k-core of a graph G, which is the largest induced subgraph of G where every vertex has degree at least k.

A lot of research about community detection has been done based on graph struc-ture only. In this thesis, we would like to find the communities which not only have a high density of links, but are also influential or important [9]. For example, in a net-work of graph-theory researchers, we would like to find well-connected communities which contain persons whose citation rates are high.

Li, Qin, Yu, and Mao introduced a new community model [24] based on k-core, called k-influential community, to capture well-connected and influential communities. Given an undirected graph G, they assign each node a weight, which is a numerical value that indicates the influence or importance of the node. The influence of a k-influential community is measured by the minimum weight of nodes in this community.

(14)

The k-influential communities are maximally connected, k-core subgraphs of G with the highest influence.

Community detection is meaningful, but typically hard in arbitrary networks [14]. The structure of networks varies a lot in different graphs, and the number or size of communities are unknown. According to the constraints of the k-influential commu-nity, discovering them directly is not practical in large real networks. The massive amount of data on large networks significantly increases computation and space com-plexity.

Although there are many difficulties, Li, Qin, Yu, and Mao in [24] have provided several algorithms with varying performance and space requirements for k-influential community discovery. Based on these algorithms, we would like to know whether there is a faster and more efficient approach for computing k-influential communities. In summary, calculating k-influential communities using fewer resources is the topic we would like to research in this thesis.

1.2 Contributions

The goal of this thesis is to speed-up the computation of k-influential communities in large scale networks, and we would like to achieve this using only a consumer-grade machine.

Even though the algorithms in [24] have greatly improved the performance of dis-covering k-influential communities, we propose four more efficient algorithms which use space in the order of the compressed version of the graph. As a graph compres-sion framework, we used WebGraph, a highly efficient Java package for reducing the footprint of very large graphs. More precisely, our contributions are as follows.

1. We provide two fast algorithms which require reasonable memory for computing top-r k-influential communities. Compared with the online algorithm in [24], our algorithms are faster by orders of magnitude.

2. We present two fast and memory-saving algorithms for computing top-r non-containing k-influential communities. One of them completely eliminates the computation of maximally connected components, which greatly decreases the running time.

(15)

The biggest graph we used has about 22 million nodes and 553 million edges. We set testing communities for varying k and r, and the results show that our algorithms are able to compute communities within very reasonable time and space on a consumer-grade machine.

1.3 Agenda

Chapter 1 introduces our motivations and objectives for community detection, and some basic concepts about the k-influential model.

Chapter 2 presents related work on community detection.

Chapter 3 describes background knowledge which supports our algorithms, includ-ing detailed concepts of k-core and k-core decomposition, precise definitions of the k-influential model, a fast in-memory k-core decomposition algorithm, and WebGraph, a graph compression framework.

Chapter 4 begins with precise definitions of two problems that we would like to solve in this thesis, then two initial algorithms from [24] for computing k-influential communities and non-containing k-influential communities are introduced. Chapter 5 proposes two algorithms for the first problem we defined in Chapter 4,

and another two algorithms for the second problem. Our ideas and pseudocodes of each algorithm are provided.

Chapter 6 contains experimental methodologies, results, and analyses. Chapter 7 concludes the thesis and discusses future works.

(16)

Chapter 2 Related Work

This thesis is inspired by data-mining work on extracting a set of cohesive subgraphs as communities in graphs and networks. Discovering communities is a crucial task to understand physical functions implied by graphs [13, 35].

In recent years, community detection has drawn a large amount of attention and research. In [16], Gregori et. al. developed algorithms to extract a set of k-dense communities from the Internet AS-level topology graph which enables researchers to gain more insight into the structure of the Internet. Koch [22] represented a new method for finding all connected maximal common subgraphs which can be regarded as communities in two graphs. Ground-truth community is defined and detected by Yang et. al. in [39].

Clique is one of the basic concepts in graph theory which is a classic dense subgraph structure. In an undirected graph G = (V, E), a clique is a subset of the vertices such that every two distinct vertices are adjacent. Figure 2.1 shows some examples of clique. Despite the fact that finding a clique of a given size in a graph is an NP-complete problem, a large number of works and algorithms have been developed.

In [11], Cheng et. al. present an external-memory algorithm (EmMCE) for maxi-mal clique enumeration in large graphs. EmMCE proves to be more efficient compared with the conventional in-memory algorithm when the input graph cannot fit in mem-ory. Koch [22] transforms the problem of finding maximal common subgraphs in two graphs into the clique problem.

Nevertheless, the strict definition of the clique is not practical for various applica-tions as it is unlikely that every entity would have a link to every other entity within the subgraph. Some concepts with relaxed definitions have been proposed. n-clique, a generalized concept of clique, was introduced in 1950 [26]. Alba [2] proposed the

(17)

(a)1-vertex clique (b)2-vertex clique (c)3-vertex clique (d)4-vertex clique

Figure 2.1: Clique example

concept of n-club, a maximal subgraph of diameter n, in 1973. k-plex, a structure defined as a graph with n vertices in which each vertex is connected by a path of length 1 to at least n − k of the other vertices, was presented in 1978 [32]. Seidman [31] introduced the concept k-core in 1983. In k-core, each vertex needs to have at least k neighbors.

Compared with most of the other concepts, k-core can be computed and main-tained in polynomial time [25, 30], and k-core decomposition, a process of calculating the core number for each node in a graph can be applied to really large graphs [21, 38]. In [10], Cheng et. al. proposed the first external-memory algorithm (EMcore) for core decomposition in massive graphs which are too large for keeping in main memory. Their experimental results show that EMcore is efficient for core decomposition in graphs with up to 52.9 million vertices and 1.65 billion edges.

In addition, k-core decomposition has been broadly used for community detection. In [40, 8, 1], they detect a new refined community called k-edge-connected subgraph using k-core decomposition as a foundation. In [35], Sozio and Gionis studied the problem of finding a community in a graph given a set of query nodes, and they provided a global search strategy based on k-core decomposition. They also found that, in the community search problem, the minimum degree is a better measure than other measures, such as average degree and density. Cui et. al. [13] investigate a similar problem of finding the best community containing a given query vertex in its neighborhood, and they proposed a local search method using k-core concept.

However, all these mentioned works do not consider the influence of a community. In this thesis, we focus on k-influential community which is constructed based on k-core by Li et. al. [24]. They provided an online search algorithm and an optimal index-based algorithm for k-influential community detection, and our work in this

(18)

thesis uses the online search algorithm of [24] as a foundation.

Based on k-core, a new structure called k-truss which represents the “core” of a k-core is proposed by Cohen [12]. The definition of k-truss is stricter than that of k-core. Namely, it is a subgraph of k-core, where every edge is contained in at least k − 2 triangles [37]. Community detection based on k-truss has been studied too. In [19], Huang et. al. defined a novel k-truss community model based on the k-truss concept, and provided an algorithm which runs in linear time with respect to the community size. Given a graph and a set of query nodes, the closest truss community search problem is studied in [20].

In next chapter, we introduce comprehensively the concept of k-core, and the k-influential community model which we study in this thesis.

(19)

Chapter 3 Background

In this chapter, we present the necessary background information which we use in this thesis. Section 3.1 presents the concepts of k-core and k-influential community, as well as an example of k-core. In Section 3.2, an efficient k-core decomposition algorithm we used, the Batagelj-Zaversnik algorithm, will be introduced. In Section 3.3, we briefly go over a graph compression framework, WebGraph, which is used for decreasing space complexity.

3.1 Basics of k-core and k-influential community

3.1.1 k-core and k-core decomposition

We denote an undirected graph by G = (V, E), where V is the set of vertices, and E is the set of edges. We set n = |V | and m = |E|, respectively. Given a vertex v, we denote by dG(v) the degree of v in G.

The notion of k-core was introduced by Seidman [31] in 1983. Similarly, we have the following definitions following [27].

Definition 1. Given a subgraph C of G induced by subset of nodes S, the degree of node v in C is denoted by dC(v). C is a k-core if and only if C is a maximal induced subgraph of G such that ∀v ∈ C, dC(v) ≥ k.

For a subgraph C of G which is a k-core, we denote it as Ck. We have the following definition of coreness.

Definition 2. A node v in G has coreness or core number k if and only if v belongs to Ck, but not Ck+1.

(20)

The process of calculating coreness or core number for each node in G is called k-core decomposition. Figure 3.1 shows an example of k-core decomposition for a sample graph.

Figure 3.1: k-core decomposition for a sample graph

As described in Figure 3.1, we do not have nodes with coreness 0 in the graph because there is no node without any connection. 1-core is the entire graph G, 2-core is a subgraph which removes the three yellow nodes from G, and 3-core is a subgraph which contains two separated connected components.

Based on definitions and the example, we can get a conclusion that there is at most one k-core in G for every k = 1, 2, .... In addition, k-cores are nested, e.g. 3-core ⊆ 2-3-core in Figure 3.1. Moreover, k-3-cores Ck(G) are not necessarily connected subgraphs, they can contain multiple maximally connected components.

3.1.2 k-influential community

k-influential community is a model described by Li, Qin, Yu, and Mao in [24] to capture well-connected and influential communities in graphs. It is built based on k-core.

Given a graph, each node can have a weight, a numerical value which indicates the influence or importance of the node. Such a value can be computed using PageRank,

(21)

or it can be some attribute of the social network user that the node represents, e.g. age, social status, number of citations, etc.

The influence of a k-influential community is measured by the minimum weight of its nodes. For an induced subgraph H (group of nodes in graph G), there are three constraints for it to be a k-influential community [24].

1. All nodes in H should be connected. 2. Each node in H has degree at least k.

3. H is not contained in other induced subgraphs which satisfy the last two con-straints, and have the same influence.

Based on the above constraints, a k-influential community obviously is a k-core cohesive subgraph.

Consider Figure 3.1 for an example. For simplicity of illustration, we set the weights of vertices to be their id’s. There are two connected components in the 3-core subgraph C3(G). We set the left connected component as C1 and the right one as C2. C1 is a 3-influential community with influence 9 because (a) the minimum-weight vertex in C1 is vertex 9, (b) it is a connected component where each node has degree at least 3, (c) it is not contained in any other connected subgraph where all nodes have degree 3 or more and influence 9 or more. Similarly, C2 is another 3-influential community with influence 10.

Given r and k, the k-influential community problem is to find the top-r (with the highest influences) k-influential communities a graph.

We note that, in the top-r results, a k-influential community can contain another k-influential community of higher influence value. In order to avoid inclusion relation-ships in top-r results, Li et. al. [24] provide a constraint for checking non-containing k-influential community. According to this constraint, a non-containing k-influential community H with influence w should not contain some other k-influential community whose influence value is larger than w.

In order to make it more accurate, we call k-influential communities as connected-cohesive-important(CCI) communities, and non-containing k-influential communities as non-containing CCI communities in following chapters.

Discovering CCI communities and non-containing CCI communities are two prob-lems we would like to explore in this thesis. To capture such communities, we need to calculate coreness for each node in the graph at first. The next section will describe

(22)

an efficient, O(m), algorithm for determining the core decomposition of a given graph [4].

3.2 k-core decomposition algorithm

In the k-core decomposition problem, the goal is to calculate the coreness or core number of each node in a graph. The core number of a node is the highest value of k that the node belongs to a k-core. One property is useful when searching k-core [3]:

Given a graph G, if we recursively delete all vertices of degree less than k, and their incidental edges, the remaining of graph G is the k-core.

How to rapidly find nodes whose degrees are less than k at each iteration is a challenge in k-core decomposition. Sorting all the remaining nodes in the graph by their degree is a solution. Batagelj-Zaversnik (BZ) algorithm [4], a very efficient algorithm for cores decomposition, is designed based on recursively deleting nodes whose degrees are less than k until the degrees of all remaining nodes are larger than or equal to k, and bin-sort to rapidly find target nodes. The pseudocode of the BZ algorithm is shown in Algorithm 1.

Algorithm 1 Batagelj-Zaversnik (BZ) algorithm Input: A graph G = (V, E), n = |V | and m = |E| Output: An array deg contains coreness of each node

1: deg[v] ← compute degree of each node v

2: bin[i] ← start index of the first node whose degree equals to i in vert

3: vert[i] ← nodes sorted by degrees in ascending order (bin-sort)

4: pos[v] ← positions of node v in vert

5: for i = 0 upto n do

6: v = vert[i]

7: for all u ∈ neighbors(v) do

8: if deg[u] > deg[v] then

9: Find the first node w whose degree equals to deg[u] in vert

10: if u! = w then

11: Swap u and w in vert and pos

12: bin[deg[u]] ← bin[deg[u]] + 1

13: deg[u] ← deg[u] − 1

14:

(23)

The input of the BZ algorithm is a whole graph, and the output is an array or table which contains core number for each node. Given a graph G = (V, E) where V is the set of vertices and E is the set of edges, then set n = |V | and m = |E|. It requires four arrays, deg, vert, pos, and bin, for coreness calculation. The first array deg is initialized to record the degrees of corresponding nodes, and the size of it is n. e.g. deg[v] means the degree of node v where 0 ≤ v < n. The sizes of arrays vert and pos are both n. The vert array contains the set of nodes ascending ordered by their degrees, while positions of nodes in the array vert are stored in the pos, e.g. if vert[i] = v, then pos[v] = i. The size of array bin equals to the maximum degree of the graph, we denote it as M . The bin array is initialized to contain the number of nodes which have the same degree, e.g. if there are 7 nodes with degree 0, and 5 nodes with degree 1, then bin[0] = 7 and bin[1] = 5. To avoid an additional array, as the algorithm proceeds, the bin array records the position of the first node of the corresponding degree in vert, e.g. the array bin[i] contains the index of the first node with degree i in the vert. Practically, the vert array can be divided into multiple blocks of vertices, e.g. block i contains all nodes with degree i, where 0 ≤ i ≤ M . If the first 10 nodes in vert have degree 0, and the next 9 nodes have degree 1, then bin[0] = 0, bin[1] = 10, and bin[2] = 19.

Figure 3.2: BZ algorithm applied in a simple graph

The main process of the BZ algorithm, which we call Core procedure, starts from the first node in array vert and goes to the end. For the node v of the smallest degree in vert, the algorithm decrements the degree of each neighbor u of v if the degree of u is larger than v. Then it moves node u from the current block to the left block, which

(24)

can be operated in constant time by swapping u and the first node in the same block. Since positions in vert are changed, it also needs to swap positions in pos. Finally, it increments the start index of the current block in bin, and decrements the degree of u in deg. After processing all nodes in vert, deg contains the coreness of each node. To illustrate the above, Figure 3.2 shows an example that applies the BZ algorithm in a simple toy graph.

(a)After process node 3 (b)After process neigh-bor 1 of node 0

(c)After process neigh-bor 2 of node 0

(d) After process node 4

(e) After process node 1

(f)After process node 2

Figure 3.3: Results of BZ algorithm Core procedure

Figure 3.2 presents a simple toy graph with 5 nodes, and 4 arrays initialized by BZ algorithm. Index of all arrays starts from 0. As described in the array vert, nodes are sorted ascending by their degrees as 3, 0, 4, 2, and 1. Array pos[i] records position of node i in vert. The Core procedure starts from the first node in vert, 3. A neighbor of 3 is 1, and the degree of 1 is larger than the degree of 3. The first node w which has the same degree as 1 in vert is still 1, so no swap between w and 1. But we still need to increment bin[deg[1]], and decrement deg[1]. Results after processing node 3 are shown in Figure 3.3a. Next node is 0 which has two neighbors 1 and 2. For neighbor 1, the degree of 1 is larger than 0, and the first node that has the same degree as 1 is 2 in vert, then we swap 1 and 2 in both vert and pos. Moreover, increment

(25)

the bin[deg[1]] and decrement the deg[1]. For neighbor 2 whose degree is 3, the first node which has the same degree in vert is itself, so no swap required. After operate bin[deg[2]] and deg[2], results are displayed in Figure 3.3b and Figure 3.3c. When processing node 4, 1, and 2, we found that all neighbors of these nodes have the same degrees as themselves, therefore, there is no other operation required. Figure 3.3d to 3.3f display results after processing node 4, 1, and 2. According to the output of BZ algorithm, the coreness of node 0, 1, 2, 3, and 4 is 2, 2, 2, 1, and 2, respectively.

The total time complexity of BZ algorithm is shown to be O(max(m, n)) [4]. Since m ≥ n−1 in a connected graph, the time complexity of BZ algorithm is actually O(m) which makes it a very efficient algorithm for k-core decomposition. In this thesis, we use BZ algorithm for coreness computation which is the basis of CCI community detection.

3.3 WebGraph

With the development of World Wide Web, graphs which represent web pages and links become extremely huge. In order to manipulate such very large graphs simply, Paolo and Sebastiano [6, 5] provide a highly efficient graph compression framework, WebGraph.

The WebGraph framework is a suite of codes, algorithms, and tools [6]. The codes which are suitable for storing Web graphs provide a high compression ratio. The algorithm uses lazy techniques that delay decompression until it is actually nec-essary when accessing a compressed graph. The WebGraph framework is completely documented and packaged a set of jar files. All the information and tools are freely available from the WebGraph home page (http://webgraph.di.unimi.it).

In this thesis, we implemented BZ algorithm and our new algorithms using the WebGraph API for random access. Two classes of WebGraph are used in our pro-grams:

1. it.unimi.dsi.webgraph.ImmutableGraph 2. it.unimi.dsi.webgraph.NodeIterator

Class ImmutableGraph is a simple class representing an immutable graph which is a graph that is computed once for all, then stored and accessed repeatedly. Function ImmutableGraph.load() is called to load graphs for random access. Then we call Im-mutableGraph.successorArray(u) and ImmutableGraph.outdegree(v) to get neighbors

(26)

of node u and the outdegree of node v, respectively. An object of class NodeItera-tor is created by ImmutableGraph.nodeIteraNodeItera-tor(), which returns a node iteraNodeItera-tor for scanning the graph sequentially, starting from the first node.

(27)

Chapter 4 Initial Algorithms

In this chapter, two problems (P1 and P2) which we set to solve in this thesis are precisely defined at first. Then we comprehensively describe two algorithms from [24], C0 and N C0, for computing CCI and non-containing CCI communities, respectively, i.e. C0 solves problem P1, and N C0 solves P2.

4.1 Two problems

According to the previous chapter, we define two top-r CCI community problems (P1 and P2):

1. Given an undirected graph G, and two positive integers k and r, how to compute the top-r CCI communities?

2. Given an undirected graph G, and two positive integers k and r, how to compute the top-r non-containing CCI communities?

4.2 DFS-based algorithms C0 and N C0

Depth-First-Search (DFS) is an algorithm for traversing (or searching) tree or graph data structures. The most common use of DFS algorithm is to find connected com-ponents in a graph. According to discussions in Section 3.1.2, given k and r, the CCI community and non-containing CCI community problem is to discover top-r (with the highest weights) maximally connected, k-core communities of graphs. In the following, we use the DFS-based algorithm to search CCI communities in a simple graph.

(28)

Recall that we use undirected graphs to represent networks. Consider an undi-rected graph G = (V, E), where V is the set of vertices, and E is the set of edges. n and m are the number of vertices and edges, respectively. Furthermore, each node u in G has a weight wu indicating the influence or importance of u. We assume a strict total order on weights. In case of ties, we use the lexicographical order of vertex ids to break the ties. The influence or importance of a CCI community is defined to be the lower-bound of its vertices’ weights.

Since a CCI community actually is a k-core or a connected component of k-core in a graph, we denote the maximal k-core of G by Ck(G), and CCI communities of G by Hk(G). Similar to k-core decomposition, the DFS-based algorithm in [24] discovers CCI communities Hk(G) by ”peeling off” the maximal k-core Ck(G). We call the DFS-based algorithm for Problem 1 as C0 in this thesis, and the pseudocode of C0 is given in Algorithm 2.

Algorithm 2 Top-r CCI communities (C0) Input: G, w, k, r

Output: Hk,1, . . . , Hk,r

1: Ck(G) ← compute the maximal k-core of G 2: cache ← ∅

3: i ← 1

4: Ck,i ← Ck(G) 5: while Ck,i 6= ∅ do

6: Let v be the minimum-weight vertex in Ck,i 7: Hk,i← MCC(Ck,i, v)

8: if cache.size() = r then

9: cache.deleteF irst()

10: cache.addLast(Hk,i) 11: Ck,i+1← RDelete(Ck,i, v) 12: i ← i + 1

13:

(29)

Algorithm 3 RDelete

1: _{procedure RDelete(C, v)} 2: for all u ∈ NC(v) do

3: Delete edge (u, v) from C

4: if dC(u) < k then

5: RDelete(C, u)

6:

7: Delete v from C

Algorithm 4 Search Maximally Connected Component (MCC)

1: _{procedure MCC(C, v)} 2: cc ← ∅ 3: MCC-DFS(C, v, cc) 4: return cc 5: 6: _{procedure MCC-DFS(C, v, cc)} 7: cc.add(v) 8: for all u ∈ NC(v) do 9: if u 6∈ cc then 10: MCC-DFS(C, u, cc)

To illustrate the algorithm C0, Figure 4.1 displays procedures that process a toy graph G to find CCI communities. In order to make it simple, we set weights of vertices to be their ids. Gray nodes and edges indicate they are removed during procedures in Figure 4.1. Since there are multiple iterations to peel off the whole graph, we denote Ck,i as a k-core subgraph before the ith iteration, and Hk,i as a CCI community found in Ck,i in the ith iteration.

Given k = 2 and r = 5, the goal is to find top-5 (with the highest influences) CCI communities in the maximal 2-core subgraph of G. In C0, cache is a list with size r used to store found CCI communities, and i mentions the number of iteration.

We compute the maximal 2-core C2(G) of G at first, then we set C2,1 = C2(G) because all vertices have corenesses that larger than 2. C2,1 is the 2-core subgraph before the first iteration, and v1 is the minimum weight vertex in C2,1. Through the procedure Search Maximally Connected Component (MCC) which is given in

(30)

(a)C2,1 (b)H2,1

(c)C2,2 (d)H2,2

(e) C2,3 (f) H2,3

(g)C2,4 (h)H2,4

Figure 4.1: Ck,i and Hk,i, for k = 2 and i ∈ [1, 4]. The original graph in (a) are all in black, grayed out vertices and edges are deleted. Weights are the vertex id’s.

Algorithm 4, H2,1 is the MCC of C2,1 containing v1. Precisely, H2,1 is the first CCI community with influence ”1” we found. Since cache is empty, H2,1 is saved directly. Next, we call procedure RDelete to remove v1, and recursively remove neighbors of v1 whose degrees in C2,1 are lower than 2. In Figure 4.1c, we can see that vertex 2 and 8 are deleted when peeling off vertex 1. After recursively ”peeling off” v1, we define C2,2 as remains of C2,1, and increment i for next iteration. We repeat this process until C2,i becomes empty for certain i. Finally, we get four CCI communities: H2,1 with influence ”1” in Figure 4.1b, H2,2 with influence ”3” in Figure 4.1d, H2,3 with influence ”4” in Figure 4.1f, and H2,4 with influence ”5” in Figure 4.1h. H2,4 is the top-1 CCI community with the highest importance.

Since DFS procedure in RDelete recursively deletes all nodes whose degrees are less than k, it is clear that CCI communities Hk,i, MCC of Ck,i, are all k-core.

(31)

In the example Figure 4.1, there are four CCI communities: H2,1 = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}, H2,2 = {3, 5, 6, 9, 10}, H2,3 = {4, 7, 11}, and H2,4 = {5, 6, 9, 10}. H2,3 and H2,4 do not contain any other communities, thus they are non-containing CCI communities. To find non-containing CCI communities, one step should be added after procedure RDelete to check whether all vertices in Hk,i are deleted or not [24]. If yes, then it is a non-containing CCI communities, and we store it in cache. We call this algorithm for Problem 2 as NC0, and it is given in Algorithm 5.

Algorithm 5 Top-r non-containing CCI communities (NC0) Input: G, w, k, r

1: Ck(G) ← compute the maximal k-core of G 2: cache ← ∅

3: i ← 1

4: Ck,i ← Ck(G) 5: while Ck,i 6= ∅ do

6: Let v be the minimum-weight vertex in Ck,i 7: f lag ← f alse

8: Hk,i←MCC(Ck,i, v) 9: Ck,i+1← RDelete(Ck,i, v)

10: f lag ← Check whether vertices in Hk,i are all deleted in Ck,i+1 11: if f lag = true then

12: if cache.size() = r then 13: cache.deleteF irst() 14: cache.addLast(Hk,i) 15: i ← i + 1 16: 17: Output cache

In next chapter, we will discuss bottlenecks of C0 and N C0, the time complexity and space complexity, then provide multiple new algorithms to more efficiently solve two problems (P1 and P2) using fewer resources on a consumer-grade machine.

(32)

Chapter 5 Algorithms for P1 and P2

In the preceding chapter, algorithm C0 and N C0 are introduced for searching top-r CCI communities and non-containing CCI communities respectively. Since C0 and N C0 follow the same logic, they have the same time complexity and space complexity. We will only thoroughly analyze the algorithm C0 in below.

There are two bottlenecks in C0. The first one is the large times of MCC compu-tations, which compute from the first minimum-weight vertex to the end. The time complexity of computing an MCC is O(m), but computing for each minimum-weight vertex costs time O(m·n). The time complexity for procedure RDelete is O(m) which can be absorbed by MCC computation. Therefore, the time complexity of algorithm C0 is O(m · n), which is impractical for big graphs running on a consumer-grade machine.

The second bottleneck is memory usage. Since algorithm C0 loads a whole graph in memory, and stores vertices of top-r communities, the space complexity is O(m + n · r). When r is small, we assume that n · r can be absorbed by m. Nevertheless, when r is a large number, n · r becomes much bigger than m which may exceed the available memory in a consumer-grade machine.

In order to break the bottlenecks, we proposed two algorithms for Problem 1 and two algorithms for Problem 2 in this chapter. Our algorithms are inspired from the algorithm C0 and N C0 described in Section 4.1, as well as a k-core decomposition algorithm, the BZ algorithm, shown in Section 3.2. They outperform C0 and N C0 by orders of magnitude.

In Section 5.1, we describe an algorithm C1 for Problem 1, then a faster algorithm C2 for the same problem is provided in Section 5.2. In Section 5.3, an algorithm N C1 for Problem 2 is introduced, then an improved algorithm N C2 for the same problem

(33)

is presented in Section 5.4.

5.1 Algorithm C1 for Problem 1

There are two reasons that MCC computations are quite expensive procedures. First, the RDelete procedure does not remove a few vertices at most of the early iterations. We observed such situations in some real-world graph datasets. Second, because of the first reason, finding MCC in early Ck,i takes a lot of time since they are quite big in size.

However, RDelete, a recursive deletion procedure, is inexpensive to compute, especially for early iterations. It deletes the minimum-weight vertex v from Ck,i, then recursively deletes all v’s neighbors whose degrees are less than k, until there are no more vertices with degrees less than k. The total time of RDelete costs O(m) which is no more than traversing a whole graph. Thus, in order to reduce MCC computations, we observe that we only need to compute MCC for the last r iterations because only those iterations produce results, top-r CCI communities, we want.

How to get the total number of iterations beforehand is another question we pose. And we notice that running RDelete procedure two times is able to solve this problem. In the first run, we can calculate the total number of iterations, then compute MCCs only in the second run. A new algorithm C1 based on the above logic is described in Algorithm 6.

Input of C1 includes a graph G, weights for each vertex, k and r. At first, we call BZ algorithm for computing coreness for each node in graph G, then remove vertices whose coreness are less than k, and remains of the graph forms the original Ck(G). C, a duplication of Ck(G), represents Ck,i after RDelete in each iteration. Counter i is used to calculate how many iterations are required. In the first while loop (lines 4-7), it picks the minimum-weight vertex v from the current C, then recursively remove v and its neighbors by calling RDelete. Each iteration is recorded by incrementing counter i. After the first run, we recover C using the original Ck(G). In the second loop (lines 11-17), another counter j is used to accumulate the times of iterations. Only when j > i − r, which means the last r iterations, we call the M CC procedure and output results directly. Procedure RDelete is called the second time to recursively delete vertices from the graph. Totally, we run M CC for only r times.

(34)

1: Ck(G) ← compute the maximal k-core of G 2: i ← 1

3: C ← Ck(G) 4: while C 6= ∅ do

5: Let v be the minimum-weight vertex in C

6: RDelete(C,v) 7: i ← i + 1 8: 9: C ← Ck(G) 10: j ← 1 11: while C 6= ∅ do

13: if j > i − r then

14: H ← MCC(C,v)

15: Output H

16: RDelete(C,v)

17: j ← j + 1

Theorem 1. Algorithm C1 correctly computes all the top-r CCI communities of a given graph G.

The time complexity of C1 is O(m · r) because MCC computation only runs r times, and time of calling two RDelete procedures is absorbed by MCC computation. In fact, since the last r MCC computations operate on quite small subgraphs which produced after i − r ”peeling off” procedures, O(m · r) is much smaller than O(m · n), the time complexity of C0.

For space complexity, we do not store r communities anymore, because only last r communities are computed, and output directly. Therefore, the space complexity of C1 is O(m).

In next section, we will provide a faster algorithm which reduces the time com-plexity by a factor of (about) 2 compared with C1.

5.2 Algorithm C2 for Problem 1

Compared with C0, C1 greatly decreases the time complexity and space complexity. However, we still run two while loops on a whole graph in C1. How to avoid the

(35)

second while loop is the next question we want to figure out.

During our experiments, we found that vertices removed in RDelete can be stored in certain data structures to avoid the second ”peeling off” procedure. The data structure we used here is a hash-based structure, which we call iteration-delete-history and denote by I. In Java codes, we use data structure ArrayDeque to represent I.

Each element I(i) in I is a list of deleted vertices in the corresponding iteration i. Take Figure 5.1 as an example for illustration.

(a)original graph, C2,1

(b)after delete I(1), C2,2 (c)after delete I(2), C2,3

(d)after delete I(3), C2,4 (e) after delete I(4), C2,5

Figure 5.1: Ck,i for k = 2 and i ∈ [1, 5]. The original graph in (a) are all in black, grayed out vertices and edges are deleted.

Given k = 2 and weights equals to nodes’ id, the original graph G is shown in Figure 5.1a. Figure 5.1b to 5.1e are 2-core subgraphs of G after recursively delete nodes with the minimum weight and its neighbors. There are 4 iterations to peel off the whole graph, nodes deleted in these iterations are: I(1) = {1, 2, 8}, I(2) = {3}, I(3) = {4, 7, 11}, and I(4) = {5, 6, 9, 10}.

Algorithm C2 based on ideas of such hash-based structure is described in Algo-rithm 7.

(36)

Output: Hk,r, . . . , Hk,1

1: Ck(G) ← compute the maximal k-core of G 2: C ← Ck(G)

3: i ← 1

4: I ← ∅

5: while C 6= ∅ do

7: I(i) ← ∅ 8: RDelete2(C,v,I,i) 9: i ← i + 1 10: 11: alive ← 0 12: for j = i downto i − r + 1 do

13: for all v ∈ I(j) do

14: alive[v] ← 1 15: v ← I(j).f irst() 16: H ← MCC2(Ck(G), v, alive) 17: Output H Algorithm 8 RDelete2 1: _{procedure RDelete2(C, v, I, i)} 2: for all u ∈ NC(v) do

3: Delete edge (u, v) from C

4: if dC(u) < k then 5: RDelete2(C, u, I, i)

6:

7: Delete v from C

(37)

Algorithm 9 MCC with alive array (MCC2) 1: _{procedure MCC2(C}_k(G), v, alive) 2: cc ← ∅ 3: MCC-DFS2(Ck(G), v, alive, cc) 4: return cc 5: 6: _{procedure MCC-DFS2(C}_k(G), v, alive, cc) 7: cc.add(v) 8: for all u ∈ NCk(G)(v) do

9: if alive[u] = true & u 6∈ cc then 10: MCC-DFS2(Ck(G), u, alive, cc)

Same as C1, input of C2 includes a graph G, weights for each vertex, k and r. The BZ algorithm is used to compute core number for each vertex in G, and Ck(G) represents the maximal k-core subgraph of G. C means remained subgraphs after RDelete2 procedures, and counter i calculates how many iterations are required to delete the whole graph. Furthermore, a data structure saved lists of deleted nodes, I, is initialized to empty. In the first while loop, we choose the minimum-weight vertex in C, and initialize an empty list of the corresponding iteration in I, which we call I(i). Then RDelete2, a modified RDelete procedure, which takes two extra parameters, I and i, is called. In RDelete2, we add all deleted vertices during recursively deletion to list I(i).

A flat array ”alive” is used to represent deletion states of all nodes in G, it is initialized to all 0 which means all nodes are deleted. A for loop goes down starting from the total iteration number, i, to i − r + 1, so there are r iterations in the for loop. In each iteration, we poll the last list of deleted nodes from I, and set these vertices to 1 in ”alive”, which means they have not been deleted. Then M CC2 is called based on the maximal k-core Ck(G) of G, and the consulting array alive. Only those vertices have not been removed are considering for computing maximally connected components. Once we find an MCC, output it directly.

Compared with C1, results of C2 are in reverse order, i.e. Hk,r, ..., Hk,1. Based on previous discussions, we can state that

Theorem 2. Algorithm C2 correctly computes all the top-r CCI communities of a given graph G.

(38)

The time complexity of C2 is the same as C1, O(m · r). However, in terms of constant factors, C2 is almost twice faster than C1. Because we only run RDelete2 procedure one time in C2, and array operations, such as insertion and poll, only take constant time.

For space complexity, structure I takes O(n) space, and the graph takes O(m) space. Since n is always much smaller than m in undirected graphs, O(n) can be absorbed by O(m). Therefore, the space complexity of Algorithm C2 is O(m).

5.3 Algorithm NC1 for Problem 2

Similar to C0, N C0 also has bottlenecks in time complexity and memory usage. The time complexity of N C0 is O(m · n), and the space complexity is O(m + n · r). It is clear that N C0 is impractical for big graphs running on a consumer-grade machine as well.

As described in N C0, the key to finding non-containing CCI communities is to check whether all vertices in MCC of vertex v are all deleted after recursively deleting vertex v and its neighbors in the RDelete procedure. Then, we posed a question that how to easily get lists of deleted nodes in advance. Based on previous experience, we use the same iteration-delete-history structure as C2 to store deleted nodes in each iteration.

In addition, how to easily and accurately check whether vertices in MCC are equal to vertices deleted in the RDelete procedure is another question we posed. Since both procedures RDelete and M CC use the depth-first-search algorithm based on the same vertex in a graph, the minimum-weight vertex, to find connected components, their results should be the same. However, the RDelete procedure takes vertices’ degrees as a filter condition, then a smaller connected component is found compared with the connected component found in procedure M CC. Thus, we can directly compare the number of vertices in the MCC and the list of deleted nodes, instead of comparing each node, to save time.

Based on above discussions, algorithm NC1 is given in Algorithm 10.

Input of N C1 includes a graph G, weights for each vertex, k and r. The BZ algorithm is used to compute core number for each vertex in G, and Ck(G) represents the maximal k-core subgraph of G. Same as C1 and C2, C represents remained subgraphs after RDelete2 procedure, and a counter i calculates the number of total iterations required. A hash-based structure I, which saved lists of deleted nodes, is

(39)

3: i ← 1

4: I ← ∅

7: I(i) ← ∅ 8: RDelete2(C,v,I,i) 9: i ← i + 1 10: 11: j ← 0 12: alive ← 0 13: while I 6= ∅ & j < r do 14: Ik ← I.poll() 15: for all v ∈ Ik do 16: alive[v] ← 1 17: v ← Ik.f irst() 18: H ← MCC2(Ck(G), v, alive) 19: if H.size() = Ik.size() then

20: Output H

21: j ← j + 1

initialized to empty at first. In the first while loop, we choose the minimum-weight vertex in C, and initialize an empty list of the corresponding iteration i in I which called I(i). Then in RDelete2, we add all deleted vertices to list I(i) during recursion. Another counter j is used to count how many non-containing CCI communities have been found. Same as C2, a flat array ”alive”, which represents deletion states of nodes, is initialized to all 0. e.g. alive[v] = 0 means node v is deleted. The second while loop takes I 6= ∅ and j < r as conditions because it is possible that we are not able to find enough r non-containing CCI communities even go through the whole graph. In each iteration, we poll the last list of deleted nodes from I, then assign to Ik. Vertices in Ik are set to 1 in array alive. Then M CC2 is called based on the maximal k-core Ck(G) of graph G, and the consulting array alive. Only those vertices have not been deleted, e.g. alive[v] = 1, are considering for computing the maximally connected component. When the number of vertices in the MCC equals to the number of vertices in Ik, all nodes in the current MCC of minimum-weight vertex

(40)

v have been deleted after RDelete2 procedure started with v. Then the current MCC is a non-containing CCI communities, and we output it directly.

Since MCC computation starts from the last Ck,i(G), results of N C1 are in reverse order, e.g. Hk,r, ..., Hk,1. Based on previous discussions, we can state that

Theorem 3. Algorithm NC1 correctly computes all the top-r non-containing CCI communities of a given graph G.

The time complexity of N C1 highly depends on the input graph. Numbers of non-containing CCI communities in the input graph is a factor which influences times of running MCC computation. Because the second loop in N C1 keeps computing MCC until it finds r non-containing CCI communities or the whole graph has been searched. Therefore, the best time complexity of N C1 is O(m · r), while the worst time complexity is O(m · n). Commonly, we take the worst time complexity O(m · n) as the time complexity of N C1.

For space complexity, N C1 takes the same space as C2, O(m). Although structure I and graph G take O(n) and O(m) space respectively, O(n) can be absorbed by O(m) because n is much smaller than m in undirected graphs.

5.4 Algorithm NC2 for Problem 2

Although N C1 takes less space complexity than N C0, it still has the same time complexity as N C0 which is impractical for big graphs running on a consumer-grade machine.

When we check deleted nodes in structure I, we observe that all information need for non-containing CCI communities has been stored in I in fact. Therefore, we propose an idea which use I to find non-containing CCI communities instead of the M CC procedure.

The number of alive neighbors of a vertex v is a condition which can check whether there is a non-containing CCI community. Given a vertex v, we define the current degree is the number of alive neighbors of v. A flat array d is used to save current degrees of nodes. When a node v is alive, d(v) stores the current degree of v. When a node v is deleted, d(v) is not updated anymore and stores the degree when deletion happens.

In some iteration i, when each node v in I(i) has current degree equals 0, d(v) = 0, then all neighbors of vertex v have been removed. Nodes in this I(i) form a

(41)

last standing community in a community containment chain, therefore it is a non-containing CCI community.

Based on previous discussions, we can state that if for each v ∈ I(i), d(v) = 0, then I(i) is a non-containing CCI community.

Take Figure 5.1 as an example for illustration. There are 4 iterations to peel off the whole graph, nodes deleted in these iterations are: I(1) = {1, 2, 8}, I(2) = {3}, I(3) = {4, 7, 11}, and I(4) = {5, 6, 9, 10}. Since node 4 and 9 still exist in C2,2, d(1) and d(8) are not equal to 0. I(1) is not a non-containing CCI community. In C2,3, which after deleting I(2), d(3) = 3 because node 5, 6 and 9 exist, I(2) is not a non-containing CCI community either. In C2,4, node 4, 7 and 11 are deleted in I(3), and all of their neighbors are deleted as well. I(3) is a non-containing CCI community. Then deleting I(4), C2,5 becomes a empty graph. Since degrees of all deleted nodes in I(4) are 0, I(4) is also a non-containing CCI community.

Our algorithm NC2, which eliminates MCC computation, is given in Algorithm 11.

Input of N C2 includes a graph G, weights for each vertex, k and r. At first, we call BZ algorithm to compute coreness for each node in graph G, then remove vertices whose core numbers are less than k, and remains of the graph forms the original maximal k-core Ck(G). C, a duplication of Ck(G), represents Ck,i after RDelete3 procedure in each iteration i.

For all vertices in C, we calculate the current degree of each node and store them in an array d. In while loop, the minimum-weight vertex is chosen and passed to RDelete3 procedure. In RDelete3, the array d of current degrees is updated during recursive deletion, and deleted nodes are added to the I(i). There is a flag isN C which used to check whether current degrees of all nodes in a I(i) are 0, the default value of isN C is true. If there is any node has current degree greater than 0, isN C is set to false. When isN C is true, we keep I(i), and increment i before jump to next iteration. But when flag isN C is false, which means the corresponding I(i) is not a non-containing CCI community, we set current I(i) to empty, then jump to next iteration without i increment.

After the while loop, all non-containing CCI communities have been saved in I, and a for loop is used to output top-r results. But the total number of non-containing CCI communities may not larger than or equal to r sometimes, a conditional break is added to check whether I is empty. The for loop goes from 1 up to r, we check whether current I is empty at first, then poll the last element in I, and output it

(42)

3: for all vertex v of C do

4: d[v] = dC(v) 5:

6: i ← 1

7: I ← ∅

10: I(i) ← ∅

11: RDelete3(C, v, I, i)

12: isN C ← true

13: for all v ∈ I(i) do

14: if d[v] > 0 then

15: isN C ← f alse

16: if isN C = f alse then

17: I(i) ← ∅ 18: continue 19: i ← i + 1 20: 21: for j = 1 upto r do 22: if I is empty then 23: break 24: H ← I.pollLast() 25: Output H directly.

Results of N C2 are in reverse order, i.e. Hk,r, ..., Hk,1. Based on the above rea-soning, we can state that

Theorem 4. Algorithm NC2 correctly computes all the top-r non-containing CCI communities of a given graph G.

Since r is much smaller than m, time of r iterations in for loop can be regarded as a constant value. The time complexity of N C2 is O(m) because it only iterates the graph one time, and no MCC computation anymore. Although N C2 uses data structure I which takes O(n), it can be absorbed by the space, O(m), taken by graph G. Therefore, the space complexity of N C2 is O(m) too.

(43)

Algorithm 12 RDelete3

1: _{procedure RDelete3(C, v, I, i)} 2: Mark v

3: for all u ∈ NC(v) do 4: d[u] ← d[u] − 1;

5: if u is not marked & d[u] < k then

6: RDelete3(C, u, I, i)

7:

8: Delete v from C

9: I(i).add(v)

5.5 Conclusion

In this chapter, we introduce two new algorithms for Problem 1 and two new algo-rithms for Problem 2. Given an undirected graph G = (V, E) where n = |V | and m = |E|, k, and r, details of algorithms are follow:

1. Algorithm C1 for Problem 1 decreases the times of maximally connected com-ponents computation. The time complexity of C1 is O(m · r), and the space complexity is O(m).

2. Algorithm C2 for Problem 1 takes a hash-based structure to save deleted nodes. The time complexity of C2 is O(m · r), but C2 actually twice faster than C1. The space complexity of C2 is O(m).

3. Algorithm N C1 for Problem 2 takes the same hash-based structure as C2. The best time complexity of N C1 is O(m · r), while the worst time complexity is O(m · n). The space complexity of N C1 is O(m).

4. Algorithm N C2 for Problem 2 completely eliminates maximally connected com-ponents computation. The time complexity of N C2 is O(m), and the space complexity is O(m) too.

(44)

Chapter 6 Experimental Results

In this chapter, we performed experimental analysis by evaluating our four algorithms provided in the preceding chapter on several real-world graph datasets. There are two parts of experiments: first, we compared the performance of initial algorithms from [24], C0 and N C0, with our algorithms for Problem 1 and 2 in two small graph datasets; second, we conducted a broad testing of our four algorithms to find the best solutions for solving Problem 1 and 2 we proposed in Chapter 4.

Section 6.1 introduces multiple real-world graph datasets we used in tests. In sec-tion 6.2, details of equipment and codes implement are presented. Then we conducted the first part of experiments: compare the initial algorithms C0 and N C0 with our four new algorithms in Section 6.3. Section 6.4 compares algorithm C1 and C2 for the Problem 1. Finally, N C1 and N C2 for the Problem 2 are compared in Section 6.5.

6.1 Datasets

We perform the experiments on the following 6 graph datasets: • Astro Physics collaboration network (AstroPh)

• Slashdot social network obtained in Novermber 2008 (Slashdot0811) • Pokec online social network (Pokec)

• LiveJournal online social network (LiveJournal1)

(45)

• Results of web crawl on pages written in Arabic in 2005 (arabic-2005)

The first four datasets are available for download from the Standford Network Analysis Platform (http://snap.stanford.edu/data/index.html). The last two datasets, uk-2002 and arabic-2005, can be obtained from the Laboratory of Web Algorithmics (http://law.di.unimi.it/datasets.php). Properties of 6 datasets, namely the number of vertices and edges, the maximum degree and core number, and the average core number, are displayed in Table 6.1.

Dataset n m dmax kmax kavg

AstroPh 18,771 198,050 504 56 13 Slashdot0811 77,360 469,180 2539 54 6 Pokec 1,632,803 22,301,964 14,854 47 14 LiveJournal1 4,846,609 42,851,237 20,333 372 9 uk-2002 18,483,186 261,787,258 194,955 943 16 arabic-2005 22,743,881 553,903,073 575,628 3,247 28

Table 6.1: Properties of datasets ordered by m.

All datasets need a preprocessing. First, directed graphs are converted to undi-rected graphs by removing self-loops (e.g. (v, v)), adding inverse edges (u,v) for each edge (v,u) if it did not exist, and deleting duplicated edges. Second, we compress graphs of txt format to WebGraph format. Third, we assigned random weights (Java double type), ranging from 0 to 100, to the vertices of each graph.

Each graph contains four files, namely .graph, .offsets, .properties, and weight, after preprocessing. Our programs take these four files and two parameters, k and r, as input, and the timeout was set to 1 hour for each algorithm to run. Ranges of k and r are given in Table 6.2.

Parameter Range

k 2, 4, 8, 16, 32, 64, 128, 256, 512 r 10, 20, 40, 80, 160, 320 Table 6.2: Ranges of parameters k and r.

(46)

6.2 Equipment

All of our experiments are conducted on a consumer-grade laptop with 2.40 GHz Intel Core i7 (4 cores) CPU, and 16 GB RAM, running Ubuntu 16.04 LTS.

We implemented all the algorithms in Java, and use the Java version ”OpenJDK 1.8.0”. In experiments, we allow each Java program to use a maximum of 8 GB heap size and 1 GB stack size.

6.3 Testing initial algorithms

In the first part of our experiment, we compare the initial algorithm C0 with our counterparts, C1 and C2 for Problem 1. For Problem 2, we compare the initial algorithm N C0 with improved algorithms N C1 and N C2.

According to our analysis of C0 and N C0, they are not practical for big graphs. So we use two smallest datasets, AstroPh and Slashdot0811, for the first part of experiments.

Parameter r is set to 40, and k is ranging from 2, 4, 8, 16, to 32 when testing two problems.

6.3.1 Problem 1

Figure 6.1 and 6.2 show results of the C0, C1 and C2 comparison.

(47)

Figure 6.2: Problem 1 - Slashdot0811 (r=40)

Sizes of subgraphs of two datasets when k is varied are displayed in Table 6.3. k AstroPh - n Slashdot0811 - n 2 17,439 47,760 4 13,741 29,026 8 9,425 17,200 16 5,664 8,977 32 1,926 3,188

Table 6.3: Sizes of subgraphs of AstroPh and Slashdot0811 in different k

Charts clearly show that C1 and C2 outperform C0 by orders of magnitude on both two datasets. Although subgraphs are relatively small when k is large, C0 still takes much longer time compared with C1 and C2, e.g. 16 seconds in C0, 0.275 seconds in C1, and 0.262 seconds in C2 when k = 32 on dataset Slashdot0811.

6.3.2 Problem 2

(48)

Figure 6.3: Problem 2 - AstroPh (r=40)

Figure 6.4: Problem 2 - Slashdot0811 (r=40)

Charts show that N C2 outperforms N C0 by orders of magnitude, and this is what we expected.

N C1 performs quite good in some k but has the same performance as N C0 most of time. Based on our analysis in Section 5.3, the performance of N C1 is not stable, it highly depends on the structure of input graphs. When there are not enough non-containing CCI communities, which is common on small graphs, N C1 goes through the whole graph and calculates MCC for each minimum-weight vertex. Then N C1 takes the worst time complexity O(m · n) which is the same as N C0.

According to results of running AstroPh dataset, we only found 4, 2, and 1, instead of 40, non-containing CCI communities when k equals 8, 16, and 32. It

(49)

implies that N C1 goes through the whole graph but still cannot find enough non-containing CCI communities. Same reason in dataset Slashdot0811, there are only 1 containing CCI communities when k equals 4, 8, 16, and 32, and 12 non-containing CCI communities when k = 2. Nevertheless, N C1 almost takes the same time as N C2 when k = 2 in AstroPh, which is expected because it found 40 non-containing CCI communities in the k = 2 subgraph.

6.3.3 Conclusion

Since C0 and N C0 are much slower than C1, C2, and N C2 even in small graphs, we eliminate C0 and N C0 from further testing on large-scale graphs.

For N C1, although its performance is not stable on small graphs with fewer non-containing CCI communities, it shows competitive results when graphs non-containing adequate non-containing CCI communities, such as k = 2, 4 in AstroPh. Based on our previous research, large-scale graphs commonly contain plenty of non-containing CCI communities, so we still test N C1 on further testing.

6.4 Testing for Problem 1

As described in Chapter 5, algorithms C1 and C2 are designed for Problem 1. Next, we present test results and analysis on datasets Pokec, LiveJournal1, uk-2002, and arabic-2005. Since AstroPh and Slashdot0811 are relatively small, we do not show the results of them.

Since there are two flexible parameters, k and r, our experiments are divided into two groups: first, we set r to a fixed value, and choose k from ranges mentioned in Table 6.2; second, fix k, and r is picked from ranges mentioned in Table 6.2.

6.4.1 Experiment 1 - fixed r

Figure 6.5, 6.6, 6.7, and 6.8 show results for computating top-r CCI communities when r = 40 and k is varied on four datasets.

(50)

Figure 6.5: Problem 1 - Pokec (r=40)

Figure 6.6: Problem 1 - LiveJournal1 (r=40)

(51)

Figure 6.8: Problem 1 - arabic-2005 (r=40)

It is clear that C2 outperforms C1 in all datasets with most of k, given r = 40. This is expected because although the time complexity of C1 and C2 are both O(m·r), the C2 actually only pass over the graph one time while C1 does two times. However, when k is big, which indicate relatively small subgraphs, C1 and C2 cost almost the same time, e.g. 1.027 seconds in C1, and 0.912 seconds in C2 when k = 512 on dataset uk-2002. The reason is that when a subgraph is small, pass over the subgraph two times or one time does not take a big difference, then we see similar running time between C1 and C2 on subgraphs with large k.

k n 2 20,193,090 4 17,383,395 8 14,939,880 16 11,376,436 32 5,825,037 64 1,829,190 128 459,938 256 194,957 512 79,363

Table 6.4: Sizes of subgraphs of arabic-2005 in different k

Charts also show that the runtime of C1 and C2 both decrease when k increases. The reason is that a subgraph Ck becomes smaller when k gets bigger, iterations

(52)

running on small graphs take less time. As an example, Table 6.4 shows the sizes of subgraphs of dataset arabic-2005 in different k. Based on Figure 6.8 and Table 6.4, it is clear that the runtime of C1 and C2 is positively correlated with the sizes of subgraphs.

Similarly, when sizes of datasets increase from 1 million nodes (Pokec) to 20 million nodes (arabic-2005), the runtime also increases from 5 seconds to almost 100 seconds using C2 when k = 2.

6.4.2 Experiment 2 - fixed k

Figure 6.9, 6.10, 6.11, and 6.12 show results for computating top-r CCI communities when k = 16 and r is varied on four datasets: Pokec, LiveJournal1, uk-2002, and arabic-2005.

When k is fixed, all subgraphs in varied r of a certain dataset are consistent. Table 6.5 shows the sizes of each subgraph.

Dataset n

Pokec 639,563

LiveJournal1 853,108 uk-2002 6,629,011 arabic-2005 11,376,436

Table 6.5: Sizes of subgraphs when k = 16

(53)

Figure 6.10: Problem 1 - LiveJournal1 (k=16)

Figure 6.11: Problem 1 - uk-2002 (k=16)

(54)

The charts clearly show that C2 outperforms C1 for all varied r on all four datasets when k = 16. More precisely, we found that the runtime of C1 generally takes almost double time of C2. This is expected according to our analysis because C1 goes through the whole graph two times while C2 only makes one pass over.

The runtime of C1 and C2 in varied r also depends on the graph structure. In dataset LiveJournal1, uk-2002, and arabic-2005, the runtime of C1 and C2 are quite stable when r increases from 10 to 320. However, the runtime of both two algorithms greatly increases in dataset Pokec when r becomes larger. To find out the reason, we conducted a further analysis to explain this interesting result. We conducted multiple tests to check sizes of top 320 CCI communities found in datasets Pokec, LiveJournal1, and uk-2002. Results are given in Figure 6.13.

Figure 6.13: Sizes of top 320 CCI communities

As described in the chart, sizes of top 320 CCI communities found in LiveJour-nal1 and uk-2002 are quite stable around 20 to 100 vertices per CCI communities. Nevertheless, sizes of top 320 CCI communities found in Pokec increased greatly from 70 to 5000 vertices, which indicates the time of MCC computation spending on Pokec should be much longer than that on other graphs. According to Table 6.1, the maximum core number of dataset Pokec is only 47 while the average core num-ber is relatively high, 14. Compared with other datasets, network Pokec shows high cohesiveness, which generally produces large sizes of communities.

(55)

6.4.3 Conclusion

Based on above experiments with varied k and r, we get some conclusions: 1. Runtime of C1 and C2 increases along with sizes of datasets.

2. When a subgraph is relatively small (k is big), performance of C1 and C2 are similar.

3. When a subgraph is relatively big (k is small), C2 outperforms C1 almost two times.

4. Generally runtime of C1 and C2 are not sensitive to varied r in the same graph. But when a graph has high cohesiveness, runtime of C1 and C2 increases when r becomes larger.

6.5 Testing for Problem 2

For Problem 2, we compare algorithms N C1 and N C2 in this section. We display test results and analysis on four datasets: Slashdot0811, Pokec, LiveJournal1, uk-2002, and arabic-2005, in below.

Same as testing for Problem 1, there are two flexible parameters, k and r. There-fore, our experiments are divided into two groups: first, we set r to a fixed value, and choose k from ranges mentioned in Table 6.2; second, fix k, and r is picked from ranges mentioned in Table 6.2.

6.5.1 Experiment 1 - fixed r

Figure 6.14, 6.15, 6.16, and 6.17 show results for computing top-r CCI communities when r = 40 and k ranges from 2 to 512 on four datasets. Missing points in charts means the runtime overs the time limit (1 hour).

(56)

Figure 6.14: Problem 2 - Pokec (r=40)

Figure 6.15: Problem 2 - LiveJournal1 (r=40)

Influential community discovery in massive social networks using a consumer-grade machine

Contents

List of Tables

List of Figures

List of Algorithms

Introduction

1.1

Motivation

1.2

Contributions

1.3

Agenda

Chapter 2

Related Work

Chapter 3

Background

3.1

Basics of k-core and k-influential community

3.1.1

k-core and k-core decomposition

3.1.2

k-influential community

3.2

k-core decomposition algorithm

3.3

WebGraph

Chapter 4

Initial Algorithms

4.1

Two problems

4.2

DFS-based algorithms C0 and N C0

Chapter 5

Algorithms for P1 and P2

5.1

Algorithm C1 for Problem 1

5.2

Algorithm C2 for Problem 1

5.3

Algorithm NC1 for Problem 2

5.4

Algorithm NC2 for Problem 2

5.5

Conclusion

Chapter 6

Experimental Results

6.1

Datasets

6.2

Equipment

6.3

Testing initial algorithms

6.3.1

Problem 1

6.3.2

Problem 2

6.3.3

Conclusion

6.4

Testing for Problem 1

6.4.1

Experiment 1 - fixed r

6.4.2

Experiment 2 - fixed k

6.4.3

Conclusion

6.5

Testing for Problem 2

6.5.1

Experiment 1 - fixed r