On graph sample clustering

(1)

On graph sample clustering

Citation for published version (APA):

Zhang, J. (2018). On graph sample clustering. Technische Universiteit Eindhoven.

Document status and date: Published: 29/10/2018

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne

Take down policy

If you believe that this document breaches copyright please contact us at:

openaccess@tue.nl

(2)

On Graph Sample Clustering

PROEFSCHRIFT

ter verkrijging van de graad van doctor aan de Technische Universiteit Eindhoven, op gezag van de rector magnificus prof.dr.F.P.T. Baaijens, voor een commissie aangewezen door het College voor Promoties, in het

openbaar te verdedigen op maandag 29 oktober 2018 om 16:00 uur

door

Jianpeng Zhang

(3)

promotiecommissie is als volgt: voorzitter: prof. dr. J.J. Lukkien 1epromotor: prof. dr. M. Pechenizkiy copromotor: dr. G.H.L. Fletcher

leden: prof.dr. T.G.K. Calders (University of Antwerp) dr. J. Gama (University of Porto)

prof.dr. P.M.E. De Bra prof.dr. G.W.M. Rauterberg dr. W. Duivesteijn

Het onderzoek of ontwerp dat in dit proefschrift wordt beschreven is uitgevoerd in overeenstemming met de TU/e Gedragscode Wetenschapsbeoefening.

(4)

On Graph Sample Clustering

(5)

A catalogue record is available from the Eindhoven University of Technology Library ISBN: 978-90-386-4595-7

The work in this thesis has been funded by The Netherlands Organisation for Scientific Research (NWO).

SIKS Dissertation Series No. 2018-17

The research reported in this thesis has been carried out under the auspices of SIKS, the Dutch Research School for Information and Knowledge Systems.

(6)

Summary

Graph data structure is one of the most fundamental data representation to repre-sent entities and their relationships in many fields, such as social networks, email-communication networks and biological protein networks. The fundamental method for analyzing networks, graph clustering, aims to examine the relations of nodes by grouping them into clusters with dense intra-cluster and sparse inter-cluster connec-tivity. Consequently, such analysis can provide valuable structural information of graphs and can be used as an important tool for decision supports in many applica-tions such as target group identification. However, with the advent of “big” graphs in various fields, the major challenge of this work is that in practice an increasing number of graphs is becoming large-scale and continuously growing in nature. Hence, model-ing and analyzmodel-ing such data in their entirety is becommodel-ing infeasible and impractical. These contemporary graphs might contain an overwhelming amount of nodes and edges and have obvious characteristics of “big-data” (e.g., complex inherent structure, massive in size and/or in a streaming fashion). How should one perform cluster analysis on these big graphs? One direction is to design scalable (possibly parallel) clustering algorithms to analyze them. It requires either large working memory (e.g., most in-memory algorithms needs to iterate over all the nodes in the memory) or powerful parallel processors (e.g., parallelization implementation). However, such computing resources are not easily accessible in all application scenarios. Another feasible solution that has received more attention is to take a small sample on the graph and perform the traditional analysis on the sampled subgraph. Thus, the inherent clustering structure of the original graph (i.e., population) can be captured by the analysis performed on a sampled counterpart that is much smaller than the original one.

In this thesis, we focus on the latter case due to its simplicity and efficiency. For brevity, we refer to the sampling-based methodology as a sample clustering process. Motivated by the need for better understanding of a sample clustering process on the original graph, we study each ingredient (i.e., sampling, clustering and evaluation) of the overall analysis process. To sum up, our main contributions are as following:

(7)

and propose several new quality metrics based on set matching methodology to quantitatively evaluate how differently the clusters of the sampled graph correspond to the ground-truth clusters in the original one. To the best of our knowledge, it is the first framework to evaluate the clustering quality which integrates sampling with clustering simultaneously. It is capable of evaluating the sample clustering process in different scenarios. Two most common scenarios are considered in this evaluation framework: (i) static graphs where nodes/edges are fully random access and (ii) streaming graphs where nodes/edges are accessed sequentially in the memory. (2) Regarding static graphs, we focus on designing each ingredient (i.e., sampling and clustering) of a sample clustering process on static graphs at scale.

• For the sampling part, we propose two novel top-leader sampling algorithms which apply the equal-size expansion (TLS-e) and internal-priority expansion (TLS-i), respectively. They are capable of producing representative samples while retaining the clustering structure commendably. Empirical studies show that the proposed sampling methods are capable to well preserve the clustering structures of the original graphs and outperform other sampling algorithms in retaining cluster-related properties.

• For the clustering part, we present a new two-stage clustering inference (TCI) method to infer nodes’ clustering affiliations in the original graph. TCI is com-posed of two stages: 1) initialization of clustering affiliations for unsampled nodes based on computed neighborhood affiliation information; 2) label prop-agation for the whole graph. It provides an effective way to infer clustering affiliations of nodes in the population.

(3) Analogously, for streaming graphs, the streaming version of sampling and clustering algorithms are requested to be computationally efficient (preferably, by single-pass) under the limited memory. Therefore;

• For the sampling part, we put forward a Cluster-preserving Partially Induced Edge Sampling (CPIES) algorithm that dynamically maintains a representative sampled subgraph and is capable of retaining clustering structure in streaming graphs at any time. Empirical experiments demonstrate that the proposed CPIES algorithm is capable of preserving the inherent clustering structure while sampling from the fully-dynamic streaming graphs, and it performs better than other competing algorithms in most structural properties, especially in terms of clustering quality.

(8)

• For the clustering part, we design a bounded-size clustering (BSC) algorithm to handle the edge additions and deletions while the data is streaming. It requires each cluster to satisfy the maximum cluster-size constraint. Also, it maintains the recency of edges in the temporal sequence and gives high priority to the recent edges in each cluster. Experimental results show that the proposed BSC algorithm outperforms conventional online algorithms and is capable to keep track of cluster evolutions of graphs. Furthermore, it obtains almost one order of magnitude higher throughput than the state-of-the-art algorithms.

We conclude with discussing for how well the sample clustering process works in these scenarios, and validate the superiority of the proposed sampling and clustering algorithms. The insights we obtained in the investigations are of value to the com-munity as a basis for understanding the merits of the sample clustering process over “big” graphs with various characteristics.

(9)

(10)

Part I.

Evaluation of the Sample Clustering Process

35

4. Evaluation Framework of the Sample Clustering Process 37 4.1. Motivation of the Evaluation Framework . . . 37

4.2. Problem Statement . . . 39

4.3. Main Issues When Evaluating the Clustering Quality on Graph Samples 39 4.3.1. Multiple Ground-truths in Graphs . . . 39

4.3.2. Validity of Clustering Evaluation on Graph Samples . . . 41

4.4. Quality Metrics based on the Set-matching Methodology . . . 42

4.4.1. Cover Definition . . . 42

4.4.2. δ-precision and δ-recall . . . 45

4.4.3. δ-precision-ground-truth and δ-recall-ground-truth . . . 46

4.4.4. Additional Statistical Metrics . . . 47

4.4.5. Illustration for Quality Metrics . . . 48

4.5. Evaluation Framework . . . 49

4.6. Chapter Summary . . . 50

5. Empirical Study of the Evaluation Framework 53 5.1. Motivation of the Empirical Study . . . 53

5.2. Benchmark Graphs . . . 54

5.3. Experimental Setup . . . 54

5.3.1. Algorithm Selection . . . 55

5.3.2. Metrics Selection . . . 55

5.4. Experimental Study . . . 56

5.4.1. Metric Competitiveness Analysis . . . 56

5.4.2. Evaluation Scenario on Real Graphs with Single Ground-truth . 59 5.4.3. Evaluation Scenario on Real Graphs with Multiple Ground-truths 61 5.4.4. Parameter Sensitivity Analysis . . . 64

(12)

CONTENTS

Part II. Static Graphs

69

6. Cluster-preserving Sampling from Static Graphs at Scale 71 6.1. Motivation of Sampling from Static Graphs . . . 71

6.2. The State-of-the-Art . . . 72

6.3. Definitions & Problem Statement . . . 73

6.3.1. Local Definitions . . . 74

6.3.2. Problem Statement . . . 74

6.4. Proposed Sampling Method . . . 76

6.4.1. Initialization of Top-leaders . . . 76

6.4.2. Sampling Expansion Strategies . . . 78

6.5. Time Complexity . . . 81

6.6. Experimental Evaluation . . . 83

6.6.1. Experimental Setup . . . 83

6.6.2. Benchmark Graphs . . . 83

6.6.3. Evaluation Methodology & Measurements . . . 84

6.6.4. Overall Experimental Results . . . 85

6.6.5. The Impact of Cluster Sizes . . . 86

6.6.6. The Impact of Sampling Rates . . . 89

7. Clustering Affiliation Inference of the Population 93 7.1. Motivation of Population Inference . . . 93

7.3. The State-of-the-Art . . . 94

7.4. Two-stage Clustering Inference (TCI) . . . 95

7.5. Computational Complexity . . . 97

7.6. Experimental Evaluation . . . 98

7.6.1. Experimental Setup . . . 98

7.6.2. Evaluation Methodology . . . 98

7.6.3. Inferring Clustering Affiliations from Graph Samples . . . 99

7.6.4. Comparison of Inference Methods . . . 100

(13)

Part III. Streaming Graphs

103

8. Cluster-preserving Sampling from Streaming Graphs 105

8.1. Motivation of Sampling from Streaming Graphs . . . 105

8.3. Proposed Sampling Method . . . 108

8.3.1. The Basic PIES Algorithm . . . 108

8.3.2. Cluster-preserving Node Replacement . . . 110

8.3.3. Isolated Nodes Elimination . . . 111

8.3.4. Edge-deleting Operation . . . 112

8.3.5. Cluster-preserving Partially Induced Edge Sampling (CPIES) . 113 8.4. Experimental Evaluation . . . 113 8.4.1. Experimental Setup . . . 113 8.4.2. Benchmark Graphs . . . 115 8.4.3. Evaluation Measures . . . 115 8.4.4. Experimental Results . . . 118 8.5. Chapter Summary . . . 127

9. Online Clustering on Streaming Graphs 129 9.1. Motivation of Online Clustering . . . 129

9.3. Online Clustering . . . 132

9.3.1. Streaming Clustering Model . . . 132

9.3.2. New Component Description . . . 133

9.3.3. Bounded-Size Clustering Algorithm . . . 136

9.3.4. Complexity Analysis of BSC . . . 139 9.4. Experiments . . . 141 9.4.1. Experimental Setup . . . 141 9.4.2. Benchmark Graphs . . . 141 9.4.3. Quality Experiments . . . 141 9.4.4. Throughput Experiments . . . 145

9.4.5. Importance of Temporal Priority . . . 147

9.4.6. Impact of Various Sampling Strategies . . . 148

10. Research Summary & Future Work 151 10.1. Research Summary . . . 151

(14)

CONTENTS 10.2. Future Work . . . 153 10.2.1. Evaluation Framework . . . 153 10.2.2. Sampling . . . 154 10.2.3. Clustering . . . 155 Bibliography 157 Appendices 167 A. The Quality Metrics . . . 167

A.1. Supervised Quality Metrics . . . 167

A.2. Unsupervised Quality Metrics . . . 169

B. Sampling Strategies . . . 171

C. Clustering Algorithms . . . 173

D. The Rest of Results on Real-world Graphs . . . 175

D.1. Results of Cluster-match-rank-plots . . . 175

D.2. Results of the Clustering Quality . . . 175

E. The Rest of Results on LFR Synthetic Graphs . . . 176

Acknowledgements 197

Curriculum Vitae 201

(15)

(16)

List of Figures

1.1. Real-world graphs from different domains. . . 2

1.2. Schematic diagram of the sample clustering process. . . 3

4.1. Problem setting. . . 38

4.2. Map for graphs with different ground-truth situations. . . 40

4.3. A bad case for the basic cover definitions. . . 43

5.1. Clustering quality of the selected clustering algorithms on the sampled LFR benchmarks (µ = 0.40) using induced random vertex sampling (sample rate p=0.20). . . 57

5.2. Clustering quality of the selected clustering algorithms on the sampled LFR graphs (µ=0.40) using induced random edge sampling (sample rate p=0.20). . . 58

5.3. Overall results for the Football network using different combinations of sampling and clustering methods (sample rate p=0.50). . . 60

5.4. Overall results for the DBLP network using different combinations of sampling and clustering methods (sample rate p=0.15). . . 60

5.5. The impact of varying sample rate p on δ-precision and δ-recall on the DBLP networks using random walk sampling and Blondel clustering. . . 64

5.6. δ-precision, δ-recall for five clustering algorithms on Football and Pol-books networks at different values of the purity threshold δ. . . . 65

6.1. The impact of cluster-size on quality metrics on the LFR benchmarks with various maximum cluster-sizes. . . 89

6.2. The impact of varying sample rate p on the quality metrics on the LiveJournal networks with Blondel clustering. . . 90

6.3. The impact of varying sample rate p on the quality metrics on the DBLP networks with Blondel clustering. . . 90

(17)

8.1. Quality comparison of different sampling algorithms on the

Enron-employee network (p=0.2). . . 122

8.2. Quality comparison of different sampling algorithms on the E-mail-EU-core network (p=0.2). . . 122

8.3. The impact of varying sample rate p on the quality metrics on the Enron-employee network with Blondel clustering. . . 124

8.4. The impact of varying sample rate p on the quality metrics on the Reality network with Blondel clustering. . . 124

8.5. The impact of varying sample rate p on the quality metrics on the CollegeMsg network with Blondel clustering. . . 125

8.6. Distribution function at 20% sampling rate on the Enron-employee net-work. . . 126

8.7. Distribution function at 20% sampling rate on the E-mail-Eu-core network.126 8.8. Distribution function at 20% sampling rate on the CollegeMsg network. 126 8.9. Distribution function at 20% sampling rate on the Facebook network. . . 126

8.10. Distribution function at 20% sampling rate on the Slash network. . . . 126

8.11. Inferring clustering affiliation for real networks using the CPIES method at 20% sampling rate. . . 127

9.1. Main components of the streaming clustering model. . . 132

9.2. The streaming reservoir. . . 134

9.3. The up-tree union-find data structure in cluster manager. . . 135

9.4. Quality comparison of different clustering algorithms on synthetic graphs with different types of evolving events. . . 143

9.5. Average quality results for the four real-world networks by varying maximum cluster-size with a fixed window size of 5K. . . 144

9.6. Evaluation on the selected clustering algorithms on Enron-employee and Higgs-reply networks with a fixed window size of 5000 and the maximum cluster-size of 50. . . 146 9.7. The influence of temporal information on different clustering algorithms.148 9.8. Clustering quality on the subgraph produced by various sampling

methods on each real-world graph with the edge sample rate pe =0.20. 149

S1. The cluster-match-rank-plot of various sampling algorithms (sample rate p=0.15) on the DBLP network when the clustering algorithm is fixed. 176 S2. The cluster-match-rank-plot of various sampling algorithms (sample rate

(18)

LIST OF FIGURES

S3. The cluster-match-rank-plot of various sampling algorithms (sample rate p=0.15) on the LiveJournal network when the clustering algorithm is fixed. . . 177 S4. The cluster-match-rank-plot of various sampling algorithms (sample rate

p=0.15) on the Friendster network when the clustering algorithm is fixed. . . 178 S5. The cluster-match-rank-plot of various sampling algorithms (sample rate

p=0.15) on the Orkut network when the clustering algorithm is fixed. 178 S6. Overall results for the Football network using different combinations of

sampling and clustering methods (sample rate p=0.50). . . 179 S7. Overall results for the Karate network using different combinations of

sampling and clustering methods (sample rate p=0.50). . . 180 S8. Overall results for the Polblogs network using different combinations

of sampling and clustering methods (sample rate p=0.50). . . 181 S9. Overall results for the Dolphins network using different combinations

of sampling and clustering methods (sample rate p=0.50). . . 182 S10. Overall results for the Polbooks network using different combinations

of sampling and clustering methods (sample rate p=0.50). . . 183 S11. Overall results for the DBLP network using different combinations of

sampling and clustering methods (sample rate p=0.15). . . 184 S12. Overall results for the LiveJournal network using different combinations

of sampling and clustering methods (sample rate p=0.15). . . 185 S13. Overall results for the Youtube network using different combinations of

sampling and clustering methods (sample rate p=0.15). . . 186 S14. Overall results for the Friendster network using different combinations

of sampling and clustering methods (sample rate p=0.15). . . 187 S15. Overall results for the Orkut network using different combinations of

sampling and clustering methods (sample rate p=0.15). . . 188 S16. Clustering quality of the selected clustering algorithms on the sampled

LFR benchmarks (µ=0.40) using random walk sampling (sample rate p=0.20). . . 189 S17. Clustering quality of the selected clustering algorithms on the sampled

LFR benchmarks (µ=0.40) using metropolized random walk sampling (sample rate p=0.20). . . 190

(19)

S18. Clustering quality of the selected clustering algorithms on the sampled LFR benchmarks (µ=0.40) using metropolis subgraph sampling (sample rate p=0.20). . . 191 S19. Clustering quality of the selected clustering algorithms on the sampled

LFR benchmarks (µ = 0.50) using induced random vertex sampling (sample rate p=0.20). . . 192 S20. Clustering quality of the selected clustering algorithms on the

sam-pled LFR benchmarks (µ=0.50) using induced random edge sampling (sample rate p=0.20). . . 193 S21. Clustering quality of the selected clustering algorithms on the sampled

LFR benchmarks (µ=0.50) using random walk sampling (sample rate p=0.20). . . 194 S22. Clustering quality of the selected clustering algorithms on the sampled

LFR benchmarks (µ=0.50) using metropolized random walk sampling (sample rate p=0.20). . . 195 S23. Clustering quality of the selected clustering algorithms on the sampled

LFR benchmarks (µ=0.50) using metropolis subgraph sampling (sample rate p=0.20). . . 196

(20)

List of Tables

2.1. Common notation & definitions . . . 17 2.2. Summary of real-world graphs used in the experiments. . . 22 2.3. The statistics of streaming graphs. . . 24

4.1. The resluts of δ-precision (δ-P), δ-recall (δ-R), ANC and NLS for the tiny graph (δ=1.0). . . 49

5.1. The parameters of LFR benchmarks . . . 54 5.2. Average δ-precision (δ-P), δ-recall (δ-R) scores for selected FB100 dataset. 62

6.1. Default parameters of LFR benchmarks. . . 84 6.2. Statistics of synthetic graphs used in the experiments. . . 84 6.3. The sample qualities of various sampling strategies on small graphs

(p=50%). . . 87 6.4. The sample qualities of various sampling strategies on large-scale

graphs (p=15%). . . 88

7.1. Inferring clustering affiliation for LFR benchmarks using different sam-pling algorithms at 20% samsam-pling rate. . . 99 7.2. Clustering metrics by each inference method on real-world graphs

using the TLS-i algorithm. . . 101

8.1. Graph statistics. . . 115 8.2. The sample qualities of various sampling strategies on synthetic graphs

(p=20%). . . 119 8.3. The sample qualities of various sampling strategies on real-world

graphs (p=20%). . . 120

9.1. The complexity analysis of the BSC algorithm. Note that a(·)is the inverse Ackermann function, where a(n) <5 for any practical value of n.140

(21)

9.2. Performance experiments for different query/update ratios with a fixed window size of 5000 and a maximum cluster-size of 50. . . 146

(22)

1. Introduction

Graph, as a generic data structure, can be a good representation of the complex rela-tionships (namely, edges) among entities (namely, nodes) of networks [FH16]. In a social network, for instance, nodes may represent people and links may represent relation-ships (e.g., friendrelation-ships), interactions (e.g., e-mails transmitted, physical proximity), or homogeneity (e.g., similar books purchased). Similarly, in biological systems, nodes may represent neurons or proteins and the links may represent neuronal connections or protein interactions. Figure 1.1 shows real-world graphs from various domains. Nowadays, in a wide range of applications, “big” graphs are becoming ubiquitous and prevalent. For example, in the Twitter network, the average number of monthly active users have increased to 328 million in the 1st quarter of 2017, adding close to 9 million users to the preceding quarter. Analyzing such kind of big graphs has become of great importance in various applications (e.g., detecting ransomware attacks and abnormal connection in social media). As technologies for generating and storing data continue to improve and proliferate, the size of graphs will continue to explosively grow in the near future. To understand the structures of real-world graphs in various scenarios, effective and efficient graph mining algorithms (e.g., sampling, clustering and classification among other algorithms) are of great demand.

Within the graph mining community, graph clustering is one of the most important research topics to explore the inherent structures of real-world graphs. It can obtain valuable structural information from graphs and provide important support decisions on several applications. However, with the explosive growth of big graphs in various fields (e.g., online social networks, scientific collaboration networks and biological networks), different types of graphs are formed and they exhibit obvious “big-data” characteristics (e.g., complex inherent structure, massive in size and/or in a streaming fashion). These characteristics of contemporary graph-structured data collections result in the relationships among the entities of graphs being more complex and unpredictable. Thus, classic graph mining approaches face more challenges and they are not capable of processing these big graphs. In particular, the major challenge is that an increasing number of graphs is becoming large-scale and/or continuously

(23)

(a) E-mail communication network [JMBO01]. (b) Author coauthor network [BB11].

(c) Protein-protein interaction network [JMBO01]. (d) Human mobility network [TTG+_10].

Figure 1.1.: Real-world graphs from different domains.

growing in nature. Hence, modeling and analyzing such data in their entirety is becoming infeasible and impractical. Thus, it is a great need to find new solutions to solve these problems.

1.1. What is the Sample Clustering Process?

Since contemporary graphs might contain an overwhelming amount of nodes and edges and have obvious characteristics of "big-data", how to process these big graphs effectively is very critical to data scientists. One direction is to design scalable (possi-bly parallel) clustering algorithms to analyze them. It requires either large working memory (e.g., most in-memory algorithms need to iterate over all the nodes in the memory) or powerful parallel processors (e.g., parallelization implementation). How-ever, such computing resources are not easily accessible in all application scenarios.

(24)

1.1. What is the Sample Clustering Process?

Another feasible solution that has received more attention is to take a small sample on the graph and do the traditional analysis on the sampled subgraph. Through sam-pling a representative subgraph, clustering analysis can be performed on the sampled graph to detect the inherent clustering structure, instead of the original population. In this thesis we focus on the latter case due to its simplicity and efficiency. For the sake of brevity, we refer to the sampling-based methodology as a sample clustering process. All of these factors motivate the great need for a more refined and complete un-derstanding of the sample clustering process on the original graph (i.e., population). The schematic diagram of the entire process is shown in Fig.1.2. In this thesis, firstly, we outline a general framework to evaluate the entire sample clustering process. Sec-ondly, we study each ingredient (i.e., sampling and clustering) of the entire process in different scenarios (e.g., static graphs and streaming graphs) and the corresponding computational models are constructed. Specially, we investigate methods of sampling and clustering across two computational models. One focuses on the less constrained model of sampling and clustering from static graphs. The other one focuses on the more difficult and most constrained model of sampling and clustering from streaming

graphs1_.

G

S Π(S)

Π(G) Seeking & defining

ground-truth(s) g Sampling strategy Λ Clustering process P Quantitative evaluation

Figure 1.2.: Schematic diagram of the sample clustering process. Let S be a sampled subgraph of a graph G and π(G) ∈Π(G)be a valid ground-truth cluster-ing of G (i.e., a potential clustercluster-ing scheme ofΠ(G)). Given a clustering

π(S) ∈ Π(S)of S induced by the clustering method P, our aim is to

design each ingredient (i.e., samplingΛ and clustering P) and evaluate the overall performance of the entire process.

(25)

1.2. Research Questions

Classic graph mining techniques are mostly designed for the graph to fit into the main memory. However, the assumption that the graph fits in the memory is not always realistic for real-world domains (e.g., online social networks). When the graph is too large and/or ever-increasing to fit in the memory, it needs to either be stored on secondary storage devices (e.g., disks) in which random accesses incur large I/O costs, or be in transit within a High Performance Computing Cluster. However, many graph mining algorithms fail to process such big graphs in an appropriate manner. To cope with this need, many attempts were investigated and implemented. So the ultimate question is:

Q0: Is it feasible to discover the clustering structure for “big” graphs in contemporary analysis scenarios (either full-access from the storage device or sequential-access in a streaming fashion)?

In this thesis, we attempt to use the sample clustering process to answer this exten-sive question. We would like to showcase our innovative solutions by designing the evaluation framework of the sample clustering process and devising new sam-pling/clustering algorithms under two different computational models (static and streaming graph model). We believe that one can get some inspirations while read-ing the thesis. Specifically, in this thesis we would like to answer the followread-ing foundational questions.

Evaluation of the sample clustering process The first underlying question is

how to evaluate the clustering quality of the sample clustering process. However, little attention has been paid to how to evaluate the clustering quality of it, which is important not only for measuring the validity of sample clustering solutions, but also to give insights on the structure distribution in the sampled graph. There are many open questions on how to design an appropriate evaluation framework in order to get credible and comparable results. Therefore, we have a strong motivation to develop a standard evaluation (benchmarking) framework to measure the clustering quality on graph samples. Thus here comes our first research question:

Q1: How can we evaluate the clustering quality of the sample clustering process? In other words, how to measure the success?

In the thesis we present a general evaluation framework to evaluate the clustering quality which integrates sampling with clustering simultaneously. It is capable of

(26)

1.2. Research Questions

evaluating the sample clustering process from different aspects. In Part I, we first present this evaluation framework and then empirically demonstrate its feasibility and effectiveness on benchmark graphs.

Graph Sampling Graph sampling is an essential elementary procedure in the sample

clustering process. When working with big graphs, the algorithm is only part of the story; the other part is the data. In some cases, the graph may be relatively small and easy to study in its entirety (i.e., without sampling). For instance, it is fairly easy to study the full set of graduate students in a particular academic department. However, in many situations, the population is massive in size and/or fully dynamic such that it is difficult and/or costly to access in its entirety (e.g., the complete set of Internet users). Hence, for reasons of efficiency, a sample should be collected, and characteristics of the original graph can then be estimated from the sampled counterpart. Formally, assume the graph G= (V, E)is given (either full-access from the storage device or sequential-access in a streaming fashion). Then, the goal is to produce a representative subgraph Gs = (Vs, Es)from the original graph G, where

Vs ⊆V and Es ⊆E.

Traditionally, graph sampling has been widely studied in static graphs using a static computational model [ANK14]. This model makes the assumption that the graphs fit in the main memory and it costs constant time to randomly access any node and its neighbors. A sampled graph is representative if it preserves selected properties of the original graph. In the present study, the simple node-level characteristics/properties, such as distributions of nodes’ degree, clustering-coefficient and closeness centrality, are usually of interest to data scientists, but an essential perspective that has been ignored is how the sampling affects the inherent clustering structure, which is a prevalent property of real-world graphs. Thus, there is a foundational research questions we should address:

Q2: How can we design an effective sampling algorithm on static graphs, to make the sample representative from the perspective of clustering structure?

In Part II-Chapter 6, we propose a new set of sampling methods on static graphs and evaluate the proposed methods against the state-of-the-art approaches. We then show that our proposed methods preserve the clustering structure of various graphs more accurately.

(27)

Although studying static graphs is indeed important, the assumption that the graph fits in the main memory is not always realistic for real-world domains (e.g., Internet user network). The intrinsic complexities such as the time-evolving nature of these real-world graphs are ignored, and they can naturally be represented as streaming graphs, which are formed as rapid, continuous and ever-increasing edge streams. Since such graphs as Twitter posts and e-mail communication networks are prevalent nowadays, an increasing number of studies focuses on devising sampling algorithms that address the complexities of the streaming model. Streaming graphs differ from static graphs in three main aspects: (i) the structure/characteristic cannot be completely measurable (e.g., only support a sequential access in a single pass); (ii) the size of the streaming graph is open-ended such that it cannot fit into the limited memory; and (iii) accurate and efficient processing is of great importance [ANK14]. Naturally, this raises the question:

Q3: How can we sample a representative subgraph from streaming graphs to preserve the clustering structure in an online fashion?

To address this issue, in Part III-Chapter 8, we propose a new cluster-preserving partially-induced-edge sampling (CPIES) algorithm that dynamically keeps an up-to-date sampled subgraph in an online fashion and preserves the inherent clustering structure in streaming graphs. The experiments demonstrate that our proposed algo-rithm is capable of preserving the cluster-related properties better than the competing algorithms.

The argument in this section depicts a natural progression of computational models for sampling from static graphs to streaming ones. This scope not only provides insights into the computational models’ complexity (i.e., static vs. streaming) , but also gives the complexity of the algorithms devised for each scenario. Therefore, our goal of graph sampling is to devise cluster-preserving sampling algorithms in both scenarios and to evaluate their performances under the guidance of the evaluation framework.

Graph Clustering In this section, we discuss another core part of the sample

cluster-ing process, i.e., clustercluster-ing on the representative sampled subgraph. With the help of such a representative subgraph, many downstream tasks (in our case, we focus on clustering which is one of the most important research topics in graph mining) can be executed without probing the original graph, therefore the performance of the entire sample clustering process is greatly enhanced.

(28)

1.2. Research Questions

The aim of clustering is to group the nodes into clusters with dense intra-cluster and sparse inter-cluster connectivity. Traditional analysis of social networks mainly focuses on clustering in a static setting, i.e., we can fully access each node/edge by multiple passes [Kar93] [KK98]. Static graph clustering is a well-studied problem and numerous algorithms have been proposed in various domains. The reader can refer to the survey [SK12] for more detail. However, for the sample clustering process of static graphs, one underlying challenging problem exists in our studies:

Q4: How can we infer the clustering affiliation of nodes for all the unsam-pled nodes in the population?

For this purpose, in Part II-Chapter 7, we propose a new two-stage clustering inference (TCI) method to label the un-sampled nodes of the original graph. To this end, we need to assign clustering labels for all the nodes in- and outside the sampled subgraph. That is, using a subgraph Gs, we attempt to infer the clustering affiliation for all the

unsampled nodes v where v∈V−Vs. Empirical experiments demonstrate that the

proposed TCI method is able to infer clustering affiliations of nodes in the population from the sampled graph effectively and efficiently.

As we have noticed in the challenging problems of graph sampling, most contem-porary graphs in practice are formed in a streaming fashion. These streaming graphs might contain millions or even billions of nodes/edges and be fully-dynamic, but most of existing algorithms mainly focus on clustering in an offline setting. They are not suitable for streaming graphs which involve the additions and deletions of nodes/edges, because they require to re-cluster the whole graph for each update which leads to expensive computational costs. Furthermore, the newly obtained clusters may substantially differ from the preceding ones, but they are supposed to be smooth with the previous clusters. Because of that, static clustering cannot react in a continuous way to the smooth changes in the graph. Therefore, the last but not least challenge question is:

Q5: How can we identify and infer the clustering structure from streaming graphs effectively and efficiently?

Addressing this problem, in Part III-Chapter 9, the streaming version of the clus-tering algorithm is preferred for such kind of streaming graphs. Therefore, we present a bounded-size clustering algorithm to handle edge additions and deletions of streaming graphs. It requires each cluster to satisfy the maximum cluster-size constraint and gives high priority to the recent edges in each cluster. Extensive ex-periments demonstrate that the proposed BSC algorithm is capable of maintaining

(29)

high-quality clusterings through a single pass and capturing the evolving events (namely, evolution-aware) while edge updates are performed on the fly.

By answering Q1 to Q5, we also partly answer the open ended question Q0. In brief, our research procedures are as following: Firstly, a good starting point is to design a creditable evaluation framework of the sample clustering process. Such evaluation framework will guide our understanding and study of improving sampling strategies and sample clustering solutions. Then, by using the sample clustering process, a large collection of existing graph sampling and clustering techniques can be integrated to cope with big graphs. We can give concrete answers whether a given sample clustering solution performs well or not. Finally, we address the defects of existing methods, and derive new sampling and clustering algorithms for both computational models (i.e., static graphs and streaming graphs, respectively).

1.3. Main Contributions

To sum up, main contributions of this dissertation are as follows:

(1) We outline a general framework to evaluate the entire sample clustering process, and propose several new quality metrics based on set matching to quantitatively evaluate how differently the clusters of the sampled graph correspond to the ground-truth(s) in the original one in various aspects. To the best of our knowledge, it is the first framework to evaluate the clustering quality which integrates sampling with clustering simultaneously. It is capable of evaluating the sample clustering process in various scenarios. By using this evaluation framework, here we consider two important scenarios: static graph and streaming graph.

(2) Regarding static graphs, we focus on designing each ingredient (i.e., sampling and clustering) of the sample clustering process on static graphs at scale.

• For the sampling part, we propose two novel top-leader sampling algorithms (i.e., TLS-e and TLS-i) to produce representative samples and they are capable of retaining the clustering structure well. The rationale is to select top-leader nodes in each cluster into the sample, and then heuristically incorporate the pe-ripheral nodes that satisfy the expansion criterion into the subgraph. Empirical experiments have shown that the proposed sampling methods are capable to preserve the clustering structures of the population well and provide effective solutions to sample static graphs at scale.

(30)

1.4. Dissertation Overview and Organization

method to infer clustering affiliations of nodes in the original graph. It takes advantages of the label propagation algorithm which makes the sample’s la-bels efficiently propagated to the unlabeled nodes. Experimental results have demonstrated that the TCI method in conjunction with any cluster-preserving sampling strategy is capable of inferring the clustering affiliation of the po-pulation commendably and performs better than the competing methods.

(3) In the case of the streaming graphs, the streaming version of sampling and clustering algorithms is required to be computationally efficient (preferably, by a single-pass) under the limited memory. Therefore;

• For the sampling part, we propose a cluster-preserving sampling (CPIES) al-gorithm that dynamically maintains representative samples and is capable of preserving the clustering structure in streaming graphs at any time. We empiri-cally demonstrate that CPIES can represent the inherent clustering structure of streaming graphs in an online fashion, and it outperforms current online sam-pling algorithms in most structural properties, especially in terms of clustering quality.

• For the clustering part, we present a bounded-size clustering algorithm to handle edge additions and deletions while the data is streaming. It requires each cluster to satisfy the maximum cluster-size constraint. Also, it maintains the recency of edges in the temporal sequence and gives high priority to the recent edges in each cluster. The experimental results show that the proposed BSC algorithm outperforms competitive algorithms and is capable of keeping track of cluster evolutions of graphs. Furthermore, it obtains almost one order of magnitude higher throughput than the state-of-the-art algorithms.

1.4. Dissertation Overview and Organization

This PhD dissertation contributes to the research in graph mining through a series of theoretical and empirical studies. The studies that comprise different chapters of this dissertation have appeared (or will appear) in peer-reviewed conferences and journals.

In order to make the construction of the dissertation more understandable and articulate, we make every effort to ensure that each chapter is consistent in definitions, notations, and so forth. After the preliminary knowledge is given, the main body of

(31)

this dissertation is structured in three parts. The first part, comprising Chapter 4 and 5, expounds a general evaluation framework to assess the sample clustering process. The second part, comprising Chapter 6 and 7, explores techniques to sample and cluster in the scenario of static graphs. The third part, comprising Chapters 8 and 9, investigates feasible solutions of the sample clustering process on streaming graphs. More concretely, the thesis is organized as follows:

Chapter 2 We provide definitions of graphs in different scenarios, and then give

common notations and basic concepts referenced throughout this thesis. Besides, benchmark graphs used in the thesis are described in detail. At last, we overview the framework of the sample clustering process.

Chapter 3 We introduce a comprehensive review of the existing literature related

to the thesis. We first introduce conventional clustering evaluation schemes, and then outline the existing work on graph sampling and clustering in two scenarios, i.e., static graphs and streaming graphs, using different computational models.

Part Istudies the problem of evaluating the sample clustering process for “big" graphs.

This part is an extension of papers [ZPFP16] and [ZPFP18c]:

Jianpeng Zhang, Yulong Pei, George Fletcher, and Mykola Pechenizkiy.

Structural measures of clustering quality on graph samples. 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Min-ing (ASONAM), pages 345–348, San Francisco, CA, USA, 2016. (Corres-ponding to Chapter 4, 5)

Evaluation of graph sample clustering. Journal submission (Revised). (Corres-ponding to Chapter 4, 5)

Both of the publications correspond to the work in Chapter 4 and 5. Particularly, in Chapter 4 we mainly focus on the design of the evaluation framework of the sample clustering process, and then Chapter 5 provides empirical analysis of the entire process on multiple benchmark graphs in detail.

Chapter 4 We introduce the necessity of evaluation of the sample clustering process

(32)

Based on our evaluation framework, we propose several effective metrics to evaluate the sample clustering results from various aspects.

Chapter 5 We conduct extensive empirical analysis on a variety of synthetic and

real-world graphs to evaluate the sample clustering process. Experimental results demonstrate that new quality metrics perform well in practice and give insightful structural information for a given sample clustering solution.

In Part II, under the guidance of the evaluation framework, we study the sample clustering process on static graphs at scale, and we attempt to give solutions on how to sample and cluster on such graphs. This part is an extension of papers [ZPPE16] and [ZZP+18]:

Jianpeng Zhang, Kaijie Zhu, Yulong Pei, George Fletcher, and Mykola

Pechenizkiy. Cluster-preserving sampling algorithm for graphs at scale. Journal submission (Revised). (Corresponding to Chapter 6)

Jianpeng Zhang, Kaijie Zhu, Yulong Pei, George Fletcher, Mykola

Pech-enizkiy. Clustering affiliation inference from graph samples. 14th Workshop on Mining and Learning with Graphs (MLG) Workshop at KDD’18, London, United Kingdom, 2016. (Corresponding to Chapter 7)

Chapter 6 We take a closer look at various sampling algorithms on static graphs.

The following observations are remarkable: (i) most sampling algorithms do not take the clustering structure into consideration. However, the clustering structure is the prevalent property in real-world graphs; (ii) existing methods are not capable of sampling static graphs at scale in an efficient way. Thus, to solve these problems, we propose two novel top-leader sampling algorithms (i.e., TLS-e and TLS-i) that produce representative samples and are capable of retaining the clustering structure well.

Chapter 7 We investigate various clustering inference approaches and present a new

two-stage clustering inference method to assign the clustering labels to unsampled nodes of the population. If samples are truly representative of the clustering structure of the larger population, the clustering structure on the sample should generalize well to the nodes that are not sampled. Using the two-stage inference, it provides an effective way to infer clustering affiliations of the original graph.

(33)

Part IIIcarries on the further research of the sample clustering process on a more realistic and restrictive model (i.e., streaming graph model). Due to the growing presence of streaming graphs in many online applications, streaming graph sampling and clustering play important roles to analyze their intrinsic characteristics. This part is an extension of papers [ZZP+17], [ZPFP18a] and [ZPFP18b]:

Pechenizkiy. Clustering-structure representative sampling from graph streams. In Complex Networks & Their Applications VI: Proceedings of Complex Networks 2017 (The Sixth International Conference on Complex Networks and Their Applications), 2017. (Corresponding to Chapter 8)

Pechenizkiy. Cluster-preserving sampling from fully-dynamic streaming graph. Journal submission (Revised). (Corresponding to Chapter 8)

A bounded-size clustering algorithm on fully-dynamic streaming graphs. In Intelligent Data Analysis, 2018. (Corresponding to Chapter 9)

Chapter 8 We first formulate the problem of sampling from streaming graphs, and

then propose a cluster-preserving sampling algorithm that dynamically maintains representative samples and is capable of retaining the clustering structure in stream-ing graphs in an online fashion. The combination of samplstream-ing compression with preserving the clustering structure is a novel approach to sample the streaming graph. The experimental results have shown that the proposed CPIES algorithm can produce cluster-preserving subgraphs and outperforms current online sampling algorithms.

Chapter 9 Designing an efficient online clustering algorithm is a core part of the

sample clustering process on streaming graphs. However, existing clustering approaches are inappropriate for this specific task because: (i) static clustering approaches require expensive computational costs to cluster the graph for each update; and (ii) the existing streaming clustering neither could fully support insertion/deletion of edges nor take temporal information into account. To tackle these issues, we present a bounded-size clustering (BSC) algorithm to handle the edge additions and deletions of streaming graphs. The experimental results show that the proposed BSC algorithm obtains good clustering qualities and is capable to keep track of cluster evolutions of graphs.

(34)

Chapter 10 We conclude the thesis with a discussion of future work based on the

obtained results of our proposed sampling and clustering techniques. We believe that these methods in the thesis and the described results are of value to the community as a basis for understanding the merits of the approaches and for further research on sample clustering on graphs.

(35)

(36)

2. Definitions and Common Notation

2.1. Graph Definitions

In this thesis, by using the sample clustering process, here we consider the two most common scenarios: static graphs in which nodes/edges are fully-access at random and streaming graphs in which nodes/edges are only sequential-access (i.e., graphs can be rendered either full-access from the memeory or sequential-access in a stream fashion). Correspondingly, these scenarios require different kinds of computational models and their formal definitions are given in detail.

2.1.1. Static Graph

Traditional graph analysis considers the graph-structured data as a static graph1, which is either derived from aggregation of edges over all time or taken as a snapshot of edges at a particular time-stamp. This model makes the assumption that the graphs fit in the main memory and it costs constant time to randomly access any node and its neighbors. The formal definition is as follows:

Definition 2.1. (Static graph). A static graph G is defined as G = (V, E)where V is a

finite set of nodes and E ⊆V×V×R+_{is a set of weighted edges, such that N} _{= |}_V_|_is

the number of nodes and M = |E|is the number of weighted edges. R+denotes the set of positive real numbers. Each weighted edge in E is represented ashu, v, wiwhere u and v are the two incident nodes of the edge where u6=v and w is the edge weight.

In a number of real-world graphs, edges in a graph may not have the same weights. For instance, edges are often associated with weights that differentiate in terms of their strength, intensity, or capacity. Thus if edges have different weights, we treat the graph as a weighted static graph. Otherwise, it is an unweighted static graph, which is essentially a special case of a weighted graph (i.e., w=1.0). Furthermore, especially

1_{In this thesis the network and graph can be used interchangeably, and for ease of analysis undirected} graphs are considered only.

(37)

with the pervasive use of social networks, it results in a dramatic increasing of large-scale static graphs followed by a surge of interest in analyzing such graphs across various disciplines. Therefore, it is necessary to analyze such kinds of graphs, which would provide important insights of their structures and formation mechanisms.

2.1.2. Streaming Graph

The majority of existing work has focused on static graphs, which is a less restrictive problem setup. In this work, we also investigate a more realistic scenario, i.e., fully-dynamic streaming graphs where edge insertions and deletions are occurring in an arbitrary order. This scenario is far more challenging because of the computational requirement and the inability to store the whole stream. Formally, we give the definition of a fully-dynamic streaming graph as follows:

Definition 2.2. (Edge update). An edge update e is in the form of(•,hu, v, wi), where u

and v are the two incident nodes of the edge and w the associated temporal weight2. Here

• ∈ {+,−}represents the edge insertion and deletion operations.

Definition 2.3. (Graph update). Given a static graph G= (V, E)and an edge update e=

(•,hu, v, wi), we define graphU pdate(G, e) = (nodeU pdate(G, e), edgeU pdate(G, e))

where

nodeU pdate(G, e) =V∪ {u, v}, and

edgeU pdate(G, e) =    E∪ {hu, v, wi} if •=+ E\ {hu, v, wi} if •=−

Definition 2.4. (Fully-dynamic streaming graph). LetE = [e1, e2, e3, ...] be an infinite

sequence of edge updates. A fully-dynamic streaming graph with respect toE is an infinite sequenceG = (G0, G1, G2, ...)of static graphs such that

- G0= (∅, ∅), and

- Gt=graphU pdate(Gt−1, et), for t>0.

We refer to Gtas the snapshot ofGat time t. Furthermore, we let Vtand Etdenote the node set and edge set, resp., of Gt.

For simplicity, if we do not mention it explicitly, we abbreviate fully-dynamic streaming graphs as streaming graphs in the remainder of this thesis. Note that the

2_{Note that the temporal weight can be defined as the arrival sequence, the relative time and even the} fading time function according to practical requirement.

(38)

2.2. Common Notation

Table 2.1.: Common notation & definitions

Symbols Definition

Notations for Graph

G The original graph

V The set of nodes within the graph

E The set of edges within the graph

N The number of nodes of V

M The number of edges of E

t The timestamp of the graph

Notations for Graph Sampling

Λ The sampling strategy

S The sampled subgraph of G

Vs The set of nodes within the sample

Es The set of edges within the sample

ns The number of sampled nodes

ms The number of sampled edges

p The sample rate of nodes

η(·) The topological property of the graph

Activated node The node v is an activated node if it has not been sampled

Deactivated node The node v is a deactivated node if it has been sampled

Notations for Graph Clustering

P The clustering algorithm

k The number of clusters

bi The set of nodes in the ithcluster in π

π A clustering scheme of a graph, i.e., a finite non-empty set π =

{b1, b2, ..., bk}where each bi(i∈ [1, k]) is a cluster of graph

Label A Clustering label set for all the nodes of the graph

labeli Clustering label of the ithnode

Notations for Evaluation

π(G) A clustering scheme in multiple clustering schemesΠ(G)

Π(G) The space of underlying ground-truth clusterings of G, i.e., a set of

multiple clustering schemes for G

π(S) A possible clustering scheme ofΠ(S)induced by the clustering process

Pon the sample S

Π(S) A set of possible clustering schemes for the sample S

timestamp symbol t on a variable denotes the variable value at the time t. However, to avoid notation clutter, we drop t from the notation if it is clear from the context.

2.2. Common Notation

In this section, we provide a set of common notations used throughout this thesis. Note that some additional symbols will be used locally. We give corresponding definitions in the context. The descriptions of common notations across this thesis are listed in Table 2.1 and some key notations are described in detail.

(39)

2.2.1. Notation of Graph Sampling

A sample S is a sampled subgraph of an original graph G, which means that the subgraph should consist of the sampled node-set Vs and the edge-set Es whose

endpoints both lie in the set of sampled nodes. The aim of graph sampling is to gen-erate representative samples, which should obtain a good sample quality. Formally,

η(·)is denoted as any topological property and it could be a single-valued statistic

(e.g., average clustering-coefficient) or a multiple-valued distribution (e.g., k-core distribution) [ADNK14]. The objective is to ensure that the sample S= (Vs, Es)is

representative. i.e., it matches many topological properties of G, i.e., η(G) ≈η(S).

Formally, key definitions are given as follows:

Definition 2.5. (Sampling algorithm). A sampling algorithmΛ is a procedure to sample a

subgraph S from an original graph G.

Definition 2.6. (Sample). A sample S= (Vs, Es)is a sampled subgraph of an original graph

G= (V, E), where Vs⊆V, Es ⊆E, and Es ⊆ {(u, v) |u∈Vs, v∈Vs}.

Definition 2.7. (Sampling rate). A sampling rate p is a sample fraction of nodes Vssampled

from V, i.e., p= |Vs|/|V|.

Definition 2.8. (Topological property). A topological property η(·)is a structural

character-istic that needs to be preserved from the original graph.

Definition 2.9. (Connected component). A graph (or subgraph) is a connected component if

and only if there is a path (which connect a sequence of nodes) of edges from v to w for every pair of nodes v and w in the graph.

2.2.2. Notation of Graph Clustering

A cluster in a graph is intuitively defined as a set of densely connected nodes that are sparsely linked to those of other clusters. However, there exists no exact mathematical definition of a cluster widely accepted in the studies. Graph clustering has been proved to be aN P-hard problem [FH16]. It is difficult to apply exact clustering algorithms, which are only applicable to small graphs, and too costly to handle large-scale graphs. Thus it is common to apply approximation rather than exact algorithms to thisN P-hard problem, with a lower time complexity. Thus, there is no universal objective function that graph clustering should follow, and we give a general definition of clustering below:

(40)

2.2. Common Notation

Definition 2.10. (Clustering scheme). For a graph G, a clustering scheme π(G)is a finite

non-empty set π(G) = {b1(G), ..., bk(G)}, where each bi(G)(i ∈ [1, k]) is a subset of V,

called a cluster of G, and it holds thatS

b∈π(G)b=V. Note that the clusters can be either

disjoint or overlapping with each other.

Definition 2.11. (Clustering algorithm). Clustering algorithmP is employed to produce

possible clustering schemes for a given graph G.

Note that, in practice, most graph clustering algorithms do not require a precise def-inition of clusters. The clusters can be interpreted from multiple optimized objective functions. For instance, one can remove the edges with high betweenness centrality to form clusters, or optimize the modularity function to obtain the clusters. However, defining clusters beforehand is a useful starting point, which allows us to verify the credibility of clustering outcomes.

2.2.3. Notation of the Evaluation Framework

The goal of the evaluation framework is to find a better way to evaluate the entire sample clustering process which integrates graph sampling with clustering. We aim to evaluate the clustering quality of the sampled graph with respect to the valid ground-truth(s). Please note that, in practice, there may be more than one valid ground-truth and we may not have access to the authentic ground-truth, e.g., in the context of multiple social network views, a variety of credible clusterings exists in the networks. Several reasonable but uncorrelated clusterings might explain the network well from various perspectives. This fact could make the sample clustering process find one ground-truth while the available information in hand gives another ground-truth. More precisely, given

• a graph G= (V, E)and a set of ground-truth clusteringsΠ(G)of G; • a sampling strategyΛ which generates a sampled counterpart S of G; and • a clustering processP, which produces a possible clustering π(S) ∈Π(S)on

the sampled graph S,

we measure the clustering quality of the entire process under the condition of multiple ground-truthsΠ(G). Such an evaluation framework will guide our understanding and study of improving sampling strategies and sample clustering solutions. Key definitions are:

(41)

Definition 2.12. (Multiple clustering schemes). Multiple clustering schemes for a graph G

are a setΠ(G)of clustering schemes of G.

Please note that for the original graph G, we can obtain multiple clustering schemes Π(G) from the reliable meta-data of nodes3. We treat them as multiple ground-truth clusterings of the original graph G. Meanwhile, for the sample S, multiple clustering schemesΠ(S)are produced by the clustering processPon the sample S. Nevertheless, their essence is consistent (i.e., they both represent a set of clustering schemes) and more specific descriptions are given in Table 2.1. Our evaluation objective is to measure the quality of the clustering result π(S) ∈Π(S)with respect to multiple ground-truth clusteringsΠ(G).

2.3. Datasets

We have considered diverse graphs derived from a variety of domains. The data collection is divided into two categories: static graphs and streaming graphs. It allows for a more comprehensive study and thorough evaluation of the performance of the sample clustering process in the face of graphs with various characteristics. We now provide brief descriptions of them separately.

2.3.1. Static Graph Collection

For static graphs, we present a number of experiments on well-known Lancichinetti Fortunato Radicchi (LFR) [LFR08] synthetic graphs and real-world graphs. The brief descriptions are given below.

Synthetic static graphs

In particular in the context of graph clustering, there is one reason why synthetic static graphs are widely used. Since there is no general consensus on how to define the objective function to evaluate the goodness of the clustering, the most feasible approach independent of any objective function is to compare with the ground-truth. However, the fact is that real-world graphs with a well-defined ground-truth are still rare. Hence, synthetic graphs incorporating a hidden ground-truth have gained popularity in the literature to evaluate the clustering process.

3_{We have described the relation between the ground-truth and the meta-data in Section 4.3.1 of Chapter 4} in detail.

(42)

2.3. Datasets

The LFR benchmark proposed in [LFR08] has been widely used to generate artificial graphs. The advantage of LFR over other benchmarks is that it accounts for the heterogeneity in the distributions of node degrees and cluster sizes. It is based on the planted partition model which takes a given ground-truth clustering as an input and assumes that both the degree and the cluster size have power-law distributions with different exponents. Therefore, we employ the LFR generator as synthetic benchmarks. In LFR benchmarks, a set of optional parameters needs to be given as necessary. The mixing parameter µ is vital and reflects the average ratio of external degree to total degree for each node. The larger µ is, the less distinct the clustering structure of the benchmark is.

Real-world graphs

We have considered various graphs derived from real-world networks, and they can roughly be classified in two groups: small and large graphs. The first five datasets4: Karate, Football, Dolphin, Polbooks and Polblogs networks, are relatively small [GN02], and they are widely used to measure the accuracy of clustering structures. The other graphs are relatively large, and for some of them it is impossible (or very time-consuming) to process them in their entirety unless we take a sample on them. They all have the corresponding meta-data that correlate to the clustering structure to serve as the underlying ground-truth. These meta-data groups are created based on specific topics, interests, hobbies, and geographical regions. For example, LiveJournal catego-rizes groups into the following types: culture, entertainment, expression, fandom, gaming, etc. We consider each such explicit interest-based group as ground-truth. Here only the top-5000 ground-truth clusters from each graph are used for evaluation, since quality metrics degrade considerably after the top-5000. They are obtained from the Stanford Large Network Dataset Collection5. All graphs used are undirected and unweighted and the summary statistics for real-world graphs are shown in Table 2.2. Furthermore, we also consider the Facebook university networks (FB1006), which have multiple ground-truths. The ground-truths are formed from meta-data in which the dorm and graduation year are highly correlated to the clustering structure of the network.

4_{All the small datasets and ground-truth information can be found at http://www-personal.umich.} edu/~mejn/netdata/.

5_{The graphs and the corresponding meta-data are complete and publicly available at http://snap.} stanford.edu/data.

(43)

Table 2.2.: Summary of real-world graphs used in the experiments. Abbreviations are described as follows: N: number of nodes; M: number of edges; C: number of clusters; # comps: number of components; D: the density of the graph; CC: the average clustering-coefficient for all nodes; “-”: the specific Facebook network needs to be specified.

Dataset Ground-truth(s) N M C # comps D CC

Karate Single 34 78 2 1 0.139 0.571 Football Single 115 613 12 1 0.094 0.403 Dolphin Single 62 159 2 2 0.084 0.259 Polbooks Single 105 441 2 1 0.081 0.487 Polblogs Single 1,224 16,718 3 1 0.022 0.320 LiveJournal Single 84,438 1,521,988 5000 1464 0.006 0.730 Friendster Single 220,015 4,031,793 5000 669 0.00017 0.442 Orkut Single 731,514 21,992,510 5000 42 0.00008 0.247 DBLP Single 93,432 335,520 5000 392 0.00009 0.708 Youtube Single 39,841 224,235 5000 639 0.0003 0.198

Web-Notre-Dame Single 325,729 1,117,563 unknown 1 0.00002 0.235

Amazon Single 16,716 48,739 5000 1106 0.0105 0.421

Facebook100 Multiple [762,41, 536] [16,651,1, 465,654] different traits - -

-2.3.2. Streaming Graph Collection

In the case of streaming graphs, both synthetic and real-world graphs are employed to verify the effectiveness of the proposed algorithms. A brief summary of these graphs is shown in Table 2.3.

Synthetic streaming graphs

We adapt the dynamic network benchmark [GDC10] to generate a set of graph snap-shots with embedded clustering structures. The shifting between graph snapsnap-shots are manipulated by the injection of a user-specified number of evolving events, e.g., merging/splitting or birth/death. These graph snapshots share similar characteristics and have the ground-truth embedded in them. To produce the streaming edges, we record the generation orders of edges as their timestamps, and rearrange the edges randomly in each trial. We ensure that all methods use the same sequential order in such a way that the quantitative comparison can be given along with edge updates.

In the implementation, we generate four synthetic graphs for four different event types: merging/splitting, birth/death, expansion/contraction and switching node, over five time steps. Meanwhile, at each subsequent time step, a number of instances of corresponding events occur. The edge updates are streamed in the system according to chronological order and all the metrics are calculated at five specific snapshots of which ground-truths are known. These graph snapshots share a number of parameters according to [GDC10] to simulate real-world graphs: the number of nodes: N=1000,

On graph sample clustering

On graph sample clustering

On Graph Sample Clustering

On Graph Sample Clustering

Summary

Contents

Part I.

Evaluation of the Sample Clustering Process

35

Part II. Static Graphs

69

Part III. Streaming Graphs

103

List of Figures

List of Tables

1. Introduction

1.1. What is the Sample Clustering Process?

1.2. Research Questions

1.3. Main Contributions

1.4. Dissertation Overview and Organization

2. Definitions and Common Notation

2.1. Graph Definitions

2.1.1. Static Graph

2.1.2. Streaming Graph

2.2. Common Notation

2.2.1. Notation of Graph Sampling

2.2.2. Notation of Graph Clustering

2.2.3. Notation of the Evaluation Framework

2.3. Datasets

2.3.1. Static Graph Collection

-2.3.2. Streaming Graph Collection