• No results found

A generalized weight-based fusion model for community detection in node-attributed social networks

N/A
N/A
Protected

Academic year: 2021

Share "A generalized weight-based fusion model for community detection in node-attributed social networks"

Copied!
67
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

MSc Computational Science

Master Thesis

A generalized weight-based

fusion model for community

detection in node-attributed

social networks

by

Timofei

Gradov

13241028

June Supervisor/Examiner: Examiner:

(2)

ABSTRACT

The weight­based fusion model is among the simplest and most efficient ones for modularity­driven community detection in node­attributed social networks that contain both links between social actors (“structure”) and the actors’ feature vectors (“attributes”). Roughly speaking, the attributes within the model are first converted into an attributive network so that one obtains the two networks — structural and attributive — instead of the node­attributed network. Then, the two networks are fused into a composite one that is believed to contain the in­ formation about both the structure and the attributes and that can be already fed to traditional modularity­driven graph community detection approaches. While the weight­based fusion model is widely used, it has been understudied analytically and had only a heuristic ground.

In this study, we disclose the mathematical machinery of a generalized weight­ based fusion model by revealing the objective function of the corresponding opti­ mization community detection process and establishing its connection with the tra­ ditional quality measures for the results of community detection in node­attributed networks. We also propose a pioneering non­manual parameter tuning scheme that provides the desired impact of the structure and the attributes on the community detection results within the weight­based fusion model. Basing on the theoreti­ cal results obtained, we further present a well­tunable Leiden­based algorithm for community detection in node­attributed social networks that declares itself fast and accurate in our multiple experiments with synthetic and real­world datasets.

(3)

TABLE OF CONTENTS

INTRODUCTION . . . 5

1 Necessary facts on the generalized weight­based fusion model . . . 8

1.1 Description of the weight­based fusion model, the corresponding community detection problem and the heuristics involved . . . 8

1.2 Precise definitions and reformulation of the node­attributed social network community detection problem . . . 11

1.3 Overview of the existing weight­based fusion models for node­attributed social network community detection . . . 13

2 The objective function of modularity­driven community detection process within the weight­based fusion model . . . 16

3 Parameter tuning scheme . . . 20

4 Composite Modularity for almost directly proportional edge weights in the structural and attributive graphs . . . 22

5 Well­tunable Leiden­based weight­based fusion model for node­attributed social network community detection . . . 26

6 Experiments . . . 29

6.1 Data, code and implementation settings . . . 29

6.2 Synthetic node­attributed network datasets . . . 29

6.2.1 Barabási­Albert and Erdős­Rényi graph models . . . 30

6.2.2 Lancichinetti­Fortunato­Radicchi benchmark . . . 32

6.2.3 The almost direct proportionality of the edge weights of the graphs GS and GA . . . 36

6.2.4 Counterexample to the node­attributed social network feature selection principle . . . 38

6.3 Real­world node­attributed network datasets . . . 41

6.4 Evaluation of the parameter tuning scheme in Algorithm in Table 5.1 47 6.5 Comparison of Entropy and Attributes­aware Modularity . . . 49

(4)

6.6 Different Modularity optimization algorithms . . . 52

7 DISCUSSION . . . 55

7.1 Ground Truth partition as quality measure . . . 55

7.2 High Differential Modularity values . . . 55

7.3 Processing large networks . . . 56

CONCLUSION . . . 58

(5)

INTRODUCTION

Community detection in node­attributed social networks is an actively stud­ ied problem in social network analysis [1, 2, 3] due to the necessity to explore a huge amount of real­world social network data containing both links between social actors (known as structure) and the actors’ feature vectors (known as at­

tributes) indicating actors’ age, interests, etc. While classical community detection

models deal either with the structure or the attributes, community detection meth­ ods for node­attributed networks aim at simultaneous usage or fusion of the both. The motivation behind it is that such fusion may clarify and enrich the knowledge about communities in a node­attributed social network [1, 2]. This idea is usually justified by the well­known homophily principle stating that like­minded social actors have a higher probability to be connected [4]. Furthermore, social science founding (see e.g. [5, 6]) suggests that the attributes can reflect and affect the community configuration of an node­attributed social network.

A variety of different models and methods for community detection in a node­attributed social network have been proposed so far [1, 2, 3]. Although they are widely used in applications, it is a problem that some of them are understudied analytically and have only a heuristic ground. In particular, it is observed in [2] that, from the optimization theory point of view, multiple community detection methods for node­attributed social networks suffer from the lack of mathematically established connection between the objective functions within the community de­ tection optimization process involved and the measures used for the community detection quality evaluation. Indeed, there are many examples of community de­ tection models and methods for node­attributed social networks in [2] where one function is optimized within the community detection process and another function – “quality measure” (not directly related to the former and – again – usually chosen heuristically from the many possible [1, 7]) is used for evaluating the optimization results. In our opinion, the above­mentioned lack may cause misinterpretation of

(6)

the community detection results and difficulties with tuning the parameters of a community detection method for node­attributed social networks. Let us empha­ size that now and throughout the research we mainly deal with the structure­ and attributes­aware quality measures (e.g. Modularity and Entropy) and rarely do with those estimating the agreement between the detected communities and the ground truth ones. It is because the latter seem to be less formalized and their “truthful­ ness” may be questionable in certain cases as the results in [8, 9, 10] and our own ones below hint.

In this research we address a particular case of the above­mentioned gen­ eral problem, namely, our main objective is to disclose the mathematical machin­ ery of a generalized weight­based fusion model that is one of the simplest and most efficient models applied for modularity­driven community detection in node­ attributed social networks. There exist more than 20 papers devoted to the family of weight­based fusion models, see [2]. Recall that for a given node­attributed so­ cial network, a typical weight­based fusion model first converts the attributes into an attributive network so that one obtains the two networks – structural and attribu­ tive – instead of the node­attributed social network. Then, the two networks are fused into a composite one that is believed to contain the information about both the structure and the attributes and that can be already fed to traditional modularity­ driven graph community detection methods. We will give the formal description of the weight­based fusion model under consideration, the corresponding community detection process and the overview of the existing weight­based fusion models in Section 1.

To meet the main objective, we aim at achieving the following goals:

– to reveal the objective function of the optimization community detection process within the weight­based fusion model;

– to establish the connection of the objective function and the related com­ munity detection quality measures;

(7)

– to propose a non­manual parameter tuning scheme that provides the de­ sired impact of the structure and the attributes on community detection results within the weight­based fusion model;

– to propose a well­tunable algorithm for community detection in node­ attributed social networks that is based on the weight­based fusion model and the well­known Leiden algorithm for Modularity maximization; – to perform experimental study of the algorithm on synthetic and real­

world datasets.

The exposition in this thesis is partly based on the peer­reviewed papers [11] and [12], where the thesis author is a co­author. Furthermore, this thesis extends [12] with the following main contributions:

– we obtain further theoretical results on the objective function of the op­ timization community detection process, in particular, we show how it behaves when the edge weights in the structural and attributive graphs are almost directly proportional;

– we generalize the previous parameter tuning scheme from the one pro­ viding only the equal impact of the structure and the attributes on com­ munity detection results within the weight­based fusion model on any desired impact;

– we propose and show the efficiency of the well­tunable Leiden­based al­ gorithm for community detection in node­attributed social networks in­ stead of the direct illustrative experiments with the well­known Louvain algorithm for Modularity maximization.

In both the papers and this thesis the contribution of the thesis author is 50% to the theoretical results (the theorems) and 90% to the design and implementation of the algorithm and the experiments.

(8)

1

Necessary facts on the generalized weight­based fusion

model

1.1 Description of the weight­based fusion model, the corresponding

community detection problem and the heuristics involved

Below we first describe a general version of the weight­based fusion model and the related node­attributed social network community detection problem. Then we recount the community detection quality evaluation scheme within the weight­ based fusion model and reveal the implicit heuristics usually used in it.

In this research we model an node­attributed social network as an undirected simple node­attributed graph G = (V, E, W, A), where V = {vi}ni=1 is the set of nodes (representing social actors), E = {eij} the set of all possible links (repre­

senting social connections) between nodes inV (i.e. (V, E) is a complete graph),

W is the set of non­negative edge weights (representing the strength of the actors’

social connections1), and A is the set of attribute vectors A(vi) = {ad(vi)}Dd=1,

vi ∈ V, with non­negative2 elements (representing social actors’ features). We

suppose that (V, E) is complete to simplify further mathematical analysis. Recall

that in what follows (V, E, W) is called the structure and (V, A) the attributes

of G.

Within a generalized weight­based fusion model, G is first converted into the following two undirected simple weighted complete graphs by a certain rule: the structural graph GS = (V, E, WS) and the attributive graph GA = (V, E, WA), where WS = {wS(eij)} and WA = {wA(eij)} are the sets of non­negative edge weights in each graph, see Figure 1.1. For convenience, we suppose that

X eij∈E wS(eij) = 1, X eij∈E wA(eij) = 1. (1)

1An edge weight may be zero and this indicates that there is no social connection.

2For nominal or textual attributes, it is common to use one­hot encoding or embeddings to obtain their numerical

(9)

Node­attributed graph G = (V, E, W, A) = Structural graph GS = (V, E, WS) Attributive graph GA = (V, E, WA) = Composite graph = (V, E, Wα)

Figure 1.1 – An example of the weight­based fusion model scheme for G with

n = 5 and 2­dimensional binary attribute vectors. The edge weights in the

attributive graph GA are according to the normalized matching coefficient on the

corresponding node attributes in G

Furthermore, the two graphs are fused to obtain the undirected simple complete weighted graph Gα = (V, E, Wα), where the elements of Wα = {wα(eij)}, with

eij ∈ E, are as follows (see Figure 1.1):

wα(eij) = αwS(eij) + (1− α)wA(eij), α ∈ [0, 1] X

eij∈E

wα(eij) = 1.

(2)

Here α is a fusion parameter that controls the impact of GS and GA on Gα. Note

that G1 = GS and G0 = GA by construction. Gα (that we call composite) is

believed to contain the information about both the structure and the attributes of

G, and it is one of the traditional heuristics behind the family of weight­based

fusion models.

Recall that community detection in G consists in unsupervised dividing V

into K disjoint3 communities Ck ⊂ V, with C = {Ck}Kk=1, such thatV =

SK

k=1Ck,

and a certain balance between the following properties is achieved [1, 2]:

– structural closeness, i.e. nodes in a community are more densely con­ 3Communities may be overlapping if necessary but here we focus on disjoint ones.

(10)

nected than nodes in different communities;

– attributive homogeneity, i.e. nodes in a community have more similar attributes than nodes in different communities.

In the context of the weight­based fusion model, the community detection

problem clearly consists in unsupervised dividing Gαinto K disjoint communities

Ck,α ⊂ V, with Cα = {Ck,α}Kk=1, such that V =

SK

k=1Ck,α and the nodes in each

community Ck,α are structurally close and homogeneous in terms of attributes.

Since one deals with the weighted graph Gαwithin the weight­based fusion

model, classical graph community detection methods can be applied to find Cα.

For example, a popular choice [2] is the Louvain algorithm [13] aiming at maxi­ mizing Newman’s Modularity [14], a measure of divisibility of a graph into com­ munitites. In the context of the weight­based fusion model, the maximization of Modularity of Gαis implicitly thought to provide structural closeness and attribu­ tive homogeneity also in G (such heuristics are applied e.g. in [15, 16, 17, 18]). Our concern is that it was not discussed in the previous works on the weight­based

fusion model how the Modularity of Gα(i.e. the objective function within the com­

munity detection optimization process corresponding to the weight­based fusion model) is connected, say, with the Modularities of GS and GA.

Following the above­mentioned implicit thought, the partition Cα maximiz­

ing Modularity of Gα is treated as that for measuring structural closeness and at­

tributive homogeneity in the initial G. For example, Cαcan be used for calculating

the corresponding Modularity of GS (a popular measure of structural closeness)

and Entropy of subsets of the corresponding attributes inA (a popular measure of

attributive homogeneity) [2]. Thus one objective function is optimized to detect communities but the quality of the communities obtained is evaluated by measures not explicitly related to the objective function. From the optimization theory point of view, is seems to be a logical gap that may cause misinterpretation of community detection results and difficulties with tuning the parameter α in the weight­based

(11)

fusion model. To fulfil the gap and reveal the true mathematical machinery of

the weight­based fusion model, we will study in Section 2 how Modularity of Gα

(provided by Cα) and Cα­based quality measures on G relate to each other.

1.2 Precise definitions and reformulation of the node­attributed social

network community detection problem

For the purposes of further analysis, we define the community detection quality measures that are often used for evaluating node­attributed social network community detection results withing the weight­based fusion model – Modularity and Entropy, see e.g. [1]. The former works with graphs and the latter with sets of vectors. The community detection quality measures are usually chosen heuris­ tically from the many possible, see [1, 2, 7].

Let G′ = (V, E, W) be an undirected simple weighted graph and C its dis­

joint partition. The graph G′ is the structure of a node­attributed graph G. Modu­

larity of G′ for C is as follows:

Q(G′, C) = 2m1 X ij Aij 1 2mkikj  δij ∈ [−12, 1], i, j = 1, n, (3) where

– Aij is the edge weight between nodes vi and vj;

– ki and kj are the weighted degrees of vi and vj, respectively, i.e.

kh = X

lw(ehl), where h ∈ {i, j}; (4)

– m is the sum of all edge weights in G′, i.e.

m = X

eij∈E

(12)

– if ci and cj are the community labels of nodes viand vj, correspondingly, δij =    1, ci = cj, 0, otherwise.

Modularity of G′ is then defined as

Q(G′) = maxCQ(G′, C). (6)

EntropyH defined for a pair G′′ = (V, A) measures the degree of disorder

of attribute vectors A within communities in the disjoint partition C of G′′. The

pair G′′ represents the attributes of a node­attributed graph G. To unify notation,

we define Entropy of G′′ for C for the case of binary D­dimensional vectors as

follows: H(G′′, C) = X Ck∈C |Ck| |V | H(Ck) ∈ [0, 1], H(Ck) = XD d=1 ϕ(pk,d) D ln 2, (7)

where ϕ(x) = x ln x + (1− x) ln(1 − x) and pk,d is the proportion of nodes in the

community Ck with the same value on dth attribute.

To deal with the ground truth environment, the definition of Normalized

Mutual Information (NMI), a ground truth­based measure that is widely used to

compare the quality of node­attributed social network community detection meth­

ods when “the true partition” C′ is known [1, 2]. We use the variant from [19].

Suppose that

C = {C1, C2, . . . , CK} and C′ = {C1′, C2′, . . . , CK′ ′}

are two partitions of a graph (V, E, W) with |V| = n. Let ηij be the number of

(13)

the partition C′, then NMI(C, C′) = −2 PK i=1 PK′ j=1ηijlog ηijn ηi·η·j PK i=1ηi·log ηi· n + PK′ j=1η·jlog η·j n , (8) where ηi· = PK′ j=1ηij and η·j = PK i=1ηij.

Thus the solution to the node­attributed social network community detection problem within the weight­based fusion model in the terms presented is as follows:

– one finds Cα maximizing the Modularity of Gα,

– Cα is used to calculate Q(G′, Cα) and H(G′′, Cα), where G′ and G′′ are the structure and the attributes of the node­attributed graph G, corre­ spondingly.

This scheme is usually applied implicitly and there are no theoretical studies why this jump from Q(Gα, Cα) to Q(G′, Cα) and H(G′′, Cα) is reasonable. An­ other issue is that Entropy that deals with vectors seems to be unnatural, taking into account the intuition behind the weight­based fusion model aiming at representing

G in a unified graph form. Using NMI based on network’s Ground Truth as a qual­

ity measure for community detection usually proposed in researches. However, it is external quality measure, which is not always available for algorithm evalua­ tion and also can be misleading, while Ground Truth may not depend on network’s properties.

We will discuss the issues more in Sections 2 and 6.

1.3 Overview of the existing weight­based fusion models for node­

attributed social network community detection

The family of weight­based fusion models that are ideologically close to (2) and the corresponding weight­based fusion model­based algorithms have been widely tested on synthetic and real­world node­attributed social networks and have

(14)

shown their advantages over many other node­attributed social network commu­ nity detection models, see [11, 15, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], etc., and the overview in [2]. In particular, the weight­based fusion model­based algorithms SAC2 [18] and CODICIL [24] are among the most influential ones for node­attributed social network community detection. Moreover, multiple experi­ ments show that weight­based fusion model­based algorithms can be superior in community detection quality (in various terms) and efficiency to other most influ­ ential ones for node­attributed social network community detection, e.g. to

– the node­augmented graph­based algorithms SA­Cluster [31] and Inc­

Cluster [32, 33], as shown in [24, 25, 26, 27, 28, 34];

– the non­negative matrix factorization­based algorithm SCI [35], as shown in [30, 36];

– the probabilistic model­based algorithms PCL­DC [37], BAGC­GBAGC [38, 39] and CESNA [40], as shown in [24, 26, 28, 30, 34, 36].

The common part of the above­mentioned weight­based fusion models is (2) but the corresponding community detection and quality evaluation processes can be performed by different means.

Furthermore, there are many particular versions of (2) under a different choice of weighting functions wS, wAand fusion parameter α [2] but it seems to us that the most balanced one is that from [11, 29] with the following normalization:

wS(eij) = µ(eij) P eij∈E µ(eij) , wA(eij) = ν(eij) P eij∈Eν(eij) , (9)

where µ and ν are structural and attributive weight functions, correspondingly. Among other things, it is shown in [29] that (2) with (9) produces normalized versions of many existing weight­based fusion models for different µ, ν and α.

(15)

leads to the weight­based fusion model in [20], Jaccard and/or Cosine Similarity as ν makes (2) analogous to the weight­based fusion model in [24].

By using different heuristic thresholds for the weights in (2) one can de­ crease the computational complexity of community detection processes within the weight­based fusion models and obtain other versions of the weight­based fusion model, see [2].

As for α in (2), its tuning or α­tuning as we call it below is difficult in gen­ eral [1, 2]. As far as we know, there are no general α­tuning schemes to provide the desired impact of the structure and the attributes on community detection re­ sults within the weight­based fusion model. Indeed, α is usually chosen manually and such choice may be not fully justified. For example, the weight­based fusion models in [15, 20, 21, 28] use α = 0, while the ones in [18, 24, 27] set α = 0.5 in experiments, suggesting to achieve the equal impact of the structure and the attributes. The absence of an explicit understanding of how to tune α within the weight­based fusion model makes the problem of reasonable α­tuning important in the field.

What is more, we are unaware of any reseach devoted to the study of the objective function in the modularity­driven community detection process corre­ sponding to the weight­based fusion model, perhaps besides the survey [2] where the problem is only stated.

(16)

2

The objective function of modularity­driven community

detection process within the weight­based fusion model

We first substitute Entropy (7) by Q(GA, C), i.e. the Modularity of the

attributive graph GA for a partition C. In these terms, if C is that maximizing

Q(GA, C), then the links between nodes in each community in C have high at­ tributive weights (minus the expected ones), i.e. the node attributes therein are homogeneous by construction. Additionally, the proposed measure works with graphs (oppositely to the Entropy working with vectors) and naturally appears in the weight­based fusion model as is seen from the theoretical results below. Let

us also mention that the Modularity of the attributive graph GA turns out to be

more informative than the corresponding Entropy as the experiments in Section 6.5 show.

Now we turn to the problem about the connection of Modularity Q(Gα) and

the community detection quality measures (in our case, Q(GS, Cα) and Q(GA, Cα)). The solution to the problem is wholly provided by the following theoretical results for a fixed G, where we assume that

kh = X

lw⋆(ehl), h ∈ {i, j}, ∈ {S, A}. Theorem 1. For any partition C, it holds that

Q(Gα, C) =αQ(GS, C) + (1− α)Q(GA, C)

+ α(1− α)∆(GS, GA, C),

(10)

where ∆ counts the difference of within­community node degrees in GS and GA:

∆(GS, GA, C) = 14 K X k=1  X vi∈Ck (kiS − kAi )2 ≥ 0. (11)

(17)

What is more, the following inequality holds:

Q(Gα, C) ≥ αQ(GS, C) + (1− α)Q(GA, C), (12)

and it is sharp for α = 0 and α = 1.

Proof. Fix a partition C. We first rewrite the ingredients of (3) in terms of (2):

Aij = αwS(eij) + (1− α)wA(eij), m = X eij∈E wα(eij) = 1, kh = X l(αwS(ehl) + (1− α)wA(ehl)) , h ∈ {i, j}. (13) Furthermore, kikj = αkiS + (1− α)k A i  αkjS + (1− α)kAj  = α2kiSkSj + α(1− α) kiSkjA + kiAkjS + (1− α)2kAi kjA. (14)

If one takes (13) and (14) into account, (3) can be rewritten in the form

Q(Gα, C) = α· 12 X ij wS(eij) 1 2αk S i k S j  δij + (1− α) · 12 X ij wA(eij) 1 2(1− α)k A i k A j  δij − α(1 − α) · 1 2 X ij 1 2 k S i k A j + k A i k S j  δij.

Extracting Q(GS, C) and Q(GA, C) from this by (1) and (3) yields (10), where

∆(GS, GA, C) = 14 X ij(k S i − k A i )(k S j − k A j )δij.

(18)

Furthermore, it can be easily seen that ∆(GS, GA, C) = 14 K X k=1  X vi∈Ck (kSi − kiA) X vj∈Ck (kjS − kjA) = 14 K X k=1  X vi∈Ck (kSi − kiA)2,

where the last expression is clearly non­negative. This gives (11). Finally, (12) follows from (10) by (11). The sharpness of (12) immediately follows from (10).

From a more general point of view, Theorem 1 actually connects the Modu­ larities of two graphs with the Modularity of the graph whose weights are the linear combinations of weights of the two graphs. We consider it as a key result for anal­ ysis of modularity­driven models for node­attributed social network community detection in our future research, see Section 7.3.

We continue by introducing additional notation for the case where C = Cα

in Theorem 1:

Q(Gα) =:Qαcom = Qαstr + Qαattr+ Qαdif,

str := αQ(GS, Cα)

attr := (1− α)Q(GA, Cα),

dif := α(1− α)∆(GS, GA, Cα),

(15)

where we call

– Qαcom Composite Modularity,

– Qαstr Structural Modularity,

– Qαattr Attributive Modularity,

(19)

Note that the Differential Modularity does not have the meaning of the Modularity itself but we call it so for uniformity.

Thus, in the terms introduced, one maximizes within the weight­based fusion

model Composite Modularity Qαcomthat consists not of the two components used for

quality evaluation (the Structural Modularity Qαstr and the Attributive Modularity

attr) but of the additional non­negative Differential Modularity Qαdif that counts

the difference of within­community node degrees in GS and GA.

Remark 1. According to our experimental results in Section 6 and the intuition

behind the form of Qαdif in Theorem 1, the Differential Modularity vanishes for

many synthetic and real­world node­attributed social networks for α ∈ [0, 1]. For

this reason, we assume in the next section that Qαdifis small enough with respect to the sum Qαstr + Qαattr.

(20)

3

Parameter tuning scheme

We now propose a simple non­manual α­tuning scheme such that the impact of structure and attributes on the community detection results is well­controlled within the weight­based fusion model. Since our terms are unified for both the components (in the sense that we use Modularity for the both), it is justified to define α = α∗ρ satisfying

str = ρ (Qαstr + Qαattr) ,

attr = (1− ρ) (Qαstr + Qαattr) for some ρ ∈ [0, 1],

(16)

as that providing 100ρ% of Qαstr and 100(1− ρ)% of Qαattr in the sum Qαstr + Qαattr

that is the major part of Qαcom according to Theorem 1 under the assumption that

dif is vanishing. Equivalently, this α = α∗ρ is a solution to the equation

(1− ρ)Qαstr = ρQαattr. (17)

We first prove the following statement.

Theorem 2. Let Qαstr > 0 and Qαattr > 0 for any α ∈ [0, 1]. If it holds that

|Q(GS, Cα)− Q(GS)| ≤ ϵQ(GS),

|Q(GA, Cα)− Q(GA)| ≤ ϵQ(GA),

(18)

for some ϵ such that 0 ≤ ϵ ≪ 1, then α∗ρsatisfies the inequalities:

1− ϵ 1 + ϵ ≤ α ρ· (1− ρ)Q(GS) + ρQ(GA) ρQ(GA) 1 + ϵ 1− ϵ. (19)

(21)

Proof. We rewrite (17) by (15) as

α = ρQ(GA, Cα)

(1− ρ)Q(GS, Cα) + ρQ(GA, Cα)

.

This and the conditions (18) immediately imply that (19) holds for α instead of α∗ρ. Furthermore, the conditions (18) guarantee that Qαstr and Qαattr are well­approxi­

mated uniformly for any α ∈ [0, 1] by αQ(GA) and (1 − α)Q(GS), correspond­

ingly. These facts imply that (19) particularly holds for α = α∗ρ.

As a result, Theorem 2 yields that for a sufficiently small ϵ one can take

˜

αρ =

ρQ(GA)

(1− ρ)Q(GS) + ρQ(GA)

(20)

as a good approximation for α∗ρproviding the above­mentioned impact of structure

and attributes on the community detection results. What is more, our experiments in Section 6.4 suggest that this is indeed so. It is interesting that (20) requires only

the Modularity values Q(GS) and Q(GA) to be applied.

The proposed α­tuning scheme is the first non­manual one providing the required impact of the components within the weight­based fusion model. A par­

ticular case of Theorem 2 for ρ = 12 providing the equal impact of structure and

attributes was previously obtained in [12].

Remark 2. It seems that (16) and (17) can be rewritten for an α­weighted version of Entropy (7) instead of Qαattr so that one can obtain a similar α­tuning scheme in these settings, too. However, in this case one in fact uses non­unified (and possibly non­comparable) terms for evaluating node­attributed social network community detection quality with respect to the structure and the attributes (Modularity and Entropy) and thus the corresponding results may be misleading. In Section 6.5, we will also perform several experiments showing the difference in the behaviour of the Attributive Modularity and the Entropy and the presence of certain problems related to the Entropy.

(22)

4

Composite Modularity for almost directly proportional

edge weights in the structural and attributive graphs

Basing on Theorem 1, we now study the case where the corresponding edge

weights in the structural and attributive graphs GS = (V, E, WS) and GA =

(V, E, WA) are almost directly proportional with a positive coefficient (that clearly implies the strong positive correlation of the edge weights by means of Pearson or Spearman). We use the notation introduced in Sections 1.1 and 2. Let us start with the following observation. Suppose that for some fixed a > 0 it holds that

wA(eij) = awS(eij), eij ∈ E. (21)

Under the condition (1), we get

1 = X eij∈E wA(eij) = a X eij∈E wS(eij) = a,

i.e. it is necessarily that a = 1 in (21). Taking this into account, we prove the following result.

Theorem 3. Let GS and GA in Theorem 1 be such that

0≤ wA(eij) = wS(eij) + ϵij

for all eij ∈ E and some ϵij ∈ [−1, 1].

(22)

IfPij|ϵij| ≤ 2ϵ ≤ 1, then for any α ∈ [0, 1] and any C

|Q(Gα, C)− Q(GS, C)| ≤ ϵ, (23)

(23)

Proof. We first rewrite the ingredients of (3) in terms of (22): Aij = αwS(eij) + (1− α)wA(eij) = αwS(eij) + wS(eij) + ϵij − αwS(eij)− αϵij = wS(eij) + (1− α)ϵij; kh = X l(αwS(ehl) + (1− α)wA(ehl)) = X l(wS(ehl) + (1− α)ϵhl) = kSh + (1− α)X l ϵhl. Furthermore, kikj =  kSi + (1− α)X l ϵil   kjS + (1− α)X lϵjl  = kSi kjS + (1− α)  kiSX lϵjl + k S j X lϵil  + (1− α)2X lϵil X l ϵjl.

In these terms, we deduce from (3) for a fixed C that

Q(Gα, C) = Q(GS, C) + I(α, ϵ, C), where I(α, ϵ, C) :=12(1− α)X ij  ϵij 12  kiSX l ϵjl + k S j X lϵil  +(1− α)X lϵil X l ϵjl  δij.

Now we estimate|I(α, ϵ, C)|. Firstly, by (1) and (4), XijkiSX lϵjl + k S j X lϵil  δij X ik S i X jl|ϵjl| + X j k S j X il|ϵil| = 4X ij |ϵij|.

(24)

Secondly, XijXlϵil X lϵjlδij Xil|ϵil| X jl|ϵjl| = X ij|ϵij| 2 . Consequently, |I(α, ϵ, C)| ≤ 1 2(1− α)  X ij|ϵij| + 1 2 · 4 X ij|ϵij| + (1− α)X ij |ϵij| 2 1 2 X ij |ϵij|  3 +X ij |ϵij|  .

This yields that the following implication holds true: X

ij |ϵij| ≤

ϵ

2 ≤ 1 = |I(α, ϵ, C)| ≤ ϵ.

Consequently, it holds that

Q(Gα, C) = Q(GS, C) + I(α, ϵ, C), |I(α, ϵ, C)| ≤ ϵ,

and this gives (23).

Theorem 3 states that if the corresponding edge weights of GS and GA are

almost directly proportional with a positive coefficient, then the Composite Mod­

ularity Q(Gα, C) is almost independent of α. In this case, GS and GA almost du­

plicate each other so that it may be reasonable to use only one of the components for community detection, i.e. come back to the classical community detection set­ tings based either on the structure or the attributes. In this sense Theorem 3 partly explains a similar effect observed in [11] in experiments (although for the Entropy instead of the Attributive Modularity).

Remark 3. Note that the assumptions of Theorem 3 also assure the good quality (by means of small ϵ) of the α­tuning scheme in Section 3 in terms of Theorem 2.

(25)

From a more general point of view, Gαin Theorem 3 can be considered as a

small variation of GS for sufficiently small ϵ. Thus Modularity of a graph changes

slightly under a small variation of its edge weights. This seems rather intuitive but we are unaware of any corresponding quantitative estimates as in Theorem 3.

In Section 6 we will also discuss the strong linear relationship of the structure and the attributes of G in the following sense: we will provide in the weight­based fusion model settings a counterexample to the well­known node­attributed social network feature selection principle that presumes to use in the node­attributed so­ cial network community detection process only the attributes “correlated with” or “tightly related to” the structure in order to obtain “better community detection results”.

(26)

5

Well­tunable Leiden­based weight­based fusion model

for node­attributed social network community detection

Basing on our theoretical results, in particular on Theorems 1 and 2, we now propose a well­tunable Leiden­based algorithm implementing the weight­based fusion model (2) under the normalization (9), see Algorithm in Table 5.1.

The input of the algorithm is a node­attributed graph G (see Section 1.1), chosen structural and attributive weight functions µ and ν in (9) and the value of

ρ in (16) and (17), indicating the desired impact of the structure and the attributes

on the community detection results within the weight­based fusion model. Within the algorithm, G is first converted into4 GS and GA according to (9). After this,

the Modularity values Q(GS) and Q(GA) are found by the Leiden algorithm [41].

For the chosen ρ, the values of Q(GS) and Q(GA) are further used for calculating ˜

αρ in (20) as an approximant for α∗ρ in (16) and (17). The ˜αρ is used in (2) under the normalization (9) to obtain Gα∗ρ with the edge weights (2). Finally, Cα∗ρ and all

the corresponding Modularity values in (15) are found by Leiden which are then output by the algorithm.

4As before, G

(27)

Table 5.1 – Leiden­based weight­based fusion model Algorithm for node­attributed social network community detection

Procedure: Weight­based fusion model (G, µ, ν, ρ)

G is a node­attributed graph, µ is a structural weight function, ν is a attributive

weight function, ρ is a value indicating the desired impact of the structure and the attributes on community detection results

GS ← µ(G) GA ← ν(G) CS ← Leiden(GS) CA ← Leiden(GA) α ρQ(GA,CA) (1−ρ)Q(GS,CS)+ρQ(GA,CA) ← FuseGraphs(GS, GA, α) ← Leiden(Gα) str ← αQ(GS, Cα) attr ← (1 − α)Q(GA, Cα) diff ← α(1 − α)∆(GS, GA, Cα)

com ← Qαstr + Qαattr+ Qαdiff return Cα, Qαcom, Qαstr, Qαattr, Qαdiff

Procedure FuseGraphs(GS, GA, α)

GS is a structural graph with the edge weights wS(eij), GAis an attributive graph with the edge weights wA(eij), α is a fusion parameter

ΣS P eij∈E wS(eij) ΣA P eij∈E wA(eij) foreach eij ∈ E do wα(eij) ← αwS(eij)/ΣS + (1− α)wA(eij)/ΣA

wα(eij) are the edge weights of Gα

(28)

We use the novel Leiden algorithm to optimize Modularity in Algorithm in Table 5.1 as, according to [41], it outperforms in community detection quality and execution time the widely­used Louvain algorithm [13]. Recall that Louvain is a popular choice in modularity­driven node­attributed social network community detection methods [2] and is state­of­the­art for Modularity optimization [42]. In particular, it is guaranteed [41] that Leiden always yields communities that are connected (oppositely to Louvain) and whose all subsets are locally optimally as­ signed. (Note that we used Louvain in the conference proceedings paper [12] that contains the results preliminary to those in the current study.)

The precise parameters and other details related to the versions of Leiden and Louvain used in our experiments will be indicated in Section 6.1.

As for the complexity of Algorithm in Table 5.1, the step where G is con­

verted into GS and GA (recall that GS and GA are complete in our settings) is

O(n2), while the Leiden­based optimization process step seems to be at most

O(n log n). Here we recall the observation in [41] that Leiden is faster than Lou­

vain whose complexity is believed to be O(n log n). The complexity of weight­ based fusion models can be possibly decreased by applying different thresholds for the weights in (2) and the number of edges of G involved, see the tables in [2].

(29)

6

Experiments

6.1 Data, code and implementation settings

Before describing our experimental results, it is worth saying that we present

on Github5 the publicly available datasets and the implementations used in our

experiments.

For synthetic graphs generation and graphs representation we use networkx

2.56 with the parameters declared in the experiments in Section 6.2. The real­

world datasets used and their sources are described in Section 6.3. For Modularity optimization we rely on the following descriptions and implementations:

– Leiden [41], leidenalg 0.8.37 for weighted graphs with Modularity Ver­

tex Partition as quality function and the maximum possible number of iterations;

– Louvain [13], python­louvain 0.148 for weighted graphs with the resolu­

tion parameter equal to 1.

In what follows, we use Leiden for Modularity optimization with the only exception of the experiments in Section 6.6 where Louvain is used as a competitor to Leiden.

Since different runs of Leiden and Louvain may lead to different commu­ nities, we average the results over 5 runs and indicate the corresponding standard deviation in the plots.

6.2 Synthetic node­attributed network datasets

Now we experimentally study the behaviour of the Modularities in (15) and illustrate Theorems 1 and 3. Since the theorems are stated for two graphs (not

5https://github.com/TimaGradov/Leiden­weight­based fusion model 6https://networkx.org/documentation/stable/index.html

7https://leidenalg.readthedocs.io/en/stable/index.html 8https://github.com/taynaud/python­louvain

(30)

a node­attributed graph), in some experiments below we initially generate two

graphs – GS and GA – instead of generating a node­attributed graph G and fur­

ther converting it into GS and GA.

Following Algorithm in Table 5.1, we use the weight­based fusion model (2)

with the normalization (9) and detect communities in Gαby Leiden. However, our

fusion parameter α runs from 0 to 1 with step 0.05 for illustrative purposes in the

forthcoming experiments, while Algorithm in Table 5.1 works only with α = ˜αp.

Graphs GS and GA are complete and weighted according to the definitions

in Section 1.1. Some edge weights may be zero and then one should think that there is no structural or attributive link between the corresponding nodes. Below, if we say that a (complete) graph has M edges with non­zero weights, then edge weights in the graph are set zero for all edges except for M of them.

The precise parameters of experiments necessary for reproducing the results are indicated in the figure captions.

6.2.1 Barabási­Albert and Erdős­Rényi graph models

Here we generate random graphs GS and GA by the well­known Erdős­

Rényi (ER) and Barabási­Albert (BA) models. Since Theorem 1 does not have

any specific assumptions on GS and GA, the chosen models are quite suitable for

illustrative purposes (although they may not produce strong community structure). The models are chosen as they generate graphs with different node degree distri­ butions (recall (11) in Theorem 1) and thus may influence the behaviour of the Differential Modularity Qαdif. What is more, we generate GS and GA in the pairs ER+ER, ER+BA and BA+BA. In each pair we consider the following two cases:

(a) when Q(GS) and Q(GA) are almost equal and (b) when one of Q(GS) and

Q(GA) is greater then the other. These proportions can be achieved by varying the

number of edges with equal non­zero weights in GS and GA.

The parameters of the BA model are m0 ≥ 1, the number of nodes in the

(31)

edges will every new node create; the parameters of the ER model are the number of nodes n and the probability p of edge creation between every pair of nodes. For more details, we refer the reader e.g. to [43]. For implementation, we use

networkx 2.5 Barabási­Albert graph9, where m0 = m, and networkx 2.5 Gn,p

random graph10.

It turns out that the results in each pair ER+ER, ER+BA and BA+BA for the chosen parameters are very similar qualitatively (and even quantitatively). For this reason we provide and analyze only the results for the pair ER+BA, see Figure 6.1. First note that in both cases Qαdifvanishes for all α ∈ [0, 1]. Thus even the difference in degree distribution does not make its values large. It can be also observed that the intersection point of the plots of the Structural and Attributive Modularities

str and Qαattr (corresponding to α∗0.5 that makes (17) valid for ρ = 0.5) is closer to

α = 0, when Q(GS) > Q(GA), see Figure 6.1 (b). In the opposite case, it is close to α = 1 (not shown), and is close to α = 0.5, when the Modularities are almost equal, see Figure 6.1 (a).

Note that the Modularity values in Figure 6.1 are rather small indicating weak community structure. To bypass it, we now perform experiments with the Lancichinetti­Fortunato­Radicchi (LFR) graph model [44] that generates graphs with a controlled community structure and that is often used to compare different community detection methods [26, 42].

9https://networkx.org/documentation/stable/index.html 10https://networkx.org/documentation/stable/index.html

(32)

(a) 0 0.2 0.4 0.6 0.8 1 0 0.05 0.1 0.15 Modularit y (b) 0 0.2 0.4 0.6 0.8 1 0 0.05 0.1 0.15 Modularit y

Composite Structural Attributive Differential

Figure 6.1 – Modularities in the graph pair ER+BA: (a) Q(GS) ≈ Q(GA)

(ER­based GS with 500 nodes and 6, 210 edges with equal non­zero weights,

p = 0.05; BA­based GA with 500 nodes and 6, 331 edges with equal non­zero

weights, m = 13), (b) Q(GS) > Q(GA) (ER­based GS with 500 nodes and

24, 886 edges with equal non­zero weights, p = 0.2; BA­based GA with 500

nodes and 6, 331 edges with equal non­zero weights, m = 13)

6.2.2 Lancichinetti­Fortunato­Radicchi benchmark

The LFR model generates graphs according to the chosen parameters con­ trolling the heterogeneity in the distribution of node degrees and community size [44]. Namely, the parameters are: the number of vertices n, the mixing parameter

(33)

minimum degree of vertices kmin, the minimum community size cmin, the maxi­

mum community size cmax, and exponents of the power­law distribution of node

degree and community size τ1 and τ2, correspondingly. The parameter µ ∈ [0, 1]

controls the strength of community structure so that the smaller µ, the stronger the community structure.

For implementation, we use networkx 2.5 LFR benchmark graph11, where

the following parameters should be specified: n,⟨k⟩, kmax, µ, τ1, τ2. One does not

need to specify kminin this version, if⟨k⟩ is specified, and vice versa. Furthermore,

one does not need to specify cmin and cmax there as cmin is set to be equal to kmin,

and cmax to n.

We set τ1 = 3 and τ2 = 2 in all the experiments with LFR­based graphs

below.

The purpose of the forthcoming four experiments is to show how variously the Modularities in (15) can behave.

First, we generate two different LFR­based graphs GS and GA with n =

3, 000,⟨k⟩ = 20, kmax = 150 and µ = 0.2. The corresponding results are in Figure

6.2 (a). While GS and GA are different, we have Q(GS) ≈ Q(GA). The behaviour

of the Modularities is similar to that in Figure 6.1 (a), however the switch around

α = 0.5 is more dramatic so that the Attributive Modularity Qαattr for α > 0.5 and the Structural Modularity Qαstr for α < 0.5 are vanishing. This can be explained

by the strong community structure of both GS and GA as µ = 0.2. Note that the

Differential Modularity Qαdiff is also vanishing for all α ∈ [0, 1] here.

Now GS and GA are generated with the same set of parameters as before,

with the only difference that µ = 0.8 for GA, i.e. the community structure of GA

is weak. In this case (see Figure 6.2 (b)) one has Q(GS) > Q(GA) as in the exper­ iment in Figure 6.1 (b). In general, the qualitative behaviour of the Modularities in Figure 6.1 (b) and Figure 6.2 (b) is very similar. The difference is quantitative: the LFR model produces graphs with higher Modularity values than the ER and

(34)

BA models. Again, Qαdiff ≈ 0 for all α ∈ [0, 1] in Figure 6.2 (b).

Now we want to experimentally study the maximum possible Composite Modularity value for the α not equal to 0 or 1. For this purpose, we initially gen­

erate a graph ˜G with n = 3, 000 nodes and 748, 500 edges with equal non­zero

weights consisting of 6 equal fully connected components (with 500 nodes each).

Furthermore, GS and GA with n = 3, 000 nodes and 374, 250 edges with equal

non­zero weights are obtained from ˜G by random assigning a half of ˜G edges to

GS and another half to GA. By construction, the maximum Composite Modularity

value should be for α = 0.5 as Gα ≡ ˜G there. This is indeed so, see Figure 6.3 (c). Note that Qαdiff is vanishing in this case for all α ∈ [0, 1], too. What is more, the

maximum Composite Modularity value is≈ 0.5 (Q(GS) + Q(GA)) here as is pre­

scribed by Theorem 1 and (15) for α = 0.5 and Q0.5diff ≈ 0. For future analysis,

Pearson’s correlation coefficient for the corresponding edge weights of GS and

GA is≈ −0.09 here.

The experiments performed so far hint that values of Qαdif are always van­

ishing. However, this is not true as Figure 6.3 (d) shows. In this experiment GS

is LFR­based (n = 3, 000, ⟨k⟩ = 20, kmax = 150, µ = 0.4) and GA is a star

graph­based (the base is the star graph with n = 3, 000 nodes and 2, 999 edges, other edge weights are zero). The difference in node degree distributions (recall (11) in Theorem 1) is so notable that the maximal value of Qαdif for α ∈ [0, 1] is rather separated from zero. This result emphasizes how interestingly the weight­ based fusion model may work for bizarre networks and that one should be aware of the Composite Modularity structure for proper interpretation of the community detection results. The traditional heuristics (see Section 1.1) do not give any hint on the Differential Modularity component.

(35)

(a) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Modularit y (b) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Modularit y

Composite Structural Attributive Differential

Figure 6.2 – The variety of Modularities behaviour for different LFR­based graph

pairs: (a) Q(GS) ≈ Q(GA) (LFR­based GS of n = 3, 000 nodes and 374,250

edges with equal non­zero weights,⟨k⟩ = 20, kmax = 150, µ = 0.2; LFR­based

GA of n = 3, 000 nodes and 37,199 edges with equal non­zero weights,⟨k⟩ = 20,

kmax = 150, µ = 0.2); (b) Q(GS) > Q(GA) (LFR­based GS of n = 3, 000 nodes

and 374,250 edges with equal non­zero weights, ⟨k⟩ = 20, kmax = 150, µ = 0.2;

LFR­based GA of n = 3, 000 nodes and 38,091 edges with equal non­zero

(36)

(c) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 Modularit y (d) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 Modularit y

Composite Structural Attributive Differential

Figure 6.3 – The variety of Modularities behaviour for different LFR­based graph pairs: (c) GS and GA, each with n = 3, 000 nodes and 374, 250 edges, are disjoint

random subgraphs of ˜G consisting of 6 fully connected components and

n = 3, 000 nodes and 748, 500 edges; (d) maxαQαdiff ≫ 0 (LFR­based GS of

n = 3, 000 nodes and 17, 251 edges with equal non­zero weights,⟨k⟩ = 20,

kmax = 150, µ = 0.4; Star graph GA with n = 3, 000 nodes and 2, 999 edges with

equal non­zero weights)

6.2.3 The almost direct proportionality of the edge weights of the graphs

GS and GA

Now we experimentally illustrate the statement of Theorem 3. We generate

GS as an LFR­based graph with n = 3, 000, ⟨k⟩ = 25, kmax = 150 and µ = 0.5;

(37)

0 0.2 0.4 0.6 0.8 1 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 Composite Modularit y

Figure 6.4 – Composite Modularity for Gα, based on GS and GA with different γ

in (24), where GS is LFR with n = 3, 000,⟨k⟩ = 25, kmax = 150, µ = 0.5; GA′ is

LFR with n = 3, 000, ⟨k⟩ = 10, kmax = 150, µ = 0.7

LFR­based graph with n = 3, 000, ⟨k⟩ = 10, kmax = 150 and µ = 0.7; the set

of its edge weights we denote by {wA′(eij)}. Now let GA be the graph with the following edge weights:

wA(eij) = γwA′(eij) + (1− γ)wS(eij), eij ∈ E, (24)

where γ ∈ [0, 1] is fixed. Clearly, if γ = 0, then GA ≡ GS; if γ = 1, then

GA ≡ GA′.

In terms of Theorem 3, if γ is small enough (with respect to the chosen

ϵ), then the edge weights of GS and GA satisfy (22) and P

ij|ϵij| ≤

ϵ

2 ≤ 1 in

Theorem 3. Consequently, one should expect according to (23) that for γ ≈ 0 the

Composite Modularity Qαcomshould be almost independent of α. This is indeed the

case as Figure 6.4 shows. What is more, Qαcom starts to essentially depend on α as

γ grows. In particular, the behaviour of Qαcom changes dramatically after α passes 0.5 when the edge weights of GA′ start dominating in (24).

(38)

find them in Table 6.1.

Theorem 3 and the experiments in this section are complementary to the heuristic conclusion in [11] that the strong positive correlation of the edge weights

in GS and GA (by means of Pearson or Spearman) leads to the weak dependence

of certain structural and attributive community detection quality measures on α within the weight­based fusion models. It is interesting that the statement “If the Composite Modularity is independent of α in (2) for all α ∈ [0, 1], then there is the strong positive correlation of the edge weights in GS and GA” is not true in general, as the experiment in Figure 6.3 (c) shows. The Composite Modularity is almost

constant there, but the general scheme how GS and GA are constructed does not

imply the strong positive correlation. Recall that Pearson’s correlation coefficient

there is≈ −0.09.

6.2.4 Counterexample to the node­attributed social network feature

selection principle

In this subsection we produce a partial but an important experiment in our settings that is related to the well­known node­attributed social network feature

selection principle particularly aiming at choosing the subset of attributes that are

“relevant to” or “tightly correlated with” the structure, see e.g. [45, 46]. This principle is usually based on the suggestions that the mismatch between the struc­ ture and the attributes “negatively affects community detection quality” [46] and that the existence of structure­attributes correlation “offers a unique opportunity to improve the learning performance of various graph mining tasks” [45].

This principle has been found questionable already in [2, 11] in the weight­ Table 6.1 – Correlation rates for different γ in (24) in Section 6.2.3

γ 0.00 0.10 0.20 0.45 0.55

Pearson 1.00 0.99 0.97 0.78 0.64

(39)

based fusion model settings. In particular, it was observed there that in the case where all the attributes are in strong linear relationship to the structure (by means of the edge weights of GS and GA), then one in fact has two sources of information that mainly duplicate each other. In this sense, it also seems unreasonable to ex­ pect a significant improvement of the community detection quality by adding the attributes to the structure. Note that Theorem 3 confirms this point analytically in our settings.

Now we are presenting a counterexample to the principle in the weight­based

fusion model settings. Namely, we show that using only the edge weights in GA

that are in strong linear relationship to the corresponding ones in GS within the

community detection process may lead to the downgrade of the community de­ tection quality, and this is not only in the sense of Modularity­based community detection quality measures as in (15) but even in that of ground truth­based mea­ sures.

In general, NMI(C, C′) ∈ [0, 1]; in particular, NMI(C, C′) = 1 if C = C′

and NMI(C, C′) = 0 if C and C′ are totally independent [19].

We perform experiments with LFR­based graphs below and therefore let us mention that the ground truth partition C′in the LFR model is that generated at the step when the generated nodes with the chosen degree distribution are randomly assigned to the generated communities with the chosen size distribution, and before

the generation of links between the communities. Recall the parameters τ1 and τ2

of the distributions in Section 6.2.2. The partition C in (8) is the above­mentioned

within the weight­based fusion model, see Section 1.1.

For the experiment, we generate two LFR­based graphs:

– GS with n = 1, 000,⟨k⟩ = 150, kmax = 450 and µ = 0.4;

– GA with n = 1, 000,⟨k⟩ = 100, kmax = 450 and µ = 0.1.

The ground truth partition C′ is related to GA here. Clearly, GA has a stronger

(40)

(a) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0 0.2 0.4 0.6 0.8 1 Modularit y NMI (b) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0 0.2 0.4 0.6 0.8 1 Modularit y NMI

Composite Structural Attributive Differential NMI

Figure 6.5 – Modularities and NMI for (a) GS (with n = 1, 000,⟨k⟩ = 150,

kmax = 450 and µ = 0.4) and GA (with n = 1, 000,⟨k⟩ = 100, kmax = 450 and

µ = 0.1), i.e. before the feature selection procedure, and (b) GS (with the same parameters) and GA′ (obtained as described in Section 6.2.4), i.e. after the feature

selection procedure

For convenience, we think that each non­zero weight in GS and GA equals

1, and the required normalization (9) is done at the very end of the procedure de­ scribed in the next paragraph.

Let GA′ = GA. Let S be the set of node pairs (v, u) such that v and u are in

the same community in C′ and the edge weight between v and u is 1 in GA and 0

(41)

between the pairs of nodes in GA′ that are in S. Thus GA with respect to GA′ has

a stronger community structure that is also more similar to C′. Consequently, the

Modularity and NMI values for GA are higher than those for GA′. Furthermore,

the linear relationship of the edge weights of GS and GA′ is stronger than that of

GS and GA by construction. The above­described procedure actually imitates the

node­attributed social network feature selection principle in our settings.

In our particular experiment GS, GA and GA′ have 98, 266, 59, 258 and

19, 388 edges with equal non­zero weights, correspondingly. The set S contains 39, 870 node pairs.

The results for GS and GA are in Figure 6.5 (a), “before the feature selec­

tion”. The maximum of the Composite Modularity (≈ 0.47) and NMI (≈ 1.00) is

for α = 0 here.

Once we exchange GAfor GA′, we get the results in Figure 6.5 (b), “after the feature selection”. By comparing the plots in Figure 6.5 it is seen that the maximum

of the Composite Modularity drops down to≈ 0.24 and the NMI down to ≈ 0.66

for α = 0. Note that Pearson’s correlation coefficient for the edge weights of GS

and GA is ≈ 0.00 and that of GS and GA′ is ≈ 0.20.

These experiments show that one can get worse community detection results after applying a variant of the node­attributed social network feature selection pro­ cedure within the weight­based fusion model than for the initial graphs. It means that the procedure may be unsuitable in some cases and one should take this into account while applying it.

6.3 Real­world node­attributed network datasets

Now we are going to deal with the undirected versions of the following pub­ licly available real­world node­attributed social network datasets:

– WebKB12 (Cornell, Texas, Washington, and Wisconsin) is a collection of

four networks, totally of 877 web pages (nodes) with 1,608 hyperlinks 12https://linqs.soe.ucsc.edu/data

(42)

(edges) gathered from four different universities Web sites (each web page is associated with a binary vector whose elements indicate the the presence of a word from the vocabulary on that web page; the vocabulary consists of 1703 unique words; the ground truth partition is available);

– PolBlogs13 is a network of 1,490 webblogs (nodes) on US politics with

19,090 hyperlinks (edges) between these webblogs (each node has a bi­ nary attribute describing its political leaning as either liberal or conser­ vative; the ground truth partition is not presented);

– Sinanet14 is a microblog user relationship network extracted from the

Weibo website with 3,490 users (nodes) and 30,282 relationships (edges) (each node is attributed with 10­dimensional positive numerical attributes describing the interests of the user; the ground truth partition is available);

– Cora15is a network of machine learning papers with 2,708 papers (nodes)

and 5,429 citations (edges) (each node is attributed with a 1,433­dimen­ sion binary vector indicating the absence/presence of words from the dic­ tionary of words collected from the corpus of papers; the ground truth partition is available).

In the experiments below, the graphs GS and GA are constructed according

to (2) and (9) with the following structural and attributive weight functions (the parameters of Algorithm in Table 5.1):

µ(eij) =    1, if w(eij) = 1 in (V, E), 0, otherwise, ν(eij) = A(vi)· A(vj) ∥A(vi)2∥A(vj)2 .

Note that ν is the well­known Cosine Similarity and ν(eij) ∈ [0, 1] in our case as all the attributes are non­negative. It is worth mentioning also that the chosen µ

13http://www­personal.umich.edu/ mejn/netdata/ 14https://github.com/smileyan448/Sinanet 15https://linqs.soe.ucsc.edu/data

(43)

(a) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.15 0.2 0.25 0.3 Modularit y NMI (b) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.15 0.2 0.25 0.3 0.35 Modularit y NMI

Composite Structural Attributive Differential NMI

Figure 6.6 – Modularities and NMI for the WebKB(Cornell, Texas) dataset: (a) Cornell, (b) Texas

and ν are among the most popular for this purpose [2] but, to be fair, the above­ proved theorems and Algorithm in Table 5.1 work for any reasonable non­negative weight function.

We present results for α ∈ [0, 1] with a certain step for clarity, while Al­

gorithm in Table 5.1 finds the partition and the Modularity values for a single

value of α = ˜αρ. In the forthcoming experiments, we additionally calculate the

NMI values by (8) for the real­world datasets where the ground truth partition C′

(44)

(a) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Modularit y NMI (b) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.1 0.2 0.3 0.4 0.5 Modularit y NMI

Composite Structural Attributive Differential NMI

Figure 6.7 – Modularities and NMI for the WebKB(Washington, Wisconsin) dataset: (a) Washington, (b) Wisconsin

based community detection quality measures and the ground truth­based ones. It is worth mentioning however that the ground truth methodology for evaluating community detection quality, although widely used in the field, seems somewhat questionable especially for real­world networks as “the ground truth” may reflect only one point of view on the network communities (among many possible). This problem is discussed e.g. in [8, 9, 10].

The results for the four networks of WebKB are presented in Figures 6.6 and 6.7. The behaviour of the Modularities is rather similar in each case. As in some

(45)

0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 Modularit y

Composite Structural Attributive Differential

Figure 6.8 – Modularities for the PolBlogs dataset

experiments with synthetic graphs in Sections 6.2.1 and 6.2.2, one of Q(GS) and

Q(GA) is greater than the other. However, Q(GS) is much greater than Q(GA)

here so the case is almost degenerate. It can be observed that GS has rather dis­

tinguishable communities, while GA not. As a result, we have α∗0.5 ≈ 0 (the α

providing the equal impact of the structure and the attributes on the community detection results in the sense described in Section 3). At the same time, as the behaviour of NMI shows, the ground truth communities are highly related to the attributes in the sense that the highest NMI values are reached around α = 0 (ac­ tually exactly at α = 0 in Figures 6.6 (b), and 6.7 (a) and (b)). It is interesting that the highest NMI value in Figure 6.6 (a) corresponds to α∗0.5. Qαdiff vanishes for all

α ∈ [0, 1].

Oppositely to WebKB, both GS and GAin PolBlogs have rather distinguish­

able communities, see Figure 6.8. As for the attributes, it is indeed reasonable as they are one­dimensional and binary. Qαdiff vanishes for all α ∈ [0, 1] in this case, too. The true distinctive feature of the PolBlogs results among those for the real­ world networks under consideration is that the Modularities here change highly linearly with respect to α and thus the condition (18) in Theorem 2 is met with a small ϵ. In particular, this guarantees that the α­tuning scheme in Section 3 and

Referenties

GERELATEERDE DOCUMENTEN

Models trained on gold data coming from OffensEval’19 are perform- ing better on the OffensEval’19 and OffensEval’20 test sets than models trained on AbusEval, which only perform

In this section we study another way to use the Larges Gaps algorithm to identify the communities when K is unknown which, as we will see in the simulations (see Section 5.3.2),

The RGB color space is also a three-component space like the YUV space, consisting of red, green and blue. However, all the three components contain both textural and color

Here, this generalized priority-based model has been ap- plied to smartphone touching data but could be applied to other event-based data sets which display power-law property for

- A.D. van Keeken and L.J. Changes in the productivity of the southeastern North Sea as reflected in the growth of plaice and sole. van Hoppe, R.E. Grift and A.D. The effect

Figure 20 – Connections between Collectief Stedenwijk Zuid Groen and visitors, users, CBIs and/or other organizations as an indicator of bridging social capital (n=18): ..... VI

COVID-19-related mortality in kidney transplant and dialysis patients: results of the ERACODA collaboration, Nephrology Dialysis Transplantation, Volume 35, Issue 11, November

By investigating the journey and accompanied experiences of this group of Syrian residents, I strive to find out how the contextual conditions (place, policy,