A generalized weight-based fusion model for community detection in node-attributed social networks

(1)

MSc Computational Science

Master Thesis

A generalized weight-based

fusion model for community

detection in node-attributed

social networks

by

Timofei

Gradov

13241028

June Supervisor/Examiner: Examiner:

(2)

ABSTRACT

The weightbased fusion model is among the simplest and most efficient ones for modularitydriven community detection in nodeattributed social networks that contain both links between social actors (“structure”) and the actors’ feature vectors (“attributes”). Roughly speaking, the attributes within the model are first converted into an attributive network so that one obtains the two networks — structural and attributive — instead of the nodeattributed network. Then, the two networks are fused into a composite one that is believed to contain the in formation about both the structure and the attributes and that can be already fed to traditional modularitydriven graph community detection approaches. While the weightbased fusion model is widely used, it has been understudied analytically and had only a heuristic ground.

In this study, we disclose the mathematical machinery of a generalized weight based fusion model by revealing the objective function of the corresponding opti mization community detection process and establishing its connection with the tra ditional quality measures for the results of community detection in nodeattributed networks. We also propose a pioneering nonmanual parameter tuning scheme that provides the desired impact of the structure and the attributes on the community detection results within the weightbased fusion model. Basing on the theoreti cal results obtained, we further present a welltunable Leidenbased algorithm for community detection in nodeattributed social networks that declares itself fast and accurate in our multiple experiments with synthetic and realworld datasets.

(3)

INTRODUCTION

Community detection in nodeattributed social networks is an actively stud ied problem in social network analysis [1, 2, 3] due to the necessity to explore a huge amount of realworld social network data containing both links between social actors (known as structure) and the actors’ feature vectors (known as at

tributes) indicating actors’ age, interests, etc. While classical community detection

models deal either with the structure or the attributes, community detection meth ods for nodeattributed networks aim at simultaneous usage or fusion of the both. The motivation behind it is that such fusion may clarify and enrich the knowledge about communities in a nodeattributed social network [1, 2]. This idea is usually justified by the wellknown homophily principle stating that likeminded social actors have a higher probability to be connected [4]. Furthermore, social science founding (see e.g. [5, 6]) suggests that the attributes can reflect and affect the community configuration of an nodeattributed social network.

A variety of different models and methods for community detection in a nodeattributed social network have been proposed so far [1, 2, 3]. Although they are widely used in applications, it is a problem that some of them are understudied analytically and have only a heuristic ground. In particular, it is observed in [2] that, from the optimization theory point of view, multiple community detection methods for nodeattributed social networks suffer from the lack of mathematically established connection between the objective functions within the community de tection optimization process involved and the measures used for the community detection quality evaluation. Indeed, there are many examples of community de tection models and methods for nodeattributed social networks in [2] where one function is optimized within the community detection process and another function – “quality measure” (not directly related to the former and – again – usually chosen heuristically from the many possible [1, 7]) is used for evaluating the optimization results. In our opinion, the abovementioned lack may cause misinterpretation of

(6)

the community detection results and difficulties with tuning the parameters of a community detection method for nodeattributed social networks. Let us empha size that now and throughout the research we mainly deal with the structure and attributesaware quality measures (e.g. Modularity and Entropy) and rarely do with those estimating the agreement between the detected communities and the ground truth ones. It is because the latter seem to be less formalized and their “truthful ness” may be questionable in certain cases as the results in [8, 9, 10] and our own ones below hint.

In this research we address a particular case of the abovementioned gen eral problem, namely, our main objective is to disclose the mathematical machin ery of a generalized weightbased fusion model that is one of the simplest and most efficient models applied for modularitydriven community detection in node attributed social networks. There exist more than 20 papers devoted to the family of weightbased fusion models, see [2]. Recall that for a given nodeattributed so cial network, a typical weightbased fusion model first converts the attributes into an attributive network so that one obtains the two networks – structural and attribu tive – instead of the nodeattributed social network. Then, the two networks are fused into a composite one that is believed to contain the information about both the structure and the attributes and that can be already fed to traditional modularity driven graph community detection methods. We will give the formal description of the weightbased fusion model under consideration, the corresponding community detection process and the overview of the existing weightbased fusion models in Section 1.

To meet the main objective, we aim at achieving the following goals:

– to reveal the objective function of the optimization community detection process within the weightbased fusion model;

– to establish the connection of the objective function and the related com munity detection quality measures;

(7)

– to propose a nonmanual parameter tuning scheme that provides the de sired impact of the structure and the attributes on community detection results within the weightbased fusion model;

– to propose a welltunable algorithm for community detection in node attributed social networks that is based on the weightbased fusion model and the wellknown Leiden algorithm for Modularity maximization; – to perform experimental study of the algorithm on synthetic and real

world datasets.

The exposition in this thesis is partly based on the peerreviewed papers [11] and [12], where the thesis author is a coauthor. Furthermore, this thesis extends [12] with the following main contributions:

– we obtain further theoretical results on the objective function of the op timization community detection process, in particular, we show how it behaves when the edge weights in the structural and attributive graphs are almost directly proportional;

– we generalize the previous parameter tuning scheme from the one pro viding only the equal impact of the structure and the attributes on com munity detection results within the weightbased fusion model on any desired impact;

– we propose and show the efficiency of the welltunable Leidenbased al gorithm for community detection in nodeattributed social networks in stead of the direct illustrative experiments with the wellknown Louvain algorithm for Modularity maximization.

In both the papers and this thesis the contribution of the thesis author is 50% to the theoretical results (the theorems) and 90% to the design and implementation of the algorithm and the experiments.

(8)

1 Necessary facts on the generalized weightbased fusion

model

1.1 Description of the weightbased fusion model, the corresponding

community detection problem and the heuristics involved

Below we first describe a general version of the weightbased fusion model and the related nodeattributed social network community detection problem. Then we recount the community detection quality evaluation scheme within the weight based fusion model and reveal the implicit heuristics usually used in it.

In this research we model an nodeattributed social network as an undirected simple nodeattributed graph G = (V, E, W, A), where V = {vi}ni=1 is the set of nodes (representing social actors), E = {eij} the set of all possible links (repre

senting social connections) between nodes inV (i.e. (V, E) is a complete graph),

W is the set of nonnegative edge weights (representing the strength of the actors’

social connections1), and A is the set of attribute vectors A(vi) = {ad(vi)}D_d=1,

vi ∈ V, with nonnegative2 elements (representing social actors’ features). We

suppose that (V, E) is complete to simplify further mathematical analysis. Recall

that in what follows (V, E, W) is called the structure and (V, A) the attributes

of G.

Within a generalized weightbased fusion model, G is first converted into the following two undirected simple weighted complete graphs by a certain rule: the structural graph GS = (V, E, WS) and the attributive graph GA = (V, E, WA), where WS = {wS(eij)} and WA = {wA(eij)} are the sets of nonnegative edge weights in each graph, see Figure 1.1. For convenience, we suppose that

X eij∈E wS(eij) = 1, X eij∈E wA(eij) = 1. (1)

1_{An edge weight may be zero and this indicates that there is no social connection.}

2_{For nominal or textual attributes, it is common to use onehot encoding or embeddings to obtain their numerical}

(9)

Nodeattributed graph G = (V, E, W, A) =⇒ Structural graph GS = (V, E, WS) Attributive graph GA = (V, E, WA) =⇒ Composite graph Gα = (V, E, Wα)

Figure 1.1 – An example of the weightbased fusion model scheme for G with

n = 5 and 2dimensional binary attribute vectors. The edge weights in the

attributive graph GA are according to the normalized matching coefficient on the

corresponding node attributes in G

Furthermore, the two graphs are fused to obtain the undirected simple complete weighted graph Gα = (V, E, Wα), where the elements of Wα = {wα(eij)}, with

eij ∈ E, are as follows (see Figure 1.1):

wα(eij) = αwS(eij) + (1− α)wA(eij), α ∈ [0, 1] X

eij∈E

wα(eij) = 1.

(2)

Here α is a fusion parameter that controls the impact of GS and GA on Gα. Note

that G1 = GS and G0 = GA by construction. Gα (that we call composite) is

believed to contain the information about both the structure and the attributes of

G, and it is one of the traditional heuristics behind the family of weightbased

fusion models.

Recall that community detection in G consists in unsupervised dividing V

into K disjoint3 communities Ck ⊂ V, with C = {Ck}Kk=1, such thatV =

SK

k=1Ck,

and a certain balance between the following properties is achieved [1, 2]:

– structural closeness, i.e. nodes in a community are more densely con 3_{Communities may be overlapping if necessary but here we focus on disjoint ones.}

(10)

nected than nodes in different communities;

– attributive homogeneity, i.e. nodes in a community have more similar attributes than nodes in different communities.

In the context of the weightbased fusion model, the community detection

problem clearly consists in unsupervised dividing Gαinto K disjoint communities

Ck,α ⊂ V, with Cα = {Ck,α}K_k=1, such that V =

SK

k=1Ck,α and the nodes in each

community Ck,α are structurally close and homogeneous in terms of attributes.

Since one deals with the weighted graph Gαwithin the weightbased fusion

model, classical graph community detection methods can be applied to find Cα.

For example, a popular choice [2] is the Louvain algorithm [13] aiming at maxi mizing Newman’s Modularity [14], a measure of divisibility of a graph into com munitites. In the context of the weightbased fusion model, the maximization of Modularity of Gαis implicitly thought to provide structural closeness and attribu tive homogeneity also in G (such heuristics are applied e.g. in [15, 16, 17, 18]). Our concern is that it was not discussed in the previous works on the weightbased

fusion model how the Modularity of Gα(i.e. the objective function within the com

munity detection optimization process corresponding to the weightbased fusion model) is connected, say, with the Modularities of GS and GA.

Following the abovementioned implicit thought, the partition Cα maximiz

ing Modularity of Gα is treated as that for measuring structural closeness and at

tributive homogeneity in the initial G. For example, Cαcan be used for calculating

the corresponding Modularity of GS (a popular measure of structural closeness)

and Entropy of subsets of the corresponding attributes inA (a popular measure of

attributive homogeneity) [2]. Thus one objective function is optimized to detect communities but the quality of the communities obtained is evaluated by measures not explicitly related to the objective function. From the optimization theory point of view, is seems to be a logical gap that may cause misinterpretation of community detection results and difficulties with tuning the parameter α in the weightbased

(11)

fusion model. To fulfil the gap and reveal the true mathematical machinery of

the weightbased fusion model, we will study in Section 2 how Modularity of Gα

(provided by Cα) and Cαbased quality measures on G relate to each other.

1.2 Precise definitions and reformulation of the nodeattributed social

network community detection problem

For the purposes of further analysis, we define the community detection quality measures that are often used for evaluating nodeattributed social network community detection results withing the weightbased fusion model – Modularity and Entropy, see e.g. [1]. The former works with graphs and the latter with sets of vectors. The community detection quality measures are usually chosen heuris tically from the many possible, see [1, 2, 7].

Let G′ = (V, E, W) be an undirected simple weighted graph and C its dis

joint partition. The graph G′ is the structure of a nodeattributed graph G. Modu

larity of G′ for C is as follows:

Q(G′, C) = _2m1 X ij Aij − 1 2mkikj δij ∈ [−1₂, 1], i, j = 1, n, (3) where

– Aij is the edge weight between nodes vi and vj;

– ki and kj are the weighted degrees of vi and vj, respectively, i.e.

kh = X

lw(ehl), where h ∈ {i, j}; (4)

– m is the sum of all edge weights in G′, i.e.

m = X

eij∈E

(12)

– if ci and cj are the community labels of nodes viand vj, correspondingly, δij =    1, ci = cj, 0, otherwise.

Modularity of G′ is then defined as

Q(G′) = maxCQ(G′, C). (6)

EntropyH defined for a pair G′′ = (V, A) measures the degree of disorder

of attribute vectors A within communities in the disjoint partition C of G′′. The

pair G′′ represents the attributes of a nodeattributed graph G. To unify notation,

we define Entropy of G′′ for C for the case of binary Ddimensional vectors as

follows: H(G′′_{, C) =} X Ck∈C |Ck| |V | H(Ck) ∈ [0, 1], H(Ck) = − XD d=1 ϕ(pk,d) D ln 2, (7)

where ϕ(x) = x ln x + (1− x) ln(1 − x) and pk,d is the proportion of nodes in the

community Ck with the same value on dth attribute.

To deal with the ground truth environment, the definition of Normalized

Mutual Information (NMI), a ground truthbased measure that is widely used to

compare the quality of nodeattributed social network community detection meth

ods when “the true partition” C′ is known [1, 2]. We use the variant from [19].

Suppose that

C = {C1, C2, . . . , CK} and C′ = {C1′, C2′, . . . , CK′ ′}

are two partitions of a graph (V, E, W) with |V| = n. Let ηij be the number of

(13)

the partition C′, then NMI(C, C′) = −2 PK i=1 PK′ j=1ηijlog ηijn ηi_·η_·j PK i=1ηi·log ηi· n + PK′ j=1η·jlog η_·j n , (8) where ηi· = PK′ j=1ηij and η·j = PK i=1ηij.

Thus the solution to the nodeattributed social network community detection problem within the weightbased fusion model in the terms presented is as follows:

– one finds Cα maximizing the Modularity of Gα,

– Cα is used to calculate Q(G′, Cα) and H(G′′, Cα), where G′ and G′′ are the structure and the attributes of the nodeattributed graph G, corre spondingly.

This scheme is usually applied implicitly and there are no theoretical studies why this jump from Q(Gα, Cα) to Q(G′, Cα) and H(G′′, Cα) is reasonable. An other issue is that Entropy that deals with vectors seems to be unnatural, taking into account the intuition behind the weightbased fusion model aiming at representing

G in a unified graph form. Using NMI based on network’s Ground Truth as a qual

ity measure for community detection usually proposed in researches. However, it is external quality measure, which is not always available for algorithm evalua tion and also can be misleading, while Ground Truth may not depend on network’s properties.

We will discuss the issues more in Sections 2 and 6.

1.3 Overview of the existing weightbased fusion models for node

attributed social network community detection

The family of weightbased fusion models that are ideologically close to (2) and the corresponding weightbased fusion modelbased algorithms have been widely tested on synthetic and realworld nodeattributed social networks and have

(14)

shown their advantages over many other nodeattributed social network commu nity detection models, see [11, 15, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], etc., and the overview in [2]. In particular, the weightbased fusion modelbased algorithms SAC2 [18] and CODICIL [24] are among the most influential ones for nodeattributed social network community detection. Moreover, multiple experi ments show that weightbased fusion modelbased algorithms can be superior in community detection quality (in various terms) and efficiency to other most influ ential ones for nodeattributed social network community detection, e.g. to

– the nodeaugmented graphbased algorithms SACluster [31] and Inc

Cluster [32, 33], as shown in [24, 25, 26, 27, 28, 34];

– the nonnegative matrix factorizationbased algorithm SCI [35], as shown in [30, 36];

– the probabilistic modelbased algorithms PCLDC [37], BAGCGBAGC [38, 39] and CESNA [40], as shown in [24, 26, 28, 30, 34, 36].

The common part of the abovementioned weightbased fusion models is (2) but the corresponding community detection and quality evaluation processes can be performed by different means.

Furthermore, there are many particular versions of (2) under a different choice of weighting functions wS, wAand fusion parameter α [2] but it seems to us that the most balanced one is that from [11, 29] with the following normalization:

wS(eij) = µ(eij) P eij∈E µ(eij) , wA(eij) = ν(eij) P eij∈Eν(eij) , (9)

where µ and ν are structural and attributive weight functions, correspondingly. Among other things, it is shown in [29] that (2) with (9) produces normalized versions of many existing weightbased fusion models for different µ, ν and α.

(15)

leads to the weightbased fusion model in [20], Jaccard and/or Cosine Similarity as ν makes (2) analogous to the weightbased fusion model in [24].

By using different heuristic thresholds for the weights in (2) one can de crease the computational complexity of community detection processes within the weightbased fusion models and obtain other versions of the weightbased fusion model, see [2].

As for α in (2), its tuning or αtuning as we call it below is difficult in gen eral [1, 2]. As far as we know, there are no general αtuning schemes to provide the desired impact of the structure and the attributes on community detection re sults within the weightbased fusion model. Indeed, α is usually chosen manually and such choice may be not fully justified. For example, the weightbased fusion models in [15, 20, 21, 28] use α = 0, while the ones in [18, 24, 27] set α = 0.5 in experiments, suggesting to achieve the equal impact of the structure and the attributes. The absence of an explicit understanding of how to tune α within the weightbased fusion model makes the problem of reasonable αtuning important in the field.

What is more, we are unaware of any reseach devoted to the study of the objective function in the modularitydriven community detection process corre sponding to the weightbased fusion model, perhaps besides the survey [2] where the problem is only stated.

(16)

2 The objective function of modularitydriven community

detection process within the weightbased fusion model

We first substitute Entropy (7) by Q(GA, C), i.e. the Modularity of the

attributive graph GA for a partition C. In these terms, if C is that maximizing

Q(GA, C), then the links between nodes in each community in C have high at tributive weights (minus the expected ones), i.e. the node attributes therein are homogeneous by construction. Additionally, the proposed measure works with graphs (oppositely to the Entropy working with vectors) and naturally appears in the weightbased fusion model as is seen from the theoretical results below. Let

us also mention that the Modularity of the attributive graph GA turns out to be

more informative than the corresponding Entropy as the experiments in Section 6.5 show.

Now we turn to the problem about the connection of Modularity Q(Gα) and

the community detection quality measures (in our case, Q(GS, Cα) and Q(GA, Cα)). The solution to the problem is wholly provided by the following theoretical results for a fixed G, where we assume that

k_h⋆ = X

lw⋆(ehl), h ∈ {i, j}, ⋆ ∈ {S, A}. Theorem 1. For any partition C, it holds that

Q(Gα, C) =αQ(GS, C) + (1− α)Q(GA, C)

+ α(1− α)∆(GS, GA, C),

(10)

where ∆ counts the difference of withincommunity node degrees in GS and GA:

∆(GS, GA, C) = 1₄ K X k=1 X vi∈Ck (k_iS − kA_i )2 ≥ 0. (11)

(17)

What is more, the following inequality holds:

Q(Gα, C) ≥ αQ(GS, C) + (1− α)Q(GA, C), (12)

and it is sharp for α = 0 and α = 1.

Proof. Fix a partition C. We first rewrite the ingredients of (3) in terms of (2):

Aij = αwS(eij) + (1− α)wA(eij), m = X eij∈E wα(eij) = 1, kh = X l(αwS(ehl) + (1− α)wA(ehl)) , h ∈ {i, j}. (13) Furthermore, kikj = αkiS + (1− α)k A i αk_jS + (1− α)kA_j = α2k_iSkS_j + α(1− α) k_iSk_jA + k_iAk_jS + (1− α)2kA_i k_jA. (14)

If one takes (13) and (14) into account, (3) can be rewritten in the form

Q(Gα, C) = α· 1₂ X ij wS(eij)− 1 2αk S i k S j δij + (1− α) · 1₂ X ij wA(eij)− 1 2(1− α)k A i k A j δij − α(1 − α) · 1 2 X ij 1 2 k S i k A j + k A i k S j δij.

Extracting Q(GS, C) and Q(GA, C) from this by (1) and (3) yields (10), where

∆(GS, GA, C) = 1₄ X ij(k S i − k A i )(k S j − k A j )δij.

(18)

Furthermore, it can be easily seen that ∆(GS, GA, C) = 1₄ K X k=1 X vi∈Ck (kS_i − k_iA) X vj∈Ck (k_jS − k_jA) = 1₄ K X k=1 X vi∈Ck (kS_i − k_iA)2,

where the last expression is clearly nonnegative. This gives (11). Finally, (12) follows from (10) by (11). The sharpness of (12) immediately follows from (10).

From a more general point of view, Theorem 1 actually connects the Modu larities of two graphs with the Modularity of the graph whose weights are the linear combinations of weights of the two graphs. We consider it as a key result for anal ysis of modularitydriven models for nodeattributed social network community detection in our future research, see Section 7.3.

We continue by introducing additional notation for the case where C = Cα

in Theorem 1:

Q(Gα) =:Qαcom = Qαstr + Qαattr+ Qαdif,

Qα_str := αQ(GS, Cα)

Qα_attr := (1− α)Q(GA, Cα),

Qα_dif := α(1− α)∆(GS, GA, Cα),

(15)

where we call

– Qα_com Composite Modularity,

– Qα_str Structural Modularity,

– Qα_attr Attributive Modularity,

(19)

Note that the Differential Modularity does not have the meaning of the Modularity itself but we call it so for uniformity.

Thus, in the terms introduced, one maximizes within the weightbased fusion

model Composite Modularity Qα_comthat consists not of the two components used for

quality evaluation (the Structural Modularity Qα_str and the Attributive Modularity

Qα_attr) but of the additional nonnegative Differential Modularity Qα_dif that counts

the difference of withincommunity node degrees in GS and GA.

Remark 1. According to our experimental results in Section 6 and the intuition

behind the form of Qα_dif in Theorem 1, the Differential Modularity vanishes for

many synthetic and realworld nodeattributed social networks for α ∈ [0, 1]. For

this reason, we assume in the next section that Qα_difis small enough with respect to the sum Qα_str + Qα_attr.

(20)

3 Parameter tuning scheme

We now propose a simple nonmanual αtuning scheme such that the impact of structure and attributes on the community detection results is wellcontrolled within the weightbased fusion model. Since our terms are unified for both the components (in the sense that we use Modularity for the both), it is justified to define α = α∗_ρ satisfying

Qα_str = ρ (Qα_str + Qα_attr) ,

Qα_attr = (1− ρ) (Qα_str + Qα_attr) for some ρ ∈ [0, 1],

(16)

as that providing 100ρ% of Qα_str and 100(1− ρ)% of Qα_attr in the sum Qα_str + Qα_attr

that is the major part of Qα_com according to Theorem 1 under the assumption that

Qα_dif is vanishing. Equivalently, this α = α∗_ρ is a solution to the equation

(1− ρ)Qα_str = ρQα_attr. (17)

We first prove the following statement.

Theorem 2. Let Qα_str > 0 and Qα_attr > 0 for any α ∈ [0, 1]. If it holds that

|Q(GS, Cα)− Q(GS)| ≤ ϵQ(GS),

|Q(GA, Cα)− Q(GA)| ≤ ϵQ(GA),

(18)

for some ϵ such that 0 ≤ ϵ ≪ 1, then α∗_ρsatisfies the inequalities:

1− ϵ 1 + ϵ ≤ α ∗ ρ· (1− ρ)Q(GS) + ρQ(GA) ρQ(GA) ≤ 1 + ϵ 1− ϵ. (19)

(21)

Proof. We rewrite (17) by (15) as

α = ρQ(GA, Cα)

(1− ρ)Q(GS, Cα) + ρQ(GA, Cα)

.

This and the conditions (18) immediately imply that (19) holds for α instead of α∗_ρ. Furthermore, the conditions (18) guarantee that Qα_str and Qα_attr are wellapproxi

mated uniformly for any α ∈ [0, 1] by αQ(GA) and (1 − α)Q(GS), correspond

ingly. These facts imply that (19) particularly holds for α = α∗_ρ.

As a result, Theorem 2 yields that for a sufficiently small ϵ one can take

˜

αρ =

ρQ(GA)

(1− ρ)Q(GS) + ρQ(GA)

(20)

as a good approximation for α∗_ρproviding the abovementioned impact of structure

and attributes on the community detection results. What is more, our experiments in Section 6.4 suggest that this is indeed so. It is interesting that (20) requires only

the Modularity values Q(GS) and Q(GA) to be applied.

The proposed αtuning scheme is the first nonmanual one providing the required impact of the components within the weightbased fusion model. A par

ticular case of Theorem 2 for ρ = 1₂ providing the equal impact of structure and

attributes was previously obtained in [12].

Remark 2. It seems that (16) and (17) can be rewritten for an αweighted version of Entropy (7) instead of Qα_attr so that one can obtain a similar αtuning scheme in these settings, too. However, in this case one in fact uses nonunified (and possibly noncomparable) terms for evaluating nodeattributed social network community detection quality with respect to the structure and the attributes (Modularity and Entropy) and thus the corresponding results may be misleading. In Section 6.5, we will also perform several experiments showing the difference in the behaviour of the Attributive Modularity and the Entropy and the presence of certain problems related to the Entropy.

(22)

4 Composite Modularity for almost directly proportional

edge weights in the structural and attributive graphs

Basing on Theorem 1, we now study the case where the corresponding edge

weights in the structural and attributive graphs GS = (V, E, WS) and GA =

(V, E, WA) are almost directly proportional with a positive coefficient (that clearly implies the strong positive correlation of the edge weights by means of Pearson or Spearman). We use the notation introduced in Sections 1.1 and 2. Let us start with the following observation. Suppose that for some fixed a > 0 it holds that

wA(eij) = awS(eij), eij ∈ E. (21)

Under the condition (1), we get

1 = X eij∈E wA(eij) = a X eij∈E wS(eij) = a,

i.e. it is necessarily that a = 1 in (21). Taking this into account, we prove the following result.

Theorem 3. Let GS and GA in Theorem 1 be such that

0≤ wA(eij) = wS(eij) + ϵij

for all eij ∈ E and some ϵij ∈ [−1, 1].

(22)

IfP_ij|ϵij| ≤ ₂ϵ ≤ 1, then for any α ∈ [0, 1] and any C

|Q(Gα, C)− Q(GS, C)| ≤ ϵ, (23)

(23)

Proof. We first rewrite the ingredients of (3) in terms of (22): Aij = αwS(eij) + (1− α)wA(eij) = αwS(eij) + wS(eij) + ϵij − αwS(eij)− αϵij = wS(eij) + (1− α)ϵij; kh = X l(αwS(ehl) + (1− α)wA(ehl)) = X l(wS(ehl) + (1− α)ϵhl) = kS_h + (1− α)X l ϵhl. Furthermore, kikj = kS_i + (1− α)X l ϵil k_jS + (1− α)X lϵjl = kS_i k_jS + (1− α) k_iSX lϵjl + k S j X lϵil + (1− α)2X lϵil X l ϵjl.

In these terms, we deduce from (3) for a fixed C that

Q(Gα, C) = Q(GS, C) + I(α, ϵ, C), where I(α, ϵ, C) :=1₂(1− α)X ij ϵij − 1₂ k_iSX l ϵjl + k S j X lϵil +(1− α)X lϵil X l ϵjl δij.

Now we estimate|I(α, ϵ, C)|. Firstly, by (1) and (4), X_ijk_iSX lϵjl + k S j X lϵil δij ≤X ik S i X jl|ϵjl| + X j k S j X il|ϵil| = 4X ij |ϵij|.

(24)

Secondly, X_ijX_lϵil X lϵjlδij ≤X_il|ϵil| X jl|ϵjl| = X ij|ϵij| 2 . Consequently, |I(α, ϵ, C)| ≤ 1 2(1− α) X ij|ϵij| + 1 2 · 4 X ij|ϵij| + (1− α)X ij |ϵij| 2 ≤ 1 2 X ij |ϵij| 3 +X ij |ϵij| .

This yields that the following implication holds true: X

ij |ϵij| ≤

ϵ

2 ≤ 1 =⇒ |I(α, ϵ, C)| ≤ ϵ.

Consequently, it holds that

Q(Gα, C) = Q(GS, C) + I(α, ϵ, C), |I(α, ϵ, C)| ≤ ϵ,

and this gives (23).

Theorem 3 states that if the corresponding edge weights of GS and GA are

almost directly proportional with a positive coefficient, then the Composite Mod

ularity Q(Gα, C) is almost independent of α. In this case, GS and GA almost du

plicate each other so that it may be reasonable to use only one of the components for community detection, i.e. come back to the classical community detection set tings based either on the structure or the attributes. In this sense Theorem 3 partly explains a similar effect observed in [11] in experiments (although for the Entropy instead of the Attributive Modularity).

Remark 3. Note that the assumptions of Theorem 3 also assure the good quality (by means of small ϵ) of the αtuning scheme in Section 3 in terms of Theorem 2.

(25)

From a more general point of view, Gαin Theorem 3 can be considered as a

small variation of GS for sufficiently small ϵ. Thus Modularity of a graph changes

slightly under a small variation of its edge weights. This seems rather intuitive but we are unaware of any corresponding quantitative estimates as in Theorem 3.

In Section 6 we will also discuss the strong linear relationship of the structure and the attributes of G in the following sense: we will provide in the weightbased fusion model settings a counterexample to the wellknown nodeattributed social network feature selection principle that presumes to use in the nodeattributed so cial network community detection process only the attributes “correlated with” or “tightly related to” the structure in order to obtain “better community detection results”.

(26)

5 Welltunable Leidenbased weightbased fusion model

for nodeattributed social network community detection

Basing on our theoretical results, in particular on Theorems 1 and 2, we now propose a welltunable Leidenbased algorithm implementing the weightbased fusion model (2) under the normalization (9), see Algorithm in Table 5.1.

The input of the algorithm is a nodeattributed graph G (see Section 1.1), chosen structural and attributive weight functions µ and ν in (9) and the value of

ρ in (16) and (17), indicating the desired impact of the structure and the attributes

on the community detection results within the weightbased fusion model. Within the algorithm, G is first converted into4 GS and GA according to (9). After this,

the Modularity values Q(GS) and Q(GA) are found by the Leiden algorithm [41].

For the chosen ρ, the values of Q(GS) and Q(GA) are further used for calculating ˜

αρ in (20) as an approximant for α∗ρ in (16) and (17). The ˜αρ is used in (2) under the normalization (9) to obtain Gα∗ρ with the edge weights (2). Finally, Cα∗ρ and all

the corresponding Modularity values in (15) are found by Leiden which are then output by the algorithm.

4_{As before, G}

(27)

Table 5.1 – Leidenbased weightbased fusion model Algorithm for nodeattributed social network community detection

Procedure: Weightbased fusion model (G, µ, ν, ρ)

G is a nodeattributed graph, µ is a structural weight function, ν is a attributive

weight function, ρ is a value indicating the desired impact of the structure and the attributes on community detection results

GS ← µ(G) GA ← ν(G) CS ← Leiden(GS) CA ← Leiden(GA) α ← ρQ(GA,CA) (1−ρ)Q(GS,CS)+ρQ(GA,CA) Gα ← FuseGraphs(GS, GA, α) Cα ← Leiden(Gα) Qα_str ← αQ(GS, Cα) Qα_attr ← (1 − α)Q(GA, Cα) Qα_diff ← α(1 − α)∆(GS, GA, Cα)

Qα_com ← Qα_str + Qα_attr+ Qα_diff return Cα, Qαcom, Qαstr, Qαattr, Qαdiff

Procedure FuseGraphs(GS, GA, α)

GS is a structural graph with the edge weights wS(eij), GAis an attributive graph with the edge weights wA(eij), α is a fusion parameter

ΣS ← P eij∈E wS(eij) ΣA ← P eij∈E wA(eij) foreach eij ∈ E do wα(eij) ← αwS(eij)/ΣS + (1− α)wA(eij)/ΣA

wα(eij) are the edge weights of Gα

(28)

We use the novel Leiden algorithm to optimize Modularity in Algorithm in Table 5.1 as, according to [41], it outperforms in community detection quality and execution time the widelyused Louvain algorithm [13]. Recall that Louvain is a popular choice in modularitydriven nodeattributed social network community detection methods [2] and is stateoftheart for Modularity optimization [42]. In particular, it is guaranteed [41] that Leiden always yields communities that are connected (oppositely to Louvain) and whose all subsets are locally optimally as signed. (Note that we used Louvain in the conference proceedings paper [12] that contains the results preliminary to those in the current study.)

The precise parameters and other details related to the versions of Leiden and Louvain used in our experiments will be indicated in Section 6.1.

As for the complexity of Algorithm in Table 5.1, the step where G is con

verted into GS and GA (recall that GS and GA are complete in our settings) is

O(n2), while the Leidenbased optimization process step seems to be at most

O(n log n). Here we recall the observation in [41] that Leiden is faster than Lou

vain whose complexity is believed to be O(n log n). The complexity of weight based fusion models can be possibly decreased by applying different thresholds for the weights in (2) and the number of edges of G involved, see the tables in [2].

(29)

6 Experiments

6.1 Data, code and implementation settings

Before describing our experimental results, it is worth saying that we present

on Github5 the publicly available datasets and the implementations used in our

experiments.

For synthetic graphs generation and graphs representation we use networkx

2.56 with the parameters declared in the experiments in Section 6.2. The real

world datasets used and their sources are described in Section 6.3. For Modularity optimization we rely on the following descriptions and implementations:

– Leiden [41], leidenalg 0.8.37 for weighted graphs with Modularity Ver

tex Partition as quality function and the maximum possible number of iterations;

– Louvain [13], pythonlouvain 0.148 for weighted graphs with the resolu

tion parameter equal to 1.

In what follows, we use Leiden for Modularity optimization with the only exception of the experiments in Section 6.6 where Louvain is used as a competitor to Leiden.

Since different runs of Leiden and Louvain may lead to different commu nities, we average the results over 5 runs and indicate the corresponding standard deviation in the plots.

6.2 Synthetic nodeattributed network datasets

Now we experimentally study the behaviour of the Modularities in (15) and illustrate Theorems 1 and 3. Since the theorems are stated for two graphs (not

5_{https://github.com/TimaGradov/Leidenweightbased fusion model} 6_{https://networkx.org/documentation/stable/index.html}

7_{https://leidenalg.readthedocs.io/en/stable/index.html} 8_{https://github.com/taynaud/pythonlouvain}

(30)

a nodeattributed graph), in some experiments below we initially generate two

graphs – GS and GA – instead of generating a nodeattributed graph G and fur

ther converting it into GS and GA.

Following Algorithm in Table 5.1, we use the weightbased fusion model (2)

with the normalization (9) and detect communities in Gαby Leiden. However, our

fusion parameter α runs from 0 to 1 with step 0.05 for illustrative purposes in the

forthcoming experiments, while Algorithm in Table 5.1 works only with α = ˜αp.

Graphs GS and GA are complete and weighted according to the definitions

in Section 1.1. Some edge weights may be zero and then one should think that there is no structural or attributive link between the corresponding nodes. Below, if we say that a (complete) graph has M edges with nonzero weights, then edge weights in the graph are set zero for all edges except for M of them.

The precise parameters of experiments necessary for reproducing the results are indicated in the figure captions.

6.2.1 BarabásiAlbert and ErdősRényi graph models

Here we generate random graphs GS and GA by the wellknown Erdős

Rényi (ER) and BarabásiAlbert (BA) models. Since Theorem 1 does not have

any specific assumptions on GS and GA, the chosen models are quite suitable for

illustrative purposes (although they may not produce strong community structure). The models are chosen as they generate graphs with different node degree distri butions (recall (11) in Theorem 1) and thus may influence the behaviour of the Differential Modularity Qα_dif. What is more, we generate GS and GA in the pairs ER+ER, ER+BA and BA+BA. In each pair we consider the following two cases:

(a) when Q(GS) and Q(GA) are almost equal and (b) when one of Q(GS) and

Q(GA) is greater then the other. These proportions can be achieved by varying the

number of edges with equal nonzero weights in GS and GA.

The parameters of the BA model are m0 ≥ 1, the number of nodes in the

(31)

edges will every new node create; the parameters of the ER model are the number of nodes n and the probability p of edge creation between every pair of nodes. For more details, we refer the reader e.g. to [43]. For implementation, we use

networkx 2.5 BarabásiAlbert graph9, where m0 = m, and networkx 2.5 Gn,p

random graph10.

It turns out that the results in each pair ER+ER, ER+BA and BA+BA for the chosen parameters are very similar qualitatively (and even quantitatively). For this reason we provide and analyze only the results for the pair ER+BA, see Figure 6.1. First note that in both cases Qα_difvanishes for all α ∈ [0, 1]. Thus even the difference in degree distribution does not make its values large. It can be also observed that the intersection point of the plots of the Structural and Attributive Modularities

Qα_str and Qα_attr (corresponding to α∗_0.5 that makes (17) valid for ρ = 0.5) is closer to

α = 0, when Q(GS) > Q(GA), see Figure 6.1 (b). In the opposite case, it is close to α = 1 (not shown), and is close to α = 0.5, when the Modularities are almost equal, see Figure 6.1 (a).

Note that the Modularity values in Figure 6.1 are rather small indicating weak community structure. To bypass it, we now perform experiments with the LancichinettiFortunatoRadicchi (LFR) graph model [44] that generates graphs with a controlled community structure and that is often used to compare different community detection methods [26, 42].

9_{https://networkx.org/documentation/stable/index.html} 10_{https://networkx.org/documentation/stable/index.html}

(32)

(a) 0 0.2 0.4 0.6 0.8 1 0 0.05 0.1 0.15 Modularit y (b) 0 0.2 0.4 0.6 0.8 1 0 0.05 0.1 0.15 Modularit y

Composite Structural Attributive Differential

Figure 6.1 – Modularities in the graph pair ER+BA: (a) Q(GS) ≈ Q(GA)

(ERbased GS with 500 nodes and 6, 210 edges with equal nonzero weights,

p = 0.05; BAbased GA with 500 nodes and 6, 331 edges with equal nonzero

weights, m = 13), (b) Q(GS) > Q(GA) (ERbased GS with 500 nodes and

24, 886 edges with equal nonzero weights, p = 0.2; BAbased GA with 500

nodes and 6, 331 edges with equal nonzero weights, m = 13)

6.2.2 LancichinettiFortunatoRadicchi benchmark

The LFR model generates graphs according to the chosen parameters con trolling the heterogeneity in the distribution of node degrees and community size [44]. Namely, the parameters are: the number of vertices n, the mixing parameter

(33)

minimum degree of vertices kmin, the minimum community size cmin, the maxi

mum community size cmax, and exponents of the powerlaw distribution of node

degree and community size τ1 and τ2, correspondingly. The parameter µ ∈ [0, 1]

controls the strength of community structure so that the smaller µ, the stronger the community structure.

For implementation, we use networkx 2.5 LFR benchmark graph11, where

the following parameters should be specified: n,⟨k⟩, kmax, µ, τ1, τ2. One does not

need to specify kminin this version, if⟨k⟩ is specified, and vice versa. Furthermore,

one does not need to specify cmin and cmax there as cmin is set to be equal to kmin,

and cmax to n.

We set τ1 = 3 and τ2 = 2 in all the experiments with LFRbased graphs

below.

The purpose of the forthcoming four experiments is to show how variously the Modularities in (15) can behave.

First, we generate two different LFRbased graphs GS and GA with n =

3, 000,⟨k⟩ = 20, kmax = 150 and µ = 0.2. The corresponding results are in Figure

6.2 (a). While GS and GA are different, we have Q(GS) ≈ Q(GA). The behaviour

of the Modularities is similar to that in Figure 6.1 (a), however the switch around

α = 0.5 is more dramatic so that the Attributive Modularity Qα_attr for α > 0.5 and the Structural Modularity Qα_str for α < 0.5 are vanishing. This can be explained

by the strong community structure of both GS and GA as µ = 0.2. Note that the

Differential Modularity Qα_diff is also vanishing for all α ∈ [0, 1] here.

Now GS and GA are generated with the same set of parameters as before,

with the only difference that µ = 0.8 for GA, i.e. the community structure of GA

is weak. In this case (see Figure 6.2 (b)) one has Q(GS) > Q(GA) as in the exper iment in Figure 6.1 (b). In general, the qualitative behaviour of the Modularities in Figure 6.1 (b) and Figure 6.2 (b) is very similar. The difference is quantitative: the LFR model produces graphs with higher Modularity values than the ER and

(34)

BA models. Again, Qα_diff ≈ 0 for all α ∈ [0, 1] in Figure 6.2 (b).

Now we want to experimentally study the maximum possible Composite Modularity value for the α not equal to 0 or 1. For this purpose, we initially gen

erate a graph ˜G with n = 3, 000 nodes and 748, 500 edges with equal nonzero

weights consisting of 6 equal fully connected components (with 500 nodes each).

Furthermore, GS and GA with n = 3, 000 nodes and 374, 250 edges with equal

nonzero weights are obtained from ˜G by random assigning a half of ˜G edges to

GS and another half to GA. By construction, the maximum Composite Modularity

value should be for α = 0.5 as Gα ≡ ˜G there. This is indeed so, see Figure 6.3 (c). Note that Qα_diff is vanishing in this case for all α ∈ [0, 1], too. What is more, the

maximum Composite Modularity value is≈ 0.5 (Q(GS) + Q(GA)) here as is pre

scribed by Theorem 1 and (15) for α = 0.5 and Q0.5_diff ≈ 0. For future analysis,

Pearson’s correlation coefficient for the corresponding edge weights of GS and

GA is≈ −0.09 here.

The experiments performed so far hint that values of Qα_dif are always van

ishing. However, this is not true as Figure 6.3 (d) shows. In this experiment GS

is LFRbased (n = 3, 000, ⟨k⟩ = 20, kmax = 150, µ = 0.4) and GA is a star

graphbased (the base is the star graph with n = 3, 000 nodes and 2, 999 edges, other edge weights are zero). The difference in node degree distributions (recall (11) in Theorem 1) is so notable that the maximal value of Qα_dif for α ∈ [0, 1] is rather separated from zero. This result emphasizes how interestingly the weight based fusion model may work for bizarre networks and that one should be aware of the Composite Modularity structure for proper interpretation of the community detection results. The traditional heuristics (see Section 1.1) do not give any hint on the Differential Modularity component.

(35)

(a) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Modularit y (b) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Modularit y

Figure 6.2 – The variety of Modularities behaviour for different LFRbased graph

pairs: (a) Q(GS) ≈ Q(GA) (LFRbased GS of n = 3, 000 nodes and 374,250

edges with equal nonzero weights,⟨k⟩ = 20, kmax = 150, µ = 0.2; LFRbased

GA of n = 3, 000 nodes and 37,199 edges with equal nonzero weights,⟨k⟩ = 20,

kmax = 150, µ = 0.2); (b) Q(GS) > Q(GA) (LFRbased GS of n = 3, 000 nodes

and 374,250 edges with equal nonzero weights, ⟨k⟩ = 20, kmax = 150, µ = 0.2;

LFRbased GA of n = 3, 000 nodes and 38,091 edges with equal nonzero

(36)

(c) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 Modularit y (d) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 Modularit y

Figure 6.3 – The variety of Modularities behaviour for different LFRbased graph pairs: (c) GS and GA, each with n = 3, 000 nodes and 374, 250 edges, are disjoint

random subgraphs of ˜G consisting of 6 fully connected components and

n = 3, 000 nodes and 748, 500 edges; (d) maxαQα_diff ≫ 0 (LFRbased GS of

n = 3, 000 nodes and 17, 251 edges with equal nonzero weights,⟨k⟩ = 20,

kmax = 150, µ = 0.4; Star graph GA with n = 3, 000 nodes and 2, 999 edges with

equal nonzero weights)

6.2.3 The almost direct proportionality of the edge weights of the graphs

GS and GA

Now we experimentally illustrate the statement of Theorem 3. We generate

GS as an LFRbased graph with n = 3, 000, ⟨k⟩ = 25, kmax = 150 and µ = 0.5;

(37)

0 0.2 0.4 0.6 0.8 1 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 Composite Modularit y

Figure 6.4 – Composite Modularity for Gα, based on GS and GA with different γ

in (24), where GS is LFR with n = 3, 000,⟨k⟩ = 25, kmax = 150, µ = 0.5; GA′ is

LFR with n = 3, 000, ⟨k⟩ = 10, kmax = 150, µ = 0.7

LFRbased graph with n = 3, 000, ⟨k⟩ = 10, kmax = 150 and µ = 0.7; the set

of its edge weights we denote by {wA′(eij)}. Now let GA be the graph with the following edge weights:

wA(eij) = γwA′(eij) + (1− γ)wS(eij), eij ∈ E, (24)

where γ ∈ [0, 1] is fixed. Clearly, if γ = 0, then GA ≡ GS; if γ = 1, then

GA ≡ GA′.

In terms of Theorem 3, if γ is small enough (with respect to the chosen

ϵ), then the edge weights of GS and GA satisfy (22) and P

ij|ϵij| ≤

ϵ

2 ≤ 1 in

Theorem 3. Consequently, one should expect according to (23) that for γ ≈ 0 the

Composite Modularity Qα_comshould be almost independent of α. This is indeed the

case as Figure 6.4 shows. What is more, Qα_com starts to essentially depend on α as

γ grows. In particular, the behaviour of Qα_com changes dramatically after α passes 0.5 when the edge weights of GA′ start dominating in (24).

(38)

find them in Table 6.1.

Theorem 3 and the experiments in this section are complementary to the heuristic conclusion in [11] that the strong positive correlation of the edge weights

in GS and GA (by means of Pearson or Spearman) leads to the weak dependence

of certain structural and attributive community detection quality measures on α within the weightbased fusion models. It is interesting that the statement “If the Composite Modularity is independent of α in (2) for all α ∈ [0, 1], then there is the strong positive correlation of the edge weights in GS and GA” is not true in general, as the experiment in Figure 6.3 (c) shows. The Composite Modularity is almost

constant there, but the general scheme how GS and GA are constructed does not

imply the strong positive correlation. Recall that Pearson’s correlation coefficient

there is≈ −0.09.

6.2.4 Counterexample to the nodeattributed social network feature

selection principle

In this subsection we produce a partial but an important experiment in our settings that is related to the wellknown nodeattributed social network feature

selection principle particularly aiming at choosing the subset of attributes that are

“relevant to” or “tightly correlated with” the structure, see e.g. [45, 46]. This principle is usually based on the suggestions that the mismatch between the struc ture and the attributes “negatively affects community detection quality” [46] and that the existence of structureattributes correlation “offers a unique opportunity to improve the learning performance of various graph mining tasks” [45].

This principle has been found questionable already in [2, 11] in the weight Table 6.1 – Correlation rates for different γ in (24) in Section 6.2.3

γ 0.00 0.10 0.20 0.45 0.55

Pearson 1.00 0.99 0.97 0.78 0.64

(39)

based fusion model settings. In particular, it was observed there that in the case where all the attributes are in strong linear relationship to the structure (by means of the edge weights of GS and GA), then one in fact has two sources of information that mainly duplicate each other. In this sense, it also seems unreasonable to ex pect a significant improvement of the community detection quality by adding the attributes to the structure. Note that Theorem 3 confirms this point analytically in our settings.

Now we are presenting a counterexample to the principle in the weightbased

fusion model settings. Namely, we show that using only the edge weights in GA

that are in strong linear relationship to the corresponding ones in GS within the

community detection process may lead to the downgrade of the community de tection quality, and this is not only in the sense of Modularitybased community detection quality measures as in (15) but even in that of ground truthbased mea sures.

In general, NMI(C, C′) ∈ [0, 1]; in particular, NMI(C, C′) = 1 if C = C′

and NMI(C, C′) = 0 if C and C′ are totally independent [19].

We perform experiments with LFRbased graphs below and therefore let us mention that the ground truth partition C′in the LFR model is that generated at the step when the generated nodes with the chosen degree distribution are randomly assigned to the generated communities with the chosen size distribution, and before

the generation of links between the communities. Recall the parameters τ1 and τ2

of the distributions in Section 6.2.2. The partition C in (8) is the abovementioned

Cαwithin the weightbased fusion model, see Section 1.1.

For the experiment, we generate two LFRbased graphs:

– GS with n = 1, 000,⟨k⟩ = 150, kmax = 450 and µ = 0.4;

– GA with n = 1, 000,⟨k⟩ = 100, kmax = 450 and µ = 0.1.

The ground truth partition C′ is related to GA here. Clearly, GA has a stronger

(40)

(a) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0 0.2 0.4 0.6 0.8 1 Modularit y NMI (b) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0 0.2 0.4 0.6 0.8 1 Modularit y NMI

Composite Structural Attributive Differential NMI

Figure 6.5 – Modularities and NMI for (a) GS (with n = 1, 000,⟨k⟩ = 150,

kmax = 450 and µ = 0.4) and GA (with n = 1, 000,⟨k⟩ = 100, kmax = 450 and

µ = 0.1), i.e. before the feature selection procedure, and (b) GS (with the same parameters) and GA′ (obtained as described in Section 6.2.4), i.e. after the feature

selection procedure

For convenience, we think that each nonzero weight in GS and GA equals

1, and the required normalization (9) is done at the very end of the procedure de scribed in the next paragraph.

Let GA′ = GA. Let S be the set of node pairs (v, u) such that v and u are in

the same community in C′ and the edge weight between v and u is 1 in GA and 0

(41)

between the pairs of nodes in GA′ that are in S. Thus GA with respect to GA′ has

a stronger community structure that is also more similar to C′. Consequently, the

Modularity and NMI values for GA are higher than those for GA′. Furthermore,

the linear relationship of the edge weights of GS and GA′ is stronger than that of

GS and GA by construction. The abovedescribed procedure actually imitates the

nodeattributed social network feature selection principle in our settings.

In our particular experiment GS, GA and GA′ have 98, 266, 59, 258 and

19, 388 edges with equal nonzero weights, correspondingly. The set S contains 39, 870 node pairs.

The results for GS and GA are in Figure 6.5 (a), “before the feature selec

tion”. The maximum of the Composite Modularity (≈ 0.47) and NMI (≈ 1.00) is

for α = 0 here.

Once we exchange GAfor GA′, we get the results in Figure 6.5 (b), “after the feature selection”. By comparing the plots in Figure 6.5 it is seen that the maximum

of the Composite Modularity drops down to≈ 0.24 and the NMI down to ≈ 0.66

for α = 0. Note that Pearson’s correlation coefficient for the edge weights of GS

and GA is ≈ 0.00 and that of GS and GA′ is ≈ 0.20.

These experiments show that one can get worse community detection results after applying a variant of the nodeattributed social network feature selection pro cedure within the weightbased fusion model than for the initial graphs. It means that the procedure may be unsuitable in some cases and one should take this into account while applying it.

6.3 Realworld nodeattributed network datasets

Now we are going to deal with the undirected versions of the following pub licly available realworld nodeattributed social network datasets:

– WebKB12 (Cornell, Texas, Washington, and Wisconsin) is a collection of

four networks, totally of 877 web pages (nodes) with 1,608 hyperlinks 12_{https://linqs.soe.ucsc.edu/data}

(42)

(edges) gathered from four different universities Web sites (each web page is associated with a binary vector whose elements indicate the the presence of a word from the vocabulary on that web page; the vocabulary consists of 1703 unique words; the ground truth partition is available);

– PolBlogs13 is a network of 1,490 webblogs (nodes) on US politics with

19,090 hyperlinks (edges) between these webblogs (each node has a bi nary attribute describing its political leaning as either liberal or conser vative; the ground truth partition is not presented);

– Sinanet14 is a microblog user relationship network extracted from the

Weibo website with 3,490 users (nodes) and 30,282 relationships (edges) (each node is attributed with 10dimensional positive numerical attributes describing the interests of the user; the ground truth partition is available);

– Cora15is a network of machine learning papers with 2,708 papers (nodes)

and 5,429 citations (edges) (each node is attributed with a 1,433dimen sion binary vector indicating the absence/presence of words from the dic tionary of words collected from the corpus of papers; the ground truth partition is available).

In the experiments below, the graphs GS and GA are constructed according

to (2) and (9) with the following structural and attributive weight functions (the parameters of Algorithm in Table 5.1):

µ(eij) =    1, if w(eij) = 1 in (V, E), 0, otherwise, ν(eij) = A(vi)· A(vj) ∥A(vi)∥2∥A(vj)∥2 .

Note that ν is the wellknown Cosine Similarity and ν(eij) ∈ [0, 1] in our case as all the attributes are nonnegative. It is worth mentioning also that the chosen µ

13_{http://wwwpersonal.umich.edu/ mejn/netdata/} 14_{https://github.com/smileyan448/Sinanet} 15_{https://linqs.soe.ucsc.edu/data}

(43)

(a) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.15 0.2 0.25 0.3 Modularit y NMI (b) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.15 0.2 0.25 0.3 0.35 Modularit y NMI

Figure 6.6 – Modularities and NMI for the WebKB(Cornell, Texas) dataset: (a) Cornell, (b) Texas

and ν are among the most popular for this purpose [2] but, to be fair, the above proved theorems and Algorithm in Table 5.1 work for any reasonable nonnegative weight function.

We present results for α ∈ [0, 1] with a certain step for clarity, while Al

gorithm in Table 5.1 finds the partition and the Modularity values for a single

value of α = ˜αρ. In the forthcoming experiments, we additionally calculate the

NMI values by (8) for the realworld datasets where the ground truth partition C′

(44)

(a) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Modularit y NMI (b) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.1 0.2 0.3 0.4 0.5 Modularit y NMI

Figure 6.7 – Modularities and NMI for the WebKB(Washington, Wisconsin) dataset: (a) Washington, (b) Wisconsin

based community detection quality measures and the ground truthbased ones. It is worth mentioning however that the ground truth methodology for evaluating community detection quality, although widely used in the field, seems somewhat questionable especially for realworld networks as “the ground truth” may reflect only one point of view on the network communities (among many possible). This problem is discussed e.g. in [8, 9, 10].

The results for the four networks of WebKB are presented in Figures 6.6 and 6.7. The behaviour of the Modularities is rather similar in each case. As in some

(45)

0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 Modularit y

Figure 6.8 – Modularities for the PolBlogs dataset

experiments with synthetic graphs in Sections 6.2.1 and 6.2.2, one of Q(GS) and

Q(GA) is greater than the other. However, Q(GS) is much greater than Q(GA)

here so the case is almost degenerate. It can be observed that GS has rather dis

tinguishable communities, while GA not. As a result, we have α∗0.5 ≈ 0 (the α

providing the equal impact of the structure and the attributes on the community detection results in the sense described in Section 3). At the same time, as the behaviour of NMI shows, the ground truth communities are highly related to the attributes in the sense that the highest NMI values are reached around α = 0 (ac tually exactly at α = 0 in Figures 6.6 (b), and 6.7 (a) and (b)). It is interesting that the highest NMI value in Figure 6.6 (a) corresponds to α∗_0.5. Qα_diff vanishes for all

α ∈ [0, 1].

Oppositely to WebKB, both GS and GAin PolBlogs have rather distinguish

able communities, see Figure 6.8. As for the attributes, it is indeed reasonable as they are onedimensional and binary. Qα_diff vanishes for all α ∈ [0, 1] in this case, too. The true distinctive feature of the PolBlogs results among those for the real world networks under consideration is that the Modularities here change highly linearly with respect to α and thus the condition (18) in Theorem 2 is met with a small ϵ. In particular, this guarantees that the αtuning scheme in Section 3 and

A generalized weight-based fusion model for community detection in node-attributed social networks

MSc Computational Science

Master Thesis