MSc Computational Science
Master Thesis
A generalized weight-based
fusion model for community
detection in node-attributed
social networks
byTimofei
Gradov
13241028
June Supervisor/Examiner: Examiner:ABSTRACT
The weightbased fusion model is among the simplest and most efficient ones for modularitydriven community detection in nodeattributed social networks that contain both links between social actors (“structure”) and the actors’ feature vectors (“attributes”). Roughly speaking, the attributes within the model are first converted into an attributive network so that one obtains the two networks — structural and attributive — instead of the nodeattributed network. Then, the two networks are fused into a composite one that is believed to contain the in formation about both the structure and the attributes and that can be already fed to traditional modularitydriven graph community detection approaches. While the weightbased fusion model is widely used, it has been understudied analytically and had only a heuristic ground.
In this study, we disclose the mathematical machinery of a generalized weight based fusion model by revealing the objective function of the corresponding opti mization community detection process and establishing its connection with the tra ditional quality measures for the results of community detection in nodeattributed networks. We also propose a pioneering nonmanual parameter tuning scheme that provides the desired impact of the structure and the attributes on the community detection results within the weightbased fusion model. Basing on the theoreti cal results obtained, we further present a welltunable Leidenbased algorithm for community detection in nodeattributed social networks that declares itself fast and accurate in our multiple experiments with synthetic and realworld datasets.
TABLE OF CONTENTS
INTRODUCTION . . . 5
1 Necessary facts on the generalized weightbased fusion model . . . 8
1.1 Description of the weightbased fusion model, the corresponding community detection problem and the heuristics involved . . . 8
1.2 Precise definitions and reformulation of the nodeattributed social network community detection problem . . . 11
1.3 Overview of the existing weightbased fusion models for nodeattributed social network community detection . . . 13
2 The objective function of modularitydriven community detection process within the weightbased fusion model . . . 16
3 Parameter tuning scheme . . . 20
4 Composite Modularity for almost directly proportional edge weights in the structural and attributive graphs . . . 22
5 Welltunable Leidenbased weightbased fusion model for nodeattributed social network community detection . . . 26
6 Experiments . . . 29
6.1 Data, code and implementation settings . . . 29
6.2 Synthetic nodeattributed network datasets . . . 29
6.2.1 BarabásiAlbert and ErdősRényi graph models . . . 30
6.2.2 LancichinettiFortunatoRadicchi benchmark . . . 32
6.2.3 The almost direct proportionality of the edge weights of the graphs GS and GA . . . 36
6.2.4 Counterexample to the nodeattributed social network feature selection principle . . . 38
6.3 Realworld nodeattributed network datasets . . . 41
6.4 Evaluation of the parameter tuning scheme in Algorithm in Table 5.1 47 6.5 Comparison of Entropy and Attributesaware Modularity . . . 49
6.6 Different Modularity optimization algorithms . . . 52
7 DISCUSSION . . . 55
7.1 Ground Truth partition as quality measure . . . 55
7.2 High Differential Modularity values . . . 55
7.3 Processing large networks . . . 56
CONCLUSION . . . 58
INTRODUCTION
Community detection in nodeattributed social networks is an actively stud ied problem in social network analysis [1, 2, 3] due to the necessity to explore a huge amount of realworld social network data containing both links between social actors (known as structure) and the actors’ feature vectors (known as at
tributes) indicating actors’ age, interests, etc. While classical community detection
models deal either with the structure or the attributes, community detection meth ods for nodeattributed networks aim at simultaneous usage or fusion of the both. The motivation behind it is that such fusion may clarify and enrich the knowledge about communities in a nodeattributed social network [1, 2]. This idea is usually justified by the wellknown homophily principle stating that likeminded social actors have a higher probability to be connected [4]. Furthermore, social science founding (see e.g. [5, 6]) suggests that the attributes can reflect and affect the community configuration of an nodeattributed social network.
A variety of different models and methods for community detection in a nodeattributed social network have been proposed so far [1, 2, 3]. Although they are widely used in applications, it is a problem that some of them are understudied analytically and have only a heuristic ground. In particular, it is observed in [2] that, from the optimization theory point of view, multiple community detection methods for nodeattributed social networks suffer from the lack of mathematically established connection between the objective functions within the community de tection optimization process involved and the measures used for the community detection quality evaluation. Indeed, there are many examples of community de tection models and methods for nodeattributed social networks in [2] where one function is optimized within the community detection process and another function – “quality measure” (not directly related to the former and – again – usually chosen heuristically from the many possible [1, 7]) is used for evaluating the optimization results. In our opinion, the abovementioned lack may cause misinterpretation of
the community detection results and difficulties with tuning the parameters of a community detection method for nodeattributed social networks. Let us empha size that now and throughout the research we mainly deal with the structure and attributesaware quality measures (e.g. Modularity and Entropy) and rarely do with those estimating the agreement between the detected communities and the ground truth ones. It is because the latter seem to be less formalized and their “truthful ness” may be questionable in certain cases as the results in [8, 9, 10] and our own ones below hint.
In this research we address a particular case of the abovementioned gen eral problem, namely, our main objective is to disclose the mathematical machin ery of a generalized weightbased fusion model that is one of the simplest and most efficient models applied for modularitydriven community detection in node attributed social networks. There exist more than 20 papers devoted to the family of weightbased fusion models, see [2]. Recall that for a given nodeattributed so cial network, a typical weightbased fusion model first converts the attributes into an attributive network so that one obtains the two networks – structural and attribu tive – instead of the nodeattributed social network. Then, the two networks are fused into a composite one that is believed to contain the information about both the structure and the attributes and that can be already fed to traditional modularity driven graph community detection methods. We will give the formal description of the weightbased fusion model under consideration, the corresponding community detection process and the overview of the existing weightbased fusion models in Section 1.
To meet the main objective, we aim at achieving the following goals:
– to reveal the objective function of the optimization community detection process within the weightbased fusion model;
– to establish the connection of the objective function and the related com munity detection quality measures;
– to propose a nonmanual parameter tuning scheme that provides the de sired impact of the structure and the attributes on community detection results within the weightbased fusion model;
– to propose a welltunable algorithm for community detection in node attributed social networks that is based on the weightbased fusion model and the wellknown Leiden algorithm for Modularity maximization; – to perform experimental study of the algorithm on synthetic and real
world datasets.
The exposition in this thesis is partly based on the peerreviewed papers [11] and [12], where the thesis author is a coauthor. Furthermore, this thesis extends [12] with the following main contributions:
– we obtain further theoretical results on the objective function of the op timization community detection process, in particular, we show how it behaves when the edge weights in the structural and attributive graphs are almost directly proportional;
– we generalize the previous parameter tuning scheme from the one pro viding only the equal impact of the structure and the attributes on com munity detection results within the weightbased fusion model on any desired impact;
– we propose and show the efficiency of the welltunable Leidenbased al gorithm for community detection in nodeattributed social networks in stead of the direct illustrative experiments with the wellknown Louvain algorithm for Modularity maximization.
In both the papers and this thesis the contribution of the thesis author is 50% to the theoretical results (the theorems) and 90% to the design and implementation of the algorithm and the experiments.
1
Necessary facts on the generalized weightbased fusion
model
1.1 Description of the weightbased fusion model, the corresponding
community detection problem and the heuristics involved
Below we first describe a general version of the weightbased fusion model and the related nodeattributed social network community detection problem. Then we recount the community detection quality evaluation scheme within the weight based fusion model and reveal the implicit heuristics usually used in it.
In this research we model an nodeattributed social network as an undirected simple nodeattributed graph G = (V, E, W, A), where V = {vi}ni=1 is the set of nodes (representing social actors), E = {eij} the set of all possible links (repre
senting social connections) between nodes inV (i.e. (V, E) is a complete graph),
W is the set of nonnegative edge weights (representing the strength of the actors’
social connections1), and A is the set of attribute vectors A(vi) = {ad(vi)}Dd=1,
vi ∈ V, with nonnegative2 elements (representing social actors’ features). We
suppose that (V, E) is complete to simplify further mathematical analysis. Recall
that in what follows (V, E, W) is called the structure and (V, A) the attributes
of G.
Within a generalized weightbased fusion model, G is first converted into the following two undirected simple weighted complete graphs by a certain rule: the structural graph GS = (V, E, WS) and the attributive graph GA = (V, E, WA), where WS = {wS(eij)} and WA = {wA(eij)} are the sets of nonnegative edge weights in each graph, see Figure 1.1. For convenience, we suppose that
X eij∈E wS(eij) = 1, X eij∈E wA(eij) = 1. (1)
1An edge weight may be zero and this indicates that there is no social connection.
2For nominal or textual attributes, it is common to use onehot encoding or embeddings to obtain their numerical
Nodeattributed graph G = (V, E, W, A) =⇒ Structural graph GS = (V, E, WS) Attributive graph GA = (V, E, WA) =⇒ Composite graph Gα = (V, E, Wα)
Figure 1.1 – An example of the weightbased fusion model scheme for G with
n = 5 and 2dimensional binary attribute vectors. The edge weights in the
attributive graph GA are according to the normalized matching coefficient on the
corresponding node attributes in G
Furthermore, the two graphs are fused to obtain the undirected simple complete weighted graph Gα = (V, E, Wα), where the elements of Wα = {wα(eij)}, with
eij ∈ E, are as follows (see Figure 1.1):
wα(eij) = αwS(eij) + (1− α)wA(eij), α ∈ [0, 1] X
eij∈E
wα(eij) = 1.
(2)
Here α is a fusion parameter that controls the impact of GS and GA on Gα. Note
that G1 = GS and G0 = GA by construction. Gα (that we call composite) is
believed to contain the information about both the structure and the attributes of
G, and it is one of the traditional heuristics behind the family of weightbased
fusion models.
Recall that community detection in G consists in unsupervised dividing V
into K disjoint3 communities Ck ⊂ V, with C = {Ck}Kk=1, such thatV =
SK
k=1Ck,
and a certain balance between the following properties is achieved [1, 2]:
– structural closeness, i.e. nodes in a community are more densely con 3Communities may be overlapping if necessary but here we focus on disjoint ones.
nected than nodes in different communities;
– attributive homogeneity, i.e. nodes in a community have more similar attributes than nodes in different communities.
In the context of the weightbased fusion model, the community detection
problem clearly consists in unsupervised dividing Gαinto K disjoint communities
Ck,α ⊂ V, with Cα = {Ck,α}Kk=1, such that V =
SK
k=1Ck,α and the nodes in each
community Ck,α are structurally close and homogeneous in terms of attributes.
Since one deals with the weighted graph Gαwithin the weightbased fusion
model, classical graph community detection methods can be applied to find Cα.
For example, a popular choice [2] is the Louvain algorithm [13] aiming at maxi mizing Newman’s Modularity [14], a measure of divisibility of a graph into com munitites. In the context of the weightbased fusion model, the maximization of Modularity of Gαis implicitly thought to provide structural closeness and attribu tive homogeneity also in G (such heuristics are applied e.g. in [15, 16, 17, 18]). Our concern is that it was not discussed in the previous works on the weightbased
fusion model how the Modularity of Gα(i.e. the objective function within the com
munity detection optimization process corresponding to the weightbased fusion model) is connected, say, with the Modularities of GS and GA.
Following the abovementioned implicit thought, the partition Cα maximiz
ing Modularity of Gα is treated as that for measuring structural closeness and at
tributive homogeneity in the initial G. For example, Cαcan be used for calculating
the corresponding Modularity of GS (a popular measure of structural closeness)
and Entropy of subsets of the corresponding attributes inA (a popular measure of
attributive homogeneity) [2]. Thus one objective function is optimized to detect communities but the quality of the communities obtained is evaluated by measures not explicitly related to the objective function. From the optimization theory point of view, is seems to be a logical gap that may cause misinterpretation of community detection results and difficulties with tuning the parameter α in the weightbased
fusion model. To fulfil the gap and reveal the true mathematical machinery of
the weightbased fusion model, we will study in Section 2 how Modularity of Gα
(provided by Cα) and Cαbased quality measures on G relate to each other.
1.2 Precise definitions and reformulation of the nodeattributed social
network community detection problem
For the purposes of further analysis, we define the community detection quality measures that are often used for evaluating nodeattributed social network community detection results withing the weightbased fusion model – Modularity and Entropy, see e.g. [1]. The former works with graphs and the latter with sets of vectors. The community detection quality measures are usually chosen heuris tically from the many possible, see [1, 2, 7].
Let G′ = (V, E, W) be an undirected simple weighted graph and C its dis
joint partition. The graph G′ is the structure of a nodeattributed graph G. Modu
larity of G′ for C is as follows:
Q(G′, C) = 2m1 X ij Aij − 1 2mkikj δij ∈ [−12, 1], i, j = 1, n, (3) where
– Aij is the edge weight between nodes vi and vj;
– ki and kj are the weighted degrees of vi and vj, respectively, i.e.
kh = X
lw(ehl), where h ∈ {i, j}; (4)
– m is the sum of all edge weights in G′, i.e.
m = X
eij∈E
– if ci and cj are the community labels of nodes viand vj, correspondingly, δij = 1, ci = cj, 0, otherwise.
Modularity of G′ is then defined as
Q(G′) = maxCQ(G′, C). (6)
EntropyH defined for a pair G′′ = (V, A) measures the degree of disorder
of attribute vectors A within communities in the disjoint partition C of G′′. The
pair G′′ represents the attributes of a nodeattributed graph G. To unify notation,
we define Entropy of G′′ for C for the case of binary Ddimensional vectors as
follows: H(G′′, C) = X Ck∈C |Ck| |V | H(Ck) ∈ [0, 1], H(Ck) = − XD d=1 ϕ(pk,d) D ln 2, (7)
where ϕ(x) = x ln x + (1− x) ln(1 − x) and pk,d is the proportion of nodes in the
community Ck with the same value on dth attribute.
To deal with the ground truth environment, the definition of Normalized
Mutual Information (NMI), a ground truthbased measure that is widely used to
compare the quality of nodeattributed social network community detection meth
ods when “the true partition” C′ is known [1, 2]. We use the variant from [19].
Suppose that
C = {C1, C2, . . . , CK} and C′ = {C1′, C2′, . . . , CK′ ′}
are two partitions of a graph (V, E, W) with |V| = n. Let ηij be the number of
the partition C′, then NMI(C, C′) = −2 PK i=1 PK′ j=1ηijlog ηijn ηi·η·j PK i=1ηi·log ηi· n + PK′ j=1η·jlog η·j n , (8) where ηi· = PK′ j=1ηij and η·j = PK i=1ηij.
Thus the solution to the nodeattributed social network community detection problem within the weightbased fusion model in the terms presented is as follows:
– one finds Cα maximizing the Modularity of Gα,
– Cα is used to calculate Q(G′, Cα) and H(G′′, Cα), where G′ and G′′ are the structure and the attributes of the nodeattributed graph G, corre spondingly.
This scheme is usually applied implicitly and there are no theoretical studies why this jump from Q(Gα, Cα) to Q(G′, Cα) and H(G′′, Cα) is reasonable. An other issue is that Entropy that deals with vectors seems to be unnatural, taking into account the intuition behind the weightbased fusion model aiming at representing
G in a unified graph form. Using NMI based on network’s Ground Truth as a qual
ity measure for community detection usually proposed in researches. However, it is external quality measure, which is not always available for algorithm evalua tion and also can be misleading, while Ground Truth may not depend on network’s properties.
We will discuss the issues more in Sections 2 and 6.
1.3 Overview of the existing weightbased fusion models for node
attributed social network community detection
The family of weightbased fusion models that are ideologically close to (2) and the corresponding weightbased fusion modelbased algorithms have been widely tested on synthetic and realworld nodeattributed social networks and have
shown their advantages over many other nodeattributed social network commu nity detection models, see [11, 15, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], etc., and the overview in [2]. In particular, the weightbased fusion modelbased algorithms SAC2 [18] and CODICIL [24] are among the most influential ones for nodeattributed social network community detection. Moreover, multiple experi ments show that weightbased fusion modelbased algorithms can be superior in community detection quality (in various terms) and efficiency to other most influ ential ones for nodeattributed social network community detection, e.g. to
– the nodeaugmented graphbased algorithms SACluster [31] and Inc
Cluster [32, 33], as shown in [24, 25, 26, 27, 28, 34];
– the nonnegative matrix factorizationbased algorithm SCI [35], as shown in [30, 36];
– the probabilistic modelbased algorithms PCLDC [37], BAGCGBAGC [38, 39] and CESNA [40], as shown in [24, 26, 28, 30, 34, 36].
The common part of the abovementioned weightbased fusion models is (2) but the corresponding community detection and quality evaluation processes can be performed by different means.
Furthermore, there are many particular versions of (2) under a different choice of weighting functions wS, wAand fusion parameter α [2] but it seems to us that the most balanced one is that from [11, 29] with the following normalization:
wS(eij) = µ(eij) P eij∈E µ(eij) , wA(eij) = ν(eij) P eij∈Eν(eij) , (9)
where µ and ν are structural and attributive weight functions, correspondingly. Among other things, it is shown in [29] that (2) with (9) produces normalized versions of many existing weightbased fusion models for different µ, ν and α.
leads to the weightbased fusion model in [20], Jaccard and/or Cosine Similarity as ν makes (2) analogous to the weightbased fusion model in [24].
By using different heuristic thresholds for the weights in (2) one can de crease the computational complexity of community detection processes within the weightbased fusion models and obtain other versions of the weightbased fusion model, see [2].
As for α in (2), its tuning or αtuning as we call it below is difficult in gen eral [1, 2]. As far as we know, there are no general αtuning schemes to provide the desired impact of the structure and the attributes on community detection re sults within the weightbased fusion model. Indeed, α is usually chosen manually and such choice may be not fully justified. For example, the weightbased fusion models in [15, 20, 21, 28] use α = 0, while the ones in [18, 24, 27] set α = 0.5 in experiments, suggesting to achieve the equal impact of the structure and the attributes. The absence of an explicit understanding of how to tune α within the weightbased fusion model makes the problem of reasonable αtuning important in the field.
What is more, we are unaware of any reseach devoted to the study of the objective function in the modularitydriven community detection process corre sponding to the weightbased fusion model, perhaps besides the survey [2] where the problem is only stated.
2
The objective function of modularitydriven community
detection process within the weightbased fusion model
We first substitute Entropy (7) by Q(GA, C), i.e. the Modularity of the
attributive graph GA for a partition C. In these terms, if C is that maximizing
Q(GA, C), then the links between nodes in each community in C have high at tributive weights (minus the expected ones), i.e. the node attributes therein are homogeneous by construction. Additionally, the proposed measure works with graphs (oppositely to the Entropy working with vectors) and naturally appears in the weightbased fusion model as is seen from the theoretical results below. Let
us also mention that the Modularity of the attributive graph GA turns out to be
more informative than the corresponding Entropy as the experiments in Section 6.5 show.
Now we turn to the problem about the connection of Modularity Q(Gα) and
the community detection quality measures (in our case, Q(GS, Cα) and Q(GA, Cα)). The solution to the problem is wholly provided by the following theoretical results for a fixed G, where we assume that
kh⋆ = X
lw⋆(ehl), h ∈ {i, j}, ⋆ ∈ {S, A}. Theorem 1. For any partition C, it holds that
Q(Gα, C) =αQ(GS, C) + (1− α)Q(GA, C)
+ α(1− α)∆(GS, GA, C),
(10)
where ∆ counts the difference of withincommunity node degrees in GS and GA:
∆(GS, GA, C) = 14 K X k=1 X vi∈Ck (kiS − kAi )2 ≥ 0. (11)
What is more, the following inequality holds:
Q(Gα, C) ≥ αQ(GS, C) + (1− α)Q(GA, C), (12)
and it is sharp for α = 0 and α = 1.
Proof. Fix a partition C. We first rewrite the ingredients of (3) in terms of (2):
Aij = αwS(eij) + (1− α)wA(eij), m = X eij∈E wα(eij) = 1, kh = X l(αwS(ehl) + (1− α)wA(ehl)) , h ∈ {i, j}. (13) Furthermore, kikj = αkiS + (1− α)k A i αkjS + (1− α)kAj = α2kiSkSj + α(1− α) kiSkjA + kiAkjS + (1− α)2kAi kjA. (14)
If one takes (13) and (14) into account, (3) can be rewritten in the form
Q(Gα, C) = α· 12 X ij wS(eij)− 1 2αk S i k S j δij + (1− α) · 12 X ij wA(eij)− 1 2(1− α)k A i k A j δij − α(1 − α) · 1 2 X ij 1 2 k S i k A j + k A i k S j δij.
Extracting Q(GS, C) and Q(GA, C) from this by (1) and (3) yields (10), where
∆(GS, GA, C) = 14 X ij(k S i − k A i )(k S j − k A j )δij.
Furthermore, it can be easily seen that ∆(GS, GA, C) = 14 K X k=1 X vi∈Ck (kSi − kiA) X vj∈Ck (kjS − kjA) = 14 K X k=1 X vi∈Ck (kSi − kiA)2,
where the last expression is clearly nonnegative. This gives (11). Finally, (12) follows from (10) by (11). The sharpness of (12) immediately follows from (10).
From a more general point of view, Theorem 1 actually connects the Modu larities of two graphs with the Modularity of the graph whose weights are the linear combinations of weights of the two graphs. We consider it as a key result for anal ysis of modularitydriven models for nodeattributed social network community detection in our future research, see Section 7.3.
We continue by introducing additional notation for the case where C = Cα
in Theorem 1:
Q(Gα) =:Qαcom = Qαstr + Qαattr+ Qαdif,
Qαstr := αQ(GS, Cα)
Qαattr := (1− α)Q(GA, Cα),
Qαdif := α(1− α)∆(GS, GA, Cα),
(15)
where we call
– Qαcom Composite Modularity,
– Qαstr Structural Modularity,
– Qαattr Attributive Modularity,
Note that the Differential Modularity does not have the meaning of the Modularity itself but we call it so for uniformity.
Thus, in the terms introduced, one maximizes within the weightbased fusion
model Composite Modularity Qαcomthat consists not of the two components used for
quality evaluation (the Structural Modularity Qαstr and the Attributive Modularity
Qαattr) but of the additional nonnegative Differential Modularity Qαdif that counts
the difference of withincommunity node degrees in GS and GA.
Remark 1. According to our experimental results in Section 6 and the intuition
behind the form of Qαdif in Theorem 1, the Differential Modularity vanishes for
many synthetic and realworld nodeattributed social networks for α ∈ [0, 1]. For
this reason, we assume in the next section that Qαdifis small enough with respect to the sum Qαstr + Qαattr.
3
Parameter tuning scheme
We now propose a simple nonmanual αtuning scheme such that the impact of structure and attributes on the community detection results is wellcontrolled within the weightbased fusion model. Since our terms are unified for both the components (in the sense that we use Modularity for the both), it is justified to define α = α∗ρ satisfying
Qαstr = ρ (Qαstr + Qαattr) ,
Qαattr = (1− ρ) (Qαstr + Qαattr) for some ρ ∈ [0, 1],
(16)
as that providing 100ρ% of Qαstr and 100(1− ρ)% of Qαattr in the sum Qαstr + Qαattr
that is the major part of Qαcom according to Theorem 1 under the assumption that
Qαdif is vanishing. Equivalently, this α = α∗ρ is a solution to the equation
(1− ρ)Qαstr = ρQαattr. (17)
We first prove the following statement.
Theorem 2. Let Qαstr > 0 and Qαattr > 0 for any α ∈ [0, 1]. If it holds that
|Q(GS, Cα)− Q(GS)| ≤ ϵQ(GS),
|Q(GA, Cα)− Q(GA)| ≤ ϵQ(GA),
(18)
for some ϵ such that 0 ≤ ϵ ≪ 1, then α∗ρsatisfies the inequalities:
1− ϵ 1 + ϵ ≤ α ∗ ρ· (1− ρ)Q(GS) + ρQ(GA) ρQ(GA) ≤ 1 + ϵ 1− ϵ. (19)
Proof. We rewrite (17) by (15) as
α = ρQ(GA, Cα)
(1− ρ)Q(GS, Cα) + ρQ(GA, Cα)
.
This and the conditions (18) immediately imply that (19) holds for α instead of α∗ρ. Furthermore, the conditions (18) guarantee that Qαstr and Qαattr are wellapproxi
mated uniformly for any α ∈ [0, 1] by αQ(GA) and (1 − α)Q(GS), correspond
ingly. These facts imply that (19) particularly holds for α = α∗ρ.
As a result, Theorem 2 yields that for a sufficiently small ϵ one can take
˜
αρ =
ρQ(GA)
(1− ρ)Q(GS) + ρQ(GA)
(20)
as a good approximation for α∗ρproviding the abovementioned impact of structure
and attributes on the community detection results. What is more, our experiments in Section 6.4 suggest that this is indeed so. It is interesting that (20) requires only
the Modularity values Q(GS) and Q(GA) to be applied.
The proposed αtuning scheme is the first nonmanual one providing the required impact of the components within the weightbased fusion model. A par
ticular case of Theorem 2 for ρ = 12 providing the equal impact of structure and
attributes was previously obtained in [12].
Remark 2. It seems that (16) and (17) can be rewritten for an αweighted version of Entropy (7) instead of Qαattr so that one can obtain a similar αtuning scheme in these settings, too. However, in this case one in fact uses nonunified (and possibly noncomparable) terms for evaluating nodeattributed social network community detection quality with respect to the structure and the attributes (Modularity and Entropy) and thus the corresponding results may be misleading. In Section 6.5, we will also perform several experiments showing the difference in the behaviour of the Attributive Modularity and the Entropy and the presence of certain problems related to the Entropy.
4
Composite Modularity for almost directly proportional
edge weights in the structural and attributive graphs
Basing on Theorem 1, we now study the case where the corresponding edge
weights in the structural and attributive graphs GS = (V, E, WS) and GA =
(V, E, WA) are almost directly proportional with a positive coefficient (that clearly implies the strong positive correlation of the edge weights by means of Pearson or Spearman). We use the notation introduced in Sections 1.1 and 2. Let us start with the following observation. Suppose that for some fixed a > 0 it holds that
wA(eij) = awS(eij), eij ∈ E. (21)
Under the condition (1), we get
1 = X eij∈E wA(eij) = a X eij∈E wS(eij) = a,
i.e. it is necessarily that a = 1 in (21). Taking this into account, we prove the following result.
Theorem 3. Let GS and GA in Theorem 1 be such that
0≤ wA(eij) = wS(eij) + ϵij
for all eij ∈ E and some ϵij ∈ [−1, 1].
(22)
IfPij|ϵij| ≤ 2ϵ ≤ 1, then for any α ∈ [0, 1] and any C
|Q(Gα, C)− Q(GS, C)| ≤ ϵ, (23)
Proof. We first rewrite the ingredients of (3) in terms of (22): Aij = αwS(eij) + (1− α)wA(eij) = αwS(eij) + wS(eij) + ϵij − αwS(eij)− αϵij = wS(eij) + (1− α)ϵij; kh = X l(αwS(ehl) + (1− α)wA(ehl)) = X l(wS(ehl) + (1− α)ϵhl) = kSh + (1− α)X l ϵhl. Furthermore, kikj = kSi + (1− α)X l ϵil kjS + (1− α)X lϵjl = kSi kjS + (1− α) kiSX lϵjl + k S j X lϵil + (1− α)2X lϵil X l ϵjl.
In these terms, we deduce from (3) for a fixed C that
Q(Gα, C) = Q(GS, C) + I(α, ϵ, C), where I(α, ϵ, C) :=12(1− α)X ij ϵij − 12 kiSX l ϵjl + k S j X lϵil +(1− α)X lϵil X l ϵjl δij.
Now we estimate|I(α, ϵ, C)|. Firstly, by (1) and (4), XijkiSX lϵjl + k S j X lϵil δij ≤X ik S i X jl|ϵjl| + X j k S j X il|ϵil| = 4X ij |ϵij|.
Secondly, XijXlϵil X lϵjlδij ≤Xil|ϵil| X jl|ϵjl| = X ij|ϵij| 2 . Consequently, |I(α, ϵ, C)| ≤ 1 2(1− α) X ij|ϵij| + 1 2 · 4 X ij|ϵij| + (1− α)X ij |ϵij| 2 ≤ 1 2 X ij |ϵij| 3 +X ij |ϵij| .
This yields that the following implication holds true: X
ij |ϵij| ≤
ϵ
2 ≤ 1 =⇒ |I(α, ϵ, C)| ≤ ϵ.
Consequently, it holds that
Q(Gα, C) = Q(GS, C) + I(α, ϵ, C), |I(α, ϵ, C)| ≤ ϵ,
and this gives (23).
Theorem 3 states that if the corresponding edge weights of GS and GA are
almost directly proportional with a positive coefficient, then the Composite Mod
ularity Q(Gα, C) is almost independent of α. In this case, GS and GA almost du
plicate each other so that it may be reasonable to use only one of the components for community detection, i.e. come back to the classical community detection set tings based either on the structure or the attributes. In this sense Theorem 3 partly explains a similar effect observed in [11] in experiments (although for the Entropy instead of the Attributive Modularity).
Remark 3. Note that the assumptions of Theorem 3 also assure the good quality (by means of small ϵ) of the αtuning scheme in Section 3 in terms of Theorem 2.
From a more general point of view, Gαin Theorem 3 can be considered as a
small variation of GS for sufficiently small ϵ. Thus Modularity of a graph changes
slightly under a small variation of its edge weights. This seems rather intuitive but we are unaware of any corresponding quantitative estimates as in Theorem 3.
In Section 6 we will also discuss the strong linear relationship of the structure and the attributes of G in the following sense: we will provide in the weightbased fusion model settings a counterexample to the wellknown nodeattributed social network feature selection principle that presumes to use in the nodeattributed so cial network community detection process only the attributes “correlated with” or “tightly related to” the structure in order to obtain “better community detection results”.
5
Welltunable Leidenbased weightbased fusion model
for nodeattributed social network community detection
Basing on our theoretical results, in particular on Theorems 1 and 2, we now propose a welltunable Leidenbased algorithm implementing the weightbased fusion model (2) under the normalization (9), see Algorithm in Table 5.1.
The input of the algorithm is a nodeattributed graph G (see Section 1.1), chosen structural and attributive weight functions µ and ν in (9) and the value of
ρ in (16) and (17), indicating the desired impact of the structure and the attributes
on the community detection results within the weightbased fusion model. Within the algorithm, G is first converted into4 GS and GA according to (9). After this,
the Modularity values Q(GS) and Q(GA) are found by the Leiden algorithm [41].
For the chosen ρ, the values of Q(GS) and Q(GA) are further used for calculating ˜
αρ in (20) as an approximant for α∗ρ in (16) and (17). The ˜αρ is used in (2) under the normalization (9) to obtain Gα∗ρ with the edge weights (2). Finally, Cα∗ρ and all
the corresponding Modularity values in (15) are found by Leiden which are then output by the algorithm.
4As before, G
Table 5.1 – Leidenbased weightbased fusion model Algorithm for nodeattributed social network community detection
Procedure: Weightbased fusion model (G, µ, ν, ρ)
G is a nodeattributed graph, µ is a structural weight function, ν is a attributive
weight function, ρ is a value indicating the desired impact of the structure and the attributes on community detection results
GS ← µ(G) GA ← ν(G) CS ← Leiden(GS) CA ← Leiden(GA) α ← ρQ(GA,CA) (1−ρ)Q(GS,CS)+ρQ(GA,CA) Gα ← FuseGraphs(GS, GA, α) Cα ← Leiden(Gα) Qαstr ← αQ(GS, Cα) Qαattr ← (1 − α)Q(GA, Cα) Qαdiff ← α(1 − α)∆(GS, GA, Cα)
Qαcom ← Qαstr + Qαattr+ Qαdiff return Cα, Qαcom, Qαstr, Qαattr, Qαdiff
Procedure FuseGraphs(GS, GA, α)
GS is a structural graph with the edge weights wS(eij), GAis an attributive graph with the edge weights wA(eij), α is a fusion parameter
ΣS ← P eij∈E wS(eij) ΣA ← P eij∈E wA(eij) foreach eij ∈ E do wα(eij) ← αwS(eij)/ΣS + (1− α)wA(eij)/ΣA
wα(eij) are the edge weights of Gα
We use the novel Leiden algorithm to optimize Modularity in Algorithm in Table 5.1 as, according to [41], it outperforms in community detection quality and execution time the widelyused Louvain algorithm [13]. Recall that Louvain is a popular choice in modularitydriven nodeattributed social network community detection methods [2] and is stateoftheart for Modularity optimization [42]. In particular, it is guaranteed [41] that Leiden always yields communities that are connected (oppositely to Louvain) and whose all subsets are locally optimally as signed. (Note that we used Louvain in the conference proceedings paper [12] that contains the results preliminary to those in the current study.)
The precise parameters and other details related to the versions of Leiden and Louvain used in our experiments will be indicated in Section 6.1.
As for the complexity of Algorithm in Table 5.1, the step where G is con
verted into GS and GA (recall that GS and GA are complete in our settings) is
O(n2), while the Leidenbased optimization process step seems to be at most
O(n log n). Here we recall the observation in [41] that Leiden is faster than Lou
vain whose complexity is believed to be O(n log n). The complexity of weight based fusion models can be possibly decreased by applying different thresholds for the weights in (2) and the number of edges of G involved, see the tables in [2].
6
Experiments
6.1 Data, code and implementation settings
Before describing our experimental results, it is worth saying that we present
on Github5 the publicly available datasets and the implementations used in our
experiments.
For synthetic graphs generation and graphs representation we use networkx
2.56 with the parameters declared in the experiments in Section 6.2. The real
world datasets used and their sources are described in Section 6.3. For Modularity optimization we rely on the following descriptions and implementations:
– Leiden [41], leidenalg 0.8.37 for weighted graphs with Modularity Ver
tex Partition as quality function and the maximum possible number of iterations;
– Louvain [13], pythonlouvain 0.148 for weighted graphs with the resolu
tion parameter equal to 1.
In what follows, we use Leiden for Modularity optimization with the only exception of the experiments in Section 6.6 where Louvain is used as a competitor to Leiden.
Since different runs of Leiden and Louvain may lead to different commu nities, we average the results over 5 runs and indicate the corresponding standard deviation in the plots.
6.2 Synthetic nodeattributed network datasets
Now we experimentally study the behaviour of the Modularities in (15) and illustrate Theorems 1 and 3. Since the theorems are stated for two graphs (not
5https://github.com/TimaGradov/Leidenweightbased fusion model 6https://networkx.org/documentation/stable/index.html
7https://leidenalg.readthedocs.io/en/stable/index.html 8https://github.com/taynaud/pythonlouvain
a nodeattributed graph), in some experiments below we initially generate two
graphs – GS and GA – instead of generating a nodeattributed graph G and fur
ther converting it into GS and GA.
Following Algorithm in Table 5.1, we use the weightbased fusion model (2)
with the normalization (9) and detect communities in Gαby Leiden. However, our
fusion parameter α runs from 0 to 1 with step 0.05 for illustrative purposes in the
forthcoming experiments, while Algorithm in Table 5.1 works only with α = ˜αp.
Graphs GS and GA are complete and weighted according to the definitions
in Section 1.1. Some edge weights may be zero and then one should think that there is no structural or attributive link between the corresponding nodes. Below, if we say that a (complete) graph has M edges with nonzero weights, then edge weights in the graph are set zero for all edges except for M of them.
The precise parameters of experiments necessary for reproducing the results are indicated in the figure captions.
6.2.1 BarabásiAlbert and ErdősRényi graph models
Here we generate random graphs GS and GA by the wellknown Erdős
Rényi (ER) and BarabásiAlbert (BA) models. Since Theorem 1 does not have
any specific assumptions on GS and GA, the chosen models are quite suitable for
illustrative purposes (although they may not produce strong community structure). The models are chosen as they generate graphs with different node degree distri butions (recall (11) in Theorem 1) and thus may influence the behaviour of the Differential Modularity Qαdif. What is more, we generate GS and GA in the pairs ER+ER, ER+BA and BA+BA. In each pair we consider the following two cases:
(a) when Q(GS) and Q(GA) are almost equal and (b) when one of Q(GS) and
Q(GA) is greater then the other. These proportions can be achieved by varying the
number of edges with equal nonzero weights in GS and GA.
The parameters of the BA model are m0 ≥ 1, the number of nodes in the
edges will every new node create; the parameters of the ER model are the number of nodes n and the probability p of edge creation between every pair of nodes. For more details, we refer the reader e.g. to [43]. For implementation, we use
networkx 2.5 BarabásiAlbert graph9, where m0 = m, and networkx 2.5 Gn,p
random graph10.
It turns out that the results in each pair ER+ER, ER+BA and BA+BA for the chosen parameters are very similar qualitatively (and even quantitatively). For this reason we provide and analyze only the results for the pair ER+BA, see Figure 6.1. First note that in both cases Qαdifvanishes for all α ∈ [0, 1]. Thus even the difference in degree distribution does not make its values large. It can be also observed that the intersection point of the plots of the Structural and Attributive Modularities
Qαstr and Qαattr (corresponding to α∗0.5 that makes (17) valid for ρ = 0.5) is closer to
α = 0, when Q(GS) > Q(GA), see Figure 6.1 (b). In the opposite case, it is close to α = 1 (not shown), and is close to α = 0.5, when the Modularities are almost equal, see Figure 6.1 (a).
Note that the Modularity values in Figure 6.1 are rather small indicating weak community structure. To bypass it, we now perform experiments with the LancichinettiFortunatoRadicchi (LFR) graph model [44] that generates graphs with a controlled community structure and that is often used to compare different community detection methods [26, 42].
9https://networkx.org/documentation/stable/index.html 10https://networkx.org/documentation/stable/index.html
(a) 0 0.2 0.4 0.6 0.8 1 0 0.05 0.1 0.15 Modularit y (b) 0 0.2 0.4 0.6 0.8 1 0 0.05 0.1 0.15 Modularit y
Composite Structural Attributive Differential
Figure 6.1 – Modularities in the graph pair ER+BA: (a) Q(GS) ≈ Q(GA)
(ERbased GS with 500 nodes and 6, 210 edges with equal nonzero weights,
p = 0.05; BAbased GA with 500 nodes and 6, 331 edges with equal nonzero
weights, m = 13), (b) Q(GS) > Q(GA) (ERbased GS with 500 nodes and
24, 886 edges with equal nonzero weights, p = 0.2; BAbased GA with 500
nodes and 6, 331 edges with equal nonzero weights, m = 13)
6.2.2 LancichinettiFortunatoRadicchi benchmark
The LFR model generates graphs according to the chosen parameters con trolling the heterogeneity in the distribution of node degrees and community size [44]. Namely, the parameters are: the number of vertices n, the mixing parameter
minimum degree of vertices kmin, the minimum community size cmin, the maxi
mum community size cmax, and exponents of the powerlaw distribution of node
degree and community size τ1 and τ2, correspondingly. The parameter µ ∈ [0, 1]
controls the strength of community structure so that the smaller µ, the stronger the community structure.
For implementation, we use networkx 2.5 LFR benchmark graph11, where
the following parameters should be specified: n,⟨k⟩, kmax, µ, τ1, τ2. One does not
need to specify kminin this version, if⟨k⟩ is specified, and vice versa. Furthermore,
one does not need to specify cmin and cmax there as cmin is set to be equal to kmin,
and cmax to n.
We set τ1 = 3 and τ2 = 2 in all the experiments with LFRbased graphs
below.
The purpose of the forthcoming four experiments is to show how variously the Modularities in (15) can behave.
First, we generate two different LFRbased graphs GS and GA with n =
3, 000,⟨k⟩ = 20, kmax = 150 and µ = 0.2. The corresponding results are in Figure
6.2 (a). While GS and GA are different, we have Q(GS) ≈ Q(GA). The behaviour
of the Modularities is similar to that in Figure 6.1 (a), however the switch around
α = 0.5 is more dramatic so that the Attributive Modularity Qαattr for α > 0.5 and the Structural Modularity Qαstr for α < 0.5 are vanishing. This can be explained
by the strong community structure of both GS and GA as µ = 0.2. Note that the
Differential Modularity Qαdiff is also vanishing for all α ∈ [0, 1] here.
Now GS and GA are generated with the same set of parameters as before,
with the only difference that µ = 0.8 for GA, i.e. the community structure of GA
is weak. In this case (see Figure 6.2 (b)) one has Q(GS) > Q(GA) as in the exper iment in Figure 6.1 (b). In general, the qualitative behaviour of the Modularities in Figure 6.1 (b) and Figure 6.2 (b) is very similar. The difference is quantitative: the LFR model produces graphs with higher Modularity values than the ER and
BA models. Again, Qαdiff ≈ 0 for all α ∈ [0, 1] in Figure 6.2 (b).
Now we want to experimentally study the maximum possible Composite Modularity value for the α not equal to 0 or 1. For this purpose, we initially gen
erate a graph ˜G with n = 3, 000 nodes and 748, 500 edges with equal nonzero
weights consisting of 6 equal fully connected components (with 500 nodes each).
Furthermore, GS and GA with n = 3, 000 nodes and 374, 250 edges with equal
nonzero weights are obtained from ˜G by random assigning a half of ˜G edges to
GS and another half to GA. By construction, the maximum Composite Modularity
value should be for α = 0.5 as Gα ≡ ˜G there. This is indeed so, see Figure 6.3 (c). Note that Qαdiff is vanishing in this case for all α ∈ [0, 1], too. What is more, the
maximum Composite Modularity value is≈ 0.5 (Q(GS) + Q(GA)) here as is pre
scribed by Theorem 1 and (15) for α = 0.5 and Q0.5diff ≈ 0. For future analysis,
Pearson’s correlation coefficient for the corresponding edge weights of GS and
GA is≈ −0.09 here.
The experiments performed so far hint that values of Qαdif are always van
ishing. However, this is not true as Figure 6.3 (d) shows. In this experiment GS
is LFRbased (n = 3, 000, ⟨k⟩ = 20, kmax = 150, µ = 0.4) and GA is a star
graphbased (the base is the star graph with n = 3, 000 nodes and 2, 999 edges, other edge weights are zero). The difference in node degree distributions (recall (11) in Theorem 1) is so notable that the maximal value of Qαdif for α ∈ [0, 1] is rather separated from zero. This result emphasizes how interestingly the weight based fusion model may work for bizarre networks and that one should be aware of the Composite Modularity structure for proper interpretation of the community detection results. The traditional heuristics (see Section 1.1) do not give any hint on the Differential Modularity component.
(a) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Modularit y (b) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Modularit y
Composite Structural Attributive Differential
Figure 6.2 – The variety of Modularities behaviour for different LFRbased graph
pairs: (a) Q(GS) ≈ Q(GA) (LFRbased GS of n = 3, 000 nodes and 374,250
edges with equal nonzero weights,⟨k⟩ = 20, kmax = 150, µ = 0.2; LFRbased
GA of n = 3, 000 nodes and 37,199 edges with equal nonzero weights,⟨k⟩ = 20,
kmax = 150, µ = 0.2); (b) Q(GS) > Q(GA) (LFRbased GS of n = 3, 000 nodes
and 374,250 edges with equal nonzero weights, ⟨k⟩ = 20, kmax = 150, µ = 0.2;
LFRbased GA of n = 3, 000 nodes and 38,091 edges with equal nonzero
(c) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 Modularit y (d) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 Modularit y
Composite Structural Attributive Differential
Figure 6.3 – The variety of Modularities behaviour for different LFRbased graph pairs: (c) GS and GA, each with n = 3, 000 nodes and 374, 250 edges, are disjoint
random subgraphs of ˜G consisting of 6 fully connected components and
n = 3, 000 nodes and 748, 500 edges; (d) maxαQαdiff ≫ 0 (LFRbased GS of
n = 3, 000 nodes and 17, 251 edges with equal nonzero weights,⟨k⟩ = 20,
kmax = 150, µ = 0.4; Star graph GA with n = 3, 000 nodes and 2, 999 edges with
equal nonzero weights)
6.2.3 The almost direct proportionality of the edge weights of the graphs
GS and GA
Now we experimentally illustrate the statement of Theorem 3. We generate
GS as an LFRbased graph with n = 3, 000, ⟨k⟩ = 25, kmax = 150 and µ = 0.5;
0 0.2 0.4 0.6 0.8 1 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 Composite Modularit y
Figure 6.4 – Composite Modularity for Gα, based on GS and GA with different γ
in (24), where GS is LFR with n = 3, 000,⟨k⟩ = 25, kmax = 150, µ = 0.5; GA′ is
LFR with n = 3, 000, ⟨k⟩ = 10, kmax = 150, µ = 0.7
LFRbased graph with n = 3, 000, ⟨k⟩ = 10, kmax = 150 and µ = 0.7; the set
of its edge weights we denote by {wA′(eij)}. Now let GA be the graph with the following edge weights:
wA(eij) = γwA′(eij) + (1− γ)wS(eij), eij ∈ E, (24)
where γ ∈ [0, 1] is fixed. Clearly, if γ = 0, then GA ≡ GS; if γ = 1, then
GA ≡ GA′.
In terms of Theorem 3, if γ is small enough (with respect to the chosen
ϵ), then the edge weights of GS and GA satisfy (22) and P
ij|ϵij| ≤
ϵ
2 ≤ 1 in
Theorem 3. Consequently, one should expect according to (23) that for γ ≈ 0 the
Composite Modularity Qαcomshould be almost independent of α. This is indeed the
case as Figure 6.4 shows. What is more, Qαcom starts to essentially depend on α as
γ grows. In particular, the behaviour of Qαcom changes dramatically after α passes 0.5 when the edge weights of GA′ start dominating in (24).
find them in Table 6.1.
Theorem 3 and the experiments in this section are complementary to the heuristic conclusion in [11] that the strong positive correlation of the edge weights
in GS and GA (by means of Pearson or Spearman) leads to the weak dependence
of certain structural and attributive community detection quality measures on α within the weightbased fusion models. It is interesting that the statement “If the Composite Modularity is independent of α in (2) for all α ∈ [0, 1], then there is the strong positive correlation of the edge weights in GS and GA” is not true in general, as the experiment in Figure 6.3 (c) shows. The Composite Modularity is almost
constant there, but the general scheme how GS and GA are constructed does not
imply the strong positive correlation. Recall that Pearson’s correlation coefficient
there is≈ −0.09.
6.2.4 Counterexample to the nodeattributed social network feature
selection principle
In this subsection we produce a partial but an important experiment in our settings that is related to the wellknown nodeattributed social network feature
selection principle particularly aiming at choosing the subset of attributes that are
“relevant to” or “tightly correlated with” the structure, see e.g. [45, 46]. This principle is usually based on the suggestions that the mismatch between the struc ture and the attributes “negatively affects community detection quality” [46] and that the existence of structureattributes correlation “offers a unique opportunity to improve the learning performance of various graph mining tasks” [45].
This principle has been found questionable already in [2, 11] in the weight Table 6.1 – Correlation rates for different γ in (24) in Section 6.2.3
γ 0.00 0.10 0.20 0.45 0.55
Pearson 1.00 0.99 0.97 0.78 0.64
based fusion model settings. In particular, it was observed there that in the case where all the attributes are in strong linear relationship to the structure (by means of the edge weights of GS and GA), then one in fact has two sources of information that mainly duplicate each other. In this sense, it also seems unreasonable to ex pect a significant improvement of the community detection quality by adding the attributes to the structure. Note that Theorem 3 confirms this point analytically in our settings.
Now we are presenting a counterexample to the principle in the weightbased
fusion model settings. Namely, we show that using only the edge weights in GA
that are in strong linear relationship to the corresponding ones in GS within the
community detection process may lead to the downgrade of the community de tection quality, and this is not only in the sense of Modularitybased community detection quality measures as in (15) but even in that of ground truthbased mea sures.
In general, NMI(C, C′) ∈ [0, 1]; in particular, NMI(C, C′) = 1 if C = C′
and NMI(C, C′) = 0 if C and C′ are totally independent [19].
We perform experiments with LFRbased graphs below and therefore let us mention that the ground truth partition C′in the LFR model is that generated at the step when the generated nodes with the chosen degree distribution are randomly assigned to the generated communities with the chosen size distribution, and before
the generation of links between the communities. Recall the parameters τ1 and τ2
of the distributions in Section 6.2.2. The partition C in (8) is the abovementioned
Cαwithin the weightbased fusion model, see Section 1.1.
For the experiment, we generate two LFRbased graphs:
– GS with n = 1, 000,⟨k⟩ = 150, kmax = 450 and µ = 0.4;
– GA with n = 1, 000,⟨k⟩ = 100, kmax = 450 and µ = 0.1.
The ground truth partition C′ is related to GA here. Clearly, GA has a stronger
(a) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0 0.2 0.4 0.6 0.8 1 Modularit y NMI (b) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0 0.2 0.4 0.6 0.8 1 Modularit y NMI
Composite Structural Attributive Differential NMI
Figure 6.5 – Modularities and NMI for (a) GS (with n = 1, 000,⟨k⟩ = 150,
kmax = 450 and µ = 0.4) and GA (with n = 1, 000,⟨k⟩ = 100, kmax = 450 and
µ = 0.1), i.e. before the feature selection procedure, and (b) GS (with the same parameters) and GA′ (obtained as described in Section 6.2.4), i.e. after the feature
selection procedure
For convenience, we think that each nonzero weight in GS and GA equals
1, and the required normalization (9) is done at the very end of the procedure de scribed in the next paragraph.
Let GA′ = GA. Let S be the set of node pairs (v, u) such that v and u are in
the same community in C′ and the edge weight between v and u is 1 in GA and 0
between the pairs of nodes in GA′ that are in S. Thus GA with respect to GA′ has
a stronger community structure that is also more similar to C′. Consequently, the
Modularity and NMI values for GA are higher than those for GA′. Furthermore,
the linear relationship of the edge weights of GS and GA′ is stronger than that of
GS and GA by construction. The abovedescribed procedure actually imitates the
nodeattributed social network feature selection principle in our settings.
In our particular experiment GS, GA and GA′ have 98, 266, 59, 258 and
19, 388 edges with equal nonzero weights, correspondingly. The set S contains 39, 870 node pairs.
The results for GS and GA are in Figure 6.5 (a), “before the feature selec
tion”. The maximum of the Composite Modularity (≈ 0.47) and NMI (≈ 1.00) is
for α = 0 here.
Once we exchange GAfor GA′, we get the results in Figure 6.5 (b), “after the feature selection”. By comparing the plots in Figure 6.5 it is seen that the maximum
of the Composite Modularity drops down to≈ 0.24 and the NMI down to ≈ 0.66
for α = 0. Note that Pearson’s correlation coefficient for the edge weights of GS
and GA is ≈ 0.00 and that of GS and GA′ is ≈ 0.20.
These experiments show that one can get worse community detection results after applying a variant of the nodeattributed social network feature selection pro cedure within the weightbased fusion model than for the initial graphs. It means that the procedure may be unsuitable in some cases and one should take this into account while applying it.
6.3 Realworld nodeattributed network datasets
Now we are going to deal with the undirected versions of the following pub licly available realworld nodeattributed social network datasets:
– WebKB12 (Cornell, Texas, Washington, and Wisconsin) is a collection of
four networks, totally of 877 web pages (nodes) with 1,608 hyperlinks 12https://linqs.soe.ucsc.edu/data
(edges) gathered from four different universities Web sites (each web page is associated with a binary vector whose elements indicate the the presence of a word from the vocabulary on that web page; the vocabulary consists of 1703 unique words; the ground truth partition is available);
– PolBlogs13 is a network of 1,490 webblogs (nodes) on US politics with
19,090 hyperlinks (edges) between these webblogs (each node has a bi nary attribute describing its political leaning as either liberal or conser vative; the ground truth partition is not presented);
– Sinanet14 is a microblog user relationship network extracted from the
Weibo website with 3,490 users (nodes) and 30,282 relationships (edges) (each node is attributed with 10dimensional positive numerical attributes describing the interests of the user; the ground truth partition is available);
– Cora15is a network of machine learning papers with 2,708 papers (nodes)
and 5,429 citations (edges) (each node is attributed with a 1,433dimen sion binary vector indicating the absence/presence of words from the dic tionary of words collected from the corpus of papers; the ground truth partition is available).
In the experiments below, the graphs GS and GA are constructed according
to (2) and (9) with the following structural and attributive weight functions (the parameters of Algorithm in Table 5.1):
µ(eij) = 1, if w(eij) = 1 in (V, E), 0, otherwise, ν(eij) = A(vi)· A(vj) ∥A(vi)∥2∥A(vj)∥2 .
Note that ν is the wellknown Cosine Similarity and ν(eij) ∈ [0, 1] in our case as all the attributes are nonnegative. It is worth mentioning also that the chosen µ
13http://wwwpersonal.umich.edu/ mejn/netdata/ 14https://github.com/smileyan448/Sinanet 15https://linqs.soe.ucsc.edu/data
(a) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.15 0.2 0.25 0.3 Modularit y NMI (b) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.15 0.2 0.25 0.3 0.35 Modularit y NMI
Composite Structural Attributive Differential NMI
Figure 6.6 – Modularities and NMI for the WebKB(Cornell, Texas) dataset: (a) Cornell, (b) Texas
and ν are among the most popular for this purpose [2] but, to be fair, the above proved theorems and Algorithm in Table 5.1 work for any reasonable nonnegative weight function.
We present results for α ∈ [0, 1] with a certain step for clarity, while Al
gorithm in Table 5.1 finds the partition and the Modularity values for a single
value of α = ˜αρ. In the forthcoming experiments, we additionally calculate the
NMI values by (8) for the realworld datasets where the ground truth partition C′
(a) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Modularit y NMI (b) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.1 0.2 0.3 0.4 0.5 Modularit y NMI
Composite Structural Attributive Differential NMI
Figure 6.7 – Modularities and NMI for the WebKB(Washington, Wisconsin) dataset: (a) Washington, (b) Wisconsin
based community detection quality measures and the ground truthbased ones. It is worth mentioning however that the ground truth methodology for evaluating community detection quality, although widely used in the field, seems somewhat questionable especially for realworld networks as “the ground truth” may reflect only one point of view on the network communities (among many possible). This problem is discussed e.g. in [8, 9, 10].
The results for the four networks of WebKB are presented in Figures 6.6 and 6.7. The behaviour of the Modularities is rather similar in each case. As in some
0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 Modularit y
Composite Structural Attributive Differential
Figure 6.8 – Modularities for the PolBlogs dataset
experiments with synthetic graphs in Sections 6.2.1 and 6.2.2, one of Q(GS) and
Q(GA) is greater than the other. However, Q(GS) is much greater than Q(GA)
here so the case is almost degenerate. It can be observed that GS has rather dis
tinguishable communities, while GA not. As a result, we have α∗0.5 ≈ 0 (the α
providing the equal impact of the structure and the attributes on the community detection results in the sense described in Section 3). At the same time, as the behaviour of NMI shows, the ground truth communities are highly related to the attributes in the sense that the highest NMI values are reached around α = 0 (ac tually exactly at α = 0 in Figures 6.6 (b), and 6.7 (a) and (b)). It is interesting that the highest NMI value in Figure 6.6 (a) corresponds to α∗0.5. Qαdiff vanishes for all
α ∈ [0, 1].
Oppositely to WebKB, both GS and GAin PolBlogs have rather distinguish
able communities, see Figure 6.8. As for the attributes, it is indeed reasonable as they are onedimensional and binary. Qαdiff vanishes for all α ∈ [0, 1] in this case, too. The true distinctive feature of the PolBlogs results among those for the real world networks under consideration is that the Modularities here change highly linearly with respect to α and thus the condition (18) in Theorem 2 is met with a small ϵ. In particular, this guarantees that the αtuning scheme in Section 3 and