Twitter-based Signed Network Clustering and Simulation

(1)

Master Computational Science

University of Amsterdam Saint-Petersburg National Research

University of Information Technologies, Mechanics and Optics

Twitter-based Signed Network

Clustering and Simulation

Aleksandra Klimkina

aleksandra.klimkina@student.uva.nl

June 9, 2014

Supervisors:

Rick Quax (UvA)

(2)

1

Abstract

There is no research conducted to study the clustering effects in a large-scale signed social network, i.e., a network having positive and negative links, while simple (unlabeled) networks are widely and deeply studied. We take a recently developed model for generating unlabeled networks (Papadopoulos, Kitsak, Serrano, Boguñá, & Krioukov, 2012), which takes into account popularity and similarity effects, and extend it for signed networks. Using this approach, we satisfy both properties of a discussion network: similar nodes will tend to connect together in positively connected clusters, while 'not-so-similar' nodes may connect more negatively (similar enugh to have a shared interest, but not similar enough to agree), tending to form negative connections between clusters.

We test and validate the extended network model using a human-annotated dataset of approximately 7000 tweets. These tweets form a discussion network, where links are created not between friends or followers, but between people who explicitly mention each other in their posts. The sentiment expressed in such a tweet is categorized manually for each tweet, and a pair of users can agree (positive link), disagree (negative link), or be neutral (which we then consider as positive). We find in the Twitter dataset that the users indeed tend to form interests groups (clusters), connected with positive links inside a cluster and negatively connected between clusters, beyond what should be expected from random assignment of sentiment.

Furthermore, we study how to fit the parameters of the extended network model to real data, in order to describe the dataset qualitatively. This described model is also compared to standard Barabasi-Albert model (Barabasi, 1999) with randomly assigned negative links, and clustering effects of the former tend to be higher, than of the latter. Comparing the distributions of clustering coefficients of two models with the real data from Twitter in terms of their means, we found that the first extended model represents the real data more truthfully, than the second randomized model.

(3)

2

Acknowledgements

First of all, I would like to thank the coordinators of the dual Master program “Computational Science” from both universities, UvA and ITMO, among whom is Alexander Boukhanovsky. Enrolling to this program allowed me to live and study in Amsterdam for the second year of the program, gain amazing experience of exploring the Netherlands, meet new people, improve my English, participate in an educational system, which is different from what I was used to in ITMO, and finally, perform this research.

Second, my thanks are to my supervisor from UvA, Rick Quax, for suggesting this interesting theme and transfusing his enthusiasm. Due to his advice, we submitted the extended abstract of the research to the ECCS’14 conference and it was approved for an Ignite Talk. Many thanks for his patience and help in performing the research and polishing the thesis.

Third, I would like to thank Willemijn van Dolen for the provided dataset, which played significant role in my research.

(4)

3

Chapter I.

Introduction

Almost everyone nowadays is familiar with the term “social network”; it has become a part of our lives. We create new acquaintances, expanding our personal networks, we watch TV shows where actors are connected into co-acting network, we participate in discussions creating discussion networks, etc. However, the first association with this term is, of course, online social networks, such as Facebook or Twitter, where we have a personal page and a list of friends. What is common among all these listed networks is that they all consist of individuals, represented as nodes, and their relationships, represented as edges. Broadly speaking, a social network is a structure, made up of social actors or agents, tied together with links.

Of particular interest of the researchers are online social networks, and it is an understandable phenomenon. Such networks imitate human communication but with greater speed, they allow people to communicate with the ones they never met in person, from various places of the world. Using online social networks, people can share news and photos immediately, express their opinion and reply to other people’s posts with almost no energy and time consumption. The users of a social network exploit the provided news-sharing services the most intensively when a resonant event occurs. People share their thoughts and find out their friends’ opinion; the most significant news can initiate huge conversations with people expressing their attitude, arguing and agreeing with each other. Such conflict situations are of great interest for research: studying people’s behavior in social networks, we can make conclusions about their beliefs, preferences and thoughts. Moreover, such study can be exceptionally helpful for companies in finding their target audience.

Underlying structure of social networks is a popular research theme. Scientists propose various models (scale-free networks, preferential attachment, popularity and similarity), arguing, which one represents the real social network more truthfully. However, modern developed models allow only one type of users relations: being friends of each other. Obviously, such models are not suitable for studying a discussion network, obtained from a conflict situation – one needs to consider people agreeing and disagreeing with each other. To the best of my knowledge, such kind of research was not conducted yet, and this is what I am doing in this work. Models with positive and negative links provide much more information about the network structure: they allow uncovering conflicting groups in the network and hierarchy structure. The presence of negative links is an essential feature in biological and metabolic regulatory networks, but for some reason they are not so popular in social networks field.

The goal of my research is to develop a model, truthfully representing a signed social network (a network with positive and negative links), as such type of model is absent in the literature. In order to develop such model, I took an existing model for networks with one type of relationship (unlabeled networks) and contributed it with negative links. Such model should be verified using a real signed social network, and, fortunately, I had such data.

Studied data

The data I had for studying is a set of posts related to the Triodos bank. This set was obtained from Twitter and it contains people’s opinions about the performance and the service quality of the bank. It is natural to expect these opinions to be both positive and negative, as not everyone is fully satisfied with the provided services. These collected tweets were human annotated for their sentiments: positive, negative and neutral. Later I consider neutral links as

(7)

6

positive too, analogously with friendship relations in social networks: not all people in one’s friend list (being-friends relations) are their close friends (positive links), some of them are just acquaintances (neutral links).

A significant part of the collected data is a set of retweets and reposts – messages of other users that people share on their own page, maybe providing it with a comment. People can also mention other users in their posts, starting or continuing a conversation. I construct a discussion network based on these reposts and mentioning: if a user reposted some other user’s post or mentioned him, I establish a positive or a negative link between them in the network, according to the sentiment of the post.

What I do expect of the constructed network is to form several clusters, densely connected with positive links inside a cluster, and with negative connections with the outside of a network. As I will show in the next chapter, this is a common belief, that clustering and homophily are tied together very strongly in a social network, thus making clustering an inherent feature of social networks. Consequently, there are several requirements for the network generation model I want to develop:

1. It should support two types of links: positive and negative

2. Resulting network should show the clustering effects with positive links tending to occur inside a cluster and negative – between clusters. The strength of this effect can be measured using the frustration coefficient, described in the Chapter V.

Outline

The rest of the master thesis is organized as follows. Chapter II contains an overview of literature related to social networks that appeared useful for my research. Chapter III contains the more detailed description of the problem: description of the dataset, some statistical information and estimated parameters of the data, resulting network and some of its characteristics. Chapter IV discusses the details of the model developed for signed social network generation, as well as the details of the original model for unlabeled networks, which was taken as a base, with theoretical reasoning of the parameters choice. Chapter V introduces the notion and emergence of frustration and describes how the value of frustration is used for estimating the clustering effects. Chapter VI is dedicated to the obtained results: in particular, it contains the results of the model validation and results of two models comparison. Chapter VII concludes the thesis, containing a brief overview, discussion of the results and possible directions for future work.

(8)

7

Chapter II. Literature Study

The field of social networks is relatively new, and intensive research of online social networks started only two decades ago. Structure, characteristics [1, 2], evolution of networks [3], spreading processes [4], algorithms for finding the community structure [5, 6] were considered by different authors.

Social networks and real life

Social networks are strongly correlated with people’s real life behavior: at first, they emerge naturally from the people’s real life and influence it after. Studying social networks is a very fruitful field as it allows to observe the whole range of people’s real-life characteristics reflected in their specific role in a social network. The articles listed in this section support this statement.

Centola studied the correlation [7, 8] of behavior in online social networks with real-life characteristics. He showed that homophily (a tendency of similar nodes to have a link) is a significant property of a network, and it allows a behavior to spread more successfully across a heterogeneous population. On the one hand, homophily has external origin – it emerges from characteristics of individuals, represented as nodes in a network. On the other hand, homophily, which is an external property of nodes, independent from a network, influences the behavior-spreading processes on the network.

Homophily may have different origins and aspects. McPherson et al. [9] considered various types and causes of homophily, which include: race and ethnic homophily, gender, age, education and occupation etc. For our study, the particular origin and form of the homophily does not matter: we care only about the presence of homophily, and, subsequently, clustering of positive and negative links in social networks.

Young studied the emergence of clustering in a network of school friends [10]. He found out that individuals with similar characteristics (in his study – self-control level) tend to form groups and create clusters. Aral et al. [11] develop this thought, stating that it is impossible to distinguish, which effect caused which: did similar individuals connect together because of their similarity or did they become similar because of their closeness in the network. I will use this fact in my work, assuming that if there is a strong correlation between clustering effects and general network characteristics, these effects can serve as quantitatively expressed description of the network.

Fowler et al. [12] conducted an experiment: they sent Facebook users different types of notifications: with or without their friends’ photos. The experiment revealed, that even in a social network people tend to react to the notifications with their “close friends” photos, i.e. people that they both know in real life and have as friends in Facebook.

Arentze et al. [13] studied the connection of geographical distance between people and probability of them knowing each other. De Domenico et al. [14] considered the twitter users’ reaction to a huge scientific event (the Higgs boson detection).

McClurg et al. [15] introduced a method for detecting the political preferences among Twitter users. Based on the US president election in 2012, the authors investigated political beliefs of the Twitter users based on the seed users (Congress members) that they follow. Then they collected the tweets on the election day and tested their predictions, which appeared to be fair enough.

(9)

8

Despite of all the complexity and variety of social networks and their apparent randomness, they all share the same features that distinguish them from random networks. Watts and Strogatz [16] studied three types of networks: film actors, power grid, and protein networks. The authors claimed that the average path length is smaller and the clustering coefficient of such networks is much higher, than in a random network. They describe such networks as “small-world” networks. The “small-world” phenomenon is widely known as a “theory of six handshakes”, that states, that every pair of humans, living on earth in one time period, is separated by no more than five mediators.

Based on all these works, I can conclude, that social networks are in close relations with people’s private life; they interpenetrate and interdepend on each other. The next sections describe more specifically, what kind of research work was done in the field of models of networks. I divided the whole set of articles related to my work into three categories: models, considering one type of edges (“friends”), two types of relations (“friends and foes”), and considering multiple types of relations.

One type of relations

This field of the social networks study is by far the most researched. This is understandable, as most of the online social networks support only one type of the users relationship – being friends of each other. The majority of the articles are dedicated to the information spreading theme. This process is also called “social contagion”, as its dynamics remind of the spreading of infectious diseases: plenty of agents become “contagious” in a very small time. Christakis and Fowler [17] studied contagion processes of smoking, obesity and happiness in social networks.

Bianconi and Barabasi [18] draw a parallel between the evolution of a social network and Bose-Einstein condensation of gas. They claimed that both processes consist of three phases: 1. scale-free phase; 2. fit-get-richer phase; 3. Bose-Einstein condensate ('winner-takes-all' phenomenon).

In some case similar work was done by Castellano et al. [19]. They overviewed connections between problems of statistical physics and opinion dynamics, crowd behavior, hierarchy formation, social contagion. For instance, they proved that Ising ferromagnet problem could serve as a simple model for opinion dynamics in social networks.

Many articles are dedicated to analysis of the structure of social networks. Researchers introduce and calculate characteristics of social networks [20], detect communities and study their structure [21, 6], analyze navigability and communicational efficiency of networks [22, 23], predict new links of networks based on the present structure [24, 25].

Structural Analysis

The most widely known and accepted structural feature of the social networks is the “scale-free” property. It was first introduced by Barabasi [26] and he defined “scale-freeness” as following the power-law degree distribution. He explains this phenomenon to be caused by two tendencies: growth of the network due to adding new vertices and preferential attachment. Both these tendencies were proved to be important, and none of them can be omitted. While the former is understandable – social networks are naturally formed by adding new vertices, – the latter needs to be commented. Preferential attachment property means, that the probability of a vertex to have a connection with a new node depends on the number of connection that vertex already have (vertex degree). This trend is also known as “rich-get-richer” effect.

(10)

9

Furthermore, many scientists tried to explain the preferential attachment effect, and Barabasi observed the most reasonable explanations in his review paper [27].

Concepts

The researchers contributed preferential attachment concept with a list of adjustments: popularity and similarity, influentiality and susceptibility, close friends, homophily.

Papadopoulos et al. [2] introduced the notion of popularity and similarity. They claimed that the nodes do not simply establish links with the most popular nodes, but also they connect with similar nodes, even if they are not popular. To prove their idea, the authors performed a simulation and compared their results with existing networks. This research managed to show that popularity-similarity extension of the preferential attachment is capable of describing a social network much more truthfully than the preferential attachment alone. The authors verified their model on several snapshots of one evolving network and they found out that their model better predicts new links than preferential attachment. That is why I took this research as a basis for my network model, which also appeared to be a truthful representation of a signed social network.

Aral and Walker [28] argued, that nodes of the network have the influentiality as well as the susceptibility characteristic with varying coefficients. They measured influentiality and susceptibility of Facebook users depending on the personal information, which they share in their profiles. They concluded that susceptible users play more important role in information spreading, than influential ones, in a sense that their respond to notification influence other susceptible individuals, what drives the information spreading process more crucially, than notification from more sparse influential individuals.

Fowler et al. [12] stated that individuals tend to react to their close friends’ notifications, despite of their susceptibility and their friends’ influentiality. Centola [8] concluded, that homophily is an important contribution to preferential attachment, as a homophily-based network is more effective in behavior spreading.

Generation

For simulation and generation of social networks, scientists use several approaches. The most popular one is using agent-based (or actor-based) models. Other approaches are game theory, machine learning, and numerical methods.

Mostly, generation step is used as additional support of the proposed adjustments as well as theoretical analysis: the model generated using the adjustment is compared to the real dataset, and then the conclusion about the quality of the model and truthfulness of the adjustment is made.

Agent-based models are widely used in different kinds of study. Girvan and Newman [21] used them in detecting community structure in social and biological networks. Arentze et al. [13] created an agent-based model to prove that the regularities they found in the network are indeed reasonable and are not a result of overfitting.

Papadopoulos et al. [2] performed a simulation based on placing nodes in hyperbolic space. Radial coordinate represents popularity, angular coordinate – similarity. New nodes are placed on the edge of the hyperbolic disc with random angular coordinates. New links are established between the new node and a set of nodes that are the nearest to the new one using hyperbolic distance measure.

(11)

10

Dynamics Analysis

Not only structure, but also the dynamics on a social network is worth studying. The researchers consider the processes of opinion formation, information spreading, social contagion.

Liu and Wang [29] study the process of opinion formation with regard to similarity of the nodes. They propose a model, where agents select other agents to communicate with according not to their opinion, but to their similarity, and the agents still can reach a consensus with such selection. In social networks, we can interpret this selection as following. In the process of opinion formation, users of a network communicate not with the users, whose opinion is close to their own, but with their friends, with whom they are similar. This model seems more reasonable in regard of actual people communication. In contrast to their work, we already operate with the opinion network: links created not between friends, but between people, who mentions each other in their opinions or who reposts, i.e. agrees to each other. What is similar to their work is that we operate not only with close opinions, represented as positive links between users, but also with opposite opinions, represented as negative links.

The theme of dynamics of the networks is also widely studied. It includes processes of network emergence and evolution. Network generation algorithms, described in the previous section, represent network emergence, keeping aside processes of nodes and edges removal.

Noel and Nyhan [30] contributed the Fowler’s model with “unfriending”, when agents remove some of their connection over time, thus turning Fowler’s system into dynamical one. The main disadvantage of all overviewed articles is that they assume only one type of relations between users – being friends of each other, and consequently, being susceptible to all the neighbors’ actions and to them only. This restriction leaves aside such important occasions, when users react to the actions of other users of social networks, who are not their friends, or when they ignore their friends’ activity.

Two type of relations

The abilities that social networks provide are far broader than users just being friends. The users of a social network can communicate and discuss various themes, so two kinds of relations appear: users being agree and disagree with each other, positive and negative relations respectively. Such networks with two types of relations (positive and negative) are called signed networks.

Structural Analysis

The field of signed social networks is not broadly studied yet. Majority of the works is dedicated to signed graphs with small amount of nodes, with no small-world effect.

Frank and Harary [31] were among the first scientists who started studying the signed graphs. In their article, they analyze types of triangles, constructing a graph, and make conclusions about their stability. One of these conclusions, which is important for us, is related to connection between stability of triangles and balance of the graph. If assigning weights to the edges of triangle, the product of these weights is positive for stable triangles. In other words, a stable triangle has either all edges positive or one positive and two negative. Both these cases represent a perfectly balanced graph, with one positively connected group of three nodes in the first case, and two groups of two and one nodes, connected negatively between the groups but positively inside the groups, in the second case.

(12)

11

Sampson [32] created the first dataset that represent a signed network. He interviewed the monks, living together in a monastery, about their top-three most liked and top-three most disliked colleagues. Later Bonacich and Lloyd [33] used this dataset as a base for their research, studying balance of this dataset and status (popularity measure) of the monks. The results they obtained are remarkable: it appeared that the monks with respect to their positive and negative links are organized into almost perfectly balanced graph with two groups with a certain type of hierarchy. Both groups had their own “leaders” – one or two monks with the largest number of positive links with others inside their cluster and negative links with the opposite cluster. As far as I know, this study was the first one, resulting in obtaining a signed social network, though not a complex one, and studying its balance measure. In my work, I will go further to investigating the balance degree of a signed social network together with studying its underlying model.

Some work was done in the area of link label prediction of signed graphs. Agrawal et al. [25] firstly introduced such problem and proposed a matrix-factorization based method for solving it. Ye et al [34] used a machine learning approach for the same problem. This problem emerges from online signed social networks (Wiki-Votes, Epinions), where the authors wanted to understand the nature of relationship between two nodes (positive or negative), given the information about some other link labels. This problem is in a certain degree close to ours: it explores the nature of local network construction, while our goal is to create a model, reflecting global network structure.

Only structural analysis of a signed network will not provide us with a full understanding of its behavior and performance. Therefore, I will consider works dedicated to dynamics of signed networks.

Dynamics Analysis

This field seems very promising, but it is unstudied at all. To the best of my knowledge, there is only one research dedicated to the dynamics of a signed network.

Hiller [35] proposed an agent-based model of network formation and operating. He allows three types of relations between agents. 1) positive 2) negative: coercion 3) double negative: conflict; no neutral links are allowed. Conflict has its cost; other types of relations are free. Using these rules, agents organize themselves into a network, either having only positive links, or having both positive and negative links and creating almost perfectly balanced clustering of the network.

This research is very interesting, but it does not analyze the statistics of the obtained network, and it is concentrated on the generation step, while our aim is to understand the behavior of an existing network.

Li et al. [36] studied the dynamics on signed social networks, in particular, processes of opinion formation in Barabasi-Albert signed network. The researchers considered binary opinions, thus, nodes with same opinions were connected with positive links while nodes with different opinions – with negative. The opinion formation process was considered from the game theory point of view, and the scientists found out that opinion formation in a signed social network leads to a consensus, though takes more time than in an unlabeled network. The final network that reached a consensus shows several “opinion clusters”: nodes with similar opinions densely connected together with positive links. This article support my assumption, that clustering should emerge in a signed social network.

In my opinion, this field is unreasonably abandoned, and with my research I will try to improve this situation.

(13)

12

Many types of relations

The approach, allowing three and more types of relation and different types of nodes, is more exotic than the previous two, and is less used in the field of social networks. This field includes dividing networks into layers and further topological analysis.

Li et al. [37] reproduced the dynamics of a social network by dividing its nodes into two categories: items and users, allowing different types of relations between them. The focus of their work is that the evolution of a network is governed by local dynamics, i.e. triadic closure. They concluded that for a reasonable simulation topology of the network should coevolve with the dynamics.

De Domenico et al. [38] introduced a general mathematical formulation for multilayer networks, that can be used in studying of the dynamics of such networks, as considered in [37]. In my opinion, the multirelational approach is not suitable for my study: the users distinguish only by their attitude to the news, and two types of attitude is enough. One can suppose several kinds of users depending on their activity, but it can be solved by introducing a coefficient of activity level without creating several types of nodes.

Conclusion

In this section, I considered three kinds of articles: discussing networks with one type of relations, two types and three and more types. The first category is by far the most popular and explored; scientists used this approach to analyze the structure and dynamics of and on the networks and simulate them. Scale-freeness is one of the main properties of such networks, and it was explained and contributed in various ways. The problem is that this approach cannot explain the whole variety of events, occurring in social networks. In particular, it is not capable to explain the social reaction to a resonant event with users expressing their approval and rejection.

To overcome such restrictions, one needs two types of relations. Networks with positive and negative ties are called signed networks. The research performed in this field concerns structural aspects of such networks, and most of articles are focused more on the graph structure, leaving aside defining properties of social networks, such as small-world effect, scale-freeness and homophily. Only one work was done in the area of simulating the structure and dynamics of a signed network. To the best of my knowledge, there is no research that tries to explain the dynamics of a real social network using the signed networks approach.

Moreover, some approaches allow three and more types of relations and several types of nodes, but I will leave these works aside as they are not suitable for solving the problem of this thesis. Some scientists find the analogy between social networks dynamics and statistical physics, and this approach might be useful for my future research, but it was not used in this thesis.

(14)

13

Chapter III. From Data to Network

The purpose of data collection

Triodos bank is one of the most famous and discussed banks in the Netherlands. It constantly monitors the public opinion about its appearance, services and performance. Monitoring usually have the form of collecting the Internet messages from news web-sites, social networks, blog services etc.

The purpose of this monitoring is to find out, in which direction the bank should develop further: emphasizing the services received positive feedback or instead addressing the issues mentioned in a negative way.

All collected messages were human annotated on their sentiment, and it makes this dataset extremely valuable: these sentiments should not be doubted, as nowadays people still deal with detecting the sentiment of a phrase better than machines.

General characteristics

The data I received for studying were organized in a .xls file with 13431 entities. Figure 1 shows a sample of the data:

Figure 1. Triodos data sample

Every row of the file contains the following information:

 Id – unique number of every post

 Title – name of the article or post

 Description – full text of the article or post

 URL – link of the article/post

 Source – social network or web-site, containing the publication

 Sentiment – sentiment of the publication

The whole project was performed using Python language and free libraries. To operate with the data, I used the openpyxl library.

Not all data entities were used for the research. As I wanted to construct a discussion network, it seems impossible to use an article from a news web-site for this purpose or a single post in a social network. Fortunately, online social network “Twitter” provides users with opportunities of mentioning each other, thus constructing a discussion network. A single post in Twitter is usually called a tweet, and I will use this term later in my reasoning.

(15)

14 A typical mentioning in Twitter looks as follows: @username <tweet message here>

The user who was mentioned in someone’s tweet receives a notification from the system, thus having a chance to reply to the original tweet, starting or continuing a conversation. A particular case of mentioning is re-tweeting. In this case, a user simply copies the other user’s message on his own page along with the original tweet author’s name, accompanying this message with letters RT in the beginning:

RT @username <username’s original tweet>

For my research, from the whole file of I filtered only tweets, containing mentions, including retweets. Obtained dataset appeared to be of the size 8382 unique tweets.

First, I analyze the dynamics of the discussion, i.e. burst sizes of different types of reaction. Figure 2 illustrates these findings.

Figure 2. Burst sizes of different types of reaction depending of the date. Number of tweets with a certain sentiment is shown with different color.

On the first sight, there is nothing special on this plot: it seems like messages of different sentiments appear independently with maybe a small correlation coefficient. The question that was also interesting is whether the burst sizes of different reaction show some periodicity. In order to tie the reaction sizes of different sentiments together, I introduced the following mood function, reflecting overall attitude of the Twitter users regarding some news on a certain day.

        () () () ) (t N t N t N t

mood _pos _neg _neut ₍₁₎

We assume the coefficients to be the following:

5 . 0 1 1       

(16)

15

The problem of finding periodicity in this case evolves to calculating the auto-correlation coefficient with varying delay times.



( ), ()



) ,

(f corr f t f t

Autocorr    ₍₂₎

Figure 3. Autocorrelation of the mood function: mood(t)N_pos(t)N_neg(t)0.5N_neut(t)

Figure 3 illustrates the obtained autocorrelation function. Here we can clearly see three large peaks at approximately 60, 120 and 180 days, which can illustrate the large-scale periodic behavior of the Twitter-users. A number of small peaks appear with a gap of approximately 7 days, which can illustrate the weekly tendency.

This aspect of the data was not researched more deeply, and the rest of my thesis focuses on investigating the structure of the corresponding discussion network. Nevertheless, exploring periodicity of the mood function can be a promising study, and I consider this part as possible future research work. The results of this investigation will set the bounds to the gap between two snapshots of the evolving network, which is another interpretation of the data. Chapter VII discusses this possibility in more details.

Network construction

The previously obtained set of posts allowed me to construct a network where each user, writing a tweet or being mentioned, is represented as a single node. The edges are created when a user mentions other user in his tweet or re-tweets another user. The main feature of the obtained graph is the presence of weights on its edges: according to the “sentiment” column of the data, I assign a +1 edge weight for positive and neutral sentiment and -1 for negative sentiment.

The fact that the positive and neutral sentiments are treated together as positive links should be commented in brief. Studies of simple unlabeled network usually assume that all

(17)

16

friendship connections are neutral. Those connections serve as conductors of social contagions [4], equally taking part in the spreading process. Fowler et al. [12] proved in their study that users have a subset of friends, namely, “close friends”, who affect a user’s behavior more strongly than all other (remote) friends. They presumed that these “close friends” are such friends, that a user knows in person, not only from on-line social network. In my study, I can draw a parallel between remote friends and “neutral” sentiments, close friends and “positive” sentiments. However, the purpose of my study is to discover the influence of adding the negative links to a network, and I want to keep the structure of positive links simple, without diving in details of degree of “closeness” of a tie.

Network construction was performed using the igraph Python library. Resulting network has the following characteristics:

Table 1. Discussion network characteristics

Number of vertices 5887

Number of edges 7389

Average degree 2.51

Maximal degree 993

Average positive degree 2.29

Maximal positive degree 904

Average negative degree 0.22

Maximal negative degree 89

Percentage of negative links 8.58%

The fact that the numbers of vertices and edges do not coincide with the number of rows remaining in the data after filtering can be explained as following. Every row of the data represent a single tweet of one user. It is natural that one user leaves several messages, which is reflected in the values of average degree and maximal degree in the obtained network. Moreover, not every tweet in the dataset leads to establishing a new connection in the network, as one user can mention another several times. It is possible that the sentiments of different mentions do not coincide, and as a resulting sentiment and, consequently, the weight of the edge, I simply take the sentiment occurring more frequently than the others. Typically, if one user mentions another user several times, corresponding sentiments are neutral and positive. As I said above, we assume that these sentiments result in one relationship – positive link, and thus, we do not lose any information by taking simply the sentiment stated most often.

(18)

17

Qualitative and quantitative characteristics of the data

Figure 4 shows the overall (positive+negative) degree distribution in a log-log scale.

Figure 4. Overall degree distribution density function, log-log scale. The green line is the theoretical distribution with the parameter, estimated using the moments method.

Accordingly to [26], we assume that the degree distribution in the obtained network follows Pareto (power-law) distribution of the following form of the probability density function:     _   otherwise 0 1 , ) ( 1 x x k x f k ₍₃₎

One can check whether this function is indeed a probability density function by calculating the following integral:

                    1 1 ₁ 1 ) ( k x k dx x k dx x f k k

In order to test, whether the empirical distribution function indeed follows the power-law, I performed a chi-squared test with the significance level α and number of bins m. Here we took α=5%, m=20. The chi-squared statistics of the data was calculated as follows:











     m i i i i emp emp x f x f x f m f f 1 2 41 . 32 ) ( ) ( ) ( , 

This statistics was compared with the theoretical quintile of the chi-squared distribution. 85 . 32 1 , 2 1 1       _ _ m  

(19)

18

The test showed, that the value of statistics if lower, than the value of chi-squared quintile, from which we can conclude that the empirical and theoretical distributions are indeed close to each other with the confidence level of 95%.

The only parameter of this distribution function is k – the value of the power-law exponent. To estimate the value of k, I used the moments method. The most appropriate moment is the average degree of the data, which should be the same as the theoretical mathematical expectation.

Let us derive the theoretical formula for the expectation:

X k k k x k dx x k dx x k x dx x f x EX k k k            _ _ _           ₁ ₁ ) ( 1 ₁ 1 1 1

Thus, formula for estimated k is as following:

66 . 1 1 ˆ _   X X k (4)

Figure 5 illustrates that this estimation of k follows the data fairly close, and later I will use this estimated kˆ for the model parameters validation.

Figure 5. Empirical cumulative probability function compared with the theoretical estimation, obtained using the moments method

The same plots can be obtained for positive and negative links separately, Figure 6 shows the corresponding empirical cumulative distribution functions. Estimation of the power-law exponents for positive and negative links separately is not performed here, as it is not necessary for the subsequent stages of the research, though still can be done using the same moments method or maximum likelihood estimation.

(20)

19

(21)

20

Chapter IV. Model description and verification

Basis of the model

The best way to investigate the hidden structure of a network is to construct a minimal predictive model. As a good generating algorithm of unlabeled networks is already present in the literature, I decided to take it as a basis and extend it. The basis of my algorithm comes from a paper by Papodopoulus et al. [2]. We will assume the same principles of new nodes appearance and establishing new links, but extend the model with the possibility of negative links in a minimal way.

The generation starts from an empty network. At every timestep only one node appears. Let us denote the polar coordinates of the node appeared at timestep s as (r_s,_s). Here r_sln( s)

and _s is a random variable, distributed uniformly from 0 to 2π: _s ~U[0,2]. All nodes are situated on concentric circles, every node on its own circle with radius rs.

New node establishes new positive connections according to the following rule:





( , , , ) ) , ( exp 1 1 1 ) , ( ₁ 1 T d t s R T d t s t s Edge p def             (5)

where Edge(s,t) – is a discrete random variable, taking value 0 if there is no connection between nodes s and t, and 1 otherwise;

ρ(s,t) – distance between nodes s and t in hyperbolic space; d1 – characteristic distance;

T – temperature parameter of the network.

For T=0.5, the plot of R depending on dst looks as follows:

Figure 7. Plot of probability of establishing a positive link between a pair of nodes depending on the hyperbolic distance between them, with temperature T=0.5. R(ρ(s,t), d1=5, T=0.5)

(22)

21

As temperature decreases, the establishment of new edges becomes more deterministic, whereas large values of T lead to almost uniform link establishing.

           ; , 0 , , 1 ) , , ( ) , , , ( 1 1 1 0 1 d d d d d t s f T d t s R st st T (6)

The parameters of this model that should be estimated are d1 and T. In the next

subsections I will provide a theoretical analysis for their estimation, which differs from the original model by Papadopoulos et al. In [2], they introduced a model, where these parameters were estimated locally, i.e. for every timestep the set of parameters changed. This choice of parameters estimation is connected with the local clustering coefficient they used for comparison of the model and the real data, and this coefficient equals to the percentage of triangles among all nodes triples in the network. The estimation of the parameters provided in this work is based on the global characteristics of the network: average positive and negative degree, and this choice is based on the global clustering coefficient that we use – the frustration, which equals to the percentage of the links, breaking the balance of the network.

Hyperbolic space

Figure 7 shows the plot of the edge formation probability function R(s,t,d1,T) depending

on the distance between nodes s and t. Here we denote distance to be not Euclidian, but hyperbolic: 2 ln 2 ln ) , (st rs rt st st st        (7)

where _st is the minimal central angle between nodes s and t. As all nodes are located on circles with the same center, we can find this angle for every pair of nodes as:

          otherwise , 2 , if , t s t s t s st _ _ _       (8)

Having all this in mind, we can draw a sketch of one network generation step. It is shown on Figure 8, and a “hyperbolic circle”, i.e. a set of points that have the same distance from the certain point, is colored with red.

Let us check some properties of the hyperbolic distance. 1. Symmetry ) , ( 2 ln ) , (s t stst  t s    2. Non-negativity ) ; ( ) , ( 0 ln 2 ln ) , ( 2 2         t s s s s s ss   

(23)

22

Figure 8. Illustration to network generation, [2]. At every timestep t a new node appears with a radial coordinate equal to ln(t) and angular coordinate drawn randomly from a uniform distribution U[0; 2π]. This node establishes positive and negative connections with other nodes according to the equations (5) and (20) respectively. All existing nodes drift apart changing their radial coordinates according to (14). In red shown a ‘hyperbolic circle’ – an equidistant line with the

center in ‘New node’.

3. Triangle inequality ] ; 0 [ 2 2 ) ( 2 2 4 ln 2 ln 4 ln 2 ln 2 ln ) , ( ) , ( ) , ( 2 2 2 2 2 2 2 2                                         sq sq sq st sq st sq st sq st sq st qt sq st qt sq st qt sq qt sq q q q q t sq st t sq qt sq t q q s t s

Obviously, the last inequality does not hold for every values of _st and sq. Thus, two out of three properties of metric are not met, and we cannot consider our distance function as a metric. However, it remains useful for our purposes: we can still set an upper bound of

(24)

23

distances for filtering distant nodes, even if distances of very close nodes can go to minus infinity. Breaking the triangle inequality can be explained from the underlying network principles: if one users is similar to two others, those two are not necessarily similar to each other, as similarity can be caused by different reasons.

Distribution of θst

As we stated earlier, the angular coordinates θ of the nodes are distributed uniformly:

] 2 ; 0 [ ~ ,   U iid t s (9)

The minimal angle between two nodes has the form (8). Let us denote _st z and find the distribution function of z:

) ( 4 1 4 1 ) ( ) ( ) ( 0 2 2 0 z D D D S dxdy dxdy y f x f z G     _ _ ₍₁₀₎

Here Dz₀ is the figure in the xOy plane, where x and y distributed as s,t, and

0 ) , (x y z z z  I. xy









2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 2 4 4 1 ) ( 4 1 ) (           z z z z z D S z G y x z z I              II. xy  2 2 2 ₄ 1 ) ( 4 1 ) ( 2 z D S z G y x z z II         

(25)

24   z z z z z G z G z G  _I  _II     2 2 2 4 4 ) ( ) ( ) ( (11)

Thus, minimal central angles are distributed uniformly in [0; π].

Distribution of ρ(s,t)

 

_              _        _    st e G st P st P t s P st st 2 0 exp 2 2 ln ) ) , ( ( ) F( 0 0 0 0         (12)

It means that ρ is distributed as follows:

             otherwise , 0 2 ln ; , 2 ) ( st st e f     (13) Drifting parameter β

At every timestep t a new node appears, but all other nodes do not stay at the same places. They change their radial coordinates according to the formula:

t s

s t r s r

r ()(1) ( ) ₍₁₄₎

where r_s(s)r_s lns, r_t lnt, β – drifting parameter. When β=0, there is no drifting at

all, and all nodes keep the same radii they received on their initialization. When β=1, all nodes change their radiuses to rt and the generation algorithm leads to a random graph on a circle.

β depends on the exponent parameter of the degree distribution γ as:

k 1 1 1      (15)

where k is the parameter of Pareto distribution of the nodes degree. This fact is proven in the article [39].

Formula (14) is a recursive expression. To operate with radii, we need a direct non-recursive expression for them, as a non-non-recursive expression is easier to program for the simulation purposes. Using mathematical induction, we will prove, that















    _ _ _   t s j j s t s t s t s s j r 1 ln 1 ln 1 ) (    ₍₁₆₎ I. Basis When t=s, we get





s s s

(26)

25 II. Inductive hypothesis

Let r_s(t1) follow the expression (16), i.e. it is true, that:





_









      _ _ _    1 1 1 1 ln 1 ln 1 ) 1 ( s t j j s t s t s t s s j r   

III. Inductive step







 













































                                                                                     s t j j s t s t j s t s t j j s t s t s t j j s t s t s t j j s t s t t s s j s s t s t s j s s t j t j s s t j s s t j s s r t r t r 1 0 1 1 1 1 1 1 1 1 1 ) ln( 1 ln 1 1 1 ln ) ln( ) ln( , when ln ) ln( 1 ln 1 ln ) ln( 1 1 ln 1 ln ) ln( 1 ln 1 1 ) 1 ( 1 ) (                        

We got the exact same expression, as we assumed, thus, we have proven the formula (16).

It also will be useful, if we derive the expression for s(t)exp



r_s

 

t



:

















  _



_

_

_

_



  _

_

_

                                                               s t j s t j j s t s t j j s t s t s t j j s t s t s j s t s t s t j s s j s s j s s j s s t r t s 1 1 1 1 1 1 1 ln 1 exp ) ln( 1 exp 1 ln exp ) ln( 1 ln 1 exp )) ( exp( ) (     _ _      

 

  _

_

_

      _ _   t s j j s t s t j s s t s 1 1 1    (17)

Later in our calculations we will write just s instead of s(t) for simplicity, but keeping in mind, that s is actually a function of t.

Average number of positive links (degree)

Pareto distribution

The network generated using the algorithm described above will have the scale-free property, what was proven in the work by Krioukov et.al [39]. It means that the degree distribution of the network will follow the Pareto law. The moments estimator showed us, that k can be estimated as described in formula (4), depending only on the average degree of the real network.

When fitting our model, we want the obtained distribution degree be the same as from the real network, i.e. we want the average degree coincide with the real one. Next, we will derive theoretical formula for average degree, using all inferences we made above.

(27)

26

Average degree

To find the average degree, we will simply divide the overall number of edges to the number of nodes.

Every node establishes new outgoing connections only on the timestep it appeared. It establishes connections with all existing nodes with probabilities R(dst, d1, T), in other words,

node t establishes on average R(dst,d1, T) connections with node s. Then possible distances from

node t and all existing nodes vary in



;d_max(s,t)



, where dmax is achieved at:

2 ) ( ln 2 ln ) ( ) , ( max t t s r s r t s d t t         (18)

To calculate the average degree we need to take integral of the function R(dst,d1, T),

defined as (5), w.r.t. dst, taking into account its distribution (12), and sum these values w.r.t

] , 1 [ n

t , where n is the number of nodes.



  

       N t t s d d f T d R N D 1 1 1 max , , 1 _ _ _ (19)

First let us consider the underlying integral.





                              _                  2 ) ( 0 1 1 exp 0 ₁ 1 1 1 1 1 max max max 1 ) ( 2 2 exp 1 1 ; 0 2 exp 1 1 ) ( ) , , ( ) , , , ( t t s T T d d T d x x d x dz z e t ts dz st T d z e z dx e dz e z dx st e T d x d f T d R T d t s I       

Unfortunately, this integral cannot be expressed in general form without knowing the exact value of T, but we can compute it numerically in regions where it converges. The integrant is positive and finite in the integration region, and the integral converges for all T 0

Average number of negative links

Analogously to positive links (5), we should choose a function of probability of establishing a negative link between two nodes. According to our requirements to the negative links, this function should follow several rules:

1. Have a finite domain.

Nodes connected with a negative link should be not very similar, but also not very dissimilar.

2. Become more uniform with the growth of the temperature.

As temperature increases, we expect both positive and negative links be distributed more uniformly.

Several families of functions satisfy these properties, among them are cosine function, normal-type functions, polynomial functions etc. Yet we have no theoretical support for a particular type of function, therefore for this research we chose the simplest integrable function: the quadratic function.

(28)

27                    otherwise , 0 1, 2 2 2 2 T d x T d T d x Rneg (20)

Negative links will concentrate between the nodes with distances between them in the range of [d2-T, d2+T], with a peak in d2.

Having three functions in hand: probability function of positive links, of negative links, distribution function of distances, one can plot them on one picture to better understand their interrelationships.

Figure 9. Links probability and distances distribution density function

For finding the average degree in the simulated network, we summed the number of links, created by every node (19). This number can be found by integrating the product of two functions: function of probability of establishing a link and probability density function of distances, which shows, how many nodes are situated on a certain distance from the considered node. Figure 9 shows these function in one plot for clearness.

One would expect that moving the value of d1 or d2 on a small value α would

significantly increase the number of established links because of exponential dependence of the number of nodes on a certain distance. To keep the value of average degree on a required level, temperature should decrease to keep the probabilities of establishing new links more tight. Later we will see that exactly this effect takes place when estimating the values of d1 or d2 depending on T.

Now we will focus on the integration procedure, which leads to theoretical expression for d2 estimation. Due to simplicity of the function, the integrals can be taken exactly.

(29)

28





                                                                    T d T d T d T d T d T d T d T d neg neg dx d x x T dx x st dx T d x x x st dx st x T d x d f T d R T d t s I 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ) exp( 1 ) exp( 2 ) exp( ) exp( 2 ) exp( 2 1 ) ( ) , , ( ) , , , (      













           _ _ _                      _ _ _                _ _ _ _ _ _ _ _ _         _ _ _ _ _                 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 ) exp( 1 1 ) exp( ) exp( 4 2 2 2 ) exp( 2 2 ) exp( ) exp( 2 2 2 )( exp( ) 2 2 )( exp( 1 ) exp( ) exp( ) exp( 2 ) 2 2 )( exp( ) exp( ) exp( ) exp( ) exp( 2 ) exp( ) exp( ) exp( 2 ₂ 2 T T T T T T st d T T T T T T st d T T T T T T T T T st d x x x T d T T d st dx x x T d x st T T T T T d T d     

It appeared that the integral itself does not depend on the numbers of nodes t and s, thus it can be put outside the sum, leaving the expression for average negative degree:

                _ _ _           N t t s neg t t s T T T T T T N d D 1 1 2 2 2 ) ( 1 1 1 1 ) exp( 1 1 ) exp( ) exp( 4  (21)

The number of real average negative degree distribution for Dneg can be taken from the

Table 1 with the characteristics of the network, obtained based on the real Triodos data. Solutions for equations (19) and (21), depending on T, simultaneously shown on the Figure 10. As I stated before, the distances should decrease with the growth of temperature, and exactly this pattern is seen in the figure.

Initially, I stated, that negative links should be established between not very similar nodes, as it seen on Figure 9. This means that for network simulation, d1 should be smaller than d2, and it occurs when temperature is small, in the left part of the plot. Computational analysis

showed, that maximal T, where d1<d2 is T=0.04. This temperature will be used for the network

(30)

29

Figure 10. Estimated characteristics distances d1 and d2 for positive and negative links

depending on the temperature, N=100

Parameters values

The computations described above should be performed for every network size N that we want to simulate, as equations (19) and (21) depend on N explicitly. Table 2 summarizes the estimations for d1 andd2 for different N – network sizes. The value of temperature is the

same of every simulation and equals

T=0.04

Table 2. Estimated characteristics distances d1 and d2 for different N - network sizes,

temperature T=0.04 N d1 d2 100 5.04 5.18 200 5.52 5.61 300 5.83 5.94 400 6.05 6.11 500 6.25 6.28 600 6.37 6.45

(31)

30

Verification of the model

In order to verify that the simulated networks have the structure close to the real one, we need to estimate some parameter of both networks, simulated and the real one, which did not participate explicitly in adjusting the parameters of the network generation model. Average positive and negative degrees were already used for calculating the distances, so they cannot serve as a support for the simulation model. On the other hand, percentage of negative links was not used yet and, and it involves both number of positive and negative links, and that is why I chose this parameter for validation of the model

We simulated 10 networks for each size listed in the Table 2 and stored the values of their negative links percentage. Those values formed a distribution function, close to normal, what is clearly seen on the plot on Figure 11.

Figure 11. Percentage of negative links in simulated data EDF (blue) compared to the normal approximation (red) and the negative percentage from the real network

(dashed green, real_p=0.0858).

Assuming that the distribution is indeed normal, the probability of obtaining the value of real negative links percentage, which can be taken from Table 1, or more extreme is 0.33. In other words, the value of cumulative distribution function of the percentage of the negative links in the point representing the value from the real data is close to 0.5, which means that the value of the mean of the distribution is close to the value of the percentage of negative links from the real data. Consequently, the data provides convincing evidence to accept the hypothesis that the negative links percentage of the real data and of the simulated data follow the distributions with the same mean.