Analyzing and visualizing news spread based on images in Social Media networks

(1)

F

ACULTY OF

S

CIENCE

Analyzing and visualizing news spread

based on images in Social Media networks

Author: Fernando Flores García

UvA ID: 10408134

Degree: MSc in Artificial Intelligence

Supervisor: Marcel Worring

(2)

1 Abstract

We present a new approach to analyze how news items spread in Social Media networks. Our goal is to get insight in this phenomenon, defining insight according to the Merriam-Webster Dictionary as ”the capacity to discern the true nature of a situation” or ”the act or outcome of grasping the inward or hidden nature of things or of perceiving in an intuitive manner” [1]. The best way to get insight is a combination of analysis and visualization.

The data analysis is dependent on a model that uses the intrinsic characteristics of news items and returns their potential of success based on two features, their burstiness and speed of propaga-tion. To obtain these two metrics, we have defined a model that returns the joint success probability of a news item based on the two parameters.

This data analysis is supported by means of a visualization. The main layout of the visualiza-tion is based on a set of tweets that create a topological network. Also a series of statistics that provide information on how images are spread throughout the network are displayed. This visual-ization is the way the outcome of the model is specified and measured.

The questions we want to answer are: how does news spread through Social Media networks?, is it possible to find any type of pattern in this process?, is it possible to foresee how future news items will be treated based on its similarity with past news items?

Based on the outcomes obtained we can affirm that news items spread faster in the first part of their spread process, having a high burstiness potential, whereas it diminishes throughout the spreading process. Similarly, speed shows a similar pattern. We have also found that the news items with the highest thriving potential are those with international scope and also those whose topic is related with Entertainment or Sports. The latter category is also the one with a highest burstiness potential.

(4)

2 Introduction

During the last decade news has seen major changes in its distribution. In the last century infor-mation broadcasting was slow due to technological limitations. Since the boom of the Internet, supported by the emergence of smartphones, users may access real-time information wherever they are with almost no limit. But this thrive of real-time information is not only unidirectional. Users now are also able to spread whatever news items they might consider significant. Social Media networks such as Twitter or Facebook have made users a primordial player in news spread, a player with the capacity of boosting any news item. Today, both Social Media and social networks exist under the same web environment, that is, it is possible to collect massive online or offline Social media data and at the same time capture the effects of Social Media as well as the influence based on the activity of social networks. This new relation allows experts to study the diffusion processes in the network

A key aspect and main contribution of this new approach is the usage of images as input in-stead of textual information. Images are easy to recall and understand, they might portray the content of a news item and own the capacity of grasping what is up at a glance, something crucial nowadays, when it is often more important to be the first in transmitting a news item than the quality of the item itself. This is the reason we are interested in investigating how news spread in Social Media networks based on their attached images and not based on their textual information. By doing so we will try to find some characteristic pattern to this particular type of news and also, if possible, to find more generic patterns relating the different types of news. We deal with images in both analysis and visualization by searching for any attached image in a tweet and checking if this image has been retweeted, with the idea of constructing a network to make easy to follow the transit of this image over the world.

Our motivation is to get insight on how news items spread in Social Media networks. It is in-teresting as nowadays these Social Media networks have become a main news source worldwide and seems this inertia will continue in future years. That is why understanding how news are broad-casted throughout networks is of vital importance for certain population groups. To get this insight we consider the best option is a combination of data analysis and visualization. As a secondary target group, companies might be interested in how users interact among them in Twitter as a base for their campaigns. Among companies, the ones that might find it most useful to have the data displayed in such visualization are marketing companies, as their business is partly based on understanding how people interact. Besides companies, two main groups might find this application useful. The first ones are journalists, more focused on news and on how they spread and which are the types of news that spread faster or grow the most. The second group is the one composed by social scientists, focused on users and on how users interact depending on a series of parameters such as age or genre. It is also useful for social scientists to know which types of news thrive in a certain demographic group.

So how does information flow in Social Media networks? This particular question has been studied from different standpoints. Cha et. al in [2] define three major roles on how data of news items about both major international and minor events spread throughout Social Media networks: mass media sources, grassroots and evangelists. The first group is able to reach most of the users being only the 0.01% of the users whereas the grassroots are the standard users, quite passive but numer-ous, and the evangelist (politicians, celebrities, etc), play a major role in spreading very specific information in either minor or big circles. Also users behave differently while in a social network than in real life. Wilcox and Stephen defend in [3] that social networks enhance self-esteem in users that are focused on close friends, something that is reflected in their behavior, also while broad-casting information. A user focused on his friends will ”take care” of them and will try to maintain them as informed as possible by, for example, retweeting those news items he considers important.

(5)

The amount of followers a user has is of vital importance for certain users like companies or News Media agencies because it might point out their popularity and prestige. We put special attention on the case of News Media agencies and on how news are spread throughout the Twitter network and, in special, on how the images attached to these news items extend via Twitter. In certain cases such as in the case of pictures, this piece of information might be shared many times: a user uploads the content and shares it with his group of friends, some of these friends share with their respective ones this image and so on, creating a sharing cascade [4].

Among all the series of Social Media networks that have emerged during the last decade Twit-ter has become one of the most influential ones, mainly because it is easy to use and also because its topological characteristics and its usability make it a perfect tool to broadcast information [5]. Nowadays thousands of famous celebrities publish their personal and professional information in this social network, but this is only a part of the greatness of this microblogging tool. Currently it has more than 200 million users and it is translated to more than 30 languages worldwide.

The main success of Twitter is its easiness, allowing users to send text messages with a limit of 140 characters. Users may subscribe to others’ tweets, that is, following them. Part of the success of Twitter comes out of the appearance of the smartphones, allowing users to update their statuses wherever they are. Another feature that is also a success key is the so-called retweet, that is, the action of sharing a tweet. This is the way information spreads in Social Networks. Finally, another important feature is the hashtag, a way to define keywords useful to keep track of a certain topic.

Twitter has become a popular source from which to propagate information in the past years and because it is an always-on tool that may be used everywhere, it permits news items to spread faster than in other media such as newspapers or radio. Also user interaction is important in Twitter as users are the ones that spread messages through their follower list, creating a network whose characteristics define the success of a news item. Also society requires immediate access to news nowadays and user opinion in means such as Twitter has become a significant way to measure the popular sentiment of a country. For study society Twitter has become a major source.

Bearing this information in mind, we have defined a model whose data is supported by a visu-alization and whose motivation is trying to know if these differences that define news items may be expressed in a visual way, using Twitter as information source. Visualizations provide a better insight in this particular task, where the geographical ones are the most useful as they express in a topological way the structure of the network referred above. This topological structure is interesting for our purposes because at the same time we reference the structure or a network data structure as well as preserving geographical information. Also the analysis of this data is important in the task of getting insight on Twitter data, as this data should be sparse among the vast amount of tweets so a task of data mining and posterior analysis of the retrieved tweets is mandatory.

The chosen approach is to create a joint layout for these two groups, where both could find useful information at a glance. The visualization should provide a series of tools such as different colors for each different type of news and statistical panels with the values for the main parameters of a news item and also global statistics per type of news. As all this information is computed dynamically, final users may get insight on how news are evolving during their spread.

(6)

The attempt of displaying Twitter information is not new, some others did it in the past from different points of view. Most of them used the idea of an underlying network of elements, but not all of them displayed this network in geographical terms, as many of them were only interested in the relations of Twitter users. Among this first group we may include the user networks created by Blancs in [6] or Arikan in [7].

Some other applications were created to take advantage of geotagged data retrieved from Twit-ter by means of a geographical visualization. Most of these visualizations are defined as part of statistical and monitoring network tools such as tweetPing [8] or A world of tweets [9].

There is another group of applications that use some interesting methods such as element clustering in Google Maps [10] or OpenStreetMap [11]. Other tools that perform their own data analysis to retrieve structured information are Just Landed [12] or Languages on Twitter [13].

Finally, another group of visualizations that make use of Twitter perform different tasks than finding the tweets, and these ones that are most widespread. This is the case of, for instance, the prediction of major events in the days before they occur, as in the case of Kallus in [14], where he claimed that it was possible to foresee via Twitter the Egyptian people’s discomfort towards the Government in the previous days of the 2013 coup d’´etat in Egypt. Some others like Rojas in [15] assert that Social Media may predict the outcome of future elections, but this seems to be an idea still to be developed and proven.

In the next section some related work will be commented from both the analysis and visualiza-tion sides. From there, in subsequent secvisualiza-tions, the data analysis will be described, and after that the model used will be defined. In the last sections the methods used will be explained and also we report the results obtained.

(7)

3 Related work

Some others in the past have contemplated several solutions to the problems addressed by us.

3.1 Data analysis

By data analysis we refer to the process of taking and processing raw data from a data source to adapt it to the requirements needed to be used as an input for the visualization. Others have carried out this task in the past by different means.

The usage of networks to represent Twitter relations is not a new field. First approaches such as Blancs’s in [6] define a network of users whose depth depended on the levels existing from a user node to the central one (1 for the friends, 2 for the friends’ friends, etc). The main problem of this type of network is intrinsic as it does not escalate well when having a large number of friends, having problems to show relations of level 4 and superior. Also Arikan defines in [7] a so-called Twitter social network that used the idea of a network of tweets representing the following actions by means of edges. Finally, the tool called mentionMap [16] uses this idea of a network of tweets as a base for the visualization, but, once again, no geographical information was used, in this project the relations among users are the only way to provide positional information.

All these projects described so far, despite that they introduce the idea of a network of tweets, do not use geographically located data taken from the tweets themselves, something essential for the purpose of a geo visualization. Internet is by definition a position-free network, i.e., users are able to access to all the sites in the network no matter where they are. This might seem conflicting with the idea of geotagging user positions but News Media tools take advantage of this feature to retrieve topological information on where people access from and how the use the site. In our approach we will also take this geographical information in the shape of latitude and longitude values, but this is not the first time this task is done. There are also some visualizations and tools that make use of some sort of map layout to display geographical information, being some of them production applications and some others scientific visualizations whose code is not available on the Internet.

The possibility of tracking information in real-time or pseudo real-time was taken into consid-eration in the past. For instance, the tool called Twittervision [17], a tool created back in 2007 but with a high popularity still today. It is defined as a web mashup of Twitter and Google maps, keeps track of a Twitter user’s activity in real time. An implicit drawback of this application, due to its real-time nature, is that it does not define a history of tweets. It does not define a network of tweets either, using a event-like approach instead, showing what is happening in the way of a pop-up in the screen. It also does not create any type of network as it only shows the tweets of the user and the ones he is following.

The usage of Twitter data for statistical purposes is performed in some visualizations such as in the one called A world of tweets [9], a real-time visualization of geolocated tweets around the world. Here statistical and historical data on the whole Twitter network are used based on a col-lection of statuses obtained continuously. Also the visualization called tweetPing [8] provides a good set of statistical data, checking the Twitter activity in real time and being capable to show a large amount of information, refreshing it every second. This information is not stored but the counters are reset every time the site is reloaded. Also the visualization Just landed [12] stores a

(8)

Another interesting data analysis algorithm used in this visualization is clustering, that is, the task of grouping a set of elements with similar properties. The web tool tweepsMap [18] provides an interesting clustering method for tweets that returns the percentage of total followers into a certain area, bigger or smaller depending on the zoom level, defining these percentages for provinces, cities and countries. As drawback, no historic data is stored so it is only possible to obtain real-time information.

Regarding to the model, some others have figured out how to measure the success of a news item in Social Media networks by using different parameters like for example speed, an easy-to-measure metric but not widely defined as such in related works. For instance, Xu and Liu in [19] resort to speed to detect a series of implicit dynamics in Social Media networks with the goal of detecting rumors spread on them. To do so they define a model that had the speed news spreads as one of its input variables.

The second of these outcome parameters that measure the success potential of a news item spread on a Social Media network is burstiness. This parameter is more difficult to measure than speed due to its inner characteristics. Some others as Kim et al. in [20] come up with a way to measure the burstiness potential of a keyword or topic on Twitter, being this model very robust and able to handle different issues as abbreviations or typing and spacing errors. Another approach on bursty words on Twitter is the one taken by Mathioudakis and Koudas in [21], reflected in the tool Twit-terMonitor. This tool is able to identify emerging topics on Twitter in real time and by defining certain criteria it may measure and order the topics by its bursting potential.

3.2 Visualization

Like in the case of data analysis, Social Media and specially Twitter data visualization has been performed since long ago. User relations have been visualized by means of a network in visualiza-tions like [7], where the author created three visualizavisualiza-tions (one per week) keeping track of how these links grow and create new interactions. In this visualization no geographical information is taken into consideration, only topics and followers matter. Networks have also been used to show how people interact in Twitter in the tool mentionMap [16], commented above. This network ex-ploring tool is clear and easy-to-use but at the same time is very powerful and retrieves all the relevant information necessary to sum the relations of a user in Twitter. An interesting feature of this project is its interactive nature, providing the user a total freedom of movements throughout the network.

In [22], R´ıos define a neat and clear visualization of all the geo-tagged tweets since 2009 to 2013. The author of this visualization is part of the Twitter Developer team, so he had access to billions of tweets to perform this project, something not feasible for other teams. In this case, no further information of the tweets is used, only the latitude and longitude.

Just landed [12] extracts travel information from tweets and maps the journeys on a map. The map itself remains two dimensional but the ”flights” are visualized as three dimensional curves. As an important feature, a chronological order of the Tweets makes it possible to review a certain time period. This visualization is very impressive and clear at the same time, but although it uses edges it is far from defining a Twitter network. This visualization is shown in figure 1.

(9)

Figure 1: Layout of Just landed [12]

A world of tweets [9] also uses geographical information, showing where people are tweeting at from the past hour via a heatmap where the more tweets there are from a specific region, the ”hotter” or redder it becomes. According to the authors in [9], through the activity of Twitter users it is possible to tailor a new map of the world that evolves during the day according to the timezones and the spreading of mobile technologies. Also tweetPing [8] provides a robust and appealing vi-sualization, showing a map where every position a tweet is triggered is highlighted, obtaining a heatmap of the areas where Twitter is most used. The main drawback of these two visualizations is that they do not show media information, instead they display statistical data. Finally, conceived as a tool to upload geographical information, mapsData [23] is a good example of what can be done by merging Twitter and a geographical visualization. The output is a visualization of this data, providing interesting features as heatmaps, clusters, markers or bubble maps. In this tool tweet networks or image tracking are not contemplated.

Another good tool performing a powerful and useful visualization is trendsMap [24]. The idea behind this tool is that Twitter is a network of ”trends”, so in it all these trends are shown, being these trends words, users or tags. These trends are defined worldwide, but by means of the zoom and pan features also regional and local trends are available to show. As an outstanding feature, the tool stores up to 7 days of historic data but, as in the former projects, media related information as images is not used.

Another interesting approach to show information obtained from a Social Media network is the one taken by Dou et al. with the method called LeadLine and defined in [25]. LeadLine is an inter-active Visual Analytics tool that automatically identify important events in news and Social Media networks. This visualization includes interesting features besides a map, such as a steamgraph to represent the temporal evolution of a topic, a good way to detect temporal events related to the aforementioned topic.

About the Twitter statistics layout, TweetPing [8] defines an appealing visualization with all the basic information returned by the application in a very clever way. This statistics layout is shown in figure 2.

(10)

Figure 2: Statistics layout of Just landed [12]

There have been some other visualizations that have studied how images are spread in Social Media networks. For instance, Itoh et al. in [26] define a system to analyze social behaviors by recognizing changes in trends in people’s ideas, experiences or interests using as input both images and text obtained from different Social Media tools like Japanese newspapers or Twitter. The difference between our approach and this one is that the authors in [26] created a three dimensional visualization of different histograms of stacked imaged on a timeline, where the third dimension represents the different topics. Authors also overlay line charts over the histograms to make it easier to compare the different histograms. The layout of this visualization is shown in figure 3:

Figure 3: Visualization created by Masahiko Itoh et al. [26]

We have found some concepts missing in [26] that we want to improve. The main missing idea is the usage of images as input, despite being used by others as in [26], it has not been widely defined as a task for researchers. A second concept, more used than the images but in different contexts, is the creation of a topological network of tweets based on users’ location. Finally, we miss a way to define the success of a news item in Social Media networks.

(11)

4 Data analysis

In order to perform the data analysis, we first need to obtain information about the messages and images, and also information about the location of the messages. In this case we will use as much information as possible from Twitter, due to the fact that sometimes the geographical information is not stored into Twitter we will also use an external gazetteer resource called GeoNames [27], which will provide the geographical longitude and latitude based on the location provided by the user in Twitter.

4.1 Twitter

Due to a limitation provided by the Twitter API (Twitter API version 1.1 [28]) it is possible to create graphs of at maximum two layers, with the first layer being the original message and the second one the retweets of this original message. If someone retweets a retweet only the original message is stored, thus losing the intermediate information. As we are interested in the spatial location of both the tweet and the retweet, the Twitter graph of a news item spread might look like the one shown in figure 4:

Figure 4: Spatial appearance of a Twitter spread graph.

In the Twitter API 1.1 there are four different types of objects defined: Tweets, Users, Entities (provide metadata and additional contextual information about content posted on Twitter) and, if the user has filled it out, Places (specific, named locations with corresponding geo coordinates. Tweets associates are not necessarily issued from that location). Out of these four entities, the two we are interested in the most in order to obtain a good data mapping are Tweets and Places, obtaining the metadata of the tweet and the information of the retweet and also the location where the tweet was sent.

As we are keeping track of how images spread on Twitter, we are crawling the application looking for tweets with an attached image. With this information in mind, for each tweet and its possible retweet, we use the set of attributes included in table 1.

(12)

Field name Definition Id Tweet unique id

Username Username of the user that sends the tweet Location Location the user sends the tweet Timestamp Time the tweet is sent Country name Country the tweet is sent

Country population Population of the country the tweet is sent City name City the tweet is sent

City population Population of the city the tweet is sent Number of retweets Number of times the tweet was retweeted

Text Actual text of the status update

Latitude Geographical latitude from where the tweet was sent Longitude Geographical longitude from where the tweet was sent

Tags Hashtags mentioned into the text User mentions Usernames mentioned into the text

Table 1: Fields extracted from the Twitter API

Besides these attributes, recalling that they are stored for both the tweet and its consequent retweet, we also store two more generic attributes that share both tweets in the case of a retweet or only the tweet in the case of being the origin of a new spread. These are the URL of the image attached in both tweet and retweet, and the URL of the message.

Not all this information may be obtained directly from Twitter. Sometimes the user does not use the geotagging function of the Twitter smartphone application or uses Twitter from a desk-top PC or lapdesk-top. Under these circumstances, the latitude and longitude are not obtainable from Twitter, so therefore another resource must be used. We have used an external gazetteer resource called GeoNames [27]. GeoNames is able to accurately return the longitude and latitude of a city based only on the city or country name.

Unfortunately, the accuracy of GeoNames, yet high, might not be enough depending on the lo-cation the user has used. First, it is possible for the user to use an invented or wrong lolo-cation, thus returning either an error or different coordinates. It is also possible the user misspell this location. This is not a critical issue as we are using a fuzzy factor when calling GeoNames so the resource itself is able to overcome the misspell of one or two letters. Finally, the resource might return a wrong location as there might be more than one city with the same name.

4.2 Images

As explained in the previous section, we intend to create a Twitter graph based on the images attached to the messages. To do this, we have to keep track of the field that, inside the tweet meta-data, contains the URL of the image. So, rather than check the text that the user has written, we check the image that he has shared with his contacts. This process might lead to some errors, mainly due to the fact that it is possible that the image does not match the topic and text of the news item.

Likewise as in the case of the longitude and latitude, we cannot be sure if the image attached and the text match each other as it is possible for both to have a different context. It is likely to find, for instance, a quote on a celebrity into a tweet that is about a different topic, and as we are not using an image classifier we cannot disregard this tweet despite it is clear that the image of the celebrity and the topic of the tweet have nothing to do with each other. In order to minimize these issues, in the case of a likely unsupervised context, the best option is relying only on trustable users and companies or only taking into consideration tweets that have been retweeted a large number

(13)

of times. By doing this, the possibilities of a real match between the attached image and the text of the news item increases significantly.

In this case we are making the assumption that the more people retweet an image the more reliable the original poster is. Moreover, we will label the spreads by taking into consideration both the content of the news item and also the attached image, avoiding possible mismatches between both. The news and its corresponding attached image will be labeled depending on different parameters and, among them, the topic of the news item. In this manner, the evaluation will be more reliable.

Problems might arise when crawling Twitter without a posterior human supervision or image clas-sification, so that it will not be possible to discard all the wrong matches. In that case, some heuristics might be followed, such as relying only on trustable sources such as News Media agencies or notable celebrities, though for the latter it is also possible to get some errors.

Yet potentially dramatical, mismatches between images and topics, according to our experience, are not very common because, even in the case of a non-related image scenario, some elements of the image itself accord with the topic, such as a quote or a non-visual reference only detected by humans and not by classifiers. This is the reason why a prospective mismatching error will not heavily affect the final outcomes of the model.

(14)

5 Model

To get a better insight in how news are spread on Twitter we need to define a set of intrinsic characteristics of the news items and, based on these inputs a set of parameters will be obtained in order to define the potential of a particular news item. This thriving potential will be defined based on its burstiness and spread speed, that will be the outcomes to return in order to define the success of a particular news item.

5.1 Assumptions

In order to define our model, let us first define some assumptions that are needed to take into consideration:

• The more followers a user has, the more influential he is.

• Generally speaking, accounts of companies and celebrities will have a large number of followers. • The population in the area where a news item occurs affects the speed and spread potential

of the item itself.

• The place where a news item takes place affects its spread.

• The more time passes since a news item occurs, the less possibilities for a big spread through the network.

The first assumption refers to the fact that when a user is followed by a large number of other users his influence will be higher and thus the probability of his tweets to be retweeted more and faster increses. Consequently, a user with a low number of followers will have less chances to his posts to be shared. The same applies for the accounts of companies and celebrities, their potential to reach more users is higher than the so-called standard users. This is because they transfer their fame outside the network, in real life, to the media network. This provokes an expected increase of the number of followers of such users into the network.

The population and the number of users might not be always interconnected, so that the third and fourth assumptions should be seen as two interconnected halves. In the third assumption the population is linked to the success potential of a news item, that is, the more people are into a certain area the more possibilities for a news item that happened there to be uploaded to a Social Media network and be spread by someone. But this is not always true because not all the regions of the planet have the same characteristics: standard of living, Internet access or literacy level depends on the region, as it is not the same a news item to occur in New Delhi Metropolitan Area or in New York, having both approximately the same population.

Also the time elapsed since the event that originated the news item heavily affects to its spread into the network. Here it is assumed that a news item that does not spread fast has a high probability of not being largely spread.

(15)

5.2 Metrics and parameters

5.2.1 Metrics

After defining the assumptions of the project now it is time to define the parameters the model will take and its metrics or outcomes. The output metrics that will point out how successful a Twitter news spread is are two:

• Speed of the news item. • Burstiness of the news item.

Speed means the velocity with which a news item is spread through the network. This parameter is crucial when related to the second output parameter, burstiness. This metric is not as easy-to-understand as speed, but a good definition may be the one provided by Renaud Lambiotte and Lionel Tabourier, burstiness is the set of intermittent increases and decreases in activity or frequency of an event, and distributions of bursty processes or events are characterised by heavy, or fat, tails [29]. Such a broad definition accurately fits into the field of Social Media networks by using the event as the origin of the spread and the increases and decreases as the temporal activity of the network towards the event that generates the news item.

5.2.2 Parameters

With respect to the input parameters the model will need, these will be two:

• Scope of the news item. • Topic of the news item.

By scope we mean the geographical range of the news item, which we classify in four different types, from smaller to bigger. The smaller scope is the personal or friend group, the second is the local one, the third is the national and finally the bigger group is the international one. By definition, every new group contains the previous ones, as illustrated in figure 5:

(16)

Depending on the type of scope, we assume the news items will behave differently, that is, they will have different potential burstiness and speed. For the personal scope we consider the mid-low burstiness potential due to the limited number of individuals involved, but as the bounds supposed to a group of friends or acquitances the speed potential is high. In the case of the local scope the burstiness potential is middle because it is partially limited in space but the speed potential remains high due to the proximity alleged among the users. For national news the burstiness potential may be considered as medium to high and its speed potential is from middle to high based on the topic. Finally, for international news both burstiness and speed are considered as the highest among all the news items ones. Our hypotheses are shown in table 2.

Scope type Burstiness potential Speed potential Personal Mid-low Mid-high

Local Mid High

National Mid-high Mid-high International Mid-high to high High

Table 2: Burstiness and speed potential based on the scope

The second parameter used to define the model is the topic, namely the type of news. We have separated news items into five main topic classes: Sports, Economics, Politics, ”Science, Technology and Culture” and Entertainment. For Sports burstiness and speed should be mid or mid-high, but also depends on the sport itself. in Economics burstiness and speed tend to be the average, and about Politics, those news should have a burstiness and speed over the average. About Science, Technology and Culture, this type of news are not used to be widespread except for big events such the release of a new smartphone so their burstiness and speed tend to be less than the average. Finally, Entertainment news items are highly linked to the news item itself, but it is rare an Entertainment news items to spread fast but it might burst high, as sometimes most of the trending topic into networks like Twitter are based on gossips or TV shows. The associated speed and burstiness potential to each type of news are summarized in table 3:

Topic type Burstiness potential Speed potential

Sports Mid Mid-high

Economics Mid Mid

Politics Mid-high Mid-high Science, Technology and Culture Mid-low Mid-low

Entertainment Depends on the news item Depends on the news item

(17)

5.3 Network diffusion

To find a model of how information is diffused into a network let us consider some methods that have been defined in literature. There are two ways on how the information spreads in the network and reaches one or more other nodes. The first way information is spread on a network is through the connections established among the users and also through the influence of external sources, like News Media agencies, relatives or friends, which we will now elaborate.

The first models used to define the behaviour of the information difussion in Social Media net-works did contemplate only the relations among users. In the last years new implementations such as [30] which take into considerations both approaches have arised. According to this new type of models information ”jumps” accross the network, and the explanation offered by authors is that there is an unobservable external influence over the network. Allegedly, about 71% of the informa-tion volume in Social Media networks such as Twitter can be attributed to network difussion, and the remaining 29% is due to external events and factors outside the network.

Information is spread throughout a Social Media network in different ways, but it seems that the optimal approximation, and in a certain way a generalization, is shown in figure 6. In this particular case, we build a link from a node to another, if the latter mentions the former in the tweet that contains a topic that the original node had talked about earlier. As there is no such a mechanism as explicit threading in networks such as Twitter, this is the optimal approximation of the path for the original user employs to diffuse a topic. Figure 6 also shows the way how a diffusion network is created. All posts that contain the topic, based on a keyword or hashtag in Social Media networks like Twitter, are labeled with timestamps and the diffusion links are created as explained above. The blue colored nodes are inside the network while the yellow ones are those that mentioned the topic, but are not linked to any other message:

Figure 6: How a topic is diffused (left) and network structure (right) [31]

With this model in mind, it is possible to develop a model based on a series of local dynamics that will measure the information spread over the next three dimensions:

• Speed: velocity of the spread.

• Scale: number of affected nodes in the network. • Range: how far the diffusion continues.

(18)

In the case of networks like Twitter, the range might measure different magnitudes. As it will be explained later, in our case the range will measure the geographical distance between the node that generates the information and the node that gets that information and rediffuses it. These three dynamics are visually explained in figure 7:

Figure 7: Local dynamics of a Social Media network [31].

The external influence over the network is of vital importance on information difussion. On Social Media networks such as Twitter, users often post links to various websites, being mainly links to news articles, videos or images. In cases such as ours, the external influence is strongly significant as to we are evaluating the effect of the attached images, which own the focus of the user. Therefore, in order to create a feasible model it is important to know the effect of attaching images to a node.

5.4 Burstiness and speed

To define the success of a sharing cascade or spread two main characteristics are the most influential. The first one is speed, that in this special case will define how fast the spread is, and the second is burstiness, used here as a way define the spread potential in terms of frequency over the total population, speed and, in the case of geographically-based networks, area [32]. The more bursty a spread is when the more users into a certain area participate in the network and the smaller time is taken to cover the area.

Is it really possible to generalize over how these networks are created and spread? It is hard to give a fully trustworthy answer to this question because the chances of creating extensive message networks in Social Media application such as Twitter with thousands or even millions of messages are rare. Only in global events it is possible to obtain reliable information. Even with this, it is not yet clear whether the behaviour of Social Media networks depends on the size of the network itself.

(19)

5.4.1 Speed

Speed is important at determining the success of a news item spread as we consider that the faster a news item spreads the more thriving it is, so therefore, potentially successful news will be shared faster than other less thriving. This concept has been deeply developed in the past by authors such as Yang and Counts in [31], who defined a model based on hazard regression models [33] to quantify the degree to which a number of features of both users and messages themselves predict the speed of diffusion to the first degree offspring. This constraint also applies for the networks based on tweets, due to, so far, it is not possible to create networks with more than two levels (message and its corresponding retweets).

Aspects such as the author, his activity and other inner characteristics will imply a faster or slower diffusion of his messages. For instance, the impact of the messages created by a big company or by an influenctial figure will have a higher probability of spreading faster into a network. With this it is also remarkable that the author has not been necessary a famous figure, even in smaller circles such as friend groups there are always more and less influenctial individuals.

Other characteristics thay may influence the success of a message are the inclusion of text or media links, the text formatting (not available in all the networks) or the mentions to other users (a message with a high number of user mentions will have a high probability of being fastly and strongly shared).

Empirical experiments, such as the one made by the authors in [31] based on how long does it take for a message on Twitter to be shared based on its topic, cast out that when the author is more active in posting and has a higher rate of being mentioned, the current message will be shared in a short period of time. Also when the message is a mention, it has a higher chance to continue its diffusion. Finally, in the case of events, the messages created in an early stage of the event are more likely to be retweeted in a short time. On the other hand, it seems that the existence or absence of media links does not affect the speed at which a message is shared.

All in all, speed is a determining factor for the success of content into a network. Along with the owner of the information or other aspects such as the inclusion of media links or the previous activity of the poster will stimulate or slow the speed of its spread.

5.4.2 Burstiness

The effect of burstiness in Social Media systems has not been widely studied since the boom of such networks. The burstiness of a post into a Social Media network might depend on a series of factors.

The first factor is formed by the statistical properties that define the microscopic evolution of a social network. These properties tend to vary depending on the network but according to authors in [34] it is possible to generalize and extract a few common properties shared by all the Social Media networks on how new nodes are added to them. Authors determined that most new edges span short distances, representing close people or relatives to the person represented by the edge.

Another influential factor of burstiness is the dynamics involved into the edge creation inside the network. The fact that link creation is a burst process, that is, not homogeneous in time, was demonstrated in [35]. For each user authors created an event time series where an event is represented by the creation of an edge. The assumption is that if the edge creation process is

(20)

ho-The outcome of the experiment, with the averaged values for the different time series, appears in figure 8:

Figure 8: Schematic representation of speed in a news item spread [35]

In figure 8 it is possible to appreciate a higher probability when the variable acquires values close to 1, also owning the distribution a long tail, that has a relation with the age of the node, so therefore the more time since the node was created the less bursty it will be.

Other factors that might affect burstiness, that also influence speed as explained above, are those related to the user, like popularity, number of related users or the content of the message. In the case of particular Social Networks such as Twitter the number of people that follow or unfollow the user that creates or shares the message affects the burstiness.

Some interesting findings about burstiness in Twitter networks were made by Seth A. Myers, Jure Leskovec in [36]. The first is that the users that follow another user during a bursting process tend to be more similar than the ones obtained outside a burst, defining this similarity metric based on the TF-IDF weighted word vectors between the two users’ aggregated tweet documents. This provides the idea that users become more related during a burst. This is not a rule of thumb in the case of the followers of big News Media agencies with a wide range of users, as those might severely differ with most of the messages of the agency.

Another fact is that not only the topic of the message affects burstiness, but even the usage of certain words might lead to a burst of follows or unfollows. Emotive words related to an event could lead to a burst of messages and follows and, on the other hand, keywords such as ”free”, ”sale” or ”download” increase the probability of an unfollow burst.

(21)

5.5 General equation

After defining the parameters and parameters involved in the model it is time to explain how they all interact, so it is possible to define a general equation for the model. This general equation appears below in equation 1:

(Spspread, Bspread) ← Spread(S, T ) (1)

being ”S” the scope, ”T” the topic, ”Sp” the speed and ”B” the burstiness. As explained before, the model will return as outcomes of the speed and burstiness of the spread.

5.5.1 General speed equation

Speed is a physical vectorial magnitude that represents the displacement of an object with respect to the unit of time and is represented by a vector. In our case it is possible to use the speed to measure how fast the news item is spread around an area, regardless how big this area is. When having a set of correlated timestamps we are able to order them and obtain a correlation of events. As we also have the population of the area where each tweet is sent, we can measure the physical distance between the position of the tweet and the retweet, obtaining therefore the speed for each timelapse, as shown in figure 9:

Figure 9: Representation of speed in a news item spread

Measuring the geographical shortest distance between two points based on their longitude and latitude is not a trivial task. As their longitude and latitude are indicating relative positions on an sphere, it is needed to use the haversine distance in order to know the distance that separates them, along with the radius of the sphere being used, in this case the Earth. This is being appreciable in figure 10.

(22)

Figure 10: Graphical representation of the haversine distance [37]

With this in mind, it is possible to compute the haversine distance d between two points in the surface of the Earth by applying equation 2:

d = R · c,

c = 2 · atan2(√a,√1 − a),

a = sin2(∆φ/2) + cosφ1· cosφ2· sin2(∆λ/2).

(2)

for point p1 = (φ1, λ1) and point p2 = (φ2, λ2), with φ latitude and λ longitude, and R being the

radius of the Earth, 6,371 kilometers.

The last step is to return the Weighted Arithmetic Mean of the speed as final outcome. The selection of this metric is due to the spread has different speed through time, so we will measure and weigh times according to their relative extension with respect to the total timelapse. In order to do this we will apply equation 3:

Sp = Pn i=1wiSpi Pn i=1wi (3)

where Spi are the different speed values of each one of sections in the spread and wi the weight or

relative extension of the section with respect to the total time of the spread.

5.5.2 General burstiness equation

The other outcome of the model is burstiness, that is, the different increases and decreases in activity of an event in time. In the model burstiness will measure the potential number of people that have had access to the news item through time. Therefore, the magnitudes to relate are the population and time, as shown in figure 11. This magnitude often increases with time into a non-uniform way, but not always. Thus, in order to obtain the burstiness Bi in an interval i, that is, the increase of

the population in that interval, we will need an approximation to the derivative of the population p with respect to the time t, as appears in equation 4:

Bi=

∂p

(23)

Figure 11: Representation of burstiness in a news item spread

Finally, as in the case of speed, we will compute the Weighted Arithmetic Mean of the partial burstiness of all the intervals of the spread. In order to obtain the final burstiness B we will apply equation 5: B = Pn i=1wiBi Pn i=1wi (5)

where Bi are the partial burstiness values for each one of the intervals and wi the weight of each

one of the intervals, that is, the time ratio that each interval lasts in comparation with the total timelapse.

5.6 Measurements

In order to measure the two outcomes of the model, speed and burstiness, it is first necessary to define what magnitudes might be used to take these measurements. In the case of the speed a news item takes to spread throughout the network we will use the meters per second magnitude, defined by the International System of Units and represented by m/s, defined by distance in metres divided by time in seconds. Otherwise, in the case of burstiness, the approximation of the derivative of the population with respect to time will be a valuable metric. For each step into the spread, that is, for each new message added, a new partial burstiness and speed will be computed and also an average metric for both speed and burstiness will be computed by taking the Weighted Arithmetic Mean of all the connections created until that point. How these metrics are chosen is visible in figure 12:

(24)

6 Results

In order to evaluate which type of information has the fastest propagation speed and bursts the most we have crawled Twitter for a total of 3 days, specifically from August 4th to August 6th 2014. As we are not interested in any specific topic, we used the generic keyword ”the” to make sure that the maximum amount of tweets were processed. Within the constraints of Twitter crawl limits this resulted in a dataset of 176911 tweets. In order to do our experiments we need tweets with attached images along with any type of geographical information such the user location or the coordinated where the message was sent. Out of the 176911 tweets 27709 contain an image and 12955 have a valid geo location. If we put both constraints we get 1604, being this the start for our retweet analysis. The ones without retweet all have speed and burstiness of 0.0 so we are left with 1604 tweets containing both geo, image and which are at least retweeted once.

The 1604 tweets are comprised of 36 independent spreads. For analysis we need ground truth so we manually labeled each spread based on two parameters. The first is the topic of each one of the news items, defining what is it about. Based on the topic, there are five labels:

• Sports: all types of sport events. • Economy: economic news items.

• Society: fashion, style and celebrities related information. • Entertainment: jokes, graphic art and fun posts.

• Politics: news items related to politic information.

The other input is scope, the geographical ambit of the item. This parameter is labeled as follows:

• Friends: information shared among users that belong to a group of friends. • Local: local news items.

• National: information with scope involving a nation. • International: worldwide news items.

6.1 Speed

In order to provide a good reasoning, we will show and comment on a series of plots with the outcomes divided as explained above, with a final remark with a summary of the plots. As the time each item needs to spread differs, time is normalized.

In figure 13 appears the speed of the different news items based on the topic of the image it-self. The elements with a highest speed are those related to Entertainment topics, with one in special that stands out due to its particular success. Politics, Society and Sports related items show an average speed, whereas the speed of the news items with Economy topic are the lowest.

(25)

Figure 13: Instantaneous speed per topic for Entertainment (a), Politics (b), Society (c), Sports (d) and Economy (e)

We can observe the spread graphs of all the 36 spreads used as datased in 14. In it we can see that partially the reason why Entertainment news items have a high speed is due to one of these elements having a high number of retweets, which produces a high speed as this particular spread has worldwide coverage. It is also interesting to see how speed decreased during time, starting high and decreasing with time. We can see that Sports and Society news items have a good start and progresively decrease with time and also how Economy and Politics spreads are no fast at any time of the spread. This is due to in the dataset there is no big Politics event, in that case it is reasonable to think in a high speed, usual in global events:

(26)

Now we will comment on the results based on the scope of the news items, whose outcome is displayed in figure 15. The speed of the news items whose scope is a group of friends is not specially high but pretty decent having into account the number of people involved in such type of spreads. This happens because nowadays it is normal to use Social Media networks to communicate with friends that live abroad, producing this medium averaged speed. About local scoped news items, their speed of this particular type of news is specially low compared with the rest of scopes, and about national and international items, the bigger the scope the faster the speed and also the number of messages per spread is higher, being the speed of international scoped items the highest among the scopes commented, also due to this type of news has the highest potential number of users all around the world, what affects the speed of their related spreads.

Figure 15: Instantaneous speed for friends list (a), local (b), national (c) and international (d) scope

We see in figure 16 how the news items behave attending to their scope. It is visible how interna-tional news items are undoubtedly the ones that spread the fastest, but also there are some nainterna-tional events with a high speed. We can observe a proportionality between both metrics, the larger the extent of the scope the higher the speed. With the exception of the friends group, which may be explained thinking that nowadays it is possible to have friends all around the world, making this type of news items fast in comparison with the local ones, circumscribed to a small area:

(27)

6.2 Burstiness

About burstiness and topic, in figure 17 the behaviour of the items in the dataset is displayed. Entertainment scoped news items provide a burstiness value during their spread which is not very high. The same applies for the news items related to Politics are shown. Their burstiness is not specially high either, but with a peak at the end of their spreads, possibly explained by a peak of interest at the end of their spread process due to they reached a highly populated area with a high number of potential viewers. Concerning Society related new items, only one of them shows a medium value for burstiness, the rest being low. Burstiness for all the news items related to Sports have a high value because of the nature of these events, that have a high burst at the moment they happen and end fast, such as a goal in a football match. Finally, about Economy related items, their outcomes are especially low, demonstrating that these type of news are not very popular in News Media networks:

Figure 17: Instantaneous burstiness per topic for Entertainment (a), Politics (b), Society (c), Sports (d) and Economy (e)

All the spreads in the dataset appear in figure 18. In here we observe how sports related news items have a high burst in the beginning of their spread and maintain it high during their spread, while the rest of topics have a low burstiness during all their spread:

(28)

Figure 18: Instantaneous burstiness per topic and spread

Finally we will comment on the results of burstiness related to scope based on the outcomes displayed in figure 19. About the burstiness for all the spreads for groups of friends, it is the lowest among all the types of scope, because of by definition the potential number of users in a friend group is the lowest. The behavior of local news spreads is slightly higher than the ones related to groups of friends but significantly lower than in the case of national and international scoped news items. National scoped news items exhibit a higher burstiness than friends and local items, having one particular item with an outstanding start that decreases through time. International scoped items have the highest burstiness value, having an outstanding element that maintains a high burstiness during all its spread process as well:

Figure 19: Instantaneous burstiness for friends list (a), local (b), national (c) and international (d) scope

The scope metric is measured per label in figure 20. It is possible to observe that, like in the case of speed, burstiness is directly affected by the scope of the news item, being the international ones the spreads that own a highest burstiness potential, decreasing this potential to local and friend group scopes. It is also interesting to observe that the scope of international news items maintains high during all the spread, due to that by definition the potential of international news will be higher than any other as their potential population is the whole globe. We also see how friends group and

(29)

local news behave with respect to burstiness. Friends group burstiness is the lowest as the potential population is smaller than a city:

(30)

6.3 Comparison of speed and burstiness

In figure 21 appears a comparison of the speed and burstiness of the labeled spreads based on their topic. It is visible a high burstiness of sports related spreads due to the initial burst they all have:

Figure 21: Compared speed versus burstiness per topic

Figure 22 shows the compared speed and burstiness for all the spreads. Here we can see how sports news items have in general a high burstiness and medium speed, while entertainment related items show the highest speed:

(31)

The comparison of speed and burstiness per label based on scope appears in figure 23. The news items whose scope is international have both the highest speed and burstiness, because to due to their nature such news spread worldwide and tend to be bursty. Also news items on groups of friends are fast as it is common to have friends all around the world that comment or share the item. Finally national news are bursty and speed in a smaller dose than the international ones, while the local news are the least fast and bursty as their geographical location is reduced:

Figure 23: Compared speed versus burstiness per scope

In figure 24 appears a comparison for all the spreads in the dataset based on their scope. Despite a few national scoped elements show a high burstiness the average of the label exhibits a lower burstiness as shown in figure 23. On the other hand, both national an international news items have the highest speed but eventually the international news win the contest and are the more highly propagated ones:

(32)

The last parameter used to determine which type of news is the most successful we will measure the average number of items per spread. In the case of the spread topic, the results are shown in table 4. The spreads that stand out are the entertainment related ones, it seems pretty clear that users tend to share this type of contents the most:

Entertainment Politics Society Economy Sports

58 4 6 3 8

Table 4: Average number of elements in a spread based on topic

In table 5 the same comparison is done based on scope. International news items are on average the most shared while the other ambits seem to have a similar sharing rate:

Friends Local National International

4 7 5 72

Table 5: Average number of elements in a spread based on scope

With all this information we can conclude that the news items that spread the fastest in News Media networks are related mainly to entertainment, jokes and Internet memes. Also sport news seem to be very popular and really bursty. Also it is possible to conclude that there is a direct relation between the scope and success of a news item, the wider the scope the more successful the item is. In order to obtain more reliable results it would be useful to get access to a bigger dataset, as the one used here contained only 36 labeled spreads.

(33)

7 Visualization

7.1 Speed and burstiness in the visualization

The main layout of the visualization shows a world map and a graph representing the spread of the news item selected at that point along with some other elements such a slidebar, an image selector and different statistics panels. Figure 25 shows this main layout:

Figure 25: Main visualization layout

The concepts of speed and burstiness are displayed in the visualization in the way of two line graphs, displaying each of them the current value of burstiness or speed and also an averaged value of each metric until the selected timestamp. Also when hovering over one of these lines, a popup with the value of the metric will be shown on the right of the line graph. These graphs are shown in figure 26:

Figure 26: Compared speed versus burstiness per scope and spread

Also on top of the layout a panel with the current and averaged speed and burstiness is shown. These values change when moving the slidebar place at the bottom of the visualization. This statistics panel is the one that appears in figure 27:

(34)

7.2 Additional elements

Besides the elements that provide insight on how speed and burstiness are measured, there are some other visual components in the visualization that have been used to provide a better understanding as auxiliar elements of the visualization.

The main component of the layout is the graph that represents the spread process where a se-ries of nodes share the information uploaded by an initial user, as shown in figure 25. In the graph there are two types of elements, those that contain only one message or those that contain more than one. When hovering over a node that contains one message information related to that particular message is shown, as it appears in figure 28:

Figure 28: Popup shown when hovering over a single node

On the other hand, when hovering over a node with more than one message a popup with the number of current messages clustered in the node will appear, as shown in figure 29. Also after clicking on such node, a list containing information of all the messages grouped in the selected node will appear, also shown in figure 29:

(35)

Another important element of the visualization is a image selector that will switch betweent the different spreads based on the shared image of each one of them. After clicking on ”previous” or ”next” a new image will be loaded and with it the graph will be reloaded with the spread of the new image. This image selector appears in figure 30:

(36)

8 Conclusion

Thanks to the outcome obtained now it is possible to reasonably give answers to the questions proposed. Despite there is no such a ”success formula” in News Media networks, it seems reason-able to admit that the contents that have the most chances of being viral are those ones based on entertainment and those that represent graphical jokes. It is also possible to determine that, besides worldwide events, topics such as politics or economy tend to be a discrete burst and seem not to be specially popular among users of Social Media networks.

Also, regarding scope and speed, we have found a direct relation between both metrics, being the international scoped news those that own a higher intrinsic speed and the local ones those whose speed is the lowest. It is also specially interesting that the news spreads that are shared by a small group of users that conform a friends group tend to be faster than local news items as people use to have friends (or followers) all around the world, whereas local scoped news have a limited range of action, making them slow compared with the other groups.

Concerning burstiness, sports related news items show a high burstiness mainly due to their instan-taneous nature, as most of the times they are based into a single, punctual event such a goal. Also entertainment related items have such a nice burstiness potential, also due to the high number of average tweets of this particular type of news. Out of all the spreads analyzed, the ones related to politics and economy maintained a low burstiness potential, mainly due to such type of information does not receive high attention from users except when a global event occurs.

Finally, also in burstiness we found a relation between scope and this metric, being that inter-national scoped news items are those with the highest burstiness and the friends group scoped ones those with a smallest burstiness potential.

We also measured the amount of shares of each one of the labels and there was clear that both international scoped and entertainment related news items have a overwhelming success based on their shares. This is partly due to the action of ”outliers”, particularly thriving spreads with hun-dreds of shares that we thought on removing but eventually we decided to maintain them because they also returned especially interesting information on how successful posts are shared in networks and how they may affect other spreads.

All in all, according to these results, we may conclude that the success of a news item depends on its topic, the broader it is the more chances of success it will have, and also about its topic, if it is about entertainment it will have a high chance of becoming viral, and also it could partially happen when it comes to sports news items, but this burstiness potential will be limited in time.

As possible points to improve in the future, it would be interesting to highlight three. The main one is the fact that, due to the difficulty to crawl useful data, the size of the dataset is small as now the appeareance of outliers affects the outcome of the experiment, so it would be interesting to make the test with a higher number of news spreads. Also we have found an error in the results returned by the gazetteer, that sometimes does not return the right location due to there are several places with the same name. It would be also interesting to analyze the content of the images used in order to obtain additional information over the image itself, analyze sets of spreads based on the content of the images shared, or even automate the process of data labeling.

We consider that the information obtained may be used in the future by others to build up on top of it, as the results retrieved may be extended in different ways. It provides some basic infor-mation that can also be used in further experiments and it also opens a new field on how to apply image data analysis to Social Media networks.

(37)

References

[1] http://www.merriam-webster.com/dictionary/insight.

[2] H. Haddadi K. Gummadi M. Cha, F. Benevenuto. The world of connections and informa-tion flow in twitter. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions, 2012.

[3] A.T. Stephen K. Wilcox. Are close friends the enemy? online social networks, self-esteem, and self-control. Journal of Consumer Research, Inc., 2012.

[4] P. A. Dow J. Cheng, L. A. Adamic. Can cascades be predicted? WWW ’14 Proceedings of the 23rd international conference on World wide web, 2014.

[5] C. Brock A. George M. Planck, I.L. Pollard. Initial indicators of topic success in twitter: Using topology entropy to predict the success of twitter hashtags. Network Science Workshop (NSW), 2013 IEEE 2nd, 2013.

[6] Y. Blanc. Yoan blanc’s twitter network visualization. http://yoan.dosimple.ch/blog/2007/05/17, 2007.

[7] B. Arikan. Growth of a twitter graph. http://burak-arikan.com/growth-of-a-twitter-graph, 2008.

[8] Tweetping. http://tweetping.net.

[9] A world of tweets. http://aworldoftweets.frogdesign.com.

[10] https://maps.google.com.

[11] http://www.openstreetmap.org.

[12] J. Thorp. Just landed. http://blog.blprnt.com/blog/blprnt/just-landed-processing-twitter-metacarta-hidden-data, 2009.

[13] https://www.mapbox.com/labs/twitter-gnip/languages.

[14] N. Kallus. Predicting crowd behavior with big public data. Proceedings of the companion publication of the 23rd international conference on World wide web companion, 2014.

[15] F. Rojas. How twitter can help predict an election. The Washington Post, 2013.

[16] Mentionmapp. http://mentionmapp.com.

[17] D. Troy. Twittervision. http://twittervision.com, 2007.

[18] Tweepsmap. http://tweepsmap.com.

[19] L. Lui B. Xu. Information diffusion through online social networks. Emergency Management and Management Sciences (ICEMMS), 2010 IEEE International Conference, 2010.

[20] E. Hwang D. Kim, S. Rho. Detecting trend and bursty keywords using characteristics of twitter stream data. International Journal of Smart Home, 2013.

(38)

[23] Mapsdata. http://mapsdata.co.uk.

[24] Trendsmap. http://trendsmap.com.

[25] D. Skau W. Ribarsky M.X. Zhou W. Dou, X. Wang. Leadline: Interactive visual analysis of text data through event identification and exploration. Visual Analytics Science and Technology (VAST), 2012 IEEE Conference, 2012.

[26] M. Kitsuregawa M. Itoh, M. Toyoda. Visualizing time-varying topics via images and texts for inter-media analysis. Information Visualisation (IV), 2013 17th International Conference, 2013.

[27] http://www.geonames.org.

[28] https://dev.twitter.com/docs/platform-objects/tweets.

[29] L. Tabourier R. Lambiotte. Burstiness and spreading on temporal networks. The European Physical Journal B, 2013.

[30] J. Leskovec S. Myers, C. Zhu. Information difussion and external influence in networks. KDD ’12 Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, 2012.

[31] S. Counts J. Yang. Predicting the speed, scale, and range of information diffusion in twitter. Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media, 2010.

[32] W. Feng S. Ayyorgun. A deterministic definition of burstiness for network traffic characteri-zation. International Conference on Computer Communications and Networks, 2004.

[33] D. Oakes D.R. Cox. Analysis of Survival Data. Chapman Hall, 1984.

[34] R. Kumar A. Tomkins J. Leskovec, L. Backstrom. Microscopic evolution of social networks. KDD ’08 Proceedings of the 14th ACM SIGKDD international conference on Knowledge dis-covery and data mining, 2008.

[35] G. P. Rossi A. Sala X. Wang H. Zheng B. Y. Zhao S. Gaito, M. Zignani. On the bursty evolution of online social networks. HotSocial ’12 Proceedings of the First ACM International Workshop on Hot Topics on Interdisciplinary Social Networks Research, 2012.

[36] J. Leskovec S. Myers. The bursty dynamics of the twitter information network. WWW ’14 Proceedings of the 23rd international conference on World wide web, 2014.

Analyzing and visualizing news spread based on images in Social Media networks

F

ACULTY OF

S

CIENCE

Analyzing and visualizing news spread

based on images in Social Media networks

Author: Fernando Flores García

UvA ID: 10408134

Degree: MSc in Artificial Intelligence

Supervisor: Marcel Worring

Contents

1

Abstract

2

Introduction

3

Related work

3.1

Data analysis

3.2

Visualization

4

Data analysis

4.1

Twitter

4.2

Images

5

Model

5.1

Assumptions

5.2

Metrics and parameters

5.3

Network diffusion

5.4

Burstiness and speed

5.5

General equation

5.6

Measurements

6

Results

6.1

Speed

6.2

Burstiness

6.3

Comparison of speed and burstiness

7

Visualization

7.1

Speed and burstiness in the visualization

7.2

Additional elements

8

Conclusion

References