Exploring Entity Propagation Across Languages in Online News and Twitter

(1)

!

Exploring Entity Propagation Across Languages in

Online News and Twitter

!

A thesis

submitted

in fulfillment

of the requirements for the degree

of

Master of Information Studies

(track: Human Centered Multimedia)

at

University of Amsterdam

by

Selvi (Ragaselvi) Ratnasingam

(Student Number: 10441069).

!

Supervisor: Manos Tsagkias

Second Reader: Tom Kenter

!

(2)

Exploring Entity Propagation Across Languages in Online

News and Twitter

Selvi(Ragaselvi) Ratnasingam

Dept. Science, University of Amsterdam. Amsterdam, The Netherlands

ragaselvi@gmail.com

ABSTRACT

Even though the internet is multilingual and widely ac-cessible, information is still fragmented due to the dif-ferences in languages. Deep understanding of the re-lationship between languages can be useful to influence marketing and to enhance the insight into how language communities are connected with each other. Our paper describes a frequency-based model for analyzing entity (i.e. an organization or an individual) propagation over multiple languages in online news and Twitter. A novel aspect of our model is that it is capable of detecting the dynamics in the information di↵usion across language, without looking at semantic overlaps between news ar-ticles or tweets. Our in-depth analysis reveals that a popular foreign language is a strong constraint on the adoption of information from another language, both in online news and Twitter. Moreover, we detect the influ-ence and interest one language community has on other language communities, such as the impact of Spanish on Italian and the interest of Dutch people towards the English language. Most of the findings presented in this paper are in accordance with the results of prior stud-ies.

Categories and Subject Descriptors

[H.3.1 Information Storage and Retrieval]: Con-tent Analysis and Indexing

General Terms

Experimentation, Human Factors, Measurement

Keywords

Information propagation, languages, online news, social media, Twitter

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

Copyright ACM ...$10.00

1. INTRODUCTION

Even though the internet is multilingual and widely accessible, language serves as a barrier in information di↵usion [11, 22]. Cross-lingual awareness may support us in connecting the segmented information spheres. Moreover, understanding the interlinking between lan-guages allows us to propagate the information efficiently to the target group. In other words, cross-lingual aware-ness can reveal stakeholders to find languages which can a↵ect people beyond their linguistic similarity. Unfor-tunately, only few studies [13, 14, 23] have been carried out to examine the impact that one language has on an-other language. Therefore, we are interested in examin-ing the cross-lexamin-ingual influence in online news and Twit-ter and how the two media di↵er in spreading informa-tion, especially in spreading information to other lan-guage communities. The popularity of micro-blogging platforms, like Twitter [7], is very dynamic and reflects news related events. The majority of social media mes-sages are inspired by news articles [6] and even news articles are influenced by what is frequently discussed in social media. The reason for choosing these two me-dia is that one is a derived form of traditional print media and the second one is a modern communication tool, which is very popular among all categories of peo-ple, from professionals to normal citizen. By examining both media, we can obtain meaningful insight into the di↵erences in information propagation, on the one hand where only professionals are involved in shaping online news and on the other hand, where citizens are also involved in producing information for Twitter.

In this paper, we define and study the problem of analyzing information propagation of online news and Twitter across multiple languages. Our analysis of the data stream consists of studying how information about a particular entity (i.e. an organization or an individ-ual) spreads through multiple languages and how the information flow di↵ers between news media and Twit-ter. We have entities, e.g. music artists, car brands, that are being discussed in the news and in social me-dia in di↵erent languages. Usually the reason they are being discussed is that there is an event happening in which the particular entity is involved, for example, a

(3)

band is on tour and is giving concerts around Europe, or a car manufacturer releases a new model of a car. Every time an event occurs, it is likely to be picked up by the news and/or social media [1]. Depending on which parts of the world this event is most “relevant” to, it is more likely that these parts will pick up the dis-cussion of the event, in terms of published news articles or tweets.

This paper does not investigate the underlying top-ics associated with the information flow of every entity. For example, a news article in Italian describes the re-lease of a new blend of entity “Johnnie Walker”, at the same time a news article in English discusses the possi-ble winner of the Johnnie Walker Golf tournament. In both cases, the new articles are related to the entity “Johnnie Walker”, but the two of them are on di↵er-ent topics. The main reason for not investigating the topics is that the available machine translation libraries (in four examined languages) are not capable of trans-lating the texts to a decent level. Without transtrans-lating the texts, we cannot ensure that the items in di↵erent languages share a common topic. The chapter on meth-ods describes our idea of how we wanted to examine the topics and also explains the issues in mining topics in more detail.

Although we found that it is impossible to investi-gate the topic of news articles or tweets in multiple languages, we are interested in exploring the feasibility of our model in detecting the relationship between the languages in information propagation, without looking at the semantic overlap between the news articles or tweets. In this paper, we propose a frequency-based model which can identify the pattern of di↵usion by in-vestigating the changes in entity popularity in di↵erent languages. One way of measuring the entity’s popular-ity is by considering the frequency of published news articles or tweets. For example, the popularity of a YouTube video can be calculated by counting the num-ber of unique views. Likewise, we use the frequency of published news articles or tweets to measure the popu-larity of an entity.

In this study, we aim to answer the following research questions:

RQ1: How does the pattern of di↵usion di↵er per lan-guage? Given a specific language, which language community is most influenced by its activity? RQ2: Is there a relationship between the entity’s (i.e.

an organization or an individual) origin and its pattern of di↵usion?

RQ3: How does the persistence of the language vary by distinct entity categories? The term persistence in this context indicates the dominance one language has on a specific entity, in terms of the number

of published news articles or tweets, compared to other languages.

The uniqueness of this study exists in the fact that we provide a flexible and extensible model for auto-matic analysis of cross-lingual influence, without look-ing at the semantic overlap between the news articles or tweets. The findings of this paper can reveal interesting associations and relationships between the chosen lan-guage communities. The new model described in this paper is demonstrated by investigating four languages (English, Dutch, Italian and Spanish), but can be ex-tended to any number of languages with minimal or no e↵ect.

The remainder of this report is organized as follows. The next section provides details on academic research work in related areas. In section 3, we define the ter-minology used in this study and describe our dataset. Section 4 discusses the research methodology. In section 5, 6, and 7, we report our results and findings related to our three research questions. In section 8, we introduce a new topic (“characterizing the entities using time se-ries”) based on our qualitative observations from the previous sections. The last section concludes the paper and outlines potential future directions of this research.

2. RELATED WORK

Our work is related to two main research areas, (i) examining the information propagation in news media and social media and (ii) analyzing the geographical and linguistic contribution in information di↵usion.

2.1 Online news and social media

A number of previous studies have been based on ex-amining the responses to a news article in social media [10, 16, 17, 18]. Each of these analyses examined vari-ous aspects of the interaction between news and Tweets. We now discuss them one by one: the work by Lee et al. [16] examined the participation of individuals in news production and di↵usion. Similarly, Sankaranarayanan et al. [18] demonstrated how to capture tweets that cor-respond to a specific news article. Bandari et al. [17] dealt with predicting online popularity based on early popularity. Tsagkias et al. [10] analyzed social media utterances that implicitly reference a news article and not the news event discussed in the article. In all these studies, the authors proposed solutions to deal with the noisy data and how to track news and tweets about a specific topic. In our work, the data are already col-lected and classified into the secol-lected topics, thus the tracking is not relevant for us. This study is predomi-nantly concerned with the information di↵usion across multiple languages and tries to understand the interac-tion of news and tweets through multiple languages.

The work in Meme-tracking and the Dynamics of the News Cycle [4] is aimed to solve the problem of tracking

(4)

meme as they evolve over time and spread across the web. It studies the mutational variants of the news cycle as they evolve over time in two di↵erent ways, namely understanding the temporal variations as a whole, and identifying recurring patterns in the growth and decay of a meme around its time period. In particular, this study provides a quantitative analysis of the global news cycle and the dynamics of information propagation be-tween the news media and blogs. A main di↵erence between our work and this study about meme track-ing is that we devote our work to understandtrack-ing how the entity evolves through multiple languages instead of tracking meme that can modify over time.

Other notable researchers in the field of predicting the responses to a news article on social media are [4, 8, 9]. A main resemblance between these studies and our study is that all try to understand the interaction of the two media involved. In our study we allocate more importance to investigating the influence that one language has on another, rather than on studying the interaction between the two media.

2.2 Geographical and linguistic influences

Hong et al. [11] and Weerkamp et al. [12] found that users of various languages use Twitter for di↵erent purposes, namely for information sharing or for conver-sational purposes. Hong et al. [11] presented a study on the behavior di↵erences exhibited by users of dif-ferent languages. They studied the di↵erences across languages based on the usage of hashtags and URLs in tweets. Specifically, they showed for each pair of lan-guages the number of hashtags and URLs which they have in common. This information, combined with the normal distribution of URL and hashtag usage, is used to explore the similarities and di↵erences between the languages.

Early work by Putsis et al. [13] analyzed cross- coun-try interaction by examining how the adoption of a new product in one country a↵ects the adoption in other countries. The main similarity with our study is that they also focus on the pattern of interaction between communities. They analyzed the sales data of products that have wide consumer appeal, such as video cassette recorders and microwave ovens. They observed that some countries, for example Germany, France and Italy, have a high external contact rate as a percentage of all contacts. This allows them to exert strong external in-fluence, e.g. on people from Denmark, The Netherlands and Sweden. Some limitations in that study are that they left out the optimal (product) introduction time and optimal advertising and distribution support across countries. That analysis reveals that external or foreign influences a↵ect a high percentage of sales.

Another study in the area of the geographical distri-bution of users across di↵erent countries [14] examined

Table 1: Examples of entities.

A.C. Milan Chevrolet Fox Broadcasting Company BlackBerry El hormiguero General Motors Company, Inc. Jennifer Lopez Linux Metro AG

the role of o✏ine geography in online social networks and divides 91 countries into eleven distinct groups. The grouping relies on geographic, linguistic, political and cultural groupings of countries in the o✏ine world. Mostly, the countries in the same group are geographi-cal neighbors or countries that share the same language. Similarly, Jurgens (2013) [15] demonstrated that, in spite of globalization, neighboring countries have more correlation in social networks. Also Kamath et al. [19] argued that the physical distance between locations is a determining factor in the information dissemination. Compared to this work, our study presents a consid-erably more detailed analysis on finding interlinking between the languages (instead of countries); because we are interested to identify the language in which we can introduce the information first that stimulates later adoption into other language communities.

In summary, geography and language have crucial im-pact on the information di↵usion in the online world. The two variables language and geography are inter-twined: studying one of the variables implies studying the other one as well. Both geographical proximity and linguistic similarity play a significant role in the infor-mation exchange across national boundaries. Only a few of the previous studies have analyzed the role of language in information dissemination, by analyzing ge-ographical proximity between the countries. The aim is to fill this gap.

3. DATA AND TERMINOLOGY

This section is divided as follows. First, we briefly introduce the dataset we aim to use in this experiment and then we describe how the data are collected. Sec-ond, we define the terminology used in this paper.

Our data collection spans from the middle of July 15, 2013 up until October 17, 2013, which covers a time span of 94 days (3 months and 2 days). Moreover, the data collection contains two datasets: online news and public tweets. In this paper, a named entity refers to an individual or an organization, which are also categorized into broad categories, such as automotive, football and music. By analyzing entities from distinct categories, we are able to identify the di↵erence in information dif-fusion between various categories. In total, the datasets consist of 176 entities from RepLab1 _{(CLEF) 2013[3]}

(see few examples of entities in Table 1). Both datasets

1_{Replab ( CLEF) is a competitive evaluation campaign for} Online Reputation Management (ORM) systems, which fo-cuses on the reputation of companies from various domains, such as music, automotive and banking [3]

(5)

Table 2: Statistics of datasets online news and Twitter.

Total English Dutch Italian Spanish News 0.68 M 41.3% 17.6% 20.3% 20.9% Twitter 3.85 M 84.3% 0.8% 2.1% 12.8%

are specifically designed to support analysis in tracking the chosen entities.

The datasets of online news and public tweets provide data related to the entities in four selected languages, English (abbreviated to en for convenience), Dutch (nl), Spanish (es) and Italian (it). The four languages are among the top 10 popular languages in Twitter [5]. Due to this, they are expected to be sufficiently representa-tive to provide data to analyze. The balance between the languages depends on the availability of data for each of the used entities.

A news article item typically features a title, lead, body, authors, quotations and named entities as struc-tured elements. A tweet consists of metadata such as geolocation coordinates, mention and retweeted count. A text matching technique is applied to retrieve the re-lated entities to a news article or a tweet. For news arti-cles we have already gathered the related entities, which are an extra feature in the news article: named entities. For the tweets, we will collect the corresponding entities using the same text matching approach while process-ing. This study uses only the two following attributes: created date-time and named entities. The news article or tweet will be referred as an item.

Table 2 provides the percentage of news and tweets, as well as the percentage of items for every language in each dataset. As can be expected, Twitter is most prominent in a number of published items, because of the large number of users involved. Additionally, we can observe that English dominates in both data sets, but to an extremely high level in Twitter. One rea-son for this can be that there are more active Twitter users in English than in other language communities. Another explanation can be that Twitter is highly used for personal conversations by other language communi-ties, instead of for information sharing, which is in line with the findings reported by Weerkamp et al. [12].

4. METHODS

Studying information di↵usion over multiple languages is a new field of research, unlike many studies of entity tracking or evolution of meme in one particular lan-guage.

One way of investigating the information flow in mul-tiple languages is through studying the frequency of published items over time. To analyze the temporal

characteristics, the frequency analysis is a popularly used method to explore the information di↵usion [1, 19, 20]. We use the frequency of items to measure the pop-ularity of an entity that is comparable to measuring the popularity of a YouTube video by its view count. Thus, an analysis of frequency in multiple languages provides insight into how the popularity of an entity changes over time in multiple languages. The basic idea of our approach is to align the items related to a particular entity from di↵erent languages based on the published time and to discover how the popularity of an entity travels from one language to another and how the pop-ularity of an entity changes over time.

In this study, we perform the same methods and anal-ysis on both datasets. For each entity, we create a time series of the number of published items per time unit and based on this time series, we obtain the so-called sub-flows of languages. We use the term “sub-flow” in the rest of the paper as the result of language pairs con-nected with an arrow (e.g. it ! es), which represents the order of language with the highest frequency. The retrieval of sub-flow is split into three subtasks: (i) time series generation, (ii) frequency normalization and (iii) flow of languages. In addition to the analysis of sub-flow, we also examine peaks occurring in the time series. This section also describes the method used to detect the peaks in the time series. This is followed by a dis-cussion of the methodological limitations in our study. Finally, this section describes a possible approach for mining the underlying topics.

Time series generation. We start by retrieving the items (news articles/tweets) related to the same entity. The whole period is divided into equal length of time unit, depending on the choosing time unit length T. We choose T = 1 (in our case period are defined as days) corresponding to a daily based analysis, which we found to sufficiently cover the dynamics in the time series by performing qualitative observations. Then, we compute the frequency of published items for every time unit. In the end, we obtain the frequency of published items of all four languages, during di↵erent time units ordered by time.

In other words, for a language ln and for an entity e we generate a time series by counting how many items have been published within a time unit of 1 day. We plot these time units in specific frequencies as illustrated in Figure 1 (top).

Freq(e, t, ln) = total number of items produced in the language ln during t corresponding to the given e.

In the following, we introduce the normalization used in this study.

Frequency normalization. Note that some languages are more widely used than others. For example, some languages may have more native speakers or these lan-guages are more popular among other linguistic users.

(6)

Figure 1: Frequency over time of entity “Amazon.com” before (top) and after (bottom) normalization (in online news). From the time series, we can see that Amazon.com is quite popular in all four languages for whole time period, but slightly more popular in English. The bottom time series is normalized on language variable, to reduce the popularity di↵erences between the languages.

For instance, Poblete et al. [5] point out that among the tweets collected in 69 languages, English is the most popular with nearly 53% of the tweets. Our study is purely based on the popularity of an entity in various languages. Therefore, the language popularity bias ap-pears to be more vulnerable for the end results. To minimize the impact of popularity di↵erences in lan-guages, we need to find an appropriate normalization method, which allows languages to be directly compa-rable by normalizing the frequency of published items. We test z-score normalization and normalization on the language variable. The z-score is a common method for data normalization and it represents by how many stan-dard deviations the score di↵ers from the mean and in which direction. We compute the z-score normalization as shown below.

N ormF req(e, t, ln) = f req(e, t, ln) µ(e, ln) (e, ln)

where f req(e, t, ln) is the total number of published items in interval t and in language ln. The µ(e, ln) is the mean corresponded to the given entity e and lan-guage ln, and (e, ln) is the standard deviation related to the entity e and language ln.

For the normalization on the language variable, we di-vide the frequency of each time unit by the total number of items in the particular language. A larger value of total number of items leads to a smaller frequency value after normalizing.

N ormF req(e, t, ln) = f req(e, t, ln) totalLn(ln)

where freq(e, t, ln) is the total number of published items in interval t and in language ln. The totalLn(ln) is the total number of items in language ln for all enti-ties. The normalized frequency suggests how much the popularity of an entity ranks compared to the popular-ity of other entities in the specific language.

We observed that the normalization on language vari-able outperforms z-score normalization in reducing the impact of popular language. At the end of this phase, for each entity we obtain the time series along with their associated normalized frequencies and the related language. For example, Figure 1 (bottom) plots the normalized score of all four languages as a function of t.

Sub-flow of languages. In order to understand the flow of information, we build what we call language pat-tern of an entity. By comparing the normalized fre-quencies related to the same time unit, we choose for every time unit the language with highest normalized frequency. The language associated with every time unit will be combined with an arrow, in this way we preserve the order of occurrence (e.g. nl_{! nl ! en !} .... _{! es). We refer to this as the “language pattern} of an entity”. To make it clear, the language pattern of an entity may contain one language more than once and the length of this language pattern depends on the number of time units.

A sliding window of size two is used to choose the sub-flow of languages (e.g. en ! nl). By collapsing every sub-flow within the language pattern of every entity, we see recurring sub-flows among languages.

(7)

Figure 2: The pink dashed line represents the threshold which we use to detect the peaks in the time series. This plot illustrates the time series of entity “Canon Inc” in online news.

Figure 3: An example where we lose the shape of the time series after the normalization, by radically smoothing the language with the highest number of items (i.e. English in this example). This will happen when one language represents more than 75% of the total population of the dataset. This time series illustrates the popularity of the entity “U2” in Twitter before and after normalizing. The analysis of sub-flows allows us to identify how

of-ten language B reaches the highest frequency value af-ter language A. In this case, we assume that language A has contributed in the information propagation to language B. Boczkowski [27] argued that news orga-nizations mimic each other’s coverage. So, if a spe-cific news item is very popular, then there is a higher chance that this news is imitated by news agents in other languages, depending on whether the entity asso-ciated with this news is relevant in their country. The people using Twitter are not watching what has been tweeted in other languages. However, when the news agent publishes a news item copied from another lan-guage, it will also get attention on Twitter. Moreover, in Twitter, the multilingual individuals could help us to propagate the information beyond the linguistic sim-ilarity [24].

Peak Detection. In addition to the flow of patterns,

we seek to capture the peaks in the time series. For de-tecting peaks, we follow the recommended method of identifying periodicities and bursts in time series [25]. The approach described in [25] relies on the computa-tion of a threshold, which will be applied to find the peaks efficiently. The threshold will be calculated as follows:

threshold(e) = µ(e) + x_{⇤ (e)}

where µ(e) is the mean of the normalized frequency for entity e and (e) is the standard deviation corresponded to the specific entity. A typical values for x is 1.5. Fig-ure 2 illustrates the time series of entity “Canon Inc” and the threshold corresponding to this particular en-tity. Now, having calculated the threshold, we are inter-ested in the region above the threshold. By examining the region above the threshold, we find the prominent peaks occurring in a time series.

(8)

For the analysis in the rest of this paper we will only consider the five categories with the most entities, au-tomotive, music, IT technology, finance, and university. The other categories have too few entities (less than nine) to provide meaningful insight.

Limitations. Our normalization method is used to smooth the di↵erences in language popularity. How-ever, if this di↵erence is found to be too large, i.e. if one language represents more than 75% of the total popu-lation, so that it overshadows the other language in the dataset. We will then radically smooth the popular-ity of the language with the highest number of items, meaning that we lose the shape of the time series. That will have a negative e↵ect on the results. We show an example in Figure 3.

English, the language with the highest number of tweets, by itself accounts for 84% of the total Twitter popula-tion in our dataset. Therefore, some of the results on the Twitter dataset presented in this paper may not accurately reflect the relationship between the four lan-guages.

Mining underlying topics (in future). As noted earlier, we are not able to examine the underlying top-ics of the items related to a specific entity. This sec-tion explains our idea of how we wanted to examine the topics. To make it clear, within the items (news arti-cles/tweets) related to a specific entity, we search for correlated topics across multiple languages. This topic detection allows us to analyze how a specific topic as-sociated with a particular entity flows across di↵erent languages that helps us to accurately identify the inter-linking between the languages.

First, we take look into how we would like to exam-ine these underlying topics. A possible approach for finding correlated topics across multiple languages is to first translate the items from multiple languages into one common language (e.g. into English), using a nat-ural language library. Then, we need to find the topics occurring within the information stream of every entity and then we group the items from multiple languages according to their topic overlaps. In the following, we describe the steps used to detect items related to a spe-cific topic. We start with performing the following steps for each item related to a specific entity (using a natural language library in the specific language):

1. Tokenize the text 2. Remove stop words 3. Stemming

4. Calculate tf-idf vectors for words

Using term frequencyinverse document frequency (tf-idf) vectors, we can measure the cosine similarity [26] of two items, which indicates how similar the two items are. The cosine similarity will be higher if both items

share more terms, especially if they share more uncom-mon terms. If the product of cosine similarity exceeds an empirically set threshold, the items will be consid-ered to share same topic. At the end of this phase, we have a number of topics for each entity, and each topic has a number of published items associated with it. Af-ter this step, we continue using the same methodology in the same way as we perform currently, the only dif-ference is that we have grouped the items per topic and per entity, instead of only per entity.

Second, we describe the main problems in mining the underlying topics. There are only a few libraries avail-able for machine translation and the availavail-able libraries are not accurate enough to process the natural lan-guage. The noisy text in tweets makes it even more complex for machine translation. So, to the best of our knowledge, none of the libraries can translate the nat-ural language to a decent level in all four languages. Besides the machine translation libraries, we also need a natural language library in all four languages, which converts a string of text (news articles/tweet) into a tf-idf vector. Only the natural language libraries for the English language are well developed, the others have just started. Hence, the libraries in other languages cannot provide a very good quality. Another possible challenge in investigating the topics could be an insuf-ficient number of correlated topics detected across mul-tiple languages to obtain meaningful insight.

5. RQ1: PATTERN DIFFUSION AMONG

LANGUAGES

In order to gain insight into cross-lingual influence, we investigate the information di↵usion among the lan-guages. To do this, we count how many times a specific sub-flow (e.g. en_{! nl) appears in the language patterns} of all entities. Table 3 provides the di↵usion pattern of language pairs in both datasets. We can read the table in both directions, row-wise it shows the influence one language has on another and column- wise it shows the dependence one language has on another. We concen-trate on the rows in this section. The rows in Table 3 represent the influence percentage a language commu-nity has on other languages. We calculate the influence percentage per dataset. The highest influence percent-age per row and per dataset is indicated in bold. The highest influence percentage suggests for each language, which language community is most influenced by its ac-tivity.

It can be observed from Table 3 that English has the most impact on both Dutch media (7.6%); the lan-guages Italian and Dutch have a give and take relation-ship with each other (both languages have high influ-ence percentage on each other: nl _{! it 10.9%, it ! nl} 11.3%) and Spanish news are mostly adopted by Italian news agents (8.7%). Comparing to online news, Twit-ter datasets partly di↵er in languages with the highest

(9)

Table 3: Pattern di↵usion among the languages which represents the correlation between the language pairs (in %). The maximum values per row and per dataset are in bold.

Online news Twitter

From _{# To ! English Dutch Italian Spanish English Dutch Italian Spanish}

English - 7.6 6.6 7.2 - 14.4 4.6 10

Dutch 8.1 - 10.9 8.5 13. 8 - 6.1 9.1

Italian 6.9 11.3 - 8.4 4.7 6.2 - 5.9

Spanish 7.5 8.2 8.7 - 10.2 8.8 6.3

-Table 4: Summary of the language with most influence per entity category and per dataset.

Online news Twitter

Language Automotive Music IT-technology

Finance UniversityAutomotive Music IT-technology Finance University English es nl nl nl nl nl es nl nl nl Dutch en it en es it es en en en en Italian en nl es nl nl nl es nl en en Spanish en it it nl nl nl it en en en

influence percentage. The most influenced language re-mains the same for English and Italian, but Dutch and Spanish tweets now have the most impact on tweets in English. This may be due to the fact that the Twitter population is unevenly distributed among the languages English, the language with the highest number of items, by itself accounts for 84.3% of the total Twitter population in our dataset.

We compare the high influence percentage of lan-guage pairs with the statistics on widely spoken foreign languages in di↵erent countries. The reason for compar-ing our results to the popular foreign languages is pri-marily because multilingual individuals serve as bridges in linking the segmentation of information spheres by connecting di↵erent language communities [24]. For this comparison, we used the report of 2012 Eurobarom-eter [21], which provides the three most widely used languages for each European country. From this re-port [21], we examine the countries within the Euro-pean Union with one of the four investigated languages as the official language. In [21], it is shown that Spanish is one among the popular foreign languages spoken in Italy, but the same is not true in reverse. This finding suggests the reason behind the high influence of Spanish on Italian and also why Italian does not have a promi-nent impact on Spanish. Other notable findings in the study [21] is that people in the Netherlands are more fluent in English compared to other foreign languages spoken in this country (i.e. French and German) 57% of the total population can follow the English news on the radio or TV and 90% of the Dutch people prefer to use English as a foreign language. This result sug-gests the ability and interest of Dutch people towards the English language, which is in line with our findings: English news and tweets highly a↵ect the news (7.6%)

and tweets (14.4%) in Dutch. The interlinking of Dutch and English is also consistent with the findings of past studies by Hong et al. [11], which reveal that these two languages have a high correlation between the counts of URLs and hashtags shared on Twitter. That shows the level of common interest shared by people using these two languages. To conclude, we observe that the high influence percentage of language pairs partly correlate with the widely spoken foreign languages.

Pattern of di↵usion per category. To further ex-plore the pattern of di↵usion, we next investigate the di↵usion pattern per category. For this analysis, we use only the five categories with the most number of enti-ties. The intuition here is that di↵erent categories may have di↵erence in the patterns, which can reveal insights about the impact of entity category on the pattern of di↵usion. We also focus here on the high influence per-centage of language pairs per category (see Table 4).

By examining the most frequently occurring languages per category (looking at each column in Table 4), we discover the language that adapts the most news arti-cles or tweets from other languages with respect to a particular category. For instance, we can observe that English news agents are extremely interested in topics related to the category automotive, because automo-tive news is mostly adopted in English. In addition, it seems that the news on finance and universities are mostly adopted by Dutch news agents. We do the same comparison in Twitter. Dutch people adopts mostly the tweets related to the category automotive and the cat-egories finance and university are often adopted by the tweeters using the English language.

(10)

Table 5: Correlation between the language of entity origin and the language with the maximum number of peaks in the time series. The second column shows the number of entities originated in a particular language, meaning that the specific language is the official language in the country of origin of the entity.

% of matching origin and max. # peaks Official language # entities that originated in Online news in Twitter

English 113 24.8 41.6

Spanish 22 36.4 86.4

Table 6: Distribution of persistence for the categories with most entities (in both datasets). The persistence is measured in two ways: (i) by analyzing the sub-flows between the same languages and (ii) by counting the maximum number of peaks.

Online news Twitter

Category sub-flow analysis counting maximum # peaks

sub-flow analysis counting maximum # peaks Automotive en en nl nl Music nl nl es es IT technology en nl nl nl Finance es en, nl, it en es University nl nl en en

6. RQ2: RELATIONSHIP BETWEEN

PAT-TERN OF DIFFUSION AND THE

ORI-GIN OF AN ENTITY

Our intuition is that the origin of an entity will a↵ect the information dissemination of the particular entity. For instance, one might expect entities that originated in Italian language to have a higher popularity in Italian than in other languages. Peaks are among the finest features that can reveal the changes in the popularity of an entity.

To conduct this analysis, we compute the number of occurring peaks in all four languages for each entity. For detecting the peaks, we apply the described peak detection method in section 4. Besides the peak distri-bution among the languages, we also need to identify the country of origin of the respective entity. There-fore, we manually collect the origin of all entities from Wikipedia, as the origin of an entity is often included in the wiki page of the organization or celebrity. We then assign one language for each mapping country, ac-cording to the most widely used official languages of the country.

Out of the 12 official languages, we find that only four languages have a reasonable number of entities to per-form the analysis, while the remaining languages have too few entities (less than 12). However, for a cou-ple of languages (i.e. German and Japanese), we do not have the information stream available in these lan-guages. Thus, English and Spanish remain for the anal-ysis and Table 5 shows the correlation corresponding to these two languages.

We can observe a higher correlation rate (36.4% in news, 86.4% in Twitter) for Spanish in both media, between the language of the entity origin and the language with the maximum number of peaks, compared to the En-glish language. Out of the 113 entities corresponding to English, 86 of them originated in the United States of America. Wilkinson et al. [23] found that the most interesting topics in the US also tend to be interesting to other nations. This shows the interest of other countries towards the US, suggesting that the entities originating in US are also popular among other language commu-nities. This can be the reason for the lower correlation rate of entity’s origin and the English language.

The origin of an entity thus clearly a↵ects the pattern of di↵usion. Moreover, the degree of impact depends on the popularity of the country of origin among other na-tions. A high popularity of the country of origin results in a lower impact of the entity origin in the information di↵usion of the specific entity.

7. RQ3: PERSISTENCE OF THE LANGUAGES

PER ENTITY CATEGORY

In this section, we investigate for each category the language with more persistence. The persistence can be measured in two di↵erent ways. One way of mea-suring this is by examining the information flow within the same language (e.g. es! es), which represents how long the high popularity spreads out in one language. A high value of this measure indicates a higher persis-tence of the language on the particular entity or entity category. High persistence suggests that the occurring events are local, meaning that the events are only

(11)

inter-esting for a specific language community. Notice that we examined the sub-flows between di↵erent languages in section 5, but here we look into the sub-flow within the same language (e.g. es_{! es). Another way of} de-termining the persistence is by counting the language with the maximum number of peaks for each category. Table 6 shows the distribution of persistence calcu-lated in two distinct ways for the five categories with the most number of entities. By comparing the results from the two methods, we make three observations: first, the persistence values from the two methods match three out of the five categories. The di↵erent results only relate to the categories IT-technology and finance. Sec-ond, we observe a considerably larger di↵erence in the distribution of persistence between the two datasets. The large di↵erence in persistence may be due to the normalization not working as it is supposed to.

The persistence results of online news show that En-glish has a higher persistence for the categories auto-motive and IT technology and Dutch for the categories music and university. Interestingly, a number of the big music festivals in The Netherlands were held in our investigating period, such as Lowlands (extremely pop-ular among the well-educated younger generation) and Tomorrow land. The higher persistence value of the En-glish language for the category IT technology is more likely due to that the large number of global IT firms are originated in the US.

8. CHARACTERIZING THE ENTITIES

US-ING TIME SERIES

In addition to the three research questions, we here take a close look into the entity level to identify the characteristics of these entities. Based on the qualita-tive observations in the previous sections, we observed that four groups of entities appear in our datasets. The terminology used for grouping is inspired by the study on the adoption of hashtags and meme di↵usion [19]. That study used these terms to categorize the hashtags, in our case we use them to group the entities broadly.

We were aware that the time series of entities are highly dynamic, so the grouping can help us to identify the dynamics in the time series. For this clustering, we focus on how well the entity related information is distributed in all examined languages and how well the information is distributed over time. Due to this, we divide the whole period of 94 days into eight segments and then we calculate the total normalized score for each of these time series. By dividing the time series into multiple segments, we are able to detect whether the entity is popular during the entire time period, or only during a certain time. The choice of number of segments is arbitrary. We also compute the total score for every language.

Table 7: Fraction of entities belongs to the entity groups for both datasets.

Group % of items in Online News % of items in Twitter Worldwide phenomena 73.9% 33.0% Event-driven 6.3% 3.4% Local interest 6.8% 51.7% Local and event-driven 13.1% 11.9%

• Worldwide phenomena: These entities are evenly spread in at least three languages and also evenly spread in at least five of the eight time series. • Event-driven: These entities are globally

dis-tributed like the worldwide phenomena, but the popularity of these entities depends on the occur-ring events. That means the popularity is not uni-formly spread over the whole period.

• Local interest: These entities are only popular among one or two languages, but they have a long time span.

• Local and event-driven: This group is the com-bination of event-driven and local interest. We determine whether the segment (of language or time duration) is uniformly distributed: when the segment’s score is smaller than one-third of the highest scoring segment. Depending on the number of uniformly dis-tributed segments, we classify the entities to the above listed groups. Table 7 shows the summary of entity grouping statistics. We now perform a rule-based clus-tering. In the future, we expect to do the automatic clustering and compare the results from the two clus-tering approaches.

To gain more understanding of the information flow, we further explore the results of score distribution in both datasets.

Online news dataset. Results show that world-wide phenomena rank first in online news, followed by the groups of regional and event-driven, local interest and event-driven (see Table 7). The high percentage of worldwide phenomena suggests that the majority of entities is highly popular among the news media in all examined languages and also have a long time span. Since this study focuses on cross-lingual relationships, it is obvious that the data predominately belongs to the group of worldwide phenomena. Some examples of entities that are classified into the group of worldwide phenomena are AB Volvo, Amazon.com, Jennifer Lopez and Microsoft corporation. Few example of entities that are classified into the other three categories: event-driven (e.g. El hormiguero, Foxtel, Johnnie Walker and New York University), local interest (e.g. Muse, SEAT SA, U2 and McDonald’s) and regional-and-event-driven (e.g. Lufthansa Airlines, Nivea, Jaguar Cars Ltd,

(12)

Whit-ney Houston and Oracle).

Twitter dataset. There is a huge di↵erence between the distribution of groups in online news and Twitter (see Table 7). The main reason behind this is the un-even distribution of languages in the Twitter dataset (as shown in Table 7). This could be further agreed with the fact that around 52% of entities refer to the group local interest, suggesting that these entities are only popular in one or two languages. However, only fo-cusing on the entities in which the languages are evenly spread, means that we focus on entities in the world-wide phenomena or the event-driven groups. We can see that music ranks first in the categories in which the lan-guages are uniformly spread, followed by IT-technology and automotive. This indicates that the Twitter users in Dutch, Italian and Spanish are commonly interested in tweeting about entities related to these three cate-gories. Comparing to other language communities, the English language tweeters have a broader range of cat-egory interests.

Our findings in this section revealed that Twitter con-versations are more likely to focus on topics related to the categories music, IT-technology and automotive, es-pecially when compared to the tweets in Dutch, Italian and Spanish. However, the entities of the other cate-gories are well covered in online news.

9. CONCLUSION

In this paper we have demonstrated a frequency based model for analyzing information di↵usion over multiple languages. We conducted a more fine-grained analysis focusing on the relationship between the investigated languages, in terms of the impact one language has on another. Due to practical constraints, we did not look at the semantic overlap between the news articles or tweets in this paper. It is interesting that our model is still capable of detecting and distinguishing the fine di↵erences in the information di↵usion. In particular, this study provided the following insights. For example, the influence and interest one language community has on other language communities, such as the impact of Spanish on Italian and the interest of Dutch people to-wards the English language. Furthermore, we identified that one country’s popular foreign language is a strong constraint on the adoption of information from another language, both in online news and Twitter. Moreover, we found that only topics on some categories (e.g. au-tomotive, music) are discussed on Twitter, despite the other categories being well covered in online news. A number of results of this study are also confirmed by prior studies in this field.

We believe that this study is just a first step towards identifying the role of languages in information di↵u-sion, which will allow us to predict the flow of infor-mation popularity in future. There are still some open issues that can be investigated further. The model

pre-sented in this study is purely based on the popularity of entities, in terms of frequency of published news ar-ticles or tweets. To better understand the relationship between the languages, we need to investigate the un-derlying topics associated with news articles or tweets. The results from that experiment help us to statisti-cally strengthen (or weaken) our findings in this paper. Directions for future work include investigating alterna-tive normalization approaches that might lead to more accurate estimates, even if one or two languages over-shadow other languages in the dataset. Furthermore, we also plan to find out to what degree advertising agen-cies influence the information di↵usion.

10. ACKNOWLEDGMENTS

We thank EU project LiMoSINe (grant agreement no. 258191) for giving access to the resources that fa-cilitated the research. LiMoSINe project is funded by European Community’s Seventh Framework Program (FP7/2007- 2013). Also, special thanks to Manos Tsagkias for his guidance during this research.

11. REFERENCES

1. Yang, J., Leskovec, J. (2011, February). Patterns of temporal variation in online media. In Proceedings of the fourth ACM international conference on Web search and data mining (pp. 177-186). ACM.

2. Leskovec, J., Backstrom, L., Kleinberg, J. (2009, June). Meme-tracking and the dynamics of the news cycle. In Pro-ceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 497-506). ACM. 3. Amig, E., de Albornoz, J. C., Chugur, I., Corujo, A., Gon-zalo, J., Martn, T., ... Spina, D. (2013). Overview of replab 2013: Evaluating online reputation monitoring systems. In Infor-mation Access Evaluation. Multilinguality, Multimodality, and Visualization (pp. 333-352). Springer Berlin Heidelberg. 4. Leskovec, J., Backstrom, L., Kleinberg, J. (2009, June). Meme-tracking and the dynamics of the news cycle. In Pro-ceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 497-506). ACM. 5. Poblete, B., Garcia, R., Mendoza, M., Jaimes, A. (2011, October). Do all birds tweet the same?: characterizing twit-ter around the world. In Proceedings of the 20th ACM intwit-ter- inter-national conference on Information and knowledge management (pp. 1025-1030). ACM.

6. Kwak, H., Lee, C., Park, H., Moon, S. (2010, April). What is Twitter, a social network or a news media?. In Proceedings of the 19th international conference on World wide web (pp. 591-600). ACM.

7. Java, A., Song, X., Finin, T., Tseng, B. (2007, August). Why we twitter: understanding microblogging usage and com-munities. In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis (pp. 56-65). ACM.

8. Becker, H., Naaman, M., Gravano, L. (2010, February). Learning similarity metrics for event identification in social me-dia. In Proceedings of the third ACM international conference on Web search and data mining (pp. 291-300). ACM.

9. Tsagkias, M., Weerkamp, W., De Rijke, M. (2009, Novem-ber). Predicting the volume of comments on online news stories. In Proceedings of the 18th ACM conference on Information and knowledge management (pp. 1765-1768). ACM.

(13)

10. Tsagkias, M., de Rijke, M., Weerkamp, W. (2011, February). Linking online news and social media. In Proceedings of the fourth ACM international conference on Web search and data mining (pp. 565-574). ACM.

11. Hong, L., Convertino, G., Chi, E. H. (2011, July). Language Matters In Twitter: A Large Scale Study. In ICWSM.

12. Weerkamp, W., Carter, S., Tsagkias, M. (2011). How people use twitter in di↵erent languages.

13. Putsis Jr, W. P., Balasubramanian, S., Kaplan, E. W., Sen, S. K. (1997). Mixing behavior in cross-country di↵usion. Mar-keting Science, 16(4), 354-369.

14. Kulshrestha, J., Kooti, F., Nikravesh, A., Gummadi, P. K. (2012, June). Geographic Dissection of the Twitter Network. In ICWSM.

15. Jurgens, D. (2013, June). That’s what friends are for: In-ferring location in online social media platforms based on social relationships. In Seventh International AAAI Conference on We-blogs and Social Media.

16. Lee, C. S., Ma, L. (2012). News sharing in social media: The e↵ect of gratifications and prior experience. Computers in Human Behavior, 28(2), 331-339.

17. Bandari, R., Asur, S., Huberman, B. A. (2012, February). The Pulse of News in Social Media: Forecasting Popularity. In ICWSM.

18. Sankaranarayanan, J., Samet, H., Teitler, B. E., Lieberman, M. D., Sperling, J. (2009, November). Twitterstand: news in tweets. In Proceedings of the 17th ACM SIGSPATIAL Interna-tional Conference on Advances in Geographic Information Sys-tems (pp. 42-51). ACM.

19. Kamath, K. Y., Caverlee, J., Lee, K., Cheng, Z. (2013, May). Spatio-temporal dynamics of online memes: A study of geo-tagged tweets. In Proceedings of the 22nd international con-ference on World Wide Web (pp. 667-678). International World Wide Web Conferences Steering Committee.

20. Gruhl, D., Guha, R., Liben-Nowell, D., Tomkins, A. (2004, May). Information di↵usion through blogspace. In Proceedings of the 13th international conference on World Wide Web (pp. 491-501). ACM.

21. European Commission. (2012, June). Special Eurobarome-ter 386: Europeans and their languages (pp. 21). Retrieved from http://ec.europa.eu/public_opinion/archives/ebs/ebs_386_en. pdf

22. Hale, S. A. (2012). Net Increase? CrossLingual Linking in the Blogosphere. Journal of ComputerMediated Communication, 17(2), 135-151.

23. Wilkinson, D., Thelwall, M. (2012). Trending Twitter topics in English: An international comparison. Journal of the Ameri-can Society for Information Science and Technology, 63(8), 1631-1646.

24. Eleta, I., Golbeck, J. (2012). Bridging languages in social networks: How multilingual users of Twitter connect language communities?. Proceedings of the American Society for Informa-tion Science and Technology, 49(1), 1-4.

25. Vlachos, M., Meek, C., Vagena, Z., Gunopulos, D. (2004, June). Identifying similarities, periodicities and bursts for on-line search queries. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data (pp. 131-142). ACM.

26. Tata, S., Patel, J. M. (2007). Estimating the selectivity of tf-idf based cosine similarity predicates. ACM SIGMOD Record, 36(2), 7-12.

27. Boczkowski, P. J. (2010). News at work. Imitation in an Age of Information Abundance.