Twitter and Culture: Estimating IDV-index of countries based on tweets

(1)

___________________________________________________________

Twitter and Culture:

Estimating IDV-index of countries based on tweets

Maurice Mbaseka Kerubino

Student number: 5831660 Email: maurice.kerubino@gmail.com

Thesis Master Information Science – Business Information Systems

University of Amsterdam

Faculty of Science and Faculty of Economics and Business

Department Business Studies

Section Information Management

Final version: 08-06-2014

Supervisor: dr. W.P.Weijland, Faculty of Science, University of Amsterdam

Examiner: drs. A.W. Abcouwer, Signature: __________________________________

(2)

M.Sc. THESIS

Twitter and Culture:

Estimating IDV-index of countries based on tweets

Abstract

While cultural variations across countries have been extensively studied using small-scale experiments and surveys, the global nature if Twitter offers an important source of data for research on cultural and social aspects countries. One of those aspects that can easily be investigated by analyzing tweets, is individualism vs collectivism; part of Hosfstede’s cultural dimensions theory. In this study, we try to estimate Hofstede’s IDV-index by analyzing tweets from 10 countries and calculating their Tweet-IDV score. Through a literature review, we devise a criteria to identify individualistic tweets. We discovered that tweets in which the word “I” was preceded or followed by a verb or adverb are individualistic and tweets in which the word “we” was preceded or followed by a verb or adverb are collectivistic. After excluding individualistic tweets that were categorized as check-ins we found significance evidence that Tweet-IDV scores were positively correlated to IDV scores of countries investigated. Based on these important observations, we conclude that is possible to estimate Hofstede’s IDV-index by analyzing tweets from different countries.

(3)

LIST OF TABLES

Table 1: Examples of individualistic and collectivistic tweets………..….11

Table 2: number of tweets collected per country………..…16

Table 3: Top ten words from collectivistic (cdv) tweets and individualistic (idv) tweets

………..…18

Table 4: Confusion matrix of Classifier performance on training data………

……….18

Table 5: Confusion matrix of Classifier performance on test data………..18

(4)

Table 7: Accuracy per token or token combination on a training set of 1000 tweets...

...21

Table 8: Tweet-IDV scores after excluding check-ins………..……….23

Table 9: Interclass correlation coefficients per IDV category………..24

LIST OF FIGURES

Figure 1: Datasift maps polygon selection tool...14

Figure 2: Hofstede's IDV vs Tweet-IDV...22

Figure 3: Hofstede's IDV vs Tweet-IDV after excluding check-ins...22

(5)

ACKNOWLEDGEMENTS

Thank you,

Your support, time, insights, advice and guidance is duly appreciated:

Dr. W.P.Weijland, Academic supervisor, Informatics Institute, Faculty of Science, University of Amsterdam, Science Park 904, 1098 XH Amsterdam, the Netherlands

Dr. Wouter Weerkamp, former Academic supervisor and Postdoctoral researcher, at Information and Language Processing Systems (ILPS), University of Amsterdam For the support;

... And to all the teaching and support staff at the Informatics Institute of the University of Amsterdam

My colleagues and friends, new, old and future, throughout the world.

And most importantly, for your love and support and warmth you bring me everyday: My wife Cheryn and Daugher Cherice.

Please read with pleasure, Maurice Kerubino. Peace on Earth!

(6)

EXECUTIVE SUMMARY

This document is presents the results of my research for my Master thesis, which is the final assignment of my study in Business Information Systems at the University of Amsterdam. In this document I shall start by introducing the research topic; “Estimating IDV-index based on tweets”. I’ll follow that up with a short literature review on conformity of views within countries that lead to cultural difference between countries, affecting opinions and therefore influencing what we say on Twitter; which presents the area of interest that I am researching. I then continue with a description of my research methodology in which I state the goals and relevance of this research and compare it with earlier research initiatives. The structure of my research will then be elaborated, stating the research methods, research framework, and research materials. The results of my result are then presented and explained. I conclude with deductions, questions, challenges and possible further improvements presented by this research. A list of reference literature used in this research will close this report.

1 Introduction

1.1 Background

Much of the online activity of recent takes place on Twitter1_{, a popular microblogging website}

on which an estimated 250 million status updates are posted per day by users from all corners of the world. Originally developed as an online tool to help friends and family stay in contact with each other and post textual status updates in 140 characters or Tweets as they are popularly known, on what they were doing (Java et al., 2007), Twitter has evolved into a tool that can be used for a variety of other services like spreading and sharing breaking news and spontaneous ideas (Xhao et al., 2011) and promote products and services (Thongsuk et al., 2011). Through specialized software applications, Twitter users can also enrich their tweets by posting links to pictures, audio or video. These features have led to situations where Twitter was used to propagate information in real-time in many crisis situations such as the after- math of the Iran election in 2009, the tsunami in Samoa and the 2010 Haiti earthquakes (Hong, 2010).

Tweets contain useful information that can be analyzed to discover information about Twitter users and the environment in which they find themselves. All this information in these tweets has attracted the attention of researchers who have recently used Twitter to, among others, research real-time events (Sakaki et al., 2010; Vieweg et al., 2010), forecast box-office revenue for movies and popularity (Asur et al., 2010; Bandari et al., 2012) or model public mood (Bollen, 2011; Golder et al., 2011). Some of this research has even given rise to applications like SNTMNT2_{, an online tool that gives daily insights into online consumer}

sentiment surrounding stock market funds by analyzing tweets. Data from increasingly popular online social media like Twitter allow social scientists to study individual behavior in real time in a way that is both fine-grained and massively global in scale, making it possible to obtain precise real-time measurements across large and diverse populations (Golder, 2011). These qualities make Twitter a suitable source for data for research on cultural and social variations across countries. Cultural variations across countries have been empirically studied using small-scale experiments and surveys in the real world, Gavilanes (2013).

Hofstede (1980) one of the pioneer researchers in this area, developed the cultural dimensions theory as a framework for cross-cultural communication, by analyzing data from an opinion survey of IBM employees from 70 countries in the 1960s and 1970s. The theory was one of the first that could be quantified, and could be used to explain observed differences between 1 http://www.Twitter.com

(7)

cultures but other researchers (Bond et al., 2004; Inglehart, 1997; Kitayama, 1997; Trompenaars, 1993, Schwartz, 2004; Smith et al., 2002) have since come up with similar theories supporting this notion. Hofstede found five main factors that explained most of the difference in cultures between countries, which he labeled cultural dimensions.

The first factor is; Power Distance and reflects the extent to which people (especially those less powerful) expect and accept that the power is distributed unequally. Countries with a low power distance exhibit power relations that are more democratic compared o countries with high power distance. The second dimension is called Individualism vs. Collectivism and refers to the degree to which individuals are integrated into groups. Societies in which the interest of the individual prevails over the interest if the group is considered to be

individualistic, while societies in which the interest of the group prevails over the interest of the individual are considered collectivistic. Social relationships in individualistic societies are loose as opposed to relationships integrated in strong and cohesive groups in collectivistic societies. Another dimension that varies across countries is the Masculinity vs. Femininity; which refers to the distribution of emotional roles between the genders. Masculine cultures' values are competitiveness, assertiveness, materialism, ambition and power, whereas feminine cultures place more value on relationships and quality of life. The fourth dimension is

Uncertainty avoidance, which refers to "a society's tolerance for uncertainty and ambiguity". It reflects the extent to which members of a society attempt to cope with anxiety by

minimizing uncertainty. People in cultures with high uncertainty avoidance tend to be more emotional and try to minimize the occurrence of unknown and unusual circumstances by carefully planning changes.

The fifth aspect, which Hofstede later added to his cultural dimensions theory is Long-term orientation, vs. Short Term Orientation which describes the extent to which society attached importance to the future. Long term oriented societies attach more importance to the future and foster pragmatic values oriented towards rewards, including persistence, saving and capacity for adaptation. In short term oriented societies, values promoted are related to the past and the present, including steadiness, respect for tradition, preservation of one's face, reciprocation and fulfilling social obligations.

Although Twitter represents just one of the media used by people to communicate, its global footprint, offers possibilities to collect data from different cultures across the globe.

The act of tweeting can be seen as a social event that can collectively be defined, maintained, and held in place but can vary considerably across cultures (Kitayama, 1997). The same analytic methods that Hofstede used can therefore be applied on tweets from different countries to try and analyze the cultural values from different cultures.

Hofstede’s dimensions have been found to determine how people from different countries behave differently when subjected to the same conditions in real world situations. In this study, we’ll try to find out if we can use data collected from Twitter to assess the extent to which such differences can also be detected. The main goal of the study is to estimate the individualism score of a country as a cultural difference, by studying online interactions on Twitter. We analyze more that 2.2 million tweets from different countries by identifying and classifying certain elements in tweets to determine whether the tweets are individualistic or collectivistic. By considering a nation as a society with an own distinct culture and therefore identifying individualistic or collectivistic topics in tweets from that nation, we want to find evidence that the levels of individualism in different countries can be estimated by studying the content of tweets from those countries.

We apply Naïve Bayes classification and Part of Speech (POS) tagging techniques to aid us discover and categorize individualistic and collectivistic tweets and estimate levels of individualism in different countries from the tweets. Regression analysis on the number of tweets per country that are classified as individualistic against the IDV-index will help us find the relationship between the scores on the IDV-index (IDV-scores) and the level of

(8)

1.2 Relevance of this research

Recent, Twitter access blockages are subtle reminder that social software tools like Twitter can have adverse effects in some parts of the world. Twitter has a global reach and therefore can be used as a global sensor for changes in cultural aspects and to understand culture’s impact on social software adoption and usage. Changes in cultural aspects have in the past helped researchers investigate economic activity in different countries (Gorodnichenko and Roland 2011) and the difference in marketing techniques used by individualistic or

collectivistic marketers. On social network sites (Tsai and Men, 2012). This research will use existing tools, theories and techniques to analyze the cultural aspect of individualism and estimate IDV-score of different countries. This will help us better understand how people from different parts of the world use Twitter and maybe provide insights on how cultural and social aspects can be analyzed on Twitter. This research also extends work by Carnegie mellon University’s (CMU) Gimpel et al. (2011) that applies natural language processing (NLP) techniques to analyze tweets through Part-Of-Speech (POS) tagging. The tools provided CMU’s ARK-tweet-nlp project3_{(Gimpel, 2011) are modified and applied to answer}

some of our research questions.

1.3 Research objective

There could be a wide range of explanations why tweets from nations might differ in the amount of personal content that they contain. The objective of this research is to find out if it is possible to estimate IDV-score based on the origin of Twitter messages or tweets. Initially, tweets from countries with high levels of Twitter activity will be analyzed to establish if it is possible to detect individualistic or collectivistic. Subsequently, the number of individualistic tweets from a country should be able to help us estimate IDV-score of a country. Guided by conclusions from earlier research (Hofstede, 1980; 1983 and Ji’, 2010) we assume that communication or a tweet is individualistic when a user prioritizes his or herself higher than group from which he or she comes from. Similarly, communication or a tweet is collectivistic when a user prioritizes the group from which he or she comes from, higher than his- or herself. By “the group from which one comes from”, we mean, the country or nation from which one is from.

By analyzing percentage of individualistic tweets from different countries, we’ll attempt to estimate the IDV-score of that country. By the “percentage of individualistic tweets”, we literally mean the fraction of individualistic tweets from the total number of individualistic and collectivistic tweets that are categorized from a particular country. The research will attempt to correlate the scores estimated through Twitter with the actual Hoftsede IDV-scores for the countries analyzed.

1.4 Research questions

The main question that this research aims to answer is;

Can we estimate the IDV-index of countries based on tweets?

Answering this research question will require us to identify individualistic and collectivistic tweets and finding a relationship between the nature and quantity of individualistic or collectivistic tweets with the characteristics of the country from which they are from. Part-of-speech (POS) tagging, a basic form of syntactic language analysis technique used in natural language processing (NLP), will be used in order to identify individualistic and collectivistic tweets. This will raise the following 3 research sub-questions which individually shall focus on the various subsets of the research area;

(9)

1. Is the use of personal pronouns “I”and “we” on Twitter related to individualism?

2. Can we successfully apply POS tagging on tweets?

3. Can we automatically determine individualistic vs. collectivistic tweets? 4. Can we find a relationship between Twitter-based IDV and Hofstede's IDV

indices?

The first two sub-questions focus on the possibility of using POS-tagging as a technique for identifying individualistic and collectivistic tweets, while the third sub-question focuses on the correlation between the nature and quantity of individualistic or collectivistic tweets per country and the IDV-index.

1.5 Theoretical framework: Cultural background as a

differentiating factor in tweets

People from the same country have the same culture because live in a national clique. Cliques have a tendency to have more homogeneous view and opinions as well as share many common traits Granovetter (1973) leading conformity of views and opinions. The acts and behaviours of people from a particular country differ from behaviour of people from another country. Culture, for that matter is a challenging variable to research, in part because of the multiple divergent definitions and measures of culture (Leidner, 2006) but culture provides a common understanding transcending immediate individual experience, a social reality to guide our actions (Latane, 1996). Latane (1996) describes culture as; “the entire set of socially transmitted beliefs, values, and practices that characterize a given society at a given time”. People from different parts of the world defer in their ability to influence each other in their spatial location, affecting each other in a dynamic interactive process of reciprocal and recursive influence resulting in different social structure for groups of people from different parts of the world. Latane (1996) goes on to explain that this is due to the fact that people tend to be influenced by nearby rather than far away peoples, resulting in local patterns of

consensus in attitudes, values, practices, identities and meanings that can be interpreted as sub-cultures. As a result, those who share that common space come to mutual understandings of their world and as these individual cultural elements become regionally differentiated, correlations emerge among them leading to stereotypes, identities, nationalities, and recognizable cultural patterns (Cullum, 2007).

These regionally differentiated cultural elements can be used to explain he differences in behavior and practices between people from different countries. Through nation-level factor analysis or other equivalent procedures, Hofstede (1980), a cultural anthropologist researcher, proposed the cultural dimensions theory as a framework that can be used to identify

dimensions of variation in national cultures. Other cultural anthropologist researchers (Bond et al., 2004; Hall (1981); Inglehart, 1997; Trompenaars, 1993, Schwartz, 2004; Smith et al., 2002) have come up with similar frameworks but the most frequently used model in cross-cultural studies is Hofstede’s theory (Hall, 1981).

hhhhhh

Hofstede’s (1980) “individualism versus collectivism” dimension refers to the level of prioritization between individual and group in pursuit of profits. It refers to the difference of someone in a group to prioritize him or herself higher than the group or the group higher than herself (Ji, 2010). Individualism pertains to societies in which the ties between individuals are loose: everyone is expected to look after himself and his or her immediate family.

Collectivism as its opposite, pertains to societies in which people from birth onward are integrated into strong, cohesive in-groups, which throughout people’s lifetimes continue to protect them in exchange for unquestioning loyalty (Hofstede, 2005). Hostede (2005) goes on to state that individualist societies practice individualism and collectivism is practiced by collectivistic societies. The IDV-index that was developed in Hofstede’s (1980) study is used to measure the individualism or collectivism of countries where IDV-score is the percentage of individualism that the country exhibits. Extreme individualism is close to 100 and extreme

(10)

collectivism is close to 1 on the scale IDV-index. Not all countries are represented on the index, but examples of the top four individualistic (least collectivistic) countries are; The United States of America with a score of 91, followed by Australia with a score of 90, then United kingdom and The Netherlands with a score of 89 and 80 respectively. The top four collectivistic (least individualistic) are Guatemala (6), Ecuador (8), Panama (11) and Venezuela (12). A full list can be found on Clearly Cultural’s website4_.

The individualism versus collectivism dimension may tap into conceptions held by people in a culture (Kashima, 1998). By considering Twitter user posts (also known as tweets) as a differentiating cultural practice, one can make meaningful deductions about the society from which those posts are coming from. This study implies that the language in tweets may embed a particular conception of relationships among people. One way in which we can identify those conceptions is in the use of personal deixis (person-indexing-pronouns like I, my, me, mine, we, our, us, ours). Personal deixis in language may provide a window through which cultural practices can be investigated (Kishima, 1998). In the area of cultural psychology, priming studies have demonstrated that exposure to first-person singular pronouns (I, my, me, mine) induces an individual to adopt an individualistic self-view, whereas exposure to first-person plural pronouns (we, our, us, ours) leads an individual to adopt a collectivistic self-view, albeit temporarily (Brewer, 1996; Gardner et al., 1999; Kühnen, 2004). Kashima (1998) on other hand found that languages spoken in individualistic cultures tend to require speakers to use the “I” pronoun when referring to themselves while languages spoken in collectivist cultures allow or prescribe dropping the “I” pronoun. Compared to their individualistic counterparts, people with a collectivistic orientation prefer to use first-person plural

possessive pronouns (Na, 2009). Similar to these findings are Hofstede’s (2005) observations that collectivistic cultures avoid the use of the word “I” and consider themselves as

interdependent on society while individualistic cultures encourage the use of the word “I” and consider themselves as independent self. The question on whether the use of first person pronouns is a good indicator for individualism or collectivism in society is answered by Twenge (2012) in his research on changes in pronoun use in American books between 1960 and 2008. Twenge (2012) finds that the increasing use of first singular person pronouns in American books was due to the increase in individualism in American society.

Twitter, with its global reach provides a suitable environment for analysis of the language that people use in different countries and gain an insight on how news propagates, how people communicate, and maybe how they influence each other (Poblete, 2011). People’s utterances on Twitter could help us find relationship with their cultural dimensions because tweets could give us an insight on the social reality guiding people’s actions on Twitter. Tweets in this case would act as a lens in which people experience their immediate worlds. Through linguistic analysis of tweets, we’ll attempt to make conclusions on whether the nation from which the tweets originated is individualistic or collectivistic.

1.6 Individualism and Collectivism on Twitter: Tweet-IDV score

Based on this theoretical framework, individualism or collectivism can be recognized on Twitter through what people tweet about. Observations by Hofstede (2005), Kashima (1998) and Na (2009) that the use of first-person pronouns in phrases can be attributed to the fact that a culture is individualistic or collectivistic can help us recognize Individualism and

Collectivism on Twitter. Individualistic cultures encourage the use of the word “I” and consider themselves as independent self while collectivistic cultures avoid the use of the word “I” and consider themselves as interdependent on society. Therefore people with a

collectivistic orientation prefer to use first-person plural pronouns such “we”. Various studies (Brekke, 1999; Bybee, 1999; Hyland, 2001; Okamura, 2009) have showed that collocation of “I” and “we” in the English language indicates that the first-person pronoun is mostly

immediately followed by a verb or adverb in non-inquisitive phrases or preceded a verb in 4 http://www.clearlycultural.com/geert-hofstede-cultural-dimensions/individualism/

(11)

inquisitive phrases. A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things (Manning, 1999). Therefore tweets where the use of the word “I” (first-person singular nominative pronoun) is used, followed by a verb or adverb will be considered individualistic. Examples of individualistic tweets would be “I am happy!”, “I still love you” or “Did I check in?” Tweets where the word “we” (first-person plural nominative pronoun) is used, followed by a verb or adverb will be considered collectivistic. Examples of collectivistic tweets would be “We have won!”, “we always win” or “Did we get a raise?” Other collocations of “I” and “we” are considered beyond the scope of this research. See table 1 for more examples of individualistic and collectivistic tweets

Individualistic tweets Collectivistic tweets

@name I am going to blow

We are gonna kick your a…

I’d like to tell you all that #UgandaIsNotSpain

@name Can we get an amen?

Can I kick it? Yes, I can … bwooooy!

We still want justice for Martin!!!

Table 1: Examples of individualistic and collectivistic tweets

Tweet-IDV score

Since the scores on the IDV-index, are measured as the percentage of individualism on the individualism-collectivism dimension, in this research, I propose to measure individualism on Twitter as the percentage of individualistic tweets among all collectivistic and individualistic tweets posted form a country. I’ve labeled this percentage IDV score on the Tweet-IDV index.

Tweet-IDV score per country is therefore the percentage of individualistic tweets among all collectivistic and individualistic tweets posted form a country.

Therefore ,

where

is the sum of all individualistic tweets from a country in which the token combination {first-person singular nominative pronoun + verb e.g I am, I like, Do I} is used.

is the sum of all collectivistic tweets from a country in which the token combination {first-person plural nominative pronoun + verb e.g we are, we like, Do we..?} is used.

Using the Tweet-IDV scores calculated from the tweets collected, a reference Tweet-IDV index will be created in which the list of Tweet-IDV scores will be listed and categorized.

1.7 Hypothesis and expectations

As mention earlier in the purpose of this research, we intent find out if IDV-index can be estimated based on tweets from different countries. We therefore expect find empirical evidence that there is a relationship between the percentage of individualistic tweets per country and the IDV-index, suggesting that IDV-index can be estimated by analyzing tweets from various countries. Based on these observations, a country with a high Tweet-IDV score is expected to have a comparatively high score on the IDV-index and a country with a low Tweet-IDV score is expected to have a comparatively low score on the IDV-index.

We therefore hypothesize that there is a relationship between the number of individualistic tweets per country and Hofstede’s IDV score for that country. The null hypothesis states that

100   



indTweets



colTweets indTweets ore TweetIDVsc



indTweets



colTweets

(12)

there is no relationship between the number of individualistic tweets per country and Hofstede’s IDV score for that country.

1.8 Earlier research

Micro-blogging tools like Twitter are a relatively new form of technology so there is limited prior research about the impact of culture of their use The effects of cultural differences on the use of technology in general have always intrigued researchers because culture at various levels, including national, organizational, and group, can influence the successful

implementation and use of information technology (Leidner & Kayworth, 2006). Ji’s (2010) study on the influence of cultural differences on the use of social software offers an insightful view on how culture affects social software usage but research by Gavilanes (2013), is probably the most relevant research on this subject. Galvanes analysed tweets from different countries to conclude that, the higher a country’s Pace of Life, the more predictable its residents were and that users from collectivistic countries interact more with each other than those in individualistic countries. Galvanes (2013) does imply that culture does affect the way we use Twitter.. Another detailed study by Tsai and Men (2012) found significant differences between the types of messages that individualist marketers put on social media, as opposed to collectivist marketers. Tsai found that collectivist marketers on the Chinese social platform Renren are more likely to say that their product is very popular. Although Ji’s (2010) work found insufficient evidence that the divergent use of social network services was caused by cultural differences, the results showed that Korean and Chinese users form bridging and bonding social capital mainly through Expert Search and Connection functions, but American users mainly use the Communication function to form bonding Social Capital. A comparable study by Kayan (2006) on cultural differences on the use of instant messaging in Asia and North America found that multi-party chat, audio-video chat and emoticons were much more popular in Asia than in North America.

On the structure and usage of Twitter, Poblete (2011) found differences and similarities in terms of activity, sentiment, use of languages, and network structure on Twitter in different countries. The authors argued that using Twitter data in research could help us gain on how news propagates, how people communicate, and maybe how they influence each other. Perhaps more relevant to this research is work on the use of language on Twitter (Hong, 2011; Weerkamp, 2011), discovered cross-language differences in adoption of features such as URLs, hashtags, mentions, replies, and retweets.

Recent research on topic discovery and extract in tweets has also contributed to the methodology that will be taken in this research. Zhao (2011) work in this area presented methods that are very effective for topical keyphrase extraction and topic modeling. Hong (2010), Markman’s (2011) and Nishida (2011) and work on tweeter topic modeling and classification presents methodology that can be used to discover topic clusters in tweets. The Part-of-Speech tagging work by Gimpel et al. (2011) also present useful tools that can help researchers user natural language processing techniques to understand what goes on Twitter. Proximity of social website users also plays an important role in the way they use they influence each other’s activity. Research by Huberman (2008) on networks in Twitter found that a sparser and simpler network of actual friends proved to be a more influential network than the dense one made up of followers and followees, in driving Twitter usage since users with many actual friends tend to post more updates than users with few actual friends. Similar research by Leavitt et al. (2009) has shown that users of online social networking websites like Twitter influence each other’s behavior. Leavitt (2009) study found that news outlets influence people to republish news to other users, while celebrities with higher follower totals foster more conversation than provide re-tweetable content.

My research will try to provide empirical evidence that the differences in the amount of individualistic or collectivistic topics in tweets from different countries are influenced by their culture. Subsequent goals will be to formulate a structured explanation of what aspects of

(13)

culture influence the amount of individualistic or collectivistic topics in tweets from one country.

1.9 Study limitations

Analysis based on the content of tweets can be challenging because Twitter contains highly non-standard orthography (Puniyani, 2010). First of all, tweets are challenging to analyze because they are very brief, containing only 140 characters (Kaufman, 2010; Carter et al., 2011) and contain several novel syntactic elements seen mainly in social media such as URLs and email addresses; emoticons; Twitter #topic hashtags and Twitter @user mentions together with abbreviations, acronyms and initialisms that could lead to difficulties in recognizing and analysing useful data (Kaufman, 2010). It is also important to note that on average; only 1% to 3% of all tweets is geolocation-tagged5_{and therefore can be associated with a particular}

country. This figure varies per country but there are also a number of technical issues (Hale et al., 2012) tied to the validity and scale of geography associated with tweets, which can affect the validity and accuracy of the data collected.

Another limitation is that Twitter is used in multiple languages and it poses challenge on how to efficiently analyze tweets posted in different languages and identifying the embedded conception of relationships between people. Some languages like Spanish for example, omit the first person singular (e.g. “I” in English) in a practice which Kashima (1998) has label “pronoun-drop”. Pronoun-drop may be understood as a way of symbolically de-emphasizing the significance of the actor, the subject of an action (Kashima, 1998). If a language requires its users to mention the subject explicitly under most circumstances, the individual actor (typically the subject of a sentence) may be made chronically a prominent focus of attention. By contrast, if a language allows its users to drop the personal pronoun, the prominence of the actor may be somewhat reduced and the context in which the person is acting is brought closer to discursive attention (Kashima, 2003). An explicit use of “I” signals that a person is highlighted as a figure against the speech context that constitutes the ground; while its absence reduces the prominence of a speaker’s person, thus reducing the figure-ground differentiation (Kashima, 1998). Pronoun-drop therefore provides a symbolic tool by which the conception of the person is contextualized and controls whether the actor is the focal point of a linguistic discourse or the actor is symbolically represented as a being embedded in context. Limitations presented by pronoun-drop will be circumvented by considering only tweets in English as a single language of analysis and assuming that people from different nations use that language the same way.

It is possible that English language tweets emanating from non-English speaking countries are posted by people that are not indigenous to that country such as tourists or expatriates but we’ll assume that these tweets are so few that they do not have a significant impact on the overall outcome of our research.

In the next chapter, we’ll describe the methods and tools that we used to collect data and analyze from Twitter. We’ll present the criteria used to select and categorize the collected data and how theory from literature guided our research. The results of this research will then be presented and explained, followed by conclusions, a discussion and implications of this research closing this report.

2 RESEARCH METHOD

2.1 Research methodology

Based on conclusions from the theoretical framework, criteria were formulated to select individualistic and collectivistic Tweets from 10 different countries as data for a period of 1 5 http://www.floatingsheep.org/2012_07_01_archive.html

(14)

week. A set the collected tweets is run through a classifier that should predict whether the use of “I” or “We” in tweets is related to individualism vs collectivism cultural dimension. The tweets were then subjected to linguistic analysis with part-of-speech tagging and analyzed by specially created software to categorize them as individualistic or collectivistic and tally the percentage of individualistic tweets for different countries from which tweets were gathered, which we will call Tweet-IDV. We use Pearson’s correlation coefficient to correlate each country’s Tweet-IDV with Hofstede’s individualism index score in order to see if we can we estimate the IDV-index of countries based on tweets.

2.2 Data collection

A corpus of geolocation-tagged English tweets from 10 randomly selected was created for this research. These countries are; U.S., Australia, Great Britain, Canada., New Zealand, Ireland, India, Japan, Portugal and Indonesia with IDV-scores of 91, 90, 89, 80, 79, 70, 48, 46, 27 and 14 respectively on the IDV-index. The geographic information associated with each tweet makes it possible for the selection of tweets based on the country from which they were posted but this limits the amount of tweets that can be selected because geolocation-tagged tweets comprise of less that 3% of all tweets. The geographic information is often attached when a tweet is posted using a GPS capable device and when the user allows the geographic information to be attached to the tweets.

Using DataSift6_{, a paid platform that interfaces Twitter’s Application Programming Interface}

(API), tweets were extracted. Only tweets posted in the English language were collected from these specific countries for a period of 1 week, between 5th of August 2012 and 12th of August 2012. Datasift automatically recognizes the language in which tweets were written7

making it easier to select only English tweets. English was chosen as a single language because it helped avoid limitations of pronoun-drop and because it is the single most popular language on Twitter8_{for populations from allover the world. Unlike Twitter’s official API}9_,

Datasift it provides better tools for selecting tweets from a particular location, is allowed a higher data rate limit by Twitter.

All forms of textual Twitter content were considered for this research. Therefore tweets, retweets, hashtags, mentions and links were all part of the corpus but photographs and videos were not be part of the corpus. Just for the purpose f this research the corpus was named the Twitter Thesis Corpus.

The Twitter Thesis Corpus consists of a selection of tweets collected over a period of one week starting 31-07-2012 until 06-08-2012 from the 10 countries mentioned earlier. In order to limit the amount of tweets collected and stay within Twitter’s imposed data limits, Tweets collected were filtered on the occurrence of the words “I” or “we”. Special software was created to interface Datasift’s API to collect, POS-tag and analyze the tweets.

2.3 Twitter data collection with Datasift

Using Datasift’s API, a software application written in the Java programming language was specially created to collect geolocation-tagged English tweets.

Datasift provides a host of tools that make it easier to select tweets based on the location from which they were posted. Tweets can be selected based on the country code, coordinates or timezone. Datasift’s maps polygon selection tool makes it possible to visualize the selection

6 www.datasift.com

7 http://dev.datasift.com/docs/targets/language-detection-how-it-works

8 http://semiocast.com/publications/2011_11_24_Arabic_highest_growth_on_Twitter

9 Twitter limits the rate and data that an average developer can access thorough the Twitter streaming API. https://dev.Twitter.com/docs/rate-limiting and https://dev.Twitter.com/docs/things-every-developer-should-know

(15)

of tweets from within a certain location by mapping a polygon on a map (see figure1), which is a collection of coordinates that enclose the region selected.

Figure 1: Datasift maps polygon selection tool showing the process of selecting tweets originating from the Netherlands

Tweets were selected from U.S., Australia, Great Britain, Canada., New Zealand, Ireland, India, Japan, Portugal and Indonesia by filtering the selection to tweets where the

CountryCode matches the two-letter ISO country code from any of the named countries. For example, tweets from Canada were filtered on country code “CA” with the expression;

Twitter.place.country_code == “CA”

English tweets were selected by filtering tweets where the Datasift attached language tag indicated that the tweet was written in English, with the expression;

language.tag == "en"

Tweets selected were filtered on the occurrence of the words “I” and “we” using Datasift’s filter operators and programmable regular expressions (regex). For selecting tweets in which the words “I” and “we” are present, the following expression was used;

Twitter.text contains_any "I, we" Twitter.text regex_partial \"(?i)[iw]\\'(?i)([dm]|ll)\" or Twitter.text regex_partial \"(?i)we\\'(?i)([dm]|ll)\"

The entire expression for selecting English language tweets from the above mentioned countries is;

((Twitter.text contains_any\"I, i, we, WE, We, wE, im, IM, ill, ILL\") or (Twitter.text regex_partial \"(?i)[iw]\\'(?i)([dm]|ll)\")

or (Twitter.text regex_partial \"(?i)we\\'(?i)([dm]|ll)\"))

and (Twitter.place.country_code in \"US, AU, GB, CA, NZ, IE,IN JP, PT, ID\") and (language.tag == \"en\"")

This expression also selects tweets in which the abbreviated combined forms like I’ll, we’ll or we’d, irrespective of the latter case in which they are written. The tweets were then saved in text files for further analysis for this thesis.

Practical concerns on collected data

The Twitter Thesis Corpus has only English tweets, which is the predominant language on Twitter and is either the native or official language of the almost all countries investigated in this research except Japan, Portugal and Indonesia. It is therefore possible that a significant amount of the English language tweets collected in non-English the speaking countries of Japan, Portugal and Indonesia could have been posted by non-indigenous residents in the country such as expatriates or tourists which poses a potential bias in the data collected. Another point of concern is the possibility that indigenous residents, who post tweets in English, cannot fully express themselves as they would in their native languages, therefore leading us to wrong conclusions.

(16)

Over two million tweets (2,278,927 tweets to be exact) were collected through Datasift with the majority of tweets, 1569169, about 69% coming from the US while the minority of tweets, 887 tweets, i.e., about 0.04% of all tweets collected, during this period were from India (see Table 2: number of tweets collected per country). We’ll call this collection tweets, the Twitter Thesis Corpus in the rest of this document.

Country No. of tweets per

country United States 1569169 Australia 1383 United Kingdom 278157 Canada 49816 New Zealand 940 Ireland 19504 India 887 Japan 87627 Portugal 7415 Indonesia 169829

Table 2: number of tweets collected per country

The Corpus was processed before it could be used for our research. During pre-processing, the majority of Twitter specific tokens and non-words were filtered out of the Corpus by tokenizer software that we had written for this purpose. Therefore token varieties seen mainly in social media such as URLs and email addresses; emoticons; Twitter #topic hashtags and Twitter @user mentions were filtered out to reduce the number of unique words that would otherwise poorly affect the performance of our classifier.

2.4 Tweet classification

Although, researchers have very often stated that language in which the word “I” is used is individualistic, and language in which the word “we” is used is collectivistic; we cannot automatically assume that the same can simply be applied to Tweets. It is our task to show that tweets in which the words “I” and “we” are used are indeed individualistic and

collectivistic respectively. Carrying out this task would help us answer our first research sub-question on whether the use of “I” and “we” is related to individualism.

Classifier setup

For this purpose, we constructed a multinomial Naive Bayes Classifier to help classify the tweets into 2 categories; namely IDV for individualistic tweets and CDV for collectivistic tweets. Our classifier works by using a training set, which is a set of tweets or documents that are already associated to a category. Using this training set, the classifier determines for each word in the training set, the probability that it makes a document belong to each of the considered categories. The probability that a document belongs to a category is computed by multiplying the individual probability of each of its word in this category. The category that is computed to have the highest probability is the one that the document is most likely to belong to. Multinomial naive Bayes considers a document as a bag-of-words. For each class

C ,

w∨C

P

¿

) the probability of observing word w given this class, is estimated from the training data, simply by computing the relative frequency of each word in the collection of training documents of that class (Bifet et al, 2010). The probability that a document d

(17)

w

_i

P (C )

∏

1 n

P(

¿¿

C)

P(d )

P (C

|

d )=

¿

Where wi occurs in document d with n number of words and

P(w

i

)

is the

probability of observing word

w

i given category

C

. P

(

C

)

is the prior probability

that is required by our classifier and P

(

d

)

is the normalization factor.

Therefore for our 2 categories IDV and CDV, the document is classified to belong to IDV if

P

(

Cidv

|

d

)

>P

(

Ccdv

|

d

)

. The document will likewise be categorized to belong to CDV if

P

(

C

cdv

|

d

)

>

P

(

C

idv

|

d

)

.

For our classifier we created a training set from the Twitter Thesis Corpus where we

categorize tweets in which the pronoun I appears next to a verb, e.g I am.. , I like.., Do I…? or I appears next to an adverb e.g I just tried.. , I always.., I carefully…? as belonging to the category IDV and likewise, tweets in which the pronoun We appears next to a verb e.g., We do, We like or Do we..? or proverb or the pronoun We appears next to an adverb e.g We just tried.. , We always travelled.., We carefully put… as belonging to the CDV category. Documents in which I or We do not appear next to a verb or were ignored. These are tweets which might have been mistakenly collected because they contain the pronoun I or We. A quick scan of these tweets also reveals that these tweets were not in English.

From this training set, the classifier created a model that was used to classify the training set. Tweet classification process

We constructed a Naïve bayes classifier based on Apache Mahout10_{machine learning library}

running on single Hadoop11_{node. We then randomly selected a sample of 1072 tweets from}

our corpus, which we refer to as documents in the rest of this section. We created our training set by manually coding the sample documents by adding the category at the beginning of each tweet.

cdv 1366589832497 We're survivin'! Actually it dropped to 87 degrees today . idv 1366589832607 Haha . If Twitter is dead and boring ... Im , sorry ... Ill be back

idv 1366589832142 I've had a packed day , but I still feel something is missing . idv 1366589833850 Finally someone agrees with me ! Good God I can't take it cdv 1366589833319 We tearin dis bitch down !

We then run a mahout job to turn them into a sequence file format, which is the default file format that mahout understands and used mahout to transform the sequence file into a vector matrix of Tf–idf weights. Tf–idf12_{stands for, term frequency–inverse document frequency,}

which is a numerical statistic that reflects how important a word is to a document in a collection or corpus.

In order to test the accuracy of our classifier, we randomly split the set into a training set of 920 documents which we used to train the classifier and a test set of 152 documents that we used to test the accuracy of the classifier.

10 http://mahout.apache.org/ 11 http://hadoop.apache.org/

(18)

Classifier performance

We dug through the dictionary was created for the training set of the original set of 1072 documents, to extract the top 10 words per class by term-frequency tf count.

Top 10 words for class cdv

Term frequency, tf for cdv

Top 10 words for class idv Term frequency, tf for idv we can all have so our house life night see 152.04781651496887 31.088478565216064 28.318191528320312 20.864829540252686 18.10286283493042 16.738158702850342 16.47712469100952 16.014671802520752 15.80769395828247 14.933171272277832 i my am me just have get so like will 1665.026211380895 444.6563186645508 685.3958888053893 283.476256608963 269.7886710166931 257.3328976631164 6 240.5379986763000 5 238.3365969657898 235.0444881916046 1 216.1578987654346 8

Table 3: Top ten words from collectivistic (cdv) tweets and individualistic (idv) tweets

The set of top 10 words per class seems to be mutually exclusive except for the existence of the term “have” in both sets. This creates a fairly unique set of attributes that the model uses to predict the individual classes and therefore improving overall accuracy. The first-person plural pronoun “we”, weighs heavily in the model when classifying the class, cdv while the first-person singular pronouns “I”, and “my” have significance influence in the model that predict the class idv. The top 5 words for the class idv are an inflected of the first-person singular pronouns “I”.

The classifier achieved a 97% accuracy (see table2) when run on the training set.

Predicted

cdv idv

Actual cdv 47 0

idv 26 847

Table 4: Confusion matrix of Classifier performance on training data

Summary of classifier performance on Training data

---Correctly Classified Instances : 894 97.1739% Incorrectly Classified Instances : 26 2.8261% Total Classified Instances : 920

On test data, the classifier achieved an accuracy of 85.5% (see Table3)

Predicted

cdv idv

Actual cdv 17 5

idv 17 113

(19)

Summary of classifier performance on Test data

---Correctly Classified Instances : 130 85.5263% Incorrectly Classified Instances : 22 14.4737% Total Classified Instances : 152

Classification evaluation

The accuracy,

A

which is the overall measure of the quality of the classification is the total

number of correctly classified instances over all classifications, or

A=

∑

i

c

_{i ,i}

∑

i , j

c

_{i , j}

¿

17+113

17+5+17+113

¿

0.855263=85.5 %

For each class

κ

, precision

¿

, is the fraction of cases classified as

κ

that truly are

κ or,

¿(

j)=

c

i ,i

∑

i

c

_{i , j}

Therefore Precision for class CDV,

¿

(cdv )=

17 17+17

=0.50=50 %

And precision for class IDV,

¿

(idv)=

113 5+113

=0.96=96 %

The classifier is predicting CDV documents poorly with a precision of 50% which is quite low but the precision of 96% for IDV documents which is quite high. Our classifier is therefore predicting IDV well enough.

Overall estimate of precision of the classifier on the test data, which the Mean Average Precision for both the CDV and IDV classes is

50+96

2 =73 %

which indicates that the classifier predicts both classes well.

Likewise, for class κ , the Sensitivity or Recall Rec is the fraction of cases that are truly

κ

that were eventually classified as

κ

Rec(i)=

c

i ,i

∑

j

c

_{i , j}

Therefore Recall for class CDV,

Rec (cdv )=

17 17+5

=

0.77=77 %

And recall for class IDV,

Rec (idv)=

113 17+113

=0.87=87 %

Overall estimate of Recall of the classifier on the test data, which the average of Recall for both the CDV and IDV classes is

77+87

(20)

highly sensitive to both classes, but more sensitive to IDV documents. Overall Accuracy = 85.5%, precision = 73% and recall = 82%, seem to suggest that our classifier is performing well and that tweets with “I” and “we” is related to individualism.

2.5 Identifying IDV-tokens in tweets

Based on earlier work by Gimpel et al. (2011) from, Speech (POS) tagging, Part-of-speech (POS) tagging, a basic form of syntactic language analysis technique used in natural language processing (NLP) will be performed on the Twitter Thesis Corpus to recognize individualistic or collectivistic tweets per country. The proportion of individualistic tweets discovered will then be correlated with scores on the IDV-index to investigate the relationship between individualistic and collectivistic tweets from a country and the IDV-index. As mentioned earlier, tweets where the use of the word “I” (first-person singular nominative pronoun) is used, followed by a verb or adverb are considered individualistic while tweets where the word “we” (first-person plural nominative pronoun) is used, followed by a verb or adverb are considered collectivistic. For this research we labeled the token combination {first-person singular nominative pronoun + verb13_{e.g I am, I like, Do I} or {first-person singular}

nominative pronoun + adverb e.g I just like, I carefully placed} as IDV-tokens and {first-person plural nominative pronoun + verb or proverb e.g. we are, we care, Are we} as CDV-tokens. Using the POS-tagger from the ARK research group at Carnegie Mellon University’s ARK-tweet-nlp project (Gimpel et. al, 2011) we identified and tallied every occurrence of IDV and CDV tokens in tweets. The ARK-tweet POS-tagger uses Tweet-nlp tagset which was designed as a coarse tagset that captures standard parts of speech (noun, verb, etc.) as well as categories for token varieties seen mainly in social media such as URLs and email addresses; emoticons; Twitter #topic hashtags and Twitter @user mentions, making it suitable for this research. Tools from the ARK-tweet-nlp project have been made available these to the research community at http://www.ark.cs.cmu.edu/TweetNLP.

The ARK-tweet-nlp POS-tagger in its original state can identify occurrences of “I”, “we” and verbs but cannot automatically identify IDV-tokens or CDV-tokens. We modified the

software to automatically discover IDV-tokens and CDV tokens and identify tweets in which IDV-tokens and CDV tokens appear. We’ll refer to the modified software as IDVand CDV-token discovery and analysis software from now on.

The software carried out the analysis in 3 phases;

First, a training set of 1000 randomly selected tweets from the Twitter thesis Corpus was used to test the accuracy of the POS-tagger and the software.

The entire Twitter Thesis Corpus was then loaded into ARK-tweet-nlp POS-tagger and the software tagged tweets in which IDV tokens we present as individualistic and tweets in which CDV tokens occurred as collectivistic.Finally, each occurrence of an individualistic or collectivistic tweet per country was tallied and analysed to find its intrinsic characteristics like the frequency of I’s or We’s. The software also tallied the occurrence of other Twitter specific token varieties such as #topic hashtags Twitter @user mentions and check-ins. We also calculated the percentage of individualistic tweets or Tweet-IDV score in the entire corpus and the Tweet-IDV per country. The results of the analysis are summarized and presented in the next section.

3 RESULTS

After running the Twitter Thesis Corpus through the POS-tagger and the IDVand CDV-token discovery and analysis software, we found that the Twitter Thesis Corpus contains over two million tweets (2,278,927 tweets to be exact). The majority of tweets, 1569169, about 69% of all tweets were from the US while the minority of tweets, 887 tweets, i.e., about 0.04% of all tweets collected during this period were from India. Tweets in which the IDV tokens

13 The lexeme of the verb was considered (which is the set of set of inflected forms taken by the verb, e.g. walk, walked, walking; all relating to the same lexeme; walk).

(21)

occurred, the indivualistic tweets, constituted 93% of all tweets. are individualistic giving us an overall Tweet-IDV score of 93 for the set of countries investigated. In total, there were 2,813,425 “I” tokens in 2,028,471 individualistic tweets and 194,810 “we” tokens in 156,256 collectivistic tweets. Country No. of tweets per country IDV score (Hofstede) Tweet-IDV score United States 1569169 91 93 Australia 1383 90 91 United Kingdom 278157 89 90 Canada 49816 80 91 New Zealand 940 79 91 Ireland 19504 70 90 India 887 48 86 Japan 87627 46 99 Portugal 7415 27 93 Indonesia 169829 14 96

Table 6: Hofstede's IDV scores and Tweet-IDV scores

All countries investigated have Tweet-IDV score above 90 with Japan having has the highest IDV score of 99 while India has the lowest score of 86. Tweet-IDV score of every country investigated significantly differs from it’s corresponding IDV score on Hofstede’s IDV-index. The analysis in the section here below should shine some more light on these results.

4 DATA ANALYSIS

The ARK-tweet POS-tagger was not altered before use in this research; therefore we assumed the overall accuracy of 89.37% that was achieved by Gimpel et. al (2011) at Carnegie Mellon on a training set of 1,000 examples (14,542 tokens). Although the unmodified ARK-tweet-nlp POS-tagger has an accuracy of 91% for tagging personal pronouns (I, we, you, me, it), the ARK-tweet-nlp POS-tagger identifies “I” and “we” tokens with a 100% accuracy when tested on a traning set of 1000 randomly selected tweets from the Twitter Thesis Corpus. We also assume 97% accuracy that the unmodified ARK-tweet-nlp POS-tagger has for identifying verbs, as demonstrated by Gimpel et. al. (2011). Our software achieved 100% for identifying IDV-tokens in tweets in a training set of 1000 randomly selected tweets from the Twitter Thesis Corpus, as shown in table 7. All 143 IDV-token occurrences were correctly discovered. Token/ Token combination Accuracy in % Examples I 100 I, i, We 100 we

Verb 97

Go, want come

I + verb 100 I am.., I want.., Imma.., I’ve, Do I..? have I..? We + verb 100 We want.., we do.. Did we..? Should we.. ?

Check-in 100 I’m at … http://t.col/DGXaBc

Table 7: Accuracy per token or token combination on a training set of 1000 tweets.

Preliminary analysis found that Asian countries have very high Tweet-IDV scores; with Japan having the highest Tweet-IDV score of 99 followed by Indonesia with a Tweet-IDV score of 96 while India has the lowest Tweet-IDV score of 86, see table6.

          indTweets colTweets 100 indTweets

(22)

Japan and Indonesia having the highest Tweet-IDV scores is peculiar given the fact that among the countries investigated in this research, Japan has a relatively low score of 46 on the IDV-index while Indonesia it has the lowest score on the IDV-index of 14.

A correlation coefficient r of -0.4021 from regression analysis of the IDV score with the Tweet-IDV scores suggests that there is a negative fit between Tweet-IDV scores and IDV scores for the countries from which the Tweets were collected. The negative fit also suggests that Tweet-IDV scores should be increasing when Hofstede’s IDV scores are decreasing, as shown in figure2. Closer analysis of the tweets reveals that a significant number of “I” tokens are generated by users letting their followers know where they are by checking into a location with a social location service such as Four Square or Uber Social, the so-called check-ins.

4.1 Identifying Check-ins

We identified 3 categories of check-ins namely;

1. Check-ins that made up the user’s entire post and were generated by the location service, such as Four Square, that was used to post to the tweet. These types of check-ins are usually characterized by tweets in which the token combination {I’m at} appears in the same tweet with a URL, for example;

 I'm at PK's House http://t.co/g88IJxyz or

 I'm at Tropix Tanning (Clinton Township, MI) http://t.co/DQ5jvaBc. 2. Tweets in which the user checked into a location that was registered as a place in

Four Square. In these tweets, the token combination {I just ousted} or {I just became} and {@foursquare! http://t.co/} are present. Examples of these tweets are;

 I just ousted PoconG as the mayor of GOR Dimyati Kota Tangerang on @foursquare! http://t.co/GEtmDJGQ.

 I just ousted @funfann as the mayor of Shokudo PSBJ on @foursquare! http://t.co/C4Bcbh8B

 I just became the mayor of aunt home on @foursquare! http://t.co/WmYRPtU1

3. Tweets in which the check-in was part of the user’s voluntarily formulated post. In this category of tweets, the user’s location was included together with the user’s freely formed textual post. Examples of these tweets are;

 @All_About_Jesus: We know that [Jesus] really is the Savior of the world - John 4:42 http://t.co/GpbvaFsp

 Deketin lg lah, klo emang beda RT @ibnkahfi: I was stupid,ugly is'nt like your exlover to another http://t.co/bTbnHHJp

In the category 1 and 2, the language and choice of words in check-ins is decided by the functionality of Twitter or location service while the user has control of the choice of words in category3. In this research, we therefore consider check-ins as tweets that fall in category 1 and 2 which only occur in individualistic tweets. We therefore decided to analyze Tweet-IDV when check-ins are ignored.

We calibrated our software, by using regular expressions in Java, to capture and tally check-ins with more than 98% accuracy on a training set of 1000 tweets. We found that, check-check-ins only constitute about 24.3% of all individualistic tweets and occur in varying intensity per country. No check-ins were found in collectivistic tweets. The percentage of check-ins among tweets from Japan was the highest; constituting about 91% of all individualistic tweets from Japan while 73% of all individualistic tweets from Indonesia are check-ins, as shown in table4. Check-ins are less popular in Canada, the U.S. and the U.K. where respectively only 16%, 11% and 6% of all individualistic tweets are check-ins. There was however no (zero) check-ins registered in Australia, New Zealand or India in this period.

Table 4 shows Tweet-IDV after tweets with check-ins were excluded and the percentage of check-ins among individualistic tweets for different countries.

(23)

Country Check-in % of individualistic tweets Hofstede’s IDV score Tweet-IDV United States 11 91 92 Australia 0 90 91 United Kingdom 6 89 90 Canada 16 80 89 New Zealand 0 79 91 Ireland 4.4 70 90 India 0 48 86 Japan 91 46 93 Portugal 42 27 88 Indonesia 73 14 87

Table 8: Tweet-IDV scores after excluding check-ins

Excluding check-ins among all collectivistic and individualistic tweets posted form a country resulted in lower overall Tweet-IDV scores.

It’s interesting to note that after excluding check-ins, Tweet-IDV scores of the United States, Australia and the United Kingdom are close to their corresponding scores on Hofstede’s IDV-index by an increment of exactly 1 point while the Tweet-IDV score of the other countries deviate significantly from their corresponding Hofstede IDV scores. All Tweet-IDV scores lie close to 90 and 91% of all English tweets from Japan with the “I” token, are check-ins. A correlation coefficient

r

of 0.560412 from regression analysis of Hosfstede’s IDV score with the Tweet-IDV scores suggests that there is a strong positive fit between the Tweet-IDV scores and Hofstede’s IDV scores for the countries from which the Tweets were collected. Figure 3 shows a positive slope on a scatter plot indicating a strong correlation between Tweet-IDV and Hofstede’s IDV, implying that excluding check-ins has a significant impact on Tweet-IDV.

Figure 2: Hofstede's IDV vs Tweet-IDV,

r

= -0.4021 Figure 3: Hofstede's IDV vs Tweet-IDV after excluding check-ins,

r = 0.560412

The scatter plot exposes 2 outliers namely; Japan and India which have Tweet-IDV score of 93 and 86 respectively, compared to their corresponding IDV score of 46 and 48. We can only speculate that the cause of this anomaly but there is no empirical evidence for any explanation that we can put forward at this time but it is not surprising because these are the two countries with the highest percentage of check-ins among all the countries investigated.

This does not necessarily mean that there is always a relationship between Tweet-IDV scores per country and Hofstede’s IDV scores but an estimate of the probability of a correlation

(24)

between Tweet-IDV-index and the IDV-index at a 95% confidence level reveals that the value of the correlation coefficient r should lie between -0.106 and 0.879 while at a 99% confidence level, the correlation coefficient

r

should lie between -0.327 and 0.922; which it fulfills in both occasions.

4.2 Tweet-IDV among countries with similar IDV scores

By categorizing countries in groups with similar Hofstede IDV scores, we try to understand how strongly countries in the same category resemble each other. W achieve this by categorizing the countries according to Hofstede’s IDV scores by taking the top 4 countries (U.S., Australia, United Kingdom and Canada) as the High-IDV countries, the next 3 countries (New Zealand, Ireland and India) as the Medium-IDV countries and lastly (Japan, Portugal and Indonesia) as the Low-IDV countries and measuring the intra-class correlation coefficient of countries in the same category. Intraclass correlation is normally used to quantify the extent to which members of the same group or class tend to act alike.

We found that the top 4 High-IDV countries have a intraclass correlation coefficient of 0.8664 while the Medium-IDV countries have a intraclass correlation coefficient of 0.995402 and Low-IDV countries have a partial correlation coefficient of 0.966496, as shown in table9

Country IDV category

Partial correlation coefficient,

United States High 0.8664

Australia United Kingdom Canada

New Zealand Medium 0.995402

Ireland India

Japan Low 0.966496

Portugal Indonesia

Table 9: Interclass correlation coefficients per IDV category

The intraclass correlation coefficients are all significantly high, indicating that tweet from countries in the same category have a similar distribution of IDV tokens and CDV tokens. The categories that we investigated here have very few countries and they are not the same size, otherwise these high intraclass correlation coeficeints would indicate that countries within the same category show similar behavior on Twitter..

5 DISCUSSION

Although the Tweet-IDV score of countries of the top 4 countries (US, Australia, UK and Canada) with the highest Hofstede IDV scores closely match, the Tweet-IDV scores of the other 6 countries are significantly much higher than their corresponding Hofstede IDV scores. In this research, it is not clear why the Tweet-IDV scores of the top 4 countries closely match their corresponding Hofstede IDV scores but one explanation could be language in which the tweets were collected. The Twitter Thesis Corpus has only English tweets, which is

high

r

medium

r

low

r

category

Twitter and Culture: Estimating IDV-index of countries based on tweets

Twitter and Culture:

Estimating IDV-index of countries based on tweets

Thesis Master Information Science – Business Information Systems

University of Amsterdam

Faculty of Science and Faculty of Economics and Business

Department Business Studies

Section Information Management

Final version: 08-06-2014

M.Sc. THESIS

Twitter and Culture:

Abstract

TABLE OF CONTENTS

LIST OF TABLES

Table 1: Examples of individualistic and collectivistic tweets………..….11

Table 2: number of tweets collected per country………..…16

Table 3: Top ten words from collectivistic (cdv) tweets and individualistic (idv) tweets

………..…18

Table 4: Confusion matrix of Classifier performance on training data………

……….18

Table 5: Confusion matrix of Classifier performance on test data………..18

Table 7: Accuracy per token or token combination on a training set of 1000 tweets...

...21

Table 8: Tweet-IDV scores after excluding check-ins………..……….23

Table 9: Interclass correlation coefficients per IDV category………..24

LIST OF FIGURES

Figure 1: Datasift maps polygon selection tool...14

Figure 2: Hofstede's IDV vs Tweet-IDV...22

Figure 3: Hofstede's IDV vs Tweet-IDV after excluding check-ins...22

ACKNOWLEDGEMENTS

EXECUTIVE SUMMARY

1

Introduction

1.1 Background

1.2 Relevance of this research

1.3 Research objective

1.4 Research questions

1.5 Theoretical framework: Cultural background as a

differentiating factor in tweets

1.6 Individualism and Collectivism on Twitter: Tweet-IDV score

@name I am going to blow

We are gonna kick your a…

I’d like to tell you all that #UgandaIsNotSpain

@name Can we get an amen?

Can I kick it? Yes, I can … bwooooy!

We still want justice for Martin!!!

1.7 Hypothesis and expectations









indTweets



colTweets

1.8 Earlier research

1.9 Study limitations

2 RESEARCH METHOD

2.1 Research methodology

2.2 Data collection

2.3 Twitter data collection with Datasift

Practical concerns on collected data

2.4 Tweet classification

Classifier setup

w∨C

P

¿

w

P (C )

∏

P(

¿¿

C)

P(d )

P (C

|

d )=

¿

P(w

)

w