• No results found

Brexit on Reddit

N/A
N/A
Protected

Academic year: 2021

Share "Brexit on Reddit"

Copied!
74
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Brexit on Reddit

Master Thesis Digital Humanities

Student: Michaela Smolikova (S3831809) Supervisor: Dr. Marc Esteve del Valle Second reader: Dr. Sabrina Sauer

(2)

Abstract

In the digital age, social networking sites become an artifact of human culture and a data source for researchers to study the current state of society. The emergence of these platforms urged the development of new methods that are suitable for analysis of vast amount of data. These methods are based on multidisciplinary expertise and leverage the potential of fields such as computational science, statistics, linguistics, or political and media studies. This thesis aims to demonstrate the potential of the social networking site called Reddit in relation to public opinion mining on a selected topic: Brexit.

The methodology was selected in line with previous research. The theme of the withdrawal of the United Kingdom from the European Union structures is analyzed with the help of three techniques: sentiment analysis, exploratory data analysis, and topic modeling. The goal of the study is to prove whether the attitude towards Brexit on Reddit resembles the attitude of traditional polls and identify the main trends and drivers behind the Brexit discourse on Reddit.

The results reveal multiple valuable insights. The sentiment analysis indicates that the overall attitude towards Brexit over time varies and follows the same trendline as official polls. The exploratory data analysis and topic modeling then complements the outcomes and provides further insights such as the most popular subreddits and topics in the Brexit discussion on Reddit. Overall, the study proves the potential of Reddit as a valuable data source in text and opinion mining.

(3)

Table of contents

1

Introduction ... 3

2

Background ... 5

2.1

Reddit ... 5

2.2

Brexit ... 6

3

Theoretical Background ... 8

3.1

Public opinion monitoring ... 8

3.1.1

Polls as a traditional tool for public opinion monitoring ... 8

3.1.2

User-generated content as an alternative source for public opinion mining ... 9

3.2

Computational methods in textual analysis ... 10

3.2.1

Text mining and machine learning ... 11

3.2.2

Sentiment analysis ... 12

3.2.3

Topic modeling ... 13

4

Literature Review ... 16

5

Research question and hypotheses ... 20

6

Methodology and tools ... 21

6.1

Sentiment analysis with VADER ... 21

6.2

Exploratory data analysis ... 22

6.3

Topic modeling with Latent Dirichlet allocation and Sci-kit Learn ... 23

6.4

Data integration and analysis tools ... 25

6.4.1

Keboola Connection ... 26

6.4.2

SQL and Python ... 27

7

Data collection and preparation ... 29

7.1

Data collection ... 29

7.2

Basic data preparation and data structure ... 31

8

Analysis ... 34

8.1

Sentiment analysis with VADER ... 34

(4)

2 8.3

Topic modeling ... 52

9

Discussion ... 56

10

Conclusion ... 60

11

Bibliography ... 61

12

Appendices ... 66

12.1

Appendix A: VADER sentiment analysis (Jupyter Notebook) ... 66

12.2

Appendix B: Text preprocessing and EDA (Jupyter Notebook) ... 67

(5)

3

1 Introduction

The digital revolution is undoubtedly changing every aspect of human lives. The internet and tools such as social networking sites are transforming the way people produce, consume, and store information. Most importantly, digital technology reforms the manner in which people interact with each other. While in the past, the debates were lead in public places; nowadays, the social discourse moves to the online environment, making it an artifact of social development. Individuals do not hesitate to share personal information with the online audience, and therefore large amounts of digitized data become available to anyone with decent digital literacy. From the academic perspective, the digitalization also reforms the way scholars conduct research. The spread of user-generated content provides researchers with an immense amount of data to analyze.

Social networking sites are one of the main sources of user-generated content, and in the past decade scholars have developed multiple methods for social media analysis. These techniques are usually more or less related to text mining. In the past, text analysis was dependent on intense labor of researchers, “But the digital age has opened possibilities for the scholars to enhance their traditional workflows. [...] This shift from reading a single book “on paper” to the possibility of browsing many digital texts is one of the origins and principal pillars of the digital humanities domain, which helps to develop solutions to handle vast amounts of cultural heritage data – text being the main data

type.”(Jänicke, Franzini, Cheema, & Scheuermann, 2015). The outcomes of such an analysis can be used in multiple ways. Companies analyze users’ data to improve their sales strategies, state organizations to improve public services and political parties to monitor public opinion and influence voters’ preferences.

There has been a lot of research done on social media and opinion mining (Boulianne & Theocharis, 2018; Gayo-Avello, 2013). While some platforms are covered with high intensity, others seem to be neglected. One of the most popular platforms for opinion mining is Twitter. Twitter is widely adopted by politicians and journalists, and therefore it always provides relevant content. Contrary, there are not many studies that focus on other social networks such as Reddit or YouTube in relation to text analysis (Roozenbeek & Salvador Palau, 2017; Severyn, Moschitti, Uryupina, Plank, & Filippova, 2016). This thesis, therefore, aims to fill this gap by exploring the potential of Reddit in the field of sentiment analysis, opinion mining and topic modeling.

The goal is to analyze Reddit users’ opinions towards the withdrawal of the United Kingdom from the European Union, so-called Brexit, and identify the main drivers behind the discourse on this

(6)

4 social network site. In order to provide a consistent overview of this problematic, the thesis will begin with a brief description of the general background and theoretical framework. Then, it will continue by highlighting some of the main findings in previous research and elaborate on the methodology and data collection. The main analysis will consist of 3 main parts: sentiment analysis, exploratory data analysis, and topic modeling. Finally, the discussion chapter will interpret the results and link them to existing findings in the field.

(7)

5

2 Background

2.1 Reddit

In 2005, Steve Huffman, Aaron Swartz, and Alexis Ohanian co-founded Reddit with the idea of creating a website that aggregates the most interesting and trending information on the internet. "In the past few years, Reddit – a community-driven platform for submitting, commenting, and rating links and text posts – has grown exponentially, from a small community of users into one of the largest online communities on the Web" (Singer, Flöck, Meinhart, Zeitfogel, & Strohmaier, 2014). According to the latest report – ‘Digital in 2019’, produced by 2 private marketing and advertising companies, We Are Social and Hootsuite, Reddit outnumbered Twitter in the number of active users. Based on the report, Reddit has 330 million users globally and ranks as the fifteenth most visited website and the seventh most popular social network. Even more significant is the time people spend on Reddit every day. According to Alexa.com, people spend almost 11 minutes and 40 seconds a day browsing Reddit, compared to six and a half on Twitter, or a little below ten on Facebook, which confirms that Reddit is more engaging than other leading social media platforms (“Global Digital Report 2019 - We Are Social,” n.d.-a).

The primary role Reddit had in the past, aggregating interesting news, influenced the structure of the website. Reddit, as a community based social network, is organized based on a particular topic in independent pages called subreddits. Any user can create these pages, which are moderated by

individuals or a group of moderators. Each user can then join multiple subreddits based on their

preferences. Some subreddits have clearly stated guidelines on what is allowed to post on the subreddit. Every user can post so-called submissions, which can include self-generated content, or a link to external content such as YouTube videos, Igmur pictures, or an online news article. Once a user posts a

submission, other users can up and downvote the submission to influence its position in the subreddits feed. Voting is the main logic of Reddit as a whole, where users decide what content is supposed to be popular and what does not deserve attention. Besides upvoting and downvoting, users are also allowed to comment on each submission, unlike other social media, e.g. Facebook or YouTube, Reddit has a more elaborate comment structure. While on Facebook or YouTube, people can only reply to a post in one or two layers, Reddit offers a multi-layered format that can facilitate more complex discussion

(8)

6 under a submission. To ensure better navigation through the website, users can use multiple filters and search tabs, and they can also send each other direct messages or use chat.

What is also different in comparison with other social network sites is that Reddit does not encourage the users to share their identity, personal details, or any other type of intimate information. Most of the users use nicknames instead of their real names. For this reason, Reddit is considered to be a semi-anonymous network(Kilgo et al., 2016), where revealing individual identity is rather an

exception. Therefore, one can argue that the content published by its users can be more controversial than on other platforms(Zhang & Kizilcec, n.d.). The anonymity on Reddit makes it more difficult for researchers because they are not able to extract the basic demographic figures of Reddit’s users. Hence, they need to rely on previous studies that have covered this topic. When conducting a study on Reddit, where demographics could play a vital role, the anonymity of the users needs to be taken into account. The latest numbers from the portal Statista, which is a portal that aggregates statistical reports about a variety of topics, shows that the largest traffic on Reddit comes from The United States, above 38%, followed by the United Kingdom with almost 7.5 % (“Reddit.com desktop traffic share 2019 | Statista,” n.d.). The data about the age of the users is even more difficult to estimate. Until now, we only know that almost 60% of American Reddit users are between the age of 18 to 29 (“6% of Online Adults are reddit Users | Pew Research Center,” n.d.).

On the other hand, analyzing data from Reddit has benefits as well. One of the biggest

advantages of Reddit as a data source is the accessibility of data. Unlike other social network sites that have suffered for scandals concerning personal data breaches and unethical exploitation of users' personal information by third parties, Reddit still keeps its data widely open to the public. When it comes to data extraction, there are multiple ways to access Reddit’s submissions, comments, and other data. People can scrape Reddit with a Python script, download monthly data dumps that are publicly available, or use an application programming interface (API) especially developed for extracting data from Reddit based on several parameters. (“GitHub - pushshift/api: Pushshift API,” n.d.)

2.2 Brexit

The 23rd of June 2016 became an important day in European history - the day when British

people chose to leave the European Union (EU). More than 30 million Britons decided not only about the future of their own country but also about the future of the whole EU. For a very long time, the idea

(9)

7 of the United Kingdom leaving the EU seemed to be rather a science fiction, especially when most of the polls favored the Remain campaign over the one that called for Brexit. Even more shocking, was the outcome of the referendum when the results have shown that the British eurosceptics won with a slight lead of 51,9% and said their final “yes” to leave the EU. “In many ways, however, the outcome of the UK’s referendum on EU membership was not surprising”(Hobolt, 2016).

Most of the studies would agree there are multiple reasons why the UK-EU relationship came to this point amongst which we can include the historical development, both domestic and international politics, and last but not least, socio-economic aspects. From the historical point of view, the UK has been an independent and powerful global player for centuries, and the revision of the geopolitical order after the Second World War was difficult to adopt for many British people as well as for their

authorities. Since the beginning of the 50’s, the UK refused to participate in the European integration process and did not become a part of the European Communities until 1973. Even after entering, the UK retained its unique position within the Communities and successfully negotiated multiple exceptions and opt-outs (Bulmer & Quaglia, 2018). Besides the struggles with the redefinition of the UK’s international position, “the Brexit referendum came about as the culmination of decades of internal division in the British Conservative Party on the issue of European integration” (Hobolt, 2016). The situation became even more serious when the refugee crisis hit Europe in 2015. While the EU urged a common solution to this challenge, the Eurosceptic United Kingdom Independence Party used this topic as one of the pillars for their general election campaign resulting in a rise of the party’s preferences. To avoid a complete split of the conservative electorate and attract UKIP’s Eurosceptic voters before the general election in 2015, David Cameron offered a new settlement with the EU and referendum on whether the UK should stay in the EU or not. Although most of the polls expected the Remain campaign to be more successful, the fear of global issues and dominance of the EU led the UK to the final

(10)

8

3 Theoretical Background

3.1 Public opinion monitoring

3.1.1 Polls as a traditional tool for public opinion monitoring

The outcome of the Brexit referendum undoubtedly represents a major fail of pollsters that were not able to accurately identify the attitude in society. The majority of online and phone polls carried out between the announcement of the referendum in September 2015 until one day before the actual vote predicted the Remain campaign to have a slight lead. However, a simple plot of the data provided by different polling agencies gives us already some important insights (“EU referendum poll tracker - BBC News,” n.d.).

Figure 1: The Official Polls Results in 2015-2016

Figure 1 demonstrates polls results provided by multiple agencies and organizations throughout the dates mentioned above. Although the trend line clearly shows that the Remainers had an advantage from the beginning of the campaign, the growth was obviously not as fast as the one of the Brexiters.

0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50%

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jul Sep Oct Dec

2015 2016

Remain Leave Will not vote

(11)

9 The data also displays that the undecided voters were slowly choosing one or another camp. According to the BBC’s political analyst, the clear shift to leave happened somewhere in mid-June 2016 when most of the polls favored Leave to win the election. However, in the end, the pollsters were not able to provide a consistent picture of what is going to happen on the day of the Brexit referendum (“EU referendum poll tracker - BBC News,” n.d.).

YouGov, one of the global leaders in public opinion monitoring and market research, says that there’s an important lesson to learn: the online polls were right, but they paid too little attention in comparison with more traditional phone calls. “Throughout the campaign the telephone polls showed Remain comfortably ahead, sometimes by as much as 18 points, narrowing to an average 2.7% Remain lead over the final four weeks. The online polls showed Leave ahead more often than Remain, with an average lead for Leave of 1.8% over the same period. In total over the campaign period, 78% of the telephone polls showed a Remain lead, whilst 63% of the online polls showed Leave ahead, including 57% of YouGov polls” (“The online polls were RIGHT, and other lessons from the referendum | YouGov,” n.d.). According to Freddie Sayers, the Editor-in-Chief at YouGov, the fact that online polls are more accurate makes perfect sense for multiple reasons. On the one hand, market research phone calls or face-to-face interviews encounter challenges such as a low number of people that are willing to take a survey proposed by a stranger, and they struggle to reach out to the less educated part of society which skewed the results towards Remain. On the other hand, online surveys are undeniably more convenient because they are faster and cheaper. Furthermore, thanks to the fact that the respondents are often offered financial compensation for taking a survey, online polls attract more people from different backgrounds, which ensures better representative groups than in phone or face-to-face polls (“UK poll results – What UK Thinks: EU,” n.d.).

3.1.2 User-generated content as an alternative source for public opinion mining

The internet and tools such as social network sites, have transformed people’s habits related to information production and consumption as well as the way people interact in day to day life. The increasing importance of digital technology and internet-based services cause a rapid increase in the amount of data produced by every internet user based on the type of service they use. Nowadays, there are almost 4.5 billion internet users worldwide, out of which nearly 3.5 billion use social media.

(12)

10 Individuals use social networks to connect with others, companies to advertise their product, and

political parties to communicate their ideology.

In relation to UGC and public opinion mining, social media play an essential role. “Microblogging services (e.g., Twitter) and social network sites (e.g., Facebook) are believed to have the potential for increasing political participation. While Twitter is an ideal platform for users to spread not only information in general but also political opinions publicly through their networks, political institutions (e.g., politicians, political parties, political foundations, etc.) have also begun to use Facebook pages or groups for the purpose of entering into direct dialogs with citizens and encouraging more political discussions” (Stieglitz & Dang-Xuan, 2013). Unlike other public opinion mining methods, social networking sites analysis does not require intensive human labor, neither a persuasive approach in order to gather people’s opinions. The fact that individuals are willing to share valuable personal data and information about their political preferences online allows researchers to use computational methods, analyze large amounts of data, and infer voters' preferences from users’ digital fingerprints.

The benefits of public opinion monitoring based on online data are not a brand new topic. Since the invention of the world wide web and the adoption of digital technology by mainstream users, the amount of data that can be freely accessed has grown, and will be growing in the future as well, exponentially. According to the latest infographics “Data Never Sleeps 7.0” (“Data Never Sleeps 7.0 Infographic | Domo,” n.d.) produced by DOMO, an American software and business intelligence company, internet users generate more than 55 thousand Instagram photos, 500 thousand tweets, or almost 4.5 million of Google searches every minute (“Data Never Sleeps 7.0 Infographic | Domo,” n.d.). DOMO also stresses that the big data will keep on growing, and therefore the ability to understand this data will become crucial for any private business or public organization. The user-generated content (UGC) can then play a significant role in the field of political opinion mining. Traditional polling methods such as phone or face-to-face interviews are time-consuming, labor-intensive, and sometimes not accurate. The data-driven approach leveraging freely accessible information from the internet seems to be like a more convenient technique.

3.2 Computational methods in textual analysis

As mentioned above, potential voters produce large amounts of data that can be easily accessed and analyzed. These insights help political parties as well as businesses and other organizations to

(13)

11 understand their audience better and eventually effectively market their product, no matter if it is a new car or opinion. For scholars, online data provides an eminent opportunity to study a variety of topics from a whole new perspective and help find out patterns and trends in human communication and behavior. There are many methods to study users' online behavior, extract their sentiment about a certain topic, or monitor the leading online discussion topics. One of the most popular techniques of digital content analysis is text mining that allows researchers to analyze large corpora, with the help of modern technology in a short period.

3.2.1 Text mining and machine learning

Text mining, also often referred to as knowledge discovery from the text, is “a truly

interdisciplinary method drawing on information retrieval, machine learning, statistics, computational linguistics and especially data mining”(Hotho, Hotho, Nürnberger, & Paaß, 2005). The goal is to analyze

unstructured or semi-structured textual data and extract unseen patterns and clues. Text mining utilizes methods similar to structured data analysis and tries to structure the data in a way that the extracting of insights in possible.

The current approach focuses on classification, clustering, and information extraction methods. While the main task of classification is to label a text based on a pre-defined category, i.e., sentiment classification, the method of clustering tries to identify patterns in text based on their attributes without prior annotation, this is called topic detection. Information extraction then “naturally decomposes into a series of processing steps, typically including tokenization, sentence segmentation, part-of-speech assignment, and the identification of named entities, i.e., person names, location names and names of organizations”(Hotho et al., 2005).

All of the methods mentioned above are more or less related to machine learning, which became a very popular method in humanities thanks to the accessibility of digitized content, including textual, visual, or audio records. “Machine learning, by its definition, is a field of computer science that evolved from studying pattern recognition and computational learning theory in artificial intelligence. It is the learning and building of algorithms that can learn from and make predictions on data sets” (Selvam, Simon, Singh Deo, & Babu, 2015).

Machine learning generally consists of two main sub-branches: supervised and unsupervised learning. Supervised machine learning is a technique that requires prior data annotation and labeling.

(14)

12 Once a machine has examples of data with according labels, it can get trained on a training data set and use the learned clues to label the unlabeled data. Amongst other tactics, text classification is one of the exemplary tasks in supervised machine learning. Unlike the supervised algorithms, unsupervised machine learning method does not require prior annotation of the data. The goal of unsupervised learning is to find commonalities and hidden patterns in data without being trained on a labeled data set.

3.2.2 Sentiment analysis

Sentiment analysis, as well known as opinion mining, is a research area standing on foundations of computer science, text mining, and linguistics. It “is the field of study that analyzes people’s opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes” (Liu, 2015). The main goal of sentiment analysis is to classify a given text, speech, or even image and assign a positive, negative, or neutral label to it. Although the history of sentiment analysis and opinion mining goes back to the early 20th century, the widespread application of this technique across fields such as politics, psychology, marketing, and finance made sentiment analysis one of the fastest-growing research areas in computer science in recent years (Mäntylä, Graziotin, & Kuutila, 2018).

Text sentiment analysis offers 3 basic levels of analysis: document level, sentence level, and entity level. The document level identifies the sentiment of a whole document. However, the document must consist only of one single entity. Sentence level sentiment analysis goes further and extracts sentiment from each sentence of a document. However, when a sentence has chained positive and negative sentiments, this practice is not enough to properly identify the overall opinion. For this reason, researchers have developed the entity-level sentiment analysis. The main idea of the entity sentiment identification is that each opinion also has its target; in other words, an entity that needs to be studied separately.

Thus far, most of the automated software solutions for sentiment analysis work with so-called lexicons. A lexicon is a set of words with a certain association, which, in sentiment analysis, carries a positive or negative meaning. The algorithm then goes through the text, tries to find these words, and derives the sentiment label based on their frequency in the document. Multiple ways are used to create

(15)

13 these standardized lists of words. Still, the most common approach of creating a lexicon is called a dictionary-based approach, sometimes called a closed-vocabulary dictionary (Schwartz & Ungar, 2015).

The dictionary-based approach uses a manual selection of sentiment words, which can be complemented by an automated selection of synonyms or antonyms from publicly available online dictionaries (Liu, 2015). The second approach, called a corpus-based or open-vocabulary approach, uses automatic tokenization and language feature selection to derive the sentiment of words. A study conducted by researchers at the University of Pennsylvania proves that these data-driven techniques and dictionaries can have better performance than traditional hand-crafted dictionaries (Schwartz & Ungar, 2015).

Although sentiment analysis is a popular theme amongst scholars, there are still multiple obstacles that need addressing. First of all, sentiment, by its definition, is a subjective feeling that is usually being expressed by words in a context. Sometimes, the same words can be meant positively, while other times, or by another person, the meaning of those words can be negative. The sentiment, therefore, highly relies on the context. A great example of this issue is the use of sarcasm, which is still very difficult for machines to recognize. Besides that, the fact that a sentence contains sentiment words does not necessarily mean that there is a relevant opinion in it, and the other way around, there can be sentences exposing a sentiment without sentiment words. All of this implies the fact that sentiment analysis in a complex topic highly related to natural language processing. Unlike textual analysis, the goal of natural language processing algorithms is to understand human communication with the emphasis on context properly.

3.2.3 Topic modeling

Since early 2000, topic models became a popular method for analyzing textual data. The reason for this popularity is the accessibility of digital content like digitized books and online news, and

eventually, the widespread use of social media and other platforms where users generate digital content. In the field of Digital Humanities, topic modeling is considered to be a pure representation of distant reading because its goal is to automatically detect hidden topics within a large corpus with the help of computational methods (Meeks, Weingart, Weingart, & Weingart, 2012). The application of topic modeling goes from analyzing historical documents and collections through social media analysis and monitoring to building sophisticated recommendation engines for websites or products.

(16)

14 In practical terms, “topic modeling is an unsupervised learning method that assumes each document consists of a mixture of topics, and each topic is a probability distribution over words. [...] The output of topic modeling is a set of word clusters. Each cluster forms a topic and is a probability

distribution over words in the document collection” (Liu, 2015). In general, it is necessary to stress that topic models ignore the language complexity and semantic nature of the text and treat the corpus as a so-called “bag-of-words”. Bag-of-words modeling is a common technique in natural language processing based on the assumption that a document consists of a set of words with a particular distribution while overlooking grammar rules and words’ order (Deepu, Pethuru, & Rajaraajeswari, 2016).

The most popular topic model is Latent Dirichlet Allocation (LDA) introduced in 2003 by an American professor of statistics and computer science, David M. Blei (D. M. Blei, Ng, & Jordan, 2003). LDA represents each word of a corpus as a vector. “Through different vector similarity measures, distributional models are able to establish semantic relations between words according to the similarity of their contextual vectors. In this way, words with similar contextual vectors could be clustered

together as semantically related words” (Navarro-Colorado, 2018). LDA is widely used in literary research as well as linguistics, media studies, or marketing.

In general, topic models are an effective way to analyze vast amounts of textual data and extract leading topics of the corpus. In the digital age, manual evaluation is practically impossible; therefore, topic models have huge application potential in public as well as in the business sector. Researchers use topic modeling to extract topics from a large collection of a literary text or to identify cultural trends in online discussions, while businesses utilize topic models to optimize their products and provide better customer experience through effective recommendations. On the other hand, it is also necessary to mention that topic models have several weaknesses. “One main issue is that it needs a large volume of data and a significant amount of tuning to achieve reasonable results” (Liu, 2015). The implementation of such a model, therefore, requires a certain level of technical knowledge. Furthermore, those scholars who do not possess at least basic programming skills will not be able to apply such a model. Besides that, topic models do not provide an explicit answer, but they require additional human interpretation and thus tend to be subjective to a certain extent.

In general, a data-driven approach in the field of media studies and political science can lead towards two goals: predictions and insights. “Prediction focuses on automatic estimation or

(17)

15 exploratory language analyses to understand better what drives different behavioral patterns”

(Schwartz & Ungar, 2015).

While the research uses well-established techniques for some of the social network sites such as Facebook or Twitter, the methodology for opinion mining and sentiment analysis in other platforms, such as Reddit, are mostly derived and not very well designed yet. Especially in political science, polling is often done using Twitter as a relevant source, and some tools specifically focus on Twitter posts analysis such as Python’s NLTK or Scikit-Learn libraries. Unfortunately, this technology might not be as effective for other platforms because the length of the text is usually much longer than 280 characters, which is the limit for a tweet.

(18)

16

4 Literature Review

Concerning public opinion mining, social network sites have a significant role, and many

researchers widely study them. One of the earlier studies of social networks as a tool for opinion mining was conducted in 2009 at the Carnegie Mellon University and suggested that a textual analysis of UGC has the potential to capture major trends in public opinion and eventually supplement traditional polling methods. The researchers have applied a simple sentiment analysis technique, concretely counting instances of positive and negative words, on 1 million Twitter messages related to the American president Barack Obama and compared it with regular polling data. Although the results vary across data sets, the correlation between the sentiment and official polls goes up to 80%. (O'Connor,

Balasubramanyan, Routledge, & Smith, n.d.). In future research, the authors encourage to use a more suitable lexicon together with better classification of Twitter messages, meaning differentiation between tweets and replies.

Another study takes a similar approach and examines more than 100.000 tweets about the German federal election using the well-established text analysis software LIWC - Linguistic Inquiry and Word Count. The scholars’ findings state that “the mere number of messages mentioning a party reflects the election result. Moreover, joint mentions of two parties are in line with real-world political ties and coalitions. An analysis of the tweets’ political sentiment demonstrates close correspondence to the parties' and politicians’ political positions indicating that the content of Twitter messages plausibly reflects the offline political landscape.” (Tumasjan, Sprenger, Sandner, & Welpe, 2010). However, the researchers admit several limitations, such as representativeness of the data sample in relation to demographics, missing data due to the unstructured nature of Twitter content and software limitations.

On the contrary, multiple papers by Gayo-Avello, a Spanish expert on social media and political science, warn against using social media data and sentiment analysis for political predictions and identifies several issues related to this methodology (Gayo-Avello, 2013). One of the main issues he identified is the fact that there is no established and reliable method of counting electoral votes based on Twitter data. Seeing sentiment analysis itself as a “black box” is another problem identified by Gayo-Avello. He highlights the fact that the software solutions used for social media content sentiment analysis ignore patterns such as humor, sarcasm, or propaganda and therefore are not reliable in understanding complex human communication patterns. Another essential obstacle is the creation of biased, non-representative, sample data because of neglecting data on demographics. Several case

(19)

17 studies support Gayo-Avello’s argument. These case studies of American senate elections prove that applying lexicon-based sentiment analysis on political communication on Twitter has poor accuracy, thus is not suitable for predictive analysis (Gayo-Avello, 2013).

Although most of the studies so far focus on analyzing Twitter content, some researchers pay attention to other data sources as well. The researchers at the University of Illinois developed a new methodology for analyzing UGC on a large scale, which is called the Big Data Audience Analysis (BDAA). The “BDAA draws on two terms: big data and distant reading. We use big data as shorthand for large-scale data sets that are likely unable to be read closely by a single researcher and even by a small group of researchers. [...] Distant reading refers to a macroscale approach to information, often using graphs and visualizations to examine trends in that information” (Gallagher et al., 2019).

In comparison with the methods mentioned above, BDAA combines multiple approaches such as sentiment analysis, exploratory data analysis, statistics, and geolocation analysis, all of them combined into a visual representation. The study itself examines approximately half a million users’ comments from the New York Times newspaper website. As the outcome, the authors give multiple recommendations to the website designers on how to improve the user interface and moderate online discussion in a more efficient way (Gallagher et al., 2019). This study confirms the importance of text analysis as a relevant approach to determine the main clues and patterns in online communication.

Concerning social media’s influence on Brexit, there are multiple articles, mainly focused on analyzing UGC from Twitter. Researchers from the London School of Economics and Political Science “analyzed 7.5m tweets and found the predominance of Euroscepticism on social media mirrored its dominance in the press.” (Hänska & Bauchowitz, 2017). They argue that the internet and social media have changed the way people consume news, and therefore, it is necessary to analyze the role of platforms such as Twitter or Youtube to understand public opinion formation. Unfortunately, this study does not elaborate on the methods, and therefore the conducting of this analysis is not transparent at all (Hänska & Bauchowitz, 2017).

Slovenian researchers at the Jozef Stefan Institute took a more detailed approach of analyzing data from Twitter in relation to Brexit. Their task was to detect stance on Brexit automatically and therefore identify the Twitter users’ voting preferences. Unlike the previous study, the authors precisely describe the methods they use. They have collected over 4.5 million tweets from almost one million Twitter users, manually annotated the stance of tweets, and developed and applied a custom

(20)

18 classification model. To achieve accurate results, the authors adjust the outcome by the demographic distribution, which has shown to be essential. As the authors mention: “A naive application of our stance model predicts the outcome of the referendum as Remain. However, there are large differences in several aspects of demography between Twitter users and eligible voters. We take into account just the age distribution and adjust the outcome predicted by the model. This shows the convincing win of the Leave supporters, even higher than the actual result. The conclusion from this experiment is the need for continuous monitoring of demographic distribution between the Twitter users, and careful adjustment of the predicted results” (Grčar, Cherepnalkoski, Mozetič, & Kralj Novak, 2017).

Besides analyzing sentiment to infer peoples’ opinions, there are sophisticated methods, such as topic modeling, that help researchers analyze and identify patterns and trends in online communication. Since social media proved to be a resourceful means for public opinion mining, multiple studies have implemented topic models to determine the main drivers in the online discourse. For instance, the scholar from Carnegie Mellon University analyzed Twitter users’ messages to “predict which other microblogs a user is likely to follow, and to whom microbloggers will address messages” (Puniyani, Eisenstein, Cohen, & Xing, 2010). The results prove that the application of topic models, in their case LDA, in combination with supervised techniques outperform traditional social network analysis methods.

Another research from University in North Carolina used LDA as a backbone to build a topic model customized to hashtag analysis. To understand hashtags in social media, the researchers propose a new Tag-Latent Dirichlet Allocation model that connects hashtags with a corresponding topic. They analyzed over 16 million Tweets, extracted a set of 140 topics, and finally assigned appropriate topics to each hashtag. The paper stresses the fact that inferring context only from the hashtags is very difficult and often misleading, while the added list of topics indicates what the subject is of a tweet containing a particular hashtag. In the end, the researchers group hashtags into clusters based on similar topics that “support the development of a comprehensive understanding of events discussed on Twitter”(Ma, Dou, Wang, & Akella, 2013).

The literature overview proves that social media data can play an important role in the monitoring of long term trends in society and thus serve as a complementary means to analyze and predict social behavior. The main data source for researchers is still Twitter. Compared to other social networks, Twitter is very popular and highly adopted by politicians and journalists, which makes it a

(21)

19 great place for public discussion. This microblogging platform has more than 300 million active accounts (“Global Digital Report 2019 - We Are Social,” n.d.-b) and allows its users to create, consume and share content in a short message of 280 characters. Another important aspect that made Twitter so attractive is that the data was open and easily accessible. Sadly, since scandals related to unethical use of social media data, such as Cambridge Analytica, which shook the privacy measures, Twitter changed its data access policy. Nowadays, academics can only access data not older than ten days, unless they want to pay considerate amounts to private companies that hold Twitter data archives. A little attention in this matter is then paid to other social networks such as Reddit, although it seems to have growing

importance amongst social media platforms.

Although social media proved to have a growing role in the research of opinion and text mining, the nature of these networks carries multiple challenges. First, the amount of data that is available for analysis is too vast to be assessed qualitatively, and therefore the scholars rely on computational methods. Besides that, most of the content extracted from these platforms is stored in an unstructured format, which is still very difficult to be automatically processed. Although technological progress allows researchers to develop complex models and self-learning algorithms, the human language is a great challenge even for the best artificial intelligence applications. Last but not least, even if anonymity was not a common practice on social media platforms, the data would hardly provide us with a

(22)

20

5 Research question and hypotheses

The goal of this thesis to examine Reddit users' attitudes in the Brexit debate and see if the overall sentiment of Reddit comments related to the Brexit debate follows the same trendline as regularly conducted polls. Besides that, the thesis aims to explore unseen patterns and clues and identify the main drivers and topic of the Brexit debate on Reddit that might add to the explanation of voters' offline behavior.

The research questions of the study are:

RQ1: Does the users' attitude in the Brexit debate on Reddit follow the same trendline

as regularly conducted polls?

RQ2: What are the main drivers behind the Brexit debate on Reddit? Where do the

users talk about Brexit? How do the positive comments differ from the negative ones? What are the users talking about? What are the main topics?

The study aims to give an answer to the following research hypotheses:

H1: The overall sentiment in the Brexit debate on Reddit over time will follow the same

trendline as the sentiment expressed in the official polls.

H2: The sentiment in the Brexit debate on Reddit will be negative.

H3: Comments with positive sentiment will be mentioning different topics than

comments with negative sentiment. Therefore the topic will differ based on the sentiment of the comment.

(23)

21

6 Methodology and tools

Since this thesis investigates the overall attitude of Reddit users in the Brexit debate and tries to find patterns and clues to infer on the users’ offline behavior, the most appropriate method resembles the BDAA concept (Gallagher et al., 2019) described in the previous section. Unlike the original study, this thesis has to leave out some main elements, such as geolocational data, because this data is unfortunately not available on Reddit. However, the platform still provides detailed information about each comment that can be leveraged for insights extraction. The methodology is based on 3 main domains corresponding to the theoretical chapter: sentiment analysis, exploratory data analysis, and topic modeling.

6.1 Sentiment analysis with VADER

Before conducting the main analysis, it is necessary to obtain the sentiment for each comment. Since the sentiment is extracted from Reddit users’ comments, multiple specifics need to be taken into account. “Some of these [...] stem from the sheer rate and volume of user generated social content, combined with the contextual sparseness resulting from shortness of the text and a tendency to use abbreviated language conventions to express sentiments” (Hutto & Gilbert, 2014). To tackle these challenges, it was decided to execute the automated sentiment analysis with the help of VADER, “a simple rule-based model for general sentiment analysis” (Hutto & Gilbert, 2014) developed by researchers from the Georgia Institute of Technology in Atlanta, US.

VADER, i.e., the Valence Aware Dictionary for sEntiment Reasoning, “is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media” (“GitHub - cjhutto/vaderSentiment: VADER Sentiment Analysis.,” n.d.). In general, VADER combines a lexicon-based approach with a set of rules that address human speech conventions and sentiment intensity. Besides that, VADER has multiple advantages such as no need for a training data set, speed even when running online and open and easily accessible documentation.

VADER’s lexicon has been built based on well-established sentiment lexicons like LIWC, ANEW, and GI. However, it also incorporates features specific to social media languages such as emoticons, acronyms, and slang. The final lexicon has over 9 thousand lexical features. The next step was to grade the sentiment valence, i.e., intensity. For this task, the researchers have used a crowdsourcing platform

(24)

22 Amazon Mechanical Turk, trained independent raters and let them annotate more than 7 500 hundred lexical features with both, “the sentiment polarity (positive/negative), and the sentiment intensity on a scale from –4 to +4. For example, the word “okay” has a positive valence of 0.9, “good” is 1.9, and “great” is 3.1, whereas “horrible” is –2.5” (Hutto & Gilbert, 2014). The same rules applied to emoticons, acronyms and slang expressions as well. To improve the quality of VADER as a tool for sentiment

analysis, the researchers have conducted a qualitative analysis and isolated five “generalizable heuristics based on grammatical and syntactical cues to convey changes to sentiment intensity. Importantly, these heuristics go beyond what would normally be captured in a typical bag-of-words model” (Hutto & Gilbert, 2014). This analysis was performed by two experts that manually annotated 8 hundred tweets and classified their sentiment intensity. The final model considers textual and linguistic details as well: punctuation, capitalization of the letter, degree adverbs, contrastive conjunction, and negation. To finally obtain the gold standard, the researchers recruited 20 independent raters, again through AMT, and asked them to rate 4 different corpora, which include tweets, movie reviews, technical product reviews, and opinion news articles.

The final results consist of a comparison between the findings and 7 well-established sentiment analysis lexicons. VADER showed exceptional performance in analyzing social media text and

outperformed human raters in correctly classifying the sentiment of tweets. Based on previous research and results, VADER makes it a perfect fit for analyzing Reddit data.

6.2 Exploratory data analysis

Exploratory data analysis is an essential part of any data-driven research project or study. To further develop the hypothesis and make any inferences or conclusions, it is necessary that a researcher clearly understands the given data set and knows what data is he or she dealing with. A common practice on how to explore a data set and find the basic trends and clues is to conduct a so-called “Exploratory data analysis” (EDA). Statistician John Wilder Tukey introduced the term itself in the early 1970s. Unlike traditional statistical methods with a well-defined execution process, EDA is, according to Tukey, “an attitude, a flexibility, and a reliance on display, NOT a bundle of techniques” (Tukey, 1980). In his work, he often emphasizes the role of EDA as a means to find the right research questions as they are often more important than the actual answers. EDA suggests that a graphical representation of a data set reveals critical information, and therefore, it is essential prior to the application of sophisticated

(25)

23 statistical models and analysis techniques. For this reason, “Tukey's approach to data analysis is highly visual, and he has numerous suggestions for graphical displays. Graphs are used for many different purposes. They can be used to store quantitative data, to communicate conclusions, or to discover new information” (Tukey, 1980).

Since the main goal of EDA is to uncover hidden patterns and clues in data, it can be considered to be a relevant method to explore the Reddit data set and find the drivers in the Brexit debate on Reddit. The EDA part of this thesis is built on multiple Python libraries such as Numpy, Pandas for data manipulation and statistics, and Matplotlib and Seaborn for visualisations (“Matplotlib: Python plotting — Matplotlib 3.1.1 documentation,” n.d.; “NumPy — NumPy,” n.d.; “Python Data Analysis Library — pandas: Python Data Analysis Library,” n.d.; “seaborn: statistical data visualization — seaborn 0.9.0 documentation,” n.d.). While the first part explains why EDA is an appropriate approach to explore the Reddit data set, the second focuses on a concise analysis of each variable and summarizes the main characteristics of the data.

6.3 Topic modeling with Latent Dirichlet allocation and Sci-kit Learn

Latent Dirichlet Allocation (LDA) is a probabilistic topic model that was introduced in 2003 by an American professor of statistics and computer science David M. Blei. The LDA model aims to extract the hidden set of topics within a set of documents. It generally works with three basic entities: words, documents, and corpora. A word is a basic unit in a document represented by a vector, a document is a sequence of words, and a corpus is a collection of documents. The LDA model assumes that each document is a mixture of topics, i.e., a probabilistic distribution. Each topic, on its end, is a mixture of words, also in a probabilistic distribution. The LDA also assumes that the creation of documents occurs in a certain generative way. In case a new document has to be created, the first thing is to define the number of words in a document, then select a set of topics and finally generate the words based on the topic distribution within the document. With these assumptions, LDA tries to go backward in this process and generate topics in each document based on word clusters belonging to a certain topic(D. Blei, 2012).

(26)

24

Figure 2: The intuitions behind latent Dirichlet allocation (D. Blei, Carin, & Dunson, 2010)

Note: The infographic above displays the main logic behind the LDA: each document has a probabilistic distribution of topics; each topic is represented by a probabilistic distribution of words.

There are multiple means to implement the LDA model. For this thesis, a Python library called “sci-kit learn” was selected as an appropriate tool for LDA implementation. It started as in 2007 as a Google Summer of Code project and soon became popular amongst the machine learning community (“About us — scikit-learn 0.21.3 documentation,” n.d.). “Scikit-learn exposes a wide variety of machine learning algorithms, both supervised and unsupervised. [...] Since it relies on the scientific Python ecosystem, it integrates easily into applications outside the traditional range of statistical data analysis” (Pedregosa et al., 2011).

There is a set of best practice rules that ensure quality outcomes that can be used to improve the results of the sci-kit learn LDA model. Although the model can run on raw data, it has been proven that text pre-processing such as lower casing, removing unimportant information with no added value, removing whitespaces or conflicting words, and stopwords. All of this can significantly improve the

(27)

25 results as well as the speed of the model. The below figure displaying a piece of code describes the particular steps.

Figure 3: Text preprocessing

6.4 Data integration and analysis tools

Researchers have been experimenting with computational methods and tools in humanities for more than 30 years (Underwood, 2017). However, the broad application of this approach is often related to the boom of social network sites and the production of UGC in the past decade. New

interdisciplinary fields, such as digital humanities, offer brand new innovative ways of studying subjects such as literature, history, or political science, but the adoption of digital methods and tools also brings several challenges. For instance, while basically any modern laptop can handle a variety of software solutions, it also requires a certain degree of digital literacy that comes along with experience in applied statistics, mathematics, and programming. According to some scholars, “just applying the tool or even “learning to code” alone” (Tenen, 2016) is insufficient in order to be able to interpret the results.

On the other hand, in the field of text analysis, computational power and modern software allow scholars to apply experimental approaches and analyze data on large-scale. The selection of the right tools can also significantly help researchers in project execution and improve the performance of the analysis itself. Thus, the right selection of a toolset is an important part of any research project. The

(28)

26 next section briefly elaborates on the tools that have been used for this project and critically reflects on its application in research in general.

6.4.1 Keboola Connection

Since this project consists of a variety of tasks, it also requires a variety of software and tools. To keep track of each branch of the project and have all the tools in one place, it’s necessary to find a tool that can serve as the main storage for all applications and files. For this reason, a tool named Keboola Connection is used to execute the analysis. In short, Keboola Connection is a cloud-based analytical environment containing modules for data extraction, storage, manipulation, and advanced data analysis. The platform’s architecture is quite complex and consists of many independent components. Figure 4 displays the basic workflow.

Figure 4: Keboola Connection workflow

The tool follows standard extract, transform and load procedure (ETL) (“What Is ETL? | SAS,” n.d.) and allows its users to extract data from various sources systematically, process it and prepare for further analysis and eventually load in a visualization platform, or other system based on particular use

(29)

27 case. For this project, mostly the analytical environment and storage of the tool were leveraged. It allows its users to upload all necessary files in CSV format, and once the data is stored in the cloud database called Snowflake (“Cloud Data Warehouse Software | About Snowflake,” n.d.), it can be transformed and analyzed.

Keboola Connection supports 3 main transformation languages and backends, namely Snowflake for basic tabular data preprocessing and database manipulation, Python for advanced data analysis, and machine learning and R for advanced statistical analysis. For each language, there is a dedicated environment for script preparation. Snowflake Worksheets for SQL, Jupyter Notebook for Python, and R Studio for R. Besides that, the platform also offers a simple way to implement other third-party services, such as natural language processing tool - Geneea - and others while keeping all data in one place. The final part of the ETL process is then executed by components called Writers, which pull the data out to the consumption layer – visualization platforms (“Keboola Connection Overview | Keboola Connection User Documentation,” n.d.). Although Keboola Connection offers connectors to multiple business intelligence and visualization tools such as Tableau or Looker, the visualizations in this thesis were prepared in Python and Microsoft Excel.

6.4.2 SQL and Python

Although this thesis focuses on unstructured data and textual analysis, many important

attributes of each Reddit comment are stored as simple structured tabular data. Even though Python as a programming language is widely used for data preprocessing and manipulation, there are also other convenient ways to work with tabular data. Nowadays, the most widely used data model for storing and manipulating structured data is a relational model, which is a cornerstone for relational databases. To retrieve any data from such a database, you need a query language. While “there are a number of database query languages, [...] SQL is the most influential commercially marketed relational query language” worldwide (Silberschatz, Korth, & Sudarshan, 2011). Therefore, the tabular data and simple tasks will be handled by Keboola Connection’s module and transformation backend that runs on top of the Snowflake. Although there are small differences in syntax compared to other platforms, Snowflake allows working with tabular data like any other relational database.

As mentioned above, SQL can have benefits while working with tabular data. However, when it comes to more complex tasks, such as natural language processing and comprehensive text analysis, it is

(30)

28 necessary to utilize more powerful programming languages such as Python. According to a report based on data from GitHub, a software development platform, Python ranked as the third most popular programming language in the world (“Projects | The State of the Octoverse,” n.d.). In a practical matter, Python is used as a high-level programming language for application development or machine learning and advanced data analysis. In this thesis, Python is used for multiple tasks, specifically: data

preparation, analysis, visualization, and indirectly sentiment analysis as well.

To critically evaluate the selection of the tool mentioned above, one can state that they do not necessarily require a profound knowledge of computational methods or programming. However, a certain level of digital literacy, basic coding and programming skills, specifically SQL and Python, as well as theoretical fundamentals of database design concepts and natural language processing, are essential. It is also important to state that the scope of this project poses limitations in leveraging the full potential of the tools, and therefore, some features will be used in a rather illustrational way.

(31)

29

7 Data collection and preparation

According to a report from CrowdFlower, currently, data scientists from the company Figure Eight - an artificial intelligence company, spend about 80 percent of their time with data collection and preprocessing. The other 20 percent only consists of data analysis and interpretation (“CrowdFlower 2016 Data Science Report,” n.d.). The fact that data scientists spend most of the time on data retrieval and organization signals the importance of this task. This section of the thesis elaborates on the process of data extraction, cleansing, and transformation to prepare the data for the analysis.

7.1 Data collection

As mentioned in the previous section, there are multiple ways how to extract data from Reddit, and they differ based on the goal of a particular analysis and preferred tools and programming

languages. Since this thesis focuses on the Brexit debate on Reddit in the past year, multiple search parameters for data collection were selected. Reddit comments were selected based on the keyword “Brexit” regardless of the subreddit. The time frame was set from the 1st of August 2018 till 31st of July 2019, downloading approximately one thousand comments per day. The final data set contains around 360 thousand comments in total.

In case a researcher wants to analyze vast amounts of data from the whole Reddit, the most convenient way is to download full monthly data dumps that are publicly available at pushshift.io. “Pushshift.io offers a feature-rich API to search social media data, including Reddit. Pushshift also collects and disseminates Reddit comments and submissions every month. Over 40 academic papers have used Pushshift as one of the sources for their research” (“Fundraiser by Jason Baumgartner : Pushshift.io Fund,” n.d.). Considering the amount of data, downloading “the whole Reddit” is not a suitable method for this paper due to the file size and processing power that needed to select and transform the initial data in the way suitable for analysis.

Fortunately, pushshift.io also offers another way to access Reddit data through the pushshift.io Reddit API, and the method is documented in detail on GitHub. “This RESTful API gives full functionality for searching Reddit data and also includes the capability of creating powerful data aggregations” (“GitHub - pushshift/api: Pushshift API,” n.d.). The search is divided based on the end-points - comments or submissions - and it can be specified based on multiple parameters such as search terms, subreddit,

(32)

30 id, time range, author, size, and many others. To search for 1000 comments per day within the selected time range, the following parameters can be used: q - search term, after/before - date specification, and size - to define the size.

This method is extremely easy to use. However, it requires a lot of manual labor. Besides that, there are some important limitations, including the maximum size of the search set to 1000 comments per one API call. There are tools available to overcome these problems, such as pushshift.io API wrapper called psaw that is publicly available on GitHub as well (“GitHub - dmarx/psaw: Python Pushshift.io API Wrapper (for comment/submission search),” n.d.). Although the wrapper is simple to use, the API was not properly responding to requests due to the size of the data it was trying to retrieve. Most of the time, the wrapper got disconnected right at the beginning of the data collection, and therefore, it was impossible to use it for data extraction.

Finally, it has been decided to test a ready to use Reddit comments extractor written by a data scientist Tomas Votava. The whole solution, built on top of another Reddit API wrapper named PRAW. To use the extractor, scholars only have to create a Reddit developer account, enter credentials together with the search term in a JSON file, and run the script. The script then crawls the entirety of Reddit and searches for the selected term based on multiple parameters that could be specified. In the end, a CSV file was created containing all data. After a couple of days of testing and adjusting the main script, it was clear that it was impossible to use due to its reliability. Although the script was powerful in pulling out a lot of historical data from subreddits with lower activity, it was not reliable in collecting all comments through subreddits with a higher amount of submissions and comments. The final data set usually missed the majority of the data. The script is, unfortunately, no more available on Votava’s GitHub profile.

After developing, implementing, and testing various automated solutions, the decision was made to use manual extraction, with the help of the pushshift.io API calls. According to the method described above, approximately 1000 comments per day from the 1st of August 2018 till the 31st of July 2019 were collected and stored in separate JSON files. These files were merged and preprocessed to be able to perform the analysis.

(33)

31

7.2 Basic data preparation and data structure

Since the data was spread across more than 350 separate JSON files, the first task was to merge all these files and create one structured data set. Since the data extracted through the Reddit API was clean in terms of its structure, it was sufficient to join the files horizontally. The structure of the JSON file is as displayed in Figure 5.

Figure 5: Data structure of the JSON file pulled from Reddit’s API

Fortunately, Python offers a great solution for this task - the glob and CSV module (“glob — Unix style pathname pattern expansion — Python 3.7.4 documentation,” n.d.). With the use of the glob and CSV module, it was possible to merge all JSON files into one simple CSV file with basically a couple of lines of Python code. The final raw data set had 362 115 rows and 41 columns, accordingly with the JSON file structure above.

(34)

32 After having the raw data set ready, data cleansing started. Firstly, the raw data set was

uploaded by the CSV extractor into Keboola Connection. Once the file was stored in the cloud storage, basic transformations in SQL in Snowflake Worksheets were conducted. In the beginning, only a couple of columns, such as transformed Unix timestamp to a regular date format, were added, and ids’ prefixes removed. To keep the data structure as simple as possible, the decision was made to keep only below selected columns for further analysis.

Table 1: Reddit Data Structure

Column name Description

author Provides an instance of Redditor.

body The body of the comment.

created_utc Time the comment was created, represented in Unix Time.

id The ID of the comment.

score The number of upvotes for the comment.

subreddit Provides an instance of Subreddit. The subreddit that the comment belongs to.

subreddit_id The subreddit ID that the comment belongs to.

Note: The descriptions were downloaded from the official documentation of the Praw library.

Besides the main data set with Reddit comments, the official polls result in CSV format was uploaded to Keboola Connection. This data is publicly available at the “What UK Thinks” website and aggregates 77 polls results from February 2012 till July 2019. The data structure is as follows:

Table 2: Official Polls Structure

Column name Description

(35)

33

Pollster Name of the agency that conducts the poll

Remain Percentage of “Remain” votes

Leave Percentage of “Leave” votes

"Don't know/undecided" Percentage of undecided voters Note. The descriptions were downloaded from the What UK Thinks website.

(36)

34

8 Analysis

The main analytical section of this thesis is divided into 3 main sections. The first part focuses on sentiment analysis and tries to answer the first research question of whether the users' attitude in the Brexit debate on Reddit follows the same trendline as regularly conducted polls. Then an exploratory data analysis is conducted to find basic patterns and clues in the data set. Lastly, a topic model is built to identify the main topics Reddit users talk about in the Brexit debate.

8.1 Sentiment analysis with VADER

As described in the previous section, VADER is a sentiment analysis tool, specially attuned for analyzing social media data and other informal human communication channels. Therefore the decision was made to test this lexicon and rule-based software to conduct sentiment analysis on Reddit data. Before the main analysis, it was necessary to install the VADER package and Python’s Pandas' library (“Python Data Analysis Library — pandas: Python Data Analysis Library,” n.d.). Pandas is a powerful library for handling data structures. The whole analysis was executed in the cloud with the use of Keboola Connection and Jupyter Notebook, which proved to be a good decision due to hardware requirements. During the testing period, a MacBook Pro late 2017 8GB memory was used. Although it can be considered to be a quite powerful device, it struggled intensely with the command execution, and it was clear that the processing power was not enough to process this task promptly.

In order to improve the performance, especially in the matter of time processing, it was

necessary to work only with a respective part of the data set - the body - comments. After importing the necessary packages, the sentiment analysis was conducted. In detail, the script iterated over each comment separately, evaluating the sentiment and then writing the results into a CSV file together with the comment id, which was important for further processing and data merge. The outcome had four basic features: positive, negative, neutral and compound sentiment values. VADER’s documentation on GitHub Clearly describes each metric. “The pos, neu, and neg scores are ratios for proportions of text that fall in each category (so these should all add up to be 1... or close to it with float operation). [...] The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). It is also useful for researchers who would like to set standardized thresholds for

(37)

35 classifying sentences as either positive, neutral, or negative” (“GitHub - cjhutto/vaderSentiment: VADER Sentiment Analysis.,” n.d.). This scoring method was implemented after merging the VADER sentiment data set with the full Reddit data set accordingly with the thresholds defined on GitHub:

Table 4: VADER Sentiment Threshold Values

Sentiment Compound score

Positive >= 0.05

Neutral > -0.05 and < 0.05

Negative <= -0.05

Once the sentiment analysis was executed and the results were stored in a separate CSV file in Keboola Connection, the data could be merged in SQL based on a unique identifier - comment id. Besides joining the two files, the comments were classified as positive, negative, or neutral based on the threshold values determined above. The final step was to visualize the outcome of the sentiment analysis and compare it with the results of the official polls.

(38)

36 It is clear that the preferences of both, Remain and Leave, supporters are falling. On the other hand, more people seem to be indecisive regarding an eventual second Brexit referendum.

Figure 6: VADER sentiment

The interpretation of the results can be tricky because the task was to determine the overall sentiment in the Brexit debate, not their stance in the referendum. The comparison is then based on the assumption that in case the sentiment is positive, the comment is rather in favor of Brexit, in case the sentiment is negative, it supports the Remain campaign. Considering this logic, the results of the sentiment analysis do not entirely reflect the outcomes of the polls. However, the trend lines, together with associated linear equations, confirm that both positive and negative sentiment curves follow the same trend as the official polls. On the contrary, while the official polls suggest that the share of undecided voters is slightly decreasing, the share of neutral comments grows.

8.2 Exploratory data analysis

In order to better understand a data set, it is necessary to conduct basic descriptive statistics. The data set contains both numerical and categorical data. Python’s most popular data science library

Referenties

GERELATEERDE DOCUMENTEN

With regard to the Public Prosecutor, the district offices, the offices at the Court of Appeal, and the National Information Point Judicial Leave (LIJV) all have responsibilities

In dit onderzoek is onderzocht of cognitieve- (Metacognitie, Gedragsregulatie, Strafgevoeligheid, Beloningsresponsiviteit, Impulsiviteit/fun-seeking, Drive), persoonlijke-

H1: The explanatory power of identity-based drivers of public support for European integration on the individual level has increased, and the explanatory power of

It tested in what way the four different relational models (i.e., communal sharing, equality matching, authority ranking, and market pricing) affect OCB,

3.3.10.a Employees who can submit (a) medical certificate(s) that SU finds acceptable are entitled to a maximum of eight months’ sick leave (taken either continuously or as

Morgan leaves it open whether contact with white women, which would imply a weakening in the younger women’s social network ties as they would now be interacting with members

Binne die gr·oter raamwerk van mondelinge letterkunde kan mondelinge prosa as n genre wat baie dinamies realiseer erken word.. bestaan, dinamies bygedra het, en

The present text seems strongly to indicate the territorial restoration of the nation (cf. It will be greatly enlarged and permanently settled. However, we must