Modeling political opinion using Twitter

(1)

Faculty of Economics and Business

Requirements thesis MSc in Econometrics.

1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided up into a number of sections and contains references. An outline can be something like (this is an example for an empirical thesis, for a theoretical thesis have a look at a relevant paper from the literature):

(a) Front page (requirements see below)

(b) Statement of originality (compulsary, separate page) (c) Introduction (d) Theoretical background (e) Model (f) Data (g) Empirical Analysis (h) Conclusions

(i) References (compulsary)

If preferred you can change the number and order of the sections (but the order you use should be logical) and the heading of the sections. You have a free choice how to list your references but be consistent. References in the text should contain the names of the authors and the year of publication. E.g. Heckman and McFadden (2013). In the case of three or more authors: list all names and year of publication in case of the rst reference and use the rst name and et al and year of publication for the other references. Provide page numbers.

2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All that actually matters is that your supervisor agrees with your thesis.

3. The front page should contain:

(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Faculty as in the heading of this document. This combination is provided on Blackboard (in MSc Econometrics Theses & Presentations).

(b) The title of the thesis

(c) Your name and student number (d) Date of submission nal version

(e) MSc in Econometrics

(f) Your track of the MSc in Econometrics 1

Master’s Thesis

Modeling political opinion using Twitter

Author:

Dominique M. van der Vlist Student ID: 6366554

Supervisor: N.P.A. van Giersbergen Second reader: J.C.M. van Ophem

MSc Econometrics: Big data business analytics

(2)

This document is written by Student Dominique van der Vlist who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it.The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

Abstract

Modeling political opinion using Twitter by Dominique M. van der Vlist

Student ID: 6366554

Efforts have been made to exploit social media to track public opinion in several domains. In a In the political domain, it may be an instrument to supplement or even substitute traditional election polls, for these remain relatively expensive and time consuming. The objective for this research is to create a learning algorithm that models Dutch election polls using Twitter data to approximate traditional polls and eventually predict vote intentions. Twitter data is collected for a period of one and a half months prior to the Dutch general elections of March 15th, 2017. Sentiment and topic features are extracted from the tweets and aggregated into a daily time series that is modeled using traditional polls as a ground-truth. Predictions made by the model reflect the polls and perform comparatively well when predicting the election outcome.

(4)

(5)

1. Introduction

Social media has been an attractive instrument for researchers to quantify public opin-ion. Especially the social media platform Twitter is popular for people to broadcast their opinions, thoughts and ideas because of its accessibility and anonymity. As this results in an extensive, publicly available real-time information stream, the question arises whether we can exploit this data to quantify the opinion of the mass. Several applications have become popular topics of research in this field. Examples of this are measuring brand popularity or consumer confidence, predicting stock values, but also measuring political opinions.

This research examines an approach for the application of social media analytics in the political domain. The traditional method of polling to determine the vox populi received severe criticism since recent elections. A cause of this is Donald Trump’s victory in the U.S. general elections of November 2016, despite him being behind Hillary Clinton in nearly all opinion polls. But also in the more recent general elections of the United Kingdom on June 8th this year, the pollsters predicted Theresa May to receive a majority which was not realized. These recent failures in prediction resulted in a decrease in confidence people have towards pollsters.

Social media has often been examined as an alternative to traditional election polls to predict election outcomes. A motivation for this being that traditional opin-ion polling is comparatively expensive and time consuming, as this is typically carried out by questionnaires, surveys or telephone calls. A pro is that the sample being questioned is set up to be a representative sample of the population. To minimize any sample selection bias, large panels are recruited consisting of the right numbers of the educated, the young and so on. To balance between effort and good results, the samples are often fairly small, which may be disadvantageous to the results. An-other difficulty is that the number of people willing to participate in questionnaires is plummeting. To emphasize: in 1980, 72% of Americans responded to a phone call that attempted to infer their political preferences. That share has fallen to 8% by 2012, and kept falling to less than 1% last year (The Economist,2017, June 17th).

Thus alternatives for opinion mining such as Twitter-based election polling become an attractive subject to explore. The main objective for this research is to examine how we can exploit Twitter to approximate traditional polls to supplement or even substitute these to measure real-time vote intentions. The question that arises now is twofold: (1) how can we model political opinions using Twitter data, and (2) can we accurately predict the election result using this model.

(8)

Most of them apply for methodological adjustments and improvements as predictions based on Twitter data often still lack behind traditional polls in accuracy. A recent publication attempted to predict the Dutch general elections of 2012 using merely the relative numbers of party mentions. Their prediction was worse than that of traditional polls (Sanders and Van Den Bosch, 2013). This research proposes an improvement to their approach, applied to the Dutch general elections of 2017 by taking into account suggestions made in previous research.

The data was collected for a period of one and a half months prior to the Dutch general elections on March 15th, 2017. Twitter was scraped for this period by search-ing for a small set of manually defined election-related keywords. Additional user demographics were obtained by consulting a publicly available prediction algorithm. The data is then preprocessed to reduce noise in the dataset before extracting features that can be used as input variables for the final models. The poll data used to train the models with, originates from a Dutch polling aggregation instance and is obtained as a daily time series.

Extracting public opinions from the tweets provides a challenge for both econo-metrics and computational linguistics. Twitter messages consist of only 140 characters at maximum and are mostly unstructured, informal and contain a lot of unusable fea-tures such as hashtags and URLs. This limits the applicability of predefined lexicons and/or classification models for sentiment analysis. An extension of the LDA topic model that accounts for the dependency between sentiment and discussed topics is applied for the purpose of simultaneous sentiment detection and topic modeling. This information is aggregated to a daily time series of Twitter-based features per party. These features can serve as input variables for any learning algorithm such as a neural network. The network is trained using the polls as a ground-truth and evaluated by its predictive performance on a test set. Finally, the model is used to predict the election outcome.

The remainder of this thesis is organized as follows. First Chapter 2 discusses preceding literature on the topic as well as a theoretical background on sentiment analysis. Chapter 3 then elaborates on the methodology applied in this research. In Chapter 4, the data is presented accompanied by some descriptive statistics. The resulting models are presented in Chapter 5, and Chapter 6 interprets the results in the form of a conclusion and provides suggestions for future research.

(9)

2. Theoretical background

This chapter first elaborates on the aggregation technique employed to obtain the poll data used in this research. Then previous literature on the topic is comprehensively discussed. The chapter ends with a discussion of previous approaches to sentiment analysis.

2.1 Poll aggregation

Opinion polls have been used for several years to obtain preliminary insights into election outcomes worldwide. Many people and institutions try to quantify political opinions of the population through surveys and questionnaires. Aside from these inde-pendent initiatives, polling aggregators emerged that combine election polls published by others, with the objective to provide a better prediction of the election outcomes. An example of such a poll aggregator for Dutch election polls is Peilingwijzer, which is the source of the poll data used for this research. This website, founded by Tom Louwerse, aggregates polls of six well-known polling institutes: The Politieke Barometer by Ipsos, De stemming by EenVandaag, Kantar Public by TNS Nipo, Peil.nl by Maurice de Hond, I&O Research and LISS-panel. These polls all use questionnaires and publish on a weekly to monthly basis, yet more frequently close to the election day. The sample sizes these pollsters use vary between N = 1.000 and N = 4.000. The polls are combined using a methodology that accounts for uncertainty in multiple ways elaborated on in this section.

Peilingwijzer uses Markov Chain Monte Carlo (MCMC) method to obtain esti-mates of the percentage of the population that votes for each party separately. The model is based on the following distributional assumptions for all polls i:

Pi ∼ U nif (Yi− z, Yi+ z),

Yi∼ N (Mt, FiD),

Mt= At+ Hbi,

At∼ N (At−1, τ )

(2.1)

In the first equation, P_i represents the outcome of poll i. This is a percentage drawn from the uniform distribution with an interval width of 2z, which is the margin that arises by converting percentages to a number of seats. As there are 150 seats in the Dutch House of Representatives, the underlying percentage may be a third of a percentage higher or lower than the reported value Pi. The underlying percentage is

(10)

represented by Yi and is assumed to be drawn from a normal distribution as stated

in the second equation. The standard deviation is assumed to be the product of a predetermined error margin and a design effect, which scales this margin and is estimated by the model. The error margin depends on the sample size and the mean percentage of votes the parties get. Its mean Mt is denoted in the third equation

and comprises the sum of two elements. The first element is a percentage At drawn

from a distribution that is dependent on past realizations (assumed to be the realized percentage of the population voting for that party), denoted in the last assumption. The standard deviation τ is included so that the model acts like a random walk between two consecutive polling dates. The second element is the so-called house effect of pollster b of poll i. House effects denote the structural bias some pollsters have towards a certain outcome, as some of the methods used in these polls have shown to result in systematically different estimates. These differences are assumed to sum to zero for every party over the six different polling sources and do not have to be constant over all polls of that source, as methodologies might have changed over time. Peilingwijzer uses all historic poll data to determine the house effects.

This methodology for aggregating polls gives rise to some criticism. First, the strong assumptions the model imposes on the polls being random samples is question-able. As most pollsters only use small sample sizes, it is uncertain the samples are completely random. This is partly accounted for by the design effect. Another as-sumption that can cause bias is that the house effects need to sum to zero. This is not very realistic in practice; it does not have to mean that if one pollster underestimates the number of votes for party x, another pollster has to overestimate the number of votes for this party by the same amount.

2.2 Related literature

Predicting opinions based on social media usage has been a popular topic in research and widely discussed in the political context during the last years. The existing liter-ature traces back to 2010. This section forms a guide through this body of literliter-ature.

2.2.1 Preliminary work and critique

Some of the first to publish on Twitter analysis in the political domain that show promising results are Tumasjan, Sprenger, Sandner, and Welpe (2010). Their work is inspired by previous research that analyses other social media platforms like Facebook and blogposts, or even traditional press. They claim that the relative number of party mentions in German tweets reflects election results in Germany with an accuracy close to the traditional election polls. They also find joint mentions of parties correspond to political ties and coalitions. Around that same period, O’Connor, Balasubramanyan, Routledge, and Smith (2010) published similar results, yet also introducing sentiment analysis in this context using a lexicon of words to classify positive and negative tweets. Sentiment ratios are calculated by counting the number of positive relative

(11)

to the number of negative tweets. As the day-to-day sentiment ratios turn out to be highly volatile, much more than the ’off-line’ polls, the authors generate a moving average over k days to derive a more consistent signal. They did not attempt to predict the actual results, but report no significant correlation between public opinion measures from election polls and the moving averages of sentiment of contemporaneous tweets.

These publications were then critiqued by Gayo-Avello, Metaxas, and Mustafaraj (2011). Opposing the results found by O’Connor et al. (2010) and Tumasjan et al. (2010), these authors argue Twitter data does not predict results of the congressional U.S. elections better than chance. They reason this by saying Twitter users are non-representative, while traditional polling techniques enable random sampling from the population. Furthermore, they mention simple lexicon-based sentiment analysis meth-ods as performed by O’Connor et al. (2010) applied to political conversations needs to be further researched and fine-tuned. This suggests researchers need to explore alter-native methods for sentiment analysis as well as correcting for demographics of Twitter users in some way. These two suggestions form a basis of following publications on the subject.

2.2.2 More advanced methodologies

The first to use more advanced sentiment analysis methods are Bermingham and Smeaton (2011). They introduce supervised machine learning methods to improve sentiment analysis in their model for the Irish General Elections and achieve an MAE of 5.58% against actual electoral results, which is still uncompetitive with traditional polls that achieved an MAE of 1.08%. A second extension to sentiment analysis is the distinction they make between inter- and intra-party sentiment. After the inter-party polar sentiment is evaluated as the number of positive (negative) tweets relative to the total number of positive (negative) tweets each day, they determine intra-party sentiment as well by calculating a ratio of positive against negative sentiment for each party. They then combine these measures with the proportion of party mentions by fitting a linear regression with poll data as the dependent variable. Afterwards, they evaluate Twitter-based prediction against polls as well as the final election outcomes. Sang and Bos (2012) combine sentiment scores with the number of party mentions for the Dutch Senate elections of 2011. Until this time, none of the research method-ologies included data cleansing prior to analysis. The authors describe removing all ambiguous tweets (i.e. tweets that contain more than one party name) and consid-ering only the first tweet which mentioned a single political party for each user. All negative tweets are removed as well. Party-specific sentiment scores, determined by a classifier relying on a corpus of manually labeled tweets, are then used to weight the number of party mentions. Additionally, the difference between the predicted number of seats using Twitter and as reported by the polls are used to re-weight iteratively, in attempt to de-bias the data according to political leaning. These weighting choices improve their prediction of the election outcome. They also argue for weights based

(12)

on user demographics to account for dissimilarities in the demographic distribution of Twitter users compared to demographics of Dutch voters. An exemplification of this is that user studies revealed senior citizens are underrepresented on the Internet, yet this group has a relatively large turnout in elections (Fox,2010). Demographic weights are not included in their analysis as they claim to be unable to do so as age and gender are not reported on Twitter user profiles. The authors report no evaluation metric for predictive performance, but Gayo-Avello (2013) computed the MAE as 1.33% in his meta-analysis. A comment is added this score is comparable to traditional polls, yet this baseline is missing in the article.

To address the issue of Twitter users not necessarily being a representative sample of the population that is eligible to vote, both Dwi Prasetyo and Hauff (2015) and Sanders, De Gier, and Van Den Bosch (2016) extend the methodology of combin-ing counts and sentiment measures by correctcombin-ing for demographic distribution. Dwi Prasetyo and Hauff (2015) reduce bias by assigning higher weights to tweets from females (which have a lower prevalence on Twitter) and from underrepresented loca-tions. They find correcting for location improves prediction accuracy (MAE reduces from 3.3% to 1.99%), yet gender-based de-biasing does not yield much improvement (MAE reduces slightly from 3.3% to 3.21%). Similar to this, Sanders, De Gier, and Van Den Bosch (2016) find a slightly increased correlation between their Twitter based prediction and the election outcome (0.84 to 0.86) after correcting for gender and age. In both papers, the counts are weighted based on the ’true’ population distribution, obtained from voter panel surveys executed by external institutions.

To recapitulate, authors applied more advanced methods for sentiment detection comprising automatic polarity classifiers by supervised machine learning algorithms, and incorporated sentiment in the predictions in various ways. Some publications corrected for non-representativeness of the Twitter sample. Predictions are still only based on descriptive measures such as the number of party mentions, or on linear regression.

2.2.3 Implementations of machine learning algorithms in recent work For the last years, machine learning techniques started to grow in popularity in many fields of research for its wide applicability in predictive analytics. Research that utilizes this to model Twitter data for the purpose of predicting election outcomes is discussed below.

Tsakalidis, Papadopoulos, Cristea, and Kompatsiaris (2015) incorporate daily Twitter-based features such as sentiment shares and party mentions into time series and com-bine them with opinion poll data for predicting the 2014 EU elections in Germany, the Netherlands and Greece. Their Twitter-based features comprise various sentiment scores and the volume of party mentions, which means demographics are not consid-ered. The authors use a seven-day moving average filter for the features to smoothen the values as performed in the past (O’Connor et al.,2010). They argue there is no complete polling aggregation service available, which is why they manually combined

(13)

different polls and linearly interpolated these. This resulted in a daily time series of one poll-based feature (the polling results) and eleven Twitter-based features they use as input for their machine learning algorithms that are trained to predict the election outcome. The authors employed four different models, on each political party sepa-rately. These are linear regression, Gaussian process, sequential minimal optimization and support vector regression. The latter was disregarded due to poor performance, while the predictions of the other algorithms are averaged for every party resulting in the final estimates. They report estimates of several models that are restricted versions of the model that uses all Twitter-based and poll-based features. The lowest MAE achieved is 1.80 whereas the traditional polls reach 1.63.

The most recent work on the topic is by Beauchamp (2017). He proposes training a model with historical polls as a function of Twitter features; the top 10.000 uni-grams extracted from the tweets on a daily basis. To increase the sample size for out-of-sample prediction, he proposes to train the models over a moving window of m days prior to day t one wishes to predict. The author makes use of multiple machine learning algorithms: random forests, support vector machines and elastic nets. How-ever, none of these are especially suited for temporal data structures, which is why he proposes a new linear regularization feature-selection model that combines aspects of the aforementioned models, incorporating time effects and state fixed effects. The results show best predictive performance for the last model when time effects and state fixed effects are included (MAE of 1.27%).

In summary, existing literature shows promising results on using Twitter as a source of information to predict election outcomes. In its simplest form, the social media platform can be used by counting instances of party name mentions. Incorpo-rating sentiment obtained with machine learning techniques by weighting the counts and correcting for demographics seems to result in some improvement. A more com-plex model trained with sentiment measures and additional features using a machine learning algorithm can improve results. Accounting for the time series nature of the data and incorporating fixed effects seems to be beneficial.

2.3 Sentiment analysis

To go into further detail on sentiment detection, this section summarizes some appli-cations of sentiment analysis on tweets.

Preliminary literature employs mostly simple lexicon-based techniques to deter-mine sentiment polarity of tweets (O’Connor et al., 2010). Multiple publications on the application of supervised machine learning algorithms for tweet sentiment anal-ysis then arise. Most of these rely on manually labeled tweets as training sets, not specifically related to the political domain. Examples of classifying algorithms are Naive Bayes, Support Vector Machines and maximum entropy classification, reaching different levels of accuracy (Pang, Lee, and Vaithyanathan,2002; Durant and Smith,

(14)

2006; Barbosa and Feng, 2010; Neethu and Rajasree,2013). Gayo-Avello (2013) ar-gues applying sentiment analysis methods with a specific focus on the political domain can improve predictive performance remarkably.

A first publication in the political domain is by Bermingham and Smeaton (2011). Their Naive Bayes classifier is trained using a manually annotated sentiment training set to retrieve the aforementioned sentiment scores. Jahanbakhsh and Moon (2014) follow a similar approach, finding that removing certain ’noise’ from the tweets, de-fined as URLs, "@somebody", "RT", hashtags, numbers, stopwords, punctuation and capitalization, significantly improves accuracy. Another focus in their research is the extraction of discussed topics over certain time periods using the unsupervised Latent Dirichlet Allocation (LDA) model with Gibbs Sampling. They motivate this stating that discovering popular political topics discussed on social media can be beneficial to political campaign leaders. Results show that popular off-line topics are in correspon-dence with topics extracted from tweets using the LDA algorithm, which suggests topic modeling can be used to find subjects that reflect what is important to the public.

Dwi Prasetyo and Hauff (2015) propose training their Naive Bayes sentiment classi-fier using tweets that contain positive/negative emoticons do determine the sentiment label. Ibrahim, Abdillah, Wicaksono, and Adriani (2015) follow this method, yet the difference is they analyze polarity sentiment on sub-tweet level. They partition a tweet into several sub-tweets based on pre-specified delimiters and determine the sentiment towards the concerning candidate’s name for each individual sub-tweet. This may imply determining sentiment towards a topic in a tweet as opposed to the sentiment of a tweet as a whole results in better predictions.

It is thus debated in past work that sentiment analysis using classification is domain-dependent and training a classifier on a labeled corpus of non-political tweets does not yield optimal results (Sanders and Van Den Bosch, 2013; Tsakalidis et al.,

2015), causing researches to manually label a corpus of political tweets which can be time consuming and somewhat devious. For this reason, authors argue for lexicon-based approaches (Tsakalidis et al.,2015), even though it is debated that lexicon-based sentiment analysis only performs slightly better than a random classifier (Gayo-Avello, Metaxas, and Mustafaraj,2011), These lexicons have the advantage that they can eas-ily be translated to other languages for the purpose of training a classifier.

In summary, sentiment classifiers are trained using a corpus of labeled (political) tweets, either manually or classified according to the use of emoticons, or alternatively a lexicon of words is used. Preprocessing tweets by removing noise from the data improves classification accuracy. It is found that topics that reflect off-line discussions can be extracted from tweets. It can be interesting to combine this finding with determining sentiment on a sub-tweet level, by determining sentiment on topic level.

(15)

3. Methodology

This chapter elaborates on the methods applied to perform the analysis. The first section covers how the raw dataset is configured, followed by a description of the algorithm used to obtain the desired sentiment and topic features. The last section proposes how the variables and features are transformed into a time series and how these data are modeled to predict election results.

3.1 Collecting Twitter data

The data used for this research are gathered for a time period of one and a half months prior to the Dutch election day at the 15th of March, 2017. The polling data are retrieved fromPeilingwijzer, which has been discussed in Chapter21. Part of the Twitter data is scraped directly from the website, and the other part is obtained via Twitter’s API.

Retrieving a large dataset from Twitter is tedious, as one can not simply consult Twitter’s API and search for specified keywords an unlimited amount of days back in the past. Therefore, as a first step, an online database that stores IDs of tweets written in the Dutch language is consulted. This database, named TwiNL, stores approximately 40% of all Dutch tweet IDs since December 16, 2010 for education and research purposes. The database assembles these IDs by exploiting Twitter’s API in two ways. The first is to continuously search for common Dutch keywords which are not frequently used in other languages. For the exact list of words we refer to Tjong Kim Sang and Bosch (2013). The second approach is to gather all messages from a monthly updated ranked list of 5.000 users who posted in Dutch most frequently in the previous month, based on the collected tweets using the first approach. These methods seem to result in a random sample of the Dutch tweets and thus there is no reason to believe we have to deal with selection bias or find another way to obtain the tweets.

The queries used to search the database are reported in Table3.1. All queries are case insensitive. The parties included in the queries are the parties that are the 15 top ranked parties based on the share of votes they received on election day. The party name’s abbreviations are believed to be more popular on Twitter, but complete party names and some common variations are also searched for. In Dutch politics, every party has a top candidate (in Dutch: lijsttrekker ) that is almost always the party’s political leader. This person attracts most media attention prior to the elections.

1

(16)

Table 3.1: Search queries used to search the Twitter database

Party Query Number

of tweets VVD "vvd" OR "volkspartij voor vrijheid en democratie" 212.255

"mark rutte" OR "markrutte" OR "rutte" 242.808 PvdA "pvda" OR "partij van de arbeid" OR "partij vd arbeid" 124.411 "lodewijk asscher" OR "lodewijkasscher" OR "asscher" 52.381 PVV "pvv" OR "partij voor de vrijheid" 341.706

"geert wilders" OR "wilders" 296.033 SP "sp" OR "socialistische partij" 75.638

"emile roemer" OR "roemer" 27.231 CDA "cda" OR "christen democratisch appel" 112.720

"sybrand buma" OR "buma" 55.159 D66 "d66" OR "democraten 66" 113.279

"alexander pechtold" OR "pechtold" 44.227 CU "cu" OR "christenunie" 24.134 "gertjan segers" OR "segers" OR "gert jan segers" 6.339 GroenLinks "gl" OR "groenlinks" OR "groen links" 86.221

"jesse klaver" OR "klaver" 65.826 SGP "sgp" OR "staatkundig gereformeerde partij" 25.266 "kees van der staaij" OR "van der staaij" OR "vd staaij" 4.973 PvdD "pvdd" OR "partij vd dieren" OR "partij voor de dieren"

OR "partij van de dieren" OR "partijvandedieren" OR "par-tijvoordedieren"

36.776

"marianne thieme" OR "thieme" 14.716 50plus "50plus" OR "50 plus" OR "vijftigplus" 22.173 "henk krol" OR "krol" 24.820 VNL "voor nederland" OR "vnl" OR "voor nl" OR

"voorneder-land"

33.541

"jan roos" 15.219

DENK "DENK" 36.781

"tunahan kuzu" OR "kuzu" 27.347 FvD "fvd" OR "forumvoordemocratie" OR "forum voor

democratie"

49.416 "thierry baudet" OR "baudet" 24.789 PP "pp" OR "piratenpartij" OR "piraten partij" 12.472 "ancilla van de leest" OR "van de leest" OR "vd leest" 1.370

Total 2.160.611

Therefore, also the top candidates names of each party are used as keywords. This introduced some problems as the search option did not allow for symbols such as hyphens (as in "gert-jan segers") or mathematical signs (a common abbreviaton of "50 plus" is "50+"). Another consideration is the ambiguity in some of the names: the last name of "Jan Roos", who is the top candidate of "VNL", is the Dutch word for rose, and party name "DENK" is also the Dutch imperative of to think, which is why this query is made case sensitive. No effort was made to find misspelled names as the obtained sample is already fairly large.

After collecting all tweet IDs using these queries on a time period from February 1st until March 15th, they were used to scrape Twitter to collect a set of variables discussed in the next section. There were three scripts created for scraping Twitter. The R code created for this purpose can be consulted in Appendix A. To possibly gain efficiency, a Python script was developed and executed in parallel. The Python

(17)

scripts can be found in AppendixB. The last script in AppendixBis written to retrieve tweets of specific users, which is used to scrape the party’s tweets and those of the top candidates from their own Twitter accounts. For the Python scripts, Twitter’s API is used whereas the R program uses regular webscraping2_{. Scraping resulted in a}

dataset of 1.414.932 tweets in total, as some users have deleted their posts or accounts in the meantime.

3.2 Variables

For the outcome variable, the day-to-day poll result as determined by Peilingwijzer

is assumed to be the ground-truth on which the model is trained. This variable is documented as a daily time series of the expected share of votes for each party individually and is denoted as P oll_it for party i at day t.

The set of variables saved from Twitter can be sub-categorized into tweet-specific and user-specific variables. The following variables, considered to be the tweet-specific variables, were saved from each Twitter page that was consulted via the corresponding tweet IDs. The first variable, P arty, denotes the party the content of the tweet is about. This variable is assigned conform the keywords used to find the tweet, which allows tweets containing multiple party mentions to occur more than once in the dataset. When the keyword is a top candidate, the party he or she represents is reported. This variable is used for identification purposes of the tweets. Second, the timestamp of the tweet is stored, containing both the date and time the tweet was posted. The date is used as a time index to construct a time series from the Twitter information. Third is the actual tweet, from which sentiment and topic features are extracted as discussed in the next section. From these tweets, hashtags and U RLs are extracted and stored separately as well.

A set of user-specific variables corresponding to the user posting the tweet was stored simultaneously. This set includes username, f ullname, biography, location and f ollowerscount. The first are used for identification purposes. The others are disregarded for the analysis. Full name and biography are not informative on people’s voting behavior, and location is too often not (accurately) provided by the user. User’s followers count could perhaps be used to weight the tweets. However, this is deliber-ately ignored as every vote counts equally during elections, reflected by every user’s tweets weighting equally. Additional demographic variables such as age and gender are not that straightforward to obtain from Twitter. However, these are retrieved from TweetGenie, an external source (Nguyen, Trieschnigg, and Meder,2014). TweetGenie is an online machine learning algorithm automatically assesses the age and gender of a user. The algorithm uses the 200 most recent tweets of a user to predict age and gender by their use of language based on a Dutch training set of users with known ages

2

The reason for this is that it is only possible to consult Twitter’s API up to 7 days in the past when searching for specific keywords. As the initial approach was to scrape Twitter directly without obtaining tweet IDs from TwiNL, this caused a problem. It turned out to be troublesome to scrape Twitter directly, which is why the tweet IDs from TwiNL were used instead.

(18)

and gender. One should be aware of the measurement bias this method introduces to the demographic variables. Demographic variables are exploited as it is believed that Twitter users are a non-representative sample of the population. To account for this, we have to determine these variables as preceded by the literature Sang and Bos,

2012; Dwi Prasetyo and Hauff,2015.

Aside from these variables obtained using the tweet IDs from the TwiNL database, the number of likes and retweets from the verified accounts of the parties and top candidates is also gathered. However, it should be noted that these numbers not necessarily represent the support a tweet or user has gotten that day, but the support since that day, causing an upward bias in this variable as there is always a probability users will like or retweet later in time. We deliberately look at the number of likes instead of the number of followers, as it is likely many people also follow politicians they do not necessarily support, just to stay updated in general. It is much less likely that users like tweets they do not support and thus indirectly of people they do not support.

3.3 Feature extraction from tweets

The following step is to extract information from textual data that can be used as input variables for the election poll model. A common feature to extract is a sentiment polarity score. Additionally, topic modeling can be applied to obtain trending topics, which constitute a less common feature to extract in the Twitter context, yet it has potential predictive power for modeling election polls as mentioned in Chapter2. The procedure to obtain features from the tweets is discussed in this section.

3.3.1 Joint topic and sentiment model

In conclusion of the literature review in Chapter 2, lexicon-based sentiment analysis may be of preference as opposed to supervised learning methods with manually la-beled tweets. Another suggestion from previous research was to incorporate topics in sentiment analysis, which implies the use of a simultaneous topic and sentiment anal-ysis may be desirable. Li, Huang, and Zhu (2010) propose a joint topic and sentiment model based on the LDA topic model, with a given sentiment lexicon as prior sen-timent knowledge. This subsection is dedicated to a brief theoretical explanation of this model. For a more complete explanation, we refer to Li, Huang, and Zhu (2010). Intuitively, simultaneous analysis is appropriate as one word can have different sentiment polarities in different domains or towards different topics. An intuitive example of this can be given with the word ’complex’: a book review for example can contain the sentence ’the storyline is complex and unpredictable’ for which the sentiment is positive, whereas the sentence ’it is hard to use such a complex cam-era’ can imply negative sentiment when used for a camera review. As such, these simultaneous models have shown promising results for sentiment classification, with a simple lexicon-based approach as a baseline (Li, Huang, and Zhu,2010; Lin and He,

(19)

Figure 3.1: Graphical models of LDA (l) and sentiment-LDA (r).

2009). The alternative would be to determine sentiment of the entire document and extract topics independently, whereas these methods analyze sentiment on the topic or domain level of a document. Independence over sentiment polarity of the words in a document is still assumed; another extension of this model is to allow dependence on the local context (Li, Huang, and Zhu,2010). However, this is beyond the scope of this research.

The joint model applied here is the sentiment-LDA model as proposed by Li, Huang, and Zhu (2010). This model is an expansion of the Latent Dirichlet Allocation (LDA) model by adding a sentiment layer. This sentiment layer is associated with the topic layer, and words are associated with both sentiment labels and topics as visualized in a the graphical model in Figure 3.1. This way, sentiment-LDA can classify both overall sentiment of a document, as well as sentiment polarity for each topic.

The generative process of sentiment-LDA is defined in a similar way as that of regular LDA, only including the aforementioned sentiment layer. The model assumes a corpus consists of M documents,

C = {d1, d2, ..dM},

that consist of a set of N_dwords,

d = {w1, w2, ..., wnd}.

The generative process of a word w in document d comprises three steps. First, a topic z is chosen from K topics, with a distribution θd that is specific for that document.

The topic distribution has a Dirichlet prior with hyper-parameter α (topic proportion of a document; a high α implies it is likely that the document contains a mixture of most of the topics). Similar to this, a sentiment label l is chosen from a topic- and

(20)

document specific sentiment distribution πd,z with a Dir(γ) prior distribution. In the

last step, the actual word is chosen from the topic- and sentiment specific distribution ϕz,l with a Dirichlet prior with hyper-parameter β (Figure3.1).

The authors propose a sentiment lexicon as prior sentiment knowledge for the model (Li, Huang, and Zhu,2010). This prior knowledge is used for Bayesian inference during initialization and also in following iterations: each word token in the corpus is classified according to the sentiment lexicon. If a word is not present in the lexicon, its sentiment label is randomly sampled.

The inference algorithm adopted is the Gibbs sampler, a Markov Chain Monte Carlo (MCMC) algorithm. A Markov chain is defined whose stationary distribution is the posterior of interest. Independent samples are then collected from the stationary distribution, which are used to approximate the posterior distributions. The model parameters θ, ϕ and π can now be estimated, which can be used to extract the desired features.

First, the K most probable words for each sentiment label and each topic can be determined by calculating ϕ_z,l, the word distribution specific for topic z and sentiment label l. This provides us with a probability for each word w occurring:

ϕ(w)_z,l = P r(w|z, l), which can be calculated as

P r(w|z, l) = n

(w) z,l + β

nz,l+ V β

, (3.1)

with n(w)_z,l the number of times word w is assigned to topic z and sentiment label l, and V the number of distinct terms in the document. To obtain the top K words, we simply order the words according to their probability of occurring for all sentiment labels and each topic.

Similarly, we can obtain probabilities for the topics expressed in the documents:

θ(z)_d = P r(z|d) = n

(z) d + α

nd+ Kα

, (3.2)

with n(z)_d the number of times words are assigned to topic z, n_d the total number of words in the document, and K the number of topics in the document. Ordering the topics in descending probabilities gives us the most likely topics in the document.

And thus the probabilities of sentiment polarity of the topics can be calculated as:

π_z,d(l) = P r(l|z, d) = n (l) z,d+ γl n(z)_d,z+PS l=1γl , (3.3)

with n(l)_z,d the number of times words are assigned to topic z, and sentiment label l, and S the number of sentiment labels. The sentiment polarity of a topic is determined

(21)

by the label that has the highest probability.

Finally, we can determine the overall sentiment polarity of a subtopic by summing the probabilities over all topics of sentiment label l occurring in these topics:

P r(l|d) = K X z=1 P r(l|z, d) · P r(z|d) = K X z=1 π(l)_z,dθ_d(z). (3.4) The sentiment label with the highest probability given the document d represents the polarity assigned to that document.

3.3.2 Preprocessing of tweets

Before running the algorithm, it is common practice to preprocess the textual data. As our research concerns tweets that contain 140 characters at maximum, often including hashtags and URLs, preprocessing is of huge importance as we want to be able to extract as much information as possible from the short messages. This is achieved with several consecutive steps, inspired by previous research mentioned in Chapter2. First, it is decided to keep all retweets, as these can be interpreted as people sharing similar thoughts and opinions they would also send from their own accounts. However, as the retweets contain the signal ’RT’ at the beginning of the tweet, this is removed. Other signaling words that are removed from the actual text are @somebody when the tweet is a reply, and URLs when present.

To reduce bias, tweets that are posted by party members are deleted entirely. The reason for this being they always advertise for their own party, plausibly with the intention to steer social media opinions or even statistics (Sanders and Van Den Bosch,2013). A selection is made from the list of by Twitter verified accounts in the dataset. A list of candidates published for the general elections of 2017 is used to filter out party members 3. Another set that is removed are Tweets that consist of characters that do not appear in the Dutch alphabet, as these should not have been included in the dataset to begin with. For this purpose, an observation is dropped when it contained characters that are not in one of the following unicode character sets: Basic Latin, Latin-1 Supplement, Latin Extended A and B, IPA extenions or Spacing Modifier Letters (evaluated after removing punctuation).

Another modification to remove all ambiguous tweets (i.e. tweets that mention more than one party) as proposed by Sang and Bos (2012). However, in this research the decision is made to maintain these tweets. This results in duplicate tweets in the dataset, with a different label for which party the tweet concerns. The intuition behind keeping these tweets is that one can simply express sentiment towards both these parties simultaneously, or address a topic that concerns multiple parties in a single tweet.

3

https://www.kiesraad.nl/adviezen-en-publicaties/rapporten/2017/2/definitieve-kandidatenlijsten-tk-2017/definitieve-kandidatenlijsten-tk-2017

(22)

Lastly, we remove tweets from users that are predicted to be below the age of 18, as these users are not eligible to vote and thus their opinions should be disregarded. We did however include these tweets to train the sLDA model, to keep the number of observations as high as possible and this is not believed to harm the accuracy of the algorithm. The remainder of the dataset comprises 823.875 tweets.

In addition to the removal steps presented above, some typical preprocessing steps in textual analysis are performed as well. These include removing capitalization, punctuation, common Dutch stop words and stemming of words. As words that occur in large proportions of the documents (i.e. corpus-specific stop words) do not provide much information, words that occur with a frequency higher than a threshold of 80 percent are disregarded by the algorithm. Also the keywords that are used to search the Dutch Twitter database (i.e. the party and top candidate names) are removed from the documents. Without removal of these words, the top words of each topic consist of a (subset of) this set of words. Words that occur in only a very small proportion of the documents are automatically excluded as only the V most frequent words are included in the vocabulary. The choice for V is explicated in the next subsection.

Subsequent to these preprocessing steps, the sentiment-LDA algorithm can be executed with the tweets as input. The next section elaborates on the implementation of the model for this research.

3.3.3 Application of sentiment-LDA

The corpus for this application contains the daily tweets, which individually represent the documents. A second option would be for all tweets to compose the corpus, and for the daily tweets combined to form a document. The first approach is adopted here, so sentiment and topics can then be determined for each tweet individually, providing more information to use for modeling than when sentiment and topics are determined of the aggregated tweets. However, it could be possible that tweets contain too little information on topics and sentiment, such that the sentiment-LDA model does not result in any useful features. Ideally, the model is able to extract sentiment with high accuracy, and topics that are discussed in the tweets show some recurring patterns over time.

We now emphasize the assumptions on model parameters. First, sentiment polar-ity is assumed to be binary: either 0 (negative) or 1 (positive). Second, the optimal number of topics is chosen, often a debatable topic. With supervised machine learn-ing where we can calculate accuracy for different values of this parameter, which we can use to approximate the optimal number of topics. The original review dataset to experiment with in the publication of the sentiment-LDA model reported K = 50 as the optimal value Li, Huang, and Zhu, 2010. As we cannot measure accuracy with the Twitter dataset, alternative methods have to be considered.

A method to approximate an appropriate number of topics is the elbow method. The documents in a document-term matrix are clustered based on k-means using

(23)

Figure 3.2: Within cluster variance as a function of the number of clusters

the Euclidian distance measure, where k is varied between 10 and 120. The within-clusters sum of squares is plotted as a function of the number of within-clusters, and k is then chosen as the number of clusters so that adding another cluster does not add much information. The optimal number of clusters is determined by eyeballing the graph visualized in Figure3.2.

The graph is based on the document-term matrix of a random subset of the to-tal documents, as the computations for the whole dataset would require excessive computing time. This subset contains 10.000 documents and a maximum vocabulary size of 3.000 words. This graph shows the marginal decrease in within-cluster sum of squares does not clearly drop at any specific point, as there is no angle visible in the graph. As K = 50 is a common choice for the number of topics in LDA topic modeling, the algorithm is run for K = 50. It should be noted, however, that this may be a poor choice for the number of topics which can harm the analysis.

For the choice of the Dirichlet prior’s hyperparameters α, β and γ, we consider the properties of the Dirichlet distribution. The parameters can be thought of concentra-tion parameters for this distribuconcentra-tion; they determine how concentrated the probability mass of a sample from the distribution is likely to be. Considering α; a value much less than 1 implies it is more likely a document, meaning a tweet in the context of this research, exists of a mixture of just a few topics, whereas a value much greater than 1 implies the probabilities of topics occurring is more evenly spread. The parameter is typically set inversely proportional to the number of topics as 50_K. However, as tweets are only short messages that probably exist of only a few of the topics, it may be more suitable to choose a smaller value for α. For this reason, α is set to 0.5. As we

(24)

cannot say anything specific on the per sentiment per topic word proportions in this application, the concentration parameter β is set to its default value of 200_V with V the number of words in the vocabulary. This amount has to be limited from above because of memory capacity constraints that arise when storing a d x V document-term matrix. This required the algorithm to be executed as a divide-and-conquer algorithm that saves subsets of the document-term matrix on disk and later reads the matrices iteratively in memory again. The vocabulary size is now set to 4000 words which implies β = 0.08. Finally, the value of γ is chosen to be 1, as there is no default known and 1 is the most neutral choice.

Executing the iterations occupies excessive runtime, forcing us to limit the number of iterations of the Gibbs sampler to 30, even though the default is set to 50.

The implementation of the sentiment-LDA algorithm used as a basis for this re-search is obtained from GitHub (ayushjain91,2016). The code is written to run the model for reviews written in English, so some adjustments have to be made to make the code suitable for Dutch tweets. The sentiment prior has to be modified, but also the code for stemming and removing stopwords has to be adjusted.

The union of two sentiment lexicons is used as the prior sentiment knowledge. First, a Twitter specific Dutch sentiment lexicon is deployed that is enriched with cyber-slang words and emoticons (Sang, 2014) 4. When a word does not occur in this lexicon, the well-known Dutch version of the natural language processing API5 is consulted. This lexicon is consulted as a second choice as the sentiment classifier is trained with book reviews.

Aside from these (fairly small) adjustments to the algorithm, a few functions have to be added, for this script only provides a method to obtain the top words for each topic and every sentiment label as can be obtained using Equation (3.1). Other added methods that are threefold:

1. getTopSentiment returns the sentiment with the highest probability of occurring for each document and each topic within the document, represented by Equation (3.3);

2. getTopics returns the topic distribution of each document, sorted by descending probabilities as in Equation (3.2);

3. getOverallSentiment returns the overall sentiment of each document, i.e. the sentiment label that has the highest probability of occurring as calculated in Equation (3.4).

The complete modified code can be consulted in AppendixC.

4

The lexicon is obtained through contact with the author

(25)

3.4 Model

The sLDA model returned topic distributions, sentiment labels towards each topic and overall sentiment labels for every individual tweet. This information has to be converted to features to train a model for the election polls with. This section clarifies how we arrive at the final dataset and proposes a model for the features and variables.

3.4.1 Dataset

The dataset containing all tweets is aggregated into a daily time series for every party. This time series dataset include the poll data. Also the number of likes and retweets each party received on their official profiles that day are summed. Then the total number of tweets, and the number of tweets concerning each party are counted, both weighted with post-stratified weights for user demographics and unweighted (weight-ing is discussed in Chapter4).

To obtain the sentiment related features, we follow previous literature (Tsakalidis et al.,2015; Bermingham and Smeaton,2011). It should be kept in mind that positive sentiment is denoted as s = 1 and negative sentiment as s = 0. First a sentiment average score for party p on day t is calculated as

sent_totalpt= PNpt i=1wi· si PNpt i=1wi ,

with post-stratification weight wi for tweet i, si the overall sentiment of tweet i,

and Npt the total number of tweets on day t concerning party p. Second, sentiment

scores are calculated relative to only positive and negative tweets. This results in the following shares: sshare+_pt= PNpt i=1wi· si PNt i=1wi· si , sshare−_pt= PNpt i=1wi· (1 − si) PNt i=1wi· (1 − si) ,

where N_t denotes the total number of tweets on day t. These features are inter-party sentiment scores. Bermingham and Smeaton (2011) also mention intra-party sentiment, which is calculated as

sentintrapt= log10 PNpt i=1wi· si+ 1 PNpt i=1wi· (1 − si) + 1 .

These variables are constructed from all tweets in the dataset, and analogously for distinct demographic age and gender groups. This results in specific inter-party sen-timent scores for males and females, as well as for young/middle/old aged users. The reason for creating discriminative demographic groups is the difference in vot-ing behavior between these groups (NOS,2017), which increases the probability that

(26)

sentiment and topics can contribute differently to voting shares for different demo-graphic groups (e.g. women that are positive towards GroenLinks can affect voting shares differently than when men are positive).

Furthermore, the five most probable topics in each tweet for party p and day t are summed and sorted according to the number of occurrences per day. The ten most frequent topics are then stored together with their corresponding sentiment scores, calculated as the weighted average tweet and topic specific sentiment.

The number of retweets and likes of party p are determined relative to the other parties each day. The frequently used count measure for the number of (weighted) party mentions is also included in the feature set as a potential input variable, both relative and absolute. The 7-day moving average of weighted relative party mentions is constructed as well (weighted with demographics).

These steps result in a large set of features. To reduce overfitting, only a subset of these variables is included as input variables in the model. To determine which features to include, a model is trained on each subset of features and tested on a hold-out test set. The performance of different models trained on all subsets is then compared. Feature subsets are chosen by reasoning as well as applying simple OLS on some of the features as inspired by (Beauchamp,2017).

3.4.2 Neural network

As an alternative to the previously mentioned descriptive measures used to predict election polls in previous literature, the approach applied in this research constitutes training a (regression) model to fit the obtained features and variables to the poll data.

The model deployed to fit past data to predict future polls is one that is not previously applied in the literature: a neural network. Implementation of this model can be challenging because of the large amount of possible features to include in contrast to the small number of observations in the final dataset. Including too many features or a poor choice of model parameters allows the algorithm to find a set of weights and biases that result in computed outputs that are exactly equal to the target values in the training set. This generally results in a very low predictive performance when the model is fed with a test dataset for out-of-sample testing. To avoid such overfitting, various subsets of features and choices of model parameters are evaluated based on cross-validation. This generally results in a subset of features that are deemed truly informative.

Cross-validation in the context of time series requires partitioning the dataset in an alternative way as opposed to traditional methods such as leave-p-out or k-fold cross-validation, as this ignores the temporal ordering in the dataset. The training data is still divided into k subsets (folds), usually chosen to be 10, only now in the form of forward chaining. This implies increasing training set sizes and a constant size of the test set to allow comparison of predictive performance between folds. Thus the history on which the model is trained becomes larger, whereas the test set stays

(27)

constant in size. Aside from this adjustment to the usual construction of folds for cross-sectional data, we deploy walk-forward validation. This means we retrain the model as new data becomes available; after a prediction is made a day ahead in time, this observation can be added to the training set when forecasting the next day. This can be applied using an expanding or sliding window, where the first approach is chosen in this research because of the already limited number of observations in our (training) dataset. Walk-forward validation would give the model the best opportunity to make good forecasts at each time step, but requires much computation time. However, this is believed not to be an issue as we only have fairly little datapoints.

More specifically, as we observed 42 days prior to the elections, we can split the data into a training and a test set that comprise 35 and 7 days, respectively. As the validation set should contain a sufficient number of observations also, the training set is split into a validation set of 7 observations again, and a training set for cross-validation of the remainder. As we want 10 folds for cross-cross-validation, this results in a training set for cross-validation of a minimum of 18 days (the window expands when forecasting) and a validation set of 7 days for the first fold, and 28 days of training versus 7 days of validation in the 10th fold. After cross-validation, the test set is used to evaluate the obtained model as a whole.

A feed-forward neural network with one hidden layer is constructed that uses backpropagation without weight backtracking. Determination of the optimal network architecture is crucial for its impact on the accuracy of the predictive performance of the model. The process involves the daunting task of constructing a multitude of neural network topologies with different structures and parameter values before arriving at an acceptable model.

The number of nodes in the hidden layer is varied, and different activation func-tions are explored (hyperbolic tangent, logistic). Additional hidden layers are believed to be unnecessary, as this merely evokes overfitting with such a small dataset. The learning rate is varied between 0.01 and 0.21 with increments of 0.02.

The aforementioned is summarized with the algorithm in pseudocode:

The optimal model that results from this procedure is trained and using the test set, an out-of-sample prediction of the polls is made and eventually of the election result. From this, the Root Mean Squared Error (RMSE) can be computed to use as an evaluation metric.

The final dataset only comprises 42 observations (42 days prior to election day, which is the 43th day). Tweets of yesterday are used to predict polls for today, resulting in the following error function for each individual party:

E(w) = 1 2 T X t=2 ||yt(xt−1, w) − P ollt||2,

with T=43, xt−1 the vector of covariates and weights w. With one hidden layer and

a linear activation function for the output variables, which is generally the case for networks with the purpose of regression, this results in the following network function

(28)

Split dataset into training and test set For each subset of features:

For each value of hyperparameter: For each fold:

Split training set into training set and validation set For each day in the validation set:

Train a model with the features, hyperparameter values and training set Use model to predict 1 day ahead

Remove first day of validation set and add to training set Store mean squared validation error

Store mean validation error over all folds

Store mean validation error over all folds over all features

Determine which combination of features and hyperparameters returns the model with the smallest error

Run the entire training set through this model and determine accuracy with test set

Figure 3.3: Pseudocode for implementation of cross-validation.

for every output variable:

yt(xt−1, w) = M X j=1 w(2)_kjh( D X i=1 w_ji(1)xi,t−1+ w(1)j0) + w (2) k0, t ∈ T

where h(·) denotes the transformation by the activation function (either logistic or hyperbolic tangent), x1,t−1, ..., xD,t−1 the input variables (observed one day prior to

the poll date), M is the number of nodes in the hidden layer and w(1)_j0 and w_k0(2) the bias parameters. The superscripts indicate the layer of the network.

(29)

4. Data

This chapter presents the dataset. The purpose of this chapter is to provide prelim-inary insights into the variables and present the features that result from the sLDA algorithm.

4.1 Descriptive analysis

We first attempt to provide preliminary insights into the retrieved dataset and on how this data relates to offline political opinions and trends.

The daily political tweet frequency distribution of the full dataset is illustrated in Figure 4.1. It can be deducted from the distribution that Twitter is used as a platform to discuss politics, as there is a visible match between electoral events (i.e. television/radio debates) and the popularity of politics on Twitter that same day.

To provide insights into which parties are popular in the Netherlands, some char-acteristic numbers are reported in Table4.1. These include the election results of the general elections of 2012 and 2017. The proportions of tweets concerning each party in the total dataset is also reported as inspired by Tumasjan et al. (2010), for they conclude the total number of tweets mentioning a political party can be considered a reflection of vote shares. As we can see from the table, ranking by tweet volume and ranking by election results does not even result in identical ordering of parties, and thus tweet volume does not reflect the results very accurately for this data.

The daily party mentions are also considered. Figure 4.2 presents the polls and the proportions of party mentions (including candidate mentions of the concerning party) on Twitter in a graph. Combining these into one figure may aid us to detect correlation at first glance (Figure4.2).

The first that attracts attention is that the Twitter measures are a lot more volatile than the polls are, which can probably be attributed to discussions on current events as reported by the media. For this reason, a seven day moving average of the party mentions is plotted in the same figure as well. Following O’Connor et al. (2010), this value is calculated for every party p as M A_t(p) = 1₇Pt

k=t−6

#tweetsk(p)

#tweetsk(total). This

provides a more consistent signal from the tweet proportions.

Only the graphs that show the most noteworthy courses are reported. The remain-ing parties show similar patterns. First, the proportion of daily mentions of PVV is structurally higher than its vote share as estimated by the polls, which also appeared in Figure 4.1. This pattern is not that unexpected, as this controversial party is known to attract a lot of media attention. This implies the PVV is often a popular

(30)

Figure 4.1: Distribution of daily political tweet frequencies. Days electoral debates were held are highlighted in blue.

Table 4.1: Characteristic numbers of Dutch political parties

Party Number of seats Number of seats Difference Share of votes Share of mentions

2017 2012 2017 (Elections) (Twitter) VVD 33 41 -8 21.3 18.4 PVV 20 15 5 13.1 29.3 CDA 19 13 6 12.4 9.7 D66 19 12 7 12.2 7.4 GroenLinks 14 4 10 9.1 6.7 SP 14 15 -1 9.1 4.6 PvdA 9 38 -29 5.7 8.5 CU 5 5 +/- 3.4 1.3 PvdD 5 2 3 3.2 0.9 50PLUS 4 2 2 3.1 2.1 SGP 3 3 +/- 2.1 1.3 DENK 3 0 3 2.1 3.7 FvD 2 0 2 1.8 3.4 VNL - - - 0.4 2.2 PP - - - 0.3 0.6 Downloaded from https://www.parlement.com/id/vk1wljxti6u9/tweede_kamerverkiezingen_2017

(31)

Figure 4.2: Time plots of the share of votes as reported by the polls, and as derived from the proportion of party mentions on Twitter

Table 4.2: Correlations M At−1and pollst

VVD PvdA PVV SP CDA D66 CU GL SGP PvdD 50plus vNL DENK FvD PP -0.079 0.331 0.212 0.653 0.835 0.332 0.011 0.591 0.595 0.725 -0.431 0.605 -0.034 -0.278 0.444

topic of discussion, even though its share of votes may not be estimated to be as high. This pattern also holds for some of the smaller parties (i.e. DENK, FvD, VNL and PP). Alternatively, the proportion of party mentions on Twitter can be structurally lower than the offline polls. This holds for SP, CU, CDA and PvdD, which are all relatively stable parties when comparing outcomes of the former general elections. Thirdly, the patterns of party mentions of some parties seem to fluctuate around the polls, albeit much more volatile, e.g. for VVD and PvdA. The overall trend of the smoothed seven-day MA measure seems to be roughly in accordance to the pattern of the polls. The only party for which a true contradiction seems to emerge is for 50plus. When polls estimate a decrease in the proportion of votes, the party actually becomes a more popular discussion topic on Twitter relative to other parties. None of the other parties clearly display such a contradiction.

To quantify the information illustrated in Figure 4.2, Pearson’s correlation coef-ficients between the party mentions and polls are calculated. Table 4.2 depicts the correlations between the moving average of party mentions at day t − 1 and the polls at day t. The correlations are measured with a one day delay as tweets are believed to measure real-time opinions whereas polls include some delay.

(32)

The correlations seem to be in line with what is made visible in the graphs. Ex-tremely volatile patterns such as that of the VVD corresponds to a low correlation coefficient in absolute value, and some correlations are negative; especially the coef-ficient of 50plus. The highest correlation found is 0.835 for CDA which exhibits a relatively stable pattern of both the polls and party mentions on Twitter. In sum-mary we see the correlations vary a lot between parties, even reaching fairly large negative values. This causes us to question previous research that proposes relative party mentions can be used as an alternative to election polls.

Following the literature, the dataset may require post-stratification if the demo-graphic distribution of the sample does not match that of the population. To inves-tigate this, demographic data of the population is acquired from the CBS, a well-known Dutch governmental institution that collects and provides statistical informa-tion about the Netherlands (CBS,2017). The population demographics are compared to the Twitter demographics obtained with TweetGenie for 102.868 of 122.441 unique users. We were unable to retrieve this information for the entire set of users, as some users removed their accounts or have profiles that are not publicly accessible, causing TweetGenie to throw an error when estimating gender and age. The resulting age distribution in the Twitter dataset is visualized in Figure4.3.

The histogram shows there are users younger than 18 (which is the minimum age to be able to vote in the Netherlands). The share of tweets in the sample by these users accumulated to around 5% of the dataset. The choice is made to remove tweets posted by this age group, as their opinions are not irrelevant for election outcomes and can bias the results of prediction. We are aware this may cause some loss of information as ages could have been underpredicted by TweetGenie, or even because younger users can reflect their parent’s voting intentions. However, as this group only comprises a small share of the dataset and we believe it mostly causes bias, the tweets are removed anyway. This left us with 820.247 tweets in the dataset.

Furthermore, three peaks can be observed from the histogram. The first two are in both tails, which shows TweetGenie estimates ages in the range from 10 to 70. These threshold values can therefore be interpreted as cumulated frequencies of users below the age of 10 or above the age of 70, respectively. The other peak is for users that are estimated to be in their early twenties, which may imply this age group being overrepresented on Twitter (as expected) when compared to the population demographics. This supports the belief that Twitter users may not be an representative sample of the population and we have to account for this.

We first generate three age groups in our dataset, interpretable as young, middle and old ages. We attempt to create the groups to be of equal size, as this is beneficial when a distinction is made between ages and gender in explanatory variables. The groups and their proportions in the dataset are reported in Table 4.3, next to the distribution of the population.

As expected, the oldest age group underrepresented on Twitter, whereas the youngest (and middle) are overrepresented. The share of male users is also a lot

Modeling political opinion using Twitter

Faculty of Economics and Business

Requirements thesis MSc in Econometrics.

Master’s Thesis

Modeling political opinion using Twitter

MSc Econometrics: Big data business analytics

Abstract

Contents

1. Introduction

2. Theoretical background

2.1

Poll aggregation

2.2

Related literature

2.3

Sentiment analysis

3. Methodology

3.1

Collecting Twitter data

3.2

Variables

3.3

Feature extraction from tweets

3.4

Model

4. Data

4.1

Descriptive analysis