Machine learning and election forecasts in The Netherlands

(1)

forecasts in The Netherlands

Master thesis

Econometrics and Operations Research January 4, 2021

Abstract

In this master thesis, I forecast the result of the Dutch House of Representatives election of 2017 using a classical political economy model as a basis for the predictive variables. I will

use three different types of machine learning models and compare my results with the forecasts made by a classical political economy model and with the opinion polls, which are

the leading information source for election forecasting in the Netherlands. Using a parsimonious random forest model and a gradient boosting model seems to produce the

most accurate results. The machine learning techniques are outperforming the opinion polls months before the election, but lose their edge when the time to the election is smaller. Furthermore, I extended the parsimonious model by including a demographic

variable. This did not improve the forecast.

(2)

1 Introduction

Election forecasting plays a prominent role in most western democracies. Getting more in-sights in future election results can be useful for a multitude of reasons. There can be a strategic value in knowing your current position with the voters to campaign managers of a political party. Whether their last weeks before the election should be filled with campaign events or whether winning is so sure/impossible that spending any more campaign funds would be a waste of money. Having some insights into the election results beforehand can also be of journalistic value. Knowing which parties are going to be responsible for the main clashes beforehand makes it easier for a journalist to know what types of stories to prepare for and what kind of stories might be of interest to the reader. Furthermore, a certain entertainment value can be assigned to election forecasts. Television channels like CNN have given weekly (sometimes daily) updates in the months before the Trump vs Biden 2020 presidential election. While CNN might be an extreme example for the use of election forecast, Dutch television channels are also guilty of providing election forecast weeks before the actual election, possibly trying to draw more attention to the narrative of the election and in conjunction also creating a larger viewer base for their election coverage, election debates and their broadcast of the election night results. Lastly, but maybe most important is that election forecasts, in the form of opinion polls, can be a way for citizens to express their current opinion on the governments’ performance. It’s for a reason that during the COVID-19 pandemic approval ratings of governments and prime ministers are showed fre-quently on the news. If a government is doing an insufficient job in the eyes of the citizens, their approval rating drops and in most functioning democracies low approval ratings can lead to a change in policy.

Next to the opinion polls, there are also model based election forecasts. There is a dis-tinction between the role of opinion polls and model based forecasting. While both try to predict the election result, opinion polls might be more useful to capture everyday changes in the political landscape that might be caused by the mood of the day, while model based forecasting is more useful to capture the effect of long term influences to a country, like economic growth, education or immigration.

Especially in the Netherlands most of the early information on election results comes from opinion polls. Election forecasting within the science of political economy has a more aca-demical background and is most often done by creating a voting function of some sort. These functions almost always contain an economical and political variable giving us the following voting function

Vote = f (economics, politics). (1)

(3)

Lewis-Beck and Mongrain was quite simple and only predicted the voter share for the in-cumbent party. Hence there might be some room for improvement. The data used in this paper comes from the Dutch Central Bureau of Statistics (CBS). Income growth data per municipality from 2010 till 2017 is used together with a variable containing the lagged votes and a political variable. While the CBS does not provide enough income data to estimate the next Dutch election, providing new accurate forecasting methods is still very relevant. If these techniques prove to be an improvement on the existing model, more data might come available. In the last section, I extend the model described by Dassonneville, Lewis-Beck and Mongrain by including an immigration variable in the hope of better explaining the rise of the right-wing party PVV.

During the last couple of years, the rise in the use of machine learning techniques has been quite substantial. This paper will apply some of these techniques and test whether they can be of help in predicting election results. This paper will discuss random forest, which is a powerful but easy to implement machine learning tool, elastic net regression, which uses variable selection and variable optimization and lastly gradient boosting, a machine learning technique that is gaining popularity which makes a stepwise prediction based on the errors of a previous or initial prediction.

2 Literature review

The literature on forecasting elections started with the insight that an election is a ref-erendum on the governments’ performance (Tufte, 1978). This presumes that voters will reward a capable government and reject a failing one. If this is indeed the case then it means that one can predict in what way individuals will vote if one knows how well the government has performed. For the ”referendum” - theory to hold ground it is important that one assumes that voters vote retrospectively. This assumption is supported by a large body of literature (Fiorina, 1981). How to measure government performance is of course up for debate. It is often argued that the economic state of a country plays a big part in this (Nadeau, Lewis-Beck and B´elanger, 2012), but also the candidates’ popularity (Lewis-Beck and Rice 1992) and non-economic issue handling are of importance (Graefe, 2013). In a cross-national study of 39 elections, it was concluded that the importance of the gov-ernments’ economic capability rises during economic fragile times. However, the economy becomes of less importance when other crisis exists (Singer, 2011), such as a terrorist attack or for instance the current COVID-19 pandemic. According to a Gallup poll, in 69% of the years, a non-economic issue was the most important issue that year according to voters in the United States. Furthermore, the sitting president’s popularity has been identified by a study as the best predictor for forecasting presidential elections (Lewis-Beck and Rice, 1992).

(4)

the difference between the approval and disapproval rates, the annualized growth rate of real GDP during the first two quarters of the year and a dummy variable indicating whether or not the incumbents’ party has been in office for more than 1 term (Abramowitz, 2004). The review study by Lewis-Beck has later been criticized in another study as the quality score, that is used to review models, would give too much importance to lead time (the time before an election a forecast can be made). This study however does not point to another model being the best, it does however suggest to predict national elections at a state level (Holbrook, 2010).

An individual-level voting intention study found an odds ratio of 2.8 when the economic perception moved from ”worse” to ”better”. They also found a significant and positive odds ratio for social class and religion (Nadeau, Lewis-Beck and B´elanger, 2012). These demo-graphic variables are often not used in studies looking at national election results as they are less suitable to predict voter shares on a national level, they are probably more suitable predictive variables when looking at lower-level data (i.e. individual, municipal, state). The possibility to include such demographic variables and Holbrooks suggestion, to look at elec-tion forecast from state level, might make it worthwhile to look at elecelec-tions forecasts from a municipal or provincial viewpoint when forecasting the election results in the Netherlands.

While structural models discussed above have been formulated for almost every democracy in Western Europe, no such models existed for the Dutch case until 2017 (Dassonneville, Lewis-Back and Mongrain, 2017). Predicting the Dutch case has the advantage over the US case in the sense that, contrary to the US, whoever wins the popular vote in the Netherlands, will always win the election. Furthermore, the share of votes a party receives is proportional to the awarded number of seats in parliament. However, the highly fragmented party sys-tem and the recent surge in electoral volatility in the Netherlands (Mair, 2008) render the Netherlands a hard case for predicting elections by means of a structural model. It is nev-ertheless improbable that the fundamentals of predicting elections, that tend to obey very similar rules from one country to the next, do not apply in the Netherlands (Lewis-Beck and B´elanger, 2012: 768). Using information from national elections it was shown that structural models containing: lagged voter share, GDP growth and months in office out-perform public opinion polls in predicting the voter share for the incumbent party in the Netherlands. As this was the first and only model to predict Dutch elections using structural models, this will be the reference point to which my model will hopefully be an improvement.

(5)

I will only include the same (or at least approximately the same) kind of predictive variables.

Machine learning techniques have not been used before to forecast the Dutch election. There has however been a paper of a study that uses machine learning to predict elections (Awais, Hassan and Ahmed, 2019). This paper combines information from three different sources and applies Bayesian learning to produce quite accurate results, as they outperformed 85 teams and 450 participants in a forecasting contest. This might seem promising for my study, but it is important to notice that this paper uses different data, different techniques and predicts elections in different countries.

Using machine learning often requires more data points than normal regression. Using results from national election results will therefore not be enough as I will have a limited number of data points. To mitigate I use data from the national elections at the municipal level. In the Netherlands, there are approximately 400 municipalities (This number changes every year due to re-classifications). This not only gives us more data points, but it makes it also more logical to include variables that are used in individual voter intention studies (i.e. demographic variables). To be able to give a fair comparison to the method used by Dassonneville, Lewis-Beck and Mongrain I will only use the predictive variables that they used, but it does give the possibility to include demographic variables in a follow-up study to improve on the model that is used in this paper. This makes this study not only innovative in the techniques it uses but it also provides a possibility to keep on improving the models that are provided.

Lastly, it is important to note that every election forecasting model has some kind of error component ingrained in it. To counter and reduce this random component one could com-bine multiple election forecast models. It has been shown that the forecast from comcom-bined models leads to the most accurate model (Graefe et al, 2013; Awais, Hassan and Ahmed, 2019). So even if this new technique is not a significant improvement on the existing model it might still contribute to better election forecasts overall as it can be a part of a combined model with other forecasting models to reduce error and improve accuracy.

3 Data description

(6)

3.1 Dependent variable

The dependent variable that this paper is trying to predict is the number of votes per political party. This will be done for every political party that has managed to get a seat in the national election since 2010. The election I will try to forecast will be the House of Representatives/Tweede Kamer election of 2017. The main reason for selecting this election lies in the availability of the data that will be further discussed in the economic variable section. The metric used to evaluate the different models will be the difference in predicted and actual seats. In mathematical terms, the evaluation metric for each model j can be described as follows

Prediction errorj = 10 X

i=1

House of Representative seats_i− predicted seats_i2

, (2)

where subscript i represents a political party. One might notice that I include only 10 different parties, while there are currently 13 parties. There are two reasons I decreased the number of parties. Firstly, I removed the results from ’DENK’ and ’Forum voor Democratie’ as this was the first election they participated and I had thus no tools for predicting their outcome. This is a shortcoming of the models used, but as the occurrence of new parties entering the House of Representatives is not quite common this is in my opinion, not a big deal for now, but definitely an interesting topic for further research. Removing these two parties from the election result means that the number of awarded seats have to be recalculated. I have done this according to the same procedure as is normally used to find the number of awarded seats. This procedure is as follows:

1. Add the number of votes obtained by the parties present in the House of Representative to get the total number of votes.

2. Divide the total number of votes by the number of available seats in the House of Representatives, which is 150. The outcome is called the ’kiesdeler’

3. Divide the number of votes per party by the ’kiesdeler’. Round this number down to get the number of full seats (’volle zetels’). After this is done for every party, there will remain several seats (’restzetels’).

4. The number of remaining seats is distributed among the parties through the system of the largest average. This means that the number of votes per party is divided by the number of obtained seats + 1. The outcome of these calculations is called the average.

5. The party with the highest average is awarded the first remaining seat. After this, the procedure is repeated until there are no more remaining seats left. If a party is rewarded a remaining seat, their new average is found by dividing the total number of votes for that part by the number of full seats + the remaining seats + 1.

(7)

Table 1: Seats after readjusting for DENK and FvD

Party VVD CDA PVV PVDA D66 GL CU SGP SP PVDD 50+

Actual seats 34 20 21 9 20 14 8 14 5 5

would have been as is shown in table 1. The above method will also be used to find the predicted number of seats for each model that is used after the predicted number of votes has been produced by the different machine learning methods.

3.2 Independent economic variable

The paper by Dassonneville, Lewis-Beck and Mongrain uses GDP growth as the economic variable to predict the number of votes. The average GDP growth per municipality is not provided for longer periods by the Central Bureau of Statistics (CBS). Instead, this paper will use the growth in the average standardized income per household in a municipality a year prior to the elections. Ideally one would use the growth in income/GDP the 12 months for the election (just as Dassonneville, Lewis-Beck and Mongrain did). However, this data is also not available. So the income growth is now, for example, if an election took place in the year 2017, the growth of income between 2015 and 2016. The main limitation in the availability of the data lies within this variable. The CBS only has the income data per municipality available between 2010 and 2017. This means that for now, I can only use the following elections in the dataset. The House of Representatives elections of 2012 and 2017, the provincial elections of 2015 and the European elections of 2014.

3.3 Independent political variable

The political variable in this paper is the same as in the paper by Dassonneville, Lewis-Beck and Mongrain. The number of votes in the previous election is used as an independent variable to determine the number of votes in the upcoming election. However, as I am trying to predict the number of votes of every political party, in contrast to just the largest political party, I can also use the votes of the different political parties in the last election to predict the votes. There might be some multicollinearity between the votes for political parties, as there is a limited number of votes available and one for cast for one party cannot be cast for another. Correlation between the independent variables will be discussed later on in the method description.

3.4 Independent ”in office” variable

(8)

indicating whether the party was a governing party in the period before the election. When an observation is not from the House of Representatives election I look at which parties are the governing parties at the moment of the election. Furthermore, there is often mention of a ’prime minister bonus’ when the prime minister is also the party leader in the next election. This would mean that the prime ministers’ party would get more votes, if the economy did well, compared to the other governing parties. To see if this effect is true I also included a factor variable which can take the values of ”governing”, ”co-governing” and ”opposition”. In the results section, I will discuss the effects of a binary ”in office” variable compared to a factor ”in office” variable. In table 2 and 3 one can find a overview of the ”in office” variable and which parties were assigned which values during which periods.

Table 2: Binary ”in office” variable

Party/election 2012 2014 2015 2017

VVD 1 1 1 1

PVDA 0 1 1 1

CDA 1 0 0 0

PVV 1 0 0 0

Table 3: Factor ”in office” variable

Party/election 2012 2014 2015 2017 VVD 1 1 1 1 PVDA 0 2 2 2 CDA 2 0 0 0 PVV 2 0 0 0

3.5 Municipal reclassification

(9)

where Pi,n−2 is the population in municipality i in period n − 2, Ii,n−2 is the income in municipality i in period n − 2 and T Pn−2 =

3 P i=1

P opulationi,n−2. If the reclassification took place at the end of year n − 1 I handled a bit differently. This is because in this case I have income data from municipality 1,2 and 3 from year n − 2 and n − 1, but the election results from year n. The income growth for municipality 4 would then be calculated as

Income growth = 3

X

i=1 Pi,n−1 T Pn−1 Ii,n−1− Pi,n−2 T Pn−2 Ii,n−2 3

X

i=1 Pi,n−2 T Pn−2 Ii,n−2 (4)

Furthermore I removed the islands Bonaire, Saba and Sint Eustatius from the election results as their income was not recorded by the CBS. Lastly, it is important to note that in 2013 the municipality of Boornsterhem was split up and divided between Leeuwarden, Heereveen, De Friese Meren and S´udwest-Fryslˆan. As it is almost impossible to know which part of the income growth in Boornsterhem was transferred to which municipality exactly I removed this observation when calculating the income growth in 2013.

3.6 Descriptive statistics

If economic prosperity has an influence and the number of votes the incumbents’ parties get, one might already be able to see this in the data. The elections for which I have data were all preceded by a governing period of the VVD. The portion of VVD votes per municipality against the income growth can be seen in figure 1a and 1b. If one looks at figure 1a it seems as there is indeed some kind of connection between the number of votes for the VVD and the income growth. When one makes a distinction between different elections as one can seen in figure 1b. It becomes clear that the connection between income growth and

(a) No distinction between elections (b) Distinction between elections

(10)

(when solely looking at figure 1b) that the effect of income growth on the number of votes is present and that some other non-economic issue has lead to the decrease of the number of votes for the VVD between 2012 and 2014. The same figure can be made for the governing partner of the VVD, the PVDA. This has been done in figure 2a and 2b. The PVDA was in government from 2012 till 2017. Upon inspection of figure 2a it seems as if there is an unexpected negative correlation between the income growth and the number of votes for the PVDA. However, when one makes a distinction between different election, as is done in figure 2b, one can see that this negative correlation is mainly due to the 2012 election. The PVDA was not in office before this election, so one can ignore this when one tries to find a connection between income growth and the number of votes. Nonetheless, there still seems to be no increasing relation between the number of votes and income growth. This might be due to the so-called prime minister bonus, which is discussed in section 3.4, or due to some other non-economic issue.

(a) No distinction between elections (b) Distinction between elections

4 Random Forest

The first machine learning method that I apply is the random forest method. It is one of the most used algorithms. Mainly because it is easy to implement, but also because of its diversity. It can be used for regression and classification tasks. although in this paper the classification option is not relevant.

4.1 Method description

In trying to predict election results I will make use of a machine learning technique called random forest. Using a random forest makes it possible for a prediction to retrieve informa-tion from similar observainforma-tions that are close to one another. In this setting, this means that if for example, the municipality of Alphen aan den Rijn in 2017 has similar values for predic-tive variables as Emmen in 2012 than their election result will probably also be quite similar.

(11)

Using a decision tree one can divide data into multiple subsets based on pre-specified or random split values of predictive variables. For example, take in mind a set of dutch munic-ipalities for which one knows the number of votes for the CDA. One could imagine multiple factors playing a role in how a city votes. The first factor that could determine the number of votes for the CDA is the number of votes for the CDA in the previous election in the same municipality. This could be the first split. The second split could be at a certain income growth this municipality has experienced over the last year. The third split could be related to whether the CDA was a governing party in the last election cycle. The first dataset is now divided into 8 different subsets in which municipalities have similar characteristics and also (based on how good the splits are) quite similar voting levels.

There is now one decision tree, but it is also possible to make a new decision tree, but with different split points. One could take different lagged voting levels, different income growths, completely new predictive variables, but also more or less split points. Once there are generated hundreds of decision trees one, of course, gets a forest. Since the split points and used variables are random for each tree one gets a random forest.

Once I have a random forest I am now interested in how to use it as a forecasting tool. Suppose I have one new municipality for which only the predictive variables are known. Based on the first decision tree I can put this observation in one of the eight aforementioned subgroups. If one repeats this process for all of the decision trees and average the outcome of all the observations one can get a prediction for the number of votes for the CDA for the new municipality. The whole process can be summarized by the figure 3 1_.

Figure 3: Random forest structure

The tricky part is now that this method is only used to predict one dependent variable.

(12)

In the election forecast, one would ideally be able to predict the expected votes for all par-ticipating parties. One could just make n different random forests for the n parpar-ticipating political parties, but one can also make use of a multivariate random forest as described by Segal and Xiao (Segal and Xiao, 2011). ”To construct multivariate random forests, that ac-commodate multivariate outcomes, we simply generate an ensemble of multivariate random trees via bootstrap resampling and predictors subsampling as for univariate random forests.” There is a package in R called ’MultivariateRandomForest’ which I will use to perform the random forest analysis described above.

4.2 Results

Firstly, I tested a random forest model on the data and predicted the election results of 2017. I began by running the univariate random forests. These models are univariate in the dependent variable.

4.2.1 Univariate random forest

I ran multiple univariate random forests. The random forest models are different in the number of predictive variables. I begin with the most parsimonious model and elaborate from there to see what the effect is. The most parsimonious model only uses the number of lagged votes from the same party, an ”in office” variable and a growth variable. The growth variable is only included if the party was a governing party in the last election cycle. Note that I included the ”in office” variable to capture the so-called prime ministers bonus. As said in the data description this variable has values of ”governing”, ”co-governing” and ”opposition”. R will assign the options with the values of 2,1 and 0. When applying the random forest algorithm with the number of trees set to 1000 and the minimum node size of 2, one gets the outcome that is shown in table 4 The main error of the prediction lies within

Table 4: Predicted seats using parsimonious random forest with factor ”in office”variable

Actual seats 34 20 21 9 20 14 8 14 5 5

Predicted seats 29 24 17 14 20 7 11 18 5 5

(13)

Table 5: Predicted seats using parsimonious random forest with binary ”in office”variable

Actual eats 34 20 21 9 20 14 8 14 5 5

Predicted seats 29 24 17 14 20 7 11 18 5 5

the evaluation metric given in equation 2 is 156, which looks quite promising.

There might be some extra information gained by including the number of votes for other parties in previous elections. This means the model now includes the lagged votes from all the political parties and all the ”in office” variables from all the parties. I have included the ”in office” variable in binary form as the difference is minimal and it is easier to code. Using again a random forest with 1000 trees and minimal node size 2 gives the prediction for the 2017 election results that is shown in table 6. Including the other political parties leads to

Table 6: Predicted seats using random forest including all political parties

Actual seats 34 20 21 9 20 14 8 14 5 5

Predicted seats 25 24 18 16 19 8 12 18 5 5

a less precise result. The result in table 6 has an error of 224 according to equation 2. So including the votes of other parties does not improve the prediction. It is also worth noting that including all the parties might lead to multicollinearity as one vote cast for party A cannot be cast for party B.

4.2.2 Multivariate random forest

As said in the method description it is also possible to predict multiple parties at the same time. The same predictive variables are used for the results shown in table 6. This leads to the prediction of the 2017 election that is shown in table 7. Again one sees the same problem arising as in table 6. The number of party seats for the PVDA is too large and the seats for the VVD are too low. This again might be caused by the multicollinearity of the past votes of the party. The error for this model has increased up to 242

Table 7: Predicted seats using multivariate random forest

Actual seats 34 20 21 9 20 14 8 14 5 5

Predicted seats 23 23 19 17 21 9 11 17 5 5

(14)

seems that the most parsimonious models has the lowest error. It also seems that introduc-ing more independent variables leads to more error, but this could also be caused by the multicollinearity. To tackle this issue I use elastic net regression, as this makes it possible to set certain coefficients to zero if they introduce multicollinearity. As a random forest is

Table 8: Multiple random forest predictions

Party VVD CDA PVV PVDA D66 GL CU SGP SP PVDD 50+ Error

Actual seats 34 20 21 9 20 14 8 14 5 5 0

Predicted seats parsimonious factor 29 24 17 14 20 7 11 18 5 5 156

Predicted seats parsimonious binary 29 24 17 14 20 7 11 18 5 5 156

Predicted seats all parties 25 24 18 16 19 8 12 18 5 5 224

Predicted seats multivariate model 23 23 19 17 21 9 11 17 5 5 242

mostly a black box algorithm is hard to exactly pinpoint where the error in these different models comes from. Looking at the results in table 8 one can see that mainly the result for the PVDA and the VVD are off in the multivariate model. Within R there is an option to find the variable importance. This can be found in figure 4. The variable importance is calculated as the increase in node purity weighted by the probability of reaching that node. Where the node purity can be seen as how high up in the decision tree a split variable can be found. The higher up in the tree the purer the node. The probability of reaching a node is calculated as the number of samples that reach the node, divided by the total number of samples.

As can be seen in 4 is that the most important variable in determining the number of votes for the PVDA and the VVD is the lagged number of votes for Groenlinks. This is strange and I do not have an explanation for it. It’s also worth noting that the error rate seems to be significantly lower for approximately 30 trees. However, changing the number of trees from 1000 to 30 does not seem to affect the results.

(15)

5 Elastic net

To tackle the possible multicollinearity introduced by adding other political parties to the estimation methods I use the elastic net procedure. The elastic net procedure does this by finding the optimal combination between two types of regression (lasso and ridge regressions, which are to be explained later).

5.1 Method description

The advantage of finding values for these two types of regressions lies within the bias-variance trade-off. When estimating a model, both variance and bias are desired to be low. Using or-dinary OLS has the desired property of unbiasedness. However, it can have a huge variance. This happens when there is a high correlation between the predictor variables or when there are many predictors. One can imagine that the correlation between the number of votes for different parties can be quite high as there is a limited number of votes and a vote cast for one party cannot be cast to another. When using the outcome for all parties to predict future votes there are also quite a lot of predictor variables. Hence a huge variance is to be expected when using ordinary OLS to predict the number of votes. A general solution to this is called regularization. This approach reduces the variance at the cost of creating a slight bias. This tradeoff is best captured by figure 5.

Figure 5: Variance bias trade off

(16)

5.1.1 Ridge regression

The difference between OLS and ridge regression ( or L2 regularization) is the way the errors are minimized. For OLS this is done by

arg min β n X i=1 (yi− x 0 iβ)ˆ 2_. ₍₅₎

The minimization of the squared errors for ridge regression is not much different from OLS. The only difference is the adding of a regularization penalty. The loss function for ridge regression is given by arg min β n X i=1 (yi− x 0 iβ)ˆ 2_{+ λ} m X j=1 ˆ β_j2. (6)

To see what the effect of adding this term is on the bias and the variance one must solve equation 6. This gives us ˆβridge = (X0X + λI)−1X0Y . This leads to the bias and the variance being the following.

• Bias( ˆβridge) = −λ(X0X + λI)−1β.

• Var( ˆβridge) = σ2(X0X + λI)−1X0X(X0X + λI)−1.

One can see that increasing λ increases the bias, but reduces the variance. The question remains of course: What is the optimal value of λ? This is where elastic net is useful as it determines among others the optimal value for λ.

5.1.2 Lasso regression

Lasso regression (or L1 regularization) is conceptually quite similar to ridge regression. It dif-fers mainly in the way the penalty term is specified. The loss function under lasso regression is specified as arg min β n X i=1 (yi− x 0 iβ)ˆ 2_{+ λ} m X j=1 | ˆβj|. (7)

Specifying the loss this way means one can use lasso regression for variable selection as it can lead to coefficients between zero, whereas ridge regression can only lead to very small coefficients. Further comparing the two ridge regression works well when there are many large parameters of approximately the same magnitude, while lasso regression works well when there few significant parameters and the rest is close to zero.

5.1.3 Elastic net

Elastic Net is used as a combination of lasso regression and ridge regression to get the best of both worlds. It does this by minimizing the following loss function

(17)

Where α is the mixing parameter that decided between lasso regression (α = 1) and ridge regression (α = 0). The glmnet package automatically finds the optimal value for α and λ for us, such that equation 8 is minimized.

5.2 Results

Using elastic net makes it possible to include a lot of variables, even if they introduce multicollinearity as the elastic net approach can set their coefficients to zero. I again use the same predictive variables as in the random forest model, for which the results can be found in table 6. So in estimating the predicted number of votes for party i, the predictive variables are the past number of votes of all the parties, the economic growth if this party was in office the last period and also the binary ”in office variable”. Since I am now using regression equations now I can express the model in a equation, which is done in equation 9.

Party_i,j,n= β1+ 10

P

i=1

βi+1· Partyi,j,n−1+ 10

P

i=1

βi+11· In officei,n−1+ β22,j·1{In officei,n−1= 1} ∗ Growthj,n−1+ i,j,n, (9)

The elastic net regression produced the results shown in table 9.

Table 9: Prediction of seats using elastic net

Actual seats 34 20 21 9 20 14 8 14 5 5

(18)

Since there are 10 different election outcomes to be estimated, there are also 10 different set of values for α and λ. These values can be found in table 10. The error using the elastic net approach is now 294. This is more than any of the random forest methods. A big contribution to the large error is again the underestimation of the VVD.

Table 10: Different values for α and λ

Party VVD CDA PVV PVDA D66 GL CU SGP SP PVDD 50+ α 0.981 0.585 0.717 0.259 0.438 0.618 0.881 0.0415 0.456 0.591 λ 7.345 7.173 0.001 0.069 0.054 1.126 5.505 0.059 0.002 3.794

As it might be more appropriate to use the same value for α and λ, I also did a multivariate elastic net regression in which all the predicted seats are estimated at the same time. This leads to the outcome presented in table 11. Clearly this has not improved the prediction

Table 11: Prediction of seats using multivariate elastic net

Actual seats 34 20 21 9 20 14 8 14 5 5

Predicted seats 31 18 18 26 17 6 10 16 4 4

results as the error is now 394. Especially the prediction of the PVDA is way off, which leads to the majority of the error. A cause of this could be that the multivariate elastic net gives to much importance to the number of lagged votes. Another cause could be that there is some strange correlation between the variables that one cannot observe and comes into play when estimating the number of votes for each party simultaneously. This could also explain the rise in the error when predicting the number of seats using the multivariate random forest.

6 Gradient boosting

Another popular machine learning technique is called gradient boosting. This method uses a step-wise reduction of the errors to improve the models forecasting capabilities.

6.1 Method description

The method uses a three-step algorithm to produce results. The algorithm is as follows. As input one needs data {(xi, yi)}ni=1. Where yi is the number of votes for a certain party in municipality i and xi is a 2 × 1 row containing the number of votes for the same party in municipality i in the previous election and the ”in office” variable. For governing parties xi is a 3 × 1 also containing the income growth in municipality i. Furthermore, one needs a differentiable loss function L(yi, F (xi)). The loss function used here is

1

2(yi− ˆyi)

(19)

1. Initialize the model with a constant starting value: F0(xi) = arg min γ n P i=1 L(yi, γ). To obtain a starting value one must find the γ that minimizes the total value for the loss function consisting of γ and yi or L(yi, γ) =

1

2(yi− γ)

2_{. This is done by taking the} average value of yi for γ.

2. For m = 1 to M repeat the following steps:

(a) Compute pseudo residuals ri,m= −

∂L(yi, F (xi)) ∂F (xi)

F (xi)=Fm−1(xi)

for i = 1, ..., n.

(b) Fit a regression tree to the pseudo residual ri,mvalues using xiand create terminal regions Rj,m for j = 1, ..., Jm. With terminal regions I mean the end leaves of the regression tree.

(c) Compute γj,m= arg min γ P xi∈Ri,j L(yi, Fm−1(xi) + γ) for j = 1, ..., Jm. (d) Update Fm(x) = Fm−1(x) + ν Jm P j=1 γJm1(xi ∈ Rj,m).

Where ν is the learning rate of the algorithm. The smaller one sets ν the more accurate the model becomes, however it also increases the running time and has a chance of overfitting.

3. Output FM(x).

6.2 Results

Finally, I use gradient boosting to predict the election results. The results are shown in table 12. The gradient model used is similar to the parsimonious random forest model as only the lagged votes, binary ”in office” and growth variable are used. I did not include the lagged votes of other political parties here as it seems to only introduce more error. This seems to be a quite accurate prediction. The evaluation metric gives an error of only 212.

Table 12: Prediction of seats using gradient boosting

Actual seats 34 20 21 9 20 14 8 14 5 5

(20)

Figure 6: RMSE per iteration of gradient boosting

However the results for the PVDA and the CU SGP seem quite off. I might have encoun-tered some overfitting for these parties. When plotting the root mean squared error (RMSE) per iteration for the train data and the test data as seen in figure 6 one in deed notices that the RMSE increases for the PVDA and CU SGP after approximately 100 iterations. Reducing the number of iterations to 100 leads to the results that can be seen in table 13. These results are more accurate when compared to the result in table 12. The error

accord-Table 13: Prediction of seats using gradient boosting after adjusting number of learning rounds

Actual seats 34 20 21 9 20 14 8 14 5 5

Predicted seats 31 26 17 11 22 6 8 17 6 6

ing to equation 2 is only 144 which is the lowest so far. Furthermore, in table 14 the variable importance can again be found. The lagged votes is the most important determinant for predicting the number of future votes. This is probably also why the number of seats for GroenLinks is predicted lower than was actually the case, as they only had 4 seats in the 2012 election. The effect of income growth and the ”in office” variable is small, but for the PVDA and the VVD it has a variable importance of more than 20 percent

Table 14: Variable importance for gradient boosting

VVD PVDA CDA PVV D66 GROENLINKS CU SGP SP PVDD PLUS50

Lagged votes 0.771 0.659 0.916 0.995 1 1 1 1 1 1

In office variable 0 0.257 0.084 0.005 0 0 0 0 0 0

(21)

7 Comparing results

All the forecasts and their errors can be found in table 15. It is now of interest how the models how performed against the model described by Dassonneville, Lewis-Beck and Mongrain and the opinion polls. First I discuss the opinion polls as their outcome is easier to compare to

Table 15: Forecast results of all models

Actual seats 34 20 21 9 20 14 8 14 5 5 N.A.

Predicted seats random forest factor 29 24 17 14 20 7 11 18 5 5 156 Predicted seats random forest binary 29 24 17 14 20 7 11 18 5 5 156 Predicted seats random forest all parties 25 24 18 16 19 8 12 18 5 5 224 Predicted seats random forest multivariate 23 23 19 17 21 9 11 17 5 5 242

Predicted seats elastic net 21 25 19 11 20 9 12 21 6 6 294

Predicted seats elastic net multivariate 31 18 18 26 17 6 10 16 4 4 394

Predicted seats gradient boosting 28 24 16 16 21 6 12 16 6 5 212

Predicted seats gradient boosting refined 31 26 17 11 22 6 8 17 6 6 144

my results. There are multiple versions of opinion polls available in the Netherlands. An overview can be found on Wikipedia2_{, where different polls from different institutions can be} found from 5 months till 1 day in advance of the election. I took an excerpt from this page and showed their result and accuracy according to equation 2 in table 16. It is quite remarkable to see that the initial forecast of the opinion polls was quite precise after which they quickly deteriorate and become more accurate right before the election. In the comparison with my models, the opinion polls forecast 2 months ahead deserve a bit of extra attention. If one assumes that as soon as a year is over the income data of the previous year would become available, than this election forecast could have been made two months before the election (the election was on the 15th of March). This is of course a debatable assumption, but for the sake of comparing lets assume it is true. In this case my method would have done a better job than the opinion polls. Comparing it to the 5 months and 1 month ahead forecast my models produce fairly similar errors. When election night comes closer, the opinion polls clearly have the upper hand. Comparing my forecast with the model-based forecast from

Table 16: Forecast results opinion polls

Actual seats 34 20 21 9 20 14 8 14 5 5 N.A.

Opinion poll Peil.nl 5 months 27 17 28 10 15 15 9 13 4 9 153

Opinion poll EenVandaag 1 month 23 18 26 12 16 15 10 13 6 10 207

Opinion poll I&O Research 1 week 24 16 20 14 20 17 9 14 6 5 153

Opinion poll Kantar Public 1 day 27 20 23 11 18 14 9 15 4 6 65

Dassonneville is a bit harder as they only created a model to estimate the voter share of the largest party. They have 20 out of sample forecasts with an average of 4.38% percentage point difference between the forecast and the actual voter share. This would translate to

(22)

approximately 7 seats in the House of Representatives or an error score of 49 according to equation 2. It is hard to determine what the total error score would be if they used the model to estimate all the parties. It is unlikely that an error score of 49 per party would be the average error score (or 490 total error), however, I do not expect them to beat the error score produced by the parsimonious random forest and gradient boosting models. When solely observing the largest party, my models also produce a more accurate seat prediction than Dassonneville (although this is only a forecast of the 2017 election compared to 20 out-of-sample forecasts).

8 The extended model

As discussed in early in this paper, the way the data is structured (on municipal level) makes it possible to include demographic variables. One could include a variable containing the number of immigrants in a municipality in the hope of better explaining the rise of the populist party PVV. There are a growing number of studies that have found impact from the inflow of immigrants on voting results (Belucci, Conzo and Zotti, 2019). While there are studies that link the influx of immigrants to the rise of right-wing parties across Eu-rope when looking at national effects (Vasilakis, 2018; Pardos-Prado, Lancee and Sagarzazu, 2014), there is a Danish study that looks at the effect of refugees on voting results on a municipal level. They conclude that there are different effects on voting result depending on whether a city belongs to the 5th percentile most populous areas in Denmark. They find that more refugees in a more rural area lead to more right-wing voting, while the effect in the urban areas is exactly the opposite (Dustmann, Vasiljeva and Damn, 2019). Further-more, there is Dutch study done by Van der Paauw and Flache (Van der Paauw and Flache, 2012) that suggest a positive relation between the percentage non-western immigrants and PVV votes controlling for characteristics of the age distribution, distribution of educational levels, socio-economic status, income distribution, number of residents, distribution of re-ligious groups and average perceptions of unsafety. This was a study done on municipal level.

While the approach of the Danish study is interesting I will first look at the effects of including solely the percentage of non-western immigrants per municipality on the capa-bility of the model of predicting election results. I include the percentage of non-western immigrants as a percentage of the total population of the municipality in the year of the election. In figure 7 one can observe the relation between the number of valid votes and the percentage of non-western immigrants. The percentage of non-western immigrants is higher in more urban areas. This might be an explanation for the more welcome attitude against immigrants in urban areas, as was found by Dustmann, Vasiljeva and Damn. I chose the percentage of non-western immigrants, as opposed to all immigrants or Islamic immigrants, for the reason that this was also done by the study of Van der Paauw and Flache.

(23)

Figure 7: Relation between immigrants and number of valid votes

Table 17: Gradient boosting model with percentage of non-western immigrants

Actual seats 34 20 21 9 20 14 8 14 5 5

Predicted seats 28 25 16 17 21 6 8 18 6 5

The error score is 232. So including the percentage of non-western immigrants has de-creased the accuracy of the model. One also observes some unexpected results when looking at table 18. One could expect the importance of immigrants to be much higher for the PVV. Clearly, including only the percentage of non-western immigrants does not lead to a better model. I will now make a difference between rural and urban areas and the attitude towards immigrants.

Table 18: Variable importance for gradient boosting with the extended model

VVD PVDA CDA PVV D66 GL CU SGP SP PVDD PLUS50 Lagged votes 0.499 0.616 0.888 0.849 1.000 0.860 1.000 0.796 0.983 0.362 Percentage 0.300 0.061 0.024 0.147 0.0004 0.140 0.0004 0.204 0.017 0.638 In office variable 0 0.235 0.089 0.004 0 0 0 0 0 0 Income growth 0.201 0.088 N.A. N.A. N.A. N.A. N.A. N.A. N.A. N.A.

(24)

might be another percentile that is more appropriate as the Netherlands is a more populous country than Denmark and thus municipalities that are relatively less populous might still be more populous than what is considered urban in Denmark. Therefore I run the model two times, one time with the split at the 5th percentile and one time with the split at the 10th percentile. The results of the forecast can be found in table 19. These model have

Table 19: Gradient boosting model with percentage of non-western immigrants and urban binary

Party VVD CDA PVV PVDA D66 GL CU SGP SP PVDD 50+ Actual seats 34 20 21 9 20 14 8 14 5 5 Predicted seats 5th percentile 29 25 16 17 21 6 7 18 6 5 Predicted seats 10th percentile 28 25 17 16 21 7 8 17 6 5

an error score of 222 and 186 respectively. Although this is an improvement on the results showed in table 17 it still has a lower score than the parsimonious gradient boosting model. A cause for the increase of the error score could be given by the paper Belucci, Conzo and Zotti. They state that the reason that some voting tendencies are explained by immigrants are caused by media coverage on these immigrants in contrast to just their presence in a country as most papers describe. Another reason could be that the I used the percentage of non-western immigrants, which also includes immigrants from eastern Asia, South America and Africa. The PVV propagates to be against the Islam and Muslim immigrants, so the inclusion of other immigrants might make the forecast less precise.

9 Discussion

In the models that were used for this research, there have been different levels of success, however, there are some critical remarks that have to be made when interpreting the results. To begin there are some comments about the data that one has to take into account. The data availability is an issue when forecasting the election results in the Netherlands on a municipal level due to the shortcoming of economical variables on municipal level. This is an issue that is unfortunate for the Dutch case but does not have to be an overall downside for the use of machine learning in election forecast, as one can imagine that data in other countries (or in the Netherlands over time) is more widely available. Furthermore, it is worth noting that ideally, one would use an economic predictor that measures the economic well being in the period right before the election, instead of approximately 1 year before the election took place as is the case in this paper. Lastly, due to municipal reclassifications, the income data of multiple municipalities were combined to obtain the new income growth for the newly formed municipality. While it is nearly impossible to get the income growth completely correct with the numbers available, using the current approach does certainly not lead to any irregularities in the income growth data.

(25)

in other areas and the theory of an economic and political variable predicting the number of votes may be sound there is always a chance that this prediction got it right, but the model is still wrong. To be sure of the forecasting ability of these models one should test them on future elections (or other past elections if more data becomes available) to see if the error of these forecast stays more or less the same.

Building on this, it is also very important to note that election forecasts will never be 100 percent accurate. There is no physics involved or natural laws to abide by like when forecast-ing the weather (and even that cannot be forecast with 100 percent accuracy). There will always be variables influencing the election outcome that were not included in the model. It is therefore important that when interpreting these forecasting results (or any election fore-casts for that matter) that one does not expect the forecast to exactly equal the outcome. It should however be the most likely outcome and the actual outcome should fall within a reasonable margin of error. As said above, one can only really judge the capability of its election forecast model, when one can compare multiple forecasts to actual outcomes.

It is important to note that I use the results of municipal elections to increase the num-ber of data points. Furthermore, I use election results from three different kinds of elections (parliamentary, provincial and European). I tried to control for the fact that individuals might vote differently when the legislative body they vote for is different by including a binary variable for each different sort of election. As there are only four elections in the dataset this did not change the result, but it might be wise to control for the type of election when more data is available to see if this could have any effect.

Another shortcoming of the current models is that it is unable to predict the vote count of parties that have not participated in an election before. This is were opinion polls clearly have the upper-hand over the models discussed in this paper. This does not mean that machine learning is unable to ever predict this number of votes for new parties. There might be other predictive variables that can explain the votes for new political parties, but this is a topic for further research.

Lastly, I have extended the model by including the percentage of non-western immigrants to find disappointing results. This does not mean that it is not worthwhile to investigate whether the inclusion of other independent variables might lead to a better model. Further research must point out whether including extra predictive variables will reduce the error metric of the model.

10 Conclusion

(26)

learning techniques to forecast the Dutch House of Representatives election of 2017. Applying a parsimonious random forest and gradient boosting yields the best results and beats the opinion polls when the lead time is the same. Comparing those two models to the existing model-based forecast by Dassonneville, overall my models produce more accurate results. However, this comparison has to be approached with caution as the results are described a bit differently. Lastly, I constructed an extended model by including the percentage of non-western immigrants as an independent variable. This only made the forecast less accurate.

References

[1] Micheal S. Lewis-Beck, Election Forecasting: Principles and Practice, British Journal of Politics and International Relations VOL 7, 145–164, 2005.

[2] Ruth Dassonneville, Michael S. Lewis-Beck and Philippe Mongrain, Forecasting Dutch elections: An initial model from the March 2017 legislative contests, Research and Politics, July-September 2017: 1–7

[3] Edward Tufte, Political Control of the Economy, Princeton University Press, 1978.

[4] Morris Fiorina, Retrospective voting in American National Elections, New Haven: Yale University Press, 1981.

[5] Richard Nadeau, Michael S. Lewis-Beck and ´Eric B´elanger, Economics and Elections Revisited, Comparative Political Studies 46(5) 551–573, 2012.

[6] Micheal S. Lewis-Beck and Tom Rice, Forecasting Elections, Washington, DC: Congres-sional Quarterly Press, 1992.

[7] Andreas Graefe, Issue and Leader Voting in U.S. Presidential Elections, Electoral Stud-ies, 2013.

[8] Matthew M. Singer, Who Says “It’s the Economy”? Cross-National and Cross-Individual Variation in the Salience of Economic Performance, Comparative Political Studies, 44, 284-312, 2011.

[9] Alan Abramowitz, When good forecasts go bad: the time-for-change model and the 2004 presidential election, PS: Political Science and Politics, 37:4, 745–746, 2004.

[10] Thomas M. Holbrook, Forecasting US presidential elections, The Oxford Handbook of American Elections and Political Behavior,346-371. Oxford: Oxford University Press, 2010.

[11] Peter Mair, Electoral volatility and the Dutch party system: A comparative perspective, Acta Politica 43(2–3): 235–253, 2008.

(27)

[13] Cas Mudde, The populist Zeitgeist, Government and Opposition 39, 541–563, 2007.

[14] Hans Georg Betz, Radical Right-wing Populism in Western Europe ,Macmillan, Houndsmill, Basingstoke, 1994.

[15] Gijs Schumacher and Matthijs Rooduijn, Sympathy for the ‘devil’ ? Voting for populists in the 2006 and 2010 Dutch general elections, Electoral Studies 32(1): 124–133, 2012.

[16] Mark Segal and Yuanyuan Xiao, Multivariate random forests, WIREs Data Mining and Knowledge Discovery, Volume 1, January/February 2011.

[17] Muhammad Awais, Saeed-Ul Hassan, Ali Ahmed, Leveraging big data for politics: pre-dicting general election of Pakistan using a novel rigged model, Journal of Ambient Intel-ligence and Humanized Computing, July 2019.

[18] Davide Bellucci, Pierluigi Conzo, Roberto Zotti, Perceived immigration and voting be-haviour, Carlo Alberto Notebooks, no 588 ,June 2019.

[19] Chrysovalantis Vasilakis, Massive Migration and Elections: Evidence from the Refugee Crisis in Greece, International Migration Vol. 56 (3), 2018.

[20] Sergi Pardos-Prado, Bram Lancee, I˜naki Sagarzazu, Immigration and Electoral Change in Mainstream Political Space, Political Behavior, 36: 847-875, 2014.

[21] Christian Dustmann, Kristine Vasiljeva, Anna Piil Damm, Refugee Migration and Elec-toral Outcomes, Review of Economic Studies, 86, 2035–2091, 2019.