Modelling the Eurovision voting behaviour

(1)

Modelling the Eurovision

voting behaviour

(2)

Modelling the Eurovision

voting behaviour

Maurice Schaasberg 11810866 Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor dhr. dr. J.A. Burgoyne Capaciteitsgroep Muziekwetenschap Faculty of Humanities University of Amsterdam Roetersstraat 11 1018 WB Amsterdam June 26th, 2020

(3)

Abstract

The legitimacy of the Eurovision voting has been questioned throughout the years. Attempts have already been made to find evidence for collusion but this is still a contentious subject. This paper attempts to find further evidence by training ma-chine learning algorithms that try to simulate the voting behaviour of the contest. Both classifiers and regression models were trained using a combination of musical and non-musical data. The effectiveness of each type of data was similar and the performance of the models was not great. Despite that, evidence for certain biases has been found.

(4)

Introduction

There have been countless attempts to find signs of collusion or bias in voting for the Eurovision Song Contest. The popular Contest, running since 1956, is often criticized for having political voting behaviour. It is thought that countries vote for each other because of mutual agreements rather than musical merit.

In fact, numerous studies have shown that certain groups of countries are more likely to vote for each other, suggesting a possible bias. The research done by Yair et al. in 1995 is believed to be the first to address this issue. They introduced the concept of voting blocs: A group of countries that show favouritism towards their own members.

It is uncertain whether these countries do this consciously or if this is the re-sult of cultural proximity, as these groups share similarities in language, history, etc. This paper tries to tackle this problem from a different angle, namely trying to predict the voting behaviour with machine learning algorithms. The main focus here is to train models and examine which features they find most important. If the models prefer features that are not related to the quality of the music over musicological ones, then this could suggest that some form of collusion may have occurred. On the other hand, if the musicological data generates the best predic-tors then this can assuage collusion concerns.

It is not expected that any type of machine learning algorithm will be able to accurately predict the voting behaviour of the Eurovision Contest. There are too many factors that make this difficult: jury members are constantly changing, mu-sical quality can not be empirically measured, music tastes change over time, etc. Bets on who will win the contest are still being placed every year, this is a strong indication that perfect accuracy should be unattainable.

The models are likely not going to strictly prefer either musical data or non-musical data, and will probably show preference for a mixture of both.

(6)

Method

Data

The dataset includes all performers and votes from the finals between 1975 and 2019. The various voting systems that were used before 1975 were all too different from the ones used after. Because of this, there is no way to convert those votes to be equal. The system that was introduced in 1975 stayed the same until 2016 when a minor change was implemented: televotes and juryvotes now produce in-dividual scores that are added up, rather than both those votes being combined in some way to produce just one score. To accommodate for that change the total points that performers between 2016 and 2019 got were divided by 2. After fil-tering for finals between 1975 and 2019, 1028 performers and 30053 votes remained. For the set of musical features an audio analysis tool called ’Essentia’ was used[1]. This tool measures a wide variety of low-level, tonal and rhythm descriptors such as: melband energy, harmonic pitch class profiles, bpm, etc. In total it computed 4385 features for every song. Some of them where not numeric and had to be removed. These were typically features that describe some information regarding keys.

Such a large dataset is unsuitable for most machine learning algorithms so some form of dimensionality reduction had to be applied. For this Principal Components Analysis was used. It is common practice to standardize your features before per-forming PCA so that was done first. During this standardization process it was found that a large number of features had a standard deviation of zero. These had to be excluded since they result in zero division, but also because variables that have the same value for every observation are completely useless descriptors. After standardization and removing non-numeric features, the set ended up with 4121 variables. PCA was then applied on this reduced set, resulting in the first 71 principal components explaining 80% of the variance.

As for the non-musical features, these were manually computed and thus limited to easily computable ones. The features were mostly inspired by previous research. A study that was done by Ginsburgh and Noury found that a performer benefits from representing the host country[3]. This same study also found that a song be-ing sung in either English or French has an effect as well. For language detection the python package ’Langdetect’ was used. There are plenty of other packages available, but this one is said to work well on large plaintext. Considering all of the plaintexts are full lyrics of a song, this seemed like an appropriate choice. With this package 2 binary columns could be computed, indicating whether a song was sung in either English or French. Langdetect provides a confidence score along

(7)

side the detected language, which is the estimated probability that the detected language is correct. For most of the songs this probability was 99.99%, however there were a couple of special cases where that it was much lower. This occurs when a song is sung in multiple languages, or when the lyrics contain a lot of notes like ’oooh’ and ’na’. Since the language probability does say something about the song it was added to the set of features.

The order in which performers perform their songs is assigned randomly but many people believe that it has an effect on the voting. Research by Haan, Dijkstra and Dijkstra indicated that the first performer does better than average and the same goes for the later performers[4]. Using the full ordering as a feature is not applicable due to the number of contestants varying each year. Being the 20th performer in 1980 is not the same as being the 20th performer in 2010. Instead, the ordering of each year was used to assemble two binary features: the contestant was the first performer, or the last performer.

Finally, votingblocs were added to the set as well. They were added as 15 dummy variables, 1 for each votingbloc. The votingblocs that were used are the ones listed on Wikipedia:

Bloc Countries

A France, Luxembourg, Monaco B Greece, Cyprus

C Turkey, Azerbaijan

D Australia, Malta, Ireland, United Kingdom E Austria, Germany, Switzerland

F Netherlands, Belgium G Andorra, Portugal, Spain H Albania, Italy

I Italy, San Marino

J Sweden, Norway, Finland, Denmark, Iceland K Estonia, Latvia, Lithuania

L Romania, Moldova

M North Macedonia, Albania

N Serbia, Bosnia Herzegovina, Slovenia Montenegro, North Macedonia, Croatia O Belarus, Ukraine, Russia, Azerbaijan

Armenia, Georgia, Latvia, Lithuania, Estonia Table 1: Votingblocs as listed on Wikipedia[2].

(8)

There are no official conventions regarding votingblocs as it is quite a contentious subject. How they are defined in table 1 was inspired by a study on Eurovision votingblocs done by Derek Gatherer[5]. Many other ways of separating countries into votingblocs exist, but those are very similar to the ones showed here and results would likely be the same if those were used instead.

All these combined resulted in a set of 21 non-musical features: 1. from host country

2. sung in English 3. sung in French 4. language probability 5. first in order 6. last in order 7-21 votingblocs

Models

Interpretability is a very important attribute that the models need to have. How well a model works is not the only factor, knowing how much each feature con-tributes to a model is more relevant. Linear regression seems like a perfect fit because of this, as it is very easy to interpret. Every feature ends up having their own respective weight which indicates exactly how important it is for the model. Other regression models will be trained as well, their purpose will likely be just for comparison. In case any of them perform significantly better than linear regression those will have to be interpreted as well. The other regression algorithms are ran-dom forest regression, partial least squares regression and AdaBoost. Polynomial regression is also an option but with 71 principal components you would end up with far too many features. Instead of that, linear regression will be expanded with just the quadratic terms.

All of the regression models will have average points as the target variable. Us-ing the total points that a performer got as the target is not applicable. Every year there is a different number of countries that participate and thus vote. When more countries are voting a performer is naturally going to end up with more points. Weighing the total points by how many countries got to vote will resolve that issue.

(9)

Regression is not the only suitable way to model the voting behavior: binary and multiclass classification can also work. This can be done by changing the observations from performers to raw votes. The target variable is now how many points country X gave to performer Y. This manages to capture a very important aspect of the Eurovision voting, namely what specific country is voting. In the real contest it often occurs that certain songs are loved by 1 group of countries, and disliked by another group of countries. That dynamic is lost when you only consider the average points that a song got.

Due to how the voting system works, multiclass classification will likely end up with very poor results. When a country can only assign points to 10 performers, all the other performers automatically get 0 points from that country. The result of that is a very unbalanced set of classes. The classes for 1-8,10 and 12 points are all equally as common: they all make up 4.4% of the votes. The class for zero points is 55.4% of all votes. Classification algorithms usually do not perform well on such an unbalanced dataset, which is why binary classification is likely the better option. Instead of considering how many points a performer got from a country, you can consider whether they got any points or no points at all. After changing all votes to a binary column it ends up with a reasonable balance: the 0 class still has an occurrence of 55.4%, but all the other classes are now combined into 1 with an occurrence of 44.6%.

As mentioned previously, the voting system changed in 2016. That change could be worked around so that the years after 2015 can still be included in the dataset that the regression models will be trained on. Unfortunately that is not the case for the classifiers. Choosing to include just the juryvotes(or televotes) would solve this problem but that was not possible with the dataset used in this study. The data about the votes only listed how many total points a country awarded to a performer, it did not mention how many of that total came from either jury or televotes. The classifiers will thus be trained only using data between 1975 and 2015. Removing the last 4 years reduced the total number of votes from 30053 to 24761.

(10)

Evaluation

Model performance will be gauged with the standard metrics. For the regression models that metric is the mean squared error(MSE). The MSE is calculated as follows:

MSE =

_n1 Pn

i=1

(y

i

− ˜

y

i

)

2

Where yi is the predicted value, ˜yi is the true value and n is the total number of

predictions.

The metric for the classifiers is the classification accuracy:

accuracy =

number of correct predictions

total number of predictions

Both these scores are not very informative on themselves and require a baseline to be compared to. A suitable baseline model for regression is a model that simply always returns the mean as the predicted value. The average MSE that this model had was roughly 4.5. For the classifiers, the baseline model returns the most common class. This most common class is the aforementioned 0 class. The occurence of that class was 55.4% so the baseline’s accuracy is 55.4%.

Results

Musical regression

Initially, the number of principal components was set to however many were needed to explain 80% of the variance without putting too much thought into it. To find a more optimal number of components, linear regression models were trained for each amount. The performance was very inconsistent and dependant on the train/test split. To eliminate this randomness 1000 different splits were made and the average MSE was calculated. How the MSE changes as a function of the number of components is shown here:

(11)

Figure 1: Average MSE for each number of principal components

Figure 1 shows that the first 14 components is the optimal amount. Beyond 14 the MSE steadily gets worse. 14 being optimal for linear regression does not necessarily mean that it will be optimal for other algorithms as well.

How all the algorithms compare to each other and how each algorithm changes as the number of components increases is shown in the following figure:

Figure 2: Comparison between regression algorithms

The random forest, as you would expect, seemingly gets better as it gets more predictors. This increase stagnates at about 10 components. AdaBoost and linear

(12)

regression with quadratic terms both seem to perform worse with more compo-nents. Partial least squares regression is a horizontal line because that algorithm trains on the full set of musical features instead of the principal components. Some-thing to note about the quadratic terms is that the x axis indicates the number of components and each of those components squared as well.

In terms of performance, none of the models do much better than the baseline. Linear regression is the best performing algorithm but even still, it only performs roughly 2% better than the baseline at best. All of the other algorithms do sig-nificantly worse than the baseline model. AdaBoost in particular performs very poorly.

Non-musical Regression

Average MSE linear regression 4.406 baseline 4.500 AdaBoost 4.633 random forest 5.220

Table 2: Average MSE on non-musical dataset

Linear regression is once again the best performing algorithm. The random forest performs much worse when trained on the non-musical data instead of the musical data. AdaBoost and linear regression on the other hand do better with this non-musical set. The overall result here is that the non-musical models perform slightly better than the musical ones.

Partial least squares regression and quadratic term models were not trained using this set. The set mostly consists of binary variables which are unaffected by taking the square, so it does not make sense to include the quadratic terms.

Combined regression

Naturally, it was worth looking into combing the 2 datasets and training models with it. Again, the principal components were added 1 by 1 to see how performance fluctuates. The models in figure 3 were trained using the full non-musical set plus the first n components.

(13)

Figure 3: Performance using both datasets

Both the random forest and linear regression improved by a fair bit. The best performing model is now linear regression using all the musical and non-musical features with a MSE of 4.29. That is about 4.5% better than the baseline model. AdaBoost’s performance now lies somewhere in between what it was previously.

Backward stepwise regression

Figures 2 and 3 show a trend where the addition of certain principal components improves the performance, whilst others decrease the performance. This trend suggests that backward stepwise feature selection is likely to improve the linear regression model.

Implementing backward stepwise feature selection is not as straightforward as it normally would be, because the results are so inconsistent. To account for the randomness factor, the algorithm was implemented with these steps:

1. Make a train/test split, train a model using the full set of features 2. Measure how the performance changes by removing each feature 3. Remove the feature that causes the biggest increase in performance 4. Repeat steps 2&3 until only 1 feature remains

5. Determine what the best performing set of features was

6. repeat steps 1-5 a thousand times and count how frequently each feature ends up in the best set

This process was done using the combined set with linear regression and the ob-tained frequencies are shown in this table:

(14)

Feature Frequency PC 2 851 PC 11 796 Sung in English 790 PC 14 786 PC 12 754 Voting bloc O 747 From host country 743 Voting bloc K 716 Sung in French 713 Voting bloc M 712 PC 5 706 Voting bloc G 686 PC 1 664 Voting bloc D 648 PC 6 634 PC 4 626 PC 10 614 Last in order 585 PC 13 584 PC 7 558 Voting bloc A 554 Voting bloc C 532 Voting bloc I 518 Voting bloc L 518 Voting bloc N 501 Voting bloc H 486 First in order 475 Voting bloc B 396 Voting bloc J 387 PC 3 374 PC 8 314 Language probability 312 Voting bloc E 308 Voting bloc F 254 PC 9 119 Table 3: Feature frequencies

(15)

The columns of the combined set were then sorted based on the obtained frequencies in descending order. Iterating through this set, like done previously with the principal components, yielded these results:

Figure 4: backward stepwise regression performance

The performance has been increased again, the best MSE being about 4.19 now. What is very peculiar in this graph is the performance jump when the 23rd feature is added. It seems like taking the first 15 features and manually adding the 23rd one could make the model even better.

Figure 5: backward stepwise regression compared to the baseline

Figure 5 shows that adding the 23rd feature to the first 15 did indeed make the model better. In the figure the same graph is plotted, expect that the addition of the 16th feature was replaced by the 23rd. Doing this made the MSE drop to

(16)

The set of features that was obtained by applying backward stepwise selec-tion consists of these features: principal components 1,2,5,6,11,12,14, voting blocs D,G,I,K,M,O, ’sung in English/French’ and ’From host country’.(note: voting bloc I was manually added)

log transform

The distribution of average points looks like this:

Figure 6: average points distribution

Regression models generally tend to perform poor when their target variable has a distribution like this. Something that looks like a bell curve is more suitable than a downwards curve. Since the average points data has a right skew, applying a log transform should turn it into a better shape.

Figure 7: average points distribution after applying log transform

The point distribution is still not a perfect bell curve but it has gotten much closer. In theory this should improve the models. Since the target variable now lies on a

(17)

different scale the new MSE’s can not be compared to the previous ones. It makes more sense in this case to express the performance as ’percentage increase relative to the baseline’.

Figure 8: Performance relative to baseline before/after log transform The log transform did not have much of an effect on most of the performances. AdaBoost is the only algorithm that was substantially improved. The best per-forming one, backward stepwise linear regression, did not improve. Before the transform it managed to perform roughly 7.5% better than the baseline, after the transform it is about 6%.

Average R squared scores were calculated for the 2 best performing models as well.

r squared no log transform 0.0651 log transform 0.0534 Table 4: r squared scores

The R squared score is higher without the log transform but since the target variables are not equal these can not be compared directly.

(18)

Classification

multiclass binary Random forest 0.553 0.568 baseline 0.554 0.554 AdaBoost 0.542 0.536 Decision tree 0.445 0.526 K-nearest neighbours 0.494 0.513

Table 5: Average accuracy of classifiers

As expected, the multiclass classifiers did not perform well. Not one algorithm managed to perform better than the baseline accuracy of 0.554. The binary classi-fiers appear to be a bit better, the random forest exceeded the baseline by a small 0.014. Relative to the baseline, this increase is approximately 2.5%.

Model interpretation

The best performing model was the backward stepwise regression. Since this is a linear regression model it is very easy to interpret. The weight that a feature gets is not consistent and depends on the train/test split. Because of that the average weight was measured. Before doing that the principal components were standardized so that all the features have a similar scale. This standardization did not affect performance.

(19)

Feature average weight Voting bloc I 1.319 From host country 0.980 Voting bloc K -0.874 Voting bloc O 0.804 Sung in English 0.754 Sung in French 0.636 Voting bloc D 0.628 Voting bloc M -0.552 Voting bloc G -0.410 PC 2 -0.326 PC 11 -0.200 PC 12 -0.195 PC 14 -0.184 PC 6 0.160 PC 1 -0.140 PC 5 0.131 Table 6: Feature weights

The weights shown in table 5 are given in descending absolute order but that does not mean the top feature is the most important. The principal components lie on a slightly different scale. That scale (after standardization) is somewhere between -4 and 4 whereas the non-musical features are all binary variables.

Discussion

It was already known that modelling the Eurovision’s voting behaviour would be a difficult task. The results found here demonstrate just how difficult it actually is. Despite trying various approaches and improvements, the best performing model only beats the simple baseline by 7.5%. Results being poor is largely due to the difficult nature of the task, but there are certainly shortcomings with the approach used in this study. For starters, the non-musical feature set is very limited. There are many more characteristics that would have been included in this set had they been accessible. Unfortunately there is little publicly available data regarding the contestants. The set was thus limited to things that are easy to find or generate. Had this set of features been better the performance might have been different.

An aspect that plays a big role in the voting in real life, that was absent from this study, is the ’competition’ of each contest. It does not matter how good a song is, if there are 10 even better songs it will get 0 points despite being good.

(20)

valuable information for a model that has to predict how well a song will do. This aspect is very difficult to implement and no methods of doing so were found in this study.

Lastly, it might be the case that each country has their own unique voting be-haviour. It is inconceivable that every one of these can be captured by a singular model. Attempts were made to train models for each country but the results were incredibly poor and have been omitted from this paper.

Not only were the results poor, there also does not seem to be a substantial dif-ference in the effectiveness of musical versus non-musical data. The models that were trained using either one of the 2 had very similar performances. Using both datasets instead of just one also yielded better results meaning that both types of data are useful. Even the backward stepwise selection did not show a prefer-ence for either types of data: The best set of features contained 7 musical and 9 non-musical features.

This does have some interesting implications for the purity of the Contest’s voting. A truly unbiased judge would only base their vote on musical merit. This is clearly not the case as the non-musical features would have been useless for the models if the judges were unbiased. What these findings suggest for the possibility of collusion is uncertain. There is no way to find out if judges are subconsciously showing a bias or if that is being done deliberately.

Something that is still unclear, is whether the log transform improved the perfor-mance or not. When compared to their relative baselines the increase was smaller, but that might be an unfair comparison. Although the same model was used for both baselines, the strength of that baseline is dependant on the distribution. Al-ways predicting the mean is more effective on a bell curve distribution than on a right skewed one. The r squared scores also can not be directly compared since the target variables are different.

Interpreting the best performing model turned out to be a lot more difficult than expected. The average weight of each feature has been measured, but that is not enough to properly interpret the model. Since the model was trained using princi-pal components you would have to know what those components mean. Normally you would be able to do that by looking at the factor loadings of each compo-nent. This becomes a problem however when the original dataset contains over 4000 highly specific features. Top loadings such as ’mean spectral energyband’ are notoriously difficult to interpret and would require an expert in the field.

Despite that, the 9 non-musical values can be interpreted. The positive weights of voting blocs I, O and D suggest that contestants from those blocs generally do

(21)

better on average. Voting bloc I in particular has a high positive weight. Voting blocs K, M and G having a negative weight would suggest that countries from those blocs get less points than average. The model also shows a preference for songs that were sung in either English or French. This makes sense considering the contest is international. Citizens from any country are usually more familiar with those 2 languages than any other language, excluding their native one. Being from the host country also ended up being a important predictor.

A different approach to this problem that potentially works better would be to use a different set of audio features. The ones used here are low level features and it might be better to use high level features instead. The set of 4000 audio features as a whole gives a very detailed description of a song, but each individual feature on its own does not. Having a set of highly descriptive features, such as mood, could result in models that more accurately simulate how judges vote.

Conclusions

Various regression models have been trained to simulate the voting behaviour of the Eurovision judges. Backward stepwise linear regression was the best performing one, although none of them performed well. Both musical and non-musical data turned out to be useful for this task, neither of them more so than the other. Feature importance for the best model has been determined, and interpreted where possible. No obvious signs of collusion were found but evidence for certain biases was. Further improvements can be made and other approaches are still worth trying.

(22)

References

1. https://essentia.upf.edu/

2. https://en.wikipedia.org/wiki/Voting_at_the_Eurovision_Song_Contest 3. V.A. Ginsburgh, A. Noury, 2005, "Cultural voting: The Eurovision Song Con-test"

4. M. Haan, G. Dijkstra, P. Dijkstra, 2003, "Expert Judgment versus Public Opin-ion - Evidence from the EurovisOpin-ion Song Contest"

5. D. Gatherer, 2006, "Comparison of Eurovision Song Contest Simulation with Actual Results Reveals Shifting Patterns of Collusive Voting Alliances"

Modelling the Eurovision voting behaviour

Modelling the Eurovision

voting behaviour

Modelling the Eurovision

voting behaviour

Abstract

Contents

Introduction

Method

Data

Models

Evaluation

MSE =

(y

− ˜

y

)

accuracy =

number of correct predictions

total number of predictions

Results

Musical regression

Non-musical Regression

Combined regression

Backward stepwise regression

log transform

Classification

Model interpretation

Discussion

Conclusions

References