Predicting The World Cup Soccer 2014

(1)

Predicting The World Cup Soccer 2014

Bachelor Thesis

July 10, 2014

Student: Dennis Steenhuis

Primary supervisor: Prof. Dr. E.C.Wit

(2)

Abstract

Every four years there is a lot of speculation about who will win the World

Cup soccer. This thesis approaches the World Cup in a statistical way, and

will state the winning chances of each team based on historical data. For

these predictions we have used three different models. Firstly there is the

Bradley-Terry model, with a modification of Davidson. Secondly we used

a Generalized Linear Model (GLM) and small modification of that, the so

called Generalized Linear Mixed Model (GLMM). With the use of these

models we have tried to predict the outcome of the World Cup, but football

remains a very unpredictable game.

(3)

List of Figures

4.1 Histogram of number of goals scored by Algeria per match . . 18 4.2 Histogram of number of goals scored by Spain per match . . . 18

5.1 Barplot win chances . . . . 33 5.2 Barplot win chances . . . . 34 5.3 Barplot win chances . . . . 35

B.1 Most likely advancing through knock-out stage Bradley-Terry 49

B.2 Most likely knock-out results GLM . . . . 57

B.3 Most likely knock-out results GLMM . . . . 65

(7)

List of Tables

4.1 Data of FIFA World Cup 2014 participants . . . . 19

4.2 Results of Bradley-Terry estimates . . . . 20

4.3 Group B match probabilities Bradley-Terry . . . . 21

4.4 Group B probabilities Bradley-Terry . . . . 21

4.5 Win chance Bradley-Terry . . . . 22

4.6 Deviance and AIC-values GLM . . . . 24

4.7 Summary of GLM 9 . . . . 25

4.8 Summary of GLM 10 . . . . 25

4.9 Group B match probabilities GLM . . . . 26

4.10 Group B probabilities GLM . . . . 26

4.11 Win chance GLM . . . . 27

4.12 Most likely match results group B GLM . . . . 27

4.13 Most likely results group B GLM . . . . 28

4.14 Summary of the GLMM . . . . 29

4.15 Group B match probabilities GLMM . . . . 29

4.16 Group B probabilities GLMM . . . . 30

(8)

4.17 Win chance GLMM . . . . 31

A.2 Summary of GLM 9 . . . . 44

A.3 Summary of the GLMM . . . . 44

B.1 Group stage match probabilities BT . . . . 46

B.2 Group standing probabilities BT . . . . 47

B.3 Advance probability Bradley-Terry . . . . 48

B.4 Most likely group outcomes Bradley-Terry . . . . 49

B.5 Group stage match probabilities GLM . . . . 51

B.6 Group standing probabilities GLM . . . . 52

B.7 Advance probability GLM . . . . 53

B.8 Most likely group match results GLM . . . . 55

B.9 Most likely groupresults GLM . . . . 56

B.10 Group stage match probabilities GLMM . . . . 59

B.11 Group standing probabilities GLMM . . . . 60

B.12 Advance probability GLMM . . . . 61

B.13 Most likely group match results GLMM . . . . 63

B.14 Most likely groupresults GLMM . . . . 64

(9)

Chapter 1 Introduction

Football is one of the most popular sports in the world. Every two years in the Netherlands, if the national team is qualified for the European Cup or the World Cup, people are beginning to speculate about which team is going to win and if ’we’ have a chance to win the cup. In this thesis we will approach the World Cup in a statistical way.

First we start with a lot of theory on which our models are build. Different models will be explained, for example the model of Bradley and Terry [2]

and Generalized Linear Models. In addition to that the hypothesis tests and selection criteria we use will be covered. Also the poisson distribution and the related Skellam [11] distribution are covered.

Chapter 3 is a short chapter about the structure of the FIFA World Cup Soc-

cer 2014. Then in chapter 4 we finally start predicting. For these prediction

we have used a dataset of 2646 football matches which included at least the

last 100 games played by each participant of the World Cup at March 14,

2014.

(10)

Chapter 2 Theory

2.1 Generalized Linear Models

2.1.1 Linear Models

Linear models are used to model the relation between the response variable y and explanatory variables X. Given a dataset {y _i , x _i1 , x _i2 , ..., x _ip }, i = 1, ..., n. Where we want to model y i , where it can depend on the given data {x _i1 , x _i2 , ..., x _ip }. With linear regression we assume that y _i is normally distributed and that:

y _i = β ₀ + β ₁ x _i1 + · · · + β _p x _ip + ε _i = x ^T _i β + ε _i , i = 1, ..., n.

Where ε _i is an error term, which are all normal indentically and independent distributed with mean zero and variance σ ² , ε _i ∼ N (0, σ ² ). Often all n equations are combined to obtain an equation in matrix form.

y = Xβ + ε.

Where

y =





 y 1

.. . y _n





 , X =





 x ^T ₁

.. . x ^T _n





 , β =





 β 1

.. . β _n





 and ε =





 ε 1

.. . ε _n





 .

As shown in chapter 6 of Dobson and Barnett [4] both the Least Square

Estimator (LSE) and the Maximum Likelihood Estimator (MLE) of β are

(11)

equal to:

β ˆ _LSE = ˆ β _{M LE} = (X ^T X) ⁻¹ X ^T y

2.1.2 Generalized Linear Model

With the use of linear models we are restricted to the assumption that y _i is normally distributed. But in some cases we want to model discrete data, so we need, for example, a poisson or binomial distribution. Or we see from a plot that our response variable has for example a gamma distribution. In these cases we can use a Generalized Linear Model (GLM), which, as the name suggests, is a generalization of the linear model.

For a GLM we have to define three things:

• The distribution of the response variable Y _i , which has to be a member of the exponential family. This means that the density function of Y _i can be written in the form f (y; θ) = exp{a(y)b(θ) + c(θ) + d(y)}.

• The linear predictor η _i . Which is often stated as η _i = x ^T _i β.

• The link function g(E[Y i ]) = η _i , which links the linear predictor to the expected value of Y _i .

In R the maximum likelihood estimates of β in a GLM can be estimated with use of the function glm.

2.1.3 Generalized Linear Mixed Model

When we use Generalized Linear Mixed Models (GLMMs) we still need to define the distribution of our response variable and the link function. The difference is in the linear predictor, in which we will add a random effect. So instead of η i = x ^T _i β we will assume that:

η _i = x ^T _i β + z ^T _i γ.

Here the term x ^T _i β are called the fixed effects and z ^T _i γ the random effects. We simply could estimate the γ, but then there is not any randomness involved.

Therefore we assume that γ ∼ N (0, G) and will estimate the covariance

matrix G. For simplicty it is often assumed that all random effects are

(12)

independent.

The estimates for β and G can be found using the function glmer from the package lme4 in R.

2.2 Bradley-Terry Model

The Bradley-Terry model was introduced by Bradley and Terry [2] to com- pare treatments. Later, Davidson [3] made a modification of this model in order to allow ties between different treatments.

2.2.1 The standard Bradley-Terry Model

To start we will assume there are t different treatments. Then the probability of preferring treatment i above treatment j according to the Bradley-Terry model is equal to:

Pr(i preferred over j) = π _i

π _i + π _j , i 6= j.

This implies that either i is preferred over j or j is preferred over i. Because, Pr(i preferred over j) + Pr(j preferred over i) = π _i

π _i + π _j + π _j

π _i + π _j = 1.

This implies that Pr(i is equal to j) = 0, so there are no ties allowed. For some applications, this is reasonable assumption. But when modelling foot- ball matches it is not a reasonable assumption, because a match can end in a draw.

2.2.2 Allowing ties in the Bradley-Terry model

To allow ties in the Bradley-Terry model we make use of the modification

that Davidson made. With the following condition: The probability of a tie

is proportional to the geometric mean of the probabilities of preference for

the teams being compared. He modified the Bradley-Terry model in such a

(13)

way that ties are allowed. This modification yields the following model:

Pr(A tie between i and j) = ν √ π i π j

π _i + π _j + ν √

π _i π _j = p(0|i, j) Pr(i beats j) = π _i

π i + π j + ν √ π i π j

= p(i|i, j) (2.1) Pr(j beats i) = π _j

π i + π j + ν √ π i π j

= p(j|i, j) From this we can derive the following density function:

f (w _ij , w _ji , t _ij ; π _i , π _j , ν) = π _i ^w

^ij

· π ^w _j

^ji

· ν √

π _i π _j t

ij

(π i + π j + ν √

π i π j ) ^w

^ij

^+w

^ji

^+t

^ij

.

Where w _ij is the number of times i beats j and t _ij the number of ties between i and j. To estimate the π _i ⁰ s we use the method of maximum likelihood.

Multiplying the density function over all different combinations of i and j over the nubmer of treatments and taking the logarithm we end up with the following log likelihood:

`(π, ν) = 1 2

t

X

i=1

s i log π i + T log ν + X X

i<j

n ij log(π i + π j + ν √ π i π j ) Where s i = 2w i + t i , two times the total number of wins of i plus the total number of ties of i. And n _ij is the number of games played between i and j and finaly T the total number of ties over all matches. Since we have that

`(aπ, ν) = `(π, ν) for any a > 0, the log-likelihood is scale invariant [7]. It is convenient to maximize the likelihood with the restriction P t

k=1 π _k = 1.

To maximize the likelihood over π and ν we have to solve the following system of equations:

s _i ˆ π _i −

t

X

j=1

n _ij (2 + ˆ νp ˆ π _i /ˆ π _j ) ˆ

π _i + ˆ π _j + ˆ νp ˆ π _i π ˆ _j = 0, i = 1, ..., t T

ˆ

ν − X X

i<j

n _ij p ˆ π _i π ˆ _j ˆ

π _i + ˆ π _j + ˆ νp ˆ π _i π ˆ _j = 0

When t > 2 the solution ( ˆ π, ˆ ν) has no closed form and has to be found using an iterative procedure. This iterative procedure will converge under the following assumption as stated in Davidson [3].

Assumption 1. In every possible partition of the objects into two non-empty

subsets, some object in the second set has been preferred at least once to some

object in the first set.

(14)

Algorithm 1. Start with the following initial conditions ˆ π _i ⁽⁰⁾ = ¹ _t , i = 1, ..., t and ˆ ν ⁽⁰⁾ = _{N −T} ^2T . Where N is total number of matches played.

Then for M = 1, 2, .... until convergence do the following two steps:

1. for i = 1, 2, ..., t

g _i ^{(M )} =

i−1

X

j=1

n _ij

2 + ˆ ν ^{(M −1)} q

ˆ

π _i ^{(M −1)} /ˆ π _j ^{(M )}

ˆ

π _i ^{(M −1)} + ˆ π ^{(M )} _j + ˆ ν q

ˆ

π ^{(M −1)} _i ˆ π ^{(M )} _j

+

t

X

j=i

n _ij

2 + ˆ ν ^{(M −1)} q

ˆ

π _i ^{(M −1)} /ˆ π ^{(M −1)} _j

ˆ

π ^{(M −1)} _i + ˆ π _j ^{(M −1)} + ˆ ν q

ˆ

π _i ^{(M −1)} π ˆ _j ^{(M −1)} ˆ

π _i ^{(M )} = s _i g ^{(M )} _i

Normalize such that P ˆ π _k ^{(M )} = 1 2.

h ^{(M )} = X X

i<j

n _ij q

ˆ

π _i ^{(M )} π ˆ _j ^{(M )} ˆ

π _i ^{(M )} + ˆ π ^{(M )} _j + ˆ ν ^{(M −1)} q

ˆ

π _i ^{(M )} π ˆ ^{(M )} _j ˆ

ν ^{(M )} = T h ^{(M )}

2.3 Hypothesis Testing

2.3.1 Single parameter hypothesis in GLMs

The maximum likelihood estimates ˆ β _{M LE} will converge in distribution to a normal distribution.

β ˆ _{M LE} → N (β, I ^d ⁻¹ )

(15)

Where β is the true value and I ⁻¹ is the inverse information matrix. Which is also estimated when using the function glm or glmer. Now consider the null hypothesis, β _j is zero versus the alternative hypothesis, β _j it is not zero.

This is equal to saying that the covariate x _j has no effect on the response variable versus x _j has an effect on the response variable.

H ₀ : β _j = 0 H _a : β _j 6= 0

Under the null hypothesis we have that ˆ β _j ∼ N (0, σ ² _ˆ

β

j

), thus _σ ^β ^ˆ

2^j βjˆ

∼ N (0, 1).

Where σ ² _ˆ

β

j

is the j ^th element on the diagonal of I ⁻¹ . Following the standard procedure we can calculate a p-value for the null hypothesis:

p _c = Pr |Z| ≥

β ˆ _j σ ² _ˆ

β

j

!

, Z ∼ N (0, 1)

In R the p c ’s for all estimated coefficients are given in the summary when using the function glm or glmer.

2.3.2 Deviance

Suppose we have n observations y ₁ , ..., y _n then we can estimate a maximum of n parameters. A model with the maximum number of parameters is called a saturated model. The likelihood of the saturated model will be larger than that of any other model for the same observations.

Now let ˆ β _max denote the parameter estimates for the saturated model and β the parameter estimates for our model of interest. Then `( ˆ ˆ β _max ; y) will be the log-likelihood of the saturated model and `( ˆ β; y) the log-likelihood of our model. The deviance is defined as follows:

D = 2 h

`( ˆ β _max ; y) − `( ˆ β; y) i

The Deviance converges in distribution to a chi-squared distribution.

D → χ ^d ² (m − p, r)

Where m is number of parameters of the saturated model, often the number

of observations n. And p is the number of parameters for our model of

interest. Finally r the non-centrality parameter which is approximately zero

and therefore will be often omitted. In fact is the deviance 2 log Λ where Λ

is the likelihood ratio statistic [8]. The deviance is also called the ”Goodness

of Fit”.

(16)

2.3.3 Hypothesis testing for nested GLMs

Suppose we have two models, the model M ₀ and a more general model M _a which are nested. That means that they have the same probability density and the same link function. But the linear part of M ₀ is a special case of that of M _a . We will compare these models by comparing their deviance. Let’s define our null hypothesis and alternative hypothesis as follows:

H ₀ : β = β ₀ =





 β ₁

.. . β _q







H _a : β = β _a =





 β ₁

.. . β _p







Where H ₀ corresponds to the model M ₀ and H _a to M _a such that q < p < n, with n the number of observations. Then we can test H ₀ versus H _a using the difference of deviance statistics.

D ₀ − D _a = 2 h

`( ˆ β _max ; y) − `( ˆ β ₀ ; y) i

− 2 h

`( ˆ β _max ; y) − `( ˆ β _a ; y) i

= 2 h

`( ˆ β _a ; y) − `( ˆ β ₀ ; y) i _d

→ χ ² (p − q)

The corresponding p-value is then:

p _c = Pr(X ≥ D ₀ − D _a )), X ∼ χ ² (p − q)

In R we use the function anova to compare nested models using the difference of deviance statistic.

2.3.4 Test of equal preferences in the modified BT model

Define the following hypothesis:

H ₀ : π _k = 1

t , k = 1, ..., t, ν unspecified H _a : π _k 6= 1

t , for some k ∈ {1, ..., t}

(17)

To maximize the likelihood under H ₀ , we calculate that ν ₀ = _{N −T} ^2T . Using the likelihood ratio statistic Λ [8], we obtain the following statistic:

S = −2 log Λ

= 2 [`( ˆ π, ˆ ν) − (N − T ) log(N − T ) − T log(2T ) + N log(2N )]

→ χ d ² (t − 1) The p-value will then be:

p _c = Pr (X ≥ S) , X ∼ χ ² (t − 1)

2.3.5 Appropriateness of the Bradley-Terry model

Instead of looking at the Bradley-Terry model we can approach it from the multinomial distribution. That requires only that the probabilities q(i|i, j), q(j|i, j) and q(0|i, j) sum to one for each pair (i, j) and further they are unspeci- fied. The maximum likelihood estimates of these probabilities are given by their relative frequencies. Thus, q(i|i, j) = w _ij /n _ij , q(j|i, j) = w _ji /n _ij and q(0|i, j) = t _ij /n _ij . To check how appropriate model (2.1) is, we can state the following hypothesis:

H ₀ : q(k|i, j) = p(k|i, j), for k = 0, i, j and all (i, j), H _a : q(k|i, j) 6= p(k|i, j), for k = 0, i, j and some (i, j)

Then under H ₀ and using the likelihood ratio statistic, we test the appropri- ateness of the model with the statistic:

U = −2 log Λ = 2 X X

i<j

w _ij log

w _ij n _ij p(i|i, j) ˆ

+

+ w _ji log

w _ji n _ij p(j|i, j) ˆ

+ t _ij log

t _ij n _ij p(0|i, j) ˆ

→ χ d ² (t ² − 2t)

Where ˆ p(k|i, j) are the maximum likelihood estimates for model (2.1) for k = 0, i, j. The p-value will thus be:

p _c = Pr (X ≥ S) , X ∼ χ ² (t ² − 2t)

All hypothesis that will be tested in this thesis, will be tested with an signif-

icance level of 0.05.

(18)

2.4 Criteria for Model Selection

2.4.1 Goodness of fit

The deviance or goodness of fit, discussed in 2.3.2 can be used as a criterion to select the best model. The lower the deviance of a model, the better it fits. Also it is possible to extract the deviance from models in R when the functions glm or glmer are used. Besides the deviance for GLMs, the test for appropriateness of the Bradley-Terry model is also called the goodness of fit.

Thus we can use the statistic U to check the goodness of fit for a Bradley- Terry model. Also for this statistic it holds that, the lower the statistic U , the better the model fits.

2.4.2 Akaike Information Criterion

The Akaike Information Criterion (AIC) is a widely known criterion to select models. The AIC is defined as follows:

AIC = 2p − 2`

Where p is the number of parameters estimated in the model and ` is log- likelihood for the model. Also the AIC values are given when using the standard functions glm or glmer in R.

2.5 Poisson distribution

The Poisson distribution is a well known discrete distribution. This model

can be used in many cases such as modeling the number of accidents on a

highway, the number of customers that comes in to a shop or the number of

goals scored in a football match.

(19)

2.5.1 Properties

The distribution function of the poisson distribution has the following form:

f (x; λ) = λ ^x

x! e ^−λ , x = 0, 1, 2, ...

And is derived from a Binomial(n, θ) distribution, where you take the limit n → ∞ and θ → 0 and hold nθ to be constant λ.

Futhermore we have that when X ∼ P ois(λ) then:

E[X] = λ, Var[X] = λ, Mode[X] = bλc.

Also, the poisson distribution is a member of the exponential family, so it can be used for a GLM. When it is used for a GLM, the natural link function, which is the natural logarithm, g(E[X]) = log E[X] = log λ, is often chosen as the link function.

2.5.2 Skellam distribution

If we have two independent Poisson distributed random variables, X ₁ ∼ P ois(λ ₁ ) and X ₂ ∼ P ois(λ ₂ ), then the difference of those two variables, Y = X 1 − X 2 will follow a Skellam distribution [11]. Or Y ∼ Skellam(λ 1 , λ 2 ) which has the following distribution function:

f (y; λ 1 , λ 2 ) = e ^−λ

¹

^−λ

²

λ ₁ λ ₂

y/2

I |y| (2 p

λ 1 λ 2 ), y ∈ Z.

Where I _y (2 √

λ ₁ λ ₂ ) is the modified bessel function of order y. In some ways of dependence between X ₁ and X ₂ the difference can also follow a Skellam distribution. Furthermore we have that the expectation and variance of the Skellam distribution are:

E[Y ] = λ 1 − λ ₂ , Var[Y] = λ 1 + λ ₂ .

With the use of the Skellam distribution we can calculate the probability that X 1 > X 2 , X 1 = X 2 or X 1 < X 2 which are as follows:

Pr(X ₁ > X ₂ ) = Pr(Y > 0) =

∞

X

k=1

e ^−λ

¹

^−λ

²

λ ₁ λ ₂

k/2

I _k (2 p λ ₁ λ ₂ ), Pr(X ₁ = X ₂ ) = Pr(Y = 0) = e ^−λ

¹

^−λ

²

I ₀ (2 p

λ ₁ λ ₂ ), (2.2)

Pr(X ₁ < X ₂ ) = Pr(Y < 0) =

−1

X

k=−∞

e ^−λ

¹

^−λ

²

λ ₁ λ 2

k/2

I |k| (2 p

λ ₁ λ ₂ ).

(20)

2.5.3 Mixed Poisson distribution

Now suppose we have two random variables X ₁ and X ₂ such that:

X ₁ ∼ P ois(Y λ ₁ ) X ₂ ∼ P ois(Y λ ₂ )

Y ∼ log N (µ, σ ² ), log-normal distribution

Then X ₁ and X ₂ are no longer independent. This yields the following prob- ability:

Pr(X ₁ = x ₁ , X ₂ = x ₂ ) = Z ∞

0 Pr(X ₁ = x ₁ , X ₂ = x ₂ |Y = y)f _Y (y)dy

= Z ∞

0 (λ 1 y) ^x

¹

x ₁ !

(λ 2 y) ^x

²

x ₂ ! e ^−λ

¹

^y−λ

²

^y f _Y (y)dy

= λ ^x ₁

¹

λ ^x ₂

²

x 1 !x 2 !

Z ∞ 0

exp(−y(λ ₁ + λ ₂ ) − (log y − µ) ² /2σ ² )y ^x

¹

^+x

²

⁻¹ σ √

2π dy

Where we used that the variables X ₁ and X ₂ are independent if they are conditioned on Y . Therefore we can also assume that:

X ₁ − X ₂ |Y = y ∼ Skellam(yλ ₁ , yλ ₂ )

With this we can calculate the probability that X ₁ − X ₂ = n for any n ∈ Z.

Pr(X ₁ − X ₂ = n) = Z ∞

0 Pr(X ₁ − X ₂ = n|Y = y)f _Y (y)dy (2.3)

= Z ∞

0 λ ₁ λ ₂

n/2

I _|n| (2y p

λ ₁ λ ₂ ) exp(−y(λ ₁ + λ ₂ ) − (log y − µ) ² /2σ ² ) yσ √

2π dy

With these probabilities we can calculate the probabilities of Pr(X ₁ > X ₂ ), Pr(X ₁ = X ₂ ) and Pr(X ₁ < X ₂ ) in the same way as in equation 2.2. The integral does not have a closed form and therefore can only be estimated. Because of this estimation problem the probabilities that are estimated for different λ ₁ λ ₂ , µ and σ ² using R have approximately an error of 3e − 4.

Besides this, we can calculate the expectation and variance of X ₁ and X ₂ and the covariance between those two.

E[X ⁱ ] = E[E[X i |Y ] = E[Y λ i ]

(21)

= λ _i E[Y ] = λ i exp(µ + σ ² /2) (2.4) Var[X i ] = Var[E[X i |Y]] + E[Var[X i |Y]]

= Var[Yλ i ] + E[Yλ i ] = λ ² _i Var[Y] + E[X i ]

= λ ² _i (exp(σ ² ) − 1) exp(2µ + σ ² ) + E[X i ]

= E[X i ]λ _i (exp(σ ² ) − 1) exp(µ + σ ² /2) + E[X i ]

= E[X i ] λ _i (exp(σ ² ) − 1) exp(µ + σ ² /2) + 1

(2.5) Cov(X ₁ , X ₂ ) = Cov(E[X 1 |Y], E[X 2 |Y]) + E[Cov(X 1 , X ₂ |Y)]

= Cov(Yλ ₁ , Yλ ₂ )

= λ ₁ λ ₂ Var[Y]

= λ 1 λ 2 (exp(σ ² ) − 1) exp(2µ + σ ² ) (2.6)

Mode[X _i ] = bλ _i exp(µ − σ ² )c (2.7)

(22)

Chapter 3 Structure of the FIFA World Cup Soccer 2014

There were 203 countries who tried to qualify for the World Cup of 2014.

From those 203 teams, 31 made it to the final. Together with Brazil, who is the host of the World Cup (WC), there are 32 teams who will compete each other from June 12 ^th till July 13 ^th to become the new world champion for the next four years. This team will take the place of the reigning world champion Spain.

3.1 Group Stage

The 32 teams who are qualified for the WC are divided into eight pouls of each four teams.

In the group stage each team will play three games, one time against every team in the same group. So a total of 48 matches will be played in the group stage.

Each team will receive three points for a win, one point for a draw and zero

point for a loss. For team a team it is possible to get any number of points

from 0 to 9 points except for 8 in three games. Since there are six games in

a group each with 3 possible outcomes there are 3 ⁶ = 729 possible outcomes

for the final ranking in each group. For more detailed info about the group

stage and the possible outcomes see Wiener [12].

(23)

Group A Group B Group C Group D

Brazil Spain Colombia Uruguay

Croatia Netherlands Greece Costa Rica

Mexico Chile Cote d’Ivoire England

Cameroon Australia Japan Italy

Group E Group F Group G Group H

Switzerland Argentina Germany Belgium

Ecuador Bosnia and Herzegovina Portugal Algeria

France Iran Ghana Russia

Honduras Nigeria USA Korea Republic

3.2 Knockout Stage

After the group stage the best two teams of each group will advance to the knockout stage.

Winner S1 Winner S2 Winner Q1

Winner Q2 S1 Winner E1

Winner E2 Q1 1.Group A

2.Group B E1 1.Group C 2.Group D E2

Winner E3 Winner E4 Q2 1.Group E

2.Group F E3 1.Group G 2.Group H E4

Winner Q3 Winner Q4 S2 Winner E5

Winner E6 Q3 1.Group B

2.Group A E5 1.Group D 2.Group C E6

Winner E7 Winner E8 Q4 1.Group F

2.Group E E7

1.Group H

2.Group G E8

(24)

In the knockout stage there has to be a winner, thus a draw is not allowed.

When a game is tied after ninety minutes, extra time or a penalty shoot out will determine the winner. Since there has to be a winner we have 2 ⁸ · 2 ⁴ · 2 ² · 2 = 2 ¹⁵ = 32768 different outcomes of the knockout stage, when the teams that are qualified for the knockout stage are known.

Two teams can only meet eachother twice as a maximum during the World

Cup. The first match will be played in the group stage, while the second

match has either to be in the final or in the match for the third place.

(25)

Chapter 4 Analysis

4.1 Data

The data consists of 2646 international football matches, which are at least the 100 most recent played games at March 14 2014 of the participants of the World Cup Soccer. The data has been downloaded from the FIFA [5]

website. All considered matches can be found in appendix C. The raw data has seven columns:

• Team1: The team that played virtually at home.

• Score1: The score of the home team.

• Score2: The score of the away team.

• Team2: The team that played virtually away.

• Date: The date on which the game was played.

• Place: The location where the game was played.

• Type: The type of the game (Continental Final, Continental Quali- fier, Friendly, FIFA Confederations Cup, FIFA World Cup Final, FIFA World Cup Qualifier).

For further analysis it is assumed that every friendly and qualifying match is played at home for Team1. And for finals and the Confederations Cup it is assumed that both teams do not play at home except for the organising team.

Besides that the ranking of the teams on the FIFA World Ranking is added.

For games played in the years 2008 untill 2013 the rank on the end of the year

(26)

is added. Games played before 2008 have been added the ranking of 2008 and finally for games played in 2014 the rank on the FIFA World Ranking in the beginning of april 2014 has been added.

4.2 Exploratory Data Analysis

In table 4.1 we find the data of the participants of the FIFA World Cup 2014.

We can see that about 85% of the games that resulted in a win, are won by a participant to the World Cup. Whereas they form only 63% of the total played games. Thus, participants of the World Cup are highly likely to win their games.

Furthermore, we see that participants score 0.32 goals per match more than the mean of all teams. But, this range is quite big, because the mean of Algeria is 1.18 is below average, while that of Germany, which is 2.41, is almost two time the average. Also, participants concede on average 0.35 less goals than the average of all teams. But also this range is quite big.

We can also look at the number of goals scored per team per game. In figure 4.1 and 4.2 there are two histograms. The first one with the number of goals scored per game from Algeria and in the second one we have done the same for Spain. These histograms suggests that the number of goals scored per game is poisson distributed.

Goals Scored

Density

0 1 2 3 4 5 6 7 8 9 10

00.050.10.150.20.250.30.35

Figure 4.1: Histogram of number of goals scored by Algeria per match

Goals Scored

Density

0 1 2 3 4 5 6 7 8 9 10

00.050.10.150.20.25

Figure 4.2: Histogram of number of

goals scored by Spain per match

(27)

Country PM W L D GS GA MGS MGA

Algeria 100 42 33 25 118 107 1.18 1.07

Argentina 104 60 20 24 190 102 1.83 0.98

Australia 101 51 26 24 160 100 1.58 0.99

Belgium 100 40 34 26 145 131 1.45 1.31

Bosnia and Herzegovina 100 42 39 19 158 136 1.58 1.36

Brazil 110 75 14 21 237 78 2.15 0.71

Cameroon 101 47 27 27 149 102 1.48 1.01

Chile 105 50 33 22 152 120 1.45 1.14

Colombia 100 48 29 23 135 92 1.35 0.92

Costa Rica 104 43 34 27 137 111 1.32 1.07

Cote d’Ivoire 103 57 19 27 200 98 1.94 0.95

Croatia 101 57 19 25 176 96 1.74 0.95

Ecuador 101 34 41 26 131 133 1.30 1.32

England 100 61 17 22 199 83 1.99 0.83

France 105 54 23 28 144 77 1.37 0.73

Germany 106 73 15 18 255 101 2.41 0.95

Ghana 101 52 27 22 149 90 1.48 0.89

Greece 101 51 26 24 127 95 1.26 0.94

Honduras 104 46 36 22 138 115 1.33 1.11

Iran 104 57 17 30 180 80 1.73 0.77

Italy 104 47 21 36 153 106 1.47 1.02

Japan 111 59 23 29 191 103 1.72 0.93

Korea Republic 107 52 26 29 156 103 1.46 0.96

Mexico 115 57 32 26 192 112 1.67 0.97

Netherlands 105 67 16 22 214 79 2.04 0.75

Nigeria 104 54 20 30 165 86 1.59 0.83

Portugal 104 57 19 28 195 91 1.88 0.88

Russia 100 52 21 27 160 87 1.60 0.87

Spain 111 89 8 14 247 69 2.23 0.62

Switzerland 100 47 23 30 149 95 1.49 0.95

Uruguay 102 47 25 30 178 116 1.75 1.14

USA 108 57 34 17 175 131 1.62 1.21

Total 3322 1725 797 800 5455 3225 1.64 0.97

All Games 5292 2020 2020 1252 6960 6960 1.32 1.32

Table 4.1: Data of FIFA World Cup 2014 participants

(28)

4.3 Main Analysis

4.3.1 Bradley-Terry Model

We will start with predicting using the Bradley-Terry model as explained in section 2.2. We will compare four different Bradley-Terry models. The first model was estimated using all teams in the data set, this is a total of 163 teams. Since the estimates of each country are calculated according to an iterative scheme:

ˆ

π ^{(M )} _i = 2(Number of wins) + Number of ties g ^{(M )} _i

Some of the estimates are 0, because these countries have lost all their games in our data set. This results in that the goodness of fit of this model is infinity. Note that these estimates are found using a small modification of algorithm 1, since the data does not satisfy assumption 1.

In the second model we solved this problem by removing all matches against countries which have lost all their games.

For comparison we added two other models. Model 3 in which we only con- sidered the matches from countries which have won at least one game. And model 4 in which we only considered matches played between participants of the FIFA World Cup. The results of the draw parameter ν, the AIC, the Goodness of Fit and the test of equal preferences for all four models can be found in table 4.2. All the estimated parameters for the models can be found in appendix A.1.

Model 1 Model 2 Model 3 Model 4

ν 0.74233 0.74205 0.73816 0.72559

AIC 43231 41028 35865 7981

Goodness of Fit ∞ 2329 2153 652

S 37210 35319 30754 6448

Pr(Chi > S) < 2e − 16 < 2e − 16 < 2e − 16 < 2e − 16 Table 4.2: Results of Bradley-Terry estimates

In each of the four models we will reject the hypothesis of equal preferences,

which is as expected. Furthermore, both the AIC and the Goodness of Fit

for model 4 are the lowest of all four models. This is not a very odd result

because in model 4 we only needed to estimate 33 parameters while in the

(29)

other three models we had estimate more than 100 parameters. Therefore we will take model 4 as our best Bradley-Terry model.

Using this model we can calculate the probabilities of the outcome of the matches played in the group stage.

Group B Team1 Team2 Team1Win Draw Team2Win

1 Chile Spain 0.2332 0.2515 0.5153

2 Chile Netherlands 0.3123 0.2640 0.4238 3 Chile Australia 0.3809 0.2661 0.3530 4 Spain Netherlands 0.4581 0.2605 0.2814 5 Spain Australia 0.5293 0.2487 0.2220 6 Netherlands Australia 0.4381 0.2627 0.2992 Table 4.3: Probabilities of match outcomes in group B using Bradley-Terry model 4

From which we can calculate the probability of a team will end up first, second, third or last in the group:

Group B Prob.1st.place Prob.2nd.place Prob.3rd.place Prob.4th.place

Chile 0.1714 0.2331 0.2793 0.3162

Spain 0.4239 0.2725 0.1856 0.1180

Netherlands 0.2494 0.2728 0.2555 0.2223

Australia 0.1553 0.2216 0.2797 0.3435

Table 4.4: Probabilities of a team ending 1 ^st , 2 ^nd , 3 ^rd or 4 ^th in group B using Bradley-Terry model 4

In the knock out stage, where there has to be a winner we used our estimates from the modified Bradley-Terry model in the standard Bradley-Terry model.

Pr(i beats j) = π _i

π _i + π _j , Pr(j beats i) = π _j π _i + π _j .

Where the π _k ’s are the estimates from the modified Bradley-Terry model as in model 2.1.

When we calculate for each possible combination of matches in the knock-

out stage the probability that a certain team will advance through the next

round, we can calculate the probability of a team winning the WC. These

probabilities using model 4 can be found in table 4.5.

(30)

Team Win Chance Team Win Chance

1 Brazil 14.36% 17 USA 1.61%

2 Germany 11.80% 18 Ecuador 1.54%

3 Spain 11.64% 19 Korea Republic 1.50%

4 Argentina 11.16% 20 Australia 1.32%

5 England 5.46% 21 Croatia 1.32%

6 France 4.54% 22 Russia 1.03%

7 Portugal 4.45% 23 Belgium 0.79%

8 Netherlands 3.95% 24 Honduras 0.62%

9 Colombia 3.86% 25 Nigeria 0.52%

10 Uruguay 3.83% 26 Costa Rica 0.28%

11 Iran 3.18% 27 Cote d”Ivoire 0.27%

12 Japan 2.60% 28 Ghana 0.22%

13 Italy 2.13% 29 Greece 0.22%

14 Switzerland 1.97% 30 Bosnia and Herzegovina 0.15%

15 Mexico 1.78% 31 Cameroon 0.12%

16 Chile 1.67% 32 Algeria 0.10%

Total 100%

Table 4.5: Chance of winning the WC for each team using Bradley-Terry model 4.

4.3.2 Poisson GLM

The prediction in the previous section is only based on wins, draws and losses. With the use of a Generalized Linear Model we will now predict the number of goals a certain team will score against another team. From figure 4.1 and 4.2 we can see that the number of goals scored by a team will follow a poisson distribution. For the GLM we will use the natural link function, the logarithm. Now we have already two of three points for a GLM, we only need to define the linear predictor. So we need to know on which conditions the score will depend. First, the attacking qualities of the attacking team.

For each team we will define the attack qualities as the mean number of goals

scored per game in the data set. Secondly there are the defending qualities

of the defending team. We will define the defensive qualities as the mean

number of games played per conceeded goal times -1. Besides the defending

and attacking qualities we will check for dependence of a team playing home

or away. The type of the game (qualifying, final, friendly) and the difference

between the rank of the teams on the FIFA Ranking. This is how looked at

(31)

in model 1 till 11.

In model 12 till 19 we take a different approach. We will estimate the at- tacking qualities of a team, α _i and the defending qualities of the opponent, δ _j . And finally estimate the power of team, where take the attacking and defending qualities to be equal. Also in these model we will check for depen- dence of the type of game or home advantage. So our GLM will have the following form, where Y _ij is the number of goals team i will score against j.

Y _ij ∼ P ois(µ _ij ) η _ij = log(µ _ij )

The 19 different linear predictors:

- model1: η _ij = β ₀ + β ₁ att _i + β ₂ def _j

- model2: η _ij = β ₀ + β ₁ att _i + β ₂ def _j + β ₃ home _i - model3: η _ij = β ₀ + β ₁ att _i + β ₂ def _j + β _type

- model4: η _ij = β ₀ + β ₁ att _i + β ₂ def _j + β ₄ (rank _j − rank _i ) - model5: η _ij = β ₀ + β ₁ att _i + β ₂ def _j + β ₃ home _i + β _type

- model6: η _ij = β ₀ + β ₁ att _i + β ₂ def _j + β ₃ home _i + β ₄ (rank _j − rank _i ) - model7: η _ij = β ₁ att _i + β ₂ def _j + β ₃ home _i + β ₄ (rank _j − rank _i ) - model8: η _ij = β ₀ + β ₁ att _i + β ₂ def _j + β _type + β ₄ (rank _j − rank _i )

- model9: η _ij = β ₀ + β ₁ att _i + β ₂ def _j + β ₃ home _i + β _type + β ₄ (rank _j − rank _i ) - model10: η _ij = β ₁ att _i + β ₂ def _j + β ₃ home _i + β _type + β ₄ (rank _j − rank _i ) - model11: η _ij = β ₁ att _i + β ₂ def _j + β ₃ home _i + β _type + β ₄ (rank _j − rank _i ) +

β ₅ att _i def _j

- model12: η _ij = α _i − δ _j - model13: η _ij = β ₀ + α _i − δ _j

- model14: η _ij = β ₀ + α _i − δ _j + β ₃ home _i

- model15: η _ij = β ₀ + α _i − δ _j + β ₃ home _i + β _type - model16: η _ij = α _i − α _j

- model17: η ij = β 0 + α i − α _j

- model18: η _ij = β ₀ + α _i − α _j + β ₃ home _i

- model19: η _ij = β ₀ + α _i − α _j + β ₃ home _i + β _type

Which gave us the deviance and AIC results that are in table 4.6.

(32)

Model Deviance AIC Model Deviance AIC 1 6034.32 14807.02 12 5663.08 15079.78 2 5891.66 14666.36 13 5663.08 15079.78 3 6022.83 14805.54 14 5502.21 14920.91 4 5894.58 14669.29 15 5489.91 14918.61 5 5867.50 14652.21 16 6005.94 15096.65 6 5758.67 14535.38 17 5901.95 14994.66 7 5762.66 14537.37 18 5751.29 14845.99 8 5878.82 14663.53 19 5724.63 14829.34 9 5728.40 14515.11

10 5728.40 14515.11 11 5726.28 14514.99

Table 4.6: Deviance and AIC-values of the 19 different models

Based on the AIC values it seems wise to choose model 9, 10 or 11. Since the deviance and AIC of model 9 and 10 are equal and that of model 11 is just a little bit smaller. We decided to do the test of deviances between model 9 and 11. Note that Rmodels in such a way that the number of estimated parameters in model 9 and 10 are equal. And therefore the test of deviances will give exactly the same results between models 9 and 11 and 10 and 11.

Model Resid. Df Resid. Dev Df Deviance Pr(>Chi)

9 5282 5728.40

11 5281 5726.28 1 2.13 0.1449

With our chosen significance level of 0.05 it seems wise to choose the smaller

model 9 or 10. To check if there is any difference between those two models

we checked their summary’s which are in table 4.7 and 4.8.

(33)

Model 9 Estimate Std. Error z value Pr(>|z|) 95%-confint β ₀ 0.0440 0.0772 0.57 5.6888e-01 (−0.11, 0.20) β ₁ 0.4210 0.0327 12.86 7.4040e-38 (0.36, 0.48) β ₂ 0.5414 0.0613 8.84 9.7806e-19 (0.42, 0.66) β ₃ 0.3135 0.0257 12.21 2.8938e-34 (0.26, 0.37) β _CQ -0.2163 0.0519 -4.17 3.0286e-05 (−0.32, −0.11)

β _F -0.1326 0.0442 -3.00 2.7128e-03 (−0.21, −0.05) β _{F CC} 0.2281 0.1028 2.22 2.6572e-02 (0.22, 0.43)

β _{F W CF} -0.1200 0.0813 -1.48 1.4001e-01 (−0.28, 0.04)

β _{F W CQ} -0.1655 0.0460 -3.60 3.2152e-04 (−0.26, −0.07)

β ₄ 0.0043 0.0004 11.84 2.3742e-32 (0.0036, 0.0050) Table 4.7: Summary of GLM 9

Model 10 Estimate Std. Error z value Pr(>|z|) 95%-confint β ₁ 0.4210 0.0327 12.86 7.4040e-38 (0.36, 0.48) β 2 0.5414 0.0613 8.84 9.7806e-19 (0.42, 0.66) β ₃ 0.3135 0.0257 12.21 2.8938e-34 (0.26, 0.37) β _CF 0.0440 0.0772 0.57 5.6888e-01 (−0.11, 0.20) β CQ -0.1723 0.0719 -2.40 1.6594e-02 (−0.32, −0.03)

β _F -0.0886 0.0716 -1.24 2.1565e-01 (−0.23, 0.05) β _{F CC} 0.2721 0.1225 2.22 2.6390e-02 (−0.03, 0.51)

β F W CF -0.0760 0.1055 -0.72 4.7132e-01 (−0.28, 0.13)

β _{F W CQ} -0.1215 0.0679 -1.79 7.3688e-02 (−0.25, 0.01)

β ₄ 0.0043 0.0004 11.84 2.3742e-32 (0.0036, 0.0050) Table 4.8: Summary of GLM 10

By hypothesis testing of the single parameters we would set β ₀ and β _{F W CF} to zero in model 9. Since their p-value is greather than 0.05, our significance level. For the same reason we will set β _CF , β _F , β _{F W CF} and β _{F W CQ} to zero in model 10. Since we are only interested in games that are of the type FWCF both models yield the same model after setting those coefficients to zero. So our final GLM to predict the World Cup Soccer will have the following form:

Y _ij ∼ P ois(µ _ij )

log µ _ij = β ₀ + β ₁ · att _i + β ₂ · def _j + β ₃ · home + β _{F W CF} + β ₄ · (rank _j − rank _i )

= 0.4210 · att _i + 0.5414 · def _j + 0.3135 · home + 0.0043 · (rank _j − rank _i )

(34)

With this model we can calculate for each match and for both teams their µ _ij or mean goals scored by i against j in the long run. We will for now assume that the goals scored by i and j are independent. With use of the Skellam distribution we then can calculate that either i or j will win the match or the match will be tied. This will give us the following probabilities for group B.

Team1 Team2 Team1Win Draw Team2Win

1 Chile Spain 0.1575 0.2376 0.6049

2 Chile Netherlands 0.2219 0.2621 0.5160 3 Chile Australia 0.4396 0.2793 0.2811 4 Spain Netherlands 0.4462 0.2822 0.2716 5 Spain Australia 0.6845 0.2039 0.1116 6 Netherlands Australia 0.6007 0.2361 0.1632 Table 4.9: Probabilities of match outcomes in group B using the GLM

Teams Prob.1st.place Prob.2nd.place Prob.3rd.place Prob.4th.place

Chile 0.1136 0.2197 0.3443 0.3225

Spain 0.5127 0.2889 0.1413 0.0571

Netherlands 0.3144 0.3462 0.2222 0.1172

Australia 0.0592 0.1452 0.2922 0.5033

Table 4.10: Probabilities of a team ending 1 ^st , 2 ^nd , 3 ^rd or 4 ^th in group B using the GLM

For the knock-out stage we used the following probabilities for a win or a loss:

Pr(i beats j) = Pr(Y _ij > Y _ji )

Pr(Y ij > Y ji ) + Pr(Y ij < Y ji ) Pr(j beats i) = Pr(Y _ij < Y _ji )

Pr(Y _ij > Y _ji ) + Pr(Y _ij < Y _ji )

Using this we come up with the following probabilities for each team of

winning the World Cup.

(35)

Team Win Chance Team Win Chance

1 Brazil 24.49% 17 Mexico 1.24%

2 Spain 17.31% 18 USA 1.04%

3 Germany 10.75% 19 Greece 0.90%

4 Netherlands 6.80% 20 Belgium 0.84%

5 Portugal 5.37% 21 Nigeria 0.76%

6 England 4.99% 22 Ghana 0.72%

7 Argentina 4.11% 23 Chile 0.60%

8 France 2.53% 24 Bosnia and Herzegovina 0.58%

9 Switzerland 2.21% 25 Japan 0.51%

10 Russia 2.13% 26 Algeria 0.40%

11 Cote d’Ivoire 2.09% 27 Ecuador 0.33%

12 Uruguay 1.92% 28 Honduras 0.30%

13 Croatia 1.83% 29 Costa Rica 0.26%

14 Iran 1.59% 30 Korea Republic 0.19%

15 Colombia 1.58% 31 Cameroon 0.17%

16 Italy 1.36% 32 Australia 0.11%

Total 100%

Table 4.11: Chance of winning the WC for each team using the GLM.

Also when we use the GLM Brazil is the favourite for winning the WC, but its probability for winning is much higher. The reason for this is that we have considered the home advantage for Brazil in this model. Germany and Spain complete the top three just as in the Bradley-Terry, but the chances for Spain have increased a lot, while that of Germany is almost equal.

Besides using the Skellam distribution to come up with these probabilities we can also predict the score for these matches. We used the mode of Y ij to predict each score.

Team1 Team2 µ ₁₂ µ ₂₁ Chance

1 Chile 0 1 Spain 0.7249 1.6878 15.1%

2 Chile 0 1 Netherlands 0.8805 1.4943 13.9%

3 Chile 1 0 Australia 1.3090 0.9869 13.2%

4 Spain 1 0 Netherlands 1.2975 0.9457 13.8%

5 Spain 1 0 Australia 1.9288 0.6246 15.0%

6 Netherlands 1 0 Australia 1.7076 0.7586 14.5%

Table 4.12: Most likely results of the matches in group B using the GLM.

(36)

Which yields the standings in the group:

Team Points GS GA Sal µ

1 Spain 9 3 0 3 4.91

2 Netherlands 6 2 1 1 4.15

3 Chile 3 1 2 -1 2.91

4 Australia 0 0 3 -3 2.37

Table 4.13: Most likely results in group B using the GLM

For the knock-out stage we again used the floored estimated value of µ _ij to predict the score and used the estimate to choose the winner of each game when the game is tied. The most likely results of the group stage and knock- out stage can be found in appendix B.2.

4.3.3 Poisson Generalized Linear Mixed Model

The next modification we make is instead of using a GLM we will use GLMM.

The reason is that a football match can depend on some random effects which effect the score of both teams. We will use the same fixed effects as in our GLM. So we have the following assumptions:

Y _ij ∼ P ois(µ _ij ) Y _ji ∼ P ois(µ _ji )

log µ _ij = β ₀ + β ₁ · att _i + β ₂ · def _j + β ₃ · home + β _type + β ₄ · (rank _j − rank _i ) + γ _ij log µ _ji = β ₀ + β ₁ · att _j + β ₂ · def _i + β ₃ · home + β _type + β ₄ · (rank _i − rank _j ) + γ _ji

γ ij = γ ji ∼ N (0, σ ² )

Using the function glmer we got the following output:

When we compare this with table 4.7 or 4.8 we see that the coefficients are not significantly different from the non-random models. The main difference is that the AIC value of this mixed model is about 2.5 times as low as that of the best non-random models.

Just like in the normal GLM with the same parameters we set β ₀ and β _{F W CF} to zero. Which yield the following model for predicting the WC:

Y _ij ∼ P ois(µ _ij )

(37)

GLMM Deviance AIC σ ² σ Pr(Chi < Deviance) 4732.23 5745.78 0.024248 0.15572 < 2e − 16 Estimate Std. Error z value Pr(>|z|) 95%-confint β ₀ 0.0230 0.0797 0.2882 7.7316e-01 (−0.13, 0.18) β ₁ 0.4188 0.0336 12.4707 1.0791e-35 (0.35, 0.48) β ₂ 0.5281 0.0625 8.4557 2.7745e-17 (0.41, 0.65) β ₃ 0.3146 0.0260 12.0773 1.3920e-33 (0.26, 0.37) β _CQ -0.2172 0.0537 -4.0433 5.2714e-05 (−0.32, −0.11)

β _F -0.1334 0.0458 -2.9131 3.5785e-03 (−22, −0.05) β _{F CC} 0.2241 0.1077 2.0812 3.7414e-02 (0.01, 0.44)

β _{F W CF} -0.1215 0.0839 -1.4470 1.4790e-01 (−0.29, 0.04)

β _{F W CQ} -0.1654 0.0476 -3.4739 5.1296e-04 (−0.26, −0.07)

β ₄ 0.0044 0.0004 11.7957 4.1079e-32 (0.0036, 0.0051) Table 4.14: Summary of the GLMM

log µ _ij = 0.4188 · att _i + 0.5281 · def _j + 0.3146 · home + 0.0044(rank _j − rank _i ) + γ _ij γ _ij ∼ N (0, 0.02429)

Using the probabilities as derived in equation 2.3 we can again calculate the probabilities that a team will win or lose a game, at what position it will end in the group stage and the chance of winning the World Cup. Note that these probabilities are not precise due to a lack of machine precision during the calculation.

Team1 Team2 Team1Win Draw Team2Win

1 Chile Spain 0.1582 0.2380 0.6038

2 Chile Netherlands 0.2226 0.2623 0.5151 3 Chile Australia 0.4429 0.2788 0.2783 4 Spain Netherlands 0.4469 0.2812 0.2719 5 Spain Australia 0.6840 0.2047 0.1113 6 Netherlands Australia 0.6029 0.2354 0.1617

Table 4.15: Probabilities of match outcomes in group B using the GLMM

All these probabilities are not very different from the comparing tables 4.9,

4.10 and 4.11. This is because the expected value of Y _ij of the GLMM is

approximately the expected value of the GLM multiplied by e ^0.01212 ≈ 1.0122.

(38)

Teams Prob.1st.place Prob.2nd.place Prob.3rd.place Prob.4th.place

1 Chile 0.1146 0.2208 0.3450 0.3196

2 Spain 0.5121 0.2891 0.1417 0.0571

3 Netherlands 0.3148 0.3462 0.2223 0.1167

4 Australia 0.0585 0.1439 0.2910 0.5066

Table 4.16: Probabilities of a team ending 1 ^st , 2 ^nd , 3 ^rd or 4 ^th in group B using the GLMM

Also there are some minor differences in the estimated coefficients.

Besides the almost equal expected values, the modes are also very close.

There is a difference of approximately exp(−0.02424) ≈ 0.976 Therefore, the predicted results of the matches played in the group stage are almost the same of both models. They differ at 2 places in the group stage, the match Greece vs Cote d’Ivoire and Argentina vs Iran. And at three places in the knock-out stage. Because the ranking in de group where now different due to two different results.

The probabilities of all groups and of all three models, Bradley-Terry, GLM

and GLMM can be found in appendix B. Where we have also included the

most likely outcomes for each model.

(39)

Team Win Chance Team Win Chance

1 Brazil 24.23% 17 Mexico 1.26%

2 Spain 16.95% 18 USA 1.08%

3 Germany 10.86% 19 Greece 0.93%

4 Netherlands 6.74% 20 Belgium 0.87%

5 Portugal 5.42% 21 Nigeria 0.75%

6 England 4.99% 22 Ghana 0.72%

7 Argentina 4.19% 23 Chile 0.63%

8 France 2.53% 24 Bosnia and Herzegovina 0.61%

9 Switzerland 2.24% 25 Japan 0.51%

10 Russia 2.15% 26 Algeria 0.42%

11 Cote d’Ivoire 2.10% 27 Ecuador 0.35%

12 Uruguay 1.98% 28 Honduras 0.31%

13 Croatia 1.86% 29 Costa Rica 0.27%

14 Colombia 1.62% 30 Korea Republic 0.19%

15 Iran 1.56% 31 Cameroon 0.17%

16 Italy 1.39% 32 Australia 0.12%

Total 100%

Table 4.17: Chance of winning the WC for each team using the GLMM.

(40)

Chapter 5 Conclusion and Discussion

5.1 Conclusion

Using three different models will of course lead to three different predictions.

In figure 5.1 we have plotted the win chances of nine teams estimated by the three models next to each other. As said before the predictions of the GLM and GLMM do not differ that much, which also can be seen in the plot.

Besides that, all three models agree that Brazil has the highest chance of winning the WC. But, difference between the Bradley-Terry model and the linear models is about ten percentage points. Spain, Germany and Argentina have about equal probabilities of winning the WC according to the Bradley- Terry model. While the GLM’s have a clear order of those three teams. The three models agree the most on the chances of Germany, England, Belgium and other lower qualified teams. All the probabilities of winning and advanc- ing to the next round can be found in appendix B, table B.3, B.7 and B.12 for respectively the Bradley-Terry model, the GLM and the GLMM.

We consider the GLMM as our best model. The reason for this is that it has

by far the lowest AIC value of the three. Except for Bradley-Terry model

4, but this models takes considerably less data in account. It also takes the

strength of the teams and home advantage in consideration. Whereas the

Bradley-Terry model only predicts for past wins, draws and losses. Besides

that it has an advantage above the GLM that it takes in account some random

effect which may occur during a football match. The most likely results using

the GLMM can be found in appendix B.3. Also the most likely results using

(41)

Bra Spa Ger Ned Por Eng Arg Fra Bel BradleyTerry GLM GLMM

win chance in percentages 0510152025

Figure 5.1: Win chances of nine teams estimated by the three different mod- els.

the GLM are included, these can be found in appendix B.2.

In our GLMM we have not taken in account the shape of the teams. We have tried to assign weights to matches, so that matches in the past are less important than recent matches. But using the option weights in glm, gave almost the same results for the parameters than using equal weights for each game.

As it can be seen in the results there will never be a score higher than two goals for any team. And the only team that manages to score two goals in match is Brazil. This is explained by the home advantage, according to both the GLM and the GLMM teams that play at home score about

e ^0.315 ≈ 1.370 times as more goals as teams that do not play at home. Most

of the other matches are expected to end in a 1-0 victory or a 1-1 draw.

Only one match is expected to end in a score with no goals, this is the match between Switzerland and France in group E.

These expected results are based on the most likely numbers of goals a team

will score against another team. The probabilities of those most likely results

are also included. But using the GLMM we can also simulate random World

Cups. In appendix B.4 we have three random World Cup outcomes, where

every score is a number from a mixed Poisson distribution as in 2.5.3. This is

(42)

to show the uncerntainty in our model, when just the right poisson numbers are drawn from the Poisson distribution it is possible for a team as Cameroon to become World Champion. But note that these outcome are unlikely and the probabilities of advancing to the next round and winning the World Cup can be found in appendix B.3

5.2 Discussion

5.2.1 Goldman Sachs

Bra Spa Ger Ned Por Eng Arg Fra Bel BradleyTerry GLMM GoldmanSachs

Win chance in percentages 01020304050

Figure 5.2: Win chances of nine teams estimated by the three different mod- els.

Goldman Sachs, a leading global investment banking, securities and invest- ment management firm has also done a similar prediction [6]. They differ- ences between our model and that of Goldman Sachs is that they used the so-called ELO-rating where we used the the FIFA-ranking, used only the goals scored in the last ten matches, instead of all games. And added a vari- able, not only for home advantage, but also if the match is played on the home continent.

In figure 5.2 we have plotted the win chances of nine teams according to

(43)

Goldman Sachs and two of our models. One thing you immediatly see is the enormous chance, almost fifty percent, for Brazil to win according to the model of Goldman Sachs. But also Argentina has much higher chance then in our GLMM. This probably implies that variable for if a team plays at its home continent is quite big.

5.2.2 Bookmakers

Besides our prediction and that of Goldman Sachs we can also compare our results with that of different bookmakers. The difference between those bookmakers and the GLM(M)s is that they also take in account the strength of individual players of teams. Therefore Belgium, but also Argentina have a much higher win chance according to the bookmakers than our model. This can be seen in figure 5.3, where have the win chances of nine teams according to two bookmakers and the GLMM.

Bra Spa Ger Ned Por Eng Arg Fra Bel GLMM bet365 WilliamHill

Win chance in percentages 01020304050

Figure 5.3: Win chances of nine teams estimated by the three different mod-

els.

(44)

5.2.3 Possible improvements

When modelling sport, the shape of the teams can be a very important factor.

In this case you could say that recent played games are more important than games played six years ago. One way to model this is only using the mean scored goals of a team in for example the last ten matches. Just like it is done in Hatzius et al. [6]. Another way to model the shape of a team is adding a new variable for the shape. In which the shape of a team in a current match depends on their shape in their last match and the shape of their opponent in their last match.

But to implement the shape in one of these ways, we need more data than

only the last matches played by the participants of the World Cup. Because

we also need the shape of all the opponents against which the teams have

played.

Predicting The World Cup Soccer 2014