• No results found

Modelling the probability of a football goal using time intervals

N/A
N/A
Protected

Academic year: 2021

Share "Modelling the probability of a football goal using time intervals"

Copied!
35
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Master’s Thesis

Modelling the probability of a football goal

using time intervals

Kasper van Vliet

Student number: 6159982

Date of final version: April 20, 2016

Master’s programme: Econometrics

Supervisor: Dr. J. C. M. van Ophem

Second reader: Dr. M. J. G. Bun

(2)

Contents

1 Introduction 1 2 The Model 4 2.1 Model Set-up . . . 4 2.2 Maximum Likelihood . . . 5 2.3 Time Intervals . . . 5 2.4 Model specifications . . . 6

2.4.1 Model I, the basics . . . 6

2.4.2 Model II, time effect . . . 6

2.4.3 Model III, introducing interval-specific current score effects . . . 7

2.4.4 Model IV, the effect of a red card . . . 7

2.4.5 Model V, comparing countries . . . 7

3 Data 8 4 Results 14 4.1 Model I, the basics . . . 14

4.2 Model II, time effect . . . 16

4.3 Model III, introducing interval specific current score effects . . . 19

4.4 Model IV, the effect of a red card . . . 21

4.5 Model V, comparing countries . . . 24

4.6 5-minute vs. 15-minute models . . . 28

5 Conclusion 29

6 Discussion 31

Bibliography 33

(3)

Chapter 1

Introduction

Statistics become increasingly important in the, generally conservative, world of football. While models and statistics are only recently becoming important to the coaches and clubs themselves, the use of models to predict match outcomes started in the early eighties. Three different approaches are mainly used to model the probabilities of these match outcomes: the final score, the margin of victory or the match result (win, draw, lose). The first approach forms the basis of most literature and was used by Maher (1982), who assumed that the number of goals scored by each team is independently Poisson distributed. The means of the distributions are then dependent on past performances of the teams. Various researchers tried to improve the model in different manners. For example, Karlis and Ntzoufras (2003) allow for dependence between the number of home and away goals. Furthermore, Dixon and Coles (1997) incorporate the dynamic nature of teams’ performances by assuming that recent information is of more value than historic information in determining parameter estimates in the present. However, both Dixon and Coles (1997) and Karlis and Ntzoufras (2003) only look at the goal distribution over an entire match and not at the probability of a goal within a particular phase of a match. This probability of scoring a goal, or goal scoring rate in the continuous world, can change during a match. From experience we know that factors like playing style, current score, team strength and decisions made by the referee are likely to influence the actual probability of scoring and conceding goals.

Dixon and Robinson (1998) are the first to present a model in which the scoring intensity during a match is considered. They find inhomogeneity in the frequency of scored goals during a match and use a framework in which the home team and the away team scoring processes have intensities that are allowed to vary over time and with the current status of the process to investigate this inhomogeneity. They draw three main conclusions. First, the scoring rates for both teams generally increase as the match progresses. The authors suggest that this is caused by tiredness of the players. Secondly, the scoring rates of home and away teams depend on the current score in a match. The authors find that at level scores, the rates do not differ from the rates at a score of (0,0). When the home team is leading, the goal scoring rate of the home team decreases and the rate of the away team increases. They argue that this could be caused

(4)

CHAPTER 1. INTRODUCTION 2

by the home team defending a lead and the away team trying to restore equality. If the away team is leading, the scoring rates of both teams tend to increase. At last, the authors look at the effect of a first goal being scored by either the home or away team. They find evidence that after a first goal has been scored, the scoring rates of both teams slightly increase. Their explanation for this effect is that teams play more openly once ’the deadlock has been broken’. Van Ours and van Tuijl (2011) investigate whether there are country-specific effects in goal scoring in the ’dying seconds’ of national teams matches. They provide anecdotal evidence for the fact that the ability to score goals in the dying seconds of a match is not randomly distributed across countries and find evidence that the goal scoring and goal conceding intensities for the countries Belgium, Brazil and Italy do not increase significantly during the final stage of a match. However, for England, the Netherlands and Germany, the goal scoring intensity during the final minutes is significantly higher than before. Their main conclusion is that differences between national teams occur in the evolution of goal scoring rates over the course of a match which indicates that besides skills, national identity seems to matter.

As discussed, Dixon and Robinson (1998) find that the scoring rates generally increase after a first goal has been scored by either the home or away team. Nevo and Ritov (2013) investigate this phenomenon in more detail by looking at the relation between the first two random goal scoring times in football matches, not distinguishing between home and away goals. The authors’ main conclusion is that the occurrence of a first goal can increase or decrease the probability of the next goal, depending on the time of the first goal. They find that if a goal is scored before (after) the 52nd minute of a match, it decreases (increases) the probability of a second goal. While limited by only investigating the relationship between the first and second goal in a match, this research strengthens the idea that the current state of a match is of importance when looking at goal scoring rates.

So far we have only discussed within match parameters that are determined by the actual scoring process. However, decisions of the referee can clearly affect the teams and their goal scoring probabilities as well. In football, a red card is given to a player who commits a serious misconduct, excluding him/her from participating in the remaining part of a match, leaving the team with a player less than the opponent. Vecer et al. (2009) investigate the size of the effect of a red card using data from the FIFA World Cup 2006 and Euro 2008, both tournaments with only national teams competing. The authors find that one of the teams receives a red card, its scoring intensity is decreased to around 23 of its original intensity. For the opposing team, the scoring rate increases to around 54 of the original intensity. Furthermore, they find that the expected total number of goals decreases when the stronger team receives a red card, while the expected total number of goals stays the same or increases when the weaker team receives one. At last, we would like to note the home advantage that plays a role in the scoring probability. Home advantage is a widely known phenomenon that occurs in most sports and originates from the fact that home teams consistently win more than 50% of the matches played in a balanced home and away schedule. This means that the scoring rates of home teams are in general

(5)

CHAPTER 1. INTRODUCTION 3

higher than of away teams. Lots of factors can play a role in causing this home advantage. For example, the presence of supporters can influence referee decisions, give the home team a boost and discourage the away team. We refer to Courneya and Carron (1992), who extensively describe home advantage and the possible factors that play a role. They conclude that home advantage exists in major team sports, the magnitude of home advantage within each sport is consistent and relatively stable over time and that the magnitude of the home advantage differs from sport to sport. Furthermore, they sum up a few potential experimental hypotheses that may contribute to the home advantage: subjective decisions favour the home team, competitors perceive their psychological states to be different when playing home or away and that the absolute crowd size is not an important factor.

In short, previous research suggests three significant within match parameters when inves-tigating the probability of a goal: time, current score and red cards. In this thesis we develop a framework that, as far as we know, is not yet applied in the existing literature and follows an approach in which matches are divided into time intervals. For each time interval, the probability of a home goal and away goal scoring event is estimated separately. The model is flexible with respect to all the described within match parameters and since home and away goals are modelled separately it can also catch the home advantage. In chapter 2 we present this model, followed by chapter 3, which describes the datasets we apply our model to. In chapter 4 we present the estimation result and chapter 5 is used to conclude. Chapter 6 gives us the opportunity for a short discussion on the results and conclusions.

(6)

Chapter 2

The Model

In this chapter we present the model we will use to estimate the probability of a goal scoring event within a certain time interval. In the Model Set-up and Maximum Likelihood sections we present the model in a general form, without specifying the number and length of the time intervals. Thereafter we present the exact specification we have chosen and which will be used in the remaining part of this thesis. At last we describe the specific models that we estimate.

2.1

Model Set-up

We have matches i = 1, . . . , n, for which the outcome is recorded as a product of scoring ’probabilities’, one for each of the time intervals j = 1, . . . , k. We do not consider matches in which one of the teams scores more than one goal in an interval, which leaves us with four possibilities for each interval: both the home and away team score, both the home and away team do not score, the home team scores and the away team does not score and vice versa. For each match i, the match outcome probability is then recorded as a product of these possibilities, depending on the actual match data. For example, this product can look like

Outcomei = pi1(1 − qi1) ∗ (1 − pi2)(1 − qi2) ∗ ... ∗ (1 − pij)(1 − qik),

where pij (qij) is the probability that the home (away) team scores in interval j of match i and

1 − pij (1 − qij) the probability that the home (away) team does not score in interval j of match

i. These probabilities can be written as a function of covariates that are likely to influence the probability of a goal scoring event. In this thesis, we choose for a logistic specification in order to obtain actual probabilities: values between zero and one. This leads to the specification

pij =

exp(αj+ β(QH,i− QA,i) + θ1aheadij + θ2behindij)

1 + exp(αj+ β(QH,i− QA,i) + θ1aheadij + θ2behindij)

, (2.1)

where we have αj, which represents a baseline scoring intensity, aheadj, a parameter that takes

the value one if the home team is ahead in the score at the start of interval j and zero otherwise and behindj, a parameter that takes the value one if the home team is behind in the score at the

start of interval j and zero otherwise. Moreover, we have the parameters QH and QA, which

(7)

CHAPTER 2. THE MODEL 5

represent the home and away team strength, respectively, and stay constant during a match. The probability of the away team scoring in interval j of match i can then be written as

qij =

exp(δj + φ(QA,i− QH,i) + κ1aheadij + κ2behindij)

1 + exp(δj + φ(QA,i− QH,i) + κ1aheadij + κ2behindij)

, (2.2)

where δj is the equivalent of αj, QH and QA again represent the teams’ strength. Opposite to

the specification of pij, the parameters aheadij and behindij now take the value one if the away

team is ahead or behind, respectively, at the start of interval j of match i and zero otherwise. When constructing these probabilities, we will assume some effects to be constant over all intervals, or allow some to vary from interval to interval. In the above specification, the effect of the strength of the teams is not allowed to vary between intervals, hence the parameters β and φ. However, the baseline scoring probability is allowed to vary from interval to interval, hence the interval-specific parameters αj and δj. Different assumptions lead to different model

specifications, which we will describe more extensively in the following paragraphs.

2.2

Maximum Likelihood

To estimate the parameters in our model we use the method of Maximum likelihood. Therefore, we need to construct a likelihood function. Each match contributes to the likelihood function through a particular product, one term for each interval, representing one of the four possibilities that can happen in a particular interval, as in example 2.1. The total likelihood can therefore be written as L(αj, δj, β1, β2,φ1, φ2, θ1, θ2, κ1, κ2, j = 1, . . . , k) = n Y i=1 k Y j=1 pijhij(1 − pij)1−hijqijaij(1 − qij)1−aij, (2.3)

where hij (aij) is one if the home (away) team scores in interval j of match i and zero otherwise.

Taking the logarithm of this likelihood gives the loglikelihood which is the function that we actually maximize: `(αj,δj, β1, β2, φ1, φ2, θ1, θ2, κ1, κ2, j = 1, . . . , k) = n X i=1 k X j=1

log(pij)hij + log(1 − pij)(1 − hij) + log(qij)aij + log(1 − qij)(1 − aij).

(2.4)

The maximization of this loglikelihood function is done by running an optimization procedure provided by the Optimization Toolbox of MATLAB 2014b software.

2.3

Time Intervals

One might consider different ways to specify the length and number of the time intervals. The smaller each interval, the more realistic. However, by decreasing the size of each interval, the estimation procedure becomes more complex, since the number of parameters that needs to be

(8)

CHAPTER 2. THE MODEL 6

estimated increases. We are therefore dealing with a trade-off between complexity and realism. We choose to estimate the models for intervals with lengths of five and fifteen minutes. In the 5-minute model we are limited in the amount of parameters to include, but the intervals are short. The 15-minute model is less realistic in the sense that the time intervals are relatively long, but we do not need to impose restrictions on the number of parameters. Estimating both models allows us to compare the results and draw conclusions based on the outcomes of both models.

In our application, we divide a match into the following time intervals when using the 15-minute model: 1-15, 16-30, 31-45, 45+, 46-60, 61-75, 75-90, 90+.The first three intervals of each half are exactly 15 minutes long, whereas the last interval of the first half (45+) and second half (90+) has a length that varies between matches. This last interval of each halve contains the stoppage time, which is added if the match has been paused due to for example injuries or substitutions. Therefore the last interval of both halves has a varying length. When using the 5-minute model we have intervals: 1-5, 6-10, 11-15, 16-20, 21-25, 26-30, 31-35, 36-40, 41-45, 45+, 46-50, 51-55, 56-60, 61-65, 66-70, 71-75, 76-80, 81-85, 86-90, 90+. Again, the last interval of the first and second half represents the stoppage time with varying length. In the chapter 3, we describe the characteristics of this stoppage time in more detail, as well as the way we measure the strength of teams.

2.4

Model specifications

2.4.1 Model I, the basics

We begin the analysis with a model specification in which all parameters are assumed to have a constant effect from interval to interval. In this case, the probabilities for home and away team scoring events look like

pij =

exp(α + β(QH,i− QA,i) + θ1aheadij + θ2behindij)

1 + exp(α + β(QH,i− QA,i) + θ1aheadij + θ2behindij)

(2.5)

and

qij =

exp(δ + φ(QA,i− QH,i) + κ1aheadij + κ2behindij)

1 + exp(δ + φ(QA,i− QH,i) + κ1aheadij + κ2behindij)

. (2.6)

The underlying assumptions of this model are that the baseline scoring probability does not change from interval to interval. Furthermore, that the effects of being ahead or behind in the score are constant during the whole match. This model is therefore not able to incorporate the time effect, but it gives us the opportunity to compare the 5-minute and 15-minute model on a basic level.

2.4.2 Model II, time effect

We extend model I by allowing the baseline scoring probability to vary from interval to interval. This leads to the specification of the probabilities as in 2.1 and 2.2. Since the baseline scoring probability can change during the match, the time effect can be measured with this model.

(9)

CHAPTER 2. THE MODEL 7

2.4.3 Model III, introducing interval-specific current score effects

By estimating this model, we want to investigate whether the effect of being ahead or behind in the score is constant during a match. Therefore we extend model II by allowing the ahead and behind effects to vary from interval to interval. This leads to the following specification of the probabilities:

pij =

exp(αj+ β(QH,i− QA,i) + θ1,jaheadij+ θ2,jbehindij)

1 + exp(αj+ β(QH,i− QA,i) + θ1,jaheadij+ θ2,jbehindij)

(2.7)

and

qij =

exp(δj+ φ(QA,i− QH,i) + κ1,jaheadij + κ2,jbehindij)

1 + exp(δj+ φ(QA,i− QH,i) + κ1,jaheadij + κ2,jbehindij)

. (2.8)

2.4.4 Model IV, the effect of a red card

We extend model II by introducing the effect of red cards. Playing with ten instead of eleven players has a big influence on the way teams play and therefore their probabilities to score. By estimating this model we look into the size of this influence and whether there is a difference in the size of the effect between home and away teams. This leads to probabilities

pij =

exp(αj+ β(QH,i− QA,i) + θ1aheadij + θ2behindij + ρ1rchij+ ρ2rcaij)

1 + exp(αj+ β(QH,i− QA,i) + θ1aheadij + θ2behindij + ρ1rchij+ ρ2rcaij)

(2.9)

and

qij =

exp(δj+ φ(QA,i− QH,i) + κ1aheadij+ κ2behindij + γ1rchij+ γ2rcaij)

1 + exp(δj+ φ(QA,i− QH,i) + κ1aheadij+ κ2behindij + γ1rchij + γ2rcaij)

, (2.10)

where rchij(rcaij) is a dummy variable which takes the value one if the home (away) team has

received a red card in one of the intervals before interval j in match i.

2.4.5 Model V, comparing countries

Our datasets, which we will describe extensively in the next chapter, contain matches from football clubs from six large European footballing countries. We like to investigate whether the scoring probabilities differ between these countries. Certain countries are famous for their way of playing. The Italians are known for their strong defensive skills, while the Dutch are known for always trying to play attacking, attractive football. By extending model IV with country dummies, we can measure whether these differences are significant in the club football of these particular countries. We take Spain as a reference country and introduce dummies for the Netherlands, England, Germany, Italy and France. These dummies take the value one if a match is played in that particular country and zero otherwise.

(10)

Chapter 3

Data

We have collected data from football matches played in the first and second highest divisions in Spain, France, the Netherlands, England, Germany and Italy from the seasons 2013/2014 and 2014/2015. This dataset contains 9402 matches. In our analysis, we only take into account the matches played in the regular season, which means that play-off matches (12 in Spain, 12 in the Netherlands, 20 in Italy and 10 in England) are deleted from the dataset. In addition, matches that contain intervals where more than one goal is scored by one of the two teams are deleted from the data. This event happened in 1829 matches in our dataset when using 15-minute intervals and in 456 matches when using 5-minute intervals. At last, matches that contain an unusual amount of stoppage time (more than 10 minutes in the first half, or more than 15 minutes in the second half) and matches that were ended before the full 90 minutes were played, a total number of 232 are omitted from the data. Deleting all these matches leads to datasets consisting of 7,335 (15-minutes) and 8,673 (5-minutes) matches played in two different seasons, in which each match consists of 90 minutes playing time, divided into to halves, plus a few additional minutes (stoppage time) at the end of each half. From these matches we know the scoring time of each goal, accurately up to a minute. Table 3.1 displays the number of matches from each league in our dataset per season.

Furthermore, we use the Euro Club Index1 as a measure for the playing strength of the

teams. Since the ECI value of each team is know at the start of each match, we know the strength of the teams at the moment they actually competed. Table 3.2 shows the descriptives of home team ECI values in the different leagues, per season. From table 3.2 we can see that in both seasons, Spain has the strongest league in terms of average ECI home values, just ahead of the Premier League and the 1.Bundesliga. The Jupiler League, second highest league in the Netherlands, is clearly the weakest in our dataset. Furthermore, the table shows that almost all leagues have slightly improved in the two year time period. Except for the Eredivisie in the Netherlands, the Serie B in Italy and the Premier League in England.

1The Euro Club Index (ECI) values are published on http://www.euroclubindex.com and represent the

relative playing strength of all European football teams playing in the highest divisions. The index is developed by Hypercube Business Innovation and Infostrada Sports. The methodology page on the above website describes how the values are constructed.

(11)

CHAPTER 3. DATA 9

Table 3.1: Number of matches per season

Country League ’13/14 ’14/15 Total 5-minute intervals

England Premier League 356 362 718 England Championship 526 513 1039 France Ligue 1 368 363 731 France Ligue 2 348 358 706 Germany 1. Bundesliga 286 287 573 Germany 2. Bundesliga 290 291 581 Italy Serie A 366 367 733 Italy Serie B 439 425 864 Netherlands Eredivisie 285 286 571 Netherlands Jupiler League 323 323 646 Spain Primera Division 359 363 722 Spain Liga Adelante 375 414 789 Total - 4321 4352 8673

15-minute intervals

England Premier League 292 318 610 England Championship 436 439 875 France Ligue 1 326 313 639 France Ligue 2 303 313 616 Germany 1. Bundesliga 228 237 465 Germany 2. Bundesliga 245 248 493 Italy Serie A 304 306 610 Italy Serie B 394 382 776 Netherlands Eredivisie 226 231 457 Netherlands Jupiler League 257 257 514 Spain Primera Division 285 303 588 Spain Liga Adelante 331 361 692 Total - 3627 3708 7335

Team strength is the only exogenous factor we take into account in our model. All the other parameters depend on the actual match course. Figure 3.1 shows the distribution of the goals over the different time intervals. We see that both for home and away goals, the number of

(a) 5-minute intervals (b) 15-minute intervals

(12)

CHAPTER 3. DATA 10

Table 3.2: ECI values of home playing teams per season

Country League mean std min max 2013/2014

England Premier League 2,635 578 1,690 3,757 England Championship 1,722 229 1,236 2,277 France Ligue 1 2,268 395 1,605 3,611 France Ligue 2 1,559 250 921 2,085 Germany 1. Bundesliga 2,664 458 1,756 4,210 Germany 2. Bundesliga 1,679 232 1,122 2,234 Italy Serie A 2,469 420 1,692 3,595 Italy Serie B 1,477 261 973 2,181 Netherlands Eredivisie 2,030 374 1,338 2,960 Netherlands Jupiler League 1,020 281 583 1,542 Spain Liga BBVA 2,697 587 2,090 4,467 Spain Liga Adelante 1,814 230 1,281 2,391

2014/2015

England Premier League 2,628 559 1,789 3,754 England Championship 1,727 205 1,160 2,163 France Ligue 1 2,301 438 1,548 3,627 France Ligue 2 1,562 220 957 1,996 Germany 1. Bundesliga 2,699 455 1,880 4,170 Germany 2. Bundesliga 1,709 217 1,263 2,282 Italy Serie A 2,484 446 1,678 3,744 Italy Serie B 1,458 230 986 2,152 Netherlands Eredivisie 1,947 406 1,201 2,821 Netherlands Jupiler League 1,058 314 522 1,946 Spain Primera Division 2,726 676 1,772 4,540 Spain Liga Adelante 1,839 270 1,283 2,450

goals per interval increases as the match progresses. Moreover we see a drop in the number of scored goals in the last interval of the second and first half. These intervals correspond with the injury time, as discussed before. At last, the figures show us that the number of away goals is smaller in each interval than the number of home goals.

Clearly, the amount of goals scored during the injury time intervals is smaller than during the regular intervals. In the 15-minute model this differences is substantially larger, since the difference in length of the intervals is much larger. Figure 3.3 shows some basic statistics about the length of the stoppage time intervals. We see that the average length of stoppage time at the end of the second half is almost 5 minutes, while the average stoppage time at the end of the first half is just over 2 minutes. Stoppage time is added for each substitution that takes place. Since these substitutions occur far more often in the second half, the second half stoppage time is significantly longer.

The distribution of goals over all intervals eventually determines match results, as presented in table 3.4. We see that almost 75 percent of all matches result in a final score in which neither the home or away team scores more than twice. If we would consider all matches played in the highest leagues (before omitting a part of the matches, as described in the first part of this

(13)

CHAPTER 3. DATA 11

Table 3.3: Stoppage time in minutes

Interval mean std min max 5-minute intervals First half 2.277 1.439 0 10 Second half 4.719 1.861 0 15 15-minute intervals First half 2.282 1.438 0 10 Second half 4.808 1.818 0 15

chapter) this percentage would still be above 70. This tells us that less than 30 percent of all matches played, one of the teams scores three or more goals.

Table 3.5 shows the number and percentage of home wins, draws and away wins per country. The 15-minute dataset contains relatively more draws and less home or away wins compared to the 5-minute dataset. This is due to the fact that we exclude matches in which one of the teams scores more than one goal in one of the time intervals. The table shows that over 40% of the matches end in a home win, while a draw and away win are more ore less equally likely with around 30%. Furthermore, we see that in England away wins occur the most when comparing these six countries. In France, away wins occur less (more than three percent points), compared to the average over the six countries.

At last figure 3.2 shows the distribution of red cards over the intervals in our two datasets. The figure shows that away teams receive far more red cards than home teams. Furthermore, the frequency of given red cards increases per interval. In the 15-minute dataset, with 7,335 matches, home (away) teams are given 752 (1040) red cards in total. In the 5-minute dataset, with 8673 matches, these numbers are 898 red cards for the home team and 1,240 for the away team.

(a) 5-minute dataset (b) 15-minute dataset

(14)

CHAPTER 3. DATA 12

Table 3.4: Match result frequencies (5-minute dataset)

Result Frequency Percentage Cum. percentage 1-1 1110 12.8 12.8 1-0 927 10.69 23.49 0-0 745 8.59 32.08 2-1 741 8.54 40.62 0-1 727 8.38 49 2-0 712 8.21 57.21 1-2 568 6.55 63.76 2-2 467 5.38 69.14 0-2 379 4.37 73.51 3-0 378 4.36 77.87 3-1 357 4.12 81.99 1-3 207 2.39 84.38 3-2 197 2.27 86.65 0-3 173 1.99 88.64 2-3 137 1.58 90.22 4-0 133 1.53 91.75 4-1 125 1.44 93.19 3-3 93 1.07 94.26 1-4 78 0.9 95.16 4-2 68 0.78 95.94 2-4 53 0.61 96.55 0-4 51 0.59 97.14 5-1 37 0.43 97.57 5-0 36 0.42 97.99 4-3 30 0.36 98.35 1-5 20 0.23 98.58 5-2 18 0.21 98.79 0-5 14 0.16 98.95 3-4 12 0.14 99.09 6-0 12 0.14 99.23 6-1 11 0.13 99.36 2-5 10 0.12 99.48 5-3 6 0.07 99.55 6-2 6 0.07 99.62 7-0 5 0.06 99.68 4-4 4 0.05 99.73 4-5 4 0.05 99.78 1-6 3 0.03 99.81 2-6 3 0.03 99.84 3-5 3 0.03 99.87 0-6 2 0.02 99.89 0-7 2 0.02 99.91 6-3 2 0.02 99.93 0-8 1 0.01 99.94 1-8 1 0.01 99.95 2-7 1 0.01 99.96 2-8 1 0.01 99.97 3-6 1 0.01 99.98 6-6 1 0.01 99.99 7-1 1 0.01 100

(15)

CHAPTER 3. DATA 13

Table 3.5: Match results

Country Home win Draw Away win Home win Draw Away win Percentage Percentage Percentage 5-minute England 753 463 541 42,9% 26.4% 30.7% France 645 432 360 44.9% 30.1% 25.0% Germany 502 322 330 43.5% 27.9% 28.6% Italy 685 493 419 42.9% 30.9% 26.2% Netherlands 549 303 365 45.1% 24.9% 30.0% Spain 668 407 436 44.2% 26.9% 28.9% Total 3802 2420 2451 43.8% 27.9% 28.3% 15-minute England 609 435 441 41.0% 29.3% 29.7% France 550 410 295 43.8% 32.7% 23.5% Germany 395 301 262 41.2% 31.4% 27.3% Italy 564 467 355 40.7% 33.7% 25.6% Netherlands 410 275 286 42.2% 28.3% 29.5% Spain 526 394 360 41.1% 30.8% 28.1% Total 3054 2282 1999 41.6% 31.1% 27.3%

(16)

Chapter 4

Results

4.1

Model I, the basics

Table 4.1 shows the estimation results of the most basic model in which all parameters are as-sumed to have a constant effect during the entire match. The left column shows the parameters,

Table 4.1: Model I estimation results

Parameter Coefficient Coefficient Home Away 5-minute Constant -2.636* -2.961* (0.014) (0.016) QH− QA/QA− QH 0.413* 0.410* (0.017) (0.019) Ahead 0.089* 0.178* (0.021) (0.027) Behind 0.139* 0.207* (0.025) (0.025) 15-minute Constant -1.691* -2.019* (0.016) (0.017) QH− QA/QA− QH 0.412* 0.381* (0.021) (0.023) Ahead -0.054* -0.013 (0.027) (0.034) Behind 0.002 0.089* (0.031) (0.030)

*significant at 5%, standard errors between parentheses.

the middle column the estimated effects of these parameters for the home team and the right column the estimated effects for the away team. This means that for the parameter ahead, the middle column shows the estimated effect of the home team being ahead in the score on the probability of a home goal and the right column the estimated effect of the away team being ahead in the score on the probability of an away goal.

(17)

CHAPTER 4. RESULTS 15

The results show that for both the 5- and 15-minute model the constant for the home team is higher than for the away team. This indicates that in each interval, the baseline scoring probability of the home team is higher than the one of the away team. The most common explanation for this is home advantage, as discussed in the introduction. To formally test whether the home and away team constants differ significantly we use a likelihood ratio test, as describe in Cameron and Trivedi (2005). To be able to perform this test, we need estimate the model again, but now with the restriction that the home team constant is equal to the away team constant. The likelihood ratio test tells us whether the model fit decreases when imposing this restriction. The results in table 4.2 confirm that for both the 5-minute and 15-minute model, the model fit decreases when imposing the restriction. This means that the baseline scoring probability of the home team significantly differs from the baseline scoring probability of the away team.

Table 4.2: Likelihood ratio test results

Model Test Statistic DoF Critical Value* p-value 5-minute 251.15 1 3.841 0.000 15-minute 198.92 1 3.841 0.000 *Critical value based on a 5% significance level.

Furthermore, the results show that the bigger the difference between the home and away team strengths, if positive, the higher the probability of home goal and vice verse for an away goal. Translated to actual probabilities, this basic model predicts an increase of 0.3 percentage points in home goal probability for each interval if the Qh− QA increases with 100 points.

Furthermore, the 5-minute model shows that the effect of ahead and behind on the proba-bility of a goals is small, but positive. This holds for both the home and away team. However, the 15-minute model shows different results. Firstly, not all parameters are significant. The home team being behind in the score or the away team being ahead in the score does not lead to different scoring probabilities for the teams. Secondly, the effect of being ahead as a home team has a negative coefficient in the 15-minute model, while the coefficient is positive in the 5-minute model.

Two factors play a role in explaining these differences between the 5-minute and 15-minute model. Firstly, the amount of intervals in which one of the teams lead is way smaller in the 15-minute model. Imagine the home team scoring in the 18th minute and the away team scoring in the 29th. This would mean that with the 5-minute model in the intervals 21-25, 26-30 the effect of leading the score for the home team can be measured, while in the 15-minute model this is not possible. This makes it harder for the 15-minute model to identify the effects of the current score parameters. Secondly, the 5-minute dataset contains over a thousand matches more than the 15-minute dataset. This could lead to different result. We further investigate this second factor by estimating the 5-minute model on exactly the same 7,335 matches that occur in the 15-minute dataset. The results of this estimation are in line with the results of the 15-minute model since the results now show a significant negative effect of being ahead in the

(18)

CHAPTER 4. RESULTS 16

score on the scoring probability of the home team. Apparently, the matches in the 5-minute dataset which we excluded from the 15-minute dataset cause the effect of being ahead in the score to differ.

So far we only described the estimated coefficients. These coefficients translate into actual probabilities by plugging the estimation results into the logistic specification. If we would consider a match between Arsenal FC (ECI=3340) and Manchester United (ECI=3090)1, with Arsenal playing at home, we would have the probabilities for scoring in each interval as shown in table 4.3.

Table 4.3: Model I probabilities

Current score P(Home goal) P(Away goal) 5-minute model

Equal 0.074 0.044

Home team leading 0.080 0.054 Away team leading 0.084 0.052

15-minute model

Equal 0.170 0.108

Home team leading 0.162 0.116 Away team leading 0.170 0.108

As mentioned, this first model is very basic. We assume that the baseline scoring intensity does not change from interval to interval and it is therefore not suitable for measuring the time-effect, which is an important within match factor according to previous research. However, it helps us to get a feeling with the numbers and probabilities. Moreover, the models give us an indication of the importance of the current score effects in a match. In the coming sections we get into more detail on these time and current score effects.

4.2

Model II, time effect

This model extends model I by allowing the baseline scoring probability of the teams to vary between intervals. This means that we make the constant from model I interval specific as in equations 2.1 and 2.2. This model is therefore able to measure an increase or decrease in the scoring probability of the home and away team over time. The effects of leading or trailing the score are still assumed to be constant over the match.

Table 4.4 shows the estimation result of the 15-minute model. Here, we see that as the match progresses the baseline scoring probability for both the home and away team increases. For the home team, the probability to score between minute one an fifteen is 19 percent, increasing to a goal scoring probability of 26 percent in the interval 76-90. For the away team, the scoring probability increases from 15 percent in the first interval to 20 percent in the 7th interval. These percentages are calculated from the estimation result at equal score and again

1

(19)

CHAPTER 4. RESULTS 17

Table 4.4: Model II 15-minutes: estimation results

Variable Interval Coefficient Coefficient Home Away Constant1 1-15 -1.670* -1.989* (0.032) (0.036) Constant2 16-30 -1.521* -1.858* (0.031) (0.035) Constant3 31-45 -1.461* -1.835* (0.032) (0.036) Constant4 45+ -3.808* -4.292* (0.080) (0.098) Constant5 46-60 -1.471* -1.777* (0.033) (0.036) Constant6 61-75 -1.372* -1.681* (0.030) (0.036) Constant7 76-90 -1.322* -1.713* (0.033) (0.037) Constant8 90+ -2.652* -2.890* (0.050) (0.054) QH− QA/QA− QH - 0.424* 0.393* (0.022) (0.024) Ahead - 0.020 0.051 (0.030) (0.037) Behind - 0.069* 0.162* (0.034) (0.033)

*significant at 5%, standard errors between parentheses.

considering a match between a home team with 3340 ECI points and an away team with 3090 ECI points. Furthermore, the 15-minute model does not find evidence for significant effects of being ahead in the score for either the home or away team. Being behind in the score leads to a slight increase to score a goal.

The 5-minute model shows similar results in the sense that the baseline scoring probability increases as a match progresses. However, since each intervals now only contains five minutes of playing time, it can reveal much more detail in this increasing probability. Figure 4.1 shows how the scoring probabilities evolve from interval to interval. The 10th and 20th intervals are deliberately left out since they represent the injury time intervals. This figure gives a great insight in how the estimated probabilities increase as the match progresses. This increase is slightly stronger and more gradual for the home team than for the away team. Moreover, the scoring probability for the away team seems to decrease slightly between intervals 14 and 18 (minute 65 to minute 85). The scoring probability in the last interval for the away team is higher than all other intervals. Furthermore, the scoring probability at the end of the first half is higher than at the start of the second half.

Figure 4.1 shows how the time varying constants translate into scoring probabilities per interval. The effects of the other parameters estimated by the 5-minute model are shown in table 4.5. The estimated coefficients of being behind in the score are very similar to the 15-minute model. However, where the 15-15-minute model estimates do not show significant effects of

(20)

CHAPTER 4. RESULTS 18

Figure 4.1: Model II, 5-minutes: scoring probability per interval

Table 4.5: Model II 5-minutes: estimation results

Variable Coefficient Coefficient Home Away QH− QA/QA− QH 0.418* 0.409* (0.017) (0.019) Ahead 0.015 0.123* (0.023) (0.029) Behind 0.070* 0.150* (0.026) (0.027)

*significant at 5%, standard errors between parentheses.

being ahead in the score, the 5-minute model does. We discussed the possible reasons for this phenomenon in the model I section. The effect of the home team leading on the probability of another home goal is not significant when we allow the baseline scoring probability to vary. Furthermore, the effects of these current score parameters on the probability of scoring are smaller than in model I. It seems that the current score effects are overestimated in model I, caused by the increasing baseline scoring probability. Since the dummy variables ahead and behind take the value of one more often in the latter part of matches, it could be that they took over some of the time effect. The extension of model I to model II seems therefore essential in order to obtain reliable probability estimates.

Up to this point, we considered a match in which the home team (Arsenal FC) is stronger than the away team (Manchester United). Let us consider what happens if the same teams would not play in London, but in Manchester. Figure 4.2 shows the scoring probabilities of the home and away team when considering Manchester United (3090) playing Arsenal FC (3340) at home. Now Manchester United has a higher probability to score during the match, while having a lower ECI value, caused by the home advantage. Our results suggest that the away

(21)

CHAPTER 4. RESULTS 19

team should have 400 ECI points more than the home team in order to have comparable scoring probabilities.

Figure 4.2: Model II, 5-minutes: scoring probability per interval

We use the likelihood ratio test to formally test whether model II is an improvement in terms of model fit on model I. As shown in table 4.6, for both the 5-minute and 15-minute model, model II fits the data better than model I which tells us that allowing the baseline scoring probability to change from interval to interval leads to a better model fit.

Table 4.6: Likelihood ratio test results

Model Test Statistic DoF Critical Value* p-value 5-minute 1,075 16 26.296 0.000 15-minute 4,453 16 26.296 0.000 *Critical value based on a 5% significance level.

4.3

Model III, introducing interval specific current score effects

In this model we allow the baseline scoring probability and the current score effects to be interval specific, which leads to a lot of parameters to be estimated. Unfortunately, due to computational limitations, we cannot obtain estimates for the 5-minute model. For the 15-minute model, with only eight intervals per match, it is possible to come up with estimates which we will discuss in this section.

Table 4.7 contains the estimation results of this model which differs from model II in the way the current score parameters are included. By comparing tables 4.7 and 4.4 we can conclude that the effect of the strengths of the teams are comparable. Furthermore, in model III the increasing baseline scoring intensity is present as expected. By estimating model II with 15-minute intervals and the assumption that current score effects are constant over the whole match, we only find no evidence for the effect of being behind in the score. We estimate model III to investigate whether there are interval specific current score effects and we see some interesting

(22)

CHAPTER 4. RESULTS 20

Table 4.7: Model III, 15-minutes: estimation results

Variable Interval Coefficient Coefficient Home Away Constant1 1-15 -1.670* -1.988* (0.032) (0.036) Constant2 16-30 -1.512* -1.848* (0.035) (0.039) Constant3 31-45 -1.427* -1.799* (0.039) (0.044) Constant4 45+ -3.722* -4.324* (0.111) (0.147) Constant5 46-60 -1.527* -1.791* (0.045) (0.049) Constant6 61-75 -1.350* -1.700* (0.046) (0.051) Constant7 76-90 -1.307* -1.706* (0.049) (0.055) Constant8 90+ -2.838* -2.998* (0.092) (0.97) QH− QA/QA− QH - 0.424* 0.392* (0.022) (0.024) Ahead2 16-30 -0.018 -0.061 (0.086) (0.113) Ahead3 31-45 -0.079 -0.120 (0.072) (0.093) Ahead4 45+ -0.261 0.186 (0.187) (0.244) Ahead5 46-60 0.155* 0.077 (0.068) (0.084) Ahead6 61-75 0.009 0.096 (0.066) (0.080) Ahead7 76-90 0.001 0.018 (0.067) (0.082) Ahead8 90+ 0.182 0.373* (0.116) (0.130) Behind2 16-30 -0.014 0.174 (0.102) (0.094) Behind3 31-45 0.016 0.137 (0.082) (0.079) Behind4 45+ 0.061 0.163 (0.198) (0.227) Behind5 46-60 0.121 0.190* (0.078) (0.074) Behind6 61-75 -0.005 0.182* (0.075) (0.073) Behind7 76-90 0.044 0.167* (0.074) (0.074) Behind8 90+ 0.470* 0.184 (0.123) (0.125) *significant at 5%, standard errors between parentheses.

results.

Firstly, if the home team is leading the score, it only affects the probability of scoring another goal in the first interval after the break (minutes 46-60) since this coefficient (0.155) is significant. A possible explanation for this phenomenon could be the positive vibe within a team after reaching the break being ahead in te score which allows them to start very positively after the break, while the opponent is still thinking about what went wrong in the first half. When looking at the effect of the away team leading the score we have to look at the last column of table 4.7 and we see that only the coefficient representing the injury time of the second half is significant. Apparently for the away team this possible ’positivity’ reason does not hold up. It could be that the away team attaches more importance to holding the lead than extending it.

(23)

CHAPTER 4. RESULTS 21

Secondly, the positive significant effect of leading the score on the probability of scoring another goal as an away team during the last interval of the match can be explained by the fact that the home team is doing everything it can to equal the score. This can be expected since a home loss is never a satisfying result. This attacking playing style of the home team leads to a weaker defence and gives the away team the opportunity to counter attack. Still focussing on this last interval of the match, we see that the effect of the home team being behind in the score at the start of the interval, is positive and significant. These results therefore indicate that if the home team is behind in the score at the start of the last interval, the probabilities of both the home and away team scoring in this last interval are higher. If the away team is leading however, we do not find evidence for an increase in the probabilities.

At last, the results show that if the away team is behind in the score in the second half, this leads to an increased probability to score. Figure 4.3 shows how the numbers translate into actual scoring probabilities per interval considering the current score effects.

(a) Home team (b) Away team

Figure 4.3: Current score effects on scoring probability per interval

Of course we would like to know whether this extension on model II formally improves the fit to the data. Again, we use a likelihood ratio test to asses this. In table 4.8 we see the results from this test. With a p-value of 0.0628, the likelihood ratio test does not find evidence that the extension leads to a better model fit, using a 5% significance level.

Table 4.8: Likelihood ratio test results

Model Test Statistic DoF Critical Value* p-value 15-minute 35.39 24 36.42 0.0628 *Critical value based on a 5% significance level.

4.4

Model IV, the effect of a red card

In this model we investigate the effect of red cards on the home and away teams’ scoring probabilities. We assume that the effect of a red card is constant over all intervals for two

(24)

CHAPTER 4. RESULTS 22

reasons. First, if we would allow the effects to be interval specific, the amount of data on red cards per interval would be very small as shown in figure 3.2. Secondly, since it would increase the number of parameters dramatically, we would not be able to obtain estimates for the 5-minute model. As discussed in chapter 2, model IV has interval specific baseline scoring probabilities and the other effects are assumed to be constant over the intervals. We exclude these interval specific constants from the tables since they are very similar to the estimates of model II and at the same time to improve the readability of the table.

Table 4.9 shows the estimation results for both the 5-minute and 15-minute model. The

Table 4.9: Model IV estimation results

Variable Coefficient Coefficient

Home Away 5-minute model QH− QA/QA− QH 0.422* 0.413* (0.017) (0.019) Ahead -0.004 0.099* (0.023) (0.029) Behind 0.088* 0.170* (0.026) (0.027) Redcard team -0.434* -0.514* (0.064) (0.063) Redcard opponent 0.427* 0.523* (0.039) (0.049) 15-minute model QH− QA/QA− QH 0.427* 0.397* (0.022) (0.024) Ahead 0.005 0.030 (0.030) (0.037) Behind 0.085* 0.180* (0.034) (0.033) Redcard team -0.363* -0.524* (0.081) (0.081) Redcard opponent 0.398* 0.500* (0.056) (0.067)

*significant at 5%, standard errors between parentheses.

estimates of the effects of teams’ strengths and the current score parameters are very similar to the estimates from model II as expected. Furthermore, the directions of the effects of a red card are in line with what we experience in practice and what is found in previous research: receiving a red card decreases the probability to score a goal and increases the probability to concede a goal. The 5-minute model predicts a decrease of 2 to 3 percent points (relatively around 33%) in the probability of a home goal and an increase of 2 to 3.5 percent point (relatively around 62%) in the probability of an away goal when the home team receives a red card. On the other side it predicts an increase of 3 to 4 percent points (relatively around 46%) in the probability of a home goals and a decrease of 1 to 2 percent points (relatively around 39%) in the probability of an away goal when the away team receives a red card. At last, the results

(25)

CHAPTER 4. RESULTS 23

show some remarkable similarity between home and away team coefficients for team strength and the red card parameters. One could consider to estimate the model with the restriction that these coefficients are equal to reduce the number of parameters. Since the transformation to actual probabilities is not linear, these similar coefficients do not lead to similar relative changes in probabilities, as described above.

Figures 4.4 and 4.5 show how the scoring probabilities of the teams change when a red card is given to either the home or away team in interval 12. These figures are again based on a match between teams witch ECI values of 3340 and 3090 and show that red cards have a very big impact on the ability to score goals. A red card given to the home team causes the

Figure 4.4: Model IV, 5-minutes: scoring probability per interval (home team red card in interval 12)

probability of scoring to drop to a level below the probability of conceding, making it way more difficult to reach a satisfying match result. Our model suggests that when a red card is given to the home team in interval 12, the expected number of home goals decreases from 0.7 to 0.45 in the remaining part of the match. At the same time the expected number of away goals increases from 0.4 to 0.65 for the remaining minutes. These number are based on a current score in which the teams are level and would be slightly different if we would consider a situation in which one of the teams is leading.

Just like model III, model IV can be seen as an extension on model II. We again use the likelihood ratio test to formally test whether this extension improves the model fit. The results in table 4.10 show that the introduction of the red card effects clearly improves the model fit and therefore forms a useful extension.

(26)

CHAPTER 4. RESULTS 24

Figure 4.5: Model II, 5-minutes: scoring probability per interval (away team red card in interval 12)

Table 4.10: Likelihood ratio test results

Model Test Statistic DoF Critical Value* p-value 5-minute 319.04 4 9.488 0.000 15-minute 157.79 4 9.488 0.000 *Critical value based on a 5% significance level.

effects of a red to be constant over the match. We estimate the 15-minute model with interval-specific red card effects to investigate whether the size and significance of these effects change from interval to interval. Since the data are very sparse, the 15-minute model has difficulties identifying the effects in the first half. However, since the second half parameter estimates are significant it can tell us something about the size of the effect in different intervals. The results produce some evidence for the hypothesis that the effect of a red card becomes stronger as the match progresses. This works both for receiving a red card in your own team and a red card for the opponent: the negative coefficients become more negative and the positive coefficients more positive. Using intuition, one could argue that during the latter intervals within a match, the teams playing with one player more put more pressure on their opponents. Furthermore, playing with a player less is exhausting for the players still on the field and this fatigue can lead to a weaker defence making it more likely to concede a goal. The estimation results of this 15-minute model with interval specific red card effects are shown in tables 4.11 (for the home team scoring probability) and 4.12 (for the away team scoring probability).

4.5

Model V, comparing countries

In this section we compare the scoring probabilities within the, arguably, six biggest European football leagues. We therefore introduce dummy variables for England, France, the Netherlands,

(27)

CHAPTER 4. RESULTS 25

Table 4.11: Parameter estimates home team

Parameter Interval Estimate Standard error QH− QA - 0.427* 0.022 Constant1 1-15 -1.671* 0.032 Constant2 16-30 -1.513* 0.035 Constant3 31-45 -1.434* 0.039 Constant4 45+ -3.746* 0.113 Constant5 46-60 -1.528* 0.045 Constant6 61-75 -1.343* 0.046 Constant7 76-90 -1.337* 0.050 Constant8 90+ -2.848* 0.094 Ahead2 16-30 0.016 0.087 Ahead3 31-45 -0.093 0.072 Ahead4 45+ -0.296 0.188 Ahead5 46-60 0.148* 0.068 Ahead6 61-75 0.003 0.066 Ahead7 76-90 -0.024 0.067 Ahead8 90+ 0.146 0.116 Behind2 16-30 -0.015 0.102 Behind3 31-45 0.023 0.082 Behind4 45+ 0.083 0.199 Behind5 46-60 0.127 0.078 Behind6 61-75 0.0088 0.076 Behind7 76-90 0.069 0.075 Behind8 90+ 0.506* 0.124 Homered2 16-30 0.182 0.509 Homered3 31-45 -0.432 0.356 Homered4 45+ -1.330 1.005 Homered5 46-60 -0.289* 0.226 Homered6 61-75 -0.428* 0.178 Homered7 76-90 -0.474* 0.130 Homered8 90+ -1.041 0.186 Awayred2 16-30 0.257 0.466 Awayred3 31-45 0.856* 0.228 Awayred4 45+ 1.071 0.315 Awayred5 46-60 0.253* 0.166 Awayred6 61-75 0.115* 0.131 Awayred7 76-90 0.541* 0.092 Awayred8 90+ 0.349* 0.118

*significant at 5%, standard errors between parentheses.

Germany and Italy and use Spain as a reference country. Table 4.14 shows the estimation result of this particular model. Again, the estimates for the interval-specific constants which represent the baseline scoring probabilities are excluded from te table to improve readability. Moreover, these estimates are similar to the estimates in models II and IV. The same holds for the estimates of the effects of red cards and current score parameters. In this section we describe some possible reasons for the results, focussing on the country dummy variables.

Firstly, the table shows that for the home goal scoring probability, only one country behaves different than the pack: the Netherlands. With a coefficient of 0.223 (estimated by the 5-minute model), the probability of scoring a goal as a home team in the Netherlands is, per

(28)

CHAPTER 4. RESULTS 26

Table 4.12: Parameter estimates away team

Parameter Interval Estimate Standard error QA− QH - 0.396* 0.024 Constant1 1-15 -1.989* 0.036 Constant2 16-30 -1.850* 0.039 Constant3 31-45 -1.804* 0.044 Constant4 45+ -4.326* 0.148 Constant5 46-60 -1.792* 0.049 Constant6 61-75 -1.694* 0.051 Constant7 76-90 -1.703* 0.056 Constant8 90+ -3.018* 0.100 Ahead2 16-30 -0.079 0.113 Ahead3 31-45 -0.131 0.039 Ahead4 45+ 0.177 0.244 Ahead5 46-60 0.065 0.084 Ahead6 61-75 0.081 0.080 Ahead7 76-90 -0.019 0.082 Ahead8 90+ 0.336* 0.130 Behind2 16-30 0.178 0.095 Behind3 31-45 0.148 0.079 Behind4 45+ 0.172 0.227 Behind5 46-60 0.202* 0.075 Behind6 61-75 0.197* 0.073 Behind7 76-90 0.198* 0.075 Behind8 90+ 0.223 0.126 Homered2 16-30 0.861 0.478 Homered3 31-45 0.823* 0.275 Homered4 45+ 0.383 0.583 Homered5 46-60 0.469* 0.196 Homered6 61-75 0.314* 0.153 Homered7 76-90 0.556* 0.115 Homered8 90+ 0.473* 0.135 Awayred2 16-30 -0.319 0.615 Awayred3 31-45 -0.693 0.395 Awayred4 45+ -0.376 0.715 Awayred5 46-60 -0.481* 0.227 Awayred6 61-75 -0.486* 0.170 Awayred7 76-90 -0.652* 0.142 Awayred8 90+ -0.343* 0.166

*significant at 5%, standard errors between parentheses.

Table 4.13: Likelihood ratio test results

Model Test Statistic DoF Critical Value* p-value 5-minute 159.95 10 18.307 0.000 15-minute 107.47 10 9.488 0.000 *Critical value based on a 5% significance level.

interval, 1.3 to 1.8 percent points higher than the in the other countries. The same holds for away goals, compared to the reference country, the clubs in the Netherlands score more away goals. This is not surprising since Dutch people attach great value to attractive football with an attacking playing style. It seems that this playing style is adapted by the teams: compared

(29)

CHAPTER 4. RESULTS 27

Table 4.14: Model V estimation results

Variable Coefficient Coefficient

Home Away 5-minute model QH− QA/QA− QH 0.425* 0.417* (0.017) (0.019) Ahead -0.009 0.090* (0.023) (0.029) Behind 0.082* 0.165* (0.026) (0.027) Redcard team -0.433* -0.505* (0.064) (0.063) Redcard opponent 0.433* 0.533* (0.039) (0.049) F rance -0.011 -0.030 (0.033) (0.038) Germany 0.064 0.154* (0.039) (0.038) England 0.029 0.108* (0.031) (0.035) Italy 0.015 0.049 (0.032) (0.036) N etherlands 0.223* 0.276* (0.033) (0.037) 15-minute model QH− QA/QA− QH 0.431* 0.398* (0.022) (0.024) Ahead 0.002 0.022 (0.030) (0.037) Behind 0.081* 0.178* (0.034) (0.033) Redcard team -0.326* -0.511* (0.081) (0.081) Redcard opponent 0.407* 0.512* (0.056) (0.068) F rance 0.028 -0.051 (0.040) (0.045) Germany 0.080 0.120* (0.043) (0.047) England 0.044 0.102* (0.039) (0.042) Italy 0.041 0.046 (0.039) (0.043) N etherlands 0.247* 0.286* (0.042) (0.046)

*significant at 5%, standard errors between parentheses.

to other countries, the Dutch clubs score and concede more goals than the other countries in our datasets.

Furthermore, the results show that in England and Germany the probability of an away goal is higher than in the reference country (Spain), while the probability of a home goals is not. It

(30)

CHAPTER 4. RESULTS 28

is more difficult to come up with an explanation for this phenomenon. Apparently, compared to the reference country, the clubs in Germany and England do not play differently in their home games, but do play differently in away games, leading to an increased probability of scoring during away games.

The introduction of the country dummies gives some nice insights in the way the clubs from different countries play. However, we would like to know if this formally improves the model fit. Again, we use the likelihood ratio test for this purpose. The results in table 4.13 show that indeed the introduction of the dummies improves the model fit. This shows that is not only nice to know whether the clubs play differently, but it is an important factor for the model fit as well.

4.6

5-minute vs. 15-minute models

In chapter 2 we already mentioned the trade-off between complexity and realism when choosing the length of the time intervals. In this section we formally test for models II, IV and V, whether the 5-minute model is able to fit the data better than the 15-minute model. We do not need to perform these tests for models I and III. In model I we do not include interval-specific parameters, so we do not impose more restrictions in the 15-minute model than in the 5-minute model. Moreover, we are only able to estimate model III in the 15-minute case, making it impossible to compare it to the 5-minute case.

We again use likelihood ratio tests to asses the model fit. To correctly perform these tests, we need to estimate the 15-minute models on the 5-minute dataset and impose the right restrictions. In practice this means that the constants of intervals one, two and three need to be equal, as well as the constants of intervals four, five and six, etc. The results of the likelihood ratio test that we perform after estimating in the above describe manner, are shown in table 4.15.

The results of these tests confirm that the 5-minute model is able to fit the data better than the 15-minute model, which is in line with our expectation. Furthermore, the results of the performed tests are very similar since we impose the same restrictions in these three cases.

Table 4.15: Likelihood ratio test results

Model Test Statistic DoF Critical Value* p-value Model II 78.947 24 36.415 0.000 Model IV 78.211 24 36.415 0.000 Model V 78.508 24 36.415 0.000 *Critical value based on a 5% significance level.

(31)

Chapter 5

Conclusion

In this thesis we have estimated the probability to score a goal in football and quantified the effects of three important factors that influence this probability to score a goal: time, current score and red cards. Furthermore, we controlled for team strength and compared the scoring probabilities in different countries. Our approach is based on a discrete set-up in which we divided matches into time intervals. For each interval, we estimated the probability of a home goal and away goal separately and we were therefore able to capture home advantage as well. In this chapter we present the conclusions that follow from our estimation results.

Firstly, previous literature suggests that the probability to score gradually increases when a match progresses. The most common explanation for this is that the players get tired near the end of the match. Our results are in line with this research since the baseline scoring probability generally increases from interval to interval. Moreover, our results have shown that this increase is slightly stronger for the home team scoring probability than for the away team scoring probability. Furthermore, we have found that the scoring probability at the end of the first half is slightly higher than at the beginning of the second half for both teams.

Secondly, we conclude that the effect of the current score on the scoring probabilities is not consistent over all our models. However, we can conclude with certainty that being behind in the score leads to an increased probability to score. This holds for both the home and away team and is confirmed by all our models. A loss is never a satisfying result and teams are therefore adapting their playing style in this situation, leading to an increased scoring probability. Furthermore, we have found some evidence for the hypothesis that when the away team is leading, its probability to score another goal increases, which can be explained by the fact that the home team is likely to be taking risks in trying to level the score. By taking these risks, their defence becomes vulnerable, which leads to this increased probability to concede a goal.

Thirdly, both from experience and previous research it is clear that red cards have a big effect on the scoring probabilities. Our model predicts a relative decrease of around 30% of a home goal and a relative increase of around 60% of an away goal when the home team receives a red card. If the away team receives a red card, it decreases the probability of an away goal

(32)

CHAPTER 5. CONCLUSION 30

with around 40% and increases the probability of a home goal with around 45%.

The last model we have estimated in this thesis investigates whether the clubs from different countries have different baseline scoring probabilities, using Spain as a reference country. The first conclusion we can draw is that in the Netherlands, clubs have a higher probability to score, both home and away. This can be explained by the fact that the Dutch people greatly value an attractive, attacking playing style. Furthermore, we can conclude that in England and Germany, the probability of an away goal is higher than in the reference country, while the probability of a home goal is not significantly different. Apparently, clubs from Germany and England play differently in their away games, but not in their home games. A phenomenon for which we do not have a clear explanation at the moment.

At last we would like to emphasize that the results produced by the 5-minute and the 15-minute model are, in general, very similar. Especially for the red card effects and the team strength effects, both models produce very similar results. This gives us confidence that our model set-up is able to correctly capture the effects that influence the scoring probabilities.

(33)

Chapter 6

Discussion

This final chapter gives us the opportunity to shed our light on the assumptions and limitations that have played a role developing our model.

Firstly, we like to address the issue of the length of the time intervals, as mentioned in section 2.3. As described, in an ideal world, the length of the time intervals should be taken as small as possible in order to come as close to reality as possible. However, we are not able to come up with estimates when using intervals shorter than five minutes and this forms a limitation of the model. A continuous approach using survival techniques could provide a solution. However, since goal scoring events can hardly be seen as independent events, it would add very much to complexity when one would still be interested in all goal scoring events within a match. Our approach in this manner is the use of both a 5-minute and 15-minute model, allowing us to asses the robustness of the estimation results. In this way, we are still able to use a simple, easy to explain, discrete model set-up.

Secondly, there are a few assumptions on the chosen parameters that could be further investigated. In our set-up the current score is either equal, behind or ahead, not differencing between the amount of goals a team is ahead or behind. Teams that are leading or trailing with 4-0 are likely to play differently than if leading or trailing with 1-0, leading to different scoring probabilities. It would be interesting to see whether our model set-up can distinguish different effects if the current score parameters would be split up into more detailed variables.

Moreover, we use the Euro Club Index as a measure for team strength. This index is carefully designed to represent the teams sportive qualities and we believe that it does a great job in doing just that. However, it is limited in the sense that it rates a teams overall quality, without distinguishing between attacking and defensive abilities. It would be interesting to investigate the goal scoring probabilities if teams would not only have an overall rating, but a rating for attack and defence skills separately.

Lastly, we would like to come back to the fact that we exclude all matches in which two or more goals are scored by one team in a certain time interval. We did this to keep things simple and to be able to track the score by only using dummy variables. However, by doing this, we introduce a slight selection bias in our data since matches with a lot of goals scored have a

(34)

CHAPTER 6. DISCUSSION 32

higher probability of two goals scored by one team within a time interval. This results in a dataset with a lower average number of goals per match than in reality (a decrease from 2.6 to 2.5). Table 3.5 shows that match results up to twelve scored goals occur in our dataset, which indicates that this selection bias is not severe. While the expected number of goals predicted by the model will be slightly lower than realistic caused by this effect, we empathize that this effect is small and it does not interfere with effects we estimate in this thesis.

In short, our model is based on a few assumptions that we might prefer not to make and on top of that has its limitations. However, the simplicity of this model has its beauty as well, it can easily be explained and applied if the right data is at hand. On top of that, it produces estimates that are in line with both the literature and our intuition. Last but not least, with the right modifications this model can also very easily be applied in lots of other (team) sports.

(35)

Bibliography

Cameron, A. C. and Trivedi, P. K. (2005). Microeconometrics: methods and applications. Cambridge University Press.

Courneya, K. S. and Carron, A. V. (1992). The home advantage in sport competitions: a literature review. Journal of Sport & Exercise Psychology, 14(1).

Dixon, M. and Robinson, M. (1998). A birth process model for association football matches. Journal of the Royal Statistical Society: Series D (The Statistician), 47(3):523–538.

Dixon, M. J. and Coles, S. G. (1997). Modelling association football scores and inefficiencies in the football betting market. Journal of the Royal Statistical Society: Series C (Applied Statistics), 46(2):265–280.

Karlis, D. and Ntzoufras, I. (2003). Analysis of sports data by using bivariate poisson models. Journal of the Royal Statistical Society: Series D (The Statistician), 52(3):381–393.

Maher, M. J. (1982). Modelling association football scores. Statistica Neerlandica, 36(3):109– 118.

Nevo, D. and Ritov, Y. (2013). Around the goal: examining the effect of the first goal on the second goal in soccer using survival analysis methods. Journal of Quantitative Analysis in Sports, 9(2):165–177.

van Ours, J. C. and van Tuijl, M. A. (2011). Country-specific goal scoring in the’dying seconds’ of international football matches. International Journal of Sport Finance, 6(2):138.

Vecer, J., Kopriva, F., and Ichiba, T. (2009). Estimating the effect of the red card in soc-cer: When to commit an offense in exchange for preventing a goal opportunity. Journal of Quantitative Analysis in Sports, 5(1).

Referenties

GERELATEERDE DOCUMENTEN

The aim will be to explore a semantic analysis of such deverbal nouns in Sesotho within the assumptions of lexical semantics with a focus on Generative lexicon theory... The goals

Sommige geomorfologische of bodemkundige fenomenen kunnen alleen verklaard worden door te kijken naar hun antropogene of biotische betekenis (bijvoorbeeld bolle

In particular, we focus on three MAC methods: IEEE 802.11p, the proposed standard for medium access, standardized by the IEEE for Wireless Access for the Vehicular

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of

De doorgeblazen hoeveelheid lucht werd met de hand ingesteld. Na een wijziging in de luchthoeveelheid duurde het minstens l dag voordat er een waarneembare

Using two publicly available data sets (both dealing with acute leukaemia), we show that this quality measure can be used to compare different microarray data sets with respect to

Grabbing Objects with the NAO Robot Using Multiple Behaviors and Interval Estimation for their Selection 7 When the program is started the user (human or robot) can chose which

The required development research was outlined in Chapter 3 and a DCS solution was presented. The following chapter covers the verification of the DCS solution with