• No results found

A multinomial logistic regression for analysing and forecasting match results in the Dutch Eredivisie League

N/A
N/A
Protected

Academic year: 2021

Share "A multinomial logistic regression for analysing and forecasting match results in the Dutch Eredivisie League"

Copied!
36
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Faculty of Economics and Business

A Multinomial Logistic Regression for Analysing

and Forecasting Match Results in the Dutch

Eredivisie League

23 april 2013

Name: Martijn Spindelaar Student number: 10108998

Specialization: Economics and Finance Thesis supervisor: Dr. P.H.F.M. van Casteren Date & Place: 23 april 2013, Beverwijk Thesis: Bachelor thesis

(2)

Abstract

During this research a model that will create positive returns over the bookmaker’s odds is developed. The model is based on a multinomial logistic regression where the categorical dependent variables are respectively a home win, a draw and an away win. The parameters of the model are estimated on basis of the fifteen Eredivisie seasons of 1996/1997 till 2010/2011. The model is thereafter used for out-sample forecasting; the forecasts were used for betting on football matches during the Eredivisie seasons of 2011/2012 and 2012/2013 in order to make a profit. The second stage is dedicated to develop a sophisticated staging plan where is explained how the amount bet on a match is determined. Each bet is considered as a financial asset with its own risk and expected return. By bundling these bets together in a portfolio diversification benefits will be created, this total staging plan is based on the Modern Portfolio Theory. At the end of my research, the sophisticated staging plan is tested against a simple staging plan by examining the build-up of the bankroll by using these two staging plans.

(3)

Table of Contents

Abstract II Table of Contents IV List of Figures V List of Tables VI 1 Introduction 1 2 Literature Review 2 3 Methodology 4 4 The Model 5 4.1 A brief introduction . . . 5 4.2 The Model . . . 6

4.2.1 The Last Result -Variable . . . 7

4.2.2 The strength of the opponent -Variable . . . 9

4.2.3 The effect of the interaction between strength and result . . . 11

4.2.4 Homeground -Variable . . . 13

4.2.5 Artificial Field -Variable . . . 13

4.2.6 Final Model . . . 14

4.3 Results . . . 14

5 Staging 17 5.1 Introduction to staging . . . 17

5.2 Relationships . . . 17

5.2.1 Striking Rate vs Expected Probability . . . 19

5.2.2 Odds vs Realized Return . . . 20

5.2.3 Expected Return vs Realized Return . . . 22

5.3 Return Maximization . . . 23

(4)
(5)

List of Figures

1 Frequency boxplot . . . 10 2 Illustration Effect St-i and Xt-i. . . 11

3 The relationship between the calculated probabilities and the striking rate of 477 matches played during the Eredivisie seasons 2011/2012 and 2012/2013. . 20 4 The relationship between the odds set by the bookmakers and the Realized

Return based on the 334 value bets which occured during the Eredivisie seasons 2011/2012 and 2012/2013. . . 21 5 Relationship between τ and the realized return . . . 22 6 Number of quality-bets for different values of τ . . . 22 7 A simple probability tree diagram which represents the cash flows that arise

from placing a bet. . . 23 8 Portfolio expected return as a function of the standard deviation . . . 26 9 Development Bankroll with and without a staging plan for 277 quality-bets . 28

(6)

List of Tables

1 Results of Regressions of different sets of variables . . . 15 2 An overview of three random matches played during the Eredivsie seasons

2011/2012 and 2012/2013. . . 17 3 The calculated probabilities and the average odds set by nine different

book-makers of the three randomly selected matches played during the Eredivsie seasons 2011/2012 and 2012/2013. . . 18 4 The expected return and the preferred odds given the maximum expected

return of the three randomly selected matches played during the Eredivsie seasons 2011/2012 and 2012/2013. . . 18 5 A summary of the three randomly selected matches played during the Eredivsie

seasons 2011/2012 and 2012/2013. . . 19 6 The relationship between the calculated probabilities and the striking rate of

477 matches played during the Eredivisie seasons 2011/2012 and 2012/2013. . 19 7 The relationship between the odds set by the bookmakers and the Realized

Return based on the 334 value bets which occured during the Eredivisie seasons 2011/2012 and 2012/2013. . . 21 8 I. A summary of a subportfolio which is virtually composed on August 12, 2011. 26 9 II. A summary of a subportfolio which is virtually composed on August 12, 2011. 27

(7)

1 Introduction

Predicting a football game is a hard and challenging objective and beating the bookmaker in the long run is an even more challenging task. Nowadays it is possible to place all kind of bets; betting on the result of a game, the number of corner kicks, over/under 2.5 goals, which team will score first etc. The odds of the outcomes of these events are determined by the believe of the bookmaker. The higher the chance that an outcome of a given event might occur, the lower the odds. Besides placing as accurate as possible bets, the bookmakers will secure their profits also by lowering all odds with some points, this is known as The Bookmakers’ Profit Margin – the Overround.

Even when a bettor succeeds in finding a so-called value bet ; a bet with a positive expected return, there is another factor that needs to be taken into account, that the way in which the bets will be staged is essential for making a profit. Deploying several bets will be seen as holding a portfolio with financial assets whereby each asset, has its own expected return and volatility. To maximize profits, the portfolio will be optimized.

During my research, I will focus on predicting the result of a match; home-win, draw and away-win. The League of the Netherlands, the Dutch Eredivisie League, will be analysed. Data of the seasons 1996/1997 till 2012/2013 will be gathered, the data of the first fifteen seasons will be used for parameter estimation and the data of the last two seasons will be used for out-of sample forecasting.

The research will start by discussing related studies and take a look in which way my thesis will contribute in contrast to these studies. Secondly, the methodology of this research will be explained; the research strategy, in which way data will be collected and the framework for theory. Thirdly, there will be an introduction to football betting and determining odds with subsequently the outlining of the model and his variables. In section four data is collected and will be summerized, furthermore the parameters of the model will be estimated. Section five will explain some principles of Finance, put forward the reasons for using a staging plan and present the framework of the staging plan. Section six presents the results and consists of the general conclusion, the limitations of the study and possible further research.

(8)

2 Literature Review

Is the betting market inefficient and are there in the long run positive returns? Quite a number of studies have tried to answer these questions. Pope and Peel (1990) examine the efficiency of the betting market concerned with the outcomes of association football matches in the United Kingdom. To set a definition of an efficient market I will refer to the Journal of Finance article of Fama (1970). He defines an efficient market as a market in which prices always fully reflect all available information.

The motivation of Pope and Peel (1990) to analyse the efficiency of the betting market is the result of two phenomena that are not consistent with the characteristics of an efficient market. The first phenomenon is that the odds of the same outcome of the same event offered by different bookmakers vary. The odds can even differ so much that when combining the odds of different bookmakers a bet with a guaranteed arbitrage return of 12 per cent could be placed (Pope and Peel, 1990). The second phenomenon comes forward from the a priori reasonable assumption that the bookmakers posts odds that reflect their expectations to the various outcomes. So, consequently, odds are not jointly determined by buyers and sellers.

At the end of their paper, Pope and Peel (1990) expressed their belief that the Book-maker’s expectation is based on data that is easily accessible for the public. So on basis of weak form tests, an information subset in which the information set is using historical, public prices only, they concluded that there is evidence that indicates that certain types of bets are more favourably priced by some firms than by others. However, they did not find a betting strategy that generates post-tax profits and, consequently, therefore the odds do not seem to meet the axioms of rational expectations. The implication of this result is that bettors could lower their losses but can’t create any profits.

Despite the fact that Pope and Peel (1990) didn’t find a profitable betting strategy, they have found some evidence that proves that the betting market isn’t an efficient market. To exploit this inefficiency, many scientists have tried to find a profitable betting strategy for several sports. Koopman and Lit (2012) contributed to the literature by introducing a bivariate Poisson model with stochastic time-varying attack and defence strengths. This model will be discussed only briefly here. However, they verified the out-of-sample performance of their model for betting on an outcome of a football game. Despite the fact that they succeed in

(9)

pass the criterion “less risky”, and therefore meet some requirements. The first requirement; the expected value of a bet must be higher than the value τ. The second requirement, is limiting so-called longshots; bets with high odds and consequently a high probability of losing. The rationale behind these requirements might be to create a buffer and protect the portfolio against errors that exists inside a model. By theory, it can’t be justified to insert these restrictions; there must be a legitimate argument that proves that betting on bets with a value lower than τ will end in non-positive returns. The result of these two requirements is that bets with a positive expected return will be excluded on the basis of sometimes arbitrary chosen values which can’t be justified by theory.

The relationship between the research of Koopman and Lit (2012) and this research might be quite clear, both concepts start by developing a model and using the model for determining value-bets. In general, same research and same objective, however, the theses will differ strongly. My model will be based on a logistic regression and not on a Poisson distribution. Furthermore, I’ll try to develop a more sophisticated staging plan, based on the principles of Finance.

Another research that has some similarities is the research of Boelier and Stekler (2003). They tried to forecast the outcome of a NFL Football game. They used power scores, which were weekly freely published in the New York Times, rankings of the league and predictions of the editor of the New York Times. These variables were used as independent variables inside a probit regression. This model is based on the same statistical technique, however, I will use different variables and examine a different sport. Beside these difference, my research goes one stage further by using the model for placing bets.

(10)

3 Methodology

This research will focus on the matches played in the Dutch Eredivisie. The objective of this reasearch is finding a relationship between a set of variables and the result of a match. Data with respect to the results and the bookmaker’s odds are collected from the website http://www.football-data.co.uk/ and data with respect to club statistics such as the number of spectators are collected from the website http://www.elfvoetbal.nl/.

The variables that are used to predict the result of match will be derived on basis of the season results. The program Microsoft Excel R will be used to calculate and compute

these variables. The spreadsheet frame that directly reproduce and provide the necessary information will completely built by myself. Furthermore Microsoft Acces R will be used to

bundle the data. The statistical operations will be done by the statistics program IBM SPSS R.

Furthermore the graphs that will be shown are created by the programs Microsoft Excel R

and IBM SPSS R

. The thesis will be completely written with the typesetting system TEX in combination with the macro-package LATEX.

(11)

4 The Model

4.1 A brief introduction

Before defining the models, a further introduction to the system behind the odds might be necessary. As already pointed out, in the bookmaking business the price for placing a bet isn’t handled by something as a brokerage fee (Kuyper, 2000). The bookmakers’ take is a fixed percentage of the total amount bet and can be estimated by calculating the over-roundness. This is best illustrated by providing an example. When the decimal odds1 set by the bookmaker for a game between team 1 and team 2 are equal to 1.5, 3.4 and 6.5 for a home win, a draw and an away win, respectively, the over-roundness can be calculated as the sum of the inverse of the odds set by the bookmaker minus one:

Home win: Draw: Away win: 1 1.5 1 3.4 1 6.5 = 0.667 = 0.294 = 0.154 = 1.115

The over-roundness is in this case 11.5%. When the book is balanced i.e. the stakes are divided in the proportion 0.667, 0.294 and 0.154 then the bookmaker’s return will be irrespective of the outcome 0.115/1.115 or 10.3% of the total stake. When taking a look at the average over-roundness of the major bookmakers, it’s notable that it’s quite constant at 11.5% (Kuyper, 2000). No further research is undertaken how the over-roundness is set. It is assumed that this is a result of competition and self-regularity and thus that the service side of the betting market is efficient.

By using the odds and the over-roundness, the probability of an outcome according to the bookmaker can be determined:

The probability = 1

1.115 ∗ (Decimal odds)

One important thing to mention is that this formula only is valid when there will be assumed

1. Decimal odds quote the total that will be paid out to the better, should he win, relative to his stake. For example, suppose team 1 is quoted ate 3. If a e 1 bet were succesfull, the costumer would collect e 3 (e 2 profit +e 1 return of stake.)

(12)

that the over-roundness is equally distributed over the three outcomes. If this is not the case, the profits will be biased towards one of the three outcomes and the calculations need to be revised. Since we have no evidence that this won’t be the case we assume that this assumption is valid and no further research related to this topic will be done.

Referring to the research of Fama (1970), he defines an efficient market as a market in which prices always ”fully reflect” all available information. This rules out systems with expected profits in excess of equilibrium expected returns. As already pointed out, the equili-brium expected profits for the better will be -10.3%, indicating a loss. But since Pope and Peel (1990) concluded that the betting market isn’t efficient it is by theory possible to increase the expected return and maybe achieve what Pope and Peel (1990) haven’t achieved; increasing the expected return beyond the null-frontier.

4.2 The Model

The model will be based on a multinomial logistic regression. The extension from a logistic regression with two categories to a logistic regression with M (> 2) categories requires us to choose a so-called reference category (Mendard, 2001). After choosing this value, the mem-bership in other catgeories is compared to the probability of memmem-bership in the reference category. This requires the calculation of M-1 equations, one for each category relative to the reference category, to describe the relationship between the dependent variable and the independent variables. The probability that y is equal to any value h other than the excluded variable h0 is P (Yi= h|X1, X2, . . . , Xk) = e(αh+ βh1X1+ βh2X2+ · · · + βhkXk) 1 +M −1P h=1 e(αh+ βh1X1+ βh2X2+ · · · + βhkXk) (1) h= 1, 2, . . . , M-1 And for the excluded variable

P (Yi= h0|X1, X2, . . . , Xk) = 1 1 +M −1P h=1 e(αh+ βh1X1+ βh2X2+ · · · + βhkXk) (2) h= 1, 2, . . . , M-1

(13)

Where subscript k refers to specific independent variables X and the subscript h refers to specific values of the dependent variable Y. The categorical variable that will be predicted in this research is obviously the outcome of a match in the view of the home team with the following possibilities: a win, a loss and a draw, denoted by 1, 2 and 3 respectively. The reference category is the draw result denoted by the value 3.

The dependent variable is now defined and the remainder of this section will be used to explain and define independent variables.

4.2.1 The Last Result -Variable

The first variable that will be defined will be the last result -variable. From our anticipation the expectation is that a team with good season results will perform well in the upcoming match. Measuring the last results could be done in several ways; the first way is counting points for every win, draw and loss. The main shortcoming of measuring this way is that a team will not be accounted for winning with a relative high goal balance. Therefore measuring the last results-variable in this way will not be completely satisfactory. Another way of measuring is taking the goal balance. The amount of points achieved is in some way a perfect function, with an operator, of the goal balance and will therefore have a strong correlation with the total points achieved. Therefore the result of a team will be accounted by their goal balance. When the extra information is useful the last method has no shortcomings relative to the point of points achieved – method, the unit of measure for the last results-variable will be the goal balance. Let Xt-i denote the goal balance of the home team i matches ago and

Xt-i*2 denote the goal balance of the away team i matches ago. The first variable is defined,

the next that needs to be determined is the amount of lags i.e. the number of matches ago that is taken into account. This is done by first choosing arbitrary a number and when the parameters are estimated decide which beta’s are significant. there will be chosen for six lags, furthermore there will be assumed that lag weight i matches ago for the home team is equal to the lag weight i matches ago for the away team. The regression can be established:

yt= β0+ β1 Xt-1− Xt-1* + β2 Xt-2− Xt-2* + β3 Xt-3− Xt-3*



+ β4 Xt-4− Xt-4* + β5 Xt-5− Xt-5* + β6 Xt-6− Xt-6* + Ut (3)

(14)

When the regression won’t be restricted the regression will have some shortcomings; the regression will have a high degree of collinearity (Hill et al., 1997) and some betas can take some nonlogical values, values that don’t agree with our anticipation. For example the 6th lag could have a higher value than the 5th lag. Therefore, as Hill et al. (1997) outline, a restriction will implemented. There will be assumed that the lag weights follow a certain pattern that can be represented by a second-order polynomial.

∂yt ∂ xt-i− xt-i*  = βi= γ0+ γ1(i − 1) + γ2(i − 1) 2 W here ∂βi ∂i = γ1+ 2γ2(i − 1) ≤ 0 βi≥ 0

As already stated, there will be six lags, so the following relations can be derived:

β1 = γ0 β2 = γ0+ γ1+ γ2 β3 = γ0+ 2γ1+ 4γ2 β4 = γ0+ 3γ1+ 9γ2 β5 = γ0+ 4γ1+ 16γ2 β6 = γ0+ 5γ1+ 25γ2

Substitute these values into equation (3):

yt= β0+ γ0  Xt-1− Xt-1*  + γ0+ γ1+ γ2  Xt-2− Xt-2*  + γ0+ 2γ1+ 4γ2  Xt-3− Xt-3*  + γ0+ 3γ1+ 9γ2  Xt-4− Xt-4*  + γ0+ 4γ1+ 16γ2  Xt-5− Xt-5* + γ0+ 5γ1+ 25γ2  Xt-6− Xt-6* + Ut (4)

(15)

Define: Vt0 = Xt-1− Xt-1*+ Xt-2− Xt-2*+ Xt-3− Xt-3*+ Xt-4− Xt-4*+ Xt-5− Xt-5*+ Xt-6− Xt-6*  Vt1 = Xt-2− Xt-2*+2 Xt-3− Xt-3*+3 Xt-4− Xt-4*+ 4 Xt-5− Xt-5*+ 5 Xt-6− Xt-6*  Vt2 = Xt-2− Xt-2*+4 Xt-3− Xt-3*+9 Xt-4− Xt-4*+16 Xt-5− Xt-5*+25 Xt-6− Xt-6*  (5)

Substitute these newly defined variables in equation (4):

yt= β0+ γ0Vt0+ γ1Vt1 + γ2Vt2 + Ut (6)

4.2.2 The strength of the opponent -Variable

Using only the last result -variable would lead to misleading results; from our anticipation we expect a high degree of interaction with the strength of the recent opponents. To overcome this problem, a good estimation of the strength of the teams in the form of a number is necessary.

When calculating the average goal balance achieved in that season and determine the deviation from the average of the average goal balance of all teams a good indicator of the strength of a team is created. The problem of this unit of measure is that this variable is during the season an overtime changing variable. The characteristics change overtime; the lowest value, the highest value and the quartiles change. These changes might be best illustrated by a boxplot. Figure 1 represents three boxplots based on the data of the season 2010/2011, the first boxplot represents the distribution of the variable after one match played, the second boxplot represents the distribution of the same variable after 17 matches played and last boxplot represents the distribution of the same variable after 33 matches played.

(16)

Figure 1: Frequency boxplot

As can be read, the spread of the distribution becomes smaller as the season progresses. To overcome this problem, the standardized average goal balance will be used in the regression instead of using directly the value of the average goal balance. The standardized average goal balance is calculated by substracting the average of the average goal balance and divide it by the standard deviation:

Strength of team = Pt-i− ¯Pt-i σPt-i

= St-i (7)

Where δPt-i, ¯Pt-i and Pt-i denote the standard deviation of the average goal balance, the

average of the average goal balance of all teams and average goal balance gathered respectively. Not only the strength of the opponents of the last matches will enter the regression, also the strength of the relevant teams will be used. These variables will be denoted as St and

(17)

4.2.3 The effect of the interaction between strength and result

The desired effect of these two variables might be confusing and sometimes nonlogical, the-refore figure 2 is created to help to understand the intended effect of the two variables.

Figure 2: Illustration Effect St-i and Xt-i

Figure 2 is characterized by a horizontal axis where the goal balance of i matches ago is shown and a vertical axis where the effect on the chance of winning the next game is shown. Figure 2 knows three functions; by the first function, indicated by the green line, the opponent is better than average, so consequently, he has a positive relative strength. When a team succeeds in winning this match, the intended effect on the chances of winning next match will be positive. Even when the team succeeds in ending the match in a draw or a small loss it will contribute to the teams’ chances of winning the next game. When looking at the other two functions, when playing against more vulnerable teams, the team will be contributed less for the same result, precisely the effect that is justified by theory. For achieving the picture illustrated by figure 2 the following function can be set up;

yt= β0+ βiXt-i+ βi+6St-i+ βi+12Xt-i|St-i| (8)

The first two terms are already discussed, the third term is an interaction between the goal balance and the absolute value of the oppononent’s strength. There is chosen for the absolute value because the direction of the strength is not important, only the magnitude is. When not using the absolute value, the equation can give nonlogical outcomes.

(18)

corres-ponding parameters are restricted by a polynomial of degree two. For the strength variable and the interaction variable the same restrictions applies, so following the same procedure, the following parameters will be defined:

β7 = δ0 β13= ζ0 β8 = δ0+ δ1+ δ2 β14= ζ0+ ζ1+ ζ2 β9 = δ0+ 2δ1+ 4δ2 β15= ζ0+ 2ζ1+ 4ζ2 β10= δ0+ 3δ1+ 9δ2 β16= ζ0+ 3ζ1+ 9ζ2 β11= δ0+ 4δ1+ 16δ2 β17= ζ0+ 4ζ1+ 16ζ2 β12= δ0+ 5δ1+ 25δ2 β18= ζ0+ 5ζ1+ 25ζ2 And: Wt0 = St-1− St-1*+ St-2− St-2*+ St-3− St-3*+ St-4− St-4*+ St-5− St-5*+ St-6− St-6*  Wt1 = St-2− St-2*+2 St-3− St-3*+3 St-4− St-4*+ 4 St-5− St-5*+ 5 St-6− St-6*  Wt2 = St-2− St-2*+4 St-3− St-3*+9 St-4− St-4*+16 St-5− St-5*+25 St-6− St-6*  Zt0 = Xt-1|St-1| − Xt-1*|St-1|*+ Xt-2|St-2| − Xt-2*|St-2|*+ . . . +. . . + . . . + Xt-6|St-6| − Xt-6*|St-6|*  Zt1 = Xt-2|St-2| − Xt-2*|St-2|*+ . . . +. . . + . . . + 5 Xt-6|St-6| − Xt-6*|St-6|*  Zt2 = Xt-2|St-2| − Xt-2*|St-2|*+ . . . +. . . + . . . +25 Xt-6|St-6| − Xt-6*|St-6|*

Substitute these values into the regression:

yt= β0+γ0Vt0+γ1Vt1+γ2Vt2+δ0Wt0+δ1Wt1+δ2Wt2+ζ0Zt0+ζ1Zt1+ζ2Zt2+β19St20St*+Ut

(19)

4.2.4 Homeground -Variable

One of the most discussed topics within football is the advantage of playing at home. As Nevill and Holder (1999) state, the reasons put forward are different, the two most commonly heard reasons are the advantage of no travelling and the support of the own crowd. The total audience of the home team will be used as unit of measurement for home ground -variable and is consequently a continous variable. This variable will be denoted as:

Ht = number of spectators present in the stadium measured in thousands (10)

Since there is no database found which could provide the number of spectators for all the 5,202 matches there will be used a proxy for the number of spectators. At the website http://www.elfvoetbal.nl/ for all the relevant seasons the average number of spectators per match is collected. By using this variable there will be assumed that the deviation during a season is quite low and that the average number of spectators of that season is a good estimation of the exact number of spectators present at that time in the stadium.

4.2.5 Artificial Field -Variable

Teams who aren’t used to play on artificial fields will on average underperform on these fields (Carmichael and Thomas, 2005). This variable will be implemented in a way that it will only be of power when the match fulfills the following two conditions: first, the match will be played on an artificial underground and second, the away team isn’t used to an artificial underground. To achieve this result the following mathematical ”trick” will be used:

D = Dh (1 - Da) (11)

Where Dh and Da denote the preference for the underground for the home and away team

respectively and will be equal to 1 when the team’s underground preference is an artificial field.

(20)

4.2.6 Final Model

When adding variables (10)-(11) to equation (9), the model is complete:

yt= β0+ γ0Vt0 + γ1Vt1 + γ2Vt2+ δ0Wt0 + δ1Wt1+

δ2Wt2+ ζ0Zt0 + ζ1Zt1 + ζ2Zt2+ β19St+

β20St*+ β21Ht+ β22Dt+ Ut (12)

4.3 Results

As already stated, the first fifteen season will be used for parameter estimation. These seasons give in total 4,590 observations, the first seven matchdays can’t be used because of the missing data, so without the unusable observations there will be in total 3,631 observations available for the regression. Different combinations of variables will be used to determine the final model. The numbers between the brackets are the standard errors of the parameters.

(21)

Table 1: Results of Regressions of different sets of variables

Result3 1 2 3 4 5

Pseudo R-Square

Cox and Snell 0.110 0.121 0.122 0.171 0.166

Nagelkerke 0.125 0.139 0.139 0.196 0.190 McFadden 0.056 0.062 0.062 0.090 0.087 Parameters 1 Intercept 0.7385 0.7405 0.7403 0.4823 0.4897 (0.0436) (0.0438) (0.0438) (0.0961) (0.0950) Vt0 0.0663** 0.0745** 0.0713* 0.0249 (-) (0.0136) (0.0144) (0.0228) (0.0239) Vt1 -0.0017 -0.0090 0.0004 -0.0032 (-) (0.0134) (0.0143) (0.0219) (0.0221) Vt2 -0.0005 0.0007 -0.0010 -0.0004 (-) (0.0026) (0.0028) (0.0042) (0.0043) Wt0 (-) 0.0600* 0.0622* 0.0384 (-) (0.0288) (0.0308) (0.0312) Wt1 (-) -0.0369 -0.0427 -0.0459 (-) (0.0257) (0.0275) (0.0277) Wt2 (-) 0.0062 0.0072 0.0078 (-) (0.0049) (0.0052) (0.0053) Zt0 (-) (-) 0.0039 0.0081 (-) (0.0212) (0.0214) Zt1 (-) (-) -0.0113 -0.0099 (-) (0.0199) (0.0200) Zt2 (-) (-) 0.0020 -0.0099 (-) (0.0038) (0.0200) St (-) (-) (-) 0.2857** 0.3487** (0.0751) (0.0600) St* (-) (-) (-) -0.3740** -0.4355** (0.0632) (0.0461) Ht (-) (-) (-) 0.0134** 0.0125* (0.0052) (0.0052) Dt (-) (-) (-) 0.2047 (-) (0.2576)

3. The reference category is: 3. * Significant at a 5% level ** Significant at a 1% level

(22)

Parameters 1 2 3 4 5 2 Intercept 0.0928 0.0718 0.0716 0.2951 0.2933 (0.0505) (0.0512) (0.0512) (0.1134) (0.1123) Vt0 -0.0247 -0.0282 0.0389 0.0005 (-) (0.0149) (0.0157) (0.0244) (0.0257) Vt1 -0.0135 -0.0147 -0.0031 -0.0007 (-) (0.0147) (0.0157) (0.0237) (0.0238) Vt2 0.0023 0.0027 0.0008 -0.0001 (-) (0.0028) (0.0030) (0.0046) (0.0046) Wt0 (-) -0.0665* s -0.0596 -0.0377 (-) (0.0317) (0.0337) (0.0341) Wt1 (-) -0.0082 -0.0153 -0.0168 (-) (0.0283) (0.0302) (0.0304) Wt2 (-) 0.0017 0.0028 0.0033 (-) (0.0054) (0.0058) (0.0058) Zt0 (-) (-) 0.0132 0.0088 (-) (0.0228) (0.0228) Zt1 (-) (-) -0.0141 -0.0163 (-) (0.0216) (0.0216) Zt2 (-) (-) 0.0022 0.0027 (-) (0.0042) (0.0042) St (-) (-) (-) -0.1884* -0.1982** (0.0840) (0.0675) St* (-) (-) (-) 0.3363** 0.3547** (0.0676) (0.0479) Ht (-) (-) (-) -0.0172** -0.0173** (0.0066) (0.0065) Dt (-) (-) (-) -0.1760 (-) (0.2715)

Most of the variables are insignificant at a 5% level, by omitting these variables the regression will improve and estimated parameters are more accurate. By following this pro-cedure, the only variables that will be used for the model are the relative strengths of the relevant teams and the Homeground variable. The parameters are now estimated and the final equations for a home win, away win and a draw respectively will be:

PYi= 1|St, St*, Ht  = e 0.4897 + 0.3487St− 0.4355St*+ 0.0125Ht  1 + e0.4897 + 0.3487St− 0.4355St*+ 0.0125Ht+ e 0.2933 − 0.1982St+ 0.3547St*− 0.0173Ht

(23)

P  Yi= 2|St, St*, Ht  = e 0.2933 − 0.1982St+ 0.3547St*− 0.0173Ht 1 + e0.4897 + 0.3487St− 0.4355St*+ 0.0125Ht+ e 0.2933 − 0.1982St+ 0.3547St*− 0.0173Ht (14) P  Yi= 3|St, St*, Ht  = 1 1 + e0.4897 + 0.3487St− 0.4355St*+ 0.0125Ht+ e 0.2933 − 0.1982St+ 0.3547St*− 0.0173Ht (15) 5 Staging 5.1 Introduction to staging

As already stated, one of the most underestimated factors in the world of football betting is the way in which the football bets will be staged. A staging plan will be introduced that determines the amount bet on every game that will give us in the end the optimal balance between the expected value and the risk. Before introducing the staging plan the different relationships that exist between the characteristics of a bet will be discussed.

5.2 Relationships

To show why this sections start with explaining the relationship between different charac-teristics of a bet, there will be given some example bets. The odds that will be used are the average odds of nine different bookmakers which are collected online from the website http://www.football-data.co.uk/. Over the Eredivisie seasons 2011/2012 and 2012/2013 there are in total 477 machtes available for football betting. In the following table information of three random matches will be shown:

Table 2: An overview of three random matches played during the Eredivsie seasons 2011/2012 and 2012/2013.

MatchID Date Home Team Away Team Home Goals Away Goals HTRS ATRS Spec.

4609 19-08-2011 Roda JC RKC Waalwijk 0 2 -0.6396 -0.3198 12,600

4691 05-11-2011 Feyenoord NEC 0 1 0.4458 -0.6242 42,000

(24)

MatchID has no further meaning and is only used for administrative purposes. The match with MatchID 4609 knows a home team with a relative strength of -0.6396, an away team with a relative strength of -0.3198 and approximately 12,600 spectators attended this match. The probabilities of the possible results can be computed by using the data and equations 15, 16 and 17. The results of these equations and the average odds of the bookmakers are shown in the following table:

Table 3: The calculated probabilities and the average odds set by nine different bookmakers of the three randomly selected matches played during the Eredivsie seasons 2011/2012 and 2012/2013.

MatchID Probabilities Average Odds

1 3 2 1 3 2

4609 45.67% 25.98% 28.35% 1.476 4.081 6.025

4691 74.15% 17.53% 08.33% 1.393 4.434 6.950

5075 72.83% 17.32% 09.85% 1.313 4.775 9.457

By using the calculated probabilities and the odds set by the bookmaker the expected returns can now be calculated. The expected return of each outcome is simply the product of the probability and the corresponding odds minus one, in the following table the expected returns and the preferred outcome given the maximum expected return will be shown:

Table 4: The expected return and the preferred odds given the maximum expected return of the three randomly selected matches played during the Eredivsie seasons 2011/2012 and 2012/2013.

MatchID Expected Return Preferred Result

1 3 2

4609 -0.3259 0.0602 0.7081 2

4691 0.0329 -0.2229 -0.4214 1

5075 -0.0438 -0.1728 -0.0686 1

When choosing the maximum expected return for each bet; the first bet has an expected return of 70.81%, the second bet an expected return of 3.29% and the last bet an expected return of -4.38%. The fact that at the third match the best possible expected return is negative indicates that the bookmakers expectations and the expactations calculated by the model are substantially similar and that is not possible to overcome the bookmaker’s overround. From the 477 matches, 334 matches qualify for a value-bet.

The bets are analysed and the characteristics of these bets will be summarized in the following table:

(25)

Table 5: A summary of the three randomly selected matches played during the Eredivsie seasons 2011/2012 and 2012/2013.

MatchID Preferred Calculated Corresponding Corresponding Actual Realized

Outcome Probability Average Odds Expected Return Result Return

4609 2 28.35% 6.025 0.7081 2 502.50%

4691 1 74.15% 1.393 0.0329 2 -100.00%

5075 1 72.83% 1.313 -0.0438 1 21.30%

As can be seen a bet has several characteristics, to make a profit on the longterm it is usefull to examine how these characteristics relate to each other. The following section shows the relationship between the percentage correct bets and the calculated probabilities.

5.2.1 Striking Rate vs Expected Probability

For a credible model, it is crucial that there is a clear relationship between the calculated probabilities and the percentage correct bets (from now on called the striking rate). In order to test this, on every preferred outcome will be bet one unit. The bets are subdivided in different categories based on the calculated probabilities. Of each category the striking rate is calculated, see the following table and figure:

Table 6: The relationship between the calculated probabilities and the striking rate of 477 matches played during the Eredivisie seasons 2011/2012 and 2012/2013.

Min Calculated Max Calculated Number of Number of Striking Rate

Probability Probability Bets Bets Correct

1 0% 10% 14 0 00.00% 2 10% 20% 84 12 14.29% 3 20% 30% 98 25 25.51% 4 30% 40% 67 21 31.34% 5 40% 50% 64 29 45.31% 6 50% 60% 59 22 37.29% 7 60% 70% 45 27 60.00% 8 70% 80% 25 17 68.00% 9 80% 90% 21 18 85.71% 10 90% 100% 0 - -%

(26)

Figure 3: The relationship between the calculated probabilities and the striking rate of 477 matches played during the Eredivisie seasons 2011/2012 and 2012/2013.

Undue the lack of a perfect relationship, the model can be considered as credible; there is a clear relationship between the calculated probabilities and the striking rate.

5.2.2 Odds vs Realized Return

Our model estimates as accurate as possible the true probabilities of the result of a match given a set of variables. Based on these probabilities the expected return can be calculated by multiplying the probabilities with the corresponding odds. Of course, the calculated pro-babilities will deviate from the true probabilites, but in general, these deviations will not necessarily make profits out of reach. But when the odds are higher the deviation between the expected return and the true expected return will be amplified. By knowing this, the re-lation between the odds and the realized return will be examined whether this rere-lation knows a turning point where the realized return becomes negative. To examine this, the bets will again be subdivided in several groups, but now on the basis of the odds:

(27)

Table 7: The relationship between the odds set by the bookmakers and the Realized Return based on the 334 value bets which occured during the Eredivisie seasons 2011/2012 and 2012/2013.

Min Odds Max Odds Number of Realized

Bets Return 1 1.00 2.00 57 -9.40% 2 2.00 3.00 67 -8.65% 3 3.00 4.00 48 -18.41% 4 4.00 5.00 49 42.75% 5 5.00 6.00 31 38.23% 6 6.00 7.00 25 52.65% 7 7.00 8.00 11 -100.00% 8 8.00 9.00 8 -100.00% 9 9.00 10.00 12 53.96% 10 10.00 11.00 6 -100.00% 11 11.00 12.00 6 -100.00% 12 12.00 ∞ 14 -12.70%

Figure 4: The relationship between the odds set by the bookmakers and the Realized Return based on the 334 value bets which occured during the Eredivisie seasons 2011/2012 and 2012/2013.

Undue the anomaly in category 9 there can be concluded that betting on bets with odds higher or equal to 7 amplifies errors in such a degree that these bets will lead to losses. Therefore, bets with odds higher or equal to 7 will be excluded. Value-bets with odds lower than 7 will be called quality-bets.

(28)

5.2.3 Expected Return vs Realized Return

The last relationship that will be examined is between the expected return and the realized return. This relationship is the most interesting part of the betting stage. A clear relationship makes the expected return a strong indicator for the realized return. When τ denotes the minimum accepted expected return and is situated on the horizontal axis and the realized return is situated on the vertical axis figure 5 can be drawn. The 90% confidence interval is represented by the dotted lines. The number of quality bets for different values of τ are shown in figure 6.

Figure 5: Relationship between τ and the realized return

(29)

Although the realized return isn’t a smooth function of τ there is some kind of a relationship betweem them. The realized return for τ equal to 0.2 is 14.95% indicating a return of 14.95 units for every 100 units bet. For a τ equal to 0.4 the realized return is almost tripled to a level of 44.16%, indicating a return of 44.16 units for every 100 units bet. But as figure 6 shows, a higher value for τ , consequently results in a lower number of quality-bets, for τ equal to 0.4 there remain only 38 quality-bets. Due to the fact that the number of quality-bets become lower for an increasing τ the 90% confidence interval continues to widen. Therefore, given the 90% confidence interval, even for a high value of τ losses will never be totally excluded.

5.3 Return Maximization

When placing several bets, diversification benefits will arise. Like in Finance, holding several assets with each his own expected return and risk will be seen as holding a portfolio where each asset has its own expected return and risk. The amount bet on every match is determined by the maximized so-called return-to-volatility-ratio. The expected returns are already discussed, this paragraph starts by showing how to calculate the risk of a bet.

As Bodie et al. (2011) put forward, the standard deviation of the expected return is a measure of risk. It is defined as the squared root of the variance, which is in turn the expected value of the squared deviation from the expected return. The higher the volatility e.g. the standard deviation, the higher will be the average value of these squared deviations. Therefore, standard deviation provide one measure of the uncertainty of outcomes. Before showing the symbolic formula, there will be started by displaying a simple probability tree diagram which explains the cash flows that arise from placing a bet:

Figure 7: A simple probability tree diagram which represents the cash flows that arise from placing a bet.

Amount Bet

0 1 - Probabilit

y Succes

Amount Bet * Odds Probabilit

(30)

As figure 7 explains, there are two possible scenarios; a scenario where the preferred result is equal to the actual result and therefore will be denoted as a succes and a scenario where the preferred result is not equal to the actual result and therefore be denoted as a failure. If bet on a value-bet there can generally be stated that the lower the probability of succes, the higher the average value of these squared deviations and thus the risk. Now that the definition risk is explained and the scenarios are described, the formula for the standard deviation is the following:

δ = v u u t N X i=1

Pi(Xi− µ)2, where µ = T he Expected Return =

N

P

i=1P iXi

(16)

Using formula 16 for calculating the standard deviation of the three already given example bets: δ4609 = p [0.2835 ∗ (5.025 − 0.7081)2] + [0.7165 ∗ (−1 − 0.7081)2] = 2.715 δ4691 = p [0.7415 ∗ (0.393 − 0.0329)2] + [0.2585 ∗ (−1 − 0.0329)2] = 0.610 δ5075= p [0.7283 ∗ (0.313 − −0.0438)2] + [0.2717 ∗ (−1 − −0.0438)2] = 0.584

The expected return and the the standard deviation of a single bet are known, but how can the expected return and standard deviation of a portfolio comprising several bets be calculated? The expected return of a portfolio is simply the weighted average of the expected return with the portfolio proportions as weights:

E (rp) = n

X

i

wiE (ri) (17)

And the standard deviation of a portfolio is:

δp = s X i wi2δi2+ X i X i6=j wiwjδiδjρij, (18)

where ρij = the correlation coefficient between the returns of bet i and j.

(31)

and thus is equal to zero, formula 18 could be rewritten as: δp= s X i wi2δi2 (19)

Because there is now a measure for both the risk and the expected return, there will be examined how these two things can be used to determine to amount bet on a match. There will be assumed that the bettor is undue the fact that he is involved in a betting game is risk-averse4. The fact that the bettor is risk-averse will ensure that the bettor will continuously seek for the highest feasible reward-to-volatility ratio. The reward-to-volatility ratio is called the Sharpe ratio (Bodie et al., 2011) and can be calculated as follows:

Sharpe Ratio = E (rp) − rf δp

where rf = the risk-free rate (20)

Because the returns of a bet are based on average on a two day-basis, the effect of risk-free rate will practically dilute. By describing this mathematically, the compounded risk-free rate for two days will be equal to:

(1 + rf)

2

360 − 1 ≈ 0 for any considered reasonable risk-free rate.

Therefore formula 20 will be rewritten as:

Sharpe Ratio = E (rp) δp

(21)

The 477 macthes which were analyzed were played in a time interval of 1.5 years. Because of this time interval it is not possible to bet on all these events simultaneously and in extension of this, it is not possible to combine all these bets in one portfolio. Therefore, there will be bet every week which results in a total of 53 maximized ”subportfolios” where each subportfolio exists of 9 matches on average. The following table gives a summary of the first subportfolio:

4. A risk-averse investor ”penalizes” the expected return of a risky portfolio by a certain percentage to account for the risk involved.

(32)

Table 8: I. A summary of a subportfolio which is virtually composed on August 12, 2011.

MatchID Preferred Corresponding Corresponding Corresponding

Outcome Expected Return Standard Deviation Average Odds

4600 1 16.41% 81.74% 1.738 4601 2 284.14% 612.26% 13.600 4602 2 36.16% 195.38% 4.165 4603 2 29.43% 233.45% 5.505 4604 1 -1.46% 47.36% 1.213 4605 2 31.54% 207.95% 4.603 4606 1 -6.31% 82.82% 1.669 4607 1 16.00% 63.08% 1.503 4608 1 2.30% 67.91% 1.449

Note that the bet with MatchID 4601 will be excluded because the average odds exceeds the specified value of 7. The following figure will show the efficient frontier, this frontier is a graph of the lowest possible variance that can be attained for a portfolio expected return. Given the input of the expected returns and the variances, the minimum possible variance for any targeted expected return will be calculated5.

Figure 8: Portfolio expected return as a function of the standard deviation

As can be noticed, all the individual bets lie to the inside of the frontier. This implies that risky portfolios comprising only a single bet are inefficient. Diversifying investments leads to portfolios with higher expected returns and lower standard deviations.

(33)

Besides the efficient frontier, there is also drawn the Capital Allocation Line, indicated by the red line. This line shows all possible combinations of the risky portfolio and the risk-free rate (which is approximately equal to zero for such a short time interval). If the line is steeper, then for any level of volatility, there will be earned a higher expected return. Because the investor is risk-averse, he will seek to the highest Sharpe ratio (that is, the steepest slope of the CAL-line). The point where the CAL-Line is as steep as possible is marked by the letter P, at this point the CAL-line is tangent to the efficient frontier. Every bettor will invest in portfolio P regardless his preferences, the preferences will only determine how much he invests in risk-free assets and in risky assets. The weights that will be attached to every bet are calculated by a spreadsheet calculator and are shown in the following figure:

Table 9: II. A summary of a subportfolio which is virtually composed on August 12, 2011.

MatchID Corresponding Corresponding Weights Portfolio Portfolio

Expected Return Standard Deviation Expected Volatlity

Return 4600 16.41% 81.74% 28.25% 20.45% 48.50% 4601 284.14% 612.26% –% 4602 36.16% 195.38% 10.90% 4603 29.43% 233.45% 6.21% 4604 -1.46% 47.36% 0.00% 4605 31.54% 207.95% 8.39% 4606 -6.31% 82.82% 0.00% 4607 16.00% 63.08% 46.25% 4608 2.30% 67.91% 0.00%

The way of staging is now explained, the following figure shows how a bankroll develops over time with a staging plan and without a staging plan. At the strategy of no staging plan there will bet on every quality-bet an equal amount, this is equivalent to 2771 unit. For the strategy with a staging plan there will be for every portfolio, undue the fact that the Sharpe ratios of each portfolio differ, assigned an equal amount of units. So, for every subportfolio there is 531 units available to distribute.

(34)

Figure 9: Development Bankroll with and without a staging plan for 277 quality-bets

As can be seen, both bankroll are realized by relatively large shocks implicating that bets are relatively risky assets. When looking at the level of the bankrolls, there can be concluded that the overall level of the bankroll using a staging plan is higher indicating a higher return. Only the way of developing of the two bankrolls are very similar and show a high correlation. The fact that the overall level of the bankroll using a staging plan is higher is mainly caused by a single relative large score occuring around quality-bet #110 and not caused by a constant higher growth rate. Therefore on basis of the high confidence interval shown in figure 5 and the way of developing just showed in figure 9 there can’t be concluded, undue the higher return, that using a staging plan is superior to a betting strategy where equal amounts are used. This requires further study and needs to be tested on a larger number of bets.

(35)

6 Conclusion

A multinomial logistic regression model is presented for the analysing and forecasting of football matches. During the research differents sets of variables are tested, the combination of the relative goal balance of the home and away team and the number of spectators proved to be significant in such a way that this combination is used to forecast the result of a match. After the parameters of the model are estimated on basis of the Eredivisie seasons of 1996/1997 till 2010/2011 the model is used for forecasting matches of the seasons 2011/2012 and 2012/2013.

The forecasts of the model are exploited by using them in football bets. Two different staging plans are used; a simple staging plan whereby on each quality-bet an equal amount will be bet and a sophisticated staging plan where the amount bet is determined by the ratio of risk and expected return. Both strategies result in a profit, the difference between the two startegies are quite small, implicating that the sophisticated staging plan isn’t superior to the simple staging plan.

Although the achieved results are quite promising, I believe that the model could be further improved. First, in the current model the fatique of players isn’t taken into account. Secondly, the two strength variables are at the beginning of the season not representative for the true strength of the teams which results in incorrect estimations. Besides improvements for the model, the staging plan needs further examination; the minor differences between the two staging plans could be caused by the relatively low number of bets. Therefore it would be interesting to test whether the achieved results hold when the number of bets increase substantially. This could be achieved by testing the model in foreign competitions which give the option to bet on more matches.

(36)

7 References

Bodie, Kane, and Marcus. Investments and Portfolio Management. McGraw-Hill Education, ninth edition, 2011.

Boelier and Stekler. Predicting the outcomes of National Football League games. International Journal of Forecasting, pages 257–270, 2003.

Carmichael and Thomas. Home-Field Effect and Team Performance: Evidence from English Premiership Football. Journal of Sports Economics, 2005.

Fama. Efficient capital markets: A review of theory and empirical work. The Journal of Finance, pages 383–417, 1970.

Hill, Griffiths, and Judge. Undergraduate Econometrics, chapter 15 Distributed Lag Models. Wiley, 1997.

Koopman and Lit. A Dynamic Bivariate Poisson Model for Analysing and Forecasting Match Results in the English Premier League. PhD thesis, Tinbergen Institute, 2012.

Kuyper. Information and efficiency: an empirical study of a fixed odds betting market. Applied Economics, pages 1353–1363, 2000.

Scott Mendard. Applied Logistic Regression Analyses, pages 91–92. 07-106. Sage University Papers Series on Quantitative Applications in the Social Sciences, 2 edition, 2001.

Nevill and Holder. Home Advantage in Sport: an Overview of Studies on the Advantage of Playing at Home. Adis International, 1999.

Pope and Peel. Information prices and efficiency in a fixed-odds betting market. Economica, pages 323–341, 1990.

Referenties

GERELATEERDE DOCUMENTEN

Lemma 7.3 implies that there is a polynomial time algorithm that decides whether a planar graph G is small-boat or large-boat: In case G has a vertex cover of size at most 4 we

Response latencies of correct (white) and incorrect (hatched) responses averaged over all subjects, word length, and word frequency in relation to eccentricity... Proportions

It is shown that by exploiting the space and frequency-selective nature of crosstalk channels this crosstalk cancellation scheme can achieve the majority of the performance gains

H6: team boundary spanning is positively related to team performance, because teams acquire more external resources when team boundary spanning increases.. Besides the

The goal of this study is to research if process variables used for measuring the processes in teams are measuring comparable or different things.. 1.3

This test I conducted in three different manners: first, I used the variables as they were in the main model – using the earlier described points system and the

Om een idee te krijgen van de huidige aanwezigheid van de Apartheidsideologie in de Afrikaner identiteit en de dominante (racistische) denkbeelden die hiermee gepaard gaan is

– Data Stream Management Systems – Reputation Systems – Context-Aware Systems – Artificial Intelligence – Information Retrieval – Self-organizing Systems – Semantic Web..