• No results found

An investigation into the forecast efficiency of UK bookmakers’ betting-odds : the Barclays Premier League

N/A
N/A
Protected

Academic year: 2021

Share "An investigation into the forecast efficiency of UK bookmakers’ betting-odds : the Barclays Premier League"

Copied!
56
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

efficiency of UK bookmakers’ betting-odds

The Barclays Premier League

Owen Keating University of Amsterdam Faculty of Economics and Business

21st January 2015 Research Supervisor Rutger Teulings University of Amsterdam r.m.teulings@uva.nl ! ! ! ! !

This paper presents an ordered probit regression model designed to forecast the results of football matches from the Barclays Premier League. The model is used to test the efficiency of using bookmakers’ odds as forecasts of match results. A number of economic tests are also employed to further test the efficiency of bookmakers’ odds. The results show that bookmakers’ odds underperform the odds of the model presented in the study. The results also indicate that betting according to the model’s probabilities can yield high levels of positive returns.

(2)

Statement of Originality

This document is written by Student Owen Keating, who declares to take full responsibility for the contents of this document.

I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in

creating it.

The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

1. Introduction 1 2. The implications of an inefficient betting market: a new perspective 3

3. What is the Efficient Market Hypothesis? 4

4. Literature Review: Modelling and forecasting football match results 5

5. Forecasting Match Outcomes: Empirical Models 6

5.1 Model 1 – Naïve model 7

5.2 Model 2 – Intricate model 9

6. Data 14

7. Results 16

7.1 Significant estimates from the Naïve model and the Intricate model 16

7.2 Naïve model (small) vs Intricate model 17

7.3 Bookmakers’ implicit probabilities – inferences 20

7.4 Correlations between bookmakers 22

8. Testing the efficiency of bookmakers’ odds 23

8.1 Empirical tests 24

8.2 Economic efficiency tests 27

9. Conclusion 33 10. References 35 11. Appendix 38 ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

(4)

1. Introduction

According to a report by Deloitte (2013), the economic impact of the betting industry, in the UK, extends to £5 billion gross value added and 100,300 full time equivalent jobs when also including the direct, indirect and induced contributions of the industry. Yet the betting industry has been drawing increasing attention from researchers for reasons other than its economic implications. Firstly, it has enabled researchers to feasibly conduct empirical investigations with regard to the efficiency of information markets. Fixed odds betting markets have been particularly convenient to use as the odds are set several days prior to an event and they do not alter in response to the betting activities prior to the event (Kuypers, 2000). Secondly, there exists a consensus with regard to the parallels between wagering in the betting markets and trading in the financial markets (Graham & Stott, 2008; Bruce & Marginson, 2014). Both settings comprise of investors who have access to rich, and widely available, information sets and who interact in a zero-sum game, whilst aiming to profit through trading (or betting), as uncertainty is resolved over time (Levitt, 2004; Williams, 2009). Studying the efficiencies in the betting markets has enabled researchers to gain a useful perspective for interpreting the behaviours of participants in the financial markets, whilst also giving insights into the operation of such markets (Williams, 2009). Thirdly, the betting industry has also captured the attention of those curious about the possibility of profiting from poorly set odds (Graham & Stott, 2008).

Deloitte’s report (2013) on the economic impact of the betting industry offers an additional, and important, reason for studying the efficiency of the betting markets, which has not previously been factored into similar studies. Poorly set odds can leave bookmakers exposed to tremendous amounts of risk. Any bettors who are able to identify and exploit mispriced odds can instigate substantial losses for bookmakers. Consequently, a significant amount of jobs could be placed in jeopardy.

This paper will first offer a new perspective for studying the efficiency of betting markets. Subsequently, it will analyse the efficiency of bookmakers’ odds as forecasts for football match outcomes. Put differently, this paper will study the Efficient Market Hypothesis (EMH) – which states that a market is efficient with respect to a particular information set if it is not possible to make excess returns using that information set (Fama, 1970). In doing so it will utilise a combination of explanatory variables that has not been pooled together in a similar manner by previous studies. Furthermore, this paper is the first to focus on predicting the results of matches specifically played by high performing teams from the Barclays Premier League. The paper will also offer an update to the research so far, by focusing on data from 2010-11 onwards. The overall approach in this paper is formulated in line with the findings of two previous studies – one is by Professor John Goddard and Dr. Ioannis Asimakopoulos, 2004, and another is by a student named Jasmine Xu, 2011.

(5)

In their study, Professor Goddard and Dr. Asimakopoulos (2004) presented an ordered probit model that encapsulated a wide range of explanatory variables, besides past match results, and with this model they concluded that the bookmakers’ odds are inefficient. They also directly tested the efficiency of bookmakers’ prices by using various betting strategies, which led them to the same conclusion. They were unclear as to whether the inefficiencies they discovered were systematic, random, or both. Lastly, they found some evidence that the inefficiencies in bookmakers’ odds had diminished over time.

In her study, Xu (2011) attempted to forecast match outcomes using a different approach. She employed a binary model, where the two event outcomes were a home win and not a home win, the later included draw and away win outcomes. She also used a very different set of explanatory variables. Her study appears to show some weak evidence for information inefficiency in bookmakers’ odds. However, her model failed to outperform the bookmakers’ odds when she tested the EMH directly using different betting strategies.

In their studies, Goddard and Asimakopoulos (2004) and Xu (2011) employ varying methods when conducting both empirical and economic efficiency tests. This paper

combines the methods of both studies to offer a more up-to-date insight into the subject and, consequently, provides a contribution to the existing literature on the efficiency of information markets – a matter often studied through empirical and economic tests of the EMH.

To analyse the efficiency of bookmakers’ odds as forecasts for football match

outcomes, this paper first presents an ordered probit model, which is designed to predict the results of football matches from the Barclays Premier League. Next a binary probit model is set up to show that bookmakers’ odds are significant predictors of match outcomes. Then, a second binary probit model is set up. This model incorporates both the normalised

bookmakers’ odds and an additional term which models the difference between the information incorporated in the normalised bookmakers’ odds and the information incorporated in the probabilities generated by the ordered probit model. If the EMH holds and bookmakers’ odds efficiently incorporate publically available information, the second binary model will not produce any additional relevant information; in other words, the

additional term in the second binary model will be insignificant. This makes it possible to test the efficient market hypothesis and, thus, the efficiency of bookmakers’ odds as forecasts for football match outcomes.

Finally, this paper also presents several economic tests that directly assess the EMH. These tests are based on the ex post returns that could be generated by placing bets using different strategies, which include passive betting such as placing a bet on every possible outcome for each match and more selective strategies such as betting with the bookmaker that offers the best price, amongst others. First, the betting strategies are applied according to the probabilities presented by the bookmakers’ odds. Subsequently, similar

(6)

betting strategies are applied to the match outcome with the most favourable return, according to the probabilities generated by the ordered probit model.

The remainder of this paper is structured in the following manner – section 2 offers a reinforcement of the aforementioned new perspective for studying the efficiency of the betting market; section 3 elaborates on the EMH; section 4 reviews the existing research on the modelling as well as forecasting of football match results; section 5 describes two forecasting models for predicting match outcomes; section 6 describes the data used in this study; section 7 is devoted to reporting results; section 8 presents empirical and economic efficiency tests; and section 9 provides a conclusion and suggestions for future research.

2. The implications of an inefficient betting market: a new perspective

Most studies thus far have focused on analysing information efficiencies in the betting market as a tool for understanding the operations and the participant behaviours in the financial markets. However, given that the betting industry in the UK alone is able to support over 100,000 full time equivalent jobs, the potential for inefficiencies in this market merits further analysis. Keynes’ equation for gross domestic product (GDP) can be used to further illustrate this. Keynes (1936) denoted GDP by Y, and presented it as the sum of four categories – consumption (C), investment (I), government purchases (G), and net exports (NX), which can equivalently be written as:

Y – C – G = I + NX (1)

Equation (1) illustrates that national savings, on the left hand side, are equal to the sum of investments and net exports, on the right hand side. National savings can further be categorized as private and public savings in the following manner, where T represents taxes:

[ Y – T – C ] + [ T – G ] = I + NX (2)

Equation (2) can be utilised to study the impact of changes in household spending, business spending and government spending. According to Keynes’ General Theory, household, business and government spending plans determine an economy’s total income (1936). If people spend more, firms will be able to sell more of their goods and services. If firms are able to sell more, they will produce more and will, thus, hire more workers, thereby enabling an economy to prosper and grow.

For continued growth, the betting industry first requires consumers who are willing to place bets. It also requires bookmakers to be able to aggregate information much more efficiently than bettors. If bookmakers’ prices do not reflect all available information then the betting market would be inefficient. This would make way for bettors to devise highly profitable betting strategies. As a result, the bookmakers will have to endure extensive losses. Continued losses will eventually result in the shutting down of land-based betting venues, and redundancies of those employed. As the industry offers opportunities to

(7)

unqualified and unskilled labour (Association of British Bookmakers Ltd.,2013), it is likely that those made redundant will add to unemployment figures.

Logically, those unemployed would experience a decrease in their personal income, which means that less tax on personal income can be collected. A reduction in personal income also means that households would reduce their household spending. This would mean that households would reduce their betting tendencies. The reduced spending by consumers will lead to a loss of business for bookmakers, which will feed back into a

reduced gross profit and a reduction in tax collected from these profits. Furthermore, the rise in unemployment is likely to result in increased costs for Her Majesty’s Revenue and

Customs, through an increased burden on national schemes such as the Housing Benefit, Jobseeker’s Allowance, Council Tax Reduction, Income Support, amongst other low-income benefits. Eventually, all these forces will combine to have a negative impact on the society’s economy. Thus, it is important that bookmakers aggregate available information with greater efficiency than bettors.

3. What is the Efficient Market Hypothesis?

The efficient market hypothesis (EMH) is a concept developed in the late 1960s and early 1970s, by Eugene Fama. It states that a market is efficient so long as all available information is fully reflected in prices at any point in time (Fama, 1970, 1976). Fama offered three versions of the EMH – weak efficiency, semi-strong efficiency and strong efficiency (Holton, 2006). The three versions differ in their definition of what constitutes an information-set (Fama, 1976).

Fama reported that a market is weak form efficient if prices reflect any information contained in historical prices and, as such, a technical analysis cannot be used to

outperform the market. Therefore, if the weak-form EMH holds, it should not be possible for bettors to outperform the football match result predictions of the bookmakers using historical prices only.

In semi-strong form efficient markets all publically available information is reflected in prices (Fama, 1976). Thus, if the semi-strong form EMH holds, when football match outcome is regressed on a function of a bookmaker’s odds as well as other predictors, implicitly available as public information, those predictors should be found to be insignificant. If the other predictors are found to be statistically significant then it would appear that the

bookmakers’ odds are not efficient at incorporating relevant publically available information. Lastly, Fama (1976) defines strong form efficient markets as those where publically and privately available information is fully reflected in prices. This paper does not test the strong form EMH, as privately available information was not accessible.

(8)

4. Literature Review: Modelling and forecasting football match results

Past studies on the betting industry have analysed the efficiency of the betting markets in a range of sports. The majority of the focus has been on horse race betting, though this has been shifting towards English and other European football betting. This is due to the fixed-odds feature of football bets. With fixed-odds bets, a bettor is aware at the time of placing the bet what his/her earnings would be. With horse race betting, where odds are not fixed, a bettor might receive a considerably lower return on their bet than what was expected at the time of placing the bet. The less dynamic structure of football bets,

therefore, makes them far more attractive to study betting market efficiency.

Modelling football match results evolved from studies dating back to the 1950s. One of the first key contributions is that by Moroney (1956). The author showed that the Poisson distribution offered an adequate fit to match outcomes. An alternative method was proposed in 1971 (Reep, Pollard & Benjamin), which entailed the utilisation of a negative binomial distribution. Subsequently, a study demonstrated that football matches had a predictable element to them (Hill, 1974). Almost a decade later another note worthy contribution was made. It offered a forecasting model that accounted for the differences in the skills of the playing teams (Maher, 1982).

The early 1990s saw further contributions to the field and research relating to football match result forecasting began to increase rapidly around the late 1990s and early 2000s. Dixon and Coles developed a match-outcome forecasting model that could generate ex ante probabilities for both match scores and results (1997). The authors made improvements to a time-independent Poisson regression model by inventing an ad hoc adjustment mechanism for low-scoring matches. Rue and Salvesen built upon the authors’ framework in 2000 to introduce a novel, time-dependent rating method through the use of the Markov Chain model. In 2002, Crowder, Dixon, Ledford and Robinson developed a computationally less demanding method for updating team-strength parameters.

In 2000, Forrest and Simmons employed an ordered logit regression model to study how effective newspaper tipsters were at forecasting football match results. Later on, Dixon and Pope (2004) utilised the Dixon-Coles model to obtain probabilistic forecasts, which were then compared to the probabilities inferred by bookmakers’ prices. Goddard and

Asimakopoulos (2004), as well as Forrest, Goddard and Simmons (2005) made further contributions to the field by employing an ordered probit regression model to study match results directly. The ordered probit models in both studies utilised a range of explanatory variables besides past match results. These included variables such as each team’s involvement in the FA Cup competition, the geographical distance between the teams’ grounds, the teams’ average attendance, amongst others.

(9)

In their study, Goddard and Asimakopoulos (2004) found evidence for semi-strong form inefficiency of bookmakers’ odds. They also found several betting strategies that yielded positive returns. The authors also conducted an economic test to account for a particular, potential, source of inefficiency – normally, bookmakers’ odds are compiled and published five days prior to the day when the match in question is played, so information about results for matches played within that time-frame, by either team, would not be impounded in the odds for the particular match in question. It was not clear from their study as to whether the trend they obtained for this test was systematic, random, or partly both. However, they, nevertheless, found some evidence that the inefficiencies in bookmakers’ odds diminished over time.

More recently, Xu (2011) employed a binary probit model to forecast football match outcomes. Though her study offered weak evidence for information inefficiency in

bookmakers’ odds and her model did not outperform bookmakers’ odds during the economic testing conducted, Xu presented an interesting betting strategy, which she referred to as Trim-the-Tails strategy. If a match outcome has a probability of less that 0.333 it is not too likely to occur. Betting on such an outcome would be referred to as a risk loving behaviour. If it has a probability of more than 0.667, then it is quite likely to occur and betting on such an outcome would be referred to as a risk-averse behaviour. Trim-the-Tails strategy allows for the elimination of this risk loving and risk-averse behaviour amongst bettors (Xu, 2011).

This study follows in the footsteps of Goddard and Asimakopoulos (2004), to develop an empirical model based on publicly available historical information, whilst also adding additional, possibly significant, explanatory variables; the motivation for which arose from the study conducted by Xu (2011). The empirical model presented in this study will be used as a benchmark against which the efficiency of the bookmakers’ utilisation of publicly available information will be analysed. Subsequently, a range of economic tests, inspired by both the study conducted by Goddard and Asimakopoulos (2004) and by Xu (2011), are employed to compare the yields that can be generated using the bookmakers’ odds against those that can be generated using the empirical model.

5. Forecasting Match Outcomes: Empirical Models

A discrete choice model suitable for predicting football match results, where the dependent variable takes one of three possible non-numerical values – ‘home win’, ‘draw’, ‘away win’ – is the ordered probit model. A match outcome depends upon the unobserved (latent) variable y*i,j,z in the following manner:

Home win: yi,j,z=1 γ2< y*i,j,z

Draw: yi,j,z=0.5 γ1< y*i,j,z ≤ γ2 (3)

(10)

where γ1 and γ2 denote the cut-off points for the adjacent levels of the dependent variable

and y*i,j,z takes the following form:

y*i,j,z=βk Xk,i,j +ui.j, ui.j ~ N(0,1), (i,j=1,…,n), (k=1,…,K) and (z=1,…,Z) (4) where Xk,i,j is a 1 by K matrix of regressors, β is a K by 1 vector of parameters to be

estimated, y*i,j,z is the latent outcome of a match between the home team i and away team j,

for the match in question, represented by the sub-index z, and ui.j is the error term. The error

term is assumed to be normal as well as independent and identically distributed (i.i.d.). In other words, it is assumed that the unsystematic component in a match result does not vary directly, or inversely, with the outcome’s uncertainty. Two empirical models will be presented in this section, thus K will differ in each case. It is also assumed that there is no serial correlation.

Unlike a typical regression model, the model used in this paper does not contain an intercept, β0. This is because this investigation deals with a three-outcome ordered probit

model that will have two threshold points, which are collinear just like a β0 intercept. It could

be said that equating β0 to 0 corresponds to choosing the first alternative however the two

cut-off points, in equation (3), sufficiently distinguish between which alternative will be chosen and, thus, leave no room for intercept β0 to play any role within the model.

5.1 Model 1 - Naïve model

First, a simple and a rather naïve model is set up to account for the possible

influences of a number of events that have taken place during the last game played by each team, regardless of who the teams had played against in their respective previous game. The model includes the number of shots by the home team in the previous match played by that team, HSi,z-1, and the away team’s counterpart, ASj,z-1. The sub-indices i and j

distinguish which variable corresponds to the home team i, and which variable corresponds to the away team j. Additionally, since the sub-index z represents the match in question, the sub-index z-1 denotes that the variables included are from one match prior to the match in question.

The model also includes the number of shots on target by the home team, which is the number of times the team would have scored if an opposition team player had not saved the ball, HSTi,z-1; as well as the away team’s counterpart ASTj,z-1. The model includes variables that capture a team’s ability to create scoring chances during a game – these include the number of times the home team hit the woodwork, which refers to the number of times the team hit the frame of the goal and came close to scoring, HHWi,z-1, and its

counterpart for the away time, AHWj,z-1.

Both the attack and the defence of a team is modelled through the inclusion of three variables – the number of home team tackles, HTaci,z-1, and the number of away team

(11)

tackles, ATacj,z-1, to model the ability of a team to dispossess an opponent; number of home dribbles, HDi,z-1, and number of away dribbles, ADj,z-1, to model a team’ ability to tackle an opponent and successfully make it past them while holding possession of the ball, and their ability to prevent an opponent’s shot from reaching the goal, represented by the number of shots blocked, HBi,z-1 and ABj,z-1. Furthermore, the number of home corners, HCi,z-1, and the number of away corners, ACj,z-1, as well as the number of home throw ins, HThrinsi,z-1, and away throw ins, AThrinsj,z-1, during each team’s previous performance are included. The importance of these factors lies in that they enable a team to create more chances for scoring a goal during a match. As such, it is expected that the home team counterpart of these variables will have a positive impact on the probability of a home win, thus resulting in a higher value of y*

i,j,z, whilst the away team counterparts are expected to have a negative impact on the probability of a home win.

The illegal manoeuvres of a player, fouls and offsides, are penalised: in case of a foul, a free kick or a penalty is given to the opposition, and in case of an offside, a free kick is given to the opposition. The higher the number of fouls or offsides committed by a team, the more chances are given to the opposition to score a goal. Thus, these factors are modelled through the inclusion of number of home team fouls, HFi,z-1, and home offsides, HOi,z-1, as well as the away team counterparts.

It is expected that the coefficient of HOi,z-1 will carry a negative sign, representing that a rise in home team offsides will impede the likelihood of a home win outcome. Intuitively this would make sense because an offside is a result of a failed attack. The number of offsides does not provide information about how often a team attacks; it provides information about how efficiently they attack when they do. High number of offsides reflects negatively on the efficiency of the team’s attacking competence. Additionally, high number of offsides results in reduced possession for the team that committed the illegal manoeuvre, which intuitively will have a negative impact of the probability of scoring and thus winning. It is also expected that the coefficient of AOj,z-1 will carry an opposing sign, representing that a rise in the away team offsides will increase the likelihood of a home win outcome.

The coefficient on the variable HFi,z-1 is expected to have a negative sign. Since a foul grants possession or a goal scoring opportunity to the opposition, a high number of fouls should reduce the likelihood of a home win. It is also expected that the coefficient of AFj,z-1 will carry a positive sign, representing that a rise in the away team fouls will increase the likelihood of a home win outcome.

Lastly, yellow and red cards are also utilised to indicate warnings and to signify when a player is sent off, modelled as home yellow cards, HYi,z-1, away yellow cards, AYj,z-1, home red cards, HRi,z-1, and away red cards, ARj,z-1. Both of these factors are also expected to impede a team’s probability to win.

(12)

As a result the naïve model, the parameters of which can be found listed in Appendix 1, is as follows:

y*

i,j,z=(β1)HSi,z-1 + (β2)ASj,z-1 + (β3)HSTi,z-1 + (β4)ASTj,z-1 + (β5)HHWi,z-1 + (β6)AHWj,z-1 +

(β7)HTaci,z-1 + (β8)ATacj,z-1 + (β9)HDi,z-1 + (β10)ADj,z-1 + (β11)HBi,z-1 + (β12)ABj,z-1 + (β13)HCi,z-1 + (β14)ACj,z-1 + (β15)HThrinsi,z-1 + (β16)AThrinsj,z-1 + (β17)HFi,z-1 + (β18)AFj,z-1 + (β19)HOi,z-1 + (β20)AOj,z-1 + (β21)HYi,z-1 + (β22)AYj,z-1+ (β23)HRi,z-1 + (β24)ARj,z-1 + ui,j (5)

5.2 Model 2 - Intricate model

Next, a more intricate model is set up. This model is much more intricate than the model presented in section 5.1 as it not only includes those variables from the naïve model that are found to be statistically significant, at the 0.05 level, in determining the outcome of a match, but it also includes a set of additional variables that act as much better indicators of ‘team quality’ and ‘recent performance’. These variables account for each team’s

performance over a longer duration of time prior to the match outcome in question, as opposed to the variables in the naïve model, which only look at each team’s performance in the previous match. Thus, the variables in the intricate model are more likely to better explain the overall match outcome. Moreover, the model includes several other explanatory variables that account for factors such as: the influence of each team’s involvement in the FA Cup on their performance during the matches from the Premier League; the advantage of playing on the home ground (home-advantage); ‘big team’ effect, amongst others. These variables are included in the model to prevent omitted variable bias and are described in greater detail in the subsequent sub-sections.

5.2.1. Win ratios over the past 24 months

The model includes both teams’ win ratios over the previous 24 months, Pd

i,t,s,z and Pd

j,t,s,z. These win ratios pose as the main ‘team quality’ indicators in this model. They are computed using the following conversions: win = 1, draw = 0.5 and loss = 0. The win ratio variables are partitioned based on the time frame that they belong to, with respect to the ‘current observation match’. The sub-index t indicates whether the win ratio variables are computed from matches, played by the respective team, within 12 months of the ‘current observation match’, t=0, or within 12 to 24 months of the ‘current observation match’, t=1. The sub-index s indicates whether the win ratio variables are computed from matches, played by the respective team, within the current season, s=0, one season ago, s=1, or two seasons ago, s=2. As before, the sub-indices i and j distinguish between the win ratios with respect to the home team i and the away team j. Lastly, the sub-index z refers to the match observation in question; this sub-index is utilised on several occasions throughout the remaining sub-sections of section 5 and carries the same meaning.

(13)

To summarise, the partitioning of the win ratio variables involves the computation of the win ratios data according to whether the match data is from the current season (t=0, s=0); one season ago but within 12 months of the current observed match (t=0, s=1); one season ago but more than 12 months of the current observed match, (t=1, s=1); or whether it is from two seasons ago and within 12 to 24 months of the current, observed match (t=1, s=2) (Goddard & Asimakopoulos, 2004).

Two of the teams in the sample of this study had been relegated at the end of season 2008 – 09. To account for the difference in divisions the win ratios have been further partitioned according to division, where d=0 implies that the win ratio is computed based on data from the current division and d=-1 implies a computation based on data from one division below (Goddard & Asimakopoulos, 2004). Accordingly, there are seven win ratio variables for each respective team, per match: (t=0 s=0 d=0), (t=0 s=1 d=0), (t=1 s=1 d=0), (t=1 s=2 d=0), (t=0 s=1 d=-1), (t=1 s=1 d=-1), (t=1 s=2 d=-1). The eighth possible

combination: (t=0 s=0 d=-1) is not included because, as mentioned in the introduction, this paper focuses on data from the Barclays Premier League, season 2010-11 onwards. Also, as mentioned above, the relegation of the two teams from the study sample occurred at the end of season 2008-09 but they had been promoted back into the Premier League at the end of season 2009-10. Therefore, from season 2010-11, all teams in the study sample played in the Premier League. As such, there would be no observations for the variable (t=0 s=0 d=-1) for any of the sample matches. So, the variable (t=0 s=0 d=-1) can be excluded from the model. For the win ratio variables that are included in the model, in each case, three of the seven variables have no observations; however, as the combination of these variables differs in each case, it would not be appropriate to simply exclude the variables with no observable data.

According to experimentation conducted by Goddard and Asimakopoulos (2004), win ratio coefficients for matches occurring 24 to 36 months preceding the match in question are not significant. Therefore, only win ratios over the previous 24 months are considered.

5.2.2. Recent match results

Each team’s results for their most recent matches played on home ground, RH i,z-m and RH

j,z-n, and their most recent matches played on away ground, RAi,z-n and RAj,z-m, are also included, with the full time result for each match being converted to numerical value

according to win=1, draw=0.5 and loss=0 with respect to the team in question. The superscript H indicates that the result is for a match played on the home ground, and the superscript A indicates that the result is for a match played on the away ground. Again, the sub-index z represents the match in question. The variables are incorporated for z-m, with m ≤ 9, and z-n, with n ≤ 4. Both m and n represent the number of most recent matches that are

(14)

being taken into consideration. Also, z-m and z-n represent which specific matches from the sample are being taken into consideration.

The home results duration is longer because team i is playing from a ‘home position’. In other words, it is playing on the home ground. So, it is more relevant to look at larger data for that team’s performance on the home ground. For team j, the away team, the variable RA

j,z-m is computed for m ≤ 9 because the team is playing from an ‘away position’, and the variable RH

j,z-n is computed for n ≤ 4. This is because in order to be able to gauge how team j might perform, from its ‘away position’, it is more relevant to look at larger data for the team’s performance on away grounds. Information about team i’s performance on the away ground is less relevant to how they will perform on the home ground, as is the case for information about team j’s performance on the home ground when they are performing as the away team.

In their study, Goddard and Asimakopoulos (2004) experimented with the coefficients on the recent match result variables and found the coefficient for m=10,11,12 and n=5,6 to be not significant. Thus, this paper confines to focusing on the recent match result variables for m ≤ 9 and n ≤ 4.

5.2.3. Significance of the match in terms of relegation

The model incorporates two dummy variables that model whether the match has relegation significance for either team, SIGHi,z and SIGAj,z. The possibility to be relegated can create a difference in the incentives faced by two teams and influence the outcome of a match. These dummy variables have been set up such that each variable is 1 if the match in question had relegation significance for the team in question and 0 if it did not. A match is deemed significant if before the start of the match it is possible for the team in question to be relegated, assuming that all other teams currently in contention for relegation take one point on average from each of their remaining games.

The significance of a match was not considered in terms of promotion significance for the team since the Premier League is the highest league that a team can be in. Additionally, the significance of a match was not considered in terms of whether it could have meaningful impact on a team’s chances of becoming champion of the Premier League. As mentioned earlier, to determine whether a match has relegation significance, it is assumed that all other contenders for relegation will earn one point on average during the rest of the season. However, such an assumption cannot be applied to determine the significance of a match for winning the Premier League. Nor can it be assumed that all other contenders for the top position would win each of their pending games. The way strong competitors perform towards the end of a season can hugely influence whether a particular match has

(15)

greatly increase the complexity of the model. As such, matches are only considered in terms of their relegation significance.

5.2.4. FA Cup involvement

The impact of early elimination from the FA Cup has the potential to influence match results in a positive as well as a negative manner. For instance, elimination from the FA Cup could raise a team’s probability to win, as they are able to focus their efforts on one league. However, it could also result in players losing confidence in their abilities, thereby reducing the likelihood of winning a match. The dummy variables CUPHi,z and CUPAj,z are employed to model whether either team has been eliminated from the FA CUP. These variables are 1 when the team in question is still a participant in the FA Cup, at the time of the match in question, and 0 when the team is no longer a participant in the FA Cup.

5.2.5. Geographical distance between the team’s grounds

Previous studies have found geographical distance to have a significant influence on match outcomes (Clarke and Norman, 1995; Goddard and Asimakopoulos, 2004). According to Goddard and Asimakopoulos (2004), the advantage of playing on home ground may be, partially, offset during matches played in local derbies, where both teams are closely located, due to the greater intensity of competition. They also suggest that the home advantage may be increased during matches played between teams that are located much further apart, due to the possibly psychological as well as practical difficulties of long distance travel faced by both the teams as well as spectators. Hence, the model in this paper incorporates the variable DISTi,j to represent the distance between the grounds of the two teams that are playing in the match in question.

5.2.6 Attendance, relative to a team’s league position

In their study, Goddard and Asimakopoulos (2004) found ‘big team’ effect to be a highly significant determinant of match outcomes. Intuitively this makes sense, as there can be both psychological and material advantages of having a large support; for example; the crowd can directly influence the match result and it can mean a larger revenue base, which allows for more resources to be spent on players.

An attendance variable can aid in modelling ‘big team’ effect. However, research on what determines match attendance suggests that there exists an inner-league relationship regarding league position (O’Connor, n.d.). The study found that league position affects attendance; it suggests that fans exert glorified rather than loyal attributes. It has also been found that fans alter their attendance patterns given promotion/relegation of teams. Thus, when incorporating an attendance variable it is necessary to control for the league position.

(16)

This paper follows the method employed by Goddard and Asimakopoulos (2004), as well as Forrest et al. (2005), for modelling the ‘big team’ effect by utilising the residuals of the following cross-sectional regressions:

Season k=1: Ln(ATT) = α1 + β1POS1 + RESID1 (6)

Season k=2: Ln(ATT) = α2 + β2POS2 + RESID2 (7)

where, for the team in question, Ln(ATT) denotes the natural logarithm of the average attendance for the team, POS denotes the end of league position of the team and RESID denotes the residual value.

The average attendance of a team is partly explained by the team’s end of league position k seasons prior to the current season, for k=1,2. This approach is employed for all three seasons, therefore the end of league positions during the seasons 2008-09 as well as 2009-10 have to be utilised. However, as stated in section 5.2.1, two of the teams from the study sample were in a lower division for the duration of the season 2009-10. This was accounted for by ranking each team’s final league position on a scale of 92 to 1, as there are 92 football teams in the League, which includes Levels 1 – 4 of the English football league system: the Premier League, Football League Championship, Football League One, and Football League Two. This particular scale is utilised because, for instance, at the end of the 2009 – 10 season, one season prior to the 2010 – 11 season, Newcastle finished at the top of the Championship. In other words, their league position fell outside of the Premier League. Therefore, the teams from the Championship had to be taken into account in the ranking scale. However, since the Championship is one of the three divisions that fall in the Football League, the remaining divisions were also counted in the ranking scale. The rank 92, therefore, represents the team at the top of the League, the winner of the Premier League, and 1 represents the team at the bottom of the Football League Two.

Consequently, the cross-sectional regressions are conducted for k=1 and k=2. The covariate ATTPOSi,1 corresponds to RESID1, for k=1, which is denoted by the sub-index 1, and the covariate ATTPOSi,2 corresponds to RESID2, for k=2, which is denoted by the sub-index 2, for team i. Team j’s counterparts, ATTPOSj,1 and ATTPOSj,2, are computed as well. As values over successive seasons have the tendency to be highly correlated, it is assumed that the residual values will be serially correlated. Therefore, ATTPOSi,1 is replaced by ΔATTPOSi,1 = ATTPOSi,1 – ATTPOSi,2 (Goddard & Asimakopoulos, 2004; Forrest et al.,2005). Similarly, ATTPOSj,1 is replaced by ΔATTPOSj,1 = ATTPOSj,1 – ATTPOSj,2.

5.2.7 Overall Intricate model

As a result the intricate model, the parameters of which can be found listed in Appendix 2, is as follows:

(17)

y*i,j,z=(β1: β7)Pdi,t,s,z + (β8: β14)Pdj,t,s,z + (β15: β23)RHi,z-m + (β24: β27)RAi,z-n + (β28: β31)RHj,z-n + (β32: β40)RAj,z-m + (β41)SIGHi,z + (β42)SIGAj,z + (β43)CUPHi,z + (β44)CUPAj,z + (β45)DISTi,j +

(β46)ATTPOSi,2 + (β47) ΔATTPOSi,1 + (β48)ATTPOSj,2 + (β49)ΔATTPOSj,1 + (β50)ASTj,z-1+

(β51)ADj,z-1 + (β52)AHWj,z-1 + ui,j (8)

6. Data

The Barclays Premier League was selected for the analysis in this study as it is a prominent as well as a popular football tournament that is held annually and thus the

required data could be collected more feasibly. Since this study investigates the efficiency in predicting results of matches played by high performing teams, the ordered probit model in this paper is estimated using data for the teams that have consistently retained a position in the Premier League during three consecutive seasons: 2010-11, 2011-12 and 2012-13 (inclusive). Consequently, 14 teams’ data was utilised. These teams include: Arsenal, Aston Villa, Chelsea, Everton, Fulham, Liverpool, Manchester City, Manchester United, Newcastle United, Stoke City, Sunderland, Tottenham, West Bromwich Albion and Wigan Athletic, in alphabetical order. Only matches held between teams from the listed collection were included in the estimation of the ordered probit model. Overall, 535 matches were used.

Subsequently, the ordered probit model is used to estimate ex ante probabilities for 155 matches for the 2013-14 season, which are all the matches played between the teams used to generate the model. The model is not applied to the entire set of 380 matches for that season because applying the model to a match played by one team that was included when computing the model and one that was not, or to a match where neither team was included in the computation of the model, would introduce a bias in the results.

The data for the variables home shots (HSi,z-1), home shots on target (HSTi,z-1), home fouls (HFi,z-1), home corners (HCi,z-1), home yellow cards (HYi,z-1), home red cards (HRi,z-1), and the away team counterparts were obtained from an online football statistics archive that is monitored and regulated by Joseph Buchdahl, a sports betting analyst and author of a highly regarded book on fixed odds sports betting (Buchdahl, 2001). The data for the variables: home team hit woodwork (HHWi,z-1), home team tackles (HTaci,z-1), number of home dribbles (HDi,z-1), number of home throw ins (HThrinsi,z-1), home offsides (HOi,z-1) number of shot blocked by home team (HBi,z-1), and the respective away team counterparts were obtained from the football website WhoScored.com. The site is run from Central London, by a team of football analysts and software developers, who also have backgrounds in the sector (WhoScored.com, n.d.).

The variable DISTi,j , which represents the distance between the grounds of the two teams, that are playing in the match in question, is calculated using the distances data available on Sportmapworld.com (2007). The distances are dealt with in kilometres. In order

(18)

to set up the dummy variables CUPHi,z and CUPAj,z information about participation in the FA Cup is obtained from the official site for The FA Cup and FA competitions (The FA, n.d.). The calculations for the two dummy variables, SIGHi,z and SIGAj,z, are computed using the total points data available in the league tables on the official website of the Barclays Premier League (Barclays Premier League, 1992).

The match attendance data required for the computation of the variables ΔATTPOSi,1 and ATTPOSi,2 , for team i, as well as ΔATTPOSj,1 and ATTPOSj,2, for team j, is obtained from SoccerPunter Pte. Ltd. (2002). The teams’ final league positions data is obtained from the end of league table on the official website of the Barclays Premier League (Barclays Premier League, 1992).

The win ratio variables of teams i and j, Pd

i,t,s,z and Pdj,t,s,z, were computed using data from Joseph Buchdahl’s online football statistics archive, (Buchdahl, 2001). The variables RHi,z-m, RAi,z-n, RHj,z-n and RAj,z-m are computed using the full time match results data from Joseph Buchdahl’s online football statistics archive, (Buchdahl, 2001).

The study utilises the data for football match odds from seven different bookmakers – Bet 365 (B365), Bet & Win (BW), Interwetten (IW), Ladbrokes (LB), William Hill (WH), Stan James (SJ) and VC Bet (VC). The odds data for these bookmakers are obtained from

Joseph Buchdahl’s betting odds archive, available on his football-data site (Buchdahl, 2001). The most basic form of odds available by bookmakers are odds for a home win, odds for a draw and odds for an away team. Some bookmakers offer decimal odds, often known as European or continental odds, others offer fractional odds which are also known as British odds, whilst some even offer moneyline odds that tend to be known as American odds. This paper uses published continental odds i.e. decimal odds. Decimal odds include both the stake and the payoff that a bettor could win. For instance, the Fulham – Arsenal match on August 24th, 2013, faced home win odds of 3.8, draw odds of 3.5 and away win odds of 2.1 from the bookmaker Bet 365. This means that if a bettor were to place a £1.00 stake on Fulham winning the match, he/she would receive a payoff of £3.80 if Fulham wins the game. In other words, the bettor will win £2.80 plus their original stake of £1.00.

It is possible to convert bookmakers’ odds into probabilities that enable one to identify how likely the bookmakers believe that a particular match outcome will be. Applying Stumbelj and Robnik’s (2010) method for converting bookmakers’ odds into probabilities to the example of the Fulham – Arsenal match results in an outcome of 0.2632 (=1/3.8), the probability of a home win, θH

i,j,b,z, 0.2857 (=1/3.5), the probability of a draw, θDi,j,b,z, and 0.4762 (=1/2.1), the probability of an away win, θA

i,j,b,z, where the sub-index b indicates the bookmaker whose odds are being considered given the match in question, which is represented by the sub-index z. One can easily identify that bookmakers favour the match outcome for which they have set the lowest odds.

(19)

The sum of the probabilities, however, invariably exceeds 1. In the case of the above example, the sum of the three probabilities is 1.0251. The additional 2.51% is the

bookmaker’s margin. The purpose of this margin is to cover the bookmaker’s costs and profits, regardless of the match outcome. Bet 365 and VC Bet are the only bookmakers found to have a margin of 2.7% and 2.8%, respectively, on average for the sample of matches studied from season 2013-14. The remaining bookmakers had a margin in the range of 5.6% and 8.1%, with a standard deviation in the range of 0.1% and 0.9%, see column ‘Bookmaker’s Margin’ in Table 7.

Implicit probabilities that add up to one can be generated from the bookmakers’ probabilities as follows: ϕH

i,j,b,z = θHi,j,b,z/( θHi,j,b,z + θDi,j,b,z + θAi,j,b,z) for home win, ϕDi,j,b,z = θDi,j,b,z /( θH

i,j,b,z + θDi,j,b,z + θAi,j,b,z) for a draw, and ϕAi,j,b,z = θAi,j,b,z /( θHi,j,b,z + θDi,j,b,z + θAi,j,b,z) for an away win. Applying this to the Fulham – Arsenal example, results in each possible outcome’s normalised probabilities to be 0.2568 for a home win, 0.2787 for a draw, and 0.4645 for an away win.

7. Results

7.1 Significant estimates from the Naïve model and the Intricate model

Upon acquiring the specified possible explanatory variables for the naïve model, in section 5.1, and conducting an ordered probit regression, three of the 24 variables – away team shots on target (ASTj,z-1), away team dribbles (AD j,z-1) and away team hit woodwork (AHWj,z-1) – are found to be jointly statistically significant at the 0.05 level (see Appendices 3 and 4). Intuitively, using both economic knowledge and knowledge of the sport, football, there seems to be no apparent reason as to why only these three variables would be significant. Subsequently, the statistically insignificant variables are eliminated and the ordered probit regression is repeated until all variables are found to be jointly statistically significant.

As stated in section 5.2, the three variables that are found to be statistically

significant at the 0.05 level, in the naïve model, are utilised in the construction of the intricate model. This is done after the execution of a Wald test, which is used to test:

H0: (β1, …,β24)=0

H1: At least one of the estimated coefficients is not equal to zero,

as well as a z-test on each of the individual coefficients (see Appendices 3 and 4). Both tests are conducted with respect to the full set of 24 estimated coefficients of the, full, naïve model and then with respect to the three significant estimates of the, small, naïve model.

The results of the Wald test are reported in Table 1 below. The p-value, in both cases, is less than 0.01 so the null hypothesis can be rejected at the 0.01 significance level. This indicates that in both cases the coefficients are not simultaneously equal to zero, at the

(20)

0.01 significance level. The individual z-tests for each coefficient, see Appendices 3 and 4, further support these results. In the first regression, the full naïve model, three estimates are found to be statistically significant at the 0.05 level. In the subsequent regression, the small naïve model, which includes only the three significant estimates, all estimated coefficients are found to be statistically significant at the 0.05 level (see Appendices 3 and 4).

Table 1: Wald tests – full naïve model and model with significant estimates only.

In the intricate model, the coefficients for the recent match results, from section 5.2.2, appear to be erratic in terms of significance. This is similar to the results reported by

Goddard and Asimakopoulos (2004). The estimated coefficient for SIGAj,z is negative, as expected, possibly due to the potential additional effort exerted by the opposition to avoid relegation. However, the coefficient for SIGHi,z is also negative which suggests that perhaps a home team, facing a possible relegation, experiences detrimental effects such as loss of confidence. In section 5.2.4, it was proposed that early elimination from the FA Cup could influence match results positively, by enabling the team in question to focus their efforts on one league, or negatively, for instance, by players’ loss of confidence. The later effect seems to dominate, according to the coefficient estimates for CUPHi,z and CUPAj,z. Furthermore, in this study, the estimated coefficient on DISTi,j is found to be statistically significant, at the 0.05 level, similar to what was discovered by Goddard and Asimakopoulos (2004).

7.2 Naïve model (small) vs Intricate model

A likelihood ratio test is conducted to test the fit of the two models – the small naïve model, comprising of three statistically significant parameters, and the intricate model, comprising of 52 parameters. These models are referred to as Model 1 and Model 2, respectively, while reporting the results in panel 1 of Table 2, below, where for each model respectively: column Ll(null) reports the log likelihood for the empty model; column Ll(model) reports the log likelihood for the fitted model; column Df reports the degrees of freedom, including the two cut-off points; column AIC reports the Akaike information criterion; and column BIC reports the Bayesian information criterion .

The reported values for Ll(model), for both Model 1 and Model 2, from panel 1 of Table 2 are used to obtain the likelihood ratio test statistic of 105.1216, distributed chi-squared, with 49 degrees of freedom, which is reported in panel 2 of Table 2. D denotes the likelihood ratio test statistic in panel 2 of Table 2, whereas Df denotes the degrees of

!! Df p-value

Full Naïve Model 46.20 24 0.0042

(21)

freedom. The test statistic has an associated p-value of 0.0001, which is p<0.05, thereby indicating that the intricate model, Model 2, fits significantly better than the naïve model, Model 1.

Table 2: Information criteria and likelihood ratio test – Model 1 vs Model 2.

According to the Akaike information criterion (AIC), the model with the best balance between goodness of fit and complexity also appears to be the intricate model, Model 2. The BIC values indicate otherwise, however, this is to be expected since BIC always favours the simpler model compared to AIC if the sample size exceeds 7. Since the likelihood ratio statistic and the Akaike information criterion both indicate that the alternative model is optimal, it is the model that will be used for further analysis in this paper. The estimated coefficients that encompass Model 2 are reported in Table 3, together with their respective standard deviations. Also reported are the cut-off parameters from equation (3), γ1 and γ2.

1. Information criteria for Model 1 and Model 2

Model Obs Ll(null) Ll(model) Df AIC BIC

Model 1 535 -575.2232 -566.0804 5 1142.161 1163.572 Model 2 535 -575.2232 -513.5196 54 1135.039 1366.282 2. Likelihood ratio test: D = -2Ll(modelModel 1) + 2Ll(modelModel 2 )

D Df p-value

(22)

1. WIN RATIOS OVER PRECEDING 24 MONTHS

Home team (i) Pdi,t,s Away team (j) Pdj,t,s

0-12 months (t=0) 12-24 months (t=1) 0-12 months (t=0) 12-24 months (t=1) Matches played: Current Season

(s=0) Previous Season (s=1) Previous Season (s=1) Two Seasons Ago (s=2) Current Season (s=0) Previous Season (s=1) Previous Season (s=1) Two Seasons Ago (s=2) Current division (d=0) -0.481 1.308*** 0.399 0.677** 0.370 -0.420 -0.179 0.104 (0.397) (0.363) (0.375) (0.321) (0.395) (0.368) (0.368) (0.323)

One division lower (d=-1) -0.623 1.955 0.046 1.657 -1.721 0.115

(1.287) (1.374) (0.430) (1.114) (1.172) (0.409)

2. MOST RECENT MATCH RESULTS (m,n) matches

ago 1 2 3 4 5 6 7 8 9

Team i’s Home 0.205 -0.024 0.344** 0.258* 0.273* 0.119 0.185 -0.105 -0.192

matches (RHi,m) (0.142) (0.136) (0.138) (0.138) (0.141) (0.138) (0.133) (0.143) (0.136) Team i’s Away

matches (RA i,n) 0.250** 0.080 0.194 0.101 (0.125) (0.134) (0.129) (0.131) Team j’s Home -0.298** -0.033 -0.176 -0.237* matches (RH j,n) (0.149) (0.144) (0.138) (0.137) Team j’s Away matches (RAj,m) 0.116 -0.058 0.082 -0.027 0.047 0.046 -0.349*** -0.046 -0.070 (0.130) (0.128) (0.128) (0.126) (0.129) (0.134) (0.129) (0.128) (0.131) 3. OTHER EXPLANATORY VARIABLES, CUT-OFF PARAMETERS

CUPHi CUPAj DISTi,j SIGHi SIGAj ASTj ADj AHWj

-0.182 0.173 0.099** -0.168 -0.237 -0.0278* -0.029** -0.104

(0.164) (0.172) (0.044) (0.166) (0.175) (0.016) (0.015) (0.097)

ATTPOSi,2 ΔATTPOSi,1 ATTPOSj,2 ΔATTPOSj,1 γ1 γ2

0.507 0.275 -1.05** -0.424 0.465 1.338

(0.514) (0.770) (0.531) (0.782) (0.574) (0.576)

(23)

The implicit bookmaker probabilities used to report the descriptive statistics in Table 4 are computed by applying the method described in Section 6 to the arithmetic mean of the odds quoted by the seven bookmakers for every outcome. The table also reports the match outcome probabilities estimated by the empirical model, equation (8), as well as the actual match outcomes. In both tables, data for the implicit bookmaker probability, ϕr

i,j , and the model estimated probability, Pr

i,j , where r = H, D, A, are mean values, with standard

deviations reported in brackets. The computation for the model estimated probability, Pri,j , is described in section 8.1. In Table 4, H(%), D(%), and (A%) report the actual proportions of each outcome.

Table 4: Summary descriptive statistics: bookmakers’ implicit probabilities, and forecast probabilities for matches in season 2013-14.

Disaggregation of the probabilities by month shows that the probabilities generated by the model move in the same direction as the implicit bookmaker probabilities. Although, the model’s estimated probabilities fluctuate by a greater amount. This is more clearly illustrated in Table 5. Almost the same pattern is identified for the standard deviations in Table 4, which in general tend to rise as the season progresses. The model appears to effectively replicate the main features of the bookmakers’ probabilities.

Table 5: Differences between: bookmakers’ implicit probabilities and actual results, forecast probabilities and actual results.

Differences: BOOKMAKERS Differences: MODEL

H(%) - ϕH

i,j D(%) - ϕDi,j A(%) - ϕAi,j H(%) - PHi,j D(%) - PDi,j A(%) - PAi,j

All 0.040 -0.065 0.026 0.029 -0.088 0.058

AUG-OCT 0.064 -0.042 -0.023 0.073 -0.064 -0.010

NOV-DEC 0.027 -0.028 0.001 0.082 -0.055 -0.026

JAN-FEB -0.004 -0.010 0.014 -0.069 -0.035 0.104

MAR-MAY 0.066 -0.166 0.100 0.021 -0.181 0.161

BOOKMAKERS MODEL ACTUAL

ϕH

i,j ϕDi,j ϕAi,j PHi,j PDi,j PAi,j H(%) D(%) A(%) All (0.203) 0.450 (0.046) 0.246 (0.177) 0.303 (0.234) 0.461 (0.063) 0.269 (0.202) 0.4903 0.271 0.1807 0.329 AUG-OCT 0.436 0.252 0.312 0.427 0.274 0.299 0.5000 0.2105 0.2895 OCT (0.193) (0.040) (0.169) (0.238) (0.068) (0.205) NOV-DEC 0.423 0.253 0.324 0.368 0.280 0.351 0.4500 0.2250 0.3250 DEC (0.188) (0.035) (0.183) (0.216) (0.050) (0.223) JAN-FEB 0.445 0.245 0.310 0.510 0.270 0.220 0.4412 0.2353 0.3235 FEB (0.210) (0.047) (0.186) (0.210) (0.062) (0.163) MAR-MAY 0.492 0.236 0.272 0.537 0.251 0.211 0.5581 0.0698 0.3721 MAY (0.218) (0.059) (0.173) (0.235) (0.068) (0.184)

(24)

Correlations among the match outcomes, the implicit bookmaker probabilities and the model estimated probabilities are presented for all three possible results in panels 1, 2 and 3 of Table 6, respectively. From the results, it appears that the probabilities generated by the model are slightly more correlated with the actual match results than the implicit bookmaker probabilities are for both a home win outcome and an away win outcome. In the case of a draw outcome, the probabilities generated by the model are substantially more correlated with the actual match results than the implicit bookmaker probabilities are.

Table 6: Correlation tables (Obs=155). 1. Home Win H ϕH i,j PHi,j H 1.000 ϕH i,j 0.367 1.000 PH i,j 0.373 0.679 1.000 2. Draw D ϕD i,j PDi,j D 1.000 ϕD i,j 0.069 1.000 PD i,j 0.133 0.475 1.000 3. Away Win A ϕA i,j PAi,j A 1.000 ϕA i,j 0.362 1.000 PA i,j 0.365 0.669 1.000

The large difference in the correlation results for the draw outcome instigated further investigation into the performance of the model, versus the implicit bookmakers’

probabilities, in predicting draw outcomes. In the selected sample of 155 matches, 28 matches resulted in a draw. For 22 of these 28 matches, the probability of a draw generated by the empirical model exceeded the draw probability implied by the bookmakers’ odds. The model also correctly predicted one draw outcome, whereas the bookmakers failed to

correctly predict any of the draw outcomes. Additionally, for each of the 28 matches that resulted in a draw, the implicit bookmakers’ probabilities for a draw outcome were under 30%. The empirical model, however, indicated that 13 of these 28 matches would result in a draw outcome with a probability of more than 30%. These results suggest that the model may be slightly better at forecasting match outcomes than the bookmakers’ odds.

The average home-win, draw and away-win implicit probabilities for each bookmaker are reported in Table 7. Data for ‘Average sum of non-normalised probabilities’ are the mean values of the sum of the home win, draw and away win ‘probabilities’, for 155 observations.

(25)

‘Bookmakers’ Margin’, expressed as a percentage, is the net profit generated. As in the case of Table 4 and Table 5, the implicit bookmaker probabilities, ϕr

i,j, where r = H, D, A, are mean values and their standard deviations are reported in brackets.

At a glance, it can be noted that the standard deviations of the implicit probabilities for each outcome, reported in Table 7, are the highest for Stan James, at 0.9%, followed by Bet 365, Ladbrokes and William Hill, at 0.8%, and lowest for Interwetten, 0.1%. Interestingly, the margins of the bookmakers have an opposing trend, with Bet 365 having the lowest margins and Interwetten the highest, and with Stan James, Ladbrokes and William Hill’s margins falling in between. This could perhaps be a result of Bet 365 employing a low margin, high volume business strategy that translates into better odds for their customers which encourages high levels of participation from bettors. Regardless of whether that is the case or not, the disparity between the bookmakers raises the question of how much the implicit probabilities actually vary between bookmakers.

Table 7: Bookmakers’ margins and implicit probabilities.

normalised probabilities Average sum of non- Bookmaker’s Margin ϕH

i,j ϕDi,j ϕAi,j B365 (0.008) 1.027 2.7% (0.8%) (0.210) 0.455 (0.050) 0.242 (0.184) 0.303 BW 1.067 (0.004) 6.7% (0.4%) (0.203) 0.451 (0.047) 0.242 (0.178) 0.306 IW (0.001) 1.081 8.1% (0.1%) (0.194) 0.447 (0.043) 0.247 (0.170) 0.306 LB 1.060 (0.008) 6.0% (0.8%) (0.199) 0.450 (0.045) 0.245 (0.174) 0.305 WH 1.062 (0.008) 6.2% (0.8%) (0.201) 0.441 (0.048) 0.263 (0.173) 0.297 SJ 1.056 (0.009) 5.6% (0.9%) (0.204) 0.452 (0.044) 0.243 (0.179) 0.305 VC 1.028 (0.007) 2.8% (0.7%) (0.207) 0.455 (0.048) 0.242 (0.181) 0.303

7.4 Correlations between bookmakers

For a more detailed look, a correlation table is set up to display the correlations between bookmakers’ normalised probabilistic assessments of each outcome. The correlations are conducted for all seven bookmakers. The results are reported in Table 8, with each panel dedicated for each outcome respectively. The correlations between all seven bookmakers exceed 0.99 for a home win outcome, as reported in panel 1 of Table 8. The correlations for the away win outcome, in panel 3 of Table 8, also follow a similar pattern in that all bookmakers’ implicit probabilities are almost perfectly correlated. The

(26)

correlations for the draw outcome vary, see panel 2 of Table 8, which is not surprising as draw outcomes are much more difficult to predict.

Table 8: Correlations for seven bookmakers (Obs 534).

In her paper, Xu (2011) studied the correlations between the same set of

bookmakers, as well as two others, with respect to a home win outcome. Her study was with respect to the odds for season 2006-07 and showed that all bookmakers were very highly correlated, with correlations greater than 0.99 (Xu, 2011).

8. Testing the efficiency of bookmakers’ odds

Empirical testing of the efficiency of fixed odds has generally been confined to two approaches (Jakobsson & Karlsson, 2007). A regression-based method can be conducted, where an event’s outcome is regressed on a function of the odds as well as other predictors

1. Home Win B365H BWH IWH LBH WHH SJH VCH B365H 1 BWH 0.9981 1 IWH 0.9920 0.9924 1 LBH 0.9966 0.9966 0.9939 1 WHH 0.9974 0.9966 0.9915 0.9962 1 SJH 0.9971 0.9970 0.9934 0.9971 0.9962 1 VCH 0.9986 0.9977 0.9915 0.9966 0.9974 0.9969 1 2. Draw B365D BWD IWD LBD WHD SJD VCD B365D 1 BWD 0.9785 1 IWD 0.9648 0.9610 1 LBD 0.9664 0.9616 0.9648 1 WHD 0.9439 0.9431 0.9314 0.9322 1 SJD 0.9757 0.9682 0.9620 0.9656 0.9304 1 VCD 0.9845 0.9739 0.9578 0.9616 0.9403 0.9730 1 3. Away Win

B365A BWA IWA LBA WHA SJA VCA

B365A 1 BWA 0.9978 1 IWA 0.9912 0.9913 1 LBA 0.9964 0.9962 0.9931 1 WHA 0.9971 0.9960 0.9908 0.9955 1 SJA 0.9966 0.9965 0.9928 0.9966 0.9952 1 VCA 0.9984 0.9976 0.9906 0.9964 0.9970 0.9965 1

(27)

that are implicitly available as public information. If the odds are efficient, then all other predictors would be statistically insignificant but if other predictors are found to be

statistically significant then it can be deduced that the odds do not efficiently incorporate all publically available information (Goddard & Asimakopoulos, 2004; Xu, 2011). Another approach is to conduct economic efficiency tests (Goddard & Asimakopoulos, 2004;

Jakobsson & Karlsson, 2007; Xu, 2011). During such tests the ex post returns from different

betting strategies, conditional on a forecasting model, are computed. If bookmakers’ odds are efficient, it should not be possible to employ a technical analysis to achieve an abnormal return, that is, a return greater than the bookmakers’ margin (Fama, 1970).

8.1 Empirical tests

The conversion of bookmakers’ odds to implicit probabilities, the method of which is explained under section 6, makes it possible to conduct a regression-based analysis on whether or not bookmakers’ odds efficiently utilise all publically available information. Since the bookmakers’ odds are almost perfectly correlated, the resulting estimates of studying one of the bookmakers should be representative of the remaining bookmakers. The analysis will be conducted with Bet 365’s normalised probabilistic assessments for each possible match outcome.

To test the efficiency of the bookmakers’ odds, this paper employs the following binary probit model:

ri,j = αr + βr ϕri,j + ui,j (9)

where ri,j represents the full time result of the match, between team i and team j. In the first instance, the binary probit model is employed such that ri,j is equal to 1 if the match results in a home win, and 0 otherwise. Then, the analysis is repeated with ri,j equal to 1 if the match results in a draw, and 0 otherwise. Lastly, the analysis is also conducted with ri,j equal to 1 if the match results in an away win, and 0 otherwise. The estimation results are reported in panel 1 of Table 9. In each case, the following two hypotheses are tested:

H0: αr = 0 vs H1: αr ≠ 0 H0: βr = 0 vs H1: βr ≠ 0

in order to establish whether the bookmakers’ odds are statistically significant in explaining the match outcomes.

Since the sample size utilised for this part of the study is relatively large (n>30), it is possible to assume that the sample is normally distributed and as such z-statistics can be used to study whether each coefficient is individually statistically different from zero. According to the z-values and 2-tailed p-values, as reported in panel 1 of Table 9, one can reject the null hypothesis in the case of αr and βr for r = H, A at 0.01 significance level.

(28)

Furthermore, one can also reject the null hypothesis in the case of αr for r = D at 0.05

significance level. However, the null hypothesis cannot be rejected in the case of βr for r = D. The next task, after establishing that the bookmakers’ odds are statistically

significant, is to test whether the odds incorporate all publically available information. To do so, the following binary probit model is constructed:

ri,j = αr + βr ϕri,j + λr(Pri,j – ϕri,j) + ui,j (10) where, again, ri,j denotes the full time match outcome. Pri,j are the ex ante probabilities for the 155 matches from the season 2013-14, and ϕri,j are the implicit bookmakers’ probabilities. The ex ante probabilities, Pr

i,j are obtained using the following mechanism. First, the entire set of estimated covariates for both the home team and the away team, for every football match, are substituted into the ordered probit model, equation (8), which was estimated using the data for the seasons 2010-11 to 2012-13, inclusive, as reported in Table 1. Doing so generates a fitted value for the match between teams i and j, which is denoted by ŷ*i,j. This value is then used to estimate the probability of a home win, draw and an away win in the following manner:

Home win: PH

i,j = 1- Φ(γ2 - ŷ*i,j) Draw: PD

i,j = Φ(γ2 - ŷ*i,j) - Φ(γ1 - ŷ*i,j) (11) Away win: PA

i,j = Φ(γ1 - ŷ*i,j),

where Φ is the standard normal distribution function, and γ1 and γ2 are the cut-off points for

the adjacent levels of the dependent variable as presented in section 5.

The second binary model, equation (10) differs from the first, equation (9), in that it has the additional variable (Pr

i,j – ϕri,j), which is utilised to test whether the intricate model from section 5.2 encompasses any publically available information that has not been

incorporated in the bookmakers’ odds. This approach is employed due to the limited sample size of 155 matches, from the season 2013-14, which means that incorporating the 52 covariates from the intricate model, presented in section 5.2, would not be appropriate. This mechanism has been utilised by Goddard and Asimakopoulos (2004) and, intuitively, makes sense because if the bookmakers’ odds incorporate all publically available information then the additional variable (Pr

i,j – ϕri,j) would not be a statistically significant explanatory variable. The test using the second binary model, equation (10), is also repeated for three cases – first, ri,j is equal to 1 if the match between teams i and j results in a home win, and 0 otherwise; second, ri,j is equal to 1 if the match results in a draw, and 0 otherwise; third, ri,j is equal to 1 if it results in an away win, and 0 otherwise. Next, the following null hypotheses are tested, separately, in order to find out if each of the estimates in the new binary model is statistically significant:

H0: αr = 0 vs H1: αr ≠ 0 H0: βr = 0 vs H1: βr ≠ 0

Referenties

GERELATEERDE DOCUMENTEN

It is, to my knowledge, the first study in this strand of research that used a second-stage dual- moderated mediation model to analyse the effects of the underlying motives

Risk benefit assessment favoring the helmet will only be attained if the helmet can show highly significant clinical benefit.. CONCLUSIONS: This study shows

Worden deze diensten evenwel verricht voor een vaste inrichting van de ondernemer op een andere plaats dan die waar hij de zetel van zijn bedrijfsuitoefening heeft gevestigd,

Een helofytenfilter wordt meestal gebruikt om voorbehandeld huishoudelijk afvalwater (water uit septic-tank) na te behandelen tot een kwaliteit die in de bodem geïnfiltreerd kan

Bij de Fusariumsoorten was alleen de aantasting door Fusarium so- lani gunstiger bij hogere calciumgehalten in de knol De gevoeligheid van knollen voor Helminthospo- rium

carduorum bleek in Nederland zeer zeldzaam en is slechts van een drietal locaties bekend, waar in totaal vijf exemplaren zijn verzameld.... gibbirostre evenmin, terwijl Behne

Door de hoge historisch-landschappelijke waarden in het kleinschalig oud cultuurlandschap, heeft een keuze voor een natuurbehoudstrategie daar bovendien veel meer consequenties..

Recent hebben wij voor een aantal plantensoorten aangetoond dat het inbrengen van specifieke ter- peensynthase genen leidt tot de productie en afgifte van geurstoffen