Composing the optimal football squad : an ordered probit approach on changing the world of football

(1)

UNIVERSITY OF AMSTERDAM MASTER THESIS ECONOMETRICS

Composing the Optimal Football Squad

An ordered probit approach on changing the world of football

Thesis presented for the degree of Master of Science in Econometrics

Author: Supervisor:

Gijs Kruikemeier Dr. J. C. M. van Ophem

Student number: Second Reader:

10750754 Dr. M. J. van der Leij

Track: Date:

(2)

Abstract

Every football club in the world is remembered by its heroic victories. The manager and sporting director of the club can have great influence when it comes to winning prizes. With data driven analyses likely being the future of the football landscape, models that help clubs in managing their teams, become more and more relevant. Therefore, this thesis presents a management tool that can optimize a team’s chances of fulfilling its sportive ambitions by adjusting their squad. Over 3.700 Premier League matches from the past ten years are used to estimate an ordered probit model on match outcome. The differences in footballing abilities between the two opposing teams, as measured by the Euro Player Index (EPI), are used as main explanatory variables. It is found that only the difference in EPI between the central midfielders, the left midfielders, and the substitutions are of significant influence on match outcome. Additionally, a model for player market value, with EPI as explanatory variable, is presented. The results of the ordered probit model for match outcome and the model for player market value are then combined to create the management tool. Subjective to a budget constraint, the tool maximises the probabilities of winning points. Given the squads of the opponents in the coming season, the most efficient budget distribution across the team can be obtained. With that, the club can optimise the probability of ending the season at the desired place in the league table.

(3)

Statement of Originality

This document is written by Gijs Kruikemeier who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(4)

Introduction

“European football is unquestionably the world’s most popular sport.” (Matheson, 2003). While this may seem like a rather bold claim in the book of Matheson, it can be supported by numbers and facts. The 2017 final of the most prestigious football club tournament, the Champions League, had an estimated global TV audience of 350 million people (Bentley, 2017). In comparison, the estimated number of viewers for the Super Bowl1_{, was not even}

half of that. USA Today (2016) estimates that the European competition for countries in 2016 even attracted two billion television viewers (USA Today, 2016). A sport this big naturally has a lot of money involved. Eurosport (2018) estimates that football clubs across the globe spent an astronomical amount of 6.37 billion dollars on buying players in 2017 (Eurosport, 2018). With clubs spending that much money on transfer sums, let alone the salaries and bonuses they have to pay, it can be questioned whether football clubs are still profitable. Frick (2007) indeed states that most clubs try to maximize sporting success instead of business success (2007, p. 426). He claims that the revenues from ticket sales, merchandise and the sale of television broadcasting rights are directly spent in order to achieve more on the pitch (Frick, 2007, p. 426). It is therefore of great importance that when a club wants to maximize utility, or sporting success, as its main goal, this optimizing process is thoroughly investigated.

Consequently, there has been a great deal of research in the field of optimizing team performance: the papers of Carmichael, Thomas & Ward (2000), Oberstone (2011), Dobson & Goddard (2003) and Kern & Süssmuth (2005), just to name a few. However, team performance is not the sole means when it comes to achieving sporting ambitions. For a football team to perform, it needs the right people in the right spots. The team’s manager must be able to lead all the individual players and forge them into a solid squad. Moreover, a striker is expected to score goals, a midfielder to give key passes and a keeper to get clean

(8)

sheets. Therefore, individual manager and player performance are both important aspects for a club to address. As for manager performance, Audas, Dobson & Goddard (2002), Kern & Süssmuth (2005) and Koning (2003) have all written articles that attend that matter. Furthermore, player performance is, amongst others, analysed in McHale, Scarf & Folker (2012), McHale & Szczepan'ski (2014) and Schultze & Wellbrock (2018). However, despite the findings in these papers, it can be questioned whether all professional football clubs have managed to maximize their sporting success given their budget. Sporting directors still tend to rely on their own intuition or that of a scout when deciding how and with whom to improve their first team’s squad. The reason for this may be that relying on the old scouting ways works well enough for them. However, another theory is that a model that brings team performance, player individual performance and managerial decisions together, is not yet present in the current literature.

Therefore, this thesis aims to create a model that can optimize a team’s chances of fulfilling its sportive ambitions by adjusting their squad. Furthermore, it tries to obtain an answer to the question at what point the squad is vulnerable and needs improvement. In this process, the question of how and how much achieving sporting ambitions depends on the squad of a football club is central. Consequently, investigating which positions in the field are vital and how player quality influences a team’s result become relevant concepts. The approach on this matter will be that of a three-alternative ordered probit2_{model. The starting}

point is a single latent variable that is either “win”, “draw” or “loss”. For 3734 matches, divided over ten seasons, in the Premier League3_{, the outcome of the match is explained by}

the difference in European Player Index4_{(EPI) for every position, and difference in average}

squad age. The European Player Index assigns to every player in the dataset a value that

2_{The three-alternative ordered probit model is more extensively explained in chapter 4.} 3_{The highest-level football competition in England.}

4_{The EPI index is developed by Hypercube and owned by Remiqz. Both are football data analytics} companies located respectively in Utrecht and Amsterdam. This is further explained in chapter 3.

(9)

represents their footballing abilities. The index stands for the quality of a football player regardless of his position, making comparison between players at different positions possible. In the model developed in this thesis, the EPI of the left back of one team is compared to the EPI of the right forward of the other. This analysis is done from the point of view that having ascendancy at the most, or the most critical positions in the field is key in claiming match victory. Another perspective is that the difference in EPI of players at the same positions of both teams is key in winning the match. Thus, the EPI of the left back of the one team compared to the EPI of the left back of the other. To this extent, a model in which the EPIs of the left backs of both teams, the central backs, the right backs etcetera are compared, is also estimated. The model corrects for team specific home advantage, and the opponent that is faced. In order to be able to correct for home advantage, the outcome of the match is modelled for half of the matches in the dataset where teams played at home, and the other half of the matches where teams played away. Additionally, next to a model that describes the data, this article also presents a managerial tool for clubs to make transfer decisions. This tool is created using the estimated parameters from the probit model for match outcome. In general, the sportive goal of a football club is to attain a certain position (or finish within a certain range of positions) in the league table. For example, clubs like Manchester United, Manchester City and Chelsea may aim to become league winners. Whereas clubs like Swansea City, Stoke City and West Bromwich Albion can have avoiding relegation from the Premier League as sportive ambition. In the manager tool, given the squads of the opponents in the coming season, the expected number of points that a team will collect in this coming season is optimized over the player its own squad given the budget restriction of the club. Observing what number of points resulted in which position in the league table over the past decade, it can be estimated what number of points results in which position in the coming season. For a club, by estimating the amount of points it will collect, it will be possible to answer the question whether or not they are going to fulfil their sportive ambitions/goals in the coming season. An important side note is that because the manager tool takes the squads

(10)

of the other teams in the league as given, the estimated number of points that results from the optimisation are more relative to the other teams than absolute. Consequently, the tool can only be used for one club per season.

The outline of this thesis is as follows. Chapter 2 gives an overview over the literature that is available on football team performance, individual player performance and manager performance. Chapter 3 explains the dataset, its variables and presents relevant descriptive statistics. Chapter 4 describes the model and estimation procedure. Chapter 5 elaborates on the results and analyses these results. In chapter 6, a summary of the thesis is presented. Additionally, a few points of discussion are talked about, and the implications of the findings are given.

2 Literature Review

As stated in the introduction, there has been done extensive research on sporting achievements in general. The influence of player performance, team performance and manager performance on the sporting ambitions of a football club have been thoroughly investigated in the past decades. This chapter gives an overview of the relevant papers regarding these subjects. Furthermore, the methods and findings of the most important papers are explained. Chapter 2 is organized as follows. It starts by addressing the sporting ambitions of football clubs in section 2.1. The general management of a football club and its revenues are discussed in section 2.2. In section 2.3, the individual wages of players and coaches are regarded. The relevant articles on football player transfers are discussed in section 2.4. Finally, chapter 2 will be summarized and discussed in section 2.5.

(11)

2.1 Sporting ambitions

There has been a great deal of research regarding the performance of a sports team. In other sports like baseball, the performance and with that the statistics of teams have extensively been investigated. As early as in 1982, James (1982) introduced Sabermetrics: the mathematical and statistical analyses of baseball. Thereafter, a lot of statistical based baseball analyses books have been written. Dewan (2006), Albert & Bennet (2003), Keri (2006) and Lewis (2003) are just a few examples. The latter is a book based on a true and fascinating story of the Oakland Athletics and their road to victory. Based on Sabermetrics, their manager composed a team that broke the record of most consecutive wins in American League history. Anderson & Sally (2013) investigate football statistics. Questions like “How valuable are corners?” and “Which goal matters most?” are considered. As for sporting ambitions in football, to get success, the ambitions of a club must match the abilities of the squad. Furthermore, as a football squad cannot perform without a capable manager, his abilities must also be in line with the club’s ambitions. This section about sporting ambitions is divided into section 2.1.1, that contains literature on manager performance, section 2.1.2, that elaborates on squad performance, and section 2.1.3, that regards individual player performance.

2.1.1 Manager performance

The ways of a manager are of course of influence on the performance of the team. Eventually, the eleven players on the pitch and substitutes have to finish the job. However, the manager is always responsible for the result. Koning (2003) evaluates the effect of firing a coach on team performance. Based on data from the Eredivisie5_{from 1993 to 1998, he presents a}

model in which he controls for the difference in quality of opponents faced by the old and the new coach. The researcher finds that performance of a team does not always improve

(12)

when a coach is fired (Koning, 2003, p. 561). Another paper that investigates the influence of discharging a manager is that of Audas et al. (2002). They estimate a model based on match-level data. The researchers find that a manager change within the season, results in worse short-term performances by the squad. Audas et al. (2002) then go on in explaining that it might have something to do with the fact that players have to adapt to the playing style of the new manager. They state that it may take up to sixteen matches (approximately three months) for a team to unlock its full potential after a within-season manager change (Audas et al., 2002, p. 644).

2.1.2 Squad performance

As stated in the section before, in the end, the team has to do it. The performance of the squad is decisive for the result. On this matter, a wide variety of research has been done. McHale & Davies (2007) find that simply taking the FIFA world rankings6_{to predict match outcome}

of international games, does not work well. These rankings do not adjust fast enough in order to reflect a team’s current performance. Audas et al. (2002) estimate the outcome of a football match by an ordered probit model. In their model, the latent variable is explained by the home team average win ratio over the recent seasons, the result of the recent home matches by home team and the result of the recent away matches played by home team. Furthermore, the model corrects, amongst other variables, for being the home or the away team, and geographical distance between the two clubs. The researchers find that most of the explanatory variables are significant. For one, the home team average win ratios of the past two seasons are significant. The match results of both teams up to three home and three away matches (i.e. approximately six matches in total), are also of significant influence on the estimated match outcome. Additionally, the variables match significance, and geographical

6_{This ranking system was introduced in 1992 and aims to determine which footballing country is the best of} the world based on recent results.

(13)

distance, are also significant. Next to Audas et al. (2002), Koning (2000) also creates an ordered probit model to predict the result of a game. In his model, match outcome is determined by home advantage and difference in quality between the two opposing teams (Koning, 2000). Another article that uses ordered probit to model match result is that of Kuypers (2000).

Papers that have another way of handling football match results are present in the literature as well. Oberstone (2009) investigates team performance in the Premier League to distinguish the top clubs from the rest. McHale & Scarf (2007) create a bivariate model for home and away team shots on target, finding a negative correlation between the two. They find that “playing the beautiful game”7_{is an effective strategy (2007, p. 444). Successful}

teams are characterized by their tendency to play beautiful football. Pollard (2006) presents, based on data of competitions between 1997 and 2003, a measure of home advantage per football league in the world. His main finding is that in the main leagues in Europe8_{, home}

advantage does exist. In these leagues, the home advantage is between 60% and 65%, that is, 60% to 65% of the points won between 1997 and 2003, were won at home. For the Balkan countries, where up to 78% of the points were won playing at home, home advantage is even more of an issue. On the other hand, countries like San Marino and Andorra, that have small football competitions and very small stadiums, do not seem to display any presence of an advantage of playing in the home stadium. As playing at home is a variable that can describe data as well as predict it, because it is known beforehand, home advantage is accounted for in the model of this thesis. Another paper that compares leagues from different countries is that of Oberstone (2011). The researcher presents in his article the main differences between

7_{“The beautiful game” is characterized by a lot of passing and crossing. This type of football is considered to} be joyful to watch.

8_{The main leagues that are considered here are those of Spain, France, Germany, Italy, England and The} Netherlands.

(14)

La Liga9_{, Serie A}10_{and the Premier League. The findings that are most relevant for this thesis}

are the following. The Premier League has a significantly lower percentage of shots on target than the Serie A and La Liga. This may be a consequence of the tighter man marking in the Premier League (Oberstone, 2011, p. 11). Furthermore, players in the Serie A have a passing accuracy that is significantly better than that of players in the other two leagues. Additionally, the Serie A has the highest percentage of successful tackles and both Serie A and the Premier League have a higher average number of tackles per game. Lastly, of the three leagues, the Premier League has the lowest number of fouls, yellow cards and red cards (Oberstone, 2011, p. 11). The results of Pollard (2006) and Oberstone (2011) give reason to develop different models for different countries. In this thesis, only a model for the Premier League is considered. However, as implied by Pollard (2006) and Oberstone (2011), the results in this thesis do not hold for other football competitions.

Dobson & Goddard (2003) investigate if persistence in sequences of football match results is an issue in the Premier League. They conclude that there is no persistence for sequences of consecutive matches without a loss, and sequences of consecutive losses. However, the researchers find negative persistence for sequences of consecutive wins and sequences of consecutive matches without a win. Furthermore, Dobson & Goddard (2003) state that their results reveal little to nothing about the true existence of a persistence effect. This is due to a selection effect that is present in their model. Teams that have long sequences of not winning are generally the weaker clubs. Thus, their chances to not win again are not only based on their dry spell at this moment, but also on the fact that their team is weak. The model that this thesis aims to create is a predictive one. It does not solely focus on the first upcoming game, it predicts the outcome probabilities for all the matches in the coming season. Therefore, if the mood of a football team (i.e. if a team is experiencing a sequence of

9_{The highest-level football competition in Spain.} 10_{The highest-level football competition in Italy.}

(15)

consecutive wins or losses) is taken up into the model, as the season progresses, the estimation of the next match depends on the outcome of the past matches. Then, if these past matches were not forecasted right, the estimated mood after a few matches is not right. The estimation of the outcome probabilities of next match after this wrongly predicted sequence, could become really distorted. As the season progresses, this problem could get worse. Therefore, the mood of a football team is not regarded in the model of this thesis.

The last paper that is discussed in this section is that of Carmichael et al. (2000). These researchers present a model for team performance per match. Their dependent variable is the observed team’s goals minus that of the opponent. The variables that this match outcome is regressed upon are, amongst many others, shots hitting the woodwork, clearances, blocks and interceptions, tackles, percentage successful passes, playing at home, and red cards. Also, a fixed effect per opposing team is added to the model. Except for the dummy variable for home advantage, and the fixed effect dummies, all variables are in differences between the opposing teams. The variables mentioned here are all found to have a significant effect on determining match outcome. Also, as expected, all variables except for red cards have a positive effect on match outcome.

2.1.3 Player performance

With any professional sports team consisting of individual athletes, it is important to investigate sports performance on the individual level. The performance of the individual athlete has extensively been investigated. However, most of the papers about this subject regard athletes that compete in an individual sport, such as tennis, or golf. McHale & Forrest (2005) create a model to predict professional golf tournaments, whereas McHale & Morton (2011) investigate tennis match outcomes. The contributions of a player to the result of a match are much clearer if the sport played is an individual one. Still, research on individual player performance in team sports is available. The first index that rated players in a team sport regardless of their position was the EA Sports player performance index. McHale et al.

(16)

(2012) analyse the construction of this index. They explain that it is a weighted average of match contributions, winning performance, match appearances, goals scored, assists and clean sheets. Next to McHale et al. (2012), another article on individual player performance is that of Lewis (2005). In his paper, he presents a measure of player performance in cricket. Further, books about this subject in baseball have been written by Goldman & Kahrl (2010) and James & Henzler (2002). Sill (2010) wrote a book about an adjusted plus/minus (APM) metric in the NBA11_{. This APM metric starts with assuming that what matters most is a}

player’s contribution to the victories of the team. Additionally, it corrects for the teammates and opponents while the player is on the field. This APM system is also used in hockey, as described by Macdonald (2011), and, of course, in football. For example, Schultze & Wellbrock (2018) used data from the 2012/2013 Bundesliga12_{season and created an}

individual player performance index that is built up as follows. Whenever a team scores, its players on the field get rewarded points. When the team concedes, points are subtracted. This model is than corrected for the strength of the opponent and the timing of the goal. If a goal was scored in crunch time (important goals), it is much more valuable than if it is scored in garbage time (outcome of the game is already decided). The identifying assumption in Schultze & Wellbrock (2018) is that all players on the field contribute equally to the result of the match, corrected for the minutes they played (2018, p. 122). Their plus/minus metric assigns a value or index to every player in the system based on their contributions to the results in the 2012/2013 season. The researchers state that “This shows that the plus/minus metric has a dual nature, as it can be used both as an evaluation tool for one team and as a scouting tool for another” (Schultze & Wellbrock, 2018, p. 125). A system that is similar to the plus/minus metric, but is more advanced, is the Euro Player Index. The EPI is used in the

11_{National Basketball Association, the highest basketball league in North-America.} 12_{The highest-level football competition in Germany.}

(17)

model of this thesis as the most important explanatory variable and it is explained in the next chapter.

2.2 Club management

From the point of view of managing any organization, it is important to establish the relationship between the inputs used in production and their relative contributions to output (Carmichael et al., 2000, p. 31). In this approach, a football club is no more than an organization with inputs and outputs. Kern & Süssmuth (2005) state that “Of course most clubs still consider success on the pitch and the glory of victory as their main business objective” (2005, p. 486). Every football club in history is remembered by its great victories, not by their net profit. Consequently, if a football club is considered as an organization, it can boldly be claimed that the input is money, and the output is sporting success. Furthermore, Kern & Süssmuth (2005) state that “Clubs invest in players, coaches and management in order to succeed in the several competitions in which they take part and thereby increase revenue from the gate, broadcasting rights, merchandising and sponsoring” (2005, pp. 485-486). Managing a football club can be seen as a continuing cycle of increasing revenues and investing those revenues in the improvement of the first team squad13_{. A better team will then}

(expectedly) perform better and increase profits, which can then again be invested to improve the squad.

Stene (2016) develops in his paper a strategic management tool for managers in European professional football. He states that clubs nowadays thrive on a point maximizing mentality instead of a profit maximizing mentality. While Stene (2016) does believe that his

13_{Not all revenues are directly invested in the first team squad. A football club often has a youth academy in} which investments also need to be done. Naturally, these investments in the youth academy are also, however indirectly, aimed at eventually improving the first team’s results. However, in this thesis, the focus will be on directly improving the first team by means of buying and selling players.

(18)

tool provides managers with a better understanding of the problems they are facing, he concludes by stating that modelling a football club as a business in total is a complicated challenge. Kern & Süssmuth (2005) examine the economic output of a football club. The researchers use clubs from the Bundesliga to execute a pooled regression using the data of two seasons: 1999/2000 and 2000/2001. A Cobb-Douglas type production function is estimated with the log of the club’s adjusted total revenues (ln(REV)), as output. They find that participation in the Champions League14_{has a positive influence on revenues. Off}

course, in their turn, increased revenues possibly have a positive influence on the probability of entering the Champions League, which may create endogeneity issues. Moreover, if a club has a big fanbase, they generally have a higher income. Kern & Süssmuth (2005) present results in which the logs of the ex-ante estimates of the wage bills of the players and that of the coaches, have a significant influence on ln(REV). In their final estimate, a 1% increase in player wages results in a 0.52% increase in revenues. For the wage of the coach, if that increases by 1%, the club’s revenues will, according to the model, increase by 0.27%. However, the researchers also estimate a model in which sporting performance is the dependent variable. For every team, this is a weighted aggregated point index based on up to four competitions in which the team can compete. The weights are determined using a difference in importance between the competitions (i.e. Champions League is more important than the national cup). Here, they find that the wage of players as well as that of the coach do not have a significant influence on athletic output. Kern & Süssmuth (2005) examined the economic output of a football club. They found that player and manager wages significantly influence economic output. The relation between the input in the football industry, money, and the output, sporting success, is complex. What factors are of influence on individual player salary, may give insight in this complex relation.

(19)

2.3 Individual wages

For the management of a football club, it is important to investigate the wages of coaches and players, and what variables influence these wages. Batré et al. (2008) present a model for football player wage. Their main objective is to find whether and to what extent performance influences salary. In their paper, performance is measured in terms of, amongst others, career games and goals, number of games played and goals in the last season, and a dummy for team captain. For the period of 1995 to 2007, the researchers estimate an equation for 1993 different players where ln(wage) is the dependent variable. They find that the variable that has the biggest influence is age. The positive influence of age on salary is also found in Lehmann & Schulze (2005), Feess, Frick & Muehlheusser (2004), Lucifora & Simmons (2003) and Huebl & Swieter (2002). Another finding is that players from South-America and Western-Europe receive a considerable pay premium in comparison to players from the rest of the world. However, this may be a consequence of their way of modelling. It is indeed more logical that a player gets payed based on their abilities and not their country of origin. Furthermore, a player’s position has a big influence on his wage. Forwards earn the most, then midfielders, then defenders, and goalkeepers earn the least of the squad. Because Batré et al. (2008) also estimate the effect of goals scored in the past years, this cannot be the reason that forwards have the highest salary, followed by midfielders. As for the influence of goals scored on player remuneration, the goals that are scored in the last season have a far greater influence than career goals, that is, recent performance is far more important than past performance. This also holds for the variable games played. The positive effect of player performance on salary is also found by Lucifora & Simmons (2003). They use a cross section from the Serie A to obtain that the number of games played, and goals scored have a significant positive effect on player wages. Lastly, Batré et al. (2008) find that a so-called “superstar effect”, is present. This is the effect that causes the wage of a player to increase because spectators come to the stadium or watch television just to see him play.

(20)

Battré et al. (2008) conclude by stating that their models explain player salary quite well from the various performance measures.

2.4 Transfers

An important aspect for any sporting director is the transfer value of a desired player. In his negotiation efforts, he will always try to keep the price as low as possible. The selling party will do the opposite thing. The question becomes however, what determines transfer value? What aspects of a player make him more valuable than his colleague? A paper that investigated these questions is that of Eschweiler & Vieth (2004). They investigated 254 transfers in the Bundesliga from 1997 to 2003. The researchers find that factors that positively influence a transfer fee are, amongst others, age, not being a goalkeeper, the FIFA-coefficient of the country of origin and number of international caps. Eschweiler & Vieth (2004) find that age squared, and international caps squared negatively influence transfer fee. This indicates that as a player ages, the positive effect of age on transfer fee lingers. Carmichael, Forrest & Simmons (1999) use a Tobit model with transfer fee as dependent variable, for the estimation of the transfer fee of football players. These researchers find that variables that positively influence transfer fee are age, number of appearances for former and current clubs and number of goals (1999, p. 143). They also find that age squared negatively influences transfer fee. A more recent paper on football transfers is that of Ruijg & Ophem (2015). The researchers create a model that corrects for the selectivity problem that not all transfer fees are observed and thus that the used sample may not be random. In their estimates, they find that the most important variables that influence the transfer value are age, average minutes played and not being a goalkeeper (Ruijg & Ophem, 2015, p. 19). In conclusion, the important variables that are found by all papers to positively influence transfer fee, are age, playing matches and not being a goal keeper.

(21)

2.5 Summary and implications

Chapter 2 gave an overview of the literature regarding football team performance, individual player performance and manager performance. Additionally, club management, individual wages and transfers were also regarded. In summary, the main implications that the literature has on the model of this thesis are the following. In section 2.1.2, it is found that an ordered probit model in predicting match outcome works well. An ordered probit approach in forecasting match result is also the approach of this thesis. However, the difference is that the researchers in section 2.1.2 explain match result based on the difference in teams as a whole, whereas this thesis explains match result based on the differences between the individual players of opposing teams. Furthermore, Carmichael et al. (2000) create a model in which they model team performance on various variables. Almost all their explanatory variables are in difference between the two teams. Combined with the assumption that player performance is very important in determining match result, this thesis uses differences in EPIs per position as main explanatory variables for match outcome. Additionally, this thesis presents different interpretations of explaining match outcome based on differences in Euro Player Index. A model in which direct opponents (left back against right winger) are compared is reviewed against a model in which players with the same positions are compared (left back against left back). Furthermore, a model in which just the EPIs of the players of the evaluated team are used as explanatory variables is created. Pollard (2006) finds that home advantage is not the same in the big European leagues but definitely something that influences match outcome. Therefore, the model of this thesis corrects for home advantage. The technical specification of the model is explained in detail in the chapter 4. First, in chapter 3, the datasets that are used in this thesis are thoroughly regarded.

(22)

3 Data and Variables

In this thesis an ordered probit model for match outcome is presented. Additionally, a model for player market value with EPI as main explanatory variable is estimated. These two models are then combined in the creation of a manager tool. For the estimation of the ordered probit model on match outcome, and for the estimation of the relation between EPI and market value, two different datasets are used. In this chapter, the variables in the two datasets are explained. First, in section 3.1, the most important variable in both of the datasets, the Euro Player Index, is explained. Then, in section 3.2, the dataset that is used for the estimation of the ordered probit model is regarded. Section 3.3 elaborates on the different team formations that were used in the different matches in the dataset. Finally, in section 3.4, the dataset that is used for the estimation of the relation between EPI and individual player market value is described.

3.1 The Euro Player Index

In this thesis, the variable that is key in explaining and forecasting match outcome, is the Euro Player Index. This index is developed by Hypercube15_{and used in their football}

analytical models. The construction of the EPI is, based on intel from Hypercube and Remiqz16_{, globally explained in this section. A precise description of EPI on model basis}

cannot be given as this information is confidential. To start off this explanation, a short definition of the European Club Index (ECI), also developed by Hypercube, is given.

The Euro Club Index is a single value given to each club based on their recent performance. After each match the club has played, the index is adjusted based on the result.

15_{Hypercube is, as mentioned before, a football data analytics company located in Utrecht.}

16_{Remiqz is, as mentioned before, a football data analytics company located in Amsterdam that works closely} with Hypercube.

(23)

This adjustment accounts for the ECI of the opponent. If for example AFC Ajax, a club with a relatively high ECI, were to play against NAC Breda, a club with a considerably lower ECI, the index of AFC Ajax will not increase a lot if they win the match, as this was already expected based on the ECI of the two teams. On the contrary, if Ajax loses, their index will drop a great deal, because they then lost against a weaker club. Furthermore, the system also corrects for the competition a team is in. If a Premier League team loses against a team from La Liga, all the teams in the Premier League will get a small negative correction because their competition is now of lower level than before, in comparison with other competitions. Also, all teams in La Liga get a small positive correction. In July 2007, the Euro Player Index system started. Back then, because the ECI system was already operational, the EPI started by giving every player the starting value of the ECI of their club. After about a year, the indices of all the players were calibrated, and the EPI system was up and running.

The Euro Player Index aims to assign a single value to each player in the system based on their footballing abilities. It is an incremental system that updates the EPI for each player after each game they played, and it works as follows. Before every game, the Euro Selection Index (ESI) is determined of both teams. The ESI is the average of the EPIs of the eighteen best players from that club at that particular time. Where the ECI is an index that represents the performance of the club in the past years, the ESI represents the ability of the current squad. Thus, for every game, the expected match outcome is determined based on this ESI. Match outcome is either one, if the home team wins, zero for a draw, and minus one if the away team wins. Given the historical results in matches with teams with similar ESI’s, corrected for home advantage, an expected result is established. This expected result is a value between minus one and one. As stated by Pollard (2006), the measure of home advantage is not the same for all competitions. Consequently, in this correction for home advantage, it is considered which competition the match is played in.

Then, based on the predicted match outcome, the personal EPI, and the ratio of the personal EPI against that of his teammates, an expected value of each individual player is

(24)

determined. This individual expected value is, just as expected match outcome, also a number between minus one and one. While the match is played, the expected value of the match result changes depending on whether or not goals are being scored. If for example, before the match, the expected value of the result was 0.3, and no goals are scored, as the minutes pass, the expected value will linearly go to zero (a draw). However, if at 0-0 in the 85th_{minute, the}

home team scores, the expected match result will make a jump upwards and will then linearly go to one. The system that assigns to every player an individual expected value is implemented as to obtain different EPI changes for players in the same team. With this system, the EPI of the best players will rise slower and decline faster. For the EPI of the least players, the contrary holds. The change in EPI per player per match then depends on the change in his individual expected value between the times he stepped on and off the pitch. Furthermore, goals and their timing (how important was the goal?) are taken into account when determining the change in EPI. Additionally, assists, and yellow and red cards are also taken account for. Say for example, Matthijs de Ligt, started a match for AFC Ajax. While he was on the field, the score changed from 0-0 to 2-0 in favour of Ajax. If he is then substituted off in the 60th_{minute and Ajax loses the game with 2-3, this is not the fault of De}

Ligt. While he was on the field, the chances of Ajax winning the game went up. Therefore, his personal contribution to the match outcome is positive and his EPI goes up. With EPI, a system is created in which all players get assigned an index, which are, regardless of their positions, comparable to each other. So, the footballing ability of a central back can, based on the EPI, be compared to that of a left forward. To summarize, the change in Euro Player Index for a particular player in a particular match depends on his contribution in trying to favourably change the outcome of the game. As stated before, since EPI is owned by Remiqz, and with that confidential, the precise explanation of how the construction of the EPI works in terms of models cannot be given here.

(25)

3.2 Players in the ordered probit dataset

The dataset that is used in this thesis for the estimation of the ordered probit model is gathered by Gracenote17_{. It was then delivered to Remiqz via Hypercube. The dataset contains}

information about 3734 matches in the Premier League. The matches were played in the seasons 2008/2009, 2009/2010, … , 2017/2018. In these ten seasons, 380 matches were played every season. In the last season, 2017/2018, the last 66 fixtures were not yet played when this dataset was created. Consequently, these matches are not in it. After deleting matches that were not fully documented, 3729 matches remained. These matches were played by 36 unique teams, consisting of 1867 unique players. Of every match, the two opposing teams, the stadium, and the final score are known. Additionally, for every match, the players that played in that match are known. That is, there is data only about the players who were on the pitch during some point in the match. Nothing is known about bench players that did not make an appearance. Of the players that were in the starting eleven or were substituted on, their age, their position and the number of minutes they played are known. The players that were substituted on have the position label “SUB”. Thus, of those players it is not clear which position they played. While it is possible to just give the substitute the position label of the player they replaced, it can be argued that this implicitly assumes that a substitution is always done because the player in the field performs badly. It can be questioned if this is always the case. A manager can choose to make a tactical change in his team. For example, when in the last phase of the match, he wants to favourably change the score by replacing a defender by an attacker. Then, labelling the player that is substituted on as defender, is wrong. Furthermore, of all the players, EPIs before and after the match are known. Since, as explained in section 3.1, the EPI system is incremental, the EPI of a player is never the same before and after the match.

(26)

Table 1 contains the relevant descriptive statistics of the ordered probit dataset. The first thing that stands out is the minimum EPI of -274.29. As can be deducted from the average EPI and its standard deviation, for a player in the Premier League, this is extremely low. However, it is not likely to be a representative value for this player’s qualities at that time. For players that are not yet in the system, it takes some time to calibrate their EPIs. The starting value of this particular player (-274.29) is amongst other things based on the ECI of the club he is from. Thus, if he transferred from a very bad club, to a club in the Premier League, he probably was one of the better players in his former team. Nevertheless, his starting value EPI is based on his former club, and therefore very low. However, after one

(27)

match, his EPI was 1026.53. This value, as adjusted after one match, probably reflects his footballing abilities better already. The next thing that sticks out is that defenders have a somewhat lower average EPI than midfielders and attackers. Though, the standard deviation of their average EPI is lower. Furthermore, the most players in the dataset play in the central midfield. With the highest average EPI of all the positions in the defence and midfield, the central midfielders seem to have an important role in most teams. The lowest average EPIs are those of the right and the left backs.

3.3 Matches in the ordered probit dataset

In the 3729 matches in the dataset, teams played in seven different formations. The formation that is most used (2936 times), is 4-5-1. This means that the team plays with four defenders, five midfielders and one striker. The next most used one (2753 times), is 4-4-2, with four defenders, four midfielders and two strikers. With 1221 and 262 times, the third and fourth most used formations are, respectively, 4-3-3 and 3-4-3. The three least played formations are 3-5-2 (197 times), 5-3-2 (56 times) and 5-4-1 (33 times). It is hard to realistically compare players per position of two teams that play very different formations. As it is considerably arbitrary, few sensible things can be said about which players are direct opponents of each other. To make realistic comparisons possible between two teams on the pitch, matches in which one of the teams used one of the three least used formations, are dropped. This leaves a dataset of 3459 matches with only 4-5-1, 4-4-2, 4-3-3 and 3-4-3 as used formations. These formations are respectively clarified in Figure 1, Figure 2, Figure 3 and Figure 4. Subsequently, the formations 4-3-3 and 3-4-3, which are considerably less frequently used than 4-5-1 and 4-4-2, are written as if they were 4-5-1. This is done as to make

(28)

comparison between teams with the different formations less complicated. For the 4-3-3 formation in Figure 3, the Left Midfielder (LM) and the Right Midfielder (RM) are transformed into Central Midfielders (CM). Also, the Left Forward (LF) and the Right Forward (RF) are transformed into LM and RM, respectively. Lastly, the Central Forward (CF) becomes the Striker (ST). This former 4-3-3 formation is now transformed into a 4-5-1 formation. For the 3-4-3 formation in Figure 4, the Central Back (CB) becomes a CM, and the Left Back (LB) and the Right Back (RB) become CB. Furthermore, the LM and RM become LB and RB, respectively. The LF and RF respectively become LM and RM while

(29)

the CF is again transformed into an ST, resulting in a 4-5-1 formation. After these transformations, the dataset consists of 4241 starting formations 4-5-1 and 2677 starting formations 4-4-2. Naturally, the decisions made in these transformation processes are somewhat arbitrary. It can be questioned whether writing formations as other, comparable formations, is the right way to go in representing the actual match events.

3.4 Relation between EPI and market value dataset

The dataset that is used in this thesis to estimate the relation between EPI and market value of football players is gathered by Hypercube, from whom Remiqz received it. The dataset consists of 5259 market value estimates in different points in time of 1769 unique players from 35 different Premier League clubs. Of these players, the age and EPI at time of the estimation of the market value is known. Furthermore, the club for which they played at that moment is also known. These estimates cover a period from 2008 until 2017. Table 2 contains the relevant descriptive statistics of the EPI and market value dataset. The estimated market values go from 25 thousand to 80 million. With an average of 6.47 million and a standard deviation of 8.36 million, there seems to be a great variation in estimated market values. Figure 5 gives a visual representation of the relation between EPI and logarithm of the estimated market values. The red line is a best fit third-degree polynomial. This best fit is, while being a third-degree polynomial, almost a straight line, indicating that the relation

(30)

between the logarithm of market value and EPI is close to linear. With that, the relation between market value and EPI is likely to be exponential.

4 Model Specification

The specifications of the model for match outcome and the model for the market value of football players are explained in this chapter. Additionally, the manager tool is described. Match outcome is estimated with an ordered probit model. Three versions of the ordered probit model, all with different explanatory variables, are described in this chapter. The outline of this chapter is as follows. Section 4.1 clarifies the ordered probit model for match outcome. The three different versions of explanatory variables are explained in section 4.2. Then, in section 4.3, the marginal effects for the match outcome model and the Wald

(31)

specification tests are elaborated on. Lastly, the manager tool that this thesis aims to provide and the model for market value are described in section 4.4.

4.1 Ordered probit model for match outcome

In this section, the basis model for match outcome is described. In this model, match outcome for the evaluated team is the latent variable. This latent variable in the three-alternative ordered probit model is either “win”, “draw” or “loss”. The starting point model for the match outcome variable is the following.

𝑦_"∗ _{= 𝒙} " &_{𝜷 + 𝑢}

" for i = 1, ... , N football matches

The three-alternative ordered probit model is then created. 𝑦_" = 𝑗 if 𝛼_,-. < 𝑦_"∗_{≤ 𝛼}

, with j = 1, 2 or 3

Here, j = 1 stands for a loss, j = 2 for a draw and j = 3 for a win. Then, with 𝛼₂ = −∞ and 𝛼5 = ∞, the probability that the evaluated team in match i gets match outcome j is determined

as follows.

𝑝", ≔ 𝑃[𝑦" = 𝑗] = 𝑃;𝛼,-. < 𝑦"∗ ≤ 𝛼,<

= 𝐹>𝛼_,− 𝒙_"&_{𝜷? − 𝐹(𝛼}

,-.− 𝒙"&𝜷)

Where F is the CDF of 𝑢_", the standard normal CDF. Furthermore, three binary variables, for each observation in y, are introduced.

𝑦_", = B1 𝑖𝑓 𝑦_{0 𝑖𝑓 𝑦}" = 𝑗

" ≠ 𝑗

Finally, the parameters 𝛼., 𝛼I 𝑎𝑛𝑑 𝛽 are estimated from maximizing the following

log-likelihood. ln(𝐿_Q) = R R 𝑦_", ∗ ln>𝑝_",? 5 ,S. Q "S.

(32)

For the 3459 matches in the dataset, the evaluated team is alternatingly chosen to be the home or the away team. So, for half of the matches (1730), the outcome of the match is evaluated from the home team’s perspective, and for the other half (1729), the outcome is evaluated from the away team’s perspective. The outcome of the match is then explained by variables (𝒙_") that are different for every version of the model. However, some control variables are used in every model. It can be reasoned that a match for a team that is battling against relegation, played against a top team, is mentally a very different match than against another relegation candidate. In most cases, it is not weird for a small club to lose against a top club, whereas the players of the small club are expected to at least draw against another small club. To this extent, for every team in the dataset, the opponent of the evaluated team is controlled for by a dummy variable that takes the value one if the opponent of the evaluated team is that particular team, and zero otherwise. Moreover, as mentioned in section 2.1.2, Pollard (2006) finds that home advantage in the Premier League is present and has to be accounted for in predicting match outcome. It can be argued that home advantage is not the same for Manchester United as for AFC Bournemouth. The first club plays their matches at Old Trafford, that has more than 75.000 seats, while the latter plays at Dean Court and can have a maximum support of around 11.500 fans in the stadium. Consequently, the three versions of the model all control for a playing at home dummy per team. This is a dummy variable that is equal to one for the home playing team in that match.

4.2 Three versions of the model

In this section, the three versions of the ordered probit model for match outcome are described. They are different in their view on match events and with that in the explanatory variables. The first model that is regarded is the “Difference in EPI per position” model. Here, the explanatory variables are constructed as follows. This model takes the point of view that the difference in EPI per position is key in explaining match outcome. Thus, that it matters which team has the best left back or the best striker. To this extent, every player in

(33)

the evaluated team is compared to the player of the opponent that plays in the same position. For example, the left back of the one team as opposed to the left back of the other team. Note however that in the formations in Figure 1 and Figure 2, some players have the same label. The data cannot distinguish between the two central backs, the two or three central midfielders and the two strikers. Thus, for example, it is not clear which one of the central backs directly takes on the striker and which one supports the entire defensive line. Consequently, it is not correct to simply compare a random central back of the evaluated team to a random central back of the opponent. Hence, the EPIs of players with the same label in the dataset are pooled together into one average. For both the 4-5-1 and the 4-4-2 formation, this results in eight pooled EPI values per team. Namely, goalkeeper, left back, central backs, right back, left midfielder, central midfielders, right midfielder and striker(s). The first explanatory variable for the first model is then the EPI of the goalkeeper of the evaluated team minus the EPI of the goalkeeper of the opponent. The second is the EPI of the left back of the evaluated team minus the EPI of the left back of the opponent. This is then preserved for the entire team as to obtain eight explanatory variables. Additionally, most teams bring in at least one substitute in every match. Because not all substitutions play the same number of minutes, a weighted average of their EPIs is taken. As mentioned before, of players with the label “SUB”, it is not clear which position they played in the field. So, the explanatory variable that is added to the model is simply the weighted substitution EPI of the evaluated team minus the weighted substitution EPI of the opponent. For the 43 matches in the dataset where at least one of the two teams did not bring in a substitute, this “SUB-SUB” variable is set equal to zero. Furthermore, of both teams in every match, the average age is determined. The last explanatory variable that is added to the model is the average age of the evaluated team minus the average age of the opponent. Also, the control variables, as explained in section 4.1, are added to the model.

The second model is the “Difference in EPI direct opponents” model. Here, the angle is taken that the difference in EPI between direct opponents on the field is crucial. So, if every

(34)

direct battle on the field is won by one team, that team is likely to win the match. Thus, players of both teams that play the same area on the field and will therefore have a lot of direct confrontations, are compared. The three combinations of formations that can play against each other are 4-4-2 against 4-4-2, 4-4-2 against 4-5-1, and 4-5-1 against 4-5-1. For all these combinations, players on the pitch are compared as follows. Since keepers do not have direct battles with a particular player of the opponent, a good comparison with a field player will likely not exist. Therefore, the first explanatory variable is the EPI of the keeper of the evaluated team against the EPI of the opposing keeper. For the next variable, a direct comparison with an opposing field player is possible. This variable is the EPI of the left back of the evaluated team minus the EPI of the right midfielder of the opposing team. For the other side of the defence, the EPI of the right back of the evaluated team minus the EPI of the left midfielder of the opponent is taken as explanatory variable. The other explanatory variables are the EPI of the central backs of the evaluated team minus the EPI of the striker(s) of the opposing team, the EPI of the left midfielder of the evaluated team minus the EPI of the right back of the opposing team, the EPI of the central midfielders of the evaluated team minus the EPI of the central midfielders of the opposing team, the EPI of the right midfielder of the evaluated team minus the EPI of the left back of the opposing team and, finally, the EPI of the striker(s) of the evaluated team minus the EPI of the central backs of the opposing team. Additionally, the EPI difference of the substitutions and the difference in average age are added to the model in the same way as in the “Difference in EPI per position” model. Also, the control variables are added as explained in section 4.1. The explanatory variables in this version of the model are only logical because, as stated in section 3.1, EPIs of players are, regardless of their positions, comparable to each other. Because the explanatory variables in these first two models are in difference between the two teams, the models are restrictive in its parameters. However, the idea of these specifications is to see whether the difference in EPI between players with the same position in the field, or opposing players, significantly influences match outcome. Because the ordered probit model is not linear, it can be

(35)

questioned whether it is possible to simply use the EPIs of all players on the field as sole explanatory variables, take the difference in the estimated coefficients and then test for simultaneous significance. The model in which the sole effect of the EPIs of the evaluated team are used as explanatory variables is considered next.

The third and last model is the “EPI evaluated team” model. The perspective that only the players of the evaluated team make the difference between winning and losing is taken in this model. The explanatory variables in this model are simply the EPIs of every position and the weighted EPI of the substitutions. The EPIs of the players of the opponent are not considered in this version of the model. Only the dummy control variable for the opponent is used here. Moreover, average age of the evaluated team and the control variables, as explained in section 4.1, are also added.

4.3 Specification tests

In this section, the tests and further analyses that are executed on the models from section 4.2, are explained. First, the three versions of the ordered probit model on match outcome are compared to each other based on their pseudo R-squared value. The R-squared that can be used to express model performance for logistic regressions is McFadden’s pseudo R-squared (McFadden, 1974). This pseudo R-squared depends on the log-likelihoods of the model without any covariates, and the model as estimated. It is defined as follows.

𝑅_UVWI _{= 1 −}ln(𝐿X)

ln(𝐿2)

Here, 𝐿_X is the likelihood of the model that is estimated and 𝐿₂ that of the model with no predictors. Clearly, this squared is not the same as that of an OLS regression. Where an R-squared with OLS says something about the proportion of variance that is explained by the covariates, McFadden’s R-squared is not to be interpreted in the same way. Consequently, with the interpretation of this R-squared, great care is advised. The models in this thesis are compared to each other based on McFadden’s R-squared but no conclusions are drawn based

(36)

on it as to the degree of clarification of these models. On the model that performs best, the following analyses are executed. Wald tests for simultaneous insignificance of some variables are executed. If the tested variables appear to be simultaneously insignificant, they are removed from the model. Then, the model is estimated again with only the variables that have not been removed after the Wald tests. Of the variables in this final model, the marginal effects are determined and analysed.

4.4 Manager tool

The ultimate goal of this thesis is to create a manager tool. This tool can be used by sporting directors and football managers to determine how to divide their budget more efficiently across their squad. In this sense, efficient means that with a certain budget, the probability for a club to attain their sporting ambitions, is maximised. For this manager tool, the final model as described in section 4.3, is used. The construction of the tool is addressed in this section.

After the executed tests and analyses described in section 4.3, a final model is obtained. In this final model, the thresholds and coefficients of the explanatory variables are estimated. Given the squads of the evaluated team and the opponent, the probability for each match outcome can then be determined as follows.

𝑞",(𝒙") ≔ 𝑃[𝑦" = 𝑗] = 𝑃;𝛼Z,-. < 𝑦"∗ ≤ 𝛼Z,<

= 𝐹>𝛼Z_,− 𝒙_"&_𝜷[? − 𝐹(𝛼Z

,-.− 𝒙"&𝜷[)

Here, the thresholds (𝛼Z,, for j = 1 or 2) and coefficients (𝜷[) are known from the estimation

of the final model. Using these probabilities, the basis manager tool looks as follows. ℒ>𝑞_",(𝒙_"), 𝜆? = ∑ [1 ∗ 𝑞Q _"I(𝒙_") + 3 ∗ 𝑞_"5(𝒙_")] − 𝜆>𝐴𝑣𝑔(𝐸𝑃𝐼_effgfg) − 𝐴𝑣𝑔(𝐸𝑃𝐼_ehi)?

"S.

In this manager tool, the aggregated weighted probabilities of collecting points during the coming season are optimised over the EPIs of the players in the evaluated squad, and lambda.

(37)

In the tool, 𝒙" is a vector that is a function of the EPIs of all the positions in the evaluated

squad and the average age of the squad. These EPIs and the average age are variables that are not yet known because they will result from the optimisation of the Lagrange function. Furthermore, 𝒙_" also depends on the EPIs of the players of the opponent in match i, and their average age. The precise format of 𝒙_" is not yet determined in this chapter. It depends on what model specification performs best, and what positions are significant in determining match outcome. The results in chapter 5 will lead to a definitive form of 𝒙_". The variable 𝐴𝑣𝑔(𝐸𝑃𝐼_effgfg) is not yet known at the start of the optimisation, as it depends on variables over which the Lagrange is optimised. However, the variable 𝐴𝑣𝑔(𝐸𝑃𝐼_ehi) is known at the start of the optimisation, as it is simply the average of the EPIs of the starting eleven in the current squad of the evaluated team. The reason that the probability of winning is multiplied by three, is that it is assumed that since winning a match yields three points and drawing yields one, any team likes winning three times as much as drawing. The constraint of this basis model, regarding the average EPIs of the current and the needed squad, is added so that the club’s budget is in a way accounted for. Given that a football club does not get a sudden cash flow impulse, they can only distribute their financial resources better across the squad. However, the model above assumes that using the average EPI of the current squad as upper bound for the average EPI of the needed squad, represents a relevant budget restriction. Whereas, it may be more reasonable to estimate the market value of the players in the current squad from their EPIs and use that as upper bound for the market value of the players in the needed squad. Then, that upper bound can be used as budget constraint. With that, the evaluated club can distribute their actual budget better across the squad, rather than a better distribution of EPI points. Figure 5 indicates that the relation between player market value and EPI is likely to be exponential. Therefore, an OLS model with the logarithm of player market value as dependent variable, is estimated. In Figure 5, a third-degree polynomial is fitted through the data. As this visually seems like a good specification, the starting explanatory variables of the logarithm of player market value are 𝐸𝑃𝐼, 𝐸𝑃𝐼I_and

(38)

𝐸𝑃𝐼5_{. After a model for market value as explained by EPI is obtained, the restriction in the}

Lagrange function is adapted according to the estimated relation. The improved version of the manager tool then looks as follows.

ℒ>𝑞",(𝒙"), 𝜆? = R[1 ∗ 𝑞"I(𝒙") + 3 ∗ 𝑞"5(𝒙")] − 𝜆(𝑀𝑉𝑆effgfg− 𝑀𝑉𝑆ehi) Q

"S.

Here, the market value that is needed for the new squad (𝑀𝑉𝑆_effgfg) depends on the EPIs over which the entire Lagrangian is optimised. It is the aggregated value of the estimated market value of every player that is needed for the squad of the coming year. The aggregated market value of the players in the current squad (𝑀𝑉𝑆_ehi) depends on their EPIs, which are known at the time of the optimisation.

5 Results and Analysis

In this chapter, the results for the models as explained in chapter 4 are presented and analysed. The results for the three different ordered probit models are given and the best model in terms of pseudo R-squared value is selected. Then, on this selected model, different Wald tests for simultaneous insignificance are performed. With the variables that appear significant after the Wald tests, a final model is estimated, and the marginal effects are computed. Additionally, this chapter presents the results for the OLS estimation of market value using EPI as explanatory variable. Lastly, the final version of the manager tool is presented. The outline of this chapter is as follows. In section 5.1, the results for the “Difference in EPI per position” model are presented and analysed. The results for the “Difference in EPI direct opponents” model are presented and analysed in section 5.2. Then, in section 5.3, the results for the last model, the “EPI evaluated team” model, are considered. Section 5.4 elaborates on the model selection and the Wald specification tests. Then, in section 5.5, the final specification of the ordered probit model is presented. Additionally, the marginal effects of

(39)

the variables in the final ordered probit model are given and analysed in section 5.5. Furthermore, the results of the OLS estimation for player market value and the final manager tool are presented in section 5.6. Also, in section 5.6, a summary of chapter 5 is given.

5.1 “Difference in EPI per position” results

In this section, the results of the first version of the model, the “Difference in EPI per position” model, are described and analysed. Table 3 presents the results of this model. It is an ordered probit with match outcome of the evaluated team as latent variable. Here, match outcome is either “win”, “draw” or “loss”. The explanatory variables in this model are based on the difference in player quality of the opposing teams of players that are on the same position in the formation. So, the variable GK-GK is the EPI of the keeper of the evaluated team minus the EPI of the keeper of the opponent. This also holds for CB-CB, LB-LB, etc. The estimated coefficients and standard errors in, respectively, the first and the second column, result from an estimation in which the model does not correct for either the opponent or a team specific home advantage dummy. The model from which the estimates in the third and fourth column result, does control for these factors.

Since the ordered probit model is not a linear one, the actual estimated coefficients are not straightforwardly interpretable. Therefore, only the significance of the estimates is considered. In both the models that do and do not control for home advantage and the opponent, the only estimated coefficients that are significant are those of CM-CM, LM-LM and SUB-SUB. These results imply that in most Premier League matches, the outcome is determined by quality of the midfielders and the substitutions. The team that wins that “midfield” battle is likely to come out on top. The formations that all teams in the dataset play can only be 4-4-2 or 4-5-1. In both these formations, the midfield is heavily occupied as compared to the third most used formation, 4-3-3. From this point of view, for both formations 4-4-2 and 4-5-1, having respectively two and three central midfielders, the result that the central midfield is rather important in determining match outcome, is a logical result.

(40)

Furthermore, the fact that left midfielders have a significant estimated coefficient, can be a bit misleading. From chapter three, it is clear that the LF’s in the 4-3-3 and 3-4-3 formations, become LM’s when these formations are transformed into 4-5-1. Additionally, in the 4-5-1 and the 4-4-2 formations, it can be argued that left midfielders play a role in the team’s attacks. As they do not have a left forward in front of them, they are likely to give more than

Composing the optimal football squad : an ordered probit approach on changing the world of football