• No results found

Football player quality related to physical indicators : a panel data approach

N/A
N/A
Protected

Academic year: 2021

Share "Football player quality related to physical indicators : a panel data approach"

Copied!
58
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Master’s Thesis

Football player quality

related to physical indicators

A panel data approach

Jorrit S. Visser

Student number:

11887508

Date of final version:

July 14, 2018

Master’s programme:

Econometrics

Specialisation track:

Econometrics

Supervisor:

dr. J. C. M. van Ophem

Second reader:

dr. K. Lasak

Faculty of Economics and Business

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided up into a number of sections and contains references. An outline can be something like (this is an example for an empirical thesis, for a theoretical thesis have a look at a relevant paper from the literature):

(a) Front page (requirements see below)

(b) Statement of originality (compulsary, separate page) (c) Introduction (d) Theoretical background (e) Model (f) Data (g) Empirical Analysis (h) Conclusions

(i) References (compulsary)

If preferred you can change the number and order of the sections (but the order you use should be logical) and the heading of the sections. You have a free choice how to list your references but be consistent. References in the text should contain the names of the authors and the year of publication. E.g. Heckman and McFadden (2013). In the case of three or more authors: list all names and year of publication in case of the rst reference and use the rst name and et al and year of publication for the other references. Provide page numbers.

2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All that actually matters is that your supervisor agrees with your thesis.

3. The front page should contain:

(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Faculty as in the heading of this document. This combination is provided on Blackboard (in MSc Econometrics Theses & Presentations).

(b) The title of the thesis

(c) Your name and student number (d) Date of submission nal version

(e) MSc in Econometrics

(f) Your track of the MSc in Econometrics 1

(2)

Abstract

This thesis investigates the relationships between the quality of a football player and its physical

performance in matches. Using a fixed effects estimation technique on unbalanced panel data of

players from the Dutch leagues over three consecutive seasons, it is established which indicators

are related to the quality of a player. Furthermore, it is investigated whether there are differences

in the determinants of player quality across positions. It is found that certain aspects of the

physical performance are good indicators of player quality. Especially the sprinting performance

of players indicate player quality, although its effect is found to vary across positions. Significant

positive effects of an additional sprint are found for centre backs, full backs and midfielders.

For strikers, a higher maximum speed generally indicates higher player quality.

Keywords: Football, positions, physical performance, player quality, unbalanced panel data, fixed effects,

two-way clustered standard errors, composite effects.

(3)

Contents

1 Introduction 2

2 Literature Review 5

3 Methodology 8

3.1 Panel data . . . 8

3.2 Fixed Effects approach . . . 9

3.3 Two-way clustered standard errors . . . 11

3.4 Position specific estimation . . . 12

4 Data 13 4.1 The Euro Player Index . . . 13

4.1.1 Calculation . . . 13

4.1.2 EPI description of the sample . . . 16

4.2 Data description . . . 17

4.3 Positions . . . 20

4.3.1 Additional midfielder statistics . . . 23

5 Empirical Analysis 25 5.1 General results . . . 25

5.2 Position specific results . . . 29

5.2.1 Defenders . . . 29

5.2.2 Midfielders . . . 31

5.2.3 Wingers and strikers . . . 31

5.3 Classification of Midfielders . . . 34

5.4 Statistical differences between positions . . . 36

5.4.1 Chow tests . . . 36

5.4.2 Composite effects . . . 40

5.5 Robustness . . . 41

6 Conclusions 43

References 46

Appendix A

Additional descriptive statistics

49

(4)

CONTENTS

iii

Appendix B

K-means clustering

50

Appendix C

Chow test

51

Appendix D

Delta Method

53

(5)

Statement of originality

This document is written by Jorrit Visser who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(6)

Chapter 1

Introduction

Since the beginning of the current century, professional football has become an industry with large amounts of money involved. Per season around 5.8e billion is paid for the broadcast rights of the FA Premier League which is about eight times the amount that was paid at the beginning of the century (BBC, 2015). Rohde and Breuer (2016) report that for the top-30 clubs in Europe the average revenue per year was 164.5e million over the years 2004 to 2013. The budgets that clubs can spend during a year have grown rapidly. For example, Paris Saint-Germain (France) and Manchester City (England) saw their revenues on average grow by 18.0% and 16.9% respectively per season from 2004 to 2013 (Rohde and Breuer, 2016). As the decisions that have to be made by the management of a football club involve more and more money, the demand for quantification of the on-pitch performance within the football industry grows such that the risk of financial pitfalls can be minimized. The exposure of a football club is particularly determined by the performances of the first team. Given a certain available budget, managements of clubs want to maximize their sporting success and outperform the competing teams. Therefore, it is of great interest to investigate the indicators of a high-quality player allowing the optimization of the composition of the first team and consequently the maximization of sporting success. Finding these physical indicators for player quality enables the possibility to optimize training schedules for players improving their physical abilities and consequently their quality. This thesis attempts to find relations between player quality and on-pitch physical performance indicators (e.g. number of sprints per match). Furthermore, it will be established which indicators are important for the different player positions within a football team and it will be evaluated whether these differences are significant. When it comes to professional sports in general, the use of data has become widespread. For example, in sports as rugby and baseball the use of data in measuring and optimizing on pitch performance has been common for quite some time. In basketball it appeared that the expected score of a three-point shot was higher than a two-point shot taken closer to the basket. Even though a three-point shot involves more risk of missing the target, it was found that shooting from longer distance yielded more points. This finding resulted in basketball teams increasing their amount of three-point shots by 50% (Kopf, 2017). The analysis of sports is also widespread in the academic literature. James et al. (2005) tried to find position specific characteristics of rugby players whereas O’Connor et al. (2016) performed a similar analysis for youth rugby players in Australia. When it comes to football, recent studies tried to identify player characteristics of high-level youth football players. Deprez et al. (2015) performed a

(7)

CHAPTER 1. INTRODUCTION

3

position specific analysis on data of youth players of two Belgian professional football clubs. Considering professional elite football, Bloomfield et al. (2007) set an example in finding the physical demands per position within a football team. They used specific movement analysis on 55 players in the FA Premier League and found significant differences between positions. However, Bloomfield et al. (2007) did not relate the physical performance to the quality of a player.

This thesis focuses on Dutch professional football. More specifically, match data of physical performances of players of eleven different football clubs over the seasons 2015-2016, 2016-2017 and 2017-2018 will be used. These eleven clubs are active in either the Dutch first division (Eredivisie) or the second division (Jupiler League). Considering the tremendous amounts of money involved in football, as described by Rohde and Breuer (2016) for instance, it is important to be able to explain and improve the product of a football clubs: the on-pitch performance. Furthermore, it is relevant for coaches to know which phys-ical indicators explain high-player quality for the different positions within the football team. Where previously mentioned research only characterized certain abilities of football players, this thesis takes the analysis of football players a step further. Player quality is directly related to the on-pitch physical performance such that differences between high- and low-quality players can be identified. Furthermore, the use of detailed individual physical match data enables the investigation of position specific charac-teristics. For instance, the physical abilities of a high-quality winger may differ from the abilities that characterize a high-quality centre back. Practically, this is of interest for managements and coaches of football clubs as this enables the development of player specific training schedules such that a player’s physical abilities can be improved. Increasing the quality of players has a positive impact on the sportive success and additional value is created as other clubs are willing to pay higher transfer fees for better players. Theoretically, the main contribution of this research is the ability to measure player quality which has not been available in the past in this way.

A measure of the quality of a football player is something that is hard to define. For instance, using goals scored as a measure has the drawback that strikers and midfielders, defenders or goalkeepers cannot be compared fairly. This thesis uses the Euro Player Index (EPI), developed by the firms Hypercube and Remiqz, as a measure of player quality. The EPI enables comparisons of players across teams, positions and leagues. A higher EPI indicates higher player quality. An explanation of the EPI is deferred to Section 4.1. By making use of estimation methods for panel data, it is investigated which physical variables are significant indicators for player quality. Furthermore, interest lies in the differences between these indicators for the specific positions within a football team. It was found that certain aspects in the physical performance of a player are significant indicators of player quality. For example, the sprinting characteristics of a player were indicators of player quality. Additionally, it was found that the efficiency by which a player moves on the pitch is a significant indicator. In this context, the efficiency is determined by the total activity a player displayed during a match and the distance the player covered in that match. Covering the same distance but with a lower total activity is interpreted as more efficient. Comparing the different positions within the team to each other, it became clear that sprinting behaviour is especially important for the full backs and centre backs. As there are various types of midfielders each requiring different physical abilities, midfielders are classified into groups with the same characteristics. It was found that for one of these groups the physical performance is indeed an indicator of player quality. Especially, distance covered at sprint pace was a positive indicator. For strikers the maximum speed turned out to be a positive significant indicator. For wingers, the distance covered at various speeds

(8)

CHAPTER 1. INTRODUCTION

4

appeared to be a negative indicator of quality. The results appear to be robust to changes in the model and sample specification. For example, excluding players that participated in less than five matches did not change the results drastically. On the other hand, it appeared that the panel data structure should not be ignored.

The thesis is organized as follows. In Chapter 2 an elaborate discussion of the existing literature is provided. Chapter 3 describes the estimation methods used to analyze the data. Next, Chapter 4 gives an overview of the data used in the analysis. The empirical analysis is expounded in Chapter 5. Conclusions and discussions are contained in Chapter 6. When relevant, references are made to the appendices of this thesis.

(9)

Chapter 2

Literature Review

As already pointed out in Chapter 1, a lot of research has been conducted considering the physical capabilities of professional athletes in general and football more specifically. This section will provide an extensive overview of the published research to indicate which findings are relevant for this research and to what extent this thesis contributes to the existing literature. This thesis focuses on football, a widely practiced and one of the most well-known sports in the world (FIFA, 2010). The name football is common in most parts of the world, especially in Europe. In other parts of the world, like the United States and Australia, the game is known as soccer. Throughout this thesis, the sports is referred to as football rather than soccer. Once every four years the FIFA World Cup is organized which is the sports event that attracts the most spectators in the world (FIFA, 2014) implying that the public interest for the sport is enormous.

Applying scientific analyses to sports in general has been of great interest to researchers for decades. As already indicated in Chapter 1, a sport that was one of the early adopters of a scientific approach was basketball. However, the approach has also been applied to other sports. Kamst et al. (2010) investigated whether the 500m in speed skating should take place in two heats in which each skater starts once in the inner lane and once in the outer lane. Kamst et al. (2010) reach the conclusion that this approach yields unfair results and the procedure should be changed. Considering football, there is also a vast amount of examples in the academic literature. Koning et al. (2003) constructed a simulation model that can be applied to football tournaments in order to predict its outcome. A different econometric approach has been taken by Müller et al. (2017) in building a model to explain the market value of a football player. Next to traditional performance indicators on the pitch, Müller et al. (2017) took a player’s social value into account. For example, they included the amounts of YouTube videos and views in their modelling. To be able understand to a certain extent the outcome of sporting events has drawn attention from researchers, professional athletes, sports policy makers and also from the crowd. The latter getting more and more keen to understand their favorite sports. Real cases have proven that the application of statistical analyses to sports pays off. For instance, one may recall the 2002 Oakland Athletics season in which the technical director of the baseball team, Billy Beane, achieves great sporting success. With a limited budget, Beane composed a baseball team based on statistics resulting in twenty consecutive won games. The movie Moneyball is based on the story and sporting success of the Oakland Athletics. Following the analogy of Beane, FC Midtjylland managed to become Danish champions twice

(10)

CHAPTER 2. LITERATURE REVIEW

6

by a dramatic change in their recruiting process of players. Technical director Rasmus Ankersen managed to implement a scouting algorithm that found good and cheap players for the club based on statistical analyses. This contributed to their first ever championship title in the 2014-2015 season and in the season 2017-2018 they managed to become champions once again.

Scientific research on the area of football has been conducted in various ways and from various points of view. As already referred to earlier, Müller et al. (2017) built a model to explain a player’s market value and Koning et al. (2003) published a model by which the outcome of a football tournament could be simulated. Other studies investigated the presence of home advantage in football. Clarke and Norman (1995) conducted research on home advantage in the English FA Premier League. Later, Koning (2000) incorporated home advantage in a study on competitive balance in the Dutch Eredivisie. In short, home advantage is found to play an important role in the game of football. The home advantage effect has a significant effect on the outcome of games and should therefore be taken into account when performing statistical analyses on football.

The role of physical abilities in football has been studied intensively as well. In Chapter 1, references were made to Deprez et al. (2015) and Bloomfield et al. (2007) which are only two out of plenty of published studies on physical characteristics of professional football players. Deprez et al. (2015) investigated 744 high-level youth soccer players between the age of 8 and 18. It was found that goalkeepers and defenders were significantly taller than midfielders and attackers in all age categories. For the higher age classes, around the age of 18, it was found that attackers were more explosive, faster and more agile than the players in other positions. A similar research has been performed by Cullen et al. (2013) who investigated the differences in physical characteristics and fitness levels of players younger than the age of 18. However, contrary to Deprez et al. (2015), Cullen et al. (2013) found only minimal differences across positions. Concerning professional elite football, Bloomfield et al. (2007) investigated the physical demands across positions for players in the FA Premier League. Bloomfield et al. (2007) investigated a total of 55 players which were segmented into groups of defenders, midfielders and strikers. It was found that the position has a lot of influence on the movements of a player and Bloomfield et al. (2007) concluded that this indicated different physical demands for the positions in a football team. Other studies focused particularly on the analysis of sprinting behaviour of players. For example, Di Salvo et al. (2010) analyzed the sprinting activities of players in the UEFA Champions League and UEFA Cup and found that wide midfielders significantly sprint more than players on other positions. Centre backs were found to sprint the least. Di Salvo et al. (2010) concluded that the sprinting characteristics of a player are heavily influenced by their position on the pitch. Similar conclusions were drawn by Bradley et al. (2009) and Ade et al. (2016). Boone et al. (2012) investigated the different physical demands across positions in the Belgian Jupiler Pro League. 289 players of 6 different teams were divided into groups based on their position on the pitch. Every player participated in physical tests and it turned out that players in different positions on the pitch have different physical characteristics. An additional result of comparing players across positions was found by Carling et al. (2012) who studied the high-intensity movements of players in 80 matches of the French League 1. It was found that central midfielders performed more actions on the high-intensity level and that their average speed was significantly higher than for the other positions. Gonçalves et al. (2014) considered the activity of defenders, midfielders and attackers separately and found that attackers showed significantly less physical activity than the players on other positions. An other argument that confirms the differences of physical performance across

(11)

CHAPTER 2. LITERATURE REVIEW

7

positions was proposed by Schuth et al. (2016) who studied positional interchanges during matches and found that such changes have a significant influence on the physical activity of players. Bush et al. (2015) took a different point of view and investigated whether the physical demands of players have changed over time. The English Premier League seasons of 2006-2007 up to the season of 2012-2013 were analyzed and it was found that the physical demands changed significantly over this period of time. As the results of this thesis are based on data of only three consecutive seasons of the Dutch leagues, one might assume that such a physical evolution did not occur in this time span.

It might be argued that the physical performance of a player depends heavily on the intensity of fixtures in the schedule of a football season. A schedule that contains a lot of congested fixtures might lead to players that suffer from fatigue. However, the literature does not confirm this hypothesis. Various researchers investigated different football leagues and tournaments and found that the physical performance did not change when fixtures were congested. Meister et al. (2013) investigated this for the German leagues and concluded that a period of three weeks in which the players were exposed to a lot of matches did not significantly change the physical performance variables. Later, Folgado et al. (2015) studied matches in a congested period in the FA Premier League and found no evidence that high match exposure changed the physical performance of the players. Varley et al. (2018) performed a similar research on a tournament for players of an age younger than 23. In such tournaments teams play a lot of matches in a short time span. Again, the general conclusion was that playing successive matches in a short time span did not significantly change the physical performance of football players. Vigne et al. (2013) investigated the physical performance of a team in the Italian Serie A over three consecutive seasons and distinctions were made between defenders, midfielders and attackers. The main interest of the research was whether the physical performance of the players changed as the team performed better in the league. Vigne et al. (2013) found that this was the case and this could be interpreted as an indication that physical performance varies with player quality. The latter is the main subject of interest of this thesis.

Although the quantity of related research is large, an interesting point of view that has never been taken in academic research is the relation between the quality of a football player and its physical abilities. A probable reason for this is that player quality is subjective and very hard to define. Of course it is straightforward to elect the top scorer to be the best player. But is it not fair to compare strikers, midfielders and defenders on the amount of goals scored. As, for example, a high-quality defender does not need to have the ability to score a large number of goals, this method of defining quality would favor strikers over defenders drastically. Next to that, the result of a match does not immediately indicate player quality as this results depends heavily on the quality of the opponent that is faced. For example, stating that the top scorer of the Dutch Eredivisie scoring 28 goals is better than the top scorer of the English Premier League that scored only 24 goals is not fair, since the opponents that a player in the Premier League has to face are of a higher quality than the opponents that are faced in the Eredivisie. This thesis uses an objective player quality measure that allows the comparison of players across teams and leagues such that the relationships between player quality and physical abilities can be investigated. An elaborate explanation on this measure, the Euro Player Index, is to be found in Section 4.1. By relating the EPI to physical performance indicators, unique insights can be found in the relation between physical abilities and player quality. As research in the passed has shown (e.g. Bloomfield et al. (2007)), the relevant physical indicators are likely to vary across the different positions within the football team. This will also be investigated in this thesis.

(12)

Chapter 3

Methodology

This section will describe the theoretic background of this thesis. As stated earlier, the dependent variable throughout the modelling is the Euro Player Index (EPI) representing the quality of the player. For the moment it is sufficient to state that the EPI is on the individual level, dimension i, and that the EPI of a player can vary over matches, dimension t. Note that the t dimension in this case does not represent time as such but rather matches ordered in time. In mathematical notation, the EPI of player

i starting in match t is then written as EPIit. After the match, the EPI is updated taking into account

the expected outcome before the match and the realized outcome of the match. A detailed explanation on this can be found in Section 4.1.

3.1

Panel data

As the dependent variable is subject to variations in multiple dimensions, i and t, the data are char-acterized as panel data. In general, such a setting requires a different estimation technique than an ordinary regression. Pooled OLS would lead to endogeneity and consequently to biased estimates of the parameters of interest. There are several estimation techniques possible to solve these problems. In this thesis the fixed effects, also called Within estimation technique, is employed. One complication of the data is that the number of observations is not equal for all players. This results in unbalanced panel data and consequently the elimination of the fixed effects might not be straightforward. One of the earlier discussions on this subject was published by Wansbeek and Kapteyn (1989). Baltagi and Song (2006) discuss the complications that can occur when panel data are unbalanced and provide a solution to estimate the parameters of interest consistently. Balazsi et al. (2018) emphasize that it depends on the assumed model whether a Within transformation can fully remove the fixed effects in the case of unbalanced panel data. For example, a model with both individual fixed effects and time fixed effects on unbalanced data requires a different approach than Within estimation as it is mathematically impossible to remove the fixed effects with the Within transformation. As this thesis only includes player fixed effects in the model of interest, the fixed effect can be removed via a Within transformation (Balazsi et al., 2018). Furthermore, season and club dummies are included in the model to control for season and club effects.

(13)

CHAPTER 3. METHODOLOGY

9

This thesis employs Within estimation rather than First Difference estimation as the structure of the data is such that the data contains gaps. The time span between two observations of a player varies and consequently First Difference estimation would be an unsuitable estimation technique. Baltagi and Song (2006) also discuss the importance of whether observations are missing at random or not. In case it is non-random, sample selection is present which should be taken into account. Baltagi and Song (2006) report estimation procedures how to overcome this sample selection. In this thesis it is assumed that sample selection is absent.

3.2

Fixed Effects approach

The model that is estimated in this thesis is as in (3.1). The EPI of player i starting in match t is explained by an individual fixed effect αi for player i, dummy variables indicating for which club player

i played (dict), match-varying variables represented by zit and an error term it. Furthermore, dummy

variables indicating which club the player represents in match t are added to take into account the quality of a player’s teammates and club effects. To avoid perfect multicollinearity one club dummy variable has to be removed. A caveat that occurs by including the club dummy variables is that the corresponding coefficients are only fully identified in case a club has one or more players that represent an other club in the sample as well in some other season. Stated differently, dummy variables are only included for the clubs that have one or more players that represented other clubs in the sample in some other season. In this way the strength of a club the players plays for is taken into account. It appeared that there is one club in the sample, Roda JC Kerkrade, that had no players representing other clubs in other seasons. Consequently, the dummy variable for this club is removed from the regression variables. Stated differently, the club effect of Roda JC Kerkrade is absorbed by the individual player intercepts. For the ease of notation and later derivations, the model is rewritten as in (3.2).

EPIit= αi+ C−1 X c=1 δc dict+ γ0zit+ it (3.1) = αi+ β0xit+ it, i = 1, ..., N, t = 1, ..., T. (3.2)

As already mentioned before, the number of observations is not the same for each player. This depends on, of course, whether a player participated in a match or not. Therefore, the indicator variable sit is

introduced which is defined as

sit=

 

1 if player i participated in match t 0 otherwise.

(3.3)

Let Si denote the total amount of matches in which player i participated. That is,

Si= T

X

t=1

sit, i = 1, ..., N. (3.4)

Using the total number of matches played by player i, the dependent variable and the explanatory variables are demeaned such that gEPIit and ˜xit are obtained. Mathematically, this is written down in

(14)

CHAPTER 3. METHODOLOGY

10

(3.5) and (3.6). g EPIit= EPIit− Si−1 T X u=1 siuEPIiu (3.5) ˜ xit= xit− Si−1 T X u=1 siuxiu (3.6)

As αiis constant per player, by demeaning the equation the fixed effect drops out of the equation. The

equation that is to be estimated boils down to (3.7). This equation can then be estimated by performing Ordinary Least Squares (OLS).

g

EPIit= β0x˜it+ ˜it (3.7)

The OLS estimator resulting from this, ˆβ, is then calculated as

ˆ β = N X i=1 T X t=1 sitx˜itx˜0it !−1 N X i=1 T X t=1 sitx˜itEPIgit ! . (3.8)

To identify the necessary assumptions for the estimator to be consistent, (3.8) can be written out as follows ˆ β = N X i=1 T X t=1 sitx˜itx˜0it !−1 N X i=1 T X t=1 sitx˜itEPIgit ! (3.9) = β + N X i=1 T X t=1 sitx˜itx˜0it !−1 N X i=1 T X t=1 sitx˜it˜it ! . (3.10)

For ˆβ to be a consistent estimator, two conditions should be satisfied. First of all,

E T X t=1 sitx˜itx˜0it ! (3.11)

should be of full rank such that the inverse of

N−1 N X i=1 T X t=1 sitx˜itx˜0it (3.12)

exists. By the Law of Large Numbers and by Slutsky’s Theorem the inverse of (3.12) then converges to a finite matrix. Next to that, an exogeneity assumption should be made such that the probability limit of the second part of (3.10) becomes zero. A sufficient condition such that

E (sitx˜it˜it) = 0 (3.13)

holds is an extension of the common strict exogeneity assumption in the case that the unequal numbers of observations are not present. Assuming that E (it| Xi, si, αi) is zero is sufficient. In this expectation

Xi represents a matrix containing the stacked vectors xit over t for player i. Likewise, si is a T × 1

vector consisting of sit (t = 1, ..., T ) for player i. The assumption of strict exogeneity is necessary as the

estimation procedure relies on the Within estimation technique which uses all available observations of player i to demean each individual observation of player i. Implicitly it is thus assumed that the error term and the selection indicator sitare uncorrelated and consequently the absence of sample selection is

(15)

CHAPTER 3. METHODOLOGY

11

3.3

Two-way clustered standard errors

The usual standard errors of the least squares solution as in (3.8) are misleading and incorrect. Due to the panel structure of the data, correlation is present across the observations of one player. Secondly, it has to be taken into account that a player does not participate in a match on his own. Therefore, a player’s performance is likely to be correlated to the performance of its teammates. Consequently, the standard errors of the estimated coefficients are clustered on two levels.

For the sake of clarity, it will be first explained how to calculate the covariance matrix that is one-way clustered. The calculation of a two-way clustered covariance matrix follows from this. A general estimate of a one-way clustered covariance matrix is obtained as follows (Cameron et al., 2011). Let ˆV1( ˆβ) denote

the estimate of the covariance matrix clustered on some variable and let G define the total amount of clusters in the sample arising from clustering on the considered variable. Then, the vector ˆugrepresents

the error of cluster g. The associated ˆV1 is calculated as

ˆ V1 ˆβ  = (X0X)−1Bˆ1 ˆβ  (X0X)−1 (3.14) where ˆ B1 ˆβ  = G X g=1 X0guˆg ˆβ  ˆ ug ˆβ 0 Xg. (3.15)

The calculation of a one-way clustered covariance matrix is easily extended to a two-way clustered covariance matrix. In this thesis, the standard errors are clustered per player and per team in a season. Cameron et al. (2011) provided a thorough explanation on how to implement these standard errors. First, the two one-way clustered covariance matrices that result from two separate regressions have to be obtained. Both regressions are identical except for the variable for which the standard errors are clustered on. In this case, one regression has standard errors clustered per player and the standard errors of the other regressions are clustered per team in a season. This results in the covariance matrices

V1 and V2. Then, a new variable is created which combines the two variables on which the separate

regressions were clustered. This variable is used to cluster the standard errors in another regression. This results in the one-way clustered covariance matrix V1∩2. The two-way clustered standard errors

are then obtained by taking the square root of the diagonal of the matrix

V2way(β) = V1(β) + V2(β) − V1∩2(β) . (3.16)

Note that deducting V1∩2(β) twice in (3.16) is not necessary due to the symmetry of a covariance

matrix. Estimates of the components of V2way(β) are all obtained in similar ways. According to (3.14)

and (3.15), ˆV1, ˆV2and ˆV1∩2, all clustered on different variables, can be calculated. The estimate of the

two-way clustered covariance matrix can be obtained by plugging in the estimates of the components in the formula of (3.16).

(16)

CHAPTER 3. METHODOLOGY

12

3.4

Position specific estimation

First, the sample will be analyzed without taking the positions of the players into account. With this approach it is investigated whether the model improves when the on-pitch variables are added to the model. This is done by evaluating the relevant statistical tests and model evaluation measures. In the end, this results in a general model that is most preferred. After this, the most preferred model will be applied to the data that is available for the different positions. In such a way, relationships can be identified between the EPI and the on-pitch variables for the specific positions. Although this approach yields regression results per position, it does not tell whether the coefficients of the different regressions are significantly different. Therefore, a Chow test is performed allowing a direct comparison between two positions (Chow, 1960). The Chow test is used in this case to test whether the subset of coefficients of the physical indicators are significantly different across positions. The test as proposed by Chow (1960), however, assumes that the error terms of the two separate regressions have the same variance. To relax this assumption, Toyoda (1974) proposed a slight adaption of the Chow test that takes into account that the equal variance assumption might not hold. An explanation of the Chow test and the adaptation by Toyoda (1974) is contained in Appendix C. By performing this Chow test, it can be established whether there exist significant differences between the coefficients of the physical indicators across positions. Additionally, it can be tested per physical indicator whether the effect on the EPI is different between positions. This can be done by the Chow test as well. In this case, only one coefficient is forced to be the same for the two positions that is tested for. The coefficients of all other variables are allowed to be different across the positions.

It should also be noted that there are several variables included in the models that have a composite effect on the EPI. The estimated composite effect can be calculated using the estimated regression coefficients keeping other factors constant. However, a standard error should be reported as well as these composite effects are random variables. The calculation of this standard error will be performed by the Delta Method. The Delta Method allows the calculation of standard errors of transformations of estimated regression coefficients. Hence, the standard errors that were obtained via the two-way clustered variance matrix are used to come up with the standard errors of these composite effects. A full explanation on how the standard errors are calculated by the Delta Method is contained in Appendix D. In this thesis, a composite effect of interest is for example the effect of a sprint of average distance on the EPI, ceteris

paribus. This effect involves the estimated coefficients of the number of sprints and the sprint distance.

(17)

Chapter 4

Data

4.1

The Euro Player Index

4.1.1

Calculation

The Euro Player Index (EPI) is a measure that quantifies player quality. This measure has existed since July 2007 and was developed by the firms Hypercube and Remiqz. The index allows players to be compared across teams and leagues. The Euro Player Index is available for players in 35 leagues. The system is based on incremental calculations and the indices are adjusted after every match. In July 2007 the system was initiated and after one year, around July 2008, it was calibrated yielding representative results according to Hypercube. In the remainder of Section 4.1 a global explanation on the calculation of the EPI is included to motivate the validity of its use in this thesis. By design, the EPI’s of two players can only be compared in levels and not in not in terms of percentages. Furthermore, the lower bound of the EPI is not defined. There exist for example players with an EPI below zero.

Expected outcome

The increment of a player’s EPI is based on several factors. The expected outcome of the match is one of these factors. Either the team playing at home wins (+1), the match ends in a draw (0) or the team that plays away wins the match (-1). The expected outcome of the game is determined upon the outcomes of all the matches in history that were played between teams having the same quality difference as the match that is considered. In this case, the quality of a team is represented by the average EPI of the best 18 players of the club, later referred to as the ECIP. Hence, the ECIP is calculated as

ECIP = 1 18 18 X i=1 EPI(i), (4.1)

where EPI(i)denotes the ordered indices of the club’s players. The difference in ECIP between the home

and away club is then used to determine the expected outcome of the game, ranging between +1 and -1. Home advantage is also taken into account in determining the expected outcome. Research from the past has shown that home advantage is present in professional football. Clarke and Norman (1995)

(18)

CHAPTER 4. DATA

14

showed its presence in English professional football. Later, Koning (2000) emphasized its importance in Dutch professional football. This home advantage is found to be different across leagues and is therefore corrected for differently per league. The probabilities of a home win, a draw and an away win are calculated according to the cumulative standard normal distribution. The exact calculation procedure of these probabilities cannot be disclosed due to the property rights of Hypercube and Remiqz. With the probabilities of the outcome, the expected outcome is calculated as

Expected outcome = 1 × P [Home win] + 0 × P [Draw] − 1 × P [Away win]. (4.2) For the individual players participating in the match, an individual expected outcome is determined depending on their current EPI, the EPI of the teammates and the strength of the opponent in ECIP.

The difference between these expected outcomes within a team depends upon the player’s EPI and the range of the EPI’s of its teammates. For instance, let FC Barcelona have an expected outcome of +0.4 for the upcoming match that they play at home. Then the best player in the line-up (Lionel Messi) has a different expectation than their worst player (Sergi Roberto) such that worse players are allowed to improve faster when winning and better players are punished harder when losing. The individual expected outcome is calculated similarly as the expected outcome of the match. However, the ECIP of

the team which player i belongs to is then replaced by a weighted value of the player’s EPI and the ECIP

of his team. The weight α is set such that prediction power of the system is optimized historically. The weighted value of the player’s EPI and the ECIP then becomes

α × EPIi+ (1 − α) × ECIP, (4.3)

where the exact calculation of α cannot be disclosed. The probabilities of a home win, a draw and an away win are then calculated as before yielding an individual expected outcome for each player i. The strength of the opponent is still represented by the ECIP of the club.

During a match the expected outcome of a match changes. For example, a match in which the score stays zero to zero for 80 minutes has an expected outcome that is getting closer and closer to zero, i.e. a draw. For each player participating in the match, the expected outcome at the starting minute of the player and the expected outcome at the minute the player leaves the game are compared. For instance, a player that participated in the whole match for which the expected outcome at the beginning was +0.4 while the game ended in a home win (+1), the player contributed +0.6 to the outcome. However, a substitute entering the match in the 70th minute where the expected outcome was already +0.75 contributed only

for +0.25 to the final result. Other factors that influence the expected outcome during a match are of course goals, red and yellow cards and substitutions. Figure 4.1 shows an illustrative example how the expected outcome changes during a match. At the start of the match, the expected outcome indicates the home team is more likely to win. As long as the score stays zero to zero, the expected outcome goes linearly to zero, a draw. At (1) the home team scores and the expected outcome jumps up. The expected outcome goes then linearly to +1, a home win. However, at (2) the home team receives a red card causing the expected outcome to jump down. Although the home team received a red card they are still in front such that the expected outcome goes to +1. At (3) the away team scores such that the total score become 1-1. After this, no match events occur and the match ends in a draw.

(19)

CHAPTER 4. DATA

15

Minutes

90

0

(1)

(2)

(3)

45

+1

0

-1

Exp

ected

outcome

Goal

home

team

Red

card

home

team

Goal

a

w

a

y

team

Figure 4.1: Example of the development of the expected outcome during a match

EPI increment

After the match has ended, the EPI’s are updated. Using the expected outcome, the realized outcome and the so called κ-factor the incremental values are calculated. An example of the incremental calculations is listed in Table 4.1. The κ-factor varies across leagues and is determined to optimize the predictive power of the statistical model. The intuition of the κ-factor is based on the Elo rating (Elo, 1978) that is used in the analysis of chess games.

Table 4.1: Example of incremental calculations of the EPI.

Realized result Expected outcome κ-factor EPI increment

1 0.63 20 20 × (1-0.63) = 7.4

0 0.63 20 20 × (0-0.63) = -12.6

-1 0.63 20 20 × (-1-0.63) = -32.6

New players

New players can be included into the calculations in two ways. Either the new player is new solely or the team is entirely new. The latter could happen in case a team of a lower division promotes to a higher division that is in the system whereas the lower division is not. The first case could occur when a club that is in the system signed a player from their youth academy or a player from a club outside the system. In this case the starting EPI of a player is based on the ECIP of the club and the number

of contracted players of the club. The first has a positive effect on the starting value and the number of contracted players has a negative effect implying a trade off between these two factors. High-quality teams that have a high ECIP have generally also a lot of contracted players. To make sure that new

(20)

CHAPTER 4. DATA

16

contracted players the club already has. In the case that the team is entirely new to the EPI calculations all players get a starting EPI that equals the Euro Club Index (ECI) of the club. It will take again some time until the indices of the players of this club are calibrated.

4.1.2

EPI description of the sample

The EPI distribution of all players in the Dutch Eredivisie and Jupiler League for the season 2017-2018 is depicted in Figure 4.2. The EPI in this histogram of a certain player is calculated as the average EPI over the season.

Figure 4.2: EPI histogram of all players in the Eredivisie and Jupiler League in the season

2017-2018

As already mentioned briefly in Chapter 1, the physical on-pitch variables are available for a subset of all the teams in the Eredivisie and the Jupiler League. In Table 4.2 the clubs are listed for which the on-pitch variables are available in the data. It appears that the sample compromises players from teams that played in the first division consistently, teams that promoted to the first division or got relegated to the second division and teams that played in the second division over the three seasons. Now the question arises whether the clubs in the sample constitute a representative subset of all the clubs in the Eredivisie and the Jupiler League. In Figure 4.3 the average EPI over the season 2017-2018 is depicted for all the players that are contained in the sample, i.e. for which the on-pitch variables are available. Comparing Figure 4.2 to Figure 4.3, it appears that the top players are not contained in the sample as there are only a couple of players that have an EPI greater than 2000. This can be explained by the fact that the sample does not contain any of the top clubs in the Netherlands (e.g. Ajax, Feyenoord or PSV). Furthermore, there seems to be a cut-off just below zero in the histogram. Players of low quality are apparently not observed in the sample. This observation is two-fold. First of all, it could said that this is because the sample does not contain low-rated clubs. Second, players with such a low EPI are not observed as they are not put in the line-up of the team by the coach. After all, the on-pitch variables are only observed for players that are actually in the team that is composed by the coach. To conclude,

(21)

CHAPTER 4. DATA

17

the shapes the histograms are roughly equal except for the tails of the distribution of all players in the Dutch leagues implying that the available sample reflects the Dutch leagues well in general.

Table 4.2: Clubs for which the on-pitch variables are available and their playing level over the

seasons

Club 2015-2016 2016-2017 2017-2018

NAC Breda Jupiler League Jupiler League Eredivisie

NEC Nijmegen Eredivisie Eredivisie Jupiler League

Roda JC Kerkrade Eredivisie Eredivisie Eredivisie

FC Volendam Jupiler League Jupiler League Jupiler League

SC Cambuur Eredivisie Jupiler League Jupiler League

PEC Zwolle Eredivisie Eredivisie Eredivisie

FC Eindhoven Jupiler League Jupiler League Jupiler League

Helmond Sport Jupiler League Jupiler League Jupiler League

ADO Den Haag Eredivisie Eredivisie Eredivisie

SBV Excelsior Eredivisie Eredivisie Eredivisie

De Graafschap Eredivisie Jupiler League Jupiler League

Figure 4.3: EPI histogram of all players in the Eredivisie and Jupiler League in the season

2017-2018 available in the sample

4.2

Data description

An overview of all available on-pitch physical variables is given in Table 4.3. These data were obtained via JOHAN Sports, a company providing performance analytics systems for sports teams using GPS tracking devices for players during matches. To be able to compare variables across players and matches, variables for which this is relevant are scaled to a per 90 minutes level. For example, if a player participated for 45

(22)

CHAPTER 4. DATA

18

minutes in a match and covered a total distance of four kilometres, the total distance is doubled. In this way, it can be compared to the total distance covered by a player that played 90 minutes. The variable efficiency is a measure of how efficient a player moves on the pitch. The measure is defined as total distance covered at jogging pace or higher divided by the total playerload. The higher the playerload, the more movements the player needed to make to cover his total distance. E.g., for two players covering an equal distance during a match on at least jogging pace, the player that needed a lower playerload made more efficient movements on the pitch. The physical efficiency by which a player moves across the pitch might be an indicator of player quality. It is expected that a high-quality player has more efficient movements across the pitch compared to a player of lower quality.

Table 4.3: Descriptions of the on-pitch variables

Variable Description

Total distance Total distance covered in meters

Walking distance Distance covered in meters between 0 and 7 km/h

Jog distance Distance covered in meters between 7 and 14 km/h

Running distance Distance covered in meters between 14 and 20 km/h

Sprint distance Distance covered in meters above 20 km/h

High-intensity sprint distance Distance covered in meters above 25 km/h

Sprints Number of times reached a speed above 20 km/h

High-intensity sprints Number of times reached a speed above 25 km/h

Repeated sprints Number of times two or more sprints within 20 seconds

Speed Average speed during the match

Maximum speed Maximum speed attained during the match

Efficiency Measure of efficiency of the movements of a player

Table 4.4 contains descriptive statistics of the variables that are used in the models of the research. As already mentioned before, the variables for which it is relevant are scaled to 90 minutes. The unbalanced data capture 519 unique matches and 288 unique players. The total number of observations is 3716. On average, the EPI of a player is 1133. It appeared that the variable Maximum speed contained records that were infeasible. For instance, a football player reaching a maximum speed of 50 km/h is impossible as the maximum speed record of Usain Bolt was 44.72 km/h during the 100 meters of the World Championships in 2009. Therefore, in case the difference between a record and the individual average was more than two standard deviations, the record was replaced by the sample mean of the individual player. Other unreasonable records were deleted. For instance, an observation in which a player covered only 50 meters in 90 minutes seemed to be subject to malfunctioning of the GPS tracking equipment.

(23)

CHAPTER 4. DATA

19

Table 4.4: Descriptive statistics of variables

Statistic Mean St. Dev. Min Max

EPI 1133 461.57 −464 2454 Age 24.60 3.46 17.30 35.20 High-intensity sprints 12.38 7.20 1 56 Sprints 53.27 17.18 2 127 Repeated sprints 44.29 19.17 0 124 Total distance 11 466.41 1368.29 334.59 19 855.28 Walking distance 4309.29 771.15 143.37 13 702.80 Jog distance 4628.42 847.13 136.85 7965.42 Running distance 1917.32 556.06 31.12 3946.08 Sprint distance 611.37 258.41 23.25 2051.73

High-intensity sprint distance 140.17 102.62 0.64 698.59

Maximum speed 30.47 1.96 25.02 38.99 Speed 6.74 0.79 1.81 9.65 Efficiency 12.81 4.66 2.41 175.31 Season 2015-2016 0.12 0.33 0 1 Season 2016-2017 0.39 0.49 0 1 Season 2017-2018 0.49 0.50 0 1 Eredivisie 0.41 0.49 0 1 KNVB Beker 0.04 0.19 0 1 Jupiler League 0.53 0.50 0 1 NAC Breda 0.08 0.28 0 1 NEC Nijmegen 0.12 0.33 0 1 Roda JC Kerkrade 0.01 0.12 0 1 FC Volendam 0.13 0.34 0 1 SC Cambuur 0.11 0.31 0 1 PEC Zwolle 0.10 0.30 0 1 FC Eindhoven 0.11 0.31 0 1 Helmond Sport 0.03 0.17 0 1

ADO Den Haag 0.09 0.28 0 1

SBV Excelsior 0.09 0.29 0 1 De Graafschap 0.12 0.32 0 1 Observations 3716 Unique matches 519 Unique players 288 Unique teams 11

(24)

CHAPTER 4. DATA

20

4.3

Positions

Particular interest, as already pointed out in Chapter 1, is in the differences that can be observed within the formation of the team. For example, what different kind of behaviour characterize a high-quality striker compared to a striker having less quality? It is therefore important to define the different positions that are distinguished in this thesis. Figure 4.4 contains these different positions. A football team consists of eleven players that to a certain extent are bound to a position on the pitch. Managers instruct the tactics to the team and therefore they also determine which playing formation is used. Figure 4.4 contains a 1-4-3-3 formation, i.e. one goalkeeper, four defenders, three midfielders and three attackers. Defenders are either centre backs or full backs. The three attackers consist of wingers and strikers. Managers may also decide to let their team play in a 1-4-4-2, 1-3-4-3 or any other system by which they think they can outperform the opposition. The positions indicated in Figure 4.4 are all positions that are identified in the sample regardless of the formation the team plays. Furthermore, as 1-4-3-3 is the formation that is used most frequently in the Netherlands, the influence of the formation on the physical performance of a player is not considered. It was found that 1-4-3-3 was the most used formation when inspecting all the matches in the Eredivisie over the last 10 seasons. It appeared that in 74.1% of the line-ups the formation was 1-4-3-3. Table 4.5 shows the distribution of the different formations over the last 10 seasons in the Eredivisie. These numbers were obtained from the database of Hypercube and Remiqz.

GK

FB

CB

CB

FB

M

M

M

W

ST

W

Figure 4.4: A positional composition of a football team

As goalkeepers (GK) have a position that is very dissimilar to the other positions within the team, they are not included in the research. Besides, goalkeepers never play matches in which they are equipped with

(25)

CHAPTER 4. DATA

21

the tracking device measuring their on-pitch data. Left and right backs are denoted as full backs (FB) and are treated as the same positions. Next to that, the centre backs are marked as CB. All midfielders are labeled M since a further specification of the position of midfielders is not available. Lastly, the forward positions are separated in strikers (ST) and wingers (W). Strikers are centre forwards while the wingers play either on the left or on the right side of the pitch. Lastly, we note that the positions of substitutes are not identified. The replacement of a centre back by a substitute does not necessarily imply that the substitute is a centre back as well. For instance, the coach may decide to alter the formation by replacing the centre back by a midfielder or a striker. Therefore, players that entered the match as substitutions are removed from the sample. Table 4.6 displays the descriptive statistics of the variables per position.

Table 4.5: Different formations used in matches in the Eredivisie over the last 10 seasons

Formation Frequency Percentage

1-4-3-3 4546 74.10% 1-4-4-2 943 15.40% 1-4-5-1 482 7.90% 1-5-3-2 69 1.10% 1-3-4-3 68 1.10% 1-3-5-2 20 0.30% 1-5-4-1 8 0.10%

Comparing the means across positions, it appears that the average EPI is the highest for centre backs and wingers even though the average EPI for the other positions do not differ that much. The spread in EPI seems to be comparable across positions. Also, players across positions are roughly of similar age. For all positions, the average age lies in between the age of 24 and 25. Turning to the on-pitch variables, some differences can be identified between the positions. Centre backs sprint the least of all positions on average. All three variables, high-intensity sprints, sprints and repeated sprints, have the lowest average number for centre backs. On the other hand, midfielders make on average the most sprints. The average total distance is for all positions above 10 kilometers. Wingers tend to jog and run more than players on the other positions. Likewise, midfielders cover more distance than the other positions on sprinting and high-intensity pace. Surprisingly, wingers attain the lowest maximum speed on average compared to the other positions. The efficiency measure, indicating the activity a player needed to cover a certain distance, is on average roughly equal across the five positions although the standard deviation for wingers and strikers is somewhat higher than for centre backs, full backs and midfielders. Appendix A contains some additional descriptive statistics that were inconvenient to be contained in Table 4.6 as they are not of main interest.

(26)

CHAPTER 4. DATA

22

Table 4.6: Descriptive statistics of variables per position

Position CB FB M W ST

Mean

(St. Dev.) (St. Dev.)Mean (St. Dev.)Mean (St. Dev.)Mean (St. Dev.)Mean

EPI 1188 1116 1095 1158 1022 (424.27) (466.34) (485.18) (476.54) (425.99) Age 24.95 24.09 24.16 24.77 24.96 (3.31) (2.98) (3.51) (3.76) (3.54) High-intensity sprints 8.70 13.92 18.77 10.20 15.88 (4.75) (6.18) (8.63) (6.09) (7.03) Sprints 40.00 53.40 65.01 54.84 61.91 (11.88) (14.55) (18.87) (16.09) (14.56) Repeated sprints 29.79 44.50 56.77 46.00 53.94 (13.14) (16.05) (21.51) (18.08) (16.85) Total distance 10 616.51 11 199.58 11 595.53 12 229.90 11 305.03 (1039.79) (1305.68) (1408.46) (1220.93) (1193.68) Walking distance 4394.68 4302.99 4478.00 4134.73 4477.10 (632.36) (819.95) (864.08) (747.39) (775.00) Jog distance 4290.08 4457.10 4375.71 5194.79 4259.89 (661.98) (716.61) (857.08) (764.81) (790.89) Running distance 1504.76 1788.25 1929.14 2303.27 1853.14 (373.22) (439.86) (521.76) (529.21) (435.20) Sprint distance 426.98 651.23 812.66 597.10 714.87 (169.5) (225.26) (310.11) (231.36) (224.17)

High-intensity sprint distance 92.91 167.74 230.20 107.23 173.43

(65.63) (93.71) (129.76) (83.22) (100.80) Maximum speed 30.10 30.97 31.85 29.65 31.05 (1.76) (1.62) (1.91) (1.86) (1.96) Speed 6.33 6.66 6.64 7.17 6.55 (0.59) (0.68) (0.93) (0.72) (0.72) Efficiency 12.40 12.33 12.32 13.51 13.18 (2.69) (2.46) (3.01) (5.42) (8.71) Observations 823 825 1205 478 385 Unique players 84 92 132 87 68

(27)

CHAPTER 4. DATA

23

4.3.1

Additional midfielder statistics

A potential problem when estimating the model for midfielders it that it is unknown which role midfielders take on the pitch. This role might be crucial when identifying whether physical variables are significant indicators of player quality. A midfielder can for example be a box-to-box player which would usually require covering long distances. An other role a midfielder can take is the role of a play maker, relying more on technical and passing skills rather than physical skills. To take this potential problem into account, the midfielders will be classified into groups based on their football abilities on the pitch. This classification will be accomplished by K-means clustering of which an explanation is contained in Appendix B based on the explanation of Bishop (2006). This section will describe the variables that will be used to perform the segmentation. The results of this segmentation are deferred to Section 5.3. Table 4.7 contains the definitions of the skill variables that will be used for the classification of the midfielders. It should be noted that the variables are all scaled per 90 minutes and are averaged over all the matches the player was lined-up as a midfielder. These statistics were obtained via Wyscout, a scouting platform reporting all kind of player statistics in football. It is believed that some of the variables are distinctive for defensive midfielders, such as defensive duels and interceptions. Other variables are more important to attacking midfielders such as the number of shots or the number of dribbles. The descriptive statistics of the additional variables for midfielders are as in Table 4.8. The definition of a pass to the final third is depicted in Figure 4.5. When the results of the classification are analyzed, the characteristics of the different groups of midfielders based on these additional statistics will also be reported.

Table 4.7: Descriptions of the skill variables for midfielders

Variable 1.2cmDescription cm

Defensive duels 1.2cmNumber of times the player gets involved into duels with an opponent defensively cm

Interceptions 1.2cmNumber of times the player intercepts passes from opponents cm

Recoveries 1.2cmNumber of times the player recovers the ball cm

Shots 1.2cmNumber of shots the player aimed at the goal of the opponent cm

Shot assists 1.2cmNumber of times the player provided a teammate an opportunity to score cm

Crosses 1.2cmNumber of times the player puts the ball in front of the opponent’s goal from either

side of the pitch

cm

Dribbles 1.2cmNumber of times the player takes the ball forward with touches of the feet now and

then

cm

Progressive runs1.2cmNumber of times the player penetrates the opponent’s half at higher pace with or

without the ball

cm

Passes 1.2cmNumber of passes the player made cm

Through passes 1.2cmNumber of times the player made a pass that went through the opponent’s defense cm

(28)

CHAPTER 4. DATA

24

Table 4.8: Descriptive statistics of the classification variables of midfielders

Statistic Mean St. Dev. Min Max

Defensive duels 7.18 2.02 2.70 12.53 Interceptions 3.96 1.55 1.24 7.40 Recoveries 8.05 2.84 2.18 14.26 Shots 1.34 0.76 0.00 3.35 Shot assists 0.82 0.43 0.00 2.05 Crosses 1.59 1.34 0.00 5.74 Dribbles 3.14 2.38 0.33 13.03 Progressive runs 0.94 0.70 0.00 4.62 Passes 35.53 9.33 13.58 65.88 Through passes 0.89 0.52 0.18 3.17

Passes final third 5.96 2.73 0.87 15.75

M

(29)

Chapter 5

Empirical Analysis

This chapter contains the empirical results that are obtained from the models of Chapter 3. Section 5.1 lists the results that were obtained without the distinction between the positions. In Section 5.2, the empirical results taking the different positions into account are discussed. Section 5.3 contains the clas-sification of midfielders into groups with similar characteristics. Next, Section 5.4 lists the results when testing differences between positions. Furthermore, the composite effects are calculated and compared across positions. To strengthen the results robustness checks are contained in Section 5.5.

5.1

General results

In Table 5.1 the results of the regression that include merely the control variables are displayed. As already explained in Chapter 3, the set of control variables consists of a season dummy variables, dummy variables for clubs and leagues and the age of the football player. The dummy variables for De Graafschap and Roda JC Kerkrade are left out of the regression to ensure that the matrix of explanatory variables is of full column rank. Taking a closer look at Table 5.1, it can be observed that the estimated coefficients for the season dummies are significant at the five percent level. Hence, player quality in the 2016-2017 and 2017-2018 seasons is significantly lower than in the 2015-2016 season. Apparently, the average player quality diminished for these seasons compared to the 2015-2016 season. For the league dummy variables, it appears that players in the Eredivisie are significantly better than players in the Jupiler League, the league representing the reference level although only at the ten percent level. The club effects are indicated for clarifying purposes, but are not of main interest. In later regressions it will only be indicated that club effects are controlled for. Lastly, a remarkable relationship between EPI and age of a player is observed. It appeared that a non-linear relationship fits the data best. The signs of the coefficients imply a concave parabola shaped curve with respect to the relationship of age with the EPI. Inspecting the magnitude of the age coefficients the peak age of EPI keeping all other factors constant can be identified according to

Peak age = −βage 2 × βage2

. (5.1)

This definition of the peak age directly follows from maximizing the EPI equation with respect to the age of a player. Filling out the observed coefficients would imply a peak age of 34.30 which is a rather

(30)

CHAPTER 5. EMPIRICAL ANALYSIS

26

high age for a football player to be at the top of his play. An explanation for this observation might be that the physical abilities of players change as they get older such that the assumption of all other factors being constant when determining the peak age is not valid. An other explanation might be that good players tend to keep to play longer at a high age than players that deteriorate fast when becoming old. This would imply that the old players in the sample kept playing because they are still of value to their team while their colleagues of comparable age are not in the sample because they decided to quit. Having observed the relevance of these control variables, the next step is to extend the model with the variables of interest, the physical on-pitch variables, and to test whether this extended model is preferred over the model as in Table 5.1.

Table 5.1: Fixed effects regression with control

variables

(1)

EPI Control variables - Fixed Effects

Estimate Std. Error Season 2016-2017 −261.76** 101.81 Season 2017-2018 −367.34** 143.99 Eredivisie 90.49* 49.57 KNVB Beker 27.01 35.63 NAC Breda −212.54* 119.47 NEC Nijmegen 259.09*** 79.70 FC Volendam −598.10*** 139.45 SC Cambuur 6.28 115.43 PEC Zwolle 250.06*** 92.57 FC Eindhoven −143.42 169.18 Helmond Sport −553.49*** 128.09

ADO Den Haag −539.66*** 124.75

SBV Excelsior 140.48 108.26 Age 754.95*** 174.71 Age2 −11.01*** 2.99 R-squared (Within) 0.18 Observations 3716 Players 288 Significance levels: *** p < 0.01, ** p < 0.05, * p < 0.1. The standard errors are clustered in two dimensions: per player and per team in a season.

An extension to the model of Table 5.1 can be made by including the various distance variables. In column 1 of Table 5.2 the constituents of the total distance covered per player are added to the model, namely: walking, jog, running, sprint and high-intensity sprint distance. Only the variable walking distance appears to be significant at the five percent level. The corresponding coefficient is negative such that on average a player that covers an additional 100 meters is 2 EPI points worse. The remaining distances variables are insignificant indicators of playing quality. The addition of the distance variables changed

(31)

CHAPTER 5. EMPIRICAL ANALYSIS

27

the coefficients of the season dummies, the dummy for the Eredivisie and age only slightly. Comparing the performance of the model to the model of Table 5.1 it appears that the Within R-squared is slightly higher. An F -test is included in Table 5.2 to test whether the model is statistically better than the model only containing the control variables. As the resulting F -statistic exceeds the critical value of 2.22, the model of the first column is significantly different than the model of Table 5.1.

Extending the model further with the remaining on-pitch variables yields the estimation results as in the second column of Table 5.2. Remarkable is that walking, jog, running and sprint distance are now insignificant while the coefficient of high-intensity sprint distance is significant at the five percent level. It should be noted, however, that the effect of high-intensity sprint distance interacts with the effect of the variable high-intensity sprints. Whereas the high-intensity sprint distance has a negative estimated coefficient, the number of high-intensity sprints is a positive significant indicator of the EPI. Combined this would imply that players with a relative high EPI make a lot of high-intensity sprints of, on average, a short distance. Other variables that are significant in column 2 are the number of repeated sprints and the number of sprints. The efficiency variable turns out to be a positive indicator of the EPI although only at the ten percent level. Considering the control variables, no remarkable differences are observed comparing column 1 and column 2 of Table 5.2. The peak age of the EPI is now estimated to be at the age of 33.93 and 33.20 for the columns 1 and 2 respectively which is lower than before. The Within R-squared of the model in column 2 of Table 5.2 is slightly higher than for the model in column 1. Again, the F -test that is enclosed in column 2 of Table 5.2 indicates that the models of column 1 and column 2 are significantly different from each other. In this test, the model as in column 1 of Table 5.2 is the restricted model whereas the full model is as in column 2. The full model is the most preferred model and will be used in the upcoming sections to evaluate the relationships between the EPI and the on-pitch physical variables.

Referenties

GERELATEERDE DOCUMENTEN

Kimberley en Jenneke maken met behulp van een video-opname een ( s,t )-diagram van een sprint van Carl Lewis over 100 meter.. Figuur 2 staat vergroot weergegeven op

Bepaal daartoe eerst in de figuur op de uitwerkbijlage de straal van de baan die het wegschietend deeltje binnen de spoel beschrijft.. Opgave 5

Als een leerling de snelheid op een punt bepaalt door een raaklijn te tekenen in de figuur op de uitwerkbijlage en deze snelheid vergelijkt met figuur 3: uiteraard goed

The development and transfer of knowledge among employees is critical aspect in the strategic management of internationalization.(IPP 3) Options in building a global network can

Using a dataset which includes the total FDI stock of US firms in 52 partner countries, disaggregated into nine manufacturing sectors, this study finds some initial evidence that,

The misclassification loss L mis (u) is shown by solid lines and some loss functions used for classification are displayed by dashed lines: (a) the hinge loss and the 2-norm loss

Zoals iemand vandaag in de zaal zei: iedereen heeft een beetje gelijk, maar niemand het abslute gelijk.. Het signaal van het congres is helder een

A compilation of photometric data, spectral types and absolute magnitudes for field stars towards each cloud is presented, and results are used to examine the distribution of