A Paired Comparison Lasso Model for Determining the Information Used by Tennis Betting Data

(1)

A Paired Comparison Lasso Model for

Determining the Information Used by Tennis

Betting Data

Anke van den Beukel

(2)

Master’s Thesis Econometrics, Operations Research and Actuarial Studies Supervisor: Prof. Dr. R. H. Koning

(3)

A Paired Comparison Lasso Model for Determining

the Information Used by Tennis Betting Data

Anke van den Beukel

December 29, 2018

Abstract

(4)

1 Introduction

A smash of her racket and a number of accusations to the chair umpire Carlos Ramos were the cause of the awarding of a penalty game to Serena Williams in the final of the 2018 US Open. The fierce ensuing debate reached no consensus on whether the umpire was in his right to use this uncommon penalty measure. Some argued that Williams indeed crossed the line with her loss of integrity and increasingly dis-respectful behavior towards Ramos and her opponent Naomi Osaka, who ended up winning the match. Others cited examples of, in particular male, players who have got away with worse, which again turned the discussion towards the topic of sexism in tennis. Would a male player who exhibited identical behavior been docked a game as well? The publishing of a cartoon in the Australian newspaper Herald Sun, de-picting Williams as a toddler having a tantrum, only added fuel to the fire (Devic, 2018). But it also showed that tennis is still a sport that moves people. Old stars like Martina Navratilova and John McEnroe have been replaced by the likes of Caroline Wozniacki and Roger Federer, and millions of people are eager to watch the sport of tennis whenever another Grand Slam tournament is airing.

It is not just millions of people who are watching, but also millions of dollars that are being invested by bettors who are eager to make money by gambling on tennis matches. Some gamblers may enjoy the thrill of the possibility of winning some extra money, but many others see their placed bets as serious business. Death threats geared towards players who were on the verge of winning a match but ended up losing are not uncommon (Addley, 2015; BBC Sport, 2016). However, apart from gambling on the correct match winner, many other statistics can be bet on. Examples are betting on the tournament winner, the match score in sets, or whether a match will finish within a certain number of games. Furthermore, in-play betting is also a very popular part of the tennis betting market. In-play betting, or “live” betting, allows for bets being placed as the match progresses. Tennis is an ideal sport for in-play betting, as points are being made frequently which shift the odds of the match (Klaassen and Magnus, 2014).

(7)

world of match-fixing. During match-fixing, tennis players are bribed in order to let a specific outcome happen. This happens especially in the lower levels of tennis, such as the Futures and Challengers circuits, where players’ costs are often higher than their revenues from tournaments (Independent Review Panel, 2018, e.g. p.2). This year in June, thirteen people in Belgium were detained after suspected match-fixing (Stonestreet, 2018). The criminal organization allegedly fixed matches of the Futures and Challengers. However, even the higher level tournaments may not be clean: a doubles match at Wimbledon 2018 “[had] been flagged for suspicious betting behavior, a possible sign of match-fixing” (Rothenberg, 2018).

(8)

other factors become less relevant over time? Online betting has become increasingly popular, so it could therefore be expected that betting agencies are more competitive. Are betting odds more predictive now than they were five years ago?

To answer these questions, this thesis will use a Bradley-Terry model with betting odds data known at the beginning of the match. The Bradley-Terry model is a type of model for paired comparison data, i.e. data that can be used to predict the probability that one object is preferred over the other. Furthermore, a lasso-type procedure enables the selection of covariates by penalizing the player-specific coefficients. As a result, players with similar covariate effects will form clusters, and irrelevant covariates will be eliminated from the model. Thus, if the betting odds truly include all information from, for example, ranking difference, we might expect that a variable representing the ranking difference would be eliminated from the model if betting odds are also included.

Another method of testing whether betting odds include all available information is by comparing a model with betting odds and a model with the other factors. Again, if betting odds have incorporated the information from these factors, we would expect the model with betting odds to make at least equally accurate predictions on the winner of the match.

This thesis is structured as follows: section 2 gives an overview of the literature regarding betting in tennis. The third section describes the data used in this thesis and the fourth section explains the methodology. Section 5 presents the results of the estimated models. Section 6 is a discussion of the findings of this thesis and section 7 concludes.

2 Literature Review

2.1 Forecasting in tennis

(9)

(and their squares), whether a player was a former top ten player, the difference in rounds reached during last year’s tournament (named the “individual tournament effect”), and left- and right-handedness of the players. Only the ranking difference was significant in all models considered for both men and women. For men, the individual tournament effect was significant, implying that men’s skills are more surface-based than women’s. The other variables had ambiguous effects.

McHale and Morton (2011) also used a Bradley-Terry type model and note the importance of surface: players’ rankings were notably different when only hard court or only clay matches were considered. However, only men’s data was used, so that the effect of surface on women’s rankings is unknown.

With regard to ranking, there is some discussion as to whether ranking positions or ranking points should be used as predictors. Lisi and Zanella (2017) use ranking points in their forecasting model, arguing that points also reflect a measure of dis-tance that captures the quality differences between two players more accurately. On the other hand, Klaassen and Magnus (2014) prefer ranking positions, as “ranking points are artificial creations, not directly related to the players’ true qualities”. In addition, they write that the method of calculating ranking points changes often and differs between men and women. Given that this thesis aims to compare men’s and women’s matches, the players’ ranking positions are preferred over ranking points. Furthermore, the official ranking positions and points are typically updated every Monday. Live rankings, which are updated after every match, are also available on the internet. Although they might contain a small piece of extra information on the current strength of a player, the data used in this thesis only contains the official ATP and WTA rankings. As the difference between daily and weekly updates is expected to be rather small, the official rankings will be used for the analyses.

(10)

2.2 Betting in tennis

In tennis, betting odds give an indication of the expected probability of the outcome of a match, as set by a bookmaker. If decimal odds are used, as is done in this thesis, odds of 1.9 mean that $1.90 is paid back for every dollar invested if the bet is correct. Higher odds thus have a higher return when the bet is correct, but there is typically a lower probability of this happening. Given this definition, can we then expect that betting odds are perfect predictors of the outcome? One reason why they might not be is given by the favorite-longshot bias. The favorite-longshot bias is a result of the tendency of gamblers to overvalue the underdog, i.e. the “longshot”, and undervalue the favorite (Levitt, 2004). Bookmakers can take advantage of this by lowering the odds of the underdog more than those of the favorite. A bookmaker will always add a margin to its betting odds in order to make a profit, such as 1.9/1.9 odds in a match with equally strong players, instead of 2.0/2.0. In a match where one player is considered the favorite, and would, say, have a 74% chance of winning, the fair odds, which would reflect this probability, would be 1/0.74 = 1.35 for the favorite and 1/0.26 = 3.85 for the other player. If the bookmaker applies a 5% margin, the resulting odds would be 1.28 and 3.66. But given the fact that bettors (unrealistically) favor the underdog, a bookmaker might alter the odds to 1.30 and 3.47 instead. The result is market inefficiency and lower expected returns on longshots than on favorites. The favorite-longshot bias has been found in horse racing (Snowberg and Wolfers, 2010) and football (Cain and Peel, 2000), although other studies found unbiased odds in football (Forrest, Goddard, and Simmons, 2005) or even a reverse effect in baseball (Woodland and Woodland, 1994). These differing results could be a consequence of the width of the betting odds in these sports, i.e. in horse racing there is more often a clear favorite, whereas baseball tends to be more competitively balanced (Cain, Law, and Peel, 2003). Tennis might offer a compromise in that regard: during the early stages of a tournament, higher-ranked players often face lower-ranked players, resulting in matches with clear favorites. In rounds further in the tournament, opponents are of similar strength and are more evenly balanced.

(11)

McHale (2007). By using men’s ATP betting data from bet365, they found a positive bias: betting on underdogs yielded heavy losses, whereas the expected returns on bets on favorites were close to zero. The authors argue that another possible explanation for the favorite-longshot bias is that bookmakers use it as a defense mechanism against private information from well-informed bettors: bookmakers face potentially large losses when an underdog wins if they have set inaccurate odds. Forrest and McHale therefore use sub-samples of Grand Slam matches and non-Grand Slam matches, reasoning that in Grand Slam tournaments, incentives are higher and players are more motivated to win. As a result, the better player will more often win (alongside the fact the Grand Slam matches are played best of five) and insider information becomes less relevant. However, they found no smaller bias in Grand Slam matches when compared to non-Grand Slam matches. Lahviˇcka (2014), on the other hand, finds a larger bias for high-profile tournaments, which include Grand Slams, ATP World Tour Finals and WTA Tour Championships. It is interesting to note that Lahviˇcka also used betting data from bet365, although he uses men’s and women’s data, and over a longer time span than Forrest and McHale. He also finds a stronger bias in later-round matches (defined as matches that were not in the first round) and between lower-ranked players. Lahviˇcka gives two explanations for these seemingly contradictory results. The stronger bias for lower-ranked players could be explained by the fact that their matches are harder to predict, such that private information plays a larger role. For later-round and high-profile tournaments, the author argues that “in such matches the bookmaker faces a different kind of risk; the general public could react faster than the bookmaker to newly available information”. In both cases the bookmakers thus uses the defense mechanism approach, although for different reasons.

(12)

2.3 Effect of gender

As mentioned in the introduction, women’s tennis is less competitive than men’s tennis. For example, Klaassen and Magnus (2014) find that there are more upsets, defined as top-sixteen seeds not reaching the final sixteen, in men’s singles, although women’s tennis is becoming more competitive. Furthermore, Du Bois and Heyndels (2007) find that men’s and women’s matches are equally competitively balanced when considering match-specific and seasonal uncertainty, but that men’s tennis is more competitive when using inter-seasonal and long-term uncertainty. The term “un-certainty” is used for the unpredictability of an outcome, where a high uncertainty reflects high unpredictability and thus that players are competitively balanced. At the match-specific level, for example, the authors show that tie-breaks are almost equally likely to occur in men’s and women’s matches (14.89% and 14.16%, respec-tively). In this case, a tie-break is used as a measure that two players are equally strong. In that regard the authors find no difference in the predictably of match out-comes between gender. The fact that individual matches are equally unpredictable could imply that betting odds might be just as predictive for men and women. When considering inter-seasonal uncertainty, however, there were more new players in the top ten of male players, and the number one position changed more frequently.

As was already pointed out in section 2.1, male players tend to have a stronger preference for a certain surface than female players. This is exemplified by players like Rafael Nadal, who has won the gravel-surfaced French Open a record of eleven times, or the otherwise relatively unknown grass-specialist Nicolas Mahut. Furthermore, Wimbledon’s seeding system takes into account previous grass performances for men, whereas for women, it only uses the top 32 of the WTA ranking (Official Website Wimbledon, 2018). We could therefore expect that court surface could be more of a decisive factor when placing bets for male players than female players.

2.4 Effect of surface

(13)

example, in the length of the rallies: according to O’Donoghue and Ingram (2001), an average rally lasts 4.3 strokes at Wimbledon, but 7.7 at the French Open. Faster sur-faces are advantageous to players with a powerful serve, whereas a defensive baseliner is better off at clay (Fernandez, Mendez-Villanueva, and Pluim, 2006). Given that grass courts are quicker than other courts, we might expect more upsets and thus a higher match unpredictability for this surface. Del Corral and Prieto-Rodr´ıguez (2010) however, find that this is only the case for women. This could be a result of the fact that men tend to prefer a certain surface more than women. We thus might expect that other factors are less relevant for betting odds on quicker courts for women, since this type of surface plays a larger role in the predictability of the match. Bookmakers might set more extreme odds if the difference in ranks is large for a match on clay, whereas they might be more careful if the match is played on grass.

2.5 Effect of time

(14)

3 Data

3.1 Description of the data

All data has been obtained from http://www.tennis-data.co.uk/alldata.php. The men’s dataset includes 48,621 matches, played over a period from January 1, 2001 to August 5, 2018. For women, there are 28,837 observations, starting from De-cember 31, 2006 up to August 5, 2018. Apart from details of the match, such as court surface and type of tournament, the dataset also provides information on the players’ rankings and betting odds. A full description of every variable in the final dataset is given by Table A.1. The table shows the names of the variables in the men’s and women’s datasets (first and second column) and a description (third column). Empty cells indicate that the variable is not included in a dataset. For example, the women’s dataset contains no data on the number of games won in the fourth and fifth set.

3.2 Data editing

Several changes have been made to the original data to make it more suitable for the analyses in this thesis.

3.2.1 New variables

PlayerA and PlayerB The original dataset includes variables Player 1 and Player 2, where Player 1 is by default the winner. New variables were created, named PlayerA and PlayerB, where the winner is either A or B, derived from the variables Player 1 and Player 2. Whether Player 1 became Player A or Player B was randomly determined. Similarly, the corresponding player statistics, such as WRank and LRank (the rank of the winner and loser), are now ARank and BRank.

(15)

prob 365A prob 365A gives the winning probability of player A as implied by the betting odds of bet365. This implied probability is calculated by

prob 365A = 1 B365A 1 B365A+ 1 B365B . (1)

We divide by the sum of the inverses of B365A and B365B, the betting odds of bet365 for Player A and B, due to the margin added by bookmakers. As such, we have the linear relationship prob 365B = 1 - prob 365A. Due to this perfect correlation, only prob 365A is included in the analyses.

diffrank This variable represents the difference in ranking positions of player A and player B, given by diffrank = ARank − BRank.

high profile Equals 1 if the match is “high-profile” and 0 otherwise. Following Lahviˇcka (2014), high-profile matches include Grand Slams, ATP World Tour Finals and WTA Tour Championships.

first round Equals 1 if the match is in the first round of the tournament and 0 otherwise.

3.2.2 Missing and incorrect data

Unique names In some cases, unique players were included under different names in the data file. For example, Del Potro J. and Del Potro J. M. represent the same player. Furthermore, sometimes the name contained a typo (such as Kohlschreiber P.. instead of Kohlschreiber P.). This poses a problem since the same person will be regarded as two or more different people in the analysis. I manually went through the data to correct for these errors by giving unique players the same name.

(16)

are most likely typos. I changed to the value of LBB to 1.610, since other betting odds for Player B were 1.610 (B365B), 1.60 (EXB) and 1.650 (PSB). I changed the value of EXA to NA, since I was unable to derive the correct value (other betting odds were 1.85 (CBA), 1.800 (B365A) and 1.885 (PSA)). The same holds for SBA. Furthermore, several matches in 2002 had B365A = B365B = 1. These were set to NA.

Duplicates A total of 1480 matches were included twice in the dataset. All of them were matches played in 2001. All duplicates in the data were removed.

Missing dates Two matches did not have dates. On the internet the correct dates of the matches were found.

GreenSet A total of 31 matches were played on GreenSet, which is a certain type of hardcourt surface. All of these matches were from the 2007 Sunfeast Open in Kolkata, a WTA Tier III indoor tournament (Tier III tournaments are now called International Tournaments). The tournament ceased to exist in 2008, such that the 2007 Sunfeast Open is the only Sunfeast Open in the dataset. Other tournaments, such as the Swiss Indoors, also use GreenSet, but for these tournaments the surface is described as hard court. The surface “GreenSet” was therefore replaced with “Hard” for these matches in Kolkata.

Completed matches Only matches that had been completed are selected,

ac-counting for 96.3% of the total number of matches.

3.3 Data analysis

3.3.1 Analysis match characteristics

(17)

Table 1: Percentages of match characteristics (a) Men n % Surface Clay 15980 34.2 Grass 5488 11.8 Hard Court 25207 54.0 high profile 0 39133 80.5 1 9120 19.5 first round 0 25964 53.8 1 22289 46.2 (b) Women n % Surface Clay 8079 29.3 Grass 3038 11.0 Hard Court 16508 59.8 high profile 0 21734 78.1 1 6052 21.9 first round 0 14565 52.4 1 13221 47.6

that only about one fifth of the matches are a high-profile match for both genders. Lastly, there are slightly fewer first round matches than non-first round matches for both men and women.

3.3.2 Analysis betting data

(18)

Figure 1: Correlations of the four betting agencies over time

LBA, respectively), the analyses will be done with these betting odds.

Table A.3 presents the correlations between the betting variables. Not all corre-lations can be calculated, since some betting variables do not have overlapping data. As expected, the vast majority of correlations are higher than 0.9, indicating that the betting companies do not differ much between their betting odds. The correlations between the four variables B365A, EXA, PSA and LBA are also around this order of magnitude. That being said, there is still a certain discrepancy between agencies.

Table A.4 shows correlations of the variables B365A, EXA, PSA and LBA per year, in order to find out if correlations have increased over time. As indicated by Forrest et al. (2005), we might expect that they do. The correlations of some variables have increased, such as those between B365A and PSA. Figure 1 visualizes the correlations. There appears to be an increasing trend, and correlations seem to be less spread out for later years.

3.3.3 Betting data per surface

(19)

grass are more spread out than those on hard court or clay. Both the interquartile range and the range between the 10th _{and 90}th _{quantiles is longer. Thus, bookmakers} actually set more extreme odds on grass than on clay and hard court for men. For women, the interquartile range is also the longest for grass and shortest for clay, although the difference is smaller than for men.

Furthermore, a number of implied probabilities is exactly the same, especially for the 10th _{and 90}th _{quantiles. This is due to the fact that there are only 147 unique} odds for the 68,872 matches with betting odds data from bet365. As such, some quantiles have the exact same probabilities.

Table 2: Descriptive statistics of probabilities implied by bet365

Min 0.10 0.25 Mean Median 0.75 0.90 Max

Men Hard 0.015 0.205 0.325 0.499 0.500 0.664 0.795 0.990 Clay 0.019 0.205 0.336 0.501 0.500 0.664 0.795 0.981 Grass 0.029 0.172 0.312 0.491 0.500 0.675 0.814 0.971 Women Hard 0.029 0.205 0.336 0.501 0.500 0.682 0.795 0.976 Clay 0.029 0.205 0.337 0.502 0.500 0.675 0.795 0.971 Grass 0.019 0.207 0.320 0.499 0.500 0.682 0.795 0.967

4 The model

4.1 The Bradley-Terry model

(20)

More formally, let the random variable Y(r,s) denote the outcome of a match, and define it as Y(r,s)=   

1 if player ar wins from player as 0 if player as wins from player ar.

(2)

Here, arand asare members of the set of objects {a1, . . . , am}, where m is the number of objects. The Bradley-Terry model is given by

Pr(ar as) = Pr(Y(r,s)= 1) =

exp(γr− γs) 1 + exp(γr− γs)

, (3)

where γj denotes the strength of player aj. We can see that if both players are equally strong, i.e. if γr− γs = 0, then Pr(Y(r,s)= 1) = 0.5. In this model, we restrict

Pm

r=1γr = 0 to ensure identifiability. To see this, note that only the differences in strengths are used in (3). In the case of two players, we might find ˆγr− ˆγs= 2, which could lead to many different estimates for ˆγr and ˆγs without the restriction.

An extension with ordered response categories is also possible, which would give more information on the difference in strength between two objects, or allow for a draw. For tennis, an ordered response could be used on the basis of the difference in the number of sets won. However, we would have different ordered responses for men and women, as some men’s matches are played as best of five. Since we would like to compare the men’s and women’s models, this thesis will focus on the model with the binary outcome.

4.2 Covariates

(21)

The difference in ranking between two players, however, is match-specific. This thesis will focus on this type of covariate.

We can model the inclusion of match-specific covariates as

γir = βr0+ xTiβr, (4)

where the vector xi = (xi1, . . . , xip) contains the match-specific covariates, and p is the number of subject-specific covariates. We allow the effect of these covariates to be object-specific, as we have βr = (βr1, . . . , βrp). βr0 can be seen as an intercept or as “leftover” strength, and in the case where no covariates are added we have γir = βr0. For similar reasons as in Section 4.1, we need Pm

r=0βrj = 0, j = 0, . . . , p to have identifiability.

As an example, suppose we have a model where we include the implied probabili-ties from bet365, prob 365A, as a subject-specific covariate. We would then estimate the probability that player A wins from player B in match i as

Pr(Y(A,B) = 1)i =

exp(βA,0− βB,0+ (βA,prob 365A − βB,prob 365A)prob 365Ai) 1 + exp(βA,0 − βB,0+ (βA,prob 365A− βB,prob 365A)prob 365Ai)

(5)

4.3 Penalty terms

One can imagine that the number of estimated parameters gets large if many covari-ates are added. Adding five parameters for 30 players already results in (5 + 1) · (30 − 1) = 174 estimated parameters. To counteract this, we can introduce penalties to create clusters of similar covariate effects. This reduces the number of estimated parameters, as well as the complexity of the model.

In order to do so, we introduce the penalized log-likelihood:

lp(β) = l(β) − λJ (β), (6)

(22)

4.3.1 Penalizing the intercept parameter

Since we use a lasso model, we penalize the absolute differences between the coeffi-cients. More specifically, the penalty for the intercept term is

P1(β10, . . . , βm0) = X

r<s

|βr0− βs0|. (7)

As λ grows larger, more and more clusters of similar intercepts are formed. In the case that no covariates have been added, this is identical to joining players with similar strengths. For very large λ there will be one cluster with β10 = . . . = βm0. As we needPm

r=0βr0 = 0, every βr0would equal zero and the intercepts would be eliminated from the model. For λ = 0 we have the original model again, where every player has their own intercept.

4.3.2 Penalizing the subject-specific covariates

For subject-specific variables with object-specific effects we have the penalty

P2(β1, . . . , βm) = p X j=1 X r<s |βrj − βsj|. (8)

Similar to P1, P2 clusters objects which share a similar effect. As we might have more than one subject-specific covariate, it could be the case that for a finite λ we are able to identify several clusters for one covariate, and one large cluster for another. If the coefficients of one covariate form one cluster indeed, then, due to the identifiability restriction, the coefficients will all be estimated to be zero. In that case, the covariate can be eliminated from the model.

4.3.3 Combining penalties

The penalties can be combined by

J (β) = 2 X

l=1

wlPl = w1P1+ w2P2, (9)

(23)

depend on the number of penalties and the number of free coefficients related to each penalty. In order to combine the penalties, we need to rescale the covariates so that each covariate has a variance of one.

4.4 Model estimation

4.4.1 Cross-validation

To estimate the optimal value for λ, we apply k-fold validation. In k-fold cross-validation, the data is randomly divided in k groups. For every value of λ, each group is once used as the test set, and the remaining k − 1 groups are used as the training data set. The model is fitted onto the training set, and subsequently the results are used for predictions on the test set. We can measure the predictive performance of each test set, and summarize these for each value of λ. The λ with the highest performance is then chosen as the optimal λ.

The predictive performance can be measured by using the ranked probability score (RPS) (Gneiting and Raftery (2007)):

RPS = n X i=1 K X k=1 (Pr(Y(r,s) ≤ k)i−1(Y(r,s)≤ k)i)2, (10) where Y(r,s)∈ {1, . . . , K}, the number of response categories. Furthermore, 1(Y(r,s) ≤ k) is the indicator function, which equals 1 if Y(r,s) ≤ k, and 0 otherwise. In the binary case, this expression reduces to the Brier score (Brier, 1950):

BS = n X

i=1

(Pr(Y(r,s)= 1)i− Y(r,s),i)2. (11)

Essentially, it is the sum of the squared differences between the predicted outcome and the actual outcome.

4.4.2 Confidence intervals

(24)

Figure 2: Implied winnings probabilities against actual probabilities (n = 68, 912)

parameters. More specifically, we sample with replacement from the original data, for which we estimate λ and the parameters. The sample size equals the size of the data, i.e. it equals the number of observations. We perform this procedure B times, such that we obtain B bootstrap estimates of the parameters. Given a certain confi-dence level α, we take the α/2 and 1 − α/2 quantiles from the B bootstrap estimates to obtain the confidence intervals. Since the computation time is rather large of this bootstrapping procedure, the optimal value of λ from the cross-validation can also be used. In that case, λ does not have to be estimated again for every iteration, which significantly decreases computation time.

5 Results

5.1 Implied probabilities

(25)

be able to include as many matches as possible that had data from bet365. There are two interesting findings: firstly, we can see in the figure that for small implied probabilities, the actual probability is often even lower. On the other hand, in the fourth quantile of implied probabilities we can see that actual winning probabilities are higher. It seems that the betting odds underestimate the probability of a favorite winning, whereas they overestimate the chances that the underdog wins.

Secondly, there are two visible peaks at 0.48 and 0.63. This is due to the fact that there are only two matches with a rounded implied probability of these numbers. Since player A lost twice when the implied probability was 0.48, we see an actual probability of 0.00. The reverse is true for 0.63. The same holds for 0.52, but since player A lost one of these matches and won the other, the actual probability is exactly 0.50. Table A.5 shows the rest of the implied and actual probability data. It is interesting to see that there are multiple implied probabilities of which there are only a few matches. The differences are so large that it seems hard to believe that it is due to chance, i.e. that there just happened to be few matches where the implied probability was, for example, 0.37. Perhaps the bookmakers shift the odds on purpose, such as to exactly 0.50. Bettors might find it more interesting to bet on a 0.50 match, since in their mind the two players might not be exactly equal.

(26)

Table 3: Data of implied probabilities (in 0.05 intervals) and actual probabilities Interval Actual n 0.00-0.05 0.03 144 0.05-0.10 0.05 1, 336 0.10-0.15 0.10 2, 574 0.15-0.20 0.16 2, 571 0.20-0.25 0.20 3, 414 0.25-0.30 0.25 5, 023 0.30-0.35 0.33 6, 259 0.35-0.40 0.39 3, 510 0.40-0.45 0.43 5, 334 0.45-0.50 0.48 2, 908 Interval Actual n 0.50-0.55 0.51 4, 397 0.55-0.60 0.58 5, 491 0.60-0.65 0.61 3, 598 0.65-0.70 0.68 6, 362 0.70-0.75 0.73 3, 479 0.75-0.80 0.80 4, 049 0.80-0.85 0.85 3, 476 0.85-0.90 0.91 1, 978 0.90-0.95 0.96 1, 665 0.95-1.00 0.98 134

5.2 glm analyses

(27)

Table 4: Percentages of match characteristics of the top 30 (a) Men n % Surface Clay 505 28.9 Grass 171 9.8 Hard Court 1071 61.3 high profile 0 1283 73.4 1 464 26.6 first round 0 1509 86.4 1 238 13.6 (b) Women n % Surface Clay 472 23.3 Grass 171 8.4 Hard Court 1383 68.3 high profile 0 1550 76.5 1 476 23.5 first round 0 1645 81.2 1 381 18.8

are more matches between two higher-ranked players. For women, this is indeed the case: higher-ranked players win 65.6% of their matches in the original dataset, but 62.4% in the subset. For men, however, these percentages are 65.9% and 69.7%, in the same order. This could be due to the domination of the Big Four, who constitute a larger share in the subset than in the entire data.

Furthermore, Table 4 shows the new match characteristics for the subset used in the analyses. The most important difference between the subset and the original set, as shown in Table 1, is that the share of first-round matches has lowered to 14% for men and 19% for women. The share of clay and grass matches has decreased as well. Due to these differences, results from this section are not necessarily applicable to the original data.

(28)

of bet365 as predictor of the outcome, the second the ranking difference and the third includes both variables. The fourth and fifth model are similar to the first and third, but use log(B365A - 1) as predictor instead of the implied probabilities. The outcome variable Winner equals 1 if the winner is player A and 0 is the winner is player B. Firstly, we can see that the (significant) variables have the correct signs: a higher value of prob B365A is indeed associated with a higher probability of winning, whereas a higher value of diffrank, indicating that the ranking of player A becomes worse compared to player B’s, results in a lower probability of winning. The same holds for log(B365A - 1): increasing odds lower the probability of winning. In the table we can also see that the effect of diffrank becomes insignificant in the third and fifth model, indicating that prob 365A might use the information from diffrank. The prediction accuracy of the five models for men is, respectively, 72.5%, 69.5%, 72.8%, 72.2%, and 72.2%. As a baseline prediction, we already saw in Section 5.2 that the higher-ranked player wins 69.7% of the time. The second model does not always estimate the higher-ranked player to win, since its prediction accuracy differs. Furthermore, there is no large difference between using log(B365A - 1) and the implied probabilities.

Having a closer look at the first and third model, there were twelve out of 747 men’s matches where these two models predicted a different winner. Of these matches, model 1 made four correct predictions, and model 3 the remaining eight. Furthermore, ten out of these twelve matches had betting odds (slightly) favoring the lower-ranked player, and the remaining two had equal odds. An example of one of these matches is a 2015 match between Tommy Haas and Andreas Seppi. Haas, coming back from a shoulder injury, had a ranking of 849, but bet365 set his probability of winning at 58.9%. He lost nonetheless, as was predicted by the third model, but not the first. The differences in prediction accuracy thus seems to be mainly between matches where the betting odds favor the lower-ranked players.

(29)

match. Between model 1 and 3 there are thirteen different predictions, of which seven are made correctly by the first model. Of these thirteen matches, eight had estimated winning probabilities of exactly 50% for both players. Matches with estimated equally strong players thus appear to be matches where predictions differ for women.

A total of 14.7% of matches for men and 19.4% of matches for women have an implied winning probability of 50% or higher for the lower-ranked player. These are thus the type of matches where the bookmakers presumably have extra information, other than the ranking difference. The first model with only the betting odds makes a correct prediction of 60.3% and 63.4% of these matches for men and women. For the model that only includes ranking difference, these percentages are 46.7% and 36.6%. It seems that bookmakers do use this extra information well, in particular for women.

5.3 Model results

5.3.1 Gender

Figure A.1 and A.1 (Cont.) show the coefficient paths for the men’s and women’s models. From top row to bottow row, the y-axis shows the estimates of the βr0, βr,prob 365A, βr,diffrank, βr,high profile, and βr,first round as λ moves to zero. The left col-umn gives the estimates of the men’s model and the right colcol-umn of the women’s model. The same data is used as in Section 5.2. For the optimal value of λ, as indicated by the red vertical line, it can be seen that none of the variables are ex-cluded from the model, for both men and women. It thus seems that diffrank, high profile and first round are still relevant, despite the fact that the betting odds data is included. There are also many different clusters of the intercepts, more for men than for women, indicating that there is some unexplained variation in the outcome left that is not captured by the covariates. In addition, the effects of the intercepts are larger in absolute value than the other four covariates. The betting odds, in contrast, have rather small effect sizes. Furthermore, the top cluster of the women’s intercept paths is from Serena Williams, who clearly stands out. Even for the highest estimated value of λ does Williams have a separate cluster.

(30)

vari-able. Players like Simona Halep and Agnieszka Radwa´nska are in the bottom cluster, indicating that they perform worse in the first round of a tournament compared to other players. From the data we can calculate that Halep and Radwa´nska won 44% and 40% of their first-round matches, respectively. This is indeed lower than, for example, Serena Williams (100%) or Victoria Azarenka (85%). It should be noted, however, that there are only four first-round matches of Williams in the data, com-pared to 25 of Halep. For men, there are more clusters of the first round covari-ate. The estimated coefficients range from −0.382 (Tommy Haas) to 0.443 (Andy Murray), compared to the two coefficients of −0.108 and 0.017 for women. The first round covariate thus seems more important for men than for women. For prob 365A, diffrank, and high profile there do not seem to be large differences between the sexes.

Concerning the confidence intervals, both procedures from Section 4.4.2 were ap-plied. The first procedure involves estimating λ for every iteration, whereas the second takes the optimal λ from the cross-validation. For B = 50, no large differences be-tween the two procedures were found. Since the computation time was significantly shorter for the second method, this procedure was applied where B was set to 500. Figures A.2 to A.6 show the 95% confidence intervals for the men’s model and Fig-ures A.7 to A.11 for the women’s model. It is clear that for many players and for many covariates the confidence intervals include zero. For both men and women, the confidence intervals are centered around zero for prob 365A and diffrank. This also holds for high profile and first round for women. Due to the identifiability con-straint that Pm

r=1βrj = 0, it could also be the case that the effect of prob 365A and diffrank is the same for all players. We do see that Djokovic, Nadel, Federer and Murray have positive intercepts, just like Serena Williams, Sharapova, and Azarenka. Furthermore, a number of players have significant negative intercept coefficients. We can also see that Wickmayer performs well in high-profile tournaments, similar to Wawrinka, Berdych, Nadal, Federer, and Djokovic. Murray, Federer and Djokovic also have positive coefficients for first round, whereas Warwinka’s and Haas’ coef-ficients are negative.

(31)

subset. We can check that except for Venus Williams, these are also the players with the highest estimated intercepts.

Table 5: Top five players with the highest winning percentages

(a) Men Name % n 1 Djokovic N. 0.81 255 2 Federer R. 0.75 220 3 Nadal R. 0.75 220 4 Murray A. 0.67 195 5 Ferrer D. 0.57 191 (b) Women Name % n 1 Williams S. 0.87 164 2 Sharapova M. 0.69 150 3 Azarenka V. 0.68 158 4 Kvitova P. 0.63 169 5 Williams V. 0.63 107

A men’s model with only prob 365A and the intercepts predicts the correct winner 71.8% of the time, whereas a model with diffrank, high profile, first round and the intercepts has a percentage of 73.8% (Table 6). Furthermore, a model with only the intercepts predicts the winner of the match correctly 72.1% of the time. Thus, adding prob 365A actually decreases the predictive performance of the intercept-only model. For women, the predictions are somewhat worse. In the same order, the models have a percentage of 66.2%, 66.9%, 66.1%. Adding covariates therefore does not seem to make a large difference compared to the intercept-only model.

(32)

Table 6: Prediction percentages for men and women

Sex Intercepts prob 365A

and intercepts

diffrank, high profile, first round and intercepts

n

Men 72.1 71.8 73.8 1747

Women 66.1 66.2 66.9 2026

5.3.2 Surface

The models of the previous section were estimated separately for matches played on hard court, gravel and grass. The coefficient paths of first round could not be estimated, as, in particular for grass, there were too few first-round matches. For men, the same holds for high profile. In addition, a model that included both prob 365A and diffrank was not estimable either for men. Figure A.12 shows the women’s coefficient paths of the intercepts on the left, and those of prob 365A on the right. The hard court results are in the top row, those of clay in the middle row and those of grass in the bottom row. In a similar fashion, A.12 (Cont.) shows the paths of diffrank on the left and high profile on the right.

For the intercepts, Serena Williams forms her own cluster on hard court, with ˆ

β0 = 0.402, whereas the other players have ˆβ0 = −0.014. There are four inter-cept clusters for clay (again with Serena Williams forming her own cluster with the highest estimated coefficient), whereas the intercepts are all estimated to be zero for grass. For prob 365A, there are many different clusters for all three surfaces. The diffrank variable is close to zero for hard court and grass, but not for clay. Lastly, high profile is included for all three surfaces. There are thus some differences be-tween the surfaces: the intercepts are eliminated for the grass model, but not for the clay and hard court model. diffrank also seems to be more relevant for the clay surface.

(33)

grass the lowest. Although the difference between hard court and clay is small, the results of grass are seven to ten percentage points lower than those of hard court. However, there were only 171 grass matches analyzed, such that the results could be subject to higher variation than hard court or, to some extent, clay.

For women, the predictions of hard court are very similar to the results of the general model and do not show much variation. For clay, a model with prob B365A and the intercepts performs slightly better than the other two models. For matches on grass we see the opposite again: the betting odds model performs worse than the other two models.

Table 7: Prediction percentages for different surfaces

Surface Intercepts prob 365A

and intercepts diffrank and intercepts1 n Men Hard court 73.4 73.6 74.0 1071 Clay 75.6 75.8 77.6 505 Grass 66.1 63.2 64.9 171 Women Hard court 66.7 66.9 67.8 1383 Clay 68.4 71.4 66.5 472 Grass 74.3 69.0 70.8 171

1_{The women’s models also contain high profile}

5.3.3 Time

(34)

Table 8: Prediction percentages for different time periods

Time period Intercepts prob 365A

and intercepts diffrank and intercepts1 n Men 2010-2013 69.7 69.7 70.0 1049 2014-2017 75.8 76.3 76.0 658 Women 2010-2013 68.5 69.2 71.7 1138 2014-2017 67.1 68.0 68.4 823

1_{The women’s models also contain high profile}

own cluster in both time periods; her ˆβ0 is in both models the highest. For the later time period, however, the other players’ intercepts are one cluster and close to zero. For diffrank and high profile we can also see that there are more clusters in 2010-2013 than in 2014-2017. Their effect sizes are also more widespread: for high profile, they range from −0.749 to 0.565 compared to −0.014 to 0.225, re-spectively. For diffrank, this is −0.485 to 0.338 compared to −0.047 to 0.085. The coefficients of prob 365A are slightly smaller now too, however: from −0.549 to 0.352 in 2010-2013 to from −0.336 to 0.381 in 2014-2017.

The prediction accuracy of the betting odds slightly decreased for the years 2014-2017 for women (Table 8). We can also see that for both time periods, the model with betting odds is slightly better than the intercept-only model, but also slightly worse than the model with diffrank, high profile and first round. The difference between this model and the model with betting odds does become smaller over time: there is a difference of 2.5 percentage points in 2010-2013, whereas for 2014-2017 this is 0.4 percentage points.

(35)

order to have equal time periods. As the data from 2018 does not go further than August 5, it seemed best to compare 2010-2013 and 2014-2017 instead of 2011-2014 and 2015-2018.

6 Discussion

6.1 Discussion of the results

Models that included the implied winning probability of the betting odds often did not even outperform an intercept-only model. Only for the women’s model using matches played on clay there was a small increase in prediction accuracy. At the same time, adding diffrank, high profile and sometimes first round did not improve the accuracy either of an intercept-only model. This finding is rather interesting, as a model with only player-specific intercepts assumes that a player’s strength is constant over time. A model with prob 365A could allow for varying strengths by adding information of a player’s winning probability for every match. It is unclear why this is the case; even if the intercepts contain more useful information than the betting odds, it would be expected that the betting odds would be estimated to be zero and eliminated from the model. However, this was not the case either.

Interestingly enough, there did seem to be a difference in prediction accuracy between the general men’s and women’s models. Although from the literature review we might expect that men’s tennis is slightly more competitive, their match outcomes appeared to be slightly more predictable. Perhaps this is again due to the fact that only the 30 players who played the most matches were included, which included the four top players Murray, Federer, Djokovic and Nadal. Table 5 showed that Djokovic, Federer and Nadal have won 75% or more of their matches. Furthermore, the four players together have played almost 40% of the total of matches in the subset. Hence, it need not be the case that the betting odds include more information and become better predictors for men, but that predicting a win for only Murray, Federer, Djokovic and Nadal already increases the prediction accuracy significantly.

(36)

the most dominant player. In the models this was reflected by the fact that she often has her own intercept cluster, where her coefficient estimate is always the highest. This could be interpreted as “leftover” strength after including, for example, the implied probabilities of the betting odds. But why did these implied probabilities not estimate a higher probability of winning for Serena Williams? Maybe we do see an example of the favorite-longshot bias here. Since Williams might have been nearly always the favorite, it could be the case that the betting odds nearly always underestimate her. From Figure 2 we saw that there might be some evidence of the favorite-longshot bias of bet365’s odds. Similar reasoning could be applied to the Big Four, who also have positive intercept coefficients. Perhaps the betting odds also consistently underestimate their probability of winning.

Beside the intercepts, we saw that the high profile and first round covariates also hold valuable information for specific players. Wickmayer, Wawrinka, Berdych, Nadal, Federer, and Djokovic all did well in high-profile tournaments, and Murray, Federer and Djokovic also in first-round matches. This result could be beneficial to incorporate for bettors and bookmakers, as it yields extra information on the winnings probability of certain players.

Regarding surface, we saw for women that the model with betting odds performed better than the model with diffrank and high profile for women on clay. This may be a small piece of evidence that the betting odds include information on the type of surface, such as the fact that some players are better on clay. However, for hard court and grass this was not the case. The fact that clay matches would be more predictable than grass matches, as suggested by the literature review, is not in line with the prediction accuracies of the models: the intercept-only model and the model with diffrank and high profile on grass outperform those on clay. For men, however, we did see that grass matches were the least predictable and clay the most. The differences between the models with the betting odds and with diffrank were similar across different surfaces, so that there is no evidence that betting odds use more information on a specific surface.

(37)

factors. Except for Serena Williams, the intercepts were close to zero in 2014-2017 and the coefficients ranges of the other covariates also grew smaller. This might also be evidence that the tennis players are now more alike, such that they have similar (closer to zero) coefficients. On the other hand, the coefficient range of the betting odds did not show this effect as much. The range grew slightly smaller, but not by much. We also saw that the predictions of the betting odds model grew closer to the predictions of the model with diffrank and high profile, although both predictions decreased. The differences are rather small, so that the difference could also simply be a result of chance. For men, the predictions also improved over the years. The difference between the model with betting odds and with diffrank changed from −0.3 percentage points to 0.3 points. Again, it is not necessarily the case that betting odds now use more available information from diffrank. The differences are so small that they could have also arised due to chance.

6.2 Limitations

A limitation of the results is certainly that only a subset of the data could be used. Only the 30 players who played the most matches over 2010-2018 were selected, which were often higher-ranked players. Including more players would have resulted in adding players who played fewer matches, of whom the player-specific coefficients could often not be estimated. For example, the number of first-round matches would be very limited for these players, in particular when selecting matches on a certain surface or in a certain time period. As we saw in Section 5.2 and Table 4, there are a number of differences between the subset and the original dataset. As a consequence, the results of this thesis only apply to the subset, and not to the original dataset.

(38)

the other players). Having had different values for the log-likelihood would have been preferred, as they provide another measure of model fit.

Furthermore, in this thesis only the betting odds of bet365 have been used. Since most matches had betting data of this bookmaker, bet365 was the most logical op-tion. However, considerable data is also present for Expekt, Pinnacles Sports and Ladbrokes, as could be seen in Table A.2. From Table A.3 and Figure 1 we know that correlations between these variables are high, but they are not perfectly corre-lated either. It would be interesting to see if other betting agencies perform better or worse than bet365 in terms of the information used from ranking difference and other variables.

7 Conclusion

The aim of thesis was to investigate whether betting odds of tennis matches include all available information to set accurate odds. From the results of this thesis it appears that betting odds do not include all information from rank differences, the fact that a match is a high-profile match or in the first round. Using lasso models, we could see that these three covariates are often still included in the model, suggesting that they are still relevant in determining the outcome of a match. This was also reflected by the prediction results; across sex, surface and different time periods, models with betting odds did not outperform models with the three match characteristics. Surprisingly enough, a model with only (player-specific) intercepts did just as well.

However, there was some evidence that for later years, the covariates diffrank, high profile and first round are less important for women. For grass matches we could also see that the intercepts were eliminated from the model, as opposed to clay and hard court matches. In addition, the first round variable is a more important factor for men than for women. Hence, the information used by betting odds might depend on the subset used.

(39)

in a high-profile tournament or in a first-round match. This is interesting information for the improvement of the accuracy of betting odds. Furthermore, several players appeared to have a high intercept coefficient, suggesting that the implied winning probability of the betting odds could sometimes be too low for these players.

References

Addley, Esther (2015). Heather Watson defiantly slams ‘cowardly’ social media abuse.

The Guardian. Retrieved from: https://www.theguardian.com/sport/2015/

jul/01/heather-watson-defiantly-slams-cowardly-social-media-abuse. Accessed: 13-11-2018.

BBC Sport (2016). Wimbledon: Kevin Anderson angry at ‘death threats’ after first-round loss. BBC Sport . Retrieved from: https://www.bbc.com/sport/tennis/ 36653337. Accessed: 13-11-2018.

Bondell, Howard D. and Brian J. Reich (2009). Simultaneous factor selection and collapsing levels in ANOVA. Biometrics 65 (1), 169–177.

Brier, Glenn W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review 78 (1), 1–3.

Cain, Michael, David Law and David Peel (2000). The favourite-longshot bias and market efficiency in UK football betting. Scottish Journal of Political Econ-omy 47 (1), 25–36.

Cain, Michael, David Law, and David Peel (2003). The favourite-longshot bias, bookmaker margins and insider trading in a variety of betting markets. Bulletin of Economic Research 55 (3), 263–273.

(40)

Devic, Aleks (2018). Herald Sun backs Mark Knights cartoon on Serena Williams. Herald Sun. Retrieved from: https://www.heraldsun.com.au/news/ victoria/herald-sun-backs-mark-knights-cartoon-on-serena-williams/ news-story/30c877e3937a510d64609d89ac521d9f. Accessed: 13-11-2018.

Du Bois, Cind and Bruno Heyndels (2007). It’s a different game you go to watch: competitive balance in men’s and women’s tennis. European Sport Management Quarterly 7 (2), 167–185.

Fernandez, Jaime, A. Mendez-Villanueva, and B.M. Pluim (2006). Intensity of tennis match play. British Journal of Sports Medicine 40 (5), 387–391.

Forrest, David, John Goddard, and Robert Simmons (2005). Odds-setters as fore-casters: The case of English football. International Journal of Forecasting 21 (3), 551–564.

Forrest, David and Ian McHale (2007). Anyone for tennis (betting)? The European Journal of Finance 13 (8), 751–768.

Gneiting, Tilmann and Adrian E. Raftery (2007). Strictly proper scoring rules, pre-diction, and estimation. Journal of the American Statistical Association 102 (477), 359–378.

Independent Review Panel (2018). Independent review of integrity in tennis.

Retrieved from: http://www.tennisintegrityunit.com/storage/app/media/

Independent\%20Reviews/IRP-2018/Interim\%20Report.pdf. Accessed: 25-10-2018.

Klaassen, Franc and Jan R. Magnus (2014). Analyzing Wimbledon: The power of statistics. Oxford University Press, USA.

Lahviˇcka, Jiˇr´ı (2014). What causes the favourite-longshot bias? Further evidence from tennis. Applied Economics Letters 21 (2), 90–92.

(41)

Lisi, Francesco and Germano Zanella (2017). Tennis betting: can statistics beat bookmakers? Electronic Journal of Applied Statistical Analysis 10 (3), 790–808.

McHale, Ian and Alex Morton (2011). A Bradley-Terry type model for forecasting tennis match results. International Journal of Forecasting 27 (2), 619–630.

O’Donoghue, Peter and Billy Ingram (2001). A notational analysis of elite tennis strategy. Journal of Sports Sciences 19 (2), 107–115.

Oelker, Margret-Ruth and Gerhard Tutz (2017). A uniform framework for the com-bination of penalties in generalized structured models. Advances in Data Analysis and Classification 11 (1), 97–120.

Official Website Wimbledon (2018). Seeds: Information about seeds and seeding formulas for Wimbledon. Retrieved from: https://www.wimbledon.com/en_GB/ atoz/seeds.html. Accessed: 19-11-2018.

Rothenberg, Ben (2018). Signs of possible match-fixing in Wimbledon men’s doubles. The New York Times. Retrieved from: https://www.nytimes.com/2018/07/11/

sports/tennis/match-fixing-wimbledon-mens-doubles.html. Accessed:

25-10-2018.

Schauberger, Gunther and Gerhard Tutz (2017a). BTLLasso - A common framework and software package for the inclusion and selection of covariates in Bradley-Terry models. University of Munich, Department of Statistics: Technical Reports, No. 202 .

Schauberger, Gunther and Gerhard Tutz (2017b). Subject-specific modelling of paired comparison data: A lasso-type penalty approach. Statistical Modelling 17 (3), 223– 243.

(42)

Stonestreet, John (2018). Belgium detains 13 in tennis

match-fixing probe. Reuters. Retrieved from: https://www.

reuters.com/article/us-sport-tennis-matchfixing-belgium/

belgium-detains-13-in-tennis-match-fixing-probe-idUSKCN1J10VC.

Ac-cessed: 25-10-2018.

The Economist (2017). How data changed gambling. The Economist .

Re-trieved from: https://www.economist.com/the-economist-explains/2017/

07/19/how-data-changed-gambling. Accessed: 30-11-2018.

(43)

Appendix

A

The BTLLasso package

The BTLLasso package was written by Gunther Schauberger and Gerhard Tutz from the Ludwig Maximilian University of Munich. The package allows for the modeling of heterogeneity in paired comparison data by using the framework of sections 4.1 -4.4. More information on the package can be found in Schauberger and Tutz (2017a).

A.1 Model specification

The response object The response object, named Y, needs to be specified by the response.BTLLasso() function. This function incorporates the following arguments:

r e s p o n s e . B T L L a s s o ( r e s p o n s e , f i r s t . o b j e c t = NULL , s e c o n d . o b j e c t = NULL , s u b j e c t = N U L L ) .

The first argument, response, is a vector with the outcomes of the matches. The second and third, first.object and second.object, are vectors with the names of players A and B, respectively. The last argument is a vector with the subjects, for which we match id is used.

The covariates The subject-specific covariates are specified with X in the cv.BTLLasso function. X thus contains the xi of (4) and is n × p.

Specifying the penalty terms Choosing which covariates are penalized is done by the ctrl.BTLLasso() function:

(44)

. d i f f s = TRUE , p e n a l i z e . o r d e r . e f f e c t . a b s o l u t e = TRUE , p e n a l i z e . o r d e r . e f f e c t . d i f f s = F A L S E ) .

Of main interest are the specification of the intercept and subject-specific penalty terms in this command:

- penalize.intercepts: if set to TRUE, P1 is activated and intercepts are penalized as in (7)

- penalize.X: if set to TRUE, P2 is activated according to (8)

scale is TRUE by default, such that the covariates are automatically scaled to have a variance of 1. Covariates can again be rescaled when, for example, plotting the coefficients paths. Although this does have the advantage of being able to directly interpret the estimated coefficients, it does not allow for a direct comparison of the effect sizes. The results in this thesis are therefore shown with scaled covariates.

A.2 Model estimation

The main function of the package is cv.BTLLasso. This function estimates the spec-ified model and uses cross-validation to estimate the tuning parameter. It uses the following arguments:

cv . B T L L a s s o ( Y , X = NULL , Z1 = NULL , Z2 = NULL , f o l d s = 10 , l a m b d a = NULL , c o n t r o l = c t r l . B T L L a s s o () , c o r e s = folds , t r a c e = TRUE ,

t r a c e . cv = TRUE , cv . c r i t = c ( " RPS " , " D e v i a n c e " ) )

The arguments Y and X are specified as in A.1. The specification of the penalties can be incorporated with the control argument. For lambda, we can specify a range of values for which the fitting procedure finds the optimal λ. Furthermore, the number of folds k is set to 10, but can be specified manually.

A.3 Confidence intervals

(45)

b o o t . B T L L a s s o ( model , B = 500 , l a m b d a = NULL , c o r e s = 1 , t r a c e = TRUE , t r a c e . cv = TRUE , c o n t r o l = B T L L a s s o . c t r l () , w i t h . cv = T R U E )

to obtain the bootstrap estimates for the confidence intervals. The value of the number of iterations B can be changed, but is by default set to 500. Since for every iteration a new value of λ is estimated, we can provide a smaller set of values for lambda in order to decrease the computation time. Alternatively, with.cv can be set to TRUE. If so, the optimal value of λ from the cross-validation analysis is used.

A.4 Visualizations

The function

own _ p l o t _ f u n c t i o n ( x , l a b e l l i n g = FALSE , y l i m i t s = r a n g e ( c o e f s ) , p l o t s _ per _ p a g e = 1 , ask _ new = TRUE , r e s c a l e = FALSE , w h i c h = " all " , e q u a l . r a n g e s = FALSE , x . a x i s = c ( " l o g l a m b d a " , " l a m b d a " ) , r o w s = NULL , s u b s . X = NULL , s u b s . Z1 = NULL , m a i n . Z2 = " Obj - s p e c . C o v a r i a t e s " , . . . )

plots the coefficients paths for different values of λ. Here, x is a cv.BTLLasso object. This function has been slightly altered from the BTLLasso package. The labelling argument has been added to be able to remove the labels, which were the names of the players to which a certain coefficient path belonged. This was done in order to save space in the appendix. Furthermore, ylimits can now be manually specified, such that different models can be plotted on the same scale, allowing for a more direct comparison. If it is not specified, the range of the y-axis is the range of the coefficients.

For plotting the confidence intervals, we can use

own _ b o o t _ p l o t ( x , x l i m i t s = r a n g e (0 , r a n g e ( g a m m a . ci [ , i n d e x :( i n d e x + m - 1) ]) ) , q u a n t i l e s = c (0.025 , 0 . 9 7 5 ) , p l o t s _ per _ p a g e = 1 , ask _ new = TRUE , r e s c a l e = FALSE , w h i c h = " all " , i n c l u d e . z e r o = TRUE , r o w s = NULL , s u b s . X = NULL , s u b s . Z1 = NULL , m a i n . Z2 = " Obj - s p e c . C o v a r i a t e s " , . . . ) ,

(46)

(47)

B

Tables and Figures

Table A.1: Description of the variables

Men Women Description

ATP WTA Tournament number (e.g. ”1” indicates the first tournament of the year, “2” the second tournament, etc.)

Location Location Location of the tournament Tournament Tournament Name of the tournament

Date Date Date of the match (before 2003: start date of the tournament)

Series Tier Category of the tournament (e.g. “ATP 250” for men or “Premier” for women) Court Court Type of court (“Outdoor” or “Indoor”)

Surface Surface Court surface Round Round Round of the match

Best of Best of Maximum number of sets to be played PlayerA (PlayerB) PlayerA (PlayerB) Name of player A (B)

ARank (BRank) ARank (BRank) Ranking position of player A (B) at the beginning of the tournament APts (BPts) APts (BPts) Number of ranking points of player A (B) at the beginning of the tournament A1 (B1) A1 (B1) Number of games won by player A (B) during the first set

A2 (B2) A2 (B2) Number of games won by player A (B) during the second set A3 (B3) A3 (B3) Number of games won by player A (B) during the third set A4 (B4) Number of games won by player A (B) during the fourth set A5 (B5) Number of games won by player A (B) during the fifth set ASets (BSets) ASets (BSets) Number of sets won by player A (B)

Comment Comment Comment on how the match ended, e.g. ”Walkover” B365A (B365B) B365A (B365B) bet365 odds of player A (B)

B&WA (B&WB) Bet&Win odds of player A (B) CBA (CBB) CBA (CBB) Centrebet odds of player A (B) EXA (EXB) EXA (EXB) Expekt odds of player A (B) GBA (GBB) Gamebookers odds of player A (B) IWA (IWB) Interwetten odds of player A (B) LBA (LBB) LBA (LBB) Ladbrokes odds of player A (B) PSA (PSB) PSA (PSB) Pinnacles Sports odds of player A (B) SBA (SBB) Sportingbet odds of player A (B) SJA (SJB) SJA (SJB) Stan James odds of player A (B) UBA (UBB) UBA (UBB) Unibet odds of player A (B) MaxA (MaxB) MaxA (MaxB) Maximum odds of player A (B) AvgA (AvgB) AvgA (AvgB) Average odds of player A (B)

match id match id Unique match id for every match, given by Date PlayerA PlayerB prob 365A prob 365A Winning probability of player A as implied by B365A and B365B diffrank diffrank Difference in ranking positions between player A and B

(48)

(49)

(50)

Table A.4: Variation over time of correlations between B365A, EXA, PSA and LBA

Year B365.EX B365.PS B365.LB EX.PS EX.LB PS.LB

(51)

(52)

Table A.6: Model results of the five benchmark models for men Dependent variable: Winner (1) (2) (3) (4) (5) prob B365A 5.067∗∗∗ 4.884∗∗∗ (0.257) (0.286) diffrank −0.024∗∗∗ _−0.002 _−0.002 (0.002) (0.001) (0.001) log(B365A - 1) −0.954∗∗∗ _−0.919∗∗∗ (0.051) (0.056) Constant −2.593∗∗∗ _−0.041 _−2.500∗∗∗ _−0.228∗∗∗ _−0.222∗∗∗ (0.142) (0.051) (0.155) (0.057) (0.057) Observations 1,747 1,747 1,747 1,747 1,747 Log Likelihood −946.896 −1,104.610 −945.776 −940.103 −938.950

Akaike Inf. Crit. 1,897.792 2,213.220 1,897.552 1,884.205 1,883.899

(53)

(54)

(55)

(56)

(57)

(58)

(59)

(60)

(61)

(62)

(63)

(64)

(65)

(66)

(67)

(68)

A Paired Comparison Lasso Model for Determining the Information Used by Tennis Betting Data

A Paired Comparison Lasso Model for

Determining the Information Used by Tennis

Betting Data

Anke van den Beukel

A Paired Comparison Lasso Model for Determining

the Information Used by Tennis Betting Data

Anke van den Beukel

December 29, 2018

Contents

1

Introduction

2

Literature Review

2.1

Forecasting in tennis

2.2

Betting in tennis

2.3

Effect of gender

2.4

Effect of surface

2.5

Effect of time

3

Data

3.1

Description of the data

3.2

Data editing

3.3

Data analysis

4

The model

4.1

The Bradley-Terry model

4.2

Covariates

4.3

Penalty terms

4.4

Model estimation

5

Results

5.1

Implied probabilities

5.2

glm analyses

5.3

Model results

6

Discussion

6.1

Discussion of the results

6.2

Limitations

7

Conclusion

References

Appendix

A

The BTLLasso package

A.1

Model specification

A.2

Model estimation

A.3

Confidence intervals

A.4

Visualizations

B

Tables and Figures