Predicting Goal-Scoring Opportunities in Soccer by Using Deep Convolutional Neural Networks

(1)

Predicting Goal-Scoring Opportunities in Soccer by Using Deep Convolutional Neural

Networks

Martijn Wagenaar 16 November 2016

Master’s Thesis

Department of Artificial Intelligence, University of Groningen,

The Netherlands

Internal supervisor

Dr. M.A. Wiering,

Artificial Intelligence & Cognitive Engineering,

University of Groningen

External supervisor

Dr. W.G.P. Frencken,

Football Club Groningen;

Center of Human Movement Sciences, University of Groningen

(2)

(3)

Abstract

Over the past decades, soccer has encountered an enormous increase in professionalism. Clubs keep track of their players’ mental condition, physical condition, performances on the pitch, and so on. New technology allows for the automation of some of these processes. For instance, position data captured from soccer training and matches can be used to keep track of player fitness. There are opportunities for the use of position data in artificial intelligence as well. In the current research, position data gathered from matches played by a German Bundesliga team has been used to predict goal-scoring opportunities. The problem was approached as one of classification: given a snapshot of position data, is it more likely that a goal-scoring opportunity will be created or that ball possession will be lost? Snapshots of position data were taken and transformed to 256 × 256 images, which were used as input for machine learning algorithms. The performance of two deep convolutional neural networks was compared: an instance of GoogLeNet and a less complex 3-layered net.

GoogLeNet came out as the best performing network with an average accuracy of 67%. Although the final performance was not spectacular, there are some promising indicators for future research and possible practical uses.

iii

(4)

(5)

1

Introduction

Over the past decades, soccer has encountered an enormous increase in professionalism. The major clubs spend millions on transfers and salaries. Performances on the pitch are not only reflected in the standings, but also have substantial financial consequences. When a team does not do well for a number of matches, the coach is often quickly replaced.

The increase in professionalism includes extensive monitoring of players. Clubs keep track of their players’ mental condition, physical condition, performances on the pitch, and so on. New technology allows clubs to partially automate some of these processes. For example, position data of players can help in analyzing the physical condition of players, as it can be used to calculate the distance a player has covered during a match, what their average speed was, how fast the player did accelerate, etcetera.

The large amount of available data offers opportunities in the field of computer science and artificial intelligence. When every position of every object on the field at any time is known, one could try to use these data in order to predict certain match events right before they occur. Particularly interesting is the occurrence and prediction of goal-scoring opportunities. In the end, scoring goals is what clubs, coaches and players want to achieve during a soccer match.

The prediction of goal-scoring opportunities can be approached as a classification problem. The occurrence of a goal-scoring opportunity then takes the role of a positive example, while the occurrence of the opposite (the non-occurrence of the aforementioned event) fulfills the role of a negative example. When labels are present, supervised machine learning methods can be applied in order to classify snapshots of a soccer match as either possible goal-scoring opportunities or as less promising states.

In the current research, goal-scoring opportunities were detected by analyzing soccer match position data. Abstract image representations of the intervals around goal- scoring opportunities (and non-opportunities) were used as input to convolutional neural networks [37]. Convolutional neural networks are particularly interesting because they might be able to detect higher-order tactical patterns in soccer by

1

(8)

repeated convolutions. The convolutional neural networks could then be used to classify soccer snapshots into a class indicating that a goal-scoring opportunity might arise and a class which indicates the opposite scenario.

1.1 Research question

The main research question of this research is as follows:

Research question Can convolutional neural networks be used to classify and predict goal-scoring opportunities when presented with position data from soccer matches?

1.2 Thesis structure

This thesis mainly focuses on machine learning methods. After all, these are the methods that were used to predict goal-scoring opportunities. In chapter 2, however, some background about soccer game dynamics from a sports science perspective will be given. Only a small subset of all research on this topic used machine learning methods on position data.

Chapter 3 is about neural networks and machine learning. Basic descriptions of several types of neural networks are given. At the end of the chapter, an outline of convolutional neural networks will be given. The chapter concludes with describing several state of the art convolutional neural networks, which have shown spectacular performance in the Imagenet challenge [9].

Chapter 4 is all about the dataset. The chapter zooms in on several characteristics of the data set and how they were dealt with. Definitions of goal-scoring opportunities and ball possession are given, because of their importance for extracting training, validation and test data for machine learning. The last section of the chapter is about data exploration: some traditional and less traditional visualizations are applied to the data set in order to examine its contents.

Chapter 5 lists the experiments that were run and the results. While chapter 4 zooms in on initial preprocessing of the data, this chapter describes the process of taking processed data and transforming it to image datasets for machine learning. A total of three experiments with sub-experiments are described whose results are presented at the end of the chapter.

(9)

In the final chapter of this thesis, chapter 6, the results of the experiments are discussed and a conclusion is formulated. Finally, suggestions for future research are given.

1.2 Thesis structure 3

(10)

(11)

2

Capturing the dynamics of soccer

The current standard to assess tactical performance is by game observation. Human observers rely on their knowledge about the game to extract certain tactical performance indicators, often guided by a set of rules, or handbook, which highlight factors of attention [27]. As a consequence, this type of analysis is subjective and slow. An automated system for tactical performance assessment would eliminate these major downsides of game observation by humans.

The main problem with building an automated system is to determine which indicators of performance it should take into account. A traditional method to capture soccer match dynamics is to keep track of the frequency of occurrences of specified events. These events can range from events which are relatively easy to detect automatically, such as the total number of passes, the percentage of successful passes and the number of shots on goal to more explicitly defined events such as the number of key passes of a specific player. Frequencies of event occurrences have successfully been used to predict the outcome of soccer matches [38] and have been shown to be able to discriminate between successful and unsuccessful teams [7].

Frequencies of event occurrences of specified events do not tell all. One could miss out on ‘complex series of interrelationships between a wide variety of performance variables’ when not looking beyond frequencies [4]. Among factors that are missed out on are higher-order tactical patterns which emerge during soccer. The following sections go into further detail on attempts that have been made to extract these higher-order tactical patterns from soccer position data.

2.1 Machine learning: object trajectories and player formations

There has been some research on soccer position data focusing on trajectories of objects and player formations involving machine learning techniques.

Knauf et. al. proposed a class of spatio-temporal convolution kernels to capture similarities between trajectories of objects [22]. The clustering method has been applied successfully to soccer data. Knauf et. al. distinguished two different trajectory

5

(12)

sequences: game initiations and scoring opportunities. Game initiations began with a pass from the goal keeper and ended when possession was lost or the ball was carried to the attacking third of the field (or the start of a new game initiation). Scoring opportunities marked the event of carrying the ball to a ‘predefined zone of danger’

in the attacking area of the field. After clustering on a particular set of trajectory sequences, the clusters represented tactical patterns which could ultimately be used for classification.

A similar approach involving a variant of self-organizing maps [23] focused on detecting formations rather than trajectories [13]. Two interactions between players were investigated: the interaction between the four attacking players of the French team, the four defending players of the Italian team and the ball during FIFA World Cup 2006, and vice versa. The tactical patterns that were fed to the network were short game initiations and long game initiations. Both tactics started when the ball was won from the opposing team by either the goalkeeper or the defense players.

When the pass following this gain of ball possession exceeded 30m, the tactical pattern was classified as a long game initiation. Otherwise, the tactical pattern was called a short game initiation. After training and manual expert labeling of the self-organizing map’s output layer, the self-organizing map successfully detected 84% of all game initiations.

Memmert et al. outline other opportunities in the field of self-organizing maps [28].

A special self-organizing map was trained with examples of defensive and offensive patterns from the UEFA Champions League quarterfinal of FC Bayern Munich against FC Barcelona from the 2008/2009 season. The research illustrates the use of self- organizing maps in evaluating tactical formations. The neural network that they used to train on formations found that the most frequent defensive pattern of FC Bayern Munich (formation 3) led to obtaining ball possession for 40% of the total occurrences against the most frequently used offensive formation of FC Barcelona (formation 2, see figure 2.1).

2.2 Team centroids and surface area

The previous section focused on raw position data, where algorithms were used to extract the relevant information. Sometimes it is beneficial to guide an algorithm in the right direction, by pointing out which derived variables from position data could possibly be of use. Two of these derived variables are team centroids (the mean positions of the team’s players) and the surface area of teams.

(13)

Fig. 2.1.: Most frequently used offensive formation from FC Barcelona (formation 2) and most frequently used defensive formation from FC Bayern Munich (formation 3) during the UEFA Champions League quarterfinal during the 2008/2009 season.

Image copied from [28].

Frencken et al. found that the centroid position of a soccer team can provide valuable information about the ‘coordinated flow of attacking and defending’ in 5 versus 5 soccer matches [12]. A positive linear relation was found between the two teams’

centroids: when the centroid of a team moved in a specific direction, the other teams’ centroid did as well. This positive linear relation was present for both the y-direction (length) and x-direction (width). More interesting, deviations from the above described pattern occurred during the build-up of goals. For 10 out of 19 scored goals, a crossing of the centroids occurred, deviating from the positive linear relation.

While interesting findings, they do not directly extrapolate to 11 versus 11 soccer matches. For 11 versus 11 matches, the team centroids showed low variability. The interaction between each player and his position-specific centroid, however, has shown more potential to capture player movement behavior [36]. Two methods have been particularly successful in capturing player movement dynamics with respect to their team defending, midfield or forward centroids. An approximate entropy method evaluated the time series of individual player distances with respect to their accompanying centroids, and classified players into classes of predictability (low, medium and high predictability) [36, 34, 33]. A second method examined the relative phase of position-specific centroids, by analyzing two centroids as oscillators which could either be in-phase or not [25, 10].

Frencken et al. also examined possible linear relations between the surface areas of opposing teams, but none were found for 5 versus 5 matches. Other research has shown that the surface area of a team depends on the strength of the opposition [6].

When attacking, the surface area of a team was larger against weaker teams, while when defending, the surface area was larger against stronger opponents.

2.2 Team centroids and surface area 7

(14)

2.3 Clustering algorithms

Voronoi analysis of electronic soccer games has given some insights into soccer game dynamics [21]. Because it was, at the time of publication, hard to get data from real-world games, the video game FIFA Soccer 2003 (developed by EA Sports) was used to obtain a data set. It was argued that FIFA Soccer 2003 resembles soccer well and is similar to the actual game. Kim [21] used the positions of the players on the field as the point set to construct Voronoi diagrams. Kim argues that, when the total area of Voronoi segments of a team is larger than the area of the opponent’s segments, the first team dominates the latter. Fonseca et al. found a similar result by analyzing games of futsal with Voronoi diagrams [11]. It was found that the area of the Voronoi segments - the total area of the ‘dominant regions’ of the players - was greater for the attacking team and smaller for the defending team.

(15)

3

Neural networks

This chapter gives a theoretical background of several types of neural networks.

These types of neural networks will later be used to predict goal-scoring opportunities in soccer by using position data as the neural network’s input.

3.1 Perceptron

The smallest building block of modern day’s neural networks is the perceptron [35].

A perceptron maps its input vectorx to a binary output f (x) by computing the dot product between the input and a vector of weights w (see equation 3.1). If the outcome of this calculation is above a certain threshold value, the output of the perceptron is 1. If otherwise, the output is 0. In modern artificial neural networks, the threshold value is incorporated into the network by a bias term b. Figure 3.1 shows a schematic example of a perceptron.

f (x) =

(1 ifw · x + b > 0

0 otherwise (3.1)

-1 x

₁

x

₂

Σ

w

₁

w

₂

b

y

Fig. 3.1.: A perceptron with two inputs.

A single-layer perceptron is a linear classifier: the parameters of the network (weight vector and bias term) can be set to approximate a linear function. The parameters of a perceptron are determined by a learning process. In order to train a perceptron, a training set is required. In the training example below, D = {(x1, d1), . . . , (x_s, ds)}

is the training set where s is the number of training examples, x_j is the input vector of example j and dj is the label (1 or 0) of example j.

9

(16)

A simple learning algorithm to train a perceptron is as follows:

1. Initialize the weight vectorw to either zeros or random small values;

2. For each example j in the training set D:

a) Calculate the actual output:

y_j(t) = f [w(t) · x_j] = f [w₀(t)x_j,0+ w₁(t)x_j,1+ · · · + w_n(t)x_j,n] (3.2)

b) Update the weights:

wi(t + 1) = w_i(t) + (d_j − y_j(t))x_j,i (3.3)

3. Repeat step 2 until

a) iteration error on time t is less than a user-specified threshold; or

b) the algorithm has run for a predetermined maximum number of iterations.

3.2 Multi-layered perceptron and backpropagation

An extension of the perceptron is the multi-layered perceptron (MLP). An MLP consists of multiple neurons, where the perceptron only contained a single neuron, and can distinguish data that are not linearly separable. It has been shown that MLPs are capable of approximating any measurable function to any desired degree of accuracy [17].

MLPs consist of at least three layers: an input layer, one or more hidden layers and an output layer. An MLP is fully-connected: every neuron of layer k is connected to every neuron in layer k + 1 by corresponding weights. To guarantee that an MLP is able to learn non-linear functions, non-linear activation functions are applied to the summed output of the hidden neurons. The outcome of the activation functions is then passed through to the neurons in the next layer. For MLPs, sigmoid functions are often used as activation functions, such as the hyperbolic tangent:

f (vi) = tanh(v_i) (3.4)

(17)

or the logistic function:

f (v_i) = (1 + e^−vⁱ)⁻¹ (3.5)

where viis the weighted sum of the inputs connected to the ith neuron in a certain layer, and f (v_i)is the output of that neuron.

Training an MLP differs a lot from training a perceptron, especially because of added non-linearities and an additional hidden layer. First, we would have to define a cost function which we would like to minimize in order to decrease the error. A frequently used cost function is the mean squared error:

E(n) = 1 2

X

j

e²_j(n) (3.6)

where e_j(n) = d_j(n) − y_j(n)is the error of output neuron j for the nth training example, dj(n)is the target value and yj(n)is the computed value by the MLP. The main problem is now to determine to which extent specific neurons have contributed to the error value. This is also known as the credit assignment problem in artificial neural networks: how much should a particular weight be adapted in order to minimize the error [14]?

The most common way is to propagate the error value through the network, starting from the output layer. This method is called backpropagation of the error [43]. Using gradient descent, we find the following general update rule for a weight wij(n) where i reflects the node in the layer which is closest to the input.

∆w_ij(n) = −η∂E(n)

∂vj(n)yi(n) (3.7)

where η is the learning rate and y_i the output of the previous neuron. Note that the weights are adapted in the direction of the negative gradient, hence the term gradient descent. The derivative term in equation 3.7 varies for hidden nodes and output nodes. For output neurons:

−∂E(n)

∂v_j(n) = e_j(n)f⁰(v_j(n)) (3.8)

and for hidden neurons:

3.2 Multi-layered perceptron and backpropagation 11

(18)

− ∂E(n)

∂v_j(n) = f⁰(v_j(n))^X

k

−∂E(n)

∂v_k(n)w_jk(n) (3.9)

where f⁰ is the derivative of the activation function used in the neural network. The sum term in equation 3.9, which sums over all output neurons k to which hidden neuron j is connected, shows that the hidden weight updates rely on computing the cost function derivative with respect to the output weights first: the error is backpropagated. Although the above example is for a perceptron with only one hidden layer, deeper neural networks with a much more complex structure can still be trained by backpropagation. The weight update for a neuron then relies on all neurons that are located between this particular neuron and the neural network’s output.

3.3 Guiding the learning process

Neural networks can contain thousands or millions of parameters which all have to be set to appropriate values to approximate a certain function. It is not hard to imagine that the process of learning is not straightforward. Over the past decades several methods have been developed that guide the learning process by making changes to the weight update equations for the neural network’s parameters. The following sections will be about some of these methods, particularly the ones that are often used in deep neural networks.

3.3.1 Weight decay

It has been shown that penalizing large weights in a neural network can aid generalization [29]. The easiest way to achieve this is by adding a penalty term for large weights to the neural network’s cost function:

E(w) = E0(w) +1 2λ^X

i

w_i² (3.10)

where E0 is the original error function (such as sum of squared errors as in equation 3.6), w the weight vector containing all free parameters in the network and λ a hyper-parameter which determines to what extent large weights are penalized.

When using gradient descent, the weight update function becomes as follows:

(19)

w_i(t) = w_i(t − 1) − η ∂E

∂w_i(t − 1)− ληw_i(t − 1) (3.11)

By penalizing large weights, it is attempted to push learning in a direction where all connections between neurons participate in producing the neural network’s outcome.

When weights are not penalized for becoming larger and larger, there is a possibility that only a few of the network’s parameters are of importance and the majority is neglected. In such a scenario, using a much smaller network would then be able to achieve similar performance: the bigger network does not live up to its full potential.

3.3.2 Momentum

When adapting the weights in the direction of the negative gradient based on single examples (or single batches of examples), the weight adaptations tend to fluctuate.

The weights are changed too much based on single examples, which blocks the path for the weights to find their optimal values. A simple method to counter this issue, is to introduce some sort of momentum to the network’s weight update function [41, 19]. When certain weights are changed in the same direction for consecutive iterations, their momentum grows and the weights tend to be adapted in the same direction for the next iteration. Momentum can be implemented by altering the weight update function as follows:

w_i(t) = w_i(t − 1) − η ∂E

∂wi(t − 1)+ α∆w_i(t − 1) (3.12)

where α is the momentum parameter which decides how much the previous weight adaptation weighs.

3.3.3 Dropout

A recently developed technique which helps to prevent over-fitting in neural networks is dropout [39, 40]. When training data is limited, noise in the input data can cause the network to train on noise as if it were features of the input patterns. This ultimately leads to worse generalization and a lower test accuracy. Dropout tackles this issue by disabling neurons with a certain, pre-specified probability during the training phase. During one training iteration, each neuron is dropped with probability p. When p = 0.5, roughly half of the neurons are used for the forward and backward pass. Because fewer neurons are given the task to generate the desired

3.3 Guiding the learning process 13

(20)

output, they have to be more flexible and learn more general features. While this generally results in a higher training error, the test error is often greatly reduced, as has been shown with different sets of data [40].

Dropout can also be considered a form of model combination. With neural networks, it is often desirable to combine the output of multiple independently trained neural networks to obtain a better prediction [5, 2]. With very deep neural networks, for instance the convolutional neural networks used in the Imagenet competition, it is practically impossible to train multiple neural networks. The training time is simply too long. With dropout, for every iteration a unique model is used, due to the unique combination of available neurons.

3.4 Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are a type of neural networks which exces- sively make use of mathematical convolution to learn a function which is non-linear in nature. CNNs are particularly suited for image data because of convolution kernels which are shifted over an image. A fully-connected multi-layered perceptron would have too many connections and therefore focus too much on single pixels instead of patterns spanning multiple pixels.

The first CNN appeared in literature in 1990. LeCun et. al. used a small neural network incorporating two convolutional layers to recognize handwritten digits [26].

Over the last few years, CNNs and deep learning have experienced an enormous boost in popularity. This can mainly be ascribed to their recent successes in the Imagenet competition, which will be described in more detail in section 3.5.

The typical elements of CNNs are listed in the following subsections. For training deep neural networks, the training algorithm, or solver, differs slightly from the ones described before. The last part of this section lists the main differences between the training processes.

From now on, it is assumed that the convolutional neural networks are used for classification. The outputs of these networks are class-conditional probabilities: the networks output k numbers which indicate the probability that a presented input matrix belongs to a specific class.

(21)

3.4.1 Typical elements

A typical CNN consists of multiple layers. The first few layers of a CNN are often convolutional layers or pooling layers. The convolutional layers typically contain added nonlinearities (ReLUs) which process the output activations of the neurons.

After the convolutional layers, mostly one or more fully-connected feedforward layers are used to obtain activation scores for every class. A softmax on the final output layer yields final class-conditional probabilities.

Convolutions

Theoretically, convolutional layers in a CNN can be described as being constraint fully-connected layers (where the constraints are that severe that the layers are not fully-connected anymore). The nature of the constraints, such as shared weights amongst subsets of neurons (the neurons belonging to a specific filter or kernel) and sparse connectivity (not every hidden or output neuron is connected to every input neuron), make that we can consider convolutional layers in a more practical way by addressing the principle of mathematical convolution. This section will primarily make use of the latter description.

The first convolutional layer in a CNN convolves an input representation, often in the form of multiple arrays, with a set of filters. Every filter, or kernel, in a convolutional neural network detects distinct features. A filter has its own set of weights which can be trained to extract specific features. For example, the weights of a 5 × 5 filter can be set to 1 around the edges and to 0 in the center.

When an input is convolved with this filter, the output activity is high for areas containing 5 × 5 rectangles, and low for other regions.

Filters in a CNN operate through the full depth of the input representation.

In case of an RGB image, the input is represented as three stacked 2D arrays:

one array for every color channel. The kernels thus operate on all three layers, taking every color into account. The convolution operations, however, are still two-dimensional of nature: convolutions are applied to independent two- dimensional slices (e.g. the red-colored channel of a kernel acts on the red input channel).

The number of filters in a convolutional layer determines the depth of the output volume of that particular layer. Let us consider an arbitrary activation on position (x, y, z) in an output volume of a convolutional layer. x and y tell us something about the spatial location where the filter was applied to the input. The height of z however, tells us the filter number or depth slice. For

3.4 Convolutional Neural Networks 15

(22)

every layer, the number of filters, or depth, can be set manually: it functions as a hyperparameter for the CNN.

There are a couple of other hyperparameters which can be set for every layer.

The stride of a kernel determines how much a filter slides when it is applied repeatedly. It determines the spacing between applied filters. With a stride of 1, a filter slides 1 ‘pixel’ (when dealing with an input image real image pixels are meant; in higher layers the term ‘pixel’ refers to ‘convolutional pixels’) to the side after applying the filter. When the stride is lower than the dimension of the kernels, there is a certain overlap between the filters. Another hyperparameter is zero-padding. When the dimensions of the output volume need to be controlled, one could add zeros to the sides of the input volume.

A single convolutional layer is not very effective in detecting higher-order patterns that are suitable for image classification. The strength of CNNs lays in stacking convolutional layers. The first layers will detect primary image features, such as edges. The latter layers, which take the output of the former layers as their input, will be able to detect higher-order features. In other words, the latter layers will detect patterns of patterns. When using kernels that are large enough and a sufficient amount of convolutional layers, the higher-order patterns can be very global when compared to the input image.

Rectified Linear Units

In order for a CNN to learn a non-linear function, nonlinear activation functions have to be added to the net. Traditional neural networks often use a hyperbolic tangent f (x) = tanh(x) or a logistic sigmoid function f (x) = (1 + e^−x)⁻¹ for added nonlinearities (see section 3.2).

Most modern, deep CNNs, however, use Rectified Linear Units (ReLUs) to implement nonlinearities. A ReLU outputs its input if it is above zero, and zero otherwise:

f (x) = max(0, x) (3.13)

One of the most important advantages or ReLUs is that it has been shown that neural networks with ReLUs tend to converge faster than networks with traditional activation functions [30].

Pooling

Pooling layers effectively subsample an input volume. Pooling layers shift a

(23)

small filter, often 2x2, over an input volume, every time selecting the maximum value of the 4 numbers. Other forms of pooling exist, such as average and frac- tional pooling. In this thesis both max pooling and average pooling were used for the GoogLeNet architecture, and only max pooling for a self-constructed 3-layered convolutional neural network.

x1 x2

x3 x4

⇒ MAX(x₁, x₂, x₃, x₄)

The hyperparameter stride functions the same as for kernels: it determines how much the pooling filter is shifted every time it is applied. For pooling, the stride is often equal to the filter dimension (e.g. a stride of 2 for 2 × 2 filters), although there are exceptions such as 3x3 filters with stride 2 (e.g. AlexNet [24]).

Besides reducing the amount of computation in a network, pooling layers also prevent overfitting and therefore aid generalization. Because for every 2x2 region the maximum value is selected, a kind of translation invariance is introduced. For example, when an object is moved one pixel up on an input image, the pooling operation ensures that the final outcome is still similar to the non-translated version of the image.

Fully-connected layers

In a typical CNN, multiple convolutional and pooling layers are followed by one or more fully-connected layers. These fully-connected layers take the output of the last convolutional layer and yield output activities for every single class.

Softmax regression

The output of the last fully-connected layers is k-dimensional, where k repre- sents the number of classes in the training data. As a final step, the output activations have to be transformed to k numbers which reflect the probability that a certain input belongs to a certain class. These probabilities will then be used for calculating the loss or error for training and test examples. For the transformation to class-conditional probabilities, a softmax function is often used. Softmax regression is a form of logistic regression and expands it to a multi-class scenario:

3.4 Convolutional Neural Networks 17

(24)

h_θ(x⁽ⁱ⁾) =







p(y⁽ⁱ⁾ = 1|x⁽ⁱ⁾; θ) p(y⁽ⁱ⁾ = 2|x⁽ⁱ⁾; θ)

...

p(y⁽ⁱ⁾= k|x⁽ⁱ⁾; θ)







= 1

Pk

j=1e^θ^T^j^x⁽ⁱ⁾





 e^θ^T¹^x⁽ⁱ⁾ e^θ^T²^x⁽ⁱ⁾

... e^θ^T^k^x⁽ⁱ⁾







(3.14)

where x⁽ⁱ⁾ is a presented input, p(y⁽ⁱ⁾ = n|x⁽ⁱ⁾; θ) the probability that the model output y⁽ⁱ⁾ = n given x⁽ⁱ⁾ and all model parameters θ. θ^T_nx⁽ⁱ⁾ is the intermediate output for class n by passing input x⁽ⁱ⁾ through the for class n related parameters in the model. The softmax function is then computed over all output activations.

3.4.2 Training a deep convolutional neural network

The most common way to train a deep convolutional neural network is by applying variants of gradient descent. The solver method used in this research is Nesterov’s accelerated gradient, which uses gradient descent as a basis but also includes momentum in its gradient computation:

Vt+1= µV_t− η∇L(W_t+ µV_t) (3.15)

Wt+1= W_t+ V_t+1 (3.16)

where µ is the momentum value, η the learning rate and ∇L the gradient.

Deep neural networks are often not trained in an online fashion where the input matrices are presented sequentially. Instead, the input matrices are presented in mini-batches. For every mini-batch, containing a pre-defined number of examples (typical values are 8-64, depending on the computational power of the available GPUs), dot products with the model’s parameters are computed for the full mini- batch to generate the network’s output. Usually, the larger the size of the mini-batch, the better the generalization, as the weight adaptations of the model depend less on single examples.

Another characteristic of deep neural networks is that, when dealing with a lot of model parameters, the models tend to overfit quite easily when not enough training examples are presented. To overcome this issue, forms of data augmentation are used to artificially enlarge the training data (e.g. [24]). Methods include extracting patches from the original images and altering color channel intensities.

(25)

3.5 State of the art

The Imagenet Large Scale Visual Recognition Challenge (ILSVRC) is a yearly recur- ring competition for object detection and object localization algorithms [9]. The Imagenet challenge consists of multiple sub-challenges, of which the image classification task has probably received most attention. For the image classification task, the classes (of a total of 1000 classes) of objects depicted on images have to be predicted by an image classification algorithm. The decisive performance measure is the top-5 error rate which has to be as low as possible. A top-5 error occurs when the actual class of an image is not among the top-5 predicted classes by the algorithm.

During the last couple of years, the image classification challenge has been dominated by deep convolutional neural networks [24, 42, 16]. Even among convolutional neural networks the accuracy has drastically improved over the past years. Only from 2012 to 2015 the top-5 error has decreased from 15.3 (AlexNet) to 4.49 (ResNet).

The increase in accuracy is partially an effect from new types of deep convolutional architectures, but can be ascribed to the availability of more computational power as well.

The next sections describe three ILSVRC winners from the past years: AlexNet (2012), GoogLeNet (2014) and ResNet (2015). These convolutional neural networks all have a very different structure, which makes it interesting to list all of them and not only the best performing one.

3.5.1 AlexNet

AlexNet is a deep convolutional neural network which was entered in the ILSVRC2012 competition by Alex Krizhevsky et al. [24]. AlexNet achieved top-1 and top-5 error rates of 40.7 and 18.2 when using a single CNN, and error rates of 36.7 and 15.3 when averaging predictions of 7 similar CNNs.

The overall structure of AlexNet is depicted in figure 3.2. The first five layers of AlexNet are convolutional of nature. Convolutions with 96 kernels of size 11 × 11 × 3 and stride 4 act on the 3-dimensional input data. The second convolutional layer has 256 kernels of size 5 × 5 × 48 which act on the normalized and max-pooled output of the first convolutional layer. Note that the depth of the kernels in the second convolutional layer is half of the number of kernels in the previous layer: this is due to parallel processing on 2 GPUs. The third convolutional layer contains 384 kernels of size 3 × 3 × 256 which are connected to the normalized, max-pooled output of the second convolutional layer. The last two convolutional layers also use 3 × 3

3.5 State of the art 19

(26)

convolutions: 384 kernels of size 3 × 3 × 192 and 256 kernels of size 3 × 3 × 192 for the fourth and fifth layer, respectively.

The five convolutional layers are followed by two fully-connected layers. These layers both contain 4096 neurons. The output of the last fully-connected layer is used for 1000-way softmax classification. Rectified Linear Units (ReLUs) were used as the activation function throughout the network, for fast computation and convergence.

AlexNet is a large CNN with around 60 million parameters. Krizhevsky et al. use several methods to improve the training process and reduce overfitting. Dropout [39, 40] (see section 3.3.3) was used in the fully-connected layers to prevent the network from overfitting. Two forms of data augmentation are used with the same purpose in mind. From the original ImageNet data, which contained 256 × 256 images, patches of 224 × 224 were extracted as well as mirrored versions, effectively increasing the amount of data by a factor 2048. These patches were used during training¹. The other form of data augmentation encompassed altering the RGB channels in training images by adding a small quantity to every pixel related to the principal components of the image.

A small amount of weight decay (0.0005) and momentum (0.9) was used to facilitate learning.

Figure 2:An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264–

4096–4096–1000.

neurons in a kernel map). The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 × 5 × 48.

The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers. The third convolutional layer has 384 kernels of size 3 × 3 × 256connected to the (normalized, pooled) outputs of the second convolutional layer. The fourth convolutional layer has 384 kernels of size 3 × 3 × 192 , and the fifth convolutional layer has 256 kernels of size 3 × 3 × 192. The fully-connected layers have 4096 neurons each.

4 Reducing Overfitting

Our neural network architecture has 60 million parameters. Although the 1000 classes of ILSVRC make each training example impose 10 bits of constraint on the mapping from image to label, this turns out to be insufficient to learn so many parameters without considerable overfitting. Below, we describe the two primary ways in which we combat overfitting.

4.1 Data Augmentation

The easiest and most common method to reduce overfitting on image data is to artificially enlarge the dataset using label-preserving transformations (e.g., [25, 4, 5]). We employ two distinct forms of data augmentation, both of which allow transformed images to be produced from the original images with very little computation, so the transformed images do not need to be stored on disk.

In our implementation, the transformed images are generated in Python code on the CPU while the GPU is training on the previous batch of images. So these data augmentation schemes are, in effect, computationally free.

The first form of data augmentation consists of generating image translations and horizontal reflections. We do this by extracting random 224 × 224 patches (and their horizontal reflections) from the 256×256 images and training our network on these extracted patches⁴. This increases the size of our training set by a factor of 2048, though the resulting training examples are, of course, highly inter- dependent. Without this scheme, our network suffers from substantial overfitting, which would have forced us to use much smaller networks. At test time, the network makes a prediction by extracting five 224 × 224 patches (the four corner patches and the center patch) as well as their horizontal reflections (hence ten patches in all), and averaging the predictions made by the network’s softmax layer on the ten patches.

The second form of data augmentation consists of altering the intensities of the RGB channels in training images. Specifically, we perform PCA on the set of RGB pixel values throughout the ImageNet training set. To each training image, we add multiples of the found principal components, Fig. 3.2.: The full AlexNet architecture. Parallel computations are done on two GPUs, which

is illustrated in the above figure by two separate pathways. Image copied from [24].

3.5.2 GoogLeNet

GoogLeNet [42], the winning convolutional neural network in the ILSVRC2014 classification challenge, is very different compared to the previously described AlexNet. Where AlexNet used relatively few convolution kernels which acted on

1During testing, five 224 × 224 patches (and mirrored versions) were extracted from an original 256 × 256image: the four corner patches and a center patch. The final classification was then the result of averaging softmax predictions of ten distinct patches.

20 Chapter 3 Neural networks

(27)

big volumes of data, GoogLeNet introduced so-called Inception modules, where convolutions with differently sized kernels are applied in parallel. The outputs of the multiple convolutional layers within a module were concatenated and passed to the next layers. Figure 3.3 shows an illustration of a single Inception module. In the full CNN, Inception modules were stacked on top of each other, where the output of the previous module functioned as the input for the next.

Deep convolutional neural networks have the undesired property that the volumes of data, due to repeated convolutions, quickly become too large to be handled by current computer hardware. Some networks attempt to tackle this issue by using subsampling methods such as average or maximum pooling. In GoogLeNet, every time the computational requirements would increase too much to be handled by the hardware, the dimension of convolutional volumes is reduced. This is achieved by using max pooling or average pooling and by applying 1 × 1 convolutions. This is clearly visible in figure 3.3: before 3 × 3 and 5 × 5 convolutions, the input is convolved with small 1 × 1 kernels. 1 × 1 filters can be used to reduce the dimension of convolutional volumes by using less filters than the depth of the input volume:

the depth of the convolutional volumes is reduced.

x

5x5 conv 3x3 conv

1x1 conv 3x3 max pool 1x1 conv

1x1 conv

1x1 conv Filter

concatenation

Fig. 3.3.: A single Inception module. Image based on [42].

Because GoogLeNet is a very deep network with 22 layers with parameters (excluding pooling layers which do not have parameters/weights), it can be hard to correctly adapt the weights using backpropagation. There is a problem of vanishing gradients:

the error vanishes when it is propagated back into the network, leading to insufficient weight changes in the neurons near the input [3]. GoogLeNet deals with this problem by adding two auxiliary classifiers to the network, which are connected to intermediate layers. The output of these layers was taken into account for backpropagation during training: the error of the auxiliary classifiers was weighted with a factor 0.3 (opposed to 1.0 for the final, ‘third’ output). In this way, the error did not vanish as much as it would had there only been one output, as the

(28)

intermediate classifiers were closer to the input than the final classifier. The auxiliary classifiers were not taken into account during test and validation time.

The network was trained on the Imagenet dataset by using a momentum of 0.9 and by decreasing the learning rate by 4% every 8 epochs. Dropout was used only in the fully-connected layers, with a value of 0.7 for the branches used for intermediate classification and with a value of 0.4 in the main classification branch. The designers of GoogLeNet [42] do not give any further details on the training process, and mention that it is hard to give a definitive guidance to the most effective way to train the network.

For testing, seven independently trained versions of the same GoogLeNet model were used. These models were used for ensemble prediction. The only differences between the training processes of these models was in sampling methodologies and the randomized input image order. Aggressive cropping was applied to the test data, leading to 144 crops per image. The softmax probabilities were then averaged over multiple crops for all individually trained networks, leading to a final classification.

The performance of GoogLeNet in terms of top-5 error was very good: a top-5 error of 6.67% was achieved, which is significantly better than the error rate of AlexNet (15.3%) and ILSVRC2013 winner Clarifai (11.2%) [44].

The full layout of GoogLeNet can be found in appendix A.

3.5.3 ResNet

Convolutional neural networks with a very high number of layers have the potential to learn more complex functions than networks which are shallower. Deeper networks do, however, not always perform better in terms of reducing the training error [15]. This is called the degradation problem. Note that this can not be ascribed to overfitting, because then the test error would be higher and not the training error.

The winners of the ILSVRC2015 challenge, MSRA, think that the phenomenon occurs due to an inability of the deeper network to learn identity mappings when necessary [16]. The additional layers of the deeper network fail to map the identity function (note that when all additional layers map the identity function, performance would be identical to the more shallow network).

MSRA introduced a deep residual framework that deals with this issue by letting one or more successive convolutional layers learn residual functions rather than full mappings. Let H(x) be the desired mapping of a few stacked convolutional layers. Normally, this set of layers would learn the direct mapping F (x) = H(x).

(29)

Instead, the input x is not only passed to the convolutional layers, but also passed through the identity function after which it is added to the convolutional output F (x): F (x) + x. Instead of the direct mapping H(x), now the residual mapping F (x) := H(x) − xis learned. An illustration of a building block for deep residual learning can be found in figure 3.4.

Conv. layer

x

ReLU

I

Identity function

Next layer ReLU

Conv. layer

Fig. 3.4.: Residual learning building block. Image based on [16].

The best performing full residual network consisted of building blocks which are a bit more complex than depicted in figure 3.4. Instead of two-layer blocks of 3 × 3convolutions, three convolutional layers were present within a block. The first convolutional layer, with 64 kernels of dimensions 1 × 1, was presented an input with depth 256. After ReLUs were applied to the output of the first layer, a second convolutional layer (3 × 3, 64 kernels) processed the data. ReLUs were applied to the intermediate data volume, followed by the third and last convolutional layer (1 × 1, 256 kernels). The convolved result was added to the input, after which the

data was passed through another ReLU.

The best accuracy on the Imagenet dataset was reported with a network consisting of a total of 152 convolutional layers. The full network consisted solely of residual learning building blocks, apart from initial 7 × 7 convolutions, and two pooling operations at the beginning and at the end of the net. With some momentum and weight decay but without dropout, training went fluently and the degradation problem that deep networks often show was not present. The final top-5 accuracy of a 152-layered ResNet on the Imagenet validation set was as low as 4.49: significantly better than previous year entries in the ILSVRC competition.

(30)

(31)

Dataset definitions and 4

exploration

This chapter is about the dataset that was used for the experiments. Pre-processing methods are described, important definitions are given and the dataset is explored with visualization methods.

4.1 Dataset contents

The dataset consists of two-dimensional position data of a selection of matches played by a German Bundesliga team during the 2008/2009 and 2009/2010 season.

The positions for every player on the pitch were captured by the Amisco^r multiple- camera system. The Amisco^r system consists of multiple cameras placed around the stadium and tracks all moving players on the soccer field at a sampling frequency of 25 Hz [1]. Computer vision techniques are used to track objects and estimate their positions.

Because the ball position was originally not tracked by the system, it was manually added to the data. Therefore, the position of the ball is not as precise as the player movement. When the ball was passed or shot, only its start position and end position were marked. As a result, the ball always moved in straight lines, even in cases of curved shots or passes. When a player had ball possession and dribbled with the ball, the x- and y-coordinates from the player were copied and used as ball position.

As a follow-up step the data were imported into soccer analysis software developed by Inmotio [18]. Consecutively, the raw export function of the Inmotio software was used to export match data at a downsampled frequency of 10 Hz.

4.2 Preprocessing

25

(32)

4.2.1 Data filtering

Three matches were deleted from the original dataset. The position data of these matches were incomplete: in all three cases data from the opposing team’s players were missing. The actual dataset therefore included a total number of 29 full-length matches. Only one of these matches was an away game.

4.2.2 Data cleansing

When dealing with continuous data measured in a dynamic environment, it is no surprise that parts of the data contain errors. In the used dataset, there existed time intervals for which the coordinates of a certain player were not measured correctly.

In the dataset, such measurements were indicated by the coordinates adopting very high values (in the order of 1.0 · 10⁵ meters).

When inspecting the data, it was apparent that there were two main reasons which caused the errors. The first reason is because of players being not between the lines of the soccer field. For this category a distinction can be made between players which were not between the lines for only a limited amount of time, and players which had left the pitch indefinitely. For the first category, coordinates were linearly interpolated for the interval that contained erroneous data. For the second category, erroneous data rows were deleted. Note that this did not negatively impact the continuity of the dataset: the players had left the soccer field indefinitely, and did not return to the pitch.

The second cause for erroneous data is the computer vision algorithm not being able to correctly capture the position of a player. This effect seemed to be present most during corner kicks, during which the players were standing very near to each other, causing tougher extraction of individual players. In these cases the player position was linearly interpolated as well.

Figure 4.1 shows the distribution of the duration of noisy intervals: time intervals during which the coordinates were not measured correctly for at least one player.

The figure shows that the erroneous intervals do not make up a large part of the data. This indicates that linear interpolation would not affect the data too much in a negative way.

(33)

Noisy interval duration (s)

0 5 10 15 20 25 30 35 40 45 50

Frequency

0 500 1000 1500 2000 2500 3000 3500 4000 4500

5000 Duration of noisy intervals

(a)

Noisy interval duration (s)

5 10 15 20 25 30 35 40 45 50

Frequency

0 10 20 30 40 50

60 Duration of noisy intervals (>5s only)

(b) Fig. 4.1.: (a, b) Duration of noisy intervals.

4.2.3 Addition of derived variables and quantities

The raw dataset lacked information in certain aspects, as it did only contain a timestamp, two-dimensional coordinates, object velocity, object id, object name and shirt number for all players and ball. Missing essential parts of the data were added manually. Every player had to be assigned to a team, which was either the Bundesliga team or the opposing team. This was done manually. The playing direction of the teams was extracted by looking at the position of players at the beginning of the matches. When most players belonging to one of the teams were located on a particular side of the field, their playing direction was set to the other side, and vice versa. Finally, a variable representing the direction of movement for players and ball was added. Coordinates of two adjacent samples were taken into account for determining the direction of movement.

4.3 Definition: goal-scoring opportunities

There are many ways to define a goal-scoring opportunity in soccer. One could state that possession of the ball in a certain area of the soccer field, for instance the penalty area, is a goal-scoring opportunity. Frequency statistics about soccer matches often include the total number of shots and the number of shots on goal, which could both also be considered goal-scoring opportunities.

Due to the two-dimensional nature of the data, it is impossible to distinguish between shots which were on goal and shots which went over the bar. Taking these limitations into account, goal-scoring opportunities have been defined as shots which (almost)

4.3 Definition: goal-scoring opportunities 27

(34)

crossed the end line near the goal. A shot which is a little wide would still be classified as a goal-scoring opportunity, as will a shot which crosses the goal on the upper side. A movement of the ball was considered a shot when:

1. the ball had moved in a more or less straight line towards the goal;

2. the velocity of the ball was above a certain threshold all the time;

3. before the velocity of the ball passed this threshold, a player belonging to the attacking team was near the ball.

Algorithm 4.3 shows pseudocode of the algorithm which was used for finding goal- scoring opportunities. Start index, end index and the side of the field where the goal-scoring opportunity occurred were returned for each opportunity. The algorithm relies on several constants that affect the threshold settings for detecting goal-scoring opportunities, which are listed below. Between brackets are the actual values that were used for extracting opportunities.

min-velocity [set to: 20 km/h]

The minimum velocity for the shot. Velocity had to be above this threshold for a movement of the ball to be considered a shot.

max-p-distance [set to: 1.5 m]

Maximum distance from attacking player to ball at the start of a goal-scoring opportunity.

max-angle-change [set to: 20 degrees]

Maximum change of direction of the ball during a shot. In section 4.1 it was stated that, due to the later addition of the ball to the dataset, the ball always moved in a straight line when it was shot. This variable may seem unnecessary at first sight, but imagine the case of a shot which is touched by another player halfway through. It would not be appropriate to consider this as a single shot.

max-dist-to-goal [set to: 1.0 m]

When the ball was directed towards goal in the previous sample, but is not anymore in the current sample, this parameter determines the maximum distance from the ball to the end line for the shot to be considered a goal- scoring opportunity. By setting this threshold to a value higher than zero, shots which do not directly pass the end line can still be detected as opportunities. A shot stopped by the goal keeper before the goal line will then still be classified

(35)

as an opportunity. Another advantage of assigning a slightly positive value to the threshold is that it helps to tackle noisy measurements.

min-shot-time [set to: 0.5 s]

Minimum duration for a shot.

opp-margin [set to: 5.0 m]

Goal-scoring opportunity margin on both sides of goals. This parameter determines how far a shot can go wide while still being classified as a goal-scoring opportunity. Note: for clarity, this parameter is not listed in algorithm 4.3.

Algorithm 4.3 Algorithm to find goal-scoring opportunities

1: functionFIND–OPPORTUNITIES(data) 2: opportunities ←empty list 3: for each sample ∈ data do

4: if sample.ball-velocity < min-velocity then

5: continue

6: if sample.ball-direction is towards one of the goals then

7: near ←list of players whose distance to ball ≤ max-p-distance 8: team-att ←team attacking the goal to which the ball is directed 9: if there is a player ∈ near belonging to team-att then

10: for end-sample ∈ [sample + 1, sample + 2 ... sample + n] do

11: dir -diff ←diff in direction between end-sample and end-sample − 1 12: if dir-diff > max-angle-change then

13: break

14: if ball at end-sample is not anymore directed towards goal then 15: if distance at end-sample to goal ≤ max-dist-to-goal then

16: if time between sample and end-sample ≥ min-shot-time then

17: Add (sample − 1, end-sample) to opportunities

18: Outer for loop: jump to sample end-sample + 1

19: else

20: break

21: else

22: break

23: return opportunities

4.4 Definition: ball possession

The ball possession was assigned to the team whose player was closest to the ball.

Some extra parameters were added to avoid the algorithm from switching ball possession when the ball passed a player closely, but did not undergo a change in direction or a significant decrease in velocity.

Algorithm 4.4 shows pseudocode of the algorithm used for ball possession detection.

For every sample in the data, an integer indicating possession for the German

4.4 Definition: ball possession 29

Predicting Goal-Scoring Opportunities in Soccer by Using Deep Convolutional Neural Networks