Ball-I3D: Localizing Footballs from Player Coordinates

(1)

MSc Artificial Intelligence

Master Thesis

Ball-I3D: Localizing Footballs from Player

Coordinates

by

Tijmen van Dijk

11336404

October 2, 2019

36 ECTS 7 January 2019 - 7 July 2019

Supervisor:

Prof. C.G.M. Snoek

Daily Supervisor:

Dr.ir. A.G. van Opbroek

Assessor:

Dr. P.S.M Mettes

(2)

Abstract

This thesis investigates using player coordinates to localize the ball in football. We propose the Ball-I3D method, which represents the player coordinates over time as a sequence of two-dimensional coordinate histograms. This representation is used as input to the state-of-the-art I3D video encoder from action recognition. We investigate several design choices and settings to optimize this method, concluding that data augmentation and a second two-stream approach using flow information appear to improve results. In a comparative experiment with an object detector-based approach, we investigate the relative advantages of the two approaches, which merits the conclusion that Ball-I3D can be an interesting alternative for some applications. We compare another video encoder (TSN) to the I3D video encoder and conclude that the I3D video encoder performs better than TSN for ball localization. A comparison of Ball-I3D to human performance demonstrates that Ball-I3D is at least on par with the human brain in localizing the ball. In the light of these findings, we conclude that Ball-I3D shows potential in its ability to encode the spatio-temporal information contained in the player coordinates for ball localization. We believe the approach merits consideration and further research by researchers working in the football domain.

(3)

Introduction

Football broadcasts are consistently rated among the most-viewed television events, which creates plenty of reason to make the viewing experience as enjoyable as possible. Football broadcasts are large productions, employing many people and many cameras to supply the viewer with a variety of interesting camera shots and salient statistics. In amateur football, the interest in footage is much more limited, and as such the resources available to provide interesting footage are limited as well. Nevertheless, fans or family of players may want to watch amateur matches as live streams at home, or coaches may want to analyze match footage or have access to statistics to improve coaching. There is a gap here between interest and feasibility that can be potentially fulfilled using relatively simple static camera set-ups combined with artificial intelligence. If one were able to automatically direct an overview camera to zoom in and follow the action, either physically or digitally, the static footage could be made much more interesting, potentially in systems that can be deployed at any amateur club at a fraction of the cost of hiring a camera crew. In this context, a basic piece of information for a model to understand is the location of the ball, for example, to allow the camera to keep the ball in frame or to generate ball possession statistics.

Finding the ball in camera footage can be considered a special case of object detection. Through the usage of deep convolutional networks, object detectors have been making large strides towards rivalling humans in their capacity to find objects in images [5]. New architectures have brought frequent improvements on several large-scale benchmarks over the past years, most of which contain some sort of ball category. However, there is a consistent pattern that performance worsens as objects get smaller. Recently, the most important object detection benchmark has been the COCO dataset [6]. Table 1.1 shows the performance scores of various

Table 1.1: The COCO Average Precision measure (AP) scores on the leaderboard test set for current and previous state of the art architectures, showing the total AP as well as separate AP scores for the small and large object categories [1]. To indicate the discrepancy in performance between small and large objects, we denote the ratio between the scores for each method, leading to the conclusion that performance is worse for small objects.

Method Backbone AP APsmall APlarge l ar ge_{smal l}

SSD513[2] (2016) ResNet-101 31.2 10.2 49.8 4.9 Faster R-CNN w FPN[3] (2016) ResNet-101-FPN 36.2 18.2 48.2 2.6 RetinaNet[4] (2017) ResNet-101-FPN 39.1 21.8 50.2 2.3 TridentNet[1] (2019) ResNet-101-Deformable 48.4 31.8 60.3 1.9

(6)

Figure 1.1: A typical shot of a football game from an overview perspective. The first panel on the bottom shows a 4x zoom of a 68 by 68-pixel ball and panel two shows a 15x zoom of a ball at 10 by 10 pixels in the hands of the thrower, both taken from the overview shot. Panel three illustrates the challenge posed by motion blur. Panel four shows a typical example of occlusion by the legs of a player, which happens frequently.

object detectors on the COCO benchmark. Clearly, there is a consistent trend of reduced performance on smaller objects. Considerable effort has been expended on addressing this, with some success. For example, SSD [2], an early single-shot approach scores almost a factor of five lower on the Average Precision (AP) measure on small objects relative to large ones. RetinaNet [4], a successor to SSD that uses Feature Pyramid Networks and a loss function that focuses on difficult samples, has reduced this factor significantly. Despite further efforts, this factor remains approximately two for TridentNet [1], the current state of the art, justifying the conclusion that small objects are harder to detect for object detectors.

To illustrate the challenges posed by object size in detecting a ball from overview footage, Figure 1.1 shows what a typical shot with a static camera from an overview perspective may look like. While this is a 4K resolution image, the ball at the far end of the field comprises only 10 by 10 pixels and is hardly discernible at all when enlarged. When the ball is close to the camera, the ball is 68 by 68 pixels. When we express these sizes as a percentage of image size, 10 and 68 pixels amounts to 0.3% and 1.8% of the largest side of the 3840 x 2160 image respectively. Figure 1.2 shows the distribution of object instances in terms of the object size relative to image size, for common datasets including COCO. The ball with percentages ranging from 0.3% to 1.8% of

(7)

Figure 1.2: The distribution of instance sizes for the MS COCO, ImageNet Detection, PASCAL VOC and SUN datasets [6], showing that the ball ranging from 0.3 to 1.8 percent of image size is decisively small relative to benchmark datasets.

image size is located at the very low end of the 4% bin. While there exist many objects in COCO that fall inside this bin, it is safe to assume that the average ball is far smaller than the average COCO object and that this will affect ball detection performance with an object detector, since performance worsens as objects get smaller.

Importantly, the ball being a small object is not the only conceivable issue for detecting the ball. There is the issue of motion blur. Quick passes and shots frequently turn the ball from a distinctive sphere into a rectangular blur that is harder to recognize as a ball, as illustrated by the third bottom panel of Figure 1.1. Another issue is frequent occlusion, illustrated in the fourth bottom panel of Figure 1.1. In many situations, the ball is partly blocked by the legs of whichever player is in possession, or even completely blocked when players are positioned in between the camera and the ball. White socks and other clothing, white field markings, and bright reflection glare on the field can also reduce the distinctiveness of the often white ball or provide image regions that are plausible false positives.

Despite all these difficulties, humans watching football usually have no trouble knowing where the ball is at any given time. Apart from detecting the ball directly, we are able to use many sources of auxiliary information to understand the flow of the game. We can use our memory of the preceding ball locations, the commentary, and the camera movement, but the positions of the players and their movement are also good indicators because they respond directly to the ball. The players provide twenty-two potential sources of information on where the ball is. It is conceivable that aggregating this information intelligently can provide a reasonable ball location prediction. Considering the difficulties of detecting the ball directly and the fact that players are easier to detect than the ball due to being larger and more distinctive, this provides an interesting avenue of research. As such, we seek to leverage the information contained in the spatial distribution of players on the field and how this spatial distribution changes over time to localize the ball. The hypothesis is that the spatio-temporal patterns in player movements allow locating the ball with reasonable accuracy. This approach is independent of whether the ball itself is actually visible. Therefore, we consider this ball localization instead of strictly ball detection, to avoid confusion. In this context we formulate the following research question:

Research Question

(8)

To answer this question, we investigate the existing literature on ball detection and review methods of aggre-gating spatio-temporal information that inspire our method in Chapter 2. In Chapter 3 we present the Ball-I3D method, consisting of a pipeline that uses a new representation of player coordinates and processes it with the I3D video encoder [7] to obtain a ball location prediction. Following the conceptual overview of the method, de-tails concerning the implementation and training of the method are described in Chapter 4. Chapter 5 describes the experiments and their outcomes, separable into three sections: Section 5.1 covers experiments with various settings and modifications to Ball-I3D with the purpose of optimizing the method. In Section 5.2 we experi-ment with the RetinaNet object detector to obtain some insight into comparative strengths and weaknesses of Ball-I3D relative to an object detector approach. Section 5.3 compares using another video encoder to using the I3D encoder to confirm that I3D is indeed a good choice as an encoder. Section 5.4 presents a human benchmark, in which human test subjects compete with the model to provide some comparative perspective on the performance of Ball-I3D. We discuss the implications of our results in Chapter 6 and conclude in Chapter 7.

(9)

Chapter 2

Related Work

In the context of the research goal, we can divide the related work into two sections. First, we review the existing work on ball detection using images or video frames as input. Second, in the context of using player positions as inputs instead of the raw frames, there are a number of methods that have used interesting input representations and architectures to encode spatio-temporal data, from which we can draw inspiration for ball localization.

2.1 Ball Detection in Video Frames

There has been a considerable amount of research on detecting balls in video with a variety of pre-deep learning methods. Many methods use hand-crafted appearance-based features, like the distinct colour, size, and shape of the ball [8, 9, 10]. Other features are motion- or velocity-based [11, 12]. Generally, multiple of these ball features are used to obtain a sequence of ball candidates. Kalman Filters can be applied to turn this sequence into a robust estimation of a trajectory, removing spurious candidates and estimating locations for intermediate frames in which no candidates were found [13, 14]. In the presence of strong non-linearities such as acceleration and spin, a more robust and flexible ball-tracker can be engineered using a Particle Filter [15]. A recent survey on ball tracking by Kamble et al. [16] states that there is no standardized benchmark to evaluate ball detection and tracking approaches, with authors generally using a variety of very small datasets. Some authors use complicated custom camera setups in combination with techniques like frame differencing and background modelling, which are unlikely to generalize to other setups. Authors operate on many different small datasets, with most being very different from football footage from an overview perspective, consisting of long balls or close-up shots exclusively [16]. The immense variety in datasets and setups makes it hard to ascertain the relative usefulness of these hand-crafted feature methods.

Many computer vision tasks have benefited immensely from shifting from hand-crafted features to learned feature representations. Ren`o et al. [17] present a deep-learning-based approach to ball detection for tennis, classifying whether pre-selected image patches contain a ball or not. This approach forms an exception in the deep learning era in the sense that it is ball specific. When using deep learning to detect objects, many low-level features that are shared across a wide range of objects. Therefore, the research interest in object-specific categories has dissipated in the past few years, with the general object detection task having moved to the forefront. The task of general object detection aims at providing bounding box localizations for all objects annotated in a dataset. Object detection methods can be divided into two categories: two-stage detectors and one-stage detectors [5]. Two-stage detectors first compute a set of proposals for regions in the image that are likely to form a coherent object, initially using the Selective Search algorithm and later using a trained Region Proposal Network [18, 19]. The second stage consists of extracting features and predicting a class for

(10)

each of the proposed regions. An example is the RCNN [20] detector and its subsequent improvements like Faster-RCNN [18]. On the other hand, one-stage detectors attempt to localize and classify objects in one pass, dividing an image into fixed anchor regions and regressing bounding boxes coordinates and predicting a class for each region. Notable one-stage detectors are YOLO [21] and SSD [2] and their subsequent improvements. Generally, one-stage detectors have been much faster than two-stage detectors but cannot compete when it comes to detection accuracy. Lin et al. [4] recognize that the detectors a much larger quantity of easily classifiable background regions during training than they see difficult objects. They devise a focal loss function that emphasizes the difficult misclassified samples to account for this class imbalance. Their method trained with this focal loss is called RetinaNet, and it does achieve object detection accuracy that is comparable to two-stage detectors. The potential issues surrounding using an object detector to detect a ball from an overview perspective have been discussed in the introduction as these issues are inseparable from the motivation for this thesis. To illustrate these issues and provide a comparison with the Ball-I3D method, we will experiment with using RetinaNet to detect the ball. We choose RetinaNet over the aforementiond TridentNet, which is the state of the art, because TridentNet does not yet have an implementation available that is as well-supported and easy to adapt as RetinaNet.

As we have noted, player positions may contain important information on where the ball is. We are not the first to recognize this. Wang et al. [22] present a method that uses player detections and trajectories to predict ball possession. This ball possession prediction is combined with image ball detection using hand-crafted features in a Conditional Random Field model. The player positions serve to assist detecting the ball directly to obtain a more dependable trajectory despite occlusion. Their work supports the idea that the player context a usable feature for finding the ball, which combined with the challenges surrounding using an object detector motivates us to investigate using player positions as inputs.

2.2 Representing spatio-temporal information

When using player positions instead of image inputs, the input can be considered a spatial distribution of players over the football field. The problem also has a strong temporal dimension in the way that the spatial distribution varies over time. While it is conceivable that certain football situations like a corner kick or a kick-off can be recognized from the spatial distribution of players alone, the trajectories formed by sequences of coordinates are likely to be far more informative for finding the ball. There are a number of papers that have presented interesting input representations and architectures to learn an encoding of spatio-temporal data.

Social-LSTM [23] is an approach for modelling the interaction of pedestrians in crowds and predicting their future trajectories. Each pedestrian track has its own LSTM. To model the interactions with nearby pedestrians, their hidden states are shared between their respective LSTMs and aggregated through a method the authors call social pooling. The players in football can also be considered a crowd of moving pedestrians, but a shortcoming of social pooling is that it only aggregates spatial information into each individual trajectory. It does not encode the dynamics of the crowd as a whole, which is possibly relevant for predicting a ball location based on player positions.

Traj-GRU (Trajectory-GRU) [24] is a model for precipitation nowcasting, which is the task of predicting short term local rainfall based on a sequence of local radar maps. Their approach is an extension of ConvGRU, which is a Gated Recurrent Unit (GRU) with convolutional input-state transitions and state-state transitions. These state-state transitions are location-invariant. The extension of Traj-GRU is that it models these transitions to be location-variant. This allows it to capture the natural motion of clouds, for example translations and rotations, more accurately. Periodic-CRN (Convolutional Recurrent Network) [25] also applies a variant of ConvGRU on sequences of traffic density measurements at various locations in a city. From these inputs the authors predict the traffic density in the city a few timesteps into the future. The traffic density measurements are represented as a top down traffic density heat map of the city. What is interesting about these two approaches is that they both represent a spatial distribution as an image. The traffic density maps of Periodic-CRN can be understood

(11)

as a 2D histogram of the spatial distribution of traffic in a city, and the same holds for the radar maps of Traj-GRU, but then of clouds in a region. As noted by the authors of Traj-GRU, a sequence of these 2D histograms can be understood as an artificial video representation of spatio-temporal data. Our method draws inspiration from these two papers and will represent the player positions as a video consisting of 2D histograms. Both Traj-GRU and Periodic-CRN process their inputs using a ConvGRU-based encoder to learn a representation of their artificial videos, but ConvGRU can no longer be considered state of the art for encoding spatio-temporal information. ConvGRU-based encoders have also been employed in the field of action recognition, and at this point they have been significantly outperformed by 3D convolutional encoders [26, 27]. Furthermore, recur-rent architectures are difficult to train and due to their sequential processing, they are also slow compared to attention-based or convolutional models, which process their inputs hierarchically [27]. Slow training and infer-ence would be acceptable if it results in better predictions, but recent works have shown recurrent architectures being outperformed by convolutional architectures on a variety of tasks with a strong sequential or temporal dimension. Convolutional architectures such as WaveNet [28] and ByteNet [29] have surpassed the state of the art in audio generation and machine translation respectively, both tasks with a strong sequential component requiring long-term memory, which are classically considered the domain of LSTM-based approaches. In an analysis of temporal video-based tasks, Ghodrati et al. [30] show that a simple time-adjusted convolutional architecture outperforms LSTMs when it comes to understanding the arrow of time. Bai et al. [27] propose a general convolutional architecture called TCN (Temporal Convolutonal Network) that is able to outperform LSTM and GRU architectures on a wide range of standard sequence modelling benchmarks. TCN demonstrates a longer effective memory, the length of which is controllable through the stride of dilated convolutions and net-work depth. In the light of these findings, we look at fully convolutional encoders to model our spatio-temporal data.

When it comes to learning to understand data in video form, the field of action recognition is a key field with many researchers experimenting with convolution-based video encoders. The obtained encoding is subsequently used to predict an action for a short video sequence. An early attempt at applying the concept of 3D convolution in the field of action recognition is called C3D [31]. The convolution is 3D in the sense that the operation is applied in two spatial dimensions as well as in the temporal dimension. This way the model learns spatio-temporal features by aggregating space and time simultaneously, instead of sequentially like many preceding approaches. C3D outperforms LSTM-based approaches and approaches using hand-crafted features, but the latter by a much slimmer margin than seen in some other computer vision domains. This is because C3D cannot use pretrained weights as is the standard for 2D-convolutional networks, since no pretrained weights exist for 3D convolution, and common action recognition datasets are too small to comprehensively train the model from scratch. Essentially, C3D cannot be trained to its full potential.

Another approach in action recognition is Temporal Segment Networks (TSN) [32]. It is a convolutional approach that divides a video in segments and predicts an action for a snippet in each segment. This samples the video more sparsely, which increases efficiency since subsequent video frames are usually highly correlated. Each snippet can be analyzed using any 2D ConvNet. The predictions of each snippet are averaged to obtain a video level prediction. This temporally averaged prediction is used to compute the loss and update the model parameters, allowing learning long range temporal structure over the entire video. This means TSN does not actually learn spatio-temporal encodings directly, but aggregates predictions on spatial encodings over time. While this may seem conceptually weaker than learning features through convolution in the temporal dimension like C3D does, it is nonetheless able to outperform C3D on action recognition benchmarks.

The current state of the art in action recognition is I3D [7]. I3D is a model that draws strongly on the concept of C3D and addresses its pertinent issues by utilizing weights pretrained weights on 2D images to initialize its 3D kernels. They do this by stacking the 2D weights in the third (temporal) dimension. This is a simple adjustment, but these inflated weights are nevertheless a very valuable initialization for action recognition. Additionally, they gather a large new dataset consisting of YouTube videos, called Kinetics, to provide further pre-training. The inflated weights plus the extra pre-training provides a much better initialization that allows a deeper network

(12)

than C3D, while still being trainable with small datasets. This approach being the state of the art demonstrates that, when properly trained, the power of 3D convolutions for encoding video information is considerable. This validates using the encoder of I3D as starting point for our method, but instead of applying the encoder to natural RGB videos, we will apply it to a video consisting of 2D histograms that represent the player positions.

(13)

Chapter 3

Methodology

In this section we define the Ball-I3D method, which uses player positions as inputs for the task of ball localiza-tion. We define ball localization to be the prediction of the coordinates of the ball relative to the field. These field coordinates take the shape of right-handed 2D Cartesian coordinates (x, y) in meters, with the center spot in the field defined as coordinate (0, 0). The Ball-I3D method can be applied in any situation in which player positions can be defined in field coordinates. Conceivably, these player coordinates could be annotated by hand or obtained through GPS trackers, but in a context of cheap deployment, the most practical source seems running a person detector on overview video footage to obtain bounding box detections, and converting the base of this bounding box to field coordinates using the camera parameters of the specific setup. Figure 3.1 shows the camera setup used to obtain overview footage for our dataset as an example. The scope of the research question covers the process from player coordinates to a ball location; we consider the process used to obtain the player coordinates as a given starting point. Therefore, further details of the video footage and person detector are explained in Chapter 4. In this section we cover the two basic building blocks of our method, the representation of the player coordinates used as input, and the video encoder used to encode these inputs.

Figure 3.1: Camera setup with two 4K cameras, each covering one half of the field with a small section of overlap, providing footage from an overview perspective.

3.1 Representation

As we have seen in Section 2.2, spatial distributions like traffic and clouds have been successfully represented using images. In a similar way, we represent the positions of the players on the football field using a grid that represents the football field from a top down perspective. Figure 3.2 shows the conversion from detections to a 73 x 111 grid, where each cell corresponds to a 1 x 1 bin in meters and is initialized at zero. The player

(14)

Figure 3.2: First row shows the footage of the camera setup with the player detections on the left, and the corresponding histogram on the right. Each pixel of the histogram corresponds to a square meter in the football field, with each white pixel indicating the presence of a player. The second row shows only a left camera view containing most players and the corresponding histogram. Similarly, the third row shows a right camera view.

coordinates are then matched to whichever bin is closest to their coordinates in the field and this bin is set to value 1. This grid then forms a top down view of the spatial distribution of the players across the field. The grid resolution is a design choice. A higher resolution grid allows representing the player coordinates more accurately, at the cost of increased computational load. However, the benefits of increasing the resolution are limited by the precision of the player coordinates. From now on, these images will be referred to as coordinate histograms, as they are a discretized representation of player positions across two continuous coordinates. A sequence of these histograms at consistent time intervals may be understood as a video representation of player coordinates and as such it can be used as an input for a video encoder. This transition from player positions to an artificial video is the first step of the Ball-I3D method.

3.2 Video Encoder

A video encoder should be capable of learning to encode the spatio-temporal data contained in the sequence of coordinate histograms in a way that is conducive to predicting the ball location. The video encoder of choice

(15)

Figure 3.3: A schematic representing the I3D encoder [7], with each Inc. (Inception) block consisting of the module on the right. This represents the encoder as we will use it, with the input video consisting of a sequence of coordinate histograms, and the final 1x1x1 convolution having two filters to predict the two coordinates of the ball.

is the aforementioned I3D encoder. The I3D architecture is shown in Figure 3.3. The backbone of the video encoder is Inception-V1, pretrained on ImageNet with the weights inflated as described in Section 2.2. The video encoder has subsequently been received further training for the task of action recognition using the Kinetics dataset [7]. The I3D model for action recognition is employed in a two-stream format, with one stream taking the RGB video as input, and the other optical flow externally derived from the video. The results on action recognition benchmarks generally benefit from having externally computed optical flow as extra input. Each stream consists of the architecture shown in Figure 3.3. The streams are trained independently. At test time the logits are summed and fed into a softmax function, doing so results in an aggregate performance that is better than each of the independent streams. The I3D architecture is implemented in an easily modifiable Keras format on Github: https://github.com/dlpbc/keras-kinetics-i3d.

To be able to use the I3D video encoder to map player coordinates to the location of the ball, we need to adjust the output predictions. The I3D video encoder outputs a 2048 length vector encoding, on top of which we add a 1x1x1 convolution layer with two filters that regresses the encoding into two output coordinates. As a regression model, Ball-I3D does not have a softmax with logits as the final layer. This prevents applying the two-stream concept in directly the same way as in I3D for action recognition, since taking the average of logits pre-activation is not conceptually the same as taking the average of regressed coordinates. How to add explicit motion information in this case is not trivial and is the subject of an experiment in Section 5.1.3. Therefore, the basic implementation of the model consists of a single stream using just the coordinate histograms. The combination of this input representation and encoder forms the basic Ball-I3D method. A schematic overview of the method is shown in Figure 3.4.

(16)

Figure 3.4: Schematic overview of the Ball-I3D method. Football overview footage is converted into player positions in field coordinates, these form the starting point of the Ball-I3D method. Player coordinates are converted into a sequence of coordinate histograms. This sequence is used as input to the I3D video encoder to predict the location of the ball.

(17)

Chapter 4

Experimental Setup

This chapter describes details of the dataset and experimental setup as it pertains in general to all experiments, leaving more specific experimental setup to the respective sections in the subsequent chapter.

4.1 Dataset

The starting point of the Ball-I3D method is player positions expressed in field coordinates, the target is a ball location in field coordinates. This section describes the details of how the ball annotations and the player positions coordinates are obtained from a simple static camera setup to form a usable dataset.

4.1.1 Video Data

TNO is conducting research into automation of the coverage of amateur sports. For this purpose, they have set up cameras that provide an overview of the field at several sports clubs. The setup consists of two 4K resolution cameras, each covering half of the football field with a small overlap, as has been illustrated in Figure 3.1. The cameras are static, and the camera parameters are known. This setup automatically records football matches played at set times, filming at 25 frames per second.

4.1.2 Ball Annotations

The output of the model is the ball location at the last frame of an input video sequence, with the preceding frames merely serving to provide more information. We have manually annotated the ball location in the video data to serve as ground truth for training the model. Since the camera setup is static and the camera parameters are known, we can convert annotations in pixel coordinates to field coordinates through a perspective projection. For this transformation to result in accurate field coordinates, the ball is only annotated when it is on the ground, or sufficiently close so that its position relative to the field can be deduced. The annotation is placed centrally at the bottom of the ball. This is required because annotating the ball in the air or at the top of the ball will lead to field coordinates that are inaccurate after the conversion from pixel space. Annotation is done by moving through the video frames at adjustable time intervals, taking care to annotate regularly and to annotate any important directional shifts, while applying linear interpolation to obtain a dense ball trajectory over the intermediate unannotated frames. Practically this means that the annotation interval is generally around 0.25 seconds, because moving much faster results in inaccurate interpolation. Annotation at this rate is a tedious and time-consuming process, limiting the size of the dataset. As such, we have partially annotated five different matches of a single club for a total of 81 annotated minutes, starting at different portions of the match to

(18)

maximize variation. Table 4.1 shows the number of matches used for annotation, and the respective amount of time annotated.

Table 4.1: The annotated time per match and in total. Match Annotated Time (min)

1 22 2 15 3 15 4 13 5 16 Total 81

There are some decisions in the annotating process that merit clarification. The ball is only annotated when the game is active so to speak, defined to be whenever the players are actively responding to the ball or moving in preparation for some static situation. These static situations include set pieces like goal kicks, throw-ins, free kicks, and the time between the ball going out of bounds and the ball being brought back into active play. Some of these situations involve the players moving while the ball is not and are likely difficult to predict. Nevertheless, for any practical purposes one would want to have a prediction for these situations, because they are important to the game and they take up a considerable portion of game time. What is not considered active play is for example the treatment of an injury, which can have a ball visible in the frame as well as players moving, but there does not need to be any relation between the two. Another decision we made is to annotate headers and throw-ins, even though the ball is not strictly on the ground. However, its position in field coordinates can be deduced and thus annotated. It is important to do so because they can be important directional shifts. For throw-ins the position of the ball is marked to be at the feet of the thrower, while for headers we envision a line between the positions of the feet before and after a jump and mark the position on the line orthogonal to the ball at the moment of contact.

4.1.3 Player Coordinates

To obtain player coordinates, the players are detected in the raw frames of the video data using a pipeline devised by TNO. Detection is done on single frames using a pedestrian detector called the ACF (Aggregated Channel Features) Detector [33]. This is a light pre-deep-learning method that uses gradient histograms to obtain bounding box detections of players, which was state of the art in 2014. This detector was chosen because this pipeline was designed to be able to run “real-time” on a powerful CPU. Since football players are not lightning fast, the detector operates on every 4th frame, so real-time is understood to be 6.25 fps. A tracking algorithm links the single frame detections together to form rough tracks [34]. The central pixel coordinates at the bottom of the bounding box can be projected to field coordinates to represent the player position. Apart from the coordinates, the tracking algorithm also calculates movement information for each player, provided it is detected across multiple frames. This movement information consists of a speed in m/s and an orientation angle in radians. These two quantities can be interpreted as flow information that is potentially usable as a second stream.

There are some crucial shortcomings to the player coordinates obtained in the process as described above, which introduce significant noise to the input data that the model must contend with. Firstly, there are no ground truth annotations to evaluate the performance of the detector with, we can only claim that the detector appears capable of detecting players fairly reliably based on manual verification. The detections in Figure 3.2 are the detections generated by ACF. Secondly, the detector operates on each camera separately. This leads to players

(19)

being detected in both cameras simultaneously when they are in the area of overlap between the cameras. The first row of Figure 3.2 contains some instances of these doubled detections. This is issue is solved to some extent in the conversion to histograms, where detections that are close enough to be put in a single bin will be represented as a single player. Incidentally this will also cause two players that are very close to be represented as a single player, however, players that are stood close together will often not be properly detected as two persons anyway. A third shortcoming is that a small discrepancy in pixel coordinates leads to a much larger discrepancy in field coordinates, an issue that worsens at long distances as the viewing angle relative to the field lowers. This leads to player positions jittering back and forth in field space as their bounding boxes vary slightly between subsequent frames. Fourthly, while the detections are limited to the dimensions of the field plus a few meters margin, this still leaves enough room for the detection of the occasional coach, a substitute during his warm-up, and the referee and assistant referees. This means the number of detections may exceed twenty-two and can vary even between subsequent frames. Luckily this is not an issue for the proposed representation, as it allows changing the number of player pixels per frame. However, this does obstruct experimenting with certain vector approaches or using an LSTM per player similar to Social-LSTM, since these require a constant number of inputs. Finally, the tracking algorithm has a number of shortcomings of its own. It also operates on each camera separately, but there is no mechanism in place to link tracks across cameras. This means that every track moving across the midline is broken up and restarted as a new track in the other camera. There is also no matching of multiple tracks to a single player identity. Each track is independent, and a single match can have hundreds of tracks. There is no information of which player is which, and no team information. There is an algorithm in place that tries to cluster the players in two teams based on the colour histogram within their bounding boxes, but manual verification of the clustering lead to the conclusion that it is too unreliable to utilize. We essentially cannot obtain any additional information from the tracks provided by the tracking algorithm, except for the speed and orientation information, which has enough data to be at least worth experimenting with.

Overall, there are significant imperfections and sources of noise in the data. The model should be able to learn to deal with some of these sources of noise, but it is likely that performance will be adversely affected relative to having perfect player coordinates. Some of these shortcomings could be improved upon, but engineering a better starting point for the proposed method is outside the scope of this thesis.

4.1.4 Input Representation Details

Some implementation details for the input representation of Ball-I3D depend on the dataset. Firstly, the appropriate length of the input video sequences Ball-I3D depends on the frequency of detections. The detections are generated at 6.25 fps. At this rate 25 frames cover four seconds of playtime, which we deem a reasonable setting based on the speed of football, but we will conduct an experiment with other settings. Secondly, since the detections contain considerable jitter and noise as described previously, the precision of the coordinates does not merit grid resolution any higher than 1 x 1 meter. In fact, this resolution will likely serve to remove duplicate detections in the section of overlap and small jitters in the coordinates. For the dimensions of the field of the amateur club under consideration this comes down to a grid of 73 x 111.

4.2 Training Details

4.2.1 Evaluation Measure

The ball localization task will be evaluated using the mean squared Euclidean distance, which is also directly usable as a loss function during training. As the squared variation of the Euclidean distance, it will punish large errors more severely, pushing towards more stable predicted locations.

(20)

For a vector of predictions x and a vector of targets y of equal size n, denoting coordinates in meters: Mean squared Euclidean Distance (m2) = 1

n

X

i =1

(xi− yi)2

Mathematically this is an equivalent evaluation measure to the Mean Squared Error (MSE), which is the standard loss function for regression models. However, we will continue to call it the mean squared Euclidean Distance to maintain the intuitive connection with the Euclidean distance. In general, we will report the non-squared mean Euclidean distance alongside the evaluation measure, solely because people have an intuitive understanding of what a distance in meters means.

4.2.2 Training Splits

The total annotated dataset consists of eighty-one minutes of football play, which at 6.25 fps amounts to 30.000 unique samples. Of this total, ten continuous minutes (3750 samples) are used as a validation set, and another set of ten continuous minutes for a testing set. These two splits are continuous because subsequent samples are highly correlated, so validating or testing on randomly sampled samples introduces a heavy bias from having trained on the correlated neighbouring samples. For each experiment we train on the remaining sixty-one minutes (23000 samples).

4.2.3 Training Procedure

The I3D model for action recognition features very limited regularization. Dropout is applied to the final encoding before the prediction layer. Similarly, we apply dropout in Ball-I3D to the final feature vector of Ball-I3D before the two output coordinates are computed using a 1x1x1 convolution. The model is trained for thirty epochs, using the Adam optimizer with a learning rate that is initialized at 1e-3. The learning rate is dynamically reduced with a factor 0.2 when the validation loss stops decreasing up to a minimum learning rate of 1e-5. After each reduction there is a cooldown of five epochs before it can be reduced again. Since the number of parameters is very large relative to the number of samples (14 million parameters trained with 23.000 samples) we expect overfitting to occur. That is why we save the model at every epoch and select the model which performs best on the validation set as the final result of a training run, as a form of retroactive early stopping. In Chapter 5 we will conduct an experiment with various augmentation methods, but throughout all experiments we utilize a flipping augmentation. This augmentation consists of flipping a sequence of coordinate histogram horizontally, vertically or both with equal probability. Due to the symmetries in the game of football, this augmentation appears to have no apparent disadvantages, so we include it in the base method.

(21)

Chapter 5

Experiments and Results

5.1 Optimization

In this section we evaluate the implementation of the Ball-I3D method as described in Chapter 3. As this is a largely new approach to a new problem, there are various settings and possible improvements with which we experiment in an effort to optimize the method. These experiments will be evaluated on the validation set. At the end of this section we select the optimizations based on whether they improved performance on the validation set or not and present a final result of the optimized Ball-I3D method on the test set.

5.1.1 How to train Ball-I3D?

Generally, when adapting an encoder to a new task, it is good practice to leave the weights of the encoder fixed while training the newly initialized top layer from scratch. Adjusting encoder and top layer simultaneously may unnecessarily disrupt carefully trained filters in the encoder due to errors introduced by the initially random top layer. The I3D encoder offers three possibilities for initialization, the trained weights used for the RGB stream, the trained weights used for the two-channel optical flow stream, and randomly initialized weights. The coordinate histograms from our method do not clearly resemble either natural RGB video or optical flow video, so to assess the usefulness of these initializations relative to random initialization, we train the Ball-I3D top layer with a fixed encoder initialized with each of these three weight sets. To make the coordinate histograms compatible with the dimensionality of the pre-trained weights, the coordinate histograms are copied three times in the channel dimension to match RGB dimensions, and two times to match flow dimensions.

Since there is a large discrepancy between the inputs of Ball-I3D relative to the natural RGB video and optical flow the I3D for action recognition has been pre-trained with, we expect that training only the top layer will not yield good results. Therefore, with the top layer trained, we subsequently fine-tune the model in its entirety, adjusting both encoder and top layer weights.

Table 5.1 shows the results of training Ball-I3D with a fixed encoder for the three initializations, as well as fine-tuning each set of encoder weights. The fixed random weight experiment serves as a sanity check as the model is expected to learn next to nothing with that initialization. A random set of weights outputs an encoding that consists of random noise, so the regression top layer cannot discern between inputs and is only able to learn to shift the predictions to the right output range. Both fixed RGB and fixed flow weights do contain distinct features and are an improvement over random weights, but neither gets very close to the ball on average. Fixed flow weights perform better than RGB by a small margin, which could possibly be attributed to the sparsity of flow images resembling the sparsity of coordinate histograms. Flow images only have non-zero values where

(22)

Table 5.1: Results of training with a fixed encoder and fine-tuning the encoder, using three available weight initializations. These results indicate that fine-tuning is important to achieve a good performance. The results are similar for all three fine-tuned weight sets, indicating that the pre-training of I3D does not offer much of an advantage.

Mean squared Euclidean distance (m2₎

Mean Euclidean distance (m) RGB weights (fixed) 458 19.06 Flow weights (fixed) 366 17.34 Random weights (fixed) 1109 30.05 RGB weights (fine-tuned) 180 11.00 Flow weights (fine-tuned) 188 11.24 Random weights (fine-tuned) 196 10.97

there is movement and as such, they are generally quite sparse. The coordinate histograms are also sparse due to only having non-zero values at the locations of the players.

All three of the fine-tuned sets are an improvement over training just a top layer, achieving a Euclidean distance of approximately 11 meters. Using the RGB and flow weights as a starting point for fine-tuning appears scarcely better than random. This demonstrates that the model benefits little from pre-training on the action-recognition task, and that even training from scratch is feasible. Strictly speaking, the fine-tuned RGB weights perform the best, as such we will consider this the basic implementation of the Ball-I3D method. In the following sections we will experiment with various possible improvements for which we will refer to this result as the baseline.

5.1.2 How stable is the training?

Since optimizing deep learning models is an inherently stochastic process, the local minimum that is obtained varies across training runs. Ideally, each model would be trained dozens of times using different random seeds so that we can apply statistical hypothesis testing to verify the statistical significance of observed performance differences. With the available hardware, time, and the large number of experiments this is regrettably unfeasible. To obtain some information on the degree of stochasticity in the results, we carry out 10 runs of the fine-tuning process for Ball-I3D using the pre-trained RGB weights.

Table 5.2 shows the results of these runs. We use the mean of 184 and its standard deviation of 8 to provide an indication whether the results of other experiments are merely caused by stochastic variation or actually constitute an improvement over the baseline. We will use a two standard deviation range from this mean as a rough approximation of a 95% confidence interval. This requires two assumptions. Firstly, we assume that the degree of stochasticity of this fine-tuning process is a reasonable estimation for the variation of the results of the other experiments. Secondly, we assume the variation of results to be normally distributed. Strictly speaking, this is false since the evaluation measure is non-negative, but the results are many standard deviations from zero, so it is a reasonable approximation. If an experiment shows an improvement that falls within the two standard deviation range, we will judge it to be the result of the stochasticity of the training process. If the improvement exceeds two standard deviations, we will deem it likely that it constitutes an actual improvement over the baseline.

(23)

Table 5.2: Results of 10 training runs of the Ball-I3D baseline and the aggregated mean and sample standard deviation for each measure. From these results we conclude there is significant variation in the obtained minimum for a single training procedure. We will use a two standard deviation range relative to the baseline as a cut-off for a modification to be deemed an improvement.

Mean squared Euclidean distance (m2₎ Mean Euclidean distance (m) Run 1 180 11.00 Run 2 183 11.15 Run 3 175 11.04 Run 4 205 11.69 Run 5 180 11.00 Run 6 177 10.98 Run 7 186 11.19 Run 8 184 10.84 Run 9 179 11.12 Run 10 192 11.32 Mean 184 11.13 Standard Deviation 8 0.24

5.1.3 What is the influence of flow?

I3D and a number of other models in action recognition use optical flow that is externally computed from the video frames as input to a second stream of the model. The aggregated predictions of the two streams generally performs better than either the raw video or the optical flow stream individually. This is an indication that the model itself is not perfectly capable of extracting motion information as a feature, since the optical flow is extracted from the video frames and thus contains no information that was not already present in the video. For the player coordinates generated as described in Chapter 4.1.3, the speed and orientation can be understood as optical flow information for players. These values are derived directly from the detections in a track, possibly providing some extra information that is lost in the conversion of detections to coordinate histograms. For example, in a situation where multiple players cross paths there is no way to reliably discern the players from one another in the coordinate histograms, but with speed and orientation values their trajectories are explicitly provided. Therefore, in this case it is actually possible that constructing flow inputs from this speed and orientation adds information that is not contained in the other stream.

The inputs for the flow stream are constructed from the speed and orientation values as follows: The tracking algorithm outputs a speed in meters per second, and an orientation in radians based on the right-handed axes. Because our data contains some suspect high speeds which are most likely due to linking erroneous or jittering detections, we clip the speed value at 8 m/s. The speed and orientation can be considered to define a vector in polar coordinates. We convert the vector to Cartesian coordinates so that the two values will occupy the same range. These two values per player detection represent horizontal and vertical displacement. These are used to construct a flow histogram, which is very similar to a coordinate histogram, except that the flow histogram is two-channel. Instead of setting the grid cell which the player occupies to 1, we set it to its horizontal displacement in the first channel, and its vertical displacement in the second channel. The result of this conversion and its corresponding coordinate histogram is visualized in Figure 5.1.

(24)

Figure 5.1: Coordinate histogram on the left and the corresponding flow histogram on the right. An extra channel of ones has been added to the two-channel flow image for visualization in RGB, with the colours stemming from the two flow values plus an always activated green channel.

Table 5.3: Results of using flow histograms as inputs, indicating that flow histograms on their own lack some information relative to the coordinate histograms. In this table and in any subsequent tables, the ± indicates the two standard deviation range of the baseline.

Mean Euclidean distance (m) Coordinate histograms (baseline) 184± 16 11.13± 0.48 Flow histograms 213 12.17

In I3D for the action recognition task, the flow stream on its own performs nearly as well as the raw video stream, so it is interesting to assess the performance of a flow stream on its own in the ball localization case. Table 5.3 shows the result of Ball-I3D taking only flow histograms as inputs relative to the baseline. The flow histograms on their own appear to perform worse than the coordinate histograms. There are two plausible causes for this. Firstly, players that are standing still are not recorded in the flow histograms as their horizontal and vertical displacement is zero, while their position can be relevant to localizing the ball. Secondly, the speed and orientation values are only computed when there are a number of subsequent detections available, and the computation is not done retroactively. This means that first few detections in a track are always without speed and orientation. Due to the shortcomings of the detection and tracking algorithms, new tracks are started often and therefore many player locations receive zero values in the flow histogram, while they are non-zero in the coordinate histogram.

While the flow stream does not appear to be better than the coordinate stream, it is still possible that the model can combine information from both in a way that improves the predictions. There are many possible ways of aggregating the coordinate and flow streams, we experiment with three methods, which are described below and are schematically represented in Figure 5.2.

1. Prediction level: Averaging Predictions

In the two-stream I3D model, the two streams are trained independently. At test time the authors sum the per-class-logits before applying the softmax activation to obtain the class prediction scores [7]. This simple way of aggregation is effective for the action recognition task, but its effectiveness is not likely to carry over to ball localization. Since Ball-I3D regresses two coordinates, there is no softmax activation. We can average the coordinate predictions of the two-streams since it is the closest mirror of the original approach, but the only way this could lead to a better aggregated result is when the streams have a consistent pattern of being wrong

(25)

on opposing sides of target. There is no apparent reason to expect this to be case. Instead, it is likely that the aggregated result will be worse than the best individual stream and better than the worst individual stream. 2. Input level: Concatenating Channels

Ideally, the model should have both streams available during training, since it may be able to learn when to use which information from which stream. One way to do this is to merge the two types of input at the start of the pipeline. We concatenate the two-channel flow histograms with the single-channel coordinate histograms in the channel dimension to obtain three-channel inputs. This mirrors how convolutional models receive and learn to use colour information. This method makes the model effectively one-stream, but it can theoretically access all the information and learn features based on both types of input.

3. Encoding level: End-to-end Trained Two-Stream

Instead of aggregating the two streams at the prediction level like in the first method, we join the two streams at the final feature vector, which gives the model the opportunity to learn what features to use for its predictions. To do so we concatenate the final feature vectors from each stream before the final 1x1x1 convolution that outputs the coordinates. The two streams are trained concurrently as a merged model and the total model size is doubled, which makes this an end-to-end two-stream approach.

Table 5.4 shows the results of these three approaches relative to the single streams. Averaging at the prediction level appears no worse than the best of the two individual streams, and actually may perform better than con-catenating the channels. Concon-catenating the flow information as extra channels appears to offer no improvement, and in fact scores over two standard deviations worse than the baseline. This poor performance may stem from the model merging the channels early, in the first convolutional layer. Each filter takes a weighted sum of the channels to compute its feature map so it can in theory learn based on the channels, but it stands to reason that having flow and coordinate information separated through multiple layers allows learning more distinct features that complement each other. This is what the end-to-end trained two-stream gets the opportunity to do, and it outperforms the coordinate stream by a margin that exceeds two standard deviations. Therefore, it appears that flow information does help to the model if aggregated in this way, so we deem the trained two-stream method a useful modification to the baseline.

Table 5.4: Results of the three methods of aggregating coordinate and flow information. Training the model end-to-end with concatenation at the feature level performs better than any of the other methods.

Mean squared Euclidean distance (m2)

Mean Euclidean distance (m) Coordinate stream (baseline) 184± 16 11.13± 0.48 Flow stream 213 12.17 Averaging Predictions 180 11.16 Concatenating Channels 205 11.91 End-to-End Trained Two-Stream 159 10.36

(26)

Figure 5.2: Schematic overview of the three ways of aggregating the two streams. Method 1 occurs at the prediction level, method 2 at the input level, and method 3 at the encoding level.

(27)

5.1.4 What is the influence of data?

Considering that the pre-trained weights are of very limited use, the model is very large for the number of training samples available in the dataset. The sixty-one minutes used as a training set comprise twenty-three thousand highly correlated samples, while a single stream of the I3D model has 14 million trainable parameters. This discrepancy is problematic in that it makes the model likely to overfit, but it also begs the question whether the model can actually be trained to its full potential with this number of samples, which was an issue that plagued C3D [31]. Additionally, sixty-one minutes is not even a full match and may not contain many important football situations or dynamics that are necessary to generalize properly to unseen data. It could very well be the case that the best way of optimizing performance on the validation set and other unseen data is through simply increasing the number of training samples. To assess this, we trained the model on increasing proportions of the dataset, to see whether each added portion boosts performance significantly or whether the returns show a diminishing trend.

Table 5.5: Results of training with various fractions of the entire dataset. The largest three tested fractions all perform within the two standard deviation range, demonstrating that increasing the amount of data improves results, but only up to a point that is most likely exceeded by the size of the dataset.

Mean squared Euclidean distance (m2) Mean Euclidean distance (m) 20% data 254 13.49 40% data 200 11.84 60% data 196 11.43 80% data 171 10.76 100% data (baseline) 184± 16 11.13± 0.48

Table 5.5 shows the result of training with five subsets of the full dataset of increasing size. While the 20% data model performs noticeably worse, even the 40% data model already hits on the verge of the two standard deviation range of the baseline. This indicates that the model is trainable with a low number of samples, and that simply annotating another thirty minutes of game time will not decrease the prediction error significantly. However, these results do not rule out the possibility that the model would benefit from specifically annotating difficult or unseen situations.

A frequently used technique in deep learning is data augmentation. This can alleviate issues with training large models with few samples as well as increase the ability of a model to generalize. While we have concluded that the training set is of sufficient size, increased generality is enough motivation on its own to justify experimenting with augmentation techniques. The histogram representation precludes many of the classic image augmentation techniques. There are no differences between cameras that would benefit from Gaussian noise, colour or contrast shifts. Cropping, shearing and scaling all cause disruptions which are not expected to occur in unseen data produced with the preprocessing pipeline. Nevertheless, there are some possibly helpful augmentation techniques conceivable within the limitations imposed by the representation, and we experiment with three of them.

1. Flip

As mentioned before, the flipping augmentation is already included in the baseline. The ideal augmentation technique performs a transformation that results in a significant difference from the original sample, while still being completely representative for the real dataset. In this regard, flipping the histograms comes very close to this ideal. The game of football can be considered reflectionally symmetric horizontally and vertically. Any

(28)

situation in football that can happen on the left side of the field, can also happen on the right side, the same holds for the vertical case. While it is possible to think of reasons why this could be false, like most players being right-footed, or sunshine or spectators having some asymmetric effect, with the abstraction level of the input representation these symmetries can be assumed to hold. For that reason, we flip horizontally, vertically or both, each with a 0.25 probability so that over the epochs each of the four versions is seen equally often on average.

2. Translation

Translating the player coordinates is a reliable way of keeping the inter-player spatial structure intact, but at a different location in the field. Samples augmented this way could increase the variation in the training set, without introducing samples that are too atypical of real football. We translate with a probability of 0.5, so that the real training samples are encountered more often. If a sample is translated, we first select whether the translation is horizontal, vertical or both, with uniform probability for the three translation types. Subsequently we uniformly select a distance in pixels for each active translation dimension, limited to 3 pixels in either direction. 3. Rotation

Rotating the player coordinates also keeps the inter-player spatial structure mostly intact but changes the locations in a different way than translation does. The player locations are rotated around the center spot. The degree of rotation has to be small, because at larger distances from the axis of rotation, the displacement of the player also becomes larger, and goalkeepers not being in front of their goals is definitely unnatural. We limit the rotation to 5 degrees in either direction. We rotate with a probability of 0.5, so that the real training samples are encountered more often. If a sample is rotated, the amount of rotation is selected uniformly. If used concurrently with the translation augmentation, different combinations of rotations and translations will reinforce the effect of the individual augmentations, further increasing variation in the dataset.

Table 5.6: Results when using the three types of augmentations, indicating that the flip and translation augmentations improve results.

Mean squared Euclidean distance (m2₎ Mean Euclidean distance (m) Non-augmented 236 12.82 Flip (baseline) 184± 16 11.13± 0.48 Rotation 242 12.99 Translation 188 11.66 All augmentations 161 10.63 Flip + translation 142 10.00

Table 5.6 shows the results of applying the augmentation techniques. As expected, flipping shows to be a solid augmentation technique with the non-augmented run performing notably worse than the baseline, which includes flipping. The translation technique on its own scores far below the non-augmented result and appears to be a solid augmentation. Rotation provides no apparent improvement. Conceptually, this might be due to rotating around the center spot not fitting well with the dynamics of football. Football has many situations that can be understood as being symmetric across the midline. For instance, if the team in possession moves to play the ball close to one sideline, then the defenders will also move to that sideline to cover and intercept. Rotating around the center spot relies on rotational symmetry, which seems less common in football and results in moving players that are on opposite sides of the midline to opposite sidelines. Enabling all augmentations

(29)

simultaneously performs well, but enabling just flipping and translating achieves an even better result, providing further evidence that rotation is not a good augmentation. To conclude, out of these three simple augmentation techniques, flipping and translation appear to increase the generalization potential of the model. As a whole, augmentation is helpful for the Ball-I3D method.

5.1.5 What is the influence of the temporal dimension?

The design decision to use a video encoder to model player coordinates is based on the assumption that the temporal dimension is important to the task. We conduct various experiments to shed some light on how the temporal dimension influences results. In the initial implementation of the Ball-I3D method the temporal length of the input sequence has been set to 25. This choice is motivated solely by four seconds (25 frames at 6.25 fps) sounding like a reasonable time span for the game of football. We conduct an experiment with frame lengths of 8, 15, 40 and 80 to find out whether the model can utilize longer temporal patterns, or that maybe shorter lengths are already sufficient. Eight frames are the minimum temporal length that the I3D encoder is designed to handle due to its temporal convolution kernel sizes and padding settings. To be able to judge the performance of the model with spatial information only, we also provide the encoder with only the histogram at the target time step, copying this histogram in the temporal channel to obtain the minimum sequence length. Table 5.7 shows the results for different amounts of temporal information. Using spatial information only performs far worse than using any amount of temporal information, confirming the hypothesis that the temporal dimension is important and also indicating that Ball-I3D is able to make use of it. The 8 and 40 frame results fall within the two standard deviations rage from the baseline, the 15 and 80 frame results appear worse than using 25 frames. Considering these unequivocal results, it is not possible to draw a clear conclusion that shorter or longer sequence lengths improve results.

Table 5.7: Results when using various sequence lengths. Using a sequence of histograms appears better than using a single histogram, indicating that the model does learn from the temporal dimension. The 80-length sequence appears to be slightly worse, but otherwise the sequence lengths perform similarly.

Mean squared Euclidean distance (m2₎ Mean Euclidean distance (m) Spatial-only 298 14.61 8 196 10.85 15 203 11.28 25 (baseline) 184± 16 11.13± 0.48 40 190 11.20 80 211 11.76

One potential issue with using player coordinates to localize the ball precisely is that player movements are not always temporally synchronized with ball movement. In some cases, this leads to players moving to where they predict the ball is going to be, while it is also commonplace that sudden directional shifts lead to player movements lagging behind what the ball is doing. Therefore, it is an interesting experiment to see whether predicting using input frames that lie beyond the target frame can increase accuracy. Knowing the player movements both before and after the target frame might help the model to time its predictions better. Using frames from the future is cheating in a sense that it is impossible to do so in strictly real-time deployment. However, in the context of live streaming it is generally acceptable to have a small delay between recording and broadcasting, making this technique potentially viable even outside of a research setting. Practically, using a

(30)

temporal sequence length of twenty-five with five future frames means that the target is the ball location in the twentieth frame.

Table 5.8: Results when using a number of frames from the future, indicating that using some future frames can help, but the improvement does not generalize to ten future frames.

Mean squared Euclidean distance (m2₎ Mean Euclidean distance (m) 0 (baseline) 184± 16 11.13± 0.48 5 158 10.34 10 200 11.42

Table 5.8 shows the result of using future frames. Utilizing five future frames appears to be an improvement over using no future information, improving performance by more than two standard deviations. Conceptually, this might be due to the future frames providing information that helps the model decide whether the players are anticipating the ball or reacting to the ball. Surprisingly, this improvement appears absent when the number of future frames is increased to ten. Even though five future frames performs below the two standard deviation range of the baseline, we will not consider this modification part of the optimized Ball-I3D method, due to it being a form of “temporal cheating” that may not be applicable in every setting.

5.1.6 Conclusion

In the experiments presented and discussed above we have seen two viable modifications that produced results that outperformed the baseline results by more than two standard deviations on the validation set, namely training a two-stream model end-to-end and augmenting using both flips and small translations. Using both modifications, we can evaluate Ball-I3D in its optimized form. Table 5.9 shows the results of the optimized Ball-I3D versus the baseline results on the validation set. Ball-I3D outperforms the baseline by the greatest margin yet, indicating that the two optimizations are also useful in unison.

Table 5.9: Results of the baseline and optimized Ball-I3D on the validation set. The optimized Ball-I3D obtains the best performance, supporting the conclusion that the two modifications are useful.

Mean Euclidean distance (m) Baseline 184± 16 11.13± 0.48 Ball-I3D (optimized) 139 9.75

Table 5.10 shows the same results on the test set. There is a notable difference in the absolute error values between validation and test sets, with both the baseline and the optimized Ball-I3D scoring over 80 m2_higher

on the test set. It is plausible that this stems partly from the bias introduced from the optimization decisions taken based on validation results, however, the fact that we are predicting on only ten minutes of football means there can also be vast differences between how difficult the game situations are to predict. With the available data this is the most representative test of deploying the method we can do, and differences in difficulty between any set of ten minutes are to be expected. Crucially, the relative difference between the baseline and optimized Ball-I3D that we see on the validation set largely generalizes to the test set, indicating that the optimizations also improve results on unseen data.

Ball-I3D: Localizing Footballs from Player Coordinates

MSc Artificial Intelligence

Master Thesis