Tracking a ball live on a mobile device.

(1)

Tracking a ball live

on a mobile device

(2)

Layout: typeset by the author using LA_TEX.

(3)

Tracking a ball live

on a mobile device

Joost Vledder 11895144 Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor J. Groot Kormelink JOGO Stroombaan 4 1181 VX Amstelveen 29 Jan, 2021

(4)

Abstract

Following an object in a video stream in real-time on a mobile device can be done using an object detector. However, due to the relatively low computational power of mobile devices it is challenging to keep following an object in real-time feasible. This thesis aims to explore the possibilities of using an object tracker to reduce the time taken to track a ball on a video stream captured on a mobile device. It was discovered that using a Kernalized Correlation Filter or a Discriminative Correlation Filter Tracker with Channel and Spatial Reliability when combined with an object detector are solutions to keeping the tracking in real-time feasible whilst also finding the ball accurately.

(5)

3.8 Data files . . . 15 4 Results 16 4.1 Time . . . 16 4.1.1 Proof of Concept . . . 16 4.1.2 Exploration trackers . . . 17 4.2 Misses . . . 18 4.2.1 Proof of concept . . . 19 4.2.2 Exploration trackers . . . 19 4.3 IoU . . . 19 4.3.1 Proof of concept . . . 19 4.3.2 Exploration trackers . . . 21 4.4 KCF optimisation . . . 22 4.4.1 Parameter tuning . . . 22 4.4.2 KCF configurations . . . 23 5 Analysis 25 5.1 Not usable algorithms . . . 25

5.2 Potential solutions . . . 25

6 Conclusion 27

7 Discussion 29

(6)

1 Introduction

The last years have shown an increase in interest in Computer Vision (CV). Within the field of CV, two main tasks are to locate an object and to follow an object. The first task can be realised by using object detection. The second task can be realised in two ways, by running object detection on each frame or by implementing an object tracker. The choice for each way depends on the platform the video or frames are run on and whether the object should be followed in real-time. If the camera of a smartphone should be used to follow an object in real-time, the relatively low computational power of a smartphone will be problematic for running an object detection algorithm on each frame.

This is also the problem for JOGO (JOGO, 2020). JOGO is a start-up which aims to cre-ate a platform for on the one hand young football players and on the other hand football trainers and clubs. On this platform users can do exercises of which their app automat-ically records the performances. As this will create objective data, players and trainers can use numbers to improve themselves or their players respectively. An example of an exercise is juggling the ball. JOGO wants to be able to count the amount of times a player juggles the ball automatically. In order to do this they need to be able to follow the ball. Currently they are doing this by using object detection on each frame and in practice this suffices, i.e. they are able to count the amount of juggling repetitions. However, according to JOGO the algorithm takes too much time leading to interpolation between frames, so they lose track of the ball in some frames. The second problem is that running their object detector each frame is very power demanding, leading to rapidly draining the smartphones battery.

A possible solution for these problems is object tracking. Object tracking is the process which tracks an object over a time period by locating its position in each video frame (Almohaimeed and Prince, 2019). There are many different object tracking solutions. For this project a tracker is defined as follows: A Detection based single object online learning tracker. Detection based for this project means that after every specified amount of frames the object detector is run again. This is chosen to make sure that the tracker can run a long time without the chance of the tracker losing the ball and simultaneously tackling the problem of having to run the object detector each frame. As the exercise used for this project is juggling a ball, the tracker only has to focus on one ball. So a single object tracker is best suited. Lastly the tracker should perform live, therefore it uses the information from the initialization frame to learn.

The use of a lightweight tracker, which is capable of tracking a moving object for a certain amount of frames, could be the solution for JOGO. The research question and the sub questions for this project are:

1. To which extent is it possible to create an object tracker for a mobile device which tracks footballs in real-time and thus is a better solution than running an object detector every frame?

1.1. What is needed to make tracking an object on mobile device in real-time pos-sible?

(7)

1.3. When dropping the mobile device requirement, what are optimisations or track-ers that could be further explored?

First of all, in order explore the possibility of creating an object tracker on a mobile device which performs in real-time, some sort of mobile platform or library will be needed. If neither of these can be found, the challenge will be too extended for this project.

The impact of the frame of initialisation is expected to be extensive. As the trackers will run on a low computational power device and the ball will move quickly, it is expected that the trackers will lose the ball easily. Therefore resetting the tracker often can keep the tracker more accurate. Furthermore it will have impact on the time as well, when running the the slower object detector more often the average time to track will go up as well.

Lastly it is expected that on a desktop application conducting experiments will be more efficient. Because a modification to a mobile application entails installing a new version of this application.

For the main question, it is expected that an object tracker that runs in real-time on a mobile device can be created. However it is expected that the accuracy will decrease strongly, because the ball will move rapidly. Due to this it is expected that the object tracker will have to look at a big part of the frame, which will demand a tracker too heavy to run on a mobile device.

It is expected that an object tracker can be implemented, since it has been done before. For example research has been done to explore the possibilities of tracking an emergency exit sign which is recorded on a mobile device (Mohammed and Morris, 2014). Neverthe-less, this project is different in two ways. First of all in the mentioned paper the camera moves and the object is fixed, whereas in this this project the camera is fixed and the object moves. Further more the main focus of the related work is the color of the object. For this research, the color of the balls will not be the main focus as balls in many different colors exist.

The answer to the questions will be found in two ways. First of all, a proof of concept will be built using OpenCV’s 3.2 implementation of the Discriminative Correlation Filter Tracker with Channel and Spatial Reliability (CSRT) which will run on a smartphone application. This tracker is discussed in 2.1. The proof of concept mainly shows the decrease in time needed when running an object tracker compared to running an object detector. However, it also shows that the accuracy does not drop, especially when the object detector is run every specified amount of frames. The proof of concept will an-swer sub questions 1.1 and 1.2. Next to the proof of concept an explorational part will be set up, which further explores the possibilities of using OpenCV. During this part other trackers and optimisation possibilities will be investigated. The trackers which will be further looked into are a Kernalized Correlation Filter (KCF), further discussed in 2.2 and a Generic Object Tracking Using Regression Networks (GOTURN) tracker, this tracker will be discussed in 2.3. These are not implemented in an application, because they have limitations which made it not feasible to be run on a mobile device for this project. These exploration trackers will show a better comparison between the trackers provided by OpenCV. This explorational part of this project functions to form an answer for sub question 1.3.

(8)

2 Literary Background

As discussed in section 1 the main goal of this project is to look into the potential of using a tracker for following an object in a video stream in real-time. The CSRT is used in the proof of concept but also for the exploration of OpenCV’s potential. As the KCF and GOTURN tracker are only used for the exploration part, these will also be referred to as the exploration trackers.

2.1 CSRT

First of all, the tracker that will be used in the proof of concept will be OpenCV’s implementation of the discriminative correlation filter tracker with channel and spatial reliability (Lukeˇziˇc et al., 2018).

This tracker is based on a discriminative correlation filter tracker (DCF). These types of trackers learn a filter with a pre-defined response on the training image. This is done by selecting a slightly bigger region than the object and by doing so the background is taken in as well. Normally DCF uses circular correlation to make sure learning can be done efficiently by Fast Fourrier Transform (FFT). Because of the circularity, the filter is trained on many circularly-shifted variants of the target. Furthermore the FFT needs the filter and the search region size to be equally big, this will limit the detection range. Lastly, DCF assume targets to be axis-aligned rectangular objects. For non-rectangular or hollow targets the filter will learn parts of the background as well.

To overcome these issues CSR-DCF, the discriminative correlation filter with channel and spatial reliability was implemented. The spatial reliability map makes sure that the filter only supports the object within the region of interest and by doing so overcoming the problem of circular shift and rectangular object assumption. Since the circular shift problem is overcome, arbitrary search regions can be chosen. This will lead to including more background samples, which will help the discriminative power of the algorithm. The channel reliability is used for determining the importance of each of the channels provided by the input frame. In this paper there are two types of reliability measures: channel learning reliability, which is calculated in the filter learning stage. And channel detection reliability, which is calculated in the target localization stage. By doing this they can use simple features like HoG and Colornames while still competing with algorithms using deep features in terms of accuracy and on the other hand outperforming those same algorithms on speed. Because of this the algorithm was claimed to run at almost real-time. Because of this claim this tracker was considered for this thesis.

This was the only reference found by OpenCV to explain their algorithm, so it is assumed that modifications were done to this algorithm.

2.2 KCF

OpenCV’s KCF tracker is implemented to tackle the problem of not being able to incor-porate all the negative samples present on a single frame. (Henriques et al., 2014) This problem exists because using all negative samples would be too computationally expen-sive for modern trackers. However, this paper considers not using all negative samples the reason for inhibiting performance in tracking in general. This is why they introduce circu-lant matrices, which are analytical tools that enable using a Kernalised Ridge Regression without suffering from “the curse of kernalisation”. This creates a kernalised variant of a linear correlation filter, allowing the use of multiple feature channels, whilst keeping the

(9)

computational complexity and thus the computation time very low. This is the greatest advantage of this tracker, being able to running on over a 100 fps, however this is on a desktop application.

2.3 GOTURN

Lastly, OpenCV’s implementation of the GOTURN will be discussed. As mentioned be-fore, GOTURN stands for Generic Object Tracking Using Regression Networks. Generic Object tracking means that it does not learn to track a specific class of objects. This tracker is not based on correlation filters, but uses neural networks instead (Held, Thrun, and Savarese, 2016). According to the writers, correlation filter trackers are traditionally trained entirely online and by doing so they suffer in performance. These trackers suffer, because they do not use the large number of videos available to improve their perfor-mance by learning from them. GOTURN does use these available videos and trains the algorithm offline. Using this offline learning is the first reason that the algorithm can run fast, despite it being based on neural networks. The second reason is the use of one-pass regression.

The algorithm is claimed to run at over 100 fps, however this is on a high-end desktop. On a desktop with just a CPU the algorithm is claimed to run at 2.7 fps.

(10)

3 Method and Approach

The tracker to be implemented had to meet one requirement in particular: It should run in real-time on a smartphone. This initially lead to the MediaPipe platform (Lugaresi et al., 2019 PREPRINT) and the OpenCV library (Bradski, 2000).

3.1 MediaPipe

MediaPipe is a platform created by Google specifically aimed at audio, video and sensory machine learning applications. It is written in C++ but can be equipped on a variety plat-forms, one of those being smartphones. However, at the time of performing this research MediaPipe’s version is 0.8.2, which is an alpha version, therefore limited documentation is available. As the time for this thesis is limited, diving into MediaPipe seemed out of the scope for this project. Furthermore, as MediaPipe is still in alpha Google has already disclaimed that many core features may be subject to change. Nevertheless MediaPipe could be a great solution for JOGO, especially if they are no longer alpha.

3.2 OpenCV

OpenCV has implementations for a variety of platforms, one of which is mobile devices. However, the tracking method of OpenCV is not in the supported part of the library. They have a repository containing all of the extra modules, including the tracking module. Depending on the platform this can lead to manually building the OpenCV configuration. After building OpenCV, the first step was to look at the possibilities of this library without implementing it in an smartphone application. This was decided to keep the problem low-level at first before dealing with complications of smartphone applications. As this first step was exploratory, it was decided to not implement the object detector from JOGO themselves, but use a simple face detector and track the face. This was all done in Java, which meant that what was done for this step could partly be implemented in the smartphone application directly. It was also decided that for this step no results were gathered since the idea was to determine if the OpenCV trackers were implementable and tracked the face. CSRT, KCF and GOTURN trackers were able to track the face, so the OpenCV trackers could be researched further.

As addressed 1 the research will be done in a smartphone and a desktop application. For the smartphone application version 4.3.0 of OpenCV was used. In this version for JAVA, it is not possible to tune parameters. In fact newer versions do not show to have the possibility to tune these parameters either. For the exploratory part version 4.4.0 of OpenCV was used in Python, because for Python a method to tune the tracker parameters is already available.

3.3 Ground Truth

After this conclusion, a ground truth was needed in order to evaluate the trackers. This was provided by JOGO, in the form of frames of three videos and a JSON-file with bounding box locations of the football in each frame. The bounding boxes were found using YOLO’s object detector which was not implementable on smartphone but considered

(11)

to be accurate enough to function as ground truth. The bounding boxes were not found manually and as a result of that not in all frames a ground truth bounding box was found. Furthermore, it is not clear whether the ball is in or out of frame. In the second and third video the ball does leave the camera frame, thus in these videos the not found bounding boxes can be correct. In the first video this is not the case, nevertheless in just under 25% off the frames the ball is not recognised, see table 1. This ratio was considered good enough since it leaves 664 frames for that video to compare with.

# Total frames # Detected # Not detected Ratio detected

Video 1 877 651 226 0.742

Video 2 1356 524 832 0.386

Video 3 755 470 285 0.623

Total 2988 1645 1343 0.551

Table 1: The distribution of ground truth bounding boxes for the different videos. Since the ground truth bounding boxes were not found in each frame, it was important to look at these bounding boxes to see if the object detector was accurate. Figure 1 shows an example of a ground truth bounding box. A run through all frames and projecting the bounding boxes showed that the bounding boxes were perfectly accurate.

Figure 1: Example of a ground truth bounding box.

3.4 The application

To test the feasibility of a real-time tracker for tracking a football on a low-computing power device, a tracker was implemented as a smartphone application. This application had two other requirements it had to meet, in order to be considered a proof of concept to solving JOGO’s problems. First of all, the object detector and the tracker had to be run on the videos provided by JOGO. Secondly the object detector of JOGO had to be implemented. JOGO also provided code for implementing the object detector. The object detector is a TensorFlow lite model, so it can be effortlessly implemented in an

(12)

application using Android Studio.

For the tracker to run on a smartphone application, the OpenCV tracker modules had to be imported to the application. Quickbirdstudios is an app development company which also happened to have created a dependency for implementing OpenCV in applications. However, they are currently not up-to-date leading to the earlier mentioned problem of limited improving options for the trackers. The older version of OpenCV however does provide a sufficient tracker for a proof of concept.

To manage evaluation of the tracker, the application should run the tracker on the pro-vided videos from JOGO. To achieve this, the frames were all added to the application. For visual evaluation it was important to be able to see where the bounding boxes were drawn. In android applications that has to be realised by loading all the frames and detecting and tracking the object on another thread than the main thread. Once this was set up, the bounding box information along with the time it took to detect or track had to be saved.

3.5 Configurations

The next step was to define the different configurations for the proof of concept of the algorithm. A configuration is defined as tracking until frame n is reached, in which n is a predefined number. JOGO have said that they do not want to run their object detector every frame because of the earlier mentioned problems. However, they said that they will run their object detector ever 500ms. This corresponds to running the object detector every 15 frames, assuming smartphone cameras record in 30 frames per second. So the first configuration would be to run the object detector every 15 frames (CSRT15). To be able to compare, another configuration was decided to be just running the object detector. The next two were to run the object detector every 5 (CSRT5) and 30 (CSRT30) frames. To be able to see the effect of running the tracker on its own without interference, the last configuration was running the object detector on the first frame after which the tracker would keep running until the end of the video stream (CSRT). For these configurations, it was decided to not run the object detector if the tracker did not find the football. This was chosen since the effect of running the object detector every n frames could be analysed better. When the object detector was not able to find the ball, the object detector was run again, because the initialisation of the tracker needs to be done with an accurate bounding box.

3.6 Exploration trackers

The earlier described KCF and GOTURN trackers were not usable in an application for different reasons. The KCF tracker in the original form performed poorly, however pa-rameter tuning was considered an option to improve the performance in such a way that it was possible. The process of parameter tuning is further discussed in 3.6.1.

The GOTURN tracker was not implemented in an application, because it needed addi-tional files to run. This was not accomplished for the application, however the GOTURN tracker sounded promising and was recommended by JOGO as well. So it was decided to look at this tracker on a desktop application to determine its performance and be able to compare that with the CSRT. Running this tracker on a desktop application was achievable, since placing the additional files in the same folder as the program folder was

(13)

be sufficient. This tracker did not have any parameters to tune, so its performance was measured by the time it needed to track and initialise and by the bounding boxes it found. Both of these trackers were not combined with the object detector, because it would un-necessarily over complicate this process. Furthermore, as these trackers are not run on a mobile device, the time taken can not be compared to the proof of concept. So the time taken by the object detector does not add value to the analysis. Moreover, as the bounding boxes given by the object detector were already found, these bounding boxes are used to reset the tracker every n frames. The proof of concept was used for analysing the different configurations, therefore for these part a standard configuration in which the tracker was reset 15 frames was chosen.

Lastly, the individual frames were turned into a video. This was done to not need to read the individual frames from memory and thus was purely an implementation convenience. 3.6.1 Parameter Tuning

As mentioned the KCF tracker was not usable in its original form, however the newer OpenCV module has an option to tune parameters. This is done by first saving the default parameters into a file, for this project the XML-format was chosen. Following is the file content of the default KCF tracker.

< i n t e r p f a c t o r > 7 . 5 0 0 0 0 0 2 9 8 0 2 3 2 2 3 9 e−02</ i n t e r p f a c t o r >

Unfortunately these parameters were poorly described by OpenCV, mostly the descrip-tions were a sentence long. Additionally, these descripdescrip-tions did not explain the param-eter values, for example what the minimal and maximal allowed values were. The type of the parameters was provided by OpenCV. The only Boolean parameters are com-press feature, resize, split coeff and wrap kernel. The integer parameters are the follow-ing: compressed size, desc npca, desc pca and max patch size. All others parameters are floats.

The detection threshold parameter (detect thresh) was expected, however it was assumed to be higher than the default value. This was one of the features that was expected to have a high impact for better performance. However, as this threshold was already low, at only 0.5, the amount of impact that tuning this parameter would have, was predicted to be less influential than expected at first. Even though this parameter was low, for this exploration it was decided to set the detection threshold at 0.31. This means more frames

(14)

detected, which means more data and thus more research possibilities.

The other parameters needed some more investigation. Due to the lack of documenta-tion for the parameters it was hard to determine the parameters influence beforehand. Thus the initial approach was to tune the parameters manually to see the impact that had. This was done to find parameters with potential to increase the algorithms per-formance, as there was little to find about this tuning in related work. Moreover the possibilities to tune combined parameters were too much to loop methodically over the different values. For this tuning, better performance is measured as tracking more frames.

3.7 Intersection over Union

For evaluation Intersection over Union (IoU) was chosen as the method. To calculate the IoU, the area of overlap was divided by the area of union. The area of Union is always bigger than or equal to the area of overlap and thus the IoU is always a number between 0 and 1. For this method the minimal and maximal coordinates of a bounding box are needed, the coordinates of the top left and bottom right of a bounding box. Before the ground truth bounding boxes could be used for IoU, they had to be translated. The ground truth bounding box coordinates corresponded with an image which was rotated ninety degrees clockwise, the found coordinates however corresponded to a real world rep-resentation. To make sure these coordinates were aligned the the ground truth coordinates were translated as follows:

x0 = y y0 = W − x

Where y is the y coordinate before rotation, W is the width before rotation and x is the x coordinate before rotation. But when this is done, the coordinates in the rotated frame represent the bottom left and top right of the bounding box. So to get the ground truth coordinates representing the top left the height had to be subtracted from the y-coordinate of the bottom left coordinates. To find the ground truth coordinates representing the bot-tom right the height had to be added to the y-coordinate of the top right coordinates.

(15)

After the translation the IoU was calculated in the following manner:

// f i n d t h e c o o r d i n a t e s w h i c h make up t h e s m a l l e s t r e c t a n g l e interXMin = max( goldMin . x , trackMin . x ) ;

interYMin = max( goldMin . y , trackMin . y ) ; interXMax = min ( goldMax . x , trackMax . x ) ; interYMax = min ( goldMax . y , trackMax . y ) ; // compute a r e a o f i n t e r s e c t i o n

i n t e r s e c t i o n = max ( 0 , interXMax − interXMin + 1) ∗ max ( 0 , interYMax − interYMin + 1) // c a l c u l a t e a r e a s o f t h e o r i g i n a l b o u n d i n g b o x e s g o l d A r e a = ( goldMax . x − goldMin . x + 1) ∗ ( goldMax . y − goldMin . y + 1) t r a c k A r e a = ( trackMax . x − trackMin . x + 1) ∗ ( trackMax . y − trackMin . y + 1) // c a l c u l a t e t h e union u n i o n = g o l d A r e a + t r a c k A r e a − i n t e r s e c t i o n i f g o l d A r e a == 0 i o u = −1 e l s e i f t r a c k A r e a == 0 i o u = −2 e l s e i o u = i n t e r s e c t i o n / u n i o n

As can be seen, the IoU could be -1 and -2. This was done to be able to differentiate between the different reasons for a IoU of zero. A IoU of -1 was the case if there was no gold standard to compare to for that frame. A IoU of -2 was returned if the tracker or object detector was not able to find the object. After the IoU of each frame was calculated they were also saved.

3.8 Data files

After finding the bounding boxes, the time taken, the frames where the object detection was run and the IoU’s. Files for each of the configurations or different trackers, as dis-played in table 2. The x and y coordinates are normalized, by dividing by the frame width and height. The width and height are measured in pixels. The time is in milliseconds. Finally the object detector column indicates whether on that frame the object detector was run successfully, 1, if it was unsuccessful, -1, or whether the tracker was run on that frame, 0.

(16)

x y width height time Object detector iou f0 1.jpg 0.443 0.543 0.149 0.0867 125 1 -1.00 f1 1.jpg 0.450 0.545 0.144 0.0850 29.9 0 0.818 f2 1.jpg 0.460 0.550 0.138 0.0817 28.9 0 0.742 f3 1.jpg 0.471 0.561 0.138 0.0817 23.5 0 0.515 f4 1.jpg 0.475 0.567 0.144 0.0850 22.0 0 0.698 Table 2: Example of CSV-file used for evaluation.

4 Results

The goal for JOGO was to be able to track a ball in each frame. This was not possible with JOGO’s object detector, since it took too much time. So the first result to look at will be the time it takes to use OpenCV’s object tracker. Nevertheless, the speed of the algorithm should not cause the accuracy of the algorithm to drop dramatically. The second result, which will be looked at is the amount of times the tracker or detector did not find the ball. Thirdly, the IoU scores for the trackers and detectors will be analysed. Last of all, the results of optimising the KCF tracker are displayed. .

Each subsection of the results is divided in two parts, the first part being the results of the proof of concept and the second part being the results for the exploration trackers. For the exploration trackers a configuration was chosen in which every 15 frames the tracker was reset. The object detector was not used to analyse the exploration trackers, as earlier mentioned in subsection 3.6. Furthermore, these exploration trackers were only run on video 1. This was done to be able to compare the trackers on one video, without having to consider the fact that some of the frames in which no ball was detected was correct.

4.1 Time

4.1.1 Proof of Concept

First of all, the time it takes to either detect or track was analysed. For each of the videos the average time of each of the algorithms to detect or track is displayed in figure 3. As displayed, the more often the object detector is run, the longer the algorithm takes per frame on average. This is expected because the object detector consistently takes over 65 milliseconds to detect the ball. If the object detector is run more in a configuration the average goes up. However, this is not the only thing that has impact on the average time taken. For the tracker configurations the time taken to initialise the tracker has to be added to the frame in which the object is detected as well. This initialization time appeared to be in the same range as to update the tracker. So the time taken to initialise the tracker is the time to to find the object, 65 milliseconds, plus the initialisation time, at least 18 milliseconds, which means that at least 83 milliseconds are needed for the tracker initialisation frame.

(17)

Average time (ms)

Configuration video 1 video 2 video 3

CSRT 25.9 20.4 24.1

CSRT 30 28.3 31.9 35.2

CSRT 15 33.2 37.3 38.2

CSRT 5 40.9 46.2 45.7

Object Detector 66.1 65.7 66.3

Table 3: Overview of average time to run algorithm per video per configuration.

To get more insights in the distribution of the time to detect and track following box plots were made.

Figure 2: Box plot of the time taken for detecting and tracking of video 1.

From this can be seen that for all tracker configurations, the data points which are in the least 75% of taken times are all lower than the average for the object detector. Further-more, the impact of the object detector is clearly visible as well. The more often the object detector is run, the more deviated the time becomes. Especially CSRT5, shows that the interquartile range goes up to 40 seconds for video 1, for CSRT this would be considered an outlier. Lastly the outliers visible, especially visible for CSRT30 and CSRT15, are the earlier mentioned initialisation frames for the tracker.

4.1.2 Exploration trackers

Same as for the proof of concept, the time taken is the most important for the analysis of the KCF and GOTURN tracker. First of all, the average time for each of the trackers to run is calculated, this is displayed in table 4. This table demonstrates the claimed speed for KCF, averaging 2.13 ms to track the ball. However, the GOTURN tracker performs over three times as slow as the CSRT on average.

(18)

Configuration average time video 1 (ms)

CSRT 20.8

KCF 2.13

GOTURN 64.4

Table 4: Overview of average time to run algorithm per video per configuration.

For these trackers it was also considered useful to create box plots to visualise the dis-tribution of the time taken. The box plots are split in two figures for clarity purposes. Figure 5 displays the complete box plot for the GOTURN tracker. However, this box plot displays outliers at over 400 milliseconds. These are due to the fact that initialisation takes that long for this tracker.

Figure 5: Box plot of the time taken for tracking of video 1 by the GOTURN tracker.

Figure 6: Box plot of the time taken for tracking of video 1 by the CSRT, KCF and GOTURN tracker, for clarity purposes the y-axis is cut-off at 100 milliseconds. The values above 100 milliseconds for the GOTURN tracker can be seen in 5

Figure 6 demonstrates the performances of the different trackers. However, all outliers of the GOTURN tracker at over 100 milliseconds are not in the frame, the CSRT and KCF tracker did not have outliers over 100 milliseconds. The performance of KCF shows no outliers, which take longer than the fasted runs of the CSRT. The GOTURN tracker performed slower than all of CSRT’s runs except for CSRT’s outliers.

4.2 Misses

The main goal of this application is always follow the ball. Thus it was important to look at the amount of frames missed. As mentioned, the ground truth object detector missed the ball in frames as well, even though the ball was present in a number of frames in which it was not found. However, the object detector or the tracker did succeed in finding the ball in cases where the ground truth did not manage to do so. This can lead to less misses compared to the amount of misses of the ground truth object detector in the following results. Nevertheless, this does not mean that all found bounding boxes are correct. Since the object detector or the tracker can incorrectly find the ball in the frame.

(19)

4.2.1 Proof of concept

The CSRT configurations performed as displayed in table 5. For all videos the main trend is that when the object detector is run more often the amount of misses increases. In video 1 and 3 the tracker never fails to find the ball, nevertheless in video 3, the ball is not present in all frames. Lastly, all the configurations managed to record less misses than the ground truth. However, this does not tell anything about the quality of the found bounding boxes.

Times object not found

Configuration Video 1 Video 2 Video 3

Detection Tracking Detection Tracking Detection Tracking

CSRT 0 0 1 3 1 0

CSRT30 2 0 78 30 122 0

CSRT15 2 0 141 9 152 0

CSRT5 18 0 251 0 197 0

Object Detection 45 0 433 0 269 0

Table 5: Times object was not found per video per configuration, divided in cause for not finding object.

The amount of missed frames per tracker whilst running on video 1 are demonstrated in table 6. As can be seen the CSRT and the GOTURN trackers do not lose the ball once. The KCF tracker on the other hand does lose track of the ball in over half of the total frames. Again in video 1 the ball is always in the frame, therefore every time the ball is not found is incorrect.

Total missed Object detector missed Tracker missed

CSRT 2 2 0

KCF 547 2 545

GOTURN 2 2 0

Table 6: Times object was not found per tracker on video 1.

4.3 IoU

As mentioned in section 3.7, IoU was the method used for evaluation of the quality of the used configurations and trackers. The IoU indicates how accurate the found bounding boxes are compared to the ground truth. Then 1 - IoU represents the inaccuracy of the tracker. This was used to determine the drift in a cycle. Where a cycle is finding an object successfully using the object detector and then tracking it until trying to detect again. The drift is the change of IoU during a cycle.

4.3.1 Proof of concept

First of all, figure 7 represents the drift of IoU for the CSRT configuration. As can be seen after 300 frames, the tracker loses the ball. Nevertheless, as showed in earlier in table 5, this configuration does not fail to track the ball, according to the tracker. It does manage

(20)

to find the ball swiftly once after this. This can happen when the ball coincidentally enters the area which the tracker has determined as containing the ball.

Figure 7: IoU drift for CSRT on video 1.

Secondly, the drifts of the CSRT30, CSRT15, CSRT5 and the object detector are presented in figure 8. These drifts are the average drifts of their respective cycles. For the object detector a cycle of 30 frames was chosen.

The object detector shows no drift and it shows an average IoU score which does not surpass 0.77

The three other CSRT configurations all show an increasing drift throughout their cycles. Where CSRT5 on average does not perform worse than a 0.68 IoU score, CSRT15 on average performs better than a 0.6 IoU score. Finally, CSRT30 does not go below an IoU score of 0.54. Finally, figure 8 shows that initially the drift increases strongly, however after a few frames the drift increases less quick.

(21)

For the exploration trackers the overall performance was considered interesting to look at as well, since these trackers all have the same configuration. However, as exhibited in table 6 the KCF tracker failed to track the ball in a lot of frames. Therefore, figure 9 shows the IoU scores of each of the trackers for the frames which are found by the KCF tracker. As displayed the trackers perform comparable on average, with averages of over 0.7 as IoU score. Nevertheless, these trackers have their own peaks and valleys. Showing that the trackers do track in different ways.

Figure 9: IoU scores for different trackers on video 1.

For the different trackers the drift was also used to visualise the performance over time. Because the different trackers have the same cycles starting from the same frame every time, the graphs of these trackers, displayed in figure 10, show the drift of the trackers for the exact same cycles.

The drift of the KCF tracker shows a rapid decrease in IoU score. After the fifth frame the IoU score is on average near 0.2. The KCF tracker does also have a lot of missed frames and in these frames the IoU score will be 0. However, in the figure can be seen that the KCF drift starts low, then increasing rapidly. This means that on average the tracker manages to track the ball in the first frames better than later on in the cycle. The CSRT and the GOTURN tracker perform comparable. The GOTURN tracker does end a bit higher in the graph and thus has a bigger drift of IoU. The GOTURN tracker has a IoU score of around 0.5 after the tenth frame. The CSRT does not reach an IoU score lower than 0.63 on average.

(22)

Figure 10: Average IoU drift for different trackers.

4.4 KCF optimisation

It was possible to run the KCF tracker on a mobile device, however it was not able track the ball correctly. This was why it was decided to look into the tuning the parameters on for a desktop application. This was possible with the newer version of OpenCV, as mentioned in section 3.2. The results of parameter tuning are found in section 4.4.1. Next to the parameter tuning, the KCF tracker was implemented with different configu-rations to see the impact that would have. This will be discussed in section 4.4.2.

4.4.1 Parameter tuning

The parameters of the KCF are tuned by reading a XML-file. After the manual tuning, it showed that tuning many parameters did not lead to better performance. Where bet-ter performance is defined by tracking the ball on more frames. Without modifying the default parameters, the tracker managed to find the ball on 319 frames.

First of all, changing the boolean parameters, mentioned in section 3.6.1, individually did not lead to tracking more or less frames.

Secondly, for the integer parameters only ’max patch size’ showed better performance when changed. After lowering the value from 6400 to 64 the tracker managed to find 3 extra frames. Tuning all other integer parameters, lead to the conclusion that they did have minimal and maximal values. However, changing these did not contribute to more frames in which the ball was tracked.

Lastly, the float parameters were tuned. The lambda, sigma, interpolation factor and learning rate did not change anything to the amount of frames tracked. Nevertheless, the changing the parameters detection threshold and output sigma factor did result in 11 additional frames in which the ball was tracked. Changing the detection threshold from 0.5 to 0.31 resulted in finding these 11 extra frames. When the output sigma factor was modified form 0.0625 to 0.143, the maximum of eleven extra frames was reached.

After finding the different parameters which influenced performance, the improving pa-rameters were combined. This resulted in not finding additional frames, thus changing

(23)

one of the float parameters gave the best parameter composition. 4.4.2 KCF configurations

The configuration for all the tracker on desktop was to run the object detector every fifteen frames. Nevertheless, figure 10 demonstrated that in the first few frames on average the KCF tracker performed better. In order to investigate this further, all configurations between running the object detector every frame until the initially chosen every fifteen frames were run. Figure 11 shows the amount of frames tracker for every configuration. When the object detector is run every fifteen frames, it is clear that the least amount of frames are tracked. Nevertheless, it is also clear that when the object detector is run more often the amount of frames which is tracked on increases. Leading to more frames tracked when object detector is run every other frame (KCF2), compared to running the frame every frame. KCF2 does not track in 31 frames, compared to 45 frames in which the object detector does not find the ball.

Figure 11: Amount of frames tracked.

KCF2 manages to find track on more frames than the object detector. However, it is important to know the quality of the tracked frames as well. Figure 12 shows the IoU scores of KCF2 throughout video 1. This clearly shows that the KCF trackers performs accurate, with an average IoU score of 0.74.

(24)

(25)

5 Analysis

In this section the results will be further analysed. For JOGO the main problem is to be able to track the ball in every frame. In order to do this, the time for the algorithm to object or track in each frame should be low enough to run on every frame. Furthermore, the found bounding box should at least be somewhat accurate. Accuracy is not the main focus, since the movement of the ball is more important than the exact location in each frame. First of all, the configurations that are not usable will be discussed. After that the possible options for JOGO be addressed.

5.1 Not usable algorithms

To start off, the CSRT configuration clearly showed to not be usable for this problem. As it loses the ball, even though the ball is in all the frames. Additionally, when the ball is out of the frame or occluded, the tracker does not indicate that it can not find the ball. Secondly, the CSRT30 configuration was also classified as not a solution. This configu-ration represents object detecting every second, assuming a 30 fps video stream. This is considered to be too long, as the tracker does not fail if the ball goes out of the frame. If the tracker was to indicate appropriately, the object detector could be run every time the ball is out of frame.

Lastly, the GOTURN tracker was considered to be no option. To start off this tracker is not faster or more accurate than the CSRT on desktop. But above all it takes half a second to initialise this tracker on desktop. This would mean that the tracker will miss out on 15 frames every time it is initialised.

5.2 Potential solutions

First of all, both of the CSRT configurations on the mobile device which are left will be discussed. These are CSRT5 and CSRT15. Both of these share the same problem, the time taken by detecting the object and initialising the tracker combined takes too long to be run in real-time. In fact detecting the object alone takes too long to do so. As showed in table 3, the object detector took over 60 milliseconds on average. However, JOGO claimed that they were able to run the object detector on a mobile device on average at 40 milliseconds per frame. If that is the case, an initialisation frame takes 60 milliseconds, with the found time taken to initialise the tracker. This will mean that, assuming 30 mil-liseconds for the tracker to run and 75 milmil-liseconds for the initialisation frames, after 5 frames using the tracker is faster than running the object detector 5 frames. However, this is only possible if the video stream is kept in memory. For this the CSRT15 configuration might be better suited, because CSRT5 will constantly be running from memory. The fact that the object detector could be run faster by JOGO could mean that the CSRT could be run faster as well, but this is purely speculative. If this would be the case the CSRT5 might be a better solution, because it was reset more often and thus slightly more accurate. Finally, the KCF tracker showed to be a solution. Because of its speed, it allows the object detector to be run every other frame. The average time taken to track is around 2 milliseconds and the time to initialise is even lower. This means that two frames can be detected and tracked in under 45 milliseconds. This is a great solution for the problem of interpolation, whilst keeping the detections as accurate as running the object detector. This KCF configuration manages to average an average of a 0.74 IoU score. This is the

(26)

case while little drift is possible, since the tracker is reset every other frame. Furthermore, as the heavy object detector has to be run two times less, meaning the drainage problem could be halved as well. This tracker was run on a desktop, so this does not mean that the the tracker would necessarily be this fast on a mobile device. The CSRT showed to be somewhat faster on desktop than on the mobile device, nevertheless this was not such a great factor that it mean that the KCF tracker will be useless. Lastly, the KCF tracker did not successfully find the ball on any frame on the mobile device. On the desktop it did manage to find the ball in frames without having to tune any parameters. This must be further analysed when implementing on a mobile device.

(27)

6 Conclusion

The goal of this project was to investigate the possibility of tracking a football on a mo-bile device. Furthermore, the tracking should be possible in real-time and be at least somewhat accurate. It was expected that this would be feasible, since modules existed which are meant to solve this problem. However, due to the lower computational power it was expected that different trackers would be challenged by the real-time aspect. OpenCV turned out to be the best module to use for this project and also confirmed the hypotheses of sub question 1.1 that a library was needed. This library showed that it is possible to run multiple trackers on a mobile device. Nevertheless, the tracking part of the module is currently not supported by OpenCV. Because of this is there is probably little to find about other OpenCV mobile object tracking applications and the documentation is also limited.

The three trackers implemented by OpenCV considered for this project were: CSRT, KCF and GOTURN trackers. The CSRT was used for the proof of concept and thus implemented on a mobile device. The proof of concept was used for investigating the impact of running the object detector every 5, 15 and 30 frames and thus it would give answer to sub question 1.2. The impact is determined in two ways, the impact it has on the time and on the accuracy. It was expected that the impact would be extensive on the accuracy and less on the time. This turned out to be the other way around. If the object detector is run less often the drift of the tracker is larger, but it does not lose the ball. The time, however, does not only increase in average time taken, it also shows to be less consistent and more often a longer time is needed.

The exploration on the desktop was done with the CSRT, the KCF tracker and the GO-TURN tracker. This was done to be able to find an answer for sub question 1.3. As the CSRT was also used in this part, it was possible to compare the trackers between themselves and get an idea how they would perform on a mobile device. Again the time needed and the accuracy of the trackers were analysed. As the GOTURN tracker was slower, due to a long initialisation time of half a second, and less accurate, this was quickly considered to be no solution. The KCF tracker outperformed the CSRT strongly on time needed being almost 10 times faster on average. Additionally the time needed to initialise is even over 10 times less on average. The KCF did also manage to to track accurately if it managed to find the ball. For the initial configuration in which the object detector was run every 15 frames the KCF tracker failed to find the ball in over 500 frames. Therefore the optimisation possibilities were investigated for this tracker. First of all, the parameters of the KCF tracker were tuned. This led to little to no improvements, in fact changing the parameters had little effect overall. However this tracker could be improved by running the object detector and tracker alternately. This resulted in this configuration to outperform the object detector. So optimisation for this tracker specifically was an option, not by tuning its parameters as that had little to no effect, but by changing the configuration.

These two approaches yielded three potential solutions. To start of the CSRT configura-tions in which the object detector is run every 5 or every 15 frames are possible soluconfigura-tions. Both of these trackers are accurate enough to be used and they track faster than needed for real-time tracking. However, on the initialisation frame, in which the object is detected and the tracker is initialised, the algorithm needs over twice the time needed for real-time applications. Nevertheless, JOGO claimed that they could run the object detector on each frame needing 40 milliseconds compared to this projects 65 milliseconds. This could

(28)

mean that they are capable of running the CSRT configurations faster in which case they will be accurate and fast solutions.

Secondly the KCF2 configuration is a possible solution. The KCF tracker is so efficient that it manages to run in under 3 milliseconds on average. With the highest outliers at 10 milliseconds. Additionally, the initialisation of the tracker averages around 1 millisecond. By running it every other frame, the problem of interpolation could be solved. Because every second frame has time enough to track even if the object detection takes to long, assuming the 40 milliseconds claimed by JOGO. Furthermore, this configuration actually showed to find the ball on more frames than the object detector provided by JOGO could. This solution will also mean having to run the object detector a lot less, almost half as often as the current algorithm, if the ball stays in the frame. This could also have a great impact on the battery drainage problem mentioned by JOGO. However, this tracker is only run on a desktop, so the time taken on a mobile device might differ from the found times during this research. To conclude, this project has proven that it is possible to cre-ate an object tracker on a mobile device which is able to track footballs in real life. It also outperforms running a object detector every frame, by being faster and thus eliminate the need for interpolation, while not losing the ball.

(29)

7 Discussion

Although the project research gave satisfactory results, there are some remarks that can be made on this thesis. The first and most important one being the focus on just one module. This was done to be able to explore the full potential of the OpenCV module. However this exploration also led to the discovery of some drawbacks of this module. To start off, the tracking module from the library is not supported. This means that means limited support and limited documentation. Furthermore the module has already shown differences between the different languages, unfortunately Java has shown to miss out on features which are implemented for Python already.

Another shortcoming is the limited information given to and received from the OpenCV trackers. In the newer versions of OpenCV it is possible to change the parameters of the trackers, as discussed in subsection 3.6.1. That is, however, the only modification that can be done. The input for these trackers should be a rectangular bounding box, while the object in this project is a ball. Thus object segmentation will not be possible with the current version of OpenCV. But this could be an improvement for the tracking accu-racy, while in the ideal situation none of the background will be considered the object. Moreover, the output of the trackers is also nothing but a rectangular bounding box. No further information about the movement, the region which was looked at or the reason for selecting the output bounding box is available. This information can be very useful for analysis or statistics for JOGO. Nevertheless, as the bounding boxes are the output, these can be used for analysis and statistics done by JOGO.

Secondly, other modules are available which can outperform OpenCV. As mentioned in subsection 3.1, MediaPipe was considered as a platform which could be used during this research. For example MediaPipes slogan: “MediaPipe offers open source cross-platform, customizable ML solutions for live and streaming media.”, fits the goal of this project seamlessly. Furthermore, this platform is created by Google which is considered a mar-ket leader in the field of artificial intelligence. This is also a reason why this platform could be a solution for JOGO. However, when this research was done, MediaPipe was still in alpha. This means that currently created solutions might have to undergo radical changes, until Mediapipe has a stable version. Additionally, the structure for solutions is very efficient, but very specific as well. Therefore, MediaPipe is considered a platform and not a module. This means that learning how to use MediaPipe requires time and research. Due to the time constraints of this project, it was decided to use this time to research working trackers over a way to implement a tracker.

Despite OpenCV having drawbacks and the existence of other modules, OpenCV offers a single module which contains implementations of multiple state-of-the-art trackers. This allowed this research to be exploratory for different trackers and not only for using a tracker in general.

During this project, the choice was made to explore the KCF and GOTURN trackers in a desktop application. Even though one of the main requirements was to be able to track the ball on a mobile device. Nevertheless, as the mobile devices improve, eventually these traditionally low computational devices will be able to run much more powerful al-gorithms. In fact, in this case the mobile device did not perform remarkably slower or less accurate for the CSRT. This also shows that JOGO could also consider computationally heavier algorithms.

Another limitation for this project was the fact that a mobile application needed to be built for the proof of concept. This meant that the time that was used to build this, was

(30)

not available for further parameter tuning for exploring OpenCV’s trackers. For example other trackers which showed poor performance with their default parameters could be parameter tuned to potentially improve their performance. Furthermore, the CSRT con-figurations could be further researched as well. If the data of more concon-figurations would be available, the impact of running the object detector would be clearer. So this is an aspect of the research which could be elaborated on in future work.

Lastly, as mentioned in subsection 3.7, for the IoU calculation only video 1 was used. For the exploration part the research was completely done using solely video 1. This was decided due to the fact that the frames without the ball in it, where not labelled as such. The ball was not in all frames in video 2 and 3. This would mean that for these videos if the ball is not found by the object detector or tracker, it is not clear whether it is correct or incorrect. In video 1 the ball is always in the frame. Therefore, if the detector or tracker does not find the ball, the IoU score should be 0. For the exploration part it was decided to use video 1 alone, because the performance of the trackers had to be compared with themselves or each other. Furthermore, as these were run on a desktop, it was decided that gathering results in different situations was not useful, since the trackers were expected to perform differently on a mobile device. However, video 2 and 3 could be used for other analysis. To start off, video 2 is recorded in the dark, this could give insights in the performance of the trackers in artificial light. Additionally, if it was clear in which frames the ball was not visible, the effect of frames without a ball on the tracker could be analysed. This could also be useful for the choosing the configuration of the tracker.

(31)

References

Almohaimeed, Norah and Master Prince (Feb. 2019). “A Comparative Study of different Oject Tracking Methods in a Video”. In: International Journal of Computer Applica-tions 181, pp. 1–8. doi: 10.5120/ijca2019918470.

Bradski, G. (2000). “The OpenCV Library”. In: Dr. Dobb’s Journal of Software Tools. Held, David, Sebastian Thrun, and Silvio Savarese (Oct. 2016). “Learning to Track at

100 FPS with Deep Regression Networks”. In: vol. 9905, pp. 749–765. isbn: 978-3-319-46447-3. doi: 10.1007/978-3-319-46448-0_45.

Henriques, Joao et al. (Apr. 2014). “High-Speed Tracking with Kernelized Correlation Filters”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 37. doi: 10.1109/TPAMI.2014.2345390.

JOGO (2020). JOGO // Tracks your technical development. url: https://jogo.ai/. Lugaresi, Camillo et al. (June 2019). MediaPipe: A Framework for Building Perception

Pipelines.

Lukeˇziˇc, Alan et al. (July 2018). “Discriminative Correlation Filter with Channel and Spatial Reliability”. In: International Journal of Computer Vision 126. doi: 10.1007/ s11263-017-1061-3.

Mohammed, Abdulmalik and Tim Morris (Apr. 2014). “A Robust Visual Object Tracking Approach on a Mobile Device”. In: pp. 190–198. isbn: 978-3-642-55031-7. doi: 10.1007/ 978-3-642-55032-4_19.

Tracking a ball live on a mobile device.

Tracking a ball live

on a mobile device

Tracking a ball live

on a mobile device

Abstract

Contents

1

Introduction

2

Literary Background

2.1

CSRT

2.2

KCF

2.3

GOTURN

3

Method and Approach

3.1

MediaPipe

3.2

OpenCV

3.3

Ground Truth

3.4

The application

3.5

Configurations

3.6

Exploration trackers

3.7

Intersection over Union

3.8

Data files

4

Results

4.1

Time

4.2

Misses

4.3

IoU

4.4

KCF optimisation

5

Analysis

5.1

Not usable algorithms

5.2

Potential solutions

6

Conclusion

7

Discussion

References