Evaluating Visual Object Trackers on Hard Occlusions

(1)

Evaluating Visual Object

Trackers on Hard Occlusions

(2)

Layout: typeset by the author using LA_TEX.

(3)

Evaluating Visual Object Trackers

on Hard Occlusions

Thijs Kuipers 11070234 Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor dr. Deepak Gupta

Intelligent Sensory Information Systems Informatics Institute

University of Amsterdam Science Park 904 1098 XH Amsterdam

(4)

Acknowledgement

I would like to express my sincerest gratitude towards my supervisor dr. Deepak Gupta, for guiding me through the process of this thesis project. For helping me understand the different concepts that are crucial to this thesis project. And finally, for helping me think critically about previously unexplored subjects within the field of research concerning this thesis project.

(5)

Abstract

Visual object tracking is a challenging problem in computer vision, as visual object trackers have to deal with many circumstances such as illumination changes, fast motion, and occlusion. In the current thesis, a set of state-of-the-art trackers are evaluated on the specific challenge of hard occlusions. While existing benchmarks do contain examples of occlusion, these are often not very challenging and do not represent the occurrence of occlusion in the wild. Therefore, a small-scale dataset containing different categories within hard occlusions is created, on which the se-lected trackers are evaluated and compared to the challenging LaSOT benchmark. Results show that hard occlusion remains a very challenging problem for current state-of-the-art trackers. Furthermore, it is shown that tracker performance varies wildly between different categories of hard occlusions, where a top-performing tracker on one category performs significantly worse on a different category. The varying nature of tracker performance based on specific categories suggests that the common tracker rankings using single performance scores such as AuC or F-score are not adequate to gauge tracker performance in real-world scenarios.

(6)

Chapter 1 Introduction

Visual object tracking remains a challenging problem, even though it has has been studied for several decades [27]. A visual object tracker has to account for many different and varying circumstances. For instance, changes in the illumination may alter the appearance of the to be tracked object. The object could also blend in with the background environment, or the object might get occluded, resulting in the object, or part of the object, being obstructed from view. Because of all the possible different circumstances visual object trackers have to account for, visual tracking is considered a hard problem [25].

One type of visual object trackers is referred to as online object trackers. In online tracking, only the information in current and previous frames can be used at any given instance, as opposed to offline tracking, which allows for utilisation of both previous and future frames [27, 25] . One class of online visual object trackers are single object trackers where a bounding box is provided only in the first frame frame provided. According to A. W. M. Smeulders et. al. these kinds of trackers can be divided into five distinct components [25] : the target region, the representation of the to be tracked object, the representation of the motion of the object, the core method of tracking, and the model updating. For this thesis project, solely online deep trackers where the target object is identified only by a rectangular bounding box in the first frame are considered.

Most current state-of-the-art visual object trackers do perform well on marks such as the Object Tracking Benchmark (OTB) that are designed for bench-marking general tracking performance [26, 27]. These datasets, however, are often rather small and therefore limited in scope. Because of this, they do not represent the challenges trackers are faced with in the real world [23, 27]. In order to more accurately benchmark the different object trackers, alternative benchmarks have been presented. Notable ones include TrackingNet [23], GOT-10k [12], and LaSOT [10].

(9)

While many of the challenges visual object trackers face have mostly been addressed, this has not been the case for the occurrence of occlusion [25]. Occlusion can encompasses a variety of different scenarios. For instance, an object can be occluded when it either moves partially or fully out of frame. Occlusion also occurs when another object fully or partially blocks the target object. When the target object is partially blocked, either certain specific features of the target object can disappear, or part of the whole target object appearance will disappear.

Some methods for handling occlusion do exist. However, they are often focused on very specific aspects of tracking such as solely tracking pedestrians [24]. A major problem in evaluating visual object trackers on occlusion is the lack of data of hard occlusion in current benchmarks. While the above mentioned benchmarks do contain samples of occlusion, they often only occur for short amounts of time, or occluded objects lack motion [28, 23, 10]. Therefore, the available benchmarks might not accurately evaluate tracker performance on occlusion.

The current thesis project aims to evaluate a set of current state-of-the-art visual object trackers on the occurrence of occlusion. Currently available bench-marks that contain occurrences of occlusion have shown decreased performance compared to the baseline [10, 12]. Therefore it is hypothesised that the evaluated trackers will show significantly lower performance on the hard occlusion dataset. Furthermore, in current popular tracking benchmarks, trackers are often ranked by single metrics such as area-under-curve or F-score [28, 23, 15, 14]. Top performing trackers can vary between different challenging scenarios such as lighting or occlu-sion [10, 12]. Because of this, it is hypothesised that within the larger category of occlusion, performance could drastically differ between trackers per sub-category. In order to perform this evaluation, first the subject of occlusion will be cate-gorised, as different kinds of occlusion can occur and could have different effects on visual object tracker performance. Next, a moderately sized dataset will be constructed that contains the different kinds of occlusions. The image sequences need to be of considerable length, in order to accommodate the need for long du-ration of target object occlusion. Finally, using this dataset a detailed evaluation of the performance of each of the trackers will be given.

(10)

Chapter 2 Background

Figure 2.1: General representation of visual object trackers showing the five dis-tinct components: the target region, representation of appearance, representation of motion, core method, and model updating. Image taken from [25]

In this thesis project, visual object tracking refers to the problem of tracking any object in a sequence of frames. The object to be tracked, the target object, is identified in the first frame of this sequence. Because the visual object trackers should be capable of tracking any arbitrary object, it is not possible to construct specific models offline for detecting the target object [29]. While many different solutions to arbitrary object tracking have been proposed, they can be decomposed into five distinct components [25]: the target region, representation of appearance, representation of motion, core method, and model updating, which are illustrated in figure 2.1).

The target region contains the target object, and can be represented in various ways. Often, the target object is represented by a target bounding box [25]. Other methods include differently shaped bounds, such as ellipses or target contours that attempt to omit the background information. Although, it has been suggested that including background information could aid in constructing stronger appearance

(11)

models [25].

The representation of appearance the target object is handled by the appear-ance model and can be represented in a multitude of ways [19]. Appearappear-ance models can consist of just raw image data, or make use of Histograms of Gradients [25]. More recently, discrimintative models have grown in popularity [8, 2]. Siamese networks have also been adopted in tracking, using a similarity function to create the appearance model [29, 17, 18].

Motion can be represented in multiple ways [25]. A simple approach involves the assumption that both the location and scale of the target object do not change much between frames during tracking. Thus the target object is found by searching around the previous predicted location [17]. Often, predictions that are further away from the previous location or are of a wildly differing scale weighted as being less probable candidates for the target object [7, 2].

The core method of tracking describes the method to locate the best location of the target object in the new frame, such as similarity matching [29] or using discriminitative correlation filters [8].

To account for changes in appearance of the target object, trackers use varying strategies of updating their appearance model [25]. Some trackers opt to update their model every few frames or when distractor peaks are detected [8, 2, 7]. Others choose to keep their appearance model static during tracking [29, 17, 18].

2.1 Challenges in Tracking

Visual Object Trackers are subjected to many different and varying circumstances. As such, Visual Object Tracking is considered a hard problem. Because these varying and different circumstances are often unpredictable, they can pose serious challenges for visual object tracking when tracking objects in the wild. It is not a rare occurrence that the target object changes in how the target object is visu-ally perceived. Since generic object trackers have to construct their target model representation solely from the first frame in a sequence, drastic change in target object appearance poses a serious challenge [29].

The visual appearance of the target object can be altered during tracking in many different ways. Different lighting conditions can drastically alter the appear-ance of the target object by changing it colour or making features of the object disappear, for instance when the target object is drowned in dark shadows. The target object can be non-rigid and change its shape while it is being tracked. Ob-jects can also change their orientation relative to the camera, therefore altering their appearance. Other challenges concerning the visual appearance of the target object include camera motion blur, or the background is visually extremely similar to the target object.

(12)

Of course, visual object trackers are not just challenged by change of visual appearance of the target object during tracking. The motion of the target object could be fast, or the camera itself could be moving or shaking, therefore resulting in the object appearing outside the target location. Motion prediction can help in order to track fast moving targets [25]. However, the target object could be moving in very unpredictable ways, which makes motion prediction a difficult task.

Finally, another major challenge for visual object trackers is the occurrence of occlusion. Occlusion poses as an interesting challenge for object tracking, that might at first glance seem similar to challenges regarding the change of appearance of the target object. However, occlusion is vastly different. If the appearance of the target object changes, the target object can still be fully observed. This is not the case for occlusion. Although that what is visually perceived does indeed change, it is not the target object itself that has changed. Rather, it is something else in the scene that causes a perceived change of appearance.

2.2 Types of Trackers

In order for single-target visual object trackers to be able to search for the target object, the tracker needs to be able to distinguish between pixels belonging to the target object, and pixels belonging to the background. Furthermore, a robust procedure for creating the target object model is needed, as only the first frame in a sequence can be used to construct this model. In order to address this problem, different methods have been applied to visual object tracking. Notable methods current state-of-the-art trackers make use of are Discriminating Correlation Filters such as ECO [8], Siamese Networks such as SiamFC [29] and SiamRPN++ [18], Discriminative Model Prediction as used in DiMP [2], and Overlap Maximisation, introduced in ATOM [7]. In the following sections, the above mentioned methods and trackers will be discussed.

2.2.1 Discriminative Correlation Filters in Visual Object

Tracking

Discriminative Correlation Filters (DCF) is a supervised method for learning linear classifiers or regressors , and have been successfully applied to visual object trcking [6]. The DCF utilise circular correlation which offers two major advantages for visual object tracking. Recall how general object trackers are required to learn an appearance model of the target object based on minimal supervision. First, when using circular correlation, all shifted variants of input samples are implicitly included, which results in being able to construct an appearance model with very limited data [6]. This makes DCF well suited for generic visual object tracking.

(13)

Furthermore, calculations are performed within the Fourier domain by utilising the Fast Fourier Transform, resulting in more efficient computations [3].

The goal in the DCF is to learn a multi-channel convolution filter f from a set of training examples {(xk, yk)}tk=1 where xk denotes a training sample and

yk denotes the desired convolution response [6]. Each sample xk consists of a

d-dimensional feature map, extracted from the target region of an image. The filter f consists of one convolution filter fl _{per layer in the d-dimensional feature map.}

The convolution response of the filter f on a training sample xk is then defined as

Sf(xk) = d

X

l=1

xl_k∗ fl_. _(2.1)

Here, ∗ denotes the circular convolution operator and l denotes the layer in the feature map. In order to obtain the filter f , the least square error between the convolution response and label yk is calculated.

Approaches using Discriminant Correlation Filters have been successfully ap-plied to tracking. The MOSSE tracker, the first DCF tracker, using solely grayscale samples to train the filter, achieved state-of-the-art performance [4, 27]. Many of the then state-of-the-art trackers adopted the DCF method [20]. Improvements were made such as learning multi-channel filters, replacing grayscale sample with more feature-rich samples such as colour spaces or histogram-of-gradients [6]. Even features learned from deep convolutional networks have been used to increase track-ing accuracy [20]. The drawback of these deep features are significantly reduced tracking speed, making them not applicable for real world scenarios [8].

As improvements in DCF tracking performance is mostly attributed to em-ploying more powerful features, models have grown substantially larger, including many hundreds of thousands of trainable parameters [8]. Such large models intro-duce the risk of overfitting, as training data during tracking is scarce. In order to solve the issue of overfitting, [8] introduce Efficient Convolution Operators (ECO) for tracking. ECO aims to reduce the number of parameters in the DCF model by introducing a factorised convolution operator.

ECO also employs a different strategy for updating the sample space model. Usually, object trackers using the DCF will add a training sample x, each frame, until the maximum amount of training samples that can be stored is reached [8]. Once this maximum has been reached the sample with the lowest weight will be discarded. The issue with this approach is that it requires a large amount of training samples in order to form a good representation of the target object [8]. Instead, ECO opts to model the training data as a set of Gaussian components. Each component represents a different aspect of the target object appearance. This does not only lead to a more representative training set, it also vastly reduces the amount of training samples that need to be stored. ECO updates the sample

(14)

space model every frame.

Another key area of improvement is in the model update strategy. In DCF based tracking approaches, the model is often updated each frame during tracking [8]. Instead of updating the model in every frame, ECO applies a more sparse updating scheme. Ideally, the model should only be updated if the target ob-ject appearance has been sufficiently changed. However, it is noted that finding such conditions can be extremely complex [8]. Therefore, ECO avoids detecting change in the target object appearance and instead ECO opts for a periodic update scheme, updating the model every n-th frame. This does not affect the sample space model update. Periodically performing the model does not only increase tracking speed, as it was found that a value of n ≈ 5 generally led to improved tracking results, which is attributed to the reduced occurrence of overfitting when periodically applying the model update [8].

2.2.2 Siamese Networks in Visual Object Tracking

Deep convolutional neural networks have been successfully applied to many com-puter vision tasks [29]. However, these models are often trained to detect specific object classes. These models also require large amounts of training data and time in order to reach state-of-the-art performance. For these reasons, such models are not very well suited for generic object tracking [29]. One solution in order to solve the above mentioned problems is the utilisation of similarity learning by using a Siamese network, as is shown by the Fully Convolutional Siamese tracker (SiamFC) [29].

A Siamese network consists of two separate branches that share the same weights [5]. Each branch is its own sub-network, and receives its own separate input. One is called the template branch, which receives the exemplar image while the other branch is called the detection branch which receives the candidate image [17]. The sub-networks are used to extract features from the given input sam-ples. The two resulting feature vectors are subsequently compared by a distance metric such as cosine similarity [5]. The distance between the two features repre-sents the similarity between the two input patterns. Thus the goal of the Siamese network is to learn a function f (z, x) offline that compares an exemplar image z to a candidate image x. The function should output a high value if the two images depict the same object, and a low value if they do not. In the Siamese network, both branches apply the function φ two both the target and exemplar image. Their representations are then combined using another function g such that f (z, x) = g(φ(z), φ(x)). The idea is that instead of learning to recognise spe-cific features belonging to certain classes, only that what makes two instances of any arbitrary class is learned. Therefore, Siamese networks extend well to general object detection [29].

(15)

Figure 2.2: The fully convolutional Siamese architecture used by SiamFC where the blue and red pixels in the score map show the similarities for the blue and red sub-windows respectively. Image taken from [29]

.

SiamFC proposes a Siamese architecture that is fully convolutional [29]. Be-cause of the network being fully convolutional, target images do not have to be of the same size as the exemplar image. This means the network can be provided with a larger candidate image, effectively increasing the search region. The simi-larity at all translated sub-windows can be calculated in a single evaluation, using a convolutional embedding function φ. The features resulting from φ are then combined into a score map using a cross correlation layer:

f (z, x) = φ(z) ∗ φ(x) + B. (2.2) Here, B is used to model the offset of the similarity value [18]. The output of the similarity function f is not a single value, instead it results in a score map, where the locations with high values indicate regions that are similar to the target object. The fully convolutional architecture used in SiamFC is depicted in Figure 2.2.

During tracking, SiamFC calculates the feature map φ(z) for the exemplar image, here the initial appearance of target object provided by a bounding box in the first frame of a sequence, and remains fixed during tracking. The feature map and will be compared to the sub-windows of subsequent frames using the similarity function f . The resulting score map is upscaled by bicubic interpolation, and scale variations are handled by searching for the object over five different scales. The online phase is kept very basic, as no model update is applied during tracking. This is done to show the capabilities of the fully convolutional Siamese network architecture to their full extend [29].

By employing similarity learning using a fully convolutional Siamese network architecture, a rich source of features can be used for online object trackers, as

(16)

these networks have shown to be able to use the available data in a very effi-cient way [29]. This prompted the development of more advanced visual object tracker implementations that make use of the Siamese network architecture, such as SiamRPN [17].

SiamRPN consists of a Siamese network that is extended with a region pro-posal network. The Siamese network is used for feature extraction. The extracted features are then fed to the region proposal network, which is used for proposal generation and contains two branches: a classification branch and a detection branch. The classification branch is used for foreground-background classification and the detection branch is used for proposal refinement [17]. Furthermore, where the cross correlation module of SiamFC produces a single channel score map, SiamRPN utilises up-channel cross correlation which outputs a multi-channel cor-relation map. The score map is up-scaled by a adding a large convolutional layer, which yields a richer level of information [17].

Similar to SiamFC, SiamRPN pre-computes the feature map on the initial frame, and it remains fixed during the entire tracking period. During tracking, in each frame one forward pass on the detection branch is performed in order to obtain the region proposals. Once the top n region proposals are generated, different strategies are used to select proposals. First, bounding boxes that are far from the center of the feature map are discarded, as it is assumed that change of location of the target object between subsequent frames is fairly minimal [17]. After discarding outliers, the proposals are re-ranked by adding penalties to large changes in bounding box size and bounding box ratio, and the top candidate is chosen as the final prediction.

The Siamese trackers discussed above were able to reach respectable perfor-mance. However, compared to the then state-of-the-art trackers, there still existed a notable performance gap [18, 8]. Since the location of the target object can be at any position within a frame, it is necessary to learn a translation invariant fea-ture representation of the target object. The architecfea-ture of the mention Siamese trackers use modified version of AlexNet [29, 17]. This version of AlexNet [16] is found to be the only architecture that satisfies the translation invariance require-ment [29]. This means that the Siamese trackers were not able to use features from more sophisticated networks such as ResNet [18, 11].

SiamRPN++ is an evolution of SiamRPN that aims to break the restriction of strict translation invariance [18]. The strict translation invariance restriction follows from the similarity function f that is used in the Siamese networks (see Equation 2.2). More concretely, f (z, x[T ]) = f (z, x)[T ] [18]. Here, [T ] is the translation operation. The reason the strict translation invariance restriction is not met in deep networks such as ResNet is because these networks use padding [18]. In order to solve the problem of the translation invariance, a spatial aware

(17)

sampling strategy is used; during training, candidate images are shifted from the centre, resulting in no spatial bias [18].

With being able to use deep networks, SiamRPN++ successfully implements a Siamese network architecture using ResNet the backbone network [18]. By using such a deep network comes the added benefit of layer aggregation, as different deep layers contain different levels of representations which results in richer appearance models [18].

Besides the implementation of a deep backbone architecture and the utilisation of layer aggregation, SiamRPN++ also improves upon the cross correlation mod-ule used to embed the two branches from the Siamese network. SiamRPN uses up channel cross correlation, which yields a more rich multi-channel score map compared to the single-channel score map used in SiamFC [17]. However, this comes at the cost of a large parameter unbalance, which results in the learning process becoming less stable and makes it more difficult to converge [18]. Instead, SiamRPN++ implements a depth-wise cross correlation, which greatly reduces the amount of parameters while keeping performance on the same level as the up-channel cross correlation. An added benefit of this approach is the signifi-cant reduction in memory usage. Furthermore, SiamRPN++ employs the same strategy as SiamRPN during tracking.

2.2.3 Accurate Tracking by Overlap Maximisation

Thanks to extensive training, SiamRPN and SiamRPN++ are capable of accu-rately estimating bounding box location and scale. However, such Siamese trackers can still have difficulties in target classification, as these trackers do not explicitly account for background distractors due to the lack of online learning [7].

The Accurate Tracking by Overlap Maximisation tracker (ATOM) aims to address this problem with a tracking approach that consists of a target estimation module that is learned offline, and a target classification module that is learned online, using the ResNet-18 [11] model as the backbone for both the target and classification tasks [7]. The goal of the target estimation module is to determine the bounding box of a candidate object target given a rough initial bounding box estimate by utilising overlap maximisation. Here, the module is trained offline in order to predict the intersection over union (IoU) between an target object and a bounding box candidate. Next. by maximising the IoU prediction the bounding box can be estimated. In order to embed the exemplar image features with the candidate image features, ATOM makes use of a modulation-based network, as Siamese networks resulted in sub-optimal performance [7].

Similar to the Siamese network, the modulation network consists of two branches: a reference branch and a test branch. Likewise, the reference branch is fed the ex-emplar image z together with the corresponding bounding box Bz, which returns

(18)

a modulation vector c(z, Bz). The test branch is fed the candidate image x for

which the bounding box should be predicted. The goal of the test branch is to extract general features from the candidate image in order to perform the IoU pre-dictions. As this task is seen as more complex, additional layers are added to the test branch compared to the reference branch [7]. The resulting feature h(x, Bx)

from the test branch is modulated by the modulation vector from the reference branch by cross-wise multiplication. This results in an IoU representation which is fed to the IoU predictor.

As the estimation module is unable to discriminate between the actual tar-get object and possible background distractors, the estimation module is comple-mented with a classification module which consists of a two-layer fully convolu-tional network [7]. The goal of the classification module is to provide a rough 2D location of the target object. The classification module is trained entirely online. Therefore, in the first frame data augmentation is performed resulting in 30 train-ing samples, which is used to train the classification module. Durtrain-ing tracktrain-ing, only the second layer of the classification model is optimised.

In order to find the target during tracking, ATOM first extracts features from the previous frame. These features, combined with the previous bounding box are then used to generate ten initial bounding box estimations for the current frame. Next, the estimation module is used to find the corresponding IoU predictions, which are then maximised using gradient ascent [7]. The final bounding box is constructed by taking the mean of of the bounding boxes corresponding to the top three highest IoU scores. Furthermore, if the classification score of the target object falls below 0.25, the target is deemed as lost.

2.2.4 Discriminative Model Prediction

The Discrimintative Model Prediction (DiMP) tracker introduces an alternative approach to solve some of the shortcomings of the Siamese network trackers [2]. Since Siamese trackers only utilise target object appearance during the online phase, crucial background information is ignored. Naturally, this background in-formation is needed in order to discriminate between the target object and back-ground distractors. Moreover, Siamese networks could be prone to poor generali-sation for objects that did not occur during offline training and they often have a poor model update strategy [2].

In order to alleviate these problems, DiMP consists of a discriminative model prediction architecture derived from a discriminative learning loss for visual track-ing which can fully exploit background information [2]. Since the model predictor contains few parameters, overfitting to certain classes during training is avoided, re-sulting in an architecture that can generalise to unseen objects. Similar to ATOM, DiMP consists of two branches: a target classification branch and an target

(19)

estima-tion branch used for distinguishing the target object from background distractors and estimating the bounding box respectively.

In the classification branch, a ResNet-18 [11] is used as the backbone network in order to extract features from a set of training images with corresponding target centre coordinates. These features are then fed to the model predictor module, which contains a model initialiser module and a model optimiser module. The goal of the predictor module is to find a target object model. During tracking, a set of 15 samples of the target object is generated. In the subsequent frames, whenever a target is predicted with sufficient confidence, it can be added to the set of training samples in order to update the target model. The target model is updated every 20 frames, or whenever a distractor peak is detected. For the bounding box estimation branch, DiMP uses the same IoU maximisation based architecture as used in ATOM [7].

(20)

Chapter 3 Occlusion in Tracking

In section 2.1 different challenging scenarios trackers may face in the wild were outlined. One of these challenges that remains a problem for visual object track-ing to this day is occlusion [28, 29]. Although recent tracktrack-ing benchmarks do include a multitude of different challenging scenarios including occlusion, these might not be adequate to accurately measure tracking performance [10, 23, 27, 25]. In the case of occlusion, often solely general cases are considered, such as par-tial or full occlusion, and the occlusion might only occur for very short amounts of time. In the following sections, well established tracking benchmarks will be dis-cussed. Next, examples are shown why these benchmarks might not be adequate to evaluate tracker performance on occlusion. Finally, different categories of hard occlusions will be outlined, and an overview of different metrics to evaluate tracker performance on will be presented.

3.1 Previous Work on Tracking Datasets

In order to evaluate and determine the robustness of a tracker implementation, the tracker needs to be evaluated on a range of different challenging scenarios. In [25], the ALOV++ benchmark is proposed, consisting of 315 sequences from the Amsterdam Library of Ordinary Videos with an average length of 9.2 seconds. It is argued that short sequences are particularly challenging for visual object trackers, as the trackers do not have a lot of time to adapt to different scenarios portrayed in the sequence [25]. Another argument for including many short sequences is to have ample of material to represent the different challenging tracking scenarios.

ALOV++ does include attributes for occlusion, however it does not differen-tiate between different kinds of occlusion. During the evaluation of tracker per-formance on occlusion on ALOV++ the amount of occlusion that occurs during the sequences in combination with the motion of the target object is considered.

(21)

When the target object where no motion of the target object relative to the cam-era is present, the evaluated trackers show strong capabilities in handling occlusion [25]. Therefore, it is concluded that short partial and even full occlusions are not a problem when no relative motion of the target object occurs. Likewise, when partial occlusion occurs in combination with motion, the top trackers are still able to perform very well. Tracker performance suffer when motion is combined with full occlusions, where especially recovering after target loss seems to be a big issue [25].

Another well known tracking benchmark is the Object Tracking Benchmark (OTB) [28]. OTB contains 100 sequences of varying object classes. A subset of OTB, OTB-50, contains 50 of the most challenging sequences in order to perform detailed evaluation of tracker performance [28]. Similar to ALOV++, the se-quences in OTB are expanded with annotations, including occlusion, which covers both full and partial occlusion, and out-of-view when some portion of the target leaves the camera field of view. The trackers evaluated on OTB occlusion and out-of-view subsets do not seem to perform significantly better or worse compared to other challenging scenarios or the overall performance [28].

Since the rise of deep trackers, demand for large scale datasets containing many training examples has increased. In order to accommodate these demands, TrackingNet [23] is introduced. With a total of 30.643 videos, tracking net offers an impressive collection of sequences of varying length, framerate, resolutions, and object classes. TrackingNet offers a total of 15 different attributions per sequence, with separate tags for partial and full occlusion, as well as partial out of view occlusion. The evaluated trackers show that on TrackingNet, full occlusion and out of view occlusion are the most difficult challenges, although top performing trackers on these attributes are still able to reach respectable performance compared to other attributes and the overall results [23].

Visual object tracker benchmarks such as the ones mentioned above mostly contain relative short sequences. When tracking in the wild, tracking often occurs for longer periods of time. Therefore, evaluating tracker performance on short sequences might accurately measure tracking performance. In order to solve some of these issues the Large Scale Single Object Tracking benchmark (LaSOT) is introduced, which contains many longer duration sequences [10]. LaSOT is focused on long-term tracking, and contains many sequences where occlusion occurs. Like the above mentioned dataset, LaSOT includes a set of attributes for each sequence. LaSOT is a challenging benchmark, showing lower tracking performance compared to for instance TrackingNet or OTB [10, 28, 23], and shows a considerable lower performance on occlusion and out of frame occlusion [10].

(22)

3.2 Remaining Challenges in Occlusions

The widely used visual object tracking benchmarks in the above section all con-tain samples of occlusions. In addition to this they do offer different attributes per video, including occlusions. Yet, these benchmarks might not accurately eval-uate tracker performance on occlusions. First of all, only general occlusion is categorised; solely evaluating based on whether partial of full occlusion occurs does not consider all the challenges that exist within the overall occlusion cate-gory. Only evaluating on general occlusion would therefore not represent accurate tracker performance in the wild.

Besides the lack of categorisation of occlusion, many of the sequences that contain occlusion do so in ways that do not reflect how occlusion might occur in real-life scenarios. For instance, when occlusions occur, the target has low to no motion relative to the camera. Figure 3.1 shows examples from occlusion in OTB. Here, the occluded face stays mostly stationary during the sequence.

Figure 3.1: Sequences FaceOcc1 and FaceOcc2 present two examples of occlusion from OTB [28] showing barely any motion of the target object.

Apart from slow moving target objects during occurrences of occlusion, occlu-sions often occur for very short amounts of time. For instance, in the sequences from figure 3.1, the time between the occlusions only last for a couple of seconds at most. As some implementations of the visual object trackers can use temporal data in order to perform model updates, such as DiMP [2] or ATOM [7], such short occurrences of occlusion do not properly evaluate tracker performance when occlusions occur for a longer duration. LaSOT does offer longer sequences, but even in occlusion heavy sequences, occlusions often only occur for shorter amounts of time, and when they do occur for longer periods of time the object has low

(23)

motion relative to the camera [10].

Besides the above mentioned problems, many of the sequences containing oc-clusion often occur small partial ococ-clusions, such as shown in figure 3.2. As men-tioned before, these benchmarks lack specific sub-categories within occlusion. Most sequences also only contain occlusion caused by objects that are substantially dif-ferent from the target object. Although exceptions exist such as the money-4 sequence from LaSOT.

Figure 3.2: Partial occlusions from LaSOT [10] sequences microphone-16 , deer-10, and monkey-4.

LaSOT does provide sequences where objects fully leave the camera field of view for prolonged amounts of time. However, the area where the target object leaves the camera field of view is often the same location where the target object enters the field of view after the occlusion period. Although full occlusion has already proven to be a challenge in its own right [10], the described situation does cause for less of a challenge. As a tracker might get stuck after losing the target object, when the target object reappears in that same location, the tracker does not have to relocate the target object.

3.3 Hard Occlusions

The above section illustrates the reasons why current tracking benchmarks do not properly evaluate visual object trackers on occlusions. Occlusions occur usually for short amounts of time, and are often coupled with mostly stationary target objects. Therefore, hard occlusions are introduced. Hard occlusions describes a set of more challenging and specific occlusions scenarios that are currently not addressed in the current tracking benchmarks.

During hard occlusions, the target object is being occluded for greater amounts of time, resulting in the occurrence of occlusion for the majority of frames in a given sequence. In order to evaluate the actual tracking ability of the visual object trackers, during hard occlusions the target object moves around relative to the camera. In addition to long continuous occurrences of occlusion that are combined with target object movement, hard occlusions combine these challenges with specific occlusion scenarios.

(24)

Naturally, hard occlusions contain partial and full occlusion of the target ob-ject. During partial occlusion, some part of the target object must still be visible. During full occlusion, the target object is not visible, however, it is still within the camera field of view, and should ideally continued to be tracked. Examples of these occlusions can be found in figure 3.3 (a). Hard occlusions also include partial and full out of frame occlusion. These situations differ from regular oc-clusion as the target object leaves the field of view of the camera. As the target object leaves the field of view, in order to perform successful tracking, the tracker has to be able to detect the absence of the target object. Next, when the target object re-enters the field of view at a different location compared to where it left the field of view, successful tracking requires extensive re-localisation capabilities. Examples are shown in figure 3.3 (b).

Figure 3.3: (a) Partial occlusion sequence (left) and full occlusion sequence (right). (b) Partial out of frame occlusion (left) and full out of frame occlusion sequence (right).

Hard occlusions also include more specific cases of occlusion, which are com-bined with the occlusion scenarios mentioned above in order to challenge and evaluate the distinct components of visual object trackers. Occlusion by similar objects occurs when the target object is occluded by a similar looking object. For successfully tracking the target object in this scenario, object trackers need to have a strong discriminating model, being able to distinguish the target object from the similar looking background distractors. Furthermore, the appearance model of the trackers are challenged by occlusion by transparent objects, and feature occlusion. During the latter, a specific feature of the target object is omitted from view. See figure 3.4 for examples of occlusion by similar objects, occlusion by transparent objects, and feature occlusion.

3.4 Evaluation Metrics for Occlusions

In order to evaluate visual object tracker performance, different methods have been proposed. Methods using precision or bounding box overlap are widely used [28,

(25)

Figure 3.4: Left: Occlusion by similar object. Mid: Occlusion by transparent object. Right: Feature occlusion, the bright blue patch on the target object dis-appears.

23, 15]. These methods provide a good overall evaluation of tracker performance. However, in situations like occlusion where the target object might be obstructed from view, these methods do not take into account the target absence predictions and re-localisation of the target object. In the following sections, the commonly used metrics such as precision, success rate, as well alternative metrics that do attempt to take the above mentioned issues into account will be discussed.

3.4.1 Precision

A rather straightforward metric that is widely used is precision [23]. When tracking precision, the centre localisation error is calculated. The centre localisation error is defined as the Euclidean distance d between the centre of the ground-truth bounding box gt and the prediction bounding box pr. A frame is deemed as successfully tracked if the precision meets a certain threshold t. In order to evaluate the performance on a specific sequence, the average precision of all annotated frames is calculated, defining the average precision ps per sequence s as

ps = 1 Nsp X i b(gti, pri), where b = ( 0 if d(gti, pri) > t 1 otherwise (3.1)

Here, d(gti, pri) denotes the Euclidean distance between the ground-truth

cen-tre gt and prediction pr cencen-tre, and N_sp denotes the amount of annotated frames that contain a prediction of a given sequence s. Note that Np

s <= Nsg.

Using the average precision, precision plots can be constructed by calculating the precision for different thresholds t. However, the relative simplicity of the precision metrics does come with some drawbacks. For instance, by only calculat-ing the centre error distance, the size of the predicted boundcalculat-ing box is not taken into account. Similarly, when the tracker loses the target object, the predicted bounding boxes might become random which can result in inaccurate measuring of tracking performance when using the precision metric [1].

(26)

3.4.2 Success Rate

An evaluation metric that does take both the size and position of the predicted bounding box is the success rate [27, 23]. The success rate makes use of the average bounding box overlap, which measures performance by calculating the overlap between ground-truth bounding boxes and prediction bounding boxes. Bounding box overlap is calculated using the intersection over union (IoU) score, defined as

IoU (bg, bp) =

|bg∩ bp|

|bg∪ bt|

. (3.2)

Here, bg denotes a ground truth bounding box and bp denotes a prediction

bounding box and | · | denotes the area of the given region.

To calculate the success rate on a given sequence s using the bounding box overlap, the percentage of successfully tracked frames is calculated. Here, a frame is considered successfully tracked when the intersection over union score meets a certain threshold t. Therefore, the success rate SR can be defined as

s = 1 Ns X i b(bg, bt) where b(bg, bt) = ( 0 IoU (bg, bt) < 0 1 otherwise (3.3) Often the average success rate is calculated by setting t = 0.5, however that can be considered unfair, as not meeting that requirement does not necessarily indicates that a frame is unsuccessfully tracked [13]. Similarly to the precision plot, the success rate can be used to create a success plot by calculating the success rate for a varying range of thresholds t where 0 <= t <= 1. In order to denote the tracking performance as a single score, instead of a graph, the area under the curve of the success plot can be calculated, which is a more accurate representation of tracker performance compared to using a single threshold [27].

3.4.3 Longest Subsequence Metric

As has been mentioned, the above metrics might not always accurately report tracking performance in situations where objects could be occluded from view. In order to account for these issues, other methods have been proposed. For instance, the success rate can be altered by always counting a frame as successfully tracked when a tracker correctly predicts the absence of the target object [14]. Another proposed method is the Longest Subsequence Metric (LSM) [22].

The LSM metric quantifies long term tracking behaviour by computing the ratio between the length of the longest continuously successfully tracked sequence of frames and the full length of the sequence. A sequence of frames is deemed as successfully tracked if a certain percentage p of frames within this sequence have a bounding box overlap of at least 0.5 [22].

(27)

The goal of LSM is to take into account the issue of trackers frequently losing track of the target object for short durations, which existing metrics fail to address [22]. As a tracker loses the target object, it might re-locate it as the target object happens to come move across the point where the target was initially lost. By only taking the longest successfully tracked sequence into account, such scenarios will result in a lower LSM score.

3.4.4 F-Score

Another evaluation metric that is widely used is the F-score [9]. The F-score is defined as

F (s) = 2 · ps· rs ps+ rs

. (3.4)

Here, ps, as defined in equation 3.1, denotes the precision and rs is the recall

given a sequence s. Note that when calculating precision for the F-score, often the average boundary box overlap is used instead of the Euclidean distance [13]. Similarly, recall is defined as

ps = 1 Nsg X i b(gti, pri), where b = ( 0 if IoU < t 1 otherwise . (3.5) Here, N_sg denotes the total amount of annotated frames.

An alternative approach to the F-score is proposed by [21], introducing the tracking precision, tracking recall, and tracking F-score. Instead of using the IoU score to calculate precision and recall, the prediction certainty scores of the visual object trackers are used instead. This results that the tracking F-score can take target absence prediction and target re-localisation prediction into account [22]. By utilising different thresholds t allows for the creation of an F-score plot. The tracking F-score metric is used in the 2018 and 2019 Visual Object Challenge in order to evaluate long term trackers [15, 14]. The final tracker performance measure is determined by taking the highest tracking F-score from the F-score plot.

(28)

Chapter 4 Method

To evaluate the visual object trackers on hard occlusions, a Hard Occlusion Bench-mark dataset (HOB) has been constructed. Each of the visual object trackers used in this thesis project will be benchmarked on HOB. In this thesis project the ex-periments and evaluation will be performed on online single object trackers, which are initialised by a rectangular bounding box in the first frame of a sequence during tracking. Furthermore, only trackers with code available online are used in this thesis project. The following trackers being evaluated: SiamFC [29], SiamRPN++ [18] (AlexNet), SiamRPN++ (ResNet) [18], SiamRPN++ (Long Term) [18], ECO [8], ATOM [7], and DiMP [2].

4.1 Tracking Frameworks

Two tracking frameworks were used in this thesis project. All trackers used default available models, and were thus evaluated on their default configurations as they appear in their respective papers. Each of the trackers used in this thesis project was subjected to a subset of the OTB50 Benchmark [28] in order to ensure the trackers were working properly.

4.1.1 PySOT

With PySOT1, the SenseTime Video Intelligence Research team offers the imple-mentation of several state-of-the-art single object tracking algorithms. The track-ers used during this thesis project provided by PySOT are SiamRPN++ using AlexNet [16] as a backbone, SiamRPN++ using ResNet50 [11] as a backbone, and SiamRPN++ Long Term [18]. The latter is an adjusted version of SiamRPN++

1_{PySOT codebase and installation instructions can be found at https://github.com/STVIR/}

pysot.

(29)

with the ResNet50 backbone which is specifically designed for long term tracking, and follows the re-initialisation strategy as described in [15]. The trackers are implemented in Python using the PyTorch deep learning framework version 0.4.1 and Cuda version 9.0.

4.1.2 PyTracking

Like PySOT, PyTracking2 _{offers a framework containing several state-of-the-art}

object tracker implementations. PyTracking offers both modules for training track-ers and evaluating tracktrack-ers. The tracktrack-ers provided by the PyTracking framework used during this thesis project are ECO [8], ATOM [7], and DiMP [2]. These trackers are implemented in Python using Pytoch as the deep learning framework and using Cuda version 10.0.

4.1.3 SiamFC

While PySOT does offer an implementation of SiamFC [29], it does not offer a pre-trained model. Therefore, an alternative implementation of SiamFC is used which can be found at https://github.com/huanglianghua/siamfc-pytorch. This version of SiamFC is implemented in Python using PyTorch and Cuda version 9.0. SiamFC was evaluated on the same Windows machine as the PySOT trackers.

4.2 Hard Occlusion Benchmark Dataset

In this section, the construction of the Hard Occlusion Benchmark dataset (HOB) will be described. Hob contains 20 sequences which contain different scenarios of hard occlusions. In total, HOB contains a total number of 55388 frames, or on average 92 second duration per sequence. As HOB is specifically constructed in order to evaluate visual object trackers on hard occlusions, consistency between each sequence is important. Variations in scenarios that may challenge visual object trackers, apart from hard occlusions, are kept to a minimum. To guarantee this consistency, properties such as lighting conditions are kept mostly the same between each sequence. Apart from keeping the sequences consistent in terms of different scenarios, resolution and frame rate are also kept consistent. Each first frame with corresponding ground-truth is shown in figure 4.1.

2_{Pytracking codebase and installation instructions can be found at https://github.com/}

(30)

Figure 4.1: First frame of every sequence in HOB with target object bounding box.

4.2.1 Annotations

Each of the sequences in HOB is annotated by hand with two bounding boxes. The first bounding box surrounds the entire target object. When the target object is occluded, the bounding box still attempts to describe the entirety of the target ob-ject, including the occluded section. The second bounding box is used to annotate only the section that occludes the target object. When the target object moves out of frame, or no occlusion occurs, no target object bounding box or occlusion bounding box respectively are present. As the bounding boxes are rectangles it is not probable that the bounding box accurately describes the occluded area. In order to get a better approximation of the amount of area each occlusion bound-ing box describes, each occlusion annotation is supplied with an occlusion ratio. Because trackers use the first frame to build the appearance model, each object is fully visible in the first frame of each sequence. Figure 4.1 shows the first frame of each of the 20 sequences with corresponding target object bounding box.

All annotations were performed by hand. Because each sequence requires three different types of annotations, the target object bounding box, occlusion bound-ing box, and occlusion ratio, it was not feasible for the current thesis project to supply these annotations for each individual frame. Partially annotating the sequences, and thereafter filling in the missing annotations by employing state-of-the-art trackers can be used to speed up the annotation process, such as performed during the construction of TrackingNet [23]. However, because of the large amount of occlusion in the particular sequences of HOB, this strategy did not prove to be useful, as most prediction would require extensive manual corrections during

(31)

oc-clusions. Therefore, it was decided to supply annotations at a rate of 2 frames per second, which still offers plenty of annotated frames due to the long duration of the sequences and the long periods of occlusion. Each sequence contains on average 185 annotated frames.

4.2.2 Attributes

Besides the annotations, each sequence in HOB is categorised with the different hard occlusion scenarios introduced in section 3.3. The attributions are defined in Table 4.2.2.

Table 4.1: Description of each of the 7 attributes that categorises the HOB se-quences.

Attribute Description POC Partial Occlusion . FOC Full Occlusion.

POF Partial Out of Frame Occlusion. FOF Full Out of Frame Occlusion. OCF Feature Occlusion.

OCSO Occlusion by Similar Object OCTO Occlusion by Transparent Object

4.3 Evaluation Methodology

As has been illustrated in section 3.2 current visual object tracking benchmarks contain samples of occlusion that do not accurately resemble occlusion as it appears in the wild. Therefore, the selected visual object trackers will be evaluated on both HOB, and a subset of LaSOT [10] testing set. In this thesis project, the visual object trackers are evaluated on HOB using the precision and success rate metrics, as well as the area-under-curve (AuC) and LSM.

4.3.1 Evaluation on LaSOT

Since LaSOT is a large benchmark that is focused on long-term tracking, it includes many sequences containing occlusions and out-of-frame occlusions, and because of this proves to be one of the more difficult tracking benchmarks [10]. Evaluating the selected visual object trackers on LaSOT will therefore offer a great baseline for comparing to the evaluation on HOB. HOB is a relatively small dataset containing only 20 sequences. To keep the comparison between HOB and LaSOT fair, from

(32)

laSOT the top 20 occlusion heavy sequences are selected. Furthermore, while LaSOT offers per-frame annotations of truths, HOB contains a ground-truth annotation every 15 frames. Therefore only every 15th frame of LaSOT will be used during the evaluation procedure.

4.3.2 Evaluation Metrics

Evaluations of the visual object trackers on HOB are performed using the precision, success rates, AuC, and LSM. While precision does not take the predicted bounding box size into account, it does correctly measure how close the position of the prediction is to the ground truth, which is not always the case when using success rates, as only the overlap between prediction and ground truth is considered. In the case of occlusion, where the target object is not entirely visible, precision can tell to what extend the tracker manages to correctly predict the location of the occluded target object. The final precision score for each tracker is evaluated using a threshold of 20 pixels such as in [27].

Besides accurate predictions of the position of the target object, visual object trackers should be able to accurately predict the correct bounding box size and ratio. Ideally, even when the target object is occluded from view, the tracker should still correctly predict a bounding box that encompasses the entire object, not just the unoccluded part of the object. This is measured by the success rate. In order to form a single representative score from the success rates for each tracker, the AuC metric is used.

Finally, it is important that the visual object trackers do not lose the target object. In order to evaluate how well the trackers can continuously track the tar-get object, the LSM metric is used. The representative LSM score is calculated at a threshold of 95% as in [22], in other words, a successfully tracked continuous sequence of frames contains at least 95 percent correctly tracked frames. A frame is considered correctly tracked when the IoU of that frame is greater than 0.5. Be-cause LSM calculates the ratio between longest continuously tracked subsequence and the length of the entire sequence, it can introduce bias towards extremely long and short sequences. However, all sequences used in this thesis project are of similar length, therefore this is not an issue for accurate evaluation.

In the case of the metrics that use IoU, success rate, AuC, and LSM, the case of a tracker successfully predicting the absence of the target object has to be considered. Therefore, if a tracker correctly predicts the absence of the target object, the IoU is set to 1. The tracking F-score is not used in this thesis project, as it requires the confidence scores of the trackers, which were not available, and regular F-score is shown to be as representative as calculating the overlap between prediction and ground-truth bounding boxes [25].

(33)

Chapter 5 Experiments and Evaluation

In this thesis project, a set of current state-of-the-art visual object trackers are benchmarked and evaluated. The selected set of trackers contains a variety of different tracking algorithms. ECO [8] is based on discriminative correlation fil-ters. SiamFC [29] uses a Siamese Network in order to learn a similarity function for general object detection during tracking. SiamRPN++ [18] extends SiamFC with a region proposal network for more accurate localisation of the target ob-ject and implements a more efficient cross correlation function. Three versions of SiamRPN++ are used in the current thesis project. The first two use the standard implementation using ResNet50 [11] and AlexNet [16] as a backbone respectively. The third version uses ResNet50 and follows the update protocol as described in [15], where the tracker is re-initialised after a target loss. ATOM [7] uses over-lap maximisation for better target classification. DiMP [2] utilises discriminative model prediction to solve the shortcomings of Siamese Networks. Below, the overall performance evaluation will be presented. The mentioned trackers will be bench-marked on both HOB and LaSOT [10]. Thereafter, a more detailed analysis of four occlusion specific challenges will be given: full out of frame occlusion, occlusion by similar object, occlusion by transparent object, and feature occlusion. These are chosen for their contrasting results compared to the overall performance. Each of the trackers used in this thesis project are evaluated on a Windows machine for the PySOT framework and SiamFC, and a Ubuntu machine for the PyTracking framework, running an Intel Core i7 8700k CPU and an Nvidia 1070 GTX GPU.

5.1 Overall Performance

Figure 5.1 shows the precision, success rate, and LSM of each individual tracker on both HOB and LaSOT. In table 5.1 the representative scores of each of the metrics are shown. Overall, the results show that on average performance is worse

(34)

Figure 5.1: Overall results on HOB (top) and LaSOT (bottom) on the precision rate, success rate, and LSM metrics (left, middle, and right respectively).

on HOB compared to LaSOT. On HOB, SiamRPN++ using the ResNet50 (r50) [11] backbone architecture is the top performing tracker in terms of precision, success rate, and LSM. In the success plots, SiamRPN++ (r50) outperforms the long-term (lt) variant by a small margin, with a AuC score of 0.359 compared to 0343, which is interesting, as SiamRPN+ (lt) is specifically designed to handle long term tracking including the disappearance of the target object by occlusion. This is in contrast to the performance on LaSOT. On LaSOT, DiMP is the top performing tracker on the AuC score, with SiamRPN++ (lt) being a close second, with a score of 0.391 compared to 0.383 respectively. It is interesting to note that DiMP consistently underperforms compared to both SiamRPN++ (r50) and SiamRPN++ (lt) on HOB.

The same results apply to the precision score, where on HOB SiamRPN++ (r50) and SiamRPN++ (lt) are the top performing trackers with scores of 0.195 and 0.192 respectively, with DiMP performing slightly worse with a score of 0.173. On LaSOT, SiamRPN++ (lt) and DiMP are the top performing trackers on preci-sion with scores of 0.437 and 0.342 respectively. It is not surprising SiamRPN++ (lt) is a top performer on HOB and LaSOT, as it can re-initialise itself when it loses the target object. The difference in performance of DiMP between HOB and LaSOT can possible be attributed to its discriminative capabilities of background

(35)

Table 5.1: Representative scores for precision (t > 20), area under curve (AuC) and LSM (x > 0.95) scores for each of the evaluated trackers on HOB and LaSOT. Note sRPN++ stands for SiamRPN++.

Precision AuC LSM

Dataset HOB LaSOT HOB LaSOT HOB LaSOT ATOM 0.142 0.342 0.243 0.317 0.122 0.228 DiMP 0.173 0.421 0.324 0.391 0.126 0.292 ECO 0.070 0.191 0.149 0.199 0.098 0.166 SiamFC 0.093 0.225 0.205 0.191 0.090 0.135 sRPN (alex) 0.154 0.320 0.300 0.300 0.125 0.185 sRPN (lt) 0.192 0.437 0.343 0.383 0.127 0.248 sRPN (r50) 0.195 0.318 0.359 0.278 0.133 0.245

distractors. In the case of hard occlusions such as in HOB, DiMP might be more likely to predict bounding boxes solely around the unoccluded part of the target object, while SiamRPN++ (r50) and SiamRPN++ (lt) include more of the oc-cluded part. This would explain the lower scores in precision and AuC on HOB compared to LaSOT.

The worst performing tracker in both precision and AuC on HOB is ECO with a precision of 0.07 and a AuC score of 0.149, with SiamFC being the second worst performing tracker. Likewise, on LaSOT, ECO and SiamFC are the worst per-forming trackers, with ECO having a slightly higher AuC score of 0.199 compared to SiamFC which has a score of 0.191. It seems that the discrimintative correlation filter approach utilised in ECO is not very well suited for occlusions and long-term tracking in general, as it likely is less able to generalise compared to Siamese based trackers. In the case of SiamFC, its lack of accurate target classification and lo-calisation seems to hamper performance during cases of occlusions and long-term tracking. This becomes more apparent in the LSM score, where SiamFC and ECO perform the worst on both HOB and LaSOT, although ECO performs better on LaSOT compared to SiamFC, while on HOB, ECO is the worst performing tracker on the LSM metric. Suggesting SiamFC is more capable of handling occlusions, while struggling with general long-term tracking compared to ECO.

ATOM performs consistently worse on HOB compared to the three SiamRPN++ variants. On LaSOT, ATOM performs very similarly to SiamRPN++ (r50) and SiamRPN++ (alex), generally outperforming them by a slight margin. Like DiMP, ATOM has improved discriminative capabilities, resulting in more accurate tar-get object localisation. However, in the case of occlusion, this might again result in the tracker tending to only find the unoccluded regions of the target object, resulting in lower precision and AuC scores on HOB compared to LaSOT. The discriminative power of DiMP and ATOM does result in these trackers performing

(36)

almost on par with the SiamRPN++ (r50) and SiamRPN++ (lt) on HOB, and outperforming SiamRPN++ (alex) on the LSM metric. On LaSOT, DiMP has the highest LSM score, suggesting that having stronger discriminative abilities does result in being able to more accurately keep track of the target object in long-term tracking.

5.2 Attribute Evaluation

Figure 5.2: Success plots for full out of frame occlusion (top left), occlusion by sim-ilar object (top right), occlusion by transparent object (bottom left), and feature occlusion (bottom right).

The tracker performance on the specific attributes is evaluated using success plots. The success plots for full out of frame occlusion, occlusion by similar ob-ject, occlusion by transparent obob-ject, and feature occlusion are shown in figure 5.2. Full out of frame occlusion seems to be a very challenging problem problem for the visual object trackers. SiamRPN++ (lt) clearly is the top performing tracker on this category. This is most likely attributed to its re-initialisation strategy when

(37)

detecting target object loss. The second best performing tracker on full out of frame occlusion is SiamRPN++ (r50), performing considerably better compared to SiamRPN++ (alex), SiamFC, ATOM, DiMP, and ECO. Having access to rich features of different scales seems to be of great importance for full out of frame occlusions. SiamRPN++ (alex) performs significantly worse than SiamRPN++ (r50), likely due to its less rich features resulting from AlexNet [16]. DiMP per-forms considerably worse during full out of frame occlusions, as its performance is on par with ATOM and ECO, and slightly below SiamRPN++ (alex). Overall, the SiamRPN++ variants outperform the other evaluated trackers. ATOM, DiMP, and ECO update their appearance model during tracking. In the case of full out of frame occlusions, this could result in the appearance updating on samples that do not contain the target object, leading to a corrupt appearance model. This is not the case for the SiamRPN++ trackers, as their appearance model remains fixed during tracking.

In the case of occlusion by similar object, SiamRPN++ (alex) has the highest overall performance, while SiamRPN++ (r50) performs the worst of the SiamRPN++ trackers. Interestingly, the use of the shallow AlexNet as a backbone results in bet-ter performance compared to using the deep ResNet, even outperforming the long-term SiamRPN++ variant. Clearly, re-initialising on target loss has no clear ad-vantage in the case of occlusion by similar objects, as performance of SiamRPN++ (lt) is very similar to the performance of SiamRPN (alex). DiMP is the second best performing tracker, with ATOM, SiamFC and ECO being the lowest performing trackers. ECO is performing considerably lower compared to the other trackers, suggesting that the discriminative correlation filter approach has a difficult time dealing with occlusions by similar objects.

Similarly, ECO performs the worst on the occlusion by transparency cate-gory by a wide margin. DiMP is the best performing tracker on this catecate-gory, with once again SiamRPN++ (alex) being a very close second. Both DiMP and SiamRPN++ (alex) perform considerably better compared to the other SiamRPN++ variants, similarly to the occlusion by similar object category. Likewise, SiamRPN++ (r50) and SiamRPN++ (lt) have very similar performance. In the cases of occlu-sion by similar objects and occluocclu-sion by transparent objects, it seems that the ability to generalise for objects that are not seen during training could play an im-portant role. The appearance model predictor implemented in DiMP contains few parameters, leading to better generalisation as less overfitting to observed classes occurs during the offline training phase [2]. Likewise, SiamRPN++ (alex) using the shallow AlexNet contains less parameters compared to SiamRPN++ (r50) [18]. The performance of ATOM is considerably lower compared to DiMP, while both use the same IoU maximisation based architecture for updating the appearance model. Thus, the appearance seems to be of less importance in these cases, as the

(38)

appearance model of SiamRPN++ (alex) remains fixed while tracking.

In the case of feature occlusion, SiamRPN++ (r50) and SiamRPN++ (lt) are the top performing trackers, with near identical performance. As objects tend to stay at least partially visible in this category, the re-initialisation strat-egy of SiamRPN++ (lt) does not offer much benefits in tracking performance. SiamRPN++ (lt) and SiamRPN++ (r50) are closely followed by SiamRPN++ (alex) and DiMP. Once again, ECO is the worst performing tracker, with SiamFC and ATOM showing slightly improved performance compared to ECO. The results of the feature occlusion category are very similar to the overall performance on occlusion as shown in figure 5.1, although on average the trackers seem to perform slightly worse on feature occlusion specifically at higher thresholds.

(39)

Chapter 6 Conclusions and Discussion

From the experiments it is shown that hard occlusions are still a challenging prob-lem for the evaluated visual object trackers. Compared to tracking benchmarks such as OTB [28], TN [23], and LaSOT [10], it is shown that when facing hard oc-clusions, the evaluated trackers show significantly reduced performance. On HOB, the top performing tracker on precision, AuC, and LSM is SiamRPN++ (r50) [18], closely followed by SiamRPN++ (lt) [18]. Both outperform DiMP, which was shown to outperform SiamRPN++ (r50) on general tracking benchmarks in-cluding LaSOT and the Visual Object Tracking Challenge (VOT) [2, 14]. The improved discrimination of background distractors offered by DiMP compared to the SiamRPN++ variants could make DiMP more prone to locating only the un-occluded part of the target object. Furthermore, DiMP updates its target model appearance during tracking, while the SiamRPN++ variants do not. This could result in DiMP gradually decaying its target model appearance in cases where the target object is occluded for prolonged amounts of time, resulting in less accurate tracking.

ECO is the worst performing tracker by a large margin, suggesting that dis-criminative correlation filter based trackers are not able to handle hard occlusions very well, even though ECO rivals Siamese based trackers on benchmarks such as OTB [8, 17, 18]. Similarly to DiMP, ECO updates it appearance model dur-ing trackdur-ing. ATOM, which also performs lower than the SiamRPN++ variants, updates its appearance model as well. Although in general tracking benchmarks, DiMP shows improved performance over SiamRPN++, during hard occlusions, the appearance model update strategy seems to decrease performance [2]. Therefore, future work could focus on implementing more robust appearance model update strategies, which only update when on high prediction confidence scores in order to minimise appearance model decay.

Furthermore, when examining specific sub-categories within occlusion, tracker

Evaluating Visual Object Trackers on Hard Occlusions

Evaluating Visual Object

Trackers on Hard Occlusions

Evaluating Visual Object Trackers

on Hard Occlusions

Acknowledgement

Contents

Chapter 1

Introduction

Chapter 2

Background

2.1

Challenges in Tracking

2.2

Types of Trackers

2.2.1

Discriminative Correlation Filters in Visual Object

Tracking

2.2.2

Siamese Networks in Visual Object Tracking

2.2.3

Accurate Tracking by Overlap Maximisation

2.2.4

Discriminative Model Prediction

Chapter 3

Occlusion in Tracking

3.1

Previous Work on Tracking Datasets

3.2

Remaining Challenges in Occlusions

3.3

Hard Occlusions

3.4

Evaluation Metrics for Occlusions

3.4.1

Precision

3.4.2

Success Rate

3.4.3

Longest Subsequence Metric

3.4.4

F-Score

Chapter 4

Method

4.1

Tracking Frameworks

4.1.1

PySOT

4.1.2

PyTracking

4.1.3

SiamFC

4.2

Hard Occlusion Benchmark Dataset

4.2.1

Annotations

4.2.2

Attributes

4.3

Evaluation Methodology

4.3.1

Evaluation on LaSOT

4.3.2

Evaluation Metrics

Chapter 5

Experiments and Evaluation

5.1

Overall Performance

5.2

Attribute Evaluation

Chapter 6

Conclusions and Discussion