Domain Adaptation for Single Object Tracking

(1)

Domain Adaptation for Single

Object Tracking

Jorrit Ypenga 11331550 Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor A. Moskalev MSc UvA-Bosch DELTA Lab

Faculty of Science University of Amsterdam

Science Park 904 1098 XH Amsterdam

(2)

Visual object tracking has quickly become one of the major challenges within the domain of computer vision. However, the limited availability of annotated training data introduces a bias towards existing datasets in tracker development. In this work, we introduce a novel training method and regularization procedure for Siamese trackers that tackles this overfitting problem through offline deep supervision and divergence-based domain adaptation. The fully offline nature of our method harbors a significant advantage over previous external regularizers since it retains the original tracking speed of the underlying model. Our experiments on two renowned tracking benchmarks show promising results. When applied to a different target domain, the baseline showed signs of severe overfitting on the source. We show that our deeply supervised training formulation reduces overfitting and uses data more efficiently compared to the baseline. Our method achieved a relative gain of ∼ 9% over the baseline on the most common accuracy and robustness metrics when using the same amount of training data. Additional divergence based regularization yields an even greater improvement in tracking robustness when object diversity is high, at the cost of localization.

Keywords: Object tracking, Siamese networks, Deep learning, Computer vision, Object detection, Regularization, Domain adaptation

Figure 1: A comparison of our formulation, Deeply Supervised SiamFC (DSSiam), with the original SiamFC [1] and the ground-truth (GT) on three OTB2015 [2] sequences, including several object tracking sub-challenges, such as scale change, motion blur, background clutter, fast motion, occlusion, and rotations.

(4)

1 Introduction

Visual Object Tracking (VOT) is a principal challenge within the field of computer vision. The objective of VOT is to identify and track regions of interest within a sequence of video frames. Visual object trackers are extensively used and have many applications, such as surveillance [3], self-driving cars [4], flow analysis [5], medical imaging [6], and human-computer interaction [7].

A typical VOT methodology can be divided into four sequential stages: target initialization, appearance modeling, motion prediction, and target positioning [8]. The target initialization stage comprises the annotation of a region of interest or target in the first frame. Appearance modeling is concerned with extracting features from the resulting region of interest for a mathematically applicable representation. Motion prediction aims to estimate the location of the region of interest in future frames. In target positioning, prediction and search strategies are employed to pinpoint the exact location of the region of interest.

While often diverse in these four stages, state-of-the-art VOT methodologies can still be generalized under different broader categorizations. (1) Single vs. multiple object tracking, distinguishing trackers concerned with single object tracking from those concerned with multiple object tracking. (2) Short- vs long-term tracking, a division originally made by the VOT challenge committee1 [9], separates trackers specialized in short sequences from trackers more catered towards long sequences. Short-term trackers are not required to re-initialize when the target is lost, whereas long-term trackers are. (3) Online vs. offline tracking is a generalization based on the amount of available data to a tracker during inference. Online trackers are applied to a live video stream, hence future frames are not known during inference. In contrast, offline trackers use pre-recorded videos during inference, which allows for the use of all the frames within the sequence for predictions.

The main difficulties of VOT are twofold. A lack of extensively annotated datasets for training commonly leads to overfitting on the source domain. Ad-ditionally, the complicated nature of object tracking results in a diverse array of sub-challenges that interfere with object identification, such as occlusion, illumina-tion variaillumina-tion, background clutter, scale change, moillumina-tion blur, and re- and deforma-tion [10, 2]. Over the past two decades, visual object trackers have shown signifi-cant improvement in the latter, mainly due to the increasing use of deep-learning approaches based on Convolutional Neural Networks (CNN) and, more recently, Siamese networks [1]. However, limited research has been devoted to overfitting and cross-domain performance.

To address this problem we present domain adaptation for VOT, allowing train-ing data to be used more efficiently and in a way that generalizes to different visual domains through deeply supervised training of Siamese models and additional reg-ularization. This work aims to investigate whether deep supervision during training prevents domain-specific overfitting for Siamese models. Additionally, this research seeks to determine the utility of supplementary regularization in achieving gener-alized embeddings. We validate this by applying deep supervision and regulariza-tion to the iconic SiamFC tracker [1] and contrasting resulting scores with original SiamFC results. While overfitting poses a problem for any current tracker, our

(5)

research specifically focuses on the implications of domain adaptation for single-camera, single-target, model-free2 _{trackers intended for short-term tracking.}

In this work, the theoretical background and related work are discussed in Sec-tion 2. SecSec-tion 3 introduces and describes our method in detail. The full evaluaSec-tion and baseline comparison, along with the complementary discussion, is presented in Section 4. Finally, Section 5 and 6 discuss potential future research and denote our conclusion.

2_{Model-free, meaning that the only externally provided information during training are the}

(6)

2 Background and Related Work

This section introduces and discusses the theoretical and practical background of this research using a variety of past studies. First, we discuss the tracking-by-detection paradigm and past and current techniques responsible for its implemen-tation. After this, we address regularization efforts to increase the robustness of Siamese trackers. In closing, we examine popular benchmarking methods for VOT.

2.1 Tracking-by-detection

VOT harbors multiple perspectives on tracking problems. One of these perspectives is that distinguishing the target object from the background suffices in successfully tracking that single object. The tracking methodology derived from this perspective is called tracking-by-detection. The paradigm conventionally consists of two sub-tasks: object detection and data association [11], as shown in Figure 2. First, the locations of the target objects in the sequence are acquired and second, the locations of the targets are matched to obtain full object tracks. However, when concerned with tracking a single object, data association becomes redundant, which leaves only the task of object detection.

Object detection in single-target trackers is accomplished by modeling a proce-dure that can discern target pixels from background pixels in a scene. A supervised technique for learning such a foreground-background discriminator is Discrimina-tive Correlation Filters (DCF) [12]. DCF forms the basis of the training stage for several state-of-the-art online single object trackers [1, 13, 14, 15]. Alongside its use in VOT, DCF is also actively applied in closely related computer vision tasks, such as face verification and action recognition [16, 17].

2.1.1 History of DCF-based Tracking

DCFs were originally adopted by VOT researchers because of their ability to track complex objects through numerous appearance variations and their significant in-crease in speed compared to state-of-the-art trackers at that time. First instances of such DCF-based Trackers (DCFT) relied on simple template matching and would generally fail when applied to tracking tasks due to hard labeling constraints during training. Advancements in DCFT can be attributed to two main contributions: the introduction of new features and conceptual improvements in filter learning [18].

DCFs quickly inspired many improvements, such as Average of Synthetic Exact Filters (ASEF) [19], which introduced adjustable filters for specific tasks. ASEF yields variability by modeling an average filter from multiple exact filters computed from training images of the same object, resulting in a more generalizing correlation filter. In addition, contrary to previous methods, ASEF specifies the complete cor-relation output per training image, adding additional variability. While successful at more static computer vision tasks, ASEF is unsuitable for online visual tracking due to the requiring extensive training.

Bolme et al. proposed the Minimum Output Sum of Squared Error (MOSSE) [20] filter to compensate for the recurring need for large quantities of training data. The MOSSE-filter is an improved version of ASEF that is trained offline for the purpose of object detection. The objective during training is to minimize the sum of squared error between the desired output and the output of the response filter,

(7)

Figure 2: A schematic outline of the two sub-tasks (object detection and data association) of tracking-by-detection as used in multiple object tracking, performed by an independent detector and tracker [11].

resulting in better filters using fewer training images. MOSSE forms the basis of a large part of current state-of-the-art DCFTs [21].

2.1.2 DCF-based Tracking

As a supervised method, DCF comprises a training and testing stage. The goal of designing general DCF inference is to return a response map containing low values for background pixels and high values for the target pixels [22]. This is accomplished by learning a filter h from a set of training samples {xt, yt}T

t=1. Each training sample xt is a feature map extracted from the region of interest in a training image. The feature map consists of Nc channels and can be denoted as xt = {xtd}

Nc

d=1, where xt

d ∈ R

w×h_{. Here, w and h correspond to the width and height of the region of} interest. In practice, it is often the case that w = h for the dimensions of the search region. The labels yt_{are defined as a maximum response in the center of the} provided ground-truth bounding box, accomplished by either hard one-zero labeling [1] or soft labeling using a 2-dimensional Gaussian [12, 23]. The response of the filter h on a feature embedding extracted from a n × n patch can be defined as

Rh(xt) = Nc

X d=1

xt_d? hd, (1)

where ? denotes convolution. The object position is estimated as the location of the maximum of Rh(xt). More efficient tracking is achieved by assuming circular con-volution; according to the convolution theorem, circular convolution in the spatial domain is equal to element-wise multiplication in the Fourier domain:

Rh(xt) = F−1 Nc X d=1 ˆ xt_d· ˆhd ! , (2)

where F represents the Discrete Fourier Transform (DFT) operation, · denotes element-wise multiplication and ˆx represents the DFT ˆxt_{= F (x}t_{) of x}t_{. In practice,}

(8)

Figure 3: Schematic overview of the workflow of a typical DCFT [24].

the DFT of a vector is computed using the more efficient Fast Fourier Transform (FFT) algorithm. Computation in the Fourier domain is significantly faster than direct circular convolution; circular convolution has a complexity of O(n4_{), while} FFT has a complexity of only O(n2_{log n) for patches of size n × n.}

The objective of DCF training is to learn filters h that minimize the error be-tween the correlation responses Rh(xt) and the labels yt. Mathematically, the DCF objective is defined as min h T X t=1 αtL(Rh(xt), yt) + λLreg, (3)

where L is a loss-function, Lreg is a regularizer, αt ≥ 0 equal the sample weights and λ ≥ 0 is a parameter determining regularizer influence. An overview of the complete DCF formulation is displayed in Figure 3.

2.1.3 Siamese Networks

Another way to model a supervised procedure for computer vision tasks is by em-ploying a Neural Network (NN), where the network is trained such that it returns the probabilities of a sample belonging to any of the predefined classes. However, this approach is very impractical when the problem at hand calls for the introduc-tion of yet unobserved classes, in which case the network would need to be retrained with additional samples of such a newly observed class. To this end, Bromley et al. [25] developed a structure that is impervious to such class changes; Siamese networks.

Siamese networks are a subclass of artificial NNs. A Siamese network denotes a network structure of two sub-networks that share the same weights while working in tandem on two different inputs to compute comparable outputs. The network learns by comparing or contrasting the outputs using a specific similarity or loss function, effectively learning a similarity function between the two inputs ??. Figure 4 denotes a typical Siamese network structure.

This similarity learning problem is insusceptible to changing classes and is able to learn information about categories from very few samples, therefore it is also

(9)

Figure 4: The general Siamese network structure for tracking or image comparison. The network takes two inputs x1 and x2 leading into two subnetworks f1 and f2 with shared

weights. A convolution is applied to obtain a response map. Loss is commonly obtained by computing contrastive loss between the resulting response map and labels that express the desired activation. Example input images are from the MNIST database [27].

commonly used as one-shot learning method [26]. Siamese networks lend themselves especially well to VOT, due to their unique exploitability as a template-matching framework.

One of the first trackers to adopt Siamese networks as such was the Fully Con-volutional Siamese network (SiamFC), proposed by Bertinetto et al. [1]. SiamFC aims to learn the function f (x, z) that compares an exemplar patch z to a candidate patch x and returns high scores when x is similar to z and low scores when x is not similar to z. SiamFC closely follows the tracking-by-detection paradigm by combin-ing a DCF with a prior learned by a Siamese network. More specifically, SiamFC trains AlexNet [28] as feature extractor ϕ to generate two appearance models that are then correlated, resulting in:

f (x, z) = ϕ(z) ? ϕ(x) + b1, (4)

where b1 is a signal that takes a value b ∈ R in every location, i.e. an additional bias. The output of the correlation is the score map h, with the object’s predicted position at the coordinates of the maximum activation. The similarity function f [1] is learned offline on randomly selected image pairs. This approach allows for training on data-sets that were not originally intended for tracking, but for object recognition and classification, which are not as sparse, such as ILSVRC [29]. The inference is formulated as a one-shot detection task, where the embedding of the exemplar z is only computed once and correlated with re-scaled instances of x. The resulting bounding boxes are estimated based on the maximum activation of the score map and corresponding scale of the re-scaled x. SiamFC shows that learning a similarity metric alone is enough for strong results, comparable to state-of-the-art trackers of that time [30], which were magnitudes slower. The success of SiamFC paved the way for more advanced Siamese trackers.

The Siamese Regional Proposal Network (SiamRPN) [13] extends a Siamese network with a Region Proposal Network (RPN) containing two branches for both inputs; a classification branch and a regression branch. These branches are then used for foreground-background classification and proposal refinement respectively. SiamRPN takes a template and a detection frame as input and feeds both frames through both branches during training. The inference is formulated as a one-shot

(10)

detection task, similar to SiamFC [1], where the branch for the template is pruned. In other words, the embeddings used for classification and regression are only ex-tracted from the template once and fed directly into the classification and regression branches corresponding to the detection frame. SiamRPN achieved leading perfor-mance in the VOT2015, VOT2016 and VOT2017 real-time challenges [30, 31] while performing at significantly faster speeds than its competitors.

Another Siamese tracking formulation, Accurate Tracking by Overlap Maxi-mization (ATOM) [15], uses the two branches of the Siamese network structure for target estimation and target classification. Target estimation is performed by an Intersection-Over-Union (IOU) predictor that predicts IOU scores using backbone features from both branches, the ground truth bounding box and proposal bound-ing boxes from the current frame. Classification is performed by another network head with the sole purpose of discriminating between target and background using the backbone features. The underlying Siamese backbone consists of two instances of ResNet-18 [32] pre-trained on ImageNet [33]. ATOM achieved leading perfor-mance in terms of EAO and accuracy on VOT2018, outranking both SiamRPN and Distractor Aware SiamRPN (DaSiamRPN).

Siamese trackers are commonly trained on randomly sampled instance and exem-plar pairs. This training procedure does not mimic real online tracking conditions, where images appear in an ordered sequence. This leads to an inconsistency between training and inference.

2.2 Regularization and Robustness

Within the overarching field of machine learning, there are two ways of training models in spite of limited amounts of generalizing training data. The first comprises a transfer learning approach, meaning that specific layers of a network that is pre-trained on a more general related dataset, are repre-trained using a smaller more specific dataset. However, this approach still assumes that the training data is similar to the underlying distribution, conceivably yielding poor performance when the domain of the input data changes. The second approach, domain adaptation, is a discipline that deals with scenarios where a model trained on a source domain is used in the context of a different but related target domain. We make the observation that regularization is a special case of domain adaptation within VOT, often responsible for an increase in tracking robustness.

2.2.1 Regularization as Domain Adaptation

In recent years, numerous domain adaptation methods have been proposed. Re-searches make a distinction between supervised and unsupervised domain adapta-tion methods [34]. In supervised domain adaptaadapta-tion, some labeled data is available from the target domain, albeit a quantity too small to train a full network. In un-supervised domain adaptation, no labeled data is available for the target domain.

Wilson et al. [35] further classify unsupervised methods into approaches that (1) align source and target domain distributions, (2) map between domains, (3) separate normalization statistics, (4) design ensemble-based methods, and (5) focus on the discriminability of the target. This research limits its scope to the first of the aforementioned methods, also referred to as domain-invariant feature learning

(11)

methods.

Domain-invariant feature learning aims to align source and target domain dis-tributions through invariant features by modeling a feature extractor, i.e. an NN or a CNN. The resulting feature representation is domain-invariant if the features follow the same distribution regardless of whether the input data originates from the source or target domain [36]. Thus, if a supervised model performs well on the source domain using a domain-invariant feature representation, it might generalize well to the target domain since the features of the target data resemble those in the source domain. However, this method implicitly assumes that such a feature embedding exists, which isn’t necessarily the case.

We notice that, when denoting individual images as source and target domains, regularization in VOT is a case of divergence-based domain-invariant feature learn-ing [35]. The goal of such regularization is to learn a feature representation that generalizes well when a tracker is confronted with yet unobserved objects by min-imizing a divergence that measures the distance between the source and target distributions, i.e. the distance between representations learned from observed and unobserved images.

2.2.2 Robustness of Siamese Trackers

Many efforts have been made to improve the accuracy and robustness of existing tracking models. These efforts could be viewed as external regularizers applied to existing models to increase their performance. However, it appears that tracking ac-curacy and robustness are weakly correlated [37], meaning that external regularizers have to settle for a satisfactory trade-off between the two.

Bhat et al. [38] analyzed this trade-off with respect to deep and shallow features. They found that training deep features to get higher accuracy might lead to a sub-optimal model, due to the implicit invariance property of deep features. Conversely, deep models benefit significantly from specific training to obtain higher robustness, such as data augmentation, a softer labeling function or generalizing regularization. Their analysis of shallow features revealed that those are more suited to yield high accuracy because they capture low-level features with higher discriminability.

In order to captivate the complementary benefits of shallow and deep features, Bhat et al. developed an external regularizer that combines both; Unveiling the Power of Deep Tracking (UPDT) [38], an extension of the ECO [23] tracker. UPDT obtains a fused activation map as the weighted combination of the resulting shallow and deep scores of ECO. The aim of UPDT is to jointly estimate the weights for this combination and the underlying model that maximize a confidence measure that promotes both accuracy and robustness. The confidence measure is developed so that a large resulting confidence ensures no distractors and a distinctive response, i.e. high accuracy and robustness.

UPDTs fusion approach improved the underlying baseline tracker ECO and outperformed it on all tested benchmarks. However, it does this at the expense of tracking speed. Nevertheless, Bhat et al. showed that the development of modules that potentially increase the robustness of existing deep trackers is worth pursuing. One such module is Tracking Holistic Object Representations (THOR) [39], a recently proposed regularizer for Siamese trackers. THOR introduces a Short-Term Memory (STM) and Long-Short-Term Memory (LTM) to the inference pipeline of

(12)

Figure 5: An overview of the THOR tracking paradigm. The Encoder is a Siamese network that returns embeddings of the input and template images. The templates stored in the LTM and STM are convolved with the input, resulting in two sets of activation maps. Subsequently, a modulation module normalizes the activation maps before corresponding bounding boxes are computed. The bounding boxes corresponding to the maps with maximum activation in each set are used as candidates for the ST-LT Switch, which determines the final prediction based on their overlap [39].

an arbitrary Siamese architecture. These modules try to store additional diverse object representations during the tracking process, which are then used in the final bounding box prediction. To determine which representations are diverse enough to be stored THOR uses a Gram matrix. This matrix, denoted as G, is the matrix of the inner products of all feature embeddings f1, . . . , fn stored in the pipeline:

G (f1, . . . , fn) =    f1? f1 f1? f2 · · · f1? fn .. . ... . .. ... fn? f1 fn? f2 · · · fn? fn    (5)

The value of the Gram determinant |G| describes the signed volume of the paralel-lotype defined by the feature vectors f1, . . . , fn. The volume provides an indication of the diversity of the stored representations. The higher, the more diverse the stored representations are. The LTM objective is thus to collect features that max-imize the Gram determinant:

max f1,f2,...,fn

|G (f1, f2, . . . , fn)| (6)

A representation is thus only included in the LTM when it increases |G|. However, as the nuclear norm fluctuates too much to use as measure for the STM, another diversity measure γ is proposed:

γ = 1 − 2 N (N + 1)Gst,max N X i<j Gst,ij (7)

This measure sums the upper triangle of G and normalizes the sum by the maximum value of G. Resulting in a more robust indication of template diversity. Figure 5

(13)

illustrates the complete THOR pipeline. Plugging THOR on top of state-of-the-art Siamese trackers showed an increase in performance on both VOT2018 and OTB2015 benchmarks [2, 40], but a slight decrease in tracking speed due to the additional overhead.

2.3 Benchmarking

A multitude of tracking benchmarks exists, with the most widely used and accessible being the Visual Object Tracking challenge (VOT3) [41, 21, 42, 30, 31, 40, 9], the online Object Tracking Benchmark (OTB) [10, 2] and more recently, the generic object tracking benchmark (GOT-10k) [43]. The benchmarks generally vary in their adopted video sequences and evaluation metrics but usually exhibit some overlap between both. All mentioned benchmarks contain many video sequences, manually annotated with ground truth bounding boxes, intended for single object tracking. A tracker is then applied to a benchmark to obtain its predictions. Whereafter, several evaluation metrics can be computed using the ground-truth boxes and the predicted boxes.

There are three important evaluation criteria relevant for a tracker: accuracy, robustness, and speed. Accuracy measures how accurately a tracker manages to localize its target, robustness measures how often the target is successfully localized [38], and speed is simply how fast the tracker is able to predict, measured in Frames Per Second (FPS). Multiple evaluation metrics exist to give an indication of either tracking accuracy, robustness or both. Below, we discuss the most prominent of these metrics.

Center Error: The center error is the oldest measurement of performance for trackers. It measures the difference between the predicted center and the ground-truth center and is therefore purely a measure of precision. Its popularity originates from the minimal required annotation effort: a single point per frame. The individ-ual error is generally expressed using the euclidean distance

d(cP, cG) = v u u t n=2 X i=1 (cP i − c B i )2, (8)

where cP _{and c}B _{denote the predicted and ground-truth centers respectively. The} full results are usually shown as a precision curve [2] or average error [43]. A large drawback of this measure is that it does not incorporate target size, resulting in arbitrarily large errors. This renders the center error unfit for predicting tracking failure and tracking drift.

Region Overlap: Region overlap measurements, such as Average Overlap (AO) incorporate both target size and position, giving a more robust indication of track-ing accuracy. AO requires region annotations and measures the overlap between predicted regions and annotated regions. Regions are typically expressed as rectan-gles containing the target, i.e. bounding boxes. The individual overlap is computed as the Intersection Over Union (IOU):

IOU(i) = Area(B G i ∩ BiP) Area(BG i ∪ BiP) = Area(T P ) Area(T P + F N + F P ), (9)

(14)

where BP_{and B}G_{denote the predicted and annotated bounding boxes respectively.} The AO is defined as the average value of all individual overlaps in the sequence:

AO = 1 N

X i

IOU(i) (10)

Reporting the AO over all sequences in a full benchmark set is the most popular indicator of overall tracking performance [2, 43, 9].

Success Rate: Success rate is a region overlap measurement originating from the object detection community [44], where an object is correctly detected when region overlap is above a certain threshold. With the surge of the tracking-by-detection paradigm, the success rate has also become a popular measurement for object tracking. It is typically used to report tracking robustness. The success rate is defined as the average value of all individual overlaps higher than a predefined threshold τ :

SRτ = 1 N

X i

f (i), where f (i) = (

0, if IOU(i) < τ

IOU(i), otherwise (11)

SRτ reflects the percentage of correctly tracked frames. The threshold τ is usually set to τ = 0.5, but any value that satisfies 1 ≥ τ ≥ 0 can be used, e.g. GOT-10k [43] reports success rate using both τ = 0.5 and τ = 0.75 on its online leaderboard4. Success Curves: A common way to compare overall tracking performance is through success curves. A success curve denotes success rate over a range of differ-ent thresholds τ , i.e. the threshold τ is varied on the x-axis and the corresponding SRτ is plotted on the y-axis. Since it is so common to use performance curves as indications of tracker performance, curves of different trackers are often compared in a single plot, which commonly results in an unreadable graph. To this end, the Area Under Curve (AUC) per tracker is reported as a complementary measure. The AUC can give an indication of accuracy and robustness, dependent on which curve it is applied to. The AUC is computed by approximating the definite integral using the trapezoidal rule:

Z b a

f (x) dx ≈ (b − a) ·f (a) + f (b) 2

where a and b represent the range on the x-axis over which to compute the area and f denotes the approximated function. In practice, the volume of the trapezoid is usually computed over small sub-intervals a → b for a more accurate approximation. Failure Rate: A measure originally intended to reflect performance in correlation with tracking length. The classic failure rate computation casts the tracking prob-lem as a supervised system in which an operator reinitializes the tracker when the AO drops below a certain threshold τ , commonly chosen as τ = 0.1 [45]. The amount of interventions is summed and used as a comparative score. We deviate from the classic computation and define failure rate as the percentage of incorrectly tracked sequences. More specifically we count the number of sequences with AO < τ , where τ = 0.5 consistent with the SR, and divide this by the total amount of sequences in the benchmark, yielding a more global indication of tracking failure on the benchmark.

(15)

3 Method

In this work, we propose a novel supervised deep learning method for Siamese track-ers: the Deeply Supervised Siamese network (DSSiam). The goal of this method is to prevent overfitting on the training data. To prevent overfitting, we aim to learn a domainvariant feature extractor by deep supervision. For this purpose, we in-troduce the concept of chaining, a way of stacking Siamese models during offline training. This allows for a loss computation based on a consecutive sequence of exemplar frames, as opposed to only a single exemplar frame, mimicking inference conditions better than previous Siamese methods. Consecutively, the introduction of multiple inputs creates the opportunity to impose additional regularization on the extracted feature embeddings of the consecutive sequence. We make the nat-ural assumption that embeddings should not drastically vary between consecutive frames. The subsections below describe the method in further detail.

3.1 Deeply-Supervised Siamese Network

3.1.1 Training

In essence, chaining stacks Siamese networks in a front-to-back fashion during the training phase, i.e. the output of the first network is fed into the input of the next network and so on. The resulting framework consists of multiple individual Siamese networks which we will now refer to as links in the chain. We have chosen to adopt instances of SiamFC as our links, in virtue of its relatively simple inference and backbone structure compared to other state-of-the-art Siamese trackers.

As mentioned in Section 2.1.3, the SiamFC tracking scheme only consists of backbone net φ and a correlation layer ?, resulting in

si,j= f (x, z)i,j= φ(z) ? φ(x) + b1 (12) Equation 12 returns a single score map defined on a finite grid s ⊂ Z2_{, where (i, j) ∈} s denotes a spatial coordinate reflecting areas of divergence between exemplar image z and instance image x. During inference, SiamFC employs a uniform motion model coupled with multi-scale testing to find the location

p = arg max (i,j)

(s) (13)

of peak activation within differently scaled variations of s. This 2-dimensional coordinate p is used to center the predicted bounding box on the target.

Our method takes a single exemplar patch z and a sequence of consecutive instance patches X = {xi}ni=1 as input, where n ∈ N>0 denotes the chaining factor that specifies the amount of links, i.e. the amount of nested Siamese networks within the DSSiam structure, and xi ∈ X represents a single instance patch of dimension w ×h×3 cropped from the full input image and centered on the target. The amount of instance images in the input is thus directly proportional to the scale factor. As a result, our method returns multiple score maps S = {xi∈ X : f (xi, z)} and feature embeddings F = {xi∈ X : φ(xi)}.

In order to incorporate previous x in the current prediction, we center every next xi>0based on the resulting prediction from the previous x. For this purpose,

(16)

φz φx1 ★ φ'x2 ★ φ'x3 ★ φ(x1) φ(x2) φ(x3) 3x22x22x128 3x17x17 z x1, x2, x3 3x3x255x255

Figure 6: A schematic example of the DSSiam framework applied to SiamFC during training using a chaining factor n = 3. DSSiam-n3 takes one exemplar image z and a set of 3 instance images X as inputs. It outputs 3 activation maps S and 3 feature embeddings F . Here, φ0xi = φ(α(xi, pi−1)) and φxi = φ(xi), i.e. the feature extractor of

SiamFC applied to xi, which is either re-centered using the previous position pi−1or not.

Each ? represents a correlation layer. Every correlation layer is followed by a batchNorm layer and a softArgmax layer. The feature embeddings of z and x1 ∈ X are computed in

the first link. The embeddings are then fed into the correlation layer, which returns an activation map s1 ∈ S. Subsequently, softArgmax(s1) returns the position that is used

to center the next frame x2∈ X . This process is incrementally repeated over X until all

s ∈ S and φ(x) ∈ F are obtained.

we define the base affine matrix A that, when applied to the coordinate grid of xi, centers every instance xi>1based on the previous position p:

A =1 0 −2 t0 w 0 1 −2t1 h , where t = −(p∗− p), (14)

where p∗is the center of xi, which corresponds to the center of the target. The first instance in the sequence, x1, is not required to be re-centered based on a previous prediction since it is already centered.

The arg max function in Equation 13 is non-differentiable. In order to make the chain trainable end-to-end we instead adopt the differentiable soft-arg max [46, 47, 48] function:

p = soft-arg max(h)

=X

i,j

softmax(h)i,j· (i, j)

=X i,j ehi,j P i0_,j0ehi0 ,j0 · (i, j) (15)

where (i, j) iterate over the activation map pixels. Soft-arg max estimates the lo-cation of the maximum activation p as a weighted average of all pixel values (i, j) where the weights are given by softmax(h) [48]. Now, using Equation 12, 14 and 15, it is possible to define the full activation output of DSSiam. Let α be the function

(17)

that applies A to the coordinate grid of xi and returns a re-centered instance xi based on pi−1, then:

Sθ= {xi∈ X : f (α(xi, pi−1), z; θ)}, where pi−1= (

p∗, if i − 1 < 0 pi−1, otherwise

(16)

and θ are the weights of the backbone net φ. All resulting score maps Hθ are nor-malized with respect to their link and batch. A schematic example of the framework is shown in Figure 6.

The classification loss L(H, t) of the correlation filter is defined as the mean weighted binary cross entropy loss per score map s ∈ Sθ per pixel value (i, j) ∈ s [1]: L(t, Sθ) = 1 n X s∈S X i,j∈s w1tijlog(σ(sij)) + w2(1 − tij) log(1 − σ(sij)), (17)

where tij ∈ {−1, +1} is the ground-truth label at index (i, j) of the ground-truth score map t of the same dimension as h, σ is the sigmoid activation function and (w1, w2) denote the weights used to adjust for class imbalance. The elements of the ground-truth score map tij are considered to be positive if they are within radius R/k, where k is the total stride of the backbone φ, of the center of t and nega-tive otherwise. The parameters θ of the backbone φ are the values that minimize Equation 17, found through Stochastic Gradient Descent (SGD). The resulting θ can then be used as weights for the feature extractor during inference.

3.1.2 Inference

During inference, our model adopts the inference strategy of the Siamese tracker it is applied to, in this case, SiamFC. This harbors a significant advantage over other regularization approaches [38, 39] since our method introduces no additional overhead during inference and thus performs at the same speeds as the original tracker. However, in practice, there is a small necessary change, as the obtained parameters θ are not of the same dimension as those fit for the underlying tracker, since they are learned in a chained fashion. In other words, the newly learned θ can’t be loaded in the original backbone. Instead, we replace the original backbone with an instance of DSSiam with a chaining factor of n = 1, which results in essentially the same function as Equation 12 with an affine centering layer:

f (x, z) = φ(z) ? φ(α(x, p∗)) = φ(z) ? φ(x) (18)

Note that we choose to remove the additional bias to not propagate the error through the sequential network [49].

The first frame of the sequence f0is presumed to be accompanied by a bounding box containing the target. DSSiam crops an exemplar patch z (127 × 127 × 3) and several differently scaled instance patches x (255 × 255 × 3) from the next frame f1 based on the initial bounding box. The different scales are determined by a predefined set of scaling factors.

Comparable to SiamFC, DSSiam applies multi-scale testing to the instances to estimate the position and scale of the target in the next frame. The exemplar z is

(18)

φ

★ (x,y) 3x255x255x3 127x127x3 wxh f0 f1 f1 f2 wxh wxh

Figure 7: An example of typical DSSiam inference. Exemplar z is cropped from f0 and

instance x patches are cropped from frame f1 based on the ground truth bounding box

(cyan) in f0. Instances x are subjected to multi-scale testing, yielding multiple responses.

The response with maximum activation is used to center and scale the new bounding box (green) in f1 using the point of maximum activation and the corresponding scaling

factor. New instances are cropped from the next frame f2using the dimensions of the new

bounding box. This cycle is repeated over all frames in the sequence.

correlated with all scaled versions of x to obtain different score maps. The score maps are penalized by multiplication with a scaling penalty. The score map with the highest activation is chosen to reflect the target’s condition in f1best. Subsequently, the map is Hann smoothed [50] and the corresponding point of maximum activation and the original scale factor are used to estimate the center and dimensions of the resulting bounding box. An overview of the process is illustrated in Figure 7. The process is incrementally repeated over the full sequence by correlating the initial exemplar embedding of z with instances from subsequent frames. The search regions for subsequent instances are based on the predicted bounding box in the previous frame.

3.1.3 Implementation

Our framework is written in the Python programming language using the PyTorch [51] library for deep learning. We built upon the existing SiamFC PyTorch imple-mentation by L. Huang5_{. We adopt SiamFC as backbone for this implementation} because of its relative simplicity as tracking framework. However, the DSSiam framework is created with the intent of adaptability and extensibility. It is sup-posed to be a general Siamese training paradigm. In theory, it should therefore be possible to apply DSSiam to many of the current state-of-the-art Siamese trackers, such as ATOM [15], SiamRPN [13], DASiamRPN [14], and SiamRCNN [52].

3.2 Regularized Siamese Embeddings

Chaining also introduces the opportunity to impose supplementary regularization on the set of extracted feature embeddings F during training. The extra

(19)

larization aims to force the feature extractor to generate similar features for the same object in slightly different conditions, resulting in a more general appearance model. When applying such a divergence minimization strategy, we assume that feature embeddings should not change much between consecutive frames, since the object’s condition should not vary drastically either. In practice, regularization is enforced by adding a term to the loss function in Equation 17. While a myriad of regularization strategies are possible, we propose Gram regularization, similar to the THOR formulation described in Section 2.2.2.

3.2.1 Gram Regularization

DSSiam returns the raw feature embeddings of the sequence, therefore it is possible to adopt regularization strategies that minimize feature divergence using a similarity measure. In practice, this is best accomplished by constructing a Gram matrix using the aforementioned similarity measure followed by the application of a function that retains the diversity information to the Gram matrix. This ensures that all embeddings influence the resulting regularization term equally. THOR adopts the inner product as such a similarity measure, but other measures can be used or even learned. We refer to this regularization approach as Gram regularization.

The first step of Gram regularization is to generate a Gram matrix G using the extracted feature embeddings F . Like THOR, we define the similarity between two embeddings as the inner product φ(x0) ? φ(x1), resulting in:

G(F ) =    f0? f0 · · · f0? fn .. . . .. ... fn? f0 · · · fn? fn   , (19)

where f0, . . . , fn are the feature embeddings φ(x0), . . . , φ(xn) ∈ F . The resulting G is a matrix of dimension n × n, where n is the chaining factor. The Gram determinant could then be estimated as

|G| = rank(G)

Y i=1

λi, (20)

where λi represent the eigenvalues of G. The Gram determinant equals the square volume of the parallelotype formed by f0, . . . , fn. Recall that the goal of THOR was to maximize |G| to obtain a diverse array of object templates. In contrast, the goal of our regularization is to generate general embeddings constant over slightly different object appearances, hence why we aim to minimize |G| over a sequence of embeddings:

min

F |G(F )| (21)

However, |G| is expensive to compute and non-differentiable, which renders it unfit as additional regularization for the loss function. Instead, we use the nuclear norm of the Gram matrix:

||G||∗= rank(G)

X i=1

(20)

where σi denote the singular values on the diagonal of Σ in the singular value de-composition U ΣVT _{= G. The nuclear norm ||G||}

∗ is proportional to |G|, since λi =

√

σi and is frequently used in optimization problems to relax low rank con-straints [53]. Since ||G||∗ is differentiable, it can be directly incorporated in the loss function as regularizer. As a result, the training objective of DSSiam becomes finding parameters θ that minimize the loss function

L(t, Hθ) = L0(t, Hθ) + λ||G||∗, (23)

where λ ≥ 0 is the regularization parameter controlling the influence of the regu-larization term.

(21)

4 Experiments

In this section, we perform an evaluation of the proposed framework. First, we reiterate the goal of our experiments and discuss the experiment outline. Here-after, we provide technical details of the experimental setup for training and testing concerning the baseline comparison. Finally, we list and discuss the results of our framework and the baseline.

4.1 Goal

We build an experiment with the goal of validating if deep supervision reduces overfitting on the source or training domain. For this purpose, we conduct a baseline comparison. We do so by training our models on one specific dataset and testing on two benchmarks: a subset of this dataset used for training and a tracking dataset corresponding to a different domain of visual appearances, which we refer to as an unrelated6 target domain.

If our framework yields a significant improvement over the baseline on the un-related target domain, we can conclude that domain adaptation was successful. To determine whether overfitting occurs and to ensure that chaining alone is responsi-ble for an increase in performance, we perform an ablative study (section 4.2.3). If the ablative model exhibits worse performance than the baseline on the unrelated target domain, it will be indicative of severe overfit on the source and suggest that our formulation prevents overfitting. We also verify additional Gram regularization by contrasting its performance with that of the non-regularized models and the baseline.

4.2 Baseline Comparison

We evaluate six variants of the DSSiam framework: vanilla DSSiam with vary-ing chainvary-ing factor n ∈ {2, 3, 4} and Gram regularized DSSiam with varyvary-ing n ∈ {2, 3, 4}. We denote the variants of DSSiam as DSSiam-n{n}, e.g. DSSiam with a chaining factor of 2, becomes DSSiam-n2. Additionally, we evaluate two variants of SiamFC: the classic formulation SiamFC-3s (3 scale-factors) [1] and SiamFC-3s trained on the same amount of data as DSSiam-n2 as an ablative study (section 4.2.3). Both instances of SiamFC will comprise our baselines. All models are trained on GOT10k-train and tested on two different object tracking benchmarks: GOT-10k-eval and OTB2015 (1). We choose to test on two different benchmarks to emphasize the difference between the source and target domains. GOT-10k-eval is a subset of GOT-10k-train and is thus more aligned with the source domain than OTB2015. We report two measurements indicating tracking accuracy: the Average Overlap (AO) and Center Error (CE). And two measurements indicative of track-ing robustness: the Success Rate (SR) and Failure Rate (FR). We choose not to report AUC, since it is found directly proportional to AO in practice. A detailed explanation on the computation of the performance measurements can be found in Section 2.3. Results are presented as success curves and graphs to visualize trends.

6_{Unrelated in quotations: the domains are very much related, but the target domain is not}

(22)

Table 1: Relevant benchmark specs.

Benchmark Set Videos Frame Rate (FPS)

OTB2015 100 30

GOT-10k-eval 180 10

GOT-10k-train 93184 10

(a) GOT-10k-eval (b) OTB2015

Figure 8: Success curves on GOT-10k-eval (a) and OTB2015 (b) for the baselines and our best model with and without additional Gram regularization.

Numerical results are provided in Table 2. We discuss our results per benchmark and then denote our overall findings.

4.2.1 Training Setup

The weights θ are found by minimizing the models’ respective loss functions with SGD using AlexNet [28] as the backbone. The weights are initially set according to Kaiming initialization [54]. All models are fitted on the GOT-10k [43] training set over 50 epochs using 93, 184 · n instances (255 × 255) and 93, 184 (127 × 127) exemplars per epoch, where n denotes the chaining factor of the respective model. The exemplar and n instances are randomly sampled from the same sequence, lim-ited to 100 exemplars per unique sequence. The original SiamFC applies a series of random transforms to the images as a form of data-augmentation, we choose to omit this for our models since it interferes with sequential prediction. The radius R for ground truth true labels is set to R = 16. The score maps are up-sampled from 17 × 17 to 272 × 272 using bi-cubic interpolation before approximating the peak location in DSSiam models using soft-arg max. Gradients are estimated per iteration using batches of 64 · n instances and 64 exemplars. For the regularization models, we choose λ = 5 × 10−6, such that L(t, Hθ) ∼ λ||G||∗ at the very start of training. The initial learning rate is set to 0.01 with a decay of ≈ 0.869 per epoch, weight decay equals 5 × 10−4 and momentum is set to 0.9. All training is performed on a 16-core CPU node with a single NVIDIA GTX-TitanX GPU on the Distributed ASCI Supercomputer 4 (DAS-4)7 [55].

(23)

Table 2: Performance of all models on both benchmarks expressed in Average Overlap (AO), Center Error (CE), success Rate (SR), and Failure Rate (FR). The best result per metric is highlighted in blue.

GOT-10k-eval OTB2015

Accuracy Robustness Accuracy Robustness

Model AO CE SR FR AO CE SR FR DSSiam-n2 0.5515 116.426 0.6155 0.4111 0.6173 30.4005 0.7727 0.22 DSSiam-n3 0.5482 116.7105 0.6177 0.4222 0.5972 32.6427 0.7346 0.26 DSSiam-n4 0.5186 129.9404 0.5766 0.4667 0.5691 36.9377 0.6991 0.28 DSSiam-n2 + GR 0.5504 117.1422 0.6263 0.3944 0.5849 34.3242 0.7218 0.28 DSSiam-n3 + GR 0.5418 123.2039 0.6137 0.4000 0.5835 36.4991 0.7122 0.30 DSSiam-n4 + GR 0.5252 121.8283 0.5857 0.4333 0.5765 37.3226 0.7085 0.25 SiamFC-3s 0.5242 127.7197 0.5985 0.4056 0.5807 31.452 0.7306 0.26 SiamFC-3s* 0.5365 112.9348 0.6072 0.4278 0.5674 36.2191 0.7051 0.28 4.2.2 Inference

During inference, the exemplar embedding is only computed in the first frame and correlated with the embeddings of the search regions of the subsequent frames in the video sequence. The initial exemplar and instance patches are of the same dimension as they were during training: 127 × 127 and 255 × 255 respectively. For multi-scale testing three scale-factors 1.0375{−1,0,1}are used. Scales are penalized using a scaling penalty of 0.9745. The resulting score map is up-sampled in a similar fashion to training for more accurate localization. The corresponding scale is weighted before multiplication with the bounding box using a scale learning rate of 0.59. All testing is performed on a 16-core CPU node with a single NVIDIA GTX-TitanX GPU on DAS-4. Training and tracking setup in full detail can be found in our PyTorch implementation8_.

4.2.3 Ablation Study

The DSSiam framework uses up to four times as much training data as the original SiamFC formulation. To ensure that deep supervision alone is responsible for an increase in performance, we conduct an additional ablation study using the same amount of data as our best performing model, DSSiam-n2. In practice, this is accomplished by training SiamFC-3s using two consecutive instances, yielding two responses. Similar to our models, we do not use additional data-augmentation. Subsequently, the training loss can be computed as the average loss over the two responses, similar to Equation 17. Further training parameters are the same as described in Section 4.2.1. We denote the ablative model as SiamFC∗ in Figure 8, 9, and 10.

4.3 Results and Discussion

GOT-10k. The success curves of our best model with and without regularization displayed in Figure 8a show a slight improvement in SR over the baseline on all thresholds. Furthermore, both models retain a slightly higher success rate at higher

(24)

(a) Average Overlap (b) Center Error

(c)Success Rate (d) Failure Rate

Figure 9: Performance of all models on GOT-10k-eval benchmark measured in Average Overlap (a), Center Error (b), Success Rate (c), and Failure Rate (d). Note that the graphs do not start at 0.

overlap thresholds than the ablative model, indicating more robust tracking using the same amount of data.

A more precise analysis of the accuracy metrics shown in Figure 9 reveals that our models with n ≤ 3 slightly outperform the baseline and ablative model, except for the CE, shown in Figure 9b, where we observe that the ablative study out-performs all models. This can be explained by our object localization procedure, which uses the more unstable soft-arg max function during training. This instabil-ity might cause shifted learning, where a network learns features that produce an increased area of maximum activation within the score map, which results in less accurate position prediction during inference. All deeply supervised models show a relatively steep drop in accuracy as n > 3, implying that deep supervision does not necessarily prevent overfitting when training on larger sequences. Initially, the per-formance of the regularized models is slightly less than that of the vanilla models. However regularized performance on both metrics drops less steeply, even yielding better performance as n > 3, suggesting that regularized models are less prone to overfitting.

The analysis of tracking robustness in Figure 9c and 9d reveals the same pat-terns. However, the regularized models now steadily outperform the deeply

(25)

su-(a) Average Overlap (b) Center Error

(c) Success Rate (d) Failure Rate

Figure 10: Performance of all models on the OTB2015 benchmark measured in Average Overlap (a), Center Error (b), Success Rate (c), and Failure Rate (d). Note that the graphs do not start at 0.

pervised models, except for the SR of regularized DSSiam-n3 shown in Figure 9c. This unexpected SR score highlights two key limitations of our work: sub-optimal λ choice and unknown standard deviation of all model performance. It might be the case that adjusting λ or re-running experiments results in the expected behavior.

These initial findings show that deep supervision on sequences of length n ≤ 3 yields increases in both accuracy and robustness scores over the baseline and abla-tive models. This suggests that DSSiam utilizes data more efficiently than SiamFC. However, deep supervision is still prone to overfitting on sequences of length n > 3. The more consistent performance of regularized models as the sequence length in-creases shows that Gram regularization prevents overfitting. Furthermore, regular-ization yields an additional increase in tracking robustness over deep supervision, but at the cost of localization, and thus accuracy.

Because of the limitations we mention, it is unjustified to support the other im-provements we observe, since they are of the same scale as the outlier. Therefore, we treat the slight improvements of our models solely on GOT-10k-eval as weak evidence for our conclusions.

(26)

the baseline, and the ablative model on OTB2015 shown in Figure 8b, exhibit more significant micro and macro differences compared to GOT-10k. DSSiam-n2 yields a significant performance increase over all other models shown. Regularized DSSiam-n2 retains a higher success rate at higher overlap thresholds than the baseline and ablative model, again suggesting a more robust overall tracking performance. Fur-thermore, the contour of all curves shows that all models retain a higher success rate at higher overlap thresholds on OTB2015 than on GOT-10k-eval. This indicates that all models achieve higher robustness on OTB2015 compared to GOT-10k-eval. The analysis of tracking accuracy of the ablative model in Figure 10a and 10b indicates a severe overfit on the source domain, even resulting in worse scores than regular SiamFC. Moreover, the tracking accuracy of models with n > 2 drops quicker than on GOT-10k-eval. This suggests that overfitting on the source do-main even occurs when training on sequences of length n > 2, which is sooner than thought previously. Despite the steep decrease, DSSiam-n2 yields a significant im-provement in AO of ∼ 9% over the ablative model and a ∼ 6% increase over the vanilla baseline, suggesting that deep supervision prevents overfitting and provides better generalization for the wide range of object appearances when using similar amounts of data from the source domain. This is further supported by the CE scores. The accuracy of regularized models is generally less affected by varying n, indicative of being less prone to overfitting.

The robustness scores of all models, displayed in Figure 10c and 10d, exhibit a very similar pattern to the AO scores. The performance of the ablative model indicates a severe overfit, while DSSiam-n2 yields a significant increase over both instances of SiamFC. Regularized models, however, seem to exhibit more random behavior as n varies.

Overall, the regularized models seem to perform significantly worse on both ac-curacy and robustness metrics than the vanilla models and the baseline on this benchmark. Furthermore, regularized performance does not necessarily correlate with varying n. We speculate that the reason for this is twofold: there is a major FPS difference between the training and testing data, as shown in Table 1, and a major difference in video resolution. Both these differences severely lower the frame-by-frame diversity of a sequence. The regularized models expect a higher di-versity since the network has learned to generate general embeddings that comply with a 10 FPS sequence. This inconsistency in diversity results in embeddings that are enforced to be too general to identify the subtle differences in object appearance of a 30FPS sequence, leading to worse performance. Lower image resolution has a similar effect on frame-by-frame diversity and results in the same behavior. This difference in FPS and resolution also explains the generally superior performance of all tested models compared to GOT-10k-eval: tracking is easier when frame-by-frame diversity is low.

Overall. From the benchmark specific analyses, the following key findings emerge: both benchmarks show that deep supervision without regularization yields improve-ment over the baseline on almost all metrics. This combined with the significant performance increase on the unrelated target domain, OTB2015, indicates that deep supervision does indeed prevent overfitting on the source. Furthermore, the more consistent results as n > 3 and superior performance on GOT-10k-eval of regular-ized models indicates that Gram regularization increases tracking robustness at the cost of accuracy, but suffers from low diversity or Gram volume, caused by smaller

(27)

(28)

5 Future Work

The concept of chaining describes sequential training on consecutive frames. How-ever, sequential training is substantially affected by the FPS of the video stream, camera motion, and object motion. These all affect frame-by-frame diversity of a consecutive sequence. Potentially, applying non-consecutive sequential training in these and other cases, where every n-th frame is sampled instead, might be benefi-cial. This may constitute the object of future studies.

During training, we choose the initial regularization parameter λ for all models such that the initial BCE-loss equals the initial nuclear norm of the Gram matrix. Our trials on varying λ are very limited due to the computationally expensive offline stage of our model. Future research could further examine the effects of varying λ or investigate the impact of adjusting λ for individual models.

In addition to varying λ, the inclusion of different sequential feature regulariza-tion has to be further studied. Addiregulariza-tional research could explore the effects of Gram regularization using the THOR STM diversity measurement shown in Equation 7, or conduct experiments using a different measure to express feature diversity, such as cosine similarity.

Studying the effects of different regularizers would be a fruitful addition. How-ever, more beneficial would be to first study the effects of sequential feature regular-ization without chaining models, i.e. without propagating the location prediction. This could be done by extracting features from a sequence, similar to the DSSiam paradigm, but instead only use the extra features for supplementary regularization. As mentioned in Section 3.1.3, we designed our framework with the intent of adopting more complex models. Even though our model successfully improves SiamFC performance, the relatively simple nature of the online stage of SiamFC could limit the impact of both sequential training and domain-invariant features. We suggest future research to be devoted to the integration of more complex models within DSSiam.

(29)

6 Conclusion

In this work, we propose a novel training formulation for Siamese trackers that aims to utilize more training data more efficiently to reduce overfitting by means of deep supervision and additional regularization. The developed model is trained fully offline on small sub-sequences of arbitrary length. The online stage is the same as the initial underlying Siamese tracker, yielding the same tracking speed. In order to verify our models, we train the baseline and deeply supervised models with and without regularization on the GOT-10k benchmark as source domain. We perform experiments on two tracking benchmarks to highlight cross-domain performance: a smaller subset of GOT-10k and OTB2015, a benchmark comprising a domain of visual appearances different from the source. We analyzed performance expressed in both accuracy and robustness. We found that our deeply supervised approach on sequences of length two shows better generalization when trained on the source and yields superior performance over the baseline in tracking accuracy and robustness on both benchmarks. Additional Gram regularization exhibits promising results concerning tracking robustness when frame-by-frame diversity is high, but at the cost of accuracy. Our results lead us to believe that both techniques can prove very useful for future tracking efforts.

(30)

References

[1] Henriques J.F. Vedaldi A. Torr P.H.S. Bertinetto L., Valmadre J. Fully-convolutional siamese networks for object tracking. Hua G., J´egou H. (eds) Computer Vision – ECCV 2016 Work-shops. ECCV 2016. Lecture Notes in Computer Science, 9914, 2016.

[2] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1834–1848, 2015.

[3] Kinjal A Joshi and Darshak G Thakore. A survey on moving object detection and tracking in video surveillance system. International Journal of Soft Computing and Engineering, 2(3):44–48, 2012.

[4] Hyunggi Cho, Young-Woo Seo, BVK Vijaya Kumar, and Ragunathan Raj Rajkumar. A multi-sensor fusion system for moving object detection and tracking in urban driving environments. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 1836– 1843. IEEE, 2014.

[5] Thi Lan Anh Nguyen, Francois Bremond, and Jana Trojanova. Multi-object tracking of pedestrian driven by context. In 2016 13th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 23–29. IEEE, 2016.

[6] J´erˆome Lautissier, Louis Legrand, Alain Lalande, Paul Walker, and Fran¸cois Brunotte. Object tracking in medical imaging using a 2d active mesh system. In Proceedings of the 25th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (IEEE Cat. No. 03CH37439), volume 1, pages 739–742. IEEE, 2003.

[7] Siddharth S Rautaray and Anupam Agrawal. Vision based hand gesture recognition for human computer interaction: a survey. Artificial intelligence review, 43(1):1–54, 2015. [8] Mustansar Fiaz, Arif Mahmood, and Soon Ki Jung. Tracking noisy targets: a review of recent

object tracking approaches. arXiv preprint arXiv:1802.03098, 2018.

[9] Matej Kristan, Jiri Matas, Ales Leonardis, Michael Felsberg, Roman Pflugfelder, Joni-Kristian Kamarainen, Luka Cehovin Zajc, Ondrej Drbohlav, Alan Lukezic, Amanda Berg, et al. The seventh visual object tracking vot2019 challenge results. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 0–0, 2019.

[10] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Online object tracking: A benchmark. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2411– 2418, 2013.

[11] Laura Leal-Taix´e. Multiple object tracking with context awareness. arXiv preprint arXiv:1411.7935, 2014.

[12] Martin Danelljan, Gustav Hager, Fahad Shahbaz Khan, and Michael Felsberg. Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE inter-national conference on computer vision, pages 4310–4318, 2015.

[13] Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8971–8980, 2018.

[14] Zheng Zhu, Qiang Wang, Bo Li, Wei Wu, Junjie Yan, and Weiming Hu. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pages 101–117, 2018.

[15] Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. Atom: Accu-rate tracking by overlap maximization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4660–4669, 2019.

[16] Marios Savvides, BVK Vijaya Kumar, and Pradeep Khosla. Face verification using correlation filters. 3rd IEEE automatic identification advanced technologies, pages 56–61, 2002.

(31)

[17] Mikel D Rodriguez, Javed Ahmed, and Mubarak Shah. Action mach a spatio-temporal maximum average correlation height filter for action recognition. In 2008 IEEE conference on computer vision and pattern recognition, pages 1–8. IEEE, 2008.

[18] Alan Lukeˇziˇc, Tom´aˇs Voj´ıˇr, Luka ˇCehovin, Jiˇr´ı Matas, and Matej Kristan. Discriminative correlation filter tracker with channel and spatial reliability [j]. International Journal of Computer Vision (IJCV), 126(7):671–688, 2018.

[19] David S Bolme, J Ross Beveridge, Bruce A Draper, and Yui Man Lui. Visual object tracking using adaptive correlation filters. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 2544–2550. IEEE, 2010.

[20] David S Bolme, Bruce A Draper, and J Ross Beveridge. Average of synthetic exact filters. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 2105–2112. IEEE, 2009.

[21] Matej Kristan, Roman Pflugfelder, Aleˇs Leonardis, Jiri Matas, Luka ˇCehovin, Georg Nebehay, Tom´aˇs Voj´ıˇr, Gustavo Fern´andez, Alan Lukeˇziˇc, Aleksandar Dimitriev, Alfredo Petrosino, Amir Saffari, Bo Li, and Bohyung Han. The visual object tracking vot2014 challenge re-sults. In Computer Vision - ECCV 2014 Workshops, pages 191–217, Cham, 2015. Springer International Publishing.

[22] Mustansar Fiaz, Arif Mahmood, Sajid Javed, and Soon Ki Jung. Handcrafted and deep trackers: Recent visual object tracking approaches and trends. ACM Computing Surveys (CSUR), 52(2):43, 2019.

[23] Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6638–6646, 2017.

[24] Zhe Chen, Zhibin Hong, and Dacheng Tao. An experimental survey on correlation filter-based tracking. arXiv preprint arXiv:1509.05520, 2015.

[25] Jane Bromley, JW Bentz, L Bottou, I Guyon, Y LeCun, C Moore, E Sackinger, and R Shah. Signature verification using a “siamese” time delay neural network. Int.]. Pattern Recognit. Artzf Intell, 7, 1993.

[26] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, volume 2. Lille, 2015.

[27] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.

[28] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[29] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015. [30] Matej Kristan, Ales Leonardis, Jiri Matas, Michael Felsberg, Roman Pflugfelder, Luka

ˇ

Cehovin, Tomáˇs Voj´ır, Gustav Häger, Alan Lukeˇziˇc, Gustavo Fernandez Dominguez, Ab-hinav Gupta, Alfredo Petrosino, Alireza Memarmoghadam, Alvaro Garcia-Martin, Andrés Montero, Andrea Vedaldi, Andreas Robinson, Andy Ma, Anton Varfolomieiev, and Zhizhen Chi. The visual object tracking vot2016 challenge results. volume 9914, pages 777–823, 10 2016.

[31] Matej Kristan, Ales Leonardis, Jiri Matas, Michael Felsberg, Roman Pflugfelder, Luka Ce-hovin Zajc, Tomas Vojir, Gustav Hager, Alan Lukezic, Abdelrahman Eldesokey, et al. The visual object tracking vot2017 challenge results. In Proceedings of the IEEE International Conference on Computer Vision, pages 1949–1972, 2017.

(32)

[32] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recogni-tion, pages 770–778, 2016.

[33] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.

[34] Mei Wang and Weihong Deng. Deep visual domain adaptation: A survey. Neurocomputing, 312:135–153, 2018.

[35] Garrett Wilson and Diane J Cook. A survey of unsupervised deep domain adaptation. arXiv preprint arXiv:1812.02849, 2019.

[36] Han Zhao, Remi Tachet des Combes, Kun Zhang, and Geoffrey J Gordon. On learning invariant representation for domain adaptation. arXiv preprint arXiv:1901.09453, 2019. [37] Matej Kristan, Jiri Matas, Aleˇs Leonardis, Tom´aˇs Voj´ıˇr, Roman Pflugfelder, Gustavo

Fer-nandez, Georg Nebehay, Fatih Porikli, and Luka ˇCehovin. A novel performance evaluation methodology for single-target trackers. IEEE transactions on pattern analysis and machine intelligence, 38(11):2137–2155, 2016.

[38] Goutam Bhat, Joakim Johnander, Martin Danelljan, Fahad Shahbaz Khan, and Michael Felsberg. Unveiling the power of deep tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pages 483–498, 2018.

[39] Axel Sauer, Elie Aljalbout, and Sami Haddadin. Tracking holistic object representations. arXiv preprint arXiv:1907.12920, 2019.

[40] Matej Kristan, Ales Leonardis, Jiri Matas, Michael Felsberg, Roman Pflugfelder, Luka Ce-hovin Zajc, Tomas Vojir, Goutam Bhat, Alan Lukezic, Abdelrahman Eldesokey, et al. The sixth visual object tracking vot2018 challenge results. In Proceedings of the European Con-ference on Computer Vision (ECCV), pages 0–0, 2018.

[41] M. Kristan, R. Pflugfelder, A. Leonardis, J. Matas, F. Porikli, L. Cehovin, G. Nebehay, G. Fernandez, T. Vojir, A. Gatt, A. Khajenezhad, A. Salahledin, and A. Soltani-Farani. The visual object tracking vot2013 challenge results. In 2013 IEEE International Conference on Computer Vision Workshops, pages 98–111, Dec 2013.

[42] Matej Kristan, Jiri Matas, Ales Leonardis, Michael Felsberg, Luka Cehovin, Gustavo Fernan-dez, Tomas Vojir, Gustav Hager, Georg Nebehay, and Roman Pflugfelder. The visual object tracking vot2015 challenge results. In Proceedings of the IEEE international conference on computer vision workshops, pages 1–23, 2015.

[43] Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. arXiv preprint arXiv:1810.11981, 2018.

[44] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zis-serman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.

[45] Luka ˇCehovin, Aleˇs Leonardis, and Matej Kristan. Visual object tracking performance mea-sures revisited. IEEE Transactions on Image Processing, 25(3):1261–1274, 2016.

[46] Olivier Chapelle and Mingrui Wu. Gradient descent optimization of smoothed information retrieval metrics. Information retrieval, 13(3):216–235, 2010.

[47] Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. Deep spatial autoencoders for visuomotor learning. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 512–519. IEEE, 2016.

[48] Sina Honari, Pavlo Molchanov, Stephen Tyree, Pascal Vincent, Christopher Pal, and Jan Kautz. Improving landmark localization with semi-supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1546–1555, 2018.

Domain Adaptation for Single Object Tracking