Multifaceted Memory Augmentation in Discriminative Object Tracking

(1)

Multifaceted Memory Augmentation in

Discriminative Object Tracking

Alvise Sembenico

(2)

(3)

MSC ARTIFICIAL INTELLIGENCE MASTERTHESIS

MULTIFACETED

MEMORY

AUGMENTATION IN

D

ISCRIMINATIVE

OBJECT

TRACKING

by

A

LVISE

S

EMBENICO 12380288 July 9, 2020 Supervisor:

D

R

. D

EEPAK

G

UPTA SUPERVISOR’SDEPARTMENT Assessor:

D

R

. E

FSTRATIOS

G

AVVES

(4)

Alvise Sembenico

Multifaceted Memory Augmentation in Discriminative Object Tracking

July 9, 2020

Examiner: Dr. Efstratios Gavves Supervisor: Dr. Deepak Gupta

University of Amsterdam QUVA Lab

Informatics Institute Science Park 904

(5)

Abstract

Modern object trackers perform online learning to be better malleable and adapt to visual changes of the object. In order to prevent catastrophical forgetting, meta annotated frames are stored in a memory component. Current siamese based discriminative trackers take into account solely the training images set in the memory relying purely upon the discriminative power. We argue this approach leads to poor performance in situations where the target object cannot be distinguished from others. Our proposed methods takes into account the overall trajectory in a recurrent fashion. In addition to that, we propose several memory control policies leading to a higher images quality and prevent model drifting. Moreover, we provide a wide study regarding memory manipulation and its effect on the final performance, yielding crucial insights about memory’s working mechanism.

(6)

(7)

Acknowledgement

It has been a long, extremely challenging and interesting marathon that taught me a lot. First and foremost I want to thank my supervisor Deepak that supported me and guided me throughout this rollercoaster journey and pushed me to do more especially during hard moments.

I just want to thank a person that helped me the most during these two years and I probably never thanked enough. Celia Pastor thank you for all the trips, dinners and weekends in Science park, as well as your motivational phrases.

(8)

(9)

2.3.3 Correlation filter . . . 10 2.3.4 Evaluation metrics . . . 11 2.4 Siamese trackers . . . 13 2.4.1 SiamFC . . . 14 2.4.2 ATOM . . . 17 2.4.3 DiMP . . . 19 3 Related works 25 3.1 Memory types . . . 25 3.1.1 Sensory register. . . 25 3.1.2 Short-term store . . . 26 3.1.3 Long-Term store . . . 26 3.2 THOR . . . 26 3.3 Model decay. . . 28 4 Proposed Method 31 4.1 Overview . . . 31

4.2 Multifaced Memory Augmentation (MMA) . . . 31

4.3 Schema . . . 33

4.4 Model control . . . 34

4.4.1 Delta control . . . 34

4.4.2 Backward control . . . 37

(10)

4.5 Training . . . 40

4.6 Training bootstrap . . . 40

5 Experiments and results 43 5.1 Memory size. . . 43 5.2 Backward control . . . 44 5.3 MMA. . . 47 5.3.1 Backward control . . . 49 5.3.2 Delta control . . . 49 5.4 Overall analysis . . . 50

6 Conclusion and future direction 53

Bibliography 55

(11)

1 Introduction

„

Computers in the future may weigh no more than 1.5 tons.

—Popular Mechanics, 1949

1.1 Motivation and Problem Statement

Object tracking, as well as most all of the areas in artificial intelligence, it has shown an exponential performance increase, especially after the deep learning revolution. In standard machine learning terminology, object tracking can be classified in several ways, such as One-short learning, Meta-learning, continual learning, not a single term defines completely all the challenges within the human-trivial task of following an object in a time dependent series of frames. Even considering a single frame, one can find several hidden (or not that hidden) caveats that make the state of the art performance sensitively far away from the current human level, whereas it is not certainly the case for other computer vision tasks, some of them even considered

solved (even if for a subset of applications) such as object classification.

During the unfolding a video, a considerable amount of transformations can occur and co-occur to the object under examination. Moreover, due to the bi-dimensionality of the image, only a specific view of the item is available during the boostrapping. Modern trackers have to be able to handle variations, even substantial ones, correctly identify the object and store the new information to not forget previous frames and create a more complete latent representation of the object. In order to do so, a memory cell is required since encoding all the information directly into the model’s weight is computational intensive and easily leads to catastrophical forgetting [continual-learning]. As depicted in the background section, a set of images called

training set is generally stored and higher order gradient descent is performed only

on a convolutional layer [9]. In this way, the preservation of the relevant new information throughout the entire duration of the video is guaranteed.

Adopting the latter approach, one faces new issues, namely the limit of the memory and the uncertainty. Even though modern computer memories are able to store hours of videos even in high resolution, only a small subset of the whole information is

(12)

feasible to take into account to update the weights online, keeping at least 20 frames per second achieving a real time object tracking, crucial for most of the applications nowadays.

In this work, we focus our attention to the memory component, adopting a general approach making possible to clone our contribution to possibly any kind of discrimi-native tracker. Our contributions try to fill the gap between modern state of the art trackers and neuroscience, inasmuch trackers’ performance are order of magnitude lower the the human counterpart. We therefore strongly believe that learning and adopting ideas and concepts from neuroscience literature can only benefit. After all, the human brain had million of years to evolve and to perfect itself, slightly more than what academia had to come up with the best object tracker. In addition, our study goes to analyze the effect of manipulating the memory, opening up the black box and throw light to correlation between the memory component and the overall performance. This could help future studies to focus on more relevant aspects.

1.2 Thesis outline

The complexity of the object tracking task requires a hefty background section; the important concepts of continuos (a.k.a continual) learning, and meta-learning. Here these concept are expressed and they will be a thread throughout the entire work. Next, the early methods are outlined and explained; with early methods I mean the ones developed before the disruption of deep learning, or more in general methods that do not heavily depend of convolutional neural networks. All the recent state of the art trackers not only are based on neural networks, but are so called Siamese networks. The naming ensues from the intrinsic structure; the network is built to get two images as inputs, and they are processed by the same network called

backbone, generally with weights sharing. Next, the latent representation is further

processed by so called heads that diverges for the reference and test frame. The underlying principle of for this choice and common to all siamese trackers, is training the network to match pairs of images that are consequently fed during test time.

In the related work we analyze initially human memory types outlined in [2]; next we review the literature, specifically an approach called THOR [46] which includes some of the ideas from the human memory.

Finally, in section 4our contributions are expressed, starting from the memories component, as well as other control policies in order to have more insights about the memory, understanding if a new sample that is about to be added is in compliance with the rest of the training images. Finally, since our method is based upon existing ones, a rationale is not to retrain the whole network from scratch but fine tune it the

(13)

additional components, we described possible training tricks and shortcut in order to make the upgrade as seamless as possible.

1.3 Human memory

Human visual memory has been a widely studied and debated field since mid 20th century. It describes the inner mechanisms such as encoding, storage, and retrieval and their relationships within the brain cortex. Related to computer vision we refer toVisual Memory, namely the one dedicated to visual stimuli streaming from the

eyes.

Studying human memory is extremely complicated as the majority of the brain’s components, even though the brain structure is intrinsically hierarchical and tends to homogeneously forms clustered anatomical areas [36]. However, memory is not a self standing component but it is encoded in the different parts of the brain according to their scope. This complexity naturally leads to different theories such as [4] [2] even though there is a certain degree of agreement about subdivision of the memory in 3 main components: Long term, Short term and Sensory, also known as Visual short-term memory, Visual long-term memory and Iconic memory [5]. Later reviews incorporate the sensory component with the short term [1].

In this work we adopt the tripartite memory subdivision expressed in [2]. The proposed method we express in this work is inspired by Atkinson–Shiffrin memory

model [2].

An comprehensive account of the theoretical benefits of implementing such memory model is provided in section3.1, moreover, we believe this theoretical framework is crucial in those situations where, due to overlapping, object out of the view, or partially covered purely discriminative trackers are not sufficient and lead inevitably to model drifting.

(14)

(15)

2 Background

„

There is no reason anyone would want a computer in their home.

—Ken Olsen, 1977

2.1 Continual learning

Artificial intelligence systems embedded in real life experience a continual stream of input data. Dynamic systems are required to process and extract information from raw collected information and exploit it in order to be malleable and follow up the changes of the surrounding environment and learning from past experience. This ability of adaptation and learning from new experiences is called continual learning and it is crucial to modern systems such as autonomous driving, reinforcement learning agents and obviously object tracking.

A dynamic approach is in some situation crucial where the volatility and range of changes of a system are sensibly high, however, it comes with several issues that must be addressed in order to prevent unexpected behaviors. Training in a standard manner the neural network simply using the new data generally leads to a phenomena called catastrophic forgetting; the new information irreversibly overwrites the old information previously learned, causing a dramatic worsening of the performance over the training data. One obvious solution to this issue would be to retrain the whole model from scratch including the new data. Even if theoretically possible, this approach results technically unfeasible in basically every situation due to resource and time eagerness of the training. The so called stability-plasticity

dilemma [continual-learning] defines the tradeoff between the learning of the new

information and the forgetting of the old knowledge.

Not surprisingly, there is a strong correlation between continual learning and biolog-ical systems [continual-learning]; the tasks of the brain are divided into learning

and memorizing. The former consists of extract relevant information from the sur-rounding event, structure and linking them to the already existing knowledge, the latter is performed storing those event and make them available for later queries.

(16)

If we see the process from a holistic perspective, the brain is therefore capable of generalization of a collection of episodes and retention of the relevant ones.

In object tracking, however, the catastrophical forgetting is not a big concern since we can simply reset the status of the tracker at the end of the video and start over with untouched network’s weights. Nevertheless, the tracking of an object can present a challenge since we want the model to remember every part of the object that has been encountered so far. While in short-medium videos this issue barely affect the overall performance, in long term video tracking it is a remarkable problem that need to be addressed. In section3.2a paper that goes exactly in this direction is disclosed.

2.2 Meta Learning

We can divide the approaches of tackling object tracking in two varieties: online and offline learning. The offline approach is analogous to the most common machine learning techniques, in fact, the learning process is carried out only at a previous stage; the weights of the whole model are then kept fixed throughout the actual tracking itself. This is for instance the case for object detection or digit recognition task, where indeed once the network has been trained is not modified at all.

Online learning, on the other hand, includes a model update during tracking, namely a subnetwork’s weights are updated according to different criteria. Object tracking can be associated with different type of learning such as online learning, meta learning and self supervised learning, making the task particularly tricky, at the same opening to countless different solutions even radically variegated. For simplicity, we group it under the label of one-shot learning, where the model is supposed to learn a representation of an entity using a single sample, in the case of object tracking the first bounding box, and be able to correctly predict the new occurrences of the same element in consecutive data entries.

Let’s divide et impera the one-shot learning in submodules: meta-learning and actual online learning part. We can rephrase the meta-learning in learning to learn; learning offline how to learn online. A legion of work has been invested in this learning context [38] [45] [29] [39], even though interesting approaches have been proposed, this work will cover indirectly the ones used by object trackers.

Early methods (sec. 2.3) are considered online learning algorithms since the inner state of the system changes during the tracking. On the other hand, it took some years of research to augment the outbreaking deep learning based Siamese networks (sec. 2.4) and create an online mechanism on top of them. Generally speaking, it

(17)

is not straightforward to define an online process for a neural network, since it is really task dependent. Moreover, dealing with model update must be done carefully since traditional plain gradient descent method is not suitable for this situation since it requires several iterations and it can lead to overfitting if only few data points are used. In section5 we analyze one of the possible approachs to online learning in object tracking.

2.3 Early methods

Before the outbreak of deep learning in Artificial Intelligence and more specifically in Computer Vision, probabilistic and filter based methods were used. We can look at the task of object tracking as the estimation of some properties of a certain object that change during time, moreover, these information are provided in a form of image, that can be therefore considered a 2D time dependent discrete digital signal. It is straightforward to see why techniques from digital signal processing and dynamic systems were implied to tackle this broad task. In the following subsections I will briefly explain these early techniques and how research evolved reaching fully convolutional neural network and why the latter are preferable over the former.

2.3.1 Tracking with matching

The underlying idea behind this mature family of trackers is that it is possible to match pairs of pixels in consecutive frames. Matching a sufficient amount of pixel in the next frame makes it possible to estimate the pose and the bounding box of the target object. One of the works using this idea is “Robust visual tracking using an adaptive coupled-layer visual model” [13]. However, before diving into the analysis of this paper, Kalman filter are explained briefly.

(18)

Definition 1: Kalman filter

Kalman filtering is an algorithm using partial, discretized and noisy measure-ments, producing a joint probability distribution for every timestamp. After the estimate has been produced, and the measurement is observed, these estimates are the update through a weighted average.

The underlying theory behind the Kalman filter is based on linear dynamical systems with discrete time domain.

The filters are modeled as Markov chain (more in chapter2.4.3), with linear operators and errors from a Gaussian family. The linearity of the operators means that only linear operations are applied at every state to generate the consecutive one, with the addition of Gaussian noise. Every state xtat a time

tis represented by real numbers vector. The parameters of the model are:

• Ft: the state-transition model

• Ht: observation model

• Qt: the covariance of the process nose

• Rt: the covariance of the observation noise

• Bt: control-input model, applied to the control vector ut

• Wt: the process noise

The update equation is the following:

xt= Ftxt−1+ Btut+ wt (2.1)

There is a strong correlation between Kalman filter and Markov chain, in fact, for the discrete case (which is the case for object tracker), we can represent the process as a Markov chain as depicted by the following image.

The new state matrices F, B, Q, R, H can even remain constant throughout all the episode rollout, in this case they are share between consecutive nodes. The update function is arbitrary and it is therefore object of study.

The tracker is composed by two coupled layers, called “global visual model” and “local visual model”; the former is a geometrical constellation of patches of the original image that describe the visual property of the target object such as motion, color and shape. Since the target object can be occluded or be warped, some of the patches present in the local memory are therefore removed.

(19)

Ft Bt 0, Qt 0, Rt Ht ut zt vt wt xt, PT Ft+1 Bt+1 0, Qt+1 0, Rt+1 Ht+1 ut+1 zt+1 vt+1 wt+1 xt+1, Pt+1

Fig. 2.1: Model underlying the Kalman filter. The gray circles represent matrices, unenclosed values are vectors. The white circles is the joint distribution we are interested in, specifically it is a multivariate distribution, where the x is the mean and the P is the covariance matrix.

The global visual model, on the other hand, maintains a probabilistic model of the target visual properties. The underlying adaptation is constrained to focus only on the patches present in the local layer. A double way information flowing is running between the two layers, it is indeed the reason of the definition “coupled layers”.

The center of the target is estimated using Kalman filter, where the center is identified by the weighted average of the patches positions present in the local layer according to the undermined formula

ˆ

xt= xt−1(i) + ˆv

(i)

t (2.2)

Where ˆvtis the predicted target’s velocity. Finally, different information such as HSV

histogram of the target and background, motion, and different of velocity of the point and the tracker are then combined into a single likelihood for each pixel to belong to the target.

2.3.2 Discriminative classification with kernels

Early works about discriminative classification are numerous [37], [21], [48], [23], explaining the working mechanisms of all of them is out of scope for this work, therefore, in order to provide an understanding of kernel methods we describe only “Struck: Structured Output Tracking with Kernels”.

(20)

A visual object tracker operates on a list of images (I1, I2, . . .)and it is represented

by a function f : It→ ptwhere ptparametrizes the target object in the image It, in

the most simple case it represents the bounding box surrounding the target, but it can even be composed by other parameters such as scale, rotation or shape. The Struck tracker learns the transformation function f : T → Y between consecutive frames namely pt→ pt+1. The learning procedure is performed training a Structured

SVM (S-SVM), in order to overcame the limitation of binary data such as in [21]. The image features are composed by three different types:

• Raw features: scaling the image to 16 × 16 and normalizing to gray scale

• Haar features [54]: 2 scales on 4 × 4 grid, normalized in the range [−1, 1]

• Histogram features: 16-bin intensity histograms from the spatial pyramid pf 4 levels.

At inference time, the argmax

y∈Y

g(t, y)is used. Since this algorithm trains an S-SVM with augmented data from a single frame, it also includes an adaptation mechanism based on LaRank [12]. The SVM is then updated at every frame making this algorithm online.

2.3.3 Correlation filter

An image can be seen as an ordered sequence of pixels, where each pixel is consti-tuted by a tuple of three integer values in the range 0-255 or floating point in the range 0-1 according to its encoding. On top of this representation, the mathematical framework of digital signal processing allows us to perform interesting and useful operation. In the case of object tracking we are interested in finding which part of the image (or signal) is more similar to a template. A basic operation performing a similarity value is filtering, or convolution, represented by the following operator.

Definition 2: Convolution Operator

(f ∗ g)(t)_,

Z ∞

−∞

f (τ )g(t − τ )dτ (2.3)

Throughout this work we use the notation x representing the image, and f repre-senting the filter. In order to speed up the process, the convolution operation is performed in Fast Fourier Transform domain (FFT), therefore, both the filter f and the image x are transformed into X = F (x) and F = F (f ), with F being the Fourier transform operation. Being in this new domain, thanks to the Convolution Theorem,

(21)

the convolution operation is equivalent to the pointwise product of their Fourier Transformed, formally:

Definition 3: Convolution Operator in Fourier domain

f ∗ x = F−1{F {f } · F {x}} (2.4)

Therefore the position estimation will correspond to the maximal activation after the convolution

G = F H∗ (2.5)

Where H∗represents the complex conjugate of the Fourier Transformed of the filter

f.

The next question we need to address is how to find the correct filter. Several approaches to this problem are present in the literature, such as ASEF [10] and UMACE [35]. In [11], the so called MOSSE algorithms finds the filter H such that the squared error loss between the ground truth and the actual convolution.

min

H∗

X

i

|F_i H∗− G_i|2 (2.6)

Setting the derivative w.r.t. H∗ _{to zero and solving, the following solution is}

de-rived.

Definition 4: Solution of discriminative filter

H∗= P iGi Fi∗ P iFi Fi∗ (2.7)

2.3.4 Evaluation metrics

The high dimensionality of the video tracking groundtruth makes the definition of a single evaluation metric hard. Several methods have been used to evaluate the effectiveness of an object tracker such as F-Score [31], PBM [24], OTP [26], OTA [25], Deviation [3].

Moreover, time has been invested in improving these evaluation metrics [55] [58]. Furthermore, additional metric such as robustness that takes into account the number of tracking failures. Even though the metric proposals are numerous, a high correlation between them has been shown in [49] as depicted in picture2.2

(22)

Name Equation Aim Measure

F-score [31] 2_{precision+recall}precision·recall Accuracy Thresholded precision and recall F1-score [25] 1

Nf rames

P i2 p

i_·ri

pi_+ri Accuracy Precision and recall

OTA [7] 1 − P i(n i f n+n i f p P igi

Accuracy False positive and false negative

OTP [26] 1 |Ms|

P i∈Ms

|Ti_∩GTi_|

|Ti_∪GTi_| Accuracy Average overlap over matched frames

OTA [25] 1

Nf rames

P i

|Ti_∩GTi_|

|Ti_∪GTi_| Accuracy Average overlap s

Deviation [3] 1 −

P

i∈Msd(T i_,GTi₎

|Ms| Location Centroid normalized distance

PBM [25] 1

Nf rames

P i

h

1 −Distance(i)_T_h(i) i Location Centroid L1-distance

Tab. 2.1: Overview of the different evaluation metrics. Table taken from [49]

In practice, the main measurements to prove the soundness of a tracker consists of Area Under the Curve (AUC), Survival curve (or success plot), and precision. In order to evaluate whether a object has been successfully identified, we define the overlap criterion, operation on the bounding box generated by the tracker and the groundtruth. In this framing, we can move the metric to binary classification, namely the object has been correctly identified or not. Give the tracked bounding box Ti

in the the frame i and GTi the corresponding ground truth, the overlap definition

follows: Ti∩ GTi |Ti_{∪ GT}i_| ≥ 0.5 (2.8)

The advantage of being in a binary setting is the possibility of exploiting the plenty of measures already present. The F − score combines the precision and recall values. Denoting ntp, nf p, nf nthe number of true positives, false positives and false

negatives in a video, the precision is defined as follows p = ntp

n_tp+nfp and the recall

r = ntp

n_tp+nfn. Finally the F − score formula ensues:

F = 2 · p · r

p + r (2.9)

Bringing the F − score to the tracking setting, we redefine pi ₌

Ti∩ GTi / Ti

and ri _{= |T}i_{∩ GT}i_|/|GTi_{| and normalize it by the length of the video, namely:}

Definition 5: F − score for object tracker

F 1 = 1 N_frames X i 2 · p i_{· r}i pi_{+ r}i (2.10)

Another useful metric is the Intersection Over Union (IoU) having the following formula

(23)

Definition 6: Intersection over union IoU = 1 |Ms| X i∈Ms Ti∩ GTi |Ti_{∪ GT}i_| (2.11)

It defines the average ration between the area defined by the union of the ground truth and the predicted bounding box, and the intersection of the two.

With the F − score only it is possible to create the survival curve, namely the average

F − scorewith respect to the number of videos. In fig. 2.3it is possible to see such a plot. Clearly, the tracker represented by curves more to the right hand side of the graph correspond to the best performance.

These are the metrics that are generally adopted in order to prove the superiority of a certain tracker compared with the rest. However, to get a deeper insight about the behavior of the tracker we might analyze specific classes of videos such as occlusion or deformation. Diving the overall performance in subset we are able to gain a better understanding the kind of situations in which a tracker strives or suffers. During the last years, since the outbreaking of deep Learning in Computer Vision in general and consequently in object tracking, the complexity of the underlying architecture has increased exponentially; moreover, one of still not overcomed intrinsic issue of deep learning is its explainability, making the debugging and improving process almost impossible without a wide and deep analysis of experimental results. In the following chapters we take a close look at our improvement proposal, adopting DiMP [9] as practical implementation, followed by an analysis of experimental results, to conclude with the formulation of explanations of the different behaviors.

2.4 Siamese trackers

Siamese neural networks are special architecture composed by two branches, gen-erally a set of layers with weights sharing, namely both branches have the same weights. Siamese take two samples as input, in our case two images: reference and

test. The outputs of the two branches are then combine with is so called head which

takes the features, it combines them and performs a dimensionality reduction via fully connected layers. At the end of the head the prediction such as regression or classification is given.

This specific architectural choice is widely used in object tracking since it fits perfectly the task of matching two input, namely find the location of object inside the input 1 in the input 2. In the next subsection one of the first paper using siamese networks in object tracking is briefly explained.

(24)

Fig. 2.2: Scatterplots of video evaluations showing the high correlation between metrics showed in [58].

2.4.1 SiamFC

State of the art object tracker consists of a template matcher that seeks the most similar object compared to the given ground truth. This allows the implementation to be fully convolutional and not use any matching heuristic such as SIFT. Even if this approach achieves reasonable accuracy, it comes with a high number of limitations that are analyzed later.

Even though countless papers use similar approaches, SiamFC [8] was among the first to adopt it for object tracker, therefore, without loss of generality, its approach is explained.

The initial setup consists of a pair of images (x, z), where x is the training image and

zthe target. Features are then extracted from both images by a CNN ϕ, generally called backbone. Finally, the results are combine using cross-correlation layer

f (z, x) = ϕ(z) ∗ ϕ(x) + b1 (2.12) One can notice similarities between this algorithm and the one presented in the previous section regarding the correlation filter, however, the main difference is that the Fourier Transformed is implicitly performed but the CNN, moreover, only meaningful and semantic information are extracted, leading to better performances. Finally, the model is then trained offline using a discriminate approach, namely

(25)

Fig. 2.3: Survival curve of the the trackers accordingly to the number of videos. The gray areas in the top right and bottom curve is the area where not datapoint is present, namely no tracker achieve such performance. This plot is displayed in [58].

training the network using both positive and negative samples pair, leading to the logistic loss

l(y, v) = log(1 + exp(−yv)) (2.13)

with v representing the real-value score after the correlation layer and y ∈ {+1, −1} representing the positiveness or negativity of the samples pair. Stochastic gradient descend is then used to approximate the expected loss

Definition 7: Expected loss for SiamFC

arg min

θ (z,x,y)E L(y, f (z, x; θ)) = arg minθ (z,x,y)E

1 |D|

X

u∈D

`(y[u], v[u]) (2.14)

(26)

Fig. 2.5: Network architecture of SiamRPN++ tracker

One direct advantage of these methods is the ease of training, for each pass, only couples of frames are required, one as template, the other one as target. Generally, pretrained networks are used as backbone such as ResNet18 or ResNet50 to shorten training time, fig2.4shows the architecture of SiamFC network.

After [8], more advanced models has been proposed such as SiamRPN [32] and SiamRPN++ [33] makes use of the Region Proposal Network in order to gain both accuracy and speed. In addition, several architectures such as [52], ECO [16], ATOM [15], DiMP [9] have been proposed in order to overcome the limits of pure offline trackers. In the next subsection we take a closer look at the intrinsic limitations of this approach.

(27)

Fig. 2.6: List of the trackers participating to the 2018 Visual Object Tracking VOT2018 challenge

2.4.2 ATOM

In ATOM: Accurate Tracking by Overlap Maximization [15], the two main different tasks, namely target estimation and target classification, are split and performed by two different heads of the network. Basically, there is a common network called backbone that ingest both the reference and the test image, successively, two different subnetworks unfold the backbone’s output into two different output, one for each of the task previously mentioned. Since a more modern target classification is explained in detail in the following section (Sec5), in this section I focus on the target estimation, part that is shared with the DiMP [9] work, therefore worth to be discussed in order to draw a complete picture of current state of the art trackers.

The target estimation is performed by Overlap Maximization and it is build upon the IoU-Net work [22] which has been proposed in terms of object detection instead of tracking. Even if the difference might sound subtle, the main point is that for object detection the type of classes are known, namely the network knows which kind of object is looking for. Conversely, in object tracking, we want the system to be able to track any kind of object, even never seen ones.

The IoU-Net is trained to predict the IoU score between the reference and test frame of a given object; the bounding box with the highest score predicted is therefore chosen as the candidate. The basic IoU-Net model [22], given the latent space representation x ∈ RW ×H×D _{of an image and a bounding box B ∈ R}4_{, the network}

is trained to predict the IoU-score. A submodule called Precise ROI Pooling [22], a special variant of the basic average pooling, allowing a differentiation of the

(28)

Fig. 2.7

bounding box coordinates, otherwise discrete and therefore non differentiable, can be trained with standard SGD algorithms such as ADAM [28]. Due to the complexity of the task, and the unknown nature of the subject, for the training of the IoU-Net, a one shot learning is simply not feasible and it would lead to poor generalization and most likely to overfit. Instead, the authors of the paper provide a complete offline training manner.

I will now explain the information flow within the networks, more specifically the two branches represented in figure2.7. The ResNet-18 subnetworks act as backbone, extracting meaningful information from the raw picture and intuitively removing the information relatively to the background. In the reference branch, after applying some convolutional layers, a PrPool layer and fully connect ones follows.

A similar flow occurs in the test branch, however, more layers and higher pooling resolution (5 × 5 vs 3 × 3) due to the higher complexity of the task. The information from the reference branch and the test one are then combined using a channel-wise multiplication. Finally, the output of the network is composed by the IoU prediction between the one submitted to the network and the actual one, is given by the following formula:

Definition 8: IoU formula ATOM

IoU(B) = g (c (x₀, B0) · z(x, B))

Give the working mechanism of the IoU-Net, the training phase requires a bit of attention since the only data we have are the actual bounding box around the object. Clearly we cannot feed this data to the network, otherwise the output will always be 1. Random bounding box have to be generated in order to provide a loss to backpropagate. Moreover, this will be procedure at test time, namely the bounding box with the highest predicted IoU will be picked as output.

(29)

Fig. 2.8: Given the meta-annotade training images set TN, and the test frame, the features are extracted passing thorough the backbone and convolutional layers (Cls Feat). The feature coming from the training set go then to the model inizializer and optimizer that will produce the final layer’s weights. Finally, these weights are used for a final convolution on the test frame features giving a discriminative heatmap.

2.4.3 DiMP

Countless approaches are present in the literature but in this work I will focus on the specific method of “Learning Discriminative Model Prediction for Tracking” [9], DiMP for short. In this paper, the authors describe a method consisting in a training set Strain= {(xj, cj)}nj=1, with xj ∈ X being the feature map extracted with

a backbone (ResNet 50), and the center of the coordinate cj ∈ R2. After having

extracted the activation map from the test frame, it is then convoluted with the final filter f creating the score prediction p ∈ R19×19_{, the point with highest activation in}

then chosen as the center of the target object. Finally, the IoUNet [22] is employed to compute the best bounding box surrounding the object. The IoUNet is the one explained in the section above; the remaining part of the DiMP is shown in fig2.8.

Given the training set, we want the network to minimize the classification loss of the already seen sample, the formula representing the loss is the following

Definition 9: Loss formula for DiMP

L(f ) = 1

|S_train|

X

(x,c)wi∈Strain

kr(x ∗ f, c)k2+ kλf k2 (2.15)

Where ∗ stands for the convolution operator, and the latter part of the formula kλf k2

is the regularizer. Taking a look at the former part of the formula, the function r(s, c) represent any arbitrary function; if we set it to be r(s, c) = s − ycwith yca Gaussian

function centered at c, the whole formula is reduced to regularized Mean Squared Error. However, in case of negative samples, the model is forced to set all the scores as closes as possible to zero, requiring high expressivity and distracting it from the pure discriminative. The authors therefore opt for a combined regression and hinge loss (max(0, s) leading to the following residual function

(30)

Definition 10: Hinge loss

r(s, c) = vc· (mcs + (1 − mc) max(0, s) − yc) (2.16)

Where mc∈ [0, 1] defines the target region mask, where c is the center of the target

coordinates; the spacial weight vcand the regularization factor λ. In this way the

network is more focused to the object rather then the background. However, this approach would not work if a standard gradient descend

f(i+1) = f(i)− α∇Lf(i) (2.17)

is used because it would make a minimum of 30 frames per second not feasible since each update would take several iterations, therefore making online tracking not possible. To overcame the slow convergence of gradient descent, the author propose a higher order optimization, where the step length α is learned by the model itself. Steepest descent methods have been well know in the literature [conjugate]

[numerical] for a while; they go beyond the scope of this work though. The first

step is to approximate the loss2.15with a quadratic form with f(i)_{being the current}

estimate:

L(f ) ≈ ˜L(f ) = 1

2

f − f(i)TQ(i)f − f(i)+f − f(i)T∇Lf(i)+ Lf(i)

(2.18) Where Q(i) _{is a positive definite square matrix. The next step is to solve the}

approxi-mate loss arg minθL(f )˜ 2.18

Definition 11: Steepest gradient descent in DiMP

∂ ∂α ˜ Lf(i)− α∇Lf(i)= 0 (2.19) α = ∇Lf(i)T_∇L_f(i) ∇L f(i)T Q(i)_{∇L f}(i) (2.20)

The element in the equation that is missing from the analysis is Q(i)_{, we can define}

some cases for its value, first the most trivial one, namely the value Q(i) ₌ 1

βI,

reducing the whole formula2.20to α = β. As mentioned before, we can introduce higher order optimization such as second order derivative so to follow the gradient in the steepest possible way, setting Q(i) ₌ ∂2_L

∂f2

f(i) _{where L is the loss} _2.15_.

However, the Hessian matrix is unfeasible for high number parameters and it has high computational costs anyway. Therefore, the Bhat et at propose to use the Gauss-Newton algorithm since it is used to minimize squared functions without

(31)

requiring the second order derivative to be computed, therefore tailored for our purpose. The variable Q(i)_{takes therefore the value of}

Q(i) =J(i)TJ(i) (2.21)

With J(i)_{being the Jacobian of the residual filter f}(i) _{that is computed through a set}

of neural network operations, implementable with any autograd framework. The whole algorithm is explicited in1.

Algorithm 1 Target model prediction D

1: Samples Strain = {(xj, cj)}n_j=1, iterations Niter

2: f(0) ← Model In it (Strain) .Filter initialization 3: for i = 0, . . . , Niter − 1do

4: ∇Lf(i)← FiltGradf(i), Strain

.With L eq.2.15 5: h ← J(i)_∇L_f(i) _._{Apply Jacobian} 6: α ← ∇L f(i) 2

/khk2 .Compute the step length

7: f(i+1) ← f(i)_{− α∇L}_f(i) _._{Filter Update}

8: end for

Having a good filter initialization is crucial for the success of the whole tracking performance. Starting with total random weight and apply algorithm1would be too expensive in term of time, therefore a small network is exploited to perform a few shots learning, where the samples in Strainare augmentation of the initial frame.

It is composed by a convolutional layer followed by a precise ROI pooling [22]. This la st layer performs a features extraction and warp them to the same dimension of the filter f . This operation is repeated for each image in the train set and the the first initial filter guess is the average. Finally the optimizer is ran for a final refine of the weights using algorithm1.

(32)

Memory. The Strainconstituted the dynamic memory of the tracker; every frame is

added to the memory with its relative bounding box considered as ground truth if the max of the activation map is higher than a certain threshold β, corresponding to a rough uncertainty estimation. At every frame the final filter f is then recomputed, being now able to capture even more recent views of the object. Furthermore, in the current official implementation available athttps://github.com/visionml/pytracking

a “fading system” is in place, namely performing a weighted sum in eq. 2.15with weights ws, starting from 1 and updated

Definition 12: Weights update in DiMP

wt+1_s ← w t s∗ (1 − lr)−1 P s∈Strainws (2.22)

at each step. Clearly the memory size is limited, 50 in the current implementation, consequently each new frame substitutes the one with minimum ws, corresponding

to the oldest in the memory. This is a pretty naive methods, in fact it is one of the main component of our proposed methods and subject of this work.

Limitations. Let’s now take a look to the limitations and drawbacks of Siamese

object tracker. First of all, since the template frame remains constant throughout all the video, every new image is always compared to the same initial frame. This type of behavior implicates two different aspects: less prone to model drifting, robustness compromised. Model drifting is cause when the model start tailoring according to a object that look generally similar to the target but clearly it is not the object we are interested, causing a total drop of the model accuracy. After a model drifting, it is generally hard for any model to recover since at every frame, the drifting is reinforced. The main limitation of pure offline trackers is the expressiveness constraint, namely, the inability to capture abrupt or slow but substantial changes of the subject. If the template z in the upper branch of2.4and2.5is too different to the search branch, the model in unable to successfully perform the match, leading to poor performances.

In order to overcome this intrinsic limitation, dynamic tracker has been introduced, where dynamic means with weights changes during time. We can formalize offline and online tracker using Bayesian networks2.9, where the measured variable is the prediction of the target in the frame i, represented by predi. In2.9a, the network

shows how every prediction is independent to the previous ones, therefore, the probability of model shifting is proportional to the intrinsic difficulty of each frame and independent to the previous ones.

(33)

Note 1: Natural Numbers predi Init σ ρ i = 0, . . . , N

(a)Offline learning tracker

pred1 pred2 pred3

Init

ρ1 ρ2 ρ3

σ

(b)Online learning tracker

Fig. 2.9: Bayesian Network representing the variables involved and their relationship within tracker. Init stands for the initial frame initialization. σ is the expressiveness of the model, meaning how much the model is able to capture different aspects or view of the same object as a unique entity. ρ is the intrinsic difficulty of the image, namely, the number and the degree of the challenges present in the frame.

Online learning object trackers has been proposed in recent years [17] to fix this issues but leading to model decay in long term video tracking [20], worsening the performance in case the objects returns to its original view. In order to fix this problem, [9] proposes a way of storing a set of images and train online a template module that matches the position of the new frame give the past frames. An issue raising from this architecture is the lack of the object position inference in case the object has an abrupt deformation. In this example, the simple template target approach does not work since the relationship is broken, therefore, an inference submodel is needed in order to predict the possible next position based only on the last frames.

(34)

(35)

3 Related works

„

About 80% of statistics are false.

—Anonymous

3.1 Memory types

Before diving into the actual method proposed, I want to give a general overview of the work of Atkinson et al [2], presenting the different memory types that have been hypothesized, their role, and their relevance for our work and object tracking in general.

The Atkinson–Shiffrin memory model consists in a tripartite memory system:

• Sensory register

• Short-term store

• Long-term store

The interesting aspect of this research is that it was performed with most of ex-periments being visual or audio-visual, therefore the results are likely to be more relevant and accurate in the context of visual object tracking. Even though the Atkinson-Shiffrin model [2] was presented in 1968 and therefore successive research criticize some aspects or simply improve them, they are facets that do not affect the general structure that inspired us.

3.1.1 Sensory register

The Sensory register is the entry point for any kind of environmental stimuli to the human memory. Every sense has it own register that is not however directly processed. In fact, the sensory register, or “sensory memory” does not perform any kind of operation directly to the stimulus but it collect them, store temporarily, from here its denomination of register “buffer”. Next, it serves the requested stimulus to the short term memory only if attention is paid to them, otherwise the information

(36)

gathered decay in a matter of fractions of second. Regarding the pure visual system, the information present in the so called Iconic memory decay in about 0.5-1 second [50].

3.1.2 Short-term store

The chunks of information that are paid attention to are moved from the sensory memory to the Short-term where they can lay for about 20 seconds up to 30 if active rehears takes place. However, the capacity of this kind of memory is pretty limited, in fact only about 7 chunks ±2 can be held. The transfer process into the Long-term store is considered to be automatic in case of rehearsing or link creation with pre existing knowledge.

3.1.3 Long-Term store

The last layer of the memory stack is the Long-term store, in fact, here the memory are stored for an undefined amount of time, however, similarly to the mechanism of a Von Neumann architecture, the information has to be transferred to the short term in case the subject wants to manipulate them.

3.2 THOR

The objective of the long term memory is to collect images that are the most diverse as possible, creating a general information set about the object. Different properties of the object need to be captured such as light variation or deformation to able to perform a better prediction for the consecutive frames. Instead of directly working with the images, we deal with a better representation, namely the features extracted by the backbone, and in the DiMP also an additional convolutional block (Cls feat).

Given a list of feature vectors {fi}ni extracted from the the images {Ti}ni, we are

interested in the templates similarities Gram matrix. Since we are in the feature space, according to [8], the convolution operator is a similarity measure, we can then construct the following matrix.

G (f1, · · · , fn) =     f1? f1 f1? f2 · · · f1? fn .. . ... . .. ... fn? f1 fn? f2 · · · fn? fn     (3.1)

Given that G ∈ Rn×n_{, its determinant corresponds to the squared of the volume}

(37)

squared matrix M , its determinant represents the volume in the hyperspace; in our case, M is the similarity matrix G and its volume represents a proxy function for the diversity among the templates in the memory. Since the goal of the long term component is to store templates as much different as possible between each other, we can reformulate the problem maximizing the diversity using the determinant of the matrix we just described.

max

f1,f2,...,fn

Γ (f1, . . . , fn) ∝ max f1,f2,...,fn

|G (f1, f2, . . . , fn)| (3.2)

Therefore, each new template is added to the memory if and only if it increases the determinant of the gram matrix. However, in the case of a frame in which the prediction is slightly off, the aforementioned condition is satisfied, moreover, the distance to all the existing templates is dramatically high. Since it is unfeasible to keep track of the whole history at every point in time select the templates with highest diversity, only the current state is considered. Specifically, a new image fcis

added to the memory if and only if it increases the diversity increasing the value of |G|.

As control, we want to limit the value of |G|, otherwise meaning that noisy images has been added, causing model drift. Therefore, the naive solution would be to se and upperbound; however, it is not straightforward to find and it would not be accept new incoming templates for long time as long as the new ones are far from the existing ones. The authors proposed a lower bound instead on the similarity between the new template Tcand the base T1, specifically 2 types called dynamic

and ensemble.

The former consists of subtracting a diversity measure γ derived from the Short term to the self similarity of the ground truth G11. The condition is then the following

Definition 13: Dynamic Long term condition

fc? f1 > ` · G11− γ (3.3)

The ensemble condition, on the other hand, consist of keeping the bound static but compute it with respect to all the templates currently in the long term memory, the formula turns to be the following

Definition 14: Ensemble Long term condition

(38)

For the Short-term, the authors suggest to use the Gram matrix and calculate the diversity measure using the only the upper triangle such as

Definition 15: Short term condition

γ = 1 − 2 N (N + 1)Gst,max N X i<j Gst,ij (3.5)

However, in our work there are already different diversity measurements in place that reject a sample in case too far from the rest as it will be explained later, therefore, adding one more layer of complexity will turn into lower performances, harder result interpretation and performance instability.

3.3 Model decay

To understand how a pronounced distance between the prediction and the groundtruth can affect the rest of the video, let’s take a closer look at it formalization presented by Gavves et at [20]. Given any frame i, its prediction can be considered noisy measurements of the groundtruth

yi= yi∗+ δi, and δi ∼ N

0, σ_i2 (3.6)

With yi being the outcome of the model, depending on the weights φy

yi+1= f (xi+1; φi+1) (3.7)

Given the update step of the filter f , we can write the weights change as follows; here a standard gradient descent approach is represented, even though higher order optimization are used such as in [9], the equation does not lose in generality.

φi+1= φi− η∇φLi (3.8)

We can now generalize this step for the whole process up to time t, taking the expectation of the gradient of the loss. For the sake of representation f₍i, t)represents the output of the network of the input image xi parametrized by weights φt.

∇_φLt= ∇φE h (yi− fi,t)2 i (3.9) = E[∇φyi2] − 2E [yt∇φft] + ∇φE[fi2] (3.10) = 2E [ft∇φft] − 2E [yt∇φft] (3.11)

In the way from equation3.10to eq3.11the fact that once the prediction of yi is

(39)

we have formula for the gradient of the loss we can plug it in the update equation

3.8. t

φt+1− φt= −2η [E [fi,t∇φfi,t] − E [yi∇φfi,t]] (3.12)

= −2η [E [fi,t∇φfi,t] − E [(yi∗+ δi) ∇φfi,t]] (3.13)

= −2ηE [(fi,t− y∗i) · ∇φfi,t]

| {z }

Perfect parameter update

+ 2ηE [δi· ∇φfi,t]

| {z }

Parameter bias

(3.14)

In the next couple of sections we present different methods to estimate the parame-ters bias estimating both δiand a proxy function of ∇φfi,t.

(40)

(41)

4 Proposed Method

„

He question of whether computers can think is like the question of whether submarines can swim.

—Edsger W. Dijkstra

4.1 Overview

The discriminative power of the presented method gained state of the art results, we believe that some limitations of previous models have not been addressed, leaking in specific cases considered “very hard” for any kind of tracker. Inspired by the work of the Neuroscientists Atkinson-Shiffring [2], who proposed a model for human memory, willing to solve the limitations present in DiMP but also in any other memory based online tracker. As you shall see, the memory model presented not only is supported by scientific data, but it is also pretty intuitive. Moreover, taking inspiration from the best living tracker known so far, the human being, can only help the research ecosystem, stopping it to be constraint with only current methods but start thinking multidisciplinarly and creatively.

4.2 Multifaced Memory Augmentation (MMA)

With the Atkinson–Shiffrin memory model in mind, we decided to augment the basic memory consisting of a collection of tuples {(xi, ci, wi)}Ni where x represents the

features, c the predicted bounding box and w the weight of that given sample.

Unlikely the memory model presented above, the new samples enter directly into the long and short memory, without any information passing between the two memory types but they must satisfy different requirements. For the memory construction we adopt the methods proposed in THOR [46] that will be rehearsed below.

The rationale behind the work “Tracking Holistic Object Representations”, as the name suggests, is to create a representation of the whole object, from all his sides and views, seen so far during the tracking process. This is done via frame selection,

(42)

namely deciding according to some criteria if a frame is added to the memory or not. Intuitively, we can think of it as a way to assess if a new view of an object is different from the ones already encountered; if it is too similar is consequently discarded because not adding values to the collection. In the work of Sauer et al, two different metrics are proposed, relatively for the long and short components

Given an image Ti, its corresponding features fi is given, as explained above, by

the forward pass of the backbone and the features extractor. Even though features dimensionality is constant throughout the duration of the video, the quantitative difference between two template can diverge significantly. The different scale provokes a problem since the convolution as similarity measure is not scale invariant. Moreover, working with a large gram matrix, the determinant can explode in case of non normalized matrix, or vanish in case of a normalized one. We therefore present a modification of the method presented in [46] that address these issues. The algorithm to compute the gram matrix is presented in2, it consists of building a normalized gram matrix, where the similarity measure corresponds to 1 in case the templates are equal, and it tends to 0 as the templates get different. To address the problem of the determinant, we adopt a normalization, namely set the determinant of the Gram matrix to 1, scaling the proposed matrices but a factor c and take the absolute value, since we are interested in the dimension of the parallelotope and not its direction. The value c is computed as follows, let’s call p the value of the determinant:

det(A) = p (4.1)

det(cA) = cnp (4.2)

c = p−n1 (4.3)

In this way, the determinant of the matrices lay around 1, giving numerical stability to the process. Even in the case of overflow of p, its approximation is good enough to reduce the excursion of the other matrices determinant.

Algorithm 2 Normalized Gram Matrix algorithm 1: t← templates ? templates

2: normalized← t/diagonal(t) .Row wise operation

3: tri1 ← diagonal(normalized) .Get diag. sup., set rest to 0

4: tri2 ← diagonal(normalizedT₎ _._{Get diag. sup., set rest to 0} 5: G← [tri1,tri2] .Stack on additional dim.

6: min ← min(G, axis=0)

7: return G + min + minT _{− I}

(43)

Fig. 4.1: Long and short memories are passed through backbone and online optimizer, then the activation scores are combined with the sensory memory score. At the end, the merger outputs the final prediction. The IoU-net component is not shown here for the sake of illustration.

4.3 Schema

In this section we analyze how to combine the insight gained by Atkinson et al and the precious work by Sauer at al, in order to create a unique framework. Moreover, even though this work has been tailored around the current state of the art tracker “Learning Discriminative Model Prediction for Tracking” [9], it is meant to be plugged to any kind on tracker equipped with a memory. The latent space dimensions will be therefore related to the aforementioned method In addition to that, since the objective is not to find the best architecture for this purpose, we adopt a relatively shallow network but expressive enough to get reliable measurement of the impact of the proposed model.

The Long and Short memory contains respectively the tuple images, predicted bound-ing box {(xi, ti)}LTi=1,...,nST, {(xi, ti)}

ST

i=1,...,nST. In order to create different predictions

for each of the two modules, the architecture depicted in figure2.8, two filters are created, fSTand fLTusing the model optimizer described in section5. The sensory

memory, on the other hand, is not related to any filter, it is an LSTM based model, taking the last score prediction p as input and giving as output the prediction for the next frame. Intuitively, it considers the past trajectory of the object in order to compute the next possible position of the object, similarly to Kalman Filter [6] based trackers [44]. This aspect is one of the most important of the whole work since makes the tracker able to handle situations such as occlusion or the object is temporary hidden. In those specific situation, the discriminative power is not enough, or it can even be misleading since some other objects might be recognized as the target one.

(44)

After having a prediction of the location of the object in the test frame for each memory component, they have to be unified in a single prediction. The three activation maps are merged using a set of fully connected layers, ReLU as activation map, and Dropout [51].

4.4 Model control

In the next few subsections, we presents some contributions to prevent that a misprediction in a single frames polluted the memory or cause model shifting targetting another similar object.

4.4.1 Delta control

For the estimation of δiof the function5.3.2, we adopt a similar approach to Bayesian

Neural Network.

However, one of the drawback of MC Dropout is the fact that at inference time, multiple forward pass has to be performed, increasing of a high factor the overhead, making it not suitable for online object tracking dropping the analyzable frame rate. In our work, we adopted an approach similar to [27], namely, the output of the Merger network is 19 ∗ 19 + 1. The additional output σ is the uncertainty estimator; the loss function of the merger therefore is modified as follows:

Definition 16: Modifield loss formula delta control

Lmerger= X i σir(xLTi , xSTi ; yi) + λ log 1 σi (4.4)

The function r can be a simple Mean Squared Error but we adopted the one provided in [9], based on hinge loss, maintaining the training consistent to the rest of the network, but it remains a free choice.

(45)

Note 2: Bayesian Neural Network

In standard Neural Network, the optimization process focuses on finding the weights that minimize a certain type of error function, generally least squared error. θ = arg_θmax p(D|θ) (4.5) θ = arg_θmin N Y i=1 (yi− f (ˆyi|θ))2 (4.6)

Since the equation cannot be solved analytically, Stochastic Gradient Descent is used as approximation in order to perform the minimization; the data D are generally fed in batches. This corresponds to finding the maximum likelihood of the data p(D|θ), therefore, at inference time, we consider only the outcome with the highest probability with no knowledge about the uncertainty of the model.

In Bayesian Neural Network, on the other hand, we don’t consider a single answer but the whole distribution over the output.

p(y|x, D) =

Z

p(y|x, w)p(w|D)dw (4.7)

However, this integral is not analytically solvable, therefore approximations are needed to handle the upper formula. The first thing that pops out thinking about approximate integral is Monte Carlo methods family. In few words, Monte Carlo methods sample a sufficient amount of time (depending on the dimensionality this value can vary sensitively) from an unknown distribution, then it is approximate using the statistical values extracted from this set of point. However, it is impossible to sample the outputs of Neural Network since generally the layers are deterministic. The ony aleatoric layers are Dropout; in fact the MC Dropout [19] exploits this fact. In MC Dropout, neurons are set to 0 randomly and the forward pass is executed several times. We can not define an output distribution.

The MC Droupout [19] nonetheless is slows because it requires multiple passes. A inexpensive method to overcome this issue is using an additional output of the network, generally denoted by σ that will be our uncertainty estimator. L = (ˆy−y)_2σ22 + log σAdding σ to our loss, intuitively the network

will learn to output sigma when the aleatoric uncertainty is high. In the opposite case, when the model is relatively confident the sigma value will be downsized to decrease the loss.

Intuitively, the σ gives an estimation about the certainty of the prediction; one can imagine that if short and long term are aligned in the sense of both giving high prediction in the same spot, it is highly likely that the object is actually in that

(46)

0 0.5 1 1.5 2 2.5 3 1 1.5 2 2.5 timestamp L (σ )

Expected σ loss behaviour

Fig. 4.2: Expected behaviour of the σ loss with L =P ilog

1

σ during first part of the merger training.

position, therefore the sigma value should be high. On the other hand, if the two memory component diverge in the prediction, or their confidence score is low, there is a higher uncertainty, therefore the sigma should get low value. Moreover, we can expect the sigma value to be a reliable proxy value for the training of the merger. At the beginning of the training, the sigma starts at random values with an average of 0.5; since the merger is still far away from the full training, it is more convenient for the network to decrease the values of sigma as well. Finally, when the merger is closer to the convergence, the sigma shall increase since the prediction are closed to the groundtruth. We can therefore expect a behavior of the sigma loss as depicted in the graph4.2. The actual result are plotted in figure4.3and indeed follows the predicted one. As the merger gets better, it pushes the confidence up, reducing the loss of sigma. A careful attention must be paid the value of λ in eq. 4.4.1since an unbalanced one would push sigma either to 0 or to 1. Clearly, the sigma loss will not reach 0 since it is a combine measure for both aleatoric and epistemic uncertainty that are never 0.

We exploit the information carried by the uncertainty estimator σ during tracking updating the loss2.15as follows:

L(f ) = 1

|S_train|

X

(x,c)∈Strain

(47)

Fig. 4.3: Full track of merger loss during training.

4.4.2 Backward control

As explained in the chapters above, a big problem for the online learning module is the model shifting or, more in general, model decay caused by the addition to the memory of labelled frames far from the groundtruth. It is worth to mention that even though object tracking is considered to be a semi supervised task, out of the hundreds of frames, only the first one is really tagged, the rest of the data is annotated by the model itself. Even a small inaccuracy can cause a divergence in the model in an exponential manner, and it is almost impossible to recover from such event.

We present a method to measure the contribution of each new sample of the memory to the final filter f . Such frames with remarkably high contribution are discarded since they are likely to be off predictions.

Given the templates set S_trainand a new entry snew, it will be used by the optimization

process to reduce the loss4.8. Intuitively, if relatively similar to the templates already present within Strain, the contribution of s to the loss will be low since it has already

been optimized to discriminate the same object. On the other hand, if the new sample s is way different than the previous ones, it contribution to the loss will be higher. Measuring the contribution of a new image s to the loss L (eq. 4.8), will guarantee a good proxy estimation of the shifting caused by the new template. Based on this intuition, we propose 2 methods to measure the contribution of new template, discarding it in case it is above a certain threshold β.

Gradient norm. A first method to calculate the contribution of an image embedding

Multifaceted Memory Augmentation in Discriminative Object Tracking