Automated Dolphin Video Tracking

(1)

Bachelor Informatica

Tracking dolphins in videos with

the use of convolutional

neu-ral networks

Jesse van Remmerden

June 21, 2019

Supervisor(s): dr. T.R. Walstra; dr. F. Visser; dr. ir. R. van den Boomgaard

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

(3)

Abstract

In my thesis I researched the possibility of tracking multiple dolphins in video. For this I researched convolutional neural networks and its current architectures. In this chapter I discussed how ResNet [13], which revolutionised the way how convolutional neural networks are designed, and Faster R-CNN [32], which made it possible to locate multiple objects, within acceptable time frame, operate, both standards in the field of CNN. For my exper-iments I tested two current implementations: DeepLabCut [25] and Mask R-CNN [15]. In my experiments I found that DeepLabCut is only sufficient when only one dolphin is present in the image. The results of Mask R-CNN are significantly better, when more one dolphin is present in the image. This architecture had an high enough accuracy to answer my research question, to see if it was possible to track multiple dolphins in a video.

(4)

Acknowledgements

This thesis would not be at this level without the following people: dr. T.R. Walstra, who was my thesis supervisor and dr. ir. R. van den Boomgaard, who gave advice on which direction I should go with my thesis. I also what to thank dr F. Visser, who came with this research question and helped me with retrieving the data. Lastly I want to L. Smalbil, who always was there when I wanted to discuss problems and helped me by checking my thesis on the language and structure.

(5)

3.2 Pose Estimation . . . 24 3.3 Example Network . . . 25 3.4 Configurations . . . 25 3.4.1 Creating a project . . . 25 4 Mask R-CNN 27 4.1 Theory . . . 27 4.2 Network . . . 28 4.3 Data . . . 29 5 Experiments 31 5.1 Hardware . . . 31 5.2 DeepLabCut . . . 31 5.2.1 Subsets of parts . . . 31 5.2.2 Single dolphin . . . 35 5.2.3 Multiple dolphins . . . 35

(6)

5.2.4 Single and multiple dolphins . . . 36 5.3 Mask R-CNN . . . 36 5.3.1 Single Dolphin . . . 36 5.3.2 Multiple Dolphins . . . 37 5.3.3 Complete set . . . 37 6 Discussion 39 6.1 Results . . . 39 6.1.1 DeepLabCut . . . 39 6.1.2 Mask R-CNN . . . 41 6.1.3 Combined . . . 42 6.2 Future Work . . . 42 6.3 Conclusion . . . 43 A DeepLabCut configuration 49 A.1 Config.yaml . . . 49

A.2 pose-cfg.yaml - training . . . 51

A.3 pose-cfg.yaml - test . . . 52

B Mask R-CNN configuration 53 C Results 55 C.1 DeepLabCut . . . 55 C.1.1 Heatmaps Labels . . . 55 C.1.2 Subsets of Parts . . . 55 D Surfsara Lisa 59 D.1 Slurm system . . . 59

D.1.1 Example Slurm Script . . . 60

D.2 Important commands . . . 60

(7)

CHAPTER 1

Introduction

In the research of dolphins, one of the most important aspect is to track of movement in water. Certain ways of swimming might indicate whether they are hunting, playing or migrating to a different place. In order to classify dolphin behaviour overhead drone video images are used. These drones make images from many metres above the water level. The video is then analysed by a researcher, whose first task it is to mark the path that each individual dolphin takes. Due to the erratic movement of the dolphins, this can take multiple days – time which might be better spent on other work

In the last couple of years a lot of progress has been made in the field of object detection and tracking algorithms. This progress was made possible in part due to the increasing availability of data and the increase of computational resources. This allows for the possibility to automate certain process, one of which is tracking dolphins, from start to finish. The ultimate ideal being giving a video as input and having as output all the information needed to track dolphins. In this thesis I will explore this possibility as well as other methods of tracking to analyse whether it is possible to build a model which can automate the tracking of dolphins. In order to do so, I will examine multiple object detection- and tracking algorithms and then use existing techniques to develop a new model. These models will be tested to see which model preforms the best when tracking dolphins.

This thesis will be split in multiple sections. First I will delve in the theory of how a con-volutional neural network (CNN) is able to detect objects in images. This section is split into different parts, in which I explain, in steps, how an CNN is built. The first section will be used to explain how artificial neural networks are able to predict which an CNN is based on. From there I will explain a CNN and how it function. This is necessary to explain how each part, of the implementations I tested, operate. In these sections, current architectures and techniques are explained to give an broader view of the current research in computer vision.

In separate chapters, I will describe both tested implementation in detail and how the imple-mentations are configured. The configurations are used for the experiments. These experiments will be explained in detail in the discussion.

1.1 Previous Work

In the previous years a lot of progress has been made in the field of object detection. Arguably the most important contribution is ResNet [13]. In this paper the authors present an improvement on the existing sequential CNN architecture by adding residual shortcuts. The residual shortcuts were such an improvement that this model became the basis for subsequent CNN architectures. One of the biggest improvements is of ResNet is ResNext [43], which improved ResNet by a

(8)

redesigning of the convolutional layers.

In the ResNet architecture, each block is built to be of a single width and sequential, but for ResNext this width is changed to 32 in their tests. The reason for this change lies in the fact that they show an increase accuracy, with ResNet-50 having a 23.9% top-1 error and ResNext-50 having 22.2%. For the ResNet-101 this error was 22.0% and ResNext-101 having 21.2%. Al-though the difference between the two models is small, it is still an improvement, especially when considering that the amount of computations is the same.

Another improvement of ResNet are Feature Pyramid Networks (FPN) [20], which used ResNet as a basic and expanded on it by using the results from ResNet and combining those with fea-ture extracting techniques. This combination resulted in small increase on the average recall (AR), with 8.0% to 56.3% and with smaller object it increased by 12.9%. This was tested with Detectron [12]. Which is an CNN framework, in which one can use a wide spectrum of CNN architectures and models.

A more specialised model is DeepLabCut [25], which itself is a implementation of DeeperCut [16]. DeepLabCut was especially made to detect the movements of animals. In their experi-ments [16] used data of mice moveexperi-ments and trained them on the mice’s movement features. They started with a ResNet-50, which was pre-trained with image-net [5]. This enabled them [25] to have an high accuracy, of only 5 pixel average error.

(9)

CHAPTER 2

Theoretical background

A convolutional neural network (CNN) is a computer system, which was designed to classify an object within an image. The design of a CNN, is based on how we humans process vision and locate and classify object within this vision. This is the same design as an artificial neu-ral network (ANN), which has in essence the same system as a CNN therefore the basis of a CNN. This chapter will be used to explain how a CNN is designed and how each part operates. To understand it, I will first explain the functionalities of ANN. There I will explain how an ANN is built, its activation function, how it predicts and how it learns. The next step will be the explanation of how a CNN operates; this is the next step because a CNN is in essence an ANN. This explanation will focus on how a CNN differs from an ANN, what convolutions are and how they are used. The last sections will be used to explain current CNN implementations and design options which are used in the models I used for my experiments.

2.1 Artificial Neural Networks

Artificial Neural Network (ANN) is a widely used machine learning paradigm. Because of its adaptive use and its ability to model non-linear relationships, it is a very powerful tool. An ANN is based on the biological neural network. Just as a biological neural network, an ANN contains connections which are all linked. Moreover, neural networks also contain artificial neurons that determine which parts of the brain (or layers) are activated. An artificial neural network has three main components: (1) the input layer, (2) the hidden layer, and (3) the output layer. The input layer is used to handle the input of the data. If the data has five variables, the input layer will have five nodes. A neural network can have multiple hidden layers, which each can have a different amount of nodes or different configurations. The output layer is the final layer of the network and deals with the actual output of the network; the number of nodes of the output layer should therefore be based on the input and the desired output of the network.

2.1.1 Activation function

Each neuron in an ANN has an activation function, which is how it determines when the neuron activates or ’fires’. The input the an activation function either takes its value from the output of the previous layer or - if it is the input layer - from the user. These activation functions are custom and there is a variety of activation functions available. In the following subsections I will discuss the most important ones.

Sigmoid

The Sigmoid function is a logistic function, ranging from 0 to 1. Because of this limited range a high input value will never skew the results too much. The Sigmoid function is usually used for

(10)

binary classification (i.e. YES/NO answers).

σ(x) = e

x

1 + ex (2.1)

Tanh

The Tanh function is also a logistic function, but with an increased range of -1 to 1. This in-creased range will decrease the chance a network will get stuck during training. This because of the wide range with Tanh.

tanh(x) = e

x_{− e}−x

ex_{+ e}−x (2.2)

ReLU

In the current field rectified linear unit (ReLU) [27] is the most used activation functions for hidden layers [29]. ReLU has as advantage that it only activates positive nodes, but can cause problems if too many nodes return 0.

Relu(x) = max(0, x) (2.3)

Two proposed solutions for this are LeakyReLu [24] and PReLu [14]. LeakyReLu will, if the input is negative, not set the value to zero, but will set it to a fraction. PReLu does exactly the same, except that the fraction is not a constant, but a parameter that can be learned by the network. 2.4 shows how this works mathematically, but with LeakyReLu ai would be a constant

number between zero and one set by the user.

It was shown [44] how these functions can lead to an increase in precision. Moreover, they also have the advantage that less machine instructions are needed to programme compared to

(11)

Sigmoid and Tanh. P ReLu(xi) = xi xi> 0 aixi xi≤ 0 (2.4) Softmax

In an activation function mostly suited for the output Layer for multiclass classification. It takes n variables from the input layer and normalises the values. These normalised values will always sum to 1. This useful if one wants to display the chance (in percentages or decimals) of a given prediction.

2.1.2 Weights and Biases

An artificial neural networks learns by using weights and biases (w). In an ANN, each node -except for the input layer - has a weight assigned to the output of the previous layer and (if one chooses to include one) a bias. Each weight acts as a scalar on the output of the corresponding node, indicating the importance of that node. The biases are an addition to the total of each output and weight combined as can be seen below:

pj(t) =

X

i

oi(t)wij+ w0j

Figure 2.1: pj(t) is the function to calculate the total of a node and oi is the output of the

previous layer, which has i nodes

2.1.3 Forward Propagation

In the previous section, I explained how a single activation function works. This section will focus how data will be passed to the next layer – this is called forward propagation. When a node gets the output of the nodes of the previous layer X, it will first multiply these values with the corresponding weights W . Those outcome values will then be summed. This sum will be given to the activation function and its output will in turn be passed to the next layer.

A(x1m) = A(x00w0+ x01w1+ ... + x0nwn)

Figure 2.2: In this A is the activation function

The procedure is to be repeated until it reaches the output layer, where it will go through the last activation function, and passes the predictions to the user.

2.1.4 Backpropagation

Instead of just using forward propagation, most ANNs will also use backpropagation. Backprop-agaton allows for better adjustment of the weights (W ) and therefore to a better approximation of the results of the training data. This is the method ANN uses to learn and adjust the weights. In this section I will describe how it evaluate the training model and how it, with this evaluation, updates the weights, to create better predictions.

Loss Function

The first step is to evaluate the model, for this a loss function is used. A loss function takes the predicted data and the ground truth value, which is the value on which the model is trained, as input and shows how close the prediction is to the ground truth data. For this an wide range of loss functions are available; which loss function is chosen depends on the prediction data format and how it should predict. In this section I will describe the most used loss functions and use

(12)

them as example to show how a loss function evaluates a model.

One of the most often used loss function is the Mean Squared Error (MSE) formula. This calculates the average error and is made to be always strictly positive. The formula (2.3) takes two inputs: the prediction of the model (y) and the ground truth ( ˙y). From here you see the dis-tance from the prediction and the ground truth. This should be as low as possible, which should be the case for every loss function. For binary classification, the cross-entropy (CE) loss function

M SE = 1 2N

N

X

i=1

(y(i)− ˙y(i)₎2

Figure 2.3: The Mean Square Formula

proves a better fit. CE calculates the divergence between the predicted label and the original label. Again this formula (2.4) takes the prediction of the model (y) and the ground thrush ( ˙y), as input. When the labelling is non-binary (for instance in multiclass prediction), multiclass

CE = −1 N N X i=0 h

y(i)log( ˙y(i)) + (1 − y(i)) log(1 − ˙y(i))i

Figure 2.4: Cross Entropy

cross-entropy (MCCE) can be used as the loss function. In essence, this is similar to CE, the only difference being that it calculates the entropy of each class and then take the sum over all classes. These examples are just small size of the total number of loss functions, but with this ex-planation I wanted to show how loss functions are used to evaluate a model. This evaluation is then used to update the weights, with the use of Gradient Descent.

Gradient Descent

With Gradient descent a loss function is used to adjust the model. Effectively an optimisation technique, it uses the partial derivative to find a local minimum so as to update an individual weight.

θt+1= θt− α∂E(X, θ

t₎

∂θ Figure 2.5: Gradient descent

In the equation E is the error function, for which we will use the MSE as an example with θ being the collection of all the weights and biases and α being the learning rate (i.e. a variable which is set by the user to manage the effect of each iteration of the gradient descent). In the equation below, it is shown how the partial derivative of a single individual weight is calculated.

∂E(X, θ) ∂wk ij = 1 N N X d=1 ∂ ∂wk ij 1 2 y(d)− ˙y(d) 2 = 1 N N X d=1 ∂E(y(d)_{, ˙}_y(d)₎ ∂wk ij

Figure 2.6: Gradient descent of a single weight

In this figure w is the weight, k is the layer index, i is the index of the incoming node and j is the index of the node in layer k. To calculate this part, one has to take the sum of the partial derivative of the error function with respect respect to that weight, for each input-output of this

(13)

node.

Using this computation, one can calculate the gradient descent of the entire network. Let me explain this by going through each of the steps:

1. A set of training data is forwarded in the network, in which the outcome ( ˙y(i)) is stored; 2. Start from the output layer and calculate for each node the gradient descent;

3. After that, calculate for each node ∂E(X,θ)_∂wk ij

; 4. Apply this to the weights by wk

ij = wijk − α ∂E(X,θ)

∂wk ij

; 5. Repeat each step.

Stochastic Gradient Descent

Own way to simplify the gradient descent procedure is stochastic gradient descent (SGD) [19]. Because of the relevance for my thesis and its preeminent usage in DeeperCut [16], I will describe this procedure in a bit more detail.

The way SGD works is quite similar to regular gradient descent, but it in a couple of ways. The most important difference is that not every single result of the loss function is used for the iteration, but only a single one which is picked randomly. After it has picked one, it uses the same procedure as for gradient descent. SGD works best when the training set is large and training time is a bottleneck [2].

2.2 Convolutional Neural Networks

In the previous chapter, I explained how an artificial neural network operates and can learn a specific task. In this chapter I will explain how a convolutional neural network (CNN) operates. In this chapter I will discuss how a CNN operates, by telling the difference between a CNN and an ANN. First I will discuss what a convolution is and how it is used in visual objects detection. After that, I will explain how the various types of layers function and how they are connected.

2.2.1 Convolution

A convolution is a mathematical operation which takes two input functions (A and B) and outputs a third function. This third function indicates how A is effected by B. To give a simple example, when two rocks are thrown in the water at two different places, they both produces waves. The place the waves meet and effect each other is what an convolution calculates, by taking the first wave (A) and taking the second wave (B) and see how this wave effects the first wave. (f ∗ g)(t) = Z ∞ −∞ f (τ )g(t − τ )dτ (2.5) (f ∗ g)(t) = t X j=−t f (n − j)g(j) (2.6)

A convolution is normally a continuous function (2.5) in which t is the time step or the index of the function. The other function (2.6) is the discrete function and is the one that will be the default version used in most CNN implementations.

When working with images, the default discrete convolution function is extended to function with two dimensional arrays (2.7).

(f [i, j] ∗ g[i, j]) = ∞ X m=−∞ ∞ X n=−∞ f [m, n] × g[i − m, j − n] (2.7)

(14)

In a normal convolution it is standard to do zero padding. This is used when an array input is needed, but when the given index the array is out of bounds. For this situation, 0 is always used for the calculation. Within a CNN this would cause problems, so to solve this the calculations of convolution is started at the first index, at which zero padding does not happen and stops at the last index where this also does not happen. Having discussed the theory, what can we do

Figure 2.7: The filter is the convolution and the input is the image. Image from [33] with a convolution? One of the usages is trying to find an edge in an image. For this an edge detection kernel (a kernel is the name of the matrix used for the convolution) can be used.

E =   −1 −1 −1 −1 8 −1 −1 −1 −1   (2.8)

When applied to a picture you get the following results.

Figure 2.8: Normal image. Image from

[26] Figure 2.9: After the filter is applied

This simple convolution helps us to find the out line of the dolphins in the first image, which results in the second picture. In the second picture the output array of the image has the highest peaks at the detected edges, which is useful for a CNN to determine what is in the image. Colour convolutions

When working with colour images, the dimensions of the image changes. A grayscale image always has the dimensions of (H × W × 1), but with colour images it can increase the dimensions to (H × W × 3) or (H × W × 4). In this example, I will use RGB. The first three dimensions, of a colour picture, will always be red, green and blue, where each of their corresponding values expresses how strongly each of these colours is represented. Sometimes a fourth dimension is used for the alpha value, which is used for the opacity within a picture1_.

(15)

When processing colour images with a convolution, there are three possibilities. The first option is to change the image to a grayscale image, before feeding it to the CNN. Computationally, this is the simplest option, but it will result in a loss of information – filtering out colour can decrease the accuracy of the network. The second option is to convolve each colour separate, where each dimension has its own kernel or one shared kernel. The downside is that with the next convolution these dimensions also need to be calculated, which can slow down the network. The third option is a 3D convolution of the image. This is quite similar to the previous method, but each outcome of the kernel is added up to a 2D output.

2.2.2 Convolutional Layer

In the previous section I explained how convolution are calculated and how they can be used to find attributes in an image. This section I will explain how a convolution is used within a convolutional layer.

Within a layer, multiple convolutions happen on the same input. For instance, when we have a layer which has a kernel size of (3 × 3), stride 2 and has 8 convolutions happen, with an input image of (9 × 9 × 1) will result in an output of (4 × 4 × 8). These 8 dimension are each an output of an different convolution. This output then is propagated further in the network to the next layer.

Learning Convolution Layer

In the previous section, I explained how an artificial neural network functions and learns. One of the important parts were the weights and how they function within a network. A CNN works similarly in the convention layer, but each variable in a kernel is treated as a weight. For instance an (3 × 3) kernel is basically a matrix filled with weights.

In the section about ANN I explained how one node retrieves the output of every other node in the previous layers. In an CNN these inputs, at the first layer, is the image. The convolution is, in essence, an helper function which collects the inputs and applies the weights to it. Just like an ANN these weights are learned with backpropagation in an similar way, but expanded from a 1D space to a 2D space.

2.2.3 Fully Connected Layer

In most image classification CNNs, the last layers will be fully connected layers. These layers are almost identical to a normal artificial neural network. When a fully connected layers is connected to a pooling or convolutional layer, each node will be connected to each previous node of the layer. 2

2.2.4 Activation Function

Just like an ANN, a CNN makes use of activation functions. When used these are executed for every output of the previous layer. This is the same as with an ANN, but again expanded from 1D model to 2D model.

This means for an output of (32 × 32 × 7), 7168 activations (i.e. functions) fire. This is one of the reasons why ReLU is one of the most popular activation functions within most CNN architectures [29]. This is because of exection time.

2.2.5 Pooling Layer

One way to decrease the computation time of a network is to reduce the number of inputs. This can be done by a pooling layer. This not only decreases the computational resources required,

(16)

but it also decreases the chance of overfitting. This is the case because fewer inputs are used and, depending on which kind of pooling layer, the inputs are averaged. A pooling layer does not change anything to the depth of the layer, but only to the width and the height. For instance, an input can be changed from (120, 120, 6) to (60, 60, 6), but never to (60, 60, 3).

To calculate the a pooling output, just as with the convolutional layers, a window is slided over the input. The inputs of this window is used to calculate a single value used for the output. The calculation done depends on the sort of pooling. Here I will describe the two most used forms of pooling.

Max Pooling

With max pooling the highest number within the window is used as the output for the pooling. In figure 2.10 you see how from the first window, 20 is chosen, which is the highest number in the pooling window. It was shown shown [34] how this method gives better results, compared to other methods. This makes Max Pooling sensitive to patterns, within a window. It also helps to find specific features in an image.

Figure 2.10: Max and average pooling. Image from [38]

Average pooling

On average, not the highest feature of the window is used, but the average of each feature in the window. Average pooling is more useful not for finding a specific feature, but for finding more average information of the image.

2.2.6 Stochastic Gradient Descent in a Convolution Neural Network

In a CNN, stochastic gradient descent is done basically the same with an ANN. Both of the implementation, DeepLabCut [25] and Mask R-CNN [15], use this to train a model.

2.3 Dilated Convolution

Under normal circumstance, the output of a convolution is always the result of a single point and its nearest neighbours. However, in a dilated convolution this gets slightly modified. In a dilated convolution not the nearest neighbours are used. In figure (2.12) it shows how an dilated convolution takes input further away from the center.

(f [i, j] ∗ g[i, j])(t) = ∞ X m=−∞ ∞ X n=−∞ f [m, n] × g[i − m × t, j − n × t] (2.9)

(17)

In an dilated convolution (2.11) t is used to determine how dilated the convolution will be. If t = 1, it will behave like a normal convolution, but when increased, the receptive field will take into account a larger portion of the image, albeit with less detail. This can be seen in the picture below.

Figure 2.12: Input field of an dilated convolution. Image from [30]

Dilated convolution were first introduced in a CNN. In their paper [3] it is shown this increases the accuracy for image segmentation without increasing the stride of the convolution or increasing the kernel size. The approach decreased the number of computations without lowering the accuracy of the result.

2.4 ResNet

One of the biggest shifts of in convolutional neural network research was ResNet or Residual Learning, which was presented by He et.al. [13]. ResNet is different from previous networks, because it can learn to skip layers.

Figure 2.13: How the ResNet shortcut looks in the network. Image from [13]

Because the new mathematical structure of the block, a group of convolutional and/or pooling layers, became y = F (x) + x, from the normal y = F (x). The importance of this is that the output F (x), which is an function which represent multi-ple convolutional layers, can become zero and because of ad-dition x, which is output of the previous block, called iden-tity, in essence this block will be skipped when predicting im-ages.

A ResNet network can learn which blocks are useful for the des-ignated task and which are not. The final mathematical equation for a block is y = F (x, {Wi}) + x, in which W represent the weights

of the network. F (x, {Wi}) can represent multiple layers in the network, with each their own

weight.

In a convolutional neural network, the input and output of a block can have different dimensions. They [13] solve this by having two versions of a block. The previous version, which I discussed above, is used for when the input and the output of a block are of the same dimensions. If the dimensions of the input and output do not match, F (x, {Wi}) + Wsx is used. In this formula

Ws is not a weight, but a linear projection to change the dimensions of the identity to match

(18)

The layers in the network use ReLu (see [27]) as their activation function. The ReLu is ap-plied after the identity, which means the ReLu output is handed to the next block. He et.al. [13] tested two versions of the block architecture: a version in which each block had two convolutional layers and one in which it had three convolutional layers after which the identity is added. In their tests, He et.al. showed that the two layer block version worked best with their 18 and 34 layer model (ResNet-18 and ResNet-34) and the three layer version worked best with their 50, 101 and 152 layer model (ResNet-50, ResNet-101 and ResNet-152).

Thus He et.al. showed the effectiveness of this architecture. In the end this was such an improve-ment with respect to the previous architectures, that right now almost all CNN applications use a version of ResNet or an improved modified version of ResNet. In the models I tested this is also true, keep this in mind, when it is not clear which architecture is used.

2.5 Transfer Learning

One of the problems I faced, will researching my thesis, is the complete lack of annotated data. I had two solutions: one is to spend a lot of time labelling a huge set of images or to use a smaller annotated set in combination with transfer learning to enhance the accuracy. I chose the second option which I will describe in this section.

With transfer learning the weights are not randomised at the start, but start with a pre-trained set of weights. In a survey [45], on how transferable trained features are, it shows how pre-trained weights can increase the results, when implemented in the correct way. The reason this works is that especially the first layers are general in their use, and when going further in the network it will specialise further on the specific classes.

When choosing pre-trained weights it is important that the weights are trained for general cases. For example, using pre-trained weights, which were trained on detecting cars, and finetuned it on detecting dolphins, the results will be lower, than when more general trained weights are used. If these weights, which are trained on cars, are fine tuned on detecting trucks, the results would be better then when using a more generalised weights.

In my models, I used pre-trained weights which were either trained on the Imagenet [5] or COCO dataset [21]. Both of them are trained on a general set of classes. The COCO set was trained on the set of 40 classes and the Imagenet was trained on 200 classes. Each model I will describe which pre-trained weights were used.

2.6 Fully Convolutional Networks

One of the components of DeeperCut [16] and DeepLabCut [25] is a Fully Convolutional Network (FCN) [22]. In this section I will give a short description of FCN and I will describe in a later section how it is used within DeeperCut [16].

A FCN is not used for image classification, but for object detection. To do so, it uses semantic segmentation. The first part of an FCN is almost identical to normal CNN, in that it uses the same convolutional layers. As to the second part, Long et. al.[22] changed the fully connected layer with more convolutions. The output layer will consist out of n convolutions, in which n depends on the number classes that are tested on the network. From here the convolutions are up-sampled a couple of times and, in the end, merged so that it results in a picture.

(19)

Figure 2.14: A Fully Convolutional Network. Image from [22]

Long et. al. [22] used a Pixel-Wise Loss function to train the network. With this function each predicted pixel is tested against a ground truth. For the gradient descent, they used SGD (which is already explained in a previous section).

L(y, ˙y) = −X

i∈C

y(i)log( ˙y(i))

Figure 2.15: Pixel-Wise Lose Function

2.7 R-CNN

One of the more problematic challenges of CNN is to locate individual items in an image. It is possible to slide with different side windows over an image, but this would lead to too much computations. One solution to this challenge is CNN [11] and its improved version of Fast R-CNN [10] and Faster R-R-CNN [32]. In the Mask R-R-CNN [15], Faster R-R-CNN is one the significant parts of architecture. In this section it will be described to give a better understanding how Mask R-CNN operates.

2.7.1 R-CNN

Figure 2.16: R-CNN structure. Image from [39]

Ross B. Girshick et. al. [11] describe a way to to localize that part of an image which can be im-portant. This is not done by previous methods of SIFT [23] or HOG [4], but by a new region pro-posal function (RPN). In the paper, the authors dis-cuss that the approach is agnostic as to which al-gorithm is used. In their tests, however, a Selec-tive Search [41] is used. For each iteration RPN will detect 2000 regions, which will be found after being propagated into the same CNN. One of the problems of this is that the CNN needs images of the same size as input. Yet, this problem is solved by cropping each of the found regions to the same size (thus, before it is entered in the CNN). For the classification itself, a Support Vector Machine (SVD) is used. Moreover, a regression on the bound-ing boxes is used as well to make them more accu-rate.

(20)

R-CNN scored best in every category of the PASCAL VOC 2010 dataset [8]. Still, this ar-chitecture has some fundamental problems regarding its execution time. Because from each image 2000 images are created, it takes a while to process them. For instance, it takes 47/s [10] for one image to be processed. Training is also slow (training would take more than two days on a GPU).

2.7.2 Fast R-CNN

Figure 2.17: Fast R-CNN structure. Image from [36]

Ross B. Girshick et. al. [10] improved R-CNN by chang-ing when and how the regions of interests are found. In R-CNN this was done before it was passed through the CNN; in Fast R-CNN it is done after. In this architecture, the whole image is processed in the CNN. This creates a convolutional feature map, from which they use an ROI pooling layer, for each feature found. This feature image is then warped to make it so that each found region has the same dimensions. Each feature is processed by a se-quence of fully connected layers which produce a feature vector. The feature vector is in turn processed by two

different branches of the network – one for the classification and one for the bounding box. For the classification softmax is used in place of an SVD [11]. For the bounding box a regressor is used to pinpoint a better position for the bounding box.

L(pi, p∗i, ti, ti∗) = Lcls(pi, p∗i) + λ [p ∗

i ≥ 0] Lreg(ti, t∗i) (2.10)

For training, a joint loss function (2.10) was used which combines both the loss function of the classification and the loss function of the regression. For the classification, pi is the probability

of the classification and p∗

i is a binary ground truth, which is 1 if it is the classified class, and 0

if it is not. pi can take a value between 0 and 1, with 0 meaning it is not the classified object

and 1 that it is the classified object. ti = (tx, ty, tw, th) is a set of predicted coordinates of the

bounding box and t∗_i is the ground truth of the bounding box.

2.7.3 Faster R-CNN

Figure 2.18: Faster R-CNN structure. Image from [37] The bottleneck for Fast R-CNN is the use of Selective Search

[41]. Shaoqing Ren et. al. [32] removed the dependency of Selective Search, by making it so that the network learns the RoI. Faster R-CNN uses a Region Proposal Network (RPN) for this, which is combined with the architecture of Fast R-CNN. L({pi} , {ti}) = 1 Ncls X i Lcls(pi, p∗i)+λ 1 Nreg X i p∗iLreg(ti, t∗i) (2.11) This RPN is trained separately from the normal R-CNN, with its own loss functions. The loss function (2.11) of the RPN is a combination of multiple loss functions (2.12). The first function (2.12a) is used for classification and is a log-loss func-tion. The second loss function (2.12b) is used for the bounding box. This function is the summation of another loss func-tion (2.12c), which takes as its input the difference between each coordinate of the predicted bounding box and the ground truth bounding box. This function (2.12c) is a

(21)

returns a lower value.

Lcls(pi, p∗i) = (pi∗log(pi)) + (1 − p∗ilog(1 − pi)) (2.12a)

Lreg(ti, t∗i) = X i∈{x,y,w,h} L1Smooth((ti)i, (t∗i)i) (2.12b) L1Smooth(x) = ( 0.5x2 _{if |x| < 1,} |x| − 0.5 otherwise (2.12c)

For each image, a number of bounding boxes are used (In the paper this value was 9). Three scales are used: 128, 256 and 512, with each trying the following aspect ratios: 1:1, 1:2, 2:1. Each proposed region is then used as input for the Fast R-CNN [32] part of the architecture, which normally would have been the output of the Selective Search [41]. These bounding boxes are fine tuned during the learning process.

The development of R-CNN and its improved versions, is one of the biggest improvements in locating multiple objects within a image. Currently Faster R-CNN is the most popular CNN for locating, because of its accuracy, but also the fast computing.

(22)

(23)

CHAPTER 3

DeepLabCut

While researching tracking methods for dolphins, I came across the framework DeepLabCut [25]. In DeepLabCut one can track multiple points on a person or animal. This framework is an complete research framework, in which each step from labelling, to refining an trained model. For this they build a Python package, around a slightly modified version of DeeperCut [16]. DeeperCut is a pose estimation CNN, in which multiple joints are located of person, with the help of CNN and refinement of the found location.

In this chapter I will explain how DeeperCut [16] finds those points and refines them. Also I will discuss how DeepLabCut [25] is used and how it modified DeeperCut [16].

3.1 Convolutional Neural Network

In the introduction of this chapter, I discussed how DeepLabCut [25]) uses DeeperCut [16] as its backbone for finding each label. This section will explain how DeeperCut designed the CNN part, of their pose estimation CNN.

DeeperCut itself is an improved version of DeepCut [31]. DeeperCut improved on DeepCut, by changing the old CNN architecture with ResNet [13], which they modified to find each joint. In their paper, Eldar Insafutdinov et. al. [16] first discussed two previous implementations used for image segmentation: FCN [22] and Liang-Chieh Chen et. al. [3]. FCN I discussed already in the previous chapter. The architecture of Liang-Chieh Chen et. al. [3] I discussed partly in the dilated convolution section. In which I describe that they used dilated convolution to get better results for image segmentation. Eldar Insafutdinov et. al. [16] tested both imple-mentation to see which give better results.

Eldar Insafutdinov et. al. [16] got the best results with the architecture of Liang-Chieh Chen et. al [3], but because of GPU memory issues they chose to have an hybrid architecture, with parts of both Liang-Chieh Chen et. al. [3] and Jonathan Long et. al.[22]. Their approach started with an Image Classification ResNet-101 [13], which was pre-trained from the Imagenet dataset [5]. They removed the last pooling layer, fully connected layers and softmax layer. Moreover, the stride of the first convolution was lowered from 2 to 1 to prevent information loss issues; it changed all (3 × 3) convolutions, in the fifth bank, to dilated convolutions. In same manner as with the FCN, they upsampled this output, which results in an heatmap for each label, from which each label is located.

While learning, they used an ground truth around the pixel marked pixel. Anything within k range of the marked pixel would return 1 (In the paper they used k = 15), otherwise it would return 0. This was combined with a cross-entropy loss function, which was jointly trained with

(24)

the pose estimation (discussed in the next section). Intermediate supervision

One of the adjustments they made was to use intermediate supervision. Before ResNet [13], a lot of networks had the problem of vanishing gradient, and used, for instance, score maps as inputs for subsequent layers [42]. Eldar Insafutdinov et. al. [16] argue that this is no longer necessary because of ResNet (according to the authors ResNet solved this problem) Shih-En Wei et. al.[42] also showed that it helps to find the spatial relation between parts in an image. To still get this effect, Eldar Insafutdinov et. al. [16] added part loss layers in the fourth convolution bank to have the same effect. In their results, it was shown how this addition increased the accuracy of the network.

3.2 Pose Estimation

After each point is found, DeeperCut [16], applies pose estimation for each found point in rela-tion to the other points, which have an different label. This is done in two parts. First, Eldar Insafutdinov et. al. [16] regress two points that were found. Second, they used the regression to calculate each pair cost and use this within an logistic model.

After a point is found (k), which is tuple containing the x and y coordinates (k = (xk, yk).

k is found in relation to a joint class (c), which comes from the set of joint classes (c ∈ C). For each other joint class (c0 ∈ C/c), it will calculate the relative distance from c, so you get tk

cc0 = (xc0 − x_k, y_c0 − y_k). Eldar Insafutdinov et. al. [16] used an extra layer, which predicts

the location ok

cc0. The true location (tk_cc0) and the predicted location will be compared with each

other in a smooth L1 cost function, which was described in the section about Faster R-CNN (2.12c).

After the locations are predicted, they [16] are used to find the pairwise cost. If we have c and c0 as the classes and ldand ld0 as the found location for the detections d and d0, which are used to

find the relative position. Using this previous method, we can find od_cc0, which is the relative offset

position from c to c0and we can find od0

c0_c, which ranges from c0to c. To check if the found locations

ld and ld0 belong to same person, we need to calculate the offset between the two: ˆo_dd0 = l_d− l_d0.

Using the previously found od

cc0 and od 0

c0_c, we need to calculate how they stand in relation to ˆodd0.

This is done as follows. As you calculate the distance between each offset, ∆f =

ˆodd0− od_cc0 and ∆b = ˆodd0 − o d0 c0_c

. Furthermore, an angle is needed for the logistic model, which is calculated by: θf = (](ˆodd0, od cc0) and θb = (](ˆodd0, o d0 c0_c)

. With these values we can derive the following feature vector, fcc0_dd0 = (∆_f, ∆_b, θ_f, θ_b, exp(−∆_f), exp(−∆_b), exp(−θ_f), exp(−θ_b)). This is used

in the following logistic model, in which ωcc0 are the learned weights for the model.

P (zcc0_dd0 = 1|f_cc0_dd0, ω_cc0) −

1

1 + exp(− hfcc0_dd0, ω_cc0i)

(3.1)

Figure 3.1: Logistic model

This method is used to strengthening each predicted location, and to give a more accurate prediction, when only a CNN is used. It helps to locate each label for an individual person, by checking how it relates to the other joints.

(25)

3.3 Example Network

In this section I will combine the explanations on how both the CNN and Pose Estimation work, and how they are both trained in combination. Moreover, I shall explain how they predict a pose. Let us first start with a discussion of DeeperCut. DeeperCut [16] will always try to find the location of each individual joint. This is done by the CNN part, which will return a score map for each label. In these score maps, the joints are highlighted1_{. DeeperCut tries to find}

each local maximum in the picture.

From each of these score maps, the location of the other labels are predicted (which was already explained in the first part of section about the pose estimation). This is especially important when more than one person is present in an image. Thus, regression makes it so that the search space for the pairwise connection is smaller so as to not connect one person to an other person. When two predicted classes are found they will be compared to each other using the pose esti-mation algorithm. This combination is a like a double check. The regression therefore checks the predicted location of the other class in relation with the other label, and the prediction itself checks whether the regression is correct or not.

In most cases, a model is trained on three different loss functions. The first is the cross-entropy location detection in which it is checked whether the prediction is in a certain radius of the ground truth. The second function is the regression from each label to the other labels, which is trained with the L1 smooth loss function. The last part is the pairwise class regression (which

was already described in the second section of the pose estimation). These logistic weights are trained from the ground truth of where the pixels are located.

3.4 Configurations

In the previous section, I discussed the inner workings of DeeperCut [16], the foundation of DeepLabCut [25]. This section will be used to show how I made use of DeepLabCut, the config-uration of the projects and edits I made in the framework in my own project.

3.4.1 Creating a project

When creating a project, you start by naming it and using the name of the annotator. DeepLab-Cut [25] will make a folder structure for the user. The most central most important file here is the config.yaml. This file contains the configuration of the current project. In appendix A you can find the example config.yaml structure I used. In this section I will only discuss the attributes, which are the most important where we are at now.

Once the project is created, the next step is to extract frames for the annotation step. For this you have three option: manual, uniform and k-means. Selecting manual, a GUI is executed in which you can select each frame manually. This is the best way to extract frames, but can be time-consuming and not worth the effort. Selection uniform, n number of frames are chosen for you. This process will take n samples uniform random from the video. K-means will try to extract frames that display the most change in the video. The latter feature is useful if you want to automatically find the frames in which the behaviour is most diverse. Within the context of my thesis project, I found uniform to work the best for extracting frames. This was because the behaviour of dolphins in the video was uniform and extracting frames manually would have lead to the selection of the same frames. By default 20 frames will be extracted, but in the config.yaml you can adjust the variable numframes2extract to your need.

The next step is to annotate the extracted data. Before starting, it is important to specify which parts you will annotate in the config.yaml file. While annotating one must be aware that it is only possible to annotate one part at one frame. To reiterate, one has to make sure that

(26)

only one of the annotated parts is visible within the frame. After the annotation it is possible to check whether every video is annotated, being a validation before you start the training. The next step is to train the model. Before training the model, one needs to make sure that there is a split of test and training data. In the config.yaml one can set the fraction for which frames will be train data and which will be test data. By default this is 95% train data and 5% test data. When creating train data the user can set the iteration index. This is important for training with different models that have the same data. Before starting the train cycle it is important to note which ResNet [13] one will be using. By default, this will be set to ResNet-50, but with only editing the config.yaml, one can set this to ResNet-101. If the user wants to use his own weights, or restart training, one can set the init weights within the pose cfg.yaml, which can be found in the train folder, which is within the dlc-models folder.

With the onset of the training, one can set three variables in the function. The first is shuf-fle, which is the index of the created training set. When it is important to make checkpoints within the data, it is possible to set it to saveiters=n which will save at every nth step. The last is displayiters=n, which will show the current iteration, current loss and current learning rate at every nth step.

One of the most important features of the framework is to evaluate the trained model. This is done with the evaluate network function of DeepLabCut. This function will try to predict each label on each image, of both the test and train dataset. It returns the average pixel error (how far the prediction was off) of both the train and test set separate and also when the worst 1 percent is removed. Without any custom settings, this average will be of all the labels combined. One can also test each label separately, by setting comparisonbodyparts to the body part one likes to test [28]. Moreover, if the user wants to have a plot of the data, the only thing that is needed is to set plotting to True.

(27)

CHAPTER 4

Mask R-CNN

In DeeperCut [16] the authors looked into whether it would be beneficial to implement a version with Faster R-CNN [32]. Yet, it was shown that did decrease the accuracy of the network, but with an significant lower execution time. [16] argued that this did not out weight the accuracy loss. This is why it was not used in the implementation of DeeperCut [16].

After DeeperCut [16] was published, a new architecture, Mask R-CNN [15], which combined the previous discussed techniques of FCN [22]), was highly used within DeeperCut, and Faster R-CNN [32].

In this chapter I will delve deeper into how Mask R-CNN is designed. Here I will go into more detail on how it modified R-CNN to not only perform the classification and apply a bound-ing box, but also how it applied a mask around the object. This chapter will also contain an explanation of the setup of my Mask R-CNN implementation, in which I explain the use of Mat-terport Mask R-CNN [1]. MatMat-terport Mask R-CNN is the framework I used to train my models and evaluate these models. .

4.1 Theory

The goal of Mask R-CNN [15] is to combine Object Detection and Semantic Segmentation. We have seen how Object Detection is performed by Faster R-CNN [32] (by creating bounding boxes around the object) it tries to learn during training. Using Semantic Segmentation, moreover, it also tries to locate an individual object. This is done at the pixel level with a FCN [22]. In their paper, they [15] described this combination as Instance Segmentation. In this section I will describe how FCN and Faster R-CNN are combined and where exactly the modifications where made in order to improve results. In the theoretical background chapter I already described the workings and conceptual underpinnings of R-CNN and FCN - I will therefore not go in depth on how they work here and only focus in the implemented modifications.

One of the modification [15] made to Faster R-CNN is the adjustments to the Region of Interest pooling. This was done because in the pooling too much information was lost to accomplish semantic segmentation. Instead they used Region of Interest Align (RoI Align). Normally with RoI pooling the stride would be calculated by dividing the Input dimension size, with the pooling window size dimension. This causes issues when a RoI input is (15 × 15) and the pooling layer output would be (7 × 7), because 15₇ ≈ 2.143, which would become, after being floored, 2. When normal RoI pooling is used, it will only look at the first 14 inputs, loosing information needed for Semantic Segmentation. RoIAlign, by contrast, solves this by keeping the stride 2.143, but uses bilinear interpolation to calculate value of input, of the pooling. Jonathan Long et. al.[32] showed how this increases the accuracy of the network with respect to normal RoI pooling.

(28)

Moreover, they used a ResNet [13] and the normal structure of the Faster R-CNN, in which a classification and bounding box regression is performed. The way this is done is with output of RoIAlign. This output is also used for the FCN part of the network as it creates the mask needed for Semantic Segmentation. In figure 4.1 it shows how it extended an normal Faster R-CNN architecture, by adding an the mask as an extra output, not influenced by the classification and the bounding box regression.

Figure 4.1: Overview of an Mask R-CNN. Image from [17]

Training is quite similar to Faster R-CNN [32]. The main difference is that, besides normal loss functions for the bounding box (LBB) and the classification (LC), a third loss function is

added for the mask (LM ask). This third loss function is combined in the loss function for each

RoI (L = LM ask+ LBB+ LC).

This combined loss function symbolises how Mask R-CNN combined previous architectures of faster R-CNN [32] and FCN [22], and resulting in an improved version of both architectures.

4.2 Network

For the implementation of the Mask R-CNN [15], I used Matterport Mask R-CNN [1]. This implementation was chosen in favour of Detectron [12] because of its ease of use. For more in-depth testing and research I would recommend Detectron, because of its wider range of usages and options to edit more aspects of the network. In this section I will describe the usage of Matterport Mask R-CNN and how one can configure the network.

In appendix B the reader can find an example of the configuration used within Matterport Mask R-CNN. In this section I will discuss the most important configurations and will reference to the appendix B. The most important setting is the BACKBONE. This setting determines the underlying CNN architecture of the network. With this setting, ResNet-50 and ResNet-101 are supported [13]. Depending on the size of the picture and the labels one wants to detect, it is recommended to sent it to RPN ANCHOR SCALES. This configurations takes in a number a values, each determining the size - in pixels - of the RoI it tries to find. In case the images are of a high pixel size and the labels one tries to find take up a lot of space in the picture, I would recommend to set these to a high value. Yet, if the pixel size is low and/or the labels one wants to find in the picture are small, I recommend to set this to low instead. Depending on the image size, one might also want to change IMAGE MAX DIM and IMAGE MIN DIM. The reason for that is that these configurations allows for determining the size with respect to how the image will be cropped or ’upscaled’. In my research, I found that setting this to the original size of dolphin images would result in a ”Out of Memory” warning message because of the images’ original size of (4096 × 2160).

Furthermore, the most important setting was not set using these configuration. When train-ing it is possible to determine which layers should be retrained and for this two options are

(29)

available. The first is setting only heads, should be trained. With this last layers of the pre-trained weights will be updated, but the weights of the other layers are frozen. Here frozen means they will not be updated while training and will keep the same values. The second option is to update all the weights in all the layers. This is done by selecting all when calling the function to train the network. In my experiments, I will do three configurations with these two options. The first two are models trained for 40 epoch on either only heads or only all. The third option is an combination of the two. In this configuration the first 20 epochs only heads will be trained, afterwards the model will be trained 40 epoch on all. In the description of my experiments, I will discuss these configurations in more detail.

4.3 Data

In this subsection, I will explain the training procedure. I started with pre-trained weights, which were trained on large dataset. This allowed for the possibility to train a model on a small amount of annotated image. For this I used two different pre-trained weights. The first is the COCO-dataset [21], which is trained on 40 different classes. The second pre-trained weights were obtained from the Imagenet dataset [5]. These weights were trained on a set of 200 classes, which is significantly more than the COCO dataset. In my experiments I trained models with both sets of weights in order to see how those effected my results.

For annotating the data I used the VGG image annotator tool [6] 1. With this tool I anno-tated 110 images, which had only one dolphin and 120 images, with only multiple dolphins in them.

Augmentation

For training and testing I used a relative small set of 200 images for training and 30 for testing. This is usually not enough to give a satisfactory accuracy. One way to solve this is to annotate more images, but a simpler option is to augment the data. By augmenting the data one carries out a transformation on the image, change the colour or blur them. For the implementation of the augmentation, I used Imgaug [18]. This Python package is developed to be used for augmenting images and correctly edit the annotations to correspond the augmentation.

a u g m e n t a t i o n = i a a . SomeOf ( ( 0 , 2 ) , [ i a a . F l i p l r ( 0 . 5 ) , i a a . F l i p u d ( 0 . 5 ) , i a a . OneOf ( [ i a a . A f f i n e ( r o t a t e =90) , i a a . A f f i n e ( r o t a t e =180) , i a a . A f f i n e ( r o t a t e = 2 7 0 ) ] ) , i a a . M u l t i p l y ( ( 0 . 8 , 1 . 5 ) ) , i a a . G a u s s i a n B l u r ( sigma = ( 0 . 0 , 5 . 0 ) ) ] )

Figure 4.2: Augmentation settings

With Imgaug [18] I did the following augmentations on the data (4.2). With SomeOf you fix how many of the options should applied on the data; these settings (4.2) can be between 0 to 2 options. Flipr and Flipd both flip the image, with Flipr flipping the image on vertical axis and Flipd flipping on the horizontal line. The next option will pick one of the rotations (only one is chosen is to be sure it is not flipped to original position). Multiply effects the colour values of an image by multiplying them by a number between 0.8 and 1.5. The last option adds a Gaussian blur to the image.

1_{Development and maintenance of VGG Image Annotator (VIA) is supported by EPSRC programme grant}

(30)

In my experiment, I trained each model option, both with and without augmentations, to see how it influences the results. With these results I can see if augmenting the data has a positive effect on the accuracy of the model.

(31)

CHAPTER 5

Experiments

I choose to split up my experiment into two sections, each dedicated to one of the implementa-tion described in the previous chapter. The first, therefore, will be about DeepLabCut [25] and some other experiments specifically tailored to this framework. Something similar will be done, but then for the Mask R-CNN [15] for which I used Matterport Mask R-CNN implementation [1]. In each section I will first describe to corresponding experiment. For purposes of clarity, I will present the results in different subsection below.

5.1 Hardware

My experiments ran on the Lisa GPU-cluser [40]. A single node contain four GeForce 1080Ti, 11GB GDDR5X. In my experiments I will indicate how many GPU are used.

5.2 DeepLabCut

With DeepLabCut [25], I will describe two properties of the network. The first is to see how the placement of labels is dependent on other label (because of the previous explained pose es-timation). In these experiments the average pixel error will be calculated, which is how far the predicted label is from the ground truth in pixel.

The second set of experiments will focus on how DeepLabCut will hold up when multiple dol-phins are present in the image. These experiments will done to see if DeepLabCut can locate all the dolphins, without resulting in a significant number of false positives. These experiments uses the annotated dataset of the Mask R-CNN models. In this dataset each dolphin is demarcated. By counting number of demarcated areas, we can locate the number of dolphins and by checking if a found point’s coordinates are within a demarcated area, we can see if it is a true positive or false positive. A dolphin is found, if at least one point is located in the marked area. The total number of found dolphins will be divided by the number of demarcated areas to get the final results. To calculate the percentage of true positives, I divided the number of true positives with the total number of found points.

5.2.1 Subsets of parts

In the experiment I will run an evaluation on the powerset of the set labels and removing empty sets. In this experiment I use the following labels: Head, Blowhole, Backfin, Rightfin, Leftfin, Tail.

I decided to present the results using six separate tables: Train Error, Test Error, Train Er-ror and Test ErEr-ror with 0.1 worst cut-off the average of all the four values and the standard

(32)

deviation. I will present the ten best and ten worst for each. Full results can be found in the appendix (C.1.2).

Results

Table 5.1: Best ten test error

Parts Test error(px)

Blowhole 23.32

Blowhole RightFlipper 27.94

Blowhole Tail 30.08

Blowhole RightFlipper Tail 31.12 Blowhole LeftFlipper 31.63 Blowhole RightFlipper LeftFlipper 32.2

RightFlipper 33.4

Blowhole RightFlipper LeftFlipper Tail 33.63 Blowhole LeftFlipper Tail 33.71 Blowhole RightFlipper BackFin 34.62

Table 5.2: Worst ten test error

Parts Test error(px)

Head RightFlipper LeftFlipper 54.45 Head RightFlipper BackFin 55.6 Head LeftFlipper Tail 55.92 Head Tail BackFin 56.99 Head LeftFlipper BackFin 58.73 Head RightFlipper 59.26

Head Tail 61.31

Head LeftFlipper 64.1

Head BackFin 65.0

Head 79.58

Table 5.3: Best ten train error

Parts Train error(px)

BackFin 52.5

Tail BackFin 56.42

Head BackFin 56.77

Head Tail BackFin 57.97

Tail 60.47

Head Tail 60.72

Head 60.96

Head RightFlipper Tail BackFin 68.41 Head LeftFlipper Tail BackFin 70.18 Head RightFlipper BackFin 71.29

(33)

Table 5.4: Worst ten train error

Parts Train error(px)

Blowhole RightFlipper LeftFlipper BackFin 96.77 Head Blowhole RightFlipper LeftFlipper 99.03 Blowhole RightFlipper LeftFlipper Tail 99.48

Blowhole 111.78

Blowhole RightFlipper 113.73

Blowhole LeftFlipper 114.56

Blowhole RightFlipper LeftFlipper 115.16

RightFlipper 116.78

RightFlipper LeftFlipper 117.57

LeftFlipper 118.25

Table 5.5: Best ten test, with 0.1 cutoff

Parts Test error with p-cutoff

RightFlipper 6.54

RightFlipper Tail 10.38

Tail 11.53

Blowhole RightFlipper Tail 16.85

Blowhole Tail 18.2

Blowhole RightFlipper 20.17 RightFlipper LeftFlipper Tail 23.19 Blowhole RightFlipper LeftFlipper Tail 23.24

Blowhole 23.32

Blowhole LeftFlipper Tail 24.81

Table 5.6: Worst ten test error, with 0.1 cutoff

Parts Test error with p-cutoff

Head Tail 51.23

Head Blowhole 52.5

Head RightFlipper LeftFlipper BackFin 54.86 Head RightFlipper LeftFlipper 58.04 Head RightFlipper BackFin 58.95 Head LeftFlipper BackFin 59.0

Head LeftFlipper 64.76

Head BackFin 65.0

Head RightFlipper 66.69

(34)

Table 5.7: Best ten train, with 0.1 cutoff

Parts Train error with p-cutoff

RightFlipper 48.3

RightFlipper BackFin 51.37

BackFin 52.48

RightFlipper Tail BackFin 53.11 RightFlipper Tail 53.59

Tail BackFin 54.0

Head RightFlipper BackFin 55.49 Head RightFlipper Tail BackFin 55.52

Tail 55.59

Head Tail BackFin 56.39

Table 5.8: Worst ten test, with 0.1 cutoff

Parts Train error with p-cutoff

Head Blowhole RightFlipper LeftFlipper 77.47 Blowhole LeftFlipper BackFin 78.42

Head Blowhole 78.73

Blowhole LeftFlipper Tail 80.09 Head Blowhole LeftFlipper 81.5 Blowhole RightFlipper 84.27 Blowhole RightFlipper LeftFlipper 86.17

LeftFlipper 90.04

Blowhole LeftFlipper 94.78

Blowhole 97.97

Table 5.9: Best ten average pixel error

Parts Average SD

Tail 41.4125 19.159807377685194

RightFlipper Tail 45.6725 26.456079050947817

Tail BackFin 46.265 9.910051715304013

RightFlipper Tail BackFin 48.224999999999994 15.939815714116644

BackFin 50.23 2.26001106191983

RightFlipper 51.254999999999995 40.68319646979573 Blowhole Tail BackFin 51.985 20.08556757973247 RightFlipper BackFin 52.44 15.269647998562379 Blowhole RightFlipper Tail BackFin 52.697500000000005 22.623521360522105

(35)

Table 5.10: Worst ten average error

Parts Average SD

RightFlipper LeftFlipper 65.79249999999999 34.04390868789893 Head RightFlipper 66.54499999999999 9.859463727809944 Head Blowhole RightFlipper LeftFlipper 67.2475 22.343556537623996

Head Blowhole 67.4725 15.201196293384282

Blowhole LeftFlipper 67.9525 37.378588372890704 Head RightFlipper LeftFlipper 68.5325 15.368347951227548 Head Blowhole LeftFlipper 68.995 19.786427797861844

Head 70.27 9.309999999999999

Head LeftFlipper 71.5975 8.569662697562842

LeftFlipper 73.105 32.60404461106015

5.2.2 Single dolphin

To test whether the labels are always marked within a dolphin, I will run tests on the dataset used for Mask R-CNN [15]. I used the annotated area to see if a point was located on the dolphin. To verify whether it is accurate, I checked if the coordinates for the labels are within the animal. Those within the animal will be marked as 1 and those out of bound as 0. From this the average is taken. The average of every image will be the accuracy of the labels.

The train and test dataset described here are not the ones used for DeepLabCut, but the ones used for the Mask R-CNN implementation.

Results

Name AP points SD AP points AP found SD AP found Single Dolphin Train Set 0.708 0.209 0.969 0.172

Single Dolphin Val Set 0.795 0.162 1 0

5.2.3 Multiple dolphins

At the moment DeepLabCut [25] does not support the tracking of multiple animals. To get some results from this anyhow, I will take each found point, by taking the local minima from the output of the CNN, and check if it is within the annotated area.

These experiments were conducted in the same manner as with a single dolphin, though pic-tures were included in which multiple animals were seen. Each experiment was also conducted with different thresholds in order to see how those effect the scores.

Results

Name Threshold AP points SD AP points AP found SD AP found

Train 0.9 0.430 0.294 0.677 0.305 Val 0.9 0.401 0.227 0.584 0.312 Train 0.8 0.420 0.285 0.723 0.295 Val 0.8 0.371 0.223 0.625 0.327 Train 0.7 0.408 0.289 0.736 0.283 Val 0.7 0.358 0.224 0.664 0.325 Train 0.6 0.340 0.282 0.756 0.271 Val 0.6 0.384 0.215 0.701 0.308 Train 0.5 0.390 0.275 0.768 0.267 Val 0.5 0.364 0.212 0.701 0.308

(36)

5.2.4 Single and multiple dolphins

In these experiments, the combined set of the previous two experiments was used to extract the results.

Results

Name Threshold AP points SD AP points AP found SD AP found

Train 0.9 0.571 0.290 0.826 0.286 Val 0.9 0.560 0.281 0.753 0.315 Train 0.8 0.566 0.287 0.848 0.270 Val 0.8 0.543 0.289 0.777 0.312 Train 0.7 0.560 0.293 0.858 0.261 Val 0.7 0.535 0.294 0.800 0.300 Train 0.6 0.556 0.291 0.864 0.250 Val 0.6 0.551 0.281 0.823 0.279 Train 0.5 0.551 0.291 0.870 0.246 Val 0.5 0.539 0.287 0.823 0.279

5.3 Mask R-CNN

Each of the experiments is executed with the Matterport R-CNN implementation [1]. The ex-periments will be done with three datasets, the first being a set where the images consist, of only one dolphin. The second set only containing images with multiple dolphins and lastly an combination of these sets.

In these experiment I will try to find the ideal learning setup, which will result in the high-est average precision. To succeed in this I trained multiple models in a number of different configurations to see which configuration gives the best average precision. This average precision was calculated with the Pascal VOC standard [9] and was seen as a true positive if at least 50% of the area of the predicted area and ground truth, was covered by the prediction.

For these experiment multiple models are trained. A model in which only the last layers are trained (heads), a model in which every layer is trained, both models are trained for 40 epochs. A third model is trained 20 epochs only on the last layers and afterwards 40 epochs on all the layers. Each epoch will have 1000 iterations. These three model setup will be trained with two different pre-trained weights: one from the COCO dataset [21] and one from Imagenet [5] dataset. These models will again be split up into two more categories: models trained with the normal set and an augmented training set. In total 12 models will be trained for these experiments.

5.3.1 Single Dolphin

To test how Mask R-CNN [15] behaves with a single dolphin, I will use a test set which only contains a single animal. This dataset will be the same as the one used for the DeepLabCut [25] implementation.

Results

Table 5.11: With a single dolphin and without augmentations to the dataset. Epochs COCO Imagenet

20 heads, 40 all 0.775 0.772

40 all 0.691 0.694

(37)

Table 5.12: With a single dolphin and with augmentations to the dataset. Epochs COCO Imagenet

20 heads, 40 all 0.846 0.770

40 all 0.769 0.845

40 heads 0.769 0.692

5.3.2 Multiple Dolphins

The next step is to see how Mask R-CNN [15] behaves with multiple animals. The train set will be the same as with the single dolphin experiments. The test set will be the same, as was used with the DeepLabCut [25] implementation.

Results

Table 5.13: With multiple dolphins and without augmentations to the dataset. Epochs COCO Imagenet

20 heads, 40 all 0.890 0.892

40 all 0.850 0.865

40 heads 0.815 0.795

Table 5.14: With multiple dolphins and with augmentations to the dataset. Epochs COCO Imagenet

20 heads, 40 all 0.891 0.874

40 all 0.898 0.872

40 heads 0.864 0.815

5.3.3 Complete set

Results

Table 5.15: With complete set and without augmentations to the dataset. Epochs COCO Imagenet

20 heads, 40 all 0.841 0.842

40 all 0.785 0.795

40 heads 0.765 0.691

Table 5.16: With complete set and with augmentations to the dataset. Epochs COCO Imagenet

20 heads, 40 all 0.841 0.832

40 all 0.846 0.862

(38)

Automated Dolphin Video Tracking

Bachelor Informatica

Tracking dolphins in videos with

the use of convolutional

neu-ral networks

Jesse van Remmerden

June 21, 2019

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

Contents

CHAPTER 1

Introduction

1.1

Previous Work

CHAPTER 2

Theoretical background

2.1

Artificial Neural Networks

2.1.1

Activation function

2.1.2

Weights and Biases

2.1.3

Forward Propagation

2.1.4

Backpropagation

2.2

Convolutional Neural Networks

2.2.1

Convolution

2.2.2

Convolutional Layer

2.2.3

Fully Connected Layer

2.2.4

Activation Function

2.2.5

Pooling Layer

2.2.6

Stochastic Gradient Descent in a Convolution Neural Network

2.3

Dilated Convolution

2.4

ResNet

2.5

Transfer Learning

2.6

Fully Convolutional Networks

2.7

R-CNN

2.7.1

R-CNN

2.7.2

Fast R-CNN

2.7.3

Faster R-CNN

CHAPTER 3

DeepLabCut

3.1

Convolutional Neural Network

3.2

Pose Estimation

3.3

Example Network

3.4

Configurations

3.4.1

Creating a project

CHAPTER 4

Mask R-CNN

4.1

Theory