Self-improving apple detection by active learning

(1)

Self-improving apple detection

by active learning

(2)

Layout: typeset by the author using LA_TEX.

(3)

Self-improving apple detection by active

learning

Frank Brongers 11873914 Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. E. Bruni

Institute for Logic, Language and Computation Faculty of Science

University of Amsterdam Science Park 907 1098 XG Amsterdam

(4)

Abstract

Within deep learning, to increase performance of neural networks, the need for a lot of labelled data is high. The cost, however, of labelling this data can be very high as well. To reduce expenses on labelling, whilst keeping performance as high as possible, active learning is introduced. With active learning the goal is to find a subset within the unlabelled set such that labelling this data would yield similar results to all data being labelled. Within this thesis the application of active learning for the RetinaNet object detector, with the focus on detecting apples, was explored. Three active learning methods were implemented. First, the sum method, which aggregates the uncertainty scores of the bounding boxes, found by the RetinaNet, and picks a subset based on the highest scores. Secondly, the core-set method, which picks a subset based on how well this subset represents the whole set. Finally, the Variational Adversarial Active Learning method, which encodes the unlabelled set through a Generative Adversarial Network setup with a Variational Auto-encoder, to find a subset with images of which certain features are not yet well-represented in the labelled set. It was found that none of the methods showed a significant improvement over a random approach, but this was more likely due to the dataset and not due to the performance of the active learning methods.

(5)

Acknowledgements

First, I would like to thank my supervisors Elia Bruni and Klaus Ondrag for setting me up for a smooth start with the thesis and continued support throughout, allowing for the freedom to explore the methods I wanted to explore.

Secondly, I would also like to thank Tobias Teule for his good insights and comments whilst working alongside me during the thesis.

Finally, I would like to thank Leon Eshuijs for his general support during the thesis and being a good sparring partner.

(7)

1 Introduction

Within deep learning, to increase performance of neural networks, the need for a lot of labelled data is high. The cost, however, of labelling data can be very high as well. To reduce expenses on labeling, whilst keeping performance as high as possible, active learning (AL) is introduced.

AL is the act of finding a subset of the unlabelled data, which yields the highest performance gain when labelled. A multitude of methods have been introduced in recent years to further push the boundaries of active learning and extend it to the field of computer vision.

Within this thesis, multiple AL methods are compared to discover where the RetinaNet, a one-stage object detector, fails to correctly detect apples and which method finds the best subset for the RetinaNet to be trained on. The research question is thus as follows: which AL method combines best with the RetinaNet object detector?

1.1 Overarching project

This thesis is part of a larger project in which a robot is to pick apples in an apple orchard. The robot should move autonomously and improve over time. For it to improve over time, more data is needed, which it collects during runs through the apple orchard. However, because it works in real-time and continuously collects data, as it tries to detect apples, labelling all this new data becomes unfeasible. Thus an AL method is needed to reduce the amount of data to be labelled, whilst improving the RetinaNet as closely as possible to when all data is labelled, and on the other hand find the weaknesses of the object detector implemented on the robot.

These implications add to the bounds of the research question: labelling cost of the selected images should be kept in mind as well as the efficiency of the AL method, not only the perceived scores of the AL method on the test data used within this thesis.

(8)

2 Theoretical foundation

2.1 Deep learning for object detection

This thesis focuses on the cost-reduction of labelling for deep learning for object detection, and more specifically using the RetinaNet object detector. Deep learning for object detection builds upon deep learning for image classification, such that not only the right class needs to be assigned to an image, but also where in this image this object is located. The types of networks can be split into two groups: one-stage and two-stage detectors.

With two-stage detectors like R-CNN and Faster R-CNN [5, 18], a region proposal network like selective search [23] searches the image for possible objects, returning a set of candidate areas for the classifier to classify. With this process, the set of candidate areas is significantly reduced, as well as returning bounding boxes which more accurately describe the objects location within the image.

One-stage detectors like YOLO, SSD and the RetinaNet [13, 17, 12], however, sample a larger set of candidate object locations regularly across an image. These cover different scales, aspect ratios and spatial positions, with so-called anchor boxes, which are bounding boxes with a fixed size and aspect ratio. This speeds up the bounding box search, making one-stage detectors faster than two-stage detectors, yet it also results in a lot of redundant candidate locations. This results in one-stage object detectors having, in general, a lower accuracy than two-stage detectors.

2.2 RetinaNet

The object detector used for this thesis is the RetinaNet object detector. The RetinaNet is a state of the art one-stage object detector. This network is the byproduct of the implementation of a new learning loss, called focal loss, for one-stage detectors [12].

The RetinaNet improves the state of the art performance for one-stage detectors using this new loss-function which focuses the network more on hard examples. The loss function builds on the standard cross-entropy loss (CE) function by adding a (1 − pt)γ factor, see equations 1 and 2. This

decreases the relative loss for examples that are more easily classified. By focusing more on hard examples during training, the accuracy of the classifier of the detector is greatly improved.

CE(pt) = −log(1 − pt) (1)

F ocal(pt) = −(1 − pt)γlog(pt) (2)

The RetinaNet itself is a byproduct of the paper. The network consists of a Feature Pyramid Network (FPN), built on an top of a feedforward ResNet architecture, which connects to a bounding box regressor and a classifier. As specified in the paper, the network is relatively simple, but still achieves high performance. It outperforms both state of the art one-stage detectors, like YOLOv2 and SSD, as well as two stage detectors, like Faster R-CNN, in accuracy and speed [12].

2.2.1 ResNet

The ResNet is based on the findings by He et al [8]. Previously, deeper neural networks were harder to train, as accuracy degrades, and it was found that this was not due to overfitting. He et al found that by having an identical output of one layer skipping several layers and being added to the input for the following layer, with so-called residual shortcut connections or residual blocks (see figure

(9)

Figure 1: A visualization of the RetinaNet [12].

2), overcomes degradation of models when adding more layers. Instead of learning the underlying mapping directly, through these residual blocks, the model learns the residual mapping.

Within the RetinaNet the ResNet functions as a deep image preprocessor that aids the FPN to provide a rich feature pyramid at multiple scales, from which anchor boxes are sampled.

Figure 2: A residual block [8].

2.2.2 Feature pyramid network

Feature pyramids are a basic building block for detecting objects at multiple scales [1], which is one of the main challenges of object detection. Before the FPN, however, object detectors tended to avoid feature pyramids, for processing images at multiple scales is expensive. The FPN combats this by leveraging the feature hierarchy inherent to CNNs [11].

The output from the final layer of the CNN, in this case the ResNet, is again upsampled through a top-down approach for the FPN. Through lateral 1 by 1 convolutional connections with the ResNet at multiple layers within the network, feature mappings are created. These lateral connections are connected to their corresponding upsampled layers within the FPN, to create more accurate and semantically valuable feature mappings even at higher resolutions, see figure 3.

(10)

Figure 3: The addition af the lateral connection to the upsampled layer [11].

2.2.3 Bounding box classifier and regressor

Finally the anchor boxes used for each layer of the FPN are forwarded to two parallel networks. Both are relatively small, fully convolutional networks [14].

The purpose of the classification network is to predict the probability of an object being within the corresponding anchor box, and to which class this object belongs. The network consists of several convolutional layers with ReLU activations, followed by a sigmoid activation as output function.

The box regression network acts to regress, if there is a truth box nearby, to the ground-truth. However, instead of giving a classification output, this network outputs the relative offset between the anchor box and the ground-truth. For the RetinaNet from [12] both the classification and box regression network share the same architecture, except for their output layers.

2.3 Active learning

To reduce the labelling cost, this thesis employs active learning (AL). As mentioned in section 1, AL is implemented to find a subset within the unlabelled data which yields the highest performance increase. As data storage and collection become more readily available, the bottleneck for machine learning shifts toward labelling. Although unsupervised learning, a type of machine learning for which labels are unnecessary, overcomes the labelling cost issue, many machine learning techniques still rely on labelled data. This data is to be labelled by a human, which can become expensive when there is a lot of data to be labelled. AL is to decrease these expenses with minimal impact to the performance. Another area in which AL can help, which is relevant to the over-arching project, is to show the weaknesses of a network within the current dataset and to collect data which contains more of these examples [3].

AL can be divided into three different forms of application: membership query synthesis, pool-based sampling, stream-pool-based selective sampling [20]. In the first case, the AL method generates its own examples which should be labelled, e.g. a specific part within an image. The second case has the AL method select a pool (or subset) of data to be labelled, the size of this pool is referred to as the budget. For the third case, the AL method selects examples one by one to be labelled, based

(11)

on the estimated informativeness. The second case aligns with experiments done in this thesis, as for convolutional neural networks the other cases are not found to be effective [19].

Pool-based sampling can be split into three categories: representation-based sampling [19, 21], uncertainty-based sampling [2] and a combination of both [24]. Uncertainty-based sampling relies on the output of the network to select the subset. Representation-based uses the images on which network is trained, to find a subset best suited for labelling.

2.4 Variational auto-encoders

The Variational Adversarial Active Learning method, see section 3.3, employs a variational auto-encoder. A variational auto-encoder (VAE) is a type of auto-encoder that improves upon the generative properties of auto-encoders for the creation of new data, see figure 41_{. Auto-encoders}

compress (encode) and decompress (decode) data, like compression schemes such as Huffman coding or Lempel-Ziv compression do, to reduce, for example, the bandwidth needed to send the data or the memory needed to store it. However, for the case of auto-encoders, the compression and decompression schemes are learned and performed with neural networks [7].

(a) Auto-encoder

(b) Variational auto-encoder

Figure 4: A simplified representation of an auto-encoder (a) and a VAE (b). Both with input datax

and output dataˆxreconstructud fromz. For (a)zcorresponds toxpassing through the bottlened, for (b)zis a similar distribution toxencoded as a Gaussian distributionN (µx, σx).

An auto-encoder can be separated into two different networks: the encoder, which encodes the data, and the decoder, which reconstructs the data from the encoding. Between these networks lies the so-called bottleneck, which ensures only the most important part of the information can pass through. Together the encoder and decoder learn the most optimal compression scheme for

(12)

the data corresponding to the size of the bottleneck. This reconstruction is learned with on the reconstruction-loss, which can be computed from the difference between input and output image.

For generation, samples can be taken from the latent space and have the decoder reconstruct these into a new data. However, due to the tendency of auto-encoders to severely overfit, this newly generated data has a high chance of being meaningless.

With VAEs, training is regularised to create a latent space with good generative properties [9, 10]. This is done by encoding the input data as a distribution over the latent space, instead of as a single point. The encoded distributions are often chosen as Gaussian distributions. This way, the encoder can be trained to return a mean and variance which describe these distributions. The reasoning behind using a distribution instead of a single point is that it makes it possible to naturally regularize the latent space, which is key to generating new data.

Added to the reconstruction loss is a regularisation term for the latent space which is prone to regularise the latent-space by forcing the encoder to return distributions close to a Gaussian distribution. This regularisation is done by comparing the encoded distribution with a Gaussian through the Kulback-Leibler divergence, or relative entropy, which indicates the difference between two distributions [15]. For the proper calculation of the parameters for the VAE, the reparame-terization trick is used [10], to allow for backpropagation, despite the random sampling procedure necessary for the VAE.

3 Method and approach

Multiple AL methods were implemented. These approaches were chosen for their relevance and for their differebce in focus with respect to the learning process.

I opted to use the sum method [2], for its efficiency both on implementation and the sampling process, this is an uncertainty-based method. Two representation-based methods have been imple-mented as well: the core-set method [19], for its non-random selection and efficient sampling, and the Variational Adversarial Active Learning method [21], for its generative properties and its ability to find features which are not well represented within the labelled set.

Random sampling baseline To get an accurate picture of the performance of the implemented AL methods, a random sampling baseline (RSB) is utilized. A random sample is taken from the training data, the network is trained on this sample and then tested. Afterwards another sample is taken from the remaining training data and added to the previous sample, after which the network is trained and tested again. This continues until all specified sampling percentages have been satisfied. When an AL method performs better than the RSB, it chooses data more effectively than randomly picking it and is thus beneficial for selection process of a subset which should be labelled.

3.1 Sum

The sum method is based on the findings in [2] and aggregates the uncertainty scores of the bounding boxes found after by the RetinaNet. A pre-trained RetinaNet does an inference run on the training data. From this procedure the uncertainty scores are retrieved and their uncertainty score is summed based on the 1vs2 score, which gives a higher score to a prediction if the uncertainty between the top two most likely cases is higher. The highest score would thus be assigned to the case of both classes receiving a fifty percent certainty score.

(13)

Sum(x) =X

i∈D

1vs2(xi)

In which x is the image, D the set of bounding boxes found and xi the predicition scores of of

bounding box i in image x.

However, since the dataset used for this thesis contains only one class, the 1vs2 score is not applicable. Instead, only the prediction of the one class is used but is treated as two classes: the object and the absence the object. This way, the uncertainty score will be highest if the RetinaNet scores the bounding box, with a prediction of an apple being in it, with a certainty 1₂.

Sum(x) =X i∈D 1 2− |xi− 1 2| (3)

This method, however, is biased. It only focuses on the output of the RetinaNet and also it tends to value images with a higher amount of bounding boxes more, as mentioned by Brust et al [2]. The first bias can put the RetinaNet in a vicious circle, in which it will only be improved on those objects it can already detect in the first place, thus leaving out hard examples which might actually improve the RetinaNet the most. During testing of the AL method this will not be an issue, as the ground-truth labels are available. The issue will become more prevalent during deployment, as completely new data will then be sought for.

The second bias counteracts the goal of AL, which is to reduce labelling cost. Because each image within the dataset can have multiple bounding boxes, labelling cost is not only based on the amount of images, but also on the amount of bounding boxes which can be found within each image. If the AL method has a bias towards images with more bounding boxes, it does not decrease the data that will be labelled as effectively as possible. An image with more bounding boxes with relatively high certainties might be given a higher score by the AL method than an image with one very hard example, with a very low certainty score, whilst this image might actually both reduce labelling cost and also improve the object detector better than the other image.

Cost reduction As the first bias is inherent to this method of AL, it can only be avoided by using a different AL method. The second bias, however, can be combated more easily, by modifying the scoring of bounding boxes. Within [2] two more similar methods are explored: average and max. Both were outperformed by the sum method and are thus less relevant in the eye of performance. For cost reduction, however, the average function (AVG), see function 4, is more relevant, as it is not biased towards the amount of bounding boxes.

Avg(x) = 1 |D| X i∈D 1 2 − |xi− 1 2| (4)

Focus on hard examples Another method to combat the second bias is to turn the functions into polynomials. This way a hard example gets scored exponentially higher than an easy example, this should both reduce the focus on the amount of bounding boxes, whilst also increasing the amount of hard examples being used, which should be more informative. This addition is inspired by [12], in which it was found that a learning loss which focuses more on hard examples significantly improved performance of the object detector, the methods are thus also named FSum and FAvg for which F stands for focal like the focal loss from [12].

(14)

F Sum(x) =X i∈D (2 · (1 2− |xi− 1 2|)) γ ₍₅₎ F Avg(x) = 1 |D| X i∈D (2 · (1 2− |xi− 1 2|)) γ ₍₆₎

The γ variable indicates the importance of the hard examples. Within the results section 5, the γ used is noted after the method name. Notice that both functions, relative to the functions on which they are based (3, 4), are doubled. If this is not done, the scores decrease rapidly with a higher γ, even if the uncertainty score is 1. This was not done for function 3 and 4 as the doubling is not significant for distinguishing between estimated informativeness for these functions.

Implementation The functions 3, 4, 5 and 6 have been implemented as shown in algorithm 1, for which F corresponds to the function implemented. RetinaNet r does an inference run on the unlabelled set s0, the function assigns a score to each image x. Set s with size b, corresponding to the allowed size of the to be labelled subset, is then returned containing the images with highest scores.

Algorithm 1: Uncertainty based method

input: unlabelled pool s0, budget b, RetinaNet R and function F Initialize s = ∅, t = ∅; si _{= R} infer(s0); for x ∈ si do u = F (x); t = t ∪ {u}; end while |s| 6= b do v = arg max(t); t = t \ {tv}; s = s ∪ {s0 v}; end return s

3.2 Core-set

The core-set approach [19] selects a subset based on how well this subset covers the feature space of the complete data set. This is done by calculating the distance between the vectorized images and choosing the set which covers the biggest part of the complete set whilst minimizing the distance between each data point and the closest chosen data point, see figure 5.

This problem is similar to the k-Center problem. This follows the same structure by searching for a b amount of data, which function as centers, such that the distance from each data point to the closest center is minimal, the problem to be solved follows as equation 7. A set s1is to be found with the size of budget b or smaller, which will be the chosen subset of the total dataset s1∪ s0_,

(15)

Figure 5: The coverage principle of [19] in which S is the selected subset with range δSto encapsulate

all points. The goal of the core-set approach is to minimize this range to get the best representation of the complete dataset, with the subset.

such that each data point xj in the complete dataset is closest to the most nearby center xi from

set s1.

min

s1_:|s1_|≤bmaxi j∈smin1_∪s0∆(xi, xj) (7)

Implementation Within [19] the authors mention this problem is NP-hard, thus an optimal solution is not always feasible, however a greedy variant is efficiently implementable, see algorithm 2. To create a more robust solution than the greedy version, the Gurobi framework2 was used. Due to licensing issues, however, within this thesis only the greedy variant was implemented. Raw image features were used, whereas in the paper the activations of a classifier network were used, due to the time restraint this was not possible for this thesis. The distance metric used for ∆ in algorithm 2 is the l2 distance, see equation 8, where p and q correspond to the two image vectors.

d(p, q) = v u u t n X i=1 (pi− qi)2 (8)

3.3 Variational Adversarial Active Learning

The structure of the Variational Adversarial Active Learning method (VAAL) [21] is similar to that of a Generative Adversarial Network (GAN). GANs are typically used for the parallel improvement of a generator and discriminator model [6]. The generator generates new data based on the training data received and the discriminator predicts whether this data is from the training set or generated. The generator improves by trying to deceive the discriminator model into predicting the generated data as training data, whilst the discriminator improves by learning to distinguish between the two.

(16)

Algorithm 2: k-Center-Greedy [19]

input: data xi, existing pool s0 and budget b

Initialize s = s0; while |s| 6= b + |s0_{| do}

u = arg max_i∈[n]\sminj∈s∆(xi, xj);

s = s ∪ {u}; end

return s \ s0

This way, a strong generator can be produced, which returns data similar to the training data, and a strong discriminator model, which can discriminate between these to types of data.

Within VAAL, the generator model is replaced by the sampling procedure from the latent space of the VAE. The VAE is tasked to learn an encoding such that it can trick the discriminator model into predicting unlabelled and labelled data as labelled when encoded, see figure 6. After training the model, the discriminator predictions indicate how representative an image is for the complete set. If the discriminator predicts an image as unlabelled, it is likely that a feature within this image is not well-represented within the whole labelled data-set as it was not similarly encoded.

For VAAL, instead of a standard VAE, a Wasserstein auto-encoder (WAE) is used, which mini-mizes the penalized divergence between the distribution of the auto-encoder and the target distribu-tion [22]. This improves upon the generative qualities of a VAE. Although the standard WAE uses the Wassertein distance, the Kulback-Leibler divergence is used instead for the VAAL method. The standard loss for the WAE is also extended to take into account the prediction of the discriminator as to enable the WAE to trick the discriminator.

Figure 6: A visualization of the VAAL [21]. The human expert or oracle in this case provides the labels for the data which the discriminator predicts as not part of the labelled pool.

Implementation The VAAL implementation for this thesis is shown in algorithm 3, before the selection process both the discriminator and WAE are trained in an adversarial setup with the labelled set. The WAE then encodes the unlabelled images and the disciminator scores each image for how representative it is for the labelled set. The images with the lowest score are chosen to be

(17)

labelled3_.

Algorithm 3: VAAL

input: unlabelled pool s0_{, budget b, WAE W and discriminator D}

Initialize s = ∅; senc_{= W} encode(s0); si_{= D} infer(senc); while |s| 6= b do v = arg min(si_); si= si\ {si v}; s = s ∪ {s0v}; end return s

4 Experimental setup

Within this section the implementation and evaluation choices are provided. This accounts for the implementation of the RetinaNet and AL methods as well as the evaluation metrics and the dataset used for this thesis.

4.1 Nvidia RetinaNet

For the implementation of the RetinaNet I used Nvidia its Object Detection Toolkit (ODTK)4_.

The ODTK allows for easy use of prebuilt RetinaNets with different sized ResNet backbones and corresponding FPNs, optimized for GPU processing. It supports both the CocoDetection format5

and comes with the CocoAPI for processing results. 4.1.1 Docker and NGC

Docker6 _{allows for the instant deployment of environments separate from the users own desktop}

environment. This ensures the corresponding application to be run or developed has the right tools available from the start, without the need to go through the complete installation process, nor risking conflicts with installs on the users desktop environment. This can be done through either building a container or installing an existing container image.

To deploy the Nvidia RetinaNet, I utilized the NVIDIA GPU Cloud (NGC)7_{. The NGC contains}

several container images for docker. Using these images, running PyTorch models with a GPU is instantaneously available if a Nvidia GPU is installed on the desktop of the user. For the Nvidia

3_{Original code used by the authors of [21] can be found at https://github.com/sinhasam/vaal. The dataset was} changed to the used dataset for this thesis, see section 4.2, and the initial size of the labelled set, budget and amount of increments were changed, see section 4.

4_{See https://github.com/NVIDIA/retinanet-examples for more information and the code used.} 5_{See http://cocodataset.org/#format-data.}

6_{More information can be found at https://www.docker.com/.}

(18)

RetinaNet, the PyTorch NGC container was recommended and thus also used. This is a docker container image with out-of-the-box support for PyTorch to be run on a GPU. The AL methods, see section 3, were also deployed within this container.

4.2 Data

As dataset to train and test the Nvidia RetinaNet on, I used the KFuji RGB-DS dataset [4]. This dataset covers Fuji apples hanging from their trees and consists of 967 images with multiple apples in each image and corresponding bounding boxes. In addition, each image has 3 different modalities: RGB, depth and and range-corrected infrared intensity. For this thesis I only employed the high-resolution RGB part of the dataset. The dataset corresponds to the dataset of the overarching project, as the main subject is apples with depth information8_{which is a core part of both datasets,}

see figure 7.

Important differences to note are in lighting, the individual trees are more illuminated for the KFuji dataset, distance, the trees are closer for the KFuji dataset, and resolution, the KFuji dataset has a lower resolution. Although these differences might be significant, no better alternative was available during the duration of the thesis.

(a) KFuji RGB (b) Actual dataset

Figure 7: An image from the KFuji RGB dataset and one from the actual to be used dataset. The data conforms with the CocoDetection format, in which the bounding boxes are separated values in a json file separate from the image files. This is the case for the output of the Nvidia RetinaNet as well, see figure 8.

4.3 Experiment parameters

For the initial size of the training set I chose 10 percent and for the budget size 10 percent as well. Image selection by the AL methods was done incrementally with the size of the labelled set going up to 60 percent of the complete set. This means that 10 percent is picked from the remaining data, labelled and tested, after which 10 percent is picked again from the remaining data, this continues up until 60 percent of the data is used. I opted to only go up to 60 percent, as the first increments of the data are the most informative of the performance of the AL method and all methods will

8_{Another sub-project of the overarching project focuses on the depth-prediction from RGB-images, hence the} importance of depth information.

(19)

Figure 8: A visualization of the ground truth bounding boxes (blue) and the predicted bounding boxes (red) with their corresponding score.

converge in performance near 100 percent data usage as they cover exactly the same data. This also allowed for the testing of more hyperparameters of the Nvidia RetinaNet and AL methods, as experiments do not take as long they would when going up to 100 percent.

To start on an even basis, I used the same pre-trained Nvidia Retinanet for all methods, which were trained on the first 10 percent chosen randomly, corresponding to the seed of the current run. Each method was run for five different seeds with respect to the base-model, to account for variability between runs. Two different types of experiments were done in this incremental manner: continuous and reset. The continuous experiment involved training the base-model for several epochs, increasing the labelled set based on the chosen subset by the AL method, training the now retrained model for several epochs again, up until 60 percent of the data was used. For the reset experiment, at each increment of the data, the base-model was used for training and testing, instead of using the previously trained model.

The pre-trained Nvidia RetinaNets were either trained solely on the initial randomly selected 10 percent or first on the COCO dataset9 _{and then trained on the initial randomly selected 10}

percent. I used both a RetinaNet with the ResNet18FPN backbone, as it is the smallest and fastest available Nvidia RetinaNet, and one with the ResNet50FPN backbone, as this allows for a higher perfomance10_{, with a backbone size of 18 and 50 layers respectively.}

4.4 Evaluation

To measure the performance of the Nvidia RetinaNet I used the average precision (AP), average recall (AR) and their combined F1 score using the Intersection over Union (IoU), see equation 9,

9_{Visit http://cocodataset.org/ for more information.}

(20)

10, 11 and 12. The IoU score is based on the intersection and union between the ground-truth bounding box and the predicted bounding box. However, as this can be applied to any two boxes, the IoU is thresholded at 0.5:95 which means thresholds of .5 up until .95 with incremental steps of .05 are used11_{. For each AL method, at each percentage of data used, the Nvidia RetinaNet is}

measured with the metrics described above.

AP = Number of true positives

Number of true positives + Number of false positives (9) AR = Number of true positives

Number of true positives + Number of false negatives (10) F1=

AP · AR

AP + AR (11)

IoU = Area of overlap

Area of union (12)

5 Results

For figures 9, 10, 11, 12, 13 and 14, subfigure (a) shows the representation based methods Variational Adversarial Active Learning (VAAL) and k-center greedy (KCG) with the random sampling baseline (RSB) and sum (SUM) method; whilst (b) shows only the uncertainty methods sum and average (AVG) with F denoting the focal variant and the decimal after it denoting the γ used, if none is specified γ equals 2. For more details on the hyperparameters and base-models used, see appendix A and for the separate AR, AP and AP at and IoU of .5, see appendix B.

Although slight differences can be seen for each method for both the representation and uncer-tainty based methods, none has a significantly higher performance then any other method at any point or significantly beats the RSB. The KCG method, however, is less performant in the first few increments than other methods in figure 11, but more performant in the latter increments. This does not seem to be the case for the other experiments.

The continuous experiments, figures 9, 10 and 13, result in a general increase in performance the higher the percentage of data used, with the exclusion of figure 13b and RSB in figure 10a. This is not the general case for the reset experiments. All experiments, except for experiment 5 (figure 13), have an F1 score of around 0.6.

(21)

(a)

(b)

Figure 9: Experiment 1, a pre-trained RetinaNet with a ResNet18FPN backbone trained continu-ously for 200 epochs.

(a)

(b)

Figure 10: Experiment 2, a pre-trained RetinaNet with a ResNet18FPN backbone trained contin-uously for 500 epochs.

(22)

(a)

(b)

Figure 11: Experiment 3, a pre-trained RetinaNet with a ResNet18FPN backbone trained for 500 epochs and reset after each increment.

(a)

(b)

Figure 12: Experiment 4, a non pre-trained RetinaNet with a ResNet18FPN backbone trained for 1000 epochs and reset after each increment.

(23)

(a)

(b)

Figure 13: Experiment 5, a pre-trained RetinaNet with a ResNet50FPN backbone trained contin-uously for 500 epochs.

(a)

(b)

Figure 14: Experiment 6, a pre-trained RetinaNet with a ResNet50FPN backbone trained for 500 epochs and reset after each increment.

(24)

6 Discussion

First of all, the fact that the continuous experiments generally improve over time, whilst this is not the case for reset experiments, can be explained by the fact that the training for continuous experiments is cumulative. At each increment the Nvidia RetinaNet model builds upon the previous one and is thus likely to be better than its predecessor, this is not the case for reset experiments, as each run builds upon the same base-model.

Following this, it can be concluded that the models do not improve with more data supplied. From all reset experiments (figure 11, 12 and 14) the improvement with more data is non-existent. This could be due to the fact that the models were not near converging to their optimal performance. This is highly unlikely, however, as xperiments 1 and 2 (figure 9 and 10) use the same base-model, the main difference is that for experiment 2 the models are trained for 300 epochs extra at every increment, but it does not perform better, the performance is actually lower in general than that in experiment 1 which might be due to overfitting.

Secondly, none of the methods definitively beats another method or the random sampling base-line. This is the case for both the representation and uncertainty based methods. For the KCG method this is explained by [21], noting that this method suffers from the curse of dimensionality. Because the KCG method uses raw features, the dimensionality is very high, which hinders perfor-mance. This, however, should thus translate to it performing relatively worse than VAAL, which is the case in [21], but not the case for the results found within this thesis. The conclusion that none of the AL method beats any other also follows from the fact that no model improves with more data, the models would not improve either if supplied a specific subset of this data chosen by an AL method.

The result that none of the AL methods perform marginally better than the random sampling baseline also conforms with the findings in [16]. This, however, does not explain that none of the models improve when using more data, which is the case in [16]. Seeing as the AL methods could not be properly measured against each other, the problem found with the results not increasing with more data, is also unlikely to be resolved by implementing more AL methods. This leaves the size of the RetinaNet or the dataset as suitable causes and thus also as possible solutions.

The size of the RetinaNet could cause issues as its performance might get saturated relatively fast. This would support the fact that none of the models improve with more data. Even so, two backbones were used, the ResNet18FPN and ResNet50FPN, of which the second is more than twice as big as the first. Still, both backbones reach similar performance. The anomaly in this is experiment 4 (figure 12), which is a non pre-trained model, as the models reach significantly lower performance than for the other experiments. However, for this method there is still no improvement with more data.

This leaves the dataset as an issue for the absence of increasing performance. Both the size and variability of the dataset can cause trouble for discriminating between subsets. First, the size of the dataset is relatively small, consisting 967 images. Comparing this to the CocoDetection dataset, which contains more than 120000 images, the K-Fuji dataset seems relatively small. However, the apple class here has around 1600 instances, whilst for each image in the K-Fuji dataset this is at least higher than 4, which would suggest it being better for training on apple detection. Secondly, the variance between images within the dataset might not be very high, thus each image has about the same informativeness. It is the case that each image looks relatively similar to the image show in figure 4.2a and figure 8, very highly illuminated trees in front and darker in the back, whilst for the CoCoDetection dataset the images differ a lot from each other. Seeing as the images might be

(25)

very similar across the whole dataset, the Nvidia RetinaNet might not need as much data to reach optimal performance on the given task. Although it would have been, were this the case, more likely that the performance of the models in experiment 5 (figure 13) was closer to that of the other experiments. It might be the case, however, that the models within this experiment should have been trained for longer to reach the same performance.

6.1 Conclusion

Even though the results are not conclusive about any of the implemented AL methods, and thus leave the research question unanswered, they do not exclude the potential benefits that might be reaped if used on the actual dataset. Although [16] mentions that, in general, the state of the art AL methods are not as performant as suggested by their creators and do not improve much upon the random sampling baseline, the intuition still holds that some data is more informative then other data and that exploring AL can be fruitful.

However, as all methods perform relatively similar, if one method is to be recommended over the random sampling baseline, this would be either the AVG or FAVG method. These methods only pick an image if most of the bounding boxes within the image have a high uncertainty score, thus this method is biased towards images with a lower bounding box count. This should relatively decrease labelling cost the most, but performance will stay similar to other methods as found in section 5.

6.2 Future work

To improve upon the results found in this thesis, the AL methods would need to be tested on the actual dataset used for the over-arching project. The dataset seems to have a lot of influence on the performance of the AL methods. The dataset used for the overarching might become larger or have a higher variety of images, which could make using AL more beneficial. Although not all possible hyperparameters of the Nvidia RetinaNet were tested, as this was outside the scope of this thesis, from the results it can be concluded that testing more of these hyperparameters or other AL methods should not be the first focus for future expansion upon this work.

References

[1] Edward H Adelson et al. “Pyramid methods in image processing”. RCA engineer 29.6 (1984), pp. 33–41.

[2] Clemens-Alexander Brust, Christoph Käding, and Joachim Denzler. Active Learning for Deep Object Detection. 2018. arXiv: 1809.09875 [cs.CV].

[3] Brendan Collins et al. “Towards Scalable Dataset Construction: An Active Learning Ap-proach”. Computer Vision – ECCV 2008. Ed. by David Forsyth, Philip Torr, and Andrew Zisserman. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 86–98. isbn: 978-3-540-88682-2.

[4] Jordi Gené-Mola et al. “KFuji RGB-DS database: Fuji apple multi-modal images for fruit detection with color, depth and range-corrected IR data”. Data in Brief (July 2019). doi: 10.1016/j.dib.2019.104289.

(26)

[5] Ross B. Girshick et al. “Rich feature hierarchies for accurate object detection and semantic segmentation”. CoRR abs/1311.2524 (2013). arXiv: 1311.2524. url: http://arxiv.org/abs/ 1311.2524.

[6] Ian J. Goodfellow et al. Generative Adversarial Networks. 2014. arXiv: 1406.2661 [stat.ML]. [7] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http://www.deeplearningbook.

org. MIT Press, 2016, pp. 499–523.

[8] Kaiming He et al. “Deep Residual Learning for Image Recognition”. CoRR abs/1512.03385 (2015). arXiv: 1512.03385. url: http://arxiv.org/abs/1512.03385.

[9] Diederik P. Kingma and Max Welling. “An Introduction to Variational Autoencoders”. CoRR abs/1906.02691 (2019). arXiv: 1906.02691. url: http://arxiv.org/abs/1906.02691.

[10] Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. 2013. arXiv: 1312. 6114 [stat.ML].

[11] Tsung-Yi Lin et al. “Feature Pyramid Networks for Object Detection”. CoRR abs/1612.03144 (2016). arXiv: 1612.03144. url: http://arxiv.org/abs/1612.03144.

[12] Tsung-Yi Lin et al. Focal Loss for Dense Object Detection. 2017. arXiv: 1708.02002 [cs.CV]. [13] Wei Liu et al. “SSD: Single Shot MultiBox Detector”. CoRR abs/1512.02325 (2015). arXiv:

1512.02325. url: http://arxiv.org/abs/1512.02325.

[14] Jonathan Long, Evan Shelhamer, and Trevor Darrell. “Fully Convolutional Networks for Se-mantic Segmentation”. CoRR abs/1411.4038 (2014). arXiv: 1411 . 4038. url: http://arxiv. org/abs/1411.4038.

[15] David JC MacKay and David JC Mac Kay. Information theory, inference and learning algo-rithms. Cambridge university press, 2003, p. 34.

[16] Prateek Munjal et al. Towards Robust and Reproducible Active Learning Using Neural Net-works. 2020. arXiv: 2002.09564 [cs.LG].

[17] Joseph Redmon et al. “You Only Look Once: Unified, Real-Time Object Detection”. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 2016. [18] Shaoqing Ren et al. “Faster r-cnn: Towards real-time object detection with region proposal

networks”. Advances in neural information processing systems. 2015, pp. 91–99.

[19] Ozan Sener and Silvio Savarese. Active Learning for Convolutional Neural Networks: A Core-Set Approach. 2017. arXiv: 1708.00489 [stat.ML].

[20] Burr Settles. Active learning literature survey. Tech. rep. University of Wisconsin-Madison Department of Computer Sciences, 2009.

[21] Samarth Sinha, Sayna Ebrahimi, and Trevor Darrell. Variational Adversarial Active Learning. 2019. arXiv: 1904.00370 [cs.LG].

[22] Ilya Tolstikhin et al. Wasserstein Auto-Encoders. 2017. arXiv: 1711.01558 [stat.ML]. [23] Jasper RR Uijlings et al. “Selective search for object recognition”. International journal of

computer vision 104.2 (2013), pp. 154–171.

[24] Lin Yang et al. “Suggestive Annotation: A Deep Active Learning Framework for Biomedical Image Segmentation”. CoRR abs/1706.04737 (2017). arXiv: 1706.04737. url: http://arxiv. org/abs/1706.04737.

(27)

A

Experiment details

Base-models Multiple base-models have been setup. These models are either a pre-trained Nvidia RetinaNet or a randomly initialized Nvidia RetinaNet. The base-models are, after initial-ization and pre-training, trained on the first 10 percent of the dataset randomly chosen. Using these base-models for the newly chosen data by the AL methods ensures each method has a fair start.

Model 1 Pre-trained Nvidia RetinaNet with a ResNet18FPN backbone, trained for 500 epochs and 10 validation iterations with a learning rate of 0.01.

Model 2 Randomly initialized Nvidia RetinaNet with a ResNet18FPN backbone, trained for 1000 epochs and 10 validation iterations with a learning rate of 0.001.

Model 3 Pre-trained Nvidia RetinaNet with a ResNet50FPN backbone, trained for 500 epochs and 10 validation iterations with a learning rate of 0.01.

Experiments Building upon the base-models, several experiments have been done. These are either with models reset at each increment of data, meaning at each increase of data used the model is trained from the set amount of epochs using the base-model; or continuously trained, meaning the previous the model trained and tested for the previous percentage is used for the next incriment, with the initial model being the base-model.

Experiment 1 Extends model 1, continuously trained for each increment of data for 200 epochs and 4 validation iterations with a learning rate of 0.01.

Experiment 3 Extends model 1, retrained for each increment of data for 500 epochs and 10 validation iterations with a learning rate of 0.01.

Experiment 4 Extends model 2, retrained trained for each increment of data for 1000 epochs and 10 validation iterations with a learning rate of 0.001.

Experiment 6 Extends model 3, retrained trained for each increment of data for 500 epochs and 10 validation iterations with a learning rate of 0.01.

B

Additional results

For figures 15, 16, 17, 18, 19 and 20 (a) and (d) correspond to the AP at an IoU of 0.5:0.95, (b) and (e) correspond to the AR at an IoU of 0.5:0.95 and (c) and (f) correspond to the AP at an IoU greater than or equal to 0.5. The results generally adhere to the findings of section 5 and thus also support the statements made in section 6.

(28)

(a) (b) (c) (d) (e) (f) Figure 15: Experiment 1 (a) (b) (c) (d) (e) (f) Figure 16: Experiment 2

(29)

(30)

Self-improving apple detection by active learning

Self-improving apple detection

by active learning

Self-improving apple detection by active

learning

Abstract

Contents

Acknowledgements

1

Introduction

1.1

Overarching project

2

Theoretical foundation

2.1

Deep learning for object detection

2.2

RetinaNet

2.3

Active learning

2.4

Variational auto-encoders

3

Method and approach

3.1

Sum

3.2

Core-set

3.3

Variational Adversarial Active Learning

4

Experimental setup

4.1

Nvidia RetinaNet

4.2

Data

4.3

Experiment parameters

4.4

Evaluation

5

Results

6

Discussion

6.1

Conclusion

6.2

Future work

References

A

Experiment details

B

Additional results