Driving on Data; Object Detection in Urban Driving Scenes

(1)

Driving on Data

(2)

Layout: typeset by the author using LATEX.

(3)

Driving on Data

Object Detection in Urban Driving Scenes

Daniël C. Hamerslag 11323795

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Miltiadis Kofinas, MSc

Institute for Logic, Language and Computation Faculty of Science

University of Amsterdam Science Park 907 1098 XG Amsterdam

(4)

Abstract

Autonomous vehicles are slowly becoming a reality, and bring the potential to cre-ate a safer driving environment, potentially reducing the unprecedented number of traffic casualties. Using the new NuImages dataset, modern object detectors, which are the fundamental component of autonomous driving systems, are eval-uated in this paper. An object detector was trained on this new dataset which, despite not exceeding the performance of other object detectors, gives insight in how these technologies work, as well as the usefulness of the NuImages dataset. Finally, suggestions are made for future research on this subject.

(5)

Chapter 1 Introduction

1.1 History

Ever since humans started roaming our planet there has been a need of transporta-tion. For humans, as well as for goods and animals. The fundamental discovery of the wheel dates back to around 3500 BC and was a giant leap forward that pro-pelled humankind toward progress and prosperity. Wagons with primitive wheels spread throughout most of Europe and Asia over the course of a few centuries, but it took until the nineteenth century before the horse was finally replaced by what was called the "automobile", a combination of the Greek "auto" ("self") and Latin "mobile" ("moving"). About 100 years later, 50 years after Turing’s invention of the computer, a new technology makes an entrance that evolves automobiles to become truly self-moving. Over the past millennia humans have literally been holding the reins, first those of horse-driven carriages and later of engine-driven cars.

1.2 Present

Now however, a new age is at hand where control is about to be relinquished to AI-technology. Companies such as Tesla, commanding innovative strength and sympathy from a broad audience garnered by not only focusing on electric propul-sion and autonomous driving, but also on ingenious design, have brought this challenge within reach. It now depends on AI-researchers and private enterprises whether or not safe ’hands-free’ driving becomes available on a large scale. As well as whether or not it will contribute to a safer driving environment. When the first autonomous vehicles appeared on the streets they were followed by a wave of warnings about the dangers they would present. Just like when the first trains

(8)

emerged and people were wary and in awe of these new machines. People needed time to get used to and adapt to this new technology. Understandably so, if we compare the time between the invention of the wheel and the appearance of the first car to the quick evolution from car to autonomous Tesla, new adjustment problems are to be expected.

Arguably the best way for autonomous vehicles to win people’s trust is by using their full potential in making driving safer. Evaluations of autonomous vehicle safety (Papadoulis, Quddus, & Imprialou, 2019) have shown that a network of connected autonomous vehicles will not only reduce traffic conflicts but also ensures efficient traffic flow. To make these autonomous vehicles a reality, a multitude of different systems need to work in tandem as to ensure their driving capabilities match or exceed those of humans. The fundamental building block of such a driving system is a reliable and accurate object detector. This is necessary for an autonomous driving system to see what is happening around it, as well to predict driving situations in advance.

1.3 Object Detection

Object detection as a field within AI has existed even before deep learning became prevalent, as early object detectors were mostly based on handcrafted features such as lines, shapes or even entire objects. They were then compared with parts of an image and if the feature response was strong enough in a certain place, the object would be detected. Other early object detection methods work by calculating a histogram of the orientation of gradients in parts of an image (Dalal & Triggs, 2005). These techniques were later enhanced by region proposal networks that calculate the most likely locations and scales of objects so they can be more easily compared with features. Then, in 2012, the convolutional neural network AlexNet (Krizhevsky, Sutskever, & Hinton, 2017) arguably marked the start of the deep learning era in object detection and image recognition by significantly improving image classification compared to its predecessors.

These deep learning models are the foundation of object detectors in au-tonomous vehicles, and their inner workings will be discussed in the next chapter. One aspect all object detectors share is their need of large, annotated datasets. Luckily, more and more of these are being created and made public in recent years, one such dataset is NuImages (Caesar et al., 2019), an annotated dataset of over 90,000 images in urban driving scenes. All annotated with bounding boxes as well as segmentation and instance masks. In order to evaluate which object detection systems are most effective in urban driving scenes, this dataset will be used to train and evaluate possible models.

(9)

1.4 Goal

During this project it will be evaluated which object detection systems work best in urban driving scenes, while also being able to detect approaching objects from a large distance, something traditional object detectors struggle with. In order to do this, object detectors will be trained and evaluated using the new NuImages dataset.

Figure 1.1: Examples of the instance mask and bounding box annotations of random samples in the NuImages dataset

(10)

Chapter 2 Related Work

Object detection is deeply entwined with deep learning and convolutional neural networks, so to get an overview of how object detectors work, a brief review of both of these subjects is in order.

2.1 Deep learning

The first concept of deep models can be traced back to 1947 by (Pitts & McCulloch, 1947), with their intention to simulate a human brain in order to allow computers to complete complex tasks that were previously only possible for humans. Al-though popular during the 80s and 90s of the 20th century, deep learning became less popular largely because of a lack of computing power and training data. Over the past decade or so, it reemerged thanks to the appearance of large annotated datasets, the parallel processing power of GPUs and significant advances in train-ing techniques.

2.2 Convolutional neural networks

Convolutional neural networks (CNNs) enhance the feature learning ability of deep learning networks. In the context of CNNs a convolution is a filter that is applied to an input. Most regular neural networks applied to images receive the input image as a flattened array, so a 16x16 image would become a 256x1 input. With CNNs, however, the original shape of the image is kept in order to preserve spatial dependencies in the input image by applying convolutions as filters. The content of these filters are the learnable parameters of the model, similar to the weights in a regular neural network. The results of the convolutions are called feature maps,

(11)

and serve as the input for the next layer of filters, therefore these are the layers of a CNN. As one might expect, the feature maps extract features, as such, certain filters might learn to detect certain shapes or edges. Theoretically, the first layer of filters will detect features such as colors or shapes in certain regions while deeper layers detect more sophisticated features that may decide what this shape actually is, such as a hand or a wheel. Finally, a classic fully-connected layer learns the linear combinations of these high-level features to predict the actual object classes. This is shown in figure 2.1.

Figure 2.1: Overview of a basic CNN, source: (Jan et al., 2017).

2.3 Backpropagation in CNNs

As in any other type of neural network, backpropagation (Rumelhart, Durbin, Golden, & Chauvin, 1995) is what drives the learning capabilities of a convolutional neural network. The role of backpropagation is the same as in simple neural networks, however, the backward pass works slightly differently in convolutional layers. A convolution can be seen as a simple computational graph which uses X and F to compute output O. X here being the input values and F being the filter. The forward pass is F ∗ X = O. As for the backward pass, the loss from the previous layer is received as L, therefore the gradients of X and F need to be calculated. Just like in a regular backward pass this can be done using the chain rule. This requires the local gradients of X and F: δO

δX, δO

δF and the derivative of

the output on the loss of the next layer δL δO. δL δX = δL δO ∗ δO δX (2.1) δL δF = δL δO ∗ δO δF (2.2)

(12)

These equations are important because δL

δX will become the loss gradient for

previous layer, which will be used for the rest of the backward pass. Meanwhile

δL

δF is used to update the filter using Fnew = F − α δL

δF, with learning rate α which

is how the convolutional layer learns.

2.4 Activation functions

Activation functions are the core of any neural network, they control whether or not a neuron is actually activated as well as how strong its output is. In their simplest form, activation functions are non-linear functions with the weighted value of a neuron as its input, and the actual value of a neuron as its output. As shown in the formula below, the value of a neuron is a linear calculation, therefore its derivative is always a scalar. However, for backpropagation to work, the derivative needs a relation to xi. A derivative consisting of a single scalar does

not contain xi and therefore has no relation to xi. This means neural networks can

not successfully complete a backward pass for backpropagation without activation functions. Additionally, because every layer is a linear function of the layer before it, the final layer will always be a linear combination of the first layer. This means a neural network will effectively be reduced to only having a single layer. This makes activation function an essential part of any neural network.

yw = w1x1+ w2x2+ b (2.3)

y = σ(yw) (2.4)

Where σ is an activation function.

Since any non-linear function can technically be used as an activation function, many different activation function are in use. However, the most common ones are tanh, (leaky) ReLU (Krizhevsky et al., 2017) and softmax. Each of these has their own advantages and disadvantages which will now be discussed.

Sigmoid

σ(x) = 1

1 + e−x (2.5)

σ0(x) = σ(x) · (1 − σ(x)) (2.6) The sigmoid function squeezes the output between 0 and 1, while also bringing values closer to either 0 or 1, which makes predictions clearer. The fact that all values are between 0 and 1 is the reason that sigmoid functions are often used in models where the final result is a probability. However, this function almost

(13)

completely disregards extreme values. Unfortunately, the gradient is almost zero for extremely high and low values, which can cause learning problems.

Tanh

tanh(x) = ex− e−x

ex_{+ e}−x (2.7)

tanh0

(x) = 1 −tanh(x)2 (2.8) This activation function is very similar to sigmoid, except that it is centered around 0 instead of 0.5. This has certain advantages, such as zero values actually being mapped as zero and it makes it easier to map extremely negative inputs. However, because of the similarities it suffers from the same disadvantages as the sigmoid function.

ReLU

R(x) = ( x, x > 0 0, x ≤ 0 (2.9) R0(x) = ( 1, x > 0 0, x < 0 (2.10) ReLU is the most commonly used activation function, largely because it is very computationally efficient because it uses relatively trivial operations. It is about as close to linear as a function with a derivative can get without actually becoming linear. However, this does have certain disadvantages. For example, every negative input is reduced to zero, this means the derivative is also zero, which in turn causes backpropagation to become ineffective. This is referred to as the dying ReLU problem and it can cause specific neurons to become completely passive and effectively cause a part of the neural network to fall asleep. Additionally, the domain of this function is [0, inf] which can result in extremely high activations.

Leaky ReLU

R(x) = ( x, x > 0 αx, x ≤ 0 (2.11) R0(x) = ( 1, x > 0 α, x < 0 (2.12) Leaky ReLU is an attempt to solve the dying ReLU problem by giving negative inputs a small derivative α, usually set to 0.01.

(14)

2.5 Pooling

As can be seen in figure 2.1, there are sampling layers in between the convolutional layers, these are often called pooling layers and serve both to decrease the compu-tational complexity of the model as well as to extract features on a different scale. A pooling layer can be applied on feature maps from convolutional layers and mul-tiple pooling layers can exist in a single model. Pooling operations are essentially a filter being applied to a feature map, except that the filter is an operation built into the model rather than learned. Pooling filters are usually 2 × 2, applied with a stride of two, therefore they reduce the size of the feature map by a factor of two in every dimension. The two most common types of pooling functions are:

• Max Pooling: Returning the maximum value in the filter area.

• Average Pooling: Returning the average of all values in the filter area. Feature maps always retain the position of the objects in the original image; an object in the top-left of the original image is also expressed in the top-left of the feature map. Unfortunately, this causes slight variations in the position of an object to have an unwanted impact on the feature map. In the pooled feature map, an object that is slightly moved or distorted will still be in the same position, which makes object detection more precise.

Figure 2.2: The difference between max pooling and average pooling. Image retrieved from "https://www.quora.com/What-is-max-pooling-in-convolutional-neural- networks"

2.6 Region Proposal Networks

When detecting objects in a crowded image where the exact location of the objects to be detected is unknown and any number of objects might be visible inside the

(15)

image, a region proposal network (RPN) becomes necessary in order to let the ob-ject detection system function. These RPNs split the image into multiple regions and estimate the likelihood of each of these regions to contain an object. A mul-titude of region proposal networks have been developed over the years. However, more computationally efficient detection networks such as Fast R-CNN (Girshick, 2015) and SPPnet (He, Zhang, Ren, & Sun, 2015) have exposed these region pro-posal computations as a bottleneck when it comes to decreasing the running time of these algorithms. This makes sense considering that in these networks, every single region proposal has to be connected to a separate CNN to classify which object is present in that region. Since there are usually 2000 region proposals in any image, training a Fast R-CNN network is not very fast at all.

2.7 Faster R-CNN

Many solutions have been proposed to decrease the amount of computation nec-essary for RPN while maintaining accuracy. One such technique is Faster R-CNN (Ren, He, Girshick, & Sun, 2015), although slightly dated it is still widely consid-ered a state-of-the-art model for object detection. As its name suggests, it is based on Fast R-CNN but faster because of its region proposal network. Faster R-CNN introduces an RPN which is combined with the detection network, which allows the region proposals to be nearly cost-free. Essentially, this region proposal net-work is a convolutional neural netnet-work that simultaneously predicts object bounds and object scores. Therefore, it is no longer necessary to attach a CNN to every single region proposal. Instead, a single CNN is applied to the entire image, re-sulting in a convolutional feature map, on which a separate network is trained to predict the region proposals. These region proposals are then shaped into a new layer, called the Region of Interest (ROI) pooling layer, which is used to predict the object within the region of interest, along with its bounding box.

2.8 Anchors

A recurring problem with machine learning, especially in object detection, is the varying size of the input image as well as the number of bounding boxes. Faster R-CNN solves this problem by using anchors, these are boxes of a fixed size that are placed evenly throughout the input image at fixed intervals. For every anchor the model decides whether or not this anchor contains an object. If it does, the model decides whether or not it can be adjusted to better fit the original object. Although the anchors are placed in the convolutional feature map, they do refer-ence locations in the original image, therefore the number of anchors depends on

(16)

the width W and height H of the image, as well as on the sub-sampling ratio r. An image of size W × H will have a feature map of size W/r × H/r, so an anchor on every data point in the feature map will create anchors r pixels apart in the original image.

Because objects can appear in many different sizes and are not always com-pletely square, cars for example tend to shaped like horizontal rectangles, a single anchor box is home to multiple anchors. These anchor come in three different sizes and three different ratios between their width and height. So for all three preset sizes, anchor boxes of ratio’s 1 : 1, 2 : 1 and 1 : 2 will be created.

Figure 2.3: Left: Locations of anchor centers. Right: anchor boxes of a single random anchor center.

This creates a large number of anchor boxes that need to be processed by an RPN. The RPNs task is to predict the possibility that the content of an each anchor box is part of the background or an object in the foreground, as well as to fine-tune the anchor box. It is important to note that this is a binary classification where the class of the object does not matter, only whether or not there is one. This probability is called the ’objectness score’ and is computed alongside a bounding box regression for adjusting the anchor box. This adjustment consists of four values: ∆x, ∆y ∆width and ∆height, this indicates how the original anchor box

needs to be adjusted to better fit the object.

2.9 Mask R-CNN

While Faster R-CNN only detects objects as bounding boxes, Mask R-CNN (He, Gkioxari, Dollár, & Girshick, 2017) augments this by also detecting the precise location of every pixel of objects, called a segmentation mask. Mask R-CNN does this by adding a new branch specifically for predicting an object mask which works in parallel with the bounding box recognition branch from Faster R-CNN. This branch is another CNN that takes a convolutional feature map as its input and

(17)

creates a matrix that indicates which pixels in the image are part of the object, called a binary mask. The process Mask R-CNN uses is very similar what was discussed in the previous section. A region proposal network selects the most promising regions of interest, each of which contains a single class at most. Now, the semantic model is trained similar to a binary classifier, where every pixel that contains an object is classified as 1, and pixels that do not as classified as 0. Then, for every bounding box these pixels can be assigned individual colors based on their class to produce results such as in the image below.

Figure 2.4: An image from nuImages with and without its segmentation mask.

Although more modern and possibly effective object detectors exist, Mask R-CNN is still widely considered a state-of-the-art model, therefore different imple-mentations of Mask R-CNN will be the focal point of this project.

(18)

Chapter 3 Method

3.1 NuImages

The nuImages dataset consists of a total of annotated 93k images, containing a grand total of 800k foreground objects with bounding boxes and 100k semantic segmentation masks. These images were taken in an urban environment in both Boston and Singapore and were annotated by humans using annotation instruc-tions created by the team behind nuImages. Every annotated image also comes with six past and six future images, taken every half second. Unfortunately these images are not annotated and therefore not used for training. The annotated ob-jects are divided into a total of 25 different classes, such as cars, adult pedestrians, trucks and traffic cones. As with any dataset, the distribution of these classes is not uniform, with some classes appearing thousands of times and others only a few dozen times, as can be seen in table 3.1. The implications of this imbalance in class frequency will be discussed in the final chapter.

3.2 Mask R-CNN

As indicated in the previous chapter, a Mask R-CNN model was trained using the nuImage dataset. Mask R-CNN is described by its creators as "a conceptually simple, flexible, and general framework for object instance segmentation". It is as fast as Faster R-CNN, despite adding an entire branch for instance segmentation, which gives predictions that are way more precise than bounding boxes. Just like Faster R-CNN, Mask R-CNN works in two stages, first extracting regions of inter-ests (RoI) with a RPN, then extracting features from these regions and classifying them. The difference is that during this second stage, Mask R-CNN also extracts a binary mask for every RoI. Note that the object classification and mask

(19)

segmen-Class No. of annotations Percentage of total Car 250,088 36.05 Adult Pedestrian 149,921 21.61 Barrier 88,545 12.76 Traffic Cone 87,603 12.63 Truck 36,314 5.23 Bicycle 17,060 2.46 Motorcycle 16,779 2.42 Construction Worker 13,582 1.96 Rigid Bus 8,361 1.21 Construction Vehicle 6,071 0.88 Trailer 3,771 0.54 Push/Pullable 3,675 0.53 Debris 3,171 0.46 Bicycle Rack 3,064 0.44 Personal Mobility 2,281 0.33 Child Pedestrian 1,934 0.28 Police Officer 464 0.07 Bendy Bus 265 0.04 Stroller 363 0.05 Animal 255 0.04 Police Vehicle 139 0.02 Ambulance 42 0.01 Wheelchair 35 0.01

Table 3.1: Class distribution of the NuImages dataset

tation do not impact each other because they happen in parallel. The total loss for every RoI is calculated by adding the loss for the bounding-box, the classification and the mask loss. The mask and classification loss are both calculated the exact same way as in Fast R-CNN (Girshick, 2015). Since a mask has to be created for every object in every RoI, and the binary mask functions on a per-pixel basis, the output of the mask branch has a size K × M. With K being the number of classes and M being the resolution of the masks. The loss is then defined as the average binary cross-entropy loss. This allows the network to separate the class prediction and the mask prediction so there is no competition between the classes.

Since every pixel in the masks is related to a pixel in the original image, the feature maps that the RPN presents as regions of interest need to be aligned to the original image. Therefore, Mask R-CNN uses a so called RoIAlign layer. This

(20)

creates a fixed-size feature map on which the mask branch uses binary classifica-tion to predict the masks, while the bounding box and classificaclassifica-tion tasks are done using fully connected layers.

3.3 Setup

For this experiment, a Mask R-CNN ResNet-50 FPN model was trained on the nuImages training dataset, containing a total of 62k images. This model uses a Feature Pyramid Network (FPN) (T.-Y. Lin et al., 2017) with a ResNet50 back-bone which is pretrained on the Common Objects in Context (COCO) dataset (T. Lin et al., 2014). COCO is a dataset containing over 200k labeled images containing objects from 80 different classes. Because this model was already pre-trained, only the head of the model had to be changed to fit the 25 classes of the nuImages dataset. This was done by loading the model from TorchVision (Paszke et al., 2019) and replacing the mask and box predictors with a Mask R-CNN and R-CNN predictor respectively.

The model was fine-tuned with a stochastic gradient descent (Robbins, 2007) optimizer with the learning rate set to 0.005. The learning rate was chosen based on preliminary experiments on the mini version of nuImages, which is a subset of the dataset containing 50 annotated images. Stochastic gradient descent was chosen as the optimizer because it approximates the gradient without processing the entire dataset with every update. This helps processing such a large dataset with limited GPU processing power. Because stochastic gradient descend sometimes overshoots loss minima, a learning rate scheduler was also added to half the learning rate every three epochs.

As for retrieving the samples from the dataset, not all objects have both a bounding box and mask annotation, instead some objects only have a bounding box. These objects were not used for training or evaluation and were instead ignored.

3.4 Training and Hardware

The nuImage dataset consists of a total of 92k images, 68k of which are part of the training set, whereas the rest are part of the test and validation sets. Still, training a model on a dataset consisting of 68k high resolution images requires a lot of GPU processing power. Therefore, training was done on a Google Cloud machine equipped with 32GB of RAM and a Tesla P100 16GB GPU. Despite this being one of the best performing GPUs on the market, training took around 16

(21)

hours per epoch. Because of the limited amount of time and resources available, training on the full dataset was limited to only 5 epochs, taking about three days. Furthermore, Detectron2 (Wu, Kirillov, Massa, Lo, & Girshick, 2019) was im-plemented, this allowed for easy experimentation with different models from De-tectron2’s model zoo. The experiments run using the models from Detectron2’ model zoo were trained and evaluated on the mini version of the dataset in order to save time and resources. The results this produced were then compared to the AP scores presented in the model zoo which are shown in table 3.2. Because the test results on the mini version of nuImages were essentially the same as the results presented by Detectron2, these preexisting baselines were used to pick the model for training on the entire dataset.

Model Inference time (s/img) Memory (GB) box AP

R50-C4 0.111 5.2 39.8 R50-DC5 0.076 6.5 40.0 R50-FPN 0.043 3.4 41.0 R101-C4 0.145 6.3 42.6 R101-DC5 0.092 7.6 41.9 R101-FPN 0.056 4.6 42.9 X101-FPN 0.103 7.2 44.3

Table 3.2: Model performance from Detectron2 model zoo, trained and evaluated on the COCO dataset

Based on these results, the R50-FPN model has the lowest inference time and memory usage. Both of these metrics are important because this project requires a model that can classify images quickly, and the hardware available has a limited amount of RAM. Note that the images in COCO have a lower resolution than the images in nuImages, therefore the memory usage will be higher for our model. De-spite performing well in the aforementioned metrics, R50-FPN maintains a decent bounding-box average precision. As a result, this model was chosen for further training.

3.5 Evaluation

The model was evaluated by calculating the mean Average Presicion (mAP) on the nuImages validation set, containing 16k images. As shown in equation 3.1, this metric calculates the precision on every class in the validation set, and averages this to create a single number as the output. In this equation, K is the number of classes and AveP(k) is the average precision (AP) of a single class.

(22)

MAP = PK

k=1AveP(k)

K (3.1)

In addition, the average precision of every class was computed and plotted separately. The average precision is based on the intersection over union (IoU), this is defined in equation 3.2, where the area of overlap and union are calculated by comparing the bounding boxes of the prediction and the ground truth.

Intersection over Union = Area of Overlap_{Area of Union} (3.2) Any prediction where IoU ≥ 0.5 is considered a true positive, while any pre-diction where IoU < 0.5 is considered a false positive. These values can then be used to compute the average precision for every class. Furthermore, any prediction made where no object is present or the classification is wrong is considered a false negative.

The average precision is calculated by ranking every prediction for every class based on their confidence scores, which are given by the object detector, and computing the precision and recall for every single prediction. For example, if there are two objects of a specific class and the model creates five different predictions which are as follows, ranked by their confidence scores: true, false, true, false, false. Computing the precision and recall per prediction would lead to table 3.3. The average precision for this class would then be the average of all precision values in this table.

True or false Precision Recall True 1 0.5 False 0.5 0.5 True 0.66 1.0 False 0.5 1.0 False 0.4 1.0

Table 3.3: Precision and recall per prediction for a hypothetical class with only two occurrences.

(23)

Chapter 4 Results and analysis

4.1 Quantitative results

Figure 4.1 shows the amount of objects that were evaluated for every class in the nuImages v1.0-val dataset. There are large differences between the number of occurrences of every class, with the most common one appearing 52.269 times and the least common one only twice. Note that this figure only shows 23 classes, while there are 25 classes in the full dataset. This means that two classes did not appear in the evaluation set at all. This figure also shows that the four most occurring classes, cars, pedestrians, traffic cones and barriers, make up 83% of the total number of objects, with all other classes combined making up only 17% of the total number of objects. Because the images from the dataset were all taken in either Boston or Singapore, this distribution of object frequency might be entirely different in other urban environments. For example, the environment in a city like Amsterdam will be home to a much larger amount of bicycles. Finally, it is important to note that this distribution does not exactly represent the distribution of objects in the nuImages dataset, because all objects without a segmentation mask were removed by the data retrieval system.

(24)

Figure 4.1: Class distribution of the objects in the validation set

Figure 4.2 shows the AP50 for all 23 classes that were evaluated. The AP

ranges from 0% for the bottom six classes to 86% for the best performing class. Note that this figure shows the average precision where any prediction with IoU ≥ 0.5 is considered a correct prediction. The mAP for IoUs ranging from 0.5 to 0.95, which is not shown in this table, is 18.6%, which is considerably lower.

(25)

Figure 4.3: False and true positives for every class

Figure 4.3 Shows the amount of true and false positives for every class. Note that only classes that have any predictions are shown, whereas classes that do not have any true of false positives are not shown, resulting in only 19 classes total. This graph clearly shows that for every class, a very large proportion of predictions are wrong, resulting in a high amount of false positives. For example, cars, the best performing class in terms of AP, has 179.279 false positives, and 49.120 true positives. This means that 179k

179k+50k · 100 = 78% of all instances where the model

predicted a car were incorrect. Note that the mean average precision for some of these classes is relatively high despite the number of false positives eclipsing the number of true positives. This is unexpected but it might be because of the high amount of overlap between the bounding boxes as seen in the next section.

(26)

Adult pedestrian Construction worker Personal mobility vehicle

Stroller Barrier Motorcycle

Push/pullable Traffic cone Bicycle rack

Bicycle Rigid bus Car

Construction vehicle Trailer Truck

Table 4.1: Precision-Recall curves for every class where AP > 15%

In table 4.1 the graphs have the precision on the y-axis and the recall on the x-axis, both ranging from 0 to 1. These values are obtained by ranking every prediction based on its certainty and calculating the total precision and recall for the class for every prediction. High performing classes such as ’car’ and ’adult

(27)

pedestrian’ show a high precision for most recall values, except when the recall approaches 1.

4.2 Qualitative results

Figure 4.4: Upper image: random nuImages sample with predictions. Bottom image: random nuImages sample with ground truth

Figure 4.4 demonstrates the model predicting some objects multiple times, leading to a lot of false positives, the clearest example is the car on the left, which is correctly classified as a car with a near-perfect bounding box, while containing another bounding box with a car classification. The same goes for the traffic cones

(28)

on the right. It is also interesting to note that the model detected the reflections of cars in the windows of the building on the opposite side of the road. One correctly and another incorrectly. Although this prediction looks chaotic, thanks to the large number of overlapping bounding boxes, it can easily be post-processed into a clearly readable image by applying non-maximum suppression. This is a technique where overlapping bounding boxes of the same class are merged into a single bounding box, this removes bounding boxes such as the extra bounding box in the car on the left side of the image. The resulting image is shown in figure 4.5

Figure 4.5: Predictions after applying non-maximum suppression.

Aside from some small errors, this post-processed image is very similar to the ground truth, and is an example of the model working as intended. That being said, this is a relatively simple image compared to the rest of the dataset; the car is in the forefront and only slightly obstructed, while the rest of the objects are traffic cones. In theory traffic cones are one of the easiest classes to detect since they all have a similar shape and color.

(29)

Figure 4.6 is an example of the model performing relatively poorly at first sight, although the cars, pedestrians and motorcycle are all detected correctly, a lot of objects are detected where there are none in the ground truth. The graffito on the wall on the right is classified as a dozen bicycles, while the center of the image is cluttered random predictions. However, a partly obscured car in the background on the far-left side of the image, as well as an almost completely obscured bus in the center, are detected despite not even being labeled in the dataset. Unfortunately the bounding boxes indicating this car and bus all overlap and are not very accurate. As such, this image is interesting because it shows an instance where the model is too sensitive to objects in the background causing

(30)

it to detect all objects in the ground truth along with those which are not even labelled.

Finally, 4.7 shows the predictor detecting a relatively obscure class, bicycle rack, correctly, although once again the classification consists of a lot of overlapping bounding boxes. The high number of overlapping bounding boxes in this image, as well as the others as seen in figures 4.4 and 4.6 might explain why the mAP is so low when the minimum IoU has to be greater than 0.95. Since in this case most of the bounding boxes in the bicycle rack will be considered incorrect, as opposed to AP50 where most of them will be considered good predictions.

(31)

Chapter 5 Conclusion and future work

5.1 Conclusion

In the end, the performance of the model trained on NuImages is not revolutionary, it does not achieve the same accuracy as existing ResNet50FPN models trained and evaluated on the COCO dataset. According to the TorchVision documentation the pretrained model has an mAP of 37.9% with IoU ranging from 0.5 to 0.95 on the COCO dataset. While the model trained on NuImages only reached a mAP of 18.6% using this metric, only reaching a comparable mAP 37.40% when the maximum IoU is set to 0.5 upon evaluation. This means that the bounding boxes for our model are relatively innacurate. However, this is not surprising considering that the model was only trained for five epochs on NuImages, and that 83% of the evaluation set was comprised of only five classes. This made it difficult for the model to learn the less frequent classes, which clearly showed in table 4.2. Six classes having an average precision of 0% greatly decreases the mAP. In contrast, the six best performing and most frequent classes showed an average precision of more than 70%, this also showed in the qualitative analysis in chapter 4.2. However, some of these results can be attributed to the pre-trained nature of the model, although COCO, the dataset used for this pre-training, contains very different samples, it shares many classes with NuImages. Still, most of COCO’s images are not in an urban driving environment and the fine-tuning on NuImages showed that the model still detects these classes very accurately. It is unfortunate that one of the features that sets NuImages apart, namely its less common classes, are so infrequent that the model is hardly able to detect these classes after training for such a relatively short time.

(32)

5.2 Future work

Fortunately, creating a perfect detection system that exceeds state-of-the-art mod-els was not the primary goal of this project. Instead, that was to evaluate which object detection systems work best in urban driving scenes. It turned out that performing this evaluation thoroughly requires more time and computing resources than were available. Trying and evaluating different types of models on the mini version of NuImages did not produce satisfactory results because the dataset size was too small. The only model that actually trained and evaluated on the entire dataset was the most promising model according to the results presented in De-tectron2. Although this showed promising results, more training is necessary in order to determine if this model the one most suitable for urban driving scenes. For future research, it is recommended to use a larger array of candidate models and enough computing resources to train each for a long period of time.

Finally, it must be noted that the data retrieval system used for retrieving the samples from NuImages can be improved upon. A lot of objects in the dataset are annotated with only a bounding box, and lack a segmentation mask. Instead of completely ignoring these objects it might be better to actually use them for train-ing and evaluation and skip the segmentation branch for these specific samples. This should be possible since Mask R-CNN does not necessarily require segmen-tation masks for bounding box classification. Preserving these partial annosegmen-tations will increase the number of objects in the images and might reduce the scarcity of uncommon classes. In conclusion, to really discover the most suitable model for object detection in urban driving scenes, a lot more time and processing power is required.

Ultimately, more evaluation and experimentation is needed in order to further improve the reliability of autonomous vehicles. It is important to remember the true goal of research projects in this field; this is not science for the sake of science, but science for the sake of safety. Creating a safer driving environment using a network of autonomous connected vehicles is a chance to reduce the number of traffic accidents and the misery they bring. As such, further developing these technologies and exploring their possibilities is a worthwhile endeavor for future research.

(33)

References

Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., . . . Beijbom, O. (2019). nuscenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027.

Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human de-tection. In 2005 ieee computer society conference on computer vision and pattern recognition (cvpr’05) (Vol. 1, pp. 886–893).

Girshick, R. (2015). Fast r-cnn. In Proceedings of the ieee international conference on computer vision (pp. 1440–1448).

He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the ieee international conference on computer vision (pp. 2961–2969). He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep

convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 37 (9), 1904–1916.

Jan, B., Farman, H., Khan, M., Imran, M., Islam, I., Ahmad, A., . . . Jeon, G. (2017, 12). Deep learning in big data analytics: A comparative study. Com-puters Electrical Engineering, 75 . doi: 10.1016/j.compeleceng.2017.12.009 Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with

deep convolutional neural networks. Communications of the ACM , 60 (6), 84–90.

Lin, T., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick, R. B., Hays, J., . . . Zitnick, C. L. (2014). Microsoft coco: Common objects in context. arxiv 2014. arXiv preprint arXiv:1405.0312 .

Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection.

Papadoulis, A., Quddus, M., & Imprialou, M. (2019). Evaluating the safety impact of connected and autonomous vehicles on motorways. Accident Analysis & Prevention, 124 , 12–22.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., . . . Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems 32 (pp. 8024– 8035). Curran Associates, Inc.

(34)

Pitts, W., & McCulloch, W. S. (1947). How we know universals the perception of auditory and visual forms. The Bulletin of mathematical biophysics, 9 (3), 127–147.

Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91–99).

Robbins, H. (2007). A stochastic approximation method. Annals of Mathematical Statistics, 22 , 400-407.

Rumelhart, D. E., Durbin, R., Golden, R., & Chauvin, Y. (1995). Backpropaga-tion: The basic theory. BackpropagaBackpropaga-tion: Theory, architectures and applica-tions, 1–34.

Wu, Y., Kirillov, A., Massa, F., Lo, W.-Y., & Girshick, R. (2019). Detectron2. https://github.com/facebookresearch/detectron2.

Driving on Data; Object Detection in Urban Driving Scenes

Driving on Data

Driving on Data

Object Detection in Urban Driving Scenes

Abstract

Contents

Chapter 1

Introduction

1.1

History

1.2

Present

1.3

Object Detection

1.4

Goal

Chapter 2

Related Work

2.1

Deep learning

2.2

Convolutional neural networks

2.3

Backpropagation in CNNs

2.4

Activation functions

Sigmoid

Tanh

ReLU

Leaky ReLU

2.5

Pooling

2.6

Region Proposal Networks

2.7

Faster R-CNN

2.8

Anchors

2.9

Mask R-CNN

Chapter 3

Method

3.1

NuImages

3.2

Mask R-CNN

3.3

Setup

3.4

Training and Hardware

3.5

Evaluation

Chapter 4

Results and analysis

4.1

Quantitative results

4.2

Qualitative results

Chapter 5

Conclusion and future work

5.1

Conclusion

5.2

Future work

References