Real-time Vehicle Detection with Limited Hardware using Single Shot Multibox Detection and MobileNetV2

(1)

Real-time Vehicle Detection

with Limited Hardware

using Single Shot Multibox Detection and MobileNetV2

(2)

Layout: typeset by the author using LA_TEX.

(3)

Real-time Vehicle Detection

with Limited Hardware

using Single Shot Multibox Detection and MobileNetV2

Bob L. Leijnse 11872888

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisors Charalampos Tamvakis

Research and Development Sightcorp

Science Park 400 1098 XH Amsterdam

Dhr. dr. S. van Splunter

Informatics Institute Faculty of Science University of Amsterdam

Science Park 904 1098 XH Amsterdam

(4)

Abstract

The detection of vehicles has become a popular research subject. This raises the potential for new applications, such as easily deployable traffic monitoring. To enable wide range deployment, the cost of hardware needs to be limited. Cur-rently, research in vehicle detection primarily investigates the opportunities for self-driving cars and focuses on obtaining a high accuracy since this can mean the difference between having an accident or not. However, entirely focusing on accuracy takes its toll. The result is that most methods require top of the league hardware to be able to run in real-time.

On the contrary, this project investigates the real-time opportunities of vehi-cle detection using limited hardware, in this case one core of an Intel i7 CPU. The goal is to run at a speed of approximately 20 frames per second and achieve high accuracy on the KITTI benchmark at the same time. A Single Shot Multi-box Detector (SSD) is used to achieve this goal. In order to obtain a speedup, MobileNetV2 network replaces the backbone of the original SSD implementation. Five experiments were performed to find a good speed/accuracy trade-off. The re-searched hyperparameters were (1) image sizes, (2) first feature map attachments, (3) pre-trained weights, (4) optimizers, and (5) width multipliers.

Experimental results provide two models with a good speed/accuracy trade-off. The first proposed model achieves an average precision on the KITTI easy car class of 80.9% and runs at 21 FPS. The second model achieves 83.3% and runs at 18 FPS. These two models show that it is possible to detect vehicles in real-time with limited hardware accurately.

(5)

Introduction

Vehicle detection has become a popular research subject in the last decade. This raises the potential for new applications, such as easily deployable traffic monitor-ing. The cost of hardware needs to be limited to enable wide range deployment. Vehicle detection is particularly a difficult challenge because of the following four factors. First, variations of light alter the look of a car. A car at night looks dif-ferent than the same car on a clear day, reflecting much sunlight. Second, in dense traffic, it is likely that other road-users occlude parts of a vehicle. It is essential that partly occluded cars still are being detected because these cars still belong to traffic situations. Third, the look of cars varies at different rotations: the front, the back, the top and both sides all have to be recognized. The last difficulty is the large variety of scale. The actual pixel sizes of cars close to the camera are much bigger than the pixel sizes of cars in the background.

Understandably, accuracy has been the main priority for research on vehicle detection. Obtaining an extremely high accuracy can mean the difference between having an accident or performing a manoeuver in time. However, entirely focusing on accuracy takes its toll. The result is that most methods require top of the league hardware to be able to run in real-time. Less attention has been going to the efficiency side of the research of vehicle detection. Sightcorp, a University of Amsterdam spin-off specialized in object analysis, pointed out this problem and proposed to investigate this problem as a bachelor thesis.

This project researches what the state-of-the-art techniques are for the prob-lem of detecting vehicles, but now applicable to limited hardware instead of the best hardware available. Naturally, accuracy is still an important part of the project. However, requirements were set for the hardware that could be used. This limitation results in research to acquire a vehicle detection model with a

(9)

good speed/accuracy trade-off. Altogether this project aims to answer the follow-ing research question:

“To what extend is it possible to accurately detect vehicles in real-time with limited hardware?”

The following three subquestions first have to be addressed to give a substan-tiated answer to the research question:

• What is considered as limited hardware?

In contrast with most academic research, this project is exploring for a trade-off between speed and accuracy for devices with limited hardware. This re-sults in the limitation that the vehicle detection has to run on average central processing unit (CPU) devices, instead of high-end graphics processing unit (GPU) devices.

• What are the real-time implications?

In this case, the goal is to run at approximately 20 frames per second (FPS) or more. The reason for this is that a tracking algorithm can be placed on top of the vehicle detection model. A tracking algorithm will slow down the speed of the system. However, 20 FPS is assumed to provide an overhead to cover this slowdown. Using tracking algorithms enables the possibility to perform several tasks like the counting of vehicles passing a particular spot. If the number of FPS is low, a car can, for example, move between two shots from the left to the right side of the image. Then it is tough for the tracking algorithm to keep track of the car. Therefore, the detection algorithm has to be fast enough.

• What are feasible accuracy metrics?

The project aims to obtain a high average precision for specific circumstances. Out of the four main factors which make it hard to detect vehicles (varia-tions of light, occlusion, rotation, and scaling), occlusion and scaling will be evaluated extensively. The evaluation of variation of light and rotation are proposed as future work.

The remainder of this thesis is structured as follows: first, the theoretical foundation is discussed. Hereafter, the method and experiments to find useful hyperparameters are addressed. Finally, the results are presented and discussed, followed by a conclusion and proposal for further research.

(10)

Chapter 2

Theoretical foundation

Several ways to detect objects are possible. Current state-of-the-art multi-object detectors exist of two main parts. The first part of a multi-object detector is the backbone, consisting of a Convolutional Neural Network. This part is the actual image classifier. The second part is the detector, which decides what parts of an image are objects.

The next step was to determine how to fill in the two parts of the multi-object detector. After performing a literature review, the two parts used in this project were a MobileNetV2 backbone, and a Single Shot Multibox Detector (SSD). In the remainder of this chapter, the first section shows briefly how Convolutional Neural Networks can be used to classify images. Hereafter, subsection 2.2 and 2.3 show the considerations for choosing the MobileNetV2 backbone and the SSD detector over other possible methods. The actual design of how MobileNetV2 and SSD were implemented as a method to detect vehicles is shown in chapter 3.

2.1 Image classification with Convolutional Neural

Networks

Traditional neural networks are good at detecting patterns. However, these net-works do not handle well with localization information, which is very important in detecting objects in pictures. This is important because objects are built up out of smaller parts which are connected to each other. To solve this problem, Convo-lutions Neural Networks use convolutional layers to detect parts of an object. For example, the first layers detect edges and corners, the middle layers detect parts of objects like wheels and mirrors, and the last layers detect full objects in different shapes. Convolutional Neural Network are trained using labelled images. While training, in a forward pass, the input image passes trough the network completely.

(11)

After this, in a backward pass, the network weights are updated to learn infor-mation about the label corresponding to the input image. When this process is repeated for numerous iterations for multiple images, the network learns informa-tion which label corresponds to an image. When an unknown image is shown to the network, the network predicts how well the image corresponds to the possible labels.

2.2 MobileNets for image classification

Figure 2.1: An overview adapted from the MobileNetV2 paper of different sizes of MobileNetV2 compared to MobileNet (V1), NasNet and ShuffleNet.

MobileNets are Convolutional Neural Networks designed to have a small model size and low complexity, which also runs at limited hardware such as mobile de-vices. In this project, a newer version, MobileNetV2 [1] was used because its per-formance is slightly better than other lightweight models such as NasNet [2] and ShuffleNet [3] as shown in figure 2.1. Another reason for choosing MobileNetV2 was that there are already a lot of pre-trained weights available for several sizes of the model to build upon.

2.3 Single-stage and two-stage detectors

The second part of a multi-object detector is the detector of a model, the part that determines which parts of an image are objects. This part is needed because only a particular part of an image can be a vehicle and not necessarily the whole

(12)

image. Currently, state-of-the-art object detectors are one of the following two types: single-stage or two-stage. Single-stage detectors such as YOLO [4] and SSD [5] detect objects by treating them as a simple regression problem. This type is called ‘single-stage’ because all the computations are done in a single network. For an input image, the detector learns the class probabilities and the bounding box coordinates.

Two-stage detectors, such as Faster R-CNN [6], are the other type of detectors. This type is called ‘two-stage’ because it uses an extra network. The extra network generates regions of interest using a Region Proposal Network (RPN). Because of the extra network, this type of detectors achieve the highest accuracy rate but is noticeable slower than single-stage detectors.

The next sections briefly discuss current state-of-the-art detectors and the con-sideration of why SSD was chosen for this project.

2.3.1 Faster Regional-CNN

Faster R-CNN is a two-staged Convolutional Neural Network that uses a Region Proposal Network that shares convolutional features with the detection network. This results in nearly cost-free region proposals. The Region Proposal Network predicts object bounds and objectness scores at each position. In this way, high-quality region proposals are generated.

Although this method is accurate, it takes a high toll because of the two stages. It runs at five FPS on a strong GPU. Therefore, Faster R-CNN can be used as an inspiration but not as a framework to build a fast and efficient framework.

2.3.2 You Only Look Once (YOLO) networks

YOLO is a simple and straightforward system that resizes an input image, runs a single Convolutional Neural Network on the image and thresholds the resulting detected objects. The complete detection pipeline is a single staged network, which makes it fast. During training and testing, YOLO scans the entire image so that the network encodes contextual information.

The system divides an input image into a grid. Each cell of that grid predicts a specified amount of bounding boxes and confidence scores for that box to contain an object. Each cell of the grid also predicts a conditional class probability for that cell. Eventually, the conditional class probabilities and the individual box confidence are multiplied to make a prediction.

YOLOv3 [7] is currently the most used version. In addition to earlier versions, the method now predicts an objectness score for each bounding box using logistic

(13)

regression. For the class prediction, YOLOv3 uses binary cross-entropy because softmax would assume that each box has exactly one class, which is often not true. YOLOv3 now predicts bounding boxes at three different scales and extract features of them using a similar concept to feature pyramid networks. At last, they use a different network with 53 convolutional layers.

Concluding, YOLOv3 is fast and still quite accurate. In research, many peo-ple are tweaking the detector to become more accurate in specific tasks. Hence, YOLOv3 is deemed as a potential candidate for the approach of this project.

2.3.3 Single Shot Multibox Detection

Another single staged object detector is Single Shot Multibox Detection (SSD). This method eliminates bounding box proposals and the subsequent pixel of feature re-sampling stage, which results in a fundamental improvement in speed. Experi-mental results show that SSD has a competitive accuracy and is much faster than methods that use an additional object proposal step.

This method is not the first to eliminate the bounding box proposals and the subsequent pixel, but the method added several improvements to increase the ac-curacy. SSD uses “a small convolutional filter to predict object categories and offsets in bounding box locations, using separate predictors (filters) for different aspect ratio detections, and applying these filters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales.” This makes it possible to achieve a high accuracy using relatively low-resolution input, which reduces the computation time even more.

In this project, SSD was used because it is faster and more accurate on the Pascal VOC2007 [8] dataset than Faster R-CNN and YOLOv3.

(14)

Chapter 3

Method

The method to detect vehicles consisted of three main components: the Single Shot Multibox Detector; the Convolutional Neural Network, MobileNetV2, used as the backbone of the SSD; and the KITTI dataset used to train, evaluate, and test ex-periments. This chapter addresses how these three components were implemented. Chapter 4 addresses the experiments that were done using the implemented model.

3.1 Single-Shot Multibox Detector

The Single Shot Multibox Detector implementation is based on the original paper. However, the backbone of the SSD was changed from the original slow VGG-16 [9] to MobileNetV2 [1] to obtain an improvement in speed. In addition to changing the backbone, several experiments were performed to optimize the speed/accuracy trade-off for the Single-shot Multibox Detector for vehicle detection.

3.1.1 Multibox detection

The first component of SSD is the multibox detector. The detector divides an input image into a grid. For each cell in the grid, six default boxes were proposed. All these boxes have the following properties: x-centre-location, y-centre-location, width, height, and confidences for different classes. Hereafter, these box proper-ties were trained to adapt to the different classes the models were trained on. The idea is that the model learns concepts about objects in a convolutional way. For example, a horizontal rectangle is more likely to fit a truck, and a square box is more likely to fit a car.

The division of an input image into a grid was done for six feature maps of different scales. A feature map is the output of a filter applied to the previous

(15)

Figure 3.1: Illustration from the original SSD article of four default bounding boxes per cell for the ground truth at two different feature maps.

layer. This results in the possibility to train and detect objects for a large variety of scales. In this project, the feature map sizes were scaled to the image size. Section 4.1 looks further into the different configurations of the feature maps.

Figure 3.1 shows an example of how four boxes per grid cell can detect a small cat with the 8x8 feature map and the bigger dog with the 4x4 feature map. Notice that the 8x8 feature map cannot detect the dog in this image.

3.1.2 Properties of default boxes

The bounding boxes had four default location values to initialize: x-centre-location (cx), y-centre-location (cy), width (w), and height (h). The x and y centre-location are the middle-point of each cell in a feature map grid, computed using equation 3.1, where i and j are the indices of the grid cell and fk is the size of the k-th

feature map. cx(i, fk) = i + 0.5 fk cy(i, fk) = i + 0.5 fk (3.1)

The scale factor of the width and height of the bounding boxes were com-puted using equation 3.2, where smin is 0.2 and smax is 0.88 just as in the original

implementation, and m is the number of feature maps. smin_k = smin+

smax− smin

(16)

Note that the equation skips the first feature map. The scaling factor for the first feature map was 0.1. Afterwards, the resulting scale factors smin_k were multiplied by the image size. For example, the default bounding boxes of six feature maps for an image width of 512 pixels have the following widths: 51.2, 102.4, 189.44, 276.48, 363.52 and 450.56 pixels.

The default scale factor was used as a minimum size for the bounding boxes. From this follows that an upper bound was needed as well. The maximum ing box size was computed by equation 3.3, which is basically the minimum bound-ing box size of the next feature map.

smax_k = smin+

smax− smin

m − 2 (k − 1), k ∈ [1, m] (3.3) Now, all the default bounding boxes have the same ratio. However, vehicles can have different aspect ratios, as illustrated in figure 3.1. Therefore, six different aspect ratios were used per grid cell per feature map: two square boxes of different sizes and four boxes with an other aspect ratio. Table 3.1 shows how the six default aspect ratios scales were computed using the minimum and maximum bounding-box scale. Again, the actual pixel size can be computed by multiplying the obtained scales by the image width or height.

Ratio 1 2 3 4 5 6

Width smin_k psmin

k ∗ smaxk s min k ∗ √ 2 s_√mink 2 s min k ∗ √ 3 s_√mink 3 Height smin k psmink ∗ s max k smin k √ 2 s min k ∗ √ 2 s_√mink 3 s min k ∗ √ 3 Table 3.1: Overview of six the box aspect ratio scales used per grid cell per feature map where smin_k and smax_k are the results of equation 3.2 and 3.3 respectively.

3.2 SSD using a MobileNetV2 architecture

This section discusses how the SSD network architecture was changed to a cus-tomized MobileNetV2 architecture. First, the adjustments to the network archi-tecture are discussed. Second, the method to train this archiarchi-tecture is discussed.

3.2.1 Network architecture

The original paper about Single Shot Multibox Detection used the VGG-16 [9] Convolutional Neural Network as a backbone. The backbone produces a fixed

(17)

Figure 3.2: The SSD with a VGG-16 backbone of the original paper (top) compared to the customized SSD of this project using a MobileNetV2 backbone (bottom).

number of bounding boxes, as discussed in the previous section, followed by non-maximum suppression to predict objects. In this project, the same approach was followed but now using a faster MobileNetV2 backbone. Figure 3.2 shows schemat-ically the differences between the original SSD and the customized version used in this project.

A MobileNetV2 backbone is composed of building blocks called ‘bottleneck residual’ blocks. These blocks consist of three layers: a pointwise with ReLU6 as activation function and an expansion factor t; a depthwise convolution with a stride s and ReLU6; and at last a depthwise linear convolution. After each convo-lution the results are normalized using batch normalization [10] before the ReLU6 activation function. Table 3.2 shows the effect of a bottleneck residual block to its input. All the bottleneck could be adjusted by a width multiplier to perform experiments testing different widths.

(18)

Input Operator Output h × w × k 1 × 1 pointwise conv2d, BatchNorm2d, ReLU6 h × w × (tk) h × w × tk 3 × 3 depthwise s, BatchNorm2d, ReLU6 h_s × w

s × (tk) h

s × w

s × tk 1 × 1 linear conv2d, BatchNorm2d

h s ×

w s × k

0

Table 3.2: Bottleneck residual block transforming an input width w × height h from k to k0 channels, with stride s, and expansion factor t.

part was the original MobileNetV2 implementation, represented by the dashed cube in figure 3.2. This dashed cube was made of several sequences of bottleneck residual blocks. These sequences are shown per line in table 3.3. After 14 layers the output with a size of 96 channels of the fifth bottleneck was attached to the detector using a 3 × 3 convolution accompanied by class confidence values and the x, y, w and h of the six different bounding boxes. This was the attachment for the first feature map. After five more sequences followed by a 1 × 1 2d convolution, the second feature map was attached to the detector and the output was sent to the second part of the backbone.

Input Operator t c n s Attachment no. 5122_{× 3} _conv2d _- ₃₂ ₁ ₂ -2562_{× 32} _{bottleneck 1} ₁ ₁₆ ₁ ₁ -2562× 16 bottleneck 2 6 24 2 2 -1282_{× 24} _{bottleneck 3} ₆ ₃₂ ₃ ₂ -642_{× 32} _{bottleneck 4} ₆ ₆₄ ₄ ₂ -322× 64 bottleneck 5 6 96 3 1 1 322× 96 bottleneck 6 6 160 3 2 -162_{× 160} _{bottleneck 7} ₆ ₃₂₀ ₁ ₁ -162_{× 320} _{1 × 1 conv2d} _- ₁₂₈₀ ₁ ₁ ₂

Table 3.3: MobileNetV2 configuration for an image size of 512×512. The bottle-neck layers are in the form of table 3.2. Each line describes a sequence of n layers and has a stride s and expansion factor. All layers in the same sequence have the same number of output channels c.

The second part of the customized SSD consists of extra blocks. Four addi-tional bottleneck residual blocks were used to come to a total of six feature map attachments. These four blocks are shown in figure 3.2 by the four blocks the most to the right. The exactly used configurations are shown in table 3.4. The six at-tachments were followed by a non-maximum suppression step to produce the final

(19)

bounding boxes. Non-maximum suppression is a technique used to transform a group of bounding boxes with a Jaccard overlap bigger than a threshold parameter into one bounding box, as visualized in figure 3.3.

Figure 3.3: Illustration of non-maximum suppression.

Input Operator t c n s Attachment no. 162_{× 1280} _{bottleneck 8} _0.2 ₅₁₂ ₁ ₂ ₃

82× 512 bottleneck 9 0.25 256 1 2 4 42× 256 bottleneck 10 0.5 256 1 2 5 22_{× 256} _{bottleneck 11} _0.25 ₆₄ ₁ ₂ ₆

Table 3.4: Bottleneck configurations for the last four feature map attachments.

3.2.2 Training

To train the model, we need to optimize the loss function. The loss function computes the differences between the predicted bounding boxes and the actual ground truth boxes provides by a training dataset. The predicted boxes were a match with the ground truth boxes when the Jaccard overlap was higher than 0.5. Hereafter, the total loss Ltotal was computed by a combination of class confidence

loss and localization loss. Equation 3.4 shows how the total loss was computed for N matches between a prediction l and ground-truth g for a certain location x and class confidences c.

Ltotal(x, c, l, g) =

1

(20)

The class confidence loss Lconf(x, c) was the softmax loss of multiple class confi-dences c. Lconf(x, c) = N X i∈P os xp_ijlog(ˆcp_i) − X i∈N eg

log(ˆc0_i), where ˆcp_i = exp(c

p i) P pexp(c p i) (3.5)

The localization loss Lloc(x, l, g) of equation 3.4 was computed by the Smooth

L1 loss [11] between the predicted box l and the ground truth box g. Then, the width w, height h, x-centre cx and y-centre location cy of the default bounding box regresses to offsets like in Faster R-CNN [6].

Lloc(x, l, g) = N X i∈P os X m∈{cx,cy,w,h} xk_ijsmoothL1(lmi − ˆg m j ) where ˆ g_jcx= (g cx j − dcxi ) dw i ˆ gcy_j = (g cy j − d cy i ) dw i ˆ g_jw = log(g w j dw i ) gˆh_j = log(g h j dh i ) (3.6)

After determining the loss, the weights of the model were fine-tuned using several optimizers, as discussed in section 4.4.

3.3 Training and evaluation using the KITTI

Vi-sion Benchmark Suite

The implemented models were trained and evaluated using the KITTI dataset [12]. This is a dataset created for autonomous driving research by the Karlsruhe Insti-tute of Technology and Toyota Technological InstiInsti-tute at Chicago. They recorded data with several cameras and provided the data with 2D annotations. The KITTI dataset has become a popular benchmark test for vehicle detection algorithms. The dataset provides the precision, recall, runtime and machine specifics obtained by other teams. Therefore, it is a suitable benchmark to evaluate and compare the models trained in this project. The following sections first give an insight into the dataset. Hereafter, the evaluation criteria are discussed. At last, an additional

(21)

evaluation method is introduced to evaluate the scaling and localization perfor-mance.

3.3.1 Dataset insights

The KITTI object detection and object orientation estimation benchmark consists of stereo images with a resolution of 1242 × 375 pixels. Only the images of the left camera were used. The dataset consists of 7481 training images and 7518 test im-ages, comprising a total of 80256 labelled objects. The training images containing 39596 annotated objects are publicly available to train and validate models. The test annotations are not publicly available. Thus the Karlsruhe Institute of Tech-nology can perform a fair test of the models and assure none of the teams trained on the test dataset. In the dataset, each dynamic object within the camera’s view is annotated in one of the following categories: car, van, truck, pedestrian, per-son, cyclist, tram and miscellaneous. In this context, miscellaneous objects are dynamic objects that do not occur enough in the dataset to be worthwhile to an-notate. These are objects like trailers and segways.

Figure 3.4: Number of objects in the KITTI training dataset per merged category. For this project, not all the categories the KITTI dataset offers were useful. Therefore, the 39.596 annotated objects were merged into three groups: ‘small vehicles,’ ‘big vehicles’ and ‘do not care’. It is still possible to fairly compare the car class of this project to other research because in the official KITTI benchmark,

(22)

a van detected as car does not count as a false positive for cars. This is a result of the fact that the official benchmark only takes cars, pedestrians, and cyclists into account. Therefore, the car class can be extracted from the ‘Small vehicle class’ afterwards. Figure 3.4 shows the distribution of these groups in numbers. The figure shows that the dataset primarily consists of cars. In this project, a random sample of 80.0%, 5984 images, was used as training data. The other 20.0%, 1488 images were used as validation data.

3.3.2 Official KITTI evaluation

The official KITTI evaluation evaluates object detection performance using the PASCAL VOC Challenge criteria [8]. Because the classes were merged as de-scribed in the previous section, only the ‘car’ class was evaluated following the official KITTI criteria. According to the PASCAL VOC Challenge, a prediction is correct if the correct class is predicted according to the ground truth, and if its Intersection over Union (IoU) with the ground truth is above an IoU-threshold of 50%. Figure 3.5 shows the definition and an illustration of how to compute the IoU. The difference between the PASCAL VOC Challenge and the KITTI evalu-ation is that the KITTI evaluevalu-ation is even harder; an IoU of 70% is required for the car class is required. For pedestrians and cyclists, an IoU of 50% for detec-tion is required. This means KITTI requires a better object localizadetec-tion than the PASCAL VOC Challenge.

Figure 3.5: Definition and illustration of intersection over union (IoU). The KITTI dataset separates objects into three difficulties: Easy, Moderate and Hard. These difficulties are based on three criteria. First, the minimum bounding box height, which is the minimum size of the ground-truth box of an object has to have to belong to a difficulty. Second, the maximum occlusion level, which is a status given by hand. The possible statuses are Fully visible, Partly visible, and Difficult to see. At last, the maximum truncation percentage, which is a measurement of what part of the object leaves the image boundaries. The

(23)

Min. bounding box height Max. occlusion level Max. truncation Easy 40 pixels Fully visible 15%

Moderate 25 pixels Partly visible 30% Hard 25 pixels Difficult to see 50%

Table 3.5: The definition of the three different difficulties of the KITTI 2D bench-mark.

criteria for the difficulties are defined as shown in table 3.5. Figure 3.6 gives a visual example of cars with varying classification difficulties.

With all these criteria, the trained models were evaluated using the 1488 val-idation images. Per image, the predictions made by the single-shot detector were compared to the ground truth images. If for a detection, the IoU is higher than the IoU-threshold, the detected category matches to the ground truth category, and the ground truth belongs to one of the difficulties than the detection will be count as a true positive. Detections of ground truth objects which are smaller than the minimum size do not count as false positive. Missed objects that do not meet the difficulties do not count as false positives as well. A precision was computed using true and false positives by equation 3.7. Equation 3.8 was used to compute the recall.

Figure 3.6: Illustration of the three difficulties. A green bounding box indicates an easy difficulty. Orange and red indicate Moderate and Hard, respectively. Also, the pixel height px and the actual distance in meters m is shown.

P recision = T rue P ositives

T rue P ositives + F alse P ositives (3.7) Recall = T rue P ositives

T rue P ositives + F alse N egatives (3.8) The results of the precision and recall were used to compute the average preci-sion AP per category using equation 3.9. Precipreci-sion and recall pairs were computed

(24)

by altering the recall thresholds from 0 to 1 in 40 steps. Then the precision value was replaced for recall ˜r with the maximum max precision for any recall bigger or equal then the recall threshold.

AP = 1 40 X r∈{0.0, ... ,1.0} APr where APr = maxr≤r˜ p(˜r) (3.9)

3.3.3 Scaling and localization evaluation

While the official KITTI benchmark is useful for comparing the used method to other teams and getting an insight into how well our method copes with occlusion, the benchmark does not give much feedback about localization and scaling. For example, a model that performs average for cars close and far away from the cam-era can be more useful than a model that performs extremely well on cars close to the camera.

Therefore, the trained models that fit the project requirements (having a good speed/accuracy trade-off and running at an FPS of approximately 20 FPS) were evaluated for multiple minimum object pixel heights to get an estimation in what distance range the method performs well. In addition, the models were evaluated for a less strict IoU of 50%, as used in other object detection benchmarks such as the PASCAL VOC challenge [8]. The advantage of this is that it gives an insight into whether the localization or the scaling performance is a limiting factor for the models.

(25)

Chapter 4

Experiments

Figure 4.1: A schematic overview of the five investigated hyperparameters of the customized SSD MobileNetV2 architecture.

In order to obtain a good balance between accuracy and speed, five experi-ments were conducted. The researched hyperparameters were (1) image sizes, (2) feature map attachments, (3) pre-trained weights, (4) optimizers, and (5) width multipliers. Figure 4.1 shows a schematic overview of the five adjusted hyperpa-rameters and their placement in the customized architecture. The next sections address the hypothesis of the experiments and what had to be done to adjust the five hyperparameters.

(26)

unaltered. All experiments were trained and evaluated using PyTorch 1.5 [13] in Python 3.6.9 [14]. The models were trained using a GeForce GTX 1080 Ti graphics processing unit (GPU). All models were trained for 120000 iterations with a multi-step learning rate decay for the optimizer. For the first 80000 iterations, the learning rate was 1e-3. Hereafter, 1e-4 until iteration 100000, and 1e-5 until the last iteration. Also a learning-rate warm-up factor was used from 1e-3₃ in iteration 0 to 1e-3 in iteration 500. For training and evaluating, all images were resized to a square format.

4.1 Varying image sizes

Hypothesis experiment 1: The size of the input image of the model has an impact on the accuracy and speed of the models. At one hand, because there is a higher number of pixels, and therefore a higher number of computations to be made. On the other hand, the size of the feature maps depends on the input image size. Therefore, the hypothesis is that a bigger input leads to higher accuracy and a lower speed.

In table 3.3 was shown that the first feature map attachment is made after 4 strides resulting in a 512₂4 = 32

2 _{feature map size. For a 320}2 _{input is that 20}2_,

what results in fewer detection boxes. Table 4.1 show the implications of altering the image size to the feature maps. For this hyperparameter, the trained input sizes were 3202, 5122, and 6402. All the models were trained using a batchsize of 32 and the same pre-trained weights published by PyTorch [13].

Attachment no. Feature maps 3202 Feature maps 5122 Feature maps 6402

1 202 322 402 2 102 ₁₆2 ₂₀2 3 52 ₈2 ₁₀2 4 32 42 52 5 22 ₂2 ₃2 6 12 ₁2 ₂2

Table 4.1: The implications of varying the image size to the feature map sizes.

4.2 Varying the first feature map attachment

Hypothesis experiment 2: When the first feature map attachment is attached to an earlier bottleneck, the feature maps become bigger, have more anchor boxes,

(27)

and therefore the accuracy should become higher. However, this should make the model slower at the same time.

As described in section 3.2, by default the first feature map is attached to the fifth bottleneck as shown in figure 3.3. Besides, models were trained for attachment to the third and fourth feature map. Figure 4.2 shows the implication of changing the bottleneck attachments to the feature map sizes. All models were trained using a batchsize of 32 and the same pre-trained weights [15].

Attachment no. First attachement Bottleneck 3 First attachement Bottleneck 4 First attachement Bottleneck 5 1 802 402 202 2 402 ₂₀2 ₁₀2 3 202 ₁₀2 ₅2 4 102 52 32 5 52 ₃2 ₂2 6 32 ₂2 ₁2

Table 4.2: The implications of varying the first bottleneck attachment to the feature map sizes.

4.3 Using pre-trained weights

Hypothesis experiment 3: This experiment is a direct comparison between pre-trained weight initializations. Using pre-pre-trained weights could improve the results because information of the pre-trained weights can be useful for the currently trained model.

Three variants of initial weights were used for the original MobileNetV2 back-bone, represented as the dashed cube in figure 4.1. The first variant was train-ing a model without pre-trained weights. The second variant was pre-trained on ImageNet [16] published by PyTorch [13]. The last variant was pre-trained on ImageNet as well, but the author Duo Li [17] claims to obtain higher average precision. In order to make a fair comparison, all the pre-trained models were fine-tuned using a batchsize of 32 and an image-size of 5122.

4.4 Comparing optimizers

Hypothesis experiment 4: The goal of this experiment is a direct comparison between optimizers. Sashank J Reddi et al. [18] suggest that Adams convergence issues were fixed by Adam AMSgrad. This experiment verifies their suggestion.

(28)

Three optimizers were evaluated: stochastic gradient descend momentum, Adam AMSgrad, and AdamW AMSgrad. The original PyTorch [13] implementations of these optimizers were used. The models were trained using a width multiplier of 0.5 and a batchsize of 32.

4.5 Varying width-multipliers

Hypothesis experiment 5: Using a smaller width-multiplier results in fewer parameters for the backbone to learn on. Therefore, a lower width-multiplier should make the model faster, but less accurate at the same time.

In order to change the width of the model, the output channels of the backbone were altered using a width-multiplier w. The used multipliers were 0.25, 0.5 and 1.0. Pre-trained weights for these width-multipliers were made publicly available [17]. The implications of varying the width to the backbone are shown in table 4.3. The models were trained using the Adam AMSgrad optimizer and a batchsize of 48. Input Operator t c n s 5122× 3 conv2d - 32w 1 2 2562_{× 32w} _bottleneck ₁ _16w ₁ ₁ 2562_{× 16w} _bottleneck ₆ _24w ₂ ₂ 1282_{× 24w} _bottleneck ₆ _32w ₃ ₂ 642× 32w bottleneck 6 64w 4 2 322_{× 64w} _bottleneck ₆ _96w ₃ ₁ 322_{× 96w} _bottleneck ₆ _160w ₃ ₂ 162× 160w bottleneck 6 320w 1 1 162× 320w conv2d 1 - 1280 1 1

Table 4.3: The implication of changing of varying the width multiplier w to a backbone with an image size of 5122.

(29)

Chapter 5

Results

This chapter shows the results of the trained models. The first section shows the results of the official KITTI benchmark for cars. The second section shows the results for the scaling and localization evaluation. The last section shows qualitative results. The models were tested using a single core of an Intel CoreR TM

i7-7700K central processing unit (CPU) that runs at a maximum of 4.20GHz.

5.1 Original KITTI evaluation

As discussed in chapter 4, five experiments were performed, all evaluating a dif-ferent hyperparameter. All the models were evaluated using a non-maximum-suppression threshold of 0.35 and a confidence threshold of 0.01. All figures show the average precision for the easy, moderate, and hard car class of the KITTI benchmark and the number of frames per second (FPS). The number of FPS was computed by measuring the inference time.

The results for the varying image sizes are shown in figure 5.1a. It is visible that when the image size increases, the number of FPS becomes lower. The results of varying the first feature map attachment are shown in figure 5.1b. When the first feature map is attached to an earlier bottleneck the model becomes slower, the average precision changes inconsistently. The impact of using pre-trained weights is shown in figure 5.1c. It is visible that using pre-trained weights improve the results and that there is a different result for the two pre-trained models. The comparison between the three optimizers is shown in figure 5.1d. The Adam AMSgrad optimizer achieves the highest average precision. At last, figure 5.1e shows the impact of varying the width of the pre-trained model. The average precision increases when a bigger width is used. However, the number of FPS corresponds inconsistently to the accuracy.

(30)

(a) Exp. 1: varying image sizes (b) Exp. 2: varying the first attachment

(c) Exp. 3: using pre-trained weights (d) Exp. 4: comparing optimizers

(e) Exp. 5: varying pre-trained width-multipliers

(31)

5.2 Scaling and localization evaluation

The previous section provides two models that fit the project requirements of run-ning at approximately 20 FPS and having a good speed/accuracy trade-off. The first model, shown in figure 5.1e, is trained using an image size of 5122 pixels, a pre-trained width of 0.5, and uses Adam AMSgrad as an optimizer. The second model, shown in figure 5.1b, is trained with an image size of 3202 _{pixels, a full}

width, an attachment at the fourth bottleneck, and uses SGD momentum as an optimizer. This section evaluates the scaling and localization performance of these two models.

The scaling and localization performance of the first model are shown in figure 5.2a and 5.2b. The figures show that the difference between an IoU of 0.7 and 0.5 results in a higher average precision when the IoU is less strict. At the same time, the figures give an insight into the performance for different minimum object heights. It is visible that the average precision is higher when cars are bigger in the image. The differences are particularly higher when the minimum object height is low.

The scaling and localization performance of the second model are shown in figure 5.3a and 5.3b. For this model, again, the average precision becomes higher when the objects are bigger in the image. Although, the average precision slightly decreases beginning at a minimum object height of 80 pixels.

(32)

(a) IoU of 0.7 (b) IoU of 0.5

Figure 5.2: Evaluation of the first model for varying minimum object heights and two IoU thresholds.

(a) IoU of 0.7 (b) IoU of 0.5

Figure 5.3: Evaluation of the second model for varying minimum object heights and two IoU thresholds.

(33)

5.3 Qualitative results

This section shows qualitative results on the KITTI test set of the two models selected in the previous section. Figure 5.4 shows results of the first model, figure 5.5 shows results of the second model. The green bounding boxes indicate that there is a car predicted to be in that box. The left bottom of the green boxes indicated the confidence of the class with the highest confidence for that bounding box.

Figure 5.4: Qualitative results of the first proposed model.

(34)

Chapter 6

Discussion

This section discusses the obtained results. The most important findings are dis-cussed, and an explanation is given whether the results are in line with the hypothe-ses posed alongside each experiment or not. First, the results for the five varied hyperparameters are discussed and compared to attempts done by other detection algorithms. Second, the scaling and localization performance results are discussed.

6.1 Discussion of the five varied hyperparameters

The results of the first experiment, varying the image size hyperparameter, are in line with the hypothesis: when the image size increases, the accuracy increases and but the model becomes slower. In other words, the number of FPS decreases. However, figure 5.1a shows that the difference between an image size of 3202 _and

5122 _{is bigger than the difference between 512}2 _{and 640}2_{. Therefore, it does not}

seem like a linear increase. This can be caused by the fact that the training images were resized. For example, when the original image format of 1242 × 375 pixels was resized to, 5122 _{pixels. Then, the height of 375 pixels was stretched to 512}

pixels, but that does not provides extra information. However, in that case, the feature maps are bigger as well, which results in more detections. Therefore, there is a high possibility that the small size of the network is the limiting factor in this problem.

The hypothesis of the second experiment, varying the feature map attachments, was that an attachment to an earlier bottleneck results in a slower model and a higher average precision. The hypothesis for the speed turns out to be correct. In contrast, the average precision of attaching to the third bottleneck was lower than expected. A possible explanation is that at the third bottleneck, not enough

(35)

bottlenecks are passed in order to have enough weights to distinguish cars correctly. In line with the hypothesis of the third experiment, using pre-trained weights results in obtaining a higher performance, as shown in figure 5.1c. Using pre-trained weights by Duo Li provided an increase in average precision of 18.5% for an image size of 5122_{. Because these weights were trained on ImageNet, it is}

conceivable that the training of this project builds upon the car class trained on ImageNet.

The results of the fourth experiment were in line with the hypothesis, but the variance in the results was more significant than expected. The impact of vary-ing the optimizers is shown in figure 5.1d. The impact of changvary-ing from AdamW AMSgrad to SGD momentum makes the average precision of the easy car class increase with 0.38%. Changing from SGD momentum to Adam AMSgrad results in an additional increase of 10.4%.

The results of the last experiment show that, as expected in the hypothesis, the width of the MobileNetV2 backbone influences the results, as illustrated in figure 5.1e. Using half of the width results in a small decrease of average precision, while the speed of the model almost doubles.

Model Easy AP Moderate AP Hard AP

Yolo v3 0.922 0.776 0.657

Faster R-CNN 0.890 0.832 0.726

Yolo v2 0.882 0.787 0.695

Proposed model 2 0.834 0.795 0.716 Proposed model 1 0.809 0.696 0.629

Table 6.1: The two models proposed by this project in comparison to a YOLOv2, YOLOv3 and Faster R-CNN implementation.

The research question was whether it is possible to be able to run at approx-imately 20 FPS on limited hardware and obtain a good average precision at the same time. According to the results of the experiments, two models meet the requirements. The first is a model trained with an image size of 5122 _{pixels, a}

pre-trained width of 0.5, and uses Adam AMSgrad as an optimizer. The second proposed model is trained with an image size of 3202 pixels, a full width, an at-tachment at the fourth bottleneck, and uses SGD momentum as an optimizer.

(36)

compared in table 6.1 to a YOLOv2 [19], YOLOv3 [20], and Faster R-CNN [21] implementation. While the other attempts only obtain a slightly higher average precision, the models proposed by this project can run at real-time on limited hardware.

6.2 Discussion of the scaling and localization

per-formance

The second part of the evaluation was getting an insight into the localization and scaling performance of the trained models. In this section, the performance of the two best models is discussed.

For the first proposed model, figure 5.2 showed that the model performs better when the IoU is 0.5 in comparison to 0.7, which means that the localization eval-uation is less strict. In particular, when the minimum object pixel height is lower than 50 pixels, the average precision becomes better. Altogether, the first model performs slightly better when the minimum car height is bigger. The localization performance becomes a lot better when the car height is bigger.

For the second proposed model, figure 5.3 showed the scaling in localization performance. In comparison to the first model, the results for an IoU of 0.7 are better when for small minimum object pixel heights. When the IoU is changed to 0.5, the average precision of the overall plot rises. This means the localization performance is equally spread between the minimum object pixel heights.

In conclusion, model one performs well for cars with a bigger pixel size, es-pecially in terms of localization, while model two has a better distribution of performance. The localization of the model is notably better for cars with a small object height.

(37)

Chapter 7

Conclusion and further

research

7.1 Conclusion

This work investigated the research question: “To what extend is it possible to accurately detect vehicles in real-time with limited hardware?” The goal was to be able to run at approximately 20 FPS on average CPU-devices and still obtain high average precision. In order to do so, the method of this project replaces the back-bone of the original Single Shot Multibox Detections paper with a MobileNetV2 backbone. The method was trained and evaluated on the KITTI benchmark. Five experiments were performed to obtain a good speed/accuracy trade-off. First, in-creasing the image sizes resulted in higher accuracy, but a decrease in speed. The second experiment showed that re-attaching the feature maps can lead to higher accuracy, but slows down the method as well. The third experiment showed that using the right pre-trained weights improves the accuracy without slowing down the method. The fourth experiment showed that using Adam AMSgrad results in a higher accuracy without a decrease in speed. The last experiment showed that using a smaller backbone width results in higher speed, but is less accurate at the same time.

As a result of these five experiments, two models with a good speed/accuracy trade-off are proposed. The first is a model trained with an image size of 5122 pixels, a pre-trained width of 0.5, and uses Adam AMSgrad as an optimizer. The result of this model on the KITTI easy car class is 80.9% and runs at 21 FPS. This model is especially useful for detecting cars close to the camera.

The second proposed model is trained with an image size of 3202 _{pixels, a full}

(38)

optimizer. For this model, the result of the KITTI easy car class is 83.3% and runs at 18 FPS. This model is better in detecting cars far away from the camera, and worse for cars close to the camera in comparison to the first model.

While the other methods such as YOLOv2, YOLOv3 and Faster R-CNN only obtain a slightly higher average precision, the models of this project run in real-time on limited hardware. Therefore, the two models provided by this project show that it is possible to accurately detect vehicles in real-time with limited hardware.

7.2 Future research

This section discusses possible opportunities for future research. While the KITTI dataset is useful to compare the method of this project to other approaches, this dataset only provides images at daytime at a sunny weather condition. Therefore, it is conceivable that the method of this project will not perform well at nighttime or other weather conditions. Using other datasets could solve this problem.

Also, several parameters that were fixed in this project could be examined as well. For example, the properties of the minimum and sizes of the default bounding boxes were fixed, just as the nms-threshold.

Finally, adding a Feature Pyramid Network [22] could improve the accuracy of the models. The method takes an arbitrary size as input and returns resized feature maps at multiple levels, in a fully convolutional fashion. The method uses a bottom-up pathway to compute a feature hierarchy consisting of feature maps at several scales: a top-down pathway to hallucinate higher resolution features by up-sampling feature maps from higher pyramid levels, and lateral connections to merge feature maps of the same spatial size from both pathways. Using a feature pyramid network will slow down the method and can therefore only be used on models that are currently faster than 20 FPS.

(39)

Bibliography

[1] Mark Sandler et al. “Mobilenetv2: Inverted residuals and linear bottlenecks”. In: Proceedings of the IEEE conference on computer vision and pattern recog-nition. 2018, pp. 4510–4520.

[2] Barret Zoph et al. “Learning transferable architectures for scalable image recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, pp. 8697–8710.

[3] Xiangyu Zhang et al. “Shufflenet: An extremely efficient convolutional neu-ral network for mobile devices”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, pp. 6848–6856.

[4] Joseph Redmon et al. “You only look once: Unified, real-time object detec-tion”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 779–788.

[5] Wei Liu et al. “Ssd: Single shot multibox detector”. In: European conference on computer vision. Springer. 2016, pp. 21–37.

[6] Shaoqing Ren et al. “Faster r-cnn: Towards real-time object detection with region proposal networks”. In: Advances in neural information processing systems. 2015, pp. 91–99.

[7] Joseph Redmon and Ali Farhadi. “Yolov3: An incremental improvement”. In: arXiv (2018).

[8] Mark Everingham et al. “The pascal visual object classes (voc) challenge”. In: International journal of computer vision 88.2 (2010), pp. 303–338. [9] Karen Simonyan and Andrew Zisserman. “Very deep convolutional networks

for large-scale image recognition”. In: arXiv:1409.1556v6 (2014).

[10] Sergey Ioffe and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift”. In: arXiv:1502.03167v3 (2015).

[11] Ross Girshick. “Fast r-cnn”. In: Proceedings of the IEEE international con-ference on computer vision. 2015, pp. 1440–1448.

(40)

[12] Andreas Geiger et al. “Vision meets robotics: The kitti dataset”. In: The International Journal of Robotics Research 32.11 (2013), pp. 1231–1237. [13] Adam Paszke et al. “PyTorch: An imperative style, high-performance deep

learning library”. In: Advances in Neural Information Processing Systems. 2019, pp. 8024–8035.

[14] Guido Van Rossum and Fred L Drake. Python library reference. 1995. [15] Congcong Li. High quality, fast, modular reference implementation of SSD

in PyTorch.https://github.com/lufficc/SSD. 2018.

[16] Jia Deng et al. “Imagenet: A large-scale hierarchical image database”. In: 2009 IEEE conference on computer vision and pattern recognition. Ieee. 2009, pp. 248–255.

[17] Duo Li. PyTorch Implemention of MobileNet V2.https://github.com/ d-li14/mobilenetv2.pytorch. Apr. 2020.

[18] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. “On the convergence of adam and beyond”. In: arXiv preprint arXiv:1904.09237 (2019).

[19] Yizhou Wang. Train YOLOv2 with KITTI dataset.http://yizhouwang. net/blog/2018/07/29/train-yolov2-kitti/. 2018.

[20] Jizhi Zhang. yolov3 warp. http : / / www . cvlibs . net / datasets /

kitti/eval_object_detail.php?&result=326e3ea44e5b02885a14389b26b79d4d85d9033c. 2019.

[21] Yizhou Wang. Train YOLOv2 with KITTI dataset.http://yizhouwang. net/blog/2018/12/20/object-detection-kitti/. 2018.

[22] Tsung-Yi Lin et al. “Feature pyramid networks for object detection”. In: Pro-ceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 2117–2125.

Real-time Vehicle Detection with Limited Hardware using Single Shot Multibox Detection and MobileNetV2

Real-time Vehicle Detection

with Limited Hardware

using Single Shot Multibox Detection and MobileNetV2

Real-time Vehicle Detection

with Limited Hardware

using Single Shot Multibox Detection and MobileNetV2

Contents

Chapter 1

Introduction

Chapter 2

Theoretical foundation

2.1

Image classification with Convolutional Neural

Networks

2.2

MobileNets for image classification

2.3

Single-stage and two-stage detectors

2.3.1

Faster Regional-CNN

2.3.2

You Only Look Once (YOLO) networks

2.3.3

Single Shot Multibox Detection

Chapter 3

Method

3.1

Single-Shot Multibox Detector

3.1.1

Multibox detection

3.1.2

Properties of default boxes

3.2

SSD using a MobileNetV2 architecture

3.2.1

Network architecture

3.2.2

Training

3.3

Training and evaluation using the KITTI

Vi-sion Benchmark Suite

3.3.1

Dataset insights

3.3.2

Official KITTI evaluation

3.3.3

Scaling and localization evaluation

Chapter 4

Experiments

4.1

Varying image sizes

4.2

Varying the first feature map attachment

4.3

Using pre-trained weights

4.4

Comparing optimizers

4.5

Varying width-multipliers

Chapter 5

Results

5.1

Original KITTI evaluation

5.2

Scaling and localization evaluation

5.3

Qualitative results

Chapter 6

Discussion

6.1

Discussion of the five varied hyperparameters

6.2

Discussion of the scaling and localization

per-formance

Chapter 7

Conclusion and further

research

7.1

Conclusion