Speed Estimation from Traffic Video Data

(1)

Speed Estimation from

Traffic Video Data

(2)

Layout: typeset by the author using LA_TEX.

(3)

Speed Estimation from Traffic

Video Data

Lotte Philippus 11291168 Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Drs. T.R. Walstra Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam 29 Jan, 2021

(4)

Abstract

Analysis of traffic video data can greatly help traffic management with problems such as congestion and accidents. This thesis aims to conduct video data analysis using Mask Regional Convolutional Neural Network (Mask R-CNN) for vehicle detection and simple online and real-time tracking (SORT) for vehicle tracking. The speed of the vehicles is calculated by converting the measurements from the pixel domain to the real world. The results show that correctly estimating vehicle speed has many challenges, such as sufficient data sets, camera perspectives and vehicle acceleration.

(5)

Chapter 1 Introduction

Over the past years, the number of on-road vehicles has been increasing. This can put pressure on infrastructure and road capacity, leading to problems such as congestion, traffic accidents and air pollution. Efficient traffic management is thus increasingly important and one option to improve this is through video surveillance systems such as closed-circuit television (CCTV). A massive amount of data is generated from this, which can be processed and analyzed for traffic flow prediction. This is important for traffic control systems and can make traffic management easier and more efficient by predicting road congestions or help road planning by estimating traffic demand [8]. This thesis focuses on the analysis of traffic video data, which can be done in a great variety of ways but generally consists of three parts: object detection, object tracking and speed estimation.

The object detection will be performed using Mask R-CNN. This is an instance segmentation technique which creates a mask for the object in addition to a bounding box and more accurately detects objects than its predecessor Faster R-CNN [6], and was chosen because it is the latest and best-performing R-CNN. Vehicle tracking will be done through a tracking algorithm called simple online and real-time tracking (SORT) - a method for multiple object tracking that performs both real-time and with high accuracy, making it well suited for quickly processing traffic footage. Finally, the speed will be estimated by first measuring the speed in pixels per frame. A reference is needed to convert the pixels to meters, such as the standard width of a road lane. This, together with the frame rate of the footage, is used to convert the speed to meters per second.

So the question this thesis aims to answer is:

How can vehicles in videos be detected and tracked, and their speed be estimated?

(7)

neural networks and multiple object tracking will be explained. After this the method of this thesis and its results are explained. Finally, the results are discussed and a conclusion is drawn.

(8)

Chapter 2 Theoretical Background

This chapter explains some theoretical background knowledge for this thesis. First, the structure of a basic CNN is explained and the theory behind Mask R-CNN is described. Second, an explanation is given of Multiple Object Tracking, and specifically the SORT algorithm. Finally, the related work is described. For the explanation of the basics of CNN’s the online Stanford course [17] and the book Deep Learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville [5] are used.

2.1 Convolutional Neural Networks

CNN’s are a type of neural network that are typically used for analyzing images. They are made up of neurons that have learnable weights and biases. CNN’s consist of an input and output layer, and multiple hidden layers in between. The layers are usually one of three types: convolutional, pooling or fully-connected. An example of a simple convolutional neural network can be seen in figure 2.1.

2.1.1 Convolutional layer

A convolutional layer convolves the input from the previous layer and passes its results to the next. It does this through a mathematical operation called convolution. A convolution operation is typically denoted as f ∗ g, where the first argument f is the input and the second argument g is the kernel or filter. In machine learning, the input is usually a multidimensional array of data and the filter a multidimensional array of parameters that are adapted by the network. The filter has a smaller size than the input, and the operation between the filter-sized patch of the input and the filter is a dot-product. Because the filter has a smaller size than the input, it is multiplied with the input at different points. So

(9)

Figure 2.1: A simple schematic of a convolutional neural network. [13]

a convolution between an image I at location (i, j) and an m by n filter K would look like: S(i, j) = (I ∗ K)(i, j) =X m X n I(i − m, j − n)K(m, n) (2.1)

Each filter will activate on a different pattern, so the result of a convolutional layer is a feature map of all the features that are recognized in the input.

There are three parameters that control the size of the output: depth, stride and padding. The depth of the output is the number of filters that are used. The stride specifies how the filter slides across the image. If the stride is 1, the filter moves 1 pixel at a time. If the stride is 2, the filter moves 2 pixels at a time. A higher stride thus produces a smaller output, see figure 2.2. Padding, or more specifically zero-padding, means padding the input layer with zeros around the border. This ensures that the input and output will have the same dimensions and no information from the input is lost.

Convolutional layers are usually followed by a rectified linear unit (ReLU) which is a linear function defined as:

R(z) = max(0, z) (2.2) This activation function transforms the feature map into an activation map. Negative values in activation maps can cancel out positive values in subsequent layers which can degrade the accuracy of the network. ReLU’s are therefore a crucial component of CNN’s.

(10)

Figure 2.2: Illustration of the difference in stride. The stride is 1 on the left side and 2 on the right side, resulting in an output size of 5 and 3 respectively. [17]

2.1.2 Pooling layer

Pooling layers are typically used after a convolutional layer and activation function. This layer reduces the size of the activation maps by combining the outputs of a cluster of neurons in one layer into a single neuron in the next. This helps to make the representation invariant to small translations in the input and prevents overfitting by reducing the amount of parameters and computations. The most common form is a 2x2 filter with a stride of 2, meaning that the amount of activations is reduced by 75%. Filters larger than 3x3 are rarely used since they would discard too much information.

The most common pooling method is the max operation, which takes the maximum of a 2x2 neighbourhood. Alternatively, average pooling can also be used. This method computes the average of the neighbourhood. Both methods can be seen in figure 2.3.

2.1.3 Fully-connected layer

This is a layer in which all neurons from the previous layer are connected to all neurons in the current layer. These layers are typically used at the end of a CNN and use the results of the convolutional and pooling layers to classify an image into a label. The output of the convolutional and pooling layers is flattened into a vector where each value is the probability that a certain feature belongs to a label.

2.1.4 Mask R-CNN

A Mask Region based Convolutional Neural Network (Mask R-CNN) is a model for instance segmentation, which is a method for image analysis where the input image is split into segments that each represent a distinct object or part of an object. The model returns for each detected object a bounding box containing that object, a class label, a confidence score for that class label, and a mask.

(11)

Figure 2.3: The difference between max and average pooling in a 2x2 neighbourhood. [3]

are regions within an image that could contain an object. Second, it extracts features from the region proposals to predict the class of the object and generate a bounding box and mask for it.

The first stage consists of a standard CNN backbone network and a Region Proposal Network (RPN). This RPN scans the backbone network and extracts a feature map from the convolutional layers, see figure 2.4. These feature maps contain Regions of Interest (RoIs), which are regions in the image that could contain an object.

Anchors, which are a set of reference boxes with predefined locations and scales relative to images, are used to bind features to the image location. Anchors with different scales bind to different levels of the feature map. The RPN uses these anchors to detect where the object is and what the size of its bounding box is.

To properly align the bounding boxes to the objects in the image, an RoIAlign layer is used, see figure 2.5. This layer extracts a small feature map (the dashed lines) from each RoI (the solid lines). The dots denote the sampling points in each RoI. The value of each sampling point is calculated using bilinear interpolation.

In the second stage the mask, classification and bounding box are produced. The features from the RoIAlign layer are fed into fully connected layers to make the classification and into a regression model to further refine the boundary box prediction. The mask is generated using a Mask classifier, consisting of two CNN’s that output a binary mask for each RoI where 1 represents an object and 0 the background. The pixels labeled 1 are then coloured to obtain the visual results.

(12)

Figure 2.4: Region Proposal Network [14]

2.2 Multiple object tracking

Multiple object tracking (MOT) aims to analyze videos to identify and track objects without any prior knowledge about appearance and the number of objects present [4]. To do this, a detection step is necessary to identify the different objects. Most MOT algorithms therefore make use of detection-by-tracking. This creates a set of detections, such as bounding boxes, that are extracted from the video frames. These detections are then used to guide the tracking process by associating them together, so the same ID can be assigned to bounding boxes that contain the same object.

The next step is feature extraction and motion prediction. A feature extraction algorithm analyzes the detections to extract appearance, motion or interaction features. A motion predictor can also be used to predict the next position of an object. These predictions are used to compute a similarity score between two detections. The detections that belong to the same object are given the same ID. This process is shown in image 2.6.

MOT algorithms can also be divided into batch and online algorithms. Batch tracking algorithms can use information from future frames to determine object identities. Online tracking algorithms can only use current and past information to make predictions about the current frame. This makes it suitable for real-time

(13)

Figure 2.5: Implementation of RoIAlign. [6]

applications like autonomous driving, but also causes it to perform worse than batch tracking algorithms.

2.2.1 SORT

Simple online and real-time tracking (SORT) is a tracking algorithm that focuses on associating objects efficiently for online and real-time applications [2]. The algorithm has four components: detection, propagating object states into future frames, associating current detections with existing objects and managing the lifespan of tracked objects.

The detection component is performed by a CNN. Propagating object states into future frames requires an object model. The state of each target is modelled as:

x = [u, v, s, r, ˙u, ˙v, ˙s]T (2.3) where u and v are the horizontal and vertical pixel location of the target’s center, s is the scale and r is the aspect ratio of the target’s bounding box. When a detection is associated to a target, the bounding box is used to update the target state. The velocity components are solved using Kalman filters, which is an algorithm that can estimate the future state of a dynamic system. The algorithm has two phases: predict, where the estimates of the current state along with its uncertainty are produced, and update, where the predictions are updated and the uncertainty is changed to enhance the prediction. Kalman filters are recursive, so the filter only

(14)

Figure 2.6: MOT algorithm. (1) An object detector is run to obtain the bounding boxes of the objects (2). For every detected object, different features are computed (3). An affinity computation step calculates the probability of two objects belonging to the same target (4), and an association step assigns a numerical ID to each object (5). [4]

requires the last estimate and the estimation gets better at every time step. If no detection is associated to the target, its state is predicted without correction using the linear velocity model.

The next stage is to assign detections to existing targets. For this, each targets bounding box geometry is estimated by predicting its new location in the current frame. The intersection-over-union (IOU) distance between each detection and all predicted bounding boxes is calculated. This is put into a cost matrix and then solved using the Hungarian algorithm, which tells us if an object in frame A is the same as the object in frame A-1 and will be used to attribute a unique ID to an object which is used for association. The Hungarian algorithm is a combinatorial algorithm for solving the assignment problem, which consists of finding - in a weighted bipartite graph - a matching of a given size in which the sum of weights of the edges is a minimum. The Hungarian algorithm operates on the idea that if a number is added to or subtracted from all the entries in a row or column of a cost matrix, the optimal assignment for the resulting cost matrix is also the optimal assignment for the original matrix. There are four different parts to solving the algorithm for an n by n matrix, see figure 2.7. First, subtract the smallest entry in each row from all other entries in the row. Next, do the same for the smallest entry

(15)

Figure 2.7: Solving the Hungarian algorithm in six steps. [7]

in each column. Cover all zeros in the resulting matrix using a minimum number of horizontal and vertical lines. If n lines are required, an optimal assignment exists among the zeros and the algorithm stops. If less than n lines are required, find the smallest element that is not covered by a line in the previous step. Subtract it from all uncovered elements, and add it to all elements that are covered twice. These last two steps repeat until an optimal assignment is found.

The final stage is the creation and deletion of track identities. Identities need to be created or destroyed when objects enter or leave the frame. Any detection with an overlap of less than the minimum IOU distance is considered a new, untracked object. This object is initialized with the geometry of the targets bounding box, a velocity of zero and a large uncertainty. Track identities are deleted when they are not detected for a certain amount of frames. This way, if a track identity is deleted early, its tracking will resume under a new identity.

2.3 Related work

Many methods have been developed for traffic flow analysis and speed estimation. One such method is by Zhang et al. [18]. Here, Mask R-CNN is applied to detect and track vehicles and pedestrians in footage from unmanned aerial vehicles (UAV). Vehicles are identified with a detection line, coordinates of the bounding

(16)

box, a class number and an object ID. The vehicles are counted by calculating the overlap ratio between the object’s bounding box and the detection line. To measure the speed of the vehicles, the distance of the object center from one frame to the next frame is calculated using the objects bounding box, which is then averaged over four consecutive frames. The speed is converted from image pixel based measurement to space meter based measurement using affine transformations. The geometrical information of the road, video resolution and UAV location are used to calculate the transformation.

Another method by Hua, Kapoor, and Anastasiu [8] uses the two best performing models from the 2017 AI City Challenge to detect vehicles in a traffic video. The vehicles are then tracked by computing the optical flow for a set of corners. Potential corners are found by calculating the eigenvalues of the frame derivatives. For each corner point associated with a tracked object, the detected point locations in at most t past frames are kept track of, which are called tracklets. The algorithm for calculating the speed estimates the vehicle speed as a function of the local movement and the maximum speed. Local movement is a function of the maximum historical corner point movement within the tracklets associated with the vehicle. Then, a projection matrix is approximated to normalize object movement.

Huang [9] developed a method that uses Faster R-CNN to detect vehicles. After the original detection, a class filter, confidence score filter, duplicated detection filter and outlier filter are applied to reduce false positives. The vehicles are then tracked using a histogram-based tracker. This tracker connects the same object across frames by finding the minimum Chi-squared distance among a group of candidates. The candidates are selected by taking all bounding boxes with 80% of its area covered by the neighbourhood region in the next frame. Speed conversion from pixel per second to mile per hour is done by warping the videos using linear perspective transformation. The results are smoothed by a moving average of five frames.

Shahbazi et al. [15] uses the object detector You Only Look Once (YOLOv3) to detect vehicles. Tracking is done by using deep appearance features for vehicle re-identification and motion estimation is done with Kalman filters. A 3D model of the road is made, and the 3D position of each car within that model is estimated using ray-tracing. Once a ray hits one of the faces in the triangular mesh of the road model, the hit position defines the location of the car. The changes in the position of the vehicle and the frame rate of the camera are then used to calculate the speed of the vehicle.

(17)

Chapter 3 Approach

This chapter gives an overview of the datasets and the implementation details of the network, tracking algorithm and speed calculation.

3.1 Data

For this thesis, three different datasets are used; one for the training of the Mask R-CNN and two for the speed estimation. The COCO dataset was chosen for the Mask R-CNN because it is a very large dataset that includes segmentations for the images. It also comes with a set of pre-trained weights so that the network doesn’t have to go through the process of training on the dataset. The Ko-PER dataset was chosen for the speed estimation because it contains about 10 minutes of vehicle footage, split into traffic going straight ahead, traffic making a right-hand turn and a mix of different crossings. This makes it easy to focus on calculating the vehicle speeds for one type of crossing. The Urban Tracker dataset was chosen because it is a side view of a crossroad. This makes it easier to estimate the speed of the vehicles, since it minimizes the effects of perspective.

3.1.1 COCO dataset

The Mask R-CNN is trained on the Common Objects in Context (COCO) dataset [12]. This is a large-scaled object detection, segmentation and captioning dataset. It contains 80 object categories. The relevant categories for this thesis are car, truck, bus, motorcycle and bicycle, but it also contains categories such as person, handbag and cow.

Each image is annotated with the object categories in that image, all instances of those objects and a segmentation mask for each instance. The dataset is split into 118.000 train images, 5.000 validation images and 41.000 test images.

(18)

3.1.2 Ko-PER dataset

The first dataset used for speed estimation is the Ko-PER Intersection Laserscanner and Video dataset [16]. This dataset consists of three sequences of videodata. Sequence 1 is a 6:26 minute video of an intersection. In this video, several hundred road users cross the intersection. Sequence 2 and 3 contain vehicles making a right-hand turn and a straight ahead manoeuvre respectively. The dataset is comprised of raw laserscanner data, undistorted camera images, reference data of selected vehicles and object labels. One of the camera images from sequence 3 can be seen in figure 3.1.

Figure 3.1: A frame from the Ko-PER dataset.

The laserscanner data contains, among other data, [x, y, z] coordinates for all objects. A transformation matrix is also provided to transform the coordinates from the laserscanner system to a common coordinate system. The object labels are only available for sequence 1, since that is the only video with multiple objects, and consist of car, truck, pedestrian and bike. An overview can be seen in table 3.1. The reference data contains the velocity in the x, y and z direction, among other things such as the [x, y, z] coordinates and the length of the vehicle.

Sequence Cars Trucks Pedestrians Bikes Duration (s)

1a 63 1 10 0 96

1b 63 3 13 3 96

1c 81 5 7 3 96

1d 83 3 8 4 97

(19)

3.1.3 Urban Tracker dataset

The second dataset used for speed estimation is the Urban Tracker dataset [10]. It contains five different videos of multiple types of road users, such as pedestrians, cars and bikes. For this thesis, only the Sherbrooke video was used because the camera is positioned only a few meters above the road, giving a side view of the crossroad. This makes it easier to calculate the speed of the cars going straight ahead, since we only have to take in account the speed in either the x or y direction. One of the frames of the Sherbrooke video can be seen in figure 3.2.

Figure 3.2: A frame from the Sherbrooke video of the Urban Tracker dataset.

3.2 Implementation

The object detection part is done using the Mask R-CNN described in chapter 2.1.4. It is implemented using TensorFlow 2.3.0 and Python 3.8.5 and uses ResNet-101 as the CNN backbone. ResNet-ResNet-101 stands for a Residual Network that has 101 layers, and is a type of CNN that is able to skip layers. This ability results in better accuracy when dealing with a network that has a lot of layers. When a network backpropagates, the loss function gets smaller. With a network that has many layers, this could prevent the weights from being updated. The ability to skip layers avoids this problem since there are fewer layers to propagate through. The Mask R-CNN also has a Feature Pyramid Network (FPN) that works with the RPN to detect the objects in the images. The FPN is the part that extracts the feature maps from the convolutional layers and consists of a bottom-up and a top-down pathway, see figure 3.3. The bottom-up pathway is the backbone CNN, so in this case ResNet-101. In the top-down pathway, the feature maps from the bottom-up pathway are bottom-upsampled into spatially coarser but semantically stronger feature

(20)

maps. Between the bottom-up and the top-down pathways are lateral connections to merge feature maps of the same spatial size. After this, a convolution is appended onto each merged feature map, resulting in the predictions. The source code can be found on GitHub [1].

Figure 3.3: Feature Pyramid Network (FPN) [11]

The network is pre-trained on the COCO dataset mentioned in chapter 3.1.1. It takes as input individual frames from a video, and outputs a text file with for each detection the frame it was detected in, the x and y location of the bottom left corner of the bounding box, the width and height of the bounding box and the confidence score.

Vehicle tracking is done using the SORT algorithm described in chapter 2.2.1. It is implemented using Python 3.8.5 [2]. It uses the text file from the Mask R-CNN as input and outputs a new text file with for each detection the frame, the object ID, the x and y location of the bottom left corner of the bounding box and the width and height of the bounding box.

Finally, the speed of the vehicles are calculated. For simplicity, only the speed of the vehicles going straight ahead are calculated. First, the text file from the SORT algorithm is read and sorted by object ID and then by frame number. To remove any outliers, all objects that are detected in less than five frames are removed. Since all vehicles are going straight ahead, the first and last detection from every vehicle are used. The x and y coordinates of the centers of the bounding boxes are calculated, and the travelled distance is calculated by subtracting the coordinates from the first location from the last location and taking the absolute value. Again, outliers are removed by removing all detections that travelled less than ten pixels. The amount of seconds it took to travel that distance can be determined by subtracting the frame number of the first detection from the last detection, and dividing that by the frame rate of the camera. To convert the

(21)

amount of pixels travelled to meters, a reference point in the image of which we know how big it is in the real world is needed. The standard width of a car lane is used, which is 3,5 meters. With a basic image editor, the width of the car lane in pixels can be measured. The conversion ratio from pixels to meters is then calculated by dividing the width of the road in meters by the width of the road in pixels. This is then multiplied with the travelled distance of the object to get the amount of meters travelled. The speed of the vehicle is determined by dividing the amount of meters travelled by the amount of seconds it took.

(22)

Chapter 4 Results

This chapter describes the results from the speed estimation. First, the results of the Mask R-CNN are shown, then the results from the SORT algorithm, and finally the results from the speed calculation. Since the speed is only calculated for vehicles driving straight ahead, the footage from sequence 3 of the Ko-PER dataset is trimmed to only include these vehicles. The used footage starts at image ’KAB_SK_4_undist_1384779884800018.bmp’. The footage from the Sherbrooke video from the Urban Tracker dataset was also trimmed to only include vehicles driving from either left-to-right or right-to-left. The used footage is frame 970-1190, 1235-1300, 1520-1660, 1700-1785 and 2010-2095.

4.1 Mask R-CNN

A few results from the Mask R-CNN and the Ko-PER dataset can be seen in figure 4.1. The images shown here were chosen at random from the frames in sequence 3 and were obtained by inputting the images into the network with the display flag on. The different detections are seen, as well as their image class, confidence score, bounding box and segmentation mask. In the top pictures and the bottom right picture, most vehicles are correctly identified. In the bottom right picture, two of the vehicles in the bottom middle of the frame could not be identified.

(23)

Figure 4.1: Four frames from the Ko-PER dataset with the detections from the Mask R-CNN.

The results from the Urban Tracker dataset can be seen in figure 4.2. Again, the images shown were chosen at random from the frames in the Sherbrooke video and were obtained the same way as the frames from the Ko-PER dataset. Most vehicles are correctly identified, only some cars in the distance could not be detected. Some additional objects are detected as well, such as pedestrians and traffic lights. In both images, one of the traffic lights is misidentified as a potted plant, but this does not influence the detection or tracking of the vehicles.

(24)

Figure 4.2: Two frames from the Urban Tracker dataset with the detections from the Mask R-CNN.

4.2 SORT

The results from the Ko-PER dataset and the SORT tracking algorithm can be found in figure 4.3 and 4.4. The images shown are from the frames in sequence 3 and were obtained by inputting the images into the algorithm with the display flag on.

The top two images in figure 4.3 show images that are a few frames apart. All objects shown in both frames are correctly re-identified, and there are also new cars that are given a new object ID. The bottom two images in figure 4.3 show two more images that are a few frames apart. Here, the black car in the middle of the left frame is not tracked, probably because it was not correctly detected by the Mask R-CNN. The white truck in the left frame is not re-identified in the right frame. This could also be because the Mask R-CNN could not identify the vehicle in the right frame.

(25)

Figure 4.3: Four frames from the Ko-PER dataset with the detections from the SORT tracking algorithm.

Figure 4.4 shows a wrong re-identification of a vehicle. In the top of the screen, a vehicle is detected and given a lilac bounding box. After a few frames, the detection is lost and after a few more frames it is detected again. Because there were too many frames between the identifications, the car is given a new object ID and is now shown with a pink bounding box.

(26)

Figure 4.4: Three frames from the Ko-PER dataset with the detections from the SORT tracking algorithm, showing a car being identified, lost and then re-identified.

The results from the Urban Tracker dataset can be found in figure 4.5. The images shown are from the frames in the Sherbrooke video and were obtained by inputting the images into the algorithm with the display flag on. Figure 4.5 shows two images that are a few frames apart. All cars are correctly identified and tracked, and overall this dataset resulted in only one or two cases where a vehicle could not be identified or properly tracked.

(27)

Figure 4.5: Two frames from the Urban Tracker dataset with the detections from the SORT tracking algorithm.

4.3 Speed conversion

Table 4.1 shows the results of calculating the speed of the first ten objects in sequence 3 of the Ko-PER dataset. The full table of results can be seen in Appendix A. The vehicles seem to mostly drive between 2 and 10 m/s, with the lowest speed calculated being 0.37 m/s and the highest 13.71 m/s. The average speed is 5.64 m/s. The slow speeds could be explained by vehicles that accelerate from standstill.

Object ID Speed in x-direction (m/s) Speed in y-direction (m/s)

7 3.96 6.10 11 9.69 5.62 12 7.27 5.19 15 3.93 4.39 20 1.50 1.92 23 4.03 6.18 24 8.89 5.21 28 7.33 5.14 30 1.76 2.59 34 3.12 3.63

Table 4.1: An excerpt from the results of calculating the speed of the vehicles in sequence 3 of the Ko-PER dataset.

Table 4.2 shows the results of calculating the speed of the objects in the Sherbrooke video of the Urban Tracker dataset. If we discount the vehicles that are barely moving - such as the vehicles that have a velocity of below 1 m/s in

(28)

both the x and y direction - the vehicles seem to mostly drive between 10 m/s and 20 m/s, with the lowest speed calculated being 10.90 m/s and the highest 20.29 m/s. The average speed in the x direction is 5.85 m/s - or 14.74 m/s if we discount the vehicles with a velocity below 1 m/s. The slow speeds could be explained by vehicles that are driving from the top to the bottom. These vehicles are approaching a red light, and are thus driving very slowly. They also seem to move slower due to the perspective of the camera.

2 0.10 0.33 3 0.14 0.20 6 0.08 0.33 16 11.38 0.41 74 10.90 0.57 110 0.51 0.94 113 0.22 0.30 119 20.29 0.75 135 0.09 0.20 148 0.00 0.14 154 0.31 0.30 157 11.89 0.67 189 13.25 0.76 217 0.01 0.25 218 0.26 0.19 222 0.43 0.90 223 15.35 0.67 280 20.15 1.01

(29)

Chapter 5 Discussion

The data from chapter 4 revealed some interesting things. The Mask R-CNN performs reasonably well on the Ko-PER dataset, except for the lower left part of the image. Many vehicles that pass trough that section do not get detected. It is not immediately clear why this part of the image has fewer detections. A possible explanation could be the perspective of the camera. The further a car gets to the bottom of the screen, the more it becomes a top-down view. It could be that the COCO dataset does not contain as many top-down views of vehicles, resulting in fewer detections. The Mask R-CNN performs much better on the Urban Traffic dataset with only one instance where a large truck could not be identified until it was already almost out of frame. This could be because the truck was very dark and thus unclear, or because the COCO dataset does not contain enough pictures of trucks. This seems to support the theory that the Mask R-CNN has more trouble identifying vehicles from a top-down view than from a side view.

The SORT algorithm performs very well. Again, it has trouble tracking the vehicles in the lower left corner of the images from the Ko-PER dataset, but this is likely because of the lack of detections by the Mask R-CNN. There are a few instances where the the cars are not properly re-identified and get new object IDs, as can be seen in figure 4.4. Most of these detections should be filtered out during the speed calculation, but it could be that some of these detections are the cause of the low speeds in Appendix A.

The speed calculation seems to work fine. If we convert the lowest and highest speed calculated from the Ko-PER dataset to kilometers per hour, the speeds are in a range from 1.33 km/h to 49.36 km/h, with an average speed of 20.3 km/h. These speeds seem normal for a road in a city, albeit somewhat low. This could be due to the fact that this method of speed calculation does not take acceleration into account. The speeds from the Urban Traffic dataset are in the range from 39,24 km/h to 73.04 km/h, with an average speed of 53.1 km/h. This is somewhat

(30)

fast, but still normal for a city road.

The main problem seems to be finding a good dataset. The Ko-PER dataset has reference data for a single car in sequence three, but the footage contains dozens of cars. The reference data shows that the car has a speed of 8 m/s in the x-direction and 10 m/s in the y-direction. The speeds calculated are on average lower than that. Discrepancies in the results and the reference data could be explained by the way the speed is calculated. This method disregards the perspective of the camera and assumes that the road is straight and flat, which could introduce errors to the results. The Urban Traffic dataset has no reference data, but is less affected by camera perspective since the cars move horizontally instead of diagonally. The camera is less stable than the camera from the Ko-PER dataset, however, and has noticeable distortions that could be the source of inaccuracies in the results.

(31)

Chapter 6 Conclusion

The aim of this research is to calculate the speed of vehicles in traffic footage. The research question is therefore: How can vehicles in videos be detected and tracked, and their speed be estimated? This is done in three parts: vehicle detection, vehicle tracking and speed calculation. Vehicle detection is done using a Mask R-CNN and vehicle tracking is done using the SORT algorithm. The speed of the vehicles is then calculated by taking the first and last detection of an object, calculating the distance the object travelled and how long it took, and converting that from pixels per frame to meters per second. The results seem reasonably accurate. Unfortunately, due to a lack of datasets with good reference data for the vehicle speeds, it is not possible to test whether these results are accurate or not.

6.1 Future work

As mentioned before, the results of the experiments could not be properly evaluated due to a lack of reference data for the vehicle speeds. Many datasets either contain reference data or traffic footage but rarely both, and the datasets that do contain both are not publicly available. This research could be greatly improved by a dataset that contains the speed for every vehicle in the footage.

Another improvement could be made in the way the speed is calculated. This research assumes that the roads are straight and flat, and disregards the influence of the camera’s perspective. More research could be done to take these influences into account. The method could also be adapted to take acceleration and deceleration into account.

(32)

Bibliography

[1] Waleed Abdulla. Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow. https://github.com/matterport/Mask_RCNN. 2017.

[2] Alex Bewley et al. “Simple online and realtime tracking”. In: 2016 IEEE International Conference on Image Processing (ICIP). 2016.

[3] Cameron Buckner. “Deep learning: A philosophical introduction”. In: Philosophy Compass 14.10 (2019).

[4] Gioele Ciaparrone et al. “Deep learning in video multi-object tracking: A survey”. In: Neurocomputing 381 (2020).

[5] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http: //www.deeplearningbook.org. MIT Press, 2016.

[6] Kaiming He et al. “Mask R-CNN”. In: IEEE transactions on pattern analysis and machine intelligence. IEEE, 2017.

[7] Magnus Hieronymus and J. Nycander. “Finding the Minimum Potential Energy State by Adiabatic Parcel Rearrangements with a Nonlinear Equation of State: An Exact Solution in Polynomial Time”. In: Journal of Physical Oceanography 45 (Apr. 2015).

[8] Shuai Hua, Manika Kapoor, and David C Anastasiu. “Vehicle Tracking and Speed Estimation from Traffic Videos”. eng. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, 2018.

[9] Tingting Huang. “Traffic Speed Estimation from Surveillance Video Data: For the 2nd NVIDIA AI City Challenge Track 1”. eng. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, 2018.

[10] J. Jodoin, G. Bilodeau, and N. Saunier. “Urban Tracker: Multiple object tracking in urban mixed traffic”. In: IEEE Winter Conference on Applications of Computer Vision. 2014.

(33)

[11] Tsung-Yi Lin et al. Feature Pyramid Networks for Object Detection. 2017. arXiv: 1612.03144 [cs.CV].

[12] Tsung-Yi Lin et al. “Microsoft coco: Common objects in context”. In: European conference on computer vision. Springer. 2014.

[13] Phung and Rhee. “A High-Accuracy Model Average Ensemble of Convolutional Neural Networks for Classification of Cloud Image Patches on Small Datasets”. In: Applied Sciences 9 (Oct. 2019).

[14] Shaoqing Ren et al. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. eng. In: IEEE transactions on pattern analysis and machine intelligence 39.6 (2017).

[15] M. Shahbazi et al. “Vehicle Tracking and Speed Estimation from Unmanned Aerial Videos”. In: ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLIII-B2-2020 (2020). [16] E. Strigel et al. “The Ko-PER intersection laserscanner and video dataset”.

In: 17th International IEEE Conference on Intelligent Transportation Systems (ITSC). 2014.

[17] Stanford University. CS231n Convolutional Neural Networks for Visual Recognition. url: https://cs231n.github.io/ (visited on 01/15/2021).

[18] Huaizhong Zhang et al. “Real-Time Traffic Analysis using Deep Learning Techniques and UAV based Video”. eng. In: 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 2019.

(34)

Appendix A

Ko-PER Results

7 3.96 6.10 11 9.69 5.62 12 7.27 5.19 15 3.93 4.39 20 1.50 1.92 23 4.03 6.18 24 8.89 5.21 28 7.33 5.14 30 1.76 2.59 34 3.12 3.63 35 2.57 4.16 42 3.34 3.53 47 7.09 4.10 48 7.22 4.93 49 0.81 0.62 51 4.45 2.46 57 9.20 6.57 61 0.37 4.82 65 7.98 4.31 68 2.49 2.00 77 10.58 6.50 78 10.71 7.95 81 2.06 3.13 84 10.15 10.11 90 6.36 6.54

(35)

Object ID Speed in x-direction (m/s) Speed in y-direction (m/s) 91 4.56 7.47 92 8.44 6.08 98 9.42 5.19 104 3.83 6.27 113 9.23 10.05 115 3.88 1.60 116 12.22 9.17 120 3.89 5.89 121 10.92 6.41 122 0.67 0.91 125 5.30 5.54 128 11.47 8.35 135 4.49 2.89 138 11.62 7.29 140 5.57 1.92 144 2.59 3.85 147 13.71 11.87 148 3.65 3.89

Table A.1: The results of calculating the speed of the vehicles in sequence 3 of the Ko-PER dataset.

Speed Estimation from Traffic Video Data

Speed Estimation from

Traffic Video Data

Speed Estimation from Traffic

Video Data

Contents

Chapter 1

Introduction

Chapter 2

Theoretical Background

2.1

Convolutional Neural Networks

2.1.1

Convolutional layer

2.1.2

Pooling layer

2.1.3

Fully-connected layer

2.1.4

Mask R-CNN

2.2

Multiple object tracking

2.2.1

SORT

2.3

Related work

Chapter 3

Approach

3.1

Data

3.1.1

COCO dataset

3.1.2

Ko-PER dataset

3.1.3

Urban Tracker dataset

3.2

Implementation

Chapter 4

Results

4.1

Mask R-CNN

4.2

SORT

4.3

Speed conversion

Chapter 5

Discussion

Chapter 6

Conclusion

6.1

Future work

Bibliography

Appendix A

Ko-PER Results