Velocity Estimation From a Monocular Video Using a 3D Scene Representation

(1)

MSc Artificial Intelligence

Master Thesis

Velocity Estimation From a Monocular

Video Using a 3D Scene Representation

by

Marlous Kottenhagen

11036184

May 21, 2021

36EC January - May Supervisor: Dr Dr Sezer Karaoglu Daily Supervisor: Nedko Savov MSc Assessor: Dr Prof. Dr Theo Gevers

University of Amsterdam 3D Universum

(2)

Abstract

In this thesis, we perform the task of velocity estimation using monocular video. Our proposed pipeline maintains a 3D representation of the scene and can calculate velocity from the object locations. In this way, velocity can be well estimated regardless of object motion direction. To achieve this, we break down the underexplored problem of monocular velocity estimation into components for instance segmentation, 3d tracking and camera motion (ego-motion) estimation modules, for which we benefit from research availability. We prove that our components are performing reasonably and we find that the biggest source of error in the proposed pipeline are still the tracking and ego-motion modules, from which small inaccuracies can be magnified in the final velocity. We further find that the 3D tracking module is affected by distance, occlusion and large object motion, while ego-motion - by translation of the camera. Our modules are designed to be replaceable, so our pipeline could easily benefit from further improvement of the fields of 3d tracking and ego-motion in the aforementioned aspects.

(3)

List of Figures

1.1 The 6 SAE levels of autonomous driving Source: https://www.nhtsa.gov/

technology-innovation/automated-vehicles-safety##topic-road-self-driving 2 3.1 Overview of the pipeline used in this thesis. The input to the network is a

scene consisting of multiple frames. This will first go through the top half of the pipeline to estimate the ego-motion. Then the scene and ego-motion are used as an input to the 3D tracking module. The output of the 3D tracking module together with the ego-motion will be used to compute the final velocity. . . 10 3.2 An example of instance segmentation by Mask R-CNN on the kitti dataset . 11 3.3 The Structure from motion pipeline used by colmap. Source:(1) . . . 12 3.4 The 3D tracking pipeline used by mono3DT. Source: (2) . . . 13 4.1 Estimated path of our ego-motion module and the ground truth path. . . . 20 4.2 Bar plot of the MAE for each bin of object distances. Each bin consists of

all velocity errors obtained for objects within the given range. This plot was create using estimated 3D center and ground truth ego-motion. . . 22 4.3 Plot of the MAE for each bin of object distances. Each bin consists of all

velocity errors obtained for objects within the given range. This plot was created using estimated ego-motion and ground truth 3D center. . . 22 4.4 Plot to show the effect of occlusion on the velocity error. All errors are

divided into different bins based on the object distance to the camera. For each bin, the MAE is shown. This plot was created using the estimated 3D center and ground truth ego-motion. . . 24 4.5 Different movements used in the experiments for ego-motion estimation.

The arrow shows movement and the dot means that the object or camera is standing still. The green dot represents the object and the red dot the camera. . . 25 4.6 Different movements of the camera and object used in the experiment for

3D center estimation. The arrow shows the movement of the object and camera compared to one another. A single dot means that the object is standing still. Green represents the object and red represents the camera. . 26

(6)

LIST OF FIGURES

4.7 Plot to show the effect of different object movements, compared to the camera, on the velocity error. These results are all gained from scenes where the camera was static and the object was moving. All errors are divided into different bins based on the objects distance to the camera. For each bin the MAE is shown. For these results we used estimated 3D center and ground truth ego-motion. . . 26 4.8 Plot to show the effect of different object movements, compared to the

camera, on the velocity error. These results are all gained from scenes where both the camera and object are moving. All errors are divided into different bins based on the objects distance to the camera. For each bin the MAE is shown. For these results we used estimated 3D center and ground truth ego-motion. . . 27 4.9 Plot to show the effect of different camera movements on the velocity error.

These results are all gained from scenes where the object is static and the camera is moving. All errors are divided into different bins based on the objects distance to the camera. For each bin the MAE is shown. For these results we used ground truth 3D center and estimated ego-motion. . . 27

(7)

List of Tables

4.1 The mean absolute velocity error taken over all objects in the dataset. . . . 15 4.2 The velocity error for the four different scenarios using only ground truth

information, ground truth ego-motion with the estimated 3D center and ground truth 3D center with estimated ego-motion. . . 17 4.3 Comparison of our motion module with state of the art monocular

ego-motion methods . . . 19 4.4 Comparison of the mono3DT 3D center estimation to the baseline created

(8)

(9)

1

Introduction

In this day and age, autonomous driving technologies are already being used in cars. According to SAE (3) autonomous driving systems can be classified into one of 6 levels based on how independent they are. These levels are described in Figure 1.1. Currently, most self-driving vehicles fall within level 2 of the SAE scale. An example of this is Tesla’s autopilot which is able to steer accelerate and break within its lane and can change lanes. For all these functionalities driver surveillance is still needed and in the case of changing lanes, the driver needs to activate the flashing light to indicate to the car that it is safe to turn. More recently Honda released the first car with an autonomous level of 3. This technology can only be used during traffic jams.

Even though there are already autonomous vehicles on the road, higher automation and independence from the driver are yet to be achieved. To expand the capabilities of au-tonomous vehicles further and go towards fully auau-tonomous vehicles it is important that the autonomous vehicles have a good understanding of their surroundings and know how other road users behave. One element that could help with this is to know the velocity of the road users around the autonomous vehicle. Knowing the change of speed of the car in front, an autonomous vehicle can regulate its speed accordingly. Also, knowing the speed of a vehicle crossing our path can be informative about the course of action to take to avoid a collision.

The goal of velocity estimation is to correctly estimate the velocity of other road users around the autonomous vehicle based on video and sensor data. The task of velocity es-timation has previously been researched using an RGB-D camera (4). RGB-D cameras with a good depth range are quite expensive. In this work, we instead explore velocity estimation using a single monocular camera. Monocular cameras are a cheap sensor option and widely available compared to RGB-D, Lidar and radar sensors. Monocular velocity estimation is an underexplored topic. To the best of our knowledge, there is a single work on monocular object velocity estimation (5), limited for use on a highway. In their work, velocity is estimated from a neural network, incorporating depth, tracks and optical flow. Instead, we attempt to explicitly calculate the velocity from accurate camera and object locations estimations, which we show to be sufficient. By being more explicit, we are able to break down the problem into a few well studied ones - 3D object tracking and ego-motion estimation. Therefore, our approach is less reliant on a complex black box neural network

(10)

1. INTRODUCTION

Figure 1.1: The 6 SAE levels of autonomous driving Source: https://www.nhtsa.gov/ technology-innovation/automated-vehicles-safety#topic-road-self-driving

solution expected to solve the velocity estimation problem in one step. Additionally, with our approach we can estimate the velocity of a vehicle moving in any direction, unlike (5) where the direction is limited to the motion direction of the camera. Furthermore, we make use of full 3d representation of the scene, estimated from the monocular data. We believe that addressing the 3D nature of the problem, without requiring 3D sensory data, is a benefit of our pipeline.

The main goal of this thesis is to create a monocular velocity estimation pipeline and to evaluate its strengths and weaknesses. In this thesis we will answer the following research question:

• How can a pipeline be designed for estimating velocity given only monocular video data?

The following contributions are made in this thesis:

• A novel pipeline that is able to estimate the velocity of cars in an urban setting, regardless of object motion direction, using a 3d scene representation internally but only making use of monocular data as input.

• Research on precisely which components contribute the most to the final velocity error.

(11)

• Presenting insights on how the model works under variations of relevant conditions. In this thesis, we will start by giving the needed theoretical background in Chapter 2. This will be followed by describing our velocity estimation pipeline in Chapter 3. In this chapter we will describe how each part of our final pipeline is constructed. In Chapter 4 we will show the performance of our velocity estimation pipeline and we will evaluate how different components of the pipeline affect this performance. Further, we will show how our pipeline performs under different relevant conditions. Lastly, in Chapter 5 we will state the conclusions we obtained from our experiments and give some possibilities for future work.

(12)

(13)

2

Background

2.1 Velocity Estimation

In our research, we are going to create a pipeline that will be able to do velocity estimation given only monocular data. Monocular velocity estimation has not been well explored in research.

One field for which velocity estimation is used is for traffic surveillance systems (6) (7). In these papers they make use of a single stationary camera, meaning that for velocity esti-mation they only need to take the movement of the object into account. In our research, the camera itself will also be moving which adds an extra level of complexity to the task. Previous research in the field of velocity estimation using a moving camera is the pa-per by Henein et al. (4), making use of an object aware dynamic SLAM algorithm. Their pipeline uses an RGB-D video and data gathered by a proprioceptive sensor, for example, an IMU as input. The proprioceptive sensor data is used to predict the motion of the vehicle. In this work, we will only use monocular video data which would lead to extra challenges in the field of depth and vehicle motion prediction. When performing velocity estimation this paper excludes objects that are far away and that are entering and exiting the camera field. In our thesis, we won’t exclude any of this data and will perform ex-periments on this kind of data to gain insights into how the model performs under these difficult situations.

As far as we know there is one monocular velocity estimation paper (5). It uses a monoc-ular RGB scene as input of their network. This pipeline extracts information from this input using depth, tracking and optical flow networks. The outputs from this network are then used by an MLP to estimate the final velocity. A disadvantage of this model is that it does not directly incorporate the motion of the camera itself. When camera motion is neglected we can only identify how the object moved compared to the camera. In our the-sis, we will use camera motion to construct the 3D motion of the object independent from the camera which can then be used for accurate velocity estimation. Another difference between this paper and our research is the kind of dataset used. Their dataset consists of scenes captured on a freeway which will lead to very specific scenarios with little change in

(14)

2. BACKGROUND

between scenes where most of the objects will move in the same direction as the camera. The camera will also mostly move in a straight line. Our dataset will mostly contain urban scenes where both the objects and the camera are able to move in many different directions compared to one another.

2.2 Object Tracking

Object tracking is a task within computer vision that has the goal to track the motion and location of an object through sequential sensory data. It has been studied extensively, as surveys show (8) (9). It is a task widely used within autonomous vehicles (10) to gather information about the objects surrounding the vehicle. In this thesis, we use object tracking to determine the travelled distance of an object car in the scene. We will use this distance to obtain our velocity prediction. In autonomous vehicles, it is important that multiple objects can be tracked at once. This is known as the multiple object tracking task (MOT) (11) (12) (13). For the MOT task, the targets to track and the number of targets are not known beforehand.

2.2.1 Multi 3D-object Tracking

Multi 3D-object tracking (MOT3D) extends on the task of regular object tracking by detecting and tracking 3D bounding boxes instead of 2D. MOT3D tracking tends to out-perform 2D tracking, as seen on the Kitti object tracking leaderboard. 3D tracking will be an important component of our final velocity estimation pipeline, since it will allow harvesting 3D information from the monocular input images, therefore allowing for more relevant velocity predictions.

Most state-of-the-art 3D tracking methods make use of Lidar data (14) (15) (16). In (14) the network takes two consecutive Lidar point clouds as an input to the network. This network passes the point clouds to an encoder-decoder structure that predicts the objects rigid body motion of the 3D objects. A different approach is used by (16). Here the lidar point cloud is used as an input to a 3D detection module that detects 3D objects. Then a 3D Kalman filter is used to predict the state of the previously detected objects in the current frame. The predicted states are matched with object detections in the new frame by using the Hungarian algorithm. In contrast, our goal is to only use monocular data.

MOTSFusion (17) is a 3D tracking method that can be used with monocular, stereo and lidar data. This network takes as input stereo flow, 2D bounding boxes, monocular video images, depth maps and camera ego-motion. If we would want to use this network in our pipeline we need to make sure that all of these inputs are generated using monocular methods. Another paper of interest is mono3DT (2). The input for the mono3DT pipeline is a monocular video stream and GPS data and outputs 3D object tracks. Due to the use of GPS data within the pipeline this method is not fully monocular. GPS data is used to model the motion of the camera. To make this a monocular method we can replace the GPS input with a monocular ego-motion estimation network.

(15)

2.3 Instance Segmentation

Mono3DT outperforms other monocular methods on the kitti tracking benchmark. Ad-ditionally, it matches well our needs for a tracking module, apart from the GPS data requirement, which we can replace with estimated ego-motion data. Therefore, we choose to use mono3DT in our tracking module.

2.3 Instance Segmentation

Segmentation is the task of finding all the pixels belonging to determined object classes in an image. The output of image segmentation is a mask of each object, describing all the locations of the object’s pixels. Instance segmentation is the problem of masking individual objects within an image rather than all pixels of a given object class. This is relevant for velocity estimation, since it helps us isolate the moving object from the background. One method that can be used for instant segmentation is Mask R-CNN (18). This is a two stage instance segmentation method that builds upon the Faster R-CNN (19) a 2D object detection and region proposal method. In the first stage region of interests are detected. Then in the second stage, two networks will perform image classification and mask prediction in parallel.

Another instance segmentation method is YOLACT (20). YOLACT breaks the instance segmentation task up into two simpler parallel tasks. The first part uses a neural net-work to produce prototype masks. Then the second task is to predict a vector of mask coefficients. For each detected instance the prototype mask and the mask coefficients are linearly combined to produce the final object mask.

YOLACT outperforms Mask R-CNN in speed but when looking at performance Mask R-CNN has the highest precision out of the two. Since our goal is to create a velocity pipeline and research if it is able to perform accurate velocity estimation performance is more important than speed. This is why we will use MASK RCNN in this thesis.

2.4 Ego-motion Estimation

Ego-motion estimation, also known as odometry, is the task of estimating the rotational and translational movement of the camera within an environment.

Ego motion estimation methods use a wide variety of data like Lidar (21) (22) and stereo vision (23) (24). In this work, we will use monocular data as an input.

One method of ego-motion estimation is to combine pose and depth estimation (25). This method uses a single frame to estimate the depth using a depth CNN and uses the input frame and its direct neighbours to estimate the pose using a Pose CNN. The outputs of both networks are then combined to inverse warp the original input frame and reconstruct the next frame. The reconstruction loss is then used to train both CNNs. This contributes to both networks learning from one another. The version of Bian et al. (26) is able to do pose estimation using 2 consecutive frames instead of 3. Further, they extend the network by estimating the depth of both input frames instead of only the original frame. Another

(16)

2. BACKGROUND

variation of the network is the work of Chen et al. (27). They introduced a third network to the pipeline, namely an optical flow network that operates in parallel with the depth and pose networks. Then all three networks together contribute to the loss function that is used for training. The use of only two consecutive frames to estimate the pose could be useful to create a pipeline that is able to perform velocity estimation in an online fashion. But the errors from these kinds of methods are not up to par yet with the state of the art. Since in our thesis the main focus is to evaluate our pipeline and see if it could be used for the velocity estimation task we will trade online estimation for better performing methods. Another method that can be used for ego-motion estimation is simultaneous localization and mapping (SLAM). The task of SLAM is to simultaneously construct a map of its environment and keeping track of the robot’s location in it. In our case, this can be trans-lated to an autonomous vehicle that tries to build a map of the scene it drives through while keeping track of its own locations within this scene. ORB-SLAM2 (28) is a SLAM algorithm that can be used for monocular, stereo and RGB cameras which build upon ORB-SLAM (29). There is also research done for using SLAM in dynamic environments (30)(31)

Another way ego-motion can be estimated is by using structure from motion (SFM). In SFM methods a 3D reconstruction of an environment is created based on the movement of the camera. The first step of SFM is key point detection to detect points of interest within a frame. Then these points of interest are matched in-between frames. Making use of these matches the correct orientation of the cameras can be found. This can be done by estimating how to move one camera on top of the other in a way the matched points will overlap. When the correct movement is found we know how the cameras are oriented compared to one another.

There exist tools that can be used for sfm estimation. An example of such tools is Agisoft Metashape and Colmap (1). In this thesis we will use Colmap because of its solid accuracy and since it provides the sequential feature matching functionality. Instead of performing feature matching between all available frames sequential feature matching will only look at a specified number of neighbouring frames within the sequence. This is especially suit-able for an autonomous vehicle since the camera is always moving and mistaken matches between remote frames are avoided.

(17)

3

Methodology

Our proposed pipeline for velocity estimation is shown in Fig. 3.1. The only input to our network will be a sequence of RGB frames depicting the scene. In our pipeline, we will separate the motion of the camera and the motion of the objects by estimating them using two separate modules. Then these two modules will be combined at the end to reconstruct the 3D world movement of the objects and estimate their velocity.

Having an input RGB frame, our first goal is to separate the pixels of the object, which we are interested to estimate the velocity of, and of the background. In this way, we can achieve the desired separate motion estimation of the camera. We achieve this by passing the input images to our Instance segmentation module, which returns a mask of the object. Using this mask we remove the moving objects and pass the updated frames will be passed to an ego-motion module, which estimates the motion of the camera. This module will estimate the motion of the camera by jointly reconstructing the background scene and estimating camera movement in it. By removing the moving object, we ensure a good quality reconstruction from the SfM method in the ego-motion module, since moving objects other than the camera disturb the results. A 3D tracking module is used to estimate and track the locations of the objects in 3D, given two frames and camera motion data between them, obtained from the ego-motion module. The camera motion is necessary so the 3D tracking module can take it into account when predicting the locations of objects in the frame following the motion. For each object, the module returns the 3D track and an object center - the location of the midpoint of an object in camera space. Having the updated location, in the final Velocity estimation module, we map it from camera to world space by using the ego-motion transformation. Then the direction of motion and velocity of the object can be estimated by considering the distance travelled between frames.

3.1 Instance Segmentation

The first step in our pipeline, instance segmentation, is used to detect the pixels of the moving objects in the scene. These segmentations are then used to remove these moving objects from the frame to retrieve the background. These backgrounds can then be used as an input to our ego-motion module. For our instance segmentation, we will use Mask R-CNN (18). Mask R-CNN is a state-of-the-art instance segmentation framework. The

(18)

3. METHODOLOGY RGB Ego-motion module 3D tracking module Velocity estimation module Instance segmentation module Input Scene Ego-motion Velocity 3D tracks, centerpoints

Figure 3.1: Overview of the pipeline used in this thesis. The input to the network is a scene consisting of multiple frames. This will first go through the top half of the pipeline to estimate the ego-motion. Then the scene and ego-motion are used as an input to the 3D tracking module. The output of the 3D tracking module together with the ego-motion will be used to compute the final velocity.

Mask R-CNN network operates in two stages. In the first stage a region proposal network (RPN) is used that proposes 2D candidate bounding boxes. The second stage consists of two different networks that work in parallel. The first network uses ROIAlign, a method used to extract features from the proposed 2D bounding boxes and simultaneously performs object classification and bounding box regression. The second network predicts a binary mask for each 2D bounding box that gives all pixels belonging to the object contained within the bounding box.

In this thesis, we will use the Detectron2 (32) implementation. We will use the pre-trained X101-FPN network since this network has the highest reported average precision. This is an ResNeXt network with a depth of 101 layers and a Feature Pyramid Network (FPN) as backbone. This model is trained on the COCO dataset (33).

Each frame in the scene will pass through the instance segmentation module and give the segmentation masks as an output. Each segmentation mask gives us the corresponding pixels belonging to an object and the category this object belongs to. We will use these masks to remove the pixels from the frame belonging to the objects that are likely to move.

3.2 Ego-motion Estimation

The next part in our pipeline is ego-motion estimation. The input to this network will be the static background frames produced by our instance segmentation module. The ego-motion module is used to estimate the motion of the camera. The motion will then be used in the 3D tracking module to estimate future object locations and in the velocity estimation module to the locations of the objects in world space.

(19)

3.2 Ego-motion Estimation

Figure 3.2: An example of instance segmentation by Mask R-CNN on the kitti dataset

We decided to use Colmap to estimate camera motion because it is a highly accurate algorithm that is able to obtain good pose estimations using monocular images. Colmap also outperforms online based pose estimation methods. Since the main goal of this thesis is to evaluate if our pipeline could be used for accurate velocity estimation we want to use the best possible algorithm for our network even though it can not be used for real-time estimations. The design of our modules is to be easily replaceable, making it easy for an online method to be plugged in in the future.

Colmap(1)(34) is a general-purpose Structure-from-Motion (SfM) and Multi-View Stereo (MVS) pipeline. In our thesis, we will make use of the SfM functionality of Colmap. The pipeline used for the SfM is shown in Figure 3.3, which can be broken down into different segments. The first of the SfM is to extract features for each frame. These features are then used in the matching step to detect potentially overlapping features between frames. In our thesis we will use sequential matching, this feature matching approach will only match features between n neighbouring frames. To verify the matches truly point to the same scene point a geometric verification step is used. In this stage, the SfM tries to estimate a transformation between two frames that maps as many features as possible. If enough features are mapped they are geometrically verified. The output of this step are all verified pairs.

When verified image pairs are found the reconstruction step can begin. The first step of reconstruction is the initialization of the module, by choosing a suitable pair of frames. When an initial pair is chosen new frames need to be added to the current model. This is done in the image registration step by solving the Perspective-n-Point (PnP) problem, this is the problem of estimating the relative pose of the frame with respect to the current scene. The next step is to extend the scene by adding new points to it. This is done through triangulation. After these steps Bundle Adjustment is used to refine the parameters for the relative pose and added points estimated in the previous steps. The reconstruction steps are repeated until convergence is reached.

While creating the 3D reconstruction of the background Colmap will estimate the ro-tation (R) and translation (−→t ) of each frame. This rotation and translation are based on a reference point initialised somewhere in world space.

(20)

3. METHODOLOGY

Figure 3.3: The Structure from motion pipeline used by colmap. Source:(1)

To make sure the ego-motion can be used as an input to our 3D tracking module we need to process the data. The first processing step is to make sure that for each frame the rotation and translation are given based on the coordinate system of the first frame in the scene. This is achieved with Equation 3.1 for translation and Equation 3.2 for rotation. Here t1 and tn are the translation for the first and nth frame respectively, Rp1 gives the rotation from the reference point to the first frame and Rpn for the reference point to the nth frame. − → tn = − → tn− − → t1 (3.1) R1n= R−1_p1Rpn (3.2)

The second processing step is to correctly scale the data. the output from Colmap uses an arbitrary unit of length that we need to convert to meters. This can be done by finding the correct scaling factor for each scene as shown in Equation 3.3. Here d() is the computation of the euclidean distance, gt stands for ground truth, est for estimation and n is equal to the number of frames in the scene. The new translation vector is then computed as shown in Equation 3.4. s = 1 n∗ n X i=1 d(−→tgt1, −→ tgtn) d(−−→test1, −−→ testn) (3.3) − → tn = − → tn∗ s (3.4)

3.3 3D Tracking

The next step of our pipeline is the 3D tracking module. The purpose of the 3D tracking module is to estimate the 3D centers of the objects and track these over time. A 3D object center represents a static point in an object that can be used to estimate the movement of the object in between two frames, which can then be used by our velocity module to compute the velocity of the object.

To perform 3D tracking we will use the mono3DT network (2). From all 3D tracking methods, mono3DT is one of the few monocular 3D tracking methods. The GPS data ex-pected by the network can be easily replaced by camera motion data from our monocular ego-motion estimation network. This 3D tracking network is an online method, meaning

(21)

3.3 3D Tracking

Figure 3.4: The 3D tracking pipeline used by mono3DT. Source: (2)

that only previous data is used for their estimations. Therefore, our pipeline could be easily updated for online estimation by switching out the ego-motion module.

Our chosen 3D tracking model is a 3D network that initially takes a scene consisting of RGB frames and GPS data as input. To create a monocular pipeline using this 3D net-work we swap the GPS data for the pose estimation output generated by our ego-motion network. The pipeline of the mono3DT network is shown in Figure 3.4.

For each RGB frame, the network uses Faster R-CNN to estimate region of interests (ROIs) in the form of 2D bounding boxes. The next step is to estimate the 3D box information using an ROI feature vector obtained from a 34-layer DLA-up network (35) using ROIalign, an operation used to extract features from ROIs. The ROI features are passed through 3-layer 3x3 convolution sub-networks to obtain depth, orientation, dimension and projec-tion of the 3D center. From the depth and projected 3D center a 3D locaprojec-tion is computed. Based on the detected ROIs and 3D information we can start tracking. For each frame, the goal is to match existing tracks to detected objects, create new tracks, or end an existing track. An existing track is matched to a new detection based on affinity computed from the overlap between 2D bounding boxes, 3D bounding boxes and similarity between appear-ance features of existing and new object detections. This is done by first projecting the current existing tracks forward in time using camera motion and object velocity. The ob-ject velocity is estimated using 2 128-dim hidden state LSTMs. First, a Prediction LSTM estimates the velocity based on the previous predicted location and previously updated velocities. The Updating LSTM is used to update the 3D location and velocity based on the current and previously predicted location.

The updated velocity used by the LSTM is represented in the network by the difference between the previous and current object location. This velocity is thus not comparable with the velocity estimation done by our final pipeline.

For our pipeline, we will use the lstmoccdeep model, which is the model also described and shown to produce the best results in (2). This model is pretrained on the Kitti tracking dataset (36)(17).

(22)

3. METHODOLOGY

3.4 Velocity Estimation

To compute velocity we first obtain a point within the object to denote its location, this needs to be a point that remains static if taken in object space that we will be able to be tracked on separate frames. We choose this to be the 3D center of the object (treated as an object center), outputted from the 3D tracking module. It is a point expected to remain static within a small enough margin of variability. The object center is estimated per frame from the 3D tracking module and therefore it is in camera space. To be able to estimate the velocity we need to convert the object center from camera space to world space using Equation 3.5. Once we have the location of the object in two frames and in world space, velocity can be estimated using the distance between those two points and the time between two frames (represented by the video frame rate) as shown in Equation 3.6.

pw = (R−1pc) + T (3.5)

v = d(pw1₁, pw2) fr

(3.6)

3.5 Creating a Ground Truth

Currently, there are no datasets that give us ground truth velocity for objects. This means that we needed to compute our own ground truth in advance of doing the experiments. In the Kitti dataset, we have 3D bounding box locations in camera space for each object per frame. We also have the translation and rotation of each camera in respect to the first frame. With this information, we can compute the location of the object in world space. This is done using Equation 3.5. The tracking dataset also gives us the tracks of each object in the scene. Using the tracks combined with the 3D location in world space the velocity per frame can be computed using Equation 3.6.

(23)

4

Experiments

4.1 Velocity Estimation

4.1.1 Datasets

For the evaluation of our velocity estimation method, we used 2 different datasets. The first dataset is the KITTI vision multi-object tracking dataset (36). All experiments are done using the training dataset. This is the dataset the 3D tracking model is trained on. We used this dataset for the general evaluation of the model to be able to discover the weaknesses of our velocity estimation model. To confirm our results we use a second dataset. The second dataset used is nuscenes (37). Nuscenes is a large-scale autonomous driving dataset consisting of 1000 scenes each 20 seconds long. In our thesis, we focus on velocity estimation in urban areas and all of nuscenes scenes are collected in cities with a lot of traffic. For our experiments, we chose to use the nuscenes mini-split.

4.1.2 Velocity Error

We will evaluate the estimated velocity using the mean absolute error (MAE) as shown in Equation 4.1. The MAE gives us the mean error between estimated velocity xi and ground truth velocity yi averaged over all observations n. This metric will show us how accurate our velocity estimation currently is. For both Kitti and nuscenes we will compute the MAE over all objects in the dataset.

MAE = Pn i=1|yi− xi| n (4.1) Dataset MAE (m/s) Kitti 3.5176 Nuscenes 5.4298

(24)

4. EXPERIMENTS

Table 4.1 shows the mean absolute velocity error of both datasets. We see that Kitti per-forms better than Nuscenes. This is probably due to the fact that the model was trained on the Kitti dataset.

The other available monocular based velocity estimation work (5) has an average error of 1.12m/s which is significantly lower than our reported error. However, this score is not comparable to ours for two reasons. Firstly, the dataset used by (5) is not publicly avail-able and is consisting of scenes captured on a highway while our datasets are captured in urban areas. Secondly, our pipeline handles estimation of motion regardless of the object direction, which is reflected in our more complex evaluation set. Their work only handles motion parallel to the camera and evaluate on a dataset with simpler examples falling into this condition.

Despite the higher velocity errors, we still believe that the added benefits of using a full 3D representation of the scene making use of ego-motion together with being able to estimate the velocity of a vehicle moving in any direction is a strong advantage of this pipeline. In the following experiments, we will evaluate the pipeline to identify the problems. This will give further insights into which directions could be taken to reach a better score.

4.1.3 Looking Into Different Scenarios

To understand better where the velocity error mainly comes from we will break the velocity estimation up in multiple different scenarios. The first scenario we will look at will be where both the camera and the object are standing still. This scenario was chosen since it is the easiest scenario where we know what the outcome should be. Based on these results we can look into the different modules of the pipeline and see where the biggest mistakes are made. Following this experiment, we will evaluate 3 different scenarios involving the movement of the camera and object. It could be insightful to see how these affect the velocity error and which scenario generates the biggest error. This could help us further understand what kind of data generates bigger errors.

We will refer to the different scenarios as follows: 1. Both the object and camera are standing still

2. The object is standing still while the camera is moving 3. The object is moving while the camera is standing still 4. Both the object and camera are moving

Next to looking into the different scenarios we will look into the 3D center, obtained by the 3D tracking module, and ego-motion module. In the experiment, 3D center will refer to the z-axis of the 3D object center. This experiment is done to evaluate how each of these modules affects the final velocity error. To show this effect we create a baseline where we compute the velocity error using ground truth data as input for these modules. For this experiment, we will use the MAE as stated in equation 4.1. For this experiment we make use only of the dynamic scenes in the nuscenes dataset, as the static scenes in nuscenes were limited exclusively to static frames, making it unable to estimate the camera trajectories.

(25)

4.1 Velocity Estimation

Scenario KittiMAE (m/s) baseline Kitti MAE (m/s) 3D center Kitti MAE (m/s) ego-motion nuScenes MAE (m/s) baseline nuScenes MAE (m/s) 3D center nuScenes MAE (m/s) ego-motion 1 static camera, static object 0.2945 2.2786 0.4693 0.9001 4.1549

-2 moving camera, static object 0.7930 4.0339 2.0180 3.5082 5.3842 2.5425 3 static camera, moving object 0.5354 3.5898 0.6617 0.9141 5.1813 -4 moving camera, moving object 0.2220 3.6348 0.9863 2.0478 5.7788 2.2245

Table 4.2: The velocity error for the four different scenarios using only ground truth infor-mation, ground truth ego-motion with the estimated 3D center and ground truth 3D center with estimated ego-motion.

Table 4.2 shows the velocity error for each of the described scenarios. First, we will look at the results obtained using ground truth ego-motion and 3D center estimation. Scenario 1 gives us the lowest velocity error as expected. This was expected because, when both the camera and object are standing still this gives us an easy situation with almost no variation in input. That brings us to scenario 3. This is another scenario where the camera is still and the object is moving. We see that when movement is involved the velocity error gets worse. It could be that the movement could make the 3D center estimation harder. The first problem that could have an effect on the 3D center estimation is that when the object is moving from or towards the camera the size of the object and thus the bounding box will vary. Since the bounding box is given as an input to the 3D center estimation network it could be that when the object is farther away the amount of information contained within the bounding box is smaller. A smaller bounding box contains less data which could make it harder for the 3D center network to make a correct estimation. A bigger error on the 3D center estimation will translate into a bigger velocity error. This is due to the nature of how the velocity is computed, equation 3.6.

fr = 10hz d(pw1, p12) = 0.5m verr = 0.5 1 10 verr = 5m/s (4.2)

Take for example the computation in equation 4.2. When we start with an error of 0.5 m which is magnified to a 5m/s velocity error.

Scenario 4 almost gives us the same error as seen in scenario 3. This is logical because we use ground truth ego-motion data so still the movement of the car is what mostly affects the velocity error. But something unexpected happens when looking into scenario 2. Since only the camera is moving and the object is still we would expect the velocity error to be a bit better or at least around the error of scenario 3 and 4 instead it performance the worst out of all scenarios. One possibility is that occlusion could be a problem in this scenario. Most objects that fit into this category are objects standing still on the side of the road. For most of the track these objects are partly occluded by other objects. This occlusion can lead to worse 3D center detection and thus again to worse velocity estimation. The

(26)

4. EXPERIMENTS

occlusion also happens sometimes in other scenarios but not as much as in scenario 2. To see the effect of the occlusion and the depth of the car on the velocity error we will perform further experiments in sections 4.3.2 and 4.3.1.

Secondly, we will look into the performance of the velocity estimation with ground truth 3D center estimation and estimated ego-motion. As expected are the velocity errors for scenario 1 and scenario 3 almost the same compared to the velocity errors when we only use ground truth information. This is because when the camera is standing still there is no ego-motion and if our ego-motion module would make a mistake it would probably be small in this case. In scenario 2 and 4 we see larger errors which is logical since this is where the camera is actually moving and we can see the effect of the ego-motion. Again we see a bigger error when comparing scenario 2 to scenario 4. This could still be due to occlusion.

For the ground truth 3D center we only swapped the value for the z-axis, which represents depth. The x and y axis of the 3D center will still be estimated by the 3d-tracking module. It could be difficult for the module to correctly estimate these values when the object is partly occluded.

4.2 Model Evaluation

We discovered that the main error of our velocity estimation pipeline comes from two mod-ules, 3D center and ego-motion. To show that our components of the velocity estimation pipeline have reasonable performance we will compare these modules against the current state of the art. With this we will show that our current pipeline is limited by the modules that are currently available.

4.2.1 Datasets

Both components will be evaluated on the Kitti vision dataset. Kitti provides data for a variety of different autonomous driving tasks used by many state of the art papers to train and evaluate their model on. This is why we will also make use of Kitti.

For ego-motion evaluation, we will use the odometry dataset (36) that is part of Kitti. This dataset consists of 22 scenes for which the first 11 are used for training and the oth-ers used for testing. The goal of this dataset is to estimate the path of the autonomous vehicle by determining the rotation and translation between two frames, which can also be described as ego-motion. To compare our ego-motion module to the state of the art on Kitti we need to choose a scene that is used in most papers for evaluation. This is why we will evaluate our ego-motion module on scene 09 from the Kitti training set.

The evaluation of the 3D center component is done on the training set of 3D object de-tection dataset (38) - a subset of Kitti. The Kitti tracking dataset is avoided, as the 3D center estimation module has been trained on it.

(27)

4.2 Model Evaluation

Method Translation (%) Rotation (deg/m)

Ours 1.10 0.0019

D3VO (39) 0.78

-PMO (40) 1.31 0.0031

Zou et al (41) 3.49 0.010

Table 4.3: Comparison of our ego-motion module with state of the art monocular ego-motion methods

4.2.2 Ego-motion Evaluation

For the ego-motion estimation we ran the scene on our ego-motion module (34), (1) us-ing sequential matchus-ing. Our ego-motion module first extract keypoints from each scene. These keypoints are then matched between scenes. In sequential data there is only a very small chance that 2 frames that are far apart in the sequence will have matching keypoints in them. On the other hand frames close to one another are guaranteed to contain match-ing keypoints. This means that for a good result we only need to match the frames close to one another.

After running the data through our ego-motion module we needed to do some corrections to get the data in the right format for evaluation. The first step of this data processing is the same as we needed to do for our pipeline as explained in section 3.3.

Additionally, we need to rotate the coordinate system the make sure the coordinate systems are oriented in the same way. This needs to be done since for the ground truth data these directions are based on GPS data. Colmap is not given this information and thus creates its own orientation of the coordinate system. To correctly rotate the coordinate system we are using Kabsch algorithm. This is an algorithm to calculate the optimal rotation matrix between two paired set of points.

After processing the data we can evaluate the outcome of our ego-motion estimation. The Kitti evaluation methods are used as shown in Equation 4.3 and 4.4 where F is a set of frames, ˆp and p are estimated and ground truth camera pose respectively, is the inverse compositional operator and ∠[˙] is the rotation angle. For comparison, we looked at the top monocular methods on the Kitti leader board, D3VO (39), PMO (40), and the work of Zou et al. (41).

Erot(F ) = 1 |F | X (i,j)∈F ∠[(ˆpj ˆpi) (pj pi)] (4.3) Etrans(F ) = 1 |F | X (i,j)∈F k(ˆpj ˆpi) (pj pi)k₂ (4.4) Table 4.3 shows that our ego-motion module outperform most of the methods. This shows that our ego-motion module has reasonable results and outperforms most techniques. Fig-ure 4.1 shows the estimated path of our ego-motion module versus the correct path. This figure shows that in between time steps the errors are fairly small but over time they

(28)

4. EXPERIMENTS 0 100 200 300 400 500 -200 -100 0 100 200 300 400 z [m] x [m] Ground Truth Visual Odometry Sequence Start

Figure 4.1: Estimated path of our ego-motion module and the ground truth path.

accumulate into a bigger error. For our implementation, the accumulated error is not as important since we only use the difference in motion between two frames. Thus the accumulated error will not lead to bigger velocity errors at the end of the path.

4.2.3 3D center Estimation Evaluation

In this experiment, we will show that our 3D center estimation is currently the best pos-sible result we can get. 3D center estimation is mostly used within 3D object detection methods. The papers focusing on 3D object detection do not show results on their indi-vidual components but only on their final 3D bounding box estimation. This fact does not allow us to directly compare our 3D center estimation to other methods. Therefore, we design an indirect evaluation method.

We compare the accuracy of the estimate object center between our model and DORN (42) - a state of the art depth estimation model. To obtain object center from depth we will first use instance segmentation to get all pixels corresponding to one object. The aver-age depth of all these pixels combined will be our 3D center estimation. Having the object center, we can estimate their accuracy and directly compare it with ours. To perform instance we will use Mask RCNN (18) - a state of the art instance segmentation module. For the 3D center estimation, we will only look at the z-axis of the 3D center. The 3D object center is computed as shown in Equation 4.5. Here −→c3D is the 3D center, cx, cy are the projected 3D center point locations in the 2D frame, K is the camera intrinsic matrix and cz is the estimated depth of the 3D center. To obtain the final 3D center cx and cy are effected by cz. This shows that it is important in our pipeline to have an accurate

(29)

4.3 Case Study

Method MAE (m)

mono3DT 3.013

DORN 3.757

Table 4.4: Comparison of the mono3DT 3D center estimation to the baseline created with DORN

estimation of cz since this will increase the accuracy of the final 3D center computation. −→ c3D=   cx cy 1  K−1c_z (4.5)

Table 4.4 shows that the 3D center estimation currently used in our 3D tracking mod-ule outperforms the baseline. It could be that our 3D center mono3DT method performs better when we look at objects from only one side. Let’s for example take a car that we only see from the back in a frame. When using our baseline method all pixels in the car will give roughly the same depth, which will be the depth to the back of the car and not the 3D center. This will also be reflected in the final 3D center estimation. On the other hand, mono3DT is trained to compute the 3D center of objects in different situations which also includes only seeing the object from one side. This specific training helps our 3D center module be more accurate in these cases. 3D center estimations that are more accurate will lead us to a better final velocity estimation.

Our evaluation shows that our method already produces results better object center than a state-of-the-art depth estimation method, leading us to conclude that the quality of our object centers is at the high end of what is currently available. Therefore, to improve upon the accuracy of this method and subsequently minimizing the velocity error, advancements in this field are necessary.

4.3 Case Study

4.3.1 Effect of Distance to 3D Center Object

In this experiment, we want to evaluate the influence of the distance from the camera to the object 3D center on the velocity error. Our hypothesis is that the further away the object is, the larger the velocity error will be on average. To test this hypothesis we will divide the velocity errors into bins based on their distance to the camera. We chose to use bins with a size of 10 meters each. For each bin, the MAE will be computed.

Figure 4.2 shows us that on average the MAE increases when the distance also increases when looking into 3D center estimation. This could be related to how the 3D tracking mod-ule works. The 3D tracking modmod-ule takes a bounding box of the object as input. When the distance from the camera to the object is higher this will create a smaller bounding

(30)

4. EXPERIMENTS

Figure 4.2: Bar plot of the MAE for each bin of object distances. Each bin consists of all velocity errors obtained for objects within the given range. This plot was create using estimated 3D center and ground truth ego-motion.

Figure 4.3: Plot of the MAE for each bin of object distances. Each bin consists of all velocity errors obtained for objects within the given range. This plot was created using estimated ego-motion and ground truth 3D center.

(31)

4.3 Case Study

box. A smaller bounding box contains less data for the module to work with, this could cause the network to make more errors in the 3D center estimation.

On the other hand, Figure 4.2 shows us that when looking into ego-motion estimation the average error does not seem to show the same consistently growing trend. One expla-nation could be that error due to ego-motion is mainly caused by the translation error. The translation error will have the same effect on the object no matter the distance to the camera. The effect of rotation and translation will be further evaluated in Section 4.3.3. Both figures show us that there is also an increase in error when the object has a dis-tance smaller than 10. This could be due to the fact that in almost all these cases the objects are partly not visible on the frame. The 3D center module will only get part of the car as an input. This can lead to difficulties for the network to understand where the 3D center of the object is located.

While ego-motion has remained unaffected, we have shown that distance is significantly affecting 3D object tracking, which in turns affect velocity estimation quality.

4.3.2 Effect of Occlusion

In this experiment, we explore the effects of occlusion on our velocity estimation model. To show this effect we split the data into two groups, occluded and not occluded. Each group will be divided into different bins based on their distance from the camera. Our hypothesis is that when an object is occluded this will affect the velocity error negatively. We will only perform this experiment for the 3D center module since objects in the image do not affect the ego-motion estimation.

As shown in Fig. 4.4 we see that on average the velocity error of occluded object is higher than that of their non-occluded counterparts. Since we have established the in-fluence of distance we show occlusion for each distance group to eliminate the possibility of any imbalanced based on distance when comparing occlusion data. When an object is partly occluded it could be harder to identify the correct bounding box of the object. Since these bounding boxes are used as input to the 3D center module it could be harder to make correct estimations. Further, the data given to the network will consist partly of a different object. This extra data that is not part of the actual object could confuse the network and create worse 3D center estimations.

4.3.3 Effect of Object Movement

Lastly, we want to evaluate how different movement of the camera and object will affect the final velocity error. The different movements we want to experiment on are shown in Figures 4.5 and 4.6. For ego-motion, we look only into the movement of the camera while the object is standing still. For this experiment, we are interested to see how rotational and translational errors in the ego-motion can affect the final estimation. For the 3D center estimation, we will look at two different categories, one where the camera is standing still and one where the camera moves in a straight line. Then we look into different movements

(32)

4. EXPERIMENTS

Figure 4.4: Plot to show the effect of occlusion on the velocity error. All errors are divided into different bins based on the object distance to the camera. For each bin, the MAE is shown. This plot was created using the estimated 3D center and ground truth ego-motion.

of the object compared to the camera.

Since our research area is in urban areas these different kinds of movement can appear often. For this evaluation, we will take the MAE of these different movements compared against the distance.

In Figure 4.7 we show the MAE scores for different movements of the object, while the camera is still. In this figure we see that when the object is turning the errors are smaller than when the object moves in a straight line compared to the camera.

For the final velocity estimation, it is more important that the distance between the two estimated 3D centers lie close together than that the actual 3D centers are estimated cor-rectly. Take the example of both the camera and object standing still. If both 3D center estimations are the same even though one of the values is estimated incorrectly this still leads to a distance of 0. This could be used to explain the differences in errors. When the object is turning the object is often moving slow. In between frames, this leads to less diversity in input to the network. It could be that the network is more prone to get roughly the same error in 3D center estimation when the input is more alike. On the other hand, when an object is moving straight away from the camera there is a large difference in input and the 3D center errors could be further apart.

For the horizontal movement of the car, we notice that the data is concentrated in the 20-40m distance range. Because of the nature of this horizontal motion distance remains fairly constant from the camera. However, from the groups available we notice more accu-rate velocity estimation than vertical motion, likely because the horizontal motion is much more represented on the camera plane.

(33)

4.3 Case Study

In Figure 4.8 we show the different MAE results for when the camera is moving in a straight line. We see the same differences in error as before when looking into the different object movements. What is different in this figure is that we distinguish between the ob-ject moving toward the camera and moving in the same direction. When the obob-ject moves in the same direction as the camera, the changes of the object location are smaller than the camera location change. So this proves our point further that bigger changes in input between two frames lead to worse velocity errors.

Lastly, we look into the different movements of the camera in comparison to an object standing still. These results are shown in Figure 4.9. When the camera makes a turn the resulting errors are smaller than when the camera moves in a straight line. When a camera moves in a straight line there is almost no difference in rotation between two frames. On the other hand, when the camera is turning the rotation gets bigger but the difference in translation will be smaller. This shows that the translational error has a bigger impact on the final velocity estimation than the rotational error.

Object Camera

Figure 4.5: Different movements used in the experiments for ego-motion estimation. The arrow shows movement and the dot means that the object or camera is standing still. The green dot represents the object and the red dot the camera.

(34)

4. EXPERIMENTS

Object Camera

Figure 4.6: Different movements of the camera and object used in the experiment for 3D center estimation. The arrow shows the movement of the object and camera compared to one another. A single dot means that the object is standing still. Green represents the object and red represents the camera.

Figure 4.7: Plot to show the effect of different object movements, compared to the camera, on the velocity error. These results are all gained from scenes where the camera was static and the object was moving. All errors are divided into different bins based on the objects distance to the camera. For each bin the MAE is shown. For these results we used estimated 3D center and ground truth ego-motion.

(35)

4.3 Case Study

Figure 4.8: Plot to show the effect of different object movements, compared to the camera, on the velocity error. These results are all gained from scenes where both the camera and object are moving. All errors are divided into different bins based on the objects distance to the camera. For each bin the MAE is shown. For these results we used estimated 3D center and ground truth ego-motion.

Figure 4.9: Plot to show the effect of different camera movements on the velocity error. These results are all gained from scenes where the object is static and the camera is moving. All errors are divided into different bins based on the objects distance to the camera. For each bin the MAE is shown. For these results we used ground truth 3D center and estimated ego-motion.

(36)

(37)

5

Conclusion

In this thesis, we propose a novel pipeline for monocular velocity estimation that explicitly estimates velocity from the camera and object location estimations and effectively uses 3D information to more accurately make those estimations. We break down the problem of velocity estimation into segmentation for isolating the static background from moving objects, ego-motion estimation using an sfm based camera pose estimation method and 3D object tracking to obtain 3D information from our monocular input. With our reconstruc-tion approach, we are able to handle any kind of object direcreconstruc-tion. The pipeline produces 3.5 MAE on the Kitti set and 5.4 MAE on Nuscenes. We set out to find what the largest sources of the error are. We evaluated the 3D center and ego-motion estimation and find that their performance is comparable to the best methods currently available. However, the accuracy error on these components is magnified into a large velocity error, making them the primary sources of the error. Furthermore, we have evaluated our pipeline in various conditions and have shown that the 3D tracking module is affected by distance, occlusion and large object motion. We also discovered that the velocity estimation is affected by the translation of the camera estimated by our ego-motion module. Advancements in the fields of those components in the aforementioned aspects can further improve our pipeline’s velocity accuracy.

5.1 Future Work

In urban scenarios, there will also be a lot of cyclist and pedestrians on the road. It could be interesting to expand the network to be able to predict the velocity of these objects as well. This will also bring new challenges to using the 3D center. For instance, pedestrians are not rigid bodies which mean their 3D center can change overtime.

In our thesis, we scaled the ego-motion to get the translation in meters. Currently, this paper uses the ground truth ego-motion to compute this scale. To make the pipeline not depend on ground truth data other solution for this problem needs to be explored. One possible solution could be to use a state of the art depth estimation network.

(38)

(39)

References

[1] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-Motion Revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016. iii, 8, 11, 12, 19

[2] Hou-Ning Hu, Qi-Zhi Cai, Dequan Wang, Ji Lin, Min Sun, Philipp Kra-henbuhl, Trevor Darrell, and Fisher Yu. Joint Monocular 3D Vehicle Detection and Tracking. In Proceedings of the IEEE/CVF International Confer-ence on Computer Vision (ICCV), October 2019. iii, 6, 12, 13

[3] SAE On-Road Automated Vehicle Standards Committee et al. Taxon-omy and definitions for terms related to driving automation systems for on-road motor vehicles. SAE International: Warrendale, PA, USA, 2018. 1 [4] Mina Henein, Jun Zhang, Robert Mahony, and Viorela Ila. Dynamic

SLAM: The Need For Speed. arXiv preprint arXiv:2002.08584, 2020. 1, 5 [5] Moritz Kampelmühler, Michael G Müller, and Christoph

Feichten-hofer. Camera-based vehicle velocity estimation from monocular video. arXiv preprint arXiv:1802.07094, 2018. 1, 2, 5, 16

[6] HA Rahim, UU Sheikh, RB Ahmad, and ASM Zain. Vehicle velocity esti-mation for traffic surveillance system. World Academy of Science, Engineering and Technology (WASET), page 772, 2010. 5

[7] Huanan Dong, Ming Wen, and Zhouwang Yang. Vehicle Speed Estimation Based on 3D ConvNets and Non-Local Blocks. Future Internet, 11(6):123, 2019. 5

[8] Alper Yilmaz, Omar Javed, and Mubarak Shah. Object tracking: A sur-vey. Acm computing surveys (CSUR), 38(4):13–es, 2006. 6

[9] SR Balaji and S Karthikeyan. A survey on moving object tracking using image processing. In 2017 11th international conference on intelligent systems and control (ISCO), pages 469–474. IEEE, 2017. 6

[10] Shinpei Kato, Eijiro Takeuchi, Yoshio Ishiguro, Yoshiki Ninomiya, Kazuya Takeda, and Tsuyoshi Hamada. An open approach to autonomous vehicles. IEEE Micro, 35(6):60–68, 2015. 6

(40)

REFERENCES

[11] Anton Milan, Laura Leal-Taixé, Ian Reid, Stefan Roth, and Konrad Schindler. MOT16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016. 6

[12] Gioele Ciaparrone, Francisco Luque Sánchez, Siham Tabik, Luigi Troiano, Roberto Tagliaferri, and Francisco Herrera. Deep learning in video multi-object tracking: A survey. Neurocomputing, 381:61–88, 2020. 6 [13] Yingkun Xu, Xiaolong Zhou, Shengyong Chen, and Fenfen Li. Deep

learn-ing for multiple object tracklearn-ing: a survey. IET Computer Vision, 13(4):355–368, 2019. 6

[14] Xinshuo Weng and Kris Kitani. A Baseline for 3D Multi-Object Tracking. CoRR, abs/1907.03961, 2019. 6

[15] Aseem Behl, Despoina Paschalidou, Simon Donne, and Andreas Geiger.

PointFlowNet: Learning Representations for Rigid Motion Estimation

From Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. 6

[16] Sukai Wang, Yuxiang Sun, Chengju Liu, and Ming Liu. PointTrackNet: An End-to-End network for 3-D object detection and tracking from point clouds. IEEE Robotics and Automation Letters, 5(2):3206–3212, 2020. 6

[17] Jonathon Luiten, Tobias Fischer, and Bastian Leibe. Track to reconstruct and reconstruct to track. IEEE Robotics and Automation Letters, 5(2):1803–1810, 2020. 6, 13

[18] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask R-CNN. CoRR, abs/1703.06870, 2017. 7, 9, 20

[19] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497, 2015. 7

[20] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF International Con-ference on Computer Vision, pages 9157–9166, 2019. 7

[21] Ji Zhang and Sanjiv Singh. LOAM: Lidar Odometry and Mapping in Real-time. In Robotics: Science and Systems, 2, 2014. 7

[22] Ji Zhang and Sanjiv Singh. Visual-lidar odometry and mapping: Low-drift, robust, and fast. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 2174–2181. IEEE, 2015. 7

[23] Peiliang Li, Tong Qin, et al. Stereo vision-based semantic 3d object and ego-motion tracking for autonomous driving. In Proceedings of the European Conference on Computer Vision (ECCV), pages 646–661, 2018. 7

(41)

REFERENCES

[24] Wenyan Ci and Yingping Huang. A robust method for ego-motion estima-tion in urban environment using stereo camera. Sensors, 16(10):1704, 2016. 7

[25] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsu-pervised learning of depth and ego-motion from video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1851–1858, 2017. 7

[26] Jia-Wang Bian, Zhichao Li, Naiyan Wang, Huangying Zhan, Chunhua

Shen, Ming-Ming Cheng, and Ian Reid. Unsupervised scale-consistent

depth and ego-motion learning from monocular video. arXiv preprint arXiv:1908.10553, 2019. 7

[27] Yuhua Chen, Cordelia Schmid, and Cristian Sminchisescu.

Self-supervised learning with geometric constraints in monocular video: Con-necting flow, depth, and camera. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7063–7072, 2019. 8

[28] Raul Mur-Artal and Juan D Tardós. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017. 8

[29] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE trans-actions on robotics, 31(5):1147–1163, 2015. 8

[30] Chao Yu, Zuxin Liu, Xin-Jun Liu, Fugui Xie, Yi Yang, Qi Wei, and

Qiao Fei. DS-SLAM: A semantic visual SLAM towards dynamic

envi-ronments. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1168–1174. IEEE, 2018. 8

[31] Muhamad Risqi U Saputra, Andrew Markham, and Niki Trigoni. Visual

SLAM and structure from motion in dynamic environments: A survey.

ACM Computing Surveys (CSUR), 51(2):1–36, 2018. 8

[32] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github.com/facebookresearch/detectron2, 2019. 10

[33] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zit-nick, and Piotr Dollár. Microsoft COCO: Common Objects in Context, 2015. 10

[34] Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise View Selection for Unstructured Multi-View Stereo. In European Conference on Computer Vision (ECCV), 2016. 11, 19

(42)

REFERENCES

[35] Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep Layer Aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 13

[36] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012. 13, 15, 18

[37] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Bal-dan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027, 2019. 15

[38] Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke, Thomas Brox, and Andreas Geiger. Sparsity Invariant CNNs. In International Con-ference on 3D Vision (3DV), 2017. 18

[39] Nan Yang, Lukas von Stumberg, Rui Wang, and Daniel Cremers. D3VO: Deep Depth, Deep Pose and Deep Uncertainty for Monocular Visual Odometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. 19

[40] Nolang Fanani, Alina Stürck, Matthias Ochs, Henry Bradler, and Rudolf Mester. Predictive monocular odometry (PMO): What is

pos-sible without RANSAC and multiframe bundle adjustment? Image and

Vision Computing, 68:3–13, 2017. 19

[41] Yuliang Zou, Pan Ji, Quoc-Huy Tran, Jia-Bin Huang, and Manmohan Chandraker. Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling, 2020. 19

[42] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep Ordinal Regression Network for Monocular Depth Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 20

Velocity Estimation From a Monocular Video Using a 3D Scene Representation

MSc Artificial Intelligence

Master Thesis