Monocular Depth Prediction

(1)

Monocular Depth Prediction

(2)

Layout: typeset by the author using LA_TEX.

(3)

Monocular Depth Prediction

Single view depth map prediction using RetinaNet

backbones

Sander Kohnstamm 10715363

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. Elia Bruni

Institute for Logic, Language and Computation Faculty of Science

University of Amsterdam Science Park 907 1098 XG Amsterdam

(4)

Abstract

In this paper, a model is proposed for the prediction of a depth map from a monocular image input for agricultural purposes. This proposed model utilises backbones similar to RetinaNet, which is used for object detection, with a custom depth prediction output module. These backbones consist of Residual Neural Networks and a Feature Pyramid Network(FPN). It is specifically designed to be implemented with as little interference to the object detection, performed by the same model. The evaluation is achieved by finding a baseline network and comparing it visually and numerically to the proposed network. By incorporating the different output layers of the FPN, a promising amount of detail is preserved while producing almost as good results as the top-performing baseline network.

Acknowledgements

I would like to thank Klaus Ondrag and Dr. Elia Bruni for providing the framework of this project and for answering all my questions and Hidde Jessen for providing some key insights regarding the implementation.

(5)

Chapter 1 Introduction

Similar to many fields, the agriculture industry is adopting different modern tech-niques to improve its performance. In this industry, the main goal of this adapta-tion is the improvement of its yields[3]. The automaadapta-tion of harvesting crops is a very lucrative adaptation given the difficulty and high labour costs of picking in-dividual crops. This applies especially to high-value crops. These are crops which require harvesting techniques that are more labour intensive than average because they require more precision and care to maintain the structural integrity of the produce. Among others, these include fruits, vegetables and ornamentals.

This paper concerns the robotisation of the picking of apples or removal of apple blossoms. The system used for the project under which this paper is written includes an RGB-D camera and a picking mechanism. As the weather conditions on an apple orchard vary, so do the lighting conditions for the camera. Because the computer vision system is expected to handle all the different lighting conditions, it has to be very consistent. One way to achieve this is to utilise the depth sensor on the camera. An additional depth channel is known to improve different computer vision tasks [19]. However, the depth sensor on the camera used for this project can be inconsistent itself. That is why the focus of this paper is on the enhancement of the accuracy and the consistency of the depth sensor within the camera, by predicting a second depth channel using only the RGB data.

Depth estimation or prediction from image data is an important structural problem in the field of computer vision. This problem is often separated into the categories of depth estimation on either stereoscopic images, multiple frames (video) or a single still image. The latter, Monocular Depth Prediction (MDP), is a very active field of research but with slow development[7]. Several studies have proposed different approaches to this category. Early papers utilised Markov and Conditional Random Fields to regulate the depth maps and handcrafted features, matching input images to images in a data set and mapping its depth channel

(8)

to the input. Most recent attempts address the MDP problem with deep con-volutional neural networks originally designed for image classification and object detection. However, these solutions incorporate different convolution and pooling steps, reducing the output resolution and the precision of the edges of objects. On the other hand, different network architectures such as Feature pyramids of-ten utilise a decoder after detecting objects in a picture. This decoder combines low resolution but dense information output of the object detection with localised information in the output of previous layers.

The first part of the data set used for this paper is an external data set from the University of Lleida, Spain. This is a data set created for the picking of apples. The second part of the data set is obtained with the camera used for the detection and picking of apple flowers. The data sets have different lighting and situational conditions, but both contain a depth channel next to it’s three red green and blue channels. This depth channel is taken as ground truth and the RGB images as inputs for the model. These are the data sets that are used for the training and testing of our models.

The current state of the art of the project utilises a RetinaNet[22] model im-plementation for its object detection. This is a model based on a ResNet and Feature Pyramid Network backbones by researchers at FAIR. It utilises a focal loss function to drastically reduce the number of regions of interest in an image and has achieved very respectable results while keeping computational costs to a minimum.

It would significantly reduce computational costs to have an image only pass a substantial part of the network once for both the depth prediction and the object detection. This poses the question: ’Are the backbones of the RetinaNet suited for monocular depth prediction?’ To evaluate this question, the prediction of a depth channel from a still RGB image is approached as a machine learning problem similar to object detection.

Different types of models with varying network architectures are proposed, eval-uated and compared to obtain the highest-scoring depth prediction as a baseline. These models include networks that are based on U-Net and ResNet backbones and include a Biomedical Image Segmentation, Semantic Segmentation and an ob-ject detection model with upsampling. The models are evaluated while pre-trained and untrained on our datasets.

Once a baseline model is selected, a model is proposed based on the same architecture as RetinaNet. As this is the same model that is used for object detection and it utilises the same backbone with the same pre-trained weight states. Only a custom head module will separate the object detection and the depth prediction. This RetinaNet depth prediction model provides promising results by nearly matching those of the selected baseline model.

(9)

Chapter 2 Background

Before the implementation of the actual network model of this paper, an overview of the agricultural project, its different components and the current state of the art of its object detection is provided in this section.

2.1 Problem Setup

2.1.1 Hardware

The project for which this paper is written is situated on an agricultural site for the production of apples and flowers. On this site, an autonomously steering robot on wheels is driving through the orchard to pick the apples and remove the flowers. As its computer vision input system, this robot is fitted with an Intel Realsense D415. These are small, 90x20x23 mm, consumer camera’s which produce RGB channels of 1920 x 1080 pixel resolution and contain a depth sensor producing a fourth channel of the same size and width but with a lower, 1280 x 720 pixel resolution. After the classification of the objects of interest utilising the RGB channels, the depth channel is overlayed to produce a 3D map of the apples or flowers for the robot to correctly position its picking mechanism.

The depth sensor on this camera operates by producing very small laser beams and observing the reflection. This results in the relatively low resolution of 1280 x 720 pixels and an accuracy of around 2 per cent. With a lower resolution and accuracy, difficulties arise when differentiating individual objects of interest if they reside further away or in clusters. The latter often being the case with both flowers and apples.

(10)

2.1.2 Data sets

To train a machine learning model to get a predicted output of a depth channel images taken with the Real Sense camera and from an external source from dif-ferent orchards respectively are used. Next to their red, green and blue channels these sets of images both contain a fourth depth channel with a lower resolution which is upscaled so it produces the same height and width as the other channels. This depth channel will be viewed as ground truth for the depth prediction. These two data sets are provided for this task; one featuring apples and one featuring apple flowers.

Figure 2.1: First data set example image.

Figure 2.2: First data set example ground truth.

Figure 2.3: Second data set exam-ple image.

Figure 2.4: Second data set exam-ple ground truth.

The images taken at the apple orchard are from the University of Lleida, Spain and their KFuji RGB-DS database[10, 9]. This data set is available at [14]. They have been produced with a Microsoft Kinect V2 in a dark environment under the same specific lighting conditions for each image resulting in as little variance as possible. This data set contains 967 multi-modal images with their respective depth channels, split into a train set of 619 images, a validation set of 155 and a test set of 193 images. These images are shown in Figure 2.1 with their ground truth in Figure 2.2.

The second data set is produced on the orchard of the project it self. These images were taken in daylight with more varying lighting and weather conditions, which was to be desired to produce a more condition invariant model. This data

(11)

set contains 267 pictures. These images were taken with the D415 camera and are shown in Figure 2.3 with their ground truth in Figure 2.4.

2.2 Object Detection

The object detection task in the current state of the art of this project is per-formed by neural networks. Many different neural networks have been optimised to become suitable for object detection tasks. These networks began with the first backpropagation algorithms in the 1990s[26] and they became more widely adopted when large scale data sets, such as ImageNet in 2009 [6], and more pro-gressive computing power became available. Together with more research into network architecture also came Deep Neural Networks (DNNs) which utilise mul-tiple hidden layers for more complex learning. One of the most notable DNN is a Convolutional Neural Network (CNN).

Instead of a more traditional feature extractor paired with a classification al-gorithm, the layers of CNNs, also called feature maps combine multiple values in the previous layers, or the input if it is the first layer. The connections between these layers allow for different kinds of transformations of the data such as pooling or filtering before resulting in a final layer with a certain activation function to obtain the output that is desired [30]. This way the operation of the visual system of a brain is simulated to obtain a higher level of complexity allowing the model to recognise patterns and complex hierarchical features.

Figure 2.5: A region based CNN (R-CNN) showing a two step process of region extraction and object detection[30].

Two find an object in an image, a detector has a few options. These options include:

• Have a sliding window with a feature extractor and classification algorithm move over an image. This was one of the earliest versions of object detectors.

(12)

• Have a large set of pre-selected regions of various scales, sizes and orienta-tions. This way only one pass over the image is needed, hence the name One-Stage detector.

• Have a region proposal network propose a set of regions of interest producing a relatively sparse subset of regions. This process is named a Two-Staged detector.

Either of the last two processes can be combined with a CNN to perform region-based classification. This way an object can be found, classified and boxed in an image as the two-stage detector shown in Figure 2.5. Two-stage detectors have historically been the dominant object detectors when it comes to accuracy and speed with the latter only recently being matched by one stage detectors which have been fine-tuned to focus on speed such as YOLO and SSD [22]; which only pass over a number of regions in the order of hundreds, instead of millions, vastly degrading their accuracy. This is a well-known problem with one-stage detectors. They either require a large set of candidate regions, most of which are background regions of no interest, impacting their speed, or a small set of regions resulting in lower accuracy.

2.2.1 RetinaNet

One of the latest and most notable networks using a one-stage detector is Reti-naNet from the researchers at Facebook AI Research (FAIR). This model aims to regard a large number of possible regions of interest while maintaining the speed of the mentioned one-stage detectors by to vastly reducing the class imbalance between foreground and background regions in an image[22]. This imbalance is a natural consequence of the low amount of objects in an image. Most of the regions in an image do not contain any part of an object and therefore do con-tribute a negligible amount to the learning process. To minimise the amount that non-informative regions contribute to the learning process they utilise Focal Loss. CE(pt) = −log(pt) (2.1)

F L(pt) = −(1 − pt)γlog(pt) (2.2)

Focal Loss is their adaptation on Cross-Entropy Loss with a variable γ imple-mented as shown in equation 2.2. This is in contrast to the regular loss func-tion in equafunc-tion 2.1. This way the amount that this model is influenced by non-informative regions can be limited while still maintaining the loss of the more im-portant regions. The network’s backbone consists of two parts. Firstly, it utilises a (pre-trained) ResNet for deep feature extraction and, secondly, a Feature Pyramid

(13)

Network to provide a multi-scale feature pyramid. The following sections explain the workings of these parts.

2.2.2 ResNets

It has been shown that complex continuous functions can be approximated by ex-panding a single hidden layer according to the universal approximation theorem[5]. When operating on machine learning tasks, however, single layer systems soon lead to problems such as overfitting. Networks can be expanded to incorporate more hidden layers to combat this problem. Adding more layers to networks has shown great benefits[18], but the performance of deeper networks is difficult to maintain. When back-propagating through a very deep network, the gradient can become infinitely small, saturating its accuracy and reducing the networks ability to learn as shown by the error rates in figure 2.6.

Figure 2.6: Training (left) and testing (right) error rates for deep neural networks[11].

Figure 2.7: Building blocks of a resnet with an identity by-pass[11].

Deep Residual Neural Networks or ResNets [11] have been an important develop-ment in computer vision since its research paper came out in 2015. This paper allows for the exploration of networks with much deeper structures without the downside of losing the effect of their gradient functions. The researchers have achieved this by building the network with building blocks of layers and activation functions shown in Figure 2.7, each of which can also be skipped by the gradient function with an identity bypass. When stacking these building blocks in theory no loss function can be reduced to zero. This maintains the learning capabilities when stacking more layers onto itself.

2.2.3 Feature Pyramid Networks

As objects can occur on different locations and in different sizes on images, it is an essential objective to have your object detection network be scale-invariant. His-torically this problem has been attempted to solve by processing multiple versions

(14)

of an image with varying scales all at once. These networks were called Feature image pyramids and were excessively computationally expensive[1].

A core part of the RetinaNet is the Feature Pyramid Network (FPN)[21]. This is a network, also proposed by researchers at FAIR. In the paper, attempts are made to solve the problem of excessive computational demand of a network, while preserving its scale invariance. Other architectures that have been proposed for feature extraction often neglect either of these important qualities. These archi-tectures include a single feature map of convolutional layers. This would suffer the loss of a large part of the scale invariance. A Pyramidal Feature Hierarchy has also been proposed. This is a network which does provide different scaled prediction, but only once after each layer. Therefore, there is no mechanism to recover the original resolution in later predictions.

Figure 2.8: Feature Pyramid Network representation [21].

The FPN from FAIR works by composing bottom-up and top-down structures with lateral connections between the two[21]. The first, bottom-up, part consists usually of a few convolutional layers from the ResNet architecture. After these convolutional layers, the width and height dimensions have been compressed into the third dimension. 1x1 Convolutional and upsampling layers make up the top-down pathway. These layers incorporate the lateral inputs from the bottom-up part to recover the spatial dimension and the resolution. This process results in multiple different prediction layers, each with its own preferred scale. This is shown in figure 2.8

2.2.4 Software implementation

The current state of the art of the object detection for the apples and flowers, is implemented as a git fork of a PyTorch implementation of RetinaNet. This easiest accessible by utilising a python notebook, for instance in Google Colab. Google Colab has free to use GPU accessibility which pairs well with PyTorch network implementations for faster training.

(15)

The PyTorch machine learning package utilises dataset modules for its data generation for batch training. Within these modules, one has the means to modify the data before training. These modifications include transformations, such as normalisation, and simple if statements.

Transfer Learning

Since DNNs became the standard for recognising patterns and objects in images, vast amounts of different data sets have been trained and worked with[30]. The training on these data sets always results in a certain combination of weights in a neural network. These weights essentially form an efficient information repository on all the different patterns and objects in the images of that data set. If the data set was varied enough, this is almost always useful information for new projects regardless of the type of objects that are to be detected. One can take advantage of this with transfer learning.

Transfer learning is the practice of importing pre-trained weights for the back-bones of the model to save time, expand the networks ability to detect patterns and make it more robust to variance in input scenarios. These weights often have been trained on very large data sets, such as ImageNet[6], and therefore contain a lot of information on many different objects. This paper will utilise imported weights from previous training on ImageNet for different models. There are two options when transfer learning is applied. One could either import the weights, attach a fully connected head and only train the head by freezing the rest of the model. We will call this process partial training. Or the entire model can be just slightly trained for the specific purpose of the project. We will call this process full training.

(16)

Chapter 3 Related Work

The problem of depth prediction has historically been split into the various avail-able forms of input. These forms could be multiple images, stereo images or monocular views[28]. While most of the work has been done on stereoscopic depth prediction[17, 2], the prediction of a depth channel from a single RGB image, has also been attempted in various ways. However, it remains a very difficult prob-lem to solve. Early research incorporated geometric and visual cues in the image depending heavily on colour, texture or handcrafted features and therefore only suitable in specific environments [13, 15]. These solutions often incorporated some form of machine learning starting with Markov Random Fields [27].

3.1 Deep Convolutional Neural Networks

Recently, however, deep convolutional neural networks (CNN’s) have gained a lot of attention in the field of computer vision in general, including for MDP [23]. Even though most solutions share the utilisation of some sort of ground truth depth map and the CNN’s, their respective implementations have different variations on many levels. Some research proposes a fully convolutional network [19]. This was achieved by replacing the fully connected layer of the ResNet50 model with a novel up-sampling block, yielding a network which is light to use with very acceptable accuracy. Others adapt the input of the model by embedding the focal length setting of the camera[12]. This way additional information in the image, such as fuzziness, in combination with the parameters intrinsic to the camera can be utilised to improve depth prediction.

There are also solutions which utilise CNNs, showing the usage on sparsely labelled ground truths. This could be for the completion of the sparse depth map, proposing novel convolutional modules to improve performance on sparsely sampled depth maps[29]. This results in a network which is resistant to multiple

(17)

levels of sparsity in the ground truth. Also using a very sparse depth map to vastly improve the accuracy of a regular MDP network, consisting of a ResNet50, where 50 represents the number of residual blocks as shown in Figure 2.7, backbone and fully convolutional upsampling layers has been shown [24].

3.2 Data

As MDP is not as an active field as stereoscopic depth prediction, data sparsity can be an issue. Some papers attempt to solve this problem by shifting the focus more on the accumulation of training data. This could be by utilising the internet as a source, multi-view stereo- and structure from motion methods to complete a depth map and post-processing to remove noise and outliers[20]. They can also combine existing and new data with their own captured images with depth-mapping to provide a new state of the art benchmarking suite[8].

3.3 Loss functions

A notable attempt at depth prediction, was made with a different loss function[7]. By casting the depth estimation as an ordinal regression problem and by using ordinal loss, the strong ordinal correlation between depth values could be incorpo-rated. M SE = 1 n n X i=1 (xi− yi)2 (3.1) MAE = Pn i=1|yi− xi| n = Pn i=1|ei| n (3.2)

In general, however, the Mean Squared Error (MSE) loss between the ground truth and the predicted image, constitutes a large part of the loss functions utilised for back-propagation in these papers concerning MDP. MSE is shown in equation 3.1. Otherwise, for algorithms predicting feature vector outputs, a mean absolute error (MAE) could also be utilised. This loss function is shown in equation 3.2 and will be referred to as L1Loss due to its PyTorch implementation. In both equations xi and yi are the true values and predictions respectively.

(18)

Chapter 4 Method

A neural network will be proposed for the prediction of a depth channel from a single monocular image. In this section, the search for a baseline algorithm to compare it to and the evaluation of this network that follows are explained as well as the model itself.

4.1 Loss and data implementation

Depth sensors on camera’s have a finite range. Anything above this range can not be measured accurately anymore. Therefore the solutions most camera’s imple-ment is to set all the distances above a certain threshold to zero. This is also the case for the two sets of depth data from this paper. This data is not beneficial towards the learning process and that is why a adaptation of the loss function is implemented. This adaptation sets all the losses on the same locations of zero marked depth, to zero. This way the learning process is not affected by any irrel-evant locations.

Figure 4.1: Second data set exam-ple image.

Figure 4.2: Second data set exam-ple ground truth after modification.

(19)

The D145 camera from the second data set can detect the depth up until a much further range then the other data set provides. This is also much further then the objects of interest lie, which is close to 1 meter away from the point of view, as an orchard is positioned with rows. As the values further away are much higher, they would also impact the training of the models by far the most, even though is it non-usable information. Therefore, a modification to the data set is made. This modification sets all values in the ground truth higher than 2000 to zero. The number 2000 was chosen by checking visually whether all objects of interest were incorporated and unnecessary background information was cancelled out. Setting the value to zero is an adequate solution because of the adaptation that has been explained in the previous paragraph; the loss for regions with a depth of zero will not be accounted for. This process resulted in the second data set as shown in figure 4.2.

4.2 Baseline algorithm search

For the proposal of a neural network for MDP, firstly, a network will be imple-mented to serve as a baseline. Afterwards the performance of the proposed network will be judged in contrast to this baseline. These networks will be searched for online within some limits of the current project. These limits include that the network must:

• Be written in python with the PyTorch machine learning package for easy implementation.

• Accept a 3 channel, RGB, input and provides a 1 channel output with the same width and height.

• Not reduce the resolution of the output an unacceptable amount. Some resolution loss is expected.

During the search, different adaptations of a Google Colab notebook will be used to implement, test and evaluate the different networks. These notebooks all contain the structural components for the evaluation of the networks. These components mainly include a PyTorch data set and data loader, the model or an import of the model, some adjustments to the model if necessary, visualisation and monitoring functions, transformation functions, a train function and a test function. An implementation of Google Tensorboard is utilised for the visualisation of interim training and validation results.

(20)

4.2.1 Search and evaluation

Both the baseline and the proposed network will be evaluated on different levels. First and foremost, a visual inspection will be performed to see if the output of the network does come close to the intended labelling. This part is necessary because these networks could often be modified network which was originally intended for different purposes. A visual inspection would quickly confirm the networks output resolution and whether it is improving during training. These visual inspections will also be performed on a set of own images taken from inside the house. These do not have a depth channel label attached, but the output will be visually in-terpretable by a person to verify whether the network is suitable for multiple conditions.

The visual aspect of output is an important component of its validity, however, to truly judge whether the output is more desirable than another, numeric values are preferable. This will be done by comparing the output of the loss functions of the different networks to each other within Tensorboard. Different loss functions were considered and implemented, but as most other papers utilise MSE, the choice was made to compare MSE outcomes for the algorithm search. This way a top baseline model is selected for the comparison to the proposed model.

Models that we’re available with pre-trained parts are also tested using trans-fer learning. This is the practice of importing pre-trained weight states for the backbones of the model to save time, expand the networks ability to detect pat-terns and make it more robust to variance in input scenarios. Afterwards only a few layers as mentioned in section 2.2.4 are modified or added. In this case, the last layers mainly consist of convolutional layers, batch normalisation layers and ReLU activation functions in order to achieve the correct dimensions for the output. Using transfer learning then either only these latter layers or all the layers are optimised with the current data sets.

4.3 Proposed network

Implementation

The implementation of the proposed network is also within a python notebook. To reduce the computational cost for the depth prediction to run side by side with the object detection, it was important to retain as much of the original structure of RetinaNet as possible. Therefore the entire structure of RetinaNet was copied over to the notebook and different positions within the model were evaluated to insert a custom depth prediction module (DepthModule). This DepthModule is shown in figure 4.3.

(21)

Figure 4.3: Depth Module representation.

As is shown in figure 4.3, the module takes as input the different output layers from the FPN, concatenates them into one large feature vector of 1280 channels and performs an upsampling to the requires output dimensions. Afterwards, it reduces the channels by 3 deconvolutional layers with batch normalisation and activation layers in between. With the implementation of the DepthModule, transfer learning is also applied.

Evaluation

After the network with the best interim evaluation results is found its implemen-tation is finalised in a final Google Colab notebook together with the proposed algorithm. This notebook will have the same basic structural components as the previous ones. Here both algorithms will be evaluated for the final results. These will be compared to each other visually and numerically, again. However, for a more general overview of their performances, the baseline algorithm and the pro-posed network will also be trained and evaluated with the L1Loss loss mentioned in section 3.3. From this comparison, it will become more clear whether the proposed network is capable of depth prediction and of the eventual detection of apples or flowers.

(22)

Chapter 5 Experiments

All the models that were tested have had universal and individual complications that had to be overcome to implement and evaluate them. In this section, the process for each model and its limitations is briefly explained. This summation includes the proposed model.

5.1 Hyperparameters

For all the examined models, different combinations of hyperparameters were tested. The learning rate, optimiser, step-size and gamma had similar optimal settings. Therefore, to increase consistency, these were set to 0.0001, Adam, 7 and 0.1 respectively. These values were eventually implemented in both the baseline search and proposed model testing

Figure 5.1: Training sessions for baseline algorithm search with loss and epochs on the y-axis and the x-y-axis respectively.

Figure 5.2: Training sessions for the proposed model with loss and epochs on the y-axis and the x-axis respectively.

Some of the training sessions for the baseline algorithm and the proposed model 19

(23)

are shown in Figures 5.1 and 5.2, where the loss and epochs are on the y-axis and the x-axis respectively. These sessions will be explained further in the next paragraphs, but as is visible from these figures, most losses converged between 10 and 13 epochs. Therefore, most of the full training sessions were set to 17 epochs. In this time the best model at the best scoring epoch was updated and returned at the end of training.

5.2 Baseline model suggestions

As the depth prediction problem requires a feature vector as output, most basic object detection algorithms were not fit for use. They often output classification of different objects and their respective locations on the input image. On the other hand, there are image segmentation models. These often output a one-hot vector with the same height and width of the input image with segments of objects or regions of interest being represented by a one and the rest is zero.

5.2.1 Segmentation Models

The examined segmentation models include DeepLabV3[4], Fully Convolutional DenseNet[16] and U-Net[25]. These are two semantic and one medical image seg-mentation models respectively. DenseNet had to be imported from GitHub, but the other two were available as a download from the PyTorch hub and both had pre-trained or non-pre-trained versions. As the outputs for the DeepLabV3 and U-Net implementations are either zero or one, a small head module was constructed to achieve more variation in the output. This module consists of two convolutional layers with RELu activation functions and a batch normalisation layer in between. The PyTorch DenseNet model used for this paper was already adapted for depth prediction and did not need further adaptation.

After the implementation in the respective notebooks, if available, both the pre-trained and non-pre-pre-trained versions of the models were pre-trained. This was done on both datasets utilising the different variations on hyperparameters mentioned in section 5.1. If the pre-trained models were available, the same transfer learning procedures as mentioned in section 2.2.4 were attempted. For each of these models, the top-performing combinations were proceeded with.

5.2.2 Object detection Models

As for models with image classification backbones, 2 different implementations of ResNets with decoders were examined. The output of these models is, in-stead of a feature vector of the same height and width of the input image (for

(24)

instance 256x256), often a feature vector of 8x8 or 16x16 with a very large third dimension. Therefore some deconvolutional layers with activation and batch nor-malisation layers were used to reach the desired output dimensions. They were also implemented into the respective notebooks and trained on the datasets. Only pre-trained ResNets with transfer learning were utilised and the same variations on hyperparameters were examined. These models were not as extensively exam-ined as the segmentation models because they served as a proof of concept for the proposed model.

5.3 Proposed Model

As was mentioned the proposed model consists of a ResNet and FPN backbone with a custom DepthModule as an output, again consisting of multiple convo-lutional, activation and batch normalisation layers. To simplify the process of constructing the DepthModule and to verify whether this setup results into use-ful output the entire model was copied to a notebook with the same structural components as before. Within this notebook, various depth module heads were examined, all different combinations of the same layers. The final selection incor-porated accuracy as well as speed performance because a computationally lighter model will benefit the final implementation.

(25)

Chapter 6 Results

The results of the search for a baseline algorithm, as well as the final results of the comparison of this algorithm to the proposed network, are given below.

6.1 Baseline algorithm search

During the search for a baseline algorithm, many different MDP solutions presented themselves. Unfortunately however, within the constraints mentioned in section 4.2, also many were withdrawn from consideration. The networks that are were left had various qualities and issues. The reasons for choosing the baseline model will be explained below.

The main problem that arose with segmentation models was the output for-mat. As information is condensed into one-hot feature vectors, individual varia-tions between values are lost. To achieve varied output, as is necessary for depth prediction, the head module mentioned in section 5.2 was added. This process comported more with DeepLabV3 than with the other two segmentation models.

Figure 6.1: DenseNet out-put (right) with its respective groundtruth (left).

Figure 6.2: U-Net output (right) with its respective groundtruth (left).

(26)

Figure 6.3: ResNet output (right) with its respective groundtruth (left).

Figure 6.4: DeepLabV3 out-put (right) with its respective groundtruth (left).

As is shown in figure 6.1 the DenseNet algorithm flattened the output while preserving the edges of the object. This resulted in an eventual MSE loss of 1.8179 as is shown, next to the other loss results in table 6.1. The U-Net model preserved most of its segmented output as shown in figure 6.2, resulting in a MSE loss of 2.3552. One of the ResNet implementations had the problem that it reduced the output resolution to such an extent that it became an inadequate solution, even as the model achieved a loss of 1.4594. This is visualised in figure 6.3. The other was not able to achieve any viable output.

DenseNet U-Net ResNet DeepLabV3 MSE Loss (Avg) 1.8179 2.3552 1.4590 0.7590

Table 6.1: Loss function results on both datasets for the different candidates for baseline network selection.

The DeepLabV3 network, however, showed promising outputs for the predic-tion of depth from both the first and the second dataset. The network resulted in an MSE loss of 0.7590. This is shown in figure 6.4. Even though some resolution was lost, the decision was made to continue the evaluation of the proposed model in contrast with the DeepLabV3 network as a baseline.

6.2 Final results

The proposed model also showed very promising results, matching or outperform-ing that of the baseline model in terms of the resolution, loss and visual aspects. With MSE loss implemented, both the processes of transfer learning with the par-tial and the full network learning were able to perform very adequately. However, the fully trained RetinaNet performed the most accurate across the two datasets as shown in the graph below.

(27)

DeepLabV3 RetinaNet Partial RetinaNet Full 0.6 0.8 1 0.76 1.02 0.73 0.82 0.92 0.62 MSE Loss

Data Set 1 Data Set 2

The training with MAE loss resulted in better performance for DeepLabV3, which scored the best of the three. Again the partial training of the RetinaNet adaptation performed the least, but very adequately relative to the other two solutions. These results are presented in the graph below.

DeepLabV3 RetinaNet Partial RetinaNet Full 0.5 0.6 0.7 0.8 0.66 0.63 0.54 0.81 0.85 0.79 L1 (MAE) Loss

Data Set 1 Data Set 2

As for the visual aspect of the evaluation, the models performed similarly to their respective loss results. The fully trained RetinaNet outperformed the baseline model and the partially trained version of RetinaNet. The latter of which performed about as well as the baseline model. This is all shown in figures 6.5-6.6. These pictures all show a fair amount of reduction in resolution, but to a reasonable extent. Also, most of the main regions of interest have been correctly identified. This result is similar for both the different losses. When combining the visual aspect with the modest loss scores it becomes clear that most important regions received a correct prediction.

(28)

Figure 6.5: Dataset 1 images with their respective groundtruth depth (left) and prediction(right) with MSE loss training. From top to bottom the first 3 models are DeepLabV3, RetinaNet Full and RetinaNet Partial with MSE and the next 3 are DeepLabV3, Reti-naNet Full and RetiReti-naNet Partial with MAE.

Figure 6.6: Dataset 2 images with their respective groundtruth depth (left) and prediction(right) with MSE loss training. From top to bottom the first 3 models are DeepLabV3, RetinaNet Full and RetinaNet Partial with MSE and the next 3 are DeepLabV3, Reti-naNet Full and RetiReti-naNet Partial with MAE.

(29)

Chapter 7 Conclusion

This paper serves as a proof of concept for the researchers occupied with imple-menting object detection in this section of the agricultural field. It has shown that it is very much feasible to perform adequate depth prediction with the RetinaNet backbones currently in use for the object detection process. Even though some res-olution loss is obtained, the partially trained network, which was solely initialised by the same pre-trained weights as will be used for object detection, can make out what is foreground and background and reduce its loss to nearly match that of a top performing baseline algorithm.

The fully trained RetinaNet also provided very respectable results, but it would be more computationally expensive to run next to object detection. This is because then there would essentially be two different models running side by side.

7.1 Discussion

Even though the results look promising, some minor changes could have been made to the process to improve either the reliability or the speed, and therefore the amount, of the results. For instance, working in an online Google Colab environment has proved to be a bottleneck for most of the project. If access would have been available to a local GPU or to the paid subscription Google Colab offers, which is currently not available in the EU, certain setbacks could have been avoided. The usage limits of the free version often interrupted the learning process. The extent to which a dual model set up would impact performance is also not tested. It could be the case that it is very reasonably within the capabilities of the on-board computer of the harvesting robot to compute the entire depth prediction and the object detection separately. This way an even more well-suited solution could be implemented, regardless of computational cost.

Furthermore, providing two loss function evaluations and a visual representa-26

(30)

tion is not an adequate measure of performance. Different evaluating techniques are needed to truly judge whether this implementation has a useful effect on the eventual depth prediction, object detection and object localisation. The latter of which is the end goal.

7.1.1 Future work

That is also why some suggestions for further research remain. Firstly, it would be interesting to consider more evaluation options, such as the KITTI benchmarking suite or the solutions of the DORN Loss paper. These options would present a view more relatable of the performance to other available papers. Evaluations could also be set up to verify whether the depth prediction in combination with the provided depth image is a more robust representation of the truth. This could be done by either combining the outputs or implementing a function that calculates and removes outliers in the output from the camera.

Secondly, the actual support this depth prediction offers to the eventual object detection is also to be evaluated. After this, one could also even test whether this process has a positive impact on the eventual harvesting of crops.

Lastly, progress could be made by incorporating more frames into the depth prediction process. This has proved to be a very lucrative step and is documented in several papers. This could improve performance as well as reduce computational cost. As the camera gives a video output, it would be wasteful to start every prediction process over, for every frame.

(31)

Bibliography

[1] Edward H Adelson et al. “Pyramid methods in image processing”. In: RCA engineer 29.6 (1984), pp. 33–41.

[2] Asra Aslam and Samar Ansari. Depth-Map Generation using Pixel Matching in Stereoscopic Pair of Images. Feb. 2019.

[3] C Wouter Bac et al. “Harvesting robots for high-value crops: State-of-the-art review and challenges ahead”. In: Journal of Field Robotics 31.6 (2014), pp. 888–911.

[4] Liang-Chieh Chen et al. “Rethinking Atrous Convolution for Semantic Image Segmentation”. In: CoRR abs/1706.05587 (2017). arXiv: 1706.05587. url: http://arxiv.org/abs/1706.05587.

[5] Balázs Csanád Csáji et al. “Approximation with artificial neural networks”. In: Faculty of Sciences, Etvs Lornd University, Hungary 24.48 (2001), p. 7. [6] J. Deng et al. “ImageNet: A large-scale hierarchical image database”. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009, pp. 248–255.

[7] Huan Fu et al. “Deep Ordinal Regression Network for Monocular Depth Esti-mation”. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (June 2018). doi: 10 . 1109 / cvpr . 2018 . 00214. url: http : //dx.doi.org/10.1109/CVPR.2018.00214.

[8] Andreas Geiger, Philip Lenz, and Raquel Urtasun. “Are we ready for au-tonomous driving? the kitti vision benchmark suite”. In: 2012 IEEE Confer-ence on Computer Vision and Pattern Recognition. IEEE. 2012, pp. 3354– 3361.

[9] Jordi Gené-Mola et al. “KFuji RGB-DS database: Fuji apple multi-modal images for fruit detection with color, depth and range-corrected IR data”. In: Data in brief 25 (2019), p. 104289.

(32)

[10] Jordi Gené-Mola et al. “Multi-modal deep learning for Fuji apple detection using RGB-D cameras and their radiometric capabilities”. In: Computers and Electronics in Agriculture 162 (2019), pp. 689–698.

[11] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: CoRR abs/1512.03385 (2015). arXiv: 1512.03385. url: http://arxiv.org/abs/ 1512.03385.

[12] Lei He, Guanghui Wang, and Zhanyi Hu. “Learning depth from single images with deep neural network embedding focal length”. In: IEEE Transactions on Image Processing 27.9 (2018), pp. 4676–4689.

[13] D. Hoiem, A. A. Efros, and M. Hebert. “Geometric context from a sin-gle image”. In: Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1. Vol. 1. 2005, 654–661 Vol. 1.

[14] http://www.grap.udl.cat/en/publications/KFujiRGBDSdatabase.html.

[15] Zheng Hu, Min Sun, and Kerui Xia. “Sparse depth map reconstruction from image sequences of the buildings”. In: Proc SPIE (Oct. 2009). doi: 10.1117/ 12.833932.

[16] Simon Jégou et al. “The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation”. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2017, pp. 11–19. [17] Patrik Kamencay et al. “Improved Depth Map Estimation from Stereo

Im-ages Based on Hybrid Method”. In: Radioengineering 21 (Apr. 2012). [18] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet

classi-fication with deep convolutional neural networks”. In: Advances in neural information processing systems. 2012, pp. 1097–1105.

[19] Iro Laina et al. “Deeper depth prediction with fully convolutional resieepual networks”. In: 2016 Fourth international conference on 3D vision (3DV). IEEE. 2016, pp. 239–248.

[20] Zhengqi Li and Noah Snavely. “MegaDepth: Learning Single-View Depth Prediction From Internet Photos”. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 2018.

[21] Tsung-Yi Lin et al. “Feature Pyramid Networks for Object Detection”. In: CoRR abs/1612.03144 (2016). arXiv: 1612 . 03144. url: http : / / arxiv . org/abs/1612.03144.

[22] Tsung-Yi Lin et al. “Focal Loss for Dense Object Detection”. In: The IEEE International Conference on Computer Vision (ICCV). Oct. 2017.

(33)

[23] Fayao Liu, Chunhua Shen, and Guosheng Lin. “Deep Convolutional Neural Fields for Depth Estimation From a Single Image”. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 2015.

[24] Fangchang Ma and Sertac Karaman. “Sparse-to-Dense: Depth Prediction from Sparse Depth Samples and a Single Image”. In: CoRR abs/1709.07492 (2017). arXiv: 1709.07492. url: http://arxiv.org/abs/1709.07492. [25] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-Net: Convolutional

Networks for Biomedical Image Segmentation”. In: CoRR abs/1505.04597 (2015). arXiv: 1505.04597. url: http://arxiv.org/abs/1505.04597. [26] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams.

“Learn-ing representations by back-propagat“Learn-ing errors”. In: nature 323.6088 (1986), pp. 533–536.

[27] Ashutosh Saxena, Sung H Chung, and Andrew Y Ng. “Learning depth from single monocular images”. In: Advances in neural information processing sys-tems. 2006, pp. 1161–1168.

[28] M. Shao, T. Simchony, and R. Chellappa. “New algorithms from reconstruc-tion of a 3-D depth map from one or more images”. In: Proceedings CVPR ’88: The Computer Society Conference on Computer Vision and Pattern Recognition. 1988, pp. 530–535.

[29] Jonas Uhrig et al. “Sparsity Invariant CNNs”. In: CoRR abs/1708.06500 (2017). arXiv: 1708.06500. url: http://arxiv.org/abs/1708.06500. [30] Z. Zhao et al. “Object Detection With Deep Learning: A Review”. In: IEEE

Transactions on Neural Networks and Learning Systems 30.11 (2019), pp. 3212– 3232.

Monocular Depth Prediction

Monocular Depth Prediction

Monocular Depth Prediction

Single view depth map prediction using RetinaNet

backbones

Abstract

Contents

Chapter 1

Introduction

Chapter 2

Background

2.1

Problem Setup

2.1.1

Hardware

2.1.2

Data sets

2.2

Object Detection

2.2.1

RetinaNet

2.2.2

ResNets

2.2.3

Feature Pyramid Networks

2.2.4

Software implementation

Chapter 3

Related Work

3.1

Deep Convolutional Neural Networks

3.2

Data

3.3

Loss functions

Chapter 4

Method

4.1

Loss and data implementation

4.2

Baseline algorithm search

4.2.1

Search and evaluation

4.3

Proposed network

Chapter 5

Experiments

5.1

Hyperparameters

5.2

Baseline model suggestions

5.2.1

Segmentation Models

5.2.2

Object detection Models

5.3

Proposed Model

Chapter 6

Results

6.1

Baseline algorithm search

6.2

Final results

Chapter 7

Conclusion

7.1

Discussion

7.1.1

Future work

Bibliography