Investigating the Effect of Illumination Conditions on Flower Detection in Apple Orchards

(1)

Master Thesis

Investigating the Effect of Illumination Conditions

on Flower Detection in Apple Orchards

by

Klaus Josef Ondrag

12265306

November 27, 2020

36EC January 2020 – November 2020 Supervisor: Dr. Elia Bruni Assessor: Qi Bi, M.Sc. (PhD Candidate)

Institute for Logic, Language and Computation Faculty of Science

(2)

I am deeply indebted to my supervisor, Dr. Elia Bruni, who has guided me through the formative process of this thesis. His advice and focus were enormously valuable, and I especially appreciate how much time he invested in this thesis. Besides contributing with guidance and experience, he also contributed to the labeling process despite his tight schedule.

Additionally, I would like to thank Daniele for his work in the field on image collec-tion. To the labelers of apple flower images, I am extending my heartfelt appreciation for their kind helping hands. Their indispensable work made this project possible as a large number of labels are required for this thesis.

I would like to dedicate this dissertation to Sirong Wu, the love of my life and my biggest supporter through the ebb and flow of life. In this project, she was invaluable to me with her support and comfort. She contributed with linguistic feedback and served as the most valuable labeler. Not only did she support me through this thesis; Her calm soul and her words of encouragement lifted my spirits since we first met.

My sincerest appreciation goes to Valentin Zieglmeier, my best friend, for his aca-demic and moral support. His insightful feedback on this dissertation was meticulous and helped me raise the standard of this thesis to the level I desire.

Special thanks also go to my friends, Andreas Panteli and Leonardo Romor, for the intellectual and fun times together in Amsterdam.

I am immensely grateful for the unconditional support of my family through all times. Without them, I could never have undertaken the life-changing adventures through my international academic trajectory. With regard to this thesis, I am particularly thankful for my younger brother, Martin Ondrag, who also helped labeling when he could, despite doing a full-time internship. Finally, I would like to deeply thank my uncle, Dr. Wolf Dieter Ondrag, for his constant guidance and inspiration for decades.

(3)

In agriculture, superfluous apple flowers are commonly removed to realize quality produce. Conventional approaches suffer from drawbacks such as low precision, environment pollution or intensive human involvement. This thesis aims to tackle these problems by working towards a hypothetical robot that detects the flowers using computer vision. Computer vision could enable the robot to precisely remove the excess flowers through other means, such as a water stream or by cutting them off. One naturally arising concern is if computer vision is capable of working in any illumination conditions, such as sunshine from any angle, or clouds. In this work, we investigate the effect of illuminations on flower detection in apple orchards with deep learning models by creating a dataset focused on illumination conditions. The dataset contains 712 images, which contain both color and depth information, making this an RGBD dataset. Each image has labels about its illumination conditions. We report the performance for the object detection models RetinaNet and Faster R-CNN per condition and observe differences depending on the amount of clouds in the image, but not between the two models. While we did not manage to identify the exact cause of these differences, it seems unlikely that the number of flowers is correlated to it, as our experiments show. Our experiments further demonstrate that there is no significant difference between models that use RGB vs. RGBD. Additionally, we see that pre-trained models with a randomly initialized first layer are not able to achieve the same performance as pre-trained models with a pre-trained first layer. This indicates that the models can transfer knowledge from other datasets that our dataset by itself does not provide. Overall, we find that recent object detection models such as RetinaNet are a usable and fast way to perform object detection in orchards.

(4)

Acknowledgments II

Abstract III

1 Introduction 1

2 Background 4

2.1 Convolutional Neural Network . . . 4

2.1.1 Kernel . . . 5

2.1.2 Stride . . . 5

2.1.3 Padding . . . 6

2.1.4 Output Shape . . . 6

2.1.5 Multiple Input Channels . . . 6

2.1.6 Multiple Output Channels . . . 7

2.2 Computer Vision Tasks . . . 7

2.3 Object Detection with Neural Networks . . . 8

2.3.1 Two-Stage Detectors . . . 8

2.3.2 One-Stage Detectors . . . 9

2.4 Metrics . . . 9

2.4.1 Intersection over Union . . . 10

2.4.2 Precision and Recall . . . 11

2.4.3 Precision-Recall Curve . . . 11

2.5 Depth . . . 13

3 Related Work 15 3.1 Neural Networks in Robotic Farming . . . 15

3.2 Depth Cameras and Illumination Conditions . . . 15

3.3 Comparison to Apple Datasets . . . 17

4 Dataset 19 4.1 Image Collection . . . 19

4.2 Image Annotation . . . 20

(5)

5 Experiments 26

5.1 Models . . . 26

5.2 Metric . . . 26

5.3 Training . . . 27

5.4 Data . . . 27

6 Results and Discussion 29 6.1 RQ1: Is the performance of object detection in orchards dependent on illumination conditions? . . . 29

6.1.1 Results per Category . . . 29

6.1.2 Results per Number of Flowers . . . 29

6.2 RQ2: Does adding depth as an input channel to the object detection model improve object detection performance? . . . 36

6.2.1 Depth Processing . . . 36 6.2.2 Weights Initialization . . . 36 6.2.3 Results . . . 36 6.2.4 Training Duration . . . 39 7 Future Work 40 7.1 More Data . . . 40 7.2 Infrared Channels . . . 40 7.3 Lux Meter . . . 41 7.4 Apples . . . 41 7.5 Depth Hardware . . . 41 7.6 Depth Model . . . 42 7.7 Depth Processing . . . 42 7.8 COVID-19 . . . 42 8 Conclusion 43 References 44 List of Figures 51 List of Tables 52

(6)

The improvement of fruit agriculture is an ongoing research effort. Research explores various directions to achieve this. For example, nutrition management aims to provide the optimal resources for the plants. [1–3] Research on nutrition deficiencies help to understand the roles of chemicals within the biological processes and also help to diagnose malnutrition early in the plant life cycle [4].

In the case of apple trees, which this work is focused on, a wide range of research is available of this topic and encompasses the whole life cycle of apples. Research areas include the flowering process [5–9], storage [2, 10], yield [11, 12], taste [13], and quality [3, 11, 12, 14] of apples.

More specifically, our work is focused on the topic of flower thinning. We train neural network models to detect flowers so that a hypothetical robot could perform flower thinning. As this should work reliably, we investigate the effect of illumination conditions on the performances of the models. For this purpose, we create an RGBD dataset with illumination conditions for each image. We analyze the effect of illumina-tion condiillumina-tions as well as the effect of regular color images vs. adding depth data to the model.

Fruit thinning is important because when the apple tree produces flowers, pollinated flowers develop into apple fruits. This development means spending resources, i.e. energy and molecules. If the tree happens to have fewer pollinated flowers, it will spend its resources on fewer apples, meaning an individual apple receives more. Research shows that reducing the number of apples grown improves their quality and size. As such, we aim to do just that. One way of achieving this reduction is fruit thinning. [14] Approaches for fruit thinning this can be grouped into mechanical and chemical approaches. [6, 14] Mechanical approaches physically remove flower buds. This could be done by cutting them off, using water spray guns or limb shakers. The human effort required but also the precision is decreasing in the listed order. Chemical approaches can work on various biological levels. Common methods are caustic sprays, plant bioregulators or insecticidal carbamates. These are typically applied by spraying generously over the plants, for exmple with a tractor, making it fast to do.

In an effort to reduce the usage of chemicals in our food chain while simultaneously keeping the required human labor low, we are researching automating automated processes. Potentially, a robot could either pluck superfluous flowers or eliminating

(7)

them with a precise high pressure water spray. Besides the engineering challenges to replace human labor, a first requirement for such a robot is the detection of flowers.

In the last decade, computer vision has made great progress. [15] In 2012, a model named AlexNet, based on neural networks, significantly outperformed its competitors which used classical machine learning approaches in the ImageNet competition. [16] These neural networks are inspired by human synapses. [17, 18] They have been shown to be able to approximate any continuous function. [19, 20] This theoretical capability is limited in practice by being able to "learn" this approximation. By learn, we mean finding the constants to a function that minimizes an error function. One of the contributions with AlexNet by Krizhevsky et al. was training a reasonably big neural network efficiently. Since then, various improvements have been made. Models such as ResNet [21] extended the capabilities further by changing the architecture of the model. The training process was also further improved through techniques such as stochastic gradient descent [22] or the Adam optimizer [23].

Neural networks and deep learning are not just successful in image classification challenges such as ImageNet. Other computer vision tasks such as object detection have also received a lot of attention as it is especially important in the areas of robotics and self driving cars. [24, 25] For these fields, an understanding of the environment is paramount.

As mentioned before, we would like a computer to be able to detect apple flowers in orchard, for the overall goal of performing fruit thinning with a robot. In this work, we incorporate latest computer vision models and apply them to this situation. More specifically, we investigate the effect of illumination conditions on object detection. This is important because such a robot should be able to perform its task reliably. It would not be considered a success if it fails to detect apple flowers if the sun is at the wrong angle for the robot, or if it does not work when there are clouds. For this, we create our own dataset. We take pictures in an apple orchard and annotate the images. Then, we train object detection models and evaluate their performance. Additionally, we not only take color images but also capture depth information. We augment the models and add depth to their input with the goal of increasing their robustness with regards to illumination and to increase the object detection performance.

Overall, two research questions crystallize:

Research Question 1 (RQ1) Is the performance of object detection in orchards

(8)

Research Question 2 (RQ2) Does adding depth as an input channel to the object detection model improve object detection performance?

Before we can answer those questions, we need to establish the necessary knowledge to understand this work. In the next chapter, we describe what is expected of the reader to know and briefly cover some key topics.

(9)

In this chapter, we review topics that are assumed to be known in later chapters. Sections with high familiarity can be skipped. We assume that neural networks and backpropagation are familiar. [26] In case a refresher on these topics is needed, we redirect you to the respective papers, or one of the excellent works of Bishop [27] Schmidhuber [17], LeCun et al. [18], or Li et al. [20]. Due to the high importance Convolutional Neural Networks for this work, we review this topic specifically in the following section.

2.1 Convolutional Neural Network

In this section, we review the central elements of Convolutional Neural Networks (CNNs) [28, 29], namely kernel, stride, and padding. We also describe how CNNs can have an arbitrary number of input and output channels. The following section is based on Li et al. [20].

One of the problems with using a linear layer, i.e., one matrix multiplication and one vector addition, is that its weights are dependent on the position. If we have two white images where one has an apple on the left and one on the right, in a linear layer, the color values of the apple will be multiplied with different weights. In order to detect objects reliably anywhere in the image, the training data would have to include objects in every possible position to ensure it can be detected everywhere. This is unattractive from both a practical as well as a mathematical point of view. For humans, an apple is easily detectable as an apple regardless of its position in an image. Hence, we are interested in location-independent, also called translation invariant, detection of objects. CNNs solve this problem efficiently and are the backbone of the current success of neural networks in computer vision.

For now, consider an image as a two-dimensional matrix. This would mean that the image is gray-scale instead of color. Later, we will generalize to three-dimensional matrices.

(10)

(a) p=0, s=1 First step (b) p=0, s=1 Second step (c) p=1, s=2 First step (d) p=1, s=2 Second step

Figure 2.1: Example of a convolution operation. The image is displayed in blue, the output displayed in green. The shaded area corresponds to the area to which the kernel is applied to. The values of the kernel are displayed as smaller numbers in the shaded area. The kernel size k is 3 in all examples. p and s indicate the padding and stride, respectively. Source images from Dumoulin et al. [31].

2.1.1 Kernel

Instead of multiplying the whole image with a similarly sized matrix and adding a bias vector, like a linear layer would do, we take a significantly smaller matrix and a scalar. This smaller matrix is called filter or kernel. The scalar is called bias. Commonly, the kernel is a square matrix, i.e., the same number of rows and columns. Some well-known architectures, such as VGG [30] and ResNet [21], use widths of 3 and 7. This is significantly smaller than the images in our dataset, which have a size of 1920 columns and 1020 rows.

In Figure 2.1, the process of applying the kernel is shown. We will explain the padding p and stride s in the following subsections. The kernel is multiplied element-wise with a patch of an image. The result is summed up, and a bias is added. Conceptually, it makes sense to shift the perspective a bit. Instead of viewing the top-left element of a kernel as the basis, consider focusing on the center element. Then we can think of applying the kernel as a weighted sum of a pixel and its surrounding pixels. This perspective explains why the previously mentioned sizes for kernels were all odd. If we want to consider a pixel plus its x neighbors on one side then we end up with a kernel of size k=1+2x, which is odd.

2.1.2 Stride

After applying the kernel at a given element, we shift the underlying image and apply the kernel at a different element. How far we move is called stride s. The effect of s=1 is visible when comparing the location of the shaded area between Figure 2.1a and Figure 2.1b. The effect of s=2 is visible when comparing the location of the shaded area between Figure 2.1c and Figure 2.1d.

(11)

For s=1, we can calculate the output O at row i and column j for an input image I with the following formula:

Oi,j = b+ k

∑

a=0 k

∑

b=0 Ii+a−n,j+b−nKa,b (2.1)

where b is the bias scalar, k is the kernel size, n= bk_/2cand K is the kernel.

2.1.3 Padding

As we subtractbk_/₂_c_{from the indices, we have to take special care of the edges. There}

are multiple ways to treat them. We can ignore it, which means that the size of the output will be smaller than the input image. We can also apply a padding p around each edge. In the easiest and most common case, the padding consists of zeroes. A zero-padding with p=1 is shown in Figure 2.1c and Figure 2.1d.

2.1.4 Output Shape

Applying stride and padding changes the size of the output. A bigger stride causes rows and columns to be skipped and hence reduces the size of the output. On the other hand, a bigger padding effectively makes the input image larger and hence increases the size of the output. In Figure 2.1, we can see that in both versions the output is a 3×3 matrix, as the increased padding and stride cancel each other out. If we wish to calculate the size of the output nout, we can calculate it with following formula:

nout=

nin−k+2p

s

+1 (2.2)

If p= bk_/₂_c_{and we want to keep the dimensions of the output the same, we set s}₌_1.

To halve them with the same value of p, we set s=2.

2.1.5 Multiple Input Channels

At the beginning of this section, we simplified an image as a two-dimensional matrix. In practice, color images consist of three 2D matrices: one for red, one for green and one for blue. To accommodate for this, we have a kernel for each color channel. The bias scalar remains a scalar. We can adapt Equation 2.1 to additionally sum over the input channels: Oi,j = b+ k

∑

a=0 k

∑

b=0 n_cin

∑

c=0 I_i+a−n,j+b−nKa,b,c (2.3)

(12)

where ncin is the number of input channels and Ka,b,c is the c-th kernel element at row a

and column b.

2.1.6 Multiple Output Channels

During training, kernels learn to detect features through backpropagation. This means that (approximations of) the best values of the kernel are found. Best is defined as the values that minimize a loss function. For example, one kernel might learn to detect horizontal lines while another kernel detects corners. Since we are interested in detecting complex objects, we need many kernels to ensure that the neural network can learn complex tasks. This means that we do not just have one kernel per color channel, but multiple. This becomes another hyperparameter that we choose when defining our neural network architecture. We can adapt Equation 2.3 to additionally incorporate multiple kernels:

Oi,j,cout =bcout +

k

∑

a=0 k

∑

b=0 n_cin

∑

cin=0 Ii+a−n,j+b−nKa,b,cin,cout (2.4)

where bcout is the cout-th bias, ncin is the number of input channels and Ka,b,cin,cout is the

element at row a and column b of the (cin, cout)-th kernel.

Note that now our output is also three-dimensional. This means we can apply another convolutional layer afterward, which makes it possible to stack multiple convolutions with different hyperparameters.

With this refreshed view on convolutional neural networks in mind, we will now have a look at their applications in computer vision.

2.2 Computer Vision Tasks

There are many tasks computers do in the field of computer vision. Three common ones are shown in Figure 2.2. This work is focused on object detection, which is visualized in Figure 2.2a. Object detection means finding rectangles that contain objects of interest. These rectangles are called bounding boxes. In this case, the objects of interest are groups of flowers of an apple tree, and the bounding boxes are displayed in red. For disambiguation, Figure 2.2 shows two other tasks, image classification and image segmentation. For image classification, the goal is to predict one or more categories for every image. In Figure 2.2b, we predict the image to contain apple trees. On the other hand, in Figure 2.2c, each pixel gets assigned to a class. The output is typically displayed as a colorful picture, in which each pixel is color-coded according to its class.

(13)

(a) Object Detection (b) Image Classification

(Class: Tree)

(c) Image Segmentation

(Blue: Sky, Green: Tree, Red: Ground)

Figure 2.2: Visual comparison of common computer vision tasks. Source image from our dataset.

As this work is focused on object detection, we will now have a closer look at how that is implemented with neural networks.

2.3 Object Detection with Neural Networks

One difficulty is how to predict and classify the bounding boxes efficiently. A naive way to implement object detection could be to run a classifier for each possible bounding box. It is easy to see that there are too many possible bounding boxes as they can have varying sizes.

Better approaches can be grouped into two-stage [32–34] and one-stage [35–39] detectors. The former first proposes regions of interest and then classifies them. The latter predicts bounding boxes and class probabilities directly from the intermediate representations. One-stage detectors were developed from two-stage detectors to improve speed. Potential drawbacks are a loss in detection performance. The workflows of two representative models are shown in Figure 2.3. Next, we will describe them in more detail.

2.3.1 Two-Stage Detectors

One representative group of models is the Region-based CNN (R-CNN) family [32], which was further developed in Girshick [33] and Ren et al. [34]. The original paper and name giver is Region-based CNN (R-CNN) [33]. This model’s workflow is shown in Figure 2.3a. To perform object detection, it does the following [32]: First, 2000 regions are proposed. They are then re-scaled into a fixed size. Then, a CNN generates a vector representation for every region. Finally, a support vector machine (SVM) outputs the probabilities of each class for each region.

(14)

(a) Region-based CNN (R-CNN). Image from Girshick

et al. [32].

(b) You Only Look Once (YOLO). Image from Redmon

et al. [40].

Figure 2.3: Workflows of two object detection models

this process further. One of the changes is performing the work in a single network. Still, the two stages are visible within the networks. One of the downsides of two-stage detectors is that these consecutive stages require time.

2.3.2 One-Stage Detectors

To speed up detection, one-stage detectors were developed. Examples of such detectors are You Only Look Once (YOLO) [40] and its improvements [35–37], Single Shot MultiBox Detector (SSD) [38] and RetinaNet [39]. All of these models have a different internal architecture and use different losses for training so we will highlight how one-stage detectors work with YOLO as an example. Its workflow is shown in Figure 2.3b. YOLO does the following [40]: The image is split into a smaller grid. For each of those cells, the model predicts the probability of each class being visible in this cell. At the same time, for each of those cells, B bounding boxes are predicted. These predictions are structured as (x, y, width, height, and confidence score). A cell is responsible for a bounding box, in terms of the loss, if the center of the ground truth bounding box is within this cell.

2.4 Metrics

Now that we know how object detection with neural networks works, we still need to evaluate models in order to decide which one performs best. We explain a standard evaluation technique which is used by Everingham et al. [41] and Lin et al. [42]. More details can be found in Padilla et al. [43], which is the the basis for this section.

A model outputs bounding boxes and confidence scores which are values between zero and one. We first order the bounding boxes by decreasing scores. Then, we

(15)

calcu-late the intersection over union value, which will be explained in the next subsection, for each pair of bounding boxes between the predictions and the ground truth. We run through the predictions and greedily assign the ground truth bounding box with the highest intersection over union value. Then, we create a precision-recall curve, which summarizes the precision and recall values we can achieve for a given confidence score. The terms precision, recall and precision-recall curve will also be explained in the next subsections. Finally, we summarize a precision-recall curve into a single value called mean average precision. This number is then used to compare different models. The higher the mean average precision is, the better.

2.4.1 Intersection over Union

In order to quantify the quality of prediction in form of a bounding box, we use the Intersection over Union (IoU) [41]. It is defined as follows:

IoU(BBgt, BBp) =

Area(BBgt∩BBp)

Area(BBgt∪BBp)

(2.5) where Area(·)is a function that returns the area of a polygon, BBgtis the bounding

box of the ground truth, BBpis the bounding box of the prediction,∩is the intersection,

and∪is the union.

This implies that the range of the IoU-function is (0, 1]. A perfect prediction will have a value of 1. Predictions with no overlap will be closer to 0, depending on the size of the bounding box.

(a) Intersection (BBgt ∩ BBp), shown in blue (b) Union (BBgt∪ BBp), shown in blue (c) IoU(BBgt, BBp) =0.44 (d) IoU(BBgt, BBp) =0.33 (e) IoU(BBgt, BBp) =0.21

Figure 2.4: Details and examples of Intersection over Union (IoU). Figure 2.4a and Figure 2.4b visualize the intersection and union components of IoU, respectively. Figure 2.4c, Figure 2.4d, and Figure 2.4e show three different examples and the corresponding IoU values. The bounding box of the ground truth (BBgt) is shown with a solid green border. The bounding box of the prediction (BBp) is shown with a dashed red border.

The intersection and the union of two predictions are visualized in Figure 2.4a and Figure 2.4b, respectively.

(16)

To gain a better understanding of which IoU values two bounding boxes result in, have a closer look at Figure 2.4c, Figure 2.4d, and Figure 2.4e. A regular prediction is shown in Figure 2.4c. The IoU value is 0.44 which is caused by a shift in location of the prediction. The size is equal to the ground truth. However, the IoU punishes this because of the area of the union, and hence the denominator, grows larger, and as a result the IoU value becomes smaller. Figure 2.4d shows the need to divide by the area of the union. If we just measured the intersection, the prediction shown in Figure 2.4d would have a value of 1.0. Indeed, models would simply predict bounding boxes with the dimensions of the full image to maximize the intersection. Finally, Figure 2.4e shows a prediction that is too small and fully contained within the ground truth.

2.4.2 Precision and Recall

Now we know how to quantify how well a predicted bounding box fits the ground truth bounding box. Next, we want to distinguish two cases: In one, we predict half of the bounding boxes perfectly with an IoU of one but the other half has an IoU of zero. In another case, every predicted bounding box results in an IoU of 0.5. If we simply averaged the IoU values, we could not distinguish those cases.

Instead, we pick a threshold t with 0≤t ≤1. A prediction will be considered correct when its IoU value is at least t. A correct prediction is also called True Positive (TP). A false prediction with an IoU value of less than t is called a False Positive (FP). A ground truth without a corresponding prediction is called a False Negative (FN). For completeness, we want to mention True Negatives (TN). A TN would be a non-existing bounding box as a ground truth which has not been detected. This does not apply to our situation as it is not useful for training a neural network.

With these definitions, we can calculate the percentage of predictions which were correct. This is called precision. We can also calculate the percentage of ground truths which were detected, which is called recall. Mathematically, precision is defined as

TP

TP+FP and recall as TP

TP+FN. [44]

Naturally, having a precision and a recall of one would be best. In reality, we have to make trade-offs by choosing a confidence value threshold. The lower the threshold, the more bounding boxes will be predicted. This typically increases recall. The higher the value of t, the more confident the predictions will be which typically corresponds to an increased precision. To visualize this trade-off, we can use precision-recall curves.

2.4.3 Precision-Recall Curve

At this point, we have a list of predicted bounding boxes, sorted by decreasing confi-dence, and for each of those bounding boxes we know if it is a true positive or a true

(17)

Image Bounding Box Confidence TP FP Precision Recall

2 B 0.96 1 0 1.00 0.25

1 A 0.72 0 1 0.50 0.25

2 C 0.45 1 0 0.66 0.33

Table 2.1: Example of a Precision-Recall table. There are two images in the dataset and each image has two ground truth bounding boxes. The model predicted three bounding boxes. Precision and recall are calculated by only considering the bounding boxes above a row. Example calculation for bounding box A: Precision = ₁₊₁1 = 0.5 because only one out of the two predicted bounding boxes was correct. Recall = 1₄ because only one of the total 4 bounding boxes has been correctly predicted.

Figure 2.5: Example of a precision-recall curve. Values are not from Table 2.1. Image from Padilla et al. [43].

negative, for a given value of t.

We can calculate the precision and recall for each confidence value by only consid-ering bounding boxes with at least that value. An example calculation is shown in Table 2.1. Note, since recall is based on false negatives this column will only increase. Hence, we can plot a precision-recall curve, where the recall value is on the x axis and the precision value on the y axis. The line connecting these dots corresponds to the confidence threshold we could pick. For our final application, we need to choose an threshold which dismisses bounding boxes with lower confidence scores. This means that we can choose our model’s performance to be anywhere on this line.

An example of a precision-recall curve is shown in Figure 2.5. This figure also shows the interpolated curve, which the maximum precision of recall values is higher than the current recall of the current position. This interpolated curve is used because we can always choose the better threshold with better recall and the same precision.

(18)

How this Precision-Recall curve is used to evaluate our model will be explained in chapter 5.

2.5 Depth

The dataset collected in this work not only includes color images, it also stores depth information. This is achieved with a stereo depth camera. For each pixel, the camera calculates how far away the corresponding object is and stores this in an additional channel. The camera itself contains two cameras, l and r for left and right, respectively, facing the same direction. Next, we describe how cameras can calculate the distance to a point. This explanation is based on Fidler [45].

The location of the left camera is O_l ∈ _R3_{, i.e. a three-dimensional vector. The}

location of the right camera is Or=Ol+T 0 0 T

, where T is the difference in the x axis between the two cameras.

P(X, Y, Z)

Ol Or

Pl (xl, yl) Pr (xr, yr)

Left image plane Right image plane

Left camera center Right camera center

(a) 3D view

P(X, Y, Z)

Ol T Or

Pl (xl, yl) Pr (xr, yr)

Left image plane Right image plane

Left camera center Right camera center

Z

f

(b) Bird’s-eye view

Figure 2.6: Overview of the two cameras and a point

The camera setup is visualized in Figure 2.6. We want to find the distance Z of an arbitrary point P. The points O_l, Or and P span a plane on which Pl and Pr lie. The

two image planes also lie on the same plane as they both are at distance f from each camera. This distance f is called the focal point. Because of the previously mentioned planes, we can conclude that the lines O_lOr and PlPr are parallel. This implies that

yr=yl.

Now, take a bird’s-eye view, which is shown in Figure 2.6b. The coordinate Z of point P is marked in the diagram. We can use the fact that we have two similar triangles,

(19)

(Ol, P, Or)and(Pl, P, Pr), to calculate Z: T Z = T+xr−xl Z− f (2.6) ⇐⇒ TZ−T f = TZ+ZXr−ZXl (2.7) ⇐⇒ −T f = Z(Xr−Xl) (2.8) ⇐⇒ Z= f ·T X_l−Xr (2.9) This process only works when a point is visible for both cameras. When this is not the case, a placeholder value will indicate that no distance can be estimated. This depth estimation can be done for the whole image. The camera we utilized performs similar calculations in its firmware and returns the color image and the depth channel directly.

This depth channel can also be visualized. In Figure 2.7, we show an image and its depth channel. For illustration purposes, a gray scale has been applied to the depth channel.

(a) Color image (b) Gray-scaled depth channel. Completely black pixels

indicate pixels for which depth could not be calculated. For the remaining pixels: the brighter the closer to the camera.

Figure 2.7: Example color image with a gray-scaled depth channel

After this section, you should now have an intuition about the calculation of depth from images and its limitations.

This chapter contains important background knowledge. Next, we will dive into the state of the art by exploring the related work.

(20)

With all the necessary knowledge equipped, we now offer an overview of related papers and show how we contribute to this space. First, we will talk about robots in orchards, briefly mention their tasks and which plants are commonly investigated. Next, we will discuss the role of depth cameras and illumination conditions in an orchard environment. Then we will discuss specific papers that are more closely related to our work.

3.1 Neural Networks in Robotic Farming

Robotic farming in general is an active research topic. Besides apples, commonly researched plants include broccoli [46], sweet peppers [42, 47, 48], passion fruits [49, 50], almonds [51], and mangoes [51]. These papers share their interest object detection of fruits with a focus on harvesting them. Other papers in this category aim to provide additional insights into the whole farming process. For example, Ismail et al. [52] use machine learning to detect which picked apples are good and bad. Akbar et al. [53] focus on the automated pruning of plants instead.

The work [54] of Arad et al. explores building a harvesting robot for sweet peppers. They solved many engineering problems and delivered real world results, which is impressive. Their focus differs in several ways from ours. We focus on apple flowers outside, while they focus on sweet peppers in a glasshouse. Also, a significant portion of their work is about the challenges of the robot, where as we investigate the illumination conditions and their effects.

Common tasks besides object detection are fruit size estimation [55], fruit count-ing [56–58], mappcount-ing to a 3D environment [59], and segmentation [60–62].

3.2 Depth Cameras and Illumination Conditions

Some works researching illumination invariance focus on image processing. Maddern et al. [63] proposes new color space based on the peak spectral responses of each sensor channel in the camera. Similarly, Kim et al. [64] introduce a preprocessing step to arrive at an image that captures important features of faces. Both would process the image

(21)

before passing it through the network. While interesting, this shift in input distributions would most likely require training from scratch, as conventional pre-trained models are trained on regular RGB images. Additionally, our work focuses on investigating regular neural network models on RGB or RGBD. A comparison of image processing techniques and how well they transfer to orchards would be interesting but are suited better for another work.

We are not the first work to incorporate depth data into object detection with neural networks. For example, Fan et al. [65] present an RGBD dataset for object detection and instance segmentation. However, their objects to detect are always humans and the pictures typically contain only a few foreground objects. Another interesting dataset is from Jamil et al. [66]. They introduce a dataset of close-up ear images. The task for the neural network is to identify a person based on the image. This is different to us as we are not interested in uniquely identifying each flower. We only want to be sure about the presence or absence of it. Still, their work includes images for various illumination conditions, quantified by lux values. This is relevant to us as this is an objective way to quantify illumination conditions.

On more closely related papers to our work that include depth cameras and orchards, notable papers include a work [50] of Tu et al. and its predecessor [49]. They are detecting passion fruits. The commercial growing structure of passion fruits compared to apples differs as the plants grow along a raised platforms where fruits hang down. Apple flowers in our orchard, on the other hand, hang on the similar height as the camera. Hence, in our situation, a mis-identification due to flowers in another row are more likely and the background varies more. In the case of passion fruit, the sky can provide a nice contrast that can aid detection. Regardless of plant differences, Tu et al. also trains a Faster R-CNN model, like us. Unlike us, they train one model on RGB and one on depth data, where we train one on RGB and one on RGBD. They found that the model trained on RGB performs better than the one trained on solely depth data. Our research question 1 is not covered by them and the goal of our research question 2 is to close the gap between the performance of their models.

More generally, most papers targeting fruits utilize RGB cameras. Among the papers that used depth cameras, the Kinect is most common [46, 48, 60], where as we use the Intel RealSense D415.

Many papers also circumvented the need for illumination invariant models by controlling the illumination [55, 67, 68]. These works typically create a box with black fabric around their system and install artificial lights. This setup is easier to achieve for plants on the ground such as broccoli [46], but is more difficult for trees as they vary in size and height. Thus, circumventing the illumination issue is not suitable for our situation. Rather, we should build an illumination invariant object detector.

(22)

3.3 Comparison to Apple Datasets

Next, we will evaluate selected papers targeting object detection of apples. These papers share that they are focused on apple fruits instead of apple flowers and that their primary focus is on increasing object detection performance, not on investigating the effects of illumination conditions.

Fruit Detection and Segmentation for Apple Harvesting Using Visual Sensor in

Orchards: This paper [69] creates a dataset that contains 800 images, a number

close to ours (712). Kang et al. train DaSNet, which is based on ResNet to detect apples and branches. Their final model has a Average Precision at a threshold of 0.50 for Intersection over Union (AP50) of 0.836 with an inference time of 72ms on a GTX1080Ti. This means that the model can detect apples reasonably well and reasonably fast. YOLOv3 [33] and Faster R-CNN [48] are also evaluated but performed worse. Unfortunately, the dataset is not publicly available. Our dataset, on the other hand, contains labels about illumination conditions and has depth data.

MinneApple: Häni et al. [61] publish a dataset with 1000 images for apple detection

as a challenge. An online leader board is available1 . Segmentation and counting are two other tasks in this challenge. This dataset captures a lot of variety. Some pictures are taken up close, some further away. There are multiple types of fruits at "different stages of the ripening cycle". Similar to us, the authors "spread out data capture over multiple days to get more varied illumination conditions" [61]. The images in this dataset are taken with a smartphone, the Samsung Galaxy S4. Although we also try to capture a variety of flowers by taking pictures across the orchard, we are more strict with the setup of the camera. The distance and angle to the trees are more fixed in our work. Also, we only take pictures of apple flowers and do so with a dedicated camera, the Intel RealSense D415.

KFuji RGB-DS: While MinneApple had a strong focus on the creation of a varied

dataset, Gené-Mola et al. focused on depth information [67, 68]. Their dataset includes 967 images and provide RGB images, a depth channel, and a range-corrected IR intensity channel. The authors state that the performance of depth cameras is sensitive towards direct sun exposure and hence the images were taken at night with artificial lighting. Again the number of images is similar to ours (712). A key difference to our work is that we specifically take pictures during the day to investigate the effect of illumination conditions, while they work around this.

(23)

Deep Fruit Detection in Orchards: This dataset [51] contains apple, mango, and almond trees. In total, there are 3704 images, which makes this dataset one of the biggest ones available. Bargoti et al. evaluates Faster R-CNN and performs an ablation study with respect to the number of training samples as well as data augmentation. The focus of "Deep Fruit Detection in Orchards" is hence more on the training aspects of applying deep learning models to fruit detection.

(24)

In this chapter, we will describe the dataset we have created for this work. First, we will describe the image collection process and how we annotate the images. Then, we will explain the category classification for our images. At last, we will list statistics such as number of images per category.

4.1 Image Collection

As we described in the earlier chapters, our goal is to research the effects of illumination on object detection with neural networks. For this, we took pictures in an apple orchard located in Auer, Italy. The images were taken between 17.04.2020 and 20.04.2020 with an Intel RealSense D415. At this time of the year, apple flowers were in the condition just before the orchard owner would apply the classic mechanical or chemical fruit thinning techniques. By taking pictures at this condition, we ensure that the conditions correspond to the working environment of a potential robot.

To make the conditions even more realistic, we mount the camera on a pole. This allows for easier handling and greater stability while taking pictures. Additionally, this allowed us to compare the performance of different cameras before deciding on the final one. The pictures were taken approximately 1 meter away from the trees.

The complete setup is as follows: The camera is connected through a non-standard USB cable to a laptop, which is carried in the operator’s backpack. The laptop is connected to a Wi-Fi hotspot of the operator’s smartphone. An internet connection is not needed. The operator holds the phone with the app open in one hand and carries the pole with the other. When the operator wishes to take a picture, they press a button in the app. This causes the app to send a message to a custom web server running on the laptop. The web server then triggers the camera to take a picture and sends this picture back to the app. The operator then knows that the process is completed and can manually verify the quality of the image.

This process resulted in consistent quality of our pictures. Since the camera uses a non-standard USB cable, pictures could only be taken from laptops, not smartphones. The app is needed because it is highly challenging to operate a laptop and a camera simultaneously while walking in non-flat terrain. It is critical that this process works

(25)

Figure 4.1: Example images from our dataset.

without internet connection because orchards can potentially be without cell connection due being remotely located.

After collection, the images are manually filtered. Duplicate or blurry images are removed. Then, the images are ready to be annotated.

To better illustrate the environment we are in and the output of the process we just described, we show some example pictures in Figure 4.1.

4.2 Image Annotation

We use the service labelbox1to label our images. In each image, groups of flowers have to be marked as bounding boxes. Labeling each flower individually would increase the cost of labeling a lot. The downside is that labeling becomes more subjective. But a potential target application is trimming of flowers so that the few apples that grow become better. In the case of removing excess flowers with a water stream, a rough area to target might be sufficient. This also means we can focus on the row of trees closest to the camera. Only these flowers would be considered to be removed. As this project’s focus is on the illumination conditions and not on the specific object detection performance on individual flowers, we decided to annotate the bounding boxes for groups of five flowers, where visible and possible, and only for the closest row of trees. Flowers that were only partially visible are also labeled as a group/in a bounding box. This approach also matches the time and resource constraints of this project better.

We label ourselves and received additional support from friends and family. The labelers receive initial guidance in the form of documentation and a personal conver-sation. Afterward, they label a few pictures and, after a review, received personal feedback on their work.

(26)

Due to the limited resources, each image is labeled exactly once, not more. However, we manually verify all labels through visual inspection to ensure a high quality. Besides the bounding boxes, the labelers are asked to selected various categories for each image.

We track the following categories:

1. Sun on Flower (Yes / No): Yes if there is sun or shade visible on flowers, otherwise no.

2. Sun Artifacts (Yes / No): Yes if there are sun artifacts such as sun rays visible in the picture. This would happen more likely if the sun is visible or close to being visible in the picture.

3. Color Temperature (Cool / Warm): Pictures with a yellow taint, which typically happens for pictures around sunset, are labeled as warm. Pictures are labeled cool if it is not clearly warm. This is slightly subjective.

4. White Sky Proportion (0-25%, 25-50%, 50-75%, 75-100%): The percentage of sky that is bright white. This can be either the Sun or highly illuminated clouds. 5. Cloud Proportion (0-25%, 25-50%, 50-75%, 75-100%): The percentage of

non-bright-white clouds in the sky. This is not a superset of white sky proportion. The general reasoning for these categories is that they are fairly objective, with the exception of the color temperature perhaps. This helps to establish consistent labels and thus reliable results. Three of the categories only have two values, whereas two categories can take on four values. Having four percentages could help us find correlations easier. Other criteria such as the angle of the sun relative to the camera are hard to capture reliably. The sun could be covered by clouds or not be in the picture at all.

A sample of images from our dataset can be found in Figure 4.2. The images cover multiple categories and should give you an understanding of how the categories look visually. You can also see the annotated bounding boxes and how we group multiple flowers together. It is also clear to see that only the front row is labeled. At the same time, we show the captured depth images. We can see that the rows of trees are easily distinguishable. Also the ground and sky can clearly be identified. If we take a look at the closest tree and its flowers, things are not ideal. Individual flower pedals are not always identifiable. Thin branches are sometimes not captured. Also, shadows next to close objects is noticeable. This is likely due to the fact that this is a stereo depth camera.

Now that we have described the dataset creation and labeling process, we will move on to aggregated information about the whole dataset.

(27)

(i) RGB (ii) Depth

(a) Sun on Flower: No, Color Temperature: Cool, Clouds Proportion:

0-25%, White Sky Proportion: 50-755%, Sun Artifacts: No

(b) Sun on Flower: No, Color Temperature: Cool, Clouds Proportion:

(c) Sun on Flower: Yes, Color Temperature: Cool, Clouds Proportion:

0-25%, White Sky Proportion: 50-75%, Sun Artifacts: Yes

(d) Sun on Flower: Yes, Color Temperature: Warm, Clouds Proportion:

(e) Sun on Flower: Yes, Color Temperature: Cold, Clouds Proportion:

(f) Sun on Flower: Yes, Color Temperature: Cool, Clouds Proportion:

Figure 4.2: Example images with annotations from our dataset. Bounding boxes are shown in red in the RGB images. Black is far away or represents missing information, white is close in the depth images.

4.3 Statistics

In this section, we analyze our own dataset and plot relevant statistics. This helps us gaining an understanding of all images without having to look at each individually.

We start with the categories and their occurrence. The number of images that fulfill one or two categories can be found in Figure 4.3. Except for apparent paradoxes, almost every combination of two attributes exists (non-zero cells). This table is very useful to ensure that no obvious errors while labeling happened.

(28)

Sun on Flo_we r - Y_es Sun_on Flo_we r - N_o Sun Art_ifac ts - _Yes Sun Art_ifac ts - _No Colo r Te_mp era ture_{- W} arm Colo r Te_mp era ture_{- C} ool Wh ite S_{ky P} rop ortio n - 0_-25 % Wh ite S ky P_rop ortio n - 2_5-5 0% Wh ite S ky P_rop ortio n - 5_0-7 5% Wh ite S ky P_rop ortio n - 7_5-1 00% Clo uds Pro por_tion - 0 -25_% Clo uds_Pro por_tion - 2_5-5 0% Clo uds_Pro por_tion - 5_0-7 5% Clo_uds Pro por_tion - 7_5-1 00% Clouds Proportion - 75-100% Clouds Proportion - 50-75% Clouds Proportion - 25-50% Clouds Proportion - 0-25% White Sky Proportion - 75-100% White Sky Proportion - 50-75% White Sky Proportion - 25-50% White Sky Proportion - 0-25% Color Temperature - Cool Color Temperature - Warm Sun Artifacts - No Sun Artifacts - Yes Sun on Flower - No

Sun on Flower - Yes 489 0 154 335 167 322 266 66 97 60 305 61 28 95

- 223 19 204 79 144 91 17 32 83 120 43 38 22 - - 173 0 31 142 35 34 53 51 144 25 2 2 - - - 539 215 324 322 49 76 92 281 79 64 115 - - - - 246 0 192 24 22 8 50 42 52 102 - - - 466 165 59 107 135 375 62 14 15 - - - 357 0 0 0 149 40 51 117 - - - 83 0 0 48 20 15 0 - - - 129 0 85 44 0 0 - - - 143 143 0 0 0 - - - 425 0 0 0 - - - 104 0 0 - - - 66 0 - - - 117

Figure 4.3: Overview of the distribution of categories for the whole dataset. The dataset contains 712 images. The value in each cell stands for the number of images that fulfill both the row and column descriptions. For example, 485 images have sunshine on the flower, whereas 227 do not. As another example, 34 images have no sun artifacts and a white sky proportion of 25-50%. The lower diagonal has been omitted as this table is symmetric.

(30%). With the relatively high percentage of test images, we want to ensure that enough images are in each category of the test dataset. This way, a later evaluation focusing on individual categories still produces reliable results. The distribution of categories for the test dataset can be found in Figure 4.4.

Next, we will look at the bounding boxes. We show aggregated information about them in Figure 4.5. A histogram of all the sizes of the bounding boxes can be seen in Figure 4.5a. We can see that the average bounding box has a size of 9240 pixels2, which corresponds to a square with a length of approximately 96 pixels. As we can see, a few outliers with a size of over 50,000 pixels2exist, which corresponds to a square with a length of approximately 224 pixels. These outliers are not biologically different, they are a result of being close to the camera.

In Figure 4.5b, we can see a histogram of the number of bounding boxes per image. The maximum is 160. We can see that compared to Figure 4.5a, there is a lot more variance.

The previous figures helped describe the dataset numerically. Next, we explore the annotated dataset with a visualization to verify our work and to gain more intuition

(29)

Sun on Flo_we r - Y_es Sun_on Flo_we r - N_o Sun Art_ifac ts - _Yes Sun Art_ifac ts - _No Colo r Te_mp era ture_{- W} arm Colo r Te_mp era ture_{- C} ool Wh ite S_{ky P} rop ortio n - 0_-25 % Wh ite S ky P_rop ortio n - 2_5-5 0% Wh ite S ky P_rop ortio n - 5_0-7 5% Wh ite S ky P_rop ortio n - 7_5-1 00% Clo uds Pro por_tion - 0 -25_% Clo uds_Pro por_tion - 2_5-5 0% Clo uds_Pro por_tion - 5_0-7 5% Clo_uds Pro por_tion - 7_5-1 00% Clouds Proportion - 75-100% Clouds Proportion - 50-75% Clouds Proportion - 25-50% Clouds Proportion - 0-25% White Sky Proportion - 75-100% White Sky Proportion - 50-75% White Sky Proportion - 25-50% White Sky Proportion - 0-25% Color Temperature - Cool Color Temperature - Warm Sun Artifacts - No Sun Artifacts - Yes Sun on Flower - No

Sun on Flower - Yes 155 0 50 105 54 101 83 24 25 23 95 20 9 31

- 69 5 64 25 44 29 4 9 27 39 12 9 9 - - 55 0 13 42 12 11 14 18 44 10 0 1 - - - 169 66 103 100 17 20 32 90 22 18 39 - - - - 79 0 61 10 4 4 17 13 14 35 - - - 145 51 18 30 46 117 19 4 5 - - - 112 0 0 0 47 12 13 40 - - - 28 0 0 14 9 5 0 - - - 34 0 23 11 0 0 - - - 50 50 0 0 0 - - - 134 0 0 0 - - - 32 0 0 - - - 18 0 - - - 40

Figure 4.4: Overview of the distribution of categories for the test dataset. The test dataset contains 224 images. 0 50k 100k 150k 200k 0 200 400 600 800 1000 Value Co un t

(a) Histogram of the sizes of the bounding boxes of the

images. n = 23,050, Mean = 9240 pixels2, Standard deviation = 9134 pixels2 0 20 40 60 80 100 120 140 160 0 10 20 30 40 50 60 70 80 Value Co un t

(b) Histogram of the number of bounding boxes per

im-age. n = 712, Mean = 32 bounding boxes, Standard deviation = 24 bounding boxes

Figure 4.5: Statistics about the bounding boxes

about it. We count the number of bounding boxes per pixel and show the result in Figure 4.6. Besides the image’s visual appeal, we can see that this roughly represents an image of our dataset. The brighter spots indicate where many bounding boxes are, summed up over all images. We can see that the bottom of the image is very dark, which indicates that our labeling process is correct. At the bottom of our images, there

(30)

is usually soil with grass and the trees are growing. Some close-up pictures contain bounding boxes at the bottom of the image though, so we still see some bounding boxes being found there. A similar reasoning with the sky explains why the top of the image contains fewer bounding boxes. The middle of the image contains most bounding boxes, as we would expect after having a look at Figure 4.2. There is no strong bias towards the left or right noticeable, which is expected as most pictures are taking in the middle of orchard rows. These pictures then naturally contain trees at the left, center, and right, and hence flowers should be roughly equally distributed along the x axis. Overall, although Figure 4.6 cannot prove that our labeling process is flawless, it still indicates that the process is sound.

20 40 60 80 100 120 140 160 180 200

Figure 4.6: Heatmap of the bounding boxes. The brighter a cell, the more often this cell is part of a bounding box.

(31)

In this chapter, we will describe the experimental setup that we use to answer our research questions. We further describe the models, the training process, the data, and the metric we use to compare models.

5.1 Models

In order to research how neural networks can handle different illumination situations, we first have to train a neural network. As explained in section 2.3, object detection models can be grouped into two-stage and one-stage detectors. The former might have higher accuracy but the latter are faster. We train a model of each category. As representative of the two-stage detectors, we use Faster R-CNN [34]. For the one-stage detectors, we use RetinaNet [39]. We set both models to use a ResNet-50 [21] as a backbone to have a fair comparison. This means that first the image is fed through the backbone. Then, the output of the layers of this backbone are fed into the object detection parts of the models. For the Faster R-CNN, being a two-stage detector, a region proposal network is also part of the architecture.

5.2 Metric

After we train our models, we need a way to compare them. Similarly to Everingham et al. [41] and Lin et al. [42], we use mean Average Precision (mAP) as our metric to evaluate our models. In section 2.4, we explain how we can compute a precision-recall curve from the predictions of a model. The predictions in this case are the bounding boxes with confidence scores for each image in the test dataset. Based on the precision-recall curve, we calculate the Average Precision (AP) for an Intersection over Union (IoU) threshold t (APt) by calculating the area underneath the interpolated

curve. [41] This value summarizes the performance of a model. However, it is still depended on t. To remove this dependency, we calculate APt for multiple values of

t and then average those. This value is called mAP. We follow Lin et al. [42] and use t ∈ [0.50, 0.55, 0.60, . . . , 0.95].

(32)

5.3 Training

For development, we use Python and a framework called Detectron21, which is based on the deep learning library PyTorch2. It offers functionality to train object detection models and allows to switch models easily. It also uses best practices used by researchers at Facebook. This means we get to focus on our contributions and have a low risk of implementation errors. This also enables us to use pre-trained models easily. The configurations we used for RetinaNet3 and Faster R-CNN4 are available online.

We start by utilizing pre-trained models provided by Facebook. They train a ResNet-50 on the ImageNet [70] dataset. Next, they train RetinaNet or Faster R-CNN on the COCO [42] dataset. We then use these pre-trained models and train on our training dataset. We train until the performance on the validation dataset has peaked. In our regular setting, this happens after 8000 images. We also train some models from scratch. This means that we don’t use pre-trained weights and instead randomly initialize them. In that case, we train for 40,000 images. In both cases and for both models, we use stochastic gradient descent [22] as an optimizer with a learning rate of 1e-3, which we found to perform best. Like the original training settings, we apply random resizing and random flipping as training data augmentations.

5.4 Data

When we train the models on our own dataset, depending on the experiment, they receive either a three-dimensional (RGB) or four-dimensional (RGB + depth) as input. The models expect uint8 as inputs, meaning the values range from 0 to 255. The RGB data is already in this format so no additional processing is necessary. However, the depth data type is uint16, which means the values range from 0 to 65,535. Each step corresponds to a distance of 0.001 meter (0.1 centimeter). A naive version of passing the depth data might be to divide each value of the depth by 256 and cast the result to uint8. In other words, this would squeeze the whole depth image into a range from 0 to 255. This changes the depth resolution from 0.001 meter to 0.256 meter (25.6 centimeter). As the objects of interests are flowers with sizes in this range, this change of resolution would remove important information. Since we want to improve the performance of object detection with the depth data, such a coarse scale is unlikely to help.

1_{https://github.com/facebookresearch/detectron2} 2_{https://pytorch.org/} 3_{https://github.com/facebookresearch/detectron2/blob/006a23980541fc876095f3560adda15129a7b568/} configs/COCO-Detection/retinanet_R_50_FPN_3x.yaml 4_{https://github.com/facebookresearch/detectron2/blob/006a23980541fc876095f3560adda15129a7b568/} configs/COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml

(33)

We can use the specifics of our dataset to keep more relevant information. Since the labeled flowers are exclusively in the front row of the trees, it makes sense to discard information about elements far away. The distance of the camera to the trees varies slightly and also flowers can grow towards or away from the camera. This means special care must be taken not to be too strict with the limits. We perform experiments with three different kinds of processing which are visualized in Figure 5.1.

(a) Range covered: 0 – 2 meter, Pixels

further away than the max range get the value 0.

(b) Range covered: 0 – 2 meter,

Pix-els further away than the max range get the value 255.

(c) Range covered: 0.5 – 1.5 meter,

Pixels further away than the max range get the value 0.

(34)

In this chapter, we discuss the results of our experiments, organized by our two overarching research questions.

6.1 RQ1: Is the performance of object detection in orchards

dependent on illumination conditions?

6.1.1 Results per Category

To answer RQ1, we train the two models, RetinaNet and Faster R-CNN on the RGB data. We then evaluate generate predictions for the test dataset. Afterwards, we evaluate these predictions while only using image belonging to a specific category, as we define them in section 4.2.

In Table 6.1, we show the results of the models trained on RGB data for the various categories. Firstly, both models show similar performance as the bars in each category do not differ significantly from each other. Secondly, we can see a a clear difference across different categories. The mAP for the category "Cloud Proportion - 0-25%" is 0.244, whereas it is 0.361 for the category "Cloud Proportion - 75-100%".

We can see from the scale of the y-axis that the mAP is not very high. However, metrics are a mathematical concept. They might not always capture what we humans want to express. Hence, it is important to inspect the results visually and to verify that the metric captures our intent. We show one of the best and one of the worst images per category in Figure 6.6. We do this only for the model RetinaNet trained on RGB for an easier comparison. Additionally, both models, RetinaNet and Faster R-CNN, have similar performance so not many insights could be gained.

A visual inspection of the results in Figure 6.6 suggests that pictures with more flowers lead to worse results.

6.1.2 Results per Number of Flowers

We investigate this effect by calculating the mAP per image and plotting it against the number of boxes. This is shown in Figure 6.1. The trendline is the result of a linear regression that minimizes the mean squared error. The fit is quite bad as the coefficient

(35)

Category RetinaNet RGB Faster R-CNN RGB

Sun on Flower - No 0.291 0.281

Sun on Flower - Yes 0.256 0.256

Sun Artifacts - No 0.267 0.263

Sun Artifacts - Yes 0.265 0.267

Color Temperature - Cool 0.253 0.251

Color Temperature - Warm 0.306 0.304

White Sky Proportion - 0-25% 0.264 0.264

Clouds Proportion - 0-25% 0.244 0.243

Table 6.1: Performance per Category for RetinaNet and Faster R-CNN, both trained on RGB data.

of determination (R2) is approximately 0.01. This means that the trendline can account for only 1% of the observed variance. This can also be seen visually, as there is a lot of variance for a given number of bounding boxes.

A pearson correlation test resulted in a p-value of 0.117. As this value is greater than 0.05 it is not statistically significant. As such, we cannot draw conclusions about our hypothesis that pictures with more flowers result in worse performance. More information could be gained if we had more data.

0 20 40 60 80 100 120 0 0.2 0.4 0.6 0.8 1 N boxes mAP

(36)

(i) Bad (ii) Good (a) Sun on Flower - False

(i) Bad (ii) Good

(b) Sun on Flower - True

(37)

(i) Bad (ii) Good (a) Sun Artifacts - True

(b) Color Temperature - Cool

(c) Color Temperature - Warm

(38)

(i) Bad (ii) Good (a) White Sky Proportion - 0-25%

(b) White Sky Proportion - 25-50%

(39)

(i) Bad (ii) Good (a) White Sky Proportion - 75-100%

(b) Clouds Proportion - 0-25%

(c) Clouds Proportion - 25-50%

(40)

(i) Bad (ii) Good (a) Clouds Proportion - 50-75%

(b) Clouds Proportion - 75-100%

Figure 6.6: Good and bad results per category for RetinaNet trained on RGB data. Only predictions with a score greater than 0.5 are shown. Ground truths are displayed in cyan, predictions in orange. (Part 5)

(41)

6.2 RQ2: Does adding depth as an input channel to the object

detection model improve object detection performance?

6.2.1 Depth Processing

As we have discussed in section 5.4, there are multiple ways of passing depth data to the model. In our experiments, all three methods result in similar performance with no significant difference. The reason for this will be explained later. For the remainder of the thesis, we report the models based on passed depth data like in Figure 5.1a.

6.2.2 Weights Initialization

As a difference to training on RGB data, we have to deal with initialization of the first layer of models. For RQ1, we use pre-trained weights as we found they performed the best. For RQ2, we have to explore. Because we are changing the dimensions of the input data, the shape of the weights for the first convolutional layer change as well. We experiment with three different setups:

(a) The weights for all four channels are randomly initialized. This is shown in Figure 6.7a.

(b) The weights for the first three channels, which correspond to RGB, are copied from a pre-trained model. The weights for the fourth channel, i.e. depth, is randomly initialized. This is shown in Figure 6.7b.

(c) Same as (b), except the weights of for the forth channel are initialized with one of the channels for RGB. This is shown in Figure 6.7c.

6.2.3 Results

We train models with these initialization strategies on RGBD data. For a better compar-ison, we also include two runs with RGB data, one with RetinaNet and one with Faster R-CNN.

The mAP of the validation set while training various models is shown in Figure 6.8. Several interesting effects are visible.

First, the two best performing models, shown in orange and black, are trained on RGB data, which is surprising. Since with RGBD compared to RGB data, we only add information and not remove any, the performance should at least stay the same. This could hint at implementation errors rather than conceptual problems. To inspect why this is the case, we plot the weights of the first convolutional layer for RetinaNet RGBD

(42)

(a) Before training: Channels 1-4 randomly initialized (b) Before training: Channels 1-3 copied from pre-trained,

Channel 4 randomly initialized

(c) Before training: Channels 1-4 copied from pre-trained (d) Channels 1-3 copied from pre-trained, Channel 4

ran-domly initialized: After training

Figure 6.7: Histograms of the weights of the first convolutional layer before training for different initialization strategies and after training

after training in Figure 6.7d. We can see that the weights corresponding to the depth channel now occupy values close to zero. This means that the depth input is ignored by the model. That could have various reasons, like implementation errors, architecture problems, or that the data actually does not add meaningful information.

To further investigate this, we look at the results for the different initialization strategies. Faster R-CNN trained on RGBD data with a randomly initialized first layer, shown in blue, performs significantly worse. It loses approximately 9 percentage points. Note that these models still use pre-trained weights for the remaining layers, except for the classification head which is always trained from scratch since we are training for a novel, i.e. not pre-trained for, object. It is not surprising that the performance starts at a lower position. After all, a first layer with random weights will initially give bad results. However, we would expect the mAP to converge to similar numbers since the architecture and the perceived data are the same. Considering the fact that the code changes to pass RGBD data are minimal, we can single out the factor of initialization.

(43)

Figure 6.8: Validation metrics for different representative models. One training step corresponds to a batch which means the model sees 4 images. Note that the training for the three models reaching 6k actually was continued until 10k, but no improvement is visible and hence the graph is cut off.

This observation is also being confirmed by comparing RetinaNet RGB trained from Scratch, shown in purple, and RetinaNet RGBD control trained from Scratch, shown in brown. Control here means that the weights for the depth channel are initialized to zero. This means that the model only considers the RGB data. If there were implementation errors that affect the model or the RGB data, we would expect to see a difference in performance between those two models. Although we stop the control model early, it is clearly visible that they have similar performance.

Another interesting observation is that the performances of RetinaNet and Faster R-CNN seem to be matching closely in similar settings. This is in alignment with our observations for RQ1.

Perhaps most interesting is the performance drop of Faster R-CNN RGBD with a randomly initialized first layer, shown in blue. As mentioned before, it loses quite some performance compared to the comparable settings shown in gray and olive. As explained in the previous paragraph, the performance of both architectures was consistently matching throughout our experiments, even if this single graph might not be fully providing proof of it. Because of this, we have to conclude that the first layer is crucial for providing a high performance. Randomly initialized models are not able to catch up and converge at a much lower mAP. This might be due to the fact that this first layer is from the backbone, a ResNet-50. The backbone was first trained on ImageNet and then the whole model was trained on COCO, as we explained in section 5.3. This means that it has been trained considerably and it seems to be able to transfer a lot of knowledge over to our setting. Knowledge our dataset does not provide on its own. Hence, pre-training appears to be a crucial ingredient to achieve high performance,