Self-Supervised Learning in Horticulture REDUCING THE NEED FOR ANNOTATED DATA IN NEURAL NETWORK TRAINING

(1)

R

ADBOUD

U

NIVERSITY

N

IJMEGEN

FACULTY OFSOCIALSCIENCES

Self-Supervised Learning in Horticulture

REDUCING THENEED FORANNOTATEDDATA INNEURALNETWORKTRAINING

T

HESIS

MS

C

A

RTIFICIAL

I

NTELLIGENCE

Author:

Bram

VAN

A

SSELDONK

s4476883

External Supervisor:

Dr. Ruud B

ARTH

Internal Supervisor:

Dr. Serge T

HILL

Second Reader:

Dr. Umut G

ÜÇLÜ

June 2020

(2)

2.1 Large-Scale Horticultural Datasets . . . 5 2.2 Data Synthesis . . . 7 2.3 Self-Supervised Learning . . . 9 3 Methods 16 3.1 Original Method . . . 16 3.2 Revised Method . . . 22 4 Results 28 4.1 Preliminary Experiments . . . 28 4.2 Main Experiment . . . 32 5 Discussion 34 5.1 Mask R-CNN . . . 34 5.2 Preliminary Experiments . . . 34 5.3 Main Experiment . . . 37 6 Conclusions 40 6.1 Mask R-CNN . . . 40 6.2 Preliminary Experiments . . . 40 6.3 Main Experiment . . . 41

(3)

Abstract

While many opportunities for the use of technology driven by artificial neural networks exist in horticulture, the adoption rate of such technology remains low in the domain. Part of this is due to the large amount of time and effort required to gather annotated data for the training of neural networks. In this thesis we explore self-supervised learning as a means to reduce the amount of annotated data required for training in an attempt to make neural networks more accessible. Specifically, we test whether pre-training a neural network on an image rotation estimation task can benefit the network when it is trained on a different task afterwards. The performance of the network is compared to the performance of a network without pre-training and a network pre-trained on data from a publicly available dataset. We find that the combination of the data and the pre-training task used in this thesis negatively affects any further training on a different task. Salience maps reveal that this is due to the network learning to pay attention to pixels in the background rather than pixels in the foreground during pre-training.

(4)

Chapter 1

Introduction

The introduction of convolutional neural networks (CNNs) has lead to rapid advances in the field of computer vision. In turn, these advances have enabled the development of new tech-nologies in many fields. But not in every field. Although there is great potential for the use of convolutional neural networks in horticulture, the adoption of this technology lags behind in comparison to other domains.

The slow adoption of new technology is not inherent to the sector. Mechanisation and robotisa-tion have lead to a significant increase in food producrobotisa-tion and quality over the past few decades. Nor is the slow adoption due to a lack of possible applications. Several areas where neural net-works could greatly improve performance have already been identified, such as:

In yield estimation neural networks can be used to automatically detect and segment leaves in footage of plants. The leaf area of a plant is a good estimate of the yield for many greenhouse crops [1].

In disease/pest detection neural networks can be used to either detect specific diseases or pests in crops, or they can be used for anomaly detection to identify unhealthy plants. In both cases early detection can help the farmer combat the problem before it can cause widespread damage [2], [3].

In phenotyping neural networks can offer the ability to automatically measure the extent to which certain genes express themselves under various conditions. This can be done for both quantitative and qualitative features [4].

In harvest automation neural networks can be used to locate and identify harvest targets such as ripe fruit, vegetables or leaves. The use of neural networks might also enable custom order picking where desired features such as size or ripeness can vary per order [5]–[7].

Instead, the bottleneck lies in a lack of annotated data. Most state-of-the-art CNNs are trained in a supervised manner and require a large set of annotated training samples. While large-scale annotated datasets exist for general purpose computer vision problems, e.g. Pascal VOC[8], MS COCO[9], ImageNet[10], etc., almost no such datasets exist for the horticultural domain. Manually collecting and annotating large-scale datasets is always a time consuming process, but even more so in the horticultural domain. Computer vision problems in the horticultural domain often require highly detailed annotations. The subjects of those annotations are often complex and organically shaped which requires more time to capture correctly. Additionally it takes time to properly capture the variation needed to train a neural network that can generalise to anything

(5)

changes as a result of the time of day, the weather, season bound changes, the use of various artificial light sources and different growing methods. And that is just to create a dataset for a single crop. Collecting a custom dataset for multiple crops is not just impractical, but in most cases infeasible due to the large amount of work required.

Self-supervised learning approaches can provide a solution to this problem, as they do not re-quire human-annotated labels to learn from data. They can be employed to pre-train a neural net-work so it only requires a small set of human-annotated data to learn how to solve a downstream task. By reducing the amount of data that needs to be manually annotated, it becomes more feasible to collect a custom dataset and therefore to use state-of-the-art convolutional neural networks. Or in other words:

“Through self-supervised learning, state-of-the-art convolutional neural networks can achieve similar performance with less manually annotated data.”

In addition, self-supervised learning tasks (SSLTs) enable access to domain specific data, allow-ing a network to learn more relevant detectors at deeper levels in the network. This would mean that networks pre-trained with SSLTs could be closer to the final network state than networks pre-trained on data from different domains, and may thus require less annotated data to reach the same level of performance. Or in other words:

“Through self-supervised learning it becomes possible to pre-train state-of-the-art convolutional neural networks on domain-specific data, which in turn may reduce the amount

of data required to train the neural network.” We unify these hypotheses and ask:

“Can self-supervised learning be used to solve the data scarcity problem in the horticultural domain?”

To investigate our hypotheses, in this research we explore the use of a self-supervised learning task called rotation estimation to train U-Net[11] on a semantic segmentation task. Before doing so, we delve deeper into data availability in the horticultural domain. Chapter 2 explores datasets that are available and show why these do not satisfy data requirements for many computer vision problems in the horticultural domain. In addition it provides a summary of work done on data synthesis and self-supervised representation learning. Chapter 3 explains the methods and experimental setup used to validate our hypothesis. The results of these experiments can be found in chapter 4. We provide an in-depth discussion of the results in chapter 5. Our findings are reiterated in chapter 6.

(6)

Chapter 2

Related Work

The performance of (deep) convolutional neural networks is strongly related to the size of the network and the amount of available training data. With sophisticated architectures and large-scale datasets, CNNs keep improving state-of-the-art performance for many computer vision problems. However, due to the requirement of large-scale datasets, new CNN structures become increasingly inaccessible to domains where data is hard to come by.

The horticulture domain suffers from this problem. Computer vision problems in this area often deal with high variation in object classes, changing environment conditions, growth dynamics and other influences. Gathering data that generalises well to a broad range of application settings is no easy feat. In addition, these problems often require pixel level annotations. Obtaining annotations at such detailed levels can take up to an average of 30 minutes per image [12]. Due to the time and effort required, manual construction of a dataset is infeasible for most problems. One approach commonly used to reduce the amount of data required to train state-of-the-art convolutional neural networks, is to pre-train the network on a publicly available large-scale dataset and then fine-tune the network on a smaller custom dataset.

Another approach is not to gather, but to synthesise data. Game engines like Unreal Engine and Unity can be used to generate photo-realistic images and annotations from simulated scenes in a relatively short time and at low cost.

A third approach is to reduce the need for annotated data altogether. The field of self-supervised representation learning focuses on training useful feature extractors from unannotated data. These approaches are discussed in section 2.1, 2.2 and 2.3 respectively.

2.1 Large-Scale Horticultural Datasets

State-of-the-art convolutional neural networks for general purpose computer vision problems are trained on up to millions of images. When data is not available on such a large scale, it can help to first train a neural network on a different dataset with more data.

Neural networks are function minimisers. They try to find the solution with the lowest average cost given a cost function (also called loss function). The learning process of neural networks is powered by gradient descent and is much like a ball rolling down a hill. Neural networks ini-tialised with random weights are unlikely to consistently produce good solutions to a problem.

(7)

takes a while. If the network is trained on only a few samples, the network is more likely to get stuck in a local minimum. This would be the equivalent of a ball getting stuck in a hole somewhere on the side of the hill.

By pre-training the neural network on different data, it is more likely to consistently produce solutions that are somewhat useful. Instead of starting high on top of a hill, the network network starts at a position closer to a minimum. The more relevant the data and task used for pre-training are to the target domain, the closer the starting position of the network is to a minimum or the lower down the hill the ball starts its journey.

Figure 2.1: Visualisation of one-dimensional gradient descent with various starting positions. This is a simplified version of the real prob-lem, as gradient descent for neural networks is a highly dimensional problem. A repres-ents no pre-training, B represrepres-ents pre-training with weakly relevant data and C represents pre-training with highly relevant data.

Pre-training a network changes its starting position for gradient descent. This can cause two things to happen:

1. The neural network simply starts closer to a minimum but ends up in the same minimum as a network that is not pre-trained. In this case training the neural network will be faster, but the perform-ance will not increase in comparison to a network that was not pre-trained. 2. The neural network starts closer to a

minimum and ends up in a different minimum than a network that is not pre-trained. In this case training the neural network is likely to not only be faster, but also reach a better perform-ance in comparison to a network that was not pre-trained.

This is only the case if the data and task used for pre-training are relevant to the target task.

If a network is pre-trained on a task and or data that is very different from the target task and data, it can negatively influence the training process as the network first has to unlearn what it has learned.

Pre-training for detection and classification is typically done on datasets like MS COCO[9] and ImageNet[10]. For segmentation tasks datasets like Pascal VOC[8] or CityScapes[13] are often used. These large-scale datasets are publicly available, contain many categories and have a large amount of samples for each category. They are perfect for general-purpose computer vision problems but their applicability to other domains can vary.

Datasets like Pascal VOC, MS COCO and ImageNet depict everyday objects such as furniture, food and vehicles and CityScapes contains traffic scenes from various German cities. From the data in these datasets, only a small portion relates to horticulture. To illustrate, for Pascal VOC only two out of twenty available classes contain plants or crops. For CityScapes, MS COCO and ImageNet this is only 1 out of 30, 15 out of 171 and 28 out of 1,000 respectively. Moreover, relevant classes are depicted in settings that are significantly different from horticultural settings which makes these datasets unsuitable for pre-training neural networks for horticultural use. Instead, ideally one would use datasets with more relevant data.

(8)

Large-scale datasets that offer a more tailored fit to the horticultural domain do exist:

• Flowers 102[14] contains 1,020 images of 102 different flowers naturally occurring in the United Kingdom.

• LeafSnap[15] contains over 30,000 images covering leaves of all tree species that natur-ally occur in the Northeastern United States.

• PlantVillage[16] provides nearly 20,000 images of healthy and diseased leaves of bell pepper, potato and tomato plants.

• VegFru[17] comes with over 150,000 images of almost 300 different types of vegetables and fruits found in everyday life settings.

Unfortunately these datasets come with annotations for higher order classification only (e.g. the image depicts a tree of species A) and lacks precise positional data (e.g. bounding boxes or pixel level annotations) needed for detection or segmentation tasks.

Other datasets that do contain annotations for detection or segmentation tasks each come with their own limitations.

• The Urban Trees dataset[18] contains roughly 14,500 images but covers only 18 different species of trees growing in Pasedena. While the dataset technically is suited for detection tasks, it provides annotations in the form of geological locations rather than the commonly used bounding boxes.

• INat2017[19] has over 850,000 images of more than 5,000 different species of plants, an-imals and fungi. Of these images roughly 200,000 contain plants or crops. Unfortunately, the dataset does not include bounding box annotations for plant classes, only for animals. • CropDeep[20] provides over 30,000 images of 25 different crops and includes multiple growth stages for two of them. However, all images were collected in laboratorial green-houses. This potentially limits its applicability to settings in a greenhouse only.

• Capsicum Annuum[12] contains 10,500 synthetic images of bell pepper plants in a green-house setting. While the dataset generalises well to real world applications, the set only covers a single crop.

To the best of our knowledge, no horticultural datasets exist that are publicly available and do not suffer from one of the problems mentioned above.

2.2 Data Synthesis

Virtual environments are already a key component in reinforcement learning and some areas of robotics. By modelling agents and placing them in a digital reality, they can learn to interact with their surroundings. Virtual agents offer several advantages over real world agents. When a virtual agent gets damaged, it can be repaired for free. Virtual agents can also be placed in en-vironments that would be hard to create in the real world, e.g. zero gravity or immense pressure. Most importantly, large numbers of virtual agents can be run at the same time to significantly speed up learning. This knowledge can then be transferred to a real world agent. If the model resembles the real world close enough, the real world agent is able to perform the same task. This is how openAI trained their dexterous hand[21].

(9)

Table 2.1: Overview of popular general-purpose and horticultural datasets.

1_{The Urban Trees dataset is suitable for detection tasks, but contains geological location}

an-notations instead of bounding boxes.

Dataset Images Classes Class. Det. Segm.

Pascal VOC[8] 11,530 20 3 3 3 CityScapes[13] 20,000 30 7 7 3 MS COCO[9] 330,000 171 3 3 3 ImageNet[10] 14,197,122 1,000 3 3 7 Flowers 102[14] 1,020 102 ₃ ₇ ₇ LeafSnap[15] 30,866 185 ₃ ₇ ₇ PlantVillage[16] 19,298 38 3 7 7 VegFru[17] 160,731 70 3 7 7 Urban Trees[18] 14,572 18 ₃ ₃1 ₇ iNat2017[19] 858,184 5,089 ₃ ₃ ₇ CropDeep[20] 31,147 31 ₃ ₃ ₇ Capsicum Annuum[12] 10,500 8 ₇ ₇ ₃

First, virtual environments can be used to simultaneously generate images and annotations. The rendering pipeline takes information from objects in the virtual environment and translates it to pixel values. If the rendering pipeline considers object material information, the result will be a regular image. But the rendering pipeline can also consider other information like object size and position to create bounding boxes, the distance from the camera to generate depth maps, object class to generate segmentation maps or object id to create panoptic maps. Once a virtual environment has been modelled, all this information is freely available and can be used as an-notations for machine learning applications.

Second, virtual environments make it easy to add variety to a dataset. By assigning new ma-terials to objects or even replacing entire objects in a scene it becomes possible to very quickly gather new and original images. OpenAI used this technique to randomise the visual appearance of the dexterous hand. Doing so helped their vision model bridge the gab between simulation and reality. This concept can be taken even further. By procedurally modelling (part of) the environment it theoretically becomes possible to generate an infinite amount of original images. If the virtual environment resembles the real world close enough, the synthetic data can be used to train neural networks. Training only on synthetic data results in a performance gap when the model is transferred to real-world data, so in practice a small amount of empirical data remains necessary to fine-tune the model. Work in various fields shows that the use of synthetic images to supplement empirical images can lead to improved performance of neural networks compared to networks that were trained on empirical data alone [22]–[25].

This approach does require a realistic model of the subject. Various plant modelling methods have already been explored. The open source library OpenAlea offers functionality to gener-ate anatomical and functional plant models from which synthetic images can be collected [26]. Benoit et al. [27] built a system that simulates and generates synthetic images of the growth of roots. Cicco et al. [28] generate synthetic images to train a crop and weed detection system. And Barth [12] has modelled the common bell pepper to generate the Capsicum Annuum data-set, containing 10,500 synthetic images of bell pepper plants in a greenhouse setting. Despite the success of these approaches, creating a model can still be a time-consuming process.

(10)

2.3 Self-Supervised Learning

Most neural networks are trained in a supervised manner. This means that for every data entry Xiin a dataset D there is a corresponding human-annotated label Yi. The network is trained by

minimising loss using the following function:

loss(D) = min θ 1 N N

∑

i=1 loss(Xi,Yi) (2.1)

where N is the number of data entries in D. Generally speaking supervised learning achieves the best performance, given that the human-annotated labels are accurate. But as shown in the previous sections, gathering data and annotations is a time consuming process.

Self-supervised learning is a form of unsupervised learning where neural networks are trained using automatically generated labels instead of human-annotated labels. For every data entry Xi

in a dataset D a pseudo label Pi is automatically generated without human involvement. The

network is trained by minimising loss using the following function:

loss(D) = min θ 1 N N

∑

i=1 loss(Xi, Pi) (2.2)

where N is the number of data entries in D. Typically self-supervised learning methods are used in combination with supervised learning to solve a problem. First a network is trained on a pretext task using a self-supervised learning task. This allows the network to learn visual features that can be of use in the second task. Next the network is trained on what is known as the downstream task. The downstream task often uses supervised learning on a small dataset with human-annotated labels. Networks trained in this manner typically have a lower performance than networks that were trained using only supervised learning, but they also require much less human-annotated data.

Over the past few years, many different self-supervised learning tasks have been proposed. The most notable of these are listed below. A comparison of these tasks can be found under the section Performance Comparison. For a more complete overview of the current state of self-supervised learning we refer to [29].

(11)

Exemplar Learning

Proposed by Dosovitskiy et al. [30], exemplar learning is a classification task where a network is trained to classify variations on patches sampled from the training data. First N patches are sampled from a dataset. Next a family of transformations is defined. For each patch Xi, a number

of random compositions of transformations are constructed. By applying these compositions to Xi, a set of its transformed versions SXi is created. Each set is then assigned a surrogate class

label and a convolutional neural network is trained to predict the surrogate class of transformed patches. Dosovitskiy et al. make use of translation, scaling, rotation, contrast and colour trans-formations in their family of transtrans-formations, but other transtrans-formations can be added freely. To avoid that patches contain no object or parts of objects, patches are sampled with a probability proportional to the mean squared gradient magnitude within the patch.

Figure 2.2: In exemplar learning, patches are sampled from a dataset. From each patch a set of transformed patches is generated. All transformed patches with the same initial patch are assigned the same surrogate label. The network is trained to classify the transformed patches according to their surrogate labels. Images of the cat and dog are taken from [31].

Context Prediction

Context prediction was proposed as a self-supervised learning task by Doersch et al. [32]. In context prediction, a central patch is sampled from the input image. Next, a second patch is sampled from one of eight possible relative positions (top left, top, top right, left, right, bottom left, bottom or bottom right). Both patches are fed as input to a convolutional neural network and the network is trained to predict the position of the second patch in relation to the first patch. To prevent trivial solutions, Doersch et al. introduced a gab between the relative positions of up to half the size of a patch. In addition they also jitter the relative positions by introducing an offset of up to 7 pixels.

Figure 2.3: In context prediction, a central patch is sampled from the input image. A second patch is sampled from one of eight possible relative positions. Both patches are shown to the neural network. The network is trained to predict from which relative position the second patch was sampled. Image of the cat is taken from [31].

(12)

Jigsaw Puzzles

Noroozi et al. [33] proposed jigsaw puzzles, a self-supervised learning task similar to context prediction. In the jigsaw puzzle task, nine patches forming a 3×3 grid are sampled from the input image. The patches are reordered using a randomly chosen permutation from a set of predefined permutations. A (siamese) neural network is then trained to predict the index of the permutation used to randomise the patch order. To prevent trivial solutions the authors follow the example of Doersch et al. and leave a gab between the patches. They also feed multiple jigsaws of the same image to the network to ensure that it cannot learn to associate certain features with absolute positions.

Noroozi et al. [34] propose an extension to this task. Taking inspiration from representation learning in text domain, they corrupt up to two of the patches by swapping them with patches from another image. This makes the task remarkably more complex as the network now needs to detect which patches are corrupted and solve the puzzles with only the remaining patches. Noroozi et al. call their extended task jigsaw++.

Kim et al. [35] take this concept even further and propose damaged jigsaw puzzles. Damaged jigsaw puzzles combine the conventional jigsaw puzzle with inpainting and colourisation. By dropping one of the nine patch in its entirety and the ab channels of the other patches, the network is forced to solve three problems simultaneously.

Figure 2.4: In jigsaw puzzles, nine patches forming a 3×3 grid are sampled from the input image. The patches are reordered using permutation from a predefined set of permutations. The network is trained to predict the index of the permutation. Image of the cat is taken from [31].

Image Colourisation

Image colourisation is a research field on its own, but can also be used as a self-supervised learn-ing task. To do so, input images are transformed to the CEI Lab colourspace, where L represents the lightness level, a represents the green-red component, and b represents the blue-yellow com-ponent. The L channel of the image is then used as input and the neural network is trained to predict the a and b channels such that when the three are combined, they form an image with plausible colourisation. Literature shows there are multiple approaches to this problem. We highlight work that explores colourisation in the context of self-supervised representation learn-ing.

Zhang et al. [36] approach the task as a classification problem where the ab output space is quantised in bins. They show that pixels in natural images have a bias towards low ab values. To combat this, they reweight the loss of each pixel based on the pixel’s colour rarity.

(13)

Iizuka et al. [37] take a different approach and handle the task as a regression problem. They normalise the ab components globally so their values lay in the [0, 1] range and can be predicted using a Sigmoid function.

Larsson et al. [38], [39] take an approach similar to Zhang et al. but use a hypercolumn that combines the activation from several layers to predict the colour values instead of upsampling to reach the original image resolution. In addition they show that using the Hue Chroma Lightness (HCL) colourspace instead of the Lab colourspace can result in slightly better performance, but only if chromatic fading is applied.

Zhang et al. [40] extend the problem by learning the opposite conversion, colour channels to greyscale image, as well. To achieve this they employ a split-brain autoencoder network architecture. In addition, Zhang et al. show that their method is not restrained to images in the Labcolourspace, but can also work with RGB and depth image pairs.

Figure 2.5: In colourisation, the input image is split into a greyscale image and colour layers. The network receives the greyscale image as input and is trained to predict the colour channels. When the prediction is combined with the greyscale image, it should result in a realistically coloured image. Image of the cat is taken from [31].

Inpainting

Like colourisation, image inpainting is a research field on its own. In image inpainting, part of the input is missing and the network is tasked with generating a plausible reconstruction of the missing region(s) based on its surroundings. Pathak et al. [41] argue that in order to successfully do this, a network needs to develop an understanding of the contents of an image and can therefore be used as a self-supervised learning task.

In their work, Pathak et al. explore different ways to occlude parts of an image. They find that features learned from inpainting where the centre block is occluded do not generalise as well as features learned from inpainting where random blocks or shapes are occluded.

Figure 2.6: In inpainting, part of the input image is occluded. The network is trained to re-construct the occluded area(s) from the context of the image. Image of the cat is taken from [31].

(14)

Counting

Noroozi et al. [42] propose counting as a self-supervised learning task. In counting an input image is subdivided in a 2×2 grid to create four tiles. The entire image is also scaled down, such that the scaled version of the image matches the size of a tile. The four tiles and the scaled down version of the image are all passed through the same convolutional neural network. The authors use the fact that the sum of visual primitives in the tiles must match the number of visual primitives in the scaled down version of the image to train the network. To avoid the trivial solution where the network outputs only zeros, regardless of input, the authors use a contrastive loss. A second image is randomly chosen and scaled down. The contrastive loss tries to minimise the difference between the first scaled down image and the sum of the tiles while maximising the difference between the second scaled down image and the sum of the tiles.

Figure 2.7: In counting, the features of a scaled version of the input are equated to the sum of the features of the tiled image. To avoid trivial solutions, the sum of the features of the tiled image must be different from the features of a scaled version of a different image. Image of the cat is taken from [31].

Rotation Estimation

Proposed by Gidaris et al. [43], rotation estimation is a self-supervised classification task where a network tries to classify the rotation of an image. Despite its simplicity, the task requires a network to localise salient objects in an image, recognise their orientation and object type, and then infer the image orientation. Gidaris et al. find that this task works best when images are rotated by a multiple of 90◦.

Figure 2.8: In rotation estimation, the input image is squared and rotated by a multiple of 90◦. The network is trained to predict the rotation of the image. Image of the cat is taken from [31].

(15)

Deep Clustering

Caron et al. [44] propose a task where the pseudo labels are not generated from the input image, but rather from the network output. In deep clustering, images are translated to features by passing them through a convolutional neural network. These features are clustered using any clustering algorithm. The cluster assignments are used as pseudo labels to train the network. After each epoch, the generated features are reclustered and the new cluster assignments are used as pseudo labels for the next epoch.

This task is prone to trivial solutions. One solution to the clustering problem is to assign all features to the same cluster and leave the other clusters empty. To avoid this, clusters that become empty are given a new centroid that consists of the centroid of a non-empty cluster with a small perturbation. Another possible trivial solution has to do with classification. If one cluster contains significantly more samples than the other clusters, the network can learn to only output the class index of the largest cluster. To avoid class inbalance problems, images can be sampled based on a uniform distribution over classes. Another solution is to weigh the contribution of each image to the loss by the inverse of the size of its assigned cluster.

Figure 2.9: In deep clustering, input images are translated to features using a neural network. At the start of each epoch features are clustered. The cluster assignments are used as pseudo labels to train the network. Images of the cat and dog are taken from [31].

Performance Comparison

To draw a fair comparison between the quality of the features learned through different self-supervised learning tasks, the performance on a downstream task can be used. In literature, the Pascal VOC [8] classification, detection and semantic segmentation tasks and the ImageNet [10] classification task are often used for this.

For evaluation on the Pascal VOC tasks AlexNet [45] is initialised with weights obtained from the pretext task and is then fine-tuned on data from the Pascal VOC dataset. Classification, de-tection and segmentation performances are all reported.

For evaluation on ImageNet, AlexNet is initialised with weights obtained from the pretext task. The weigths from conv1 up to the displayed layer are frozen and a linear classifier consisting of a single linear layer followed by a Softmax is trained to classify ImageNet images from the output of the displayed layer. To account for differences in feature map size, each feature map is resized through bilinear interpolation such that all flattened feature maps have approximately the same dimensions. The top-1 accuracy of the classifier is reported.

(16)

Table 2.2 contains an overview of the reported performance of each self-supervised learning task on the Pascal VOC and ImageNet tasks. These performances can be improved in several ways. Work by Doersch and Zisserman [46] and by Kim et al. [35] show that combining multiple self-supervised learning tasks almost always results in a better performance on the downstream task. Results from [34], [44], [46]–[48] show that the use of deeper networks such as VGG [49] and ResNet [50] can further boost performance. Noroozi et al. [44] show that it is also possible to distil knowledge from deeper networks back into smaller networks such as AlexNet, resulting in an additional performance boost.

Table 2.2: Evaluation of SSLTs on the classification, detection and semantic segmentation tasks for Pascal VOC and the ImageNet classification task. Results with * are not from the original paper, but were achieved with modified weight initialisation according to [51]. Results from [30], [37] could not be included because these works do not report performance on the Pascal VOC or ImageNet dataset. In addition, [30] uses a different network architecture making a direct comparison unfair.

Pascal VOC ImageNet

Method Ref Class. Det. Segm. Ref Conv1 Conv2 Conv3 Conv4 Conv5

Supervised[45] 79.9 59.1 48.0 [40] 19.3 36.3 44.2 48.3 50.5 Random [41] 53.3 43.4 19.8 [40] 11.6 17.1 16.9 16.3 14.1 Inpainting[41] [41] 56.5 44.5 29.7 [40] 14.1 20.7 21.0 19.8 15.5 Colourisation[36] [36] 65.9 46.9 35.6 [40] 12.5 24.5 30.4 31.5 30.3 Split-Brain[40] [40] 67.1 46.7 36.0 [40] 17.7 29.3 35.4 35.2 32.8 Context[32] [51] 65.3* 51.1* - [40] 16.2 23.3 30.2 31.7 29.6 Counting[42] [42] 67.7 51.4 36.6 [42] 18.0 30.6 34.3 32.5 25.7 Jigsaw[33] [33] 67.7 53.2 37.6 [35] 18.2 28.8 34.0 33.9 27.1 Damaged Jigsaw[35] [35] 69.2 52.4 39.3 [35] 14.5 27.2 32.8 34.3 32.9 Jigsaw++[34] [34] 69.8 55.5 38.1 [34] 18.2 28.7 34.1 33.2 28.0 Rotation[43] [43] 73.0 54.4 39.1 [43] 18.8 31.7 38.7 38.2 36.5 DeepCluster[44] [44] 73.7 55.4 45.1 [44] 12.9 29.2 38.8 39.8 36.1

(17)

Chapter 3

Methods

The aim of this thesis is to reduce the amount of annotated data required to train a state-of-the-art convolutional neural network for horticultural computer vision problems. Our suggested approach employs self-supervised learning to overcome the lack of available annotated data. We train two CNNs to perform a downstream task on an existing dataset. The first network is trained traditionally using supervised learning. The second network is pre-trained using a self-supervised learning task before it is fine-tuned on the downstream task. The training behaviour and performance of the two networks are compaired to draw a conclusion about the usability of this approach.

In this chapter we describe two methods to execute our approach. Our original method that we were unable to execute because we were unable to gain access to the materials needed and a revision of the original method that we do execute. We still describe the original method as pre-liminary work on the this approach has lead to an interesting insight worth sharing. Moreover, we believe the original method can lead to additional insights that the revised method cannot provide.

3.1 Original Method

Data

Initially we planned on working with the CropDeep dataset. The CropDeep dataset contains over 30,000 images, close to 50,000 individual bounding box annotations and covers 25 different types of crop. What makes this dataset particularly interesting is that it contains multiple growth stages for the two largest greenhouse crops: tomato and cucumber. Unfortunately we were unable to get in touch with the authors of the CropDeep dataset and subsequently were unable to gain access to the dataset.

Self-Supervised Learning Task

To examine whether self-supervised pre-training is a viable alternative to supervised pre-training we compare a network pre-trained using a self-supervised learning task to an identical net-work pre-trained using supervised learning. We choose to use rotation prediction as our self-supervised learning task for its simplicity combined with its high performance. To obtain the best possible results, it would be better to opt for the deep clustering task. However, this task re-quires additional computational power in order to perform clustering at the start of each epoch. To keep the computational power required to test our method within the limits of our hardware, we settle for the slightly lower performance of rotation prediction.

(18)

In the original paper on rotation estimation [43], the authors indicate that they achieved slightly better performance when they showed the network all rotations of an image in the same batch. We deviate from this practice and do not provide all four rotations of an image in the same mini-batch, mainly because we are constrained by the hardware we use and cannot use a batch size of four without severely reducing the resolution of the images. Instead we generate the four rotations of each image when loading the dataset for the first time and randomly shuffle the order of all samples at the start of each epoch.

Downstream Task

Originally we decided to focus our attention on detection tasks rather than segmentation tasks. Choosing a detection task as our downstream task offers three advantages:

1. Many computer vision problems in the horticultural domain can be framed as object de-tection tasks. We only evaluate our approach on one problem, but in theory our approach should be applicable to all these problems.

2. Detection tasks tend to be simpler than segmentation tasks. Segmentation tasks require more detailed annotations and often use neural networks with more trainable paramet-ers. We try to avoid networks with large numbers of trainable parameters to cater to the hardware available to us.

3. We specifically choose crop detection as our downstream task so we can make use of the CropDeep dataset. This saves us the time and effort of collecting and annotating a custom dataset.

In crop detection a neural network is tasked with locating all instances of crops in an image. This is done by generating a bounding box for every crop in the image. Bounding boxes should be just large enough to encompass the entire crop.

Model

We were planning on using a Faster R-CNN [52] network architecture for the neural network. The Faster R-CNN architecture shows good performance on various benchmarks and offers a solid basis for our experiments. The performance of the network can be improved by trans-forming the backbone network into a Feature Pyramid Network [53]. Feature pyramid networks offer access to high-level feature maps at multiple scales. This improves the network’s capacity to handle objects of a variety of sizes. The Faster R-CNN architecture can also be extended with additional network heads to perform other tasks such as instance segmentation and pose estimation [54]. Especially instance segmentation can be particularly interesting in the context of tasks like yield estimation where it is not only important to locate leaves, but also to compute the area they cover.

Many implementations of the Faster R-CNN network architecture are freely available along with weights for the network pre-trained on ImageNet. We opt to use a Pytorch implementation of the network provided in Facebook AI Research’s Detectron2 [55] (shown in fig 3.1). This implementation uses a ResNet-50+FPN backbone [50]. Literature shows that ResNet is well-suited for adaptations with self-supervised learning tasks [46]–[48].

It is not possible to pre-train the entire network architecture at once using the rotation estima-tion task, as any layers beyond the backbone either require posiestima-tional informaestima-tion or do not learn anything meaningful from the rotation estimation task. Instead only the ResNet-50 backbone is pre-trained. To facilitate this a linear classifier is added behind the final layer of the ResNet-50

(19)

Figure 3.1: Architecture of the generalised Feature Pyramid Network (FPN) R-CNN network in Detectron2. The model can be extended for other tasks by adding additional heads to the standard ROI heads. The image is inspired by an image from [56].

network is fine-tuned in a supervised manner.

In our preliminary research we also explored a Mask R-CNN implementation provided in the Facebook AI Research’s Detectron2 and found that this network can reach surprising perform-ance levels using very little data. We collected a custom dataset containing images of cucumber plants growing in a greenhouse and manually gathered instance mask annotations of the most visible leaves for fifty of those images. On average five to ten annotations were gathered per image. Collecting these annotations took us roughly two and a half days with an average an-notation time of 24 minutes per image. A network was then trained on a single image containing five annotations and evaluated on thirty images using COCO metrics. The network reached an mAPIoU=0.5score of roughly 0.5 and an ARmax=100score of just over 0.5. When the training set was increased to forty images and the validation set was reduces to ten images, the network was able to reach an mAPIoU=0.5_{of 0.783 and an AR}max=100_{of 0.647. We discuss the implications}

(20)

a)

b)

Figure 3.2: Results after training Mask R-CNN with just forty images. a) Ground truth annota-tions. b) Predictions made by the network.

Experimental Setup

We bring these elements together in two experiments designed to validate our hypotheses from chapter 1. As mentioned before, we were unable to gain access to the CropDeep dataset and as a result were unable to perform these experiments. We still share our experimental setup to provide a complete overview and because we feel that the second experiment in particular offers greater insight than the experiments we perform instead.

Experiment 1

The first experiment is aimed at our first hypothesis:

“Through self-supervised learning, state-of-the-art convolutional neural networks can achieve similar performance with less manually annotated data.”

To validate this statement the CropDeep dataset is split in three different subsets. A training set containing 80% of the data or 24,918 images, a test set containing 10% of the data or 3,114 images and a validation set containing the remaining 10% or 3,114 images. The training set is divided once more. Part of the training data is used for self-supervised pre-training of the ResNet-50 backbone. The other part of the training data is used to fine-tune the Faster R-CNN network in a supervised manner. The Faster R-CNN network is trained five times:

I. using 0% of the training data for self-supervised pre-training of the ResNet-50 backbone and 100% of the data for supervised fine-tuning of the network. This provides a baseline of the network’s performance.

II. using 0% of the training data for self-supervised pre-training of the ResNet-50 backbone and 67% of the data for supervised fine-tuning of the network. This is done to quantify the performance drop caused by training with less data.

III. using 33% of the training data for self-supervised pre-training of the ResNet-50 backbone and 67% of the data for supervised fine-tuning of the network. This offers insight in the added benefit of pre-training using self-supervised learning and a small dataset.

IV. using 0% of the training data for self-supervised pre-training of the ResNet-50 backbone and 33% of the data for supervised fine-tuning of the network. This is done to quantify the performance drop cause by training with much less data.

(21)

All networks are trained until they reach convergence. The validation set is used to prevent overfitting on the training data. The performance of the networks is measured on the test set using the COCO metrics mean Average Precision (mAP) and Average Recall (AR).

If the hypothesis is correct, we expect to see network III and V outperform network II and IV respectively. Ideally we would see network III or V reach a performance similar to that of network I , however it is unlikely that the self-supervised learning task would provide a learning signal as strong as that of the supervised learning task. The self-supervised learning task might very well require a dataset several times larger than the decrease in samples used for supervised training to reach the same level of performance.

(22)

Experiment 2

The second experiment is aimed at our second hypothesis:

“Through self-supervised learning it becomes possible to pre-train state-of-the-art convolutional neural networks on domain-specific data, which in turn may reduce the amount

of data required to train the neural network.”

The use of the CropDeep dataset allows us to draw a comparison between networks pre-trained on data with various levels of domain relevance. We narrow the downstream detection task down to tomato detection only. We split the data of the tomato class in the CropDeep dataset in four subsets: a training set for fine-tuning containing 250 samples, a training set for pre-training containing 500 samples, a test set containing 100 samples and a validation set also containing 100 samples. We train the Faster R-CNN network five times:

I. using no data for self-supervised pre-training of the ResNet-50 backbone. This provides a baseline of the network performance.

II. using data from a different domain for self-supervised pre-training of the ResNet-50 back-bone. Instead of pre-training this backbone ourselves, we use the readily available net-work weights trained on ImageNet. We take care to only load these weights into the backbone network and initialise the rest of the network the same way as the others. III. using data from a narrow related domain for self-supervised pre-training of the

ResNet-50 backbone. We use ResNet-500 samples from the cucumber class in the CropDeep dataset as cucumbers are grown in a similar fashion to tomatoes and both contain images in a greenhouse setting.

IV. using data from a broad related domain for self-supervised pre-training of the ResNet-50 backbone. We use 100 samples each from the cucumber, lemon, chili pepper, watermelon and pumpkin classes in the CropDeep dataset resulting in a total of 500 samples. Using multiple crops leads to a bigger variety in the dataset and can help the network overcome specific differences such as the colour or shape of the crops.

V. using data from the target domain for self-supervised pre-training of the ResNet-50 back-bone. We use 500 samples from the tomato class in the CropDeep dataset.

All networks are fine-tuned in a supervised manner using the 250 samples of the tomato class reserved for fine-tuning. All networks are trained until they reach convergence or begin to overfit. The validation set is used to monitor overfitting on the training data. The performance of the networks is measured on a test set using the COCO metrics mean Average Precision (mAP) and Average Recall (AR).

If our hypothesis is correct, we expect to see each network outperform previous networks. We expect samples from the ImageNet dataset to be less relevant than domain data and thus to result in less relevant feature extractors. However, the ImageNet dataset contains an amount of samples orders of magnitude larger than the datasets we use for pre-training. It is not inconceivable that this offers an advantage that cannot be overcome by a relatively small number of domain related data. We choose not to address this imbalance as the weights for networks fully trained on ImageNet are readily available and are commonly used as a starting point for neural network training. If pre-training on ImageNet outperforms training on domain specific data, then that is an important result in and of itself.

(23)

3.2 Revised Method

After it became apparent that it would not be possible to gain access to the CropDeep dataset we reconsidered some of our decisions and revised our method. The experiments described in this section are the experiments we executed and of which the results are reported in chapter 4.

Data

Instead of the CropDeep dataset we use the Capsicuum Annuum dataset. This dataset con-tains 10,500 synthetic images of bell pepper plants in a greenhouse setting. To ensure that the synthetic images closely represent real-world images, images are rendered using a regular structural modular plant model with a multi-scale representation for side shoots. This enables detailed plant part modelling based on empirical data. Figure 3.3 shows how close the synthetic data represents real-world examples. For a more in-depth comparison between the synthetic and empirical data we refer to the work of R. Barth [12]. The Capsicuum Annuum dataset contains both depth map and segmentation map annotations.

In addition we make use of the CityScapes dataset. The CityScapes dataset contains 20,000 images of traffic situations in fifty different cities in Germany. It comes with detailed semantic segmentation annotations of thirty different classes for 5,000 samples and coarse annotations for the rest. The dataset is often used as a benchmark to test and compare the performance of new network architectures for semantic segmentation tasks.

Self-Supervised Learning Task

Despite the revision of our approach and the use of a different dataset, we see no reason to revise our choice for the rotation estimation task. There are self-supervised learning tasks that generate output more similar to segmentation maps, e.g. colourisation or inpainting. These tasks might lend themselves better to the type of network architecture typically used for segmentation problems as they can target all layers of the network. However these learning tasks also lead to a lower performance than rotation estimation.

a)

b)

c)

Figure 3.3: Examples from the Capsicum Annuum dataset. a) Empirical images of bell pepper plants in a greenhouse setting. b) Synthetic images generated for the Capsicuum Annuum data-set. c) Segmentation labels of the images in row b. Class labels:• background,•leaves,

(24)

To measure the performance of a network on the rotation estimation task, we measure the class accuracy (eq. 3.2) and average accuracy (eq. 3.1) of the model. These are defined as:

acc= 1 |C|_c∈C

∑

acc(c), where (3.1) acc(c) =|{d ∈ D | L d ps= c ∧ Ldgt= c}| |{d ∈ D | Ld gt = c}| , (3.2)

where C is the set of all classes, D is a dataset, Ld_psis the predicted class label for sample d and Ld_gt is the ground truth label for sample d.

Downstream Task

Instead of training and evaluating the neural networks on a crop detection task, we opt to use a semantic segmentation task. This choice is mainly motivated by the fact that we were able to gain access to the Capsicum Annuum dataset. However, segmentation tasks are still plenty relevant for horticultural computer vision problems (e.g. crop handling, phenotyping and disease detection). The Capsicum Annuum dataset comes with semantic segmentation maps, so we do not have to worry about manually gathering annotations. We compensate for the memory requirements of a larger network by lowering the resolution of our in- and output.

Specifically, we use a plant part segmentation task. In plant part segmentation a network is tasked with assigning each pixel of an image to a plant part class. We use the Capsicum Annuum dataset where every pixel belongs to one of the following: background, leaf, pepper, peduncle, stem, shoot or leaf stem, wire or cut.

To measure the performance of a network on the plant part segmentation task, we use the Jaccard Index also known as the Intersection over Union (IoU). We measure both the class IoU (eq. 3.4) score and the average IoU (eq. 3.3) scores. These are defined as:

IoU= 1 |C|_c∈C

∑

IoU(c), where (3.3) IoU(c) = 1 |D|_I∈D

∑

|{p ∈ I | Lps(p) = c} ∩ {p ∈ I | Lgt(p) = c}| |{p ∈ I | Lps(p) = c} ∪ {p ∈ I | Lgt(p) = c}| , (3.4)

where C is the set of all classes, I is an image in dataset D, Lps(p) is the predicted class label

for pixel p and Lgt(p) is the ground truth label for pixel p.

Model

Because semantic segmentation is fundamentally different from object detection we switch to the U-Net[11] architecture (fig. 3.4). We choose this architecture for its relative simplicity and popularity. In essence U-Net is an auto-encoder network with skip connections between layers with the same dimensions. While the U-Net network architecture was originally developed for biomedical image segmentation, it has since found use in other applications such as pansharpen-ing [57], volumetric segmentation [58] and general image segmentation [59].

Many implementations of the U-Net architecture are freely available including a Tensorflow implementation from the authors themselves. However, our preference goes to Pytorch as we already have done some preliminary work that can be reused if the model is implemented in Pytorch. We use an implementation by J. van Vugt [60]. Unfortunately this implementation

(25)

Figure 3.4: Architecture of U-Net. Our implementation differs slightly from the original archi-tecture in [11]. We explore two modifications to the original network: padding in the convolu-tional layers and batch normalisation after every activation function.

U-Net is designed for semantic segmentation tasks. We modify the network to work with the rotation estimation task by cutting the network in half at its narrowest point and adding a lin-ear classifier to the output of the first half. Since U-Net is an auto-encoder network this is the equivalent of replacing the decoder by a linear classifier. After the encoder is trained with the rotation estimation task we load the network weights into the original architecture.

Preliminary Experiments

Before we bring these elements together in an experiment, we first consider some minor changes to the original U-Net network architecture and training procedure. To this extend we perform three preliminary experiments.

Padding

The original U-Net network architecture does not use padding in the convolutional layers. This means that the width and height of feature maps decrease after every convolution and that the input and output have different dimensions. This can be solved by padding the input to each convolutional layer with zeros. However, padding can lead to a border effect as the outer pixels of the output are based on less input than other pixels. To test whether this border effect is a problem for us, we perform a small preliminary experiment in which we train U-Net with and without padding in the convolutional layers. We manually inspect the quality of the output of each network to see if we notice any border artefacts. Based on the results we decide to use padding in all future experiments.

Batch Normalisation

We can further improve the network performance by using batch normalisation[61]. Batch normalisation normalises the output of an activation function and can help greatly speed up the training process. To see how batch normalisation can affect our network performance, a

(26)

second preliminary experiment is performed in which U-Net is trained with and without batch normalisation. In the network trained with batch normalisation, we apply batch normalisation after every convolutional layer except the last. We compare both training behaviour and network performance to quantify the effects of batch normalisation. Based on the results we decide to use batch normalisation in all future experiments.

Weighted Loss

Figure 3.5: The class imbalance problem inherent to plant part seg-mentation. Figure taken from [12]. Our data suffers from a class imbalance problem. This

problem is inherent to the domain of our data and the plant part segmentation task. The largest area of a plant con-sists of leaves and only a small part constitutes to other parts like cuts or peduncles. This is directly reflected in the pixel labels in a segmentation map (see fig 3.5). Class imbalance can prevent the network from learning to out-put the less represented classes. As mentioned in [44] this problem can be solved by weighing each pixel’s contri-bution to the loss by the inverse of its label occurrence. We notice that this signal can be too strong due to differ-ences in label distributions between individual images and can lead to over-representation of smaller classes in some cases. To better balance the effect of the weighted loss, we take the inverse of the nthroot of the label occurrence. This brings the weights closer to each other.

To avoid integer overflow problems in larger datasets, we first compute the label occurrence as a fraction of the total number of pixels per image and then average over all images. This results in equations 3.5 and 3.6. weight(l) = 1 n p occur(l), where (3.5) occur(l) = 1 |D|_I∈D

∑

|{p ∈ I | LI(p) = l}| 1 |I|, (3.6)

where l is the class label, I is an image in dataset D and LI(p) is the class label of pixel p in

image I. We perform a preliminary experiment in which we compare networks where the loss is weighted with the following values for n: 1, 2, 3, 4 and 8 and a network with a regular non-weighted loss function. The training behaviour is evaluated in order to determine the best value for n. Based on the results we settle on n = 3 for future experiments.

Training

All networks in the preliminary experiments are trained and evaluated on the plant part segment-ation task using 300 training samples and 50 evalusegment-ation samples from the Capsicuum Annuum dataset. To prevent memory errors, we resize the samples to a resolution of 480 × 640 pixels. For the non-padded network in our first preliminary experiment samples are resized to a resolu-tion of 460 × 572 pixels. The networks are trained using a cross entropy loss funcresolu-tion. We use a batch size of 2 and train all networks for 25 epochs on an nvidia Geforce RTX 2080 Ti using the Adam[62] optimiser with a learning rate of 0.001, β1of 0.9 and β2of 0.999.

(27)

Experimental Setup

We combine all previous elements in a single experiment designed to validate our hypotheses from chapter 1. Since we have access to only one type of domain relevant data, we no longer execute separate experiments for both hypotheses. Instead we perform one experiment which can validate both hypotheses.

To do this we reframe the problem we originally tried to solve. In the expected results for ex-periment 1 of our original method, we point out that the learning signals of the pre-training task and the downstream task might not be equally strong. If a network is trained on 33% less annot-ated data and instead that 33% is used to pre-train the network in a self-supervised manner, this does not have to result in the same performance as when the network is trained in a supervised manner using 100% of the data. Finding the exact number of samples required to make up for a specific decrease in annotated training samples is a tedious process that involves a lot of trial and error. Instead we accept any increase in performance as a result of additional pre-training using unannotated data and self-supervised learning as proof that self-supervised learning can help reduce the amount of annotated data required to train a convolutional neural network. After all, to achieve the same performance increase in a strictly supervised manner would require the use of additional annotated data.

Unfortunately it is no longer possible to draw a comparison between networks pre-trained on data with various levels of domain relevance. The Capsicuum Annuum dataset only contains images of bell pepper plants and no other crops. This means the Capsicuum Annuum dataset can only be used to pre-train on data from the target domain. It is still possible to draw a comparison with a network pre-trained on data from a different domain by using data from the CityScapes dataset. This is what we end up doing in our revised experiment.

In order to test the applicability of our approach to the horticultural domain where data is scarcely available, we only use a small number of samples in our training sets. From personal experience with gathering data in a greenhouse we estimate that it is possible to collect roughly 700 images in a day’s work. The annotation process takes much longer. We managed an aver-age of 24 minutes per imaver-age when annotating instance masks. Gathering semantic segmentation maps can cost up to an average of 30 minutes per image. This means that it would take a team of four people almost two full weeks to annotate all 600 images. Alternatively there are services that offer to annotate data for you. Using a service like Labelbox it would cost roughly e1800,-to annotate all 700 images. We consider the gathering of data the biggest bottleneck for com-mercially targeted applications.

We settle on a dataset size of 700 samples selected from the Capsicuum Annuum dataset. These 700 samples are split in four subsets: a training set for the plant part segmentation task con-taining 300 samples, a training set for the rotation estimation task concon-taining 300 samples, a test set containing 50 samples and a validation set also containing 50 samples. We also take 300 samples from the CityScapes dataset to use for supervised pre-training of the network, 50 samples for testing purposes and 50 samples for validation. With this data we train U-Net three times:

I. using no pre-training before fine-tuning the network on the plant part segmentation task. This provides a baseline of the networks performance.

II. using the 300 samples from the CityScapes dataset to pre-train the network before fine-tuning the network on the plant part segmentation task. This offers insight in the perform-ance achievable when the network is trained on publicly available but not domain specific data.

(28)

III. using the 300 samples from the Capsicuum Annuum dataset to pre-train the network fore fine-tuning the network on the plant part segmentation task. This quantifies the be-nefit of pre-training on domain specific data.

To prevent memory errors, we resize all samples to a resolution of 480 × 640 pixels. We use a batch size of 2 for the self-supervised learning task and a batch size of 2 for the plant part segmentation task. We use a cross entropy loss function for all tasks. All networks are trained for 25 epochs on an nvidia Geforce RTX 2080 Ti using the Adam[62] optimiser with a learning rate of 0.001, β1of 0.9 and β2of 0.999. The network performance is evaluated on the test set

every 100 batches. Pre-training on the rotation estimation task is stopped early if the accuracy of the network does not change more than 0.0075 for three consecutive evaluations. Pre-training on the semantic or plant part segmentation tasks is stopped early if the average IoU changes less than 0.0075 for five consecutive evaluations.

(29)

Chapter 4

Results

4.1 Preliminary Experiments

In this section we provide an overview of the results of our preliminary experiments. In ad-dition we provide some examples of the output produced by the network after our proposed modifications. A summary of the results can be found in table 4.1.

Padding

While performing the preliminary experiment we noticed that the network without padding struggles to output anything other than the most occurring pixel label. We manually interrupted training as it was unlikely that the network would overcome this problem based on the IoU and loss metrics. Table 4.1 contains IoU scores obtained on the validation set after 13 and 25 epochs for the original network and the network with padding respectively. Figure 4.1 shows the class IoU scores during the training of both networks. Figure 4.4 shows the output produced by both networks. It also illustrates the discrepancy between the input and output resolutions if the network does not use padding. We do not detect any border artefacts in the output produced by the network with padding.

Figure 4.1: Class IoU scores of U-Net with padding on the training data during training of the networks.

(30)

Batch Normalisation

Figure 4.2 shows the class IoU scores during the training with and without batch normalisation. IoU scores obtained on the validation set after 25 epochs of training with batch normalisation can be found in table 4.1. An example of the output produced by the network with batch normal-isation can be found in the third column of figure 4.4. We find that the use of batch normalnormal-isation leads to an increase of the IoU score by as much as 143% for the least represented class and an overall increase of 24%.

Figure 4.2: Class IoU scores of U-Net with padding and batch normalisation on the training data during the training of the networks.

Weighted Loss

Figure 4.3 shows the IoU scores during training with various variations on a weighted loss function. IoU scores obtained on the validation set after 25 epochs of training by the best performing network can be found in table 4.1. An example of the output produced by the best performing network can be found in the last column of figure 4.4. We find that a weighted loss where class weights are computed as the direct inverse of the class occurrence (n = 1) negatively influences the learning process and results in worse performance than a non-weighted loss. The best results are obtained with a weighted loss where class weights are computed as the inverse of the third root of the class occurrence (n = 3). This leads to an additional increase of the IoU score by 15.6% for the least represented class and an overall increase of 3.6%.

(31)

Figure 4.3: IoU scores of U-Net with padding, batch normalisation and a weighted loss on the training data during the training of the networks. Equal signifies that all class weights were equal. Root 1 signifies the first root of the class occurrence which is the same as directly taking the class occurrence.

(32)

Table 4.1: IoU scores obtained on the validation set. P indicates the use of padding in the convolutional layers. BN indicates the use of batch normalisation. WL indicates the use of a weighted loss where the nth_{root is used to compute class weights.}

Shoot or

Network Background Leaves Bell Pepper Peduncle Stem Leaf Stem Wire Cut Avg.

Original 0.000 0.533 0.039 0.000 0.000 0.000 0.000 0.000 0.072 P 0.865 0.879 0.828 0.195 0.541 0.469 0.411 0.337 0.566 P+BN 0.921 0.926 0.880 0.473 0.679 0.606 0.522 0.592 0.700 P+BN+WLn=3 0.919 0.926 0.900 0.547 0.719 0.623 0.546 0.622 0.725 a) b) c) Padding + BatchNorm + Weighted Loss Padding + BatchNorm Padding Original

Figure 4.4: Results of training U-Net with modifications for thirteen epochs. In- and output of the original network architecture are scaled relative to the in- and output of the networks with padding. a) Input images. b) Ground truth annotations. c) Output produced by the network.

(33)

4.2 Main Experiment

In this section we provide an overview of the result of our main experiment. We report the network performance on both the pre-training tasks and the downstream task.

Supervised Pre-Training

Figure 4.5: IoU scores during supervised pre-training.

Figure 4.5 shows the IoU score on both the training and validation set during training on data from the CityScapes dataset. Figure 4.7 shows the output produced by the network after training for 25 epochs. The network reaches a final average IoU score of 0.512 on the CityScapes validation set. This is much lower than the performance of state-of-the-art networks which can reach scores up to 0.827. However, those networks are trained on the full dataset rather than a subset and are trained for far longer.

Self-Supervised Pre-Training

Figure 4.6: Accuracy scores during self-supervised pre-training.

Figure 4.6 shows the average accuracy score on both the training and validation set dur-ing traindur-ing on the rotation estimation task. Training meets the early stopping criteria after just 2,400 samples. The network reaches a final average accuracy of 0.995. Remarkably the network consistently per-forms better on the validation set than on the training set.

Input _{Ground Truth} Output

(34)

Downstream Task

Figure 4.8 shows the IoU scores on the training and validation set during training. The network pre-trained with CityScapes data meets the early stopping criteria after 3.6k samples. At this point its performance is almost identical to that of the network with no pre-training. Pre-training on the CityScapes dataset results in an average IoU of 0.6022 vs. 0.6006 with no pre-training on the training set and 0.6595 vs. 0.6633 on the validation set after 3.6k samples. Pre-training with the rotation estimation task leads to lower performances at 0.5783 on the training set and 0.6501 on the validation set. The network pre-trained with rotation estimation continues to under perform for the rest of the training duration in comparison to the network with no pre-training.

(35)

Chapter 5

Discussion

5.1 Mask R-CNN

While exploring models for our original approach we trained a Mask R-CNN network on custom data and found that it can reach surprisingly high performance levels with very little data. We showed that training on just forty images can lead to a mAPIoU=0.5of 0.783 and an ARmax=100 of 0.647. Manual inspection of the results leads us to believe that a leaf detection system for commercial applications can be trained with 100 images or less.

It is unclear whether the success of the network translates to other detection problems as well. Leaf detection is a relatively simple problem since there is only one class to detect and leaves often cover a large area in an image. Other detection tasks can be more complex in nature, e.g. crop detection might require the detection of multiple classes or additional ripeness prediction. Additionally we only use data gathered at a single location. We cannot validate the generalis-ability to data collected at different locations, different times during the day, different times in the year or at different growth stages of the plant. All of which may vary greatly in commercial applications.

Our findings suggest that neural networks might not require the large amounts of data previ-ously thought necessary for neural network training. However, our results are obtained on a very simple task with little variation in the data. Additional investigation is needed to draw any meaningful conclusions about the minimal amount of data required to successfully train a neural network.

5.2 Preliminary Experiments

Padding

In our first preliminary experiment we explore the use of padding in convolutional layers of the neural network. In the original implementation of U-Net the convolutional layers do not use padding. This means that the output images produced by the network are smaller than the input images. Moreover, input images have to be of specific resolutions to prevent scaling errors. During the execution of our experiment, we find that the original network architecture struggles to learn anything beyond the most occurring class(es), whereas the padded network seems to have no problem learning complex representations of multiple classes. Since the only difference between the two networks is the padding in the convolutional layers, we can only think of one explanation. We suspect that the input image is compressed too much to reconstruct any

Self-Supervised Learning in Horticulture REDUCING THE NEED FOR ANNOTATED DATA IN NEURAL NETWORK TRAINING

R

ADBOUD

U

NIVERSITY

N

IJMEGEN

Self-Supervised Learning in Horticulture

T

MS

A

I

Author:

Bram

A

s4476883

External Supervisor:

Dr. Ruud B

Internal Supervisor:

Dr. Serge T

Second Reader:

Dr. Umut G

June 2020

Contents

Abstract

Chapter 1

Introduction

Chapter 2

Related Work

2.1

Large-Scale Horticultural Datasets

2.2

Data Synthesis

2.3

Self-Supervised Learning

∑

∑

Exemplar Learning

Context Prediction

Jigsaw Puzzles

Image Colourisation

Inpainting

Counting

Rotation Estimation

Deep Clustering

Performance Comparison

Chapter 3

Methods

3.1

Original Method

Data

Self-Supervised Learning Task

Downstream Task

Model

Experimental Setup

3.2

Revised Method

Data

Self-Supervised Learning Task

∑

Downstream Task

∑

∑

Model

Preliminary Experiments

∑

Experimental Setup

Chapter 4

Results

4.1

Preliminary Experiments

Padding

Batch Normalisation

Weighted Loss

4.2

Main Experiment

Supervised Pre-Training

Self-Supervised Pre-Training

Downstream Task

Chapter 5