On transfer learning from RGB to multispectral remote sensing images in a classiﬁcation task

(1)

MSc Artificial Intelligence

Master Thesis

On transfer learning from RGB to multispectral remote

sensing images in a classification task

by

Nicole Louise Ferreira Silverio

10521933

April 20, 2020

36 EC 26th of August 2019 - 20th of April 2020

Supervisors:

Dr. M. Chong

Y. Chen MSc

P. Mettes MSc

Assessor:

Prof. dr. T. Gevers

(2)

Acknowledgements

First of all, I want to thank my supervisors for always being there to assist me when necessary and being extremely understanding. Without your help I would not have been able to complete this manuscript. Second of all, I would like to thank my fellow students (and friends) Oscar Ligthart and Arend van Dormalen for always being there during our studies and always having my back. I couldn’t have done all this without my other two Musketeers. Next, I would like to thank my parents for their infinite love and for being my biggest cheerleaders, and Kevin for letting me take over his computer. Last but not least, a big thank you to everyone else who supported me.

(3)

Abstract

Satellite data can be widely used to monitor the Earth’s surface. One of the applications of these type of images is land use and land cover classification, where patches of land are assigned to a specific class. In order to apply machine learning to such a task, large labelled datasets are necessary. However, large labelled remote sensing datasets are scarce. Often when large labelled datasets are not available, transfer learning is utilized to be able to make use of machine learning nonetheless.

Regular images often consist of three color channels, however remote sensing images can consist of many more image channels. Pretrained models are often trained on regular, three channel images, so to utilize a pretrained model when dealing with images that have more than three channels either the excessive image channels have to be discarded, or as an alternative the pretrained model or data has to be modified. The most effortless approach would be to discard the excessive image channels, yet valuable information for the task at hand could be contained in these extra channels. Considering large labelled satellite image datasets are scarce and creating labelled datasets is time and labor consuming, it would be valuable to research the effects of transfer learning, without discarding the extra image channels, on accuracy and quantity of data needed for proper performance.

In this manuscript a comparative research of several multi-channel transfer learning approaches is presented. The approaches can be divided into two categories: dimensionality reduction of the data and extending the weight matrix of the pretrained model to take input with more than three image channels. All of the approaches were applied to a smaller model architecture and compared to each other and to an RGB-only counterpart on accuracy, quantity of data needed and training time. The most viable approaches were then applied to a larger model, ResNet-50, to see whether the benefits would still hold when model size was increased. It was found that all approaches applied to the smaller model outperformed their RGB counterpart in terms of accuracy achieved on the test data, where the weight extension approaches outperformed the dimensionality reduction approaches. The weight extension approaches also outperformed their RGB counterpart considerably in terms of quantity of data needed to match RGB-only accuracy, achieving a similar result with 50% less data. However when applied to a larger model architecture, no improvement was achieved over the RGB baseline. Thus, with some effort to extend a (smaller) pretrained CNN to take more than three channels as input, the model outperforms a counterpart fine-tuned solely on the RGB channels in terms of quantity of data needed and accuracy achieved on the test data.

(4)

1 Introduction

Nowadays, it is becoming more common to have continuous access to satellite imagery all over the world. The European Space Agency (ESA) and the National Aeronautics and Space Administration (NASA) have made the data from their Copernicus and Landsat programs free to use, both for commercial and non-commercial use (Helber et al.,2019). Free access to these type of images makes that this type of data can be widely used, for instance for disaster monitoring. The ESA’s Sentinel-3 satellites were used to monitor the bushfires in Australia (ESA,2019a) and the Amazon fires (ESA,2019b) this year and compared the fires to those of last year. Remote sensing images from the Landsat satellites were also used to track the effects of fires on the landscape, such as looking for pieces of unburned land in areas struck by fires (Meddens et al.,2016) or looking at how the Earth’s surface changed in areas where there have been fires (McCarley et al.,2017). Aside from disaster monitoring, these images can also be used to research environmental issues, such as seeing the impact of climate change on the Earth’s surface. Using the Sentinel-2 satellites, a remote glazier in Antartica which cannot be easily examined in person because of its location, can now be constantly monitored with the use of these images (ESA,

2019c). Another application of these types of satellite images is land use and land cover classification, where patches of land are classified into categories, preferably automatically. Land use and land cover classification could make it easier to keep track of changes in the Earth’s surface and thus make it easier to keep track of current environmental issues, such as (illegal) deforestation in the Amazon or observing the effects of climate change on the Arctic ice. Aside from this, land use and land cover classification could also be used for disaster recovery. To be able to perform land use and land cover classification with (deep) machine learning, a large quantity of labelled data is necessary for proper performance. However, there are few large labelled remote sensing datasets (Helber et al., 2019; Li et al., 2018; Cheng et al., 2017). For comparison, a commonly used dataset for image classification, the ImageNet Large Scale Visual Recognition Challenge of 2012 (ILSVRC2012) (Deng et al., 2009; Russakovsky et al., 2015), consists of over 1.2 million train samples. Some of the larger satellite image datasets consist of ten thousands of images (Helber et al., 2019; Cheng et al., 2017), which is several orders of magnitudes smaller. Creating such (large) labelled datasets is time consuming (Pan and Yang,

2010).

When dealing with insufficient data to train a model fully, it is often decided to make use of a pretrained network and fine-tune that network on the task-specific dataset. In computer vision these models are often pretrained on the ILSVRC2012 dataset (Goodfellow et al.,2016;Oquab et al.,2014;Russakovsky et al.,2015). The images in this dataset consist of three color channels. Due to the architecture of convolutional neural networks (CNNs), the input-image channel dimension is fixed. However, images with more than three input channels are becoming more common; such as RGB-D images (Alexandre, 2016; Schwarz et al., 2015; Wang et al., 2019;Hoffman et al.,2016; Han et al.,2018), medical images (Wollmann et al., 2018; Tai et al., 2017) and remote sensing images (Helber et al.,2019).To reap the benefits of transfer learning on a land use and land cover classification task,Helber et al.(2019) removed excess channels from their data and only kept the RGB channels to meet the three channel input requirement of pretrained models. This means that 10 out of the 13 channels, or in other words, over 75% of the information of each image was discarded. This information could be valuable nonetheless.

Hence, it would be beneficial to research various methods of performing transfer learning on multi-channel (> 3) images and whether this would provide advantages over performing traditional transfer learning. Some research has been done on performing transfer learning onto images with more than three image channels (Alexandre, 2016; Schwarz et al., 2015; Wang et al., 2019; Hoffman et al., 2016; Han et al., 2018; Wollmann et al., 2018;Tai et al.,2017). However, not all of these methods were compared to RGB transfer performance and next to this, the methods have never been compared to one another. Besides, it has not been researched whether using all of the image channels present has an effect on quantity of data needed. Next to this, the model architectures used in these researches were smaller architectures. It would be interesting to research the effects of multi-channel transfer learning on a bigger model.

The contributions made in this research complement the aforementioned literature; first of all, the researched approaches were compared to one another in order to research what type of approach would work best in a multi-channel transfer scenario. Next to this, it was researched what the effects would be on quantity of data needed when having made use of all image channels during training and what the effects would be on training time when trained on all image channels. The most viable approaches were then applied to a larger model to see whether any advantages over RGB transfer would still hold when model size was increased. Lastly, the multi-channel transfer approaches in this research were applied to remote sensing images.

In this research, several methods of performing multi-channel transfer learning in a land use and land cover classification task were researched and compared to each other and a baseline model which was fine-tuned on solely the RGB channels of the satellite data. The approaches researched can be divided into two categories: dimensionality reduction of the image-channel dimension, and modifying the weight matrix of the input layer of the model to be able to take more than three channels as input. These approaches were implemented for a smaller model architecture and performance was evaluated on accuracy, quantity of data needed to match

(6)

accuracy of the baseline and training time. The most viable approaches were then implemented for a larger model, ResNet-50, to investigate whether the improvements hold when model size increased. It was found that all methods outperformed the baseline, the model fine-tuned on solely the RGB channels of the satellite images, in terms of accuracy, with the approaches where the input layer was modified to take all channels as input performing best. The approaches where the input layer was modified also needed 50% less data to achieve the same accuracy performance as the baseline. Preliminary results show that, when applied to the larger ResNet-50 model, the methods did not improve over the baseline, a similar model fine-tuned on only the RGB channels of the data, on any of the metrics evaluated on. All in all, when presented with a transfer learning scenario where the target data has more channels than the source data, extending the weight matrix of the first convolutional layer for the model to be able to take more channels as input is favorable over reducing the dimensionality of the input-channel dimension of the data and over using only the RGB information, regardless of how the extra weights are initialized.

1.1 Paper outline

In the following chapters more in depth information about the techniques used, the problem space and the research performed will be presented. In chapter 2 an in depth review of literature about the task at hand, classification on remote sensing images and multi-channel transfer learning, is presented. Next, the materials and methods applied in this research including the experimental set-up are described in chapter 3. In chapter 4 the results of the experiments conducted are presented, followed by an in depth analysis of the results in chapter 5. Lastly, chapter 6 describes what can be concluded from the results and the possible next steps for further research regarding this subject.

(7)

2 Related Work

In the following sections a more in depth review of literature about image classification on remote sensing images and transfer learning on multi-channel (> 3) images is outlined. Firstly, techniques and methods to perform image classification on remote sensing images are described, including some of the problems that occur for this task and how these are conventionally handled. Furthermore, since in this research non-conventional transfer learning is performed, specifically from source data with three image channels to target data with 13 channels, a review of literature on multi-channel transfer learning is presented.

2.1 Land use and land cover classification on satellite data

In literature land use and land cover classification can imply two different tasks; in some cases the classification is performed per pixel (Carranza-Garc´ıa et al., 2019; Li et al.,2018) whereas in other cases a label is assigned to the complete image (Helber et al.,2019;Cheng et al.,2017;Li et al., 2018;Castelluccio et al.,2015). Even though there is a slight difference in how the classification is performed in these two tasks (pixel-wise versus image-wise), the underlying goal is the same: assign a label to how the land is used, or in other words what is covering the Earth’s surface. In literature, assigning a label to an image is also called scene classification. (Li et al., 2018). In this research we use land use and land cover classification to refer to image classification on remote sensing images. The following literature is about this type of classification as well.

Older methods to perform land use and land cover classification rely on the use of manually crafted features, but these features are hard to improve and often highly specific for the data used (Li et al.,2018;Cheng et al.,

2017). With the emergence of deep learning, where the models learn feature representations from the data fed to the model, state-of-the-art results on classification of remote sensing images have been achieved using deep neural networks, specifically using CNNs (Helber et al., 2019;Li et al., 2018; Cheng et al., 2017; Castelluccio et al.,2015). There are, however, some challenges when using deep networks to classify remote sensing images. For instance, large neural networks need a large quantity of labelled data, often millions of data samples, to be able to perform well. This creates a problem for land use and land cover classification considering large labelled remote sensing image datasets are scarce (Helber et al.,2019;Li et al.,2018;Cheng et al.,2017).

When using CNNs to perform classification of remote sensing images, there are several approaches that can be used dependent on the amount of data present (Li et al., 2018): either using a pretrained network as a feature extractor without fine-tuning on the data (Castelluccio et al., 2015; Cheng et al., 2017), using a pretrained network and fine-tune it on the target dataset to adjust the parameters specifically on the data (Helber et al.,2019;Cheng et al.,2017;Castelluccio et al., 2015) or, when there is sufficient data, fully train a model without making use of pretraining (Castelluccio et al.,2015;Helber et al.,2019). Fully training a network has its advantages over using a pretrained network considering the network learns to extract features specifically for the data trained on. However, a large quantity of labelled data (often millions of images) is necessary for a model to converge and to prevent overfitting. This is often not a given in remote sensing datasets. A method to prevent overfitting of a model, without using pretraining, is to create a smaller model architecture, but these smaller networks can often not generalize as well as their larger counterparts. Castelluccio et al.(2015) found that fine-tuning a pretrained network achieved higher results than training from scratch on the UC-Merced dataset.

Helber et al. (2019) also found that fine-tuning a larger pretrained model outperformed training a smaller model from scratch on their newly created dataset, EuroSAT, which will be used in this research as well. The dataset created in this paper consists of patches of land cropped from images taken by the ESA’s Sentinel-2A satellite and consists of multi-spectral images with 13 spectral bands encapsulating the visible and part of the non-visible spectrum. Benchmarks are provided for classification on this dataset and it was found that a pretrained ResNet-50, fine-tuned only on the RGB-bands, performs best for this dataset. Fine-tuning on several other combinations of three spectral bands, such as short-wave infrared and color infrared, was performed as well, resulting in a considerable performance but not outperforming fine-tuning on the RGB-bands. Aside from fine-tuning on a subset of three spectral bands, it was also researched what the overall classification accuracy was for each spectral band using single-band images. Again, the RGB-bands performed best overall, but several of the other spectral bands also performed quite well. This could indicate that there is valuable information present in the non-RGB spectral bands of the EuroSAT dataset. However,Helber et al.(2019) did not research what the effect was of training and/or fine-tuning on all 13 spectral bands all at once.

2.2 Transfer learning

Nowadays, when provided with enough data, computers can perform almost as well as humans on many tasks with the use of deep learning models in supervised learning problems. It is important to note that the amount of data needed for these models trained from scratch to achieve a human-like performance is immense (Goodfellow et al., 2016). With deep CNNs it is difficult to learn all of the network parameters if there is only a small

(8)

amount of data available, with small implying already a couple of thousands of data samples (Oquab et al.,

2014), likely causing the models to overfit. However, it is common to not have such large amounts of data available and annotating more data is often time and labor intensive (Pan and Yang,2010). Hence, when there is limited data available it is common to make use of transfer learning.

Transfer learning is based on the notion that people use knowledge from previous experiences when confronted with new situations in order to handle these new situations in a more efficient manner (Pan and Yang, 2010;

Torrey and Shavlik, 2010). In machine learning this entails taking knowledge learned from a certain source task, such as classifying generic images from a large labelled dataset, and applying it to a target task, for instance classifiying images in a small medical dataset, to improve performance of the model, either in terms of diminishing the amount of data samples needed in order to prevent overfitting (Yosinski et al.,2014;Oquab et al., 2014), training time or overall performance of the model (Torrey and Shavlik, 2010). In the case of computer vision, networks are often pretrained on the large ILSVRC2012 dataset (Goodfellow et al., 2016;

Oquab et al.,2014;Russakovsky et al.,2015;Deng et al.,2009).

When dealing with visual input, there are a lot of different type of images, such as regular photographs, medical images or satellite images. Nevertheless, Yosinski et al. (2014) report that the learned lower-level features always look the same when training a CNN, no matter what image data the network is trained on. This means that the first layer of these networks learns general features, whereas the deeper layers of the network become more specific for the dataset and the task the network is trained on. However, Yosinski et al.

(2014) also found that initializing the deeper, more specific layers with pretrained weights outperformed random initialization of these more specific layers when fine-tuning on a new dataset.

2.2.1 Transfer learning on multi-channel (> 3) images

Most applications of transfer learning in CNNs focus on transfer learning from RGB input to RGB input, where (some of) the pretrained weights of a neural network are copied into the same architecture as the pretrained models architecture (Alexandre,2016). However, it is not possible to use an identical model architecture if the input images have a different dimension. Since visual data consisting of more than just the three color channels is becoming more and more available,Alexandre (2016) proposed several methods for classification of images with more than three channels considering both training from scratch and transfer learning. The data used in their experiment consisted of RGB-D images; images comprised of RGB channels and an additional depth channel.

In order to perform transfer learning from RGB to RGB-D input images, Alexandre (2016) proposed an ensemble model with separate CNNs for each channel of the input image. Single channel CNNs were trained on each of the image channels and the weights of the model with the best performance were used as initialization for the three other CNNs in the ensemble model. Transferring the weights of the best performing channel resulted in an increase both in performance over the baseline-ensemble model where the four single channel CNNs were trained from scratch. It was also found that transferring the weights of the best performing single-channel CNN outperformed their single CNN, which took 4-channel input, trained from scratch both in terms of training time and performance.

However, an ensemble model of four single-channel input CNNs requires more memory because of the amount of parameters present in such a model. For an ensemble model of four single-channel CNNs there are many more learnable parameters compared to a single CNN which takes four-channel input; there are more parameters in the first layer of the single CNN, but the rest of the model’s parameters are shared, which is not the case when having four separate CNNs for each channel. In the ensemble model there are almost four times the learnable parameters compared to a single CNN. Memory-wise experimenting with multi-channel transfer learning with (at least some) shared parameters could be beneficial. Especially considering the fact that lower-level features in a CNN are very general (Yosinski et al.,2014).

In (Schwarz et al.,2015;Wang et al.,2019;Hoffman et al.,2016;Han et al.,2018) the authors made use of a partially shared network when performing object classification, object detection, saliency detection and pose estimation on RGB-D images. In these experiments the networks were partially parallel, with a separate feature extractor for the RGB image and the extra information (and in the case ofHoffman et al.(2016) even separate first fully connected layers), of which the output was combined before being fed to the rest of the network. In these experiments the models were pretrained on ImageNet data and fine-tuned on their RGB-D data. Some of the researches initialized their depth-network with the weights pretrained on ImageNet data (Schwarz et al.,

2015; Wang et al., 2019), while others initialized their depth networks with the weights fine-tuned on their own RGB data (Hoffman et al., 2016;Han et al.,2018). Except for Schwarz et al.(2015), where they did not compare their method to transfer learning on RGB-only data, the other experiments showed that using extra channel information outperformed RGB-only methods on the task at hand.

Hoffman et al.(2016) researched when to merge the output of the networks, but confined this experimentation to merging right before and between the fully connected layers. Wollmann et al.(2018) even went a step further with sharing parameters, where they proposed a CNN architecture for nuclei segmentation in cell tissue with

(9)

different coloring, which takes all image channels as input and where, similar to Alexandre (2016), the extra channel weights were initialized with the weights of a CNN pretrained on single channel data. Different, however, from the research ofAlexandre (2016);Schwarz et al. (2015);Wang et al.(2019);Hoffman et al. (2016), only the weights of the first convolutional layer were transferred onto the extra three channels present in their data, instead of transferring the weights of the complete feature extractor inSchwarz et al.(2015);Wang et al.(2019);

Hoffman et al. (2016) or the weights of the complete model as inAlexandre (2016). The weight matrix of the first layer was extended and the pretrained weights of the single channel models first convolutional layer were copied onto all four channels in the extended weight matrix. Aside from using the pre-trained weights of a single channel as initialization for all channels, another method was researched where the weights of the pretrained channel were only used as initialization for the channel that was of the same type as the dataset the model was pretrained on and the weight matrix was extended for the extra channels with weights initialized according to the method of He et al.(2015), where the new weights were trained from scratch. Contrary toAlexandre

(2016),Wollmann et al.(2018) found that transferring the weights of the single pretrained channel to the other three channels did not outperform initializing the three extra channels randomly and train them from scratch. This, however, could be due to the small number of data samples the model was trained on in this experiment. Also, due to the structure of their data and experiment, they could not compare the four-channel approaches to a RGB-only approach.

Besides modifying model architecture, it is also possible to adapt the data in such a way that it will fit the three-channel requirement of a model pretrained on RGB data. This is whatTai et al.(2017) did to make use of a pretrained model for semantic segmentation on fMRI data, which consisted of 31 image channels. They performed PCA to reduce the image to three input channels and fed the three-channel images to a pretrained CNN. However, again because of the structure of the data, it was not possible to compare their PCA-aided approach to a RGB-only counterpart.

Aside from increasing performance in terms of accuracy, none of the approaches mentioned above studied some of the other advantages of transfer learning, such as needing less data for training. One of the many reasons to make use of transfer learning is a lack of sufficient data to train a model from scratch, thus it would be beneficial to research whether transferring pretrained weights onto extra channels in the data would reduce the amount of data necessary for a decent performance in terms of accuracy; for instance whether less data is needed when using the extra channels to match performance of fine-tuning on RGB-only data. In most of the reviewed researches it was also not studied whether the models converged faster or slower when fine-tuned on the data with extra information compared to their RGB counterpart, which is why, in our research, it was opted to create model architectures which shared as many parameters as possible by fusing the feature maps in the first convolutional layer such as inWollmann et al.(2018). Next to all this, the previously mentioned experiments , where a traditional CNN was used, made use of smaller network architectures, for instance AlexNet (Krizhevsky et al.,2012). It would be useful to research whether the approaches for multi-channel transfer scale when making use of a larger network, such as ResNet-50.

(10)

3 Material & Methods

In the following sections the material and methods used are defined in more detail. Section 3.1 introduces the datasets used for the tasks at hand and for model pretraining, including how the data is preprocessed. Furthermore, the model architectures are explained. In section 3.2 transfer learning from RGB to RGB satellite images is introduced for the purpose of defining performance baselines. Section 3.3 is the core of the chapter; where the approaches to perform transfer learning from RGB to multispectral images are described. Lastly, in section 3.4 and 3.5 the metrics used to compare model performance and the experimental details are introduced.

3.1 Preparations

In the following sections the datasets, including how this data was pre-processed, and model architectures used in this research are described. First, the satellite data and the datasets used for pretraining the models are described, followed by the steps taken to pre-process this data. Lastly, the two model architectures used in this research are defined in more detail.

3.1.1 Data

Multiple datasets were used in this research, of which some were used for pretraining the models and another was used to train and finetune pretrained models on the task at hand. Models were pretrained on the ILSVRC2012 (Russakovsky et al., 2015) and the Tiny ImageNet dataset and for land use and land cover classification the EuroSAT satellite dataset was used. The following sections provide detailed information on all the datasets used.

Figure 1: The amount of data samples per class in the com-plete data set (train, validation and test samples)

EuroSAT satellite data The satellite data used for this research was the EuroSAT dataset created by Helber et al. (2019); a dataset con-sisting of cropped pictures, 64 by 64 pixels in size, from the ESA’s Sentinel-2A satellite im-ages. This data set consists of 27000 images, divided over 10 classes which are quite evenly balanced in the amount of data points per class as can be seen in figure 1. The ten classes repre-sented in the data set are Highway, Forest, An-nual Crop, Sea & Lake, Industrial, Permanent Crop, Herbaceous Vegetation, River, Residential and Pasture. Some example images per class are presented in figure 3.

The EuroSAT images can be downloaded as RGB-only images or as multi-spectral images, consisting of 13 spectral bands containing the visible spectrum, near-infrared, shortwave in-frared and more. See figure 2 for all spectral bands and the Appendix for a visualisation of the separate bands. Most spectral bands in these images are used to identify and monitor land use, except for the bands Aerosols, Cirrus

and Water Vapor which are used to correct atmospheric effects. As can be seen in figure 2 the spatial resolution of these spectral bands range from 10 meters per pixel to 60 meters per pixel. Before taking the 64x64 pixel crops, the spectral bands which did not have a spatial resolution of 10 meters per pixel were resampled to this spatial resolution by the creators of the dataset.

The cropped images are taken over 34 European countries during multiple times of the year to cover as many different instances of each class, since there is high variance in each class depending on geographical location and seasonal changes, and different types of each class are represented as well, as not all types of the same land cover look identical. For instance, as Helber et al.(2019) report, characteristics of rivers can differ greatly between them, which is also reflected in the data set. There has been no atmospheric correction on the images. This high intra-class variance makes it more challenging for land use and land cover classification, since the instances of a class can be very different from each other. Aside from the intra-class variance making the classification more challenging, there are also some classes which share similarities between them (Helber et al.,

(11)

Figure 3: Example images from the classes present in the EuroSAT dataset. Image taken from Helber et al.

(2019). (a) Industrial buildings. (b) Residential buildings. (c) Annual crop. (d) Permanent crop. (e) River. (f) Sea and lake. (g) Herbaceous vegetation. (h) Highway. (i) Pasture. (j) Forest

Figure 2: The spectral bands including their spatial resolution and central wavelength. Table fromHelber et al. (2019)

The data was split into a train, validation and test set before any pre-processing was performed on the data. The training set consists of 80% of the data, which translates to 21600 samples, and the validation and test set both consist of 10% of the data, which equates to 2700 samples each.

Tiny ImageNet & ILSVRC2012 The data used for pre-training the models used for transfer learning were both the Tiny ImageNet data set and the ILSVRC2012 (ImageNet Large Scale Visual Recognition Challenge) data set (Russakovsky et al.,

2015). Both datasets are based on ImageNet, which is a hierar-chical image database with object classes based on concepts from WordNet, containing both common object categories and more rare ones Deng et al.(2009). The hierarchical part of ImageNet means that there are classes, for instance ’mammal’, which in turn have their corresponding subclasses, in this case ’types of mammals’.

Firstly, the Tiny ImageNet data set was considered due to the fact that training a (custom) model on the regular ImageNet data set is very time consuming and computationally expensive due to the sheer size of the data set and the size of the images. Tiny ImageNet is a data set created by Stanford University for their CS231N course. It consists of 200 ImageNet classes with 500 train samples, 50 validation and 50 test samples per class. Aside from the decreased amount of classes and amount of data points per class, the Tiny ImageNet images are resized to be smaller than the regular ImageNet images; the images are 64 by 64 pixels (instead of 256 by 256 pixels), which is the same size as the satellite images used in this research. Because the Tiny ImageNet and satellite images are the same size, the satellite images did not have to be increased in size to fit the model. The Tiny ImageNet dataset has a labelled train and validation set.

Aside from having a custom model pretrained on the Tiny ImageNet dataset, a larger model that was pretrained on the ILSVRC2012 when it was obtained was used as well. The ILSVRC2012 dataset consists of a little bit over 1.4 million images, with over 1.2 million images in the training set, divided over 1000 classes

Russakovsky et al.(2015). These classes are a subset of all 21841 ImageNet classes.

3.1.2 Pre-processing

Often before feeding data to neural networks, the raw data is pre-processed. In the following sections it will be described what type of pre-processing was performed on the raw data of the datasets used in this research. EuroSAT satellite data The multi-spectral images in the EuroSAT dataset are of .tif format. To make it easier to perform operations on the images, the .tif files were converted into numpy arrays. To save time when

(12)

loading in the data, this was done once and the converted numpy arrays were saved to make sure that the files did not have to be converted every time when loading the data for the models.

On the website of the ESA and in literature it is noted that the multi-spectral images have a 12-bit ra-diometric resolution, which translates to values between 0 and 4095 (Drusch et al., 2012). However, when the multi-spectral data was analyzed, it was found that the maximum values of the channels diverge from this statement.

Figure 4: The distribution of pixel values per spectral band

As can be seen in figure 4, for most spectral bands the pixel values’ maximum values are far above the mentioned radiometric resolution, ranging up to over 25000 for some bands. Because of this, it was decided to clip the channel values to 4095, since that is the supposed range of values the images can take on. Because of this, and to keep the data between RGB-only and 13-band multi-spectral as much the same as possible, it was decided to create our own RGB-only data set by extracting the corresponding channels for red, green and blue from the clipped multi-spectral data and using those channels as our RGB data. This was done after splitting the 13-channel multi-spectral data into a train, validation and test set, to ensure that the same data samples were used for training on RGB-only and 13-channels.

After our own RGB data set was created, the data was modified to our wishes: the data was scaled to a range of [0, 1] since the pretrained models in PyTorch are trained on images with scaled pixel values to this range before normalizing. Again, this was done to keep everything that was not compared as similar as possible over all experiments; when using a pretrained model for transfer learning, the new data that will be fed into the model should have the same pre-processing done as the original data the model was trained on. Thus, since different types of pre-processing were not compared to each other, the pixel values were scaled to this range both when the models were trained from scratch as well as when transfer learning was performed. After scaling the pixel values, the mean and standard deviation per channel was computed on the scaled train set and used to normalize the images using the zero-mean unit-variance method; per channel the mean was subtracted from each pixel value and then divided by the channel’s standard deviation.

When using the larger ResNet-50 architecture, the images were increased in size to be 224x224 pixels to match the image size of the images ResNet-50 was pretrained on by making use of PyTorch’ functional interpolate function with default parameters. The images were then scaled to the range of [0, 1] as well before the images were normalized with the zero-mean unit-variance method.

Tiny ImageNet & ILSVRC2012 The pixel values of the images in the Tiny ImageNet data set were scaled to a range of [0, 1] and normalized, with the zero-mean unit-variance method, using the standard ImageNet means and standard deviations used when pretraining PyTorch models on the ILSVRC2012 ImageNet data set. Aside from scaling and normalizing, the train data was also augmented to prevent overfitting; torchvi-sion.transforms RandomResizedCrop was used to crop the input image to a random size and aspect ratio, with default PyTorch parameters, and then resize the crop to a 64 by 64 pixel image and RandomHorizontalFlip was

(13)

implemented to have a 50% chance that the images were flipped horizontally. These augmentations were not performed on the validation images.

The pre-processing steps performed on the ILSCRV2012 for PyTorch’ pretrained models were resizing the images to 256x256 pixels, taking a center crop of 224x224 pixels, scaling the images to the range [0, 1] and finally, normalizing the images using the zero-mean unit-variance method.

3.1.3 Models

In this research two different convolutional neural network architectures were used; a smaller, custom made model and a larger model, ResNet-50. The smaller, custom model was both fully trained on the EuroSAT data (both for RGB-only and 13-channel data) and pretrained on the Tiny ImageNet dataset to be used for transfer learning. The larger ResNet-50 model was only used pretrained for transfer learning purposes. Both model architectures will be described in more detail in the following sections.

Custom model For the smaller model architecture, it was decided to use an architecture known to be able to process 64 by 64 pixel images since the EuroSAT images are of that size. The architecture that was implemented came from a research done by Farag et al.(2016), in which the authors propose an architecture similar to AlexNet Krizhevsky et al.(2012), with five convolutional layers and three fully connected layers. A more detailed view of the model architecture is presented in table 1. Fully connected (FC) layer one and FC layer two are followed by a ReLU activation function, before going through a Dropout layer with a dropout rate of 50%. Each convolutional layer is immediately followed by a ReLU activation as well. Modifications made on the network ofFarag et al.(2016) conclude using BatchNorm layers after each convolutional layer plus activation, unless the convolutional layer plus activation is followed by a pooling layer (MaxPool). In that case, the BatchNorm layer is implemented after the pooling layer. Other modification were changing the input size of the first convolutional layer to three, instead of one- or two-channel image input, and changing the number of output nodes of the output layer from two to 10, since there are 10 classes to distinguish between in our dataset. There was no regularization (Dropout/BatchNorm) or activation function (ReLU) implemented after the output layer.

The ReLU activations were implemented from torch.nn.ReLU with inplace set to True, to save memory since this way no extra memory has to be allocated for the output of the ReLU activation. The remaining parameters of the layers not mentioned in the table were the default PyTorch values for the torch.nn package’s Conv2d, MaxPool2d, BatchNorm2d, Linear and Dropout modules.

Input size Output size Kernel size Stride

Conv1 # image channels 64 5 1

MaxPool - - 2 2 Conv2 64 192 3 1 MaxPool - - 2 2 Conv3 192 384 3 1 Conv4 384 256 3 1 Conv5 256 256 3 1 MaxPool - - 2 2 FC1 4096 4096 - -FC2 4096 4096 - -Output 4096 # classes -

-Table 1: Model architecture of the custom model. Not present in the table are the activation functions and the regularization layers, such as Batch Norm and Dropout

This model architecture was used both for fully training the model and for transfer learning. This model was trained from scratch on the RGB-only EuroSAT data and on the 13-channel EuroSAT data. Aside from this, this model was also pretrained on the Tiny ImageNet dataset for transfer learning purposes. When the model was trained on RGB data (either fully trained on RGB EuroSAT images or on the Tiny ImageNet dataset), the input size of the first convolutional layer was three. When the model was trained on 13-channel EuroSAT data, the input size of the first convolutional layer was 13. For training on the EuroSAT data, the output size of the output layer was 10 and for the Tiny ImageNet data, the output size was 200.

ResNet-50 Since Helber et al.(2019) acquired the best performance using a pretrained ResNet-50 model, it was decided to implement this model as it was also of interest to see whether any viable approaches of doing transfer learning onto more than three channels would still hold when increasing the size of the model architecture.

(14)

Figure 5: Architecture of a residual block. Taken fromHe et al.(2016) ResNet models have very deep architectures, which is made

possi-ble by so called residual blocks. A residual block is made up of two convolutional layers, with a ReLU activation between them. After the second convolutional layer, the original input that was fed to the first convolutional layer in the residual block is added to the output of the second convolutional layer before heading through another ReLU ac-tivation. Instead of creating a completely new representation of the input, the addition of the input fed to the residual block to the out-put creates a slight alteration of the inout-put (He et al., 2016). This model was only used for transfer learning purposes and was imported pretrained.

3.2 Transfer learning on RGB satellite images

Before performing transfer learning on image input with 13 input

chan-nels, the pretrained models were also fine-tuned on RGB-only satellite data as a baseline.The following sections explain the method used for performing transfer learning on the smaller and larger models when dealing with RGB-only satellite data.

3.2.1 Custom pretrained model

The custom pretrained model was trained on the Tiny ImageNet dataset, which means that the input image size between data sets were the same. After pretraining the model, the final output layer was replaced with a new output layer to be able to classify the 10 classes in the EuroSAT data. Next, the model was fine tuned on the EuroSAT RGB-only data.

3.2.2 Pretrained ResNet-50

The ResNet model used was pretrained on the ILSVRC2012 dataset. The images in this dataset are larger than the satellite images provided in the EuroSAT dataset. As mentioned in section 3.1.2, the EuroSAT images were increased in size. The pretrained ResNet state dictionary was then loaded in and the final output layer was replaced to have an output size of 10, to be able to distinguish between the 10 classes present in the EuroSAT dataset. Finally, the model was fine-tuned on the RGB-only EuroSAT images.

3.3 Transfer learning on multi-channel (> 3) satellite images

Out-of-the-box pretrained computer vision models in PyTorch are trained on RGB images, which means that transferring the knowledge of these models onto images which have more than three channels is not as straight forward as transferring their knowledge onto RGB images. Without modifying the model’s architecture or weights, the input dimension is fixed and the new input should have the same input dimension as the data the model was pretrained on.Considering the 13 spectral bands of the images contain the bands for red, green and blue, it could be possible to only use these three bands. However, that would mean discarding the information present in the extra 10 channels. Since the objective of this research is to study the effects of fine-tuning a pretrained model on data with more input channels than the data the model was pretrained on, it is necessary to either modify the data to fit the three input channel requirement or modify the model to be able to take images with more than three channels as input.

The transfer learning approaches on the 13-band multi-spectral images researched in this experiment can be divided into two categories; dimensionality reduction and weight modifications.

3.3.1 Dimensionality reduction

For dimensionality reduction two different approaches were researched: a convolutional layer was added to reduce the number of input channels to three in one approach and in the other approach PCA was performed to reduce the number of input channels to three.

Adding a convolutional layer Firstly, a single convolutional layer was added before the existing model architecture (including pretrained weights) of the pretrained model. A new model class was created which takes a pretrained model as input and creates a single convolutional layer with a 13-channel input dimension and a 3-channel output dimension, with a 1 by 1 filter and stride 1 to keep the width and height of the picture the same. In the forward pass of the model, the input is fed firstly through the single convolutional layer to reduce the channel dimension to three. The output of the first convolutional layer is then fed through the pretrained model, as if doing regular RGB transfer learning.

(15)

Figure 6: Example of how the images were flattened before performing PCA to reduce the number of input channels PCA Secondly, PCA was used to reduce the number of input

chan-nels to three. The PCA function was trained in an incremental fashion using the scikit-learn IncrementalPCA (IPCA) function due to the size of the dataset. The images were flattened to create a two-dimensional vector, of size image width x image height by image channels, where each pixel was represented as a sample described by 13 features. These flattened images were fed in batches to the IPCA to train. After training on all batches, the IPCA function was used to transform the 13-channel data to 3-channel data before feeding it to the pretrained model(s).

For both dimensionality reduction methods, similar to performing transfer learning on the RGB-only dataset, the final output layer of

the pretrained model was replaced with a new output layer with 10 output nodes. 3.3.2 Weight modifications

Aside from reducing the dimensionality of the data, it is also possible to extract the pretrained weights from a model and modify them in order for the weights to fit the new, 13-channel data by expanding the weight matrix. These modification should be done on the weights of the first convolutional layer of the pretrained model, since that layer is the one dealing with the fixed, so to speak, 3-channel input dimension.

First of all, the weight matrix of the first convolutional layer is of size output dimension x input dimension x filter size x filter size. This means that this weight matrix should be expanded in the input dimension size from three to 13. For this expansion of the weight matrix to work properly, the model architecture had to be slightly modified to be able to set separate learning rates for the RGB and 10 extra channels. Since the weights from the pretrained model are tuned for RGB input, it might be the case that the 10 extra channels need a slightly higher learning rate, to give the model a little push in fine-tuning the weights for these extra channels. In order to do that, the first convolutional layer of the model was split into two parallel convolutional layers; one for handling the RGB channels of the 13-channel input and one for handling the extra 10 channels. The input of the model was split into a 3-channel image (RGB) and a 10-channel image (extra channels) before going through the forward pass, where the split images were fed to their respective convolutional layer. The outputs of the parallel first convolutional layers were then added together before being fed to the rest of the (pretrained) model. See figure 12 for a more detailed view of the new architecture. For the 10-channel convolutional layer, bias was set to false, since we do not want to add the bias twice. The 3-channel first convolutional layer was

Figure 7: Model architecture for transfer learning onto 13-channel input image using our custom model archi-tecture. Below each convolutional layer, the output size is noted in bold. Between the convolutional layers, the activations, pooling operations and regularization are noted below the arrows.

initialized with the weights of the first convolutional layer of the pretrained model. There are multiple ways to initialize the weights for the extra 10 channels. The first two approaches described are based on the research done byAlexandre (2016) andWollmann et al.(2018), mentioned in section Related Works.

Default initialization It was researched first what the effect when default initialization was used for the extra channels. The RGB channels were initialized with the pretrained RGB weights.

Weights of a single color-channel Secondly, the weights of a single color channel were used to initialize the extra channels. This was examined for all color channels. An empty 64 x 10 x 5 x 5 weight matrix was created to store the multi-spectral weight matrix. Then, the weights of a single color channel were extracted from the original weight matrix and copied to all 10 (new) channels.

(16)

Average color-channel weights Another approach researched was to use the average of the RGB filter weights to initialize the extra channel weights. For each element in the weight matrix the weight-values for the red, green and blue channel were added up and then divided by three. These average weights were then used to initialize the extra channels.

Most similar color-channel weights Lastly, it was researched per extra channel which color channel was the most similar in terms of distribution of pixel values. Then, for each extra channel, the weights were initialized with the weights of the color channel which has the most similar distribution. For this approach it was firstly checked per channel which color had the most similar pixel value distribution by performing a Kolmogorov-Smirnov test. Each extra channel was compared to the three color channels and the weights of the color channel which returned the smallest KS-statistic value were used as initialization. The final output layer of the pretrained model was replaced with a new output layer with 10 output nodes.

3.4 Metrics

The performance of the models and transfer learning approaches was measured on multiple metrics: accuracy on the test data, quantity of data needed to achieve RGB-accuracy and training time. It was decided to research multiple metrics since improvement on accuracy alone is not the only goal. Considering large remote sensing datasets are scarce, it would be beneficial to research whether using all of the available image channels in remote sensing images would decrease the amount of data necessary to achieve the same accuracy performance as using only the RGB channels. Aside from this, training time was assessed to research whether using all of the image channels would slow down training greatly. These metrics will be elaborated in the following subsections. 3.4.1 Accuracy

Accuracy of the models was measured on the test set of the data; the percentage of correctly predicted images on the test data. The models were trained on the training set of the data and evaluated on the validation set at the end of each epoch. If the validation accuracy improved considering the highest accuracy reached up until that point, the model’s weights were saved together with the optimizer state and would overwrite any previously saved weights and optimizer state. After completing training, the saved model contained the weights of when the model reached peak performance on the validation set. This saved model was then used to calculate accuracy on the test set.

3.4.2 Amount of data samples needed

Aside from accuracy, it was also researched how many data samples were needed for the 13-channel multi-spectral models trained from scratch and transfer approaches to reach the same accuracy as their RGB-counterparts, the baselines. To illustrate: the 13-band multi-spectral model trained from scratch was measured on the amount of data samples needed to reach the same accuracy as the RGB model trained from scratch and the amount of data needed for the 13-band multi-spectral transfer approaches were measured in comparison to the accuracy reached when doing transfer learning on the RGB-only data.

This was researched by training the 13-channel models on reduced amounts of training data, aside from training the models on the complete training set. The percentages of train data used to train the model range from 10% to 100% of the data in steps of 10%.

3.4.3 Training time

Training time was measured visually based on the validation curves and in seconds per epoch. Since all models were trained (or fine-tuned) for 100 epochs, these curves were compared on how quickly the models converged and how long one epoch took in seconds.

Training time for the models while pretraining was not taken into account here, since we are only interested in the amount of time the models needed to perform well on the satellite data when performing transfer learning.

3.5 Experiments

The following sections describe in what manner the experiments were performed; the hardware and software used, how the hyperparameter settings were chosen and the how the investigated approaches were evaluated in more detail.

(17)

3.5.1 Hardware & Software

All experiments were run on a Nvidia GeForce RTX 2080 GPU and made use of Python 3.6.3 (Anaconda), PyTorch version 1.3.0, Torchvision version 0.4.1 and CUDA toolkit version 10.1.168.

3.5.2 Fully trained models

Before experimenting with any approaches to perform transfer learning on multi-channel images, it would be useful to ensure that using the extra channels while doing computer vision tasks, such as image classification, on satellite images would enhance performance, either by reaching a better accuracy, reducing training time and/or needing less data to reach a similar accuracy as RGB. Therefor, the baseline RGB and baseline multi-spectral models were compared to each other on the metrics mentioned in the previous section while trained from scratch on the EuroSAT data.

For the RGB baseline model, a gridsearch was performed over the following hyper parameters: • Batch size: 32, 64, 128, 256, 512

• Learning rate: 0.001, 0.0001, 0.00001 • Weight decay: 0, 0.0001, 0.00001, 0.000001

The optimizer used was Adam (Kingma and Ba,2015), the loss function used was Cross Entropy loss and the initialization of the model weights and bias was PyTorch default; Kaiming/He initialization (He et al., 2015). The model was evaluated on the validation set after each epoch of training on the complete train dataset. If the validation accuracy improved with regards to the best reached accuracy so far, the model was saved. This was done to prevent saving a model that was overfitting instead of implementing early stopping. The best performing configuration of hyper parameters on the validation set (batch size 128, learning rate 0.001 and weight decay 0.00001), was evaluated on the test set. Note that it could be possible to find a better configuration of hyper parameters with a bigger gridsearch over more parameters and values. However, since optimizing the models was not the goal of this research it was decided to limit the size of the gridsearch to the above mentioned hyper parameters.

For the 13-channel baseline model, the best performing hyper parameters for the RGB model were imple-mented. This was, again, done to keep as much similarities between the models trained on RGB-only and all 13 spectral bands. Training of the 13-channel baseline model was done in exactly the same manner as training the RGB-only model; the model trained for 100 epochs and after each epoch the model was evaluated on the validation set. If validation accuracy improved, the model was saved. After training, the 13-channel baseline model was evaluated on the test set. Aside from training on the complete train set, it was also researched what would be the effect when training the 13-channel model on a decreased amount of training samples. The model was trained in exactly the same manner for 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80% and 90% of the available training data and again, these models were evaluated on the test set.

3.5.3 Pretraining

Custom model The custom model was manually pretrained on the Tiny ImageNet dataset. A small grid-search was performed, since training the model on the Tiny ImageNet data was time consuming and optimizing this model was not one of the goals of this research. The gridsearch was performed over the following hyper parameters:

• Batch size: 32, 256

• Learning rate: 0.001, 0.0001, 0.00001 • Weight decay: 0, 0.00001

The optimizer, loss function and initialization were the same as the ones used for the fully trained models; Adam, Cross Entropy Loss and default PyTorch initialization for the model weights and biases. The model was trained for 200 epochs and, similar to the baseline model training, the model was saved if validation accuracy improved after training for another epoch. The best performing configuration of hyper parameter values was a batch size of 32, a learning rate of 0.0001 and a weight decay value of 0.00001.

The peak validation accuracy of the model was not very high: 51.62 %. Since optimizing this pretrained model was not the objective of this research, the final resulting model was considered satisfactory since it is performing unquestionably better than chance (which would be 0.5% accuracy for 200 classes).

ResNet-50 ResNet-50 was imported from PyTorch already pretrained on the ILSVRC2012 dataset, so there was no manual pretraining performed on this model.

(18)

3.5.4 Transfer learning

Custom model After the custom model was pretrained on the Tiny ImageNet dataset, a model was fine-tuned on the RGB-only EuroSAT data and a model was fine-tuned on the 13-channel data.

Firstly, the final output layer of the model was replaced, since the pretrained model was made to distinguish between 200 classes and the EuroSAT dataset only contains images from 10 classes. This new output layer was not pretrained. Next, a gridsearch was performed over the following hyper parameters for fine tuning on the RGB-only satellite data:

• Batch size: 32, 64, 128, 256, 512

• Base learning rate (for the pretrained layers): 0.001, 0.0001, 0.00001 • Output layer learning rate: 0.01, 0.001, 0.0001

• Weight decay: 0, 0.0001, 0.00001, 0.000001

Again, the optimizer used was Adam, the loss function Cross Entropy Loss and the replaced output layer initialization was PyTorch default. The model was trained for 100 epochs and saved in the same manner as the other models; only if validation accuracy improved. The best performing configuration of hyper parameters (batch size 512, base learning rate 0.0001, output layer learning rate 0.0001 and weight decay 0.0001) was evaluated on the test set.

For all 13-channel transfer approaches, the best performing hyper parameters on RGB-only transfer were used. The output learning rate was, in the case of 13-channel transfer for both the dimensionality reduction and extending the weight matrix approaches, used for all new layers. So in the case of extending the weight matrix, for the first convolutional layer receiving the 10 channel input the same learning rate was used as the output layer and in the case of adding a convolutional layer at the beginning of the model, this layer also used the output layer learning rate. The 13-channel transfer approaches were trained for 100 epochs as well and only saved when validation accuracy improved. For the 13-channel transfer approaches it was also researched what performance was like using the reduced amounts of data. After training the models, accuracy was evaluated on the test set.

ResNet-50 For the most viable 13-channel transfer approaches on the pretrained custom model it was re-searched whether these approaches scale to a larger model: ResNet-50. A pretrained ResNet-50 model was fine-tuned on both the RGB and 13-channel EuroSAT data. The output layer of the model was replaced to fit the 10 classes in the EuroSAT data, considering that the dataset ResNet-50 was pretrained on consisted of 1000 classes to distinguish between. Hereupon a small gridsearch was performed, while fine-tuning on the RGB-only satellite data, to find a good hyperparameter configuration over the following hyperparameters and values:

• Batch size: 32, 128, 256, 512

• Base learning rate (for the pretrained layers): 0.001, 0.0001, 0.00001 • Output layer learning rate: 0.01, 0.001, 0.0001

• Weight decay: 0.0001, 0.00001

As with the previous models, the optimizer used was Adam and the loss function was the Cross Entropy Loss. The model was fine-tuned for 100 epochs and only if validation accuracy improved, the model was saved. The model with the best performing configuration of hyperparameters (batchsize 512, base learning rate 0.0001, output layer learning rate 0.001 and weight decay 0.0001) regarding validation accuracy was then evaluated on the test set.

For the 13-channel transfer approaches tested on ResNet-50, the best performing hyperparameters on RGB-only fine-tuning were used. The output layer learning rate was in this case used for all new layers of the model. The model was then fine-tuned for 100 epochs as well, with the model saved when validation accuracy improved. For the 13-channel transfer approaches it was also researched what performance was like using reduced amounts of data. After training the models, accuracy was evaluated on the test set.

3.5.5 Evaluation

The different types of models were not compared with each other. Instead the RGB models were compared to their 13-channel counterparts on test accuracy, amount of data samples needed and training time; RGB training from scratch was compared to 13-channel training from scratch, doing transfer learning from the custom model pretrained on Tiny ImageNet onto RGB satellite data was compared to all approaches of doing transfer learning from the custom model onto 13-channel satellite data and the same holds for ResNet-50.

(19)

4 Results

The results of the experiments conducted in this research (sections 3.5.2 and 3.5.4) will be presented in the following sections; starting with the results for the model fully trained, which are followed by the results of multi-channel transfer learning. First, the results on the smaller custom model will be elaborated. Lastly, the results obtained on the ResNet-50 model are described.

4.1 Training from scratch

The first experiment entailed training the custom model architecture from scratch on RGB-only data and (reduced amount of) 13-channel data. In figure 8a average test accuracy is reported for the model trained on (reduced amounts of) the 13-channel train data. The grey dashed line is the average test accuracy obtained by the model trained on the full RGB train dataset. This figure illustrates that when trained on the 13-channel data the model outperforms its counterpart trained on RGB-only data in terms of accuracy and quantity of data needed when tasked with land use and land cover classification. When all 13 channels of the data were used, the model trained on 40% of the available train data starts to resemble performance of the models trained on the RGB-only data with the same hyper parameter setting. When 60% of the data was used and upwards, the model trained on all 13 channels outperforms the RGB baseline, with a peak test-set accuracy improvement of 2.24%.

(a) (b)

Figure 8: Performance of the RGB-only and 13-channel model, trained from scratch. Figure 8a shows the test accuracy performance of the 13-channel model trained on incremental amounts of train data. The grey line shows the test accuracy of the model trained on the complete RGB train set. The turquoise line correspond to the test accuracy achieved by the model trained on the different percentages of train data used. Shown are averages over three runs. Error bars represent the standard deviation on the three runs. Figure 8b shows the validation accuracy curves of one of the models trained for 100 epochs both for the 13-channel and the RGB-only data.

Figure 8b shows the validation accuracy during training on the full training set for RGB and 13-channel data. It can be seen that when trained on 13-channel data the model seems faster to converge than its RGB-counterpart and reaches peak RGB-validation accuracy faster as well. It is necessary to keep in mind though, that this faster convergence is measured in epochs. Per epoch the model takes slightly, but not considerably, more time due to a larger amount of input data regarding number of channels. The 13-channel model takes 10 seconds per epoch while for the RGB model it takes about 9 seconds per epoch. For the models trained on 40% of the data, which achieve an accuracy similar to having trained on the full RGB dataset, the time per epoch is 4 seconds. This is about twice as fast as training on the full RGB dataset.

Both figure 8a and figure 8b show that the custom model architecture trained from scratch on 13-channel data outperforms its RGB counterpart on all three metrics evaluated; accuracy reached on the test data, amount of data samples needed to achieve RGB performance and training time (in amount of epochs until convergence).

(20)

4.2 Transfer learning - custom model

In the previous section it was shown that the custom model trained on 13-channel data outperforms its RGB counterpart. This implies that it is beneficial to make use of these extra channels, at least when training from scratch. The following step in the research was to experiment with multiple approaches of doing transfer learning when the target dataset has more image channels than the source data. In the coming sections the results of these approaches will be described.

4.2.1 Dimensionality reduction

The first transfer learning approaches researched were the dimensionality reduction methods; a model was trained on data where PCA was used to reduce the channel dimension from 13 to three and a model was trained with an extra convolutional layer in the beginning of the models to reduce the channel dimension to three (section 3.3.1). In figure 9 the achieved accuracy on the test set for both PCA (orange line) and convolutional dimensionality reduction (yellow line) on incremental amounts of training data are displayed. The grey dashed line is the accuracy on the test data obtained by the model trained on RGB data.

Figure 9: Test accuracy performance of the models fine-tuned on incremental amounts of data with reduced input channels. In orange the performance of the model which received 3-channel input data by performing PCA on the 13-channel data and in yellow the performance of the model which used an extra convolutional layer to get from 13-channel to data with three input channels. In grey the performance of the model fine-tuned on the complete RGB train data. Shown are the averages over three runs. Error bars represent the standard deviation on the three runs.

PCA From figure 9 it can be inferred that the model fine-tuned on data which is reduced from 13-channels to three by a PCA function trained on the training data does not outperform the model fine-tuned on RGB data considerably, with an increase of 0.39% when fine-tuned on the complete train dataset. Aside from no notable improvement regarding accuracy when trained on the same amount of data, this method also seems to require about the same amount of data as its RGB counterpart, as can be seen in figure 9 as well. When PCA was applied to the 13-channel data the model starts to have a similar performance as its RGB counterpart when 80% of the train data available was used.

In figure 10 the validation accuracy during fine-tuning on RGB data (grey curve) and on the PCA-transformed data (orange curve) are displayed. For the PCA method, it looks like the model converges slightly faster than fine-tuning on RGB only data. However, the difference between the curves is so small that it does not seem to be a considerable difference in convergence time. Per epoch the model fine-tuned on RGB data takes about 8 seconds, while the model trained on all PCA-transformed data takes about 32 seconds per epoch.

(21)

Figure 10: Validation accuracy curves of the models during fine-tuning for 100 epochs. In orange the curve of the model fine-tuned on the full train dataset which had a reduced amount of input channels (from 13 to three) by performing PCA. In grey the validation accuracy curves of the model fine-tuned on the full RGB train data.

Adding a convolutional layer Referring back to figure 9, it can be seen that the model where an extra convolutional layer was implemented to decrease the dimensionality of the data does not outperform the model fine-tuned on RGB only data. At best, the model achieves a test accuracy improvement of 0.19% over the model fine-tuned on RGB when the full dataset was used.

Shown in figure 11 are the validation accuracy curves for the models fine-tuned on RGB data, in grey, and fine-tuned using an extra convolutional layer to perform dimensionality reduction of the 13-channel data, in yellow. There is no difference between the time until convergence for both of these methods regarding the number of epochs. Time in seconds per epoch is slightly higher for the model where an extra convolutional layer was added as dimensionality reduction; 11 seconds per epoch instead of 8 for fine-tuning on RGB. As can be seen in figure 9 this method starts to resemble RGB accuracy when 90-100% of the 13-channel data was used. Both figure 9 and figure 11 show that the model with an added convolutional layer to reduce the input channel dimension to three does not outperform the model fine-tuned on RGB images.

Figure 11: Validation accuracy curves of the models during fine-tuning for 100 epochs. In yellow the curve of the model fine-tuned on the full train dataset which had a reduced amount of input channels (from 13 to three) by adding an extra convolutional layer at the beginning of the model. In grey the validation accuracy curves of the model fine-tuned on the full RGB train data.

On transfer learning from RGB to multispectral remote sensing images in a classiﬁcation task

MSc Artificial Intelligence

Master Thesis