Artiﬁcial staining of raw microscopy images using CycleGANs

(1)

Artificial staining of raw microscopy images using CycleGANs

Georgios Doulgeris

s3742237

(2)

Master’s thesis June 2020

Student: Georgios Doulgeris

First supervisor: prof. Dr. L.R.B Schomaker

Second supervisor: prof. Dr. I.J. van der

Klei

(3)

List of Figures

2.1 The convolution operation in the context of CNNs. An input image (left side) is padded with zeros (middle). A convolutional filter (the blue matrix) slides through the padded input getting the dot product in each location. In this example the stride is 2 and the convolution process is shown by the different locations of the filter. The first location is the matrix with the blue outline, the second is the one with the red outline. The matrix with the green outline is a subsequent location. Figure taken and adapted from [31] . . . 6 2.2 In this setting, a CNN is trained for semantic segmentation of scenes. The two squares on the

left side represent two different receptive fields of the network used to predict the label of the pixel on the right picture. The receptive field represented by the green square is too small to recognise that the target pixel is part of a larger object thus resulting in wrong classification, the receptive field represented by orange is big enough to identify the whole object resulting in correct segmentation¹. . . 7 2.3 Example of a transpose convolution by emulating a convolution. A direct convolution of a

4 × 4 input (green blue) with a 3 × 3 kernel (gray) produces a 2 × 2 output (light blue). The transpose convolution of that is equivalent with convolving that 2 × 2 feature map padded with zeros as illustrated with a kernel of 3 × 3 to produce a feature map with the original 4 × 4 input. Strides in this case are set to 1. Figure taken from [14] . . . 8 2.4 Architecture of original U-net taken from [52] . . . 9 2.5 Attention gate design as suggested in [42] . . . 9 2.6 Standard GAN architecture. A random noise vector is passed as input to the generator

which is transformed to a fake image. A (real, fake) pair of images is then presented to the discriminator who produces scores for each of the images. The pair error is back propagated to the discriminator and the fake image score is propagated back to the generator. Figure taken from [55] . . . 10 2.7 Example of a training step. In (a) the relations between the GANs and the domains are estab-

lished. In (b) we see an example of a forward cycle: An image x from domain X is translated to domain Y by generator G. The translated image ˆY is presented to the discriminator DY

and also fed to the generator F sto that it can be translated back to domain X. The backward cycle is seen in (c) and it is the same procedure in the opposite direction. Figure taken from [71] 12 3.1 Instances of the training data. 3.1a shows stained cells, 3.1b shows unstained images . . . 14 3.2 Example of the image augmentation. Starting from the top left: Original image, horizontal

flipping with random cropping, vertical flipping with random shear. Bottom left: Random scaling with randomly adjusted brightness, random rotation with augmented contrast, random cropping random scaling . . . 15 3.3 Baseline discriminator architecture. The number above the block denotes the number of

convolutional filters and the number inside denote the stride. . . 16 3.4 Baseline Generator architecture. The number above the block denotes the number of convo-

lutional filters and the number inside denote the stride . . . 16

(6)

4.1 First column shows the grayscale images presented as input to the generator. Second column shows the images that were normalised in the [−1, 1] range with the hyperbolic tangent used as the activation function in the last layer of the generator. Similarly in the third column images were normalised into the [0, 1] range using the relu activation function. The last column shows the actual stained microscopy images. . . 23 4.2 Heatmaps of the activations of the output layer for each of the networks. Each subfigure is

consisted of three images representing the three color channels of the image. Top left heatmap shows how strongly the blue color is predicted, top right shows the prediction of green and in the bottom the prediction of red is shown. Blue represents lack of prediction of a respective color whereas red pixels indicate that said color was confidently predicted by the network.

This figure shows how the colors are predicted when an image is colorized by the generator. . 24 4.3 First column is the raw grayscale input. Each column represents a CycleGAN using images

in a specific colorspace, with the last column being the ground truth images. We can see that the RGB samples are the ones resembling the most the ground truth cells, with the HSV color space one completely missing the mark. . . 25 4.4 Second column shows results from a network without SN, third column shows results of a

network using SN. The results from the network using SN have more vivid colors and slightly more accurate prediction of the blue stain.. . . 26 4.5 When SN is applied to the CycleGAN (third column), the resulting samples are more carefully

stained especially when we look at the green stain. The images produced by the CycleGAN without SN (second column) have more blended colours and its components are less distinguishable with respect to the ground truth (fourth column). The experiment was performed in the YCbCr colorspace . . . 27 4.6 Generated samples for models that utilize the Attention U-Net as the generator. The third

row shows the results of a model that was trained for 60 epochs and the results on the fourth column has a batch size of 6. It is clear that the samples generated by the Attention CycleGAN with SN are the closer ones to the ground truth. . . 28 4.7 Effect of SN to the training stability of the CycleGAN that uses attention, Generator B is the

generator responsible for translating images from grayscale to color. DisA is the discriminator that deals with stained images. The application of SN reduces oscillation of the loss and stabilizes training. A batch size larger than 1 (in this case 6) needs fewer iterations for the same amount of epochs. . . 29 4.8 Left: feature maps from the last attention layer. Right: Heatmaps of the activations from the

output layer showing how each the primary colors are predicted just before the final image reconstruction. . . 30 4.9 A CycleGAN with a 23 × 23 PatchGAN as a discriminator, produces darker samples with the

blue stains not being recognised accurately. The samples pictured in the third column, have a good balance of correct brightness and stain prediction both compared to the model in the second column and the one that has a discriminator with a receptive field with a size of 47 pixels. The latter, while being closer to the ground truth with respect to the hue of the colors, it misses some structural elements when staining the raw microscopy cells. . . 32 4.10 The U-net generator here has a receptive field on the downsampling path equal to 190 pixels.

In the second column, we can see that the network produces dark samples, with more bland colors and missing the correct localization of the peroxisome stain, whereas it is more successful in staining the vacuoles. The results of the CycleGAN that has a discriminator with a receptive field of 39 pixels are heavily pixelated, not looking natural, however the color performance has improved with more bright colors and more correct staining both for the blue and green. In the fourth column, the network with the larger receptive field produces samples that are more accurately stained and with structures that resemble the ones from the ground truth images. 33

(7)

4.11 All networks manage to include most structural elements during the translation. The results differ in the actual color that is chosen to stain the pixels. The SSIM model in the second column underpredicts the vacuole stain and also misses the peroxisome (blue) stain. In the third column we have an overprediction of the green stain. Adding the L1 component balances the two, achieving a more precise prediction of the green stain, and decent localization of the peroxisome stain for the most part as seen in the fourth column. The model that uses the histogram based loss (5th column) is conservative in its predictions with more balanced images color-wise, but lacking the prediction of the blue stain. . . 35 4.12 Predicted samples of a model using a MS-SSIM based loss that forgets any knowledge it

has acquired during training. These predictions were made in training time and each pair corresponds to an epoch. The network exhibits a progress in which progressively it learns to stain the images better with each epoch. However, in the last epoch it completely forgets everything and produces an almost untranslatable sample. . . 36 4.13 All models generate samples that maintain the general structure of the grayscale input. The

data instance in the first row is translated very convincingly by all models. Model 4’s predictions in the 5th row seem to be the closer ones to the ground truth. . . 37 4.14 The generated samples from the Pix2Pix (Columns 2 and 3) suffer from bright colors and

lack of prediction of the blue stain, a phenomenon we came across in our previous CycleGAN experiment. Our tuned CycleGAN generates samples that resemble slightly more the ones from the ground truth. . . 38 4.15 Colorized images from the CycleGAN used as input to test the cyclical consistency of the

network. . . 39 4.16 Stained images from the dataset used as input to test the inverse mapping. . . 39 5.1 The Laplacian Kernel . . . 45 5.2 Mean image for each of the three channels and the overall mean dataset image. Top left shows

the mean for the red channel, top right for the green channel, bottom left for the blue and bottom right the overall mean image of the dataset. . . 46

(8)

List of Tables

2.1 Values of some colors in the YUV and YCbCr color spaces respectively. . . 5 3.1 Different discriminator layouts that are used in the experiments. . . 19 3.2 Different generator layouts that are used in the experiments . . . 19 4.1 The results for the the first round of experiments. As explained in Section 3.6, high Structural

Similarity Index (SSIM) and Multi-scale SSIM (MS-SSIM) scores are indications of many similarities between images. Inversely, for the ∆E^∗94, the lower the score, the less color differences between images. . . 22 4.2 The network using images in the RGB color space produces more structurally similar results

as shown by the SSIM and MS-SSIM scores. The same network appears to stain images with fewer color differences than the models acting in different colorspaces. . . 24 4.3 The application of SN improves performance and helps especially in staining cells with more

realistic colors as shown by the ∆E^∗94 metric. . . 26 4.4 The SN network has higher structural similarity scores and marginally better color perfor-

mance than the non SN network. . . 27 4.5 The CycleGAN equipped with just the Attention module plus the SN, achieves the best color

performance. With respect to the structural similarity metrics, the network with the Attention and SN scores the highest, proving once again the importance of SN in generating structurally similar samples. . . 28 4.6 All three networks have similar structural similarity scores. However, the CycleGAN whose

discriminators have a receptive field with a size of 47 perform better with respect to the color fidelity, achieving a ∆E^∗94 score of 33.781 . . . 31 4.7 As in Table 4.6, the models with the generator of a receptive field of 190 pixels follow the

same trend, with all three CycleGANs performing similarly. The network with the largest receptive field in the discriminator performs better in the ∆E^∗94.. . . 32 4.8 All networks perform similarly in the structural similarity scores. The third column suggest

that the MS-SSIM network achieves better color performance as measured by the ∆E^∗94 metric. 34 4.9 Evaluation scores for the models specified in Section 3.4.6. All models achieve high structural

similarity scores and good color performance. Model 4 with equal learning rates and 4 times more updates for the discriminator edges out the rest of the models with its performance on both metrics. . . 36 4.10 The application of SN improves performance in the case of a supervised learning method too,

on all metrics.. . . 37 4.11 SSIM scores of the second generator . . . 40

(9)

Abstract

Raw images from a microscope usually need processing in order to be inspected. Researchers use chemical agents to stain the samples and highlight the relevant structures that need to be examined, however this is a often a laborious, costly and risky procedure. In this thesis, a CycleGAN is used with the aim to automate the chemical staining of raw microscopy cell images. CycleGAN’s unsupervised learning approach is ideal, addressing the lack of available data sets in the process, so that grayscale cell images can be colorized digitally, avoiding the difficulties of chemical staining. We experiment with different techniques and examine whether they can lead to the best possible results. We find that the addition of spectral normalization stabilizes the training of the CycleGAN without adding computational cost. Moreover, we observe that by utilizing two different learning rates for the generators and discriminators as well as using asynchronous updates for the GAN components, we can reach a better convergence point and generate the best staining results, with our best experiment engulfing a structural similarity index measure (SSIM) of 0.881 and a color difference score give by the ∆E^∗94 metric of 12.73. The cycle consistent nature of CycleGAN can be used to produce paired samples for use in other endeavors. Overall, the tuned up version of CycleGAN was capable of generating realistic samples, however there were obvious problems with the localization of the peroxisome stain even in our best set of results. Among the limitations were the lack of adequate computational resources plus the data suffered from ’color bleeding’, a phenomenon that causes stains to leak out of their target structures, making the learning of structures and by extension the staining, more challenging. Therefore, more data are needed, preferably with a reduced bleeding effect in order to get more faithful results.

(10)

Chapter 1

Introduction

Due to their tiny size, cell samples are inspected through the lens of a device, e.g., a microscope. However, examination of such samples without some kind of pre-processing is extremely difficult [35]. Structures and differences between components in the cell are difficult to distinguish without the aid of external visual enhancement methods. A procedure that is most commonly used for a better imaging of biological data is chemical staining [1], where a chemical agent is interacting with the cell and as a result, certain parts of it are highlighted. The contrast between distinct parts of the cell is enhanced, making it for easier recognition and localization of subcellular components. Thus, a certain quality is expected when it comes to the correct staining of these samples in order to avoid corruption of the data and the subsequent conclusions that would stem from their study [48].

Chemical staining comes with certain caveats. One disadvantage of staining is that the chemical agents used to dye the samples can damage the cell or, in some cases, destroy it. Additionally, chemical staining is not an instant process. Its effects are evident after a certain time has elapsed, sometimes that time being as much as 72 hours. This delay can cost money or hinder timely provision of results [10]. Finally, since staining is not an automated procedure, there is a window of error due to the human factor. Biological data are sensitive and can be damaged easily. In general, chemical staining, while very beneficial and essential, can be a very laborious, time consuming and risky operation. Making chemical staining an automated process would save resources for the researchers and reduce the possibility of a corrupt sample. With the surge of deep-learning methods and their application in medical-image processing tasks [37], researchers have been employing such techniques in order to automate the staining process. Since raw microscopy (brightfield) images can be considered as grayscale images, staining them can be framed as an image colorization problem from the point of view of a deep-learning research.

The automatic colorization of grayscale images is an inherently ill-posed problem. Solving it requires finding a transformation from one dimension (pixel intensity value) to three dimensions (pixel RGB values for the RGB colorspace), which is impossible to solve for an objective solution without given any constraints [21].

An object in a grayscale image can be colorized with different colors, thus prior knowledge is required for accurate colorization of the parts [11]. Machine learning methods have been introduced to solve that problem with [57] using feature extraction and a Support Vector Machine classifier to colorize grayscale images of home facades. With the emergence of deep-learning and the Convolutional Neural Network (CNN) as the leading deep neural network architecture [28] for image specific tasks, many methods and techniques were developed to tackle that problem. Models that are trained directly on the raw input and desired target data or as also called end-to-end architectures, which can take advantage of the learning capacity of the CNN, have achieved satisfying results. In [23], Iizuka et al use a CNN trained on the Places data set [69], a data set with 2.5 million photographs of 205 categories such as airport, bedroom etc., to separate low-level and global features and minimize a regression loss while also using image labels to learn the context of the image and colorize it, given the grayscale version, in a smarter way. The method of [66] is more straightforward since just one pass through the CNN is enough for the model to output a weighted, probabilistic colormap in the CIELAB color space that is subsequently used to colorize the input. However, one problem is that these approaches choose a loss function to minimize which might not be suited to biological microscopy data, resulting in unsatisfying performance. When Goodfellow et al introduced the Generative Adversarial

(11)

Networks [16], they opened the way to new approaches. [25] is a popular method that utilizes the GAN architecture for image translation tasks, one of them being image colorization.

For the problem of digitally staining microscopy images, both approaches (Deep CNNs and GANs) have been used with relative success. [7] developed a pixel-to-pixel approach (2-layer classifier) and an area-to- pixel approach (CNN) to stain multi-modal data into a target data set. An ensemble of U-Nets [52] was used by [43] to predict multiple structures. The outputs were combined to create a fully fluorescent image with every structure highlighted. [50] achieved satisfying results in digitally staining biological tissue samples using a GAN architecture with U-Net as the generator.

These methods used paired data i.e. an input image with a corresponding target image. In the case of medical data, data sets were created from scratch by scanning images using suitable microscopes and chemically processing samples to create the targets. In addition, in most of the cases extra work needs to be done to exactly match the pre and post staining images. Factoring in the fact that during that time the sample is altered by the conditions of the environment, it is evident that creating samples is risky.

CycleGANs [71] use unsupervised learning to translate images from one domain to another and allow for efficient image translation tasks without using paired data. Using that concept, [39] took near infrared images and produced satisfying colorized results. CycleGANs have been used also used to generate CT scan images from Cone beam computer tomography (CBCT) scans [32]. Their results in transforming images from one domain to the other and vice versa are promising. Such a problem would have been considered ill-posed until very recently, but the cyclic translation between CT and CBCT images which are two types of images with many differences and exclusive particularities proves that a CycleGAN can also be used for translation between domains relevant to biology or medicine. Recently [5] used a CycleGAN to transform stained tissue samples to another stain in order to improve results of segmentation, something we want to investigate further.

1.1 Project Goal

Colorization of cell images is hard because the different components are not easily distinguishable. Pixels with high intensity values do not necessarily correspond to a specific chemical label and the contrast between different parts of the cell is low. To address the scarcity and small degree of availability of medical paired data sets, we use a CycleGAN to exploit their unsupervised learning nature. Another motivation behind the choice of specific architecture is to explore the possibility of creating paired data sets out of images from two different domains, so that they can be used for future work. The biology department has provided us their own data. The data have the form of single cell images taken from a bright-field microscope, using two fluorescent proteins to stain vacuole membranes and peroxisomes in the cells. Most approaches have attempted to stain the samples according to one type of stain, or in a multi-modal scenario [7]. In this thesis, we attempt to stain cell images using the two mentioned types of fluorescent marker simultaneously.

In short, this thesis will try to answer the question:

Can a CycleGAN learn to digitally stain grayscale single-cell images?

This question is followed by the following sub questions:

• What are the main components that enable for better staining quality?

• How does the CycleGAN approach compares to a supervised learning based one?

Since a CycleGAN is a relatively complicated model and GAN training is a notoriously difficult task [12], this thesis will examine how different approaches and improvements proposed in the literature impact the quality of the results and the stability of the training process. There are various factors that are explored, such as the depth of the generator, the receptive field of the discriminator, various loss functions as well as the choice of the suitable color space.

This thesis is structured as follows: Chapter 2 will present the theoretical background for all the methods used. Chapter 3 will discuss the formalization of the experiments performed and Chapter 4 will present the results of those experiments. Chapter 5 will contain a discussion about the results as well as the conclusion of the thesis.

(12)

Chapter 2

Theoretical Background

This chapter introduces several theoretical concepts relevant to the research.

2.1 Color spaces

Color is the ability of the Human Visual System (HVS) to visually categorize a range of wavelengths of light.

What we call ’red’ is light that has a wavelength that ranges from 625 to 745 nanometers. Representing color using wavelengths is not intuitive enough, therefore mathematical models that help us formalize representation of color have been established. These mathematical models are called color spaces. Color spaces formalize color representation through a system of coordinates. A color space has dimensions. In grayscale for example, each color is represented by a single value which corresponds to a luminance value. The RGB color space is a 3-dimensional space with each color represented by a triple (r,g,b). Each color space uses different features of color to represent it and some are more suitable than others depending on the task at hand.

2.1.1 RGB

The RGB (red, green, blue) color space was inspired by how humans perceive color. The cones in the retina of the eye are sensitive to the wavelengths of red, blue and green colors. The RGB color space uses the same principle and represents every color in the spectrum as a combination of these three values ranging from 0 to 255. RGB is the most widely used color space in image processing tasks mostly because of its simplicity and its resemblance to how the HVS understands color.

However, one disadvantage of the RGB space is the high correlation between its components, something that might cause problems in tasks such as color processing [15]. This requires exploring different color spaces for the given task.

2.1.2 HSV

The letters in the HSV acronym stand for Hue, Saturation, Value. The hue component carries chromatic information, saturation denotes the amount of gray in a color or how diluted it is in white light. Value measures the brightness of the color. HSV is closer to the human perception of color compared to RGB as it is more intuitive.

The HSV color space has the shape of a cone. Each point or color in the color space is denoted by three coordinates. Hue is expressed as an angle, saturation as the distance from the center and value the location with respect to the height of the cone.

2.1.3 YUV

YUV color space has three channels to encode color information. The first one is denoted as Y and describes the intensity of light of the color. The other two components carry color information. To convert from RGB

(13)

to YUV we have the following linear transformation:



 y u v



=





0.299 0.587 0.114

−0.147 −0.289 0.436 0.615 −0.515 −0.1







 r g b



 (2.1)

Since we use a linear transformation to convert between the two color spaces, some correlation between components still exists, however this correlation is not as high as in the RGB space [15].

2.1.4 YCbCr

The YCbCr color space is primarily used in encoding the image signal for television. It is defined in a similar way as YUV, one channel, the Y, holding the information about luminance and the other two are called the chrominance components [15]. The YCbCr colorspace is used widely in image processing. The difference between the YCbCr and YUV spaces is in the way color is encoded in the chrominance components (U,V for YUV and Cb, Cr for the YCbCr).

Color YUV YCbCr

Black (0,0,0) (0,0,0)

White (1,0,0) (1,-1,0)

Red (0.299, -0.147, 0.615) (0.299, -0.169, 0.5) Green (0.587, -0.289, 0.1) (0.587, -0.331, -0.419) Blue (0.114, 0.436, -0.1) (0.114, -0.5, -0.081)

Table 2.1: Values of some colors in the YUV and YCbCr color spaces respectively.

Table 2.1shows how the basic colors are encoded in the two color spaces. The values for the Y channel are the same in both. The channels that contain information about chroma differ slightly.

A color is converted from the RGB color space to the YCbCr with the following linear transformation:



 y cb

c − r



=





0.299 0.587 0.114

−0.169 −0.331 0.5 0.5 −0.419 −0.081







 r g b



 (2.2)

2.2 Color Histograms

Color is one of the most important visual features of an image. The color information can be extracted and can be graphically represented through a color histogram. A color histogram illustrates the distribution of the color values of the image and can be used to infer several qualities about the nature of colors in the image, e.g. if the colors are bright or dark, if they are more on the pale or colorful side etc. When plotted, on the x-axis lie the respective color values e.g. for an image in the RGB colorspace, the x-axis contains values from 0 to 256, whereas the y-axis shows the number of pixels with the corresponding color value. In most applications, pixel values are grouped in small ranges resulting in histograms with a number of bins on the x-axis.

Despite the simplicity behind the idea of color histograms, it is a feature that has been utilized to aid to the solution of a variety of tasks. [58] use image histograms in order to classify objects pictured in an image, as well as identifying their location. Problems that deal with the retrieval of an image are frequently use color histograms [53] to match the color distribution of the input to the suitable match in a database as in [33,56,64]. Color histograms are also used in the context of deep-learning. [29] use a CNN to predict per- pixel color histograms for the colorization of grayscale images whereas in [67] color histograms are computed from human input in real time to provide multiple possible colourization schemes for a grayscale image. Color histograms can also be used with Generative Adversarial Networks in image translation problems as in [4]

where histograms are presented as reference to the generator so that an image that follows the characteristics of the reference will be generated while also utilizing a histogram-based loss for the network, in other words the color histogram of the generated samples should follow similar distribution as the target images.

(14)

2.3 Image Colorization

Let I be a grayscale image and p is a pixel intensity value of image I. Then the problem of colorization of I in the RGB color space, can be expressed as finding a mapping C that matches p to a triple (r, g, b) where r, g, b are the red, green and blue values of the corresponding pixel in colorized image G with the same number of pixels as I.

C : I → G, where for p ∈ I, C(p) = (r, g, b) (2.3)

2.4 Convolution Neural Networks

A Convolutional Neural Network (CNN) is a type of neural network that is typically associated with image processing tasks due to early work by [30] and the success of Alexnet [28] in image recognition tasks. It consists of an input layer, an output layer and several hidden convolutional layers between them. The convolutional layers are responsible for extracting features from the image, using convolutional operations.

A kernel or filter, is a square matrix with small dimensionality that is being used to compute a dot product between the kernel and a corresponding pixel area of the input image with the same size. This process is repeated as the kernel moves around the image and the final output is a matrix called feature map.

The kernel can move pixel columns horizontally or pixel rows vertically in an iterative way. According to the defined number of stride s, the kernel moves s number of pixels each time. The size of the stride in each layer is considered a hyper parameter and plays a crucial role in the quality of the results. Strided convolution results in an output with reduced dimensionality, however a stride that is too large can cause the kernel to skip pixel areas that contain important information about the image, whereas a kernel that moves one pixel each time, increases the number of parameters in the network and by extension the computational time. An illustration of the convolution process can be seen in Figure2.1.

Figure 2.1: The convolution operation in the context of CNNs. An input image (left side) is padded with zeros (middle). A convolutional filter (the blue matrix) slides through the padded input getting the dot product in each location. In this example the stride is 2 and the convolution process is shown by the different locations of the filter. The first location is the matrix with the blue outline, the second is the one with the red outline. The matrix with the green outline is a subsequent location. Figure taken and adapted from [31]

In the context of deep-learning, we use the term 2D convolution despite most tasks dealing with images with three channels. In that case, the convolution is performed on each of the channels separately. The final activation map is a matrix with 3-tuples as pixels values, each of the components being the output of the convolution on its respective channel.

2.4.1 Receptive Field

While in fully connected networks, the neurons of a layer are connected to every other neuron of the previous layer thus receiving information from the whole input, in convolutional neural networks, the signal that a neuron receives depends only on an area of the input. The size of this region is determined by the kernel

(15)

size, the size of the strides and the shape of the input and it is called the receptive field of the layer [36].

The receptive field of a CNN can be computed with the following equation [2]:

r₀=

L

X

l=1

((k_l− 1)

l−1

Y

i=1

s_i) + 1 (2.4)

where L is the total amount of Layers, l is the current layer, kl is the size of the convolutional kernel (e.g.

k_l= 3 for a 3 × 3 kernel) of layer l and s_i is the stride of layer. Having a well defined and adequately sized is essential when using CNNs in order to ensure that each feature is predicted taking the largest relevant area into account. The size of the receptive field can be increased by adding more depth to the network [36], increasing the size of the kernels or the stride of a layer. In general, the closer the pixels are to the center of the receptive field, the more influential they are to the calculation of the output feature [36]. Therefore, features are not only results of a region in the image, they are more correlated to the center of those regions.

The pixels in the receptive field that actually influence the calculation of output features are called the effective receptive field.

Figure 2.2: In this setting, a CNN is trained for semantic segmentation of scenes. The two squares on the left side represent two different receptive fields of the network used to predict the label of the pixel on the right picture. The receptive field represented by the green square is too small to recognise that the target pixel is part of a larger object thus resulting in wrong classification, the receptive field represented by orange is big enough to identify the whole object resulting in correct segmentation¹.

2.4.2 Transpose Convolution

While convolutional operations reduce the dimensionality of the input and aim at establishing abstract representations of the input, there are instances where said input needs to be restored back to its original dimensions or to even higher ones in tasks like enhancing the resolution of images [44], while also maintaining spatial consistency. This is achieved through a process called transpose convolution. Layers that utilize transpose convolutions are used extensively in the decoding part of an autoencoder [14] .

The difference between standard convolution and transpose convolution lies in the convolution matrix used for multiplication with the kernel during the forward and backward pass. In the case of direct convolution, the kernel k is multiplied with the convolution matrix C during the forward pass of the network and compute the gradients of the kernel weights by multiplication of k with the transposed convolutional matrix C^|. In the case of a transposed convolution, for the forward pass we get the dot product between k and C^| and during the back propagation the matrix multiplication is performed with k and (C^|)^|. A more intuitive way that a transpose convolution can be thought of is by emulating a direct convolution. By padding the a feature map with zeros suitably and convolving with the kernel produces a feature map of the original dimensions. However this method involves more parameters than just reversing the computation of the dot

1Image taken fromhttps://developer.nvidia.com/blog/image-segmentation-using-digits-5/

(16)

products for the forward and backward passes so the latter method might be preferred for implementations due to reasons of computational efficiency.

In order for the upsampling to faithfully restore the input and not just its original shape, the optimal kernel weights must be learned with the training process

Figure 2.3: Example of a transpose convolution by emulating a convolution. A direct convolution of a 4 × 4 input (green blue) with a 3 × 3 kernel (gray) produces a 2 × 2 output (light blue). The transpose convolution of that is equivalent with convolving that 2 × 2 feature map padded with zeros as illustrated with a kernel of 3 × 3 to produce a feature map with the original 4 × 4 input. Strides in this case are set to 1. Figure taken from [14]

2.5 U-Net

Motivated by the problem of semantic segmentation in medical images and inspired by the architecture of auto encoders and Deep Convolutional Networks, the U-net utilizes a symmetrical layout of layers and is widely used for various tasks: segmentation [52], image translation [25] and image colorization [6] among others.

U-net uses exclusively convolutional layers. There are distinct parts stages in the architecture as illustrated on Figure 2.4. On the left is the contracting part consisting of standard convolutions where the dimensionality of the input is gradually reduced. The extensive path on the right, features upsampling convolutions while halving the number of feature channels with every layer. During the upsampling, the input from the previous layer is concatenated with the feature map of the symmetrical layer through skip connections. This makes the network combine the increased feature information from the standard convolutional layers so that the reconstruction of the image in the final layer can be benefited by the localization information of the earlier layers.

In addition to the high accuracy of U-net, it is robust when it comes to augmented data. This quality makes it ideal to use when adequately large data sets are not available, especially when working with biological data [52].

(17)

Figure 2.4: Architecture of original U-net taken from [52]

.

2.5.1 Attention U-Net

While results in convolutional U-Net show that they are capable of performing really well across various tasks, they often fail to account for long-term dependencies of images and rely too much on convolution operations to capture information about local features. One possible reason for that is that the sizes of receptive fields of the convolutional filters are not large enough. The usual way of remedying this problem is to increase the depth of the network, though trading off capacity for computational efficiency [65].

In [42] they proposed a U-Net based architecture with attention gates integrated in the upsampling path.

The attention gates act as filters that reduce the significance of irrelevant background regions. They take two inputs, the output of the previous layer and the output of the skip connected one. Both are subjected to a 1 × 1 × 1 convolution so that the number of trainable parameters is reduced for faster computation.

The output feature maps of the convolutions are added and passed through a relu activation function. After another 1 × 1 × 1 convolution the output is passed through a sigmoid activation. Finally, the output of the sigmoid called the attention coefficients, is multiplied by the input feature map, giving the final attention output. A visualization of an attention gate can be seen in Figure2.5.

Figure 2.5: Attention gate design as suggested in [42]

Attention gates can help, for instance, with the localization of human-body organs in segmentation tasks with medical imaging data.

2.6 Generative Adversarial Networks

A generative adversarial network (GAN) is a type of deep-learning architecture that utilizes two neural networks. One network has the role of the generator (G) and the other has the role of the discriminator (D).

These two networks are adversaries in the sense that G generates fake data instances and D tries to classify them as real or fake.

(18)

Figure 2.6: Standard GAN architecture. A random noise vector is passed as input to the generator which is transformed to a fake image. A (real, fake) pair of images is then presented to the discriminator who produces scores for each of the images. The pair error is back propagated to the discriminator and the fake image score is propagated back to the generator. Figure taken from [55]

The generator is responsible for producing an artificial sample. G samples a noise vector from a latent space pz and transforms it so that it is indistinguishable from an actual instance of the data set. In other words, the output G(z) of the generator G where z is the noise vector, is aimed to follow the distribution of the target data set. The discriminator on the other hand, acts as a critic of the results of G. D is tasked to classify the input samples as real or fake. In its standard version, D outputs a single scalar that denotes the probability that the input is fake or real [16].

The generator and discriminator are employed in a minmax game where the generator is trained to minimize log(1 − D(G(z))) where D(·) is the output of the discriminator, while the discriminator tries to maximize its discriminative quality or in other words the probability that the correct label has been assigned to each of the input samples. The objective function V(G,D) that characterizes the training of a GAN is:

min

G max

D V (D, G) = Ex∼p_data(x)[logD(x)] + Ez∼p_z(z)[log(1 − D(G(z)))] (2.5) where x is an instance from the data set, p_datais the distribution of the data set, z is the noise vector that was sampled from the latent space pz. Figure2.6shows the standard architecture of a GAN. Training of both components occurs simultaneously. There is no convergence in the classical sense since both networks are trained in a zero-sum game, but theoretically, a perfect GAN is the one that has reached a Nash equilibrium, a state where no further action can improve the score of neither of the generator nor the discriminator. In this scenario, the generator has captured the data distribution perfectly and the discriminator outputs 0.5 for each sample expressing uncertainty for the authenticity of the sample. However, when the generator and discriminator are neural networks, they are trained using gradient descent and such schemes do not guarantee a convergence to Nash equilibrium but rather to local minima [46].

Another issue in GAN training is the mode collapse problem. The generator can overfit at the expense of the discriminator. The artificial samples that are generated manage to fool the discriminator but they do not encapsule the data distribution. The quality of the training is corrupted since now the generator only outputs such samples that the discriminator is not able to classify as fake, thus stopping the learning process.

2.7 CycleGAN

The goal of a GAN is to learn a mapping from a latent space X to a target data distribution Y so that the generator G can transform a sample from X to be indistinguishable from an actual sample from distribution Y . This mapping is highly under-constrained. CycleGANs address this issue by coupling the learning of the mapping G : X → Y with that of an inverse mapping F : Y → X [71].

CycleGANs are based on the idea that transformations between domains should be cycle consistent. Using mapping G and its inverse F we can define the composite functions G(F (y)) and F (G(x)). By definition F (y) ∈ X and G(x) ∈ Y . Should F be the inverse mapping of G then G(F (y)) ≈ y and F (G(x)) ≈ x,

(19)

∀x ∈ X and y ∈ Y . This cannot be enforced just with the standard adversarial losses, besides adversarial losses are hard to optimize for convergence (see Section2.6). CycleGANs improve on the adversarial losses by adding a cycle consistency loss defined as follows:

L_cyc(G, F ) = Ex∼pdata(x)[||F (G(x)) − x||₁] + Ey∼pdata(y)[||G(F (y)) − y||₁] (2.6) The adversarial loss is based on the L1 loss instead of the binary cross entropy of a typical GAN. For tasks that depend on preserving color consistency between domains, an additional loss is added. This loss called identity loss and it further regularizes the generators so that they can produce an untranslated image when provided with input from the opposite domain. The equation that encapsulates this is:

Lid(G, F ) = E^y∼pdata(y)[||G(y) − y||1] + E^x∼pdata(x)[||F (x) − x||1] (2.7) When the model accounts for the identity loss, it has been found to maintain the color qualities of the input, resulting in better qualitative results. The objective loss has three different components, the adversarial loss, the cycle consistency loss and the identity loss which are weighted by a λ factor. The authors of [71] have empirically set λ to be 10. The adversarial loss is weighted normally with a weight value of 1. The forward and backward cycle losses are weighted by λ with the influence of the identity loss being reduced since we halve the value of λ.

2.7.1 Architecture

For each translation between domains we define a GAN. In [71] the generators follow an auto encoder setup with convolutional layers that downsample the input and fractionally strided convolutional layers that reconstruct the input to its original size by upsampling. It is worth noting the presence of residual blocks [18]

that are responsible for carrying information from a layer further down the architecture. In general, different kinds of generators can be utilized depending on the task at hand.

The discriminators follows the structure of a PatchGAN introduced at [25]. PatchGAN is a fully convolutional neural network that slides a N × N patch (N is the number of pixels) over the image classifying it as real or fake, averaging the results for the final output. In this case a 70 × 70 PatchGAN is used.

2.7.2 Training procedure

For domain X we define the generator G and discriminator DX, similarly for domain Y we have F and DY

as generator and discriminator respectively. First the generators translate an image from their respective domain to the other, producing fake samples that are placed in a pool of generated images that the discriminators use for their updates. This technique stabilizes the training, reducing the oscillation of the loss function. The two GANs are trained simultaneously. The direction X → Y is described. G takes an image from domain X and translates it to Y. That output of G and is translated back to X by F . This is called the forward cycle. In the backward cycle, a real image from the target domain is translated by F to X, following by feeding the translated image to G so that it can be translated back to Y . The third step is the translation of the identity element when the identity loss is utilized. An example of a training step can be seen in Figure2.7.

(20)

Figure 2.7: Example of a training step. In (a) the relations between the GANs and the domains are established. In (b) we see an example of a forward cycle: An image x from domain X is translated to domain Y by generator G. The translated image ˆY is presented to the discriminator D_Y and also fed to the generator F sto that it can be translated back to domain X. The backward cycle is seen in (c) and it is the same procedure in the opposite direction. Figure taken from [71]

.

2.8 Normalization

2.8.1 Instance normalization

Batch normalization (BN) is one of the most popular regularization methods, used widely in deep-learning tasks. It normalizes the output of a layer before feeding it to the next and its application improves the performance of a model and reduces training times significantly [24]. However, BN does not come without problems as it is shown in [63]. Models that use small batches, have the error of BN increasing.

Instance normalization (IN) [61] was introduced as a solution to an image translation problem where contrast of the input image was influencing the translated output. CycleGANs use IN instead of BN not only because IN forces the network to stay agnostic when it comes to contrast information of the input, but also because of its non-dependency to batch features. Additionally, CycleGANs use small sized training batches exploiting IN’s insensitivity to batch features. Experiments in [61] showed that when IN is applied in the generator improves the quality of results in a qualitative way.

Let x ∈ R^{T ×C×W ×H} an input tensor that denotes a batch and T is the number of examples in the minibatch, and H, W, C are the height, width and number of channels of the image respectively. Then if xtijkan element of the batch, IN can be formalized as following:

µ_ti= 1 HW

W

X

l=1 H

X

m=1

x_tilm (2.8)

σ²_ti= 1 HW

W

X

l=1 H

X

m=1

(x_tilm− muti)² (2.9)

y_tijk= x_tijk− m_ti

pσ_ti² + (2.10)

In the equation2.8, the mean of the instance is calculated for one channel, as is the standard deviation in 2.9. Finally, the normalized channel is given by2.10. The process is performed for all image channels as opposed to BN that calculates the features across all the elements in the batch.

2.8.2 Spectral Normalization

A typical issue in GAN training is the instability in training especially in high dimensional spaces. Problems that can arise include mode collapse as described above, exploding gradients where large gradients during the calculation of the error are accumulated, inhibiting the training or completely corrupting it. Another issue that might cause problems is this of a perfect discriminator [3] where the chosen discriminator is a perfect classifier of real/fake samples with 0 error propagated back to the generator to facilitate further learning.

(21)

These problems can be alleviated by bounding the gradients and restricting the function space from which we choose the discriminators.

Assuming the discriminator is a function D : I → R where I is the image space, we essentially want to have the derivative of this function to be bounded. If D is a Lipschitz continuous function then for x, y ∈ R and x 6= y the following inequality is valid: |D(x) − D(y)| ≤ K|x − y| where K is a real number. We can then form the following inequality:

|D(x) − D(y)|

|x − y| ≤ K (2.11)

When we take the limits of both parts of the inequality, the left part forms the first derivative of the discriminator D or, in other words, its gradient. Therefore choosing a Lipschitz continuous function would bound the derivative and protect the network from suffering from exploding gradients. Ultimately, the problem lies in finding a suitable value for K.

Spectral normalization as introduced by [40] is based on that idea of bounding the gradient of the discriminator. This is achieved by replacing each value of a weight matrix W with the normalization factor W/σ(W ) where σ(W ) is the spectral norm of weight matrix W . The spectral norm σ(W ) is the largest singular value of W or the square root of the largest eigenvalue of the matrix W^TW . Spectral normalization uses a method called power iteration to quickly calculate the eigenvalues of all the weight matrices in the discriminator and can be efficiently added to any existing GAN. Its application not only remedies the GAN issues we already mentioned, but can also aid the network generate more diverse samples.

(22)

Chapter 3

Experimental Setup

In this chapter we describe the design of the experiments. First the data set is described followed by the augmentation methods. Next, we present the architectures of the generator and discriminator. After that, we establish the different experiment configurations and the environment in which the experiments took place. Lastly, we give an overview of the evaluation metrics used in this thesis.

3.1 Data set

The data set consisted of 394 raw single cell images in grayscale and 394 stained ones in the RGB color space. The cells in the images were stained with 2 chemical agents. FM4-64 is a dye that stains the vacuoles by travelling from the plasma membrane to said organelle via endocytosis. To stain peroxisomes in the cells, green fluorescent protein was used. This protein is used to highlight the perixosomes in the cell by fusing to the peroxisomal membrane protein Pex3 [41]. In our data, vacuole staining is illustrated by the green color and for the peroxisome staining, blue is used for illustration purposes. The images are paired, for every raw grayscale image we have the corresponding stained one. However, since we wanted to address the lack of paired data sets available, we use data augmentation separately, in order to decouple the grayscale brightfield images from the stained ones.

Figure3.1shows a sample of data from the two domains. The images were segmented from larger samples containing multiple cells. The data set itself contained images depicting only noise, which were filtered out.

The colors in the stained images denote a specific part. Blue pigments highlight the peroxisomes, green is the dyed vacuoles, with red being the rest of cell interior.

(a)

(b)

Figure 3.1: Instances of the training data. 3.1ashows stained cells,3.1b shows unstained images

(23)

3.1.1 Data augmentation

In [71] the data set size approaches 20000 images. A usual practice of enhancing the amount of data as well as introducing a degree of variability is data augmentation. Data augmentation was used extensively in [5] to enhance their small data set. To increase the size of our data set without running into the risk of creating multiple duplicate images we used imagemorph ¹, a software based on [9] that applies random elastic deformations to the images, creating independent and identically distributed samples. We followed that by applying some standard types of augmentation including horizontal and vertical flipping, rotation, scaling and shearing which were combined with color augmentation techniques such as randomly adjusting brightness and contrast. We did not expand more on color augmentation techniques such as adding random Gaussian noise, since we wanted to avoid the corruption of the data set. Each image was augmented by two randomly selected types of augmentation. An example of the augmented data can be seen in Figure3.2.

3.2 Pre processing

The images were resized in a (64,64,3) shape. Subsequently, they were normalized using an affine transformation

y = (x − min(x)) b − a

range(x)+ b (3.1)

where x is the image to be normalised, min(x) is the minimum value of x, [a, b] is the target range and range(x) is the difference between the maximum and the minimum of x. The target domains were determined at training time since we wanted to check which range of values make for smoother training. These domains are [0, 1] and [−1, 1] as per [71].

Next the normalised images were converted to a target color space if that was specified. We experimented with the YUV, YCbCr and HSV color spaces. When not specified, images were left in the RGB color spaces.

Figure 3.2: Example of the image augmentation. Starting from the top left: Original image, horizontal flipping with random cropping, vertical flipping with random shear. Bottom left: Random scaling with randomly adjusted brightness, random rotation with augmented contrast, random cropping random scaling

3.3 Model Architecture

For all the experiments we use a CycleGAN. The models were adapted from [71] so that they fit better to smaller sized data. We do not use a 70 × 70 PatchGAN since the inference patch is too large for the size of the images in our data set. Taking into account that in [71] they use 256 × 256 or 128 × 128 sized input images, we use a 23 × 23 PatchGAN as the discriminator in order to fit better the input size of the data.

More specifically, the discriminator consists of 5 blocks, where each block is a Convolutional layer followed

1https://github.com/GrHound/imagemorph.c

(24)

by instance normalization and leaky relu with an α = 0.2. Instance normalization is not used in the first layer. Figure3.3shows the full discriminator architecture.

Figure 3.3: Baseline discriminator architecture. The number above the block denotes the number of convolutional filters and the number inside denote the stride

Figure 3.4: Baseline Generator architecture. The number above the block denotes the number of convolutional filters and the number inside denote the stride

The generators is based on U-Net as it shown to work better [17] [26] with medical data [5] [51] [49]. As can be seen in Figure3.4, the generators have 7 blocks on each path plus the bottleneck layer. The input is downsampled through a series of convolution operations and is upsampled by gradually imposing transpose

(25)

convolutions. The stride size for all layers for this baseline generator is 1. Instance normalization is used here as well instead of batch normalization.

The networks were programmed using Keras [13].

3.4 Experiments

In this section we will explain the idea behind every experiment. Each subsection describes different ideas that were explored in this research. The experiments were performed in a grid search like procedure, where different setups were used and evaluated in a comparative manner.

3.4.1 Color spaces

Images are represented by computers as numerical arrays. A color space can be viewed as a coordinate system, therefore the same image has different representation in each color space. For the computer, an image in RGB is completely different than the same image in the HSV color space.

For our experiments, we tested images in RGB colorspace, as well as the YUV, YCbCr and HSV. These color spaces are chosen because images can be converted using linear transformation from the (linear) RGB space.

3.4.2 Spectral Normalization

We will examine the effect that the spectral normalization has both on the training and the final results.

Spectral normalization has been used extensively in GANs since it stabilizes training and improves the quality of the output. Spectral normalization was initially used to regularize the discriminator [40], however it can be applied to the generator as well. In fact a generator with spectral normalization has shown to further improve results [26,65]. We investigate the effect of spectral normalization has on the stability of the training and the quality of the results.

3.4.3 Attention

We integrate the attention gate mechanism as described in 2.5.1. Oktay et al. used this module in a semantic segmentation problem in the context of medical data. The module’s ability to capture fine details and distinguish between different structural elements in images where said structures are not so obvious, is the motivation behind its use in our experiments. In their work, the attention gate is utilized before every convolutional layer in the upsampling path, a setup that we follow in our research. We also incorporate Instance normalization inside the attention gate.

3.4.4 Losses

One of the objectives of our research is to produce visually compelling results. Assessing the image quality of said results requires some sort of quantization so that distinction between images with low and high quality can be made. That purpose is served by indexes like the Structural Similarity Index, its multi-scale counterpart MS-SSIM and Peak Signal to Noise ratio (PSNR) that aim in assigning a numerical value to the degree of correlation of an image to HVS. [68] explore the effect of different losses in the context of image processing in deep-learning. Following that, we experiment with four different loss functions: Structural Similarity Index (SSIM) loss, Multi scale SSIM (MS-SSIM), a fusion loss function that combines MS-SSIM and L1 losses. Finally, we experiment with a complementary loss function that utilizes the histograms of images. This histogram based loss, is used in combination with the standard MAE loss in a weighted manner.

(26)

SSIM

SSIM [70] is an index that is mostly used to compare the difference in visual quality between two images.

Considering two images x and y, the SSIM for pixel p is defined as:

SSIM (p) = 2µ_xµ_y+ C₁ µ²_x+ µ²_y+ C1

· 2σ_xy+ C₂ σ²_x+ σ_y²+ C2

(3.2)

where µiand σiis the mean pixel value of image i and its standard deviation respectively which are calculated with a Gaussian filter with standard deviation σG. The C1, C2parameters are stabilization variables in the case of a division with a weak discriminator.

Since SSIM is perceptually motivated, it is reasonable to use it as a loss function in an image translation task. The loss used in the generator is:

LSSIM = 1 − SSIM (p) (3.3)

Since SSIM’s values range betwen [−1, 1], with 1 meaning absolute structural similarity and -1 no similarity at all, we can minimize the loss by maximizing the SSIM. This will mean that the produced images are structurally similar to the real ones. In addition, SSIM is differentiable which is required from a loss function due to the back propagation.

MS-SSIM

For the calculation of SSIM, the choice of the Gaussian filter plays a major role. A small standard deviation causes the network to not preserve local structure in the image [68], a large standard deviation might enable the preservation of noise in the edges of image components. To eliminate the risk of choosing a standard deviation of an appropriate range, the MS-SSIM was introduced [62] and defined as follows:

M S − SSIM (p) = l_M^α(p) ·

M

Y

j=1

cs^β_jⁱ(p) (3.4)

where l_M and cs_j are the first and second factors in Equation3.2. We can define the MS-SSIM loss as:

LM S−SSIM = 1 − M S − SSIM (p) (3.5)

The difference between MS-SSIM and SSIM is that in the former case, we use differently chosen Gaussian windows G_σg corresponding to different scales and patches with different resolutions. Each G_σg is applied in an image downsampled by 2. For our experiments, we used a 7 × 7 Gaussian window applied in 3 scales.

MS-SSIM with L1

L1 loss is not sensitive to outliers. It helps preserve colors and luminance, however it has trouble replicating contrast, a quality that MS-SSIM possesses. Therefore [68] propose a novel loss that combines the L1 with the MS-SSIM loss. The equation defining this combination loss is the following:

Lcomb= α · L^{M S−SSIM}+ (1 − α)G_σM

G · L^L1 (3.6)

In our experiments, we divide the L1 contribution by 3 as we found that the total loss was too heavily dominated by it. The α parameter is set empirically to 0.84 by [68].

Histogram based loss

Color histograms are a competent form of representation of images because they are invariant to translation and rotation. We implement a loss that uses the histograms of the images so that the results captures the color distribution of the target data set. The loss is inspired from [60] in which they use a similar loss for their network which is tasked with scoring images with a quality score. The histograms are compared using the squared Earth Mover’s distance. We treat the colorization problem as a classification one, in which each

(27)

pixel can take one value out of possible 256 classes that represent a value in the RGB colorspace. In such settings, a cross entropy loss function is typically employed. However, [22] observe that such loss does not take into account inter-class relationships where some errors in classification are must be more punishable than others. For example, in a colorization problem, for a white pixel is more preferable to get assigned a grey color rather than black, when in an image recognition task a bird classified as house is not further from the truth than a bird classified as car. Instead, we use the Earth’s mover distance (EMD) to construct our loss function which is shown by [22] to perform better for ordered-class classification. In the context of histograms, EMD is defined as the minimum cost to match two histograms (distributions). Given two histograms p and ˆp, the EMD between them is formalised as:

EMD(p, ˆp) = (1 N

N

X

k=1

|CDFp(k) − CDFpˆ(k)|^r)^1/r (3.7)

where N is the number of classes (or bins of the histograms) and CDFp(k) is the cumulative distribution function. The number of r is set to 2 according to [60] for easier optimization.

In our implementation, the histograms are normalized so that their elements sum to 1, resembling a probability distribution and the number of bins is set to 16. Equation3.7is used in the construction of the overall loss function which also includes the standard L1 loss of the vanilla version of the CycleGAN. The final form of the loss is the following:

Lhist= 0.2 ∗ L^EMD+ 0.8 ∗ L^L1 (3.8)

where L^EMDis Equation3.7. The weights were decided empirically.

3.4.5 Receptive fields

We experiment with different sizes for the receptive fields both in the generator and the discriminator.

Finding the correct size for the receptive field of a CNN is essential for capturing vital information about images. We will adjust the receptive fields by applying either depth or increasing the size of the filters and the strides.

For the discriminator we choose the following architectures.

Discriminator Architectures

Number of layers 5 5 6

Filter sizes 3 3 3

Strides 2-1-1-1-1 2-2-1-1-1 2-2-1-1-1-1

Size of Receptive Field 23 39 47

Table 3.1: Different discriminator layouts that are used in the experiments

The last layer in the discriminator is omitted since it always remains constant. For the generator we experiment with another different architecture.

Generator Architectures Number of layers 7 5

Filter sizes 3 4

Strides 1 2

Table 3.2: Different generator layouts that are used in the experiments

In Table 3.2the number of layers denote the number of layers in one path of the U-net. Because of the symmetry of the architecture, the other side follows the same structure.