Extracting Biomarkers from Hematoxylin- Eosin Stained Histopathological Images of Lung Cancer using Deep Learning

(1)

Extracting Biomarkers from

Hematoxylin-Eosin Stained Histopathological Images of

Lung Cancer using Deep Learning

Master’s thesis in Artificial Intelligence

Emiel Stoelinga

s4837584

Internal supervisor:

Dr Y. G¨

u¸cl¨

ut¨

urk

Radboud University

External supervisors:

Dr. F. Ciompi

Radboudumc

Dr. Z. Swiderska-Chadaj

Radboudumc

September 10, 2019

(2)

Abstract

In this thesis the technique of deep learning was applied to the field of digital pathology, more specifically lung cancer, to extract several different biomarkers. Tertiary lymphoid structures (TLS) have been found to indicate a positive patient prognosis, especially in combination with germinal centers (GC). Therefore, a VGG16-like network was trained to detect TLS and GC in histopathological slides of lung squamous cell carcinoma with F1 scores on the pixel level of 0.922 and 0.802 respectively. Performance on a different held-out test set on the object level was 0.640 and 0.500 for TLS and GC respectively.

Treatment differs per growth pattern of lung adenocarcinoma and variability between pathol-ogists in the assessment of lung adenocarcinoma exists. Therefore, a similar VGG16-like network was trained to segment growth patterns of adenocarcinoma in slides of lung tissue with F1 scores on the pixel level of 0.891, 0.524, 0.812 and 0.954 for solid adenocarcinoma, acinar adenocarcinoma, micropapillary adenocarcinoma and non-tumor tissue respectively.

Because the previous system was only trained on sparsely annotated data and consequently did not encounter neighbouring growth patterns of lung adenocarcinoma, a method with genera-tive adversarial networks to generate fake densely annotated realistic looking image patches from sparsely annotated data was examined and a comparison between three types of models was made.

(3)

2 Quantification of tertiary lymphoid structures and germinal centers in lung tissue 7 2.1 Introduction . . . 7 2.2 Methods . . . 7 2.2.1 Data . . . 7 2.2.2 Augmentation . . . 9 2.2.3 Deep learning . . . 9 2.2.4 Post-processing . . . 11 2.2.5 Evaluation . . . 11 2.3 Results . . . 12 2.3.1 Pixel level . . . 12 2.3.2 Object level . . . 17 2.4 Discussion . . . 18

3 Segmentation of growth patterns of adenocarcinoma in lung tissue 20 3.1 Introduction . . . 20 3.2 Methods . . . 21 3.2.1 Data . . . 21 3.2.2 Augmentation . . . 21 3.2.3 Deep learning . . . 22 3.2.4 Evaluation . . . 22 3.3 Results . . . 22 3.3.1 Pixel level . . . 22 3.3.2 Image level . . . 24 3.4 Discussion . . . 24

4 Synthesizing densely annotated data of lung adenocarcinoma with GANs 26 4.1 Introduction . . . 26 4.1.1 GP-GAN . . . 26 4.1.2 CycleGAN . . . 26 4.1.3 EdgeConnect . . . 27 4.2 Methods . . . 28 4.2.1 Data . . . 28 4.2.2 GP-GAN . . . 28 4.2.3 CycleGAN . . . 29 4.2.4 EdgeConnect . . . 31

(4)

4.3 Results . . . 32

4.3.1 GP-GAN . . . 32

4.3.2 CycleGAN . . . 33

4.3.3 EdgeConnect . . . 34

4.4 Discussion . . . 35

5 Summary and conclusions 38

(5)

Chapter 1

Introduction

1.1 Digital pathology

Pathology

The medical field of pathology is concerned with the study of the causes and effects of disease. Depending on the methods that are utilized, the field of pathology can be divided into multiple subfields. The main subfield that will be regarded in this thesis is the field of histopathology. Histopathology is the study of disease on the microscopical level. Data is generally obtained from biopsies or resected organs which are treated such that they can be examined by a pathologist. The tissue is usually fixated such that no decay can develop in the tissue. Afterwards the tissue is cut and prepared on a glass slide before it can stained and analyzed.

Staining

After the tissue has been fixated, has been cut and has been prepared on a glass slide, it is so thin that it is essentially colorless. In order to recognize structures in the tissue, the slide is stained with a dye. A common staining that is used is the staining of hematoxylin and eosin (H&E). It is a relatively inexpensive staining and together, the two components make nuclei stand out in a blue color and cytoplasm and extracellular matrix in pink (Fischer et al., 2008). An example of an image of H&E stained tissue is presented in Figure 1.1. Other types of staining such as immunohistochemical stainings can be used to highlight certain cellular components by applying particular antibodies. However, in this thesis only the widely-used H&E staining will be regarded.

(6)

Digital pathology

Traditional histopathology demands for the physical presence of a glass slide to be examined by a pathologist. Digital pathology has introduced the possibility to evaluate a slide from any computer by digitizing slides in high resolutions with scanning devices. Images of tissue are typically stored in the whole slide image format in a pyramid structure in which downsampled versions of the original image are stored in higher levels of the structure. The pyramid structure is presented in Figure 1.2. This structure enables an examination of the tissue on different levels either with more or with less context.

Figure 1.2: Outline of the pyramid structure of a whole slide image (Wang et al., 2012)

1.2 Deep learning

The field of deep learning is a subfield in the broader field of machine learning. In the field of deep learning, multiple layers of neurons with weighted connections are typically combined to form an artificial neural network (ANN). An example of a feed-foward artificial neural network is presented in Figure 1.3. During training of a feed-forward model with supervised learning, the architecture is presented with numerous labeled examples. These examples are processed by units in the input layer of the model which in turn transmit signals to the units in the succeeding layer. Eventually the signal reaches the output layer of the model, which typically results in a classifiction. In this way, a model can for example be trained to classify images of handwritten digits or images of objects but also data of other sort such as sound data (LeCun et al., 1990; Krizhevsky et al., 2012; Piczak, 2015). An artificial neural network typically learns via the process of back-propagation. With back-propagation the weights in the model are updated according to the loss of the model, i.e. according to the difference between the predicted value and the actual label. Over the course of multiple epochs, the weights in the ANN are updated in order to come to a model that best predict a label for a given example. Because of their architecture, ANNs can solve linear and non-linear problems.

(7)

In the field of semantic segmentation, every pixel of an image is provided with a label instead of a single label per example. The goal of a model is to create a correct prediction of the segments in an image. An architecture that is capable of doing such a semantic segmentation is the fully convolutional neural network (FCNN) of which an outline of the architecture is presented in Figure 1.4 (Long et al., 2015). The fully convolutional model consists of convolutional layers, pooling layers and upsampling layers and due to the use of no fully-connected layers, the network is able to be used on input of any size resulting in output of the same size. Essentially the image shrinks in resolution as it travels through the network and is upsampled using deconvolutional layers in the last part of the model in order to come to a pixelwise prediction. Such networks can for example be used for object detection, crowd segmentation or iris segmentation (Kang and Wang, 2014; Long et al., 2015; Liu et al., 2016).

Figure 1.4: Outline of the FCNN architecture (Long et al., 2015).

Another interesting field of machine learning is the field of generative adversarial networks (GANs) of which the concept was first proposed by Goodfellow et al. (2014). A GAN typically consists of a generator network and a discriminator network and together they can learn the generator network to generate new data with the same statistics as the training set. An outline1

of the architecture is presented in Figure 1.5. During training, the generator and the discriminator essentially play a minimax game. It is the goal of the generator to fool the discriminator by generating data that looks realistic and it is the goal of the discriminator to succesfully recognize fake data generated by the generator from a set of real data and fake data. The architecture has been used to produce fascinating results such as the creation of fake images of bedrooms, the generaton of fake datasets of people’s face in particular celebrities, or the possibility to transfer the style of images into the style of other images (Radford et al., 2015; Karras et al., 2017; Zhu et al., 2017).

Figure 1.5: Outline of the GAN architecture.

1.3 Deep learning in digital pathology

In the past, many studies have shown the potential of deep learning in healthcare. The technique has for example been used to identify melanoma, predict rediological scores in MRI images or

pre-1_{https://www.researchgate.net/figure/Generative-Adversarial-Network-GAN_fig1_} 317061929

(8)

dict clinical events for a patient from his electronic health records (Esteva et al., 2017; Jamaludin et al., 2016; Choi et al., 2016). Also in the field of digital pathology, deep learning has proven to be a useful technique. Deep learning has for example been used for the segmentation of neuronal structures in electron microscopic stacks, to detect mitosis in breast histology images, to segment epithelial and stromal regions and to identify breast cancer (Cire¸san et al., 2013; Ronneberger et al., 2015; Xu et al., 2016; Wang et al., 2016). As well as the technique of deep learning, the GAN architecture has been succesfully been used in healthcare. GANs have for example been trained to identify anomalies in imaging data, to segment organs in chest X-ray images, or to gen-erate datasets of realistic looking magnetic resonance images of slices of the human brain (Schlegl et al., 2017; Dai et al., 2018; Calimeri et al., 2017). The goal of such studies is to contribute to the model of personalized medicine in which for every patient a personalized treatment is decided. In order to come to such a treatment, multiple models all predict a prognosis based on biomarker, i.e. biological parameters (Group et al., 2001). Together, the models can help to reduce the workload for pathologists while increasing the objectivity of diagnoses (Litjens et al., 2016).

1.4 Research questions

In the current study, two biomarkers were examined to generate a prediction of a prognosis in the field of lung cancer. First a model was designed to predict the amount of tertiary lymphoid structures (TLS) and germinal centers (GC) in lung tissue. TLS in combination with GC has been found to correlate with patient survival and currently there exists no system that is capable of predicting these structures in tissues. Therefore, the first research question was:

What is the performance of an ANN trained to detect TLS and GC in lung tissue? Second, in this study a model was trained to segment types of growth patterns of adenocarci-noma in cancerous lung tissue. Between the different growth patterns the type of treatment differs. In addition variability between the assessments of pathologists exists. It is therefore beneficial to oncology and the workflow of digital pathology to develop a system that can accurately predict the prevalent types of growth patterns of lung adenocarcinoma in tissues and therefore the second research question was:

What is the performance of an ANN trained to segment different growth patterns of adenocarcinoma in lung tissue?

Last, a method was examined to synthesize new densely annotated data from existing sparsely annotated data of growth patterns of lung adenocarcinoma. The new densely annotated data can potentially extend the current dataset by including information about borders between different growth patterns and is therefore expected to enhance performance of the system to segment growth patterns of lung adenocarcinoma. The third research question was:

Is it possible to synthesize realistic looking fake data of different growth patterns of lung adenocarcinoma with generative adversarial networks?

This thesis will now continue with three chapters with one chapter per research question and the final chapter will include a summary and conclusions.

(9)

Chapter 2

Quantification of tertiary

lymphoid structures and germinal

centers in lung tissue

2.1 Introduction

Tertiary lymphoid structures (TLS) are ectopic lymphoid formations often found in sites of chronic inflammation and cancer and they form in response to a particular set of pathogenic events. (Dieu-Nosjean et al., 2014, 2016; Neyt et al., 2012). Tumor-associated TLS are generally formed at the invasive front of the tumor and often indicate a positive patent prognosis (Dieu-Nosjean et al., 2008; Hiraoka et al., 2016). Also in their study, Silina et al. (2018) found the density of TLS in untreated patients to be the most significant prognostic marker. Furthermore, a larger presence of germinal centers has been presented to enhance TLS function thereby enhancing patient survival. Currently, the quantification of TLS and GC is not always common in routine pathology. As the frequency of TLS and GC can be a significant biomarker, there exists the desire for a system that can automatically quantify TLS and active TLS, i.e. TLS with a germinal center. To the best of our knowledge, there exists no such a system yet and an examination of the problem therefore has a clinical relevance. The research question of this study will be what the performance will be of an ANN trained to detect TLS and GC in lung tissue. Deep learning techniques have shown that they are able to recognize patterns in medical data in general of which some significant examples have been presented by Litjens et al. (2016). The hypothesis therefore is that it will very well be possible to train a model that will be able to correctly quantify the amount of TLS and GC in whole slide images.

2.2 Methods

2.2.1 Data

Cohort 1

The first set of data that was used in this project consisted of 38 sparsely annotated whole slide images that were scanned in 4X. An example is presented in Figure 2.1a. The tissue was hematoxylin-eosin stained and the data have been provided by the Institute of Experimental Immunology, Z¨urich. The WSIs were annotated by a molecular biologist and the annotations included:

• 850 annotations of TLS • 159 annotations of GC

(10)

• 1506 annotations of other lung tissue • 143 annotations of empty tissue

The data was distributed such that 30 images were allocated to the joint training set and 8 images were allocated to the test set. For every experiment, 7 images from the training set were randomly allocated to a separate validation set. The allocation of images into a training set and a test set resulted in a distribution of the annotations as presented in Table 2.1.

TLS GC other lung empty Training set 660 140 1102 96 Test set 190 19 404 47

Table 2.1: Distribution of annotations in a training set and a test set for cohort 1

Cohort 2

The second set of data that was used in the project consisted of 28 sparsely annotated whole slide images of hematoxylin-eosin stained lung squamous cell carcinoma that were scanned in either 20X or 40X and were provided by The Cancer Genome Atlas (TCGA) (TCGA, 2012). An example is presented in Figure 2.1b. The WSIs were annotated by a molecular biologist in order to come to the following set of annotations:

• 256 annotations of TLS • 57 annotations of GC • 321 annotations of stroma • 182 annotations of tumor

• 299 annotations of other lung tissue • 7 annotations of empty tissue

Of the total set of images, 15 images were used in the training set, 5 images were used in the validation set and 8 images were used as a test set for evaluation on the pixel level. The allocation of images in different sets resulted in a distribution of annotations as presented in Table 2.2.

TLS GC other lung empty stroma tumor Training set 159 34 136 3 203 106 Validation set 51 9 83 2 59 23

Test set 66 14 80 2 59 53

Table 2.2: Distribution of annotations in a training set, validation set and a test set for cohort 2 In order perform an evaluation on the object level, dense annotations were obtained of the test set. Square regions were annotated such that every pixel in the region was assigned to a single class resulting in the following set of annotations:

• 52 annotations of TLS • 13 annotations of GC • 91 annotations of stroma • 40 annotations of tumor

(11)

• 36 annotations of epithelium and glands • 74 annotations of necrosis

• 191 annotations of inflammation • 36 annotations of other lung tissue

(a) Example sparse annotation of cohort 1. Blue is TLS, green is GC.

(b) Example sparse annotation of cohort 2. Blue is TLS, green is GC, orange is stroma and yellow is other lung tissue.

Figure 2.1: Examples of sparse annotations in cohort 1 and 2.

Patch sampling

During training of the models, patches from annotated regions of the whole slide images were randomly extracted. For every patch that was extracted, a pixel with the correct label was searched in the annotated data. A patch with the specified patch size was extracted around that center pixel and the whole patch received the label of that center pixel. This label was ultimately compared to the label that the model provided as output in order to calculate the loss. Every batch of patches contained an even balance in the distribution of labels.

2.2.2 Augmentation

In order to train a more robust model, extensive augmentation of the data took place during training of the models. Spatial augmentation included flipping, rotation and elastic deformation and with some probability, Gaussian noise and blur were included in the images as well. As for color augmentation, images were augmented in the HSV space and in the haematoxylin eosin and DAB space (HED). In later experiments, the previously used color augmentation was replaced by a method based on Tellez et al. (2019). Images were augmented using either HSV strong settings or HED light settings, resulting in augmented patches as can be seen in Figure 2.2 and Figure 2.3.

2.2.3 Deep learning

Basic architecture

The models that were trained in the current research were trained using Keras with TensorFlow as backend and were based on the VGG16-net (Simonyan and Zisserman, 2014; Chollet et al., 2015; Abadi et al., 2015). They contained multiple hyperparameters that determined the architecture. Hyperparameters that were important are the depth δ which indicated the amount of convolutional

(12)

Figure 2.2: Images augmented using the settings for HSV strong

Figure 2.3: Images augmented using the settings for HED light

blocks in which every convolutional block consisted two convolutional layers and one pooling layer, the branching factor β which indicated the amount of filters in the first convolutional layers and whether the model should include batch normalization . In the basic architecture these values were set to δ = 4, β = 4 and = true. The resulting architecture is presented in Table 2.3. Because batch normalization was set to true, every convolutional layer was followed by batch normalization. The activation function of all the convolutional layers was a rectified linear activation function. In order to overcome overfitting, L2 regularization was used with λ = 0.00001. An Adam optimizer was used to optimize the softmax function. For every model the batch size was set twice the amount of classes that is was trained on and the learning rate was initially set to 0.0005. In case the accuracy on the validation set did not improve for more than 5 epochs, the learning rate was divided by 2. The weights of the network were initialized using the method as suggested by He et al. (2015).

Dilation

In later experiments, dilation was introduced in the models. Dilation can be used to exponentially expand the receptive field without losing resolution (Yu and Koltun, 2015). The expectation was to obtain more dense predictions and segmentations. In order to do so, dilation starting with a value of 2 was introduced to the every second convolutional layer with 3x3 kernels. For every

(13)

con v5-16 con v5-16 max-p o ol con v3-32 con v3-32 max-p o ol con v3-64 con v3-64 max-p o ol con v3-128 con v3-128 a vg-p o ol con vX-512 con v1-256 con v1-C soft-max

Table 2.3: Architecture of the basic model that was used in this project. Each convolutional layers convA-B contained kernels of AxA and B filters. Pooling layers contained windows of 2x2 with stride=2. C is the amount of output classes. X is the width and height of the output of the previous average pooling layer which was dependent on the size of the input.

Hyperparameter Explanation

Patch size The patch size of the input images

Pixel spacing The pixel spacing of the level on which patches should be extracted Learning rate The learning rate of the model

Depth The amount of convolutional blocks in the model Branching factor The number of filters for the first convolutional layer

Dilation Whether the every second convolutional layer should include dilation Augmentation The form of data augmentation that is used

L2 regularization λ Lambda for the L2 loss for L2 regularization

Table 2.4: The hyperparameters that are altered over the different experiments.

convolutional block that was added to the models, the dilation rate γ was multiplied by 2. Hyperparameters

During the experiments, multiple hyperparameters were changed to come to different results. A list of the hyperparameters that were altered is provided in Table 2.4.

2.2.4 Post-processing

Post-processing was solely done on the TLS and GC classes and was based on the following rules: 1. Remove TLS and GC predictions that are smaller than the smallest annotations for the specific class in the training set. This was done by counting the amount of pixels that the smallest annotations existed of.

2. Remove GC predictions that are not surrounded by at least 50% of TLS prediction. GCs are typically always surrounded by TLS and therefore so should predictions, but prediction errors might occur which is why the value of 50% was chosen.

Pixels that were part of predictions that were removed due to post-processing, were assigned to the other lung tissue class.

2.2.5 Evaluation

Evaluation at the pixel level

In order to evaluate models, F1 scores on the pixel level were computed. Because initially no dense annotations for entire whole slide images were available for the test set, F1 scores were only calculated on the basis of the parts that were annotated. This means that for every pixel that was annotated it was checked whether the model predicted the correct label. An explanation of the computation of the amount of true positives, false positives and false negatives is given in Table 2.5. F1 score were averaged over the total test set by accumulating the amount of true positives, false positives and false negatives and calculating the F1 score afterwards.

(14)

Reference pixel label Template pixel label Value Target label Target label True positive Target label Not target label False negative Not target label Target label False positive No label Any label Not counted

Table 2.5: Explanation of true positives, false positives and false negatives in pixelwise evaluation.

Evaluation at the object level

In order to perform a more general evaluation, an evaluation at the object level was performed. A left out test set as described in Section 2.2.1 that contained densely annotated areas, i.e. square areas in the whole slide images in which every pixel was annotated, was used. For every prediction of a TLS and every prediction of a GC, it was determined whether the prediction was a true positive or a false positive. In case a prediction partly overlapped with an annotation of the predicted class, the prediction was counted as being a true positive. If the annotation did not overlap at all with an annotation of the predicted class, the prediction was counted as being a false positive. Every annotation of a TLS and of a GC that was not detected by a model was counted as being a false negative. Evaluation on this level was only performed for the TLS and the GC classes, because those classes are the classes of interest to the research question.

2.3 Results

2.3.1 Pixel level

Cohort 1

First a model was trained on the data including only the TLS, other lung tissue and empty tissue classes. The model was evaluated on a held out test set and the F1 scores on the three classes were 0.824, 0.951 and 0.983 respectively. A normalized confusion matrix is presented in Figure 2.4a and an example of a prediction of the model is presented in Figure 2.5a. Afterwards a similar model was trained with the GC class included. The F1 scores for the TLS class, the GC class, the other lung tissue and the empty class for that model were 0.895, 0.281, 0.962, 0.908 respectively. A normalized confusion matrix is presented in Figure 2.4b and an example of a prediction of the model is presented in Figure 2.5b. What can be seen is that with the introduction of the annotations of a fourth class, more uncertainty was introduced to the model. The GC class was often mistaken by the TLS class which resulted in a low F1 score.

(a) Confusion matrix without the GC class

(b) Confusion matrix with the GC class

(15)

(a) Example prediction of a model trained on the TLS, other lung tissue and empty tissue classes. Blue = TLS, green = other lung tissue, yellow = empty tissue.

(b) Example prediction of a model trained on the TLS, GC, other lung tissue and empty tissue classes. Blue = TLS, green = GC, yellow = other lung tissue, orange = empty tissue.

Figure 2.5: Example predictions of models trained on cohort 1. Red annotations = TLS, green annotations = other lung tissue, blue annotations = GC.

After an inspection of the results and a closer inspection of the data, the suspection was that the right amount of information for a model to be able to distinguish between the different classes was not in the data due to the poor resolution (4X). A second cohort that included images scanned in higher resolutions (20X and 40X) was therefore introduced.

Cohort 2

In order to give the possibility for the models to encode more information in the model, the models for this cohort were expanded with an extra convolutional block. That means that the depth δ for the models that were trained on this cohort was 5.

Patch size and pixel spacing The first experiments that were conducted involved altering the input patch size of the models and the pixel spacing on which patches were extracted. Three different models were trained on the TLS, GC and empty classes. Furthermore, because the goal of the project was initially to be able to segment TLS and GC, the stroma, tumor and other lung tissue classes were concatenated to form one rest class which was used during training as well. Results are presented in Table 2.6 as model A1, model A2 and model A3. As can be seen, a lower pixel spacing with the same patch size for model A2 results in lower F1 scores for all classes. An increase of the patch size with the same pixel spacing for model A3 enhances performances, however it does not result in F1 scores as high as for model A1. The input patch size and pixel spacing that were therefore chosen for the rest of the models are 256x256 and 1 µm respectively.

L2 regularization A new model under the name of model A4 was trained which was based on model A1 . In this model the L2 regularization term was multiplied with a factor 10. As can be seen, the F1 score for the TLS class increased but slightly decreased for the GC class. The F1 score for the rest class stayed the same but for the empty tissue class the F1 score rose.

Augmentation By using the method from Tellez et al. (2019), models were trained with HSV-strong and HED-light augmentation settings, being models model A5 and model A6 respectively. For model A5 the F1 score on TLS increased slightly compared to the score for model A4, but the F1 score for the GC class decreased dramatically. For model A6, the F1 score for the TLS class decreased slightly while the F1 score for the GC class increased considerably. Also the values for

(16)

the rest and for the empty tissue classes increased slightly.

Classes As the stroma and tumor do hold clinical value, in further experiments the classes were regarded separately resulting in 6 separate classes for training: TLS, GC, stroma, tumor, other lung tissue and empty tissue. The model with the highest F1 score so far (model A6) was re-trained using all six classes resulting in model B1. The resulting F1 scores are presented in Table 2.7. The F1 score on the TLS class increased slightly but the performance on the GC class dropped quite a bit. The model had difficulties predicting stroma, but the performance on tumor tissue is close to performance on empty tissue, which in turn increased for this model as well.

L2 regularization Even though previous experiments demonstrated that a lower L2 regular-ization λ resulted in lower F1 scores, two more models were trained with different values for the L2 regularization, namely model B2 and model B3. Surprisingly, both models resulted in higher F1 scores for the GC class as compared model B1. The results on the other classes are comparable.

Branching factor For model B4 the branching factor was increased to from 4 to 5. This resulted in a larger amount of filters in the convolutional layers. The F1 score for the TLS class remained the same, while the F1 scores for the GC class and the tumor and the empty classes increased quite a bit.

Learning rate The learning rate was adjusted which resulted in model B5 and model B6. Interestingly, for both models the F1 scores decreased for almost all classes.

L2 regularization and branching factor Model B2 appeared to result in a higher F1 score for the GC class as compared to model B1 because of a lower value for the L2 regularization λ. The same goes for model B4 because of a higher branching factor, which is why a model was trained in which the L2 regularization λ was altered as well as the branching factor. The resulting model B7 had higher F1 scores on all six classes except for the empty tissue class.

Dilation As explained in Section 2.2.3, a model that included dilation was introduced as model B8. The model resulted in the highest F1 scores on the TLS class, the GC class and the stroma class. For the other classes, performance was similar except for the empty tissue class for which performance was surprisingly much lower.

Figure 2.6: Confusion matrix of model B8 when applied to the left-out test set where 1=TLS, 2=GC, 3=other lung tissue, 4=stroma, 5=tumor and 6=empty tissue.

F1 scores of the best performing model B8 on the TLS class, the GC class, the other lung tissue class, the stroma class, the tumor class and the empty class were 0.922, 0.802, 0.788, 0.630, 0.782 and 0.681 respectively. A confusion matrix is presented in Figure 2.6. As can be seen, the F1 score for the GC class is mainly lowered due to GC being predicted as TLS. An example of GC being partly predicted as being TLS is presented in Figure 2.7a. Such false predictions lower the F1 score of GC. Other lung tissue is often predicted as being stroma and viceversa. Also, stroma is frequently predicted as being tumor tissue and viceversa. An example of the latter is presented in Figure 2.7b. Last, empty tissue is sometimes predicted as being tumor tissue or stroma.

(17)

(a) Near correct prediction of TLS with GC

(b) False prediction of stroma

Figure 2.7: Example output of model B8. Of the predictions, blue=TLS, green=GC, or-ange=stroma, pink=tumor, yellow=other lung tissue.

model# Input patch size Pixel spacing L2 regu-larization TLS GC Rest Empty tissue A1 256x256 1 µm 0.00001 0.902 0.662 0.977 0.729 A2 256x256 0.5 µm 0.00001 0.875 0.508 0.958 0.704 A3 512x512 0.5 µm 0.00001 0.903 0.556 0.965 0.595 A4 256x256 1 µm 0.0001 0.911 0.653 0.977 0.787 A5 256x256 1 µm 0.0001 0.927 0.468 0.976 0.787 A6 256x256 1 µm 0.0001 0.904 0.720 0.980 0.798 Table 2.6: Results of models trained on the TLS class, GC class, empty tissue class and rest class. Values are F1 scores.

(18)

mo del# Input patc h size Pixel spacing L2 regularization Drop out Branc hing factor Learning rate TLS GC

Other lung tissue

Stroma T umor Empt y tissue B1 256x256 1 µ m 0.0001 1 4 0.0005 0.911 0.641 0.786 0.613 0.796 0.850 B2 256x256 1 µ m 0.00001 1 4 0.0005 0.902 0.720 0.750 0.550 0.803 0.834 B3 256x256 1 µ m 0.001 1 4 0.0005 0.901 0.732 0.802 0.578 0.819 0.855 B4 256x256 1 µ m 0.0001 1 5 0.0005 0.911 0.751 0.799 0.589 0.847 0.904 B5 256x256 1 µ m 0.0001 1 4 0.001 0.900 0.650 0.703 0.567 0.785 0.886 B6 256x256 1 µ m 0.0001 1 4 0.0001 0.896 0.573 0.755 0.508 0.733 0.825 B7 256x256 1 µ m 0.00001 1 5 0.0005 0.913 0.696 0.796 0.629 0.833 0.834 B8 256x256 1 µ m 0.0001 1 4 0.0005 0.922 0.802 0.788 0.630 0.782 0.681 T able 2.7: Results of mo dels trained on the TLS class, GC class, stroma class, tumor class, rest lung tiss u e class and empt y tissue class. V alues are F1 scores.

(19)

Effects of post-processing

The effect of post-processing is different per model. The model that resulted in the highest performance model B8 was a model that included dilation in the convolutional layers. The result was a more dense segmentation, but the output of the model also contained the well-known salt-and-pepper effect due to the per-pixel classification (Blaschke et al., 2000). In those cases, the small predictions were attempted to be removed with to post-processing and post-processing improved the performance of the model as can be seen in Table 2.8. However for some other models such as model B1, post-processing resulted in a lower performance on the test set. An example of the result of post-processing is presented in Figure 2.8.

model# TLS GC Other

lung tissue

Stroma Tumor Empty tissue B1 no post-processing 0.911 0.688 0.793 0.613 0.769 0.850 B1 with post-processing 0.911 0.641 0.786 0.613 0.796 0.850 B8 no post-processing 0.921 0.776 0.791 0.630 0.782 0.679 B8 with post-processing 0.922 0.802 0.788 0.630 0.782 0.679

Table 2.8: Results of post-processing

(a) Prediction before post-processing (b) Prediction after post-processing

Figure 2.8: Predictions before and after post-processing. Green is GC class, yellow is other lung tissue class and orange is stroma.

2.3.2 Object level

The best performing model B8 was evaluated on a held out test set that included dense annotations as was described in section 2.2.5. The scores for the precision, recall and F1 are presented in Table 2.9. As can be seen, the values for the precision are lower than the values for the recall for both labels due to the amount of predicted false positives.

The training set that the model was trained on contained the classes TLS, GC, stroma, tumor, other lung tissue and empty tissue while the dense annotations in this test set included also the class inflammation. As TLS is often found at sites of chronic inflammation and the class of

(20)

label precision recall F1 score TLS 0.505 0.873 0.640 GC 0.370 0.769 0.500 Table 2.9: Results of model B8 on the test set with inflammation as false positive for TLS

label precision recall F1 score TLS 0.903 0.289 0.438 GC 0.370 0.769 0.500 Table 2.10: Results of model B8 on the test set with inflammation as true positive for TLS

inflammation was not included in the set of training classes, another evaluation of the model was computed in which predictions of TLS where the tissue was annotated as inflammation was not counted as a false positive, but as a true positive. Consequently annotations of inflammation that were not predicted as TLS were counted as false negatives. The scores for the precision, recall and F1 are presented in Table 2.10. Precision for the TLS class clearly increased as the amount of false positives decreased. However, the amount of false negatives increased as well, as areas that were annotated as inflammation were not predicted as being TLS. Figure 2.9 presents the distribution of predictions of pixels that were annotated as inflammation. As can be seen inflammation was mainly predicted to be stroma or other lung tissue. This resulted in a low recall.

Figure 2.9: Distribution of predictions on the pixel level for inflammation annotations

2.4 Discussion

The research question of this study was what the performance would be of an ANN trained to detect TLS and GC in lung tissue. In order to answer this question, multiple evaluations have been performed.

On the pixel level, quite some confusion between the stroma, tumor and other lung tissue classes existed. Less confusion appeared in the TLS and GC classes which resulted in F1 scores of 0.922 and 0.802 respectively. Regarding the evaluation on the pixel level, we can therefore conclude that the best performing model can segment TLS and GC with decent performance.

The evaluation on the object level was performed only for the TLS and the GC classes. In the evaluation where TLS and GC were only counted as true positives when there was overlap with those classes in the annotations, the precision was quite low due to large amounts of false positives. False positives of both classes mainly arrose due to annotations of the inflammation class being predicted as being TLS or GC. In an attempt to correct for this, another evaluation in which the TLS class and the inflammation class were taken together, was performed. This resulted in a higher precision for the TLS class, but a lower recall as annotations of inflammation were also predicted as being something else than inflammation or TLS.

The current study is about research in lung cancer and therefore it is crucial for the system to have a high recall such that TLS and GC are missed. The current prospect is that a system that is

(21)

discussed in this study will initially only be used in combination with the expertise of a pathologist. Therefore possible false positives can be filtered out by the expert user during inference.

Nevertheless, an improvement to the system would be to include annotations of the inflam-mation class in the training set of the model. The model would then be learned to segment this tissue as a separate class. A new evaluation of the model in which inflammation being predicted as TLS or GC would be regarded as false positive would then be more legitimate, as the model would have been trained to recognize all of the three classes.

Regarding post-processing, some improvements could be applied as well. First, post-processing was only applied to predictions of the TLS and the GC class. Predictions of those classes that were smaller than the smallest annotation of the class in the training set were removed and set to the other lung tissue class. There exists a possibility for smaller structures of those types, therefore an improvement could be to only remove predictions that were e.g. smaller than the smallest annotation minus the standard deviation. Furthermore, the predictions of the other classes could be post-processed as well in order to come to a better united segmentation. Prediction map currectly contain many scattered predictions, the salt-and-pepper effect. An improvement would be to use erosion in order to remove small predictions. Both for the TLS and GC classes and for the other classes, removed predictions could be filled by the class that is closest to it by using a nearest neighbour approach.

All in all the answer to the research question given the data that we currently have would be that an ANN trained to detect TLS and GC in lung tissue would result in an F1 score on the pixel level of 0.922 and 0.802 for the TLS and GC classes respectively and of 0.640 and 0.500 on the object level. The developed system is able to detect the two target classes with modest performance, but for future work some improvements both in the post-processing and in the evaluation could be implemented in order to come to a possible higher performance and a more legitimate evaluation.

(22)

Chapter 3

Segmentation of growth patterns

of adenocarcinoma in lung tissue

3.1 Introduction

Currently, lung cancer is the second most common cancer for men and women and the most fatal type of cancer worldwide (Siegel et al., 2019). Furthermore, in most countries adenocarcinoma is the most prevalent histologic subtype of lung cancer (Devesa et al., 2005). According to the 2015 WHO lung adenocarcinoma classification, growth patterns of invasive non-small cell lung adenocarcinomas can typically be classified into solid adenocarcinoma, lepidic adenocarcinoma, acinar adenocarcinoma, micropapillary adenocarcinoma and papillary adenocarcinoma (Travis et al., 2015). After extensive histological examination, tumors are generally classified according to the most prevalent type of growth pattern. Based on that classification a prognosis is build and a specific treatment is chosen. Because of the availability of subtype-specific therapies, it is key that tumors are classified into the correct category. Furthermore, previous studies have found that interobserver variablity among pathologists exists when it comes to the classification of growth patterns of lung adenocarcinomas (Warth et al., 2012; Thunnissen et al., 2012). Especially the categorization of micropapillary and papillary growth patterns resulted in low kappa scores, while the evaluation of solid and lepidic tumor growth was homogeneous.

With this study, we strive to develop a system that can automatically segment growth pat-terns of lung adenocarcinoma in wholeslide images. Such a system can potentially decrease the interobserver variability and speed up a pathologist’s workflow by suggesting the type of growth pattern that is present in the tissue. It is therefore of relevance to the field of digital pathology. Furthermore, as the type of growth pattern that is prevalent says something about the prognosis of a patient, it is also of relevance to the field of oncology. Previously, Gertych et al. (2019) have established a system that is able to distinguish five subtypes of lung adenocarcinoma with an impressive performance. The study consisted of the training of three different architectures, namely GoogLeNet, ResNet-50 and AlexNet. In this study, we will try to extend the results by examining the VGG architecture in combination with similar data, but from less sources. The research question of the study is what the performance of an ANN is, trained to segment dif-ferent growth patterns of adenocarcinoma in lung tissue. As Gertych et al. (2019) have shown before that it is very well possible to segment subtypes of lung adenocarcinoma in digital slides with architecture like AlexNet, the hypothesis is that similar result might be obtained by using a VGG-like network.

(23)

3.2 Methods

3.2.1 Data

Cohort 1

The data that was used in the project consisted of 29 whole slide images of hematoxylin-eosin stained tissue that were scanned in either 20X or 40X and were provided by TCGA (TCGA, 2014). The WSIs included lung adenocarcinoma and were annotated by a molecular biologist in order to come to the following set of annotations:

• 91 annotations of solid adenocarcinoma • 22 annotations of acinar adenocarcinoma

• 23 annotations of micropapillary adenocarcinoma • 66 annotations of non-tumor tissue

Of the total set of images, 15 images were used in the training set, 5 images were used in the validation set and 9 images were used as a test set. The allocation of images in different sets resulted in a distribution of annotations as presented in Table 3.1.

Solid Acinar Micropapillary Non-tumor

Training set 52 10 12 42

Validation set 10 7 4 12

Test set 29 5 7 12

Table 3.1: Distribution of data in a training set, validation set and a test set for the detection of growth patterns in lung adenomacarcinoma

Cohort 2

As a second testset for evaluation of the models, a private set of data from the Radboudumc was used. The set included 26 whole slide images of hematoxylin-eosin stained lung adenocarcinoma tissue which were scanned in 40X. Per image the growth patterns of adenocarcinoma that were present in the image were annotated by a trained pathologist on the image level. The resulting annotations could contain multiple annotations of different growth patterns per image. In total, the set contained 11 annotations of solid adenocarcinoma, 21 annotations of acinar adenocarcinoma and 5 annotations of micropapillary adenocarcinoma.

Patch sampling

The size of the patches was set to 256x256 pixels with 10X magnification. The values were chosen according to the shape of patches that were used by Gertych et al. (2019). In their research, a patch size of 600x600 pixels with 20X magnification was used. In order to speed up patch extraction and retaining enough context in the patches, a lower resolution was chosen in combination with a smaller patchsize. Finally the patches contained a similar amount of tissue.

3.2.2 Augmentation

The same method of augmentation as explained in Section 2.2.2 was applied to the data during training. This means that the default augmentation included general augmentation like flipping, rotation and elastic deformation as well as augmentation in the color space. In later experiments the color augmentation was substituted for color augmentation as proposed by Tellez et al. (2019), i.e. specific augmentations in the HSV and HED spaces.

(24)

3.2.3 Deep learning

Model

Same as for the model in Section 2.2.3 the architecture that was used was based on the VGG16-net. The depth δ of the architecture was expanded to δ = 5 to be able to save enough information in the models and the batch size was set to 8. The basic architecture was as presented in Table 3.2. con v5-16 con v5-16 max-p o ol con v3-32 con v3-32 max-p o ol con v3-64 con v3-64 max-p o ol con v3-128 con v3-128 max-p o ol con v3-256 con v3-256 a vg-p o ol con vX-1024 con v1-512 con v1-C soft-max

Table 3.2: Architecture of the model that was used in this project. Each convolutional layers convA-B contained kernels of AxA and B filters. C is the amount of output classes. X is the width and height of the output of the previous average pooling layer which was dependent on the size of the input.

Hyperparameters

The hyperparameters that were examined during this study were the pixel spacing of the input images, the value for the L2 regularization and the augmentation type as these parameters have proven to be able to change performance of the model on the test set positively significantly in Section 2.3.1.

3.2.4 Evaluation

Evaluation on the pixel level

Evaluation on the pixel level was performed in the same way as is explained in Section 2.2.5. F1 scores were calculated by accumulating the amount of true positive, false positive and false negative pixel predictions over the total test set. Metrics were only calculated on the basis of the areas of images that were annotated.

Evaluation on the image level

In order to do an evaluation on the image level, a second cohort was used. For every image in the dataset for every class the amount of pixels predicted as being that class was divided by the total amount of pixels predicted as being adenocarcinoma, i.e. every class except for non-tumor. A prediction on the image level was obtained by taking the top-1 prediction. The predictions were compared to the annotations that were available for the test set. If the prediction was in the image level annotation of an image, it was counted as being a hit otherwise it was counted as a miss.

3.3 Results

3.3.1 Pixel level

Two models model C1 and model C2 were trained with different values for the L2 regularization λ, being 0.001 and 0.0001 respectively. F1 scores on a held out test set are presented in Table 3.3. As can be seen, the results are quite similar except for the micropapillary class in which the model with λ = 0.0001 resulted in a lower F1 score for that class.

(25)

Pixel spacing The current pixel spacing that was chosen to be used was based on Gertych et al. (2019). Two new models model C3 and model C4 were trained using a different pixel spacing namely 0.5 relating to a magnification of 20X while retaining the same patch size 256x256. Furthermore, the L2 regularization λ differed for model C3 and model C4, being 0.0001 and 0.001 respectively. Performance of both models was lower for the solid and acinar adenocarcinoma classes compared to the performance of model C1. For the other two classes performance was similar or slightly better. These results indicate that the model needs more context to accurately be able to determine the subtype of adenocarcinoma that is present in the patch therefore needing a higher pixel spacing.

Augmentation The default augmentation technique that was used was partly replaced by the technique as introduced by Tellez et al. (2019). Two models with the same characteristics as model C1 were trained, both either with the HED-light or with the HSV-strong setting. The resulting models were model C5 and model C6 respectively. For model C5 performance increased for every class, therefore the augmentation setting of HED-light augmentation was beneficial.

Model# Pixel spacing L2 regu-larization Aug-mentation

Solid Acinar Micro-papillary Non-tumor C1 1 µm 0.001 Default 0.856 0.399 0.756 0.916 C2 1 µm 0.0001 Default 0.853 0.400 0.531 0.919 C3 0.5 µm 0.0001 Default 0.804 0.265 0.718 0.937 C4 0.5 µm 0.001 Default 0.790 0.273 0.748 0.945 C5 1 µm 0.001 HED-light 0.891 0.524 0.812 0.954 C6 1 µm 0.001 HSV-strong 0.800 0.438 0.845 0.925

Table 3.3: Result of models trained on the solid, acinar, micropapillary and non-tumor classes. Values are F1 scores.

A confusion matrix of the best performing model C5 is presented in Figure 3.1. What is clear to see is that the model often mistakes acinar adenocarcinoma with micropapillary adenocarcinoma. An example of such a wrong prediction is presented in Figure 3.2a. The F1 score on the solid adenocarcinoma class is decent. An example of a prediction of a wholeslide image that includes solid adenocarcinoma is presented in Figure 3.2b.

Figure 3.1: Confusion matrix of model C5 when applied to the left-out test set where 1=solid, 2=acinar, 3=micropapillary and 4=non-tumor.

(26)

(a) A prediction of model C5 on an annotation of acinar adenocarcinoma.

(b) A prediction of model C5 of a wholeslide image. Annotations in the image are solid adenocarcinoma.

Figure 3.2: Predictions of model C5 where green=acinar, yellow=micropapillary, blue=solid and orange=non-tumor.

3.3.2 Image level

In order to come to predictions on the image level, the best performing model C5 was applied to the second cohort. Of the total set of 26 images, 17 images were classified correctly, i.e. the top-1 prediction of the model was in the annotations of those images, resulting in an agreement percentage of 65.4%. For 9 images the image level prediction was incorrect in which 5 times the prediction was micropapillary and the annotation was acinar, 3 times the prediction was micropapillary and the annotation included acinar and solid and once the prediction was solid and the annotation was acinar. These results support the results on the pixel level in that acinar adenocarcinoma was often falsely predicted as micropapillary.

3.4 Discussion

The research question in this study was what the performance of an ANN trained to segment different growth patterns of adenocarcinoma in lung tissue would be. An ANN based on the VGG architecture was trained which resulted in F1 scores on the pixel level of 0.89 for solid adenocarcinoma, 0.81 for micropapillary adenocarcinoma, 0.52 for acinar adenocarcinoma and 0.95 for non-tumor tissue. Previously, Gertych et al. (2019) have reported F1 scores of 0.91, 0.76, 0.74 and 0.96 for the same classes. Except for the micropapillary class, our model did not surpass performance of the previous model. Especially on the acinar adenocarcinoma class, our model resulted in a poorer F1 score.

The model that was trained in this study was learned to recognize patterns in 15 wholeslide images from TCGA only. The models that were trained by Gertych et al. (2019) were trained to segment patterns in 33 slides from the Cedars-Sinai Medical Center (CSMC) in Los Angeles and 45 slides from the Military Institute of Medicine (MIMW) in Warsaw, Poland. Furthermore, in their paper Gertych et al. (2019) report that the tissue preservation of the slides from the CSMC and MIMW cohorts was of much better quality than the preservation in the TCGA dataset. Our

(27)

training set therefore contained less images and less annotations and the images were in poorer quality than the images used by Gertych et al. (2019). These factors could have resulted in worse performance and a better quality of the training data could therefore result in a better generalization of our models.

A question for future work that remained in the study by Gertych et al. (2019) was what a patch size of 256x256 with a magnification of 20X would do to the performance of the model. In our study, two models were trained with such settings being model C3 and C4, both with different values for the L2 regularization. Both models resulted in lower performance than our best performing model C5. Our assumption therefore is that models do need more context than patches of 256x256 scanned with 20X contain. We suggest to use patches of 256x256 with pixel spacing 1 µm.

A recent similar study reported an agreement percentage of their model with a pathologist of 66.6% while our model had an agreement of 65.4% (Wei et al., 2019). However, a difference in the evaluation of the model is present, as their evaluation is based on the agreement between the assessment of the predominant growth patterns in the data. The evaluation that was performed in this study was based on the predominant pattern as predicted by the model and the accumulated predominant and minor growth patterns as defined by the annotators and this could potentialy favour the agreement percentage to a higher score. A better comparison between the models of the two different studies can be performed as future work when data that in which information about the predominant and the minor growth patterns is divided, would be available.

Large part of the discusson of this study is built upon a comparison to the work of Gertych et al. (2019). However, as their models were trained on a larger set of data, evaluated on a different set of data and also included the cribriform class, the comparison might not be sound. All in all the result of this study is a model that can classify growth patterns of lung adenocarcinoma with decent performance.

In future work more annotations of other growth patterns of lung adenocarcinoma should be included, such as papillary and lepidic in order to come to a more general model. Furthermore, in order to come to denser segmentations, dilation might be included in the models as well, as was done in the model as explained in Section 2.2.3. Also, no post-processing was included in this study. Including such a method might enhance segmentations resulting in more accuracte predictions and therefore better segmentations and possibly F1 scores. Last, the amount of data that the models were trained on could be extended in order to come to a better generalization of the models. The training set that was used in this study only contained sparse annotations. Therefore our models did not encounter any neighbouring growth patterns of adenocarcinoma. A further enhancement of the performance could potentially be achieved by including dense segmentations in the annotations, especially of neighbouring types of growth patterns, and training semantic segmentation models on those new data. Currently, no such data was available for this study which is why the automatic synthesis of such data from sparsely annotated data will be examined in the following chapter.

(28)

Chapter 4

Synthesizing densely annotated

data of lung adenocarcinoma with

GANs

4.1 Introduction

In the previous chapter a method was examined to segment different growth patterns of adenocar-cinoma in cancerous lung tissue. The data that the models were trained on consisted of wholeslide images of lung adenocarcinoma which where annotated sparsely. The result was that the image patches that the models were presented with during training only included a single label, namely the label of the center pixel that was assigned to the whole image patch. As a consequence, no model was confronted with borders between different types of growth pattern. Models therefore did not learn to segment these regions and to make a real dense segmentation of the growth pat-terns. A possible improvement that was presented in Section 3.4 was to include dense annotations in the training data and train semantic segmentation models to segment the growth patterns in the new data. The hypothesis was that the result of such models would include denser segmentations and therefore a higher performance.

In the current project, the synthesis of fake realistic densely annotated data on the basis of sparsely annotated data with the usage of generative adversarial networks will be examined. Three different architecures will be explored to generate the fake data, namely the proposed GP-GAN by Wu et al. (2017), CycleGAN by Zhu et al. (2017) and EdgeConnect by Nazeri et al. (2019).

4.1.1 GP-GAN

In their research Wu et al. (2017) proposed a method that can realistically blend two high res-olution images. In order to do so, a network with an architecture that is based on the model proposed by Pathak et al. (2016) is trained to generate a well-blended low resolution image, the so-called color constraint. High resolution details are captured by introducing a gradient which is used as the Poisson constraint and together the two constraints combine into a high resolution blended image using a Gaussian-Poisson equation (Burt and Adelson, 1983). An overview of the architecture is presented in Figure 4.1. As the current study focuses on the realistic blending of two different images, in this case two different growth patterns of lung adenocarcinomas, the GP-GAN architecture is certainly related.

4.1.2 CycleGAN

Results from the study by Zhu et al. (2017) demonstrated the potential of the CycleGAN architec-ture to map images between different domains. Numerous example domains were provided in their

(29)

Figure 4.1: Overview of the GP-GAN architecture (Wu et al., 2017)

study in which for example images of horses were mapped to the domain of zebras and paintings from one painter were mapped to the style of another painter. The key concept of the CycleGAN architecture is that it contains two generator networks and two discriminator networks and images cycle from one domain to another and back. The performance of the system is measured with multiple adversarial loss functions and cycle-consistency loss functions. An outline is presented in Figure 4.2. The architecture was used in the current study to map fake images of two types of growth patterns of lung adenocarcinomas to the domain of real images of lung adenocarcinomas.

Figure 4.2: Overview of the CycleGAN architecture and the cycle-consistency loss (Zhu et al., 2017)

4.1.3 EdgeConnect

A whole different approach of the problem that is faced in the current project was by making the problem an inpainting problem. An area around the border between the two types of growth patterns could be left out and a model could be trained to fill in the missing part of the image. The current state-of-the-art performing inpainting architecture is the one from Nazeri et al. (2019). The system consists of two different models of which the first model reconstructs the edges of the source image and the second model inpaints the blank region of the source image given the edge map and the input image. An outline of the architecture is presented in Figure 4.3. This approach is a different approach to the problem and by doing so the model is given a larger degree of freedom as a larger part of the image should be predicted to come to a final output.

(30)

4.2 Methods

4.2.1 Data

Patches

The data that was used in this project was based on the data that is referred to in Section 3.2.1. From the first cohort of 29 wholeslide images, 8,000 patches were extracted each containing solely pixels that were annotated as one of the four classes solid adenocarcinoma, acinar adnocarcinoma, micropapillary adenocarcinoma and non-tumor tissue. This was done by eroding the annotation masks just enough, such that any patch that would be drawn around any annotated center pixel would include only annotated pixels.

Filters

In order to generate fake patches each with two types of growth pattern of adenocarcinoma in it, a set of 8 filters was produced. Example filters are presented in Figure 4.4. Using the filters, 8000 new images were generated. For every iteration during the generation of the images, two random images each with different growth patterns were picked together with a random filter. The filter was then augmented by rotating it with a factor of 0, 90, 180 or 270 degrees and applied to one of the images. The inverse of the filter was applied to the other image and both results were taken together to form the new copypasted images. An example is presented in Figure 4.5.

4.2.2 GP-GAN

Default parameters

Code from this study was publicly available on Github1_{. The repository featured an}

implementa-tion in Chainer and included extensive documentaimplementa-tion about setting up an own experiment (Tokui et al., 2015).

In this experiment, the parameters were left to their default value as they were in the original study. That means that the filtered patches as described in Section 4.2.1 were not used and only a single filter was used for the generation of fake images, namely a filter with a centered square as presented by the original study. Although during training only one filter was used, the authors determined that the architecture is still able to generate well-blended images for inputs with arbitrary masks (Wu et al., 2017).

L2 loss

The loss in the GP-GAN architecture is calculated with

L(x, xg) = λL2Ll2(x, xg) + (1 − λL2)Ladv(x, xg)

1_{https://github.com/wuhuikai/GP-GAN}

(31)

Figure 4.5: From left to right: input image A, input image B, the filter, the result image after filtering (copypasted image)

in which x is the generated image, xg is the ground truth image, Ll2is the L2 loss, Ladvis the

adversarial loss and λL2is a parameter to define the importance of the L2 loss in the loss function.

The L2 loss is by default calculated by computing the difference between the output image of the system and one of the two source images. In the original study the data which the model was trained on consisted of combined images taken from the same point of view but from a different point in time. This means essentially that the only difference between the two input images was caused by a difference in lighting but the contents of the two images were broadly the same. Given that the contents of two images are to a great extent the same and there only exists a difference in colour, it does not matter whether the L2 loss in the GP-GAN architecture is based on the difference between the output image and input image A or the output image and the other input image B. The images that are used in this study do not share that same characteristic. Therefore experiments were conducted in which the L2 loss was calculated based on the difference between the output image and the copypasted image. Multiple experiments were performed with different values for the weight of the L2 loss in the loss function λL2.

Exclude Poisson constraint

Same as for the calculation of the L2 loss, the characteristic that the input images do not share the same gradient raises a problem when the color constraint and the Poisson constraint are merged to form the output of the architecture. As the Poisson constraint forces the gradient of the copypasted image to be fully in the output image, also edges close to the border between the two different types of growth patterns are forced to be in the output. The result is that the architecture has no freedom to transform the image such that the area around the border looks realistic by for example finalizing a structure. Experiments were conducted in which the Poisson constraint was left out of the architecture. The input shape and the output shape of the Blending GAN generator were set to 256x256 pixels and the amount of units in the bottleneck layer was decreased such that the experiments did not fail due to memory issues. Because L2 loss is known to produce blurry results and blurry output images were an obvious result to expect because the Poisson constraint was left out (Pathak et al., 2016), the value for λL2 was decreased during different experiments

such that the importance of the adversarial loss in the loss function gained.

4.2.3 CycleGAN

Default parameters

The code from the original paper has been published on GitHub2_{. The code features an}

imple-mentation of the study in PyTorch (Paszke et al., 2017).

The first experiment that was run with CycleGAN consisted of the default architecture with the default values for the parameters. As input of the domain A, generated fake images as explained in Section 4.2.1 were used. On the contrary, real images from the dataset that were extracted were used as input for domain B. The objective of the model then was to map fake images from domain A to domain B and vice versa.

(32)

(a) Filter (b) Wide weight map

(c) Narrow weight map

Figure 4.6: A filter and its associated wide and narrow weight maps for the cycle-consistency loss

Loss and weight maps

The problem that is regarded in this study is the mapping from fake images in domain A to the domain of real images in domain B. A correct transformation of real images in domain B to a fake images in domain A was not crucial. Furthermore, a correct mapping of a cycled fake image, i.e. a fake image that was transformed to domain B and then transformed back to domain A (see Figure 4.2), was not as important as solely the mapping of the fake image to domain B. Therefore, the weight of the cycle-consistency loss was decreased. The idea was tested in multiple experiments in which the weight of the cycle-consistency loss in the loss function for domain B was set to 0 and the same weight for domain A was altered.

In addition, a weight map that was multiplied with the cycle-consistency loss was introduced. This was based on the idea that the cycle-consistency loss is less important close to the border between the two growth patterns, such that the model has more freedom to transform the image on that specific spot. Further away from the border, the cycle-consistency loss should be higher as the model should leave the images as they are further away from the border. An example of such a weight map in combination with the filter is presented in Figure 4.6. In each experiment either wide or narrow weight maps were used.

The same concept applies to the adversarial loss for images that are mapped from domain A to domain B. The inverse of the weight map that is presented in Figure 4.6 was multiplied by the adversarial loss for domain A, such that the adversarial loss close the border was increased in importance and the generator was moved towards making the output look real close to the border. Different experiment were conducted in which the weight maps were multiplied by the cycle-consistency loss and adversarial loss and the importance of the cycle-cycle-consistency loss in the total loss function was altered using the λcyc parameter.

Standardization

Because data from TCGA consisted of data from different labs, differences in staining between the data were present. In order for the model to make the output look real, the model had to focus both on transforming the colours over the whole image to a single consistent staining as well as transforming the image around the border to make the border look realistic. To make the model focus primarily on transforming the image around the border, models were trained on standardized data. The standardization technique that was used was the one from Bejnordi et al. (2016). As a template image, an image from LabPON was used which is presented in Figure 4.7. During training, the value of λcyc for domain A was lowered from 10 to 1 and for domain B from

(33)

Figure 4.7: Template image that was used for the standardization of data

4.2.4 EdgeConnect

Default parameters

Code3_{from the original study was published online and was freely available for research purposes.}

Both the edge model and the inpainting model were trained using narrow filters like the narrow weight maps that were presented in Figure 4.6c. At the locations where the weight maps had values < 1, the images was made blank. This resulted in input images as presented in Figure 4.8.

Figure 4.8: Example input of the EdgeConnect architecture.

(34)

4.3 Results

4.3.1 GP-GAN

Default parameters

Learning curves of the L2 loss and the adversarial loss in the system are presented in Figure 4.10. As can be seen, the L2 loss clearly decreases as the amount of epochs grows. The adversarial loss however remains roughly the same during the whole training which illustrates that the discrim-inator is seemingly not learning during training. Example output of the system is presented in Figure 4.9. In the images, a clear border between the two types of tissue is visible. Furthermore, there exists a difference in color between the two types of growth patterns which makes the border even more distinct.

Figure 4.9: Example output of the GP-GAN architecture trained with default parameters

(a) L2 loss of the generator (b) Adversarial loss of the generator

Figure 4.10: Learning curves of the generator of the GP-GAN architecture trained with default parameters

Modifications

L2 loss The modification of basing the L2 loss on the difference between the copypasted image and the output image was tested using different values for λL2, namely 0.999 (default), 0.99, 0.8,

0.6, 0.4 and 0.2. Examples of output of the system using the same input images are presented in Figure 4.11. As λL2 decreases, the architecture seems to make up for the differences in staining

between the two source images as an effect of the higher need for the image to look realistic. A negative effect is that certain brighter spots in the images are introduced, which are in turn unrealistic.

(35)

(a) λL2= 0.999 (b) λL2= 0.99 (c) λL2= 0.8 (d) λL2= 0.6 (e) λL2= 0.4 (f) λL2= 0.2

Figure 4.11: Results of GP-GAN with the L2 loss based on the difference between the copypasted image and the output image

Exclude Poisson constraint Experiments were conducted in which the L2 loss was cal-dulated based on the difference between the copypasted images and the output images and the Poisson constraint was removed from the system. In turn, the input and output size of the genera-tor model were set to 256x256 pixels and the amount of hidden units in the bottleneck was lowered from 4000 to 2000. In turn multiple experiments with different values for λL2 were conducted of

which example results are presented in Figure 4.12. In the example from the experiment with λL2 = 0.999 a clear rectangle is visible. Furthermore the image is very blurry, which is a logical

result of using L2 loss (Pathak et al., 2016). A lower value for λL2 resulted in images that were

somewhat less blurry due to the role of the adversarial loss being greater in the total loss function. A further decrease of λL2to 0.6 and lower resulted in very blurry images, as can be seen in Figure

4.12c.

(a) λL2= 0.999 (b) λL2= 0.99 (c) λL2= 0.6

Figure 4.12: Results of GP-GAN with the L2 loss based on the difference between the copypasted image and the output image and without the Poisson constraint in the system

4.3.2 CycleGAN

Default parameters

As a first experiment, a CycleGAN model was trained using the default parameters as they are published. Two examples of output of the system are presented in Figure 4.13. The network seems to standardize the staining in the images, but lacks a transformation of the structures in the images which is why the location of the border in the image if obvious. Furthermore, the network seems to over-generalize by coloring white parts of the source images pink.

Modifications

Loss and weight maps In further experiments, wide and narrow weight maps were introduced that were multiplied with the cycle-consistency loss and the adversarial loss. Furthermore the

Extracting Biomarkers from Hematoxylin- Eosin Stained Histopathological Images of Lung Cancer using Deep Learning