• No results found

AI-assisted PD-L1 Scoring in Non-Small-Cell Lung Cancer

N/A
N/A
Protected

Academic year: 2021

Share "AI-assisted PD-L1 Scoring in Non-Small-Cell Lung Cancer"

Copied!
36
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Master’s Thesis

Radboud University Nijmegen

AI-assisted PD-L1 Scoring in Non-Small-Cell Lung Cancer

Author:

Tristan Payer

s4493109

Supervisor:

Dr. Francesco Ciompi

Second reader:

Dr. Tim Kietzmann

August 17, 2020

(2)

I would like to thank my supervisor Francesco Ciompi. I would also like to thank all the staff and PhD students at the Radboud UMC computational pathology group of the pathology department. A special thanks goes to Witali Aswolinskiy who was a great help in developing the new strategy for the U-net.

(3)

Contents

1 Introduction 3

2 Background 5

2.1 Medical background . . . 5

2.1.1 PD-1/PD-L1 pathway . . . 5

2.1.2 Tumor Proportion Score . . . 5

2.2 Technical Background . . . 7

2.2.1 Artificial Neural Networks . . . 7

2.2.2 Convolutional Neural Networks . . . 8

2.2.3 Fully convolutional networks . . . 9

2.2.4 U-net . . . 9 2.2.5 Localization-based Counting FCN . . . 9 2.3 Related work1 . . . . 11 3 Data 13 4 Experiments 15 4.1 Data augmentation . . . 16 4.2 FCNs . . . 17 4.3 U-net . . . 18 4.4 LCFCN . . . 20 5 Evaluation 21 5.1 NatureNet . . . 21 5.2 DenseNet . . . 21 5.3 U-net . . . 22 6 Discussion 24 6.1 Difficulties . . . 25 6.2 Demonstrator . . . 27 6.3 Future Research . . . 27 6.4 Summary . . . 28 Appendices 32 A NatureNet configuration 32 B DenseNet configuration 33 C Original image 34

(4)

1

Introduction

With 1.6 million deaths out of 1.82 million cases in 2012, lung cancer is the leading cause of cancer related death world wide [31]. Lung cancer can be divided into two main sub-types. Small cell lung cancer (SCLC) and non-small-cell lung cancer (NSCLC), with NSCLC making up about 85% of all cases. Smoking, air pollution lack of exercise and genetic makeup are among the main risk factors for lung cancer [21].

For a long time chemotherapy has been the only line of treatment for NSCLC patients. Im-munotherapy is a novel range of treatment that aims to improve the body’s own immune system in order to allow it to successfully fight the cancer. Immune checkpoint inhibitors are one way of doing this. Immune checkpoints form a natural part of the immune system. They regulate the immune response and prevent parts of the immune system from attacking the own body. This kind of immunotherapy drugs target the pathways tumors use to escape immune system detection. They work by preventing the deactivation of T cells [8]. Currently there are at least three checkpoint inhibitors (pembrolizumab, nivolumab and atezolizumab) that are approved for selected treatment in patients [4].

A more detailed explanation about this can be found in the medical background section.

Checkpoint inhibitory immunotherapy has shown great results in some patients. Unfortunately, not all patients respond to this treatment. Accurate predictions of which patients will respond well to the therapy is a vital task. The therapy is used in late stage cancer patients that would otherwise have no chance of treatment. The therapy is used in late stage cancer patients that would otherwise not have long to live without treatment. It is therefore very important to also identify the patients that would not benefit from immunotherapy so that they can receive other treatment options that are more suited to their case. Treatment options may include the combination of dif-ferent chemotherapy drugs with the possible addition of radiation therapy. Targeted therapies have also show promising results and have been approved by the European Medicines Agency (EMA) and/or the U.S. Food and Drug Administration (FDA). Target therapy drugs have been developed to target specific genetic mutations in the tumor cells. They are thus also patient specific. Target therapy is often used in combination with chemotherapy drugs. Another promising option is the combination of immunotherapy with chemotherapy [20]. Patient selection for immunotherapy will play an important role in the future. PD-L1 expression will be an important factor in this decision [37].

As the therapy is quite severe it can come with multiple different side effects. It is important to minimize the suffering of this kind of patients. The drugs can introduce immune-related ad-verse events (irAEs) which can have effects on skin, gastrointestinal tract, liver and other organs. There have been reports about severe and even life-threatening cases. Because side effects can start asymptomatic or with very few symptoms only, intensive monitoring of the patients wellbeing is important. Severe side effects have been found in about 5-10% of patients that have been treated with either pembrolizumab or nivolumab. In 5% of cases the treatment has to be stopped. Side effects can also occur late after the therapy has started and in some cases even after the therapy has been finished [12].

Another reason for the need of a good efficacy prediction is the high cost of the therapy. Costs can up to more than 100.000e per patient per year[7]. This puts an enormous burden on the healthcare system and the insurance companies. It is estimated that in Europe alone 199 billion euros have been spent on cancer treatment in the year 2018. This again shows the need that this money is

(5)

spend resourcefully. Good assessment of biomarkers, like the PD-L1 expression, can help to achieve this.

Furthermore, there is also the burden put on the patients that must come to the hospital for therapy and other checks that go along with it. This in turn also means that, a lot of different medical professionals are involved in the treatment process. They only have limited time and resources to spent which makes it important that those are spend on therapies that are likely to show a positive effect on the wellbeing of the patient.

At the moment PD-L1 expression is the only biomarker that is used to estimate the efficacy of treatment with pembrolizumab. Examples of how PD-L1 expression might look like can be found in the “Data” section in figures 6 and 7. The amount of PD-L1 expression is measured in the Tumor Proportion Score (TPS). In short this is a ratio between cells that show some membrane staining and cells that do not show this membrane staining. Before the treatment with pembrolizumab it is required that the TPS is estimated using the Dako 22C3 pharmDx assay [4] or a similar assay that uses different antibodies. Estimating the TPS is a difficult and time-consuming task for patholo-gists. This of course adds to the overall cost of the therapy. Furthermore, there is some variability in the estimations that pathologists report [5]. This and the reasons named above lead to the need of a reliable, robust automated method to estimate the TPS. In this works I present three conceptually different neural networks that aim automatically classify and locate PD-L1 positive and negative tumor and immune cells.

This work was also part of the Radboud AI for Health lab which is an Innovation Center for Artificial Intelligence (ICAI) lab. The ICAI national network aims to bridge the gap between academia and the industry for new technological development in the field of artificial intelligence (AI). Currently ICAI has 23 partners in the Netherlands. The AI for health lab offers new project opportunities for students as well as new PhD positions. This project is a first pilot project for AI assisted PD-L1 scoring in lung cancer. As part of this project the tested models have been integrated in a web-interface. This allows people with no prior knowledge in AI to apply the models on new data [36, 35].

(6)

2

Background

The background of this work is divided in two different parts. The first part is about the medical details. In this part will briefly explain how immunotherapy works and why it is important to recognize PD-L1 positive cells. The second part of the background is the technical part. Here I will explain how the neural networks that are used work.

2.1

Medical background

2.1.1 PD-1/PD-L1 pathway

Evasion of host immune response is one out of six hallmarks of cancer [11]. T cells are a sort of lymphocyte and play an important role in cellular immune responses. Like other lymphocytes T cells are created in the bone marrow. They then migrate to the thymus where their T cell receptor is developed. This allows cytotoxic T cells to detect and kill cells that can harm the organism. This includes virus infected cells and cancer cells [3].

Programmed cell death 1 (PD-1) is a transmembrane receptor found on active t-cells. Programmed cell death ligand 1 (PD-L1) is the ligand to the PD-1 receptor. It can often be found on the surface of tumor cells. So also, in non-small cell lung cancer (NSCLC). When the PD-L1 ligand of a tumor cell interacts with the PD-1 receptor on a t-cell the t-cell becomes inactivated. This means that it is no longer capable of fighting tumor cells. Immunotherapy aims to block this PD-1/PD-L1 interaction using monoclonal antibodies. This interaction can also be seen in figure 1. On the left panel the tumor cell’s PD-L1 ligand binds to the T cell’s PD-1 receptor and deactivates it. On the right panel this interaction is blocked by an immune checkpoint inhibitor (anti PD-L1 or anti PD-1). The T cell is nor deactivated and can kill the tumor cell [23].

In about 61% percent of patients with advanced NSCLC more than 1% of tumor cells show PD-L1 expression. This is defined as being PD-L1 positive. About 23% of patients show PD-L1 expression in more than 50% of tumor cells, meaning that they are strongly PD-L1 positive [5].

Multiple studies have shown evidence that a high number of PD-L1 stained cells correlates with improved efficacy of immunotherapy [10, 6, 30]

2.1.2 Tumor Proportion Score

The examination of individual cells or tissue samples is done by pathologists on histopathology slides. To determine which cells are PD-L1 positive or negative a tumor sample is needed. This is taken from a tumor via a biopsy. The Sample is then embedded in a paraffin block. This block is cut into very thin slices with a thickness of 4 − 5µm. The sample is then stained. The staining is needed to increase to increase the contrast of the cells or make certain structures visible [2]. To increase the robustness of these slices they are put on a glass slide. From there they can be examined by a pathologist under a light microscope or scanned with a fast high resolution scanner. Examples of slides and tissue can be seen in the “Data” section in figures 5, 6 and 7.

In a clinical setting the amount of PD-L1 expression needs to be quantified. This is done with the so-called Tumor Proportion Score (TPS). Different values of this score can then be used as cutoff levels that determine the further treatment a patient will receive.

(7)

Figure 1: T cell interacts with tumor cell

TPS = #PD-L1 positive tumor cells

Total number of tumor cells + PD-L1 negative tumor cells

To determine this TPS the pathologist first looks at the biopsy slide at low magnification and determines tumor areas that have at least 100 stained tumor cells. Because not all cells are stained equally staining is rated on a scale from 0 to 3+. a rating of 0 means that no staining is shown while a rating of 3+ means that the staining shows strong intensity. All cells that have a staining of higher than 1+ are counted for the evaluation. After areas with at least 100 stained tumor cells have been identified the pathologist zooms in at those areas for closer inspection. The cells are now inspected closely and counted to determine the TPS.

The process of the further treatment is determined on previous treatments the patient received and the value of the TPS. There are three cutoff points for TPS values. The cutoff values and the following procedure vary between different staining agents and the actual product that is used for immunotherapy. For KEYTRUDA (pembrolizumab) the cutoff points are at TPS < 1%, TPS 1 − 49% and TPS > 50%. Patients that show a TPS of smaller than 1% are not indicated for treatment. If the patient shows a TPS between 1% and 49% they are indicated for treatment, depending on previous treatment they have received. Patients that show an TPS of more than 50% are indicated for treatment even if they have not received previous treatment [1]. Of course all decisions are made by an oncologist that knows the patient very well.

At the moment PD-L1 expression in tumor cells is the only biomarker to assess the efficacy of immunotherapy. While the tumor proportion score (TPS) is an important biomarker for the

(8)

efficacy of the therapy it is far from perfect. Not all patients that have a high TPS respond well to the therapy and on the other hand not all patients that have a low TPS have a negative response to the therapy. This could indicate multiple things. One is that TPS, and therefore PD-L1 expression, is not as good as a predictor as it is currently thought of. Not all immune checkpoint inhibitory drugs require PD-L1 testing. Another possibility is that the scoring by hand is too subjective to give a reliable enough TPS. Previous studies have indicated that the PD-L1 assessment of pathol-ogists varies [5, 32]. Especially around the cutoff values this is a big problem as it determines the further treatment of patients. Differences in pathologists’ judgements could lead to non-approval for treatment with pembrolizumab in certain patients [33]. Furthermore, the use of different assays used for the tissue staining can lead to different outcomes in pathologists’ judgements. This again could lead to the approval or non-approval of patients, which might mean that they will not receive the ideal treatment [32]. A third possibility is that the location of the cells with respect to the tumor and other structures plays an important role that is currently not investigated enough [8]. In addition to the medical needs for accurate PD-L1 judgement there is also some financial motiva-tion. Pembrolizumab requires PD-L1 testing before a possible administramotiva-tion. Nivolumab, a similar immunotherapy drug by a competing pharmaceutical company does not require previous PD-L1 assessment. The use of one over the other drug is currently debated. However, the availability of PD-L1 expression testing often influences which of the two is used [33]. Therefore, robust and reliable PD-1/PD-L1 detection methods are needed.

2.2

Technical Background

2.2.1 Artificial Neural Networks

Artificial neural networks are a type of machine learning classifier that was inspired by biological networks. Artificial neural networks usually consist of multiple layers that are made up of multiple neurons. In the case of a fully connected network each neuron in a layer is connected to each neuron of the previous layer. Each neuron has an input value x and each of the connections has a weight w. The output of a single neuron can be described by:

y = σ m X i=0 wixi+ b !

where m is the number of all neuron in the previous layer, b is a bias term and σ an activation function.

Usually the first layer in a network is the input layer that directly receives its activations from the data. The last layer is the output layer that gives the result of the classification. The layers in between the input and output layer are called hidden layers. The number of hidden layers can vary. If many hidden layers are used the network is also called a deep network, hence the term deep learning.

(9)

2.2.2 Convolutional Neural Networks

Figure 2: Visualization of a con-volution filter. [22]

Neural networks have also shown great results in the domain of image analysis and image classification. Probably the most important building block to achieve this improvement was the convolutional layer. To understand convolutions in the domain of image analysis we first have to look at how an image is rep-resented. We can think of an image as a two dimensional grid where each pixel corresponds to one location in the grid. This image is convoluted with a small kernel. One can think of this process as sliding the kernel over each pixel in the image. The output of this operation is a weighed sum of each location the kernel has been sled over. Similar techniques are used in image editing for example to blur an image or detect edges. By learning the right weights for the kernel the network can detect features

in the input image. Because the kernel is sled over the whole image the feature detection becomes location invariant. This means that the network will respond to learned features regardless of where they are located in the input image [24].

Another important building block of convolutional networks is the pooling layer. This layer com-bines values of neighboring pixels. This combination increases the size of the receptive field for the next convolution layer. Usually a CNN consists of two parts. The first on is the convolutional part that extracts important features in the image. The second part is a standard feed forward network. This part of the network is fed the extracted features from the CNN as input. It uses those features to perform the actual classification task [18].

Furthermore, the convolution operation can be efficiently implemented as matrix multiplication. Because matrix multiplications require many simple multiplications that can be done in parallel the computations can efficiently be done on graphic cards (GPU). The output for a convolution can be described as the following [24]:

y = σ (K ∗ x)

Where K is the convolution kernel. As an small toy example this would look something like this:

K ∗ x =k1 k2 k3 k4  ∗   x1 x2 x3 x4 x5 x6 x7 x8 x9  

This operation can be reshape to form the matrix multiplication [26]:

K ∗ x =     k1 k2 0 k3 k4 0 0 0 0 0 k1 k2 0 k3 k4 0 0 0 0 0 0 k1 k2 0 k3 k4 0 0 0 0 0 k1 k2 0 k3 k4     ∗               x1 x2 x3 x4 x5 x6 x7 x8 x9              

(10)

2.2.3 Fully convolutional networks

As mentioned before, the “standard” convolutional networks are good for image classification. But for some image analysis tasks it is important to get a semantic segmentation of the image. This means that not only the whole image is labeled but every pixel in the image that contains the desired object is labeled. Fully convolutional networks (FCN or FCNN) are a computationally efficient way to achieve this. Semantic segmentation not only solves the problem of what is shown in an image but also where an object is located in a given input image and what shape and size it has.

As mentioned before a classical convolutional network usually consist of a convolutional part and a fully connected part. In fully convolutional networks the fully connected part is replaced by a convolutional layer. This can be done because the fully connected part can also be seen as a convolutional layer where the kernel covers the complete input size. This also makes the network independent of the input size. To do this the fully connected layers are replaced by convolutional layers, where the first convolutional layers has an input size equal to the feature map size (output) of the previous layer. These layers are then followed by a 1x1 convolution layer. The number of channels for this convolution layer is equal to the number of classes the networks will be able to predict [19].

2.2.4 U-net

U-net is a different fully convolutional network architecture that that has achieved great results in biomedical image segmentation tasks. The network consists of two pathways. Data first flows through the contracting pathway and then through the expanding pathway. The contracting path-way is very similar to a vanilla FCN. It consists of various blocks. Each block is buildup of two 3x3 convolutional layers followed by pooling a layer. Usually the number of feature channels is doubled for every block in the network. The expanding pathway consists of blocks that are build up by up-sampling and up-convolution layers. Here the number of feature channels is halved again for each block. Usually in FCNs there is a tradeoff between location accuracy and the use of more context. Giving the network more context results in giving the network bigger patches. Therefore, more pooling layers are needed. This however reduces the location accuracy. If smaller patches and fewer pooling layers are used the receptive field of the network loses context. To solve this problem, features that have been extracted in “earlier” layers in the contracting pathway are feed to “later” layers in the expanding pathway. This allows the network to combine more high level features with a bigger context.

An overview of the architecture can be seen in figure 3. One difference between the U-net architec-ture and the classical FCN architecarchitec-ture is that the U-net needs segmentations as training labels, while the FCN only needed labeled image patches without segmentation [27].

2.2.5 Localization-based Counting FCN

Another way to see the problem is to see it as an object detection problem. The cells could be counted simply based on the number of instances that are detected per class. A novel method for this is the Localization-based Counting FCN (LCFCN). It achieved state-of-the-art results on many object detection benchmark datasets like Pascal VOC and the Penguins dataset. The authors argue that for object counting, segmentation or detection models often perform worse than counting models based on regression. A downside of these object counting models is that the loss function

(11)

Figure 3: U-net architecture

directly outputs the number of instances in a scene. Object location is not possible with these kinds of models. The authors say that the segmentation and object detection models perform worse because they have to learn the exact shape and size of the objects that should be detected. Therefore, they introduced a novel loss function that would allow a FCN to predict the rough location of an object but not the exact shape and size. The output of the network is a “blob”. This blob is not an exact segmentation of the object but it is at the right location.

Figure 4: Results of the LCFCN on the penguins dataset

What is meant by this can best be seen in figure 4. The Pen-guins are not exactly segmented but the blobs clearly show the location of each penguin. To count them one can simply count the number of blobs. To train the network point level annota-tions are needed. This means. That in the training data set each target object is annotated with a single dot. The blob is then shaped around this dot.

The loss function that is introduced here consists of four parts. The image-level loss makes sure that for each target annotation at least one “blob pixel” is predicted. The point-level loss makes sure that at least the annotated pixel is correctly predicted. The split level loss makes sure that each predicted blob only contains a single annotation. The false positive loss makes sure that no blobs without an annotation are predicted. The final loss

func-tion is the sum of those four parts. For the network backbone any FCN can be used. For the actual blob predictions an up-sampling part is added [17].

This model seems to be very suited for the given kind of data set. The output of the network is sufficient for the localization and counting of different cells.

(12)

2.3

Related work

2

While there is a substantial amount of work on digital image analysis for pathology, there is not too much research done on the detection of PD-L1 positive cells.

[29] compared the automate Optra image analysis with the rating of three pathologists from different institutes. The pathologists were instructed to label tumor cells as well as immune cells according to the amount of membrane staining they show.

For the machine learning algorithm a subset of 30 WSIs was selected. The algorithm was trained on fields of view (FOV), where each slide consists of 30 to 800 FOV. The algorithm works in an it-erative manner. First viable cells are identified based on intensity and morphology-based features. Positive cells are then identified based on intensity values in the red channel. Finally, cells are classified into tumor and immune cells based on characteristics like shape size, intensity and other texture based features. The features were further refined by pathologists in an iterative manner after evaluation on a validation set.

The algorithm achieved an area under the receiver operator curve of 0.8 for detection PD-L1 posi-tivity in tumor cells and 0.7 in immune cells. Unfortunately, because this is a commercial product there is no explanation of the actual algorithm that was used to achieve those results.

[16] used the HALOTMmultiplex IHC version 1.1 base algorithm PD-L1 assessment in melanoma.

The algorithm is not further described but it is mentioned that it used color de-convolution. Their algorithm achieved very high concordance with the ratings of two pathologists. However, one has to notice that those results are only for melanoma cells. No differentiation between tumor and immune cells has been made. Also, this is not for a specific cutoff for tumor cell positivity as in the study by [29].

One of the first methods that use deep learning to detect PD-L1 positive cells was developed by [15]. They used an Auxiliary Classifier Generative Adversarial Network (AC-GAN) for this task. In a GAN two networks are trained to outperform each other. One network is the generator that creates a fake image. The other network is the discriminator that learns to discriminate between real and generated images. The output of the discriminator is then used to improve the generator until the discriminator can no longer discriminate between real and generated images. Tradition-ally this architecture can only be used to generate images. In the AC-GAN however additional information is used in order to also create a classifier. First the generator also receives a one-hot encode vector containing a class label. This means that the generator will learn to not only create images but create images belonging to a certain class. The discriminator on the other hand also contains an auxiliary classifier that tries to predict the label of the generated image. This means that the discriminator learns to discriminate between real and fake images while at the same time also training a classifier that can predict the different classes.

In order to make the tissue samples visible they have to be stained with a staining agent. Cy-tokeratin (CK) is a commonly used staining agent. This staining is often used to identify tumor regions in the samples. However, samples stained in the CK domain do not show whether there are PD-L1 proteins in the cell membrane. In order to make them easily visible another staining agent is needed. Because it is hard for a pathologist to detect tumor regions in the PD-L1 stained

(13)

domain obtaining labeled samples is hard and costly. Unfortunately, machine learning algorithms still need quite a large amount of samples. In order to overcome the problem of not having enough training images [14] used a generative adversarial network (GAN) to create unpaired image to image translations from the CK to the PD-L1 domain while at the same time training a classifier. Their Domain Adaptation and Segmentation GAN (DASGAN) allows them to end-to-end train a single networks that performs the image translations from CK to PD-L1 while at the same time learns to segment the PD-L1 positive regions. The DASGAN architecture is based on the architecture of the CycleGAN. In this architecture two generators are trained. One GAB that creates image

translations from domain A (CK) to domain B (PD-L1) an the other one GBAthat creates

trans-lations from B to A. For each generator there is also a discriminator that learns to differentiate between real and created images. Because the cycleGAN works with unpaired images there are no matching image pairs that can be used to compute the loss. The cycle loss that is used here is based on the invertibility of the created images. The images that have been created by GAB

are translated back to their original domain by the GBAgenerator and vice versa. Similar to the

auxiliary classifier GAN (AC-GAN) the discriminators are not only trained to differentiate between the real and created images but are additionally trained to perform a classification task. In this case the classification task is not just classification but image segmentation.

With low number of manually annotations the DASGAN performed better than other networks. This is for networks that have only been trained on real or synthetic images as well as for two-step models.

(14)

3

Data

The digital analysis of pathology samples has been made possible by the emergence of gigapixel whole-slide-images (WSI). During the recent years the pathology field has shifted more and more towards using those digital WSIs. They are used for diagnostic as well as for research applications. The process of generating the digital images aims to digitize the light microscopy. Tissue samples on regular glass slides are scanned by a digital high resolution scanner. In the scanner the glass slide with the tissue sample is automatically moved so that a camera with a microscopic lens can take images of all the parts of the sample. Those images are then later combined into a single big image and saved in the lossless .tiff image format. Because of the high resolution a single image needs a lot of memory to be stored. In this case a single image can be between three and six gigabytes big. In order to keep the images viewable by a human observer on a modern com-puter they are saved as multi-resolution image pyramid. This means that the image file consists of multiple images for different zoom levels. This also resembles the zoom levels of a traditional light microscope. When the image is loaded in a special viewing software the software only has to load the zoom level in image part that is currently viewed instead of the whole file at once. This allows for smooth zooming and panning without the need for extremely powerful computational resources [9]. In my case the highest resolution provides a 20 times optical magnification. The annotations for the images are saved in a separate XML (Extensible Markup Language) file.

Figure 5 shows two samples of a WSI at a low magnification so that the whole tissue sample can be seen. The size of the tissue sample is about two centimeters from top to bottom and slightly less from left to right. The green and red dots are the annotations. As can be seen only a very small portion of a slide is annotated. At this zoom level single cells can not be identified. Figure 6 shows examples of a high resolution zoom so that individual cells can be identified.

All the samples in the data set have been stained with the PD-L1 IHC 22C3 pharmDx kit [1]. This gives the cell cores a blue color. If the cell is PD-L1 positive the cell membrane is stained in a light to dark brown color. Figure 6 shows samples of the different cells that are labeled in the data set. All samples have been taken from the 0 level in the image pyramid. This is the level with the highest (20 times) resolution. The top row shows samples for PD-L1 negative cells while the bottom row shows samples for PD-L1 positive cells. The yellow boxes that can be seen in some of the images are regions of interest (ROI) that are exhaustively labeled. The images clearly show the brown membrane staining for the positive cells. The cell core is stained in blue. The negative cells do not show the membrane staining. Not all cells have such a distinct membrane staining as the ones that are shown in these pictures. The pictures in figure 6 show only homo-geneous regions where only one sort of cells is present. This is not representative for the whole data set. Other samples show regions that contain different cell types next to each other. Examples of such regions can be seen in Figure 7. The area in figure 7b is zoomed out more to show more cells. As mentioned before differentiating between the different cell types is no easy task. Data la-beling has to be done by a trained person or a pathologist. That is why obtaining pixel precise segmentations was not a feasible option for this project. As the cost and the time involved alone for creating a dataset would be much too high. A faster way to get annotations is to work with point annotations. This means that each labeled cell contains a single dot that marks the cell as belonging to a certain class. For this work the annotations where created by a medical student an

(15)

Figure 5: Tow WSI samples

later checked by a trained pathologist.

The dataset used in this project only contains labels for the cells the network should learn to detect. However a WSI does not only consist of those cells. If the network would never see anything else than the labeled cells, it would predict everything it sees as one of those cells. Because in a real world example the network would also see other tissue the background labels are needed. To obtain such background samples annotations for regions that do not contain cells have been cre-ated. In each WSI. These annotations have not been created by the medical student that created the annotations for the cells.

Cells have been annotate from 29 different WSIs. 19 of those WSIs have been used for training, 5 have been used as a validation set and the other 5 have been used as a test set. The test set was only used for the final evaluation. The networks have never seen cells from the test set before this. The splits for this have been chosen randomly. Not all WSIs contain samples of all labels. This was taken into account when creating the splits so that each split would contain samples of each label. In total 10.066 cell have been annotated. The distribution of labels and train splits can be seen in table 1. The labels contain PD-L1 positive tumor and immune cells as well as PD-L1 negative tumor and immune cells. In addition to this there is a “Rest” class for background that are not cells.

The FCNs needs patches as input images. They were created by cropping an fixed size image from the highest zoom level in the WSI with the annotation in the center.

(16)

(a) PD-L1 negative immune cells

(b) PD-L1 negative tumor cells

(c) PD-L1 positive immune cells (d) PD-L1 positive tumor cells

Figure 6: Samples of the different cells in the dataset

train val test

Tumor pos. 464 191 228 Tumor neg. 2047 545 478 Immune pos. 1607 122 163 Immune neg. 951 77 234 Rest 2336 259 364 Sum 7405 1194 1467

Table 1: This table shows the distribution of cells per labels and training splits

4

Experiments

This work mainly compares the segmentation performance of three different neural networks. Two different architectures for “standard” fully convolutional networks and the performance of a U-net implementation are compared. As mentioned in the background section technically the U-net is also a fully convolutional network. However, the architecture differs a lot from the other two networks. One could argue that they are also conceptually different and the sort of input data is different for the two kinds of neural networks. The FCNs can be trained on labeled patches. This means that the training data consists of a set of labeled images patches. The U-net needs segmented training

(17)

(a) Positive and negative immune cells (b) Positive and negative tumor cells

Figure 7: Examples of different cell types close together.

images. This means that instead of a single label a segmentation map is needed. The segmentation map has the same size as the input image. Each pixel in segmentation map denotes the class of the corresponding pixel in the input image. Because the different architectures work on different input data the training and evaluation process is also different.

4.1

Data augmentation

As mentioned before the WSIs have a very high resolution of more than 100,000 by 100,000 pixels. Because of this a neural network cannot be trained with a single WSI. To get trainable image smaller patches with a size that fits the network must be cropped from the WSI. Furthermore, the WSI only contains a few regions of interest with labeled data. This means that the vast majority of a WSI would only be “background” when training the network. So sampling the patches also helps to prevent extreme over-fitting on the background.

Neural networks generally perform better when more data is available. Unfortunately, the number of images that were available for this project are very limited. A small number of images often leads to the model over-fitting. This means that the model does not generalize well and will perform badly on images it has never seen before. There are some techniques like normalization or dropout layers that can be used in the model to reduce over-fitting. Image augmentation is another technique that has been shown to improve generalization performance. This is done by artificially increasing the number of training images [34].

Classical image augmentation techniques like geometric and color augmentations have been used. With a probability of 60% an image was augmented after being sampled from the training set. The augmentation technique was randomly chosen from a list of possible augmentations. For geometric augmentations horizontal and vertical flips as well as random rotations by up to 25 degrees were possible. For the color augmentations gamma correction, logarithmic correction and intensity corrections were possibilities to choose from. All color augmentations have been implemented using the popular scikit-image python library [25].

(18)

4.2

FCNs

This section explains the experimental setup for two fully convolutional networks. The first one “NatureNet” is based on a simple convolutional architecture without too many fancy additions. The network consists of four convolution blocks. Each of those blocks consists of two 2D convolu-tion layers followed by a batch normalizaconvolu-tion layer each. The convoluconvolu-tion layers in one block have the same hyper-parameter settings. After each convolution block a max pooling layer with a kernel of 2 and a stride of 2 is used. This effectively halves the image size. After each pooling layer the number of filters used by the convolution block are doubled. The kernel size is always kept at 3.

As can be seen in table 1 the class distribution for the dataset is really uneven. In early ex-periments this lead to an over-fitting for the most prominent class in the dataset. To combat this problem a balanced sampling strategy was implemented. The batch generator was build to sample the same number of patches for each class. After patches have been sampled the image augmenta-tion process was implemented.

In order to train the best possible version of the NatureNet experiments with many different hyper-parameter settings have been conducted. To keep track of the results and different settings the “weights and biases” (wandb) python library was used. This library has great integration with the most common python libraries used for the training of neural networks. But it can also be frame-work independent. Furthermore it can help finding hyper-parameters using Bayesian optimization. The configurations for this hyper-parameter search can be found in the appendix. As mentioned before early in the process it became clear that the unbalanced dataset lead to over-fitting for the most common class. Another way to combat this is to use class weights. So samples of different classes would get different weights in the loss function. Different settings for those class weights have been explored. Furthermore, different values for the depth of the network, the use of batch normalization layers in the network architecture, the number of filters created by the convolution layers, different probabilities for dropout layers, values for l2 normalization and the use of image augmentation or not have been tested. The range for the values the parameters could take have been determined based on earlier experimentation with the network.

For the Dense Convolutional Network (DenseNet) the same patches have been used to train it. For the hyper-parameter optimization of this network use of a batch normalization layer, different number of filters created by the convolution layers, class weights, dropout probabilities and the use of image augmentation have been explored. In addition to that there are some architecture specific parameters that can be explored. The use of initial or final pooling layers, the number of convolution layers per convolution block and the use of padding for the convolution layers have been explored.

The DenseNet also consists of blocks of convolution layers. A typical problem that occurs in deeper networks is the vanishing gradient problem. This means that the gradients get smaller and smaller the further they are passed on. This hinders learning because the parameter optimization is based on these gradients. One solution to solve this problem was the introduction of residual connections. The signal is passed to the next layer via an identity connection. So information from the original input can flow unchanged to deeper layers in the network. The DenseNet build on this idea. But instead of only having a residual connection to the next layer the DenseNet passes the information on to all the following layers in the block. An additional difference is that features that

(19)

are passes on via the residual connection are not combined with features from the convolution layer by addition. Instead the features are concatenated [13].

4.3

U-net

As mentioned multiple times before the U-net is a segmentation architecture and therefore needs a segmentation map for training. Because the data set only consists of point annotations this seg-mentation map had to be created automatically. To do this a very simple approach was chosen. First random patches of size 256*256 have been extracted from ROIs in the WSI. Each patch that has been extracted contains at least one labeled cell. To create the segmentation mask a 2d ar-ray with the same size as the input patches has been created. All values in the arar-ray where set to 0, later representing the background of the patch. The location of the point annotation was determined and a fixed size circle around it was created. This circle should represent the segmen-tation of the cell. For each cell class a different pixel value was chosen. On the corresponding locations in the 2d array the zeros have been replaced with those values. The size of the circle has been determined empirically so that it would on average cover most of the cell. A radius of 15 pixels works best for this. All segmentations are based on the highest resolution level of the WSI. It was experimented with different hyper-parameters and different loss functions but the network did not seem be able to segment different cells. It was made sure that the network was able to over-fit on a small data set with only three patches and no difference between train and validations set. The output after a different number of epochs was examined. It turned out that the longer the network got trained the more it predicted everything to be background.

Because the segmentation masks named above did not work very well new segmentation masks have been created. This time the masks are again circles but with an additional boarder or rim around it. The idea was inspired by [28]. This has two benefits. First it should help the network to put more emphasis on the difference between the cells and the background. It should also help to segment cells that are close together. Additionally, the network should now learn to put more “focus” on the membrane of the cells which is an important feature for the classification.

After some epochs one could see that the network started to segment some of the cells. However, with very low accuracy. After some more epochs the background again started to take over. Figure 8 shows the results of training a U-net architecture in an early state. The image on the left shows the input patch. The image on the right shows the segmentation map. The image in the middle shows the output of the network. As can be seen the network could roughly detect some cells. However, it detects a lot of cells that are not labeled. The additional rim that was created around the segmentations of the cells seem not to be very helpful.

It is important to note that a precise segmentation of the cells is not the goal here. It would also not be very reasonable to expect this with the given ground truth. For the task of computing the TPS it would be sufficient to only get rough segmentations that could be counted. Also for obtaining the location of the cells it would be sufficient to only have rough segmentations.

Still the network was over-fitting on the background. Evaluating the epochs showed that the network was able to segment some cells but after more epochs the network started to predict only background. For the cells that had been segmented in early epochs the classification rate was not good either. The network could segment the cells but it was not able to reliably distinguish between the classes. Tumor cells have often been labeled as immune cells and vice versa. The network also

(20)

Figure 8: Example of predictions made by the U-net with additional rims in the segmentation mask

often only segmented the cell core, that is stained blue in all cells. Because all the measures that have been taken to prevent the over-fitting on the background did not work it was time to look at the dataset again.

As mentioned before the dataset contained regions of interest in which the cells have been la-beled. No cells outside these regions have been lala-beled. All the patches that were used for training have only been taken from these areas to prevent having unlabeled cells in the training set. However the ROIs were not labeled exhaustively, meaning that every cell in the ROI would have been labeled. Because this was not taken into account at first the patches that were used for training possibly contained a lot of cells that belong to one of the desired classes but have not been labeled and were therefore treated as background. This would also explain why the network always over-fitted on the background. The network was able to learn some features that are useful to identify and segment cells. When the training goes on the network segments cells that are treated as background. The loss function however treats this as a misclassification. The network then basically unlearns the features that have been used to classify these unlabeled cells. Because the background class is bigger than the other classes a higher accuracy could be achieved if the background is predicted more frequently.

Figure 9: Improved segmentation masks To solve this problem a new form of

mentation masks have been used. For this seg-mentation mask another circle is introduced. This circles goes around circle that is used as segmentation for the cell. Only pixels that lie within this second bigger circle are used as background. Everything that is outside such a circle is ignored. For cells that are close to-gether this still means that there are some unla-beled cells in the background but overall, many fewer unlabeled cells are now treated as back-ground. Figure 9 shows an example of such a segmentation mask. The white and gray cles are the labels for the cell class. The cir-cle in darker grey show what is used as back-ground. The black area is ignored. In the pre-vious masks the black area together with the

(21)

dark grey area would have been used as back-ground labels.

4.4

LCFCN

An additional practical thing about the LCFCN is that it works with point annotations. Therefore, no additional creation of segmentation mask is needed. Because a single WSI would be too big to train the network on it again trained and applied to smaller image patches. The authors of the LCFCN paper published their code on GitHub. After changing some issues so that code would be compatible with the data set used for this project, network training started. First results were pretty bad. The network was only able to predict large blobs at the boarder of a patch. As a sanity check and to easier debug the model it was tried to over-fit it on a very small data set of only three images. This technique is common when training neural networks. If the network would not be able to over-fit on a small data set it would also not be able to extract useful features from a bigger data set. Over-fitting on a small data set worked well. The results can be seen in figure 10.

Figure 10: results of the LCFCN over-fitted on a small data set The model predicted blobs around the annotations. In the

figure the annotations can be seen in red the blobs have a yellow boarder around them. The color of the blob does not matter. Again, the shape of the blob is not important. Only the location and that the point annotation is within the blob are important.

Even after long time of experimentation I was not able to achieve similar results when using a bigger dataset. The net-work was only able to predict big blobs at the boarders of the patch. I expect that this is for the same reasons the U-net was not able to learn good segmentations. Not all cells in the patch are labeled. This means that the network will unlearn previ-ously learned features when it predicts blobs for not annotated cells.

(22)

5

Evaluation

In an ideal situation the dataset would consist of pixel precise annotations. Then the segmentations produced by the neural

networks could be compared with the original annotations on a pixel level with a measure like the DICE score. Because the dataset only consists of point annotations this is not possible. An interesting property of the fully convolutional networks is that the output size depends on the input size. This means that when an image patch that is greater than the one used for training is given, the output can be seen as a segmentation. When an input patch has the same size as during the training the output is only of size one. This means that the output can be seen as a classification (for the central pixel of the image). For the evaluation of the FCNs this property was used. Image patches have been created from the test set the same way they have been created for the training and validation set. The networks were then evaluated on these patches, just like the evaluation would have been done for a classification task. For the evaluation the F1score, Cohen’s kappa (κ)

and confusion matrices are used. The F1score is the harmonic mean between the precision and the

recall. It can be computed with the following formula: F1= 2

precision * recall precision * recall

5.1

NatureNet

All the parameters and results have been logged on wandb. The network with the best performance on the validation set has then been evaluated on the test set. The best result for the NatureNet architecture was F1= 0.755 and κ = 0.650 on the validation set and F1= 0.676, κ = 0.585 on the

test set. The corresponding confusion matrix can be seen in Figure 11.

On the y-axis the true label is depicted while on the x-axis the predicted label in depicted. On the diagonal the percentage of predictions where the predicted label is the true label can be seen. Cells off the axis show misclassified predictions. A perfect classifier would only have ones on the diagonal and nothing else anywhere on the matrix. All confusion matrices presented here have been normalized so that the values show percentages instead of actual cases.

We can see that most of the classifier’s predictions are correct. We can also see that the classifier can distinguish between positive and negative cells. However, it has problems differentiating between tumor and immune cells. The red boxes have been added to further highlight this finding. The top left square contains all the PD-L1 negative cells while the bottom left square contains all the PD-L1 positive cells. Furthermore, we can also see that the classifier has problems with the “Rest” class. 50% of this class are identified correctly while the rest is other cells that have mistakenly been identified as “Rest”.

5.2

DenseNet

The DenseNet achieved an F1 = 0.763 and a κ = 0.604 for the validation set. On the test set it

achieved an F1 = 0.582 and κ = 0.371. The confusion matrix for the DenseNet can be seen in

figure 12. The labels are the same as in the previous confusion matrix. As can be seen in the confusion matrix this network has a problem detecting the PD-L1 negative immune cells. Most of those cells are predicted as PD-L1 negative tumor cells. Like the NatureNet this network is able to

(23)

Figure 11: Confusion matrix for the NatureNet architecture with the best hyper-parameter setting

reliably differentiate between the positive and negative cells but has problems within the positive or negative class. The rest class is correctly classified in 49% of the cases.

5.3

U-net

The U-net architecture is a segmentation architecture. This means that it cannot be tested as a classification network like the FCN architectures have been. Still it is very important to keep the evaluation as similar to the FCN evaluation process as possible. To achieve this the U-net was also tested on the same patches. The U-net outputs a segmentation map for the whole patch. Because the patch is centered around created segmentation mask, which is centered around the original point annotation, the center pixel is part of the cell that should be detected. Therefore the center pixel of the segmentation mask can be used as classification for this cell. In early testing it became clear that sometimes a segmentation that covers most of the cell is created but because the point annotation has not been centered on the cell the segmentation would not cover the exact center of the patch. In the evaluation method this would lead to a misclassification of the cell. I argue that because the segmentation still covers most of the actual cell this would not be a correct evaluation. To solve this problem the center of mass of the segmentation is computed. Then a circle around this center is created if this circle overlaps with the circle that is used as a segmentation map it is counted a correct classification. If the classes of the prediction and the label do not match it is of course not counted as a correct segmentation, even if the circles overlap.

(24)

Figure 12: Confusion matrix for the DenseNet architecture with the best hyper-parameter setting

for the best results achieved with the U-net can be seen in figure 13. The labeling is the same as for the previous confusion matrices.

It can be seen that like the other architectures this network can reliably differentiate between the PD-L1 positive and negative cells. But also like the other networks it has problems differentiating between PD-L1 positive tumor and immune cells. Compared to the other two models it has less problems differentiating between the negative tumor and immune cells. 66% of the rest class is identified correctly.

(25)

Figure 13: Confusion matrix for the U-net architecture with the best hyper-parameter setting

6

Discussion

Comparing the results of the different networks we can see that the simple NatureNet architecture achieved the highest F1 score with F1= 0.676 compared to the DenseNet with F1= 0.582 and the

U-net with the lowest value of F1= 0.526. However, the F1 score is not everything. The confusion

matrices also give valuable insights about the actual performance of the different networks. The confusion matrix of the U-net looks the best. For each of the individual cell classes it has the most correct predictions with being correct in 69% of the cases for the worst case (PD-L1 positive immune) and being correct 79% of times in the best case (PD-L1 positive tumor). This is much better than the Dense net which is only 63% of times correct in the best case (also PD-L1 positive tumor) while it almost never correctly detects the PD-L1 negative immune cells. Even though the U-net had the worst F1score of the three it has a better-looking confusion matrix than the dense net.

An important factor that has been left out of the evaluation until now is the actual performance of the segmentations. After all the networks are initially designed as segmentation networks there-fore one could also evaluate the performance of the actual segmentations. Because of the way the data set is annotated the evaluation was done as described above. A common metric to evaluate the performance of segmentation networks is on a pixel basis. This was not possible and also not desirable for this work because of two reasons. First because the annotations only consist of point annotations there is no pixel precise ground truth with which a prediction could be compared. The automatically generated segmentation masks would not be precise enough to get a reliable evalua-tion. The second reason is that the data set is not exhaustively labeled. This means that there are

(26)

cells that might belong to one of the target classes that are without an annotation. But the network would also segment those cells. This would negatively influence the evaluation as those cells would be counted as background. Furthermore, a pixel precise evaluation does not necessarily reflect the goal of the work. The goal was it to detect the different cell types that could be used to compute the TPS. Because the TPS alone is not as reliable in predicting the outcome of an immunotherapy as one would hope, we were also interested in finding the location of the different cells. This knowl-edge could the be used in following studies as an addition factor. One could examine if there is a relationship between the location of tumor cells relative to other structures like for example blood vessels. An exact segmentation of cells is therefore not necessary to achieve as long as the number of cells and the location could be reliable detected.

It is still important to look at the segmentation performance to get an estimate of at least the rough segmentation performance. Because this will be a somewhat subjective measurement it was not included in the results section. The results to this are a bit surprising to me.

There is not much difference between the segmentations created by the NatureNet and the DenseNet architectures. Both are not really able to clearly segment individual cells. The segmentations tend overlap. The networks only seem to be able to roughly segment wider regions that contain a certain type of cell. One could assume that with some post processing those regions could even be defined more clearly, but that would be a task for some follow-up research. For clinical use this kind of segmentations seems not to be very useful.

The U-net architecture on the other hand was able to create nice segmentations of individual cells. Overlaps between individual cells appear only in a few cases.

Figure 14 shows the segmentation mask created by the DenseNet overlaid on a WSI. The patch of the WSI without the overlay can be found in the appendix 16. Unfortunately, the color scheme used in the mask is not the same as in the original annotations. The legend on the lower left shows the labels for the segmentation mask. As described above the network was not able to create pixel precise segmentations for the cells. The segmentations overlap.

Figure 15 shows the output created by the U-net. In this case the color map is the same as for the original labels. Because the overlay is semi-transparent the colors still look a bit different. As can be seen the U-net output has much nicer segmentations of the individual cells. This kind of output is much more suited for counting and localizing individual cells.

It can also be seen that the network often predicts some output for cells that are not labeled. Unfortunately, there is no 100% guarantee that these are indeed errors.

6.1

Difficulties

One difficulty of this project was the data set. Because this work was a first pilot for the detection and localization of PD-L1 cells at the Radboud University medical center the data set was very limited. The project was done with a relative little number of annotated cells. This is because it is a time-consuming task for a trained pathologist (or medical student) to label a high number of cells. Segmentation networks like U-net could profit from big exhaustively annotated ROIs. As mentioned in the experiments the data set was not in this way. This means that simply extracting patches for the U-net did not work. Therefore a method that was able to cope with the related problems was needed. Later the student that made the annotations created new ROIs that have been fully annotated. Most of these ROIs have been to small to sample patches that are big enough to work with. Sampling bigger patches from the ROIs would also allow for the use of deeper networks. This

(27)

Figure 14: Segmentation mask created by the denseNet architecture

(28)

could in turn further improve the results. One property of valid convolutions is that the image that get passed trough the network gets smaller after each convolution. One work around for this problem is to use padded convolutions. Here the output of a convolution is padded so that it has the same dimensions as the input of the convolution. A problem of this is that with each padding artefacts are introduced into the image.

Here the annotations have been created by a single medical student and have then been checked by trained pathologists. Usually for these kinds of tasks the training data is created by multiple specially trained pathologists. The results of the trained network are then again compared with the annotations of a number of pathologists. This is done because the task of labeling the cells is not easy and there is never 100% concordance between expert annotators. However, it is not expected that the quality of this data set is low. This work is also just a first pilot. Still it might be something to keep in mind especially when the dataset is used for future work that shows promising results.

6.2

Demonstrator

Because this project was part of the Radboud AI for health project, a demonstrator that is hosted under the algorithms section on grand-challenge.org3was created. Grand-Challenge is an online

platform hosted by the Radboud UMC. It hosts contests in bio-medical imaging, provides tools to annotate data sets and allows for the demonstration of different algorithms. A user can upload a WSI to the platform. The image is than segmented by the U-net model. The user interface allows to explore the segmentation of the slide. When a better working model is created the model can simply be switched out.

6.3

Future Research

There are multiple ways to further extend this work. One way would be to explore different models. Even though all the models that have been tested here work conceptually in a different way they are all fully convolutional segmentation models. One way to improve the results could be to just try out newer segmentation models. Another way could be to explore object detection models like some new implementations of the YOLO model or new state of the art models like efficientDet. Object detection models work by prediction a bounding box around the object they should detect in an input image. At first it was decided not to use an object detection model because it would need bounding box annotations for training and only point annotations were available. Later in the project it was decided to test the performance of U-net on the data set. This network needed segmentation maps which have been created automatically. In a similar way it could have been decided to also create bounding boxes in a fully automated way. Because of time constraints for the project and because I already had experience with using U-net it was decided to use this instead of an object detection network. It might be interesting to test a state-of-the-art object detection network on this problem. This might be a good solution for finding the actual location of cells. To find the location of a cell the center of a bounding box could be used. This is computationally more efficient than finding the center of mass for a segmentation.

(29)

As can be seen in figure 9 sometimes when two or more cells are close together the segmentations overlap. This makes it difficult in post processing to get the actual number of predicted cells. An object detection network could predict overlapping bounding boxes for each of the cells. This would make it easy to get the actual number and location of cells.

All the segmentation that have been created automatically have been circles with a fixed size radius around the point annotation. This is a vary generic and simple way to create those segmen-tations. A more advanced method to create the segmentations could be to use a region growing algorithm. The point annotation could be used as a seed point for the algorithm. Neighboring pixels are then added to the segmentation incrementally. Whether a pixel is added to the segmen-tation is done based on the pixel values. Different kind of algorithms and rules for adding pixels could be explored. This might lead to more precise segmentations than just using the circles. The more precise annotation in turn could lead to better segmentation performance of the network. This might be the case because only (or at least more) pixels that actually belong to the cell are treated as segmentation for this cell. The circles often contain pixels that actually belong to the background or do not contain all the pixels of the cell. Therefore, the network is often fed with pixels that belong to the background. If the network learns to predict background pixels as pixels belonging to a cell this behavior is punished in the loss function. The same holds for pixels that are actually background but are labeled as “cell pixels” because the segmentation circle does not exactly fit the shape of the cell. This behavior makes it more difficult for the network to learn the correct segmentations.

A possible problem with using a region growing algorithm for creating the segmentations is the difference between the cell core and the cell membrane. As can be seen in figure 7 the cell core is stained in a blueish color while the cell membrane is stained brown for PD-L1 positive cells. For PD-L1 negative cells the cell membrane is not so easily visible. A region growing algorithm would typically only segment the cell core and would not be able to also segment the cell membrane because of the strong color differences. However, the cell membrane is the most important factor in determining whether a cell is PD-L1 positive or not.

6.4

Summary

In this project I compared the performance of three different kind of neural networks for the scoring of PD-L1 expression. This is currently the only used biomarker to estimate the efficacy of treatment with pembrolizumab, an immunotherapy drug. The performance of the models that I trained and evaluated is not good enough to be used in a clinical research setting. I did not succeed in providing a model that is accurate and reliable enough to detect and locate PD-L1 positive and negative cells. Even-though there is still much work left to do I hope that this project can still be seen as a valuable first pilot project and that following students can use this as a starting point to further conquer the problem.

(30)

References

[1] Dako Agilent. PD-L1 IHC 22C3 pharmDx interpretation manual. 2018.

[2] Hani A Alturkistani, Faris M Tashkandi, and Zuhair M Mohammedsaleh. “Histological stains: a literature review and case study.” In: Global journal of health science 8.3 (2016), p. 72. [3] Mads Hald Andersen et al. “Cytotoxic T cells.” In: Journal of Investigative Dermatology 126.1

(2006), pp. 32–41.

[4] Julie R Brahmer et al. “The Society for Immunotherapy of Cancer consensus statement on immunotherapy for the treatment of non-small cell lung cancer (NSCLC).” In: Journal for immunotherapy of cancer 6.1 (2018), p. 75.

[5] Wendy A Cooper et al. “Intra-and interobserver reproducibility assessment of PD-L1 biomarker in Non–Small cell lung cancer.” In: Clinical Cancer Research 23.16 (2017), pp. 4569–4577. [6] Wendy A Cooper et al. “PD-L1 expression is a favorable prognostic factor in early stage

non-small cell carcinoma.” In: Lung cancer 89.2 (2015), pp. 181–188. [7] Jennifer Couzin-Frankel. Cancer immunotherapy. 2013.

[8] Pramod Darvin et al. “Immune checkpoint inhibitors: recent progress and potential biomark-ers.” In: Experimental & molecular medicine 50.12 (2018), pp. 1–11.

[9] Navid Farahani, Anil V Parwani, and Liron Pantanowitz. “Whole slide imaging in pathology: advantages, limitations, and emerging perspectives.” In: Pathol Lab Med Int 7.23-33 (2015), p. 4321.

[10] Edward B Garon et al. “Pembrolizumab for the treatment of non–small-cell lung cancer.” In: New England Journal of Medicine 372.21 (2015), pp. 2018–2028.

[11] Douglas Hanahan and Robert A Weinberg. “Hallmarks of cancer: the next generation.” In: cell 144.5 (2011), pp. 646–674.

[12] Lars Hofmann et al. “Cutaneous, gastrointestinal, hepatic, endocrine, and renal side-effects of anti-PD-1 therapy.” In: European journal of cancer 60 (2016), pp. 190–209.

[13] Gao Huang et al. “Densely connected convolutional networks.” In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 4700–4708.

[14] Ansh Kapil et al. “DASGAN–Joint Domain Adaptation and Segmentation for the Analysis of Epithelial Regions in Histopathology PD-L1 Images.” In: arXiv preprint arXiv:1906.11118 (2019).

[15] Ansh Kapil et al. “Deep Semi Supervised Generative Learning for Automated Tumor Propor-tion Scoring on NSCLC Tissue Needle Biopsies.” In: Scientific reports 8.1 (2018), p. 17343. [16] Viktor H Koelzer et al. “Digital image analysis improves precision of PD-L1 scoring in

cuta-neous melanoma.” In: Histopathology 73.3 (2018), pp. 397–406.

[17] Issam H Laradji et al. “Where are the blobs: Counting by localization with point supervision.” In: Proceedings of the European Conference on Computer Vision (ECCV). 2018, pp. 547–562. [18] Geert Litjens et al. “A survey on deep learning in medical image analysis.” In: Medical image

(31)

[19] Jonathan Long, Evan Shelhamer, and Trevor Darrell. “Fully Convolutional Networks for Se-mantic Segmentation.” In: The IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR). June 2015.

[20] Christian Manegold et al. “The potential of combined immunotherapy and antiangiogenesis for the synergistic treatment of advanced NSCLC.” In: Journal of Thoracic Oncology 12.2 (2017), pp. 194–207.

[21] Julian R Molina et al. “Non-small cell lung cancer: epidemiology, risk factors, treatment, and survivorship.” In: Mayo Clinic Proceedings. Vol. 83. 5. Elsevier. 2008, pp. 584–594.

[22] Lukas Mosser, Olivier Dubrule, and Martin Blunt. “Stochastic Reconstruction of an Oolitic Limestone by Generative Adversarial Networks.” In: Transport in Porous Media (Dec. 2017). doi: 10.1007/s11242-018-1039-9.

[23] National Cancer Institute at the National Institutes of Health. Immune Checkpoint Inhibitors. 2019. url: https://www.cancer.gov/about-cancer/treatment/types/immunotherapy/ checkpoint-inhibitors (visited on 08/09/2020).

[24] Christopher Olah. Understanding Convolutions. 2014. url: https : / / colah . github . io / posts/2014-07-Understanding-Convolutions/ (visited on 06/25/2020).

[25] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python.” In: Journal of Machine Learn-ing Research 12 (2011), pp. 2825–2830.

[26] Jacob Reinhold. Dropout on convolutional layers is weird. 2019. url: https://towardsdatascience. com/dropout-on-convolutional-layers-is-weird-5c6ab14f19b2 (visited on 07/20/2020). [27] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for

biomedical image segmentation.” In: International Conference on Medical image computing and computer-assisted intervention. Springer. 2015, pp. 234–241.

[28] Zaneta Swiderska-Chadaj et al. “Convolutional neural networks for lymphocyte detection in immunohistochemically stained whole-slide images.” In: (2018).

[29] Clive R Taylor et al. “A Multi-Institutional Study to Evaluate Automated Whole Slide Scoring of Immunohistochemistry for Assessment of Programmed Death-Ligand 1 (PD-L1) Expression in Non–Small Cell Lung Cancer.” In: Applied Immunohistochemistry & Molecular Morphology 27.4 (2019), pp. 263–269.

[30] Suzanne L Topalian et al. “Safety, activity, and immune correlates of anti–PD-1 antibody in cancer.” In: New England Journal of Medicine 366.26 (2012), pp. 2443–2454.

[31] Lindsey A Torre, Rebecca L Siegel, and Ahmedin Jemal. “Lung cancer statistics.” In: Lung cancer and personalized medicine. Springer, 2016, pp. 1–19.

[32] Giancarlo Troncone and Cesare Gridelli. “The reproducibility of PD-L1 scoring in lung cancer: can the pathologists do better?” In: Translational lung cancer research 6.Suppl 1 (2017), S74. [33] Vivek Verma and Joe Y Chang. “Quantification of PD-L1 expression in non-small cell lung

cancer.” In: Translational Cancer Research 6 (2017), S402–S404.

[34] Jason Wang and Luis Perez. “The effectiveness of data augmentation in image classification using deep learning.” In: Convolutional Neural Networks Vis. Recognit (2017), p. 11. [35] ICAI website. About ICAI. 2019. url: https://icai.ai/ (visited on 06/25/2020).

(32)

[36] Radboud AI for health website. Radboud AI for health website - About. 2019. url: https: //www.ai-for-health.nl/about/ (visited on 06/25/2020).

[37] Ming Yi et al. “Biomarkers for predicting efficacy of PD-1/PD-L1 inhibitors.” In: Molecular cancer 17.1 (2018), pp. 1–14.

Referenties

GERELATEERDE DOCUMENTEN

Gianni Bocca at the University Medical Center Groningen in close collaboration with the Dutch Pediatric Thyroid Cancer Study Consortium and the Institute of Pathology of the

In Table 5.1 , the results of the uniformity of dosage units test for the isoniazid formulation after filling SDCs at different vacuum pressures are given.. According to

The effects of liquid water content on predicted ice shapes was next conducted, holding water droplet diameter and ambient temperature fixed.. The liquid water content has a

In dit onderzoek is gezocht naar een antwoord op de vraag: ‘Hoe kunnen particuliere eigenaren een watertoren herbestemmen tot woning rekening houdend met de cultuurhistorische

Het te kiezen model en de onderliggende structuur zullen niet alleen de basis zijn voor te ontsluiten persoonlijke kennis en andere kennisbronnen, maar naast deze

The question remains whether alterations in HPA axis are the result of abnormal pain perception or that CP can be seen as a consequence of HPA axis dysfunction (Adler &amp;

Where

Eén groep buigt zich over de aspecten die geregeld moeten zijn voordat de wolf naar Nederland komt en een andere groep bespreekt wat geregeld moet zijn op het moment dat de wolf