Soft margin loss for unlabelled regions: detecting tumour infiltrating lymphocytes

(1)

Universiteit van Amsterdam

MSc Artificial Intelligence

Master Thesis

Soft margin loss for unlabelled regions:

detecting tumour infiltrating lymphocytes

by

Andreas Panteli

12139742

July 10, 2020

48 ECTS November 2019 - July 2020

Supervisors:

Dr E. Gavves

Dr J. Teuwen

Dr H. Horlings

Assessor:

Dr P. Mettes

(2)

Abstract

Lack of labels in object detection often leads to ambiguous or even incorrect informa-tion assessment where un-annotated areas may contain an important cell, like a lymphocyte. Moreover, H&E slides provide a challenging platform for cell detection due to the morphology similarity of different cells in the tissue. Training neural networks under these considerations, with traditional loss functions, leads to unstable predictions with an increased amount of false positive cases.

In this work, we introduce the soft margin loss (SML) for non-exhaustively annotated H&E slides alongside a novel neural architecture (named YELL: You Efficiently Look at Lymphocytes). SML restructures the basic loss function to disassociate background (weak negative annotations) from probable unlabelled TILs (in the otherwise considered background area) and train using high fidelity information alone. YELL is able to reuse the same base network for multiple smaller tiles of an input image, as an intermediate classification task, and for the whole image for bounding box prediction without introducing two independent networks. This smaller model, allows for compacting a multi-task approach combining the local neighbourhood information (classification) with the global image view (detection) under the same network excluding only three single layers.

SML is able to bypass the restrictions imposed by mistaking unlabelled TILs as background information, and is able to train with these noisy annotations, while YELL outperforms state-of-the-art related architectures in H&E images of two datasets.

(3)

Acknowledgements

I would like to thank Efstratios Gavves for his reliable and relentless assistance, guidance and supervision during the whole process. This work would not have been possible without him. I would like to thank Jonas Teuwen for his constant support, Hugo Horlings for his expert advice and pleasant attitude towards new challenges, and Siamak Hajizadeh for his calm solutions. I cannot forget, of course, the Dream Team whose pleasant lets-grab-a-beer response has been the inspiration of some of the insights of this thesis; and for that I thank them. And of course, my family and friends have been a reliable supporting force which made this work possible.

(4)

1

Motivation

According to the National Cancer Institute (NIH) of the U.S. department of Health, cancer is amongst the leading causes of death [1]. Despite the steady decrease of the cancer death rate to 0.002%, it is expected by 2030 to diagnose 23.6 million new cases per year [1, 2]. The world health organisation, of the International Agency for Research on Cancer, reports a total of 90₅₅₅0_{027 deaths in 2018 with lung,}

colorectal, stomach, liver, and breast cancer types being responsible for half of all mortality cases [3, 4]. With limited options for treatments and the lack of a definite cure for many cancer types, early diagnosis is an imperative first step towards impeding tumour metastasis or irreversible growth [5]. Un-fortunately, the availability of oncologists or pathologists is limited and expensive which results in a decreased number of human inspections from specialists for patients without immediately-evident or life-threatening symptoms [6, 7]. This incurs more sparsely obtained early-cancer prognosis from doctors. However, imaging techniques such as magnetic resonance imaging (MRI) or computed tomography (CT) scans, needed for the examination, can be quickly performed by almost any licensed physician for a signif-icantly smaller amount [8–10]. Given the necessary background information and specialised knowledge, the task of expert clinicians can often be a mundane condition inspection where if, for example, there exist the pre-defined structures in the image that constitute cancer cells then the diagnosis is indisputable even by an automated system [11]. Hence, any algorithmic procedure that is as well trained as professional pathologists can automate the detection and diagnosis process of detecting salient features in medical images, increasing the number of patients examined and potentially the number of early diagnoses [12– 14]. Using a confidence metric, scanning multiple patients at a time, an automated system can notify physicians for uncertain decisions and direct to treatment everybody who indisputably needs it [15].

Detecting biomarkers in medical data is not a trivial task however. Not all cancer types or mutations behave the same and neither, sometimes, in a similar pattern [13]. It is often the case that what features constitute a severe and aggressive behaviour for one type of cancer do not correlate with other types of tumours, especially in different areas of the human body [11, 13]. The inconsistency and variation persistent in cancer diagnosis is one of the many reasons that computer aided technology (CAT) has been struggling with prediction robustness and generality [16, 17]. Recent advances in machine learning, and more specifically in deep learning, have shown how neural networks (NN) are able to approximate any data driven function and have successfully solved many problems that were once thought impossible [18–23]. Due to their neural architecture, inspired by the human brain synapses, NNs have been able to “learn” the most salient and important characteristics in images [24–27], language [28–32] and in many other domains/tasks with often near (or even higher than) human performance [6, 33–36]. With their more recent success in medical applications NNs have been become an essential tool in modern medical practises and research [26, 37–43].

A novel application of NNs in medical images has recently been the detection of tumour infiltrating lymphocyte (TILs), whose presence in breast tissue, has been able to predict the response of the patient to immunotherapy treatment [8, 19, 44]. More specifically, in the case of a positive diagnosis of breast cancer, an increased number of TILs inside tumour cells, at least 5% presence in the stroma near the cancerous tissue, has been shown to be a positive indicator that immunotherapy will be beneficial to patient [8]. Immunotherapy, however, can cost up to €800_{000, over a lengthy period of time and only a}

limited number of patients may benefit for it. Therefore, creating an automated CAT system using a NN model able to detect the lymphocytic cells in breast cancer patients can not only save money but also target immunotherapy recommendations to people who will benefit the most out of it; whilst referring some patients to more appropriate treatments sooner.

(6)

CHAPTER

2

Introduction

In the fight against cancer, immunotherapy is a novel treatment which for certain cancer types enhances the immune system in order to destroy tumour cells. It remains however an active research area and so far only a small group of patients, e.g.: 5-20% of the metastatic breast cancer patients, benefit from it. It is expensive, costing upwards of 80’000 euros per year, and physicians are tasked in deciding which patients should receive it and which should receive a different treatment. Applying the right treatment at the right time is imperative since the life of a patient may be at risk. To that effect, recent studies have attempted to investigate early signs in a patient which might indicate that immunotherapy is the appropriate treatment for them [8, 12, 45]. It has been shown that cells called lymphocytes, produced by the immune system, are an early indicator and biomarker able to predict whether a patient will respond positively to immunotherapy [8, 46]. More precisely, pathologists look for lymphocytes in the stroma region near the cancerous area, i.e.: Tumour Infiltrating Lymphocytes (TILs). The percentage of TILs to the stroma region relates to the ability of the immune system to fight the tumour cells and if the TILs-stroma percentage is above a certain value then pathologists will recommend the immunotherapy treatment [44]. The work presented in this report aims to tackle this exact problem by automating the process of detecting TILs in order to assist in the prediction of whether patients will respond to immunotherapy.

Lymphocytic cells, or any other cell type like tumours or red blood cells, cannot be inspected readily; the doctor needs to assess the patient condition first. When performing an examination for cancer, due to the nature of tumours, the patient information can be acquired by either invasive or non-invasive procedures. Invasive procedures may involve cutting a tissue sample from the patient while non-invasive procedures include imaging scans by magnetic resonance imaging (MRI) or computed tomography (CT). While non-invasive methods are safe, easy and convenient they cannot provide detailed information about the morphological nature of tumours and surroundings on the cellular level [13, 47]. Diagnosis of a condition for areas at risk, and the proposal of treatments, require hard facts which can often only be gathered by analysing the tissue of a patient under a microscope [11, 20].

Histopathology, refers to the diagnosis of a disease through a microscopic inspection of the tissue which is able to reveal cells such as lymphocytes T- ,B-, A-cells, tumour nuclei, fibroblasts, ducts, fat or muscle tissue and others. These cells and tissue structures are important for the pathological examination and can be seen in whole slide images (WSIs) which are thin slices of tissue of a cross-section region to

Figure 2.1: PatientID: 60464 of the NKI dataset (4.1). The H&E WSI, from the broad view of the whole tissue, is magnified by a factor of 25_{for the middle image and by a factor of 2}7_{for the one in the right.}

(7)

2.1. MEDICAL APPLICABILITY CHAPTER 2. INTRODUCTION

be examined, stained with specific chemicals and reviewed by doctors. An example of a WSI can be seen in Figure 2.1 alongside magnified areas, where cells are visible.

Upon receiving the tissue, it cannot immediately be used, however, because the inner structures are not naturally visible. A staining procedure must proceed which colours the tissue in a way that certain aspects of it become clearly visible. “Certain” being a key word because some staining methods are able to display different tissue properties compared to others [48]. This step is very important and crucial because any later processing and analysis depends highly on what type of staining is used. When inspecting cells like tumour nuclei and surrounding immune cells (like lymphocytes) there are two common staining methods. The first is Haemotoxylin and Eosin (H&E) stains and the other is CD3 (or CD8) antibody Immunohistochemistry (IHC) staining, examples of which are shown in Figure 2.2. The CD3 staining contains a very specific molecule which recognizes only T immune cells in tissue and ignores other tissue components [49]. H&E staining on the other hand displays the general layout and placement of cells, allowing for the whole tissue structure to be clearly visible, by colouring cell nuclei blue (hematoxylin) and the extracellular matrix pink (eosin) [50]. Depending on the desired task and outcome pathologists will use one of the two, or for some critical areas, both.

The most common staining procedure is the H&E one because it shows the tissue structures as they are without distorting the individual components. Pathologists use H&E slides to review the cancerous area, assess to progression of the disease, identify important regions and finally detect all different cells including TILs. Furthermore, H&E staining is faster, inexpensive and easier compared to the CD3 method. The disadvantage of the H&E staining is that it does not distinguish any cell type more than other, whilst CD3 does exactly that. When attempting to locate tumour infiltrating lymphocytes, CD3 will highlight all T-cell lymphocytes which would be very easy to detect [52]. However, CD3 staining ignores other types of lymphocytes, other cells and structures, and using a CD3 stained slide alone, restrains the possibility of finding tumour infiltrating lymphocytes, because the tumour cells are not clearly visible. That is why H&E slides are always used in order for the pathologist to identify the cancerous regions and find the tumour infiltrating lymphocytes, which are visible but not highlighted.

An example of tumour infiltrating lymphocytes, in a local area of an H&E WSI can be seen in Figures 2.3 and 2.4. The lymphocytes are the dark black circular spots. Tumour cells, are the less dark and more “diffused” or spread out bigger cells. The black elongated cells are fibroblasts which are part of the stroma that synthesize the extracellular matrix and collagen. Biologically, lymphocytes are distinguished between tumour and fibroblasts (the two other most common cell types appearing near TILs) by their small circular shape. Imaging effects, staining properties and tissue variations may display lymphocytes differently, either with diffused boundary or light blue spots, but compared to other cells in their vicinity lymphocytes are structurally small spheres in the tissue. This is an important attribute when trying to disambiguate areas with closely neighbouring cells.

2.1 Medical applicability

It is an important task to recognise the different cells in tissue images because on these small structures relies the medical diagnosis. Using as reference Figures 2.3b and 2.3d the green circles indicate tumour

(8)

(a) Raw H&E area from patient 60470 4.1) (b) Annotated H&E area from patient 60470.

(c) Raw H&E area from patient 60475 4.1) (d) Annotated H&E area from patient 60475.

Figure 2.3: Tumour infiltrating lymphocyte examples in H&E images. Lymphocytes are often small cir-cular black cells, tumour cells are bigger with more diffused boundary and lighter colour while fibroblasts are thin dark elongated cells. Blue circles are lymphocytes (TILs), green circles are tumour cells and yellow circles are fibroblasts.

Figure 2.4: Example of an H&E slide from The Cancer Genome Atlas (TCGA), of the Pan Cancer Atlas, dataset from the Genomic Data Commons. Left: Raw H&E image, number 11 from paad_batch_1 of the TCGA dataset 4.1. Right: Annotated H&E image of the same area. Lymphocytes are the small dark cells with blue circles

cells, the blue circles indicate lymphocytes and the yellow circles indicate fibroblasts. All these smaller cells in the image are inside the stroma regions which are part of a tissue or an organ with a structural or connective role. Similarly, in Figure 2.4, the black and darker spots constitute the TILs and tumour population on the reddish pink background which makes the stroma. The lighter pink areas, which would otherwise appear as white, are regions from ducts. To a doctor recognising this morphological arrangements is a trivial task but important in the analysis of the tissue. From Figure 2.5 the differences between healthy and tumorous tissue is clearly visible by the commotion of the tumour cells (darker large spots). The change of colour and composition is this case an important biomarker for cancer. In the case

(9)

Figure 2.5: Tissue samples from the same patient, at a magnification factor of 24_{. Left: area without}

tumours (healthy tissue). Right: cancerous area.

Figure 2.6: H&E WSI tissue sample from patient 60465 of the NKI dataset localised around a cancerous area with positive lymphocytic behaviour (TILs grouping around the tumour cells). The dark purple circles indicate endothelial cells and the black circles indicate fat cells.

of lymphocytes, as in Figure 2.6 the dark spots indicate the immune response to cancer, in the form of tumour infiltrating lymphocytes (TILs), i.e. lymphocytes concentrated near tumour cells. Detecting these lymphocytes and estimating an approximate lymphocyte-stroma coverage, the percentage of the stroma occupied by lymphocytes, produces a TILs coverage score which is used by pathologists to recommend immunotherapy treatment.

The example in Figure 2.4 shows that finding these lymphocytes is a simple task for the human eye to perform; but this image shows only a very small region of the tissue at a very high magnification, with favourable imaging conditions and a good visual example. In practise, a doctor needs to evaluate the whole tissue slide, a whole slide image (WSI), which may contain thousands or even millions of small lymphocytic cells as can be seen in Figure 2.7. It is reasonable therefore for the doctor to localise predictions in some areas only and make a lymphocyte percentage estimate for the WSI of one the patient which will be used to decide the treatment. It is therefore natural to seek an automated way to evaluate slides using the time of medical experts as efficiently as possible while providing predictions for all the tissue which would otherwise be too laborious for humans.

(10)

2.2. RELEVANCE OF FINDING TILS CHAPTER 2. INTRODUCTION

Figure 2.7: WSI with 7 annotated regions of interest from patient 60465 of the NKI dataset, indicating the amount of TILs that can be contained in a single slide. The labelled areas contain in total around 2’000 (not-exhaustive)lymphocyte annotations only while only occupying less than 5% of the total WSI.

2.2 Relevance of finding TILs

When cancer is detected and the immune system starts producing lymphocytes to combat the "foreign" intrusion, lymphocytes start appearing and flooding the stroma region. What makes a lymphocyte a tumour infiltrating lymphocyte (TIL) is its location relative to the tumour cells. When lymphocytes are within the cancerous region then they are named TILs because they are strategically positioned around and actively fighting against cancer. Other cells like lipocytes have a very close resemblance to lymphocytes in the stroma. However, lipocytes are not lymphocytes, they are fat cells as Figure 2.8 indicates. Since these cells look very similar to lymphocyte a detection algorithm will assume that they are lymphocytes if no additional information is provided. That may not be a wrong conclusion based on the shape and colour of the cells, it is however an incorrect prediction based on the location of the cells. Hence, if detected near stroma areas they should not be taken into account when calculating the TILs-stroma percentage since they would not contribute to the ability of the immune system to respond positively to immunotherapy.

It is important to mention that the natural group response of lymphocytes is an important factor

Figure 2.8: Fat cell area from patient 60470, of the NKI dataset, with lipocyte locations marked with blue squares, indicating predictions but not true labels.

(11)

2.3. HOW PATHOLOGISTS DECIDE CHAPTER 2. INTRODUCTION

Figure 2.9: H&E WSI tissue sample from patient 60470 of the NKI dataset, with sparse concentration of TILs near the tumour - a poor immune system response.

Figure 2.10: H&E WSI tissue sample from patient 60467 of the NKI dataset, with TILs concentrated near a cancerous area as a group (which indicates a good localisation effect from the immune system). in determining if the immune system is actively engaging cancer or not. When a small number of lymphocytes are observed near tumour cells that indicates that there is little response from the immune system as shown in Figure 2.9. When there is an increased amount of lymphocytes then it can be assumed that the immune system is reasonably “fighting” cancer. Whoever, the best desirable response, that is also sought out during a diagnosis is that a large number of lymphocytes are grouped and bundled together around the tumorous area like in Figures 2.6 and 2.10. Subsequently, a system detecting TILs should be able to handle all three response events and recognise lymphocytes irrespective of their arrangement.

2.3 How pathologists decide

When the condition of patient needs to be assessed a pathologist will review an H&E WSI by first acquiring an overview of the whole slide and determining in broad terms where the tumour lies. Second, the doctor will concentrate / zoom in the cancerous area and estimate the tumour-stroma percentage. This step is important in the diagnosis for the progression of cancer. Third, the doctor will make localised predictions for TILs in the stroma region and calculate a TILs-stroma percentage for the patient. If the percentage value of TIls to stroma is more than 5% then the pathologist will recommend immunotherapy for the patient [53].

The localisation procedure from a broad overview to specific cell predictions is very important in order order not to take into account lymphocyte-looking but not TILs cells. As Figure 2.11 shows, such cells are numerous and they might often trick the untrained eye. The important information is these cases is the overall location of the patch. Necrotic tumour cells appear like TILs and the only thing that sets them apart from lymphocytes is there position relative to the stroma area. However, TILs may still be present in the tumour area but the human eye, i.e. doctors, cannot always distinguish them themselves. In order to be able to say whether a cell is a TIL or a necrotic tumour cell a different staining is needed. The same case applies for ductal carcinoma which appears near ductal regions.

It is worth mentioning that computational pathology (CPATH) has been an increasingly useful tool for doctors. CPATH involves the use of computer aided systems for medical applications which prioritises

(12)

2.4. MEDICAL DATA CONSIDERATIONS CHAPTER 2. INTRODUCTION

Figure 2.11: Cancerous area from patient 60477, of the NKI dataset, with some necrotic tumour cells. In the right image, the green area indicate the tumour and the light blue indicates the stroma. Although some lymphocyte-looking cells might appear in the tumour area (marked with green squares to indicate predictions on tumours) it would be wrong to classify them as such because TILs do not usually appear there. In contrast, the blue squares indicate the two possible TILs predictions are can actually be true. However, in areas such as these, even doctors cannot be sure, using the aforementioned location intuition. In order to make sure, additional staining would be required here.

personalised treatment, limits manual labour and shortens lengthy and repetitive procedures. Under the auspices of CPATH, automated systems working on medical data have shown significant promise in many areas including assisting assessments of WSIs. It remains however a challenging task. Any system that mistakes necrotic tumour cells as lymphocytes makes the same error as trained pathologists. Similarly, for predicting a TIL where lipocytes are. Therefore, being mindful of the location of the TILs predictions in respect to the stroma in the cancerous area is of utmost importance and automated systems need to learn to take that under consideration.

2.4 Medical data considerations

Unlike other types of data, medical information such as images, scans or patient reports such as blood sugar levels, hormones and other are very sensitive. Medical information is not shared easily due to privacy considerations and hence there are special guidelines in acquiring and using it. To that effect, research applied on medical information takes considerably longer to obtain a license or permission to share and hence not much medical data is available. Compared to other applications, medical data is often limited for research and any new solution proposed is tested on a small range of available data. Moreover, in order to acquire, for example, cell annotations, i.e. where a cell is located in an image, skilled professionals are required. Because doctors and physicians are often very busy, this lengthens the period of time needed for creating a new medical dataset for research. Considering the high cost of medical information, both in terms of time and money, new approaches attempting to tackle a problem need to face problems such as generality of application and over optimising to dataset specific properties. When considering the detection of TILs, novel automated methods need to be aware of the following properties of medical imaging.

1. Not all dark spots are TILs and TILs are not always dark spots, as Figures 2.11 and 2.8 show. TILs can be confused with fibroblasts, tumour cells, necrotic tumour cells or pink background with a black spot artefact. It is important to differentiate between the different cells and avoid incorrect background predictions. However, any rule depending on the morphology of TILs does not hold universally. Some TILs do appear as dark black circles, but others may appear more elongated ellipses, or very small spots resembling imaging artefacts or even very blurred and diffused into the background. Due to the lack of constant shape, position and appearance it is increasingly difficult to detect TILs in any environment. That is why important aspects in how TILs differ with other cells should be taken into consideration.

2. Scans may contain blurring or other artefacts, as illustrated in Figure 2.12. In some cases blurring can appear in an image because the laser scanning device failed to zoom properly in the area. Some-times, spots covering tissue very occur due to a mistake in the cutting of the tissue, or placement in the scanner. Normally, such mistakes would not be considered for developing a new method but because of the limiting number of medical data available, it can not always be afforded.

(13)

Figure 2.12: H&E WSI tissue sample from patient 60460, of the NKI dataset. Left: TILs located at the edge of the slide where the scanning laser was unable to focus correctly. Right: area were a smudge on the lens appears as a large dark area on this tissue.

Figure 2.13: H&E WSI tissue samples from patient 60460, of the NKI dataset, magnified by a factor of 29 _{showing some mistakes in the annotations. Left: one annotation of a lymphocytic cell not being}

centred around the cell itself. Middle: multiple labels and incorrect overlapping annotation (for the fibroblast circle label) caused by technicalities during the annotation process. Right: confusing rectangle and circular annotation for the same cell.

3. Human annotations are often inconsistent and naturally error prone. When defining the cells in an image, it is very cumbersome to position an ellipsoid annotation at the exact centre of the cell, as can be seen in Figure 2.13. Due to the enhanced magnification level, a slide trembling on the hand might incur an erroneous overlapping annotation or a displacement of the original ellipse. These are minor technicalities in the program that collects the labels made by doctors. But due to the large number of cells to be identified and the speed which the human annotator goes through every cell it is not reasonable to restrain the process further by requiring a manual review and correction for every single annotation. Therefore, some bounding box labels which are not centred are not perfect indicators TILs. Moreover, because bounding box annotations do not border lymphocyte ellipses precisely, all annotations contain considerable background information as a label for every TIL.

4. Annotations for cells are not exhaustive. Because there are hundreds of thousands of cells in every WSI it is practically impossible for a human pathologist to annotate every single cell. Therefore, as Figure 2.14 indicates, some areas would be left un-annotated. This complicates creation of a new method which should not confuse the lack of annotations as lack of lymphocytes.

When processing whole slides of H&E stained patient images, technical considerations may indicate significant challenges when considering the amount of information each scan produces. In order to acquire sufficient resolution for a doctor to be able to magnify an area in the tissue, so that individual cells are clearly visible and discernible, high levels of magnification are required. Sometimes, the highest zoom level can require a magnification factor of more than 29_{. This depth of field, used to store information, can}

create a WSI which occupies about one gigabyte (GB) of memory space and is often more than 100’000 pixels in width/height. The large amount of information per single patient slide poses some unique challenges when processing medical images, unlike other images. With the current state of hardware, it is practically impossible to load and process a single whole slide at a time. Both in terms of memory limits, data loading time and processing bottlenecks.

(14)

Figure 2.14: H&E WSI tissue sample from patient 60460, of the NKI dataset, magnified by a factor of 26_{, displaying a partially annotated area.}

Figure 2.15: H&E WSI tissue sample from patient 60470, of the NKI dataset. The hand drawn shapes, with white boundary on the left, are drawn by the pathologist to indicate the cancerous area. To the right, a magnified location is shown, where the doctor has indicated two square regions where there is prevalent tumour presence (indicated by green) and one square region where there exist lymphocytic activity, in stroma (indicated by blue).

In order to deal with the large images it is typical to split them into smaller patches of around 256 by 256 pixels or 512 by 512 each. In some medical tasks, a broad whole slide perspective is an important step before diagnosis, such as finding the cancerous area in the slide. This is also done by pathologists on the whole slide level before zooming in into more regional areas. By the aforementioned split of WSIs into 256 × 256 image tiles, the process of acquiring a broad perspective is not possible after the split. Hence, a method may perform some analysis with a lower resolution WSI at a reasonable, for example, 1000 × 1000 size and then proceed to process the smaller tiles.

Many approaches that attempt to detect lymphocytes, limit the scope of new automatised methods to classifying small patches of tissue, manually defined, as lymphocytic or not. Some other methods perform manual pre-processing of an image before applying their detection or classification method. Usually, localisation, i.e. defining where the cancer area is and where to apply the new approach, is done manually, considering the annotation of a doctor who delineates where the cancer is, as shown in Figure 2.15.

An other important factor to consider when processing H&E scans is the magnification levels them-selves. Most of the magnifications are too small to be used for cell detection since the individual cells are not easily seen. But there can be more than one candidate level, when deciding which one to choose, that can reasonable be used for detecting TILs. Although the choice can be arbitrary for one slide it is not for the whole dataset. Different slides have different magnification level associations due to technical variations during the scanning process. Therefore, it might be the case that two slides have the same

(15)

2.5. RELATED WORK CHAPTER 2. INTRODUCTION

magnification level, i.e. same resolution, but the individual cells, of the same cell type and similar loca-tion, are of different scale. It is important to emphasise the comparison of same cell types and similar locations. This is imperative because lymphocytes can vary in significantly in scale, in a single slide, depending on their location and lifespan. Therefore, at different scales, cells, like fibroblasts, from one slide might be more similar to lymphocytes from another slide. To avoid this problem magnification levels producing same scaled-cells should be carefully chosen.

2.5 Related work

The intent of automating the detection of cells and/or lymphocytes was and will be for the near future an active research field with the goal of having a method be equally good at detecting TILs as doctors,

without their intervention. The first methods of differentiating TILs used traditional image filtering,

with darker colour being a decisive metric in the detection process [54]. Such approaches are very fast and simple to perform but cannot be perfect. Due to the similarities and artefacts of images, many dark spots would be falsely predicted as TILs in the whole slide. This means that such approaches require the active participation of pathologists during and after the methods are run, in order to filter out incorrect predictions. Albeit a good step in the right direction, due to the simple nature of such first attempts, they mostly worked as assisting tools for doctors during prognosis [54, 55]. Moreover, every slide may have a differently stained colour range, some slides shifting more to the red or some to the blue spectrum. In addition, some WSIs may contain random influences from the scanning device and illumination. These properties of WSIs would render a single universal threshold value unable to perform well on many different slides.

Various techniques have built on this method using more intricate decision criteria and combinations in order to identify more indicative properties in the images which would distinguish TILs easier and generalise better across different slides [55, 56]. These additional modifications made the thresholding approaches more capable and useful to pathologists but as discussed in 2.4 the colour of cells is not always consisted, even amongst TILs. Thusly, meticulous review from doctors is still an essential requirement for the use of colour-inspired rule-based methods.

Further progress introduced cell shape and boundary considerations when segmenting cells such as watershed deconvolution [56]. This method involves first choosing “sources” of different origin, and “flooding” regions, with the colour of the source, separating items on the boundaries, given a set threshold. Cell template matching is combined with colour filtering of the image to identify local minima points. These points act as “basins”/ sources for the region-boundary segmentation methods [56]. When applied for TILs, this approach requires first a pre-processing step such as image filtering by colour thresholding, mentioned above, and then the application of watershed; with manually tuned properties for H&E slides containing lymphocytes and other biological cells [57, 58]. This more advanced concept is considered still a simple approach to cell detection and can be somewhat effective. The need for inspection by pathologists is limited to review of the results. However, because not many parameters and adjustments are involved, there exist plenty assumptions taken by the method in order to work. Therefore, manual tuning is needed for a good balance between what is considered a TIL and what is taken to be the background. Because of this threshold, there is always a trade-off between how many TILs are missed and how many background areas would be incorrectly considered lymphocytes. As such, this renders these approaches complementary overview methods, with noisy results, after which the pathologists still need to perform the diagnosis [59].

Attempting to limit the intervention of doctors, by increasing the generality of methods and restricting manually defined criteria, more elaborate approaches have been introduced lately; namely neural networks (NNs). Neural networks model the functions and connections of the human brain, i.e.: activation of neurons, and are able to accumulate knowledge by processing large amounts of medical data. NNs fall under the category of machine learning approaches and in not so recent years they have outperformed almost all other approaches in many fields of research including image classification, cell detection and many others [24, 26, 44, 60]. Previous machine learning methods include principal component analysis (PCA) [61, 62], linear discriminant analysis (LDA) [63, 64], and independent component analysis (ICA) ,[65], amongst others. They work by reducing the dimensionality of the data, or in other words, compact the information of an image to a few metrics [66]. However, like the previous filtering approaches, these too require defining threshold values and rule-based systems for them to operate, resulting in the same drawbacks of generality and applicability as mentioned above. NNs on the other hand are universal approximators to any latent function, work similarly like the human brain, are able to learn very complex image representations and are the de facto dominant methods used today [67–70]. Recent advances in

(16)

object detection have propelled the convolutional neural network (CNN) architecture, [70], as the most popular NN architecture for object recognition, image classification and segmentation, cell detection and many others [6, 10, 12, 34–36, 71–73]. Neural networks in digital pathology have been the new frontier in the biomedical information analysis advancing the field of computational pathology (CPATH) with medical Artificial Intelligence (AI) [11, 20, 21]. Visual prognosis assistance, image evaluation, sequence indexing, pattern recognition and tumour detection have seen various new NN toolkits to assist doctors [11, 22]. Some practical applications involving NN range from in vivo tumour detection, [7, 17, 74, 75], to DNA sequence search [76, 77].

Focusing on the medical domain, image segmentation models such as the U-Net introduced, in the work of Ronneberger et al. 2015 [26], has been the state-of-the-art neural architecture used as baseline for many methods. However, due to the variability of how cancer cells appear and due to their constant change in morphology and behaviour it is often challenging for NNs to detect the right cancer nuclei and the surround micro-environment of cells and tissue [21]. Montalto and Edwards 2019 [15] highlight that a fully convolutional network can perform comparable to trained pathologists, when there exists little variation across new test cases and recommend using NNs as an augmenting tool to improve diagnosis for aggressive tumour prognosis. This example emphasises one major negative aspect of medical images: NNs can become good at differentiating the heterogeneity of tumour cells but may not be able to detect the true biomarkers of cancer which pathologists look out for [6, 36]. To tackle the generality problem evident in unpredictable cell properties state-of-the-art methods combine more than once different approaches, some of which often entails manually defining threshold values and designing morphological features [23]. The methodology leads to to highly optimised results for competition such as cell segmentation and tracking but fails to be applicable in different scenarios of the same looking images which H&E tumour areas exhibit across different cancer types and location [14].

It is important to note that when handling medical information every prediction should be reliable and if possible explainable [78, 79]. This is an important reasoning before practically deploying a neural model for medical applications due to the severity of the mistakes it can make and the accountability of errors in treatment [9]. Nonetheless, labels of real data used to train these NNs can be error prone due to noisy annotations or human fatigue and NNs will “learn” the mistakes alongside the other information during training [14, 80]. In low contrast data with low Signal-to-Noise ration images, discerning cell boundaries becomes difficult even for human experts; as illustrated in Figure 2.16. Medical cohorts are also susceptible to lack of enough training data due to the sensitive nature of the images and to the both time- and money-wise cost of acquiring the annotations [16, 81]. All of the above increase the difficulty of reliably detecting biological patterns and new NN approaches need to be able to overcome noisy labels and/or few training samples [82]. Such methods include the work of Hermsen et al. 2019 [83] who used a CNN model for histologic analysis in a multiple tissue class segmentation task for cortical regions of both healthy and pathologic renal tissue biopsies. They used a U-net, [26], architecture for image segmentation combined with image augmentation of spatial and colour variations in order to achieve state-of-the-art results.

In the work of Saltz et al. 2018 [8] computational staining is used with two separate convolutional neu-ral networks (CNNs) to identify lymphocyte-infiltrated regions in digitized H&E stained tissue specimens. The first NN is called the lymphocyte CNN which is tasked with lymphocyte infiltration classification and the other network is the necrosis CNN which tries to perform necrosis segmentation. More specifically, the lymphocyte CNN identifies image patches containing lymphocyte infiltration trained by semi-supervision

Figure 2.16: Simulated nuclei representation of HL60 cells stained with Hoescht from the Fluo-N2DH-SIM+ dataset listed on the ISBI cell tracking challenge. Right most image indicates the correct segmen-tation of the cells on the pixel level [23].

(17)

of labels generated from an unsupervised convolutional auto-encoder (CAE). The lymphocyte CNN is it self built by the encoder of the trained CAE, with the addition of a few additional encoding layers, and produces a probability of the patch being a lymphocyte-infiltrated patch. The label of the patch is decided by simple thresholding; if the probability value is above a predefined threshold, the patch is classified as lymphocyte-infiltrated. The necrosis CNN is a DeconvNet model, [84], which segments higher-resolution image patches to pixel wise predictions of necrotic regions and is especially useful to eliminate false positives from the lymphocyte CNN in necrotic regions.

In the work of Swiderska-Chadaj et al. 2019 [44] U-Net was used to segment the cells of different cancer types in WSIs with the addition of a spatial dropout layer to combat over fitting. The resulting pixel predictions were post-processed by smoothing with a Gaussian filter and detecting local maxima as the lymphocyte centroids by a manual neighbourhood size threshold. From all the other networks the group tried, U-Net performed the best in lymphocyte detection. In the work of Shi et al. 2019 [85] a multi task methodology is applied to strengthen the loss term by using weakly supervised auxiliary, or semi-derived, loss components. In more detail, network branching was designed to re-purpose the use of point annotations in a multi-task approach for both density maps and image segmentation. Shi et al. 2019 [85] tackle object counting by separating the gradient propagation to different neural neighbourhoods that are supplementary to the initial base network used for segmentation.

While segmentation approaches produce a pixel-wise classification able to delineate exact cell bound-aries, they face two important caveats. First, every pixel becomes a prediction output which accumulates over a large image [86]. The larger the image, the broader the overview of the network reasoning would be and the relative positioning of the cells would be taken under consideration. For example, This is im-portant to differentiate TIL-looking cells in necrotic tumour areas from TILs in the stroma. The smaller the image, the more the network will concentrate on the individual properties of a cell. However, for detecting tumour infiltrating lymphocytes, the relative position, and broader perspective, is important. If the image to be segmented becomes too large, then computationally, it poses an increasing challenge with diminishing returns [43, 87]. The second caveat of segmentation approaches is that per-pixel anno-tations are not always available and are more difficult and costly to acquire. Therefore, in the absence of comprehensive segmentation maps, approaches that classify an image on the pixel level cannot be used. Object detection using predictions for the location and size of bounding boxes tackles both the above problems. A notable approach is the R-CNN network which introduced region proposals for CNNs [88]. Using selective search, [89], as image filtering in terms of colour, shape, and texture, 2000 region proposals per image where generated as feature extraction. These features were then fed to a pre-trained AlexNet CNN architecture, [90] combined with (1) a support vector machine (SVM) for object classification, and (2) bounding box predictions for localisation, in order to predict different objects in the images (car, airplane, bird, e.t.c.) [88]. Combining feature extraction as region proposals and then training a CNN to reject or displace probable object region proved successful over other methods, achieving state-of-the-art results. Nevertheless, due to the large amount of region proposals to be processed R-CNN takes considerable amount of time to train. Moreover, the feature extraction method, selective search algorithm, is fixed and does not adapt to new data. Therefore, The generality of this application is limited by the region proposal capability on specific datasets.

The subsequent development of Fast R-CNN by the same group, Girshick 2015 [91], constructs feature maps directly from the CNN itself and only then does it produce region proposals from the selective search algorithm [91]. Fast R-CNN is able to train faster than its predecessor by a factor of 89.5% (with an inference time impoverishment of 95.3%) by not feeding 2000 region proposals to the network. In th work of Ren et al. 2015 [92] the need for the selective algorithm is eliminated altogether alleviating the time and performance bottleneck it introduces [92]. Ren et al. 2015 [92] introduce the Faster R-CNN which represents a first step to an end-to-end neural object detection architecture. The input image is fed through a CNN which produced feature maps which are then fed into a second network to predict region proposals and bounding boxes [92]. Faster R-CNN improved inference time by about 91.3% over Fast R-CNN which is fast enough to be used in real time applications.

One disadvantage of the R-CNN variants is that regions are crucial for the object detection and the bounding boxes rely on having good region proposals without been able to consider information from the complete image. This is a big problem when local and global image objects are equally important in object detection, such as detecting TILs. The You Only Look Once (YOLO) network/detection algorithm altered the dynamics of bounding box predictions using a single neural architecture, from raw image to object boxes, while at the same time considering information from all parts of the image [18]. This is accomplished by splitting the image where objects lay into a square grid of size S. Each grid cell outputs a number of possible bounding boxes, m boxes, centred within its area. Each bounding box output out

(18)

Figure 2.17: YOLO network image processing example using a cell grid, on an H&E sample image [18]. of the m boxes, contains two predictions for its location x, y, where the x and y local coordinates of the centre within the grid cell and two predictions for the global shape of the box width, height. In addition to the four predictions about the location and shape, each box has one output prediction for its confidence, i.e. how certain the network is that there exist a box centred at this cell. Moreover, each bounding box output also contains class probability predictions for the box described by the x, y, width,

height properties. If there are only two classes in the dataset, with the same dimensions per object,

then output of the CNN would be an image of shape S × S × 7, where S is assigned depending on the application, typically 32. The single network approach to object detection makes the YOLO network even faster than Faster R-CNN but does introduce some assumptions which limit its functions. The first assumption is that objects do cannot share the same shape and centre. The way that YOLO works with overlapping objects is assuming different anchor boxes which represent boxes of different scales of width to height dimensions. An example illustration can be found in Figure 2.17 For each possible scale, only one box is predicted per grid cell. That means that if there exist two distinct but overlapping objects with the same centre and similar size, only one of them can be detected. This can become a serious problem if the

S × S grid is too small, making large grid cells which might contain many centres of objects. Increasing

the grid size reduces the effect of such occurrences but makes training more difficult due to multiple equally probable centre of objects. This is a spatial constraint of the algorithm which becomes a trade of between detecting too small or too large objects in the image. In addition, the constraint of anchor boxes limits the predictive capabilities of the algorithm to the sizes manually provided to it and if there exists objects of different scales within the same class then it might confuse the method. The final generalised predictions of the network, given anchor boxes B and number of classes C, is S × S × (B ∗ 5 + C).

In the work of van Rijthoven et al. 2018 [19] YOLLO (You Only Look at Lymphocytes Once) is introduced which applies the principle of YOLO introduced by Redmon et al. 2016 [18] and adapted it for medical images (CD3 stained WSIs) and quick lymphocyte detection through an end-to-end neural architecture avoiding the split of separate classification and detection algorithms. YOLLO comprises of 7 convolutional layers, combined with batch normalisation, rectifier linear units (ReLU) and average pooling without any fully connected layers. Similarly to the YOLO(v1-3) architectures, the final layer contains five channels corresponding to the predictions for the x, y, width and height displacement within a grid cell, and the confidence of the prediction. The difference between the YOLO network is that YOLLO only predicts lymphocytes and does not require the class prediction output. YOLLO exhibited high performance cell detection with simple convolutional layers WSIs stained with immunohistochemistry. Admittedly this staining highlights lymphocytes and distinguishes them from the rest of the tissue with a distinct yellow colour. However, YOLLO indicates that TILs can be reliably detected by simple network architecture and this report took some inspiration from the work of van Rijthoven et al. 2018 [19] as well

(19)

2.6. CONTRIBUTIONS CHAPTER 2. INTRODUCTION

as from the work of Redmon et al. 2016 [18].

2.6 Contributions

The intention of the work presented in this report is to detect tumour infiltrating lymphocytes. However, the goal is twofold:

• Apply lymphocytic cell detection in H&E slides.

• Automate processing of WSIs with little to no intervention from pathologists.

Previous approaches on object detection have used images such as cars, dogs, air-planes e.t.c. to recognize different objects in everyday life environments. Most of these objects are distinct between them, like a car with a dog, and only a small number of classes bare high resemblance, e.g.: a dog with a cat. Nevertheless, since it is obvious to the human eye almost immediately what dogs and cats look like, object detection has been able to perform at near human-like performance. In medical data however, it is not always the case that a cell is clearly distinct to the human eye. In previous approaches, like in the work of van Rijthoven et al. 2018 [19], slides stained with immunohistochemistry, using CD3 and CD8 antibodies, make the desired cell, lymphocytes stand out in its environment which makes it easy for pathologists to detect it. But in the case of H&E slides that is not always the case. While some lymphocytes stand out in some regions, most often they do not, and doctors need a broader inspection of the surroundings to be able to assert the cell type. It is also the case, that for some regions, this is entirely impossible, such as for necrotic tumours near stroma, for which different staining, such as the CD3-based immunohistochemistry, is needed to help the diagnosis. It is therefore, a not trivial application of an object detection algorithm on H&E slides. Previous approaches using data from TCGA (The Cancer Genome Atlas) dataset which contains H&E slides, were constrained to classifying a patch of tissue as lymphocytic (containing lymphocytes) or not. This classification approach simplifies the task and differs significantly from finding the precise location of each TIL [8].

Automating the inspection of WSIs without the need for human correction is a necessary condition for any medical tool of this scope to be useful. Medical professionals have a demanding time schedule and being able to provide fast and meaningful information about a patient can save them effort and time which can in turn be spent on the treatment of the patient.

With these goals in mind, H&E slides from 16 patients where collected by the Antoni van Leeuwenhoek hospital / Netherlands Cancer Institute (Netherlands Kanker Instituut, NKI) and analysed in this work. These data exhibit all properties discussed in 2.4. The aspect of which this work focuses on is the lack of exhaustive labels for the annotated regions that are used. As discussed earlier, this is a unique problem for H&E images and a most common occurrence in practical applications. Consequently, the research questions that are investigated are as follows:

• In the case of low fidelity information, can we restrict loss prorogation of false negative predictions for un-labelled areas.

• Is it possible to compact the features of classification and detection networks into a smaller, more efficient and faster architecture?

Investigating different network architectures to solve the problem, relaxing some constraints of pre-vious methods, and implementing new restrictions on the loss function, amass our contributions, sum-marised as follows:

1. Introduction of a novel network design that is smaller and faster with increased performance than related models.

2. TIL detection on H&E slides outperforming state-of-the-art models by 5.6%.

3. Introduction of a soft margin loss which allows NNs to be trained successfully on non-exhaustively annotated regions with improved results over traditional loss functions.

(20)

CHAPTER

3

Methodology

3.1 Soft margin loss

The base loss function for object detection following the end-to-end neural architecture approach, used by this work as a starting point, is the loss introduced by Redmon et al. 2016 [18] for the YOLO model. This loss function operates with a grid of bounding box predictions and is expressed as

Lyolo= λcoord S2 X i=0 B X j=0 1obj ij (xi−ˆxi)2+ (yi−ˆyi)2 + λcoord S2 X i=0 B X j=0 1obj ij (√wi−p ˆwi)2+ ( p hi−qˆhi)2 + S2 X i=0 B X j=0 1obj ij (Ci− ˆCi)2 + λnoobj S2 X i=0 B X j=0 1noobj ij (Ci− ˆCi)2+ S2 X i=0 1obj ij X c∈classes [pi(c) − ˆpi(c)] 2 (3.1)

where λcoordand λnoobj are the weighing factors for the loss terms for the objects (denoted by “obj”) and

background (no objects, denoted by “noobj”) in the image respectively, [18]. S is the width of the square grid of cells for predictions, B is the number of anchor boxes and x, y, w, h are the location predictions of bounding boxes per cell. C is the confidence of the prediction of the box and p(c) is the class probability. An important assumption in the use of this loss function is that any area in which there does not exist an object, according to the labels, is considered background (no object) and penalises the confidence, if it’s above zero, for that cell. The loss function however operates with any definition of objects and no-object areas and this is an important step towards dealing with low fidelity information for unlabelled areas in the image.

The approach implemented by van Rijthoven et al. 2018 [19] for the YOLLO network for lympho-cytes, simplifies the above equation (3.1) with C = 1 and thus ignoring the last loss term for the class probabilities. This change is also retained for the following implementations.

Dealing with lymphocytes in medical images annotated by physicians the biggest obstacle is that information about background is not guaranteed. As Figure 2.15 indicates, the labelled cells are assumed ground truth and true representation of the their labels, but the absence of labels for the rest of the image cannot always be considered as background for use as no-object areas in the existing loss function. If the assumed “background” (unlabelled areas) is always considered as such but weighed less in the loss function, by decreasing the λnoobj factor, to indicate lower priority information compared to the other

parts of the equation, it still does not solve the problem. What this change causes is an over-prediction of TILs in the background area, making many false positive mistakes, even in pink background, as shown in Figure 3.1. Increasing the lambda factor to find the trade-off peak point also introduces more incorrect back-propagation information for the network and less true positives are predicted. In medical applications, mistakes for over-predicting or under-predicting crucial information, are severe since it will lead to an incorrect diagnosis and either perform or not an important, expensive and lengthy treatment. It is essential to mention, that although exhaustive annotations are not always provided, there is considerable information about other important structures in the tissue such as labels for tumour cells,

(21)

3.1. SOFT MARGIN LOSS CHAPTER 3. METHODOLOGY

Figure 3.1: Initial prediction results, with the YOLO loss, on patients 60460 (left) a d 21813 (right) WSIs. To the left the problem of over-prediction is illustrated where some false positive mistakes are made. This example shows the results after tuning the loss parameters and training the YOLLO network; it represents one of the best responses possible at this stage. However, when inspecting more difficult regions (right) there are prohibitively many false positive predictions. Note: areas that contain no predictions (bottom left, bottom right and top left parts of the image to the left) were not fed to the network (because regions of interest were defined by the annotations provided into the nearest 256x256 tiles)

Figure 3.2: Initial prediction results, with the YOLO loss, on patient 60460 WSI with the same loss parameters as in the work of Redmon et al. 2016 [18]. This example indicates how severe noisy labels can be when the weight factor for the non-lymphocyte annotated areas is too high. Note: areas that contain no predictions (bottom left, bottom right and top left parts of the image to the left) were not fed to the network (because regions of interest were defined by the annotations provided into the nearest 256x256 tiles)

red blood cells, fibroblasts and many others. When detecting TILs, these labels are not taken into account and are considered as no-object areas during training.

To prevent the constant association of lack of labels for no-object information, the following definitions are considered:

• Lymphocyte annotations are considered hard positive labels, and are assigned to the set “objects”. • Annotations for non-lymphocytic cells and structures, are considered hard negative labels, and are

assigned to the set “no objects”.

• Areas not containing labels are considered as weak negative labels, are added to the set “back-ground”, but are not used for training as they are. Three steps of threshold conditions and a marginal loss investigation filters out background information from possible unlabelled lympho-cytes and only back-propagates the background areas.

(22)

3.1. SOFT MARGIN LOSS CHAPTER 3. METHODOLOGY

In order to limit the propagation of unlabelled lymphocytes, in the “background” set, as back-ground patches, a restriction is set to only propagate gradients of bounding boxes whose positive class prediction is lower than a condition rule. Formally, let there be K classes and class probability scores {p(i = 1), p(i = 2), p(i = 3), . . . , p(i = K)}, ∀i ∈ S2_{, where p(i = c), for c ∈ K, indicates the probability}

score for class c of grid cell i. Under high fidelity information, loss propagates with a non-zero value when

c , clabel. In the case of weak annotations with low fidelity information, the propagation criterion can be

made more strict by the addition of the following condition for propagation:

p(i = cmax) − p(i = cmax−1)ttemp <1 −  (3.2)

where p(i = cmax) is the probability score for the maximum scored class (argmax), p(i = cmax−1) is the

second highest class probability score, ttemp is the temperature parameter (increasing over time) and 

is the margin parameter. This definition would propagate fewer certain predictions of TILs and more background and other cell annotations in the “background” set.

In the case of one class of interest (lymphocytes), all other class probabilities indicate not just other cells (annotated or not) but also background (stroma: pink tissue area). Therefore, further simplification allows for consideration of the output prediction confidence as a score metric for lymphocytes. Under this assumption, given the estimated TILs probability, the condition for propagating only in areas of low TILs probability can be translated to not propagating the loss function of highly certain TILs scores. This means that the condition expressed in (3.2) can be simplified to:

p(i = c)ttemp <1 −  (3.3)

The “background” set is thus processed as follows:

1. Negative sampling is applied, with random coefficient a = 0.5, from a uniform distribution for the “background” set of locations without labels, forming a “negative samples” set. This step ensures that lymphocytes are not always considered as negative instances and helps to disassociate the unlabelled lymphocytes from background.

2. Confidence values are used as a general reference of the certainty of the model for a cell containing a lymphocyte. Hence, under the assumption that high confidence scores will indicate lymphocytes, and low scores will indicate the absence of lymphocytes (either background or other unannotated cells excluding TILs) then the threshold operation, for tneg = 1 − , as (3.2) shows, is applied

on the “negative samples” set based on the confidence prediction scores. The remaining areas after this step are considered the “background no objects”. This step is very important because it excludes confident lymphocyte predictions altogether from propagating as negative areas and train the network with the wrong information.

3. Because of the nature of the weak negative labels, it is possible that some areas containing lympho-cytes that do not appear similar to a typical lymphocytic cell, as described in 2.4, may still remain in the “background no objects” set. To improve further on the back-propagation step, the focal loss, [93], for the confidence scores of predictions for the “background no objects”, is applied per sample with focal hyper-parameters γfand αt, ∀ t ∈ BN O, where BNO indicates the “background

no objects” set.

Subsequently, the above considerations construct the design of a new loss function, namely a soft margin loss (SML) function, given as

sml(pi(c), ˆpi(c)) = ( 0 if pi(c)ttemp >1 −  f l[pi(c) − ˆpi(c)] 2 otherwise (3.4)

where pi(c) is the class probability score of the prediction as in (3.1), fl is the focal loss transformation,

and is the margin hyper-parameter to be tuned.

All the aforementioned changes to the original loss function in equation (3.1), deal specifically with the problem of absence of labels in the object detection task in medical data with non-exhaustive annotations for regions of interest. It is important to emphasise that the consideration of the confidence to be an indication of the certainty for TILs in the cell is only valid with one class; under the assumption that only one class is contained in the image. Hence, the modifications above can also work in other domains with noisy labels of similar type but for one class. In 3.2 this assumption is also resolved making the final function applicable for any number of classes. The resulting loss function is expressed as

(23)

3.2. YELL: YOU EFFICIENTLY LOOK AT LYMPHOCYTES CHAPTER 3. METHODOLOGY Lsml1 = λxy S2 X i=0 B X j=0 1obj ij _(x i−ˆxi)2+ (yi−ˆyi)2 _{+ λ} wh S2 X i=0 B X j=0 1obj ij (√wi−p ˆwi)2+ ( p hi−qˆhi)2 + λobj S2 X i=0 B X j=0 1obj ij (Ci− ˆCi)2+ λnoobj S2 X i=0 B X j=0 1noobj ij (Ci− ˆCi)2+ λBN O S2 X i=0 B X j=0 1BN O ij sml(Ci, ˆCi) (3.5) where sml indicates the soft margin loss (SML) adaptation, as described above, and the parameters

λxy, λwh, λobj, λnoobj, λBN Oare weighing hyper-parameters to be tuned. Confidence scores, C, where used

as probability score estimates, as discussed above. It is evident, under the SML loss, that the network training becomes more precise in terms of defining hard positive, hard negative and weak negative labels compared to previous work. This provides a clear advantage when tuning the method for the task at hand.

Compared to the original YOLO loss, SML1 (3.5), has the same restrictions for spatial constraints of

varying object size [18]. However, simplifying the loss function further for specific lymphocyte detection, an additional assumption is considered:

• Given same magnification levels of WSIs, the size of lymphocytes is similar.

In this case, a manual definition for the width and height is used, for the intersection over union calculation [18], for the size of TILs which can be calculated analytically. In addition, the λwhparameter is set to zero

and the number of anchor boxes needed for detection is set to one, 2.5. Therefore, the output prediction layer of the network used can be reduced to S × S × (3) for only the x. y and C predictions, compared to S × S × (B ∗ 5 + C) as before.

One very important consideration for the constant cell size adaptation is that labels for lymphocytes as shown in Figure 2.13 are seldom precise to the cell boundary. This allows for more background to be considered as a TIL annotation than usual object detection methods. By limiting the predictions to the location of the lymphocytes alone, and allowing for a larger constant window size to be used, the ambiguous/contrasting information between lymphocytes in the tissue and label annotations has an even more smaller negative effect on training. This provides more robustness and invariance to noisy labels, naturally assigned by pathologists and allows for the use of this approach in more practical/real-world applications. The concluding SML loss function is thus described as

Lsml2 = λxy S2 X i=0 1obj i _(x i−ˆxi)2+ (yi−ˆyi)2 _{+ λ} obj S2 X i=0 1obj i (Ci− ˆCi)2 + λnoobj S2 X i=0 1noobj i (Ci− ˆCi)2+ λBN O S2 X i=0 1BN O i sml(Ci, ˆCi) (3.6)

At this stage, the SML loss targets single class objects(TILs) of the same size, precisely for noisy back-ground labels. Although it is a less general definition, SML2 (3.6), has considerable performance

advan-tages over SML1(3.5), due to the lower number of possible bounding box predictions, without inhibiting

negative effects.

3.2 YELL: You Efficiently Look at Lymphocytes

Traditional object detection algorithm follow one of two general ideologies:

• Proposal of regions (first network) and classification of acquired selections (second network) [88, 91, 92]. These type of approaches require two independent large networks either trained end-to-end or one-by-one. The training is lengthy, with the disadvantage that the second part of classifying the regions and producing bounding box predictions cannot consider the full image (limited field of view). The advantage these methods, such as the Faster R-CNN, [92], is ability to focus on two separate tasks independently and improve at each task separately.

Soft margin loss for unlabelled regions: detecting tumour infiltrating lymphocytes

Universiteit van Amsterdam

MSc Artificial Intelligence

Master Thesis

Soft margin loss for unlabelled regions:

detecting tumour infiltrating lymphocytes

Andreas Panteli

July 10, 2020

Supervisors:

Dr E. Gavves

Dr J. Teuwen

Dr H. Horlings

Assessor:

Dr P. Mettes

Abstract

Acknowledgements

Contents

1

Motivation

2

Introduction

2.1

Medical applicability

2.2

Relevance of finding TILs

2.3

How pathologists decide

2.4

Medical data considerations

2.5

Related work

2.6

Contributions

3

Methodology

3.1

Soft margin loss

3.2

YELL: You Efficiently Look at Lymphocytes