Deep Siamese Real-Time Tumor Tracking for Lung Cancer Patients

(1)

MSc Artificial Intelligence

Master Thesis

Deep Siamese Real-Time Tumor Tracking

for Lung Cancer Patients

by

Dragos-Ioan Grama

12282693

August 12, 2020

48 ECTS 11/2019-07/2020

Supervisor:

Dr. Deepak K. Gupta

External Supervisor:

Dr. Wilko Verbakel

Ward van Rooij (MSc)

Examiner:

Prof. Dr. Cees G.M. Snoek

University of Amsterdam

(2)

Abstract

Accurate markerless tumor tracking is an essential tool for efficient radiotherapy treatment delivery because it preserves healthy tissue from radiation damage. Current tracking methods, used in clinical practice, are prone to give erroneous predictions for small tumors due to low tumor contrast. In this work, we investigate the effectiveness of deep learning methods and propose a deep siamese model for real-time markerless tumor tracking. Since there is a discrepancy between the training and validation datasets, we also propose a novel method to reduce the gap between the two data distributions. Our siamese model outperformed the chosen baselines and we have also seen an improvement in using the proposed data translation strategy. Furthermore, we offer an analysis of different model versions and we discovered a negative aspect of using padding in deep siamese networks. Finally, our proposed method was proven to be suitable for markerless tumor tracking in lung cancer patients.

(3)

Introduction

Lung cancer is the most common cause of cancer deaths worldwide, causing more deaths than breast, prostate, colorectal, and brain cancers combined [1]. Scientists are constantly working on discovering new methods and/or improving existing ones in order to get the best outcomes from the treatment plans. Surgery would be the most desirable treatment option but unfortunately about 80 percent of lung cancer patients are inoperable, case in which the primary treatment method is radiotherapy (RT), often combined with chemotherapy [2]. Radiotherapy is involved in about 50 percent of all patients, making it a very important treatment modality.

The most common type of RT is external beam radiotherapy, which consists in delivering the radiation to the tumor from outside the body [3]. Irradiated cells will eventually die from genetic damage unless it can be repaired. Although healthy cells, which are less sensitive to radiation, are more likely to recover, some of them will inevitably die leading to other possible health issues.

RT experienced massive progress over the years from both technological and biological aspects and as the radiation delivery becomes more and more precise, it is extremely important to have a modality of accurate localization of the tumor mass and organs-at-risk in order to optimize the radiation delivery to the tumor and avoid the surrounding tissue (healthy cells). In the treatment planning stage, the challenge of accurate tumor localization was overcome by the integration of 3D imaging such as CT, MRI and PET scans, which are able to provide an accurate visualization of the tumor volume [4]. Unfortunately, these methods are too time consuming and therefore unsuitable for the treatment delivery phase. For this reason, clinicians commonly use kilovoltage X-ray imaging (kV X-rays) for real-time image acquisition [5]. This method offers high resolution images but it may suffer from insufficient soft-tissue contrast, making accurate tumor localization difficult, especially due to bone occlusion (e.g. ribs).

One novel treatment delivery method is volumetric modulated arc therapy (VMAT), where the gantry continuously rotates around the patient during the radiation delivery. The currently most used method by clinicians for tumor tracking (determining the position in 3D) consists in template matching of individual kV images with templates generated from previously acquired 3D data (CT/MRI/PET) by means of normalised cross correlation, which gives a 2D position, followed by triangulation with previous obtained 2D positions in order to obtain a 3D location. For some patients this works well, but if the tumor is small and the absorption of all other tissue in the beamline is high, there is hardly any contrast of the tumor in the kV image and template matching may result in an incorrect match [6]. A possible solution involves implanting metallic

(6)

fiducial markers in, or near, the tumor mass, which will appear in the kV image as distinctive white markers. However, this method comes with several downsides: the implantation can be harmful for the patient [5], fiducials can generate metal artifacts which may cause target occlusion and also treatment-planning errors [7]. For these reasons, we need a better method to detect the tumor in kV images and allow tracking during gantry rotation. We insist on the idea that if there is little to no contrast in an individual image, the algorithm should be able to use the information from previous matches in order to make a prediction.

Deep learning is a promising solution for object detection and tracking in natural images and its application in medical domain could therefore be an interesting alternative to template matching. Due to the existence of a large variety of tumor sizes, locations and shapes amongst patients, training a model which generalizes over multiple subjects is a difficult task, and instead, we adopt a patient-specific training strategy. Another challenge of training a deep model for tracking on X-ray images is the lack of annotated datasets which makes supervised learning not possible. Most works involving deep learning [8, 9, 10, 11, 12], have used the patient’s treatment planning 4DCT scans in order to generate Digitally Reconstructed Radiographs (DRRs) for training and testing. The 4DCT scans had also undergone prior augmentation steps (translations, rotations, deformations) for simulating different inter- and intrafractional tumor variations. While these methods generate enough data for a patient-specific training strategy, there still remains a gap between DRRs and X-Rays in terms of image resolution, contrast and noise, that we try to overcome using the method described in Section 3.3.

1.1 Contributions

In this work we analyse a new approach for markerless tumor tracking based on deep siamese neural networks [13, 14]. Unlike most existing works which use a tracking by detection approach, our method employs a tracking by similarity score strategy and a more localized search space. We also experiment with different network versions and investigate the challenges that our proposed siamese model faces when applied to the tumor tracking task. Finally, we also propose a new method for reducing the gap between DRRs and X-Rays. Therefore, we would like to present our main contributions as follows:

• A different approach for real-time markerless tumor tracking that follows the strategy of novel tracking methods based on deep siamese networks.

• The integration of an anchor-free technique for bounding box regression into our proposed siamese model.

• An analysis of the negative impact padding has on deep siamese networks for the tumor tracking task.

• A novel method based on generative neural networks for narrowing the gap between training (DRRs) and inference (X-Rays) datasets.

Our results suggest the proposed siamese method is suitable for the tumor tracking task, obtaining overall better results than our chosen baselines [9, 11]. We have also seen an improvement in the anchor-free approach over the standard anchor-based strategy. Based on the results obtained by our ablation experiment described in Sec. 4.4, we raise a concern about the use of padding in

(7)

deep siamese networks for tasks when the contrast between the object being tracked (foreground) and surrounding area (background) is very low. At last, our proposed GAN-based data translation method indicates an improvement for tracking on kV images, however, its analysis is quite limited and further research is needed.

1.2 Thesis Structure

Our research paper is structured as follows: in Section 2 we present an overview of the most common methods for object tracking in computer vision, where we mostly focus on describing the category of siamese trackers, as well as some of the most relevant deep learning tumor tracking methods to our work; in Section 3 we present our dataset generation pipeline which includes the proposed DRRs to X-Ray translation method, as well as an in-depth description of the siamese model. We formulate our main research questions in Section 4 and we also outline our experiments design choices meant to answer them. Section 5 presents the results obtained from conducting the experiments described in Sec. 4, along with our interpretation of the obtained outcomes. In Section 6 we present an overview of results interpretation, outline the limitations of our work and suggest future directions of research. Lastly, Section 7 gives further details about the implemented network architectures and some visual representations of the model’s performance.

(8)

Chapter 2

Related Work

At a very intuitive level, the challenge we are facing can be viewed as trying to find the center coordinates of a single tumor in each frame of a medical video, without having access to any similar dataset with ground-truth labels, as previously explained in Section 1. Therefore, we place our research in the field of single object tracking and in this section we will briefly describe some of the most relevant work conducted in this domain.

In Sec. 2.1 we will firstly talk about the general approach for object tracking in computer vision (in 2.1.1), then we continue the topic by changing the direction towards the modern deep learning based algorithms (in 2.1.2) and finally, we describe the class of deep siamese object trackers (in 2.1.3) which is the core of our research.

In Sec. 2.2 we present some of the existing work conducted for markerless tumor tracking (mainly using deep learning), focusing mostly on the methods that are similar to ours w.r.t. data generation pipeline and the type of networks used.

2.1 Background

2.1.1 Object Tracking in Computer Vision

Object detection and tracking is an intensively researched area in computer vision, with many practical applications such as traffic detection, self-driving cars, face detection, video surveillance, etc. The tasks that object tracking methods are trying to solve can be categorized in agreement with different factors: according to the number of objects, the field is divided into single and multiple targets tasks; according to the amount of accessible information, the methods can be devided into (semi-)supervised and unsupervised learning solutions; based on the background, tasks can be divided into static and dynamic backgrounds [15]. Object tracking (OT) algorithms can also be separated into batch and online methods [16]. While the batch algorithms have access to information from future frames, online methods use only present and past information. The tumor tracking challenge we are trying to solve falls under the category of a real-time online single object with dynamic background tracking method for which we adopt a (semi-)supervised learning solution.

Typically, there are 3 steps for tracking an object: object detection, object classification and object tracking, all of which can be addressed using several strategies. Object detection refers to identifying the ROIs that contain moving objects (foreground). Depending on the task, this

(9)

information might be known a priori, such as in the case of single object tracking, or it can be obtained by applying object detection methods. The extracted regions of interest may contain objects from multiple classes such as dogs, cats, cars, birds, humans etc. In most cases, it is desirable to know the class of each tracked object, which is achieved by classification methods, but in our challenge we have only one class (tumor) so this step is skipped. The final tracking step consists in approximating the path of a target by finding its position in every frame of the video.

2.1.2 Deep Learning for Object Tracking

Last years have seen a great interest for the integration of neural networks in object detection and tracking due to their ability to extract complex features, making deep learning models the current state of the art in this area of research. In particular, convolutional neural networks (CNNs) are extensively used for learning meaningful spatial representations and real-time performance.

In most cases, the tracking task of a single object starts with knowing the target’s ground-truth coordinates in the first frame of a video, usually by manually drawing a bounding box around the object to be tracked. This information might be inaccurate (due to human errors) or even not available, like in our situation where the precise tumor location is unknown in all frames (including the first one). For these reasons, some tracking algorithms integrate an object detection step to improve the quality of the input detections or when this information is not available. Most of these methods employ a bounding box extraction strategy based on Faster R-CNN [17] or similar variants, such as Mask R-CNN [18], which extends Faster R-CNN with an extra branch for predicting the target’s segmentation mask.

Other architectures worth mentioning are YOLO [19], for its unified approach and real-time performance, and SSD [20], a single-shot object detector, which outperforms Faster R-CNN in terms of accuracy, and YOLO in terms of speed. SSD’s main contribution consists in the generation of multi-scale feature maps, allowing for predictions at different scales.

After the object detection and feature extraction steps, most deep tracking algorithms will compute a similarity score between the previously tracked object and its potential current location, either as a distance measure over extracted features, or directly [16].

Siamese networks represent a particularly interesting category in which the loss is computed using the information from different images in order to learn a similarity measure between objects. Our network is greatly inspired by SiamMask [13] and SiamRPN++ [14] architectures.

2.1.3 Deep Siamese trackers

Most previous works intended to handle the challenge of arbitrary object tracking followed the strategy of training an online discriminative classifier only on the dataset obtained from the video on which the tracking algorithm is then applied. This approach works well only for simple models that are not able to learn a rich representation of the target, while more complex models (such as deep convolutional networks) lose the real-time property when updated online [21].

SiamFC [21] proposes a different approach, where a fully-convolutional neural network is trained completely offline and then simply evaluated at test time with no further updates. The network receives 2 input images, a smaller region containing the target to be tracked (z) and a larger search region cropped around the last predicted location of the target (x), and compares z with x at different locations on x using the cross-correlation operation. The model outputs a similarity score map where higher values should correspond to the target’s location on x.

(10)

SiamRPN [22] extends SiamFC with a region proposal network (RPN) [17] for predicting both classification scores and bounding box coordinates, thus increasing the performance of SiamFC.

SiamRPN++ [14] extends SiamRPN by using a deeper convolutional neural network as feature extractor and also allowing the model to learn richer representations with the introduction of depth-wise cross-correlation operation.

SiamMask [13] is a novel deep architecture for semi-supervised real-time object tracking and segmentation, which extends SiamRPN [22] by adding an extra branch for learning the segmentation mask and a mask loss function. Like SiamRPN++, it uses a deeper feature extractor and the richer depth-wise cross-correlation layer. The resulting cross-correlated feature maps returned by the backbone network are then extended by adding a classification (sϕ), bounding box (bσ) and

segmentation (hφ) mask heads, which are trained using Lclass, Lboxand Lmaskloss functions defined

as cross-entropy, smooth L1 and binary logistic regression loss, respectively. SiamMask-E [23] is the current state-of-the-art tracking method on VOT2017/18 benchmark.

Due to its state-of-the-art performance and the integration of the segmentation mask branch, we are designing our proposed siamese model based on the SiamMask architecture.

2.2 Markerless Tumor Tracking Methods

In the medical domain, specifically for real-time markerless tumor tracking in fluoroscopic images, many studies have been reported, but often they are not used in clinical practice. Xu et al. [24] presented an optical flow-based method, using a two-dimensional velocity vector field to quantify the target motion in sequential frames. Their method performed well in small tumor displacements, but failed in the case of larger ones and it was not robust to elastic tumor deformations. We also consider that an optical flow approach might not be best-suited for the task, due to the occlusion challenge combined with the gantry arc rotation which might cause tracking drift issues.

Shieh et al. [25] proposed a Bayesian approach for 3D tumor tracking, which compared to other methods, is able to deliver the 3D tumor position as well as the uncertainty of the model, which is of great importance in medical applications, but not real-time performance. In order to compute the model uncertainty, we suggest a Bayesian model adaptation as a future work direction in Sec. 6.2.

Encoder-decoder type architectures are used in the works of [8] and [10], which are trained and validated on synthetically generated images from the planning 4D-CT scans. The former proposes a novel data augmentation method for soft-tissue to bone decorrelation, while the latter trains an autoencoder on specific image patches (ROIs) in order to output a target probability map (TPM) instead of a segmentation mask. We integrate the soft-tissue/bone decorrelation idea proposed by Terunuma et al. [8] in our dataset generation pipeline as it has been proven to be successful and it helps generating more training data. We have noticed that our implementation of the proposed siamese model indirectly computes a similar form of the target probability map (TPM) proposed by Hirai et al. [10], which we use to refine the model’s predictions during inference time.

Other methods, like Deep Bi-LSTM [26], aim to predict the patient respiration motion using the respiratory motion signal as temporal information. They show good results for respiration trajectory prediction but unfortunately, in most cases there is a weak correlation between the tumor motion and the respiration motion signal. For this reason, we choose not to follow a respiratory motion prediction approach, but a tumor motion one.

Zhao et al. [9] used Faster R-CNN for accurate markerless pancreatic tumor detection. They also introduce an efficient data augmentation pipeline for simulating inter- and intrafractional tumor

(11)

variations from which we draw inspiration in our work. Specifically, they used the patient planning 3D-CT scans to generate Digitally Reconstructed Radiographs (DRRs) for the anterior-posterior (AP), left-right (LR) and oblique (135◦) direction, after which the CT scans were augmented with tumor deformations, translation and rotations. The VGG-based Faster R-CNN model was trained and validated on the DRRs separately for the AP, LR and oblique directions only when the patient was in the exhale respiratory phase. Unlike [9], we train and evaluate our model for all angle perspectives and for all respiratory phases. We also visually inspect the results on kV images.

Takahashi et al. [11] made use of an encoder-decoder network for predicting the tumor mask segmentation, where the decoder is replaced by a pixel shuffle layer [27] in order to achieve real-time performance. The tumor’s coordinates are then computed as the density center weighted by pixel-wise mask score. This approach should achieve better results than object detection methods, such as Faster R-CNN, because a few misclassified pixels will have little impact on final detection. They also propose a simple solution for reducing the gap between DRRs and X-Rays, by training on DRRs with random contrast (gamma) transformation and noise addition. The model was trained and evaluated on phantom data with spherical tumors of sizes of 1, 2 and 3 cm, and they used two-view X-Ray fluoroscopy images which mitigates the tumor tracking challenge due to easier scenario (phantom data) and higher quality images. Due to its simplicity and data augmentation properties, we use the random contrast (gamma) transformation method proposed by [11] to reduce the gap between DRRs and X-Rays, beside our proposed data translation strategy aiming to solve the same issue.

Most proposed deep learning methods show promising results in terms of accuracy and speed but not all of them ([8, 9]) presented the model’s performance on kV images (due to lack of ground-truth labels). Also, only some methods ([10]) were evaluated on a full gantry arc rotation (180◦ or 360◦). Finally, the only methods that used information from previous frame(s) to locate the tumor in the current frame are the optical flow approach of [24] and the Bayesian approach of [25], which are not deep learning-based frameworks.

Unlike existing deep learning methods, we propose a different approach for markerless tumor localization based on tracking by similarity score which uses both spatial information, from the current frame, and a form of weak temporal information from previous frame(s). We also propose a novel strategy for DRR to X-Ray data translation, based on generative neural networks.

(12)

Chapter 3

Methodology and Datasets

Before diving into the implementation details for the siamese model, we would like to give a comprehensive description of the dataset generation pipeline. In Sec. 3.1 we firstly present the raw data used in our experiments, followed by the data augmentation strategies and the generation of Digitally Reconstructed Radiographs (DRRs) in Sec. 3.2. Finally, we describe the proposed method for reducing the gap between DRRs and X-Rays in Sec. 3.3.

In the second part of this chapter, specifically in Sec. 3.4, we present our proposed siamese model, focusing mostly on the implementation of the anchor-free approach and the key differences to SiamMask [13].

3.1 Data acquisition

As previously mentioned in Section 1, we train a patient-specific deep learning model for tumor localization in lung cancer patients, similar to the approach of Zhao et al. [9]. This strategy reduces the need of a large number of patients for training the model because, for each new patient, a different network learns from the subject’s training dataset and the performance is evaluated on the same subject, hence no need for generalizability. In order to provide enough samples to train a single model (for a single patient), we employ different data augmentation methods described in Sections 3.2.1 and 3.2.2, consisting of diffeomorphic registrations, translations and various angle perspectives. The size of the training and testing datasets is 28960 and 7240, respectively, which is enough to train and validate the performance of a deep learning model.

The datasets involved in our research come from 2 patients with small lung tumors (<2cm), used for training and validation of the proposed tracking model, and a phantom model with 3 lung tumors, used for training and validation of the proposed data augmentation step intended to reduce the gap between digitally reconstructed radiographs (DRRs) and X-Rays. Even if the number of patients is small, we have chosen particularly difficult tracking scenarios because of the small size of the tumors and their position, which is predisposed to occlusion (with ribs). Therefore, we believe that a model performing well on these cases should not have any problems in other situations (for other patients).

For the patient data, 4DCT scans with 2.5 mm slice thickness are acquired using a Siemens SOMATOM go.Open Pro scanner, during the treatment planning phase, under free-breathing con-ditions. The generated 4DCT scans are divided into 10 3DCT scans corresponding to 10 respiratory

(13)

phases: T00, T10, ..., T90 (T00: peak exhalation, T50: peak inhalation). In Fig. 3.1 we show an overview of the pCT generation process. Each 3DCT scan is a 512 × 512 × 144 array with values ranging between -1000 (air) and 30000+ (bone). The gross tumor volume (GTV) was manually contoured by a certified radiologist on the T00 and T50 CT scans and we manually translated it to the other phases in Slicer3D, by shifting the T00’s GTV to match the tumor’s appearance in each individual phase.

The patient’s kV images (X-Rays) are acquired by an On-Board Imager (OBI) attached orthogonal to the gantry arm (like in Fig. 3.2). The OBI settings are 120 kV, 1500 mm X-Ray source to detector distance, 1000 mm source to body isocenter distance, 0.388 mm pixel size and 768 × 1024 array size. The X-Ray array values range between 0 and 15000+. For patient 1 we have an 180◦ _{arc rotation, consisting of 222 kV images, while for patient 2 we have an 135}◦_{arc rotation,}

consisting of 276 kV images.

A 512 × 512 × 346 3DCT scan with 0.6 mm slice thickness was generated together with the GTV of 3 tumors (in the middle and cranial part of the lung) for the phantom data, with the phantom in a static position. Since respiratory motion cannot be simulated on the phantom, 4DCT generation is not feasible. For each of the 3 tumors, we have a full 360◦ gantry arc rotation consisting of 422 kV images, taken under the same OBI settings used for the patient data. The phantom X-Ray pixel array values range between 0 and 90000+. In order to simulate tumor motion, the couch (on which the phantom is situated) follows a regular (sinusoidal) motion.

Figure 3.1: 4DCT generation taken from [28].

3.2 Dataset Generation

As mentioned in Section 1, in order to train a deep learning model for target localization and tracking in a supervised way, a considerable amount of annotated data instances is needed. Unfortunately, there are no available datasets with annotated lung tumor kV images suitable for training such a

(14)

Figure 3.2: On-Board Imager (OBI), producing thegreenbeam, mounted on the rotating gantry arm, producing theyellowbeam.

model; therefore, we will generate DRRs from the planning CT (pCT) data, which will be used for training and validation.

X-Ray images contain a large variety of tumor translations, rotations and deformations in-between adjacent frames, due to the patient’s respiration motion and gantry rotation, while DRRs are quite limited from this point of view. To mitigate this problem, we incorporate a pCT registration step in our dataset generation pipeline, which precedes DRR acquisition. Finally, we aim to reduce the gap between DRRs and X-Rays in terms of resolution, contrast and noise, by proposing a novel method based on generative adversarial networks (GANs) [29].

Our proposed dataset generation pipeline, summarized in Fig. 3.3, is composed of the 3 main steps mentioned earlier: pCT registration and deformation, DRR generation and DRR to X-ray transformation. We will describe each individual step in the following subsections.

Figure 3.3: The proposed dataset generation pipeline consisting of probabilistic diffeomorphic registration (red), DRR acquisition (green) and DRR to X-Ray translation (blue).

(15)

3.2.1 Planning CT registration and deformation

Zhao et al. [9] took on the challenge of detecting tumors in pancreatic cancer patients during the exhale respiration phase. In order to simulate inter- and intrafractional variations, they register the exhale pCT scan (T00) with the other phases, obtaining 9 motion vector fields (MVFs), and generate 80 new MVFs by the following formula:

MVF0 = r ∗ MVFi+ (1 − r) ∗ MVFi+1, i = 1, 9 (3.1)

where MVFi is one of the 9 MVFs, r is a random number from the open interval (0,1) and i is

randomly selected every time. Finally, they apply the generated MVFs to deform the exhale pCT and obtain 80 3DCT scans, which are used for training and validation.

We adapt this approach to fit our problem where we have to detect the tumor in all respiration phases and instead of registering one phase with the other 9s, we register each phase with the most dissimilar one (e.g. the peak exhale to the peak inhale) and generate new MVFs as follows:

MVFi= Reg pCTi, pCT(i+4)%10 , i = 0, 9

MVF0= r ∗ MVFi, i = 0, 9

(3.2)

where pCTi is one of the 10 planning 3DCT scans, Reg is the registration method used and r is

a random uniformly distributed number from the open interval (0,1). In this way, the tumor can end up in any altered state between the peak inhalation and peak exhalation while following the patient’s respiration cycle. For each pCT we randomly generate 3 MVFs, leading to a total of 30 new CT scans plus the 10 original ones.

There are different registration techniques (Reg in 3.2) used to generate motion vector fields, such as rigid, affine, bspline. In our work, we use the probabilistic diffeomorphic registration method proposed by Dalca et al. [30], case in which the MVFs will be represented by diffeomorphic deformation fields (DDFs). This registration method (Fig. 3.4) is represented by a convolutional neural network (CNN) where a 3D UNet-type architecture [31] is used as feature extractor. The network receives the fixed and moving images as input and outputs an approximate posterior probability of the velocity field in the form of the mean µz|x,y and variance Σz|x,y of a multivariate

normal distribution. A velocity field z is then sampled from the learnt µ and Σ and transformed into a DDF, which is applied to the moving image to obtain the fixed one. The network is trained end-to-end in an unsupervised fashion using a loss function with 2 components:

L(ψ; x, y) = 1 2σ2_K X k ||x − y ◦ φzk|| 2_{+ KL[q} ψ(z|x; y)||p(z)] (3.3)

where x and y are the input fixed and moving images, ψ represents the network’s learnable parameters, z is the sampled velocity field and p and qψ are the prior and posterior distributions of

z. K represents the number of samples used (1 in this case) and σ2_{is a fixed hyper-parameter. The}

first term in 3.3 brings the deformed moving image close to the fixed one, while the second term ensures closeness between the posterior and prior p(z) = N (z; 0, Σz), which constraints

topology-preserving (diffeomorphic) registration.

We choose to implement this type of image registration for 2 reasons:

(16)

2. due to the probabilistic nature of the model, each sampled DDF will be different, thus enhancing the variation in the new generated CT scans.

Deform

Figure 3.4: Diffeomorphic registration method proposed by Dalca et al. [30]. LM SE and LKL

denote the first and second term of equation 3.3, respectively.

For the 3D U-Net model, we use the pretrained weights published by [30] and we further train it for each of our patients for 10 epochs with a batch size of 1 (due to memory limitations), using stochastic gradient descent (SGD) with 0.8 momentum and 10−4 learning rate which decreases linearly to 10−5. No batch normalization layers are used in the 3D U-Net network because the batch size is very small (= 1) and that might cause problems in learning the global mean and variance of the training dataset.

3.2.2 DRR generation

DRR generation consists of 4 major steps: material decomposition, analytic primary computation, scatter estimation and noise injection. The 3D-CT scan is segmented into air, soft tissue and bone regions, each of which will have a contribution to the projected DRR. Traditionally, this method is done via Hounsfield Unit (HU) thresholding, which works well for large HU differences, e.g. air ([-1000] HU) and bone ([200, 3000] HU), but may fail for overlapping intervals like soft-tissue ([-150, 300] HU) and bone.

Unberath et al. [32] propose a solution to this problem by automatically segmenting the CT volume using an encoder-decoder CNN, called DeepDRR. Their results show that deep models trained on DRRs generated by DeepDRR, generalized better on real data (X-Rays) than the ones trained on

conventional DRRs. They also use a CNN-based scatter estimation which outperforms tradi-tional kernel based methods [33], but in most clinical products scatter reduction is applied as pre-processing, so this step is skipped in our implementation. With no further training, we directly use the

pretrained DeepDRR to generate our training and validation datasets. We show in Fig. 3.5 an overview of the analytic primary computation used for generating the DRRs.

(17)

Figure 3.5: DRR generation process taken from [34].

During the respiration, the patient’s bone structure does not change while the soft tissue (tumor) will deform and move, showing little to no correlation between the tumor location and surrounding bones. In most cases, the tumor will be in close proximity to bone tissue or even partially or completely occluded by it. Since the bone has a higher density than soft tissue, it will appear as a high-contrasted feature, which the model might learn to correlate with the tumor location. Terunuma et al. [8] address this issue by separately generating DRRs for soft-tissue and bone, respectively, and randomly overlapping them as in Fig 3.6. Based of their results, they prove that such strategy prevents mistracking caused by bone tissue and therefore, we are integrating it in our DRR generation step with a small modification in the overlay process defined as follows:

I(x, y) = Isoft(x + δx, y + δy) + Ibone(x, y) (3.4)

where Isoft and Iboneare the soft-tissue and bone DRRs, δxand δy are random integers in (-32, 32)

interval. The change consists in shifting the soft-tissue DRR, instead of the bone one as in [8], in order to introduce variation in tumor locations.

Figure 3.6: Overall workflow of the proposed DRR generation taken from [8].

Using the DeepDRR’s projection algorithm, DRRs are produced along 181 angle directions, from 0◦ (left-right) to 181◦ (right-left), corresponding to half of a complete gantry arc rotation. The X-ray source to detector and isocenter distances are set to 1500 mm and 1000 mm, respectively, and the detector shape and pixel size are set to (768 x 1024) pixels and 0.388 mm, respectively. This setting is selected in order to match the one of the kV image generation.

(18)

Our final dataset is composed of 10 original + 30 deformed 3DCT scans × 5 random soft-tissue/bone overlaps × 181 angle perspectives = 36200 DRRs, divided into 28960 (80%) for training and 7240 (20%) for testing.

3.3 DRR to X-ray transformation

Due to the nature of kV and DRR images, the probability distributions of these 2 data types are quite different; therefore, training a model on one dataset does not guarantee a good performance on the other. We hypothesize that if we could define a function G : R −→ X, which maps the elements of R to the space of X, such that the distribution of G(R) is indistinguishable from the distribution of X, then learning a model on G(R) will have the same effect as learning directly on X.

We propose a strategy for learning such function G, using the idea of adversarial training introduced by Goodfellow et al. [29]. This idea involves simultaneously training a generator G, whose task is to learn an approximate distribution of data x, and a discriminator D, whose task is to learn to distinguish generated data (sampled from the learnt distribution) from the real data. The learning process is equivalent to a minmax game between G and D with the following value function:

min

G maxD L(G, D) = Ex∼pdata(x)[logD(x)] + Ez∼pz(z)[1 − logD(G(z))] (3.5)

where pz(z) is the prior input noise to the generator.

In our approach we train 2 generators, one for increasing the resolution of DRRs [3.3.1] and the other for contrast transformation and noise generation [3.3.2]. As specified in 3.1, we use the phantom data for training and validation of our proposed generative strategy. The reasons behind this decision are the following:

1. in order to train a network which increases the DRR’s resolution, we need access to high resolution DRRs; patient’s pCT scans have 2.5 mm slice thickness, compared to 0.6 mm of the phantom ones, hence phantom DRRs have higher quality

2. in patient data there is no one-to-one mapping between DRRs and X-Rays due to the respiration motion; such mapping helps training our 2nd _{generative model, so we use the}

phantom data (which is static)

3. to validate our approach, we need the ground-truth labels (tumor location) on kV images, since the scope of this method is to increase the tracking accuracy on X-Rays; we have this information available only for the phantom data in the form of tumor trajectories during arc rotation.

From the phantom’s 3DCT scan, 360 DRRs were generated corresponding to a full gantry arc (360◦) using the settings mentioned in 3.2.2. A log transformation was applied to linearize the pixel values and then, each DRR was normalized between [-1,1] (3.6), in order to generate the dataset used to train both generative models.

c = 1 log(1 + max(x)) x0 = c ∗ log(1 + x) x00= x 0_{− min(x}0₎ max(x0_{) − min(x}0₎∗ 2 − 1 (3.6)

(19)

3.3.1 Super Resolution GAN

The first generative model aims to increase the image quality of DRRs that usually have a low resolution due to pCT’s high slice thickness (∼ 2.5 mm). We follow the work of Ledig et al. [35] and train a super resolution generative adversarial network (SRGAN) for reducing the resolution gap between DRRs and X-rays (as in Fig. 3.7).

In their approach, a generator G receives a low resolution (LR) image and outputs the same image 4x upscaled, and a discriminator D learns to distinguish between the original high resolution (HR) image and the upscaled one. The LR images are obtained by downsampling HR images using a bicubic kernel with a downsampling factor of 4. Because our DRRs and X-rays have the same spatial dimensions (768 x 1024), the LR and HR images should have the same size. Thus, in order to generate LR images we first downsample the HR images by a factor of 4, using bilinear interpolation, and then we upsample them by a factor of 4 using the same method.

The generator network G used in our implementation has the same structure as the one from [35], with the exception that we remove the last 2 sub-pixel convolution layers used for upsampling. For the discriminator D we used a PatchGAN model introduced by Isola et al. [36], which is composed of four 4x4 convolutional layers with stride of 2 and 64, 128, 256 and 512 channels, respectively, and a last 1x1 convolutional layer with stride of 1 and 1 channel output. Each layer, except the last one, is followed by an instance normalization layer [37] and a LeakyReLU (α = 0.2) activation. Such discriminator has fewer parameters than a full-image one and aims to classify multiple patches extracted from the original image.

As in [35], we use an objective loss functions with 3 components, a pixel-wise loss (3.7), a VGG loss (3.8) and an adversarial loss (3.9) defined as:

Lpixel(G, x, y) = Ex,y[||y − G(x)||1] (3.7)

LV GGi(G, x, y) = Ex,y[||V GGi(y) − V GGi(G(x))||2] (3.8) LGadv(G, D, x) = Ex[(1 − D(G(x))) 2 ] (3.9) LDadv(G, D, x, y) = Ey[(1 − D(y)) 2_{] + E} x[D(G(x))2]

where V GGi represents the feature maps of the last convolution before the ithmax pooling layer

of a pretrained VGG [38] network. Unlike [35], we employ L1 pixel-wise loss function instead of L2 because it encourages less blurring [36] and for the adversarial loss we use the least-squares loss instead of the negative log likelihood (3.5) as it is more stable during training and generates higher quality images according to Mao et al. [39]. Finally, our generator and discriminator objective functions are defined by the equations (3.10) and (3.11).

LG = λ1∗ LGadv(G, D, x) + λ2∗ Lpixel(G, x, y) + λ3∗ X i LV GGi(G, x, y), i = 1, 4 (3.10) LD= LDadv(G, D, x, y) (3.11)

where λ1, λ2 and λ3are hyperparameters set to 0.01, 1 and 2 * 10−6, respectively.

The generator is initially trained alone to optimize only Lpixelloss for 50 epochs, with a batch size

of 2, using Adam optimizer (beta1=0.5) with an initial learning rate of 10−4. After that, both G and D networks are trained alternatively for another 200 epochs with the same settings.

(20)

4:1 1:4

Figure 3.7: Learning process of a super resolution GAN [35] for increasing DRR’s resolution.

3.3.2 Cycle GAN

Even on phantom data, despite its static nature, the one-to-one matching between DRRs and X-Rays is not perfect. Therefore, learning an image-to-image transformation for reducing the contrast and noise gap becomes learning to map one domain (DRRs) to the other (X-Rays), as specified in the beginning of this section by the function G : R −→ X.

Zhu et al. [40] present a method to relate 2 data domains X and Y with the help of generative neural networks. Their goal is to learn 2 mapping functions G : X −→ Y and F : Y −→ X, using 2 generative and 2 discriminative networks. Because there are several possible mappings between X and Y, they argue the necessity to reduce the searching space by constraining the functions G and F to be cycle-consistent: F (G(x)) ∼ x and G(F (y)) ∼ y. This constraint is put in practice by defining a new loss function used for training the generators as follows:

Lcycle(F, G, x, y) = Ex[||F (G(x)) − x||1] + Ey[||G(F (y)) − y||1] (3.12)

Our solution implements the cycle GAN [40] idea where the 2 data domains are represented by DRRs and X-Rays. Unlike the original paper, where samples from X and Y are not paired, there is a corresponding X-Ray for each DRR, thus we restrict the search space even more by employing an extra pixel-wise (3.7) and VGG (3.8) loss functions.

Let G1 and G2 be our 2 generator networks, where G1 maps an input DRR to the X-Ray data space and G2 does the reverse operation by translating an X-Ray to the DRR space. Similarly, we define D1 and D2 discriminator networks, where D1 learns to classify real X-Rays from the fake ones generated by G1 and D2 learns to distinguish real DRRs from the ones generated by G2. Fig. 3.8 summarizes the learning process of the proposed method, where the 4 objective functions used to train G1, G2, D1 and D2 are defined as follows:

(21)

LG1 = λ1∗ LGadv(G1, D1, x) + λ2∗ Lcycle(G1, G2, x, y) + λ3∗ Lpixel(G1, x, y) + λ4∗ X i LV GGi(G1, x, y), i = 1, 4 (3.13) LG2 = λ1∗ LGadv(G2, D2, y) + λ2∗ Lcycle(G1, G2, x, y) + λ3∗ Lpixel(G2, y, x) + λ4∗ X i LV GGi(G2, y, x), i = 1, 4 (3.14) LD1= LDadv(G1, D1, x, y) (3.15) LD2= LDadv(G2, D2, x, y) (3.16)

where λ1, λ2, λ3and λ4 are hyperparameters set to 1, 10, 1 and 10−6, respectively.

Figure 3.8: Overview of the proposed CycleGAN [40] method for DRR to X-Ray translation. The colorsbrownand bluecorrespond to DRR and X-Ray data domains, respectively.

For the generators we use the same encoder-decoder architecture with skip connections as specified in the work of Armanious et al. [41], and we replace the batch normalization layers with instance normalization. For the discriminators we use the same P atchGAN [36] architecture as previously described in 3.3.1.

All 4 networks are trained for 250 epochs with a batch size of 4, using Adam optimizer (beta1=0.5) with an initial learning rate of 10−4. Because the job of discriminators is easier than the generators one, we update D1 and D2 every 2 steps and we also use the first 50 epochs for the initialization of G1 and G2, after which we randomly reinitialize the weights of D1 and D2 and train all networks for the remaining 200 epochs. After epoch 150 we decrease the learning rate by a factor of 2 every 10 epochs.

(22)

3.4 Proposed Tracking Method

We design our proposed tracking algorithm starting from the hypothesis that a siamese-based tracking by similarity score is better suited for the markerless tumor tracking task than existing tracking by detection methods, due to its more localized search space and the use of previous frame(s) (temporal information). Due to their popularity and performance, we decided to implement a deep siamese network for our proposed tracking method and its structure is greatly inspired from Wang et al. [13].

We propose a modified version of SiamMask as our main tracking algorithm, which is specifically designed for accurate markerless tumor tracking. All additional changes employed in our

implementation have the scope to improve the quality of the results and reduce the network’s complexity. An overview of the proposed method is illustrated in Fig. 3.9.

Backbone class_head bbox_head mask_head class bbox centerness mask Target image(z) Search image(x)

Figure 3.9: Schematic illustration of our siamese tracking model. The backbone is represented by a ResNet50 [42] but it can be any other suitable CNN. d?represents the depth-wise cross correlation

operation applied to the extracted feature maps.

3.4.1 Input image shapes

To simplify the implementation of our following design choices, we modify the input target and search image shapes from 127 × 127/255 × 255 to 128 × 128/256 × 256 due to the convenience of working with powers of 2. The resulting spatial size of the backbone’s feature maps will therefore be 16 × 16, 32 × 32 for target and search regions, respectively. Following [14], we crop the 8 × 8 center region from the target’s feature maps to reduce the computation time of the cross correlation operation.

3.4.2 Separated class/bbox/mask depth-wise cross correlation

In order to separate the classification, bounding box and mask domains, SiamMask employs 3 unshared 1 x 1 conv2d-bn-ReLU blocks on top of the cross correlated feature maps. We choose to

(23)

increase this separation by implementing 3 distinct depth-wise cross correlations, for classification, regression and segmentation, respectively. The feature maps obtained from the backbone network are adjusted with 3 unshared 1 x 1 conv2d-bn blocks for each of the 3 network heads, followed by the corresponding depth-wise cross correlations.

3.4.3 Anchor-free siamese tracker

Most recent fully convolutional siamese trackers [22, 14, 13] are implementing a region proposal network (RPN) [17] in order to improve their performance. The RPN runs in a sliding window fashion at each location on the cross correlated feature maps and predicts a foreground/background classification score and a bounding box refinement in the form of (tx, ty, tw, th). This strategy

requires a set of predefined anchors (bounding boxes) with different scales (32, 64, 128) and aspect ratios (0.5, 1, 2), which are centered at the locations on the search image that correspond to the ones in the feature maps like in Fig. 3.10.

The final bounding box predictions are computed by applying the refinement module (bounding box branch) to the corresponding anchors that achieved high classification scores as follows:

x = tx∗ wa+ xa

y = ty∗ ha+ ya

w = exp(tw) ∗ wa (3.17)

h = exp(th) ∗ ha

where x, y, w and h represent the predicted bounding box’s center coordinates, width and height, and xa, ya, wa and ha define the same properties for the corresponding anchor.

The use of anchor boxes in RPN-based methods increases the model’s complexity by adding a set of hyperparameters that require fine-tuning for optimal performance. In the case of SiamMask, we need to decide what anchor scales and aspect ratios to use, as well as which intersection over union (IoU) with the ground truth box to set as thresholds for positive (> 0.6) and negative (< 0.3) samples. In order to remove the constraint of having to take such decisions, we design our network using an anchor-free approach.

255*255*3

7*7*256

255*255*3

Figure 3.10: Example of predefined anchors on the input image (left), their corresponding locations on the feature maps (center) and the final bounding box predictions (right).

(24)

Tian et al. [43] reformulate the object detection task as a per-pixel classification problem and propose an anchor-free object detection method that achieves ”state-of-the-art results among one-stage detectors” in 2019. We follow their approach and modify our siamese tracker to run in an anchor-free fashion.

Let F ∈ RH×W ×C _{and B}

gt = (x∗1, y∗1, x∗2, y2∗) be the cross-correlated feature maps and ground

truth bounding box, with (x∗₁, y∗₁) and (x∗₂, y₂∗) top-left and bottom-right box coordinates,

respectively. Each feature vector Fij will be responsible for generating 4 predictions: Pclass(i, j) ∈

R2, Pbox(i, j) ∈ R4, Pcenter(i, j) ∈ R and Pmask(i, j) ∈ RHt/2×Wt/2, with i = 0, H − 1 and

j = 0, W − 1. The corresponding ground truths Gclass(i∗, j∗), Gbox(i∗, j∗), Gcenter(i∗, j∗) and

Gmask(i∗, j∗) are generated based on the ground-truth bounding box (x∗1, y1∗, x∗2, y∗2) and location

(i, j) as follows: (i∗, j∗) = bHt 4 c + i ∗ s, b Wt 4 c + j ∗ s (3.18) l∗_j = j∗− y∗₁ t∗_i = i∗− x∗ 1 r∗_j = y₂∗− j∗ (3.19) b∗_i = x∗₂− i∗ Gclass(i∗, j∗) = ( 1, if x∗₁≤ i∗_{≤ x}∗ 2 and y1∗≤ j∗≤ y∗2 0, otherwise (3.20) Gbox(i∗, j∗) = (lj∗, t∗i, r∗j, b∗i) (3.21) Gcenter(i∗, j∗) = s min(l∗ j, r∗j) max(l∗ j, rj∗) × min(t ∗ i, b∗i) max(t∗ i, b∗i) (3.22) Gmask(i∗, j∗) = Cmask i∗, j∗, bHt 2 c, b Wt 2 c (3.23)

where s represents the backbone’s total stride (= 8), Ht and Wt are the target’s height (= 128)

and width (= 128), and Cmask(x, y, H, W ) is a function applied to the ground truth segmentation

mask which crops a H × W region centered at location (x, y). In the next paragraph we explain the logic behind the above equations and we visually show an overview of the anchor-free learning strategy in Fig. 3.11.

Intuitively, a location (i, j) on the cross-correlated feature maps (F ) can be mapped back to the input search image at location (i∗, j∗) using eq. 3.18, which is near the center of its receptive field. If (i∗, j∗) falls inside Bgt(3.20), then we consider it being part of the foreground (tumor) and

Fij should generate a positive classification score (∼ 1), otherwise it should predict a negative one

(∼ 0).

For each positive location (i∗, j∗), its associated bounding box is defined as the distance to the left (l∗_j), top (t∗_i), right (r_j∗) and bottom (b∗_i) edges of Bgt (computed using eq. 3.19) and instead

of predicting an anchor refinement, like in anchor-based methods, Fij directly generates a box as

(lj, ti, rj, bi), centered at (i∗, j∗).

Tian et al. [43] observed that locations far from the center of detected objects produce lower quality bounding boxes. To solve this issue, they propose adding an extra output head for learning a distance score (Gcenter) between each location (i∗, j∗) and its corresponding object’s center, defined

(25)

as in eq. 3.22. During inference, Pcenterwill be used to downweight the classification score (Pclass)

based on the distance to the target’s center.

Finally, we also want our model to learn a segmentation mask, as we hypothesise that it could improve the tracking accuracy. Therefore, each feature vector Fij will be responsible for generating

a Ht

2 × Wt

2 segmentation (Pmask(i, j)) which corresponds to the ground truth mask (Gmask(i ∗_{, j}∗₎₎

of its receptive field as defined by eq. 3.23. Thus, Gmask(i∗, j∗) is nothing more than a H₂t ×W₂t

cropped region centered at (i∗, j∗) on the input search image segmentation.

CNN Heads BG FG l* t* r* b* 0 0 0 1 1 1 60 35 20 15 10 10 70 40 - - - -0.378 0.189 - -

-Figure 3.11: Overview of the anchor-free learning process with 2 positive (redand yellow) and 3 negative (violet) examples. The final loss L is a combination of Lclass, Lbox, Lcenter and Lmask

losses. The box, center and mask ground truth (G) and prediction (P ) scores are not computed for the negative examples, as they do not contribute to the final loss. In table G, BG/FG are the background/foreground ground-truth classification scores.

3.4.4 Loss functions

In this subsection we explain each of the loss functions employed for training the proposed siamese model. Each loss corresponds to a specific output head and the total loss function is computed as the sum of all losses: L = Lclass+ Lbox+ Lcenter+ Lmask. It is important to mention that Lmask

is used only for the network version that uses the segmentation mask branch.

Classification Loss. Lin et al. [44] investigated the performance of one-stage and two-stage (based on R-CNN [45]) detectors and reached the conclusion that the latter type of models achieve better results, mainly due to the foreground/background class imbalance; as most training candidate locations are easy negatives, their contribution is not useful and can lead to degenerate models.

(26)

They propose a solution to mitigate this problem in the form of a new classification loss function, called focal loss, defined as:

Lf ocal(x, y) =

(

α ∗ (1 − x)γ_{∗ log(x),} _{if y = 1}

α ∗ xγ_{∗ log(1 − x),} _otherwise (3.24)

where y ∈ {±1} is the ground truth class, x ∈ [0, 1] is the returned probability, α is a hyperparameter that balances the contribution of positive/negative samples to the loss and γ ≥ 0 is the focusing parameter that controls how much to downweight easy negatives.

Anchor-based siamese trackers, like SiamMask and SiamRPN, limit the number of positive samples to 16 and a total of 64 training samples, in order to avoid the class imbalance; instead, we follow the anchor-free strategy of [43] and use focal loss for classification with α = 0.5 and γ = 2 as in [44].

Bounding box Loss. Most anchor-based detection/tracking methods use smooth L1-loss (Huber loss) [3.25] for regressing the bounding box coordinates (tx, ty, tw, th). Such loss function

(L1, L2) was shown to be unsuitable for box regression in the form of (l, t, r, b) by Yu et al. [46], because it destroys the assumption that the bounds of an object are highly correlated (by treating them individually). L1smooth(x, y) = (₁ 2(x − y) 2_, _{|x − y| ≤ δ} δ(|x − y| −1₂δ), otherwise. (3.25)

Since the final goal of the predicted bounding box is to maximize the intersection over union (IoU) with the ground-truth box, it makes perfect sense to directly optimize for that, and instead, an IoU loss function [46] is employed for box regression.

When tracking a tumor for accurate radiation delivery, it is of most importance to locate its center coordinates. While maximizing IoU indirectly reduces the distance between predicted and ground-truth box centers, there is no explicit formulation of a loss term that directly optimizes it, and therefore it could be that 2 predicted boxes have the same IoU but different center distances to the ground truth. Zheng et al. [47] updated the original IoU loss with a distance penalty and named it Distance-IOU, which is defined as follows:

LIOU(p, g) = − ln |p ∩ g| |p ∪ g|, as in [46] (3.26) LDIOU(p, g) = LIOU(p, g) + ρ2(p, g) c2 , as in [47] (3.27)

where p and g are the predicted and ground truth bounding boxes, respectively, ρ(.) is the Euclidean distance between center coordinates and c is the diagonal size of the smallest rectangle enclosing p and g. In the proposed model, LDIOU loss is used for anchor-free box regression and the distance

term (right handside of 3.27) is multiplied by a factor of 10 in order to focus more on accurately predicting the tumor’s center.

Centerness Loss. An extra centerness score is computed for locations (i, j) on the cross correlated feature maps that lie inside the ground-truth bounding box, in order to downweight

(27)

predictions that are far away from the center of the tumor. The centerness head is trained using mean-squared error (MSE) loss function, which is added to the total loss of the model. During inference, the classification score is adjusted using the equation:

Pclass= Pclass∗ e−(Pcenter)

k

(3.28)

where k is a hyperparameter which controls the influence of the centerness score on the final prediction. In our experiments we use k = 1.5.

Intuitively, learning a distance score between the center of the target and surrounding locations on the search image is similar to the approach of Hirai et al. [10] of predicting a per-pixel target probability map (TPM).

Mask Loss. Similar to [13], a binary logistic regression loss (3.29) over all positive locations (i, j) is computed in order to classify each pixel of the generated segmentation mask. Let pij, gij

be the w × h predicted and ground truth segmentation masks, corresponding to location (i, j); each pixel pkt_ij at position (k, t) is then classified to whether it belongs to the target or not.

Lmask(p, g) = 1 wh X kt log(1 + e−gkt∗pkt_), _{k = 1...h, t = 1...w} _(3.29)

3.4.5 Architecture

The final network consists of 3 main components: a backbone acting as feature extractor, adjustment layers and cross correlation, and 3 network heads for classification, bounding box regression and mask segmentation, respectively. The backbone network is represented by a ResNet50 [42] with the 5-th stage being removed; in order to obtain a higher resolution and an increased receptive field in the features maps generated by the backbone, the first convolutional layer of the 4-th stage is modified with stride 1 and dilation rate of 2 [48], resulting in a total stride of 8. The 3 adjustment layers described in 3.4.2 reduce the number of channels to 256 and aim to separate the data domains of the network heads. Three resulting target and search feature maps are then depth-wise cross correlated and fed to the classification, regression and segmentation heads, respectively. A more detailed description of the network’s structure is presented in the supplementary material (Sec. 7.1.

3.4.6 Training

The training dataset of a patient consists of 160 arcs of 181 degree angles. For each input (target, search) pair we randomly choose 2 arcs i, j, an angle k and a random integer r in [1, 5] interval; the search image is generated from the frame corresponding to the ith_{arc and k}th_{angle and the target}

image is generated from the jth _{arc and (k ± r)}th _{angle. In this manner, the target and search}

images are maximum 5 frames (angle degrees) apart from each other.

Like [22], we crop the target and search patches based on the ground truth bounding box of the tumor. If the tumor’s bounding box has shape (w, h), then the target patch is cropped to be of A × A size, with A =p(w + p) ∗ (h + p) and p = (w + h)/2. The search patch is cropped with double the size of the target, 2A × 2A, and both target and search regions are resized to 128 × 128 and 256 × 256, respectively. Similar to [13], we use random translations of ±8 pixels for the target image and random rescaling of 2±q and 2±2q, with q ∈ [1₆,₁₀1], for target and search images, respectively.

(28)

Li et al. [14] showed that padding employed by deeper backbones (e.g. ResNet50) destroys the restriction for translation invariance in Siamese networks, f (z, x[4s]) = f (z, x)[4s], and the

network learns a positional bias towards the center of the cross-correlated feature maps. Their solution involves applying random translations of ±32 pixels to the search image, which we also adopt during training.

3.4.7 Inference

During inference, most siamese trackers initialize the target image (z) on the first frame, which is kept fixed for the duration of the tracking process. The predetermined feature maps computed by the backbone network on input z are used as convolutional kernels for the depth-wise cross-correlation operation with each new frame’s search image (x).

In the problem of tumor tracking on kV images, due to the 360◦ arc rotation, the target’s visual appearance changes based on the angle perspective. Such variations result in big differences between the tumor’s aspect in frames that are far apart from each other and instead of keeping z fixed, we constantly update it such that for frame i the target’s feature maps are computed using the information from frame i − 1.

To obtain the final prediction from the outputs generated by the classification, regression and segmentation heads, we employ the proposal strategies used by Li et al. [22], which are designed to make siamese models suitable for the tracking task. We apply each proposal strategy to the classification output map as follows:

1. The classification scores are firstly adjusted using the centerness predictions according to eq. 3.28, in order to downweight predictions farther away from the target’s center.

2. Since the tumor is always situated in the center region of the search image, a cosine window is applied on the classification score map and only the center h × w subregion is kept. In this way, predictions are re-ranked based on the score map’s locations relative to the center. We define h = bK₂c and w = bK

4c, like in Fig. 3.12, based on the score maps’s size (K × K) and

the fact that the tumor moves mostly in the superior-inferior (SI) direction.

3. Finally, a scale penalty is applied to the kept predictions to downweight the bounding boxes that differ in terms of size and ratio with respect to the previous frame’s predicted box. The penalty is calculated using the same formula from [22].

The final prediction is selected as the bounding box/segmentation mask corresponding to the location with the maximum classification score from the adjusted classification map (using the above-mentioned proposal strategies).

(29)

Figure 3.12: Cropping strategy applied to the classification score map for keeping the predictions corresponding to the center h × w subregion (redandgreensquares).

(30)

Chapter 4

Experiments

Previous methods involving deep neural networks for markerless real-time tumor tracking are either using the spatial information to detect the target’s location in each individual frame [8, 9, 10, 11, 12] or the temporal information to predict the trajectory of the tumor, if the ground truth is available for kV images [49], or the respiration motion [26]. To our knowledge, there are no works that attempt to take advantage of both types of information and we would like to investigate the performance of deep siamese trackers on this particular task, as well as what challenges come with the application of such models on medical data.

Since we are constrained to train our models on synthetic data (DRRs), rather than directly on kV images, due to the lack of ground truth information, we are also investigating the efficiency of the proposed DRR to X-Ray translation strategy based on generative neural networks.

4.1 Research Questions

We design our experiments to answer the following research questions that aim to provide a deeper insight about the challenge and our direction towards solving it:

. Q1 : Siamese Models. Does a model that uses both spatial and (weak-)temporal

infor-mation in a siamese fashion outperform a model that detects the target’s location based only on spatial

information? What are the challenges that come with training a deep siamese model on medical data?

. Q2: Different model versions. Is ResNet50 a suitable backbone network for our siamese

model? Is predicting the target’s bounding box in an anchor-free approach suitable for siamese trackers? Does learning to predict the target’s segmentation mask improve the results?

. Q3 : DRR to X-Ray translation. Can generative neural networks (GANs) reduce the

gap between DRRs and X-Rays in order to improve the model’s results on real data?

For the most part, our research is focused on answering question Q1 and several experiments

(31)

The second research question Q2 aims to improve the performance of our siamese model, by

analysing different model versions including different backbones, the employed anchor-free approach and the use of a segmentation mask head. While Q1 and Q2 focus on the tracking model itself,

Q3 switches the attention towards the differences between training and inference datasets. This

question is

formulated to analyse a different approach for transposing the network’s performance from one data distribution (DRRs) to another (X-Rays).

4.2 Tracking model setup

In this section we explain the implementation choices for training and validation of the proposed siamese model, which are essential for running the experiments designed for answering questions Q1and Q2, and we also define the baselines against which the model will be compared.

4.2.1 Baselines

For the challenge of tumor tracking in lung patients there are no benchmark datasets and most works in this field have used their own clinical data that cannot be made public due to confidentiality restrictions, neither have made their implementation publicly available for reproduction of the results. In order to compare the performance of our method to the selected baselines, we are therefore constrained to implement their training pipeline based on the details specified in their publications. We choose the methods of Zhao et al. [9] and Takahashi et al. [11] as our baselines as they are the most similar in terms of network structure and dataset generation to ours.

Baseline 1 (FasterRCNN)

FasterRCNN [17] is a region-based convolutional neural network for real-time object detection and segmentation. The network consists of 3 modules: backbone, region proposal network (RPN) and ROI classifier and bounding box regressor heads. The backbone is a standard convolutional neural network, usually a ResNet, that extracts the image features which will serve as input to the other modules.

The region proposal network takes the features from the backbone as input and scans them in a sliding-window fashion, giving a foreground-background score and bounding box refinement for each scanned region. Based on the score, the RPN will output a set of region proposals which are most likely to contain objects.

The last module will receive the feature maps from the backbone and the proposals from RPN and for each proposal it will scan the feature maps and generate a score for each possible class in the dataset and a second bounding box refinement.

Zhao et al. [9] showed that FasterRCNN performs well for tumor detection in patients with pancreatic cancer. For this reason, we choose to implement FasterRCNN as our first baseline. We also add a feature pyramid network (FPN), introduced by Lin et al. [50], in order to better represent objects at different scales. The FPN module is also used by MaskRCNN [18] to obtain better accuracy. Unlike [9], which used a VGG16 network as feature extractor, we implement a ResNet50 feature extractor until the 4th_{stage for a fairer comparison to our siamese model, which}

(32)

Baseline 2 (Semantic Segmentation network)

The second baseline employs an encoder-decoder network for predicting the tumor’s segmentation mask, instead of directly outputting its location in the form of a bounding box. Based on the predicted mask, the target’s location is then calculated as the density center weighted by pixel values. Takahashi et al. [11] hypothesised that such strategy would give better results than detection methods returning the target’s coordinates directly, because a few misclassified pixels will have little impact on the final prediction.

The decoder network is replaced by a pixel shuffle layer to make the model suitable for real-time detection. The encoder network consists of 5 convolutional layers with wide kernel sizes for capturing non-local features (tumor’s contour). This method showed very good results in a lung cancer phantom study, but its performance was not assessed on real patient’s data.

In Table 4.1 we show a summary of the main differences between our proposed siamese tracking model and the chosen baselines. The most important distinctions are the tracking strategy (by detection/by similarity score) used for target localization and the input image size. The baseline models receive a 384 × 384 search area cropped from the center of the 512 × 768 input frames, which is large enough to capture the tumor, while the siamese network receives a 256 × 256 search input image, according to the model’s description from Section 3; that is because siamese trackers do a local search of the target, while the other methods employ a tracking by detection strategy (a global search). Other relevant dissimilarities are related to the network structure (nr. of layers, kernel sizes, skip connections etc.), the total stride of convolutional layers and the batch size used for training (due to memory restrictions).

Model Input Size Tarcking Core CNN Anchors 2-stage Stride Batch size

Zhao et al. [9] 384 × 384 By detection ResNet50 3 3 16 8

Takahashi et al. [11] 384 × 384 By detection Wider CNN 7 7 8 16

Our model 256 × 256 By similarity score ResNet50 7 7 8 16

Table 4.1: Overview of the key differences between the baselines and our proposed model.

Training

All 3 models (baselines + ours) are trained from scratch on the datasets generated from the 4DCT scans of 2 lung cancer patients according to the data generation pipeline described in section 3.2, where we skip the last step of translating DRRs to X-Rays because this step is used to answer Q3.

The training procedure is done end-to-end using SGD optimizer with an initial learning rate of 10−3, which linearly increases to 10−2 for 10 epochs and then decreases exponentially to 10−5 for another 20 epochs, where each epoch consists of 1000 steps. For SGD the momentum is set to 0.9 and we also employ gradient clipping with a clip value of 5 for reducing numerical inconsistencies. We use a batch size of 16 for the siamese model and the second baseline, and a batch size of 8 for the first baseline due to memory limitations.

In order to answer part of question Q2, we train our proposed siamese model with and without

the segmentation head and also using the anchor-free and anchor-based approaches. Specifically, 3 model versions are trained: anchor-free siamese, anchor-free siamese + mask and anchor-based siamese models. We compare the performance of all 3 variants to decide which one suits our challenge the most.

Deep Siamese Real-Time Tumor Tracking for Lung Cancer Patients

MSc Artificial Intelligence

Master Thesis