Uncertainty visualization for synthetic CT images generated from MRI scans

(1)

Uncertainty visualization for

synthetic CT images generated

from MRI scans

Kiki van Rongen

Faculty of Science

University of Amsterdam

A thesis submitted for the degree of

Master of Science

(2)

Acknowledgements

I would like to thank my UvA supervisor, professor Marcel Worring, for the guidance throughout my thesis. His valuable feedback and quick re-sponses has enabled me to continuously move forward.

Furthermore, I would like to acknowledge the entire team at MRIguidance for the pleasant work environment and laughter at the office. In particu-lar, I would like to thank Daan Kuppens and Marijn van Stralen for the interesting discussions and patience in explaining complex medical affairs. Finally, I am grateful for the support of my family and friends over the past months. Many of you have taken a true interest in my research, which always aroused the intrinsic motivation needed to reach the finish line.

(3)

Abstract

Uncertainty visualization is of great importance for deep learning appli-cations in the medical domain as tasks are often highly risk-sensitive. Moreover, it plays a pivotal role in understanding the robustness and decision-making process of neural networks. However, current methods for obtaining uncertainty estimates do not address the generation of rele-vant uncertainty visualizations. Rather, they mainly focus on maximizing accuracy by using the uncertainty estimates. We propose a method that enables the acquisition of relevant uncertainty visualizations by perform-ing noise insertions at test time. We use dropout as the specific type of noise and spread several dropout layers spatially throughout the network. We assess the relevance of the findings both qualitatively and quantita-tively. The qualitative approach pays attention to the visual similarity between prediction errors and uncertainty visualizations. On the other hand, quantitative assessments aim to reveal the correlation between pre-diction errors and uncertainty estimates.

We illustrate the strength of our method by evaluating on actual patient data. More specifically, we consider the task of transforming a MRI scan into a synthetic CT scan. We employ a popular 3D fully convolutional network, known as V-Net, and conduct miscellaneous experiments accord-ingly.

We show that the quality of uncertainty visualizations is highly dependent on the position of noise insertion, whereas accuracy remains fairly stable. Moreover, a comparative analysis of uncertainty maps obtained by single noise insertions has revealed significant deviation in highlighted regions, pointing to the contemplation of the model.

Our research contributes to the connection between theoretical frame-works and real-world applications. In particular, we propose a tangible way of examining uncertainty estimates that can readily be used.

(4)

List of Figures

1.1 Example of aleatoric uncertainty . . . 5

1.2 Example of epistemic uncertainty . . . 5

4.1 Block diagram . . . 17

4.2 V-Net architecture . . . 22

5.1 Example of high similarity between bad reconstructions in the syn-thetic CT and highlighted regions in the uncertainty map . . . 26

5.2 Example of selected bone within the evaluation mask . . . 27

6.1 Uncertainty maps for the pelvis data with dropout at various positions. 30 6.2 Uncertainty maps for the SI joint data with dropout at various positions. 32 6.3 Relative change in evaluation MAE for insertion of dropout at various positions . . . 33

6.4 Uncertainty maps with dropout at high-level features compared to other feature levels . . . 34

6.5 Visual difference with insertion of a single dropout layer and multiple dropout layers. . . 35

6.6 Correlation between voxelwise error and uncertainty in the evaluation mask divided in four quadrants . . . 37

(7)

Chapter 1 Introduction

1.1 Problem definition

Nowadays deep learning models, or neural networks (NNs), are increasingly used in various medical applications [14]. They can significantly speed up decision-making processes by providing supporting information or eventually bypass human observa-tion completely for routine tasks. However, they also pose a serious risk as erroneous decisions in these applications might come at a high cost. Certainly we want to avoid sending ill patients home or providing healthy patients with hazardous treatment plans. There has been ever more demand from the medical industry for uncertainty quantification in deep learning. Typical NNs, although strikingly accurate, do not have the property to express some measure of uncertainty over model predictions.

The general notion of uncertainty in deep learning is usually divided in three subgroups: approximation, aleatoric and epistemic uncertainty [20]. The first group, approximation errors, is caused by the choice of model family. For example, a linear model is not sophisticated enough to fit data that follows a quadratic curve. NNs are famous for their ability to fit complex data due to the large number of parame-ters. Therefore, we may assume that the approximation error is negligible [24]. The difference between the remaining two is the source of the uncertainty. Aleatoric un-certainty arises from noisy input data deteriorating the model’s performance. On the other hand, epistemic uncertainty concerns not the input data, but the model pa-rameters. NNs try to find the optimal set of parameters with respect to the observed samples in the data set. Yet, the observed samples only represent a subset of the actual data. For that reason, epistemic uncertainty is caused by the model lacking knowledge regarding samples it has not seen before. It could be reduced by increas-ing the number of samples in the data set. As approximation uncertainty is unlikely

(8)

for NNs, we provide specific examples in our data set for epistemic and aleatoric uncertainty further on.

Various methods have been proposed to account for epistemic, as well as aleatoric uncertainty. First of all, we consider popular epistemic uncertainty methods. To this end, we revisit the concept of likelihood in deep learning frameworks. Likeli-hood points to the probability of observing an output given the model parameters. Thus, it requires some knowledge regarding the distributions of these parameters. The parameters of a NN are often called weights and indicate the model coefficients. The weights transform the input, via subsequent multiplications, to a certain out-put. Hence, if we have access to a distribution for every weight parameter, we gain information regarding the probability of occurrence for different possible outputs of the network. This enables us to construct an output distribution. As stated earlier, common NNs output only a point estimate for the parameter mean and are not able to express a degree of uncertainty accompanied by these point estimates. Bayesian neural networks (BNNs), on the other hand, offer a more sophisticated framework [15, 19]. Indeed, BNNs follow the preceding logic by imposing a distribution over every weight. Thereupon, the variance (or spread) of the output distribution can serve as an uncertainty measure. More centered distributions have a lower variance and indicate that values occur mostly around the mean. Therefore, the probability of encountering outliers is lower, so we say that the uncertainty is lower. Unfortunately, constructing weight distributions also doubles the number of parameters to be esti-mated by the model. BNNs can handle epistemic and aleatoric uncertainty well, but remain a computationally expensive method. Hence, other tangible frameworks were developed later on.

A more practical approach, known as Monte Carlo dropout or MC dropout for short, has been introduced and reduces computational time significantly [5]. Dropout is a conventional method to prevent overfitting of the parameters with respect to the samples in the training set, by randomly dropping out parts of the network [23]. The method is applied to the training procedure, so dropout layers are enabled during training. MC dropout activates dropout after training as well, when we evaluate the trained network on a test set. In this way, it creates multiple predictions for the same input by forwarding the input through the network several times with differ-ent random dropout configurations. These predictions combined form an accurate approximation of the output distribution. Other methods include an ensemble of MC dropout models or Deep Ensembles [13]. Ensemble methods conduct multiple trainings, each time initializing the weights randomly. In doing so, these methods

(9)

drastically increase training time. In contrast, MC dropout remains having constant training time, but prolongs the testing procedure. For that reason, it is favorable to employ MC dropout compared to ensemble methods for large NNs that already require a lot of training time.

An ancillary feature of methods capturing epistemic uncertainty, such as MC dropout, is their ability to boost accuracy [4, 10, 13, 18, 25]. By averaging multi-ple outputs we obtain an improved estimate for the parameter mean, resulting in a more accurate prediction. However, models with higher accuracy are not necessarily accompanied by better uncertainty visualizations. Rather, good uncertainty visual-izations may come at the cost of accuracy. We dive into the matter of defining a ’good’ uncertainty visualization later on. Nevertheless, this study focuses on the ac-quisition of uncertainty estimates. We consider the affair of increasing accuracy only in conjunction with these uncertainty estimates, not as a mere goal by itself.

Additionally, the amount of epistemic uncertainty can be an indication for the robustness of the model. We refer to model robustness as the ability of the model to produce sufficient reconstructions for adversarial, or out-of-domain, inputs. To illustrate, we consider MC dropout as the approach to measure epistemic uncertainty. Since MC dropout essentially turns off neurons during inference, thereby inserting noise, we can say it forces the input to be somewhat adversarial. Depending on the model robustness, it is capable of handling such noise insertions. We can ascertain this by examining the distributions of predictions created by activating dropout during inference.

We leave the notion of epistemic uncertainty for now and proceed with deep learn-ing frameworks that address aleatoric uncertainty. If we assume every sample in the data set is equivalent, thus there is no correlation between the input noise and un-certainty, we have homoskedasticity. Violating this assumption causes heteroskedas-ticity; a phenomenon that is commonly encountered in deep learning. Within het-eroskedastic regression tasks all predictions for one sample suffer from the input noise. Hence, the uncertainty is taken as a whole and not measured per parameter, as is the case for epistemic uncertainty methods. Most approaches propose to add a noise pa-rameter to the loss function, representing the overall uncertainty of the input sample. Thereby, they induce a heteroskedastic noise model [10, 25]. The noise parameter signifying aleatoric uncertainty is a key component of the total uncertainty, or pre-dictive variance. In particular, prepre-dictive variance is a summation of epistemic and aleatoric uncertainty. We elaborate on predictive variance in general and its relation with respect to aleatoric uncertainty in section 4.1.

(10)

From a theoretical point of view, the respective methods for uncertainty quantifi-cation can be seen as separate fields of study. As indicated, epistemic uncertainty approaches the problem on a parameter-level, while aleatoric uncertainty depends on the entire input. Nevertheless, it is fairly easy to combine both into one framework [10]. Studies that do so usually aim for a comparative analysis between the two. We choose to focus on methods capturing one specific type of uncertainty and provide an in-depth investigation of its value for the interpretation of model predictions. Epis-temic uncertainty is the logical choice in this case, since its associated methods are more sophisticated than those for aleatoric uncertainty. Certainly, we do not wish to forgo the issue of noisy data, but attempt to address it by evaluating the uncertainties acquired by measuring epistemic uncertainty. Altough epistemic and aleatoric uncer-tainy are mostly considered separately when modelling, in practice it is challenging to classify a particular uncertainty as being either epistemic or aleatoric [3].

We analyze the concept of uncertainty for a real-world application in the medical field. More specifically, we consider synthetic computed tomography (CT) images, generated from magnetic resonance (MRI) scans. MRI technology is typically used to visualize soft tissue, while CT ensures an accurate depiction of bone structures. Regrettably, the latter can only be produced by releasing harmful radiation. Besides, MRI-based synthetic CT methods bring soft tissue and bone combined in one imaging exam. For these reasons, synthetic CTs are favorable. To understand the meaning of uncertainty in a medical setting, we present an example for aleatoric and epistemic uncertainty for our data set.

Recall that aleatoric (or data) uncertainty arises because of poor input data that disturbs model predictions. Although properly trained, the model cannot always account for external factors. An example is provided in figure 1.1, where we see the input image, prediction and ground truth respectively. The upper part of the MRI scan contains a few blurry lines, denoted in figure 1.1a, indicating that the patient has moved during the procedure. The motion artefacts degrade the quality of the scan, which causes the model to miss certain bone structures.

Next, figure 1.2 covers an example of epistemic (or model) uncertainty. It reveals a clear dissimilarity between the synthetic CT and ground truth, different from the one seen before. The synthetic CT contains a false bone reconstruction at the center. A closer look at the corresponding MRI scan reveals a common mistake made by the model. MRI scanners produce virtually no signal for bone and air. These signal voids appear black on the scan, illustrated in figure 1.2a. A common mistake made by the model, and also the one presented here, is that it recognises air regions as

(11)

(a) MRI scan (b) synthetic CT (c) ground truth

Figure 1.1: Example of aleatoric uncertainty. The red circle indicates several motion artefacts in the MRI scan.

bone causing a false bone reconstruction in the synthetic CT. We consider this as epistemic uncertainty, as the error occurs because of a lack of knowledge regarding the distinction between air and bone.

The synthetic CTs, as presented in figure 1.1b and 1.2b, are generated via a fully convolutional network (FCN) [22]. FCNs can learn to make dense predictions per pixel and are thus suitable for MRI to CT conversion. They adopt an encoder-decoder structure, where the encoder first decreases the size of the image and the decoder thereafter enlarges the image up to its original size. We discuss several hurdles of the task at hand. To this end, we start with the complexity of synthetic CT data and subsequently consider some difficulties of medical applications in general.

To start with, one apparent complexity is the three-dimensionality of the data. Both MRI and CT scans contain an additional depth dimension. Instead of pixels, we are dealing with voxels (volume pixels). While 2D frameworks are suitable for 3D data as well, it is important to note that the context of voxels is then only partially included. Performing volumetric convolutions is necessary to ensure that the context

(a) MRI scan (b) synthetic CT (c) ground truth

Figure 1.2: Example of epistemic uncertainty. The red circle indicates signal voids in the MRI scan.

(12)

of voxels is included in all directions and not merely along two axes, as is the case for 2D models [17].

Another obstacle is the computational cost of 3D frameworks. We typically pass cubes of voxels through the model, which requires a lot of memory storage. Consid-ering larger cubes limits the number of features we can utilize. There is a trade-off between the cube size and the complexity of the model.

Furthermore, the formation of synthetic CTs relies on a training procedure where the input image and ground truth are of two distinct scanners. This means that we do not have a 1-to-1 spatial mapping between the two. An additional step is required, prior to training, in which the CT scan is transformed to the MRI coordinate system. The process is referred to as registration and entails some misalignment errors, causing the targets to be noisy. The phenomenon of noisy targets is not uncommon in deep learning applications. It has been studied over the past years by various researchers [8, 20]. Yet, in our case it is very hard to grasp since there is no metric representing, for example, the amount of misalignments for a patient. Thus, it is challenging to take it into account during training.

Lastly, for medical applications in general it is considered difficult to obtain a correct error metric. Most studies measure the absolute distance between the target and prediction, in our case the CT and synthetic CT scan. Yet, this lacks essential information regarding the bone structure. On top of that, it is highly influenced by registration errors. A more correct way would be to also take into account the relevance of errors. To illustrate, for the synthetic CTs considered here the errors inside or around bone structures play a more prominent role.

Since we cannot fully rely on the error metric, it is cumbersome to link uncertainty to errors. Therefore, we evaluate a particular uncertainty map in two manners, one quantitative and the other qualitative. From a quantitative perspective we examine the correlation between uncertainty measures and the error metric, e.g. the absolute distance. If we adopt a qualitative approach, we investigate the similarity between highlighted regions in the uncertainty visualization and the visible inaccurate regions in the prediction. The preceding evaluation criteria ensure that we are able to distin-guish a ’good’ uncertainty map.

1.2 Research aims

Returning to the general framework, we seek to examine the effect of epistemic uncer-tainty on predictions for medical applications. To this end, we propose an extension

(13)

of the MC dropout approach for quantification of epistemic uncertainty by varying the placement of dropout layers throughout the network. Recall that MC dropout enables dropout during inference, by that inserting noise. Hence, it can facilitate in the understanding of a change in outcome with respect to a subset of the learned fea-tures. Most studies use dropout after each convolutional or each fully-connected layer [10, 21], thereby bypassing the effect of individual dropout layers on output variation. However, recent research has shown that finding a proper location for the acquisition of uncertainty estimates affects uncertainty performance substantially [16]. Therefore, our method plays a pivotal role in trying to understand how uncertainty manifests itself at different positions within the network. Moreover, the MC dropout extension enables us to assess model robustness.

The proposed method builds upon [9], in which a baseline network for probabilistic pixel-wise semantic segmentation is introduced. The research experimented with dropout layers for each convolutional block in the encoder and/or the decoder. They state that while model uncertainty remains visually constant for all experiments, there is a significant difference in accuracy. Moreover, they assert that high-level features benefit more effectively from Bayesian weights. We contradict these results and provide an explanatory note for both further on.

The aim of this work is to offer a tool that facilitates the comprehension of un-certainty throughout large neural networks, by cleverly making use of MC dropout. Subsequently, we elaborate on the value of the uncertainty for the interpretation of the prediction. We henceforth can summarize our approach as inserting noise at specific positions in the network with dropout. We assess the effectiveness of our approach on a typical subset of NNs adopting an encoder-decoder structure, namely FCNs.

Furthermore, we endeavour a practical approach. In particular, we provide de-tailed assessments of the obtained images. For this affair, we cover the relation between uncertainty visualizations and error metrics. Additionally, we investigate visual dissimilarities between the synthetic CT and uncertainty visualizations. Our practical approach creates a bridge between theoretical frameworks for uncertainty quantification and real-world medical applications.

1.3 Thesis outline

This study kicks off with the related work section. A general overview is presented of various studies that incorporate uncertainty estimates into their framework. There-after, chapter 3 discusses the required background knowledge, indicating the origin

(14)

of BNNs and its usefulness for measuring epistemic uncertainty. From this, we de-rive the proposed method in chapter 4. The subsequent two chapters focus on the experimental design (chapter 5) and results (chapter 6). Finally, we close with the discussion and conclusion, pointing out potential future work.

(15)

Chapter 2 Related Work

This section outlines other studies forming the base of this research. First, we examine methods that quantify different uncertainty types within NNs. Subsequently, we discuss studies that exploit the relation between epistemic uncertainty and model predictions. Finally, some research that addresses uncertainty in deep learning from a medical point of view is considered.

Approaches to quantifying various uncertainty types in deep learning frameworks have been introduced earlier [10, 12, 25]. Most studies distinguish between epistemic and aleatoric uncertainty, also known as model and data uncertainty respectively as indicated in the introduction. Current frameworks adopt separate techniques for both. For the epistemic part, MC dropout is most commonly used [6], while aleatoric uncertainty is typically measured via a heteroskedastic noise model [10]. Our proposed method covers only epistemic uncertainty, as techniques for aleatoric uncertainty are considered a separate field of study. Moreover, popular aleatoric methods do not always yield informative uncertainty maps in medical applications [12]. We leave this matter for future research.

Alternative studies focus on model uncertainty exclusively. The aim of these studies is not only to boost accuracy, but also to create model robustness. By inserting noise at specific places during inference, they trigger the model to generate diversity in predictions. One approach is to consider noise in the form of dropout layers [6]. For regression tasks analogous to ours (with a similar encoder-decoder structured network), the conclusion is often drawn that inner layers are more suited for dropout due to their high-level feature extraction [9]. Furthermore, experiments with dropout in only the encoder or decoder have resulted in similar-looking uncertainty maps, yet significantly different model performances. On top of that, there is no reference to the assessment of the obtained uncertainty maps, therefore we do not know whether the uncertainty maps are ’good’ in any sense. This emphasizes the earlier statement that

(16)

models with higher accuracy are not necessarily accompanied by better uncertainty visualizations. We will discuss the matter in more detail in section 6.

Other kinds of noise have recently been investigated as well, including transfor-mations and Gaussian noise. While these techniques are frequently used for data augmentation (prior to training) it is shown that they effectively lead to better cor-related uncertainty estimates, as opposed to MC dropout [16]. However, it is only applicable to data sets that do not already perform comparable data augmentation techniques. Dropout is not considered data augmentation since it applies to the NN and not the data set. As we already perform data augmentation extensively before training, we focus on adding noise via dropout. Similar to [5], but unlike [16], we restrict ourselves to the insertion of dropout layers at equal positions and activate the layers during both training and inference. If we loosen this constraint, i.e. activate dropout only at inference and not during training, the model is less robust against noise at test time, leading to potentially unstable loss functions.

Although research demonstrates a significant increase in model performance when adding noise [5, 9, 16, 25], there is a lack of notion beyond accuracy. Moreover, there has been no further investigation of why certain noise insertions are indeed optimal. Therefore, we assess uncertainty visualizations obtained at intermediate layers of the network extensively.

Lastly, various methods exist within the medical field to determine the quality of uncertainty estimates. A simple way is to plot the uncertainty against the absolute deviation between the predicted voxel and target voxel [7]. The evaluation is then done on a voxel level. It can be applied to different measures of uncertainty, such as predictive variance, predictive entropy or mutual information [18]. Another approach is to focus on calibration [11]. Well-calibrated uncertainty estimates arise when the distribution over predictions matches that of the distribution observed during train-ing. The preceding methods all adopt a quantitative approach. Since error metrics are often biased in medical applications, we also perform qualitative assessments. More specifically, we consider the visual similarity between uncertainty and errors in the prediction. We aim to link visual observations to the metrics in specific regions to reduce metric bias.

(17)

Chapter 3 Background

We discuss several concepts that precede on our methodology. First, we outline a Bayesian approach towards neural networks where we consider a probability distri-bution over predictions. For this, we revisit BNNs in more detail. Thereafter, we discuss an approximation technique that makes BNNs feasible, known as variational inference (VI).

All formulas are expressed in a three-dimensional setting, as the dataset for the considered application is also three-dimensional. The data cover N patients, for which we have MRI scans and CT scans. If we regard a single patient, we denote the input (MRI scan) for that patient with xi and the target (CT scan) with yi. We may

assume that the CT is already transformed to the MRI coordinate system, so both scans xi and yi have an equal number of voxels M (length × width × depth) and

every voxel in the MRI has a target voxel in the CT scan.

3.1 Neural Networks

Following [10], we consider the dataset D = (X, Y), where X = {x1, . . . , xN} and

Y = {y₁, . . . , y_N}. Both xi and yi are M -dimensional vectors for i = 1, . . . , N . First,

we consider a pair of scans (xi, yi) of patient i. Let f W

(·) be a NN, where W denotes the weight parameters of the network. The NN is fed an input xi and outputs a

prediction (synthetic CT) ˆy_i, so ˆ

y_i = fW(xi). (3.1)

We say that ˆy_i is a point estimate for xi, since we only have a single prediction per

(18)

3.2 Bayesian Neural Networks

We now turn to a different segment of networks, namely BNNs. The goal of BNNs is to discard the notion of a point estimate and consider a probability distribution over the prediction instead. In the introduction it is mentioned how the variance of a distribution can serve as a degree of uncertainty over predictions, also known as predictive variance. However, in order to construct a distribution we actually need several predictions for the same input, and not merely a point estimate. Note that we thereby indicate that feeding an input through the NN may lead to different outputs, hence equation 3.1 no longer holds. Recall that we can induce an output distribution by imposing a distribution over the weight parameters. It is common practice to impose a normal (Gaussian) distribution for weight initialization. By enforcing the preceding logic, we alter equation 3.1 so the output becomes a distribution:

p(y_i|fW_(x

i)) = N fW(xi), σ2 , (3.2)

where we have a probability distribution over the target y_i. We refer to p(y_i|fW(xi))

as the predictive distribution. Equation 3.2 implies that we impose a distribution over the NN. This essentially means that we express an initial belief, or prior, over the weight parameters W, such that W ∼ N (0, I) with I the identity matrix. Because every weight parameter in W is now normally distributed, the NN is also a Gaussian with the output of the NN as mean and an additional noise parameter σ2 serving as the variance.

Equation 3.2 represents the likelihood of an individual sample. The likelihood is a typical notion in statistics, especially within Bayesian statistics. It measures how well the model can fit new data given the model parameters, in our case the weight matrix W. If we consider the entire dataset, and not one sample like in equation 3.2, we can also write the likelihood as p (Y|X, W). By applying Bayes’ rule we find the most plausible weight parameters given the data, also known as the posterior :

p (W|X, Y) = p (X, Y|W) p (W) p (X, Y) =

p (Y|X, W) p (W)

p (Y|X) , (3.3) where we have used the conditional probability rule that p (A, B) = p (B|A) p (A). Furthermore, p (W) denotes the Gaussian prior over the weights and p (X, Y) is the probability of the data or evidence.

(19)

With equation 3.2 and 3.3 we have all the necessary elements to perform Bayesian inference. During Bayesian inference we observe a new sample x∗, pass it through the network to acquire the predictive distribution p(y∗|fW(x∗)) and calculate Bayes’ rule to obtain the posterior. For simplification, we write p(y∗|fW_(x∗_{)) as p(y}∗_|x∗_{, W)}

to indicate that the predictive distribution depends on the input x∗ and the model parameters W. By multiplying the posterior with the likelihood we get

p(y∗|x∗, W) = p(y∗|x∗, W)p (W|X, Y) . (3.4) This formula is problematic, since we are actually interested in the marginal proba-bility p(y∗|x∗_{). The dependence on W arises because the likelihood depends on it as}

well. We can get rid of W with marginalisation, meaning we sum over all possible values for W by taking the integral. In doing so, we obtain

p (y∗|x∗) = Z

W

p (y∗|x∗, W) p (W|X, Y) dW, (3.5) which is exactly what we want; a predictive distribution for a new sample independent of the model parameters. For further notation we leave out the subscript denoting the integration over W.

In theory, Bayesian inference can be done by following the preceding steps. Un-fortunately, there is an issue when we try to calculate the posterior. While we have a tractable expression for the likelihood and the prior, in the sense that we are able to calculate these components individually, the evidence presents more of a challenge. The marginal probability p (Y|X) cannot be evaluated analytically. Rewriting the denominator of equation 3.3 reveals why:

p (X, Y) = Z

p (X, Y, W) dW = Z

p (X, Y|W) p (W) dW. (3.6) This formula is intractable as we have to integrate over all possible values for W, which is generally impossible. Hence, the posterior is also intractable. The reason why marginalisation was not an issue before, in equation 3.5, is because we are able to take samples there. We dive into this matter in more detail in section 4.2. Nevertheless, several solutions have been proposed to overcome the integration problem, sampling being one of them [19]. Yet, it has shown slow convergence rates [2]. Approximating the posterior instead is therefore more popular and also the approach on which we proceed.

(20)

3.3 Variational Inference

A conventional way of approximating the posterior is using Bayes by Backprop, also known as variational Bayesian inference [1]. As the name suggests, a variational distri-bution qθ(W) is used to approximate p (W|X, Y). Note the subscript θ representing

the parameters of the variational distribution. We parameterize the distribution so that we can alter it in such a way that it closely resembles the true posterior. Although the approximation is no longer unbiased, as was the case for sampling methods, it is considerably faster. Therefore, VI is more feasible in practice.

The question is how to find the optimal configuration for θ to accomplish maximal similarity between qθ(W) and p (W|X, Y). The KL-divergence is very suitable in

this case. The metric measures how one probability distribution differs from another. Following [2], we get KL(qθ(W)kp(W|X, Y)) = Eqθ(W) log qθ(W) p(W|X, Y) = Z qθ(W) log qθ(W|X, Y) p(W|X, Y)dW = Z qθ(W) log qθ(W|X, Y) p(X, Y|W)p(W)dW = Z qθ(W) log qθ(W) p(W)dW + Z qθ(W) log p(X, Y)dW − Z qθ(W) log p(X, Y|W)dW = KL(qθ(W)kp(W)) + log p(X, Y) − Eqθ(W)[log p(X, Y|W)]. (3.7) Since we eventually wish to optimize this equation, and thus take the derivative with respect to θ, we can discard the factors independent of θ. By doing so, we obtain the evidence lower bound (ELBO):

ELBO(θ) = KL(qθ(W)kp(W)) − Eqθ(W)[log p(X, Y|W)]. (3.8)

It can be inferred that maximizing the ELBO is similar to minimizing KL(qθ(W)kp(W|X, Y)).

The optimization problem then reduces to ˆ

θ = arg max

(21)

Up to this point, we have obtained a way to find the optimal parameter setting for θ. This ensures that the parameterized approximate posterior is in close proximity to the true posterior. Finding a tractable expression for the posterior enables us to perform Bayesian inference; the final step in the theoretical framework of BNNs.

(22)

Chapter 4 Methodology

Recall that the goal of our study is to obtain an understanding of where uncer-tainty manifests itsels in the decision-making process of the model and to investigate whether that uncertainty has value for the interpretation of the prediction. Therefore, we quantify epistemic uncertainty at various positions throughout the network by in-serting noise with dropout layers. By closely evaluating intermediate uncertainty visualizations, we obtain a better comprehension of the model robustness against noise insertions.

This chapter starts with a mathematical expression of epistemic uncertainty. Sub-sequently, we describe the concept of MC dropout; a popular method that enables us to quantify epistemic uncertainty. Thereafter, we outline our approach where we extrapolate MC dropout to FCNs and consider the network used for the current appli-cation in particular. Finally, we assess the metrics of the obtained synthetic CTs and quality of the corresponding uncertainty maps. The key components of the preceding process are jointly outlined in a block diagram, illustrated in figure 4.1.

4.1 Expression of epistemic uncertainty

We return to the initial problem where we wanted to express a degree of uncertainty over the predictions via the predictive variance. We have seen how BNNs enable us to construct predictive distributions, rather than point estimates. Moreover, using VI we obtain a tractable expression for the posterior, so that we are able to perform Bayesian inference. By tying the two together, we can find an expression for the variational predictive distribution:

q_θˆ(y∗|x∗) =

Z

(23)

Figure 4.1: Block diagram of our approach. MRI scans are fed to the trained network on a patch level T times. By calculating the mean and variance we obtain the final prediction (synthetic CT) and uncertainty map. The error metrics are acquired by averaging the metrics over all patients. As a final step, the obtained synthetic CTs and uncertainty maps are assessed.

where ˆθ is the optimized parameter resulting from optimizing the ELBO, as indicated in equation 3.9. We infer the predictive variance using equation 4.1:

Varqθˆ(y∗|x∗)(y ∗ ) = Eqθˆ(y∗|x∗)[y ∗ y∗T_{] − E}qθˆ(y∗|x∗)[y ∗ ]Eqθˆ(y∗|x∗)[y ∗ ]T. (4.2) It is worth noting that the predictive variance is thus calculated with respect to the approximate posterior, not the true posterior. There is still an error margin between the two that cannot be quantified.

Following [12], we can decompose the predictive variance into two factors as Varqθˆ(y∗|x∗)(y

∗

) = Z

diag{Ep(y∗_|x∗_,W)[y∗]} − E_p(y∗_|x∗_,W)[y∗]E_p(y∗_|x∗_,W)[y∗]T q_ˆ

θ(W)dW | {z } aleatoric + Z Ep(y∗_|x∗_,W)[y∗] − E_q ˆ θ(y∗|x∗)[y ∗ ] Ep(y∗_|x∗_,W)[y∗] − E_q ˆ θ(y∗|x∗)[y ∗ ]T q_θˆ(W)dW | {z } epistemic . (4.3)

(24)

From this decomposition we derive how uncertainty, measured as the predictive vari-ance, relates to an epistemic part and an aleatoric part. The main difference between the two is that one can be explained away given enough data (epistemic), while the other cannot (aleatoric). Epistemic uncertainty captures our ignorance with respect to the model. It is high for input data which is far away from the training distribution [2]. Increasing the number of training samples leads to a more peaked distribution around the mean, making it increasingly unlikely to encounter outliers (the model has seen more data, and therefore has more knowledge). The other component, aleatoric uncertainty, is regretfully always present. We assume that there can always be noise in the data, even if we have access to unlimited amounts of data.

If we enforce a heteroskedastic noise model, which is conventional for the quan-tification of aleatoric uncertainty, equation 4.3 reduces to

Varqθˆ(y∗|x∗)(y ∗ ) ≈ σ + Z Ep(y∗_|x∗_,W)[y∗] − E_q ˆ θ(y∗|x∗)[y ∗ ] Ep(y∗_|x∗_,W)[y∗] − E_q ˆ θ(y ∗_|x∗₎[y∗]T q_ˆ θ(W)dW, (4.4) where σ denotes the noise parameter introduced in section 1. It is added as an extra component to the loss function and tuned accordingly. In our case, we focus on the second part of equation 4.4, which signifies the uncertainty of the model over the predictions. This term vanishes in the uncommon event of zero epistemic uncertainty.

4.2 MC Dropout

We now move to a suitable method for capturing epistemic uncertainty. MC dropout is the obvious choice for this affair. It has gained popularity over the years due to its simple implementation and constant training time [13]. In the previous paragraph, we have found a definition for epistemic uncertainty. This definition is established with VI. Thereby it comprises an approximation of the predictive distribution, formulated as

p (y∗|x∗) ≈ Z

p (y∗|x∗, W) qθ(W) dW. (4.5)

This expression remains problematic, due to the presence of the integral. MC integra-tion provides an alternative; we can approximate the integral by averaging samples. Following [10], we formulate this as

p (y∗|x∗) ≈ 1 T T X t=1 py∗|x∗, ˆWt, (4.6)

(25)

with T representing the number of samples that are taken, or MC samples. Weights are sampled from the variational distribution ˆWt ∼ q_θˆ(W), where ˆθ is again found by

maximizing the ELBO. As the above formula no longer contains an integral, we have successfully transformed the problem of integration into a problem of optimization.

Nevertheless, the question remains how we acquire MC samples. MC dropout proposes to draw upon dropout [23], for which it bears its name. For that reason, we discuss dropout in more detail.

4.2.1 Dropout

Dropout was originally intended to reduce overfitting of deep neural networks by randomly deactivating neurons during training. In order to understand what hap-pens when neurons are deactivated at a certain layer, we examine the mathematical transformation of the primary input x∗ through subsequent layers. Following [23], let y∗_(l) be the result of passing y∗_(l−1) through layer l. In each layer, the input for that layer is transformed by multiplying it with a weight matrix, adding a bias vector and applying an activation function. Mathematically, this transformation is denoted as

y∗_(l) = f w(l)· y∗(l−1)+ b(l) , (4.7)

where f (·) represents the activation function and wland bl are the weight matrix and

bias vector for layer l respectively. The activation function is typically a non-linear function so the network learns more refined, i.e. complex, relations rather than mere linear ones. With dropout, we add an extra step to the transformation process:

r(l) ∼ Bernoulli(p),

˜

y∗_(l−1) = r(l)∗ y∗(l−1),

y∗_(l) = f w(l)· ˜y∗(l−1)+ b(l) .

(4.8) Here ∗ denotes the element-wise product and p is the probability of dropping a neuron (the dropout rate). We sample r(l) from a Bernoulli distribution, so it takes on the

value 1 with probability p and 0 with probability 1 − p. Now, it is evident that while we pass the input x∗ through subsequent layers with dropout, it is transformed by a subset of the learned features and not the entire feature space.

(26)

4.2.2 Inclusion of dropout in MC dropout

Primarily, dropout is activated only during training. If we consider it a technique that prevents overfitting of large neural networks, it does not make sense to use it during inference as well. The notion of overfitting merely applies to the training procedure when the network tries to learn complex relations in the data. However, if we consider dropout as a method that triggers noise, as indicated in the introduction, it appears in an altogether different light. Recall that in order to obtain a measure of uncertainty, we need to construct multiple predictions for the same input. Now, it does not seem so strange to activate dropout during inference too. This is essentially the building block of MC dropout. We can acquire MC samples by passing the input through the network multiple times, each time randomly dropping a different set of neurons. It is shown that averaging these samples leads to an estimate for the approximate predictive distribution. More importantly, the sample variance can serve as a measure for epistemic uncertainty. If we increase the number of samples, the estimate moves closer to the target, thereby reducing variance. A pleasant side effect is that the error also decreases as a function of T [23].

Let us shortly summarize the above key findings. We start with obtaining a mathematical expression for epistemic uncertainty. This expression makes use of the approximate predictive distribution. Yet, it is shown that the definition of the approximate predictive distribution contains an intractable integral. Therefore, we take samples from the distribution instead by activating dropout at test time and state that averaging these samples serves as an estimate of epistemic uncertainty.

4.3 MC dropout extension for FCNs

Our method applies to FCNs, which are comprised of a sequence of blocks. Every block performs a convolution (or several convolutions) and subsequently alters the size of the input with pooling. For each convolution, a number of steps are followed. First, the convolution layer extracts the input’s features. Thereafter, batch normalization is conducted to normalize the input to a certain layer. Lastly, the activation function transforms the input to non-linear. FCNs are encoder-decoder structured networks and can be divided in two parts accordingly. The first part increases the number of features. Thereby, the network learns more complex relations. In addition, the resolution decreases so the model zooms in for more detail and moves away from spatial aspects. Conversely, the second part enlarges the image up until its original

(27)

resolution while jointly reducing the number of features. In doing so, the spatial dependencies become more important again.

We now turn to our more refined network architecture used to generate synthetic CTs. We adopt a popular FCN suitable for 3D data sets, known as V-Net [17]. Following the preceding approach, we consider a dropout layer at every block in the network, as illustrated in figure 4.2. The main difference between V-Net and other FCNs is the usage of forwarded features. This means that the features of the encoder also operate in the decoder, to retain fine grained detail from early stages. Furthermore, V-Net passes volumetric input patches consecutively through the model instead of the entire image.

It is mentioned earlier that dropout is considered a kind of noise insertion that evokes diversity in the predictions of a certain layer. Conventionally, dropout is posi-tioned after fully-connected layers at the end of the network [23]. Since FCNs typically do not contain fully-connected layers, we do not subject to this constraint. On the contrary, similar to [9] we add a dropout layer to each block in the network. In partic-ular, we propose to place the dropout layer after the convolution(s) before pooling, if there is a pooling layer present. We allow arbitrary positions within a specific block, as long as there is some consistency. To this end, we adopt the same approach for all blocks. This ensures that dropout layers are equally spaced throughout the network. While it is also sufficient to add dropout before the convolution(s) after pooling, it is important to note that the noise then regards to a different feature and resolution level.

After choosing which dropout locations to trigger, the network is ready for train-ing. We acquire multiple forward passes with dropout activated at both training and test time. As indicated in section 2, activating dropout solely during inference is likely to cause unstable loss functions as the model is then less robust against noise insertions like dropout [16]. For that reason, we restrict ourselves to the activation of dropout during training as well as inference.

(28)

Figure 4.2: V-Net architecture with 8 blocks, each consisting of three convolutional layers and a pooling layer (except from the last two blocks). MRI patches are passed subsequently through the network, resulting in synthetic CT patches. Eight positions are designated for potential dropout layers. Note that the above figure seems to indicate that the dimensions of the input MRI patch are not cubed, however, it is indeed a 3D volume with equal length, width and height.

(29)

Chapter 5 Experimental Design

This chapter is divided in three sections. First of all, we describe the data in more detail. Second, we address the research questions and describe the conducted exper-iments. We close with elucidating the evaluation criteria.

5.1 Data

We present results on two medical data sets of different body parts. More specifically, we consider one data set comprising pelvis data and another data set containing sacroiliac (SI) joint data. The focus on two data sets, rather than one, allows us to illustrate the coherence of results on a variety of data.

The data sets are alike in the sense that they both entail pairs of MRI and CT scans. Moreover, they include mostly scanning pairs of patients, meaning there are few MRI and CT scans of a ’healthy’ person for the simple reason that exposure to CT radiation should be minimized. This is not necessarily problematic since we adopt a patch-based network where patients with highly anomalous bone structures have very few specific patches with anomalies. To allow voxelwise comparison of the synthetic CT and target CT scan, the MRI and CT scans are spatially aligned during the process of registration. Voxel values of a CT scan are quantified in Hounsfield units (HU) with a range between -1000 and 3000.

There are also some differences between the pelvis and SI joint data. While the pelvis data set consists of 36 pairs of scans, the SI joint data entails 25 pairs of scans. Furthermore, the SI joint data contains images of size 768 x 768 x 240. The pelvis data are of considerably smaller size, namely 288 x 488 x 160. The data sets are trained with a different set of optimal hyperparameters. We use these model trainings as the baseline for further comparison.

(30)

5.2 Experiments

The experiments are designed to answer the following questions:

1. To what extent can we visualize epistemic uncertainty throughout the network by performing noise insertions at specific positions with dropout?

2. Can we find a distinction between the effect of noise insertion at outer blocks and inner blocks of the network?

3. Are there certain positions within the network, either single or multiple ones, more optimal for noise insertion in the sense that they produce high-quality uncertainty maps without deteriorating performance? And if so, why?

In order to address the first question, we experiment with adding dropout layers at the positions indicated in figure 4.2. Thus, we start with examining the effect of adding single dropout layers on predictions and uncertainty maps. Following [6], we acquire 50 forward passes through the model, serving as MC samples, with a dropout rate of 0.2. For the second question, we try out joint insertion of several dropout layers. We distinguish between dropout insertions at inner blocks (position 3, 4 and 5) and outer blocks (position 1, 2, 6, 7 and 8) of the network. Lastly, we elaborate on the third and final question. To this end, we use the architecture and optimal set of hyperparameters of the baseline model and fine-tune it by specifying optimal positions to trigger dropout. Optimality is regarded in the sense that the model should produce high-quality uncertainty maps (explained further on), while simultaneously maintaining accuracy.

5.3 Evaluation criteria

The uncertainty maps and synthetic CTs resulting from the experiments are assessed in seperate ways. For the evaluation of synthetic CTs we adopt a quantitative ap-proach and outline several metrics. Conversely, the uncertainty maps are assessed qualitatively as well as quantitatively. We elaborate both in the following sections.

5.3.1 Synthetic CT assessment

In the previous chapter we have outlined our methodology for obtaining a measure of epistemic uncertainty by forcing the network to output multiple predictions at test time. Let ˆy_k be a M × 1 dimensional vector comprising the mean over these

(31)

predictions for k = 1, . . . , K, where K represents the number of patients in the test set. For both the pelvis data and the SI joint data we consider three patients for testing, so K = 3. Furthermore, y_k denotes the ground truth for ˆy_k. We calculate the voxelwise difference, or mean absolute error (MAE), as

MAEk= 1 M M X m=1 |ˆy_mk− y_mk|. (5.1) Recall that M represents the number of voxels in the prediction. Therefore, equation 5.1 is equivalent to comparing a CT scan (ground truth) to the synthetic CT (predic-tion) for patient k on a voxel level. The average MAE, signified as MAE, is obtained by calculating the MAE for every patient in the test set and taking the average.

An additional measurement that we take into account is the MAE for bone re-gions, to which we further refer as bone MAE for a single patient and bone MAE for the entire test set. The bone MAE enables us to assess the quality of the bone reconstruction in particular and is thus more specific than the MAE over the entire prediction. Since bone voxels have a high HU value compared to other non-bone voxels, we can distinguish a bone region by setting a threshold. To this end, we select voxels with a value above 150 HU in the synthetic CT and calculate the MAE equivalent to equation 5.1. It is important to note that as a consequence the bone MAE values tend to be higher and are further apart among the experiments.

To investigate the effective loss or gain in accuracy due to noise insertions, we present metrics in terms of relative changes (as a percentage) with respect to the baseline. Note that the baseline does not enforce dropout activations during inference. We find the relative change via

Relative change = MAE − baseline

baseline · 100%, (5.2) where we regard the shift in MAE specifically. Equation 5.2 can be extrapolated to other metrics by replacing MAE with the respective metric.

5.3.2 Uncertainty assessment

We distinguish between high-quality and low-quality uncertainty maps based on two observations. One is the visual similarity between bad reconstructions in the synthetic CT and highlighted regions in the uncertainty map, indicated in figure 5.1, the other is the correlation between uncertainty measures and the error metric. Thus, we identify

(32)

(a) Synthetic CT (b) CT scan (c) Uncertainty map

Figure 5.1: Example of high similarity between bad reconstructions in the synthetic CT and highlighted regions in the uncertainty map. The red circles indicate dissimilarities between the synthetic CT and target CT that are also highlighted in the uncertainty map.

a high-quality uncertainty map as one that indicates high visible similarity with bad reconstructions in the synthetic CT, while jointly showing positive correlation between the error and variance on a voxel level.

Unfortunately, identifying correlation is a challenging task in our case. Recall that the considered application suffers from biased error metrics, largely due to reg-istration errors. As a consequence, correlation graphs between the voxelwise error and uncertainty contain a lot of noise. In particular, if we regard all voxels in the synthetic CT, the correlation cannot be identified as the effect is overshadowed by misregistrations. On the other hand, if we zoom in and focus on specific regions of the synthetic CT, thereby reducing noise, we risk observing only part of the correlation graph. We have manually created masks (evaluation masks, illustrated in figure 5.2a) for three patients in the test set of the pelvis data and equivalently two patients in the test set of the SI joint data. These masks encompass specific bone areas for which we know they are fairly stable and easy to distinguish. It is important to note that the evaluation masks are generated based on the CT scan, while the former bone masks (used to calculate the bone MAE) are generated based on the synthetic CT. Nevertheless, the evaluation masks ensure that voxels with an extremely high error arising at, e.g, the body contour no longer disturb the correlation. However, registra-tion errors around bone structures regretfully remain present, as indicated in figure 5.2b. The considered model does not account for these errors. As a consequence, uncertainty is also not detectable at misregistrations. Notwithstanding, the evalua-tion masks decrease the amount of noisy voxels significantly and thus enable us to investigate correlation more clearly. We henceforth refer to the voxelwise difference in the evaluation mask as evaluation MAE.

(33)

(a) Evaluation mask (b) Error map

Figure 5.2: Example of selected bone within the evaluation mask. The error map substracts the predicted voxel from the target voxel, so black and white coloured regions point to over- and underestimation of the model respectively. By regarding only certain bone areas we reduce the number of noisy voxels due to registration errors. Yet, some registration errors remain present, indicated with red circles in (b).

(34)

Chapter 6 Results

We outline the key findings of our study. Following the research questions specified in chapter 5, we evaluate results on the two data sets. The assessment takes place with regard to error metrics, as well as uncertainty maps. Note that we are only able to present one slice of a 3D volume.

Starting off with noise insertion at one exact position within the network. The upper part of table 6.1 shows the MAE and bone MAE for a single dropout layer that is added with a dropout rate of 0.2. We observe less deviation in MAE and bone MAE for the SI joint data compared to the pelvis data. This is an indication that the SI joint model is more robust, since it can better cope with noise created via dropout. Moreover, the SI joint model is much more consistent in its predictions in between model trainings, with error metrics in a range of ±0.3 for the MAE and ±0.4 for the bone MAE, compared to the pelvis model, which ranges from ±2 for the MAE and ±10 for the bone MAE.

Nevertheless, the SI joint data only slightly improves upon the baseline model by 0.1% in MAE and 1.4% in bone MAE at best. Conversely, the pelvis data shows significant reductions for almost all single noise insertions, with a maximum drop of 7.0% for the MAE and 18.2% for the bone MAE. We conclude that adding dropout in inference does not degrade accuracy for both data sets, i.e. we either observe a decreased error or a fairly constant error. This is of great importance for the evaluation of the uncertainty maps. If we had found a severely diminished accuracy the uncertainty maps would be hard to interpret, as degraded predictions may cause larger uncertainty values that were otherwise not present.

Turning to the uncertainty visualizations corresponding to table 6.1, indicated in figure 6.1 and figure 6.2, there is a substantial deviation in highlighted regions among the images for both data sets. Recall that dropout layers at position 3, 4 and 5 operate on 64 or 128 features, which is perceived as high-level feature extraction.

(35)

Pelvis data SI joint data

Position MAE bone MAE MAE bone MAE Baseline 28.3 96.5 72.8 168.7 1 -0.1% -8.5% +1.6% -0.9% 2 +5.7% -9.6% +1.1% +1.4% 3 -2.5% -13.8% +1.1% +2.4% 4 -1.0% -16.3% +1.1% +3.3% 5 -5.4% -18.2% -0.1% -0.4% 6 -7.0% -13.0% +3.1% -1.4% 7 -2.8% -9.2% +2.3% +2.0% 8 -4.9% -13.3% 0.0% +5.9% 1 & 2 +1.9% +17.9% - -4 & 5 -1.5% -9.6% +0.5% -0.1% 3, 4 & 5 -1.2% -12.1% - -6 & 7 +11.0% +46.2% -

-Table 6.1: Relative changes for insertion of single and multiple dropout layers with respect to the baseline model. The baseline model does not enforce dropout activations during inference. The metrics are averaged over three patients for both data sets.

(36)

(a) Position 1 (b) Position 2 (c) Position 3 (d) Position 4

(e) Position 5 (f) Position 6 (g) Position 7 (h) Position 8

Figure 6.1: Uncertainty maps for the pelvis data with dropout at various positions. The images are visualized with equal window levels (brightness and con-trast) to avoid a distorted picture of the size of uncertainty.

We can observe that inserting noise at high-level features leads to uncertainty maps showing high similarity with reconstruction errors. Furthermore, for the pelvis data we see that especially the uncertainty maps at position 3 and 4 show more deviation in the background compared to earlier and later stages of the network. The majority of uncertainty maps of the SI joint data concede on the area containing most uncertainty, namely the femur head. This points to the fact that the SI joint model is indeed more robust with respect to the pelvis model, which is also noticed in the metrics.

Perhaps more surprisingly, we find that noise insertion at the earlier blocks of the network, i.e. position 1 and 2, also show strikingly good uncertainty maps for both data sets. The number of features is much lower here, namely 32 or less. However, regarding the final blocks the uncertainty maps of the SI joint data show an alter-nate behaviour than those of the pelvis data. More specifically, we observe that bone

(37)

edges (or cortical bone) become more apparent for the pelvis data. Bone voxels have a much higher HU value compared to non-bone voxels. Therefore, there is a large value discrepancy along cortical bone. As a result, the uncertainty in these regions is generally also higher. Yet it is problematic that other more relevant uncertainties seem to alleviate or even vanish. By contrast, there is no clear evolvement of cortical bone for the SI joint data at position 6 and 7. Rather, it is merely present at the last position. We conclude that dropout at position 8 has resulted in low-quality uncer-tainty maps for both data sets. We discuss this matter in section 7. Notwithstanding, the uncertainty maps at the three final blocks lack sufficient detail. Although the un-certainty map at position 7 might suggest otherwise, it only represents a slice of the complete 3D visualization for a particular patient.

In order to understand the magnitude of noise that arises in the predictions due to the single dropout insertions, we take a closer look at the error metrics. The original metrics, MAE and bone MAE, are not suitable for this matter. They are heavily influenced by registration errors, resulting from the conversion of CT to the MRI coordinate system. As mentioned in section 5.3.2, this overshadows the effect of noise insertions on the MAE value. Therefore, we consider the evaluation MAE, illustrated in figure 6.3. Recall that the evaluation masks encompass bone areas, but are different from the masks used to calculate the bone MAEs as they are generated based on the CT scan and not the synthetic CT.

We consider each patient seperately. This way, we present a picture of the devi-ation in metrics among patients. It is derived that the relative changes of the pelvis data are further apart. This result coincides with the earlier observation that the SI joint model is more consistent in its predictions than the pelvis model.

Furthermore, for the pelvis data we find that the performance of the model in the evaluation mask is significantly improved with single dropout insertions at position 1 to 7, denoted by a 30% drop in evaluation MAE compared to the baseline for all patients. Additionally, the evaluation MAE varies in a range of approximately 40%, which equals roughly 25 HU. This is a much larger discrepancy than the earlier observed range of 12% (∼2 HU) for the MAE and 10% (∼10 HU) for the bone MAE, pointed out in table 6.1. The wide value range is caused by the sudden spikes in evaluation MAE for dropout at position 8 for all patients. Nevertheless, the evaluation MAE is still lower here than the baseline for two patients. In general, we conclude that the model improves upon the baseline with dropout insertion at an arbitrary position.

(38)

(a) Position 1 (b) Position 2 (c) Position 3 (d) Position 4

(e) Position 5 (f) Position 6 (g) Position 7 (h) Position 8

Figure 6.2: Uncertainty maps for the SI joint data with dropout at vari-ous positions. The images are visualized with equal window levels (brightness and contrast) to avoid a distorted picture of the size of uncertainty.

(39)

(a) Evaluation MAE for pelvis data (b) Evaluation MAE for SI joint data

Figure 6.3: Relative change in evaluation MAE for insertion of dropout at various positions. The red horizontal line indicates the evaluation MAE for the baseline model without dropout activations at inference. The dots depict relative changes and differ in color between the patients in the test set.

We observe an equivalent result for the SI joint data where the evaluation MAE is generally much lower compared to the baseline model. Remarkably, this effect has not been identified in either of the other two metrics (MAE or bone MAE, found in table 6.1). In addition, we see a similar behaviour as detected for the pelvis data where the evaluation MAE spikes for dropout at position 8. Yet the value range is larger; roughly 50% (∼30 HU). This is due to the high evaluation MAE for one patient at position 6. If we disregard this observation, figure 6.3a and 6.3b are more alike.

Next, we have experimented with inserting noise at multiple positions simultane-ously for the pelvis data, illustrated in figure 6.4. Therefore, we have added dropout layers to the inner (position 4 and 5) and outer (position 1, 2, 6 and 7) blocks of the network, providing a comparative analysis of the performance variation for different feature level extractions. Several things can be derived from these experiments. First, we observe that adding multiple dropout layers in general does not always yield higher quality uncertainty maps than in the single dropout layer experiments. For example, if we consider position 1 and 2 we see that adding dropout layers to both positions has resulted in a loss of detail compared to the previous uncertainty maps of either position 1 or position 2 individually. Moreover, we see the phenomenon again that uncertainty maps in the final blocks of the network focus mainly on cortical bone. Adopting several noise insertions (at position 6 and 7) does not solve this problem. Conversely, a combination of dropout layers at high-level features maintains most details in the uncertainty map from previous results, while also showing the most

(40)

(a) Position 1 & 2 (b) Position 4 & 5 (c) Position 6 & 7

Figure 6.4: Uncertainty maps obtained at high-level features (position 4 and 5) pro-vide more details than those obtained at other feature levels.

accurate prediction visually. As indicated in table 6.1, for the single as well as the multiple dropout layer(s) experiments the lowest MAE and bone MAE is achieved with noise insertion at high-level features, where the network learns complex relations in the data. This indicates that not only bone, but also the prediction as a whole is more accurately reconstructed quantitatively by adding dropout layers at inner positions.

The visually best performing model, i.e. resulting in the most accurate synthetic CT, for the pelvis data is achieved for noise insertion at the inner-most positions of the network. Experiments with varying the dropout rate have led to an optimal rate of 0.3, instead of the previous 0.2. As a consequence, we have assigned position 4 and 5 with dropout rate 0.3 as the optimal configuration for noise insertion. Table 6.1 indicates a decrease in bone MAE by 9.6% and a decrease in MAE by 1.5%. On top of that, the corresponding uncertainty map is very refined; it expresses sufficient detail while also showing high similarity with reconstruction errors. To the contrary, the preceding results have always led to uncertainty maps indicating high similarity with errors, but in conjunction with somewhat deteriorated visual performance. We extrapolated the optimal setting for the pelvis data to the SI joint data. Here, we observe a slight increase in MAE and decrease in bone MAE. Additionally, we again find detailed uncertainty maps together with good reconstructions.

We place the synthetic CTs with dropout at position 4, 5 and a combination of 4 and 5 in comparison, presented in figure 6.5. Recall that we compare predictions de-pendent on the position of dropout mainly via the metrics and not visually. However, in this case we do provide the synthetic CTs as they contribute to the

(41)

understand-(a) (b) (c) (d)

Figure 6.5: Visual difference with insertion of a single dropout layer and multiple dropout layers. The left three images contain synthetic CTs obtained with noise insertion at position 4, 5 and 4 and 5 respectively. The rightmost image provides the uncertainty map associated with (c). The red circles in (a) and (b) identify errors that are not, or to a lesser extent, visible in (c).

ing of why dropout at position 4 and 5 is visually optimal. It is evident that the combination boosts visual performance, in the sense that errors from single dropout insertion experiments diminish or even disappear when combining the dropout in-sertions. Besides, we rule out the possibility that the uncertainty obtained by joint dropout insertion is a summation of single dropout insertions. It is known that the network forwards features from the third block to the fifth block and block 4 has the highest number of features (128). Hence, it is plausible that noise insertion is beneficial at position 4 and 5 because this is essentially where the model learns the most detailed and complex relations. We have repeated the experiment several times, leading to coinciding uncertainty maps and stable error metrics (within a range of 0.3 for the MAE and 0.4 for the bone MAE). Therefore, we can conclude that it is not the variation between different model trainings that causes the sudden approvement in quality.

Lastly, we investigate whether the best visual model indeed manages to produce high-quality uncertainty maps. From visual assessments we have identified high simi-larity between uncertainty and errors. The other necessary component to acknowledge the designation of high-quality is based on correlation. For this, we examine the scat-ter plot as it sheds light on the occurence of combinations between voxel errors and uncertainty. We only regard voxels inside the evaluation mask. A first glance at

(42)

the scatter plot (shown in figure 6.6a, 6.6d, 6.6g and 6.6k) reveals that the majority of errors is positive. Thus, the model generally underestimates bone. We dive into the scatter plot in more detail by distinguishing four quadrants based on high/low uncertainty and high/low voxelwise error. We set the threshold for the former at 5, and for the latter at 250. The synthetic CT and ground truth for figure 6.6 can be found in appendix A.

We first consider voxels with high error and low uncertainty, illustrated in figure 6.6a, 6.6b and 6.6c. It indicates that the model is consistent in its incorrect prediction for certain voxels. One explanation that is also visible in figure 6.6b is misregistra-tions. Recall that uncertainty is indeed not detectable in poorly registered regions, since the model does not account for such regions.

Moving on, we regard the opposite quadrant so voxels with low error and high uncertainty, presented in figure 6.6d, 6.6e and 6.6f. We observe very few voxels satisfying these contraints. This is fortunate, because it is hard to grasp why the model would produce highly varying predictions, but is still able to remain a low error on average. Most likely, some of these predictions are outliers. This is substantiated by the observation that the associated voxels seem to be randomly spread throughout the image and do not occur in certain regions specifically.

The third quadrant, selected in figure 6.6g, contains voxels with high uncertainty and high error. Note that this is a desirable phenomenon as it points to the model contemplating the value for voxels that indeed have a large error. Interestingly, the highlighted voxels arise mostly at cortical bone, as illustrated in figure 6.6h and 6.6i. Conversely, the last quadrant containing voxels with low uncertainty and low error (figure 6.6j) mainly selects trabecular bone, i.e. tissue inside the cortex, as indicated in figure 6.6k and 6.6l. From a quantitative perspective, it has been detected that trabecular bone is generally better reconstructed than cortical bone. Therefore, it is plausible that the model is indeed more sure about these voxel values.

(43)

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

(j) (k) (l)

Figure 6.6: Correlation between voxelwise error and uncertainty in the eval-uation mask divided in four quadrants. From left to right, we provide scatter plots ((a), (d), (g) and (j)) with corresponding error map ((b), (e), (h) and (k)) and uncertainty map ((c), (f), (i) and (l)). Recall that black and white coloured re-gions in the error map point to over- and underestimation of the model respectively. The yellow squares in the scatter plots highlight voxels with high/low uncertainty and high/low voxelwise error. Similarly, these voxels are indicated in yellow in the remaining images.

Uncertainty visualization for synthetic CT images generated from MRI scans