Deep Learning for X-ray Tomography

(1)

Deep Learning For X-ray Tomography

Jim Wagemans

June 28, 2020

Bachelor thesis Mathematics and Computer Science Supervisor: Felix Lucka

Informatics Institute

Korteweg-de Vries Institute for Mathematics Faculty of Sciences

(2)

Abstract

X-ray tomography is the science of imaging objects through X-ray radiation. One, often familiar, example of this is a CT (computed tomography) scan, which has a wide variety of applications. In some settings, it is costly or not possible to measure sufficient data for tomographic reconstruction. In such settings, commonly used algorithms generate artifacts that reduce image quality. The use of ‘deep learning’ has become a popular approach to tacking these issues. Deep learning revers to the use neural networks that have a large amount of layers.

Recent research has suggested that a specific type of neural network, namely a mixed-scale dense convolutional neural network, performs well at the task of enhancing sparse data reconstruction. In thesis, we attempt to expand upon this research, by performing similar experiments with a different type of dataset. We used two different algorithms and several types of constraints on measured data, to generate reconstructions where undesirable artifacts were present. We then trained a mixed-scale dense convolutional neural network to reduce or remove these artifacts from the reconstructions. We found that, to different degrees, networks were able to significantly enhance the quality of these sparse data reconstructions.

Title: Deep Learning For X-ray Tomography

Authors: Jim Wagemans, jimwagemans@student.uva.nl, 11912286 Supervisor: Felix Lucka,

End date: June 28, 2020

Informatics Institute University of Amsterdam

Science Park 904, 1098 XH Amsterdam http://www.ivi.uva.nl

Korteweg-de Vries Institute for Mathematics University of Amsterdam

Science Park 904, 1098 XH Amsterdam http://www.kdvi.uva.nl

(3)

1. Introduction

Many people are (possibly unknowingly) already familiar with the concept of tomogra-phy. Namely, because this is where a CT (computed tomography) scan gets its name. Emitted X-ray radiation passes through an object and hits a detector. From the data gathered by this detector, an image is reconstructed that shows you the internal structure of said object. Tomography concerns the reconstruction of such images from gathered data. Computed tomography scanners are widely used in many countries 1_.

This thesis contains three main parts. The first part is mathematical description of X-ray tomography. We describe the relation between data measured by a CT scan and the object that is being imaged. This relation is mostly understood in terms of the Radon transform [1]. We shall dive into several theorems that give insight into this transform. These include its relation to the Fourier transform, its inverse transformation and its normal operator as a convolution. We formulate the problem of image reconstruction as an inverse problem that is ill-posed (difficult to solve). Two different algorithms (SIRT and Filtered Back-projection) are introduced to reconstruct images in practice. We lastly delve into the artifacts that are generated by these algorithms when data is insufficient in different ways.

The second part the thesis concerns deep learning. We start with an overview of machine learning as function approximation. We describe the basics of neural networks with a focus on mixed-scale dense convolutional neural networks, which is the type of network later used for experiments. We also delve into the mathematical and practical side of network training. Network training is generally done using gradient descent. Gradient descent does not always find the optimal approximation, but is effective when the set of approximations has a high dimensionality. We consider how to efficiently train on a given dataset by dividing said dataset into batches. The practical problem of overfitting is also addressed, as well as how one can use validation datasets to prevent this overfitting.

In the third part we combine the previous parts into an experiment. The use of deep learning to improve reconstruction quality will be revered to as ‘enhancement’, whereas the task of generating an image from measured CT scan data will be revered to as ‘re-construction’. In some real world setting, gathering CT scan data can be costly. In a medical setting for example, because a CT scan exposes a patient to ionizing radiation [2]. There could thus be a benefit to enhancing sparse data reconstructions. A 2018 pa-per [3] proposed the use of mixed-scale dense convolution neural networks (MSDCNN) to enhance reconstructions in sparse data settings. Our experiments set out to expand

1

ctscannersworld, OECD (2020), Computed tomography (CT) scanners (indicator). doi: 10.1787/bedece12-en (Accessed on 26 May 2020)

(6)

on this research by testing the MSDCNN on a custom dataset of real world reconstruc-tions. We made such a custom dataset, based on CT scan cross-sections of coffee beans and walnuts. We then reconstructed images from this dataset under three types of con-straints. Sub-sampling in the angular direction, sub-sampling in the detector size and limiting the largest available angle. An MSDCNN architecture is then used to enhance these reconstructions. We found that the networks manage to significantly reduce the artifacts as described in the first part.

(7)

2. X-ray Tomography

This chapter contains a basic overview of X-ray Tomography. We will discuss mathe-matical theory, algorithms and some complications.

2.1. The problem setting

In X-ray tomography we wish to infer the internal structure of an object based on collected projections of that object. This thesis restricts itself to 2D reconstructions, but the theory extends to 3D [4]. Consider the reconstruction of a 2D ‘slice’ or cross-section of a scanned object. This slice can be viewed as a 2D plane on which a density function f (x, y) is defined. This function is real-valued and positive. We generally limit ourselves to reconstructing objects of finite size. Thus f is zero outside a bounded subset of R2 (definition 1). The goal of the reconstruction is to retrieve the original function, from the measured data.

Before we discretize, let us consider the continuous case. An X-ray of intensity Ie

leaves the emitter, travels through a line l and hits the detector with intensity Id. The

intensity of the ray decays exponentially depending on the density in l. More precisely, according to Lambert-Beer’s law [5]

Id= Iee− R lf (x) dx (2.1) . We can conclude _Z l f (x) dx = − log(Id Ie )

. Many algorithms take this line integral as as starting point from which to reconstruct f (x, y) (section 2.3.1). To entirely reconstruct f , many line integrals are needed. The next sections covers the relation between these line integrals and the original function.

2.2. Mathematical theory

Should the reader find themselves wanting to know more about this topic, a good book would be ‘Introduction to the Mathematics of Medical Imaging’ by Charles l. Epstein [6].

2.2.1. Radon transform

A useful mathematical transformation for X-ray tomography is the Radon transform. In the problem setting we saw the relation between measured data and line integrals.

(8)

The Radon transform describes the relation between these line integrals and the original function. This transformation was invented in 1917 by Johann Radon [1]. Interestingly this was before the invention of CT scans. Radon also calculated the inverse to this transformation.

The Radon transform takes a function f : R2 → R. The output is also a function Rf : R2 → R. The Radon transform is defined for piecewise continuous functions with bounded support [6, Definition 3.4.1].

Definition 1. The set P CB ⊂ R27→ R is defined by

P CB = {f ∈ R2 7→ R | f is piecewise continuous and has bounded support} (2.2) .

First we give the Radon transform for the first angle, this is a projection straight down the y axis.

Rf (0, t) = Z ∞

−∞

f (t, y)dy

. Now to get to Rf (θ, t) we rotate the y-axis (here counter-clockwise) and then apply Rf (0, t). This formula is given by [5].

Definition 2. Let f be a function in P CB then Rf ∈ R2 7→ R is defined by

Rf (θ, t) = Z ∞

−∞

f (t cos(θ) − s sin(θ), t sin(θ) + s cos(θ)) ds (2.3) .

Corollary 1. The Radon transformation can also be equivalently formulated in the fol-lowing ways. First define

n(θ) =cos(θ) sin(θ) , n⊥(θ) =− sin(θ) cos(θ) (2.4)

. let l(θ, t) be the line perpendicular to t · n(θ), then

Rf (θ, t) = Z

l(θ,t)

f (s) ds (2.5)

. Or using the Dirac delta function:

Rf (θ, t) = Z ∞ −∞ Z ∞ −∞ δ(x cos(θ) + y sin(θ) − t)f (x, y) dx dy (2.6) . Or using vectors: R[f ](θ, t) = Z ∞ −∞ f (s · n⊥(θ) + t · n(θ)) ds (2.7) .

(9)

Proof. First we prove 2.3 is equal to 2.5 and 2.7. Define

r(s) =−t cos(θ) − s sin(θ) t sin(θ) + s cos(θ)

. Note a few things about this function. First

r(s) = t · n(θ) + s · n⊥(θ) . Thus

kr0(s)k = kn⊥(θ)k =p(− sin(θ))2_{+ (cos(θ))}2 _{= 1}

. We also have r(R) = l(θ,t). Now by definition of the line integral

Z l(θ,t) f (s) ds = Z ∞ −∞ f (r(s))kr0(s)k ds = Z ∞ −∞ f (s · n⊥(θ) + t · n(θ)) ds = Z ∞ −∞

f (t cos(θ) − s sin(θ), t sin(θ) + s cos(θ)) ds

Equality to 2.6 remains to be proven. For this we use the sifting property of the Dirac delta, which says

Z ∞

−∞

δ(x − b)f (x) dx = f (b)

. We define u = x cos(θ) which implies du = cos(θ) dx. Now Z ∞ −∞ Z ∞ −∞ δ(x cos(θ) + y sin(θ) − t)f (x, y) dx dy = 1 cos(θ) Z ∞ −∞ Z ∞ −∞ δ(u + y sin(θ) − t)f ( u cos(θ), y) du dy = 1 cos(θ) Z ∞ −∞ f (t − y sin(θ) cos(θ) , y) dy .

(10)

ds = _cos(θ)1 dy. Now 1 cos(θ) Z ∞ −∞ f (t − y sin(θ) cos(θ) , y) dy = Z ∞ −∞ f (t − t sin 2_{(θ) − s sin(θ) cos(θ)}

cos(θ) , t sin(θ) + s cos(θ)) ds =

Z ∞

−∞

f (t − t(1 − cos

2_(θ))

cos(θ) − s sin(θ), t sin(θ) + s cos(θ)) ds =

Z ∞

−∞

f (t cos(θ) − s sin(θ), t sin(θ) + s cos(θ)) ds

.

Sampling the Radon transform

Because the Radon transform is linear, in the discrete settings R is a matrix. The input is an image vector v (gray scale not RGB). v encodes the pixel values on the image grid G ⊂ Z2.

Definition 3. Suppose we have an angle sampling of θ1, . . . , θn. Now Rθi denotes the

sub-matrix which projects the image-vector along an angle θi. The full discrete Radon

transform is defined by.

R =    Rθ1 .. . Rθn    .

If v is the image vector then every entry of Rθiv corresponds with a ‘bin’ between two

lines of angle θi. This models a rays of photons passing through the object and hitting

the detector. See figure 2.1.

Definition 4. Let G ⊂ Z2be a finite set denoting the ‘grid’. Let v ∈ R#Gbe our ‘image vector’. For any x ∈ G, v(x) denotes the image value at x. We give the definition for j ∈ Z denoting the row number. Let wj,θi ∈ G → R denote our weight function. Then

we define

(Rθiv)j =

X

x∈G

wj,θi(x)v(x)

. Together with definition 3 this defines the discrete Radon transform.

The resulting vector Rv is called a sinogram. In this setting wj,θi(x) represents the

weight of the pixel x for the line with angle θi. There remains a choice how to chose

wj,θi(x). One could count all grid-cells which have any overlap with the j-th bin. Let

Bj,θi denote all grid cells within j-th ’bin’ and G the image grid. This gives

wj,θi(x) =

(

1 | x ∈ Bj,θi

(11)

(12)

(a) f (x, y) (b) Rf (θ, t)

Figure 2.2.: Original function and its Radon transform.

. One could also only count the grid-cells for which the center is contained in the j-th bin (setting wj,θi(x) to 1 or 0). Another option is use the potion of the centre line of

Bj,θi that overlaps with x.

2.2.2. Inverse problems

In the previous sections we described how one computes the sinogram s from the mea-surement data and how the sinogram relates to our original function f . If R denotes the Radon transform, we seek to find our f such that

s = Rf . We can consider this in more general setting.

Definition 5. Let H, K denote Hilbert spaces and A ∈ L(H, K) a linear function from H to K. A linear inverse problem goes as follows:

Given g ∈ K, find f ∈ H such that

g = Af . In a well-posed problem [7, chapter 4]

• A unique f exists for every g

• The solution f depends continuously on g.

(13)

The Radon transform is often ill-posed. On factor is that we usually have a finite number of projects, which means that (unaccountably infinitely) many different functions would have the same measured Radon transform (see Fourier slice section 2.2.3). There are also other real-world factors which distort the measured Radon transform, such as background radiation (See section 2.5.3 about artifacts).

When dealing with an ill-posed problem, we can take the following approach. In absence of any solution f , we try to find a solution ˜f such that the following norm is minimal

kA ˜f − gk . We first need a small lemma.

Lemma 1. Let H, K denote Hilbert spaces, A ∈ L(H, K). If k ∈ ran(A) and non-zero then A∗k is non-zero.

Proof. Denote k = Ah, now

kkk2> 0 . From this we deduce

kkk2= hk, ki = hAh, ki = hh, A∗ki We know this product is bigger than 0, thus A∗k 6= 0.

The following theorem gives an equation that can be used to compute such an f . Theorem 1. Let H, K denote finite dimensional Hilbert spaces, A ∈ L(H, K) and f ∈ H. Any f ∈ K that minimizes

kAf − gk (2.8)

does so if and only if

A∗Af = A∗g (2.9)

.

Proof. First we assume that f minimizes

kAf − gk

. Now g ∈ K can be written as g1+ g2 with g1∈ ran(A) and g2∈ ran(A)⊥. Because, by

linearity, Af − g1∈ ran(A) and −g2 ∈ ran(A)⊥, we have

hAf − g1, −g2i = 0

. Thus by the Pythagorean theorem we have

kAf − gk = kAf − g₁− g₂k =pkAf − g₁k2_{+ kg} 2k2

(14)

. The only way to minimize this with respect to f is to minimize kAf − g1k. Because

g1 is in the image of A, this norm is zero if Af = g1. Note that ran(A)⊥= ker A∗ (finite

dimension) which implies A∗g2 = 0. This means that

A∗g = A∗(g1+ g2) = A∗g1= A∗Af

. Now we prove the other way around. We assume A∗Af = A∗g . Since A∗g2= 0 this means

A∗(Af − g1) = 0

. Since Af − g1 ∈ ran(A). We must have Af = g1 by lemma 1.

It is often easier to solve equation 2.9 than to calculate g1. If A is an injective

transformation we can calculate g1 using

g1 = A(A∗A)−1A∗g

. The calculation is easy since g1 ∈ ran(A) denote Af1= g1, then

A(A∗A)−1A∗g =A(A∗A)−1A∗g1+ A(A∗A)−1A∗g2 =A(A∗A)−1A∗Af1+ A(A∗A)−10 =A(A∗A)−1(A∗A)f1 =Af1 =g1

We can search for f such that Af = g1 or A∗Af = A∗g. This f might still not be

unique. In this case one could take f with minimal norm.

It is worth noting that solving equation 2.9 can be impractical when A is a large Matrix. The section on the SIRT algorithm (2.3.2) will go into more detail on a practical algorithm for solving the inverse problem of the Radon transform.

2.2.3. Fourier slice-theorem

A theorem that gives insight into the ‘information’ that is given by each projection is the Fourier slice theorem. The theorem states a relationship between the Fourier transform and the Radon transform.

Theorem 2 (Fourier Slice Theorem). Let f ∈ P CB, then

F2[Rf ](θ, ω) = F [f ](ωn(θ)) (2.10)

(15)

1. Take the projection along lθ and calculate the frequency components (1d).

2. Calculate the frequency components of f in its entirety (2D) and take a slice along the line lθ.

Here follows the proof:

Proof. Here we present a proof from [8]. F₂[Rf ](θ, ω) = Z ∞ −∞ e−itωR[f ](ω, t) dt = Z ∞ −∞ e−itω Z ∞ −∞ f (s · n⊥(θ) + t · n(θ))ds dt = Z ∞ −∞ Z ∞ −∞ f (s · n⊥(θ) + t · n(θ))e−itωds dt = Z ∞ −∞ Z ∞ −∞

f (x1, x2)e−iω(x1cos θ+x2sin θ)dx1dx2

= Z ∞ −∞ Z ∞ −∞ f (x1, x2)e−ih(x1,x2),ωn(θ)idx1dx2 =F [f ](ωn(θ)) using the change of variables

x1= −s · sin θ + t · cos θ

x2 = s · cos θ + t · sin θ

.

A few interesting implications of this theorem describe which ‘information’ is given by each projection.

1. Projections along different angles give different information about the frequency components (except for the origin, which is given by each projection).

2. Because the Fourier transform is invertible, a projection is needed for each line in R2or equivalently every angle in [0, π). A finite or even countable set of projections is insufficient to reconstruct f .

3. Suppose we have a finite set of projections. For each radius r > 0 and circle with radius r in the frequency plane, the same number of points on this circle are known regardless of r. Because circles with larger r have greater circumference, a smaller ‘fraction’1 of points is known. Thus the projections give us proportionally more information about lower frequency parts of f .

1_{This fraction would technically be 0 because a circle has an infinite number of points. Suppose we}

were to discrete the frequency plane. Now the fraction of ‘pixels’ we would be able to fill in, would be lower for circles with greater radii

(16)

2.2.4. Inverting the Radon transform

This section will contain the inversion of the Radon transform from definition 2.3. This is theorem 6.2.3 from [6], which is where the proof is from.

Theorem 3. Radon Inversion Formula

Let f ∈ P CB such that F [f ] ∈ L1(R2). Then

f (x) = 1 4π2 Z π 0 Z ∞ −∞ eirhx,n(θ)iF2[Rf ](r, θ)|r| dr dθ (2.11) .

Proof. We know from theorem 5 that

f (x) = 1 4π2

Z

R2

F [f ](ζ)e−ihx,ζidζ

. We can re-express this using polar coordinates as 1 4π2 Z 2π 0 Z ∞ 0 e−irhx,n(θ)iF [f ](rn(θ)r dr dθ

. We now use theorem 2 (the Fourier slice theorem). Thus it is equal to 1 4π2 Z 2π 0 Z ∞ 0 e−irhx,n(θ)iF₂[Rf ](θ, r)r dr dθ

. Because for any θ, t we have R[f ](θ + π, t) = R[f ](θ, −t) we can integrate from 0 to π for θ and R for t. This gives us

f (x) = 1 4π2 Z π 0 Z ∞ −∞ eirhx,n(θ)iF₂[Rf ](r, θ)|r| dr dθ .

Corollary 2. From equation 2.11 we can formulate the inverse Radon transform.

R−1[g](x) = 1 4π2 Z π 0 Z ∞ −∞ eirhx,n(θ)iF2[g](r, θ)|r| dr dθ 2.2.5. Sampling

Due to the physical limits of measurement and computation we are not able to measure a function on an uncountable set like R2. Thus we must limit ourselves to the sampling of a function. Generally we sample a function on a grid. This means we have a (often equally spaced) sampling of a portion of the x-axis {xn}Nn=0 and y-axis {ym}Nm=0. This

gives the sampling of a function f

(17)

. In practice we limit ourselves to reconstructing the sampling of f . Our measured data is also sampled. Thus our measurements consists of

{Rf (θi, tj) | i, j ∈ Z, 0 ≤ i ≤ P, 0 ≤ j ≤ S}

. The θi and tj are equally spaced, denote the increments by ∆θ and ∆t. P here is the

number of projections and S the number of pixels in the detector. For a given value of S, how many projection P are needed for a “good” reconstruction? We can consider the condition derived in [5]. We know from the Nyquist Shannon sampling theorem that the highest frequency we can represent ωm is given by

ωm =

1 2∆t

. Thus in the frequency domain, we can only sample frequencies in the disk with radius ωm. If we take projections across 180 degrees or π radii, we have an angular increment

of

∆θ = π P

. The Fourier slice theorem (theorem 2) tells us that this is also the angular increment in frequency space. The difference between consecutive values at the edge of the circle with radius ωm is given by

a= ∆θωm =

π P 2∆t

. To get a “good” reconstruction, this angular resolution need to be of the same order as the resolution along the radial line. We can see the interpolation of the Fourier transform of Rf in figure 2.3. The radial line has length 2ωm and is sampled at S points just as

in the spatial domain. Thus the difference r is given by

r=

2ωm

S = 1 S∆t

. If these are of the same order, that gives us a relation between S and P . Namely 1 S∆t ≈ π P 2∆t =⇒ P ≈ π 2S

. If the projections are taken across 360 degrees instead of 180, this does not change. This is because

Rf (π + θ, t) = Rf (θ, −t)

. This means a sampling across 180 degrees with ∆θ is the same as sampling across 360. degrees with ∆θ since no new information is added.

Note that this theory applies to parallel projection. In practice an X-ray source emits photon rays in more of a ‘fan’ shape. The further away the emitter is, the more these rays resemble parallel rays. In case of such a ‘fan’ shape, scanning across 360 degrees does provide an additional benefit.

(18)

Figure 2.3.: Interpolation of the Fourier transform. (source [5]).

2.3. Algorithms

2.3.1. Filtered back projection

One algorithm that is often used for reconstruction is the Filtered Back projection (FBP). The filtered back projection belongs to the analytic reconstruction algorithms. This class of algorithms is designed for the continuous setting and later discretized.

The first step of the FBP is the “back projection”. This means applying the adjoint R∗ of the Radon transform. This is the unique function such that for any f, g ∈ P CB

hRf, gi = hf, R∗gi . In two dimensions this is given by [9]:

R∗g(x, y) = 1 2π

Z 2π 0

g(θ, x cos(θ) + y sin(θ) dθ

. Thus the back-projected function fb is given by

fb(x, y) = R∗R[f ](x, y) = 1 2π Z 2π 0 Rf (θ, x cos(θ) + y sin(θ) dθ

. As shown in “Relation to convolutions” (section 2.4):

F [f_b](ω) = 1

kωkF [f ](ω) .

A multiplication in Fourier space is the same as a convolution in the spatial domain. Thus

f = F−1[ω 7→ kωk] ∗ fb

. Sadly there is no function with Fourier transform exactly ω → kωk. In practice one takes an approximation H such as the Ram-Lak filter. In the real and discrete setting

(19)

R is a matrix and the adjoint is RT. Given a measured sinogram p = Rf , one computes an approximation ˆf as such

ˆ

f = HR∗p

. Alternatively one can use a filter on the sinogram p before applying R∗. In this case the filter ˜H applies a 1D convolution with respect to the non-angular argument.

ˆ

f = R∗Hp˜

. Here ˜H = q

− d2

ds2 [10]. Because convolutions and adjoints of linear operators are also

linear, the FBP algorithm itself is linear. Since we only have to back-project once, FBP is often faster than iterative algorithms that back-project each iteration.

2.3.2. SIRT

The simultaneous iterative reconstruction technique is a reconstruction algorithm be-longing to the algebraic algorithms. In this context it means we assume a discrete setting from the start. We wish to reconstruct the original function, which we consider to be defined on a grid of size N × M

{(n∆x, m∆y) | 0 ≤ n ≤ N, 0 ≤ m ≤ M }

. We can then consider the function to be a vector f ∈ RM N. If we have P projections of size K each, our sinogram can be considered to be a vector g ∈ RP K. Since the Radon transform is linear (by linearity of the integral), we can model its discrete version as a matrix R ∈ RN M ×P K.

In section 2.2.2 on inverse problems we discussed finding f such that

Rf = g

and the complications that arise. One could attempt to find the inverse of R but that poses several problems. The matrix R does not have to be a square matrix. Even if R is square, it can be very large which makes inversion computationally very expensive.

If we cannot reasonably calculate an explicit inverse, we could search for f such that L(f ) := kg − Rf k2₂/2

.

is minimal (equivalent to minimizing kg − Rf k2, which is how it was formulated in

the inverse problems section 2.2.2). We are going to use an iterative approach based on the Landweber iteration [11].

We start with our first guess f0 the zero vector and apply gradient descent (explained in more detail in gradient descent section 3.3.2). We have

(20)

. Gradient descent then gives us

fn+1= fn+ αRT(g − Rfn)

. To account for the use of a different norm and scaling for convergence, define: ˜

Ri,j =

Ri,j

(P

kRi,k)(PkRk,j)

. The iterative computation of the SIRT algorithm now reads:

fn+1= fn+ α ˜RT(g − Rfn) (2.12) . To give some intuition as to what this means intuitively. Rfn− g represents the

current error between our estimate and the measured data. The SIRT algorithm then back-projects this error and subtracts (a scaled version of) this back-projected error. The α factor here is similar to that of gradient descent, where it is called the learning rate.

2.4. Relation to convolutions

In the computational experiments chapter (chapter 6) we use a convolutional neural network to enhance the reconstruction algorithms in sparse data settings. Before delv-ing into convolutional neural networks in the next chapter, one might wonder how the reconstruction problem relates to convolutions. A 2017 paper [12] proposed the use of convolutional neural networks for a specific class of ill-posed linear inverse problems. This is the class of inverse problems g = Af , where A∗A is a convolution. This class includes the Radon transform.

To understand why this is, we first need a few definitions from the paper [12].

Definition 6. Let L2(Ω) denote the space of complex-valued function on Ω ⊂ Rn for

which _Z

Ω

|f (x)|2dx exists and is finite.

We define several types of operators.

Definition 7. A linear operator T ∈ L2(Ω1) → L2(Ω2) is an isometry if

T∗T [f ](x) = f (x) (2.13) .

Definition 8. A linear operator Mm ∈ L2(Ω) → L2(Ω) is a multiplication (with a

continuous bounded m ∈ L2(ω)) if

Mm[f ](x) = m(x)f (x) (2.14)

(21)

Remark 1. Note M_m∗ = Mm with m denoting the complex conjugate.

Definition 9. A linear operator Aa∈ L2(Ω) → L2(Ω) is a convolution if

Aa= F∗MˆaF (2.15)

. Here ˆa = F h and Mˆa is a multiplication.

Remark 2. F∗ is the adjoint of the Fourier transform. This is equal to the inverse of F (up to multiplication by a constant depending on the convention, see section A.1, lemma 2).

Definition 10. A reversible change of variables Ψφ ∈ L2(Ω1) → L2(Ω2) is a linear

operator such that

Ψφ[f ](x) = f (φ(x)) (2.16)

.

Remark 3. Here φ is invertible with Ψ−1_φ = Ψ_φ−1. Here Ψ∗_φ = 1

|Jφ|Ψφ−1, where |Jφ|

denotes the determinant of the Jacobian of φ.

In [12] they describe a class of functions for which A∗A is a convolution.

Theorem 4. Let T be an isometry, Mm a multiplication and Ψφ a reversible change of

variables. If A is of the form

A = T Mm(Ψφ)−1F

. Then A∗A is convolution of the following form: A∗A = F∗M|Jφ|Ψφ|m|2F

.

Proof. Let us expand A∗A. This gives us A∗A =(T Mm(Ψφ)−1F )∗T Mm(Ψφ)−1F =F∗(Ψφ)−∗Mm∗T∗T Mm(Ψφ)−1F =F∗(Ψφ)−∗MmMm(Ψφ)−1F =F∗(Ψφ)−∗M|m|2(Ψ_φ)−1F =F∗|J_φ|Ψ_φM|m|2(Ψ_φ)−1F =F∗M|Jφ|Ψφ|m|2F .

(22)

Corollary 3. R∗R is a convolution. The Radon transform satisfies [12]:

R = T Ψ−1_φ F

, with Ψφbeing a variable change to polar coordinates. Meaning φ−1(θ, r) = (r sin θ, r cos θ).

T is an isometry which applies the Fourier transform in polar coordinates with respect to r.

Now theorem 4 tells us that R∗R is a convolution, namely R∗R = F∗MˆaF

, where ˆa(ω) = |Jφ(ω)| = kωk1 .

Now follows short calculation of the value of the determinant. Denote φ(ω) = (θ, r). We compute |J_φ(ω)| =(|Jφ−1(φ(ω))|)−1 =|−r sin θ cos θ r cos θ sin θ |−1 =|−r sin2θ − r cos2θ|−1 = 1 |r| = 1 |pω2 1+ ω22| = 1 kωk

. This motives the filter used in the FBP algorithm.

2.5. Artifacts

In tomography there can be certain discrepancies between the original image and its reconstruction. Undesirable alterations from the original image are called artifacts. This sections explores some of the artifacts present in X-ray tomography. In figure 2.4 several such artifacts are shown.

(23)

(a) No constraint (b) Angular sub-sampling (8 times)

(c) Limited angle (90◦) (d) Detector sub-sampling (12 times)

Figure 2.4.: Different reconstruction artifacts (FBP, brightness capped at 0.3).

2.5.1. Aliasing

Aliasing occurs when the sampling rate is insufficient to accommodate all the spatial frequencies present in an image. In this case, a higher frequency component might appear as a lower frequency component. Consider the following two sine-waves of different frequencies.

(24)

. If we sample these functions with a sampling-frequency of 4, they would appear identical. More precisely, for any k ∈ Z

sin(10πk 4) = sin(2π k 4 + 8 4πk) = sin(2π k 4)

. Thus these function would appear to be the same when sampled. The Shannon-Nyquist theorem tells us what constitutes a sufficient sampling rate. If a function has bandwidth B (see background on Fourier transform) one needs a sampling frequency of at least 2B. Not all function have a finite bandwidth, in this case the aliasing cannot be prevented by only changing the sampling frequency.

We can see examples of aliasing in figure 2.4. These occur in the sub-sampling of both angles and detector size, were we can see periodic patterns are present.

2.5.2. Streaks

This subsection focuses specifically on streaks present in limited angle tomography. As can be seed in 2.4 (c), during limited angle reconstruction some edges are lost and others are introduced. For parallel geometry this happens when Rf (t, θ) is only know for 0 ≤ θ ≤ Θ with the upper bound Θ < 180◦.

We first consider the addition of new edges. In figure 2.4 (c) only vertical and hori-zontal edges are introduced. A 2013 paper [10] provides an explanation for the streak artifacts added during parallel FBP reconstruction. To grasp this, we do need the wavefront-set WF(f ) of a function f ∈ Rn _{7→ R. An exact definition falls outside the}

scope of this thesis. Roughly, we have (x, v) ∈ WF(f ) when f has a discontinuity at x and v gives us the direction in which the singularity occurs. In the case of an interface (which is a curve of discontinuities) x must be on the interface and v must be tangent to this interface. For example, consider a function defined on [0, 1] × [0, 1] by

f (x, y) = (

1 | x > 1₂ 0 | x ≤ 1₂

. This function has a boundary at {(x, y) | x = 1₂}. We have

WF(f ) = {(x1 x2 ,y1 y2 ) | x1 = 1 2, ±y1 ∈ R>0, y2 = 0}

. Now [10] tells us which discontinuities are added. They are contained in the set {(x + v, αn(θ)) | (x, αn(θ)) ∈ WF(f ), hv, n(θ)i = 0, θ ∈ {0, Θ}}

. This might look intimidating. Lets consider an example where Θ = 90◦, in this case θ is either 90◦or 0◦. We only have to consider (x, w) ∈ WF(f ) when w is either horizontal or vertical. In this case the steak artifacts are located at x + v where v is perpendicular to w. In the 90◦ setting that would imply horizontal or vertical steak artifacts. We would expect these steaks to meat at 90◦ angles. In figure 2.4 (c) we can see that these streaking artifacts are in line with these predictions.

(25)

We also have missing edges in figure 2.4 (c). For and edge to be present, there needs to be a projection parallel to it. This is why the upper left and lower right edge are not fully present.

2.5.3. Modelling

There are other ways in which a reconstruction image can be distorted. The model that described the relation between the density function f and the measured data might be wrong. We often have to make assumptions for the sake of complexity. Taking every factor into account might make the model to complicated to work with.

One example might be background noise. This can distort the measured intensities. Another might be the energy of X-rays. In equation 2.1 we did not take into account the energy of X-rays. If we do consider the energy spectrum T (E) we get [7, section 1.1]

Id= Ie Z T (E) Z l −f (x, E) dx dE

. This is also called a beam hardening artifacts. f (x, E) represents the absorption of photons with energy E at a point x.

Another way in which the model might not be correct is the assumption that the X-rays pass through the object in a straight line. For this to happen, one would need to stop the rotation of the object during every projection. In reality the object is often continuously rotating even during the projection. The bigger the scanned object and the higher the rotation speed, the more this artifact will be present.

(26)

3. Deep Learning

3.1. Machine Learning

Machine learning, as the name implies, is the study of algorithms that ‘learn’. Learn-ing here means that the algorithm changes its behaviour based on previous experience or data. A common distinction that is made includes the following types of machine learning:

1. Supervised learning: The algorithm is given a dataset with inputs and correspond-ing outputs. Based on this dataset, the algorithm then attempts to predict the correct outputs for new inputs.

2. Unsupervised learning: The algorithm learns to detect patterns in data with min-imal human supervision.

3. Reinforcement learning: The algorithm interacts with an environment. It receives feedback based on the actions it takes and learns from this.

In this thesis, supervised learning will be the main focus. Supervised learning algorithms often try to approximate some function f , sometimes called the objective function. In these settings, f is know on a certain sample set of the whole of input {(xi, f (xi))}N_i=1.

The goal of these algorithms is to predict the value of this function on inputs v for which the output f (v) is not known. A very popular class of algorithms for approximating such functions is called ‘Neural Networks’.

3.2. Neural Networks

A Neural Network is a type of algorithm inspired by neurons in a human brain. A neuron is modelled as holding a single numerical value. A neuron also has connections to other neurons which are modelled by weights. A neural network often has several layers of neurons. A standard fully connected feed-forward neural network can be described as follows [13]:

Let n be the number of layers (L1, . . . , Ln). Here L1 is the input layer and Ln is the

output layer. Each one of these layers is modelled as a vector in Rmi _{where m}

i is the

number of nodes in a layer i. Given a layer Li, the layer Li+1can computed by:

Li+1= σ(Ai+1Li+ bi+1)

(27)

• A_i+1_{is a matrix in R}mi×mi+1_{, holding the weights between the neurons in different}

layers.

• bi+1 is a vector in Rmi+1 called the bias.

• σ is a real non-linear function. This is often applied entry-wise to vectors

Consider what would happen if bi = 0 and σ the identity function. Then every layer

would only contain multiplication with a matrix Ai. The entire neural network would just

be a composition of matrix transformations. A composition of matrix transformations is once again a matrix transformation. This means that if the objective function is non-linear, the neural network can at best give a linear approximation. For example, a zero input would always result in a zero output.

The introduction of a bias bi results in an affine transformation between each layer.

This means that a zero input no longer necessarily results in a zero output. However, a composition of affine transformations results in an affine transformation. This would limit the network to affine approximations. The introduction of σ, a non-linear function, breaks this pattern and allows for much larger class of approximations.

3.2.1. Convolutional neural networks

A Convolutional neural network is a type of NN specifically designed to work with images. Consider first the problems that arise when using a regular (fully connected feed-forward) NN as previously described. For the sake of simplicity, the input will be a gray-scale image I of size n by m.

A naive way to implement this is to vectorize it as v ∈ Rn·m setting vi= I[i//m, i%m]

, where // denotes integer division and % is the remainder operation.

The first issue here is the number of trainable parameters. If the network has k layers of the same size as the input, we would have Ai ∈ Rnm×nm with n2m2 parameters per

layer. Every layer would also have nm extra parameters for the bias. The total number of trainable parameters would be k(n2m2+ nm).

If we have a 1000 by 1000 image as input, then n = m = 1000. A network with 30 layers and parameters stored as 32 bit floating point numbers, would require

30 · (10002· 10002+ 1000 · 1000) · 32 ≈ 9.6 · 1014

bits. Dividing by 8 · 109 gives the number of Gigabytes. The network would require 120, 000 GB of memory. This is a lot more memory than what a contemporary computer offers.

Another issue is the loss of spatial locality. A pixel might have a distance of 1 to its upper and lower neighbours but distance m to its right and left neighbour. The network might learn this spatial locality, but that is an inefficient use of training time and parameters.

(28)

A convolutional neural network solves this problem by hard-coding spatial locality in the architecture. Instead of alternating between affine transformations and the non-linear activation function σ. We apply a convolution Qi between every layer followed by

the activation function. Thus

Li+1= σ(Qi+1∗ Li+ bi+1)

. A convolution (section A.2) performs the same local operation everywhere in the image Li. We thus only need to store parameters for this local operation. A fully connected

network takes every pixel in Li into account when calculating a single pixel in Li+1.

A CNN only looks at a region around the corresponding pixel in Li when calculating

a pixel in Li+1. Depending on the type of CNN, the use of convolutions ranges from

computing the first layer from the input, to using convolutions between every layer. To see the efficiency of using convolutions, consider the task of horizontal edge detec-tion. In a fully connected NN, every node (neuron) would detect edges in a region of the input image. Each one of these nodes requires nm + 1 parameters (including the bias). This would be very inefficient if each neuron performs a similar task but in a different region. A convolutional neural network could simply use a kernel such as

  1 2 1 0 0 0 −1 −2 −1  

and a bias of zero. This kernel is called a Sobel filter. This kernel does not have to be trained separately for each neuron in the first layer. For this 3 × 3 kernel we have 10 trainable parameters including the bias. Although a cherry-picked example, there is a significant reduction in the number of required parameters.

Convolutions allow a neural network to easily detect the same local feature in different regions across the image.

3.2.2. Mixed-scale Dense CNN

This section details the type of CNN used in the reconstruction enhancement experi-ments (chapter 6) later in this thesis. This type is called a Mixed-scale dense convolu-tional neural network. A 2018 paper [3] made use of MSDCNN to enhance tomographic reconstruction.

A MSDCNN makes use of dilated convolutions (equation A.7) between all layers. These dilated convolutions have a wider view but the same number of parameters as regular convolutions. The use of different dilations for different layers explains the mixed-scale part of the name. The name contains the word ‘dense’ because in the computation of layer i, layers i − 1, . . . , 1 are used instead of only layer i − 1. This should relieve the network of having to ’copy’ information to the next layer. Every layer thus has access to everything that was computed in the previous layers.

Contrast this with another architecture based on a fully convolution NN. The U-Net architecture [14] down-scales in the first half of the layers. In the second half, the layers

(29)

scale up. The output of each down-scaling layer is also concatenated at the up-scaling layer of the same scale. This allows the network to apply convolutions (thus detecting features) at different scales. The use of different dilutions in the MSDCNN architecture has the same effect. One advantage of the MSDCNN is that they typically require fewer training parameters, which decreases computation time and risk of overfitting [3]. (Overfitting is explained later in section 3.3.7).

Now a more detailed mathematical description of the MSDCNN layers based on on [3]. One computes the layer Li as follows

Li = σ( i−1 X j=0 Cdi qi,jLj+ bi)

. Using the following definitions:

• Li ∈ Rn×m is the i th layer of the network

• σ is a non-linear activation function (such as sigmoid or ReLu, not learned) • di ∈ Z>0 the dilation of kernels used in layer i (not learned)

• qi,j ∈ R3×3 a convolution kernel (learned)

• b_i_{∈ R}n×m _{the bias of layer i (learned)}

• Cdi

qi,j a convolution with kernel qi,j and dilation di

The padding of the convolution here is equal to the dilation. This causes Lj to have the

same dimensions as Li. It is interesting that this architecture can take images of different

dimensions as input using the same parameters. One could, although the performance might not be staggering, train the network on smaller images and use the trained network on larger images. In illustration is given in figure 3.1, different colors indicate different dilations.

(30)

3.3. Network training

3.3.1. Problem setting

As mentioned previously, the neural network is used to approximate a function f (ob-jective function). Letting θ denote the parameters of the neural network, fθ is our

ap-proximation of the objective function f . We have a dataset {(xi, yi)}i∈I with f (xi) = yi

for i in some index-set I. We want network parameters θ such that fθ(xi) is close to yi.

To define this ’closeness’ a loss function l(a, b) is needed. The value of l(a, b) should be lower the ’closer’ a is to b and l(a, b) should be 0 if a = b. Common loss functions are the ’L1’ and ’L2’ based on the distance functions on Rk.

• ’L1’ l(a, b) = k X i=0 |ai− bi| • ’L2’ l(a, b) = k X i=0 (ai− bi)2

This gives us the tools to define our function J (θ) which tells us how well the parameters θ are performing on our dataset. We define

J (θ) = 1 |I|

X

i∈I

l(fθ(xi), yi)

. Here |I| denotes the number of elements in I. The division by |I| provides valid comparison when computing J (θ) on datasets of different sizes. The following section describes how we can practically find a θ to minimize J (θ).

3.3.2. Gradient descent

The gradient descent algorithm describes a method for finding local minima of smooth functions. The problem setting is as follows:

Given a function J : Rn→ R, find a θ ∈ Rn_{that minimizes J (θ)} 1

A straightforward approach would be to sample the function on a finite set U ⊂ Rn and pick the lowest value. Without a-priori knowledge of where f attains low values, it is a good idea to pick evenly spaced points. This approach runs into considerable problems when there is a significant cost associated with calculating J (θ).

Consider the following example. Let An be an evenly spaced sampling of the unit

cube [0, 1]n, with spacing ₁₀₀1 , defined by

An= { 1 100(a1, . . . , an) T _{| ∀i a} i ∈ Z, 0 ≤ ai ≤ 100} 1

A minima is often but not always present. Even if present they can be hard to find. In such cases we can search for a low value of the function.

(31)

Figure 3.2.: Gradient descent.

2

. This set has 101n elements, which grows exponentially as the dimension increases. To compute the values of f on this set and find the lowest, would take a significant amount of computing power. This type of brute force approach would be computationally very inefficient in settings with high-dimensional input spaces. The minimizing of a function with regard to neural network parameters, is one such setting.

The gradient descent algorithm takes an iterative approach to this problem. It starts with an initial guess w0 and updates each guess as follows [13, section 4.3]

wn+1= wn− α∇J (wn)

. This α is referred to as the learning rate. The gradient ∇J (w) indicates the direction of greatest increase (from w), as well as the magnitude of that increase. In figure 3.2, an application of the GD algorithm is shown for a quadratic function. The wn converge

to a local minimum. More sophisticated algorithms compute different α each iteration. One can stop after J (wn) dips below a certain thresh-hold or after a predetermined

number of iteration. One must necessarily stop when ∇J (wn) = 0, because in that case

wn+1= wn. This means there is no increase in any direction 3. In this case wn is called

a stationary point with respect to J . In practice one might stop once k∇J k ≈ 0.

3.3.3. Complications

The GD approach does not suffer as much from high dimensionality as the brute force approach. To compute the gradient of J with input x = (x1, . . . , xn)T, one has to

2_{https://saugatbhattarai.com.np/what-is-gradient-descent-in-machine-learning/} 3

Technically this means the best linear approximation has no increase in any direction(is a constant function) . For example x3 _{is monotonically increasing but has derivative 0 at 0.}

(32)

compute _∂x∂J 1, . . . , ∂J ∂xn, as ∇J =    ∂J ∂x1 .. . ∂J ∂xn   

. That is a linear increase with respect to the dimension n. This is a significant improve-ment over the exponential increase, as is the case for the brute force method. Though gradient descent and variations of it far outperform brute force in practise, there never-theless are drawbacks.

The linear increase in computation cost are per iteration. An increase in dimensionality might cause GD to require more iterations in order to achieve the same decrease. The ideal case is when J is linear, which means

J (x1. . . xn) = n

X

i=1

Ji(xi)

, for some J1, . . . , Jn∈ R 7→ R. In this case _∂x∂J_i = Ji0. Note that

x − α∇J (x) =    x1 .. . xn   − α    J₁0(x1) .. . J_n0(xn)   =    x1− αJ10(x1) .. . xn− αJn0(xn)   

. Thus applying gradient descent to J has the same results as applying GD to each Ji

individually. In this case the increase in computation cost is truly linear. The more closely J locally represents a linear function, the less GD with be effected by high dimensionality. Due to the architecture of the neural network, in practice J could be very non-linear.

Another complication is the GD algorithm getting ‘stuck’. The algorithm stops once it reaches a stationary point. However, a stationary point does not have to be the lowest value J can attain. In figure 3.3, J = x3− x is plotted. Should the GD start on the right relative to the local maximum (

q

1

3), it will converge to or halt at the local minimum

(-q

1

3). As we see, this local minimum is not a global minimum and GD does not find

the optimal solution.

Besides local minima, GD can also stop at a saddle point (gradient 0 but not a local minimum/maximum). One example is J = x3. This function is monotonically increasing but GD would halt at 0 since J0(0) = 0. Theoretically GD could also stop at a local maximum, but these are locally repelling.

It is interesting to note that an increase in the dimensionality n of the parameter space can make stationary points less likely. For a point w to be a stationary point, we need

∂J ∂x1

(w) = 0, . . . ∂J ∂xn

(w) = 0

. The larger n is, the less likely it is that all partial derivatives are 0 at the same point w.

(33)

(34)

3.3.4. Learning rate

The learning rate is the α ∈ R>0 in the gradient descent equation. Where each iteration

of the gradient descent algorithm is computed as wn+1= wn− α∇J (wn)

. One justification for the gradient descent approach is the following. For a smooth function J , it holds that for each x ∈ Rn _{there exists an ∈ R}

>0 such that

J (x − ∇J (x)) ≤ J (x)

. This does not guarantee that J (xn+1) < J (xn) since this can be different for

dif-ferent x and is not necessarily known. Ideally α should be close to this . Some more sophisticated algorithms contain optimizations that compute a different α each iteration. A learning rate that is either too small or too big could slow down the convergence of GD. If one picks a learning rate that is too small, it could take many steps before a stationary point is reached. If the learning rate is too big, it could shoot past the stationary point.

3.3.5. Back-propagation

So far the gradient descent algorithm has been taking for granted our ability to compute J (θ). This subsection describes the computation of this gradient. We first discuss how to compute the gradient on a single element of the dataset. A later section (3.3.6) deals with network training across a dataset.

Based on the image at the first layer L1 ∈ Rm (the input image), the output image

Ln ∈ Rk is computed. The output image Ln is completely determined by the input

image L1 and the vector of trainable parameters θ. The goal is to find θ such that

Ln is close to the target image (denoted by y), or as described in the problem setting,

minimize J (θ).

Because we are dealing with a single input and output, using the notation of the problem setting

J (θ) = l(fθ(L1), y)

. Here fθ denotes the neural network. We will denote the approximation fθ(L1) by

either Ln, because it is equal to the image at the output layer, or ˆy to discuss the vector

entries.

Considering the loss function J as function of the weights, we can apply gradient descent. The question remains how to compute the gradient. We first consider the com-putations for a fully connected neural network architecture. The gradient is calculated for each element in the set of parameters (Ai, bi as defined in section 3.2). First consider

the gradient with respect to bi. By the chain rule we have

∂J ∂bi = ∂J ∂Ln ∂Ln ∂bi

(35)

. For the ‘L2’ loss function with respect to Ln, the Jacobian is given by ∂J ∂Ln = ∂J ∂ ˆy =2(ˆy1− y1) · · · 2(ˆyk− yk)

. The chain rule chain can be applied again ∂Lj+1 ∂bi = ∂Lj+1 ∂Lj ∂Lj ∂bi . Recall Lj+1= σ(Aj+1Lj+ bj+1) . Thus ∂Lj+1 ∂Lj = σ0(Aj+1Lj + bj+1)Aj+1

. Here σ0(Aj+1Lj+bj+1) does not exactly denote σ0 applied to each element of the vector

Aj+1Lj + bj+1, but rather a diagonal matrix with σ0 applied to each of those entries,

along the diagonal. To compute _∂b∂J

i we can recursively apply the chain rule

∂J ∂bi = ∂J ∂Ln ∂Ln ∂Ln−1 . . .∂Li+1 ∂Li ∂Li ∂bi

. Now all that remains is to compute ∂Li

∂bi. We first split Li using an intermediate v.

Denote v = AiLi−1+ bi, then Li+1= σ(v). By the chain rule

∂Li ∂bi = ∂σ(v) ∂v ∂v ∂bi

. Here ∂σ(v)_∂v is a diagonal matrix with diagonal entries σ0(vi) and _∂b∂v

i = I the identity

matrix.

We also have other parameters than bi. The matrix Ai has to be updated according

to the GD algorithm. To avoid using matrix calculus, we calculate _∂(A∂l

i)j where (Ai)j is

the j-th column of Ai, which is a vector. We can use the same technique again.

∂J ∂(Ai)j = ∂J ∂Ln ∂Ln ∂Ln−1 . . .∂Li+1 ∂Li ∂Li ∂v ∂v ∂(Ai)j . Only _∂(A∂v

i)j needs to be calculated. This is given by

∂v ∂(Ai)j

= (Li−1)jI

(36)

Convolutions

It is also possible that a layer contains a convolution. Thus the question arises how a gradient can be calculated in this setting. In this setting, a layer can be regarded as two-dimensional array, rather than a one-dimension array. These are just different representations of the same vector. Similar to the densely connected neural network, let us focus on a single layer with input x and output L

L = σ(v), v = x ∗ w + b

. Here the +b represents element wise addition with a scalar b. The convolution operator * here denoted cross-correlation rather than convolution in the strictest mathematical sense. The difference between them is simply whether w is flipped or not.

Similar to the previous section, we can back-propagate until ∂J_∂v is known. The follow-ing remain to be calculated ∂J_∂x, _∂w∂J, ∂J_∂b. According to [15], we have

∂J ∂b = X i X j ∂J ∂v i,j and ∂J ∂w = x ∗ ∂J ∂v

. One more complication is ∂J_∂x. let w0 denote the flipped version of w, then ∂J

∂x = ∂J ∂v ∗ w

0

. If the original convolution was performed with padding (zero’s around the border), this convolution needs to be performed with the same padding.

3.3.6. Epochs and batches

The description of gradient descent algorithm mentions the calculation of the loss func-tion. This loss function is calculated with respect a dataset of samples (input and target matches). After this calculation, the gradient is calculated and the network parameters are updated.

In practice it is not always ideal to use the entire dataset when calculating the loss function. Instead the dataset is divided into subsets called batches, on which the loss function and its gradient are calculated. Although this reduces the accuracy of each GD step, it lowers the cost of computation. If one were to divide the dataset in 10 batches, the GD would take 10 less accurate steps compared to 1 more accurate step when using the entire dataset.

There now is a distinction between how often we iterate through the dataset and how often the GD algorithm takes a step. Going through the dataset once is called an epoch. The total number of iterations satisfies

# iterations = # epochs # samples in dataset batch size

(37)

3.3.7. Overfitting

Up until now, we have consider a dataset to consist of pairs of input and target images. All these pairs are used for the training process. We will have to split this dataset into two, because of one issue that has not yet been addressed, the issue of overfitting.

We discussed approximating the objective function f by a neural network approxi-mation fθ. Our total loss J (θ) measures how well our fθ resembles f on our dataset.

We might hope that a low value of J (θ) means that our fθ also resembles f on data

outside our dataset. For example, suppose we train a network to recognize which images containing cats. If our network performs well on our dataset of cat and non-cat images, we would hope it also guesses correctly when shown a picture not present in its dataset. Once a network keeps performing better on the dataset and worse outside the dataset, the network is overfitting.

We can relate this problem to teaching. In this context we are the teacher and the neural network is the student. The student, let us call him Jan, has to learn trigonometry. We let him make exercises in a trigonometry exercise sheet and then test his knowledge on a test. Jan wishes to score as high as possible on this test. We hope that this way, Jan learns generalizable approaches to trigonometry problems. A problem arises when our test consists of questions also present in the exercise sheet. If Jan has enough memory, he can simply memorize the answers to the questions in the exercise sheet. To solve this, we could remove several exercises from the exercise sheet and reserve them for the test. This way, memorization of answers is not rewarded with a high test score.

Analogous to this setting, a dataset also requires a validation dataset. The validation set contains pairs of input and target images that the network does not use for training. Instead, the validation set is used to monitor the performance on non-training data. One might stop the training once the performance on the validation set gets worse (this means the network is overfitting). This is to prevent the network from learning specific patterns only present in the training data. Just like the example of the student, if the network has more trainable parameters (like memory for the student) it could overfit more easily.

As an example, we trained a MSDCNN on randomly generated 100 by 100 images. The average loss of this network during training is plotted in figure 3.4. The average loss is given for both the training and validation data. This loss is relative to the loss during the first epoch, as the absolute values here are less relevant that the increase and decrease. We see that the loss decreases on the training data but not the validation data. This shows the network is not learning general patterns but only coincidental patterns present in the training data. From this example we can see the need to include a validation-set.

(38)

(39)

4. Ethics

Deep learning for image reconstruction can have several real-world applications. The reconstructions one can enhance range from archaeological to medical. Depending on the settings, different trade-offs have to be considered.

4.1. Medical

One obvious application would be medical CT (Computed Tomography) scans. Every single projection in a CT scan exposes the patient to radiation. It could be beneficial if we can make reconstruction with less projections. Data might also be sparse when one cannot scan fully around the patient. Deep learning assisted reconstruction (DLAR) could improve the quality of medical images when data is sparse. This in turn can give medical professionals more information on which to base their advice and actions.

The use of DLAR is not without drawbacks. A neural network can effectively be considered a ‘black box’, there are very little certainties about how a reconstruction is ‘enhanced’. This posed the risk of the network distorting the medically relevant information in the reconstruction. Consider the following hypothetical example:

Suppose a neural network is trained to enhance reconstructions of bone fractures. The network is trained on a dataset of bone fractures gathered in a clinical setting. The network successfully learns to enhance reconstructions of the bone fractures present in the dataset. If one were to incorporate this network into software for clinical use, the following issue arises. It is very difficult to guarantee the behaviour of the network for types of bone fractures not present in the dataset. It might very well be possible that the network interprets a fracture not present in the dataset as healthy. This poses a problem for incorporation into clinical software. This problem is not limited to enhancement but also applies to classification.

A 2020 paper [16] made use of machine learning techniques to detect COVID-19 based on CT scan data. If, hypothetically, the training data is not general enough, the network might make mistakes with scans very different from the training data. Since the actual network is to some extent a ”black box”, it can be difficult to examine why the network makes the mistakes it makes.

4.2. The trade-off

Although the network might appear to improve the image quality, it does not have more information than the input low quality reconstruction. The network improves the reconstruction by making assumption. Which assumptions it makes is based on the data

(40)

it was trained on. The danger lies in the network learning assumptions in the training data that do not generalize to the real world.

The network might improve the quality (resolution) of a reconstruction but this comes at the cost of certainty. When considering the use of DLAR for X-ray scans, the following should be taken into account

1. DLAR allows reconstructions to be made with less projections per scan, thus lower the amount of radiation the patient is exposed to.

2. It also decreased the certainty with which medical advise can be based on the reconstruction.

In order to weight these up- and downsides, one would need a probabilistic model of the accuracy of these prediction. This could be a topic for further research.

(41)

5. Data acquisition

This section describes the acquisition of the data used for later experiments. In these experiments, the dataset was used for the training of neural networks, see chapter 6 ‘Computational Experiments’. In these experiments, neural networks were trained to improve images that were reconstructed under certain constraints. This dataset was gathered at the FleX-ray lab at the Centrum Wiskunde & Informatica (CWI), the na-tional research institute for mathematics and computer science in the Netherlands.

5.1. Materials

In order to gather data for tomographic reconstruction, we needed an object to scan. This object needed to fit in the scanner and be able to be left there overnight. We chose a tube filled with materials that would yield interesting cross-sections. The approximate dimensions of this tube were a height of 30 cm and a diameter of 10 cm.

The composition of the filling by volume: • 1 part walnuts

• 1 part coffee beans

• 2 part imitation coffee (filler)

We needed the filling to be somewhat heterogeneous, such that there was enough variety in shapes for the network to later train on. We strove for a filling with natural variation in shapes and sizes. This was meant to mimic settings such as brain scans, within practical boundaries. If the filling was too homogeneous, network performance is less likely to generalize to such settings.

Thus we attempted to pick filling materials that would look different in cross-sections taken at different heights. They should still contain similarities but contain natural variation in shapes and sizes. Wall-nuts and coffee beans generate images with patches of homogeneity that meet at variously shaped edges. These types of shapes might possibly help the network learn approaches that generalize to, for example, brain scans. For practical purposes, the materials should not deteriorate when left inside the scanner overnight.

We also decided to pick a background filling, as air would give too much contrast with the walnuts and coffee beans. Too much contrast would make it easier for the network to detect artifacts where there should be air. There should still be sharp boundaries present and the background filling should not look to much like noise. We tried different background fillings including:

(42)

Figure 5.1.: Filling materials used for data acquisition.

• powdered sugar • sawdust

• artificial sweetener

• mixtures of salt and sweetener • imitation coffee

Based on visual judgement, we eventually found that using imitation coffee as back-ground filling, produced an image where there was some contrast with both the air and coffee beans/walnuts. The powdered sugar and sweetener provided too much contract. The salt and sawdust looked to granular and noisy. The final mixture is shown in figure 5.1. A tube was filled with this mixture and placed in the detector. The tube was left there overnight while it was being scanned.

5.2. Scan setups

The scanner scanned a total of 201 cross-sections at different heights across the tube. From each of these, projections were takes across 1800 angles.

The resulting dataset contained 201 directories, one for each cross section. Each of these directories contained 1800 files in TIFF format. Each one of these files contains an array of shape (1,1944).

(43)

(44)

6. Computational Experiments

This section covers an experiment in tomographic reconstruction enhancement. Tomo-graphic reconstruction algorithms give higher quality results when the data consists of a larger amount of measurements. When too little data is present, certain reconstruction artifacts can occur (See the artifacts section 2.5). In some real-world settings, there can be a non-trivial cost associated with the gathering of measurements for tomographic reconstructions. An example of this is a medical CT scan. An increase in the number of projections used in this scan, leads to an increase in the amount of radiation that a patient is exposed to [2]. In such settings, there could be a benefit to gathering less data in a scan and enhancing the sparse data reconstruction. Data might also be sparse in the sense that angular scanning range is limit.

This experiment examines the ability of neural networks (MSDCNN) to improve to-mographic reconstructions in sparse data settings. We hypothesize that the network is able to visibly improve the quality of reconstructions in these sparse-data settings, removing or reducing artifacts as described in section 2.5.

6.1. Reconstruction

As mentioned in chapter 5 ‘Data acquisition’ our raw dataset consists of projections of 201 cross-sections. From each cross-section 1800 projections were taken. Reconstruc-tions were made on 512 by 512 pixel images. Reconstruction was performed using the ‘gen data.py’ script from [17].

For each cross-section a ‘full’ reconstruction was made using each reconstruction al-gorithm. The algorithms used were:

• Filtered back-projection (FBP) algorithm as implemented in the ASTRA Toolbox

1 _{(with a ’Ram-Lak’ filter). Section 2.3.1}

• Simultaneous Iterative Reconstruction Technique (SIRT) for a total of 500 itera-tions. Section 2.3.2

‘full’ here means that all 1800 projections were present in the sinogram. For each cross-section, four additional rotated reconstructions were also made by shifting the angles in the sinogram by {300 · i | i ∈ {0, 1, 2, 3, 4}}. This increases the available training data, as the enhancement task is rotation invariant. This results in 1005 ‘full’ reconstructions per algorithm or five shifts per cross section. These are to be used as target images in the network training.

(45)

Aside from the ‘full’ reconstructions, we also made ‘constrained’ reconstructions. The different constraints used were:

• Sub-sampling in the angular direction • Limited angle reconstruction

• Sub-sampling in the detector size

For each constraint, both algorithms were separately used for reconstruction.

First the angular sub-sampling. For the FBP reconstructions, we sub-sampled the sinograms with factors 2, 4 and 8 respectively. Because the SIRT algorithm works better with sparse data, we used factors 4, 8 and 12. The sub-sampling of the angles in the sinogram occurred after the shifting. For each factor, all 201 cross-sections were reconstructed five times (once per shift). Based on section 2.5, we would expect these reconstructions to contain aliasing similar to figure 2.4 (b).

For the angular limitation, the full reconstructions contained projections across 360◦. The constrained reconstructions had limits of 180◦, 150◦, 120◦ and 90◦. For each limit, all 201 cross-sections were reconstructed (once again 5 times, once per shift). As opposed to the angular sub-sampling, the FBP and SIRT reconstructions had the same limits. In this setting, based section 2.5 we expect artifacts similar to figure 2.4 (c). Some edges could disappear and addition streak-like edges would be added.

The detector sub-sampling was only performed for a factor of 12. Here we would expect some aliasing and lowering of image resolution, based on figure 2.4 (d).

6.2. Datasets

In order to train a neural network, one needs make a dataset containing both input and target images (section 3.3.7). We made a dataset for each combination of:

• Algorithm (SIRT/FBP)

• Constraint type (full/angular sub-sampling/angular limit/detector sub-sampling) • Constraint factor (none for full)

Each of these datasets contained 1005 images (201 cross sections with 5 shifts each). Each dataset had constrained reconstructions forming the input images. The respective set of target images consisted of the full reconstructions using the same algorithm. The input and target images were paired based on which cross-section they were reconstructed from and their respective shift.

We split each of these datasets in training and validation. The reconstructions based on cross-sections 180 to 201 were moved to the validation set (regardless of their shift).

Deep Learning for X-ray Tomography