A Complex-Valued Recurrent Inference Machine for Accelerated MRI Reconstruction

(1)

MSc Artificial Intelligence

Track: Deep Learning

Master Thesis

A Complex-Valued Recurrent Inference

Machine for Accelerated MRI Reconstruction

by

Kai Lønning

11133023

August 21, 2017

36 EC January - August, 2017

Supervisors:

PhDc P. Putzky

PhD M. Caan

Assessor:

Dr M. Welling

Faculty of Science

(2)

Abstract

This MSc thesis in Artificial Intelligence contains a detailed description of how to use Recurrent Inference Machines for solving the inverse problem of accelerated MRI reconstruction. The network is implemented for real- and complex-valued parameters, where the latter version does backpropagation using Wirtinger-calculus. Both networks are shown to outperform Compressed Sensing with respect to MSE and SSIM metrics, and are shown to be more robust against varying sampling patterns. The superiority of the RIM over compressed sensing is not only detectable through metrics, but also perceptible in recontructed images. The real- and complex-valued networks yielded comparable results, and no advantage was found from using Wirtinger-calculus as opposed to regular calculus. The former is unable to generalize equally well to high acceleration factors, and it is suggested that this is due to the number of features being half as many as in the real-valued network, for the same amount in terms of storage in memory. It remains unknown whether this advantage diminishes as the number of features grows, and whether this could lead to scenarios where a complex-valued neural network is preferable to a real-valued network for accelerating MRI reconstruction.

(3)

Introduction

1.1 Motivation and Goal

1.1.1 Magnetic Resonance Imaging and the Problem at Hand

Magnetic Resonance Imaging (MRI) is a technique where a combination of a magnetic field and radio-frequency photon pulses are used to infer information about the pixel-wise density of an object. Since biological tissues vary in density, this information enables the imaging of organs within the body. In contrast to techniques using X-rays or other ionizing electromagnetic radiation, MR-imaging is a benign way of visualizing the interior parts of the body for medical or scientific purposes.

Figure 1.1: An MR-image of a brain. The image on top is a corrupted version of the image below, as a result of four-times under-sampling.

As will be seen later, MR-measurements are done in a se-quential procedure, where more measurements allow for gen-erating images at the resolution limit of the device. How-ever, acquiring these measurements takes a long time due to physical and physiological constraints to be explained later. Aspirations to speed up MRI scanning times are therefore unlikely to be achieved through an improvement of equip-ment. Instead, the approach is to gather less data than necessary for constructing an image, thereby reducing the scanning time, and find a way to reconstruct the corrupted signal.

Being able to lower scanning times would result in lower costs per image, not to mention a greater comfort for pa-tients, who are required to lie perfectly still within a narrow coil during scanning. Perhaps most motivating, though, is to improve on real-time MRI technology, where images of the body’s internal state are produced in real-time.

The problem of reconstructing the signal is more compli-cated than one might think. The MRI scanner does not measure an object’s density directly, but rather the spa-tial frequency signal of its density, through the mathemati-cal operator known as the Fourier transform. Due to peri-odic properties of the Fourier transform, constructing images from partial measurements leads to aliasing artifacts. This is when a single object in the image appears at more than one location at the same time, resulting in cluttered images where important details are no longer distinguishable. An example of this can be seen in figure 1.1.

Succesful image reconstruction comes down to solving what is known in science as an inverse problem. A formal description of such problems is given later in Chapter 2, but in brief, an inverse problem starts with a signal being measured after undergoing a transformation. The forward process of the transformation is known, but the

(5)

inverse transformation is not. The goal is to find the signal that resulted in the measurement by approximating the inverse transformation.

1.1.2 Artificial Neural Networks and Deep Learning

Figure 1.2: The image shows the basic structure of neural network. Neurons are stacked in layers, which are stacked in a pipeline. Data is fed to the input layer and is passed on through the hidden layers, ending with the output layer. The out-put is then evaluated by the loss function during training. Deep learning refers to networks that contain more than a single hidden layer.

Artificial neural networks (ANN) are universal nonlinear function approximators with learnable parameters, capable of outperforming hand-engineered solutions to many prob-lems [1]. ANNs contain sets of digital neurons which are stored in layers. The layers are stacked on top of each other, allowing data to be processed in a pipeline starting at the input-layer and ending at the output-layer. An illustration of the very basic concept of a neural network can be seen in figure Figure 1.21_{. Deep learning refers to the set of ”deep”}

neural networks that contain more than just a single hid-den layer of neurons between the output- and input-layers. Neurons in a neural network are called network parameters, or weights. The weight-containing layers usually combine their weights with the layer’s input through linear trans-formations. The parametrized layer is then followed by a non-parametrized layer, containing what is referred to as activation functions. The activation functions are nonlinear functions that enable the network to transform its input onto abstract classes of manifolds. On such manifolds, datapoints that were previously indistinguishable from one another may become completely different, due to some key feature that was invisible in the input space.

Alternating between parametrized layers and activation functions, the data flowing through the network is transformed into abstract feature spaces, the shape of which are determined by the network parameters. In other words, to change the parameters means to change the feature representation of the input data. The learning in deep learning results from a process called training, where iterative adjustments are made to the network parameters, thus reshaping the feature space over time, and enabling the network to create features that result in the correct output for a given input.

What determines whether an output is correct or not, is defined through the network’s loss function. The loss function takes the output of the network as its input, and assigns to it a real number which represents the error. Usually the error is determined by having a set of correct answers, or targets, for each input given to the network during training. This is known as supervised learning, and it is what will be used when teaching a model how to reconstruct sparsely sampled MR-images in this thesis. By showing the model a fully sampled MR-image, an error can be calculated from what it ”thought” an MR-image should look like.

When including its loss function, the entire network can be seen as functional ` : D → R, mapping a datapoint x ∈ D to the error of its network output. During training, the network is exposed to datapoints in an iterative process, each datapoint resulting in a corresponding loss-value. For each iteration, network parameters are altered in such a way that ` decreases the next time the network receives the same input again. This is achieved through a method known as gradient descent, where parameters are updated with respect to the gradient of `. To get the gradient with respect a given weight, the gradient is first calculated for the loss function, and then passed on back through the network, propagating through the network’s output-layer and towards the input-layer. Network parameters are updated as the gradient reaches them in the chain. This way of transferring gradients is known as called backpropagation.

Since one datapoint may be very dissimilar to the next, causing abrupt changes in the gradient between network inputs, parameters are usually updated using the average gradient over a set of datapoints, called mini-batches. Bigger mini-batch sizes lead to more stable rates of convergence, but comes at the cost of a lower degree of exploration of the parameter-space, thus risking convergence to a local optimum that may be much worse than the global optimum.

To summarize, training consists of feeding the network an input, resulting in an output, which is then assigned an error through the loss function. The gradient of this error is then propagated back through the network during backpropagation, and is used for updating the network parameters when their layer is reached in the

(6)

chain. After several such iterations, when the loss function has converged to some optimum, the network will have learned how to assign datapoints from a distribution similar to the dataset it was trained on, in such a way that the associated error is as low as possible.

1.1.3 Related Work and the Goal of this Thesis

Using deep learning to solve sparsely sampled MRI reconstruction problems is currently a growing field of interest. Several algorithms have been constructed for this purpose. See for instance [2], where the success behind compressed sensing is incorporated into a convolutional neural network model, or [3], where a network was pre-trained for reconstruction on complete X-ray images before seeing corrupted MR-images, in what is referred to in deep learning as transfer learning. [4] is another example of adapting compressed sensing to a deep learning model, whereas the model in [5] learns the residual error from a given measurement in order to subtract it from the corrupted image.

As seen, there are several promising examples of sparse MR-reconstruction using deep learning, however, there is a one common point of neglect. None of them apply the calculus appropriate for the complex-valued domain of MR-signals. In [2], real and imaginary components are passed to the convolutional layers as two separate channels, while the model in [4] uses the same channel for both components. This simplifies the backpropagation algorithm, allowing them to use the same procedure already built into every deep learning framework there is, but it comes at a cost. Relying on calculus designed for real variables means the network will have to compensate by learning the complex interactions involved in the signal corruption on its own. This thesis postulates that there are advantages to adapting the backpropagation process to comply with the complex domain that inherently comes with the territory of sparse MR-reconstruction problems. This should free up resources from learning complex interactions, and hopefully allow the network to better focus on learning the desired signal instead. The neural network implemented in this thesis will therefore be what is known as a complex-valued neural network (CVNN). CVNNs are parametrized by complex-valued weights, which are updated using what is called CR-calculus, or Wirtinger-calculus, replacing the regular calculus implemented for real-valued neural networks (RVNN). CVNNs are rare in deep learning, probably owing to the extra work required to re-implement the backpropagation algorithm, in addition to the fact that complex-valued datasets are rare. However, that is not to say it is completely uncharted territory. Examples can be found in [8], where the notion of complex activation functions is discussed, or in [6], which deals with the same, in addition to Wirtinger-backpropagation. In [7], the author argues that using complex-valued parameters are good for regularization, showing a big discrepancy in over-fitting between the real- and complex-valued neural networks implemented. She also reports that CVNNs are harder to train than RVNNs, a claim repeated in most papers on CVNNs. Apparently, convergence is very sensitive to weight initialization. In [9], they convert popular RVNN initialization methods into corresponding complex initialization methods, while [10] takes a whole new approach by implementing a global search algorithm to find a promising region in parameter space before commencing gradient descent.

Currently, papers on CVNNs are biased towards extracting complex features from real-valued datasets, rather than using datasets that are complex-valued to begin with. This can be a valid approach. In [11] for instance, they argue for the benefit to using multiplication in Fourier-space, instead of convolutions in image-space, as a way of cutting down computation time and extracting complex-valued features otherwise hidden in natural images. However, the bias towards real-valued datasets can also seem a bit contrived at times, where the motivation begins and ends with training a CVNN as opposed to a RVNN. While such research is valuable as stepping stones, or proofs of concepts on things related to CVNNs, it might be the case that the benefits of using CVNNs would become more clear when the underlying dataset is inherently complex-valued. Given that this is the case for MRI data, this thesis will seek to answer this.

The model implemented in this thesis will be the Recurrent Inference Machine (RIM), as proposed in [12]. The RIM is designed to be a general inverse problem solver. It will be implemented both as a RVNN and as a CVNN, after which a comparison is given. The RIM has been successfully implemented for tasks such as image denoising, super-resolution and a range of other corruption processes. It has yet to be applied to MR-data, or other complex-valued data, nor has it been implemented using Wirtinger-calculus before.

1.2 Overview

This chapter has given a short introduction to the various topics contained in this thesis. Each of these topics will be expanded upon in the chapters that follow.

(7)

Chapter 2 will delve a bit further into the underlying physics of MRI, ending with a description of compressed sensing, which is useful when motivating certain aspects of the deep learning approach to the same problem. Much of the information in this chapter should be attributed to [13] and [14], that go to great lengths in describing not only compressed sensing, but also MR-Imaging in general. Chapter 3 will describe exactly what Wirtinger-calculus is, and how it should be implemented in a CVNN in order for backpropagation to work. This chapter contains a lot of information from [15], which gives an insightful introduciton to Wirtinger-calculus. Chapter 4 will describe all the components used in the RIM implementation of this thesis, ending with a proper description of the RIM itself, which can be found in greater detail in [12]. In Chapter 5, implementation details will be given, along with a description of all experiments. Result of these experiments are also shown, alongside a discussion and ideas for future work. Finally, the thesis is concluded in Chapter 6.

1.3 Notation

i will be used for the imaginary unit i =√−1. Let z ∈ C be a complex variable z = x + iy, for x, y ∈ R. Then ¯z will be used to denote the complex-conjugate ¯_{z = x − iy. When z is a vector, it will be written in bold, z ∈ C}n_,

with real components x, y ∈ Rn_{; z = x + iy. Its transposed complex conjugate will be denoted z}H _{:= ¯}_zT_.

Let f be a function taking z as an argument. This thesis will use two ways of denoting the derivative of f with respect to z and its complex conjugate ¯z. First, the commonly used subscript, with fz := ∂f_∂z and fz¯ := ∂f_{∂ ¯}_z.

Second, for when z is clear from the context, we shall also use ∂f := ∂f_∂z and ∂fc := ∂f

∂ ¯z to mean the derivative of

f with respect to its argument and its complex conjugate argument, respectively. Similarly, when either one or both of z and f are vectors, we will denote the Jacobian of f by Jf :=∂z∂f, and the Jacobian with respect to the

complex conjugate as Jc_f := ∂f_∂¯_z.

Two vectors x and y are combined into one with the following notation: col (x, y) :=x y

(8)

Chapter 2

Magnetic Resonance Imaging

2.1 MRI Physics

MRI scanners use superconductivity to generate a several Tesla strong, static magnetic field. This field causes the protons inside the scanner to align their net magnetic moment, or spin, in the direction of the field. Next, the machine emits photon pulses at radio-wave resonance frequencies, disturbing the protons’ magnetic spin by inducing precession with respect to the strong background field. Photons re-emitted by the precessing protons are then picked up by a receiving coil sensor, measuring an electric signal oscillating at the same frequency as the net sum of proton precession frequencies in the scanner. This signal is represented by a complex-valued function. Let % (k) be the signal picked up by the scanner as a function of the net spin precession frequency k ∈ K. The letter K stems from the wavenumber ubiqutous to all wave-related physics, and K-space has become the proper terminology in MRI to mean the space of net precession frequencies. As Fourier analysis teaches us, for every function of time or frequency, there is a corresponding function of frequency or time, where the two functions uniquely determine each other through the Fourier transform. One crucial difference from many textbook examples though, is that we view k not as temporal frequency, but as spatial frequency, seeing how k is the net precession frequency of protons summed up over a region of space. Let this region be known as the image space from now on, taking coordinates r ∈ I. The two coordinates r and k are said to form a Fourier transform pair, through the spatial Fourier transformations

% (k) = Z I ρ (r) e2πik·rdr, ρ (r) = Z K % (k) e−2πik·rdk, (2.1)

where ρ (r) is thus the inverse Fourier transform of % (k). In fact, ρ (r) is proportional to the spin density at r, effectively giving us a measure of mass as a function of image space coordinates.

Possessing equations that relate proton mass at a given location with the net frequency of precessing protons is not enough, though. As is evident in (2.1), the Fourier transform maps each image space coordinate to each K-space coordinate and vice versa, so in order to get the mass of a single point in image space, we would presumably need to know the signal for all frequencies in K-space. In other words, there also needs to be a way of varying the net precession frequency of the protons in the scanner. This is accomplished by applying a magnetic gradient field g that causes perturbations in the much stronger magnetic background field. The gradient causes precession frequencies to vary over image space, thus changing the net frequency over time, allowing us to sample values of % for various values of k. It was noticed independently by Ljungren [16] and Twieg [17] that K-space can be traversed in so-called k-trajectories by exploiting the following identity

k (t) ∝ Z t

0

g (τ ) dτ. (2.2)

Using the gradient g, the ks being sampled can be viewed as a curve through K-space, parametrized by time t. By altering the gradient over time, one alters the path taken through K-space. Notice how this implies that the acquisition of MR-data is time-constrained, a fact of great importance to this thesis, to be revisited later. Several gradient schemes have been developed in order to traverse desired k-trajectories, creating what is essentially a set of K-space sampling patterns.

The data acquisition process for MRI scanning traditionally consists of sampling the function % for discrete and equidistant points in K-space. When sampling, low-frequency values for % will carry the highest signal magnitude, encoding the general shape and structure of the density function ρ, while higher frequencies carry

(9)

Figure 2.1: Upper image shows the signal mea-sured in K-space. Middle image shows the mag-nitude of the signal in image space. The K-space samples reveal an image of the frontal cortex through the inverse Fourier transform. Lower im-age shows the phase of the signal in imim-age space as an alternative representation.

lower magnitudes that account for the spatial details in ρ. When increasing the frequency for which the signal is mea-sured, it will eventually no longer be possible to distinguish the true signal from background noise. At this point, no ad-ditional image information is lost by discontinuing the mea-surements. What remains is a finite set of samples at equidis-tant points in K-space, that correspond to the Fourier series coefficients of the Discrete-Time Fourier Transform (DTFT) of the desired image signal ρ. Although it is possible to re-construct the continuous density function ρ, image pixels are discrete, thus it suffices to take the inverse Discrete Fourier Transform (DFT) of the samples to reconstruct the pixel intensity values.

The underlying theory works the same for any d-dimensional space, but this thesis will focus on the case where d = 2. Specifically, it deals with MR-images of the brain viewed in 2D-slices. Let M and N be the number of samples measured in each dimension, such that we have a set of data points {%kl}

M −1,N −1

k,l=0 . The corresponding pixel values

{ρmn}M −1,N −1_m,n=0 in image space is then computed with the

inverse DFT, ρmn= 1 N M M −1 X k=0 N −1 X l=0 %kle2πi( mk M+ nl N). (2.3) The technique that allows for retrieving the pixel values de-scribed above is known as sampling the DTFT. Similar to the continuous case in (2.1), one can transform back to K-space using the DFT:

%kl = M −1 X m=0 N −1 X n=0 %mne−2πi( mk M+ nl N). _(2.4)

The result of using (2.3) and (2.4) to transform between K-space and image-space can be seen in figure 2.1.

From now on, the abbreviation FFT will be used in-stead of DFT, where FFT stands for Fast Fourier Trans-form. Reducing the cost of computation from O N2_M2_to

O (M N log (M N )), the FFT is always used when computing the DFT, leading it to become a preferred synonym. Although in general both ρ and % are complex-valued func-tions, it is possible for at most one of them to be real-valued. In this case the complex-valued counter-part becomes a Her-mitian function. For instance, if ρ is a real-valued function, then % will have the property % (k) = ¯% (−k). Some MRI acquisition methods fine-tune its settings to ensure such a symmetry, thus reducing the amount of data points neces-sary to scan by almost a half. However, since this is not the case in general, nor for the dataset used in this thesis, we will consider both functions to be complex-valued from here on out.

As seen in figure 2.1, the complex-valued nature of ρ yields more than a single way to interpret pixel values. Most common is to use the magnitude |ρ|, but in some cases, when the susceptibility of tissue to the magnetic field it is placed in, rather than mass density, is the point of focus, intensity values are assigned by the principal argument Arg (ρ) instead.

Another concept in MRI data acquisition relevant to this thesis, is the disparity in precession decay rates of different body tissues. In bone, for instance, the excited protons realign their spin in the direction of the background field in a matter of just a couple of milliseconds after being exposed to the radio frequency pulse. In

(10)

contrast, tissues that contain more water molecules have longer precession decay rates, reaching several hundred milliseconds. This is the reason no signal stemming from the skull is picked up when using MRI to visualize brain structure, resulting in the black enclosure seen in figure 2.1. Samples are taken long after the bone molecules’ proton precession has decayed. Since different tissues have different decay rates, it is possible to alter the contrasts between them in image space by choosing how long to wait after a radio frequency pulse has been emitted before sampling the signal. This waiting time will be referred to as the echo-time of an image. By setting the right repetition-time; the time between sampling consecutive radio frequency pulse echoes, MR-images are fine-tuned to highlight the tissues deemed relevant to the problem.

2.2 MRI: Sampling and Reconstruction

2.2.1 Sampling K-space

When sampling the DTFT, it is important to abide by the Nyquist-Shannon sampling theorem [18] in order to guarantee a successful image construction free from aliasing artifacts due to under-sampling. A prerequisite to the theorem is that the function we wish to recreate, in our case the image intensity ρ, has a Fourier transform which is zero everywhere except for within a finite region. As previously explained, and as is evident in Figure 2.1, this is the case for MRI raw-data; values in K-space are sampled up to frequencies where the signal is lost in background noise. Given such a bandwidth B, the Nyquist-Shannon sampling theorem states that the function must be sampled at intervals of maximum length 1/2B for no aliasing artifacts to appear. In the context of MRI, B is determined by the distance from the origin we traverse in K-space, and since B is a spatial frequency, the theorem implies that our image resolution will be limited to 1/2B. Furthermore, if N equidistant points between 0 and B along a given dimension in K-space were to be sampled, the image size, or field of view (FOV) as it is known in MRI, along that dimension will be given by N/2B. In other words, for a given bandwidth, it is the K-space sampling density that determines the size of the region covered inside the scanner. If the object being measured is too big to fit in the FOV, or inversely, if the sampling rate is too low to cover the object, then the periodic properties of the Fourier transform causes the object’s emitted signal to wrap around the edges and appear as aliasing artifacts in the image.

The Nyquist-Shannon sampling theorem provides the sufficient conditions to construct an MR-image free from aliasing artifacts. As such, any equidistant grid of measurements where the distance between samples are no larger than 1/2B constitutes a fully sampled MR-image.

As implied by (2.2), traversing K-space is a process of shifting gradient fields over time, and thus cannot be done instantaneously. The rate of change of the gradients is known as the slew rate. Limitations in the maximum gradient magnitude and slew rate are set by equipment, and further restricted by a physiological phenomenon. Should the gradient change too rapidly, or be set too high, subjects will experience discomfort akin to an electric jolt, known as peripheral nerve stimulation. Another bottleneck to the duration of MRI scanning is caused by the echo-times described at the end of section 2.1. Differing decay-rates between various types of tissues is a physical fact, which means having to wait the desired echo-time before sampling. Because of this, the only way of lowering scanning times is by breaching the Nyquist-Shannong sampling theorem, which means solving the inverse problem mentioned in subsection 1.1.1.

2.2.2 The Inverse Problem of Accelerating MRI Reconstruction

Succesful image reconstruction comes down to solving what is known in science as an inverse problem. Let y ∈ Cn be the under-sampled signal measured by the machine, and let x ∈ Cm be the true image signal, if we were to apply the inverse Fourier transform to the fully sampled data. We then have the model y = a (x) + n, where a is the function taking the true signal x and corrupting it in order to produce the measurements in y. n is the additional error stemming from imprecise measurements, usually assumed to be normally distributed. Using a and y to derive x is called solving an inverse problem. Generally speaking, inverse problems can be put into one of two categories; the linear and the non-linear case. Luckily, the corruption process associated with sparse sampling in MRI is linear, enabling us to write a as the matrix A ∈ Cn×m_{. Having n m, the goal is}

then to solve the under-determined linear equation

y = Ax + n. (2.5)

A is not an unknown variable in inverse problems. In accelerated MRI reconstruction, it is given by A :=Fu,

where u is the sub-sampling pattern applied after the Fourier transform by replacing unsampled elements with zero. However, even though the forward procedure is known, finding x is still no easy task, since A is rank

(11)

deficient, hence the equations are under-determined and there are no unique solutions for x. This implies that an optimization procedure is required. Compressed sensing has been the favored technique after entering the stage around a decade ago, but more recently deep learning models are also being developed for reconstruction purposes [2] [3] [4].

2.2.3 Compressed Sensing

Essentially, breaching the Nyquist-Shannong sampling theorem to reconstruct an MR-signal from a sparsely sampled set of frequencies in K-space is made possible due to non-uniform sampling patterns and the exploita-tion of certain constraints to the signal of interest. For instance, the reconstrucexploita-tion technique in compressed sensing exploits the inherent compressibility of the signal it seeks to reconstruct. A prerequisite to using CS is therefore that the signal x to be reconstructed has a sparse representation u in some alternate domain, with the injective sparse transformation Ψ known beforehand, such that u = Ψx. Natural images and videos have sparse representations that are used for compression in filetypes such as JPEG and MPEG, hence it is to be expected that the images acquired in MRI are also compressible. The sparse transform Ψ is chosen dependent on the application. For example, in [13] it is explained that angiograms, or MR-images of blood vessels, will use a sparsifying technique called finite differences, while brain scans rely on wavelets, and dynamic MR-imaging (spanning time) of the heart uses sparse temporal Fourier transforms.

Figure 2.2: The images show the difference be-tween coherent (left) and incoherent (right) alias-ing artifacts, while illustratalias-ing the difference be-tween the sub-sampling schemes that produce them.

The second prerequisite to CS is that, when applying the in-verse FFT to the sub-sampled set of frequencies, the image produced should have aliasing artifacts that are incoherent. See [13] and [14] for a proper metric of incoherence. Fig-ure 2.2 illustrates the concept perfectly1, but to explain it in words, the artifacts should appear noise-like, without any distinguishable patterns normally associated with aliasing. In fact, the association of aliasing artifacts with image com-ponents over-lapping each other in a repetitive fashion is due to the assumption of equidistant samples in the Shannon-Nyquist sampling theorem, but this is actually not the case when grid-like sub-sampling schemes are avoided. Sampling from a different distribution will instead create aliasing arti-facts with less obvious structure, making them appear noise-like, or, indeed, incoherent.

Usually, the sampling distribution will have a higher density towards the low-frequency samples, since there are more sig-nals with high magnitude in the center of K-space. Exactly how many samples are needed to reconstruct a perfect sig-nal is covered in [19] and [21], although according to [14], the theory is different in practice, and empirically speak-ing the number of samples should be two to five times the dimensionality of u. In any case, the degree to which the fully sampled and properly reconstructed MR-images can be compressed forms a time constraint in CS.

The third and final criteria of CS is, to cite from [13], that the image should be reconstructed by a nonlinear method that enforces both sparsity of the image representation and consistency of the reconstruction with the acquired samples. The nonlinearity is necessary in order to avoid recreating the artifacts introduced by naive, linear reconstruction through the inverse FFT on sub-sampled data. The notion of nonlinearities being more expressive, thus capable of compensating for errors introduced by linear transformations, is well-known, but how sparsity is enforced is less obvious for the uninitiated. Crucial to CS is using the l1-norm in addition to the

more standard l2-norm when defining the objective to optimize. While minizing with respect to the l2-norm

will push each element down evenly, the l1-norm tends to decrease by driving most elements towards zero, while

allowing a few sparse elements to remain or increase.

This deals with sparsity, now finally, how about consistency between the reconstructed image and the samples in K-space? This is achieved by adding an error constraint with the l2-norm. Let σ2 be the expected noise

variance due to inherent measurement errors, and letFu be the Fourier transform followed by the application

(12)

of a sub-sampling mask u. The objective in CS is to find argmin x n kΨxk₁ ; kFux − yk2₂< σ2 o . (2.6)

This reduces the reconstruction task to a convex optimization problem, which is solved in iterative steps, switching between image compression, through the l1-norm, and data consistency, through the l2-norm, until

(13)

Chapter 3

CR-calculus

3.1 The Motivation for CR-Calculus

CR-calculus, also known as Wirtinger-calculus, was properly formalized in [23] from 1926, by the Austrian mathematician Wilhelm Wirtinger, but did not receive much attention outside the German-speaking engineering milieu until much later. Even today, CR-calculus is rarely taught in calculus courses at universities, perhaps because optimizing functionals of complex variables is not a very common problem. This chapter will explain the motivation behind CR-calculus, how it is derived, and what the resulting gradients look like for CVNN backpropagation.

Let f : C → C be a function of the complex argument z. The problem of differentiating f with respect to z as if z were a real number arises from the fact that complex differentiation is a highly constrained operation when compared to real differentiation. While the complex plane C can of course be viewed as a vector space in R2 _{over the field of real numbers, this representation does not automatically transfer the algebraic structure}

of C when viewed as a field itself. Complex differentiation is different from differentiation of vectors in R2_.1

Let f : R2 _{→ R}2

be a vector function taking the real vector argument x as input. For h ∈ C and h ∈ R2_{, the}

definition of differentiation in the two spaces are as follows:

∂f ∂z = limh→0 f (z + h) − f (z) h , ∂f ∂x= limh→0 f (x + h) − f (x) khk . (3.1)

Since the vector space R2 _{has no standard operations corresponding to multiplication and division, we are}

simply dividing by the norm of h, whereas for the complex plane C, we do have an operation for division, so we use that and divide by the scalar itself. The derivatives are defined by these limits, and the limits exist as long as varying the path h or h takes as they approach the origin does not result in different or non-definite values. Only being concerned with the norm of a vector h as it approaches zero puts less constraints on the path it takes towards the origin. For the complex scalar case, however, division is done by multiplication of the numerator and denominator by the complex conjugate ¯h of the divisor. This adds a lot of complex interactions in the numerator, which, depending on the function f , can more easily change the value of the limit as the origin is approached from different angles. This is why, for functions of complex variables, in many cases the derivative doesn’t exist.

Given some arbitrary complex function, it is hard to tell whether a derivative exists or not from the definition in (3.1) alone. Luckily, there are a pair of partial differential equations that, if satisfied, guarantees the limit’s existence. Given f (z) as before, where z = x + iy, the real and imaginary components of f can be written as two separate real-valued functions, each taking the components of z as arguments; f (z) = u (x, y) + iv (x, y). In order for the limit in (3.1) to exist, the following two relations must be satisfied:

ux= vy, uy= −vx. (3.2)

This is known in complex analysis as the Riemann equations [24]. Functions that abide by the Cauchy-Riemann equations everywhere in their domain are called holomorphic functions, while functions that are holomorphic on all of C are called entire functions. Holomorphic functions are the complex equivalence to analytic functions, meaning functions that are smooth everywhere and can be expressed by series expansions.

1_{More about the differences and specifics of complex differentiation can be found in this note, archived at the Mathematics}

(14)

This differs from real functions, that can be differentiable once or more, but still not necessarily infinitely so, thus still not being analytic functions with Taylor series expansions. In complex analysis, analyticity and differentiability are mutually inclusive properties.

Now consider that for any given problem of optimization, we are always dealing with a real-valued functional. After all, the complex plane is not an ordered set, so the notion of optimization does not exist for complex-valued functionals. Therefore, the loss function to be optimized would be of the form f (z) = u (x, y), with v (x, y) = 0. (3.2) implies that such a function can only be complex differentiable as long as u (x, y) = c, with c being a constant.

3.2 The Construction of CR-Calculus

Due to the loss function being non-differentiable, there is no way of using regular calculus for gradient descent on functions of complex variables. It turns out, however, that, while differention of a loss function with respect to its complex variable may be impossible, one can still differentiate it with respect to that complex variable’s real-valued components x and y. The only change necessary to achieve this, is to treat z and its complex conjugate ¯z as independent variables under partial differentiation. Even though knowing the value of one determines the other, establishing a formalism where this fact is ignored leads to a calculus that allows even non-holomorphic functions to be differentiated and optimized with respect to the real-valued components of their complex arguments.

In [15], a good explanation of why this works is given. To summarize, let z ∈ Cn be a complex vector variable with real components υ := col (x, y) ∈ R2n. CR-calculus begins with defining an alternative representation of z, with the new coordinate ζ := col (z, ¯_{z) ∈ Z ⊂ C}2n_{, which is said to be augmented by its complex conjugate. Z is}

then a proper subset of C2n_{, but it is not a proper vector subspace, because closure under multiplication by its}

scalar field C is breached. If Z is viewed as a subspace of R4n

instead of C2n_{, however, then scalar multiplication}

is closed, and it remains a proper real-vector subspace. Combined with the fact that z and ¯z are of course fully and uniquely determined by υ, which in turn is fully and uniquely determined by ζ, this indicates the existence of an isomorphy R2n _{∼ C}n

, provided that multiplication is carried out in Cn _{before changing coordinates to}

R2n through Z.

The change of coordinates is given by the matrix

Π :=I iI I −iI , (3.3) with inverse Π−1 =1 2 I I −iI iI = 1 2Π H_. _(3.4)

Π is sometimes referred to as the augmentation matrix, as it takes a complex variable’s real components to create the augmented representation ζ = Πυ. Looking back at equation (3.1), such a linear bijection allows one to use the Euclidean vector space definition of differentiation, instead of having to rely on the complex derivative. During inference, mathematical operations in a CVNN are thus carried out as normal in Cn_{, while}

differentiation during backpropagation is done in R2n_{. With this technique, even non-holomorphic functions}

can be differentiated, as long as the real components u and v are differentiable as vector functions in Rn_{. This}

new paradigm results in an additional set of properties when differentiation is carried out through the lense of Z. The following section will illuminate these further.

3.3 Basic Properties of CR-Derivatives

Through the change of coordinates in (3.3), We now have a Euclidean vector space representation for complex variables in which differentiation of non-holomorphic functions is possible using real derivatives. But how exactly can the complex derivatives fz and fz¯be expressed in terms of f ’s real-valued functions u and v? To see this,

look at the inverse mapping Π−1 and notice that x and y in terms of z and ¯z are given by x = (z + ¯z) /2 and y = i (¯z − z) /2. Now simply calculate the derivatives using this relation and the chain rule, while remembering to treat z and ¯z as two independent variables, to get

fz= fxxz+ fyyz=

1

2(fx− ify) , fz¯= fxxz¯+ fyyz¯= 1

(15)

Viewing f in terms of u and v again, we get the relations fz= 1 2(ux+ vy+ i (vx− uy)) , fz¯= 1 2(ux− vy+ i (uy+ vx)) ,

which, for the case when f is a holomorphic function, results in fz¯= 0 in accordance with the Cauchy-Riemann

equations (3.2). This means that from the Wirtinger-calculus perspective, a function is holomorphic whenever it is independent of the complex conjugate variable, f (z, ¯z) = f (z).

Using the derivatives in (3.5), the following complex differential identities can be found:

∂ ¯f ∂z = ∂f ∂ ¯z, ∂ ¯f ∂ ¯z = ∂f ∂z. (3.6)

For the case when f is real-valued, f = ¯f , we get the additional identity

∂f ∂ ¯z =

∂f

∂z. (3.7)

Imposing independence between the complex variable and its conjugate also leads to a new chain rule. Let F ◦ G (ζ) be a composition of two complex vector functions G : ζ 7→ G (ζ) and F : G, ¯G 7→ F G, ¯G. Using the Jacobian notation explained in section 1.3, and applying the chain rule along with the identities in (3.6), leads to the CR-chain rule given by

JF ◦G= JFJG+ JcFJcG, JcF ◦G = JFJcG+ JcFJG. (3.8)

3.4 Wirtinger-Backpropagation

Wirtinger-derivatives have been described in the previous section, now it is time to find out how this can be used for gradient descent and backpropagation. For this, it is necessary to know the gradient of a real-valued functional over complex vector spaces. Let L : C2n

' R2n

→ R be such a functional. Using the triangle inequality and the Cauchy-Schwartz inequality, it is shown in [15] that the direction of steepest descent is given by −∇zL, with

∇zL = Ω−1z J H

z , (3.9)

where Ωzis the metric tensor corresponding to the tangent space TzL. Ωzcan be very complicated to find even

in the case of RVNNs, but it has been shown for multi-layer perceptrons in [25] that using gradients with proper metric tensors can lead to great improvements in learning. To what extent a CVNN would benefit from this is not known, but the assumption of this thesis is that the metric tensor always can be approximated by the identity matrix, as is done in RVNNs. Therefore, the calculations that follow are only concerned with finding the Jacobian Jz and its conjugate transpose.

As known, backpropagation moves gradients starting at the output and back towards the input. This is possible due to the chain-rule being associative. Explicitly, if L is the loss function following two network layers F and G in an RVNN, then the derivative of their composition L ◦ F ◦ G can be expressed from front to back or, equivalently, from back to front: JL◦F ◦G = JLJF ◦G = JL◦FJG. Is this the case for Wirtinger-gradients?

Combining (3.8), (3.6) and (3.7) yields

JL◦F ◦G= JLJF ◦G+ JLJcF ◦G = JL JFJG+ JcFJ c G + JL JFJcG+ J c FJG =JLJF+ JLJcF JG+ JLJcF+ JLJF JcG = JL◦FJG+ JL◦FJcG, (3.10)

thus showing that associativity is preserved.

The derivation in (3.10) offers more than what first meets the eye. Imagine if L were complex-valued. (3.7) would no longer being applicable, and (3.8) hints to a very complicated backpropagation scheme. While the chain rule remains associative, each layer would require input-derivatives with respect to both complex and complex conjugate variables, resulting in higher computation and memory requirements in a backpropagation algorithm which is very difficult, if not impossible, to implement in most existing deep learning frameworks. When computing gradients between layers in a network, it is easy to assume that the composition between two connected layers consists of two complex-valued functions, resulting in such difficulties. In [6], this assumption

(16)

is made, and the proposed solution is to get rid of the complex conjugate dependencies by having every other network layer be a holomorphic function. While this often is the case in deep learning, since networks alternate between linear (holomorphic) and nonlinear (usually non-holomorphic) layers, we will see that it is not true for the RIM implementation in this thesis. Luckily, this is an unnecessary complication, as for any given layer in the network, its input-derivative can be interpreted as the derivative of the real-valued loss function, by viewing the remaining network layers as an extension of the loss function through composition. This enables the complex conjugate derivative of the subsequent layer to be replaced by the complex conjugate of its regular derivative, through the use of (3.7).

As derivatives are propagated back towards the top of the network, the gradients can finally be calculated by taking the conjugate transpose when the network parameters are reached in the chain. It is very important to make sure not to naively implement the conjugate transpose as described in (3.9) for each layer of the network, as this should only be done for the functions that compute the gradients with respect to the parameters. Explicitly, let D be the derivative to be passed on through a layer F , and let G be the gradient used for updating the parameters in F . Let L be the functional corresponding to the composition of layers that follows directly after F , ending in the loss function, such that JLis the layer’s input gradient during backpropagation. D and G are

then given by

DT = JLJF+ JLJcF, (3.11)

GT = JLJF+ JLJcF. (3.12)

For the case when F is real-valued, then JLwill also be real-valued, and D is simplified through (3.7), becoming

DT = 2JLJF. (3.13)

3.4.1 Widely Linear Transformations

Let W := Wx+ iWy∈ Cm×n and b := bx+ iby∈ Cmbe a complex-valued matrix and vector, respectively, and

let T be the transformation

T : Cn→ Cm, z 7→ W z + b = Wxx − Wyy + bx+ i (Wxy + Wyx + by) . (3.14)

This transformation is not strictly linear over the scalar field C, but through (3.3), it can be viewed as a transformation from R2n

to R2m_{which is a linear transformation of z’s real components. Transformations that}

are non-linear in complex vector spaces, but linear in the real component vector spaces through (3.4), are called widely linear transformations [30].

The concept of widely linear transformations is important because it enables notions associated with linearity to remain valid when Wirtinger-calculus is applied. Any layer which is linear in RVNNs will therefore be regarded as linear transformations also in the CVNN implementation.

Since T is holomorphic with respect to both its argument and its parameters, all complex conjugate Jacobians are zero, and its regular Jacobians are analogous to the case of real calculus. Equation (3.11) then becomes

DTz = JLW

= < (JL) Wx− = (JL) Wy+ i (< (JL) Wy+ = (JL) Wx) ,

(3.15)

while the parameters should be updated in accordance with equation (3.12):

GT_W = JLz

= < (JL) x − = (JL) y − i (< (JL) y + = (JL) x) ,

GT_b = JL

= < (JL) − i= (JL) .

(17)

Chapter 4

Deep Learning

By and large, the success of Deep Learning architectures comes down to no longer having to rely on hand-made features for data-processing, instead enabling the network itself to extract the best possible feature representations through training. This shift from hand-made features to features created by the network is what makes deep learning seem promising for accelerating MRI reconstructions. Remember that compressed sensing requires a predefined sparse domain, depending not only on the data being compressible, but that we know how to compress it to begin with. Of course, this is the case for MR-images, or compressed sensing would not work. But could deep learning help us create new and improved ways of compressing MR-Images that in turn can be used for Ψ in the CS algorithm? Is it even necessary to make this detour? Why not let the network show us how to transform the data directly from K-space on its own?

This chapter begins by explaining the concept behind common network architectures related to the implemen-tation in this thesis, before the Recurrent Inference Machine itself is described in section 4.4.

4.1 Activation Functions

Parametrized layers in neural networks tend to be some kind of linear transformation, but stacking several such layers directly on top of each other amounts to the same as if only using a single layer. Linear transformations need to be separated by nonlinearities in order for inputs to be projected onto nonlinear manifolds, from which useful features can be be extracted. With this aim, activation functions are parameter-less network layers, where nonlinear functions are applied element-wise to incoming vectors. Common examples of activation functions are tanh : R → [−1, 1] and the logistic sigmoid-function σ : R → [0, 1] , t 7→ (1 − e−t)−1. These functions are popular in deep learning due to their saturating ranges, which prevents individual neurons from becoming too dominant with respect to the output. Activation functions need not be saturating, though. Another popular nonlinearity is the Rectified Linear Unit, or ReLU : R → R+_{, t 7→ max (0, t). ReLUs are typically the preferred}

activation function in Convolutional Neural Networks, which will be described further in section 4.2.

4.1.1 Complex Adaptation

While linear transformations are easily trasferred to CVNNs as widely linear transformations, activation func-tions need a whole new interpretation, leading them to become a big topic in many papers on CVNNs [26] [7] [8] [6] [10]. The problem is that tanh, σ : C → C are entire functions (holomorphic on all of C) when their domains are expanded to the complex plane. Judging from the problems related to non-holomorphic functions described in section 3.1, one would be forgiven for thinking this is good news. According to Picard’s little theorem however, an entire function which omits two or more complex values from its image must be constant [27]. Since tanh and σ are entire non-constant functions, this means they are unbounded and no longer retain the saturating properties they are associated with for real arguments. Not only that, their domains are riddled with singularities relatively close to the origin, making backpropagation unstable. There are two main methods for solving this. The first is to restrict the domain to be within a region without singularities. The second is to sensibly redefine the function such that it becomes non-holomorphic. Two common ways of doing this is to apply the real-valued version of the function separately on either the magnitude and phase or on the real and imaginary components. The implementation in this thesis will stick to the latter solution. The same method will also be used for the complex ReLU, which otherwise isn’t defined on C, since it is an unordered set.

(18)

Let a : C → C, z = x + iy 7→ a (z) := a (x) + ia (y) be the component-wise activation function, and let L : C → R be the remaining network layers leading up to and including the real-valued loss function. We wish to express the gradient passed through the activation layer in agreement with equation (3.11). (3.5) is used to find the derivative and complex conjugate derivative of a,

∂a = 1 2(a 0_{(x) + a}0_{(y)) ,} _∂c a= 1 2(a 0_{(x) − a}0_{(y)) ,}

which are inserted into equation (3.11):

Dz= ∂L∂a+ ∂L∂ac

= < (∂L) (∂a+ ∂ac) + i= (∂L) (∂a− ∂ac)

= < (∂L) a0(x) + i= (∂L) a0(y) .

4.2 Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are parametrized layers that take images as input, to create new images as output. They are the most fundamental building blocks when dealing with high-dimensionality data such as images and videos. There are two reasons why they are so well-suited for image-related tasks. First, images tends to be big, but the number of parameters needed in a convolutional layer is invariant to the size of the input image. This means the problem is scalable in terms of memory requirements, even for very large images. The second reason is that CNNs exploit the correlation found between neighboring pixels in an image, as we will see in a bit.

Parameters are stored in what are called kernels, or filters, which can be seen as image patches that are tiny relative to the input image. Each kernel creates a single output image, referred to as feature maps. Let κ be a kernel in a CNN layer, containing (2K + 1) × (2K + 1) × I pixels that are indexed by k, l, i, with k = l = 0 being the central image pixel, while i denotes the channel of the input image. Channels typically consist of the three RGB values of natural images, but can also represent other things. Letνi I

i=1 be a set of I input

channels, each containing N × M pixels indexed by m, n, with m = n = 0 being a corner pixel. The (m, n)-th

Figure 4.1: Images show the input image with a single channel in blue, and the resulting feature map in green. The kernel slides on top of the image to produce one pixel at a time, here shown in a darker shade of green. The shaded region of the input image is the perceptive field of the shaded feature map pixel. When using dilations, the receptive field expands for the same kernel size by skipping every other pixel in the input image, as shown in the image to the right. The dashed lines in the image to the left are pixels that lie outside the scope of the input image. They are zero-padded in order for the feature map to retain the size of the input image.

pixel in the f -th feature map in ι, created by convolving ν by κ, is given by

ιf_mn= I X i=1 K X k=−K K X l=−K κif_klν_m−k,n−li . (4.1)

(19)

Convolutional layers contain a predetermined number of filters, with a one-to-one correspondence to the set of feature map outputs : κf F

f =1 f

−→ιf F

f =1, where F is said to be the number of features of a CNN layer.

Notice in (4.1) that ι is undefined at the edges of the input image ν, hence convolutional layers produce feature maps that decrease in size relative to the input image. When this is not the desired behaviour, it is common to pad the input image with zeros around the edges, in order for ι and ν to be equal in size. This is shown as dashed lines in image 4.11_{, which illustrates equation (4.1) for an image with a single input channel, thus}

without the sum over I.

As illustrated, the feature maps will contain information based on a weighted average of neighboring pixels in the input image. This group of pixels in the input image is known as the perceptive field of its corresponding output pixel. Variations of (4.1) exists, however, such as the use of dilations, also shown in image 4.1. Here, every other pixel in the input image is skipped, resulting in a larger perceptive field for the same kernel. When stacked in layers with nonlinearities in between, in particular Rectified Linear Units (ReLUs), CNNs have shown to be very succesful in a wide range of image-related tasks and are ubiqutous in the area of deep learning known as computer vision.

4.2.1 Complex Adaptation

As seen in (4.1), convolution is a linear operator, which means it is widely linear under CR-calculus, in the case of complex-valued kernels and input images. The forward procedure thus takes the same form as in (3.14), after exchanging matrix multiplication for the convolution in (4.1), with κ = < (κ) + i= (κ) and ν = < (ν) + i= (ν) representing W and z, respectively. The same is done for (3.15) and (3.16), again exchanging matrix multiplication with standard convolutional backpropagation.

4.3 The Gated Recurrent Unit

Another important concept in DL is the Recurrent Neural Network (RNN). This layer can acquire great network depth, while still retaining a low number of parameters. This is achieved by re-using weights in a series of recurrent time-steps. Since the recurrent unit contains all the parameters, the weights must learn to differentiate between time-steps in order to deal with the current input appropriately. To achieve an awareness of which state it is currently processing, the RNN is given a hidden state variable which it updates as the time-steps progress. Note that this is not the same as a network parameter, which is only updated during training. The hidden state variable is not learned, but rather assigned a pre-determined value when initializing the model. It only changes depending on the current time-step, and as a result of the network’s input.

The RNN used in this thesis is called a Gated Recurrent Unit (GRU), as detailed in [28]. GRUs maintain their hidden state variables by what is referred to as a reset gate r, and an update gate z. These gates can be perceived as managing the internal memory of the GRU. The update gate determines what portion of the new memory to base on the old state versus a new candidate state. Let st denote the hidden state at time-step t,

and denote the new candidate state by ˆst. The new hidden state is then given by

st+1= (1 − zt) st+ zt ˆst, (4.2)

where means element-wise multiplication. Meanwhile, the reset gate is responsible for deleting old memory deemed unuseful when the candidate state is created:

ˆst= tanh Wˆs xt st rt + bˆs , (4.3)

where Wˆsand bzˆare network parameters in the form of a matrix and a vector, respectively. Similar

transfor-mations, but with a logistic sigmoid function instead of a tanh, and with seperate sets of parameters, determine the gates r and z:

rt:= σ Wr xt st + br , zt:= σ Wz xt st + bz . (4.4)

1_{Image is taken from Theano’s Convolution arithmetic tutorial : http://deeplearning.net/software/theano_versions/dev/}

(20)

Figure 4.2: Figure shows a Gated Recurrent Unit. For each time-step, it re-ceives a new input, which is re-combined with the internal state through linear transformations and gating functions. Each time-step then outputs a new state that carries on to the next iteration.

st+1 1-st _z_t ˜st rt xt

As described in section 4.1, the logistic sigmoid compresses the matrix transformations be-tween 0 and 1. This value is then used for rescaling the incoming signal via -multiplication, enabling the gate to control the signal flow. The entire GRU update proce-dure is illustrated in figure 4.2, depicting the three equations above. The top and bottom blue rectangles are the gates in (4.4), while the blue triangle in the middle is the candidate state in (4.3). The interpola-tion in (4.2) is shown at the bottom right of the figure. Similar to CNNs, GRUs are also said to contain a set num-ber of features F . Since the GRU takes an input xt and

outputs a new hidden state st for each time-step t, F is analogously equal to the number of elements

con-tained in s

Recurrent architectures work well for datasets that come in sequences where the order of elements is important. Words in sentences, or whole sentences in paragraphs, are good examples of this, which is why RNNs are heavily relied upon in areas of deep learning such as Natural Language Processing. The relevance of RNNs for this thesis will become clear in section 4.4.

4.3.1 Complex Adaptation

The linear transformations in the GRU is of course a widely linear transformation, with the exact same repre-sentation as in Equation 3.14. The gates are a different story. Intuitively, a gate which is tasked with managing memory should not distort the qualitative information it is storing, but rather weaken the strength of the signal when deemed useful. Therefore, it is assumed best not to have the gates map to a complex number, as this would cause phase shifts in the signal under complex multiplication. The opposite could be true, however. It could be that phase shifts would actually add depth and quality to the GRU’s memory capabilities, by encoding additional features in the phase dimension. However, it seems safest not to wander too far off from what is already known to work, which are gates with the range [0, 1]. Finally, it is found most convenient to define the gates component-wise, as described in subsection 4.1.1, before mapping the output to its magnitude, as this definition requires few computations and is similar to the real GRU. In addition to this, the magnitude will be scaled by 1/√2 to retain the same range. The tanh-function in (4.3) will be defined the same way, but without computing the magnitude. The new state candidate must of course be allowed a complex value. The additional derivatives needed for this will be now be presented.

Complex Magnitude

Let Γ : C → [0, 1], be the function mapping z to its complex magnitude |z| =√z ¯z, and let ∂L be its

backprop-agation input derivative. Γ’s Wirtinger-derivative is given by

Γz=

¯ z

2|z|, (4.5)

and since Γ is real-valued, equation (3.13) tells us the backpropagation output derivative is given by

DΓ = ∂L

¯ z

(21)

Scaling by a real number

Using the gate to scale the complex-valued signal will also need a Wirtinger-derivative. Let α ∈ R be a real-valued scalar, and define the function f (z, α) = αz. Since this function is holomorphic with respect to z, its complex conjugate derivative is zero, while the derivative with respect to z is simply α. The function’s output derivative with respect to the scaled argument z thus becomes

Dfz = ∂Lα, (4.7)

in accordance with equation (3.11). Since ¯α = α, the derivatives with respect to the scalar α are both given by fα= fα¯ = z. Hence, with respect to α, equation (3.11) becomes

Dfα= ∂Lz + ∂Lz = 2< (∂Lz)

= 2 (< (∂L) x − = (∂L) y) .

(4.8)

It is this real scalar multiplication after computing the magnitude which breaches with the common pattern mentioned in section 3.4, with every other function in the computation graph being holomorphic.

4.4 The Recurrent Inference Machine

4.4.1 The Maximum a Posteriori

The deep learning method for accelerating MRI reconstruction implemented for this thesis will be the model proposed in [12] under the name Recurrent Inference Machines (RIM). The RIM is a supervised learning model designed to be a general inverse problem solver. The underlying theory formulates inverse problems in terms of probability distributions, see [22] for an in-depth description of how this is done. To motivate the RIM model, it is most important to know about the maximum a posteriori (MAP) estimator from statistics. More fundamental is the maximum likelihood (ML) estimator, where, given a set of observations y, its corresponding signal x can be estimated by maximizing the probability distribution of y conditioned on x: xM L:= argmaxxp (y|x). The MAP

estimator takes this one step further, by using the prior distribution over x as a regularizer. Due to Bayes rule, this is the same as maximizing the probability distribution of x conditioned on y: xM AP := argmaxxp (x|y).

Using Bayes’ rule and taking the log, the MAP estimator becomes

xM AP = argmax x

log p (y|x) + log p (x) . _(4.9)

For simplicity, it is tempting to say that the RIM is designed to approximate the MAP estimator of the true signal x, however this would be somewhat of a fallacy. It is more correct to say that the MAP is used as the fundament of a more complicated optimization algorithm.

4.4.2 The Recurrent Inference Procedure

The RIM attempts to infer the signal x through what amounts to gradient ascent over a series of recurrent time-steps t. In other words, for each time-step, the RIM yields an incremental step to take in image space that should, accumulated over time, lead to a good estimate of the original signal x. The MAP estimator is used as a guide in this process, and so the RIM needs some representation of the gradients of the log-likelihood and log-prior in (4.9). Let these two gradients be denoted the same way as in [12], by ∇y|x := ∇xlog p (y|x) and

∇x := ∇xlog p (x), respectively. ∇y|xis acquired as a function determined by the problem domain, meaning we

must know the likelihood distribution of our signal as a prerequisite. The much more elusive prior gradient ∇x

is learned by the model itself, internalized in its parameters φ. φ also takes care of the learning rate scheme, saving us the effort of implementing hand-made optimization algorithms into the model’s recurrent architecture, or constraining the model to additional hyper-parameters.

To summarize, let η_t≈ x be the network’s approximation at time-step t. The RIM then improves on this by producing a new estimate η_t+1through the following update equation

η_t+1= η_t+ hφ ∇y|η_t, ηt, st+1 ,

st+1= h?φ ∇y|η_t, ηt, st ,

(22)

st−1 st st+1 ∆ηt+1 ∆η_t ∆η_t−1 ηt−1 ηt+1 ηt+1 ∇y|η_t+1 ∇y|η_t ∇y|ηt−1

Figure 4.3: The diagram shows the update equation for a Recurrent Inference Machine. A neural network outputs the new increment ∆η_t+1, based on the previous estimate η_t, the log-likelihood gradient with respect to that estimate ∇y|η_t, and the hidden state st. This repeats until the desired number of time-steps have been

processed.

where s is the network’s latent memory variable, or hidden state, following the RNN paradigm, and hφand h?φ

are the network update functions.

hφ produces the incremental change ∆ηt+1 needed for gradient ascent in image space, while h?φ produces the

next hidden state. These functions can be seen as the bulk of the neural network, containing all the network’s parameters, here represented by φ. Also notice that the previous estimate η_tis passed as an argument to allow the network to evaluate the log-prior gradient.

4.4.3 The Log-Likelihood Gradient

The idea of teaching a neural network to learn by gradient ascent (or descent) was notably described in [20]. Its name worth a mention; Learning to learn by gradient descent by gradient descent, this paper uses the gradient of its loss function as an input for the network, resulting in a model that learns its own loss optimizer. This is possible due to its loss function not taking targets as input, and would not work for our task, since targets are of course unknown during inference. What we do have, however, is the measured signal y. Under the assumption that the signal measurement error is normally distributed with equal variance σ2for all frequencies in K-space, the log-likelihood probability (ignoring the normalization constant) is given by

log p (y|x) = 1

2σ2kFux − yk 2

2. (4.11)

This is a real-valued functional, thus (3.13) can be used when calculating its gradient, after setting JL := 1.

Keep in mind that the squared complex 2-norm can be written k·k2₂ = (·)H(·), and that the FFT is a unitary operator,F−1=FH _{(when scaled appropriately), to see that the gradient becomes}

∇y|x=

1 σ2F

−1

u (Fux − y) . (4.12)

Note that F_u−1Fu 6= I, due to the sub-sampling operator u. Strictly speaking, it is the gradient and not the

derivative of (4.11) that the model should use for gradient ascent, thus it could be argued that (4.12) should be based on (3.12) instead of (3.11). In this case, however, it makes no difference, as the model will learn to simply conjugate the gradient if necessary.

Looking back at the optimization procedure in compressed sensing, (4.11) is equivalent to the data consistency term in (2.6), where, by preventing the error in K-space from becoming greater than the expected measurement error σ, one can be certain that the reconstructed x is consistent with the acquired data. In the RIM’s case, however, we let the network learn how to ensure data consistency on its own, by passing it the gradient of (4.11) as input, while optimizing the error in image space instead of K-space. ∇y|x should then guide it in the right

(23)

Backpropagating through the log-likelighood gradient

The log-likelihood in (4.11) will need to be differentiated a second time, in order for derivatives to flow through ∇y|x during backpropagation. Due to (3.10), it is sufficient to find the derivatives ofF and for the application

of the under-sampling scheme u. Under-sampling is essenially the same as scaling a complex variable by a real number, namely the two real numbers 0 and 1. Therefore, the same equations (4.7) and (4.8) can be used as for the gates within the GRU. As is apparent from (2.4) and (2.3), the FFT is a widely linear transformation, and can be represented as a matrix transformation. Taking the derivative of such a transformation with respect to the vector variable results in the matrix itself. Transforming back to theF -representation again, it is clear from (3.15) that the derivative to be passed on is the Fourier transform of the input gradient: D_F =F JL.

4.4.4 The Loss Function

The RIM is a supervised learning model, meaning it needs training targets to evaluate the error and help guide it towards a good estimate. Let τ be such a target, representing the true signal if K-space were fully sampled. The RIM uses the mean squared error (MSE) as a loss function, where the estimate η_t is evaluated against τ for each time-step. The total loss to minimize is then given by the per-pixel MS-error averaged over all T time-steps, LT(ηT) = 1 T N M T X t=1 kηt− τ k 2 2, (4.13)

with N M being the total number of pixels, as in (2.3) and (2.4). Its Wirtinger-derivative is given by

DLT = 2 T N M n (η_t− τ )oT t=1, (4.14)

as per equation (3.13), where JL is regarded as equal to 1, since there are no more functions following LT.

Because the model is trained directly to optimize (4.13), while (4.9) is used as a clue to this goal, it should now be clear why the RIM does not find the MAP itself. In fact, its estimate is not only different from, but likely better than the MAP estimator. After all, there’s no reason to think the MAP minimizes (4.13), nor is there any reason to think that the MAP must be the best estimator for x in the space of all estimators. This comes back to a neural network’s ability of finding new features that are often better than those taylor made by humans. Whatever the case of which is best, (4.9) should be viewed more as a motivation and starting point than the end goal of a RIM. That is not to say that the MAP paradigm is not crucial for modeling the problem, as it is shown in [12] that the RIM seizes to work when ∇y|xis removed as an input.

A Complex-Valued Recurrent Inference Machine for Accelerated MRI Reconstruction

MSc Artificial Intelligence

Master Thesis

A Complex-Valued Recurrent Inference

Machine for Accelerated MRI Reconstruction

Kai Lønning

August 21, 2017

Supervisors:

PhDc P. Putzky

PhD M. Caan

Assessor:

Dr M. Welling

Faculty of Science

Contents

Chapter 1

Introduction

1.1

Motivation and Goal

1.1.1

Magnetic Resonance Imaging and the Problem at Hand

1.1.2

Artificial Neural Networks and Deep Learning

1.1.3

Related Work and the Goal of this Thesis

1.2

Overview

1.3

Notation

Chapter 2

Magnetic Resonance Imaging

2.1

MRI Physics

2.2

MRI: Sampling and Reconstruction

2.2.1

Sampling K-space

2.2.2

The Inverse Problem of Accelerating MRI Reconstruction

2.2.3

Compressed Sensing

Chapter 3

CR-calculus

3.1

The Motivation for CR-Calculus

3.2

The Construction of CR-Calculus

3.3

Basic Properties of CR-Derivatives

3.4

Wirtinger-Backpropagation

3.4.1

Widely Linear Transformations

Chapter 4

Deep Learning

4.1

Activation Functions

4.1.1

Complex Adaptation

4.2

Convolutional Neural Networks

4.2.1

Complex Adaptation

4.3

The Gated Recurrent Unit

4.3.1

Complex Adaptation

4.4

The Recurrent Inference Machine

4.4.1

The Maximum a Posteriori

4.4.2

The Recurrent Inference Procedure

4.4.3

The Log-Likelihood Gradient

4.4.4

The Loss Function