FF-net: A fast Fourier transform inspired neural network

(1)

FF-net: A fast Fourier transform

inspired neural network

Jan Huiskes

10740929

University of Amsterdam

MSc Artificial Intelligence

MSc Thesis

July 13, 2020

Supervisor Dr. Daniel E. Worrall Second assessor Dr. Erik Bekkers

Size 48 EC

(2)

Acknowledgements

I would like to thank my supervisor Daniel Worrall. He was a great help during my thesis. He tried to have a one hour meeting every week (which is a lot for a supervisor) and always responded quickly to my emails.

(3)

Abstract

In this thesis we make three novel neural network architectures based on the fast Fourier transform (FFT). The goal was to replace the inverse fast Fourier transform (IFFT) in the context of Fourier reconstruction tasks. Basing our novel neural architecture on the FFT, allows us to deploy techniques such as weight-tying across frequency-bands, for efficient reduction in the number of learnable parameters. The models were tested on three datasets (MNIST, STL-10, fastMRI) compared to a benchmark shown to be useful on the fastMRI challenge. The best model, FF-net, improved the benchmark on MNIST and STL-10 with a L1 loss of 0.052(5) and 0.040(6) respectively to 0.055(5) and 0.054(7) of the benchmark. However, FF-net did not manage to improve the results on the fastMRI dataset.

(4)

1 Introduction

Fourier reconstruction is a common task in signal processing and related fields, such as magnetic resonance imaging [1]. Even with the recent developments in deep learning, there has not been a lot of interest in applying neural network-based models to signal processing tasks such as Fourier reconstruction. Many of the advances seen in deep learning are in fields such as computer vision or natural language processing. This begs the question, what inductive biases are needed to process Fourier-based signals?

In image-space, convolutional neural networks (CNNs) are the model of choice [2]. Underlying their success are two key inductive biases: 1) localized receptive fields due to strong short-range correlations between pixels in natural images [3] and 2) translational weight-tying due the translational equivariance [4] of many common imaging tasks. While these two inductive biases are sensible for vision tasks in image-space, they do not hold in Fourier-space, where notions of locality and translational invariance are destroyed. This is particularly acute for tasks such as magnetic resonance imaging (MRI) reconstruction—a common medical imaging paradigm—where images are reconstructed from a typically undersampled signal collected in Fourier space.

In this thesis, we present an entirely novel neural architecture family, with a structure tailored to the processing of Fourier-space images. This architecture is loosely based on the fast Fourier transform (FFT) [5] [6], which exploits the recursive, self-similar nature of the Fourier transform across various frequencies. Basing our novel neural architecture on the FFT allows us to deploy techniques such as weight-tying across frequency-bands, for efficient reduction in the number of learnable parameters, while preserving the key structure of Fourier-based data.

In this thesis we propose two new ideas:

• A neural network that can generate the true image directly from the Fourier space without using the full inverse fast Fourier transform (IFFT). It has a novel architecture inspired by the recursive structure of the Fourier transform, when written as an FFT.

• We demonstrate excellent performance on a toy MNIST dataset and the STL-10 dataset compared to the benchmark.

In chapter 2 we provide all the background needed to understand this thesis, in chapter 3 we discuss other research papers and how they relate to this thesis, in chapter 4 we explain our models, in chapter 5 we give the set-up for our experiments, in chapter 6 we show the results of our models and in chapter 7 we discuss these results.

(7)

2 Background

One of the biggest areas where the Fourier transform is used is in the medical field. Here, MRI machines make images of body parts of patients. These images are taken in the Fourier space. This space is also called the k-space, seen in Figure 1. To get the real image one applies the inverse fast Fourier transform (IFFT) on the k-space image to get the result.

Figure 1: A k-space image is transformed by the IFFT to get a real MRI image.

In this section, we will explain the function of an MRI, the general problem, the problem that we face and the possible solution.

2.1 Understanding MRI

Magnetic resonance imaging (MRI) is an imaging technique applied in the medical world to detect soft tissue contrast. It can be used for a variety of problems including neurological, musculoskeletal, and oncological diseases [7]. One interesting and every day application is the detection of bone fracture in knees of patients, seen in Figure 1.

However, the majority of scans take 30 to 40 minutes to produce [8]. It is easy to imagine the patient discomfort at having to lie in a machine for 30-40 minutes for a scan. But there are even more problems: it can lead to low patient throughput, problems with patient compliance, artifacts from patient motion, and high examination costs [7].

The reason for this long acquisition time is due to the fact that the images are first produced in the Fourier space by sampling frequencies which takes the MRI machine a long time to do. Only if one has captured enough points in k-space, one can generate the real image. A way to solve this problem would be to get a good MRI image out of a k-space image with fewer frequency samples. In that case we cannot just use the IFFT since that will result in blurry images, which cannot be used for medical diagnoses, seen in Figure 2.

(8)

Figure 2: A resulting MRI image, when one uses an undersampled k-space. The image gets blurry and cannot be used for medical diagnosis.

Currently, a possible solution has already been found. The solution proposed in fastMRI [7] was to use a sub sample from the k-space and use a neural network after having applied the IFFT. As a result, they could speed up the process (of 30 to 40 minutes) by a factor of 4-8.

In our research we propose an extra neural network that will replace the IFFT. Therefore, we need to understand how the Fourier transform works.

2.2 Inverse problem

The category of problems that MRI reconstructions falls under is called inverse problems [9]. Here, we try to recover an image from some observations (the k-space). To be more precise, we try to recover an N -pixel image x∗∈ RN _from

a set of M observations y ∈ CM

y = A(x∗) + (1)

where A : RN _{→ C}M _{is a forward measurement operator and a noisy vector}

from an unknown distribution. We want to reconstruct x∗ from y. This is not so easy, since A−1 and p() is usually not known.

In our case, we do know A−1, namely the inverse Fourier transform, thus it is a one-to-many problem. However, we use a masked k-space as input. From this masked k-space, we try to reconstruct the MRI image with a neural network gθ

x∗= gθ(m y) (2)

where θ are the network’s parameters, the element-wise product and m ∈ {0, 1}M _{is the mask.}

There are also other methods for reconstructing the image but deep learning techniques show the most promise [9].

2.3 Fourier transform

The Fourier transform decomposes a function into its constituent frequencies [6]. These frequencies are the signals that we capture in an MRI machine. However,

(9)

the Fourier transform can only be applied to continuous functions and images consist of discrete pixels. Therefore we use the discrete Fourier transform (DFT). The formula to transform normal space to the Fourier space in 1D is then given by DFT(x, k) = N −1 X n=0 xn· e−i2πkn/N, (3)

where DFT(x, k) is the value of position k in the resulting k-space vector, x a 1D signal vector, N the vector size and i the imaginary unit [6]. The computational complexity of the DFT is O(N2). The inverse discrete Fourier transform (IDFT) is very similar: IDFT(X, n) = 1 N N −1 X k=0 Xk· ei2πkn/N. (4)

The mechanism is the same, therefore if we can understand the Fourier transform, we understand the inverse Fourier transform [6].

2.4 1D FFT algorithms

The Cooley-Tukey algorithm is a way to reduce the complexity of the DFT to O(N log N ) [10] [6]. The general idea is to reshape a N sized array into a 2D N1× N2 array, where N1 are the columns and N2 the rows [10].

Usu-ally, N1 or N2 is small and is called the radix. The idea now is to perform

the FFT or DFT on the rows/columns first, then multiply everything by com-plex roots of unity (twiddle factors) and then to do same again but now on the columns/rows. An example is given by equation 8. If one performs the DFT/FFT on the rows first it is called decimation in time (DIT) and if one performs it on the columns first it is called decimation in frequency (DIF). This is called the radix FFT, where we take a radix value to split the vector in subparts. A radix 2 would split a vector x in 2 parts each time, which we will explore below.

2.4.1 Cooley-Tukey DIT

Here we will illustrate with examples where we pick N1 or N2 are 2. First we

(10)

models on. The way this FFT works is by splitting the sum into two parts [11] DFT(x, k) = N −1 X n=0 xn· e−i2πkn/N (5) = N/2−1 X m=0 x2m· e−i2πk(2m)/N+ N/2−1 X m=0 x2m+1· e−i2πk(2m+1)/N (6) = N/2−1 X m=0 x2m· e−i2πkm/(N/2)+ WNk N/2−1 X m=0 x2m+1· e−i2πkm/(N/2) (7) = DFT(xE, k) + WNkDFT(xO, k) (8) where Wk

N = e−i2πk/N is the twiddle factor, xE the even part of the vector x

and xO the odd part. This reduces the complexity by a factor 2, seen in Figure

3. This figure is also called a butterfly diagram. If one keeps on splitting, the complexity becomes O(N log N ) and that is the full FFT algorithm.

Figure 3: A butterfly diagram of the first part of a DIT FFT. It splits an array in even and odd parts. On those parts it performs the DFT. Then it combines the parts with twiddle factors. If one would continue to split the DFT parts further and further, it would be the full FFT algorithm [11]

This algorithm is called a divide and conquer algorithm [12], because it recursively breaks down the DFT. The IFFT algorithm looks similar to the FFT algorithm so it does not need much changing except for a division of factor N .

(11)

2.4.2 Cooley-Tukey DIF

The DIF algorithm is the reverse of the DIT. Instead of splitting our normal space with pixel position n into two equally long vectors with position m we can split our k-space ith pixel position n k into two equally long vectors with position r [13] DFT(x, 2r + 1) = N −1 X n=0 xnWN2rn (9) = N 2−1 X n=0 xnWN2rn+ N 2−1 X n=0 x_n+N 2W 2r(n+N 2) N (10) = N 2−1 X n=0 xnWN2rn+ N 2−1 X n=0 x_n+N 2W 2rn N 1 (11) = N 2−1 X n=0 xn+ xn+N 2 WmN 2 (12) = DFThxn+ xn+N 2 i (13) DFT(x, 2r + 1) = N −1 X n=0 xnW (2r+1)n N (14) = N 2−1 X n=0 xn+ W N 2 N xn+N 2 W_N(2r+1)n (15) = N 2−1 X n=0 xn− xn+N 2 WNn WrnN 2 (16) = DFThxn− xn+N 2 W_Nni (17) where Wn

N is now the twiddle factor. This also reduces the complexity by a

factor 2, seen in Figure 4. Here, we can also clearly see that is the reverse way of splitting and applying the DFT.

(12)

Figure 4: A butterfly diagram of the first part of a DIF FFT. This time it multiplies the values with the twiddle factor and then performs the DFT, so that the output is split in even and odd parts. This is also the reverse of the DIT FFT.[13]

We will make one neural network architecture based on the DIT FFT algo-rithm. This algorithm is easier to implement because we want the output pixels in the correct order, which is not the case in the DIF algorithm.

2.5 2D FFT algorithms

The multidimensional DFT is given by

DFT(x, k) =

N−1

X

n∈N

xne−2πik·(n/N), (18)

where k, n and N are now multidimensional vectors, n/N is an element-wise division andPN−1

n∈N means multiple summations depending on the amount of

dimensions. The 2D case of the FFT (or IFFT) is, therefore, the 1D case applied on the rows and the columns [5]. Thus, one transforms all the rows one by one first with the 1D FFT. You will get a new image where the rows have been transformed. Afterwards, one does the same thing on the resulting columns of the image.

However, we also want to incorporate 2D information of the image into our neural network that simulates the FFT. This is a lot harder to do if we use a 1D FFT algorithm. Therefore a better way is using 2D FFT algorithms.

(13)

2.5.1 Vector radix 2D DIT

One of the most used 2D FFT algorithms is the vector radix 2D FFT [14]. We will base our models on the DIT vector radix 2D FFT. The 2D DFT can be written as DFT(x, k) = N1−1 X n1=0 N2−1 X n2=0 x [n1, n2] · WNk11n1W k2n2 N2 (19) where Wk1n1 N1 = e −jk1n12π/N_{, k = (k}

1, k2) and N = (N1, N2). We can use the

divide and conquer technique of above as well in the 2D space, seen in Figure 5.

Figure 5: We can split a 2D array similarly as a 1D array. Instead of 2 parts, we get 4 parts. We split it by skipping one tile horizontally and vertically.

Instead of 2 parts (odd and even), we get 4 parts in this case (even even, even odd, odd even, odd odd). Equation 19 transforms then to

DFT(x, k) = S00(k1, k2)+S01(k1, k2) WNk2+S10(k1, k2) WNk1+S11(k1, k2) WNk1+k2

(20) where Sij(k1, k2) = DFT(xij, k) [14] and WNkx are the twiddle factors. Here

Sij(k1, k2) is just the 2D DFT on a smaller part, hence we can use the divide

and conquer technique again. 2.5.2 Vector radix 2D DIF

Just as in the 1D case, there is a 2D DIF vector radix algorithm. The idea of this algorithm is the same as in the 1D case. Instead of splitting the normal space, the k-space will be split apart. We apply the twiddle factors first, then the DFT, so that we get the split k-space parts.

(14)

2.6 Masked k-space

We want to under-sample the k-space pixels. In this work we consider the case where we take only 25% of the pixels. We cannot take random samples from the k-space because the lower frequencies are typically more important for reconstruction quality. Therefore, we use masking lines on the k-space image with a central region of size 8% always unmasked, seen in Figure 6. We apply this mask to all datasets.

Figure 6: Masked k-space image with 25% of the pixels. Here we use masked lines to mask the image [7]

The rest of the unmasking (17%) is uniformly drawn from the rest of the image which meets the general conditions for compressed sensing [15]. The result will be that the sampling process will be sped up by a factor 4. The lines are vertically omitted because that indicates the phase encoding. In real life, this will be easier and faster to acquire.

(15)

3 Related Work

Besides understanding the relevant background, we need to know what other research has found on reconstructing MRI images from the masked k-space. Here, we will mention how reconstruction was done before deep learning, what the fastMRI challenge is and how other research has improved the fastMRI challenge.

3.1 MAP reconstruction

Inverse problems can be solved using the maximum a posteriori (MAP) estimate [9] x∗= arg min x 1 2kA(x) − m yk 2 2+ R(x) (21)

where R(x) is the regularization term and A is the forward measurement operator. Common regularization terms in MRI constructions are L1 regu-larization RL1(m) = kmk1, wavelet regularization Rwavelet(m) = kΨ(m)k1,

where Ψ is a discrete wavelet transform, and total variation regularization RT V(m) =Pij q |mi+1,j− mi,j| 2 + |mi,j+1− mi,j| 2 [7].

One could solve these problems using an optimization technique as in Equation 2. However, Zbontar et al. [7] and Ongie et al. [9] showed that deep learning outperformed these methods in both time and reconstruction fidelity. Therefore, we will be focusing on using neural architectures to reconstruct MRI images.

3.2 fastMRI

Zbontar et al. [7] set up the fastMRI project. They made the dataset of MRI scans publicly available and provided a baseline to compare results against. Their model is given by

x∗= Unetθ(IFFT(m y)) (22)

Their model, thus, consists of an IFFT followed by a U-net [16]. The U-net consists of two deep convolutional networks, a down-sampling and up-sampling path. These paths are connected by skip connections, seen in Figure 7.

(16)

Figure 7: U-net architecture from the fastMRI paper [7]. The down-sampling path mainly consists of several convolutional layers with 3 × 3 kernels and ReLU functions. It is down-sampled by MaxPool layers. The up-sampling path is similar but uses up-convolutions to up-sample the images.

The down-sampling path mainly consists of several convolutional layers with 3 × 3 kernels and ReLU functions. It is down-sampled by MaxPool layers. The up-sampling path is similar but uses up-convolutions to up-sample the images.

3.3 Other models for the U-net

Lønning et al. [17] made a recurrent inference machine (RIM) for the fastMRI challenge. This is an iterative map which uses a hidden memory state, the current reconstruction, and the gradient of the likelihood term. This last term has information about the current state and how well it is performing. With this model they managed to outperform the U-net.

Wang et al. [18] proposed a Pyramid Convolutional RNN. The network consists of three ConvRNN modules. Each module has a decoder and encoder and a basic RNN cell in the middle. It takes the IFFT image of the masked k-space and put each output of the previous ConvRNN as input for the next. In the end, they concatenate all the images. With this model they managed to outperform the U-net.

The difference in these research papers is that they focused on improving the U-net. By building different models, they managed to improve the score, but they still used the IFFT on the masked k-space.

3.4 Improvements by using the k-space

Bahadir et al. [19] looked at optimizing the undersampling of the k-space. In this project we use Cartesian undersampling by applying vertical masking lines on the k-space image. They found an optimizing technique that could mask the area that was least useful in the k-space.

Sriram et al. [20] proposed a model that could reconstruct the full k-space image from the masked k-space called End-to-End Variational Network. It

(17)

worked by putting generating a sensitivity map of the image to be obtain and using this to generate the approximate image. From there they could get back to the full k-space with the use of other small networks.

The difference in these research papers is that they focused on improvements by making use of the k-space by reconstruction or optimizing the mask. They still used the IFFT, whereas we focus on building our own neural network for the IFFT and getting more information from the undersampled k-space.

(18)

4 Method

In our research we focus on replacing the IFFT with a neural network to improve performance of models when we use a masked k-space. We are essentially trying to find a mapping between the masked k-space and the normal MRI image (which one would retrieve from the whole k-space). The problem with the IFFT (or FFT) is that it is a global transformation. Each pixel in the k-space contributes to each pixel in the normal space. We cannot simply use CNN like structures in that case because those are good for local information. The first thing that one could try is just an MLP from image to image. This would mimic the IDFT in Equation 4. This works, but it is problematic if we have images of 320×320 like in the fastMRI, because of computational complexity which scales quadratically with the number of pixels.

The structure of the IFFT algorithms mentioned above can help reduce the computational complexity by using weight-tying on smaller parts of the images. Firstly, the IFFT splits an image into smaller parts, which reduces the amount of pixels. Secondly, we can use the divide and conquer algorithm of the FFT to structure our neural networks, which speeds up the process compared to an MLP. Therefore, all our models consist of three main parts: 1) splitting the image into subparts, 2) using the IFFT or an MLP on the lowest leaves (here, the parts that are left after splitting) and 3) merging these parts to reconstruct the MRI image.

We have made three models that try to mimic the structure of the IFFT. The first model (CT-net) is based on the Cooley-Tukey 1D IFFT algorithm. The second model (LFFR) that we made is trying to mimic the structure of the vector radix 2D IFFT and is a linear model. The third model (FF-net) we made is a nonlinear model, which takes the values of the pixels and twiddle factors and maps these to the wanted outcome. Since we are working with complex numbers, we separate the real and imaginary part into two channels.

4.1 CT-net

The first model that we build was based on the 1D FFT algorithm by Cooley-Tukey, therefore the model is called CooleyTukey-net (CT-net). This model applies a neural network to each row first and then on each of the resulting columns. Firstly, we split the 1D array into an even and odd part. We do this a specific amount of times, determined by a hyperparameter depth. Secondly, on the lowest leaves we apply a linear layer. We do this instead of a DIFT/IFFT to have a more complex mapping. Thirdly, we reconstruct the 1D array in normal space with trainable parameters φ ∈ C instead of the given twiddle factors, given by

Xk= XkE+ φkXkO, (23)

where k the position of the pixel, X (part of) the column/row, XE the even part of X and XO _{the odd part. Here, φ is not shared across depths, but different for}

(19)

8. We repeat this reconstruction until we have the full 1D array in normal space from all the split parts. The process is showed in Figure 8

Figure 8: The steps of the CT-net. We apply a neural network on each row and then each column. We first split the parts till a given depth. Then, we use a linear layer on the lowest leaves. Lastly, we use W , a trainable parameter, to combine the split parts to get the final result.

The number of parameters for this model is given by

#parameters = 2N 2d 2 + d X k=0 4 · N 2k, (24)

where d is the depth and N the array size. The first part is due to the linear layer and the second part due to the trainable parameters φ. We need a factor 2 in the first part because we split the complex numbers into two channels. The trainable parameters φ are transformation matrices of 2 × 2 and therefore we have an extra factor 4.

4.2 LFFR

The first model does not incorporate any 2D structure. Since we are working with 2D images, we would like to have a network that also operates in 2D. This second model is called learned fast Fourier result (LFFR).

(20)

In this case, we can split the image into four parts as we do in the vector radix 2D FFT seen in Figure 5. Here, we also do this a specific amount of times, determined by a hyperparameter depth. Secondly, on the lowest leaves we apply an IFFT. Thirdly, we can reconstruct the MRI image by merging the leaves with trainable parameters φi∈ C, given by

Xk= XkEE+ φ 1 k· X EO k + φ 2 k· X OE k + φ 3 k· X OO k , (25)

where X is (a part of) the reconstructed MRI image and k = (k1, k2) the position

of the pixel. Here, φi _{is also not shared across depths. This equation simulates}

the vector radix equation in Equation 19. The steps of the whole process are given in Figure 9.

Figure 9: The steps of the LFFR. We first split the k-space into smaller parts based on the depth. Secondly, we apply the IFFT on the lowest leaves. Thirdly, we use trainable parameters φito merge the parts. The merging of parts is repeated a number of times based on the depth. At the end we get the reconstructed MRI image.

The number of parameters for this model is given by

#parameters = 3 d X k=0 4 · N 4k, (26)

where d is the depth and N = N1× N2 the image size. We have three separate

(21)

summation.

The only part that is being trained here are the φi parameters. We know the exact value in the case without a mask, but by training them instead we hope that it maps the masked k-space better to the true image. It is a linear model, so this mapping would be linear as well. We think this model has the correct structure but can be improved with extra nonlinearity, since it is only a linear mapping and we probably need a more complex nonlinear mapping. Therefore, we made another model that had a more complex nonlinear mapping.

4.3 FF-net

The third model that we made is called fast Fourier net (FF-net). The first two steps are the same as the LFFR: 1) split the image into four parts and 2) apply the IFFT on the lowest leaves. However, instead of merging the parts with a trainable parameter, we merge the parts with the exact twiddle factors with the function f : C7→ C Xk= f (XkEE, X EO k , X OE k , X OO k , W k1 N , W k2 N , W k1+k2 N ). (27)

The function is in 7 dimensions because it has 7 complex valued channels: three twiddle factors and four split parts of the image. Note that we use the exact value of the twiddle factors in this model.

The whole process is very similar to the LFFR. The only thing that is different is that we use the exact twiddle factors and let the neural network figure out the mapping, seen in Figure 10.

(22)

Figure 10: The steps ofthe FF-net. We first split the k-space into smaller parts based on the depth. Secondly, we apply the IFFT on the lowest leaves. Thirdly, we use a complex function f to merge the twiddle factors and split parts. The merging of parts is repeated a number of times based on the depth. At the end we get the reconstructed MRI image.

All the parts Xij and twiddle factors are used as channels for the input image in f . The architecture of the complex function f is seen in 11.

(23)

Figure 11: The architecture of the complex function f . The network combines the parts Xij_{and twiddle factors into one image as channels, therefore we get 14 channels}

(7 complex numbers). It uses convolutional blocks to find a nonlinear mapping. The number of channels are given above each block.

Unlike the other models, the size of this model is not depended on the size of the images. Therefore, the number of parameters is 496002 and not depended on the size N .

The advantage of using this neural network is that we do not define the structure of the FF-net, as we do with the LFFR. On top of that, the previous network is a linear network and this one is nonlinear. Therefore, the LFFR is constrained in the functional family of the FF-net.

4.4 Normalization of images

The loss of the neural network will in the end be determined by pixel value differences between target and generated image. If the pixel values have very low values, however, the model will not train properly since we are dealing with vanishing gradients. In the fastMRI dataset we are dealing with pixel values between 10−6 and 10−5 and this will cause us to have vanishing gradients.

A good way to deal with this is normalizing the images. We can do that by using

ˆ xn=

xn− µ

(24)

where ˆx is the normalized image, x the image, n the pixel position, µ the mean of the pixel values per image of the image and σ the standard deviation of the pixel values per image. This means that we can normalize the image after we have applied the IFFT.

However, we do not use the IFFT directly and use the k-space as input. In our model, we normalize the image after we have applied the IFFT on the lowest split parts and have merged it just before putting it through the neural network. Additionally, it is good to do this on the training images of all datasets, since it ensures loss values that can be used to train our models. Therefore, we apply it to all our datasets.

(25)

5 Experiments

5.1 Datasets

MNIST MNIST is a dataset of 28 × 28 8-bit grayscale images . It contains handwritten digits from 0-9. The digits are centered in the middle. The pixel range is between 0 and 1. The training data has 60,000 images and the test data 10,000. It is a useful toy dataset to run first sets for the model to see if it works. STL-10 STL-10 is a dataset of 96 × 96 RGB images. It contains images of 10 classes, e.g. birds, horses, cats, etc (the classes are not relevant in our case). The images are not centered in this case but the full image space is important. The training data has 5,000 images and the test data 8,000. It is a useful dataset to really test the models, because these images are a lot more complex than MNIST. The images are converted to grayscale to make it easier.

fastMRI The fastMRI data contains MRI scans of knees [7]. The scans vary between 640 × 320 to 640 × 372 pixels in k-space and 320 × 320 in normal space. The knees are centered in the middle. The training set has 866 images and the validation set 199. This is the most important dataset since it uses a real application of the IFFT. The dataset contains single-coil k-space data. This means that we only have one k-space image mapped to the ground truth. In real MRI reconstruction, one acquires multiple k-space images from different angles. These images would transformed with the IFFT to get multiple real images and then combined to get the final result. Our focus is on improving the benchmark, so single-coil is more suited.

5.2 Models & Hyperparameters

All models are trained with Adam [21] with a learning rate of 1e-3. The learning rate is decreased over time with a step schedule for STL-10 and fastMRI, by a factor of 0.1 for the last 5 epochs. The batchsize is 16, 8 and 4 respectively for the datasets.

5.2.1 Benchmark (IFFT + U-net)

In our experiment we use the Inverse fast Fourier transform (IFFT) and a U-net as our benchmark. This is used in the experiments of Zbontar et al. [7], described in section 3. The number of up- and down-sampling is 2 for MNIST and 4 for the other datasets. This will be the same for all U-nets in the other experiments. 5.2.2 CT-net + U-net

The second model is the CT-net followed by a U-net. The U-net will be the same as the one in the benchmark. The depth of the CT-net is set to 1.

(26)

5.2.3 LFFR + U-net

The second model is the LFFR followed by a U-net. The LFFR tries to find linear mappings between the masked k-space and the true image. The U-net will be the same as the benchmark. The depth of the LFFR is set to 1.

5.2.4 FF-net + U-net

The third model is the FF-net followed by a U-net. The FF-net tries to find nonlinear mappings between the masked k-space and the true image, which are not found in the IFFT. The U-net will be the same as the benchmark. The depth of the FF-net is set to 1.

5.3 Metric

For our research we used different metrics to evaluate and train the models. These metrics are explained below.

5.3.1 Binary Cross Entropy loss

The Binary Cross Entropy loss (BCE loss) was used for the MNIST dataset. This loss can only be used if the pixel values fall between 0 and 1. This is only the case with the MNIST dataset. The loss function is given by

ln= − [yn· log xn− (1 − yn) · log (1 − xn)] (29)

where n is the pixel number, y ∈ [0, 1] the ground truth image and x ∈ [0, 1] the output of the model. The final loss is the mean of all BCE losses for each pixel.

In the case of an output value of 1 or 0, we set 0 log(0)_{, 0. This loss was} both used for training and evaluation for the MNIST dataset.

5.3.2 L1 loss

The L1 loss is a widely used loss in image reconstruction. This loss is given by ln = |xn− yn| (30)

where n is the pixel number, y the ground truth image and x the output of the model. The final loss is the mean of all L1 losses for each pixel.

This loss is used to train the models on the STL-10 and fastMRI dataset and to evaluate the STL-10 outputs. The other losses mentioned below were used to evaluate the fastMRI dataset.

5.3.3 Normalized Mean Square Error

The recommended evaluation metric for the fastMRI is the normalized mean square error (NMSE) [7]. This metric is given by

NMSE(x, y) = kx − yk

2 2

(27)

where k · k2₂ is the squared Euclidean norm.

Other metrics are also recommended for the fastMRI challenge, since NSME favors smoothness over sharpness [7].

5.3.4 Peak Signal-to-Noise Ratio

One of these metrics that is also used is the Peak Signal-to-Noise Ratio (PSNR). This metric represents the ratio between the power of the maximum possible image intensity across a volume and the power of distorting noise and other errors [7]. This is given by

PSNR(x, y) = 10 log₁₀ max(y)

2

MSE(x, y) (32)

max(y) the largest entry (pixel value) and MSE is the numerator of the NMSE. In this case, high values of PSNR indicate that a model is performing better. 5.3.5 Structural Similarity

The structural similarity (SSIM) index evaluates two images by exploiting the inter-dependencies among nearby pixels [22]. The SSIM evaluates structural properties by sliding a window over the image at different locations. The similarity of image patches ˆm and m is given by

SSIM( ˆm, m) = (2µmˆµm+ c1) (2σmmˆ + c2) (µ2

ˆ

m+ µ2m+ c1) (σ2_m_ˆ + σ2m+ c2)

(33) where µmˆ and µmare the mean values of the patches, σmˆ and σmthe standard

deviation, σmmˆ the covariance and c1 and c2 are two variable to tabalize the

division: c1= (k1L)2 and c2= (k2L)2. In our research we used the same values

as Zbontar et al. [7]. Namely, a window size of 7 × 7, k1= 0.01, k2= 0.03 and

(28)

6 Results

In this section we compare our models to the benchmark on the different datasets. The datasets are ordered from small to large images, where fastMRI is the last dataset.

6.1 MNIST

The MNIST dataset contains small images of 28×28 pixels. This dataset is used as a first test to see if a model can reconstruct good images. The images reconstructed by each model can be seen in Figure 12.

Figure 12: Test image results from the different models on MNIST. We used 5 different images to compare results. All the models seem to produce good results.

As one can see, it is hard to differentiate the quality of the image generated by the models. Luckily, MNIST is only used as a first test to compare the models. The test loss in the end is given in the Table below. In this thesis we use bracket notation to notate the standard deviation. The number in the bracket applies to the last decimal of the number, e.g. 0.055(6) is 0.055 ± 0.006 and 0.055(14) is 0.055 ± 0.014. The standard deviation is calculated from all the images in the test set.

(29)

Model BCE score IFFT + U-net (benchmark) 0.055(6) CT-net + U-net 0.076(7) LFFR + U-net 0.076(7) FF-net + U-net 0.052(5)

Table 1: Table of the models and the test loss for the MNIST dataset. CT-net and LFFR perform worse than the benchmark and FF-net. FF-net outperforms all the models.

Here, we can see that CT-net and LFFR performs worse than the other models. CT-net was dropped after this dataset due to the fact that it underperformed. We still wanted to see if LFFR had a different result on STL-10. FF-net is the best model here and is significantly better than the benchmark with a p-value < 0.0001 with the t-test.

6.2 STL-10

The STL-10 dataset contains larger images of 96×96 pixels. This dataset is used to improve and optimize models before running them on the fastMRI. These images are complicated enough to see clear differences in results and also small enough that it does not take a long time to train models. The images created by the models can be seen in Figure 13.

(30)

Figure 13: Test image results from the different models on STL-10. We used 5 different images to compare results. The LFFR performs worse than the other models based on these images. However, the FF-net outperforms the other models.

As one can see, the LFFR performs worse than the other models based on the images. Therefore, we did not use this model for the fastMRI dataset. However, the FF-net clearly outperforms the other models. This is also clearly seen in the Table below.

(31)

Model L1 score IFFT + U-net (benchmark) 0.054(7) LFFR + U-net 0.073 (8) FF-net + U-net 0.040(6)

FF-net 0.046(6)

Table 2: Table of the models and the test loss for the STL-10 dataset. The LFFR performs significantly worse than the other models. However, the FF-net outperforms the other models.

FF-net is the best model here and is significantly better than the benchmark with a p-value < 0.0001 with the t-test. We also added the FF-net without the U-net. In that case the model performs slightly worse, but it still performs better than the benchmark.

6.3 FastMRI

As mentioned before, the fastMRI dataset is a dataset where the IFFT is used in a real application. This means that this the most important dataset. The images created by the models can be seen in Figure 14.

(32)

Figure 14: Test image results from the different models on fastMRI. We used 3 different images to compare results. The models give very similar results.

As seen from the images, the models give very similar output images. The table below also shows that the models perform very similar.

Model NMSE SSIM PSNR

IFFT + U-net 0.004(3) 0.65(14) 29(3) FF-net + U-net 0.004(3) 0.63(14) 29(3)

Table 3: Table of the models and the different loss functions for the fastMRI dataset. The models perform very similar.

The models have the same score for each metric except SSIM. However, the P-value is 0.1539 so the result of the benchmark is not significantly better.

(33)

7 Discussion

From the results we can conclude that the FF-net outperforms on the MNIST and STL-10 datasets and CT-net and LFFR underperform on all datasets.

When we use a higher depth for the CT-net and LFFR, the models will start to perform even worse. Fortunately, this is in line with why the models underperform. In the CT-net a higher depth means that we get more trainable parameters and a smaller Linear layer. In the LFFR we also get more trainable parameters if we increase the depth. It seems that the network is not suited to find the right parameters. The parameters in the FFT are roots of unity [10] and are thus complex valued. We tried to use a real-valued neural network with two channels for the complex values instead of a complex neural network. Complex neural networks are new networks that work with complex parameters in the network and complex arithmetic. Moenning and Manandhar [23] showed that using complex-valued neural networks are better than real-valued neural networks for tasks with complex values. Hoffmann et al. [24] also showed that using real numbers for complex data resulted in underperformance. This might justify the underperforming results of the LFFR and CT-net.

Besides the CT-net and LFFR, the FF-net also had some interesting results. Firstly, the results of the FF-net got worse when the depth increased. With a depth of 3, the FF-net had an L1 loss of 0.043(6) compared to 0.040(6) when the depth is 1 on the STL-10 dataset. One would think that it would get better, since the network is getting deeper. We tested if different networks for each depth made a difference, but it did not. It might be that the network needs different parameters to finetune the model with a higher depth. However, this result was not obstructing our goal to improve the results of the fastMRI dataset, since the FF-net performed better with a depth of 1 than the benchmark on the STL-10 dataset. We, therefore, decided to not focus on that problem.

Secondly, the FF-net did not outperform the benchmark on the fastMRI dataset while it did outperform the benchmark on the other datasets. At first we tried different hyperparameters, other activation functions and whole different architectures without any success. Therefore, the problem might be in the different datasets and how the U-net and FF-net work differently on those datasets. We first made histograms of the pixel values of the normalized input image (the IFFT of the masked k-space) to see if the histograms differ. These histograms can be seen in Figure 15.

(34)

(a) fastMRI (b) STL-10

Figure 15: Histograms of normalized pixel values of a batch from STL-10 and fastMRI datasets. The distributions show that fastMRI images have very similar pixel values, whereas the STL-10 images have very different pixel values in the images.

One can see from the plots that the distributions of normalized pixel values are different. The fastMRI histogram looks normally distributed, whereas STL-10 has a long tail. This means that the pixel values in fastMRI lie close together. It is therefore harder to make large mistakes for models, whereas the pixel values of STL-10 lie further apart from each other and that can result in larger differences. On top of that, we also made a image of the mean images of 500 training samples of each dataset. These images can be seen in Figure 16.

(a) fastMRI (b) STL-10

Figure 16: The mean image of 500 training samples of the fastMRI and STL-10 dataset. The STL-10 image resembles nothing particularly, whereas the image of fastMRI has a general structure of a knee in the middle.

Here, we can see that the mean image of the STL-10 resembles nothing particularly , whereas the image of fastMRI has the structure of a knee. This means that the fastMRI pixel values are not only close to each other, but that the structure of images are very similar, because it has a knee structure. Therefore,

(35)

improving on the fastMRI dataset means that we need to improve on small details, otherwise we get very similar results. Based on the evidence above, it seems that the FF-net helps when there are large differences in images and pixel values, but not when it comes to small details. The U-net, on the other hand, is especially good with small amounts of data [16], and might be better in differentiating small details. As a result, the benchmark will be just as good on the fastMRI dataset, because the U-net is the most important part of the model.

(36)

8 Conclusion

In this research we made three models based on the fast Fourier transform: CT-net, LFFR, FF-net. We tested these models on the MNIST, STL-10 and fastMRI datasets comparing them to the inverse fast Fourier transform as our benchmark. The CT-net and LFFR both underperformed compared to the benchmark. However, FF-net outperformed the benchmark with a L1 loss of 0.040(6) compared to 0.054(7) on the STL-10 dataset. It did not improve the results on the fastMRI, but gave similar results as the benchmark. This is probably due to the fact that this dataset is contains very similar images in the dataset.

(37)

A

Additional Results

(38)

(39)

(40)

(41)

References

[1] A. Aibinu, Momoh Salami, A.A. Shafie, and Athaur Najeeb. Mri recon-struction using discrete fourier transform: A tutorial. World Academy of Science Engineering and Technology, 42, 01 2008.

[2] Neha Sharma, Vibhor Jain, and Anju Mishra. An analysis of convolutional neural networks for image classification. Procedia Computer Science, 132: 377–384, 01 2018. doi: 10.1016/j.procs.2018.05.198.

[3] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard S. Zemel. Understanding the effective receptive field in deep convolutional neural networks. CoRR, abs/1701.04128, 2017. URL http://arxiv.org/abs/1701.04128.

[4] Rikiya Yamashita, Mizuho Nishio, Richard Do, and Kaori Togashi. Convo-lutional neural networks: an overview and application in radiology. Insights into Imaging, 9, 06 2018. doi: 10.1007/s13244-018-0639-9.

[5] Fast fourier transform. https://en.wikipedia.org/wiki/Fast_Fourier_ transform. Accessed: 2019-12-12.

[6] Understanding the fft algorithm. https://jakevdp.github.io/blog/ 2013/08/28/understanding-the-fft/. Accessed: 2019-12-12.

[7] Jure Zbontar, Florian Knoll, Anuroop Sriram, Matthew J. Muckley, Mary Bruno, Aaron Defazio, Marc Parente, Krzysztof J. Geras, Joe Kat-snelson, Hersh Chandarana, Zizhao Zhang, Michal Drozdzal, Adriana Romero, Michael Rabbat, Pascal Vincent, James Pinkerton, Duo Wang, Nafissa Yakubova, Erich Owens, C. Lawrence Zitnick, Michael P. Recht, Daniel K. Sodickson, and Yvonne W. Lui. fastmri: An open dataset and benchmarks for accelerated MRI. CoRR, abs/1811.08839, 2018. URL http://arxiv.org/abs/1811.08839.

[8] Mri, ct and pet scan times. https://info.shields.com/bid/43435/ MRI-CT-and-PET-Scan-Times. Accessed: 2019-11-19.

[9] Gregory Ongie, Ajil Jalal, Christopher A. Metzler, Richard G. Baraniuk, Alexandros G. Dimakis, and Rebecca Willett. Deep learning techniques for inverse problems in imaging, 2020.

[10] Cooley–tukey fft algorithm. https://en.wikipedia.org/wiki/Cooley% E2%80%93Tukey_FFT_algorithm, . Accessed: 2019-12-12.

[11] Decimation-in-time (dit) radix-2 fft. https://cnx.org/contents/ zmcmahhR@7/Decimation-in-time-DIT-Radix-2-FFT, . Accessed: 2019-12-12.

[12] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. The MIT Press, 2 edition, 2001.

(42)

[13] Decimation-in-frequency (dif) radix-2 fft. https://cnx.org/contents/ XaYDVUAS@6/Decimation-in-Frequency-DIF-Radix-2-FFT, . Accessed: 2019-12-12.

[14] Vector-radix fft algorithm. https://en.wikipedia.org/wiki/ Vector-radix_FFT_algorithm. Accessed: 2020-01-21.

[15] Michael Lustig, David Donoho, and John Pauly. Sparse mri: The application of compressed sensing for rapid mr imaging. Magnetic resonance in medicine : official journal of the Society of Magnetic Resonance in Medicine / Society of Magnetic Resonance in Medicine, 58:1182–95, 12 2007. doi: 10.1002/mrm.21391.

[16] O. Ronneberger, P.Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), volume 9351 of LNCS, pages 234–241. Springer, 2015. URL http://lmb.informatik.uni-freiburg. de/Publications/2015/RFB15a. (available on arXiv:1505.04597 [cs.CV]). [17] Kai Lønning, Patrick Putzky, Matthan W. A. Caan, and Max Welling.

Recurrent inference machines for accelerated mri reconstruction. 2018. [18] Puyang Wang, Eric Z. Chen, Terrence Chen, Vishal M. Patel, and Shanhui

Sun. Pyramid convolutional rnn for mri reconstruction, 2019.

[19] Cagla Deniz Bahadir, Adrian V. Dalca, and Mert R. Sabuncu. Learning-based optimization of the under-sampling pattern in mri. Information Processing in Medical Imaging, page 780–792, 2019. ISSN 1611-3349. doi: 10.1007/978-3-030-20351-1_61. URL http://dx.doi.org/10.1007/ 978-3-030-20351-1_61.

[20] Anuroop Sriram, Jure Zbontar, Tullie Murrell, Aaron Defazio, C. Lawrence Zitnick, Nafissa Yakubova, Florian Knoll, and Patricia Johnson. End-to-end variational networks for accelerated mri reconstruction, 2020.

[21] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochas-tic optimization, 2014. URL http://arxiv.org/abs/1412.6980. cite arxiv:1412.6980Comment: Published as a conference paper at the 3rd Inter-national Conference for Learning Representations, San Diego, 2015. [22] Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli.

Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004. URL http: //dblp.uni-trier.de/db/journals/tip/tip13.html#WangBSS04. [23] Nils Moenning and Suresh Manandhar. Complex- and real-valued neural

network architectures, 2018.

[24] Jordan Hoffmann, Simon Schmitt, Simon Osindero, Karen Simonyan, and Erich Elsen. Algebranets, 2020.

FF-net: A fast Fourier transform inspired neural network