Background estimation of images of the radio sky using U-Net

(1)

Background estimation of

images of the radio sky using

U-Net

(2)

Layout: typeset by the author using LA_TEX.

(3)

Background estimation of images

of the radio sky using U-Net

Nina Verheijen 11764198

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dhr. D.J.J. Ruhe Informatics Institute Faculty of Science University of Amsterdam Science Park 907 1098 XG Amsterdam

(4)

Abstract

One of the currently main objectives in radio astronomy is to detect transient objects in the low frequency spectrum of the radio sky. Currently, this is done by the LOFAR Transients Pipeline, which is based on the LOFAR radio interferom-eter. The pipeline automates the process of identification of transients. However, the pipeline is not fast enough to be used in a real-time setting. In this thesis, a machine learning based system was created that approximates the first part of the pipeline, background estimation, but at real-time speed. This was done by implementing a neural network with a U-Net architecture. To train and test the network, a subset of a survey from LOFAR was used. The targets that were used were from the LOFAR Transients Pipeline. Test results indicated that the net-work was able to generate output that was similar to that from the pipeline, while also being faster. However, to know for sure if the U-Net is accurate enough, the background estimated by the U-Net should be used for source detection to see if it would find the same number of sources as with a background estimated by the pipeline.

(5)

Introduction

1.1 Problem Statement

One of the most rapidly-developing areas of astronomy is to recognize rare features in data streams in near-real time. A new objective is to look beyond the objects that are there always or most of the time and find so-called ‘transient’ objects that sometimes appear in the background. Images of the radio sky, which are used to find transient objects, are generated by The Amsterdam-ASTRON Radio Transients Facility And Analysis Centre (AARTFAAC) all-sky monitor, which is a real-time transient detector based on the LOw Frequency ARray (LOFAR). Currently, deterministic pipelines have been created to model the radio sky. The steps of the pipeline consist of the generation of a noise model, background model and the detection and fitting of sources. However, while the pipeline is effective, it is too compute-costly to implement in a real-time setting.

Machine learning might give a solution, because it is known for its wide range of applications and ability to adapt and provide solutions to complex problems. By developing fast and efficient algorithms, a machine learning based system might be able to approximate the pipeline, but at a much higher processing speed. The objective is to create a system that is able to estimate and extract the background and use the resulting images to detect transient objects. The focus for this thesis was on creating a machine-learning based system for background estimation.

In the next section, the related works will be described. In chapter 2, some theoretical background information about the LOFAR TraP, machine learning and neural networks will be given. In chapter 3 the general methodology will be given. In chapter 4, the details of the experiments are described. The results are presented in chapter 5. Chapter 6 holds the discussion, conclusion and future research.

(8)

1.2 Related Work

1.2.1 SegNet

SegNet is a deep convolutional network with an encoder-decoder architecture that is used for image segmentation (Badrinarayanan, Kendall, & Cipolla, 2017). The architecture of the encoder part of the network is similar to a convolutional net-work. The role of the decoder part is to map the low resolution encoder feature maps to high resolution feature maps for pixel wise classification. It upsamples its lower resolution input feature maps and uses pooling indices, computed in the max-pooling step of the corresponding encoder, to perform non-linear upsampling. SegNet only stores the pooling indices instead of the whole feature map to save memory. However, this does result in a slight loss of accuracy. A network with similar structure that does store the entire feature map is U-Net.

1.2.2 U-Net

U-Net is a convolutional network, with an architecture similar to that of SegNet. Like SegNet, U-Net consists of a contracting path to capture context and a sym-metric expanisve path that enables precise localization (Ronneberger, Fischer, & Brox, 2015). The difference between U-Net and SegNet is that where SegNet stores the pooling indices from the contracting path to be used again in the expansive path, U-Net stores the entire feature map. Although this costs more memory, it does result in higher accuracy.

Becaus U-Net has an expansive path for high resolution output and stores the feature maps instead of only the pooling indices, it is able to use context and have accurate localization at the same time. Since the ultimale goal is source detection, high resolution output and accuracy are important. That is why U-Net seems more suitable for this project than SegNet. One of the neural networks that uses a U-Net-type architecture is BGnet.

1.2.3 BGnet

BGnet used deep neural networks to extract the structured background from im-ages (Möckl, Roy, Petrov, & Moerner, 2019). BGnet was used to extract the back-ground in microscopic images, in order to more accurately localize single-molecules. It provided a robust and easy-to-implement method to correct point-spread func-tion (PSF) images for structured background (sBG). BGnet was implemented in Keras with Tensorflow and trained on a standard desktop personal computer. The results indicated that the quality of the PSF shapes after sBG correction strongly

(9)

improved. Importantly, the process was also very fast, 3,500 to 5,000 PSF’s were analyzed in 4 to 30s.

1.3 Research Question

Earlier research showed that a network with an encoder-decoder architecture, such as SegNet, gave great results for pixel-wise image segmentation. A similar network, U-Net, was able to get even better results by storing entire feature maps instead of only pooling indices. Therefore, U-Net was able to use context and have accurate localization at the same time. Thus, the question is, can a U-Net accurately filter out background noise of radio sky images at real-time speed?

Since BGnet proved that a U-Net-like network can provide a fast way to correct images for structured background, it is to be expected that a U-Net will also perform well on a dataset with images of the radio sky.

(10)

Theoretical background

An overview will be given about LOFAR, AARTFAAC, the LOFAR TraP and what role they play in the detection of transient objects. This will be followed by some information about machine learning, neural networks, in particular convolutional networks and most importantly U-Nets and why it was selected for the current problem.

2.1 LOFAR

LOFAR is a new-generation radio interferometer located in the north of the Nether-lands and across Europe (van Haarlem et al., 2013). LOFAR covers the largely uncharted low-frequency range from 10-240 MHz, which is divided in low band (10-80) and high band (110-240) and provides unique observing capabilities. Due to digital beam-forming techniques, the LOFAR system is agile and allows for rapid repointing of the telescope as well as the possibility for multiple simultaneous ob-servations. LOFAR is one of the first radio observatories to feature automated processing pipelines, which can be used to distribute completely calibrated science products to its user community.

2.2 AARTFAAC

The AARTFAAC all-sky monitor is a real-time transient detector based on LO-FAR (Prasad et al., 2016). It uses data from a subset of LOLO-FAR and processes these data independently of LOFAR. It generates images of the low frequency ra-dio sky, while also observing with LOFAR. If AARTFAAC detects a transient, a trigger will be generated for LOFAR, which can interrupt its schedule for follow-up observations with a higher sensitivity and a higher resolution. AARTFAAC is capable to manage the 240 GBs raw data that it obtains from LOFAR and to produce real-time images. Besides the generation of reliable triggers for transients,

(11)

its secondary data products, like all-sky images, are useful in a variety of science cases.

2.3 The LOFAR Transients Pipeline

The LOFAR Transients Pipeline, or TraP, was developed to automate the process of identification of and response to transients, one of the main science goals of LOFAR (Swinbank et al., 2015). The automation of this process was essential due to the large data volumes which LOFAR produces and the scientific requirement for rapid response. The identification of transient and variable sources is done in two ways. One is when a source appears at a location where no source was seen in previous epochs. Another way is when sources which have been observed for multiple epochs show significant variability in their light curves. While the pipeline has been proven to be accurate, there is still much room for improvement in terms of performance.

2.4 Machine Learning

Machine learning makes it possible to complete tasks that are too difficult to solve with fixed programs written and designed by humans (Goodfellow, Bengio, & Courville, 2016). There are many kinds of tasks that can be solved using machine learning. Common examples are classification tasks and regression tasks. With a classification task, the program is asked to determine to which of the given categories the input belongs, for example the recognition of animals in images. With a regression task, the program is asked to predict a numerical value with given input. In order to evaluate the machine learning algorithm, there must be a quantitative measurement of its performance, like accuracy or error rate. Which indicator is used depends on the task that is being performed by the system.

Machine learning algorithms can be separated in supervised and unsupervised learning algorithms. With unsupervised learning, the program is given a dataset and is expected to learn useful properties of the structure of the dataset. With supervised learning, the program is given a dataset in which each example is as-sociated with a label or target.

2.5 Neural Networks

One of the ways machine learning is applied is through neural networks. Neu-ral networks consist of an assortment of algorithms which can be used for data modelling using graphs of neurons, which mimics the way that the human brain

(12)

operates. Neural networks consists of multiple functions, layers, which are con-nected in a chain. The layers between the input and output layer, if any, are called hidden layers. An example of a hidden layer is

z = h(wTx + b)

where h() is a non-linearity, wT _{are the weights, b is the bias and x is the input.}

The number of layers determines the depth of the model. The dimensionality of the layers gives the width of the model. The input of the layers is linear, however, the output needs to be non-linear in order to be useful. Activation functions are used to introduce the non-linearity into the network. The most used activation function is the Rectified Linear Unit (ReLU) function (Jarrett, Kavukcuoglu, Ran-zato, & LeCun, 2009). When RelU is used, the input will be directly returned if it is positive. If it is negative, it will return zero. When using deep neural networks, networks with one or more hidden layers, multiple gradients of complex functions need to be computed. Therefore, the back-propagation algorithm is used.

The propagation algorithm allows information from a layer to flow back-ward through the network in order to compute the gradient (Rumelhart, Durbin, Golden, & Chauvin, 1995). Back-propagation only computes the gradient, other algorithms, such as the stochastic gradient descent, are needed to learn how to use this gradient. The gradient descent is an optimization algorithm used to find a lo-cal minimum of a function, for example, a loss function (Cauchy et al., 1847). The stochastic gradient descent is an extension of the gradient descent, it works similar but uses minibatches instead of the whole training set, to lower the computational cost.

Back-propagation aims to minimize the cost function by adjustment of the weights and biases of the network. How much the weights and biases are ad-justed depends on the gradients of the cost function. LeCun et al., 1989 showed that applying back-propagation networks can be applied to real image-recognition problems without the requirement of a large, complex prepossessing state required. Nowadays, back-propagation is part of almost all neural networks that are used for object detection, chatbots and other such applications.

2.6 Convolutional Neural Networks

Convolutional neural networks (CNNs) are neural networks specialized for pro-cessing data that has a known grid like topology, for example time-series data or image data (Goodfellow et al., 2016). CNNs are more suitable than regular neural networks when working with images, because CNNs take advantage of the local connectivity of images. Which means that it can gather information from adja-cent pixels by using convolutions on patches of adjaadja-cent pixels. CNNs are called

(13)

convolutional because they apply a convolution in at least one of their layers. A typical layer of a convolutional network consists of three parts. In the first part, several convolutions are performed in parallel which gives a set of linear outputs, or activations. In the second part, each linear activation is run through a nonlin-ear activation function. In the third part, a pooling function is used, like the max pooling function. The use of a pooling function helps to make the representation approximately invariant to small translations of the input. Pooling improves the computational efficiency of the network, because it lowers the number of inputs for the next layer. CNNs are often used for image classification tasks because of their high accuracy. However, if pooling is used on images it will also reduce the resolution of the output.

2.6.1 CNNs and classification problems

Convolutional networks have delivered outstanding results in many visual recog-nition tasks, but their success was limited due to the size of the networks and the size of the available training sets. A breakthrough was made by Krizhevsky et al. by training a large network with eight layers and millions of parameters on the ImageNet dataset with 1 million training images (Krizhevsky, Sutskever, & Hinton, 2017). This indicated that a deep CNN can deliver exceptional results on a challenging dataset. Their results improved as the network became larger and was trained longer. Since the goal for this project is to create a system that is faster than the LOFAR TraP, a deep CNN might not be fast enough because it needs too much time to train and is too large.

2.6.2 CNNs and regression problems

Usually, convolutional networks are used for classification tasks, where the output of an image is a single class label. However, in many visual tasks, for example in biomedical image processing, the desired output should include localization, which means that each pixel needs to be labeled. Hence (Ciresan, Giusti, Gambardella, & Schmidhuber, 2012) trained a network in a sliding-window setup to predict the class label of each pixel. This was done by providing a local region (patch) around a pixel as input. Although this strategy was effective, it had two drawbacks. First, it was quite slow because the network had to be run separately for each patch, and there was a lot of redundancy due to overlapping patches. Furthermore, there was a trade-off between localization accuracy and the use of context. Small patches only granted little context to the network, while larger patches required more max-pooling layers, which reduced the localization accuracy.

(14)

2.7 U-Net

U-Net is a convolutional network, the architecture consists of a contracting path to capture context and a symmetric expansive path that enables precise localization. To get a high resolution output, a contracting network is expanded by the addition of successive layers, in which pooling operators are replaced by upsampling opera-tors. This architecture is presented in figure 2.1. High resolution features from the contracting path are combined with the output from the upsampling path. In this way, a successive convolution layer can learn to create a more precise output. At every downsampling step in the contracting path, the number of feature channels is doubled. Every upsampling step in the expansive path halves the number of feature channels. At the final layer a 1x1 convolution is used to map each feature to the desired number of classes. U-Net was mostly used for biomedical image seg-mentation, but because of its efficient use of available data, high resolution output and speed it is also suitable for background estimation.

Figure 2.1: U-Net architecture (example for 32x32 pixels in the lowest resolution). Each blue box corresponds to a multi-channel feature map. The number of chan-nels is denoted on top of the box. The x-y-size is provided at the lower left edge of the box. White boxes represent copied feature maps. The arrows denote the different operations.

Source: (Ronneberger et al., 2015)

(15)

Methodology

The networks that were used were trained and tested using the Pytorch Light-ning API. The program was run on a workstation with a GPU with 11GB. The workstation was running on Ubuntu 18.04.4 LTS.

3.1 The Datasets

Two datasets were used for this project. The first one was MNIST dataset, which consists of 28x28 pixel grayscale images of hand written single digits between 0 and 9. It has a training set of 60.000 examples and a test set of 10.000 examples (LeCun & Cortes, 2010). The other datastet that was used consisted of images from the radio sky from a survey from September 1st 2019. LOFAR provides images in 16 different frequencies divided in two frequency ranges, 57.61 MHz- 58.98 MHz and 61.13 MHz - 62.5 MHz. For this project, 1022 images of 1024x1024 pixels from the lowest frequency, 57.61 MHz, were used. Besides that, for each image of the radio sky, a corresponding background image was given, provided by the LOFAR TraP. The images in the dataset had quite some noise around the edges. To solve that the images were cropped so only the middle 700x700 pixels were used. In addition, there were extreme values in the data. To resolve that, the images were also standardized and clamped between 0 and 5. The images were standardized by subtracting the mean pixel value and dividing the result by the standard deviation of the pixel value. The data was split in 0.5% for the training set, 0.25% for the validation set and 0.25% for the test set.

3.2 The CNN baseline

First, a basic CNN was implemented and tested using the MNIST dataset. The MNIST dataset was used because it is easy to use and fairly small, so the testing did not take too much time. The CNN that was used as a baseline consisted of three

(16)

convolution layers, each followed by a RELU activation function. The first layer had 1 input channel, because grayscale images were used, and 8 output channels. The second layer consequently had 8 input channels and 16 output channels. The last layer had 16 input channels and 1 output channel, since the output was also a grayscale image. The input images were generated by taking the images from the MNIST dataset and adding Gaussian noise. Since the Pytorch Lightning API was used, a lot of steps such as adding the back-propagation algorithm, optimization and creating a loop to train the model, were automated by the Lightning Trainer. The Adam optimizer was used because it is able to deal with sparse gradients and it can handle non-stationary objectives (Kingma & Ba, 2014). To prevent the network from overfitting on the training dataset, an early stopping criterion was used. Early stopping kept track of the validation loss, if the loss stopped decreasing for several epochs in a row the training was stopped.

3.3 The U-Net

Next, the CNN was converted to a U-Net, by adding more layers, including expand-ing layers which ensured a high resolution output. The architecture is illustrated in figure 3.1 The U-Net was divided into two parts. The left side had three so called ‘contracting blocks’. A contracting block had two convolutional layers each fol-lowed by a batch normalization and a RELU activation function (Ioffe & Szegedy, 2015). The last layer of the contracting block was a max pool layer.

The right side had three so called ‘expanding blocks’ which were the same as the contracting blocks, only the max pool layer was replayed by a transpose convolutional layer (Dumoulin & Visin, 2016). The right side also differs from the left side, because on the right side the image from the left was combined with the image from the right side to create an output with higher resolution than a normal CNN would give.

On the left side, the first layer had 1 input channel, similar to the CNN, and 32 output channels. The second layer had 32 input channels and 64 output channels. The third layer had 64 input channels and 128 output channels. Then on the right side, starting at the bottom, the third layer had 128 input channels and 64 output channels. The second layer had 64 input channels and 32 output channels. The first layer had 32 input channels and 1 output channel.

The U-Net was also tested with the MNIST dataset. In the next step the dataset from LOFAR was introduced. In order to do this, a custom dataset class was made to load and transform the data. The images were stored in Flexible Image Transport System (FITS) format and the data first needed to be loaded into NumPY arrays in order to utilize it. After that they were rescaled to their original size, 1024x1024 pixels, and transformed to PyTorch Tensors.

(17)

Figure 3.1: Architecture of the U-Net. The numbers show the number of channels.

3.4 Data-Analysis

The output from U-Net was compared with the output form the LOFAR TraP to see if it was accurate. To do this the mean squared error, structural similarity and peak signal to noise ratio were measured. Since the problem was a regression problem, the mean squared error (MSE) was used as the loss function. The mean squared error indicates how close of a match the output of the network was to the target image. The peak signal-to-noise ratio (PSNR) was used as a quality mea-surement between the original and the reconstructed image. The higher the PSNR the better the quality of the reconstructed image. Structural similarity index mea-sure (SSIM) is a metric parameter that quantifies image quality degradation. The difference between SSIM and MSE or PSNR is that they approache estimate abso-lute errors. Whereas structural information holds the idea that pixels have strong inter-dependencies, which carry important information about the structure of the objects in the image. The time per image was used to see if U-Net was faster than the LOFAR TraP.

Tensorboard was used to log and visualize the output. In order to better visualize the quality of the images in Tensorboard, the images from the LOFAR dataset were min-max-normalized before being logged and the images from the MNIST dataset were clamped between 0 and 1.

(18)

Experiments

To start with, the CNN was run ten times to establish a baseline. After that, the U-Net was also run ten times. For both the CNN and the U-Net, the time per image, mean squared error, peak signal to noise ratio and structural similarity were computed. The batch size was set as high as possible without running out of memory. In this case, a batch size of 5. The baseline CNN was not able to overfit due to its simplicity, therefore not able to trigger the early stopping criterion. Hence, the maximum number of epochs was set to 100.

(19)

Results

The results from the U-Net and the CNN using the MNIST dataset are presented in figure 5.1. Figure 5.1a and 5.1d show the input that was given to the networks, which is the target image to which noise is added. Figure 5.1b shows the output from the U-Net and figure 5.1e shows the output from the CNN. The output from the CNN contains more noise than the output from the U-Net, but in both cases the digits were clearly visible.

(a) U-Net input (b) U-Net output (c) U-Net target

(d) CNN input (e) CNN output (f) CNN target

Figure 5.1: Results from the CNN baseline and the U-Net using the MNIST dataset.

As mentioned before, the images from the LOFAR dataset were preprocessed before going through the network and before being logged, this process is presented in figure 5.2. The first row presents the input, target and the output of the network after min-max normalization was used before the images were logged. The second row presents the input, output and target images after they have been standardized, clamped and min-max-normalized. The last row presents the input, output and target after they have been standardized, clamped, cropped

(20)

and min-max normalized. A lot more details were visible after the images were preprocessed.

(a) Original input after min-max normalization.

(b) Original output after min-max normalization.

(c) Original target after min-max normalization.

(d) Standardized, clamped and min-max normalized input.

(e) Standardized, clamped and min-max normalized output.

(f) Standardized, clamped and min-max normalized target.

(g) Input after standard-izing, clamping, cropping and min-max-normalizing.

(h) Output after standard-izing, clamping, cropping and min-max-normalizing.

(i) Target after standard-izing, clamping, cropping and min-max-normalizing.

Figure 5.2: The process of preproccessing the images to leave out noise, extreme values and to better visualize them in Tensorboard. The first row presents the images after applying min-max normalization. The second row presents the images after being standardized, clamped and min-max normalized. The last row presents the images after they have been standardized, clamped, cropped and min-max normalized.

(21)

Figure 5.3 presents what the output for both the CNN and U-Net looked like. Figure 5.3a presents the input of the U-Net, figure 5.3b is the output of the U-Net and 5.3c presents the target. Figure 5.3d presents the input image of the CNN, figure 5.3e presents the output of the CNN and figure 5.3f presents the target of the CNN. The input images were images from the dataset which was taken from a survey from LOFAR. The targets were output from the LOFAR TraP and present the background corresponding to the input images.

(a) U-Net input (b) U-Net output (c) U-Net target

(d) CNN input (e) CNN output (f) CNN target

Figure 5.3: Comparison between the output of the CNN and the U-Net

Figure 5.4 presents the mean squared error from the CNN baseline and the U-Net against the number of steps taken. It indicates that, over time, the mean squared error of the CNN slowly decreases, whereas the mean squared error of the U-Net decreases more rapidly, before increasing very briefly before the end.

(22)

Figure 5.4: The mean squared error of the U-Net and the CNN

The results of the CNN in table 5.1 indicate that the CNN had a minimum time per image of 0.0346 seconds and a maximum time per image of 0.0363 seconds. It had a minimum mean squared error of 0.5888 and a maximum of 0.6180. The peak signal to noise ration had a minimum of 18.1772 and a maximum of 18.4383. Lastly, the CNN had a minimum structural similarity of 0.3090 and a maximum structural similarity of 0.3941.

test # time per image MSE PSNR SSIM 1 0.0363 0.5938 18.3720 0.3822 2 0.0353 0.5978 18.3306 0.3670 3 0.0355 0.5888 18.4174 0.3714 4 0.0360 0.5874 18.4383 0.3491 5 0.0358 0.5973 18.3382 0.3478 6 0.0348 0.5936 18.3605 0.3662 7 0.0356 0.5866 18.4274 0.3941 8 0.0354 0.5933 18.3695 0.3807 9 0.0346 0.6180 18.1772 0.3090 10 0.0358 0.5928 18.3901 0.3402

Table 5.1: Metrics of the CNN, per test the time per image, Mean Squared Error (MSE), Peak Signal to Noise Ratio (PSNR) and Structural Similarity (SSIM).

The results from the U-Net, in table 5.2, show a minimum time per image of 0.1031 seconds per image and a maximum time per image of 0.1041. The mean

(23)

squared error had a minimum of 0.0897 and a maximum of 0.2656. The peak signal to noise ratio had a minimum of 21.9103 and a maximum of 26.8213. The structural similarity had a minimum of 0.6103 and a maximum of 0.7625.

test # time per image MSE PSNR SSIM 1 0.1041 0.1450 24.7368 0.7429 2 0.1035 0.0988 26.5127 0.7479 3 0.1035 0.1372 24.7578 0.7006 4 0.1038 0.1402 24.7575 0.6200 5 0.1036 0.0897 26.7971 0.7408 6 0.1034 0.0889 26.8213 0.7625 7 0.1032 0.1259 25.2075 0.7218 8 0.1031 0.1851 23.8442 0.6916 9 0.1042 0.2656 21.9103 0.6103 10 0.1036 0.0996 26.2050 0.7372

Table 5.2: Metrics of U-Net, per test the time per image, Mean Squared Error (MSE), Peak Signal to Noise Ratio (PSNR) and Structural Similarity (SSIM).

Table 5.3 indicates that the U-Net had an average time per image of 0.1036 with a standard deviation of 3.3446e−4. The mean squared error had an average of 0.1376 with an standard deviation of 0.0505. The peak signal to noise ratio had an average of 25.1550 with a standard deviation of 1.4546. The structural similarity had an average of 0.7076 with a standard deviation of 0.0505. When comparing these results to the CNN baseline in table 5.3 it indicates that the average time per image of the CNN was lower but the mean squared error was higher, the peak signal to noise ratio was lower and the structural similarity was also lower.

time per image MSE PSNR SSIM

CNN 0.0355 ± 4.9285e−4 0.5943 ± 8.8451e−3 18.3621 ± 7.058e−2 0.3608 ± 2.3572e−2 U-Net 0.1036 ± 3.3466e−4 0.1376 ± 0.0514 25.1550 ± 1.4546 0.7076 ± 0.0505

(24)

Evaluation

6.1 Discussion

The results from the MNIST dataset indicated that the U-Net had a better per-formance than the CNN baseline on a low level representation of data. Figure 5.2 indicated that it was indeed useful to preprocess the data. In the first row the images lost a lot of detail because of the extreme values in the images. The standardization and the clamping of the images helped a bit. However, since there was still a lot of noise around the edge, there is still not much detail in the centre of the image. Only after the images are cropped, the center of the images started to show more detail. Figure 5.4 indicated that the CNN was indeed not able to overfit and therefore not able to trigger the early stopping criterion.

The results of the CNN and the U-Net, in table 5.1 and table 5.2 indicated that U-Net, although the time per image was slower, had a better performance than a normal CNN. When observing 5.1 and 5.2, the U-Net had a lower time per image than the baseline CNN. However, it performed better on every other aspect. More importantly, the average time per image of the U-Net was 0.1036 seconds, with a standard deviation of 3.3466e−4, which was much faster than the LOFAR TraP, which had a time per image of 3.8890 seconds. Furthermore, the U-Net had MSE of 0.5943 with an standard deviation of 8.8451e−3 which means it is a reliable network.

6.2 Conclusion

The intention of this paper was to proof that a U-Net can estimate the background of images of the radio sky accurately and at real time speed. To do this, a U-Net was implemented, with as input a dataset consisting of images of the radio sky and corresponding background. The output of the U-Net was compared to the results from the LOFAR TraP.

(25)

The U-Net performed better than the CNN baseline, only the time per images was higher. When comparing the U-Net to the LOFAR TraP, U-Net is significantly faster at background estimation. With an average mean squared error of 0.1376 it definitely comes close to the output of the LOFAR Trap. However, since source detection was not implemented it can not be said for sure if the U-Net is accurate enough. This could be tested by using the estimated background of the U-Net as input for the second part of the LOFAR TraP. If the same amount of sources can be detected, using the U-Net background, as when using the background from the LOFAR TraP, it would mean the U-Net was accurate enough to give similar results as the pipeline.

6.3 Future research

The first step for future research would be to use the background estimated by U-Net for source detection, to see if it is really good enough to replace the first part of the LOFAR TraP. A next logical step would be to implement the second part of the LOFAR TraP, source detecting and fitting. This could be done for example by using Deep Neural Networks (DNNs) and DNN-based object mask regression (Szegedy, Toshev, & Erhan, 2013). One downside to this approach is that for each object of a different class, a new network needs to be trained. However, since the LOFAR TraP only looks for one type of object, transient object, this might still be a approach worth taking note of. In addition, there might also be room for improvement for the U-Net itself. Changes in for example the number of channels or the learning rate might lead to even better results.

(26)

References

Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). Segnet: A deep convolu-tional encoder-decoder architecture for image segmentation. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 39 (12), 2481-2495. doi: 10.1109/TPAMI.2016.2644615

Cauchy, A., et al. (1847). Méthode générale pour la résolution des systemes d’équations simultanées. Comp. Rend. Sci. Paris, 25 (1847), 536–538. Ciresan, D., Giusti, A., Gambardella, L., & Schmidhuber, J. (2012). Deep

neu-ral networks segment neuronal membranes in electron microscopy images. Advances in neural information processing systems, 25 , 2843–2851.

Dumoulin, V., & Visin, F. (2016). A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285 .

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press. (http://www.deeplearningbook.org)

Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (pp. 448–456).

Jarrett, K., Kavukcuoglu, K., Ranzato, M., & LeCun, Y. (2009). What is the best multi-stage architecture for object recognition? In 2009 ieee 12th interna-tional conference on computer vision (pp. 2146–2153).

Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017, May). Imagenet classification with deep convolutional neural networks. Commun. ACM , 60 (6), 84–90. Re-trieved from https://doi.org/10.1145/3065386 doi: 10.1145/3065386 LeCun, Y., & Cortes, C. (2010). MNIST handwritten digit database.

http://yann.lecun.com/exdb/mnist/. Retrieved 2016-01-14 14:24:11, from http://yann.lecun.com/exdb/mnist/

Möckl, L., Roy, A. R., Petrov, P. N., & Moerner, W. E. (2019, Dec). Accurate and rapid background estimation in single-molecule localization microscopy using the deep neural network bgnet.

(27)

ceedings of the National Academy of Sciences, 117 (1), 60–67. Re-trieved from http://dx.doi.org/10.1073/pnas.1916219117 doi: 10.1073/pnas.1916219117

Prasad, P., Huizinga, F., Kooistra, E., van der Schuur, D., Gunst, A., Romein, J., . . . Wijers, R. A. M. J. (2016). The aartfaac all sky monitor: System design and implementation.

Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation.

Rumelhart, D. E., Durbin, R., Golden, R., & Chauvin, Y. (1995). Backpropaga-tion: The basic theory. BackpropagaBackpropaga-tion: Theory, architectures and applica-tions, 1–34.

Swinbank, J. D., Staley, T. D., Molenaar, G. J., Rol, E., Rowlinson, A., Scheers, B., . . . Zarka, P. (2015). The lofar transients pipeline.

Szegedy, C., Toshev, A., & Erhan, D. (2013). Deep neural networks for object detection. Advances in neural information processing systems, 26 , 2553– 2561.

van Haarlem, M. P., Wise, M. W., Gunst, A. W., Heald, G., McKean, J. P., Hessels, J. W. T., . . . et al. (2013, Jul). Lofar: The low-frequency array. Astronomy & Astrophysics, 556 , A2. Retrieved from http://dx.doi.org/10.1051/0004-6361/201220873 doi: 10.1051/0004-6361/201220873

Background estimation of images of the radio sky using U-Net

Background estimation of

images of the radio sky using

U-Net

Background estimation of images

of the radio sky using U-Net

Abstract

Contents

Introduction

1.1

Problem Statement

1.2

Related Work

1.2.1

SegNet

1.2.2

U-Net

1.2.3

BGnet

1.3

Research Question

Theoretical background

2.1

LOFAR

2.2

AARTFAAC

2.3

The LOFAR Transients Pipeline

2.4

Machine Learning

2.5

Neural Networks

2.6

Convolutional Neural Networks

2.6.1

CNNs and classification problems

2.6.2

CNNs and regression problems

2.7

U-Net

Methodology

3.1

The Datasets

3.2

The CNN baseline

3.3

The U-Net

3.4

Data-Analysis

Experiments

Results

Evaluation

6.1

Discussion

6.2

Conclusion

6.3

Future research

References