Improving Spot Detection in smRNA FISH Images with Convolutional Neural Networks

(1)

Exploring Opportunities to Improve Spot

Detection in smRNA FISH Images with

Convolutional Neural Networks

Harm Manders 10677186

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam

Faculty of Science Science Park 904 1098 XH Amsterdam

Supervisors

Dr. S. van Splunter, Dr. P.J. Verschure Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam July 15th, 2019

(2)

Abstract

3 Data 7 3.1 Images . . . 7 3.2 Labels . . . 7 4 Method 8 4.1 Preprocessing . . . 8 4.2 Laplacian of Gaussian . . . 9 4.3 Parameter optimisation . . . 10 5 Results 10 5.1 Dataset exploration . . . 10 5.2 Parameter optimization . . . 11 6 Conclusion 16 7 Discussion 17 References 18 Appendices 20 A Image quality . . . 20

(3)

1 Introduction

1.1 Fluorescence microscopy and smRNA FISH

Fluorescence microscopy is a technique in cell biology where fluorescent molecules, bound to specific parts of a cell, are excited with a specific wavelength of light and in turn emit light of a different wavelength. By capturing this emission light, one can detect small regions in a cell marked by these fluorescent mark-ers. This specificity and high sensitivity (even single molecules can be observed) makes this technique very powerful for cell biological research [1]. Fluorescence microscopy has limitations both in spatial resolution, due to the diffraction of light, and in noise, generated by the detector itself and environmental noise, which can lead to images with a low signal to noise ratio (SNR) [1]. To increase the SNR one can either use a higher intensity light source or a longer exposure time [2] but both would result in more photobleaching, which is the reduction of fluorescence due to the damaging effect of light on fluorescent particles [3]. Due to the photobleaching effect images are usually taken with low light intensities and low exposure times and so produce images with a very low SNR.

In the field of single-molecule RNA fluorescent in situ hybridisation (smRNA FISH) cells are injected with fluorescent markers that bind to RNA in the cell. This method allows researchers to visualise and quantify specific RNA molecules in individual cells [4] which tells them something about RNA transcription and activity of genes involved in cancer. Due to the sub-resolution scale of the RNA molecules and the diffraction of light, these RNA molecules are imaged as small, blurry spots inside and around the image of the nucleus. The quantification of these spots can help researchers determine the effectiveness of drug treatments and they can now clearly observe the activity of cancer genes [5].

1.2 Problem description

The Swammerdam Institute of Life Sciences (SILS) at the University of Amster-dam (UvA) uses smRNA FISH to create an understanding of how cells become resistant to drug treatment based on cell-to-cell variability in RNA transcrip-tion response. They have a data set of 23148 three-dimensional images. All images have one channel with transcription spots and one channel with the cell nucleus and some images also contain other channels. This data set is divided into multiple image-sets, each with multiple images, ranging between 200-600 each.

Currently, they use a MatLab script that uses a semi three-dimensional Lapla-cian of Gaussian [6] to detect the spots in their images. The detection of these spots is difficult because, besides the spots themselves, the cell body also emits a small amount of light, which results in nonuniform background noise. In some images, the intensity of this background noise is so strong that the spots are barely visible. This hinders the MatLab script to process these images so a better method is needed.

(4)

1.3 Aim and approach

To be able to detect spots in these low-quality images, convolutional neural networks (CNNs) might be an interesting area of research. Previously CNNs have been often used in image analysis tasks, ranging from self-driving cars [7] to cellular biology [8].

In this research, I look at the possibilities CNNs bring to the table, to im-prove spot detection in low-quality images. But before I can do that, labels are needed that not only specify the number of spots in an image, but also specify where these spots are located. Therefore, a method is proposed that estimates the locations of the spots based on the Laplacian of Gaussian. The parameters used are tweaked to an extent, so that the results of the proposed method cor-respond to the results generated by the MatLab script. Thereby validating the locations of the found spots.

In this thesis, I first take a look at the current state of spot detection in mi-croscopy images, the applications within cell biology where convolutional neural networks are already in use and how this could benefit our research. Next, I propose a method to find spots in smRNA FISH images, providing both a ro-bust spot detection algorithm as well as a way to optimize the parameters of the algorithm to match the labelled data as closely as possible. Finally, I take a look at what might be possible in the future now that we have a way to generate location-based labels for the entire data set.

2 Related Work

A typical image that is studied here has some specific characteristics: they are extremely noisy, they contain non-uniform background noise, they contain a varying number (0 < n < 2000) of low-signal, resolution-limited spots rep-resenting sometimes only a few fluorescent molecules. To detect this kind of spots in images generally requires three steps. First, a noise reduction step is used to increase the SNR and improve the quality of the image, often Gaussian smoothing is used. Secondly, a signal enhancement step is used to increase the brightness at places where there are spots while at the same time reducing the brightness of the background. This step is where the most difference is found between spot detection algorithms. Lastly, a thresholding step is used to sep-arate the foreground signal from the background noise. All three steps are an integral part of spot detection, although some detection algorithms combine the denoising and the enhancing step in a single step [9].

Smal et al. [9] compared several techniques used to find spots in biological images. They compared seven unsupervised signal enhancement methods: Mul-tiscale Variance-Stabilizing Transform Detector (MSVST), H-Dome Based De-tection (HD), Spot Enhancing Filter (SEF), Grayscale Opening Top-Hat Filter (MTH), Wavelet Multiscale Product (WMP), and Top-Hat Filter (TH). In ad-dition to the unsupervised methods, they also used two supervised (machine learning) methods: AdaBoost (AB) and Fisher Discriminant Analysis (FDA).

(5)

with varying spot sizes, spot shapes, and with a varying signal to noise ratio (SNR). They found that if the SNR is relatively high (> 5) the performance difference between all methods becomes negligible but when the SNR is low (≈ 2) the supervised methods perform best overall. From the unsupervised methods, they found that MSVST, HD, SEF, and MTH have the best results although none of the methods outperforms the others.

2.1 Theoretical background of the Laplacian of Gaussian

One of the unsupervised methods, the SEF method, uses a Laplacian of Gaussian (LoG) filter to enhance the spots in an image. This filter is comprised of two parts [15], a Gaussian filter (Equation 1), which is used to smooth an image at a certain scale, and a Laplacian filter (Equation 2), which can be used to calculate the second derivative of an image.

G(x, y; σ) = √1 2πσexp(− x2+ y2 2σ2 ) (1) ∇2=∂ 2_f ∂x2 + ∂2_f ∂y2 (2)

Using the associative properties of convolutions,

(f ∗ g) ∗ h = f ∗ (g ∗ h) (3) the Laplacian and Gaussian kernels can be combined together resulting in the LoG kernel. ∇2_{G(x, y; σ) =} x 2_{+ y}2_{− 2σ}2 πσ4 exp(− x2+ y2 2σ2 ) (4)

This kernel can be used to convolve the image resulting in the LoG scale-space representation of the image.

LoG(x, y; σ) = I(x, y) ∗ ∇2G(x, y; σ) (5) The Gaussian part of the function is used to smooth out the image at a certain scale, applying the Laplacian part of the function to an image, results in an image which displays regions in the original image where the intensity changes rapidly [10].

The LoG filter (Equation 5) can be used to detect bright or dark regions at a certain scale, called blobs. In the LoG scale-space representation, dark blobs create highly positive values, while light blobs result in negative values. By changing the sigma, the filter can be sensitive to different blob sizes. In general, to detect a blob of size s, a sigma of (s − 1)/3 is used, as 99.7% of the intensity of a blob will fall within three sigmas of the centre of the blob. Using multiple scales, blobs of different sizes can be detected [11]. The same blobs can be de-tected in different scale-spaces but by choosing the scale-space where the blob shows the most extreme values, a scale can be determined for the blob [11].

(6)

2.2 Supervised methods for spot detection

The two supervised methods tested by Smal et al. [9], AdaBoost (AB) and Fisher Discriminant Analysis (FDA), both use convolution kernels to detect spots in the images [9]. The difference between the two methods is primarily how they generate these kernels.

AB is used in tandem with Haar-like features [12,8] to analyse local patches where spots are expected to be and it determines if the patch contains a spot or not. These Haar-like features act as weak classifiers that can individually determine features from a spot, AdaBoost is used to select and combine a subset of these weak classifiers that best describe a spot, into a strong classifier. This strong classifier is used to create a classification map of the image.

FDA also looks at local patches but determines the presence of a spot in the patch by using statistical analysis [9]. It tries to maximize the ratio of the between-class variation and the within-class variation. This maximization produces a vector, which can be reshaped into a patch that can be used in a convolution step to create a classification map of the image.

Both of these methods use convolutions in different ways to detect spots in microscopy images. Convolutional neural networks (CNNs) however, have not been used for this specific task as of yet. CNNs perform well on image classifi-cation and segmentation tasks, are already being used in different areas of cell biology and might be suited for spot detection as well.

2.3 Convolutional neural networks in cell biology

Xie et. al. [13] used CNNs to detect and count cells in microscopy images. They made use of two fully convolutional regression networks (FCRNs) which were used to identify and count cells in microscopy images. They used these FCRNs to create density maps which allowed them to count cells without having to segment them into individual cells. One of the goals of this research was to see if they could train the FCRNs on synthetic data, after which the FCRNs were tested on real data to see if a network completely trained on synthetic data could be reliably used with real image data. They showed that high accuracy in detecting and counting cells was achievable in real data when trained only on synthetic data.

Dong et al. [8] compared three pre-trained, widely used CNNs to detect and classify malaria-infected cells. They compared LeNet [14], AlexNet [15] and GoogLeNet [16] and found that all three networks had exceptionally high accuracies and all scored better than an SVM that was used as a comparison.

These papers show that CNNs are already quite successfully used in cell biology to find, count, and classify cells [8, 13]. The networks used in these papers were able to identify quite complex structures in two-dimensional im-ages. In comparison, the spots in smRNA FISH images are far less complex in shape. In [9], Small found that the convolutional methods, as described in

subsection 2.1, were particularly effective, compared to unsupervised methods, at detecting spots at relatively low SNRs. It would be interesting to see how CNNs would perform at detecting spots at low SNRs.

(7)

3 Data

When training a CNN it is important to have a big data set, with labels that are of high quality. Although the data at hand is vast in quantity, the quality of the labels and the type of labelling done has to be improved before CNNs can be implemented. The provided data consists of image-sets, which are files with multiple images, as well as a set of text files with labels. These label files contain the numbers of spots found in each image.

(a) High quality: easy to detect spots (b) Low quality: hard to detect spots

Figure 1: Two samples of smRNA FISH images. (a) is an image of good quality, (b) is a bad quality image

3.1 Images

The smRNA FISH images were acquired on a Nikon Ti-E scanning laser confo-cal inverted microscope in combination with the Nikon NIS-Elements software package. Each image has a resolution of 256 x 256 px and consists of multiple slices each of which is a 0.3 µm slice of the sample. The images were imaged with multiple laser channels and were stored in data sets corresponding to the experiment that was run [6]. This resulted in multiple data sets each with mul-tiple 3D images with mulmul-tiple channels. The spots that this research tries to detect are visible in one of the channels resulting in 3D image volumes that have to be analysed [6].

The image data sets can be split into two groups, one in which the spots are clearly visible and therefore are easy to detect with a LoG, and the second group where the spots are hard to spot and will require more advanced algorithms

Figure 1. These two groups were specified by W. Beckman (Appendix A) based on the experiments used to generate these images.

3.2 Labels

The labels that were provided with this data set were acquired by using a modified version of the MatLab script of A. Raj [17, 18]. This script uses Gaussian smoothing to increase the SNR in the image it then uses a Laplacian of Gaussian (LoG) filter to enhance the signal [9]. In the thresholding step,

(8)

multiple thresholds were tried and the resulting spot count of each threshold was plotted. A threshold was chosen based on where a plateau would be found in the number of spots.This plateau can be found by looking at where the average number of spots in an area changes the slowest.

4 Method

To get results that are comparable to results that were found using the MatLab script, the choice was made to use a three-dimensional LoG to find the spots in the images. As the data is stored in a proprietary nd2 format created by Nikon, an open-source library was used to load the files into memory. The nd2 format is a very compact format to store microscopy image data and as every file has multiple three-dimensional images each with multiple channels the data of one file cannot be loaded into memory at once. Therefore, a generator function was written to handle image loading. The data was preprocessed to extract the metadata and to find corresponding labels. After the preprocessing step, the data that was labelled was used in the parameter optimization step. In this step, grid search was used in the parameter space to find the optimal parameters for the LoG filter. A command-line interface was made to make it easier for researchers to use this tool afterwards to analyse their image data.

4.1 Preprocessing

To see how well the LoG spot detection algorithm performed, the spot count data had to be linked to the corresponding image-sets. The spot count data for each image stack was stored in text files called spotcounttable.txt and these files were stored at different locations. A Python script was used to find all the files, storing the relative path to these files, as the corresponding image-sets had names that were closely related to the names of the directories in which the labels were located. In the paths of these files, the dates and directory names were used to determine which image-set corresponds to each of these label files. The label files were then manually linked to the corresponding image-set in a spreadsheet and the number of labels in the label file was checked to see if it matched the number of images in the image set. In total 30.46% of image-sets were found to have labels, although only 16.2% of the images classified as having a good quality were found to have labels.

There was also metadata available in the image-sets. The metadata con-tained information about the x,y,z dimensions as well as the number of channels and number of images in the series.

The availability of label data, the quality of the image, and the metadata were used to filter out files that were not interesting for this research. A lot of image-sets only had a few images, and most of these images were single images from a bigger image-set indicated by their file names. Together with a couple of images with a differentiating x,y-resolution, these images were removed from the larger data set. Finally, almost one-fifth of the images were classified as bad quality images, these images were also removed from the total data set.

(9)

4.2 Laplacian of Gaussian

As stated in subsection 2.1, a Laplacian of Gaussian (LoG) filter can be used to detect bright or dark blobs. When convolved with an image, it will produce highly positive values in areas where there are dark blobs and highly negative ones in areas with bright blobs [10]. The size of the detected blobs depends on what sigmas are used. In subsection 2.1, the LoG filter is described in two-dimensions, however it can also be extended to three dimensions making it ideal for blob detection in volumetric images. After the LoG scale-spaces are calcu-lated, the blobs can be found by finding all the regions where a certain threshold is reached. As it is not known what threshold has to be used multiple thresholds are tried. By looking at the number of blobs at different thresholds, a definitive threshold can be determined at the point where there is a plateau in the number of blobs. This is due to the fact that if the threshold is too high, no blobs are found. Then, if the threshold is decreased, more and more blobs are found. When all blobs are within the threshold the number of blobs stays the same for a while. Then, when the threshold decreases even more, the background noise begins to appear in the blob counts so the number of blobs will skyrocket. This plateau in the number of blobs found can be found by estimating the local derivative of the number of blobs. The first threshold where this derivative has a local minimum is where there should be a plateau in the number of blobs found. The scikit-image Python library has a function available that implements the LoG to detect blobs. This function returns blobs found in an image based on the parameters given. These parameters are minimum sigma, maximum sigma, number of sigmas, and threshold. The minimum-, maximum- and number of sigmas relate to the scale-spaces that are calculated. The threshold parameter determines at which intensity value the function will return a blob. As this function can only handle one threshold and due to the fact that the threshold cannot be predicted beforehand, a function is needed that returns a list of blobs at different thresholds, specified by a parameter range. To implement this fea-ture the librarys function had to be modified to receive a range of thresholds. Now, after the function calculates all scale-spaces corresponding to the sigmas given, it iterates over the different thresholds and returns an array of blobs for each threshold. It wouldve been possible to call the original function with differ-ent threshold values, but that would result in recalculating all the scale-spaces again for each threshold used. This would greatly increase the running time, as applying a threshold to the scale-spaces only takes a matter of milliseconds while calculating multiple scale-spaces takes multiple seconds. Therefore, the library itself was modified to increase performance.

With the array of blob counts per threshold available, a plateau can then be found by estimating the local derivative of the curve and searching for a min-imum in the curve this produces. The local derivative can be calculated by convolving a kernel < 1₂, −1₂ > with the blob count array. By choosing a dif-ferent kernel size the derivative is calculated over a larger or smaller area. By finding the first local minimum in the local derivative plot, a threshold can be chosen where the number of blobs has a plateau, this is the number of blobs returned by the function.

(10)

4.3 Parameter optimisation

Parameter optimisation is an important part of this research. Choosing the right parameters is crucial in estimating the number of blobs as good as possi-ble. For the custom LoG function, a few parameters are important: max sigma, number of sigmas, and threshold resolution. To find the plateau, different ker-nel sizes were tested to see if there was a lot of variation there. To find the optimal combination the functions described in subsection 4.2were called with a range of different parameters. The following parameter ranges were used: max sigma: [2, 4, 6], num sigma: [2, 4, 6], threshold resolution: [20, 30, 40], ker-nel sizes [2, 4, 6, 8, 10, 12, 14]. Every combination of parameters was run on the labelled image dataset and the resulting blob counts were compared to the spot counts found in the labels. Every image-set was analysed separately and due to the time required to analyse all parameters for one given file a random selection of 50 images were used in the analysis. The results of the analysis were stored in a dictionary together with the metadata for the image-set, the label data, and the labels of the 50 selected images. After the analysis of an image-set was completed, the dictionary was stored in a pickle file using the pickle Python library.

5 Results

Using the methods proposed in section 4 generates a large amount of data. In this section multiple data visualisation techniques are used to visualise the findings. These visualisations help with the selection of a parameter-set, so it that can be used to generate spot data from the entire dataset.

5.1 Dataset exploration

As a first step, the original data sets and the analysis performed by the re-searchers of SILS, were explored. The histograms inFigure 2show the number of spots counted by the MatLab script for each of the images. The number of spots in an image ranged from 0 to 850. Therefore, the figure is split into three parts, the first ranging from 0 to 10 spots counted, the second from 10 to 100, the third from 100 to 1000.

In each of the histograms inFigure 2, the same colours were used for each of the image-sets. E.g. an image-set that was displayed as blue in the middle histogram is also blue in bottom histogram. Figure 2shows that the difference in spot count distribution between image-sets is large. In the purple image-set the spot counts range from 200 to 900, in the pink image-set the number of spots is either zero or one, while in other image-sets, such as the blue and orange sets, the spot counts range from around 10 to 400.

(11)

Figure 2: Spot distribution of each image-set of the labelled data. Each color represents one image-set.

5.2 Parameter optimization

To find the optimal parameters for the spot finding algorithm, multiple parameter-sets were tried and the results were compared to the labelled data.

Comparing different parameter-sets is challenging. Having three values per parameter for max sigma, number of sigmas, and number of thresholds, and using seven values for the kernel size parameter resulted in a total of 189 parameter-sets. To see which parameter-set resulted in the most accurate spot counts, all parameter-sets were used in the spot detection algorithm on every image-set. Due to the big difference in spot counts between image-sets, as shown in Figure 2, a performance score has been used that corrects for the mean of the number of spots in an image-set.

Choice of parameter-set influences the number of detected spots The performance scores used to compare each parameter-set were: mean abso-lute error (MAE), reflecting the sum of the differences between the SILS analysis and the analysis here performed (Equation 6). In addition, the normalized mean absolute error (NMAE) was calculated to correct for the large difference in spot counts between image-sets. The mean errors were divided by the mean spot count of the specific image-set (Equation 7). This resulted in error scores of different image-sets that can be compared with each other.

(12)

M AE = Pn i=1|yi− xi| n (6) N M AE = Pn i=1|yi− xi| n · µ (7)

InFigure 3, for each image-set (x-axis), the distribution of the MAE and NMAE is shown as a function of the parameter-sets. E.g. the MAE scores of image-set 5 are much higher than others but when looking at the normalized error scores one sees that this is merely due to the high number of spots in those image-set.

(a) Complete y-range

(b) Max y set to 175

(13)

where the difference in error scores between the parameter-sets is almost noth-ing. For image-sets where this is the case the algorithm is not very sensitive to different parameter values, which is a good sign as it indicates that, at least in these files, the chosen parameters do not affect the scores too much. Further-more, Figure 3 shows that the number of spots counted in some image-sets is very close to the original spot counts with a small variance between parameter-sets. These images sets are examples where the LoG performs well. On the other hand, Figure 3shows that other image-sets never have good error scores, no matter what parameter-set is choses, which suggests that the spots in these images sets are harder to detect.

Unique or general parameter-sets

Now that it is clear that the choice of the parameter-set has an influence on the error of the number of detected sports in an image, it can now be determined, by minimizing the error, which combination of parameters is best for each image. The result of this optimization procedure is that each image-set is coupled to a parameter-set which performs best on that image-set.

The next question is if there exists a general parameter-set that can be applied on every image-set with a reasonable result (errors not too large). To find this out, the error of the number of spots was determined by using the best parameter-set of one image to all the other images-sets. Figure 4ashows the hypothetical case where every image-set has a unique best parameter-set. Green colour here indicates an error which is smaller than the largest error on the diagonal; red indicates a larger error. In such case, the result is always worse when an image-set is analysed with another parameter-set. In case there is a parameter-set that performs well on all other imagesFigure 4b, that parameter-set can be used as a general parameter-set. Figure 5shows the result of the measurement of the errors when all parameter-sets are used on all image-sets. There are multiple parameter-sets that perform well on most image-sets but one parameter-set (1,2,2,40,2) performs best.

(a) Every image-set has unique parameter-set

(b) One parameter-set for multiple image-sets

(14)

Figure 5: Real NMAE values for different parameter-sets for each image-set

Correlation between predicted and real image counts

An alternate way to look for the optimal parameter-set is to look at how well the predicted spot counts correlate to the real spot counts. One way to do this is to calculate the pearson correlation [19].

r = P(x − x)(y − y)

pP(x − x)2_{P(y − y)}2 (8)

To see what parameter-set gives the prediction that has the best correlation with the real spot counts the pearson correlation was calculated between the spot counts of each set and the real spot counts. The parameter-set with the highest pearson correlation was found and the corresponding spot counts were plotted against the real spot count inFigure 6.

(15)

Figure 6: Correlation between predicted spot counts and actual spot counts. Chosen parameter-set based on best pearson correlation

Most frequent parameter values in top performing parameter-sets A third way to find the best parameter-set would be to look at what param-eter values occur most often in top paramparam-eter-sets for each image. Here, for each image-set, the top 10 parameter-sets were chosen, and the values for each parameter were counted to create the histograms shown in Figure 7. Here you can see that there are indeed some parameter values that occur more often in these top 10 parameter-sets than others. The parameter values that have the highest frequency together form the parameter-set (1,2,2,40,2) which is also the top performing parameter-set for the general approach

(16)

6 Conclusion

In this research different ways to find spots in microscopy images have been ex-plored. Convolutional neural networks looked like a promising way to improve the current approach but to train a CNN a better dataset was needed. The data at hand was only partially labeled and the labels that were available did not contain the locational data required for CNNs. To generate more labelled data and find the locations of all the spots, a method was used that is similar to the method used to generate the current labels. This method, the 3D LoG, used parameters (min sigma, max sigma, number of sigma, number of thresh-olds, kernel size) to adjust the algorithm, to find the parameters that would result in the best correlation with the labelled data a multitude of parameter combinations were tried and the resulting spot counts were compared with the original spot counts to find the best parameter-set. Three distinct approaches were tested. Due to the wide variety in spot counts between the image-sets the first approach looked if the best performing parameter-sets of each image-set also performed well on the other image-sets. The second approach looked for the parameter-set that had the strongest pearsons correlation with the original labelled data. The third approach assumed that the parameters were indepen-dent of each other and looked at which parameter values occurred most often in the top performing parameter-sets.

The first approach found that there are parameter-sets that perform relatively well on all image-sets, even though some image-sets had exceptionally bad re-sults over all. The parameter-set that performed best overall was (1,2,2,40,2), which also was chosen as best performing parameter-set for four different image-sets. A close second was (1,2,2,30,2) which was the best set for three image-image-sets. The second approach looked at the pearsons correlation to find the parameter-set that generated the array of spot counts that correlated best with the real spot count data. The parameter-set that was found with this method was (1,2,6,40,2).

Lastly, the third approach that was taken was to assume parameter indepen-dence. This method looked at the number of times a parameter value occurred in the top ten of each of the image-sets best parameter-sets. The parameter values that were most prevalent in top parameter-set were (1,2,2,40,2) and (1,2,2,40,2). These parameter-sets were also found in the general approach.

The normalized mean absolute error was calculated for the best performing parameter-sets and are shown inTable 1the last row in the table represents the parameter-set had the best nMAE score over all but interestingly enough this combination does not appear from any of the other methods.

Method found Parameter-set NMAE General / Histogram (1,2,2,40,2) 0.872 General / Histogram (1,2,2,30,2) 0.909 Pearson (1,2,6,40,2) 0.858 Best NMAE over all (1,6,6,40,2) 0.848

Table 1: Best performing parameter-sets with their normalised mean absolute error scores

(17)

To summarize, there are multiple parameter-sets that can be used to generate comparative spot count results. These parameter-sets were obtained with dif-ferent methods, giving a strong indication that these parameter-sets indeed are the best parameter-sets. When running the spot detection algorithm with these particular parameter-sets, the location data of the spots can now be used in further research.

7 Discussion

The results of this research rely heavily on the assumption that the original labelled data is correct. If the data generated with the MatLab script is not reliable, the error scores that are found in this research are not representative of what it really would be. In some of the image-sets the difference between origi-nal and predicted spot counts are quite large while aorigi-nalysis of other image-sets result in very low error scores. To combat the uncertainty of a manually la-belled dataset a synthetic approach might be explored. In [9,13,20] researchers generated multiple image sets to compare different spot detection algorithms. It would be interesting to see if the LoG algorithm proposed in this research would be effective in the detection of the spots found in the synthetic images and it would be an ideal way to test the sensitivity to different parameters due to the control one has over the spot variety.

Besides the uncertainty of the accuracy of the original labels, the difference between the original labels and the predicted labels can also be caused by the fact that the x,y-resolution of the microscopy images is different than the z-resolution. The three-dimensional LoG function assumes that the image matrix that is put into the function has equal x,y,z-resolutions. But because the reso-lution in x,y is approximately 100 nm and the z-resoreso-lution of each slice is 300 nm, the assumption that the spots are spheres cannot be made even though the LoG is most sensitive for spheres. To solve this a new LoG function could be written that accepts different sigma sizes for each of the axes. It would also be interesting to see if a LoG filter can be created that is sensitive to different shape spots, e.g. elongated, elliptical etc.

There are also different spot filters that are based on the LoG filter such as the Difference of Gaussian (DoG) [10] and Difference of Hessian [21]. Both these filters use approximations of the LoG which greatly improves the speed of the algorithms but reduces the accuracy and making them more sensitive to parameter changes. A comparison between the LoG, DoG and DoH on real or synthetic image data, could lead to an improved version of the method proposed in this research.

When it is shown that accuracy if the proposed spot detection algorithm is sufficient enough, the labels generated by this algorithm can provide a great starting base for spot detection with convolutional neural networks (CNNs). The location data that is now available for this dataset, in tandem with syn-thetic images that can be generated, could provide a CNN with a large amount of training data. The provided dataset has images with a large variance in spot count, signal to noise ratio (SNR) and spot density. When building a CNN for this task a lot of research has already been done in the field of cell classification

(18)

[13,8,14], making the step to spot detection should be achievable.

Looking forward, smRNA FISH is only one of the fields where spot detection is useful. Manders et al. [22] was using spot detection in confocal images to mea-sure positions of DNA-replication spots, Marsh et al. [21] uses spot detection in Atomic Force Microscopy, and it can also be used in the detection of germinant proteins with can be used to detect bacterial spores that have sub resolution structures.

All in all, spot detection has and will probably always have an important place in microscopic cell biology. It has been through an enormous amount of devel-opment and, especially with the rise of machine learning, can be lifted to the next level in the coming years.

References

[1] J. W. Lichtman and J.-A. Conchello, “Fluorescence microscopy,” Nature Methods, vol. 2, pp. 910–919, Nov. 2005.

[2] Q. Wu, F. Merchant, and K. Castleman, Microscope image processing. El-sevier, 2010.

[3] R. A. Hoebe, C. H. V. Oven, T. W. J. Gadella, P. B. Dhonukshe, C. J. F. V. Noorden, and E. M. M. Manders, “Controlled light-exposure microscopy reduces photobleaching and phototoxicity in fluorescence live-cell imaging,” Nature Biotechnology, vol. 25, pp. 249–253, Jan. 2007.

[4] H. Kempe, A. Schwabe, F. Cr´emazy, P. J. Verschure, and F. J. Bruggeman, “The volumes and transcript counts of single cells reveal concentration homeostasis and capture biological noise,” Molecular Biology of the Cell, vol. 26, pp. 797–804, Feb. 2015.

[5] S. Salehi, M. N. Taheri, N. Azarpira, A. Zare, and A. Behzad-Behbahani, “State of the art technologies to explore long non-coding RNAs in cancer,” Journal of Cellular and Molecular Medicine, vol. 21, pp. 3120–3140, June 2017.

[6] H. Kempe et al., “Understanding gene expression variability in its biological context using theoretical and experimental analyses of single cells,” 2017. [7] A. Sallab, M. Abdou, E. Perot, and S. Yogamani, “Deep reinforcement

learning framework for autonomous driving,” Electronic Imaging, vol. 2017, pp. 70–76, Jan. 2017.

[8] Y. Dong, Z. Jiang, H. Shen, W. D. Pan, L. A. Williams, V. V. B. Reddy, W. H. Benjamin, and A. W. Bryan, “Evaluations of deep convolutional neu-ral networks for automatic identification of malaria infected cells,” in 2017 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI), IEEE, 2017.

(19)

[10] H. Kong, H. C. Akakin, and S. E. Sarma, “A generalized laplacian of gaus-sian filter for blob detection and its applications,” IEEE Transactions on Cybernetics, vol. 43, pp. 1719–1733, Dec. 2013.

[11] T. Lindeberg, “Feature detection with automatic scale selection,” Interna-tional Journal of Computer Vision, vol. 30, no. 2, pp. 79–116, 1998. [12] C. Papageorgiou, M. Oren, and T. Poggio, “A general framework for object

detection,” in Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), Narosa Publishing House.

[13] W. Xie, J. A. Noble, and A. Zisserman, “Microscopy cell counting and detection with fully convolutional regression networks,” Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, vol. 6, pp. 283–292, May 2016.

[14] Y. LeCun, K. Kavukcuoglu, and C. Farabet, “Convolutional networks and applications in vision,” in Proceedings of 2010 IEEE International Sympo-sium on Circuits and Systems, IEEE, May 2010.

[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012.

[16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recog-nition, pp. 1–9, 2015.

[17] W. Beckman, I. M. Vuist, H. Kempe, and P. J. Verschure, “Cell-to-cell tran-scription variability as measured by single-molecule RNA FISH to detect epigenetic state switching,” in Methods in Molecular Biology, pp. 385–393, Springer New York, 2018.

[18] A. Raj, P. van den Bogaard, S. A. Rifkin, A. van Oudenaarden, and S. Tyagi, “Imaging individual mRNA molecules using multiple singly la-beled probes,” Nature Methods, vol. 5, pp. 877–879, Sept. 2008.

[19] J. Benesty, J. Chen, Y. Huang, and I. Cohen, “Pearson correlation coeffi-cient,” in Noise Reduction in Speech Processing, pp. 1–4, Springer Berlin Heidelberg, 2009.

[20] A. Basset, J. Boulanger, J. Salamero, P. Bouthemy, and C. Kervrann, “Adaptive spot detection with optimal scale selection in fluorescence microscopy images,” IEEE Transactions on Image Processing, vol. 24, pp. 4512–4527, Nov. 2015.

[21] B. P. Marsh, N. Chada, R. R. S. Gari, K. P. Sigdel, and G. M. King, “The hessian blob algorithm: Precise particle detection in atomic force microscopy imagery,” Scientific Reports, vol. 8, Jan. 2018.

[22] E. Manders, R. Hoebe, J. Strackee, A. Vossepoel, and J. Aten, “Largest contour segmentation: A tool for the localization of spots in confocal im-ages,” Cytometry, vol. 23, pp. 15–21, Jan. 1996.

(20)

Appendices

A

Image quality

Directory name Image quality

2016/HER2 good

2016/HES1 good

2017/DEMI LIPIDTOX good

2017/GAPDH bad

2017/PPARG bad

2017/SQLE bad

2018/CALIBRATION good

2018/MIGUEL good

2018/MICHAL DRB TIMECOURSES good

2018/SQLE bad

Improving Spot Detection in smRNA FISH Images with Convolutional Neural Networks

Exploring Opportunities to Improve Spot

Detection in smRNA FISH Images with

Convolutional Neural Networks

Contents

1

Introduction

1.1

Fluorescence microscopy and smRNA FISH

1.2

Problem description

1.3

Aim and approach

2

Related Work

2.1

Theoretical background of the Laplacian of Gaussian

2.2

Supervised methods for spot detection

2.3

Convolutional neural networks in cell biology

3

Data

3.1

Images

3.2

Labels

4

Method

4.1

Preprocessing

4.2

Laplacian of Gaussian

4.3

Parameter optimisation

5

Results

5.1

Dataset exploration

5.2

Parameter optimization

6

Conclusion

7

Discussion

References

Appendices

A

Image quality