A comparison of classic and deep learning-based strategies for image segmentation

(1)

A comparison of classic and deep

learning-based strategies for image

segmentation

Bachelor’s thesis

Abstract

Nuclei segmentation is an essential step in quantifying fluorescence images. A variety of strategies for nuclei segmentation exist. Conventional methods for nuclei segmentation in image processing programs like

ImageJ/Fiji have been free to the public and extensively used. Over the past years, researchers have been trying to improve and automate nuclei segmentation with artificial intelligence. Machine and deep learning are proven to provide accurate predictions of nuclei segmentation but seem to be limited to the experts, and this creates an accessibility barrier. A basic segmentation, Labkit, Stardist and ZeroCostDL4Mic were tested on a dataset of fluorescence microscopy images with DAPI stained nuclei to determine the accuracy and accessibility between a conventional method and artificial intelligence methods. The F1-score and Intersection over Union (IoU) were calculated to determine the accuracy. The hardware requirements, time of use, and test-time have been examined to determine the accessibility. The conventional method was inferior to all other strategies in terms of accuracy. The deep learning methods turned out to obtain the most accurate predictions, but between training a unique model and using a pre-trained model is a significant difference in accessibility

Keywords

Image analysis, nuclei segmentation, artificial intelligence, machine learning, deep learning

Author: Jamie Prevett-Mulder Student ID: 11716460

Bachelor: Biomedical Sciences Supervisor: dr. ir. Joachim Goedhart Examiner: dr. ir. Mark Hink

Institute: SILS

(2)

1. Introduction

Fluorescence microscopy is an essential tool in cell biology to visualize structures and processes within the cell and aims only to disclose the objects of interest against a black background (Lichtman & Conchello, 2005). Identification of the nucleus with a DNA stain is often a critical step in the quantification of fluorescence images (Caicedo et al., 2019a). The identification of nuclei is also known as nuclei segmentation. Datasets of

fluorescence images can consist of thousands of nuclei and therefore, widely used conventional methods are can be time-consuming. People have been trying to improve and automate nuclei segmentation with the help of machine and deep learning. Machine learning refers to the improvement of computer algorithms through learning, experience and representation. Machine learning algorithms build a model based on so-called training data, and this model can make informed decision or predictions on unseen data based on what is learned from the training data (Mitchell, 1997). Deep learning is part of a broader family of machine learning techniques. Deep learning goes beyond machine learning, as it builds models based on artificial neural networks and can make intelligent decisions on its own. (LeCun, Bengio & Hinton, 2015). Machine learning is more mimicking the data, whereas deep learning powers the most human-like artificial intelligence because of its decision-making abilities. Training data for deep learning consists of input (raw images) with corresponding desired output (image masks). Subsequently, neural networks learn how to map each input to generate the desired output (von Chamier, Laine, Henriques, 2019). Once the neural network model is trained, it can be used to make predictions about unseen data. This process is called ‘inference’ because the prediction is inferred from the training data.

Two types of nuclei segmentation can be distinguished: semantic and instance segmentation. Semantic segmentation creates a pixel-level annotation, where each pixel is assigned to a class label. In semantic segmentation of fluorescence images, these classes are either the foreground or the background. With semantic segmentation, the distinction of objects is not always achieved (e.g. two overlapping nuclei are identified as one rather than two) (Moen et al., 2019). On the contrary, instance segmentation treats objects of the same class as different objects of this particular class (e.g. two overlapping nuclei are identified as two different nuclei of the class ‘nuclei’ rather than it is one nucleus). Semantic and instance segmentation are both broadly used in machine and deep learning.

Sadly, training deep learning neural networks requires expensive and powerful hardware resources and programming skills to train neural networks (Gómez-de-Mariscal, 2019). These technical challenges lead to an accessibility barrier for novice and non-expertise users. More and more programs are emerging that try to break this accessibility barrier. An example of such a program is ZeroCostDL4Mic (von Chamier et al., 2020). The question is whether deep learning excels in nuclei segmentation in comparison to a conventional method or a machine learning method and whether it is accessible for non-expertise users as well. The central question of this research, therefore, is as follows: ‘Are nuclei segmentation methods that use artificial intelligence more accurate and accessible for non-expertise users than the conventional methods?’ To answer this question, four nuclei segmentation strategies will be tested on a dataset of fluorescent microscopy images and will be compared based on their accessibility and segmentation accuracy.

1.1 The strategies

The first strategy is a basic segmentation in ImageJ/Fiji, which represents a conventional method of nuclei segmentation. ImageJ (Figure 1) is created in 1997 by Wayne Rasband of the National Institutes of Health and is an open-source Java-based program for image processing (Abràmoff, Magalhães & Ram, 2004). ImageJ was the first of its kind that was open for the public, could be run on any operating system and was virtually limitless due to accessibility of plugins and macros created by the public (Abràmoff, Magalhães & Ram, 2004). In 2011 a distribution of ImageJ, called Fiji, was released to extend the program with more complex plugins (Schindelin et al., 2012). Fiji makes it possible to transform machine and deep learning algorithms into ImageJ plugins, which can be shared through an integrated update system (Schindelin et al., 2012).

The second strategy is an example of such a plugin with supervised machine learning properties and is called Labkit. Labkit can be used for automatic image segmentation and image labelling (Artz, 2019). Labkit uses semantic segmentation. Because of this, Labkit will solely distinguish the foreground from the background. In fluorescence images with stained nuclei, the foreground should only represent nuclei, and therefore, Labkit is a suitable strategy.

The third strategy is Stardist and will represent a deep learning method, but with the use of a pre-trained model. Stardist is a deep learning method that is designed explicitly for nuclei segmentation (von Chamier et al., 2020). Stardist predicts star-shaped convex polygons, which matches the characteristic round shape of a nucleus in a fluorescent microscopy image (Schmidt, Weigert, Broaddus & Myers, 2018). Stardist

(3)

performs a non-maximum suppression (NMS) and has demonstrated to perform well on images with overlapping nuclei. Stardist’s method is based on U-net (Ronneberger, Fischer & Brox, 2015) and uses a light-weight neural network (Schmidt et al., 2018). In Fiji, a Stardist plugin is available that can be used for 2D images or time-lapse images. Primarily it comes with three pre-trained models, although it is also possible to upload own models.

The fourth and last strategy is ZeroCostDL4Mic, and it will be used to train a unique deep learning model for the dataset. ZeroCostDL4Mic is a platform with an easy to use graphical user interface that simplifies access to deep-learning in microscopy for non-expertise researchers in coding (von Chamier et al., 2020). ZeroCostDL4Mic lowers the accessibility barriers of installing the right dependencies and have access to powerful computational resources. ZeroCostDL4Mic utilizes Google Colab, a free to use Jupyter Notebook environment where Python code can be written and executed. Colab provides free access to Graphical Processing Units (GPUs), and the notebooks are stored in Google Drive, thus can be easily shared and copied. The networks that are currently incorporated in ZeroCostDL4Mic are CARE, Noise2Void, Label-free prediction, U-net and Stardist. In this research, the Stardist 2D network was used. In essence, the same deep learning network is used as with the Stardist plugin. However, with ZeroCostDL4Mic, the Stardist network is used to train a unique model.

2. Methods and Materials

2.1 The dataset

All images were generated during an experiment that included the study of HeLa cells. The main goal of this experiment was to identify the number of HeLa cells in the S-phase and to study the different patterns of EdU incorporation. As part of this experiment, HeLa cells were incubated with 50 μl DAPI for 5 minutes. As a result, nuclei under a fluorescent microscope are stained blue with DAPI. For this research, an image set of 15 images with DAPI stained nuclei was donated by dr. J. Goedhart (Figure 2A). Five images were used as test images and the other ten images were used to train the deep learning model in ZeroCostDL4Mic.

Figure 1: (Fiji Is Just) ImageJ. The Fiji application is used to generate ground truth annotations. The ‘Wand (tracing) tool’ is selected to annotate a nucleus. With CTRL+T the annotation is saved as an region of interest (ROI) in the ROI

(4)

2.2 Ground truth annotations

In order to obtain ground truth annotations (Figure 2B), all nuclei were manually annotated. Non-overlapping nuclei were annotated using the ‘Wand (tracing) tool’ in Fiji. This tool creates a selection of an object with a uniform colour. By tweaking the tolerance for each nucleus, a correct annotation could be created. To annotate overlapping nuclei, each nucleus had to be annotated using the ‘Freehand selections’ tool. The annotations were made as regions of interest (ROIs) and saved in ROI manager in Fiji. All annotations have been sent to an expert, dr. J. Goedhart, for verification. Overall, the training set and test set respectively consisted of 976 and 610 manually annotated nuclei. Other tools like Quanti.us and Labkit have also been tried for annotating the ground truths, but these tools did not return ROIs, which were essential for further data-analysis in this research. The segmentation results of each strategy were converted into ROIs. Because of this, the segmentation results could be compared to the ground truth by overlaying both sets of ROIs.

2.3 Basic segmentation

For the basic segmentation, a set of commands was run in Fiji (Nuclei Watershed Separation, 2020) (Figure 3). First, a Gaussian Blur was run to reduce any noise and to blur out the ‘speckles’ (Figure 2A). A sigma value of 2.0 was chosen for this dataset. Secondly, pixel intensity thresholding was used to distinguish foreground objects from the background. The default threshold feature in Fiji was used, and a different threshold was set for each image. After thresholding, a binary image is created, where objects in white represent the foreground (pixel value =255), and in black,the background (pixel value = 0). To separate touching objects, that were created after thresholding, a classic watershed was run. The built-in ImageJ watershed method searches for the centre of an object and calculates a distance map from the centre to the edges of a region of interest. Because of this, it works well on circular objects with a similar value for each radius in all directions. As a result, the watershed creates a ‘dam’ between circular objects. Finally, the ‘Analyze Particles’ tool was run to convert the objects into ROIs.

2.4 Labkit

To use machine learning for semantic segmentation in Labkit, the plugin must be told which pixels represent the background value and which pixels the foreground values. This is done by drawing some scribbles with the pencil tool (Figure 4A). Based on the background and foreground labels, a classifier can be trained. The classifier predicts which pixels should be assigned to the background or foreground for the rest of the image (Figure 4B). A unique ‘classifier’ was trained for each image in the dataset. The segmentation results were

Figure 2: DAPI stained nuclei. A) Image from the dataset. Note the speckly surface of the nuclei. This is charachteristic of DAPI nuclei B) Example of reference ground truth annotations. C) Ground truth and prediction overlayed. In this research, all ground truth annotations are labeled as blue and the predictions as white D) Example of a merge: a segmentation error where two true nuclei are predicted as one nucleus. A merge of two nuclei is assessed as one false positive (FP) and two false negatives (FNs). E) Example of a split: a segmentation error where one true nucleus is predicted as two nuclei. Such a split is assessed as two false positives (FPs) and one false negative (FN).

(5)

exported into ImageJ (Figure 4B) and transformed into a binary image (Figure 4C). Similar to the basic segmentation, a classic watershed was run to separate touching object and finally the ‘Analyze Particles’ tool was run to generate an ROI for each object.

Figure 3: Basic segmentation in Fiji. The ‘Gaussian blur’, ‘Threshold’ and ‘Watershed’ tools were applied to the test images in Fiji in this particular order. The corresponding output after running each command is shown. The first picture from the right shows the raw test image overlayed by the regions of interest (ROIs) after runnning the ‘Analyze Particles’ tool.

Figure 4: Segmentation using the Labkit plugin. A) An image from the dataset has been imported into Labkit. With the pencil tool in Labkit a background-label (blue) and a foreground-label (red) have been created. The red box represents the figures B, C and D. B) The output Labkit shows after training a classifier with background and foreground labels. C) Result after it is opened in Fiji. D) Result after the image is made binary and a watershed has been run.

(6)

2.5 Stardist

With the Stardist plugin, a pre-trained model, ‘Versatile (fluorescent nuclei)’, was used. For the ‘NMS

Postprocessing’ settings, the ‘Probability/Score Threshold’ was set at a value of 0.5 and the ‘Overlap Threshold’ 0.4. All NMS Postprocessing settings were the same for all images.

2.6 ZeroCostDL4Mic

To train a model in ZeroCostDL4Mic, it was necessary to match the ten training set images with images of their corresponding masks. These masks were created with the LOCI plugin in Fiji. By using the ground truth

annotations, the LOCI plugin created an ROI map (Figure 5). The number of epochs was set to 2000, and the other parameters were kept on default. The output of each test image consisted of a predicted mask image, ROIs of all masks and a CSV file with the number of detected nuclei.

2.7 Metrics

To determine which strategies were the most accurate, two evaluation metrics for segmentation were calculated: F1 score and Intersection over Union (IoU). To calculate the F1-score, the number of true positives (TP), false positives (FP) and false negatives (FN) were counted by overlaying the ROIs of the prediction and the ROIs of the ground truth. After this, the precision, recall and the F1-score were calculated (Box 1).

To calculate the IoU, binary images with masks of the prediction and the ground truth were created using their ROIs. The MorphoLibJ plugin in Fiji was used to calculate the IoU between the binary images. The resulting IoU value, therefore, represents the entire prediction per image and is not calculated per ROI. In other nuclei segmentation studies, the IoU is usually calculated per ROI, but the knowledge and skills were lacking for this (see Discussion). A more extensive explanation of the IoU can be found in Box 2.

Figure 5: Generating training masks for ZeroCostDL4Mic. With the LOCI plugin in Fiji, regions of interest (ROIs) of the training images were used to generate corresponding masks. These masks were generated from the ground truth annotation

(7)

Box 1: Calculating the F1-score

Comparing the predictions of a segmentation strategy to a reference ground truth, results in a number of truly predicted nuclei (TPs), falsely predicted nuclei (FPs) or falsely non-predicted nuclei (FNs). These test results can be converted into single values using the evaluation metrics, precision and the recall. The precision is the fraction of truly predicted nuclei among the total amount of predictions, either true of false, and can be described as the quality measure (Sokolova & Lapalme, 2009). On the other hand, the recall is the fraction of truly predicted nuclei among the number of nuclei that should be predicted as true and can be described as the quantity measure (Sokolova & Lapalme, 2009). The precision and recall are calculated as follows:

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃𝑠

𝑇𝑃𝑠 + 𝐹𝑃𝑠 𝑅𝑒𝑐𝑎𝑙𝑙 =

𝑇𝑃𝑠 𝑇𝑃𝑠 + 𝐹𝑁𝑠

The F1 score represent the harmonic mean of the precision and the recall (Moen et al., 2019) and is calculated as:

𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 ×𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙

Example:

In Figure 6, a prediction (white) is compared to the ground truth (blue) by superimposing the ROIs. This situation consists of seven ground truth nuclei. The segmentation truly predicted only two of these nuclei. Two merges have also been created, both represent one FP and two FNs. Besides, the segmentation missed one true nucleus completely (FN) and predicted one extra object (FP). In a total of two TPs, three FPs and five FNs have been predicted by the segmentation. This results in the following F1-score:

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 2 2 + 3= 0.4 𝑅𝑒𝑐𝑎𝑙𝑙 = 2 2 + 5= 0.29 𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 ×0.4 × 0.29 0.4 + 0.29= 0.33

The precision and recall can take a value between 0 and 1, because the number of TPs is always the same. This means when the value is 0, no true positives are predicted. When the value is 1, either no FPs or FNs are predicted. When both the precision and recall have a maximum value of 1, the F1-score is also 1 (see formula). This means the F1-score can also only take a value between 0 and 1. The closer the value is to 1, the less errors are predicted.

Figure 6: Comparison between a prediction and the reference ground truth. The prediction is annotated in white and

the ground truth in blue. In this example true positives (TP), false negatives (FN), false positives (FP) and merges (M) can be found.

(8)

3. Results

3.1 Deep learning provides more accuracy than a conventional method

To determine the accuracy of the predictions, the F1-score and IoU have been calculated. The F1-score is the harmonic mean between the precision and recall and represent the accuracy of a prediction in terms of the number of missed and extra nuclei. The F1 scores of the basic segmentation (0.835 to 0.937), Labkit (0.839 to 0.995) and Stardist (0.881 to 0.99) all have a large spread in contrast to ZeroCostDL4Mic (0.943 to 0.992) (Figure 8A). These large spreads suggest that the identification of nuclei was easier in one or two images from the test set. Overall, the basic segmentation has the lowest F1-scores. The best F1-score of the basic

segmentation is even lower than the worst score of ZeroCostDL4Mic. This indicates that deep learning can ensure a better accuracy than a conventional method, in terms of the correct identification of objects as nuclei. Most of the predictions of ZeroCostDL4Mic also scored better than Labkit’s, and Stardist’s predictions and the

Box 2: Calculating the Intersection over Union

The IoU is a measure of the pixel-wise overlap between the reference ground truth (GT) and the prediction (P) (Caicedo et al., 2019a). the IoU represents the accuracy of a prediction in terms of the localization of the nuclei boundaries. When binary masks of the ground truth and the prediction are overlapped, the union is the area that the ground truth and prediction combined cover and the intersection is where they overlap (Figure 7). The IoU is calculated as follows:

𝐼𝑜𝑈(𝐺𝑇, 𝑃) =𝐺𝑇 ∩ 𝑃 𝐺𝑇 ∪ 𝑃=

𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛 𝑈𝑛𝑖𝑜𝑛

The IoU can take a value between 0 and 1:

- IoU = 0: no overlap between the ground truth and the prediction (pixels intersection = 0); - 1 < IoU > 0: partial overlap between ground truth and the prediction (pixels intersection ≠ 0); - IoU = 1: exact overlap between ground truth and the prediction (pixels intersection = pixels union).

Example:

When the intersection in Figure 7 consists of 300 pixels and the union of 600 pixels, then:

𝐼𝑜𝑈(𝐺𝑇, 𝑃) =300

600 = 0.5

Figure 7: Intersection and union between ground truth and prediction. Above, the binary masks with a pixel value of

255 are shown. Below, coloured masks are used for clarification. The white pixels have a value of 255 and the background a value of 0. True (n)pixels of the masks are unknown.

(9)

scores are in a narrow range. Overall Stardist did perform better than Labkit, but not significantly. It can be concluded that training a unique model is more accurate at nuclei identification than other methods.

To determine the accuracy, based on the localization of the nuclei boundaries, the IoU was calculated over each test image (Figure 8A). The spread of results is less in all strategies, compared to the F1-scores. This indicates that there were no critical differences between the test images that could have positively or negatively affected the localization of the boundaries. Similar to the F1-score, the basic segmentation performed the worst. Stardist and ZeroCostDL4Mic do not differ much from each other, but ZeroCostDL4Mic has slightly better IoU values. Three of the five predicted image by Labkit obtained a higher IoU than both Stardist and ZeroCostDL4Mic. From this can be concluded that machine learning is better at the localization of the true nuclei boundaries than a conventional method and might be better than deep learning in some cases.

In (Figure 8B), the two metrics are combined. The more a data point is at the top right of the figure, the better the accuracy of the prediction. The opposite is true for lower left data points. This figure confirms that the basic segmentation results in the least accurate predictions. The scattering of Labkit’s result is also clearly indicated again. Only one image that was predicted by Labkit has a high value for both the F1-score and IoU. Meanwhile, ZeroCostDL4Mic’s results provide reliability because the results are not scattered, and most of the F1-score and IoU values are higher than the conventional method and the Stardist strategy. To conclude, training a deep learning model that is specially trained for the DAPI dataset results in more accurate nuclei segmentation predictions and more reliability in future tests than the conventional method and another deep learning method that made use of a pre-trained model.

Figure 8: Accuracy. A) The left graph shows the F1-scoresvalues of all five test images per strategy. The dots represent the images. The median with a 95% confidence interval (CI) is also indicated. The right graph shows the IoU values of all five test images per strategy. The median with a 95% (CI) is also indicated. B) Combination of the F1-score and IoU. One symbol represents one image. The round blue symbols represent the basic segmentation in Fiji. The square red symbols represent the segmentation by Labkit. The triangular (point up) green symbols represent the segmentation by Stardist and the triangular (point down) purple symbols represent the segmentation by ZeroCostDL4Mic.

(10)

3.2 Deep learning correctly separates overlapping nuclei

To find out how the strategies differ in the number and sort of segmentation errors they make, the number of merges and splits (Figure 2D & 2E) have been determined. the basic segmentation and Labkit made many merges and few splits (Table 1). This means these strategies have a hard time recognizing and separating clustered nuclei. Stardist shows the exact opposite and only made one merge over the entire test set, however, also falsely separated multiple nuclei. This type of segmentation error is not found with ZeroCostDL4Mic. ZeroCostDL4Mic did made a few more merges but this is negligible compared to the number of merges of the basic segmentation and Labkit. From this can be concluded that deep learning methods yield more success at the separation of overlapping nuclei than a conventional or machine learning method, but training a unique deep learning model is necessary to ensure no imaginary objects are detected as well.

In addition, merges and splits always result in multiple FPs and/or FNs. This means that the basic segmentation generated the most FPs and FNs, followed by Labkit, Stardist and ZeroCostDL4Mic. This same pattern can be noticed in (Figure 8A), where the overall F1-score is the worst for the basic segmentation, followed by Labkit, Stardist and eventually ZeroCostDL4Mic. This indicates that the number of merges and splits have a great influence on the F1-score.

Table 1: Segmentation errors. For each image in the test set the number of merges, splits and totals is given per strategy.

Strategy Merges Total Merges Splits Total Splits

Basic Segmentation 15 9 4 10 9 47 0 0 0 1 0 1

Labkit 11 5 0 8 6 30 1 0 0 3 1 5

Stardist 0 1 0 0 0 1 1 1 4 7 4 17

ZeroCostDL4Mic 3 1 0 0 0 4 0 0 0 0 0 0

3.3 Deep learning does not always require previous knowledge and skill

To evaluate the accessibility of the strategies, the requirements, time-consumption, and ease to use have been investigated. Training deep learning models requires expensive and powerful hardware, mostly a GPU or Tensor Processing Unit (TPU). This is disadvantageous for non-expertise users because usually only have access to a Central Processing Unit (CPU). The basic segmentation, Labkit and Stardist are all run in Fiji and therefore only need CPU requirements. However, ZeroCostDL4Mic using cloud computing and provides free access to a GPU and TPU through Google Colab. Therefore, ZeroCostDL4Mic cannot be distinguished from the other strategies based on the accessibility to hardware.

Table 2 shows that ZeroCostDL4Mic requires the longest preparation time and test time. The preparation time is very long because it also includes the training time. The model has been retrained more than ten times after finally obtaining the final model. While following the steps in the notebook, several errors have also been run into. Although ZeroCostDL4Mic is a user-friendly platform and also provides a helpful tutorial, a non-expertise user will likely run into some errors too and not immediately get the expected results (Supplementary Note 1). Also, the one hour indicated in the table to execute a full segmentation for five test images does not mean it takes two hours for ten test images. This is variable for each dataset. After

ZeroCostDL4Mic, the conventional method was the most time-consuming. Fiji and the use of its commands needed some understanding and familiarisation time. Training the classifier in Labkit requires some expertise, but this was understood easily, and a full segmentation of one image can be performed under 2 minutes. The most accessible strategy was Stardist. Performing a segmentation with only takes a few clicks. It can be concluded that the use of deep learning does not always require powerful and expensive hardware or any experience. However, this is only the true when a pre-trained model is used. Training a unique model requires some expertise and time.

(11)

Table 2: Accessibility. The preparation time is an indication of the time it took to understand the use of and get familiar with the different strategies. For ZeroCostDL4Mic the generating of training masks is also included. The test time indicates the time it took to perform a full segmentation on the entire test set of 5 images.

4. Discussion

Four different strategies for nuclei segmentation, ranging from a conventional method to human-like artificial intelligence, have been compared based on their accuracy and accessibility. The F1-score was calculated to determine which strategy made the least mistakes in correctly identifying the nuclei. It was concluded that the number of merges and splits have a significant influence on the F1-score. The semantic segmentation strategies predicted many merges and few splits. This can be explained from the fact that semantic segmentation focuses on pixels rather than shapes. Moen et al. (2019) already stated that the distinction of overlapping objects with semantic segmentation will not always be achieved. In this research, it has been tried to solve this problem by applying a watershed. A watershed works well on circular objects. However, most overlapping nuclei had peculiar forms, and also densely clustered nuclei (Figure S1B) cannot be separated by a watershed. Semantic segmentation strategies are suitable for the segmentation of fluorescence images with little to no touching or overlapping objects. This also explains why there is a big spread in the F1-score results of the basic

segmentation and Labkit. One of the images in the test set only consists of a minimal number of close or overlapping nuclei. This particular image has a much higher F1-score than the other images in both the semantic strategies (Figure 8A). If this particular image had not been included in the test set, the spread of results would be smaller, and the overall F1-score for the basic segmentation and Labkit would be worse.

Stardist’s instance segmentation model had the exact opposite problem. This model is too focussed on predicting roundish shapes and therefore split non-roundish nuclei into multiple roundish nuclei. This problem was solved by the unique trained model in ZeroCostDL4Mic. Instance segmentation with the help of deep learning turns out to provide the best accuracy in terms of correct identification of the nuclei. Nevertheless, training the model in ZeroCostDL4Mic required significantly more time than all other strategies. Stardist’s model, on the other hand, can be used in just a few clicks.

So, training a deep learning model ensures the most accurate results but is not always accessible for novice users and takes time and effort. However, using a pre-trained deep learning model that suits the dataset also produces more accurate results than a semantic segmentation strategy. To answer the central question: are nuclei segmentation methods that use artificial intelligence more accurate and accessible for non-expertise users than the conventional methods? Artificial intelligence methods for nuclei segmentation are more accurate than a conventional method and can be more accessible for non-expertise users as well.

4.1 Intersection over Union requires expertise

The IoU was used to calculate the accuracy of the predictions in terms of finding the true boundaries of the nuclei. However, this value represented the value of the entire image instead of an average per object. This brings discussion with it and is therefore not included in the conclusion of this research. Due to how the IoU was calculated, merges and splits did not influence the IoU value. For example, if a prediction merges two touching nuclei, this would not affect the IoU because the binary image of the ground truth also shows a merge of these nuclei, simply because they overlap. A part of this problem can be solved by running a watershed. As previously stated, a watershed will not always separate touching ROIs. A better method is to calculate the IoU per ROI and not over an entire image with all the masks. This method was used, for example, at the 2018 Data Science Bowl, which was dedicated to instance nuclei segmentation (Caicedo et al., 2019b). Going back to the example mentioned above, if the IoU would be calculated per ROI, a merge of two touching nuclei would result in the following: The merge will cover one true nucleus, and the other true nucleus will be covered by nothing, respectively resulting in one low IoU and one IoU of 0. With this method, metrics like precision, recall, and the F1-score can be determined on different IoU thresholds. The higher the IoU threshold is set, the fewer predicted masks are detected as TP, and therefore the F1 score lowers. The strategy with the highest F1 score on the highest IoU threshold would, therefore, be considered the most accurate strategy.

Strategy Preparation time Test time Hardware requirements

Basic segmentation 2 hours 20 minutes CPU

Labkit 1 hours 10 minutes CPU

Stardist 30 minutes 5 minutes CPU

(12)

Nevertheless, the time and programming skills to apply this method were not present in this research. This indicates that not only the use of deep learning but also the evaluation of its results creates a barrier for non-expertise users. A simplified way to perform this method will probably come soon, but before that, it is recommended to get familiar with the programming language Python. This language is broadly used in the analysis of nuclei segmentation. The IoU value of an entire image still says much about the coverage between the prediction and the ground truth. However, a semantic segmentation strategy with machine learning properties like Labkit benefits significantly from this. If the method of an IoU threshold were used, Labkit’s IoU scores would significantly decrease because it created many merges. ZeroCostDL4Mic would have gotten the best F1 scores with a high IoU threshold because this strategy creates minimal prediction errors.

4.2 Other deep learning strategies

During this research, other deep learning strategies have been investigated. Ultimately, these strategies turned out to either not suit the dataset or were not accessible for a non-expert yet. Two of these strategies will be discussed below: U-net and DeepImageJ.

U-net is a popular deep learning architecture that can be used for a variety of biomedical

segmentation tasks and uses fully convolutional networks (Ronneberger, Fischer & Brox, 2015). For the U-net architecture, they also developed an interface that runs as a plugin in Fiji (Falk et al., 2019). At the beginning of this study, it was planned to include a segmentation using the U-net plugin. It is claimed that this plugin is suitable for analyzing data as a non-expert. However, this was not the case. Despite multiple tutorials that explain the use of the plugin, there have been many failed attempts to use it. The main problems are with the backend server requirements and the use of Ubuntu Linux. Many attempts have been made to get the plugin to work, but in vain. It has to be mentioned that the pre-trained models they provided did not seem suitable for the DAPI dataset. In ZeroCostDL4Mic a U-net network can also be used to train a model. The prior has been tried but did not seem to work for the DAPI dataset. Training a unique model would probably result in accurate predictions because the same is has been noticed in other studies (Caicedo et al., 2019a; Caicedo et al., 2019b). For now, the U-net plugin is not accessible for non-expertise users.

On the contrary, a plugin that is suitable for the non-expertise user is DeepImageJ. DeepImageJ is an open-source and user-friendly plugin that runs a variety of pre-trained deep learning models in a few clicks without the necessity of any expertise (Gómez-de-Mariscal et al., 2019). The list of pre-trained models varies in image processing tasks, like instance segmentation, denoising and super-resolution. The possibilities of DeepImageJ have been investigated, but the plugin was not necessary for this specific research. The only pre-trained model in DeepImageJ that was suitable for fluorescent DAPI nuclei and instance segmentation was a Stardist model. Stardist models had already been extensively investigated, and therefore DeepImageJ would not be of any significance. Although, DeepImageJ is strongly recommended for non-experts looking for deep learning models. DeepImageJ is very accessible, works fast and consists of a variety of pre-trained models for image processing tasks.

4.3 The future of deep learning

This study, and other studies (Caicedo et al., 2019a, Caicedo et al., 2019b), have shown that deep learning is on its way to match human performance. Nevertheless, the use of deep learning is mainly limited to programming experts. More cloud computing programs such as ZeroCostDL4Mic should be released to break down the accessibility barrier. Not only for nuclei segmentation but also for other image analysis tasks that are essential in cell biology research. In addition, ZeroCostDL4Mic only facilitates use and training but does not help to evaluate the results based on their accuracy (F1-score per IoU-threshold). In the future, researchers should not only focus on improving deep learning but also making it more accessible to the public. Cell biologists and other researchers should not be dependent on their programming skills and hardware possession to obtain the most accurate results for important image analysis tasks.

(13)

5. Literature

Abràmoff, M. D., Magalhães, P. J., & Ram, S. J. (2004). Image processing with ImageJ. Biophotonics international, 11(7), 36-42.

Artz, M. (2019, 29 oktober). Labkit. Geraadpleegd op 6 juni 2020, van https://imagej.net/Labkit

Caicedo, J. C., Goodman, A., Karhohs, K. W., Cimini, B. A., Ackerman, J., Haghighi, M., ... & Rohban, M. (2019). Nucleus segmentation across imaging experiments: the 2018 Data Science Bowl. Nature methods, 16(12), 1247-1253.

Caicedo, J. C., Roth, J., Goodman, A., Becker, T., Karhohs, K. W., Broisin, M., ... & Carpenter, A. E. (2019). Evaluation of deep learning strategies for nucleus segmentation in fluorescence images. Cytometry Part A, 95(9), 952-965.

Falk, T., Mai, D., Bensch, R., Çiçek, Ö., Abdulkadir, A., Marrakchi, Y., ... & Dovzhenko, A. (2019). U-Net: deep learning for cell counting, detection, and morphometry. Nature methods, 16(1), 67-70.

Gómez-de-Mariscal, E., García-López-de-Haro, C., Donati, L., Unser, M., Muñoz-Barrutia, A., & Sage, D. (2019). DeepImageJ: A user-friendly plugin to run deep learning models in ImageJ. bioRxiv, 799270.

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436-444.

Lichtman, J. W., & Conchello, J. A. (2005). Fluorescence microscopy. Nature methods, 2(12), 910-919. Mitchell, T. M. (1997). Machine learning.

Moen, E., Bannon, D., Kudo, T., Graf, W., Covert, M., & Van Valen, D. (2019). Deep learning for cellular image analysis. Nature methods, 1-14.

Nuclei Watershed Separation. (2020, 24 januari). Geraadpleegd op 6 juni 2020, van https://imagej.net/Nuclei_Watershed_Separation

Ronneberger, O., Fischer, P., & Brox, T. (2015, October). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (pp. 234-241). Springer, Cham.

Schmidt, U., Weigert, M., Broaddus, C., & Myers, G. (2018, September). Cell detection with star-convex polygons. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 265-273). Springer, Cham.

Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information processing & management, 45(4), 427-437.

Von Chamier, L., Jukkala, J., Spahn, C., Lerche, M., Hernández-Pérez, S., Mattila, P., ... & Buchholz, T. O. (2020). ZeroCostDL4Mic: an open platform to simplify access and use of Deep-Learning in Microscopy. BioRxiv Von Chamier, L., Laine, R. F., & Henriques, R. (2019). Artificial intelligence for microscopy: what you should know. Biochemical Society Transactions, 47(4), 1029-1040.

(14)

6. Supplementary

6.1 Note 1: Tips for novice users of ZeroCostDL4Mic

For novice ZeroCostDL4Mic users without any image analysis expertise, achieving the wanted results can be difficult and time-consuming. A few tips will be given below. Firstly, immediately start training a unique model. The video tutorial shows how a particular model can be trained in the notebook. The first problem while using ZeroCostDL4Mic was to replicate this model. For applying a pre-trained model, image properties of the test data must be the same as the image properties of the training data of the model. Properties like pixel-size and bit-depth, therefore, have to be changed in the test set all manually. Training a unique model with training and test images from the same dataset, this problem will not occur and save much time. Secondly, when training a model, make sure that the images and the corresponding masks have the same file name in Google Drive. This might seem obvious, but is not indicated well enough in the tutorial or notebook and could be the first problem that novice users face. Lastly, retrain the model several times with different parameters to find out which settings best fit the dataset

6.2 Note 2: Overlapping nuclei form a problem for deep learning

Some situations are difficult to predict, and a strategy’s way of predicting these gives much useful information. In the dataset, situations can be found where more than two nuclei are clustered and overlap. An example is displayed in Figure S1A. The ground truth shows that five different nuclei can be seen. Both the pixel-based segmentation strategies, basic segmentation and Labkit, have predicted a so-called ‘supermerge’. All five nuclei have been merged as one, resulting in one FP and five FNs. The deep learning methods, Stardist and

ZeroCostDL4Mic, have predicted two round-shaped nuclei. The deep learning predictions are also far from correct. These circumstances are hard to determine, but both deep learning predictions were considered as two FPs, five FNs and no merges. The results between basic segmentation/machine learning and deep learning, therefore, differ by 1 FP and the change in F1 score will be minimal. It can be concluded that the deep learning methods in this research are not good enough to predict these difficult situations where several nuclei are densily clustered.

The second example can be seen in Figure S1B. On the top left a single nucleus can be seen. On the bottom right, it is not exactly sure what can be seen. It will be referred to as ‘contamination’. This type of contamination can be found multiple times in the dataset, but cannot always be seen by the naked eye before any adjustments are made. The basic segmentation merged the nucleus and the contamination. Attempts have been made to not include the contamination by changing the pixel intensity threshold, but this had a negative consequence for the rest of the prediction. As a result, this particular prediction is considered as one FP and one FN and will also result in a low IoU. Although Labkit is also based on pixel values, this prediction is better than that of the basic segmentation and is considered as a TP. By repeatedly training the classifier after adding more marks, the contamination was eventually not included. This machine learning feature of Labkit makes it more useful than the basic segmentation. Nevertheless, the prediction is not perfect and still results in a relatively low IoU. Stardist has nicely predicted the shape of the nucleus but has also predicted a nucleus on the site of the contamination. ZeroCostDL4Mic only predicted the nucleus. To conclude, uniquely trained deep learning models like ZeroCostDL4Mic are suitable for not recognizing contaminations like these as nuclei.

(15)

Supplementary Figure 1: Significant situations. A) An example is displayed of an object (lower right) that is not recognized as a nucleus. A true nucleus can be found on the top left. B) An example of a cluster of overlapping nuclei. For both situations, a blue indicated ground truth (GT), and white indicated predictions by the classic segmentation (CS), Labkit (LK), Stardist (SD) and ZeroCostDL4Mic (ZC) had been annotated.

(16)