Multi-class segmentation of 3D THG images of human brain tissue with deep learning to test the inside-out hypothesis on the cause of Multiple Sclerosis

(1)

brain tissue with deep learning to test the inside-out

hypothesis on the cause of Multiple Sclerosis

Loek van Steijn, 11811048 July 17, 2020

INSTITUTE Biophotonics & Medical Imaging

FACULTIES Faculty of Science, Vrije Universiteit Amsterdam SUPERVISOR Prof. Dr. M. L. Groot

DAILYSUPERVISOR Drs. M. Blokker SECOND EXAMINER Dr. I. Avci

COURSE Physics & Astronomy Bachelor thesis, 15EC CONDUCTED March 30th to July 17th 2020

(2)

(MS). Recent findings might suggest that demyelination is caused by a dysfunction in the axon myelin synapse, where an increase in sodium concentration results in excitotoxicity of gluta-mate and ultigluta-mately in blister formation on the axons. The hypothesis was tested earlier by exposing ex-vivo brain tissue to an increasing sodium concentration whilst collecting 3D Third Harmonic Generation microscopic data. To increase accuracy in measuring the blister growth the THG data was first segmented. 3D multi-class segmentations were made using a convolu-tional neural network with a U-net architecture. The segmentations were split into a fiber like feature class which contained most of the axons, a nuclei and background class. The resulting network can predict fibers and nuclei from 3D THG data with up to 84% similarity. The lacking 16% consists of mainly finer-grained fibers but improvements to the network could be made to decrease this percentage. Unfortunately this network was trained using THG data with a field of view of 350 - 500 µm due to insufficient data. Though similar in appearance the available blister induced data had a field of view of 150 – 250_µm and the network is not able to predict segmentations in this range. In order to segment any data with induced blisters in the future, the network will need to be retrained using more data with a higher field of view.

(3)

erop dat de oorzaak wellicht ergens anders ligt. Het kan liggen aan een disfunctie in het axon myeline synaps, het punt tussen twee neuronen waar de communicatie tussen beide plaats vindt, waarbij een verhoogd natrium concentratie in het hersenvloeistof zorgt voor een over stimulatie van glutamaat, een neurotransmitter, wat weer leidt tot de vorming van zwellingen tussen de axonen en myeline. Deze zwellingen onderbreken de signaal transmissie tussen de neuronen en is kenmerkend voor MS. Deze hypothese is getest door hersen weefsel bloot te stellen aan een toenemende concentratie natrium waarbij de progressie is vast gelegd door een bepaalde microscopische techniek. Om de formatie van de zwellingen nauwkeurig te meten moet de microscopische data gesegmenteerd worden om goed de vorm van alle axonen en celk-ernen te bepalen. Deze segmentaties zijn gemaakt met een 3D neuraal netwerk gebaseerd op de U-net architectuur. Dit netwerk kan vormen in drie categorie en, namelijk: axonen, celker-nen en achtergrond onderscheiden met een nauwkeurigheid tot 84% en was nauwkeuriger dan segmenteren met behulp van algoritmes. Het netwerk heeft vooral moeite om de wat kleinere axonen goed te kunnen segmenteren maar er zijn duidelijke aanwijzingen om dit te kunnen verbeteren. Het netwerk was getraind op data met een gezichtsveld van 350 – 500µm omdat er een tekort aan data met geïnduceerde zwellingen was. Ondanks dat de data veel leek op de data waarmee getraind was maar met een gezichtsveld van 150 – 250µm kon het netwerk dit niet segmenteren. Om wel goede segmentaties te verkrijgen moet er meer data worden verzameld met een groter gezichtsveld of bestaand data aanpassen zodat het netwerk hierop kan trainen.

(4)

1 Introduction 5

2 Method - Image segmentation 7

2.1 Data and pre-processing . . . 7

2.2 Image segmentation . . . 9 2.2.1 Histogram equalization . . . 9 2.2.2 Anisotropic diffusion . . . 10 2.2.3 Active contour . . . 10 2.3 Post-processing . . . 12 2.3.1 Data selection . . . 12 2.3.2 Processing . . . 12 2.3.3 Feature classification . . . 13

3 Method - Convolutional neural network 15 3.1 Data pipeline . . . 15 3.2 U-net architecture . . . 17 3.2.1 Convolution . . . 17 3.2.2 Max pooling . . . 19 3.2.3 Final layer . . . 20 3.2.4 Loss function . . . 20 3.2.5 GDL modifications . . . 21 4 Results 23 4.1 Training & Validation . . . 23

4.2 Test predictions & Network evaluations . . . 24

4.3 Additional THG images . . . 27

5 Discussion 32 5.1 Ground truth creation . . . 32

5.2 CNN image segmentation . . . 33

5.3 Blister growth measurements . . . 36

6 Conclusion 38

(5)

8.2 Weight initialization & Learning rate . . . 42 8.3 Activation function . . . 42

(6)

1 Introduction

Multiple sclerosis (MS) is believed to be an autoimmune disorder in which the immune systems CD4+ T-lymphocytes (1) demyelinises the outer myelin sheath surrounding the axons. The myelin sheath is a lipid-rich membrane that insulates the axon and enables efficient trans-mittance of electrical pulses along the axons. The demyelination causes deficits in motor and sensory functions of the nerve cells. This hypothesis is referred to as the outside-in hypothesis. Recently this hypothesis has been challenged by the inside-out hypothesis in which the inner myelin sheath is being demyelinised by a dysfunction in the axon myelin synapse (AMS). This dysfunction could arise from an imbalance of the neurotransmitter glutamate also known as excitotoxicity. Glutamate gets released when the axon is electrically active and is driven by a higher concentration of calcium ions. The imbalance stimulates the release of protein arginine deaminases which converts arginine into citrulline and damages the functionality of the myelin proteins. The combination of excitotoxicity and the release of protein arginine deaminases is what is thought to trigger the immune system, which is now no longer playing the primary role (2).

It is hypothesised that this process causes blistering of the myelin sheath in which the myelin gets detached from the axon resulting in MS. But how does excitotoxicity arise? A possibility is that sodium channels are stimulated resulting in the reversal of the sodium-calcium exchange pump and creating a higher concentration of calcium ions in the AMS, which in turn causes the release of excessive glutamate (2). To test this, 4D (3D in time) Third Harmonic Generation microscopic images of human brain tissue were made to which periodically a buffer of increas-ing sodium concentration was added in order to capture the potential myelin blisterincreas-ing. The myelin sheath is a lipid-rich membrane which results in a Third Harmonic Generation (THG) signal at the refractive index mismatch between the myelin sheath and aqueous fluids (3), being cerebrospinal (CSF)-and interstitial fluid surrounding the axons. To get THG signal a laser fires 200 fs pulses into the tissue where signal arises when phase matching at the re-fractive index interface combines three photons into a photon with three times the energy at a third of the wavelength. This requires the source to have a sufficiently high wavelength of 1070 nm to ensure that the signal falls within the near-visible spectrum. The THG signal can only arise at the focal point of the excitation laser which makes it possible to visualise tissue beneath the surface up to a depth of 200_µm (4). THG has previously been used to determine the ratio between diameter of the inner and outer layer of the myelin sheath, the so called g-ratio (5). The g-g-ratio is used as a quantitative measurement of the degree of myelin blistering. In order to measure the g-ratio, the THG data was segmented. Due to the scarcity of brain tis-sue of patients diagnosed with MS the blisters were induced in tistis-sue from patients diagnosed

(7)

with epilepsy. A segmentation is a usually binary or multi channelled image which in this case clearly indicates the shape of the fiber like features such as axons and nuclei. Creating a segmentation of the THG data assists in measuring the g-ratio more accurately and reduces subjectivity. Segmentation is a time and labour intensive process which was reduced with a convolutional neural network (CNN) using a U-net model architecture (6).

A CNN is a deep neural network that is most commonly used for image analysis. The net-work was trained and validated on 3D THG data of brain tissue from patients diagnosed with epilepsy or a variant of glioma. The available data contains both fiber like features and nuclei so the network is required to be able to identify these features. In order for a CNN to predict these 3D segmentations it requires ground truth segmentations which were made using exist-ing algorithms and split into classes usexist-ing a selection of metrics. The aforementioned steps are further elaborated in the method section.

(8)

2 Method - Image segmentation

In this section the creation of the ground truths for the convolutional neural network (CNN) will be addressed. This process has been split into three categories: first we will discuss the data acquisition, properties and pre-processing (2.1), followed by the image segmentation (2.2) and the data post processing (2.3).

2.1 Data and pre-processing

16 bit THG stacks were obtained from ex-vivo brain tissue samples of patients diagnosed with a variant of glioma or epilepsy, a patient overview is shown in Table 1 and visually in the Figures 1 and 2. The brain tissue to which an increasing concentration of sodium was added to induce the myelin blistering was from a patient diagnosed with epilepsy and is shown in Figure 3. Each sample was measured three times at the same location, the average voxel value of these three measurements was stored in a slice of the image stack, a slice was made every 2_µm along the depth of the sample. The stacks were obtained between 2017 and 2018 using variable total imaging depth (ranging between 22 and 102µm) and a field of view (350 or 500

µm).

The slices have a resolution of 1050 x 1050 or 1032 x 1032 pixels. Future THG stacks will have a voxels size of about 0.5 x 0.5 x 0.5µm therefore the CNN will need to train on voxels with similar a volume, to achieve these dimensions the stacks need to be resized accordingly. The height and width of each slice need to be decreased while the depth of the stack needs to be increased. To prevent any loss of data prior to making the segmentations only the depth of the stack was resized during pre-processing. Resizing the stack was done with the resize function in Fijis ImageJ with a bilinear interpolation between slices. The length and width were resized during the post-processing phase, after segmentation was performed.

Patient Diagnosis Tissue description

1 Left-frontal astrocytoma type II

Both myelinated axons and nuclei are present, image is shown in Figure 1

2 Epilepsy

Myelinated axons are present, image is shown in Figure 2 3 Right-temporal glioblastoma type IV Mainly nuclei are present

4 & 5 Epilepsy with induced blisters

Myelinated axons and blisters are present, image is shown in Figure 3

Table 1: an overview of the origins of the brain tissue used during this project. Tissue description indicates which features are primarily present.

(9)

Figure 1: Shows the composition of an astrocytoma type II sample. The sample contains both fiber like features shown in green and nuclei shown in red, this sample is used to train the network. Glioma type IV is similar but contains less fiber like features.

Figure 2: Shows the composition of an epilepsy sample. The sample only contains fiber like features such as the one shown in green.

(10)

Figure 3: Shows the composition of an Epilepsy sample that has been treated with a sodium solution to induce blistering of the inner myelin sheath. One of the blisters is indicated with a green arrow.

2.2 Image segmentation

The image segmentation is based on the three-step segmentation process used by Zhang et al (7). This process relies on a combination of histogram equalization, anisotropic diffusion and active contour steps to create a segmentation of the acquired data. We will go into more detail of each of these steps in the sections below.

2.2.1 Histogram equalization

Histogram equalization is used to enhance the contrast, this is required to further segment the dark features such as nuclei within the THG stacks. The most common two algorithms are global histogram equalization (GHE) and local histogram equalization (LHE). GHE makes use of the global histogram which adapts poorly to local bright features and therefore results in a relatively weak enhancement. LHE overcomes this problem by using a block-overlapped method in which the histogram is equalized specifically for the pixel in the center of the block, this process is repeated for each pixel and allows each pixel to adapt according to its neigh-boring region. LHE results in a better enhancement in comparison to GHE but is much more computationally demanding. The solution is partially overlapped sub-block histogram equal-ization (POSHE), this method uses the block method from LHE but instead of each pixel being individually equalized all the pixels within the sub-block are equalized. This creates a blocking effect which is eliminated when these sub-blocks become partially overlapped, this results in a

(11)

good enhancement and is also less computationally demanding (8). In this project POSHE is used to equalize the THG stacks, the sub-block dimensions are 200 x 200 x 6 voxels with a step size equal to a sixth of the THG stack dimensions. An inclusion percentile ∈ [0,1] is introduced to make sure that pixel outliers are not included in the histogram equalization. This inclusion is set at the 0.9999th percentile, this value was found to be most effective when comparing several different percentiles and was applied on all sub-blocks. During this process the voxels were equalized to a maximum value of 255 (maximum value in 8-bit format). This process takes roughly 10-30 minutes to complete and depends on the depth of the image stack.

2.2.2 Anisotropic diffusion

The stacks contain noise associated with THG microscopy. In order to reduce the amount of noise in the stacks an anisotropic diffusion filtering (ADF) step is required. Higher Order Statistics (HOS) is applied to provide a salient edge enhancement which further reduces the amount of noise by segregating only the most salient features present in the image stacks. It does this by determining the gradient based on the surrounding voxels. The higher the order of HOS the more salient the preserved edges will be, all HOS was done with the first order. One key parameter that needs to be set is the C threshold ∈ [0,100]. This threshold is used to distinguish the flat regions, regions with similar voxel values, from the non-flat regions, regions with large variety in voxel values (9). Part C and D in Figure 4 demonstrate the difference in the resulting segmentation if this C threshold is adjusted, in part C the threshold is 20 and in part D the threshold is 70. After examining specific fiber like features and nuclei in segmented stacks made with different C thresholds, it was found that the C threshold only has minor effect on the quality of the final segmentation, for this reason the C threshold was set at 35 for all stacks. This process takes anywhere between one and several hours depending on the depth of the stack and assuming the correct C threshold was used.

2.2.3 Active contour

An active contour step divides the stack into features and background. A regular active contour model such as the Chan-Vese (CV) model, fits a curve along the boundaries of a feature depen-dent on the value of the zero-level set function of the contour. Values below zero are considered to be background and above zero are considered to be foreground regions (10). This usually works well for homogenous stacks but THG stacks are usually not homogenous resulting from the inhomogeneity of the tissue and the degradation of the signal along the depth of the stack. Due to this inhomogeneity a regular CV model will often fall short in obtaining good segmen-tation results. This is overcome by adding a weight based on prior extremes (ACPE) (7), with

(12)

this ACPE weight ∈ [0,1] it is possible to force the zero-level set function into a certain region. A smaller ACPE weight will result in smaller foreground regions and vice versa.

In order to find the best weight for the ACPE, the stacks were segmented several times with varying ACPE weights. The segmentations were then compared to the raw data and evalu-ated based on several key aspects. First the segmentation around fibres and nuclei, making sure that the salient nuclei and fibres were being segmented correctly, of which an example is demonstrated in yellow markings in Figure 4. Second the amount of segmented background noise, making sure this was reduced to a minimum and lastly if separated fibres and nuclei actually remain separated in the segmentation. Part E and F in Figure 4 demonstrates the difference in the segmentation result for different ACPE weights, part E has an ACPE weight of 0.70 and part F an ACPE weight of 0.90. Following the method described above the best ACPE weight was 0.54 and this was used for all stacks. This process takes between 15 and 30 minutes depending on the depth of the stack and assuming the correct ACPE weight was used.

Figure 4: Segmentation series. A: Original slice, B: Slice preprocessed with POSHE using 0.9999th percentile, C: Segmentation using a C treshold of 20 and an ACPE weight of 0.80, D: Segmentation using a C threshold of 70 and an ACPE weight of 0.80. E: Segmentation using a C treshold of 20 and an ACPE weight of 0.70. F: Segmentation using a C treshold of 20 and an ACPE weight of 0.90. The difference between a C treshold of 20 and 70 (C and D) is minimal for the indicated myelinated fiber (yellow marking) which is not connected in both images. In image E and F the ACPE weight is the variable (0.70 and 0.90) respectively, using ACPE weight of 0.90 does connect the indicated fiber like feature.

(13)

2.3 Post-processing

In this section the post processing steps are elaborated and substantiated. In data selection the process is described that was used to select data that was used as a ground truth in the CNN. The processing section 2.3.2 describes how the selected data was processed into data with the correct properties for the network and the classification section 2.3.3 describes how the data is split into a fiber, nucleus and inconclusive feature class so that the network is able to identify different features.

2.3.1 Data selection

Not all data that was segmented is useable as a ground truth. Some segmentations contained so many features that the segmentation in itself turned into one big white blob. Other segmen-tations contained too many unrecognizable features or had artifacts such as vessels or tissue deformations contained within the stack and could not be used as a ground truth. All these stacks were excluded from the data.

2.3.2 Processing

The process is divided in several steps, the first is the removal of any noise that did not get fil-tered out by any steps in the segmentation process. This removal process is based on features (either fiber or nucleus) volume. The volume of 20 small nuclei and fibers in every segmen-tation stack were measured and it was concluded that none of these features had a volume below 80µm3 so all features below this threshold were removed from the segmentation using ImageJ’s 3D object count. Often fibers and nuclei are connected together in the segmentation stack making identifying the individual features more challenging. To split these connected features a 3D distance transform watershed method was used. Distance transform converts the stack into voxels that are categorized as a feature or a non-feature voxel, all non-feature voxels obtain values based on its distance to the nearest feature voxel (11). By inversing the distance transform, a watershed step is added that finds the ridge between the local minima (center of the feature in inversed image) and makes sure that different features cannot connect with neighboring features. The distance transform watershed method is shown in Figure 5. MorphoLibJ for ImageJ (12) has a 3D distance transform watershed function that was used for this step. For distance transform the Borgefors method is used and for watershed dynamic 1 is used, dynamic is a measure of intensity of the search for the local minimum. After the separation of connected features some metrics were obtained using the MorphoLibJ analyze 3D regions function, these metrics are of use when labelling the features contained in the seg-mentation stack.

(14)

Figure 5: An example of the distance transform watershed method. A depicts two connected nuclei. B shows the inversed distance transform with the center of the nuclei being two local minima. The watershed step prevents two minima from being attached and splits the two nuclei as indicated in C.

The next section is dedicated to process of classifying the features. Once the features have been classified the length and width of the stack is resized so that the voxel size becomes the desired 0.5 x 0.5 x 0.5 µm this is done using the ImageJ resize function without any form of av-eraging or interpolation to ensure the structures remain unchanged. The order in which these steps are carried out has been varied and compared with each other, the order found here gave the best results.

2.3.3 Feature classification

The features in the segmentation stacks are classified as either nucleus, fiber, inconclusive or background so that the network is able to distinguish these features. The classes were deter-mined using the metrics derived in the previous section and consist of the features’ volume, sphericity and ellipsoid ratio (ER) defined by Equation 1 with R1 being the longest radius and R₃the shortest radius of the ellipsoid.

ER =R1+ R2 R3

(1)

It should be noted that at this point the stacks are still distance transformed and watershed which means that the fibers are split up in smaller parts, after classifying the features, the fibers part are reconnected again. In each stack between 15 and 17 nuclei and fiber parts with a sphericity of 0.4 or higher were analyzed on the three aforementioned metrics, the results of which can be seen in Figure 6. Based on these results the classification thresholds were set: any feature of which the metrics fall into the blue area were classified as a nucleus, features in the yellow area were classified as fiber and features in the green area classified as inconclusive (could be either a fiber or a nucleus) and the rest was classified as background. The thresholds for sphericity and ellipsoid ratio alone are not strict enough and resulted in too many false

(15)

positives, to overcome this the volume was also included. Any feature above or below a certain volume limit (limits defined individually for each image stack) would be too large or small to be a nuclei it would be classified as a fiber or inconclusive feature instead. The classes are applied using the one-hot encoding method in which the voxel value is not stored in a single value but in a 1 x 4 array, so a 1 in the first column and 0 in the others would correspond to a fiber class.

Figure 6: Shows the distribution of 109 nuclei (blue) and 109 fibers (orange) with a sphericity of 0.4 or higher from 7 samples of patient 1. The distribution is based on the sphericity and ellipsoid ratio. The volume is depicted in the shape of the data point. Based on these results the classification thresholds were created and are shown with the color of the area.

(16)

3 Method - Convolutional neural network

CNNs are typically used for image classification tasks where images are assigned a singular classification label. These tasks often require a large number of training images to achieve accurate classifications. In biomedical image segmentation the desired output is a classification label for each pixel and a large number of training images is usually hard to come by. Ciresan et al (13) solved the classification problem by sliding a kernel across the images and predicting the class of a single pixel based on the surrounding pixels within the kernel. Incidentally the data from training images is now stored in a much higher number of kernels which simultaneously solves the lack of training data problem. The drawback is that the network has to train all of these kernels which slows the training down and the classification of the pixels is only based on the size of the kernel. A larger kernel will allow the network to predict larger features such as a fibers better but a larger kernel will also require more pooling to keep the number of trainable parameters manageable. More pooling will reduces the accuracy of the classifications. The solution is a U-net model architecture (6) which is elaborated in the section 3.2 U-net architecture.

3.1 Data pipeline

In this section the data pipeline is elaborated and it is explained how the data is fed into the network. The pipeline consists out of several steps: first the THG data was normalized using min-max normalization. Then the THG and segmentation data was parsed into smaller data sub-blocks, these sub-blocks were parsed using a kernel size of 40 voxels in depth and 100 vox-els in length and width with a stride of 30 voxvox-els in depth and 75 voxvox-els in length and width creating sub-blocks with a 25% overlap with each adjacent sub-block. All these sub-blocks were augmented using a 90, 180, 270 degree rotation and mirrored along each axis creating a total of 7 (including the original) copies of each sub-block.

The training and validation augmented sub-blocks were inserted into a tuple of two Numpy ar-rays, one for the THG sub-blocks and one for the ground truth segmentation sub-blocks in a way that the first ground truth sub-block holds the ground truth for the first THG sub-block. These arrays were randomly shuffled whilst maintaining the THG and ground truth alignment. The shuffled arrays were then split into training and validation sets. Some of the features shapes and characteristics are sample dependent by shuffling and splitting the data into training and validation sets the feature inhomogeneity in each set is increased.

The training and validation arrays are loaded (as float32) into TensorFlow Datasets to which padding is added for model compatibility (see section 3.2 U-net architecture) to each sub-block resulting in dimensions of 40 x 104 x 104 voxels. In this project both zero-and symmetric

(17)

padding were compared. The batch size is the number of sub-blocks that are analyzed simul-taneously and it is set to 16 to ensure the RAM has sufficient memory and 4 GPUs can be used at the same time.

This resulted in a total of 8064 training sub-blocks and 1344 validation sub-blocks with 504 training steps and 84 validation steps per epoch.

The test set consists of THG and ground truth sub-blocks. The test set was held completely separated from the training and validation sets and serves to test the networks capability of predicting features in data that it did not train or validate on. These test set sub-blocks are sized to exactly fit the memory specifications so that the resulting full prediction required the least amount of image stitching and were also normalized using min – max normalization. An overview of the used sub-blocks and how it is split into sets is shown in Table 2.

Datasets Patient Sample

Field of view (µm) Data dim. + pad (z, y, x) Sub-block dim. + pad (z, y, x) Total Sub-blocks inc. augmentations Training & V alidation 1 001 350 104x682x700 40x104x104 7x216 = 1512 002 500 104x1000x1000 40x104x104 7x468 = 3276 003 350 84x682x700 40x104x104 7x144 = 1008 004 350 84x682x700 40x104x104 7x144 = 1008 005 350 84x682x700 40x104x104 7x144 = 1008 006 350 44x682x700 40x104x104 7x72 = 504 007 500 44x1000x1000 40x104x104 7x156 = 1092 Test 1 008 350 44x682x700 48x176x176 1x8 = 8 009 350 84x682x700 88x344x176 1x8 = 8 Extra 2 010 350 104x682x700 104x171x175 1x16 = 16 3 011 350 104x682x700 104x171x175 1x16 = 16 4 012 150 92x300x300 96x152x152 1x4 = 4 5 013 250 50x500x500 56x256x256 1x4 = 4

Table 2: An overview of the origins of the used datasets and sub-blocks. The extra datasets include 3D THG images of samples the network did not train or validate on and of which there are no available ground truths. This data is used to test how well the network performs on samples from different patients.

(18)

3.2 U-net architecture

A U-net consists out of a contracting path and an expansive path. The contracting path starts with an input layer in which the data is loaded, this process was elaborated in the Data pipeline section 3.1. Next are the hidden layers, in these hidden layers the pattern extraction takes place where convolution layers apply the learnable parameters (weights) and pooling layers decrease the number of weights to reduce the computational load. The expansive path uses convolution to apply the weights and transposed convolution to increase the number of weights to gain higher prediction accuracy and resolution. The contracting path and expansive path are shown on the left and right side in Figure 7 respectively, the shape of figure resembles the iconic “U” which lends U-net its name. The architecture used in this project is based on the network used by Çiçek (14).

At the final layer the loss is calculated which is a way of measuring the performance of the network, the lower the value the better the network performs. This is followed with back-propagation through the network, this uses the chain rule to calculate the gradient of the loss function with respect to the weights. The network then modifies these weights in order to try minimize the loss function which is referred to as gradient descent, this is elaborated in the Loss function section 3.2.4. The learning rate is one of the more important hyperparameters for the network, it allows the user to indicate how much the weights may be modified during the gradient descent. If the learning rate is set too low the network will take a long time to learn and if it is set too high the loss function can start to diverge preventing the local minimum from being found. When all the training data have been passed through the network is validated using the validation data which is a mean of checking how well the network performs using stacks it has not trained on, this can localize overfitting where the network fits too well on the training data and starts having trouble adapting to new data. Each one of these training to validation cycles is referred to as an epoch. In this project the hyperparameter optimization was done using 30 epochs and the final network was trained using 100 epochs (15).

In this section the basic components of the network will be discussed starting with the convo-lution layers containing batch normalization (BN) and the activation function, followed by the max pooling layer, the final layer and the loss function.

3.2.1 Convolution

The first few layers are 3D convolution layers. These layers are made up from learnable ker-nels of size 3 x 3 x 3 voxels. During the convolve steps the kernel values are calculated with the dot product of the weights already contained in the kernel and the value of the current position in the sub-block. The kernel weights are initialized using random values between the

(19)

upper and lower limits. These limits are defined by Equation 2 Glorot uniform or Equation 3 He uniform in which in is the number of input units and out the number of output units of the weight tensor. In this project both weight initializations were compared.

Limit = ± s 6 in + out (2) Limit = ± s 6 in (3)

The kernel moves over the sub-block with a stride of 1 in all dimensions. Once the entire sub-block is convolved a feature map is produced which maps the value of the kernel at ev-ery location in the sub-block. The vev-ery first 3D convolution layer contains 32 kernels which means it also creates 32 feature maps, these feature maps are stacked on top of each other and form the output volume. Figure 7 demonstrates the architecture of the entire CNN, the green arrows indicate the 3D convolution steps. The numbers above several sub-blocks indicate the dimensions of these blocks where the first three dimensions indicate the size of the sub-block and the 4thdimension indicates the number of feature maps in that layer. The 3D convolution step is followed by a BN step. Adding BN into the CNN allows the use of higher learning rates, and can eliminate the need for regularizers such as dropout (16). In a dropout layer random weight kernels get ignored which forces other kernels to become more generalized, reducing the chance that the network will overfit to the data.

After BN an activation function is applied. The activation function maps the value of the ker-nels to an output value. The activation functions need to be non-linear so that the output value cannot be reproduced using the output value or a linear combination of these values. Introducing this non-linearity to the network allows the network to actually learn non-linear patterns. In this project two activation functions were tested: first the Rectified Linear Unit (ReLU) which is a linear function for any positive input and zero for any negative input. Un-fortunately ReLU activators are prone to lose function if the learning rate is set too high (15). A possible solution is to use the leaky ReLU (LReLU) or parametric ReLU (PReLU). LReLU factors a small number e.g. 0.01 to the negative values instead of transforming them to zero. PReLU has a learnable parameter instead of a constant factor which in some cases improves the models outcome (17), both methods eliminate zero gradients while not adding much to the computational cost or overfitting. In this project the ReLU and PReLU activation functions were compared.

(20)

sub-block to its original size. Upscaling is done with a 2 x 2 x 2 voxel sized kernel and stride. The principle of transposed convolution is similar to that of regular convolution but now the cross product is taken of only one input value with the all the weights in the kernel. The result is that the feature maps double in size.

Figure 7: Schematic overview of the architecture of the U-net model. The first three numbers refer to the dimensions of the arrays and the fourth to the number of feature maps.

3.2.2 Max pooling

The orange arrows in Figure 7 indicate the max pooling layer. Much like transposed convolu-tion this layer also uses a kernel of 2 x 2 x 2 voxels. As the kernel moves along the entirety of the feature map also with 2 x 2 x 2 voxels strides, the maximum value within the kernel gets mapped to the next feature map. This step essentially halves each dimension besides the number of kernels. This not only reduces the size of the feature map and therefore the number of values and weights that need to be stored in the memory but also reduces the computational power required to train the CNN. Another advantage from reducing the amount of data is that it enables the network to learn features that extend the size of the kernel. The blue arrows in Figure 7 indicate where feature maps are being concatenated. Concatenation is used to integrate multi-scale features so that the network is able to learn patterns of different sizes.

(21)

3.2.3 Final layer

The final layer also uses a convolution step but uses a kernel and stride the size of a single voxel. It reduces the number of feature maps to 4, one for each class. This step also uses the Softmax activation function which is most commonly used for multi class CNNs. Softmax uses cross entropy loss or log loss where each class receives a normalized probability output between 0 and 1. Because it is based on natural logarithms the probability decreases rapidly from 1 which punishes the wrong predictions.

3.2.4 Loss function

The loss function is a measure for how well the CNN is learning the features contained in the sub-blocks. The lower the loss function the more accurately the network performs. Minimizing the loss function is achieved by modifying the kernel weights which is done during backprop-agation. In this project the generalized dice loss (GDL) function first introduced by Sudre et al. (18) is adopted and modified accordingly. The changes are described in the section 3.2.5 GDL modifications. The GDL uses the dice coefficient (DC) which is a measure of similarity between two samples and is given by Equation 4. The numerator in the DC is the similarity between the ground truth (t) and the predicted data (p) within a single class, the denominator is the cardinality or the sum of the total voxels for that class in the ground truth and the pre-dicted segmentation and s a small constant or smoothing factor typically in the order of 10−7 to prevent a zero division. The DC results in a value between 0 and 1 where 1 is a perfect match.

DC =2 × |t ∩ p| + s

|t| + |p| + s (4)

This DC is calculated for each class individually and is subtracted from an arbitrary constant which means that a perfect match would reduce this arbitrary constant and therefore also reduce the output of the GDL. The GDL is given by Equation 5, here the DCs are the dice coefficients for the fiber, nuclei, inconclusive and background class respectively.

GDL = 4 − DCf ib− DCnuc− DCinc− DCb g (5)

Backpropagation and updating of the weights is done with the adaptive moment estimation (Adam) (19). Adam stores the exponential decay averages of past squared and non-squared gradients and uses that to calculate the first and second momentum, the mean and uncentered variance respectively. Each momenta takes an exponential decay rate_β_{∈ [0, 1] which is set at} 0.9 for the first momentum and 0.999 for the second momentum. Adam uses these momenta to

(22)

calculate an adaptive learning rate for each weight which causes the GDL to converge in fewer epochs. In this project an initial learning rate of 0.01 and 0.001 were compared.

3.2.5 GDL modifications

In this section the novel changes to the GDL are described. It is worth noting that the different classes (background, fiber like features, inconclusive features and nuclei) are heavily imbal-anced. Background accounts for 88.4% of all the voxels, fibers for 9.5%, inconclusive for 1.3% and nuclei for 0.8%. The CNN can learn the background class very well because it can gain a high dice coefficient rather easily. To prevent the CNN from focusing too much on the lesser important classes such as background the classes were balanced using several methods. First the background class received a weight factor for the DC as is shown in Equation 6. Here W is the background weight factor, V the total number of voxels contained in the batch and t the total number of background voxels contained in the batch. Adding this weight factor low-ers the maximum DC for the background class which causes the GDL to converge less with a perfect background DC.

Wb g= 1 − |tb g|

Vtot

(6)

An additional penalty was added to prevent the background in the predicted segmentation to cover fiber or nuclei features in the ground truth. The background penalty (P) is given by Equation 7 here the first term in the numerator is the similarity between the fiber voxels in the ground truth and the background voxels in the predicted segmentation, the second and third term are for nuclei and inconclusive features respectively. The background penalty is added to the GDL which means when the similarities in the numerator increase the GDL also increases. The constant (C) ∈ [5 × 104, 105] in the denominator constraints the size of background penalty.

Pb g=

C (7)

Next is the inconclusive features class. This class was introduced because the segmentation al-gorithm in combination with metrics mentioned in section Feature classification was not able to classify these features as fibers or nuclei. Ideally the CNN should not learn the class in-conclusive features but instead should learn to interpret all features as a fiber or a nucleus. To achieve this the inconclusive class DC was changed to Equation 8 here the numerator is the weighted sum of the similarities of the inconclusive voxels in the ground truth with the fiber voxels and the nuclei voxels in de predicted segmentation. The denominator contains the

(23)

sum of the total number of fiber and nuclei voxels in the predicted segmentation and the total number of inconclusive voxels in the ground truth. Equation 8 specifically favors nuclei voxels being predicted on inconclusive voxels in the ground truth by weighing it 2.5 times greater than fiber on inconclusive voxels. This factor was added because in comparison to fibers a lot more nuclei were classified as inconclusive.

DC_inc₌2 × (|tinc∩ pf ib| + 2.5 × |tinc∩ pnuc|) + s |pf ib| + |pnuc| + |tinc| + s

(8)

In addition to this a lot of nuclei were also incorrectly classified as fibers. To correct for this a reward (R) was introduced to the GDL which is given by Equation 9 here the numerator is the similarity between the fiber voxels in the ground truth and the nuclei voxels in the predicted segmentation. The denominator is twice the total number of nuclei voxels in the predicted seg-mentation and acts as a constraint to the reward. The reward is subtracted from the GDL.

R_{f ib}₋_→_nuc₌ |tf ib∩ pnuc| 2 × |pnuc| + s

(9)

The modified GDL is then given by Equation 10, here L is the starting loss and is a positive constant.

GDL = L − DCf ib− DCnuc− DCinc− Wb g× DCb g+ Pb g− Rf ib_→₋nuc (10)

By modifying the DCs, the functions lose meaning during the evaluation of the model. For model evaluation the feature DC was introduced. This function checks the similarity between any feature in the predicted segmentation and the ground truth as a mean of verifying if all the features were predicted at the correct location and is given by Equation 11.

DCf eature=2 × |t

f ib∩ pf ib| + |tf ib∩ pnuc| + |tnuc∩ pnuc| + |tnuc∩ pf ib| + |tinc∩ pf ib| + |tinc∩ pnuc| + s |tf ib| + |tnuc| + |tinc| + |pf ib| + |pnuc| + s

(24)

4 Results

The results from the final training session and evaluation are given in this section along with some additional predicted segmentations of images from different patients. Results from net-work optimization covering padding modes, different weight initializations and activators are shown in the appendices.

4.1 Training & Validation

This section covers the training of the CNN. Figure 8 shows the values of the Generalized Dice Loss and the functions contained inside the GDL of the final 100 epoch training session. The values during training are shown in blue and the values during validation are shown in orange. Training the network took approximately 27 hours using 4 NVIDIA TITAN Xp GPUs. The kernel weights were saved from the epoch with the best GDL value during validation which happened on epoch 48. After epoch 48 the validation loss starts to diverge from the training loss leading to some mild overfitting especially at epoch 90 and above. The dice coefficient for the fiber class increases rapidly in the first 10 epochs to about 0.8 in the validation and remains around this value for the duration of the training. The dice coefficient for the nuclei class also increases rapidly in the first 10 epochs. The training value keeps increasing to about 0.6 and the validation value to about 0.5, both of these values start diverging in the last 20 epochs. The background dice coefficient for training and validation increases to about 0.95 in the first few epochs and remains relatively constant throughout the entire training session with only minor dips at epoch 20, 50 and 90. The dice coefficient for the inconclusive features remains below 0.5 and especially the validation value fluctuates greatly, and appears to decrease after epoch 60. The fiber to nuclei reward decreases gradually but has high peaks at around epoch 80 and 90 which coincide with the drops in the fiber, nuclei and background dice coefficients for training and validation. The background penalty decreases from the start and validation value fluctuates between 0.12 and 0.03 (interpreted as a 0.0175% overlap between ground truth and predicted segmentation) while the training value remains constant at around 0.01.

(25)

Figure 8: The top graph shows the value of the GDL during the 100 epoch training session. The other graphs show the value of the GDL parts. The second row contains the fiber, nuclei and background dice coefficients and the third row contains the inconclusive feature dice coefficient, the fiber to nuclei reward and the background penalty. The blue graph indicates the value during training and the orange graph indicates the value during validation.

4.2 Test predictions & Network evaluations

The best validations weights achieved during the training session were loaded into the network and were used to predict and evaluate the test datasets. Figure 9 shows the THG image, ground truth and prediction of a slice of sample 008 and Figure 10 of sample 009.

(26)

Figure 9: Prediction of sample 008 (left-frontal astrocytoma type II) with a field of view of 350µm at a depth of 17.5µm. A shows the THG image in grayscale. B shows the ground truth segmentation and C shows the predicted segmentation using the trained network. In the segmentations red indicate fiber like features, blue indicate nuclei and green indicate inconclusive features. The yellow squares indicate faults in nuclei segmentation, the purple square indicates faults in fiber segmentation and the brown square indicates well segmented features.

Figure 10: Prediction of sample 009 (left-frontal astrocytoma type II) with a field of view of 350µm at a depth of 20.5µm. A shows the THG image in grayscale. B shows the ground truth segmentation and C shows the predicted segmentation using the trained network. In the segmentations red indicate fiber like features, blue indicate nuclei and green indicate inconclusive features. The yellow squares indicate faults in nuclei segmentation, the purple square indicates faults in fiber segmentation and the brown square indicates well segmented features.

In Figure 9 many nuclei can be seen in the THG image, some of which are marked with a yellow square. The top nuclei gets segmented into the ground truth as a fiber and is missing from the predicted segmentation. The other marked nuclei just shows a segmented circumfer-ence and is lacking the center. Very few nuclei have been segmented and classified properly. The purple square indicates a fiber which is classified as either a fiber or a inconclusive fea-ture in the ground truth but is classified as a fiber in the segmentation unfortunately the

(27)

finer-grained fibers did not get segmented. The brown square indicates an area that has been segmented accurately, the area contains fibers and some nuclei. In the ground truth some of these features were classified as inconclusive while the predicted segmentation correctly seg-ments these inconclusive features.

Figure 10 also contains many nuclei, two are marked in yellow. Both of these nuclei are clas-sified incorrectly in the ground truth and are also clasclas-sified incorrectly or missing in the pre-dicted segmentation. Fortunately a lot more nuclei have been segmented correctly in compari-son to sample 008. The purple square contains a finer-grained fiber that did get segmented in the ground truth but has parts missing in the predicted segmentation. The brown square in-dicates an accurately segmented area. The area contains both fibers and nuclei of which some fibers are classified as nuclei and nuclei as inconclusive in the ground truth and got classified correctly in the predicted segmentation apart from one fiber segment that got predicted as a nuclei.

Table 3 shows the efficiency of the network and indicates the time each section of the predic-tion took depending on the size of the THG image. The total predicpredic-tion time is between 18.6 and 27.3 seconds which includes loading and saving the data, making the prediction with the network and stitching sub-block images together.

Table 4 shows the evaluation metrics of both test samples over the entire stack. This table also includes the Feature dice coefficient, first introduced in section 3.2.5 GDL modifications because the predicted segmentations are lacking the inconclusive feature class it negatively impacts the individual fiber and nuclei dice coefficients. The feature DC is used to measure the overall accuracy of the network as it measures the similarity between any feature on the ground truth and the predicted segmentation.

The feature DC shows a 84% similarity between the ground truth and predicted segmentation for sample 008 and a 80% similarity for sample 009. The individual class dice coefficients show minor differences compared to the training and validations values. The biggest difference is found in the background penalty where the penalty is 3.3611 or 0.7% overlapping voxels which is 40 times higher compared to training and validation for sample 008 and 11.2591 or 1.3% overlapping voxels which is 74 times higher for sample 009.

(28)

Test sample 008 009

Number of sub-blocks 8 8

Size of sub-blocks 48x176x176 88x344x176 Load data time (s) 2.7 3.2 Prediction time (s) 13.5 22.5 Sub-block stitching time (s) 1.2 0.8

Save data time (s) 1.2 0.8

Total time (s) 18.6 27.3

Table 3: Overview of the efficiency of the network. It shows the time it takes to complete the essential steps along with the size of the inserted data.

Test sample 008 009 Validation

Feature DC 0.8433 0.8029

-Fiber DC 0.7622 0.6267 0.8257

Nuclei DC 0.3373 0.4740 0.4713

Background DC 0.9729 0.9761 0.9530 Inconclusive DC 0.3073 0.5622 0.3589 Fiber to Nuclei Reward 0.1888 0.1528 0.1314 Background Penalty 3.3611 11.2591 0.0242 Background Penalty (overlapping voxels) 0.7% 1.3% 0.0175%

Table 4: Overview of the network evaluation metrics. It shows several dice coefficients (DC) for the individual classes and all the classes combined for both test samples and the validation results for comparison.

4.3 Additional THG images

The network was also tested on THG images that were excluded from the training and valida-tion datasets because the ground truth segmentavalida-tion quality was not high enough (see secvalida-tion Data selection). Figure 11 shows an example of such a ground truth segmentation, this is the one of sample 010. The reason this was excluded is because too much background and noise is being included around the fibers giving the segmented fibers an altered appearance and the overall segmentation a cluttered look.

(29)

Figure 11: An example of a ground truth segmentation that did not have a high enough quality for it to be used for training and validating the network. This is the ground truth segmentation of sample 010 of which the THG data and predicted segmentation is shown in Figure 12

Figure 12: Prediction of sample 010 (epilepsy) with a field of view of 350µm at a depth of 8.5 µm. A is the normalized THG grayscale image and B the networks predicted segmentation, here red indicates the fiber like features and blue the nuclei. The yellow squares indicate faults in nuclei segmentation, the purple square indicates faults in fiber segmentation and the brown square indicates well segmented features.

(30)

Figure 13: Prediction of sample 010 (right-temporal glioblastoma type IV diagnosis) with a field of view of 350µm at a depth of 17.5 µm. A is the normalized THG grayscale image and B the networks predicted segmentation, here red indicates the fiber like features and blue the nuclei. The yellow squares indicates a correct nuclei segmentation.

Figure 14: Prediction of sample 012 (epilepsy sample with induced blisters) with a field of view of 150 µm at a depth of 8.5 µm. A is the normalized THG grayscale image and B the networks predicted segmentation, here red indicates the fiber like features. The green arrows points at a blister on the fiber

(31)

Figure 15: Prediction of sample 012 (epilepsy sample with induced blisters) with a field of view of 250µm at a depth of 4 µm. A is the normalized THG grayscale image and B the networks predicted segmentation, here red indicates the fiber like features and blue the nuclei. The green arrow points at a blister on the fiber and the yellow square indicates two nuclei.

(32)

Figure 12 contains a THG image of an epilepsy sample and therefore contains almost no nuclei. The yellow square shows a blister on the fiber that is partially segmented as a nuclei. The purple square shows an area with finer-grained fibers which did not get segmented at all. The brown square contains a couple fibers that got segmented well with barely any deformi-ties on the fibers which is a significant improve from the excluded ground truth segmentation shown in Figure 11.

Figure 13 contains a THG image of an right-temporal glioblastoma type IV sample and there-fore does not contain any fiber structures. However almost the entirety of the predicted seg-mentation is classified as an fiber. The yellow square indicates one of the few nuclei that did get predicted properly.

Figure 14 and 15 contain THG images of epilepsy samples with induced blisters. Blisters are indicated with the green arrow. Unfortunately the entire prediction is classified as an fiber with the exception of a big nucleus found in the yellow square in Figure 15 which partially got classified correctly.

(33)

5 Discussion

In this section the results will be discussed, starting with the image segmentation for the creation of the ground truths followed by the results from the neural network and options for future research.

5.1 Ground truth creation

Medical image segmentation particularly remains challenging because of the inhomogeneity of the data and the amount of noise. The limited THG images stem from patients diagnosed with different conditions and was collected throughout several years. This results in not only being faced with challenges from the data inhomogeneity but also different standards, procedures and equipment being used to collect the data in the first place.

A three step segmentation algorithm was used to make the ground truth segmentations. The first step was performing an histogram equalization using POSHE. The technique enhanced the features contained in the THG images. The main adjustments were done to the inclusion percentile which excludes the voxel outliers from the equalization step. The inclusion per-centile was set at 0.9999thany lower value created oversaturation within the images and thus a loss of details. The second step was the anisotropic diffusion step that denoised the image by identifying the most salient features. Here the C threshold could be set, Figure 4 shows the effect of a different C threshold. Choosing the correct C threshold makes a small difference in the quality of the ground truth segmentation. The third step was the active contour (ACPE) step which split the image into feature or background voxels. Here the ACPE weight could be set which forced a region to the fore or background. Choosing a correct ACPE weight was important for segmenting fully connected fibers as is shown in Figure 4.

Creating segmentations using this algorithm takes several hours depending on the depth of the 3D image stack and assuming the correct C threshold and ACPE weight were used. Unfor-tunately a correct threshold and weight for one image stack is often not optimal for a different stack of another sample and is even different between samples from the same patient. Because the THG image is 3D it also means that the quality of the entire stack is different depending on the depth. The result is that a C threshold or ACPE weight for the top 10 slices might increase the quality of the segmentation but those same values also decreases the quality of the segmentation of slice 40. For this reason a non-optimal threshold and weight were used for all samples which on average resulted in decent segmentations for several samples. A solution for this problem would be to split the stack into stacks of 10 or fewer slices and segment each smaller stack individually with its own optimal parameters. This method will be more labour intensive as more parameters need to be optimized but creating the segmentation should cost

(34)

less time as well. A second benefit could be that it will be possible to produce higher quality segmentations of the THG images that could previously not be segmented which will increase the number of training and validation ground truths.

Besides the quality variance there are also benefits to 3D data. It allows the data to propagate along a third axis and helps in situations such as: The top of a nucleus or fiber might appear as noise and get filtered out in a 2D segmentation and be identified as a feature in a 3D segmen-tation or a fiber can extend along the z-axis which can be classified as a nucleus by only seeing the circular profile of the fiber in 2D and as a fiber in 3D. The result is that the individual slices of the 3D segmentation would contain more accurate information than a 2D segmentation of the same region.

The THG images contain many overlapping nuclei and fibers which needs to be separated if one wants to classify these features individually. Separation was done using a distance trans-form watershed step. The distance transtrans-form part finds local minima based on feature voxels and its distance to non-feature voxels. Fibers have an inconsistent shape which results in the distance transform finding too many local minima. The watershed step then splits the fibers into smaller segments. These smaller fiber segments were in some cases impossible to accu-rately classify as a nucleus or fiber resulting in a higher number of false positives e.g. nuclei classified as a fiber segment in the ground truth segmentation. The number of false positives were reduced by adding a less than ideal inconclusive feature class. This appears to be a reoc-curring problem with distance transform watershed algorithms. Research shows that splitting overlapping features can also be done using curvature information in a grayscale transform algorithm which is able to separate nuclei in grayscale images (20). This could be used prior to segmentation to create a mask of the individual features, after segmentation this mask could be used to locate and separate the individual features. Alternatively an entire segmentation algorithm has been designed to segment overlapping fibers which constructs a virtual bound-ary around the fibers contour. The boundbound-ary is then used to connect concave points which is able to separate individual fibers with an increased accuracy (21). However the most promising and easier to implement option is given by Lyanett et al. (22). This method uses the H-minima transform instead of the distance transform, the H-minima transform selects regional minima instead of local minima which could effectively reduce the number of smaller fiber segments and therefore also reduce the number of false positives in the ground truth segmentations and possibly eliminate the inconclusive feature class.

5.2 CNN image segmentation

Several layer parameters were tested and adjusted prior to the final training session. The pa-rameter adjustments were based on small training sessions of 30 epochs, the results of which

(35)

can be found in the preliminary results contained in the appendix. It must be noted that the time restriction allowed for only one training session to adjust each parameter so the results demonstrated here may not be optimal.

Two different padding modes were tested: zero padding and symmetric padding. Figure 17 shows the difference between the padding modes. Symmetric padding mirrors the array val-ues into the padding which creates less of a hard boundary between the padding and actual image resulting in better segmentation around the edges and in the first and last slice. Kernel weights were initialized using the He and Glorot uniform initializations, the results are shown in Figure 18. The way in which the weights are initialized does not seem to make much differ-ence. The initial learning rate was also adjusted between 0.001 and 0.01. Although the Adam optimizer was used which works with adjustable learning rates which makes the initial rate less important. An initial learning rate of 0.01 was used because it appears that the training and validation loss remain closer together. Lastly the ReLU and PReLU activation functions were tested and is shown in Figure 19. PReLU is supposed to show an improved result because of the extra learnable weight however the output remains similar compared to ReLU. The extra learnable weight is somewhat more computationally demanding and adds roughly 20 seconds to the training and validation of each batch, therefore the ReLU activation function was used. The ground truths contained the inconclusive feature class, ultimately the network is supposed to only predict fibers, nuclei and background. To achieve this the inconclusive dice coefficient was modified to favour fiber and nuclei predictions on an inconclusive ground truth voxel. Though often correctly classifying the actual feature it also resulted in certain features being classified as both fiber and nuclei. Because of the modifications made to the inconclusive dice coefficient it is no longer representative for a similarity metric because fiber and nuclei voxels are also expected to be predicted on actual fibers and nuclei for this reason it is impossible that the dice coefficient could ever reach perfect similarity (DC of 1). By modifying the inconclusive dice coefficient we are essentially forcing the network to learn features that are conflicting with the ground truth to avoid this the ground truth creation needs to be improved potentially by one of the methods mentioned in the previous section.

Another side effect of the design of the inconclusive dice coefficient is that the fiber and nu-clei dice coefficients also can not reach a perfect overlap with its ground truth which leads to an underestimation of both dice coefficients. Evaluating the network on individual classes be-comes troublesome for this reason the feature dice coefficient was designed which measures the similarity of any feature in the ground truth and predicted segmentation. The feature dice coefficient shows more than 80% similarity in both test samples.

Figure 9 and 10 contain finer-grained fibers in the THG images that did not get segmented in either the ground truth or the predicted segmentation. It was hypothesized that background

(36)

covered these fibers due to the fiber dice coefficient only gaining small amount of similarity for predicting these fibers correctly. The background penalty was introduced to make sure the background got punished for being predicted on fiber and nuclei voxels this worked well for training and validation images as the background penalty decreased to 0.03. This resulted in the background contracting from other features and leaving a halo-like pattern classified as inconclusive around each feature which were changed to background manually. However the background penalty for the test samples was up to 74 times higher compared to the training and validation values. The high value indicates overfitting although this did not show up in the validation metrics. This is likely caused by the way the training and validation data was split up as both sets contain data from the same samples. To avoid this it is better to contain a sample within either the training or validation dataset.

Additional THG images from other patients were segmented to test the generalization of the network. Sample 010 in Figure 12 is from a patient diagnosed with epilepsy in which no blis-ters have been induced. The sample contains fibers with a large size variety. Similar to the test samples the network did not predict the finer-grained fibers correctly which is probably a result from the overfitting of the background penalty. Otherwise the predicted segmentation has a higher quality than the segmentation produced using the three step algorithm. Sample 011 is from a patient diagnosed with right-temporal glioblastoma type IV and mainly contains nuclei. The majority of the nuclei in this sample got classified as a fiber. The most likely cause is that the network was trained and validated using data from one patient with a different di-agnosis and the network has not learned to predict features that have a different appearance. This could be solved by training the network using THG images coming from multiple patients to increase the tissue diversity. Sample 012 and 013 in Figure 14 and 15 come from patients diagnosed with epilepsy in which blisters have been induced. Both samples were collected with a different field of view, 150 and 250µm respectively. The predicted segmentation is poor in both instances while sample 013 has a different tissue composition compared to the data that was trained on, sample 012 appears quite similar. This means that the networks inability to predict a segmentation is most likely due to the change in field of view as the network was trained on data with a 350-500µm field of view. By resizing the samples to the desired voxel size of 0.5 x 0.5 x 0.5µm sample 012 got resized to a resolution of 300 x 300 pixels in length and width reducing the size by 75% while maintaining the same number of features. The features in this sample therefore contain less voxels on average which causes the network to fail. A solution would be to train the network using data with a wider variety in field of views or use data augmentation on the current ground truths.

Data augmentation is a good way to compensate for the lack of training and validation data. In this project the data was rotated and mirrored in six different orientations which resulted

(37)

in seven times the amount of data. There are several other types of augmentation that could improve the generalizability of the network. An option is to apply a scaling augmentation that reduces the features in size after which symmetric padding can be used to maintain the orig-inal size of the image stack. This could recreate images with a smaller field of view and it reduces the number of voxels per feature (23). Another potential option is using a Generating Adversarial Network (GAN) to generate new images. A GAN contains two neural networks a generator and a discriminator, the generator creates new data and the discriminator evaluates the authenticity. GANs are still being experimented with to test the potential of its application in creating new medical images but its showing promising results (24) (25).

A U-net architecture with three max pooling layers was used during this project. The results indicate that a deeper architecture will not significantly improve the accuracy as the network struggles with the finer-grained fibers which are optimized in the first layers. A V-net archi-tecture is often used for volumetric data and could potentially improve the results. The V-net architecture no longer contains max pooling layers but uses convolution layers to compress the data instead. Replacing the max pooling layers with convolution layers has the advantage that when the data is compressed it increases the number of feature maps which could lead to more accurate predictions (26).

5.3 Blister growth measurements

The main goal was to use the segmentations to measure the growth of the blisters on the inner myelin sheath. These blisters were induced by a dysfunction in the axon myelin synapse as re-sult of excitotoxicity. The dysfunction is chemically induced in epilepsy samples by increasing the concentration of sodium around the tissue. The blisters are shown in Figure 14 and 15 and examples are indicated with a green arrow.

The created segmentation would assist in clearly indicating the size and shape of these blis-ters. Unfortunately the network has trouble segmenting the data with induced blisters but these problems could be addressed using some of the aforementioned options. The network in its current state is able to identify the entire myelin sheath instead of identifying the individ-ual blisters. This means that measuring the exact size of these blisters is still largely a manindivid-ual job made easier with the well defined shapes of the fibers. In order to automate the measur-ing process the blisters must be (manually) classified as a third class and the network must be retrained using the new ground truth segmentations. These classified blisters can then be isolated from the remaining image and measured using basic object volume calculating algo-rithms such as the ones found in ImageJ.

The mask R convolutional neural network introduced by (27) could provide an alternative method. The mask R CNN is method for instance segmentations combined with object

(38)

de-tection and gives the result as an overlay in the original image. In this research keypoints were used to estimate the human stance by predicting these keypoints on specific bodily joints. Similarly these keypoints could be predicted along the blisters in a way such is shown in Figure 16. Measuring the size of the blisters can be done by measuring the distance between these keypoints at either side of the myelin sheath.

Figure 16: Demonstrates an example for keypoint prediction to measure the size of the blisters on the myelin sheath. The green dots indicate locations where the keypoints should be placed.

(39)

6 Conclusion

For a long time it was thought that MS is caused by an autoimmune disorder which is referred to as the outside-in hypothesis. Recently this hypothesis was challenged by the inside-out hypothesis in which the primary cause is a dysfunction in the axon myelin synapse. This dysfunction leads to a cascade of biochemical processes that result in blister formation (de-myelination) on the inner myelin sheath. The cascade of biochemical processes was induced in ex-vivo epilepsy samples by increasing the sodium concentration around the tissue. While the sodium concentration was increased 3D THG images were made from which the blister growth could be measured. 3D multi-class segmentation of these samples were made to increase the accuracy of measuring the blister size. A convolutional neural network with U-net architecture was made to produce the segmentations more efficiently.

Due to the lack of epilepsy with induced blisters images the network is trained and validated using 3D THG images of left-frontal astrocytoma type II samples which contain both fibers and nuclei. The network is able to identify fibers and nuclei and can predict these segmentations in under 30 seconds and achieve a higher than 80% similarity compared to its ground truth. The network has trouble segmenting the finer-grained fibers which makes up the majority of the lacking 20% similarity.

The accuracy could be increased if the quality of the ground truth segmentations improves. Improvements can be made by optimizing the algorithm parameters for each individual 2D image slice or smaller image stacks. A distance transform watershed step was used to sepa-rate overlapping features to enable individual classification, this step sepasepa-rated the fibers into small segments which made classification more difficult. Alternatively a H-minima transform can be used to separate overlapping features into larger fiber segments which will increase the ground truth classification accuracy and possibly eliminate the need for the inconclusive feature class.

The network is not able to predict segmentations of epilepsy with induced blister images. This is most likely caused by the field of view of the images the network is trained on. The network is trained on images with a field of view between 350 and 500_µm while the images with in-duced blisters have a field of view of 150 and 250µm. This means that the number of voxels per feature is lower which resulted in the network no longer being able to recognize these features. To make segmentation predictions of images containing induced blisters the generalization of the network needs to be increased. This can be done by retraining the network with THG images from other patients as well as increasing the field of view of the images with induced blisters through augmentations. Alternatively a generating adversarial network can be used to increase the number of THG images for the network to train on.