Automatic Liver Lesion Segmentation in Abdominal CT Scans: Exploring Cascaded 2D and 2.5D U-Net Approaches

(1)

Final version: 13-07-2017

Supervisor (first examiner): Prof. dr. M. Welling

Second examiner: dr. C. Jacobs

Automatic Liver Lesion Segmentation in Abdominal CT Scans: Exploring Cascaded 2D and 2.5D

U-Net Approaches

N. Kraamwinkel (10000481)

Information Science – Data Science

(2)

(3)

Automatic Liver Lesion Segmentation in

Abdominal CT Scans: Exploring Cascaded 2D

and 2.5D U-Net Approaches

Nadine Kraamwinkel, University of Amsterdam

MSc. Information Studies: Data Science

July 13, 2017

Abstract

Automatic segmentation of liver lesions could be an important advance-ment for patients and radiologists to further improve early diagnosis and treatment. To stimulate the development of such an automation, re-searchers are currently exploring deep learning approaches. In this paper we developed and experimented with two cascaded fully convolutional neural network (FCN) approaches that work in 2D and 2.5D. The first U-Net focused on providing a liver prediction mask which was subsequently utilized as additional input for the second U-Nets. In these U-Nets, one received the liver prediction mask as an additional input channel, and the other utilized the same mask to discard the non-liver background. The networks were trained and tested on the Liver Tumor Segmentation Challenge (LiTS) dataset, consisting of 201 contrast-enhanced abdominal CT studies. Results of the first FCN yielded 95% Dice score for the liver segmentation on the validation set. The U-Net with 3 slice input and masked-out non-liver background was the best performing network, and obtained 0.563 Dice score on the LiTS test set. Overall, both cascaded FCN approaches were found very promising performance-wise in compar-ison to a single slice input without information from the liver prediction mask. Further improvements can be made by addressing the qualita-tively derived segmentation challenges as well as improving the networks by exploring the implementation of ResNet connections and additional post-processing steps.

1 Introduction

Globally, the second leading cause of death is cancer. According to the World Health Organization (WHO), 8.8 million people died from cancer worldwide in 2015, of which most deaths are attributed to cancers of: lung (1.69 million deaths), liver (788.000 deaths), colorectal (774.000 deaths), stomach (754.000 deaths), and breast (571.000 deaths) (World Health Organization, 2017). The prognosis of a cancer patient significantly worsens when the tumor metastasizes, which means that the primary tumor spreads to other parts of the body.

The most common site of a metastasis is the liver (Gore et al., 2012). The cause behind this is two-fold. First, the dual blood supply increases the likeli-hood of metastatic deposits in the liver (streams from hepatic artery and portal

(4)

vein) (Ananthakrishnan et al., 2006). Second, the hepatic sinusoidal epithelium allows easy penetration of metastatic cells (Ananthakrishnan et al., 2006).

Early detection of lesions in the liver can be achieved through Computed Tomography (CT), which is considered the primary modality for liver imaging (Frankel et al., 2012). After the detection of a lesion in the liver it needs to be further investigated whether it is: (a) a malignant primary tumor, (b) a benign tumor, or (c) metastasis from another organ. In this phase, liver lesion segmentation comes in for further diagnosis, tumor classification, and cancer staging, as well helping medical experts to determine tumor growth and treatment effectiveness.

In clinical practice, liver lesion segmentation is a very tedious manual task with considerable inter-observer variation. Automatic lesion segmentation can be regarded as both an advantage to the segmentation process and a poten-tial solution to improve reproducibility of the segmentation results. Exploring automated liver lesion segmentation methods have previously received less at-tention from researchers in spite of several methods already been proposed for liver segmentation. One explanation for this is that the development of an automatic segmentation method is challenging due to irregularities of tumor shape, varying number of lesions, diversity in intensity, and ambiguous bound-aries (low-contrast) between tumor and normal liver tissues, different types of contrast levels across tumor types (hyper-/hypo-intense tumors), as well as ab-normalities in liver tissue due to surgical operations (e.g. scarring, surgical clips, etc.) (Li et al., 2015; Christ et al., 2017). Another explanation is the relatively scarce access to publicly available data.

An important development to stimulate efforts in liver lesion segmentation is the Liver Tumor Segmentation challenge in 2008 (LTSC). Contestants in this challenge had demonstrated semi-automatic, automatic and interactive methods with approaches ranging from Fuzzy clustering, K-means clustering, Adaboost for classification, to utilizing SVM for tumor extraction. All of these approaches required producing features by hand for classifier training, which is challenging in achieving optimal representation. After the LTSC, the number of publications decreased again, which is assumed to be caused by a lack of publicly available data. Recently, several other datasets have been made accessible for researchers: 20 CT scans with 120 tumors (3D-IRCAD) (Soler et al., 2012; Ircad Dataset, 2017), 4 CT scans with 10 tumors (MIDAS) (Midas Dataset, 2017), a dataset from the Liver Tumor Segmentation challenge (LiTS) (2017) with 201 contrast-enhanced CT scans (CodeLab LiTS Dataset, 2017).

Currently, researchers from various biomedical disciplines are exploring deep learning approaches. A promising approach is a Convolutional Neural Network (CNN), which is a deep learning model that can learn hierarchical features from data. The model consists of multilayer neural networks with intermediate layers increasing abstraction levels (LeCun et al., 1998). CNN methods are recognized as robust methods to varying image appearance, which is important for natural phenomena as they tend to have an hierarchical structure and deep learning can naturally capture this. CNNs have been applied in various medical applications, including segmentation of liver and liver lesions (Li et al., 2015; Christ et al., 2016, 2017; Ben-Cohen et al., 2016; Han, 2017; Bi et al., 2017; Sun et al., 2017). It can be observed that the predominant articles in liver lesion segmenta-tion report methods with full axial slices, a so-called fully convolusegmenta-tional network (FCN) architecture. In comparison to patch-based CNNs that produce an

(5)

out-put for every pixel, the FCN can produce a segmentation map in an efficient end-to-end manner from an original image. Only the paper of Li et al. (2015) explored a patch-based CNN and experimented with different patch sizes and layers. In regard to FCN architectures, papers of Ben-Cohen et al. (2016) and Sun et al. (2017) reported methods with a stack of adjacent axial slices (3 in to-tal) and a multi-channel approach consisting of multi-phase contrast enhanced CT scans of the considered slice, respectively.

More recently, cascaded approaches have been explored for liver lesion seg-mentation to address the difficulty in learning lesion boundaries. In a cascaded approach, multiple FCNs are following up which enables iterative learning and thus optimizing the learning of liver lesion boundaries. The work of Bi et al. (2017) proposes a cascaded FCN with deep residual network (ResNet) architec-ture with multi-scale fusion. The multi-scale fusion was implemented to pro-mote scale-invariance by transforming the liver and liver lesions into different scales (Bi et al., 2017). Another promising approach is the U-Net architecture proposed by Ronneberger et al. (2015) for biomedical segmentation. This ar-chitecture differs from a standard classification network such that it includes an expansion path (up-convolution) (Ronneberger et al., 2015). In addition, skip connections are implemented in the different network stages to allow fusion of spatial and semantic information at later stages of the network for inference of information (Ronneberger et al., 2015; Christ et al., 2017). Taking the aforemen-tioned into account, a cascaded approach in which two U-Nets are following-up seems worth exploring further, and earlier work provide interesting avenues in experimenting with a stack of slices and multiple channels of input.

Up until today, two papers have experimented with a cascaded U-Net ar-chitecture. Christ et al. (2017) explored a slice-wise cascaded FCN U-Net in which the first U-Net is utilized to obtain a liver segmentation which is then fed into the second U-Net to segment the liver lesions. Han (2017) proposed a similar first step but modified this to a 2.5D input of adjacent axial slices (5 in total) and subsequently utilized the obtained segmentation in the second U-Net to obtain slice-wise predictions of both the liver and liver lesion. Both methods revealed promising segmentation results, which will be touched upon in light of the papers’ findings later onwards.

1.1 Objective

The overall aim of this paper is to contribute to the development of automatic liver lesion segmentation in (abdominal) CT scans through exploring two cas-caded fully convolutional network (FCN) approaches in 2D and 2.5D. The afore-mentioned studies prove U-Net as a promising method to segment small objects such as liver lesions. To this end we utilized the U-Net implementation of Ron-neberger et al. (2015) and propose two cascaded approaches in 2D and 2.5D. The paper is structured as follows: in Section II the methodology of the con-ducted experiments are described in detail. Section III presents and elaborates on the results both from a qualitative and quantitative perspective. Lastly, Sec-tion IV and provides a discussion of the experimental findings and concludes the paper.

(6)

2 Methodology

The first two sections of the methodology provide a description of the utilized dataset and an introduction of Ronneberger’s U-Net. The other sections explain the implementation details of the cascaded approaches and the training of the networks on segmenting liver lesions from CT scans.

2.1 Dataset and Pre-processing

Data from the Liver Tumor Segmentation Challenge (LiTS) was collected and consisted of a total number of 201 contrast-enhanced abdominal CT studies from five clinical sites around the world (CodeLab LiTS Dataset, 2017). These clinical sites included the Ludwig Maximilian University of Munich (Germany), RadboudUMC (the Netherlands), CHUM Research Center (Canada), Sheba Medical Center (Israel), IRCAD (France), and Hadassah University Medical Center (Israel). From these studies, 131 were assigned for training and 70 for testing. For each scan of the training data, the liver and liver lesion masks were provided. These reference masks were executed by radiologists from the respective clinical sites with different annotation tools. The slice spacing varied between 0.69 mm and 5 mm, the voxel spacing ranged from 0.625 mm to 1.0 mm, and the number of slices per scan ranged from 75 to 986 slices because some scans were limited to the liver area whereas others contained the complete thorax region. The aforementioned resulted in varying quality of scans and reference masks. In comparison to a single data source, multifarious data can however - in spite of considerable thinkable disadvantages - be an advantage for generalization of the developed method’s performance to other CT datasets.

The training set was visually inspected with the department’s developed viewing software in MeVisLab (Ritter et al., 2011). In total 33 scans were removed from the entire training set due to mistakes and errors with the masks, as well as absent lesions in livers in several studies, and one scan was broken from the thoracic vertebral region T7 down onwards. After removing the scans from the training set, 98 scans were obtained and were further splitted into 78 train and 20 validation cases for network training. Furthermore, before testing our networks, the test set was also inspected and in total 3 CT scans were excluded among 2 CT studies with an empty reference mask from which it seemed the livers were lesion-free. In total 67 scans of the test set were utilized for testing. The network input in the experiments were axial abdominal CT slices de-rived from the training cases. Ultimately two networks were build in which one focuses on training the liver and the other on training the liver lesion, which will be further elaborated in the next sections. For the liver network the whole scan input was used for training, whereas with the lesion network only a number of slices were included when they contained a liver lesion exclusively. More-over, with the lesion network the other (background) slices were discarded for training. Pre-experimental testing revealed no additional advantage in includ-ing background slices in this specific network. Without traininclud-ing on background slices, the network showed ability to discard objects from the background in its predictions.

Next, in order to utilize information from the z-axis, to work in the third dimension, the model input was modified to handle multiple slices. Depending on the experiment, either 1 or 2 adjacent axial slices were added to the slice

(7)

under analysis. In some cases, the slice under analysis contained the lesion’s last slice. In such cases the adjacent slice resulted in including a background slice. The input size at validation time was similar in size as at training time, whereas at validation time all the slices of the scan were included.

The spacing between the slices varied greatly. To enable a multiple input slice stack, the data was transformed such that the slice spacing in each scan was set at an equal fixed spacing across all scans. To this end the dataset was transformed in which the fixed spacing was set on 2.5mm. Furthermore, the Hounsfield Unit (HU) value range of the scans were set at a range of [300,-300 HU] (see Annex I for further elaboration). In pre-experimental phase we worked with a range of [400,-1000 HU] and noticed that reducing the HU often reduced the time of the experiment’s model to converge. Other scholars in this field who also adjust the HU value range did not report on a specific rationale other than mentioning that tactically setting the value range would enable up front exclusion of (irrelevant) organs and objects (Bi et al., 2017; Li et al., 2015; Christ et al., 2017). More specifically, adjusting the range allows the network to immediately focus on the relevant information and features. Lastly, after adjusting the HU range, the values were normalized to a [0,1] range.

2.2 Data Augmentation

It is important to train the network invariance and robustness. In biomedical segmentation, (lesion) tissue variations can be simulated (Ronneberger et al., 2015). This can be achieved through data augmentation by applying elastic image deformation on the training data in the training pipeline. For the defor-mations, a 10 by 10 grid was used with displacements of 5 pixels in the grids’ intersections on the 512x512 input images (axial CT slice). In addition, mesh warping may also contribute to avoid overfitting of the network on the training data. The data augmentation on the images was conducted at training time.

2.3 U-Net

The U-Net architecture by Ronneberger et al. (2015) is a fully convolutional neural network method that is tailored for biomedical image segmentation. For biomedical image segmentation the desired output entails localization in which every pixel should receive a class label. The network at hand enables such pixel-wise segmentation in which a class label is predicted for each pixel.

The total architecture consists of 19 convolutional layers and is character-ized by a downsampling (contracting) and upsampling (expansion) path (Ron-neberger et al., 2015). These paths are connected, through so-called skip con-nections, which enables concatenation of features from the different layers to infer information. More specifically, in the contracting path the context is cap-tured and in the expansion path precise localization of the object of interest is enabled (Ronneberger et al., 2015).

The network architecture in the downsampling path is a traditional convo-lutional network consisting of repeated 3x3 convolutions that are followed with a rectified linear unit (ReLU) function as activation function, and a 2x2 max pooling operation with stride 2 for down sampling. At each downsampling step the number of feature maps are doubled and the image size becomes merged

(8)

to a small size. The repetition of the Conv-ReLU and pooling layers has the advantage to express more powerful features of the input with fewer parameters. The upsampling path consists of spatial upsampling of the feature maps with a factor 2, followed with 3x3 convolutions and a ReLU, and a concatenation with the corresponding (cropped) feature maps from the downsampling path. For instance, when the feature map from the downsampling path is not cropped it will result in loss of border pixels. At the final layer of the upsampling path a 1x1 convolution is implemented to map the 64 dimensional feature vector to the desired number of classes (Ronneberger et al., 2015).

2.4 Cascaded U-Net Approach

The network that has been utilized in this paper in principle similar to the U-Net architecture of Ronneberger et al. (2015), consisting of 19 convolutional layers and skip connections. In addition to the factor 2 upsampling steps, a 3x3 convolutional layer was added with a stride and padding of both 1. In the original implementation only valid convolutions are computed, when the input and the filter completely overlap, whereby the missing context is extrapolated by mirroring the input image. In the current case, the input images are CT slices where axial body slices are present at the image center. As a consequence, there are no border issues since there is no valuable information present at the image border. The same accounts for the concatenation of the feature maps with the corresponding maps from the downsampling path (skip connections) which do not necessarily demand to be cropped. Another implemented modification to the U-Net is zero-padding, which was used to prevent the image size decrease after each convolution.

In preliminary experiments with the U-Net architecture it was found that the network would perform well on segmenting the liver, but comparatively poor on segmenting liver lesions, when both these classes were considered at the same time. A reason behind this is that U-Net learns a hierarchical representation and the convolutional filters would become less specific for the different tissue types liver and liver lesion. A solution to overcome this is a cascaded training of FCNs as for example seen in Han (2017) and Christ (2015;2017). Through cascading two U-Nets the first can focus on learning filters to segment the liver from the background, and the second U-Net on distinguishing the liver lesions from the liver.

In this paper, two cascaded approaches have been experimented with. In both approaches, the first U-Net focused on learning to segment the liver. How-ever, in the second U-Net the input of the network is a slice or stack of slices with either (I) U-Net2a: the corresponding liver segmentation as an additional input channel, or (II) Net2b: the non-liver background masked out. For U-Net2b, the liver segmentation was added to provide the network with contextual information. In regard to U-Net2a, the liver segmentation from U-Net1 was uti-lized to mask out the non-liver background, such that the sole input is the liver and the surrounding values are set to 0. Next, the liver segmentation was found difficult for the very cancerous liver areas as these were often not completely captured. To avoid losing parts of the liver prediction mask, the Convex Hull algorithm was utilized to transform the liver prediction as close to the reference mask (see Figure 2, p.8). Furthermore, in both cases the output from U-Net1 assisted the network to focus on predicting liver lesions and not other parts of

(9)

Figure 1: U-Net1, 2a and 2b of the cascaded U-Net approach. The first U-Net focuses on segmenting the liver from axial CT slice and the second U-Nets focus on segmenting the liver lesion, either by masking-out the non-liver background or providing the liver prediction mask as additional input.

(10)

the liver and the body.

Experiments were also conducted with various input stack sizes: 1, 3 and 5 slices. For example with a stack of size 3 we fed the network (I) the slice under analysis and (II) two adjacent slices (above and below). The fixed spacing (2.5 mm) between the slices resulted that the adjacent slices from scans with a high slice spacing (e.g. 5 mm) had to be interpolated. Figure 1 provides an overview of the cascaded U-Net approaches.

Figure 2: (a, d) convex hull applied to the liver segmentation mask obtained from the first U-Net, (b, e) the original liver segmentation mask obtained from U-Net1, (c, f) in purple the reference mask. It can be observed that without convex hull a lesion (red outlined) would otherwise be excluded either completely or partially in U-Net2b.

2.5 Training

The training and segmentation was performed on axial CT slices. As mentioned previously, the first network focused on liver segmentation. It was trained from scratch on all the slices of the training data. From all liver-labeled voxels the largest connected-component was obtained before providing these as input to the subsequent U-Nets. Depending on the method, the predicted liver segmentation was either added in an extra channel (U-Net2a) or used to mask-out non-liver background (U-Net2b). The training set of the second U-Nets consisted of 2897 axial slices that only contained lesions. Furthermore, in the validation set we worked with the complete scans of the 20 validation cases which involved 10489 CT slices in total.

At training time, the networks trains the original 512x512 image completely end-to-end and directly creates the segmentation output. This output is a prob-ability map with voxel values ranging between 0.0 - 1.0. The network parameters are learned by a weighted cross-entropy loss function, see equation 1.

(11)

L = −1 N N X i=1 X c wcPˆ_icln(P_ic) (1)

Pidenotes the probability of voxel i belonging to class c and ˆPi∈ {0, 1} denotes

the ground truth. The classes in U-Net1 were background and liver, and in U-Nets 2a and 2b background and liver lesion. N denotes the total number of voxels. Further, since the voxel-wise frequency of the classes are inequal, issues can be encountered in segmenting small structures (Ronneberger et al., 2015; Christ et al., 2017). To address this, class balancing was applied to the cross entropy function. This results that more importance is attributed to each voxel of the underrepresented class. For wc

i (the weight map) we chose:

wc = 1 − PN i=1P c i N · nclasses (2) where nc_lasses are the classes liver lesion and background.

Figure 3: Training of the network (U-Net2b). (a) the input, (b) reference mask, (c) raw prediction, (d) weight map to learn the liver lesions inside the liver. The color is calculated based on the error of the prediction of a pixel multiplied to the class balanced frequency.

To train the network a weight map was created containing the voxel-wise loss weights (see Figure 3 and equation 2 ). This is different compared to the implementation of Ronneberger et al. (2015). The weight map of Ronneberger et al. (2015) is designed to segment different type of HeLa cells by forcing the network to learn the cell borders of touching cells. It was noticed in early experimentation phase that the same weight map approach would not perform adequately for segmentation of liver lesions. To this end, a non-static weight map was created that focuses on the mistakes of the prediction and emphasizes the voxel loss.

2.6 Post-processing

Before obtaining the Dice scores, a small post-processing step has been applied on the probability map, which was thresholded at 0.5 to retrieve a binary (out-put) mask. No other specific post-processing was performed.

(12)

2.7 Implementation Details

The networks were implemented in Theano and Lasagne (Theano Development Team, 2016; Dieleman et al., 2015). Networks were trained with gradient decent optimization algorithm RMSProp (Tieleman and Hinton, 2012). The initial learning rate was set on 0.0001, and for the liver lesion networks (U-Net2a and 2b) this was reduced after each epoch with a factor 0.9. By slowly decreasing the learning rate per epoch we aimed to stimulate convergence. The training of the liver lesion networks took roughly 4 days on a GTX 1080 GPU with 8 GB memory.

2.8 Performance Measurements

The performance of the networks were assessed by utilizing the key quality metric of the LiTS challenge; the Dice score. The Dice score is a similarity measurement over sets with values ranging between 0 and 1, and 1 being a perfect segmentation. It compares the numbers of elements in the prediction versus the ground truth, and is defined as:

DSC(A,B)= 2(A ∩B)/(A + B).

In addition, we also obtained the Jaccard score, which is somewhat similar and is defined as the size of the intersection divided by the size of the union of the elements, see below:

J(A,B)= 2(A ∩B)/(A ∪ B).

The score’s values range between 0 and 1. In comparison with the Dice score, the Jaccard often results in lower segmentation scores as it penalizes for mistakes outside the ground truth.

3 Results

This section presents the qualitative and quantitative results of the conducted experiments. The qualitative portion focuses on segmentation challenges that were encountered, and the quantitative part presents the performance of the networks as well as a sub-analysis on total lesion volume and a brief comparative overview of previous published work in this field.

3.1 Qualitative Results

Figure 4 shows examples of segmentation results. The reference segmentations are indicated in green, the segmentations - as generated by the proposed methods in this paper - are shown in red. Successful segmentation (the overlap between the reference and segmentation) are featured in brown. The examples show the performance of the proposed methods on segmenting the different lesion shapes and sizes. From examples 4a and 4b it can be observed that relatively large lesions and regular lesion shapes revealed promising segmentation results. Examples 4c and 4d show that lesions can also appear very small in size and the

(13)

network is able to segment those, however inconsistent, which can be observed from the few false positives (red) and false negatives (green) pixels.

Figure 4: Examples of segmentation results.

Examples of the various lesions that were dealt with from the LiTS dataset can be viewed in Figure 5, p.12. Besides lesion size and shape, a multitude of other factors may impose challenges in achieving good liver lesion segmentation results. In our segmentation results from the LiTS dataset, we observed the following scenarios which can be find discussed below.

• Lesions in the left lobe

• Lesions situated near the Inferior Vena Cava • Hyperdense Rim

• Foreign objects

• Lesions located at the liver border • Reference mask quality

(14)

Figure 5: Examples of liver lesions from the LiTS dataset

3.1.1 Left Lobe

Figure 6. (a,b). Upon reviewing the results it was noted that the liver prediction mask provides valuable information in predicting lesions. However, it was found that when the mask would not completely cover the left lobe of the liver, it would add complexity to the lesion segmentation. Furthermore, the Convex Hull algorithm in U-Net2b came in very useful as it potentially includes, to some extend, parts of the liver that were initially missed by the prediction of U-Net1 (as seen in Figure 2). In several cases this resulted that an initial excluded lesion, moreover one that would not be captured by the liver prediction mask, could be included and predicted.

3.1.2 Foreign Objects

Figure 6. (c,d). Previous surgical procedures on the liver, or other organs in proximity to the liver, can significantly influence the CT scan and segmentation quality. Besides surgical scaring, a number of CT studies contained objects, such as pedicle screws in the backbone and surgical clips, which affected the quality of the scan and segmentation. For instance, large lesions regularly yield good segmentation results but in the case of example (d) it can be noted that this relatively large lesion was poorly segmented.

3.1.3 Hyperdense Rim

Figure 6. (e,f) shows a slice in which the lesion is surrounded by a hyperdense (white) rim. Upon reviewing the segmentation results it was observed that solely the center pixels would be predicted in such a case. A possible explanation could be that such cases are underrepresented in the trainingset and that the network predominantly learns features that identify tumors as dark colored (i.e. hypodense) objects.

3.1.4 Inferior Vena Cava

Figure 6. (g). Another challenge was segmenting lesions nearby the inferior vena cava. In the segmentation results these lesions were typically neglected by the network. It can be observed in this example that the network was able to (partially) segment the small lesions of the CT scan and that none of the pixels of the liver lesion situated near the inferior vena cava were predicted.

(15)

Figure 6: Examples of challenges in liver lesion segmentation. See section 3.1 for a discussion of each example.

(16)

3.1.5 Borders

Irrespective of exact location, lesions at the border of the liver were regularly difficult for the networks to spot and segment. Commonly, this was observed when the difference between surrounding tissue and organs and that of the liver (lesion) tissue were challenging to distinguish.

3.1.6 Reference Mask Quality

Figure 6. (h). As mentioned previously, the quality of the reference masks varied between the scans. In a number of scans the reference masks could cause avoidable lower performance and increased error due to the exclusion of correctly predicted pixels.

3.2 Quantitative Results

3.2.1 Segmentation Results

Table 1 shows the results of the liver lesion segmentation methods. The best performance on the test set was obtained with U-Net2b with a 3 slice input and the liver background masked-out. This network obtained an average test Dice of 0.563 (± 0.312). The liver segmentation from U-Net1 yielded 0.950 Dice on the validationset.

Table 1: Segmentation results presented per applied method

Lesion Segmentation Results for Different Methods

Method Dice (val),

st.dev Jaccard (val) st.dev Dice (test) st.dev Jaccard (test) st.dev

U-Net2a 1 slice (a) 0.426 ± 0.247 0.308 ± 0.257 0.478 ± 0.303 0.358 ± 0.212 U-Net2a 1 slice 0.565 ± 0.268 0.432 ± 0.241 0.541 ± 0.296 0.413 ± 0.251 U-Net2b: 1 slice 0.565 ± 0.269 0.433 ± 0.243 0.553 ± 0.306 0.433 ± 0.263 U-Net2a 3 slice 0.578 ± 0.280 0.420 ± 0.250 0.547 ± 0.295 0.403 ± 0.248 U-Net2b: 3 slice 0.571 ± 0.281 0.423 ± 0.258 0.563 ± 0.312 0.438 ± 0.270

(a) 1 slice excl. Network 1 liver prediction mask.

For additional comparison, the first method presented in Table 1 contains a 1 slice input without the liver prediction mask obtained from U-Net1. Boxplots of the segmentation results can retrieved from Figure 7 (p. 15).

From the best network, 70.1% of the test cases results yielded a Dice higher than 0.500. In addition, the maximum obtained Dice was 0.931 on a scan with a total lesion volume of 212.7 ml, which is a scan that contains 2 lesions of which one is 106 mm and the other is 2.8 mm in length. The average measured lesion volume (ml) of the test set is 104 ml.

3.3 Lesion Volume

In the qualitative results it was found that the size of the lesion may influence segmentation results. To further explore this, the Dice score of each CT scan and the lesion volume of the CT scan reference masks’s were calculated. Since segmentation was not performed per lesion but per whole CT scan the total

(17)

Figure 7: Segmentation Results (Dice score) for each network on the LiTS test set. Red denotes the median and black the average Dice scores. * indicates U-Net2a, 1 slice excl. liver segmentation mask

lesion volume per scan was considered and calculated in milliliters (ml). Subse-quently, the CT scans were stratified into four groups by total lesion volume and of each group the mean Dice scores were computed per method. From Table 2 it can be observed that a higher (total) lesion volume of the CT scan is related to a higher segmentation performance.

Table 2: Dice Scores by Total Lesion Volume and Method

— Dice Scores Total Lesion Volume (ml) n (scans) U-Net.2a (a): 1 slice (mean) U-Net.2a: 1 slice (mean) U-Net.2b: 1 slice (mean) U-Net.2a: 3 slices (mean) U-Net.2b: 3 slices (mean) ≤ 10 16 0.15 0.28 0.25 0.28 0.24 10 - 49 23 0.52 0.56 0.57 0.57 0.60 50 - 199 19 0.58 0.63 0.67 0.64 0.69 ≥ 200 9 0.77 0.78 0.80 0.78 0.79

(a) 1 slice excl. U-Net1’s liver prediction mask.

3.3.1 Comparison with previous work

Published work on liver lesion segmentation were collected and compared, which can be retrieved from Table 3 (p. 16). Bi et al. (2017) and Han (2017) both obtained an higher average test Dice score with their LiTS challenge submission. This can probably be attributed to the applied ResNet connections, which will be taken into account in the next section were suggestions for further improve-ment have been proposed.

(18)

Table 3: Segmentation Performances of other Methods and Datasets

Author Method Dataset Input Mean Dice score

(a)

Christ et al. (2017) CFCN +3D-CRF 3DIRCADb:

20 CT

1 slice 0.56(v)

Christ et al. (2017) CFCN +3D-CRF Clincal:

100 CT

1 slice 0.61 (v)

Bi et al. (2017) CFCN ResNet LiTS: 201

CT

1 slice 0.498 (v)

Bi et al. (2017) CFCN ResNet +3D-CRF LiTS: 201

CT

1 slice 0.317 (v) Bi et al. (2017) CFCN ResNet +Multi-scale Fusion LiTS: 201

CT

1 slice 0.50 (v), 0.64 (t)

Han (2017) CFCN + ResNet connections LiTS: 201

CT

1 and 5 slices

0.67 (t)

Our present work CFCN LiTS: 201

CT

3 slices 0.57 (v), 0.56 (t)

v = validationset, t = testset

4 Discussion and Conclusion

This paper presented a contribution to the development of automatic liver lesion segmentation in which two cascaded FCN approaches have been proposed. In addition, this paper attempted to increase the knowledgebase of liver lesion segmentation. Segmentation of organs and tissue of unhealthy patients has been found frequently challenging due to a multitude of aspects of which several have been touched upon in the presentation of our segmentation results. These insights can be further expanded on in the future to improve lesion segmentation and stimulate lesion tailored approaches.

The best performance on the test set was obtained with U-Net2b - 3 slice, which involves a multiple slice input and a restriction for U-Net to focus on the liver for segmenting lesions, which improved segmentation accuracy. This improvement could be attributed to the network’s ability to learn more specific filters for liver lesion boundaries. Also, multiple input stack size enables working in 2.5D which increases the learnable information to U-Net in predicting liver lesions per considered slice.

In light of other work the following suggestions could lead to further improve-ments of the proposed CFCN approaches. The first is the implementation of ResNet residual connections. These connections, as seen in the work of Bi et al. (2017) and Han (2017), have the ability to further promote forward and back-ward propagation of information through the network and could lead to more accurate segmentation performance. Secondly, several improvements can be made in the post-processing. For example, applying a 3D dense conditional ran-dom field (3D-CRF). However, papers reported the difficulty in finding correct hyperparameters for heterogeneous lesion structures and 3D-CRF’s dependence on low-level features which make distinguishing lesions from liver tissue quite cumbersome (Bi et al., 2017; Christ et al., 2017). Another post-processing step,

(19)

which yielded higher segmentation in Bi et al. (2017) is applying a multi-scale integration approach at validation/test time. This approach involves re-scaling the images into various scales and eventually averaging this in order to obtain the final result. Lastly, but somewhat outside the scope of the LiTS challenge as the data cannot support it, is the utilization of contrast enhanced CT scans from multiple phases as seen in Sun et al. (2017). Through feeding the same slice from multiple contrast enhanced phases into multiple input channels, the heterogeneity issue involved in correctly segmenting lesions of all different types, shapes and sizes, could be addressed which resulted in increased segmentation performance in the aforementioned paper.

All in all, cascaded fully convolutional network segmentation methods have been found very promising for automatic liver lesion segmentation in CT scans. A multiple slice input and the exclusion of non-liver background performed bet-ter in comparison to the other experimented methods in this present study. Though many scholars are keen to find successful ingredients for automatic le-sion detection, the long training time and the substantial amount of necessary resources pose significant constrains on trying out and evaluating different meth-ods. However fortunately, challenges such as the LiTS enable the community to compare different approaches on a single and relatively large public dataset - and gradually papers on this challenge are becoming published - which tremendously aid further development and refinement of liver lesion segmentation methods.

5 Acknowledgements

I would like to thank my thesis supervisors and express my gratitude to The Di-agnostic Image Analysis Group (DIAG), Department of Radiology and Nuclear Medicine of Radboud University Medical Center in Nijmegen, for the on-site supervision and utilization of resources to support my thesis work which made it truly a very educational and memorable journey. Thank you Colin Jacobs for your supervision and feedback on my drafts, which significantly helped me in improving my work. Special thanks to Gabriel Humpire Mamani: thank you for the great Q&A times and support throughout my work; I have learned a great deal from you. I would also like to express my appreciation to Joris Bukala for his support, and thank all my other colleagues for the wonderful time I expe-rienced at the department. All inspired me to work hard and enjoy my work. Lastly, many thanks and gratitude to my family who supported me during my master’s program and thesis project.

6 About

Nadine Kraamwinkel is a Information Studies (Data Science) master’s student at the University of Amsterdam (UvA). She completed a bachelors in Health Sci-ences and a researchmaster in Global Health at the Vrije Universiteit (VU) Am-sterdam. This paper is part of her MSc. thesis-project which will be defended 14 July 2017 at the UvA. Thesis Supervisors: dr. C. Jacobs (RadboudUMC) and Prof. dr. M. Welling (UvA).

(20)

References

Ananthakrishnan, A., Gogineni, V., and Saeian, K. (2006). Epidemiology of primary and secondary liver cancers. In Seminars in interventional radiology, volume 23, pages 047–063. Copyright c 2006 by Thieme Medical Publishers, Inc., 333 Seventh Avenue, New York, NY 10001, USA.

Ben-Cohen, A., Diamant, I., Klang, E., Amitai, M., and Greenspan, H. (2016). Fully convolutional network for liver segmentation and lesions detection. In International Workshop on Large-Scale Annotation of Biomedical Data and Expert Label Synthesis, pages 77–85. Springer.

Bi, L., Kim, J., Kumar, A., and Feng, D. (2017). Automatic liver lesion detection using cascaded deep residual networks. arXiv preprint arXiv:1704.02703. Christ, P. F., Elshaer, M. E. A., Ettlinger, F., Tatavarty, S., Bickel, M., Bilic, P.,

Rempfler, M., Armbruster, M., Hofmann, F., D’Anastasi, M., et al. (2016). Automatic liver and lesion segmentation in ct using cascaded fully convolu-tional neural networks and 3d condiconvolu-tional random fields. In Internaconvolu-tional Conference on Medical Image Computing and Computer-Assisted Interven-tion, pages 415–423. Springer.

Christ, P. F., Ettlinger, F., Gr¨un, F., Elshaera, M. E. A., Lipkova, J., Schlecht, S., Ahmaddy, F., Tatavarty, S., Bickel, M., Bilic, P., et al. (2017). Automatic liver and tumor segmentation of ct and mri volumes using cascaded fully convolutional neural networks. arXiv preprint arXiv:1702.05970.

CodeLab LiTS Dataset (2017). Lits - liver tumor segmentation challenge. https://competitions.codalab.org/competitions/17094.

Dieleman, S., Schl¨uter, J., Raffel, C., Olson, E., Sønderby, S. K., Nouri, D., et al. (2015). Lasagne: First release.

Frankel, T. L., Do, R. K. G., and Jarnagin, W. R. (2012). Preoperative imaging for hepatic resection of colorectal cancer metastasis. Journal of gastrointesti-nal oncology, 3(1):11–18.

Gore, R. M., Thakrar, K. H., Wenzke, D. R., Newmark, G. M., Mehta, U. K., and Berlin, J. W. (2012). That liver lesion on mdct in the oncology patient: is it important? Cancer Imaging, 12(2):373.

Han, X. (2017). Automatic liver lesion segmentation using a deep convolutional neural network method. arXiv preprint arXiv:1704.07239.

Ircad Dataset (2017). 3d-ircadb-01 database. http://www.ircad.fr/research/3d-ircadb-01/.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learn-ing applied to document recognition. Proceedlearn-ings of the IEEE, 86(11):2278– 2324.

Li, W., Jia, F., and Hu, Q. (2015). Automatic segmentation of liver tumor in ct images with deep convolutional neural networks. Journal of Computer and Communications, 3(11):146.

(21)

Midas Dataset (2017). http://www.insight-journal.org/midas/collection/ view/38.

Ritter, F., Boskamp, T., Homeyer, A., Laue, H., Schwier, M., Link, F., and Peitgen, H.-O. (2011). Medical image analysis. IEEE pulse, 2(6):60–70. Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net: Convolutional

net-works for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234– 241. Springer.

Soler, L., Hostettler, A., Agnus, V., Charnoz, A., Fasquel, J., Moreau, J., Oss-wald, A., Bouhadjar, M., and Marescaux, J. (2012). 3d image reconstruction for comparison of algorithm database: a patient-specific anatomical and med-ical image database.

Sun, C., Guo, S., Zhang, H., Li, J., Chen, M., Ma, S., Jin, L., Liu, X., Li, X., and Qian, X. (2017). Automatic segmentation of liver tumors from multiphase contrast-enhanced ct images based on fcns. Artificial Intelligence in Medicine. Theano Development Team (2016). Theano: A Python framework for fast

computation of mathematical expressions. arXiv e-prints, abs/1605.02688. Tieleman, T. and Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient

by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31.

World Health Organization (2017). Factsheet: cancer. http://www.who.int/ mediacentre/factsheets/fs297/en/.

(22)

Nadine Kraamwinkel Annex I

1 Hounsfield Units

This annex contains a brief explanation of Hounsfield Units.

Computed Tomography (CT) is an indirect imaging technique, in which an image is calculated/reconstructed. This image is based on signals measured in CT of the detected amount of radiation. The relative amount attenuated by each volume of tissue (a voxel) is represented by a value, a so-called CT number, which is defined as the difference in attenuation between the voxel and water (Barnes, 1992). The Hounsfield scale is a scaling factor in which water is given the value (Hounsefield Unit, HU) of zero and air a value of -1000. The total scale ranges from -1000 HU up until 3095 HU, see Figure 1.

Figure 1: Hounsfield Scale. Illustration of the Hounsfield scale and image gray scale with typical ranges of CT numbers for various tissues. Taken from (Barnes, 1992) and slightly adapted.

Furthermore, the created image is a matrix (e.g. 512x512) in which every pixel represents a voxel of the scanned body. In every pixel the amount of absorption of the voxel is represented with a gray shade. This can be visually seen on the monitor. For example, when this shade is light (white) there is high absorption, dark (black) when there is little absorption. When the grey scale would be projected onto the aforementioned range of -1000 and 3095 HU it would cause a scarce contrast, and little would be distinguishable. Through the utilization of window level and window width adjustments, the grey scale can be centered and extend to the desired range such that the tissues of interest will be depicted in maximum contrast (Barnes, 1992).

(23)

Noteworthy is that radiologist can apply a contrasting fluid for further tissue differentiation. Such images are referred to as contrast enhanced CT images and they can be acquired in several phases, including: arterial phase, portal venous phase, and delayed phase. For example, this allows the radiologist to differentiate and diagnose the type of liver lesion. In the arterial phase, hypervascular tumors will enhance, and in the portal venous phase hypovascular tumors (Baron, 2017). The delayed phase starts when the contrast moves away from the liver. In the LiTS dataset most lesions are hypovascular. Only a few scans were found with hypervascular lesions, which were most likely made in the arterial phase as rim enhancement is visible around the tumor. Figure 2 shows the change in image in the arterial phase and delayed phase, in which Figure 2A. illustrates rim enhancement in the arterial phase.

Figure 2: A. Arterial phase, contrast enhanced. B. Delayed phase. Taken from (Tiferes & D’Ippolito, 2008) and slightly adapted.

2 Setting the Hounsfield Unit range

Due to the scarce available literature on HU range settings for liver lesions a rea-sonable workable estimate was taken. One paper reported a [-62,238] HU window for lesion detection on unenhanced CT scans (Sahi et al., 2014). Another study worked with contrast enhanced CT scans from the portal venous phase and worked with [-50/-100,250/200] HU ranges (Sabouri, Khatami, Azadeh, Ghoroubi, & Az-imi, 2008). Thus, a window of [-100,250] would ensure us a safe range to work with, but since the LiTS dataset is a very heterogeneous dataset (i.e. different CT scanners, contrast phases, lesion types, etc.) we decided to reduce the risk of potentially excluding important information and therefore chose to set the HU window to [-300,300].

References

Barnes, J. (1992). Characteristics and control of contrast in ct. Radiographics, 12 (4), 825–837.

Baron, R. (2017). Liver - masses 1 - characterization. Retrieved from http://www. radiologyassistant . nl / en / p446f010d8f420 / liver - masses - i - characterization . html

Sabouri, S., Khatami, A., Azadeh, P., Ghoroubi, J., & Azimi, G. (2008). Adding liver window setting to the standard abdominal ct scan protocol: Is it useful?

(24)

Sahi, K., Jackson, S., Wiebe, E., Armstrong, G., Winters, S., Moore, R., & Low, G. (2014). The value of “liver windows” settings in the detection of small renal cell carcinomas on unenhanced computed tomography. Canadian Association of Radiologists Journal, 65 (1), 71–76.

Tiferes, D. A. & D’Ippolito, G. (2008). Liver neoplasms: Imaging characterization. Radiologia Brasileira, 41 (2), 119–127.