Automated Cancer Detection Using Reinforcement Learning

(1)

Automated Cancer Detection Using

Reinforcement Learning

Liz Verbeek 10357599 Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam

Faculty of Science Science Park 904 1098 XH Amsterdam

Supervisor E.E. van der Pol MSc

Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam July 14, 2019

(2)

Abstract

Lung cancer is one of the most common causes of dead worldwide. Research has shown that screening programs can reduce lung cancer mortality by 20%, but such screening programs rely on the analysis of enormous amounts of CT images and are therefore time- and cost-expensive. To help develop automated detection systems for lung nodule detection in these CT images, the LUNA16 Challenge was held. Several submissions for this challenge showed promising results using deep supervised learning methods. These kind of methods rely on expert supervision and exhaustive search, which also takes a lot of time. In order to find a more time-efficient solution to the nodule detection task, this thesis proposes a reinforcement learning approach where the detection task is defined as a path finding problem. This approach is split into two phases: the training of a model on images that clearly contain local features to find such a path and the use of this same approach to lung nodule scans. The two-step approach shows that lung nodule detection in these scans can not be done using a model that was trained on images where local features were clearly present, indicating that such features might not be present in lung nodule scans. Further research should rule out other possible explanations and answer this question with more certainty.

(3)

1 Introduction

Lung cancer is the most commonly diagnosed (11.6% of the total cancer cases) and deadliest (18.4% of the total cancer deaths) type of cancer world-wide [Bray et al., 2018]. Research has shown that screening of high-risk subjects with low-dose computed tomography (CT) can reduce the number of lung cancer deaths after 7 years by 20% compared to earlier used radiogra-phy screening [Team, 2011]. Over the last few years, these promising results of CT screening gave rise to plans of nationwide CT screening programs in the United States as well as several European countries [Bergtholdt et al., 2016, Lopez Torres et al., 2015]. The implementation of such large scale programs, however, came with some challenges regarding the analysis of the CT data. First of all, because of the scale the screening should take place on, a large number of CT scans must be analyzed. Second, each CT scan in itself consists of up to 600 2D images [Lopez Torres et al., 2015]. Together, this results in an even larger number of CT images to be analyzed by radi-ologists, who get paid high hourly rates, which will make the interpretation of these images a very time- and cost-expensive process.

To make the screening programs more time- and cost-efficient, radiologists could be supported by computer-aided detection (CAD) algorithms that provide automated identification of (in this case) lung nodules in the CT scans. In order to compare and improve such algorithms for lung nodule detection, the LUng Nodule Analysis (LUNA) 16 challenge was held. This challenge allowed research groups to evaluate their algorithms on a large shared set of CT scans from a lung cancer screening trial [Setio et al., 2017]. Submissions for the LUNA16 challenge included several deep learning algo-rithms that were able to detect the nodules in an efficient manner, but still needed some form of supervision by a clinical expert. Therefore, most of the submitted CAD systems serve as support for radiologists rather than as independent nodule detection systems. Although these approaches are less time- and cost-expensive than manual nodule detection, it will still cost a considerable amount of time and money. For this reason, it could be useful to look into faster and possibly more accurate systems.

One possible alternative to these deep learning approaches could be to use reinforcement learning on the nodule detection task. In reinforcement learn-ing, an agent learns about its environment trough trial and error. This makes it possible to define the nodule detection task as a path finding problem where the optimal path to the goal (in this case: a nodule) can be found without having to search through the whole CT image. It would also be possible to let multiple agents search the same image in parallel, therefore

(5)

this approach has the potential to be less time-consuming than the previous mentioned deep learning approaches.

Several implementations of reinforcement learning algorithms have been proven to be successful in medical imaging problems over the last few years. For example, deep reinforcement learning has successfully been applied to detect anatomical landmarks in medical scans [Ghesu et al., 2017, Alansary et al., 2018]. The goal of this thesis is to apply a similar approach to the detection of lung nodules.

1.1 Research Questions

In earlier work, the detection of landmarks was formulated as a path finding problem where an agent has to find a goal in a 3D environment. To apply this approach to CT images with lung nodules, it must first be determined whether it is possible to find a goal in a 2D image based on the color intensity of this image, resulting in the following research question:

• Is it possible for an agent to find a path to a goal through a simple 2D image?

When proven that this is possible, this indicates that there are features present in this simple image that point the agent towards the goal. In order to apply the same approach to the CT images successfully, such features must be present in these images as well. Hence the second research question: • Do lung nodule scans contain features that enable an agent to find a

path through the scan to this lung nodule?

2 Related Work

2.1 Anatomical Landmark Detection

Most of the current methods for the detection of anatomical landmarks are based on machine learning techniques that exploit large annotated medical image databases [Ghesu et al., 2017]. Traditionally, these machine learning methods rely on precise feature engineering to model the image informa-tion, which makes them sensitive to variations in shape, size, location and orientation of the landmarks [Alansary et al., 2018]. In order to ensure bet-ter generalization to unseen data, deep learning methods that automatically learn image features have been proposed. However, these deep learning tech-niques still rely on sub-optimal search strategies such as exhaustive scanning

(6)

or end-to-end image mapping techniques that can lead to very high computa-tion times [Ghesu et al., 2017]. To account for these problems, reinforcement learning approaches have been proposed to reformulate the landmark detec-tion problem as a navigadetec-tion problem in which an agent moves towards a point of interest (the landmark) [Alansary et al., 2018, Ghesu et al., 2017].

3 Theoretical Background

3.1 Reinforcement Learning

In reinforcement learning, a decision-making problem consists of a sequence of decision-making steps that can be modeled using a Markov Decision Pro-cess (MDP). Formally, an MDP is a four-tuple < S, A, Ta, Ra > where

• S is the set of possible states (or state space); • A is the set of possible action (or action space); • Ta

ss0 is the transition function specifying the probability of arriving in

state s0 after taking action a in state s; • Ra

ss0 is the reward function specifying the reward r for taking action

a in state s and arriving in state s0.

Satisfying the Markov Property, each state and action are conditionally in-dependent of all previous states and actions, meaning that the transition function only depends on the current state s and chosen action a.

Chosen actions influence not only the immediate reward Rt in a certain

state, but also rewards associated with states that might be visited in the future. The goal is to find an optimal policy π that maximizes the total reward (or return):

Gt= ∞

P

k=0

γkRt+k+1

where y ∈ [0, 1] is a discount factor used to weight future rewards, mak-ing the agent prefer immediate rewards over future rewards.

The optimal policy maps a given state to an action by maximizing the expected return. This expectation of the return of a certain state under policy π is given by the value function:

(7)

Vπ(s) = Eπ[Gt|St= s]

We can also use the expected return over states ´and actions using the action-value function:

Qπ(s, a) = Eπ[Gt|St= s, At= a]

Both these equations can be unrolled recursively and solved iteratively by writing them in terms of the next state-action pair (s0, a0). The resulting equations are called the Bellman equations [Bellman, 1957]:

Vπ(s) =P a π(a|s)P s0 p(s0|s, a)[r(s, a) + γV_π(s0)] Qπ(s, a) =P s0 p(s0|s, a)[r(s, a) + γP a0 π(a0|s0)[Qπ(s0, a0)]]

The optimal value function V and action-value function Q can now be de-fined as V(s) = max a P s0 p(s0|s, a)[r(s, a) + γV(_s0_)] Q(s, a) =P s0 p(s0|s, a)[r(s, a) + γ max a0 Q (_s0_{, a}0_{) ]} 3.2 Deep Learning

Due to many successes over last two decades, there has been a huge rise in interest in the field of deep learning. Deep learning is the area of machine learning that concerns large neural networks (i.e, neural networks with many layers).

A neural network is a model that maps an input vector ~x through multiple hidden layers to an output vector ~y. These interconnected hidden layers all have associated weights w according to which the input vector is mapped to an output vector [Bishop, 2006, 227-232]. Deep neural networks with many of these hidden layers can approximate highly complex functions, which also makes them harder to train.

3.2.1 Convolutional Neural Networks

A special type of neural network apt for processing image data is a con-volutional neural network (CNN). Concolutional neural networks consist of convolutional layers that contain feature maps. Each of these feature maps

(8)

takes in inputs only from part of the image, which makes it possible to pro-cess different features separately and therefore makes these networks suitable for image processing [Bishop, 2006, 267].

3.3 Deep Reinforcement Learning

In deep reinforcement learning, the value functions are approximated using a deep neural network. This approach combines the representational power of neural networks with reinforcement learning models [Ghesu et al., 2017]. For the anatomical landmark detection problem, such a representation would be useful. Based on successes of convolutional neural networks (CNNs) in vision tasks, as mentioned in section 3.2.1, this kind of neural network will be used as a function approximator.

Based on the way the optimal policy is found, deep reinforcement learn-ing algorithms can be split into two main categories: value-based algorithms and policy-based algorithms. Both these types of algorithms will be dis-cussed in the next sections to provide an overview of possible approaches to the nodule detection problem.

3.3.1 Value-based Reinforcement Learning Algorithms

Using the Bellman equations we have seen in section 3.1, value-based al-gorithms derive an optimal policy from the optimal action-value function. In order to do so, value-based reinforcement learning algorithms learn an approximator of the optimal value or action-value function: Q(s, a; θ) ≈ˆ Q∗(s, a). Depending on the environment, this can be a linear approximator or a non-linear one. The best known implementation of a value-based rein-forcement learning algorithm is the DQN algorithm [Mnih et al., 2013]. In the DQN algorithm, the approximation of the value function is modeled by a neural network called a Q-network. This Q-network is trained by minimiz-ing the squared distance between the output of this network and a target yi that represents the best possible value [Mnih et al., 2013]. This target is

defined as

yi = Es0[r + γ max

a0 Q(s

0_{, a}0_{; θ} i−1)]

In this approach, the actions resulting from following the optimal policy can be obtained using the estimated action value function:

(9)

a(s) = arg max

a

ˆ

Q(s, a; θ)

3.3.2 Policy-based Reinforcement Learning Algorithms

A different technique is used in policy optimization methods: while value-based methods derive a policy from an approximation of the action-value function, these policy optimization methods represent the policy explicitly as a probability distribution of actions: πθ(a|s) = P [a|s]. Policy optimization

algorithms have shown to converge to locally optimal policies. Therefore, they tend to be more stable than value-based methods [Sutton et al., 2000]. In policy optimization, the parameters θ of the policy are updated by per-forming gradient ascent on (local approximations of) the expected return J (πθ) = Eπ θ[R(τ )] under this policy, where R(τ ) is the total reward for a

given trajectory τ : the sequence of states, actions and rewards generated from following policy π. The goal is to find a local maximum of this per-formance measure J (πθ) by updating the parameters θ of the policy in the

direction of the greatest increase in performance: θt+1= θt+ α∇θJ (πθ)

As we can see, to update the parameters we need an estimate of the policy gradient ∇J (πθ), which is difficult since this gradient depends on unknown

effects of policy changes on the state distribution . To overcome this prob-lem, we use the Policy Gradient Theorem, which provides a different expres-sion for the performance gradient that does not use the derivative of the state distribution [Sutton and Barto, 2018, 324-325]:

∇_θJ (πθ) ∝ Eπθ[∇θlog πθ(τ )R(τ )]

The probability of a trajectory τ , given that actions come from πθ, equals

πθ(τ ) = p0(s0) T

Q

t=1

πθ(at|st)p(st+1|st, at)

Therefore, we can write the previous expression as ∇_θJ (πθ) ∝ Eπθ[∇θ log p0(s0) T Q t=1 πθ(at|st)p(st+1|st, at) R(τ )]

(10)

= Eπθ[∇θ log p0(s0) + T P t=1 log πθ(at|st) + T P t=1 log p(st+1|st, at) R(τ )] = Eπθ[ T P t=1 R(τ )∇θlog πθ(at|st)]

With this representation, policy optimization methods can update θ as θt+1= θt+ αEπθ[

T

P

t=1

R(τ )∇θlog πθ(at|st)]

to find the optimal parameters θ and with those the optimal policy π. While these policy gradient methods have better convergence properties than value-based methods, they also have drawbacks.

First, the variance of the policy gradient estimate can be excessive [Sutton and Barto, 2018]. The most common method to reduce this variance is the use of a state-dependent baseline b(st) to compare the return to. The policy

gradient is then based on the difference between the return of the trajectory and this baseline:

∇_θJ (πθ) = Eπθ[

T

P

t=1

(Gt− b(st))∇θlog πθ(at|st)]

Another disadvantage of policy gradient methods is that they only update the parameters at the end of an episode. The return R(τ ) represents the total reward of the whole episode, which means that not every action taken in this episode necessarily resulted in a better score. This means that to find the optimal policy many episode samples are needed, which makes training slow [Sutton and Barto, 2018, 331-332]. To account for this problem and therefore make training more efficient, we might want to update a value es-timate for each state at each step. This approach is used in the actor-critic methods discussed in the next section.

3.3.3 Actor-Critic Methods

In contrast to policy gradient methods, actor-critic methods do not use the full return of a trajectory. Instead, the return is updated in every step by learning an approximation of the value function Vw(st) like we have seen

before. Since this value function determines how good or bad a certain state is, it serves as a critic to the agents policy (the actor), hence the name actor-critic.

(11)

The policy gradient is updated using the reward of the next state together with the new return estimate as the total reward and an approximation of the value of the current state as a baseline:

∇θJ (πθ) = Eπθ[

T

P

t=1

(r(st+1, at+1) + γVw(st+1)) − Vw(st))∇θlog πθ(at|st)]

Apart from the policy parameters, in actor-critic methods we also need to find the optimal parameters for the estimated value function, or: the pa-rameters that minimize the difference between the value Vw(st) based on the

estimated return for the current state and the value r(st+1, at+1) + Vw(st+1)

based on the observed reward for the next state and the estimated return for the next state. Since the last mentioned value includes a reward that was actually observed, this value is a little more reliable than the first, so we want to find the estimate that minimizes this difference. To do so, the parameters are updated using the mean squared error as a loss function:

J (w) = (r(st+1, at+1) + γVw(st+1)) − Vw(st))2

∇_wJ (w) = (r(st+1, at+1) + γVw(st+1)) − Vw(st))

w ← w + ∇w

3.3.4 A3C and A2C

Some actor-critic algorithms use an estimate of the advantage of an action at in a state st:

A(at, st) = Q(at, st) − V (st)

Since V (st) serves as a state-dependent baseline here, this approach can

be used to reduce variance and therefore makes the algorithm more stable. There are many policy-based algorithms that use an advantage function to choose the best action. The two best known algorithms that use an advan-tage function and an actor critic method are A2C and A3C. A2C stands for Advantage Actor Critic, which is the synchronous version of the A3C (Asynchronous Advantage Actor Critic) algorithm [Mnih et al., 2016]. The difference between these algorithms is that, while both execute different agents in parallel on multiple instances of the environment, in A3C each agent updates the network asynchronously and in A2C every agent updates the network at the same time. The consequence is that the agents in A3C

(12)

do not always have the newest versions of the parameters and therefore do not use the same parameters to do their calculations on. In A2C, all agents do have the same parameters after every step and can therefore learn from the same experience.

4 Method

In order to demonstrate that the detection of lung nodules can be addressed as a path finding problem for a reinforcement learning agent, a two-step approach that follows the two research questions stated in section 1.1 is used. The first step of this approach is used to answer the first research question: is it possible for an agent to find a path to a goal through a simple 2D image? To answer this question, a model is trained to find a goal using image intensities of a simple images with a color gradient. It is expected that these images contain features that can direct the agent to the goal. The same approach is applied to images containing lung nodules to answer the second research question: do lung nodule scans contain features that enable an agent to find a path through the scan to this lung nodule? Using the approach from the color gradient images on the LUNA16 images makes it possible to show whether the LUNA16 images contain local features that point to the goal as well. For these images it is expected that these features depend more on global image properties and therefore will be more difficult to find. Training a model on the LUNA16 images will therefore also take longer, but if this is possible, it will demonstrate that such local features do exist in the lung nodule slices.

4.1 Color Gradient Images

In the first phase of testing, a randomly generated set of 1000 color gradient images (Figure 1) will be used. The images are all 100x100 pixels in size and contain one area of 4 pixels with the highest pixel value in the image. These areas are set at random locations in the images and make up the goals the agents should find.

Like the anatomical landmark detection problem, the detection of the goal in these images is formulated as a navigation problem where the agent moves towards the point of interest (in this case: the lightest area in the image). To model this as a reinforcement learning problem, we need to define an en-vironment, a state space, action space and observation space and a reward function.

(13)

The environment is defined as one randomly chosen color gradient image. The states in this environment are represented as a position (x,y) in the image. The observation space (i.e, the area the agent can see) is a region of pixels around the position of the agent, defined in terms of pixel intensity. A visualization of such an observation can be found in Figure 1b, which shows that differences in pixel intensity can barely be seen by human eyes, but when represented as pixel values, there is a difference in value of 0.05 between these pixels. This representation of a region of pixel intensities instead of only a location in the image is used to make the model location invariant and allows the model to generalize to new, unseen images. The set of possible actions that the agent can take in this model consist of the navigation steps the agent can take: up, down, right or left. The reward function is based on the distance between the landmark position and the position of the agent. The shaping of this distance based reward function will be covered in section 5.1. The optimal policy for this model should find the shortest route from a random starting position to the goal location.

(a) (b)

Figure 1: Example of a) a color gradient image with b) corresponding example of one observation indicated as a red square in a)

4.2 LUNA16 Dataset

In the second phase, the hypothesis that reinforcement learning methods can be used to find lung nodules in CT scans will be tested on the LUNA16

(14)

dataset. This dataset consists of 3D CT images stored in MetaImage (mhd/raw) format where each .mhd file is stored with a separate .raw binary file for the pixeldata. Each 3D CT scan consists of a varying number of 2D slices that are all 512x512 pixels in size. A separate .csv file with the positions of the nodules in these 3D scans is included in the data.

Like mentioned in section 1.1, the scope of this project is limited to testing on 2D images that contain one lung nodule per image. Therefore, all 2D CT slices that contain nodules will be extracted from the dataset (Figure 2) to create a new dataset consisting of 1066 2D CT slices.

(a) (b)

Figure 2: Example of a) a 2D slice from a LUNA16 CT scan with b) a corresponding example of one observation indicated as a red square in a)

Like for the color gradient images, the environment is defined as one randomly chosen image from this dataset. The states, actions and observa-tions are represented in the same way as for the color gradient images, just like the reward function. The optimal policy should again find the shortest route from a random starting point to the goal location (which is in this case the lung nodule).

4.3 Implementation

Implementation of the reinforcement learning models for both phases of testing as mentioned in the previous section is done using the A2C algo-rithm [Mnih et al., 2016] discussed in section 2.8 with a CNN as a function approximator. The environments were build using the OpenAI Gym toolkit.

(15)

Unless indicated otherwise, testing is done using the default hyperparame-ters of the A2C algorithm (Table 1). Links to the OpenAI implementation of A2C, the Gym Toolkit and the implementation used for this thesis can be found in Appendix A.

Parameter (type) Default

policy (ActorCriticPolicy or str) no default setting env (Gym environment or str) no default setting

gamma (float) 0.99

n steps (int) 5

vf coef (float) 0.25

ent coef (float) 0.01

max grad norm (float) 0.5

learning rate (float) 0.0007

alpha (float) 0.99

epsilon (float) (1e-5)

lr schedule (str) ’linear’

verbose (int) 0

tensorboard log (str) None

init setup model (bool) True

policy kwargs (dict) None

Table 1: Default parameters used in the OpenAI implementation of A2C

Both the color gradient images and LUNA 2D slices were split into train-ing and test sets of respectively 80 and 20 percent of the available data. The training and test sets did not vary during in any of the experiments.

5 Results

Training was done in runs of several episodes where one episode of training lasts until the goal or a threshold for the number of steps that are taken is reached. After one episode, a new training episode was started on a different image from the training set. This process was repeated until the total number of training steps was reached. The performance of the model was then measured in terms of average return per episode over the number of training steps taken. Unless indicated otherwise, performance was measures as an average of 5 training runs.

(16)

5.1 Shaping Reward Function

As mentioned in the previous section, the reward function in these models is based on the distance between the position of the agent and the position of the goal, where the reward should increase when the distance decreases. This can be done in two ways: 1) using the negative or the inverse of the squared distance to the goal:

r1 = −p(xs− xg)2+ (ys− yg)2

or 2) using the inverse of the squared distance to the goal:

r2 = √ 1

(xs−xg)2+(ys−yg)2

One advantage of using the negative squared distance as reward is that the reward is proportional to the change in distance: moving one pixel closer to the goal will result in the same increase in reward for all positions in the image. It is expected that when training a model with this reward function, the average return will increase and approach zero. As shown in Figure 3a, this turned out to not be the case: the average return decreases over time when using a negative reward function. In Figure 3b, we can see that the average return does increase when using the inverse of the squared distance as reward function. For this reason, further testing is done using this inverse function.

(a) (b)

Figure 3: Average return and standard deviation of 5 runs with 1M training steps using (a) the negative squared distance and (b) the inverse squared distance as reward function

(17)

(a) (b) _(c)

Figure 4: Average return and standard deviation of 5 runs using the inverse squared distance as reward function over (a) 1M training steps with a maximum of 1000 steps per episode, (b) 1M

training steps with a maximum of 10000 steps per episode and (c) 2M training steps with a maximum of 10000 steps per episode

The results of training a model with the inverse squared distance as re-ward function are shown in Figure 4. Figure 4a and Figure 4b show the results of training with 1M training steps and respectively a threshold of 1000 steps per episode and a thresholds of 10000 steps per episode. For both used thresholds, training shows a clear increase in average return per episode with no clear difference between training with a threshold of a 1000 steps or training with a threshold of a 10000 steps. In figure 4c the results of training with 2M training steps can be found, which showed little increase in average return per episode in addition to training with 1M training steps. The performance of the trained models are measured according to the op-timality of the found path to the goal. This opop-timality is calculated as the ratio between the length of the found path and the Manhattan distance (i.e, the sum of horizontal and vertical steps) from the starting point to the goal. As shown in Table 2, on average a model trained with 2M training steps finds a more optimal path to the goal than models trained with 1M training steps, but since training time also doubles, it will be a trade-off between training time and optimality of the found path.

5.3 2D Lung Nodule Slices

For the 2D slices with lung nodules, local features that can point the agent towards the goal are expected to be as easily found as for the color gradient images. Therefore, it can be expected that more training steps are needed to find a path to the goal. Apart from this, the images are also bigger (512x512 pixels compared to 100x100 pixels), which means that a starting point can lie further from the goal than for the color gradient images. Even

(18)

Model settings Ratio used path/optimal path

Training steps: 1M 3.68

Threshold episode length: 1000

Table 2: Average optimality of used paths over 100 test runs

with these complicating factors, it was expected that some increase in return per episode could be found.

As shown in Figure 5, training a model with the same settings as used for training on the color gradient images (1M training steps and an episode threshold length of 1000 steps) did not show any increase in return.

5.3.1 Increased Training Time

Since it was expected that training on the LUNA images would take more time, it could be the case that with more training steps, the return does in-crease. Therefore, testing was also done using 10M training steps. To save time, this was done for only one run, assuming that one run with 10 times the number of steps needed for the color gradient images would be enough to indicate whether a model can be trained successfully on the LUNA images with the default parameters.

Figure 5: Average return and standard deviation of 5 runs with 1M training steps on the LUNA 2D slices

(19)

As Figure 6 shows, no increase in return was found after 10M training steps. As mentioned before, the LUNA images are bigger than the color gradient images, which might influence the time it takes to train a model. This approach did not seem to help in increasing the return. Apart from this, as we can see in the example slice in Figure 2, nodules are lighter spots in the lungs. The lungs themselves are darker areas, but the region around the lungs is a light area as well. Visualizing the performance of a model trained with 10M steps shows that the agent easily gets stuck in one of the lighter areas without even trying to move to a darker area first. This led to the hypothesis that it is difficult for an agent to find a nodule when the starting point lies outside of the lungs.

Figure 6: Return of 1 run with 10M training steps on the LUNA 2D slices

5.3.2 Closer Starting Points

To test the hypothesis that finding a nodule is more difficult when the start-ing point lie outside of the lungs, a model was trained with a startstart-ing point within a rectangle of 100x100 pixels around the goal. Since all nodules lie inside the lungs, a closer starting point increases the chance of this starting point being located in the lungs as well and might therefore yield better results. A side benefit from this is that the distance to the goal becomes smaller, which possibly also reduces training time.

(20)

Figure 7: Average return of 5 runs with 1M training steps starting within a close range from the goal

The results of training with starting points within a closer range from the goal are shown in Figure 7. Compared to the average return for the previous model as shown in Figure 5, the overall return obtained by this model is higher, but still does not show any increase during training. The overall increase in return can easily be explained by the fact that the starting point is located closer to the goal and therefore the reward of this starting state will generally by higher than the reward from the previously used starting states. The lack of increase in return during training indicates that even from a close distance, the lung nodules are difficult to locate. This conclusion was supported when testing the model on the test set: only in 2 out of 100 test runs the goal was found.

5.3.3 Training on a Single Image

Because of the lack of successes for the previously mentioned approaches, a more basic test was done: to determine whether it would be possible to fit a model with the default parameters as shown in Table 1, such a model was trained and tested on the same lung nodule slice. Expected is that this would cause overfitting of the model on this image and therefore result in a model that would be able to find the goal when testing on the same im-age. An absence of increase in return when training on only one image can indicate bad choices for the hyperparameters of the algorithm. It is known that reinforcement learning algorithms are very sensitive to variations in hyperparameters [Henderson et al., 2018], so it is very likely that tweaking of these parameters can yield better results.

To test this, training and testing was done on one single image chosen ran-domly from the LUNA 2D slices. As shown in Figure 8a, the results are

(21)

almost the same as for training on the whole training set and no increase in return was seen here either. As mentioned above, this could indicate that sub-optimal values were chosen for the hyperparameters. By making a vi-sualization of a test run on the same image as the model was trained on, it is possible to analyze what might go wrong a bit further. This visualization showed that most of the time, the agent keeps searching in a small area of the image without moving closer to the goal, which seems to indicate a lack of exploration. The degree of exploration is controlled by the entropy coef-ficient (ent coef in Table 1) parameter of the algorithm [Mnih et al., 2016], therefore some tests were done with different values for this parameter. The default value of the OpenAI implementation of A2C is 0.01 and higher values result in more exploration. Therefore, models were tested with values of 0.01, 0.05 and 0.1 (Figure 8).

(a) (b) _(c)

Figure 8: Average return and standard deviation of 5 runs with different entropy coefficients

As shown in Figure 8, changing the value of the entropy coefficient did not result in any increase in return during training. It could be the case that changing other parameters as well can yield better results, but so far analyzing the results of the test runs have not shown any indications on which of those parameters could be tweaked for better performance. Altogether, the training of a model to find a path through a simple color gradient image has shown to be possible within reasonable time. However, the training of a model to detect lung nodules in 2D CT slices with a similar approach has so far not been possible. Further testing should be done to determine whether this indicates the absence of local features in the image that can be used for this task or that other explanations can be found to ex-plain these results. Some thoughts on possible explanations and suggestions for further research will be discussed in the next section.

(22)

6 Discussion

The goal of this thesis was to determine whether reinforcement learning could be used to detect lung nodules in CT scans in a faster and possibly more accurate way than deep supervised learning methods. This hypothesis was tested with a two-step approach: first, the possibility for an agent to find a goal in a simple color gradient image was investigated. It has been shown that these images do contain features that can point the agent to the goal and that it is possible to find this goal with a relatively short training time. Second, a model was trained to let an agent find a lung nodule in a CT scan in a similar manner as the model for the color gradient images was trained. So far, training this model has not resulted in a policy that enabled the agent to find the nodule in these images. The lack of increase in return during training could indicate the absence of local features that point the agent to the goal, which would mean that it is not possible to address lung nodule detection as a navigation problem that can be solved using reinforcement learning. There could, however, be several other reasons that explain this lack of improvement, which will be discussed in the following sections.

6.1 Tweaking of Hyperparameters

As stated before, deep reinforcement learning algorithms are very sensitive to the effects of variations in hyperparameters. Different values can have significantly different effects across algorithms and environments [Hender-son et al., 2018]. This means that using an algorithm (in this case A2C) on different environments with the same hyperparameters can yield different results. It could therefore be possible that with different values for (some of) the hyperparameters of A2C, a model could be trained more easily and nodule detection would be possible. Effects of a difference in entropy was already tested in this thesis, but further research should be done to test the effects of variations in other parameter.

6.2 Size of Observations

Apart from the potential of further tweaking of hyperparameters, it could be possible that there are features the agent can use to find the nodule, but that this agent simply cannot use enough of these features. For example, when the agent can see that it is located on the edge of the lung, it could

(23)

learn that it should move inward. It might be the case that because this agent can only see one layer of pixels around itself, features like edges or other important location marks get lost. It could be possible to enlarge the observation space to prevent this loss of information. It is important to keep in mind that with this expansion, training time will increase as well.

6.3 Variations in Nodule Size

The LUNA16 dataset contained CT scans with nodules of different sizes. In testing on the 2D slices, these difference have not been accounted for, although they might have had some influence on the detection of these nod-ules. The goal for the agent has now been defined as one location in the center of the nodule, even though a region (the range of the size of the nodule) around this center also consists of lighter pixel that the agent could (correctly) detect as being the nodule. A major issue here is that the nod-ules are not perfectly round shapes. Therefore, defining the nodule as a round area based on the given diameter will not always match the actual pixel values in the image. Further research should be done to find a suitable solution for this issue.

7 Conclusion

As stated in the introduction, the goal of this thesis was to answer two ques-tions, the first of which was

Is it possible for an agent to find a path to a goal through a simple 2D image?

The experiments on color gradient images have shown that it is possible to train a model where an agent finds a goal through these images, therefore showing that when local features can be used to find such a path, an agent is able to find this within reasonable time.

The second question to be investigated was

Do lung nodule scans contain features that enable an agent to find a path through the scan to this lung nodule?

From the approaches presented in this thesis, this question can not yet be answered with any certainty. Results to the experiments that have been

(24)

done indicate a lack of local features that are needed to enable an agent to find such a path, but there could also be other causes that have led to these results.

7.1 Future Work

As suggested above, future research should be done to answer the second research question of this thesis with more certainty. As mentioned in the dis-cussion, testing with different parameters could yield different (and possibly more successful) results. Also, further research could determine whether a change in size of the observation space has any influence on the performance of lung nodule detection models.

(25)

References

[Alansary et al., 2018] Alansary, A., Oktay, O., Li, Y., Le Folgoc, L., Hou, B., Vaillant, G., Glocker, B., Kainz, B., and Rueckert, D. (2018). Evalu-ating reinforcement learning agents for anatomical landmark detection. [Bellman, 1957] Bellman, R. E. (1957). Dynamic Programming. Courier

Dover Publications.

[Bergtholdt et al., 2016] Bergtholdt, M., Wiemker, R., and Klinder, T. (2016). Pulmonary nodule detection using a cascaded svm classifier. In Medical Imaging 2016: Computer-Aided Diagnosis, volume 9785, page 978513. International Society for Optics and Photonics.

[Bishop, 2006] Bishop, C. M. (2006). Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg.

[Bray et al., 2018] Bray, F., Ferlay, J., Soerjomataram, I., Siegel, R. L., Torre, L. A., and Jemal, A. (2018). Global cancer statistics 2018: Globo-can estimates of incidence and mortality worldwide for 36 Globo-cancers in 185 countries. CA: a cancer journal for clinicians, 68(6):394–424.

[Ghesu et al., 2017] Ghesu, F. C., Georgescu, B., Zheng, Y., Grbic, S., Maier, A., Hornegger, J., and Comaniciu, D. (2017). Multi-scale deep reinforcement learning for real-time 3d-landmark detection in ct scans. IEEE Transactions on Pattern Analysis and Machine Intelligence. [Henderson et al., 2018] Henderson, P., Islam, R., Bachman, P., Pineau, J.,

Precup, D., and Meger, D. (2018). Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence. [Lopez Torres et al., 2015] Lopez Torres, E., Fiorina, E., Pennazio, F.,

Per-oni, C., Saletta, M., Camarlinghi, N., Fantacci, M., and Cerello, P. (2015). Large scale validation of the m5l lung cad on heterogeneous ct datasets. Medical physics, 42(4):1477–1489.

[Mnih et al., 2016] Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937.

(26)

[Mnih et al., 2013] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. [Setio et al., 2017] Setio, A. A. A., Traverso, A., De Bel, T., Berens, M. S.,

van den Bogaard, C., Cerello, P., Chen, H., Dou, Q., Fantacci, M. E., Geurts, B., et al. (2017). Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the luna16 challenge. Medical image analysis, 42:1– 13.

[Sutton and Barto, 2018] Sutton, R. S. and Barto, A. G. (2018). Reinforce-ment learning: An introduction.

[Sutton et al., 2000] Sutton, R. S., McAllester, D. A., Singh, S. P., and Man-sour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063.

[Team, 2011] Team, N. L. S. T. R. (2011). Reduced lung-cancer mortality with low-dose computed tomographic screening. New England Journal of Medicine, 365(5):395–409.

(27)

Appendix A

The implementation of the A2C (together with other reinforcement learning algorithms) of OpenAI can be found at:

https://github.com/openai/baselines The Gym Toolkit can be found at:

https://github.com/openai/gym

The full adaptation of A2C used for the experiments in this thesis can be found at:

Automated Cancer Detection Using Reinforcement Learning