Explainable Reinforcement Learning: Visual Policy Rationalizations Using Grad-CAM

(1)

Explainable Reinforcement Learning:

Visual Policy Rationalizations Using

Grad-CAM

Laurens Weitkamp

11011629

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam

Faculty of Science Science Park 904 1098 XH Amsterdam

Supervisors Elise van der Pol

Zeynep Akata UvA-Bosch Delta Lab University of Amsterdam

Science Park 904 1098 XH Amsterdam

(2)

Abstract

Deep reinforcement learning has made large advances through the use of deep neural networks. But using such deep neural networks make decision making opaque, it is difficult to find a logical connection between the model’s input and output. Explainable Artificial Intelligence is an emerging field dedicated with the task of making such deep neural networks explainable, but little research has been done to create Explainable Reinforcement Learning. This thesis proposes a rationalization framework that combines Grad-CAM with the A3C reinforcement learning algorithm to create visual rationalizations that can aid in understanding decision making in such models. It does this by visualizing the evidence on which the agent bases its decision, thus increasing trust and understanding in how such agents make decisions. This framework has been evaluated on two specific tasks: analyzing an agent at different time steps during training and analyzing situations in which the agent fails at a task. Both these tasks have been evaluated using three Atari 2600 environments provided by the OpenAI Gym toolkit. Each environment has been chosen because it has a different level of difficulty or a different long-term reward dependency. This thesis emphasizes the importance of visual rationalizations as a means to increase both trust and understanding in deep reinforcement learning agents.

(3)

Acknowledgments

I would like to thank my supervisors: Elise van der Pol and Zeynep Akata, for guiding me through the process, giving me constant feedback and coming up with the project in the first place. Further-more I would also like to thank a group of students for several feedback sessions: Douwe van der Wal, Max Filtenborg. Lastly I would like to thank the UvA Intelligent Robotics Lab for providing me with the necessary computational power needed to train the models used in this thesis.

(4)

1 Introduction

Reinforcement learning is an area of Machine Learning dealing with how agents should take actions when given an environment to interact with. This task is done in such a manner as to maximize the agents’ long-term reward signal. The reward itself can be anything that would enforce a particular behaviour.

For instance, consider an agent that is learning how to drive a car. The agent would get a positive reward for driving safe, and a negative reward for driving against objects. Agents follow a policy and learn the value of a state or the actual policy by experiencing states in the environment, by which they can update the policy. This works well in low-dimensional fully-observable environments, but traditional reinforcement learning methods do not scale well in high dimensional and noisy environments.

The field of Deep Learning is well equipped to deal with such high dimensional and noisy data, such as images that display multiple objects. The most notable of such Deep Learning methods are the Convolutional Neural Networks (CNNs) that can learn spatial information given high dimensional input.

Through the use of Deep CNNs as function approximators, the field of deep reinforcement learning has recently made many advances on hard problems in a wide range of environments (Mnih et al., 2013, 2016; Schulman, Levine, Moritz, Jordan, & Abbeel, 2015). However, replacing the value or policy function in reinforcement learning methods with a deep CNN will make the decision making process a black-box operation.

Recent progress in providing explainability for black-box Deep CNN classifiers is focused at improving understanding and understanding in the model to the user. Because Deep reinforcement learning methods often use Deep CNNs, perhaps explainability methods that primarily use CNNs can be used in the field of deep reinforcement learning. Adding understanding in CNN decision making can come in various forms, such as textual explanations that specify what features of the class were prominent during classification (Hendricks et al., 2016). Another approach is providing visual explanations in the form of attention-maps that point out class-specific regions.

This thesis will focus on using Grad-CAM, which is a gradient-based class activation map (Selvaraju et al., 2016). Grad-CAM creates a heat map that shows prominent spaces of activation given an input image and class. Using Grad-CAM and the A3C reinforcement learning algorithm described in (Mnih et al., 2016), the following research question will be investigated:

How can visual explainability methods aid in understanding Deep Reinforcement Learn-ing decision makLearn-ing and learnLearn-ing?

The research question itself is divided into two sub-questions:

1. How do policies differ over the course of training?

2. What is the agent focused on before/when it fails at a task?

These questions will be investigated upon in three Atari 2600 environments provided by the OpenAI Gym toolkit. The games are Pong, BeamRider and Seaquest, which are chosen because all three differ in complexity, long-term reward behaviour and the amount of active opponents per state.

This thesis is structured as follows: The next section, 2, provides the theoretical background on reinforcement learning required to understand the visualization model. Section 3 presents the visual rationalization model and explains how it is adapted to reinforcement learning tasks. Fol-lowing after that is section 4 which provides the setup required for experiments including what reinforcement learning model works best. This section also provides the results for the rational-ization model applied to four Atari 2600 games with different tasks. The last two sections, 5 and 6.2 discuss related work and further research respectively.

The codebase required to reproduce this thesis can be found at GitHub1 _{and interactive gifs}

showing agents playing the game plus Grad-CAM outputs can be found at my website2_.

1_{https://github.com/lweitkamp/rl rationalizations} 2_{https://lweitkamp.github.io/thesis2018}

(6)

2 Background

2.1 Reinforcement Learning

Reinforcement Learning aims to train agents that solve problems through experience. The agent learns to map situations to actions and how to maximize a reward based on these actions (Sutton & Barto, 1998). Many recent advances in the field of reinforcement learning are in the area of model-free decision strategies, which do not explicitly model the agent’s environment but let the agent map directly from inputs from the environment. The main framework used by agents in reinforcement learning is a Markov Decision Process (MDP), which can model probabilistic decision problems (Sutton & Barto, 1998). MDPs are defined as a 5-tuple:

• S is a finite set of states

• A is a finite set of actions, with As denoting the actions available in state s.

• P : S × A × S0_{→ [0, 1] is a state transition probability function.}

• R : S × A × S0_{→ R is a reward function.}

• γ ∈ [0, 1] is a discount factor which determines the importance of future rewards.

In an MDP, the effects of an action taken in a state depend only on the current state. This is called the Markov Property, and can be expressed using the following equation:

P [Rt+1= r, St+1= s0|S0, A0, R1, . . . , Rt, St−1, At−1] = P [Rt+1= r, St+1= s0|St, At]. (1)

Where t denotes the time step. The schema in figure 1 shows an agent interacting with an MDP-based environment.

Figure 1: The agent in a MDP environment (Sutton & Barto, 1998).

In many real-world problems the agent will not have knowledge of the complete environment. A Partially Observable MDP (POMDP) is a generalization of a MDP in which the agent cannot directly observe the current state. It instead maintains a probability distribution over the set of possible states based on observations and observation probabilities. Most modern day reinforce-ment learning methods are based on the idea of having a partially observable state space.

2.1.1 Exploration and Exploitation

Given an environment and a policy, the agent can now manoeuvre through this environment. But if the policy is non-optimal the agent still needs to figure out what the best action is to take as to maximize its reward. Acting by maximizing reward is called a greedy- or an exploitative approach, and can work well if the agent has an optimal policy. If on the other hand the agent is non-optimal, this could lead to worsened results. Image if an agent driving a car gets a small positive reward each step when it does not hit an objects. This agent might just decide to stand still and do nothing each step (which is also called reward hacking). On the other hand, picking a uniformly distributed random action which guarantees that the agent is explorative, but will not ensure a high return. The agent driving a car could go back and forth each step using a uniform random action each time. Figuring out the correct way to balance exploration vs exploitation is important to find an optimal policy.

(7)

2.1.2 Q-Learning

Q-Learning is an Off-Policy value iteration algorithm (Watkins & Dayan, 1992), defined as:

Q(St, At) = Q(St, At) + α[Rt+1+ γ max

a Q(St+1, a) − Q(St, At)]. (2)

The action-value function Q directly approximates q∗, the optimal action-value function

indepen-dent of the policy being followed. All that is required for convergence is that all state-value pairs continue to update. It is a tabular method, meaning that it holds a table of all state, action pairs and updates this table every iteration. The problem with tabular methods like Q-Learning is that they do not scale well when dealing with high dimensional data that model more real-world like environments. A real-world environment can have a massive amount of state, action pairs which would break the convergence property for Q-Learning as it is impossible to visit every state, action pair multiple times. Another problem is that keeping up with so many state, action pairs in a table is intractable for computers with a finite storage.

2.1.3 Policy Gradients

Policy based model-free methods parameterize the policy π(a|s; θ) and typically update the param-eters θ by performing gradient descent on a loss function J with respect to the policy’s paramparam-eters:

θt+1= θt+ α∇J (θt) (3)

Basing the loss function on the value of the starting state vπθ(s0), the policy gradient theorem

3

establishes that it is proportional to

∇J (θt) ∝ Eπ[Pa∇θπ(at|st, θ)qπ(st, at)]

∇J (θt) ∝ Eπ[Pa∇θπ(at|st, θ)Rt]

(4)

This final expression can be sampled by interacting with the environment, and will be used during the gradient descent procedure:

θt+1= θt+ α∇θtlog π(at|st; θt)Rt (5)

Update procedures like equation 5 are used in the REINFORCE line of algorithms which are widely used in policy gradient methods (Williams, 1992).

The Advantage Function. In order to reduce the variance of the policy estimation a learned function of the state bt(st) can be subtracted from the return, which is known as a baseline function

(Williams, 1992). Subtracting this baseline function from equation 5 results in:

θt+1= θt+ ∇θtlog π(at|st; θt)(Rt− bt(st)). (6)

A commonly used baseline function is V (st; θt), which is the value function that indicates the

value of being in state st. This function can be seen as the long-term reward for being in state

st in contrast to the direct reward for being in state st given by the reward function R. The

resulting baseline At= Rt− V (st; θt) is called the Advantage Function, because it is an estimate

of the advantage of choosing action at in state st. Using the Advantage Function in equation 6,

the policy gradient estimate is transformed into

θt+1= θt+ ∇θtlog π(at|st; θt)At. (7)

(8)

2.1.4 Actor-Critic methods

Actor-Critic methods implement both value- and policy estimation for the value function V (st; θ)

and the policy π(at|st; θ) respectively. The main idea in actor-critic methods is that the actor

interacts with the environment given a policy π and the critic assigns a value to each state that the actor is in using the value function V , which can be seen as a schema in figure 2.

Figure 2: Actor-Critic architecture.

In the case of policy gradient Actor Critic methods, the learning is done in two steps:

• the actor updates policy π (possibly using the advantage function) in a direction suggested by the critic.

• the critic updates the current value function V .

2.2 Deep Learning

Deep Learning methods, on the other hand, are made to handle high dimensional data. The first breakthrough came with the creation of AlexNet (Krizhevsky, Sutskever, & Hinton, 2012), which is a deep neural network because it has several convolutional layers followed by non-linear activations. This method of stacking convolutional layers in conjunction with other types of layers (max-pool, non-linear activations) is very successful in image classification and reinforcement learning tasks. Deep Learning has also proven to be successful in data that is sequentially organized such as language, sound or video through the use of recurrent neural networks, such as Long Short Term Memory (Hochreiter & Schmidhuber, 1997).

2.2.1 Convolutional Neural Network

The convolutional operation itself (in the case of images) is a two-dimensional filter that passes in a sliding window fashion over a matrix. The filter sums up the multiplications on each position, which produces a new value for the window position. In total this will amount to a new matrix. A convolutional layer is a series of such filters which each produce a new channel based on the original input matrix. One of the more interesting parameters for a convolutional operation is the stride, which indicates the number of values the window skips. Having a stride of one, for example, will return a new matrix of the same size but having a stride of two will effectively downsample the matrix by a factor of two. The stride will be discussed more in chapter 4 when the input to a convolutional network is a state in a reinforcement learning environment.

An interesting insight into CNNs is that the first layers act as primitive edge detectors that flow over into the next layers which combines these primitive edges slowly into abstract representations of the final classes. This effect will be noticeable in the case of Pong later in this thesis, as it is a game that consists only of such primitive edges.

(9)

2.2.2 Long Short Term Memory

In classification problems, data is often presumed to be Independent and Identically Distributed (IID). IID means that any data point will not give the model more information about a different data point (they are independent of each other). The presumption that data is IID does not hold in all fields. For example in language processing, where words can be dependent on previous words. Dependencies can arise in reinforcement learning when an agent is dealing with enemies; learning the trajectories or velocities for example. Even if the environment assumes the Markov property to hold, there can still be long-term dependencies to learn. One way to help with this is increasing the input size, give the agent n states instead of one. Another way to deal with this is to use neural networks such as Long Short Term Memory (LSTM).

Figure 3: Abstract representation of an LSTM module. Each cell outputs the hidden value hiand

passes both hiand Ci to the next cell.

Long Short Term Memory (LSTM) is designed to deal with problems of long-term dependency. Each cell in an LSTM uses the inputs xi and Ci−1 to calculate the cell state value Ci and the

hidden state value hi (see figure 3). The cell state value Ciis created using only linear operations

(although the weight and bias can itself be non-linear functions) and represents the long-term sequential information. The hidden state value hi is calculated through non-linear functions and

is the direct output of the cell to the next layer (Olah, 2015). In the case of reinforcement learning the cell state could be the velocity or direction of an enemy and the hidden state could be how to react to this enemy.

2.3 Deep Reinforcement Learning

Using both the spatial information provided by a CNN and the memory provided by LSTM cells, the field of deep reinforcement learning has become quite successful when dealing with high dimen-sional and noisy data. This is in stark contrast with more traditional methods in reinforcement learning. Several State-of-The-Art (SoTA) results have been achieved through deep reinforcement learning in complex environments such as Atari 2600 games and physics simulators such as the MuJoCo Locomotion environment (Mnih et al., 2015; Todorov, Erez, & Tassa, 2012).

2.3.1 Deep-Q Network

The Deep-Q network was initially proposed in (Mnih et al., 2013), where Q-Learning uses a CNN as a function approximator for Q-values. DQN has created a new SoTA record in various situations and it has been extended by the authors in (Mnih et al., 2015) to gain human level results on a range of Atari 2600 games. To balance out exploration vs exploitation, the DQN algorithm uses an -greedy policy that has probability to act greedily and probability 1 − to perform a uniform random action. decays over time with a limit at probability 0.1 to ensure early exploration and late exploitation during training.

2.3.2 Asynchronous Advantage Actor Critic

Unlike the Advantage Actor-Critic method, Asynchronous Advantage Actor Critic (A3C) utilizes multiple actors to learn more efficiently. Each actor has its own environment and set of parameters

(10)

and update a global model. This algorithm was first discussed in (Mnih et al., 2016), where it had been proven more time efficient in comparison to the DQN algorithm. This efficiency is not only due to parallelization of agents, but also due to a exploration factor. Having multiple agents with different exploration tactics, the training through experience becomes more diverse. The agents could, for instance, all have a different ratio of exploration vs exploitation which results in a wider set of states being visited. This also removes the need for the experience replay used in the DQN model (Juliani, 2016).

However, all these deep reinforcement learning methods have essentially become black-box decision-making processes precisely because they use CNNs. Agents can fail in non-obvious ways, and the deep models can fail to generalize well without the user knowing why. To build under-standing in deep reinforcement learning this thesis aims to use methods from Explainable Artificial Intelligence to complex reinforcement learning tasks and see if they offer guidance or rational ex-planations in failure cases and during training.

2.4 Explainable Reinforcement Learning

Explainable Artificial Intelligence (XAI) is an emerging field that is dedicated to developing meth-ods to make deep learning more understandable, and thus build understanding of the model to the user (Hendricks et al., 2016; Park et al., 2018; Zintgraf, Cohen, Adel, & Welling, 2017). There are many ways that these models can be used to build understanding such as visual or textual, but due to the lack of textual data in the Atari 2600 reinforcement learning environment, only a visual approach will be discussed in this thesis.

2.4.1 Gradient Based Class Activation Map

A visual approach to building understanding in Deep Learning methods that use a convolutional neural network is a Class Activation Map (CAM) (Zhou, Khosla, Lapedriza, Oliva, & Torralba, 2015). The CAM method requires the chosen network to have a Global Average Pooling layer, and uses this layer’s activations, given a predicted class, to create a class discriminative heat map that highlights class-specific regions. Because the network architecture needs to be altered for CAM to work, this is not an optimal method for class discriminative heat maps. This method has been extended to a more general framework called Grad-CAM in (Selvaraju et al., 2016), which somewhat agnostic to the architecture in use, as long as it is a convolutional neural network. The added benefit to Grad-CAM is that it weights class activations with gradients created during the backwards pass. Grad-CAM will be discussed more in the next chapter, where it is also adapted to work with deep reinforcement learning models.

(11)

3 Visual Rationalization Model

Because deep reinforcement learning methods use neural networks to estimate either the value function or the policy of an agent, reasoning as to why a specific action, e.g. go left or go right is taken becomes opaque. This opaqueness comes from the fact that there is no simple link between weights in the neural network and the function that it is trying to approximate. It is not easy to inspect the weights and draw intuitive conclusions about the network.

In addition to this problem, training an agent is nontrivial and highly susceptible to small changes in hyperparameters or difference in codebase (Islam, Henderson, Gomrokchi, & Precup, 2017). Even the activation functions being used can have a significant impact. For instance, the ReLU activation function commonly used in neural networks can suffer from the ’dying ReLU’ function causing worsened learning results (Xu, Wang, Chen, & Li, 2015).

Referring back to the self-driving car metaphor, knowing why the agent is making a decision can be very important to build understanding in the agent. On top of this, justifying why agents make predictions that are consistent with visual elements to a user are likely to increase understanding in the agents’ decision making (Teach & Shortliffe, 1981). Being able to explain situations in which an agent fails in non-obvious ways could also provide intuitions as to why they fail.

One such explainability method that calculates class-based discriminative features is a Gradient-weighted Class Activation Map that generates visual explanations for CNN based classifiers. Be-cause this method applies to a wide variety of CNNs, it can also be used for the CNNs found commonly in Deep-RL methods. In contrast to the Prediction Difference Analysis method that can also be applied in a wide variety of CNNs, the Grad-CAM method is computationally more tractable feasible4 for a significant amount of inputs.

4_{The Prediction Difference Analysis takes 70 minutes to compute one input in the VGG model, in contrast, the}

(12)

Figure 4: The original Grad-CAM overview modified for RL-A3C specific tasks. The model takes as input a state (in this case the input state is from the game BeamRider), calculates the state-action policy (with as highest value in this case LEFTFIRE) and then produces an activation map that is overlayed on the original state based on LEFTFIRE.

Visual rationalization model Grad-CAM computes a class-discriminative localization map Ls

GradCAM ∈ R

u×v _{using the gradient of any target class. These gradients are}

global-average-pooled to obtain the neuron importance weights ac

k for class c, for activation layer K in the CNN 5_: αc_k= 1 Z X i X j ∂yc ∂Ak ij . (8)

Adapting this method in particular to the A3C actor output, let ha _{be the score for action a}

before the softmax, αa

k now represents the importance weight for state a in activation layer k:

α_ka= 1 Z X i X j ∂ha ∂Ak ij , (9)

with |h| = |A|, the total amount of actions the agent can take. The gradient then gets weighted by the forward-pass activations Ak and passes an ELU activation6 to produce a weighted class activation map: La_GradCAM = ELU ( K X k=1 αa_kAk). (10)

This activation map has values in the range [0, 1] with higher weights corresponding to a stronger response of the output state. This can be applied to the critic output in the same fashion. The resulting activation map can extrapolated to the size of the input state and can then be overlayed on top of this state to produce a high-quality heatmap that indicate regions that motivate the agent to take action a. A visual representation of this process is depicted in figure 4.

5_{K is usually chosen to be the last convolutional layer in the CNN.}

6_{the Exponential Linear Unit has been chosen in favor of the ReLU used in the original Grad-CAM paper due}

(13)

4 Experiments

4.1 Setup

OpenAI Gym. The OpenAI Gym is a toolkit for developing reinforcement learning algorithms. The toolkit has a wide variety of environments that all have a shared interface to enable the writing of general algorithms. To support the use of the Atari 2600 video games, the Arcade Learning Environment (ALE) needs to be compiled (Bellemare, Naddaf, Veness, & Bowling, 2012). Each game in the ALE has some different versions: NoFrameskip (does not skip frames), Deterministic (skips 3-4 frames depending on game, this is what DeepMind uses) and two different versions, v0 and v4 with minor environmental adjustments.

Atari 2600 Actions. The original Atari 2600 had a joystick controller that has a single red ’fire’ button attached. The joystick itself has eight directions: up, upright, left, up-left, down-right, down and down-left. The button can be pressed by itself but also in combination with the joystick, which totals the number of actions to 17 (up-left-fire, down-right-fire, etc.). Note that the OpenAI Environment adds a ’NOOP’, an action that essentially skips the frame, and each game has a different selection of these actions to choose from which depend on the environment.

Table 1: Action space of the three Atari 2600 games Pong, BeamRider and Seaquest.

NOOP FIRE UP LEFT RIGHT DOWN LEFT_FIRE RIGHT_FIRE _LEFTUP UP_RIGHT UP_FIRE DOWN_LEFT DOWN_RIGHT DOWN_FIRE UP LEFT FIRE UP RIGHT FIRE DOWN LEFT FIRE DOWN RIGHT FIRE Pong x x x x x x BeamRider x x x x x x x x x Seaquest x x x x x x x x x x x x x x x x x x

Pong. Pong is meant to represent a table tennis game in 2D. The agent is represented on the right and the opponent on the left. PongDeterministic-v4 has 6 actions to choose from which are displayed in table 1. Note that the FIRE and NOOP are identical in Pong and the same goes for FIRE/RIGHTFIRE and LEFT/LEFTFIRE. RIGHT represents going up in the game, and LEFT represents going down in the game. The goal in Pong for either side is to achieve 21 wins after which the game ends.

BeamRider. BeamRider is a typical shooter game taking place in outer-space. Each level has 15 enemies combined with floating debris, an increased amount of variation in enemies and a boss that appears after killing all 15 enemies. The boss is optional and will disappear after a few seconds of hovering at the top of the game, and is only killable with a torpedo of which the agent has three each level. It has an extended set of actions from Pong because of the torpedo which is usable with the UP motion on the joystick, adding the actions UP, UPLEFT and UPRIGHT which amounts to 9 actions total, seen in table 1. The game has a finite number of levels, and the agent can get hit three times before the game finishes.

Seaquest. In Seaquest, the agent plays a submarine with torpedoes that can fire at enemy sharks and submarines for points. The submarine can also pick up divers (8 maximum) and bring them above the sea to gain additional points. The submarine has a finite amount of oxygen, and must thus come up for air every once in a while. The added oxygen brings a long-term reward mechanism to the game which could be interesting for Reinforcement Learning Agents. Seaquest has the full control of both joystick and button, adding up to 18 actions in total displayed in table 1.

(14)

(a) The original frame.

(b) The pre-processed frame.

Figure 5: The pre-processing of an Atari frame.

Pre-processing. The default Atari game frame returned by the Gym environment is a 3x210x160 pixel window using RGB color channels. Each frame is cropped to an 80x80 window. The original frame, filled with integers in the range [0, 255], will then be normalized to the range [0, 1]. After this, the three color channels will be averaged, effectively turning the image to grayscale. The result of preprocessing a frame is pictured in figure 5 for game Seaquest.

Model Selection. Both algorithms have been trained on the two network architectures using the game Pong from the Arcade Learning Environment in the OpenAI Gym toolkit (Brockman et al., 2016). Neural Network 1 is the network proposed by DeepMind in (Mnih et al., 2013) and used in training both Deep-Q and A3C models in their respective papers. Neural Network 2 is an implementation optimized for quicker convergence proposed in (Kostrikov, 2018), through the use of the ELU activation and normalizing the weights in the neural networks (Salimans & Kingma, 2016). The specifics of each network can be found in 2.

Table 2: Difference in neural network architecture for network 1 and network 2.

Non-linear Activations Input Frames LSTM Cell Convolutional Layers

Neural Network 1 ReLU 4 No 5

Neural Network 2 ELU 1 Yes 4

For each architecture, five agents were trained using 1,000,000 frames, after which each agent had played 100 episodes7. The resulting returns for each episode has been averaged to obtain four mean returns which are summed up in table 3. Because of the success of Neural Network 2 on the A3C algorithm, both the model and algorithm will be used in this thesis.

Table 3: Mean and variance for DQN and A3C in N N1 and N N2.

N N1 mean N N1 variance N N2mean N N2 variance

DQN 20.13 6.60 -21.00 0.00

A3C 5.40 4.30 21.0 0.00

A3C. The A3C implementation used in this thesis is a modified version of the PyTorch imple-mentation written in (Kostrikov, 2018). Most notably, the input image has been pre-processed to a pixel size of 80x80 instead of 42x42 and the third convolutional layer has a stride of one instead of two. These modifications have been made to suit the Grad-CAM output better: a stride of two effectively down-samples the image by a factor of two, with the resulting Grad-CAM output of convolutional layer four being of size 5x5. Interpolating this to an 80x80 image will cause it to lose spatial information and results in a blurry attention visualization. This is in contrast to the new

(15)

setup where the output Grad-CAM will be of size 10x10, which contain more spatial information and thus will be more accurate when interpolated. figure 6 contrasts the 5x5 and 10x10 images. A more detailed explanation of the network can be found in figure 7.

(a) Grad-CAM 5x5 interpolation to 80x80. (b) Grad-CAM 10x10 interpolation to 80x80. Figure 6: Interpolation of both a 5x5 Grad-CAM output and a 10x10 Grad-CAM output. No-ticeable is the difference in size for single activations, in 6a the bottom right represents a single activation and in 6b the middle represents a single activation.

Figure 7: The convolutional neural network in use by the A3C algorithm. The input images are pre-processed to 80x80 frames and pass four convolutional layers all having 32 filters of size 3x3. convolutional layers one, two and four have a stride of 2 effectively down-sampling the frame as it passes through but convolutional layer three has a stride of 1 to keep the activation filter needed from Grad-CAM large enough for high quality heat maps. Each convolutional layer is separated by an Exponential Linear Unit. After the convolutions the frame passes through an LSTM module to retain some direction and velocity information after which the actor and critic linear outputs are returned to the user. The linear output size n depends on the environmental action space.

Trained Models. For the purpose of this thesis, two models have been trained. The first model which will be called the Full Agent has been trained using (at least) 40 million frames. The second agent which will be called the Half Agent has been trained using 20 million frames, except for the case of Pong where it has been trained using 500,000 frames. All agents have been trained using the same hyper-parameters described in appendix B. The mean and variance described in table 4 have been calculated by playing with a greedy policy for 100 games. These results have also been contrasted to results found in other literature. It is hard to contrast these results fairly because most authors only note a training time in hours which does not count for the amount of threads in work.

Table 4: Trained models compared to results found in literature

Full Agent Mean Full Agent Variance Half Agent Mean Half Agent Variance DeepMind8

Pong 21.00 0.00 14.99 0.09 BeamRider 4659.04 1932.58 1597.40 1202.00 Seaquest 1749.00 11.44 N/A N/A

(16)

4.2 Ranking Grad-CAM Outputs

The Grad-CAM model outputs a class activation map with values in the range of [0.0, 1.0], but the resulting maps are sparsely filled with values towards 1.0. These type of activations are usually indicative of a moving object the agent is focusing on (what objects have its attention). To create a clear distinction between high activations and low activations the outputs will be ranked as follows: High Activations ranging from 0.7-1.0, these appear as red in Grad-CAM outputs. An example is

the red bounded-boxes in figure 8.

Medium Activations ranging from 0.4-0.7, these appear as a light green/yellow in Grad-CAM outputs. An example is the orange bounded-boxes in figure 8.

Low Activations ranging from values between 0.0-0.4, these appear as light blue in Grad-CAM outputs. An example is the yellow bounded-boxes in figure 8.

Figure 8: Examples of types of attention in Pong, BeamRider and Seaquest respectively. Red indicates a high Grad-CAM activation, orange indicates a medium amount of activation and yellow indicates a low amount of activation.

An example of the different ranks at their thresholds is shown below in figure 9.

Figure 9: A Grad-CAM output thresholded at three stages: High (values higher than 0.7), Medium (values between 0.4 and 0.7) and Low (values between 0.0 and 0.4). The first image is the original Grad-CAM output.

4.3 Learning a Policy

Training an agent to gain human-like or superhuman-like performance in a complex environment can take millions of frames. Seeing how an agent is reacting to different situations at different times of training might make it clear how an agent is trying to maximize long-term rewards. For each game states were manually sampled, after which both agents have ’played’ the state to learn spatial-temporal information using the LSTM cells in the convolutional model. Using the action of each agent, the Grad-CAM outputs can be contrasted to highlight differences in policy. To contrast the resulting Grad-CAM outputs to a model with randomized weights as a baseline, appendix A has a random selection of frames for generated by untrained agents.

(17)

Pong. The Full Agent has learned to shoot the ball in such a way that it scores by hitting the ball only once each round. The initial round might differ but after that, all rounds are the same: the Full Agent shoots the ball up high which makes the ball bounce off the wall all the way down over the opponent’s side, at which point the agent retreats to the lower right corner. In contrast, he Half Agent is actively tracking the ball at each step and could potentially lose some rounds because of this. This tracking behaviour of the Half agent is also demonstrated in figure 10 where it is trying to meet the ball’s height in frames 50, 51 and 53. By looking at the Grad-CAM output the agent probably makes this decision based on the ball itself because there is a high attention level on the ball in these frames. The Half Agent on the other hand has a high attention level on itself and the enemy in all but one frame (53). The action of going down is not based solely on the attention on the ball, but also largely because of the attention on itself and its direct surroundings. The Full Agent is calculating where the ball might go next to hit it, the Half Agent is trying to keep up with the ball at every step.

Figure 10: Manually sampled states from the game Pong, combined with the Full Agent and the Half Agent’s actions Grad-CAM outputs based on these states.

BeamRider. Both agents have learned to hit enemies, but the Full Agent has a higher average return, this indicates that it has more knowledge about how to play the game. Looking at figure 11, both agents have a measure of attention on the two white enemy saucers, but the rank of attention differs; the Full Agent has a high attention on the enemies, in comparison with the Half Agent which has a low attention on the enemies. The Half Agent is either going right which is essentially a NOOP in that area or it could shooting a the incoming enemy. More interesting are the last two frames: 175 and 176. The attention of the Full Agent turns from the directly approaching enemy saucer to the enemy saucer on the left of it, and the agent would try to move into its direction (LEFTFIRE). the Full Agent’s attention in frame 176 is placed in a medium

(18)

degree at the trajectory of its own laser that will hit the enemy saucer in the next frame. This could indicate that the Full Agent knows it will hit the target and is thus moving away from it, to focus on the other remaining enemy.

Figure 11: Manually sampled states from the game BeamRider, combined with the Full Agent and the Half Agent’s actions Grad-CAM outputs based on these states.

From the analysis of both agents another interesting result is discovered: the agent do not learn to ’properly’ use the torpedoes. At the beginning of each episode/level both agents would fire torpedoes until they are all used up and then continue on as usual. In figure 12 a manually sampled configuration is depicted which is played by the Full Agent. The torpedoes have not been used yet, on purpose, and there are enemies coming towards the agent at different time-steps. Looking at the Grad-CAM attention map, it would appear to be highly focused on the remaining three torpedoes in the upper right corner. This occurs even when the action chosen is not of the UP-variety which would trigger firing a torpedo.

(19)

Figure 12: Manually sampled states from the game BeamRider while not firing torpedoes. Com-bined with the Full Agent’s actions Grad-CAM outputs based on these states. In the 300 frames played it has chosen any UP-variant 219 times, LEFTFIRE 67 and other actions 14 times.

4.4 Agent Failures

Even well-trained agent make mistakes, and in the case of BeamRider and Seaquest the trained agents still end the episode by dying. This section provides an analysis of where the agents’ attention was in the four frames leading up to death. This section will only concern BeamRider and Seaquest, because Pong does not fail the game when trained well.

BeamRider. In figure 13 the agent initially has a high amount of attention focused itself, a medium attention directed at an incoming laser and a low amount of attention focused at the debris right ahead of the him. A white enemy saucer is approaching from the left, and the attention of the agent shifts to both this enemy and the incoming piece of debris. In all frames the agent is performing the LEFTFIRE action which, given the Grad-CAM activations and the trajectory of the laser seen in frames 4139 and 4140 is aimed at the piece of debris. Unfortunately, the laser the agent has can’t destroy debris because debris requires a torpedo. That would lead to the hypothesis that the agent was too late to focus on the enemy saucer because it was looking at the debris earlier. This is also supported by the Grad-CAM output, where the agent is only looking at the enemy saucer in the three frames leading to its death. If the agent had made the causal relation between debris and torpedoes or had learned to always avoid debris, it would have been able to focus on the enemy saucer earlier.

(20)

Figure 13: Agent dies because it is hit by enemy laser.

In the next situation pictured in figure 14 the agent is approached by a number of different enemies, one of which only appears after level 7: the green bounce craft. This is another enemy that can only be destroyed by a torpedo and it jumps from beam to beam trying to hit the agent which is what kills the agent eventually in frame 5664. In all frames, a high attention is focused at the nearest three enemies, two of which are only killable using a torpedo. The agent is shooting using LEFTFIRE, which is, based on Grad-CAM activations, on the green bounce craft directly in front of the agent. Like in the previous scenario in figure 13 the agent can’t kill the incoming enemies and is not able to avoid the enemies or use its torpedoes effectively.

Figure 14: Agent dies because it is hit by a green bouncecraft.

In the next case depicted in figure 15 the agent eventually dies because it is hit by a green blocker shield, another enemy it can only kill through the use of a torpedo. The first frame is rather interesting; Attention is scattered throughout the frame focused on almost everything

(21)

except for the enemies. Comparing frame 2817 with 2818 would give the impression that they target exactly the opposite locations in the frame. It is hard to interpret such resulting heat maps with respect to the agent, except for the fact that is is focusing on everything except enemies. Frames 2818 and 2819 are easier to interpret; the agent is focused on enemies and tries to either hit them or avoid them, but because it is trying to hit them the agent is too late to maneuver out of the way. It eventually dies because of this.

Figure 15: Agent dies because it is hit by a green blocker shield.

All three cases exhibit similar behaviour when it comes to Grad-CAM rationalizations: The agent is focusing on- and shooting at enemies that it cannot kill with its laser. The attention map explains as to what the agent is firing at, which can lead to the conclusion that the policy the agent is using has not learned to use its torpedoes correctly, a theory that can be supported by the results from the previous section, where the agent cannot start a game without firing off all torpedoes.

Seaquest. In Seaquest, the agent needs to learn that going up for air is beneficial and gives a high long-term return (staying alive). Many factors could lead the agent to not learn this such as a too small LSTM cell or a lack of explorative actions. A solution to this could be the use of Fine-Grained Action Repetition which selects a random action and performs this action for a decaying amount every so often (Sharma, Lakshminarayanan, & Ravindran, 2017) (the authors even use Seaquest as an example). During the training of the Seaquest agent in this paper, the agent did not learn to go up and thus was stuck on a maximum of 1900 rewards. The agent will go into the water and shoot as many enemies as it can, but it will die when it is out of breath. This section will provide a small case study using Grad-CAM to analyze what the agent is looking at, when there are enemies in the screen, and right before running out of breath.

In figure 16 the agent is seen sporadically going left and right, sometimes shooting (there is a medium attention focused on shots, but there is no clear target). The agent has a high attention focused at itself an enemies which could indicate some spatial awareness, but there is no attention focused on the oxygen meter pictured at the bottom.

(22)

Figure 16: A showcasing of Grad-CAM applied to Seaquest in a setting with enemies and other objects.

The situation in which the agent is running out of oxygen is depicted in figure 17. The agent has been shooting at an enemy on the right hand side which explains a medium amount of attention focused on that area, there is a diver below the agent that also has a medium amount of attention focused on it. Other than these sources the attention is highly focused at the agent itself. The fact that there are little activations happening on the oxygen bar would support the theory that the agent has not created a correct correlation between time spent alive and going up for water.

(23)

5 Related Work

Visual occlusion or heat maps are often used to introduce explainability in Deep Learning models, such as the Prediction Difference Analysis method proposed in (Zintgraf et al., 2017). This method is however quite computationally expensive (the authors note 70-minute processing for a single image) which makes it intractable for a sequence of over 1000 images, which is often the case when dealing with reinforcement learning tasks. Textual approaches such as (Hendricks et al., 2016) to explainability were not discussed in this thesis due to a lack of ground truth in class labels when dealing with Atari 2600 games.

In (Harrison, Ehsan, & Riedl, 2017), a model is proposed to explain an autonomous system’s behavior as if a human had performed the behavior. The authors use a natural language training dataset that is collected from human players thinking out loud while playing a game. This dataset is then used to generate explanations for an agent playing the same state-actions that the humans were playing. The downside to this approach is that the explanations did not come from the agent but from humans. The framework proposed in this thesis, in contrast, generates heat maps based on the agent’s actions which are then rationalized.

In (Greydanus, Koul, Dodge, & Fern, 2017) the authors propose to use salience maps to under-stand how an agent learns and follows a policy. The authors also propose a perturbation method that selectively blurs regions to calculate the impact it has on the policy. Although this method demonstrates important regions for the agent’s decision making, the method proposed in this thesis highlights important regions without the need for a perturbation method.

The use of T-SNE embedding in the Atari 2600 game Space Invaders used in (Mnih et al., 2015) adds an understanding of how an agent ranks environments based on the value function in Q-Learning but this is outside the scope of this work. In general, there is not much literature to be found on Explainable Reinforcement Learning but with the recent advances and focus on explainability in AI this might change.

(24)

6 Conclusion And Discussion

This thesis has presented a novel rationalization framework that uses Grad-CAM in combination with the A3C deep reinforcement learning model to produce an action specific activation map. This action specific activation map can then be combined with the original input frame to produce an interpretable location specific heat map indicating points of attention for the agent. This heat map can then be used to rationalize the decision that the agent makes.

We have shown that during training of an agent, the rationalization framework can help by providing visual distinctions of different stages of training with respect to the policy the agent is using. The rationalizations help understanding of how an agent learns and what it should be focusing its attention on to get a higher reward.

We have also shown that the rationalization framework helps in creating understanding when an agent fails at its task. Specifically, looking at when an agent dies and where the agents’ attention was during this scenario. In doing so, we found that the agent can fail due to not having learned long-term rewards that it either understands wrong (BeamRider’s torpedoes) or is not focusing its attention on (Seaquest’s oxygen bar).

This thesis has argued and emphasized the usability of visual rationalizations as a means to increase understanding of deep reinforcement learning agents. However, there are some problems when dealing with reinforcement learning agents and explainability methods.

6.1 Discussion

A problem in explainable reinforcement learning is the lack of ground truth. In image classification tasks, each image usually has a class assigned to it indicating the ground truth. In the case of reinforcement learning, it is hard to say what action is correct and what action is wrong, it can be subjective to the strategy that is being used. The lack of a ground truth makes it impossible to determine if a specific rationalization is correct, but rationalizations can still enhance the understanding of the agent’s decision making.

Grad-CAM Interpretability As shown in figure 15, Grad-CAM heat maps can somtimes be quite hard to interpret. This behaviour is most prominent at the start of a round and at the end of a round ((re)spawning). A few examples of these noisy Grad-CAM outputs are shown in figure 18. It is important to know that during the (re)spawning the agent is able to act in the same way as normal but the actions will not be executed. It could be that the agent is aware of this due to the fact that every enemy will be removed from the frame. In the case of BeamRider, the agent even changes color. This changing of color could be another indicator to the agent that something is wrong, and for Pong and Seaquest the agent gets removed from the game for a short period. The frames itself are also not helpful for investigating agent failure due to the fact that the agent cannot move/interact properly. But they might provide clues as to what visual objects catches the agent’s attention.

(25)

Figure 18: Noisy Grad-CAM outputs generated when the agent was in the process of (re)spawning.

As mentioned, this behaviour is most prominent during (re)spawning, but it does not only appear there. In figure 19 some more scenes are depicted that are from a different setting. Even though these images do not depict a clear reason for the action being taken and are thus hard to rationalize, they still depict interesting behaviour i.e. the Pong 1 frame the attention is scattered in low volume throughout the screen and highly on both agents. In the case of Pong and BeamRider, the noise often comes in the form of activations spread throughout the image, but the activations are not prominent on any moving or changing object. This includes the agent itself, scores and enemies.

(26)

6.2 Further Research

Firstly, further research could compare Grad-CAM outputs for multiple actions in one state. These could be compared to each other to see what activations motivate the agent to go for example left and not right. This could be done by first subtracting the Critic Grad-CAM output to produce an action-only attention map.

The second point of further research could be based on finding difference in deep reinforcement learning models with the use of Grad-CAM. This research could contrast model behaviour in a controlled environment, where each model is given the same state and the resulting Grad-CAM outputs will be compared for each model.

The third and last point of further research could focus on testing the model proposed in this thesis on different environments such as MuJoCo’s continuous control tasks (Todorov et al., 2012). These offer more challenging tasks with a larger action space in contrast to the limited action space in Atari 2600 games.

(27)

Appendices

A

Grad-CAM Output for Randomly Initiated Agent

A selection of random actions and the Grad-CAM outputs can be seen in figure 20.

Figure 20: Selection of random actions and the resulting Grad-CAM outputs for untrained agents (frames are not in any particular order). Note that for the game Pong, the Grad-CAMs are more location-specific when compared to BeamRider and Seaquest. This might be due to the fact that convolutions itself are good at detecting abstract objects and edges, which pong is filled with.

B

Hyper-parameters

Table 5: Hyperparameters using during training. The hyperparameters equal to that in (Mnih et al., 2016).

Name Value Description

Learning Rate 0.0001 Scale of the gradient update

γ 0.99 Discount factor for rewards

τ 1.00 Scale of Advantage function return

Entropy Coefficient 0.01 Scales entropy in policy loss estimation number of processes 14 How many parallel agents are training

(28)

References

Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2012). The arcade learning environment: An evaluation platform for general agents. CoRR, abs/1207.4708 . Retrieved from http:// arxiv.org/abs/1207.4708

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). Openai gym. CoRR, abs/1606.01540 . Retrieved from http://arxiv.org/abs/ 1606.01540

Greydanus, S., Koul, A., Dodge, J., & Fern, A. (2017). Visualizing and understanding atari agents. CoRR, abs/1711.00138 . Retrieved from http://arxiv.org/abs/1711.00138

Harrison, B., Ehsan, U., & Riedl, M. O. (2017). Rationalization: A neural machine translation approach to generating natural language explanations. CoRR, abs/1702.07826 . Retrieved from http://arxiv.org/abs/1702.07826

Hendricks, L. A., Akata, Z., Rohrbach, M., Donahue, J., Schiele, B., & Darrell, T. (2016). Gener-ating visual explanations. In Eccv.

Hochreiter, S., & Schmidhuber, J. (1997, November). Long short-term memory. Neural Comput., 9 (8), 1735–1780. Retrieved from http://dx.doi.org/10.1162/neco.1997.9.8.1735 doi: 10.1162/neco.1997.9.8.1735

Islam, R., Henderson, P., Gomrokchi, M., & Precup, D. (2017). Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. CoRR, abs/1708.04133 . Retrieved from http://arxiv.org/abs/1708.04133

Juliani, A. (2016). Simple reinforcement learning with tensorflow part 8: Asynchronous actor-critic agents (a3c) (Blog No. December 17). https://medium.com/emergent-future/ simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor -critic-agents-a3c-c88f72a5e9f2.

Kostrikov, I. (2018). Pytorch implementations of asynchronous advantage actor critic. https:// github.com/ikostrikov/pytorch-a3c. GitHub.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep con-volutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Wein-berger (Eds.), Advances in neural information processing systems 25 (pp. 1097–1105). Curran Associates, Inc. Retrieved from http://papers.nips.cc/paper/4824-imagenet -classification-with-deep-convolutional-neural-networks.pdf

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., . . . Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. CoRR, abs/1602.01783 . Retrieved from http://arxiv.org/abs/1602.01783

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 . Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., . . . Hassabis,

D. (2015, February). Human-level control through deep reinforcement learning. Nature, 518 (7540), 529–533. Retrieved from http://dx.doi.org/10.1038/nature14236

Olah, C. (2015). Understanding lstm networks (Blog No. August 27). http://colah.github.io/ posts/2015-08-Understanding-LSTMs/.

Park, D. H., Hendricks, L. A., Akata, Z., Rohrbach, A., Schiele, B., Darrell, T., & Rohrbach, M. (2018). Multimodal explanations: Justifying decisions and pointing to the evidence. In Ieee cvpr.

Salimans, T., & Kingma, D. P. (2016). Weight normalization: A simple reparameterization to accelerate training of deep neural networks. CoRR, abs/1602.07868 . Retrieved from http://arxiv.org/abs/1602.07868

Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2015). Trust region policy optimization. CoRR, abs/1502.05477 . Retrieved from http://arxiv.org/abs/1502.05477 Selvaraju, R. R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., & Batra, D. (2016).

Grad-cam: Why did you say that? visual explanations from deep networks via gradient-based localization. CoRR, abs/1610.02391 . Retrieved from http://arxiv.org/abs/1610.02391 Sharma, S., Lakshminarayanan, A. S., & Ravindran, B. (2017). Learning to repeat: Fine grained

action repetition for deep reinforcement learning. CoRR, abs/1702.06054 . Retrieved from http://arxiv.org/abs/1702.06054

(29)

Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA, USA: MIT Press. Retrieved from http://www.cs.ualberta.ca/%7Esutton/book/ebook/ the-book.html

Teach, R. L., & Shortliffe, E. H. (1981). An analysis of physician attitudes regarding computer-based clinical consultation systems. In Use and impact of computers in clinical medicine (pp. 68–85). Springer.

Todorov, E., Erez, T., & Tassa, Y. (2012). Mujoco: A physics engine for model-based control. In Iros (p. 5026-5033). IEEE. Retrieved from http://dblp.uni-trier.de/db/conf/iros/ iros2012.html#TodorovET12

Watkins, C. J. C. H., & Dayan, P. (1992, May 01). Q-learning. Machine Learning, 8 (3), 279–292. Retrieved from https://doi.org/10.1007/BF00992698 doi: 10.1007/BF00992698

Williams, R. J. (1992, May 01). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8 (3), 229–256. Retrieved from https://doi.org/ 10.1007/BF00992696 doi: 10.1007/BF00992696

Xu, B., Wang, N., Chen, T., & Li, M. (2015). Empirical evaluation of rectified activations in convolutional network. CoRR, abs/1505.00853 . Retrieved from http://arxiv.org/abs/ 1505.00853

Zhou, B., Khosla, A., Lapedriza, `A., Oliva, A., & Torralba, A. (2015). Learning deep features for discriminative localization. CoRR, abs/1512.04150 . Retrieved from http://arxiv.org/ abs/1512.04150

Zintgraf, L. M., Cohen, T. S., Adel, T., & Welling, M. (2017). Visualizing deep neural network decisions: Prediction difference analysis. CoRR, abs/1702.04595 . Retrieved from http:// arxiv.org/abs/1702.04595

Explainable Reinforcement Learning: Visual Policy Rationalizations Using Grad-CAM