Contrastively-shaped Reward Signals for Curiosity-based Exploration Tasks

(1)

MSc Artificial Intelligence

Master Thesis

Contrastively-shaped Reward Signals for

Curiosity-based Exploration Tasks

by

Nil Stolt Ans´

o

12371149

7th July 2020

48 ECTS November 2019 - August 2020

Supervisor:

Tim Bakker

Examiner:

Dr. Herke van Hoof

FACULTY OF SCIENCE,

INFORMATICS INSTITUTE

(2)

Declaration of Authorship

I, Nil Stolt Ans´o, hereby declare that the thesis titled “Contrastively-shaped reward signals for curiosity-based exploration tasks” and the presented research are entirely my own. Furthermore, I confirm that the complete work was done in pursuit of a research degree from the Master’s program in Artificial Intelligence at the University of Amsterdam. All sources of assistance, as well as previous work upon which this thesis is built have been properly indicated as such. Any work which may have been conducted with the assistance of others besides myself has been dutifully indicated.

Signed: Nil Stolt Ans´o Date: August 7th, 2020

(3)

Abstract

Reinforcement learning algorithms are quite sensitive to the structure of the presented reward signal. Sparse and deceptive signals can cause an agent to fail to thoroughly explore the state space for better policies. The aim of intrinsic motivation is to provide a self-supervised reward function that incentivizes the agent to seek novel regions of the environment. However, common approaches based on prediction error (PE) can produce reward signals with unattractive properties for the learning progress of the agent.

This investigation proposes a formalism that bridges the definition of novelty under the count-based setting with the generalizability provided by PE-count-based approaches. This formalism is ex-tended to high-dimensional spaces, where abstract representations are necessary to reduce input complexity. Controlled experiments on simple grid worlds indicate that structuring the abstract space under a contrastive loss approximates the conditions proposed by the formalism. Further experiments on visually complex environments show that the proposed approach achieves results on par with well established encoders in the literature.

(4)

Acknowledgements

I would like to express my sincere gratitude to Tim Bakker, my thesis supervisor, for his patient guidance, enthusiastic encouragement, and useful critiques of this research.

Also, I wish to thank my family and friends for their continued support throughout my studies. I would especially like to extend my gratitude to Francesco Dal Canton, Anton Wiehe, and Mahsa Mojtahedi, who helped me frame the research direction through their valuable, and often long, discussions.

Finally, I would like to thank Herke van Hoof and Marco Wiering who, through their roles as teachers, have shaped my interest for the field of reinforcement learning to what it is today.

(5)

Table of Contents 6.2.1 Atari Breakout-v0 . . . 24 6.2.2 Atari Riverraid-v0 . . . 25 7 Discussion 26 7.1 Discussion . . . 26 7.1.1 Tabular approaches . . . 26 7.1.2 Encoders . . . 26 7.1.3 Exploration behaviour . . . 27 7.2 Limitations . . . 27 7.2.1 Self-transitions . . . 27 7.2.2 Pretraining of encoder . . . 27 7.2.3 Epsilon values . . . 28 7.3 Future research . . . 28 8 Conclusion 29 List of Figures 30 Bibliography 31 Appendix 34 A Qualitative evaluation of exploration policies 35 A.1 Empty 42×42 grid world . . . 35

A.2 Random feature 42×42 grid world . . . 40

A.3 Spiral 28×28 grid world . . . 43

B Increasing encoder update frequency 48 B.1 Empty 42×42 grid world . . . 48

B.2 Random feature 42×42 grid world . . . 49

B.3 Spiral 28×28 grid world . . . 50

C List of hyperparameters 52 C.1 Grid words . . . 52

(7)

List of Notation

s state

s0 next state ˆ

s0 predicted next state S set of all states

a action

A set of all possible actions

π policy

Sπ set of states visited by policy π

probability of taking random action z abstract state

Z set of all abstract states

Z(s) mapping from state to abstract state Zθ learnt mapping from state to abstract state

T environment’s transition function in feature space Tφ learnt transition function in features space

¯

Tφ learnt transition function in abstract space

As(a) mapping of actions into feature space given state s

¯

As(a) mapping of actions into abstract space given state s

Aφ(s, a) learnt mapping of action into feature space given state s

¯

Aφ(z, a) learnt mapping of action into abstract space given abstract state z

rint _{intrinsic reward}

R(·) reward function Gint intrinsic return γ discount factor

t current step in exploration process µ(·) probability of visitation

H energy function for positive example ˜

H energy function for negative examples

L loss

ξ hinge value

d(·) distance measure L history length

(8)

Chapter

1

Introduction

Reinforcement learning (RL) is a method of training agents to map situations to actions so as to maximize a numerical reward signal. Unlike other subfields of machine learning, it is quite rare for the agent to be presented with a labelled collection of examples to learn from. Instead, the learner must discover the intricacies of the reward function by trying out actions, all while attempting to maximize the received reward. This, combined with the issue that rewards might not be received immediately after a taking a given action, and that actions also have an effect on which future situations are encountered, give rise to the challenges that distinguish RL as its own field [1].

In recent years, major developments have allowed RL algorithms to generalize to a wide range of environments [2]. However, many of these milestones require the environment to have dense, well-shaped reward signals. The engineering process that goes into shaping rewards is a notoriously challenging problem. Often the reward function in question was not built with such algorithms in mind. Furthermore, the ability to build a function that would best accommodate such learning processes would require a priori knowledge on the solution of the very same task these algorithms are trying to solve. It is thus common practice to include human heuristics into the reward function where possible. However, using heuristics opens up the possibility of introducing human biases into the solutions found by these algorithms. This can cause the system to converge into suboptimal solutions, or in the worst of cases, discover ‘reward hacking’ strategies [3].

There are occasions where dense rewards are simply unavailable. As an external designer, one might come across tasks where the only information available about the agent’s progress is whether the task itself (or key subtask) has been successfully completed. In such cases, the time complexity for discovering a reward by a local exploration strategy depends on two factors: the distance to the reward through the problem space and the width of the action space. Depending on how sparse the reward signal is, traditional local exploration strategies will require increasingly exponential time just to observe one instance of change in the reward signal.

In developmental psychology, an agent’s incentive to explore along extended sequences of actions is explained through the process of curiosity [4]. At the core of the idea, a distinction is made between extrinsic and intrinsic motivation. Although extrinsic motivation plays an important role in reinforcing behaviour, motivation originating from the agent itself is a key component towards developing skills useful later in life. This idea of exploring the environment through processes analogous to curiosity in humans, has been given attention in different forms in the field of RL.

The idea of intrinsic motivation in the context of RL, as posed by J. Schmidhuber in 1990 [5], involves a system tasked with capturing the environment’s dynamics, often termed as the world model (WM), capable of predicting the next input given a history of past actions and inputs. An action-generating reward-maximizing controller can then be incentivized to learn what action sequences produce currently unpredictable inputs.

Despite recent progress in making RL algorithms robust to evermore complex input spaces, the naive application of intrinsic motivation can produce intrinsic rewards far too noisy for a controller to learn. The environment might contain features and dynamics that are too complex to be captured by a WM of finite size. Some of these sources of information might even be purely random in nature. Parts of the environment with these properties can cause the agent to develop novelty-seeking policies that stray away from the task it was intended to explore.

A common example of such a situation involves a tree with leaves moving in the wind. For all intents and purposes, capturing the motion of the leaves is equivalent to trying to find patterns in white noise [6]. Not only would predicting the next visual input produce very high intrinsic rewards, it is not valuable for the agent to capture parts of the environment dynamics that fall outside of the its actions’ direct influence.

Recent trends in deep RL have shifted towards the use of neural networks to create abstract rep-resentations of the high-dimensional input spaces that such environments provide. The idea behind

(9)

Chapter 1. Introduction

encoding an input into a lower-dimensional space is to extract a more informative representation while cutting down on its complexity. However, there is no agreed upon optimization criterion with which to tune an encoder in order to give rise to these abstract spaces. The approach of choice varies depending on factors such as visual diversity across the environment, available computation resources, and theoretical guarantees.

Furthermore, a lot of the literature focused on shaping prediction error, aims to determine the efficacy of an encoding function strictly based on its empirical results. Unfortunately, there is no easy way to tell what the best choice of encoder will be ahead of time, as recent studies have shown the performance of any given encoder can easily vary across environments [7]. On the other hand, limited work has been dedicated to formalising theoretical frameworks through which to guarantee the performance of encoding functions.

A related field in the literature is that of count-based curiosity. Count-based approaches offer well-defined frameworks through which to describe state novelty [8, 9, 10], but generalizing these ideas outside of the tabular domain has proven to be a challenge. The set of works that aim to bridge the gap between prediction error and count-based formalisms is relatively small [11, 12]. Contributions:

1. This thesis introduces a new formalism through which a prediction error approach can ensure that its generated intrinsic rewards have the same properties as the rewards generated by a count-based approach.

2. Next, the formalism is investigated in multiple controlled grid world environments where, under tabular domains, the prediction error approach appears to show comparable results to visitation counts.

3. The formalism is then extended into the function approximation domain. When applied to abstract representation spaces, the formalism appears to provide an increase in performance. 4. Finally, the formalism is tested on two popular deep RL environments, which show that the use of a contrastively-learnt encoder achieves performances on par with other encoders in the intrinsic motivation literature.

(10)

Chapter

2

Related Work

2.1 Intrinsic Rewards

Exploratory behaviours derived from one’s inner incentives has been discussed from a wide range of perspectives in the scientific literature. As pointed out in many of P.Y. Oudeyer’s works (e.g. [4, 13]), the idea that these internal mechanisms might arise from the relation between an organism’s environment and its own state of knowledge has been addressed using concepts such as curiosity [14], intrinsic motivation [15], and free play [16].

Several strands of research within artificial intelligence have investigated the role played by such exploration mechanisms in solving tasks with complex, sparse, or deceptive rewards. In reinforcement learning, the maximization of exploration-incetivizing rewards has been approached from different angles. These intrinsic rewards are typically added alongside extrinsic rewards as ‘exploration bonuses’ [17], forcing the agent to learn a balance between the two. Examples of such quantities include: state novelty [17], prediction progress [18], and information-theoretic formalisms [19, 20].

2.1.1 Visitation Counts

In tabular RL, a measure of state novelty can be obtained through counting visitations of an observed experience [21]. Intrinsic rewards can then be generated as a function of visitation counts, such that the reward of a transition is inversely proportional to the number of past visitation of the experience. Unfortunately, when the state space is continuous (e.g. real-valued coordinates) or too large to store in memory (e.g. high resolution images), counting how often each individual transition occurs is a problem.

Attempts to formalize visitation counts to state spaces outside the scope of tabular approaches have shown some promise. The use of a hashing function in [10] allows for similar states to be mapped into discrete, lower-dimenional spaces which allow for counting. However, outside of their empirical findings, the question of optimal choices over possible hashing functions for such mappings is left unanswered. Another approach is that of pseudo-counts in [11, 12], where a learnt function is used to calculate the recoding probability of a state. This quantity allows for an estimation of visitation density which generalizes across similar states.

2.1.2 Transition Prediction Error

Another approach to generating intrinsic rewards is that of prediction error [22]. A model of the environment dynamics is used to predict what state the agent will transition into given the current state and an action. The magnitude of the error in the model’s prediction is used as the intrinsic reward. Throughout the learning process of the dynamics model, the reward for commonly visited transitions will diminish with increased visitations on said transitions.

Prediction error has the advantage of being easy to generalize to environments where the use of tabular models may be problematic. Continuous or very large state spaces can, under a prediction error approach, be tackled with function approximation. Early work uses models with low a number of hyperparameters [21], while contemporary work has adopted deep neural architectures capable of predicting directly at the level of pixels [7, 23].

2.2 Abstract State Representations

In recent years, the widespread use of deep learning techniques has allowed reinforcement learning algorithms to generalize to domains traditionally thought to be too complex [2, 24, 25].

(11)

Chapter 2. Related Work

Directly predicting future states in a high-dimensional state space (e.g. images) based on learnt dynamics is a challenging task. The alternative, learning the dynamics in some auxiliary feature space, has shown to be much more effective in terms of generalization and robustness to noise [26, 23, 7]. However, the specific choice for such feature space is heavily dependent on a number of factors. These include: available computation resources, input complexity, complexity outside of the agent’s direct control, and intended training objective. In order to ensure stable dynamics, the discussion offered by [7] considers three desired properties of an embedding space. It should be compact in terms of dimensionality, sufficient information relevant to the task should be preserved, and the space should be a stationary function of the input observations.

A simple example of a type of encoder that follows such criteria are hashing functions [10]. These can be seen as a static mapping from the state-space to a lower dimensional representation. The function can have any arbitrary properties, but in principle one would like similar states to map into similar hash codes. The hash function need not be discrete. A randomly initialized neural network with static weights also produces outputs that can be interpreted as hash codes of the original input. A neural network, being a continuous mapping, also has the benefit of causing similar enough inputs to become mapped into neighbouring regions of the output space.

On the other hand, encoders with learnable parameters offer latent feature spaces that can be structured under an objective to extract important information. Unlike static encoders, the embedding space in learnt encoders is dependent on the distribution over the input space. If the current distribution of states is undergoing change, as it might when transitions are being gathered by a policy being concurrently trained, the embedding space will not be stable. This issue can introduce difficulties in convergence for world models, as the transition model has to learn a shifting embedding distribution in [7, 10]. Nonetheless, the wide range of results in [7] go on to show that it is possible for learnt encoders, in a curiosity-based exploration task, to outperform random features produced by a static encoder in environments with an observation space of wide visual variety. An interesting discussion is also offered on why many Atari environments popular in the deep RL literature might be visually too simple to justify using a learnt encoder over a static encoder.

A common type of learnt embedding spaces are optimized using autoencoder-based architec-tures with frame prediction objectives [27, 28, 29]. These approaches demonstrate the efficacy of these methods at predicting future states at the pixel level. However, information about the fine detail in observations is lost during encoding, and thus the act of decoding back into pixel space is very noisy, producing large deviations from the target. Recent work brings criticism to this optimization objective [7]. A solution offered by [30] introduces skip-connections between encoder and decoder layers, allowing information about local detail to flow directly into the decoder layers. Alternatively, [31] argues that raw observations may contain unnecessary information and be insufficient for planning. They instead suggest additional training objectives such as rewards and value of future states.

When generating intrinsic rewards, objectives reliant on pixel space prediction can be high-variance and unreliable. An early discussion offered in [6] describes how prediction error on the raw observation space can lead to disproportionately high reward values in regions with unpredictable visual detail. The overview of encoders provided in [7] empirically demonstrates this and goes on to show that prediction error in pixel space is significantly outperformed by prediction error in embedding space. However, embedding spaces that are self-supervised on pixel reconstruction objectives such as auto-encoders suffer from similar issues. The resulting structure of the embedding space is optimized to encode features useful for decoding. As such, it is not directly obvious why the learnt embedding spaces produced by these methods should be optimal for predicting across latent representations, and by extension, why the generated intrinsic rewards should be optimal for an exploration task.

Inverse Dynamics Features (IDF) [23] (commonly known by the name Intrinsic Curiosity Module (ICM)) creates an embedding space by using as optimization objective the prediction of the taken action given two subsequent states. This is argued to produce latent spaces that ignore irrelevant visual information for the forward prediction task since it exclusively captures features that correlated with the agent’s actions. Recent work [7] has argued that the action prediction objective could fail to capture information which, despite not being immediately correlated with the agent’s action, could be useful for the forward prediction task across longer time intervals.

(12)

Chapter 2. Related Work

A different approach altogether makes the optimization objective depend on the prediction of future states in embedding space. Concurrent work [32, 33] aims to create homomorphisms [34] between observation space and embedding space through negative sampling. This allows for struc-tured world models where the transition of an action along latent space is equivalent to the action’s effect in observation space.

Other notable work [35] also uses objectives involving prediction of future states in embedding space. However, no semantics are attached to their embedding space, and instead its purpose relies in predicting relevant future quantities such as value, rewards, and policies.

(13)

Chapter

3

Preliminaries

Markov Decision Process (MDP): An MDP is a tuple M = (S, A, R, P ). Here, s ∈ S is a Markov state, a ∈ A is an action available to the agent, R : S × A → R is a reward function that generates a scalar signal defining the desirability of a given transition. P : S × A × S → [0, 1] is the transition matrix, where given two states and an action, a probability is assigned for transitioning from the first state to the second.

Throughout this research, the transition dynamics are taken to be fully deterministic P → {0, 1}. The stochastic formulation is left as future work. Furthermore, rewards presented to the agent are generated by prediction error, which is proportional to the visitation of a (s, a) pair. The intrinsic reward function is thus more appropriately described by R : S × A × n → R+_{, where}

n ∈ N is the number of visitations of a (s, a) pair.

Q-learning: Q-learning is a model-free RL algorithm that learns Q-values for (s, a) pairs in an MDP. A Q-value represents the expected return Gt=P∞k=0γ

k_r

t+k+1 of a state-action pair under

subsequent greedy actions, where γ ∈ [0, 1] is a discount factor and rtis the reward at time step t.

The expected return of a (s, a) pair under policy π is given by a Q-value function Qπ: S × A → R,

where Qπ(s, a) = Eπ|Gt|st= s, at= a|. A policy π : S × A → [0, 1] assigns probabilities to actions

in states. The value of a state under an optimal policy π∗ _{is defined as V}∗_{= max}

aQ∗(s, a) under

the Bellman optimality equation. Q-learning is guaranteed to identify an optimal policy for any finite MDP given infinite exploration time and a partly-random policy. For simplicity’s sake, the exploration policy throughout this research is an -greedy approach, where the greedy action is taken with 1 − probability.

MDP Homomorphism: An MDP homomorphism, as formalized by [34], is a tuple of func-tions h = Z, ¯As that preserve the structure of an input MDP M = (S, A, R, T ) into an

abstract MDP ¯M = hZ, ¯A, ¯R, ¯T i. Function Z : S → Z is a mapping from states to abstract states, and for each state s, a function ¯As: A → ¯A maps actions to abstract actions. The MDP ¯M is a

homomorphic image of M under h. The notions and definitions used here follow those of [33]. Each state s ∈ S is taken to have a unique feature representation. The set S is a collection of points on a N-dimensional raw feature space such that s ∈ RN. The function As(a) is used

to denote the translation vector in the N-dimensional feature space equivalent to taking action a in state s. The transition function T : S × A → RN _{models the MDP’s transition dynamics in}

the input feature space for a tuple (s, a, s0), such that T (s, a) = s0. Here, the transition function T is different from the traditional MDP transition function P in the sense that P returns the probability of a transition (s, a), whereas T outputs a vector in feature space.

Similarly, set Z is a collection of points z ∈ RM _{in M-dimensional embedding space into which}

the elements of S are mapped. Under an MDP homomorphism, the following identity holds: Z (T (s, a)) = ¯T Z(s), ¯As(a) = Z (s0) , ∀s,s0_∈S,a∈A (3.0.1)

When moving into function approximation, notation from previous work is followed [36, 32]. The transition model predicting on the raw feature space is defined as Tφ(s, a) = s + Aφ(s, a) = s0,

while the abstraction of Tφ acting on embedding space is defined as ¯Tφ(z, a) = z + ¯Aφ(z, a) = z0.

Here Aφ(s, a) and ¯Aφ(z, a) are feedforward networks that approximate As(a) and ¯As(a).

Prediction Error (PE) as intrinsic rewards: In the intrinsic motivation literature, a common measure on the relative novelty of a transition (s, a) is the distance between reality and prediction of a world model [18]. Let s ∈ S be an N-dimensional representation of a state s ∈ RN_.

Similarly, let the transition function Tφ(s, a) ∈ RN predict the outcome ˆs0 in feature space after

taking action a ∈ A while in state s. An intrinsic reward rint _{can be formalised as:}

(14)

Chapter 3. Preliminaries

where d is a distance measure in the feature space. Throughout this investigation the geometry of the feature space is treated as Euclidean and the distance measure d is the squared Euclidean distance (SED).

A reward-maximizing algorithm can learn a policy π that optimizes for long term intrinsic return Gint

t =

P∞

k=0γkrintt+k+1for the current parameters of Tφ. This, however, assumes a stationary MDP.

Changes to Tφresult in a non-stationary reward function R, and consequently cause Gintto become

outdated. Any policy improvement step is only locally optimal with respect to the moving reward function R (since R is generated by a changing Tφ).

Furthermore, a policy maximizing prediction error of a certain Tφwill only visit a narrow subset

of states Sπ⊆ S (where Sπis the set of states visited under policy π) for which Gint is maximized.

Intrinsic rewards for the purpose of global exploration is only efficient when π and Tφ are trained

synchronously. This dynamic creates an adversarial min-max objective for which finding globally optimal solutions is not guaranteed [37]. Throughout this research, the notation Sπt−k:t is used to

indicate the set of states visited under policy π during the last k update steps of the min-max optimization process, such that:

Sπt−k∪ · · · ∪ Sπt−1 = Sπt−k:t ⊆ S (3.0.3)

(15)

Chapter

4

Problem description

4.1 Rewards as a function of visitations

Assume our goal was to create a curiosity-based global exploration strategy that aims to visit the (s, a) pairs of a Markov decision process (MDP) as uniformly as possible, such that:

µ(s, a) = 1

|S × A|, ∀(s,a)∈ S × A (4.1.1)

where µ(s, a) is the probability of the exploration policy visiting the (s, a) pair.

The level of uniformity in visitations that can be achieved by a curiosity-based system can be influenced by a multitude of factors. In general these factors can be categorized into two sources: factors relating to MDP’s structure and factors inherent to the curiosity system.

Some of the influencing factors relating to the MDP’s structure include the starting state distribution and the transition matrix P . These affect the uniformity in exploration in ways that can only be compensated for by the curiosity system. An example of this is an MDP with a single unchanging starting state. By default, this starting state will have a higher visitation count (as well as nearby states) due to being visited at the start of each episode, and thus have a lower intrinsic reward. The curiosity system can make up for this bias by spending time visiting states that aren’t near the starting state.

Factors inherent to the curiosity system involve some kind of influence on the intrinsic reward generation mechanism. Examples include the initialization of the world model’s parameters, ar-chitectural choices, and preprocessing of the input. These factors introduce biases to the intrinsic reward mechanism that causes the exploration function to favour certain states over others. It is only possible to reduce these factors through design choices. Having a priori information about the MDP structure can help in making a good design choices.

Tabular count-based approaches to generating intrinsic rewards provide little to no design parameters. In the simplest of variations of such a system, the main factor affecting the incentive to visit a certain transition is the number n ≥ 0 of previous visitations∗_{. This creates a system}

where the novelty reward has two desirable properties:

• An (s, a) pair with a higher visitation count can only ever provide a smaller reward than a less visited (s, a) pair.

∀n1,n2∈Z≥0 n1< n2 =⇒ ∀(s1,a1),(s2,a2)∈S×A R(s1, a1, n1) > R(s2, a2, n2) > 0 (4.1.2)

where n1 and n2 are visitation counts of an (s, a) pair.

• Any two (s, a) pairs with equal number of visitations will generate equal rewards:

∀(s1,a1),(s2,a2)∈S×A, n∈Z≥0 R(s1, a1, n) = R(s2, a2, n) (4.1.3)

Throughout this investigation, these two properties (4.1.2 and 4.1.3) will be referred to as reward-visitation properties.

Although the count-based approach offers a reward generation mechanism that is simple and intuitive to relate to state novelty, it is limited to MDPs where the number of states |S| is countable

∗_{Works such as [8, 9] show that a more optimal count-based reward function follows an inverse square root}

relation to visitations of form β

(16)

Chapter 4. Problem description

with finite memory. Extending the idea to domains with uncountable states requires the use of elaborate extensions [11, 10].

Meanwhile, a prediction error approach has the benefit of being easily generalizable to large (and continuous) state spaces. However, attempting to meet the reward-visitation properties outlined in 4.1.2 and 4.1.3 with a prediction error approach is not straightforward. A PE approach is open to a higher number of design parameters that influence intrinsic reward generation.

To begin with, it is possible for the transition model T (s, a) to be initialized such that its prediction vectors have varying magnitudes across (s, a) pairs. This introduces variance in predic-tion error values (and thus in intrinsic rewards) throughout the state-acpredic-tion space at initializapredic-tion. Even as the agent begins to visit transitions, (s, a) pairs with equal number of visitations might not necessarily return the same reward value. This contradicts the behaviour expected from 4.1.3. Furthermore, if the PE approach is implemented in a way where certain (s, a) pairs decrease in PE at different rates, the reward property outlined in 4.1.2 would again not hold. Similarly, allowing for prediction error to ever reach zero will also cause the reward generation mechanism to deviate from the reward property described in 4.1.2. This is because two visitation numbers n and m where n < m could yield equal rewards R(s1, a1, n) = R(s2, a2, m) = 0.

Properties of the MDP can also influence a prediction error approach to a greater extent than a count-based one. An example of this is an MDP that contains nodes that transition into themselves. In such cases, the distance d for transitions where s = s0 will be d(s, s0) = 0. Assume a tabular PE approach where the transition model is defined as Tφ(s, a) = s + Aφ(s, a) and each (s, a) pair

is learnt independently (no generalization across (s, a) pairs). If Aφ(s, a) is not yet learnt and is

initialized as a zero vector, the transitions will produce the following rewards for self-transitions: rint = d (T (s, a), s0)) = d (s + As(a), s0) = d(s, s0) = 0 (4.1.4)

here, As(a) will never change as it is initialized at the target and the reward rint will be zero for

all subsequent visitations n.

4.2 Reward-visitation properties in a PE setting

The following two design choices can be used to guarantee that a tabular prediction error approach exhibits no variance in intrinsic rewards across (s, a) pairs with equal visitations. This ensures the reward function exhibits the desirable reward-visitation properties outlined in 4.1.2 and 4.1.3. This subsection assumes no generalization across (s, a) pairs.

1. The prediction of the follow-up state by the transition model Tφ(s, a) = ˆs0

ap-proaches the true follow-up state s0 asymptotically with number of visitations n. This happens at the same rate α for every (s, a) pair.

Tφ(s, a)n ← Tφ(s, a)n−1+ α (s0− Tφ(s, a)n−1) (4.2.1)

where Tφ(s, a)n is the transition model prediction of a given (s, a) pair after n visitations.

This condition is referred to as the learning condition.

2. For every (s, a, s0) tuple, the prediction at initialization (prior to any visits) by the transition model Tφ(s, a)init = ˆs0 should have the same distance (under measure

d) from the corresponding follow-up state s0.

∀(s,a,s0_)∈S×A×S T (s, a) = s0 =⇒ d (T_φ(s, a)_init− s0) = c (4.2.2)

where c denotes a constant and T (s, a) is the true transition function of the MDP. This condition is referred to as the initialization condition.

When these conditions are met, it ensures a tabular prediction error approach (where no generalization across (s, a) pairs takes place) will generate rewards according to the reward-visitation conditions. A higher reward-visitation count will always lead to a lower intrinsic reward (property 4.1.2), and given equal number of visitations, equal rewards will be returned (property 4.1.3).

(17)

4.3 Large state spaces

When dealing with very large state spaces, certain state instances might only ever be observed once (or never). Achieving uniformity in visitation across all existing state-action pairs can easily become infeasible.

Take for example a navigation task with multiple rooms. Each room can be seen as a group of multiple states, where each state has a representation based on what it is like to view the room from that given perspective. States in the same room will have similar representations. Another way to approach the environment could be to treat each room as a singular state of an abstract MDP where all the possible observations of a given room contribute to reducing the intrinsic reward of that region.

Let’s instead consider the goal to be achieving uniformity in visitations across state-action pair clusters (i.e. abstract states). To generalize across similar states, function approximation can be used to model T (s, a). However, when moving to domains outside of the scope of tabular approaches, conditions 4.2.1 and 4.2.2 become hard to uphold.

Satisfying the learning condition (4.2.1) under function approximation is troublesome because generalization across (s, a) pairs is introduced in the transition model Tφ(s, a). This causes intrinsic

rewards to change for state-action pairs despite the given (s, a) pair not being visited.

Additionally, issues with satisfying the initialization condition (4.2.2) can arise from two differ-ent sources:

• It is not guaranteed that an MDP will provide a feature space where state representations are equidistant for all neighboring states (s, s0).

• Under function approximation, random initialization of the weights φ of the transition model Tφ(s, a) = s + Aφ(s, a) introduces variance in the magnitude of the output across (s, a) pairs.

This results in a model where:

∃(s,a)∈S×A |Aφ(s, a)init| 6= c (4.3.1)

and consequently:

∃(s,a,s0_)∈S×A×S T (s, a) = s0 =⇒ d (T_φ(s, a)_init− s0) 6= c (4.3.2)

where c is an arbitrary constant and T (s, a) is the true transition function of the MDP.

4.4 Contrastive learning of abstract MDPs

4.4.1 Satisfying the learning condition

The general idea of the learning condition (4.2.1) is to have a prediction vector that asymp-totically approaches a target vector. In a feedforward network, a similar process occurs when learning through stochastic gradient descent. Low learning rates will cause predictions to converge asymptotically to their target. Under an assumption that the effect of generalization is negligible, the learning condition (4.2.1) can be sufficiently upheld.

4.4.2 Satisfying the initialization condition

The process towards approximating the initialization condition (4.2.2) is two-fold. First, an abstract MDP should be created where all neighbouring state pairs (s, s0_{) have equal distance}

between their abstract representations (Zθ(s), Zθ(s0)). Second, the transition model ¯Tφ(z, ¯a) should

be initialized so that every prediction ˆz0 is equally distant from its corresponding target z0. For an MDP to naturally provide state representations where all neighbouring states are equally distant apart is highly unlikely (unless the MDP has representations specifically hand-designed). Furthermore, even if equidistance between all neighbouring state pairs (s, s0) is present in the raw feature space, large state spaces can require the use of a mapping Z : S → Z to reduce dimensionality. Using such an encoding function does not guarantee the distances between representations of S will be preserved in the abstract space representations of Z.

(18)

contrastively-learnt encoders can be used to optimize the encoding function Zθ : S → Z so

as to introduce structure into the abstract space. In this case, the goal is to approximate the necessary properties for the initialization condition (4.2.2) by optimizing for equidistance across all pairs of abstract representations of neighbouring states (Zθ(s), Zθ(s0)). This is achieved through

the following energy functions for positive examples H and negative examples ˜H:

H = 1 K K X k=1 max 0, d z, z0k − ξ , H =˜ 1 K K X k=1 max 0, ξ − d z, ˜zk (4.4.1)

where K is the number of elements per sample, z0k is the k-th neighbouring state in the positive example sample, and ˜zk _{is the k-th element of the negative state sample.}

To prevent states from moving away from each other indefinitely due to ˜H, a hinge ξ is placed over the elements in the negative energy term. The same hinge value ξ is used to avoid neighboring states from moving too close together due to the H.

The overall contrastive loss can then be written as:

L(θ) = H + ˜H (4.4.2)

Here, the aim is to maximize the distance between all states (up to a value of ξ), while minimizing the distance between neighbouring states (down to ξ). Minimizing the contrastive loss creates an abstract MDP where distances between neighbouring states tend to d(z, z0_{) = ξ for}

all valid (z, ¯a, z0) tuples.

The second step towards ensuring the initialization condition (4.2.2) in the abstract space is to initialize the transition function ¯Aφ(z, a) such that, for all valid (z, a, z0) tuples, all predictions

¯

Tφ(z, ¯a) = z + ¯Aφ(z, a) = ˆz0lay equally distant from their corresponding target z0. In an embedding

space where d(z, z0) = ξ holds for all neighboring states (z, z0), condition 4.2.2 can be ensured by initializing the output of the abstract action function at the origin (i.e. ¯Aφ(z, a) = 0) for all (z, a)

pairs, such that:

∀(z,a,z0_)∈Z×A×Z T (z, ¯¯ A_s(a)) = z0 =⇒ d ¯T_φ(z, a), z0 = d z + ¯A_φ(z, a), z0 = d(z, z0) = ξ

(4.4.3) where ¯T (s, ¯a) is the true transition function of the abstract MDP and ¯As(a) is the abstract action

correspoding to action a in abstract state z = Z(s).

However, when using ¯Aθ in the form of a feed forward, there is no easy predetermining of

the output’s initialization. Despite this, common weight initialization strategies result in a output distribution which is centered around the origin [38]. If the variance in the output of ¯Aφ(z, a) was

to approach zero across all (z, a) pairs, equation 4.4.3 would hold.

To summarize, as all neighboring states become equally distant in abstract space and as equidis-tance is approached across all transition model predictions and their targets, the initialization condition (4.2.2) is satisfied:

∀(z,a,z0_)∈Z×A×Z T (z, ¯¯ A_s(a)) = z0 =⇒ d (z − z0) → c

∀(z,a)∈Z×A ¯Aφ(z, a)init → 0 =⇒ d (Tφ(z, a)init− z0) = c (4.4.4) where ¯T (s, ¯a) is the true transition function of the abstract MDP and ¯As(a) is the abstract action

corresponding to action a in abstract state z = Z(s).

4.5 Optimization of abstract space

Optimizing the encoding function Zθso as to achieve equal distance across all neighboring states

requires access to the full set of states in S. The optimization procedure to reduce the contrastive loss should be performed prior to the exploration period. However, the full set of states S of

(19)

complex MDPs is often not known prior to exploration. Instead, it is the exploration mechanism itself that one expects to use in order to acquire the full set of states.

The next best option could be argued to be to optimize the enconding function Zθ during

the exploration period. Such an approach would begin exploring despite not approximating the initialization condition (4.2.2).

Take some measure of performance m that evaluates the ability of an exploration mechanism to explore the state space S uniformly. Approximating the initialization condition (4.2.2) prior to starting the exploration period would be expected to reduce the variance in PE across (s, a) pairs. In contrast, approximating the condition during the exploration period should reduce the variance progressively, but should have a starting variance comparable to using a random encoder. If following the reward-visitation properties outlined in 4.1.2 and 4.1.3 offers an intrinsic reward function better suited for exploration, then a pretrained encoding function Zθ∗ is, in terms of

m, expected to surpass an encoding function being trained during t exploration steps Zθ0:t. Also,

both of these approaches should be expected to surpass the expected performance of a randomly initialized, untrained encoder.

Furthermore, another hypothesis can be made regarding the set of states used to train the encoding function Zφ. Ideally, the full set of states S is needed to approximate the initialization

condition (4.2.2) when optimizing the encoding function. However S is not available prior to exploration. In practice one has to resort to training with the set of states visited up to step t by the exploration policy Sπ0:t. Optimizing the encoder using S approximates the initialization

condition (4.2.2) to a greater extent than Sπ0:t would.

In general, through these arguments it can be postulated that optimizing Zθduring exploration

should yield a lower (or equal) performance m than in the pretrained setting. Furthermore, using a set of states Sπ0:t during optimization should yield a lower (or equal) performance m than using

the full set S. Also, any kind of training should yield better results than a randomly initialized, untrained encoder. The following inequalities can thus be put forward:

m( Zθ∗ |{z} pretrained |S) ≥ m( Zθ0:t | {z } trained during exploration | S) ≥ m( Zθ0:t | {z } trained during exploration | Sπ0:t) ≥ m( Zθ0 |{z} not trained ) (4.5.1)

where θ∗ symbolizes the best encoder weights found during pretraining, θ0:tsymbolizes the set

of encoder weights obtained across the first t updates during the exploration process, and Sπ0:t is

the set of unique states encountered across first t steps of the exploration process.

4.6 Hypotheses

1. Satisfying the learning and initialization conditions (4.2.1, 4.2.2) in a tabular PE setting should replicate the exploration behaviour observed in a tabular count-based approach. 2. Using a feedforward network as a transition model can produce equivalent exploration

be-haviour to using a tabular transition model.

3. A contrastively-learnt abstract MDP can sufficiently approximate the necessary conditions for a PE setting to replicate the exploration behaviour observed in a tabular count-based approach.

4. Optimizing the contrastively-learnt encoder Zθ∗prior to the exploration process yields better

exploration results than training an encoder Zθ0:t during the exploration process. Both of

these surpass the performance of an untrained encoder.

5. Optimising the contrastively-learnt encoder Zθ0:t using the full set of states S yields better

(20)

Chapter

5

Methodology

To evaluate the theoretical properties proposed in section 4, as well as the effect of introducing function approximation and latent spaces, this investigation will be performed in two parts. First, simple setups will be investigated in controlled grid worlds. These conditions will allow us to closely measure properties of the undergoing exploration such as the discovery rate of novel state-action pairs and the uniformity in visitations. The second part will be dedicated to investigating how the properties observed in the previous section generalize to exploration tasks of more complex environments.

The code repository used for this thesis containing all of the algorithm and environment implementations is publicly available at the following url:

https://github.com/NILOIDE/Adversarial curiosity

5.1 Investigation under controlled grid worlds

The grid world environments discussed in this section all share the same transitions dynamics. At each given state, the agent has 4 available actions: moving in one of 4 cardinal directions. Actions that would move the agent over the edge of the map cause the agent to appear in the opposite side of the grid. Furthermore, if the agent finds itself next to a wall, actions that would transition into the wall cause the agent to remain in place.

The agent’s starting position in each episode is on the center of the grid. The episode length is lenient enough for a policy to reach any state on the grid with a substantial margin. This opposes recent practices in the intrinsic motivation literature where episodes have no definitive terminal states. In such studies, the intrinsic Q-values of terminal state are bootstrapped using the starting state’s Q-value [7].

In the case of these grid world environments, having a maximum episode length aims to ground the exploration policy around the starting state in order to better observe its development over the course of learning. Furthermore, it should be noted that the agent is not aware of the episode’s length. No done signal is used when updating Q-values and the starting state is not used for the Q-value bootstrap at the terminal state. For all intents and purposes, the Q-value of the last transition in the episode is bootstrapped using the terminal state’s Q-value as if the episode were to continue.

5.1.1 Environments

Empty 42×42 grid world:

The simplest environment where we aim to investigate the formulations outlined in section 4 is a grid world empty of obstacles. The agent is free to move in all 4 directions one cell at a time. This includes the edges of the map, where the agent loops around to the other side.

The grid has 42 · 42 = 1, 764 states. The episode length is 84, twice the length necessary to reach the states in the farthest corners of the grid from the starting position at the center of the grid.

The state representation is a one-hot vector of length 1,764 where each dimension represents a cell in the grid. The dimension corresponding to the cell the agent is currently in is represented by a 1.0, whereas empty cells take a zero value. This choice in representation provides the property of having equal distance (√2) between all state representations. This is aimed to satisfy the initialization condition outlined in 4.2.2 directly at the raw representation level (as opposed to requiring an intricate encoder). This state representation also aims to investigate how well the learning condition 4.2.1 generalizes when the tabular transition model is replaced by a feedforward network.

(21)

Chapter 5. Methodology

Random feature 42×42 grid world:

This environment is almost identical to the empty grid world outlined above, except for its feature representation. The feature vector remains of length 1,764, but each dimension has a randomly initialized float value selected uniformly from the range [0.0, 1.0). This environment aims to test how the outlined approaches behave when the initialization condition 4.2.2 is not upheld.

Spiral 28×28 grid world:

This grid environment (see figure 5.1.1) contains a spiral structure of walls (green) that restrict the agent from moving into them. The agent starts at the center of the spiral (red). The spiral corridor has a width of 3 cells.

The environment aims to test how well the aforementioned approaches deal with highly linear exploration directions. The agent is required to travel through a long thoroughly explored path (low PE) in order to reach areas rich in intrinsic rewards. This environment is also interest from a theoretical perspective, as it is an MDP that has states with self-transitions. When moving into a wall, the position of the agent does not change. As mentioned in section 4, (s, a, s0_{) tuples where}

s = s0 have zero prediction error at initialization in a tabular PE approach. This creates issues with the consistency of intrinsic rewards with respect to the number of visitations (see discussion surrounding eq. 4.1.4).

Similar to the empty 42×42 grid world, the state representation is a one-hot vector of length 28 · 28 = 784 where each dimension represents a cell in the grid. The dimension corresponding to the cell the agent is currently in is represented by a 1.0, whereas empty cells take a zero value. Due to walls being inaccessible, there are only 602 accessible states for the agent to navigate into.

The episode length is set to 784, enough to visit every state at least once from the starting state before the episode ends. This is decided differently from the empty grid world because travelling in a spiral outwards from the starting state takes a lot more steps to reach the outermost states.

Figure 5.1.1: Spiral 28×28 grid world. The red cell indicates the starting state. Green cells are walls that the agent can not move into.

5.1.2 Evaluation measures

There are multiple aspects that might make a global exploration function desirable. These range from visitation uniformity to rate of discovery of new states. This is measured over a short history of transitions in order to ensure states are continuously visited (as opposed to only having visited a region in the distant past). The history length L for each environment is calculated as |S| · 20. This would allow an ideal agent to visit each state 20 times under a uniform policy∗.

The exploration properties investigated in the grid worlds outlined in section 5.1.1 are described below.

(22)

• Number of states visited: The number of unique states in the visitation history |Sπt−L:t|,

where t is the current update step and L is the history length.

• Visitation uniformity in full state-space: Visiting parts of the state-space only once might not be sufficient to fully learn the dynamics of that region. Uniformity across the state-space is measured as the difference between the visitation density and the ideal density under uniform visitation.

X s∈S µπ(s) − 1 |S| (5.1.1)

• Visitation uniformity in visited state-space: The difference between the visitation density and the density under uniform visitation for all unique visited states in the visitation history. X s∈Sπ µπ(s) − 1 |Sπt−L:t| (5.1.2)

This measure might prove a more meaningful exploration property than uniformity across the full state-space when dealing with very large spaces. In MDPs where uniform exploration of the full set of states might be infeasible, a more sensible approach might be to steadily increase the reach of the current exploration function.

5.1.3 Architectural choices

Policy and Q-values

In order to minimize external influencing factors when comparing world models, tabular Q-learning was used across this entire section. The state space is small enough for no function approximation to be required when determining Q-values.

Using prediction error to generate intrinsic reward causes Q-values to be non-stationary as the prediction error for a transition (s, a) changes during learning. In order to reduce Q-values trailing the current intrinsic rewards being received, no learning rate is employed. The update step is thus the following: Q(st, at) ← Q(st, at) + rt+ γ · max a Q (st+1, a) − Q(st, at) (5.1.3) where the discount factor γ was chosen to be 0.99.

Q-values are initialized at 0.0. In order to help the learning process to escape local minima, a small amount of random local exploration is introduced into the policy by using an -greedy strategy, where is set to 0.1. When taking a greedy action, if multiple actions have the highest Q-value, ties are broken at random. This is most important when visiting a new state where all Q-values are 0.0, so as to prevent a bias for a certain action.

World models

In order to observe the resulting behaviour arising from the outlined conditions in section 4, first a tabular count-based approach is used to create a baseline. Next, a fully tabular PE approach is used to see how well it resembles the count-based approach. Finally, world models employing function approximation will be compared against the tabular baselines.

• Tabular count-based baseline: The count-based approach uses a data structure to track how many times each (s, a) pair has been visited. Intrinsic rewards are determined by:

rint= 1

N (s, a) (5.1.4)

(23)

• Tabular transition model: The tabular PE approach uses a tabular transition model where each state-action (s, a) combination has an independent prediction vector. The transition prediction vector for every (s, a) is initialized at the origin. The vector is subsequently updated toward the corresponding s0 upon each visitation of the transition.

As(s, a) ← As(s, a) + α ( s0− ( s + As(s, a) ) ) (5.1.5)

where α is the learning rate.

The approaches using function approximation use a single hidden-layer neural network as the transition function Tφ. Given a state representation and an action, the network outputs the state

representation of the next state ˆs0= Tφ(s, a) = s + Aφ(s, a). In order to more accurately compare

against the update step in the tabular transition model, the following methods are trained using stochastic gradient descent with single element batches.

One noteworthy insight from preliminary experiments is that, assuming equal starting random seed across runs, the exploration behaviour is identical regardless of the learning rate used on the transition model. The only differences arise in the relative magnitudes of the prediction errors (which does not change the choices made by greedy action selection).

In all the following approaches, including the ones predicting in the embedding-space, the transition function Tφ has a single hidden layer with 64 rectified linear units, followed by an linear

output layer. Due to working with a discrete action space, the action is fed to the network in the form of a one-hot vector of size |A|, where a 1.0 represents the action taken in a transition.

• State prediction WM: This approach uses a feedforward network to predict directly on the raw state-space. The input consists of the state’s feature vector concatenated with a one-hot action representation vector. The network is supervised using the transition loss of the next state:

L(φ) = (Tφ(s, a) − s0) 2

(5.1.6) The following approaches perform the forward prediction in embedding space. The world model is thus two-part: an encoder function Zθ(s), followed by the same transition network Tφ(z, a) outline

above. The encoder for each approach will have the same structure: a single hidden layer with 64 rectified linear units, followed by a linear output layer. The latent space is set to be 32-dimensional across all grid world environments. The transition network will be the same as outlined above.

• Random encoder: Prediction is performed in a static, randomly initialized embedding space. The weights of the encoder function Zθ are not trained. The transition network is

trained using the transition loss based on the next state’s embedding z0. L(φ) = T¯φ(z, a) − z0

2

(5.1.7) • Contrastively-learnt encoder: In addition to the embedding transition loss, an additional contrastive loss is added to the update step. Negative sampling is used to select 10 random states. The distance between the negative sample’s embeddings and the current state’s embedding is used to compute the contrastive loss.

L(φ, θ) = T¯φ(z, a) − z0 2 | {z } transition loss + 1 K K X k=1 max 0, d Zθ(s0) − ξ, z0k | {z }

positive example loss

+ 1 K K X k=1 max 0, ξ − d Zθ(s0), ˜zk | {z }

negative example loss

(5.1.8)

Here, Zθ(s) represents embeddings through which the encoder is trained, while the lower case

z notation represents embeddings through which the loss with respect to θ is not propagated into the encoder. Preliminary experiments showed that a hyperparameter value of 0.1 was an acceptable choice for ξ.

(24)

Three experimental conditions regarding a contrastive approach are investigated:

– Encoder pretrain: Prior to the exploration process, the encoder is optimized for 3 million steps (this is the same amount of steps to the methods that optimize the encoder during exploration).

After pretraining, the variance for each environment was reduced by the following amounts:

∗ Empty 42×42 grid world: 265 fold decrease (from 0.106 to 4.0 · 10−4₎

∗ Spiral 28×28 grid world: 225 fold decrease (from 0.109 to 4.8 · 10−4₎

∗ Random feature 42×42 grid world: 11 fold decrease (from 0.121 to 0.011) – Full state space sampling : The encoder Zθ and transition model ¯Tφ are trained

syn-chronously (1:1). Negative examples are sampled from the full state space S.

– Visited state space sampling: The encoder Zθ and transition model ¯Tφ are trained

syn-chronously (1:1). Negative examples are sampled from a buffer of recently visited states. For the condition using negative examples sampled from visited states, a buffer was used as opposed to the set of unique visited states in recent history Sπt−L:t (where L is the history

length). This results in the negative samples to not be drawn uniformly from Sπ, and instead

be drawn proportionally to the visitation density µπ(s, a) of the policy π. This was shown

to obtain surprisingly similar results to sampling uniformly from Sπt−L:t, but at a fraction of

the computational cost. Extracting Sπt−L:t is costly as it works under complexity O(L · |S|).

Sampling from a buffer also has the benefit of offering a more direct comparison to the conditions in which this algorithm has to operate when dealing with complex observation spaces where processing of recent history is not trivial.

• Variational autoencoder: The encoder in this method also outputs standard deviations for each output dimension. The standard deviations are not included in the transition model’s ¯T input nor output vectors. The embedding space is optimized to reconstruct the input state representation by minimizing the evidence lower bound [39]. The decoder is made up of a single hidden layer with 64 rectified linear units, followed by an linear output layer.

Lφ,θ = T¯φ(z, a) − z0 2 | {z } transition loss + Lvae(θ) | {z } VAE loss (5.1.9)

Here, z represents embeddings through which the loss with respect to θ is not propagated into the encoder.

It should be noted that the VAE loss does not form part of the prediction error through which the intrinsic rewards are generated. Intrinsic rewards are created exclusively from the transition loss.

• Inverse dynamic features [23]: Given a (s, a, s0_{) tuple, the encoding function is optimized}

to predict the action at that took place based on (s, s0). The latent representations Zθ(s)

and Zθ(s0) are concatenated and fed to the action prediction network gψ(Zθ(s), Zθ(s0)). The

action prediction network gψ has a single hidden layer with 64 rectified linear units, followed

by a linear layer with softmax ouput over the set of possible actions. Lφ,ψ,θ = T¯φ(z, a) − z0 2 | {z } transition loss + (gψ(Zθ(s), Zθ(s0)) − a) 2 | {z }

action prediction loss

(5.1.10)

Here, Zθ(s) represents embeddings through which the encoder is trained, while the lower case

z notation represents embeddings through which the loss with respect to θ is not propagated into the encoder.

(25)

5.2 Complex environments

The second part of this investigation aims to test how the insights obtained from the grid world experiments generalize to environments with more complex dynamics. More specifically, the encoding functions will be examined in a deep RL context with environments where the raw state-space is provided at the pixel level.

5.2.1 Environments

The environments explored in this section will be two Atari game titles from the OpenAI gym package [40] commonly used in the intrinsic motivation literature:

• Breakout-v0: In this game, the player must use a ball to break bricks. The player controls a paddle with which the ball can be hit. Failure to hit the ball causes the ball to fall off the bottom of the screen, which causes the game to end.

• Riverraid-v0: In this game, the player controls a plane from a top-down perspective. While the plane moves forward, the player must avoid obstacles and gather more fuel. Failure to do either causes the game to end.

(a) Breakout-v0 _{(b) Riverraid-v0}

Figure 5.2.1: Example observations of the used Atari environments.

5.2.2 Evaluation measures

The aforementioned environments do not provide direct access to their dynamics or full state-spaces. This makes the gathering of transitions for the purpose of pretraining an encoder not a trivial process. Furthermore, their set of states is too large to meaningfully apply the measures pro-posed for the grid world environments. However, as discussed in [7], the extrinsic reward functions provided in many human-oriented games are typically well aligned to exploratory behaviour. For this reason, the environments outlined in 5.2.1 are typically used in curiosity-related RL literature and make a good fit for this investigation.

In Breakout, extrinsic rewards are provided upon breaking bricks. The breaking of bricks changes the pixel configuration, which in turn provides high prediction error. In addition to this, failing to hit the ball causes the game to restart. This leads to repeated visitation of the starting states, which provide low prediction error due to their frequent visitation. Ideally, the skills learnt in the curiosity task correlate with the skills necessary to obtain high extrinsic rewards.

The same line of thinking applies to Riverraid. The game’s reward function provides extrinsic motivation for moving forward, which also happens to be a method through which to discover novel states.

For this reason, the evaluation metric for this section will be extrinsic rewards. This will be despite the agent never maximizing this measure directly.

(26)

5.2.3 Architectural choices

Policy and Q-values

The architectural choices for this section reflect those in standard deep Q-learning literature [2]. The state representation consists of the 4 most recent grayscale frames, rescaled into a size of 84×84, and stacked in the channel dimension. The Q-network has 3 convolutional layers, followed by a fully-connected layer of 512 rectified linear units. One final linear layer is trained to output the Q-values for every action available to the agent. The batch size used during learning is of size 32, and the optimizer used is Adam [41].

When training the Q-network, in order to mitigate the issue of ever-decreasing rewards, intrinsic rewards are normalized by the mean and standard deviation of the last 1,000 observed rewards. When updating a Q-value, a target Q-network is used as the bootstrapped target Q-value. The target Q-network is updated every 1000 steps by copying the weights from the Q-network. The discount factor γ is set to 0.99.

The policy follows a -greedy strategy, where is set to 0.1 for the entire duration of the training. World models

Choices in world model architectures follow those of [7]. States are represented as the 4 most recent grayscale frames, rescaled into a size of 84×84, and stacked in the channel dimension. Encoders have 3 convolution layers with the same structure as the Q-network (although parameters aren’t shared). This is followed by a fully-connected layer of 512 rectified linear units. One last layer with no non-linearity is used to output the final embedding of 512 dimensions.

The training batch size is 32, and the resulting prediction errors are used directly after as intrinsic rewards to train the DQN algorithm. As opposed to the Q-network, the world model uses stochastic gradient descent when updating. Preliminary experiments showed the Adam optimizer to cause latent spaces to become too unstable.

The transition model is a simple single hidden-layered NN of 256 nodes with a ReLU non-linearity. This model takes an observation’s embedding as input and is trained to output the subsequent observation’s embedding.

The IDF-trained encoder uses a NN to learn to predict the action taken. This NN has a single hidden layer of 256 nodes with ReLU non-linearity. The final layer is a soft-max, which is supervised using a one-hot vector indicating which action was taken in the given transition.

The decoder used to train the VAE had the same intermediate architecture dimensions as the encoder, but in reverse. Up-convolutions were used in the decoder in place of a convolutional layer.

(27)

Chapter

6

Results & Analysis

6.1 Grid worlds

In this section, the three exploration measures outlined in section 5.1.2 are discussed for each grid world environment. The results displayed in each environment’s figures represent averages over 5 exploration runs (along with corresponding standard deviations) across 3 million update steps. Each of the 5 exploration runs were performed on a different random seed.

For reference, the average performance of a purely random policy ( = 1.0) has been added to each plot (blue).

6.1.1 Empty 42×42 grid world

Both figures 6.1.1 and 6.1.2 appear to tell a similar story about all of the methods. Surprisingly, the tabular PE approach (orange) explores better in every metric compared to the count-based approach (green). Both tabular approaches, as expected, perform much better than any of the function approximation approaches in terms of visited states and overall uniformity. However, when it comes to uniformity in visited states, figure 6.1.3 shows that almost all other approaches have a more uniform exploration of visited states than the two tabular approaches.

Moving into the function approximation setting, using a neural network to predict across raw feature space (red) outperforms all encoded approaches.

The contrastive approaches seem to mostly behave as predicted in 4.5.1. Pretraining a con-trastive encoder (brown) prior to exploration results in the highest state count score (figure 6.1.1), as well as best full state-space uniformity measure (figure 6.1.2). Updating throughout the explo-ration process seems to outperform an untrained encoder (purple), but only as long as negative samples are drawn from the full state space (pink). Sampling from the set of visited states (gray) appears to perform worse than an untrained encoder (purple). However, when it comes to uniformity across visited states (figure 6.1.3) all contrastive approaches seem to perform better than the untrained encoder (purple), yet equally between them.

Lastly, the variational autoencoder (yellow) seems to fall drastically in performance after 1 million steps across all measures, while the inverse dynamics encoder (cyan) has a performance comparable to the non-encoded NN approach (red).

Figure 6.1.1: Unique states visited in the last 35,280 update steps (max: 1,764). Higher is better. Performance of a random policy is added for reference (blue).

(28)

Chapter 6. Results & Analysis

Figure 6.1.2: Distance from uniform visitation in the last 35,280 update steps. This is measured across full state-space. Lower is better. Performance of a random policy is added for reference (blue).

Figure 6.1.3: Distance from uniform visitation within visited states in the last 35,280 update steps. States with no visitations in the last 35,280 steps are excluded. Lower is better. Performance of a random policy is added for reference (blue).

6.1.2 Random feature 42×42 grid world

Again, the results in figures 6.1.4 - 6.1.6 show that the tabular PE approach (green) surpasses the count-based approach (orange). In this environment, the tabular PE method should in theory suffer from high variance intrinsic rewards due to prediction errors having a wide initial range of magnitudes, yet it outperforms the count-based method by a significant margin.

This time around, the non-encoded NN approach (red) appears to take a lot more time to reach the maximum number of visited states (figure 6.1.4). When compared to the same method in the empty grid world environment (figure 6.1.1), the standard deviation here is much higher and it takes the method an additional 500,000 updates to reach the maximum now that representations are randomly initialized.

Interestingly, in the contrastive approaches, the pretrained encoder (brown) does not perform the best here. The encoder trained online with full state-space sampling (pink) has significantly better performance than any other encoded method. It is possible that the fact that the pretrained contrastive encoder (brown) was not able to be optimized to as low of a value as in the other environments might have played a role on the performance of the encoder. The contrastive (full space sampling) method (pink) even surpasses the non-encoded NN approach (red) for most of the exploration period, although it never quite reaches the maximum of states (figure 6.1.4).

(29)

Chapter 6. Results & Analysis

In this environment too, the contrastive encoder trained with samples of visited states (gray) shows marginally better performance than the random encoded approach (purple). Among the uniformity measures (figures 6.1.5 and 6.1.6), all contrastive approaches have relatively similar performances.

The variational autoencoder (yellow) again seems to provide results that are worse than random across all measures. Similarly, the inverse dynamics encoder (cyan) struggles to achieve consistent results that are better than those of a fully random policy.

Figure 6.1.4: Unique states visited in the last 35,280 update steps (max: 1,764). Higher is better. Performance of a random policy is added for reference (blue).

Figure 6.1.5: Distance from uniform visitation in the last 35,280 update steps. This is measured across full state-space. Lower is better. Performance of a random policy is added for reference (blue).