THE EFFICIENCY OF ADVERSARIAL STATE EMBEDDINGS IN MODEL-BASED RL TASKS THE EFFECTS OF USING LATENT SPACES OF ADVERSARIAL NETWORKS TO ENCODE VISUAL INPUT FOR THE WORLD MODELS ARCHITECTURE

(1)

T

HE EFFICIENCY OF ADVERSARIAL STATE EMBEDDINGS

IN MODEL

-

BASED

RL

TASKS

T

HE EFFECTS OF USING LATENT SPACES OF ADVERSARIAL NETWORKS TO ENCODE

VISUAL INPUT FOR THE

W

ORLD

M

ODELS ARCHITECTURE

B

Y

A

LEJANDRO

G

ONZ

ALEZ

´

R

OGEL

a.gonzalezrogel@student.ru.nl (s4805550)

AND SUPERVISED BY

PROF

.

DR

. M.A.J.

VAN

G

ERVEN

m.vangerven@donders.ru.nl

(2)

Unravelling all the information received from our surroundings is key to understand and interact with the outside world. The outcome of this operation determines the performance of any other step in the reason-ing processes, thus makreason-ing it a crucial element of any agent’s learnreason-ing experience. This thesis explored how different representations of visual information affected the ability of the World Models architecture to find an optimal policy using online, offline and hybrid training procedures. To that end, we replaced its original perception module with other alternatives that, with the same neural architecture, imposed dif-ferent biases and defined difdif-ferent elements of interest. We attempted to promote disentanglement using β-VAE, and let the model define important high-level features in an adversarial fashion using VAE-GAN. We proved that VAE-GAN can be an alternative to traditional autoencoders when encoding visual input in a reinforcement learning setup. Not only that, but this technique improved the final performance of several of our configurations. To the best of our knowledge, this was the first piece of work that has ever used an adversarial architecture to encode sensory input for a reinforcement learning task. Additionally, we were able to test the World Models architecture on a new training procedure that alternated both training in the real world and inside the model’s imagination. Unfortunately, we could not directly encourage the creation of disentangled latent spaces under the current configuration, but we still provided a qualitative analysis of this characteristic for all our approaches.

Keywords Visual embeddings; knowledge representation; model-based reinforcement learning; generative adversarial networks; World Models; VAE; VAE-GAN

(3)

orig-inal one in an online training task . . . I A.2 The influence of the size of the latent space during reconstruction . . . II A.3 The free bits strategy: diminishing the importance of the Kullback–Leibler divergence . . III A.4 The influence of our visual encoding distribution in the memory module . . . IV A.5 Considerations and advice on training VAE-GAN . . . V A.6 Understanding which high-level features VAE-GAN learns . . . VIII

Appendix B Network architectures and loss functions IX

B.1 Architectures . . . IX B.1.1 Shapes and nomenclature of our diagrams . . . IX B.1.2 Vision module . . . IX B.1.3 Memory module . . . XI B.1.4 Reward module . . . XI B.1.5 Controller module . . . XII B.2 Loss functions . . . XIII B.2.1 Mathematical notation . . . XIII B.2.2 Vision module: VAE . . . XIII B.2.3 Vision module: β-VAE . . . XIV B.2.4 Vision module: VAE-GAN . . . XIV B.2.5 Memory module . . . XV B.2.6 Reward module . . . XVI B.2.7 Controller module . . . .XVII

Appendix C Execution parameters and examples XVIII

C.1 Code and implementation . . . .XVIII C.2 Execution parameters . . . .XVIII C.3 Internal dynamics of the environments . . . XIX C.3.1 ViZDoom: Take cover (ViZDoom-tc) . . . XIX C.3.2 ViZDoom: Health Gathering Supreme (ViZDoom-hgs) . . . XIX C.4 Example executions in GIF format . . . XX C.5 Reconstruction examples . . . XXI C.5.1 ViZDoom-tc . . . XXI C.5.2 ViZDoom-hgs . . . .XXII

(5)

1 Introduction

1.1 Sensory data and reinforcement learning

To interact with the world around us, we first need to learn how to analyze all available sensory infor-mation correctly. And, among all senses, we human beings find the sense of sight key to understand our surroundings. Without us even noticing, our brains are able to manage a constant flow of rich, complex and interrelated images. Thus, if we were to design artificial models capable of successfully communicat-ing with the real world, they too should include the necessary mechanisms to discover, on their own, how to properly interpret visual input in a way that is valuable for any posterior cognitive process.

In such a case, our first step could be to represent visual information in a compact and comprehensible structure while still preserving any important features we might need in the future. By extracting all relevant characteristics, and leaving the rest behind, we could facilitate its posterior use and reduce the complexity of our model. In artificial intelligence, we might refer to this set of extracted features as a latent space and, as in many other tasks related to computer vision or image analysis, researchers have recently proposed multiple techniques to generate such them [8, 10, 12, 22, 29, 32, 34, 55].

However, there is not a standard definition of what a good latent space is [34,38,43,54]. And, depending on how we decide to answer that question, we might introduce biases that could affect the interpretability and usefulness of the data in subsequent steps [37, 38]. It is thus of vital importance to study the possible effects that these changes might have on downstream tasks.

A common hypothesis considers that a latent representation whose individual units contain unique generative factors1_{could help in a future task [22,29,37,43]. These are called a disentangled representations}

[1], and they would, intuitively, keep the independence of qualitative characteristics of an input image. A more recent line of work, not mutually exclusive to the idea of disentanglement, relies on generative adversarial networks (GANs) [17] to internally learn a good definition of the given data [8, 10, 12, 34]. Here, models learn to encode information while optimizing for a different task, a competition between two or more networks, and constantly update their idea of a good latent space as their game progresses. For this thesis, we worked on both approaches. Specifically, we tried comparing the effectiveness of a variational autoencoder (VAE) [32], which we used as a baseline, with the latent spaces generated by β-VAE [6, 22], a model specifically designed to promote disentanglement, and VAE-GAN [34], a hybrid between an autoencoder and an adversarial architecture. We were unfortunately not able to directly encourage disentanglement using β-VAE, but we provided a qualitative analysis of the internal structure of the encoded space to search for, and compare, different factors of variation.

However, as we have already mentioned, our main purpose is not to directly evaluate the quality of the latent space but rather see how any of these changes affect the performance of a reinforcement learning (RL) algorithm. This comes with an additional benefit: we have a common metric, the final performance of the model, with which we can compare different approaches. This differs from most of the previously cited works in visual embeddings, where they assess the quality of their latent spaces in a generative task2_{. And, again, because there is not a common definition of what a good reconstruction,}

or a good latent space, is, some of them define their own evaluation metric, making it hard to compare results quantitatively with other models [13, 22, 29, 38, 43].

We chose the World Models architecture [19] to solve our RL problems. This is a model-based algorithm that can be subdivided in independent modules, which grants us plenty of flexibility in terms of selecting which model we wish to use for generating our latent space. Not only that, World Models allows us

1 _{Also called factors of variation or aexplanatory factors, they refer to the underlying, and independent, elements that}

are part of the real distribution we are trying to encode into our latent space. When dealing with images, researches have treated shape, position or illumination as generative factors [8, 44].

2 _{A problem where a model needs to recreate previously unseen data from a real, previously learned, distribution. We}

(6)

the possibility to generate a virtual model of the environment in which to train a real-world agent. This increases the importance of a good visual embedding: we do not only use it to capture sensory information; it also affects the ability of the architecture to generate an “imaginary world”, a critical task in any model-based solution.

1.2 Project plan

We evaluated the influence that different visual embeddings may have in the context of reinforcement learning. We initially selected three different autoencoder architectures, each one of them with a different definition of what a good latent space should look like: VAE, whose encodings aimed to reduce pixel-wise errors in a reconstruction task, β-VAE, that prioritized disentanglement over a perfect reconstruction, and VAE-GAN, that ignored the previous pixel-based reconstruction metric and introduced a new one based on self-learned high-level features. We later discovered that our configuration did not allow β-VAE to capture valuable features of the input data, so we used it as an alternative version of our VAE model. We will describe all these algorithms in-depth in Section 3.1.2 and Appendix B.2.

Using World Models to interact with our environments, we carried out three different experiments: – Experiment 1: We tested the influence of our state representations in the final performance of the

World Models architecture when training in the real world. This gave us an insight into how different forms of encoding can alter the final policy of the network and how they affect the internal dynamics model. Additionally, it served as a reference point for the following experiments.

– Experiment 2: We tested the influence of our state representations in the final performance of the World Models architecture when training “inside the agent’s dreams”. This experiment gave greater importance to the quality of our latent representations, as the training procedure relied solely on the memory module and its ability to interpret and predict state encodings. Furthermore, it allowed us to compare our results to the original implementation.

– Experiment 3: We tested the influence of our state representations in the final performance of the World Models architecture when alternating between online and offline training. This added value to the original World Models work, which did not evaluate this option. Additionally, it avoided training fully on the “dream world”, as we saw it could be unreliable if the model had not completely learned the dynamics of the original environment. Moreover, we thought this form of training could help us overcome local optima and/or partially reduce the training time.

Additionally, we performed a qualitative analysis of all our latent spaces by applying interpolation and dimensionality reduction techniques.

Finally, we used two environments, both based on the ViZDoom library [27]. The first one was part of the original World Models research paper. The other one, however, involved a more complex task, placing the agent in a much larger arena, allowing 3D exploration and forcing the network to predict sparse reward. Thus, using this new scenario, we were able to evaluate if the architecture would generalize to harder environments.

1.3 Contributions

The main purpose of this dissertation was to compare different approaches to the creation of latent state representations in RL problems. The effects of disentangled representations on this field were largely unexplored [23], and, to our knowledge, this was be the first time that GANs were successfully used to encode visual information for an RL problem.

In the process, we were able to expand upon the previously shown capabilities of World Models architecture. In particular, we tested its ability to adapt to more demanding environments and proposed

(7)

an iterative procedure to update the policy of the network using both real-world and imagined experiences. The original work did not include any results regarding this last operation.

To summarize, we analyzed the importance of different forms of knowledge representation at different stages of the World Models reasoning process: memorizing, training online, training within a “dream” and alternating online and offline training. Therefore, at the start of this project, we aimed to answer the following research questions:

1. Can we, by providing disentangled representations of the current state, improve the performance of the World Models architecture at any of the aforementioned tasks?

2. Are features generated by generative adversarial networks suitable for a reinforcement learning task? If so, how do they compare to other methodologies?

3. Is the World Models architecture able to generalize to 3D environments with sparse reward signals? If so, can it recreate a fully functional 3D maze to train offline?

4. Does alternating between online and offline training provide any benefit in terms of performance or training time?

1.4 Structure of the thesis

We divided the content of our thesis into two parts: the main body and several appendices. The former includes an introduction to all the concepts that the reader might need to understand our work and the results and discussion regarding the main research questions of our work. The appendices include several other experiments, additional information about our design choices and in-depth explanations of our World Models implementation.

The outline of this dissertation is as follows: we will introduce basic concepts and state-of-the-art methodologies related to reinforcement learning and latent state representations in Section 2. Then, in Section 3, we will describe our project plan and review all those techniques and tools that had a direct relation to it. Section 4 and Section 5 will present our results and analyses respectively. Finally, before our conclusion in Section 7, we will list future lines of work derived from this project (Section 6).

We divided our appendices into three groups: The first one, Appendix A, contains additional infor-mation that we gathered while working on the main questions of this thesis. Appendix B details our implementation and the mathematical background behind each module. Finally, Appendix C provides information about the specific parameters of our executions and contains several examples of the perfor-mance of the World Models architecture.

2 Background and related work

2.1 A brief definition of reinforcement learning

We have already presented the reader with an informal definition of a reinforcement learning (RL) prob-lem: a scenario where an artificial agent needs to rely on sensory information to interact with its envi-ronment and perform a given task. Here, we will expand on that term and introduce new notions and ideas.

First, from a mathematical perspective, we can see an actor whose goal is to learn a policy π that can help it to communicate with the environment while maximizing the value of a specific reward signal r. The agent can interact with its surroundings using a limited set of actions a and receiving information about the current state of the world, st. This world, however, might not be fully observable and stmight

be incomplete. RL problems are usually treated as Markov decision processes [66], and we will also work under this assumption.

(8)

Among the different solutions we can give to an RL problem, we can differentiate two paradigms: model-free and model-based. The former tries to find an optimal policy by relying on previous experience to determine the best possible action at a given state. The latter tries to learn how the environment works and bases its decisions on those learned dynamics.

Both approaches have their advantages and disadvantages, and there is ongoing research on both of them [14, 19, 45, 47, 50, 61, 70]. On one hand, model-free methods require a vast amount of training data [35, 72], which is not always available or cheap to collect, and they cannot generalize as well as model-based methods to new tasks or reward signals. They, therefore, lack the flexibility to adapt to new situations that we can see in intelligent entities. On the other hand, model-based methods might be more sample efficient, but the quality of their internal representation of the world limits their capabilities. In other words, if the model does not fully understand the hidden dynamics of the environment, it can lead to substantial inaccuracies during planning. As for today, this last limitation is one of the reasons why model-based algorithms have mostly been successfully deployed when the environment is fully observable, or its internal dynamics are given or very restricted [19, 47, 50]. In any other situation, model-free algorithms tend to outperform any other strategies. [45, 46, 61, 63].

2.2 Model-based methods

In this dissertation, we will work with a model-based algorithm, as we argue that the ability to generate a mental representation of the world is key when defining an intelligent agent. Furthermore, there is enough biological evidence to support that both animals and human beings might be keeping a mental model of the world to imagine fictitious situations or simulate past and future experiences [21, 41, 51, 58, 62].

We can find, in literature, multiple techniques to learn an internal representation of the environment. Nonetheless, due to the increasing popularity of deep learning and neural networks, many state-of-the-art solutions are based on recurrent neural networks (RNN) and long short term memory networks (LSTM) [19, 50, 59, 70]. These designs can capture temporal dependencies, granting the dynamics model a sense of memory and allowing it to keep better track of the state of the world if it is not fully observable. Still, these strategies have their limitations, and long-term dependencies and complex environments are still hard to capture and understand [69]. To reduce the effects of relying on an imperfect model, researchers have proposed multiple solutions, such as trying to infer and ignore the errors carried out by the model [70], redirect its attention to those variables that affect the task at hand [50, 59, 60], or implementing hybrid solutions where a model-based algorithm supports a model-free controller [14, 47, 63, 70].

Finally, when it comes to training model-based RL architectures, using only visual information as input from the environment is a popular option [11, 14, 19, 47, 50, 70]. However, because using such high dimensional data directly can suppose a new problem, researchers have often compressed this information into a low dimensional space before exploiting it. This is where the already mentioned latent representation of visual input has an influence. We will introduce this term in more depth in Section 2.3.

2.2.1 The World Models architecture

As we have already discussed, model-based architectures use their internal model to help them to find an optimal policy to solve a problem. These representations of the world may be used to provide temporal and spatial information [70], predict future states [14, 47] or reason about possible solutions [70]. However, if we could create a dynamics model that was accurate enough, we should be able to fully substitute the original world by the agent’s own “imagination”. This would allow us to train offline, i.e. without interacting with the real environment, and could partially solve the sampling inefficiency problem of some models. Ha and Schmidhuber were able to test this hypothesis with their World Models architecture [19]. They introduced a model-based framework consisting of smaller modules that interact with each other to solve an RL problem. They subdivided the final architecture into three components: vision,

(9)

Vision (V) Controller (C) Memory (M) at st zt zt at World model hmt, cmt hmt, cmt

(a) Original implementation of the World Models architec-ture. It consists of three modules: vision, memory and con-troller. Vision (V) Controller (C) Memory (M) at st zt zt at World model hmt, cmt hmt, cmt Reward (R) , crt hrt zt at , crt hrt

(b) Modification of the World Models architecture where a new unit, the reward module, keeps a separated hidden state for the prediction of the terminal/reward signal.

Fig. 1: Diagram of the World Models architecture. st stands for the state of the world at time t, zt and

at are the latent state and action spaces and hmt, Cmt, hrt, Crt are the hidden and cell states of the

memory and reward modules respectively.

that encodes sensory information into a smaller and meaningful vector representation, memory, that acts as the dynamics model and keeps an internal representation of the current state of the world, and a controller, that acts according to a policy that maximizes reward. We show a representation of the entire structure in Figure 1a.

Regarding the optimization procedure, it is important to understand that the architecture does not train end-to-end; it does it sequentially. As a result, we first need to obtain a good latent representation from the vision module, then learn the world’s dynamics and, finally, train the controller using the information provided by the other two modules. Both the vision and memory modules can train offline with a previously collected dataset. However, when it comes to the controller, we have two options: online and offline training. We will further discuss these procedures in Section 3.1.1.

2.3 Visual embeddings

The design of effective visual embeddings is deeply related to generative models. These models learn the distribution of a given input, usually in an unsupervised fashion, and later use the acquired knowledge to create new samples. Thus, they can “generate” previously unseen data that resemble the original one. To us, it is important to understand that, by adding different restrictions, we can represent this gained knowledge in a low dimensional vector that contains relevant features of the input distribution [55]. We have already referred to this small vector as a latent space.

Depending on the architecture and training technique of the model, the quality and/or information contained within the latent space might change. In the following sections, we will discuss those method-ologies that are of interest to our project.

2.3.1 Architectures to transform input data into a latent space

Regarding the architecture, deep neural networks have been a prominent subject of interest in recent literature [6, 8, 10, 12, 29, 32, 34]. Among the many different solutions, we will focus our attention on autoencoders and GAN architectures.

The concept of an autoencoder (AE) is not new [55]. It works under the assumption that, if the network is forced to recreate the original data, creating a bottleneck at some point in the architecture

(10)

should force the algorithm to generate a low dimensional representation that would contain all relevant features of the input distribution. The portion of the model that leads to that bottleneck is called encoder, and the one that generates a sample is called decoder. During training, which is generally unsupervised, both parts of the network train simultaneously using the same metric.

However, the original implementation of an AE had one major disadvantage. While it could represent high dimensional data into a much smaller vector, the latent space did not need to be continuous or easily interpretable. This means that the model could learn to encode data seen during training into arbitrary points of the low dimensional space. As a result, the algorithm would be unable to learn any meaningful information about the data distribution or generate unseen samples.

To solve this problem, Kingma and Welling proposed the variational autoencoder (VAE) [32]. This new improvement avoids encoding an input into a specific vector and, instead, learns to create a distribution from which such vector could be sampled. This forces the latent space to have a continuous nature. We would like to refer the reader to Section 3.1.2 or Appendix B.2.2 for an in-depth review of the mathematical background of this implementation. VAE is a popular option in literature [3, 18, 19, 31, 50] and there are multiple recent models based on this architecture [6, 22, 29, 34].

But using AEs is not the only approach to obtaining a low dimensional representation. Let us first introduce the concept of generative adversarial networks (GAN) [17]. This technique approaches the problem of generating new data as a zero-sum competition3 _{between two networks: a generator, that}

learns how to produce real-looking samples, and a discriminator, that learns to distinguish real data from the generated one. Despite their training instability [15, 53], these architectures have recently shown state-of-the-art results in the reconstruction of images [36, 49], and have inspired new solution to many different problems, including RL [52, 57].

While the purpose of GAN networks is not to create a latent space, the generator can, by design, per-form the inverse operation: transper-form a given input, sometimes Gaussian or uniper-form noise, into a faithful sample. Furthermore, the discriminator must also learn the distribution of the real data to differentiate it from fake, or generated, samples. Thus, by adding small modifications to either the generator [10, 12] or the discriminator [8], we can easily obtain a meaningful representation of a real example.

2.3.2 Usefulness, quality and evaluation of the latent space

So far, we have discussed different network architectures to create latent spaces. However, we still need to define which information we should pay attention to, and how we should represent it.

In Section 1 we already introduced a very common assumption: if the original distribution of the data consists of independent factors of variation, also called generative factors, a good latent space would be the one that assigns an individual unit to each of these factors. This would, in theory, make it easier to interpret the information and keep the factor’s independence. A representation that fulfills these requirements is said to be disentangled [1], and many recent publications of both AE and GAN architectures directly address this issue [6, 8, 29, 37, 43]. Despite this, other authors have argued that the utility of these representations might be overestimated [38].

One of the problems of disentanglement is that there is not a formal definition for it. Not only that, but different researchers may propose different interpretations of the concept [22, 29, 43]. For this thesis, we planned on using the β-VAE algorithm [6, 22], which defines a disentangled representation as one where each latent unit is responsible of one single generative factor. We consider that this definition holds in our project. Nevertheless, it might not be applicable in more complex scenarios, where the latent space might not be big enough to assign one single latent unit to all factors of variation [43].

Still, disentanglement is not the only element that can affect the quality of our latent space. Primarily, as we have previously mentioned, many of these models train on generative tasks. Thus, a good latent

(11)

space should include all the information needed to generate good samples of the real data distribution. However, we face the same problem as before: what makes a “good sample”?

If we focus our attention on the generation of visual images, most AE models train to reproduce an original input from a latent space. When comparing the resulting reconstruction to the original, the first solution that might come to mind is to perform a pixel-wise comparison. This is the approach that models such as VAE, β-VAE and many other autoencoder architectures take [22,29,32]. However, instead of focusing on individual pixels, we could also define a good reconstruction as one that can mimic high-level features of the original input, such as a body part in an image of a person. Manually defining such features would be extremely inefficient and could introduce human biases into the model. But, by using GANs, we can rely on a neural network to find those features for us [8, 10, 12]. We chose one of these models, VAE-GAN [34], to carry out our experiments, as its hybrid nature will allow us to easily compare it to other autoencoder architectures. Section 3.1.2 and Appendix B.2.4 will further explain the algorithm. Despite all that, it is still hard to yield a quantitative score that would show how good a latent space might be. Evaluating disentanglement usually relies on labelled data and synthetic datasets whose factors of variation are already known [6, 13, 43, 71]. And defining the importance and utility of high-level features is a hard and subjective exercise. However, in our project, we evaluated the effectiveness of our encodings using a secondary task, an RL problem, with a clearly defined criterion of evaluation. This is nothing new, and several researchers have already compared different reconstructions using similar strategies [23, 37, 38].

Moreover, there are multiple visualization techniques that can provide a qualitative measurement of the latent space. We could, for example, reduce its dimensionality to see it represented in a 2D or 3D space, for which principal component analysis (PCA) [26] and t-distributed stochastic neighbor embedding (t-SNE) [40] are two popular options [42, 68]. Or we could rely on techniques such as interpolation, i.e. manually modifying one dimension of the latent space and observing how it affects the final output of the model [8, 13, 22, 23, 34, 43]. This last technique could provide an inside on the continuity of the latent space and let us know which values affect which generative factors.

2.4 Reinforcement learning and state representations

We have seen an introduction to reinforcement learning and the representation of visual information. Now, we would like to link both concepts together.

For an artificial agent, it is essential to understand the current state of the world, which it does by analyzing incoming sensory data. If the size of the input space is small enough, e.g. information from a small number of sensors or very simple images, it might not be necessary to reduce it any further [70]. However, as the complexity of the environment increases, and the capabilities of the agent improve, carefully selecting how to represent all that information becomes an issue.

In particular, encoding visual data is a common practice [6, 19, 23, 50]. This allows us to transform the original input into an interpretable low-dimensional vector and helps greatly at reducing the complexity of the model, as it is one of the first steps of the architecture. Moreover, because we are training to perform a specific task, models can include attention mechanisms to select only those elements that might be relevant later [25, 50].

But knowledge representation is closely related to other aspects of an RL architecture, especially in model-based models. The algorithms’ internal dynamics model is usually trained to predict how a certain action may affect the state of the world. Thus, they receive data from the environment as input, store it as part of some sort of memory, and predict future sensory information based on all the above [14, 19, 47, 50, 70]. A good latent space will have a very important influence in all these tasks, as we are asking the internal model to reproduce future encodings while also using them as input for the algorithm.

(12)

When it comes to disentangled representations, researchers have theorized about its benefits in RL [8, 22, 48], but not that many of them have actively compared different approaches. As far as we know, only DARLA [23], another RL architecture, has previously compared the performance of their algorithm using both entangled and disentangled representations of visual input.

As for adversarial techniques, there are some examples that show their effectiveness in the field of RL [52,57]. In all these, researchers use adversarial methods to directly influence how an algorithm learns a policy. However, to our knowledge, there is no evidence of GANs ever being used to encode any type of sensory information for its later use in RL. We theorize this might be due to the novelty of some of these approaches, the fact that pure GAN architectures do not produce a latent space directly, and the stability problems these networks have to converge to optimal solutions [15, 17].

3 Methods

3.1 Architecture

In this subsection, we will briefly describe the implementation of all our algorithms, as well as the interactions between modules of the World Models framework. We will also report any changes and new additions to the architecture. However, as we do not wish to overwhelm the reader with specific details about the algorithms, most of the mathematical background, and the specific details of the architecture, will be placed in Appendix B. In case of doubt, we highly encourage the reader to visit said appendix. 3.1.1 World Models

We have already introduced the World Models architecture in Section 2.2.1 and shown a diagram of the framework in Figure 1. In short, World Models is composed of smaller independent modules that interact to solve an RL problem. The controller is the one responsible for interacting with the environment and searching for an optimal policy. It receives sensory data from the vision unit, and temporal and spatial information from its internal model of the world, that resides inside the memory module.

All these elements train independently of one another. However, we need to follow an order when optimizing our modules. The vision unit only receives input from the outside environment, so we tune its parameters first. After that, we can train the memory module, which requires an already existing latent representation as both input and output. Finally, we look for an optimal policy by optimizing the controller unit, that makes use of information from the two previously mentioned units. If necessary, we can repeat this procedure multiple times to better adjust our policy.

As for the implementation, all our modules are based on different neural network architectures. Ap-pendix B.1 provides diagrams for all of them. However, while both the vision and memory components train using stochastic gradient descend [28], the controller does it using evolution strategies (ES) [20]. It was the original World Models implementation that defined this approach, arguing that, because of the limited size of the controller model, an ES could help to explore new solutions to traditional RL prob-lems. This technique is not crucial to the performance of the algorithm, and we argue it might harm the scalability of the model in more complicated environments. In our case, however, this training procedure was enough to solve all our tasks.

Next, we will introduce the training procedure and internal architecture of all the modules of World Models:

3.1.2 Vision module

At training time, the goal of this module is to recreate a given input after encoding it into a latent space. In our case, as well as in the original World Models implementation, we only deal with visual information

(13)

collected from the environment at time t. This has two important implications: this network does not account for temporal information, and it is completely blind to the final RL task.

Once the model is fully trained, we will use the generated latent space, not the final reconstruction, as input for all the other elements of World Models. Therefore, we will only use the encoder portion of the model.

Next, we will describe three different approaches to encode visual input. Despite all of them using different functions to determine the quality of the latent space, their architecture (number of layers, activation functions, depth, etc.) is exactly the same in all three cases. This allows us to better compare their results without having to theorize about the biases that different architectural decisions might introduce.

Variational autoencoder (VAE) As we already mentioned in an earlier section, VAE does not produce a latent representation for a given input. Instead, it estimates the parameters of a distribution from which to draw a latent vector. As in many other research studies [19, 22, 32, 34, 50], we chose this distribution to be Gaussian, i.e. VAE needs to encode the input image into two vectors µ and σ. This has an additional benefit in our architecture: it introduces some stochasticity to the input of both the memory and controller modules, increasing their robustness and avoiding overfitting.

VAE defines a good model as one with a high probability of reproducing the original distribution of the input x given the latent space z. Thus, we train to increase the likelihood of that event. We do so by decreasing the mean squared error (MSE)4 between each individual pixel of the reconstruction and the original image.

VAE needs, however, a second term, DKL, to ensure that the variables of the latent space follow

a given distribution. This further increases the interpretability of the resulting vectors. DKL stands

for Kullback-Leibler divergence or KL divergence [33], and we can see it as a regularizer. Tuning this parameter correctly will condition the quality of the latent space and the reconstruction properties of our network. In our VAE experiments, we had to limit its influence by applying a free bits strategy [31]. This technique sets a lower threshold KLmin to the regularization term.

As a result, we obtain the loss function LVAE = M SEloss+ max(DKL, KLmin). The reader can find

a detailed explanation and several examples of this strategy in Appendix A.3.

β-VAE β-VAE focuses on creating disentangled representations while still keeping most of the concepts of the original VAE algorithm. This model argues that, by further constraining the latent space, we can learn more efficient encodings whose individual units correspond to high-level factors of variation [22]. To that end, the algorithm increases the influence of the KL divergence by a factor of β. However, because this can harm future image reconstructions, the user can gradually weaken this constraint, keeping both a disentangled latent space and good generative capabilities [6].

Taking all that information into account, we obtain a loss LβVAE = M SEloss+ β|DKL− Ct|, where

Ct= max(Cmax, Clow+ tCstep), t = {0, 1, 2, ...}.

VAE-GAN This architecture is divided into three independent units: an encoder, a decoder/generator and a discriminator. The generator corresponds to the decoder part of an autoencoder. The discriminator is a completely new structure. It accepts an image, either real or created by the generator, and determines if the input is a reconstruction. We use the inner layers of this new network to obtain a loss function based on high-level features, not pixels. However, it has no use once the autoencoder is fully trained. Figure 19, in Appendix B.1.2, will show the reader the training procedure of a VAE-GAN network.

With this new algorithm, we aim to generate a latent space that can recreate the same output a real image would produce in a deep layer of the discriminator network. As in VAE, we can do so by

4 _{Appendix B.2.2 contains a more detailed explanation that might help the reader understand the relationship between}

(14)

optimizing the negative log-likelihood of the model. However, in this case, every unit of the VAE-GAN module (encoder, decoder/generator and discriminator) trains simultaneously but uses a different loss functions. Let us briefly discuss them.

The discriminator uses binary cross-entropy to optimize for a classification task between real and recreated examples, such that Ldis = BCEloss dis. The decoder/generator attempts to reduce the MSE

between the output of an internal layer of the discriminator when we use the real image and the corre-sponding reconstruction. Moreover, it also tries to fool the discriminator. Taking all that into consider-ation, Lgen = M SEloss dis− BCEloss dis. Finally, the encoder attempts to both reduce the previously

mentioned MSE error and produce an interpretable latent space. Therefore, Lenc = M SEloss dis+ DKL.

For the sake of completeness, during training, the discriminator uses recreated examples that are produced either by a real image or by a latent space of Gaussian noise. Appendix B.2.4 will expand on the mathematical background of the model. Moreover, as GAN architectures are notoriously hard to optimize [15, 53], Appendix A.5 will provide some insight into several techniques and architecture decisions that helped the convergence of this model.

3.1.3 Memory module

Our memory module trains to predict the next observation given the current one and the agent’s action. It is, therefore, a model that needs to learn the dynamics of the world and understand how a certain action could affect the environment. And because the environment could not be fully observable at time t, or it may have long-term relations between states, we also need a memory mechanism to remember previously seen information. Taking all those points into consideration, this module is a concatenation of an RNN network and a mixture density network (MDN) [2]. The former accounts for temporal dependencies while the latter uses past and present information to look into the immediate future.

But, before we move any further, let us briefly define the concept and usefulness of an MDN network. If we assume our environment to be stochastic, we would need a model that does not output a single value as a solution but rather a distribution from which to extract the next predicted observation. However, one single distribution might not be enough to fully approximate the target data. Therefore, we can, instead, generate several candidate solutions, choosing one of them during testing to predict the next latent vector. The task of an MDN network is to find the parameters of those distributions that better approximate the ground truth data. In our case, we use a mixture of Gaussian kernels.

This approach offers some advantages: First, it does not rely on a mean squared metric. If this were the case, the network would learn to predict the next frame as an average of what usually happens at a given point, independently of whether that prediction is realistic or not. Second, it gives the dynamics model a stochastic nature, resembling that of a real environment.

Coming back to the structure of the memory module, we train the model to increase the likelihood, or decrease the log-likelihood, of drawing the next latent space from one of the proposed distributions. Thus, the loss function is LMDN = − logP

M m=0 PN n=0Πmn(x)pmn yn|µmn(x), σmn(x) ,PM m=0Πmn(x) = 1. In

this formula, x is the input vector, yn is value n of the ground truth vector, M is the number of Gaussian

distributions we generate, N is the size of the latent space, Πmnis the probability of choosing distribution

pmn to generate yn, and pmn

yn|µmn(x), σmn(x)

is the likelihood of drawing yn from a distribution

pmwith mean µmn and standard deviation σmn. For the shake of simplicity, the reader can find a more

in-depth explanation in Appendix B.2.5.

Reward module As useful as predicting future sensory data might be, it is not the only information we would like to extract from the environment. In particular, predicting the reward signal and/or terminal state of an environment can be essential to understand the real world and replicate it if necessary.

(15)

The original implementation concatenates an extra neuron and loss function to the output of the memory module to account for any additional signals. This allows them to use the same hidden state for predicting both the next observation and the terminal/reward signals. However, this solution has a few limitations at running time, especially when dealing with sparse reward signals. In such cases, it becomes hard to optimize the reward neuron without introducing biases to the rest of the network. If we were to, for example, artificially increase the number of rewarding sequences, the rest of the model would be affected.

To solve this problem, we created a new module with the sole purpose of predicting terminal states or reward signals. This can be seen as an auxiliary network. It receives the same input as the memory module, a latent space and an action, and predicts future signals using its own memory state. We refer to this network as a reward module, and the reader can see it implemented into the original World Models architecture in Figure 1b .

The network is simple: an RNN and a fully connected layer. It trains using either cross-entropy or mean squared error as a loss function, depending on whether we are training for a classification task or a regression one. In the end, we always used it as a classifier.

3.1.4 Controller module

Finally, we have the controller unit. This module is in charge of finding a good policy to interact with the world. Like the other modules, it has the internal structure of a neural network. However, unlike them, it searches for an optimal solution using evolution strategies. In our case, the controller uses a covariance-matrix adaptation evolution strategy (CMA-ES) [20] to search the weight space of a simple, one layer, fully connected network. The simplicity of this network, with only a few hundred parameters, makes training possible. But, as the number of parameters increases, this algorithm might not be able to find an optimal solution in such a wider search space5 [65].

Using CMA-ES, we can tackle the problem of credit assignment6 indirectly, as it does not update its weights based on individual actions. Additionally, it allows us to train using CPU nodes instead of GPU, which are more suitable for running multiple independent instances of an environment.

When predicting the next action at+1, this module receives the visual encodings from the vision

module at time t, zt, and the hidden state of both the memory and reward modules at that same time,

hmtand hrt. Depending on the experiment, it might also receive the cell state of those two modules, Cmt

and Cmt. Therefore, the controller does not use any predictions about the future to decide its next step.

At training time, the algorithm updates the mean µ and the covariance matrix C of a Gaussian distribution that samples solutions from a search space. Every generation, it creates and evaluates a set of configurations, selects the best ones and updates the value of µt+1based on those selected candidates.

When recomputing Ct+1, we use those same candidates but keep the old mean value, µt. Thus, for every

dimension d, if the difference between µd t and µd t+1 is big, the variance for the new generation will

increase, stimulating the exploration of the solution space along that axis. 3.1.5 Training “inside a dream”

Although the configuration of the network does not change when training in an offline setting, some modules do have a different purpose. In this scenario, the memory module provides both the expected sensory information and the state of the world, and the controller module trains as if that data were real. Because we do not need to encode real data, the vision module has no direct use.

5 _{CMA-ES has a complexity of O(n}2₎

6 _{The credit assignment problem tries to understand how certain actions affect the final performance of the model. Because}

an action might not have an immediate effect on the reward, or might even have harmful consequences in the short-term, it is hard for the network to evaluate which practices lead to a successful run.

(16)

Training inside a simulated environment offers new challenges. First, we need to consider that our memory model might not have learned the complete dynamics of the real world. But, most importantly, because we use the internal state of the dynamics model as input for the controller, we are granting it the possibility to look into the current state of the world. This transforms the environment into a fully observable one and makes it easier for the controller to find a way to cheat the learned dynamics.

World Models attempts to mitigate this issue by utilizing the stochastic nature of the memory unit. If we remember from Section 3.1.3, this module does not output a single value but rather a group of distributions from which we could sample different solutions. This uncertainty makes predicting the next state of the world much harder and thus, forces the controller module to find an approach that does not rely on consciously exploiting the imperfections of the learned dynamics.

To adjust the uncertainty level of our training, the model introduces a new variable called temperature, τ . It affects the prediction of the dynamics module such that Π0

m= Π

1 τ

m, PM_m=0Π0m = 1. Therefore, if

τ > 1, the output becomes more stochastic. Then, once we have chosen our distribution, we can sample a future vector as zt= µ + σ(

√

τ ), where is a random value from a unit Gaussian distribution.

3.2 Environments

We made use of the ViZDoom library [27], which includes a source-port of the game “Doom” (1993) and multiple tools to facilitate the training of RL agents. It also provides multiple premade environments, some of which have been used in recent literature [11, 19, 50], but grants plenty of flexibility to modify an existing task or add a new one if necessary7.

We chose two different environments: “ViZDoom: Take Cover” and “ViZDoom: Health Gathering Supreme”. The first one has very limited movement options and a relatively small arena. The second one needs to handle a sparse reward in a 3D maze that is mostly hidden behind walls. For specifics details about the dynamics of the environments, please refer to Appendix C.3.

3.2.1 ViZDoom: Take Cover (ViZDoom-tc)

This game is included in the ViZDoom library, and we did not changed any aspect of it. Ha and Schmid-huber used it in the original World Models paper to train inside the agent’s imagination [19]. As the full name is rather long, we will refer to it as “ViZDoom-tc”.

This environment places the player in a confined room with enemies in front of him. The agent’s goal is to avoid incoming fireballs by moving left or right. As time passes, more enemies spawn, increasing the difficulty of the game. The game ends when the player’s health drops to zero, typically by being hit once or twice by a projectile. We present an example of the environment in Figure 2a.

3.2.2 ViZDoom: Health Gathering Supreme (ViZDoom-hgs)

This environment is also included in the original ViZDoom library, but we modified some important features. We provide a detailed description of all these changes in Appendix C.3.2. From now on, we will use the term “ViZDoom-hgs” when talking about this environment. If we are referring to a specific version of it, we will designate it as “ViZDoom-hgsvX”, where “X” is the specific version of the environment. In this dissertation, we present results from two different versions, ViZDoom-hgsv1 and ViZDoom-hgsv2 .

Unlike the previous scenario, this is a 3D navigation environment. The player spawns at a random location, and looking at a random angle, inside a maze. Health packages and blue spheres spawn at random locations too. The agent’s objective is to navigate the maze looking for health items while avoiding the

7 _{We used already available modding tools such as Doom Builder (http://www.doombuilder.com/) and SLADE}

(17)

(a) “ViZDoom: Take Cover” environment. From left to right and top to bottom, the first picture presents the initial state of any run. The second picture shows the limits of the room. The third and fourth images show the game at different stages.

(b) “ViZDoom: Health Gathering Supreme” environment. From left to right and top to bottom, the first image shows the layout of the maze. The second one offers a closer look at our items: a health package (right) and a blue sphere (left). The remaining images show random instances of the game.

Fig. 2: Screenshots of our environments.

harmful blue spheres, which will hurt him. Moreover, the character loses a set amount of health points every second. We can see some examples of this environment, together with the layout of the maze, in Figure 2b. The player gains health and reward when picking up packages and loses the same amount of health and reward when collecting a sphere. Health items are twice more common than spheres.

As for the versions of the environment, none of them change the goal of the game significantly, but they make it more challenging. Version 2 includes some extra adjustments to the movement dynamics of the agent to help training certain modules.

3.3 Data collection and usage

All our modules, expect the controller, trained offline using previously collected data. For each environ-ment, we gathered data from 10, 000 randomly generated trials and split them into a 90%/10% train-ing/test set. A trial would end whenever the player died or once it reached a time limit.

The collection policy of the agent was random at first, but we realised that this behaviour limited its exploration capabilities. We decided to use a pseudo-random policy where the player has a 20% probability of keeping the previously selected action, randomly selecting a new one among all possible choices otherwise. Additionally, in ViZDoom-hgs environments, we limited the length of each rollout to 150 steps. By doing so, we reduced the number of frames where the agent stands facing a brick wall. This helped greatly when training our VAE-GAN models (Appendix B.1.2)

Both the vision and memory module, and the reward module whenever used, trained using the same dataset. However, we presented them the data differently:

– For training the vision module, we used all available data.

– For the memory module, each training sequence contained a full trial. If the trial was not big enough, we applied zero pre-padding to it. If it was too big, we removed it from this training phase.

– For the reward module, the input sequences had a fixed length of 20 frames. We did not use the whole dataset, but rather a subset of it with all available reward instances and other randomly selected situations. During training, we balanced the data to contain, at least, 20% of examples with positive or negative reward each batch. Additionally, for the ViZDoom-hgs environment, we manually increased

(18)

the number of instances where the agent was very close to an object but did not pick it up. We needed to do so because the network had trouble predicting reward when close to an item. We created an alternative environment to collect this new data.

– Finally, for the controller, we did not need a dataset when training online. However, when training inside the agent’s imagination, we used the 3 first steps of a previously collected trial to initialize the hidden state of the memory and reward networks.

4 Results

The following section only includes the final results of our experiments and any remarkable circumstances that directly affected the study of our research questions. If the reader wishes to know more about the models’ performance and parameters, we would like to refer him to Appendix A and Appendix C.2. We will discuss the outcome of our research and provide a qualitative examination of the latent space in Section 5.

4.1 Experiment 1: Learning online

We trained the three main modules of World Models (vision, memory and controller) as explained in Section 3.1.1. By the time we started running these experiments, we did not have a unique network to predict the reward signal. Thus, for this first experiment, this task relies on the controller architecture. We used version 1 of the ViZDoom-hgs environment. Finally, the original World Models paper did not report results for their online training of the ViZDoom-tc environment, so we compared our results to the ones of a random policy.

Unfortunately, we found an unexpected problem regarding the training of the VAE and β-VAE archi-tectures: the regularization term was preventing the algorithms from learning a meaningful representation of the data. We used the “free bits” strategy introduced in Section 3.1.2 to overcome this problem. We recommend reading Appendix A.3, where we provide an in-depth explanation of the problem and multiple examples.

This incident had an undesired effect: we needed to diminish the influence of the regularizer on the latent space. However, β-VAE, and many other autoencoder architectures that promote disentanglement [7, 29, 43], perform the opposite operation. Sadly, this meant that our environments and experimental configuration were not suitable to test the effects of a disentangled representation in RL. Still, for this first experiment, we kept our β-VAE implementation and used it to decrease the influence of the KL divergence (β < 1). Then, we compared these results with the “free bits” strategy of the VAE algorithm. Nevertheless, we still explored the latent space of our vision modules in search of different factors of disentanglement (Section 5.1).

Lastly, we would like to clarify that each time we mention the term “VAE”, we are not referring to the vanilla implementation of the algorithm but to the version that uses “free bits”. We set the lower threshold of the KL divergence to 32, as we later found that this was the same value that the original World Models used in their experiments. We used β = 0.05 and β = 0.10 for tc and ViZDoom-hgsv1 experiments respectively, as they produce a KL divergence around that same value.

Figure 3 compares the pixel-by-pixel reconstruction loss of all three modules in both environments. Let us remember that VAE-GAN uses a different metric to update its weights during training. Therefore, its results are the worst out of the three strategies. This, however, does not necessarily imply that the generated images are uninformative or lacking in detail. We encourage the reader to look at some examples in Appendix C.5 and read about the qualitative differences between reconstructions in Section 5.1.1. Every model has a latent size of 32.

(19)

9.0 9.5 10.0 10.5 11.0 11.5 12.0

MS

E

loss

Final pixel-wise reconstruction loss for ViZDoom-tc images

VAE -VAE VAE-GAN

Algorithm

2.5 3.0 3.5 4.0 19.5 20.0 20.5 21.0 21.5 22.0 22.5

MS

E

loss

Final pixel-wise reconstruction loss for ViZDoom-hgsv1 images

VAE -VAE VAE-GAN

Algorithm

7.5 8.0 8.5 9.0

Fig. 3: Pixel-wise reconstruction losses for all three modules. Results for VAE-GAN in the ViZDoom-hgsv1 environment only show 8 points. The other two models did not converge to create good representations. Their losses were 91.34 and 59.16.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Epochs

0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5

L

MDN

Evolution of the loss function of the memory module

in ViZDoom-tc

Memory with VAE Memory with -VAE Memory with VAE-GAN

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Epochs

0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5

L

MDN

Evolution of the loss function of the memory module

in ViZDoom-hgsv1

Memory with VAE Memory with -VAE Memory with VAE-GAN

Fig. 4: Performance of the memory module when using different strategies to encode sensory data. We ran every configuration 10 times. Vertical bars represent the standard deviation of the data.

We trained our memory module with a hidden state of 256 units and using 5 kernels to predict the next latent space. This time, we provide the evolution of the performance during training in Figure 4. There is a noticeable difference between the results of the memory module with VAE-GAN encodings and the other two approaches. However, as we explain in Appendix A.4, these results can be misleading if not analysed correctly. We will examine them further during our discussion of Experiment 2 (Section 5.2). From the image, we also appreciate how VAE solutions outperform β-VAE ones, despite using a very similar strategy.

Finally, we optimized the controller. This module required a large amount of time to train. Because of this, we only have a maximum of 5 runs for each configuration. Models using β-VAE only have 3, as we could not answer the original research question using this configuration and thus, its results were not as important to us. We used a population of 100 individuals. Their fitness value was the average result after

(20)

running the environment 16 times. Every 10 generations, we selected the best solution and evaluated it extensively for 100 more trials.

We provide the results in Figure 5. Each run used a different vision and memory module, but they all trained using the same set of parameters. We can see that architectures based on both VAE and β-VAE perform similarly in both environments. In ViZDoom-tc, VAE alternatives outperform the VAE-GAN ones. However, VAE-GAN solutions are consistently better in ViZDoom-hgsv1, with the exception of two outliers.

We would like to bring the reader’s attention to those two instances that almost doubled the average performance of their corresponding configuration in our ViZDoom-hgsv1 environment. Because of their outstanding behaviour, we decided to run those same experiments a second time. Final results changed from 322.5 to 157.2 in the case of the VAE-based outlier, and from 275.8 to 157.5 in the case of the β-VAE outlier. We did not include any of these new experiments in Figure 5. Thus, we concluded that there was a point during training that could lead to a very optimal solution, but we could not reach it consistently. We do not know if such an optimal solution exists for World Models architectures using VAE-GAN encodings.

We will briefly describe the final policies of our models during the discussion (Section 5.2.1). However, we would like to remind the reader that Appendix C.4 contains GIF files with examples of all these experiments.

4.2 Experiment 2: Learning inside a dream

For this experiment, we trained World Models solely inside its own dynamics model. Then, we applied the resulting policy to the real environment and evaluated its performance. In these experiments, the controller never received any feedback from the real world.

This time, we needed a network that could emulate the behaviour of a reward/terminal signal. We initially tried to use the memory module to predict it. However, this approach was inaccurate in the ViZDoom-hgs environments. Moreover, it affected the performance of the memory module negatively. Thus, from this point on, we used our new reward network (Section 3.1.3). Furthermore, we introduced minor changes to the movement dynamics of the ViZDoom-hgsv1 environment to better predict the reward signal. We list them in Appendix C.3.2. Therefore, we started using the ViZDoom-hgsv2 environment.

We tested different temperature values, τ , to modify the stochastic nature of the memory module. The range of values changed depending on the environment and the performance of the imagined world. We reduced the number of generations from 500 to 250. At this point, all our models had a clearly defined strategy, and some of them had already learned to achieve a perfect score. The number of runs for each configuration was also reduced from 5 to 3 due to computational limitations. Finally, each configuration used three different vision and memory units. In other words, when training a controller with a different value of τ , we still kept the same vision and memory models.

We show the results of our experiments in Figure 6. We provide data for the final performance of the best models in both the “dreamed” world and the real one. The final results represent the average perfor-mance of a policy after 200 random trials. We decided not to include the evolution of the population for the sake of simplicity. However, we did not appreciate any significant differences between configurations. All models learned to create a semi-realistic environment in which to train. This includes our 3D navigational world. However, none of our configurations were able to recreate the dynamics of the real world completely, and their final strategies often included behaviours that actively tried to fool the memory module. This is the reason behind the gap in performance between offline scores and online ones. Regarding our experiments in the ViZDoom-tc environment, we were far from reaching the score shown in the original World Models papers. However, despite our results, we would like to state that a low final performance does not always correspond to random behaviour. For example, several models

(21)

0 100 200 300 400 500 Generations 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 Fitness value

Performance of the best model

0 100 200 300 400 500 Generations 200 300 400 500 600 700 800 900 Fitness value Average performance

World Models with VAE World Models with -VAE World Models with VAE-GAN

Name of the architecture

900 1000 1100

Fitness value

Performance of the best model after extensive evaluation

World Models performance for ViZDoom-tc World Models with VAE

World Models with -VAE World Models with VAE-GAN Random behaviour 0 100 200 300 400 500 Generations 0 50 100 150 200 250 300 Fitness value

Performance of the best model

0 100 200 300 400 500 Generations 0 50 100 150 200 Fitness value Average performance

World Models with VAE World Models with -VAE World Models with VAE-GAN

Name of the architecture

100 150 200 250 300 Fitness value

Performance of the best model after extensive evaluation

World Models performance for ViZDoom-hgsv1 World Models with VAE

World Models with -VAE World Models with VAE-GAN Random behaviour

Fig. 5: Performance of the World Models architecture in both ViZDoom-tc and ViZDoom-hgsv1 environ-ments.

do learn to identify projectiles as harmful. As for ViZDoom-hgsv2, we tested our configurations with different τ values. The reason behind this decision was the clear difference between based and VAE-GAN-based imagined environments. While the former was overly optimistic when placing items inside the maze, the latter was extremely cautious. By modifying τ , we expected to correct these dynamics. We will expand on this idea during the discussion.

(22)

World Models with VAE latent spaces = 1.0 = 1.1 = 1.2 Temperature value 600 1000 1400 1800 Fitness value 1 2 3 1 2 3 1 2 3 Performance of the best model after extensive offline evaluation

= 1.0 = 1.1 = 1.2 Temperature value 100 500 Fitness value 1 2 3 1 2 3 1 2 3 Performance of the best model after extensive online evaluation

World Models with VAE-GAN latent spaces

= 1.0 = 1.1 = 1.2 Temperature value 0 400 800 1200 Fitness value 1 2 3 1 2 3 1 2 3 Performance of the best model after extensive offline evaluation

= 1.0 = 1.1 = 1.2 Temperature value 100 500 Fitness value 12 3 1 2 3 1 23 Performance of the best model after extensive online evaluation

World Models performance for ViZDoom-tc after offline training Random behaviour

World Models with VAE latent spaces

= 0.6 = 0.8 = 1.0 = 1.2 Temperature value 300 700 1100 1500 1900 2300 2700 3100 3500 Fitness value ₁₂₃ ₁ 2 3 12 3 12 3 Performance of the best model after extensive offline evaluation

= 0.6 = 0.8 = 1.0 = 1.2 Temperature value 0 50 100 150 Fitness value 123 123 12 3 12 3 Performance of the best model after extensive online evaluation

World Models with VAE-GAN latent spaces

= 1.0 = 1.3 = 1.5 Temperature value 0 50 100 150 Fitness value 1 2 3 1 2 3 1 2 3 Performance of the best model after extensive offline evaluation

= 1.0 = 1.3 = 1.5 Temperature value 0 50 100 150 Fitness value 12 3 12 3 123 Performance of the best model after extensive online evaluation

World Models performance for ViZDoom-hgsv2 after offline training Random behaviour

Fig. 6: Final performance of the World Models architecture in both imagined and real environments when training solely using the model’s internal representation of the world. The annotations next to each datapoint (numbers 1 − 3) represent a complete architecture, i.e. one that uses the exact same vision, memory and reward modules. Please notice that the vertical axis might vary greatly from graph to graph.

(23)

Finally, we can see how VAE-GAN architectures recreated a much stable training environment in ViZDoom-tc, although they did not achieve the best scores. In ViZDoom-hgsv2, however, they clearly outperformed VAE alternatives.

4.3 Experiment 3: Alternating online and offline training

Next, we explored the idea of combining online and offline optimization. This strategy had not been tested before. We defined two new training procedures. First, we attached an online training phase after an offline one. This way, the agent could still gain knowledge about the world using its imagination and then refine it during an online session. Second, we alternated between online and offline training, as this could be a much more realistic setup when training for complex tasks. We only tested this second procedure in the ViZDoom-hgsv2 environment.

We used the same configuration as in Experiment 2. However, for generating our imagined experi-ences, we only applied those temperature parameters that better performed last time: τ = 1.1 for our experiments in the ViZDoom-tc environment, τ = 1.2 for VAE-based architectures in ViZDoom-hgsv2 and τ = 1.3 for VAE-GAN configurations in ViZDoom-hgsv2.

Figure 7 shows the evolution of our imagined policies when perfecting them in the real world. We ran our experiments for 50 or 100 epochs, depending on the difficulty of the environment. We compared them to a baseline where the agent did not have any prior knowledge of the task. Because we had implemented minor changes since Experiment 1, these reference values are not the same as the ones presented in Section 4.1.

We can appreciate how some previously trained models outperform the baseline, especially during the early stages of training. In ViZDoom-tc, this advantage was temporary, whereas, in ViZDoom-hgsv2, it lasted until the end of the training phase. There were no significant differences between World Models trained with VAE or VAE-GAN latent features.

Some models had greater difficulties to converge to an optimal solution. This was never an issue in Experiment 1. We observed that, by increasing the sampling range of the controller module during training, some models were able to overcome their local optima8_{. However, this was not guaranteed, and}

they usually showed similar or lower performance than a model trained without any prior knowledge. Thus, we decided to omit those results here.

Lastly, we decided to constantly alternate training between real and imagined worlds. Our strategy was as follows: we switched between online and offline training every 30 generations. We used σ = 0.1 when training in the real world, and σ = 0.05 when doing so in the fictitious one9_{. This way, we limited}

the ability of the agent to find a strategy purely based on exploiting the imperfections of the memory module. As there was no way to determine which offline policy worked best in an online setting, we always chose the one with the highest fitness value in the imagined world.

We show our results in Figure 8. We only selected those configurations that better performed in our previous experiment: VAE-based model 3 and VAE-GAN-based model 3 (see Figure 7). Due to the lack of time, we were only able to run short tests with the other modules, about 100 to 150 epochs long. The evolution of the policy remained the same or got worst. As they did not provide further information, we decided not to show those results here.

Unfortunately, we can only faithfully compare these performances to a baseline with random be-haviour. Again, our current environment and system architecture vary slightly from the one we used in Experiment 1. Thus, we cannot compare our previous results to these. We did not have time to collect enough data to provide a reliable baseline for both VAE-based and VAE-GAN-based models.

8 _{We increased the controller’s hyperparameter σ from σ = 0.1 to σ = 0.2.}

9 _{Value σ refers to a hyperparameter of the CMA-ES algorithm. It is not related to the standard deviation of a distribution.}

THE EFFICIENCY OF ADVERSARIAL STATE EMBEDDINGS IN MODEL-BASED RL TASKS THE EFFECTS OF USING LATENT SPACES OF ADVERSARIAL NETWORKS TO ENCODE VISUAL INPUT FOR THE WORLD MODELS ARCHITECTURE

T

HE EFFICIENCY OF ADVERSARIAL STATE EMBEDDINGS

IN MODEL

-

BASED

RL

TASKS

T

W

M

B

A

LEJANDRO

G

ONZ

ALEZ

´

R

OGEL

PROF

.

DR

. M.A.J.

VAN

G

ERVEN

Contents

1 Introduction

2 Background and related work

3 Methods

4 Results

MS

E

Final pixel-wise reconstruction loss for ViZDoom-tc images

Algorithm

MS

E

Final pixel-wise reconstruction loss for ViZDoom-hgsv1 images

Algorithm

Epochs

L

Evolution of the loss function of the memory module

in ViZDoom-tc

Epochs

L

Evolution of the loss function of the memory module

in ViZDoom-hgsv1