Fixing a Broken hierarchy: A Study of Information Granularity in Hierarchical Reinforcement Learning

(1)

MSc Artificial Intelligence

Master Thesis

Fixing a Broken Hierarchy:

A Study of Information Granularity in Hierarchical Reinforcement Learning

by

Laurens Weitkamp

11011629

July 15, 2020

48 EC December 2019 - August 2020

Supervisor:

Dr Herke van Hoof

Assessor:

Dr Efstratios Gavves

(2)

Abstract

Hierarchical reinforcement learning (HRL) is a research area in reinforcement learning that deals with temporal abstraction; long term planning in tasks that possibly exhibit a hierarchical structure. At the core of HRL is the idea that we can divide a full task into several skills, and learn these skills instead of learning the full task. This is typically done by making a distinction between a higher-level policy responsible for long-term planning, and a lower-level policy responsible for learning short-term skills. A benefit to this approach is that the learned model might transfer these learned skills to a new domain. Recent advances in HRL have shown promising results on tasks that could not be solved without a hierarchy, specifically in the sparse reward setting.

However, some hierarchical approaches tend to degenerate to trivial solutions; instead of learning several skills, the agent learns only to solve the whole task, offering no transferability to new domains. Typically, the same state-input is given to both higher and lower level policies. We theorize that giving both higher and lower level policies the same state-input enables the lower-level to ignore the higher-level if the enforcement of hierarchy is not sufficient enough at the policy level.

In this thesis, we propose the learning of a hierarchy through local and global state information. We structure the input for higher and lower level policies, where the former uses local state information relevant to long-term planning, and the latter uses global state information relevant to short-term planning. We show that by hiding information between higher and lower level policies, the lower-level policy is incentivized to follow the higher-level goal, resulting in a more sample efficient approach. Additionally, we show that the agent is capable of transferring the learned sub-tasks to a domain more complex and not seen during training.

(3)

Introduction

In reinforcement learning an agent is tasked with solving a sequential decision-making problem by taking actions in states to find a sequence of state-action combinations which in turn leads to maximizing an objective function. Typically, this means that an agent takes actions to maximize the cumulative (discounted) reward for visiting states. With the use of strong function approximation from the field of deep learning, reinforcement learning has been able to scale up to reach several impressive benchmarks on Atari 2600 games, the game of Go and physics simulations [1, 2, 3, 4, 5].

However, agents are typically learning to solve only one specific task, and knowledge gained from this task is not easily transferred when faced with another task. Let’s use an example to illustrate the problem more clearly: we are tasked with going to the grocery store to get groceries. We must, therefore, fetch our keys, exit our house, get on our bike and bike towards the store, and there we have to select the produce and pay for it. The typical reinforcement learning agent might be able to solve the task as a whole, but when faced with a related task (go to the bank) it will have to learn this from scratch. This is highly sample inefficient.

If we assume that the task as a whole requires several skills, we might be able to structure the learning process in such a way as to accommodate this. Instead of learning to go get groceries as a whole, we learn to fetch keys, get on our bike and to navigate separately and can hence more quickly transfer learned skills when faced with going to the bank. A research field that attempts learning such skills is hierarchical reinforcement learning (HRL). In HRL, we use several layers of policies, each trained to perform decision making at an increasing level of temporal abstraction. Typically two levels are used: the higher-level policy learns to take temporally abstracted actions and either select from a set of lower-level policies or guides a single lower-level policy to learn skills using primitive actions directly in the environment. Indeed, several tasks have been solved by using a hierarchy that could not be solved otherwise, specifically in the domain of sparse reward where there is only a reward if the agent completes the task [6, 7].

Recent literature in hierarchical reinforcement learning has focused on learning a goal-conditioned lower-level policy, with a goal set out by the higher-level policy [6, 8, 9, 10]. In goal-conditioned HRL, the higher-level policy should learn a temporally abstracted representation of the environment and use this as a goal to guide the lower-level policy towards advantageous states. The representation varies according to the environment; in physics domains the higher-level goal can exactly represent a future state, but this is more complex in pixel domains due to pixel-reconstruction loss, and might require a more complex representation such as an embedding proposed in the FeUdal Network [6, 9].

However, there are several problems with hierarchical approaches. When hierarchy is not enforced strictly enough the agent might be reduced to a trivial solution:

1. The higher-level policy might dictate the primitive actions performed by the lower-level policy at each time step, this is called micro-managing behavior in the literature.

2. the lower-level policy might simply learn to solve the task whilst ignoring the higher-level goal, we will refer to this as goal-avoidance behavior.

Both solutions reduce the learned hierarchy to a single-task solving agent, and remove any possibility of learning a high-quality hierarchical skill decomposition between higher and lower level policies. Whereas micro-managing behavior has primarily been dealt with through regularization and an enforcement of temporal abstraction at the higher-level policy, goal-avoidance is still a potential problem [11, 12, 7]. We argue that goal-avoidance essentially breaks the hierarchy; the lower level policy is detached from the higher-level policy rendering it

(5)

useless. Additionally, recent work has theorized that several hierarchical approaches work primarily because they add exploration to the training of the agent, even if the hierarchy is enforced strongly [13].

1.1 Thesis Outline

In this thesis, we will look at goal-avoidance in hierarchical reinforcement learning, specifically in the FeUdal Network. We propose to learn a goal-conditioned hierarchy by specifying the granularity of information at each level of the hierarchy, which serves as an incentive for the lower-level policy to follow the direction set out by the higher-level policy. Through information hiding, we suspect the lower-level policy will need to follow the goal set out by the higher-level policy in order to find the optimal policy. We will test the ability of the learned skill-decomposition through two experiments. In the first, we learn in a slightly altered four rooms environment which is traditionally used in hierarchical reinforcement learning. We modify this environment in several ways and evaluate if the lower-level policy is capable of solving the task by following the higher-level policy’s goal. In the second experiment, we scale the complexity up by introducing two difficult tasks for the agent to solve, and evaluating its performance on a combination of the two tasks.

We have structured our work in the following way: in Chapter 2 we provide the necessary background to understand the basis of our approach. In Chapter 3 we review the literature in hierarchical reinforcement learning with a focus on goal-conditioned methods. Following this, we outline the hypothesis and our approach to learning transferable hierarchies in Chapter 4. We continue with Chapter 5, which provides a detailed description of the experiments performed in this thesis and the results we have gathered from these experiments. Finally, we provide an discussion of the results from our experiments in Chapter 6, which includes future research topics.

(6)

Chapter 2

Background

2.1 Reinforcement Learning

Reinforcement Learning is a sub-field of machine learning that is concerned with how an agent takes actions to maximize some specified performance measure. Typically, the environment that the agent is located in is treated as a Markov Decision Process (MDP). An MDP is defined as a five-tuple {S, A, r, P, γ} where s ∈ S denotes the set of states, a ∈ A the set of actions available, which is sometimes denoted as A(s) to emphasize the actions available in state s. The reward function r : S × A → R describes the reward given by the environment from visiting a state. The reward function defines good behavior in the environment to the agent, but is not necessarily helpful by itself. Transitional probabilities from one state to another are defined by P , which can be formulated more clearly as p(s, a, s0) to emphasize the dependency on the current state, the action, and the next state.

The discount factor γ ∈ [0, 1) is strictly smaller than 1 to account for infinite-horizon tasks, additionally it balances the importance between immediate and future rewards. If the discount factor is (close to) zero, the agent will consider almost only the immediate reward gained. If, on the other hand, it is close to 1 the agent will consider subsequent rewards it will get to reinforce the action currently taken.

The goal in the MDP setting is to learn a policy π(a|s) that maps a state to an action which should maximize some performance measure. The trajectory τ = s0, a0, r0, s1, a1, r1, . . . , sT, aT, rT

that an agent takes is a collection of the states it has visited, the action it took in these states and the rewards it received until termination at time step T . Commonly, the performance measure is defined over (a sub set of) the trajectory. For example, the reward-to-go in Equation (2.1) is the sum of (discounted) rewards the agent will receive from a trajectory starting at time step t,

ˆ Gt= T X t0_=t γt0−1Rt0 (2.1)

There are several methods for learning how to maximize the performance measure, but we will primarily discuss the parameterization of the policy directly and estimate the gradient of the parameters with respect to the performance measure, which are called policy gradient methods.

2.1.1 Policy Gradient Methods

As mentioned, we will define a performance measure J (θ) for the policy, and estimate the gradient of J (θ) to take a step in the direction that maximizes the expected return. Using the value of the initial state s0 as the performance measure (indicating that we wish to maximize the reward from the starting state onwards), we can write

(7)

We can then take the gradient of this performance measure to reach what is called the REINFORCE1 family of algorithms [14, 15]; ∇J (θ) = Eπθ _ˆ Gt− b(st)∇θlog πθ(at|st) (2.3) θt+1= θt+ α ˆGt− b(st)∇θlog πθ(at|st)

Where ˆGt is the return from Equation (2.1), b(st) is a state-dependent baseline and α is the learning rate [14]. The baseline is often included to reduce the variance in the gradient update. A common baseline is an approximation of the value function Vφ(S) parameterized by φ, learned through a squared-error loss.

Actor Critic Learning If we incorporate bootstrapping into our value learning, we end up with an actor-critic method: the actor is the policy, and is being pointed in the right direction by the actor-critic Vφ. In general this bootstrapped estimate At= Rt+1+ Vφ(st+1) − Vφ(st) is known as an estimation of the advantage function because if its value is positive, the next state is advantageous with respect to the current state, and the action towards that state is beneficial. It is only an estimate of the advantage function because we do not have access to the true value function Vπ_(s

t) but only an approximation Vφ(st).

2.2 Hierarchical Reinforcement Learning

In some situations the agent is required to learn not only one but several tasks. In these situations it might be more efficient to model the learning process to account for multiple tasks. These approaches are called hierarchical methods, because the learning process is structured in some pre-defined hierarchical way. Typically, we use two types of policies, a higher-level policy in charge of learning some long-term aspects of a task (temporal abstraction), and a lower-level policy in charge of learning (possibly sets of) sub-tasks. We can further make a distinction of hierarchical learning into two categories: (1) the higher-level policy selects from a set of lower-level policies and (2) the higher-lower-level policy sets out an objective for a single lower-lower-level policy. Note that both categories can be extended to allow for infinite levels of hierarchy. This structuring of the learning process has several benefits: (1) a complex task is divided into multiple easier tasks, (2) the agent learns reasoning over longer time steps which enforces some degree of temporal abstraction and (3) the sub-policies learned should make it easy to learn similar tasks or to transfer to new tasks.

2.2.1 Feudal Reinforcement Learning

This thesis primarily focuses on using a specific type of hierarchical framework, namely Feudal Reinforcement Learning [16]. In this approach, each higher-level of hierarchy uses a more abstract representation of the state and passes an objective to the lower-level policy. More precisely, at each level of hierarchy a manager exists which sets out a goal for its lower-level worker, which only gets rewarded for following the goal objective set out by the manager. Each manager in turn functions as a worker for the hierarchical layer above until we reach a single manager which can only set out a goal consisting of primitive actions. A key aspect of this approach is the idea of information hiding at each level of the hierarchy, as each level only has an abstract representation of the environment suitable for its level of the hierarchy. Additionally, rewards from the environment are also hidden to all except for the top manager, where each other level of the hierarchy is rewarded solely based on its performance concerning the task given by the higher-level manager.

2.3 Deep Reinforcement Learning

As the state space both grows larger and more complex (physics domains or images as input for example), traditional methods in reinforcement learning fall short and we will require strong function approximation methods. The recent advances in deep learning have rolled over into reinforcement learning methods, yielding several interesting developments. Starting with the actor-critic method discussed previously, we will describe the asynchronous advantage actor-critic (A3C) method in detail. Following this is a discussion of the FeUdal Network, a recently proposed method which is based on the idea of feudal reinforcement learning.

(Asynchronous) Advantage Actor Critic We have previously discussed using the advantage function to reduce variance in gradient updates and how learning a value function in addition to the policy can sometimes be called an actor-critic method. The A3C method effectively utilizes multi-core processing to have multiple

(8)

agents working asynchronously, where each agent gathers experience from which we can learn the policy and value function using a deep convolutional neural network [3]. When using multiple agents in this manner, we additionally have a higher rate of exploration which benefits convergence, and we can slightly decorrelate the data which is required for stochastic gradient descent. It has been debated how useful the asynchronous nature is, but methods that work synchronously have been found to work just as well and but can leverage the power of GPUs.

FeUdal Networks for Hierarchical Reinforcement Learning The most recent advance in the use of feudal reinforcement learning is the FeUdal Network (FuN) [9]. Consider a lower-level policy ot = µ(st, θ) selected by a higher-level policy µ (where the selected lower-level policy is fixed for c time steps). Each lower-level policy has a corresponding transition probability distribution p(st+c| st, ot) which describes in which future-state st+c this lower-level policy might possibly end up in. We can then combine the higher-level policy together with the lower-level policy transition probability distribution to create a transition policy πTP(st+c| st) = p(st+c| st, µ(st, θ)). We can apply the policy gradient theorem on the transition policy to find the gradient with respect to the performance of this policy,

∇θπTPt = E[(Rt− V (st))∇θlog p(st+c| st, µ(st, θ))] (2.4) This re-framing of the higher-level policy could allow us to optimize the transition policy using off-the-shelf policy gradient methods. However, FeUdal Networks assume that the state-space follows a Von-Mises Fisher distribution with mean-direction g(ot): p(st+c| st, ot) ∝ edcos(st+c−st,g(ot)), which gives the objective of the higher-level policy a semantic meaning: we wish to emit sub-policies that, on average, move towards state st+c. Additionally, the transition to the successor state st+c should be advantageous, and hence we can write the gradient with respect to the performance measure using the advantage function:

∇θπTP= Rt− V (st) · edcos(st+c−st,g(ot)) (2.5) We can see that the higher-level policy creates an embedding space in which goals g(ot) are learned to maximise the cosine similarity with advantageous successor states st+c. To learn this embedding space, we take the log of the distribution to reach the final update rule for the higher-level policy:

∇θgt= GT Pt − Vθ(st)∇θdcos(st+c− st, g(θ)) (2.6) Where we ignore the dependencies of st on the parameters θ to avoid trivial solutions. The goal embedding itself is used by the lower-level policy to guide it towards the direction of advantageous states. The lower-level policy has no restrictions other than it having to use the goal in a meaningful way, and is traditionally trained using an advantage actor critic method described earlier:

∇φπt= Gt+ αRIt− Vφ(st)∇φ log πφ(at|st, gt) (2.7) The intrinsic reward RI

t = 1 c

Pc

i=1dcos(st− st−i, gi), scaled by α, is high when the lower-level policy manages to maximize the log-likelihood of the higher-level policy’s state trajectory towards advantageous states. Note that both the higher- and lower level policies can calculate the return using a different discounting term. In contrast to Feudal Reinforcement learning, the lower-level policy is given the extrinsic environment reward in addition to the intrinsic reward. Additionally, both higher and lower level policies share the same observation of the state.

We have mentioned that the lower-level policy has to use the goal in some meaningful way. Let us be specific in this description; the lower level policy uses the state observation to create an internal representation U ∈ Rd×k, and receives the goal g ∈ Rd

. The goal is transformed into an internal representation w ∈ Rk _{through a linear} layer φ with no bias. The policy is the result of a matrix-vector product π = U w, and it is because of φ not having a bias that the higher-level policy cannot ignore the goal embedding. An architectural view of the feudal network is depicted in Figure 4.1a.

(9)

Chapter 3

Related Literature

We can make a distinction between hierarchical approaches in reinforcement learning by focusing on how the higher-level policy takes actions concerning its lower-level counterpart. This categorizes the approaches into two classes: the first has a discrete set of lower-level policies from which the higher-level policy selects at a given time step, and the second uses a continuous vector as a goal that it emits to the lower-level policy at a given time step (sometimes called a goal-conditioned hierarchy [8]). We could see the second class as a generalization of the first class, a one-hot encoded vector can count as a higher-level goal which selects from a set of lower-level policies. In practice, however, distinction between lower-level policies is made explicitly by creating a fixed set of separate lower-level policy parameters from which a higher-level policy selects. Notably, both classes can be extended to account for infinite levels of hierarchy.

In this chapter, we will first discuss the former approach as this is the more traditional approach in hierarchical reinforcement learning with. After discussing these approaches, we will look at more recent work in goal-conditioned hierarchical reinforcement learning agents.

3.1 Hierarchical Reinforcement Learning

The majority of work dealing with a single higher-level policy and multiple lower-level policies is concerned with the options framework [17]. At a given time step, a higher-level policy selects from a set of options (lower-level policies), which can be seen as a temporally abstracted sequence of actions. Each option has its own initiation set (in which states can this option be activated) and a termination condition (at which states should this option terminate). Since the higher-level policy takes actions only when an option has terminated, it is not taking actions at an MDP level, but at the level of a Semi-Markov Decision Process (SMDP). The transition from one state to another is normally dependent on the current state and action, but now is additionally dependent the time elapsed since taking the action. One well known approach to options is the MAXQ algorithm [18], which creates a graph-based decomposition of the MDP into a set of hand-crafted skills, each learned by an option. Indeed, most approaches to option-learning use hand-crafted skills and rewards typically provided explicitly [17, 16]. Another well known approach is intra-option value learning [19], which is a case of off-policy learning of options quite similar to Q-learning.

Intra-option value learning is the basis of the Option-Critic architecture [11, 12], which has an actor-critic like approach to options learning end-to-end. In this architecture, each option is a policy-gradient based agent that is guided towards learning a (sub)-task by a state-option-value function QΩ(s, ω). The QΩ value function can be used as the higher-level policy by using an epsilon-greedy sampling, but the option-critic architecture does not restrict it to do so. Although the option-critic architecture has proven to be capable of learning in complex environments such as Atari games [1], the end-to-end nature tends to degenerate the hierarchy into a single-task solving agent where each lower-level policy learns to solve the game [9]. Although we do not directly discuss this architecture in this thesis, Appendix A holds a case study of the option-critic architecture in a simple environment that highlights exactly such a trivial solution.

However, options are not the only approach to hierarchical learning with multiple lower-level policies. There is a wide variety of approaches in the literature that are closely related to options (but not limited to the framework) and instead use terms such as skills and behaviors [20, 21]. More recently, Meta Learning Shared hierarchies (MLSH) has a higher-level policy which selects every N steps from a set of lower-level policies, where both-lower and higher-level policies are trained separately using PPO [7, 22]. Specifically, MLSH samples a task from a distribution over MDPs and performs a warm up training of the higher-level policy, which is followed by a joint

(10)

update period of both higher and lower levels. This process allows MLSH to quickly learn a wide set of skills and adapt itself to new tasks out of the training domain.

Most notable research towards learning a hierarchy with no extrinsic reward is Diversity is All You Need and Variational Intrinsic Control [23, 24]. Although such approaches guarantee the learning of a diverse hierarchy through a carefully constructed hierarchical policy, the resulting hierarchy is not task-focused and requires additional steps to maximize extrinsic reward.

3.2 Goal Conditioned Hierarchical Reinforcement Learning

Goal conditioned hierarchical learning, where the higher-level policy emits a goal for the lower-level policy, is a more recent approach to hierarchical reinforcement learning. This work has roots in the feudal reinforcement learning framework which leverages intrinsic reward for each lower-level policy, and information hiding from each level of hierarchy [16]. FeUdal Networks are based on feudal reinforcement learning but attempt to learn the abstract representation of the environment for a single higher and lower level policy combination [9]. In this sense, it has removed the information hiding aspect of feudal reinforcement learning, yet retains the use of intrinsic reward to guide the agent in following the higher-level policy’s goal embedding. The goal embedding directs the agent towards advantageous directions in state-space which is learned by taking the difference between st− st+c, the latent representations of a state at time step t and the future state at time step t + c.

Building on this is the Data-Efficient Hierarchical Reinforcement Learning (HIRO) framework, which operates in a continuous state physics domain [6, 25]. The authors note that while the higher-level policy is learning an embedding space, the intrinsic reward for the lower-level policy is noisy. Therefore HIRO directly leverages a sub-set (the position and direction of the agent) of the full state as goals for the lower-level agent to navigate towards and can use the euclidean difference directly as an intrinsic reward signal. However, this approach is not feasible when the state space is more complex such as pixel-based, where the agent must not only learn to generate full images but additionally must learn to ignore the noise contained in images. Subsequent work by the authors is focused on a measure of sub-optimality, which relates the embedding space to the episodic return and using this measure to learn a hierarchy [8].

(11)

Chapter 4

Method

Goal-conditioned hierarchical reinforcement learning has the promise of learning complex tasks through a hier-archy of policies where the lower-level policy is guided by the higher-level policy. The learned lower-level policy should exhibit a high-quality skill decomposition: being able to solve several sub-tasks through guidance from the higher-level policy in order to finish the task as a whole. This would enable the agent to transfer learned skills to different tasks; the higher-level policy simply has to instruct the lower-level policy on where to go in the new environment. However, recent work has shown that this does not always work out as intended, the lower-level policy might learn to solve the whole task instead of learning a decomposition of the task into several skills [9, 12]. Indeed, it has even been theorized that goal-conditioned hierarchical approaches might function as a form of exploration on top of the reinforcement learning algorithm [13].

The objective of this thesis is to learn a hierarchy with a high-quality skill decomposition that does not collapse through goal-avoidance. We theorize that hierarchies collapse because both higher and lower-level policy use the same state information to make decisions on different levels of temporal abstraction. Therefore, we are interested in the information granularity at which the higher and lower-level polices operate, and what information is shared between levels of hierarchy. Our main hypothesis is that providing the higher and lower-level policies with an appropriate granularity of information, hiding the rest of the state, improves the ability of the agent to learn a high-quality skill decomposition. To test the hypothesis, we derive a set of questions that are closely related: Q.1 Does such an agent learn faster than an agent with all information, or an agent which does not hide

information between levels of hierarchy?

Q.2 Does such an agent learn skills that are more transferable?

Q.3 Does such an agent’s lower-level policy use the higher-level input more effectively?

To answer these questions, we will focus on the FeUdal Network [9] set in pixel-based environments. The FeUdal Network works in a two-level hierarchy, where the higher-level policy emits a goal embedding which directs the lower-level policy to advantageous states. Hence, we will consider a skill-decomposition where the lower-level policy learns useful short-term tasks, guided towards the long-term task learned by the higher-level policy. In Chapter 2 we have discussed that the FeUdal network cannot ignore the higher-level goal embedding, but this is not a guarantee that it will use it in a meaningful way (it might still be goal-avoiding). We can theorize that the lower-level policy simply sees the goal embedding as noise, and hence will create a robust representation U and w which can effectively see the goal embedding g as noise. In fact, we can be seen as a form of exploration-through-noise, which has been theorized in recent work concerning goal-conditioned hierarchical learning [13].

Our approach modifies the architecture of the feudal network, allowing a detachment between the higher and lower-levels’ networks which were previously connected through the perception module. Notice that by splitting information between the FuN’s higher and lower-level policy, we are effectively a step closer to the original formulation of feudal reinforcement learning, where the higher-level policy has a more abstract representation of the environment.

(12)

Default LG LGP LGS

FuN 1111135 940127 948863 1136991

A2C N/A 850837 1118905 N/A

Table 4.1: Agents used in this thesis and number of parameters for each granularity of information. LG denotes the use local and global information. LGP denotes the same as LG, but the global information consists of a coarse map of the environment. LGS denotes the use of local and global information, but does not include information hiding between higher- and lower levels of hierarchy. Since A2C is a non-hierarchical approach, we do not use an A2C-LGS approach nor a default A2C approach.

4.1 Local and Global Information

We have hypothesized that granularity of information might improve the agent’s capacity of learning a skill-decomposition. In this section, we will go into detail about the definitions of granularity used in this thesis, namely that of local- and global information.

Definition 4.1.1. Local Information. State information which is only relevant for short-term planning. Local information should be a precise description of the agent’s immediate surroundings.

Definition 4.1.2. Global Information. State information which is only relevant for long-term planning. Global information should be a coarse description of the environment, with a specific focus on objects related to long-term credit assignment.

These definitions do not guarantee that a proper hierarchy will be learned, but it incentivizes both levels of hierarchy to focus on each other at the correct time-scale. The first definition (4.1.1) regards local information as incentive for autonomous behavior on a short-term scale for the lower-level policy. It will get rewarded for following the goal objective, but it must itself learn primitives such as opening doors, picking up keys and avoiding obstacles to follow the goal. With global information (4.1.2), the higher-level policy guides the low-level policy in directions of interest (either immediate or long-term rewards). An added benefit of this formulation of global information is the fact that learning an embedding is much easier on coarse information, specifically in the cases of a pixel-based state representation.

Required for our approach is an assumption that the observation from the environment xt at time step t can be decomposed to global information xg_t and local information xl

t. We assume we have access to some function f such that f : xt → (xgt, xlt). Such function f could perhaps be learned, but is fixed for the purposes of this thesis and is a possible direction for future research.

4.2 Models

To test our hypothesis and answer the research questions, we will require three different versions of the FeUdal network architecture. First, we require an agent which uses the full-state information which is the default FuN approach, the specific architecture of which can be seen in Figure 4.1a. The FuN agent learns a shared representation z in the perception module fpercept_{, and uses this representation for both higher and lower} levels of the hierarchy. We are primarily motivated in the use of this agent to measure sample efficiency, and goal-conditioning of the higher-level goal embedding for the lower-level agent.

The second agent utilizes a FuN architecture where a single representation is being learned using local and global information. We term this agent FuN-LGS, because the latent representations of both local and global information are simply Stacked and used for both levels of hierarchy. We are motivated to use this agent in order to measure the degree of which information hiding between levels of hierarchy can help in learning transferable skills. The architecture of the FuN-LGS agent is depicted in Figure 4.1b.

Our last FuN based agent learns a higher level policy using global information and a lower level policy using local information where no information is being shared between levels of hierarchy. This agent is termed FuN-LG and learns two distinct representations. This is our primary approach as we hypothesize that by hiding information the agent can learn a goal-conditioned hierarchy capable of transferring skills quickly to a new domain. The architecture of this agent is depicted in Figure 4.1c. For our first experiment we utilize two type of global information representations, where the former is based on Cartesian coordinates and the latter based on a coarse pixel-based map. When using the pixel-based map, we term the agent FuN-LGP to make a distinction between it and the Cartesian based higher-level policy.

(13)

∈ ℝ ∈ ℝ ∈ ℝ ∈ ℝ ✕ ∈ ℝ✕ 𝜑 ✕

(a) FuN with full-state information, adapted from Fig-ure 1 in the original literatFig-ure [9]. The agent learns a shared representation of the full state.

∈ ℝ ∈ ℝ ∈ ℝ ∈ ℝ ✕ ∈ ℝ✕ 𝜑 ✕

(b) FuN-LGS which shares local and global informa-tion between levels of hierarchy. The agent learns a shared representation of local and global information which is used to learn both higher and lower levels of hierarchy. ∈ ℝ ∈ ℝ ∈ ℝ ∈ ℝ ✕ ∈ ℝ✕ 𝜑 ✕

(c) FuN-LG(P) which hides local and global infor-mation between levels of hierarchy. The hierarchy is split up with no gradients passing between higher and lower levels of hierarchy. The lower-level policy uses lo-cal information and the higher-level policy uses global information. x_t z_t ∈ ℝd f rnn _w t ∈ ℝ d _a t f percept

(d) A2C agent. Variants of information splits can be used, but the agent learns a single representation. Hence, xtin this contact can mean a stacked

represen-tation of xgt and x l t.

Figure 4.1: Specific architectural differences between agents used in this thesis, as explained in Section 4.2. The variable d denotes the hidden dimension size of the goal embedding, and k denotes the lower-level policy’s internal representation size. All FuN agents produce a policy through a matrix-vector product π = softmax(U w) from which an action is sampled. The A2C agent directly calculated the policy from it’s internal representation π = softmax(w). For a precise explanation we refer you to Chapter 2.3.

and global information, we additionally utilize a non-hierarchical synchronous Advantage Actor-Critic (A2C) agent. The architecture of this agent is seen in Figure 4.1d, and is highly similar to the FuN-LGS agent except for the fact that the A2C agent does not learn a higher-level embedding space. We choose an A2C baseline since the FuN’s lower-level policy is trained in similar fashion. The most notable differences between the A2C and FuN agents is the dilated-LSTM utilized by the higher-level FuN policy, which is not in use for the A2C agent (similar to the original literature [9]).

We have attempted to maintain a similar number of parameters per approach defined above, which can be found in Table 4.1. However, because of the architectural differences there are some differences between parameters; the A2C-LG agent has slightly less parameters compared to the FuN-LG agent but the A2C-LGP agent has slightly more parameters compared to the FuN-LGP agent1_.

1_{We have not directly found that increasing the number of parameters in an agent leads to faster convergence for FuN or A2C}

(14)

Chapter 5

Experiments

In order to test our hypothesis, we have set up two experiments that demonstrate the efficiency and transfer-ability of learned hierarchies with local and global information. We start with an example often used in hierarchical reinforcement learning, namely the four rooms environment. This experiment is a basic goal reaching experiment in a sparse reward setting, and is perfect for testing the hierarchical capacity (the learning of sub tasks) of all agents with and without local and global information, and without explicit fine tuning. In the second experiment, we wish to test the capacity of the agents to learn two complex tasks separately, and fine tune on a third, related task. We will again test of the agents learn to effectively use the goal embedding, but focus specifically on the sample efficiency when fine tuning.

In the final section of this chapter we provide implementation details of the agents and the environments used. In order to be consistent throughout this chapter, we use the term goal embedding to refer to the higher-level policy’s learned embedding, and the term goal state to indicate the terminal rewarding state in the environment.

5.1 Four Rooms

Consider a slight modification of the classic four rooms environment, depicted in Figure 5.1. At the start of an episode both agent (red) and goal state (green) are placed at random, and each wall facing two rooms has one of two passages blocked (in blue) through an unbiased coin toss. We can consider a single episode a sample from a distribution over MDPs with similar action space and objective i.e. the agent can go left, right, up and down and must reach the goal state within 500 steps. If the agent reaches the goal state, it is rewarded +1, otherwise a reward of 0 occurs at all time steps.

The full-state information is represented as an RGB-image of size 68 × 68 × 3, depicted in Figure 5.1a. The environment itself is a 17 × 17 cell-based grid, where each room has a width and height of 7 cells. We specify local information as a high-resolution slice of 7 × 7 cells with the agent placed in the center, which amounts to an RGB-image of size 63 × 63 × 3 depicted in Figure 5.1b. We compare two different representations for global information, the first is a vector with Cartesian coordinates which holds the location of the agent and the goal (a vector in R4), depicted in Figure 5.1c. The second representation of global information is a coarse map of the environment visible in Figure 5.1d, of size 4 × 4 × 3, where each channel is reserved for a global object. The first channel holds the agent, the second channel holds a key and the third channel is included for optional future global objects.

We have modified the four rooms environment to include the local obstacles (in blue), which is represented only in local and full-state information. Hence, global information alone is not sufficient to solve this task optimally; simply minimizing the distance will not work due to walls, and memorizing the location of halls will not suffice due to the (randomized) obstacles. Similarly, local information alone is not sufficient either: the lower-level policy does not always see the goal in the local information based state. We see both as an incentive for the lower-level policy to use the higher-level goal embedding, but to maintain a degree of autonomy.

Agent Performance. In this paragraph we will detail the training parameters used for learning in the four rooms environment for all agents. Each agent was trained using 16 synchronous agents taking steps in the same environment started with a different random seed. The learning rate was kept constant at 0.0005 over all agents, higher-level gamma discounting was set γhi= 0.99 (which is also used for the non-hierarchical A2C agents), whereas the lower-level gamma is set γlo = 0.95. Using different learning rates for different levels of hierarchy can additionally ensure a distinct level of temporal abstraction through reward discounting. Entropy

(15)

(a) Full RGB-pixel based state.

(b) Local information, a slice from the original RGB pixel-based state.

(c) Global informa-tion, vector containing Cartesian coordinates of global objects (agent and goal).

(d) Global information, coarse representation of global objects in the state.

Figure 5.1: A sampled state from the Four rooms environment, and local and global (both coordinate based and pixel based) information drawn from it.

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00

Number of Frames

1e7

0 100 200 300 400

Episodic Length

Four Rooms

FuN-LG FuN-LGS A2C-LG FuN-LGP A2C-LGP FuN

Figure 5.2: Learning curves for agents in the four rooms environment, based on number of frames and episodic reward.

regularization was kept constant over all FuN agents at 0.01, but had to be tuned for the actor-critic approaches and set as 0.008 for proper convergence (FuN was not so sensitive to this parameter). We train each agent and average the result over 5 seeds. Each experiment described in this chapter is done with the best performing agent from these 5 seeds. In contrast to the LSTM-baseline in the original literature [9], we have found that setting the number of steps n for a gradient update to n = 40 results in poor performance and set this to be n = 400 (similar to the feudal approach).

As can be seen in Figure 5.2, we have found that the FuN agent with full information requires around 17 million frames for convergence, and that the local and global based agents converge within 10 million frames each. Each agent converges to the same average return of 1, which can be seen in table 5.1. The local and global based agents are able to learn to solve the task as a whole much faster than the default FuN agent. However, no noticeable difference in learning can be seen when we control for information hiding (the difference between FuN-LG and FuN-LGS). Hence, the granularity of information by itself might make the task easier to solve. We can conclude that global information can be made more complex into pixel-based maps without losing too much efficiency in terms of samples or average reward.

Blocking a wall. After training the agents on the four rooms environment, we first test the robustness of the agents to a small change in the environment: the blocking of a wall as depicted in Figure 5.3b. Each agent is evaluated on this change for 10000 episodes, and we can see in Table 5.1 that each agent is capable of navigating to the goal state with no loss in the return. Indeed, the return is exactly equal to baseline performance. From this, we can conclude that each agent learns a policy that is robust to slight changes of the environment.

(16)

(a) Four rooms (b) Wall Block (c) Scaling, n = 2 (d) Shift

Figure 5.3: Default environment and modifications of the four rooms environment, used in Section 5.1.

Experiments FuN FuN-LG A2C-LG FuN-LGP A2C-LGP FuN-LGS

Baseline 0.99 ± 0.06 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00 Wall Block 0.99 ± 0.06 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00

Scaling, n = 2 N/A 0.99 ± 0.03 0.98 ± 0.12 0.79 ± 0.40 0.90 ± 0.30 0.95 ± 0.22

Scaling, n = 3 N/A 0.97 ± 0.16 0.83 ± 0.38 0.52 ± 0.50 0.57 ± 0.49 0.82 ± 0.38

Scaling, n = 4 N/A 0.88 ± 0.33 0.54 ± 0.50 0.39 ± 0.49 0.46 ± 0.50 0.64 ± 0.48

Shift, c = −25 N/A 0.96 ± 0.18 0.61 ± 0.49 N/A N/A 0.33 ± 0.47

Shift, c = −50 N/A 0.88 ± 0.33 0.43 ± 0.50 N/A N/A 0.16 ± 0.37

Goal mask 0.99 ± 0.10 0.00 ± 0.00 N/A 0.11 ± 0.31 N/A 0.98 ± 1.22

Table 5.1: Mean and standard deviation of the episodic return of 10000 episodes during evaluation on several experiments in the Four Rooms environment. Scale and coordinate experiments can only be performed with local and global information, hence the N/A for the full information FuN agent and the pixel-based FuN and A2C agents. Bold indicates best score achieved per row.

Scaling up to 4n−rooms. The wall blocking experiment shows us that each agent is relatively robust to small changes in the environment. In this experiment we change the the environment more drastically. We scale the environment up from 4 rooms to 4n rooms with n ∈ {2, 3, 4} which results in a grid of sizes 16, 64, 256 respectively, and evaluate the performance of each agent. A representation of the scaled-up environment for n = 2 can be seen in Figure 5.3c.

Because we scale up the environment which alters the size of the full observation, we cannot use the FuN agent for this experiment. Hence, we will compare the performance of the FuN-LG(P) agents to its non hierarchical counterparts A2C-LG(P), and the non-information hiding hierarchical approach FuN-LGS. This experiment serves as a partial answer to research Q.2 , where we test the ability of the learned agent’s hierarchy to transfer onto a new domain. We will first discuss the Cartesian coordinate based agents, FuN-lG(S) and A2C-LG. When we double the size of the environment (n = 2), we see that both FuN-LG and A2C-LG agents perform similarly, although we should note that the standard deviation of the A2C-LG agent is slightly higher. When we look at scales of n > 2, we can see a significant difference in performance; the FuN-LG agent is able to transfer the navigation skill learned in the four rooms environment to much a larger environment. We must also notice that the FuN-LGS agent’s performance is similar to that of the A2C-LG agent, and does not achieve significant reward when we scale the environment for n > 2. Combining these two results, we theorize that information hiding between higher and lower-level policies helps with learning transferable skills, and might additionally incentivize the lower-level policy to follow the higher-level objective.

Now, lets look at the agents using a pixel-based coarse map as global information. Although FuN-LGP out-performs the non-hierarchical A2C agent, the benefits are not directly clear; the performance falls sharp when we scale the size of the environment beyond n = 2. We should note that because of the size of our global information (4 × 4 × 3), it is not too strange that performance worsens For n = 2, each room specifies a single cell in the global information. For n = 3, however, each cell in global information holds four rooms. Hence, the higher-level policy does not have enough information to guide the agent towards any specific room and the lower-level policy can only see one room at a time. There is specific information missing for each level of

(17)

(a) Value function for the FuN-LG agent. (b) Value function for the FuN-LGS agent.

Figure 5.4: Value functions for the FuN-LG and FuN-LGS agents. The top row depicts the value functions for the higher-level policy, the bottom row depicts the value function of the lower-level policy. Walls and goal depicted in black, the latter is located in the top-right room). Per agent, the left column indicates the value function properly, whereas the right column indicates the value function with a shift of c = −50.

hierarchy to reach the goal efficiently. We believe that the pixel-based agent offers a clear advantage to the Cartesian based representation: we do not need to extract coordinates from the environment. However, these preliminary results show that the representation might not be optimal or versatile enough for transferring skills. Coordinate shift. In this experiment, we transform the global information to values not seen during training, but which remain relative to the agent and the goal state. This experiment is again a partial answer to research question Q.2 in an attempt to see if the learned skill-decomposition can be transferred to new tasks.

The default coordinate system’s origin is placed in reverse at the top-left corner in the four rooms environment which entails that going south in the environment corresponds to positive as opposed to negative y values (going east still results in positive x values). Hence, the local and global based agents using coordinates as global information only encounter these values as being positive, not negative. We therefore evaluate the performance of the agents when we shift the Cartesian coordinate values with a scalar c outside of the training domain, namely c = −25 and c = −50 (where the original size of the grid is 17 × 17). Agents using pixel-based global information are specifically excluded from this experiment; these agents learn a specific representation for each channel, shifting these values will not change spatial location of the global objects within the global information matrix.

Since the A2C baseline does not create an embedding space, we can directly notice in Table 5.1 that the agent is not too robust again a change of base, even if this change leaves the distances relative. This shows an advantage of the hierarchical approach, the higher-level policy is forced to create an embedding space of advantageous directions; the actor-critic might simply use the global information as a heuristic to reach the goal state quickly which fails if it has not seen these distances before. Therefore, we will mainly focus on the results of the FuN-LG and FuN-LGS agents. We notice that FuN-LG reaches the highest average return, which far exceeds the FuN-LGS agent return, even though both agents have been trained using the same information. The main difference in approaches is that the FuN-LG agent hides information between higher and lower-level agents and FuN-LGS uses both information to train higher and lower-level policies. We therefore argue that information hiding between higher and lower-levels of hierarchy additionally leads to a more effective skill-decomposition, which can indeed be transferred when the task is changed.

We wish to evaluate the difference between the FuN-LG and FuN-LGS agents more broadly, and have therefore visualized their respective value function approximations in Figure 5.4 given that the obstacles are placed similarly to that in Figure 5.1a. Per agent, each row depicts a level of hierarchy and each column depicts either non-shifted (left column) or shifted (right column) with c = −50. We can first notice the difference in the lower-level value functions; the lower-level policy in FuN-LG is not capable of value estimating outside of the local information but the FuN-LGS agent uses global information to create a smooth value function outside of local information. This is beneficial for the lower-level policy’s learning because it updates the actions more generally into the direction of the goal state, but we argue that it also incetivizes the lower-level policy to avoid the higher-level policy’s goal embedding. Additionally, the shift itself (right column per agent) drastically changes the value estimation for both levels of hierarchy in FuN-LGS agent; a high value directly next to the

(18)

FuN-LG FuN-LGS FuN

Full-state 𝜋(a | s, g) 𝜋(a | s)

Figure 5.5: State information, policy through actual goal and policy through masked goal of ones for the FuN, FuN-LGS and FuN-LG agents. Note that this concerns trained agents, how the agent adapts the goal embedding into the policy could be different during training.

state but otherwise it depicts a noisy value function compared to the non-shifted version. For the FuN-LG agent we see less change in value functions. The lower-level policy does not receive global information and hence remains unchanged, and the higher-level policy has a somewhat shifted value function that will still guide the agent to the goal state. We would argue that the shift in direction of the value function towards the top-right corner is due to the directional embedding. Finally, we can see that the lower-level value function shows that if an obstacle blocks a specific hallway, the alternative free hallway has an increased value surrounding it (most noticeable in FuN-LG).

Goal Embedding Masking. In this experiment we wish to answer research question Q.3 directly, by testing the ability of the lower level policy to avoid the goal embedding (testing goal-avoidance behavior). We compare all feudal-based agents and take a specific look at three scenario’s: (1) the standard feudal agent FuN, (2) the feudal agent which uses local and global information, but creates a shared representation between higher and lower-level policies FuN-LGS and (3) the feudal agent which uses local information for the lower-level policy and global information for the higher-level policy FuN-LG(P).

We assume that the lower-level policy is conditioned on the goal embedding i.e. π(at|xt, gt) because it is unable to avoid using gt∈ Rd. Let us remind ourselves that this policy is achieved by a matrix-vector product between the internal representation of the lower-level policy Ut∈ Rk×|A|with the goal embedding transformed to φ. If we instead replace this goal embedding with a vector of ones of equal size, we are essentially using only the weights of φ which results in a policy which is not conditioned on the goal embedding i.e. π(at|st). We then evaluate the agent in the four rooms environment, while masking the goal embedding out with a vector of ones, and testing if the agent is indeed goal-conditioned or not.

When we look at the results for this experiment in Table 5.1 we can conclude that both FuN and FuN-LGS do not use the goal embedding in a meaningful way. The same is not true for FuN-LG and FuN-LGP, these agent show a low return (completely zero for FuN-LG) when the goal is masked. We do see that the Fun-LGP agent’s return is non-zero, and when we simulate an episode where the agent spawns close to the goal state, the agent’s policy seems to move itself towards it occasionally. We could argue that this is a degree of autonomy that is acceptable for the lower-level policy to have.

To provide a visual understanding of both policies (with goal embedding and with masked goal embedding), we have visualized them in Figure 5.5. The first column depicts the state in which the agent is located (full-information is depicted for clarity), the center column depicts the regular policy conditioned on the goal and the right most column depicts the policy conditioned on a masked goal (vector of ones). We can see from the policy distributions for FuN and FuN-LGS that the goal embedding seems to allow for different paths towards the goal state. The policy for FuN-LG changes from going towards the goal to going away from the goal. Recent work in goal-conditioned hierarchical approaches has found that exploration is a key component in why the hierarchical approach outperforms non-hierarchical approaches [13]. Similar to this idea, we theorize that during training the goal embedding acts as a form of exploration on the lower-level policy. This could additionally explain why the FeUdal Network is able to outperform the LSTM baseline in the original literature [9].

(19)

(a) Obstacle Avoidance (b) Key Grab (c) Composition

Figure 5.6: Environments in experiment 2, Section 5.2. In the obstacle avoidance environment the agent is required to navigate to the goal while avoiding obstacles, in each episode 25 obstacles are spawned in random locations and the goal spawns at random either in the top-left or top-right corner. In the key grab environment the agent is required to fetch the key in order to open the door. The agent receives knowledge of the key’s location through global information (if the agent is holding the key, the key location is placed on top of the agent). Both key, goal and agent spawn in random corners at the start of each episode. In the third environment, the agent has to combine learned skills in the previous two environments in order to reach the goal.

5.2 Transferable Skills Through Fine Tuning

We have seen that in the previous experiment that, in the four rooms environment, hiding information was beneficial for learning transferable skills. In the second experiment, we look at information hiding between levels of hierarchy in a multi-task setup to give a partial answer to the first research question Q.1 ; is information hiding required between levels of hierarchy when using local and global information?

We are motivated in this experiment by the idea that related tasks often require a similar set of skills. Therefore, we train an agent on two tasks that are in similar state and action space, and evaluate performance on a task which requires skills learned in both tasks to solve. If the agent is capable of solving both tasks separately, it should be able to transfer knowledge to a task that is a combination of both. In this experiment, we sample one of two tasks at the start of each episode. In the first task requires the agent to find the goal in a obstacle filled maze which is depicted in Figure 5.6a. In the second task the agent has to fetch a key in order to open a door to reach the goal, which is depicted in Figure 5.6b. We evaluate the performance of the agent by fine tuning it on the third task depicted in Figure 5.6c. In the third task, the agent has to fetch the key in an obstacle filled maze in order to open a door to reach the goal. We take a specific focus on the local and (Cartesian based) global based agents FuN-LG and FuN-LGS in order to answer research question Q.1 .

The obstacle course is primarily difficult because it requires autonomous behavior from the lower-level policy with respect to avoiding obstacles, which are only visible in local information. At the start of each episode, 25 obstacles are placed at random in the environment1_{. The agent spawns at a random location in the bottom} of the environment and the goal spawns at random in the top-left or top-right corner of the environment. If properly guided by the higher-level policy, the lower-level policy should directly move towards the goal state. In the key-grab task, the agent faces a long-term credit assignment problem. At the start of each episode the goal state spawns in either the top-left or top-right corner. The agent spawns in the bottom-left corner, and the key spawns either top-left, bottom-right or top-right depending on where the goal state spawns. Although the key is depicted in both local and global information, the door it opens is only visible through local information. An optimal policy would lead the agent directly to the key, whereas a sub-optimal policy can simply wander from corner to corner. Hence, the higher-level policy should guide the lower-level policy directly to the key. The key is represented in global information again as Cartesian coordinates, which extends the vector from R4 to R6_{. If the key is fetched, then the coordinates are changed to the agent’s x and y coordinates (which is the} default case in the obstacle course). This environment is specifically difficult for the agents. Although it can see the key, knowledge about holding the key lies only in (1) global information once it has been fetched (for the FuN-LG(S) agent), and (2) the recurrent memory of the agent itself.

Baseline Performance. We train the agents over 5 seeds and average the results using the exact same parameter setup as described in Section 5.1, experiment 1. The learning curves for the two-task setup can be seen on the left-hand side in Figure 5.7. The learning curves are quite similar when compared to the four rooms training schedule, and hence we can see a convergence of both FuN-LG and FuN-LGS agents to a return of 1,

(20)

Experiment FuN-LG FuN-LGS Two Tasks (before fine tuning) 0.99 ± 0.03 0.99 ± 0.03 Composition Task (before fine tuning) 0.06 ± 0.23 0.72 ± 0.45 Two Tasks (after fine tuning) 0.93 ± 0.26 0.78 ± 0.42 Composition Task (after fine tuning) 0.99 ± 0.11 0.99 ± 0.07

Goal mask 0.00 ± 0.00 0.00 ± 0.00

Table 5.2: Mean and standard deviation of the top performer’s episodic return per agent over 1000 episodes. We additionally investigate if the lower-level policy avoids following the goal objective by performing the goal mask experiment as described previously in Section 5.1.

signaling that they can solve both tasks. We have included the FuN agent in our training schedule, but it did not manage to converge within the given frames and hence is not included in any other benchmarking. This result does show us again that using local and global information leads to a more sample efficient approach. Before fine-tuning, we have evaluated the top performing agents on each of the three environments for 1000 episodes, results of which can be seen in Table 5.2. Here, we see somewhat different results from what is shown in the four rooms environment. We have again estimated whether both agents are avoiding the goal embedding, but this time it seems that both agents are goal-conditioned (both before and after fine tuning). However, when we look at the results of both agents at the third task (the third row in Table 5.2), it seems that only the FuN-LGS agent is capable of achieving a high return on the third task without any fine tuning. When we inspect the FuN-LG agent by evaluating it in the third task, we notice that, even if placed next to the key, it will still move in random directions and not fetch it. If we instead start the agent with the key, it will navigate in the third task to the exit and open the door to reach the goal state with an average return of 0.99 over 1000 episodes. Hence, we theorize that the lower-level policy in FuN-LG learns a spurious correlation between key-location in the goal embedding and local obstacles; if it sees local obstacles, the key-key-location should always be the agents own location. The FuN-LGS agent does not learn such a correlation, and we suspect this is because the lower-level agent receives knowledge of the key location both through the goal embedding and through its own learned representation of global information.

Fine Tuning. Let’s first look at the sample efficiency of both agents when fine tuning on the third task on the right hand side of Figure 5.7. From this and the baseline performance, we can conclude that the FuN-LGS agent is more efficient at adapting to the third task when having learned the first two related tasks. However, we do see that the returns on average have flipped between the two tasks and the third task, which means the efficiency comes at a cost. The FuN-LG agent was not capable of transferring skills without fine tuning, and takes a much longer time for learning the third skill (possibly 6 million frames more, a 6 time increase). However, the average performance over all tasks has not dropped significantly for the FuN-LG agent; indeed it is only a marginal difference.

When we compare the results from before and after fine tuning in rows 4 and 5 of Table 5.2, we argue that both models fall short in some capacity. FuN-LGS does not retain performance after fine tuning, and FuN-LG does not transfer before fine tuning. If we care most about sample efficiency, sharing information between hierarchies seems to lead to a quicker adaptation for new tasks. If performance is most important, having no information shared between hierarchies leads to a more effective agent over all tasks. This leads us to theorize that information hiding should not be all or nothing, but perhaps we can control the degree of sharing information between hierarchies.

(21)

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 1e7 0 100 200 300 400 500

Episodic Length

Number of frames

Two Tasks

FuN-LG FuN-LGS FuN

0.0 0.2 0.4 0.6 0.8 1.0 1e7 0 100 200 300 400 500

Fine Tune

Figure 5.7: Left: learning curves on the two-task environments right: fine tuning experiment on the third environment. The FuN agent was not able to converge within 20 million frames.

5.3 Implementation Details

We provide a PyTorch [26] implementation on our GitHub page2_{. The repository contains all agents (variants of} FuN and A2C), and wrappers that are used to transform the MiniGrid [27] environments which is built around the OpenAI Gym [28]. The environments are not included in the repository, but located in a forked build of the MiniGrid Gym, which can be sent on request. Since the authors of FeUdal Networks did not provide a codebase to reproduce the results, we tried to reproduce the architecture as close as we can, including the dilated LSTM and the intrinsic reward. All agents are trained using the RMSProp optimizer [29] with a learning-rate of 0.0005. We attempted the training of an A2C agent with full-state information, but did not manage to converge for a wide variety of learning-rates and entropy regularization [3], with- and without Generalized Advantage Estimation [30] and is therefore not included in this thesis. We suspect that this is due to the sparse setting, which requires a decent amount of exploration. For specific hyper-parameters used during training see Appendix B.

(22)

Chapter 6

Discussion and Conclusion

In this section, we will discuss results from both experiments, and relate these to the original hypothesis: providing the higher and lower level policies with an appropriate granularity of information improves the ability of the agent to learn a high-quality skill decomposition. We can differentiate between the agents used in this thesis based on information granularity; an agent either uses the full-state information or learns either a separate or shared representation of local and global information. We will first discuss experiments in the four room environment, and follow this with a discussion of the experiments performed using fine tuning in a more complex set of environments. After this, we shall give an analysis of what both experiments tell us about the use of local and global information for hierarchical reinforcement learning. Finally, we will discuss the shortcomings of our research and possible avenues for future work.

In the four rooms environment, we trained several hierarchical and non-hierarchical agents to navigate to a ran-domized goal state, where we have additionally placed obstacles only visible in local and full-state information. we found that agents using local and global information (in any capacity) learn much faster when compared to an agent that uses the full-state information, which gives us an answer to research question Q.1 for the four rooms environment: hiding of information between levels of hierarchy is not important for fast convergence, but hiding information about the full-state is. Hence, using local and global information in four rooms offers a bet-ter representation of the state for learning the goal-finding task. Additionally, we saw that only by completely hiding local information from the higher-level policy and global information from the lower-level policy, that the lower-level policy effectively uses the goal embedding. This is further backed-up by two experiments that change the environment (scaling the environment up, shifting the global information) in which only the FuN-LG agent is capable of maintaining a high return without fine tuning. Relating this back to our research questions Q.2 and Q.3 , it is indeed the case that if we specify the granularity as local and global information hidden between levels of hierarchy in the four rooms environment, the agent learns a skill-decomposition transferable to new domains.

In the two-task environments, we trained only hierarchical agents to solve two separate tasks, and fine tuned each agent on a third environment, which required both learned tasks to solve. We again see that using local and global information led to much faster convergence in solving the task(s), which is similar to the conclusion in the four rooms environment of research question Q.1 . The most interesting point of discussion before fine tuning is that information hiding between levels of the hierarchy does not lead to transferable skills when faced with a related task. Whereas the agent that hides local and global information between levels of hierarchy is not capable of transferring skills, the agent that does not hide information is capable of doing so. This result seems to be the opposite of that found in the four rooms environment, where information hiding led to a better skill-decomposition. Hence, information hiding between levels of hierarchy in the multi-task setup might not be of importance. Through fine tuning, we saw that the FuN-LG agent is much less sample efficient at learning a new task that required similar skills. The FuN-LGS agent is capable of quickly adapting to a new task, but this comes at the cost of a decreased performance in the origin learned set of tasks.

Once we compare the results from both experiments we made the following conclusions. First, we found that the original FeUdal Network does not learn a proper skill decomposition between higher and lower levels of hierarchy, even when faced with a relatively easy environment. In fact, the lower-level learns to avoid using the goal embedding from the higher-level policy. Second, specifying the granularity at which the hierarchical agent learns is beneficial to the sample efficiency of the agent in the environments we have used. Indeed, even if it has changed the MDP to a PoMDP. Third, to learn a skill decomposition, the degree of information hiding between levels of hierarchy depends strongly on the environment. In the four rooms environment, we concluded that it is beneficial to hide information between levels of hierarchy to learn transferable skills. In the two-task

Fixing a Broken hierarchy: A Study of Information Granularity in Hierarchical Reinforcement Learning

MSc Artificial Intelligence

Master Thesis