Finding the Boundary Between High-level and Low-level Policies Using Meta-learning

(1)

MSc Artificial Intelligence

Master Thesis

Finding the Boundary Between High-level

and Low-level Policies Using Meta-learning

by

David Kuric

11704349

August 7, 2019

36 EC January 2019 - August 2019

Supervisor:

dr. Herke van Hoof

Assessor:

dr. Efstratios Gavves

University of Amsterdam

Informatics Institute

(2)

Abstract

Following the recent success of single-task reinforcement learning algorithms and their applicability in real world scenarios, multi-task learning, fast adaptation and transfer learning have become more popular topics in reinforcement learning research. One branch of this research tries to learn a hierarchically structured policy that would be able to solve multiple hierarchical tasks with a similar structure. In order to create such a policy, multiple low-level actions need to be combined into temporarily extended high-level actions. However, if high-level actions are too short or too long, the transfer performance can be diminished. It is thus crucial to set the correct boundary between high-level and low-level actions. Finding this boundary is a challenging task because it is hard to quantify the suitability of a boundary. In this work, we propose that a good boundary should allow for fast adaptation to multiple tasks or new tasks. By combining recent findings in meta-learning (MAML) with a hierarchical reinforcement learning approach (IOPG) we develop an algorithm that aims to find this boundary by optimizing the performance on multiple-tasks after a single adaptation step (gradient update). We use a set of simple small maze environments to show the benefits of dividing the task into sub-tasks with a performance-oriented objective. Our algorithm is able to adapt to a single task faster than a task specific algorithm while keeping its policy expressive enough to outperform a multi-task algorithm that does not use meta-learning. In subsequent experiments we use a more complex taxi environment and provide a thorough analysis of the results where we point out possible areas for improvement.

(3)

Introduction

Artificial intelligence models are nowadays able to solve many complex tasks such as image classification or segmentation (Krizhevsky et al., 2012), playing classic games like chess or go (Silver et al., 2016) or generating an article with human-like or superhuman performance (Radford et al., 2018). Nevertheless, humans are still superior when it comes to transferring knowledge from one task to another. In general, machine learning models are trained to be task specific and it is usually not possible to use a model trained on one task for another, similar task, even though the difference between them may seem small from human perspective (such as recognizing cats in images vs. recognizing dogs). Instead, a new model has to be trained from scratch, which is both time consuming and data inefficient. This problem is amplified even more in reinforcement learning (RL) where an agent acts in an unknown environment and learns from gathered experience. In this case, the data distribution (or the environment states distri-bution) visited by the agent might change rapidly as it adjusts its behaviour. The agent’s model thus also needs to determine the importance of each datapoint since some states may be visited more often than others. This adds more complexity to the problem when compared to the supervised learning setting. Consequently, data efficiency became a popular topic in reinforcement learning research in the past couple of years and several new algorithms which increase data efficiency were proposed (Silver et al., 2014; Mnih et al., 2016; Schulman et al., 2017). However, these are mainly concerned with the data effi-ciency of single task learning and although they are able to train better performing agents much quicker, they do not utilize any prior knowledge and consequently do not produce agents that could adapt well to a broad range of tasks. One way to achieve faster adaptation of agents on reinforcement learning tasks could be to split tasks into smaller reusable components: sub-tasks. If the agent learns to solve many of these components, it will be able to quickly adapt to many different tasks because combining solutions to sub-tasks is easier than learning a solution of a complex problem from scratch. Furthermore, the agent can also use these sub-tasks to form a solution to a new task and accelerate learning speed on a problem it has never seen before. However, this type of transfer learning can lead to suboptimal performance when training tasks and new tasks do not have a common structure.

To allow for the task decomposition, the agent’s policy needs to employ a hierarchical structure such as the options framework (Sutton et al., 1999). In the options framework, the agent’s policy is composed of multiple low-level policies and a high-level policy, which is used to choose among low-level policies. Therefore, if the agent needs to control a robot that needs to walk forward and jump over obstacles in multiple environments where the obstacles are at difference places, it can learn a low-level policy to solve the walking sub-task and another low-level policy for solving the jumping sub-task. During test time or when encountering a new environment the agent can already walk and jump and only needs to learn a high-level policy that combines the low-level policies. This decomposition should thus lead to faster training since the agent does not need to learn from scratch but can use the solutions to sub-tasks from pre-learned low-level policies.

Additionally, this is similar to how humans reason about tasks. Consider a man driving a car as an example. In order to drive to different locations, he first might need to drive to a highway, then continue to a different city and finally drive to his destination inside the city. Hence he might have a policy for driving to a highway, a few other low-level policies for driving to different cities and a few more for driving to the destination once inside the city. However, this is only one of many different task decompositions

(5)

that can be applied in this case. The boundary between high-level and low-level policies could be set a lot higher or lower. If the boundary is set too high, a low-level policy would be driving to a destination which corresponds to a task-specific policy. On the other hand, if the boundary is set too low, the low-level policy may be turning the steering wheel or even flexing the muscles in the body. Even though these extreme cases are possible, there is probably no driver in the world who thinks on such a low or high level when driving.

When using a hierarchical policy it is crucial to set the correct boundary between low-level and high-level policies. If the chosen boundary is too low-high-level, learning on new tasks will be slow because the agent will need to choose level policies too many times. The extreme case would be to choose a low-level policy before each action which is equivalent to task-specific low-low-level policies. Such behavior was observed in the option-critic architecture (Bacon et al., 2017), where the duration of high-level decisions shrank over time. On the other hand, if we choose the boundary to be too high-level we might not achieve good generalization and we risk that the sub-tasks will not be transferable to other tasks. This brings us to the fundamental question of this research. How do we know if the boundary between low-level and high-level policies is good? Latest works (Bacon et al., 2017; Frans et al., 2018; Riemer et al., 2018) have shown that options can be beneficial for multi-task learning and transfer learning. Therefore, we believe that a simple answer to this question is that a good boundary should lead to fast adaptation and should achieve good performance on many similar tasks or new tasks quickly. In particular, we propose an objective that optimizes the performance on multiple tasks after a single adaptation step. By using this formulation, the policy should be forced to be modular enough to be transferable to many tasks from the same family as well as to have expressive low-level policies to allow for fast adaptation.

We will develop a hierarchical reinforcement learning algorithm which will be trained to find the right sub-task decomposition that will lead to fast adaptation of agents to new tasks. To achieve this, the model maximizes a theoretically justified objective that builds upon the MAML framework (Finn et al., 2017). This is in contrast with prior work, which used an empirically justified objective and a complex training schedule (Frans et al., 2018). Furthermore, we will use a more complex option structure (Smith et al., 2018) that should be more data efficient and does not require the master policy to use low-level policies for a pre-specified number of timesteps.

In the following Chapter 2 we position our research relative to the related work in both hierarchical reinforcement learning and meta-learning. The necessary background knowledge about reinforcement learning, options framework and gradient based meta-learning is explained in Chapter 3. The new method is then outlined in Chapter 4 along with implementation details. In Chapter 5 we explain the experiment setup and present the results. Furthermore, this section also contains further analysis of the negative results. Following the analysis, we discuss the results in Chapter 6 and identify the limitations of presented approach. Finally, the last Chapter 7 concludes the thesis and points out possible areas for future research.

(6)

Chapter 2

Related work

Since our work mainly builds upon research done in two areas of artificial intelligence we divide this chapter into two subsections. In the first one, we will cover hierarchical reinforcement learning. This research area is mainly focused on learning a useful task decomposition of reinforcement learning tasks in order to accelerate the learning process. The second subsection will be about meta learning. We will talk about a broad range of algorithms and models that are “learning to learn”, or in other words, learn models which are able to adapt to new tasks from the same family quickly. These two areas are often intertwined and thus some of the studies will be mentioned in both subsections.

2.1 Hierarchical reinforcement learning

A variety of approaches to hierarchical reinforcement learning that use different paradigms have been studied in recent history. A subset of these methods uses unsupervised training objectives to learn di-verse skills (Florensa et al., 2017; Achiam et al., 2018; Eysenbach et al., 2019). While these methods prove to be useful when no reward function is available, the lack of supervision can also be seen as a drawback since there is no guarantee that the learned diverse skills will be useful for solving real tasks. Another branch tries to learn a high-level policy which is trained to set goals for low-level policy (Vezhnevets et al., 2017; Nachum et al., 2018; 2019). Although the goals were not interpretable in the initial work of Vezhnevets et al. (2017) because they were only used as latent vectors that influenced the policy, the latest work (Nachum et al., 2019) vastly improved the interpretability by using the states of the envi-ronment as goal states that the agent should get into. Last but not least, most methods in hierarchical reinforcement learning build upon the options framework (Sutton et al., 1999). The major advantages of this framework are its applicability to a wide variety of methods and algorithms as well as the existence of prior work that was focused on an extensive theoretical analysis.

In the options framework, the options can be seen as macro actions that each have their own low-level policy, a termination condition (or termination function) and an initiation set. Earlier works focused on finding so-called bottleneck states to use as sub-goals in discrete spaces (McGovern and Barto, 2001; Menache et al., 2002; S¸im¸sek and Barto, 2009; Silver and Ciosek, 2012) and were later extended to con-tinuous spaces (Niekum and Barto, 2011; Machado et al., 2017). Nevertheless, these approaches require prior knowledge about the environment and therefore are much harder to apply to a wide variety of general problems.

A more promising approach in this regard are function approximators that are combined into a single hierarchical policy which is then optimized by the policy gradient. This can be done by augmenting the states with option indices as proposed by Levy and Shimkin (2012). Bacon et al. (2017) further developed this idea by using neural networks as well as option-specific value functions. Their option-critic architec-ture was also later extended to multi-level hierarchies in the work of Riemer et al. (2018). However, the main drawback of the option-critic architecture turned out to be a long training time since only the option that was active and therefore responsible for a given action was updated at each timestep. A different approach that does not suffer from this problem was introduced by Daniel et al. (2016). In this paper, the authors treat the options as latent variables and use a policy structure that allows for optimization of the objective with expectation-maximization. Nevertheless, this comes at the cost of restricting the policy to be a linear combination of state features. Consequently, neural networks cannot be used as

(7)

approximators in the policy and the expressiveness of the policy is severely reduced. The drawbacks of the two aforementioned methods were noticed by Smith et al. (2018). By treating the options as latent variables and inferring the probability of each option being active during the update time, they were able to develop Inference-Based Policy Gradient Method for Learning Options (IOPG) which is both data efficient and does not restrict the expressiveness of the policy.

Even though IOPG was shown to work well on relatively complex tasks such as robotics environments, its performance was inferior compared to the state of the art non-hierarchical reinforcement learning algo-rithms (Schulman et al., 2015b; 2017). In our opinion, this is not surprising because IOPG also recovers additional structure. We believe that the benefits of structured policy would be more apparent in a multi-task setting or when applied on tasks with clear hierarchical structure. In the work by Frans et al. (2018), the options framework was successfully applied to multi-task learning. Even though the authors used a complex and empirically motivated training routine, they were able to learn a meaningful hierarchy in a couple of environments with a robotic ant and to successfully transfer the learned hierarchical policy to a new task.

As was later pointed out in the work by Igl et al. (2019), the complex training schedule can lead to local minima and an insufficient diversity of options. To overcome these issues, the authors proposed a framework that build upon the Planning as Inference approach (Levine, 2018) and the approach known as DISTRAL (Teh et al., 2017). This framework allows for the adaptation of the pre-learned sub-policies during test time by maintaining a task agnostic prior and controlling how much is the posterior able to deviate from this prior with a parameter.

2.2 Meta-learning

Meta-learning or “learning to learn” is a concept that has been studied extensively in recent years. A common goal of different meta-learning methods is to formulate the learning task in such a way that the model will be able to adapt to a broad range of tasks and achieve good performance. A popular approach that can be used on a wide spectrum of domains is to train a meta-learner that learns the update rule of the learner’s model parameters (Bengio et al., 1991; Andrychowicz et al., 2016; Li and Malik, 2016). In this case, the update during the adaptation can be learned by a neural network (Andrychowicz et al., 2016), the necessary information from the adaptation is then usually stored in the history (Li and Malik, 2016), where recurrent neural networks are a natural choice (Duan et al., 2016; Wang et al., 2016), or induced using the combination of attention and temporal convolution (Mishra et al., 2017). These algo-rithms were shown to work well on various tasks, but even though they minimize finite horizon regret (Wang et al., 2016), which is a theoretically sound objective, they need to be trained on a finite number of inner updates. Consequently, the models only see a certain number of timesteps or episodes on a single task during the training. This can become a limiting factor during the adaptation if more timesteps and episodes are used. Because the model has never seen this many episodes it might not know how to update its hidden state to improve the performance. The convergence is therefore not guaranteed. This is in contrast with gradient based methods which can just take more gradient steps during test time and are guaranteed to converge to a local optimum.

Unlike the methods we have introduced so far, the gradient based methods such as Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017) proposed a different way of updating the learner’s model parameters. Instead of a learned update, MAML updates the parameters with gradient descent. In other words, as a first step, MAML computes the pre-update loss and performs an inner gradient update with a fixed learning rate. The new parameters are then used to calculate a post-update loss. The loss is then differentiated with respect to pre-update parameters to obtain the best initial parameters. This procedure includes higher order derivatives but can be easily implemented with automatic differentiation software. Thanks to being model-agnostic, MAML could be applied to both supervised meta-learning problems, where it improved state of the art performance, and meta-reinforcement learning problems, where it learned a model that was able to reach good performance on a variety of similar tasks after several gradient updates. This led to a rise of gradient-based meta-learning algorithms.

In the work of Li et al. (2017) the authors pointed out that a bad choice of inner learning rate can negatively affect the performance and thus instead learned the inner learning rate as well. Zintgraf et al.

(8)

(2019) noticed that by splitting the parameters into a task agnostic group which is learned jointly during training and a task specific group which is adjusted during test time, the algorithm scales better and is less sensitive to the choice of the inner learning rate.

While the aforementioned papers focused more on meta-learning and improving the general method, the followup works in meta-reinforcement learning noticed a mismatch between the theoretical derivation of MAML and its practical implementation (Al-Shedivat et al., 2018; Stadie et al., 2018; Foerster et al., 2018). The discrepancy occurs because in reinforcement learning the sampled data is also dependant on actions performed by the agent and therefore model parameters. If one just uses the surrogate loss (Schulman et al., 2015a) twice, as was done in MAML implementation, one term of a gradient will be missing. This is a more general problem with the usage of surrogate loss in stochastic computation graphs to generate higher order gradients. Foerster et al. (2018) found a way to overcome this problem with with The Infinitely Differentiable Monte Carlo Estimator (DiCE).

After noticing the missing term, Stadie et al. (2018) formulated an alternative objective which encour-aged exploration by increasing or decreasing the likelihood of pre-update trajectories based on post-update rewards. Crucially, this formulation may lead to miss-alignment of the gradient parts from pre-update and post-update trajectories and consequently to worse updates as shown in the followup work of Rothfuss et al. (2019). In this work, the authors also hypothesize about the high variance of the DiCE estimator when used in a MAML objective and propose an alternative Low Variance Curvature estimator (LVC). Furthermore, they also empirically show on a couple tasks that the variance of LVC is indeed smaller than when DiCE is used. Finally, the authors introduce Proximal Meta-Policy Search (ProMP) which outperforms previously introduced meta-reinforcement learning algorithms. Last but not least, it is also important to mention that several methods from the previous subsection about hierarchical reinforcement learning also utilize meta-learning (Frans et al., 2018; Igl et al., 2019).

(9)

Chapter 3

Background

In this section we cover the necessary reinforcement learning and machine learning background which is needed throughout the rest of the thesis. We also introduce notation that will be used in later sections.

3.1 Reinforcement learning and options

Reinforcement learning (RL) is a sub-field of artificial intelligence that studies the behavior of an agent in environments. In a classic RL setting, the agent performs actions based on the current state of the environment and receives a reward as well as a new state from the environment. Its goal is then to maximize the amount of reward it receives (in a certain time period). It is important to emphasize that the agent does not have any prior information about the environment dynamics. Therefore, in order to learn the best strategy, it needs to explore the environment by trying out different actions.

Agent at∼ π(at|st) Environment st+1∼ P (st+1|st, at) rt∼ R(st+1, st, at) at st+1, rt

Figure 3.1: Markov Decision Process.

3.1.1 Markov Decision Process

The interaction of an agent with an environment is depicted in Figure 3.1 and can be formalized using the Markov Decision Process (MDP). MDP refers to the part of the diagram that includes environment dynamics. Additionally, the agent uses its policy π(at|st) to perform actions based on states and reward it receives from an MDP.

Formally, a finite Markov Decision ProcessM is a tuple hS, A, p0, P, R, γi where S is a finite set of states, A is a finite set of actions, p0 is an initial state distribution, P is a state transition function, R a reward a function and γ is a discount factor. During a single episode, the agent uses current state stat each timestep t to perform an action ataccording to its policy π(at|st) after which it receives a reward rtand a new state st+1. These are sampled from the reward function and the state transition function of a particular MDP (the initial state is sampled from p0). The procedure is then repeated until the agent reaches a termination state. When the termination state is reached the episode ends. It is important to note that in our setting, we will only consider an agent in episodic environments but most concepts and findings are easily extensible to continuing tasks.

A sequence of states, actions and rewards during one episode can be combined into a single trajectory τ = (s0, a0, r0, s1, a1, r1, ..., sT, aT, rT). Although in the ideal case, the agent’s objective would be to maximize the sum of rewards in a single episode, the discounted episodic return is preferred instead mainly

(10)

due to better theoretical properties such as lower variance estimates and extensibility to continuing tasks: Gt(τ ) = T X t0_=t γ(t0−t)rt0, (3.1)

where γ is a discount factor that determines the relative importance of early and late rewards. We use t to emphasizes that in the general case, the discounted return is measured from the timestep t. However, there are cases where we omit t for brevity. In these cases G always corresponds to G0. We also omit the discount factor for simplicity when decomposing G in our derivations.

3.1.2 Policy gradient

Using the definition of trajectory return, the agents objective can be written as:

J = Eτ[G(τ )] . (3.2)

If we assume that the agent uses a parametrized policy πθ(at|st) to perform actions, one can use gradient ascent to maximize the objective. However, to do that, we need to be able to find the gradient of the objective with respect to policy parameters _∇θJ which can be derived as:

∇θJ =∇θEτ[G(τ )] (3.3) = Z ∇θp(τ )G(τ )dτ (3.4) = Z p(τ )∇θp(τ ) p(τ ) G(τ )dτ (3.5) = Eτ[∇θlog p(τ )G(τ )] . (3.6)

Furthermore, the derivative in the expectation can be rewritten:

∇θJ = Eτ[∇θlog p(τ )G(τ )] (3.7) = Eτ " ∇θlog p(s0) + T −1 X t=0 ∇θlog p(st+1|st) + T X t=0 ∇θlog πθ(at|st) ! G(τ ) # (3.8) = Eτ " 0 + 0 + T X t=0 ∇θlog πθ(at|st) ! G(τ ) # (3.9) = Eτ " T X t=0 ∇θlog πθ(at|st)G(τ ) # . (3.10)

The last expression is also called REINFORCE estimator (Williams, 1992). Being one of the simplest policy gradient methods, the positives of REINFORCE are that it is easy to implement and can be used with any kind of differentiable stochastic policy (the latter being true for all policy gradient methods). On the contrary, due to the expectation being approximated by sampling, one of the major drawbacks of this estimator is its high variance. This often makes learning unstable and slow. It is easy to reduce the variance by realizing that future actions do not influence current reward due to causality:

∇θJ = Eτ " _T X t=0 ∇θlog π(at|st) T X t0_=t rt0 # (3.11) = Eτ " T X t=0 ∇θlog π(at|st)Gt(τ ) # . (3.12)

We will not provide a proof but it is important to note that this adjustment keeps the estimator unbiased (Baxter and Bartlett, 2001). Additionally, the variance can be further reduced by subtracting a

(11)

baseline from the return. This baseline can also be dependant on the state stand thus a common choice is a state value function V (s) or its estimate Vκ(s). Similarly to Equation 3.11, this does not introduce any bias to the estimator.

∇θJ = Eτ " T X t=0 ∇θlog πθ(at|st) Gt(τ )− b # (3.13) = Eτ " T X t=0 ∇θlog πθ(at|st) Gt(τ )− Vκ(st) # . (3.14)

These are the most commonly used unbiased policy gradient estimators in reinforcement learning. Inter-estingly, the return Gt(τ ) can be seen as a sample of a true state-action value function Q(st, at) (Sutton et al., 2000). If we use the state-action value function and the value function is known we can interpret the gradient with the baseline term as:

∇θJ = Eτ " T X t=0 ∇θlog πθ(at|st) Q(st, at)− V (st) # (3.15) = Eτ " _T X t=0 ∇θlog πθ(at|st)At # , (3.16)

where At is called an advantage and it represents the advantage of performing at over other actions (or mean action). However, in practice it is not possible to use the true advantage because Q(st, at) and V (st) are not known. Consequently, a couple of advantage estimators were proposed by the RL community in recent years such as n-step temporal difference (TD) error:

ATD(n)_t = n X

k=0

γkrt+k+ γn+1Vκ(st+n+1)− Vκ(st) (3.17)

or recently more popular generalized advantage estimator (GAE) (Schulman et al., 2016) which uses an exponentially-weighted average of n-step TD errors:

AGAE(λ)_t = (1_{− λ)} ∞ X

l=0

λlATD(l)_t . (3.18)

However, it is important to note that even though these estimators usually have smaller variance they are no longer unbiased. In the case of GAE, the bias variance trade-off can be controlled by the parameter λ. When λ is set to 0, the estimator reduces to ATD(0) _{and is biased but also has low variance. On the} other hand, when λ is set to 1, we get an unbiased advantage estimator with higher variance that is the same as the one in Equation 3.14.

3.1.3 Options and Inferred Options Policy Gradient

In the options framework (Sutton et al., 1999), the agent’s policy consists of a set of options Ω and a policy over options πΩ_(ω

t|st). Each option has an initiation set I (in our case I = S), option policy πω_(a

t|st) and termination function ξω(st+1).

The policy over options is state dependent and is used by the agent to decide which option to use in a particular state (it is always used in s0or after an option was terminated to choose a new option). Option policies πω_(a

t|st) are used to perform actions depending on which option is currently active and what state the agent is in. After performing an action and receiving a reward rt+1 and a new state st+1, the option can be terminated with a probability which is given by a corresponding termination function ξω(st+1). If the option terminates, the agent uses the policy over options πΩ(ωt+1|st+1) to choose a new option, otherwise it keeps using the old one. This process is repeated until the end of the episode.

(12)

s0 a0 s1 a1 ω0 ω1 · · · · · · · · ·

Figure 3.2: Graphical Model of IOPG trajectory. (Smith et al., 2018)

The policy over options and termination policies can also be combined into a single option-to-option function where δ is Kronecker delta:

˜

πΩ(ωt|ωt−1, st) = [1− ξωt−1(st)] δ(ωt, ωt−1) + ξωt−1(st)πΩ(ωt|st). (3.19) Inferred Options Policy Gradient (IOPG) (Smith et al., 2018) is an extension to the simple policy gradient that combines its update rule with the options framework. Similarly to the objective of simple policy gradient in Equation 3.14, the objective of IOPG can be formulated as maximizing the expected episodic return. However, since the options are treated as latent variables the gradient of the objective has to be adjusted to: ∇θJ = Eτ     T X t=0 ∇θlog     πθ(at|st) z }| { X ωi m(t)ωi_πωi θ (at|st)     At     , (3.20)

where m(t)ωi _{is the probability of being in an option ω}

i in time t and can be calculated with the forward algorithm using the update rule:

m(t + 1)ωt+1 ₌ P ωim(t) ωi_πωi θ (at|st)˜πΩ(ωt+1|ωi, st+1) P ωkm(t) ωkπωk θ (at|st) . (3.21)

As shown in Figure 3.2, these probabilities are dependent on previous actions a[0:i−1] and states s[0:i]. Note that during the update, the options are treated as latent variables. Therefore, it is not necessary to know which option generated a particular action. Instead of updating a single option policy, all options policies will be updated in proportion to how likely it is that they generated the action.

This section provided necessary reinforcement learning background and introduced a part of the notation which will be used throughout the thesis. In the next section, we will explain gradient-based meta-learning as well as introduce the remaining notation.

3.2 Gradient based meta-learning

Compared to the black box meta-learning methods (Duan et al., 2016; Wang et al., 2016), gradient based meta-learning methods recover more structure since the update during the adaptation (learning) is restricted to be a gradient step. As a result, these models are often less flexible but easier to train and allow for better interpretability of updates during test time. Gradient based meta-learning can be divided into two levels. The low level learning consists of an ordinary gradient descent optimization and is sometimes called an adaptation step (we will also refer to this step as the inner update). On the other hand, a high level learning tries to optimize the post-adaptation performance with respect to pre-adaptation parameters, or in other words, its goal is to find the initial parameters which will lead to a good performance on multiple tasks after the adaptation is complete. In theory, the adaptation can consist of any number of updates but in practice the goal of meta-learning is to learn to adapt to tasks quickly and therefore the amount of gradient steps tends to be small.

(13)

3.2.1 MAML applied to reinforcement learning

In our derivations we restrict the adaptation to a single gradient update min

θ EMi∼p(M )[LMi(θ− α∇θLMi(θ))] , (3.22) where p(M ) is a task distribution, θ are model parameters and _LMi is a task-specific loss (Finn et al., 2017). The extension to multiple gradient updates is then straightforward albeit it increases the variance of the gradient. Since in this case, the loss function can be any arbitrary differentiable function, this approach is also suitable for reinforcement learning with policy gradient methods. To reformulate the objective as reinforcement learning task we replace the update as well as the post-update objective with the policy gradient estimate:

max

θ EMi∼p(M )

Eτ0_∼p(τ0_|θ0₎GMi(τ0) (3.23) θ0= θ + α_∇θEτ ∼p(τ |θ)GMi(τ ) . (3.24) The gradient of the objective is then:

where we omit the expectation over tasks for clarity. Note that even though we use the standard REIN-FORCE objective in our derivations for simplicity, the tricks which help to reduce variance with a baseline or causality can still be applied. The resulting gradient in Equation 3.26 closely resembles REINFORCE with the addition of Jacobian term ∂θ0

∂θ inside the expectation: ∂θ0 ∂θ =∇θ θ+ α∇θEτ ∼p(τ |θ)G Mi_{(τ )} (3.27) =∇θ θ+ αEτ ∼p(τ |θ)∇θlog p(τ|θ)GMi(τ ) (3.28) = I + α_∇θEτ ∼p(τ |θ)∇θlog p(τ|θ)GMi(τ ) (3.29) = I + α Z ∇θp(τ |θ)∇θlog p(τ|θ)GMi(τ ) dτ (3.30) = I + αEτ ∼p(τ |θ)∇2θlog p(τ|θ)G Mi_{(τ )}

+ αEτ ∼p(τ |θ)∇θlog p(τ|θ)∇θlog p(τ|θ)TGMi(τ ) .

(3.31)

As shown by (Foerster et al., 2018), the surrogate loss (Schulman et al., 2015a) that was used in the MAML implementation provides incorrect higher order derivatives. This is because a surrogate loss treats a stochastic part of the cost function as a fixed sample. Although this approach provides correct first order gradients, it leads to missing or wrong terms for higher order derivatives. In the MAML implementation, the term in Equation 3.31 is missing because the expectation in the update is treated as a fixed sample. To overcome this issue Foerster et al. (2018) propose the Infinitely Differentiable Monte Carlo Estimator (DiCE) which provides correct higher order gradients for any stochastic computation graph. With DiCE, the REINFORCE estimator becomes:

∇θJ = Eτ ∼p(τ |θ) " ∇θ T X t=0 t Y t0₌₀ πθ(at0|s_t0) ⊥πθ(at0|s_t0) ! rt # . (3.32)

where ⊥ is a ”stop-gradient” operator (⊥fθ(x) = fθ(x) and ∇θ⊥fθ(x) = 0). This estimator provides correct higher order gradients and is equivalent to the REINFORCE for the first order gradients:

(14)

An important consequence of introducing DiCE to the objective is the product of probabilities in Equation 3.32. This negatively impacts the variance of the second order gradient as the length of the episode increases. To prevent this, one can use the low variance curvature (LVC) estimator (Rothfuss et al., 2019) instead: ∇θJ = Eτ ∼p(τ |θ) " T X t=0 ∇θπθ(at|st) ⊥πθ(at|st) T X t0_=t rt0 !# . (3.39)

Even though the LVC estimator is no longer unbiased, Rothfuss et al. (2019) motivate this adjustment by the findings of Furmston et al. (2016) and claim that the bias of LVC becomes negligible near local optima θ∗_{. Furthermore, they also analyze the performance and the variance of both estimators on} differ-ent tasks and according to their experimdiffer-ents, the usage of LVC over DiCE results in better performance as well as lower variance of the gradient.

(15)

Chapter 4

Method

Our ultimate goal is to learn a good task to sub-task decomposition using the ideas from the fields of meta-learning and hierarchical and reinforcement learning. As we have previously discussed, it is usually hard to measure how good a task decomposition is. We believe that a good sub-task division should eventually lead to rapid adaptation to many related problems from the same family. Thus, in order to find these task decompositions we combine the aforementioned methods and findings into a single algorithm that produces a hierarchical policy. The sub-policies of this policy can then be combined during test time to quickly solve many hierarchical tasks from the same family.

4.1 Objective

To allow for the hierarchical structure of the policy as well as better data efficiency we utilize the options framework (Sutton et al., 1999) and treat the options as latent variables during the calculation of the policy gradient (Daniel et al., 2016; Smith et al., 2018). Additionally, we use the performance of the model after a single adaptation step to measure the suitability of a task to sub-task division. The objective can therefore be expressed as an average discounted return after a single gradient update:

max

θ EMi∼p(M )

Eτ0_∼p(τ0_|θ0₎GMi(τ0) (4.1) θ0= θ + αin∇θEτ ∼p(τ |θ)GMi(τ ) . (4.2)

Note that we use αin and αout to distinguish the inner learning rate (during adaptation step) and the outer learning rate. This corresponds to the reinforcement learning formulation of MAML (Finn et al., 2017) in Equations 3.23 and 3.24. However, we only update the policy over options during adaptation and keep the terminations and sub-policies fixed. By doing this, we encourage the model to learn sub-policies which can be combined by the policy over options to form a solution to the task and also to divide tasks into sub-task by adjusting termination probabilities.

max θξ,θΩ,θω EMi∼p(M ) h Eτ0_∼p(τ0_|θ0 ξ,θ 0 Ω,θ0ω)G Mi_(τ0₎ i (4.3) θ0_Ω= θΩ+ αin∇θΩEτ ∼p(τ |θξ,θΩ,θω)G Mi_{(τ )} _(4.4) θ_ξ0 = θξ (4.5) θ_ω0 = θω. (4.6)

The parameters of policy over options θΩ, terminations θξ and sub-policies θω can be concatenated into a single vector θ or θ0_{. The gradient of the objective is then:}

∇θJout =∇θEMi∼p(M ) Eτ0_∼p(τ0_|θ+α in∇θJin)G Mi_(τ0₎ _(4.7) ∇θJin=∇θEτ ∼p(τ |θ)GMi(τ ) . (4.8)

(16)

Algorithm 1Fast Adaptation with Inferred Options Policy Gradient if LearnInnerLearningRate then Initialize θ =_{θΩ, θξ, θω, αin} else Initialize θ =_{θΩ, θξ, θω} end if repeat

Set gradient wrt. parameters gθ= 0 fortask = 1 to N do

Sample a task M _{∼ P}M

Sample k episodes τ1:k∼ p(τ|θ) on M using πθ Calculate pre-update GM

t for all timesteps and fit the baseline Calculate pre-update AGAE(λ)t for all timesteps with Equation 3.18 Calculate the pre-update loss Jin according to Equation 4.10 Update θΩ0 = θΩ+ αin∇θJin Set θ0 ₌ {θ0 Ω, θξ, θω} Sample k episodes τ0 1:k∼ p(τ0|θ0) on M using πθ0 Calculate post-update GM

t for all timesteps

Calculate the post-update loss Jout according to Equation 4.9 Accumulate the gradients gθ= gθ+_N1∇θJout

end for

Update θ = θ + αoutgθ untilconvergence

Furthermore, the inner learning rate αin can also be learned as an additional model parameter. To reduce the variance and make the implementation suitable for automatic differentiation software we use the LVC estimator for gradient calculation (Rothfuss et al., 2019). Additionally, the variance can be reduced further by using a baseline or an advantage estimate in the inner update (θ_{−→ θ}0_{). However, the} inclusion of baseline in the outer gradient update becomes problematic since it would need to be dependent on both the pre-update parameters and the update itself (or post-update parameters). Consequently, we use the GAE to calculate the inner gradient and use the sampled discounted return in the outer gradient:

4.2 Algorithm

The final FAIOPG algorithm is outlined in Algorithm 1. In each epoch we sample N tasks. For each task we sample k pre-update trajectories τ1:kand use these to fit the baseline and calculate the inner gradient. After performing the inner update we use the updated parameters to sample new post-update trajectories τ_1:k0 . These are then used to form the outer loss function. The gradients are then accumulated for each of the N tasks and averaged before performing the update. Thanks to its structure, the algorithm is easily parallelizable since the task sampling and trajectory sampling as well as gradient calculation for every task can be done on different cores simultaneously.

4.3 Implementation Details

We use one-hot encoding to represent the states of our environment and feed-forward neural networks to create a mapping from states to the parameters of action, option and termination distributions. By using single layer neural networks without a bias for this mapping, we make the distributions in each state independent of each other. Furthermore, the usage of single layer neural networks also allows for easier analysis and debugging because the resulting policy is close to tabular. The parameters of the option distribution are then computed by passing the one-hot encoding through a linear layer and soft-max

(17)

module. Since the environments which we use have discrete action spaces, a similar process is employed to obtain action distributions (in this case they are categorical distributions) of each sub-policy. Lastly, when obtaining termination probabilities, the soft-max is replaced by a sigmoid module to restrict the output to [0, 1] range and the output is used as a parameter of a Bernoulli distribution.

Throughout our experiments we use a linear baseline that maps state features into a value function estimate for the inner update. We chose this baseline mainly because this type of baseline works well with small amount of data and allows us to keep the optimization procedure simple. These properties are crucial since we only use the data from a k sampled pre-update trajectories to fit the baseline and do not share any information about the baseline among different epochs or even different tasks within a single epoch. Even though a more complex baseline could be used to estimate the value function (such as a neural network), we believe that such choice would make the optimization too complex since the value function estimator would also have both inner and outer updates.

(18)

Chapter 5

Experiments

In our experiments we use two kinds of environments. Firstly, we run our algorithm on a set of simple grid mazes in which the agent starts in a designated state and has to find its way through the maze towards the end cell. However, certain cells may be blocked and a single block configuration represents a single task Mi of a task family M . Therefore, in order to reach the optimal performance, the agent cannot always take the same path but needs to adapt its behavior according to the blocked cells during the adaptation. By using these easy toy tasks we are able to verify that the algorithm possesses properties which are necessary for solving more complex tasks. Furthermore, we compare its performance during the test time (also called the adaptation phase) to two simple baselines to highlight key differences between learning form scratch, multi-task learning without meta-learning, and our algorithm that includes prior knowledge in form of pre-learned hierarchical policy.

The second type of environment used in our work is a more complex taxi environment. In this environment the agent represents a taxi driver whose goal is to pick up a passenger and bring him to the destination. We use a slightly modified version of the taxi environment used by Igl et al. (2019). In this environment the destination is randomly selected among four different locations. However, the passenger is not initialized in any state but in one of the four starting states. A single combination of destination and passenger’s starting location is a single task Mi or a task family M . Consequently, the agent needs to figure out the passenger’s position and destination during the adaptation and adjust the selection of sub-policies to maximize the performance. We again compare the performance of our algorithm to both baselines and perform the analysis of learned options. Furthermore, we use a combination of an additional taxi experiment with a simple regression task experiment to allow for better analysis of the results. Both grid maze environments and taxi environment are described in more detail in subsequent sections.

5.1 Small maze

In the first set of experiments we use a family of small gridworld maze environments whose map is shown in Figure 5.1. In each of the two environments the agent starts in the start cell (green) and has to navigate to the end cell (red) by performing one directional move at a time. However, one of the paths (upper or lower) is blocked by a wall. Because the agent does not have prior information about the environment it is currently in, it has to explore until it finds the path that leads to the end cell. If the goal state is reached, the agent receives reward r = 1 and the episode is terminated. For every other action the agents gets reward r = 0. Furthermore, the episode is also terminated if the goal was not reached within 200 steps. In a classic single task reinforcement learning setting, the agent would need to try many trajectories and since some of them would include walking into the wall or taking a blocked path, it could take a long time before it would converge to the correct solution that maximizes the performance. However, by using the proposed meta-learning algorithm, we hope to reduce the exploration procedure (which consists of many actions) into a smaller amount of high-level decisions. In the ideal case, the agent would only need to make a single decision (take upper path or take lower path) when exploring these two environments. Consequently, a good solution should be found much quicker.

We compared the performance of FAIOPG to two baselines which utilize the IOPG framework but do not incorporate meta-learning. As the first baseline we used a single-task IOPG that is trained to maximize the performance on a single task by learning from scratch during the adaptation. On the other

(19)

Start End

Block 1 Block 1

Figure 5.1: Map of small grid world maze environments. In each of the two environments one block is a wall leaving the other passage free.

hand, the second multi-task IOPG baseline was pre-trained using average gradient from multiple tasks for its updates before the adaptation. To make the comparison as fair as possible, we set all parameters to similar values for all algorithms during both pre-training and adaptation. Specifically, we set the learning rates to 0.1 for sub-policies and terminations, and 0.01 for policy over options in the pre-training phase. Additionally, we used a fixed inner learning rate 10 for the inner update of FAIOPG. This learning rate is therefore also used in the adaptation phase of both FAIOPG and multi-task IOPG. Note that in this phase we only adapted the policy over options for these two algorithms. On the contrary, since single-task IOPG does not have a pre-training phase, we learned all three parts of the policy during the adaptation. Here using a high learning rate 10 caused unstable learning. We thus opted for the learning rates which were used for the pre-training of FAIOPG instead.

All algorithms used two options and trained all algorithms for 500 epochs during both training and test time. Furthermore, we used the Adam optimizer (Kingma and Ba, 2015) to train all algorithms and to test task specific IOPG. However, since FAIOPG uses stochastic gradient descent in its inner update during training we decided to use simple SGD during adaptation of both FAIOPG and multi-task IOPG. When training FAIOPG, we set the number of tasks N = 60 (15 on each of the 4 cores) and sampled k= 20 trajectories for each update. Additionally, we used a discount factor γ = 0.95, a linear baseline and GAE with λ = 0.98 in all algorithms. The algorithms were pre-trained on three different random seeds (only applicable for multi-task IOPG and FAIOPG) and five random seeds per environment were used for adaptation. The average performance during adaptation as well as 95% confidence intervals of all three algorithms are depicted in Figure 5.2.

0 50 100 150 200 250 Gradient update 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Discoun ted return Single-task IOPG FAIOPG Multi-task IOPG

Figure 5.2: Average discounted return and 95% confidence interval of FAIOPG and baseline algorithms during adaptation on small maze environments (averaged over all environments).

(20)

Maximum likelihood actions in states - Option 1

0%

25%

50%

75%

100%

63 99 99 0 85 97 50 1 8 40 82 81

Termination probabilities in states - Option 1

0%

25%

50%

75%

100%

56 13 13 71 41 49 42 31 89 28 30 35

Probability of choosing an option in states - Option 1

0%

25%

50%

75%

100%

Maximum likelihood actions in states - Option 2

0%

25%

50%

75%

100%

24 6 0 4 2 98 50 0 99 0 0 0

Termination probabilities in states - Option 2

0%

25%

50%

75%

100%

43 86 86 28 58 50 57 68 10 71 69 64

Probability of choosing an option in states - Option 2

0%

25%

50%

75%

100%

Figure 5.3: Trained pre-update FAIOPG policy for small grid environments.

The results of the experiment correspond with our expectations. Multi-task IOPG clearly is not able to perform well and only achieves slightly more than half of the maximum possible discounted return Gmax = 0.774. However, we can observe that there is a large variance in its performance. This can be attributed to the change of policy options in the adaptation period. Even though the algorithm does not account for adaptation it may still slightly increase its performance by adjusting a part of its policy. Nevertheless, in most cases multi-task IOPG learned to solve a single environment with both options. This leads to poor performance on the the other environment effectively cutting the average discounted return by almost a half. Plots with the learned options can be found in Figure A.1 in Appendix A.

Contrarily to the Multi-task IOPG, the other two algorithms are able to adapt to each task and achieve nearly maximum discouted return. However, there is the difference between the speed with which they achieve this performance. The curve of single-task IOPG is less steep even though it uses more complex Adam optimizer. Thus the algorithm takes longer to achieve nearly optimal performance on this simple task. On the other hand, FAIOPG achieves this level of discounted return much quicker, after performing only a few gradients steps. An explanation for this discrepancy can be found in the structure of the pre-update policies of FAIOPG shown in Figure 5.3.

We can use the parameters in the figure to trace the steps taken by the policy. The probabilities of selecting each option in the starting state are close to 50% for both options. If option 1 is selected, the policy goes up and since the termination in the next state is zero it continues in the same direction. In the next state, it terminates with 63% probability. Thus the policy either switches to option 2 or continues to use option 1 with lower probability. In case of the switch, The sub-policy of option 2 leads to the goal with only a small chance of terminating along the way. If on the other hand the policy sticks with option 1 it will take a step to the right and switch to option 2 in the next state due to the high termination probability and low selection probability of option 1. From this state option 2 will take over and lead the agent to the goal state. Therefore, a single choice of option 1 in the starting state will

(21)

Start End Block 1 Block 1 Block 1 Block 1 Block 1 Block 1 Block 1 Block 1 Block 2 Block 2 Block 2 Block 2 Block 2 Block 2 Block 2 Block 2 Block 3 Block 3 Block 3 Block 3 Block 3 Block 3 Block 3 Block 3

Figure 5.4: Map of large grid world maze environments. In each of the two environments one block of every color is a wall leaving the other passage free.

guide the agent to its goal through the upper path. Similarly, option 2 follows the lower path if it is selected in the starting state. Its sub-policy follows the lower path and does not terminate along the way except for the penultimate state where both policies lead to the end state. We can see that the algorithm developed a structured policy where the only relevant decision is the option selection in the starting state. By doing this, FAIOPG gains advantage over single-task IOPG because it only needs to learn to make a single decision during the adaptation. It is therefore sufficient to adjust the policy over options in the starting state to always select the correct path. The plots of FAIOPG policies after adaptation on both environments can be found in the Figures A.2 and A.3 in Appendix A.

5.2 Large maze

The previous experiment shows the ability of FAIOPG to learn a simple split task and serves as a proof of concept. However, the discussed split problem is quite simple and the policy could find an optimal solution that does not use terminations at all by using one option for each path and deciding which option to take in the start state. Consequently, the algorithm does not need to learn to split tasks to sub-tasks by terminating in appropriate states. Our next experiment is thus set up in a way that requires the policy to terminate at certain points at its path to the goal state in order to reach a good performance. We use large maze environments the map of which is shown in Figure 5.4. All environments consist of three rooms which are connected with corridors. In each room one of the highlighted cells is blocked while the other one is passable. Therefore, in order to get into the next room the agent must either take the upper or lower path in each room. Similarly to the smaller environment, the agent receives a reward r = 1 only when it reaches the end cell. However, the step limit for this environment is set to a larger value of 1000 timesteps. By stacking multiple rooms together we can create a family of environments with eight family members where each of the eight environments blocks different combination of paths. If we then use less than eight options the policy will be forced to utilize terminations in order to perform well because it can only adjust policy over options and cannot adjust sub-policies during adaptation. In this experiment, we will thus also use only two options. Consequently, we expect the policy to learn to terminate at least once in every corridor and choose the upper or lower path in each room. The hyper-parameters and the settings are identical to the previous experiment with small mazes.

The results of the large maze experiment are shown in Figure 5.5. FAIOPG is again able to outper-form both baselines and achieves the average discounted return close to the optimal policy (0.463) after several updates. Single-task IOPG is also able to reach nearly optimal performance but it again needs more updates to do so. Lastly, Multi-task IOPG is not able to reach a good performance even though it performs slightly better than in the previous experiment. After closer inspection of the trained FAIOPG policy portrayed in Figure 5.6 we can notice two exploratory actions (green arrows) in each option before first and second blocks. Although these actions are not intuitive, the explanation of this phenomenon is simple. Since the amount of gathered data is limited, the agent needs to obtain the data that will produce good update within 20 episodes. Crucially, the strongest reward signal is received when the goal state is reached. Thus, if the episode terminates because of the timestep limit, the agent receives weaker signal that only depends on the baseline (since all rewards are 0). It is therefore important for the agent

(22)

0 50 100 150 200 250 Gradient update 0.0 0.1 0.2 0.3 0.4 0.5 Discoun ted return Single-task IOPG FAIOPG Multi-task IOPG

Figure 5.5: Average discounted return and 95% confidence interval of FAIOPG and baseline algorithms during adaptation on large maze environments (averaged over all environments).

Maximum likelihood actions in states - Option 1

0% 25% 50% 75% 100% 0 98 99 0 0 95 94 6 0 0 97 92 99 98 94 99 50 0 0 99 0 98 1 95 1 19

Termination probabilities in states - Option 1

0% 25% 50% 75% 100% 69 69 15 42 25 31 96 78 67 53 29 35 53 11 35 59 54 30 28 8 71 78 70 8 28 52

Probability of choosing an option in states - Option 1

0% 25% 50% 75% 100%

Maximum likelihood actions in states - Option 2

0% 25% 50% 75% 100% 0 0 3 0 98 1 7 13 92 0 1 1 99 2 1 99 50 0 98 0 0 0 94 97 9 67

Termination probabilities in states - Option 2

0% 25% 50% 75% 100% 30 30 84 57 74 68 3 21 32 46 70 64 46 88 64 40 45 69 71 91 28 21 29 91 71 47

Probability of choosing an option in states - Option 2

0% 25% 50% 75% 100%

Figure 5.6: Trained pre-update FAIOPG policy for large grid environments.

to reach the goal multiple times during exploratory period in order to reinforce the actions leading to the goal after a single update. The exploratory actions increase the probability of reaching the end state during the exploration period by allowing the agent to not get stuck in front of a block for a whole episode and thus allow the agent to switch to exploitatory actions with a single gradient update.

Additionally, each sub-policy uses a different path in each room and each option terminates at least once in every corridor as we expected. This allows the agent to combine sub-policies in each room by terminating at the correct place to create eight different paths towards the goal and solve all eight environments. However, terminations are also very likely in other states especially in the first option. The explanation for this could be that option two discovered correct actions in the columns before narrow corridors earlier than option one. Consequently, using option two in those states led to higher discounted return and thus option one was encouraged to terminate. Another reason for higher amount

(23)

Special

state Specialstate

Special state Special

state

Figure 5.7: Map of taxi environments. In each of the sixteen environments one special state is designated to the passenger and one is his destination.

of terminations could be the high initial probability of termination which was set to 50%. Nevertheless, even with many terminations, the agent is still able to adjust the option policy quickly and achieve nearly-optimal discounted return after a few gradient steps. The algorithm thus lead to the model that learned to utilize terminations in order to solve multiple tasks. From our perspective, this task to sub-task division was sound since it lead to the fast adaptation on all tasks. We therefore decided to use more challenging environment for our last experiment to test the scalability of our approach.

5.3 Taxi

The last and the most complex family of environments we experimented consisted of several taxi envi-ronments in which the agent represents a taxi driver in a gridworld. The environment map is depicted in Figure 5.7. In addition to walkable cells (white) and walls (black), there are four special cells (yellow) in each corner of the map. These states can serve as a starting point, passenger pickup location and passenger destinations. By putting the goal state and passenger pickup location in four different corners, we are able to create the total of 16 different environments from the same family. Note that even though the passenger and goal state might be in the same cell, the start of the episode cannot be in the goal state. Instead, the agent is randomly initialized in three remaining corners. Additionally, the passenger can already be picked up at the start of the episode with 50% probability. By combining 30 walkable cells in the grid with two states of the passenger (in the taxi or waiting) we arrive to the total of 60 states which are again represented using the one-hot encoding. The action space of this environment consists of six actions. Similarly to the environments in previous experiments, the agent can use four actions to move in each direction. In addition to directional, actions it also needs to utilize a special pickup/drop-off action that is used for the interaction with a passenger. Furthermore, it needs to learn to avoid using the last action that allows it to stay at the same place as this action does not provide any real benefit since the reward per timestep is negative r =−0.1 for every action. However, if the agent successfully drops off the passenger at his destination it receives a positive reward r = 2 instead. Therefore, in order to obtain maximum reward the agent first has to move to passenger’s location, pick-up the passenger, navigate to the destination and use the special action again to deliver the passenger. Note that first two steps are skipped if the agent is initialized with the passenger onboard. Finally, the episodes can last at most 1500 steps after which they terminate even when the passenger was not picked up or delivered. We used the same hyper-parameters as in first two experiments except for the initial termination probability that was set to 5% to limit the number of excessive terminations that we observed in the large maze experiment. The comparison of different algorithms is shown in Figure 5.8. Even though FAIOPG still performs best after a small amount of updates (around 10) it is eventually outperformed by single-task IOPG in the later stages of the adaptation. This implies that the sub-policies and ter-mination probabilities learned during the training can only be combined into sub-optimal solutions for some tasks and thus they are not good enough to rival the performance of task-specific algorithm.

(24)

Con-0 50 100 150 200 250 Gradient update −1.5 −1.0 −0.5 0.0 0.5 Discoun ted return Single-task IOPG FAIOPG Multi-task IOPG

Figure 5.8: Average discounted return and 95% confidence interval of FAIOPG and baseline algorithms during adaptation on taxi environments (averaged over all environments). The values are smoothed using a 0.75 coefficient.

No passenger

Passenger

Maximum likelihood actions in states - Option 1

0% 25% 50% 75% 100%

No passenger

Passenger

Maximum likelihood actions in states - Option 2

0% 25% 50% 75% 100%

No passenger

Passenger

Maximum likelihood actions in states - Option 3

0% 25% 50% 75% 100%

No passenger

Passenger

Maximum likelihood actions in states - Option 4

0% 25% 50% 75% 100%

Figure 5.9: Trained pre-update FAIOPG sub-policies for taxi environments.

sequently, FAIOPG is not able to solve the environment by only adjusting option probabilities during adaptation and the discounted return it achieves is much closer to the one achieved by multi-task IOPG which suffers from a similar problem.

By inspecting sub-policies of trained options depicted in Figure 5.9 we can gain more insight into what sub-policies were learned and how the solutions are formed during the adaptation. The extra ”pickup/drop-off” and ”do nothing” actions are represented by a square and a circle. Note that although we do not show the termination probabilities and the options selection probabilities in order to keep the plot interpretable, they can be found in Figures A.4 and A.5 in Appendix A. In general, all options have high termination probability only in special states as is expected, since this is a natural place to split these tasks. On the other hand, trained sub-policies are far from our idea of optimal sub-policies which lead to a single corner from every state (shown in Figure 5.12). Instead, every sub-policy learned an optimal path from one or more corners to another corner (or corners).

(25)

TL TR BL BR No passenger ω1 ω1 ω3 ω4 TL TR BL BR Passenger ω1 ω2 ω2 ω3 ω3 _ω₄

Figure 5.10: Trained sub-policies of FAIOPG on taxi environments represented as directed graphs. This may look strange because there is no path that leads from top left corner to top right corner in any of the sub-policies and similarly, there are no paths for other corner combinations. However, these paths can be created by combining multiple options. In our example, the agent can first utilize option one to get from top left to bottom left, then switch to option two to get to bottom right and finally use option four to reach the destination in top right corner. Therefore if there is no path from one special state to another the agent can combine multiple options into a sub-optimal path. To provide a better intuition about why the algorithm converges to this solution we can imagine each of the special state as a vertex in a directed graph and paths from special states to other special states in every sub-policy (big yellow arrows) as directed edges. This graph is shown in Figure 5.10. In the ideal case, the graph would be complete because there would be optimal path from every special state to every other special state. Despite not being complete, both graphs are still strongly connected or in other words, there exists a path from every vertex to every other vertex, and thus the agent is able to solve all tasks by combining the sub-policies. However, these solutions are not optimal for all tasks which leads to a lower performance on some of them and subsequently to the worse overall performance. This can be attributed to several factors. In the following subsections we address each of them separately.

5.3.1 Single gradient step

The first potential cause of a sub-optimal performance that we investigated can be attributed to the rela-tionship between the objective and the complexity of the problem. During the optimization we maximize the performance the agent achieves after a single gradient update. However, a single gradient update might not be enough to shift the policy from exploratory to exploitatory. Even though the problem is not as severe as in the methods that do not use a gradient during the adaptation (Wang et al., 2016; Duan et al., 2016), achieving the best performance after a single update with MAML-like does not guarantee that the performance will improve after subsequent updates. To confirm this hypothesis we performed two simple experiments that should help explain this phenomenon. Firstly, we use the regression exper-iment, similar to the one presented in MAML paper (Finn et al., 2017), to compare models that were optimized to reach maximum performance after different amount of gradient updates.

In this experiment each of the many tasks consists of outputting a value of a different sine wave given the input, where the amplitude and phase of the sine wave are randomly chosen from a range of values ([0.1, 5.0] and [0, π] respectively). As the first step during the training phase, we sample 25 tasks with different amplitude and phase. For each task, we randomly sample 10 datapoints from the range [_{−5.0, 5.0] and calculate the mean squared error (MSE) loss using the true sine wave values and the} values predicted by the model. We then calculate the gradient with respect to the model parameters and perform the inner gradient step. Afterwards, we repeat the same process n times to obtain the final loss. The gradients from all tasks are then averaged and used for the outer update. With n = 1 the process is similar to our reinforcement learning approach. However, in this case it is not necessary to use DiCE or LVC to get correct gradients since the additional dependency that is introduced in reinforcement learning formulation is not present. An optimization with single gradient update can thus be expressed as Equation 3.22 where the loss function is the MSE loss. We train each model with different n for 70000 epochs and use SGD optimizer with 0.01 learning rate for the inner update and ADAM optimizer with 0.001 learning rate for the outer update. During the adaptation (test time), we sample 10000 test tasks and perform 10 stochastic gradient steps. After every gradient step we use 1000 points from the sine wave to calculate the MSE on each task using the model predictions. The average MSE error of models

(26)

2 4 6 8 10 Gradient update 0.0 0.1 0.2 0.3 0.4 0.5 MSE Model with n=1 Model with n=2 Model with n=5

Figure 5.11: Comparison of the performance of models that are trained to maximize performance after different numbers n of gradient steps on regression task.

trained with different n after varying number of gradient steps is shown in Figure 5.11.

If we look at the performance following a single gradient step, the model which was trained with n= 1 performs the best. This result is not surprising since it was trained to maximize the performance after a single update. However, the models trained with higher values of n become better with more gradient steps. One would also expect this to happen since the other models were trained for more gradient steps. However, the difference between n = 2 and n = 1 after 10 updates is quite large even though the former was trained for only one more step. We can thus see that the model trained for a single update performance can only improve very slowly during subsequent optimization. The reason for this could also be the fact that models with higher n see more datapoints during the training since they perform multiple gradient steps with different data. We thus trained a new model with n = 1 where we doubled the number of samples during the training to 20. Nevertheless, the average MSE of this model was still two times bigger than the average MSE of n = 2 after 10 gradient steps despite using two times more data during the adaptation. Therefore, we can conclude that training for multiple steps ahead in-stead of a single step is a main factor that helps to achieve better performance at the end of the adaptation. Even though the results of the regression experiment imply that the model trained for maximum performance after a single gradient step may not learn as fast in subsequent steps, in many cases, such as our maze experiments, one gradient step can be enough to find a near-optimal solution. However, as the complexity of environments increases, finding a near-optimal solution with a single update becomes grad-ually more difficult. In order to test the difficulty of the taxi environments carried out another experiment in which we have set the initial parameters of the policy to what we believed were optimal parameters. The sub-policies of this policy are shown in Figure 5.12. Each of the sub-policies leads to different special state from every other state and finally uses the ”pickup/drop-off” action. The terminations were set to 100% in the special states and 0% otherwise while option selection probability was 25% in all states. The agent thus only needs to make one or two decisions to pickup the passenger and drop him off at the destination. After running our algorithm with these parameters and the same hyper-parameters as in the previous experiment with taxi environments we inspect the post-update performance and compare it to the average discounted return achieved by the trained FAIOPG model whose parameters were initialized randomly. Surprisingly, FAIOPG initialized with optimal parameters achieved average discounted return close to 0 and although it performed better than models trained in our previous experiment, whose per-formance was in the range [_{−0.5, −0.3], it is still far from the performance of the single-task IOPG 0.45} we would hope to achieve.

The difference between single-task IOPG and FAIOPG with the optimal initialization in combination with the result of regression experiment leads us to believe that part of the reason why FAIOPG was not able to reach near-optimal performance in the experiment with taxi environments is the complexity of the task. Unlike previous maze tasks, the taxi task is too complex to be solvable by a single gradient update. Thus, is is possible that the algorithm can find a different solution that will lead to the post-update

Finding the Boundary Between High-level and Low-level Policies Using Meta-learning

MSc Artificial Intelligence

Master Thesis