Deep Coherent Exploration for Continuous Control

(1)

MSc Artificial Intelligence

Master Thesis

Deep Coherent Exploration for

Continuous Control

by

Yijie Zhang

12255807

July 20, 2020

48 EC November 2019 - July 2020 Supervisor:

Dr. Herke van Hoof

Assessor: Dr. Patrick Forré

(2)

(3)

iii UNIVERSITY OF AMSTERDAM

Abstract

Faculty of Science Informatics Institute Master of Science

Deep Coherent Exploration for Continuous Control by Yijie Zhang

In policy search methods for reinforcement learning (RL), exploration is generally per-formed through noise injection either in action space at each step independently or in parameter space over each full trajectory. In prior work, it has been shown that with linear policies, a more balanced trade-off between these two exploration strategies is beneficial. However, that method did not scale to policies using deep neural networks. In this thesis, we introduce Deep Coherent Exploration, a general and scalable ex-ploration framework for deep RL algorithms on continuous control, that generalizes step-based and trajectory-based exploration in parameter space. Furthermore, Deep Coherent Exploration addresses the uncertainty of sampling exploring policies using analytical integration and explicitly perturbs only the last layer of the policy net-works. Through evaluating the coherent variants of A2C, PPO, and SAC on a range of high-dimensional continuous control tasks, we find that Deep Coherent Exploration results in faster and more consistent learning than other exploration strategies.

(4)

(5)

v

Acknowledgements

I can not express enough gratitude to my brilliant supervisor, Herke van Hoof. To me, Herke is not only a supervisor but also an encouraging and caring friend. Throughout the last eight months, I am fortunate to receive his perspective, insight, and guidance. Besides, I genuinely enjoyed our inspiring conversations that often shed light on new ideas in our research. Moreover, I would like to thank Patrick Forré for agreeing to be my examiner and read my thesis in such short notice.

I would like to thank my fellow students, particularly Laurens Weitkamp, Daniel Groothuysen, Benjamin Kolb, and Xiaoxiao (Vincent) Wen, for our interesting dis-cussions and our supports to each other.

Finally, I would like to thank my family and especially my parents for their endless love and support. I would like to further thank my mom for always being there for me and supporting me to follow my goals.

(6)

(7)

1

Chapter 1

Introduction

In the past decade, deep reinforcement learning (RL), the family of algorithms that uses deep neural networks to learn policies and (or) value functions in reinforcement learning, has achieved huge successes that many in the past were considered impos-sible. Examples include reaching human-level performances in playing Atari video games learned from raw pixels (Mnih et al., 2015), defeating world champion in the most complicated strategic game Go learned by self-playing (Silver et al., 2016;Silver et al., 2017) and solving complex robotic control problems (Schulman et al., 2015). These successes have shown that deep RL agents are capable of learning good strate-gies through trial-and-error in extremely complex and especially, high-dimensional environments. Although part of the credit is given to deep learning, which has proved to generalize well with high-dimensional inputs, the effective exploration strategies used by deep RL agents play an undoubtedly significant role.

The balance of exploration and exploitation (Kearns et al., 2002; Jaksch et al., 2010) has long been one of the most fundamental topics in reinforcement learning. Imagine at the beginning of learning, since the agents have no prior information about the environments, they have to try different strategies randomly, searching for the ones with high returns. As learning continues, even with better information about the uncertain environments, it could still be far from optimal if the agents stop trying new strategies but stay on the ones shown successful. As a result, to learn successful strategies, the trade-off between exploration and exploitation must be balanced well, and this is known as the exploration vs. exploitation dilemma.

Over the years, many different sophisticated methods have been proposed for ex-ploration using agents’ experiences (Tang et al., 2017;Ostrovski et al., 2017;Houthooft et al., 2016;Pathak et al., 2017). However, since they still rely on some forms of "in-ner" exploration, or are either complicated or bring intensive computation overhead, most of the exploration strategies being used in reinforcement learning literature are still based on injecting random noise in action space. Based on the idea of optimism in the face of uncertainty, these strategies explore by randomly perturbing the agents’ actions, where some familiar examples include -greedy (Sutton et al., 1998) for dis-crete action space and additive Gaussian noise for continuous action space. Following that, exploration in parameter space of linear policies (Rückstieß et al., 2008;Kober et al., 2008;Sehnke et al., 2010) was investigated. With the advance of deep RL, these exploration strategies were further extended for policies using deep neural networks (Fortunato et al., 2018;Plappert et al., 2018), aiming for more consistent, novel, and large-scale exploration behaviors.

Although these exploration strategies in parameter space (Fortunato et al., 2018;

Plappert et al., 2018) were shown leading to more consistent exploration, they still suffer from several drawbacks. Firstly, since these exploration strategies are trajectory-based, they are considered rather inefficient and bring insufficient stochasticity. Re-garding this issue, prior work (Hoof et al., 2017) proposed to interpolate between

(8)

step-based and trajectory-based exploration in parameter space, by generating tem-porally coherent exploring policies. With linear policies, this interpolation was shown beneficial for exploration. Secondly, since the uncertainty with sampling exploring policies is addressed using single-sample Monte Carlo integration, the gradient es-timates suffer from high variance, which necessitates setting a low learning rate or risks instability in the learning process. Thirdly, the effects of injecting random noise in all layers of the policy networks are still unclear, which could result in unstable exploration behaviors.

Given these, we introduce Deep Coherent Exploration, a general and scalable ex-ploration framework for deep RL algorithms on continuous control, that generalizes step-based and trajectory-based exploration for a more favorable trade-off. Moreover, our method addresses the uncertainty of sampling exploring policies using analytical integration and explicitly perturbs only the last layer of the policy network. Through evaluating the coherent variants of A2C (Mnih et al., 2016), PPO (Schulman et al., 2017), and SAC (Haarnoja et al., 2018) on high-dimensional continuous control tasks, we find that Deep Coherent Exploration results in faster and more consistent learning than other exploration strategies.

This thesis is organized as follows. Chapter2introduces and reviews the necessary background knowledge, including the key concepts in reinforcement learning, explo-ration strategies based on random perturbations, and the deep RL algorithms used in this thesis. Chapter 3 covers the related work on exploration in parameter space of "shallow" policies (Rückstieß et al., 2008; Kober et al., 2008; Sehnke et al., 2010;

Hoof et al., 2017) and of policies using deep neural networks (Fortunato et al., 2018;

Plappert et al., 2018). Chapter 4gives a comprehensive introduction to Deep Coher-ent Exploration, where we discuss the inspirations of developing our method, compare our method with similar methods in graphical models, and derive the mathematics for on-policy Coherent Policy Gradient step-by-step. Besides, we show how to adapt Deep Coherent Exploration to different families of deep RL algorithms and discuss the limitations of our method. Chapter5 covers the detailed experimental setup and presents the results of our proposed method in OpenAI MuJoCo continuous control tasks (Todorov et al., 2012;Brockman et al., 2016), compared with other exploration strategies (Fortunato et al., 2018; Plappert et al., 2018). Furthermore, we analyze the effects from different components of our proposed method using three separate ablation experiments, where we show that each component is beneficial, and combin-ing them leads to even better performances. Finally, in Chapter6, we summarize the contributions of Deep Coherent Exploration and suggest possibilities and directions for future work.

(9)

3

Chapter 2

Background

2.1 Reinforcement Learning

Reinforcement learning is a sub-field of machine learning that studies how an agent learns strategies with high returns through trial-and-error by interacting with an en-vironment. To be more precise, at each step, the agent obtains an observation of the current state and takes an action, which together transmits the agent to a new state with a scalar reward from the environment, as depicted in Figure 2.1. Most impor-tantly, the agent cannot modify the environment but can only try different actions in different states. These actions are then evaluated by the rewards, which guide the agent to learn return-maximizing behaviors. To better cover the necessary back-ground knowledge, parts of our introduction is based on the materials from OpenAI Spinning Up (Achiam, 2018), where a comprehensive and straightforward introduction of reinforcement learning is provided.

2.1.1 Markov Decision Processes

Formally, this interaction between an agent and an environment could be described us-ing Markov Decision Processes (MDPs). A MDP is defined usus-ing a tuple (S, A, r, P, γ), where S is the set of possible states, A is the set of possible actions, r : S × A × S → R is the reward function with r_t:= r st, at, st+1, P : S × A × S → R+is the transition probability function, with p st+1|st, at

being the probability of transitioning into state st+1 when taking action at in state st, p0(s0) is the initial state distribution, and γ is a discount factor indicating the preference of short-term rewards. Also, it’s useful to observe that the definition of transition probability function P implies the assumption of MDPs. This assumption is known as the Markov property, where the probability of going to the next state only depends on the current state and action.

Figure 2.1: Interaction between an agent and an environment ( Sut-ton et al., 1998).

(10)

2.1.2 Action Spaces

In reinforcement learning, there are two kinds of action spaces: the discrete action space and the continuous action space. In this thesis, we only consider the continuous action space.

2.1.3 Policies

As previously discussed, the goal of RL agents is to learn return-maximizing behaviors. These behaviors are called policies, denoted by π(a|s) as they are functions of states. These policies can be divided into two categories, namely stochastic policies, defined as π a|s : S × A → R+ and deterministic policies, defined as µ (s) : S → A. In this thesis, we only consider stochastic policies.

For continuous action space, stochastic policies are often modeled as diagonal Gaussian policies:

π(a|s) := N µ(s), Σ , (2.1)

which are fully described by a mean vector µ(s) and a diagonal covariance matrix Σ. Since the covariance matrix Σ is diagonal, it is often convenient to represent its diagonal entries using a standard deviation vector σ. This standard deviation vector can be modeled as either function of the states or standalone parameters. However, in both ways, directly modeling the standard deviation σ brings a problem as they could fall into negative values during training and become meaningless. So instead, log σ is modelled in practice.

2.1.4 Trajectories

It’s often convenient to represent a sequence of states and actions as a trajectory, defined as:

τ := (s0, a0, ..., sT −1, aT −1, sT), (2.2) with the probability of the trajectory given by:

p(τ |π) = p0(s0) T −1

Y

t=0

p st+1|st, at π at|st . (2.3)

Additionally, it is helpful to know that trajectories are often called episodes or rollouts, and these terms are often used interchangeably.

2.1.5 Rewards and Returns

The quality of a policy is evaluated by its return, which is defined as the sum of scalar rewards received over a time-horizon T . In this thesis, we consider the finite-horizon discounted return, defined as:

R(τ ) := T X

t=0

γtrt. (2.4)

There is another useful measure of return, as a special case of Equation (2.4) and often referred to as discounted rewards-to-go R_t. Discounted rewards-to-go is defined

(11)

2.1. Reinforcement Learning 5

as the sum of discounted rewards starting from step t:

Rt(τ ) := T X

t0_=t

γ(t0−t)rt0. (2.5)

2.1.6 Value Functions and Bellman Equations

Since returns evaluate how good policies are, it’s often useful to measure their expected values, especially their expected values of starting from a certain state or state-action pair. These measures are known as value functions. There are two kinds of value functions. The first kind is the state-value function Vπ(s), also known as the V -function, that measures the expected return starting from state s if following policy π, defined as:

Vπ(s) := Eτ ∼p(τ |π)Rt(τ )|St= s . (2.6) The second kind is the action-value function Qπ(s, a), also known as the Q-function, that measures the expected return starting from state s and taking action a if following policy π, defined as:

Qπ(s, a) := Eτ ∼p(τ |π)Rt(τ )|St= s, At= a . (2.7) It’s often important to pay attention to a special case of Equation (2.7), known as the optimal action-value function Q∗(s, a) and defined as:

Q∗(s, a) := max

π Eτ ∼p(τ |π)Rt(τ )|St= s, At= a , (2.8) which measures the expected return starting from state s and taking action a if follow-ing the optimal policy in the environment. The optimal action-value function Q∗(s, a) is especially important because if it is known, the optimal policy can be obtained im-plicitly. More specific, the agent can follow the optimal policy by taking the action that maximizes Q∗(s, a) at each step. Indeed, this idea builds the cornerstone of an important family of model-free RL algorithms, known as Q-learning (Watkins, 1989). Moreover, because of the property of MDPs, the value functions we discussed are closely connected to each other and can be expressed recursively through the Bellman equations. The Bellman equation for state-value function is given by:

Vπ(s) = Ea∼π(a|s),s0_∼p(s0_|s,a)

h

r + γVπ(s0)i, (2.9)

while for the action-value function, it is given by: Qπ(s, a) = Es0_∼p(s0_|s,a) r + γEa∼π(a|s) h Qπ(s0, a0) i . (2.10)

Finally, the Bellman equation for optimal action-value function is given by: Q∗(s, a) = Es0_∼p(s0_|s,a) r + γ max a0 Q ∗_(s0_{, a}0₎ . (2.11)

2.1.7 The Reinforcement Learning Objective

In reinforcement learning, the goal of agents is to learn strategies with high returns. Mathematically, this can be formulated naturally as an optimization problem, with a

(12)

clear objective:

J (π) := Eτ ∼p(τ |π)[R(τ )]. (2.12)

With this objective, the agents aim to learn a policy that maximizes the expected return over trajectories. This policy is called the optimal policy and given by:

π∗ := arg max

π J (π). (2.13)

2.2 Policy Optimization in Reinforcement Learning

With the rapid development of reinforcement learning, it becomes increasingly im-portant to have a big picture of the modern RL algorithms as they are often built on different trade-offs, assumptions, and settings.

In this thesis, we consider the model-free RL algorithms, where the agent does not learn a model of the environment for better decision-making. There are two branches within the family of model-free RL algorithms: the policy optimization methods and the Q-learning methods. Policy optimization methods learn a policy directly by op-timizing some performance objectives. In contrast, Q-learning methods indirectly improve a policy by learning the optimal action-value function, where the optimal policy can be inferred. As a matter of fact, all the RL algorithms used in this thesis, namely, A2C (Mnih et al., 2016), PPO (Schulman et al., 2017) and SAC (Haarnoja et al., 2018), are considered policy optimization methods. Moreover, these three RL algorithms are also actor-critic methods, as all of them explicitly learn a policy (also called an actor) and a value function (also called a critic) that enhance each other concurrently.

One step forward, there is another essential property of RL algorithms worth-discussing, namely, whether they are on-policy or off-policy. Of all the RL algorithms discussed in this thesis, A2C and PPO are on-policy because, in each update, only data collected by following the most recent version of the policy can be used. On the contrary, SAC is off-policy, meaning that all data collected by policies at any point can be used in updates. This distinction is critical because it highlights a subtle difference in the objectives these algorithms use to optimize their policies. This difference is deeply connected with the development of our method. In the next two sections, we will cover the ideas and mathematics behind these two approaches.

2.2.1 On-Policy Policy Optimization Methods

In this section, we cover the policy gradient method, which constitutes the funda-mental expression in on-policy policy optimization methods. Here, we consider the parameterized policy π_θ and aim to find the optimal policy that maximizes the RL objective given in Equation (2.12). This optimal policy is defined by:

θ∗ := arg max

θ J (θ), (2.14)

which can be improved by gradient-based optimization methods:

(13)

2.2. Policy Optimization in Reinforcement Learning 7

where α is the learning rate and k denotes the step of current iteration. More specific, the gradient w.r.t θ is given by:

where Equation (2.25) is known as the REINFORCE estimator (Williams, 1992) and regarded as the most fundamental expression of all policy optimization methods.

The REINFORCE estimator can be interpreted intuitively: increasing the prob-ability of taking action at in state st if the return R(τ ) is positive or vice versa. However, because of the intrinsic stochasticity from the policies and environments, R(τ ) suffers from high variance and can hence mislead the policy updates. Over the years, different methods have been proposed for a better evaluation, and they can be unified in a general expression (Schulman et al., 2016):

∇_θ_{J (θ) = E}_{τ ∼p(τ |θ)}   T −1 X t=0 ∇_θlog πθ at|st Ψt  . (2.26)

To name a few examples, Baxter et al., 2001 proposed the GPOMDP estimator with Ψ_t:= Rt(τ ), utilizing the fact that past rewards does not influence the qualities of later actions due to causality:

∇_θ_{J (θ) = E}_{τ ∼p(τ |θ)}   T −1 X t=0 ∇_θlog πθ at|st Rt(τ )  . (2.27)

(14)

Following that, Ψt := Aπθ(st, at) is proposed, using the advantage function defined by:

Aπθ_(s

t, at) := Qπθ(st, at) − Vπθ(st) . (2.28) The advantage function measures how much on average a specific action at is better than other actions in state st. The gradient estimator with advantage function is given by: ∇_θ_{J (θ) = E}_{τ ∼p(τ |θ)}   T −1 X t=0 ∇_θlog πθ at|st Aπθ(st, at)  . (2.29)

Recently, the currently popular Generalized Advantage Estimator (GAE) (Schulman et al., 2016) was proposed, introducing Ψt := ˆAGAE(γ,λ)t . Here ˆA

GAE(γ,λ)

t is an

exponentially-weighted average of n-step temporal difference (TD) errors (Sutton et al., 1998) given by:

ˆ AGAE(γ,λ)_t := ∞ X l=0 (γλ)lδV_t+l (2.30) δ_tV := rt+ γVπθ st+1 − Vπθ(st) . (2.31) The gradient estimator with GAE is given by:

∇θJ (θ) = Eτ ∼p(τ |θ)   T −1 X t=0 ∇θlog πθ at|st _ˆ AGAE(γ,λ)_t  . (2.32)

To summarize, the GPOMDP estimator given in Equation (2.27) is unbiased but with higher variance, while the gradient estimators using advantage function and GAE, as given in Equation (2.29) and Equation (2.32) respectively, are slightly biased but with lower variance. Here, choosing an estimator for the critic is often a trade-off between bias and variance. In practice, estimators with lower variance are preferred.

2.2.2 Off-Policy Policy Optimization Methods

Compared with on-policy methods, off-policy policy optimization methods first find the optimal action-value function Q∗(s, a). Then, the optimal policy can be obtained by finding the action that maximizes the Q-value in any given state:

a∗(s) := arg max

a Q

∗

(s, a). (2.33)

In practice, the optimal action-value function is usually approximated by a param-eterized function Qφ(s, a). Qφ(s, a) is trained via the Bellman equation for optimal action-value function given in Equation (2.11), by minimizing the mean-squared Bell-man error (MSBE) that measures how good Q_φ satisfies the Bellman equation:

L(φ, D) := E(s,a,r,s0_,d₎_∼D   Qφ(s, a) − r + γ(1 − d) max a0 Qφ s0, a0 !2 , (2.34)

(15)

2.3. Exploration Based on Random Perturbations 9

where D, usually called the replay buffer, is the set of experiences (s, a, r, s0, d) collected from interaction, and d is a dummy variable indicating whether or not that state is the terminal state.

Note that the MSBE objective can be unstable and divergent because of the deadly triad (Sutton et al., 1998), where the deadly triad refers to the combination of function approximation, bootstrapping, and off-policy training. Here, bootstrapping means that the update target depends on the current estimate, as shown in Equation (2.34). Besides, with function approximation, MSBE might not lead to the best Q-function for the policy, as the state distribution under the policy might be different from the state distribution in the buffer. To remedy this instability, several tricks were proposed (Lillicrap et al., 2016;Fujimoto et al., 2018).

In the case of discrete action space, if the optimal action-value function Q∗(s, a) is known, it’s often easy to find the optimal policy π∗ because one could always try all the actions in any given state and then choose the one with highest Q-value. However, it is not possible for continuous action space as there are infinite possible actions. This problem is addressed by learning a deterministic policy µθ(s) that approximates the optimal action a∗ (Silver et al., 2014; Lillicrap et al., 2016;Fujimoto et al., 2018), as shown in:

max

a Qφ(s, a) ≈ Qφ(s, µθ(s)), (2.35)

such that the policy can be optimized by: θ∗ := arg max θ Es∼D h Qφ s, µθ(s) i . (2.36)

As for the more general case of using a stochastic policy πθ, we will continue the discussion when covering SAC.

2.3 Exploration Based on Random Perturbations

Exploration strategies refer to how agents improve their decision-making by taking advantage of their experiences. With insufficient exploration, good states and actions with high rewards can be missed, resulting in policies converged prematurely to bad local optima. In contrast, with too much exploration, agents could be distracted and waste their resources trying new states and actions, without leveraging the experiences shown successfully. This phenomenon is known as the exploration vs. exploitation dilemma (Kearns et al., 2002; Jaksch et al., 2010), and over the years, it has been studied extensively, as briefly discussed in Chapter1.

At the highest level, as Thrun, 1992;Plappert et al., 2018 pointed out, it’s often useful to divide exploration into directed strategies and undirected strategies, where directed strategies aim to extract useful information from agents’ experiences (Tang et al., 2017;Ostrovski et al., 2017;Houthooft et al., 2016;Pathak et al., 2017) and undi-rected strategies rely on injecting randomness into agents’ decision-making and hope for the best. Because of the nature of undirected strategies, we believe it is clearer to call them exploration based on random perturbations. These exploration strategies can be further distinguished between exploration in action space vs. exploration in parameter space, step-based exploration vs. trajectory-based exploration, and corre-lated exploration vs. uncorrecorre-lated exploration. Also, note that this distinction is not absolute, and many strategies are performed in a hybrid way.

This section gives a general review of exploration strategies based on random perturbations for continuous action space, based on the excellent survey ofDeisenroth

(16)

et al., 2013. As for the related work using these exploration strategies, we will leave for the next chapter.

2.3.1 Exploration in Action Space vs. in Parameter Space

In order to perturb the policy randomly, there are roughly two ways: perturb the output (actions) or perturb the policy weights. The first way perturbs the policy in action space and is usually realised by adding a spherical Gaussian noise to the sampled action at each step independently. Demonstrating using the diagonal Gaussian policies given in Equation (2.1), with the reparameterization trick (Kingma et al., 2015b), it can be written in:

at= µ(st) + σ ξt, ξt∼ N (0, I), (2.37) where ξt is the spherical Gaussian noise and is the element-wise product operator. On the other hand, the second way perturbs the policy in parameter space and is often implemented by imposing Gaussian noise on the policy parameters θ at the beginning of a trajectory:

˜

θ := θ + σ ξ, ξ ∼ N (0, I). (2.38)

In most cases, exploration in action space is often preferred (Baxter et al., 2000;

Sutton et al., 1999;Williams, 1992) because it is straightforward, easy to understand, and brings more randomness that could help the policy escape from a local optimum (Deisenroth et al., 2013). In contrast, exploration in parameter space is less inter-pretable. Still, it has the advantages of being more consistent, structured, and global as it naturally explores conditioned on the states (Deisenroth et al., 2013). It is im-portant to note that when using exploration in parameter space with a stochastic policy, the resulted exploration is realized in both action and parameter spaces as this policy will always require some action noise to maintain stochastic, no matter how small this action noise is.

2.3.2 Trajectory-based vs. Step-based Exploration

Step-based exploration strategies reply on injecting exploration noise at each step in-dependently. This exploration is usually performed in the action space (Deisenroth et al., 2013). On the contrary, trajectory-based exploration strategies often add ex-ploration noise at the beginning of a trajectory, which is more suited to exex-ploration in parameter space (Deisenroth et al., 2013), as discussed in the last section.

Typically, step-based exploration strategies are more random, leading to unrepro-ducible action sequences. The effects of the perturbations can sometimes be hard to estimate as they can be washed out by the system dynamic (Deisenroth et al., 2013). On the other hand, this randomness could sometimes be helpful as it could make the policy less prone to getting trapped in a local optimum. In contrast, trajectory-based exploration produces reproducible action sequences and are often more stable ( Deisen-roth et al., 2013). This increased stability is also helpful for more consistent policy evaluation and leads to more reliable policy updates (Stulp et al., 2012; Deisenroth et al., 2013).

In this perspective, Deep Coherent Exploration lies between the two extremes on this spectrum, where a more delicate balance between randomness and stability can be achieved.

(17)

2.4. Deep Reinforcement Learning 11

2.3.3 Uncorrelated vs. Correlated Exploration

Uncorrelated exploration refers to exploration strategies that ignore the correlation of different action dimensions. For continuous control, uncorrelated exploration models the injected noise with a diagonal covariance matrix, while correlated exploration is achieved by modeling a full representation of the covariance matrix (Deisenroth et al., 2013). The former is often used with exploration in action space, and the latter is typically used with exploration in parameter space. Theoretically, maintaining a full representation of the covariance matrix enables more comprehensive and expressive modeling of the noise that can often lead to faster and better learning (Deisenroth et al., 2013). However, as the policy’s parameters become high-dimensional, it is im-practical as the number of parameters of the covariance matrix increases quadratically (Deisenroth et al., 2013).

2.4 Deep Reinforcement Learning

Deep reinforcement learning refers to the area of combining deep learning and rein-forcement learning, where the policies and (or) value functions are usually represented by deep neural networks for more sophisticated and powerful function approximation. In this section, we will briefly review the three deep RL algorithms used in this the-sis: Advantage Actor-Critic (A2C) (Mnih et al., 2016), Proximal Policy Optimization (PPO) (Schulman et al., 2017), and Soft Actor-Critic (SAC) (Haarnoja et al., 2018).

2.4.1 Advantage Actor-Critic

Advantage Actor-Critic (A2C) is the synchronous version of the original Asynchronous Advantage Actor-Critic (A3C) (Mnih et al., 2016) algorithm. Mathematically, both of them are directly built on the REINFORCE estimator (Williams, 1992), but with the more sophisticated advantage functions as critics:

∇_θ_{J (θ) = E}_{τ ∼p(τ |θ)}   T −1 X t=0 ∇_θlog πθ at|st Aπθ(st, at)  . (2.39) Like Qπθ_(s

t, at) and Vπθ(st), the true advantage function Aπθ(st, at) is unknown. In practice, it’s popular to estimate Aπθ_(s

t, at) from data using GAE (Schulman et al.,

2016).

On the other hand, these two methods improve REINFORCE by explicitly deploy-ing multiple workers (copies of the agent) to interact with different instances of the environment in parallel. This parallelism often leads to higher efficiency and lower-variance gradient estimates because of the increased variety of experiences collected. The only difference between A2C and A3C, as the name suggests, is whether the up-dates are implemented synchronously. For A2C, upup-dates are performed for all workers one at a time when they are all ready. For A3C, an update is performed whenever a worker is ready, without waiting for the other workers. In other words, the syn-chronous approach keeps all the workers identical throughout training, while for the asynchronous approach, the workers interacting with the environments are slightly dif-ferent. In practice, this difference brings extra randomness that could help accelerate training. However, researchers at OpenAI have shown that the synchronous approach achieves approximately the same performance as the asynchronous approach, and A2C is more favorable because of its simplicity in implementation. The pseudo-code of single-worker A2C is shown in Algorithm1.

(18)

Algorithm 1: A2C

Input: initial policy parameters θ0, initial value function parameters φ0. 1 for k=0,1,2,...,K do

2 Collect a trajectory τk with T steps in a buffer Dk by running policy πθk.

3 Compute rewards-to-go Rtand any kind of advantage estimates ˆAtbased on current value function V_φ_k for all steps t.

4 Estimate gradient of the policy:

ˆ ∇θJ (θ) = T −1 X t=0 ∇θlog πθk at|st _ˆ At,

and update the policy by performing a gradient step: θk+1← θk+ αθ∇ˆθJ (θ).

5 Learn value function by minimizing the regression mean-squared error:

L(φ) = 1 T T X t=0 Vφk(st) − Rt 2 ,

and update the value function by performing a gradient step: φk+1 ← φk+ αφ∇ˆφL(φ).

2.4.2 Proximal Policy Optimization

The idea behind the aforementioned vanilla policy gradient methods (REINFORCE, A2C and A3C) is simple and straightforward: take a small gradient step in policy parameters to push up the probability of π_θ(a|s) if advantage is positive, otherwise push down the probability of π_θ(a|s). However, it is hard to accurately determine the right step size as a seemingly small difference in parameter space could change the policy and cause severe instability. To address this problem, Trust Region Policy Optimization (TRPO) (Schulman et al., 2015) is first proposed, using a complex second-order method to determine the largest possible step size for "safe" updates. Closely related to TRPO and aiming to solve the same problem, Proximal Policy Optimization (PPO) (Schulman et al., 2017) is proposed shortly after. PPO is a family of first-order methods that combine several tricks to relieve the complexity in TRPO while still keeping new policies close to old. In general, there are mainly two variants of PPO, namely the PPO-Penalty and PPO-Clip, and we will only consider PPO-Clip because it is simpler to implement, and its performance is shown to be at least as good as PPO-Penalty.

Compared with vanilla policy gradient methods, in each iteration, PPO-Clip op-timizes its policy by maximizing over a surrogate objective:

θk+1:= arg max

θ L

CLIP

(19)

where this surrogate objective LCLIP_θ

k (θ) is defined as: LCLIP_θ k (θ) := Eτ ∼p(τ |θk)   T −1 X t=0 minrt(θ), clip rt(θ), 1 − , 1 + Aπ_tθk  , (2.41) with rt(θ) := πθ at|st πθk at|st

and is a small threshold which approximately restricts the

greatest distance allowed between the new policy and the old policy. However, the clipping alone does not suffice to prevent the new policy from changing too far, but serving as a regularizer. In practice, Kullback–Leibler (KL) divergence of the new policy and the old policy approximated on a sampled minibatch is often used as a further constraint for early-stopping. The pseudo-code of single-worker PPO-Clip is shown in Algorithm2.

Algorithm 2: PPO-Clip

Input: initial policy parameters θ0, initial value function parameters φ0. 1 for k=0,1,2,...,K do

2 Collect a trajectory τ_k with T steps in a buffer D_k by running policy π_θ_k. 3 Compute rewards-to-go R_tand any kind of advantage estimates ˆA_tbased

on current value function Vφk for all steps t.

4 Learn policy by maximizing the PPO-Clip objective:

LCLIP_θ k (θ) = T −1 X t=0 minrt(θ), clip rt(θ), 1 − , 1 + ˆ At ,

and update the policy by performing multiple gradient steps until the constraint of approximated KL divergence being satisfied:

θk+1← θk+ αθ∇ˆθLCLIPθk (θ).

L(φ) = 1 T T X t=0 Vφk(st) − Rt 2 ,

2.4.3 Soft Actor-Critic

Soft Actor-Critic (SAC) (Haarnoja et al., 2018) is an off-policy policy optimization method with a stochastic policy. Known as an entropy-regularized RL method, SAC is developed under the framework of maximum entropy RL (Ziebart et al., 2008), with

(20)

a slightly different RL objective: J (π) := Eτ ∼p(τ |π)   ∞ X t=0 γt rt+ αH π at|st  , (2.42)

where H is the entropy defined as:

H(P ) := Ex∼p(x)[− log p(x)], (2.43)

which measures how random the policy is, and α is the temperature parameter that determines the relative importance of the entropy term compared with the expected return. Here, the entropy term can be seen as regularizing the objective, giving a significant penalty when the policy becomes too certain. Indeed, this design is closely related to the exploration-exploitation trade-off, as it aims to maintain a more stochastic policy, which naturally encourages exploration and helps prevent the policy from prematurely converging to a local optimum. There are two main variants of SAC, where the first variant uses a fixed temperature parameter α over training and the second variant adapts the temperature parameter α while satisfying an entropy constraint. Here we consider the first variant with a fixed temperature parameter α.

To learn the value function, SAC uses the clipped double-Q trick (Hasselt, 2010;

Fujimoto et al., 2018) to mitigate the overestimation of action values. More specific, SAC deploys two learned Q-functions Q_φ₁, Qφ2 and takes the minimum one. Also

note that because of the maximum entropy RL framework and the fact that SAC uses on-policy exploration, the Bellman equation for Q-function is different, defined as:

where the next actions are denoted by ã0 on purpose to explicitly differentiate that they are sampled from the current policy ã0 ∼ π(ã0|s) instead of from the replay buffer in common setting. Putting it all together, the loss of Q-functions is defined as:

L (φi, D) := E(_s,a,r,s0_,d₎_∼D " Qφi(s, a) − y r, s0, d 2# , (2.46)

where the target y r, s0, d is given by:

y r, s0, d = r + γ(1 − d) min i=1,2Qφtarg,i s0, ˜a0 − α log πθ ˜ a0|s0 , a˜0 ∼ πθ(˜a0|s0). (2.47) Here φ_targ,iare the target networks, usually copied to φ_i before training and updated a few steps behind φi during training such that they are close to φi. Then, φtarg,i are used to make the training of Q-functions stable, which is known as the target networks trick.

(21)

Since SAC learns a stochastic policy πθ, this policy πθ is trained to maximize the maximum entropy V -function defined by:

Vπθ_{(s) : = E} a∼πθ(a|s)Q πθ(s, a) + αH π θ(a|s) (2.48) = Ea∼πθ(a|s)Q πθ_{(s, a) − α log π} θ(a|s) . (2.49)

With the reparameterization trick (Kingma et al., 2015b), actions sampled from π_θ can be transformed as:

˜

a = tanh µθ(s) + σθ(s) ξ , ξ ∼ N (0, I), (2.50) where tanh is used to bound the actions and ξ is the spherical Gaussian noise. Com-bining all these, the policy objective J (θ) is given by:

J (θ) := Es∼D,ξ∼N (0,I)

min

i=1,2Qφi(s, ˜a) − α log πθ a|s˜

. (2.51)

(22)

Algorithm 3: SAC

Input: initial policy parameters θ, initial Q-function parameters φ₁,φ₂, empty replay buffer D.

1 Set target parameters equal to main parameters φtarg,1← φ1, φtarg,2← φ2. 2 for each step do

3 Observe state s and select action a ∼ πθ(a|s). 4 Execute a in the environment.

5 Observe next state s0, reward r, done signal d, and store (s, a, r, s0, d) in replay buffer D.

6 If s0 is terminal, reset environment state. 7 if it’s time to update then

8 for j in range(number of updates) do

9 Randomly sample a batch of transitions B = {(s, a, r, s0, d)}.

10 Compute targets for Q-functions:

yr, s0, d= r + γ(1 − d) min i=1,2Qφtarg,i s0, ã0− α log π_θa˜0|s0 , 11 where ã0 ∼ π_θ(ã0|s0).

12 Update Q-functions by one step of gradient descent using:

∇_φ_i 1 |B| X (s,a,r,s0_,d₎_∈B Qφi(s, a) − y r, s0, d 2 , for i = 1, 2.

13 Update policy by one step of gradient ascent using:

∇θ 1 |B| X s∈B min

i=1,2Qφi(s, ˜a) − α log πθ a|s˜

,

where ˜a is a sample from πθ a|s which is differentiable w.r.t θ˜ via the reparameterization trick.

14 Update target networks with:

φtarg,i← ρφtarg,i+ (1 − ρ)φi for i = 1, 2.

15 else

(23)

17

Chapter 3

Related Work

3.1 Exploration in Parameter Space for "Shallow"

Rein-forcement Learning

An early systematic study on the difference between perturbing actions and param-eters can be dated back to Rückstieß et al., 2008, where the authors introduced a state-dependent exploration function that returns the same action for any given state during a trajectory. The author showed that, because of less variance and faster con-vergence, this method resulted in improved exploration behaviors if combined with REINFORCE (Williams, 1992) and Natural Actor-Critic (Peters et al., 2005). Later, this work was further extended byKober et al., 2008;Sehnke et al., 2010.

Recently, similar to the exploration strategy inLillicrap et al., 2016, where Ornstein-Uhlenbeck process (Uhlenbeck et al., 1930) is used to generate temporally correlated action noise for exploration,Hoof et al., 2017introduced the Generalized Exploration framework. Generalized Exploration explores by producing temporally coherent poli-cies, which unifies step-based and trajectory-based exploration in parameter space. Moreover, the author showed that, with linear policies, a more delicate balance be-tween these two extreme strategies often leads to better performances.

To be more specific, such temporally coherent policies are realized by constructing a Markov chain of policy parameters:

θt∼ ( p0(θt) if t = 0 p θt|θt−1 otherwise, (3.1)

where θt is the policy parameters at step t. Equation (3.1) suggests that the dis-tribution of policy parameters at step t are conditioned on the policy parameters of the previous step, with two extreme cases. The first extreme case is p(θ_t|θ_t−1) = p0(θt) where θt and θt−1 are completely independent. The second extreme case is p(θt|θt−1) = δ(θt− θt−1), where θt and θt−1 are completely dependent. Here δ is the Dirac delta function. Respectively, the first case corresponds to the step-based explo-ration (Baxter et al., 2001) while the second case corresponds to the trajectory-based exploration (Sehnke et al., 2010).

Although Generalized Exploration was shown beneficial, it suffers from storage and computation limitations. Since Generalized Exploration integrates out the uncertainty of sampling exploring policies at each step t and computes the gradients in a batch mode, all the history needs to be stored. This is undesirable and even impractical, especially when the trajectory is long, or the actions and states are high-dimensional. Apart from storage, the computation in Generalized Exploration also involves large matrix inverse, which does not scale favorably in terms of processing time.

(24)

3.2 Exploration in Parameter Space for Deep

Reinforce-ment Learning

Although the methods discussed in the last section pioneered the research of explo-ration in parameter space, their applicability or success is limited. More precisely, these methods were only evaluated with extremely shallow policies (often linear poli-cies) and relatively simple tasks with low-dimensional state space and action space. Given this, NoisyNets (Fortunato et al., 2018), Parameter Space Noise for Exploration (PSNE) (Plappert et al., 2018) and Stochastic A3C (SA3C) (Shang et al., 2019) were proposed, introducing more scalable and general methods adapted for policies using deep neural networks.

In general, both NoisyNets (Fortunato et al., 2018) and PSNE (Plappert et al., 2018) perform exploration in parameter space by adding parametric noise into the policy networks:

e

θ := θ + N0, σ2I, (3.2)

where θ is the parameters of the policy network and σ is the standard deviation controlling magnitude of the noise. In this setting, the usual loss of the policy is obtained by taking expectation over the noise.

Despite being conceptually similar, there are a few differences between NoisyNets and PSNE. The major difference is that NoisyNets directly learns the magnitude of noise σ for each parameter while PSNE controls the magnitude of noise using a single scalar σ for all parameters. This scalar magnitude is then adapted using some measures of distance. Furthermore, PSNE uses layer normalization (Ba et al., 2016) between perturbed layers. This extra step is used to ensure that the spherical Gaussian noise can achieve the same perturbation scale across all layers, even as learning progresses.

Although both are shown leading to more global and consistent exploration behav-iors, NoisyNets and PSNE have several limitations. For NoisyNets, since the success is demonstrated with discrete control on Atari video games (Bellemare et al., 2015), the performance on continuous action space is unknown. Furthermore, learning the magnitude of the noise for all parameters increases the number of parameters signif-icantly. For PSNE, maintaining a single scalar σ to control noise magnitude for all parameters is very unlikely the optimal solution.

At a high level, both methods can be seen as learning a master policy that sam-ples sub-policies for trajectory-based exploration in parameter space, as shown in Figure 4.1. Additionally, these sub-policies are sampled by perturbing parameters across all layers, with the uncertainty from sampling exploring sub-policies being ad-dressed by single-sample Monte Carlo integration. In this perspective, three major limitations motivate the development of Deep Coherent Exploration:

1. Trajectory-based exploration in parameter space is inefficient and provides in-sufficient randomness, as discussed in Section 2.3.2.

2. Addressing the uncertainty of sampling exploring policies using single-sample Monte-Carlo integration suffers from high variance.

3. The effects of perturbing all layers of the policy networks are unknown and could induce unstable exploration behaviors.

(25)

19

Chapter 4

Deep Coherent Exploration for

Continuous Control

4.1 Overview

In this chapter, we give a detailed introduction of Deep Coherent Exploration, a novel exploration strategy for deep reinforcement learning on continuous control. In a bird’s eye view, our proposed method solves the previously discussed limitations of NoisyNets (Fortunato et al., 2018) and PSNE (Plappert et al., 2018), with the following three contributions:

1. Deep Coherent Exploration interpolates between step-based and trajectory-based exploration, allowing for a more balanced trade-off between stability and stochasticity.

2. Deep Coherent Exploration integrates the uncertainty of sampling exploring policies using analytical integration in closed-forms, which significantly reduces the variance of gradient estimates.

3. Deep Coherent Exploration explicitly only perturbs the last layer of the policy network, resulting in a decreased number of learning parameters and better control of perturbations in parameter space.

Here, we provide a graphical model of Deep Coherent Exploration, shown in Fig-ure 4.1. This graphical model uses the same conventions as in Bishop, 2007, where empty circles denote latent random variables, shaded circles denote observed random variables, and dots denote deterministic variables. To be more precise, we briefly introduce the variables in Deep Coherent Exploration:

1. µ and Λ are mean and diagonal precision matrix of the Gaussian distribution that samples w0 and part of wt when t > 0. µ and Λ are the parameters we want to learn.

2. w_t denotes the last layer parameters of the policy network at step t. Moreover, wt is treated as a latent variable and will not be learned.

3. θ denotes all the parameters of the policy network except for the last layer, and it is part of the learning parameters.

4. x_t denotes the input to the last layer of the policy network at step t and x_t is decided deterministically by st and θ.

(26)

Figure 4.1: Graphical models of different exploration strategies. Left: action noise. Middle: NoisyNets and PSNE. Right: Deep Coher-ent Exploration. For exploration in action space, w0 denotes the

pa-rameters of the exploring policy. For detailed explanation of NoisyNets and PSNE, please refer to Section 4.1.2. For detailed explanation of Deep Coherent Exploration, please refer to the paragraph under this

figure.

4.1.1 Generalizing Step-based and Trajectory-based Exploration

Shown in Figure4.1, Deep Coherent Exploration generalizes step-based and trajectory-based exploration by constructing a Markov chain of w_t, as in Hoof et al., 2017. This Markov chain naturally establishes a temporal coherency by modeling the con-ditional distribution p(wt|wt−1), with the initial distribution p0(w0). Then, in this setting, step-based exploration corresponds to the extreme case when p(w_t|w_t−1) = p0(wt) and trajectory-based exploration corresponds to another extreme case when p(wt|wt−1) = δ(wt− wt−1), where δ is the Dirac delta function.

Even though this Markov chain establishes a relationship of temporal coherency, the specific form of the distribution for wt needs to be defined. Directly following

Hoof et al., 2017, we consider wt to be generated in the following process:

wt:= β ˜w + (1 − β)wt−1, w ∼ N˜ µ, 2 β − 1 Λ−1 ! , (4.1)

where β is a hyperparameter that controls the temporal coherency of w_t and w_t−1. To be more specific, step-based exploration corresponds to the case when β = 1 and trajectory-based exploration corresponds to the case when β = 0, with the intermedi-ate exploration corresponds to β ∈ (0, 1). Continuing from Equation (4.1), p(w_t|w_t−1) is given by: p(wt|wt−1) = N µt, Λ−1t , (4.2) where µt= (1 − β)wt−1+ βµ (4.3) Λ−1_t = (2β − β2)Λ−1. (4.4)

(27)

4.1. Overview 21

Additionally, the initial distribution of w0 is given by: p0(w0) = N

µ, Λ−1

. (4.5)

Note that the main reason we chose this scheme is to ensure the marginal distribution of wt will be equal to the initial distribution p0 at any step t. Given a Gaussian distribution as in Equation (4.5), this constraint can be satisfied through the detailed balance condition. For a more comprehensive explanation, please refer toHoof et al., 2017.

4.1.2 Integrating Uncertainty of Sampling Exploring Policies

When exploration in parameter space is performed by sampling exploring policies, the uncertainty from this sampling process should be addressed adequately. To bring more clarity, we present a high-level graphical model of NoisyNets and PSNE, as shown in Figure 4.1. Here NoisyNets and PSNE are presented using the same notation as Deep Coherent Exploration for better comparison. To be more specific, µ denotes parameters of the policy network, Λ denotes parameters (whether learnable or not) controlling magnitude of the noise and w0 denotes parameters of the sampled policy. In such a way, if we denote ζ := (µ, Λ) as the learnable parameters, the single-sample Monte Carlo policy gradient of both NoisyNets and PSNE can be obtained:

Here we can see from Equation (4.8), if using single-sample Monte Carlo inte-gration to address the uncertainty from sampling, the optimization aims to improve p τ1|w1 0 rather than p

τ1|ζ. Since w0 is sampled and different in each trajec-tory, it causes additional variance that can lead to instability. If the magnitude of the noise is high, the instability will be even severe, leading to high-variance gradient estimates and oscillating updates. To solve this problem, Deep Coherent Exploration uses analytical integration instead, which results in low-variance gradient estimates and stabilizes the learning process.

To summarize, NoisyNets (Fortunato et al., 2018) and PSNE (Plappert et al., 2018) evaluate the actions on the currently sampled policy, while Deep Coherent Ex-ploration evaluates the actions averagely on all the policies that could have been sam-pled. However, the analytical integration is only applicable for on-policy policy opti-mization methods, and this will be discussed later.

4.1.3 Perturbing Last Layers of Policy Networks

Both NoisyNets and PSNE perturb all layers of the policy network. Additionally, PSNE uses layer norm (Ba et al., 2016) to make sure parameters across all layers have similar sensitivities to perturbations. However, asPlappert et al., 2018pointed out, it is not clear that deep neural networks can be perturbed in meaningful ways, especially when these perturbations are used for exploration in reinforcement learning. Here we

(28)

argue that perturbing only the last layer of policy network might be better for the following reasons. Firstly, since it is unclear how this parameter noise (especially in lower layers) are realized in action noise, perturbing all layers of the policy network results in uncontrollable perturbation. To name a few concerns, will the perturbations in different layers offset each other to some degree? Or is it possible that perturba-tions in different layers have significantly different influences or behaviors? Secondly, perturbing all layers might disturb the representation learning of state, which is un-desirable for learning a good policy. Thirdly, only the last layer parameters can be integrated analytically in general. Given these reasons, Deep Coherent Exploration perturbs only the last layer of the policy network for exploration.

4.2 On-Policy Deep Coherent Exploration

In this section, we introduce how to adapt Deep Coherent Exploration for vanilla pol-icy gradient, which is the most fundamental method of on-polpol-icy polpol-icy optimization. Indeed, Deep Coherent Exploration can be combined with other on-policy policy op-timization methods in a similar way, with little modification. In particular, we will show how to adapt Deep Coherent Exploration for A2C (Mnih et al., 2016) and PPO (Schulman et al., 2017) later.

4.2.1 Deep Coherent Policy Gradient

For convenience, we first denote the learnable parameters of Deep Coherent Explo-ration as ζ := (µ, Λ, θ) and begin from the RL objective given in Equation (2.21), with θ replaced by ζ: ∇ζJ (ζ) = Eτ ∼p(τ |ζ) h ∇ζlog p(τ |ζ)R(τ ) i . (4.9)

Then, with chain rule and the D-separation property (Pearl, 1989) for directed graphs, ∇_ζlog p(τ |ζ) can be factorized as:

∇ζlog p(τ |ζ) (4.10) = ∇ζlog p(s[0:T ], a[0:T −1]|ζ) (4.11) = ∇ζlog  p(s0) T −1 Y t=0 p(st+1|s[0:t], a[0:t], ζ)p(at|s[0:t], a[0:t−1], ζ)   (4.12) = ∇ζ  log p(s0) + T −1 X t=0

log p(st+1|s[0:t], a[0:t]) + log p(at|s[0:t], a[0:t−1], ζ)   (4.13) = T −1 X t=0 ∇ζlog p(at|s[0:t], a[0:t−1], ζ) , (4.14)

where s_[0:t]denotes the tuple of states (s0, ..., st) and a[0:t]denotes the tuple of actions (a0, ..., at). Furthermore, when t = 0, p(at|s[0:t], a[0:t−1], ζ) is defined as p(a0|s0, ζ). Instead of introducing an extra term out of the above equations, we keep the current expressions for compactness and readability.

Normally, Markov policies are used where the actions are independent given their states. However, as shown in Figure 4.1, since information can still flow through the unobserved latent variable wt, our model is not Markov anymore. To simplify this

(29)

4.2. On-Policy Deep Coherent Exploration 23

dependency, we introduce wt into Equation (4.14):

∇ζlog p(at|s[0:t], a[0:t−1], ζ) (4.15) = ∇ζlog Z wt p(at, wt|s[0:t], a[0:t−1], ζ)dwt (4.16) = ∇ζlog Z wt p(at|wt, s[0:t], a[0:t−1], ζ)p(wt|s[0:t], a[0:t−1], ζ)dwt (4.17) = ∇µ,Λ,θlog Z wt p(at|wt, st, θ) | {z } π_wt,θ(at|st) p(wt|s[0:t−1], a[0:t−1], µ, Λ, θ) | {z } α(wt) dwt, (4.18)

where the first term is the density of Gaussian policy, and the second term is the posterior probability of w_t given the history, denoted by α(w_t).

4.2.2 Recursive Exact Inference of wt

For the next step, we decompose α(w_t) by introducing wt−1:

α(wt) = p(wt|s[0:t−1], a[0:t−1], ζ) (4.19) = Z wt−1 p(wt, wt−1|s[0:t−1], a[0:t−1], ζ)dwt−1 (4.20) = Z wt−1 p(wt|wt−1, s[0:t−1], a[0:t−1], ζ)p(wt−1|s[0:t−1], a[0:t−1], ζ)dwt−1 (4.21) = Z wt−1 p(wt|wt−1, µ, Λ) | {z } prior of wt p(wt−1|s[0:t−1], a[0:t−1], µ, Λ, θ)dwt−1, (4.22)

where the first term is the prior probability of wt.

Here, it’s helpful for us to view the second term p(wt−1|s[0:t−1], a[0:t−1], µ, Λ, θ) as p(wt−1|at−1, ...). In such way, if we know p(at−1|wt−1, ...) and p(wt−1|...) respectively and both of them are Gaussians, p(wt−1|at−1, ...) can be found analytically using Equation (2.116) fromBishop, 2007. Easy to observe that:

p(at−1|wt−1, ...) = p(at−1|wt−1, s[0:t−1], a[0:t−2], µ, Λ, θ) (4.23) = p at−1|wt−1, st−1, θ

(4.24)

= πwt−1,θ(at−1|st−1) (4.25)

is the Gaussian action density, and

p(wt−1|...) = p(wt−1|s[0:t−1], a[0:t−2], µ, Λ, θ) (4.26) = p(wt−1|s[0:t−2], a[0:t−2], µ, Λ, θ) (4.27)

= α(wt−1) (4.28)

is exactly the posterior probability of w_t−1 from the previous step. We can see that α(wt−1) is indeed a Gaussian using mathematical induction. For the base step when t = 0, α(w0) := p(w0|µ, Λ) is a Gaussian by definition. For the induction step at t, if α(w_t) is a Gaussian, we find pwt|s[0:t], a[0:t], µ, Λ, θ

a Gaussian using Equation (2.116) from Bishop, 2007 with α(w_t) and p at|wt, st, θ. Next, α(wt+1) can be found a Gaussian using Equation (2.115) from Bishop, 2007 with p wt+1|wt, µ, Λ

(30)

and pwt|s[0:t], a[0:t], µ, Λ, θ

. Here we finish the proof that α(w_t) is a Gaussian. As the conditions being satisfied, we arrive at efficient recursive exact inference of wt. 4.2.3 Objective of Deep Coherent Policy Gradient

Next, we use α(w_t) to compute the objective at step t as given in Equation (4.18). Suppose at∈ Rp, xt∈ Rq and our Gaussian policy is represented as:

p(at|wt, st, θ) := N (Wtxt+ bt, Λ−1a ), (4.29) where W_t _{∈ R}p×q is the coefficient matrix, x_t is the input to the last layer of the policy network and Λ_a is a constant precision matrix for the Gaussian policy. It’s helpful to represent wt∈ Rpq+p by flattening Wt and combining bt:

wt:=                      w11 .. . w1q .. . wp1 .. . wpq b1 .. . bp                      , (4.30)

such that it could still be sampled using multivariate Gaussians. Moreover, we stack xtinto Xt∈ Rp×(pq+p): Xt:=         xT_t 0T_q,1 . . . 0T_q,1 0T_q,1 1 . . . 0 0T_q,1 xT_t . . . 0T_q,1 0T_q,1 0 . . . 0 .. . ... . .. ... ... ... . .. ... 0T_q,1 0T_q,1 . . . xT_t 0T_q,1 0 . . . 0 0T_q,1 0T_q,1 . . . 0T_q,1 xT_t 0 . . . 1         , (4.31)

where 0_q,1 is a q-dimension zero column vector. After this transformation, we have Wtxt + bt = Xtwt, where the Gaussian policy in Equation (4.29) is represented equivalently as:

p(at|wt, st, θ) = N (Xtwt, Λ−1a ). (4.32) Base Case: t = 0

For the base case t = 0, posterior probability α(w0) and the prior probability of w0 is identical by definition:

α(w0) = p0(w0|µ, Λ) = N

µ, Λ−1. (4.33)

Additionally, the action probability is given by:

(31)

4.2. On-Policy Deep Coherent Exploration 25

With the property of multivariate Gaussians, we obtain the objective at t = 0:

J (µ, Λ, θ)t=0= log p(a0|s0, µ, Λ, θ) (4.35) = log Z w0 p a0|w0, s0, θ α(w0)dw0 (4.36) = log N (X0µ, Λ−1a + X0Λ−1XT0). (4.37) General Case: t > 0

For the general case of step t > 0, we need the state s_t−1, action a_t−1 as well as mean and covariance of posterior α(wt−1) stored from previous step. Suppose α(wt−1) is written as:

α(wt−1) = N (vt−1, L−1_t−1), (4.38)

and the action probability from the previous step is given by: p at−1|wt−1, st−1, θ = N Xt−1wt−1, Λ−1a . (4.39) We have directly: p(wt−1|s[0:t−1], a[0:t−1], µ, Λ, θ) = N ut−1, Σt−1 , (4.40) with ut−1= Σt−1 XT_t−1Λaat−1+ Lt−1vt−1 (4.41) Σt−1= Lt−1+ XTt−1ΛaXt−1 −1 . (4.42)

Combining the prior probability of wt: p(wt|wt−1, µ, Λ) = N

(1 − β)wt−1+ βµ, (2β − β2)Λ−1

, (4.43)

we obtain the posterior probability α(wt): α (wt) = N vt, L−1t , (4.44) where vt= (1 − β)ut−1+ βµ (4.45) L−1_t = (2β − β2)Λ−1+ (1 − β)2Σt−1. (4.46) Here, v_tand L−1_t should be stored and used for exact inference of α(w_t+1) at the next step. Finally, the objective at step t > 0 is given by:

J (µ, Λ, θ)t>0= log p(at|s[0:t], a[0:t−1], µ, Λ, θ) (4.47) = log Z wt p at|wt, st, θ α(wt)dwt (4.48) = log N (Xtvt, Λ−1a + XtL−1t XTt). (4.49)

(32)

4.3 Off-Policy Deep Coherent Exploration

As previously discussed in Section2.2.2, for continuous control, off-policy policy opti-mization methods usually learn the optimal action-value functions Q∗ first, and then optimize the policies to approximate the optimal actions a∗, as in Equation (2.35).

In this perspective, the quality of exploration is directly reflected in the approxi-mated optimal action-value function Qφ. With effective exploration, the information of state-action pairs with high returns is discovered, leading to Q_φ with high values. However, in the policy optimization component, since the policy is no longer required to explore, the connection between both disappears. Notably, the magnitude of the injected noise in parameter space can no longer be learned, but it can be adapted using some heuristic measures, as inPlappert et al., 2018.

Denote the magnitude of the injected noise σ, and we consider the same distance measure proposed by Plappert et al., 2018 for Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2016), defined as:

d(π,π) :=_e v u u t 1 N |B| N X i=1 X s∈B h π(s)i−eπ(s)i 2i , (4.50)

where _eπ is the perturbed policy, N denotes the dimension of action space and B is a batch sampled from the replay buffer D. This is a straightforward distance that measures the average difference between a policy π and its perturbed versionπ over_e action dimensions and batch size. In other words, d(π,_eπ) measures how much of the injected noise in parameter space is realized in action space on average. Note that d(π,_eπ) given in Equation (4.50) is defined for deterministic policy. So for SAC, the mean action is used for π(s)i instead.

Then, the magnitude of the noise σ is adapted simply as in Plappert et al., 2018:

σk+1 = ( ασk, if d(π, ˜π) < δ 1 ασk, otherwise , (4.51)

where δ ∈ R+ is a threshold value, α ∈ R+ is a scaling factor and k denotes the number of iteration.

4.4 Coherent Deep Reinforcement Learning

In this section, we provide a brief introduction and pseudo-code of adapting Deep Coherent Exploration for A2C (Mnih et al., 2016), PPO (Schulman et al., 2017) and SAC. Respectively, we call them Coherent-A2C, Coherent-PPO and Coherent-SAC.

4.4.1 Coherent Advantage Actor-Critic

Coherent-A2C is straightforward to implement, by replacing the gradient estimates ˆ

∇_θJ (θ) with ˆ∇_µ,Λ,θJ (µ, Λ, θ). The pseudo-code of single-worker Coherent-A2C is shown in Algorithm4.

(33)

4.4. Coherent Deep Reinforcement Learning 27

Algorithm 4: Coherent-A2C

Input: initial policy parameters µ₀, Λ0, θ0, initial value function parameters φ0.

1 for k=0,1,2,...,K do

2 Create a buffer D_k for collecting a trajectory τ_k with T steps. 3 for t=0,...,T do

4 if t=0 then

5 Sample last layer parameters of policy network w_t∼ p₀(w_t|µ_k, Λ_k) and store w_t.

6 else

7 Sample last layer parameters of policy network wt∼ p(wt|wt−1, µk, Λk) and store wt.

8 Observe state stand select action at∼ πwt,θk(at|st).

9 Execute a_t in the environment.

10 Observe next state s_t+1, reward r_t, done signal d, and store (st, at, rt, st+1, d) in buffer Dk.

11 If s_t+1 is terminal, reset environment state.

12 Infer posterior α(w_t) using previous state s_t−1, previous action a_t−1as well as mean vt−1 and covariance L−1t−1 of previous posterior α(wt−1). 13 Store mean vt and covariance L−1t of current posterior α(wt).

14 Compute marginal action probability p(at|s[0:t], a[0:t−1], µk, Λk, θk). 15 Compute rewards-to-go Rtand any kind of advantage estimates ˆAtbased

on current value function V_φ_k for all steps t. 16 Estimate gradient of the policy:

ˆ ∇_µ,Λ,θJ (µ, Λ, θ) = T −1 X t=0 ∇_µ,Λ,θlog p(at|s[0:t], a[0:t−1], µk, Λk, θk) ˆAt,

and update the policy by performing a gradient step: µk+1← µk+ αµ∇ˆµJ (µ, Λ, θ) Λk+1← Λk+ αΛ∇ˆΛJ (µ, Λ, θ)

θk+1← θk+ αθ∇ˆθJ (µ, Λ, θ).

L(φ) = 1 T T X t=0 Vφk(st) − Rt 2 ,

(34)

4.4.2 Coherent Proximal Policy Optimization

To implement Coherent-PPO, the original objective LCLIP_θ

k (θ) is substituted with LCLIP_µ k,Λk,θk(µ, Λ, θ) given by: LCLIP_µ k,Λk,θk(µ, Λ, θ) (4.52) = E_{τ ∼p}₍_{τ |µ}_k_,Λ_k_,θ_k₎ T −1 X t=0 minrt(µ, Λ, θ), clip rt(µ, Λ, θ), 1 − , 1 + Aπµ,Λ,θ t , (4.53) where rt(µ, Λ, θ) = p(at|s[0:t],a[0:t−1],µ,Λ,θ) p(at|s[0:t],a[0:t−1],µk,Λk,θk).

Here, after each step of policy update, p(a_t|s_[0:t], a_[0:t−1], µ, Λ, θ) from the new policy should be evaluated on the collected trajectory τk for both next update and approximated KL divergence. However, this quantity can not be calculated directly, but through sampling w_t and then integrating w_t out, as given in Equation (4.18). Since wt is integrated out in the end, it does not matter what specific wt is sampled. So one could sample a new set of w_t, or use a fixed w along the fixed trajectory τk. The second way is often faster because sampling is avoided. The pseudo-code of single-worker Coherent-PPO is shown in Algorithm5.

4.4.3 Coherent Soft Actor-Critic

For Coherent-SAC, only two changes are needed. Firstly, we sample the last layer parameters of the policy network wt in each step t for exploration. Secondly, we adapt the single magnitude of noise σ using our distance measure d(π_µ,θ,π_eµ,θ) after each epoch. The pseudo-code of single-worker Coherent-SAC is shown in Algorithm6.

4.5 Limitations

There are two main limitations of Deep Coherent Exploration. Firstly, since Deep Coherent Exploration performs exact inference of wtat every step t, the computation overhead is much higher than conventional methods. Secondly, to leverage the prop-erty of multivariate Gaussians for exact inference, we use big matrices by stacking the original ones, which occupies more memory. However, since we assume a diag-onal covariance matrix for the Gaussian policy, the matrixes being manipulated in both steps are highly sparse. If taking advantage of the sparsity for more efficient implementations, we believe both the limitations above can be mitigated.

(35)

4.5. Limitations 29

Algorithm 5: Coherent-PPO

Input: initial policy parameters µ0, Λ0, θ0, initial value function parameters φ0.

1 for k=0,1,2,...,K do

2 Create a buffer Dk for collecting a trajectory τk with T steps. 3 for t=0,...,T do

4 if t=0 then

5 Sample last layer parameters of policy network wt∼ p0(wt|µk, Λk) and store wt.

6 else

7 Sample last layer parameters of policy network wt∼ p(wt|wt−1, µk, Λk) and store wt.

8 Observe state s_tand select action a_t∼ π_w_t_,θ_k(a_t|s_t). 9 Execute at in the environment.

10 Observe next state st+1, reward rt, done signal d, and store (st, at, rt, st+1, d) in buffer Dk.

11 If st+1 is terminal, reset environment state.

12 Infer posterior α(wt) using previous state st−1, previous action at−1as well as mean v_t−1 and covariance L−1_t−1 of previous posterior α(w_t−1). 13 Store mean v_t and covariance L−1_t of current posterior α(w_t).

14 Compute marginal action probability p(a_t|s_[0:t], a_[0:t−1], µ_k, Λ_k, θ_k). 15 Compute rewards-to-go Rtand any kind of advantage estimates ˆAtbased

on current value function Vφk for all steps t.

16 Learn policy by maximizing the PPO-Clip objective:

LCLIP_µ k,Λk,θk(µ, Λ, θ) = T −1 X t=0 minrt(µ, Λ, θ), clip rt(µ, Λ, θ), 1 − , 1 + ˆ At , 17

and update the policy by performing multiple gradient steps until the constraint of approximated KL divergence being satisfied:

µk+1 ← µk+ αµ∇ˆµLCLIPµk,Λk,θk(µ, Λ, θ)

Λk+1 ← Λk+ αΛ∇ˆΛLCLIPµk,Λk,θk(µ, Λ, θ)

θk+1 ← θk+ αθ∇ˆθLCLIPµk,Λk,θk(µ, Λ, θ).

L(φ) = 1 T T X t=0 Vφk(st) − Rt 2 ,