Robust Planning with Imperfect Models

(1)

MS

C

A

RTIFICIAL

I

NTELLIGENCE

M

ASTER

T

HESIS

Robust Planning with Imperfect Models

by

M

AXIM

R

OSTOV

11808470

September 16, 2020

36 EC January 2020 - September 2020

Supervisor:

Dr Michael Kaisers,

Centrum Wiskunde & Informatica

Assessor:

Dr Herke van Hoof,

(2)

Abstract

This work aims to introduce a scalable robust planning method in order to cope with imprecise models. Environment models are not always known a priori, and their transition dynamics approximation can introduce errors, especially if small amount of data is collected and/or model misspecification is a concern. The objective of robust planning is to find a policy with the best guaranteed performance, which we approach by employing a two-stage minimization-maximization optimization procedure taken from the field of robust control. We assume a Markov Decision Process underlying the environment and planning model thereof while aiming for the best worst-case performance under a fixed error bound. We define a methodology that allows us to integrate planning and robust optimization techniques while also analysing the robust policy’s properties. Next, we introduce a family of locally robust decision-time planning algorithms, specifically robust Monte Carlo Tree Search. These algorithms perform online heuristic search with respect to the errors assumed in the model while the control policy is calibrated to perform safely under unfavourable scenarios. In this work, we extend the existing methods in several directions 1) we consider a planning agent that employs an imperfect environment model to adjust the control policy to behave safely and account for uncertainty in the transition dynamics; 2) we propose two variants of Monte Carlo Tree Search that locally find the robust policy function; 3) we present examples that demonstrate properties of robust transition model vs nonrobust model. Our research provides insights into the strengths and challenges of robust planning, which is crucial for transitioning towards using AI in safety-critical real life systems.

(3)

Acknowledgements

I would like to express my sincere gratitude to Dr. Michael Kaisers, my thesis supervisor, for his patient guidance and enormous support during this thesis project. I feel fortunate to have received his perspective and

insight on a wide variety of topics. I also wish to thank my teammates at Dexter Energy B.V. who helped and inspired me to pursue this interesting research direction.

(4)

1 Introduction

Decision-making tasks and challenges present in nowadays’ engineering and science are often addressed by mathemati-cal optimization and computer simulations. For example, estimating a helicopter’s flight locomotion [Schafroth, 2010], a geographical region’s energy production or consumption needs [Prabavathi and Gnanadass, 2018] or transportation schedules [Kieu et al., 2019] involves working with physical or societal systems that are inherently complex and hard to program. Without perfect knowledge about all the components and rules describing these (often random) systems, it is exceedingly difficult to capture their dynamics and create a simulator that is able to account for all relevant real-world factors. In the presence of an imperfect simulator, a practitioner may consider using optimization techniques that are aware of possible errors and inaccuracies while still guaranteeing some level of performance. Historically, such methods were extensively studied in control theory and often were referred to as robust control. Robust control methods seek to put a bound on the uncertainty or error and, given this bound, deliver results that meet the (control) system requirements. In other words, robust techniques adjust their optimization algorithm in a way that it is able to perform optimally under the worst-case scenario. Stable and adequate worst-case performance is necessary for many real world systems where extreme negative events are unacceptable e.g. energy grids, railway signalling, and air traffic control systems [Prabavathi and Gnanadass, 2018, Wang and Goverde, 2016, Kieu et al., 2019]. Moreover, many of such safety critical systems are influenced by stochastic factors, e.g., energy prices or weather forecasts [Alagoz et al., 2018, Mohan et al., 2015], and require special attention. Our research is inspired by the need of robust and efficient control in such systems. In this work, we choose to utilize an imperfect simulator. Thus, a control task is solved with planning techniques as apposed to (reinforcement) learning methods. The difference between two approaches lies in the fact that while learning does not require a description of the simulated system (i.e., environment model), planning uses the model to derive the optimal control policy [Moerland et al., 2020]. It is not uncommon for traditional planning algorithms to focus on the evaluation with a single (estimated) model and avoid tackling the problem of robust performance in a real-life setting [Dulac-Arnold et al., 2019]. This may lead to unstable or sub-optimal behaviour if the model is imprecise which is often the case when complex real-world systems are digitized [Robinson, 1999, Dulac-Arnold et al., 2019]. We thus concentrate on improving sequential decision-making where only an approximate (sample) model of the environment is given. This model is assumed to be learnt from data and/or known to have errors in its transition dynamics, hence the parameters of this sample model represent an erroneous or inaccurate representation of the real world, i.e., an imperfect model. We will explicitly account for the variance coming from the parameter estimates by integrating robustness into planning. This means maximizing the performance of the control policy function for the worst-case transition model within a predefined or learnt set of models. To put it into prospective: a non-robust planning technique finds a policy that is optimal under the sample model; a robust method aims at a policy that is optimal under the environment where the current estimate of the (non-robust) policy is measured to perform the worst. In order to derive robust policies, we must address two optimization stages: the maximization stage or employing a planning method to find the optimal policy given a model, and the minimization stage or deriving the transition model that reduces agent’s performance the most. The result is a policy that yields certain guaranteed minimal performance. In our case, this guaranteed performance is measured by the expected return under the worst-case transition model (see Section 4.1). Presence of such guarantees is useful for example when we want to make sure that we are still able to perform well if our estimated sample model carries a positive bias on the control policy, e.g., probabilities for the transitions leading to the high value states are overestimated. We notice that similar effect could be achieved with Bayesian methodology where a modeler assumes a prior knowledge about the distribution over the model errors and incorporates it into the planning methodology. However, we provide two arguments why our robustness approach might be more applicable in some cases (Section 2.3).

(7)

Value of robustness lies in deriving a robust policy that averts highly unfavourable, or worst-case, transitions (see Section 5.1). We note that simply considering the worst-case transition on the whole space of the sample model’s parameters might results in a non-performing or ill-performing system. For example, if a possible (model) error results in a value of wind speed that makes a helicopter crash (e.g., thunderstorm) then the helicopter will never take off from the ground. Here, the worst-case model is assigned a high probability, while the reality might be such that the worst-case scenario is highly unlikely under the current weather conditions. Another example could be in a navigation task where the agent is presented with two paths to reach a goal state. A shorter path is near terrains or obstacles where damage to the agent is likely, while a longer path steers clear of the danger. In expectation, the shorter path might yield a better performance, however the longer path can still be preferable in terms (safe) operation in a real-life setting. Thus, the set of possible errors and the type of robustness should be carefully chosen when performing (robust) planning. These choices are incorporated via constrains that describe how conservative we want to be or whether the algorithm needs to consider some other properties of the environment, e.g., spatial location in the state space, in order to be robust (see Section 4.2 & 4.2.3 for details).

Extensive body of literature have tackled problems of robust optimization and control. We base our methods on the results of Nilim and Ghaoui [2005], Iyengar [2005] that laid down the foundations of robust Dynamic Programming (DP) within Markov Decision Process framework [Puterman, 1994]. We extend their methodology by incorporating a recent optimization technique from Lecarpentier and Rachelson [2019] to be used to tackle transition model uncertainty. We further provide a generalization on the assumptions posed by the previous work which helps us to devise a modular and scalable family of robust decision-time planning algorithms based on Monte Carlo Tree Search [Kocsis and Szepesvári, 2006, Coulom, 2007].

We formulate three research questions that help us find a novel and efficient algorithm to do robust planning: RQ-1: How can we do robust planning with imperfect models and stochastic state transitions?

RQ-2: How can we efficiently solve minimization step in robust planning?

RQ-3: What are critical assumptions one needs to consider when creating model spaces for robust planning?

In order to answer our research question, we start with an introduction to the related academic literature, Section 2. There, it is recognised that robustness is ubiquitous in many domains and fields of mathematical optimization. Next, we present frameworks and methods that are foundational for our research, Section 3. The concepts such as Markov Decision Process (MDP), model learning and planning are described while formal notations are introduced. Then, extensions to robust methods are presented. These include both robust offline and online (tabular) planning methods, Section 4.3 & 4.4. Finally, experiments illustrating the empirical value of the proposed methods are demonstrated, Section 5. In the addition, we conclude with a number of discussion points and directions for future work, Section 6.

(8)

2 Related work

This section reviews the ambitions of related work and positions our objectives within the open research problems in the literature. It is worth mentioning that [Moerland et al., 2020] provide a well-defined framework to differentiate between planning, model-based and model-free reinforcement learning. We consider our work applicable to both planning and model-based reinforcement learning as they both can work with models (Section 3.2) that are learnt from data. Moreover, both planning and reinforcement learning communities have attended to robustness extensively during the past few decades.

2.1 Robust Optimization and Robust Control

We introduce the reader to (model-based) Robust Reinforcement Learning (RRL) by drawing historical analogies within two classical optimization fields, control theory and robust control. We thus encourage the reader to view optimization in the form of RRL as a special case of robust control methods.

Many ideas in robust optimization have their roots in the robust control field. Robust control in turn refers to the control of a plant, i.e., an environment model, with unknown dynamics subject to unknown disturbances [Chandrasekharan, 1996]. A typical control loop consists of the input to the system, i.e., some reference signal which represents the desired control value. This reference is fed through a forward function of the controller that determines a decision which is further passed on to the plant. The state of the model is unknown which results in a stochastic component in the output from the plant. Robust control methods seek to handle the stochastic component in the control task by creating a set of possible models within which the plant is assumed to operate. Such set is often called ambiguityor uncertainty set in the literature. It also represents the environment model’s hypothesis space (e.g. for learnt models), hence we assign it a slightly more general name: model space or model space set. Given this bounded set that entails some uncertainty, the policy can then deliver results that meet the control system’s requirements in all cases. Thus, robust control guarantees that if the variations in the transition function are within given bounds, the control law (policy) need not to be changed. Hence, robust control theory might be stated as a worst-case analysis method rather than a typical expected value method. The (robust) control theory objective can be approached by a set of methods mainly developed in the 20th century. These techniques include Adaptive control [Chandrasekharan, 1996], H2and H∞Control [Zhou et al., 1995], Lyapunov functions and Fuzzy (Logic) Control [Qu, 1993, Abramovitch, 1994,

Bhattacharyya, 2017].

Sutton et al. [1991] draw a bridge between control theory and reinforcement learning. Sutton et al. [1991] subdivide control tasks into two groups: 1) regulation (tracking) problems, where the goal is to adhere to some reference trajectory, and 2) optimal control problems, where the objective is to maximize/minimize a function of the controlled system behaviour (that is not necessarily described in terms of following a reference level or trajectory). Problems and methods in the second group have been viewed as harder ones due to their formulation as constrained optimization; while the former group has proven to be easier to solve both analytically and computationally [Sutton et al., 1991]. Additionally, optimal control problems are much more common because a detailed and accurate model (required by regulation tasks) of the system is often not available. Thus, if adaptive control methods i.e. techniques that adapt to a controlled system with parameters which vary, or are initially uncertain, could be developed then both tracking problems as well as optimal control problems could be solved by those methods alike. The idea behind reinforcement learning is exactly that. RL strives to enable an agent to create an efficient adaptive control system for itself which learns from the available experience. From a mathematical point of view, RL methods can be viewed as a synthesis of dynamic programming and (stochastic) approximation methods.

(9)

2.2 Robust Reinforcement Learning and Planning

Having such a close connection, one can expect that just as control theory progressed towards attending more to robustness [Elizondo-González, 2011], model-free and model-based reinforcement learning applied on a complex system also need to be performed with robust designs. A prominent survey on RL in robotics Kober et al. [2012] supports this idea while the field of robust reinforcement learning (RRL) tries to tackle open issues within the area. In this section, we present a few influential papers that lay ground for our work in RRL.

Morimoto and Doya [2005] was one of the first to introduce a novel approach to handle reinforcement learning tasks where the algorithms inherently adjust for input disturbance and modeling errors. Authors use differential game framework [Stoorvogel, 1991] and H∞control theory [Zhou et al., 1995] to formulate a number of control tasks that

they solve by working through systems of differential equations. Later, the paper compares the performance of robust reinforcement learning algorithms with non-robust counterparts and shows that on a number of tasks a robust RL controller can accommodate changes in physical parameters of the model that a standard RL controller can not handle. Additionally, authors indicate that while H∞control theory gives an analytical solution only for linear systems, robust

RL can derive a nonlinear H∞controller by online calculation and without prior knowledge of an environmental model

(model-free setting).

Nilim and Ghaoui [2005] were one of the first to consider robust control in finite-state, finite-action Markov decision processes (MDP) where robustness is considered against uncertainty in the transition matrix of a MDP. Consequently, the uncertainty is described by a model space that expresses a set of possible transition functions that are allowed. Authors assume that the set is convex (or it is feasible to find its convex hull) and it follows the rectangularity property [Epstein and Schneider, 2003] (Section 4.2). Once convexity and rectangularity is determined authors find that a classical dynamic programming algorithm may be applied to find the solution. Moreover, if the model space represents the desired level of statistical confidence for the model parameters, a policy derived with robust methods serves as a lower bound for the system performance for that confidence interval. Authors show that examples of confidence sets could be derived using maximum (log)-likelihood estimates (MLE/MAP) or entropy bounds (KL-divergence). Wiesemann et al. [2013] provide interesting insight into conditions of rectangularity and types of model space sets while further investigating the performance of robust policies analytically. Both Nilim and Ghaoui [2005] and Wiesemann et al. [2013] provide the theoretical foundation for our methods while we further extend planning with the rectangularity assumption by considering using different model spaces (e.g., Wasserstein distance) and decision-time planning.

Modern research in RRL tackles questions of safe (online) exploration [García et al., 2015], significant rare events [Ciosek and Whiteson, 2017], model noise [Gu et al., 2019] and parameter initialization Mnih et al. [2016]. These problems are tangent to ours and we present an overview of a few recent papers that are related to our research.

Paul et al. [2016] introduces a method that is robust to significant rare events (SRE). SRE’s are defined to occur under variance in unobserved and, randomly determined by the environment, state features, i.e., environment variables. The method applies Bayesian Optimization to learning problems with SRE’s while still being sample efficient. The presented algorithm is called Alternating Optimisation and Quadrature (ALOQ) uses Bayesian optimisation to address SRE’s settings. This approach performs well under a number of robotics tasks where robustness is important, however it is computationally expensive and does not touch upon decision-time planning. Pinto et al. [2017] propose a methodology of modeling uncertainties in the model/simulator via an adversarial agent that applies disturbance forces to the system. Furthermore, the adversary is also an agent that learns an optimal policy to hinder the control agent’s goal. Authors demonstrate that their technique is robust to model initialization and modeling errors, i.e., the learned policy generalizes better to various test environment. Similar problem is described in Wasserstein Robust Reinforcement Learning algorithm [Abdullah et al., 2019]. Authors propose a framework for robust reinforcement learning to tackle

(10)

misspecification in simulator’s parameters that often arise due to changes in transition dynamics. Wasserstein Robust Reinforcement Learning (WR2L) aims to find the best policy, where any given policy is judged by the worst-case dynamics amongst all candidate dynamics in a Wasserstein constraint set. Authors also propose an efficient and scalable solver following a novel zero-order optimisation method for low and high-dimensional control tasks. [Abdullah et al., 2019] also recognise a useful property of Wasserstein distance, namely the topological awareness in the state space. In our work, we extensively use Wasserstein distance while identify in more detail when and where this tool should be applied.

A recent paper from Lecarpentier and Rachelson [2019] introduces a method for robust planning in non-stationary stochastic environments. Just as in our case, the problem is posed as robust planning under uncertain model parameters. Authors make an assumption that the environment model can evolve through out an optimization epoch. Thus, the proposed algorithm aims to learn the best performance under the worst-case model parameters possible. The worst-case model parameters are derived by assuming knowledge about MDP’s start-of-the-epoch model parameters (sample model) and Lipchitz-continuous evolution of these parameters within an epoch. At each time step, the algorithm performs tree-search to find what the worst-possible model evolution could be (within the allowed set) and how the current policy performs under it. The state value function is then adjusted to represent the highest value that could be achieved by the greedy policy under the worst evolved model. Our work uses a Wasserstein optimization method proposed by Lecarpentier and Rachelson [2019] and adopts it as one of the convenient method to find model spaces (Section 4.3.2).

García and Peña [2018] mention that there is an important space to develop quantitative studies to determine what kind of robustness and model spaces should be used to solve a problem. Identifying how different model space sets act for a defined problem is therefore a fundamental step to getting a better grip on the robust optimization techniques. Authors pose a number of question that directly relate to our RQ-3, specifically: How is the quality of the solutions perturbed with the choice of the model space? How is the convergence of the algorithm altered against different sets of uncertainty? García and Peña [2018] mention that theses mainly answered questions would help to develop a methodology that allows identifying which type of robustness is required by a problem, what type of models space should be chosen and how is the behavior of the algorithm in terms of quality of its results and convergence.

The parallel concept to robust reinforcement learning is safe reinforcement learning. This field arose form RRL and it is defined as the process of learning policies that maximize the expectation of the return at the same time guaranteeing reasonable system performance and/or respecting safety constraints during learning and/or deployment [García et al., 2015]. Thus, safe reinforcement learning can be viewed as a specific case of robustness towards risks within the learning and/or deployment environment. The safety concept is the opposite of the risk or uncertainty concept present in RRL. Hence, many problems tackled by robust reinforcement learning/control can be framed in terms of safety. For example, safe operation of physical systems is also a robust control task [Zhou et al., 1995], or maximizing the expected exponential utility (safety adjusted optimization) is equivalent to a robust MDP for maximizing the worst case criterion [Osogami, 2012]. Because safe RL and RRL are so related, we are able to use some of the environments from the Safety GridWorlds Leike et al. [2017] suite to test our robust planning methodology.

2.3 Bayesian Approach vs Robustness

Methods described in the previous sections are devised in a way that account for one or several types of uncertainties. Two general uncertainty categories exist: epistemic and aleatoric [Kendall and Gal, 2017]. Epistemic uncertainty captures our ignorance about the models most suitable to explain the data at hand. This types of uncertainty is reducible and decreases with acquiring more data. Aleatoric uncertainty captures noise inherent to the environment. This

(11)

noises cannot be explained away even if more data were available. In our setting, we consider parametric (epistemic) uncertainty. It is a type of uncertainty that comes from the model parameters whose exact values are unknown and cannot be controlled in physical experiments. Often, these values cannot be exactly inferred by statistical methods and, therefore, optimization/simulation with such models could be given a special Bayesian or robust treatment.

A common method of tackling parametric uncertainty is assuming a prior distribution over the model parameter values. Given that this distribution can be integrated in optimisation as an initial state distribution - and a conditional distribution of observations given the model (likelihood) can be derived - Bayesian reasoning can be applied. Bayesian approaches are thus inherently aware of the uncertainty as they do not condition on a single model but rather seek to accommodate all aspects of uncertainty within a joint probability model, often called a posterior or belief model. Updating the belief model using the prior and likelihood may result in improved robustness. For instance, when a prior knowledge on the distribution of the model errors (differences between the true and estimated models) is available. In that case, we can be more conservative for the parameters/models where we have seen a higher modeling error.

There are two main difference between robust and Bayesian approaches. First, Bayesian methods require a prior distribution to be defined over the transition parameters space. This means if we do not have an initial information about the transitions that are considered non-robust, we are unable to apply a meaningful prior that would drive the policy into the direction of stable/guaranteed rewards. Second, Bayesian approaches derive a policy that is optimal under the belief model and not the worst-case model. Thus, unlike robust techniques, there are no strict guarantees on the lowest or highest possible algorithm performance. There are however different assumptions and restrictions that robust algorithms have to deal with. We discuss them in Section 4.

(12)

3 Background

This section describes a number of core concepts that are used in this work. We start by introducing a common decision making framework, Markov Decision Process (MDP). Next, two classical types of solution methods for planning problems in MDP’s are presented. We also introduce a problem of robust optimization as a (two-stage) maximisation-minimisation task. Lastly, we provide two general ways of deriving models spaces for the minimisation stage.

3.1 Markov Decision Processes as the model of sequential decision-making tasks

We start with the Markov Decision Process (MDP) framework [Puterman, 1994] often used to define reinforcement learning tasks. Let the MDP model M = hS, A, P, R, γi where S ⊆ Rd denotes the state space, A ⊆ Rn the action space, P : S × A × S → [0, 1] is a state transition probability function that defines the system’s dynamics, R : S × A → R is the reward function that assesses the agent’s performance, and γ ∈ [0, 1) is the discount rate that is applied to the rewards over time. Next, we assume that the dynamics model is defined by the parameters φ, and we can modify the notation for MDP model Mφand the transition function Pφ. Additionally, the agent’s policy acting in the

environment is a function π defined by the parameter vector θ. We define ρφ,θ(τ ) that denotes the trajectory density

function that depends on the current policy πθ, transition function Pφand the stationary initial state distribution µ.

ρφ,θ(τ ) = µ0(s0) πθ(a0|s0) T −1

Y

t=1

Pφ(st+1|st, at) πθ(at|st) , (3.1)

where τ is the trajectory that is produced upon consequent interactions with the environment and µ0denotes the initial

state distribution. The focus is on episodic tasks where an episode consisting of T (discrete) time steps, however the notation can be extended by letting T tend to infinity. The agent aims to optimize the total reward, return G, with respect to its policy πθ.

G(τ ) = T −1 X i=0 γtR (st, at) , (3.2) θ?= arg max θ Eτ ∼ρφ,θ[G(τ )]. (3.3)

When solving MDP tasks, it is often convenient to consider the state and state-action values. We can define the expected value of state s conditioned on the current policy πθ(we sometimes simplify the notations by referring to πθ

as π) and the transition dynamics parameters φ as

Vπ_{(s) = E}τ ∼ρφ,θ[G(τ ) | s0= s] ∀s ∈ S, (3.4)

while the value of state-action pair of s and a is

Qπ_{(s, a) = E}τ ∼ρφ,θ[G(τ ) | s0= s, a0= a] ∀s ∈ S; ∀a ∈ A. (3.5)

Given the previous notations, we can define the state value in terms of recursive Bellman equations [Bellman, 2010]. Vπ(s) = R(s) + γX

s0_∈S

(13)

Thence, the agent’s task of finding the policy that maximizes the total reward could be seen as finding V∗(s) = supπVπ(s) for all states s ∈ S. The function V∗ is called the optimal value function. By (iteratively) applying

optimality function, we converge to the value of the state (Bellman optimality equation)

V∗(s) = max a∈A " R(s, a) + γX s0_∈S Pφ(s0|s, a) V∗(s0) # , (3.7)

3.2 Reinforcement learning and planning with models

Planning in the context of reinforcement learning can be primarily viewed as a method that uses a (learnt) model of the environment to search through the state space for an optimal path to a goal. Such an approach to planning is called state-space planning. Different state-space planning techniques exist, however, most of them built on the principle of simulating experience from the environment model to evaluate a value of a state(-action pair) and update the policy based on the outcome [Sutton and Barto, 1998].

Inherent component of planing is availability of an environment model or its estimate. Sutton and Barto [1998] distinguish three types of environment models used for planning: distributional model that has complete description of the transition probabilities and rewards; generative model that can be used to produce samples of rewards and next states from any state and action belonging to the state and action space; or trajectory model that can only simulate an (independent) episode/epoch. The latter model is the most restrictive one as it does not allow to generate experience for a state-action pair without finishing an episode. The former model is the most expressive and used in Dynamic Programming (DP). For example, the distributional model is used in Value Iteration (VI) [Bellman, 2010] which is an algorithm used to solve DP tasks.

3.2.1 Background planning: Solving all states with Value Iteration

We start our methods section by presenting a simple algorithm that makes the Value Iteration [Bellman, 2010] procedure be robust. Value Iteration is an iterative model-based method that calculates the expected utility of each state using the values of the neighboring states. More formally, the algorithm performs contraction mapping on the value function space until the value converge to a stable (fixed) point. The recursive update formula looks as following

V (s) ← max a∈A " R(s, a) + γX s0_∈S Pφ(s0|s, a) V (s0) # ∀s ∈ S. (3.8)

VI is considered a background method [Sutton and Barto, 1998]. Meaning that it finds the optimal policy/value function for all the states and does not focus on a single (root) state. The method has an advantage of avoiding building an explicit search tree and can be used efficiently with small state spaces. However, it suffers from an obvious pitfall: VI performs an update on every state in every iteration, thus it can be computationally intensive and hardly scalable to the environments with huge states/action spaces. Addressing this issue, the asynchronous Value Iteration [Bellman, 2010, Department et al., 1994] technique was introduced where at each iteration only a subset of state values is updated. Nonetheless, asynchronous VI leaves a number of open questions such as how to choose such a subspace on which to perform the updates, how large the subspace should be and in what order to perform the updates. In order to tackle these issues, online planning with heuristic functions has been introduced and discussed in the next section.

(14)

3.2.2 Decision-time planning with Monte Carlo Tree Search

An alternative method to background planning is decision-time planning. Unlike VI, decision-time techniques focus on finding the optimal action for a single state or a small set of states whilst not changing the value estimates of the whole state space. Frequently, this implies growing a tree of possible path continuations for the root state i.e. state where the agent resides in the current time step. Collectively, these (decision-time planning) tree-growing techniques are known as heuristic search [Sutton and Barto, 1998] as they use a heuristic evaluation function to decide on the direction of the search. This helps to determine the order of updates and on what state subspace to perform them on.

An example of a decision-time planning method is Monte-Carlo Tree Search (MCTS). MCTS is a recent planning technique that has been successfully applied to many reinforcement learning tasks [Coulom, 2007, Browne et al., 2012]. MCTS is a Monte Carlo rollout algorithm that is adjusted to store value estimates obtained from the past simulations (rollouts) and guide the later simulations towards more promising directions via a heuristic evaluation metric. MCTS expands the parts of an already built decision tree that have resulted in high evaluations from the earlier rollouts. Algorithm 1 shows the procedure of running MCTS search for a single root state with the UCB tree policy [Auer et al., 2002]. On each time step, MCTS receives a root state and MDP model. Usually, the root state is a vector representing the agent’s position in the state space or a state index; MDP model is a generative or distribution model (our work) of the environment that is used to produce samples of the rewards and next states given a state and action. At the first stage of MCTS procedure, variables Q, N , and T are initialized. Q, N tables hold the action values and visitation counts for each state-action pair visited during the search. These tables are further used to calculate UCB [Auer et al., 2002] heuristic value of each state-action node of the tree. The T structure holds the search tree that is built throughout rollouts. In our implementation T is a set of hashes where each hash is a node identifier. We use the terminology akin to Keller and Helmert [2013] to identify different types of nodes encountered in the tree. There are decision nodes and chance nodes. Decision nodes are tuples nd= hs, Vki, where s ∈ S is a state and Vk∈ R is the state-value estimate based on

the k first trials (simulations). Chance nodes are tuples nc= hs, a, Qki, where s ∈ S is a state, a ∈ A is an action, and

Qk∈ R is the action-value estimate based on the k first trials. In order to identify the nodes, variable P ath is used. For example, P ath = [(sroot, )] would mean that the tree is in its root node n0; P ath = [(sroot, )(sroot, 0), (sj)] means

that the current (decision) node points is at the state sjand has a parent chance node (sroot, 0) which in turn points to

the state s = srootand action a = 0. T structure remembers which nodes were visited and expanded and helps to direct

the search to unexplored states.

Once the variables have been initialized, the main loop of Monte Carlo tree search starts. The main loop consists of nsimssimulations. Each simulation has four phases: in the selection phase, the explicit tree is traversed by alternatively

choosing successor nodes according to UCT tree policy [Auer et al., 2002, Browne et al., 2012]. P ath variable keeps track of the current node, while all the acquired rewards are collected in the list R. When a previously unvisited decision node (i.e., a leaf node in the explicit graph, see Figure 1) is encountered, the expansion phase starts. A child is added to the tree with an action selected at random, which leads to another leaf node. Then, the simulation stage starts. This step is sometimes also called a playout or rollout. A playout is simply choosing uniform random moves until a terminal state is encountered or simulation horizon depth is reached, i.e., number of steps in the simulation phase has exceeded some threshold. Throughout the playout, the return G is accumulated. Lastly, in the backpropagation phase, the state-action values of all the visited (during selection and expansion) nodes are updated by recalculating their average values.

Figure 1 presents a convenient visual summary of Algorithm 1. It also facilitates understanding the idea of MCTS search in a MDP as a two-player game where the agent is the first player and nature (environment simulator) is the second. In this case, assuming a simple tabular setting, we can see decision nodes as the state indices where the agent is to choose an action, while chance nodes are the state-action pairs where it is nature’s time to act. The environment’s choices are described by the transition dynamics P and the agent’s behaviour is characterized by its policy. In non-robust

(15)

Algorithm 1: Monte Carlo Tree Search

1 Algorithm MCTS(sroot,M, γ, nsims)

Input : sroot, root state

M, model γ, discount factor

nsims, number of simulations

horizon, max depth of a rollout Output : Q, explored state-actions’ values Init : Q, explored state-actions’ values N, explored state-actions’ counts R, list of accrued rewards Path, list of transition tuples T, global set of explored nodes

2 for k ← 1 to nsimsdo

3 s ← sroot

4 R, Path, s ← Select (sroot)

5 nodeID ← Hash(Path) 6 s’, a, r ← Expand (nodeID, s) 7 R.append(r); Path.append(s, a, s’) 8 G ← Rollout (s’) 9 R.append(G) 10 Q, N ← BackUp (Q, N, Path, R) 11 end 1 Procedure Select(s)

Input : s, current state

Output : R, list of accrued rewards Path, list of transition tuples s, current state

2 R = list(); Path ← list() 3 nodeID ← Hash(Path)

4 while not (done or IsLeaf (T, nodeID)) do

5 a ← TreePolicy(s) 6 s’, r, done ← Nature(s, a) 7 Path.append(s, a, s’); R.append(r) 8 nodeID ← Hash(Path) 9 s ← s’; t ← t + 1 10 end 11 return R, Path, s 1 Procedure Expand(nodeID, s)

Input : nodeID, node identifier s, current state

Output : s’, current state r, reward

2 a ← selectRandomUnexploredAction(T, nodeID) 3 s’, r, done ← Nature(s, a)

4 return s’, a, r 1 Procedure Rollout(s)

Input : s, current state Output : G, accrued return

2 G ← 0

3 while t< horizon & not isTerminal(s) do

4 a ← RandomPolicy(s)

5 s’, r, terminal ← Nature(s, a) 6 G ← G + r; s ← s’; t ← t + 1;

7 end

8 return G

1 Procedure BackUp(Q, N, Path, R)

Input : Q, state-action value table N, state-action counts table Path, list of transition tuples R, list of accrued rewards Output : Q, visited state-action value table

N, visited state-action counts table

2 ¯γ ← 1 3 G ← 0 4 while is_root(s) do 5 nodeID ← Hash(Path) 6 T.append(nodeID) 7 G ← G + R.pop(); (s, a, _) ← Path.pop() 8 Q[s, a] ← Q[s, a] + (G ·¯γ - Q[s, a]) / N[s, a] 9 N[s] ← N[s] + 1; N[s, a] ← N[s, a] + 1 10 ¯γ ← ¯γ · γ 11 end 12 return Q,N

(16)

MCTS

Selection Expansion Simulation Backpopagation

- Decision Node - Chance Node

- Tree policy - Nature's move - BackUp step - Rollout end state

- Rollout

Figure 1: MCTS procedure with stochastic environments. Selection phase starts at the root state (sroot) and the already

explored part of the state space is traversed using a tree policy. When a leaf node is encountered, the expansion phase starts and a new set of nodes is added to the tree. Next, the simulation or playout phase begins where randomly selected action are chosen until a terminal state is encountered or the trajectory horizon depth is reached. Lastly, the accumulated return is backpropagated through the visited states and their values are updated.

planning, P is independent of the agent’s policy while in robust planning the environment is adversary to the agent, i.e., it adjust its P to decrease the agent’s rewards.

3.2.3 Analysis of Monte Carlo Tree Search

While it is convenient to use MCTS in large state-action spaces to efficiently navigate the search via for example UCT policy [Kocsis and Szepesvári, 2006], it is a sampling technique that does not aim to estimate the value function completely. Given limited computation resources and time, MCTS only builds a part of the search space in hope to converge to the optimal policy function for the root state [Browne et al., 2012]. While, we will show that for small environments, i.e., 5x5 grid worlds, it is relatively easy for MCTS to converge both in value and policy to the solution of VI. For larger grid worlds/problems or environment with high entropy, MCTS may only provides an estimation of the policy and value function, e.g., preserves the optimal action ranking given a state [James et al., 2017].

It is proven by Kocsis and Szepesvári [2006] that Monte Carlo Tree Search with UCT policy Browne et al. [2012] is consistent. Meaning that the probability of selecting the optimal action can be made to converge to one as the number of samples grows to infinity. However, it is noticed that the value function produced by UCT policy cannot be guaranteed to converge to the optimal values. This is partly due to the presence of uniform random playouts in the simulation/playout phase. James et al. [2017] investigate MCTS convergence in value and policy for stochastic (grid) environments. They show that a key consideration of performing rollouts in UCT is not in how accurately it evaluates the state values, but how well it preserves the correct action ranking. In domains where there are not many nodes in the tree whose values differ greatly (smoothness assumption), this preservation often appears to be the case. However, if smoothness assumption is broken, i.e., higher entropy in probability functions and varying reward functions, the value estimates may portray bias.

(17)

Figure 2: Picture taken from James et al. [2017]. The value of leaf nodes in a search tree derived from a stochastic environment.

3.3 Planning with learnt models via robust optimization

This section gives a general introduction to robust optimization. It serves as a context to Section 4.1 that provides an adaption of the existing research on Robust Reinforcement Learning to our setup.

Robust optimization seeks to extremize an objective function given that a certain measure of robustness against uncertainty is satisfied. Usually, uncertainty is represented as deterministic variability in the value of the parameters of the problem and/or its solution. Therefore, the most common method of formulating a robust optimization task is the worst-case analysis. This means maximizing the performance of a policy, f , for the worst-case model within a predefined or learnt set, U , of MDP models p.

max

x∈Xminp∈Uf (x, p) (3.9)

where max represents a reward maximization objective of the agent; min shows optimization for the uncertainty in the environment model as the environment model attempts to minimize the agent’s performance over a deterministic set of models U . This is the classic example of a minimax or maximin optimization problem. Solution methodology for such problems involves subdividing the problem into two stages: maximization and minimization. The maximization stage is employing a (planning) technique from Section 3.2 to find the optimal policy given an environment transition model. The minimization stage is deriving the transition model within a bounded set U that reduces agent’s performance the most. As in the majority of related papers, e.g., Lecarpentier and Rachelson [2019], Wiesemann et al. [2013], Abdullah et al. [2019], we derive U from the estimate of the transition dynamics, i.e., our (sample) imperfect model. The next section 3.4 discusses how one is able to estimate the transition dynamics and derive U from the statistical confidence ranges surrounding the estimate. Section 3.5 thereafter introduces two known probability distance metrics that could be used to construct the model space U without the use confidence bounds.

3.4 Model spaces resulting from Bayesian Model Learning

There are two distinct approaches to learning a model of environment: (Frequentist) Maximum Likelihood Estimation and (Bayesian) Bayesian Inference. Bayesian Inference allows to estimate the unknown model parameters using Bayes’s Theorem. Starting from a prior distribution over the unknown transition model parameters, the agent computes a

(18)

posterior distribution over these parameters based on the encountered experience. Model learning can be done either online or offline and the agent can compute the best policy by maximizing the future expected return under the current posterior distribution. Additionally, the agent can also consider how this distribution will evolve in the future under different feasible sequences of actions and observations.

Consider the MDP framework presented in the previous Section 3.1. Assume there is a MDP where the finite state and action spaces S, A are known together with the reward function R. However, P is unknown and the aim is to find the transition probabilities Ps,a_{for all s, ∈ S, a ∈ A where P}s,a_{is the probability function that tells how likely we are}

to transition to each of the states s0given that the current state and action is s, a. Furthermore, we assume that we have access to the transition history up to time step t. Let ¯st = hs0, s1, ..., sti, and ¯at−1 = ha0, a1, ..., at−1i, denote the

agent’s history of visited states and actions up to time t. From the history, we can derive the number of times the transition (s, a, s0) occurred in the history (¯st, ¯at−1) by counting the occurrences Ns,sa 0(¯st, ¯at−1) = Σt−1i=0I(s,a,s)(si, ai, si+1).

We start the inference procedure by identifying a prior gs,athat specifies the probability over each transition

function Ps,a, i.e., pmf over S. Next, we select the likelihood function to be a Multinomial distribution of the following form Πs∈S(Psas

0

)N_ss0a (¯s,¯at−1)_{. The resulting posterior takes the form}

gs,a(Ps,a| ¯st, ¯at−1) ∝ gs,a(Ps,a)

Y

s∈S

Ps,a,s0Nssa(¯s,¯at−1)_. _(3.10)

Since the Dirichlet distribution is the conjugate of the Multinomial, it follows that if the priors gsa(Psa) are Dirichlet

distributions for all s, a, then the posteriors gsa(Psa|¯st, ¯at−1) will also be Dirichlet distributions for all s, a. The

Dirichlet distribution is the multivariate extension of the Beta distribution and defines a probability distribution over discrete distributions. It is parameterized by a count vector, φ = (φ1, . . . , φk) , where φi ≥ 0, such that the

density of probability distribution p = (p1, . . . , pk) is defined as f (p|φ) ∝ Πki=1p φi−1

i . If X ∼ M ultinomial k(p, N ) is a random variable with unknown probability distribution p = (p1, . . . , pk), and Dirichlet (φ1, . . . , φk) is

a prior over p then after the observation of X = n, the posterior over p is Dirichlet (φ1+ n1, . . . , φk+ nk). Hence,

if the prior gs,a(Ps,a) is Dirichlet φass,, . . . , φass,s , then after the observation of history (¯st, ¯at−1) the posterior

gs,a(Ps,a|st, ¯at−1) is Dirichlet

φa ss1+ N a ss1(¯st, ¯at−1) , . . . , φ a ss|S|+ N a s,s|S|(¯st, ¯at−1)

. The posterior Dirichlet parameters φ0after new transitions are then defined as

φ0a_s,s0 = φa_s,s0+ 1 (3.11)

φ0a_s000_,s000 = φa 0

s0_,s000, ∀ (s00, a0, s000) 6= (s, a, s0) (3.12)

Once the resulting densities p of the Dirichlet posteriors have been estimated, we can move to establishing a confidence range within which the model space for robust optimization is defined. A 100(1-α)% confidence interval for such unknown population is a polygon on ∆K − 1 containing 100(1-α)% of its Dirichlet posterior [Vermeesch, 2005]. The projection of this polygon onto the sides of the simplex yields K = |S| confidence intervals for the K possible transitions (size of the state space). These confidence intervals form a band completely containing the 100(1-α)% most likely multinomial populations [Vermeesch, 2005]. Thus, we can consider each side of the simplex, i.e., an estimate pi

from the density probability distribution p = (p1, . . . , pk), separately. It can be found that

ploweri = Beta(1 − α/2, φ a s,si, N a s) (3.13) pupper_i = Beta(α/2, φa_s,s i, N a s), (3.14)

(19)

3.5 Model spaces through distance metrics

Instead of finding confidence ranges, we can define U (s, a) with the following constraint on a pmf p = Ps,a: d(p0, p) < , where p0 ∈ U (s, a), i.e., an arbitrary element of the model space, and d is a probability distance metric. This section introduces two commonly used probability distance metrics that can be used to estimate model spaces.

Manhattan Distance

We first consider the Manhattan (L1) distance between two probability vectors. This measure is equal to the one-norm of the distance between the vectors. Let p and q denote the vectors of probability mass assigned by two distributions defined on the same state space of dimensionality n. Then, the Manhattan distance between p and q is given by

L1(p, q) = n

X

i=1

| pi− qi| . (3.15)

As can be deducted from the equation above, Manhattan, or L1 distance, is a fast-to-compute symmetric measure that can be used only with distributions of the same dimensions. This distance also does not respect the spatial component/topology of the probability space i.e. if r = [0, 0, 1, 0, 0] p = [0.5, 0, 0, 0, 0.5] and q = [0, 0.5, 0, 0.5, 0], d(p, r) = d(q, r) even though the probability mass of r and q is concentrated on the central atoms while for p the probability is spread on the edge atoms. For our gridworld domains (Section 5) and to calculate the cost matrix used for Wasserstein distance, we also use L1 distance in two-dimensional coordinate plane. Then, instead of atom’s masses pi

and qjrepresent the coordinates for two states.

Wasserstein Distance

Alternatively, the Wasserstein distance quantifies the distance between two distributions in a physical manner, respectful of the topology of the measured space [Schmitzer and Wirth, 2017, Dabney et al., 2017]. Thus, the Wasserstein distance takes into account the underlying geometry of the space that the distributions are defined on. It also posses symmetry property and it can measure the distance between two discrete distributions, two continuous distributions, and a discrete and continuous distribution. In our work we extensively use discrete Wasserstein distance as we believe that the topological information is fruitful to exploit in learning optimal control. However, this comes at the cost of higher computational complexity when calculating the measure. As previously, let p, q define a discrete probability mass functions, p = np X i=1 piδpi, q = nq X j=1 qjδqj (3.16)

where pi ∈ (0, 1] and qj ∈ (0, 1] probability masses of two marginals; δpi and δqj, i × j ∈ np× nq are Dirac-delta

functions indicating to which i, j atoms the masses apply. Next, a non-negative matrix f is defined and represents feasible joint distributions of p and q,

X j fij = pi, i = 1, 2, . . . , n1, (3.17) X i fij = qj, j = 1, 2, . . . , n2. (3.18)

(20)

Wasserstein distance is defined via a minimization problem of the following form W (p, q) = min f X ij fijcij, (3.19)

where cij is a (fixed) cost of transferring a unit of probability mass from i to j. As the cost metric c, we may use

any classical geometrical or custom defined distances. We choose to use L1 distance between the coordinates of the states with index i and j. An easy interpretation of this metric found in the theory of Optimal Transport [opt, 2014]. The Wasserstein distance aims to find minimal cost of probability mass transportation according to the cost of moving function c. In order to find this route a general linear programming solver can be used.

Kuhn et al. [2019a] and Lecarpentier and Rachelson [2019] use the dual formulation of Wasserstein distance (Dual Kantorovich problem [Beiglböck et al., 2010]) to change the calculation in 3.19 (with L1 distance) into a constrained maximization problem. The dual problem can be interpreted as the profit maximization problem of a third party that reallocates the probability mass from p to q on behalf of the problem owner by "buying" mass at the origin i at unit price ciiand selling at the destination j at unit price cij[Kuhn et al., 2019b]. The constraints ensure that the problem

owner prefers to use the services of the third party for every origin-destination pair i, j instead of reallocating the mass independently at her own transportation cost b. The L1-Wasserstein distance becomes

W (p, q) = max

f f

T_{(p − q)} _(3.20)

s.t. Cf ≤ b (3.21)

(21)

4 Methodology

This section provides answers to the previously posed RQ’s. Answering these RQ’s is necessary to fulfil our goal of devising a robust planning algorithm which is able to operate under imperfect models.

RQ-1: How can we do robust planning with imprecise models and stochastic state transitions? RQ-2: How can we efficiently solve minimization step in robust planning?

RQ-3: What are critical assumptions one needs to consider when creating model spaces for robust planning? In Section 4.1, we formalise a general methodology to solve robust reinforcement learning tasks from Iyengar [2005], Nilim and Ghaoui [2005], Wiesemann et al. [2013] in the notations that are consistent with the background section. Section 4.2 shows an overview of the techniques that can be used to define sets of models (i.e., model spaces) with respect to which the algorithm is robust. In the same section, we also show how the minimization step of robust optimization (Section 3.3) is integrated with the model space. In the last few Sections 4.3.2, we introduce several adjustments to two classical RL algorithms (Value Iteration and Monte Carlo Tree Search) to make them more robust.

4.1 Robust Planning Solution

A solution to robust reinforcement learning task involves optimizing for an objective that encourages lower regret under the worst-case transition dynamics model in an model space set. This is equivalent to saying that the robust policy needs to maximize its performance under the environment model that yields the lowest return with respect to some reference (optimal, but non-robust) control policy. Thus, similar to Nilim and Ghaoui [2005], Iyengar [2005] the robust policy evaluation function can be written as

fU(π) = min

φ∈Uφ

Eτ ∼ρφ,θ[G(τ )] (4.1)

where Uφis the set of parameter vectors for each element in the model space set U . Section 4.2 goes in detail on the

methods of constructing model spaces. In this section, we assume that we only need to find a robust solution for our MDP at hand and both maximization and minimization stages of robust optimization (see Section 3.3) can be solved (efficiently).

We call fU(·) in the Equation 4.1, lower bound expected return (LBER). Consequently, policy evaluation with

respect to LBER transforms the value function Equation 3.4 to V_φπ_min(s) = min

φ∈Uφ

Eτ ∼ρφ,θ[G(τ ) | s0= s] ∀s ∈ S. (4.2)

This is equivalent to Equation 2 in Wiesemann et al. [2013] where authors tackle a similar task.

There are benefits of using such formulation of the value function. For example, it allows for adjustment of the robustness level that is desired from an algorithm by increasing or decreasing the model space U . This is especially important when planning with an imperfect model, i.e., when one does not have access to the ground truth model and has to assume the worst-case scenario if willing to be robust. If the error bound (or uncertainty level) is set to zero, the model space shrinks and we arrive at a singleton U which corresponds to working with classical RL methods where the robust value function Equation 4.2 is equivalent to the plain MDP formulation of the value function, Equation 3.4. Increasing the error bound, larger model space, corresponds to encouraging a higher robustness level as a wider range of possible transition models is being considered. Moreover, we can use results from Iyengar [2005] and Nilim and

(22)

Ghaoui [2005] to transfer statistical confidence guarantees to the value function/policy if the model space sets are constructed with certain properties (see Section 4.2 for details).

Once the procedure to derive the model space is established, the lower bound expected return can be found by solving a two-stage minimax problem. In the first stage, a non-robust policy π is used to estimate the worst-case MDP transition model. In the second stage, the worst-case model is employed to derive the robust policy ˜π = πθ˜.

φmin= arg min

φ∈Uφ

Eτ ∼ρφ,θ[G(τ )] , (4.3)

˜

θ = arg max

θ Eτ ∼ρφmin,θ[G(τ )] , (4.4)

Provided the agent has access to the environment model parameterized by φ during learning, the problem could be framed in terms of state-space planning. Here, we are also encountering two optimization stages (minimization and maximization). Qπ˜(s, a) = min φ∈Uφ [R(s, a) + γVπ(s0)] ∀s, s0∈ S; ∀a ∈ A, (4.5) Vπ(s) = max a∈AQ ˜ π_{(s, a)} _{∀s ∈ S; ∀a ∈ A,} _(4.6)

4.2 Choosing a Model Space and finding the worst case model

Robust reinforcement learning requires a procedure to estimate the worst-case MDP transition dynamics model Pmin = Pφminfrom a reference (sample) transition model P0. Naturally, Pmincomes from the model space U . If

U satisfies the rectangularity assumption, it is proven by Iyengar [2005], Nilim and Ghaoui [2005]1_{that the optimal}

robust value function converges to its fixed-point and a deterministic robust policy could be derived, i.e., lower-bound expected return converges for all the states. Moreover, the rectangular uncertainty property is crucial to ensure that the controller and the environment act alternatively and in an independent fashion at each time step, thus the robust optimization could be seen as a zero-sum game. Wiesemann et al. [2013] show that for general model spaces computing optimal robust value/policy is strongly NP-hard.

In this work we focus on rectangular model space sets that have (statistical) guarantees for the optimality of robust value/policy. Rectangular sets were introduced by Nilim and Ghaoui [2005], where authors model the uncertainty in transition dynamics such that the transition probability p(·|s, a) for each state-action pair (s, a) ∈ S × A can be selected from a set U (p0(·|s, a)) which is derived independently (unrelated) to the transition probabilities of other state-action

pairs. Figure 3 is inspired from Figure 2 of Wiesemann et al. [2013] and depicts a simple visual representation of (s, a)-rectangularity. Left most picture presents a unit cube in three dimensional space of some arbitrary transitions. For the sake of simplicity, the thick line is a projection of all the other transition distribution parameters from P0on the

chosen three dimensional space, i.e., it represents arbitrary interactions within the parameter space that affect the three chosen dimensions. Points on the thick line are the possible P values if we assume that we take the same (distributional) support as in P0. The shaded area of the cube in the middle picture is the unconstrained model space. If we would

construct a rectangular set for P given we allow any variance in these three dimensions. The right most picture shows a sphere which is the constrained model space of radius , if the black dot was the current parameter values of P0. We see

1

(23)

Figure 3: Visualization of (s, a)-rectangularity. Left most picture presents a unit cube in three dimensional space of some arbitrary transition probabilities. The thick line is a projection of all the other transition distribution parameters from a sample transition model P0on the chosen three dimensional space. Thus, points on the thick line are the possible

P values if we assume that we take the same (distributional) support as in P0. The shaded area of the cube in the middle

picture is unconstrained model space. If we would construct a rectangular set for P given we allow any variance in these three dimensions chosen earlier. The right most picture shows the constrained model space of radius if the black dot was the current configuration (parameter values) of P0.

that at most mass shifted away from the reference model is in magnitude. If mass is shifted independently for each (s, a), the resulting global model space set U is constructed from smaller sets U (p0(·|s, a)).

Previously, we presented a fixed epsilon bound for the model space sets, however it is often convenient to rely on the parameters’ confidence intervals or entropy bounds in order to obtain U (p0(·|s, a). In case when U is rectangular

and obtained from the model parameters’ confidence bounds (Section 3.4) or entropy bounds (e.g. upper bounds on the Kullback-Leibler divergence between two distributions), the statistical properties of the bounds can be transferred to a robust policy and value function [Nilim and Ghaoui, 2005]. For example, state values derived from the worst-case transition model of a model space set which is in turn derived from 95% CI (MLE) of the reference transition dynamics, makes these values optimal and robust for 95% of models that are generated from the reference MLE parameter estimates.

While there is a pool of methods to create model spaces, the main criteria is the type of robustness a modeler wants to have (see Section 4.2.3 for discussion). For example, Lecarpentier and Rachelson [2019] assume a non-stationary environment with Lipchitz-continuous evolution of the model parameters, hence being robust against the (non-stationary) drift of the transition model. Abdullah et al. [2019] are allowed to perturb the model specification parameters (e.g., pole lengths, torso masses) to find a robust policy that can perform well under these perturbations. A unifying notion that appears in the problems mentioned above and in the Section 2 is the idea of a reference model P0, or a sample model as we call it. It often serves as a starting point for creating a model space set and could be

either derived from experience history or provided by equations describing the environment (physical simulator). It is frequently taken as the center around which the set of all possible worst-case models is drawn (see left most plot of Figure 3). We define the process of finding a candidate worst-case model from the reference dynamics model as a projectionoperator. We project the reference model parameters of p0(·|s, a) onto the allowed space U , thus arriving to

U (p0(·|s, a)) and/or directly to pmin(·|s, a). Mathematically, we can see the end result of the projection as finding the

worst-case model pmin(·|s, a) for each each pair (s, a) ∈ S × A. We focus on a finite state and action space MDP. Thus,

for notational simplicity we refer to the parameter vector φ of a discrete transition function p(·|s, a) as just p(·|s, a). Considering discrete state-space planning, one can find action value by solving the following equation:

Q(s, a) =X

s0

(24)

The worst-case scenario is the situation when the environment shifts all the probability mass towards the s0with the lowest value. Constraining this scenario on the model space set U (pmin(·|s, a), ) and results in a linear program,

minimize

p(·|s,a)

X

s0_∈S

p(s0|s, a)Vπ_(s0_), _{or equivalently} _minimize

p p · v subject to X s0_∈S p(s0|s, a) = 1, p(s0|s, a) ∈ [0, 1], ∀s0_{∈ S} d (p(·|s, a), p0(·|s, a)) ≤ (4.8)

where d is a distance metric applied to the parameter space of p(·|s, a). Formulation 4.8 encodes the constraints on properties of a transition probability function (i.e., positive and sum to one) together with the constraint of U (p0(·|s, a), )

(i.e., the distance between a candidate worst-case p(·|s, a) and the reference transition p0(·|s, a) cannot be greater than

). Thus, U (p0(·|s, a), ) can be seen as a set that represents a convex polytope (Section 4.3.1). Edges of this geometry

represent the frontier, and one of points in this frontier yields the lowest Q(s, a) given some policy π.

4.2.1 Direct projection: Solving a Linear Program

Constraints described in 4.8, as well as the choice of the distance metric and epsilon value are crucial to creating U (ps,a₀ , ) that encourages the desired robustness properties. First two constrains from 4.8 make sure that ps,aminis a

proper distribution measure. The third constraint lets possible transition probability functions be located in the area of the reference dynamics, e.g., if we believe that the reference dynamics model is an unbiased estimation of the true transition model, we might choose a small value to not exceedingly deviate from the reference dynamics. However, the last constraint can also introduce an interesting property to the problem 4.8. Namely, if the distance metric d is convex, the problem 4.8 becomes a convex optimization (Section 4.3.1) and it can be solved analytically. Therefore, we can efficiently apply a direct projection operator or Algorithm 2 , i.e., arrive to the worst-case estimate of ps,a_without

having to iteratively project ps,a₀ onto U . Notice that Algorithm 2 can be applied with an arbitrary distance metric, however the guarantees of policy/value convergence might be lost and/or numerical methods used to solve 4.8 could significantly slow down the computations.

Algorithm 2: Direct Projection, estimation of the worst-case transition model Pmin_{by solving linear program 4.8}

Input :P0, reference transition model

Vπ_{, estimated value function}

S × A, state-action pairs for which to do projection , error bound

Output :worst-case transition model, Pmin

1 Pmin

copy

←−−− P0 2 for s, a in S × A do 3 ps,a₀ ← P0(·|s, a)

4 Get ps,a_minfrom U (ps,a₀ , ) by solving 4.8 5 Pmin(·|s, a) ← ps,amin

(25)

4.2.2 Indirect projection: Dichotomy method

We can see the projection operator as a step towards the boundary of U while trying to minimize the action value function in Equation 4.5. The direction of this step can be easily identified from Equation 4.7. The unconstrained minimum of Equation 4.7’s LHS is the transition probability function where we only can transition to the lowest value (next) state s0. Hence, the point to which the gradient with respect to ps,a_{= p(·|s, a) of the objective function 4.8}

aims is a vector ∇p(minimization point) of all zeros except one entry with the value of 1. The index of this entry is

arg minsv(s). Alternatively, if there is more than one state that yields the lowest value, a heuristic can be employed to

break ties and determine ∇p, e.g., 1 is assigned to the closest state index to the (s, a)-pair in question.

Algorithm 3: Indirect Projection, estimation of the worst-case transition model Pminvia an iterative procedure

Input :P0, reference transition model

ˆ

Pmin, previous estimate of the worst-case model

Vπ, estimated value function

S × A, state (sub)-space for the projection , error bound

Λ, initial step sizes

Output :worst-case transition model, ˆPmin

step size matrix, Λ

1 Pˆmin copy

←−−− P0 2 for s, a in S × A do

3 λs,a← Λ[s, a] // Get the gradient descent step size

4 pˆs,a_min← ˆP_min(·|s, a) // Copy previous worst-case model

5 ∇p← h(v) // Find the step direction, e.g. h(v) = arg minsv(s)

6 pˆs,a_c,min← λs,a_i · ∇p+ (1 − λ) · ˆps,amin // Find a candidate projection ˆp

s,a c,min 7 if d(ˆps,a_min, ˆP0(·|s, a)) > then

8 λs,a← λs,a· κ

9 pˆs,a_c,min← ˆPmin(·|s, a) // If out of model space, simply decrease the step size

10 Pˆ_min(·|s, a) ← ˆps,a_c,min // Update the estimate of Pminand Λ 11 Λ[s, a] ← λs,a

12 end

Once the direction of the gradient has been established, we need to identify a step size λ which we can use to descent. Under some conditions, we can find λ such that we can solve Equation 4.8 within one iteration (Section 4.2.1). More generally, we can employ a numerical search method to start with an arbitrary step size and converge close ps,a_min by updating the step size λ on each descent iteration. For example, dichotomous search [Murakami, 1971] would allow us to move infinitely closer to the boundary of U (ps,a₀ , ). This numerical optimization method iteratively checks whether a candidate worst-case pmf ˆps,a_c,minis in the parameters space defined by U (ps,a0 , ). If the point is within the

boundary, it is accepted; else the learning rate is reduced by a factor κ.

This provides a simple method that is able to converge to a global minimum of LP 4.8 given that the problem is convex, i.e., unimodal. Once we have converged to a minimum, the parameters Λ become values close to zero and the further minimization is not possible. One has to reset Λ to initial step size on order to begin a new minimization stage, for example, due to acquiring a new reference dynamics P0.

Robust Planning with Imperfect Models

MS

C

A

RTIFICIAL

I

NTELLIGENCE

M

ASTER

T

HESIS

Robust Planning with Imperfect Models

M

AXIM

R

OSTOV

September 16, 2020

Supervisor:

Dr Michael Kaisers,

Assessor:

Dr Herke van Hoof,

Abstract

Acknowledgements

Contents

1

Introduction

2

Related work

2.1

Robust Optimization and Robust Control

2.2

Robust Reinforcement Learning and Planning

2.3

Bayesian Approach vs Robustness

3

Background

3.1

Markov Decision Processes as the model of sequential decision-making tasks

3.2

Reinforcement learning and planning with models

3.2.1

Background planning: Solving all states with Value Iteration

3.2.2

Decision-time planning with Monte Carlo Tree Search

3.2.3

Analysis of Monte Carlo Tree Search

3.3

Planning with learnt models via robust optimization

3.4

Model spaces resulting from Bayesian Model Learning

3.5

Model spaces through distance metrics

Manhattan Distance

Wasserstein Distance

4

Methodology

4.1

Robust Planning Solution

4.2

Choosing a Model Space and finding the worst case model

4.2.1

Direct projection: Solving a Linear Program

4.2.2

Indirect projection: Dichotomy method