Efficient Bayesian Learning in Factored Partially Observable Environments

(1)

MSc Artificial Intelligence

Track: Learning Systems

Master Thesis

Efficient Bayesian Learning in Factored

Partially Observable Environments

by

Sammie Katt

6151248

January 30, 2017

42 credits March 2014 - Januari 2017

Supervisors:

Dr Frans Oliehoek

Dr Christopher Amato

Assessor:

Dr J. M. Mooij

University Of Amsterdam

(2)

Abstract

Control of autonomous agents is significantly more complicated when there is uncertainty over the current state of the environment. For many real world scenarios, such as the case of robots with unreliable sensors, however, it is unreasonable to assume such complete informa-tion. Approaches that deal with such uncertainty often utilize a model of the environment, assuming an accurate description of the world is available, including models of the uncertainty.

The Bayes-Adaptive Partially Observable Markov Decision Process (BAPOMDP ) is a framework for learning and planning that does not require full observability or a correct model. The BAPOMDP framework, however, does not exploit conditional independences in dynamics that real world applications often exhibit. By taking the structure of the dynamics into consideration, a correct model can be described and learned with fewer parameters, thus leading to a steeper learning curve than available to BAPOMDPs. Earlier research in the fully observable domain led to the formulation of the Factored Bayes-Adaptive Markov Deci-sion Process (FBAMDP ), where the agent learns not only the dynamic probabilities but also the structure of the problem. Here we extend the FBAMDP to partially observable problems (FBAPOMDP ). We show how the structure of a novel problem, partially observable Sysadmin, can be learned more efficiently than in BAPOMDPs using Monte-Carlo tree search (MCTS ). The performance of MCTS, as with any other sample-based method, relies on the amount of samples, and we introduce Linking States, a novel technique of stand-in states, in combination with an additional root sampling step, to speed up the process.

Acknowledgements

This work would have been impossible without the support of others. First, I would like to use this opportunity to express my gratitude to my thesis supervisor, Frans Oliehoek. Both his attention for detail and understanding of the bigger picture has always lead to insightful discussions, of which the importance sometimes dawned on me only later. Second, I would like to thank Christopher Amato for his numerous feedback. His advice on how to scope and manage bigger projects like this one continue to help me. Lastly, I would like to mention my parents for never forcing but always supporting me in my educational decisions.

(3)

List of Figures

1 An example of a policy tree. At the root is the current chosen action, then each possible observation lead to an action. The leaves represent the last action before the end of the episode (assuming a finite horizon). . . 16 2 Two examples of a Dynamic Bayesian Graph. The arrows indicate conditional

dependencies. . . 24 3 The average time spent by BA-POMCP on a single episode on different instances

of the POSysadmin problem for increasing capacity of LS. . . 40 4 Comparison of the average time spent on a single episode on different sizes of the

POSysadmin problem . . . 41 5 The average return of POMCP on the two Tiger problem for increasing number of

simulations (in log scale). . . 42 6 The average return of POMCP on the continuing Tiger problem for increasing

number of simulations (in log scale) for various values for the exploration constant with horizon 5. . . 44 7 The average return of POMCP on the continuing Tiger problem for increasing

number of simulations (in log scale) for various values for the exploration constant. 44 8 The average return of POMCP on POSysadmin POMDP problem of 3 to 8

com-puters for various number of simulations. . . 45 9 The average return of POMCP on the episodic Tiger problem per learning episode

for various number of simulations. . . 48 10 The average return of POMCP on the POSysadmin problem per learning episode

for various levels of noise. . . 49 11 Given the original conditional probabilities that some variable is true table 11a shows

how a relevant connection is marginalized out, table 11b shows how the probabilities are set if a new connection is added . . . 50 12 The average return of FBAPOMCP on POSysadmin per learning episode for

differ-ent levels of noise in the prior. . . 51

List of Tables

1 The agents belief over the hyper state ¯s in various settings. s is the real state of the world, X is a vector of state features, φ & ψ are the count vectors for the transition and observation models, and b() represents a belief . . . 27 2 An example of a LS during a belief updated after taking action a and perceiving

observation o . . . 31 3 Parameters for all the LS / RS experiments on the POSysadmin problem. . . 39 4 Parameters for the analysis on the effects of lsdon the speed performance of POMCP. 40

5 Parameters used in the episodic Tiger POMDP experiments. . . 42 6 Parameters used in the continuing Tiger POMDP experiments. . . 43 7 Parameters used in the exploration and horizon Tiger POMDP experiments. . . . 43 8 Parameters used in the POSysadmin POMDP experiments . . . 45 9 Parameters used in the episodic BAPOMDP Tiger problem experiments. . . 47 10 Parameters used in the POSysadmin BAPOMDP experiments for learning counts

described in 6.3.2. . . 48 11 Parameters used in the POSysadmin FBAPOMDP structure learning experiments

described in 6.4. . . 51

Terminology

S set of states

A set of actions

(6)

s a state, often in current time step

a an action, often selected in current time step

r a reward, often received after taking current action

o an observation, often perceived after taking current action

t a time step

th the last time step, that of the horizon

T the transition function/model

R the reward function/model

O the observation function/model

γ the discount factor in MDPs

h the horizon of a problem

b a belief, often over states in current time step

a0 a in the next time step a∗ optimal value of a ¯

a the hyper value of a

X state s, represented by its features

Y observation, represented by its features

φ transition count vector

ψ observation count vector

V (s) the value function, equals to the expected return of a state

Q(s, a) the Q-value (function), the expected return of selecting an action in a state

G a generative model, simulator or a graph

I indicator function

U belief update function, increments the counts of the perceived experience

F the Particle Filter representing the belief of the agent

F1 the first level Particle Filter in the naive nested belief representation

F2 the second (and lowest last) level Particle Filter in the naive nested belief representation

¯

ls a linking state with pointer to s

D a set of parameters that generally describes the belief over a systems dynamics

(7)

1 Introduction

Artificial intelligence has progressed to the point where complicated tasks can be carried out by computers or robots, called agents. These tasks range from displaying web pages based on a search query to exploring Mars. Tasks in environments that are hostile to humans are of particular interest: search and rescue operations can be dangerous and could lead to injuries when performed by humans, for example. Some of these environments are well understood and require little adaptability from the agent. More sophisticated problems, however,cannot fully be described and can only be solved when the agent learns from experience. Systems can be complex in many ways. The environment can be constantly changing, for example, which makes it hard to anticipate what will happen. Often the agent is not able to completely sense its environment, parts of the world could be hidden, and the sensors are noisy by nature. In order to perform well, the agent must adapt by making use of observations of the environment, learn from past mistakes and make use of new knowledge when it is presented.

Reinforcement Learning is the sub-field of artificial intelligence concerned with how au-tonomous agents should learn to act in order to accomplish a task. The goal is to have the agent perform as well as possible with respect to this task, given a particular environment. Environments, or systems, come in many forms and can be classified as deterministic or stochas-tic, continuous or discrete and fully or partially observable. A common framework to model systems is the Markov Decision Process (MDP ) [5]. In an MDP time is discretized, and at each time step, the agent aims to maximize its value by choosing the best action in the current state. If there is uncertainty over the system dynamics, it might be necessary for the agent to explicitly acquire knowledge, introducing the trade-off between exploitation and exploration. Exploitation refers to picking the action the agent currently thinks is best, while exploration considers actions that lead to a better understanding of the environment (which then later can be exploited). Exploration is not only important when the dynamics of the system are not fully understood by the agent. The Partially Observable MDP (POMDP ) is a generalization of the MDP that models systems where the agent does not have full knowledge of the current configuration of the environment. In environments where the agent is unable to access the complete state of the world, due to noisy or incomplete data, information gathering could lead to a more accurate understanding of the current situation. The trade-off between exploiting the current knowledge and exploring to increase that knowledge is more complex in POMDPs, due to the partially observability, in addition to the potential uncertainty over the behavior of the system.

Bayesian Reinforcement Learning (BRL) [25] is a method for explicitly dealing with the exploration-exploitation trade-off in a principled way. Given a prior distribution over the unknown dynamics, BRL takes the uncertainty into account when selecting actions. The Bayes-Adaptive Partially Observable Markov Decision Process (BAPOMDP, an extension of the POMDP frame-work) is a mathematical BRL model for partially observable environments. Current applications of the BAPOMDP are limited to small domains, however, due to the complexity of maintaining a probability distribution over all model parameters.

Traditionally the state in a MDP is represented by a unique identifier, such as a single index, and the dynamics of the environment as the probability of transitioning from one state to the other in this representation. Experiences in a specific state can not be generalized to similar states, as the similarity between states is lost when translated to a unique identifier. Factorizing the state into features allows for cross-state learning, and proves to be a valuable method for solving larger problems, such as the Factored Bayes-Adaptive MDP (FBAMDP [21]). Most of such approaches, however, assume full knowledge of the current state, and are thus restricted to the fully observable problems. The main contribution of this thesis is the extension of the FBAMDP to the partially observable setting, introducing the Factored Bayes-Adaptive POMDP framework: FBAPOMDP. This formulation describes states by their features, which allows agents to generalize over experiences and learn quicker. This contribution is motivated by the research question ‘does utilizing a factorized representation of the state and dynamics of an environment in the form of the FBAPOMDP lead to faster learning than traditional BAPOMDP does?’ and ‘is it

(8)

possible to learn the conditional independences of the dynamics of a system using FBAPOMDP ?’. In order to answer these questions while taking scalability into account, we also attempt to get a better understanding of representative benchmark problems and optimization techniques. This means that questions as ‘what are distinctive properties of problems that can be solved by FBAPOMDP solution methods’ and ‘what techniques are available to optimize and speed up current solution methods’ are constantly considered.

As a result, we introduce Partially Observable Sysadmin, an extension of Sysadmin [12]. This new problem captures the characteristics of real world problems, that often exhibit structured behavior, better than previous benchmarks. A sample based Monte-Carlo Tree Search algorithm, Partially Observable Monte-Carlo Planning [23], is applied on the newly defined problem. We show that the new FBAPOMDP framework allows the agent to utilize the structure of the problem to learn faster and perform better.

1.1 Outline

The Background section will provide an introduction to MDPs, including methods for solving and learning. The following part, Factored Bayes-Adaptive POMDPs, consists of the main theoretical contribution, the formal definition of the FBAPOMDP : the factored approach for Bayes-Adaptive planning and learning. Solving (F)BAPOMDPs in Practice considers 2 optimizing approaches to improve scalability and speed for solving vast problems. Results on the Tiger and partially observable Sysadmin domain are provided in section Experimental Evaluation. A more detailed analysis is provided in the Discussion & Future work, the thesis concludes with the Conclusion.

(9)

2 Background

Before diving into the contribution of this thesis on Bayes Adaptive Reinforcement Learning, we will discuss the relevant background on Markov Decision Processes. Section 2.1 provides the definition of MDPs, and the more general POMDP framework. Solution methods, in particular the Monte-Carlo Tree Search algorithm [23], are discussed in 2.1.2 for MDPs and in 2.1.4 for POMDPs. The last section (2.2) considers learning in MDPs, addressing the Bayes Adaptive perspective specifically.

2.1 The MDP & POMDP Framework

Consider the general problem of an agent, in either a physical or a digital system, faced with the challenge of autonomously performing a task. This task can be a certain goal, such as reaching a specific situation, or it could be more of a continuing nature, where the agent aims to act as well as possible in a never ending environment. At discrete time intervals during its interaction in the system, the agent may choose an action. Depending on the action taken by the agent (and possibly other factors related to the nature of the environment), the state of the system changes, possibly stochastically, into a new configuration. The agent is given a real-valued reward signal, depending on how beneficial the action was. Note that this description is quite flexible: there is no con-straint on what actions the agent may do, the properties of the environment, or on the type of agent.

The Rovers on Mars [4] are typical examples of such a problem. The rovers drive on Mars and explore various scientifically interesting materials. All remote vehicles require a high degree of autonomy due to the restrictions of the communication channels. The rovers face many uncertainties; their sensors are inaccurate and provide only an approximation of the surroundings, or the state. Additionally, the agent may not accurately know the reliability of its actuators, such as its wheels, and Mars’ surface may exhibit unpredictable events such as rolling rocks. In this stochastic environment the rovers aim to optimize their behaviour, or action selection, with respect to their task: exploring Mars.

Markov Decision Processes (MDPs [5]) are a natural mathematical framework for modelling fully observable stochastic problems. An example of such a problem is the task of navigating in a two-dimensional grid where actions have stochastic effects. In this problem the agent starts in an initial location and must reach a goal destination by going either up, down, lef t or right while avoiding obstacles. In the case of a fully observable scenario, the agent knows exactly where the itself, destination and obstacles are. Each move, however, does not guarantee the agent will end up in the expected next location, there is a chance its action fails and the agent ends up somewhere else due to the stochastic nature of the environment. In an MDP the configuration of a system, the location of the agent in the previous example, is called the state (s). The agent influences the state by choosing an action a at each given time step, such as right. During the interaction of the agent with the environment rewards r are generated by transitions from some state s to a new state s0, given a selected action a. The agent may receive a positive reward for reaching the destination for example. After such transition the agent perceives the new state s0, its new location, and gets to pick a new action. Whether or not the sequence of transitions ever stops depends on the problem: some problems have a specific time bound, while others go on infinitely long or terminate under specific conditions. The navigation problem, for example, typically has no specific time bound, but finishes once the destination has been reached. The Mars Rovers problem, on the other hand, have neither such goals or a time bound. Lastly, in environments were only the agent has a strict deadline, or is only expected to operate for a certain amount of actions or time, the MDP has what we call a finite-horizon. The horizon sets the amount of actions the agent is allowed to perform before termination. The problem is has a infinite-horizon when there is no time bound: both the navigation and Mars Rovers examples have an infinite-horizon. A single instance from start until termination is called an episode, and the accumulation of the rewards over the whole episode is called the return, and represents how well the agent performed.

(10)

In addition to the discretization of time, these frameworks also assume that the state transition model satisfies the Markov property [18], which refers to the memoryless nature of the process. The Markov property holds if and only if the transition probabilities, given a state and action, are independent of the history. In other words, the state s fully describes the current configuration of the system. This also ensures that the agent does not need to store previous states, but only depends on the current state for optimal behaviour. MDPs will be defined formally in the next subsection (2.1.1) [20].

In many real world applications, such as the Mars problem above, it is often impossible to fully, and accurately, know s. Measurement tools have a certain degree of error, and in many situations only part of the state is observable. In search and rescue operations, for example, the amount and location of victims is often unknown, so an MDP can not accurately model the problem. The Partially Observable Markov Decision Processes (POMDP) is a framework that is able to model the partially observable property of such problems. In a POMDP, the agent is not able to perceive the real state s, but will receive an observation o after each taken action. The agent estimates the current state by utilizing information from those observations. To be precise, the POMDP is a generalization of the MDP, where the MDP is a special POMDP that provides the current state as an observation. Subsections 2.1.3 also provide a more formal description of the POMDP.

2.1.1 MDP

The MDP framework is described by the tuple (S, A, T , R, γ) [5].

• S: The set of states in the environment. Each state represents a unique configuration of the system that the agent may encounter.

• A: The set of possible actions in the environment. Note that not each action (a ∈ A) necessarily has to be applicable to each state s, as some states allow different actions than other.

• T : A function that represent the transition probabilities for each state-action-state (s, a, s0₎

tuple. T (s, a, s0) describes the probability that the agent will end up in state s0 given it has performed action a in state s: p(s0|a, s).

• R: A function R(s, a, s0

) that represents the reward r ∈ R received by the agent after selecting action a in state s and moving to state s0.

• γ: The discount factor. In order to encourage the agent to prefer earlier rewards, a reward r at time step t will be discounted by γt_{. Discounting later rewards will prevent the agent}

from delaying actions and guarantees that the sum of the rewards remains finite.

• h: The horizon of the system. Some systems have an infinite horizon (h = ∞), while others have a specific limited amount of steps.

In the MDP framework, the agent starts in an initial state s0∈ S. It will then iteratively act

and receive feedback for h time steps. During each time step t the agent will select an action at ∈ A in a particular state st. The transition function T then determines a next state st+1

given the transition probabilities T (st, at, st+1) and the reward function R produces a reward rt

according to R(st, at, st+1). The agent ends up in the new state st+1 and receives reward rt× γt.

The discount factor 0 ≤ γ ≤ 1 determines the preference for early results. It encourages the agent to prefer short solutions and prevents the agent from delaying profitable actions. In practice the discount factor is often close to 0.9. It also simulates the agent’s mortality, the agent may die or crash at any moment, denying the agent reward in later time steps.

(11)

2.1.2 Planning in MDPs

While the definition of an MDP is useful to formalize problems in a mathematical manner, it does not tell the agent how to act. Algorithms that solve MDPs, also called planning, come up with policies for the agent. A policy is a description of the behaviour of an agent, and is represented as a mapping from states to actions (s → a): given the current state, a policy describes what action to take. A policy in the navigation problem, where the agent can navigate in any of the 4 orthogonally directions, a possible (probably silly) policy is to always go right. A more sensible policy would return actions, based on the current state, such that the agent eventually reaches its goal destination. To come up with a good policy, the agent must be able to differentiate between desirable and undesirable states.

The desirability of a state, given a policy π, depends on the return that the agent expects to gain from being in that state and following π (1). In this recursive definition for the value function, the policy determines the action (π(s)), which determines the probabilities over the next states. The value of s is the addition of the immediate reward of taking that action and the expected value of the new state (still recursively defined), with respect to their probability (T (s, π(s), s0)).

Vπ(s) =

X

s0_∈S

{T (s, π(s), s0) × (R(s, π(s), s0) + γVπ(s0))} (1)

The better the policy, the higher the expected return of following that policy. More formally, given an distribution over the initial states of the system at the first time step p(s), the value of a policy is defined as (2).

V (π) =X

s∈S

p(s) × Vπ(s) (2)

When the policy selects the action that maximizes the expected return for all states s, then it is called optimal (denoted by π∗). The expected return for applying π∗ in s is, by definition, maximized and denoted by V∗(s). Closely related to the state value function is what are called the q-values Q(s, a): the expected return of executing an action in a particular state and acting according to its policy afterwards (3). Similarly to the optimal value function V∗, the optimal policy π∗always picks the action a that maximizes Q(s, ·).

Q(s, a) = X

s0_∈S

T (s, a, s0)(R(s, a, s0) + γ × V (s0)) (3)

So far the equations assumed an infinite horizon, where values of states are defined recursively and endlessly. Policies for infinite-horizon problems are independent on the time step, or also called stationary. The time step in finite-horizon systems often has a significant effect on the quality of the policy: while some actions in some specific state sa may have a high expected return initially

(for t = 0), the same actions may be a poor choice at the end of an episode (t = h). Non-stationary policies map states to actions depending on the time step πt(s). Additionally, the values of states

now also differ per time step (4)

V_πt t(s) = X s0_∈S T (s, πt(s), s0) × (R(s, πt(s), s0) + γVπt+1t (s 0₎₎ (4)

The agent’s goal in an MDP is to receive as high of a return as possible. If feasible, this means to maximize the expected return, or to follow an optimal policy (argmax

a

Q(s, a)). In more realistic scenarios the agent may not have the computational means to calculate the q-values or value function (which are interchangeable) and will have to settle for an approximation.

Over time, multiple algorithms have been designed to solve MDPs, which can be separated into two different classes: offline and online methods. Whether a method is offline or online depends on whether it computes the policy during execution. Most earlier approaches, such

(12)

as Value Iteration (VI [15]) are offline and pre-compute the complete policy, before any action is taken. Once the agent starts interacting with the environment, it has a mapping from any possible state to a chosen action, and the action selection is reduced to a lookup. While having a full policy and the ability to select an action quickly is desirable, calculating the complete policy becomes computationally demanding in larger problems. Online methods decide on a policy during execution, meaning that in between taking actions the agent computes or updates the policy. While the need for computation time before selecting actions may seem like a disadvantage, in many applications this is often not an issue, for computation time is often available in between actions. An advantage of online methods is the ability to focus on the rel-evant part of the policy, as the agent can calculate the policy for just the current (reachable) states.

Out of the many existing methods, the next two subsections discuss the offline solution method VI [5] and the online Upper Confidence bound applied to Trees (UCT ) [17]. VI has historically been one of the most important solution methods, and serves as an introduction to solving MDPs. We discuss UCT because of its importance in understanding Partially Observable Monte-Carlo Planning, which is the algorithm used in the experiments of this thesis, in order to deal with the curse of dimensionality.

Value Iteration Value Iteration [15] determines the optimal policy π∗by calculating the optimal value function V∗ (algorithm 1). The core idea is to do so in an iterative manner, where on each iteration the approximation for V∗ is improved. We know from the definitions for the value function, optimality and q-values above, that one way of writing down the optimal value function is as follows (5).

V∗(s) = max

a {Q

∗_{(s, a)}} ₍₅₎

When writing out Q∗(s, a) (6), we reach the well known recursive equation (7), also called the Bellman Optimality Equation [5]. Where the expected value of acting according to π∗is a weighted summation over the return of s0 with respect to the probability of ending up in s0.

Q∗(s, a) = X s0_∈S T (s, a, s0)(R(s, a, s0) + γV∗(s0)) (6) V∗(s) = max a ( X s0_∈S T (s, a, s0)(R(s, a, s0) + γV∗(s0)) ) (7)

The core principle of VI is to start with some arbitrary values for V (s) and repeatedly update these values using the Bellman Optimality Equation (8). With each sweep V (s) approaches V∗(s), until the updated values differ less from the values before the sweep by some threshold. At that point V (s) has converged to their real optimal values V∗(s) with some small error .

Vt+1(s) = max a ( X s0_∈S T (s, a, s0)(R(s, a, s0) + γVt(s0)) ) (8)

The traditional implementation of VI is straightforward (algorithm 1). Given the MDP definition, initialize two arrays to hold the new and old values for V (s). Then continue to apply the Bellman Update in line 7 until there is no value that changes with more than . Then return the resulting value function. Given V∗_{(s), a complete and optimal policy can then directly be}

derived from the Q∗(s, a) values by taking the maximum action for each state, as shown by (6). In large state spaces it becomes computationally taxing to sweep through all states until con-vergence, and is wasteful in systems where parts of the state space is rarely or never visited. Additionally, VI produces a full policy, an action for each possible state. In reality we are only interested in the reachable states of the domain, and more specifically, only those where we ac-tually end up. In stochastic domains it is unpredictable in which state the agent ends up in, so offline methods are required to provide a policy for all reachable states. Online methods, on the

(13)

Algorithm 1 Value Iteration

1: procedure value-iteration(S, A, T (s, a, s0), R(s, a, s0), ) 2: V [S] ← array(0)

3: V0[S] ← array(0)

4: while ∃s : |V [s] − V0[s]| ≥ do . Continue until convergence

5: V = V0

6: for all s ∈ S do . A single update sweep over S

7: V0(s) = max a P s0_∈S T (s, a, s0)(R(s, a, s0) + γV (s0)) 8: end for 9: end while 10: return V 11: end procedure

other hand, decide which action to take at a given state. These algorithms can create plans for only the relevant state space, such as those reachable from the current state. One algorithm that plans specifically for the current state is UCT [17], discussed in the next subsection.

UCT UCT [17] is an online Monte-Carlo Tree Search (MCTS ) planning algorithm that, in contrast to VI, returns an action for only the current state.1 _{UCT works by estimating the}

expected value of each of the actions using rollout -based simulations, to incrementally build a look-ahead tree, with the current state as the root.

A simulation is a sequence of (s, a0, r) tuples, starting at the current state s0, up to some

termination state or depth d. One could naively approximate Q(s0, a) with the average return

of n simulations using a random policy after taking action a in state s0. Each q-value then

approximates the return of behaving randomly, which is far from optimal, we are interested in the q-values for the optimal policy. UCT, however, builds up a lookahead tree that approximates the optimal q-values by guiding the search inside the tree towards high performing policies. This tree contain two types of nodes: decision nodes and transition nodes. The decision nodes represent states and branch into the possible actions. Transition nodes represent actions and branch into the new possible states. Each decision node is associated by a set of counts of how often action a has been chosen in state s (N (s, a)) and the current estimate of the q-values Q(s, a).

Initially the tree consists of a single root decision node (algorithm 2). Starting from the root, while inside the tree, as shown in line 3, the agent simulates taking an action based on these counts and q-values. This part of the algorithm recursively calls the Simulate procedure in line 18, which simulates acting in the actual environment. At some point during the simulation, a leaf of the tree is reached, as that sequence of action-state-reward tuples has not been represented in the tree yet (line 14 returns true). A new decision node is then added, where the counts and q-values are generally initialized to zeros, although expert knowledge can induce some prior here. After this, a rollout occurs, which is a random interaction with the generative model, which produces some return. At the end of the simulation the counts and q-values of the visited nodes in the tree are updated in reverse order back up the tree (line 20 to 22). Updating the values after a simulation is straightforward, the counts N (s, a) of the chosen actions are incremented by 1, and the Q(s, a) values are simply the average over the returns, while respecting the discount factor γ.

The naive version, that would act randomly inside the tree, updates the q-values such that they converge to a random policy. The statistics from the sampled simulations, which are stored in the lookahead tree, can be used to improve on this. Action selection within the tree is an interesting problem: if one picks actions greedily according to the current q-values, then it is possible that important parts of the tree remain unexplored, which results in incorrect q-values for some action

1_{As opposed to VI, an explicit representation of the transition probabilities is not necessary for UCT, but a}

(14)

and ultimately to bad policies. If one explores more than necessary when picking actions inside the tree, then simulations continue to visit lower performing sub-trees, which reduces the number of samples of potentially good sub-trees.

UCT applies Upper Confidence Bounds (UCB [2]) to select actions while traversing the tree for a principled trade-of between exploration and exploitation. At each decision node in the tree, the action is based on the counts and q-values. The UCB value of each action is the weighted sum of the expected return Q(s, a) and some exploration bonus, as shown in 9. This approach ensures that UCT explores enough in the tree, but will not consider inferior parts of the tree more than necessarily, leading to a faster and more accurate convergence of the q-values and policy.

a = argmaxa      Q(s, a) + C × v u u t log N (s, a) P a∈A N (s, a)      (9)

Algorithm 2 Upper Confidence bound applied to Trees (UCT )

1: _{procedure UCT(s}0, n, C)

2: for i ← 1 . . . n do 3: simulate(s0, 0, C)

4: end for

5: return argmaxa(Q0(s0, a))

6: end procedure

7:

8: _{procedure simulate(s, d, C)}

9: if T ERM IN AL || d == M AXDEP T H then

10: return 0 11: end if 12: a ← argmaxa(Qd(s, a) + C × r log Nd_(s,a) P a Nd_(s,a)) . UCB1 13: (s0, r) ← generative model(s, a)

14: if not(s0 ∈ treed) then . Start rollouts if outside tree

15: ∀s, a : initialize(Qd_{(s, a), N}d_{(s, a))}

16: retval = r + γ × rollout(s0, d + 1) 17: else 18: retval = r + γ × Simulate(s0, d + 1) 19: end if 20: Nd_{(s, a) ← N}d_{(s, a) + 1} _{. Update values} 21: Qd_{(s, a) ←} Nd_(s,a)−1 Nd_(s,a) × Q d_{(s, a) +} 1 Nd_(s,a) × retval 22: return retval 23: end procedure 2.1.3 POMDP

The required fully observable aspect of the MDP leads to an inability to model a large group of systems. In many systems the true state is hidden; consider a software agent tasked with displaying pictures, based on the user’s mood, given some facial expressions. It is impossible to directly inspect the mood of the user (the state), so the state is hidden. The facial expressions (observations), however, do provide information, which can be used to guess the state. Similarly, agents in physical applications base their actions on measurements from noisy sensors, and thus cannot perfectly observe the true state.

The Partially Observable MDP (POMDP) framework is a generalization of the MDP that can deal with such systems [16]. In a POMDP the agent does not actually know the current state s, and will not perceive s0after taking an action a. The system, instead, generates an observation o. The

(15)

observation, e.g. a noisy measurements, or only part of the state, provides (limited) information about the state. The POMDP framework extends the dynamics of the MDP by including an observation function (O) that describes the observation probability distribution, given a transition to a new state. The tuple (S, A, T , R, Ω, O, γ, h) describes a POMDP :

• S, A, T , R, γ, h: Identical to the MDP definition.

• Ω: The set of observations in the system. Any observation o possible in the system is an element of this set.

• O: A function that describes the probability that the agent will perceive an observation given the agent’s action a and the resulting state transition s to s0 (defined by T ). O(a, s0, o) is the probability that observation o is generated when reaching state s0 after the agent has selected action a (p(o|a, s0)).

The POMDP framework introduces a set of observations Ω and an observation function O, and denies direct observation of the current state st. Given an initial distribution b0 over the initial

states s0, an episode now consists of selecting an action and receiving a reward r and observation

o ∈ Ω until either a terminal state or h has been reached. The system still transitions from state s to state s0 during these steps, according to T (s, a, s0), but this transition is hidden. The system now additionally produces an observation o, given the selected action a and transition state s0, according to the observation function O(a, s0, o).

Because the state of the system is observable in an MDP, the Markov property holds and a policy is based on the current state only. Since the agent no longer has access to the state of the system in a POMDP, the agent no longer can make use of a process with this property that is memoryless. As a result, the agent now has to consider the full history, the sequence of actions and observations. The agent may use the perceived observations, that give him some information over the environment, to maintain a belief over the current state s: b(s). This belief is a probability distribution over the states, which describes in which the agent believes to be. Each state has a probability assigned to it that represents the likelihood of being in that state, which sums up to 1 as the agent knows it has to be in some state (P

s∈S

b(s) = 1). Given some initial belief b0, the

agent can update this belief online after each observation. A belief update simply considers the current belief, its new experience ((a, o) pair), and calculates the new belief. Note that a belief is a sufficient statistic of the history, and this approach will be used in this Thesis as opposed to keeping track of the (possibly infinitely long) history.

Belief Updates Provided with an initial belief, a selected action and a new observation, the new belief can be obtained using the Bayes’ Rule (equations from [16]):

b0(s) = P (s0|o, a, b) (10) b0(s) = P (o|s 0_{, a, b) × P (s}0_{|a, b)} P (o|a, b) (11) b0(s) = P (o|s0, a) × P s∈S P (s0|a, b, s)P (s|a, b) P (o|a, b) (12) b0(s) = O(s0, a, o) × P s∈S T (s, a, s0)b(s) P (o|a, b) (13)

Where (11) is a simple application of Bayesian probability theory. (12) utilizes the indepen-dence between b and o given s0 and rewrites the nominator by marginalizing out s.

Summing over all states in S is infeasible for any problem of interesting size. For this reason we turn to approximations. Particle Filters are probability approximation methods (used in e.g. [14]), where the quality of the approximation can be controlled. Particle Filters represent probabilities

(16)

with a bag of particles. There are several types of filters, but the most straightforward version suffices here. In a Particle Filter with n samples, each sample represents a probability of _n1. When the belief over the state of the environment is represented in such a way, then the approximate probability of being in state s1 is represented by the amount of times s1 is in the Particle Filter.

When planning in POMDPs at the start of an episode, the filter starts with K states sampled from the initial belief b0. Whenever an action has been executed and an observation has been

perceived, the belief is updated via rejection sampling (algorithm 3). Rejection sampling refers to the act of approximating a distribution through samples, while rejecting those that do not fulfil certain criteria. In this case samples (states) from the current belief are drawn in line 4, which are then put into a simulator of the system. The simulator returns a new state and a respective observation on line 5. The new state, which potentially is a sample for the new belief, is then only accepted if the generated observation is different from the observation received from the environment, and rejected otherwise, in line 6 and 7. When the new belief b0 has reached the correct size, the algorithm terminates.

Algorithm 3 The belief update algorithm for particle filters of N particles, given a belief b, taken action a and perceived observation o using generative model G

1: _{procedure particle-filter-belief-update(N , b, a, o, G)} 2: b0← empty()

3: while size(b0) < N do . Continue until N elements in new belief b0

4: s ∼ b

5: osim, s0∼ G(s, a) . Generate observation and state from (s, a)

6: if equals(osim, o) then

7: s0 ∈ b0 _{. Add particle if observation matches}

8: end if

9: end while

10: return b0

11: end procedure

Belief MDPs A belief over the state is also called a belief state. Belief states are continuous, as they are probability distributions and, because they are a sufficient statistic, the Markov Property holds if applied as MDP states. Belief states allow for a neat trick in POMDPs where the obser-vation function O is folded into the transition function in such a way that it can be represented by a continuous MDP : the Belief MDP. This MDP is slightly more complicated in its definition, and harder to solve than their discrete counterparts:

• ¯A = A, ¯γ = γ, ¯h = h: Identical to the POMDP.

• ¯S: b(s). The set of beliefs, all possible (continuous) belief distributions over the set of states S.

• ¯R: R(b, a, b0). Since the internal state of the agent is now represented with a belief, the reward function is defined in terms of the belief b. The expected reward is the weighted summation of the reward of hb, a, b0i transitions: P

s∈S

P

s0_∈S

b(s) × R(s, a, s0) × b0(s0). • ¯T : P (b, a, b0). Is the belief state transition function. It is defined as

P (b0|a, b) =X

o∈Ω

P (b0|a, b, o) × P (o|a, b)

Where P (b0|a, b, o) is 1 if the belief update returns b0, and 0 otherwise. P (o|a, b) is the probability of witnessing o after doing action a with a belief over the current state b. This is the probability of observing o given a new state s0 and a, multiplied by the probability of reaching s0

X

s0_∈S

(17)

where p(o|a, s0) is the observation function O of the original POMDP and p(s0|a, b) is defined by the transition function: P

s∈S

p(s0|a, s) × b(s).

2.1.4 Planning in POMDPs

Planning in MDPs involves calculating a (optimal) mapping from state to action that yields high returns. In POMDPs, policies map from histories or beliefs to actions. In infinite-horizon POMDPs, policies based on history are practically impossible: policies that grow with the horizon until infinity cannot be defined. A natural perspective on policies in finite-horizon POMDPs is in the form of a tree. The policy tree branches on the possible observations, and returns an action for each possible sequence of ha, oi pairs. The root of the tree is the current, or initial situation, while the leaves of the tree is the last ‘step’, at a depth equal to the horizon h, where the agent can only do one final action.

With only one action left, at a depth of h all the agent can do is execute a single action (figure 1). When the agent has two actions left before termination, it can choose a single action first and then base the second action on the perceived observation. With a tree of depth h, the agent may initially select an action, then receive an observation and select a new action h − 1 times. With |A| possible actions and |O| possible observations, there are |O|_|O|−1H−1 number of nodes, and |A|

|O|H −1

|O|−1 _{number of possible policies. Since this is a finite set, a straightforward brute-force} solution method is to calculate the expected return for each of these policies and pick the best. While this would work, the amount of policies grows exponentially and this approach would surely face performance complications.

Figure 1: An example of a policy tree. At the root is the current chosen action, then each possible observation lead to an action. The leaves represent the last action before the end of the

episode (assuming a finite horizon).

Many algorithms exist that attempt to efficiently search through the space of possible policies, such as the MDP VI inspired POMDP Value Iteration approach [7]. Online methods are currently considered effective approaches to reduce the search space. An popular example that is used in this work is an extension of UCT to POMDPs, and is discussed below.

Partially Observable Monte-Carlo Planning A extension of UCT to partially observable environments is Partially Observable UCT (PO-UCT [23]). Or more specifically, Partially Observ-able Monte-Carlo Planning (POMCP ), which uses PO-UCT to select actions in combination with Particle Filters with the corresponding Monte-Carlo updates to maintain the belief. POMCP constructs a search tree of histories in stead of states (algorithm 4), where the root node is the current belief (maintained as shown in algorithm 3). Similar to UCT, the tree is incrementally

(18)

built, one node per simulation, and ends in a rollout whenever the simulation reaches a leave in the tree. Each node in this tree contains the number of visits and the approximated value of that history hN (h, a), Q(h, a)i, just like UCT in MDPs did. A notable difference is that a simulation now starts by sampling a state from from the belief s ∼ b(s) in line 3. The actions during simulations inside the tree are selected using UCB : argmaxsQ(h, a) = Q(h, a) + C ×

q_{log N (H)}

N (h,a) .

Rollout still are random simulations up till a terminal state or some horizon.

POMCP uses history to build its tree and approximate the q-values. The belief is used to sample the initial state of the simulations, such that the approximation of the q-values reflect the belief of the agent. Now each node contains the expected return of doing an action a ∈ A given some history h (and the amount of time this has been simulated N (h, a)): Q(h, a). During simulations the agent will explore various actions, just like in UCT by utilizing the UCB formula. Eventually the root node contains the expected return for each action (based on the sub-trees) given its belief, and selects the action that maximizes this.

Algorithm 4 Partially Observable Monte-Carlo Planning

1: _{procedure POMCP(b(s), n, C)} 2: for i ← 1 . . . n do

3: s ∼ b(s)

4: Simulate(s, 0, [], C)

5: end for

6: return argmaxa(Q(h, a))

7: end procedure

8:

9: _{procedure simulate(s, d, h, C)}

10: if T ERM IN AL || d == M AXDEP T H then

11: return 0 12: end if 13: a ← argmaxa(Q(h, a) + C × r log N (h,a) P a N (h,a)) . UCB1 14: (s0, o, r) ← generative model(s, a)

15: if not([h, a, o] ∈ tree) then . Start rollouts if outside tree

16: ∀a : initialize(V ([h, a, o], a), N ([h, a, o], a)) 17: retval = r + γ × rollout(s0, d + 1)

18: else

19: retval = r + γ × Simulate(s0, d + 1), [h, a, o], C

20: end if 21: N (h, a) ← N (h, a) + 1 . Update values 22: Q(h, a) ← N (h,a)−1_{N (h,a)} × Q(h, a) + 1 N (h,a)× retval 23: return retval 24: end procedure

2.2 Learning Frameworks

Planning requires some sort of knowledge of the agent about the system. Previous frameworks assumed direct access to the transition and observation function (T &O). This assumption can be restrictive, as the dynamics are not known (exactly) in many real world applications and could be changing over time. This section considers the case where these dynamics are not accurately known a priori. Note that the reward function is always assumed to be known. This is because rewards are generally defined by an operator and because the generated rewards are typically not observable and cannot be learned in a similar fashion.

Let us first consider the fully observable scenario where the transition function is unknown. The setting can be approached as a Fully Observable Bayesian Reinforcement Learning (FOBRL)

(19)

problem, where the current state is observable but the models are unknown. In FOBRL the agent learns a distribution over the values of the parameters that describe the dynamics of the world. Explicitly learning these probabilities means to maintain a (belief over the) model of the world, and is also called model-based Reinforcement Learning. Other approaches exist, both Bayesian and non-Bayesian, that solve this problem model-free. Model-free algorithms learn the value function or Q-values directly. Here we focus on model-based approaches.

FOBRL does not assume that the transition function is known, which makes it possible to model problems in which we are unsure of the dynamics. Most real world problems, after all, have stochastic elements that are not known (accurately). Casting such problems as MDPs would lead to policies based on faulty models, which in turn results in suboptimal behaviour. By explicitly considering uncertainty over the transition model, the agent can behave optimally with respect to those uncertainties.

In our treatment of an FOBRL, a probability distribution over the model is maintained in a Bayesian way, which requires an initial belief over the dynamics. If we additionally assume there is some discrete and finite amount of actions and system configurations, and that the performance of the agent can be measured by some function R, then the problem can be formalized as a hS, A, b0(T ), Ri tuple, where there are no assumptions about the shape of the distribution b0.

A key observation here is that it is possible to consider the true transition model T as an hidden part of the state of a POMDP. This approach introduces a hyper-state, which consists of both the (observable) configuration of the system s and the unobservable dynamics of the system T (¯s = [s, T ]). With this notion of the hyper-state, it is now possible to cast the FOBRL as a continuous and infinite POMDP problem, where T is unknown to the agent (just like the system’s state is unknown in the ordinary POMDP in Section 2.1.3):

• ¯A = A, ¯R = R, ¯γ = γ, ¯h = h: Identical to the POMDP definition.

• ¯S: hs, T i, the tuple of the current system’s fully observable state and the partially observable transition function.

• ¯O: The observation function P (o|a, hs0, T0i) which is defined by the indicator function Io(s0)

(which returns 1 if s0 equals o and 0 otherwise), since the only (partially) unobservable part of this POMDP is T , while s and s0 is known.

• ¯T : The transition from one ¯s to the next ¯s0 is defined by P (hs0, T0i|hs, T i, a), which factor into P (s0|T, a) = T (s, a0_{s) and P (T}0_{|T ), which is 0 if they are not equal and 1 otherwise:}

IT0(T ).

Section 2.1.4 showed how any POMDP can be cast into, and solved as, a belief MDP. Similarly, FOBRL problems can also be modelled as belief MDP, and solved by algorithms such as POMCP. The rest of section 2.2 is dedicated towards casting FOBRL problems efficiently as belief MDPs (Section 2.2.1), and how this can be extended to partially observable domains in section 2.2.2.

2.2.1 BAMDP

Since the FOBRL problem can be cast as a POMDP, we can consider its belief MDP. Although the state of the system is fully observable, the hyperstate ¯s = hs, T i is not. In order to be able to define a policy, the agent maintains a probability distribution over the hyperstates, a belief b(¯s). b0 is the initial state of the system along with the prior over the transition function

b( ¯so) = (s0, P (T )). While interacting with the system, the agent updates its posterior over the

transition function P (T ), based on history h, a sequence of (a, o) pairs. As a result, at a time t, the belief is (st, P (T |ht)). To represent b(¯s) compactly, note that the transition probabilities of an

MDP are in the form of Multinomial distributions [11]. Dirichlet distributions are an appropriate prior for such distributions, and come in a relative simple shape.

(20)

Dirichlet distributions The Dirichlet distribution is the multivariate generalization of the Beta distribution and is described by a parameter vector α of N positive real numbers, or counts. Each count i represents the probability of outcome i and its certainty over that probability. For example, a count vector [8, 1, 1] represents experiencing outcome one, two and three respectively 8, 1, and 1 times. The expected Multinomial distribution is then [0.8, 0.1, 0.1]. A count vector of [800, 100, 100] however, represents a more certain probability over the actual distribution with the same expected probabilities. To generate samples x1. . . xn from a Dirichlet distribution

with parameters α, one first samples a Multinomial distribution P ∼ Dir(α), and then draw X ∼ M ulti(P ). To draw a possible Multinomial distribution from the Dirichlet distribution one applies the Gamma distribution to sample N outcome probabilities p1. . . pN: pi = Nyi

P

j=1

yj

, where

yi is sampled from the Gamma distribution with density

y_iαi−1−yi Γ(αi) .

In Bayes-Adaptive Markov Decision Processes (BAMDP ) [9], the belief over a transition dis-tribution T (s, a) is represented by a set of count parameters φa

s. Informally, count φas,s0 represents how often a transition (s, a, s0) is experienced. Consequently, the posterior update after transition (s, a, s0) consists of just incrementing φa_s,s0 by 1, which will more formally be described as δ_s,sa 0. This results in the following update formula for the counts: φ0 = φ + δa_s,s0. The actual expected tran-sition probability for a (s, a, s0) tuple, given this posterior, corresponds to P (s0|s, a) = φ

a s,s0 P sn ∈S φa s,sn .

This results in the following transition model

T0(s, φ, a, s0, φ0) = φ a s,s0 P sn_∈S φa s,sn × Iφ0(U (φ, s, a, s0))

where U (φ, s, a, s0) is the result of φ + δa_s,s0 and I is the indicator function, which returns 1 if the subscript equals its input and 0 otherwise (as defined in 2.2).

This leads to the following definition of the BAMDP :

• ¯A = A, ¯R = R, ¯γ = γ, ¯h = h: Identical to the MDP definition

• ¯S: S × Φ, where S is the set of states in the system, and Φ the infinite discrete space of count vectors φ.

• ¯T : The transition function ¯T (s, φ, s0_{, φ}0_{) in the BAMDP describes the probability of transition}

from one (s, φ) pair to another (s0_{, φ}0_{) given an action a, P (s}0_{, φ}0_{|s, φ, a). This function is}

described above by T0(s, φ, a, s0, φ0) = φ a s,s0 P sn_∈S φa s,sn × Iφ0(U (φ, s, a, s0)) (14)

With this formulation the FOBRL is cast as a BAMDP and defined as a MDP. Note, however, the state space is countably infinite2_{. Any solution method capable of solving countably infinitely}

large MDPs are thus applicable.

2.2.2 BAPOMDP

Now consider a system with uncertainty over both the dynamics (as in the previous section) and the current state (as was the case in Section 2.1.3): the Partially Observable Bayesian Reinforcement Learning (POBRL) problem. An agent in POBRL problems aims to behave as well as possible while learning the dynamics of the environment without access to current state. To do so in a principled way, it must maintain a belief over both the current state and the hidden dynamics of the system. In contrast to the FOBRL definition, where an initial belief over just

2_{Since counts are always incremented with 1, they are traditionally integer values. It is possible to have real}

(21)

the dynamics of the environment sufficed, the POBRL problem has a joint distribution over the dynamics and the initial state. In the most general setting, the agent does not accurately know the transition- T , nor observation probabilities O.

In a fashion similar to casting the FOBRL to a POMDP, the POBRL problem can also be thought of as a POMDP, where now not only the transition model, but also the state and obser-vation model are unobserved. This can be be done by defining the hyper-state as the underlying state and the transition and observation function ¯s = hs, T, Oi. The transition and observation functions are modelled by Dirichlet distributions. In addition to counts to represent the transi-tion functransi-tion (φ), parameters ψ describe the observatransi-tion functransi-tion O. ψa_s0 is a vector of counts, a Dirichlet Distribution, that describes the belief over the probability of observations given a new state and action pair (s0, a), where ψa

s0_,osymbolizes the amount of times the observation (s0, a0, o) is experienced. In contrast to the BAMDP, the counts can not be observed, and updating the counts becomes a core issue. This challenge is met by maintaining a belief over the Dirichlet parameters. Given an initial belief b0= P (s, φ, ψ), the BAPOMDP is defined as follows [22]:

• ¯A = A, ¯R = R, ¯γ = γ, ¯h = h: Identical to the MDP definition • ¯Ω = Ω Identical to the POMDP definition

• ¯S: (S × Φ × Ψ), the state of this POMDP contains the additional observation counts Ψ, compared to BAMDP

• ¯O: The observation function O(a, ¯s0, o) describes the probability of an observation given an action and new state, just like in POMDPs case (P (o|a, ¯s0)). Here ¯s0 is the triplet hs0, φ0, ψ0i. The expected probability is a function of the counts, which are of the same type as was for the transition function in BAMDPs: P (o|a, s0, φ0, ψ0) = ψ

a s,s0 ,o P sn ∈S ψa s,sn ,o .

• ¯T : The transition function ¯T (s, φ, ψ, s0, φ0, ψ0) in the BAPOMDP describes the probability of transition from one (s, φ, ψ) tuple to another (s0, φ0, ψ0), P (s0, φ0, ψ0, a, s, φ, ψ), given an action a. Described below.

Whereas a BAMDP only considers a posterior over the transition model (the actual state is fully observable) in the form of a count vector, BAPOMDPs consider a joint posterior over the whole state space ¯S: P (s, φ, ψ). This joint distribution is simplified using the conditional independence between its components [22]: ¯T = P (s0, φ0, ψ0, a, s, φ, ψ) =

P (s0|s, a, φ) × P (φ0|φ, s, a, s0) × P (ψ0|ψ, a, s0, o)

Here P (s0|s, a, ψ) describes the probability of reaching a new state s0 _{from state s after selecting}

action a, under the Dirichlet posterior ψ. As shown in Section 2.2.1, these terms correspond to the equations below, where the transitions for φ counts are deterministic and simply increment (analogue to the BAMDP framework and the ψ counts in the observation function):

P (s0|s, a, φ) = φ a s,s0 P sn_∈S φa s,sn P (φ0|φ, s, a, s0) = Iφ0(U (φ, s, a, s0)) P (ψ0|ψ, a, s0, o) = Iψ0(U (ψ, s, a, s0, o))

This BAPOMDP can be solved by any POMDP solver, but suffers from high dimensionality. The intractability of both large POMDPs and BAMDPs has been mentioned, and BAPOMDPs can be seen as the worst of both worlds. The state space of BAPOMDPs consists of all possible counts for both the observation model and transition model, so calculating a complete policy would require a mapping from any possible combination of counters (and state) to an action. Additionally, since the state and counters are not directly accessible, a distribution over the possible values is necessary. For this reason we use Particle Filters and sample-based approaches, which have been most successful in overcoming the curse of dimensionality and in tackling BAPOMDPs.

(22)

2.2.3 Sample-based Planning for Bayes-Adaptive Models

POMCP on Bayes-Adaptive MDPs is slightly different, as there is no access to a generative model to sample a hs0, o, ri tuple from a given state and action (line 14 in algorithm 4, note that the rest of the code can directly be applied to the Bayes-Adaptive models). The agent has, instead, a belief over those dynamics, in the form of a joint probability distribution over the dynamics and the current state. It would be necessary to integrate over all possible dynamic models at every node to sample generative models. This costly operation is avoided by a more efficient method [13].

BAMDPs Assuming the fully observable case for now, Bayes-Adaptive Monte-Carlo Planning (BAMCP ) is the extension of UCT to Bayes-Adaptive learning. As opposed to drawing samples from the posterior over the transition model at each node, a single function Ti is sampled at the

start of each simulation i and used as the generative model. In all other aspects, BAMCP is very similar to UCT described by algorithm 2. The differences lies in lines 13: the generative model for BAMCP is the sampled model Ti from the counts φ at the root of the tree.

BAPOMDPs Bayes-Adaptive Partially Observable Monte-Carlo Tree Planning (BAPOMCP ) combines the root sampling aspects of POMCP and BAMCP, here we take the perspective of extending POMCP to the learning scenario [1]. POMCP relies on a generative model (line 14 in algorithm 4), to draw a new state, observation and reward (hs0_{, o, ri) from a state-action pair}

(hs, ai). In BAPOMDPs there is no generative model, only a probability distribution over counts φ and ψ. Similar to BAMCP, a sample step at the root node is added to avoid the costly integration over all models to sample a generative model at each node. The start of each simulation i begins with sampling, not only a state, but also counts from the belief hsi

0, φi0, ψ0ii ∼ b(s, φ, ψ). We proceed

to sample a generative model at each step during the simulation from the counts, and update these as we go according to (14).

Sampling in more detail Both BAMCP and BAPOMCP perform a sample operation at the start of each simulation. The specific samples, however, are not drawn from the same type of distributions. BAMCP is used in BAMDPs, where the belief consists of a single set of counts φ that represents the probability distribution over the transition function. At the start of each simulation, an actual generative model is sampled from these counts T ∼ φ, which is used during the simulations as the generative model.

Recall φ is a set of vectors of size |S| for each state action pair |A| × |S|, and T assigns a probability distribution over S for each state action pair (s, a). To sample a probability distribution p1. . . p|S| from φ for each (s, a), is the same as to sample a Multinomial distributions

from a Dirichlet distribution. To sample a Multinomial distribution from a Dirichlet distribution of size s with counts α1. . . αs, s random samples y1. . . ys are drawn according to the gamma

distribution y1∼ Gamma(αi, 1) and then normalized (as mentioned in 2.2.1).

In BAPOMDPs the agent maintains a belief over the counts (in addition to the current state) P (s, φ, ψ), meaning that it assigns a probability to each possible combination of counts. The sample step involves sampling a specific combination s, φ and ψ from that probability distribution (i.e. sample a particle from the particle filter). BAPOMCP updates these counts and use them to sample generative models at each step, similarly to how BAMCP samples one at the start of each simulation. Technically, this replaces line 14, the generative model in algorithm 4, with the code in algorithm 5 (the reward model R is considered known).

Note The generative models in the experiments in Section 6 are systematically using the ex-pected models, instead of samples from the counts. This is motivated by the knowledge that, given a noisy prior, each particle in the belief contains different set of counts, and taking the expected model (instead of a sample) still takes the uncertainty over the model into consideration. Addi-tionally, calculating the expected model is computationally far easier than sampling it, allowing for more simulations per time step. A complete proof that this leads to the same value function

(23)

Algorithm 5 Generative model for BAPOMCP

1: procedure generative bapomcp model(s, hφ, ψi, a, R) 2: generative model ∼ hφ, ψi

3: (s0, o, r) ∼ [generative model(s, a), R] . Use generative models as POMDP would

4: φ[s, a, s0] + + . Update counts

5: ψ[s0, a] + +

6: return (s0, o, r)

7: end procedure

is left for future research. Equation (15) describes how to calculate the expected transition model over a state action pair from a Dirichlet distribution with parameters α1. . . αn. This equation is

used in line 2 of algorithm 5 to build the generative model.

P (i) = _nai P

i

ai

(24)

3 Factored Bayes-Adaptive POMDPs

Unless describing toy problems, it requires a large amount of parameters to describe the dynamics of MDPs and POMDPs, which comes with two problems for Bayesian approaches. First, more parameters means more data is required to learn. Second, dealing with probability distributions over large amount of parameters in Bayesian algorithms becomes computationally expensive.

Regular definitions of MDPs represent states as unique identities (s), and the dynamics as look-up tables (T (s, a, s0_{)), such that each parameter represents a single transition probability}

(s, a, s0). Most applications, however, posses structure that is not being exploited. Consider a cleaning agent in a two-dimensional grid, tasked with vacuuming dirty cells. Whenever the agent attempts to move east, its new location is the current location plus one step to the right (assuming the move is always successful). This is the case, regardless of the y-position and is also independent of which cells are dirty. In the regular BAMDP belief updates, however, such transition would only affect the parameters concerned with that particular state, where the agent is on that specific y-position, and certain cells are dirty. If the agent ever returns to a similar state, but with a different set of dirty cells, it will not be able exploit the knowledge it could have, had it known that a new location (when going east) is only dependent on the current x-position.

This work describes and unifies factored MDPs with partially observability and Bayesian learn-ing into the Factored Bayes-Adaptive POMDP (FBAPOMDP ). Section 3.1 provides the description of the FBAPOMDP, starting with the factored MDP definition, followed by an attempt to sum-marize the set of belief states and how to represent them in Section 3.2. POMCP as the approach for solving such frameworks is then described in 3.3.

3.1 Factored MDP frameworks

This section covers all relevant MDPs up to Factored BAPOMDP. Factorization is incrementally introduced by describing the MDP transition function in a factored form using Dynamic Bayesian Networks ([6]) in 3.1.1 and 3.1.2. We then specify the Bayes-Adaptive version of this factored MDP (FBAMDP in 3.1.3), and eventually re-introduce partially observability with the factored BAPOMDP (Section 3.1.4).

3.1.1 Factored MDP

The Factored MDP (FMDP ) uses a factored representation of the state and transition function, to use fewer parameters to describe the transition function and exploit its structure [6]. It does so by considering states s as a collection of features X = {X1. . . Xn} and represents the transition

function with a dynamic Bayesian network (DBN [8]). Each action is associated with a DBN, which describes the conditional transition probabilities between states.

Dynamic Bayesian Networks as models A DBN is defined as a two-layer directed graph G, and the conditional probability distributions of each node, where nodes are the state features [12]. Arcs between nodes connect state features X at the current time with the state features at the next step X0_{. An arc between node X}

a and Xb0 means that the transition probability of the current

state feature Xb to Xb0 depends on the value of Xa. If P VGa(X

0

i) denotes the value of the parents

of feature X_i0, then the probability of new state X0 given the graph G is defined by the joint prob-ability of {X₁0. . . X_n0} given the current values for {X1. . . Xn}: PG(X0|X) =Q

i

PG(Xi0|P VG(Xi0)).

Figure 2 shows an example of 2 DBNs that describe different structures for an imaginary transition function between states with 3 features {X1, X2& X3 }. Features in highly structured

problems have few transition conditions, which result in smaller DBNs (fewer connections) and thus fewer parameters. The graph in figure 2a, for example, would have |X1| × |X2| × |X3| parameters to

describe the conditional probabilities for node X2in t + 1, while graph 2b only requires |X1| × |X2|

parameters. If the cleaning agent in the earlier example recognizes that its new x-position when going east is only conditioned on its previous x-position, then it would be able to decrease the

(25)

(a) A fully connected graph (b) A partially connected graph where X2

and X3only affect X2 and X3

respectively in the t + 1 layer.

Figure 2: Two examples of a Dynamic Bayesian Graph. The arrows indicate conditional dependencies.

amount of parameters to describe the model and generalize over all states with that specific x-position. If we replace the (traditionally represented as a table) transition function T in MDPs with DBNs, we get to the following definition of the FMDP :

• A, R, h, γ: Identical to MDP

• S: Set of states, where states are represented by their state features X1. . . Xn.

• T : A function that represents the transition probabilities for each (X, a, X0) triplet. T (X, a, X0) describes the probability that the agent will end up in state s0 given it has performed action a in state X: P (X0|a, X). This is represented as a Dynamic Bayesian Network Ga for each action a, where the transition probability is defined as T (X, a, X0) =

P (X0|a, X) =Q i PGa(X 0 i|P VGa(X 0 i)). 3.1.2 Factored POMDP

The extension of the FMDP to the Factored Partially Observable Markov Decision Process is (FPOMDP ) conceptually equal to the generalization of the MDP to the POMDP. Given a factored state representation, the observation model can also be represented as a DBN, where an observation o is defined by its features Y . This leads to the following definition:

• S, A, R, γ, h: Identical to the MDP definition.

• Ω: Set of observations, where the observations are describes by their features Y1. . . Yn.

• T : Identical to the FMDP definition.

• O: A function that represents the observation probabilities for each (a, X0, Y ) tuple. O(a, X0, Y ) describes the probability that the agent will perceive observation features Y given it has performed action a in state X and ended up in state X0: P (Y |a, X0). This is represented as a Dynamic Bayesian Network for each action Ga, where the probability is

defined as O(a, X0, Y ) = P (Y |a, X0) =Q

i

PGa(Yi|P VGa(X

0 i)).

Efficient Bayesian Learning in Factored Partially Observable Environments

MSc Artificial Intelligence

Master Thesis

Efficient Bayesian Learning in Factored

Partially Observable Environments

Sammie Katt

January 30, 2017

Supervisors:

Dr Frans Oliehoek

Dr Christopher Amato

Assessor:

Dr J. M. Mooij

University Of Amsterdam

Acknowledgements

Contents

List of Figures

List of Tables

Terminology

1

Introduction

1.1

Outline

2

Background

2.1

The MDP & POMDP Framework

2.2

Learning Frameworks

3

Factored Bayes-Adaptive POMDPs

3.1

Factored MDP frameworks