Encouraging Physical Activity through Reinforcement Learning with Limited Feedback

(1)

Encouraging Physical Activity through

Reinforcement Learning with Limited

Feedback

Kylian van Geijtenbeek

11226145

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor dr. S. Wang

Amsterdam Machine Learning Lab Faculty of Science

University of Amsterdam Science Park 904 1098 XH Amsterdam

(2)

Abstract

A major objective in reinforcement learning is to learn a policy as quickly as possi-ble. However, in many real-world situations, feedback is a limited resource and free exploration may introduce unwanted risk for user satisfaction. One such reinforcement learning problem involves an agent that has to send notifications at the appropriate moments in order to encourage a person to perform physical activity. In such a case, feedback cannot be requested too frequently and free exploration may annoy the user. Therefore, one task is to select an optimal policy from a set of policies that are available and known to be safe. Determining which of the available policies that encourage the user performs best within a short amount of time is a non-trivial task, and requires efficient use of feedback. In this thesis, several algorithms are proposed that efficiently search for the optimal policy in a limited set of available policies by exploiting similar-ity between encountered contexts. Moreover, the Context Enhanced Π≤balgorithm is

proposed for extending the policy selection algorithms to safely improve an available policy. This thesis demonstrates that context similarity can be used to estimate hypo-thetical rewards for unused available policies, enabling an agent to rapidly switch to a better policy. It is also demonstrated that the Context Enhanced Π≤balgorithm can

vastly outperform the regular Π≤balgorithm by exploiting context similarity, even as

(3)

1 Introduction

In real-world situations where reinforcement learning relies on human feedback to learn, a reinforcement learning agent cannot request feedback endlessly. After all, the user is likely to experience constant requests for feedback as a negative experience with the reinforcement learning agent and may stop interacting with it altogether. Moreover, an agent cannot carelessly explore unknown policies, since this introduces a considerable risk to the overall user experience. One such example, which will be used throughout this thesis, is a situation where a reinforcement learning agent has to encourage a human to perform physical activity by sending notifications that suggest the user could go for a run. In such a case, the agent has to be careful to not send notifications excessively, which limits the feedback on sending notifications. Moreover, free exploration may introduce risk due to dissatisfaction by the user if the agent makes many bad decisions. Because of reinforcement problems such as these, it is important to study how a reinforcement learning policy can be learned as quickly as possible with only very little feedback.

1.1 Related work

Using feedback efficiently and improving safely are not new areas of research for reinforce-ment learning. This section will describe the most relevant related work regarding the efficient use of feedback and quick, yet safe improvement of policies.

It has been demonstrated by Li, Chu, Langford, and Wang (2011) that off-policy eval-uation – where the performance of an unknown target policy is being estimated using data generated by a logging policy, or behavior policy – can be utilized in a contextual bandit problem to evaluate the performance of a target policy. Besides showing the effectiveness of this method with an empirical analysis, it is demonstrated that the method used guar-antees an unbiased estimate. However, such claims require the logging policy to randomly select actions during data collection. Such a requirement quickly becomes infeasible when one wishes to estimate a target policy using data generated by an already in place warm start policy. Moreover, they assume that a long stream of data is available for evalua-tion, which cannot always be adhered to when applying reinforcement learning in real-world environments where feedback is limited.

Further research (Strehl, Langford, Li, & Kakade, 2010) has demonstrated that the requirement of randomized data collection imposed by Li, Chu, Langford, and Wang (2011) can be lifted while still yielding positive results. This allows for data collection using a warm start policy, avoiding undesirable behavior by the agent. However, the method fails to select the optimal policy from a set of available policies when there is little data to estimate the performance of this optimal policy. Despite possessing a large amount of data, the optimal policy will not be found if the data does not contain sufficient context-action pairs that the optimal policy would have chosen in those previously experienced contexts. Furthermore, Strehl, Langford, Li, and Kakade (2010) performed their experiments on a large amount of data, which impedes the extension of their results to reinforcement learning environments with little available data.

Cohen, Yu, and Wright (2018) have introduced a Diverse Exploration (DE) strategy for a Markov Decision Process (MDP), which is able to quickly and safely learn improved policies while starting with a baseline policy. They generate a new set of diverse policies that are verified to perform well on data generated by a set of policies known to be safe, where safe is defined as having a high probability of performing at least as well as any policy explored

(5)

thus far.

More promising work has been done on safe policy improvement using a baseline policy in an environment where feedback is limited. Laroche and Trichelair (2018) provide a series of algorithms for MDPs that have proven to return policies with high probability of being an improvement relative to the baseline policy. By relying on the baseline policy in case of uncertainty, they overcome the problem of frequently having too little data to adequately predict the performance of some state-action pair. This makes their algorithms suitable for operating in sparse feedback environments where free exploration has high risks.

Follow-up research on these algorithms (Sim˜ao & Spaan, 2019) has proposed a method that uses independence between environment features in factored environments to avoid sub-optimal choices and allow for even faster improvement of a policy. Besides ensuring safe policy improvement, Sim˜ao and Spaan (2019) use context information to aid the policy improvement, which has not been done in preceding research.

1.2 Thesis contribution

Previous research has shown that, while progress has been made in off-policy evaluation, the proposed algorithms for off-policy evaluation often require vast amounts of data to perform well (Li, Chu, Langford, and Wang, 2011; Strehl, Langford, Li, and Kakade, 2010). Attempts at solving this data problem have mainly focused on improving existing policies (Cohen, Yu, and Wright, 2018; Laroche and Trichelair, 2018; Sim˜ao and Spaan, 2019). However, in some cases where data availability is very limited, and where free exploration introduces high risk for either system safety or user satisfaction, it may be desirable to determine which policy in a fixed set of policies is optimal, instead of having to compute many new policies. Therefore, the question arises: How can an optimal policy be selected from a set of policies through very little feedback? To answer this question, this bachelor thesis aims to study the possibility of exploiting historical data and context similarity to quickly estimate a policy’s performance. This thesis will specifically focus on a contextual multi-armed bandit setup.

In this thesis, the proposed problem is tackled in two stages. Firstly, several algorithms are proposed for estimating the performance of a policy and subsequently switch to the policy with the highest expected reward. Secondly, these selection algorithms are combined with an extended version of the Π≤balgorithm for Safe Policy Improvement with Baseline

Bootstrapping (SPIBB) proposed by Laroche and Trichelair (2018). This extended algo-rithm will, besides estimating which available policy will perform best, employ an improved version of the policy estimated to perform best to increase the agent’s performance.

2 Experimental setup

This thesis aims to solve the limited feedback problem for a k-armed contextual bandit. A contextual bandit problem is a reinforcement learning problem where an agent observes a context and has to select a suitable action to perform.

We denote the context as the vector x, sampled from context space X . The agent may then choose to perform an action A from the set of available actions A = {a1, . . . , ak}.

The agent chooses this action based on the policy π that it employs. Given a stochastic policy π, the probability that this policy chooses action a in context x is denoted as π(a|x). Subsequently, the agent receives a reward Rt= qπ(x, a) for choosing action a under context

(6)

The proposed algorithms will be tested using a model that is based on a real-life situation where only a limited number of trials may be performed. The simulated situation is one where the agent has to encourage a human to go for a run. The agent can do this by sending notifications that suggest that the user could go for a run. At each decision point, the agent can – based on the context that the user is in – decide to either send a notification (a = 1) or not send a notification (a = 0) to the user, i.e. A = {0, 1}. Therefore, the bandit problem at hand is a two-armed contextual bandit. It is assumed that that the user is more likely to run when a notification is sent. The model M that simulates whether the user will run will be called the behavior model. Given this behavior model, a set of policies Π will be randomly generated. A policy π ∈ Π provides a mapping from context to action. It is the agent’s task to determine which of these generated policies suits the user best, by employing only one of them at a time. This employed policy will be called the behavior policy πb –

borrowing from the off-policy evaluation terminology.

The basic experimental setup that will be used to evaluate the policy selection algorithms is one where both the user behavior and policy behavior are modelled deterministically, i.e. there is no randomness involved in any decision-making process. Later, this basic setup is slightly altered to ensure that SPIBB can be used in the setup, and also to make the setup more realistic – a user is very unlikely to be deterministic. This altered setup will be used to evaluate the SPIBB-based algorithms.

2.1 Behavior model generation

The context space from which random contexts x will be sampled uniformly at random will be the continuous d-dimensional space X = [0, 10)d. For each decision point, the context x will be sampled uniformly at random from this context space. In this thesis, the deterministic setup samples two-dimensional contexts, i.e. d = 2.

A set GM = {(µ1, Σ1), . . . , (µN, ΣN)} of a random number N of randomly generated

normal distributions will be generated. For this thesis, N will be sampled at random from discrete uniform distribution U (2, 3), each distribution mean µ_i will be sampled at random from the continuous uniform distribution over the context space X , and each distribution covariance matrix Σi will be a diagonal matrix where each diagonal value is sampled at

random from continuous uniform distribution U (2, 4).

The probability of the user – modelled as behavior model M – running given a context x will be defined using the likelihood of x under the sum of Gaussians in GM, where the

likelihood is renormalized such that max

x∈XPM(run = 1|x) = 1: PM(run = 1|x) = P (µ,Σ)∈GM N (µ, Σ|x) max y∈X P (µ,Σ)∈GM N (µ, Σ|y), (1)

where N (µ, Σ|x) is a regular multivariate Gaussian distribution defined as

N (µ, Σ|x) = exp(−

1

2(x − µ)

T_Σ−1_{(x − µ))}

p(2π)d_|Σ| .

Note that the normalizing constant – i.e. the denominator in equation (1) – only has to be computed once for it to be used for every context x.

When the likelihood as defined in equation (1) exceeds some threshold p, the user will run. Otherwise, the user will not run. Moreover, the likelihood of running will be increased

(7)

by a constant ∆ when a notification is sent by the agent. Therefore, the probability of running under context x and the agent’s action a is defined as

PM(run = 1|x, a = 0) =1PM(run=1|x)>p, (2)

PM(run = 1|x, a = 1) =1PM(run=1|x)+∆>p. (3)

In this thesis, the values p = 0.5 and ∆ = 0.2 were used.

2.2 Policy generation

The set of available deterministic policies Π that the agent has to choose from will be gener-ated similar to the behavior model generation. All policies will be genergener-ated independently from each other.

A policy πi ∈ Π will be created by generating a set of randomly created Gaussians

Gi= {(µ1, Σ1), . . . , (µN, ΣN)} in context space X . The number of Gaussians will be

sam-pled randomly from discrete uniform distribution U (2, 3). The mean µ of each normal distribution will be sampled uniformly at random in context space X , and the diagonal values of each covariance matrix Σ will be sampled at random from continuous uniform distribution U (2, 4), while setting all non-diagonal values to zero.

Whether a policy πi regards it likely that the user will run under context x is then

defined similar to equation (1), i.e. as the likelihood of x under the sum of Gaussians in Gi,

renormalized such that the maximum possible likelihood equals 1:

Pπi(run = 1|x) = P (µ,Σ)∈Gi N (µ, Σ|x) max y∈X P (µ,Σ)∈Gi N (µ, Σ|y). (4)

The probability of deterministic policy πi sending a notification (a = 1) is then 1 if

Pπi(run = 1|x) exceeds some threshold, and 0 otherwise. The threshold that will be used

in this thesis is 0.5. Policy πi’s context-action probabilities are thus defined as

πi(a = 0|x) =1P_πi(run=1|x)≤0.5, (5)

πi(a = 1|x) =1P_πi(run=1|x)>0.5. (6)

2.3 Reward Definition

The reward Rtgiven to behavior policy πb for taking action a in context x at time t equals

1 if a notification is sent (a = 1) and the user runs. The reward equals 0 if a notification was sent and the user does not run, or if no notification was sent (a = 0) and the user runs. If no notification is sent and the user does not run, the reward equals 0.1. This can be written as

qπb(x, a = 1) =

(

1 with probability PM(run = 1|x, a = 1),

0 with probability 1 − PM(run = 1|x, a = 1),

(8)

qπb(x, a = 0) =

(

0.1 with probability 1 − PM(run = 1|x, a = 0),

0 with probability PM(run = 1|x, a = 0).

(8)

Note that, since the behavior model is deterministic, the probabilities in equations (7) and (8) will, for now, always be either 0 or 1.

Although the user not running when no notification is sent is a positive outcome, it is much easier to predict since the user is, overall, much more likely to not run. Therefore, such cases should arguably not be rewarded equally as when the agent correctly sends a notification and are thus rewarded with a reward of 0.1. Although the results in section 4 are obtained under aforementioned reward setup, multiple reward definitions for the particular case where no notification is sent and the user does not run have been experimented with, and the observed trends are reported in section 4.1.3.

2.4 Introducing more realism

In order to ensure that SPIBB can be used and to make the setup more realistic, the setup can be slightly altered. This section only describes the changes that were made to the setup described in sections 2.1, 2.2 and 2.3 to evaluate the SPIBB algorithms. Other than these changes, the setup is exactly the same. This altered setup will only be used to evaluate the SPIBB algorithms.

First of all, the context space X will be made discrete instead of continuous, and it will also consist of ten dimensions, i.e. d = 10. Subsequently, a context x will be sam-pled uniformly at random from the discrete ten-dimensional context space [0, 10]10_{. This}

ensures that a context-action pair can, in theory, be observed multiple times, which would be near impossible when a context were to be sampled from a continuous space. Moreover, a realistic context would be more likely to consist of more than only two features, making ten dimensions a more realistic choice. Since likelihood decreases very fast in higher dimen-sional spaces, the behavior model and candidate policies will be created using a sum of five Gaussians, i.e. |GM| = 5 and |Gi| = 5 for all policies πi. The covariance matrix diagonals

are sampled randomly at uniform from discrete distribution U (25, 35).

With the current behavior model setup and policy setup as described in sections 2.1 and 2.2, it cannot occur that both (x, a = 0) and (x, a = 1) can be observed under the same behavior policy πb. Also, the reward is deterministic for any context-action pair (x, a). In

order to better simulate real-world environments and be able to apply SPIBB, both the available policies in Π and the behavior model are made non-deterministic. We therefore redefine the probabilities of the user running given some context and an action as defined in equations (2) and (3). If no notification is sent, the probability of running simply equals the original likelihood of running PM(run = 1|x) as described in equation (1). The probability of

running when a notification is sent then equals this same probability, but with a probability increase: max {PM(run = 1|x) + ∆, 1}, where ∆ is the increase in probability.

Next, the behavior policy – not the other candidate policies – will be made non-deterministic, to make it such that there is some randomness in which action will be chosen given some context. Equations (5) and (6) from the policy definition in section 2.2 can therefore be rewritten for the behavior policy only as:

(9)

πb(a = 0|x) = 1 − Pπb(run = 1|x), (9)

πb(a = 1|x) = Pπb(run = 1|x). (10)

Lastly, the probability increase ∆ of running upon receiving a notification is highly unlikely to be constant. Rather, a user is much less likely to be influenced by a notification when the user is already quite certain of the decision to run or not run. The notification is much more likely to be able to influence the user when the user is unsure about the decision to run or not run. Therefore, in this altered setup ∆ will be modelled as

∆ = 0.2 · exp −(PM(run = 1|x, a = 0) − 0.5) 2 0.05 . (11)

3 Algorithms

In this thesis, multiple algorithms are proposed. Firstly, several algorithms are proposed for selecting – and employing – the best available policy from the setup of available policies Π. Next, an extended version of the Π≤balgorithm is proposed for improving a policy quicker

with the use of contextual information. Ultimately, the policy selection algorithms will be combined with the extended Π≤balgorithm to not only select the best available policy, but

employ an improved version of it.

3.1 Policy selection algorithms

A simple -greedy algorithm will be evaluated to serve as a baseline performance. Next, the Reward Transfer (RT) algorithm will be introduced, which re-uses rewards as much as possible for candidate policies in Π. Finally, multiple algorithms will be introduced that estimate a hypothetical reward for candidate policies whenever reward transfer is impossible.

3.1.1 Simple -greedy

The -greedy algorithm is an algorithm that will with probability 1 − choose to exploit its current knowledge and therefore choose to take the action a with the highest expected reward Qt(a) at time t (Watkins, 1989). In this thesis, the expected reward for taking action

a simply equals the average obtained reward for taking action a:

Qt(a) = t−1 P i=1 Ri·1At=a t−1 P i=1 1At=a .

With probability it will choose to explore by choosing to take any of the available actions uniformly at random. Although some may prefer to have decay in value over time, for this thesis will be kept constant. Note that for the agent employing this algorithm, the actions consist of choosing to employ one policy π ∈ Π for the decision point at hand, and exploiting its current knowledge implies employing the policy with the highest expected reward.

(10)

3.1.2 Reward Transfer

The Reward Transfer (RT) algorithm for selecting a policy from a set of policies aims to transfer the reward received by the behavior policy to candidate policies whenever possible. For every candidate policy πi ∈ Π a set of rewards Zπi is stored. At every time step t,

the behavior policy πb receives some reward Rt for taking action Abt under context xt as

described in section 2.3. This reward is stored: Zπb← Zπb∪ {Rt}.

Next, for every candidate policy πi ∈ Π, determine the action Ait that policy πi would

have been most likely to take under the same context xtif policy πi would at time t have

been the behavior policy. This includes the behavior policy itself. If the candidate policy πi would have chosen the same action as the behavior policy πb, i.e. Ait= Abt, the reward

obtained by the behavior policy can be directly transferred to policy πi: Zπi← Zπi∪ {Rt}.

When the candidate policy would, hypothetically, have chosen to act differently than the behavior policy, two different conflicts can occur. Firstly, the behavior policy may have sent a notification (a = 1), while the candidate policy would have chosen to not send a notification (a = 0). In this case, the reward for candidate policy πi is defined as

qπi(x, a = 0) =

(

0 if qπb(x, a = 1) = 1,

0.1 if qπb(x, a = 1) = 0.

(12)

In other words, if the behavior policy has received a reward of 1 because sending a notifi-cation was a good action, the candidate policy receives a reward of 0, because not sending a notification would therefore have been a bad action. Likewise, if the behavior policy has received a reward of 0, the candidate policy receives a reward of 0.1. This reward is then stored: Zπi ← Zπi∪ {qπi(x, a = 0)}.

The second type of conflict that may occur is the situation where the behavior policy has chosen to not send a notification, while the candidate policy would have sent a notification. In this case, the reward cannot be transferred or deduced. Choosing to send a notification increases the likelihood of the user running and may be the encouragement required to make the user run when normally the user would not. Hence, even though not sending a notification may appear to be a good action, we cannot say that sending a notification is therefore bad.

After the reward has been transferred to all candidate policies wherever possible, the agent assigns as its new behavior policy πb the candidate policy πi that has acquired the

highest average reward through both functioning as behavior policy and through reward transfers: πb= arg max π∈Π 1 |Zπ| X r∈Zπ r. (13)

3.1.3 MLE for Gaussian Mixture Model

Instead of ignoring cases where the reward cannot be transferred, the reward may be es-timated through Maximum Likelihood Estimation (MLE). The Gaussian Mixture Model MLE (GMM-MLE) algorithm is an extension of the RT algorithm and assumes that the probability of running upon receiving a notification is sampled from a Gaussian Mixture Model (GMM).

Rewards are transferred whenever possible as described in section 3.1.2. In case a conflict occurs where the behavior policy πbchooses to not send a notification while candidate policy

(11)

πiwould send a notification under context x, a data set D = {xt|Abt= 1, Rt= 1} is created

with contexts where behavior policy πb chose to send a notification (i.e. Abt = 1) and

received a positive reward (i.e. Rt= 1) – because the user did indeed run upon receiving

the notification. What one would now like to estimate is the probability p(run = 1|x) of the user running in the current context x. This probability will then be used as the estimated reward for sending a notification in context x. Using Bayesian probability theory, we can rewrite this probability as

p(run = 1|x) = p(x|run = 1) · p(run = 1)

p(x) .

To calculate p(run = 1|x), p(x|run = 1) will be approximated through MLE by fitting a GMM over all previous data D where the user received a notification and went running, and determining the likelihood of context x given this model. Since all contexts are randomly drawn at uniform from the context space, p(x) is constant. Moreover, the probability p(run = 1) is also assumed to be constant. Therefore, p(run = 1|x) ∝ p(x|run = 1). In more complex models, or real-life situations, this may not be so simple, requiring p(x) and p(run = 1) to also be estimated.

For MLE, at least two data points are required to avoid a singular covariance matrix. Therefore, if |D| < 2, no estimate is made and no reward is given to the candidate policy. Otherwise, a GMM is fit over D using Variational inference (Blei & Jordan, 2006). Each Gaussian in the GMM is defined by a mean µ, covariance matrix Σ and weight w such that P

i

wi= 1.

Using the parameters Θ that are found to fit the data best, p(x|run = 1) is determined as the likelihood L(Θ|x), i.e. p(x|run = 1) = L(Θ|x). The likelihood is normalized such that the maximum likelihood equals 1, i.e. max

y∈XL(Θ|y) = 1. Because of this normalization, it

does not matter that p(x|run = 1) is used as the probability of running given some context, rather than the other way around, even though it only holds that p(run = 1|x) ∝ p(x|run = 1). The likelihood is then defined as

L(Θ|x) = P (µ,Σ,w)∈Θ w · N (µ, Σ|x) max y∈X P (µ,Σ,w)∈Θ w · N (µ, Σ|y). (14)

This likelihood value will then be directly used as the reward for candidate policy πi, i.e.

qπi(x, a = 1) = L(Θ|x). The reward will be stored for policy πi: Zπi ← Zπi∪ {qπi(x, a = 1)}.

Before continuing with the new decision point, the behavior policy is updated according to equation (13).

3.1.4 RBF Weighted Average

Another method to estimate the reward for a candidate policy πi that would choose to send

a notification while the behavior policy πb did not, is to determine a weighted average over

all previous rewards where a notification was sent by the behavior policy. The more similar the context in which the reward was given, the higher its weight will be. The RBF Weighted Average (RBF-WA) algorithm is again an extension of the RT algorithm. The data D used for the estimation will be a set with context-reward pairs where the reward was obtained

(12)

by choosing to send a notification: D = {(xt, Rt)|Abt = 1}. The weighted average is then defined as qπi(x, a = 1) = P (x0_,R)∈D sim(x, x0) · R P (x0_,R)∈D sim(x, x0₎ . (15)

Similarity between vectors can be defined in many ways. This algorithm will use the radial basis function (RBF) kernel as a similarity measure between two vectors:

sim(x, x0) = exp −||x − x 0_||2 2σ2 . (16)

The RBF kernel gives close points very high weight, while points further away may receive negligible weight. Therefore, the estimated reward will largely depend on the points that are only a small distance away, almost ignoring points that are too far away.

As with the GMM-MLE algorithm, the estimated reward is stored in Zπi and after every

candidate policy has been dealt with, the behavior policy is chosen according to equation (13).

3.1.5 K-Nearest Neighbors

One final method to estimate the appropriate reward for a candidate policy πi that would

choose to send a notification while the behavior policy πb did not send a notification, is

to estimate the reward from the previous k most similar contexts, also known as k-NN regression (Altman, 1992). This k-nearest neighbors (KNN) based algorithm is an extension of the RT algorithm. First, a data set D = {(xt, Rt)|Abt = 1} of previous context-reward

pairs where the behavior policy chose to send a notification is assembled. Given some context x for which the reward qπi(x, a = 1) is being estimated, the distance to some other

context x0 is defined as the Euclidean distance between x and x0:

d(x, x0) = ||x − x0||. (17)

When the k-nearest neighbors are found, the estimated reward qπi(x, a = 1) then equals

the average reward obtained in the contexts of the k-nearest neighbors and is stored in Zπi.

The behavior policy is updated according to equation (13).

3.2 Context Enhanced SPIBB for policy improvement

It has been shown that in situations where a reinforcement learning agent cannot be allowed to freely explore, Safe Policy Improvement with Baseline Bootstrapping (SPIBB) can be used to safely improve an existing baseline policy πB (Laroche & Trichelair, 2018). SPIBB

improves a policy by increasing the probability of taking some action a in state x, if a has a high expected value in state x and the expected value for state-action pair (x, a) is based on a sufficient number N∧ of previous observations. If less than N∧ previous observations

have been made, SPIBB falls back on the baseline policy πBto ensure no unwanted behavior

occurs. In their research, Laroche and Trichelair (2018) demonstrate that the Π≤balgorithm

(13)

The Π≤b algorithm was originally designed for a Markov Decision Proces (MDP) with

state-action pairs (x, a), but can very easily be extended to a contextual bandit by inter-preting x as a context rather than a state, translating (x, a) as a context-action pair. In addition to researching how the previously described algorithms can be used to quickly se-lect an optimal policy from a limited set of policies, the earlier proposed algorithms will be extended with the Π≤balgorithm to not only find the optimal policy, but also improve

it. This will be achieved by assigning the candidate policy with the highest expected value – and which would therefore normally be assigned as behavior policy πb – as the baseline

policy for the Π≤balgorithm, and employ an improved version of this baseline policy as the

behavior policy πb.

The Π≤b algorithm largely depends on the bootstrapped set B. This set contains all

context-action pairs (x, a) that have been encountered less than N∧ times. Context-action

pairs in B are regarded as insufficiently explored, and will not be used to update the baseline policy πB.

When similarity between contexts can be defined (as in the case in this thesis), a context-action pair (x, a) may possibly be regarded as safe more quickly by observing the same context-action in similar contexts. For example, if N∧= 5, and pair (x, a) has only been observed 4 times,

(x, a) would normally be included in B. However, context x0 _{is less than a distance of ρ}

away and we have one observation of (x0, a). Therefore, we regard (x, a) as safe since 4 and 1 add up to the threshold of N∧= 5, and its expected value Q(t)(x, a) at time t will be an

average over all context-action pairs that are similar, weighted by how often they have been observed: Q(t)(x, a) = P x0_∈S x Q(t)_(x0_{, a) · N (x}0_{, a)} P x0_∈S x N (x0_{, a)} . (18)

Here, N (x, a) is the frequency that (x, a) has been observed, and Sx is the set of contexts

that are within a distance ρ from context x – including x itself:

Sx= {x0 ∈ X | ||x − x0|| ≤ ρ}. (19)

Note that in this thesis, Euclidean distance is used, whereas other distance metrics may be more appropriate in different situations. Finally, the redefinition of B will be

B= {(x, a) ∈ X × A| X

x0_∈S x

N (x0, a) < N∧}. (20)

This extended version of the Π≤balgorithm will hereafter be referenced as the Context

Enhanced Π≤b(CE-Π≤b) algorithm. Pseudocode for generating B is described in appendix

A.

Since the Π≤band CE-Π≤balgorithms exploit the principle that the same context-action

pair may be observed multiple times, these algorithms will only be evaluated in the altered, non-deterministic setup as described in section 2.4.

(14)

4 Results

The results are divided into two subsections. The basic policy selection algorithms (without policy improvement) are evaluated in section 4.1 using the initial deterministic setup with two-dimensional contexts. Next, the CE-Π≤b versions of the three most representative

policy selection algorithms are evaluated in section 4.2 using the non-deterministic setup with ten-dimensional contexts as described in section 2.4.

In all experiments, 10 different policies were available, i.e. |Π| = 10. Four evaluation metrics will be applied. First of all, the 1) average reward over all previous decision points obtained by the agent will be evaluated for both the policy selection and the CE-Π≤b

algorithms. Note that these rewards are gathered by the agent’s behavior policy, which may change over time. Secondly, the policy selection algorithms outlined in section 3.1 will be compared by 2) the frequency at which they manage to find either the optimal policy, or one of two most optimal policies. The basic algorithms will also be compared by 3) the number of decision points needed to find the optimal policy, or one of two most optimal policies, as well as 4) the number of notifications that were sent before finding such policy. A policy will be considered found when an agent employs the policy for 20 consecutive decision points. The recorded decision point at which the policy is found will be the first of those 20 consecutive decision points. An exception is made for the -greedy algorithm that is forced to explore with a probability . Such deviations from its believed optimal policy are not considered an interruption of the consecutive decision points that it deems a policy as optimal.

The reason that the algorithms are also evaluated on their ability to find one of two most optimal policies instead of just the optimal policy, is because the best two policies may be so similar in performance that finding the rank two policy instead of the rank one policy should be almost equally good to finding the very optimal policy. Therefore, broadening the range to one of two optimal policies means finding a policy close to optimal.

4.1 Policy selection algorithms

All results for the policy selection algorithms without Context Enhanced SPIBB are averaged over 1000 different behavior models with a different set of available policies for each model. The results in this section were acquired with the deterministic setup with two-dimensional contexts. For the RBF-WA algorithm, σ = 1.0 was found to be optimal and was used to obtain the results in this section. Also, for the KNN algorithm k = 1 was found to be optimal, and for the -greedy algorithm = 0.2 was found to perform optimally.

4.1.1 Average reward

In figure 1 the average obtained rewards can be seen for 500 decision points. For the first approximately 200 decision points the RT algorithm seems to outperform all other algorithms. However, after a below-average start, the RBF-WA and KNN algorithms – which perform almost equally – catch up with the RT algorithm and clearly outperform it for the remaining decision points.

Following the observation that the RBF-WA and KNN algorithms lack performance during the early stages, one could decide to delay estimation for all estimation algorithms – i.e. GMM-MLE, RBF-WA and KNN – until after the first 100 decision points. In that case, as can been seen in figure 2, the estimation algorithms profit from RT’s early performance, while RBF-WA and KNN clearly outperform it thereafter – where again RBF-WA and KNN

(15)

Figure 1: Average rewards over 500 decision points.

Figure 2: Average rewards over 500 decision points, where no estimates are made for the first 100 decision points.

perform almost equally. This can be explained by a lack of data that the estimates are based on in the beginning. As more data becomes available, the estimation algorithms are able to make more accurate estimates and therefore improve performance. By denying the estimation algorithms to estimate rewards in conflict situations during the first few decision points, the algorithms are able to profit from the RT algorithm’s early performance, while later boosting performance using estimates.

In a real-life situation, one may not want to allow the agent to send notifications freely. In fact, to avoid spamming the user with notifications, it can be decided that the agent is only allowed to send, for example, two notifications each day. In figure 3, the acquired reward can be seen when one day includes ten decision points and the agent can send at most two notifications each day. It ranges over 280 decision points, which would equal four weeks exactly. Whenever the agent would like to send a notification but has hit its daily limit,

Figure 3: Average rewards over four weeks when one day would equal 10 decision points, and at most two notifications can be sent each day.

Figure 4: Average rewards over four weeks when one day would equal 10 decision points, and at most two notifications can be sent each day. No estimations are made for the first 100 decision points.

(16)

the agent is forced to not send a notification. In such cases, the reward will not contribute to the expected performance of the current behavior policy – i.e. the reward is not stored in Zπb – if the behavior policy would have been most likely to send a notification. As can

be seen, no algorithm is successful at outperforming the RT algorithm when the number of notifications is limit to two every ten decision points. Again, it can be decided to not have the estimation algorithms – GMM-MLE, RBF-WA and KNN – perform any estimations for the first 100 decision points. As can be seen in figure 4, still no estimation algorithm is able to outperform the RT algorithm. The RT and GMM-MLE algorithms perform almost equally well.

Algorithm Frequency optimal policy (%) Frequency top 2 policies (%)

-greedy 10.0 21.2

RT 58.1 80.5

GMM-MLE 49.3 72.2

RBF-WA 63.9 82.9

KNN 63.2 83.3

Table 1: The frequencies with which the optimal policy or one of two most optimal policies is found after 100 decision points. The best results have been highlighted.

Algorithm # Decision points for optimal policy

# Decision points for top 2 policies -greedy 181.282 164.364 RT 127.930 97.354 GMM-MLE 129.738 96.084 RBF-WA 142.290 100.858 KNN 141.214 101.128

Table 2: Number of decision points required to find the optimal policy, or one of two most optimal policies. The best results have been highlighted.

Algorithm # Notifications for optimal policy

# Notifications for top 2 policies -greedy 37.917 34.112 RT 31.223 21.130 GMM-MLE 29.400 20.168 RBF-WA 34.085 23.186 KNN 33.991 23.187

Table 3: Number of notifications required to find the optimal policy, or one of two most optimal policies. The best results have been highlighted.

(17)

4.1.2 Best policy identification

As can be seen in table 1, the RBF-WA algorithm finds the optimal policy in Π most frequently within 100 decision points, whereas the KNN algorithm appears to be able to find either of the two most optimal policies most frequently.

In table 2, the average number of decision points that were required to find the optimal policy – or one of two most optimal policies – can be seen. The RT algorithm is able to find the optimal policy fastest, while the GMM-MLE algorithm finds either of the two most optimal policies fastest.

Finally, the number of notifications that were sent before finding the optimal policies can be seen in table 3. The GMM-MLE algorithm can clearly find the optimal policy, or one of two most optimal policies, within the fewest number of notifications.

4.1.3 Trends for different reward setups

As mentioned in section 2.3, different reward setups for when the agent does not send a notification and the user does not run have been tried before concluding that a reward of 0.1 will be assigned. During experimentation, it was observed for the estimation algorithms that the higher the reward, the less the estimates contribute to a candidate policy’s rewards Zπi. This can clearly been seen for the KNN and RBF-WA algorithms in figure 5.

Although the RT algorithm initially performs best for almost all reward definitions, it is quite often outperformed after some number of decision points. The number of decision points that it takes for the KNN and RBF-WA algorithms seems to, in general, drastically increase as the reward for not sending a notification and the user not running goes up, and may even never seem to be able to outperform the RT algorithm.

The average number of decision points needed to permanently outperform the RT algo-rithm for different reward setups can be seen in figure 6. The maximum number of decision points given to the algorithms to outperform RT was 2000, and if they did not outperform RT by that time, a value of 2000 was recorded. As can be seen, from a reward of 0.0 up to a reward of 0.5 the time it takes for both the RBF-WA and KNN algorithms to outperform

Figure 5: The average percentage of a candi-date policy’s obtained reward that originates from estimates for the RBF-WA and KNN al-gorithms.

Figure 6: The average number of decision points needed for the RBF-WA and KNN al-gorithms to permanently outperform the RT algorithm. Dotted lines indicate that RT was not outperformed within 2000 decision points.

(18)

Figure 7: Average rewards over 500 decision points for the different variants of the Π≤b

al-gorithm. N∧= 5.

Figure 8: Average rewards over 500 decision points for the different variants of the CE-Π≤b

algorithm. N∧= 5, ρ = 10.

RT increases significantly as the reward assigned for not sending a notification and the user not running increases. For rewards of 0.6 up to 0.8 neither algorithm managed to outper-form RT within 2000 decision points. As the reward approaches 1.0, the time it takes to outperform RT decreases again.

4.2 Context Enhanced SPIBB

The results in this section were all obtained with the altered non-deterministic version of the setup, as described in section 2.4. For the Π≤b and CE-Π≤b algorithms, only the average

reward over all previous decision points is evaluated, since the best policy identification results should be expected to be similar to the results in section 4.1.2. New values of σ = 7.0 and k = 8 for the RBF-WA and KNN algorithms respectively were found to be optimal in the non-deterministic setup and were used to obtain the results in this section.

Since the Π≤band CE-Π≤balgorithms are computationally more expensive, and neither

the -greedy nor GMM-MLE algorithms have proven to be very good consistently, the fol-lowing results only include the RT, RBF-WA and KNN algorithms. All results in this section are averaged over 500 different behavior models with different sets of available policies.

In figures 7 and 8, the average reward obtained by the different Π≤b and CE-Π≤b

algo-rithms respectively. Although the original Π≤b algorithms do not consistently outperform

their basic non-Π≤bcounterparts, the CE-Π≤balgorithms improve quite drastically, showing

a strong climb in the average reward. Moreover, it can be seen that the RBF-WA and KNN algorithms seem to generally outperform the RT-based algorithms.

(19)

Figure 9: Average rewards for the different Π≤b algorithms with a restriction of 2

no-tifications every 10 decision points. N∧= 5.

Figure 10: Average rewards for the different CE-Π≤b algorithms with a restriction of 2

notifications every 10 decision points. N∧= 5,

ρ = 10.

As before, a situation can be simulated where at most two notifications can be sent every day. Figures 9 and 10 show the average rewards for the Π≤band CE-Π≤balgorithms

respectively over a period of four weeks, where each day includes ten decision points. This results in 280 total decision points. Again, when the behavior policy is forced to not send a notification due to the restriction, while it normally would be most likely to send one, the reward is not stored in Zπb. In this restricted environment, the RBF-WA algorithm

appears to generally perform best, although the distinction between the RT, RBF-WA and KNN algorithms is less clear than before. The CE-Π≤b algorithms seem to lose a lot of

performance in the beginning, but start a strong climb after that. The Π≤balgorithms are

again not able to consistently outperform their non-Π≤bcounterparts.

Following the observation that the CE-Π≤balgorithms appear to lose a lot of performance

Figure 11: Average rewards over 500 decision points for the various CE-Π≤b algorithms,

where no changes are made for the first 100 decision points. N∧= 5, ρ = 10.

Figure 12: Average rewards for the different CE-Π≤b algorithms with a restriction of

2 notifications every 10 decision points, where no changes are made for the first 100 decision points. N∧= 5, ρ = 10.

(20)

in the first few decisions points, it can be decided to refrain from improving any policies during the first 100 decision points. In figure 11 the results can be seen when there is no notification restriction and the CE-Π≤b algorithms cannot change a policy for the first

100 decision points. Likewise, figure 12 shows the same result when there is a notification restriction of two notifications every ten decision points.

It must be noted that it was observed that the improvements made by the CE-Π≤bwere

primarily changes that made it such that the agent was more likely to send notifications than not send notifications. Not sending a notification has a maximum reward of 0.1, while sending a notification has a significantly higher maximum reward of 1.0. As it appears, even under sub-optimal circumstances, the low probability of the user running upon sending a notification makes it such that sending a notification generally has a higher expected reward in the long run than not sending a notification. Therefore, the CE-Π≤b based algorithms

will eventually decide to (almost) always send a notification.

5 Conclusion

Overall, it has been shown that transferring rewards to candidate policies provides a signif-icant boost in performance as opposed to a simple -greedy algorithm, which only evaluates a policy based on the rewards that the policy has obtained itself. When no reward can be transferred due to conflicts, estimation methods such as k-NN regression or weighted averaging using a Radial Basis Function kernel can be used to estimate rewards in these conflict situations. Although these estimation algorithms – compared to the RT algorithm – under-perform in the early stages of decision-making, their delayed boost in performance is enough to later outperform the RT algorithm.

When similarity between states or contexts can be defined, the Π≤balgorithm can be

im-proved by also taking into account similar action pair observations for any context-action pair (x, a), due to which (x, a) is less likely to end up in the bootstrapped set B. Consequently, this new Context Enhanced (CE) Π≤b algorithm is able to improve a policy

much faster, and is also able to perform well in higher-dimensional context spaces, which is where the unmodified Π≤b algorithm lacks performance. Even though CE-Π≤b tends

to always send notifications in the long run, strictly speaking – according to the obtained rewards – it is indeed a significant improvement.

To answer the initial research question, which asked how an optimal policy can be selected from a set of policies through very little feedback, it can be concluded that an essential technique is to transfer rewards to available policies whenever possible. In case of conflicts where no reward can be determined with certainty, a hypothetical reward can be estimated using techniques such as k-NN regression and weighted averaging using previously acquired data. Since these estimates are more prone to error when little data is available in the beginning, performance may be enhanced by refraining from estimating until enough data is available. Additionally, it can be concluded that the CE-Π≤b algorithm can use

context similarity to safely, yet quickly improve an available policy. Besides being able to help increase a user’s physical activity, the algorithms in this thesis should be applicable to any contextual bandit problem, and may also be relevant for Markov Decision Processes.

(21)

6 Discussion

As shown in section 4.1.3, the reward setup has a big influence on the performance of the algorithms. Throughout this thesis, a reward of 0.1 was assigned for not sending a notification and the user subsequently not running. However, as this reward is increased, estimates made by the KNN and RBF-WA algorithms contribute less and less towards the overall rewards obtained by a candidate policy. Therefore, as the reward for not sending a notification and the user not running increases, policy performance estimation is influenced less and less by the estimates. As a result, these estimation algorithms will take longer to outperform the RT algorithm, and may sometimes even never be able to outperform it. Therefore, it seems that these estimation algorithms perform particularly well when the estimated rewards contribute significantly to a policy’s total obtained reward, where the total obtained reward consists of rewards obtained through serving as the behavior policy, reward transfers and reward estimates. Changing the reward setup may thus negatively impact the results, and this possibility should be taken into consideration when applying these algorithms in a new setup.

The simulator used in this thesis is only an approximation of a realistic situation and must therefore not be expected to simulate a real person accurately. However, it does provide a reliable setup for evaluating the multiple algorithms proposed in this thesis. One thing that should not be overlooked is the flawed reward setup that was used. It can be heavily debated whether a reward of 0.1 is appropriate for when the agent does not send a notification and the user does not run. As concluded before, changes in this reward result in very different performances by the policy selection algorithms. Moreover, with the current setup, the best action is almost always to send a notification, due to which CE-Π≤bwill eventually always

send a notification. In reality, this behavior would of course be undesirable. Therefore, in a real-life situation, one should consider different reward setups such that each available notification is used optimally.

Some algorithms require one or more hyperparameters. For example, the KNN algorithm requires a value for k, the number of neighbors. However, the CE-Π≤b version of the

KNN algorithm requires as much as three hyperparameter: the number of neighbors k, the observation threshold N∧, and the distance threshold ρ for similar observations. Optimizing

such a combination of hyperparameters can take a lot of time. Although brief parameter searches have been performed, it was quickly concluded that the hyperparameters found were sufficient for a satisfactory proof of concept. Therefore, the hyperparameter values used throughout this thesis are likely to be sub-optimal. In future research, or when employing the algorithms in practise, a more thorough hyperparameter search could be performed to increase performance.

Lastly, it is important to note that during this thesis the behavior model did not change overtime. However, in practise it is likely that user behavior – or any other external behavior providing feedback – remains static. Therefore, estimated rewards by estimation algorithms such as GMM-MLE, KNN and RBF-WA, as well as any previously encountered context-action-reward observations may become less relevant as time progresses. A simple solution would be to use a moving window of size w. I.e. a candidate policy would only be evaluated based on its last w obtained rewards, and estimates would only be based on the last w encountered contexts. A downside to this approach is that a suitable value for w would have to be determined.

In future research, the conclusion in this thesis could be verified on different contextual bandit problems, and possibly also on Markov Decision Processes. Most importantly, it

(22)

would be interesting if such a new problem has a reward setup that is less biased, such that in different contexts truly different actions are optimal and the Π≤b algorithms will

not simply converge to always choosing to take one particular action regardless the context. Moreover, in the setup used in this thesis, there were only two available actions that the agent could take, and the size of the set of available policies remained unchanged at a size of ten policies. It could be insightful to investigate the stability of the proposed algorithms as the number of available actions and number of available policies changes.

Acknowledgement

I wish to sincerely thank dr. Shihan Wang for her support during this thesis. Her profound feedback and guidance have enabled me to write this thesis as it is, and have taught me what it means to be doing academic research. I would also like to thank the Playful Data-driven Active Urban Living (PAUL) project group for providing the inspiration for this thesis and for allowing me to carry out this research as part of the PAUL project.

References

Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regres-sion. The American Statistician, 46 (3), 175–185. doi:10.1080/00031305.1992.10475879 Blei, D. M. & Jordan, M. I. (2006). Variational inference for dirichlet process mixtures.

Bayesian analysis, 1 (1), 121–143.

Cohen, A., Yu, L., & Wright, R. (2018). Diverse exploration for fast and safe policy im-provement. In Thirty-second aaai conference on artificial intelligence.

Laroche, R. & Trichelair, P. (2018). Safe policy improvement with baseline bootstrapping. Li, L., Chu, W., Langford, J., & Wang, X. (2011). Unbiased offline evaluation of

contextual-bandit-based news article recommendation algorithms. In Wsdm.

Sim˜ao, T. & Spaan, M. (2019). Safe policy improvement with baseline bootstrapping in factored environments.

Strehl, E. L., Langford, J., Li, L., & Kakade, S. M. (2010). Learning from logged implicit exploration data. In In proceedings of the 24th annual conference on neural information processing systems (pp. 2217–2225).

(23)

Appendix A

CE-Π

≤b

computation of B

The algorithm below describes how the bootstrapped set B is computed by the CE-Π≤b

algorithm, as described in section 3.2. The algorithm below is largely adopted from (Laroche & Trichelair, 2018) and modified accordingly.

Algorithm 1: Compute B

input : Dataset D of previous experiences input : Observation threshold parameter N∧

input : Distance threshold ρ input : Bootstrapped set B

1 B← ∅ 2 for (x, a) ∈ X × A do 3 Observations ← 0 4 for (x0, a) ∈ D do 5 if ||x − x0|| ≤ ρ then 6 Observations ← Observations + 1 7 end 8 end 9 if Observations < N_∧ then 10 B← B ∪ {(x, a)} 11 end 12 end 13 return B

Encouraging Physical Activity through Reinforcement Learning with Limited Feedback