The Self Adapting Game: investigating the feasibility and effects of automatic game balancing through reinforcement learning

(1)

MSc Artificial Intelligence

Master Thesis

The Self Adapting Game: investigating the

feasibility and effects of automatic game

balancing through reinforcement learning

by

Thomas van Zwol

10555714

September 22, 2020

48 EC October 28th _{2019 - September 22}th ₂₀₂₀

Supervisor:

Dr E Pauwels

Assessor:

Dr H van Hoof

(2)

T.J. van Zwol

Abstract

Game balancing is a costly part of game development due to the amount of work required. Therefore it is a prime candidate for automatic optimization. However, automatic optimization may have unintended consequences when paired with the wrong objective function. When used in game development this may, for example, lead to game addiction. Due to the relative novelty and despite the widespread nature of games, not much is known about game addiction. This research aims to tackle two problems: investigate the feasibility of automatic game balancing through reinforcement learning and provide a starting point for research into the way humans respond to rewards in the context of games, thus providing more insight into game addiction. Based on literature, a stochastic player model is developed and used in conjunction with an actor-critic model in the context of an abstract game. The actor’s goal is to increase the amount of actions per session. The critic should form a model of the player model. Based on the simulations, I conclude that it works in a simulated context and I believe, if paired with real player data, reinforcement learning is an interesting avenue for further exploration by game companies. Furthermore, the critic that is shaped through the reinforcement learning process provides an interesting opportunity to gain more insight into how human players respond to rewards, since it has a relatively good fit on the underlying data. In turn, this may lead to increased understanding of game addiction.

(3)

T.J. van Zwol CONTENTS

1 Introduction

The way models in Artificial Intelligence are trained, is through maximizing some metric: accuracy in classification problems, a fitness score in evolutionary algorithms or NDCG in learning to rank. The final model is some function of the inputs that optimizes the resulting metric.

These models are often used in automation problems. Automation has many benefits and may reduce costs, improve performance or allow for better catering to the needs of customers. However, simply optimizing for a certain metric may have unintended consequences.

For example, if the input data is biased, the model may perform very well on the training data, but fail to generalise: when intentionally trained on a biased set of pictures of wolves with snow in the background and huskies without, Google’s pre-trained Inception neural network learned to predict by looking at the snow in the background, rather than the animal. This meant it failed to predict the correct animal when presented with a husky in the snow or wolf without snow [45].

Unintended consequences may also arise when the wrong reward function is used, as can be seen from the Facebook experiment where two chatbots developed their own language [7]: the reward function didn’t include anything about the grammatical quality of the utterances by the chatbots, resulting in a new language that allowed them to satisfy their reward function but that was unintelligible to the human researchers.

These were not the only chatbots with problems: the Tay chatbot by Microsoft quickly started spewing obscene and inflammatory tweets after it interacted with Twitter users [37]. This is not something that can be blamed on the algorithms behind Tay, but is simply the result of how the chatbot was designed and the users interacting with it.

When these models are used in situations where the lives of real people are involved, these errors in the data, the objective function or the design choices may have lasting consequences. In the case of predictive policing, there may be runaway feedback loops [8]: arrest data used to train the model causes police officers to visit certain neighbourhoods more often, resulting in more data about this neighbourhood. This in turn makes the model more likely to predict that there will be crimes in this neighbourhood, which sends even more police officers to the neighbourhood.

Another example from the realm of criminal justice, is the tool used by courts in the USA to help make a decision about parole: Correctional Offender Management Profiling for Alternative Sanctions (COMPAS). This tool has been shown to have a bias towards African-American people, assigning them a higher risk score than Caucasian people with the same profile [34]. Instead of a feedback loop, there was a bias in the data to begin with. These two examples clearly illustrate situations that could alter peoples lives in a dra-matic way, demonstrating the need to be vigilant in our design choices and scrutinize our-selves and others when using such techniques.

Now that the broader scope of this research has been established, we can look more closely at the specific case that I aim to investigate: the duality of benefits and consequences of optimization in game design. Automatic game balancing can be used to maximize the joy players get from the game, with the number of actions they take as a proxy for the joy, but it may have the unintended consequence of making players addicted. Nefarious game companies may even optimize for the amount of money that is spent by the players, essentially turning the game into a scam.

Directly writing of automatic game balancing due to the possibility of unintended conse-quences, might not seem the best idea, since it is something that is of tremendous economic value, due to the large amount of time and work that balancing a game normally takes [6]. Furthermore it may be impossible for a human designer to make sure that there are no dominating strategies in games where there are many different strategies that can be played. Using automatic game balancing may uncover these dominating strategies and make sure they become balanced by tweaking the parameters.

(5)

may arise. Addiction to mobile games appears to be a very real problem [5] and different techniques are employed to have players spend as much time (and money) in the game as possible. One of these techniques is making the rewards variable. A Variable Ratio (VR) reward scheme has been discovered to work best in conditioning [48].

At this point in time, however, videogame addiction is not qualified as an addiction in itself and is only included in the research appendix of the DSM-5 as Internet gaming disorder (IGD), indicating that further research is needed [42].

In this research I want to focus on this duality of automatic game optimization. The aspect of the game that will be optimized is how the game rewards its players: the reward scheme. Since this reward scheme is dependent on parameters, it is a candidate for automatic optimization. Furthermore the reward scheme is very important for a game, due to the way rewards influence the human psychology.

However, since addiction to games is a field that requires further study, I also want to investigate if this approach may possibly be used to infer a model of how human players react to rewards. This may help researchers investigate reward schemes and the behaviour they elicit from players in order to gain a better understanding of this part of game addiction.

This leads to the following two overarching questions: can reinforcement learning be used to automatically optimize a game? Could such an optimization model be used in research that focuses on understanding human behaviour in the context of game addiction?

More concretely I will be doing the following things. I will model how players respond to stimuli in a player model. These player models will be used in a simulation where an actor-critic model presents stimuli to the player models and in this way tries to learn how to reward players in such a way that the average number of actions per session increases.

Multiple reward functions for the actor-critic will be tested to see if the actor is able to learn how to reward players if it is punished for eliciting certain behaviours, specifically making the player models play longer than the desired session length.

To measure if the optimization is a success, we’ll need to look at the churn rate, which is a measure of how many players stop playing the game completely. Lowering the churn rate is one of the ways game makers can ensure that more people play the game.

If an optimization model is to be re-purposed for research into human behaviour, it is important to see if the model formed by the critic accurately reflects the player behaviour. Therefore analysing the model fit will also be an important aspect of this research.

This leads to the following main research questions:

1. Can reinforcement learning be used to learn how to reward (models of) players in such a way that the churn rate drops?

2. How does the reward function influence the churn rate?

3. How does the reward function influence the ability of the model to learn how to reward players?

4. How accurately can the behaviour of the player models be captured in the actor-critic model?

Research question 1. is the most important question for determining if automatic game balancing might be used to learn the reward scheme of a real game. My hypothesis is that this is possible, but that how much the churn rate drops depends on the reward function used.

For question 2., I believe that a reward function that incentivizes the actor to increase the session length as much as possible will drop the churn rate the most. However, I think that a reward function that also takes into account the desired session length, will get the churn rate closer to the designed churn rate.

My hypothesis for question 3. is that the actor-critic model will need more iterations to achieve similar levels when it is punished for eliciting certain behaviours.

(6)

Question 4. is relatively open ended and is there to see whether or not this model might be used to model real player behaviour. I believe that the critic will be able to model the player model relatively accurately with no clear biases.

Contributions

• I developed a first model for how humans respond to rewards in a game (section 4.2) based on theory about rewards (section 3.1) and the way human players progress through the levels of a game (section 2.4).

• The model described above was used in reinforcement learning to investigate the fea-sibility of using reinforcement learning to achieve game balancing (section 5.2). • I also tested alternative reward functions that take into account player health and

punish the actor, to see if these functions impede learning (section 5.2.2).

• In order to see if this approach might be used in increasing understanding about game addiction, I investigated the modelling capacity of a neural network when model fit is the byproduct of using this model for some other goal (section 5.3).

(7)

T.J. van Zwol 2 RELATED WORKS

2 Related Works

From the introduction it becomes clear that there are four main topics of research that influence this research: automatic game balancing, game addiction, reinforcement learning in multi-agent systems, and modelling user behaviour .

To understand where this research fits in, it is important to look at research that has been conducted in these four fields, define what each of these fields entails and how this research differs.

2.1 Automatic Game Balancing

Game balancing1 _{is an essential step during the development of a game, since it directly}

influences the players enjoyment of the game. Usually this is done by careful design, and by having prospective players play the game and see how the game progresses, also known as play testing. This is rather expensive, so automation may cut down these costs considerably [6, 55].

The definition of game balancing used in this thesis is the same as found in Volz et al. [55]: “the modification of parameters of the constitutive and operational rules of a game (i.e. the underlying physics and the induced consequences / feedback) in order to achieve optimal configurations in terms of a set of goals”. The most abstract goal in this case, is the enjoyment of the players and with competing players fairness is crucial [6]. What these parameters and goals entail in this thesis will be discussed in the section describing the game that is used in this research.

Volz et al.[55] have shown that it is possible to achieve automatic game balancing through the use of an evolutionary algorithm. Even multi-objective optimisation was feasible in this approach. Kunanusont et al.[27] have also shown that automatic game balancing is possible. They introduce a new algorithm, the N-Tuple Bandit Evolutionary Algorithm, which combines aspects from reinforcement learning and evolutionary algorithms.

An important part of game balancing is identifying dominating strategies. In mas-sively multiplayer online role-playing games (MMORPG’s), many players play at the same time, often with different races or classes and several skills. To balance all these different races/classes and their respective skills, so that none of the combinations was strictly domi-nating, Chen et al.[6] developed a new co-evolutionary algorithm and showed the mechanism behind it works.

Automatic game balancing has mostly been done through evolutionary algorithms. In this research, I will attempt to do this trough reinforcement learning instead. The reason why I think this is a valid approach is that for this research, game balancing and learning to play the game are more or less the same: the actions taken by the computer in response to the actions taken by the player result in the balance of the game. The balancing is not concerned with interactions between players but between the game and its players.

2.2 Game Addiction

With computer games becoming more prevalent in our society during the last centuries due to the development of better computers and smart phones, research is being done into whether or not excessive gaming can be qualified as addiction [28]. At this point in time, video game addiction is not qualified as an addiction in itself and is only included in the research appendix of the DSM-5 as Internet gaming disorder (IGD), indicating that further research is needed [42].

Research into gaming addiction has been going on for some time already and according to Griffiths, “excessive behaviours add to life and addictive take away from it” [15], leaving open the possibility that not all excessive gaming is a sign of addiction. He also argues that

1_{Balancing does not necessarily mean that each strategy becomes equal, just that no strategy is strictly}

(8)

video game playing can be seen as a non-financial form of gambling. Especially loot boxes2 can be considered gambling since money is usually required to open them and the player does not know what rewards he will receive [16].

In [15], Griffiths provides a concise summary of research he and others have carried out into video game addiction, which I will paraphrase here since it clearly outlines what is known about the effects of gaming. Gaming is known to have benefits: educational, social and/or therapeutic. Video game playing can, however, be addictive when done in excess. This is more prevalent when it concerns an online game without an end, making playing that game potentially a 24/7 activity. Griffiths also draws a parallel between problem gamblers that win instead of lose money, making the real problem that they spend too much time gambling which compromises other aspects of their life, and video game players that show similar patterns.

Now that smartphones are widely available, mobile games have taken off. Candy Crush Saga is one of those games that has a large number of players and has an important social aspect. Chen et al. have shown that addiction to such a game can exist by applying a questionnaire used for internet addiction to this context [5]: they found that 7.3% of the users were addicted to this game, with loneliness and self-control being significant predictors of mobile social game addiction.

Mobile games are a subset of online video games. This is relevant since Lemmens et al. [29] have shown that there is a stronger correlation between online video games and IGD, than between offline video games and IGD. Furthermore, they also discovered that disordered gamers of online role playing games (RPGs) spent more than four times the amount of time playing their game of choice than gamers without IGD.

Since rewards are an important aspect of video games and rewards are also very important in conditioning, i.e. establishing certain behaviours in animals or humans, the link between optimizing the rewards given to players and video game addiction is evident. One only needs to look at gambling addiction to see there is a strong link between addiction and reward.

2.3 Reinforcement Learning in Multi-agent Systems

Reinforcement learning can relatively easily be applied to a single agent that has to learn how to achieve a certain task, but the more interesting situations are the ones where multi-ple agents have to interact. Consider self driving cars where the cars share information and coordinate to achieve a cooperative goal [4] or competitive computer controlled opponents in videogames like AlphaStar [54]. In the first example, the vehicles can communicate, but in the situation of playing a (competitive) game, the agents may not be able or willing to communicate. There is a large amount of research done into reinforcement learning in coop-erative multi-agent systems often involving deep learning [10, 12, 60] or actor-critic models [11, 31] where the agents are able to communicate, learn to communicate, or indirectly com-municate due to using a centralised element in the learning procedure. However, less research done in competitive environments with deep learning and actor-critic models [40, 52] where agents do not share information and act independently. This has, however, been studied in game theory to quite some extend, applying it to different real world scenarios [9, 20, 61].

In this research, I will be focusing on multi-agent systems where the agents do not communicate and where the game can be seen as a mix of cooperation and competition. The actions should be in the continuous domain. In [31], many of these aspects are tackled by using an actor-critic model. The main difference is that in this research, only one agent will be governed by a computer algorithm: all the other agents are simulated humans, making their reward function and action-value pairs opaque to the reinforcement learning model. Since it is also a sequential game, this aspect is more like [40]. The critic from the actor-critic model can, after convergence, be considered a model of how the player models

2_{Virtual crates with items that can usually be opened by spending money. The items can either be of}

(9)

respond to certain stimuli from the actor. Why this is useful becomes clear from the next section.

2.4 Modeling User Behaviour

To better understand or predict how users will act, it is often beneficial to model their behaviour. This can be done in three ways, according to Ostrom [39]: in language, in mathematics or statistics and through computer simulation. Using a artificial neural network to model user behaviour is a combination of both the statistical approach and the simulation approach: the model is learned from data and thus reflects the statistics of the data but it can also be used to generate new data.

Since predicting the action of a user in response to some stimulus is in essence a cate-gorization problem, artificial neural networks are well suited for this task [32]. It has been shown that artificial neural networks can indeed be used to model human behaviour [32] and that these networks can be used to analyse and explain human behaviour [51].

User behaviour is modeled in many different contexts. In the context of Information Re-trieval, click models are used to model human behaviour on the results page of the retrieval-system. They are often used to make learning to rank less dependant on the position bias [18]. Human behaviour is also modelled in the context of traffic [30] and it is shown that a neural network is capable of modeling the behaviour of the driver quite accurately, although not perfectly. It should be noted that the network used is, in the context of the current state of the art models, rather simple. Another example of modelling specific human behaviour, is by learning how to behave as a human operator in the process industry [13]. These examples indicate that it is possible to model how humans behave in certain specified and bounded contexts.

A recent paper that doesn’t use neural networks to model user behaviour, but manages to capture the user behaviour in (simple) power laws, is [44]. Reguera et al. have investigated the player behaviour in mobile games like Candy Crush Saga. They discovered that the willingness of players to spend time on achieving a level of Candy Crush is related to what level they are on: the further in the game they are, the more tries they are willing to spend. This behaviour of the abandon time, ta(n), can be described by a power-law of the form

ta(n)∼ nα. They also discovered that the difficulty of the levels in these games, measured

in the time to pass a level or tp(n), can be described by a power-law of the form tp(n)∼ nβ.

The relation between ta(n), tp(n), t emp

p (n) (the empirical time to pass a level) and the churn

probability pc(n) in the following equation

tp(n) = tempp (n) 1_{− p}c(n) and ta(n) = tempp (n) pc(n) (1) tempp (n) denotes the average number of attempts needed by players that passed level n

to pass it, which can be determined from the data set.

One of the other findings from their research, is that there exists a relationship between αr, βr3 and whether or not players ever stop playing the game. If αr− βr< 1, eventually

every player will churn, if αr− βr> 1 a number of players will, if there is enough content in

the game, never stop playing. The higher αr− βr, the larger the chance that players either

abandon right at the start or never stop playing.

Reguera et al. have also demonstrated that a stochastic simulation of the players validates their model and that assuming players are statistically identical reproduces their progression and survival accurately. This information is used to design the experiment for this research. If the model fit of the critic compared to how the player models work is good, using actor-critic models to investigate how humans work may prove to be a viable avenue for research.

3_{Since α and β are parameters of a model in this research, an underscore r is added to these parameters}

(10)

T.J. van Zwol 3 BACKGROUND

3 Background

In this section, I want to discuss the psychological aspects of rewards and how these are related to conditioning and addiction, and a number of reinforcement learning techniques and the reasons for choosing one technique over the other.

The reason to start with the psychology behind rewards and addiction, is that this psychological aspect plays an important role in the design of the experiment and the selection of the reinforcement learning techniques.

3.1 The Psychology of Rewards

In the mobile game which is the topic of research for this thesis, the players will be rewarded for certain actions they take. These rewards have an effect on the players and their behaviour. In order to understand this effect, it is important to look at the psychological aspect of rewards.

The term reward in psychology has three meanings that are connected: it is something that a person likes, wants, and that serves as a reinforcer for learning [1]. Liking has to do with feeling pleasure or satisfaction when receiving a reward; wanting is the desire to receive rewards and is closely associated with motivation; reinforcement is used to describe the effect of rewards in the process of learning [14].

It appears that the wanting component of rewards is strongly related to the release of dopamine in the brain, but liking is not: animals that have been trained to press a lever to obtain a reward, show a increased release of dopamine just before pressing the lever, but not after receiving the reward [43]. The larger the expected reward is, the more dopamine is released [47].

3.1.1 Reward Schedules & Conditioning

Dopamine is also strongly related to new learning and can be used in conditioning . Hungry animals presented with food, show a burst of dopamine release. If the presentation of food is preceded by another signal, e.g. a lamp coming on, the animal learns to associate the other signal with the food and dopamine will be released in response to the signal and no longer to the food: the lamp has become a predictor for the food. In order for dopamine to be released, the reward or its predictor needs to be unexpected [49].

The above is an example of classical conditioning, with Pavlov and his dogs being a prime example. If the goal is to shape the behaviour of animals or humans, we speak of operant conditioning. In operant conditioning there are different schedules of reinforcement: continuous reinforcement, partial reinforcement and extinction. Continuous reinforcement always rewards the action, partial reinforcement only rewards sometimes and extinction is the stage in which the animal or human no longer receives rewards [14]. The partial reinforcement schedule can be further divided into four schedules:

1. Fixed-ratio (FR) schedule: a reward is given after each nth response.

2. Variable-ratio (VR) schedule: a reward is, on average, given after n responses. How-ever, the exact number of responses varies.

3. Fixed-interval (FI) schedule: a reward is given to a response that occurs after a fixed period of time has elapsed.

4. Variable-interval (VI) schedule: similar to the fixed-interval, however the time varies around some mean amount of time.

VR schedules are able to generate the highest number of responses. This is why for example slot machines work using a VR schedule. To understand why this is the case, it is useful to view the organism that is conditioned as acquiring new knowledge about the

(11)

relationship between the response and the reward it receives [38]: as long as the organism is still learning this relationship, the reward is unexpected and dopamine is released. Since higher levels of dopamine are associated with internal rewards, it also increases the wanting [23, 14].

Furthermore, the VR schedule also appears to be preferred explicitly, even if it amounts to the same average reward [48], and it proves to be more resilient to extinction: organisms trained with a VR schedule and then shifted to an extinction schedule, keep showing the response for far longer than the other reward schedules. It appears they have learned to be persistent because responses were not always followed by a reward [14].

3.1.2 Rewards in Video Games

Rewards earned in video games have the same effect as other rewards: they lead to a heightened amount of dopamine in the brain [26] due to the learning of the rewards that come with certain actions [50]. Since the increased dopamine increases wanting, it means that the reward scheme in the game has a huge impact on the perceived fun.

From the previous sections, it becomes clear that, if your goal is to have your users form a habit of playing, the reward for a given action should not always be the same: with variable rewards the players need to keep learning what reward is associated with the action, resulting in dopamine being released every time the action is taken and increasing the likelihood of forming a habit of playing. A habit may even turn into an addiction: addiction to mobile games is possible [19, 5], however, it is not as strong an addiction as substance addictions. 3.1.3 Relation between Rewards and Addiction

To understand how addiction to mobile games works, it is necessary to look at the relation between rewards and addiction. With brain imaging, it has been shown that games of chance where money is used as a reward, are strong activators of the reward structures in the brain [3, 25]. Because it is impossible to predict the payoff, each time a player receives a reward, the reward structures in the brain keep releasing dopamine.

Even though the person playing the game may consciously know that the reward is unpredictable and not influenced by anything they do, the brain behaves as if it is trying to learn how to predict the reward. This results in the forming of a very strong habit or even addiction [14].

Behavioural addiction differs from substance addiction, but the latter does illustrate the mechanisms of the former. The way it works is by making use of the reward system in the brain: the drugs mimic or promote the effects of dopamine and endorphins. Normal rewards only activate the dopamine release when the reward is unexpected, but the drugs always activate the dopamine circuits, resulting in a sort of ‘super learning’ [22]. A behavioural addiction is in some ways similar to a drug addiction [21]. It doesn’t matter if the dopamine release in the brain comes from the use of a drug or is the result of certain behaviours. This also goes for excessive video game playing [59, 58].

3.2 Reinforcement Learning

Since the goal of this research is to have a game learn how to reward its players, under-standing the background of reinforcement learning is necessary. Reinforcement learning in computers works in a similar way to how it works in humans and animals: an agent takes an action and receives a reward for this action. The goal of reinforcement learning is to have an agent learn how to behave in a certain environment in such a way that it maximises the reward it receives: it learns a policy that tells it what action to take given a certain state.

In the most simple cases the state does not change and the action space is discrete; more complex cases deal with a partially observable state and/or continuous action spaces. The state space in this research is finite and discrete, but the action space is continuous and thus

(12)

infinite. Also important to consider is the fact that in this research we are dealing with a multi-agent system: the algorithm deciding how to reward the player is an agent, but so are the players.

In this section a number of techniques will be shortly described and their possible appli-cation to this research discussed.

3.2.1 Bandit Algorithms

Bandit algorithms are one of the simplest reinforcement learning algorithms and were first described by [46]. They are suited for relatively simple environments and sets of actions: they do not incorporate information about the state and can, in practical applications, only sample from a finite number of arms.

There are a number of bandit algorithms that can be used in multi-agent environments [2], and there are generalizations to infinite numbers of arms [24, 56] but they depend on assumptions about the reward function. These assumptions do not necessarily hold for this research, therefor using a bandit algorithm for learning the reward scheme of the game seems unsuitable.

3.2.2 Q-learning & Deep Q-learning

In contrast with the bandit algorithms, the Q-learning algorithm does take state into ac-count. The algorithm learns the best action to take given a certain state [57] and is a generalisation of the multi armed bandit algorithm. Q-learning is meant for a discrete and finite world where the number of actions is finite. Q-learning can also be applied in multi-agent systems [2] but in practice doesn’t perform well because of the non-stationary nature of the environment [33].

Q-learning can be extended to continuous domains. This is done by, for example, taking a weighted average of a number of discrete actions. The weights are determined by the relative Q-values of these actions [35]. When actions depend on multiple variables, these variables need to vary independently in order to cover the full range of possibilities, leading to a combinatorial explosion of the number of actions when the number of variables increases. A variation on Q-learning was made possible by the advancement in the training of artificial neural networks: Deep Q Networks (DQN). In this approach, neural networks are used to learn the Q-values from a high dimensional input. However, these networks still select from a finite set of actions [36]. DQN can also be extended to continuous domains [17], however, in the context of deep reinforcement learning, other algorithms are more widely used for continuous actions spaces.

3.2.3 Policy Gradient & Actor-Critic models

Another approach to reinforcement learning, is through the use of Policy Gradient methods. There exist several different algorithms that make use of the Policy Gradient concept: instead of learning action-value pairs, a parameterized policy is learned. The value function is not needed for action selection, but it can be part of learning the parameters of the policy [53]. The model learns the policy parameters by doing gradient ascent on the gradient of a evaluation metric, with respect to the policy parameters.

An implementation of the Policy Gradient method, is the actor-critic model [41]. This algorithm is best used when the value function is unknown and should be learned as well. The actor picks an action based on a number of parameters and the critic estimates the reward. This reward is then used to update the parameters of the actor. The critic itself is something that is learned from actual data by comparing the prediction with the rewards observed from the environment [53].

The actor-critic model seems to be the best option: it is a fully online, incremental algorithm with an infinite action set. This combines all the elements needed for the context

(13)

of optimizing a game while players interact with it and also matches with the desire to present users with a wide range of rewards to maintain unpredictability. Furthermore, the critic becomes, by virtue of the training process, a model of user behaviour given the rewards the players receive. This model can, when trained with human players, be used to infer how human players respond to rewards.

(14)

T.J. van Zwol 4 METHODS

4 Methods

In section 3.2.3 Policy Gradient & Actor-Critic models, I have established why I believe the actor-critic model is the algorithm class best suited for this research. In this section I will discuss how I implemented the player model as a stochastic model and the implementation of the actor-critic that will be used to learn how to reward these player models. Furthermore I will also discuss the reward functions that will be used to train the actor.

4.1 Notation and definitions

In order to make the method section easier to comprehend, I will provide an overview of notation short hands and definitions that may not be directly obvious.

1. Rewards (r): with rewards I mean the output of the actor that is presented to the player model. These rewards are equal or larger to zero. When it is important that the rewards are either zero or larger than zero, this will be mentioned in the text. 2. Characteristic function: I use the standard notation to indicate a characteristic

function, i.e. 1[some condition]. This is used in the player model to influence how rewards

affect the model.

4.2 Player Model

The player model simulates the behaviour of players in the mobile game. The main objective of the player model is to simulate whether or not players continue play, since this is something the Actor-Critic model aims to optimize. The type of actions or the specific style of play are therefore abstracted out.

The models will output a sequence of the shape [1,...,1,0,...,0], with 1 indicating that the user model takes another action in response to the reward received and 0 indicating no action was taken. In other words, if the model does not continue play, the rest of the outputs will be 0. This is in a way similar to the cascading browsing model where the model stops as soon as it deems a result interesting.

The model has three mechanisms in place that simulate whether or not a player continues with play. The first is very simple and operates on the notion that the game should make it possible for the players to keep playing: each model has a budget at the start of the simulation. Actions taken decrease the budget, rewards received increase the budget. If the budget ever reaches a point where not enough budget is left to take an action, the player model ‘churns’: the player decides to stop playing the game altogether because they can no longer take any actions.

The budget is tied to the second mechanism: tiers. This is a measure of progression and acts on the level of multiple sessions. It simulates players wanting to progress in the game and losing interest if this doesn’t happen fast enough. To reach a certain tier, the model will have to acquire a certain amount of budget. Players will try to reach the next tier of game play, but will only do so for a limited amount of sessions. If they do not reach the next tier within this randomly determined number of sessions, they churn as well. The tiers in this research reflect the levels from [44], but they differ in that players execute discrete actions within the tiers instead of simply passing or not passing a level.

The third mechanism operates mostly on the level of a single session and determines whether or not a player takes another action after receiving a reward. This is modelled by having a value for ’boredom’ that determines the chance the player keeps playing. It is influenced by the predictability of the rewards. The predictability is learned over sessions as well, since human players will form a mental model of the rewards they can receive across sessions.

The way the three mechanisms of the model are implemented is described in the following paragraphs:

(15)

BudgetThe implementation of the budget is straightforward: each model starts with a budget w of a certain size v. Taking an action decreases w by l (the current tier). Receiving a reward r increases b by r. If w would become smaller than 0 by taking an action, the player does not take an action and churns. The size of v is more or less a hard cap on the number of actions the model can take without receiving a reward, thus incentivising the actor to make the average of the reward interval smaller than v. For this research, v was set to 10.

Tiers To simplify the simulation, players only switch tiers between sessions. The tiers are designed according to the findings of [44]. Since there is no data we can use to determine the churn rate or tempp , we either need to pick values for these, or pick values for tp and ta.

Because we want to investigate if learning the reward scheme through reinforcement learning can lower the churn rate, it makes sense to pick values for tp and ta. The values chosen are:

tp= 2 and ta= 3. This will make t emp

p (n) = 1.2 and the churn rate at tier 1 40%.

Combined with the values for αr= 1.4 and βr= 0.4 as described in [44], and the desired

number of actions per session being 20, this leads to the following tiers:

Tiers tp actions increase of b required avg. reward per action

1 2.00 40 10 1.25 2 2.64 53 20 2.37 3 3.11 63 40 3.63 4 3.49 70 80 5.14 5 3.81 77 160 7.07 6 4.10 82 320 9.90 7 4.36 88 640 14.27 8 4.60 92 1280 21.91 9 4.82 97 2560 35.39 10 5.03 101 5120 60.69

Table 1: The tiers as calculated from the chosen value for tpat tier 1, the desired number of

actions per session (20) and the increase in budget required to reach the next tier. To ensure that αr− βr< 1, the values for tp and the number of action are rounded up,

and the value for the average reward per action is rounded down. Tier 10 is the highest tier a player can become in this research, so when the player would normally reach tier 11, they are considered to have finished the game. If more content would be available (i.e. more tiers), then those players would keep on playing.

The player model advances to the next tier when at the beginning of a session its budget is equal to or larger than the required budget for a tier. In theory this can mean that the model advances two tiers at once. Whenever it advances a tier, the number of sessions this player model will try to advance to subsequent tier is randomly determined. This is done in a similar fashion as in [44]. If it doesn’t reach this tier within the predetermined number of sessions, the model churns.

What is important to note is that if the actor-critic rewards the player models in such a way that the player models perform more than 20 actions per session, they will need fewer session to reach the next tier. This decreases the difficulty of reaching a tier and thus αr− βr< 1 might no longer hold, depending on the increase of session length.

Boredom The main mechanic in the model is based around the idea of simulating a user that is (subconsciously) trying to predict the outcome of an action: from the research it becomes clear that the reward centers of the brain fire while trying to learn to predict the outcome of the action. If the outcome becomes more predictable, the user loses interest. This will be modelled through the accumulation of boredom.

Boredom increases when the rewards become predictable and decreases when the rewards are unpredictable. Rewards become predictable when: there are consecutive rewards of size

(16)

zero; the size of a reward is similar to other rewards that have been received in response to past actions. The rewards are unpredictable when: a reward is given after a number rewards of size zero; a reward is given that is greater than the average of the rewards received thus far.

These four boredom modifiers are captured in the following equations. After each equa-tion follows a short explanaequa-tions of the equaequa-tion. In these equaequa-tions rtrepresents the reward

r at time t for which the difference in boredom ∆btshould be calculated. α, β, γ, δ, , ζ, η, θ

are all parameters of the model that tune how it responds to rewards and how many steps in its history it will consider.

ZR =

t

X

i=t−α

β1[rt=0] (2)

ZR stands for Zero Rewards and this represents the influence of rewards of size zero (i.e. not getting a reward) on the boredom. Zero rewards increase boredom by amount β for every zero reward in sequence of length α preceding rt.

P R = p(rt)· γ1[rt6=0] (3)

P R stands for Predictable Reward and this represents the influence of predictability of the received reward. p(rt) is the probability of receiving a reward of size rt as estimated

by the player model. How p(rt) is calculated is described below. In short, this formula

describes how non-zero rewards increase boredom by amount γ multiplied by the predicted probability of receiving a reward of this size.

LR =   rt E(r_{| r > 0)}− δ !+  θ (4) LR stand for Large Reward and this represents the influence of receiving a large reward on the boredom. E(r_{| r > 0) is the expected value of a non-zero reward. How this is} calculated is described below. δ is a threshold to determine if a reward is large and it is subtracted from the ratio between rtand E(r| r > 0). If the outcome of this is positive, it

is raised to θ, a scaling factor so that impact of ‘large rewards’ does not necessarily scale linearly. is the scaling factor for the boredom decrease as a result of receiving a large reward. RAZR = t X j=t−ζ η1[rj=0]· rt E(r_{| r > 0)}1[rt6=0] (5)

RAZR stand for Reward After Zero Rewards and this represents the influence of receiving a reward after a number of zero rewards. Boredom is decreased by amount η for every zero reward in sequence of length ζ preceding non-zero reward rt, scaled by the relative size of

rtcompared to the expected reward.

Now that the individual components that influence boredom are established, the full formula for calculating boredom is the following. It incorporates the four formulas stated and explained above.

b0= 0

bt= bt−1+ ∆bt

∆bt= ZR + P R− LR − RAZR

(6)

Notice how it is not necessary that 0 < bt< 1. This is so that large swings in one or the

(17)

larger amount of dopamine released, increasing the likelihood of that player playing for a longer time in order to learn how the rewards are being determined.

The boredom is used to determine the probability that the player model continues play. Since 0 < bt< 1 is not a requirement, min(1, 1− max(bt, 0) will be the probability that the

model continues play.

Since the actor will be restricted in such a way that the average reward for each action of the user model will be of a size that allows players to reach the next tier within a certain number of steps, I believe there is no trivial solution for the actor: if it gives out many rewards in order to negate the first term increasing the boredom, the size of the rewards will be smaller, increasing the chance of the rewards falling in the same bucket and thus increasing boredom through the second term. Only giving out very high rewards to capitalize on the first term decreasing boredom, soon leads to a high average reducing the contribution of this term. If it balances this with giving low rewards in order to reduce the average of the rewards, then the chance of the reward falling into a bucket containing other rewards increases and the positive effect of receiving a reward decreases.

p(rt) is estimated by the player model. This can be done in two ways: either by collecting

the received rewards into buckets of a certain size or by constructing a density function from the received rewards.

For this research the bucket system is used, because the size of the buckets may prove an important parameter that influences the way the actor behaves. In order to see if this is the case, the number of buckets is kept constant between the tiers, but the range these buckets cover is 20_{∗ l}2_{. Since the growth rate for the buckets is different from the growth rate of}

the average reward, we should see a difference in the parameters of the actor if the bucket size has a strong influence. Because a bucket system is used, the probability distribution is discrete.

The buckets are filled with the last 200 rewards (zero and non-zero) the player received previous to the current session and the rewards from the current session. At the end of the session, the stored rewards are trimmed to again be the 200 rewards the player received last. It should be noted that zero rewards are included in the estimate of p(rt) but that p(rt)

is only used for calculating the influence of non-zero rewards (eq. 3) on the boredom. I have chosen to do this, because otherwise zero rewards would increase boredom through two separate terms, while non-zero rewards would do this only through one term.

E(r_{| r > 0) is the expected value of a non-zero reward. In other words, in the most} simple case, it is the average of all the received, non-zero rewards. Since humans have imperfect learning, the way E(r_{| r > 0) is calculated for this research is different. After} the first session, the average of the non-zero rewards of that session, r0, is stored. For each

subsequent session, E(r_{| r > 0) is updated according to the following equation:}

Et+1(r| r > 0) = Et(r| r > 0) + 0.1 · rt (7)

Important to note is that this player model is purely developed based on theory about how humans responds to rewards and the way they progress through the levels of a game (see sections 2.4 and 3.1). It has not been verified with human data due to this data being unavailable at the time of this research.

4.3 Actor

The actor learns the parameters of the reward scheme. The reward scheme should not be deterministic since a variable reward scheme provides the most engaging experience. This is achieved by having the actor learn parameters for a probability distribution: the p value for a Bernoulli distribution and the α value for a gamma distribution. The Bernoulli distri-bution is used to determine if the player should receive a reward, the gamma distridistri-bution to determine its size. Each tier (l) has it’s own values for p and α.

(18)

A gamma distribution is parameterized by two parameters which can be represented in two fashions: with k = shape and θ = scale or with α = shape and β = rate (1

θ). Since

the gamma distribution is implemented as the latter in PyTorch, I will use these parameter names as well. However, since the player model also has parameters α and β, the parameters of the actor will be denoted with a subscript a: pa, αa, βa.

To simplify training the model, the actor outputs a sequence of rewards for each tier. In the player model, the appropriate sequence is selected. This selected sequence and the models response to the rewards are then passed on to the critic for learning.

The outputs from the actor have the shape:

# of player models/sessions× session length (50) × # of tiers (10) and are the result of random samples from a Bernoulli and a gamma distribution

Mathematically the part of the actor that determines the actual reward can be described as

rt,l= (s∼ Bernoulli(pa,l))· (a ∼ Γ(αa,l, βa,l)) (8)

βa,l=

αa,l· pa,l

ravg,l

(9) where r is the reward that is given to the player, pa,land αa,lare the parameters that are

learned with l indicating the tier, ravg,lthe average reward that a player should receive based

on their tier, and s and a represent the samples from the Bernoulli and gamma distribution respectively.

When training the actor, selecting which tiers to train is determined based on the dis-tribution of tiers among the player models. The values for pa,l and αa,l are parameters

the model learns directly. Since pa,l needs to be between 0 and 1, pa,l is passed through a

sigmoid, and αa,l needs to be larger than 0, αa,lis passed through a softplus.

The rewards could in theory be sampled from a large variety of distributions. The first reason for using a gamma distribution is that a sample from a gamma distribution lies between zero and infinity and is therefore always positive. As a result, there is no hard cap on the maximum reward size except for the limitations of representing a number in computer memory. Furthermore a gamma distribution is a maximum entropy probability distribution for a random variable with fixed positive mean. Since nothing it known about how a distribution of rewards influence humans, using the distribution with the maximum entropy is the best choice: it minimizes the amount of prior knowledge that is built into the distribution. The fixed positive mean is necessary to ensure that the mean reward for a player action matches with the mean reward needed to reach the next tier. The influence of the two parameters on the shape of the gamma distribution can be seen in figure 1.

The reward scheme is subject to one major constraint: the average reward should be a constant number. This constraint is captured in equation 9. Since the β parameter of a gamma distribution describes the scale and the mean of a gamma distribution is µ = α

β, βa,l

can be calculated based on the average reward ravg,l, which is predetermined, and the αa,l

that is learned by the model. Because the probability of receiving a reward also depends on the value of pa,l, the β parameter should be scaled to reflect that.

4.4 Critic

The critic learns from the data generated by the player models actions. During learning it takes as input a sequence of rewards that were presented to the player concatenated with a one-hot vector indicating the tier of the player, and as target the sequence of actions generated by the model.

While training the actor, the critic takes as input a sequence of rewards generated by the actor concatenated with the intended tier and outputs the expected value of each reward. This output is then used in the objective function to reward the actor.

(19)

Figure 1: Plots for the probability density function of a gamma distribution under differ-ent parameters. k corresponds with α and θ with 1_β. The higher the value for α (k), the more ‘symmetrical’ the distribution becomes. Lower values for θ or higher values for β result in more of the probability mass falling in a shorter domain and closer to zero. Image by: MarkSweep and Cburnett deriva-tive work: Autopilot distributed under CC BY-SA 3.0 license. Retrieved from: https://commons.wikimedia.org/wiki/File:Gamma distribution pdf.svg

The neural network consists of three layers: the input layer, a hidden layer and an output layer. Since the outputs need to be between 0 and 1, the activation function of the output layer is the sigmoid. The loss function used is the Binary Cross Entropy loss. Due to the nature of this loss, the outputs of the critic will reflect the possibility of the model taking that action given the rewards it was presented with.

The size of the input layer is 60: sequence length of rewards of 50 + the number of tiers which is 10. The size of the output layer is equal to the sequence length. Since the hidden layer shouldn’t be smaller than the output layer, the hidden layer is set to be equal in size to the output layer. The activation function between the layers is chosen to be the Rectified Linear Unit (ReLU).

4.5 Objective Functions

There are two different objectives for the actor-critic model that are investigated in this research:

1. Maximize the number of actions the players take during one session.

2. Maximize the number of actions the players take, but also make sure the players take no more than 20 actions per session.

(20)

The first objective simply tries to make users play as much as possible, while the second objective also tries to look out for the players health by not making them play too much per session.

To ensure the actor-critic model tries to achieve these objectives, they are converted into loss functions for the actor part of the actor-critic. Since the score for the actor is dependent on the rewards it presents to the player model, and these rewards are dependant on samples from a probability distribution, the log probabilities of these rewards are needed in the loss function as well. The loss function corresponding to the first objective is

Ri= J X j=1 oc i Li= Ri· log(p(si)) L =_−Li (10) with oc

i denoting the output of the critic in response to sequence si, J being the sequence

length, si the sequence of rewards presented to the player model and log(p(si)) the log

probability of sequence si.

The second objective can be interpreted and converted into a loss function in two ways: either the actor-critic model is rewarded a maximum of the target session length τ , with session going longer rounded down, or the actor-critic model is actively punished for sessions that go over τ actions.

The first interpretation leads to the following formula describing the loss function

Ri= J X j=1 oc i Li= min(Ri· log(p(si)), τ ) L =−Li (11)

The second interpretation leads to three more options: ignoring all sessions over a certain length (i.e. setting the reward to zero), punishing the actor with a set amount per session regardless of how much longer the session are, or a scaling punishment which punishes the actor more heavily depending on how much ‘too long’ the session are. The first two options can be captured in a single, slightly more complex loss function:

Ri= J X j=1 oc i Li= ( Ri· log(p(si)) if Ri· log(p(si))≤ τ k otherwise L =_−Li (12)

with k≤ 0 and representing the size of the penalty.

(21)

exponential or logarithmic scale is also viable. The loss function for this is

Ri= J X j=1 oc i Li= ( Ri· log(p(si)) if Ri· log(p(si))≤ τ −(Ri− τ) otherwise L =_−Li (13)

To sum up this subsection I will provide clear names for the reward functions so they can be references in the following sections, and also provide a short description of how the reward is calculated.

• Regular: the reward for a session is equal to the session length

• Ignore: the reward for a session is equal to the session length except for the sessions longer than the desired session length τ . For those sessions the reward is zero. • Linear punish: the reward for a session is equal to the session length except for

the sessions longer than the desired session length τ . For those sessions the reward is negative and equal to the distance to τ times some constant.

• Uniform punish: the reward for a session is equal to the session length except for the sessions longer than the desired session length τ . For those session the reward is some negative value.

• Clip: the reward for a session is equal to the session length except for the sessions longer than the desired session length τ . For those sessions the reward is set to τ , clipping the session length.

4.6 Data

The data used in this research is synthetic and generated as part of training the actor-critic model. The actor generates sequences of rewards. In response the player models generate a sequence of actions. The combination of the rewards and the actions are used to train the critic. For training the actor, a new set of rewards is generated and presented to the critic with the same distribution among the tiers as the player models during this iteration.

4.7 Metrics

The main metric for this research is the churn rate. Churn rate is defined as the percentage of players that stop playing a game. In theory this would mean people who permanently stop playing the game, but since players are often tracked in cohorts with a set end-date of observations, in practice it means the players that stopped playing at some time during the period of observation for the cohort.

To determine if the critic part of an actor-critic would be able to model player behaviour in a setting with human players, we need to look at model fit. For this we look at the R2

between the outputs of the player model and the critic. Just the R2 metric is not enough though, since biases may be present even with high R2_{. Therefore we also look at plots of}

the probabilities from the player model and the outputs from the critic, giving us a visual indication whether or not the critic exhibits certain biases.

(22)

4.8 Implementation

The actor-critic model is implemented in PyTorch4_{and makes use of REINFORCE to}

back-propagate through the random samples. The way everything is implemented can be seen in the code available on Gitlab5_.

4_{https://pytorch.org}

(23)

T.J. van Zwol 5 RESULTS AND ANALYSIS

5 Results and Analysis

5.1 Preparation

Since it is completely unknown which values for the parameters of the player model would make for a realistic player model, a grid search over these parameters was done. From this grid search, six different parameter combinations were selected.

The selected parameter combinations all had an average session length that was smaller than 10 when presented with stimuli from the untrained actor-critic model. The parameters of the actor part were set to the same value for each combination to ensure a fair comparison.

The six parameter combinations that were selected had the following characteristics: The parameter combination with ...

• the largest growth rate; • the smallest growth rate;

• the largest standard deviation of session length; • the smallest standard deviation of session length; • the smallest average session length;

• the largest number of full length sessions.

The growth rate was determined by dividing the average session length from the fifth iteration by the average session length of the first iteration. The smallest growth rate was smaller than 1, meaning that the average session length decreased. The other four combinations were evaluated on values from the first iteration only.

This parameter tuning was done with a version of the user model that did not include tiers, budget or churning, thus only evaluating the performance of the parameters.

In table 2 the possible values for the parameters that were tested can be seen. In table 3 the actual selected parameter combinations can be found.

Parameters α*, ζ* β, γ, , η δ θ

Possible values 5, 6, 7, 8, 9, 10 0.01, 0.03, 0.05, 0.07 1 1, 1.5, 2

Table 2: The possible values that each parameter could take on during the grid search. The parameters marked with an * were coupled and had the same value. All other parameters were not coupled and were varied independently from each other.

Parameters α β γ δ ζ η θ

Largest growth rate (1) 10 0.07 0.07 1 0.05 10 0.03 2

Smallest growth rate (2) 7 0.05 0.03 1 0.01 7 0.03 2

Largest standard deviation (3) 9 0.07 0.03 1 0.05 9 0.05 1.5

Smallest standard deviation (4) 8 0.07 0.07 1 0.01 8 0.01 1

Smallest average (5) 7 0.07 0.07 1 0.01 7 0.01 1.5

Largest number of full sessions (6) 8 0.05 0.01 1 0.05 8 0.03 1 Table 3: The values for the six selected parameter combinations for the parameters of the

player model as defined in the methods section.

To create an understanding of how the selected parameters will probably influence the training of the actor-critic with the full player model, the actor-critic was first trained with

(24)

the same stripped down player model that was used to select the parameters. This was done for 500 iterations and the results plotted. In figure 2 the average session length can be seen. Figure

It appears that the actor-critic model is not entirely stable, since in figure 2 it can be seen that a collapse may occur. It also appears that for some parameter combinations it is impossible for the actor-critic to get the average session length above 10.

Training the actor-critic with the simple player model

0 100 200 300 400 500 0 5 10 15 20 25 30 35 40 45 50 Iterations Av erage session length

Largest growth rate (1) Smallest growth rate (2) Largest SD (3)

Smallest SD (4) Smallest average (5) Most full sessions (6)

Figure 2: The development of the average session length over the iterations.

Training the actor-critic with the simple player model

0 100 200 300 400 500 0 2 4 6 8 10 12 14 16 18 20 Iterations Difference in actions

Largest growth rate (1) Smallest growth rate (2) Largest SD (3)

Smallest SD (4) Smallest average (5) Most full sessions (6)

(a) The difference between the real and predicted average session length over the full training of 500 iterations. This difference quickly drops as the critic gets better at prediction.

0 20 40 60 80 100 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Iterations Difference in actions

(b) A zoomed in version to give a more detailed view of the way the difference between real and predicted average length fluctuates.

Figure 3: In these two figures the absolute difference between the average session length and the prediction for the average session length from the critic is plotted. This difference is a measure of the quality of the critic.

From figure 3 it is clear that when the average session length for a parameter combination is close to either end of the range, the critic becomes rather good at predicting the average

(25)

session length, with this effect even more pronounced when the average session length is on the lower end of the range. In the other cases, the difference between the real and predicted average session length fluctuates quite rapidly. Because of this, we may conclude that, while the player model is governed by a non-changing set of rules, the critic is not able to capture this model fully. It is likely that a reason for this is that the player model is stochastic and the implemented critic is not. However, this section was only used to get an understanding of what we might expect when training the actor-critic with the full user model. In the next section we will see if the full user model has the same kind of outcome. The quality of the fit of the critic is discussed more extensively in subsection 5.3.

5.2 Learning the Reward Scheme

The main part of this research revolves around trying to learn the reward scheme that best matches the goals of the game designer. First the actor-critic will learn how to reward the player models without any restrictions. Secondly, the other reward functions described in the Methods section will be used to see if they can be used to guide the actor towards the goal of achieving a predefined average session length.

5.2.1 Decreasing the churn rate without restraints

To answer research question 1, the actor-critic model was trained with cohorts of player models being added at a regular interval. The first cohort was added at the start and a new one was added every fifth iteration. In total 100 cohorts were involved in training the critic. Each cohort had a maximum of 50 iterations of interaction with the actor-critic: the average number of sessions a player model needs to progress through all the tiers is 37.96 sessions, based on the designed tiers which can be found in table 1. This way, if the actor-critic has learned how to reward the player models, the player models should have either churned, or reached the finished state.

In figure 4 the development of the churn rate per tier can be seen. The churn rate is only plotted for tiers 1, 2 and 3 because for the other tiers no models churned. What can be observed is that for all the parameter combinations, the churn rate at tiers 1 and 2 becomes lower over time and the percentage of players that finishes the game increases.

There is however a slight problem: The ideal churn rate, when looking at equation 2 in [44] combined with the decision to make tp= 2 and ta= 3, would be 40% at tier one. Most

of the models end rather substantially below the desired churn rate. The two models that don’t end below the desired churn rate, however, seem to be on a downward trend that may take them below a churn rate of 40% at tier one.

The way the average session length developed in this experiment can be seen in figure 5. Some parameter combinations need longer to really start improving but the average session length increases for all the parameter combinations. What is more noteworthy, however, is that the higher the tier, the higher the average session length, up to what appears to be a point of convergence. This pattern holds for most of the parameter combinations and tiers. If we combine the information from figures 4 and 5 with the notion that α_{−β < 1 should} hold if we want players to abandon the game at some point in time, we can see that it makes sense for the churn rate to go down and the number of players that ‘finish’ the game to go up: the average session length is much higher than the 20 actions per session which the tiers were designed for. The downward trend for the two parameter combinations that don’t end with a churn rate of less than 40%, combined with the upward trend for the average session length at every tier, provides an indication that these parameter combinations may also end up having a churn rate that is below the churn rate that the tiers were designed for.

This means that if we want to combat the possibility of players getting addicted to the game, we should ensure that α− β < 1 remains true. For this, we need to see if the other reward functions that were described in the Methods section are able to keep the actor in check, while allowing the actor to learn how to reward the players at the same time.

(26)

Using cohorts to determine churn rate

0 100 200 300 400 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Iterations Ch urn rate

Largest growth rate Tier 1 Tier 2 Tier 3 Finished 0 100 200 300 400 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Iterations Ch urn rate

Smallest growth rate

0 100 200 300 400 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Iterations Ch urn rate Largest SD 0 100 200 300 400 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Iterations Ch urn rate Smallest SD 0 100 200 300 400 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Iterations Ch urn rate

Smallest average session length

0 100 200 300 400 500 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Iterations Ch urn rate

Largest nr of full sessions

Figure 4: In these figure the churn rate per tier is plotted as well as the fraction of players that finished the game. The churn rate for tiers above tier 3 are not plotted because no player models churned above tier 3: all player models that reached tier 4 went on to finish the game. Above each plot is indicated what parameter combination resulted in the plot. The legend is the same for each plot.

(27)

Average session length per tier

0 50 100 150 200 250 300 350 400 450 500 0 5 10 15 20 25 30 35 40 45 50 Iterations Av erage session length

Largest growth rate

Tier 1 Tier 2 Tier 3 Tier 4 Tier 5 Tier 6 Tier 7 Tier 8 Tier 9 Tier 10 0 50 100 150 200 250 300 350 400 450 500 0 5 10 15 20 25 30 35 40 45 50 Iterations Av erage session length

Smallest growth rate

Tier 1 Tier 2 Tier 3 Tier 4 Tier 5 Tier 6 Tier 7 Tier 8 Tier 9 Tier 10

(28)

T.J. van Zwol 5 RESULTS AND ANALYSIS 0 50 100 150 200 250 300 350 400 450 500 0 5 10 15 20 25 30 35 40 45 50 Iterations Av erage session length Largest SD Tier 1 Tier 2 Tier 3 Tier 4 Tier 5 Tier 6 Tier 7 Tier 8 Tier 9 Tier 10 0 50 100 150 200 250 300 350 400 450 500 0 5 10 15 20 25 30 35 40 45 50 Iterations Av erage session length Smallest SD Tier 1 Tier 2 Tier 3 Tier 4 Tier 5 Tier 6 Tier 7 Tier 8 Tier 9 Tier 10

(29)

T.J. van Zwol 5 RESULTS AND ANALYSIS 0 50 100 150 200 250 300 350 400 450 500 0 5 10 15 20 25 30 35 40 45 50 Iterations Av erage session length