• No results found

Embedding a simplicity-based theory of cognition in intentional agents

N/A
N/A
Protected

Academic year: 2021

Share "Embedding a simplicity-based theory of cognition in intentional agents"

Copied!
48
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Bachelor Informatica

Embedding a simplicity-based

theory of cognition in

inten-tional agents

Lin Raddahi

March 1, 2021

Supervisor(s): dr G. Sileno

Inf

orma

tica

Universiteit

v

an

Ams

terd

am

(2)
(3)

Abstract

The aim of this study is to see if we can increase the efficiency of Q-learning agents. We do this by enriching Q-learning agents with a cognitive model based on Simplicity Theory. This theory centers around the attractiveness of situations and it claims that humans are more interested in unexpected situations. We tested three different Q-learning versions in two different simple games, and compared them. The first game was a very basic 1D line game in which an agent was tasked with finding patterns in the numbers on the line. The second game was a 2D grid game in which a mouse had to locate and consume pieces of cheese. In the first game, the ST based Q-learning version showed an improvement of performance, however in the second game the ST based Q-learning version appeared to perform worse than the other Q-learning versions. Further research is needed to determine the efficiency of ST based Q-learning.

(4)
(5)

Contents

1 Introduction 7 2 Theoretical background 9 2.1 Simplicity Theory . . . 9 2.1.1 Complexity . . . 9 2.1.2 Unexpectedness . . . 9 2.1.3 Emotion . . . 10 2.1.4 Intention . . . 11

2.2 BDI agent model . . . 11

2.3 Q-learning . . . 12

2.4 ST-based learning . . . 12

3 Design and implementation 15 3.1 BDI agent . . . 15

3.2 Minimal program search . . . 15

3.2.1 Brute force search algorithm . . . 15

3.2.2 A* algorithm . . . 16

3.3 Complexity calculation . . . 16

3.4 First game: String Leveller . . . 17

3.4.1 Setup . . . 17

3.4.2 Q-learning . . . 17

3.5 Second game: Cheese-seeking Mouse . . . 19

3.5.1 Setup . . . 19

3.5.2 Q-learning . . . 20

4 Experiments and results 25 4.1 Comparison A* and brute force search algorithms . . . 25

4.2 Comparing String Leveller game versions . . . 25

4.2.1 Uniform non-deterministic world . . . 25

4.2.2 Non-uniform non-deterministic world . . . 26

4.3 Comparing Cheese-seeking Mouse game versions . . . 27

4.3.1 First objective . . . 27

4.3.2 Second objective . . . 29

5 Discussion and conclusion 31 5.1 Discussion . . . 31

5.2 Conclusion . . . 32

6 Appendix A 35

7 Appendix B 37

(6)

9 Appendix D 43

(7)

CHAPTER 1

Introduction

One of the aims of modern computer science is to create software that can independently make decisions on our behalf. This decision making consists in practice of choosing what actions to undertake and how to execute these actions. One way to create this type of software is to make use of agents (M. J. Wooldridge and Jennings 1995), acting rationally and autonomously. Reinforcement learning algorithms such as Q-learning (Watkins and Dayan 1992) are known to provide autonomous agents with ‘agent policies’, which means that agents use a certain strategy that mandates what actions the agent should execute based on the state of the agent and of the environment. The Q-learning algorithm lets the agent explore random actions and saves the corresponding rewards in a Q-table. After the learning phase, this Q-table becomes the ‘policy’ on which the agent relies to take actions.

In its current form, we have no way of analysing or controlling the different components that determine the behaviour of the Q-learning agent. In principle, it may be possible to improve Q-learning agents’ behaviour by explicitly providing them with such components.

A notable difference between the way Q-learning agents and humans interact with their envi-ronment would be that the latter has the ability to recognize the relevance of different events that are occurring or may occur. To develop an agent that might be competent at recognizing the relevance of events, this thesis will research the use of the Simplicity Theory (ST) (Dessalles 2017; Dessalles 2020a). This computational theory of cognition has been empirically proven to be able to predict human behavior by making use of models of relevance. The models are based on both Unexpectedness and Emotion. In short, this theory suggests that human beings give more relevance to events or situations that are easier to describe than to generate (capturing unexpectedness), and that the expected emotional response of a situation is amplified or atten-uated by its unexpectedness.

By providing the Q-learning agent with clear components that determine behaviour (Unexpected-ness and Emotion), we may offer an alternative basis for explainable AI approaches. Additionally, since the Simplicity Theory is based on a non-extensional theory of probability, we expect the Q-learning algorithm would require less training data to work.

As part of the reinforcement algorithms that can be used for many AI applications, the ethical aspect of the improvement of Q-learning agents might have many dimensions. Assuming the agents are used for beneficial purposes, the outcome of this thesis could have a positive societal impact, even in terms of transparency of the constructed policy.

This thesis tries to answer the question: Can a Q-learning agent be provided with a rational-ity based on ST?

The thesis proceeds as follows. The theoretical context for the various concepts used is given in the first section. In the second section we explain the design and implementation of two separate game concepts that use different forms of Q-learning. In the third section we experiment with the different games and the different forms of Q-learning. The fourth section provides a discussion on the thesis and a conclusion based on the findings of the experiments.

(8)
(9)

CHAPTER 2

Theoretical background

2.1

Simplicity Theory

Simplicity Theory (ST) is a cognitive computational model proposed to explain why certain situations or events are attractive to human beings (Dessalles 2013; Dessalles 2020a). It is based on the idea that the human mind is highly sensitive to drops in complexity, as described in the context of Algorithmic Information Theory (AIT). This specific sub field of theoretical computer science is centered around the relationship between computation and information of generated objects such as strings (Chaitin 1977). In short, the Simplicity Theory (ST) states that an event or situation becomes more interesting if the complexity of generating it (that is, reproducing mentally its occurrence) is higher than the complexity of describing it. This difference is called Unexpectedness. In this section we will go through some of the formulas used to compute different ST components.

2.1.1

Complexity

ST makes use of Kolmogorov complexity. The Kolmogorov complexity is defined in the context of strings as the length in bits of the shortest program generating a certain input string (Li, Vit´anyi, et al. 2008). The Kolmogorov complexity of an object can be more generally defined as the length of the shortest program that produces the object as output (Li, Vit´anyi, et al. 2008). More formally, we denote with PM the set of possible programs to control a machine M . For each program p ∈ PM we denote with |p| its length as number of instructions and p() its output. The Kolmogorov complexity of a finite string x for a machine M KM(x) is defined as:

KM(x) = min p∈PM

{|p|, p() = x}

2.1.2

Unexpectedness

According to ST, Unexpectedness captures the informational relevance of a situation. In formula, the Unexpectedness of a situation (which equals how interesting it is) is computed as:

U = CW − CD

(U : Unexpectedness, CW: World Complexity, CD: Description Complexity)

The world or generation complexity (CW) can be described as the minimum length of a program used by a machine modeling the functioning of the world. In other words, CW depends on how the observer represents the world and its constraints, measures how difficult is for this model to generate the situation (Dessalles 2020g).

The description complexity (CD) is the minimum length of a program used by a computing machine that reproduces all cognitive abilities and knowledge of the observer, to generate the

(10)

situation as a mental object (Dessalles 2020b). Both the world and description complexity are defined as Kolmogorov complexities.

As it has been stated before, Simplicity Theory is a model that explains why situations are attractive to humans. It is important to take note of what is defined as a situation.

In probability theory, the concept of an event/situation is used to describe a certain set of out-comes of an experiment to which a probability is assigned (Leon-Garcia 2008).

In Simplicity Theory, an important specification is that a situation can only be considered a situation if the description of it makes it unique. This distinction is important, because it can be easy to see a ‘situation’ to be very simple, when in fact it is very complex.

For example, if someone was to describe the event of a door opening, this situation could be perceived to be very simple. However, since there are many doors in the world (and those may be opened many times), this simple description does not fully explain the situation. In fact, it could be a lot more complex to describe which specific door was opened and when, depending on the given context.

Example To illustrate how the Unexpectedness formula is translated to real life situations, a simple lottery draw can be taken for example (Dessalles 2020e). When asked to imagine the outcome of a lottery draw of 5 numbers, people are likely to think of a completely random string of different numbers. In fact, it would probably be rather surprising if the actual outcome of the draw would be 1-2-3-4-5. This is because, even though every outcome has the same chance of occurring, not every outcome is felt to be equally plausible. The reason that an outcome like 1-2-3-4-5 is perceived to be very unlikely can be explained on the basis of the formula of Unexpectedness.

In a fair lottery, the complexity of the draw of five numbers (CW) is the same for every draw. Assuming this imaginary lottery draw can only draw numbers zero to nine, the complexity of generating 5 numbers would be equal to 5 · log2(10). The complexity of describing the outcome of the lottery (CD) however is not the same for every draw. The 1-2-3-4-5 draw would be a lot easier to describe (“one to five”) than the draw 2-7-4-9-0 (can only be described as “two, seven, four, nine, zero”).

Taking the formula of Unexpectedness, U = CW − CD, we can now explain why the 1-2-3-4-5 lottery draw seems rather unlikely: with the same World Complexity CW for both draws, and a lower Description Complexity CD for the 1-2-3-4-5 draw than the 2-7-4-9-0 draw, the Unexpect-edness for the 1-2-3-4-5 draw would be way higher.

ST claims that ex post probability can be defined using Unexpectedness with the following formula:

p = 2–U

This formula suggests that humans assess probability through complexity, rather than the other way around (Dessalles 2020f).

2.1.3

Emotion

ST introduces additional formulas to capture the practical relevance of a situation. In principle, a situation can be purely ‘epistemic’, meaning there is no emotion attached to it. However, a lot of situations do have an emotional dimension. According to ST the Unexpectedness of a situation adds upon its Emotional Intensity, which is computed as:

E = AE + U

where E = Emotional Intensity, AE = Actualized Emotional Intensity, U = Unexpectedness. The Emotional Intensity does not include any information on whether this is a positive or negative emotion. The following equation does include this information by using the valence parameter:

(11)

The formula takes into consideration that complexity is represented on a logarithmic scale, there-fore, to add valence, we need to bring emotional intensity back to the linear scale.

An example that illustrates the Emotional aspect in Simplicity Theory is the following: a neighbor tells a story of how last week he was visiting the bank, when suddenly a robber came in and held him at gunpoint! This story is undeniably Unexpected (the situation is very unlikely to happen, but very easy to describe), but there is also an Emotional aspect to it.

Hearing this story, it can easily be imagined how frightening this situation must have been. Now if we take the situation of (oneself currently) “being held at gunpoint by a robber during a visit to the bank” and call this s(self, now), we can use E(s(self, now)) to describe the Emotional Intensity that is connected to the situation (Dessalles 2020c). The Actualized Emotion can then be computed as:

AE(s) = E(s(self, now)) − U (s(self, now))

Generalizing the situation to other people and time, we call s(x, t) the situation of “x was held at gunpoint during a visit to the bank t days ago”, and we can write E(s(x, t)) as follows:

E(s(x, t)) = AE(s) + U (s(x, t)).

In this example, we can assume that the neighbor does not have any influence on the causal complexity (CW) of the situation. We can now rewrite the formula with CD(s(self, now)) = 0 and CD(s(x, t)) = CD(s) + CD(x) + CD(t):

E(s(x, t)) = AE(s) + U (s(x, t)) = E(s(ego, now)) − CD(s) − CD(x) − CD(t)

Following this formula, it can be derived that the story told by the neighbor has a higher Emotional Intensity if: the content of the story is in the context already (CD(s) = 0), it happened to a close acquaintance (small CD(x)) and if it happened recently (small CD(t)).

2.1.4

Intention

Intention in Simplicity Theory is defined as the global necessity of the action from the actor’s perspective (Dessalles 2020d). If we consider an action a that has one emotional consequence s which is valued with intensity E(s), we can write the following formula:

I(a, s) = E(a ∗ s) − U (a ∗ s) ≈ E(s) − U (s||a) − U (a)

in which I = Intention, E = Emotional Intensity and U = Unexpectedness (we assume E(s) ≈ E(a ∗ s), i.e. the emotional cost of the action performance is marginal to its outcome).

We can see that the intention increases with the emotional intensity of the action (the importance of the outcome) and decreases with both the uncertainty of the causal link (represented by U (s||a) and inadvertence (represented by U (a)).

2.2

BDI agent model

The BDI agent model is the most common template used to program intelligent agents. It is based on three mental attitudes: Beliefs (the knowledge about the world), Desires (what objec-tives are to be completed) and Intentions (the actions that are executed in order to achieve the desires).

The general cycle of a BDI execution is as follows (Mascardi, Demergasso, and Ancona 2005): 1. observe the world and the agent’s internal state, and update the event queue consequently; 2. generate possible new plan instances whose trigger event matches an event in the event queue (relevant plan instances) and whose precondition is satisfied (applicable plan in-stances);

(12)

4. push the selected instance onto an existing or new intention stack, according to whether or not the event is a (sub)goal;

5. select an intention stack, take the topmost plan instance and execute the next step of this current instance: if the step is an action, perform it, otherwise, if it is a sub goal, insert it on the event queue.

The BDI programming model has different existing agent implementations such as AgentS-peak (Bordini, H¨ubner, and M. Wooldridge 2007) and dMARS (d’Inverno et al. 2004), in this project however we will consider a minimal interpretation of the BDI model that can be compared with the policies found by reinforcement learning methods such as Q-learning.

2.3

Q-learning

To be able to comment on the efficiency of the Q-learning agent with ST cognition, it can be integrated with other problem solving algorithms. Q-learning (Watkins and Dayan 1992) is a reinforcement learning algorithm that learns what actions an agent should take based on a reward system. Each action the agent takes brings a certain (positive or negative) numerical reward. In Q-learning, the agent starts with an empty Q-table, and as it proceeds to take actions the Q-table is updated with the Q-values that are tied to these actions and the state the agent is in. The formula for the updated Q-value is:

Qnew(s, a) = (1 − α) ∗ Q(s, a) + α ∗ (r(s, a) + γ ∗ max

b Q(δ(s, a), b))

where α is the learning rate (between 0 and 1), Q(s, a) is the old Q-value that is tied to the executed action at the old state, r(s, a) is the reward for the action at the old state, γ is the discount factor (between 0 and 1) and maxbQ(δ(s, a), b) is the largest value in the q table that is tied to the new state.

As seen in the formula, the updated Q-value consists of a combination of the old Q-value and a newly perceived Q-value. This combination is dependent on the learning rate α, where a higher learning rate means the agent gives priority to the most recent information and a lower learning rate means the agent gives priority to prior knowledge. The newly perceived Q-value is in turn dependent on a discount factor γ. A higher discount factor means the agent will give future rewards greater importance, while a lower discount factor means the agent mostly considers the current reward.

During the learning phase, the agent can make a decision on which action to take either by randomly selecting the next action (exploring), or by looking at the Q-table and selecting the action that is linked to the highest Q-value (exploiting). This decision making is dependent on an  value (between 0 and 1), where a higher  means the agent will mostly take actions based on the Q-table, and a lower value means the agent will mostly take random actions. After the learning phase, the  is set to 1 and every action the agent takes is based on the previously filled Q-table. Finding the current state in the Q-table and determining which action is tied to the highest Q-value is how the agent decides which action to take.

2.4

ST-based learning

To attempt to merge Q-learning and ST we will do this in a way that both Unexpectedness and Actualized Emotion are influencing the way the agent learns. Looking back at the formula of the newly perceived Q-value

r(s, a) + γ ∗ max

b Q(δ(s, a), b),

this value consists of a current reward r(s, a) and an aggregate value of future rewards

maxbQ(δ(s, a), b). Considering the discount factor γ gives more or less importance to the aggre-gate of future rewards, this value can be given the interpretation of the probability of success. In other words, the γ value can be interpreted as how much the agent can count on receiving the

(13)

future rewards.

As this discount factor could be dependent on the target state δ(s, a), the newly perceived Q-value can be rewritten as

˙

Q(s, a) = r(s, a) + max

b γ(t|a, s) · Q(t, b) This formula can then be reinterpreted in terms of an aggregate reward:

Rs(a) = rs(a) + max

b P (t|a, s) · Rt(b)

where rs(a) is the immediate reward and P is the probability of transition. Now the following correspondence can be found:

P (t|a, s) · Rt(b) ⇒log log Rt(b) − log 1 P (t|a, s)

Since the formula p = 2−U states that ex post probability p depends on Unexpectedness (Dessalles 2020f; Dessalles 2011; Saillenfest and Dessalles 2015), this formula can be turned around to find U = logp1. This formula corresponds with the last part of the aggregate reward formula, and so the newly perceived Q-value can be calculated as:

AEsP(a) = Aggregate{log2rs(a), max

b AEt(b) − Us(t||a)}

where AEsP(a) is the aggregated anticipated emotion of the action a executed in state s, rs(a) is the immediate reward, maxbAEt(b) is the maximum aggregate anticipated emotion of the target state t and Us(t||a) is the Unexpectedness of t occurring if a is executed in state s.

The new Q-value is finally calculated as follows:

AEnew(s, a) = (1 − α) · AE(s, a) + α · Aggregate{log2|r(s, a)|, max

b (AEt(b) − U (t||a))} where AEnew(s, a) is the updated AE value, α is the learning rate, AE(s, a) is the AE value tied to the old neighborhood and the current action, |r(s, a)| is the absolute reward tied to the action and the old state, maxbAEt(b) is the maximum AE value of the target state and U (t||a) is the Unexpectedness of the target state. In this formula, we are only considering the intensity of the anticipated emotion, not the valence of the emotion.

Looking more in detail, for optimal control principles (the max is there to select the best ac-tion available) and knowing that AE is an intensity and therefore has no valence (we need an additional ”valence” function to capture whether the emotion is positive or negative), we need to distinguish the treatment of intensity and valence in this selection. Let us consider the most plausible best case AE+ and the most plausible least worst case AE−. The AE+ is equal to the emotion of a situation s with the maximum AE with positive valence. The AE− is equal to the emotion of a situation with minimal AE and negative valence. Note that there is a priority in selection: only if no emotion with positive valence exists, an emotion of negative valence is taken into account.

(14)
(15)

CHAPTER 3

Design and implementation

3.1

BDI agent

All programs were written in the Python language. Since the goal of this thesis is to provide a (minimal) BDI agent with ST cognition, the agents that are used during the experiments have to be linked to the BDI model. This is done by providing the agents with a knowledge of the world state (Belief), giving the agent a certain goal to achieve (Desire) and creating a plan for the agent to complete the goal (Intention). See the individual game setups (3.3.1 and 3.4.1) for a more specific explanation of the link between the agent and the BDI model.

3.2

Minimal program search

3.2.1

Brute force search algorithm

To be able to compute both World Complexity and Description Complexity, it is necessary to be able to find the minimal program. The most simple way of finding a minimal program is by using a brute force search algorithm. This is an algorithm that explores all possible solutions to a problem. To then find the minimal program, the solution with the lowest complexity has to be selected. Depending on the problem, the brute force search algorithm might have to explore a large number of solutions, which can make this algorithm unsuitable for finding the minimal program for complex problems.

The brute force search algorithm that was designed for finding the minimal program is shortly made up of the following steps:

• define what operators the program uses

• define what parameters are used by the different operators

• for every operator, see if there are any parameters and generate these parameters • try to perform the action (execute operator with or without the parameter(s))

• if the action is successful, add the action to the list of tried actions linked to the current state you are at

• go to next state and repeat the process

• at the last state (if the object input length matches the object output length), backtrack and try any other actions that have not been tried yet at every state

This backtracking algorithm makes sure that every solution is tried by keeping track of the states that have been reached, and linking the state and action to the next state. When all actions have unsuccessfully been tried at a certain state, the algorithm will backtrack to the previous state and proceed from there.

(16)

3.2.2

A* algorithm

Different existing search algorithms can also be used to find the minimal program. The A* algorithm is one example of an algorithm that can be used.

The A* algorithm used for finding the minimal program in short follows the next steps: • define what operators the program uses

• define what parameters are used by the different operators

• for every operator, see if there are any parameters and generate these parameters • try to perform the action (execute operator with or without the parameter(s)) • if the action is successful, add the achieved state and action to a list

• when all possible actions at the current state have been executed, compare the states in the created list

• select the state(s) that is/are closest to the input (based on some defined heuristic) • for these states, repeat the process

• if the input matches the output, the shortest solution has been found The algorithm follows the path that minimizes the function:

f (n) = g(n) + h(n).

Here, g(n) is the depth of the node and h(n) is the heuristic function. The heuristic function serves to compare the different branches and rank them based on the estimated distance from the current node to the goal node, so to determine which branch(es) to follow. (Pearl 1984). Since we use this algorithm find a minimal program, the complexity of the latest action was added to the heuristic. This way, the search algorithm follows the branches that not only minimize the distance to the goal node, but also use the actions with lowest complexity.

The algorithm starts with g(n) = 0 (state 0) and then tries to execute all the different actions. The outputs that belong to these actions are saved in a list. Then, all the different outputs are compared to the input and the heuristic function h(n) is calculated. The g(n) is added to the h(n) of every action to get f (n). Then the action(s) that give(s) the lowest f (n) is/are saved and the g(n) is increased by one (the algorithm goes one node deeper). Then the process is repeated, until the input and output are the same.

3.3

Complexity calculation

When a program has been found, there needs to be a way to calculate the complexity of this program. The following pseudocode gives the algorithm that is used to find the complexity of a found program:

Algorithm 1: Algorithm to find the unordered complexity of a given program. K = 0

for action in program do

K+ = log2(length(operators)) for parameter in parameters do

K+ = log2(length(parameter) end

end

In this algorithm, both the number of operators and the number of parameters per operator are important in the calculation of the complexity of the program.

(17)

3.4

First game: String Leveller

3.4.1

Setup

The first test of a Q-learning agent with ST-based cognition is done in a very simple environment: a 1D line game meant to explore the proof of concept of the integration. We call this game ‘String Leveller’. In this game, an agent can “walk” over a 1D line of numbers (0-9). The goal of the game is to create certain patterns in the world configuration that are easy to memorize/recognize (‘Desire’ part of the BDI model). Examples of patterns that are recognized in the world con-figuration as relevant for the game are patterns of all the same numbers (e.g. “33333” or “99999”). The game is defined using the following parameters:

Parameter Definition Value

p probability of change adjustable* p1** probability numbers 1-5 0.05 ∗ p p2** probability numbers 6-7 0.1 ∗ p p3** probability number 8 0.2 ∗ p p4** probability number 9 0.3 ∗ p

l length of the line 5***

Table 3.1: *the probability of change can be adjusted and has no set value, **part of the non-uniform non-deterministic world, ***the length of the line during the learning phase is 5, but the line to be solved can be of any length

The position of the agent starts at the beginning of the line (pos 0) and after each time step the agent moves one place to the right. If the agent is at the last position of the line, instead of moving to the right, the agent returns back to the beginning of the line. At the position the agent is currently at, the agent can either increase of decrease the number by one, or it can leave the number unchanged. The agent can only see the values of the cell at the current position, one cell to the left and one cell to the right (neighborhood). This is the ‘Belief’ part of the BDI model. If the agent is at the beginning of the line, the cell at the end of the line becomes the left side of the current cell. Likewise, when the agent is at the end of the line, the cell at the beginning of the line becomes the right side of the current cell. The moving around of the agent and changing the numbers makes up the ‘Intention’ part of the BDI model.

There are two distinct settings considered in the testing of the game:

1. Uniform non-deterministic world. In this world there is a certain probability of change, but every number has the same probability.

2. Non-uniform non-deterministic world. In this world there is a certain probability of change, and every number has a certain probability.

Each time step, there can be a certain (adjustable) percent chance of a random change occurring in the world configuration. The number at a random position in the world configuration will then be changed to a random number from 0-9. In the non-uniform non-deterministic world, the numbers that are put as replacement each have their own probability of occurring (see Table 3.1). This means that for every time a number in the world configuration is randomly changed, the chances are highest that this number is replaced by the number 9, then 8, and so on.

3.4.2

Q-learning

For every Q-learning version, during the learning phase, the world configuration starts with a random line of numbers of length 5. There are 1000 learning episodes, meaning that 1000 random

(18)

lines of length 5 have to be solved to fill the Q-table. Each Q-learning game uses an  of 0.9, meaning that 9 out of 10 times, the agent takes an action based on the Q-table, while 1 out of 10 times it takes a random action. The α (learning rate) starts at 0.9 and then gradually lowers to 0.1. The γ (discount factor) that is used for the first two Q-learning versions is set to 0.8. After the learning phase, the line to be solved can be any line of any length.

Simple Q-learning

In this version of the game, the agent interacts with the line based on normal Q-learning. The reward that the agent receives is based on the neighborhood that the agent creates by performing an action. To calculate this reward, the values of the left and right cell in the neigh-borhood are compared to the value of the cell the agent is currently at. The absolute difference between the numbers is summed and the result is multiplied by -1.

E.g. the neighborhood “123” would receive a reward of -2 (difference between 2 and 1 is one, difference between 2 and 3 is one, -(1+1) = -2), the neighborhood “369” would yield a reward of -6 (6-3=3, 9-6=3, reward=-(3+3)=-6).

Then, the Q-value is updated. This Q-value is tied to the executed action a at the old neighbor-hood (state s).

Since the agent prefers to execute actions that bring the highest (current and future) rewards, this means that the agent aims to create neighbourhoods that have low difference between the values.

Q-learning and Description Complexity

In this version of the game, the reward for the Q-learning agent is changed to the Description Complexity of the changed neighborhood configuration.

The reward is calculated by −CD, where CD is the Description Complexity.

To calculate the Description Complexity of a given configuration, a machine is needed that can reproduce the configuration in a way that simulates a description of the configuration (See algo-rithm 3 in Appendix E for pseudocode).

The machine designed for this purpose has the operators MOVE, INCREASE and DECREASE. - The MOVE operator moves the machine one place to the right, without changing the number at the current position. The number at the old position is saved as a temporary number. - The INCREASE operator adds one to the saved temporary number and puts this number at the current position. The new number is then saved as the temporary number.

- The DECREASE operator decreases the saved temporary number by one and puts this number at the current position. The new number is then saved as the temporary number.

For example, the description program that gives the neighborhood configuration [0,3,1] would be: MOVE INCREASE INCREASE INCREASE MOVE DECREASE DECREASE.

ST based Q-learning

The last version of the String Leveller game tries to merge Q-learning and ST together by replacing the normal calculation of the Q-value:

Qnew(s, a) = (1 − α) ∗ Q(s, a) + α · (r(s, a) + γ ∗ max

b Q(δ(s, a), b)) with the following calculation based on AE:

AEnew(s, a) = (1 − α) · AE(s, a) + α · (r(s, a) + min[AEt(b) − U (t||a)])

In this formula AEt(b) − U (t||a) > 0. This formula is based on the formula explained in 2.4, but slightly altered to make it more simple to implement in the code. Here we consider the sum of r(s, a) + min[AEt(b) − U (t||a)] as the aggregation part of the formula. The reward is calculated by CD. Since the reward is higher when there is a larger difference between the numbers in the neighborhood, and the goal is to minimize the differences, we look for the min[AEt(b) − U (t||a)]

(19)

instead of the max[AEt(b) − U (t||a)].

In the String Leveller game, the Unexpectedness is calculated each time the configuration is changed by the agent. The Unexpectedness of this change is calculated by comparing the neigh-borhood before the change (including any random change) to the neighneigh-borhood that has been changed by the agent. To calculate the Unexpectedness U (t||a) in the formula of the new Q-value, we need to not only be able to calculate the CD, but also the CW. The Generation Complexity of a given configuration is dependent on the probability of a random change of the environment. In the case of the String Leveller game, each change in the world configuration has a certain chance of occurring randomly, so if a number in the configuration changes, the complexity of this change is calculated by log2(1/p), where p is the probability of a random change. Similarly, if there is no change in a neighborhood, this has a probability of 1 − p. The complexity of no change is thus calculated by log2(1/(1-p)).

For example, in case of the non-uniform non-deterministic world, if the neighborhood changes from [1,2,3] to [1,5,3], the Generation Complexity of this transition is calculated by: 2∗log2(1/0.8)+ log2(1/(0.2 ∗ 0.05)) (probability of change is 0.2, probability of the number 5 is 0.05). (See algo-rithm 2 for pseudocode)

Algorithm 2: Algorithm to calculate generation complexity of a given 1D neighbor-hood.

CW = 0 pos = 0

oldn = old neighborhood newn= new neighborhood while pos < length(newn) − 1 do

if oldn[pos] == newn[pos] then CW+ = log2(1/1 − p) end else CW+ = log2(1/p) end pos+ = 1 end

3.5

Second game: Cheese-seeking Mouse

3.5.1

Setup

The second game that is used to test the Q-learning agent with ST based cognition is a 2D grid game. We call this game ‘Cheese-seeking Mouse’. Table 3.2 shows the parameters used for the game.

The game consists of a grid that can be traversed by an agent, in this case a mouse. Some of the cells of the grid contain mouse traps and these blocks can not be crossed by the mouse. Each step the mouse takes costs energy, and when the mouse’s energy level drops to zero it has to rest for 3 time steps. Spread across the maze are pieces of cheese that, when eaten, increase the energy level of the mouse. Whenever the mouse eats a piece of cheese, a new piece of cheese will appear in a random spot on the grid. The moving around of the agent and eating the pieces of cheese make up the ‘Intention’ part of the BDI model. Each time step a random cheese might be re-positioned to another place in the grid. This change of cheese position is based on a given probability p. The agent is aware whether there is a trap directly in front or to the sides of their current position, how many pieces of cheese there are anywhere to the left/right/front of their current position, and what direction they are walking in (‘Belief’ part of the BDI model). In the

(20)

Parameter Definition Value Gw Grid width 5 Gh Grid height 5 t Number of traps 3 c Pieces of cheese on the grid 3/5* ec Energy boost from cheese 3 e Energy level at start of game 3 gc** Number of cheese to eat (goal) 5/10*** p**** Path length of mouse 40/60*****

Table 3.2: *Number of cheese on the grid can be either 3 or 5, **This goal is part of the first objective of the game, ***Can be either 5 or 10, ****This is part of the second objective of the game, *****Path can be either of length 40 or 60

Cheese-seeking Mouse game, we consider a uniform non-deterministic world, meaning there is a certain chance the cheese change positions, and each position has the same chance.

There are two different objectives considered for this game (‘Desire’ part of the BDI model). The first objective of the game is for the mouse to eat a given number of cheese in as little time steps as possible. The second objective of the game is for the mouse to eat as much cheese as possible in a given time period/path length.

3.5.2

Q-learning

For every Q-learning version, during the learning phase, the cheese positions and the mouse starting position were random. The mouse traps always had the same positions (see Figure 3.1). There were 4000 learning episodes, meaning that the mouse had to fulfill the objective of the game 4000 times to fill the Q-table. Each Q-learning game used an  of 0.9, and the α (learning rate) started at 0.9 and then gradually lowered to 0.1. The γ (discount factor) that was used for the first two Q-learning versions was set to 0.8.

The state that is tied to the Q-value was based on the current direction of the agent, the number of cheese that is anywhere to the left, right and front of the mouse, and the mouse traps directly to the left/right/front of the mouse.

After the learning phase, the starting position of the mouse is set to [0,4]. Simple Q-learning

In this version of the game, the mouse learns how to navigate the grid based on simple Q-learning. The rewards that the agent receives after each action are tied to the cells of the grid. The reward for ‘normal’ cells is -10, the reward for ‘cheese’ cells is 50 and the reward for ‘mouse trap’ cells is -100. Once the mouse has eaten the cheese, the cell becomes a normal cell and the reward changes from 50 to -10.

Q-learning with Description Complexity

In this version of the game, the reward is given by −CD.

To be able to give a description of the cheese positions, the decision was made to describe all the paths from the current position of the mouse to all the positions of the cheese. To calculate

(21)

Figure 3.1: Example Cheese-seeking Mouse game. The number on the mouse (green) depicts the current energy level.

the Description Complexity, the paths from the mouse to all the pieces of cheese have to be simulated by a machine.

The machine designed for this purpose has the operator MOVE and the parameters UP, DOWN, LEFT and RIGHT (See algorithm 5 in Appendix E for pseudocode):

• The LEFT parameter moves the mouse one place to the left (on the x-axis) • The RIGHT parameter moves the mouse one place to the right (on the x-axis) • The UP parameter moves the mouse one place upwards (on the y-axis) • The DOWN parameter moves the mouse one place downwards (on the y-axis)

For example, the description program that gives the path from [0,4] to [1,0] would be: MOVE RIGHT, MOVE UP, MOVE UP, MOVE UP, MOVE UP. Figure 3.2 illustrates this path. Since the Description Complexity is dependent on the sum of all the paths from the mouse to the pieces of cheese, the reward will be higher the closer the mouse is to the cheese. In a situ-ation where there are multiple pieces of cheese on one hand of the mouse, and one piece of cheese on the other, the Description Complexity will relatively be lower if the mouse moves towards the group of cheese instead of the single piece. This means the mouse will prioritize going to more cheese if there is a choice, and since the mouse gains more energy by eating multiple cheese, the mouse should have to rest less often.

When the mouse walks into a mouse trap or outside the grid, the Q-table is modified with a large value to deter the mouse from choosing this action again.

(22)

Figure 3.2: Example of the path from the mouse (green) to the piece of cheese

ST based Q-learning

In this version of the game, the reward is given by CD. Similar to the String Leveller game, the updated Q-value is calculated based on AE:

AEnew(s, a) = (1 − α) · AE(s, a) + α · (r(s, a) + min[AEt(b) − U (t||a)]) In this formula, AEt(b) − U (t||a) > 0.

In the Cheese-seeking Mouse game, the Unexpectedness is calculated each time the agent changes position. The Unexpectedness of this new position of the agent with relation to the cheese po-sitions is calculated by comparing the cheese popo-sitions before any possible change to the cheese positions that might have been changed by the agent/environment. To calculate U (t||a) in the new Q-value formula, besides the CD we need to also calculate the CW.

The generation complexity of the cheese positions is dependent on the probability of a random change of the environment. In the case of the Cheese-seeking Mouse game, each change in the cheese positions has a certain chance of occurring randomly, so if a cheese position changes, the complexity of this change is calculated by log2(1/p) where p is the probability of a random change. Similarly, if there is no change in a cheese position, this has a probability of 1 − p. The complexity of no change is thus calculated by log2(1/(1 − p)).

Each cheese has its own number that identifies this specific piece of cheese. This is nec-essary to keep track of any position changes. For example, if the cheese positions change from [[0,[2,3]],[1,[3,4]],[2,[3,3]]] to [[0,[2,3]],[1,[3,4]],[2,[4,2]]] (cheese number 2 changed from po-sition [3,3] to popo-sition [4,2]), the generation complexity of this tranpo-sition is calculated by: 2 ∗ log2(1/0.8) + log2(1/0.2) (see algorithm 3 for pseudocode).

(23)

Algorithm 3: Algorithm to calculate generation complexity of a list of cheese positions. Cw = 0

pos = 0

chprev= previous cheese positions chcur = current cheese positions for ch1 in chprev do

for ch2 in chcur do

if id(ch1) == id(ch2) then if ch1 == ch2 then Cw+ = log2(1/1 − p) end else Cw+ = log2(1/p) end end end end

(24)
(25)

CHAPTER 4

Experiments and results

4.1

Comparison A* and brute force search algorithms

For all different machines and for different inputs, the A* and brute force search algorithm were timed and compared.

The following table shows the comparison of the two algorithms:

Machine

Input/ Mouse pos

Cheese

pos Time Brute Force Time A* A* - Brute Force Quickest CD Machine 000 - 0.00011940559000000003 0.00024109218000000003 0.00012168659 Brute Force CD Machine 090 - 0.0006385364599999999 0.00256354364 0.00192500718 Brute Force CD Mouse [0,0] [4,4] 0.00039366411999999963 0.00074207425 0.00034841013 Brute Force CD Mouse [0,0] [0,1] 6.988950999999965e-05 0.0004673608699999993 0.000397471359999 Brute Force CD Mouse [0,0] [19,19] 0.00181082784 0.00923631495 0.00742548710 Brute Force

Comparing the two different search algorithms, it was evident that the brute force search al-gorithm was the fastest for all Description Complexity machines. Since the brute force search algorithm was faster for the programs used in this project, this was the algorithm that was used for finding the shortest solutions for both Description Complexity programs.

4.2

Comparing String Leveller game versions

The different versions of the String Leveller game were compared for different configurations and different probabilities of change. The given configurations were to be solved 800 times by each program, attempting to compensate for the differences in the random changes of the configuration.

4.2.1

Uniform non-deterministic world

First the uniform non-deterministic world was tested. In this world, every number that randomly replaced a number in the configuration had the same probability. The tables found in Appendix A give the average number of steps needed to reach a pattern in the given configurations per program, and the average time it took for the program to find a pattern.

Figure 4.1 shows the average number of steps needed to reach a pattern for the three different Q-learning versions for different probabilities of change (see table 4.6). The configuration to be solved was [3, 1, 6, 9, 0].

(26)

Figure 4.1: Uniform non-deterministic world, configuration [3, 1, 6, 9, 0]

Looking at the results of the experiments, the ST based Q-learning took the least average number of steps to find a pattern in the given configuration for most configurations and probabilities. Only for the configuration [3, 1, 6, 9, 0] with 0.6 probability of change was the Q-learning with CD as reward the best at finding a pattern. It’s worth noting that the differences between the Q-learn with CDprogram and the ST based Q-learn program were quite small, and the standard deviations of these programs largely overlapped. Only for the configurations of length 10 (tables 7.3 and 7.4) did the standard deviation of the simple Q-learn version not overlap with the two other Q-learn versions. Figure 4.1 seems to display an exponential relation between the steps needed to reach a pattern and the probability of change, with the higher the probability, the more steps required.

4.2.2

Non-uniform non-deterministic world

Secondly, the non-uniform non-deterministic world was tested. In this world, every number that randomly replaced a number in the configuration had a certain probability (see table 3.1). The tables found in Appendix B show the average number of steps needed to find a pattern per program.

Figure 4.2 shows the average number of steps needed to reach a pattern for the three different Q-learning versions for different probabilities of change (see table 8.6). The configuration to be solved was [3, 1, 6, 9, 0].

(27)

Figure 4.2: Non-uniform non-deterministic world, configuration [3, 1, 6, 9, 0]

The results of the experiments in the non-uniform non-deterministic world were comparable to the ones found in the uniform non-deterministic world, except for the differences between the Q-learning with CDand the ST based Q-learning, which seemed to have increased. The standard deviations of these two versions however still largely overlapped. Overall, the agent needed less steps to find a pattern in the uniform deterministic world than in the uniform non-deterministic world.

4.3

Comparing Cheese-seeking Mouse game versions

The different versions of the Cheese-seeking Mouse game were compared for different parameters and different probabilities of change. The grid was traversed 1000 times by each program, attempting to compensate for variations in the random changes in cheese positions. Every time the mouse either walked into a mouse trap or outside of the grid, the path was marked as invalid.

4.3.1

First objective

The first objective of the game was for the mouse to eat a certain number of cheese in as little time steps as possible. The tables found in Appendix C show the average number of steps needed to eat the number of cheese in column ‘Number to reach’, the average energy level of the mouse during the game, the average time it took for the program to finish, the number of traps the mouse fell into and the number of invalid paths. The column ‘Pieces of cheese’ gives the number of cheese that are on the grid at all times. (See figure 4.3 and 4.4 for the graphs.)

(28)

Figure 4.3: Pieces of cheese on the grid: 3, number of cheese to eat: 5

(29)

The experiments revealed that for all configurations with a probability of change up to 0.4, the simple Q-learning algorithm required the least average number of steps to reach the goal. Overall, it seemed that the simple Q-learning mouse had the least risk of walking into a mouse trap. The average energy level of the mouse seemed to be largely associated with the average number of steps taken, suggesting that the more steps were required to reach the goal, the lower the average energy level was. The differences between the three Q-learning versions became smaller as the probability of change increased. Looking specifically at the version with the parameters set to 3 pieces of cheese on the grid and the goal to eat 5 pieces, the simple Q-learning algorithm was ‘best’ for all probabilities except 0.6 and 0.8. The Q-learning algorithm based on ST took the most average number of steps to reach the goal for all probabilities. Looking at the version with the parameters set to 5 pieces of cheese on the grid and the goal to eat 10 pieces, the Q-learning algorithm with CD as reward was the ‘best’ algorithm for probability of change 0.6-0.8. For probability 0.6, the simple Q-learning was the worst of the three.

4.3.2

Second objective

The second objective of the game was for the mouse to eat as much cheese as possible while walking a path of predetermined length. The tables found in Appendix D display the average number of cheeses eaten by the mouse during the path of predetermined length (in column ‘Path length’), the average energy level of the mouse during the game, the average time it took for the program to finish, the number of traps the mouse fell into and the number of invalid paths. (See figure 4.5 and 4.6 for the graphs.)

(30)

Figure 4.6: Path length: 60

For the second objective of the game, the experiments showed that in both versions with path length set to 40 and to 60, the Q-learning mouse based on ST consumed the least amount of cheese for all probabilities. The differences between the amount of cheese eaten per program became smaller as the probability of change increased. The standard deviations of all three programs overlapped. The average level of energy tended to have a direct relationship with the average amount of cheese consumed, indicating that the higher the average energy level, the more cheese the mouse had consumed on average. For the version with the path length set to 40, the simple Q-learning mouse ate the most pieces of cheese on average for all probabilities except 0.6 and 0.8, where the Q-learning mouse with CDate the most pieces. When the path length was set to 60, the simple Q-learning algorithm performed best until p = 0.6, after which the Q-learning with CD appeared to be the best option. There did not seem to be a clear correlation between the Q-learning version and getting stuck in a mouse trap.

(31)

CHAPTER 5

Discussion and conclusion

5.1

Discussion

It was not expected for the brute force search algorithm to be the quickest searching algorithm, as the brute force search algorithm explores all possible solutions and the A* algorithm stops once the shortest solution is found. The unexpected outcome of the comparison between the brute force search algorithm and the A* algorithm may be explained by the fact that the code for the A* algorithm takes longer to calculate each solution. For each action that is performed, the A* algorithm has to calculate the f(n) again for all solutions, which can cause the program to take longer overall. This explanation is supported by the fact that when the A* and brute force search algorithm were tested for different machines (not used in the final version of the pro-grams), the A* algorithm was indeed faster than the brute force search algorithm if there were a lot of different solutions possible. In both Description Complexity machines, the constraints imposed on the execution of the actions were so strict that each program had only one possible solution. This was primarily done to speed up the programs.

In the String Leveller game, it seemed that the ST based Q-learning version was best at finding patterns in both the uniform and non-uniform non-deterministic world, and the difference in performance between the ST based Q-learning and the other two Q-learning versions seemed to increase in the non-uniform non-deterministic world. This outcome was what we expected, since the ST based Q-learning agent is able to recognize the Unexpectedness of states, and in the non-uniform non-deterministic world the different numbers have different probabilities of occurring, amplifying the differences in Unexpectedness of states.

After the learning phase in the Cheese-seeking Mouse game, the mouse still chooses a faulty move every now and then, causing the mouse to fall into a trap or go outside of the grid. One possible reason for this behavior is that, since the Q-table is initialized with zeros for states that have not yet been visited by the mouse, the mouse may select an incorrect action because that action has not yet been performed yet in that state. Depending on the Q-learning version, the Q-value of zero will then be smaller or larger than any other Q-value in this position in the Q-table, leading the mouse to prefer this action.

One of the reasons why the simple Q-learning version seemed to be the better choice in the majority of the experiments may be that the Description Complexity that was calculated for the Cheese-seeking Mouse game did not account for any of the grid’s obstacles/mouse traps (another version of the Description Complexity machine that did take obstacles into account was too slow to be used for experimenting). This means that according to the Description Complexity, the pieces of cheese may appear closer to the mouse than they actually are, which could result in the mouse choosing a less efficient path.

(32)

dependent on the actual locations of the cheese, which are randomly modified, it’s possible that, despite the fact that each Q-learning version ran the game 1000 times for each experiment, the findings do not adequately represent the differences in output between the three Q-learning ver-sions.

We started this project with the plan to experiment with different ST concepts in the context of BDI agents, but in order to make the programming process simpler, we decided to consider a minimal interpretation of the BDI model instead.

5.2

Conclusion

In this project, we tried to find a way to provide Q-learning agents with a cognition based on ST. We hypothesized that if a Q-learning agent had the ability to recognize the relevance of events, it might perform better. We tested three different Q-learning versions in two separate games, and compared the results.

Overall, it seemed that the ST based Q-learning was the best performing Q-learning algorithm for the String Leveller game, but the worst performing for the Cheese-seeking Mouse game. Since the two different games show very different results, the final conclusion is that further research is needed before we can really say anything about performance of Q-learning agents based on ST. Since we used a simplified formula to calculate the Q-value, one suggestion for future research would be to implement the correct formula and experiment with this.

(33)

Bibliography

Bordini, Rafael H., Jomi Fred H¨ubner, and Michael Wooldridge (2007). Programming Multi-Agent Systems in Multi-AgentSpeak Using Jason (Wiley Series in Multi-Agent Technology). Hoboken, NJ, USA: John Wiley amp; Sons, Inc. isbn: 0470029005.

Chaitin, Gregory J (1977). “Algorithmic information theory”. In: IBM journal of research and development 21.4, pp. 350–359.

d’Inverno, Mark et al. (July 2004). “The dMARS Architecture: A Specification of the Distributed Multi-Agent Reasoning System”. In: Autonomous Agents and Multi-Agent Systems 9, pp. 5– 53. doi: 10.1023/B:AGNT.0000019688.11109.19.

Dessalles, Jean-Louis (2011). “A structural model of intuitive probability”. In: arXiv preprint arXiv:1108.4884.

— (2013). “Algorithmic simplicity and relevance”. In: Algorithmic probability and friends - LNAI 7070. Ed. by David L. Dowe. Berlin, D: Springer Verlag, pp. 119–130. doi: 10.1007/978{-}3{- }642{- }44958{- }1 _ 9. url: https : / / www . dessalles . fr / papers / Dessalles _ 11061001.pdf.

— (2017). “Conversational topic connectedness predicted by Simplicity Theory.” In: CogSci.

— (2020a). Simplicity Theory. https://simplicitytheory.telecom-paris.fr/SimplicityTheory. html. Accessed: 2021-02-17.

— (2020b). Simplicity Theory- Description Machine. https://simplicitytheory.telecom-paris.fr/OMachine.html. Accessed: 2021-02-17.

— (2020c). Simplicity Theory- Emotion. https : / / simplicitytheory . telecom - paris . fr / Emotion.html. Accessed: 2020-12-18.

— (2020d). Simplicity Theory- Intention. https://simplicitytheory.telecom- paris.fr/ Morality.html. Accessed: 2020-12-18.

— (2020e). Simplicity Theory- Lottery. https : / / simplicitytheory . telecom - paris . fr / Lottery.html. Accessed: 2021-02-17.

— (2020f). Simplicity Theory- Probability. https://simplicitytheory.telecom- paris.fr/ Probability.html. Accessed: 2020-12-18.

— (2020g). Simplicity Theory- World Machine. https://simplicitytheory.telecom-paris. fr/WMachine.html. Accessed: 2021-02-17.

Leon-Garcia, A. (2008). Probability, Statistics, and Random Processes for Electrical Engineering. Pearson/Prentice Hall. isbn: 9780131471221. url: https://books.google.nl/books?id= GUJosCkbBywC.

Li, Ming, Paul Vit´anyi, et al. (2008). An introduction to Kolmogorov complexity and its applica-tions. Vol. 3. Springer.

Mascardi, Viviana, Daniela Demergasso, and Davide Ancona (Jan. 2005). “Languages for Pro-gramming BDI-style Agents: an Overview.” In: pp. 9–15.

Pearl, J (Jan. 1984). “Heuristics: Intelligent search strategies for computer problem solving”. In: url: https://www.osti.gov/biblio/5127296.

Saillenfest, Antoine and Jean-Louis Dessalles (July 2015). “Some Probability Judgments may Rely on Complexity Assessments”. In:

Watkins, Christopher JCH and Peter Dayan (1992). “Q-learning”. In: Machine learning 8.3-4, pp. 279–292.

(34)

Wooldridge, M. J. and N. R. Jennings (1995). “Intelligent Agents: Theory and Practice”. url: https://eprints.soton.ac.uk/252102/.

(35)

CHAPTER 6

Appendix A

Average number of steps to reach pattern (st.dev) Average time

Simple Q-learn 39.57 (49.62) 0.00153989315375

Q-learn with CD 14.15 (17.08) 0.01611983187375

ST based Q-learn 13.36 (15.70) 0.011004262181250002

Table 6.1: Setup: [3, 2, 3, 4, 3], probability: 0.2

Average number of steps to reach pattern (st.dev) Average time

Simple Q-learn 67.20 (45.39) 0.0019063875324999999

Q-learn with CD 41.22 (20.47) 0.016305988464999997

ST based Q-learn 38.09 (21.01) 0.0113169151975

Table 6.2: Configuration: [0, 5, 9, 5, 0], probability: 0.2

Average number of steps to reach pattern (st.dev) Average time

Simple Q-learn 2461.8 (2398) 0.0291384557375

Q-learn with CD 267.7 (221.5) 0.0191740077925

ST based Q-learn 242.1 (197.4) 0.014703877826249991

Table 6.3: Configuration: [3, 1, 6, 9, 0, 3, 0, 4, 8, 6], probability: 0.2

Average number of steps to reach pattern (st.dev) Average time

Simple Q-learn 3203.1 (3316) 0.034924183426250005

Q-learn with CD 268.7 (201.3) 0.018316442210000002

ST based Q-learn 242.5 (197.1) 0.013433208250000002

(36)

Probability Average n. of steps to reach pattern (st.dev) Average time Simple Q-learn 0.1 130.3 (105.8) 0.0023382904875000003 Q-learn with CD 0.1 51.61 (26.30) 0.018091633991249998 ST based Q-learn 0.1 31.33 (10.40) 0.008875075413749997 Simple Q-learn 0.2 101.9 (77.34) 0.0020420090825 Q-learn with CD 0.2 43.27 (21.97) 0.017638260795 ST based Q-learn 0.2 37.93 (19.20) 0.012261477675 Simple Q-learn 0.4 182.5 (166.4) 0.0050628236325 Q-learn with CD 0.4 79.92 (61.71) 0.03194513370375 ST based Q-learn 0.4 79.45 (60.82) 0.029744732921250003 Simple Q-learn 0.6 413.4 (388.2) 0.010746706997500013 Q-learn with CD 0.6 170.8 (161.9) 0.06904042051874999 ST based Q-learn 0.6 180.4 (165.3) 0.06677792251499998 Simple Q-learn 0.8 736.4 (682.9) 0.01748411660875 Q-learn with CD 0.8 381.3 (390.1) 0.13317044019000002 ST based Q-learn 0.8 368.9 (368.3) 0.13770193607375

(37)

CHAPTER 7

Appendix B

Average number of steps to reach pattern (st.dev) Average time

Simple Q-learn 54.23 (42.47) 0.00478498171875

Q-learn with CD 34.72 (24.34) 0.018464826685

ST based Q-learn 17.80 (18.62) 0.012872732571249998

Table 7.1: Configuration: [3, 2, 3, 4, 3], probability: 0.2

Average number of steps to reach pattern (st.dev) Average time

Simple Q-learn 74.51 (49.06) 0.005176297974999999

Q-learn with CD 50.17 (24.61) 0.018493518400000002

ST based Q-learn 39.86 (18.06) 0.014409296521249998

Table 7.2: Configuration: [0, 5, 9, 5, 0], probability: 0.2

Average number of steps to reach pattern (st.dev) Average time

Simple Q-learn 1722.73 (1624) 0.0548831131625

Q-learn with CD 234.5 (167.9) 0.024709352901250002

ST based Q-learn 209.1 (150.8) 0.019984976564999996

Table 7.3: Configuration: [3, 1, 6, 9, 0, 3, 0, 4, 8, 6], probability: 0.2

Average number of steps to reach pattern (st.dev) Average time

Simple Q-learn 1595.51 (1412) 0.051689092031249995

Q-learn with CD 246.6 (177.6) 0.026487643428750002

ST based Q-learn 206.4 (149.3) 0.019833487486249995

(38)

Probability Average n. of steps to reach pattern (st.dev) Average time Simple Q-learn 0.1 398.3 (390.3) 0.01130130072 Q-learn with CD 0.1 44.59 (19.64) 0.020162251227499997 ST based Q-learn 0.1 31.40 (10.42) 0.009829217615000002 Simple Q-learn 0.2 77.37 (53.85) 0.00482422701 Q-learn with CD 0.2 49.36 (23.31) 0.017638260795 ST based Q-learn 0.2 39.99 (20.11) 0.012930558308750001 Simple Q-learn 0.4 129.9 (106.5) 0.0095570199425 Q-learn with CD 0.4 71.87 (50.61) 0.026830536101249997 ST based Q-learn 0.4 64.91 (46.30) 0.02745569464249999 Simple Q-learn 0.6 166.5 (160.7) 0.016134348991250002 Q-learn with CD 0.6 104.1 (90.13) 0.03654140449 ST based Q-learn 0.6 87.30 (75.48) 0.03620757491999999 Simple Q-learn 0.8 219.0 (209.6) 0.018637180533749992 Q-learn with CD 0.8 111.1 (102.4) 0.047285644087499995 ST based Q-learn 0.8 113.5 (100.6) 0.04546397372249999

(39)

CHAPTER 8

Appendix C

2D game version Pieces of cheese Number to reach Average number of steps (st.dev) Average energy level Average time Fallen in traps Invalid paths (out of 1000) Simple Q-learn 3 5 21.14 (6.650) 2.464 0.042194646609000006 0 0 Q-learn with CD 3 5 24.44 (7.480) 1.992 0.18099387017 2 2 ST based Q-learn 3 5 39.59 (18.45) 1.932 0.12128994969300001 2 3 Simple Q-learn 5 10 26.06 (5.31) 6.255 0.080881596451 5 5 Q-learn with CD 5 10 28.94 (7.501) 5.207 0.30913800353000004 22 22 ST based Q-learn 5 10 38.81 (15.47) 3.967 0.18892732707600002 7 10

(40)

2D game version Pieces of cheese Number to reach Average number of steps (st.dev) Average energy level Average time Fallen in traps Invalid paths (out of 1000) Simple Q-learn 3 5 20.92 (6.647) 2.410 0.052291680218 0 0 Q-learn with CD 3 5 22.13 (7.264) 2.343 0.20872150904400003 6 6 ST based Q-learn 3 5 28.85 (11.35) 2.110 0.12891312624899998 4 4 Simple Q-learn 5 10 26.46 (5.644) 5.775 0.091668791785 7 7 Q-learn with CD 5 10 28.15 (7.168) 5.317 0.33838597063599996 21 24 ST based Q-learn 5 10 31.84 (9.523) 4.623 0.197363727017 14 18

Table 8.2: Probability of change: 0.2

2D game version Pieces of cheese Number to reach Average number of steps (st.dev) Average energy level Average time Fallen in traps Invalid paths (out of 1000) Simple Q-learn 3 5 20.61 (6.625) 2.379 0.07365931579500001 0 0 Q-learn with CD 3 5 22.70 (8.249) 2.294 0.258899648624 0 0 ST based Q-learn 3 5 26.04 (9.764) 2.216 0.155647341284 11 11 Simple Q-learn 5 10 26.39 (6.232) 5.461 0.129816405833 17 18 Q-learn with CD 5 10 26.40 (5.868) 5.446 0.41060830666900006 19 24 ST based Q-learn 5 10 29.05 (7.759) 4.850 0.22133377404 26 27

(41)

2D game version Pieces of cheese Number to reach Average number of steps (st.dev) Average energy level Average time Fallen in traps Invalid paths (out of 1000) Simple Q-learn 3 5 22.24 (7.947) 2.379 0.099999400299 4 4 Q-learn with CD 3 5 20.48 (7.715) 2.420 0.31682895987 3 3 ST based Q-learn 3 5 24.08 (9.084) 2.279 0.170768469466 1 1 Simple Q-learn 5 10 26.80 (6.024) 5.158 0.153172532722 6 7 Q-learn with CD 5 10 26.47 (6.019) 5.127 0.47635151023500005 16 16 ST based Q-learn 5 10 27.88 (6.739) 4.934 0.234942351961 16 17

Table 8.4: Probability of change: 0.6

2D game version Pieces of cheese Number to reach Average number of steps (st.dev) Average energy level Average time Fallen in traps Invalid paths (out of 1000) Simple Q-learn 3 5 21.46 (7.531) 2.399 0.119013807645 0 1 Q-learn with CD 3 5 20.91 (7.981) 2.435 0.366330224227 2 3 ST based Q-learn 3 5 24.03 (9.536) 2.319 0.18761451075 0 0 Simple Q-learn 5 10 25.99 (5.953) 5.176 0.16340512806299998 13 14 Q-learn with CD 5 10 26.43 (6.269) 5.124 0.5093081555869999 12 13 ST based Q-learn 5 10 27.21 (6.346) 4.939 0.25663775207899997 26 29

(42)
(43)

CHAPTER 9

Appendix D

2D game version Path length Average cheese eaten (st.dev) Average energy level Average time Fallen in traps Invalid paths (out of 1000) Simple Q-learn 40 11.73 (3.333) 4.700 0.08699974591899999 1 1 Q-learn with CD 40 11.33 (3.454) 4.533 0.294372542444 3 3 ST based Q-learn 40 6.091 (3.902) 2.564 0.130557955303 4 4 Simple Q-learn 60 18.35 (4.787) 6.157 0.13085087647199997 5 5 Q-learn with CD 60 17.61 (4.540) 5.729 0.529657602793 13 23 ST based Q-learn 60 11.42 (5.228) 3.384 0.16625127476700002 6 6

Table 9.1: Probability of change: 0.1

2D game version Path length Average cheese eaten (st.dev) Average energy level Average time Fallen in traps Invalid paths (out of 1000) Simple Q-learn 40 11.59 (3.366) 4.706 0.107927156799 3 3 Q-learn with CD 40 10.25 (3.355) 4.576 0.319750206134 1 1 ST based Q-learn 40 8.299 (3.641) 3.132 0.16593316945 1 5 Simple Q-learn 60 18.83 (4.497) 6.482 0.16345987334099998 4 7 Q-learn with CD 60 17.67 (4.706) 5.848 0.47342129571700003 11 13 ST based Q-learn 60 14.92 (4.622) 4.275 0.203519943354 11 13

(44)

2D game version Path length Average cheese eaten (st.dev) Average energy level Average time Fallen in traps Invalid paths (out of 1000) Simple Q-learn 40 12.02 (3.336) 4.985 0.14791358137399999 1 1 Q-learn with CD 40 10.98 (3.467) 4.305 0.437092013951 2 2 ST based Q-learn 40 9.556 (3.525) 3.582 0.19930150978900002 4 5 Simple Q-learn 60 19.01 (4.432) 6.665 0.222785171571 5 7 Q-learn with CD 60 18.29 (4.366) 6.066 0.615429950703 6 10 ST based Q-learn 60 15.56 (4.376) 4.588 0.259701145249 4 5

Table 9.3: Probability of change: 0.4

2D game version Path length Average cheese eaten (st.dev) Average energy level Average time Fallen in traps Invalid paths (out of 1000) Simple Q-learn 40 11.36 (3.531) 4.518 0.18646772296699998 5 6 Q-learn with CD 40 11.76 (3.461) 4.809 0.534007035124 2 3 ST based Q-learn 40 10.22 (3.507) 3.904 0.243896342373 7 8 Simple Q-learn 60 18.47 (4.205) 6.139 0.27780953402399994 1 1 Q-learn with CD 60 18.41 (4.488) 6.206 0.7479936428629999 7 7 ST based Q-learn 60 14.93 (4.601) 4.338 0.314694158415 14 14

(45)

2D game version Path length Average cheese eaten (st.dev) Average energy level Average time Fallen in traps Invalid paths (out of 1000) Simple Q-learn 40 10.91 (3.461) 4.300 0.217898397375 5 5 Q-learn with CD 40 11.42 (3.425) 4.600 0.6157260215199999 0 0 ST based Q-learn 40 9.922 (3.451) 3.823 0.277149985748 2 4 Simple Q-learn 60 18.14 (4.092) 5.816 0.325924306347 1 2 Q-learn with CD 60 18.23 (4.439) 6.030 0.872587219755 5 7 ST based Q-learn 60 15.14 (4.339) 4.313 0.363646502087 5 5

(46)
(47)

CHAPTER 10

Appendix E

Algorithm 4: Machine to give description program of a given 1D neighborhood. def reset(): temp = None pos = 0 output = [] n = neighborhood def increase():

if temp < n[pos] then temp += 1

if temp == n[pos] then output[pos] = temp end return “increase” end return False def decrease():

if temp > n[pos] then temp -= 1

if temp == n[pos] then output[pos] = temp end return “decrease” end return False def move(): if pos == 0 then output[pos] = n[pos] temp = output[pos] pos += 1 return “move” end

if temp == n[pos] then output[pos] = temp pos += 1

return “move” end

(48)

Algorithm 5: Machine to give description program of a given 1D neighborhood. def reset(F ):

output = mouse position postemp= mouse position poscheese= cheese position def distance(pos):

x,y = pos i,j = poscheese

distance = [abs(x-i),abs(y-j)] return distance

def move(direction): postemp = output

if direction == left then if output[0] > 0 then

output[0] -= 1

if distance(postemp) < distance(output) then output[0] += 1

return False end

return ‘move left’ end

else if direction == right then if output[0] < grid width - 1 then

output[0] += 1

if distance(postemp) < distance(output) then output[0] -= 1

return False end

return ‘move right’ end

else if direction == up then if output[1] > 0 then

output[1] -= 1

if distance(postemp) < distance(output) then output[1] += 1

return False end

return ‘move up’ end

else if direction == down then if output[1] < grid height - 1 then

output[1] += 1

if distance(postemp) < distance(output) then output[1] -= 1

return False end

return ‘move down’ end

Referenties

GERELATEERDE DOCUMENTEN

Steven Mortier en Dirk Pauwels van de afdeling beheer van het agentschap Onroerend Erfgoed leverden belangrijke administratieve ondersteuning en toonden de nodige

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of

Based on scientific literature and findings of previous studies, three independent variables (e.g. consumer involvement with a brand, consumer attitude towards a brand and

Both strains have been co-inoculated as adjunct starter cultures for the production of matured Cheddar cheese (Ferreira and Viljoen, 2003) resulting in enhanced lactic acid

Changes in the pH and organic acid concentrations during the ripening of matured Cheddar cheese when (a) D.. lipolytica and (c) both these species were inoculated as

1 Means of the FA present in matured Cheddar cheese with Debaryomyces hansenii and Yarrowia lipolytica as adjunct starters as well as in the control cheese during

Purpose: Inspired by trends towards health issues and the growing interest in cooking videos, this study investigates if, and to what extent, the use of background music and

If I now relabel parameters to the ones I use here, I find that the probability of absorption at the boundary of the absorbing object, for a free diffusive moving particle modelled