Comparing single-phase and two-phase Reinforcement Learning of a chess endgame

(1)

Comparing single-phase and two-phase

Reinforcement Learning of a chess

endgame

Daniël M.G. Staal 10277927

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam

Faculty of Science Science Park 904 1098 XH Amsterdam

Supervisor

dhr. dr. M.W. (Maarten) van Someren Faculty of Informatics

University of Amsterdam POSTBUS 94323 1090 GH Amsterdam

(2)

Abstract

Reinforcement Learning is a kind of Machine Learning where an artificial agent receives rewards for certain actions. A next state that will give the agent the high-est reward, is found by calculating the reward for every possible next state using a reward function. The goal of the learning process is to find the ideal reward func-tion. In this project a comparison is made between using one or two evaluation functions for different phases of the learning problem, using a classical Reinforce-ment Learning problem as an example, the chess endgame King + Rook versus Rook. This process of dividing a problem into multiple learning phases is called Curriculum Learning.

(3)

1 Introduction

1.1 Research Question & Hypothesis

A lot of problems require different strategies at different stages to solve it. Most problems can not be solved using just one strategy but need to be divided into mul-tiple phases where different policies are the best way to reach a solution. This re-search project will look into the possibility of training an agent with Reinforcement Learning to solve such a problem, using a famous chess endgame as an example. The research question is:

To what extent does the performance of an artificial agent using Reinforcement Learning to solve the chess endgame King + Rook vs King(KRK) increase when the learning phase is divided in two separate parts?

Bengio et al. (2009) point out that humans and animals learn much better when the examples are not randomly presented but organised in a meaningful or-der which illustrates gradually more concepts, and gradually more complex ones. They call this the concept of Curriculum Learning. This thesis compares single-phase learning with a simple example of Curriculum Learning, consisting of two separate phases using an artificial agent that tries to learn a winning strategy for a classical chess endgame situation using Reinforcement Learning. The endgame situation with a white King and Rook against a black King. The agent should learn to win playing with King + Rook against an opponent player that controls only the black King. When trying to achieve checkmate in the KRK endgame, like most problems, reaching the goal seems to consist of separate tasks. The first is driving the opponent King to the edge of the board, and the second task is to achieve checkmate and avoid stalemate. This project will compare two different approaches, learning one strategy for the whole game and learning two separate strategies for two different phases of the endgame.

Reinforcement Learning

Reinforcement Learning is a field of machine learning where the artificial agent is rewarded for certain actions and the achievement of goals. How to achieve a goal is what should be computed by the artificial agent, which in this project learns from playing against an opponent, controlled by a random playing agent.

According to Sutton and Barto (1998, par.1.1), Reinforcement Learning is learning what to do, how to map situations to actions, so as to maximize a nu-merical reward signal. The learner is not told which actions to take, as in most

(5)

forms of machine learning, but instead must discover which actions yield the most reward by trying them. Sutton and Barto (1998, par.1.3) point out that the core elements of a Reinforcement Learning System are a policy, a reward function and a value function. A policy defines the learning agent’s way of behaving at a given time, the reward function calculates the reward for a certain state and the value function calculates the reward over time, when exploring more steps in a search tree. Such a function is not strictly necessary. In this project the policy is obtained by calculating a single policy for a certain state using the evaluation function and no value function is used, this is called a greedy method. This method always ex-ploits current knowledge to maximise immediate reward; it spends no time at all sampling apparently inferior actions to see if they might really be better (Sutton and Barto, 1998). This choice is made because time complexity greatly enhances when using deeper searching methods and time complexity already appeared to be an issue in this project. Furthermore choosing this approach probably does not influence the comparison between the two learning methods discussed. A more detailed description of the algorithm that is used in this project can be found in the section Method & Approach below.

Each game of chess ends in a special state called the terminal state, followed by a reset to a standard starting state or to a sample from a standard distribution of starting states. Tasks with episodes of this kind are called episodic tasks. In episodic tasks we need to distinguish the set of all non-terminal states, denoted, from the set of all states plus the terminal state (Sutton and Barto, 1998).

Hypothesis

My hypothesis is that separating the learning process into two phases will improve the level of play of the artificial agent. This hypothesis is under the condition that the evaluation function, a concept that will be explained later on, is a linear func-tion, and my expectation is that it is difficult for the agent to learn to solve a prob-lem that consists of two different phases, which both require a different approach, using a single linear evaluation function.

To summarise I predict that when an artificial agent has to learn to solve a problem by means of updating weights of features and using a linear evaluation function, more satisfying results will be obtained by dividing the problem into two separate learning phases, each with a different evaluation function, or by making use of an evaluation function that is not linear. The validity of the first statement, the usage of two evaluation functions, will be investigated in this research.

(6)

1.2 Literature Review

Block et al. (2008) use an algorithm for a chess engine based on a Reinforcement Learning algorithm. Block et al. (2008) developed an agent that is trained to play the game of chess with all the pieces on the board, but the algorithm is suited for a more specific chess problem like the one discussed in this project as well. Their article provides an algorithm that consists of three main functions to improve the level of play of an artificial agent: an evaluation function, an update function and a function to calculate the temporal difference. The evaluation function calculates a reward for a certain state, the state with the highest reward will be the state that the agent chooses to go to next. The temporal difference function calculates the dif-ference between rewards in the current state and the next one. The update function updates the weights of features; or in other words, it updates the magnitude of the reward that the agent gets for certain features. When a terminal state is achieved, like a checkmate or a draw by losing the Rook, respectively a positive or a neg-ative reward is given to the agent, mostly rewarding the last played moves in a game. This algorithm is the core learning process of the program developed in this project, and will be more thoroughly explained in the section Method & Approach. Curriculum Learning is a fairly new concept in Reinforcement Learning and is the general name for learning in multiple phases. Bengio et al. (2009) state two potential advantages of Curriculum Learning, firstly: "faster training in the online setting (i.e. faster both from an optimisation and statistical point of view) because the learner wastes less time with noisy or harder to predict examples (when it is not ready to incorporate them)"; and secondly: "guiding training towards better re-gions in parameter space, i.e. into basins of attraction (local minima) of the descent procedure associated with better generalisation.

The book by Sutton and Barto (1998), is a general introduction to

Reinforce-ment Learning that in this project is mostly used to develop a proper Exploration/Exploitation algorithm and to form a general understanding of Reinforcement Learning

con-cepts. Russell et al. (1995) like Sutton and Barto (1998) provide general concepts about Reinforcement Learning that in the present project are mostly being used without specific citations due to the reason that these concepts can be found in other books on the subject of Artificial Intelligence.

Some other learning agents use Genetic Programming (GP) to solve chess endgames, a learning algorithm based on evolutionary biology. For example Lass-abe et al. (2006) developed a learning chess engine to solve the endgame Rook + King vs King, the same problem as the one discussed in this project. Lassabe et al.

(7)

(2006) provide a list of the features they used to create agents. Some of the features used in the present project are based on those features.

2 Method & Approach

When implementing a Reinforcement Learning algorithm, a lot of different ap-proaches can be chosen. This project implements a method proposed by Block et al. (2008) to train an artificial agent using Reinforcement Learning. Below firstly, the algorithm is explained and a general outline of the architecture and assumptions made in the program are discussed, then the step of dividing the learning phase into two different parts and finally the choices made in the program are discussed at the end of this section.

2.1 Block’s algorithm

Block et al. (2008) propose an algorithm for using Reinforcement learning in chess engines. This algorithm can be used for the whole game of chess but it is also ap-plicable to just the endgame problem that is discussed in this project, because the algorithm is a general method to train an artificial agent to play. In the research project by ? this endgame is also part of the learning process but it is just one of the many learning processes in their project. Block et al. (2008) describe a method-ology introduced by Samuel (1959), and that adopted notation and concepts first introduced by Sutton (1988), which is not an algorithm developed especially for chess but rather for Temporal Difference in Reinforcement Learning in general.

To begin with the evaluation or reward function as proposed by Block et al. (2008):

This function approaches an ideal function to evaluate a current state in the best possible manner. The variables are the number of features k, the weight of a cer-tain feature ωi and a certain feature value Ji given a state x. This function is used

to find the next state with the highest reward and also in the update function to cal-culate the temporal difference and the gradient of the function. The present project defines a state as a board position after a move made by the player controlling the black King. When the agent calculates the best move for the white player, every possible move is evaluated after the inclusion of a random move with the black

(8)

king. In practice this means that the move that the black King makes may eventu-ally be different than the random move used to calculate the state reward (Figure: 1).

Figure 1: Left: move used to calculate the next state reward. Right: move actually played by the opponent. Image source: chess.com.

To find the ideal values for the weights of the features, the weights are updated after a terminal state is achieved:

The update function that Block et al. (2008) use and is used in this project as well, consists of the weight itself, the learning rate α, the sum of the gradient of all board positions till the last position before the terminal state is reached, the maxi-mum number of moves per game N and finally the correction ∆t stated below:

Where djis the temporal difference:

The correction ∆t is used to give a direction to the value that is found by com-puting the gradient of the evaluation function, which in fact is just a feature value in a certain state. This value is multiplied with the correction. This correction is computed by taking the sum of all states from the current state in the sum to the second to last state and multiplying the parameter λ with the temporal difference. This temporal difference calculates the difference between the value of the eval-uation function in the next state and in this state. This value can be negative or positive. The correction can decrease or increase the value of the weight changing

(9)

the reward for a high value of this feature. λ is a parameter that controls the extent to which temporal differences propagate backwards in time (Baxter et al., 1997).

2.2 Division into two learning phases

To divide the chess endgame into two learning phases, a terminal state has to be implemented for the first phase. A positive reward is given to the agent for reach-ing a certain position. A negative reward is given when the white Rook has been captured. Not much has changed relative to learning with a single phase, but the agent now uses two linear evaluation functions, or ’strategies’, to finally come to a checkmate or draw by stalemate or loss of the Rook. The terminal state for the first phase is the simple case of any position where the black King is on one of the edges of the board (Figure 2). When this happens the agent receives a termi-nal state reward and the next phase is initiated and the agent starts using the next strategy which involves another set of weights for the same list of features. In this phase other features will be more important and different decisions will be made. In the section below the two ways of Reinforcement Learning are compared.

Figure 2: The black King moved to the edge of the board so a terminal state for the first phase is reached, 6x6 board. Image source: chess.com.

2.3 The Program

The last subsection discussed the algorithm of Block et al. (2008) used in this project. This subsection will give a more precise description of the program cre-ated in this project. How is Block et al. (2008)’s algorithm used and what is added? The program consists of several parts with all general assumptions and actions (see Appendix B for a UML (Unified Modeling Language) of the program). First a general outline will be given of the architecture and general concepts of the pro-gram including information about changeable parameters and choices made in the

(10)

program, subsequently these choices will be discussed and later on in the result section it is shown how certain choices turn out.

- Playing games - Learning from games - Features

Playing games

When a game is played, the first move is made by the opponent player with the black King and the second move by white etc. The opponent player is controlled by a random playing agent, whereas the player with the white pieces makes the move with which it receives the highest reward. The best reward state is evaluated by checking all possible moves for the agent and then making a random move with the black king, consequently the program learns to checkmate a random playing agent. The board size can be changed, which greatly changes the complexity and probably also the success of the learning process, as well as the initial board position of each game that can be the same every game or be reset to a random position after each game (Table 1). The maximum number of moves per game N, determines the maximum number of moves that are allowed to be played. If after N moves no terminal state is reached, a game is thrown away and not used for learning. This parameter highly influences the complexity of the learning process but can not be set too small, because this might constrain the learning process. This is why its value should be chosen with care.

Learning from games

After every game that ended due to reaching a terminal state, the weights will be updated using the update function provided by Block et al. (2008). Before starting to learn from games, values have to be chosen for the learning rate and the value for lambda, which determine the speed of change of the weights in the update function. It is not easy to say which values are optimal, therefore in future research a learning algorithm may be useful to find values for these parameters. Another choice to be made is the initial value of the weights before playing games and the values for rewards for achieving different terminal states.

Features

Wiering and van Otterlo (2012, p.549) point out that the choice of features is a difficult task. Questions arise such as for instance the amount the features, what

(11)

features are useful and which combination of features will work. Note that simply using board positions as features will probably result in an overly complex cal-culation process and may not even be sufficient for the agent to learn a winning strategy. After calculation the feature values are being normalised between 0 and 1, otherwise relative values of features are not scaled in the right fashion.

Boolean parameters Functionality

randomPositions Start every game in different

or every time the same initial position numberOfGames The number of games to be played

maxNumberOfMoves N: the maximum number of moves per game allowed numberOfFeatures The number of features

alwaysTakingRookFlag Whether the black king should be forced to take the rook if possible

boardSize The size of the board

Table 1: Boolean parameters that can be changed before playing games and training the agent

2.4 Discussion of choices made in the program

Exploration/Exploitation

The concept of Exploration/Exploitation means balancing the use of exploring moves, random moves, and exploiting moves, moves determined by the next state in which the agent receives the highest reward. At which threshold of reward, should the agent make an exploiting move? This research project does not make use of such an algorithm due to the fact that this is probably not fundamental when comparing using two reward functions with the single reward function method. However in future research it may be interesting to implement it. In this case, for example, an algorithm stated by Sutton and Barto (1998) could be implemented. Sutton and Barto (1998, Chap. 2) discuss the use of Reinforcement Comparison. The comparison is between the reward in a current state and a reference reward. The idea is that if the reward in the current state is higher than the reference reward, an exploiting move should be executed, meaning that the next state with the highest reward is chosen. If the reward is lower than the reference reward, a random move is executed. After each move the reference reward is updated using the following function by Sutton and Barto (1998):

(12)

where α is a learning parameter to increase the changing speed of the reference reward. The main idea of the calculation of the reference reward is to check if the reward is higher than the mean rewards of all past states.

Random moves

A best reward state is evaluated by checking all possible moves for the agent and then making a random move with the black king. In for example Lassabe et al. (2006), who uses genetic programming to train an artificial agent, the black king is played by opponents with different strategies. These strategies influence the learning process of the agent and the agent adapts to beat a certain strategy. In this project the agent learns a strategy to beat a random playing agent. When playing against for example a good chess player, other strategies will arise.

Learning rate & λ -value

The lambda value controls the contribution of the temporal difference from a po-sition xt until the end of a game (Block et al., 2008). Block et al. (2008) state that

λ = 0.7 gives the most satisfying results. In this project this value for λ is used for all tests, but other values may achieve a better algorithm. The learning rate α = 0.7 is chosen for every test. Additional research, using for example learning algorithms, to find optimal values for these parameters might increase the learning capacities of the artificial agent.

Initial value of weights and rewards

The initial values for the weights and the terminal state rewards did not seem to be very important while testing. The only difference can be made in the choice of the reward for a stalemate position. Stalemate is the trickiest part of the endgame because a strategy to reach a checkmate is almost the same as for reaching a stale-mate position, a position in which the player who has to make a move can not make a legal move and is not in check (see Appendix A). In both cases the black King should be forced into a corner, but the last stage of the game is different, because stalemate has to be prevented, and the white Rook may suddenly have to move away from the black King to make room, an uncommon strategy. When driving the black King to the edge of the board, the first strategy is not a good strategy for the whole game, an additional strategy seems to be required when trying to

(13)

checkmate the black King and preventing stalemate. Lassabe et al. (2006) chose to assign the achievement of a stalemate position as a relatively good action and they give the agent a positive reward when a stalemate is achieved, although lower than the reward for a checkmate. Such a reward system is used in the present re-search project because when stalemate is punished the agents may have a hard time achieving a higher level of play and differences between the two methods may be harder to locate.

Features

Wiering and van Otterlo (2012, p.549) state that most evaluation functions use piece values and square values for certain pieces as features. In this project other features are used primarily to decrease the complexity of the learning process, be-cause when all the possible boardpositions are used as features, the time complexity highly increases. However, in time the level of play of the agent may benefit from this approach. This is not the main goal in this project but it is important to realise when developing a chess engine using Reinforcement Learning. The features im-plemented in this project are listed in table 2, and more features could easily be added if necessary. The meaning of some features are shown with respect to figure 3. In this particular position the feature "squaresOfKingvsRook" would receive the value three. The ’cage’ of the black King has three squares. The number of legal moves, noOfPosSquaresk" would be two. The black King could go right or diago-nally left to take the rook. So in this position "kingProtectsRook" would receive the value false/zero. The "distanceBetweenWhiteRookAndBlackKing" is three. Note that these feature values will be normalised before usage in the learning process.

Feature Meaning

"squaresOfKingvsRook"; The number of squares that the white Rook delimits for the black King "noOfPosSquaresk"; Number of legal moves available for the black King

"distanceToEdgeBlackKing"; Distance of the black King to the edge of the board "distanceBetweenWhite

RookAndBlackKing"*; Manhattan distance between the white Rook and the black King "kingProtectsRook"*; Is the white King defending the white Rook

"threatenedRook"*; Can the white Rook be captured in this position? "rookLost"; Is the white Rook captured?

"kingsInOpposition"*; Both kings are opposite eachother with one square in between "blackKingInStaleMate"; This is a stalemate position

"blackKingInCheckMate"; This is a checkmate position

(14)

Figure 3: BoardPosition for a 4x4 sized chessboard. Image source: chess.com.

3 Results

To find an answer to the research question stated in this project, it is necessary to compare the level of play of the agent between using one or two reward functions. A lot of parameters are variable, e.g. the amount of games played, the maximum number of moves in a single game, which features are used to train the agent, but not all these variables are interesting in view of this project. The static parameters are stated first, subsequently the parameters that are being changed between tests.

Every test is done using the same parameter setting for one and for two evalu-ation functions. It does not seem to be necessary to vary every parameter, because probably not all of them influence the relative level of play of the agent in using two learning functions or one. These unchanged variables are listed in table 3. Al-though in chess a stalemate position is evenly bad as losing the rook, the stalemate reward is set to 1 because for reaching a stalemate in many positions (e.g. Appendix A) the agent has to do almost everything right except for the last move. However this reward is set lower than the reward for a checkmate position. The combination of features is the same for every test ran in this project. Features that the agent uses for the reward functions are "noOfPosSquaresk","distanceToEdgeBlackKing", "dis-tanceBetweenWhiteRookAndBlackKing", "threatenedRook" and "kingsInOpposi-tion" (for explanation of features see table 2).

To obtain results only two variables are changed; the size of the chessboard and whether initial positions are random. These variables seem to matter when thinking about using one or two reward functions. The size of the board may increase the time it takes for the agent to reach the terminal state of the first phase (when the black King reaches the edge of the board). When the initial positions are set to random, it can be measured how flexible the two different learning methods are when dealing with random starting positions. It may be interesting to vary other

(15)

variables, but that seems to be less important and may be very time consuming. Future research could make this clear. The chessboard sizes that are used are 4x4, or 6x6, because when using the 8x8 chessboard, time complexity may become a problem, not due to the duration of playing a single game, but due to the amount of games it takes to learn a satisfying strategy. Additionally it may not contribute to this research to play on a 8x8 to compare the single-phase with the two-phase problem. The amount of games is set to 200 in every test in this project, because a larger amount resulted in a problem with the Java heap space, which may be caused by bad architecture of the program and may be solvable by using heap-profiling software like for example proposed in Shaham et al. (2001). Running 10 iterations of 200 games appeared to be just possible and takes approximately one hour for the 6x6 sized chessboard. Running more iterations on more games could strengthen the claim that this research project is trying to make, however the results show the direction of where the process is going. In this sections the results are stated, and in the next section the different tests are discussed and evaluated.

Variable Value

Learning Rate 0.7

Lambda value 0.7

Iterations 10

Amount of games played 200

Max. number of moves in a single game 20

Checkmate reward 2

Stalemate reward 1

Rook lost reward -2

King on the edge reward(only 1st phase) 2 Combination of Features

Table 3: Unchanged variables

3.1 Test 1: 200 games, each game the same initial position, 10 itera-tions

In this testing round, 10 times 200 games were played with each time the same initial position on a 4x4 and on a 6x6 sized chessboard (Figure 4). The placement of the pieces is like this, because at least a few moves are required to achieve a checkmate. Furthermore in the 4x4 board the king is placed in one of the four squares in the middle of the board to ensure the position does not start in the second phase, initiating the second reward function immediately.

(16)

Figure 4: Initial position used for the: left: 4x4 sized chessboard, right: 6x6 sized chess-board. Image source: chess.com.

The results after 10 times 200 games for a 4x4 sized chessboard are shown in Figure 5. These bar plots show the amount of checkmates(blue), stalemates(green) and draw by losing the rook(yellow), for every 20 games played with a total of 200 games in 10 iterations. The x-axis counts the groups of 20 games, the y-axis shows the amount of every terminal state that is achieved. The same settings for a 6x6 sized chessboard are shown in Figure 6.

Figure 5: Left: stacked bar plot for two reward functions. Right: bar plot for one reward function. Board size: 4x4. Static initial position. Yellow = Draw by losing the rook. Green = Stalemate. Blue = Checkmate. Pictured is the mean amount of every terminal state per 20 games for 10 iterations with each a total of 200 games.

(17)

Figure 6: Left: stacked bar plot for two reward functions. Right: bar plot for one reward function. Board size: 6x6. Static initial position. Yellow = Draw by losing the rook. Green = Stalemate. Blue = Checkmate. Pictured is the mean amount of every terminal state per 20 games for 10 iterations with each a total of 200 games.

3.2 Test 2: 200 games, each game a random initial position, 10 itera-tions

In this testing round, 10 times 200 games were played but now with each time a random initial position on a 4x4 and on a 6x6 sized chessboard.

The mean results after 10 times 200 games for a 4x4 sized chessboard are shown in Figure 7. The same settings for a 6x6 sized chessboard are shown in Figure 8.

(18)

Figure 7: Left: stacked bar plot for two reward functions. Right: bar plot for one reward function. Board size: 4x4. Random initial position. Yellow = Draw by losing the rook. Green = Stalemate. Blue = Checkmate. Pictured is the mean amount of every terminal state per 20 games for 10 iterations with each a total of 200 games.

Figure 8: Left: stacked bar plot for two reward functions. Right: bar plot for one reward function. Board size: 6x6. Random initial position. Yellow = Draw by losing the rook. Green = Stalemate. Blue = Checkmate. Pictured is the mean amount of every terminal state per 20 games for 10 iterations with each a total of 200 games.

4 Evaluation

In this section the results stated in the previous section will be evaluated. Table 4 sketches an evaluation model, which is used to evaluate the level of play of the agent. These values will be used to show the increase of level of play and to calculate whether the outcomes of the two methods are significantly different using a paired t-test with a significance level of 5%. The values shown in table 4 are the same as the rewards given to the agent.

(19)

Terminal state Reward

Checkmate 2

Stalemate 1

Lose the rook -2

Table 4: Terminal state rewards

Test 1: 200 games, each game the same initial position, 10 iterations

Figure 5shows that both the agents using one and or reward functions reach a de-cent level of play after playing 200 games on a 4x4 sized chessboard. A difference is found in the smaller amounts of checkmate positions reached when using one reward function, the right figure. This could mean that when the agent uses two reward functions, the second phase is helpful to achieve a higher reward by achiev-ing a checkmate position usachiev-ing another strategy from the point that the opponent king is on the edge of the board. Interesting to note is that the agent using two reward functions improves more gradually. Figure 9 shows the evaluation for the 4x4 board. No significant differences between the methods are measured.

The second figure shows results for static initial positions on a 6x6 board (Fig-ure 6), both methods do not seem to reach high level of play within 200 games. The agent using two reward functions however, performs better after playing 200 games while the agent using one reward function does not. The amount of check-mates seems to be reasonably even. Figure 10 shows the evaluation for the 6x6 board. The agent with two reward functions is performing significantly better be-tween 20 and 40 games played and bebe-tween 120 and 180 games played.

Test 2: 200 games, each game a random initial position, 10 iterations

In the first figure of this test (Figure 7), a 4x4 sized board was used. Both methods reach a reasonable result after playing 200 games, and again the agent using two reward functions reaches a checkmate position more easily. Figure 11 shows the evaluation for the 4x4 board. The agent with two reward functions is significantly performing better between 100 and 120 games played.

Finally Figure 8 shows the 6x6 sized chessboard using random initial posi-tions. The level of play of the agents appears to be similar, the two reward function method again finds a higher amount of checkmate terminal states. Using one re-ward function a lot of games do not end with a terminal state, which is a problem because games that do not end with a terminal state are not used for learning. Fig-ure 12shows the evaluation for the 6x6 board. No significant differences between the methods are measured.

(20)

Figure 9: Evaluation of Figure 5. Blue: evaluation for two reward functions. Yellow: evaluation for one reward function. Board size: 4x4. Static initial position. The bars show the mean evaluation for 10 iterations of 200 games using the model shown in table 4.

Figure 10: Evaluation of Figure 6. Blue: evaluation for two reward functions. Yellow: evaluation for one reward function. Board size: 6x6. Static initial position. The bars show the mean evaluation for 10 iterations of 200 games using the model shown in table 4.

In general the agent using two reward functions not only performs slightly better in achieving checkmate positions, it performs better in playing the endgame overall, including avoiding loss of the rook and achieving stalemate positions.

(21)

Figure 11: Evaluation of Figure 7. Blue: evaluation for two reward functions. Yellow: evaluation for one reward function. Board size: 4x4. Random initial positions. The bars show the mean evaluation for 10 iterations of 200 games using the model shown in table 4.

Figure 12: Evaluation of Figure 8. Blue: evaluation for two reward functions. Yellow: evaluation for one reward function. Board size: 6x6. Random initial positions. The bars show the mean evaluation for 10 iterations of 200 games using the model shown in table 4.

5 Conclusions

In this research project the following research question has been studied:

To what extent does the performance of an artificial agent using Reinforcement Learning to solve the chess endgame King + Rook vs King(KRK) increase when the learning phase is divided in two separate parts?

(22)

To answer this question the chess endgame problem King + Rook vs King has been used to compare one or two linear reward functions training an agent to solve it. The hypothesis, as stated in the Introduction, predicted that if an agent uses two reward functions the level of play would be higher than when the agent trains with a single reward function. When using two reward functions the second phase starts when the black King reaches the edge of the board. From this moment until the end of the game a different set of weights for the features would be used to calculate the best next state.

In the evaluation section it was shown that the performance of an agent that uses two reward functions for two different phases of the game, is better than the performance of the agent that uses a single reward function, in achieving a check-mate position, which receives a higher reward than a stalecheck-mate position, the first agent appeared to reach a more satisfying result and as well in the general perfor-mance using table 4.

It can be concluded that using Curriculum Learning with two separate reward functions leads to a slightly better performance of the artificial agent when learning the chess endgame King + Rook versus King.

6 Discussion

More iterations and games

Testing the program more thoroughly could have improved the reliability of the results of this research project, but time pressure and time complexity cause prob-lems. When playing more than approximately 2500 games in one run, the program stopped due to a lack of heap space. This might be caused by bad programming, and could probably be solved by improving the architecture by using heap-profiling software(Shaham et al., 2001) or using more cores when testing for more compu-tational power. It could turn out however, that more tests with different parameter setting or feature combinations provide more or less the same results. Playing more games with these features may not improve the playing level of the agent, but note that the main goal of this research project was not to develop the best feature combination, but rather to compare the two learning methods.

(23)

General performance

The general performance of the artificial agent was not great, taking into account that an average human player would win almost all initial positions playing with the white King and Rook. This is probably caused by a bad feature composition and lack of an exploration/exploitation function. However, comparing the single and two reward function method was the main aspect of the project, therefore this is not a fundamental problem.

7 Future Research

Future progress in this project

During this project the agent received a positive reward for achieving a stalemate position. If this reward would be negative it is possible that the evaluation of the agent that uses two evaluation functions would relatively be even better. However this is not tested in this project, mainly because doing this would slow the learning process for the reason that achieving a stalemate position requires almost the same strategy as for achieving checkmate, again facing the problem of time complex-ity. Other possible expansion of this research project would include implementing an Exploration/Exploitation algorithm, a value function that would use an utility function (Sutton and Barto, 1998) to think more steps ahead and find better feature combinations. These implementations in the program may increase the difference in level of play of the two methods using one or two sets learning phases.

More reward functions

In this research a maximum of two reward functions is used for separate phases in the chess game. However it could be interesting to see if more reward functions would lead to even better results or how many would be ideal. Lassabe et al. (2006) use 33 different kinds of positions with different strategies. They state that neural networks may be helpful to find what kinds of positions there are, by clustering, but in chess this is hard because a different position of a single piece may result in a different evaluation of the position. Only few open source chess programs use more than three different states classes; opening, middlegame and endgame (Lass-abe et al., 2006). Schraudolph et al. (1994) use Temporal Difference Learning to evaluate a position for the board game Go. They claim that with sufficient attention to network architecture and training procedures, a connectionist system trained by temporal difference learning alone can achieve significant levels of performance in this knowledge-intensive domain.

(24)

Curriculum Learning in general

In the Introduction section of this project Bengio et al. (2009) was mentioned con-cerning Curriculum Learning. They state that Machine Learning algorithms can benefit from a curriculum strategy, i.e. a Learning strategy based on more reward functions, and for this subject it remains unclear why some curriculum strategies work better than others, which should be subject to future research.

References

Baxter, J., Tridgell, A., and Weaver, L. (1997). Knightcap: A chess program that learns by combining td(λ ) with minimax search. Technical report, Australian National University, Canberra.

Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learn-ing. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM.

Block, M., Bader, M., Tapia, E., Ramírez, M., Gunnarsson, K., Cuevas, E., Zal-divar, D., and Rojas, R. (2008). Using reinforcement learning in chess engines. Research in Computing Science, 35:31–40.

Lassabe, N., Sanchez, S., Luga, H., and Duthen, Y. (2006). Genetically pro-grammed strategies for chess endgame. In Proceedings of the 8th annual con-ference on genetic and evolutionary computation, pages 831–838. ACM. Russell, S., Norvig, P., and Intelligence, A. (1995). A modern approach. Artificial

Intelligence. Prentice-Hall, Englewood Cliffs, 25.

Samuel, A. L. (1959). Some studies in machine learning using the game of check-ers. IBM Journal of research and development, 3(3):210–229.

Schraudolph, N. N., Dayan, P., and Sejnowski, T. J. (1994). Temporal difference learning of position evaluation in the game of go. Advances in Neural Informa-tion Processing Systems, pages 817–817.

Shaham, R., Kolodner, E. K., and Sagiv, M. (2001). Heap profiling for space-efficient java. In ACM SIGPLAN Notices, volume 36, pages 104–113. ACM. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences.

(25)

Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction, volume 1. MIT press Cambridge.

Wiering, M. and van Otterlo, M. (2012). Reinforcement Learning: State-of-the-art, volume 12. Springer Science & Business Media.

(26)

Appendices

Appendix A

Left: final white move that leads to stalemate. Right: final white move that leads to check-mate. Image source: chess.com.

(27)

Appendix B

UML for the chess program showing the most important classes with class names in blue. Arrow explanation: ’Extends’ means this is a parent class. ’Use’ means these classes have a fundamental connection.

Comparing single-phase and two-phase Reinforcement Learning of a chess endgame