• No results found

On Higher-Order Control Tasks: The Application of A3C on Space Fortress

N/A
N/A
Protected

Academic year: 2021

Share "On Higher-Order Control Tasks: The Application of A3C on Space Fortress"

Copied!
29
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

On Higher-order Control tasks: the

application of A3C on Space

Fortress

Putri A. van der Linden

10768017

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam

Faculty of Science Science Park 904 1098 XH Amsterdam

Supervisors

Drs. G. Poppinga (department ASDO, NLR) Dr. ir. J.J.M. Roessingh (department AOTS, NLR)

Dr. S. van Splunter (Universiteit van Amsterdam) Faculty of Science

University of Amsterdam Science Park 904 1098 XH Amsterdam

(2)

Acknowledgement

I would like to express my sincerest gratitude to a particular group of people with-out whom this project could not have been completed. I would like to thank the following (in no particular order):

My supervisors drs. Gerald Poppinga and dr. Jan Joris Roessingh for giving me this opportunity, guiding me throughout the process and providing me with the tools and environments in which the current thesis could be made possible.

My supervisor dr. Sander van Splunter for guiding me throughout the writing process.

My fellow students Wijnand van Woerkom and Rijnder Wever for delivering their previous work on which this project was built on and for guiding me through parts of the code.

Andi Aliko, for helping me understand the concepts that are fundamental within this project.

(3)

Abstract

In the current thesis, the extent to which the A3C architecture is able to learn higher-order control tasks is studied. The tasks consisted of a set of subtasks of the simplified Space Fortress game which vary in complexity and order of control required. Experiments in previous attempts with Deep Q-learning applications on these subtasks have shown substantial learning in the lower-order tasks, but higher-order tasks could not be mastered. The GA3Carchitecture was applied to the same set of subtasks that were used in the Deep Q-learning experiments. The GA3C architecture was able to master a lower-order of control, but could not master higher-order control tasks. Furthermore, GA3C did not show a significant increase in learning be-haviour in higher-order control tasks when compared to the Deep Q-learning experiments.

(4)

Contents

1 Introduction 5

2 Space Fortress Game and Subtasks 6

2.1 Subtasks . . . 7

2.1.1 The Aiming Task . . . 8

2.1.2 The Control Task . . . 8

2.1.3 The Simplified Space Fortress Game . . . 9

3 Related Work 10 3.1 Previous learning methods for Space Fortress . . . 12

4 Reinforcement Learning Background 13 4.1 Q-Learning . . . 13

4.1.1 Action Strategies . . . 14

4.1.2 Deep Q-Learning . . . 14

4.1.3 Experience Replay . . . 15

4.2 Actor-Critic Learning . . . 16

4.3 Asynchronous Advantage Actor-Critic (A3C) . . . 16

5 Method and Approach 17 5.1 Environment . . . 18

5.1.1 OpenAI Gym . . . 18

5.1.2 GA3C . . . 18

5.2 Rewards and Terminal States . . . 18

6 Experiments 19 6.1 The Aiming Task . . . 19

6.2 Control Tasks . . . 20

6.2.1 Frictionless Control . . . 20

6.2.2 Grid Control . . . 22

7 Conclusions and Discussion 23

(5)

1

Introduction

In recent years, Deep Reinforcement Learning (DRL) has been proven a powerful tool for many learning tasks (Li, 2017) (Nachum et al., 2017). These learning algorithms are roughly modeled after human neural activity and learning behaviour. They are, to a certain extent, thought to resemble the human learning process (Mnih et al., 2015).

The application of model-free Deep Reinforcement Learning has shown that agents equipped with this technique could master challenging tasks in uncluttered and rel-atively straight-forward contexts, for instance, in specific computer games. Com-plex tasks such as several Atari games are being solved by state-of-the-art tech-niques using Neural Networks with intricate learning architectures (Mnih et al., 2013), (Mnih et al., 2015).

With the use of Deep Q-Learning (DQN) and a convolutional neural network, (Mnih et al., 2013) have succeeded in approximating or exceeding the skill of pro-fessional human players in many games. A single framework was applied to these games, using their pixel values as input.

However, other seemingly uncomplicated computer games could not be mastered by this DQN-DRL architecture. An example of the latter is a simplified version of the game Space Fortress, that was developed under US DARPA1to investigate human learning strategies suitable for learning complex military tasks.

An important characteristic of the simplified Space Fortress game is that it requires higher-order control. Higher-order control tasks are tasks in which simpler ob-jectives need to be coordinated and met in order to meet a more complex general objective. In this specific game, these objectives include acceleration of the space ship to navigate, while tactical maneuvering is required to defeat a space fortress. Previous attempts2at controlling the space ship using a DRL agent have shown that it is possible to observe substantial learning in the agent when lower order control (i.e. positional control or grid control) is used. This was less so the case for the more complicated control tasks such as tasks in which a more complex frictionless navigation is required.

In these attempts, DRL was based on a DeepMind3 DQNalgorithm using a con-volutional neural network. However, it is thought that more advanced DRL ar-chitectures such as the Asynchronous Advantage Actor-Critic (A3C) architecture might exhibit a significant increase in learning behaviour compared to DQN when applied to higher-order control tasks.

In previous research, A3C is said to outperform DQN due to increased stability of the learning process (Mnih et al., 2016). A3C uses multiple agents to solve a task asynchronously. This algorithm is said to not only outperform DQN at the tasks themselves, but is also said to require less computational resources (Mnih et al.,

1https://www.darpa.mil/

(Last visited: 02/07/2017)

2Previous attempts were done by Bsc Artificial Intelligence students in various projects. Sources

are available upon request.

3https://deepmind.com/

(6)

2016), (Babaeizadeh et al., 2017). Additionally, methods have been devised for A3C that combine the utilization of CPU and GPU components to further optimize the computational performance (Babaeizadeh et al., 2017).

The current thesis studies the extent to which the A3C architecture is able to learn higher-order control tasks. This will be done by studying its performance on dif-ferent subtasks of the simplified Space Fortress game. These tasks will vary in complexity and thus require different levels of control. The performance of A3C on these increasingly difficult tasks will be compared to the performance of DQN. The A3C architecture as described in (Mnih et al., 2016) showed a significantly higher average score in the game Asteroids, which has navigation and other tasks that are similar to the higher-order control tasks of the Space Fortress game. There-fore it is hypothesized that this architecture might also produce increased learning behaviour in the Space Fortress game, including its higher-order control tasks. The Netherlands Aerospace Centre4 (NLR) is interested in an A3C architecture that can be applied to the higher-order control in the game Space Fortress. They have requested that a method for DRL, such as A3C, be devised, implemented and verified for desired functionality, and that performance of the DRL agent be measured. The criterion is speed-based, i.e. how fast (measured in game frames or seconds of real-time play) the DRL-agent is able to complete a game with a proficient score.

2

Space Fortress Game and Subtasks

The Space Fortress game was created in 1989 by researchers at the University of Illinois with the purpose of studying human learning strategies for complex tasks. The goal was to create a game that involved complex tasks with varying diffi-culty levels that are representative for real-world tasks, with which human skill acquisition could be studied. It was therefore deemed necessary that the game be challenging and interesting for prolonged periods of practice. (Man´e and Donchin, 1989)

The main objective of the game itself is to destroy a space fortress located at the centre of the screen. A frame of the full space fortress game can be seen in Figure 1.

4

(7)

Figure 1: A frame of the full Space Fortress game in which the space ship (a), the fortress (b), a mine (c), an instance of a bonus (d), a space ship missile (e) and a fortress missile (f) can be seen.

The player is represented by a spaceship that can navigate around the space fortress and shoot missiles at it. To destroy the space fortress, it must first be weakened by shooting ten missiles with ≥ 250 millisec intervals at the fortress. The fortress can then be destroyed by two consecutive <250 millisec interval missile fires.

The navigation is frictionless and thus consists of steering and acceleration; when the space ship is moving in a certain direction, it keeps moving in that direction without user input until the user decides to steer and re-accelerate.

Besides destroying the space fortress, the space ship must prevent being damaged by mines or the space fortress; the space fortress also fires missiles at the space ship, which must be evaded. Additionally, mines are spawned in the screen which are accompanied by a letter. The spaceship needs to evade these mines. Prior to a mines’ spawning, the player will be provided with a letter that indicates whether the mine is a friend or foe. Friend mines, when fired at by the spaceship, will move towards the space fortress and damage it. Foe mines require a different weapon system, and when shot at will be destroyed. The space fortress can not be damaged while a mine is present on the screen.

Initially, the space ship is provided with 100 missiles. When the player has run out of missiles, shots will be penalized by a negative reward of 3. Missiles can be restocked when a dollar sign appears on the screen just below the space fortress. When a dollar sign is present, the player can choose to receive missiles or a positive reward of 100, by pressing one of two keys. A demo of the full game can be seen in https://www.youtube.com/watch?v=FWuTP_JgZwo.

2.1 Subtasks

Because of the high complexity of the game and the large amount of long-term objectives, simplified versions of the game have been constructed in previous DRL

(8)

attempts to study the extent to which DRL was able to master subtasks of the game. These tasks are divided into three categories, which vary in difficulty and require different orders of control.

2.1.1 The Aiming Task

The Aiming Task revolves around acquiring the skill of turning towards a target and knowing when to shoot. A frame of the aiming task can be seen in Figure 2

Figure 2: The aiming task.

The game contains the space ship, which is located at the centre of the screen. The player is again represented by the space ship, which is not able to move its location but is able to turn in any direction. Mines are spawned consecutively at 10 frame intervals in the screen, and remain for a duration of 200 frames or until destroyed. The main objective of the aiming task is to shoot as many mines as possible during an instance of the game. The duration of a game is 2400 frames.

2.1.2 The Control Task

The goal of the Control Task is to determine whether a player can master the nav-igation aspects of the game. Therefore, it needs to navigate to a target location as quickly as possible. A frame of the control task can be seen in Figure 3

(9)

Figure 3: The control task.

The game consists of the space ship, which is initially located at a random position on the screen. The player controls the space ship, which is able to freely move to any position on the screen. Squares are spawned consecutively with 80 frame intervals at random positions on the screen. After a square is spawned, it remains for a duration of 200 frames or until the spaceship is navigated towards the location of the square. The duration of a full game is 2400 frames.

There are two different versions of the Control Task, which differ in navigation control. There is a ”no-grid” or frictionless version of the task, which has friction-less navigation similar to that of the full space fortress game. The other version is the grid version, which has grid-control navigation. In this version, the position of the space ship is fully dependent on user input, meaning that a user input results in a constant change of location on the screen.

2.1.3 The Simplified Space Fortress Game

The Simplified Space Fortress game is an actual instance of the space fortress game with certain components being left out. The task requires aiming and navigation skills as well as tactical maneuvering to evade damage and destroy a target. A frame of the simplified space fortress game can be seen in Figure 4.

(10)

Figure 4: The simplified space fortress task.

The game again consists of a space ship, which represents the player. A space fortress is located at the centre of the screen. No other objects are present in the game. The space ship is able to freely move to any position on the screen, whilst the position of the fortress is fixed at the centre. The fortress located at the centre of the screen, can turn in any direction and shoot missiles. The fortress will track the location of the space ship and fire missiles at it with 20 frame intervals. Moreover, the space ship can fire at the space fortress. When no terminal state is reached, the duration of a game is 350 frames. A terminal state is reached when either the space ship is destroyed by two hits of the fortress, or when the fortress is destroyed by three hits of the space ship. Space ship shots only take effect when there is an interval of at least 10 frames inbetween consecutive shots.

Equivalent to the Control Task, the Simplified Space Fortress game has two ver-sions, which again differ in navigation control; there is a frictionless navigation version, as well as a grid control navigation version.

3

Related Work

Ever since the publication of (Mnih et al., 2013), deep reinforcement learning methods have become increasingly popular and a wide range of applications has been explored (Li, 2017). The method was proven flexible and powerful in many different game environments, and achieved superhuman performance at some (Mnih et al., 2015)(Mnih et al., 2013). The architecture used pixel values of these games as input after the screen frames had been processed to 84x84 gray scale frames. These frames were fed to a Deep Q-network, which contained three convolutional layers with 32x8x8 filters and stride 4, 64x4x4 filters and stride 2, and 64x3x3 and stride 1 respectively, followed by two fully connected layers. A single state is represented by four game frames to capture motion. In each state, the network cal-culates all possible Q-values achievable from that state for all actions. Experience Replaywas used to stabilize the learning process (see section 4.1.3).

(11)

New methods have since been devised that have attempted to improve or exceed the results of (Mnih et al., 2015), such as DoubleDQN and Prioritized Experience Re-play(Li, 2017). The most prominent and recent has been the use of Asynchronous methods, which have been proven to be more stable and more optimized than DQN (Mnih et al., 2016) (Li, 2017).

The asynchronous framework as constructed by (Mnih et al., 2016) lets multiple learning agents learn on different instances of an environment asynchronously, while they manage an own copy of a main network. After a certain amount of steps, the weights of the networks of these agents are averaged and are fed to the main network. This framework has the benefit that it allows for greater flexibil-ity, for example, it allows for experiments with different learning methods that are not bound to be on-policy or off-policy. Furthermore, the computational resources needed are being reduced by no longer having to store an experience replay. This framework was tested with different on-policy and off-policy methods, i.e. Asynchronous one-step Q-learning, Asynchronous one-step Sarsa, Asynchronous n-step Q-learningand Asynchronous advantage actor-critic (Mnih et al., 2016). These methods were applied to several Atari games that were also used in (Mnih et al., 2015). Asynchronous advantage actor-critic outperformed all other methods at most tasks. The network used in (Mnih et al., 2016) had a convolutional layer with 16x8x8 filters with stride 4, and another convolutional layer with 32x4x4 fil-ters and stride 2, succeeded by a fully connected layer. The network was used to approximate both the value function and the policy function. The A3C architecture as described by (Mnih et al., 2016) was executed on a single multi-core CPU and was able to outperform a GPU DQN architecture.

Since, (Babaeizadeh et al., 2017) have constructed an A3C implementation that is compatible with GPU usage. Their implementation, which they call GA3C (in which the G refers to the usage of a GPU), was proven to achieve similar results as the original A3C implementation but with significant speed-up. The GA3C imple-mentation has some differences in its architecture compared to that of (Mnih et al., 2016). In GA3C only a single network is used which is maintained on a GPU, meaning that the agents do not have their own network. Instead, the agents make policy requestsby sending frames to a predictor. The predictor queues the policy request of all agents and batches them into an input for the network. The policy that is returned from the network is then returned to all agents by the predictor, which are then able to perform an action based on the policy they received. Mean-while, the agents send a series of experiences called training batches to a trainer periodically. The trainer queues these training batches and uses them to update the network. A systematic overview of the two models can be found in figure 5.

(12)

Figure 5: Systematic overview of A3C (a) and GA3C (b) architectures. Image taken from (Babaeizadeh et al., 2017).

General descriptions of Deep Q-learning, experience replay and A3C can be found in section 4.1.2, section 4.1.3 and section 4.3.

3.1 Previous learning methods for Space Fortress

Previous Deep Reinforcement Learning applications on the game space fortress using DQN have shown that learning is possible for simpler subtasks of the game. This was the case for the aiming task and the control task with grid-control, in which the agent approximated the maximum possible score for these games. At these tasks, there was a more direct relation between an action and the goal. The grid control navigation requires a set of turns and a set of forward moves to reach the goal. The aiming task requires a set of turns and some shots.

However, these methods did not prove sufficient when applied to subtasks that re-quired coordination of simpler tasks. The agent performed poorly on the control task that had frictionless navigation, the simplified space fortress with grid con-trol, and consequently, the simplified space fortress game with frictionless control. The control task with frictionless navigation requires that the agent not only learn to move towards the objective, which is the case for the grid-control version, but also that it controls for and counters the movement of the spaceship in frictionless space. For the simplified space fortress game with grid-control, the agent has to learn to evade shots of the fortress, as well as turn and aim to shoot at the fortress. The simplified space fortress game with frictionless control requires that a combi-nation of the latter two be learned.

These higher-order control tasks require significantly more actions and subobjec-tives to solve their main objective, and thus the relation between a move and its result is much less obvious.

The previous DRL agents were based on a DQN architecture as described by (Mnih et al., 2013). Two different implementations of DQN, SimpleDQN5 and

5

(13)

a recreation of Google’s Deep Mind implementation6, were applied to the game and showed similar results.

4

Reinforcement Learning Background

Reinforcement learning is a technique in which an agent interacts with an envi-ronment through actions and rewards it receives, and has to learn which action corresponds to which reward through trial and error. An important aspect of re-inforcement learning problems is that in most cases, the received rewards after performing an action are sparse and are only awarded at later time steps. A diffi-culty is thus to discover which action lead to which future reward.

In reinforcement learning, every task can be dissected into a series of <state, ac-tion, reward> tuples, in which an agent interacts with an environment by perform-ing an action and the environment returns a reward and a new state. An important assumption when defining an environment is that there is a relation between states and actions, and that the outcome of a state depends on a finite amount of previous states, and can thus be defined as a Markov decision process.

4.1 Q-Learning

The goal in each state is thus to choose the action that maximizes the total future reward R, in which later rewards are discounted by a factor γ due to future time steps being less certain.

Rt= n

X

k=0

γkrt+k = rt+ γrt+1+ ... + γn−1rt+n

We define the function that retrieves the highest possible future reward (or Quality of an action) given action a in state s as the Q-value of that action

Q(st, at) = max(Rt+1)

The ambition is to find the policy π that finds the action with the highest Q-value in each state. It is thus defined as:

π(s) = argmax(Q(s, a))

The Q-value for a specific state-action pair can thus equally be calculated by

Q(st, at) = E[Rt+1+γ ∞

X

k=0

γkRt+k+2|st, at] = E[Rt+1+γmaxa0Q(st+1, a0)|st, at]

This corresponds to the reward that was received after performing action atplus the

discounted total future reward of the next state. The former is called the Bellman Equation (Sutton and Barto, 1998). The Q-values can iteratively be calculated using this equation.

6

https://github.com/kuz/DeepMind-Atari-Deep-Q-Learner (Last visited: 02/07/2017)

(14)

4.1.1 Action Strategies

Reinforcement learning fundamentally is a trial-and-error process. This means that initially, the actions the agent chooses are random since it has little knowledge of the environment and the consequences of its actions. The chosen actions become less random as the agent acquires more knowledge of the environment. Q-learning is a greedy algorithm, meaning that the action with the highest Q-value is being executed. A consequence is that the algorithm could get stuck in a local mini-mum/maximum when a suboptimal strategy is being explored before the optimal strategy. The suboptimal strategy will have the highest Q-values at that moment and thus the suboptimal strategy will be exploited.

Small amounts of randomized actions are desirable throughout the learning pro-cess to explore strategies that deviate from the current strategy but might result in higher scores. However, large amounts of random actions would defeat the purpose of learning a strategy and slow down the learning process. The previous is called the explore-exploit trade-off ((Sutton and Barto, 1998), Chapter 2.1).

A common solution to this problem is to introduce a factor  that represents the probability with which an action being chosen is random and thus deviates from the current strategy. This is called the epsilon-greedy strategy. The further an agent is in the learning process, the more certain it can be that its current strategy is the optimal strategy. This is why usually a decay rate is introduced to reduce the value of  through time.

Q-learning is an off-policy learning algorithm used in many artificial reinforcement learning methods. Off-policy learning methods are defined as methods in which the weight updating happens independent of the policy that is being carried out by the agent. In the case of Q-learning, the updating of the Q-function happens as if the actions chosen follow a greedy strategy, while the actual actions chosen follow an epsilon-greedy strategy. This means that there is a discrepancy between the actual action policy and the updating policy, which is why Q-learning is an off-policy algorithm ((Sutton and Barto, 1998), Chapter 7.6). Consequently, in the case of on-policyalgorithms, no such discrepancy is present, meaning that the policy with which the weights are updated is the same as the policy that is being carried out.

4.1.2 Deep Q-Learning

For tasks in which large amounts of states are possible, calculating and storing the Q-values is no longer feasible. The Q-function is therefore approximated using a neural network. This approach is called Deep Q-learning. The network is being fed a state and returns the Q-value for every action in that state (see Figure 6).

(15)

Figure 6: Conceptual representation of a deep Q-network.

Regarding the Stability of Deep Q-Learning

A common problem in generic Deep Q-learning is that the learning tends to be unstable (Mnih et al., 2015)(Mnih et al., 2016)(Schaul et al., 2015). The Q-function is sequentially updated after each performed action. A consequence is that the Q-function will become adapted to patterns of the most recent series of actions. This has the result that if a potentially useful but rare pattern or method does not recur for a longer sequence of episodes, the update policy forgets the pattern and has to re-learn it when it is once again encountered (Lin, 1992).

Many methods have been devised that attempt to counter this problem. The current paper will only discuss Experience Replay, due to its usage in (Mnih et al., 2015).

4.1.3 Experience Replay

The concept of experience replay is to keep track of and reuse past states and ac-tions (Lin, 1992). An experience is a state, action, reward and next state set, and is thus defined as (st, at, rt, st+1). Over many games, experiences are stored in a

replay memory(Mnih et al., 2015). In each step during an episode, the new expe-rience is added to the replay memory while a random mini-batch from the replay memory is sampled, over which the Q-values are updated. This means that the updating no longer happens over the most recent sequence of experiences.

This method has the advantage that samples are being stored and reused, which results in more effective data usage. Besides, it stabilizes the learning by decorre-lating subsequent samples and thus averaging out the updating. (Mnih et al., 2015)

(16)

4.2 Actor-Critic Learning

On Value-based Methods and Policy-based Methods

Q-learning is a value-based algorithm, in which the policy π is approximated in-directly: in each state, calculate certain state or action values v (in the case of Q-learning: optimize the Q-function), and determine the policy by choosing the highest of these values v (here again: Q-value). Additionally, policy-based meth-ods have been devised, in which the policy is approximated directly from the states and actions. Rather than choosing an action with the highest value v, a sequence of n actions is chosen through a function π(at|st, θ), in which the function

param-eters θ can be approximated through several methods. During an episode, a set of actions is executed. The reward after a certain amount of actions is then backprop-agated over all used state-action pairs, usually by means of a policy gradient. In policy-based deep learning methods, the policy π(at|st, θ) is represented using

a neural network, where θ represents the weights of the network. After executing a policy, the rewards are updated to the network backwards in time for all state-action pairs.

An important branch of learning methods that is worth mentioning are actor-critic methods. Actor-critic methods are usually primarily policy-based methods that at-tempt to find the optimal policy by calculating intermediate values, with which the policy is re-evaluated or ”criticized” at every timestep ((Sutton and Barto, 1998), Chapter 11.1).

At every timestep, calculate vt after performing an action, calculate the extent

to which the new values vt+1 are surprising using the Temporal Difference error

(critic), and update the policy (actor) with respect to this found error ((Sutton and Barto, 1998), Chapter 11.1). The temporal difference error is defined as

δt= rt+1+ γV (st) + V (st+1)

Where V is the value function that retrieves value vtfor state st.

4.3 Asynchronous Advantage Actor-Critic (A3C)

While experience replay stabilizes the learning to a certain extent, there are some fundamental issues involved in the technique. Firstly, it requires significantly more computational resources due to the storage of past experience (Mnih et al., 2016). Secondly, experience replay is limited to off-policy learning algorithms, and thus limits the exploration of the use of potentially useful on-policy algorithms. (Mnih et al., 2016) Have constructed a framework that overcomes these problems and simultaneously results in more stable learning.

In A3C, multiple actor-critic agents manage their own network while playing in dif-ferent instances of an environment in parallel. Each agent updates its own parame-ters θ0and θv0 for the policy and value functions and updates these asynchronously

(17)

to the global policy and value parameters θ and θv after a fixed amount of steps or

a terminal state is reached. Since all agents are learning in different instances of an environment, the global updating is constructed of many different experiences and will thus vary less than with sequential updating, which is the case in generic deep Q-learning (see section 4.1.3 and section 4.2). This effect can be enhanced by using different exploration-policies for the agents. A systematic overview of the A3C algorithm can be found in Figure 7

Figure 7: Systematic overview of A3C. Image taken from7

5

Method and Approach

The A3C implementation as described in (Babaeizadeh et al., 2017) (GA3C) has been applied to multiple subtasks of the space fortress game. This was done by adjusting the C-code of the game to experiment with different reward functions, and furthermore integrating the space fortress Gym8environment with the GA3C architecture. The set of subtasks consist of the same subtasks that were used for the DQN experiments in previous attempts.

7

https://medium.com/emergent-future/simple-reinforcement- learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2(Last visited: 02/07/2017)

8https://gym.openai.com/

(18)

5.1 Environment

5.1.1 OpenAI Gym

In April of 2016, OpenAI9released Gym, a general framework in which any game can be formatted into a reinforcement learning environment and statistics can be collected. State-of-the-art reinforcement techniques require that an exchange of ob-servations (the frames and pixel values of a game), actions and reward take place, as was demonstrated by Markov decision processes (see section 4.1). For the cur-rent project, Gym version 8.0 was used. An overview of all properties a Gym environment should have can be found in Appendix 1.

5.1.2 GA3C

The GA3C implementation as described by (Babaeizadeh et al., 2017) and in-structions on how to run it can be found on https://github.com/NVlabs/ GA3C. Since the preprocessing step of converting the frames to 84x84 grayscale frames are done in the space fortress Gym environment, these are omitted in the GA3C implementation when running this game.

5.2 Rewards and Terminal States

Experiments have been performed with different reward functions and terminal states to observe how these would influence the learning behaviour. In the DQN experiments, the terminal states and rewards that were used for each game are de-scribed in section 2.1. In the current thesis, reward functions and terminal states have occasionally been altered. Positive rewards generally tend to encourage be-haviour, while negative rewards tend to discourage it. A difficulty is thus to deter-mine how much and whether behaviour should be encouraged or discouraged, e.g. whether not achieving the goal should be penalized with a negative reward, or be left with a 0 reward, resulting in no difference in behaviour.

Penalizing episodes in which the goal was not achieved might speed up the learning process by preventing behaviour that did not result in a reward being carried out again in future time steps. However, this might also have the result that behaviour that was close to being a good strategy but did not result in a reward is less likely to be explored in future time steps, which could be counterproductive.

Altering the terminal states might influence the distance between actions and their rewards. If the rewards are backpropagated at the end of an episode, shorter episodes will influence a smaller amount of state-action pairs and will result in more games being played. Updates might be done with higher frequency but might be less substantial, since shorter game durations might result in less exploration within episodes.

(19)

6

Experiments

The GA3C algorithm has been applied to the aiming task and to multiple versions of the control task. The measurements have been focused on rewards in terms of frames, to make robust comparisons that are independent of the system or machine on which they are tested. Under the assumption that hyperparameter optimization is largely an empirical process that requires exhaustive testing,-which is outside the scope of this thesis- the parameter values have been unchanged from the default implementation. A table with parameter settings for GA3C and DQN can be found in Appendix 2 and 3.

Experiments have been done on two different computers. An overview of the tasks, running times and systems can be found in Table 1.

Table 1: Tasks and system attributes

Task Time System

Aiming task GA3C 16 hours octa-core CPU

Aiming Task DQN 1 day octa-core CPU, GPU

Control task (frictionless 1) GA3C 6 days octa-core CPU Control task (frictionless 1) DQN 3 days quad-core CPU, GPU Control task (frictionless 2) GA3C 3 days quad-core CPU Control task (frictionless 2) DQN 3 days quad-core CPU, GPU Control task (grid) GA3C 4 days quad-core CPU

Control task (grid) DQN 1 day octa-core CPU, GPU

6.1 The Aiming Task

During the DQN experiments, the aiming task was proven the least complex task and learning behavior would appear relatively fast. The same behaviour was found when GA3C was applied to this task.

The terminal state in this experiment was when a duration of 2400 frames was reached. A shot square resulted in a positive reward of 1 and a missed square resulted in a negative reward of 1.

At peak performance, the average score reached between 80-90 per episode, which means that on average, the agent shot a square every 26-30 frames. A demo of the agent playing the aiming task can be seen in https://www.youtube. com/watch?v=_hVr6dpJmTU. What can be seen in the demo is that as soon as a square appears on the screen, the spaceship will stop shooting and turn in the direction of the square with the shortest angle and shoot a few times. What furthermore can be seen is that after a square has been shot, the spaceship will reposition so the direction of the nose points at the upper left corner during the interval between squares. The agent will then start spraying in that direction. It is thought that perhaps most squares spawn in this area, so the agent tries to predict where a square will spawn next with the highest probability. The performance can

(20)

be found in Figure 8.

Figure 8: Performance of GA3C and DQN on the Aiming task

6.2 Control Tasks

Experiments have been performed with multiple instances and settings of the con-trol task. These variations include grid and frictionless concon-trol, different rewards and different terminal states.

6.2.1 Frictionless Control

A first experiment was done in which the terminal state was set to a duration of 2400 frames. When a mine was shot, a positive reward of 1 was awarded. A mine that was not destroyed during its lifetime was penalized with a negative reward of 1. During this time, the goal was to shoot as many mines as possible (and conse-quently, avoid missing mines while they were alive). As can be seen in Figure 9, the GA3C algorithm did not seem to exhibit any increase in learning behaviour, and DQN seemed to outperform GA3C on this task during their training time. What can be seen is that GA3C seems to have found a strategy that initially performs better than DQN, but shows no increase in learning thereafter, as opposed to DQN.

(21)

Figure 9: Performance of GA3C and DQN on the control task with frictionless control and 2400 frame game duration

The fact that under these conditions, DQN performed better than GA3C might be accounted to the reward function that was chosen and the fact that DQN is value-based as opposed to policy-based. An example might demonstrate the conse-quences of this problem more clearly. With the current game settings, the minimal score that can be reached is -16. The average reward of GA3C remained stationary around a negative score of 8, which means that in these games, it was able to collect 8 mines and missed out on another 8 mines. With the current game settings, this demonstrates that missing mines is penalized more heavily than collecting mines is rewarded. While intuitively, one would say that collecting 8 mines is behaviour that should be encouraged, this behaviour is now discouraged.

In the case that a negative reward is encountered, Q-learning will discount the value for the specific Q(s, a) at that timestep for the most part, while the discount on other state action pairs is limited and is only influenced indirectly (Mnih et al., 2016).

However, GA3C is largely policy based, which might have the result that when a negative reward is encountered at the end of an episode, the backpropagation will cause discouragement of all state-action pairs that were used during that episode in a more direct manner (see 4.2). The fact that GA3C is policy-based might thus re-sult in this potentially good strategy being discouraged more heavily than it would with the DQN algorithm.

A second experiment was thus devised in which the reward function and terminal states were altered. A missed mine was no longer penalized with a negative reward of 1. Collected mines were still awarded a positive reward of 1. The terminal state

(22)

was either when any mine was missed or when 6 mines were collected. With these settings, the minimum score that can be achieved is 0, while the maximum score is +6. The performance of GA3C and DQN on this task can be seen in Figure 10.

Figure 10: Performance of GA3C and DQN on the control task with frictionless control without penalization

In this task GA3C seems to perform slightly better, which is in line with the idea that negative rewards might cause heavier discouragement for GA3C than for DQN. Consequently, policy-based methods might also cause greater encour-agement of actions when a positive reward is encountered when compared to the value-based DQN (Mnih et al., 2016).

A demo of the performance of GA3C after approximately half of the training time can be seen in https://www.youtube.com/watch?v=i2kEEfqRT1M and a demo of its performance at the end can be seen in https://www.youtube. com/watch?v=eOHv3QnG0a0. From the demos, it is visible that the agent did learn some strategy. However, still no significant increase in learning has been ob-served. This might be an indication that this level of higher-order control is too challenging for the A3C architecture.

6.2.2 Grid Control

The GA3C architecture was tested on the control task with grid control, which is considered a less complex task than the frictionless control task and requires a lower order of control. The rewards and terminal states were the same as during the DQN attempts. The terminal state was whenever any mine was missed out on, or when 3 mines have been collected. A missed mine results in a reward of -1, while

(23)

a collected mine results in a reward of +1. With these settings, the minimum score that could be achieved is -1, while the maximum score is +3. The performance of this run can be found in figure 11.

Figure 11: Performance of GA3C and DQN on the control task with grid control

The agent seemed to learn a certain ineffective strategy, where it would traverse the screen diagonally in order to cover a greater and constant surface of the screen than it would when it would remain stationary or traverse the screen randomly. With this technique, the agent was usually able to collect one or a few mines. A demo of this can be seen in https://www.youtube.com/watch?v=TBn7AQRynfU& t=9s.

In this experiment, DQN seemed to significantly outperform GA3C. Grid control and similar types of control are also required in other Atari games, such as Gravitar, Stargunner, and Time Pilot, in which A3C was proven to outperform DQN (Mnih et al., 2016). This indicates that the performance of GA3C compared to DQN on this control task might not necessarily be explained by a level of control that was too challenging for the technique. Instead, the performance might again be explained by the penalization that is less present within DQN than GA3C.

7

Conclusions and Discussion

The current experiments showed that the A3C algorithm is able to learn lower-order control tasks up to proficient level and with better performance than the DQN al-gorithm, as was shown in the aiming task experiment.

However, it was not able to learn tasks more complex than the aiming tasks, as was shown in the control task experiments. The grid control task is closely related

(24)

to the aiming task, as they both require the space ship to change its orientation to a certain target. The aiming task then requires just a single or a few shots, while the grid control task requires a set of forward moves. They thus share similar con-ceptual tasks, although the grid control task requires more actions, and moreover, while the location of the spaceship is fixed in the aiming task, the grid control task requires that the agent learns to aim at any position on the screen.

However, at all control tasks, the agent was able to learn some strategies that pre-vented it from achieving the minimal scores at these games. Certain strategies were observable but had very low effectiveness. Learning in higher-order control tasks under the conditions described in this thesis is thus observable, but not to a sub-stantial or significant extent when compared to DQN.

This is likely due to the fact that policy-based methods tend to be more encouraging or discouraging of actions based on the received rewards compared to DQN, which did not prove beneficial when penalization is more prominent than rewarding, as was described in section 6.2.1. Despite the fact that DQN outperformed A3C on the simpler grid control task, there is some evidence that A3C performance might be increased when the reward function is adjusted accordingly. In both tasks that did not use penalization, the A3C algorithm seemed to outperform DQN, as was demonstrated in the aiming task and the second frictionless control task.Future works might therefore focus on running experiments in which penalization is less prominent and is only used for specifically bad behaviour.

What was further visible from the results is that the GA3C algorithm often seemed to stabilize around a score that indicated that it did learn a certain suboptimal strat-egy. This might be an indication that the algorithm got stuck in local optimum, and thus might require more exploring behaviour (see section 4.1.1). As was explained in section 6, no experiments have been done with adjusted hyperparameter settings. It is speculated that adjustments in parameter settings that involve the exploration policy might reduce the effects of local optima.

In the current thesis, the running times for training the architecture was limited (see section 6). A possible consequence is that the architecture might not have had enough time to converge (assuming that convergence is possible within a feasible time). A future direction is thus to increase the run time for learning when certain learning has been observed.

Finally, in the case of the Space Fortress game, there are some fundamental dif-ficulties when attempting to classify tasks into orders of control. In the current thesis, the tasks were classified by means of results from the DQN experiments and roughly estimated amounts of actions needed to meet an objective. However, these estimations need not be accurate or might in fact be different for humans and reinforcement learning agents.

This means that the tasks have been classified relative to each other with respect to order of control, with no objective quantification of ordered control being given. However, the levels of complexity that were found in the DQN experiments for these games might confirm that the classification is correct. In the current thesis, no substantial learning was found in higher-order control tasks, but there is some

(25)
(26)

References

M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, and J. Kautz. Reinforcement learning through asynchronous advantage actor-critic on a gpu. In ICLR, 2017. Y. Li. Deep reinforcement learning: An overview. CoRR, abs/1701.07274, 2017.

URL http://arxiv.org/abs/1701.07274.

Long-Ji Lin. Reinforcement Learning for Robots Using Neural Networks. PhD thesis, Pittsburgh, PA, USA, 1992. UMI Order No. GAX93-22750.

A. Man´e and E. Donchin. The space fortress game. Acta Psychologica 71, (1): 17–22, 1989.

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. In NIPS Deep Learning Workshop. 2013.

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beat-tie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Na-ture, 518(7540):529–533, 2015. URL http://dx.doi.org/10.1038/ nature14236.

V. Mnih, A. Puigdomenech Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous Methods for Deep Reinforcement Learning. ArXiv preprint arXiv:1602.01783, 2016.

O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans. Bridging the gap between value and policy based reinforcement learning. CoRR, abs/1702.08892, 2017. URL http://arxiv.org/abs/1702.08892.

T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. CoRR, abs/1511.05952, 2015. URL http://arxiv.org/abs/1511. 05952.

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning : An Introduc-tion. MIT Press, 1998.

(27)

Appendices

Appendix 1: Overview of Gym Environment Properties

Property Meaning

n actions Number of actions to choose from

action set A dict of keys that can be used for the game terminal state A boolean that is true when the game has ended action space The dimensions in which an action may occur observation space The dimensions in which an observation may occur

step

A function that, given a chosen action, returns

the reward after having done the action, the next frame, whether a terminal state has been reached, and (optionally) how many lives are left

render A function that visualises every observation. (Not necessary for calculations)

reset A function that resets the game and re-initializes all values close A function that exits the game

(28)

Appendix 2: Parameter Values for DQN

Parameter Value

env params ”useRGB=false”

agent ”NeuralQLearner” n replay 1 update freq 4 actrep 3 discount 0.99 seed 1 learn start 50000 pool frms type ”\”max\”” pool frms size 2

initial priority ”false” replay memory 1000000

eps end 0.1

eps endt replay memory

lr 0.00025

agent type ”DQN3 0 1”

preproc net ”\”net downsample 2x full y\””

state dim 7056

ncols 1

agent params

”lr=”$lr”, ep=1, ep end=”$eps end”, ep endt=”$eps endt”, discount=”$discount”,

hist len=4, learn start=”$learn start”, replay memory=”$replay memory”, update freq=”$update freq”, n replay=”$n replay”,

network=”$netfile”, preproc=”$preproc net”, state dim=”$state dim”, minibatch size=32, rescale r=1, ncols=”$ncols”, bufferSize=512,

valid size=500, target q=10000, clip delta=1, min reward=-1, max reward=1”

steps 50000000 eval freq 250000 eval steps 125000 prog freq 10000 save freq 125000 gpu 0 random starts 5

pool frms ”type=”$pool frms type”,size=”$pool frms size num threads 4

(29)

Appendix 3: Parameter Values for GA3C

Parameter Value Note

AGENTS 32

-PREDICTORS 2

-TRAINERS 2

-DEVICE ’cpu:0’

-DYNAMIC SETTINGS True

-DYNAMIC SETTINGS STEP WAIT 20

-DYNAMIC SETTINGS INITIAL WAIT 10

-DISCOUNT 0.99

-TIME MAX 5

-REWARD MIN -1

-REWARD MAX 1

-MAX QUEUE SIZE 100

-PREDICTION BATCH SIZE 128

-STACKED FRAMES 4

-IMAGE WIDTH 84

-IMAGE HEIGHT 84

-EPISODES 400000 Sometimes increased to 1000000

ANNEALING EPISODE COUNT 400000 Sometimes increased to 1000000

BETA START 0.01

-BETA END 0.01

-LEARNING RATE START 0.0003

-LEARNING RATE END 0.0003

-RMSPROP DECAY 0.99

-RMSPROP MOMENTUM 0.0

-RMSPROP EPSILON 0.1

-DUAL RMSPROP False

-USE GRAD CLIP False

-GRAD CLIP NORM 40.0

-LOG EPSILON 1e-6

-TRAINING MIN BATCH SIZE 0

-SAVE FREQUENCY 1000

-MIN POLICY 0.0

Referenties

GERELATEERDE DOCUMENTEN

dynamic capabilities as the driving forces behind the creation of new cultural products that revitalize a firm through continuous innovation Global dynamic capability is

Zoals grafisch wordt weergegeven in figuur 3.1 heeft ons empirisch onderzoek betrekking op beide richtingen van de relatie tussen individuen en beleidsmakers: studie 1 richt zich

Tabel 2 bevat de economische resultaten van de groep Bioveembedrijven, met daarnaast de voorlopige gemiddelden van het gemiddelde gangbare melkvee- bedrijf uit het Bedrijven

H1b: A MCS emphasizing interactive control system has a positive effect on integrated motivation H4d: Knowledge intensity degree affects the relationship between belief systems

Coordination compounds of 4-hydroxy-3-nitro-2H-chromen-2-one and their mixed ligand complexes with aminoethanoic acid and pyrrolidine-2-carboxylic acid were synthesized by the

Budgetimpactanalyse van 13-valente pneumokokken-polysachariden-conjugaatvaccin (Prevenar 13®) voor de indicatie actieve immunisatie ter preventie van pneumonie en invasieve

The second experimental group that will be analyzed are the individuals who received an article by the fake left-wing news outlet, Alternative Media Syndicate, as their corrective

To describe the effect of gap junctional coupling between cortical interneurons on synchronized oscillations in the cortex, we introduce a diffusion term in a mean-field model..