Deep Multi-Agent Reinforcement Learning in Swarm Robotics

(1)

Deep Multi-Agent Reinforcement

Learning in swarm robotics

David Wessels 11323272

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam

Faculty of Science Science Park 904 1098 XH Amsterdam

Supervisor

drs. Anthony (Toto) van Inge Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam June 29th, 2018

(2)

Bachelor Artificial Intelligence

Deep Multi-Agent

Reinforcement Learning in swarm

robotics

David Wessels

11323272

University of Amsterdam

Science Park 104

July 1, 2019

Ba

chelor

Thesis

(3)

Abstract

Have you ever been standing still, in the middle of the day, watching bewildered at a group of birds flocking in the sky; looking at how they slice smoothly through the air, with complex movements without colliding into each other? This behaviour is hard not to find interesting. Such behaviour is called emergent behaviour, because by combining not very intelligent individuals together, extra complexity emerges in the behaviour. In this bachelor thesis there is looked into the capabilities of reinforcement learning. This thesis looks into Q-learning and tries to learn an agent navigate through a door-way in a wall. Q-learning is a learning model which finds optimal policies in any finite Markov Decision Process. So by defining the navigation of drones through a door-way as a Markov Decision Process it is expected that the agent will converge to a optimal policy. To find this optimal policy, Q-learning uses the Q-function to search for the maximum state-action value. To cope with the large state-space of our application domain, deep learning is used as function approxi-mator of the Q-function. This thesis shows that q-learning converges to an optimal policy; however, training the model could take a long time.

(4)

Introduction

Have you ever been standing still, in the middle of the day, watching bewildered at a group of birds flocking in the sky; looking at how they slice smoothly through the air, with complex movements without colliding into each other? This behaviour is hard not to find interesting. Such behaviour is called emergent behaviour, because by combining not very intelligent individuals together, extra complexity emerges in the behaviour.

Such behaviour was an inspiration for the research domain swarm robotics [1]. Swarm robotics uses not-so-intelligent drones to do difficult tasks by collaboration. It is practical to let multiple inexpensive and relatively simple drones do very complicated tasks or tasks which would take much time if performed by a single drone. One area of interest, for example, would be search and rescue in a cave. If one brilliant drone searched throughout a huge cave, its search would take a very long time, not because it is not intelligent enough to do it faster, but rather because its moving speed still limits the drone. Furthermore, it can only discover one possible path at a time. If such a problem was tackled with a swarm of drones, multiple routes could be discovered at the same time. This would make the search process much faster, especially if they could efficiently communicate their observations to each other.

This paper is part of a research project which focuses on the area of swarm robotics and emergent behaviour. This research tries to achieve autonomous behaviour within a group of drones. When building a swarm, a few questions arise. For example, will the individuals work as a group? And if so, do leaders exist in this group? Those two questions are both about the hierarchical structure of the group. This research thesis searches for an approach in which the drones have a form of individuality, which is built up by a set of basic rules. Individuality implies, that there is no hierarchical structure in a group, that every drone acts on basis of its own reasoning. Moreover, this means that there are no leaders in the group. A system in which there are no leaders, can also be called a decentralised system, because it does not consist of a central component which is directing other components.

By looking into nature, an similar effect can be found, in different forms, where multiple individuals work together to achieve difficult jobs. For example, individual ants have a low level of intelligence. Ants have only a few actions which they perform; however, a much higher level of intelligence is said to be achieved through the combination of multiple individuals’ simple actions, mostly by trial and error. This higher level of intelligence or behaviour that arises from multiple not-so-intelligent individuals is called emergent behaviour. This research tries to achieve a similar kind of behaviour that should arise from the basic rules given to the drones. The drones, as individuals, are not particularly intelligent; however, by combining swarm robotics with emergent behaviour, the goal is that behaviour becomes more intelligent when the drones start to work together. The behaviour sought within this paper deviates from standard emergent behaviour because, the behaviour is not sought using a simple set of rules, but rather with reinforcement learning. Because the drones will be trained with reinforcement learning, the drone could also be referred to as a software agent, or in short, just agent. In computer science terms, a software agent refers to a computer program which acts on behalf of someone or something else. So in the application of our research will the agent act on behalf of the drone.

(7)

Regular reinforcement learning, with the condition that multiple agents have to act in the same environment without having a central component in the system, would not be sufficient for this problem. Normal reinforcement learning trains one agent to perform a task. So when normal reinforcement learning is applied to a swarm of drones, then one agent would learn to control the entire swarm. This means that the swarm is controlled by a central component. However, because of the time frame, in which this research has to be done, there is only focused on a centralised system. For a decentralised system another form of reinforcement learning is needed, where every agent is trained individually. This form of reinforcement learning is called multi-agent reinforcement learning (MARL), which will be discussed shortly in the next paragraph and in the section 7; future research.

This thesis makes use of Q-learning as a reinforcement learning model. Q-learning finds optimal policies for Markov Decision Processes. Values for every state-action pair are kept in a Q-table the table shows which action is the best for every state. However, because the state-space of the application presented in this thesis is big, Q-tables will be replaced by a neural network, which will operate as a function approximator. This will all be elaborated in the theoretical background and method. The application which the Q-learning model has to learn for this thesis is navigating a drones through a door-way in a wall.

Research in swarming began with Craig Reynolds, who made the first computer simulation of the aggregate motion of a flock of birds. [2] He achieved this behaviour through the dense interaction of relatively simple individual agents. The behaviour of his agents was built on simple rules, called the flocking rules. Even though Reynolds developed his flocking rules some time ago, the research area of swarming has only just been discovered. Still, many issues are not solved, and most research has focused only on parts of the problem. [3] Moreover, most research has been done with regards to robots acting from the ground [4], or just in simulations [5]. Multi-agent reinforcement learning (MARL) is also an approach to let multiple agents collaborate but acting individually. [6, 7] In such a system, agents are mainly trained within a Markov Decision Process in which the agents try to maximise their rewards simultaneously and independently. [6] has proven that its MARL algorithm, which uses Q-learning, finds optimal policies in deterministic environments. However, to cope with continuous state and action spaces, function approximators can be used to approximate the Q-function, which is needed to predict the best action for an agent. Using a function approximator is not a guarantee for finding the optimal policy. The performance decreases when the distance between the target function and the chosen linear space increases. [8] However, it has been shown that in multiple applications domains, reinforcement learning behaved well with using function approximators. These application domains are getting more complex every time. One example of this is where neural networks were used as function approximators by Deep Mind for learning Atari games with Q-learning. [9]

Another solution for controlling swarms and navigating through a door can be found in Ar-tificial Physics. Bartels [10], investigates this research field and uses force fields for creating attracting and repellent forces. These forces attract the drones in a formation and push them from objects. Furthermore, with an extra goal force field, the swarm is guided to its goal. An advantage of this approach is that the swarm is controllable, and the drones still calculate everything in a decentralised manner. A disadvantage of using it for navigation through doors -is that it does not always push the drone into an optimal path.

Research into deep reinforcement learning within a drone application is useful for the reason that it offers more insights in learning single agents act together and perform intelligent tasks. It could also be an improvement of emergent behaviour based just on a set of rules since it offers more possibilities in the variety of tasks which can be learned. Therefore, is the application domain of drones a fast-growing domain, so people could exploit this thesis after it is done.

The question asked in this research is, if drones can learn to navigate through a door, as a swarm, with deep multi-agent reinforcement learning. It is expected that a deep multi-agent reinforcement learning approach will achieve that a swarm of drones will navigate through a door; this by acting individually and without colliding into any object. This expectation comes from the fact that in multiple papers it is shown that agents, in single- and multiple-agent systems, can learn a variety of tasks [11, 12]. In this research, the main question is divided into three sub-questions. First, can a Deep Q-network learn a single drone to navigate through a door?

(8)

Second, is such a network able to learn multiple drones to navigate through a door? Third, is the single- and multi-agent application able to learn the same process with random holes in the wall? Because of the multiple application domains were reinforcement learning proved itself useful, we predict that it will prove itself useful in our application domain too. Especially for the single-agent version. The biggest challenge lays within the multi-agent version, because the state and action-space become much bigger.

(9)

CHAPTER 2

Theoretical background

In this chapter, the theoretical background of our research is presented. Each section discusses one of the main parts used in the application built for this bachelor thesis. To start, the learning algorithm will be discussed, and after that, the neural network which is used as a function approximator. Some information will be redundant for experts in the field, however it is still added to create a constructive thesis, also for people who have less or no knowledge in the subject field.

2.1 Reinforcement Learning

Reinforcement learning (RL) is one of the paradigms of machine learning, alongside super- & unsupervised learning. RL is a form of machine learning in which artificial agents learn to act in an environment based on a reward structure. [13] It differs from supervised and unsupervised learning due to the fact that it does not learn from training data; however, it is searching for optimal policies by finding a balance between exploration and exploitation. With exploration, the agent will make a random move, and with exploitation, the agent looks at what he learned already and takes the best move based on that knowledge. The agent tries different moves to get closer to his goal. The environment in which the agent has to learn is usually expressed as a Markov Decision Process (MDP). With RL, an agent is learned to do a particular task over multiple episodes. An episode is one run of the game or task an agent has to play or do. After one episode, the agent learns and changes his parameters based on the rewards he got in that episode. When looking at applications of RL it can be seen that these algorithms learn multiple games based on just a simple score such as used in learning agents to do Atari games [9].

2.1.1 Markov Decision Process to model the environment

A Markov Decision Process (MDP)[14] is a collection of five items; < S, A, T, R, γ >. It contains two sets; the set of all states S and the set of all possible actions A. Furthermore, a state-transition function defines what the change of certain actions is onto the environment. This function can be described as stated in formula (2.1) below.

T : S x A → P D(S) (2.1)

P D(S) represents the set of probability distributions for each action over set S. Which is imple-mented as a dictionary containing the state-transition probabilities. Besides a state-transition function, an MDP also contains a reward function, which is stated in formula (2.2) below.

R: S x A → < (2.2)

This function allocates rewards for every state-action pairs. Lastly, an MDP contains a discount factor 0 < γ < 1.

(10)

The set of actions represent the moves an agent or UAV can make within the environment. This set could thereby be different in varying states. An agent receives rewards by doing an action in a particular state. These rewards get higher when the agent gets closer to the goal. The discount factor is an intuitive factor which comes from psychology. It comes from the fact that we would rather get $10 now, then $100 after ten years. In MDP’s, the discount factor is used to control how much future rewards are taken into consideration when choosing actions. By choosing γ small, short-term gains are more emphasised, and by choosing it larger, higher weights are given to long-term rewards. Therefore, the discount factor comes in handy since it is better not to look too far into the future, because these states are harder to predict. The goal of an agent is to maximise the expected sum of discounted rewards during the MDP. This maximisation is stated in formula (2.3) below. In this formula, ri,t+j stands for the reward

received j steps into the future for agent i. E ( ∞ X i=0 γjri,t+j ) (2.3) The dictionary with state-transition probabilities is needed, because in real life problems, en-vironments are not deterministic. Since, real-life is not deterministic, observations of the state could be imperfect, and, therefore, partial knowledge is the only thing, which can be reflected upon. Another form of randomness comes, for example, from the motion of objects which are needed to be included in the decision process. This motion can not be determined when observa-tions rely on images alone, which could be needed in a real-life situation. Therefore, probabilities about state transitions are needed to able to conclude which action provides the highest reward in the current state. Markov Decision Processes are also convenient when having a possible extension, to a multi-agent system, in mind. MDP’s can namely be expanded to multi-agent processes, in a cooperative or co¨ordinative manner. [15]

2.1.2 From value function to Q-learning

Value functions are discussed because it is necessary to understand some decisions regarding the neural network structure. Moreover it is important to know what kind of learning is used mathematically in Q-learning. In this subsection, some mathematical explanations will be given for the mathematical background of value functions. Within an MDP, rewards are probabilistic, and the return is build up from the sum of all rewards. Therefore the return is also probabilistic. Because the returns are probabilistic, the expected return can be defined given a state. This is called the value function or V-function (2.4), which works as an estimate of the return.

Vπ(s) = Eπ_[G(t)|S t= s] = Eπ[ ∞ X τ =0 R(t + τ + 1)|St= s] (2.4)

In the value function, G(t) is the sum of discounted rewards. The function returns the value of a certain state following a policy π and a policy determines the best action given a particular state. Next to the V-function, there is also the action-value function. This function is also called the Q-function (2.5) and gives the value of a certain action given a state.

Qπ(s, a) = Eπ_[G(t)|S

t= s, At= a] (2.5)

This expected value can be rewritten as an update equation for readability and implementation, as can be seen in formula (2.6).

Qπ(s, a) = r(s, a) + γV∗(δ(s, a)) (2.6)

Because formula (2.6) is an update rule, the sum is not needed anymore, since every value is incorporated within the new value. The variable γ, represents the discount factor in this formula. Furthermore, the best value for a state is represented by V ∗ and calculates the best value for a state, when following an optimal policy, which is not known yet. This seems not useful, however

(11)

V ∗ is characterised in term of the Q-function (2.7), and takes the maximum return over all Q-values given the current state and all actions.

V∗(s) = maxa0Q(s, a0) (2.7)

By knowing this, it can be concluded that the Q-values can be updated in terms of Q-values only (2.5).

Q(s, a) = r(s, a) + γmaxa0Q(δ(s, a), a0) (2.8)

The Q-function is used for learning the optimal policy. Searching for the optimal policy is called the control problem. Finding the V- or Q-function given a particular policy is called the prediction problem. Since, the Q-function is indexed by the action, and therefore tells us which action to take given a state, it is better to cope with a control problem. For finding the optimal policy given a state, one only needs to take the maximum over all actions.

π(s) = argmaxaQ(s, a) (2.9)

Learning the optimal policy with RL is called Q-learning. In practice, this is done by updating the Q-values every time a action is performed. This can be seen in the formula (2.10).

Qt+1(st, at) = (1 − α) ∗ Q(st, at) + α(rt+ γ ∗ maxaQ(st+1, a)) (2.10)

In this formula the weighted average of the old value is combined with the newly gained informa-tion, to calculate the new Q-value. The variable α is representing the learning rate; rtrepresents

the achieved reward; and γ is still representing the discount factor. Melo (2001) proofs, that Q-learning converges to an optimal policy, for any finite MDP given infinite time. [16]

2.2 Deep Reinforcement Learning

Reinforcement learning on its own, has been used in a variety of application domains. When reinforcement learning is combined with deep learning architectures, RL is capable to scale up to problems it first could not tackle: this because of infinite action or observation spaces. First is explained what deep learning is on its own, and afterwards the combination with Q-learning will be exemplified.

2.2.1 Deep Learning in its original form

Deep learning is the form of machine learning where the methods are based on neural networks. Neural networks in their simplest form are networks build from logistic regression layers, which are chained together. The layers apart from the input and output layers are called the hidden layers. These hidden layers are non-linear function approximators, because they have non-linear activation functions between each layer. Because of the non-linearities in the layers, a network is allowed to model highly complex functions rather than just linear functions. Each node in a layer can been seen as a weight of the network. To minimise the loss function, these weights are adjusted after data passed through the network. This training of the neural network is called back propagation and is done by optimisation algorithms. The foundational optimisation algorithm for back propagation is called gradient descent. [17] With back propagation, data is first propagated through the network. In this stage, dot products are calculated between the input of a layer and its weights. Subsequently, an activation function is applied to those sums of products to calculate the output of a layer. After propagating forward through the network, one has to be propagated backwards, to update the weight values according to the just produced output. For updating the weights, gradient descent is used to calculate the gradient of the error function. This is done with respect to the weights and afterwards the parameters are updated in the opposite direction of the gradient. The weights are updated following formula (2.11).

(12)

In this formula η is the learning rate and ∇J(θ) is the gradient of the loss function. Since RL is interested in not just a single input, but sequences of input until an episode is ended, recurrent neural networks are needed. These networks are called recurrent, because at least one node loops back to an earlier node. Such a connection is called the recurrent connection and a layer with such a node is called a recurrent layer. Because the output relies on previous inputs as well, it means that it has memory of former states. This is useful when making decisions, because the agent can consider all information of every time step up till now.

2.2.2 Combining Q-learning with neural networks

learning is a reinforcement learning approach which functions without a model. Within Q-learning, an agent tries to find an optimal policy, and a policy tells the agents what to do in a particular state. The agent does this by maximising the expected value of the total reward and it uses the value function itself as the target return as can been seen in formula (2.8). Q-values can be seen as an indicator of how good it is to be in a particular state when taking a particular action. Q-learning does find optimal policies in controlled Markovian domains, when the defined MDP is finite.[16] In traditional Q-learning, a Q-table is kept, which is a dictionary that maps states and actions to values to find the optimal policy. Such an approach is called a tabular method. However, these Q-tables or tabular methods are not useful in real life environments, because they grow exponentially when the state-spaces grow. To scale up with the state-spaces, the Q-table can be replaced by function approximators. This can be done by treating the input states as feature vectors and the return as target. Neural networks can be used as a function approximator and when they are combined with Q-learning you get a Deep Q-Network (DQN) as result. How such a DQN acts as an agent in an environment is shown in figure 2.1

Figure 2.1: Graphical explanation of deep reinforcement learning. Ref: towardsdatascience.com called ’Self Learning AI-Agents Part I: Markov Decision Processes’1_.

In figure 2.1, it can be seen that the agent takes actions in an environment. These actions are chosen by a neural network, which gets a state as input. After taking an action, the environment gives back a reward and a newly observed state, which the network uses to learn and to predict his next action.

In a DQN, a neural network is used as function approximator for Q-learning. However, the standard neural networks have to be slightly modified. The network is used as function approx-imator for the Q-function or action-value function. However, the input features do not contain state-action pairs, but only states will be transformed into features. Rather than calculating every value for the Q-table, the network is optimised. The network is used to approximate the Q-value for each action given a state. The number of output nodes will be equal to the number of actions, and each output node will represent one of the actions.

(13)

2.2.3 Improvements to the Deep Q-network for better performance

Because Q-learning is used in bigger state- and action-spaces, convergence to optimal policies could take a very long time. For that reason, researchers searched for approaches to improve the performance. This extra performance is found in multiple approaches such as experience replay [18] and fixing the targets of the neural network [19]. Another important improvement of a DQN is doubling the Q-network [20], however this is regarded as future research.

To get a good representation of the true data distribution in the training data, Experience Replay is used. With the default Q-update, the learning happens always from the beginning, at the end of an episode. However, this could introduce hidden patterns or unwanted correlations. With Experience Replay, random samples are taken from past experiences. These are sampled from the so called replay buffer. The size of the buffer is considered to be a hyper parameter and is set to four in this research. In the buffer, four tuples of state, action, reward and next state will be stored, and are used as queue, which means first in first out. To train the Q-network, a random batch is sampled from the buffer and used as training data.

Another instability comes from the fact that Q-learning is a form of Temporal Difference learning. Temporal Difference (TD) methods use the value-function itself as the target return. In supervised learning, the target is always part of the training data, however, this is not the case with TD methods. Since this is not the case, by using gradient descent optimisers in a neural network, the gradient estimated is not a true gradient anymore. The gradient which is estimated is called a semi-gradient, and leads to instability in the network. It can be said that since the same parameters are used for selecting and estimating actions, it leads to overestimated Q-values. Because the loss of the neural network is calculated between the Q-target and the estimated Q-value, and this Q-target is also an estimation, estimated by the Bellman equation 2.8, a correlation arises between the parameters used for estimating the values and the Q-target. This means that every time during training the Q-values shift, but also the target-values. It can be seen as chasing a moving target. As a solution, a second neural network was used, which is called the target network. This network is used to predict the targets. The weights of the target-network are not updated as often as the main-network. After a certain amount of episodes, the weights of the main-network are copied to the target-network. This amount is called the copy-period. The lower update frequency of the parameters helps to stabilise the targets. Therefore, a steadier shift can be made in the direction of the targets, before new knowledge is gained.

2.2.4 Different optimisation algorithms

The cost for an episode made by a neural network is calculated by reducing the sum of the squared error between the predicted and actual return. The squared error between the predicted and actual return is called the cost function. This function is dependent on the parameters of the model, or, as in our case, the parameters of the neural network. Minimising the cost function is done with respect to the parameters. Optimisation algorithms the gradient of this cost function to update the parameters. The most common optimisation algorithm for training neural networks is Gradient Descent as was discussed in section 2.2.1. The algorithms which are discovered for this research are both improved versions of Gradient Descent. These algorithms are not discussed in depth, because it does not fit within the scope of this research, and because they are merely tested by plugging in a algorithm from the Tensorflow library.

The first optimiser which is tried in this research is the AdaGrad optimiser [21]. AdaGrad is an adaptive optimisation algorithm, which uses the gradient to adapt the parameters. The algorithm uses different learning rates −η for every parameter, so it can perform smaller updates for frequent and bigger updates for infrequent parameters. The main benefit for this optimisation algorithm is that there is no need to manually tune the parameters. However, the disadvantage of AdaGrad is it decaying learning rate −η. The second optimiser which is used in this research is the Adam optimiser [22], which also uses adaptive learning rates for every parameter. The difference with the AdaGrad algorithm is that it tries to remove the always decaying learning rate.

(14)

CHAPTER 3

Method and Implementation

To reach the end goal of letting a swarm of drones fly through a door, the process is divided in five stages, which are stated below:

• Create a grid world • Create a Deep Q-network • Create a reward structure

• Create a simulation in which the DQN and grid world are linked • Expand the simulation with multiple agents

These stages are needed, firstly to create a world where the agent can act upon, and secondly, to create a learning model, which the agent needs to choose his actions and learn from his experiences.

3.1 A grid world to act upon

The first thing needed for this project is a world in which the drones can move and learn their tasks at hand. This grid world will contain a representation of the world, a representation of the drones, representation of the door, and a reward structure. All these objects are combined in a simulation, and can been seen as our Markov Decision Process as was discussed in the theoretical background.

For the world, a three dimensional grid was chosen, which is implemented as three dimensional array. Every element in this array is representing a coordinate in the world. Because of these coordinates, the world is discrete. Therefore, the actions an agent can do are also discrete, because the actions are based on movements possible within the environment. This makes it also easier to work with, while actions and coordinates are just represented by integers. The actions a drone can perform in the grid world are up, down, left, right, forward and backward.

The grid world was implemented as a class object, so that it could contain more information then just the grid. In the class, the drones were kept in a dictionary with as key the id of the drone and as value its position. The drones, as well as a door, were placed in the grid world. The door was made by storing a list in the world object containing all coordinates, which were filled by the door and wall. The coordinates are kept in the list so it can be checked in the training phase if a drone collides into the door or a wall. Furthermore, a goal position is kept, which is used to calculate the reward for a move which a drone made. The goal is set to be the region behind the opening of the door. The region should be big enough for a group of drones to fit in, and if all drones are in the goal region it can be concluded that the goal is achieved.

(15)

Besides the objects and the grid world itself, there are some other functions added to the simulation, to create some visualisation of the process, for example. One of these functions is plotting the grid world. A visualisation of the grid world can been seen in figure 3.1. These figures show a 20x20x20 grid world. In plot (a) one drone is placed, and in plot (b) 4 drones. Furthermore, both plots contain a door, which has a height of 6 units and a width of 4 units.

(a) Single agent (b) Multiple agents

Figure 3.1: Visualisation of the grid world with a standard door. In plot (a) a single-agent version is shown and in plot (b) a multi-agent version.

To extend the model, a function is implemented, which does not create a standard door, but just a random hole in the wall. The standardised door could be seen as a hole in the door, so the hole is just a random generalisation of the original door. This generalisation is made, because training a model on a standard door, causes the model to over fit on the features of this particular one. It could just learn an optimal path for navigating through the standardised door. When this happens, the model will not be able to navigate through an other door, because the path it learned does not bring the agent to his goal anymore. By using random holes, the model hopefully learns to use the features as a heuristic. However, the learning curve does become steeper, while the state-space becomes much bigger, because of the different holes. Figure 3.2 shows an example of the grid world consisting of such a random hole.

Figure 3.2: Visualization of the grid world with a random hole placed in the wall.

3.2 Deep Q-Network

The first network implemented is a slightly different version of the DQN, which Deep Mind used to learn the Atari games. The network is different in the fact that it does not contain any convolutional layers. Convolutional layers are not needed in our application, because the input for our neural network does not rely on images.

(16)

The implemented DQN is a neural network consisting of an input layer, which is the same size as the feature vector. This feature vector contains all features the agent or model may rely on. More is explained about the feature vector in section 3.3. The output layer is the same size as the number of actions of the MDP created to simulate the learning goal. These actions are stated in section 3.1. This input size D can be calculated by formula (3.1) stated below:

D= 3N + 5 (3.1)

In this formula, N represents the number of agents, and number five represents the extra features added, which are explained in section 3.3. The number of actions within our MDP is six, because of the possible actions which are able in the chosen discrete grid. Therefore, the output layer of the neural network consists also of six nodes. Besides the in- and output layer there are three hidden layers with a size of 200 nodes. These numbers are chosen arbitrarily. The network is build in Tensorflow.

The different optimisation algorithms, AdaGrad and Adam, are not further discussed in this chapter. They are just taken from the Tensorflow library and set as train optimiser to minimise the cost function. To solve the problems of getting a good data distribution and handling semi-gradients, improvements are made in the neural network. These problems are discussed in the theoretical background, section 2.2.3. The implementation of these improvements are discussed in the last two subsections; 3.2.2, 3.2.3. However, firstly, the action selection rule will be discussed, which is used to cope with the exploration-exploitation problem.

3.2.1 Epsilon-greedy action rule

To solve the exploration-exploitation problem in our RL approach, for the epsilon-greedy ap-proach was chosen, mostly because it is the most common one. This action selection rule is added to the DQN to balance exploratory-exploitative moves during sampling actions. This balance is sought, because an agent needs to explore enough paths to find the optimal policies. However, the agent still needs to exploit its knowledge, because exploration could lead to bad solutions for the problem, which then would be a waste of time. To find a good balance between exploration and exploitation during learning, the epsilon-greedy choice rule was used. It is stated below in formula (3.2).

action= (

random action if random number > epsilon

arg max Q − values otherwise (3.2)

With epsilon greedy, a random number is generated, and it is checked if this number exceeds epsilon. If the random number does exceed epsilon, a random action is chosen. Choosing a random action would be an exploratory move. Exploratory moves are made to find new paths, and, by making a purely random move, new paths could be discovered. By doing such a move, new paths are discovered and the consequences of actions in a particular state are observed. If the random number is less then epsilon, the action with the maximum Q-value is chosen. Such a move would be an exploitative move, because it depends on what the agent has learned until now. These Q-values are updated after every episode, so the agent would exploit what it had learned in its former moves and episodes. There are more algorithms which are used in balancing between exploration and exploitation, however epsilon-greedy is very easy to implement, and most commonly used in computer science. However, by using this algorithm, an extra hyper parameter is added to our neural network. The way epsilon is calculated, while training during multiple episodes, can influence the learning curve. Epsilon is calculated with formula based on the number of episodes done until now. It is useful that in beginning of training, exploratory moves happen more often then exploitative ones. This, because there is still little knowledge gained, and acting upon this knowledge is less convenient. Therefore, exploring new knowledge, is in such a stage more useful. However, how fast epsilon gets smaller could be differed by using different kinds of functions to calculate epsilon for an episode. The function which is used for this thesis is stated in equation (3.3).

(17)

However, because this formula is regarded as a hyper parameter, it could be investigated if other formula’s would result in faster convergence to an optimal policy.

3.2.2 Fixing the targets to stabilise learning

As has been said in the theoretical background, Q-learning is a form of Temporal Difference methods. Because Temporal Difference methods use the value-function itself as the target return it could lead to some instability in training the network used as function approximator. As a solution, an extra neural network is used. One neural network is called the ’target-network’ and is used to create the target used in training. The target network is just a copy of the main Q-network, however, it is not updated as often. Updating this target-network is done by copying the values of the weight of the main-network and put them into the target-network. In this research, the copy-period is set on 100 episodes. The copy-period represents the variable used for the amount of episodes which are waited to copy the parameters.

The target-network is used for the prediction. Thus, the predictions are based on the same parameters for 100 episodes. However, the learning does not stop, because all experiences are kept in the main-network, and its parameters are adjusted every time a new experience is gained. After 100 episodes have passed, the main Q-network is copied into the target-network to get the updated parameters for prediction. The lower update frequency of the parameters helps the targets to stabilise. This solution was used by [19] to improve the performance of their algorithm together with experience replay, which is discussed in the next section.

3.2.3 Experience Replay for a true data distribution

Besides the network architecture, a Deep Minds replay buffer, called experience replay, is also implemented. The replay buffer is used to get a better representation of the true data distribution

during training. The buffer is implemented as a dictionary, which stores (state, action, reward, next state) tuples. The buffer is used as a queue, so when a new experience is made, the last one will be

popped from the dictionary. In the training phase, a random mini-batch is sampled from the replay buffer and is used as training data. In Q-learning, updating the Q-values is done after every step. While the buffer is empty at the start of training, the buffer is filled with random experiences before starting.

3.3 Input Vector and Feature Transformation

As a first approach, the grid world was used to feed it to the neural networks as observations for the Q-learning algorithm. It seemed the most easy option in the beginning, and, moreover, it gives the agent a complete overview of the environment it has to act upon. However, this resulted in errors when the neural network had to predict the returns or Q-values for the actions. The cause of the error came from the fact that the input layers of the neural network could not cope with the three dimensional representation of the grid structure. The network which is used, comes from a tutorial where two games, namely Cart-Pole and Mountain Car, were tackled with a simpler version of Googles DQN. However, these observations were also represented as 1 dimensional arrays. The actual network used by Deep Mind uses 2 dimensional convolutional layers, so it seemed not an immediate solution for the problem.

To tackle the problem, two solutions seemed possible. Namely, to change the network struc-ture, or to change the input which is fed to the neural network. A feature vector seemed the best and especially the most easy solution for tackling the problem. Moreover, to change a neural network, more in depth knowledge of deep learning is needed, so this seemed, compared with the time left, not a convenient option.

To create a feature vector from the existing grid world, a FeatureTransformation object was build. It takes a grid world as input and produces a feature vector by calculating the features and putting them in a list. Five features are calculated in this object, namely the euclidean distance from the current position to the goal position, and the euclidean distance from the current position to all the 4 corners of the door. The transform function of the object takes the grid world as argument and extracts all the information from it to create a feature vector. For

(18)

the multi-agent version, this feature vector is expanded with the extra coordinates of the drones, and possibly the distances between the current and other drones.

3.4 The reward structures for learning a single agent

The reward structure needs to be generative, because the agents have to rely on the features alone. Generative in the sense, that the structure needs to be made in such a way, that all measurements needed to score a certain action can be done in a real-life environment. In simulations, and especially discrete grids, is information available which is not at hand or very hard to gain in the real world. For example, laying a attractive force-field in the direction of the goal, is a hard task to do in a generative way. Thereby, is a highly complex and defining model, which steers the agent automatically in the direction of the goal, not a wanted reward structure. If such a complex model was used, the benefits of RL would be passed. A generative approach uses an easy reward structure and so the RL model can figure out the rest.

A reward function which is generative and applicable in the real world, is when the rewards are based on the euclidean distance from its current position to the goal position. This distance could be measured with a variety of sensors, such as 3d camera’s, or a laser. Minimising this distance, or maximising the negated distance, could be a good guideline to learn minimising the distance with the goal. In our application is the reward negated, while it is said, that negative rewards incentivize finding the quickest way to perform a task. The reason is that an extra step results in a lower reward, whereas with positive rewards it could be helpful to take some extra steps.

Another reward structure, which is tested is again based on the euclidean distance. However, this time the euclidean distance is calculated between the drone’s position and the goal position, but it is checked if this distance is less than the distance calculated in the former step. If the distance is less, a reward of 1 is given. if the distance is equal, the reward is zero. And, if the distance is bigger, the reward will be -1. The last mentioned reward structure is a bit more steering into the right direction while it gives penalties for not getting closer to the goal. This latter reward structure can be formed as a decision rule, stated in formula (3.4).

Reward(cur dist, f ormer dist) =

( 1 _{if cur dist < f ormer dist} −1 if cur dist > f ormer dist

0 otherwise

(3.4) Within both reward structures, colliding into objects or reaching the goal is not taken into consideration, because distances are the only metrics looked at. However, it is still considered in an episode of the game. If an agent collides into an object or reaches the goal, the reward, allocated by the reward structure, is changed. The reward for colliding into an object is changed to -500, and for reaching the goal, the reward is changed to 1000. When colliding into a door, a big negative reward is given to let the model know it was not a good move. When reaching the goal, a big positive reward is given to steer the parameters in the direction of the set of moves used in this episode.

3.5 Expanding to a multi-agent version

In this section, the multi-agent version of our model is discussed. There was chosen for a centralised system because it was sure that the model would converge to an optimal policy. Because with a centralised system one overarching agent would control the entire swarm of drones, their still can be relied on the proof stated in [16] and discussed in section ??. To create such a centralised system, some extra´s had to be added to the existing model. First, the model has to keep track of multiple drones, instead of a single one. By giving the drones ids and keeping a dictionary in the simulation of all (id, position)-tuples of the drones, they are kept separate. These extra coordinates will also be added to the feature vector passed through the DQN. The reward structure needs to be changed slightly as well. In the single-agent application, the reward is based on the Euclidean distance between the drone and the goal position. However,

(19)

system. To keep track of these multiple positions at once, a centre position is calculated. This centre position is then used to represent the swarm, and therefore the Euclidean distance between the position of the swarm and goal can be calculated, on behalf of the centre coordinate. Besides the reward structure, there is also one more condition added, which would lead to an end of an episode. The drones may not collide into each other and when such behaviour happens the episode will be ended, and a big negative reward will be given to the agent.

At last, a representation was needed for the moves of the drones. A list with all the actions for every drones could not be used. This was due to the same reason as the fact that the total grid world could not be fed to the neural network. During training, the network structure could not cope with the lists of actions stacked in the replay buffer. Because earlier a choice was made not to change the network, but to change the way everything was fed to the network. This seemed the most easy solution for this problem. A choice was made to represent every possible action with an index of a list of all possible moves. Formula (3.5) is used to calculate the number of possible moves, because there are six possible moves per drone, as was explained in section 3.1.

M = 6N _(3.5)

The list of possible moves is made by storing every possible combination of moves. Every element of this list represents an action for the centralised agent controlling the swarm, and such an element is thus a list containing a move for every drone. Notable is that this list, and thus the action-space, increases very fast by an increasing amount of agents.

(20)

CHAPTER 4

Experiments and Results

In this chapter, all the different experiments will be explained, and the obtained results will be presented. Afterwards, are the results discussed in the discussion and conclusion. The first results which are presented are regarding the different reward structures. After that, the results are stated regarding the different optimizers which are tested to improve the learning process. In all experiments there is looked if the q-learning model is able to converge to an optimal policy within the MDP it has to act in. This convergence is analysed and compared, mainly, by looking at the running average of the training. The running average is calculated by taking the average of the last 100 episodes. It is used to smooth out short-term fluctuations in obtained data. The running average gives a sense of the trend the data is describing. In our application, the data is describing how well the agent performs its task. So the running average represents the trend of the learning curve. Therefore, are in the comparisons, not the real values are taken into consideration, however the curve of the graph.

4.1 Testing the different reward structures

In this section, the results of the experiment regarding the reward structures are discussed. For this research, two different types of reward structures were tested. The first one makes use of the Euclidean distance between the position of the drone and the goal and assigns the distance itself as a reward. The second approach rates this Euclidean distance based on the distance of the former position. If the distance is less than the former distance, the reward is 1; if the distance is equal than the former distance, the reward is 0; and if the distance was bigger, the reward is -1. To test which of the two reward structures would lead to faster convergence were both analysed after 2000 episodes of learning. The results for the first reward structure can be found in figure 4.1 and the results for the second one can be found in figure 4.2. Within a figure, two charts are plotted. Chart (a) shows all the rewards which the agent got after an episode and chart (b) shows the running average over all episodes. It could be noticed that the y-axis differs between figure 4.1 and 4.2. This comes from the fact that the different reward structures allocate rewards within different ranges. With the first reward structure, real numbers are allocated, and therefore, the sum of rewards over one episode can grow quite big. On the other hand, the second reward structure allocates just single points, which keeps the sum of rewards much lower than by allocating the real Euclidean distance.

(21)

(a) Current reward is plotted on the y-axes and the number of episodes on the x-axes.

(b) Running average is plotted on the y-axes and the number of episodes on the y-axes.

Figure 4.1: Results after 2000 episodes for the first reward structure.

Figure 4.2: Results after 2000 episodes for the second reward structure.

By looking at the figures, it can be seen that the models with different reward structures both converge to an optimal policy. The second reward structure seems to converge slightly faster. However, something which points out more than the slightly faster convergence can be seen by looking at graph (a) of both figures 4.1 and 4.2. It can be seen that with the second reward structure, in figure 4.2.a, the goal is found in an earlier stage than with the first reward structure. This can be inferred because maximum rewards can be seen in the plot from the start, however, with the first reward structure, these maximum rewards are only plotted after 150 episodes. Because the density of vertical lines becomes less after more episodes, this means that less frequently low rewards are achieved. These high rewards are telling us that the goal is found, because when finding the goal, the reward for that step is made very high, namely 5000. However, in figure 4.2a, it can also be seen that low rewards still happen quite frequently when the number of episodes is high. This happens less frequently with the first reward structure, which can be seen in figure 4.1.a.

4.2 Testing the different optimizers

After finding out with which reward structure the DQN performed the best, there still could be made some improvements in the neural network itself. The second experiment which is done is testing different types of optimisation algorithms. Because of the time frame, only two different types are tested. Namely the AdaGrad and Adam optimiser, which are discussed in chapter 2; Theoretical background. The rewards over all episodes and the running average of the learning process with the Adam optimiser are shown in figure 4.3.

(22)

Figure 4.3: Results after 2000 episodes for the second reward structure with the Adam optimi-sation algorithm.

In the first experiments, the neural network was trained every time with the AdaGrad op-timiser. Therefore, we can compare figure 4.3 with 4.2. By comparing, from both figures, the rewards over all episodes, it can be seen that with the Adam optimiser, the density of vertical lines gets a bit less. Although there has to be admitted that this result is not very clear. More-over, because of the action-choice rule used, epsilon-greedy, which has a random component, it could be because of randomness. However, by looking at the running average of both figures, a slight faster convergence can be seen when using the Adam optimisation algorithm. Therefore, all later experiments only made use of the Adam optimiser.

4.3 Testing the DQN in a centralised multi-agent version

After being done with the single-agent version, it was time to test if the DQN structure could cope with the multi-agent version. The multi-agent version was firstly tested through a standardised door-way. The results of this experiment can be found in figure 4.4. Because the action-space is much bigger this time, the number of episodes is also increased to give the neural network enough episodes to converge.

Figure 4.4: Results after 5000 episodes for the multi-agent version navigating through a stan-darized door.

(23)

By looking at the figure, it can be seen that the algorithm converges to an optimal policy. Notable is that the convergence is rather abrupt. This abrupt convergence could find its reason in the reward which is given for the agent finding the goal. This reward is quite prominent in comparison to the rewards given for standard steps.

4.4 Testing on random door-ways

In this section the models are discussed, which are trained in an environment in which the standardised door is replaced by a random door-way in the wall. The single-agent version is the only version which could be tested because of the time frame. As been said in the method, section 3.1, the state-space becomes much bigger when training on random door-ways. Therefore, is also for this test the number of episodes increased to 5000. The results of this experiment can be found in figure 4.5.

Figure 4.5: Results after 5000 episodes for the single-agent version navigating through a random door-way.

By looking at the graph in sub-plot (a), it can be seen that rewards of 1000 are achieved by the agent. The reward of 1000 can only be achieved by reaching the goal, and therefore it is known that the agent did reach the goal sporadically during the entire training period. However, as can be seen in sub-plot (b) does the model not converge to an optimal policy. This conclusion can be made because the graph is not stabilising at one particular value of the running average, which is observed in all former tests.

(24)

CHAPTER 5

Discussion

The discussion is divided into three components. Firstly, the created algorithm and results are discussed regarding the single-agent version. This includes the experiment about optimisation, because it is only performed with an single agent. Secondly, the multi-agent version is discussed, and especially evaluated in terms of the main goal. At last will some problems with the created model be discussed.

5.1 Single-Agent Experiments

By evaluating the results regarding the different reward structures, there can be concluded that the best one, for this application, is the second proposed structure. The second reward structure determined on behalf of a rating of the Euclidean distance between the goal point and the position of the drone. When this distance is smaller then in the previous observation, the reward is 1, when the distance is larger, the reward is -1, and when being equal, the reward will become 0. It was expected that the second reward structure would steer the agent his parameters more into the direction of the goal during learning. This expectation was made because, by evaluating a distance, instead of just giving this distance as a reward, an extra layer is laid over that number representing the distance. The extra layer gives an impression about how good that number of the distance is. For the agent, it is just a random number in the beginning, with no environmental meaning. By giving this number a rating, it could discover a sense of direction in an earlier stage. By looking at the results of the experiment regarding the reward structures, there can be seen that indeed, the second reward structure converges faster than the first one. So the expectation that laying a meaning full layer over the distance told the model more about the effectiveness of the move might be right.

Subsequently, the experiment regarding the different optimisers is discussed. As been said in the results performed the Adam optimisation algorithm better than the firstly tried AdaGrad optimiser. This result is as expected because Adam is a further improvement in the chain of Gradient Descent optimisers. These results were just stated and discussed because they belonged in the research process in finding better performance for the DQN.

Lastly, the single-agent version is tested, to find out if it could learn to navigate a drone through a random-doorway. By looking at the results, it seems that the created model failed to learn this task. However, because the model which is used to train the agent is a Q-learning model, there can be relied on the proof, given in the theoretical background, that Q-learning will converge for every finite MDP, when having infinite time. Therefore, we expect that the model will converge when training the model for a more extended period. Furthermore, the chosen function used to calculate epsilon, for the action-choice rule epsilon-greedy, could cause the slow convergence of the model. This function converges fast to zero, after a 100 episodes has epsilon a low value of ±0.0996. This causes the model to make a tiny amount of exploratory moves from an early stage. This fast convergence to zero, is not a problem for the standardised door application, however, because the state-space is much bigger with random door-ways it might

(25)

be necessary to explore much more during training to achieve the same behaviour for random door-ways. It could be possible that the agent did not discover enough possible action sequences to find the optimal one.

To answer the first sub-question, if a deep Q-network can navigate a single drone through a hole in the wall, inconsideration, it can be concluded that it can be done. However, training the model just on one particular door is not sufficient for using it in a real-life environment. Moreover, when a model is trained on a particular doorway and has to navigate through a different one, it is not sure it will succeed to do it without colliding into any objects, because the model overfits on the parameters of the path going through that particular one. A solution for this problem would be to initialise a new random hole at every episode of learning. Which is done in the last experiment. By doing, it takes a longer period of learning the model before the model will converge; however, it learns to navigate on behalf of the features instead of just finding an optimal path for one particular door. However, to show an actual accurate application, of such a model, is future research needed.

5.2 Multi-agent experiments

By looking at the results The model predicts for every drone a move at once. Thereby are past experiences of all the drones taken into consideration during the learning process. Because it is a centralised model, and our model needs to predict moves for every drone, the action-space grows very fast with an increasing amount of drones to consider. With a single drone the MDP had a action-space of size 6. However by increasing the amount of drones to the size of three, will the action-space space increase to a size of 216. Increase the number of drones with another one, and the action-space grows to a size of 1296. As an consequence of this rapid increase of the action-space will the time it takes to train the model also become much longer. This will be become a problem when the this model would be used for an application with a larger scale of drones. Some improvements could lead to a better performance of the algorithm but it is not sure that it outgrows the learning time. The formula which describes this growth is formula (3.5), and can be found in section 3.5.

This action-space would be smaller with an decentralised model, because every drone will be represented by an individual agent. Every agent only needs to predict a move for a single drone. All agents still take the other drones their movements into consideration, however this does not change the amount of available moves. But with enough training resources and in a application domain where a decentralised component is not necessary, the model works theoreti-cally. Theoretically in the sense that, it is only tested within in a simulation and not in a real-life application.

5.3 Problems with the created model and its experiments

There has to be noted that all tests done in this research could be done more thorough. The reason for this is that the created learning model consists of a random component. Within the choice-rule epsilon-greedy, there is a random number generated to choose what kind of move is done. Moreover, exploratory moves are also chosen randomly. So these random numbers affect the agent in finding the goal. By finding the objective in an earlier stage, which could be random, converges the model also faster. This faster convergence comes forth from the parameters are faster steered in the right direction. This makes exploitative moves more sensible, in the fact that they are based on better data, consisting of an already found path to the goal. To eliminate this reliability on random moves during the tests, could multiple trained models be examined. By training multiple models, the running average could be normalised. However, this will take lots of training hours. Thereby is it in this research worked with a device without a graphic processing unit(GPU). Training neural networks can also be done on the GPU instead of the CPU. The performance of training a model is much higher on a GPU, so using one would reduce the training time a lot.

(26)

Furthermore, the agents learn to act in a simulated environment. Thereby arises the reality gap problem immediately. Claiming that the current model can navigate real drones through a real door would be a big statement. Since lots of real-life components are not taken into consideration within the current model. One essential component, which the proposed simulation needs, to represent the real world better, is mass inertia. Mass inertia is especially crucial for navigating flying object. Since the speed of a moving object has an influence on the path, it can make. Mass inertia needs to be incorporated in the path planning for a drone, while otherwise, the drone could collide into objects because a calculated path calculated a too sharp corner, and thereby can the drone not turn its mass on time.

Another problem with the used simulation is that the world, which the models are trained in, is represented discretely. This makes that there is also a discrete representation needed when acting in the real world. However, creating such a discrete map of the environment can be a hard, time consuming, task. So for real-life practises, is it helpful to change the representation of the world within the model to a continuous one. Another option would be to make the fineness of the discretization smaller. A discrete grid is an approximation or discretization of reality. So by making the fineness in the grid smaller, it would approximate the environment closer to reality. However, it would not change the fact that the world is still discretized, and therefore always consist of a reality gap.

(27)

CHAPTER 6

Conclusion

In this section, we take all the considerations of the discussion together and come with a conclu-sion for our main question. However, firstly, we will discuss the sub-questions before the main question will be answered. The first sub-question asked if it was possible to use a DQN to learn a single agent navigate a drone through a hole in a door. By looking at the results, there can be concluded that it is possible. The results show that the agent converges to an optimal policy for the created MDP with a standardised door. By checking the path the drone performed it navigated through the door without colliding into any objects.1

With the second sub-question, there is looked into the multi-agent version of the firstly created model. It asked if it was possible with a DQN to learn an agent to navigate multiple drones through a door. Again, can be concluded that the DQN was able to converge to an optimal policy with the standardised door. By checking the paths for every drone it again showed that all the drones performed paths which brought them to the goal, without colliding with each other or another object. Moreover, the drones took a path in which they did not differ from the formation which they started in.2 _{It is not tested if the DQN can also perform the same task in}

a decentralised manner. However, this could be done as further research.

For the last sub-question, experiments regarding the multi-agent version are not executed because of the time frame. However, there still can be made a conclusion about the single-agent model, which is trained on random door-ways. The results show a lack of convergence; however, by considering the convergence proof, it is hard to say that the model could not converge when training the model for a longer period, especially when additional improvements would be added to the Q-network. Moreover, we expect that using a different function for calculating epsilon will also boost the performance. As has been explained in the discussion. These expectations are also intensified because of the fact, the agent achieved its goal a substantially amount of the episodes.

Lastly, we will discuss the main question. The main question reads, ”Can drones learn to navigate through a door, as a swarm, with deep multi-agent reinforcement learning. Taking all the results of this research into consideration, we can conclude that this is possible. Especially when training time for the model will be increased, and the training is done on a dedicated device with a strong GPU. For real-life purposes, the model needs some further research; however, this first model is a good step in the right direction. Moreover, if a decentralised system is not needed, the agent has to be trained in a continuous environment to get a better representation of the world, which is also discussed in future research.

1_{See appendix section A; 9.1}

(28)

CHAPTER 7

Future Research

In this chapter, the possibilities for further research are presented. After this thesis, there could be thought of lots of improvements before moving to a real-life application where real drones navigate through a real hole in a real wall. However, first is a improvement of a DQN discussed, which was not implemented in this thesis, because of the timeframe.

7.1 Extra improvement of the deep Q-network

Within deep Q-learning, the target is calculated based on the former maximum Q-value combined with the just achieved reward. Because the maximum Q-value is based on an approximation, it is not sure if it represents the best action, for the state in which the agent is located. The accuracy of the Q-values depends on the actions the agent tried until now, because it could be possible the best action is not performed yet. As a consequence of these noisy Q-values, can false-positives be produced during training, and these false-positive will steer the parameters in the wrong direction. Better said, as a consequence, the Q-values will be overestimated. This behaviour will complicate the learning process because non-optimal actions will get higher Q-values then the best possible action. As a solution, the computation of the Q-Q-values and the estimation of the targets can be decoupled.[20] Hasselt et al. (2016) proposed a double Q-network in which two different neural Q-networks make these two approximations. Hasselt’s paper shows that using a double Q-network will reduce the overestimation of Q-values and improve stability during training.

7.2 Guiding the agents into the real world

The first things which could be looked into as further research is changing the environment, in which the drones are trained. It could be changed, by taking the real world more in consideration when designing the simulated environment. This can be done by changing the discrete simulation into a continuous one. The real world is also continuous, and therefore would a continuous simulation be a better representation of the world. By training the agents in such a simulation, will the reality gap be made smaller.

The features selected for this project are already based on variables which would also be available in a continuous world. This is done with the thought in mind that the simulation had to be changed for a real-world implementation, and that some features are only available in a discrete one. Features such as coordinates do not exist, automatically, in a real world, and therefore are such features not chosen for our observations during training. These features are not available in a continuous world because no integer values exist in such a world.

To create an even better model where the agents act in the real world, could the observations be changed in real images. In these real images could a recognising algorithm be used to detect a hole or door-way in a wall. This algorithm could then put a bounding box around the door, to let the agent know where the door is. When real images would be used as input for a neural

(29)

network, convolutional layers are needed to extract features from the images. The DQN which is used for learning the Atari games consists already of these convolutional layers. So as future research could this network be exploit to train the agents based on images for the same task as in this research.

7.3 From a centralised to a decentralised model

The most important future research would be extending the existing model to a decentralised model. The model presented in this thesis trained one agent which is directing the whole swarm. However, as been said in the chapter conclusion, implementing such a model for real-world purposes, ensures that a central component is needed within the swarm. Such a centralised component is not always wanted because then the central component needs to be in the range of all the drones to communicate and evaluate the new observations of all drones. It will limit the capacities of the system and therefore, would a decentralised model be a justified subject for further research.

In multi-agent reinforcement learning (MARL), multiple agents are trained within the same model. Therefore could every drone be represented as a single agent, taking his own decisions while considering the actions of other agents. Within MARL all agents try to maximise their rewards simultaneously, however independently.

A good starting point for MARL would be to exploit and build on existing work of decen-tralised coordination in these multi-agent systems. In this area, problems can be thought of as Distributed Constraint Optimisation Problems (DCOPs), and for these problems, researchers have already found some optimal solutions. [23, 24] However, a downside of these systems is that they scale poorly when the numbers of agents grow.

Deep Multi-Agent Reinforcement Learning in Swarm Robotics