Lenient Similarity Learning for Cooperative Multi-Agent Reinforcement learning

(1)

MSc Artificial Intelligence

Master Thesis

Lenient Similarity Learning for Cooperative

Multi-Agent Reinforcement Learning

by

Raymond Koopmanschap

11925582 48 ECTS 1st of November 2019 - 14th of August 2020

Supervisor:

Dr. D. Bloembergen

Co-supervisor:

Dr. M. Kaisers

Assessor:

Dr. E.J.E.M. Pauwels

(2)

Abstract

Recent success in multi-agent reinforcement learning has been achieved by incorporating a similarity met-ric called the Time Difference Likelihood into hysteretic learning (Lyu and Amato, 2020). This similarity metric is used to separate two sources of stochasticity which are traditionally indistinguishable in an inde-pendent learner setting where agents can not communicate with each other nor see the actions of each other. These two sources are the environment stochasticity and the stochasticity coming from the non-observable actions of the other agents. The outcome of the similarity metric is then used to determine to which ex-tend the agent will update its value function. This allows the agent to not update when miscoordination is received, which can occur because the agent is not able to see the other agents’ actions, but successfully update when environment stochasticity is encountered. The similarity metric does this by calculating the similarity between the current return distribution and the temporal difference target return distribution which include the immediate reward and the return distribution of the next state. Hereby assuming that the target distribution is different from the current distribution for miscoordinated actions. One shortcoming of this work that introduced the Time Difference Likelihood, was that there has not been experimented if the similarity metric was effective in distinguishing miscoordination from environment stochasticity. We bridge this gap by comparing the Time Difference Likelihood together with some newly proposed similarity metrics on the effectiveness of separating miscoordination from environment stochasticity in different sce-narios. These scenarios consist of various current and target return distributions that are representative of distributions occurring in multi-agent reinforcement learning environments. Additionally, the potential drawbacks of the Time Difference Likelihood are explored and verified in these experiments. Next, we in-corporate the similarity metrics in lenient learning (Wei and Luke, 2016), which we call lenient similarity learning, in order to test the effectiveness of the similarity metrics in multi-agent reinforcement learning environments. To compare the performance between lenient learning and lenient similarity learning we in-troduce an extended version of the Climb Game (Claus and Boutilier, 1998) and show that lenient similarity learning improves over lenient learning in terms of sample efficiency and the percentage of runs that reach the optimal joint-policy. The behavior of lenient learning is analysed to point out that the improvements are due to the successful separation of miscoordination from environment stochasticity. Subsequently, we mention the shortcomings of lenient similarity learning and identify that it overestimates the value function in a modified version of the extended Climb Game leading to a decrease in performance. We conclude by introducing the similarity metric as a general function with as input some features that help to distinguish miscoordination from environment stochasticity and as output the similarity value. We use the current and target return distributions as features but potentially other features can be used. This paves the way for similarity metrics that can more precisely separate miscoordination from environment stochasticity by using additional information.

(3)

1 Introduction

There are many important problems where multiple agents have to cooperate to achieve a common goal. Ex-amples include the coordination of multiple traffic lights to maximize traffic flow through a city (Bakker et al., 2010; Van der Pol and Oliehoek, 2016; Abdoos et al., 2011), smart grids consisting of multiple electric power systems to optimize demand and supply of energy (Hern´andez et al., 2013; Merabet et al., 2014), and optimal resource allocation in cloud computing (Wang et al., 2016; Mazrekaj et al., 2017).

Multi-agent systems can be trained and deployed in several ways. In this work, we both train and deploy in a decentralized way. Thus each agent learns from its own experience and is completely independent of the other agents, not allowing any communication between them. This is in contrast to centralized training and decentralized execution where the agents can communicate during training but not during execution. The independent learner setting is commonly used for settings where no centralized control agent is available and for settings where agents need to make decisions based on their local observations alone. This independent learner setting is a challenging problem but more practical and easier scalable (Claus and Boutilier, 1998).

To explain the shortcomings of the current work in this field, we first introduce multi-agent reinforcement learning and what we ideally want the agents to learn in this setting. To clarify the concepts of multi-agent reinforcement learning we show an example in a real-life setting where we have to maximize traffic flow through a city. Next, we explain how agents commonly learn in a reinforcement learning setting and elaborate on why learning is difficult in the multi-agent reinforcement learning setting with independent agents. We address how current work in this field aims to solve these difficulties. Lastly, we cover which elements of these difficulties are not addressed yet which will guide the objective and research questions for our work.

Reinforcement learning is used to learn sequential decision making from experience. It can be applied in the single-agent as well as the multi-agent setting, both with their own specific difficulties. Through reinforcement learning, agents learn to navigate an environment defined by states, possible transitions between them, actions to go from state to state, and rewards. The goal is to learn the actions that maximize the amount of cumulative reward, also called return. We do this by learning a policy that for a given state indicates which action to take. In the multi-agent reinforcement learning setting in particular, we want to find the optimal joint-policy. Thus for each agent, we want to learn a policy that combined with the optimal policies of the other agents maximizes the total amount of return.

For example, in the case of maximizing traffic flow through a city, all possible positions of cars could represent the states in the environment. A single agent then represents the traffic lights at a single crossroad and the agent can navigate through different states by taking actions—in this example changing the color of the traffic lights. For a particular state and action, the agent observes some amount of traffic throughput that will guide the update of a better policy with the final goal of finding the optimal joint-policy that maximizes the total amount of traffic throughput.

One way to learn such an optimal joint-policy is by using a value function. The value function maps a state to an expected return and indicates how much value this state gives. This allows the agent to follow actions that lead to the states with the highest value. Thus learning a good joint-policy requires an accurate estimate of the value function so the agent can infer which states lead to the most return and this in turn will help to select the appropriate action for a given state.

The estimation of the value function is a difficult task since the estimated returns, by which to update this value function, are noisy. This noise is mainly caused by two sources of variance. One source of variance, that also plays a role in single-agent reinforcement learning, is environment stochasticity. This actually consists of two sources of stochasticity, reward stochasticity and transition stochasticity, but for now, we can treat them as one. This stochasticity results in either ending up in a different state or receiving a different reward, which are respectively transition and reward stochasticity. This stochasticity makes estimating the value function more difficult. The second source of variance is unique to the multi-agent setting and comes from the fact that in the independent learner setting an agent can not observe the actions of the other agents, e.g. they can not observe how traffic lights at a different crossroad are changing. So despite taking the same action in the same state (from the perspective of a single agent) the agent may end up with different traffic throughput because the traffic lights at another crossroad were different.

From the perspective of an independent agent, those two sources of variance are traditionally indistinguish-able. Therefore, we could model them as if we are drawing a sample from a single probability distribution, in our case a return distribution. This non-distinguishability would not pose a big problem if we only want to learn the mean of the return distribution. In this case, we can simply take the average of all the samples we receive to estimate the mean. This would work for environment stochasticity, the first source of variance, since the mean return of the environment stochasticity will not change over time. This situation is similar to empirically estimating the average number of dots from a dice roll where the number of dots does not change. Therefore, the mean can be safely estimated by algorithms that estimate the average of the return distribution like Q-learning (Watkins and Dayan, 1992), which will, given enough samples, converge to the mean of the

(6)

dis-tribution. However, for our second source of variance, the unknown actions of the other agents, we do not want to learn the mean return. If the current agent would learn the mean return of the other agent’s actions it will weight each action of the other agent equally, no matter which action the other agent takes. This is undesirable since the other agent can make mistakes, especially at the beginning of a game, leading to an underestimation of the value of the action of the current agent. This could lead to an agent taking an action that currently looks optimal from its own perspective but is not the optimal joint-action, in which case the agent would need to take into account the other agent as well. Only by also taking the other agent into account we can achieve the best result in a cooperative setting, where both agents have to work together to achieve the best outcome instead of maximizing only its own return. Therefore, instead of learning the mean return, we want to learn what the return would be if the other agent takes the best action. This behavior of learning the best action is called being lenient to the other agent and can be seen as forgiving their mistakes. The agent hereby anticipates that the other agent will eventually learn the correct action which results in finding the optimal joint-policy where both agents take the best joint-actions instead of the best individual actions. Remember that the current agent does not know if the other agent makes a mistake or not since they are independent learners and can not see each other. Hence it is just another form of stochasticity as we discussed previously. Unfortunately, this creates a conflict, on the one hand, we want to learn the mean return if we are dealing with environment stochasticity to not overestimate the value function. On the other hand, we want to learn the return of the best action of the other agents to not underestimate our current action and learn what the reward would be if the other agent would take the best action.

To tackle this problem the literature mainly focuses on reducing the punishments from negative value updates. Those are updates where the value of an action in a given state decreases due to getting a low return. When an agent achieves a low return, this is likely because of miscoordination. Therefore, we want to reduce the impact of this update, but do not want to neglect it completely because the low return could also be the result of environment stochasticity.

Hysteretic learning (Matignon et al., 2007) reduces the impact of negative value updates by using a lower learning rate for negative updates than for positive value updates. However, it is empirically shown that it does not perform well in stochastic environments since it overestimates the Q-value of actions that include environment stochasticity (Wei and Luke, 2016). Lenient learning (Wei and Luke, 2016) improves on this by initially performing negative updates with a low probability and gradually increasing this probability using a temperature decaying schedule. The assumption is that the agents have already learned good policies once the negative update probability gets higher. This allows the agents to take into account all rewards received by actions with environment stochasticity, and not only the rewards that result in a positive Q-value update. One problem with this approach is that during the phase where the temperature has cooled down and agents learn the environment stochasticity they can realize that their current policy was worse than estimated. This requires them to update their policies which will be hard because they will again be susceptible to miscoordination. Secondly, the hyperparameters to control this temperature schedule are highly problem-dependent resulting in additional effort to tune hyperparameters to achieve good performance. Both hysteretic and lenient learning are recently scaled to work in large state spaces by integrating these concepts into deep learning (Omidshafiei et al., 2017; Palmer et al., 2018).

A way to improve on hysteretic and lenient learning is to distinguish low returns coming from environment stochasticity from the low returns coming from miscoordination. Negative update intervals (Palmer et al., 2019) is an approach to tackle this problem by using intervals to separate miscoordination from environment stochasticity. This assumes that the minimum return for optimal actions is larger than the returns from miscoordinated actions. In this way, a lower bound on the return can be determined which is used as a threshold to distinguish optimal and miscoordinated actions. Negative updates can then be performed above this threshold where miscoordination does not occur. One disadvantage of this approach is that the threshold can only decay and could therefore still decay too fast if unfavorable stochastic rewards are obtained.

Another approach to distinguish miscoordination from environment stochasticity is given by Likelihood Hysteretic Implicit Quantile Networks (Lyu and Amato, 2020). This algorithm uses a method called Time Difference Likelihood. This method calculates a similarity value that indicates how similar the return distribution from the current time step, the current distribution, is to the return distribution formed by the immediate reward and the return distribution of the next time step, which we call the target distribution. The Time Difference Likelihood then uses the result of this calculation to determine the learning rate for negative updates. The return distributions for each state-action pair are estimated using Implicit Quantile Networks (Dabney et al., 2018a). By comparing the current distribution to the target distribution, the method is able to separate miscoordination from environment stochasticity. If the other agent chooses a sub-optimal action, thus causing miscoordination, the target distribution will likely be different from the current distribution resulting in a low similarity value. This low similarity value reduces the learning rate by which negative updates are performed and hereby avoids updating the Q-value with miscoordinated actions of the other agent. This assumes that sub-optimal actions of other agents lead to states with different return distributions which is often the case but not always. On the other hand, if the other agent chooses a coordinated action, the current and target return distributions are likely

(7)

similar resulting in a high similarity value. This high value will increase the learning rate of negative updates so that the same learning rate as for positive updates is used. This allows the agent to learn environment stochasticity when the coordinated (optimal) joint-actions are chosen. The updating based on the similarity of the return distributions makes the algorithm more sample efficient and more likely to converge to optimal joint-policies compared to lenient and hysteretic learning (Lyu and Amato, 2020).

While looking at the difference between the current and target distribution is an interesting approach, the authors did not explicitly verify if the Time Difference Likelihood is efficient in separating the sub-optimal actions of other agents from environment stochasticity. Improving this similarity metric could lead to a further increase in sample efficiency and performance. To bridge the gap we will compare different similarity metrics in their efficiency of detecting if two distributions are different, which allows us to separate sub-optimal actions of other agents from environment stochasticity. Subsequently, we will introduce lenient similarity learning which incorporates the different similarity metrics in lenient learning. We then compare variations of lenient similarity learning by using different similarity metrics and see which ones perform well. We will both compare the sample efficiency and performance, i.e. the likelihood of converging to an optimal joint-policy, of each similarity metric. Additionally, we show some shortcomings of lenient similarity learning. Finally, we propose to see the similarity metric as a general function with as input some features that can help to separate miscoordination from environment stochasticity and as output a similarity value indicating how likely it is we deal with environment stochasticity. In our case, we use the current and target distribution as features but we can replace this by any feature that can help to separate miscoordination from environment stochasticity.

Objective: Improve the sample efficiency and performance of cooperative multi-agent reinforcement learn-ing in an independent learner settlearn-ing by separatlearn-ing sub-optimal actions of other agents from environment stochasticity with the aid of a similarity metric which provides an improved cooperative learning target to update the value function.

• RQ1: Which similarity metrics are effective in distinguishing miscoordination from environment stochas-ticity?

• RQ2: What is the effect of the similarity metrics from RQ1 on the sample efficiency of lenient learning • RQ3: What is the effect of the similarity metrics from RQ1 on the performance of lenient learning? Answering these questions helps us to find better ways to distinguish environment stochasticity from miscoor-dination allowing us to more accurately update the value function which can improve performance and increase sample efficiency.

In the remainder of this work, we first discuss the necessary background. In the related work section, we more closely discuss negative update intervals (Palmer et al., 2019) and Likelihood Hysteretic Implicit Quantile Networks (Lyu and Amato, 2020). Next, we lay out an approach to answering the research questions in the methodology section. Then we compare the effectiveness of different similarity metrics which answers research question 1. To answer research question 2 and 3 the similarity value is incorporated into lenient learning resulting in lenient similarity learning. With this newly obtained algorithm three experiments are performed. First, we compare lenient similarity learning to lenient learning. Second, we discuss some shortcomings of lenient similarity learning. Lastly, we are testing different similarity metrics and compare their sample efficiency and performance. Finally, we discuss our obtained results, give our final conclusion, and finish with suggestions for future work.

2 Background

To understand the algorithms used in this work, we explain the core concepts of reinforcement learning and multi-agent learning. Together, this leads to the field of multi-agent reinforcement learning. This paradigm while being powerful also gives rise to a unique set of pathologies that we will outline below. Next, we elaborate on the attempts of hysteretic and lenient learning to solve these pathologies and discuss the shortcomings of these methods to lay the foundation for the related work. Lastly, we introduce distributional reinforcement learning which is used for the Time Difference Likelihood and lenient similarity learning.

2.1 Reinforcement learning

Reinforcement learning is used in settings where an agent has to complete a certain goal, specified by rewards. It can achieve this goal by making decisions in an environment to navigate itself to the desired outcome. More formally, it is modeled using a Markov Decision Process (MDP) (Sutton et al., 1998), which is a 5-tuple hS, A, T , R, γi. S represents all the possible states in a given environment, A consists of the possible actions an agent can take, T (s, a, s0_{) which can be written like P (s}0_{|s, a), is the transition probability of ending up in}

(8)

the discount factor used to discount the rewards received in future states. The goal is then to maximize the expected discounted sum of rewards also called the return G.

Gt= R. t+1+ γRt+2+ γ2Rt+3+ · · · = ∞

X

k=0

γkRt+k+1 (1)

An agent tries to achieve this goal by learning a policy π : S → A telling the agent which action to take in a particular state. The agent wants to find the optimal policy π∗ that maximizes the return. There are multiple ways to learn such a policy. One is to learn this policy directly using policy-based methods. Another is value-based methods that use a value function to guide the actions of an agent. Finally, we can combine those two into actor-critic methods. Here we will use value-based methods. A value function V : S → R is a mapping from states to expected returns and indicates which states have the highest value. This helps the agent guide itself to the most desired states. The value function is sufficient for model-based settings where we have access to the transitions T . However, in most problems this transition model is not given, and either we have to learn this model or use model-free methods. For model-free methods often a Q-value function Q : S × A → R is used which estimated the value of a state-action pair Qπ(s, a)

.

= Eπ[Gt|St = s, At = a]. This allows us to directly

obtain the greedy policy by taking the action with the maximum Q-value for a particular state.

Given a certain state and the Q-values for the state-action pairs, we need to determine which action to choose. A strategy one could follow is choosing the action with the highest Q-value at that moment, a so-called greedy approach. The problem with this approach is that at the beginning there are a lot of potential actions we have not explored which could lead to an even higher reward. This is a characteristic problem in reinforcement learning called the exploration-exploitation trade-off. On the one hand, we want to exploit our current knowledge to get a high reward but we also want to explore options that could be even better (or worse). There is a range of different methods to balance this trade-off. For example, -greedy, where with probability a random action is taken and otherwise a greedy action. For more methods, the reader is referred to the existing literature on this topic (Sutton et al., 1998).

Next, we need to estimate these values in order to train an agent. A commonly used algorithm is Q-learning (Watkins and Dayan, 1992), which estimates the Q-value by sampling an action and looking at the immediate reward and the maximum Q-value of the next state. The current Q-value is then updated a bit towards the estimated Q-value according to the difference in the estimated Q-value and the real Q-value called the temporal difference, or simply TD-error.

Q (st, at) ← Q (st, at) + α [

temporal difference target

z }| {

rt+ γ max

a Q (st+1, a) −Q (st, at)]

| {z }

temporal difference (TD-error)

, (2)

where α is the learning rate. To keep track of all the Q-values, each value with its corresponding state and action may be stored in a table. This approach is called tabular Q-learning.

However, this approach gets infeasible if the number of states gets very large because the table will not fit in memory anymore. Moreover, in large state spaces, each state will not be sampled very often which leads to noisy Q-value estimates. A solution is to estimate the Q-values using function approximation. Thus, instead of storing a table, we will learn a function that maps state-action pairs to Q-values. However, learning this function will only be beneficial if it has significantly fewer parameters than the total number of states. Function approximation is commonly done with Deep Q-networks (DQN) (Mnih et al., 2015), where the Q-value function Qθ_{(s, a) is parameterized by a (convolutional) neural network.}

Unfortunately, plain DQNs come with a couple of downsides. One of the problems of DQNs is the correlation of subsequent samples, i.e. the received rewards per step. This correlation creates difficulties for neural networks since they assume i.i.d. data and violating this assumption can make the learning less stable. In reinforcement learning, the samples are highly correlated because the next state follows from the current state and so forth. To tackle this problem, Experience Replay Memory (Lin, 1993) is added to break the correlation between subsequent samples. Each transition (st, at, rt, st+1), the state, action, reward and next state is stored in a

fixed-size buffer. During the learning phase, the agent can train its neural network by uniformly sampling over all the stored transitions and update using the sampled one. By uniformly sampling over all the stored transitions the correlation is broken. Furthermore, we use previous experience more efficiently since the stored transitions can be used multiple times. In essence, it replays the experiences it got from its memory. Another problem that occurs, is an overestimation in the Q-value update caused by the max operator in equation 2. Since the real Q-values are unknown there is always uncertainty in this estimate, by taking the maximum of this uncertainty an overestimation bias can occur. To overcome this problem double DQN (Van Hasselt et al., 2016) is used. Double DQN (DDQN) uses two neural networks, one is the main network Q(s, a, θ) and the other the target network Q(s, a, θ0_{). One network is used for selecting the action and the other for estimating}

(9)

the Q-value. The network is updated for each time step using the target value YtQ. Y_tQ ≡ rt+1+ γQ st+1, arg max a Q (st+1, a, θt) ; θ 0 t (3) The target network is similar to the main network except its parameters are not updated but copied from the main network every N steps. Double DQN both reduces the amount of overestimation and makes the Q-network more stable.

2.2 Multi-agent Systems & Game Theory

Many problems require the use of multiple agents who are interacting with each other. For example, the problems mentioned in the introduction or coordination tasks involving robots.

To get a better understanding of the multi-agent reinforcement learning setting we will first introduce some concepts that play a role in multi-agent systems before combining it with reinforcement learning. Most of them are related to game theory. A typical characteristic is the game type, which can be categorized in different ways, here we will discuss some common distinctions. One distinction we can make is between simultaneous or sequential games. In simultaneous games like rock-paper-scissor, all agents act at the same time or at least can not use the information of each other’s action making them in essence simultaneous. Those games are called normal-form games and are represented by a pay-off matrix, as given in Table 1. where A and B are the actions

Player Y

Rock P aper Scissor Player X

Rock (0, 0) (−1, 1) (1, −1) P aper (1, −1) (0, 0) (−1, 1) Scissor (−1, 1) (1, −1) (0, 0) Table 1: Rock-paper-scissor pay-off matrix

of agent X and Y and the numbers in the boxes represent the rewards for each agent. In sequential games like chess agents act sequentially and thus an agent can use the information of the other agent. These games are also known as extensive-form games and can be represented by decision trees. A second distinction that can be made is in the reward structure of the agents. This can be zero-sum, where the rewards that the agent gets per turn sum to zero. Thus a high reward for one agent means a low reward for the other and vice versa. On the other hand, in general-sum games, there is not such a restriction. Lastly, we will divide cooperative games from non-cooperative games. A zero-sum game is generally a non-cooperative game since agents have a negative influence on each other. In contrast, a fully cooperative game can be made by always giving the agents the same reward. In this work, we will look at simultaneous cooperative games.

Game theory has developed solution concepts to analyze these games. One way to search for a solution is by identifying the best response of an agent. This is the best outcome for an agent, given the other players action is known. When both players play there best response we obtain a Nash equilibrium. This equilibrium is reached when no player can improve its reward by choosing another strategy given that the other players do not change their strategy. So no agent has an incentive to unilaterally deviate anymore. However, a Nash equilibrium does not need to be the optimal strategy. An example is given in table 2 where (Hare, Hare) is a Nash equilibrium because for each player individually switching to Stag makes the agent worse off.

Player Y Stag Hare Player X Stag (4, 4) (1, 3) Hare (3, 1) (2, 2) Table 2: Stag hunt pay-off matrix

However, we want to play (Stag, Stag) in table 2, the so-called Pareto-optimal Nash equilibrium, because it gives a higher reward for both agents. Pareto-optimality is defined as a joint-policy where it is impossible to find a better joint-policy without making at least one agent worse off. There could be more Nash equilibriums in a particular game and we want to avoid sub-optimal Nash equilibriums which are not Pareto-optimal.

2.3 Multi-agent Reinforcement Learning

The games discussed in the previous section had one state, namely a single pay-off matrix. This can be extended to multiple states, where the joint-actions of the agents lead to a next state. This so-called stochastic game

(10)

generalizes Markov decision processes and repeated games. The stochastic game setting can be formalized like we did with MDPs. A stochastic game is defined by the tuple (N , S, {Ai}i∈N, T , {Ri}i∈N, γ), where N = {1, ...N }

is the set of agent, N > 1, S represents all the possible states. Ai_{represents the possible actions of agent i. T is}

the transition probability which can be written as P (s0_{|s, a), like in MDPs. However, a is now the joint-action}

of all agents, a ∈ A, where A = A1_{× · · · × A}N_{. After a transition, a reward r = R(s, a, s}0

) ∈ R is received and γ is the discount factor used.

A natural way to solve these games is by using reinforcement learning. However, this also introduces some difficulties which we will explain in more detail. Most of those difficulties are (more strongly) occurring in independent learners (Claus and Boutilier, 1998) who only observe their own local action. This is in contrast to joint-action learners, who can observe the actions taken by all agents. The observation of all actions is a strong assumption in the real world, and therefore, our focus will be on independent learners. Note that another popular paradigm in MARL is centralized learning with decentralized execution (Lowe et al., 2017; Foerster et al., 2018). In this setting agents can communicate and observe the other actions of agents during training but are independent learners during execution. The main advantage is more information during training. However, even during training this can be hard to realize and it is more common when a simulator is available (Hernandez-Leal et al., 2019). Therefore, we do not want to make any assumptions regarding communication and will use decentralized training and execution. Choosing the same approach as hysteretic and lenient learning as well as the algorithms discussed in related work. We will now introduce the common difficulties often occurring in multi-agent reinforcement learning (MARL) using independent learners.

2.3.1 Stochasticity

One difficulty that already occurs in single-agent RL is stochasticity. This introduces noise in the value function and thus makes it harder to estimate. There are multiple forms of stochasticity, we will explain three common ones. First, an agent can have a stochastic policy π(a|s) which can be a distribution over actions. In contrast to the next two sources of stochasticity, we do have control over which policy we choose. However, we always need some exploration thus stochasticity is inevitable. Second, the obtained rewards could be stochastic. After an agent took an action and ends up in a particular state it gets a reward sampled from a distribution rather than a single possible reward. For example, this could be represented by table 3 where for action (B, B) the agent gets with 50% probability a reward of 0 and with 50% probability a reward of 4. Lastly, there is stochasticity in the transitions. In this case, the transition probabilities P (s0|a, s) are at least for some transitions not deterministic. Thus a particular state s and action a could lead to a different next state s0. This could also be represented by the partial stochastic matrix game given in table 3 where for action (B, B) there is a 50% probability the agents will end up in another pay-off matrix where the reward is 0, and 50% probability the agents will be in a pay-off matrix where the reward is 4. A more intuitive and less abstract explanation of stochastic transitions would be a robot that with each move has a certain probability of slipping away which results in a different state than if he does not slip away. For all shown matrix games in this work, we will refer to reward stochasticity when we depict (a/b, a/b) in a pay-off matrix. Moreover, when we mention environment stochasticity we will also refer to this reward stochasticity. Those three sources of stochasticity make it harder to learn the Q-values we are trying to estimate since there is variance in the received returns.

Player Y

A B

Player X A (1, 1) (3, 0) B (0, 3) (0/4, 0/4) Table 3: Partially stochastic pay-off matrix

2.3.2 The changing actions of other agents

Another source of variance in the value estimation comes from the fact that an agent can not see the actions of the other agents. Thus the current agent may receive a different return depending on which action the other agents plays. This variance is mainly caused by two factors. First, as described in the exploration-exploitation trade-off, agents need to explore which inject a bit of randomness in the policy of the agents and thus adds to the variance of the received returns. Secondly, the policy of the other agents will evolve since each agent will try to improve its own policy. As a result, the environment seems to be non-stationary from the perspective of a single agent. Moreover, this non-stationarity leads to a loss of the Markov property, the future state depends only on the present state and not on the states before this, which results in the loss of many convergence guarantees present in single-agent reinforcement learning (Sutton et al., 1998; Foerster et al., 2017). Experience Replay Memory can make this problem worse due to the sampling of outdated transitions during training (Foerster et al., 2017; Omidshafiei et al., 2017; Palmer et al., 2018). Transitions get outdated because the policy of other

(11)

agents changes. Therefore, the same stored state, action, state, reward transition will not occur in the current situation anymore, and learning from them can lead to the wrong learning signal.

2.3.3 Relative overgeneralization

Relative overgeneralization is a problem that mainly arises because of the previously mentioned variance in return estimates, especially the exploration of other teammates. Relative overgeneralization occurs in games where a sub-optimal Nash equilibrium gives a higher average return then the optimal Nash equilibrium when an action from an agent is paired with an arbitrary action of the other agent (Wei and Luke, 2016). This is called a shadowed equilibrium. A minimum example of relative overgeneralization is given by the deterministic Climb game (Claus and Boutilier, 1998) in table 4 (Claus and Boutilier, 1998). Here the Pareto-optimal Nash

Player 2 A B C A (11, 11) (−30, −30) (0, 0) Pla y er 1 B (−30, −30) (7, 7) (6, 6) C (0, 0) (0, 0) (5, 5) Table 4: Deterministic Climb Game

equilibrium is (A, A). If we assume that two agents choose each action with equal probability (for exploration) and update using an average-based algorithm like equation 2, which for a single matrix game will reduce to equation 4, then player 1 will choose action C as best action.

Q(a) ← Q(a) + α [r − Q(a)] (4)

It will choose action C because P

j(A, j) < P j(C, j) and P j(B, j) < P

j(C, j), where j ∈ {A, B, C}. For

player 2 the same holds, as a result the agents will end up in the shadowed equilibrium (C, C). If with small probability action B is still played the agents will eventually end up in (B, B) which is Pareto-dominated by (A, A).

2.3.4 Miscoordination

Another common problem that occurs in matrix games is miscoordination where two agents who both choose a good looking action end up with a poor joint-action. For example, in table 5 both action A and B are good actions to play, but if player 1 plays A and player 2 plays B this results in a bad outcome sometimes referred to as the equilibrium selection problem. Miscoordination can also be seen as a sub-optimal action of one agent given the action of another agent.

Player 2

A B

Player 1 A (10, 10) (0, 0) B (0, 0) (10, 10)

Table 5: Pay-off matrix where miscoordination can occur

2.4 Independent learner baselines

Normal Q-learning fails in the MARL setting because of stochasticity, the changing actions of other agents, relative overgeneralization and miscoordination. This makes it hard to assess how good an action is and can lead to poor value estimates. Ideally, if we encounter miscoordination, we do not want to update the value function. However, if the agents do choose the optimal actions we want to update the value function to learn environment stochasticity.

Both hysteretic learning (Matignon et al., 2007) and lenient learning (Wei and Luke, 2016) tried to overcome some of these difficulties and are mainly focused on increasing the convergence rate for optimal joint-policies. Recently both hysteretic and lenient learning are scaled to large state spaces using DDQNs, namely Hysteretic-DDQN (Omidshafiei et al., 2017) and Lenient-Hysteretic-DDQN (Palmer et al., 2018).

(12)

2.4.1 Hysteretic learning

Hysteretic learning mainly tackles the problem of miscoordination. It does this by having a lower learning rate β for negative updates. Negative updates are Q-value updates where the TD-error is smaller than 0, i.e. δ < 0

Q(s, a) ←

Q(s, a) + βδ if δ < 0

Q(s, a) + αδ otherwise (5)

In this way, the agent gets punished less by miscoordination. However, the algorithm performs worse when environment stochasticity is added since it will always use a low learning rate for negative updates no matter if they come from miscoordination, or are just the effect of stochasticity. Therefore, the value function is always updated more for positive updates even when we only encounter environment stochasticity. This gives rise to an overestimation bias.

2.4.2 Lenient learning

Lenient learning tries to solve the problem of overestimating environment stochasticity and relative overgener-alization by initially not performing negative updates and gradually increasing the probability of performing these negative updates. This leads to optimism in the value function, at the beginning, since we mostly per-form positive value updates. In this way, the agents learn the value of the best actions without the noise of sub-optimal actions. Lenient learning then increases this probability of doing negative updates through a tem-perature decaying schedule which allows the agents to learn the stochasticity of the environment, see equation 6. Q(s, a) ← ( Q(s, a) + αδ if δ ≥ 0 or rand < 1 − e −1 θT (s,a) Q(s, a) else T (s, a) ← dT ×

(1 − τ )T (s, a) + τ ¯T (s0) if s0 is not the end state (if any)

T (s, a) else

(6)

where θ and τ are tunable parameters, dT is the temperature decay and rand is a random sampled number from

0 to 1, e.g. rand ∼ U ([0, 1]). The idea is that at the beginning agents change their policies often which makes it likely an agent gets punished by sub-optimal actions of teammates. Therefore, those sub-optimal actions are ignored initially which makes the agents more lenient towards each other. However, this does introduce an overestimation bias if the environment is stochastic. Therefore, the temperature will be reduced, thereby increasing the probability of negative updates to also learn the environment stochasticity. Each state-action pair has its own temperature which goes down for every visit of this state. For frequently visited states this goes down rapidly while for unexplored states the agents stay lenient. The assumption is that when the probability for negative updates gets high the agents already have stable enough policies for those states, so the problem of relative overgeneralization will be significantly reduced. A downside of lenient learning is the tunable parameters especially the lenient moderation factor θ which is problem specific and needs to be tuned separately for every problem. Furthermore, due to the temperature decaying schedule, convergence can take longer than comparable algorithms. Lastly, the agents will in the end become more vulnerable again to miscoordination, which can cause problems.

2.5 Distributional reinforcement learning

In section 2.1 we have introduced Q-learning and DDQNs where we learned a Q-value for every state, either through storing the Q-value explicitly in a table or by learning a function that can map state-action pairs to Q-values. This Q-value represents the expected return for a particular state-action pair. However, instead of learning the expected return, we could also learn the full return distribution Z(s, a) for a state-action pair. The expectation of this distribution will give the original Q-value, EZ(s, a) = Q(s, a).

Distributional reinforcement learning was originally used for specific purposes. For example, for Bayesian Q-learning (Dearden et al., 1998) to capture uncertainty, or to control risks by avoiding large but rare negative returns (Morimura et al., 2010). However, recently Rowland et al. (2018) argued that the value distribution could be useful in normal reinforcement learning as well (Bellemare et al., 2017). In thier case, a categorical distribution was used to learn the return distribution. However, a big disadvantage is that you have to specify the minimum return, maximum return, and the number of bins. This let to several improvements, a quantile regression network was proposed (Dabney et al., 2018b) with a discrete set of quantiles. Implicit Quantile Networks (Dabney et al., 2018a) improved over this by instead of a discrete set of quantiles, learn a continuous mapping from probability to returns where they use a base distribution, like U([0,1]) to initially sample the probabilities. Finally, this let to fully parametrized quantile functions (Yang et al., 2019) where also this base distribution is learned.

(13)

These advancements are useful since we want to use these return distributions to compare the current return distribution with the target return distribution using a similarity metric. Therefore, methods that produce an accurate estimate of the return distribution are helpful. As we will see, Likelihood Hysteretic Implicit Quantile Networks, which will be explained in section 3, uses Implicit Quantile Networks to estimate return distributions. However, estimating a return distribution is not the part we will improve upon, and any algorithm that can reasonably estimate a return distribution is good enough for the purpose of our work. Therefore, we will not go into more detail and we refer to the above mentioned papers for an overview of the field.

3 Related work

Two methods that improve over both Hysteretic-DDQN and Lenient-DDQN (see section 2.4) are Negative Up-date Intervals-DDQN (NUI-DDQN) (Palmer et al., 2019) and Likelihood Hysteretic Implicit Quantile Networks (LH-IQN) (Lyu and Amato, 2020). Both methods are effective in overcoming the before mentioned difficulties with environment stochasticity, changing actions of other agents, and relative overgeneralization in an inde-pendent learning setting. This leads to either an improved sample efficiency, an increase in performance, or both. More importantly, they do this by separating miscoordination from environment stochasticity. These two methods will be explained in more detail below because they are the current state-of-the-art and also try to separate sub-optimal actions from environment stochasticity. Another method worth mentioning because it is closely related to Lenient-DDQN and also tries to tackle the before mentioned difficulties is Weighted Double DQN (Zheng et al., 2018). Weighted Double DQN scales Lenient learning (Wei and Luke, 2016) to large state spaces by extending weighted double Q-learning (Zhang et al., 2017) using a DQN and introducing a lenient reward network. However, they only compared their results to the plain DDQN and lenient learning (Wei and Luke, 2016), did not test in the same environments as NUI-DDQN and LH-IQN, and was published earlier. Hence, we assume it is not the current state-of-the-art. Moreover, they did not attempt to explicitly separate environment stochasticity from miscoordination which is the method our work is exploring. Therefore, we will not go in more detail about Weighted Double DQN.

3.1 Negative Update Intervals

NUI-DDQN (Palmer et al., 2019) limits the negative impact of miscoordination by discarding episodes from the Experience Replay Memory that are likely the result of miscoordination. They identify these miscoordinated episodes by determining an interval [rmin

u , rmaxu ]. Inside this interval miscoordination does not occur. Therefore,

negative value updates can safely be performed. Hereby assuming that for trajectories where miscoordination does occur, the received cumulative rewards Rτ _{are lower than r}min

u . This interval is determined by first

initializing rmin

u and rmaxu during a random exploration period. Next, a cumulative reward Rτ is obtained

by performing an episode, which consists of all the state-transition tuples (ot−1, at−1, rt, ot). The episode is

only included in the ERM if the cumulative reward Rτ _{of that trajectory is larger than a specific threshold}

Rτ _{≥ max r}min

u , Ru− SDRu. Where Ru is the collection of all cumulative rewards and Ru and SDRu are

respectively the mean and standard deviation of the obtained cumulative rewards. By discarding these low cumulative reward trajectories, they prevent the negative impact of miscoordination. Lastly, rmin

u will be

slowly decayed, but only when the cumulative reward Rτ _{is sufficiently high: R}τ _{≥ r}max

u − . Here, rmaxu

indicates the highest received cumulative reward Rτ until now and is a small constant. The reason behind this restrictive decaying is to avoid premature decaying as a result of exploratory teammates. This decaying will expand the interval in which trajectories are allowed to be stored in the ERM and ensures the algorithm adapts the interval to better separate miscoordination from environment stochasticity. Thus it keeps discarding miscoordinated trajectories, this in contrast to leniency which after sufficiently cooled down temperatures will again be susceptible to miscoordination.

The authors show that negative update intervals increase the likelihood of converging to an optimal joint-policy compared to Lenient-DDQN and Hysteretic-DDQN in a temporally extended partial observable version of the Climb game called the Apprentice Fireman Game (AFG). These algorithms have been compared in two layouts, layout 1 and 2. Both layouts include stochastic transitions and partial observability, but in the first layout agents could see each other when making an irreversible decision, while in the second layout the agents are required to make the same irreversible decision without seeing each other. A key observation in layout 1 was that Lenient-DDQN and Hysteretic-DDQN agents could implicitly avoid miscoordination by observing each other during transitions. Layout 2 removed this option and as a result, the performance of Lenient-DDQN and Hysteretic-DDQN declined. NUI-DDQN especially outperformed Lenient-DDQN and Hysteretic-DDQN in this last setting. As noted before Hysteretic-DDQN failed in stochastic environment. Lenient-DDQN is better able to cope with stochastic environments. Nevertheless, Lenient-DDQN struggled in fully stochastic settings because lenient agents were faced with the difficult trade-off between getting mislead by stochastic rewards or use average utility estimates and fall in the trap of relative overgeneralization. To reduce this problem it had

(14)

to very slowly decay the temperature values resulting in a lot more episodes than NUI-DDQN to achieve the same performance.

While NUI-DDQN outperformed both Lenient-DDQN and Hysteretic-DDQN, its threshold can only decay and could therefore still decay too fast if unfavorable stochastic rewards are obtained. For this reason, we want to incorporate an adaptive learning rate as LH-IQN does.

3.2 Likelihood Hysteretic Implicit Quantile Networks

LH-IQN (Lyu and Amato, 2020) is an extension of Hysteretic learning that does use an adaptive learning rate for negative updates. Instead of using a learning rate β, see equation 5, it uses the maximum of β and l. Where l represents a similarity metric between two distributions called the Time Difference Likelihood (TDL). TDL calculates the similarity between the return distribution of the current time step, the current distribution, to the immediate reward and the return distribution of the next time step, the target distribution. Those return distributions are estimated using Implicit Quantile Networks (Dabney et al., 2018b). In particular, Implicit Quantile Networks estimate the inverse cumulative density function (inverse cdf) F−1(st, at) for a

given state-action pair. To compare the return distribution from the current time step and the next time step there is sampled from these inverse cdfs. The first set of samples, the distribution samples, are sampled from the estimated inverse cdf of the current state. This is done by first sampling M samples from a distribution with a support of [0, 1], e.g. the uniform distribution U ∼ [0, 1]. Then these samples are put in the inverse cdf, d1:M := F−1(st, at). The second set of samples, the target samples, are sampled in the same way from the target

distribution t1:M0:= r_t+ γ max0_aF−1(s_t, a0). See equation 2 to see the temporal difference target in the case of

a single Q-value instead of a distribution. The TDL uses the samples d1:M and t1:M0 to determine the similarity

between the current step and the next Q-learning step. A low similarity indicates a changing distribution which is likely the result of teammate exploration or miscoordination. This low learning rate makes sure the agent is not punished by sub-optimal teammate actions. On the other hand, similar return distributions are likely caused by environment stochasticity and thus a high learning rate similar to the learning rate for positive updates is desired. This prevents the overestimation of the value function and correctly learns environment stochasticity. This approach has three advantages. First, the adaptive learning rate prevents punishment from sub-optimal teammate actions and thereby more accurately updating the value function. Second, automatically determining the learning rate prevents the use of additional hyperparameters which saves tuning time. Last, as empirically shown, the learning rate does increase over time for states that received enough training. Thus enabling better learning of environment stochasticity and more stable learning. These advantages let to an improved sample efficiency and increased likelihood of converging to optimal joint-policies compared to both Hysteretic-DDQN and Lenient-DDQN.

However, they did not explicitly experiment if LH-IQN is efficient in distinguishing sub-optimal actions of other agents from environment stochasticity. Therefore, we are planning to bridge this gap by comparing different similarity metrics in their efficiency of separating sub-optimal actions of other agents from environment stochasticity. After that, we will look if this improved similarity metric will lead to further improvements in both sample efficiency and the likelihood of converging to an optimal joint-policy.

4 Methodology

In this section, we will describe our approach to answering the three research questions mentioned in the introduction. For clarity, they are stated again.

Objective: Improve the sample efficiency and performance of cooperative multi-agent reinforcement learn-ing in an independent learner settlearn-ing by separatlearn-ing sub-optimal actions of other agents from environment stochasticity with the aid of a similarity metric which provides an improved cooperative learning target to update the value function.

• RQ1: Which similarity metrics are effective in distinguishing miscoordination from environment stochas-ticity?

• RQ2: What is the effect of the similarity metrics from RQ1 on the sample efficiency of lenient learning • RQ3: What is the effect of the similarity metrics from RQ1 on the performance of lenient learning? We aim to answer the objective by consecutively answering the three research questions. First, we describe our approach on analysing hysteretic and lenient learning in the Climb Game to better understand the problems in multi-agent reinforcement learning. Second, we describe how we will answer the first research question by identifying potential similarity metrics that will later be compared both analytically and empirically in section 5.2. Last, the approach of the second and third research question will be covered. We explain how we incorporate the similarity metric in the lenient learning algorithm and elaborate on the environments we use to compare

(15)

lenient learning to lenient similarity learning. The experiments for the second and third research question will be combined because only the performance metrics we choose to evaluate the algorithms on will be different.

4.1 Analysis of hysteretic and lenient learning in the Climb Game

In the introduction and background, we shortly stated why hysteretic and lenient learning only partly solve the difficulties of miscoordination, relative overgeneralization, and environment stochasticity. To understand these problems better and show that these algorithms are indeed insufficient we analyse the performance of hysteretic and lenient learning in the Climb Game. The experiments for this will be carried out in section 5.1. First, we will explore why Q-learning fails, in order to better understand why hysteretic and lenient learning where proposed. Second, we show why hysteretic learning performs better than Q-learning but still struggles with stochastic environments. Last, we will cover why lenient learning is better able to deal with environment stochasticity but can still be vulnerable to miscoordination. We give an example of this vulnerability to miscoordination and show that this results in sub-optimal performance. This will help us understand why a similarity metric can be useful.

4.2 RQ1: Measuring the effectiveness of different similarity metrics in

distin-guishing miscoordination from environment stochasticty

In the following, we want to find similarity metrics that are effective in distinguishing miscoordination from environment stochasticity which can then be incorporated into lenient learning to achieve better performance and sample efficiency. To guide our search we formulate six desirable properties that we ideally want from a similarity metric. Next, six similarity metrics that we think will perform well on these desired properties are proposed. One of them is the Time Difference Likelihood (Lyu and Amato, 2020) which was used for the same purpose as we use it for, namely to separate miscoordination from environment stochasticity. The other five metrics are commonly used to measure either the similarity or difference between two distributions. We will later compare these similarity metrics both analytically and empirically in section 5.2.

4.2.1 Desirable properties of similarity metrics

To find out which properties we seek for in a similarity metric we will first explain what we would like to measure. We want to compare the current return distribution Z(st, at) to the target distribution rt+ γmaxa0Z(s_t+1, a0).

This is the same idea as is used in the Time Difference Likelihood, section 3.2. However, we do not necessarily have to use the inverse cumulative density function F−1_{(s, a), as Implicit Quantile Networks does, and can also}

use the probability density function of a return distribution. In its simplest form, the distribution consists of a set of samples, namely the obtained returns. In this case, we can construct a probability mass function by counting the number of occurrences of a particular return value and divide by the number of samples but it could also take more complicated forms, like Implicit Quantile Networks, where a mapping is learned from [0, 1] to a return. However, in this case, we still need to sample from the learned distribution since a direct comparison of the two distributions is not possible. Therefore, we will assume that these two distributions consist of a set of samples or we can sample from the two distributions to obtain such a set of samples. Given these two sets of samples, we want to know how likely it is that one set of samples comes from the same distribution as the other set of samples. Ideally, we want the similarity metric l to be close to 1 when the samples come from the same distribution and close to 0 when the distributions are different. In essence, it represents the probability that the two distributions are similar.

Now the problem is clear, we can describe the desirable characteristics for the similarity metric. We will describe six properties. The first important property is that we want the outcome of the similarity metric to be a probability and thus output values in the range of [0, 1]. A second important property is that it is more important to not perform a negative update (so 0 for a dissimilar distribution) when a sub-optimal action occurs, then it is to perform a negative update (1 for a similar distribution) when environment stochasticity occurs. In other words, it is more important to avoid false negatives (updating when no update is needed because miscoordination is received) than to avoid false positives (not updating when an update should be done because of environment stochasticity). Remember that we use the similarity metric to determine the probability of performing a negative update. This second property is beneficial because if an agent does perform a negative update from a sub-optimal action of the other agent, it will potentially underestimate its best action. Thus it will consider another action, a sub-optimal action, as best action. If the exploration rate will be low this does mean the agent will likely not explore the best action again. On the other hand, if the agent does not perform a negative update, while an update was required because of environment stochasticity, it will overestimate this action. Thus the action will be chosen more often causing the agent to update this action more which leads to a correct estimation of this value. This correct estimation lets the agent discover that it is indeed a sub-optimal action and the agent will switch to another action. Thus no harm is done and the optimal joint-action can still

(16)

be found. Therefore, we want the metric to be sensitive in detecting small differences in the two distributions so it can avoid performing a negative update when miscoordination occurs, i.e. a false negative. A third property that will be desirable is symmetry, say in one situation the current distribution is A and the target distribution is B but in another situation, the current distribution is B and the target distribution is A. Then we want to obtain the same results if we calculate the similarity probability of the first situation compared to the second situation because the difference between the distributions did not change. They were just presented in a different order to the agent. Fourth, we ideally want the similarity metric to work robustly for a range of different kind of distributions, e.g. not just for a normal distribution but also for multi-modal or other types of distributions. Fifth, we want the metric to be independent of the scale of the rewards. If all rewards in an environment are scaled by a factor of two this does not change the optimal joint-policy. Hence, we want our similarity metric to also give the same output. Lastly, we want the method to be computationally efficient, if it takes a lot of time to compute this metric it will slow down the algorithm. For clarity, the six properties are listed below.

1. l ∈ [0, 1]

2. More important to avoid false negatives (updating when no update is needed) than to avoid false positives (not updating when an update is needed)

3. Symmetry: l(A, B) = l(B, A)

4. Work robustly for a range of distributions

5. Independent of the scale of rewards: l(A, B) = l(2A, 2B) 6. Computationally efficient

Given these properties, we can search for similarity metrics that have most or all of these properties. This gives us an indication how good the metric will perform. However, in practice, the best metric will likely depend on the specific problem setting and the specific distributions that are occurring. Therefore, we will later experiment in different kind of settings to see which metric performs well in which setting and if there are metrics that perform well in a wide range of settings.

4.2.2 Identifying and proposing similarity metrics

These desirable properties give us a way to compare different similarity metrics and allow us to evaluate which metric can best distinguish if two sets of samples are similar or dissimilar. The problem of detecting if two sets of samples come from a different distribution is not new. There are mainly two types of methods we can use. First, there is a range of different statistical tests that try to measure if two sets of samples come from a different distribution. For example, the student-t test (Student, 1908) measures if the mean of two distributions are equal and the Pearson’s chi-squared test (Pearson, 1900) is used to detect if two sets of categorical data differ. However, many of them are only for normal distributions or have other kinds of assumptions that are not desirable. The second type of metrics are distance metrics like the Jensen-Shannon distance (Endres and Schindelin, 2003) (the square root of the Jensen-Shannon divergence (Wong and You, 1985)), earth mover’s distance (Levina and Bickel, 2001), Hellinger distance (Hellinger, 1909), or Bhattacharyya distance (Bhattacharyya, 1943) all used for measuring the difference between two distributions. A third option would be to come up with some similarity metric ourselves that fulfills these properties.

We will now propose similarity metrics that we think will perform well on the desired properties discussed previously, see table 6 for the proposed metrics. These similarity metrics are the ones that will be compared against each other in research question 1, section 5.2. However, only property 1, 3, and 5 are given because they can be easily assessed by the formula of the similarity metric itself and can be expressed in a binary way (the metric does have or does not have this property). The other three properties have to be verified empirically which is done in section 5.2. Moreover, on property 5, the independence to the scale of rewards we will also elaborate in section 5.2. Additionally, a more thorough analysis of the Time Difference Likelihood will be given to point out some shortcomings in section 5.2.1.

The first similarity metric in the table is the two-sample Kolmogorov-Smirnov statistic (Smirnov and Smirnov, 1939). This is a non-parametric test and tests whether two underlying distributions are different without making assumptions about the distributions. We will not use the test itself but only the statistic used in the test. This statistic measures the largest quantile difference in two empirical cumulative density functions, see equation 7. This is the only test statistic we use, other statistics often assume a normal distribution or have other undesirable assumptions. The Kolmogorov-Smirnov statistic is defined as

lKS = supx|F1(x) − F2(x)| (7)

where F1(x) and F2(x) are cumulative density functions. The advantage of using cumulative density functions

(17)

Range [0,1] Symmetric Reward scale independent Kolmogorov-Smirnov statistic 3 3 3

Earth mover’s distance ₇ ₃ ₇

Jensen-Shannon distance ₃ ₃ ₇

Hellinger distance ₃ ₃ ₇

Overlapping coefficient ₃ ₃ ₇

Time Difference Likelihood ₃ ₇ ₃

Table 6: Proposed metrics and the directly identifiable desired properties

The next three proposed similarity metrics, the earth mover’s distance (EMD), Jenson-Shannon distance, and the Hellinger distance are all distance metrics. Distance metrics measure the opposite of what we want to measure, namely a distance instead of a similarity. Their output is 1 if the metrics are completely different. However, this can be converted to a similarity metric by using 1−D, where D is the output of the distance metric. The chosen distance metrics are all symmetric and used to measure the difference between two distributions. Therefore, they are a good fit for our application. However, they all need to construct a probability mass function from samples which, in our case, will be done by dividing the samples into bins and divide the number of samples in each bin by the total amount of samples to obtain a probability mass function. This creates the need to choose the number of bins to determine the bin size which makes it partially dependent on the scale of rewards. This is a minor drawback of the three distance metrics. Moreover, the EMD has the additional drawback that it is not between 0 and 1. This can be circumvented by capping or scaling the output so it falls in this range but this is not ideal.

In the special case of two 1-dimensional sets of samples, the calculation of the EMD is straight-forward. Intuitively it can be seen as how much earth needs to be moved to obtain the other distribution. Here p(x) and q(x) are the two distributions, see equation 8. Hereby assuming that ||p(x)|| = ||q(x)|| and that both distributions are sorted.

DEM D=

X

x∈p(x),y∈q(y)

|x − y| (8)

The Jensen-Shannon distance (JSD) is given by equation 9. DJ SD(pkq) = 1 2DKL(pkm) + 1 2DKL(qkm) Where M = 1

2(p + q) and KL is the Kullback-Leibler divergence DKL(pkq) = X x p(x) log p(x) q(x) (9)

Next, the Hellinger distance (HD) is given by equation 10. DHD(p, q) =

p

1 − BC(p, q), where BC is the Bhattacharya coefficient BC(p, q) =X

x

p

p(x)q(x) (10)

The fifth metric we consider is the overlapping coefficient (OVL), equation 11. This is a metric that calculates the percentage of overlap between two distributions. However, since we do not have the real distribution we need to estimate this from samples. We use the same method as described for the previous three metrics thus creating a dependence on the scale of rewards since we have to choose the number of bins that determine the bin size.

lOV L=

X

x

min[p(x), q(x)] (for discrete distributions) (11) Finally, the last metric is the Time Difference Likelihood. The equation of the Time Difference Likelihood is slightly more involved and will be given in the analysis of the Time Difference Likelihood, section 5.2.1.

We identified these six similarity metrics as the most promising metrics and will compare them more thor-oughly in the experiments of research question 1, section 5.2. These metrics could then be implemented in the lenient learning algorithm.

Lenient Similarity Learning for Cooperative Multi-Agent Reinforcement learning

MSc Artificial Intelligence

Master Thesis