Linear and Non-Linear Reactive Strategies
in the Iterated Continuous Prisoner’s Dilemma
Thomas A. Unger 6132375
Bachelor Thesis (18 EC) Bachelor Kunstmatige intelligentie
University of Amsterdam Faculty of Science
Science Park 904 1098 XH Amsterdam
Supervisor Jan van Eijck
Centrum Wiskunde & Informatica Science Park 123
1098 XG Amsterdam
Abstract
This report describes an investigation into homogeneous populations of strategies for the Iterated Trader’s Dilemma, a continuous version of the It-erated Prisoner’s Dilemma. Specifically, a search was conducted for homo-geneous populations of strategies that are viable, i.e., highly cooperative as well as stable against invasion by random strategies and mutants. Work done by Wahl and Nowak (1999a) informed the investigation of linear reactive strategies and was expanded into an investigation into non-linear strategies. Results indicate that viable strategies are those that reciprocate maximum in-vestment with maximum inin-vestment, reciprocate minimum inin-vestment with a low investment (≤ 0.5) and assume that other players make a high invest-ment on the first round. These results hold true for the investigated non-linear strategies, but they do not provide greater viability than linear strategies.
Contents
1 Introduction 4 2 Method 7 2.1 Goals . . . 7 2.2 Strategies . . . 7 2.3 Payoff . . . 9 2.4 Stability . . . 9 2.4.1 Universal Stability . . . 10 2.4.2 Local Stability . . . 10 2.5 Hypothesis . . . 11 3 Results 11 3.1 Linear Strategies . . . 11 3.2 Non-Linear Strategies . . . 184 Conclusion, Discussion and Future Work 21
1
Introduction
Natural selection tends to reward entities that optimize their own relative fitness and, by extension, tends to punish those who increase the relative fitness of others at the expense of their own. Therefore, how can it be that entities which engage in altruism not only exist, but thrive under natural selection? Axelrod and Hamilton (1981) attempted to answer this question, using the Prisoner’s Dilemma (PD). To illustrate the Prisoner’s Dilemma, let us imagine the following scenario. Two hunter-gatherers live on the African savanna. One of them is very good at con-structing bows, but does not know the first thing about concon-structing arrows. This leaves him unable to hunt animals for food. Meanwhile, the other hunter-gatherer has precisely the opposite problem: while he knows how to construct arrows, he does not know how to construct a bow to shoot them with. Clearly, these are fertile grounds for cooperation. Should the two agree to cooperate, they will both pay a cost of time and energy needed to construct something for the other. However, both will also benefit greatly; they will be able to hunt animals for food.
This situation can be modeled using a special case of the Prisoner’s Dilemma, called the Donation Game. Let us assume that the cost and benefit of cooperation is equal for the two “players” and that the cost c = 1 and the benefit b = 2.
Player B
Cooperate 2 1
Defect 0 -1
Defect Cooperate
Player A
Table 1: the payoff matrix for player A in the Donation Game—a special case of the Prisoner’s Dilemma—with c = 1 and b = 2. Note that player A and B can be swapped, i.e., the payoff matrix for player B versus A is the same. The colors are to emphasize the payoffs: agreenshade indicates a net gain, whileredindicates a net loss.
equals b − c = 2 − 1 = 1. However, both players are tempted not to cooperate. After all, if player B constructs an item for player A and A does not reciprocate, A benefits from the other player’s work at no cost to himself (a net gain of b = 2 points), while player B has only cost and no benefit (a net gain of −c = −1 points). Both players also know that the other player is tempted to defect, leaving neither willing to cooperate (a net gain of 0 for both) due to the risk of being exploited, yet both players would have been better off if they had just cooperated. Herein lies the dilemma.
Returning to the example of the two hunter-gatherers, we can imagine that the sit-uation is more complex. For example, if one of them constructs an item for the other, who does not reciprocate, he might develop a grudge and refuse to deal with him again. Once the donated item breaks or is lost, its owner is out of luck: he will not receive a new one. This can be modeled using the so-called Iterated Prisoner’s Dilemma (IPD), which is simply the repetition of the Prisoner’s Dilemma. In this game, each player decides whether to cooperate based on their shared history. Ex-actly how a player comes to this decision is called a strategy.
The Iterated Prisoner’s Dilemma was used by Axelrod and Hamilton (1981) to pit strategies against each other and see which of them gain the most points after some number of rounds. Surprisingly, a simple strategy called Tit-for-Tat proved the most effective. This strategy cooperates on the first move (when there is no shared history yet) and from then on simply copies the opponent’s previous move. In a follow-up, Axelrod (2006) described tournaments to which strategies in the form of computer programs were submitted by people from all over the world. Here again, the simple strategy of Tit-for-Tat proved more effective than far more intricate strategies.
Of course, in nature, the strategies that organisms employ were not submitted by designers, but rather arose through natural selection. In computer simulations, Ax-elrod et al. (1987) used genetic programming to investigate how strategies evolve under selection for those who gain the most points. Starting with an entirely ran-dom population of strategies, he found that cooperative strategies ultimately be-came dominant and drove defective strategies to extinction1.
The Prisoner’s Dilemma requires either total defection or total cooperation from each player. Verhoeff (1998) suggested a continuous version of the PD, which he dubbed the Trader’s Dilemma (TD), in which each player can decide to cooperate by an amount between total defection (0) and total cooperation (1). This amount is also called an investment. Supposedly, this is a more realistic version of the dilemma as faced by organisms in the real world, since cooperation is rarely a
1
However, Press and Dyson (2012) found that strategies exist that can defeat any simple evolu-tionary opponent. These Zero-Determinant (ZD) strategies can force a linear relationship between their and their opponent’s number of points. In this way, they can extort any other strategy blindly op-timizing for its own fitness. Stewart and Plotkin (2013) discuss ZD strategies and their evolutionary stability.
black-and-white issue.
Figure 1: the payoff function for player A in the continuous Dona-tion Game—a special case of the Prisoner’s Dilemma—with c = 1 and b = 2. Note that player A and B can be swapped, i.e., the payoff matrix for player B versus A is the same. The colors are to emphasize the payoffs: agreen shade indicates a net gain, while
redindicates a net loss.
Figure 1 shows the payoff of player A in the continuous Donation Game, analogous to table 1. Since investment is continuous, the payoff is likewise continuous. In this case, where c = 1 and b = 2, the payoff function is a flat plane. For our two hunter-gatherers, the continuous aspect of their investment may represent the quality of the item they devise, e.g., they may decide not to spend too much effort on construction (low investment), while still cooperating.
Wahl and Nowak (1999a) investigated the Iterated Trader’s Dilemma (ITD), anal-ogous to the IPD, and the evolution of “linear” strategies, so-called for the linear relationship between their investment and the other player’s previous investment2. In this case, it turns out that it is the defective strategies that ultimately drive coop-erative strategies to extinction. This is the opposite of the results found by Axelrod in the discrete case3.
2
In a similar study, Wahl and Nowak (1999b) investigated the effect of noise on the evolution of linear strategies.
3
There is hope yet for cooperative strategies in the ITD. Killingback et al. (1999) showed that if strategies are spatially distributed and only play against those closest to them, then cooperation can become the dominant strategy.
2
Method
2.1 Goals
In the ITD, a random population of strategies can not be expected to evolve to-wards cooperation. However, there may be possible populations of strategies that are already cooperative and, more importantly, remain so in spite of invading or mutating strategies. The goal of this report is to identify populations that have these properties, specifically among the subset of homogeneous populations. In a homogeneous population, everyone uses the same strategy. There may be pos-sible non-homogeneous cooperative populations that remain cooperative through, for example, interspecies mutualisms, but these populations are beyond the scope of this report.
There are three separate properties that are used to identify strategies. The first of these is payoff, which describes how many points the constituents of the population accumulate in the ITD game, played against each other. The second property is universal stability, which describes how stable the population is against invaders from across the strategic spectrum. The third and final property is local stability, which describes how stable the population is against strategies that differ slightly from the native strategy (i.e., mutated strategies). These three properties are each given as a single number on a continuous scale. Strategies that have a high value for all three properties are referred to as viable strategies.
2.2 Strategies
As in the work of Wahl and Nowak (1999a), this report considers strategies defined by a function S(x), which gives the investment of player A as a function of x, which is the investment of player B in the previous round. Since an investment must be a value in the interval [0, 1], S maps from [0, 1] to [0, 1]. Wahl and Nowak (1999a) considered strategies where S(x) is linear, parameterized by a slope k, an intercept d and a starting move defined by x0.
S(x) = 0 if kx + d < 0 kx + d if 0 ≤ kx + d ≤ 1 1 if 1 < kx + d
If x is not defined (i.e., there is no previous round), then S(x) = S(x0).
This report considers linear strategies under a slightly different parameterization, as well as non-linear strategies. Linear strategies are defined with the three parameters d0 = S(0) (i.e., d0= d), d1 = S(1) (i.e., d1= k + d) and x0(the same as before).
When x is in the range (0, 1), S(x) is interpolated linearly with d0 and d1. This
for example, a parameter d0.5 = S(0.5). In this case, S(x) would consist of two
line pieces; any value for S(x) with x in the range (0, 0.5) would be interpolated linearly with d0 and d0.5, while any value for S(x) with x in the range (0.5, 1)
would be interpolated linearly with d0.5and d1.
(a) an example of a linear strategy S(x).
(b) an example of a non-linear strat-egy S(x) with a parameter d0.5.
Figure 2: a visualization of two example strategies. The circles indicate S(x0), the starting move. Note that values above 1 and
below 0 are mapped to 1 and 0, respectively.
In general, one can define n parameters d0, d 1 n−1, d 2 n−1, . . . d n−3 n−1d n−2 n−1, d1, which
define n equally spaced points on the interval [0, 1]. In this way, S(x) is a piecewise function consisting of line pieces. By increasing n, any continuous function S(x) from [0, 1] to [0, 1] can be approximated to an arbitrary degree. Note that n = 2 reduces to the linear case.
Linear strategies can be differentiated into nine classes as a function of d0 and
d1, a priori. These nine classes are visible in figure 3. Of particular note are the
strategies in the upper right and lower left of the figure. In the upper right of the figure are the indiscriminate cooperators, where 1 ≤ d0 and 1 ≤ d1. These will
cooperate fully, regardless of the other’s previous investment. A homogeneous population of such strategies is maximally lucrative, but possibly prone to being exploited by invaders and thus unstable. The lower left of the figure shows the other extreme. These are the indiscriminate defectors, where d0 ≤ 0 and d1 ≤ 0. These
will never invest anything at all, regardless of the other’s previous investment. A homogeneous population of such strategies is minimally lucrative, but possibly very stable against any type of invader. For both these classes, the value of x0does
not affect their behavior. For the other seven classes, x0 does have an effect and
Figure 3: linear strategies, separated into nine classes as a function of d0and d1.
While d0, d1 and x0could have any value in R, only certain ranges of values are
considered. For the linear strategies, these ranges are [−4, 5], [−4, 4] and [0, 1] for d0, d1 and x0, respectively. These ranges are equal to or otherwise exceed the
ranges in Wahl and Nowak (1999a). For the non-linear strategies, these ranges are [−1, 2] for d0 and d1, and [0, 1] for d0.5 and x0. Since these ranges still allow an
infinite number of strategies, they are discretized by sampling d0, d1 every δ =
0.02 and d0.5 and x0 every 0.2, such that the bounding values of the ranges are
included. The resulting finite sets of linear and non-linear strategies are henceforth referred to as L and N , respectively. A similar discretization is used by Wahl and Nowak (1999a).
2.3 Payoff
To find the payoff of strategies in a homogeneous population, each strategy plays against itself in a game of ITD that lasts for twenty rounds. The accumulated payoff is then divided by two and the number of rounds, yielding the average payoff per constituent per round.
2.4 Stability
To measure the stability of each strategy against invasion, a way to determine whether one strategy invades another is needed. Such a way is provided by the techniques of adaptive dynamics, also known as evolutionary invasion analysis.
Let payoff(A, B) be the function that returns the accumulative payoff of strategy A against B in a game of ITD that lasts twenty rounds. Then strategy B invades the population of strategy A, if:
payoff(B, A) > payoff(A, A) payoff(B, B) > payoff(A, B)
In other words, if strategy B gains a higher payoff against strategy A than A against itself, and strategy B gains a higher payoff against itself than strategy A against B, then B invades A. The first of these two rules ensures that strategy B can gain a foothold in a population that is still predominantly populated by A. The other rule ensures that once B has a sizeable enclave within the population, it can marginalize strategy A and ultimately drive it to extinction.
2.4.1 Universal Stability
The stability of each strategy against invasion by a random strategy is measured as the probability that a strategy is not invaded by a random strategy. This probability is measured using a Monte Carlo method, by testing for invasion (as per the estab-lished rules) each strategy in L and N against 100 randomly generated strategies. The number of these strategies that fail to invade is divided by 100, yielding an estimate of this strategy’s universal stability.
For the strategies in L, the parameters d0, d1and x0of the invading strategies are
uniformly distributed in the ranges [−4, 5], [−4, 4] and [0, 1], respectively (note that these ranges are equal to the ranges of the strategies in L).
For the strategies in N , the parameters d0, d0.5, d1and x0of the invading strategies
are uniformly distributed in the ranges [−1, 2], [0, 1], [−1, 2] and [0, 1], respectively (note that these ranges are equal to the ranges of the strategies in N ).
(Also note that the invading strategies can have any real value within the given ranges.)
2.4.2 Local Stability
The stability of each strategy against invasion by a mutant is measured as the prob-ability that a strategy A is not invaded by another strategy, whose parameters are mutated from the parameters of A. This probability is also measured using a Monte Carlo method, by testing for invasion each strategy in L and N against 100 ran-domly generated mutants. The number of mutants that fail to invade is divided by 100, yielding an estimate of this strategy’s local stability.
The process of mutation considers a strategy space with as many dimensions as there are parameters. In the case of linear strategies, there are three dimensions, for the parameters d0, d1and x0. Each possible strategy occupies a point in strategy
space. To effect a mutation, a random unit vector r in strategy space is generated (such that the directions of generated vectors are uniformly distributed), and then scaled by a scalar m, which is drawn from a normal distribution4with µ = 0 and σ = δ = 0.02. Given a strategy defined by the vector v in strategy space, a mutant is generated by calculating v + mr.
2.5 Hypothesis
Since the payoff of a strategy is determined by letting the strategy play against itself, this payoff must be symmetrical (i.e., each copy of the strategy has the same payoff, meaning exploitation is impossible). Our hypothesis, therefore, is that the strategies that achieve the highest payoff lie in the region where 1 ≤ d1. Let us
refer to this region as P .
Wahl and Nowak (1999a) found that when linear strategies are allowed to evolve, they evolve towards two regions (see figure 7 of the paper). The first region is where d0 ≤ 0 and d1 ≤ 0.5, which includes the indiscriminate defectors. The
second region is where d0 ≤ 0.5 and d1 = 1. Let us call the union of these two
regions Q. We suspect that these regions are stable universally and locally. Our hypothesis for viable linear strategies is the intersection of P and Q. This is the region where d0 ≤ 0.5 and d1 = 1.
Our hypothesis for viable non-linear strategies is the same as for linear strategies, i.e., that the value of d0.5has no effect on which strategies are viable.
Wahl and Nowak (1999a) also concluded that for all cooperative strategies x0= 1.
Since only cooperative strategies have a high payoff, we hypothesize that viable strategies have a high value of x0.
3
Results
3.1 Linear Strategies
Figure 4 shows that our model correctly replicated results in the work of Wahl and Nowak (1999a).
4
The justification for the use of the normal distribution is the assumption that mutation depends on a considerable number of variables that vary uniformly.
(a) original figure from Wahl and Nowak (1999a).
(b) new results.
Figure 4: payoff as a function of d = d0 and k = d1 − d0 and
averaged over all values of x0 in the range [0, 1]. Compare the
results of Wahl and Nowak (1999a) with those of the new imple-mentation.
Figures 5, 6 and 7 show the three properties of payoff, universal stability and local stability, respectively, averaged over x0. These graphs were combined into figure
8 as an RGB image, with each of the properties as one of the three color channels. Since viable strategies have a high value in all three properties, they appear as white. Figure 8 shows a grey patch in the top middle of the graph. A very small patch of white is visible at the point where d0 = 0.5 and d1= 1.
Figures 9 through 12 show the same type of graphs, for specific values of x0. Note
that, in figure 12, as x0 increases, viable strategies appear in the region where
d0 ≤ 0.5 and 1 ≤ d1. Also note that, in this region, as d1 increases, universal
Figure 5: payoff as a function of d0and d1 and averaged over all
values of x0 in the range [0, 1]. This shows the same results as
figure 4b, albeit under different parameters.
Figure 6: universal stability as a function of d0 and d1 and
Figure 7: local stability as a function of d0 and d1 and averaged
over all values of x0in the range [0, 1].
Figure 8: payoff (green) and universal (red) and local stability (blue) added together, as a function of d0 and d1 and averaged
Figure 9: payoff as a function of d0 and d1 for different values of
Figure 10: universal stability as a function of d0and d1 for
Figure 11: local stability as a function of d0 and d1 for different
Figure 12: payoff (green) and universal (red) and local stability (blue) added together, as a function of d0 and d1for different
val-ues of x0.
3.2 Non-Linear Strategies
Figure 13 shows the combined results for different values of x0(similarly to figure
12) and averaged over d0.5. Note that viable strategies appear as x0increases, again
Figure 13: payoff (green) and universal (red) and local stability (blue) added together, as a function of d0 and d1 and averaged
over all values of d0.5in the range [0, 1], for different values of x0.
Figure 14 shows the combined results for the case where x0 = 1, for increasing
values of d0.5. Note that the region where d0 ≤ 0.5 and 1 ≤ d1is viable (similarly
to the lower right graph in figure 12). However, as d0.5increases, particularly above
Figure 14: payoff (green) and universal (red) and local stability (blue) added together, as a function of d0 and d1, for x0 = 1 and
different values of d0.5.
4
Conclusion, Discussion and Future Work
In the case of linear strategies, our hypothesis holds true that viable strategies lie in the region where d0 ≤ 0.5 and d1 = 1, on the condition that x0 = 1. However,
the adjacent region where d0 ≤ 0.5 and 1 < d1 also contains strategies that are
somewhat viable, albethey less universally stable as d1 increases. The amount of
viable strategies in this region increases as x0 increases. When x0 = 0, only the
case where d0 = 0.5 and d1 = 1 might be viable, although the granularity of the
collected data is too low to reliably confirm this.
In the case of non-linear strategies, our hypothesis that the value of d0.5 has no
effect is falsified. As evidenced by figure 13, the value of d0.5does have an effect
on which strategies become viable as x0increases. In figure 14 it can also be seen
that highly viable strategies lie in the region where d0 ≤ 0.5 and 1 = d1 when
x0 = 1 and 0.4 ≤ d0.5 ≤ 0.6.
For both linear and non-linear strategies, the hypothesis that viable strategies have a high value for x0 holds true for the most highly viable strategies, except perhaps
in the linear case where d0= 0.5 and d1= 1. For somewhat less universally stable
strategies, lower values of x0 are possible, although their amount decreases with
x0.
Based on the above observations, listed below are general principles of highly vi-able strategies that have been found to hold true for both linear and non-linear strategies.
1. They reciprocate maximum investment with maximum investment. 2. They reciprocate minimum investment with an investment ≤ 0.5.
3. They assume that other players play the maximum investment on the first round.
For the non-linear strategies, we state the additional principle that they reciprocate an investment of 0.5 with an investment ≈ 0.5.
We do stress that these principles have only been shown to apply to (large) homo-geneous populations, only to the set of strategies that has been investigated and only for a specific case of the Donation Game where c = 1 and b = 2. It is possi-ble that more exotic viapossi-ble non-linear strategies exist. These more exotic strategies might encompass strategies that use other forms of non-linearity such as quadratic or cubic strategies or strategies that use even higher order polynomials. The viabil-ity of these strategies may also change as the values of c and b change, which has already been shown to be true for linear strategies by Wahl and Nowak (1999a). Furthermore, the search for viable strategies can be expanded to non-homogeneous populations or to strategies that react to a different set of parameters, instead of only the other player’s previous investment, although Press and Dyson (2012) note
that in the discrete Iterated Prisoner’s Dilemma a player gains no advantage with a memory longer than one round of play, even against other players with a longer memory.
References
Axelrod, R. (2006). The evolution of cooperation: revised edition.
Axelrod, R. et al. (1987). The evolution of strategies in the iterated prisoners dilemma. The dynamics of norms, pages 199–220.
Axelrod, R. and Hamilton, W. D. (1981). The evolution of cooperation. Science, 211(4489):1390–1396.
Killingback, T., Doebeli, M., and Knowlton, N. (1999). Variable investment and the origin of cooperation. Proc. Natl. Acad. Sci. USA.
Press, W. H. and Dyson, F. J. (2012). Iterated prisoners dilemma contains strategies that dominate any evolutionary opponent. Proceedings of the National Academy of Sciences, 109(26):10409–10413.
Stewart, A. J. and Plotkin, J. B. (2013). From extortion to generosity, evolution in the iterated prisoners dilemma. Proceedings of the National Academy of Sci-ences, 110(38):15348–15353.
Verhoeff, T. (1998). The traders dilemma: A continuous version of the prisoners dilemma. Eindhoven University of Technology.
Wahl, L. M. and Nowak, M. A. (1999a). The continuous prisoner’s dilemma: I. linear reactive strategies. Journal of Theoretical Biology, 200(3):307–321. Wahl, L. M. and Nowak, M. A. (1999b). The continuous prisoner’s dilemma: Ii.
linear reactive strategies with noise. Journal of Theoretical Biology, 200(3):323– 338.
Appendices
Figure 15: payoff (green) and universal (red) and local stability (blue) added together, as a function of d0and d1, for d0.5 = 0 and
Figure 16: payoff (green) and universal (red) and local stability (blue) added together, as a function of d0 and d1, for d0.5 = 0.2
Figure 17: payoff (green) and universal (red) and local stability (blue) added together, as a function of d0 and d1, for d0.5 = 0.4
Figure 18: payoff (green) and universal (red) and local stability (blue) added together, as a function of d0 and d1, for d0.5 = 0.6
Figure 19: payoff (green) and universal (red) and local stability (blue) added together, as a function of d0 and d1, for d0.5 = 0.8
Figure 20: payoff (green) and universal (red) and local stability (blue) added together, as a function of d0and d1, for d0.5 = 1 and
Figure 21: payoff (green) and universal (red) and local stability (blue) added together, as a function of d0 and d1 and averaged