Linear and Non-Linear Reactive Strategies in the Iterated Continuous Prisoner’s Dilemma

(1)

Linear and Non-Linear Reactive Strategies

in the Iterated Continuous Prisoner’s Dilemma

Thomas A. Unger 6132375

Bachelor Thesis (18 EC) Bachelor Kunstmatige intelligentie

University of Amsterdam Faculty of Science

Science Park 904 1098 XH Amsterdam

Supervisor Jan van Eijck

Centrum Wiskunde & Informatica Science Park 123

1098 XG Amsterdam

(2)

Abstract

This report describes an investigation into homogeneous populations of strategies for the Iterated Trader’s Dilemma, a continuous version of the It-erated Prisoner’s Dilemma. Specifically, a search was conducted for homo-geneous populations of strategies that are viable, i.e., highly cooperative as well as stable against invasion by random strategies and mutants. Work done by Wahl and Nowak (1999a) informed the investigation of linear reactive strategies and was expanded into an investigation into non-linear strategies. Results indicate that viable strategies are those that reciprocate maximum in-vestment with maximum inin-vestment, reciprocate minimum inin-vestment with a low investment (≤ 0.5) and assume that other players make a high invest-ment on the first round. These results hold true for the investigated non-linear strategies, but they do not provide greater viability than linear strategies.

(3)

1 Introduction

Natural selection tends to reward entities that optimize their own relative fitness and, by extension, tends to punish those who increase the relative fitness of others at the expense of their own. Therefore, how can it be that entities which engage in altruism not only exist, but thrive under natural selection? Axelrod and Hamilton (1981) attempted to answer this question, using the Prisoner’s Dilemma (PD). To illustrate the Prisoner’s Dilemma, let us imagine the following scenario. Two hunter-gatherers live on the African savanna. One of them is very good at con-structing bows, but does not know the first thing about concon-structing arrows. This leaves him unable to hunt animals for food. Meanwhile, the other hunter-gatherer has precisely the opposite problem: while he knows how to construct arrows, he does not know how to construct a bow to shoot them with. Clearly, these are fertile grounds for cooperation. Should the two agree to cooperate, they will both pay a cost of time and energy needed to construct something for the other. However, both will also benefit greatly; they will be able to hunt animals for food.

This situation can be modeled using a special case of the Prisoner’s Dilemma, called the Donation Game. Let us assume that the cost and benefit of cooperation is equal for the two “players” and that the cost c = 1 and the benefit b = 2.

Player B

Cooperate 2 1

Defect 0 -1

Defect Cooperate

Player A

Table 1: the payoff matrix for player A in the Donation Game—a special case of the Prisoner’s Dilemma—with c = 1 and b = 2. Note that player A and B can be swapped, i.e., the payoff matrix for player B versus A is the same. The colors are to emphasize the payoffs: agreenshade indicates a net gain, whileredindicates a net loss.

(5)

equals b − c = 2 − 1 = 1. However, both players are tempted not to cooperate. After all, if player B constructs an item for player A and A does not reciprocate, A benefits from the other player’s work at no cost to himself (a net gain of b = 2 points), while player B has only cost and no benefit (a net gain of −c = −1 points). Both players also know that the other player is tempted to defect, leaving neither willing to cooperate (a net gain of 0 for both) due to the risk of being exploited, yet both players would have been better off if they had just cooperated. Herein lies the dilemma.

Returning to the example of the two hunter-gatherers, we can imagine that the sit-uation is more complex. For example, if one of them constructs an item for the other, who does not reciprocate, he might develop a grudge and refuse to deal with him again. Once the donated item breaks or is lost, its owner is out of luck: he will not receive a new one. This can be modeled using the so-called Iterated Prisoner’s Dilemma (IPD), which is simply the repetition of the Prisoner’s Dilemma. In this game, each player decides whether to cooperate based on their shared history. Ex-actly how a player comes to this decision is called a strategy.

The Iterated Prisoner’s Dilemma was used by Axelrod and Hamilton (1981) to pit strategies against each other and see which of them gain the most points after some number of rounds. Surprisingly, a simple strategy called Tit-for-Tat proved the most effective. This strategy cooperates on the first move (when there is no shared history yet) and from then on simply copies the opponent’s previous move. In a follow-up, Axelrod (2006) described tournaments to which strategies in the form of computer programs were submitted by people from all over the world. Here again, the simple strategy of Tit-for-Tat proved more effective than far more intricate strategies.

Of course, in nature, the strategies that organisms employ were not submitted by designers, but rather arose through natural selection. In computer simulations, Ax-elrod et al. (1987) used genetic programming to investigate how strategies evolve under selection for those who gain the most points. Starting with an entirely ran-dom population of strategies, he found that cooperative strategies ultimately be-came dominant and drove defective strategies to extinction1.

The Prisoner’s Dilemma requires either total defection or total cooperation from each player. Verhoeff (1998) suggested a continuous version of the PD, which he dubbed the Trader’s Dilemma (TD), in which each player can decide to cooperate by an amount between total defection (0) and total cooperation (1). This amount is also called an investment. Supposedly, this is a more realistic version of the dilemma as faced by organisms in the real world, since cooperation is rarely a

1

However, Press and Dyson (2012) found that strategies exist that can defeat any simple evolu-tionary opponent. These Zero-Determinant (ZD) strategies can force a linear relationship between their and their opponent’s number of points. In this way, they can extort any other strategy blindly op-timizing for its own fitness. Stewart and Plotkin (2013) discuss ZD strategies and their evolutionary stability.

(6)

black-and-white issue.

Figure 1: the payoff function for player A in the continuous Dona-tion Game—a special case of the Prisoner’s Dilemma—with c = 1 and b = 2. Note that player A and B can be swapped, i.e., the payoff matrix for player B versus A is the same. The colors are to emphasize the payoffs: agreen shade indicates a net gain, while

redindicates a net loss.

Figure 1 shows the payoff of player A in the continuous Donation Game, analogous to table 1. Since investment is continuous, the payoff is likewise continuous. In this case, where c = 1 and b = 2, the payoff function is a flat plane. For our two hunter-gatherers, the continuous aspect of their investment may represent the quality of the item they devise, e.g., they may decide not to spend too much effort on construction (low investment), while still cooperating.

Wahl and Nowak (1999a) investigated the Iterated Trader’s Dilemma (ITD), anal-ogous to the IPD, and the evolution of “linear” strategies, so-called for the linear relationship between their investment and the other player’s previous investment2. In this case, it turns out that it is the defective strategies that ultimately drive coop-erative strategies to extinction. This is the opposite of the results found by Axelrod in the discrete case3.

2

In a similar study, Wahl and Nowak (1999b) investigated the effect of noise on the evolution of linear strategies.

3

There is hope yet for cooperative strategies in the ITD. Killingback et al. (1999) showed that if strategies are spatially distributed and only play against those closest to them, then cooperation can become the dominant strategy.

(7)

2 Method

2.1 Goals

In the ITD, a random population of strategies can not be expected to evolve to-wards cooperation. However, there may be possible populations of strategies that are already cooperative and, more importantly, remain so in spite of invading or mutating strategies. The goal of this report is to identify populations that have these properties, specifically among the subset of homogeneous populations. In a homogeneous population, everyone uses the same strategy. There may be pos-sible non-homogeneous cooperative populations that remain cooperative through, for example, interspecies mutualisms, but these populations are beyond the scope of this report.

There are three separate properties that are used to identify strategies. The first of these is payoff, which describes how many points the constituents of the population accumulate in the ITD game, played against each other. The second property is universal stability, which describes how stable the population is against invaders from across the strategic spectrum. The third and final property is local stability, which describes how stable the population is against strategies that differ slightly from the native strategy (i.e., mutated strategies). These three properties are each given as a single number on a continuous scale. Strategies that have a high value for all three properties are referred to as viable strategies.

2.2 Strategies

As in the work of Wahl and Nowak (1999a), this report considers strategies defined by a function S(x), which gives the investment of player A as a function of x, which is the investment of player B in the previous round. Since an investment must be a value in the interval [0, 1], S maps from [0, 1] to [0, 1]. Wahl and Nowak (1999a) considered strategies where S(x) is linear, parameterized by a slope k, an intercept d and a starting move defined by x0.

S(x) =    0 if kx + d < 0 kx + d if 0 ≤ kx + d ≤ 1 1 if 1 < kx + d

If x is not defined (i.e., there is no previous round), then S(x) = S(x0).

This report considers linear strategies under a slightly different parameterization, as well as non-linear strategies. Linear strategies are defined with the three parameters d0 = S(0) (i.e., d0= d), d1 = S(1) (i.e., d1= k + d) and x0(the same as before).

When x is in the range (0, 1), S(x) is interpolated linearly with d0 and d1. This

(8)

for example, a parameter d0.5 = S(0.5). In this case, S(x) would consist of two

line pieces; any value for S(x) with x in the range (0, 0.5) would be interpolated linearly with d0 and d0.5, while any value for S(x) with x in the range (0.5, 1)

would be interpolated linearly with d0.5and d1.

(a) an example of a linear strategy S(x).

(b) an example of a non-linear strat-egy S(x) with a parameter d0.5.

Figure 2: a visualization of two example strategies. The circles indicate S(x0), the starting move. Note that values above 1 and

below 0 are mapped to 1 and 0, respectively.

In general, one can define n parameters d0, d 1 n−1, d 2 n−1, . . . d n−3 n−1d n−2 n−1, d1, which

define n equally spaced points on the interval [0, 1]. In this way, S(x) is a piecewise function consisting of line pieces. By increasing n, any continuous function S(x) from [0, 1] to [0, 1] can be approximated to an arbitrary degree. Note that n = 2 reduces to the linear case.

Linear strategies can be differentiated into nine classes as a function of d0 and

d1, a priori. These nine classes are visible in figure 3. Of particular note are the

strategies in the upper right and lower left of the figure. In the upper right of the figure are the indiscriminate cooperators, where 1 ≤ d0 and 1 ≤ d1. These will

cooperate fully, regardless of the other’s previous investment. A homogeneous population of such strategies is maximally lucrative, but possibly prone to being exploited by invaders and thus unstable. The lower left of the figure shows the other extreme. These are the indiscriminate defectors, where d0 ≤ 0 and d1 ≤ 0. These

will never invest anything at all, regardless of the other’s previous investment. A homogeneous population of such strategies is minimally lucrative, but possibly very stable against any type of invader. For both these classes, the value of x0does

not affect their behavior. For the other seven classes, x0 does have an effect and

(9)

Figure 3: linear strategies, separated into nine classes as a function of d0and d1.

While d0, d1 and x0could have any value in R, only certain ranges of values are

considered. For the linear strategies, these ranges are [−4, 5], [−4, 4] and [0, 1] for d0, d1 and x0, respectively. These ranges are equal to or otherwise exceed the

ranges in Wahl and Nowak (1999a). For the non-linear strategies, these ranges are [−1, 2] for d0 and d1, and [0, 1] for d0.5 and x0. Since these ranges still allow an

infinite number of strategies, they are discretized by sampling d0, d1 every δ =

0.02 and d0.5 and x0 every 0.2, such that the bounding values of the ranges are

included. The resulting finite sets of linear and non-linear strategies are henceforth referred to as L and N , respectively. A similar discretization is used by Wahl and Nowak (1999a).

2.3 Payoff

To find the payoff of strategies in a homogeneous population, each strategy plays against itself in a game of ITD that lasts for twenty rounds. The accumulated payoff is then divided by two and the number of rounds, yielding the average payoff per constituent per round.

2.4 Stability

To measure the stability of each strategy against invasion, a way to determine whether one strategy invades another is needed. Such a way is provided by the techniques of adaptive dynamics, also known as evolutionary invasion analysis.

(10)

Let payoff(A, B) be the function that returns the accumulative payoff of strategy A against B in a game of ITD that lasts twenty rounds. Then strategy B invades the population of strategy A, if:

payoff(B, A) > payoff(A, A) payoff(B, B) > payoff(A, B)

In other words, if strategy B gains a higher payoff against strategy A than A against itself, and strategy B gains a higher payoff against itself than strategy A against B, then B invades A. The first of these two rules ensures that strategy B can gain a foothold in a population that is still predominantly populated by A. The other rule ensures that once B has a sizeable enclave within the population, it can marginalize strategy A and ultimately drive it to extinction.

2.4.1 Universal Stability

The stability of each strategy against invasion by a random strategy is measured as the probability that a strategy is not invaded by a random strategy. This probability is measured using a Monte Carlo method, by testing for invasion (as per the estab-lished rules) each strategy in L and N against 100 randomly generated strategies. The number of these strategies that fail to invade is divided by 100, yielding an estimate of this strategy’s universal stability.

For the strategies in L, the parameters d0, d1and x0of the invading strategies are

uniformly distributed in the ranges [−4, 5], [−4, 4] and [0, 1], respectively (note that these ranges are equal to the ranges of the strategies in L).

For the strategies in N , the parameters d0, d0.5, d1and x0of the invading strategies

are uniformly distributed in the ranges [−1, 2], [0, 1], [−1, 2] and [0, 1], respectively (note that these ranges are equal to the ranges of the strategies in N ).

(Also note that the invading strategies can have any real value within the given ranges.)

2.4.2 Local Stability

The stability of each strategy against invasion by a mutant is measured as the prob-ability that a strategy A is not invaded by another strategy, whose parameters are mutated from the parameters of A. This probability is also measured using a Monte Carlo method, by testing for invasion each strategy in L and N against 100 ran-domly generated mutants. The number of mutants that fail to invade is divided by 100, yielding an estimate of this strategy’s local stability.

(11)

The process of mutation considers a strategy space with as many dimensions as there are parameters. In the case of linear strategies, there are three dimensions, for the parameters d0, d1and x0. Each possible strategy occupies a point in strategy

space. To effect a mutation, a random unit vector r in strategy space is generated (such that the directions of generated vectors are uniformly distributed), and then scaled by a scalar m, which is drawn from a normal distribution4with µ = 0 and σ = δ = 0.02. Given a strategy defined by the vector v in strategy space, a mutant is generated by calculating v + mr.

2.5 Hypothesis

Since the payoff of a strategy is determined by letting the strategy play against itself, this payoff must be symmetrical (i.e., each copy of the strategy has the same payoff, meaning exploitation is impossible). Our hypothesis, therefore, is that the strategies that achieve the highest payoff lie in the region where 1 ≤ d1. Let us

refer to this region as P .

Wahl and Nowak (1999a) found that when linear strategies are allowed to evolve, they evolve towards two regions (see figure 7 of the paper). The first region is where d0 ≤ 0 and d1 ≤ 0.5, which includes the indiscriminate defectors. The

second region is where d0 ≤ 0.5 and d1 = 1. Let us call the union of these two

regions Q. We suspect that these regions are stable universally and locally. Our hypothesis for viable linear strategies is the intersection of P and Q. This is the region where d0 ≤ 0.5 and d1 = 1.

Our hypothesis for viable non-linear strategies is the same as for linear strategies, i.e., that the value of d0.5has no effect on which strategies are viable.

Wahl and Nowak (1999a) also concluded that for all cooperative strategies x0= 1.

Since only cooperative strategies have a high payoff, we hypothesize that viable strategies have a high value of x0.

3 Results

3.1 Linear Strategies

Figure 4 shows that our model correctly replicated results in the work of Wahl and Nowak (1999a).

4

The justification for the use of the normal distribution is the assumption that mutation depends on a considerable number of variables that vary uniformly.

(12)

(a) original figure from Wahl and Nowak (1999a).

(b) new results.

Figure 4: payoff as a function of d = d0 and k = d1 − d0 and

averaged over all values of x0 in the range [0, 1]. Compare the

results of Wahl and Nowak (1999a) with those of the new imple-mentation.

Figures 5, 6 and 7 show the three properties of payoff, universal stability and local stability, respectively, averaged over x0. These graphs were combined into figure

8 as an RGB image, with each of the properties as one of the three color channels. Since viable strategies have a high value in all three properties, they appear as white. Figure 8 shows a grey patch in the top middle of the graph. A very small patch of white is visible at the point where d0 = 0.5 and d1= 1.

Figures 9 through 12 show the same type of graphs, for specific values of x0. Note

that, in figure 12, as x0 increases, viable strategies appear in the region where

d0 ≤ 0.5 and 1 ≤ d1. Also note that, in this region, as d1 increases, universal

(13)

Figure 5: payoff as a function of d0and d1 and averaged over all

values of x0 in the range [0, 1]. This shows the same results as

figure 4b, albeit under different parameters.

Figure 6: universal stability as a function of d0 and d1 and

(14)

Figure 7: local stability as a function of d0 and d1 and averaged

over all values of x0in the range [0, 1].

Figure 8: payoff (green) and universal (red) and local stability (blue) added together, as a function of d0 and d1 and averaged

(15)

Figure 9: payoff as a function of d0 and d1 for different values of

(16)

Figure 10: universal stability as a function of d0and d1 for

(17)

Figure 11: local stability as a function of d0 and d1 for different

(18)

Figure 12: payoff (green) and universal (red) and local stability (blue) added together, as a function of d0 and d1for different

val-ues of x0.

3.2 Non-Linear Strategies

Figure 13 shows the combined results for different values of x0(similarly to figure

12) and averaged over d0.5. Note that viable strategies appear as x0increases, again

(19)

over all values of d0.5in the range [0, 1], for different values of x0.

Figure 14 shows the combined results for the case where x0 = 1, for increasing

values of d0.5. Note that the region where d0 ≤ 0.5 and 1 ≤ d1is viable (similarly

to the lower right graph in figure 12). However, as d0.5increases, particularly above

(20)

Figure 14: payoff (green) and universal (red) and local stability (blue) added together, as a function of d0 and d1, for x0 = 1 and

different values of d0.5.

(21)

4 Conclusion, Discussion and Future Work

In the case of linear strategies, our hypothesis holds true that viable strategies lie in the region where d0 ≤ 0.5 and d1 = 1, on the condition that x0 = 1. However,

the adjacent region where d0 ≤ 0.5 and 1 < d1 also contains strategies that are

somewhat viable, albethey less universally stable as d1 increases. The amount of

viable strategies in this region increases as x0 increases. When x0 = 0, only the

case where d0 = 0.5 and d1 = 1 might be viable, although the granularity of the

collected data is too low to reliably confirm this.

In the case of non-linear strategies, our hypothesis that the value of d0.5 has no

effect is falsified. As evidenced by figure 13, the value of d0.5does have an effect

on which strategies become viable as x0increases. In figure 14 it can also be seen

that highly viable strategies lie in the region where d0 ≤ 0.5 and 1 = d1 when

x0 = 1 and 0.4 ≤ d0.5 ≤ 0.6.

For both linear and non-linear strategies, the hypothesis that viable strategies have a high value for x0 holds true for the most highly viable strategies, except perhaps

in the linear case where d0= 0.5 and d1= 1. For somewhat less universally stable

strategies, lower values of x0 are possible, although their amount decreases with

x0.

Based on the above observations, listed below are general principles of highly vi-able strategies that have been found to hold true for both linear and non-linear strategies.

1. They reciprocate maximum investment with maximum investment. 2. They reciprocate minimum investment with an investment ≤ 0.5.

3. They assume that other players play the maximum investment on the first round.

For the non-linear strategies, we state the additional principle that they reciprocate an investment of 0.5 with an investment ≈ 0.5.

We do stress that these principles have only been shown to apply to (large) homo-geneous populations, only to the set of strategies that has been investigated and only for a specific case of the Donation Game where c = 1 and b = 2. It is possi-ble that more exotic viapossi-ble non-linear strategies exist. These more exotic strategies might encompass strategies that use other forms of non-linearity such as quadratic or cubic strategies or strategies that use even higher order polynomials. The viabil-ity of these strategies may also change as the values of c and b change, which has already been shown to be true for linear strategies by Wahl and Nowak (1999a). Furthermore, the search for viable strategies can be expanded to non-homogeneous populations or to strategies that react to a different set of parameters, instead of only the other player’s previous investment, although Press and Dyson (2012) note

(22)

that in the discrete Iterated Prisoner’s Dilemma a player gains no advantage with a memory longer than one round of play, even against other players with a longer memory.

References

Axelrod, R. (2006). The evolution of cooperation: revised edition.

Axelrod, R. et al. (1987). The evolution of strategies in the iterated prisoners dilemma. The dynamics of norms, pages 199–220.

Axelrod, R. and Hamilton, W. D. (1981). The evolution of cooperation. Science, 211(4489):1390–1396.

Killingback, T., Doebeli, M., and Knowlton, N. (1999). Variable investment and the origin of cooperation. Proc. Natl. Acad. Sci. USA.

Press, W. H. and Dyson, F. J. (2012). Iterated prisoners dilemma contains strategies that dominate any evolutionary opponent. Proceedings of the National Academy of Sciences, 109(26):10409–10413.

Stewart, A. J. and Plotkin, J. B. (2013). From extortion to generosity, evolution in the iterated prisoners dilemma. Proceedings of the National Academy of Sci-ences, 110(38):15348–15353.

Verhoeff, T. (1998). The traders dilemma: A continuous version of the prisoners dilemma. Eindhoven University of Technology.

Wahl, L. M. and Nowak, M. A. (1999a). The continuous prisoner’s dilemma: I. linear reactive strategies. Journal of Theoretical Biology, 200(3):307–321. Wahl, L. M. and Nowak, M. A. (1999b). The continuous prisoner’s dilemma: Ii.

linear reactive strategies with noise. Journal of Theoretical Biology, 200(3):323– 338.

(23)

Appendices

Figure 15: payoff (green) and universal (red) and local stability (blue) added together, as a function of d0and d1, for d0.5 = 0 and

(24)

Figure 16: payoff (green) and universal (red) and local stability (blue) added together, as a function of d0 and d1, for d0.5 = 0.2

(25)

(26)

(27)

(28)

Figure 20: payoff (green) and universal (red) and local stability (blue) added together, as a function of d0and d1, for d0.5 = 1 and

(29)

Linear and Non-Linear Reactive Strategies in the Iterated Continuous Prisoner’s Dilemma