Endogenous switching in an oligopolistic market between heterogeneous learning rules in a continuous space

(1)

Endogenous Switching in an Oligopolistic Market

between Heterogeneous Learning Rules in a

Continuous Space

University of Amsterdam

Faculty of Economics and Business

Stef Hendriks

10135820

MSc in Econometrics

July 9, 2015

(2)

Statement of Originality

This document is written by Student Stef Hendriks who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

1 Introduction

Firms do not always know the full specification of the environment in which they operate. In a Cournot oligopoly for example, firms might not know how the price depends on the production quantities they set. Moreover, firms might not know how many competitors they have, how their competitors behave and how the behaviour of their competitors affects the price. In these situations firms use learning. Firms obtain information from their actions, they update their belief system and make a decision based on their beliefs.

Fudenberg (1998) describes a wide variety of learning methods, which might all have different outcomes (Huck et al., 1999). This indicates that it is fundamental to explicitly study the different learning methods. Huck et al. (1999) also show that (1) the more information there is about the market the less competitive the firms are and (2) the more information firms have about profits of other firms the more competitive firms are. Moreover, Stahl (1996) describes heterogeneity among firms and switching between learning methods that performed better in the past. This indicates the importance of analysing the ability of different learning methods to affect other learning methods in a heterogeneous market.

This thesis analyses the interaction between different learning methods in a market setting where firms might switch between these learning methods. This issue is relevant because it is unclear what happens in a heterogeneous market, when different learning methods have different outcomes in a homogeneous market. This thesis considers a dynamic Cournot market with homogeneous goods. Firms do not know what quantity other firms set in the next period. Firms may use different learning methods to estimate their optimal production quantity in the next period, namely Q-learning, gradient learning and genetic algorithm learning.

Q-learning is a reinforcement learning technique, which uses the production quantity that will maximize its expected profit to make a decision about the production quantity it will set in the next period (Smart and Kaelbling, 2000). Gradient learning bases its production quantity decision on information about the slope of the profit function (Bonanno and Zeeman, 1985). In this way firms systematically change their production quantities for which they expect to earn a higher profit. Genetic algorithm learning makes use of reproduction and mutation based on production quantities set in previous periods. In this thesis genetic algorithm learning is based upon the social learning method described by Vriend (2000). Q-learning, gradient learning and genetic algorithm learning are analysed for the following reason. All three learning methods have different market outcomes in a homogeneous market setting. Hence it is not clear what could happen in a heterogeneous market, when a combination of different learning methods is used. It is interesting to observe if some learning methods could drive the other learning methods out of the market if endogenous switching is introduced.

This thesis contributes and builds upon several economic papers on learning methods. Watkins and Dayan (1992) describe Q-learning as a model-free reinforcement learning technique to learn the optimal decision in controlled Markovian domains. Watkins and Dayan (1992) show that Q-learning converges. However, this convergence is only possible when all states

(5)

are represented discretely. This thesis considers the same functional form of the Q-learning as Watkins and Dayan (1992), but uses this method in states that are represented continuously. Duff and Bradtke Michael (1995) extent the domain of the existing Q-learning method from discrete states to continuous states. Mill´an et al. (2002) show that continuous Q-learning is outperforming the discrete Q-learning in terms of asymptotic performance and speed of learning. Hence this thesis considers continuous Q-learning. The Q-learning in this thesis is an implementation of the Q-learning of Smart and Kaelbling (2000). They state that traditional reinforcement learning methods can have problems in continuous states, because these learning methods assume discrete states. Their method safely approximates the optimal value in a continuous space and learns quickly from a small data set. These are important features of this learning method. Hence this method is implemented in this thesis.

Corch´on and Mas-Colell (1996) describe a price setting market, where firms use gradient dynamics to optimize their decisions. They show a general result of the stability conditions on the equilibria of gradient dynamics. This thesis considers these features in a discrete time setting of the Cournot market. Bonanno and Zeeman (1985) consider a monopolistic competition with differentiated products. They show that the first-order conditions for profit optimization can be satisfied at the same time for all firms for a certain price vector. Bonanno and Zeeman (1985) show that, because of the fact that firm’s knowledge of the demand curve is limited to the linear approximation of their own demand function, firms believe they are maximizing their profit. However the real profit function will not reach the global maximum. This thesis assumes that firms do know the real inverse demand function such that the profit function has a unique maximum and gradient learners will converge to the unique Nash equilibrium. Singh et al. (2000) focus on two-player auctions. They use gradient learning in a repeated game to show that even if firms do not converge to a Nash equilibrium, the average profits of the firms do converge to the Nash equilibrium profit. This thesis also shows that when firms do converge they converge to the Nash equilibrium. However, in this thesis the learning methods work in a continuous space, where the learning methods of Singh et al. (2000) work in a discrete space. Hence when firms do not converge the average profits do not converge to the Nash equilibrium profit.

Arifovic (1994) considers the cobweb model with a homogeneous good. Firms use the genetic algorithm to choose the production quantity they set in the next period. Arifovic (1994) shows that the augmented genetic algorithm converges to the rational expectations equilibrium, which in this thesis is the Nash equilibrium. Vriend (2000) shows two variants of genetic algorithm learning in a Cournot market. One of these variants is the social genetic algorithm learning, where firms can imitate decisions from their competitors. The genetic algorithm learning used in this thesis is based upon the social learning described by Vriend (2000). Routledge (2001) uses a genetic algorithm learning method in a financial model. The author describes the genetic algorithm as a tool where learning is based on imitation and experimentation. He shows that the genetic algorithm can converge to a rational expectations equilibrium, but that it also can fail to produce the standard equilibrium. Routledge (2001)

(6)

states that convergence depends on the rate of asset supply noise. Just as in Routledge (2001) the experimentation rate in this thesis is relative to the noise added in the genetic algorithm learning method.

The previously discussed papers all consider homogeneous market settings. This means that all firms use the same learning method. However, this thesis considers a heterogeneous market setting, where firms can use different learning methods. Moreover, it is even possible that firms might switch between these learning methods to optimize their profit.

Gale and Rosenthal (1999) consider a model with two different learning methods. Players might be experimenters or imitators. Gale and Rosenthal (1999) show that players converge to a small neighbourhood around the symmetric Nash equilibrium, but that in this interval endogenous fluctuations occur. In contrast with this thesis Gale and Rosenthal (1999) do not assume endogenous switching between the two learning methods. Stahl (1996) gives evidence for heterogeneity of learning methods among firms and for switching to learning methods that perform better in previous time periods. Stahl (1996) makes use of the step-k learning described by Nagel (1995). Agents tend to learn the performance measures of these learning methods and might switch to better performing methods. Just as in this thesis agents do have homogeneous learning rates. Brock and Hommes (1997) consider a cobweb model where firms might switch between naive or rational predictors. The authors make use of discrete-choice probabilities to switch between the different predictors. The switching probabilities in this thesis are related to the discrete-choice probabilities of Brock and Hommes (1997). The advantage of these probabilities is that there is control on the sensitivity of firms to performance measures. The performance measures that are used in this thesis are related to the reinforcement levels in Camerer and Hua Ho (1999). They describe experience weighted attraction learning. The attractions of these strategies are updated based on pay-off experience. This thesis uses a special case of the updating process described by Camerer and Hua Ho (1999): when a learning rule is used, the performance measure becomes a weighted average of the realized profit and the old performance measure. When a learning rule is not used the performance measure remains the same. In Camerer and Hua Ho (1999) the experience weighted attraction learning reduces to the averaged reinforcement learning (Sarin and Vahid, 1999), which is used in this thesis.

Droste et al. (2002) analyse the interaction between learning methods in a Cournot competition. The learning methods used in their paper are Nash rules and best-reply rules. The Nash rule is a perfect foresight method that takes into account the decisions made by all firms. The best-reply method gives the best response to the average productions of the past periods. The switching method is based upon the realized profits and updated by replicator dynamics with noise. This is in contrast with this thesis, where the discrete choice model is used. Moreover, this thesis considers different learning methods than Droste et al. (2002).

Anufriev et al. (2013) analyse the interaction between least squares learning and gradient learning. They show that heterogeneity has a large impact on the convergence properties of both the learning methods. They also show that switching between these two learning

(7)

methods might induce cyclical switching. This thesis is similar to the paper of Anufriev et al. (2013). The big difference between this thesis and Anufriev et al. (2013) is that this thesis considers switching between three learning methods instead of two. The second difference is that this thesis considers homogeneous products in a Cournot market, where their paper considers heterogeneous goods in a Bertrand market. Third, in Anufriev et al. (2013) firms cannot observe the demand they are facing, where in this thesis the firms do know it. Fourth, in Anufriev et al. (2013) firms can only observe their own decisions, where in this thesis firms can also observe the production quantities set by the other firms in the previous periods. Moreover, in this thesis firms have complete knowledge of the market structure except for the production quantities the other firms will set in the next period.

Numerical simulations in this thesis show that the different learning methods lead to different market outcomes in the homogeneous market setting. If Q-learning converges, a stable outcome is reached, where the production quantities lie close to the Perfect Competition equilibrium quantity. The learning method does not have a unique stable outcome. If gradient learning converges, firms will reach a stable outcome equal to the Nash equilibrium quantity. If genetic algorithm learning reaches a stable outcome, the quantities lie near the Nash equilibrium quantity. Genetic algorithm learning does not have a unique stable outcome, meaning that the initial conditions determine the stable outcome that is reached at the end. In the heterogeneous market setting with fixed shares of learning methods Q-learning is outperforming the other learning methods, where it was the least performing method in the homogeneous market setting. When endogenous switching between the learning methods is introduced, Q-learning does not completely drive the other learning methods out of the market. However, Q-learning is still the best performing learning method in this market setting. When the intensity of choice increases, the ratio of Q-learners in the market increases. The mutation implementation of the genetic algorithm causes instability in the heterogeneous markets.

The paper is organized as follows. Section 2 presents the market structure and gives a derivation of the different equilibria. Section 3 describes the different learning methods in a homogeneous setting. Section 4 presents the simulation results in the homogeneous markets. Section 5 combines the different learning methods in a heterogeneous market setting with fixed learning rules. Section 6 will present the endogenous switching method that is used and will give some results of endogenous switching method. Section 7 concludes.

2 Market Structure

Consider a standard Cournot market with n = 12 firms, which compete by simultaneously make a decision about their production quantities. They do this in a finitely repeated game for a certain number of periods t. The confrontation between the market demand and the market supply determines the market price P. The inverse demand function depends on the total production of all firms. This inverse demand function at any period is given by:

(8)

P (Q) = max{a − b · Qγ, 0},1 ₍₁₎

where Q = Pn

i=1qi, qi is the production quantity of firm i and a, b > 0. The parameter γ

determines the convexity of the inverse demand function. In this project γ = 1 is considered. Hence the inverse demand function is linear and decreasing in Q. Firms know that marginal costs are constant and equal for all firms. This means that firms are symmetric. Hence the profit function for firm i is given by:

π(q) = P (Q) · qi− c · qi. (2)

For γ = 1 the profit function is concave, such that the profit function has a unique maximum. The first-order condition of firm i for all γ is given by:

a − b · Qγ− b · γ · Qγ−1 · qi− c = 0. (3)

Consider the symmetric Nash equilibrium such that qi = qN ∀ i. Hence the Nash equilibrium

quantity is given by:

qN = a − c b · nγ−1_{· (n + γ)} 1_γ . (4)

In the same way the Cartel equilibrium and the Perfect Competition equilibrium are derived. The proof is presented in Appendix B. In a Cartel market firms maximize their joint profit. The Cartel equilibrium is given by:

qC = 1 n · a − c b · (γ + 1) _γ1 . (5)

With Perfect Competition firms are price takers. The Perfect Competition equilibrium is given by: qP M = a − c (γ + 1) · b · nγ 1_γ . (6)

In the model the different parameters are fixed at a = 20, b = 0.5 and c = 1. There are always

1_{Firms face the same inverse demand function every period. They tend to maximize their one-period}

profit. However, their choices from previous periods do influence their choices in the next period. Moreover, the learning methods interfere with each other. Even though firms may adjust their quantities in the direction such that they increase their expected profit, they could earn a lower profit after the quantity changes, because the profit function depends on the quantities of all firms.

(9)

n = 12 firms in the market. For these parameters the values of the different equilibria are given in Table 1.

Firms do not know the production other firms will produce the next period. However,

Equilibrium Quantity Price Profit

Cartel 1.5833 10.5 15.0417

Nash 2.9231 2.4615 4.2722

Perfect Competition 3.1667 1 0

Table 1: Values of different production quantity equilibria

they do know the production quantities firms have set in the previous periods. Firms do know the specification about the inverse demand function and they know the marginal costs. For simplicity, they are only interested in maximizing their own profit in the next period.2

Furthermore, firms will interact repeatedly in the market environment described in this section. Hence, firms will use different learning methods to obtain their optimal production quantity for the next period. The learning methods are described in the next section.

3 Learning Methods in a Homogeneous Market

This section discusses the use of the three learning methods Q-learning, gradient learning and genetic algorithm learning in a homogeneous market. Firms cannot observe contemporaneous production quantities. Hence firms need to learn their optimal production quantity to maximize their profit with one of the three learning methods described below.

The timing of the learning process goes as follows. Each firm will randomly draw qi,1

and qi,2 from set S, where S = {q ∈ Rn+, 0 < qi < 3, P (Q) > c, i = 1, ..., n}.3 The genetic

algorithm needs two initial conditions to operate normally. Hence, for similarity, all three learning methods start by drawing two initial condition from set S.4 _{After period 2 firms will}

use their learning mechanism to obtain their next production quantity. When the learning process indicates a negative quantity, firms will draw their production quantity from the uniform distribution on set S. Firms will continue this process until firms reach a stable outcome or when the maximum of time periods t = 10000 is reached. The stable outcome is reached when |qi,t− qi,t−1| < η ∀ i, where η = 10−5 is the threshold value.

2_{An Alternative could be to consider firms that maximize a weighted sum of profits.}

3_{Initial conditions lie between 0 and 3, because these values lie close to the rational equilibria, they are}

assumed to be consistent. The conjecture is that firms converge close to these equilibria for all learning methods.

4_{For both Q-learning and gradient learning it would be sufficient to take only one initial value. However}

(10)

3.1 Q-learning

With Q-learning firms observe their profit from the previous period. After this, firms use the Q-learning mechanism to adjust their optimal production quantity in the direction for which they expect to earn a higher profit. The Q-learning mechanism is an implementation of the process described by Smart and Kaelbling (2000) and Waltman and Kaymak (2008).5 _The

learning mechanism is given by

qi,t+1= qi,t+ α · (πi,t+ δ · q∗i,t+1− qi,t), (7)

where δ = 0.5 is the discount factor. The parameter α represents the learning rate. In this case q_i,t+1∗ = arg max

q

E[ˆπi,t+1] is the quantity that would maximize the expected profit for firm i. To

maximize their expected profit firms assume that the other firms produce ˆqj,t+1 = 1_τ·Pτ_t=1qj,t,

where τ = min{t, 100}. Simulations are conducted for different values of α to observe the effect of the learning rate on the stability of the model.

3.2 Gradient Learning

With gradient learning firms change their production quantity depending on the slope of the profit function. Even if firms do not know the real inverse demand function, they can still get a good estimate of the slope of the profit function with market experimentation. When firms know the slope of the profit function in a certain point, they will adjust their production quantity in such a way that they make more profit. The quantity firms set is given by

qi,t+1= max qi,t + λ · ∂πi(qt) ∂qi,t , 0 , (8)

where the derivative of the profit function is given by (3). λ measures the stability of the gradient learning process. The lower the learning rate λ the more stable the process is (Anufriev et al., 2013). Firms know the production quantities that all firms have set in the previous periods. Given that the process continues until the derivative of the profit function is zero, the conjecture is that the gradient learning will converge to a unique maximum (Bonanno and Zeeman, 1985).

3.3 Genetic Algorithm Learning

With genetic algorithm firms do not know the optimal production quantity. Each firm is characterized by an output rule. Because production quantities are not bounded from above, binary strings are not used. As described earlier, firms start with drawing randomly q1 and

q2 from set S. After period 2 firms set their production quantities, the market price is

5_{Waltman and Kaymak (2008) specify Q-learning in a Cournot market, where the reward is specified as}

(11)

determined and the profits are determined. After every period the output rules are updated by reproduction and mutation. Firms look at other firms’ decisions and try to imitate and improve these decisions to their own profit. The genetic algorithm is specified as social learning (Vriend, 2000). Firms observe the other firms’ quantities and determine the profit that they could have earned with these quantities. With probability p the firms choose the optimal production quantity from one period back, with probability 1 − p they choose the optimal quantity from two periods back. This process represents the reproduction. When the production quantity they choose, is their own production quantity from the past period no noise will be added. However, when they choose another production quantity a noise r ∼ N (0, 0.1) will be added with probability z = 0.1. The noise represents the mutation. After the noise is added, firms produce these quantities. This genetic algorithm is based on the models of Vriend (2000) and Routledge (2001), but is simplified in this case.

4 Simulation Results in the Homogeneous Market

In the next subsection simulations are conducted to illustrate the possible dynamics of the Q-learning method in the homogeneous market for different values of α. The second subsection shows the dynamics of gradient learning for different values of λ. The third subsection gives the results of genetic algorithm learning for different values of p. Note that if the learning methods reach a stable outcome Figures 1, 2 and 3 show the dynamics from time period 0. If the learning methods do not converge Figures 1, 2 and 3 illustrate dynamics of the last 50 time periods.6

4.1 Q-learning Results

Values of α Statistics α = 0.05 α = 0.1 α = 0.4 α = 0.5 mean(price) 1.6517 1.6525 6.0000 5.5348 std(price) 0.3135 0.3598 5.9998 5.6366 mean(quantities) 3.0581 3.0580 4.0001 4.7683 std(quantities) 0.1973 0.1693 2.6672 4.6859 mean(profit) 1.9765 1.9737 4.0008 3.2807 std(profit) 0.5499 0.6083 10.6689 13.2305

Table 2: Statistics of Q-learning for different values of α

In line with the conjecture Figure 1 shows convergence for smaller learning rates α (Sutton and Barto, 1998).7 _{It is interesting to observe that not all firms converge to exactly the same}

6_{Monte Carlo simulations in the homogeneous market suggest that, if the learning methods converge, the}

results given in the next subsections are consistent.

(12)

Quantities

Profit

Price

α = 0.05

α = 0.1

α = 0.4

α = 0.5

Figure 1: Time series of quantities, profits and prices in a Cournot oligopoly with 12 Q-learners for α = 0.05 , α = 0.1 , α = 0.4 and α = 0.5.

production quantity. For α = 0.4 a 2-cycle arises at the point of bifurcation.8 For α = 0.5 there occurs aperiodic behaviour. For low values of α the average production quantities lie close to the Perfect Competition equilibrium of 3.1667. Table 2 shows that Q-learners earn a higher average profit when the process is not converging to a stable outcome.

the problem is stochastic, the learning rate is required to decrease to zero. A constant learning rate is used for simplicity.

(13)

4.2 Gradient Learning Results

Values of λ Statistics λ = 0.1 λ = 0.3 λ = 0.30769 λ = 0.31 λ = 0.4 mean(price) 2.5744 2.5827 4.7735 7.3328 5.9920 std(price) 0.8988 1.2816 4.7782 7.6104 6.1485 mean(quantities) 2.9043 2.9158 2.9229 2.9382 3.3989 std(quantities) 0.2275 0.2708 1.1820 2.1175 2.2402 mean(profit) 4.4384 4.3088 5.3862 2.8907 4.2904 std(profit) 1.4639 2.7510 9.4907 8.6318 11.7676

Table 3: Statistics of Gradient learning for different values of λ

For lower learning rates λ the dynamics show convergence. In particular, Figure 2 shows convergence to a stable outcome where all firms produce the Nash equilibrium quantity for λ = 0.1 and λ = 0.3. Figure 2 depicts a 2-cycle for λ = 0.30769. For λ = 0.31 quasi-periodic behaviour occurs and for λ = 0.4 there occurs aperiodic behaviour.

In general the gradient learners earn a higher profit than the Q-learners in a homogeneous market when there is convergence, since Q-learners converge near the Perfect Competition equilibrium and gradient learners converge to the Nash equilibrium.

(14)

Quantities

Profit

Price

λ = 0.1 λ = 0.3 λ = 0.30769 λ = 0.31 λ = 0.4

Figure 2: Time series of quantities, profits and prices in a Cournot oligopoly with 12 Gradient learners for λ = 0.1 , λ = 0.3 , λ = 0.30769 , λ = 0.31 and λ = 0.4.

(15)

4.3 Genetic Algorithm Learning Results

Quantities

Profit

Price

p = 1

p = 0.7

p = 0.5

p = 0.3

Figure 3: Time series of quantities, profits and prices in a Cournot oligopoly with 12 Genetic Algorithm learners for p = 1, p = 0.7, p = 0.5 and p = 0.3.

Aperiodic behaviour emerges for all values of p. This aperiodic behaviour is due to the noise term z. For p = 1, p = 0.7 and p = 0.5 the model converges to a stable outcome where all firms produce the same quantity. Figure 3 suggests that for higher values of p the ”convergence” needs less periods, since there is less fluctuation between the succeeding quantities of firms in this case. Moreover, in general the genetic algorithm earns a higher

(16)

Values of p Statistics p = 1 p = 0.7 p = 0.5 p = 0.3 mean(price) 3.9522 2.7243 3.1892 3.0879 std(price) 3.6031 1.6665 2.5577 2.2206 mean(quantities) 2.6746 2.8849 2.8329 2.8253 std(quantities) 0.6424 0.5902 0.9675 0.8853 mean(profit) 5.8989 4.4979 5.0126 5.0568 std(profit) 4.5297 4.0717 6.1421 5.5303

Table 4: Statistics of Genetic Algorithm learning for different values of p

average profit than the Q-learning and for p = 1 the genetic algorithm also earns a higher profit than the gradient learning firms. For high values of p the genetic algorithm reaches a stable outcome that is near the Nash equilibrium.

5 Simulation Results in the Heterogeneous Market with

Fixed Shares of Learning Methods

This section considers markets, where firms could use different learning methods with fixed market shares. This section only considers parameters for which the dynamics in the homogeneous markets were stable. In a market with two competing learning methods there are six firms using every method. In a market with three learning methods every method is used by four firms.

5.1 Genetic Algorithm and Gradient Learning

Learners

p = 0.7, λ = 0.3 p = 1, λ = 0.1

Statistics All GA GL All GA GL

mean(price) 2.4955 - - 2.4832 - -std(price) 0.9569 - - 0.4517 - -mean(quantities) 2.9186 2.8626 2.9746 2.9195 2.9196 2.9194 std(quantities) 0.3686 0.5234 0.2139 0.0903 0.1113 0.0694 mean(profit) 4.2091 4.1080 4.3103 4.2961 4.2852 4.3070 std(profit) 2.7407 2.7256 2.7559 0.8598 0.7903 0.9294

Table 5: Statistics of fixed market shares of Genetic Algorithm learning and Gradient learning for p = 0.7 and λ = 0.3 and for p = 1 and λ = 0.1

A heterogeneous market with six genetic algorithm learners and six gradient learners is considered. Figure 4 depicts that for probability p = 0.7 and learning rate λ = 0.3 the model

(17)

(a) GA quantities (b) GL quantities

(c) GA profit (d) GL profit (e) price

Figure 4: Time series of quantities of GA-learners (a) quantities of Gradient learners (b) profits of GA-learners (c) profits of Gradient learners (d) and prices (e) in a Cournot oligopoly with 6 Genetic Algorithm learners and 6 Gradient learners for p = 0.7 and λ = 0.3.

(a) GA quantities (b) GL quantities

(c) GA profit (d) GL profit (e) price

Figure 5: Time series of quantities of GA-learners (a) quantities of Gradient learners (b) profits of GA-learners (c) profits of Gradient learners (d) and prices (e) in a Cournot oligopoly with 6 Genetic Algorithm learners and 6 Gradient learners for p = 1 and λ = 0.1.

does not reach a stable outcome, where in the homogeneous markets of either learning method a stable outcome was reached. This is due to the mutation parameter in the genetic algorithm. Moreover, when the learning rate λ is decreased and probability p is increased Figure 5 depicts convergence to a stable outcome and, not surprisingly, this is the Nash equilibrium quantity.

(18)

This is because gradient learning is converging to the Nash equilibrium in the homogeneous market, where genetic algorithm has an incentive to imitate. In the heterogeneous market the standard deviations of the production quantities are lower than in the homogeneous market of either learning method (Table 5).

5.2 Genetic Algorithm and Q-learning

(a) GA quantities (b) QL quantities

(c) GA profit (d) QL profit (e) price

Figure 6: Time series of quantities of GA-learners (a) quantities of Q-learners (b) profits of GA-learners (c) profits of Q-learners (d) and prices (e) in a Cournot oligopoly with 6 Genetic Algorithm learners and 6 Q-learners for p = 0.7 and α = 0.1.

Learners

p = 0.7, α = 0.1 p = 1, α = 0.05

Statistics All GA QL All GA QL

mean(price) 1.8149 - - 1.6695 - -std(price) 1.4648 - - 0.5321 - -mean(quantities) 3.1313 2.3587 3.9040 3.0552 2.3559 3.7544 std(quantities) 1.0043 1.2461 0.7626 0.6343 0.7304 0.5382 mean(profit) 2.0119 1.2026 2.8212 1.9980 1.5112 2.4849 std(profit) 4.4575 3.1499 5.7651 1.5437 1.1560 1.9315

Table 6: Statistics of fixed market shares of Genetic Algorithm learning and Q-learning for p = 0.7 and α = 0.1 and for p = 1 and α = 0.05

Figure 6 and 7 illustrate clearly an aperiodic behaviour of the production quantities for all parameter values. This might be caused by the mutation term z of the genetic algorithm

(19)

(a) GA quantities (b) QL quantities

(c) GA profit (d) QL profit (e) price

Figure 7: Time series of quantities of GA-learners (a) quantities of Q-learners (b) profits of GA-learners (c) profits of Q-learners (d) and prices (e) in a Cournot oligopoly with 6 Genetic Algorithm learners and 6 Q-learners for p = 1 and α = 0.05.

method. Table 6 shows that the Q-learners are outperforming the genetic algorithm learners for all parameter values even though the two learning methods do not reach a stable outcome. This is surprising, because in the homogeneous market the genetic algorithm learning was outperforming the Q-learning method. This is due to the fact that Q-learners have more propensity to set higher production quantities. In the heterogeneous market Q-learners also have lower standard deviations of quantities in comparison with genetic algorithm learners for all parameter values. In the homogeneous markets genetic algorithm learners converge close to the Nash equilibrium and Q-learners converge close to the Perfect Competition equilibrium. The conjecture is that in the heterogeneous market Q-learning is trying to move the average production quantity near the Perfect Competition equilibrium and genetic algorithm learning is trying to move the average production quantity near the Nash equilibrium. Table 6 shows that the average production quantity in the heterogeneous market lies closer to the Perfect Competition equilibrium quantity. Hence Q-learning has more influence in the market than genetic algorithm learning.

5.3 Gradient Learning and Q-learning

Figure 8 shows aperiodic behaviour in the dynamics for parameter values that indicate stability in the homogeneous market. Figure 9 suggests a 2-cycle for learning rate λ = 0.2 and learning rate α = 0.1. In this case all Q-learners converge to a different 2-cycle than gradient learners. Figure 9 depicts that the 2-cycle of the Q-learners is oscillating around a higher production quantity in comparison with the gradient learners. This leads to a higher

(20)

average profit of the Q-learners in comparison with the gradient learners. For λ = 0.1 and α = 0.05 there is convergence to a stable outcome (Figure 10). Again the Q-learners converge to a higher average quantity in comparison with the gradient learners. The gradient learners converge to the same production quantity, where the Q-learners produce different quantities,

(a) GL quantities (b) QL quantities

(c) GL profit (d) QL profit (e) price

Figure 8: Time series of quantities of Gradient learners (a) quantities of Q-learners (b) profits of Gradient learners (c) profits of Q-learners (d) and prices (e) in a Cournot oligopoly with 6 Gradient learners and 6 Q-learners for λ = 0.3 and α = 0.1.

(21)

Learners

λ = 0.3, α = 0.1 λ = 0.2, α = 0.1 λ = 0.1, α = 0.05

Statistics All GL QL All GL QL All GL QL

mean(price) 1.9016 - - 1.7819 - - 1.6874 - -std(price) 1.8814 - - 1.5259 - - 0.2598 - -mean(quantities) 3.1296 1.0945 5.1648 3.0364 1.5618 4.5110 3.0521 1.3675 4.7367 std(quantities) 0.6230 0.6474 0.5985 0.2601 0.1642 0.3560 0.1372 0.0862 0.1881 mean(profit) 2.0165 0.2659 3.7671 1.9859 0.9769 2.9949 2.0868 0.9411 3.2326 std(profit) 5.5192 1.8249 9.2135 4.4274 2.2594 6.5955 0.4465 0.4240 0.4689

Table 7: Statistics of fixed market shares of Gradient learning and Q-learning for λ = 0.3 and α = 0.1, λ = 0.2 and α = 0.1 and for λ = 0.1 and α = 0.05

which lie close to each other. Again Q-learning is outperforming gradient learning. Table 7 shows that the standard deviations of quantities of the heterogeneous market are smaller in comparison with the homogeneous market for both learning methods. Hence it seems that these two methods have a stabilizing effect on each other.

5.4 All Three Learning Methods

(22)

(a) GA quantities (b) GL quantities (c) QL quantities

Figure 11: Time series of quantities of GA-learners (a) quantities of Gradient learners (b) and quantities of Q-learners (c) in a Cournot oligopoly with 4 Genetic Algorithm learners, 4 Gradient learners and 4 Q-learners for p = 0.7, λ = 0.3 and α = 0.1.

(a) GA profit (b) GL profit (c) QL profit

(d) price

Figure 12: Time series of profits of GA-learners (a) profits of Gradient learners (b) profits of Q-learners (c) and the price (d) in a Cournot oligopoly with 4 Genetic Algorithm learners, 4 Gradient learners and 4 Q-learners for p = 0.7, λ = 0.3 and α = 0.1.

(a) GA quantities (b) GL quantities (c) QL quantities

Figure 13: Time series of quantities of GA-learners (a) quantities of Gradient learners (b) and quantities of Q-learners (c) in a Cournot oligopoly with 4 Genetic Algorithm learners, 4 Gradient learners and 4 Q-learners for p = 1, λ = 0.1 and α = 0.05.

(23)

(a) GA profit (b) GL profit (c) QL profit

(d) price

Figure 14: Time series of profits of GA-learners (a) profits of Gradient learners (b) profits of Q-learners (c) and the price (d) in a Cournot oligopoly with 4 Genetic Algorithm learners, 4 Gradient learners and 4 Q-learners for p = 1, λ = 0.1 and α = 0.05.

Learners

p = 0.7, λ = 0.3, α = 0.1 p = 1, λ = 0.1, α = 0.05

Statistics All GA GL QL All GA GL QL

mean(price) 1.9542 - - - 1.7585 - - -std(price) 2.0484 - - - 1.5701 - - -mean(quantities) 3.1772 1.6538 1.2297 6.6480 3.0588 1.9108 1.3568 5.9088 std(quantities) 1.1661 1.0997 0.7286 1.6699 0.5108 0.8024 0.2371 0.4928 mean(profit) 2.0012 0.5741 0.4041 5.0253 1.8766 0.6275 0.8855 4.1168 std(profit) 6.1843 3.1457 2.3462 13.0610 4.6298 2.7921 2.0536 9.0439

Table 8: Statistics of fixed learning with all three learning methods for p = 0.7, λ = 0.3 and α = 0.1 and for p = 1, λ = 0.1 and α = 0.05

it is interesting to model a heterogeneous market in which all three learning methods compete with each other.

For all parameter values the model does not converge. This is because of the destabilizing effect of the genetic algorithm. As expected the Q-learners are earning more profit than the other two learning methods.

In the homogeneous market Q-learning was the least performing learning method. However, this section showed that Q-learning is outperforming the other two learning methods in the heterogeneous market. It would be interesting to model a heterogeneous market, where firms can switch between different learning methods. The next section will discuss this switching method.

(24)

6 Switching Methods in a Heterogeneous Market

This section introduces endogenous switching between the different learning methods. Firms update their performance measures based upon the reinforcement learning in Camerer and Hua Ho (1999). Subsequently, probabilities to choose a particular learning method are updated according to the discrete choice model by Brock and Hommes (1997). The performance measure of the different learning methods is specified as:

θH_i,t+1= (

(1 − w) · θ_i,tH + w · πi,t if firm i used learning method H in period t,

θ_i,tH otherwise, (9)

,

where H ∈ {GA, GL, QL} and w = 0.5 is the weight of the latest profit on the performance measure. Hence, performance measures are weighted averages of previous profits realized by the particular learning method for each firm. The initial performance measures are the first profits that were realized using the learning method for each firm.

These performance measures determine the probability of applying a learning method in the following way. The derivation of the probabilities are based on the discrete choice model described by Brock and Hommes (1997). Firm i applies learning method H in period t + 1 with probability P_i,t+1H = e β·θHi,t P χe β·θ_i,tχ , (10)

where χ ⊆ {GA, GL, QL} is the subset of methods used in the particular market and β ≥ 0 measures how sensitive the firms are to differences in the performance measures. β is considered homogeneous for all firms.9

The model with switching methods is implemented as follows. Firms start with randomly drawing qi,1and qi,2 from set S. After that the learning methods are randomly divided between

the n firms. In the next period(s) firms switch to the other learning method(s).10 _{Hence the}

firms have used all learning methods and they can update their performance measures and probabilities. In the subsequent periods the probabilities are used to determine which learning methods the firms are using. Quantities and profits are determined and the probabilities are updated. The process stops when a predefined number of time periods t = 10000 is reached.

In the simulations the parameters are fixed at α = 0.05, λ = 0.1 and p = 1. For simplicity, the genetic algorithm learners only look one period back in time to learn. Simulations are conducted for different intensities of choice β to observe a possible effect of this parameter on the switching mechanism. First, just as in the heterogeneous market with fixed learning

9_{Simulations show that the results cannot be generalized for a market, where firms have heterogeneous}

sensitivity parameters β. This emphasizes the role of path-dependence in endogenous switching. This feature is left for further research, because it is beyond the scope of this thesis.

(25)

methods, the simulations are conducted for markets with only two learning methods. In the last subsection the simulations are extended for a heterogeneous market with three different learning methods.

This section shows results for β = 5 and β = 25. More results for different values of β are given in Appendix A.

6.1 Genetic Algorithm and Gradient Learning

(a) quantities (b) profit (c) price

Figure 15: Time series of quantities of all firms (a) profits (b) and price (c) in a Cournot oligopoly where 12 firms can switch between learning methods for β = 5

(a) performance of GA (b) Performance of GL

(c) number of GA-learners (d) number of GL-learners

Figure 16: Time series performances of GA (a) performances of GL (b) number of firms using GA (c) and number of firms using GL (d) where firms can switch between learning methods for β = 5.

Figures 15, 16, 17 and 18 show that for all intensities of choice β the quantities are oscillating around the Nash equilibrium quantity. Figures 15, 16, 17 and 18 and Table 9 show that the model is more stable for higher values of β. Table 9 also shows that the average

(26)

Figure 17: Time series of quantities of all firms (a) profits (b)and price (c) in a Cournot oligopoly where 12 firms can switch between learning methods for β = 25

quantities lie close to the Nash equilibrium quantity. The performance measures of the genetic algorithm and gradient learning lie close to each other. For higher values of β the ratio of gradient learners in the market gets higher and eventually for β = 25 the number of gradient learners converges to 7 and the number of genetic algorithm learners converges to 5.

(27)

Values of β Statistics β = 5 β = 25 mean(θGA) 4.0770 3.8783 std(θGA₎ _0.3640 _0.0714 mean(θGL₎ _4.0683 _3.9086 std(θGL) 0.3979 0.0726 ratio(GA) 0.4873 0.4170 ratio(GL) 0.5127 0.5830 mean(price) 2.4557 2.4634 std(price) 0.1696 0.1407

Table 9: Statistics of switching learning with GA-learning and Gradient learning for different values of β

6.2 Genetic Algorithm and Q-learning

Values of β Statistics β = 5 β = 25 mean(θGA₎ _0.9064 _1.4582 std(θGA₎ _0.3534 _0.2737 mean(θQL) 2.0818 1.8817 std(θQL₎ _0.4650 _0.2902 ratio(GA) 0.3019 0.1641 ratio(QL) 0.6981 0.8359 mean(price) 1.6940 1.6531 std(price) 0.2290 0.1943

Table 10: Statistics of switching learning with GA-learning and Q-learning for different values of β

Figures 20 and 22 and Table 10 show that the higher the intensity of choice β the higher the ratio of Q-learners is. Moreover, the performance measures of Q-learning are always higher

(28)

(a) performance of GA (b) Performance of QL

(c) number of GA-learners (d) number of QL-learners

Figure 20: Time series performances of GA (a) performances of QL (b) number of firms using GA (c) and number of firms using QL (d) where firms can switch between learning methods for β = 5.

than the performance measures of genetic algorithm learning. Hence in general the probability of choosing Q-learning is bigger than the probability of choosing genetic algorithm learning. In Figures 20 and 22 the performance measures of genetic algorithm learning show that for higher values of β firms choose Q-learning and do not go back to genetic algorithm learning. This is indicated by a straight horizontal line.

For high values of β the number of Q-learners is oscillating between 10 and 12, eventually for β = 25 the number of Q-learners is converging to 10. This is due to the fact that Q-learners are outperforming other learning methods in a heterogeneous market, but are the least performing learners in a homogeneous market. In other words Q-learning is more profitable when there is at least one other learning method in the market and some firms will switch back if all of them are Q-learners. Because the performance measures are basically weighted averages of past profits, the firms which used Q-learning for a long time will not switch back to genetic

(29)

algorithm learning. Figure 19 depicts that genetic algorithm learning has a high destabilizing effect on the market if it is competing with Q-learning.

6.3 Gradient Learning and Q-learning

Figures 23 and 25 suggest that for low intensities of choice β there is lot of disturbance in the production quantities, where for higher intensities of choice β the production quantities become more stable. Figures 24 and 26 illustrate that for higher values of β the ratio of Q-learners becomes larger.

For a higher number of Q-learners, the production quantity of gradient learners becomes lower, which results in lower profits. Table 11 shows that for higher values of β the ratio

(30)

(a) performance of GL (b) Performance of QL

(c) number of GL-learners (d) number of QL-learners

Figure 24: Time series performances of GL (a) performances of QL (b) number of firms using GL (c) and number of firms using QL (d) where firms can switch between learning methods for β = 5.

of Q-learners increases and performance measures of gradient learners decreases. Hence the Q-learners are outperforming the gradient learners.

Again the number of Q-learners is oscillating between 10 and 12 for high values of β. Performance measures of the Q-learners are higher when they competing with gradient learners than with genetic algorithm learners. This was expected, because Tables 6 and 7 suggest that in the heterogeneous market with fixed shares of learning methods Q-learners earn a higher profit when they are competing with gradient learners instead of genetic algorithm learners.

(31)

(a) performance of GL (b) Performance of QL

(c) number of GL-learners (d) number of QL-learners

Figure 26: Time series performances of GL (a) performances of QL (b) number of firms using GL (c) and number of firms using QL (d) where firms can switch between learning methods for β = 25. Values of β Statistics β = 5 β = 25 mean(θGL₎ _1.3596 _1.2733 std(θGL₎ _0.1226 _0.1802 mean(θQL) 2.1411 2.0012 std(θQL₎ _0.1485 _0.2606 ratio(GL) 0.3800 0.0857 ratio(QL) 0.6200 0.9143 mean(price) 1.7054 1.6565 std(price) 0.1552 0.1321

Table 11: Statistics of switching learning with Gradient learning and Q-learning for different values of β

6.4 All Three Learning Methods

The previous subsections showed how firms behave when they compete in a heterogeneous market with switching methods between two learning methods. This subsection considers the behaviour of firms in a heterogeneous market with endogenous switching between three different learning methods.

The conjecture is that the same properties of the competing learning methods discussed in the previous subsections will return in the heterogeneous market with three competing learning methods. This means that Q-learning is outperforming the other learning methods and the genetic algorithm learning will have a destabilizing effect on the market for low intensities of

(32)

choice β.

(a) performance of GA (b) Performance of GL (c) Performance of QL

(d) number of GA-learners (e) number of GL-learners (f) number of QL-learners

Figure 28: Time series performances of GA (a) performances of GL (b) performances of QL (c) number of firms using GA (d) number of firms using GL (e) and number of firms using Ql (f) where firms can switch between learning methods for β = 5.

(33)

(a) performance of GA (b) Performance of GL (c) Performance of QL

(d) number of GA-learners (e) number of GL-learners (f) number of QL-learners

Figure 30: Time series performances of GA (a) performances of GL (b) performances of QL (c) number of firms using GA (d) number of firms using GL (e) and number of firms using Ql (f) where firms can switch between learning methods for β = 25.

Values of β Statistics β = 5 β = 25 mean(θGA₎ _1.1935 _1.0844 std(θGA₎ _0.1398 _0.1771 mean(θGL) 1.2601 1.1008 std(θGL₎ _0.1522 _0.1955 mean(θQL₎ _2.1733 _1.9863 std(θQL) 0.2350 0.3448 ratio(GA) 0.2596 0.0272 ratio(GL) 0.2845 0.0205 ratio(QL) 0.4559 0.9523 mean(price) 1.7231 1.6521 std(price) 0.1646 0.1354

Table 12: Statistics of switching learning with all three learning methods for p = 1 λ = 0.1 α = 0.05 w = 0.5 and different values of β

For values of β lower than 25 there is no full dominance of Q-learning of the market. Table 12 shows that performance measures of genetic algorithm learning and gradient learning lie close to each other, where the performance measure of Q-learning is significantly higher. Figures 28 and 30 suggest that the average number of Q-learners is higher than the average number of other learners for all β. Moreover, because the genetic algorithm learning and the gradient learning do have performance measures close to each other, they also oscillate around the same number of learners. Figures 27 and 29 depict that the genetic algorithm learning has

(34)

a higher destabilizing effect for lower intensities of choice β. This is due to the fact that for low values of β there are more genetic algorithm learners in the market. Table 12 also shows that for higher intensities of choice β performance measures of all learning methods decrease. For β = 1 there is one firm using Q-learning that is deviating a lot from all the other learners by producing a quantity around 20 (see Appendix A). For β = 5 there are two Q-learners that are deviating from all the other learners by producing a quantity around 10. The higher the value of β becomes the more Q-learners deviate, but the lower the production quantity of these learners will be.

Notice that for β = 25 Q-learning is dominating the whole market. When simulations are conducted for different initial values Q-learning will not always dominate the whole market (see Appendix C). However, Q-learning will still be the best performing learning method. Table 12 shows low performance measures of genetic algorithm learning and gradient learning for β = 25. This is the cause for the extinction of these two learning methods in this market. Appendix C shows that Q-learning is the optimal learning method, but that in general the other two learning methods will not be dominated. Moreover, Appendix C illustrates that the results for low intensities of choice β are more consistent than results for high values of β.

Q-learning is programmed in such a way that it learns fast in a homogeneous market. This property is affected in the heterogeneous market. Low values of β allow for more time for Q-learning to learn. Hence results for low values of β are consistent. Q-learners have high propensities to produce large quantities. For some initial conditions it is possible that Q-learners will earn a negative profit, which, in combination with high values of β, results in the fact that some firms will never use Q-learning in the market again.

This thesis shows that learning methods react differently in homogeneous markets and in heterogeneous markets. In the homogeneous markets genetic algorithm learners converge near the Nash equilibrium, gradient learners converge to the Nash equilibrium and Q-learners converge near the Perfect Competition equilibrium. This results in the fact than Q-learning is outperformed by the other learning methods in the homogeneous markets.

However, in the heterogeneous markets Q-learning is outperforming the other learning methods. Q-learning has a high propensity to produce large quantities, such that the other learners lower their production quantities. Genetic algorithm has a destabilizing effect on the heterogeneous market. This destabilizing effect becomes bigger when genetic algorithm learners are competing with Q-learners.

For higher intensities of choice β the number of Q-learners increases in the heterogeneous market with endogenous switching. Higher values of β decrease the performance measures for all firms, but also reduces the switching between different learning methods.

7 Concluding Remarks

This thesis has analysed the interaction between genetic algorithm learning, gradient learning and Q-learning. This is done in a Cournot market with homogeneous goods, where firms

(35)

do not know the production quantities of other firms in the next period. They use learning to determine their best possible production quantity. The learning methods have been used in several papers, but mainly in homogeneous markets. The added value of this thesis is that these learning methods are now considered in a heterogeneous market where firms might switch between the different methods.

In the homogeneous market the genetic algorithm converges near the Nash equilibrium when the probability of imitating firms’ production quantities one period back is high. For low probabilities the firms do not converge to a stable outcome. In a homogeneous market with only gradient learners firms do converge to the Nash equilibrium for low learning rates. For higher learning rates there emerge cycles, quasi-periodic and aperiodic behaviour. In a

homogeneous market with only Q-learners firms do converge to values close to the Perfect Competition equilibrium. Firms only converge for low learning rates. For high learning rates there emerge

cycles and aperiodic behaviour.

Next, consider a heterogeneous market with fixed learning methods. When genetic algorithm learners and gradient learners are competing for parameter values where the learning methods were stable in the homogeneous market, there is no occurrence of a stable outcome in the heterogeneous market. If the learning rate of gradient learning is decreased and the probability of imitating other firms’ production quantities one period back is increased there is occurrence of a stable outcome equal to the Nash equilibrium. There is no convergence at all in the case of competition between genetic algorithm learners and Q-learners. However in terms of average profit, Q-learning is outperforming the genetic algorithm learning. When gradient learning and Q-learning are competing there is only convergence for low learning rates. Moreover, Q-learning is outperforming gradient learning for every learning rate. When all learning methods are competing in a heterogeneous market with fixed shares there is aperiodic behaviour in all cases. Again Q-learning is outperforming the other learning methods.

Finally, a heterogeneous market with switching between learning methods is considered. The intensity of choice reflects the sensitivity of the firms on their performances. In the case that genetic algorithm learners are competing with gradient learners, the quantities are oscillating around the Nash equilibrium quantity. When the intensity of choice increases the number of gradient learners converges to 7 and the number of genetic algorithm learners converges to 5. When genetic algorithm learners are competing with Q-learners, the Q-learners are outperforming the genetic algorithm learners. When gradient learners are competing with Q-learners, the Q-learners are outperforming the gradient learners. For high intensities of choice the number of Q-learners is oscillating between 10 and 12.

When all three learning methods are included in a heterogeneous market with switching methods, Q-learning is outperforming the other methods. The bigger the intensity of choice the higher the ratio of Q-learners in the market and the more stable the process becomes. In general, Q-learning will never completely dominate the whole market.

This thesis can be improved in many ways. Observations of the Q-learning could have different weights, because observations further back in the past have less effect on decisions

(36)

in the present. It might be possible to consider inertia in the switching method, because it is clear that firms need to keep their learning method unless there is a large drop in profit, especially for Q-learning. This causes less switching in the market, but would not influence the main results in this thesis. The analysis could also be extended to other learning methods and different market environments. Moreover, it is also possible to create a more complex market environment, such that firms not only have to make production quantity decisions, but also on investments and location. It is also possible to consider heterogeneous intensities of choice or costly learning mechanisms. The conjecture is that these two additions would change the outcome of this thesis. This thesis considers the dynamics only for linear demand, but it would be interesting to see if these results also hold in a non-linear case.

(37)

References

Anufriev, M., Kop´anyi, D., and Tuinstra, J. (2013). Learning cycles in bertrand competition with differentiated commodities and competing learning rules. Journal of Economic Dynamics and Control, 37(12):2562–2581.

Arifovic, J. (1994). Genetic algorithm learning and the cobweb model. Journal of Economic dynamics and Control, 18(1):3–28.

Binmore, K., Samuelson, L., and Young, P. (2003). Equilibrium selection in bargaining models. Games and economic behavior, 45(2):296–328.

Bonanno, G. and Zeeman, E. C. (1985). Limited knowledge of demand and oligopoly equilibria. Journal of Economic Theory, 35(2):276–283.

Brock, W. A. and Hommes, C. H. (1997). A rational route to randomness. Econometrica: Journal of the Econometric Society, 65:1059–1095.

Camerer, C. and Hua Ho, T. (1999). Experience-weighted attraction learning in normal form games. Econometrica, 67(4):827–874.

Corch´on, L. C. and Mas-Colell, A. (1996). On the stability of best reply and gradient systems with applications to imperfectly competitive models. Economics Letters, 51(1):59–65. Droste, E., Hommes, C., and Tuinstra, J. (2002). Endogenous fluctuations under evolutionary

pressure in cournot competition. Games and Economic Behavior, 40(2):232–269.

Duff, S. J. and Bradtke Michael, O. (1995). Reinforcement learning methods for continuous-time markov decision problems. Advances in Neural Information Processing Systems 7, 7:393.

Fudenberg, D. (1998). The theory of learning in games, volume 2. MIT press.

Gale, D. and Rosenthal, R. W. (1999). Experimentation, imitation, and stochastic stability. Journal of Economic Theory, 84(1):1–40.

Harik, G. R., Lobo, F. G., and Goldberg, D. E. (1999). The compact genetic algorithm. Evolutionary Computation, IEEE Transactions on, 3(4):287–297.

Huck, S., Normann, H.-T., and Oechssler, J. (1999). Learning in cournot oligopoly–an experiment. The Economic Journal, 109(454):80–95.

Mill´an, J. D. R., Posenato, D., and Dedieu, E. (2002). Continuous-action q-learning. Machine Learning, 49(2-3):247–265.

Nagel, R. (1995). Unraveling in guessing games: An experimental study. The American Economic Review, 85(5):1313–1326.

(38)

Offerman, T., Potters, J., and Sonnemans, J. (2002). Imitation and belief learning in an oligopoly experiment. The Review of Economic Studies, 69(4):973–997.

Routledge, B. R. (2001). Genetic algorithm learning to choose and use information. Macroeconomic dynamics, 5(02):303–325.

Sarin, R. and Vahid, F. (1999). Payoff assessments without probabilities: A simple dynamic model of choice. Games and Economic Behavior, 28(2):294–309.

Singh, S., Kearns, M., and Mansour, Y. (2000). Nash convergence of gradient dynamics in general-sum games. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence, pages 541–548. Morgan Kaufmann Publishers Inc.

Smart, W. D. and Kaelbling, L. P. (2000). Practical reinforcement learning in continuous spaces. In ICML, pages 903–910. Citeseer.

Stahl, D. O. (1996). Boundedly rational rule learning in a guessing game. Games and Economic Behavior, 16(2):303–330.

Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction, volume 1. MIT press Cambridge.

Vriend, N. J. (2000). An illustration of the essential difference between individual and social learning, and its consequences for computational analyses. Journal of economic dynamics and control, 24(1):1–19.

Waltman, L. and Kaymak, U. (2008). Q-learning agents in a cournot oligopoly model. Journal of Economic Dynamics and Control, 32(10):3275–3293.

(39)

A

Appendix A

This Appendix gives some additional graphs in addition to the section about endogenous switching in heterogeneous markets. These graphs show the quantities, price and profits that occur in the heterogeneous market. The graphs also show some performance measures for the different learning methods. This is done for values of β equal to 1, 10, 15 and 20. The subsections assume heterogeneous markets with different distributions of learning methods.

A.1 Genetic Algorithm and Gradient Learning

Values of β Statistics β = 1 β = 5 β = 10 β = 15 β = 20 β = 25 mean(quantities) 2.9233 2.9240 2.9233 2.9229 2.9228 2.9228 std(quantities) 0.0511 0.0451 0.0376 0.0315 0.0292 0.0272 mean(price) 2.4600 2.4557 2.4603 2.4626 2.4633 2.4634 std(price) 0.1793 0.1696 0.1551 0.1454 0.1422 0.1407 mean(profit) 4.2627 4.2518 4.2649 4.2714 4.2735 4.2737 std(profit) 0.3619 0.3244 0.2624 0.2145 0.1962 0.1864 mean(θGA₎ _4.2391 _4.0770 _3.9988 _3.8670 _3.8749 _3.8783 std(θGA₎ _0.1918 _0.3640 _0.3406 _0.3008 _0.0755 _0.0714 mean(θGL) 4.2458 4.0683 3.9229 3.8689 3.9491 3.9086 std(θGL) 0.1878 0.3979 0.3105 0.2567 0.0787 0.0726 mean(PGA) 0.4983 0.4888 0.4883 0.4494 0.3763 0.4169 std(PGA₎ _0.0637 _0.2968 _0.3790 _0.3473 _0.0602 _0.0360 mean(PGL₎ _0.5017 _0.5112 _0.5117 _0.5506 _0.6237 _0.5828 std(PGL₎ _0.0637 _0.2968 _0.3776 _0.3459 _0.0601 _0.0356 ratio(GA) 0.4980 0.4873 0.4884 0.4496 0.3762 0.4170 ratio(GL) 0.5020 0.5127 0.5116 0.5504 0.6238 0.5830

Table 13: Statistics of switching learning with GA-learning and Gradient learning for different values of β

(40)

Figure 31: Time series of quantities of all firms (a) profits (b)and price (c) in a Cournot oligopoly where 12 firms can switch between learning methods for β = 1.

(41)

(42)

(43)

(44)

A.2 Genetic Algorithm and Q-learning

Values of β Statistics β = 1 β = 5 β = 10 β = 15 β = 20 β = 25 mean(quantities) 3.0266 3.0510 3.0549 3.0571 3.0571 3.0578 std(quantities) 1.6082 0.5307 0.7389 0.6711 0.4990 0.3252 mean(price) 1.8406 1.6940 1.6705 1.6656 1.6577 1.6531 std(price) 0.3504 0.2290 0.1985 0.1958 0.1952 0.1943 mean(profit) 2.5235 2.1085 2.0416 2.0276 2.0042 1.9908 std(profit) 1.7326 0.5984 0.6162 0.5788 0.4697 0.3931 mean(θGA₎ _1.7071 _0.9064 _1.0088 _1.3104 _1.3325 _1.4582 std(θGA₎ _0.9648 _0.3534 _0.3510 _0.2896 _0.3067 _0.2737 mean(θQL₎ _2.4775 _2.0818 _2.0340 _2.0200 _1.9878 _1.8817 std(θQL₎ _1.5314 _0.4650 _0.5551 _0.5019 _0.4013 _0.2902 mean(PGA₎ _0.4649 _0.3010 _0.1687 _0.1487 _0.1208 _0.1638 std(PGA) 0.1300 0.1226 0.1755 0.1802 0.2035 0.1815 mean(PQL) 0.5351 0.6990 0.8313 0.8513 0.8792 0.8362 std(PQL) 0.1300 0.1239 0.1767 0.1819 0.2046 0.1813 ratio(GA) 0.4643 0.3019 0.1693 0.1498 0.1217 0.1641 ratio(QL) 0.5357 0.6981 0.8307 0.8502 0.8783 0.8359

Table 14: Statistics of switching learning with GA-learning and Q-learning for different values of β

(45)

(46)

(47)

(48)

Endogenous switching in an oligopolistic market between heterogeneous learning rules in a continuous space