Successive approximation methods for Markov games

(1)

Citation for published version (APA):

Wal, van der, J., & Wessels, J. (1976). Successive approximation methods for Markov games. (Memorandum COSOR; Vol. 7618). Technische Hogeschool Eindhoven.

Document status and date: Published: 01/01/1976

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne Take down policy

If you believe that this document breaches copyright please contact us at: openaccess@tue.nl

providing details and we will investigate your claim.

(2)

Department of Mathematics

PROBABILITY THEORY, STATISTICS AND OPERATIONS RESEARCH GROUP

Memorandum COSOR 76-18 Successive approximation methods

for Markov games by

J. van der Wal and J. Wessels

Eindhoven, November 1976 The Netherlands

(3)

J. van der Wal and J. Wessels

Summary

In this paper an overview will be presented of the applicability of successive approximation methods for Markov games. The larger part of the paper will be devoted to two-person zero-sum Markov games with the total expected reward criterion. The analysis will include policy-iteration algorithms. Finally there are sections on the average-reward case and on the nonzero-sum case.

1. Introduction

The main purpose of this paper is to investigate the following question: can the theory of successive approximations for Markov decision processes be extended to Markov games?

A preliminary answer to this question can be very short, since Shapley [14J introduced already in 1953 successive approximations for Markov games,

which were only introduced in 1957 for Markov decision processes [Bellman, 1J. However, for Markov decision processes, under relatively weak conditions,

several types of successive approximations methods have been derived, together with sophisticated extrapolation procedures, see e.g. [8J and [9J in this volume. So the present paper will be mainly concerned with the question of generalizing this theory to Markov games. For an elementary

treatment of dynamic programming in Markov games we refer to [20J. For other aspects of the theory of Markov games we refer to the recent bibliography and survey by Parthasarathy and Stern [10J.

In section 2 the model will be introduced, also the finite stage case will be treated. We will allow unbounded rewards, but as in [8J and [9J contraction will be assumed. In section 3 the infinite stage case will be treated. In that section we will show that the standard successive approximations technique (extrapolations included) for the expected total reward criterion may be extended to contracting Markov games. In section 4 it will be shown that this is less easy for the policy iteration and

(4)

other value oriented methods. However, a suitable extension will be presented. In section 5 positive Markov games with stopping actions for the second player will be considered. These games are not necessarily contracting. Section 6 is devoted to Markov games with the average re-ward criterion, and section 7 to the nonzero-sum case.

2. The model

As in [8J and [9J, we consider a system, which is observed at discrete points in time t

=

0,1,2, •••• The system can be in one of a countable number of states:

$

=

{1,2, ••• }. In each state i and at each time t the proceedings of the system may be influenced. This may be done by two players P1 and P

2• Except in section 7 these players are supposed to have completely opposite aims. In each state i there are two finite

(nonempty) sets lK

i and lLi of allowed actions for P1 and P2 respectively. If at some time t the system is in state i and the players choose actions k and R, from lK. and lL . respectively, then this results in an immediate

~ ~

reward r(i,k,R,) for P

1 (to be paid by P2) and it further results in a transition of the system to state j with probability p(j!i,k,R,). We suppose L.P(j\i,k,R,) ~ 1.

J

A

strategy

~ for P

1 specifies for all times t and all possible histories h_t the probability ~t(klht) of choosing action k. Here the history h_t equals the sequence of states and actions in the past:

where s, is the state of the system at time, and k,' R" are the actions chosen at time, by P

1 and P2• If these probabilities only depend on St instead of h_t , then ~ is called a

Markov strategy.

If, moreover, ~t does not depend on t explicitly then ~ is called a

stationary strategy.

Stationary strategies correspond to

poZicies,

where a policy f for P 1 is any function on :I; such that f(i) is a probability distribution on lK.,

~

(5)

by f{i,k) we denote the probability of k € K .• The set of policies

~

for P. is denoted by F, the set of strategies by 'IT. Similarly one

~

defines strategies

y

E:

r

and policies g E G for the player P 2• Notations

r(f,g) will denote for f E F, g EGa realvalued function on

$

with r(f,g) (i) := lkEK

i LR.EK

i

f(i,k)g(i,R.)r(i,k,t) ;

P(f,g) will denote a nonnegative function on

$

x

$

with P(f,g) (i,j) := lkEK.ltEK. f(i,k)g(i,R.)p(j li,k,R.)

~ ~

Functions on

$

and

$

x

$

respectively will be treated as columnvectors and ~atrices, with matrix products and matrix-vector products defined in the obvious way.

We will work under the following assumptions in this section and in section 3-4.

Assumptions

It is supposed that there is a positive function ~ on

$,

which defines (as in [8J and [9J) a Banach space W of vectors w with the norm

~

IIw II

=

sup IW(i)

l~

-1 (i), such that

~ i

a. Ilr(f,g) II $. M for some M and all f E F, g E G •

~

b. IIP{f,g) II $. p<l for some p and all f E F, g E G.

~

In order to simplify our notations we will write Wand II W instead of W

~

and II II whenever this is possible.

~

Remark

These assumptions are somewhat more restrictive than those in [8J and even than those in [9J. However, as shown in [21J, assumption a. may be weakened to

(6)

r := sup inf r(f,g) E W • fEF gEG

Furthermore the use of the Harrison translation does not present essential difficulties. To avoid technical details we will stick in this and the following two sections to the assumptions stated before.

As in [8J and [9J the transition probabilities may be defective, i.e. LjP(j li,k,t) ~ 1. This may be repaired by the introduction of an absorbing state. We will not do this explicitly.

Also the condition JK, , JL, fini te is not very essential. It may be

~ ~

replaced by JK, , JL, compact, p(j

I

i,k, t) ,r (i,k, t) continuous in k, t .

~ ~

A starting state i and strategies 7T EII, Y E

r

determine a stochastic process {(St,Kt,Lt)}t:O in an obvious way, where St is the state at time t and K_t, L_t the actions chosen at time t by P

1 and P2 respectively. Probabilities referring to this process are denoted bYF~'Y , expectations

7T ~

by m, 'Yo If the index i is deleted, a column vector of probabilities

~

or expectations is meant.

The assumptions guarantee (compare [8J) 7T,y lE, ~ and even co < co co IllE7T ,Y

L

I

r (St'Kt'Lt)

III

t=N -1 N ~ M(l - p) p •

Therefore the total expected rewards (for P

1) are properly defined for any pair of strategies:

co

V(7T,y) := lE7T,y

L

r(St,Kt'L t) . t=O

*

Strategies 7T , yare said to be

optimal

if

*

* *

*

V(7T,y ) ::; V(7T ,y ) =: V ::; V(7T ,y)

*

V will be called the

value

of the game.

(7)

Analogous to [8] we introduce the following operators in

w.

L(f,g)w := r{f,g) + P(f,g)w

Uw := max min L(f,g)w ,

fEF gEG

with max-min takencomponentwise.

Note that (Uw) (i) is the value of the matrixgame with entries

r(i,k,t) +

r

p(jli,k,.Q.)w(j) •

j

Now define w := Uw

n n-l

which satisfy for n

=

(n

=

1, ••• ,T) for some W

o

E Wand find policies

1, ••• ,T

L(f,g)w 1 ~ L(f ,g)w 1 ~ L(f ,g)w 1

n n- n n n- n n- for all f,g.

Then we get the following result for the T-stage Markov game with terminal reward w (actually for this result the assumption p < 1 may be replaced by p < (0):

Theorem 1

The T-stage Markov game with terminal reward W

o

E W, i.e. the game with

criterion function T-l JE'IT,Y[\' ( ) (5)] := L r 5 ,K ,L t + W

o

T t=O t t

has the value w_T and the strategies 'ITT and Y

T, which might be denoted by (£T"" ,f 1), (gT"" ,gl)' are optimal.

The proof proceeds by induction. For details see [20].

This shows that in the finite-stage case optimal strategies may be found by dynamic programming or successive approximation. In the following

section we will extend this result to the infinite-stage case. For that case our methods of proof bear more heavily on the assumptions. Especially (compare e.g. [21], [17]) it is very essential that the assumptions imply

(8)

Lemma 1

L (f ,g) and U are contracting with contraction radii

II

P(f, g)

II

and \l respectively, with

\I :S max IIp(f,g) 1I:s p < 1 f,g

As a consequence of this lemma the operators L(f,g) and U possess unique fixed points. For L(f,g) this fixed point is exactly V(f,g), the criterion value for the stationary strategies f,g in the infinite-stage Markov game.

*

For U this fixed point will be shown to be equal to the value V of the game.

3. The ~stage Markov game

* * *

Let w E W be the unique fixed point of U in W, so UW

=

W • Let f*, g* satisfy

* * * * * * *

L(f,g)w :S L(f ,g)w ~ L(f ,g)w for all f E F, g E G • We will prove the following result, which has been proved already in 1953 by Shapley [14J for the finite state case with

L

p(jli,k,£) ~ S < 1

j

Theorem 2

*

The stationary strategies f and g are optimal in the oo-stage Markov

* * *

game and w is the value of the game, i.e. V

=

w •

Proof

and P₂ an arbitrary strategy y. Then for all T, Obviously (theorem 1) the T-stage game with terminal reward w* has value

* * *

w and f , g are optimal stationary strategies for that game.

*

Suppose Pi plays f

*

V(f ,y) ~ w T (T)

*

Mp P w - - \ l 1 - p ,

*

(9)

One may show P (T) w

*

~ p(T) II w

*

II ~ ~

*

Hence V(f ,y) ~ w V(f,g) . Similarly one shows

T

*

p IIw II ~ •

*

V(7T,g ) :5 w

*

Hence w

=

V .

for all 7T E: II.

So the ~-stage game possesses a value and optimal stationary strategies. It will now be investigated whether successive approximations produce

*

e-optimal stationary strategies and bounds for V which are arbitrarily close. Definition 7T E: II is called

e-optimaZ

if e Y t E:

r

is called

e-optimaZ

if

*

V(7T ,y) ~ V - e~ I::

*

V(7T,y ) ~ V + e~ e for all YE:

r.

for all 7T E: II.

*

An obvious way of approximating V is suggested by the fixed point property of U:

Theorem 3

*

Choose W

_o

E W. Then w := Uw 1 (n = 1,2, ••• ) converges (in norm) to V and

n

n-one actually gets the following bounds

-1

*

-1

w_n - v(1 - v) IIw -w_n _n-1 II~~ V :5 w_n +v(1-v) IIw_n - w_n-111~

However, somewhat better estimates can be given and one may simultaneously give bounds for the policies f and g (see section 2) found in the n-th

n n

iteration.

(10)

A_n := inf (w (i)

-

w 1 (i))1-I-1(i)

i n

n-\I := sup (w_n(i)

-

w 1 (i))1l-1(i)

,

n i n-sup 1-1-1(1)2 P(f ,g) (i,j)ll(j) i,g j n a := 11-1(i)l n inf P(fn,g) (i,j)1-I(j) i,g j

inf 11-1(i)l P(f,gn) (i,j)ll(j)

i,f j

b :=

n

11-1(i)L

sup P (f,g ) (i, j)1-I (j)

i,f j n Theorem 4 i f An ~ 0 , if \I < 0 , n if\l ~O. n Choose W

o

E W, define wn ;= satisfy Uw_n-1 (n

=

1,2, ••• ). Let f E F, g E G n n L(f,g)w 1 ~ L(f ,g)w 1

=

w ~ L(f ,g)w 1 • n n- n n n- n n

n-*

Then we have the following bounds for V , V(f ,g ), V(f ,y), and V(TI,g ):

n n n n a. w_n + a A ( l - a )_{n n} _n -11-I~V

*

~w_n + b \ l ( l - b )_{n n} _n -11-1 b. c. d. Proof -1 V (f ,y) ~ w + a A (l - a) II n n n n n for all

y

E

r,

for all TI E II, -1

~ w + b \I (1 - b

n) ll.

n n n

a, d are direct consequences of band c.

The proof of c will be sketched (for more detailed proofs in somewhat different situations, see [17J, [21J).

(11)

It suffices to prove c for stationary strategies n (compare [8J). Con-sider a policy (or stationary strategy) f.

L(f,g)w 1 ~ w ~ w 1 + v ~ (by definition) • n n- n n- n Hence 2 L (f,g)w_n _n-1 ~ L(f,g )[w_n _n-1 + v_n~J = L(f,g )_n w_n-1 + v_nP(f,g )_n ~ ~w_n + v_nbn~

In this way one ob~ains

LN(f,g)w 1

~

w + v (b + ••• +

bN-l)~

• n n- n n n n Hence N V(f,g )

=

lim L (f,g)w 1 . n N-+oo n n--1 ~

w

+ bv (1 - b n) ~. n n n

In this way the standard successive approximations technique may be ex-tended to Markov games. On the upper- and lowerbounds of theorem 4 one may base tests for suboptimality (see [16J and for a more detailed treat-ment [l2J).

In [9J it has been shown that an extensive class of successive approxi-mation techniques may be generated by using stopping times. This also holds for Markov games. This will not be worked out in this paper, since the concepts and proofs are rather straightforward (for finite state discounted Markov games this has been worked out in [16] and [17J).

4. Value oriented methods

In this volume Van Nunen and Wessels [9] consider a set of value oriented methods for MOP which can be viewed upon as a special type of successive

approximations method. One of these methods being Howard's policy iteration method. A straightforward generalisation of Howard's method to Markov

games has been proposed by Pollatschek and Avi-Itzhak [llJ. This gener-alisation may be formulated as follows

(12)

Algorithm

step 1 vO(i) ;= 0 for all i ~

s.

step 2 (Policy iteration). Determine policies f and g , such that

n n

L(f,g)v_n _n-1 ~ L(f ,g)v_n _n _n-1 ~ L(f ,g)v_n _n-1· step 3 (Value determination) v ;= V(f ,g ).

n n n

Pollatschek and Avi-Itzhak proved in the finite state case that the

al~orithmconverges under the following condition

max[

I

{max p(jli,k,t) -min p(jli,k,t)}J < 1-max

i j k,t k,t i,k,t

I

p(j!i,k,t) .

j

In [18J essentially the following example has been given which proves that this algorithm does not converge in general for finite state dis-counted Markov games.

Example

There is but one state. In this state both players have two actions. If P

1 picks action 2 and P2 action 1 then P

2 pays P1 2 units and the system stays in state 1 with probability 1/4, etc.

A policy f is completely determined by the probability f(l,l). If we apply the algorithm we find f

1(1,1) = g10,l) = 1, v1 = 12, f20,l) = g2(l,l) =0, v2 = 4, f₃(1,l) = g3(1,l) = 1, v₃ = 12 etc. So the algorithm cycles without ever finding an optimal pair of strategies.

A somewhat more refined extension of Howard's method is the following. This extension has been inspired by Hoffman and Karp's method [5J for the average reward Markov game.

(13)

for all f.

Algoritiun (H,K)·

step 1. Choose va such that uv

_O

~ vo· step 2. Determine UV

n and a policy gn+1 with L(f,gn+1)vn ~ uVn for all f. step 3. Determine v

n+1 := m~x V(f,gn+l)'

As in the case of MDP one may consider this algorithm as an extreme element of the following set of value oriented methods:

Algorithm (A)

step 1. Choose va such that UVa ~ va step 2. Determine UV

n and a policy gn+l with L(f,gn+l)vn ~ UVn

3 D t ·

'.=

U

A

v , where that operator U is defined by

step • e erm~ne v_n+1 9

gn+l n U V := max L(f,g)v •

g f

For

A

=

1 we have again the standard successive approximations method treated in section 3. For

A

=

we have Hoffman and Karp's algorithm. One may prove, using the monotonicity of the operators and uV

a ~ va' that

*

v_n converges monotonically to V •

For the finite state case the proof is given in [18J. The extension of this proof to the case we deal with here is straightforward. One just has to prove by induction v* ~ v ~ _Uv ₁ _~ _v _{l' and v} _~ U vn Since

r

n

*

n * n _{have Ilv}n- n- n

n - v* II ~ \)~IVa

IIU v

_{o -}

V II ~ \} IIV

a -

V II we also - V II

A possible extension is again the introduction of the stoppingtime-based L

_o

and U

_o

operators as in [17J. Another extension is that instead of using a fixed

A

one may use a different

A

in each iteration step. Note also,

n

that if the first player has only one action in each state we get the value oriented methods presented by Van Nunen and Wessels [9J for MOP.

(14)

5. Strictly positive Markov games with stopping actions

In this section we will consider a type of Markov game for which successive approximations still converge but where the U and L(f,g) operators are no longer strictly contracting. We release the assumptions of section 2 and replace them by:

$ ,

JK

i ,:JLi all finite

L

p(jli,k,~) ~ 1 , r(i,k,~) > 0 for all i, k and ~

jE$

and moreover

:JLSTOP := {R. to: :JL.

I

i ~

1:

p(jli,k,~)

=

0

jE$

for all k to:

JK2

:f ~ for all i to: S. By IIv II we mean standard maximum norm, IIv II = max

I

v (i)

I.

S_{o a}11 _{rewar s are s}d t_r~c. tl_y _{pos~ ~ve}. t' _and ' iES_-s~nce _iil,tIo J K " I T_{i ' ...}_i _are f'_~n~. t_{e- a so}1

bounded away from zero. The assumptions allow V(TI,y) (i) = for some TI,y

d · B t ' "IT stop . 1 . . -. t 1 .

an ~. u s~nce ... i ~s nonempty P

1 can stop p ay~ng ~mmed~a e y ~n each state and thus restrict his loss to some finite amount.

As in section 2 we have the following lemma.

Lemma

The n-stage game with terminal reward W

o

has the value unw

o

with terminal reward W

_o

has the value unw

Owith optimal strategies (fn, ••• ,f1) and (gn,···,gl) satisfying L(f,gk)w

k_1 ~ L(fk,gk)wk_1 =: wk s L(fk,g)wk_1 for all f and g.

The problem remains to investigate how w behaves as n tends to infinity.

n

Let rSTOP(i) be defined as the value of the matrix game with entries

. STOP

r(~,k,~), k to:

JK.,

~ to::JL. • Then for any W

o

we obviously have

STOP ~ ~

w s r , n to: ~, since in state i the second player may restrict his n

1oss t0 rSTOP ( .) b~ y choos~ng. a good random~ze. d act~on ~n. . :JL.STOE• ~

We also have 0

~

un-10

~

unO, s n

=

2,3, ••• , hence lim UnO exists. Call n-+oo

(15)

Theorem 5

w* is the unique fixed point of U and Unv -+ w* (n -+ 00) for any v Eo JRN •

Proof

First we prove the uniqueness. Let u and v be fixed points of U which have (f ,g ) and (f Ig ) as optimal strategies in the one-stage game

u u v v

with terminal payoff u and v, respectively. Then n

L(f,g)u

u u

f ,g

~ v - JP v u(S E: $)llu - vII.

n

V(f ,g ) (i) v u

Similarly u $

f ,g

Obviously for all i E: $ Jp .v u(S E: $) -+ 0 (n -+ 00) since otherwise

~ n

00 contradicting V(f ,g ) $ u. Hence u ~ v. v u

v and thus u

=

v.

n

*

So it remains to show U v -+ w for any v. This follows from

f*,(gn,···91) * - JP w (S E: $)

II

v - w

II

n

*

~ w f *,(g , ••• ,g1) _ JP w n (S n E: $)

II

v - w

*

II.

o

E:

$)

-+ 0 (n -+ 00) • n

*

lim sup U v $ w • n-+oo n-+oo n

*

Hence lim U v

=

w • n-+oo

*

f ,(g , •••,g1) Again it is obvious that :!p n (S

n

Therefore lim inf Unv ~ w*. Similarly one may show

* 1T,(gn,···,g1)

Here it is again possible to determine bounds for w using that :!P, ~

(16)

~t is not necessary to assume that P₂ can quit playing in any state. It is sufficient to assume that P

2 can restrict his loss to some finite amount. This, more general case, has been treated by Kushner and Chamber-lain [6J.

6. Average reward Markov game

In this section the state space will be assumed to be finite.

In the previous sections we have seen that it is possible to extend many of the results with respect to successive approximations in MOP to Markov games. In the average reward case however, we encounter substantial difficulties. This is illustrated by the following example called the big match. It is due to Gillette [4J and studied by Blackwell and Ferguson [3J.

Example

1 2 3 If in state 1 P1 picks action 2 and P₂

0

action 1 the payoff will be zero and the system moves to state 2, etc.

So states 2 and 3 are absorbing.

One easily argues that,if P

1 takes in state 1 action 1 with probability 1/2, the average reward for P

1 will be 1/2, whatever, strategy he uses. But it is not very clear how P

1 can guarantee himself an average payoff of 1/2. Any Markov strategy guarantees only O. This is seen as follows: Let pen) denote the probability that P

l has picked action 2 before or on time n,

and define p := lim p(n). Now let e > 0 be given arbitrarily and let N

e

be such that n-too

p - peN ) :S: e •

e

Then P₂'s strategy "play action 1 until time Nf; and action 2 thereafter" gives an average payoff of at most e •

(17)

Blackwell and Ferguson show that Pl can guarantee himself the average payoff N/2(N + 1) by playing strategy ~N defined as follows: Let P₂'s first n choices be i

1, ••• ,in, ik E {1,2}, and let cn be the excess of l's _-2 over 2's among i

1, ••• ,in• Then take action 2 with probability (N + cn + 1) • The difficulties here arise from the fact that there are strategies with more than one recurrent subchain.

Under the assumption that all pairs of stationary strategies induce an irreducible Markov chain (one recurrent subchain and no transient states) Hoffman and Karp [5J show that the game has a value and that their algo-rithm (H,K) from section 4 yields E-optimal stationary strategies. Rios

and Yanez [13J consider the game with for all i,j,k and ~ p(jli,k,i) ~ p >

o.

(Then obviously Hoffman and Karp's irreducibility assumption is satisfied.) They show that in this case the standard successive approximations method converges. Recently Tanaka and Wakuta [15J, dealing with compact state and action spaces under appropriate continuity assumptions, consider the following condition: ]p ~'y (Sn

=

sO) ~ CL > 0 for some So E $ and all i, ~

and y. And show that in this case the game has a value and that successive approximations converge.

7. Nonzero-sum two-person Markov games

This section shows that finite-stage two-person-nonzero-sum Markov games do have at least one Nash equilibrium point [7J which may be determined by successive approximations.

The main difference with the zero-sum games of the previous sections is that now we have two reward functions r1(x,k,i) and r 2 (x,k,i), where r i denotes the reward for p. , i = 1,2. Furthermore we have two terminal

~

reward functions w₁ and w

2• As a result we have to define two total expected reward functions V

1(~,y) and V2(~'Y) for P1 and P2 respectively. Now we are looking for a Nash equilibrium pair (cf. [7J) for this game;

.

*

that ~s a pair of strategies ~ , y satisfying V

1(~,y ) ~ V1(~ ,y ) and

*

V2(~ ,y ) ~ V2(~ ,y) for all ~ and y. In bimatrix games (l-stage games) there can in general be more tham one equilibrium pair.

(18)

The assumptions in this section are the following: (i) ~ is countable, :JK. ,lL. finite

~ ~

(ii) There exist two positive vectors, ~1 and V

2 such that

and IIP(f,g) II :5 P1 for all f and g

~1

Ilr

2(f,g) II ~2 :5 M2 and IIp(f,g) 1~2 :5 P2 for all f and g

respectively by

Analogous to section 2 we define the operators L

1 and L2 on W_V1 and W_f,12

L

i (f,g)w(x) :=

L

f(x,k) g(x,£) [ri (x,k,£) +

L

p(j

I

x,k,£)w(j) J ,

£EL kEK jES

x x

i

=

1,2.

Now for all x E $, w

1 E Wf,1 , w2 E W~ L1(f,g)w1 and L2(f,g)w2 determine a bimatrix game. Note that

a~sumption

tii) guarantees that L. (f,g)w. lies

~ 1.

aga;i.n in W Vi

Let us consider the n-stage game with terminal payoffs w

1 and w2 for P1 and P

2 respectively, with w._~ E W_Vi Now

defin~ w~

:= w

1'

w~

:= w2• Let fn and gn be a pair of policies satisfying

n-1 n-1 n-1 n-1

L₁(f,gn)w

1 :5 L1(fn,gn)w1 and L2(fn,gn)w2 :5 L2(fn,g)W2 for all f

n n-1

and g and define w. := L. (f ,g )w

1 ' i = 1,2. Then we have the following

~ ~ n n

result: The pair of strategies TI := (f , ••• ,f

1), Y := (g , ••. ,gl) is a

n n n n

Nash equilibrium pair of strategies for the n-stage game under consideration. The proof of this statement goes along the same lines as the proof in [20J for zero-sum games, essentially using the monotonicity of the L operators.

For infinite stage games there are a number of theorems about the existence of a pair of equilibrium strategies. See for example the survey paper by Parthasarathy and Stern [10J. Beniest [2J considers a game with $ finite

(19)

and

L

p(jli,k,t) < 1 for all i, k and t jES

under two different cooperation schemes and shows that in both cases there exists a unique pair of value vectors v~, v; which may be determined by successive approximations.

For the case of noncooperation the following example shows one of the problems we encounter when considering infinite stage games.

Example.

2,2,3/4 1,7,3/4 5,5,3/4

7,1,3/4

There is only one state. If P

1 picks action 1 and P2 action 2 then P

1 receives 1, P2 7 and the system vanishes with probability 1/4, etc.

For each finite horizon game there is only ane equilibrium pair of strategies, namely pick always action 2. In the infinite horizon games however there

is still another equilibrium pair consisting of non-Markov strategies.

Namely pick action 1 untill your opponent has picked action 2, then continue to play action 2. One easily argues that if both players use this strategy this is indeed an equilibrium pair.

Acknowledgement

With respect to this section, the authors gratefully acknowledge the contribution of their student mr. Pulskens.

References

[lJ R.A. Bellman, A Markovian decision process. J. Math. Mech. 6 (1957),679-684.

[2J W. Beniest, Jeux stochastiques totalement cooperatifs arbitres, Cahiers du Centre d'Etude de Recherche Operationelle ~

(1963) 124-138.

[3J D. Blackwell, T.S. Ferguson, The big match, Ann. Math. Statist. 39 (1968), 159-163.

(20)

[4J D. Gillette, Stochastic games with zero stop probabilities, Contribu-tions to the theory of games III, eds. M. Dresher, A.W. Tucker and D. Wolfe, Princeton Univ. Press, Princeton New Yersey, 1957, 179-187.

[5J A.J. Hoffman, R.M. Karp, On nonterminating stochastic games, Management Science ~ (1966), 359-370.

[6J H.J. Kushner, S.G. Chamberlain, Finite state stochastic games: Existence theorems and computational procedures, IEEE, Trans. Automatic Control AC-~ (1969), 248-255.

[7J J. Nash, Non-cooperative games, Ann. of Maths. ~ (1951), 286-295. [8J J. van Nunen, J. Wessels, Markov decision processes with unbounded

rewards. In this volume.

[9J J. van Nunen, J. Wessels, The generation of successive approximation methods for Markov decision processes by using stopping times. In this volume.

[10J T. Parthasarthy, M. Stern, Markov games-a survey report (1976)

University of Illinois at Chicago Circle, Chicago, Illinois. [11J M.A. Pollatschek, B. Avi-Itzhak, Algorithms for stochastic games,

Management Science ~ (1969), 399-415.

[12J D. Reetz, J. van der Wal, On suboptimality in two-person zero-sum Markov games. Memorandum COSOR 75-19, December 1976, Eindhoven University of Technology, Dept. of Math. [13J S. Rios, I. Yanez, Programmation sequentielle en concurrence.

Research papers in statistics. Ed. F.N. David, John Wiley and Sons, London-New York-Sidney 1966, 289-299. [14J L.S. Shapley, Stochastic games. Proc. Nat. Acad. Sci 39 (1953),

1095-1100.

[15J K. Tanaka, K. Wakuta, On Markov games with the expected average reward criterion (II), Sci. R~p. Niigata Univ. Ser. A. 13 (1976), 49-54.

(21)

[16J J. van der Wal, The solution of Markov games by successive

approximation. Master's thesis. Februari 1975, Eindhoven University of Technology, Dept. of Math.

[17J J. van der Wal, The method of successive approximations for the

discounted Markov game. Memorandum COSOR 75-02, March 1975, Eindhoven University of ~echnology, Dept. of Math.

[18J J. van der Wal, Successive approximation and discounted Markov games. Memorandum 11~ February 1976, Twente University of

Technology, Dept. of Appl. Math.

[19J J. van der Wal, Positive Markov games with stopping actions, Memorandum 131, May 1976, Twente University of Technology, Dept. of Appl. Math.

[20J J. van der Wal, J. Wessels, On Markov games. Statistica Neerlandica

2£

(1976), 51-71.

[21J J. Wessels, Markov games with unbounded rewards. Memorandum COSOR 76-05, March 1976, Eindhoven University of Technology, Dept. of Math.