Positive Markov games with stopping actions

(1)

Positive Markov games with stopping actions

Citation for published version (APA):

Wal, van der, J. (1978). Positive Markov games with stopping actions. (Memorandum COSOR; Vol. 7822). Technische Hogeschool Eindhoven.

Document status and date: Published: 01/01/1978

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne Take down policy

If you believe that this document breaches copyright please contact us at: openaccess@tue.nl

providing details and we will investigate your claim.

(2)

EINDHOVEN UNIVERSITY OF TECHNOLOGY Department of Mathematics

PROBABILITY THEORY, STATISTICS AND OPERATIONS RESEARCH GROUP

Memorandum COSOR 78-22

Positive Markov games with stopping actions by

J. van der Wal

Eindhoven, November 1978 The Netherlands

(3)

positive Markov games with stopping actions

J. van der Wal

Department of Mathematics Eindhoven University of Technology

Eindhoven, the Netherlands

Abstract. In this paper we consider the two-person zero-sum Markov (stochastic) game with finite state and action spaces, where the immediate payoffs from player II to player I are strictly positive, but where player II has in each state an ac-tion which terminates the play immediately. This game is a special case of the po-sitive Markov games considered by Kushner and Chamberlain. Making explicit use of

~e fact that player II can terminate the game immediately we derive with the

me-~od

of successive approximations, an £-band for the value of the game and

statio-nary E-optimal strategies for both players •

. Introduction and notations. Consider a dynamic sys tem with finite state space S := {I ,2, ... ,N} to be observed at t = 0,1,2, . . . . The "bellaviour of the system is influenced by two players, PI and P

2, having completely opposite aims. In each state i c S there exist two finite nonempty sets of actions, one for each player.

The set for PI is denoted by Ki the one for P

2 by Li' If at time t the system is observed in state i, then PI selects an action from Kiand P

2 on action from Li , As a joint result of the state i and the two selected actions, k by PI and 2 by P₂,

PI will receive an amount r(i,k,,t) from P2' the systems moves with probability p(j/i,k,l) to state j and the play terminates with probability 1-

L.

p(jli,k,2).

J

~ make the following two assumptions:

i) r(i,k,~) ~ a > 0 for all i, k and ,to

...

ii) in each state i € S there exists an action i. ( L. such that

for all k E K ..

~

~ ~ p(jli,k,i.) =0 ~

So,as long as the play goes on P₂ looses at least an amount a in each step. But P₂ has in each state an action which terminates the play immediately, of course at some positive cost.

Both players are interested in maximizing their total expected reward;

A policy f for P I is any function which adds to each state i f S a probability dis-tribution £(i) on Ki' And f(i,k) denotes the probability by which action k is taken in state i. f(i) can be called a randomized action.

A (Markov) strategy 'IT for PI is a sequence of policies 1T

=

(f

O' fl" •. ) • I f at time n the system is observed in .state i strategy 1T prescribes the randomized action fn(i).

(4)

z

-(cf.

r

7]) it is sufficient to consider Markov s tralegies only.

A stationary strategy ~ is a strategy in which the policies f _n are identical. Nota-tion: II = f{<X.O).

Similarly, we define policies g and strategies p for P

_Z"

For each pair of strategies 'IT,p let VCrr,p) be the N-coiumnvector Ylith i-th compo-nent equal to the total expected reward for PI (possibly 00) if at time 0 the

sys-tem is observed in state i and strategies nand p are used.

In [3J Maitra and Parthasarathy have considered positive Markov games, however they assume V(rr ,p) < 00 for all nand p.

Kushner and Chamberlain [2] consider a more general positive stochastic game in their assumptions A2 through A4 than we do. They proved that this game has a value and that the method of successive approximations converges.

~n section 4 of this paper we show how the structure of the game (assumptions i

"'and ii) can be exploited further to obtain with,the method of successive

approxi-mations,upper and lower bounds for the value of the game and nearly optimal statio-nary strategies. This we do using the fact that for sensible strategies for P

_z

the . probability that the play still goes on decreases exponentially.

Before we do this we give some preliminary results concerning the method of succes-. sive approximations for finite stage games in section 2succes-. To make this paper

self-contained we prove in section 3 that the game has a value, that stationay opti-mal strategies exist and that the method of successive approximations converges for any starting vector. Finally in section 5 we Yleaken assumption ii.

We conclude this section with some notations. N

For a vector v (: lR we define II vII, v and;!. by

Ilvll max Iv(i)l,

v

=

max v(i), v

i i

min v(i) .•

i

N T

And e E lR is the N-columnvector with all components equal to one: e = (],], ••• ,

1; .

2. Finite stage Markov games. In section 3 we will show that - as in the discounted Markov game (cf. Shapley [5J) - the value of the n-stage game may serve as an appro-ximation for the oo-stage game. Therefore we repeat in this section some results for finite stage Markov games.

We consider the n-stage Markov game with terminal payoff w. So we observe the sys-tem at t

=

0, I, ••• ,n-], and if, as a result of the actions taken at time n - I, the system reaches state j at time n, then PI receives a terminal payoff w(j) from .PZ- Let Wn(n,p) be the total expected reward vector for P

l in this game i f

strate-gies 1f and p are used.

Before we show that this game has a value and how to determine it we first give

some more notations. Define theN-co1umnvector r(f,g), the N x N matrix P(f,g) and

N the operators L(f,g) and U onlR by

(5)

3 -r(f,g)(i) :=

I I

f(i,k)g(i.~)r(i,k,~), k 9. 1. E S P(f,g)(i.j)

:=)' )'

f(i,k)g(i,~)p(jli,k,R,), k 9. L(f,g)v := r(f,g) + P(f,g)v Uv

:=

max min L(f,g)v , f g

where maxmin is taken componentwise.

i ,j E S

Sp (L(f,g)v)(i) is the expected payoff for PI if in the matrixgame with entries r(i,k,~) +

I.

p(j\i,k,9.)v(j) the randomized actions f(i) and g(i) are taken. And

J

(Uv)(i) is the value of this matrixgame. L(f,g)v is also the expected payoff for PI in the I-stage game with terminal payoff v if policies f and g are used.

_Clearly the operators L(f,g) and U are monotone. I.e. i f v ~ w then L(f,g)v;::: L(f,g)w and Uv ~ Uw. As we will treat the n-stage game by a dynamic programming approach it will be more convenient to renumber the observation points in reversed order: the initial time becomes t

=

n, the last actions are taken at t

=

I and the terminal payoff takes place at t

=

O.

Define

Wo

:= w

(2.1) w_t = Uw

t_1' t

=

J,2, ••• ,n

and let 'lf

n = (fn, ••• ,fl) and Pn = (gn, ••• ,gl) be n-stage strategies with fegt (to be applied at reverse time t) satisfying

(2.2)

for all f and g. So ft(i) and gt(i) are optimal randomized actions in the matrix-4Itames with entries r(i,k,t) + 2j p(jli,k,t)wt_l(j). Then we have the following

re-suI t.

Theorem 2.1. The n-stage stochastic game with terminal.payoff w has value w (=Unw) n. and the strategies 1Tn and P

n are optimal in this game for PI and P2 respectively:

W ('If,p ) ~ W (1T ,p )

=

w

~ W ('If ,p)

n n n n n n n n for all 'If and p •

Proof. The proof is fairly straightforward using the monotonicity of the L(f,g) and U operators, cf. [7 J.

. An immediate consequence of this theorem is:

.Corollary 2.2. Let (fn •••• ,f_l) as defined by (2.2) be an optimal strategy for PI in the n-stage game with terminal payoff w, then (fm, ••• ,f) is optimal in the m-stage game with terminal payoff w (m ~ n), and (f , .•• ,f I) is optimal in the

n n-m+

mrstage game with terminal payoff Un-rnw.

(6)

4

-3. Value and stationary optimal strategies. In this section we show that the opera-tor U has a unique fixed point v*, that v* is the value of the oc>-stage game and

n

*

N

that the sequence U w converges to v for all w f lR • Further it will be shown that

optimal policies for the I-stage game with terminal payoff v* give statiortary opti-mal strategies for the oo-stage game.

Let b be the N-columnvector with i-th component

b (i) : = max r (i , k ,

i .) .

kEK. ~

~

And let p(n,p,n) be the N-colutnnvector with i-th component p(1f,p,n,i) equal to the probability that the play still continues at stage n if the game has started in state i and strategies 1f and p are used.

Further denote by Pn(w) an optimal strategy for P

2 for the n-stage game with

termi-4It1

payoff w, and by p (1f,w) a strategy which is optimal if it is already known _n _.

that PI plays 11, i.e.

W _n(TI,p (n,w» _n

=

min W _n(n,p) •

p

Then we have the following basic lemma. Lemma 3. I.

i) Unw S b for all w E ]RN and n E IN.

If na + w > 0 then

ii) p ( TI , P n (w) ,n, i) s b ( i)

I

(na + ~). ii i ) p ( n , P n

err ,

w) , n, i) s b ( i)

I

(na + ~).

~. i) Clearly P

2 can restrict his losses in any n-stage game to b by

terminat-~g the play immediately, so Unw ~ b.

ii) From i) and p (w) being optimal we have

. n

W (1f,p (w» ~ w= Unw s b •

n n n

On the other hand the losses for P

2 are at least p(rt,p (w),n,i)(na + w)

n

-as, if the play still continues at (reversed) time 0, P

2 will have lost at least a at each stage and at time 0 he looses additionally a terminal amount of at least w:

together at least na + w.

50 we have

p(n,p (w),n,i)(na + w) s W (n,p (w»(i) $ b(i)

n - n n

from which ii) follows immediately.

(7)

5

-W (II. P ( If , W» , W (11, P ( w

»

< w

o 0 0 0 0 II

-I

p(II,f' (w) ,n) behaves not only as O(n )

n ,1\ _\

-I

but even as 0(0 ) as, we have with corol- .

lary 2.2 lm-O p(u,po(w),m)

=

O(n ). 10 section 4 we will see that p(n,p (w),n)

n

decreases even exponentially fast. Similarly p(n,p (n,w),n).

n

Define v

n

:= U 0, n = 0,], ... , then we have:

TIleorem 3.2. lim v exists and 16 finite.

n

Proof. From r(i,k,~) ?- a > 0 we have VI ~ Vo

=

0, and from the monotonicity of U we ge t v n+ I .:. v n' n .. 1,2,... • So v n is a nondecreasing sequence, which according

to lemma 3.I(i) is bounded by b. So lim v exists and does not exceed b, LJ

n

_ Define v* :- lim v

n' Then

_ n-?<lO

cies f* and g* satisfying

as Uv is continuous in v _Uv

*

₌

_{v , so there eX1st po 1-}

*

' 1 '

(3. I) L(f,g)v

* *

~ v

*

s L(f ,g)v

*

for all f and g • Now we get the following theorem.

Theorem 3.3.

i) v* is the value of the oo-stage game.

1'1') TI le strateg1es f , * ( 0 0 ) an g d *(00) ,sat1sfY1ng , . (3 J) • • are opt1ma , 1 f or P J an d P2

respectively.

Proof. i) PI can get at least vn by playing an optimal strategy for the n-stage game first and playing arbitrarily thereafter. So sup inf V(n,p) 2 v

n' and with

*

11 P

n ~ 00 sup inf V(n,p) 2 v •

1T P

_Further let 1Tn = (fn, ••• ,f\) (in reversed time notation) be an arbitrary n-stage strategy for PI' and V_n(1T,P) denote the total expected reward for PI in the n-stage game with terminal payoff O. Then we have by the monotonicity of the L(f,g) operators and inequality (3.1)

(3.2)

So also for all n

Hence

inf sup V(1T,p) :5 sup V(1f,g*(oo» ~ v*

P 1f 1T

So v* is the value of the oo-stage game.

ii) From (3.2) we already have that g*(oo) is optimal for P2' Remains to show the

' 1 ' f f* ("") . .

(8)

6 -n (f* (<») 0) = (~ ~ ) ( d ' . ) t~ t "n ' gn, .. ·,gl reverse tlmeag<lln 11<1 * ,...

*

~ L (f ,gn)'" L (f ,g I ) 0

*

~ * ~

*

*(00) *(00)

*

~ L (f , gn) • • • L ( f , g 1 ) v - p ( f , P n (f ,0) ,n) II v

II

*

*(00) *(00) * ;:: v - p (f ,p (f ,0) , n) II v II • n

With lemma 3.I(iii) we get p(f*(oo),p (£*(00) ,O),n) + O. Hence for all p

n

So f*(oo) is an optimal strategy for PI'

o

By definition we have UnO -7 v*. But one might also want to start the successive

ap-~proximation

procedure with a scrapvector w

~

0, for example w

=

h. That this also leads to a convergent algorithm is stated in the following theorem.

Theorem 3.4. lim Unw

=

v* for all w E

m

N.

n-roo

Proof. The proof we give here will be similar to the proof in theorem 3.3 for the optimality of f*(oo). Consider the n-stage game with terminal payoff wand let Wn(n,p) denote again the total expected reward for PI in this game. Let PI play

*(00) . *(00) _

-f and P

2 an optImal reply p (f n ,w) .

=

(g , .•• ,g). Then we have n .

Unw ;:: W (f *(00). ,p (f *(00) ,w»

=

L(f ,g ) ••• L(f ,gl)w

* -

*

-n n n

* -

*

*(00) *(00) * ;:: L(f ,g ) ••• L(f ,gI)v -p(f _n ,p (f _n ',w),n)lIw-v II

*

*(00) *(00)

*

;:: v -p(f ,p (f ,w),n)lIw-v II • n

From lemma 3.I(iii) we have p(f*(oo),p (f*(oo) ,w),n) -70, so liminf Unw ;:: v*.

. n

n-roo

l ' n

*

1 *(00) . 1 1

To prove Imsup U w ~ v . et P

2 play g • and PI an optIma n-stage rep y

n->-oo

TIn

=

(in"" '£1) to g*(oo), then we have

- *(00) - * -

*

W (TI ,g _{n n} )

=

L(f ,g ) ••• L(f1,g)w _n

-

*

-

* *

-

*(00)

*

S L (f n' g ). ~ • L (f 1 ,g ) v + p (TI n ' g , n) II v - v II * - *(00)

*

S v - p ( TIn' g , n) II v - v II ,

Also with lemma 3.I(ii) p(i,g*(oo) ,n) + 0 as g*(oo) is an optimal strategy in the n-n

*

n

*

stage game with terminal payoff v (cf. (3.1), So also limsup U w s v •

h . . n

*

(9)

e

7

-4. Bounds on v* and nearly optimal stationary strate&ies. In the previous section

1 I ' h . . . n

*

d l

we lave S lown tllat t e succeSB 1 ve approXlmatlollB U w eonverge to v an t lat

p(lI.p.n) tends to zero for all relevant strategieB p for P2 (eL lemma 3.1 (ii) and

the proof of theorem 3.4). In order to derive bounds on

/<

and L-optimal strategies

we need that pCa,p,n) ." 0 exponentially fast for all sensible strategies p.

Consider the approximations Unw. And let p (w) := (g , ••• ,g\) be an optimal strategy·

n n

for P

_z

in the n-stage game with terminal payoff w satisfying (2.2). Then by

corol-lary 2.2 P (w)

=

(g •••• ,g I) is optimal in the m-stage game with terminal

pay-n n n-m+

off Un-mw

=

w •

n-m

So with lemma 3.1(H) assuming ma + w > 0

-n-m m _{:::; bl} (ma + w ) P (If ,p (w) ,m)

.

n -n-m Define 6(m,w) by

6 (m,w) := min(l

,hi

(ma + ~»

.

Then we can write

(4. I) -P ( If • P ( m w) • m) :c: 6(m,w )

.

n n-m

So i f P2 plays P

n (w) then at reversed time n - m the play will have terminated with

a probability of at least 1- 6(m.w ). And given that the (n-stage) play still goes

n-m .

on at stage m (reversed time n-m) and that P2 continues to play Pn(w) the play will

terminate before tlstagetln + I (reversed time 0) with a probability of at least

J - 0 (n - m.w). From this reasoning we get

(4.2) p(rr,p (w),n) :::; o(m,w )6(n-m,w).

n n-m

This leads us to

_Lemma 4.1. Let Pn(w) be optimal in the n~stage game with terminal payoff w. then

-if

Uw 2 w

p(n.p (w),n) ~ O(m.w)6(n-m,w)

n (m s; n) •

Proof. From Uw 2 w we get with the monotonicity of U that w ~ w. So6(m,w ) ~

n-m n-m

6(m,w). Substituting this into (4.2) gives us the desired result.

0

. The lemma implies that p(ll,p (w) ,n) decreases exponentially fast

n

Corollary

4.2.

If Uw 2 wand n

=

up + q, u,p,q L ~ then

p(~,p (w),n) ~ 6u (p,v)6(q,v) n

This enables us to obtain bounds on v*.

Theorem 4.3. If Uw 2 wand p c ~ such that 6(p,w) < I then

Uw ~ v * ~ Uw + (l - 0 (p , w) } - 1 p

L

IS (q , w) 1\ Uw - w II. e

(10)

8

-Proof. From Uw ? w we have w : '" Ul1w :- Uw. n ' I so also lim Unw

n

rive the second inequality we observe that

v* < Uw. To

de-v*-Uw::;lim (wll-w

l):.: lim I(w -w 11 n-I)+(w n-I-w 2)+···+(w11-

z

-w,)].

n~o n-h"

Let lr

t+\(w) = (ft+I .... ,fl) and Pt+\(w) ::: (gt+" .•. ,g,) be optimal strategies for

the (t+ I)-stage game with terminal payoff w, as defined by (2.2), then

wt + 1 - wt

=

L(ft+l.gt+l)···L(f2,82)wl - L(ft,gt)···L(f1,gl)W

s L(ft+l.gt) ••• L(f2.gl)wl - L(f

t+],gt) ••• L(f2.g l )w

::; 6 ( t • w ) II Uw - w lie ,

where the first inequality follows from the optimality of 1Tt+1 and P

t+1 and the

se-cond inequality follows from lemma 4.1. Thus

00

v * - Uw S; ~ <5 ( t , v) II Uw - w lie • t-I

Using corollary 4.2 we get an upperbound for the sum at the right hand side:

So

-00 00 p w P

2

6 (m,w) s

L

r

m::: \ u=O q= I <5 (up + q.w) <

I

1:

6(~(p.w)6(q.w) u=O q=1 - ) == (l-6(p,w»

t

q=1 8(q,w) • v* s. Uw + (1-6(p,w»-1

I

6(q,w)IIUw-wll.e. q=\

o

Let f and

g

be optimal policies in the I-stage game with terminal payoff w:

~ L(f,g)w ::; L(f,g)w

=

Uw s L(f,g)w for all f and g .

Then we have the following result on the nearly optimality of £(00) and g(oo) in the

oo-stage game. Theorem 4.4.

i) I f Uw ~ w then v(i(oo) .p) 2! Uw for all p.

If w > 0 and Uw s w + Ee with 0 s; E < a then for some y < I

ii) P(f,g)w s yw for all f,

iii) V(n,g(oo» s; Uw + EY(l_y)-l w for all 'If with £ = e: max {l/w(i)}.

iES

Proof. i) Le_t p (£(00) ,0) == (g , ••• ,gl) be an optimal reply to 1(00) in the n-stage

n n

(11)

9

-By lemma 3. I (iii) again P (f(ou) , (~(m),O.n) ~ O. Further n

L(f,gn)" .L(f,g)w?: L(f,gn)" .L(f,gZ)UW? L([,gn)" .L(f,g2)w ?: ••• ~ Uw •

so for all p

Uw •

ii) From Uw $ w + e:e we have for all f

P(f,g)w

=

L(f,g)w - r(f,g) $ L(f,g)w - ae $ w + e:e - ae < w • So for some y < I we have P(f,g)w $ yw for all f.

iii) Let TIn

=

(fn, ••• ,f) be an arbitrary n-stage strategy, then

L(fn,g)··.L(f),g)O $ L(fn,g) ••• L(f1,g)w $ L(fn ,g)···L(f2,g)Uw

$ L(fn,g) ••• L(f

2,g) (w + Ee) =

=

L(f

n,g)···L(f2,g)w + P(fn,g)···P(f2,g)£e

With

e

= £ max {I/w(i)}, we have Ee $ EW. So

i

Continuing in this way we get

... n-l

$ e:y w

2 n-l

$ Uw + E (y + y + ••• + y )w.

Letting n ~ 00 we get for all TI

V(TI,g(oo» $ lim [Uw+E(y+y2+ ••• + yn-l)wJ=uw+Ey(I_y)-lw.

n-+oo

o

Clearly Uw + Ey(l _y)-lw is also an upperbound on v*. If we consider the successive

. . n h 3 4 n+ 1 n 'II d

approx1mat10ns U v, t en we know from theorem • that U v - U V W1 ten to

ze-n

ro, so with w

=

U v we have for n sufficiently large Uw $ W + e:e.

Theorem 4.4(ii) states that the function w is strongly excessive with respect to the set of transition matrices P(f,g). So the resulting Markov decision process

i f g is fixed is contracting (cf. van Hee and Wessels [1] and van Nunen [4]).

Once we have observed this we see that theorem 4.4(iii) is a standard result in contracting dynamic programming. From theorems 4.3 and 4.4 we see that the method of successive approximations yields stationary e:-optimal strategies and (arbitrary

*

. close) bounds on v • For the e:-optimal strategy for PI however we used the monoto-nicity (Uw :;:: w).

Also if we do not have Uw :;:: w we can derive on interesting result on the e:-optima-lity of [(00).

(12)

- 10

-Theorem 4.5. I f Ow '> W - l e then for all p V(f A(OO) ,p) ~ Uw - La -2 -(b - a)b.

(00) __ (=) _(w) ~(oo)

Proof. Let

g

be an optimal reply to f in the oo-stage game: V(f ,g )

'" min V(r("u) ,p). That such a strategy exists is a result from negative dynamic

pro-gra~ming

(d. Strauch [6J, theorem 9.1). Define v == V(i('''')

,g(oo»,

then L<£,g)v=v. From this we get P(f,g)v $

v -

ae, which with ae ~

v

~ be, hence 'l/b $ e, gives

P(f,g)'l ~ v - a'l/b == (1 - a/b)v •

One may argue again that p(f(oo) ,g(oo) ,n) -~ 0 (n + 00). hence

V(f,g(W»

=

lim Ln(f,g)w n+"" Further we have n A ~, n-I A .~ n-I .... ~ L (f,g)w' L (f,g)Uw,~ L (f,g)(w - ce) n-1 A ~ n-I .... ~ .... ~ n-I A ' "

~ L (f,g)w - P (f,g)ee ~ .•• ? Uw- c(P(f,g) + ... +P (f,g»e.

-I ...

And with e S a v (from

v

~ ae) we get

k .... -I k ....

P (f,g)e s a .p (f,g)'l

Combining this yields

-I Uw - t:a

_L

k==1 -1 -a (I - a/b)b • - k 2

-(I-a/b) b=Uw-€a (b-a)b. II

5. Some final remarks. In this paper we made the rather restrictive assumption that P₂can terminate the play immediately. This assumption can be weakened to:

iiI) there exists a strategy

p

for P₂such that c

=

max V(TI,p) < 00

TI

Lemma 3.1 with b replaced by c then clearly holds for w ~. c as Unw s Unc $ c. So

for w s c we obtain exactly the same results as here with b replaced by c. And if w ;:: c lemma 3.1 holds with b replaced by c + IIw - clle. Again all results carry

over with in theorem 4.5 b replaced by c.

Note that if we start the

n+l n.

Ub ~ b, hence U b $ U b

successive approximation procedure with b, then clearly for all n. So if we take w

=

U~ in theorem 4.4(iii) then

(00) n+ 1

we get V(TI,g ) s U b . In this case also the result in theorem 4.5 can be shar-d 1 h ... Un+1b Th 1 b by Un+1b.

pene as we a so ave v s • us we can rep ace

References

[ I ] Ree, K.M. van and Wessels, J., Markov decision processes and strongly

(13)

•

"

- 1.1

-L21 Kushner, H.J. and Chamberlain, S.G., Finite state stochastic games: Existence

theorems and computational procedures, IEEE Trans. AC. ~ (1969), 248-2SS.

[3J Maitra, A. and Parthasarathy, T., On stochastic games, II, J. Opt. Theory Appl. ~ (1971), IS4-160.

[4] Nunen, J.A.E.E., Contracting Markov decision processes, Mathematica Centre Tract 71, Mathematical Centre, Amsterdam 1976.

[S] Shapley, L.S., Stochastic games. Proc. Nat. Acad. Sci. USA 39 (1953), 1095-1100. [6J Strauch, R.E., Negative dynamic programming, Ann. Math. Statist. 37 (1966),

871-890.

[7] _{Wal, J. van der, and Wessels, J., On Markov games, Statistica Neerlandica 30}