Markov games with unbounded rewards

(1)

Citation for published version (APA):

Wessels, J. (1976). Markov games with unbounded rewards. (Memorandum COSOR; Vol. 7605). Technische Hogeschool Eindhoven.

Document status and date: Published: 01/01/1976

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne Take down policy

If you believe that this document breaches copyright please contact us at: openaccess@tue.nl

providing details and we will investigate your claim.

(2)

PROBABILITY THEORY, STATISTICS AND OPERATIONS RESEARCH GROUP

Memorandum COSOR 76-05

Markov games wi th unbounded rewards

by

J. Wessels

Eindhoven, March 1976

(3)

J. Wesse Is

Summary. Z-person zero-sum Markov games with the total expected reward cri-terion are considered. The one period rewards are not supposed to be bounded. However, it is assumed that the values of the one period games in each state constitute a vector in a Banach space in which the transition probabilities are contracting. This game is proved to possess a value vector and optimal stationary strategies (in a weakened sense). Furthermore, it is exhibited how the value vector and optimal strategies may be computed using successive ap-proximations.

I. Introduction. In this paper we will consider a dynamic system with a counta-ble state space $

=

{l,Z, ••• }, which is observed at discrete points in time

t

=

0,1, •••• The dynamic behaviour of the system may be influenced by two players (PI and P

Z). If the system is in state i at time t, player PI may select an action k from a finite set K

i (nonempty) and P

z

may select an ac-tion ~ from a finite set L. (nonempty). As a result of such choices the state

~

j will be observed at time t+ I with probability p(j;i,k,R.), mareover PI ob-tains a (possibly negative) reward r(i,k,R.) from P₂•

When analyzing games of this type with the total expected reward criterion, one usually requires discounting of future rewards and boundedness of the reward function (see e.g. Shapley [6J, van der Wal [7J, Himmelberg e.a. [3J, Parthasarathy [5J). Another condition which has been used is positiveness of the reward function (see e.g. [3J, [5J).

For discounted Markov games with bounded rewards one usually applies the theory of contracting operators in the space of all bounded functions on $

(real valued functions on $ will be called vectors henceforth). As in the

theory of Markov decision processes the conditions might be weakened by using other spaces (compare [8J). In this paper such an approach will be combined with another weakening, viz. with respect to r only conditions will be given for the values of the matrix games with entries r(i,k,~) for any i.

(4)

This latter weakening of the conditions has as a consequence, that the expec-ted total rewards may not be defined properly for all pairs of decision rules. This difficulty is met by adapting the definition of the criterion and by adapt ing the definition of a saddle point.

In section 2 we will give definitions, notations and assumptions together with some properties of basic tools. In section 3 it will be proved that the Markov game pocesses a value and optimal decision rules (stationary Markov strategies). In section 4 it will be shown how one might approximate the value and find E-optimal decision rules. Section 5 is devoted to some remarks and extensions.

2. Preliminaries. Amixed action for player PI in state i is a probability dis-tribution on K_i (notations: f(i,k) for the probabilities, f(i,') for the dis-tribution, Pi for the set of mixed actions; analogously for P₂ with notations

g(i,~), g(i,o), G.).

1.

A poLiC!!J for player P I determines a mixed action for P I in any state

(nota-tions: f for the policy, f(i,') for the mixed action in state i, F for the set of all policies; analogously for P

2 with notations g, g(i,o), G) . .

A strategy for player PI is a sequence 7T = (7T

O,7f1, ... ), such that 7Tn maps lln into F for n

=

1,2, ••• and ITO E F. En is the set of all paths until time n

(iO,kO'~O,···,i_n-I,k_n-1'JI._n-l)withi_m E$,k_m EIC,Jl.E]L . • Fistheset

1. m 1.

m m

of all strategies for PI' Analogously P = _{(PO'P I, ••• ) is an element of G, the}

set of all strategies for P 2•

1T E F is said to be a Markov strategy for P I if 7T maps all paths of H on

n n

the same element in F. A Markov strategy for PI may be characterized by a

00 00

sequence of policies (fO,f), ••• ) E F . Analogously ~ denotes the set of Markov strategies for PZ'

1T E F is said to be stationary if 1T is a Markov strategy and 7T does not

de-n

pend on n. A stationary strategy for P I may be characterized by a policy f E F. The terms stationary strategy and policy will be used deliberately.

With respect to the transition probabilities we suppose

p(j ;i ,k,~) ~ 0

L

p(jji,k,JI.):S;

j d

for i,j E $, k E IZ., JI. E lL. ,

1. 1.

for i E $, k E IZ., JI. E]L. •

(5)

The transition distributions might be completed by introducing an extra ab-sorbing state

a

with

p(O;i,k,lI.)

=

1

I

p(j;i,k,R.)

jE$

for i E $ •

With this artificial extension in mind we can define a probability measure on the set of all infini.te paths for each starting state i EO $ and each pair of strategies n,p using the transition probabilities p(j;i,k,R.) at time t and nt' P_t for determining the actual k and R.. This measure is denoted by

P._l,n,p• Expectations with respect to this measure are denoted by E._l,n,p, fur-thermore E denotes a vector of expectations. (The set of vectors

n,p _ _~

v = (v(I),v(Z), ••• ) with v(i) E R is denoted by R .)

The state at time t, the actions chosen by player PI and P

_z

at time t, the path until time t are denoted by the random variables St' K

t , Lt , Ht• pt(n,p) denotes the matrix with (i,j)-entry P. (St

=

j), so

1.,iT,P pl(f,g) =: P(f,g) gets the (i,j)-entry

I

f(i,k)g(i,R.)p(j;i,k,lI.) • k,R.

r(f,g) is the vector with i-th entry

I

f(i,k)g(i,R.)r(i,k,lI.). k,lI.

Definition 2.1. Let V be a vector F x G. V(n,p) EO]Roo for (11',p) E V.

*

*.

.

d dd1 '

11' , P 1.5 Sa1. to be a sa ve po1-nt

valued function defined on a subset

V

of

of V iff

*

I) (11',p ),(n ,p) E V for all (n,p) EO

F

x

G. * * *

*

2) V(11',p ) S V(11' ,p ) S V(11' ,p) for all (11',p) E

F

x G, where inequalities should be read componentwise.

*

V(11' ,p ) is called the value veotor of the game with respeot to the oriterion

*

V; 11',P are called optima~ strategies for p I and P

_z

for the game with cri-terion V.

Note that a game can possess at most one value vector with respect to V. We will use the following criterion vector V in this paper.

(6)

Definition 2.2.

V

c F x G is defined as the set of pairs (n,p) which satisfy at least one of the following conditions

a) b) With 00 E .

L

r+ (n (H ),p (H

»

(8 ) < 00 1,n,p n=O n n n n n 00 E.

l

r-(n (H ),p (H »)(8 ) < 00 1,~,p n=O n n n n n

r+(f,g)(i) := max{O,r(f,g) (i)} r (f,g)(i) := max{O,-r(f,g) (i)}

for all i € 8 •

for all i € S •

We define V as a vector valued function on

V

by 00 V(n,p) := E

L

r(n (H ),p (H

»

(8 ) n,p n=O n n n n n 00

=

L

E r(1T (H),p (H»(8) n=O 1T,p n n n n n

V(1T,p) may contain some entries equal to 00 or some entries _00.

Note that V(1T,p) is not defined as the expectation of the sum of terms like r(St,Kt,L

t). This would require stronger convergence conditions.

Now we will give our assumptions on rewards and transition probabilities.

Assumption 2. I.

a) ~ is a given vector with pos1t1ve components. b is the vector with entries p-I

Ci ).

b) ₃

S€(O,I)

?

p(j;i,k,9.)b(j) $l3b(i) (for all i € S, k € 1<i' R. €Li ),

J

the minimal allowed S-value will be denoted by 13 henceforth.

Assumption 2. I.b may be written in another way if we use f.I for the construc-tion of a space of vectors V c JRoo

, 00

V:='lV€JR suplv(i)!f.I(i) < oo} i

(7)

-00

contains those vectors w E R with w ~ v for some v E

V.

is the union of

V+

and

V-.

V

1S a Banach space when we introduce the following norm IIvll := sup/v(i)\J.l(i)

i

Now we may introduce operator norms for matrices or linear operators. Then assumption 2. l.b is equivalent to

3(3E (0, J) II P (f, g) II ~ (3 for all (degenerated) f, g • Assumption 2. J implies for t = 0,J, ... and 7T E F, pEG

We denote by

r

the vector with i-th component equal to the value of the ma-trix game with entries r(i,k,£).

£

E F and

g

E G denote optimal decision rules for this sequence of matrix games.

Assumption 2.2.

rEV.

By

V+

we denote the set of vectors w E

ioo

such that w ~ v for some v E

V.

V-V±

P(f,g) is properly defined as a linear operator on

V, V+, V-, Vi;

P(f,g) maps each of these sets into itself. "Properly defined" means that P(f,g)v(i) is independent of the order of summation.

P (f ,g) 1S mono tone on V, V+, V-, V±.

P(f,g) 1S contracting on V with contraction radius IIp(f,g) II ~ (3 < 1.

V

contains at least

(£,g).

In general, if f and g are such that r(t,g) E

Vi,

then (f,g) E VandV(f,g) is

defined.IIV(f,g)lI~

(I-S)-J llrll •

. + - +

-In general, 1f r(f,g) E

V, V , V

then V(f,g) E

V, V , V •

Note that

- - - +

r(f,g) E

V ,

r(f,g) E

V

for all f E F, g E G.

Definition 2.3. Let f E F, g E G, then L(f,g) 1S defined as a mapping of V±

into

:Roo

by

L(f,g)v .- r(f,g) + P(f,g)v for v E V- •+

(8)

Lemma 2. I. Let f E :IF, g E l&.

a) if r(f,g) E

V,

L(f,g) maps V into V and L(f,g) 1.S contracting on V with contraction radius IIp(f,g) II ::;; S < 1• The fixed point of L(f,g) in V is V(f,g).

b) if r(f,g) E

V- ,

then L(f,g) maps

V

-

into

V- •

c) if r(f,g) E

V ,

'+ then L(f,g) maps

V+

into

V+.

d) L(f,g) is monotone on V-'-.+

e) i f v E

V,

r(f,g) E

V-,

+ then L (f,g)vn + V(£,g) (for n + 00), componentwise.

Proof. Straightforward. As an example we indicate the proof of e. r(f,g)

=

r + w with w ~ 0, rEV if r(f,g) E

V+.

Then

n-! Ln(f,g)v

=

I

k=O which converges to k n P (f,g)[r + wJ + P (f,g)v , 00

I

pk(f,g)[r + wJ k=O for n + 00 •

o

00

Definition 2.4. Let v E

V.

We define Uv E B by Uv := max min L(f,g)v

fE.1F gEG

where max min is defined componentwise.

(Uv)(i) is the value of the finite matrix game with entries

r(i,k,~) +

I

p(j;i,k,Q,)v(j) •

j

Lemma 2.2.

a) U maps

V

into

V

(monotonously).

b) Let W := {v E V

I

II vII ::;; (J -S)-lllrll}.UW c

w.

(9)

Proof.

a) The monotonicity ~s triviaL

Let v €

V.

It will be proved that Uv E

V.

Uv ~ max{min r(f,g) + max P(f,g)v} f g g ~ r + max max P(f,g)v f g S r + 1311vlib • Analogous ly

Uv ~ max {min r(f,g) + min P(f.g)v}

f g g

-

_{min P(f,g)v} ~ r + mn f g ~ r - Sll vlib • Hence II Uv II ~ II r II + SII v II • b) If

11

vII s; (1-6)-1 11

;11,

then c) Let V,\o1 € V. Uv S; U(w + IIv - wllb) =

S; max min[r(f,g) + P(f,g)w + Ilv - wllP(f,g)bJ

f g

S; Uw + 611 v - wlib .

So UV - Uw s; 611 v - wlib. By interchanging v and wand combining the two results we obtain IIUw - Uwll s; Sllv - wll.

Let v €

V,

then Uv = L(f,g)v for certain f € F, g E G. Hence

r(f,g) + P(f,g)v E V ,

which implies r(f,g) E

V.

SO by successive application of U one finds a sequence of policies f_n,

&n

with 1'(£ ,g ) E: V:

(10)

Choose V

o

€

V,

define v

n

:=

Uvn-] for n

=

1,2, ••• and f,gn n such that Uv_n-1

=

L(f ,g)v_n _n _n-l '

Now we have v € V, V + v* (in norm) for n + <Xl, with v* the unique solution

n n

of Uv

=

v, V € V. ref ,g ) € V, (f ,g ) € V.

n n n n

Denote by £*,g* policies which satisfy

* * * * *

L(f ,g)v

=

Uv

=

v , then

*

v , (f ,

* *

g ) € V

The vector v* is a natural candidate for being the value vector of the game and the stationary strategies f*, g* for being optimal strategies.

3. The value vector and optimal strategies

* * *

-Lemma 3.1. Let f € F, g € G, then (f,g), (f ,g) E

V

and V(f,g ) €

V ,

*

+ V(f ,g) € V • Furthermore *

*

* * V(f,g ) ~ V(f ,g ) ~ V(f ,g) . Proof. * * * * * * L(f,g)v ~ L(f ,g)v

=

v So * * * * r(£,g ) + P(f,g)v ~ v

*

hence r(f,g ) E

V-.

Lemma 2.1(d,e) now implies

* n * * * V(f,g )

=

lim L (f,g)v ~ v n-roo <Xl ~

*

Lemma 3.2. Let n := (f O,f1, ••• ) € F , p := (8o,gl"") € ~ , then V(n,g ), V(f*,p) are defined and elements of

V-, V+

respectively.

Furthermore

*

(11)

Proof. (1 +S)llv*11b • So +

*

r (ft,g) $ (J+S)lIv lib, 00 00 (l+S)lIv*11

I

t=O t

*

P (1T,g )b $ Hence

With lemma 2.1(d,e) we obtain for any N

Hence

N

\ t * * N+I * * *

L. P (1T,g )r(ft'g ) + P (1T,g)V $ v

t=O

The second term ~n the left hand side of this inequality converges to zero for N+ 00 (componentwise and in norm) whereas the first part converges to

*

V(1T,g ). Hence

*

V(1T,g ) $ V V(f

* *

,g ) •

*

Theorem 3.1. v - the unique solution of Uv

=

v, V E V - is the value vector

for the game with criterion V. Any pair of stationary strategies f*,g* satis-fying

* * * * *

L(f,g)v $ v $ L(f ,g)v

is optimal for the game with criterion V, i.e.

*

(1T,g ), (f ,p) E V if 1T E: F, pEG

and

*

(12)

*

Proof. It has to be proved that V(n,g ), V(f ,p) are defined and are elements of

V-, V+

respectively. Furthermore the saddle point property has to be

proved. Let n = (n

O,n1, ••• ) E

F,

then (as in the proof of lemma 3.2)

+

*

E * r (n (H ),g )(S ) $ E *(1 +S)llv IIb(S )

n,g n n n n,g n

$ (1 +S)IIv

*

IIS~ •

So

Now it suffices to prove

*

sup V(n ,g )

=

v 'ITEF We know already

*

sUPooV('IT,g)

=

v 'ITEll:

Consider the Markov decision process with state space $, strategies 'IT E F

based on the action sets F .• P(f,g*) is the matrix of transition

probabili-~

*

ties if the policy f is applied. r(f,g ) is the reward vector for policy f. For this Markov decision process we have

00

sup E *

I

ndFOO

'IT, g n=O

00

Note that for this Markov decision process F is the set of all nonrandomized

Markov strategies and F is the set of all nonrandomized strategies. We now

may use the theorem of van Hee [IJ, which states that for the Markov decision

*

process with total expected reward vector v(n,g ) for the strategy 'IT

*

s up V('IT , g ) = 'ITEF

*

s up00 V('IT ,g ) • 'ITEll:

(13)

4. Successive approximations

Since the value vector v* is the unique fixed point of the operator U on

V. *

it is trivial to give successive approximations for v •

Choose va €

V,

then v ;= Uv 1 for n

=

1,2, ••• constitute a sequence of

n

n-*

vectors in V which converge (in norm) to v

-I

*

-1

v - t3(I - t3 ) 11 v - v 1lib ::;; v ::;; v +8 (I - 13) II v - v 1lib •

n n n- n n

n-As in discounted Markov games (compare van der Wal [7J) better estimates may be found for v* in the n-th iteration. In a similar way estimates may be found for the quality of the policies found in the n-th iteration.

For these estimates we use similar notations as van der Wal [7J

A_n := inf (v (i) - v_n _n-l(i»~(i)

~

\! := sup (v

(i) -

v l(i»~(i)

_•

n

i n

n-[sup

]J(i)

L

p(j;i.f (i).t)b(j) if A < a ,

i,t j n n

a :=

n

l

inf ]J(i)

L

p(j;i,f (i),R.)b(j) if A ~ a •

i,t j n n

inf ]J(i)

L

p(j;i,k,g (i»b(j) if \I < a

,

i ,k j n n

b :=

n

sup ]J(i)

l:

p(j;i,k,~(i»b(j) if \I ~ a

,

i,k j n

here fn,gn denote policies which satisfy

L(f,2)v I::;; L(f ,g)v 1::;; L(f ,g)v I '

~ n- n n n- n

n-Now we have the following bounds for v*, V(fn.~). V(fn.P), V(~,gn) (the latter two are defined and in

V+. V-)

a) v + n a ).. n n 1 - a b n

*

::;; v ::;; v + n b \! n n 1 - b_n b , b) V(f ,p) ~ v + n n a_{n n}

A

1 - a b n for all p € G •

(14)

c) for all rr € F , d) v + n a A n n b::; 1 - a n ::; v + n b v n n 1 - b n b •

Proof. d) is a direct consequence of b) and c).

* * *

*

a) is also a consequence of b) and c), viz. V(f ,g ) ::; V(f ,g )

=

v and

n

*

V(f ,~) ~ V(f ,g ) = v •

We will prove c).

That (rr,g ) E V is seen as follows:

n

Hence

L(f,g)v_n _n-1::; L(f_n,2)V_-n _n-1

=

v_n for all f E :IF •

r(f,1i! ) ::; v_-n - P(f,g)v I::; [IIv II + SlIv I I1Jb

n n n- n

n-So for any f EO F we have +

r (f,Ii! ) S; [II v II + BII v 1 IIJb •

-n n

n-Hence

for all f E F •

""

I

E r+(rr (H

),~)(S)

S; (I-B)-I[IIv II+BUv IIIJb.

t=O rr , gn t t t n

n-Using the same theorem of van Hee [IJ as 1n the proof of theorem 3.1 (with

~ instead of g*) we obtain sup V(rr,g ) = sup"" V(7T,~) • 7T€F n 7TEF So it remains to prove b V V(rr,g ) ::; v_n _n + ~ ~ b n

""

for rr € F

(15)

:0. v +\lbb

n n n

o

0

In this way we ob tain for all f

O' ••• ' f N E F

0 0 0 0

This implies for ~

=

(fO,f

l , ••• ) EF

:0. v + \l b _{(I - b ) -I b}

n n n n

s.

Remarks and extensions

N

v + \l (b +••• + b )b •

n n n n

o

a) If Ir(i,k,~)1 :0. Mb(i) for all i,k,~ and certain M> 0, then V

=

F

x

G

and

00

V(~,p) =E

L

reS ,K ,L) , ~,p n=O n n n

which converges absolutely. In this situation we have V(~,p) E

V.

SO, in this way we have already a generalization of the usual situation. A comparison with discounted Markov games and discounted semi-Markov games may be made by incorporating the discountfactor in the transition probabi-lities.

b) The finite action sets K. and L. may be replaced by compact subsets of

~ ~

some Euclidean space if we add the condition thatr(i,k,~) and p(j;i,k,~)

are continuous in (k,~).

c) If in each state i the action set L. contains only one element we obtain

~

a Markov decision process with the following requirements (we delete all parameters relating to the second player P

Z) IlP(f) II :0. S <

max r(i,k) =: rei) kdC

~

with rEV.

Now we have for any f that ref) :0. r, hence ref) E

V-.

These properties

imply that V(~) exists for all ~.

The finiteness of K. is not essential if we replace "max" by "sup". The 1

successive appproximations v

(16)

in the following way (0 > 0).

Choose f such that

n

v_n := L(f)v_n _n-12: max{v_n-l'Uv_n-1 - ob} this is possible if V

o

is chosen such that

for all i E S •

This sequence {v } generates the same sequence of estimates for v* and

n

V(f_n) as given in [8J for the condition

\r(i,k)!lJ(i) s; M for all i E :&, k E K. •

~

d) Stopping times may be used to generate a class of successive approxima-tion methods as has been shown by van der Wal [7J. This can also be done in our situation.

e) An interesting situation, which may be treated within the frame work of this paper, is the following.

Suppose at each time t

=

0,1, ••• a honest matrix game is played (the en-tries may depend on the history but the actuaL game is always honest). Then for both players strategies which are optimal in each single matrix game (i.e. prudent strategies with value 0) are overall optimal. No extra restrictions on matrix entries are necessary.

f) For characterizations of the situations in which assumption 2.I.b

IIP(f,g) lis; S < 1 is satisfied we refer to [2J, since in this respect there is no essential difference between Markov decision processes and Markov games.

(17)

References

[lJ K.M. van Hee, Markov strategies in dynamic programmnng. Memorandum

COSOR 75-20 (October 1975), Dept. of Mathematics, Eindhoven Univer-sity of Technology.

[2J K.M. van Hee, J. Wessels, Markov decision processes and strongly

exces-sive functions. Memorandum COSOR 75-22 (November 1975), Dept. of Mathematics, Eindhoven University of Technology.

[3J C.H. Himmelberg, T. Parthasarathy, T.E.S. Raghavan, F.S. Van Fleck,

Existence of p-equilibrium and optimal stationary strategies in stochastic games. Proc. Amer. Math. Soc. (to appear).

[4J J. MacQueen, A modified dynamiG progra~ng method for Markovian

deci-sion problems. J. Math. Anal. Appl.

1i

(1966), 38-43.

[5J T. Parthasarathy, Discounted, positive, and noncooperative stochastic

games. Intern. J. Game Th.

l

(1973), 25-37.

[6J L.S. Shapley, Stochastic games. Proc. Nat. Acad. Sci. USA ~ (1953),

1095-1100.

[7J J. van der Wal, The method of successive approximations for the

dis-counted Markov game. Memorandum COSOR 75-02 (March 1975), Dept. of Mathematics, Eindhoven University of Technology.

[8J J. Wessels, Markov programming by successive approximations with