• No results found

On the relation between optimality and saddle-conservation in Markov games

N/A
N/A
Protected

Academic year: 2021

Share "On the relation between optimality and saddle-conservation in Markov games"

Copied!
15
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

On the relation between optimality and saddle-conservation in

Markov games

Citation for published version (APA):

Groenewegen, L. P. J., & Wessels, J. (1976). On the relation between optimality and saddle-conservation in Markov games. (Memorandum COSOR; Vol. 7614). Technische Hogeschool Eindhoven.

Document status and date: Published: 01/01/1976

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne Take down policy

If you believe that this document breaches copyright please contact us at: openaccess@tue.nl

providing details and we will investigate your claim.

(2)

PROBABILITY THEORY, STATISTICS AND OPERATIONS RESEARCH GROUP

Memorandum COSOR 76-14

On the relation between optimality and saddle-conservation in Markov games

by

L.P.J. Groenewegen en J. Wessels

Eindhoven, October 1976

(3)

On the relation between optimality and saddle-conservation in Markov games

by

L.P.J. Groenewegen en J. Wessels

Summary. In this paper it will be investigated how the concept of value-con-serving strategies can be generalized from Markov decision processes to Mar-kov games. It will be proved that optimal MarMar-kov strategies are necessarily saddle conserving, which is the most straightforward generalization. Another generalization (called saddling) is shown to constitute a sufficient condi-tion for optimality under relatively strong assumpcondi-tions for the convergence of total expected rewards. Counterexamples show that saddle conserving is not sufficient for optimality (even under these strong convergence

assump-tions) and saddling is proved to be not necessary.

O. Introduction.

For Markov decision processes with the total expected reward criterion, we have the following well known theorem, which holds under fairly general as-sumptions:

exactly those strategies are optimal, which are value conserving and

7 ' •

t?qua/..c1-z~ng.

(The concepts of value conserving and equalizing were introduced In Dubins and Savage [13J. For the above theorem see e.g. Groenewegen [2J, Hordijk [4,5J, Rieder [6J.)

In this paper we will consider similar conditions for optimality in Markov games. In this first attempt we will be mainly concerned about the

conserv-ing condition, since we introduce a relatively strong equalizconserv-ing condition when we investigate the sufficiency of our conserving concepts for optima-lity. Two generalizations to Markov games of the value-conservation condition will be considered. The straightforward generalization (called saddle conser'-ving) will be proved to constitute a necessary condition for the optimality of a Markov strategy (section 3). The condition is not sufficient; this is even not the case if our strong equalizing condition holds (section 4). A somewhat stricter generalization (called saddling) is proved to be sufficient for optimality of a Markov strategy if our equalizing condition holds (sec-tion 3), although it is not necessary (sec(sec-tion 4). Sec(sec-tion 5 is devoted to

(4)

some extensions and remarks, section 2 contains some preliminary results and section 1 is devoted to the introduction of the model and the basic concepts.

1. The model and some basic concepts

We consider a two-person zero-sum Markov game on the countable state space S~ with countable action spaces K(i) and L(i) for both players in state i. This means (compare e.g. Shapley [8J, Parthasarathy and Raghavan [9J, van der Wal

and Wessels [IOJ) that a system which can be in one of the states of S -is observed by two players A and B at times t = 0,1,2, •••. At these times the players choose independently an action from their action spaces, viz. A from K(i), B from L(i) if the system 1S 1n state i. As a result of these ac-tions - say k E K(i), t E L(i) - the system moves to state j with probabili-ty p(i,j,k,t) ~

°

and A receives the reward r(i,k,t) from B.

Let F(i) denote the set of all randomized actions for A in state i, i.e. an element f(i,') E F(i) maps K(i) into [O,IJ, such that

I

f(i,k) = I, f(i,k)

k

will be interpreted as the probability of choosing decision k. Similarly G(i) 1S the set of all randomized actions for B in 1.

Let F := F(I) x F(2) x ••• denote the set of randomized policies for player A, i.e. a typical element f(·,·) t F is such that f(i,') t F(i). Similarly

G := G(I) x G(2) x ••• By r(f,g) we denote the columnvector with i-th

en-try

L

f(i,k)g(i,~)r(i,k,~)

k,t

(absolute convergence will be assumed, see assumption). By P(f,g) we denote the matrix with (i,j)-entry

L

f(i,k)g(i,~)p(i,j,k,t) •

k,t

A strategy TI for A 1S a sequence of functions TIt (t = 0,1, ••• ) prescribing

the probability of choosing decision k E K(i) for any state i (at time t)

and any history h t

=

(iO,kO'~O,il, ••• ,it_l,kt_l,tt_I)' For any i and h t we

have

L

TI (k,h ,i)

=

I.

k t t

n(A) is the set of all strategies TI for A.

Similarly nCB) is the set of all strategies p for B. IT := IT(A) x IT(B) is called the set of all strategies,

(5)

3

-I f for strategies ('IT,p) EO 11, 'IT E I1(A), P E TI(B) the probabilities 'lTt(k,ht,i) and pt(t,ht,i) do not depend on h

t, we will call these strategies Markov stra-tegies. The Markov strategies may be denoted as (f

O,f1, •••,gO,gl"") or (fO,fl , ••• ) or (gO,gl"") with f

t E F, gt E G. The sets of Markov strate-gies will be denoted by R, R(A), R(B) respectively: R

=

R(A) x R(B).

Any strategy ('IT,p) E IT and any starting state i determine a stochastic pro-cess (Xt,Kt,L

t) (t m 0,1, ••• ), where the random variable Xt denotes the state at time t, K

t denotes the decision chosen by A at time t and Lt denotes the decision chosen by B at time t.

ByF. ( ) we denote the probability measure for this process and by

1, 'IT,p

Ei,('IT,p) the corresponding expectation operator. If the index i is deleted we refer to the vector with these expectations or probabilities as compo-nents.

Note that we did not require

L

p(i,j,k,t)

=

I. We will however suppose jEOS

that these sums are at most I. If one likes, the introduction of an extra absorbing state would enable one to work with complete probabilities.

Before introducing our basic assumption and the central concepts, we will first give a slight extension of a result of Derman and Strauch concerning a possible restriction to Markov strategies.

Lemma 1. Let p

=

(gO,gl"") EO R(B), i E S, i e n(A). Then there exists a Markov strategy 'IT

=

(f

O,f1, ••• ) for A, such that

for all t

=

0,1, ••• , j EO S, k EO K(j), t EO L(j).

Proof. Define f E F as follows t { F. ('" ) (K

=k

1 , 1l,P t ft(j,k) := . arbitrary

I

X t

=

j), i f F.1,('"'IT,P) (Xt

=

j) >

° .

otherwise •

The proof proceeds straightforwardly by induction with respect to t. 0

The basic assumption in this paper is a version of the charge structure as-sumption as known from Markov decision processes (e.g. [4,5,2J).

(6)

Assumption.

CIO

Eo ( )

I

I

r(Xt,Kt,L t )

I

< CIO

1, 'IT,P taO for all ('IT,p) E R •

This assumption enables us to consider the total expected rewards for Mar-kov strategies 'IT,p. However, lemma 1 proves that also for 'IT ~ R(A), P cIT(B) and for 'IT E IT(A), p E R(B) the total expected absolute rewards are finite.

v('IT,p) :=E( )

I

r(Xt,Kt,Lt) 'IT ,p taO

'IT or p is a Markov strategy.

It is not sure that this game possesses a value (note that v('IT,p) is not even defined for all 'IT,p). However, in optimality criteria an important role is played by the function 'IT:

v(i) := sup inf v(i,'IT,p). 'lTeR(A) pER(A)

This vector v will be called the saddle vector or saddle function. To avoid a more overburdened notation we will use

""

w('IT,p) :=E( )

I

1'IT(xt,Kt,L

t

)!,

'IT or pis a Markov strategy.

1T,p taO

wei) := sup inf w(i,'IT,p). 'lTER(A) pER(B)

Before making the role of v somewhat more explicit, we will prove some pro-perties of this function v.

Lemma 2.

a) Ivl

< "".

b) p(f,g)lv! < CIO for all f E F, g E G.

Proof.

a) For any p' E R(B) we have

sup w(i,'IT,p') =: N(i,p') < 00 ,

'IT

viz. suppose N(i,p')

= "",

then a sequence of strategies {TIn} would exist with

( 0 n ') n

(7)

5

-However, this implies for the "strategy" which chooses 7Tn with probabi--n

lity 2 for A and p' for B, that the expected absolute reward equals in-finity. This mixture of strategies for A may be replaced by a strategy

*

.

*

7T (R(A), such that (7T ,p') have the same absolute expected reward (using a result in Aumann [12] and lemma 1). So a contradiction has been obtain-ed.

Similarly we have for any 7T' E R(A)

sup_w(i,7T',p) -: M(i,7T') < 00 •

p

-00 < -M(i,7T') ~ inf pER(B)

v(i,7T' ,p) ~ v(i) ~ sup v(i,7T,p') ~N(i,p') <00.

7TER(A) b) v+(i)

~

N(i,p')

~

w(i,7T(i) ,p') + € ,

for some n(i) E R(A) and some € > 0, if v+ :- max{O,v}. So one obtains

that (P(f,g)v+)(i) - to is dominated by the expected absolute reward for

the strategy l.n n: gp' l. R(B) and (for A) ( in the first step, 7T(j) if

+

this results 1n state j afterwards. Hence P(f,g)v is finite. Similarly P(f,g)v- is finite and so is p(f,g)lvl. 0

Now the basic notions of this paper can be introduced.

The strategy (7T,p) E R is said to be saddle conserving iff for all t - 0,1, .•

and all f E F, g E G the relation (1.1) holds for all components j with F. ( ) (X - j) >

°

for some i E S.

1, 1T,P t

(I •1)

where n - (f0' f1' ... ), p - (go' g1 ' ... ) •

The strategy (n,p) c R is said to be saddling iff for all t - 0,1, ••• and all f c F, g E G the left part of relation (1.1) holds for all components j

with F. ( ' )(X - j) > 0 for some i E S, 1T' E R(A) and the right part for

1, 1T ,p t

those components j withF. ( ')(X - j) > 0 for some i E S, p' E R(B). 1, 7T,p t

So in fact ::.addling is a somewhat stronger requirement than saddle conserv-ing (see section 5).

The strategy (7T,p) E R is said to be optimal iff for all 7T' E IT(A), p' EIT(B) V(7T',p) ~ V(7T,p) ~ V(7T,p') •

Hence if there exist optimal (Markov) strategies than the saddle vector is the value function of the game.

(8)

2. Some preliminary results

In this section some results will be presented which will be used in the next section.

Lemma 3 shows that tails of optimal Markov strategies are again optimal.

Lemma 3. Let (n,p) be an optimal Markov strategy with n

=

(fO,fl, ••• ), p = (go,gl"")' Then (n(t),p(t» with net) := (ft'ft+], ... ),

pet) := (gt,gt+I"") is optimal in state] E S if there exists a starting

state i such that]P. ( )(X

=

j) > O.

1, 1T,P t

Proof. Suppose]P, ( )(X

t = j) > 0 for certain i,j,t

1 , n,p

t-]

v(i,n,p)=E. ( )

2

ref ,g)(x)+L]P. ( )(X =m)v(m,n(t),p(t».

1, n,p ~=o T 1 1

m 1, TI,p t

Replace n by 1T which is equal to 7r except that TI := 7T' for 1 ~

a

and ar-t+l T

bitrary 7T' if Xt

=

J • This gives for v ( i , TI , p ) the same formula as for

v(i,n,p) with only the j-term in the final sum replaced by

Since v(i,TI,p) ~ v(i,7T,p) we obtain

v(j,n',p(t» ~ v(j,n(t),p(t» • Analogously one obtains

V(j,7T(t),p(t» ~ v(j,n(t),p') .

o

Lemma 4 proves that restarting of a Markov strategy does not destroy optima-lity.

Lemma 4. Let (n,p) be an optimal Markov strategy with 7T

=

(fO,f l , ••• ), p := (go,gl"")' Then (n(t),p(t» with net) := (fO,fl, ... ,ft,fO,fl"")'

pet) := (go,gl, ... ,gt'go,gl"") is again optimal for any t = 0,1, . . . .

Proof. By lemma 1 we have to prove that for all p' E R(B)

(9)

7

-t

v(i,n(t),p') =:E. (. ')

I

r(f,g')(X)

+L

JP. ( ,)(Xt+1 =j)v(j,n,gt'+I"')'

1, n,p .=0 • • • j 1, n,p

However,

So we obtain

Similarly

v(i,n' ,pet)) :; v(i,n,p) •

So v(i,n(t),p(t))

=

v(i,n,p), which completes the proof.

Combining lemma 4 and lemma 3 gives the following result:

[l

Lemma 5. Let n = (fO,fl, ••• ), p = (go,gl"") form an optimal strategy, then (ft,fO,f l , ..• ), (gt,go,gl"") form an optimal strategy in any state j such that for some i holds

JP. ( )(X

=

j) > 0 • 1, n,p t

Saddling 1S a stronger requirement for a strategy than saddle conserving.

However, any saddle conserving strategy may be transformed into a saddling strategy as is shown below.

Lemma 6. If (n,p) E R is saddle conserving, then

(n,p)

E R is saddling, where

(n,p)

18 defined in the following way.

Suppose n = (fO,f l , ... ), p = (go,gl""); n = (fO,f l , ... ), p St := {j E S

I

JP. ( )(X = j) = 0 for all 1 E S} •

1, n,p t Then

,

(10)

-Proof. Note that fa

=

fa, go = go since So ~s empty. Furthermore (rr,p) and

(;,p)

possess the same St-sets.

We will prove (1.1) for

(;,p)

in all components j for all f,g,t. For j,t such that j ¢St the inequalities (1.1) are satisfied since ft(j)

=

ft(j), gt(j)

=

gt(j)·

Suppose JESt' Then

g

(j) = g (j) for some T (0 ~ T < t). So

t t

r(f,g )(j) T

and

P(f,gt)(j,i)

=

P(f,gt)(j,i) for all ~ E S •

Hence (1.1) holds for (rr,p) ~n this component J for this t and all f,g.

0

3. Necessary and sufficient conditions for optimality

Theorem shows that an optimal Markov strategy (rr,p) is saddle conserving. For the sufficient conditions for optimality (exhibited in theorem 2) we will introduce some extra equalizing conditions. A more detailed analysis of this problem will be given in a forthcoming paper.

Theorem 1. If (rr,p) E R is optimal, then (rr,p) is saddle conserving.

Proof. Choose t.i.j such thatlP. ( )(X

=

j) > a and suppose rr=(fO.fl••• ).

~. rr.p t p = (gO,gl"")' Then (rr',p') with rr' := (f t.fO.fl, ••• ), p' := (gt.gO.gl •••• ) is optimal in state j (lemma 5). So v(j) = v(j,rr'.p') ~ v(j,TI",p') ,

with TI" := (f,£O.£I •••• ) for arbitrary f. Hence

Analogously one proves for arbitrary g

o

We now suppose that all Markov strategies are equalizing.

Equalizing condition: lim E. ( )v(X) = a for all :i E Sand (rr.p) E R.

n-+oo ~. rr,p n

(11)

9

-Lemma 7. If the equalizing condition is fulfilled, then v(~,p) = v if

(~,p) is saddle conserving.

Proof. Let k ~ O.

E(~,p)r(fk,gk)(~) =

=E(~,p){r(fk,gk)(~) + v(Xk+l ) -v(Xk)} + E(~,p){v(Xk) -v(~+I)}

where the latter equality follows since (~,p) is saddle conserving. If this equality holds we obtain:

n v(~,p) = lim E(~ )

L

r(fk,gk) (Xk) n-+<><> ' p k=O n = lim :IE(1T )

L

{v(Xk) - v(Xk+1)} n-+<><> ,P k=O = v - lim E(~,p)v(Xn+I) = v , n-+<><>

where all limits are componentwise.

o

Theorem 2. If the equalizing condition 1S fulfilled, then a Markov strategy

(~,p) is optimal if it is saddling.

Proof. It is to be proved, that v(~',p) ~ v(~,p) for any ~' if (~,p) is sadd ling.

By lemma 1 we only have to consider strategies ~' E R(A). Suppose

~'

=

(fO,f;, ••• ) and p

=

(go,gl"")' Using lemma 7 and the saddling property

we obtain:

v(~,p)

=

v ~ r(fo,gO) + P(fO,gO)v

~ r(fO,gO) + P(f

n

,gO)r (f

i

,gl) + P (fO,gO)P(f

i

,gl )v ~ ...

n

L

P(fO,gO)" .P(f~_l,gk-l )r(fk,gk) + P(fO,gO)" .P(f~,gn)v k=O Hence = n

L

E ( , )r(fk' ,gk) (Xk) +E(, )v(X 1) k=O 1T ,p 1T ,p n+ 00

o

(12)

f. :

1.

4. Some counterexamples

In this section we will show with examples that saddle conservingness does not imply optimality and that optimal strategies are not necessarily saddl-ing.

We will use the following notation for a game

k r(i,k,Q.) + p(i,j ,k,Of. J 1'. 1

This means: f. is the game which starts in state i. If A chooses action k

1.

and B chooses Q. then A receives r(i,k,~) from B and state j is reached with probability p(i,j,k,Q.). Then in j game f. is played.

J Situation: 2 3 1 + (I +a) +

°

+ af 3 f4 af2

°

+ 1 +

l:J

f 4 f4 f4 - + 2 1 + f4 2 a (I-a) f3 af 3 f 2 f3 with

°

< a < 1.

Example 1. Saddle conserving

+

optimal.

For this gamev(l)

=

l+a, v(2)

=

0, v(3) = I, v(4)

=

0. Define 11" by

(2,1,1, ••• ), where the n-th entry (n 0,1, ••• ) denotes the action to be chosen by A if at time n the system is in state 1.

Similarly p is defined by (1,2,2, ••• ) where the n-th entry of the above row 1,2,2, ••• '

(13)

-

11-undermost row the same is denoted for state 3.

(~,p) is saddle conserving, but not optimal, since for

v(1 ,'JT' ,p) • 1+0. +0.2 +•••> 1 + a

=

v(1 ,n,p) •

Example 2. Optimal ~ saddling.

:= (1,1,1, •. ,) optimal, but a suboptimal

*

Define ~

:=

(1,1,1, ••• ), p*

:=

(11'~'22,; •• ), n

:=

(3,1,1, ••• ), (n*,p*) is , , , ,... *

not saddling, since]l:'. (. *) (XI == 2) == a. > 0 and p prescribes ~, 7f,p

action in state 2 from time I onwards.

5. Remarks and extensions

a) As remarked before, the characterization of optimal strategies will be ex-tended to the equalizing concept in a forthcoming paper.

b) If one requires absolute convergence of the total expected rewards for all ~ E IT(A), p E IT(B) instead of (n,p) E R, then v(n,p) is defined pro-perly for all (n,p). Furthermore one obtains wei) < 00 and consequently

Iv(n,p) I s w, P(f,g)!v! s w •

The finiteness of wei) follows from the following reasoning: Suppose wei) == 00, then sequences 'JTn, pn (n == I, ••• ) exist with

ro

Ei(rrn,pn)

L

!r(Xt,Kt,Lt ) I

~

4n •

p=o

Hence, if A chooses nn with probability 2-n and (independently) B chooses

m -m

p with probability 2 ,then the total expected absolute reward becomes 00. This implies however [see Aumann, 12J that there exist strategies n* E IT, p* E R with total expected absolute reward equal to "", which is a contradiction.

c) Optimal strategies are not necessarily saddling, however if there exist optimal (Markov) strategies, then there exist optimal strategies which are saddling (theorem I, lemma 6). The subclass of optimal strategies which are saddling consists of perfect strategies (see Selten [IIJ). d) The basic assumption may surely be weakened. Absolute convergence of

to-tal expected rewards for some strategies suffices, if only it is clear that for the other strategies either the positive parts or the negative parts of the rewards have a finite expectation.

(14)

e) The equalizing condition, which enabled us to formulate a sufficient con-dition for optimality, is certainly fulfilled if the contraction condi-tion holds as described in Wessels [7J, assumpcondi-tions 2.1 and 2.2.

f) The essential idea of lemma 3 for Markov decision processes can already be found in Gavish and Schweizer [13J and in Groenewegen [2J. In the lat-ter this idea is used to derive a Markov decision process analogon for theorem 1.

References

[ I ] C. Derman, R. Strauch (1966), A note on memoryless rules for controll-ing sequential control processes. Ann. Math. Stat. 37, 276-278.

[2J L.P.J. Groenewegen (1975), Convergence results related to the equaliz-ing property in Markov decision processes. CaSaR 75-18, Eindhoven University of Technology.

[3J L.E. Dubins, L.J. Savage (1965), How to gamble if you must: inequali-ties for stochastic processes. McGraw-Hill, New York.

[4J A. Hordijk (1974), Dynamic programming and Markov potential theory. Math. Centre Tracts, no. 51, Amsterdam.

[SJ A. Hordijk (1974), Convergent dynamic programming. Techn. rep. no. 28, Dep. Ope Res. Stanford University, Stanford.

[6J U. Rieder (1976), On optimal policies and martingales in dynamic pro-gramming. Techn. rep. University of Hamburg. To appear in J. Appl. Prob.

[7J J. Wessels (1976), Markov games with unbounded rewards. CaSaR 76-05, Eindhoven University of Technology.

[8J L.S. Shapley (1953), Stochastic games. Proc. Nat. Acad. Sci. USA 39, 1095-1100.

[9J T. Parthasarathy, T.E.S. Raghavan (1971), Some topics in two-person games. American Elsevier, 238-252, New York.

[IOJ J. van der Wal, J. Wessels (1976), On Markov games. Statistica Neerlan-dica 30, 51-71.

(15)

13

-[l1J R. Selten (1975), Reexamination of the perfectness concept for equili-brium points in extensive games, Internat. J. Game Theort 4, 25-55. [12J R.J. Aumann (1964), Mixed and behaviour strategies in infinite

extensi-ve games. Advances in game theory, Ann. Math. Studies 52, Princeton University Press, Princeton, 627-650.

[13J B. Gavish, P.J. Schweizer (1976), An optimality principle for Markovian decision processes, J. Math. Anal. Appl. 54, 173-184.

Referenties

GERELATEERDE DOCUMENTEN

Bijvoorbeeld, doorgeschoten houtwallen, die nu bestaan uit een wallichaam met opgaande bomen, zullen in veel gevallen voorlopig niet meer als houtwal beheerd worden, maar in

De beoordeling van de farmacotherapeutische waarde van everolimus heeft geleid tot de volgende conclusie: Bij de behandeling van hormoonreceptor-positieve,

Maar we zullen veel meer moeten kijken naar wat echt nodig is: niet een heel circus met iemand die de case zit te managen, maar een netwerk waarin mensen zo veel mogelijk

Rekening houdend met alle aannames zoals genoemd in deze KCR, een marktpenetratie van 80% bij IFN geschikte patiënten en 100% bij patiënten die niet voor IFN in aanmerking komen,

BJz heeft in zijn advies aangegeven moeilijk een beoordeling te kunnen geven over de hoogte van de benodigde individuele begeleiding omdat de aard van de opvoedingssituatie of het

The main objective for this project was to develop a generic facility design tool, namely a fruit and vegetable receiving, storage and preparations area requirements model that

A quantitative methodology was selected as the most appropriate approach to identify factors preventing the successful implementation of an existing fall prevention programme...