• No results found

Note on the optimal strategies for the finite-stage Markov game

N/A
N/A
Protected

Academic year: 2021

Share "Note on the optimal strategies for the finite-stage Markov game"

Copied!
8
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Note on the optimal strategies for the finite-stage Markov

game

Citation for published version (APA):

Wal, van der, J. (1975). Note on the optimal strategies for the finite-stage Markov game. (Memorandum COSOR; Vol. 7506). Technische Hogeschool Eindhoven.

Document status and date: Published: 01/01/1975

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne Take down policy

If you believe that this document breaches copyright please contact us at: openaccess@tue.nl

providing details and we will investigate your claim.

(2)

06

(~ , ,

RRC

81

COS

part.rn<'nt of :lathema lC',

~; [ 1.::;! V) A:~l) OPE RAT 1O:'iS CI:OCL'

I:l0moranduI:l

cason

75-06

~ote on the optimal strategies for the finite-stage ~1arkov gaI:le

by

J. van der 'i~al

(3)

t,.' 1 S I r i ni te-s tagt~ :larkov garne

J. van der h'al

Abs trac t. In this no te \,e consider the fini te-s tage Ifarkov game \vi th fini tely

~i;any states and Be tions as described by Zachrisson [5]. Zachrisson proves

that this game has a value and shows that value and optimal stra les may be detert:tined \,'i th a dynamic programming approach. HOl,vever, he si lently assumed

that both players would use only Markov strategies. Here we will a sic.ple proof "'hich s11mvS this restriction to be irrelevant.

I. Introduction and notations

The fini te-s tage 2Iarkov game considered here is a game bet\"Teen tHO players \"Thich proceeds as fo llows. At each of a fini te number of time instants hoth players select an action out of a finite set of allowed actions. As a result of these two actions the state of the game is changed and one of the players receives some amount, specified by the rl.fles of the game, from the other. Ihis He formalize as follows.

\\e ,dll consider a dynamic systemvlith finite state space S;= {I, . . . ,N},

the behavior of which is influenced by two players, PI and P

Z' having

opposite aims. For each state x

s

t\V-O finite non-empty sets of actions exist, one for each player, denoted by Kx for PI and Lx for P

2• At I

equi-distant time instants, numbered reversed order n T,T-I, ... ,!, both players select an action out of the set available to them. As a joint result

f the two selected actions, k for PI and £ for P2' the system moves to a

ix,k,£), Hith

I

p(yix,k, ) = I, and PI yeS

ability p

,'ci 11 recr~l ve some (possi ly negative) a:nount from

P2' denoted by r(x,k,;n 'loreover t.ne sys p3 ff q t i p ·"Ie \!i 1.1 moves

assume, that if

-

as a result of the actions at n = I

-to stace y at the end of the game, P ~ill receive a final !

r ror:t P l.

t i ga:ne the 1-s tagr- ~larkov ga;;lP

,,,i

ti1 ina payoff q. j p e that this game has value and we will derive

, " . "

(4)

r 3. player Gv~r the duration of the game. ~lorpover ~Je vH 1 give a way to de ermine value and ptimal stra ies. First we give SODC definitions and notations.

A strategy 'f for P for the game is any function that speci £ i.es fo::- each

1

time instant n I,I-I, ...• ], and for each state x ~ S, the probability (k:x,n,h )

n chat action k K x will be taken as a function of x,n and the his tory • By h we mean

n the his tory of the game upto time-ins tant n, the

sequence h = (x

T' ,£~, ... ,x l,k 1'£ I) of prior states and

n 1 n+ n+ n+ actions

( is the emp ty sequence). Ive will call IT a i·farkov strategy i f all

IT(k:x,n,h ) are independent of h •

n n

A policy f for PI \vill be defined as any function such that f (x) 1.8 a prob-abili distribution on K for all XES. Thus a Markov strategy u consists

x

of T policies and we will denote it by IT

=

(fT, ••• ,f

l) (f the policy to

. n

be used at time instant n). Similarly we defined strategies p and policies g for P 2'

Let V(u,p) denote the ~-column vector with x-th component equal to the total expected re~vard for PI when the game starts in state. x, PI plays strategy 'IT

and P

2 plays strategy p. Strategies u* and p* satisfying

*

*

*

*

V , p ) S V(rr ,0 ) S V(n ,p) for all u and p will be called optimal and

( * *) .

V 11" ,0 1.S called the value of the game.

The te-s ~larkov game has already been considered by Zachrisson [5J.

HOwever, he (silently) assumed that both p would use only ~arkov

strategies. Under this assumption Zachrisson proves that the game has a value and that the value and optimal strategies for both p can be

de by a dynamic prograrmning approach. In the early of rIarkov

decision processes the same ,restriction >;vas made. Derman 1

J

proved that the "intuitively obvious" restriction to l1arkov stra \'las correct.

iiere v:e 11 do the same for fini te-s Harkov games.

So we will show that there exist ~arkov stra S TT

*

and

*

satis r all strategies and V ,p ) * Veil ,p ) *

*

V

(5)

- 3

existenc," of optir:lal :larkov strategies

In o1'Jer to s l i the notations we introduce two operBtors.

_ , ' , N

Let t ana g be arrntrary policies then the operators L(£,g) and lJ on liZ are Jef

oy

(L(£,g) ex):=

I

£k(x)

I

kEK £ x (xFr(x,k,£) +

I

P(Jlx,k,Q,)J, XES YES

th fk(x) (giex» denoting the probability that 1n state x action k(l) will be taken when policy f(g) is used.

Uv .- max m1n L(f,g)v

f g

("here maxmin 15 taken componentwise).

,-i'fmv the sequence v

n' n = 0, I, ••• ,T, vn E JRl'i fvo(x) := q(x), X E S -< iv l n .- Uvn_l , n = I, .•• ,T defined by i>Je expect two leUh'1las.

to be the value of the game. Before we prove this we first give

LeI:1.'1la 1. The I-stage :'larkov game with final payoff v has value Uv and there

*

*

*

*

*

exist policies f and g satisfying L(f,g)v L( ,g)v 5 LCf ,g)v for all f and g.

Proof. For any x S the game th initial state x is a matrix game ,.;rith

value )(x). For this game' Crandor:::.ized) optimal actions ex) and g (x)

*

exist. Thus the game has value Uv and the policies and g

*

are optimal.

*

Let f and g!l

*

be optimal policies 1n the I-stage Harkov game th final n payoff v n == n-I' 1, ••• ,'1'. That 1.S

*

*

L(f,g)v I ' L C f , g ) v 1 v n n- n n n- n

fine the stra t es

*

and p

*

by

*

*

f and g sa sfy

n n

L(f*,g) for all policies f

n and g.

*

*

(gT,···,gl)·

Let v (

n , .J , ,x), n 1, ••• ,1 denote the conditional expected reward for

r)

l 1

1

Lil h on"J::'lrds if the sys ter,1 is ins ta tc x at ,lr,d ,/" re llsed and istory h has been ohserved.

n

;

v \ q(x nd x

(6)

4

-..

Proof. We will prove the assertion by induction. By definition we have for all rr and hO

XES,

*

* *

Now assume vt(rr,p ,ht,x) ~ vt(rr ,P thttx) = vt(x), t =

O, •••

,n for all rr, h t and x. So for all rr, h , n+. 1 and x we have

I

kEK x

I

kEK x rr(klx,n,h I)

I

n+ Q.EL x rr(klx,n,h n+ I)

I

R-EL x +

I

• YES p(ylx,k,R-)v (y)J ~ n :; Vn+l (x) *2-g l(x)[r(x,k,Q.) + n+

where h 1 0 (x,k,t) denotes the concatenation of h 1 and (x,k,t) with

n+ n+

result h • The first inequality follows from the induction assumption and

n

* .

.

the latter one from the definition of v I and g . The latter equal~ty

*

*

n+ n

follows from vn+l L(f g )v and the induction assumption. Hence for

n+ I ' n+ 1 n

all x E: S

or V(rr,p ) s V(rr ,P ).

*

* *

The proof of the above Lemma is a shortcut of the proof given by Derman [IJ for the existence of memoryless optimal strategies in finite stage Markov decision processes.

We are now ready to show:

(7)

5

-Theorem. The T-stage Markov game with final payoff q has the value

h M k .

*

d * . .

t e a r ov strateg1es ~ an p are opt~mal, that 1S

*

. * *

*

V(~,p ) s; V(~ ,p)c

=

v

T S;V(~ ,p) for all strategies ~ and p.

*

*

*

Proof. From Lennna 2 we have V(~,p ) s; V(~ ,p ). By interchanging the roles

f d h · (* *) ( * ) .

o ~ an p we may s ow 1n the same way

V

~ ,p s; V·~ ,p • Th1s proves the

assertion.

o

Summarizing we see that we have shown that the following algorithm pro~ides

h 1 f . .

*

*

t e va ue vT 0 the game and opt1mal strateg1es ~ and p •

(i) Set vO(x)

=

q(x), x

=

1, ••• ,N.

(ii) Determine for n

=

I, ... ,T policies fn' and gn satisfying for all f and g *,

*

*

*

*

*

L(f,g)v I s; L(f ,g)v I S; L(f ,g)v I n n- n n n- n n n-and defi ne v-n

:=

L(f ,g)v

*

n n n-

*

I' (iii) v

T is the value

~f

the game and

~*

=

are optimal strategies for PI and P

2

3. Extensions and remarks

*

*

*

(fT, •• "f

l) and p

respectively.

We considered the case that neither the state space nor the action spaces depend on the time t. And we demanded

I

p(Ylx,k,~)

=

I for all x, k and ~

YES

and the times at which the system is influenced to be equidistant.

None of these restrictions however, is essential. It is easily seen that we may allow the state space and the action spaces to depend on t. And only

trivial changes in the proofs are needed if we allow

I

p(Ylx,k,~) < I for

YES

some or all x, k and t. If the time between two epochs is a random variable with probability distribution F(.ly,x,k,~) if in state x actions k and tare

taken and the system moves to y we must be careful. In order to avoid dif-ficulties we demand these random variables to have finite expectations. For these finite-stage semi-Markov games only minor changes in the proofs are· needed to' obtain the same results. E.g. we would have to extend the history of the system with the time elapsed before the nexC state is reached.

(8)

6

-'" Instead of considering the criterion of total expected rewards it is also 0,

possible to use the criterion of total expected discounted rewards. For the',)"

game with equidistant time instants we may use any discount factor S Ii:

CO,co).

'For the semi-Markov game we may use 13 Ii: [O,IJ but if we want

must demand

Of

00 etdF(tly,x,k,t) < 00 for all, y,x,k an? t.

Here we only considered finite-stage Markov games. However, our results may easily be extended to some infinite-horizon Markov games. For example con-sider the infinite-horizon Markov game as described by Shapley [2J with the

,

criterion of total expected reward (Shapley considers the case

I

p(ylx,k,1) < s < 1 for all x,k and R,) or the S-discounted (13 € [0,.1»

YES

infinite horizon, Markov game. In order to prove that these,) games ha.ve a~~~Ju~

and to find (near) optimal strategies for both players one usually approximates the game by a finite-stage Markov game. If we let v denote the value of the

n

n-stage Markov game we may easily infinite horizon Markov game if n that if reg) is an optimal policy with final payoff v* the strategy

show that v

n

*

.

tends to the value v of the tends to infinity. Moreover, one may. prove for the I-stage (discounted) Markov game

(00) ( ) «00» 'II ' I .

f = f,f,... g w~, be optLma Ln

the infinite horizon M~rkov game. This is shown in Van der Wal [4J. Two other

types of infinite horizon Markov games with the criterion of total expected rewards may be found in Van der Wal [3J.

References.

[IJ Derman, C., Firilte state Markovian decision processes. Academic Press, New York and London, 1970.

[2J Shapley, L.S., Stochastic games. Proc. Nat. Acad. Sci. USA

12.

(1953), 1095-1100.

[3J Van der Wal, J., The solution of Markov games by successive approximation. Master's thesis, Department of Mathematics, Technological University Eindhoven, 1975.

[4J Van der Wal, J., The method of successive approximations of the discounted Markov game. Memorandum CaSaR 75-02. Technological University

Eindhoven, March 1975 (Department of Mathematics).

[5J Zachrisson, L.E., Markov games. Annals of Mat~ematics Studies No. 52, Princeton, New Yersey, 1964, 211-253.

Referenties

GERELATEERDE DOCUMENTEN

Tot slot is vermeldenswaardig dat de prognosemodellen op basis van slechts enkele gebouwkenmerken - met name functie, leeftijd en percentage open gevel - goed in staat zijn om

Veelbelovende resultaten zijn al verkregen in een proefonderzoek (Lubelli et al. Om de mogelijkheden van het gebruik van kristallisatiemodificatoren om zoutschade te voorkomen

Chapter 4 proposed an efficient MPC strategy for optimizing the traffic flows that cross intersections in order to improve the urban road network throughput. The proposed MPC

Deze kunnen weliswaar weggestreept worden tegen de niet-gemaakte vervoersbewegingen van klanten naar de winkels of restaurants (dus marginaal verandert de CO 2 -uitstoot niet),

Door gebruik te maken van een nieuwe wetenschappelijke methode om lange termijn trends in een breder perspectief te plaatsen kan een duidelijk beeld geschetst worden van de kansen

Maar waar je wél wat aan kunt doen, is kijken (en oefenen) of je niet te veel – of te weinig – materiaal hebt om te presenteren voor dat tijdslot.. Daarom zijn die try-outs

En dan niet alleen van technische materialen, maar van alle relevante natuurlijke hulpbronnen in de gebouwde omgeving, hoewel ook toepasbaar daarbuiten: energie, lucht,

Het belang van territorium op kantoor wordt dus niet alleen bepaald door 'oerinstincten', maar ook door de 'logische' koppeling van het concept aan het type werk dat mensen doen en