• No results found

A suboptimality test for two person zero sum Markov games

N/A
N/A
Protected

Academic year: 2021

Share "A suboptimality test for two person zero sum Markov games"

Copied!
7
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

A suboptimality test for two person zero sum Markov games

Citation for published version (APA):

Reetz, D., & Wal, van der, J. (1976). A suboptimality test for two person zero sum Markov games. (Memorandum COSOR; Vol. 7619). Technische Hogeschool Eindhoven.

Document status and date: Published: 01/01/1976 Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers) Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne

Take down policy

If you believe that this document breaches copyright please contact us at:

openaccess@tue.nl

providing details and we will investigate your claim.

(2)

EINDHOVEN UNIVERSITY OF TECHNOLOGY Department of Mathematics

PROBABILITY THEORY, STATISTICS AND OPERATIONS RESEARCH GROUP

Memorandum COSOR 76-19

A suboptimality test for two person zero sum Markov games

by

Dieter Reetz and Jan van der Wal

Revised October 1977

Eindhoven, October 1976 The Netherlands

(3)

·

.

A suboptimality test for two person zero sum Markov games

by

Dieter Reetz and Jan van der Wal

Abstract. This paper presents a games version of the nonoptimality test given by Hastings for Markov decision processes. A pure action will be eliminated if compared to some randomized action it performs worse against any of the opponents possible actions.

I .·Introduction and preliminaries

For Markov decision processes (~1DP) several authors have proposed tests to

eliminate suboptimal actions, a.o. [4,3,1,2J. In this note we give a test for

the elimination of suboptimal actions in two person zero sum Markov games with finite state and action spaces.

Following the notation in [7J the Markov game is characterized by the state

space S := {1,2, .••,N}, for each state XES two finite nonempty sets of

ac-tions Kx for player 1 (PI) and Lx for P2' and if in state x actions k and £

are taken, an immediate payoff from P

2 to PI r(x,k,£) and transition

probabi-lities p(y/x,k,£), yES. We assume

I

p(y!x,k,£) < 1 for all x, k and £.

yES

As criterion we use total expected rewards. Shapley [6J showed that this game has a value, which we will denote by v*, as well as optimal stationary

stra-tegies.

A policy f for PI specifies the probabili,ties f(x,k) by which action k is

ta-ken in state x. The randomized action in state x is denoted by f(x).

N

In order to simplify the expressions in the remainder we define for all VER , x, k and £

r(x,k,£,v) := r(x,k,£) +

I

p(Ylx,k,t)v(y) •

yES

Let {v } be determined by the standard successive approximation method and n

A ,n ~n' an and bn be defined as follows (cf. [7J)

A

:= min (v - v n- l)(x) n XES n ~n := max (v - v 1)(X) XES n ~

(4)

- 2 -max

I

p(y

I

x,k,£) i f A < 0

,

x,k,9. yES n a := n min

I

p(y!x,k,9.) if A ?:: 0 x,k,~ yES n max

I

p(y

I

x,k,£) i f 11n ?:: 0 b := x-J-k,~ yES n min

L

p(y!x,k,9.) i f 11 < 0

.

x,k,~ yES n And let f

n be an optimal policy for PI ~n the I-stage game with terminal

payoff vn-I' i.e. min 9.EL x

L

kEK x f (x,k)t(x,k,~,v 1)

=

n n- v (x),n XES • An action k

O E Kx will be called

suboptimal at stage n

if no optimal policy

fn, satisfying the equality above, can have fn(x,k

O

)

>

O.

An action k

O

E

K

x

is called

suboptimal

if no optimal strategy f*(oo), thus satisfying

min ~EL x * * * f (x,k)t(x,k,9.,v )

=

v (x), XES

*

can have f (x,k

O

)

>

O.

In the next section we present a test for eliminating actions for one or more stages which is a straightforward extension to Markov games of tests of Hubner

[3J, Hastings [IJ, Hastings and van Nunen [2J proposed for MDP.

2. The suboptimality test

First we prove an auxilary result which says when it is possible to eliminate ac tions.

Lemma I. Let v E]RN be given arbitrarily. And let there exist a probability

distribution f(x) on K x with f(x,kO)

=

0, and

I

f(x,k)t(x,k,t,v) > t(x,k

O

'9.,v) kEK x

for all tEL x

Then action kO is suboptimal in the I-stage game with terminal payoff v.

Proof. We will prove this action for PI in the game Now define the randomized

*

by contradiction. Let f (x) be an optimal randomized

above with f*(x,k

O) > O. action f(x) by

(5)

~ f(x,k O)

=

0 f(x,k)

=

f*(x,k) 3

-*

-+ f (x,kO)f(x,k), Then we have for all l E L

x

I

kEK x ~ f(x,k)t(x,k,~,v)

=

I

kEK x

*

f (x,k)t(x,k,l,v) .

. .

*

*

But this contradicts the opt~mahty of f (x), hence f (x,k

O)

=

0 for all op-timal f*(x). I.e. k

O is suboptimal in the I-stage game with terminal payoff v.

o

Now we can formulate the suboptimality test. Define yn(x,kO) by

f (x,k)t(x,k,t,vn n-1) - t(x,kO,l,vn-I)J .

Now we may prove the following theorem:

Theorem 1. (cf. stage n + m. [2J) • n+m-I

L

JI,=n 00

L

(l1Q,b t - AQ,aQ,) > 0 then action k

O is suboptimal ~n the t=n '!"-stage game. Proof. i)

L

kEK x

L

kEK x

L

kEK x fn(x,k)t(x,k,~,vn+m-1) - t(x,kO'£'vn+m-1)

=

f (x,kHt(x,k,Q"v ) - t(x,k,,Q,,vn n n - ·I)J - [t(x,kO',Q"v ) - t(x,kO',Q"vn n-1)]

... +

L

kEKx

(6)

1) 4 1)

-~ Yn(x,kO) + a A - bn n n n~ +... + an+m-IAn+m-1 - bn+m-l~n+m-1 > 0 •

Hence with lennna I k

O is suboptimal at stage n+m.

ii) From i) with m~ 00.

As we do not know a£, b£, A£ and ~£ 1n advance there are two possible ways

of using this test.

i) Eliminate action k

O for as many stages as is possible at stage n. This

means that k

O is eliminated at stage n until stage n +m where m is the

largest integer (possibly 00) for which

o

n+m-l

I

(b£+l-n - a£+I-n A ) > 0 n ~n n n £=n £+I-n £+I-n,

(where we use bn Pn - an An ?:: b£~JI, - aJl,A£) •

ii) Eliminate k

O for one stage (if possible) after which you test whether it

can be eliminated for another state in such a way that an action elimina-ted at stage n will return at stage n+m where m is the first integer for which

n+m-I

I

£=n

3. Some final remarks

i) If we apply the suboptimality test we get exactly the same successive

approximations v as in algorithm without the test (cf. Karlin

[4J,

pp.

n

38-39) •

ii) In the preceding sections we only treated the suboptimality test for

ac-tions of PI but the case for P

z

is completely symmetric.

iii) The test can be used also in other successive approximation algorithms, for example Jacobi, Gauss-Seidel (in this case the definitions of an and

b must be adapted, cf. [7J).

n

iv) If at stage nO action £0 E Lx is eliminated for all future iterations

then in the definition of Yn(x,k

O), n > no we can take the minimum over

£

#

£0 instead of JI, E Lx.

v) In the test for suboptimality at stage n the assumption

I

p(y!x,k,£) < I

yES

plays no role at all, so this test can be used also in the finite hori-zon and average reward cases.

(7)

5

-4. References

[IJ Hastings, N.A.J., A test for nonoptimal actions ~n uPliscounted M< ~ov

decision chains, Management Science ~ (1976), 87-92.

[2J Hastings, N.A.J. and J.A.E.E. van Nunen, The action elimination

algo-rithm for Markov decision processes, Markov Decision Theory, eds.

H.C. Tijms, J. Wessels, Amsterdam, Mathematisch Centrum

(Mathema-tical Centre Tract no. 93), 1977, 161-170.

[3J Hubner, G., Improved procedures for eliminating suboptimal actions in

Markov programming by the use of contraction properties, to appear in: Transactions of the seventh Prague Conf. 1976.

[4J Karlin, S., Mathematival Methods and Theory in Games, Programming and

economics, Vol. 1, Addison-Wesley. Publishing Company, Reading, Massachusetts-London, 1959.

[5J Macqueen, J., A test for suboptimal actions in Markov decision problems,

Operations Research!2 (1967),559-561.

[6J Shapley, L.S., Stochastic games, Proc. Nat. Acad. Sci. USA 39 (1953),

1095-1100.

[7J Van der Wal, J., Discounted Markov games: successive approximation and

Referenties

GERELATEERDE DOCUMENTEN

Voor succesvolle verdere ontwikkeling van de aquacultuur zijn de volgende randvoorwaarden van belang: - de marktsituatie, omdat toegang tot exportmarkten beslissend zal zijn voor

richten aan het retouradres met vermelding van de In de afgelopen periode is door veel betrokkenen, onder wie medewerkers van het datum en het kenmerk van Zorginstituut, hard gewerkt

Rekening houdend met alle aannames zoals genoemd in deze KCR, een marktpenetratie van 80% bij IFN geschikte patiënten en 100% bij patiënten die niet voor IFN in aanmerking komen,

De analyse van het Zorginstituut dat het aantal prothese plaatsingen (bij knie met 10% en bij heup met 5%) terug gebracht kan worden door middel van stepped care en

BJz heeft in zijn advies aangegeven moeilijk een beoordeling te kunnen geven over de hoogte van de benodigde individuele begeleiding omdat de aard van de opvoedingssituatie of het

Determining which factors influence a person’s decision to disclose his/her positive status to others, particularly the sexual partner, is essential in understanding

A quantitative methodology was selected as the most appropriate approach to identify factors preventing the successful implementation of an existing fall prevention programme...

servers to encourage responsible drinking and sexual risk reduction; (2) identification and training of suitable bar patrons to serve as peer interventionists or ‘change agents’;