Convergence results related to the equalizing property in a Markov decision process

(1)

Convergence results related to the equalizing property in a

Markov decision process

Citation for published version (APA):

Groenewegen, L. P. J. (1975). Convergence results related to the equalizing property in a Markov decision process. (Memorandum COSOR; Vol. 7518). Technische Hogeschool Eindhoven.

Document status and date: Published: 01/01/1975

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne Take down policy

If you believe that this document breaches copyright please contact us at: openaccess@tue.nl

providing details and we will investigate your claim.

(2)

EINDHOVEN UNIVERSITY OF TECHNOLOGY

Department of Mathematics

STATISTICS AND OPERATIONS RESEARCH GROUP

Memorandum COSOR 1975-18

Convergence results related to the equalizing property in a Markov decision process

by

L,P.J. Groenewegen

Eindhoven. October 1975 The Netherlands

(3)

Convergence results related to the equalizing property in a Markov decision process

by

L.P.J. Groenewegen

Summary

In this paper we consider a Mark9v decision process with countable state and time spaces. Rewards have the so called charge structure and the optima-lity criterion is the total expected reward. It is proved, that when an opti-mal decision rule is applied, the value of the state at time t converges for

t

~ ~

both to zero in L I norm and almost surely.

1. Introduction

Let S be some countable set, the state space. Associated with each i ES

~s a non empty set P(i) of probability distributions on S. Elements of P(i)

may be interpreted as row vectors. A policy or decision ~s a choice of a

ma-trix P, of which for each i E S the i-th row represents an element of P(i). The set of all these matrices is denoted by P. Associated with each PEP is a real valued function r p' defined on S, representing the reward of decision P, such that rp(i) only depends on the i-th row of P.

In our set-up we restrict ourselves to memoryless decision rules, i.e. a decision rule or strategy R can always be written in the form POP] ••• with P

t E P, t E

T

= {O,I, ••• } denoting the decision to be taken at time t. The

set of all these strategies is denoted by

R.

X_t, t E

T

denotes the state at time t, and P. Rand E. R stand for the

~, ~,

probability measure and the expectation operator when the starting state is i and strategy R is used. E

R stands for the column vector with i-th component

E. R' i E S. We have the following assumption: the reward functions form a

~,

charge structure i.e.

~

V. S VR R'R=P P [E. R

L

Ir

_p _(Xk

)!

<

~J •

~E E.

a

1" ~'k=O k

A strategy is called optimal iff it maximizes the total expected reward for all starting states. In order to investigate such strategies we define the value of a strategy R = POP_I••• as a real valued function v_R on S:

(4)

- 2

-00

v (.)

:=

E

L

r_p (X k) R ., R k=O k

and the value function vas:

v(·) := sup v

R(·) • RER

Hence a strategy R is called optimal iff v

R = v. We call a strategy R opti-mal in i E S iff vR(i) = v(i).

In the sequel we need the following concepts (cf. Dubins and Savage [3J and Hordijk [5J). A strategy R

=

POP

1••• is called v-conserving iff

V_t TV . . S [F. R(Xt=j) > O-=>v(j) = r_p (j) + (Ptv)(j) /\

E 1,] E 1, t

/\ {(P v+)(j) < 00 v (P v-)(j) < oo}J •

t t

Here the condition (Ptv+)(j) < 00 or (Ptv-)(j) < 00 assures the existence of

(Ptv)(j). A strategy R is called equalizing iff

V. S .[1im E. RV(X ) = 0

J •

1E t-+oo 1, t

In the next section it is proved that optimal strategies are characteriz-ed by these two properties, and that when an optimal strategy R is applicharacteriz-ed, the value v(X

t) of the state at time t converges for t + 00 to zero in L

1 _norm

and F. R-almost surely for all i E S.

1,

2. Convergence of v(X t)

Lemma 1 is used in the proofs of theorems 1 and 2.

Lemma I. Suppose R E Rwith R = POP

1••• is optimal, and let t E

T.

Then

PtP_t+

1••• is optimal in j E S if there exists an i E S such that

F. R(X

=

j)

> O.

1, t

t-I 00

E. R*

I

r p (~) +

L

_{F. R*(X t} JI.)EJI. R

L

_{r p} (~) vR*(i) ~

1, _k=O _k _Jl.ES _1, _, _k=O _k

t-I 00

~ _{v(i) = E. R}

L

_{r p (Xk)} +

L

l' . (X =JI.)E .

L

_{r p}

t+k(Xk)' 1, _k=O _k _Jl.ES _1,R _t _{JI.,PtP t +I ••• k=O}

(5)

3

-(It should be noted that for instance in the left hand side of this formula two different processes X" j €

T

occur, the one determined by the starting

J

state i and the strategy R*, the other by ~ and R.) From the definition of R* it follows that the first term of the left hand side equals the first term of the right hand side, and also that

Also we have that

00 00

This implies that

00

'v'~€S ClPi,R(Xt = t) > 0 ~ vet) = _{EQ, R}

_,

L

_{r p (Xk) =}

k=O k

00

= E

I

_{r p} (Xk)J Q"PtPt+I··· k"O t+k

o

Theorem I gives a characterization of optimality, as is done elsewhere for other models: by Dubins and Savage [3J and Sudderth [8J for gambling houses, and by Hordijk [5J and Groenewegen [4J for a model which in fact is the model descripted here with some additional conditions. In the proof of theorem 2 only the necessity of this condition for optimality is used.

Theorem 1. A necessary and sufficient condition for the optimality of a stra-tegy R is that R is v-conserving and equalizing.

Proof. I. "~": Suppose R = POp} ••• is optimal. Then by the charge structure of the rewards we have that

00 (Ptv)(j)

=

_L

JP. P (XI

=

Q,)Et,R

L

r p (Xk) = Q,ES J, t _k=O _k 00 00

=

Ej,p

loP

I' • •

I

r P (Xk) ::;E. P P P

I

Ir p (Xk

)!

< 00. k=1 k-I J. to}'·· k=} k-}

..

(6)

4

-Hence it is to be proved that

( i) (ii) Y_tET Y' . S [F. R(X =j)_~tJE _~, > O=>v(j) t

Y. S

[lim

E.

RV(Xt) = OJ • ~E t-+<>o ~, = r_p (j) + (Ptv)(j)J t

First (i), Suppose Fi,R(X

t = j) > 0 for some choice of i,j E Sand t €

T.

Hence by lemma 1 PtP t+I, •• is optimal in j. So 00 v(j) 00 00

(because P_t+_IPt+2'" is optimal in those ~ € S with]P. P P (Xl =~) > 0)

J, t t+]'"

= r p (j) + (Ptv)(j)

t

Now (ii). In fact corollary I will be proved, which ~s more than is needed for this theorem,

(again using lemma I: PtP

t+], .. is optimal in those j € S withFi,R(Xt=j) >0)

00

=

IF.

R (X

t

=

j)

IE.

P P

I

r p (Xt)

I ::;

j€S ~, J, t t+]'" k=O t+k

00 00

Taking the limits for t ~ 00 at both sides and using the charge structure of

the rewards one obtains

(7)

5

-Hence

lim E. RV(X ) = lim E. Rlv(X )

I

=

a •

t-+<x> ~, t t-+<x> ~, t

So one side has been proved, and also corollary I.

2. "..": Suppose R = POP

t••• is v-conserving and equalizing. This part of the

proof has been inspired by the proof in Hordijk

[Sl,

theorem

4.6.

00

(a)

The charge structure of the rewards allows the following reasoning

j)r p (j) = k (because R is v-conserving) = = = n

L

lP·R(Xk=j){v(j)-(Pkv)(j)} k=O jES ~, n

L L

_{lP. R(Xk =j){v(j)} -E. P P _{v(X I )}} k=O jES ~, J, k k+I'"

n

L

E. R{v(X_{k) - v(Xk} I)}

k=O ~, +

Hence it follows from (a) that

00

E. R

L

r_p _{(Xk) = v(i) - lim E. RV(X )}

~, k=O . k ~ ~, n

since R is equalizing. So R is optimal.

v(i) ,

IJ

As already noted, we also have proved for an optimal strategy R the conver-gence of v(X

t) to zero in L I

norm if t

~

00, for each choice X

o

= i E S. So

(8)

6

-Corollary I. Suppose R E R with R J'01') • •• 1S optimal, then

lim E. R1v(X

)1

=

a

t-+«> 1 , t

for all i E S •

1

Now we come to the main result of this paper. As in the case of the L

con-vergence, this result is also obtained by using the charge structure of the rewards.

Theorem 2. Suppose R € R with R = PaPI'" u optimal, then

F. R-a.s. if t -+ ex>

1, for all 1 E S •

In the proof the following notations will be used:

~(j) denoting the probability measure on S which concentrates

00

in j E S; if XES : x~ denoting the ~-th coordinate of x.

probability 1

00

the existence of a subset A c S Proof. Choose i O E S. Suppose ,(v(X

t) -+ 0 Fio,R-a.s. if t -+ with F. R(A) > 0 such that

10' lim sup Iv(X_t)

I

> 0 on A •

t-+«> Hence

(0).

This implies (a) • [Po R(Xn 1_{0 '} )(,j

it.) >

a

A !v(it .)

I

> £OJ •

J J

Now fix such an £0' such a B, and for every 1 E B such a strictly increasing

sequence {tj(i)}J=l • From now on it. will be used for the tj-th coordinate

J 00

of i = (iO,i₁, ••• ) E B, with t. the j-th number of {l.(i)}'_l' (Note that t.

.. J J J- J

is not necessarily the same number for all elements of B.) Since R is

v-con-serving and F. R(Xn

(9)

7

-n

~

r

j.J(in ) Pn ••• Pn k- I

I

_{r p}

I

+j.J(it ) Pt' • •Pt +

I

V

I] .

k=O Nj N j Nj + t . +k j j j n

J

Note that PtPt+I",Pt+k-1 is to be read as I for k = O. Since Fio.R(Xtj = i

tj) > 0 we have that PtjPtj+I'" is optimal 1n i tj by lemma I. Then by corollary 1 we have

(c)

Hence

V"~ [lim j.J(in )P n "'P n Ivl = oJ •

J"-I.' ~ N • N • N •+n

n J J J

00

by combining (a). (b) and (c).

Now choose an arbitrary number M E

E

and choose for every x E

B

a number

00

~(x) from the sequence {tj(X)}j=l. such that ~(x) > M. Define

Note that B

M~ B. and also

V 3 3y~B [m

=

nM(y) A Xk

=

Yk for k

=

O,I, ••• ,mJ •

xEB_M m>M "

Now fix for each x E B

Msuch a number m and denote it by ~(x). Then

00 00

00

2: F. R(B)E. R[

L

Irpk(X,)

I I

X EB_M] =

(10)

co

8

-CD

=F. R(B)

L

F. p(m..(X) =nAX =j IXEBM»).l(j)

I

p •••P _{k_l!rp} I~

1.0 ' n=M+I jES 1.0 '" J.I1 n k=O n n+ n+k

(using (d»

(e) _~F. _R(B)E:

O ' 1.0 '

However, by the charge structure of the rewards we have

co (f)

_v

_{>0 3} N_"1\1 [E. R

I

Ir_p (X k)

I

< e:J • E: ~'1.0' k=N k Choose E: = F. R(B)E:

O' fix such an N, and choose M

=

N, then we have

con-1.0 '

structed a contradiction between (e) and (f). Hence

Remarks. v(X

t) -l-

a

F._1. R-a.s. for t -l- co 0 '

for all i E S •

o

I) The model described here is adopted and adapted from a model which is in-troduced in Hordijk's thesis [5J, and which is called convergent dynamic programming in a forthcoming paper (Hordijk [6J). However, for our comple-tely independently derived results we did not need Hordijk's additional assumption for the value function

V "1\1 V. S V

R R [E. R

I

v(X )

I

< <Xl

J •

nE~ 1.E E 1., n

2) It is clear that the following cases, which are well-known from the lite-rature, are contained in this model: discounted dynamic programming with bounded rewards (Blackwell [IJ), positive dynamic programming with finite optimal return (Blackwell [2J), and quite a lot of the negative dynamic programming situation (Strauch [7J).

References

[IJ Blackwell, D. (1965). Discounted dynamic programming. Ann. Math. Statist.

36, 226-235.

[2J Blackwell, D. (1971). On stationary policies. J. Roy. Statist. Soc. 133,

33-37.

[3J Dubins, L.E. and Savage, L.J. (1965). How to gamble if you must:

(11)

, r

9

-[4J Groenewegen,

L.P.J.

(1975). On stopping a Markov decision process.

Memorandum CaSaR 75-01, Eindhoven University of Technology, Dep.· Math., Eindhoven.

[5J Hordijk, A. (1974). Dynamic programming and Markov potential theory.

Mathematical Centre Tract 51, Mathematisch Centrum, Amsterdam.

[6J Hordijk, A. (1974). Convergent dynamic programming. Techn. Report 28,

Dep. Ope Res., Stanford University, Stanford.

[7J Strauch, R. (1966). Negative dynamic programming. Ann. }lath. Statist.

37, 871-889.

[8J Sudderth, W.D. (1972). On the Dubins and Savage characterization of