Markov strategies in dynamic programming

(1)

Markov strategies in dynamic programming

Citation for published version (APA):

Hee, van, K. M. (1975). Markov strategies in dynamic programming. (Memorandum COSOR; Vol. 7520). Technische Hogeschool Eindhoven.

Document status and date: Published: 01/01/1975

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne Take down policy

If you believe that this document breaches copyright please contact us at: openaccess@tue.nl

providing details and we will investigate your claim.

(2)

EINDHOVEN UNIVERSITY OF TECHNOLOGY Department of Mathematics

STATISTICS AND OPERATIONS RESEARCH GROUP

Memorandum COSG~ 75-20

Markov strategies 1n Dynamic Programming

by

K.M. van Hee

Eindhoven, October 1975

(3)

Markov strategies in Dynamic Programmin$

by

K.M. van Hee

O. Summary

It will be proven that the supremum of the expected total return over the Markov strategies equals the supremum over all strategies. The model

assump-tions are: the state space is counta"lle, the action space is measurable and the supremum of the expected total of the positive rewards over the Markov strategies is finite.

1. Introduction

In this section we develop the usual framework of dynamic programming (see for example [Strauch]).

Consider a countable set S, the state space, an arbitrary set

A,

called the action space and a a-field A on

A.

The function P: S x S x

A

+ [O,IJ is a transition function, ~.e.

1) P(j 2) P(j

We call:

i,a) is measurable on (A,A) for each j,i E S,

i,a) ~ 0,

L

P(j

I

i,a) ::; 1 for all i,j E S, a E A.

jES

H

:=

(S x A)n x S n

the set of possible histories until time n. On h we have the product a-field

n induced by A and the power set of S.

A strategy TI is a sequence (TI

O,TI1,TI2, ••• ) where TIn is a transition

probabi-lity on (A x H ) such that TI (·\h ) is a probability on (A,A), h E Hand

n n n n n

TIn(AOlo) is measurable for A

O E A. The set of all strategies is denoted by IT. The subset RM of IT is the class of randomized Markov strategies which con-tains exactly the strategies n E IT such that n (olh I,a ,s ) does not depend

n n- n n

on h

n_1 and an for all hn-I E Hn_l, an E A and sn E S. Further we define the subset M of RM by: n E M if and only if n (ols) is a degenerated probability.

(4)

2

-M is called the class of -Markov strategies. We may identify a strategy TI E M with a sequence of functions {f , n

=

O,1,2, ••• } where f : 8 7 A such that

n n

TIn(f(s)

I

s) = I. We call the functions Markov policies. We introduce a new state P and we define 8* := 8 u {p}, pep

I

P,a) := 1, for all a E A and PcP i,a):= 1 -

I

P (j

I

i,a) for all (i,a) E 8 x A. Any strategy 'IT E II,

jE8

any i E 8 and P define a probability

w.

on (8* x A)oo, by the Ionescu

1,TI

Tulcea theorem, and a stochastic process {(X ,Y ), n = O,1,2, ... } where X

- n n n

and Yare the projections from (5* x A)oo to the n-th state space and n-th

n

action space respectively. This means tr,t X is the state and Y the action

n n

at time n.

*

Functions on 8 x A are extended to 5 x A by definingthemO in (p,a).

General-ly speaking we say a measurable function f on a measure space (~,F,~) is in-tegrable if at least one of the terms

J

f+d~ and

J

f-d~ is finite.

+

-(f := max{O,f}, f

:=

-min{O,f}).

If u(X ,Y ) is integrable with respect to W. , 'IT E RM we may evaluate the

n n 1,1T

integral in operator notation:

1T

OP1T1P ••• P1Tnu(i) := E.l1T[u(Xn,Y )]n

=

...

... {J

TI (dali )u(i ,a)} .

n n n

If u(X ) is~. ~integrable we have in the same way:

n 1, 1T

E. [u(X )] = 1T

OP ••• 1T IPu(i).

1'lT n

n-For a Markov strategy f = (f

O'£1'£2"") we write

A reward function r is a measurable function on 5 x A. The total return is

00

defined by \_L _~r(X,Y) if this sum exists. We are concerned with the ex-_n _n n=O

pected total return with respect to!P. • To define this we need the

foI-1,IT

(5)

- 3

-00

Lenuna l. Suppose E.

[L

r\x,Y)J < 00, TI E

n,

1. E: S. It holds that

1.,TI _n=0 n n

00

reX ,Y ) exists P. -a.s., and 1.S

W.

-integrable.

n n 1.,TI 1.,TI

a)

I

n=O

00 00

b) E.

[I

r(X,Y)J =

I

E. [reX ,Y )J •

1.TI n=O n n n=O 1.n n n

integrabili-reX ,Y ) is n n 00

L

n=O + r (X , Y ) we nave the n n 00

Proof. Since

I

r+(X,y) < oo,~. -a.s. the existence of

n=O ~ n 1.,TI 00

obvious. And S1.nce

(I

r(X,Y »+ ~

I

n=O n n

ty with respect to W. . By the monotone convergence theorem we have

1.,n

N N

lim {

I

E. [r (X , Y )] -+

_I

E. [r (X , Y )]} =

N-7<lO n=O 1.TI n n n=O 1.TI n n

00 00

L

+

I

reX ,Y )J = E. [ r (x ,Y )

-

.

1.TI n=O n n n=O n n Since 00

I

n=O E. [r+(x ,Y )J

=

1.1f n n 00

E. [L

r+(X,Y)J < 00 1.TI n=O n n

we may subtract this quantity on both sides of tfie equality which produces

the statement.

o

In the next section we need the following elementary selection theorem.

Lenuna 2. Let TI(·ji) be a probability on (A,A) such that for all i E S, and let u be an TI-integrable function such that:

xCi) :=

I

TI(dali)u(i;a) < 00 then there is a Markov policy f such that

xCi) ~ u(i,f(i», for all i E S •

The proof is trivial.

In the rest of this paper we use the following assumption:

(A)

00

sup E.

f[

I

r+(X,y)] < 00

fEM 1. n=O n n for all 1.

E S.

(6)

4

-2. The theorem

In this section we prove, among other things, the existence of the expected total return of a strategy TI E IT. Anticipating on this result we shall use the following notations:

N-]

vN(i, TI) := E._l.TI[

I

r(X ,Y )

J,

N 1,2, ••• ,""

n=O n n

v(i,TI) := v (i,TI)_co N-1

wN(i,TI) := E._l.TI[

I

r (X ,+ Y )], N 1,2, .••

,00

n=O n n

w(i, TI) := w (i TI)

.

co '

The main result of this section is: sup v(i,TI) = sup v(i,f)

TIE IT fEM

for all 1. E S •

Lemma

3.

Suppose E. [r+(X

,Y )]

< 00 for all n, all TI E ~~ and all i E S. Let

l.TI n n

u be a nonnegative function on Sand N an arbitrary nonnegative integer. For all TI e RM there exists a f E M such that

N-1

z =

I

TI_OPTI₁···PTI_nr + TIoPTI] ••• PTIN_]P(-u) n=O

Since TIN_]{r + P ~u)} < co we may apply lemma 2 and we find a Markov policy f_N-₁ such that

Note that u

N-1 1.S integrable with respect to TIOPTI] .•• TIN-2P S1.nce TIOPTI] •.•TI

(7)

5

-With backward induction we shall construct Markov policies fN_2tfN_3t"'tfO' Suppose we found fN_2tfN_3t"'tfk such that for

N-I ~:=

I

_{f kPfk+}₁_{",Pfnr + fkPfk+I···fN_IP(-u)} n=k it holds that k-I

I

_{rrOPrr I . "P'TTnr +} 'TTOP'TTI,.,Prrk-IP~ ~ z , n=O N-I +

Note that 'TTk_IT + 'TTk_IP~ < 00 Slnce uk ~

I

fkPfk+I",Pfnr. Apply lemma n=k

again to find a Markov policy f_k- _l such that ~-l = fk_IT + fk-IP~ ~

~ 'TTk_IT + 'TTk-IP~, By the monotonicity of the operators 'TTnP we have

k-2

I

_{rraP'TT I ,· ,P'TTnr + 'TTOP'TT I "}

.P'TTk-2P~-I ~

z •

n=O

Hence the Markov strategy f defined by (fatflt"'tfN_I) for the first N time

periods, has the desired property.

0

Lemma 4. For all i E S it holds that sup wei,£) =

fEM

sup w(it'!T) . '!TERM

Proof, We only have to show a(i) : = sup w( i t f ) ~

fEM

sup w(it'TT) 'TTERM

=: b(i) ,

I) Suppose b(i) < 00 for all i E S. Fix 1

0 E S, For each E > 0 there is a

IT E RM and some integer N such that

(8)

6

-2) Suppose b(i

O)

=

00 for some iO E S. First we note that for all +

n = (n

O,n1,TI2, •.• ) E RM we have TInr < 00. Suppose the contrary, ~.e. TInr+(j) = 00. This implies for all k > 0 the existence of a Markov policy f_O and even so the existence of Markov strategy f such that

w(j,f)

~

fOr+(j)

~

k ,

which contradicts our assumption

(A).

In exactly the same way we may prove that

+ + +

n {r + Pf lr +.•• + Pf ~.. PfN lr } < 00 •

n n+ n + l

-It follows from b(i

O)

=

00 that there is for each k a TI E RM and an inte-ger N such that

Completely analogous to the proof of lemma 3 we find by backward induc-tion a sequence of Markov policies fN-l,fN-2, ... ,fO and therefore a Mar-kov strategy f, starting with (fO,f1, .•. ,f

N-1), such that

This implies that a(i

O) 00, which produces a contradiction.

o

It is a direct consequence of lemma 4 that for all TI E RM and all ~ E S

v(i,TI) exists and moreover that sup v(i,TI) < 00

TIERM

Lemma 5. For all i E S:

for all i E S •

sup v(i,f) fEM

sup v(i,TI)

=:

v(i) TIERM

Proof. Fix i

O E S. Choose E > O. There must be a n such that

(9)

7 -00 E .

[I

r+(X JY ) ] < £ • ~orr n=N n n Hence co vN(iOJrr) - E.

[I

r-(X JY )J ~ v(i_O) -2£ . ~O,rr n=N n n Define rrN E RM by:

and the function u on S by:

00

u(i) :=

Then we have

r (X , Y )

J

n n

By lemma 3 we know the existence of f = (fOJ f 1,f2, ... ) E M such that

Define the probability p on S* by

p(j) := F. f[X- = jJ •

~O' -~

(Hence E. f[u(XN)J =

I

p(j)u(j).)

~o j ES

It follows from negative dynamic prograIT~ing [Strauch, tho 4.3J that there exists a g = (gO,gI Jg2 J"') E M such that

00

I

p(j) • E.

[I

r

ex

,Y )

J :;

jES Jg n=O n n _j

I

_ES p(j).u(j) •

Let f* EMbe defined by: f* = (fO,f1, .•• ,fN-1,gO,gl" .• ) then it

~s

clear that

00

This proves the lemma.

pUlE.

Jg

[I

_n=O[ r (x_n,Y_n)J ~

(10)

- 8

-It follows from a theorem of [Derman and StrauchJ, which is generalized by [HordijkJ, for a countable state space and a measurable action space, that for a fixed i E S and an arbitrary TI E IT there ~s a IT E RM such that

F . [ X = j, Y E AJ =F.

*

[X = j, Y E AJ for all j E S, A E A . ~,TI n n ~,TI n n Hence 00 E.

[I

r+(X ,Y )J ~TI _n=0 n n 00

=

E. *[

I

r+(X ,Y )J < 00 ~TI n=O n n

which proves the existence of v(i;,T) and that v(i,TI) = v(i,TI*). The next theorem is now a direct consequence of lemma 5.

Theorem. For all i E S:

sup v(i,TI) = TIEIT

3. Remarks.

sup v(i, f) • fEM

a) The theorem, we proved in section 2, is applicable to the well-known mo-dels: pos i ti ve, negative and discounted dynamic programming provided that

the state space is countable: Note that the discount factor, if it is less than one, is included in the absorption probability of the state p.

b) In the convergent dynamic programming model, develloped by [HordijkJ ~n which it is assumed that

00

E.

[L

~,TI n=O

I

r (X ,Y )n n

I]

< co for all TI E RM, for all ~ E S

the theorem is also valid, since ~n this situation assumption (A) ~s also fulfilled.

(The proof of this statement may be found using the arguments of [Hordijk: corollary 13.3J.)

c) If we allow a discount factor S > 1 such that

00

n +

S r (X ,Y )J < 00

n n

(11)

9

-d) For our model Bellman's optimality equation is valid, ~.e.

veil := sup v(i,f) fcH

satisfies the equation

veil = sup {r(i,a) +

I

P(j[i,a)v(j)} •

aEA J

(To prove this use the arguments of [Hordijk, tho 3.1 and tho 3.5J.)

Acknmvledgemen t.

The author wisnes to thank dr. A. Hordijk for stimulating discussions.

4. References

Derm~l, C. and R. Straudl; A note on memoryless rules for controlling sequen-tial control processes.

Ann. Math. Statist.

r!..

(1966), 276-278.

Hordijk. A; Dynamic Programming and Markov potential theory. Amsterdam, Mathern. Center, 1974. M.C. Tract, nr. 51.

Strauch, R; Negative dynamic programming.