• No results found

Successive approximations for convergent dynamic programming

N/A
N/A
Protected

Academic year: 2021

Share "Successive approximations for convergent dynamic programming"

Copied!
36
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

programming

Citation for published version (APA):

Hee, van, K. M., Hordijk, A., & Wal, van der, J. (1977). Successive approximations for convergent dynamic programming. (Memorandum COSOR; Vol. 7707). Technische Hogeschool Eindhoven.

Document status and date: Published: 01/01/1977 Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers) Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne Take down policy

If you believe that this document breaches copyright please contact us at: openaccess@tue.nl

providing details and we will investigate your claim.

(2)

PROBABILITY THEORY, STATISTICS AND OPERATIONS RESEARCH GROUP

/

Memorandum COSOR 77-07 Successive Approximations for Convergent Dynamic Programming

by

Kees M. van Hee, Arie Hordijk and Jan van der Wal

Eindhoven, April 1977 The Netherlands

(3)

Kees M. van Hee, Arie Hordijk and Jan van der Wal

1. Introduction and Preliminaries

The main topic of this paper is the convergence of the method of successive approximations for dynamic programming with the expected total return

criterion.

We first sketch the framework of the dynamic programming model we are dealing with.

Consider a countable set E, the state spaoe, and an arbitrary set A, the

aotion spaoe, endowed with some a-field containing all one-point sets.

Let p be a transition probability from E x A to E (notation: p(j!i,a),

i,j ~ E, a ~ A). Let H :- (E x A)n x E be the set of histories until

n

time n (n ~ I) and HO

:=

E.

In all generality a strategy 'If is a sequence (~O,1fl' •• ') where

transition probability from H to A. The set of all strategies n

by IT. The subset M of IT consists of all Markovstrategies; i.e.

'If is a

n

is denoted

'If

=

(1f

O,1f1, ••• ) E M if and only if there is a sequence of functions fO,fl, ••• ,

f : E + A, n

=

0,1, ••• , such that

n

1fO({fO(i)Jli)

=

1, 'If ({f (i)Jlh l' a n n n- n-l' i) = I for all h I E H . I' a l E A, and i E E. Each

n- n- n- ~

a probability F. on (E x A) and a stochastic

1,'If

i E E and 'If E IT determine

process {(X ,Y), n=O,I, ••• } n n

where X is the state

n and Y the action at time n. The expectation with n

respect to lP . is denoted by lE. •

1,1f 1,1f

The reward funotion r is a real measurable function on E x A. Throughout this paper we assume

~

1 • 1 • sup :IE.

[I

r + (X , Y ) ] < 00

(4)

(note that x+ := max(x,O». This assumption guarantees that the expeuted

co

total return v(i.IT) := lE.

[l:

reX ,Y )J is defined for all i E E and

1,7T =0 n n

7T E TI, and in [9J it is proveH, using a well-known theorem of Derman and

Strauch [4J, that

(I .2) sup v(i,7T)

=

sup v(i,7T) for all i e E .

ll'EM 7Td!

As a consequence of 1.2 we are mainly interested in Markov strategies and

for that reason we introduce some notations which are especially useful for this class.

First we define the set P of transition probabilities from E to E: for

which there is a function f : S + A such that P(i,-) = p(·li,f(i». for

all i E S and further a function r : E x

P

+ lR (= the set of reals)

rp(i) := sup{r(i,a)

I

P(i,-)

=

p(i,a,'), a E A} •

Note that each 7T E M is completely determined by a sequence R

=

(PO,P" ••• ),

PEP, n

=

0,1, •••. Hence we may identify each 7T E M with such a sequence

n

R, and express

lEi,Rr(xn,yn)

=

PO ••• Pn-Irp (i) , for R = (PO,P I , ••• ), i E E .

n

(By convention the empty product of elements of P is the identity operator,

and if we omit the subscript i in lE. R we mean the function on E).

1,

On E we define the functions:

(1.3) v := sup lE R [

I

reX ,Y )

J

REM n=O n n the value funation

for a function s

n-I

(1. 4) v s

n := sup lE R[

I

r(Xk, Yk) + s(Xn) J

REM k=O

for a sequence a = (aO,al , ..• ) of functions an

we define the function wand z on E: a a

, V n

o

:= V n E + lR, R:= {x E lR

I

x ?! I}

(5)

00

(1. 5) w (i) a := sup

L

a (i) n

I

lli. R[r(X ,Y )JI

1, n n

REM n=O

i E E

00

(1 .6) z (i) := sup La (i)lli. R 1 r (X ,Y )

I

,

i E E

a

REM n=O n 1, n n

we write w for w and z for z if a \ for n = 0, I , •••

a a n

00

(1. 7) y\ := z, Yn := sup

k~O

ill R[y n-I (Xk) ] n == 2,3, •••

REM

A dynamic programming model is said to be stabZe with respect to scrapfUnction

s if

lim vS(i) = v(i) ,

~ n for all

1 E E

It is well known that positive, negative and discounted dynamic programming

models with finite E and A are stable. But this is not true in convergent

dynamic programming., the case that

z

is finite (see [13J, [14J), as is shown by the following example.

Counterexample:

E '" { 1 , 2}, A '" {\, 2}, p ( 1 II , I) '" p (211 ,2) "" 1, r (l , 1) = 0, r (1 ,2) = 2 , p(·12,1) "'p('\2,2) '" 0, r(2,1) r(2,2) '" -I •

Then vO(l) = 2 and vel) '"

n

r=O

It is well-known that stability (with respect to scrapfunction 0) is

guaranteed, if the expected total return from time n onwards, tends to

zero as n tends to infinity uniformly in the strategy. In 1.8 this

(6)

00 (1.8)

In this paper two types of assumptions are considered to guarantee this

uniform tail convergence. In section 2 the strong convergence conditions

are introduced. A model is called strongly convergent if w

a

finite for a sequence of functions a

=

(aO,a

l , ••• ) with J~

for all i E E. It turns out that property 1.8 is equivalent

or z ~s

a a (i) .,. 00

n

to a strong

convergence condition. In section 3 Liapunov functions are introduced

and the existence of finite Liapunov functions is related to strong convergence. In section 4 Liapunov functions turn out to be important tools in successive approximations because they provide bounds for Iv - v:1 and procedures

for excluding suboptimal actions.

In section 5 the connection with contracUng dynamic programming is made

and in section 6 a waiting line model with controllable input is pre-sented, which satisfies the strong convergence condition but which is not contracting. Finally in section 7 some results on (nearly) optimal strategies are collected.

We conclude this section with some remarks and notations.

Models with for each i E E a different action space A. can easily be

trans-~

formed into our frame work.

In [13J and [14J convergent dynamic programming (z < (0) was studied ex-tensively. In this paper we are almost always working within this

frame-work, since besides the overall assumption 1.1 we work with additional

assumptions which are at least as strong as: w is finite. Hence with w < 00

and

m. ~, R

I

reX ,Y n n

)1

~ 2 lE.. R r + (X ,Y ) + 1

m.

R r (X ,

y )

I

1, n n 1, n n

we have

00

z(i) ~ 2 sup

m.

R

L

r + (X ,Y ) + wei) < co •

REM ~'n=O n n

For two extended real valued functions a and b on E we write a ~ b iff

(7)

a ~ x iff a(i) ~ x for all i E E (the same holds if ~ is replaced by <

\

or ==). With the convergence of a sequence of functions on E we mean pointwise convergence and the supremum of a collection of functions 1S the pointwise supremum. With convergence of a sequence of elements of P we mean elementwise convergence. For an extended real valued function a

d . . f . b . a f h f . (.) a(i)

an a pos1t1ve unct10n on E we wr1te

b

o~ t e unct10n c 1 := b(i) •

For a nonnegative function ~ on E we introduce the set

V(ll) := {v E :ill.

00,

Ivl :5: kll for some k E :ill.}.

On V(ll) we define that the norm ~ by

IIfll ==

sup{~-I(i)lf(i)I,i

E E, !J(i) > a}.

II

1

The function II is called a bounding f~nation (c.f. section 5). For functions

f on E with

+

sup Pf < 00

PEP

we define two wellknown operators

Uf := sup {r + Pf}

PEP P

(1.10)

Df := sup Pf

PEP

Finally we formulate Bellman's optimality equations:

(1.11)

(1.12) v == Uv

The Liapunov-approach was presented by Hordijk at the Advanced Seminar on

Markov Decision Theory, Amsterdam 1976.

So he inspired van Hee and van der Wal to investigate the problem of successive approximations under very general conditions, which resulted in the strong convergence-approach. Then the three of us joined the investigations which led to this paper.

(8)

2. Strong convergence

One of the main results in this section is the equivalence of the strong convergence condition with the uniform tail property expressed in 1.8. We first give some simple, but useful inequalities.

Throughout this sectionAlet a

=

(aO,al , ••• ) pe a nondecreasing sequence

of functions, a : E 7R.

n

Theorem 2.1.

ro W

(1) sup

L

I

:IE R r(Xk,Yk)

I

s a a

REM k==n n

ro Z

sup

L

lER

I

r(~'Yk)

I

--

< a

a

REM k=n n

(U)

Proof. Since akCi) is nondecreasing in k and al(i) > 0, we have, for all

"\ i € E: co sup

L

REM k=n w (i)

s

a a (i) n

The proof of (ii) is similar.

Lemma 2.2. (2. I) Proof. ""n = U z • And further: sup

L

P •.. P +k-j

I

rp

I"

P O· •• P n-l k=O' n n n+k i ro ro

o

o

(9)

A direct consequence of theorem 2.1 and lemma 2.2 is (2.2) z a sup lE R

I

v (X )

I

S sup lE RZ (X ) S REM n REM n an

And in a similar way one may prove

(2.3)

w sup

I

lE RV(X ) n

Is...!:. •

a

RtM n

One of the consequences of the above inequalities some sequence a with

is that if z < 00 for a then lim a = 00 n n-T<lO lim lE R lv(x)l =

°

n-T<lO n

for any strategy. Hence any strategy is equaZizing (see chapter 4 of [13J).

See also theorem 7.S.

Theorem 2.3. states that w < 00 and lim a

=

00 guarantee stability. Note

a ~ n

that w S 'Z •

a a

Theorem 2.3.

Let w(a) < 00 and lim a

=

00. Then the problem .is stable with respect to any

n-T<lO n

scrapfunction s satisfying sup lERs+(X ) < "", n

=

0,1, •.• and supllERs(X

)1

+

°

R n R n

(n+oo).

Proof.

00 n-j

v-vs=sup JE

R[

I

r(xk,Yk)J - sup lER[

I

r(xk,Yk) + sex )J

n REM k=O R~M k=O n

""

Ssup \lER

L

r(xk,Yk)\ + sup IlERs(x)1

REM k=n REM n

w

S...!:.

+ sup IlE RS(X n) I •

(10)

Similarly one shows

w

Vs - v ~

...!

+

n a

n Hence lim Ivs - vi n

=

0

n-roo

sup

I

JE RS

eX )

I .

REM n

gives a new criterion for stability.

\

So theorem 2.3 If z < 00 and

a ~ lim a n

=

00 we may use scrapfunctions s satsifying for some

K E lR

lsi

~ Kz ,

since by theorem 2.1 and lemma 2.2 lim sup JERz(X

n)

=

0 •

n~ Rt:M

IJ

Consider a dynamic programming model with bounded rewards, say Ir(i,a)1 ~ b

for all i E E, a e A,and let EO be an absorbing subset of E with r(i,a)

=

a

for all i E EO' a E A. Let T be the entrance time in EO' If sup JERT < 00 then

REM

this model satisfies the strong convergence condition in a natural way since z a ~ b sup JE RT for an - n + 1, n = 0, 1 , • •• •

REI!

In fact IV

n - vi ~ n + 1 b sup JERT . Similar expressions can be derived REM

with higher moments of the entrance time.

In general one may say if w(a) < 00 then Iv (i) - v(i)1 tends to zero at a

1 n

rate at least as fast as [a (i)

n

From the foregoing results the question arises under which conditions

there exists a sequence of functions a with a ~ 00 and w ~ 00, The

n a

following theorem gives the already announced characterization.

Theorem 2.4.

There exists a nondecreasing sequence of functions a

=

(aO,a

l , ... ) on E

with lim an

=

00 and wa < 00 if and only if

(11)

Proof. First the if part. Define

co

i E E •

Obviously, bn C bn+1• Now let an(i) = £ + 1 if N~(i) ~ n < N£+l(i) with

NO(i) := 0 and N£ (i) :== min{n

I

b

n (i) ~ 2-.t} , ~ == 1,2, •••• Then

NHI (i)

sup

I

RE:M n=N£ (i) and consequently

a (i)

I

JE. R[r(X ,Y

)JI

n ~, n n J

,2, ...

w (i) a

N

J (i) 00

~sup

I IJE·R[r(x,Y)JI+

I(R,+1)2-£~w+3<""

REM n=O ~, n n £=1

The only if part is immediate from w ~ w(a) < 00 and theorem 2.1.(i).

In theorem 2.5 we collect two sufficient conditions for stability which are weaker than the strong convergence condition. It is well known that positive dynamic programming models are stable, but the strong convergence condition need not be fulfilled there. The following theorem covers also the positive case.

Theorem 2.5.

Each of the following conditions guarantees stability for scrapfunction O.

(i) timinf inf :IE R v(X

n) ;:: 0

n~ REM

o

(ii) there exists a nondecreasing sequence a

=

(aO,a

I, •.. ) of functions and a E -+

:IR

wi th lim a

=

00 n n n~ 00

(2.4) d a (i) := sup JE. R[ I a (i)r- (X ,Y ) ] < 00

(12)

Proof. For all REM Hence n-l v n :e: JE R [

I

r (X k, Yk)] • k=O liminf v :e: v(·,R)

n for all REM

and consequently liminf v

n :e: sup v(·,R) = v •

REM

Hence to prove stability we have to show limsup v

n :::; v •

Part (i). By the optimality equation we have r + Pv :::; v , PEP.

p Hence by iteration or n-1

l

Po···Pk-lrp + PO.·.Pn-1v :::; v k=O k n-1 JE R[

l

r(Xk,Yk)] + JER[v(Xn)] :::; v k=O Consequently, So with we find

liminf inf JER[v(X

n)] ;::: 0 n-+«> /REM

limsup v :::; v •

(13)

Part (ii). For REM

n-l 00

v(i,R) = JER[

I

r(~'Yk)] + JER[

L

r(~'Yk)] z

k=O k=n

n-j 00

z JER[

l

r(~'Yk)] - sup JE

R[

I

r-(xk,yk)]

k=O REM k=n

Hence, by taking the supremum over R E H

co

Using 2.4 one proves in a way similar as in the proof of theorem 2.1

Hence

d

l ' I' a l'

v z ~msup v - ~m --

=

~msup v

~ n n-+o:> an n-HlO n

If for some sequence PO,P

l , ... we have then + p v n n-] liminf Pn,.,P O v ~ 0 n-+o:>

is sufficient for stability, since iteration of the inequality v ~ rp + Pv

yields

+

and the proof of theorem 2.5 limsup v ~ v is sufficient for s tabili ty,

n

(14)

3. Liapunov Functions and Strong Convergence We first introduce Liapunov functions

Consider a sequence of nonnegative extended real functions ~ltl2"" on E

satisfying for all P €

P

the inequalities

(3.1)

I

r

I

p

k = 2,3, •••

Finite solutions of 3.1 are called Liapunov functions. If ~k is finite ~k

is called a

Liapunov function

of

order k.

Note that lk < 00 implies lk_1 < 00.

Liapunov functions are powerfull tools in dynamic programming. They were first studied in a context of dynamic programming in [13J chapter 4 for

the convergent dynamic programming model and in chapter 5 of [13] and in [15J Liapunov functions are studied in connection with the average return criterion for models in which some state is recurrent under each strategy and in [14J they are used to obtain (partial) Laurent expansions for the expected total discounted return. In section 4 the existence of a Liapunov

function of order 2 is assumed to obtain bounds for Ivn s - vi.

The functions YI'Y2"" defined in 1.7 satisfy Bellman's optimality equation,

hence

and

< 00 , k = 2,3, ••••

Hence, if Yk is finite, YI""'Yk are Liapunov functions and moreover it

is easy to verify that tk < 00 implies t ~ Y n

=

1,2, ••• ,k. Although we

n n

can work with Yk in stead of tk for theoretical purposes it may happen in applications that one can find, in a relative simple way, Liapunov

functions t

l,l2, •• ,lk,while the functions Yl'Y2""'Yk are hard to obtain.

Since there is a large class of Liapunov functions there still is some freedom to choose an appropriate one. Specially this might improve the bounds in the'approximation procedure (see also section 4). In this section we concentrate on the relations between Liapunov functions and strong con-vergence.

(15)

We recall that the finiteness of a Liapunov function of order k is equi-valent to the finiteness of YI""'Yk'

Theorem 3.1. 00 lE \' (k+n-l)

I (

) I

Yn ~ sup R L k r Xk'Yk REM k=O Remark.

H ence Y < 00 1mp 1es ' l' z < 00 f or a sequence unctions f ' ~ _ (k+kn-1), and

n a

consequently the strong convergence condition holds.

Proof. By induction. For n

=

1 the statement holds by definition 1.7. Suppose it holds for n - 1 (n ~ 2) then:

So Y n

Y ...

n =

=

00 \' m+n-l

I

I

sup L ( m )PO"'Pm-1 rp PO" •• m=O m n-}

< QO implies za < 00 for ~(i)

=

O(k ), k ~ 00. The converse is not

true, as shown by the following example.

counterexample 3.2.

The states 1,2, ••• are absorbing with reward O. In the s~ates nt, n

=

1,2, ••• , there are two

o

actions. Action 1 yields reward 0 and a transition

1 I 2' 3'

to state (n + 1) I action 2 yields reward n -I and a

transition to state n. Obviously we have for all REM

IER

I

(n+ I)

I

r(Xn,An)

I

~ n=O

(16)

-)

*

but since Yl(n')

=

n we have for the strategy R yielding transitions

from n' to (n + I)' etc. that

00

But if we make a sligthly stronger assumption then

00

sup

I

nN-1m

R

1

reX ,A

)1

< 00

REM n=O n n

the finiteness of the functions Yl""'YN defined in 1.7 can be shown.

Theorem 3.3.

If for a nondecreasing sequence of numbers aO,a

1, ••• , with an E JR and 00 \' a-I b pi I.. naO n it holds that < 00 00

u := sup m R

I

aN-II reX ,A )

I

< 00

REM n=O n n n

then the functions Y1""'Y

N defined in 1.7 are finite and satisfy the

k-) k-N inequalities Y

k S ub aO ' k

=

1, ••• ,N.

Proof. We will prove by induction

o

for k

=

1,2, ••• ,N - " n'" 0, I ,2, • •• • Set k'" 1. Using Y, ... z (by definition) and

sup lE RZ (X ) s ua I-N

REM n n

(from lemma 2.2 and theorem 2.1. ii) we get

(17)

Now let us assume

for k :: 1, ••• , m s N - 2 and n = 0,1 , .•.

and prove that the inequalities hold for k ~ m + I.

00

= sup PO",P -1

~up

l

Po···P~-IYm

REM n REM ~=o

00

=

sup

~sup

I

PO"'Pn-)PO",P~-lYm

PO"'Pn-1 Po'" ~=O

Thus we proved k

=

1,2, ••• ,N - 1, n

=

0,1, •••• k-) k-N Setting n

=

0 we get Yk $ ub a O ' k:: I, ••• ,N - 1 and with 00 N-)

we get YN $ ub • (And obviously Y1""'Y

N are finite),

o

Coro llary 3.4. k+e;

If a

=

n for n = 0,1, and some £ > 0, then z < 00 implies the existence

n a

of (finite) Liapunov functions ~1""'~k+1 satisfying 3.1.

This is immediate from theorem 3.3 with

(18)

4. Liapunov Functions and Successive Approximations

In this section we first formulate sufficient conditions for stability in terms of Liapunov functions tl and t2 (of order 1 and order 2 respectively).

Lenuna 4.1.

If some Liapunov function tl (of order I) exists and if in addition

then the problem is stable with respect to scrapfunctions s € V(£I)'

Proof. Since z ~ £1 we have

~n

lim U z "" O.

n-+oo

By lenuna 2.2, theorems 2.4 and 2.3 we have the desired result.

Lenuna 4.2.

If Liapunov functions 11 and t2 exist, then

lim UnR,l. = O. n-+oo

Proof. Consider a new reward structure: rp := R,I - Pt

1, P €

P.

For all R € M we have

Since

we have

lim PO ••• Pnt} "" 0 •

n:+<x>

for all R EO: M ,

Hence 1} is the function Y

1, defined in 1.7, for this new model. Therefore,

/

(19)

by theorem 3.1, lemma 2.2 and theorem 2.1 we have the desired result.

0

As a direct consequence of lemma's 4.1 and 4.2 we have

Theorem 4.3.

If Liapunov functions R.I and R.2 exist, then the problem ~s stable with

respect to scrapfunctions s € V(

1).

We note that sometimes Liapunov functions tl and R,2 can be found rather

simple, while YI and Y2 are difficult to obtain.

Remark 4.4.

If we assume besides the existence of a first order Liapunov function ~l'

the compactness of P and the continuity of P~l' as function of P, then

a sufficient condition for

... n

lim U R.) = 0 11-+00

is

for all P € P •

The proof of this statement proceeds in a way similar to the proof of

lemma 5.7 in [13J.

Theorem 4.5.

Let R.I and t2 be Liapunov functions (0£ order 1 and 2 respectively) and

define for a function s € V(tt)

s)(i)

-1 b

2 := sup{Q,\ (i) (Us - s) (i)

I

~ t. E, Q,I (i) > O}

then (4.1)

(20)

Proof. First observe that s €

V(R-I) then also Us E V(R-I) so the set

{i

I

R- 1(i)

=

O} gives no trouble. Since Us s s + b2~1 and ~I ~ ~2 we have

k + ~l +

Similarly, from U s ~ s + b

2t2 it follows that U s ~ s + b2t 2,hence

Uns s s + b;R-2 for n

=

1,2, •••• Since the

proble~

is stable (theorem

4.3) we have

v

=

1~m . U s n

~

The proof of the left inequality is similar.

o

The following, somewhat weaker, but more elegant inequality is now immediate.

(4.2) II v - s IIR, s II Us - silt

2 1

Remark.

If we have functions R-1 and t

z

satisfying the inequalities 3.1 but ~2(i)

for some i then we may separate the state space into E} := {i EEl ~2 (i)

and EZ := E \ E1• Since ~2(i) < 00 implies ~2(j) < 00 for all j € E which

can be reached under some strategy from state i, we have that ~l and R-2

are Liapunov functions on the smaller model with state space E

1• Hence

all results can be generalized to that situation.

If for some P, rp + 1'8 "" Us, II Us - s II£. is small and f,2 < 0() one may use

I

=

00 < oo}

the stationary strategy R := (P,P, ••• ). In section 7 tho 7.2 we give bounds

for the value of this strategy.

I t is well-known that the a-discounted dynamic programming model

(L

p(j

I

i,a) ~ a < I for all i € E and

Irpi

~ Me for some ME JR and all PEP)

(21)

can be brought into our framework by defining an extra absorbing state -\

with r(-J,a) • 0 for all a E A and

p(-l

I

i,a)

=

I -

l

p(j

I

i,a) ,

jEE

i E E •

In this new model we can take as Liapunov functions the functions defined

by tk(i) = M(I - S)-k, i E E, tk(-l)

=

0 k = 1,2 and then 4.1 becomes

slightly weaker than the Macqueen bounds [19J since we work with b

I and b;

instead of b

I and b2•

IIi the following theorem s is an approximation for v with known bounds b

l

~n ""n

and bZ' At the price of extra calculation of U (b

l) and U (bZ) we obtain bounds for v • s

n

Theorem 4.6.

Proof. For n

=

1 the statement is trivial. Suppose it holds for n

=

k.

Then

and

v - vs k+l s; sup {r + Pv} - sup {rp + Pv~}

P P P

- v s;

If there is a sequence P1'PZ"" such that

""k s; sup PU b Z p s rp + P n+ IV n for n = 1,2, •.• n ""n

then we can use PnPn-l",PIbl instead of U bl

Note that we may choose b

Z

=

0 if s ~ v.

Filially w(~ can tHole these bOUlldH to eliminate suboptimal actions. (We use

the notation with explicitly written actions a).

Action a is called suboptimaZ or nonoonserving if

r(i,a) +

L

p(j

I

i,a)v(j) < sup {r(i,a) +

L

P(j

I

i,a)v(j)}

J aEA j

(22)

Hence if bJ and bZ are bounds on v, b

l 5 v :::; bZ it holds that action a 1.S suboptimal if

r(i,a) +

L

p(j

I

i,a)bZ < sup {r(i,a) +

.I

P(j

I

i,a)b]}

j€E aEA J€E

In theorem 4.1 we prove that elimination of suboptimal actions gives a new

model with the same value function. We only assume the model satisfies some strong convergence condition.

In [14J~similar property is proved without this condition.

Theorem 4.7.

Suppose that some strong convergence condition holds. Consider a new model

with

15

c P such that for all E > 0 there is a P €

15

with rp + Pv ;:;: v-e:.

Then the new model has the same value function.

Proof. Fix E > 0, let e: := E.2-(n+l)e and choose PEP such that

- n n

+E +Pv~v.

n n

Iteration of this inequality yields

Hence

N

L

PO"'Pn- 1 (rp + En) + PO",PNv

~

v •

n=O n

E ?: v

n

since by the strong convergence condition lim PO ••• P v = 0 • -+<x> n n Therefore

z:

PO'" P n- 1 r P

~

v - E , n=O n

(23)

As in [7J and [8J we can also exclude actions for a finite number of iterations instead of all future iterations.

Fix some scrapfunction s. For notational convenience we omit the de-pendence on s in the following definitions:

v (i,a) := r(i,a) +

n

L

p(j

I

i,a)v:_ 1 (j)

jEE

d (i,a) := vS(i) - v (i,a)

n n n b 1, n ~ := b - b n 2,n I,n b 2,n Theorem 4.8. ( i) dn+k+1 (i,a) ;::0: d (i,a) -n k k

I

<llnH, 1',=0

(U)

if dn(i,a) -

I

<ll > 0 then action a is suboptimal at stage n+ k+ 1 •

t=O n+t

Proof.

(li)

is a direct consequence of (i). Since

and

inf

I

p(j

I

i,a){vs(j) - vS 1(j)} ~ b

aeA j d!: n n- I ,n

we have by subtraction of these inequalities:

Iteration of this inequality yields the desired result.

Hence, if we determine at stage n:d (i,a) and at each following stage:

n

<lln+k' we need not compute vn+k+1 (i,a) as long as

, d (i,a) -n k \' ~ > O. t. n+.t 1',=0

o

(24)

5. Contracting Dynamic Programming, Strong Convergence and Liapunov functions In this section we show how the contracting dynamic programming model intro-duced by Van Nunen [20J fits into the framework of strong convergence and Liapunov functions. The model assumptions are as follows:

• 1 )

There exist a finite function b and a bounding function ~ and there are

constants k,k' > 0 and p,p' with 0 ~ p,p' < 1, such that

00

(i) sup

l:IE

R[

I

b (X

n)

I

J

< 011 R n=O

and for all PEP (li) (iii) (iv) IIr - b II :::;k P lJ

*

P~ :::; P lJ II Pb - P b II :::; k I • lJ

In the papers of Shapley [22J, Blackwell [IJ and Denardo [3J it is assumed

that the rewards are bounded and that the operator U (def. 1.10) is a

con-traction with respect to the supremum norm. Veinott [23J showed that transient

models can be transformed into discounted models using a similarity trans-formation which is equivalent to working with a bounding function (see below). Harrison [6J noticed that in many practical models with a countable state space the reward function is unbounded and he suggested a modification:

he introduced the translation function b. But he worked with lJ - 1. Lippman

[17, 18J remarked that Harrison's model is too restrictive to include for

example the M/M/l queueing system with quadratic cost. He introduced a

special bounding function: a polynomial. Wijngaard [25J considered exponential bounding functions to study inventory models with the average cost criterion.

Wessels [24J gave the first systematic treatment of general bounding functions

for total return models with a countable state space. Van Hee and Wessels [11J studied necessary and sufficient conditions for the existence of a bounding

function lJ such that for all PEP: PlJ :::; Pl..l ,0 :::; p < 1. Hinderer [12J used

bounding functions for finite stage dynamic programming models with a general state space.

(25)

.2.)

Let us denote

w :- (I - p)-I(b - Pb)

p

then by iteration, we find:

Since by 5.1 i) we have -1 .,. (I - p) b . -1 (I - p) b

Hence the dynamic programming model with reward function - w

p P E:

P

is equivalent to the original problem. However

Indeed with 5.1 ii) and iv) we find

-1 ... (1 - p) II (1 - p) r P - b + Pbll il -1 = (I -p) II(I-p)(rp-b) -Pb+Pbllj.l ::; (1 - p) - 1 {(I - p) II rp - bll + II Pb - p bll } < co il !l

Hence the contracting dynamic progrannning model is equivalent to a model

satisfying for P E: P and some k > 0:

(i) Pl1 ::; Pll

Note that this model can be reduced in a similar way to a discounted dynamic programming model by the transformations:

(26)

This is in fact the similarity transformation studied by Veinott [23]. From 5.2 i) and ii) we have immediately

-I

and therefore, we have for 1 < A < p 00 n ~ kp 1.1 sup JE R[

I

Anlr(X ,Y

)1]

~k(l - Ap)-llJ < 00 R n=O n n

Thus the contracting dynamic programming model satisfies the strong conver-gence condition for the. sequence a An. And since nk "" o'(A n) (n -~ <x»

n

for all k

=

1,2, ••• we have by corollary :3.4 that there exist Liapunov

functions ~k satisfying 3.1 for k

=

1,2, ••••

Apart from this one immediately sees that

-I -I

l.1 + (I - p) P].1 ~ (1 - p) ].1

thus

-1

~ k(l - p) ].1.

Hence k(l- p)-ll.1 suffices as Liapunov function t

l, and it is easily checked

-n .

that k(1 - p) ].1, n

=

1,2, ••• 1S a system of Liapunov functions satisfying

3.1.

(27)

--6. Waiting Line Model with Controllable Input; an Example which 1S Strongly

Convergent but not necessarily Contracting

In this section we consider as an example the waiting line model with

controllable input which was studied in chaptEr 5 in [J3] and jn [l5J.

In this queuing model the arrival process is Poisson with expected number

of arrivals per unit time A where a denotes the service cost. We assume

a

that we can control the arrival process by choosing a from the interval [a

l,a2J • And we make the reasonable assumption that Aa decreases as c

in-creases. The service time distribution F is general.

At each time a customer completes service, the service cost may be changed. We will be looking at the embedded Markov chain.

The states space becomes E

=

{O,I, ••• } and the transition probabilities

satisfy with

{

o

if j < i - I p(j

I

i,a) == kj-i+1 (a) if J;;;: i - I k r (8)

=

QO a r -\

J

-A s e (Aas) (r!) dF(s)

°

Furthermore we assume 00 A a

J

sdF (s) < 1 1 0

and r(i,a) ;;;: 0 > 0 for i = 1,2, ••• and all a E A ;= [a

1,a2J

If one is looking for an average optimal strategy for this problem then one

1S interested in the behaviour of the system upto the first time the system

empties again.

In order to study the behaviour until this time we modify the transition probabilities and rewards in state 0 as follows

(28)

If this model is contracting then there exists a bounding function ~ satisfying

(i) Irpl :s; k].l for some k E.

:m.

over all P € P

(ii) P].l :s; P].l, for some 0 :s; P < ) and all PEP.

Now (i) implies Ir(i,a)1 :s; kj.l(i) and with r(i,a) ~ 0 > 0 follows

].l(i)

~

ok-1 J i

~

1. Now we may use theorem 2 in [llJ which states that there

exists a function ~ satisfying (ii) and

inf ).I(i) > 0

i~l

if and only if the lifetime N of the process (here the number of transitions until state 0 is reached) is exponentially bounded.

So in order that this model is contracting at least all moments of the life time must be finite and with the inequality

co :IE N (N - J ) ••• (N - k + I)

~

I

f

R.=k

o

kf

k i(9,-I) ••• (R.-k + l)dF(s)

=

A s dF(s)

o

(cf. [15J) we see that all moments of the service time must be finite as well.

Hence the model is certainly not contracting if not all moments of the service time are finite.

On the other hand it is shown in [ISJ that if the k-th moment of the service

time is finite and if

sup Irei,a) I :s; AiR, a

for some A E:m. and all i E S then there exist Liapunov functions Yl""'Yk-t'

.R. < k. We will prove this here using a completely different approach.

First one may show that if the k-th moment of the service time is finite then also the k-th moment of the lifetime of the imbedded process is finite. This may be seen as follows.

....

It is clear that the lifetime is maximized if we use the strategy R which corresponds to the minimal service cost in each state. For that strategy

we have an MIGI 1 queue. And the lifetime of the imbedded process is now

equal to the number of customers N in the busy period of the MIGI) queue.

*

*

Let F be the Laplace transform of the service time and N the transform of the distribution of the number of customers in a busy period. Then we have

*

*

(29)

*

- t

*

*

N (t)

=

e F (A - AN (t» , t > 0

where" is the Poisson parameter (cf. Gohen [2J p. 250). Differentiating this equation once with respect to t gives

(6. 1 ) N*' (t) -e - t F (" -

*

AN (t»

*

==

---~--~--~~--1 + AF'A'(A - >..N*(t» The denominator is bounded from below by

00

1 -

AJ

sdF(s) > 0 •

o

It is well-known (see for example Feller [5J p. 412) that N*(k)(t) has a

finite limit for t ~ 0 iff

00

I

n~(N

=

n) < 00 • n==O Then co

1

nkp(N n)

=

(-l)~*(k)(O)

• n=O

Differentiating (6.1) one may show by induction that if F*(t)(t) has a finite

limit for t

~

0 for t == 1, ••• ,k then N*(k) has a finite limit for t

~

0 as

well.

So we conclude that if the k-th moment of the service time is finite then also the k-th moment of the lifetime of the embedded process is finite. Now suppose

< 00 and

Then we have for all R

00 sup

I

rp (i)

I

p 00 .k-m.-] s A1

m

R

1

t m

I

r(xt,Y R-)

I

==

1

lP R (N = t) 1=0 t=O oo ::;

I

lP R(N == t) teO 00

=

A

r

tklP R (N

=

taO

for some A EO lR and all i EO E.

t

I

m

Rfci

rex , Y )

I I

N = 1=0 t R-tJ t

I

tmAtk-m.-1 1=1 00 t)

s

A

I

tklP - (N == taO R t) < 00

(30)

Where the inequality t

I

lERClr(xR,'YQ,11 N ==

tJ

~ Q,=o k-m-l At

\

follows immediately from the fact that in the embedded process only one customer is served per unit of time.

-So we see that

(X)

f

s-~(s) k < (X) and s~p

I

rp(i) _

I

< Al.·k -m-I

o

for some A E lR and all i E E imply, using corollary 3.4, the finiteness

of the functions

YI'''',Ym-I{easoning in a similar way one may show that for m .. () the model is

(31)

7. Nearly Optimal Strategies

In this section we collect some results with respect to near~y optimal

strategies for the strongly convergent case. But before we do so we first

give an example which shows that there need not exist for all E > 0 a

stationary strategy

p(~)

satisfying

(7.1.) if we only assume sup :IE R

L

R n=O reX ,Y )

I

< co n n

but not 1.8, the uniform tail property or positivity of all r(i,a). For the

positive case Ornstein [21J proved the existence of a

p(~)

satisfying 7.1.

Example 7.1. n r=I-2 (1 r=O (l n E:= 1,1',2,2', . . . . In the states nl there is only one available r

=

I - 2 (n+ 1) (I +

---1-)

action yielding an n + I immediate reward 1 - 2n(l +

1..)

and a n transition to state n.

In state n there are two actions. Action 1 gives reward 0 and a transition to

state n + 1 with probability

where b (l

=!

n n bn+1 b n 1

=

1 +-n '

and with probability 1 - (l the system leaves E. Action 2 gives a reward 2n

n

and the system leaves E with probability I.

v

may be found as follows

( ) (2n 2n+l n + 2 ) v n

=

sup , ( l , ex ex I 2 , ••• n n n+ b b n n n

=

2 sup (1, ~ , ~"") n+] n+2 since b n

+

1 as n ~ ~.

(32)

(00) •

We will show that there does not exist a stationary strategy P for wh~ch

v(n',P(oo»

~

0 for all n

=

1,2, ••••

Any stationary strategy may be characterized by the probabilities Yn by

which action 2 is taken in state n, n

=

1,2, •••• (We consider randomized

strategies since when we were looking for an example we have seen that it

may occur that though there is no pure £-optimal strategy there does exist

a randomized one).

We see that for this strategy

n I n n )

v(n' ,R) ~ 1 - 2 (1 + -) + Y 2 + (I - Y )2 (I + -) •

n n n n

So strategy R gives for state n' an immediate loss of Y 2n/n compared to what

n

could be gained. In order that this loss is smaller than 1 we must have

-n -n

Yn ~ n2 • Now let us consider an arbitrary strategy R with Y

n ~ n2 for

all n and see what its total expected reward for state n is. Using the

2 -n

inequalities a ~ 13, 1 - Y ~ 1 and Y ~ n2 , n

=

1,2, ••• we get

n n n v(n,R) = 2 Y + a n (1 - Yn ) Y 2n+1 n+2 n+ 1 + a a 1 (1 - Y )( 1 - Y 1)Y 22 + ••• n n n n+ n n+ n+ 2 4

=

n + 3(n + I) + g(n + 2) + ••• - 3n + 6 •

So for n

~

4 v(n',R)

~

-J. Hence no stationary strategy p(oo) exists

with v(i ,R)

~

v(i) - €( 1 +

I

v(i)

I)

e for all E <

t.

This concludes our

counterexample.

Now we continue with some positive results.

If the model is strongly convergent then Howards' policy iteration algorithm converges. And as a result we conclude that in the strong convergence case

it holds that for all i € E and all IS > 0 there exists a stationary strategy

pCl'» such that

(7.'2 ) (cL [I OJ) •

(33)

If for a sequence a - (a

O,a1, ••• ) with an + 00 uniformly on E it holds that za < 00, then for all € >

a

there exists a stationary strategy p(oo) such that

;?; v - e:z

a (cf. [1 OJ) •

And if w /a +

a

uniformly on E then there exists for all e: >

a

a stationary

a n

strategy p(oo) satisfying v(.,p(oo» ;?; v - e:e (cf. [10J).

Theorem 7.2. Let R., and 9,,2 be Liapunov functions of order I and 2 and let

T .

either s E V(R'I) or limsup P s s O. If furthermore rp + Ps ~ Us - d

l

then T-kO

V(o,p(oo» ~ s - e:9.

2

.

Proof. Iterating rp + Ps ~ s - e:R, 1 gives us

Letting T + 00 yields the desired result.

n

The next theorem presents a result under a partly weaker and partly stronger

assumption than 1.8. Theorem 7.3. If z < 00 and 00 \' n l -sup L.. nP rp < 00 PEP n=J

then there exists for any state ~ E S and for all e: >

a

a stationary strategy

p(oo) with

(34)

Fix i E E and e > O. Let strategy R be such that v(i,R) Choose 0 < a < such that

e c v(i) - 7; •

00

lE. R

I

anr(X , Y ) ~ v(i) - ~4 1., 0 n n n'" and 00 \' n - e (I - a)sup ~ nPr (i) ~

2 .

P n==O p

The a-discounted problem is strongly convergent, hence by 7.2, there exists a Q such that

00 00

I

anQnrQ(i) ~

n=O sup lE. R

I

anr(X , Y ) - -4

e c

R 1., n=O n n

€:

v(i) -

2 .

Since 1 - an s (1 - a)n for 0 < a < 1 andn

=

0,1, ••• we have

,00

1:

QnrQ (i) ;;::

n=O

00

;;:: v(i) -

I -

I

(1 -

a)nQnr~(i)

;;:: v(i) - €: •

n=O

Hence v (i,Q) ;;:: v(i) - e.

Finally a result on optimal strategies.

Theorem 7.4.

If the model is strongly convergent then any conserving P, i.e.

constitutes a stationary optimal strategy.

Proof. Iterating r + Pv .. v we get

p Since N-I N

I

pnr + P v .. v • n=O p N-I

I

pn r -+ v (. ,P (00

»

(N -+ "") n=O p

and pNv -+ 0 (N -+ (0) (2.3) we have v(o ,p(OO»

.. v.

r +Pv=v p

o

(35)

Hence if the model is strongly convergent.

P

compact, w <

ro,

rand Pw

p

continuous of

P

then there exists a stationary optimal strategy. since

with the compactness and continuity assumptions one may show the existence

of a conserving

P.

See also chapter 4 in [13J.

References

[IJ Blackwell. D., Discounted dynamic programming. Ann. Math. Statist. ~,

226-235, 1965.

[2J Cohen, J.W., The single server queue, North-Holland publishing Company,

1969.

[3J Denardo, E., Contraction mappings in the theory underlying dynamic

programming. SIAM Rev.

1,

165-177, 1967.

[4] Derman, C., and R. Strauch, A note on memoryless rule for controlling

sequential control processes, Ann. Math. Statist.

1L,

276-278,

1966.

[5J Feller, W., An introduction to probability theory and its applications,

Vol. II, Wiley, New York, 1966.

[6J Harrison, J., Discrete dynamic programming with unbounded rewards,

Ann. Math. Statist. ~, 636-644, 1972.

[7J Hastings, N.A.J., A test for nonoptimal actions in undiscounted finite

Markov decision chains. Hanagement Science

Q,

87-91, 1976.

[8J Hastings, N.A.J. and J.A.E.E. van Nunen, The action elimination

algorithm for Markov decision processes, this volume, 1977.

[9J Hee, K.M. van, Markov strategies in dynamic programming, Memorandum COSOR

75-20. University of Technology, Eindhoven, 1975.

[10J Hee, K.M. van and J. van der Wal, Strongly convergent dynamic programming,

Memorandum COSOR 76-26, University of Technology, Eindhoven, 1976. [11J Hee, K.M. van and J. Wessels, Markov decision processes and strongly

excessive functions, Memorand~m CaSaR 75-22, University of

(36)

[12J Hinderer, K., Bounds for stationary finite-stage dynnLic programs with unbounded reward functions (To be published).

[13J Hordijk, A., Dynamic programming and M,lrkov potential theory,

Mathematical Centre Tracts No. 51, Amsterdam, 197L

[14J Rordijk, A., Convergent dynamic programming, Technical Report,

Department of Operations Research, Stanford University, 1974.

[15J Hordijk, A., Regenerative Markov decision models, to appear in

Stochastic systems: Modeling, Identification and Optimization II,

Mathematical Programming Studies ~, North-Holland, Amsterdan,

1975.

[16J Hord,ijk, A. and K. Sl,adky, Sensitive optimality criteria in countable state dynamic programming, to appear in Hathematics of Operations

Research, 1975.

[17J Lippman, S.A., Semi-Markov decision processes with unbounded rewards,

Management Science~, 717-731, 1973.

[18J Lippman. S.A., On dynamic programming with unbounded rewards,

Management'Science

3l,

1225-1233, 1975.

(19J Macqueen, J., A modified dynamic programming method for Markovian

decision problems, J. Math. Anal. Appl.

li,

38-43, 1966.

[20J Nunen, J.A.E.E. van, Contracting Markov decision processes, Mathematical Centre Tracts No. 71, Amsterdam, 1976.

[21] Ornstein, D., On the existence of stationary optimal strategies,

Proc. Amer. Math. Soc.

3£,

563-569, 1969.

[22 J Shapley, L.S., Stochastic games, Proc. Nat. Acad. Sci. U.S.A.

12.,

1095-1100, 1953.

[23] Veinott, A.F., Discrete dynamic programming with sensitive discount

optimality criteria, Ann. Math. Statist. 40 1635-1660, 1969.

[24] Wessels, J., Markov programming by successive approximations with

respect to weighted supremum norms, to appear in Journ. Math. Anal. Appl., 1974.

[25

J

Wijngaard, J., Stationary Markovian Decision Problems,

Referenties

GERELATEERDE DOCUMENTEN

Bijvoorbeeld, doorgeschoten houtwallen, die nu bestaan uit een wallichaam met opgaande bomen, zullen in veel gevallen voorlopig niet meer als houtwal beheerd worden, maar in

Een blootstelling aan 0.1 mg BaP per dag (overeenkomend met ca. 11 mg/kg lichaamsgewicht per dag en 6.6 mg/kg in het voer) resulteerde in een significant effect op de EROD en

Het ge- bruik van de stieren met de beste fokwaarde voor onverzadigd vet resulteert in 1,7 procent meer on- verzadigd vet in de melk (27,4 procent).. Apart fokprogramma gericht

 De gekozen defi nitie, d.w.z kijken naar agrarische bedrijven met door de overheid gestimuleerde nevenfuncties, leidt tot de conclusie dat de ruimtelijke invloed van MFL op

This study does, however, trace three contemporary discourses on salvation – namely, salvation as reconciliation, salvation as liberation, and salvation as

ferox gel material was observed after one week of treatment, thereafter it exhibited a dehydrating effect on the skin, but the dehydration was less than with the other two

In the dialogues between the philosophies of different cultures, support is needed from certain empirical sciences for the understanding of the philosophy, which is

• Van getrouheidsede moet afstand gedoen word. • Beheer oor onderwysverenigings en skole moet verkry word om daardeur Sosialisme in voorge skrewe handboeke te