Note on a dynamic programming recursion

(1)

Note on a dynamic programming recursion

Citation for published version (APA):

Wal, van der, J., & Zijm, W. H. M. (1979). Note on a dynamic programming recursion. (Memorandum COSOR; Vol. 7912). Technische Hogeschool Eindhoven.

Document status and date: Published: 01/01/1979

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne Take down policy

If you believe that this document breaches copyright please contact us at: openaccess@tue.nl

providing details and we will investigate your claim.

(2)

EINDHOVEN UNIVERSITY OF TECHNOLOGY Department of Mathematics

PROBABILITY, STATISTICS AND OPERATIONS RESEARCH GROUP

Memorandum COSOR 79 - 12

Note on a dynamic programming recursion

by

J. van der Wal and W.H.M. Zijm

(3)

Note on a dynamic programming recursion

by

J. van der Wal and W.H.M. Zijm

Abstract.

In this note we consider for fixed k E { O,l, •.• } the dynamic programming

recursions (k) vn+l

=

max f {n+k\ .(k) \ k } r_f + P f vn } , n = 0,1, ••• where r

f is the reward vector and Pf the transition matrix corresponding

to a policy f. It is shown that

lim v(k) _n

I(n+k)

=

*

k+l g , k

a ,

1 , ...

*

where g is the gain of the corresponding Markov decision process.

1. Introduction and notations.

We are concerned with a dynamical system, with finite state space

S ;= {1,2, ••• ,N} and finite action space At which is observed at discrete

pOints in time, t

=

0,1,2, ...

If, at time t, the system is in state i we have to choose an action a E A

which results in an immediate payoff r~ and transfers the system to state j

~

at time t + 1 with probability p~.

,I.

p~.

=

1, i E S, a E A.

~J ] ~J

A policy f is a map f : S 4 A. With each policy f we associate the immediate

N reward vector r

f E m and the transition matrix Pf (N x N) defined by

f(i) r.

~

f(i)

=

Pij i,j E S .

A strategy ~ is a sequence of policies :~ = (f

O,f1, ••. ) where fn(i) is the

action to be taken at time n if the system is in state i. A strategy 7T is

(4)

(1.1 )

(1. 2)

(1. 3)

( 1.4)

- 2

-For any strategy TI (f

O,f1, .•• ) we define for any k E {O,l, .•. } and

n = 0,1, ... And

V~k)

(TI) :=

°

(k) vn+l (TI) := (k) v := n max v (k) n (1T) k,n 0, 1 , ..•

Then V(k) sa lS les t' f' th e f 0 11 oWlng ynaIDlc programIDlng recurSlon ' d ' , ,

n (k) vn+l

=

max f { (n+k\ + P v(k)} \ k

If

f n k,n 0, 1, •.. 1

In this note we will study the tic behaviour of v(k) if n tends to

n

infinity. This problem originated from research on more general dynamic

programming models with nonnegative matrices (cf.Zijm [3J ) •

For k = 0 we are in the situation of average reward Markov decision processes,

for which we have the following result due to Brown [2J •

Lemma 1.1

Suppose k such that

0, then there exist vectors zl and z2 E lRN and a policy

*

ng ~ v (0)

n ~ ng

*

for all n = 0, 1, • •• , where g

*

E lRN is the gain of the average reward Markov

decision process (compare also Blackwell [1J ) •

In the next section we will use this lemma in order to prove inductively the following theorem •

Theorem 1.2.

(5)

(_k+l)gn+k \

*

+ (n+k-1\

:s;

k J~

*

- 3

-where z1,z2,f and g are the same as in lemma 1.1. above. Here (i) is defined 0 if t < m •

m

2. Proof of the theorem

(2.1)

(2.2)

(2.3)

In order to prove the theorem we first derive the following lemma relating (k) ( ) to (k-1) ( ) d v(k) (IT) d l ' (k) t (k-l) d (k)

vn+l IT vn+l 1T an n an re at1ng vn+l 0 vn+1 an vn

Lemma 2.1

For all k

=

1,2, .•. and n 0,1, ..• we have (i) _vn+1(k) _(rr)

=

v(k-11) (IT) + v(k) (IT) n+ n , for all rr (ii) Proof. (k) vn+1 (k-l) :s; vn+1 (k) + v n (i) : Let rr = (f O,f1, .•• ) be an arbitrary strategy. Then we have (k) v n+1 (IT) And with (n+k) _k

=

(n+k-1 \ _{k-l )} ₊(n+k-l) _k _' k,n 1,2, ... we get and

(k)

=

\k

(k-l) \k-l ' k 1,2, •.•

=

(n+k-l)r + (n+k-2)r + .•. + (k-l)p .•. P f rf + \ k-l

fa

\

k-l I

fo

\k-l

fa

n-l n

=

(k-l) ( ) + v(k) ( ) vn+l IT n I T .

(6)

(2.4)

4

-(ii) Immediately from (i) .

o

From (2.2) we get Lemma 2.2.

For k = 0, 1, ..• and n = 0,1 ...• we have

(k) _< _(n+k) _* _(n+k-1) v _{- \k+1} _g _{+ '\} _k ₂₂

n *

with 22 and g as in lemma 1.1.

Proof.

The proof proceeds by induction on nand k.

For k

=

0 formula (2.4) follows from (1.4) and for n

o

inequality (2.4) reduces to 0 ~ 0 with vO(k)

_{::; va}

(0)

_o

( k \

and \k+l) (k-l) _{\ k}

o

Now assume that (2.4) holds for all pairs ()1, ,m) with R, ~ k and m ~ nand (t,m) :f (k,n). We will prove that (2.4) holds for (k,n).

From (2.2) we have, using the induction assumption for (k-I,n) and (k,n-I),

(k) v n (k-l) ~ v n + v(k) n-l < (n+k-l)

*

(~+k-2)

(n+k-l)

*

(n+k-2) - \ k g + k-l 22 + \ k+l g + k 22 Hence with (2.3) (k) (n+k)

*

(n+k-l\ vn ~ \k+l g + k _{)2 2 ,}

Thus (2.4) holds for all k,n 0,1, ...

Similarly we get Lemma 2.3

For all k,n

=

0,1, ... we have

v(k) (f*(oo» 2:. (n+k) * + (n+kk-l)Zl

n \k+l g

*

where f , g and 21 are as in lemma 1.1.

(7)

5

-Proof.

By induction, using (2.1) and the left hand inequality in (1.4).

Lemmas 2.2 and 2.3 together constitute the proof of theorem 1.2 • As a corollary of theorem 1.2. we have

Corollary 2.4 For k

=

0,1, ... lim References. (k) / {n+k\ vn \k+1} ::= g

*

[lJ Blackwell, D., Discrete dynamic programming, Ann. Math. Statist.33

(196 2), 7 19 - 7 26 •

[2J Brown, B.W., On the iterative method of dynamic programming on a finite state space discrete time Markov process, Ann. Math. Statist. 36 (1965), 1279 - 1285.

o

[3J Zijm, W.H.M., Maximizing the growth of the utility vector in a dynamic programming model, to appear.