Note on a dynamic programming recursion
Citation for published version (APA):Wal, van der, J., & Zijm, W. H. M. (1979). Note on a dynamic programming recursion. (Memorandum COSOR; Vol. 7912). Technische Hogeschool Eindhoven.
Document status and date: Published: 01/01/1979
Document Version:
Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)
Please check the document version of this publication:
• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.
• The final author version and the galley proof are versions of the publication after peer review.
• The final published version features the final layout of the paper including the volume, issue and page numbers.
Link to publication
General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain
• You may freely distribute the URL identifying the publication in the public portal.
If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:
www.tue.nl/taverne Take down policy
If you believe that this document breaches copyright please contact us at: openaccess@tue.nl
providing details and we will investigate your claim.
EINDHOVEN UNIVERSITY OF TECHNOLOGY Department of Mathematics
PROBABILITY, STATISTICS AND OPERATIONS RESEARCH GROUP
Memorandum COSOR 79 - 12
Note on a dynamic programming recursion
by
J. van der Wal and W.H.M. Zijm
Note on a dynamic programming recursion
by
J. van der Wal and W.H.M. Zijm
Abstract.
In this note we consider for fixed k E { O,l, •.• } the dynamic programming
recursions (k) vn+l
=
max f {n+k\ .(k) \ k } rf + P f vn } , n = 0,1, ••• where rf is the reward vector and Pf the transition matrix corresponding
to a policy f. It is shown that
lim v(k) n
I(n+k)
=
*
k+l g , k
a ,
1 , ...*
where g is the gain of the corresponding Markov decision process.
1. Introduction and notations.
We are concerned with a dynamical system, with finite state space
S ;= {1,2, ••• ,N} and finite action space At which is observed at discrete
pOints in time, t
=
0,1,2, ...If, at time t, the system is in state i we have to choose an action a E A
which results in an immediate payoff r~ and transfers the system to state j
~
at time t + 1 with probability p~.
,I.
p~.=
1, i E S, a E A.~J ] ~J
A policy f is a map f : S 4 A. With each policy f we associate the immediate
N reward vector r
f E m and the transition matrix Pf (N x N) defined by
f(i) r.
~
f(i)
=
Pij i,j E S .A strategy ~ is a sequence of policies :~ = (f
O,f1, ••. ) where fn(i) is the
action to be taken at time n if the system is in state i. A strategy 7T is
(1.1 )
(1. 2)
(1. 3)
( 1.4)
- 2
-For any strategy TI (f
O,f1, .•• ) we define for any k E {O,l, .•. } and
n = 0,1, ... And
V~k)
(TI) :=°
(k) vn+l (TI) := (k) v := n max v (k) n (1T) k,n 0, 1 , ..•Then V(k) sa lS les t' f' th e f 0 11 oWlng ynaIDlc programIDlng recurSlon ' d ' , ,
n (k) vn+l
=
max f { (n+k\ + P v(k)} \ kIf
f n k,n 0, 1, •.. 1In this note we will study the tic behaviour of v(k) if n tends to
n
infinity. This problem originated from research on more general dynamic
programming models with nonnegative matrices (cf.Zijm [3J ) •
For k = 0 we are in the situation of average reward Markov decision processes,
for which we have the following result due to Brown [2J •
Lemma 1.1
Suppose k such that
0, then there exist vectors zl and z2 E lRN and a policy
*
ng ~ v (0)
n ~ ng
*
for all n = 0, 1, • •• , where g
*
E lRN is the gain of the average reward Markovdecision process (compare also Blackwell [1J ) •
In the next section we will use this lemma in order to prove inductively the following theorem •
Theorem 1.2.
(k+l)g n+k \
*
+ (n+k-1\:s;
k J~
*
*
- 3
-where z1,z2,f and g are the same as in lemma 1.1. above. Here (i) is defined 0 if t < m •
m
2. Proof of the theorem
(2.1)
(2.2)
(2.3)
In order to prove the theorem we first derive the following lemma relating (k) ( ) to (k-1) ( ) d v(k) (IT) d l ' (k) t (k-l) d (k)
vn+l IT vn+l 1T an n an re at1ng vn+l 0 vn+1 an vn
Lemma 2.1
For all k
=
1,2, .•. and n 0,1, ..• we have (i) vn+1 (k) (rr)=
v(k-11) (IT) + v(k) (IT) n+ n , for all rr (ii) Proof. (k) vn+1 (k-l) :s; vn+1 (k) + v n (i) : Let rr = (f O,f1, .•• ) be an arbitrary strategy. Then we have (k) v n+1 (IT) And with (n+k) k=
(n+k-1 \ k-l ) + (n+k-l) k ' k,n 1,2, ... we get and(k)
=
\k
(k-l) \k-l ' k 1,2, •.•=
(n+k-l)r + (n+k-2)r + .•. + (k-l)p .•. P f rf + \ k-lfa
\
k-l Ifo
\k-lfa
n-l n=
(k-l) ( ) + v(k) ( ) vn+l IT n I T .(2.4)
4
-(ii) Immediately from (i) .
o
From (2.2) we get Lemma 2.2.
For k = 0, 1, ..• and n = 0,1 ...• we have
(k) < (n+k) * (n+k-1) v - \k+1 g + '\ k 22
n *
with 22 and g as in lemma 1.1.
Proof.
The proof proceeds by induction on nand k.
For k
=
0 formula (2.4) follows from (1.4) and for no
inequality (2.4) reduces to 0 ~ 0 with vO(k)::; va
(0)o
( k \and \k+l) (k-l) \ k
o
Now assume that (2.4) holds for all pairs ()1, ,m) with R, ~ k and m ~ nand (t,m) :f (k,n). We will prove that (2.4) holds for (k,n).
From (2.2) we have, using the induction assumption for (k-I,n) and (k,n-I),
(k) v n (k-l) ~ v n + v(k) n-l < (n+k-l)
*
(~+k-2)
(n+k-l)*
(n+k-2) - \ k g + k-l 22 + \ k+l g + k 22 Hence with (2.3) (k) (n+k)*
(n+k-l\ vn ~ \k+l g + k )2 2 ,Thus (2.4) holds for all k,n 0,1, ...
Similarly we get Lemma 2.3
For all k,n
=
0,1, ... we havev(k) (f*(oo» 2:. (n+k) * + (n+kk-l)Zl
n \k+l g
*
*
where f , g and 21 are as in lemma 1.1.
5
-Proof.
By induction, using (2.1) and the left hand inequality in (1.4).
Lemmas 2.2 and 2.3 together constitute the proof of theorem 1.2 • As a corollary of theorem 1.2. we have
Corollary 2.4 For k
=
0,1, ... lim References. (k) / {n+k\ vn \k+1} ::= g*
[lJ Blackwell, D., Discrete dynamic programming, Ann. Math. Statist.33
(196 2), 7 19 - 7 26 •
[2J Brown, B.W., On the iterative method of dynamic programming on a finite state space discrete time Markov process, Ann. Math. Statist. 36 (1965), 1279 - 1285.
o
[3J Zijm, W.H.M., Maximizing the growth of the utility vector in a dynamic programming model, to appear.