programming
Citation for published version (APA):
Hee, van, K. M., Hordijk, A., & Wal, van der, J. (1977). Successive approximations for convergent dynamic programming. (Memorandum COSOR; Vol. 7707). Technische Hogeschool Eindhoven.
Document status and date: Published: 01/01/1977 Document Version:
Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers) Please check the document version of this publication:
• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.
• The final author version and the galley proof are versions of the publication after peer review.
• The final published version features the final layout of the paper including the volume, issue and page numbers.
Link to publication
General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain
• You may freely distribute the URL identifying the publication in the public portal.
If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:
www.tue.nl/taverne Take down policy
If you believe that this document breaches copyright please contact us at: openaccess@tue.nl
providing details and we will investigate your claim.
PROBABILITY THEORY, STATISTICS AND OPERATIONS RESEARCH GROUP
/
Memorandum COSOR 77-07 Successive Approximations for Convergent Dynamic Programming
by
Kees M. van Hee, Arie Hordijk and Jan van der Wal
Eindhoven, April 1977 The Netherlands
Kees M. van Hee, Arie Hordijk and Jan van der Wal
1. Introduction and Preliminaries
The main topic of this paper is the convergence of the method of successive approximations for dynamic programming with the expected total return
criterion.
We first sketch the framework of the dynamic programming model we are dealing with.
Consider a countable set E, the state spaoe, and an arbitrary set A, the
aotion spaoe, endowed with some a-field containing all one-point sets.
Let p be a transition probability from E x A to E (notation: p(j!i,a),
i,j ~ E, a ~ A). Let H :- (E x A)n x E be the set of histories until
n
time n (n ~ I) and HO
:=
E.In all generality a strategy 'If is a sequence (~O,1fl' •• ') where
transition probability from H to A. The set of all strategies n
by IT. The subset M of IT consists of all Markovstrategies; i.e.
'If is a
n
is denoted
'If
=
(1fO,1f1, ••• ) E M if and only if there is a sequence of functions fO,fl, ••• ,
f : E + A, n
=
0,1, ••• , such thatn
1fO({fO(i)Jli)
=
1, 'If ({f (i)Jlh l' a n n n- n-l' i) = I for all h I E H . I' a l E A, and i E E. Eachn- n- n- ~
a probability F. on (E x A) and a stochastic
1,'If
i E E and 'If E IT determine
process {(X ,Y), n=O,I, ••• } n n
where X is the state
n and Y the action at time n. The expectation with n
respect to lP . is denoted by lE. •
1,1f 1,1f
The reward funotion r is a real measurable function on E x A. Throughout this paper we assume
~
1 • 1 • sup :IE.
[I
r + (X , Y ) ] < 00(note that x+ := max(x,O». This assumption guarantees that the expeuted
co
total return v(i.IT) := lE.
[l:
reX ,Y )J is defined for all i E E and1,7T =0 n n
7T E TI, and in [9J it is proveH, using a well-known theorem of Derman and
Strauch [4J, that
(I .2) sup v(i,7T)
=
sup v(i,7T) for all i e E .ll'EM 7Td!
As a consequence of 1.2 we are mainly interested in Markov strategies and
for that reason we introduce some notations which are especially useful for this class.
First we define the set P of transition probabilities from E to E: for
which there is a function f : S + A such that P(i,-) = p(·li,f(i». for
all i E S and further a function r : E x
P
+ lR (= the set of reals)rp(i) := sup{r(i,a)
I
P(i,-)=
p(i,a,'), a E A} •Note that each 7T E M is completely determined by a sequence R
=
(PO,P" ••• ),PEP, n
=
0,1, •••. Hence we may identify each 7T E M with such a sequencen
R, and express
lEi,Rr(xn,yn)
=
PO ••• Pn-Irp (i) , for R = (PO,P I , ••• ), i E E .n
(By convention the empty product of elements of P is the identity operator,
and if we omit the subscript i in lE. R we mean the function on E).
1,
On E we define the functions:
(1.3) v := sup lE R [
I
reX ,Y )J
REM n=O n n the value funation
for a function s
n-I
(1. 4) v s
n := sup lE R[
I
r(Xk, Yk) + s(Xn) JREM k=O
for a sequence a = (aO,al , ..• ) of functions an
we define the function wand z on E: a a
, V n
o
:= V n E + lR, R:= {x E lRI
x ?! I}00
(1. 5) w (i) a := sup
L
a (i) nI
lli. R[r(X ,Y )JI1, n n
REM n=O
i E E
00
(1 .6) z (i) := sup La (i)lli. R 1 r (X ,Y )
I
,
i E Ea
REM n=O n 1, n n
we write w for w and z for z if a \ for n = 0, I , •••
a a n
00
(1. 7) y\ := z, Yn := sup
k~O
ill R[y n-I (Xk) ] n == 2,3, •••REM
A dynamic programming model is said to be stabZe with respect to scrapfUnction
s if
lim vS(i) = v(i) ,
~ n for all
1 E E
It is well known that positive, negative and discounted dynamic programming
models with finite E and A are stable. But this is not true in convergent
dynamic programming., the case that
z
is finite (see [13J, [14J), as is shown by the following example.Counterexample:
E '" { 1 , 2}, A '" {\, 2}, p ( 1 II , I) '" p (211 ,2) "" 1, r (l , 1) = 0, r (1 ,2) = 2 , p(·12,1) "'p('\2,2) '" 0, r(2,1) r(2,2) '" -I •
Then vO(l) = 2 and vel) '"
n
r=O
It is well-known that stability (with respect to scrapfunction 0) is
guaranteed, if the expected total return from time n onwards, tends to
zero as n tends to infinity uniformly in the strategy. In 1.8 this
00 (1.8)
In this paper two types of assumptions are considered to guarantee this
uniform tail convergence. In section 2 the strong convergence conditions
are introduced. A model is called strongly convergent if w
a
finite for a sequence of functions a
=
(aO,al , ••• ) with J~
for all i E E. It turns out that property 1.8 is equivalent
or z ~s
a a (i) .,. 00
n
to a strong
convergence condition. In section 3 Liapunov functions are introduced
and the existence of finite Liapunov functions is related to strong convergence. In section 4 Liapunov functions turn out to be important tools in successive approximations because they provide bounds for Iv - v:1 and procedures
for excluding suboptimal actions.
In section 5 the connection with contracUng dynamic programming is made
and in section 6 a waiting line model with controllable input is pre-sented, which satisfies the strong convergence condition but which is not contracting. Finally in section 7 some results on (nearly) optimal strategies are collected.
We conclude this section with some remarks and notations.
Models with for each i E E a different action space A. can easily be
trans-~
formed into our frame work.
In [13J and [14J convergent dynamic programming (z < (0) was studied ex-tensively. In this paper we are almost always working within this
frame-work, since besides the overall assumption 1.1 we work with additional
assumptions which are at least as strong as: w is finite. Hence with w < 00
and
m. ~, R
I
reX ,Y n n)1
~ 2 lE.. R r + (X ,Y ) + 1m.
R r (X ,y )
I1, n n 1, n n
we have
00
z(i) ~ 2 sup
m.
RL
r + (X ,Y ) + wei) < co •REM ~'n=O n n
For two extended real valued functions a and b on E we write a ~ b iff
a ~ x iff a(i) ~ x for all i E E (the same holds if ~ is replaced by <
\
or ==). With the convergence of a sequence of functions on E we mean pointwise convergence and the supremum of a collection of functions 1S the pointwise supremum. With convergence of a sequence of elements of P we mean elementwise convergence. For an extended real valued function a
d . . f . b . a f h f . (.) a(i)
an a pos1t1ve unct10n on E we wr1te
b
o~ t e unct10n c 1 := b(i) •For a nonnegative function ~ on E we introduce the set
V(ll) := {v E :ill.
00,
Ivl :5: kll for some k E :ill.}.On V(ll) we define that the norm ~ by
IIfll ==
sup{~-I(i)lf(i)I,i
E E, !J(i) > a}.II
1
The function II is called a bounding f~nation (c.f. section 5). For functions
f on E with
+
sup Pf < 00
PEP
we define two wellknown operators
Uf := sup {r + Pf}
PEP P
(1.10)
Df := sup Pf
PEP
Finally we formulate Bellman's optimality equations:
(1.11)
(1.12) v == Uv
The Liapunov-approach was presented by Hordijk at the Advanced Seminar on
Markov Decision Theory, Amsterdam 1976.
So he inspired van Hee and van der Wal to investigate the problem of successive approximations under very general conditions, which resulted in the strong convergence-approach. Then the three of us joined the investigations which led to this paper.
2. Strong convergence
One of the main results in this section is the equivalence of the strong convergence condition with the uniform tail property expressed in 1.8. We first give some simple, but useful inequalities.
Throughout this sectionAlet a
=
(aO,al , ••• ) pe a nondecreasing sequenceof functions, a : E 7R.
n
Theorem 2.1.
ro W
(1) sup
L
I
:IE R r(Xk,Yk)I
s a aREM k==n n
ro Z
sup
L
lERI
r(~'Yk)I
--
< aa
REM k=n n
(U)
Proof. Since akCi) is nondecreasing in k and al(i) > 0, we have, for all
"\ i € E: co sup
L
REM k=n w (i)s
a a (i) nThe proof of (ii) is similar.
Lemma 2.2. (2. I) Proof. ""n = U z • And further: sup
L
P •.. P +k-jI
rpI"
P O· •• P n-l k=O' n n n+k i ro roo
o
A direct consequence of theorem 2.1 and lemma 2.2 is (2.2) z a sup lE R
I
v (X )I
S sup lE RZ (X ) S REM n REM n anAnd in a similar way one may prove
(2.3)
w sup
I
lE RV(X ) nIs...!:. •
aRtM n
One of the consequences of the above inequalities some sequence a with
is that if z < 00 for a then lim a = 00 n n-T<lO lim lE R lv(x)l =
°
n-T<lO nfor any strategy. Hence any strategy is equaZizing (see chapter 4 of [13J).
See also theorem 7.S.
Theorem 2.3. states that w < 00 and lim a
=
00 guarantee stability. Notea ~ n
that w S 'Z •
a a
Theorem 2.3.
Let w(a) < 00 and lim a
=
00. Then the problem .is stable with respect to anyn-T<lO n
scrapfunction s satisfying sup lERs+(X ) < "", n
=
0,1, •.• and supllERs(X)1
+°
R n R n
(n+oo).
Proof.
00 n-j
v-vs=sup JE
R[
I
r(xk,Yk)J - sup lER[I
r(xk,Yk) + sex )Jn REM k=O R~M k=O n
""
Ssup \lER
L
r(xk,Yk)\ + sup IlERs(x)1REM k=n REM n
w
S...!:.
+ sup IlE RS(X n) I •Similarly one shows
w
Vs - v ~
...!
+n a
n Hence lim Ivs - vi n
=
0n-roo
sup
I
JE RSeX )
I .
REM n
gives a new criterion for stability.
\
So theorem 2.3 If z < 00 and
a ~ lim a n
=
00 we may use scrapfunctions s satsifying for someK E lR
lsi
~ Kz ,since by theorem 2.1 and lemma 2.2 lim sup JERz(X
n)
=
0 •n~ Rt:M
IJ
Consider a dynamic programming model with bounded rewards, say Ir(i,a)1 ~ b
for all i E E, a e A,and let EO be an absorbing subset of E with r(i,a)
=
a
for all i E EO' a E A. Let T be the entrance time in EO' If sup JERT < 00 then
REM
this model satisfies the strong convergence condition in a natural way since z a ~ b sup JE RT for an - n + 1, n = 0, 1 , • •• •
REI!
In fact IV
n - vi ~ n + 1 b sup JERT . Similar expressions can be derived REM
with higher moments of the entrance time.
In general one may say if w(a) < 00 then Iv (i) - v(i)1 tends to zero at a
1 n
rate at least as fast as [a (i)
n
From the foregoing results the question arises under which conditions
there exists a sequence of functions a with a ~ 00 and w ~ 00, The
n a
following theorem gives the already announced characterization.
Theorem 2.4.
There exists a nondecreasing sequence of functions a
=
(aO,al , ... ) on E
with lim an
=
00 and wa < 00 if and only ifProof. First the if part. Define
co
i E E •
Obviously, bn C bn+1• Now let an(i) = £ + 1 if N~(i) ~ n < N£+l(i) with
NO(i) := 0 and N£ (i) :== min{n
I
bn (i) ~ 2-.t} , ~ == 1,2, •••• Then
NHI (i)
sup
I
RE:M n=N£ (i) and consequently
a (i)
I
JE. R[r(X ,Y)JI
n ~, n n J
,2, ...
w (i) a
N
J (i) 00
~sup
I IJE·R[r(x,Y)JI+I(R,+1)2-£~w+3<""
REM n=O ~, n n £=1
The only if part is immediate from w ~ w(a) < 00 and theorem 2.1.(i).
In theorem 2.5 we collect two sufficient conditions for stability which are weaker than the strong convergence condition. It is well known that positive dynamic programming models are stable, but the strong convergence condition need not be fulfilled there. The following theorem covers also the positive case.
Theorem 2.5.
Each of the following conditions guarantees stability for scrapfunction O.
(i) timinf inf :IE R v(X
n) ;:: 0
n~ REM
o
(ii) there exists a nondecreasing sequence a
=
(aO,aI, •.. ) of functions and a E -+
:IR
wi th lim a=
00 n n n~ 00(2.4) d a (i) := sup JE. R[ I a (i)r- (X ,Y ) ] < 00
Proof. For all REM Hence n-l v n :e: JE R [
I
r (X k, Yk)] • k=O liminf v :e: v(·,R)n for all REM
and consequently liminf v
n :e: sup v(·,R) = v •
REM
Hence to prove stability we have to show limsup v
n :::; v •
Part (i). By the optimality equation we have r + Pv :::; v , PEP.
p Hence by iteration or n-1
l
Po···Pk-lrp + PO.·.Pn-1v :::; v k=O k n-1 JE R[l
r(Xk,Yk)] + JER[v(Xn)] :::; v k=O Consequently, So with we findliminf inf JER[v(X
n)] ;::: 0 n-+«> /REM
limsup v :::; v •
Part (ii). For REM
n-l 00
v(i,R) = JER[
I
r(~'Yk)] + JER[L
r(~'Yk)] zk=O k=n
n-j 00
z JER[
l
r(~'Yk)] - sup JER[
I
r-(xk,yk)]k=O REM k=n
Hence, by taking the supremum over R E H
co
Using 2.4 one proves in a way similar as in the proof of theorem 2.1
Hence
d
l ' I' a l'
v z ~msup v - ~m --
=
~msup v~ n n-+o:> an n-HlO n
If for some sequence PO,P
l , ... we have then + p v n n-] liminf Pn,.,P O v ~ 0 n-+o:>
is sufficient for stability, since iteration of the inequality v ~ rp + Pv
yields
+
and the proof of theorem 2.5 limsup v ~ v is sufficient for s tabili ty,
n
3. Liapunov Functions and Strong Convergence We first introduce Liapunov functions
Consider a sequence of nonnegative extended real functions ~ltl2"" on E
satisfying for all P €
P
the inequalities(3.1)
I
rI
p
k = 2,3, •••
Finite solutions of 3.1 are called Liapunov functions. If ~k is finite ~k
is called a
Liapunov function
oforder k.
Note that lk < 00 implies lk_1 < 00.Liapunov functions are powerfull tools in dynamic programming. They were first studied in a context of dynamic programming in [13J chapter 4 for
the convergent dynamic programming model and in chapter 5 of [13] and in [15J Liapunov functions are studied in connection with the average return criterion for models in which some state is recurrent under each strategy and in [14J they are used to obtain (partial) Laurent expansions for the expected total discounted return. In section 4 the existence of a Liapunov
function of order 2 is assumed to obtain bounds for Ivn s - vi.
The functions YI'Y2"" defined in 1.7 satisfy Bellman's optimality equation,
hence
and
< 00 , k = 2,3, ••••
Hence, if Yk is finite, YI""'Yk are Liapunov functions and moreover it
is easy to verify that tk < 00 implies t ~ Y n
=
1,2, ••• ,k. Although wen n
can work with Yk in stead of tk for theoretical purposes it may happen in applications that one can find, in a relative simple way, Liapunov
functions t
l,l2, •• ,lk,while the functions Yl'Y2""'Yk are hard to obtain.
Since there is a large class of Liapunov functions there still is some freedom to choose an appropriate one. Specially this might improve the bounds in the'approximation procedure (see also section 4). In this section we concentrate on the relations between Liapunov functions and strong con-vergence.
We recall that the finiteness of a Liapunov function of order k is equi-valent to the finiteness of YI""'Yk'
Theorem 3.1. 00 lE \' (k+n-l)
I (
) I
Yn ~ sup R L k r Xk'Yk REM k=O Remark.H ence Y < 00 1mp 1es ' l' z < 00 f or a sequence unctions f ' ~ _ (k+kn-1), and
n a
consequently the strong convergence condition holds.
Proof. By induction. For n
=
1 the statement holds by definition 1.7. Suppose it holds for n - 1 (n ~ 2) then:So Y n
Y ...
n ==
00 \' m+n-lI
I
sup L ( m )PO"'Pm-1 rp PO" •• m=O m n-}< QO implies za < 00 for ~(i)
=
O(k ), k ~ 00. The converse is nottrue, as shown by the following example.
counterexample 3.2.
The states 1,2, ••• are absorbing with reward O. In the s~ates nt, n
=
1,2, ••• , there are twoo
actions. Action 1 yields reward 0 and a transition
1 I 2' 3'
to state (n + 1) I action 2 yields reward n -I and a
transition to state n. Obviously we have for all REM
IER
I
(n+ I)I
r(Xn,An)I
~ n=O-)
*
but since Yl(n')
=
n we have for the strategy R yielding transitionsfrom n' to (n + I)' etc. that
00
But if we make a sligthly stronger assumption then
00
sup
I
nN-1mR
1
reX ,A)1
< 00REM n=O n n
the finiteness of the functions Yl""'YN defined in 1.7 can be shown.
Theorem 3.3.
If for a nondecreasing sequence of numbers aO,a
1, ••• , with an E JR and 00 \' a-I b pi I.. naO n it holds that < 00 00
u := sup m R
I
aN-II reX ,A )I
< 00REM n=O n n n
then the functions Y1""'Y
N defined in 1.7 are finite and satisfy the
k-) k-N inequalities Y
k S ub aO ' k
=
1, ••• ,N.Proof. We will prove by induction
o
for k
=
1,2, ••• ,N - " n'" 0, I ,2, • •• • Set k'" 1. Using Y, ... z (by definition) andsup lE RZ (X ) s ua I-N
REM n n
(from lemma 2.2 and theorem 2.1. ii) we get
Now let us assume
for k :: 1, ••• , m s N - 2 and n = 0,1 , .•.
and prove that the inequalities hold for k ~ m + I.
00
= sup PO",P -1
~up
l
Po···P~-IYm
REM n REM ~=o
00
=
sup~sup
I
PO"'Pn-)PO",P~-lYm
PO"'Pn-1 Po'" ~=OThus we proved k
=
1,2, ••• ,N - 1, n=
0,1, •••• k-) k-N Setting n=
0 we get Yk $ ub a O ' k:: I, ••• ,N - 1 and with 00 N-)we get YN $ ub • (And obviously Y1""'Y
N are finite),
o
Coro llary 3.4. k+e;
If a
=
n for n = 0,1, and some £ > 0, then z < 00 implies the existencen a
of (finite) Liapunov functions ~1""'~k+1 satisfying 3.1.
This is immediate from theorem 3.3 with
4. Liapunov Functions and Successive Approximations
In this section we first formulate sufficient conditions for stability in terms of Liapunov functions tl and t2 (of order 1 and order 2 respectively).
Lenuna 4.1.
If some Liapunov function tl (of order I) exists and if in addition
then the problem is stable with respect to scrapfunctions s € V(£I)'
Proof. Since z ~ £1 we have
~n
lim U z "" O.
n-+oo
By lenuna 2.2, theorems 2.4 and 2.3 we have the desired result.
Lenuna 4.2.
If Liapunov functions 11 and t2 exist, then
lim UnR,l. = O. n-+oo
Proof. Consider a new reward structure: rp := R,I - Pt
1, P €
P.
For all R € M we have
Since
we have
lim PO ••• Pnt} "" 0 •
n:+<x>
for all R EO: M ,
Hence 1} is the function Y
1, defined in 1.7, for this new model. Therefore,
/
by theorem 3.1, lemma 2.2 and theorem 2.1 we have the desired result.
0
As a direct consequence of lemma's 4.1 and 4.2 we have
Theorem 4.3.
If Liapunov functions R.I and R.2 exist, then the problem ~s stable with
respect to scrapfunctions s € V(
1).
We note that sometimes Liapunov functions tl and R,2 can be found rather
simple, while YI and Y2 are difficult to obtain.
Remark 4.4.
If we assume besides the existence of a first order Liapunov function ~l'
the compactness of P and the continuity of P~l' as function of P, then
a sufficient condition for
... n
lim U R.) = 0 11-+00
is
for all P € P •
The proof of this statement proceeds in a way similar to the proof of
lemma 5.7 in [13J.
Theorem 4.5.
Let R.I and t2 be Liapunov functions (0£ order 1 and 2 respectively) and
define for a function s € V(tt)
s)(i)
-1 b
2 := sup{Q,\ (i) (Us - s) (i)
I
~ t. E, Q,I (i) > O}then (4.1)
Proof. First observe that s €
V(R-I) then also Us E V(R-I) so the set
{i
I
R- 1(i)=
O} gives no trouble. Since Us s s + b2~1 and ~I ~ ~2 we havek + ~l +
Similarly, from U s ~ s + b
2t2 it follows that U s ~ s + b2t 2,hence
Uns s s + b;R-2 for n
=
1,2, •••• Since theproble~
is stable (theorem4.3) we have
v
=
1~m . U s n~
The proof of the left inequality is similar.
o
The following, somewhat weaker, but more elegant inequality is now immediate.
(4.2) II v - s IIR, s II Us - silt •
2 1
Remark.
If we have functions R-1 and t
z
satisfying the inequalities 3.1 but ~2(i)for some i then we may separate the state space into E} := {i EEl ~2 (i)
and EZ := E \ E1• Since ~2(i) < 00 implies ~2(j) < 00 for all j € E which
can be reached under some strategy from state i, we have that ~l and R-2
are Liapunov functions on the smaller model with state space E
1• Hence
all results can be generalized to that situation.
If for some P, rp + 1'8 "" Us, II Us - s II£. is small and f,2 < 0() one may use
I
=
00 < oo}the stationary strategy R := (P,P, ••• ). In section 7 tho 7.2 we give bounds
for the value of this strategy.
I t is well-known that the a-discounted dynamic programming model
(L
p(jI
i,a) ~ a < I for all i € E andIrpi
~ Me for some ME JR and all PEP)can be brought into our framework by defining an extra absorbing state -\
with r(-J,a) • 0 for all a E A and
p(-l
I
i,a)=
I -l
p(jI
i,a) ,jEE
i E E •
In this new model we can take as Liapunov functions the functions defined
by tk(i) = M(I - S)-k, i E E, tk(-l)
=
0 k = 1,2 and then 4.1 becomesslightly weaker than the Macqueen bounds [19J since we work with b
I and b;
instead of b
I and b2•
IIi the following theorem s is an approximation for v with known bounds b
l
~n ""n
and bZ' At the price of extra calculation of U (b
l) and U (bZ) we obtain bounds for v • s
n
Theorem 4.6.
Proof. For n
=
1 the statement is trivial. Suppose it holds for n=
k.Then
and
v - vs k+l s; sup {r + Pv} - sup {rp + Pv~}
P P P
- v s;
If there is a sequence P1'PZ"" such that
""k s; sup PU b Z p s rp + P n+ IV n for n = 1,2, •.• n ""n
then we can use PnPn-l",PIbl instead of U bl •
Note that we may choose b
Z
=
0 if s ~ v.Filially w(~ can tHole these bOUlldH to eliminate suboptimal actions. (We use
the notation with explicitly written actions a).
Action a is called suboptimaZ or nonoonserving if
r(i,a) +
L
p(jI
i,a)v(j) < sup {r(i,a) +L
P(jI
i,a)v(j)}J aEA j
Hence if bJ and bZ are bounds on v, b
l 5 v :::; bZ it holds that action a 1.S suboptimal if
r(i,a) +
L
p(jI
i,a)bZ < sup {r(i,a) +.I
P(jI
i,a)b]}j€E aEA J€E
In theorem 4.1 we prove that elimination of suboptimal actions gives a new
model with the same value function. We only assume the model satisfies some strong convergence condition.
In [14J~similar property is proved without this condition.
Theorem 4.7.
Suppose that some strong convergence condition holds. Consider a new model
with
15
c P such that for all E > 0 there is a P €15
with rp + Pv ;:;: v-e:.Then the new model has the same value function.
Proof. Fix E > 0, let e: := E.2-(n+l)e and choose PEP such that
- n n
+E +Pv~v.
n n
Iteration of this inequality yields
Hence
N
L
PO"'Pn- 1 (rp + En) + PO",PNv~
v •n=O n
E ?: v
n
since by the strong convergence condition lim PO ••• P v = 0 • -+<x> n n Therefore
z:
PO'" P n- 1 r P~
v - E , n=O nAs in [7J and [8J we can also exclude actions for a finite number of iterations instead of all future iterations.
Fix some scrapfunction s. For notational convenience we omit the de-pendence on s in the following definitions:
v (i,a) := r(i,a) +
n
L
p(jI
i,a)v:_ 1 (j)jEE
d (i,a) := vS(i) - v (i,a)
n n n b 1, n ~ := b - b n 2,n I,n b 2,n Theorem 4.8. ( i) dn+k+1 (i,a) ;::0: d (i,a) -n k k
I
<llnH, 1',=0(U)
if dn(i,a) -I
<ll > 0 then action a is suboptimal at stage n+ k+ 1 •t=O n+t
Proof.
(li)
is a direct consequence of (i). Sinceand
inf
I
p(jI
i,a){vs(j) - vS 1(j)} ~ baeA j d!: n n- I ,n
we have by subtraction of these inequalities:
Iteration of this inequality yields the desired result.
Hence, if we determine at stage n:d (i,a) and at each following stage:
n
<lln+k' we need not compute vn+k+1 (i,a) as long as
, d (i,a) -n k \' ~ > O. t. n+.t 1',=0
o
5. Contracting Dynamic Programming, Strong Convergence and Liapunov functions In this section we show how the contracting dynamic programming model intro-duced by Van Nunen [20J fits into the framework of strong convergence and Liapunov functions. The model assumptions are as follows:
• 1 )
There exist a finite function b and a bounding function ~ and there are
constants k,k' > 0 and p,p' with 0 ~ p,p' < 1, such that
00
(i) sup
l:IE
R[I
b (Xn)
I
J
< 011 R n=Oand for all PEP (li) (iii) (iv) IIr - b II :::;k P lJ
*
P~ :::; P lJ II Pb - P b II :::; k I • lJIn the papers of Shapley [22J, Blackwell [IJ and Denardo [3J it is assumed
that the rewards are bounded and that the operator U (def. 1.10) is a
con-traction with respect to the supremum norm. Veinott [23J showed that transient
models can be transformed into discounted models using a similarity trans-formation which is equivalent to working with a bounding function (see below). Harrison [6J noticed that in many practical models with a countable state space the reward function is unbounded and he suggested a modification:
he introduced the translation function b. But he worked with lJ - 1. Lippman
[17, 18J remarked that Harrison's model is too restrictive to include for
example the M/M/l queueing system with quadratic cost. He introduced a
special bounding function: a polynomial. Wijngaard [25J considered exponential bounding functions to study inventory models with the average cost criterion.
Wessels [24J gave the first systematic treatment of general bounding functions
for total return models with a countable state space. Van Hee and Wessels [11J studied necessary and sufficient conditions for the existence of a bounding
function lJ such that for all PEP: PlJ :::; Pl..l ,0 :::; p < 1. Hinderer [12J used
bounding functions for finite stage dynamic programming models with a general state space.
.2.)
Let us denote
w :- (I - p)-I(b - Pb)
p
then by iteration, we find:
Since by 5.1 i) we have -1 .,. (I - p) b . -1 (I - p) b
Hence the dynamic programming model with reward function - w
p P E:
P
is equivalent to the original problem. However
Indeed with 5.1 ii) and iv) we find
-1 ... (1 - p) II (1 - p) r P - b + Pbll il -1 = (I -p) II(I-p)(rp-b) -Pb+Pbllj.l ::; (1 - p) - 1 {(I - p) II rp - bll + II Pb - p bll } < co il !l
Hence the contracting dynamic progrannning model is equivalent to a model
satisfying for P E: P and some k > 0:
(i) Pl1 ::; Pll
Note that this model can be reduced in a similar way to a discounted dynamic programming model by the transformations:
This is in fact the similarity transformation studied by Veinott [23]. From 5.2 i) and ii) we have immediately
-I
and therefore, we have for 1 < A < p 00 n ~ kp 1.1 sup JE R[
I
Anlr(X ,Y)1]
~k(l - Ap)-llJ < 00 R n=O n nThus the contracting dynamic programming model satisfies the strong conver-gence condition for the. sequence a An. And since nk "" o'(A n) (n -~ <x»
n
for all k
=
1,2, ••• we have by corollary :3.4 that there exist Liapunovfunctions ~k satisfying 3.1 for k
=
1,2, ••••Apart from this one immediately sees that
-I -I
l.1 + (I - p) P].1 ~ (1 - p) ].1
thus
-1
~ k(l - p) ].1.
Hence k(l- p)-ll.1 suffices as Liapunov function t
l, and it is easily checked
-n .
that k(1 - p) ].1, n
=
1,2, ••• 1S a system of Liapunov functions satisfying3.1.
--6. Waiting Line Model with Controllable Input; an Example which 1S Strongly
Convergent but not necessarily Contracting
In this section we consider as an example the waiting line model with
controllable input which was studied in chaptEr 5 in [J3] and jn [l5J.
In this queuing model the arrival process is Poisson with expected number
of arrivals per unit time A where a denotes the service cost. We assume
a
that we can control the arrival process by choosing a from the interval [a
l,a2J • And we make the reasonable assumption that Aa decreases as c
in-creases. The service time distribution F is general.
At each time a customer completes service, the service cost may be changed. We will be looking at the embedded Markov chain.
The states space becomes E
=
{O,I, ••• } and the transition probabilitiessatisfy with
{
o
if j < i - I p(jI
i,a) == kj-i+1 (a) if J;;;: i - I k r (8)=
QO a r -\J
-A s e (Aas) (r!) dF(s)°
Furthermore we assume 00 A aJ
sdF (s) < 1 1 0and r(i,a) ;;;: 0 > 0 for i = 1,2, ••• and all a E A ;= [a
1,a2J
If one is looking for an average optimal strategy for this problem then one
1S interested in the behaviour of the system upto the first time the system
empties again.
In order to study the behaviour until this time we modify the transition probabilities and rewards in state 0 as follows
If this model is contracting then there exists a bounding function ~ satisfying
(i) Irpl :s; k].l for some k E.
:m.
over all P € P(ii) P].l :s; P].l, for some 0 :s; P < ) and all PEP.
Now (i) implies Ir(i,a)1 :s; kj.l(i) and with r(i,a) ~ 0 > 0 follows
].l(i)
~
ok-1 J i~
1. Now we may use theorem 2 in [llJ which states that thereexists a function ~ satisfying (ii) and
inf ).I(i) > 0
i~l
if and only if the lifetime N of the process (here the number of transitions until state 0 is reached) is exponentially bounded.
So in order that this model is contracting at least all moments of the life time must be finite and with the inequality
co :IE N (N - J ) ••• (N - k + I)
~
I
f
R.=ko
kf
k i(9,-I) ••• (R.-k + l)dF(s)=
A s dF(s)o
(cf. [15J) we see that all moments of the service time must be finite as well.
Hence the model is certainly not contracting if not all moments of the service time are finite.
On the other hand it is shown in [ISJ that if the k-th moment of the service
time is finite and if
sup Irei,a) I :s; AiR, a
for some A E:m. and all i E S then there exist Liapunov functions Yl""'Yk-t'
.R. < k. We will prove this here using a completely different approach.
First one may show that if the k-th moment of the service time is finite then also the k-th moment of the lifetime of the imbedded process is finite. This may be seen as follows.
....
It is clear that the lifetime is maximized if we use the strategy R which corresponds to the minimal service cost in each state. For that strategy
we have an MIGI 1 queue. And the lifetime of the imbedded process is now
equal to the number of customers N in the busy period of the MIGI) queue.
*
*
Let F be the Laplace transform of the service time and N the transform of the distribution of the number of customers in a busy period. Then we have
*
*
*
- t*
*
N (t)
=
e F (A - AN (t» , t > 0where" is the Poisson parameter (cf. Gohen [2J p. 250). Differentiating this equation once with respect to t gives
(6. 1 ) N*' (t) -e - t F (" -
*
AN (t»*
==
---~--~--~~--1 + AF'A'(A - >..N*(t» The denominator is bounded from below by
00
1 -
AJ
sdF(s) > 0 •o
It is well-known (see for example Feller [5J p. 412) that N*(k)(t) has a
finite limit for t ~ 0 iff
00
I
n~(N
=
n) < 00 • n==O Then co1
nkp(N n)=
(-l)~*(k)(O)
• n=ODifferentiating (6.1) one may show by induction that if F*(t)(t) has a finite
limit for t
~
0 for t == 1, ••• ,k then N*(k) has a finite limit for t~
0 aswell.
So we conclude that if the k-th moment of the service time is finite then also the k-th moment of the lifetime of the embedded process is finite. Now suppose
< 00 and
Then we have for all R
00 sup
I
rp (i)I
p 00 .k-m.-] s A1m
R1
t mI
r(xt,Y R-)I
==1
lP R (N = t) 1=0 t=O oo ::;I
lP R(N == t) teO 00=
Ar
tklP R (N=
taOfor some A EO lR and all i EO E.
t
I
m
Rfci
rex , Y )I I
N = 1=0 t R-tJ tI
tmAtk-m.-1 1=1 00 t)s
AI
tklP - (N == taO R t) < 00Where the inequality t
I
lERClr(xR,'YQ,11 N ==tJ
~ Q,=o k-m-l At\
follows immediately from the fact that in the embedded process only one customer is served per unit of time.
-So we see that
(X)
f
s-~(s) k < (X) and s~pI
rp(i) _I
< Al.·k -m-Io
for some A E lR and all i E E imply, using corollary 3.4, the finiteness
of the functions
YI'''',Ym-I{easoning in a similar way one may show that for m .. () the model is
7. Nearly Optimal Strategies
In this section we collect some results with respect to near~y optimal
strategies for the strongly convergent case. But before we do so we first
give an example which shows that there need not exist for all E > 0 a
stationary strategy
p(~)
satisfying(7.1.) if we only assume sup :IE R
L
R n=O reX ,Y )I
< co n nbut not 1.8, the uniform tail property or positivity of all r(i,a). For the
positive case Ornstein [21J proved the existence of a
p(~)
satisfying 7.1.Example 7.1. n r=I-2 (1 r=O (l n E:= 1,1',2,2', . . . . In the states nl there is only one available r
=
I - 2 (n+ 1) (I +---1-)
action yielding an n + I immediate reward 1 - 2n(l +1..)
and a n transition to state n.In state n there are two actions. Action 1 gives reward 0 and a transition to
state n + 1 with probability
where b (l
=!
n n bn+1 b n 1=
1 +-n 'and with probability 1 - (l the system leaves E. Action 2 gives a reward 2n
n
and the system leaves E with probability I.
v
may be found as follows( ) (2n 2n+l n + 2 ) v n
=
sup , ( l , ex ex I 2 , ••• n n n+ b b n n n=
2 sup (1, ~ , ~"") n+] n+2 since b n+
1 as n ~ ~.(00) •
We will show that there does not exist a stationary strategy P for wh~ch
v(n',P(oo»
~
0 for all n=
1,2, ••••Any stationary strategy may be characterized by the probabilities Yn by
which action 2 is taken in state n, n
=
1,2, •••• (We consider randomizedstrategies since when we were looking for an example we have seen that it
may occur that though there is no pure £-optimal strategy there does exist
a randomized one).
We see that for this strategy
n I n n )
v(n' ,R) ~ 1 - 2 (1 + -) + Y 2 + (I - Y )2 (I + -) •
n n n n
So strategy R gives for state n' an immediate loss of Y 2n/n compared to what
n
could be gained. In order that this loss is smaller than 1 we must have
-n -n
Yn ~ n2 • Now let us consider an arbitrary strategy R with Y
n ~ n2 for
all n and see what its total expected reward for state n is. Using the
2 -n
inequalities a ~ 13, 1 - Y ~ 1 and Y ~ n2 , n
=
1,2, ••• we getn n n v(n,R) = 2 Y + a n (1 - Yn ) Y 2n+1 n+2 n+ 1 + a a 1 (1 - Y )( 1 - Y 1)Y 22 + ••• n n n n+ n n+ n+ 2 4
=
n + 3(n + I) + g(n + 2) + ••• - 3n + 6 •So for n
~
4 v(n',R)~
-J. Hence no stationary strategy p(oo) existswith v(i ,R)
~
v(i) - €( 1 +I
v(i)I)
e for all E <t.
This concludes ourcounterexample.
Now we continue with some positive results.
If the model is strongly convergent then Howards' policy iteration algorithm converges. And as a result we conclude that in the strong convergence case
it holds that for all i € E and all IS > 0 there exists a stationary strategy
pCl'» such that
(7.'2 ) (cL [I OJ) •
If for a sequence a - (a
O,a1, ••• ) with an + 00 uniformly on E it holds that za < 00, then for all € >
a
there exists a stationary strategy p(oo) such that;?; v - e:z
a (cf. [1 OJ) •
And if w /a +
a
uniformly on E then there exists for all e: >a
a stationarya n
strategy p(oo) satisfying v(.,p(oo» ;?; v - e:e (cf. [10J).
Theorem 7.2. Let R., and 9,,2 be Liapunov functions of order I and 2 and let
T .
either s E V(R'I) or limsup P s s O. If furthermore rp + Ps ~ Us - d
l
then T-kO
V(o,p(oo» ~ s - e:9.
2
.
Proof. Iterating rp + Ps ~ s - e:R, 1 gives us
Letting T + 00 yields the desired result.
n
The next theorem presents a result under a partly weaker and partly stronger
assumption than 1.8. Theorem 7.3. If z < 00 and 00 \' n l -sup L.. nP rp < 00 PEP n=J
then there exists for any state ~ E S and for all e: >
a
a stationary strategyp(oo) with
Fix i E E and e > O. Let strategy R be such that v(i,R) Choose 0 < a < such that
e c v(i) - 7; •
00
lE. R
I
anr(X , Y ) ~ v(i) - ~4 1., 0 n n n'" and 00 \' n - e (I - a)sup ~ nPr (i) ~2 .
P n==O pThe a-discounted problem is strongly convergent, hence by 7.2, there exists a Q such that
00 00
I
anQnrQ(i) ~n=O sup lE. R
I
anr(X , Y ) - -4e c
R 1., n=O n n
€:
v(i) -
2 .
Since 1 - an s (1 - a)n for 0 < a < 1 andn
=
0,1, ••• we have,00
1:
QnrQ (i) ;;::n=O
00
;;:: v(i) -
I -
I
(1 -a)nQnr~(i)
;;:: v(i) - €: •n=O
Hence v (i,Q) ;;:: v(i) - e.
Finally a result on optimal strategies.
Theorem 7.4.
If the model is strongly convergent then any conserving P, i.e.
constitutes a stationary optimal strategy.
Proof. Iterating r + Pv .. v we get
p Since N-I N
I
pnr + P v .. v • n=O p N-II
pn r -+ v (. ,P (00»
(N -+ "") n=O pand pNv -+ 0 (N -+ (0) (2.3) we have v(o ,p(OO»
.. v.
r +Pv=v p
o
Hence if the model is strongly convergent.
P
compact, w <ro,
rand Pwp
continuous of
P
then there exists a stationary optimal strategy. sincewith the compactness and continuity assumptions one may show the existence
of a conserving
P.
See also chapter 4 in [13J.
References
[IJ Blackwell. D., Discounted dynamic programming. Ann. Math. Statist. ~,
226-235, 1965.
[2J Cohen, J.W., The single server queue, North-Holland publishing Company,
1969.
[3J Denardo, E., Contraction mappings in the theory underlying dynamic
programming. SIAM Rev.
1,
165-177, 1967.[4] Derman, C., and R. Strauch, A note on memoryless rule for controlling
sequential control processes, Ann. Math. Statist.
1L,
276-278,1966.
[5J Feller, W., An introduction to probability theory and its applications,
Vol. II, Wiley, New York, 1966.
[6J Harrison, J., Discrete dynamic programming with unbounded rewards,
Ann. Math. Statist. ~, 636-644, 1972.
[7J Hastings, N.A.J., A test for nonoptimal actions in undiscounted finite
Markov decision chains. Hanagement Science
Q,
87-91, 1976.[8J Hastings, N.A.J. and J.A.E.E. van Nunen, The action elimination
algorithm for Markov decision processes, this volume, 1977.
[9J Hee, K.M. van, Markov strategies in dynamic programming, Memorandum COSOR
75-20. University of Technology, Eindhoven, 1975.
[10J Hee, K.M. van and J. van der Wal, Strongly convergent dynamic programming,
Memorandum COSOR 76-26, University of Technology, Eindhoven, 1976. [11J Hee, K.M. van and J. Wessels, Markov decision processes and strongly
excessive functions, Memorand~m CaSaR 75-22, University of
[12J Hinderer, K., Bounds for stationary finite-stage dynnLic programs with unbounded reward functions (To be published).
[13J Hordijk, A., Dynamic programming and M,lrkov potential theory,
Mathematical Centre Tracts No. 51, Amsterdam, 197L
[14J Rordijk, A., Convergent dynamic programming, Technical Report,
Department of Operations Research, Stanford University, 1974.
[15J Hordijk, A., Regenerative Markov decision models, to appear in
Stochastic systems: Modeling, Identification and Optimization II,
Mathematical Programming Studies ~, North-Holland, Amsterdan,
1975.
[16J Hord,ijk, A. and K. Sl,adky, Sensitive optimality criteria in countable state dynamic programming, to appear in Hathematics of Operations
Research, 1975.
[17J Lippman, S.A., Semi-Markov decision processes with unbounded rewards,
Management Science~, 717-731, 1973.
[18J Lippman. S.A., On dynamic programming with unbounded rewards,
Management'Science
3l,
1225-1233, 1975.(19J Macqueen, J., A modified dynamic programming method for Markovian
decision problems, J. Math. Anal. Appl.
li,
38-43, 1966.[20J Nunen, J.A.E.E. van, Contracting Markov decision processes, Mathematical Centre Tracts No. 71, Amsterdam, 1976.
[21] Ornstein, D., On the existence of stationary optimal strategies,
Proc. Amer. Math. Soc.
3£,
563-569, 1969.[22 J Shapley, L.S., Stochastic games, Proc. Nat. Acad. Sci. U.S.A.
12.,
1095-1100, 1953.
[23] Veinott, A.F., Discrete dynamic programming with sensitive discount
optimality criteria, Ann. Math. Statist. 40 1635-1660, 1969.
[24] Wessels, J., Markov programming by successive approximations with
respect to weighted supremum norms, to appear in Journ. Math. Anal. Appl., 1974.
[25