The generation of successive approximation methods for
Markov decision processes by using stopping times
Citation for published version (APA):
van Nunen, J. A. E. E., & Wessels, J. (1977). The generation of successive approximation methods for Markov
decision processes by using stopping times. In H. C. Tijms, & J. Wessels (Eds.), Markov Decision Theory :
Proceedings of the advanced seminar, Amsterdam, The Netherlands, September 13-17, 1976 (pp. 25-37).
(Mathematical Centre Tracts; Vol. 93). Stichting Mathematisch Centrum.
Document status and date:
Published: 01/01/1977
Document Version:
Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)
Please check the document version of this publication:
• A submitted manuscript is the version of the article upon submission and before peer-review. There can be
important differences between the submitted version and the official published version of record. People
interested in the research are advised to contact the author for the final version of the publication, or visit the
DOI to the publisher's website.
• The final author version and the galley proof are versions of the publication after peer review.
• The final published version features the final layout of the paper including the volume, issue and page
numbers.
Link to publication
General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain
• You may freely distribute the URL identifying the publication in the public portal.
If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:
www.tue.nl/taverne Take down policy
If you believe that this document breaches copyright please contact us at: openaccess@tue.nl
providing details and we will investigate your claim.
ТНЕ
GENERATION OF SUCCESS!VE APPROXIMATIONS FOR
MARKOV DECISION PROCESSES
ВУUSING STOPPING TIMES
J.A.E.E. van Nunen
Graduate School of Management, Delft, The Netherlands J.Wessels
Eindhoven University of Technology, Eindhoven, The Netherlands
1. INTRODUCTION
In [5] we introduced the standard successive approximation method
for Markov decision processes with respect to the total expected reward
criterion. In fact there exist some variants of this method.
Тhesevariants
differ in the policy improvement procedure: the standard procedure may
Ьеreplaced
Ьу аGauss-Seidel procedure (see e.g. HASTINGS
[1], Kushner and
KLEINМl'.N
[ 4 ] , an overrelaxation procedure (see REETZ [ 11] and
SCHELLНAAS[ 13]) or some other variants (see VAN NUNEN [8]). In [ 14] it has been shown
that such variants can
Ьеgenerated
Ьуstopping times.
Тhisapproach has
been generalized in
[б].In section 2 we will introduce the main idea of
this approach.
Policy interation -with its several variants- as introduced
ЬуHOWARD
[З]
is usually not viewed upon as
аsuccessive approximation technique.
However, in [7] it has been shown to
Ьеan extreme element of
аclass of
extended successive approximation techniques, the so-called value-oriented
methods.
Тhisapproach has been
comЬinedin [9] with the stopping time
approach. In [6]
аfurther generalization has been given (mainly with
respect to the conditions). Value-oriented methods will
Ьеtreated in
section
З.Section 4 v<ill Ье dev·oted to upper and lower bounds for the tectщiques
presented in the earlier section. Furtl1ermore some remarks on nшnerica1 aspects will Ье made.
In this paper we will use the sа:ше rюtations as in [ 5], however, in order to keep the proofs simple, >ve will work under somevJhat stronger assumptions. In fact, our assumptions are the same as th.ose in [б]. For details we will refer repeatedly ·1:0 [б].
ASSUMPTIONS. Our assumptions are the same as the assumptions in [5], with assumption
2.3
(i) rep1aced Ьу(а) 3
м>О
V llr(f)-
:r11
::::; м fЕГ(Ь) sup
I
JГ 1 < 00 for а11 i Еs.
ТТЕМ n=O
These s·tronger assшnptioпs make the spaces V and W superfluous.
As remarked in [5] (remark 5.1) one may replace r in the assumptioпs
(and definition of V) Ьу а vector Ь with Ь -
r
Е \il. We will do so iп th.is paper in order to facilitate referring to[6].
2. STOPPING TIМES AND SUCCESSIVE APPROXIМATIONS
In
this sectioп we will show that each stopping time characterized Ьуа
go aheadfuпction о
for thesequeпce
{Xn}:=O induces an operator U0 onV, sucl1 that U 0 is monotone and (usually) contractiпg.
Furthermore all these contracting operator·s оп V have tr1e same unique
•k
fixed point v . So we have for any v0 Е V апd any 6:
an.d v + n.
Е V for 11 1'2' ...
DEFINIITON 2.1. А (rando111ized) go ahead function о is а function which maps
G :=
00 into
[О, ] .
Ву Л 1де denote the set of all go ahead functions.
- о (s 0 , s 1 , ••• , s n) will Ье interpreted as the probaЬ.ili t:y to stop t):le process at time n, given t.J-iat х0 = s 0 , ••• , = sn and the process rias not been stopped earlier.
(а) 6 Е л is said to Ье nonrandomized i f о (а) Е {0,1} for all а Е G ;
00
(Ь) 6 Е л is said to Ье nonzero i f о (i) > Е: > о for some Е: and а11 i Е S; (с) 6 Е л is said to Ье transition memoryless i f 6 (а) only depends on the the 1.ast two entries of а, for those а with a·t 1.eas·t two en·tr.ies and satisfying б(s0, .•. ,sk) ~О for al1 k < n, if а= s0 , ... ,
So for а transition memoryl.ess go ahead func·tion the stopping probaЬility only depends on the most recent transition. Тhе relevance of th.is notion will become c1ear in the course of tl-i..is section.
EXAМPLES 2
._!_.
Below some examples of гюnzero go ahead functions will Ьеgiven. These examples wil1 Ье used repeatedly in this paper.
(а) Defiнe the go ahead func·tion on (n = 1,2, ..• ) Ьу он(а) := i f а coпtains less than п
+
1 entries, otherwise 611 (а) :•-~ О. The go aheadfunction 611 are пonrandomized, is опlу traпsition memoryless if
!l = 1.
(Ь) define oR Ьу cSR (s, s "
«,
s) :=
1 for all s and а11 sequences of fiпite(с)
length, 6R(a) :=О otl1erwise.
oR
is nonrandomized. and traпs.ition memoryless. otherwise о (а) := О.н
он is nonrandomized and trans.i tion memoryless. (d) Or(i) = 1/2 for а11 i Е S, Or(a) := 0 elsewhere.
бr is transition memoryless.
(ariy n) ,
Since we introduced а probaЫli stic go ahead concept, \Че have to incorporate i t in the probaЬility space алd measure. Therefore we extend the space (S х А)00 (see [5] section 2) to (S х Е х А)00, with Е := {О,1}.
Furthermore th.e stochastic p:r:ocess
{х
,Z } 000 is extended to t t t=
where
=
О as long as the process may go ahead.Now any starting state i, go ahead function о, апd апу decisioп rн1е
тт deterшiпe а
probabl1ity measureоп
(Sх Е х А)
00 with the required pro-perties in an obvious way (see [б] for details). Thiswill
Ье
denotedЬу
JPir' 0 •Expectatioпs
willЬе
l
' 0 and
probaЬility measure
тт' о
denoted Ьу lEi . Note
that are equal for eveпts w!1ich do not depend on tr1e
varia-Ыеs
Iп fact t!-ie go al1ead concept induces а stopping t.ime.
DEFINITION 2.3. Т11е randoш variaЫe т taking values iп {0,1,". is
· defined Ьу т = 11 т = %У t Yn-l
=
О апd О for all t=
0,1, •.•т is а :гandoшized stopping time wi th respect to х0
Now we will iпtroduce our operators.
DEF'INITION 2. 4. For each о Е Л апd each strategy (= noпrandomized decision
'11
rule) тт the operator L0 on V is defined Ьу
i f 1: "" оо.
LЕММА~ L~
is monotone and (for nonzero TI1erefore L~ possesses а rшique f'ixed pointPROOJ'' о Тhе contraction factor of ma·trix wit.h (i, j) entry
'о (т
<nonzero. for v Е
v,
with v(X1 ) :=О о) strictly contracting опv.
тт v 0 in V. < 1 if and only if8
isEXAМPLES. Take for тт an arЫtrary stationary strategy (f,f, ••. ).
(а)
(L~
v) (i)1
r(i,f(i))
+
L
pf(i) (i,j)v(j) ; j(Ь)
(J~~
v) (i)н
[ 1 - (i) (i,i) ]- ( i , f (i))
+
(i) (. . ) (.) ]J.' J v J ; (с)
(L~
v) (i) н :r(i,f(i))+
I
j<i Щ(, ')(пi,J
Lo
v ) ( ' ) J + \ L р f(i)(; ~,J ')v(') .. J ; н j?.i(d) ? ( i ) + ·}Er(i,f(i)) +
I
pf(i) (i,j)v(j)] 1j
(е) 1.et о Ье nonzero, then v 0 ·п· v(1•), independent of о.
RЕМАRК 2.1. If тт is а nonstatioпary strategy then tl1er·e exist va1ues for
а п
{р (i,j), r(i,a)} and go ahead fuпctions о' and о" such that v0 , !о
(see lenшia 5.1. 7 iп [б]). We now саше to the operators
DEFINI'l'ION 2. 5. Тhе operator U 0 on V is defined Ьу
:= sup 1Г
where the supermum is taken componentwise.
1Г
Note that
r"o
has only been defined for strategies 'lf, so the superшum is on1y taken over the s·trategies (= nonrandomized decision rules) • Ext.en-sion ·to ·the randomized decisioп rules would поt affect the value of U 0 v. '.ГНЕОRЕМ 2. 1 • Let о Е Л, then U 0 i s monotone and ( onl у for rюnzero li)stz·ictl у contracting wi tl1 contraction radius \! 0
possesses (for nonzero 8) а unique fixed point.
а11 with о nonzero.
1Т
: = s~p р 0 • 'Гherefore U 0
*
v is fixed point for
Fо:г details we refer to tl"1e pr·oof of theorem 5.2.1 in [б]. With respect to the last. st.atement we remark::
v(тт) if 1Т ( f , f , ••• )
Since f may
Ье
chosen such that v (1Т) ~
v* -
еµ
{ [5] tl1eorem 3" 1*
*
*
*
(ii)), we obtaiп U0v ?. v . If we had U0v > v , then i t would Ье possiЫe
1-:
'I'his tr1eorem serves as the bas:i.s for а о -based success:i. ve
approxi-*
mation algorithm, since vп := П0vn-l coпver9es :i.п пorm to v if Е V. In t11e definitioп of П 0 we take the supe:r-mum over all strategies. One
wo1.:lld naturally prefer to restrict oneself to Markov strategies and even use the algor:i. thm for constructiпg c-optimal stat:i.onary strateg:i.es. TJ1e followin.g theoreш (for tJ1e proof we refer to [б] theorem 5. 2. 2 апd 5. 2. 3) shows that. the concept of transi·tion шeшoryl.ess go ahead functions pl.ays а crucial. rol.e in this proЫem.
ТНЕОRЕМ 2 • 2 •
(а) Eet 8 Ье transition memoryless, с > О, v Е V.
Then the:гe exists а policy f, s11ch that
(Ь) Let о Ье not tгansi tion memoгyless, then the:r:e exist values fo:r: the pa:camete.rs (i,j), r(i,a) }, such that fo:r: some v Е V апd some е >о
theгe i.s по f Е
F
w.ithНепсе, i f о is transition memoryless we have
w.Ьеге the sup is not necessarily componentwise. Whe:ceas i f о .is not transi tion memory.I ess
sup f
тау only Ье defiried componentwise and may not Ье ечиаl to U 0 v. Fог
non-zero and transition memo:cyless go ahead fimctions we now oЬtain the follow.ing i te:cation procedtz:ces
(а) ( i f
L0
fv
is attained for some f).Choose
vo
Е V, definev
~ :.:-::uovn-1 and с]юоsе f sucll tllat v n then llv v*ll n - v*ll (i)
-
$ vollvo n llv ) 11 -1 (ii) -- v(f $ (1-
v
0 ) v 011vn n n(iii) i f v 0 satisfies 2 v0 , then
Choose f (п
=
1, ... ) such thatn f n - v 11 n-1 $ v $\1(f} !1 п
J"onvn-1?
max{vп-1'
Ulivn-1 -Е(1
-· vo)µ}de.fine tl:1en ш (ii) 1' < € " v(
for n suff.icient.Iy .Iarge
*
о:; v • n*
$ vLo
rп fact, as in the case of о 1 , тоге efficient .Iowe.r and upperbounds сап Ье obtained {see section 4).
EX№!PLES 2. 3. 'Гhе exarnples 2. 2 (а) - (Ь) iпduce nurnerically well-executaЫe policy improvement procedures. Iп fact о 1 induces t!1e standard successi ve approxiшation technique based on c;atlss-,Jordan-iteratioп;
oR
induces JасоЫ iteration (compare Porteus [10]); он yields Gauss-Seide1 iteration; other choices of 15 yie1ds overre1axation апd comЫnations of overrelaxation and Gauss-Seidel iteration (in this respect leшma 7.2.3 in [б] has i11ter-esti11g consequeпces).3. VALUE ORIENТED МETIIODS
In the foregoing section we developed а whole class of policy im-provement procedures or successive approximations techniques. As we saw in section 2, at the n-th stage of any policy improvement procedure the best estimate for the optimal strategy is the stationary strategy fn. This makes the next policy improvement more efficient if the value vn is nearer to v(fn). In fact the policy iteration techniques owe their high efficiency in the policy improvement part to the fact that they have vn
=
v(fn). А disadvantage of policy iteration is in fact the computation of these vn. However, there is an alternative in comЬining the advantages of policy itera-. tion and successi ve approximationsitera-. Namely suppose f is chosen such that . nthen define
Note that
v(f) n
(Л Е {1,2, ••. ,Ф})
so Ьу the choice of Л we in fact determine how good v approximates v(f ).
n n
Тhе choice Л the choice Л
gives the successive approximation of section 2, whereas gives for any transition memoryless and nonzero go ahead function а variant of the policy iteration technique.
Below we give а more formal treatment.
DEFINITION 3.1 Let о Ье nonzero and transition memoryless and suppose th at th е sup 0v is Lf .
f
assume that we have (Л) the operators U 0
u(Л)
о v
attained for some policy if v Е
v.
Furthermore we а unique way of designating such а policy. We define onV
for Л=
1,2, ••• ,Ф Ьуif the sup in U
0v is attained for f.
Note that
(Л)
It does not seem revolutionary to conjecture that vn
:= U0vn-l converges
*
to v if v
0 Е V.However, one becomes somewhatmore prudent as soon as one
realizes that
U~Л)
is neither necessarily monotone, nor necessarily
con-tracting as one can see in the following simple example for
о=
о 1, S ;, { 1,2},
µ
=
1,А=
{1,2} : p
1(i,2}
=
p 2 (i,1)
= 0.99,r(i,1)
=
1, other
probaЬili-ties and rewards being zero.
Now one obtains for v
:=(0,О)т,
w
:=(100,1оо)т,
• (Л) т
whereas lim
U0 w
=
(0,0) л~,
We will now prove that the proposed iteration step leads to
аcon-verging algorithm.
ТНЕОRЕМ 3. 1 •
Let the situation
Ье
such that
u(Л}
is defined and choose
о
v0 Е v
with
u~v0 ~ v0 .Then
v :=u;
)v 1converges in norm to
v*n u
n-and
where
fis the policy (unique,
possiЬlyafter tie breaking) which
n f
maximizes
L0vn_ 1•
PROOF.
Вуassumption we have
Since v 1 ~ L0 (f 11v1 , one obtains v 1
s
*
Ву induction this gives vn-lvn
~ U~v
0
,
which tends to v*s
vs
v(f )s
v . On t11e oth.er handn n
for n + ro. Therefore v n
•k
+ v and
In the same way as in [5] for the standard algorithrn one rnay оЬtаiп mш:е sophisticated bounds (see sectioп 4) • Futhermore t11e assшnpt:ion that the sup in U
0
v is attained can Ье weakened as in[Sl
Ьу introducing approximations (in norm) of the sup. This сап Ье extended in several ways. F'or а de·tailed description of these possiЬili ties see [6].As already stated, the case Л = 00 represents а var·iety of policy
it.eration procedures. In fact the procedures (for any nonzero transition memoryless о) generate sequences of policies with increasing value. Hence an optimal policy is oЬtained after а fini te numЬer· of i terations if the s·tate and action spaces are fini te.
If о = о1, then we have the standard policy iteration aJ.gorithm as j_ntroduced Ьу HO\'lARD in [ 3] for the fini te state, fini te action discounted case. If о = он, th.en we have the Gauss-Seidel variant as introduced Ьу HAS'ГINGS [ 1
J.
4. SОМЕ REМAF.КS ON NUМERICAL AND OTHER ASPECTS
For the algorithms based on the operators U0 (section 2) and
U~Л)
(section 3) we proved goemetric convergence. However, tr1e extrapolation based оп the convergence rate onJ.y are usuaJ.J.y not very good. As in the case of U0 (see [5]) one сап obtain better bounds ratl1.er easiJ.y" For thecase the sup in uovn is attained and exactly computed in t11e algorithm
based on П (Л) (Л
-
1, 2'"".' ) we oЬtain, i f uovo <':vo:
о -1 11
*
\r + (1-pf ) llr~o(fn+1)vn - v $ v $ v $ n n+l n --1 11 v + (1 "" v 0) llL 0 (fn+l)vn - v n nwhere 'nf· -1(')\ f(i)(. ') (' := ~ µ i LP i,J µ J f j llvil := inf µ f (i)v(i)
F'or а more detail.ed description we refer to [61. The proof iн this case is completely similar to the proof in the case о
=
о1•F'or nшnerical experience i t appears that val.ue orien·ted metrюds can
give consideraЬle gain in computational efficiency. This is especially
true if the policy improvement rec1uires many operations. General1y speak-ing опе may say tl1at
u
1;-based successive approximations metlюds only need а sma1ler numЬer of iterations to reach а near-optimal policy, however, the proof of this near-optimali ty requir·es rel.ati vely many addi tional i terations. So in qui te а 1ot. of i terations does поt changesuЬstaп-tially. 'I'herefore i t: is efficient to choose Л greater than one. In fact i t is still. more profitaЫe to increase the va1ue of Л in suЬseq_uent iterations, То give an idea of the gain in computationa1 efficieпcy we mentioп th.at we fouпd in а numЬer of examp1es with о
=
о 1 а saving in computing time of 20 - 40% wl1en we took Л=
:5 instead of Л=
1 (iп both situations we used а suboptima1ity test; the numЬers of states raпgedbetweeп 40 апd 1000), see [8].
Iп al1 procedures (а11 8 and а11 Л) i:he standard suЬoptimality test is a11owed and a1so the mш:·е sophisticated and mo.re efficient sпboptima1ity test which is described ].n the paper Ьу HASTINGS and VAN NUNEN [2] in this vo1ume.
Instead of defining o-based operators U 0 оле may t:r:ansform the data .i.n tl1e proЫem and so1ve the transformed problem Ьу the standard succes-s.i.ve approxiшatioп :methods. 'Гl1is approach has been preseпt.ec1 Ьу PORТEUS [10 ]. Iп our notation the transformatio11 is
r(f) :=
0т-1
' 1
:r·(x ,z
J,
l. n n
n=O
Ву int:щduciпg
the matrices Q(f) with Q(f) {i,j)·~
pf{i,j)o(i.,j) weоЬtiап
P(f)
L
Qk(f)[P(f) - Q(f)]'k=O
r<f>
(f) r(f)beiпg exactly Porteus • pre.i.nverse transfor·mation. In fact we showed in sec·tion 2, that tl1e transformed proЫem possesses the san1e optiшal value
vector as the original proЫem.
In fact some extension is possiЫe with respect to the conditions
(Л)
under wh.ich tl1e U 0- and U 0 -based procedures converge. We mentioned already the kind of condi tions of [ 5]. A11ot!1er approach is in coпsidering а fixed
8
a.nd require strict or N-stage contraction for U0 оп V orw.
Iп [ 11] Ree·tz chooses such an approach for о = он. One might conjecture th.at ··as in the case of о 1 (see [ 5]) -N-stage contraction implies 1-stagecoпtractioп with respect to а different norm.
REFERENCES
[ 1 ] HASTINGS, N.A.J., Some notes оп dynamic programming and replacement,
Oper. Res.
Q.
(1968) 453-464.[ 2] HASTINGS, N.A.J., J.A.E.E. VAN NUNEN, The action elimination algorithln for Markov decision pr·ocesses, in this volшne. [ 3] HOWARD, R.A., Dynam.ic programming and Markov dec.i.sion p.rocesses,
CaшЬ:r·idge (Mass.), M.I.'r.-Press, 1960.
[ 4 ] KUSHNER, H.J., A.J. КLEINМANN, Acceleгated procedures fот the solution of discrete матkоv contтol proЫems, IEEE-Trans.
on Aut.
Coпtr. А.С •
..!.§.
(1971) 147-152.[ 5] NUNEN VAN, J.A.E.E., J. WESSELS, Markov decision pтocesses with unbounded rewards, in this volume.
[ 6] NUNEN VAN, J .А.Е.Е., Contracting Markov decis.ion processes, Math-ematical Centre Tract 71, Amsterdam 1976.
[ 7] NUNEN VAN, J .А.Е.Е., А set of successive appтoximation methods for discounted Maтkovian decision proЫems, Zeitschrift fur Operations Res. 20 (1976) 203-208.
[ 8 ] NUNEN VAN, J .А.Е.Е., Improved successive appгoximation methods for discounted Markov decision processes, рр. 667-682 in А. Pre-kopa (ed.) , Process in Operations Hesearch, Amsterdani, NortЬ.
Ho11and РuЬ1. Сотр. 1976.
[ 9 ] NUNEN VAN, J.A.E.E., J. WESSEI,S, А principle .for generating op·ti-mizat.ion procedizres for discounted Markov decision processes,
рр. 683-69 5 iп the same vol ume as [ 8].
[10] PORTEUS, E.L., Boizпds and tгansformations for discounted finite
мarkov decision chains, Oper. Res. ~ (1975) 761-784.
[11] REETZ, D., Solution of а Markovian decision рrоЫет Ьу overrelaxation,
Zeitschrift fur Operations Res. (1973) рр, 29-32,
[12] REE'l'Z, D., А decision exclusion algorithm fог а c..lass of Markovian
decisioп processes, Zeitschrift fur Operations Res, Vol. _2-_12_, (1976), рр. 125-131.
[13] SCHELLНAAS, Н., Zur Extrapolation in Markoffsch.en
Entscheidungsmodel-_len mit Diskontierung, Z.f.O.R.
l.§_
(1974) рр. 91-104.[14] WESSELS, J., Stopping times and Markov programming, Tгansactions of the seventh Praque Conference оп Infoлnation Theory, Statis-tical Decision Functions, Random Processes (including 1974 Eпropean Meeting of Statist.icians), Acadeшia, Praque (to appear).