The generation of successive approximation methods for Markov decision processes by using stopping times

(1)

The generation of successive approximation methods for

Markov decision processes by using stopping times

Citation for published version (APA):

van Nunen, J. A. E. E., & Wessels, J. (1977). The generation of successive approximation methods for Markov

decision processes by using stopping times. In H. C. Tijms, & J. Wessels (Eds.), Markov Decision Theory :

Proceedings of the advanced seminar, Amsterdam, The Netherlands, September 13-17, 1976 (pp. 25-37).

(Mathematical Centre Tracts; Vol. 93). Stichting Mathematisch Centrum.

Document status and date:

Published: 01/01/1977

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be

important differences between the submitted version and the official published version of record. People

interested in the research are advised to contact the author for the final version of the publication, or visit the

DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page

numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne Take down policy

If you believe that this document breaches copyright please contact us at: openaccess@tue.nl

providing details and we will investigate your claim.

(2)

ТНЕ

GENERATION OF SUCCESS!VE APPROXIMATIONS FOR

MARKOV DECISION PROCESSES

ВУ

USING STOPPING TIMES

J.A.E.E. van Nunen

Graduate School of Management, Delft, The Netherlands J.Wessels

Eindhoven University of Technology, Eindhoven, The Netherlands

1. INTRODUCTION

In [5] we introduced the standard successive approximation method

for Markov decision processes with respect to the total expected reward

criterion. In fact there exist some variants of this method.

Тhese

variants

differ in the policy improvement procedure: the standard procedure may

Ье

replaced

Ьу а

Gauss-Seidel procedure (see e.g. HASTINGS

[1], Kushner and

KLEINМl'.N

[ 4 ] , an overrelaxation procedure (see REETZ [ 11] and

SCHELLНAAS

[ 13]) or some other variants (see VAN NUNEN [8]). In [ 14] it has been shown

that such variants can

Ье

generated

Ьу

stopping times.

Тhis

approach has

been generalized in

[б].

In section 2 we will introduce the main idea of

this approach.

Policy interation -with its several variants- as introduced

Ьу

HOWARD

[З]

is usually not viewed upon as

а

successive approximation technique.

However, in [7] it has been shown to

Ье

an extreme element of

а

class of

extended successive approximation techniques, the so-called value-oriented

methods.

Тhis

approach has been

comЬined

in [9] with the stopping time

approach. In [6]

а

further generalization has been given (mainly with

respect to the conditions). Value-oriented methods will

Ье

treated in

section

З.

(3)

Section 4 v<ill Ье dev·oted to upper and lower bounds for the tectщiques

presented in the earlier section. Furtl1ermore some remarks on nшnerica1 aspects will Ье made.

In this paper we will use the sа:ше rюtations as in [ 5], however, in order to keep the proofs simple, >ve will work under somevJhat stronger assumptions. In fact, our assumptions are the same as th.ose in [б]. For details we will refer repeatedly ·1:0 [б].

ASSUMPTIONS. Our assumptions are the same as the assumptions in [5], with assumption

2.3

(i) rep1aced Ьу

(а) ₃

_м>О

V llr(f)

-

:r11

::::; м fЕГ

(Ь) sup

I

JГ 1 < 00 for а11 i Е

s.

ТТЕМ n=O

These s·tronger assшnptioпs make the spaces V and W superfluous.

As remarked in [5] (remark 5.1) one may replace r in the assumptioпs

(and definition of V) Ьу а vector Ь with Ь -

r

Е \il. We will do so iп th.is paper in order to facilitate referring to

[6].

2. STOPPING TIМES AND SUCCESSIVE APPROXIМATIONS

In

this sectioп we will show that each stopping time characterized Ьу

а

go ahead

fuпction о

for the

sequeпce

{Xn}:=O induces an operator _U0on

V, sucl1 that U ₀is monotone and (usually) contractiпg.

Furthermore all these contracting operator·s оп V have tr1e same unique

•k

fixed point v . So we have for any v₀Е V апd any 6:

an.d v + n.

Е V for 11 _{1'2' ...}

DEFINIITON 2.1. А (rando111ized) go ahead function о is а function which maps

(4)

G :=

00 into

[О, ] .

Ву Л 1де denote the set of all go ahead functions.

- о _{(s 0 , s 1 , ••• , s n) will}Ье interpreted as the probaЬ.ili t:y to stop t):le process at time n, given t.J-iat х₀= _{s 0 , ••• ,} = sn and the process rias not been stopped earlier.

(а) 6 Е л is said to Ье nonrandomized i f о (а) Е {0,1} for all а Е G ;

00

(Ь) 6 Е л is said to Ье nonzero i f о (i) > Е: > о for some Е: and а11 i Е S; (с) 6 Е л is said to Ье transition memoryless i f 6 (а) only depends on the the 1.ast two entries of а, for those а with a·t 1.eas·t two en·tr.ies and satisfying б(s₀, .•. ,sk) ~О for al1 k < n, if а= s_{0 , ... ,}

So for а transition memoryl.ess go ahead func·tion the stopping probaЬility only depends on the most recent transition. Тhе relevance of th.is notion will become c1ear in the course of tl-i..is section.

EXAМPLES 2

._!_.

Below some examples of гюnzero go ahead functions will Ье

given. These examples wil1 Ье used repeatedly in this paper.

(а) Defiнe the go ahead func·tion on (n = 1,2, ..• ) Ьу он(а) := i f а coпtains less than п

+

1 entries, otherwise 6₁₁(а) :•-~ О. The go ahead

function 6₁₁are пonrandomized, is опlу traпsition memoryless if

!l = 1.

(Ь) define oR Ьу cSR (s, s "

«,

s) :

=

1 for all s and а11 sequences of fiпite

(с)

length, 6R(a) :=О otl1erwise.

oR

is nonrandomized. and traпs.ition memoryless. otherwise о (а) := О.

н

он is nonrandomized and trans.i tion memoryless. (d) Or(i) = 1/2 for а11 i Е S, Or(a) := 0 elsewhere.

бr is transition memoryless.

(ariy n) ,

Since we introduced а probaЫli stic go ahead concept, \Че have to incorporate i t in the probaЬility space алd measure. Therefore we extend the space (S х А)00 (see [5] section 2) to (S х Е х А)00, with Е := {О,1}.

(5)

Furthermore th.e stochastic p:r:ocess

{х

,Z } 00

0 is extended to t t t=

where

=

О as long as the process may go ahead.

Now any starting state i, go ahead function о, апd апу decisioп rн1е

тт deterшiпe а

probabl1ity measure

оп

(S

х Е х А)

00 with the required pro-perties in an obvious way (see [б] for details). This

will

Ье

denoted

Ьу

JPir' 0 •

Expectatioпs

will

Ье

l

' 0 and

probaЬility measure

тт' о

denoted Ьу lEi . Note

that are equal for eveпts w!1ich do not depend on tr1e

varia-Ыеs

Iп fact t!-ie go al1ead concept induces а stopping t.ime.

DEFINITION 2.3. Т11е randoш variaЫe т taking values iп {0,1,". is

· defined Ьу т = 11 т = %У t Yn-l

=

О апd О for all t

=

0,1, •.•

т is а :гandoшized stopping time wi th respect to х₀

Now we will iпtroduce our operators.

DEF'INITION 2. 4. For each о Е Л апd each strategy (= noпrandomized decision

'11

rule) тт _{the operator L0 on V is defined}Ьу

i f 1: "" оо.

LЕММА~ L~

is monotone and (for nonzero TI1erefore L~ possesses а rшique f'ixed point

PROOJ'' о Тhе contraction factor of ma·trix wit.h (i, j) entry

'о (т

<

nonzero. for v Е

v,

with v(X1 ) :=О о) strictly contracting оп

v.

тт v 0 in V. < 1 if and only if

8

is

EXAМPLES. Take for тт an arЫtrary stationary strategy (f,f, ••. ).

(а)

(L~

v) (i)

1

r(i,f(i))

+

L

pf(i) (i,j)v(j) ; j

(6)

(Ь)

(J~~

v) (i)

н

[ 1 - (i) (i,i) ]- _{( i , f (i))}

₊

(i) (. . ) (.) ]

J.' J v J ; (с)

(L~

v) (i) н :r(i,f(i))

+

I

j<i Щ(, ')(п

_i,J

_Lo

_v) ( ' ) _J ₊ \ _L_рf(i)(; _~,J')v(') .. _{J ;} н j?.i

(d) ? ( i ) + ·}Er(i,f(i)) +

I

_{pf(i) (i,j)v(j)] 1}

j

(е) 1.et о Ье _{nonzero, then v 0}·п· v(1•), independent of о.

RЕМАRК 2.1. If тт is а nonstatioпary strategy then tl1er·e exist va1ues for

а п

{р (i,j), r(i,a)} and go ahead fuпctions о' and о" such that v_{0 ,} !о

(see lenшia 5.1. 7 iп [б]). We now саше to the operators

DEFINI'l'ION 2. 5. Тhе _{operator U 0 on V is defined}Ьу

:= sup 1Г

where the supermum is taken componentwise.

1Г

Note that

r"o

has only been defined for strategies 'lf, so the superшum is on1y taken over the s·trategies (= nonrandomized decision rules) • Ext.en-sion ·to ·the randomized decisioп rules would поt affect the value of _{U 0}v. '.ГНЕОRЕМ 2. 1 • Let о Е Л, then _{U 0}i s monotone and ( onl у for rюnzero li)

stz·ictl у contracting wi tl1 contraction radius \! ₀

possesses (for nonzero 8) а unique fixed point.

а11 with о nonzero.

1Т

: = s~p р _{0 •}'Гherefore _{U 0}

*

v is fixed point for

Fо:г details we refer to tl"1e pr·oof of theorem 5.2.1 in [б]. With respect to the last. st.atement we remark::

v(тт) if 1Т ( f , f , ••• )

Since f may

Ье

chosen such that v (

1Т) ~

v

* -

еµ

{ [5] tl1eorem 3" 1

*

(ii)), we obtaiп _U0v ?. v . If we had U_0v > v , then i t would Ье possiЫe

1-:

(7)

'I'his tr1eorem serves as the bas:i.s for а о -based success:i. ve

approxi-*

mation algorithm, since vп := П₀vn-l coпver9es :i.п пorm to v if Е V. In t11e definitioп of П _{0 we take the supe:r-mum over all strategies. One}

wo1.:lld naturally prefer to restrict oneself to Markov strategies and even use the algor:i. thm for constructiпg c-optimal stat:i.onary strateg:i.es. TJ1e followin.g theoreш (for tJ1e proof we refer to [б] theorem 5. 2. 2 апd 5. 2. 3) shows that. the concept of transi·tion шeшoryl.ess go ahead functions pl.ays а crucial. rol.e in this proЫem.

ТНЕОRЕМ 2 • 2 •

(а) Eet 8 Ье transition memoryless, с > О, v Е V.

Then the:гe exists а policy f, s11ch that

(Ь) Let о Ье not tгansi tion memoгyless, then the:r:e exist values fo:r: the pa:camete.rs (i,j), r(i,a) }, such that fo:r: some v Е V апd some е >о

theгe i.s по f Е

F

w.ith

Непсе, i f о is transition memoryless we have

w.Ьеге the sup is not necessarily componentwise. Whe:ceas i f о .is not transi tion memory.I ess

sup f

тау only Ье defiried componentwise and may not Ье ечиаl to U ₀v. Fог

non-zero and transition memo:cyless go ahead fimctions we now oЬtain the follow.ing i te:cation procedtz:ces

(8)

(а) ( i f

_L0

f

v

is attained for some f).

Choose

_vo

Е V, define

v

~ :.:-::

uovn-1 and с]юоsе f sucll tllat v n then llv v*ll n - v*ll (i)

-

$ _vollvo n llv ) 11 -1 (ii) -- v(f $ (1

-

v

_{0 )} _{v 011vn} n n

(iii) i f _{v 0} satisfies 2 v_{0 ,} then

Choose f (п

=

1, ... ) such that

n f n - v 11 n-1 $ v $\1(f} !1 п

J"onvn-1?

max{vп-1'

Ulivn-1 -

Е(1

-· vo)µ}

de.fine tl:1en ш (ii) 1' < € " v(

for n suff.icient.Iy .Iarge

*

о:; v • n

*

$ v

Lo

rп fact, as in the case of о _{1 ,} тоге efficient .Iowe.r and upperbounds сап Ье obtained {see section 4).

EX№!PLES 2. 3. 'Гhе exarnples 2. 2 (а) - (Ь) iпduce nurnerically well-executaЫe policy improvement procedures. Iп fact о _{1 induces t!1e standard successi ve} approxiшation technique based on c;atlss-,Jordan-iteratioп;

oR

induces JасоЫ iteration (compare Porteus [10]); он yields Gauss-Seide1 iteration; other choices of 15 yie1ds overre1axation апd comЫnations of overrelaxation and Gauss-Seidel iteration (in this respect leшma 7.2.3 in [б] has i11ter-esti11g consequeпces).

(9)

3. VALUE ORIENТED МETIIODS

In the foregoing section we developed а whole class of policy im-provement procedures or successive approximations techniques. As we saw in section 2, at the n-th stage of any policy improvement procedure the best estimate for the optimal strategy is the stationary strategy fn. This makes the next policy improvement more efficient if the value vn is nearer to v(fn). In fact the policy iteration techniques owe their high efficiency in the policy improvement part to the fact that they have vn

=

v(fn). А disadvantage of policy iteration is in fact the computation of these vn. However, there is an alternative in comЬining the advantages of policy itera-. tion and successi ve approximationsitera-. Namely suppose f is chosen such that _. _n

then define

Note that

v(f) n

(Л Е {1,2, ••. ,Ф})

so Ьу the choice of Л we in fact determine how good v approximates v(f ).

n n

Тhе choice Л the choice Л

gives the successive approximation of section 2, whereas gives for any transition memoryless and nonzero go ahead function а variant of the policy iteration technique.

Below we give а more formal treatment.

DEFINITION 3.1 Let о Ье nonzero and transition memoryless and suppose th _atth е _{sup 0v is}Lf .

f

assume that we have (Л) the operators U 0

u(Л)

о v

attained for some policy if v Е

v.

Furthermore we а unique way of designating such а policy. We define on

V

for Л

=

1,2, ••• ,Ф Ьу

(10)

if the sup in U

₀

v is attained for f.

Note that

(Л)

It does not seem revolutionary to conjecture that vn

:= _U0

vn-l converges

*

to v if v

₀Е V.

However, one becomes somewhatmore prudent as soon as one

realizes that

U~Л)

is neither necessarily monotone, nor necessarily

con-tracting as one can see in the following simple example for

о

=

о _{1, S ;, { 1,}

2},

µ

=

1,

А=

{1,2} : p

1

(i,2}

=

p 2 (i,1)

= 0.99,

r(i,1)

=

1, other

probaЬili-ties and rewards being zero.

Now one obtains for v

:=

(0,О)т,

w

:=

(100,1оо)т,

• (Л) т

whereas lim

_{U0 w}

=

(0,0) л~

,

We will now prove that the proposed iteration step leads to

а

con-verging algorithm.

ТНЕОRЕМ 3. 1 •

Let the situation

Ье

such that

u(Л}

is defined and choose

о

v₀Е v

with

u~v₀~ v_{0 .}

Then

v :=

u;

)v ₁

converges in norm to

v*

n u

n-and

where

f

is the policy (unique,

possiЬly

after tie breaking) which

n f

maximizes

_L0

_{vn_ 1•}

PROOF.

Ву

assumption we have

(11)

Since _{v 1}~ _{L0 (f 11v1 , one obtains v 1}

s

*

Ву induction this gives vn-l

vn

~ U~v

0 ,

which tends to v*

s

v

s

v(f )

s

v . On t11e oth.er hand

n n

for n + ro. Therefore v n

•k

+ v and

In the same way as in [5] for the standard algorithrn one rnay оЬtаiп mш:е sophisticated bounds (see sectioп 4) • Futhermore t11e assшnpt:ion that the sup in U

₀

v is attained can Ье weakened as in

[Sl

Ьу introducing approximations (in norm) of the sup. This сап Ье extended in several ways. F'or а de·tailed description of these possiЬili ties see [6].

As already stated, the case Л = 00 represents а var·iety of policy

it.eration procedures. In fact the procedures (for any nonzero transition memoryless о) generate sequences of policies with increasing value. Hence an optimal policy is oЬtained after а fini te numЬer· of i terations if the s·tate and action spaces are fini te.

If о = о₁, then we have the standard policy iteration aJ.gorithm as j_ntroduced Ьу HO\'lARD in [ 3] for the fini te state, fini te action discounted case. If о = он, th.en we have the Gauss-Seidel variant as introduced Ьу HAS'ГINGS [ 1

J.

4. SОМЕ REМAF.КS ON NUМERICAL AND OTHER ASPECTS

For the algorithms based on the operators U0 (section 2) and

U~Л)

(section 3) we proved goemetric convergence. However, tr1e extrapolation based оп the convergence rate onJ.y are usuaJ.J.y not very good. As in the case of U0 (see [5]) one сап obtain better bounds ratl1.er easiJ.y" For the

case _{the sup in uovn is attained and exactly computed in t11e algorithm}

based on П (Л) (Л

-

1, 2'"".' ) we oЬtain, i f _uovo<':

_vo:

о -1 11

*

\r + (1-pf ₎ llr~o(fn+1)vn - v $ v $ v $ n n+l n --1 11 v + (1 "" _{v 0)} _{llL 0 (fn+l)vn - v} n n

(12)

where 'nf· -1(')\ f(i)(. ') (' := ~ µ i LP i,J µ J f j llvil := inf µ f (i)v(i)

F'or а more detail.ed description we refer to [61. The proof iн this case is completely similar to the proof in the case о

=

о₁•

F'or nшnerical experience i t appears that val.ue orien·ted metrюds can

give consideraЬle gain in computational efficiency. This is especially

true if the policy improvement rec1uires many operations. General1y speak-ing опе may say tl1at

u

_{1;-based successive approximations}metlюds only need а sma1ler numЬer of iterations to reach а near-optimal policy, however, the proof of this near-optimali ty requir·es rel.ati vely many addi tional i terations. So in qui te а 1ot. of i terations does поt change

suЬstaп-tially. 'I'herefore i t: is efficient to choose Л greater than one. In fact i t is still. more profitaЫe to increase the va1ue of Л in suЬseq_uent iterations, То give an idea of the gain in computationa1 efficieпcy we mentioп th.at we fouпd in а numЬer of examp1es with о

=

о ₁а saving in computing time of 20 - 40% wl1en we took Л

=

:5 instead of Л

=

1 (iп both situations we used а suboptima1ity test; the numЬers of states raпged

betweeп 40 апd 1000), see [8].

Iп al1 procedures (а11 8 and а11 Л) i:he standard suЬoptimality test is a11owed and a1so the mш:·е sophisticated and mo.re efficient sпboptima1ity test which is described ].n the paper Ьу HASTINGS and VAN NUNEN [2] in this vo1ume.

Instead of defining o-based operators U 0 оле may t:r:ansform the data .i.n tl1e proЫem and so1ve the transformed problem Ьу the standard succes-s.i.ve approxiшatioп :methods. 'Гl1is approach has been preseпt.ec1 Ьу PORТEUS [10 ]. Iп our notation the transformatio11 is

r(f) :=

0т-1

' 1

_{:r·(x ,z}

_J,

l. n n

n=O

(13)

Ву int:щduciпg

the matrices Q(f) with Q(f) {i,j)

·~

pf{i,j)o(i.,j) we

оЬtiап

P(f)

L

Qk(f)[P(f) - Q(f)]'

k=O

r<f>

(f) r(f)

beiпg exactly Porteus • pre.i.nverse transfor·mation. In fact we showed in sec·tion 2, that tl1e transformed proЫem possesses the san1e optiшal value

vector as the original proЫem.

In fact some extension is possiЫe with respect to the conditions

(Л)

under wh.ich tl1e U 0- and U 0 -based procedures converge. We mentioned already the kind of condi tions of [ 5]. A11ot!1er approach is in coпsidering а fixed

8

a.nd require strict or N-stage contraction for _U0оп V or

w.

Iп [ 11] Ree·tz chooses such an approach for о = он. One might conjecture th.at ··as in the case of о 1 (see [ 5]) -N-stage contraction implies 1-stage

coпtractioп with respect to а different norm.

REFERENCES

[ 1 ] HASTINGS, N.A.J., Some notes оп dynamic programming and replacement,

Oper. Res.

Q.

(1968) 453-464.

[ 2] HASTINGS, N.A.J., J.A.E.E. VAN NUNEN, The action elimination algorithln for Markov decision pr·ocesses, in this volшne. [ 3] HOWARD, R.A., Dynam.ic programming and Markov dec.i.sion p.rocesses,

CaшЬ:r·idge (Mass.), M.I.'r.-Press, 1960.

[ 4 ] KUSHNER, H.J., A.J. КLEINМANN, Acceleгated procedures fот the solution of discrete матkоv contтol proЫems, IEEE-Trans.

on Aut.

Coпtr. А.С •

..!.§.

(1971) 147-152.

[ 5] NUNEN VAN, J.A.E.E., J. WESSELS, Markov decision pтocesses with unbounded rewards, in this volume.

[ 6] NUNEN VAN, J .А.Е.Е., Contracting Markov decis.ion processes, Math-ematical Centre Tract 71, Amsterdam 1976.

(14)

[ 7] NUNEN VAN, J .А.Е.Е., А set of successive appтoximation methods for discounted Maтkovian decision proЫems, Zeitschrift fur Operations Res. 20 (1976) 203-208.

[ 8 ] NUNEN VAN, J .А.Е.Е., Improved successive appгoximation methods for discounted Markov decision processes, рр. 667-682 in А. Pre-kopa (ed.) , Process in Operations Hesearch, Amsterdani, NortЬ.

Ho11and РuЬ1. Сотр. 1976.

[ 9 ] NUNEN VAN, J.A.E.E., J. WESSEI,S, А principle .for generating op·ti-mizat.ion procedizres for discounted Markov decision processes,

рр. 683-69 5 iп the same vol ume as [ 8].

[10] PORTEUS, E.L., Boizпds and tгansformations for discounted finite

мarkov decision chains, Oper. Res. ~ (1975) 761-784.

[11] REETZ, D., Solution of а Markovian decision рrоЫет Ьу overrelaxation,

Zeitschrift fur Operations Res. (1973) рр, 29-32,

[12] REE'l'Z, D., А decision exclusion algorithm fог а c..lass of Markovian

decisioп processes, Zeitschrift fur Operations Res, Vol. _2-_12_, (1976), рр. 125-131.

[13] SCHELLНAAS, Н., Zur Extrapolation in Markoffsch.en

Entscheidungsmodel-_len mit Diskontierung, Z.f.O.R.

l.§_

(1974) рр. 91-104.

[14] WESSELS, J., Stopping times and Markov programming, Tгansactions of the seventh Praque Conference оп Infoлnation Theory, Statis-tical Decision Functions, Random Processes (including 1974 Eпropean Meeting of Statist.icians), Acadeшia, Praque (to appear).