On theory and algorithms for (semi-) Markov decision problems with the total reward decision

(1)

On theory and algorithms for (semi-) Markov decision

problems with the total reward decision

Citation for published version (APA):

van Nunen, J. A. E. E., & Wessels, J. (1978). On theory and algorithms for (semi-) Markov decision problems with the total reward decision. (Memorandum COSOR; Vol. 7823). Technische Hogeschool Eindhoven.

Document status and date: Published: 01/01/1978

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne Take down policy

If you believe that this document breaches copyright please contact us at: openaccess@tue.nl

providing details and we will investigate your claim.

(2)

EINDHOVEN UNIVERSITY OF TECHNOLOGY Department of Mathematics

PROBABILITY THEORY, STATISTICS AND OPERATIONS RESEARCH GROUP

Memorandum COSOR 78-23

On theory and algorithms for (semi-) Markov decision problems with the total reward decision

by

Jo van Nunen (Delft) Jaap Wessels (Eindhoven)

Eindhoven, November 1978 The Netherlands

(3)

On theory and algorithms for (semi-) Markov decision problems

.

*

with the total reward cr1ter1on by

Jo van Nunen (Dellt) and Jaap Wessels (Eindhoven)

I. Introduction

In a recent survey paper [20] the present authors gave a quite extensive over-view of successive approximation methods for (semi-) ~arkov decision processes and Markov games. The conditions they used there guaranteed that the contraction mapping approach could be applied. The class of algorithms that was considered was based on action independent stopping times.

The underlying survey paper differs from the mentioned one quite substantially. A major part is devoted to the situation where geometric convergence is not presupposed. Moreover, we extend the class of s.a. methods such that the algorithms may be generated by action dependent stopping times.

At discrete points in time t - 0,J,2, ••• a system is observed. For simplicity we suppose the state of the system at any of those points in times to be an element from a countable state space I :- {1,2, •••

l.

At any point of time t we may choose an action a from a set of actions A. This action a generates a probability distribution for the state j of the process at time t+l. These dis-tributions may depend on a € A and the actual state i € I as well. They may

thus be denoted by {p~.}. IO We allow {p~.}. I to be a deficient probability

1J J I:: 1J J €

distribution. Although, it might occur from a reformulation of the original problem that

I.

p~. exceeds I we will suppose

I.

p~. $ I. This is done in order

J 1J J 1J

to maintain the probabilistic interpretation.

A strategy is a rule that generates the actions to be chosen at the discrete points in time. For a given starting state i E S (or starting distribution) the

stochastic process is completely determined by the strategy. By introducing rewards for the actual paths the system may follow, it is possible to make a comparison between stochastic processes (strategies). For example, the expected reward gives a suitable criterion. Often the reward of an actual path

(i

O,aO,il,a1, ••• ) consists of the addition of (expected) rewards r (it,at) for each stage t. It is quite well possible to weaken the additivity assumption, see Groenewegen [4], Kertz and Nachman [13J and Krebs [14] for interesting analyses ,and more references. We will restrict the attention to the additive case si~e the ideas for the nonadditive situation are naturally based on it. For a system behavioral structure and a reward structure as described, the dynamic programming approach offers an ideal set-up, since at any moment t we have that the futural reward only depends on the history until time t through the state at time t. So it becomes very natural to use an optimal strategy for the process from time t on, for constructing an optimal strategy for the whole decision process.

• *

This paper combines the matrial presented in the survey lecture of J. Wessels and the lecture of J. v. Nunen at the DGOR-Jahrestagung J978 in Berlin.

(4)

In this paper we will exploit this idea and achieve strong theoretical and numerical results under relatively weak conditions.

This will mainly be done by combining several results in the literature. We do not try to give an exhaustive survey of the historical developments in this literature. However, this paper does present a survey on the recent results with respect to some of the most important issues in Markov decision theory. Some strongly related areas will not be treated extensively although we will give the main references.

_..

After introducing the formal set-up we will prove in section 2 that under our weak conditions a restriction to nonrandomized Markov strategies is allowed. This will be done by using a proof of Van Hee [8J.

-- In section 3 the conditions are shown to be sufficient for the optimal value to satisfy the optimality equation and the non-uniqueness of the solution is discussed. Moreover, we consider the related question about the approximation of infinite time horizon problems by finite programs.

Section 4 is devoted to the problem of restriction to stationary strategies. In section 5 we reconsider the topics of sections 3 and 4 under the contraction assumptions. With this approach - which has a long tradition - stronger results may be obtained, although the assumptions are still relatively weak.

Section 6 gives an analysis of the contraction assumptions. Section 7 is devoted to some numerical aspects of solving M.D.P.

We will restrict the attention mainly to some new developments with respect to algorithms based on action dependent stopping times as introduced by Van Nunen and Stidham in [22J. For more complete surveys of numerical aspects we refer to the recent papers by White [34J and Van Nunen and Wessels [20J.

2. Set-up and restriction to Markov strategies

At moments t - 0,1,2, ••• a system is observed and found to be in some state i E I

=

{1,2, ••• }. At time t an action may be chosen from a set A, which is

en-dowed with a a-field

A

containing all one-point sets. The state i at time t and the action a chosen at time t determine the probability distribution of the system's state at time t+l: {p~.}. I (regardless of what happened before time t):

a f a 1J JE

p.. G!: 0, L' I p .. s J.

1J J € 1J

""

A strategy n is a sequence {TIt}t_O of transition probabilities, such that TIt is a transition probability from H(t) to A where H(t) :- I x A x ••• x I (with t times A and t+l times I and endowed with the product-a-fleld).lf'thestrategy TI is used, Wt(A' Ih

t) with ht € H(t), A' €

A

denotes the probability of choosing an action from A' at time t, given that until time t the state-action history h

t has materialized. The set of all strategies is called

n.

According to a theorem of Ionescu Tulcea (see e.g. [19J), a starting state i E I

""

and a strategy TI €

n

determine a stochastic process (It,At)t=O' where It

de-notes the state of the system at time t and At the chosen action at time t. For the application of Ionescu Tulcea's theorem one can use the following transition probabilities from (I)( A) t to I x A (both endowed with the product-a-field):

(5)

at - J

p. . IT (A'

I

iO,aO, ••• ,at_1,j) , ~t-J J t

_ where (iO,a

O' ••• ,at-I) denotes an arbitrary point in (I x A) t and (j,A') denotes a measurable subset of I x A. The probability measure as generated in this way is denoted by P. _1,n and expectations with respect to this probability measure are denoted by lE.

~,1T

The allowed deficiency of the transition probabilities can do no harm. Formally this might be repaired by introducing an absorbing state 0 with

a \' a

p·O = I - £. I p ..•

1 J e l J

r(i,a), the one-stage reward in state i with action a, is a given measurable function on I x A.

The criterion total expected reward may now be defined as

which is properly defined for all i € I, IT e IT if e.g.

lE.

1, 'If < ~ for all i € I, IT € IT

where r+(i,a) • max{O,r(i,a)}.

Condition (*) is a very weak and natural condition, which implies among other things

(2. I) lE. L~ 0 r (I ,A ) ..

t'"

0 lE. r (I ,A ) •

1,1T t= t t t'" 1, 'If t t

In this way we have formulated a mathematical set-up which contains e.g.

discounted Markov decision processes (by taking p~. = ~q~., if S is the

1.J 1.J

discount factor and q~. a transition probability;

~J

discounted Markov decision processes with action and/or state dependent discount factors;

fading Markov decision processes;

fading and discounted semi-Markov decision processes; (semi-)Markov decision processes with fading rewards.

The whole structure 1S such, that the very gene~al strategy concept seems not

(6)

v. :- sup v. (n)

1. n€:TI 1.

or if one looks for a good strategy n' with

v .... v. (n') or v.(n') ~ v. - € for same given € >

o.

1. 1 1 1

It seems that one may replace IT by a set of somewhat more structured strategies.

A strategy n € IT is called a randomized Markov strategy if it is such that nt (AI I iO.aO"" ,it) does not depend on iO,a

o""

,at- J• So nt may be written as n (A' I i ) . The set of all randomized Markov strategies is denoted by RM.

t t

Using a simple and elegant idea of Derman and Strauch [3J one may prove that under condition (*) any strategy n may be replaced by some n' €: RM with v.(n) ... v.(n') •

1. 1.

Lemma 2.1: For any starting state i and any strategy n € IT, there exists a

strategy n' E RM, such that

lP i

,/l

t =j, At€A').lPi,n' (It-j, At € AI) for all j € IT,

t - 0, I , ••• ,A' € A. lP. (I - i J A € A ') P f Ch ' (A'

I . )

1. r n t t t roo. oose n t I t ' ' lP. ( I "

i )

l,n t t , if the denominator is

greater than zero, otherwise n~ (A' it) may be chosen arbitrarily.

Lemma 2.2: Under condition (*) we have for any i E I:

a. v.=sup v.(n)

1. n€ RM 1.

b. i f v.

1. v.(n) for same n e IT, then 1.

v. ;; v. (-rr I) for some n' e RM

1. 1.

c. there exists a n ~ RM with

v.Crr) z v. - £

1. 1. (e: > 0) •

The proof of these results follows straightforwardly from equation (2. I).

Lemna 2.2 implies that a restriction to randomized Markov strategies is justi-fied. One naturally would like it if ID1 could be further restricted to M, where

_..

rl denotes the subset of RM consisting of those strategies n E ~1 for which the probability measureslf

(7)

Moreove~one hopes that conditions like (*) may be replaced by equivalent conditions for M instead of

n.

Condition (*) is equivalent to the same condi-tion but only for randomized Markov strategies. One easily shows that condicondi-tion (*) implies:

C** ) sup E. \,00 r + (I ,A ) < QO for all i €: 1.

1~'Il' It-O t t

lfEM

Since the converse is also true, we have

Lemma 2. 3: ( **) .... (*) •

Proof. " .. " Suppose (*) is not true, Le. for some i and 11" € RM we have

This implies that for any number R, there exists a T such that

(2.2) E.

I

T 0 r + (I ,A ) -

r

T E. r + (I ,A ) ~ R •

1,lT t- t t taO 1,lT t t

It holds that

fA

r+(j,a)1TT(dalj) < QO for all j, namely suppose the contrary for

some j, then r+(j,a) cannot be bounded as function of a, which contradicts (**). So lTT may be replaced by some deterministic rule without diminishing the value of (2.2). In a similar way one proves by backwards induction that IT may be

re-placed by a strategy nl

which still satisfies (2.2).

Actually the proof of lemma 2.3 contains the result that for positive Markov decision processes (i.e. r

=

r+) the condition (**) implies that restriction to strategies in M (for fixed starting state) is allowed (this result is contained in Theorem C of Ornstein [25]). The proof we gave here is based on Van Hee [8J. This also holds for the following reasoning which shows that restriction to M is always allowed if (**) is fulfilled.

Lemma 2.4: Let (**) be fulfilled. Let u be a nonnegative function on I. For any i E I, n E RM and any natural number T there exists a strategy 1T' E M with

(2.3) :IE.

r

T- 1 0 r(I ,A ) - :IE.

l,lT t= t t l,lT u(TT) ~ JE. l,lT I

Proof. lbe construction of lTl proceeds by backwards indusction with respect to

"

(8)

Theorem 2.1: Let (**) be fulfilled, then v. ... sup V. (11") c sup V. ('11') •

1 'I1'ERM 1 11"EM 1

Proof. Let 11" ~ RM satisfy (for fixed 1 E I, £ > 0):

v.('I1') Z v. - £.

1 1

Then for some T (see lemma 2.3)

Hence

(2.4) _E ,T-I

,00

-

~

i ,'I1' l.t==O r(It,At ) -lEi ,'I1' L.t _T r (It,At ) vi - 2£ , where r (i,a) K -min{O,r(i,a)},

By defining u(j) .. IE.

0:0:>

T r-(It,A t)

1,11" t- It - j} formula (2.4) becomes

(2.5)

Lemma 2.4 implies the existence of some 1f' € M with property (2.3). So, (2.5)

also holds with 11"' instead of '11'.

lEo ,u(I

T) -

L

P. ,(IT =: j)t,1(j) •

1,lT • I l,n

JE

By using theorem 4.3 of Strauch [33J on negative dynamic programming one easily shows that 11"T,'I1'T+I"" in the definition of u(j) may be replaced by some 11"~,11"~+I"" such that the value of E

i,11" , u(IT) does not increase. So the

t t ( ' , ff ) . 2 . l ' t t ' t t

s ra egy n_O, ••• ,11"T_l,11"T"" 1S £-opt1ma ln s ar lng s a e 1.

3, The optimality equations

For the finite stage problem we introduce the analogous notation:

v~T)(1f)

:= E.

r

T

t-O

J _{reI ,At);}

_v~T)

_:=_sup

_v~T)(1f)

_•

1 l,n == t 1 11"EIT 1

For the T-stage problem (under condition (**) restricted to T stages) one simply obtains the value function v(T) (and an optimal strategy if the sup's are attained) by the following iteration process for t == I •.•• ,T:

v(t)

=

sup {ref) + P(f)v(t-I)} f€F

(9)

7

I

--

wh~r~

v (0) : == 0, F consis ts of all mappings from I into A, ref) l.!:i lillI-vector

and-P(f) a II I II x II II!-matrix with r.(f) := r(i,f(i», ~ ._ f(i) P .• (f) .- p.. • lJ lJ

This naturally generates the following two questions: v = sup {r(f) + P(f)v}?

fEF

lim vet) ... v ?

t-l><lO

__ Under condition (**) the answer to the first question is yes, as may be proved with essentially the same proof as Ross [28J uses for his theorem 6.1 for the discounted case. Hordijk [12] applies the same idea for a more general situation

(but less general than here) in the proofs of his theorems 3.1 and 3.5.

Theorem 3.1: Let (**) be fulfilled, then v satisfies the following set of equa-tions in x:

(3.1) x .. sup {r(f) + P(f)x} •

fEF

Proof: The proof proceeds in two parts. Namely, that v satisfies (3.1) is proved by showing (3.1) with ~ and with ~ instead of .. for v.

":s" follows since V(1T) :S r(1T

O) + P(1TO)V,

"~" follows, since for a strategy 1T with E-optimal continuations from t = I on holds

The second question cannot be answered positive under (**), as the following s~m

pIe example from van Hee, Hordijk, van der Wal [10] shows:

Example: a =1~1 a=2 2 a=J

~- __ - ~ - - - " , - - - ) l J I - - - r(I,])

= 0, r(I,2) = 2, r(2,1) = -I.

for t ~ 1 _{, vI}(t) _2, however, VI

=

I.

S~

apparently, some conditions should be imposed in order to obtain lim vet)

=

v.

t-l><lO

Such conditions have been presented for many special structures. The most general formulations, however, can be found 1n Schal [29J, van Hee et al. [10] and Stidham L31J. Schal's and Stidham's treatment is for the case of a general state space and

..

simplifies considerably for countable I. Schal's condition is strongly related to the condition of van Hee et al. Schal's condition is

(10)

a) r(i,a) is bounded from above on I x A;

~) lim sup sup En L~=m+1 _{r(It,At ) = O.}

m-~ n n(TI

Stidham [31] extends the results of Schal by allowing non-zero terminal reward functions. Moreover he does not assume the one stage reward function to be bound-ed above. However, the argument 'of van Hee et ale is simpler and may also - as Schal's argument- be generalized to general state spaces. Theorem and proof be-low are based on van Hee et ale [10J:

Theorem 3.2: Let (**) be fulfilled and let a sequence be given, such that

b_t+₁ ~ b_t, b_O~ I, lim bt(i)

=

00 ,

t~

L\(i) := sup lEo L't"'-O b (i)r-(It,A ) < 00 •

M 1,1£ - t t

nE Then we have: lim vet)

t~

v.

{ b}OO of functions on I

t taO

Proof. (i) vet)

~

v(t)(n), hence liminf vet)

~

v(n), which implies 1iminf

v(t)~

sup v(n)

=

v.

t~

t~ n

(ii) Suppose n E M, then

Hence So V ? Vet) - sup E

l

OO - ( A) r It' t n T=t v ? v (t) t::,. lrEM b sup E \,00 ..2 r (It,A t) nLT=t b nEM t > (t) b 1· h · 1· . (t) or v - v - ~ ,W11C 1mp 1es l1msup v ~ v. t t+oo

Here we have used the choice v(O) = _{O. For other definitions of v(O) (the} so-called scrap function) similar statements may ~e derived (see van Hee et ale [IOJ and Stidham [3IJ).

The same example can be used to show that (**) does not imply unique solvability of the optimality equations (3.1). Actually, any pair (w

l,w2) with wI ~ I, w2

(11)

9

I

-- satisfies (3.1) for the situation of the example (compare theorem 3.1 in Hordijk [25]). One way of constituting unique solvability of (3.1) within some set of

(t)

vectors is by giving conditions for the convergence of v to v for a class of __ scrap functions. If v itself belongs to this class (which is quite natural), than

we have the unique solvability of the optimality equations within some ,set. We will not work this out here, since we will come back to this point in section 5. Namely, this uniqueness is especially of interest for numerical reasons. However, from a numerical point of view it is also desirable to have a stronger form of convergence than the componentwise convergence of theorem 3.2. In order to obtain such a stronger form, we will introduce an other type of conditions which also settle the uniqueness problem in a simple way.

4. Stationary strategies

It is quite natural to believe that it is not necessary to consider M but restrict oneself to the subset of M consisting of those Markov strategies for which ~t does not depend on t. This subset of so~called stationary Markov strategies may be de-noted by F. Where F is the set of mappings from I to A, since any such function characterizes a stationary strategy in a natural way. Condition (**) appears not to be sufficient for this restriction as is demonstrated by example 7.1 in van Hee et ai. [IOJ, in which for some states v. - I whereas v.(n) $ - ] for all

sta-~ ~

tionary (even with randomized actions) strategies

For the positive case (r = r+) condition (**) is sufficient for the restriction and even for the stronger statement:

Theorem 4.1: Let (**) be fulfilled and r ~ 0, E > O. Then there exists a

statio-nary strategy 'IT with

For the proof see Ornstein [25J theorem C, where the proof has been given with

(**) replaced by its equivalent (*).

In section 5 we will present conditions which allow restriction to stationary stra-tegies. But here it is necessary to quote rather weak conditions which allow this restriction. Schal [29], Stidham [3]J and Van Hee and Van der Wal [I I] have ele-gant and strongly related results. However, thes~ results are obtained by quite different techniques. In [IIJ the authors require (besides (**» the existence of a set of mappings {bt};=O from I into [1,00) such that

(12)

101

I

b I ;?: b , lim b (i) .. "" ,

t+ t t

t·~

\ BbU) := sup

Y

b (i)llE. r(I ,At)

I

< ""

L

TIEM t=O t 1,TI t

(for all i) •

Under these conditions they show that the so-called policy iteration algorithm converges. Here this means

Choose a sequence {£ }ao 1 with £ ~ 0, En \ O.

n n"" n

_Choose fO E F.

Determine f E F for n

=

1,2, .•• , such that

n

ref ) + P(f )w(n)

~

max{w(n-l), sup [ref) + p(f)w(n-I) - Ene]}

n n fEF

where wen) = v(f ), e is the unit function on I.

n

Lemma 4.1: Under (**) and (***) the functions wen) converge to v (monotonically).

With the help of this lemma we easily see

Theorem 4.2: Let (**) and (***) hold. Then there exists for any i E I and £ > 0

a stationary strategy f E F with v.(f) ~ v. - E.

1 1

This statement (see theorem 6 in [11]) can be made uniform in i by requiring a somewhat stranger form of (***), namely with

(uniformly on I) •

5. Contracting Markov decision processes

All properties mentioned in the preceding sections (and some more) can be obtained for so-called contracting Markov decision processes. This concept is essentially an outgrowth of the concept of discounted Markov decision processes and has been developed on the basis of a proof technique in Blackwell [1]. This proof techni-que - applying contracting operators in a Banach space - has been first refined by Denardo in [2] and later by many other autho~s. For a survey of what has been obtained by this technique with respect to the generalization of the conditions and the refinements of the result the authors r~fer to their survey paper [20J. In this section we give a short treatment of what can be reached by this

(13)

techni-I techni-I

I

que. In the next section we will throw some light on the meaning of the conditions imposed.

-The conditions we will introduce have been first introduced by the authors in [2J and generalize the conditions imposed for this approach by many authors, e.g.

Blackwell [IJ, Denardo [2], Harrison [5J, Lippman [15], Porteus [26], Wessels [36J, van Nunen [23]. Moreover, we will show that the proof technique has been refined __ in the course of its development in such a way that important results - as those

of }~cQueen [17, 18], Porteus [26] and many others (see [90J) - can be obtained in a simpler and more elegant way and under weaker conditions.

Let ~ be a given positive function on 1. Then ~ defines a Banach space W of real valued functions (or vectors) on I by

W : = {w: I -+ lR, sup Iw.

h;:-I

< oo} • . I 1 1

1£

-}

In W we have the vectornorm IIw II :"" sup IWi~i and the matrix-norm

iEI II B 11:= sup IIw 11= 1 - } \' II Bw II = sup lli /., B .. ~. i d j 1) J

Now we can formulate our conditions 1n terms of this Banach space. However, it should be noticed that in practice one has to find a vector II such that the con-ditions are satisfied. The vector II 1S called the bounding function, the norm in

W is called the weighted supremum norm.

Conditions:

(ii) p

* :

= s up II P (f) II < J,

fEF

(iii) sup II P(f)i- - pi-II < 00 for some p with 0 < p < I, where r. :=

f~F 1

sup r(i,a). aEA

In this section we will further assume that conditions (i)-(iii) are satisfied. Condition (i) is only used to restrict attention to Markov strategies (theorem 2.1). Conditions (ii) and (iii) make the operator U defined by

Ux := sup {ref) + P(f)x}

fEF

a contraction operator on the space X, which is obtained by translating W over (l-p)-li-:

(14)

-I-x EX" x - (1 - p) r E. W •

111e contraction radius of U is at mos t p (for details, see [21

J).

* (0)

So, the operator U has a unique fixed point x* in X and for any choice v E X

--we have vet) := Uv(t-I) = utv(O) ~ x*. Hence one might suspect that v • x*. The convergence of vet) is now convergence in norm and with prescribed minimal rate. That v is indeed equal to x* can easily be demonstrated by showing first that

yen) ~ x* for all n E. M, hence v ~ x*, and secondly by choosing a (stationary)

--strategy f E. F such that

* * * x - Oil ~ ref) + P(f)x ~ x

*

which implies x Theorem 5. I : I

~

P \l

~

v(f)

~

a) v is the unique solution in X of

(5.1 ) x = sup {r(f) + P(f)x}

fEF

b) vet)

~

v (in norm) if v(O) € X;

*

x • So we have proved the following theorem:

c) for any £ > 0 there exists a stationary strategy f E F which is £-optimal in

the following sense:

II v - v(f) II ~ £ •

d) If the sup in (5.1) is attained at some f

=

fO for x c v,then this fO 1S an

optimal stationary strategy.

Theorem 5.1 gives stronger assertions on the topics of sections 4 and 5. Generally speaking, also the assumptions in this section are stronger, although the condi-tion

a,S

of Schal in section 3, the condition of theorem 3.2 and the condition (***) are not implied by the contraction condition.

Apparently, theorem 5.1 is a good basis for a numerical algorithm for the computa-tion of v and an optimal or nearly optimal strategy. In fact, the properties of the operator U can be used to obtain many refinements. A survey of the four ideas which lead to such refinements has been g1ven Ll [20]; also White's paper [34J gives much information on these topics. These ideas are the following:

..

I) the use of the actual convergence of iterates vet) for the construction of up-per and lower bounds for v and for the value of a good strategy (this approach

(15)

has been initiated by MacQueen in [17] and will be treated at the end of this section) •

2) the use of alternative policy improvement procedures (i.e. variants of the ope-rator U). This can be done 1n a general way by generating such operators with

the use of stopping times, as initiated by Wessels in [35] (see sections 7 and 8).

3) a better evaluation of actual policies in each iteration step by a value

orien-d · • l' d ( t -I ) b

te approach. In th1s approach the operator U 1S not app 1e on v • ut on a more or less accurate approximation of v(f

t_I), where ft_1 is the maximizing

strategy in the foregoing iteration. This approach will not be worked out here. It has been initiated in van Nunen [24]; for a short introduction and more re-ferences see [20].

4) elimination during the iteration proces (permanently or temporarily) of nonre-levant actions. Using the MacQueen-bounds (or any other bounds) one can easily identify some actions as nonrelevant for some or more of the subsequent itera-tion steps. This idea stems from MacQueen [18] and has appeared to be a very efficient tool. The basic idea will be given below (see also [5], [7]).

The MacQueen bounds are based on the following theorem.

Theorem 5.2: Suppose the supremum in the operator U ( t) ( t- l) ( 0 ) ( 1 )

is attained. Let yeO) be Biven be such that vet)

=

Uv(t-I)

=

and v := Uv and suppose v ~ v • Let f

t

=

r(f t) + P(ft)v(t-l). Then where ( t)

p~t)

( ) ( I) ( ) p ( v + ( t) II v t - v t- IL)1

~

v (f t) S v

~ v t

+ I _ ; II v t - v t-l) 11)1 1 - p_

*

(t) -I ft(i)

p_ = inf [ _)1i \' _f: _Pij _)1i

i J

and /I w "_ :

=

inf

i

-) Iw·i)1· ₁ ₁ •

Proof: From the condition v(O) s vel) (which is only taken to simplify the proof) we get monotonicity

the mapping U: v(l)

for the sequence v(t). This follows from the monotonicity of Uv(O)

~

vO

*

v(2) = Uv(l)

~

Uv(O)

=

vel) etc. Now

unv(t) - vet) = Unv(t) _ Un-tv(t) + Un-1v(t) _ Un-2v(t) ••• Uv(t) _ Uv(t-I)

s (p~ + p~-I + ••• + p*)11 vet) - v(t-I) IIll

..

(16)

13

1 · Un (t)

v =' 1m v

n""""

Clearly v ~ v(f). 1~e left hand side from the inequality which yields a lower bound for v(f) and thus for v follows in a similar way by using the mapping L(f) defined by L(f)~ :- r(f) + P(f)w, instead of the mapping U. Then it is easily

ve-rified (see [21]) that v(f

t)

=

lim Ln(ft)v(t). Now n""""

L(ft)v(t)-V(t)

=

P(ft)(V(t)-v(t-l»

~ P{ft)lIv(t)-v(t-l)t~

~ p ~ t)n v (t) _ v (t-l )

tl1 .

Using this in a similar way as we did for the upper bound we get the desired lo-wer bound.

The idea of eliminating actions permanently is based on the following idea. Letu,t be an upperbound and a lower bound for v, then action a

O is non-optimal for state i i f a

_o

r(i,a O) + _j

L

p .. u. _{1.J J} < sup {r(i,a) _a +LP~.t.}. _j _{1J J} l~at a

O is non-optimal follows since

v(i) = sup {r(i,a) + ~ p~ .v.}

a J 1.J J

a a_O

2: sup {r(i,a) + Ii' p . . t.} > r(i,a_O) +

t

p .. u. a

J

1J J J ~J J

a

O

z. r ( i , a

O ) +

?

_J p.. _1.Jv . • _J

The way in which the Macqueen bounds are constructed guarantee that none of these actions need consideration for intermediate iteration steps. For temporary elimi-nation (e.g. n steps) one uses in fact bounds for v(t+n) instead of v(oo) =v, see [7J.

6. Analysis of the contraction assumptions

As mentioned earlier, contracting Markov decision models which allowed for an un-bounded reward structure were considered by several authors. Harrison [5J intro-duced in fact the idea of a translation function. While Lippman [15J was the first

to use a polynomial as bounding function. Wessels [36J introduced general weighted supremum norms. In Van Nunen [23J and Van Nunen and Wessels [2IJ, [20J these ideas and others are combined and extended.

For a thorough analysis of the conditions we refer to van Hee and Wessels [9], van Nunen L23J, van Nunen and Wessels [20J, [2IJ. We will only give some comments.

(17)

--Van Hee and Wessels [9] describe the relation between strong excessivity of a pro;-cess and the transient behaviour of that propro;-cess. Here a propro;-cess is said to be strongly excessive if it satisfies condition (ii) of section 5. Moreover. in [21J --the authors discuss the relation between strong excessivity and 80 called N-stage

contraction. In [37] we gave a similar N-stage contraction result with respect to an arbitrary norm.

In this paper we will put some more emphasis on the significance of a bounding function ~ for the handling of unbounded rewards.

However, it may be proved that the majority of practical problems can also be treated by using appropriate shift-functions with the bounding function ~

₌

I. For details see a forthcoming paper by Stidham and Van Nunen [32J. Nevertheless,

the use of weighted supremum norms itself offers an elegant tool to cope with un-bounded rewards. Moreover, for numerical purposes, measuring with respect to a bounding function ~ has the advantage that it is possible to compute the actual value of the optimal solution with a numerical accuracy that is state dependent. For instance, the practical situation where the IVil grows very large if i is

large could be handled by assigning large ll. values to those "large" states.

1.

A queueing example with this aspect is given by Van Nunen and Wessels in [37 J, while an inventory example is described in Van Nunen [23 J. In the queueing

exam-ple, which treats an MIGll queue with a controllable server the ~-function can be a polynomial in i if the waiting costs for i customers in the system grow as a function of i 1n a polynomial way. In the inventory example that was given 1n [23] the costs are allowed to increase in an exponential way as a function of the state. Of course in that situation an exponential ~-function was needed to keep rewards bounded (with respect to that ~-function).

In [23] it is shown that the combination of the shift- and the bounding function approach extends the class of problems that could be considered by each approach separately. Although, in case that the shift-function is bounded from one side wi th respect to ]1., e. g. r. -; H]1. or r. ;;:: -}fll., the bounding function approach

1 1. 1. 1. 1

itself will suffice (see ['23] for details). So, also most of the practical pro-blems can be treated by using only the weighted supremum norm approach.

The big advantage of the contracting conditions over the more general situation, that is considered in section 2, 3 and 4, is that bounds can be constructed in an easy way and that convergence is at a geomet,ic rate.

The contraction conditions as given in section 5 are quite general and cover most of the practical problems. Moreover, they are in some aspects even more general

(18)

15

I

- than those given by Schal, since they don I t require e. g. the rewards to be bounded -from above. In addition, the contraction conditions as introduced do not require the absolute convergence for all policies as imposed in the strong convergent -case (see [10], [11]), since we only require such conditions for "good policies"

(see assumption (iii) in section 5).

7. Generation of S.A. methods by'using action dependent stopping times

--As mentioned in section 5, stopping times can be used for the generation of al-ternative successive approximation methods for solving Markov decision processes.

00

These stopping times were action-independent and defined on the process {It}t_O only. In [22J,Van Nunen and Stidham extend the existing results by allowing for

00

action dependent stopping times defined on the process {It,At}t_O' This opens the possibility to exploit properties of the reward and the transition structure de-pending on the actions in the construction of appropriate solution techniques. For a special subset of action dependent stopping times the corresponding algo-rithms possess the "equal row sum" property, which can be used, for example, to transform semi-MOP into ordinary MDP. For exponential systems this is equiva-lent to the "new device" introduced by Lippman [16]. Our approach works also for non-exponential systems. A second advantage of "equal row sum algorithms is that

(t)

P_ = p* (see theorem 5.2) which means that good extrapolation is possible. We will suppose in this section ~

=

1 (see [22J for the more general situation). Moreover we will define the go-ahead concept in a less general way than is done

1n [22J.

00

We first define a go-ahead function 0 for the stochastic process {leAt} taO by as-signing to each path a = (iO,aO,i),al, ••• ,it,at) a value o(a) € [O,lJ, where

1 - l~ (a) may be interpreted as the probability of stopping the process at time t (before receiving the reward rei ,a

»

if the path a has been observed and

pro-t t

vided that the process has not been stopped earlier.

Some definitions

(i) 0 1S said to be nonrandomized if o(a) E [0,1] for all possible paths u.

(ii) 0 18 said to be non-zero if o(i,a) ~ £ > 0 uniformly in i E I and a t A. (iii) 0 18 said to be action independent if o(a) only depends on (iO, ••• ,i

t) for

a = (iO,aO, ••• ,it,a_t).

(iv) I') is said to be transition memoryless i f ~(a,i,a,j,b)

=

o(i,a,j,b) is

in-dependent of b t A for all a. Later on in this section the concept of tran-sition memoryless go-ahead functions will appear to be of crucial signifi-cance to allow for restriction to stationary Markov policies.

I

(19)

16

I

Af ter introducing the probab ilis tic go-ahead concept we wi 11 incorporate it in the-probability space in a similar way as was done in [23J for actions-independent stopping times by extending the space (I x A)oo to (I x A x E)oo with E

=

{O,tl.

00 00

--Now the stochastic process {It,At}t=O is extended to the process {It,At'Yt}t=O where Y

t 0 with probability o(IO,AO, ... ,lt'At).

Now any starting state i, any go-ahead function 0 and any decision rule n deter-m1ne a probability measure on (I x A x E)oo in the usual way (see also [23J). We

'11 d b b" . 0 d d' . b-v6

__ W1 enote the pro a 111ty byP. an the correspon 1ng expectat10n y £ .

1,W 1.TI

The go-ahead function induces a s topping time in a way similar to the way in (.23 J. The stochastic variable T is defined by T "" t .. YO = Y 1

= ••• -

Y

t-1 - 0 and Yt == I;

T

=

ro ~Y

=

0 for all t "" 0,1,2, ••••

t

Now, for any function 1f, any go-ahead function 6, and any function w on I, we

de-fine the mapping

T-)

L~w(i)

'"

E?

[I

r(I,A) + wei ) ]

u 1,W k=O u U T

with wei )

=

0 if T = 00.

T

We will give a few examples to illustrate the concept of go-ahead functions and the meaning of the mapping L~. For more examples we refer to Wessels [35J and Van Nunen [23J, Van Nunen and Stidham [22].

Examples 7.1: For all these examples we suppose w to be any stationary Markov strategy 1T

=

(f,f, ••• ,f).

(i) 6(i,a)

=

for all i ( I, a (A 6(a,i,a)

=

0 otherwise, then wei) := L~v(i) = r(i,f(i» +

_L

I

f(i) (.)

p.. v J 1J

which yields the standard (pre-Jacobi) successive approximation method.

1 if io = i) = •••

=

it' o(a)

=

0 otherwise. Then wei) :=

L~v(i)

= [I -

p1~1~i)]-)[r(i,f(i»

+

L

U j ; i

which yields the Jacobi variant.

f(i) (0)]

p.. v J 1J

(iii) o(iO,aO.i),a1 •.•.• it,at) = 0 if iO ~ i) ~ i2 ~ ••. ? 1t' o(a)

= )

otherwise, where an ordering of the state space is pcesupposed, then

(20)

171

which is a Gauss-Seidel variant.

The above 3 examples all concern non-zero action-independent stopping times. iv) Let 6(i,a)

=

6~, 6(a)

=

0 otherwise, where the value o~ is chosen such that

*

6~

=

I

-

fJ

_*

L

a with p = sup p .. 1 I -

L

p .. a i,a j 1J j 1J

and 6(a,i,a) '" 0 otherwise. Then for a - f(i) we have

w ( i)

=

L ~ v ( i ) = <5 ~ [ r ( i , f ( i» +

1,

p ~j v (j ) ] + (J - 6 ~) v ( i) .

u 1 1 1

J

The i-th row SlW of this transition matrix belonging to this mapping is

equal to

o

~

L

p ~. + (1 - 6~) '" 1 1J 1. J

*

I - fJ \' a 1 -

*

a l. Pij + J - --,""';"'-a- ... P I -

L

p.. j 1 - l. P iJ' j 1J j

*

This is independent of i, so the algorithm induces a so-called equal-row sum algorithm with the mentioned advantages.

a

*

(iva) I t is not requested to define

o.

by means of p ;any p satisfying p sp<

1

yields an algorithm which has the equal row sum property and a contraction factor p.

(v) In this example we introduce the action-dependent Jacobi variant of the previous example: := o~ 1 * , a a * - ) (I-p)[1 - l. p .. + p .. ( I - p ) ] 1J 11 j if and only if iO

=

i

l = •••

=

lt

=

i; o(a) ... 0 otherwise. With this

stop-ping rule the process is stopped in state i immediately with probability 1-6~. I f it is not stopped and the transition is to state j

-i

i,then it

1.

is stopped with probability one (in state j);if the transition is from state i to state i, then it is stopped again with probability o~.

1 The corresponding mapping L~ equals (componentwise)

6<:

- - _ I f'''"'('"'"·...-)I[r(i,f(i)) +

I

1 6a 1 '.J.' - iPU Jr1 v(i)] .

Again it 1S straightforwardly verified that the algorithm possesses the

(21)

18

I

--Before giving a probabilistic interpretation of the equal-row sum stopping func-tions we would like to give the basic theorem which explains the importance of the algorithms based on stopping times:

Theorem 7.1: For any non-zero transition memoryless actions-(in)dependent stop-ping time the mapstop-pings

L~

and U

_o

where U

6 is defined by

have the following properties:

(i) (ii)

f

Lo,

U

_o

are monotone mappings i.e. v ~ w ~ U

6v ~ Uow. f

Lo'

U

_o

are contraction mappings with contraction radii p 6 :=

f . f . . d h " h

va

:= sup

Po

respect1vely.

Po

1S def1ne as t e matr1x W1t f

ment equal to

""

L

p~

f (1 '"' J, T

=

t) .

t=O 1, T

(iii) The fixed points of

L~

and U

6 are v(f) and v respectively.

II P 6 (f) II and (i,j)-th

ele-For a proof see Van Nunen and Stidham [22] for the action-dependent case. ele-For the action-independent case a proof may be found in [23]. As a consequence of the

pre-V10US theorem the sequence vn defined by va: I +~,

vn :: U 6vn_1

converges to v and the convergence rate is geometric. In a similar way as des-cribed for the case T ~ I upper and lower bounds can be constructed and a sub-optimality criterion may be incorporated in the actual algorithm,see e.g. [23] for similar results with action-independent stopping times. In fact,the existing variants,as introduced by Reetz [27J, Hastings [6J, van Nunen [24J. and others, are covered. Moreover, there is a close relation between our stopping time ap-proach and Porteus'spaper[26];for a discussion on this relation see [231, [20J. Several probabilistic interpretations for "equal row sum" stopping times were given in [22J. Here we will restrict the considerations to the most simple situa-tion. Suppose we have a semi-Markov decision process with transition probabilf-ties q~. and distribution functions F~.(t) for <he time until the next transition

1J 1J

given that the system is now in state i and action a is chosen and that the

(22)

possi-19

I

-- bility, if we are 1n state i,to make a trans1t1on to state i in a time 0 (a null event); we suppose such a null event to occur with probability o~ defined in

exam-1.

ple 7.I(iv). Then we see that the expected time until the next transitio~ if the --system is now in state i and action a is chose~satisfies:

]E~T~ = 1. 1.

o~

1

*

0 +

(l-o~)(L

1. •

f

J 0 a a t dF .. (t) • q .• ) 1J 1J

By choosing 6~ in the adequate way,we see that all the expected times can be made 1

equal. So, if we cons ider a transformed process with possible "null events" in each state with probability o~, then this process would behave like a process with

equi-l.

distant decision points, and, as follows from theorem 7.1, with a same total ex-pected reward. For exponential systems Lippman shows that the process is not only the same in expectation but even in distribution, see also Serfozo [30].

References

[IJ D. Blackwell, Discounted dynamic programming. Ann. Math. Statist. 36 (1965), 226-235.

[2] E.V. Denardo, Contraction mappings in the theory underlying dynamic program-ming. SIAM Rev.

!

(1967), 165-177.

[3J C. Derman, R. Strauch, A note on memoryless rules for controlling sequential

control processes. Ann. Math. Statist. 37 (1966), 276-278.

[4J L.P.J. Groenewegen, Characterization of optimal strategies in dynamic games. Mathematical Centre Tract 90, Anffiterdam (to appear).

[5] J. Harrison, Discrete dynamic programming with unbounded rewards. Ann. Math. Statist. 43 (1972), 636-644.

[6] N.A.J. Hastings, Some notes on dynamic programml.ng and replacement. Opera Res. Q. 19 (1968), 453-464.

[7] N.A.J. Hastings and J.A.E.E. van Nunen, The action elimination algorithm for Markov decision processes, p. 161-170 in the same volume as [IOJ.

[8J K.M. van Hee, Markov strategies in dynamic programming. Mathematics of Ope-rations Research

1

(1978), 37-41.

[9J K.M. van Hee and J. Wessels, Markov decision processes and strongly exces-sive functions. Stochastic processes and applications (to appear). [10] K.M. van Hee, A. Hordijk, J. van der Wal, Successive approximations for

con-vergent dynamic programming, p. 183-211 in H.C. Tijms, J. Wessels (eds), Markov decision theory, MC-tract 93, Mathematical Centre, Amsterdam,

1977 •

[IIJ K.M. van Hee and J. van der Wal, Strongly eonvergent dynamic programming: some results, p. 165-172 in Dynamische Optimierung, Bonner Mathemati-sche Schriften, Nr. 98, Bonn 1977.

(23)

20

I

U 2 J A. Hordijk, Dynamic programming and Markov potential theory. Me-tract 51, Mathematical Centre, Amsterdam, 1974.

[13J R.P. Kertz, D.C. Nachman, Persistently optimal plans for non-stationary dyna-mic programming: the topology of weak convergence case. Ann. Probab.

(to appear).

[14] D.M. Krebs, Decision problems with expected utility criteria, I: upper and lower convergent utility. Math. Opere Res. ~ (1977), 45-53.

[15J S.A. Lippman, On dynamic programming with unbounded rewards. Management Sci. ~ (1975), 1225-1233.

-- [16J S.A. Lippman, Applying a new device in the optimization of exponential sys-tems. Opere Res. 23, 1975, p. 687-710.

[17J J. Macqueen, A modified dynamic programming method for Markovian decision problems. J. Math. Anal. Appl.

li

(1966), 38-43.

[18J J. Macqueen, A test for suboptimal actions in Markovian decision problems. Opere Res. ~ (1967), 559-561.

[19] J. Neveu, Mathematical foundations of the calculus of probability. San Fran-cisco, Holden-Day, 1965.

[20J J.A.E.E. van Nunen, J. Wessels, Successive approximations for Markov decision processes and Markov games with unbounded rewards. Mathematische Opera-tionsforschung und Statistik (to appear).

[21J J.A.E.E. van Nunen, J. Wessels, Markov decision processes with unbounded re-wards. p. 1-23 in the same volume as [10].

[22] J.A.E.E. van Nunen and A. Stidham jr., Action-dependent stopping times and Markov decision processes with unbounded rewards. Report: Program in Operations Research North Carolina state University, August 1978. [23J J.A.E.E. van Nunen, Contracting Markov decision processes. MC-tract 71,

Ma-thematical Centre, Amsterdam, 1976.

[24J J.A.E.E. van Nunen, Improved successive approximation methods for discounted Markov decision processes. p. 667-682 in A. Prekopa (ed.), Progress in Operations Research, Amsterdam, North-Holland Publ. Compo 1976.

[25J D. Ornstein, On the existence of stationary optimal strategies. Proc. Amer. Math. Soc. 20 (1969), 563-569.

[26] E.L. Porteus, Bounds and transformations for discounted finite Markov deci-sion chains. Opere Res. 23 (1975), 761-784.

[27J D. Reetz, Solution of a Markovian decision problem by overrelaxation. Z. Opere Res. ~ (1973), 29-32.

[28J S.M. Ross, Applied probability models with optimization applications. San Francisco, 1970, Holden-Day.

[29J M. Schal, Conditions for optimality 1n dynamic programming and for the limit of n-stage optimal policies to be optimal. Z. f.

Wahrscheinlichkeits-theorie 32 (1975), 179-196.

1301 R.F. Serfozo, An equivalence between continuous and discrete time Markov de-cision processes. Opere Res. (to app~r).

[31J S. Stidham, jr., On the convergence of successive approximations and unique-ness of the solution to the functional equation of dynamic programming IMSOR-Report, Technical university of Denmark, May 1977.

(24)

21

I

-- [32J S. Stidham, jr., and J.A.E.E. van Nunen, The shift function approach for Markov decision processes with unbounded rewards. Report: Program in Operations Research North Caroiina State University (to appear). L33] R. Strauch, Negative dynamic programming. Ann. Math. Statist. ~ (1966),

871-890.

[34] D.J. ~lite, A survey of algorithms for some restricted classes of Markov decision problems. This volume.

[35] J. Wessels, Stopping times and Markov programming, p. 575-585 in Transac-tions of the 7-th Prague Conference on Information Theory, Stat. deci-sion functions, Random processes, Prague, Academia Press 1977.

[36J J. Wessels. Markov programming by successive approximations with respect to weighted supremum norms. J. Math. Anal. Appl. 58 (1977), 326-335. [37J J.A.E.E. van Nunen and J. Wessels, A note on dynamic programming with

un-bounded rewards. Management Sci. 24 (1978), 576-580 •