Markov decision processes with unbounded rewards

(1)

Citation for published version (APA):

van Nunen, J. A. E. E., & Wessels, J. (1976). Markov decision processes with unbounded rewards. (Memorandum COSOR; Vol. 7623). Technische Hogeschool Eindhoven.

Document status and date: Published: 01/01/1976

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne

Take down policy

If you believe that this document breaches copyright please contact us at:

openaccess@tue.nl

providing details and we will investigate your claim.

(2)

Department of Mathematics

PROBABILITY THEORY, STATISTICS AND OPERATIONS RESEARCH GROUP

Memorandum COSOR 76-23 Markov decision processes with

unbounded rewards by

J.A.E.E. van Nunen and J. Wessels

Eindhoven, November 1976 The Netherlands

(3)

J.A.E.E. van Nunen and J. Wessels

summary

Markov decision processes which allow for an unbounded reward structure are considered. Conditions are given which allow successive approximations with a convergence in some strong sense. This "strong" convergence enables the construction of upper and lower bounds.

The conditions are weaker than those proposed by Lippman [15J, Harrison [5J and Wessels [28J and are in fact a slight generalization of the conditions proposed by Van Nunen [21J.

A successive approximation algorithm will be indicated. The conditions will be analysed and compared with those in literature.

1. Introduction

We consider a Markov decision system with a countable state space S. So the states in S may be labelled by the natural numbers S := {1,2,3, ••• }. The system can be controlled at discrete points in time t

=

0,1,2, ••• by choosing an action a from an arbitrary nonempty action space A. Let

A

be a a-field on A, such that {a} €

A

for all a E A.

The chosen action a E A and the current state i E S at time t exclusively determine the probability of occurence of state j E S at time t + 1. This probability is denoted by pa(i,j). If state i has been observed at time t and action a E A has been chosen, the (expected) reward r(i,a) is earned. The objective is to find a decision rule for which the total expected reward over an infinite time horizon is maximal. For the determination of such a decision rule and for the computation of the total expected reward we have in fact to solve a functional equation of the following form

v (i)

=

sup {r (i, a) +

I

p a (i , j ) v ( j)} ,

a€A j

(4)

The more sophisticated methods for solving these functional equations, if they have a unique solution, are linear programming (D' Epenoux [3J, de Ghellinck and Eppen, [4J) and policy iteration (Howard [13J) which is a very beautiful and elegant method. Actually linear programming and policy iteration are in a sense equivalent (Mine and Osaki [18J, Wessels and Van Nunen [29J) •

However, for large scaled problems. successive approximation methods, tend to be more efficient than the known sophisticated methods (e.g. van Nunen [19J) •

It appears that successive approximation methods allow for elegant and relatively good extrapolation and error analysis. Moreover, the incor-poration of suboptimality tests can improve those methods considerably. Finally, it appears that policy iteration methods (there are many versions with differences in the policy improvement procedures, see e.g. Hastings [6J Van Nunen [21J), are essentially successive approximation methods. These methods happen to converge in finitely many iterations if state and action

space are finite.

For these reasons it is still interesting to investigate successive approxi-mation methods for Markov decision processes and likewise for Markov games

(see V.D. Wal [27J). Here we will mainly be concerned with the conditions Which allow successive approximations with guaranteed convergence in some strong sense allowing the construction of upper and lower bounds. For convergence in a weaker sense, of course, weaker conditions can be used we refer to Schal [25J and Hordijk [12J.

After the introduction of the model and the underlying assumptions we will develop some properties.

Mo~eover, we will indicate the specific successive approximation algorithm. Finally we will analyse the assumptions and compare then with those in

li tera tur.e •

Most of the assertions can be extended to nondenumerable state spaces in the obvious way.

2. The model and the assumptions

We will first introduce our assumptions on the transition probabilities and the rewards. The assumptions will be somewhat weaker than those proposed in

(5)

Assumption 2.1.

a) pa(i,j) ~ 0,

L

pa(i,j) $ 1, for all i,j E S and all a EA.

j

b) is measurable for all i,j E S as a function of a.

c) r(i,a) is measurable for all i E S as a function of a.

Remark 2.1. We allow substochastic behaviour. Defectiveness of transition probabilities may be interpreted as a positive probability of leaving the system, which results in the stopping of all earnings. In a more formal set-up this may be handled by introducing an extra state which is ab-sorbing for all actions and does not give any earnings. This has been executed e.g. in [21J by

van

Nunen and in [llJ by Hinderer. Without such a device quite a lot can be achieved in a correct formal way as has been done by Wessels [28J. Actually, as long as the outcomes in which one is

interested may be expressed in terms of bounded order histories, there is no serious problem. In this paper we will suppose that there is such an extra state, without giving i t a name or mentioning it explicitly. Compare section 5 for the meaning of substochasticity.

Definition 2.1.

(i) A

decision rule

TI is a sequence of transition probabilities TI := (qO,q1"")' where qt is a transition probability of (Ht,H_t) into (A,A), with

H

t := S x A x S x ••• x S (t + 1 times S) and

H

t is the corresponding

product a-field.

The class of all decision rules is denoted by

V.

(ii) A decision rule TI will be called

nonrandomized

or

a strategy

if qt

is degenerated for all t and all h

t E Ht• So a strategy is a non-randomized decision rule

(iii) A decision rule TI is called

Markov

if qt only depends on the last

component of h_t E H t .

The class of (randomized) Markov decision rules is denoted by RM

(iv) A Markov decision rule is called

stationary

if qt does not depend on t.

A

policy

f is a function of S into A. By

F

we denote the set of all policies. Stationary strategies correspond (one to one) to policies

(6)

be denoted by TI

:IE ,X, where X

:l.

and Markov strategies correspond to sequences of policies. We will apply these correspondences deliberately.

The class of Markov strategies is denoted by

M.

In an obvious way -see e.g. Van Nunen [21J- any starting state i E Sand

co

any decision rule TI E

V

determine a stochastic process {(Xt,Zt}t=O on (S x A), where X_t denotes the state of the system at time t, and Zt

co

denotes the action at time t. The relevant probability measure on (S x A) will be denoted by ~ ~ • Expectations with respect to this measure will

~

JE

~.

By JE'lX we denote the columnvector with i-th component

~

is any random variable.

Assumption 2.2. We assume a positive function jj on S to be given.

Let W be the Banach space of vectors w (real valued functions on S) which satisfy

Ilwll:=sup IW(i)!'jj-1(i) < c o . iES

For matrices (real valued functions on S x S) we introduce the operator-norm

liB 11:= sup IIBw II Ilwll=1

Note that

IIBII= sup jj-1(i) IIB(i,j)\'jj(j) •

iES j Assumption 2.3. (i) sup TIEM co :IE~

I

r+(x ,z )

~ n=O n n < co for all i

E S ,

+

where r (a,b) := max{O,r(a,b)}. (ii) sup IIP(f) II =: p < 1 ,

fE

F

*

where P(f) is the matrix with P(f) (i,j) := pf(il (i,j) (iii) _{sup IIP(f)r -} _{pr II =:}

fE

F

M < co

(7)

and r is the vector with i-th component r(i) = sup r(i,a). aEA

Remark 2.3. Note that P(f)r+ <

~

(componentwise) since supp(f)r+(f) < 00.

fE F

Moreover, P(f)r < ~ as is implicitly stated in assumption 2.2. iii. The model in fact combines themain features of the models introduced by Harrison [5J, Wessels [28J and Van Hee [9J, and yield a slight ex-tension with respect to the model considered by Van Nunen [21J.

Since we will prove similar results as Harrison [5J, Wessels [28J, Van Nunen [21J, this paper generalizes their results.

We will first show that under assumption 2.3 i the restriction to Markov strategies is allowed if one is interested in the criterion of total expected rewards.

Given that assumption 2.3. i is satisfied it will be clear that for any Tr E M v (Tr) ~ :=JE Tr >r(x,Z) r-.. n n n=O

is properly defined and that all manipulations with integration and summation are allowed. However, v. (Tr) may be -~ for some i E S. Furthermore

~

sup, V. (Tr) < 00. In [9

J

Van Hee shows that under assumption 2.3. i v. (Tr)

TrEM ~ ~

is properly defined for all Tr E RM since

00

sup JE

~

L

r +(X ,Z )

TrERM ~ n=O n n

Moreover, he proves that

00 sup JE~

L

r +(X ,Z ) Tr€M ~ n=O n n sup TrERM V. (Tr) ~

=

_TrEMsup V._~(Tr)

It then follows straightforwardly from Hordijk's [12J generalisation of a result of Derman and Strauch [2J that v. (Tr) is defined properly for

~

all Tr E

V

and i E S, viz. for any i E S and any Tr E

V

there exists a

*

Tr E

RM,

such that

for all j E S, A OE A n=O,l, •.•

(8)

Lemma 3.1. Proof. P(f 2)P(f1)r S; P (f2) (pr + M1V)

2-s

p r

+

pM₁V ₊ _p*M1V

2-+

2P_OM₁V

s

P r similarly P(f 2)p(f1)r :::: P(f2)(pr - M V)1 2-~ p r - PM1V - P M V

*

1 n ~ 1

The proof proceeds further in an inductive way.

Corollary 3. 1 • (i) co ]E'I\'

2

n=O r(x ) E V n for all 7T E

M

(ii) co ]E7T

l

r(X ,Z ) n=O n n

-1-s

(1 - p) r + -1- -2

= (1 - p) r + (1 - PO) M₁ V E V for all 7T E V

Proof. For 7T E Mpart (ii) follows straightforwardly from the foregoing

lemma. Because of the results of section 2 this may be extended to

(9)

Definition 3.1. L(f) is a mapping of V into V defined by L(f)v ;= r(f) +P(f)v where r(f) is the vector with i-th component equal to r(i,f(i»).

L(f) maps V- into V viz. r(f) ~

r;

v $ V

_o

for some V

o

E V, therefore -i-IIv0 - (1 - p) r II = M 2 < 00 , hence r(f) + P(f)v $ r + P(f) (1 - p)-l

r

+ P(f)M 2j..l -1- -1 (1 - p) r + (M 1(1 - p) + pM2)j..l E V • Lemma 3.2. ( i) (11) (iii)

If r(f) - r E W, then L(f) maps V into V and L(f) is contracting

on V with contraction radius IIp(f) II $ p < 1. The fixed point of

*

L(f) in V is v(f) :=v«f,f,f, •••

».

L(f) is monotone on V

n

If v E V these L (f)v + v(f) for n + 00.

Proof. Part i can be found in [28J , part ii of the lemma is trivial.

The final part is straightforward if r(f) - r E W, since in that case the

assertion is implied by the Banach fixed point theorem and the convergence

is in norm. If r(f) - r

i

Wwe have

n-1

Ln(f)v =

L

pk(f)r(f) + pn(f)v

k=O Since v can be written as

we have However, -1-v = (1 - p) r + w with w E W n -1 - n P (f)v = (1 - p) p(f)r + P (f)w •

pn(f)w tends to zero for n + 00 since P(f) is contracting on W

n

-(assumption 2.3 ii) and P (f)r tends to zero for n + 00 as follows from

lemma 3.1. This implies

n lim L (f)v = n+co 00

I

pk(f)r(f) = v(f) • k=O

o

(10)

Definition 3.2. U is a mapping of V in~o V defined by Uv := sup L(f)v

fEr

(componentwise) •

U maps V inro VI viz.

{r(f) -1-Uv = sup + P(f)[(l - p) r + wJ} fEr

-

_{{(1 -} -1 -::;; _r ₊ _sup p) P(f)r} +sup P(f)w fEr fEr ( 1 -1- -1 + p IIwII ~ ::;; - p) r + (1 - p) M1~ E V

*

and -1 -inf P(f)w Uv ~ r + inf (1 - p) P(f)r + fEr fEF -1 - -1 - p IIwll~ ~ _r ₊ (1 - p) pr - M₁~(1 - p)

*

-1- -1 = (1 - p) r - M1 (1 - p) ~- p

*11

w

II

~ € V • Lemma 3.3. (i) (H) (Hi) U is monotone on V

I

-1- -1 U maps B := {v E V IIv - (1 - p) r II ::;; M₁ (l - p) (1 U is contracting on V with contraction radius y: y ::;;

-1

- p) }into itself

*

p <

1-*

The proof proceeds in a similar way as the proof of theorem 4.3.3 in Van Nunen [21J.

Remark 3.1. Suppose the supremum in Uv for v E V is attained for certain f then r(f) + P(f)v € V hence -1-r(f) + P(f) (l - p) r + P(j)w E V

o

(11)

and 1 -r(f) + (1 - p) pr E V so -1 - - -1-r(f) - r + r + (1 - p) pr = r(f) - r + (1 - p) rEV consequently r(f) - r E W.

The same holds if L(f)v approximates Uv in norm. Then L(f)v E V as well. Hence r(f) - r E Wso the use of a successive approximation method (even without computing the supremum exactly) leads to a sequence of policies

~E:

F

with r (f

n) -

r

E: W.

*

Since U is contracting in V there exists a unique fixed point v of U in V. This fixed point is the unique solution of the optimality equation in V

v = sup {r(f) + P(f)v} . fEF

n

*

Furthermore IIU v - v II -+ 0 for n -+ co and any v E V. In the sequel we will

prove that co v*= sup JE7l"

I

r(X,Z)= 7l"EV n=O n n Theorem 3.1. sup V(7l") • 7l"EV

(i) v (71') ~ v

*

for all 71' E

V

(ii) For any E > 0 there exists a policy f such that

*

IIv(f) - v II ~ E hence sup v(7l") = 7l"EV sup v (f) fEM

*

v

Moreover, if for some f holds that v* = r(f) + P(f)v*

Then

(12)

Proof. The proof of this theorem proceeds exactly along the same lines as the proof of theorem 4.3.4 in [21J. In [21J part (i) has been proved by showing first that the assertion is true for TI E

M

and then using the re-sults of section 2. Part (ii) follows directly if we choose f E

F

such that

v* - o~ ~ L(f)v* ~ v* then L(f)[v * - o~J ~ L (f)v2 * ~ v* hence * ₈₍₁ 2 * v + + p)~ :::; L (f)v ~ v iterating this inequality gives

*

v - - - f.l :::;8 v (f) :::; v

*

1 - p

so by choosing 8 = £(1 - p) the statement will be clear.

4. Successive approximation

In the previous section we showed that the unique fixed point v* of the contraction operator U in V is the optimal value vector of the Markov

*

decision problem. Hence, v can be approximated by

o

(va E V and n = 1,2, ••• ) •

Furthermore, we proved the existence of stationary Markov strategies with value functions that approximate v* (in norm) •

Generally one not only wishes to find v* but one is also interested in good (stationary Markov) strategies. It may occur that the supremum in Uv cannot be computed exactly. Nevertheless, there are several successive approximation methods for the computation of v* and the determination of an (£-) optimal stationary Markov strategy. We refer to [22J in this volume. Here, as an example, we describe a method which uses monotonicity of the v . Consequently

n

(13)

Lemma 4.1. Let 0 > 0, suppose v, Vi E V, such that Uvl - O~ ~ v then

*

v 15+pllv-vl _ll

*

~ v + -~--- j.l 1 - p

*

Proof. The proof can also be found in [28J and proceeds as follows.

Hence, since Uvl _~ _{v +} _OJ.l _{we have}

Uv ~ UvI + p IIv - vI II j.l ~v + 15\.l + p IIv - VIII\.l

*

or Similarly Uv ~ v + E:]..I wi th E:

=

0 + p IIv - vI II •

*

2

U v ~ U(v + E:j.l)

=

U(vl ₊

V - Vi + E:]..I)

~ v + 0]..1 +p IIv - v111]..1 + p E]..I

*

= v + E:(1 + P ) 11

* ..

Iterating in the same way gives

Unv ~ V + _e:(1 + _p* + ••• _{p *}n-1)]..1 ~ v + E: j..l. 1 - p * This implies lim unv = V* :c;; V + ₁ E:

_-

\.l p* n~

Lemma 4.2. If v, VI E V with L(f)v' ;::: v, then

-r(f) - r E W and

o

(14)

where

Ilv - v'II := inf j.l-1(i) (v(i) - v' (i» ic:S and P f := inf j.l-1(i) ic:S 'i' f(i)(, ') (') L P ~,J j.l J • j

Proof. The proof of this lemma proceeds along the same lines as the proof fo the foregoing lemma.

The convergence of the following successive approximation algorithm will be clear as a consequence of the foregoing two lemmas.

Algorithm 4.1.

o

-1

STEP O. Choose a > 0; choose 15 > 0 such that 15(l - P*) < a; choose V_o c: V

such that V

o < uvo; n := 1;

STEP 1. (ii) determine f such that

n v := L(f)v 1 ~ max {v l'UV 1 - 15j.l}; n n n- n- n-STEP 2. I f 15 + p*llv_{n - v n-}₁11 1 - p

*

Pf IIv - v II n n n-l----:-1----,;.;....-- < ~ Pf n

then go to step 3 else go to step 1 with n := n + 1;

STEP 3. end of the algorithm.

Lemma 4.1 and 4.2 provide that the algorithm stops after a finite number of iterations and that in the n-th iteration step of the algorithm, we have

v + n Pf II v - v ill n n n- -1 - P f n

(15)

n

a with policy fna then the a and the distance between

-1 - c ( l - p ) .

*

If the algorithm ends at iteration step distance between v* - v(fn ) is at most

_a

upper and lowerbound for v(f ) is less than a n

Note that the choice of va and the way in which v

n is computed assure

*

that v_n converges monotone from below v i.e. v (f ) * v n-l ::; vn ::; n ::; v and lim v = v* n n~

For proofs we refer to [21J, [28J.

If we release the monotonicity assumptions and choose va E V arbitrary it

remains possible to give adequate successive approximation algorithms, see [22J in this volume.

In all those methods a main role is played by the concept of upper and lower-bound. In fact the fast convergence of the algorithms is caused by the use of this concept, see e.g. MacQueen [16J, Portens [23J, Van Nunen [llJ. Moreover, upper and lowerbounds can be used to formulate suboptimality

test which may even improve the efficiency of the algorithms considerably see e.g. MacQueen [17J Hastings and Van Nunen [8J, Hastings and Mello [7J, Hubner [14J•

5. Analysis of the assumptions

Let us first make some remarks on the assumptions.

Remark 5.1.

(i) r may be replaced by any vector b with b - r E W, so it is not necessary

-to compute r exactly. Such an approach is applied in Van Nunen [21J. (ii) In the model semi-Markov decision processes, discounted Markov decision

processes and discounted semi-Markov decision processes are contained as well.

(a) Semi-Markov decision processes (without discounting) are covered by taking the number of the decision instant as decision time and

(16)

the expected reward until the next decision instant as reward. Alternatively spoken one considers the embedded process, see e.g. Mine and Osaki [18J.

(b) Discounted Markov decision processes are included by incorpo-rating the deci~on factor e (if e ~ 1) in the transition pro-babilities i.e. pa(i,j) ;= epa(i,j). If

S

> 1 the theory should be slightly adapted. However co sup JE'IT \' i L 'ITEM n=O n + S r (X ,Z )_n _n < co

remains a sufficient condition for restriction to stationary Markov strategies. (see Van Hee [9J).

(c) For discounted semi-Markov decision processes with discount rate a ~ 0 again incorporation in the transition probabilities is appropriate, for a < 0 the theory needs slight modifications.

We now relate the use of the translation function (1 - p)-l

r ,

as intro-duced in a slightly different way by Harrison [5J, to an approach of

Porteus [24J.

Porteus proposed, for the finite state-finite action case, that the use of a translation function might be replaced by a transformation of the data.

He therefore introduced the return transformation

~(i,a) := r(i,a) - (1 - p)-1 -(r(i)

I

pa(i,j)r(j»

jES

For the transformed problem we have

~ - -1- -1 - -1

r(i) :s; r(i) - (l - p) r(i) + (l - p) pr(i) + (1 - p) M₁1J(i)

for all i E S. similarly

for all i to S

r(i)

~

r(i) - (1 - p)-l

r

(i) - (l - p)-l pM1 f.di)

-1

= - (1 - p) M

(17)

Hence/we have

(1) r E W

(2) IIP(f)11 = IIP(f)\1 $ p < 1 •

*

This implies that the transformed problem can be handled without using a

translation and fits into the model in Wessels [28J (see also Van Nunen [21J). The question remains whether for all i E: Sand 7T t:

V ;;.

(7T) = v. (7T) +u (i)

J. J.

for some function u on S which is independent of 7T. As a consequence of (1) and (2) we have that

v. (7T) J. co JE 7T \' i L n=O '" r (X ,Z ) = n n co

L

:IE ~ r (X , Z ) / n=O J. n n

and that any 7T may be replaced by a randomized Markov decision rule, without '" any effect on v. (7T) • J. co

l:

JE ~[r (X / Z ) n=O J. n n -1- -1 z (1-p) r ( X ) + ( 1 - p ) Ipn(X,j);(j)] n . n J co =

L

JE~1E~

[r(X,Z )_(1_p)-1;(X )+(1-P)-1;(x+ 1

)lx

n

,z

J n=O J. J. n n n n n N lim

l:

{JE ~ (r(X ,Z ) N~ n=O J. n n -1- -1-(1-p) r(X n)+(1-p) r(xn+1)} N -1- 1

= lim {

L

JE~r(x ,Z ) - ( l - p ) r(i) + ( l - p ) - :IE~ r(x

N+1)} N~ n=O J. n n J. = v. (TI) - (l J. -1-p) r (i)

where the third equality is allowed since

(18)

We will illustrate now how the results of Lippman [15J can be embedded in our theory (see also Van Nunen and Wessels [20J). Lippman proves the convergence of successive approximation at a geometric rate under the following conditions which are given in our notations.

Conditions of Lippman. There exist a function u : S ~ [1,~), an integer m~ 1, and constants 0 ~ 8 < 1, b > 0 such that for all i E S, a E A

I

un(j)pa(i,j) ~ 8[u(i) + bJm jES

for n

=

1, •••,m .

However, we then have for any p ~ 8 and any

*

p 11m

c ~ b[

(f)

-1

- 1J that for ~(i) defined by

~(i) := [u(1 ) + cJm the following holds:

a) and b) IIP(f)lI~p

*

IIr (f) II ~ M •

So we can use for Markov decision processes as described by Lippman the latter simpler and more general consitions a and b.

The assumption 2.3.ii requires some transient behaviour of the processes involved. This may be characterized as ~~~~~ ~~~=~~~y=~=~~, i.e.

for all f E F with p < 1 and ~ a positive function on S.

*

For strong excessiveness several sufficient and necessary conditions can be given. In order to make assumption 2.3 i1 more transparent and to relate the latter assumption to the assumptions of other authors we will give those conditions.

(19)

The process is strongly excessive with ~(i) ~ 0 > 0 if and only if the lifetimes of the process are exponentially bounded, i.e.

for all i E S, TI E M, where y < 1 and a is a positive function on S.

Proof. "if" choose ~(i)

d -1 . .

an p := V , now ~t ~a

*

co

:= sua

I

vnJP~(X

E S,x 1

i

S) with 1 < v < y-l

TIEM n=O ~ n n+

straigh~forwardly verified that P(f)~ ~ p*~.

"only ':'f" Note that for TI:= (f

O,fl ••• )

= oJPTI(X E S)

n with e := {l,l, .•• }.

Lemma 5.2. (Van Hee and Wessels [10J).

The process is strongly excessive withfj, ~ j.J(i) ~ 6 > 0 for some constants fj, ~ 0 > 0, if and only if the lifetimes of the process are exponentially bounded, uniformly in i E S, i.e

o

JP TI (X E S) ~

i n

n

ay (with a > 0, 0 < y < 1) •

Proof. The "if" part of the lemma follows straightforward, the "only if"

part can be achieved by choosing e.g. a(i)

=

fj,o-l. 0

Lemma 5.3. (See Veinott [26J, Denardo [lJ, Van Hee and Wessels [10J). The process is strongly excessive with fj, ~ ~(i) ~ 0 > 0 for some constants

fj, ~ 0 > 0 if and only if the maximum expected lifetime is uniformly

bounded in i E S, i.e.

co

sup

L

JP~(X E S) < M TIEM n=O ~ n

for some M> 0 , and all i E S.

Proof. Let ~(i) be the maximum expected lifetime if the process starts in state i E S. So

(20)

co Clearly J..l (i) := sup

L

7TEM n=O lP 7T (X E S) • i n J..l ~ e + P(f)J..l

,

and 1 P(f)J..l J..l ~ -J..l+

.

M This yialds P~f)J..l ::; ( 1 _{- M")J..l'}1 1

So for p = (l - -),0 := 1 and D. := M the "if"-part will be clear. On the

*

M

other hand if the process is strongly excessive with 0 ::; J..l(i) ::; D., then the lifetimes are uniformly exponentially bounded and hence the maximum

expected lifetimes are bounded.

Corollary 5.1. The following three assertions are equivalent. 1) The process is strongly excessive with

a

< 0 ::; J..l(i) ::; D. •

2) The lifetimes of the process are uniformly exponentially bounded. 3) The maximum expected lifetimes of the process are bounded as function

of the starting state.

Note that the maximum expected lifetime i(i) if the process starts in state i E S can be found as the smallest positive solution to

~ ~ sup [e + P(f)~J • fEF

D

There is a close relation between strong excessivity and so called "N-stage" contraction. This relation is given in the following lemma.

Lemma 5.4. (see Van Hee and Wessels [10J).

Let u be a positive function on S such that P(f)u ::; Mu for some M>

a

and all f E

F

and suppose P(f

O) •••P(fN_1)u::; p'u , with

a

< pi < 1 (N-stage contraction) for all f

O, •••,fN_1 E

F,

then there exist a positive function J..l on Sand p with

a

< p < 1, such that

(21)

use of the "similarity transformation" as described by Porteus [24J.

For the finite state space-finite action space situation Porteus proposed the following transformation of the original process. Let Q be a diagonal

matrix with positive diagonal elements

Q :=

~-1(l)

0

~-1(2)· Define

o

"

"-_.... .... "-.... "-.... and ref) := Qr(f) , -1 P (f) := QP (f) Q • -*

Then the optimal return vector v of the transformed problem is just equal

*

to QV • Viz.

;*

=

sup (I - P(f»-l;(f) fEF -1 -1

=

sup (I - QP(f)Q ) Qr(f) fEF

=

sup [Q(1 - P(f»Q-1_J_{-1 Q}

=

ref) fEF

=

sup Q(1 - P(f»-l r (f) fEF

=

Q sup (I - P(f»-l r (f)

=

fEF

*

Qv

So the assumptions 2.3 can be replaced by the same assumptions with ~(i)

=

1 for the transformed problem.

(22)

for all f E F.

Proof. Choose p

*

N

such that p' < p* < 1 and choose

co

j.I := sup

I

1 1TEM n=O p~

As a consequence of the foregoing lemma we see that "N-stage" contraction in one norm (the u-norm) implies one stage contraction in another norm

(the j.I-norm). A final characterization of strongly excessive processes is

given in the following lemma which can again be found in Van Hee and Wessels [10J. This lemma gives a probabilistic characterization of the transient behaviour of the process.

Lemma 5.5. A process is strongly excessive if and only if there exist a partition {Sk!k integer} of S and numbers ~ > 1, S ~ 1, such that for all 1T E

M

o

co

l

JP

~

(X E Sk) n=O ~ n Q,-k

~ S min{1,~ } for i ESQ,'

Proof. First note that the lemma states that there is necessarily a drift to lower Sk or a drift out of the system.

The "if" part follows by defining

co

j.I := sup JE 1T

l

u(X n) 1TEM n=O

k where u(i) := (~e:)

part follows since

if i E Sk with 0 < £: < 1 and ~£: > 1. The "only if"

-1

with 1 < a: < p

*

o

We conclude tnis section on the analysis of the basic assumptions by giving the realtion between the use of weighted supremum norms (j.I-norm) and the

(23)

[1J E. V. Denardo, "Contraction mappings in the theory underlying dynamic programming". SIAM Rev. ~ (1967),165-177.

[2J C. Derman, R.E. Strauch, "A note on memeoryless rules for controlling sequential control processes". Ann. Math. Statist.

12.

(1966), 276-278.

[3J F.D. Epenoux, "Sur un probleme de production et de stochage dans l' oleatoire". Rev. Tranc. Rech. Opere.!.! 1960, 3-16.

[4J G.T. de Ghellinck and G.D. Eppen, "Linear programming solutions for separable Markovian decision problems". Management Sci.

g

(1967), 371-394.

[5J J. Harrison, "Discrete dynamic programming with unbounded rewards". Ann. Math. Statist. ~ (1972),636-644.

[6J N.A.J. Hastings, "Some notes on dynamic programming and replacement". aper. Res. Quart. ~ (1968),453-464.

[7J N.A.J. Hastings, J. Mello, "Test for nonoptimal actions in discounted Markov programming". l1anagement Sci.

1:2.

(1973), 1019-1022. [8J N.A.J. Hastings, J.A.E.E. Van Nunen, "The action elimination algorithm

for Markov decision processes". In this volume.

[9J K.M. van Hee, "Markov strategies in dynamic programming". Univ. of Technology Eindhoven, Dept. of Math. 1975 (Memorandum CaSaR 75-20) .

[10J K.M. van Hee, J. Wessels, "Markov decision processes and strongly excessive functions". Univ. of Technology Eindhoven, Dept. of Math. 1975 (CaSaR Memorandum 75-22), to appear in

stochastic processes and appl.

[11J K. Hinderer, "Bounds for stationary finite stage dynamic programs with unbounded reward functions". Hamburg, Intitut fUr Math. Stochastik der Univ. Hamburg, June 1975, Report. [12

J

A. Hordijk, "Dynamic programming and Markov potential theory.

Amsterdam, Mathematisch centrum, 1974 (Mathematical Centre Tract no. 51).

(24)

[13J R.A. Howard, "Dynamic programming and Markov processes". Cambridge (Mass.) M.I.T. press, 1960.

[14J G. HUbner, "Improved procedures for eliminating suboptimal actions in Markov programming by the use of contraction properties". Transactions of the 7th Prague Conference on Information

theory, statistical decision functions, Random processes (including 1974 European Meeting of Statisticians) Aca-demia Prague (To appear) .

[15J S.A. Lippman, "On dynamic programming with unbounded rewards". Management Sci.

3!

(1975), 1225-1233.

[l6J J. MacQueen, "A modified dynamic programming method for Markovian decision problems. J. Math. Anal. Appl. ~ (1966), 38-43.

[17J J. MacQueen, "A test for suboptimal actions in Markovian decision problems". Operations Res. ~ (1967) 559-561.

[18J H. Mine, S. Osaki, "Markovian decision processes". New York etc. Elsevier 1965.

[19J

[20J

J.A.E.E. van Nunen, "A set of successive approximation methods for discounted Markovian decision problems". Zeitschrift fur Operations Res. 20 (1976) , 203-208.

J.A.E.E. van Nunen, J. Wessels, "A note on dynamic programming with unbounded rewards" . Eindhoven, Univ. of Technology,

Dept. of Math. 1975, (Memorandum COSOR 75-13) • [21J J.A.E.E. van Nunen, "Contracting Markov decision processes".

Amsterdam, Mathematisch Centrum, 1976 (Mathematical Centre Tract no. 71).

[22J J.A.E.E. van Nunen, J. Wessels, "The generation of successive approximation methods for Markov decision processes by using stopping times". In this volume.

[23J E.L. :?orteus, "Some bounds for discounted sequential decision processes". Management Sci. 18 (1971).

(25)

[24J E.L. Porteus, "Bo\.lnds and transformations for discounted finite Markov decision chains". Operations Res. 23 (1975), 761-784.

[25J M. Schal, "Conditions for optimality in dynamic programming and for the limit if N-stage optimal policies to be optimali ••

Zeitschrift fUr Wahrscheinlichkeits Rechnung 32 (1975) 179-196.

[26J A.F. Veinott, "Discrete dynamic programming with sensitive

dis-count optimality criteria". Ann. Math. Stist. 40 1635-1660. [27J J. van der Wal. J. Wessels, "Successive approximation methods

for Markov games". In this volume.

[28J J. Wessels, "Markov programming by successive approximations with respect to weigh ted supremum norms". J. Math. Anal. Appl.

(to appear) •

[29J J. Wessels, J.A.E.E. van Nunen, "Discounted semi-Markov decision processes: Linear programming and policy iteration". Statistica Neerlandica 29 (1975), 1-7.