Contracting Markov decision processes

(1)

Contracting Markov decision processes

Citation for published version (APA):

van Nunen, J. A. E. E. (1976). Contracting Markov decision processes. Stichting Mathematisch Centrum.

https://doi.org/10.6100/IR29937

DOI:

10.6100/IR29937

Document status and date:

Published: 01/01/1976

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be

important differences between the submitted version and the official published version of record. People

interested in the research are advised to contact the author for the final version of the publication, or visit the

DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page

numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne

Take down policy

If you believe that this document breaches copyright please contact us at: openaccess@tue.nl

providing details and we will investigate your claim.

(2)

(3)

(4)

CONTRACTING MARKOV DECISION PROCESSES

PROEFSCHIRFT

TER VERKRTJGING VAN VE GRAAV VAN VOCTOR IN

VE

TECHNISCHE WETENSCHAPPEN AAN VE TECHNISCHE

HOGESCHOOL EINVHOVEN,

OP

GEZAG VAN VE RECTOR

MAGNIFICUS, PROF.VR.IR. G. VOSSERS, VOOR E'EN

COMMISSIE AANGEWEZEN VOOR HET COLLEGE VAN

VEKANEN ZN HET OPENBAAR TE VERVEVZGEN OP

VINSVAG

25 MEI

1916

TE

16.00 UUR.

do011.

JOHANNES ARNOLVUS ELISABETH EMMANUEL VAN NUNEN

GEBOREN TE VENLO

(5)

dooJt. de pMmo:toJten PM

6. dlt.

J.

Wu~ e1.o

en

(6)

(7)

(2.2.ll o((OJ)

=

l; v

₁

~

₀

[6((0,ill =OJ; YaEG.,[[o(a) ~OJ .. [o((a,O)J

=

1]] (2.2.2) Ya,S€G., [o(a) = O•o((a,13))=0]; yaEG., yiES\{O} [o((ct, O,i)) =OJ.

page 11: replace line 1 from below by

page 12: replace line 6 and 7 from above by (ii) (0) E G; Y ctEG [ (a,0) E G].

G : = B u { (a, S)

I

o; E B, B E n

page 13: replace line 6 from above by u k=l

n

{O}k}, with B= ( u k=l

page 14 line 3 from above: replace o(a) < 1 by o(a) ~ o.

.. k i-1

G

8(0) := _k=lu {O}; GH(i) := {(i,a)

I

aE _j=Ou GH(j)} u {i}, for i ~ 0 . page 14: replace line 4 and 3 from below by

G := S u ( u Bk) u { (a,S)

I

ct € S u ( u Bk), S E u {O}k}

k=2 k=2 k=l

with B c: s\{O}.

page 133 line 13 from below: add (to appear in Computing)

(8)

1.

INTR.OVUCTION

'l..

PRELIMINARIES

8

2. 1. No:t:.ati:on4 8

2.2. Stopp.lng ti.mei. 10

'l.. 3.

Wugh.ted

.6

upJte.mum

nolf.ln6

14 2. 4.

Some

Jr.e.ma/!.k.6

on bounding 6unc.ti.on4 J 5

3.

MARKOV REWARV PROCESSES

19

3. 1. The Mallkov Jr.eWaJtd model 19

3. 'l.. Con:tJuu!ti.on ma.pp.i..ng.6 and .e..topp.lng

Ume-6

U

3. 3.

A

cll6cU6.6.i..on on

:the

tu,.t.wnpti.on4 3.1. 2-3.1. 5

32 4. MARKOV 17EC1S10N PROCESSES

37

4. 1. The Mallkov ded.6.i..on model 37

4.

'l.. Ved.6.i..on 11,Ulu and a.6.t.wnp.tlon4 38

4.

3.

Some

pJtopeJttlu 06 MaJckov dec-U.i..on pJtoc.U4U

43 4. 4. RemaJ!k.6

on

:the

a.6.6umpti.on4

4.

'l..

1-4.

'l.. 4

59 5. STOPPING TIMES ANV CONTRACTION

IN

MARKOV 17EC1SION PROCESSES

63

5. 1. The cof!Vt.a.cti.on mapp.lng L~ 64

5.2. The opti..mal Jte.twm. mapp.lng o& 70

6, VAWE ORlENTEV SUCCESSIVE APPROXIMATION

81

6.1.

PoUcq hnpltovement value detellmlna:tlon

pMc.eduJte.6

81

6.

2. Some

lle.tllalllu on

:the

value Miented me:thod.6 92

7. UPPER BOUNVS, Lell/ER BOUNVS ANV SUBOPT1MAL1TY

95

1. 1. UppeJL bound.6 t:md toweJL bound.6 6oll

vf

and

v*

95

n

1. 'l.. The .e.ubopt:..i.ma,Uty 06 MaJckov poUc.lu and .e.ubopthna,f. aet.l..on4 102

7. 3. Some

lle.tMldu

on

Q.ln.lte

4.tate .6pa.c.e

MaJckcv ded.6.i..on

pJtoce.6.6e.6

106 8. N-STAGE CONTRACTION

111

8.1.

ConveJLgence undeJL

:the

.6tlwt.g.thened

N-.6.tage

c.on:tJux.cti.on

(9)

9.

1.

Some.

example.6

06 i.pe.elQ..lc. M€Vlk.ov dew-i..on

pMc.e.uu

9. 2.

The. i.olu:tlon 06

4!{4.t.e.m6

06 UneM e.quatlon.6

9.

3. An -i..nventoJc.y p1t.oblem

REFERENCES

SUBJECT ZNVEX

SELECTEV LIST OF SYMBOLS

SA.MEN

VATTING

CURRICULUM

VITAE

117

119 123

135

137

139

143

(10)

INTROVUCTION

In the last three decades much attention has been given to Markov decision processes. Markov decision processes w!;)re first introduced by Bellman [ 2] in 1957, and constitute a special class of dynamic programming problems. In

1960 How?-rd [ 35] published his book "Dynamic programming and Markov process-es". This publication gave an important impulse to the investigation of Markov decision processes.

'

We will first give an outline of the decision processes to be investigated.

Consider a system with a countable state space

s.

The system can be

con-trolled at discrete points in time t

=

0,1,2, ••• , by a decision maker.,If

at time t the system is observed to be in state i € s, the decision maker can select an action from a nonempty set A. This set is independent of the state i €Sand of the time instant t. If he selects the action a € A in state

i €

s,

at time t , the system's state at time t + 1 will be j € S with

proba-bility pa{i,j) again independent of t. He then earns an immediate {expected) reward r{i,a).

Usually, the problem is to choose the actions "as good as possible" with re-spect to a given optimality criterion.

We will use the (expeated) totaZ r>ewaPd ar>iterion.

So the prOblem is:

1) to determine a recipe {decision rule) according to which actions should

be taken such that the {expected) total reward over an infinite time ho-rizon is maximal1

2) to determine the total reward that may be expected if we act according to that decision rule.

In first instance the following solution techniques for Markov decision processes with a finite state space and a finite action space were availa-ble: the policy iteration algorithm developed by Howard [35], linear pro-gramming [ 11], [ 13], apprpximation by standard dynamic propro-gramming [ 35].

(11)

A disadvantage {especially for large scaled problems) of the former two me-thods is that each iteration step requires a relatively large amount of com-putation.

Furthermore, the convergence of the method of standard dynamic programming is very slow.

Hence, the construction of numerical methods for determining optimal

solu-tions has been the subject of much research in this area. Moreover, muc~

attention has been paid to the generalization of the model as described by

~ellman and Howard. Both subjects will be studied in this monograph.

MacQueen [ 46) introduced an improved version of the standard dynamic

pro-gramming algo~ithm by constructing, in each iteration step, improved upper

and lower bounds for the optimal return vector. His approach yields a rath-er ,fast algorithm for solving finite state space finite action space Markov decision processes.

Modifications of the afore mentioned optimization procedures have been given J:)y e.g. Hastinqs [25], (26] who proposed a Gauss-Seidel-like technique and Reetz [60] who based his optimization procedure on an overrelaxation idea. Modifications have also been given by Porteus (59], Wessels [74], van Nunen

,and Wessels [55], and van Nunen [53], [54].

By and by several extensions of the original model were presented. The fi-nite state space and fifi-nite action space restriction was dropped, For example Maitra [48] and Derman [14] studied Markov decision processes with a countable state space and finite action space, whereas Blackwell [ S] and Denardo [12] already investigated Markov decision processes with a general

state ~d action space.

The restriction of equidistant decision point was dropped as well; see Jewell (37], [38].

As a remaining restriction, however, a bounded reward structure is assumed in the above articles.

This restriction has been released recently, see Lippman [44], [45], Harrison ,[24], weasels [75], and Hinderer [31].

In this monograph we will investigate Markov decision processes on a coun-tably infinite or finite st.ate space and with a general action space.

Fur-thermore, we allow for an unbounded reward st:ructu:re. Ne do not require the

(12)

supremum noxm0e assume the existence of a function b:

s

+JR and a

positi-ve function µ:

s

+

:m+

:= {x e

:m

I

x > O} such that

V _{ieS aeA(i)}V lrCi,a) - b(iJI < µ(i) ,

the function µ will bi:i used to construct a weighting function, or a bound-ing function. Moreover, we assume

3 • V V

l

pa(i,j)µ (j) S pµ (i) •

O<p<1 iE:S aeA(i) .

s

)€

In order to guarantee the existence of the total expected reward we assume the function b on S to be a charge with respect to the transition probabi-lity structure (see section 4.1).

These assumptions on the reward and transition probability structure arise in fact by a combination and a slight extension of the conditions as

pro-/

posed by Wessels [75] and Harrison [24]. As will be shown in the final chapter the assumptions allow e.g. the investigation of a large class of discounted Markov and Semi-Markov decision Processes. Lippman's assumptions [45] are covered as well, see van Nunen and Wessels [56].

We will develop a set of optimization procedures for solving Markov deci-sion problems, satisfying the described conditions, with respect to the

\

total reward criterion. This will be done by using the concept of

stopping

time

(see also Wessels [74]), which results in a unifying approach. This

set of methods includes the procedures for finite state, finite action

space Markov decision processes as proposed by Howard [35], Reetz [60],

Hastings [25] MacQueen [46], A main role in our approach will.be played by

the theory of monotone contraction mappings defined on a complete metric

space of functions on

s.

This space will be denoted by

V,

The concept of stopping time will be used to define a set ·of contraction

mappings on

V.

Given a decision rule and given the starting state i e S we

may define the stochastic process {st

I

t

=

0,1, ••• } where st denotes the

state of the system at time t. Roughly speaking a stopping time is a recipe

for terminating the stochastic process {st

I

t 2: O}. For each stopping time

(denoted by 6) we define the mapping U

0 of V by defining (U0v) (i) as the

supremum over all decision rules of the expected total reward until the

process is stopped according to the stopping time

o,

given that the process

(13)

if the process is stopped in state j e

s. u

₆is proved to be a monotone

con-tractive mapping on the complete metric space

V.

For stoppinq times that are nonzel"o (see section 2.2) u

6 will be strictly

contracting, its fixed point beinq equal to the requested optimal expected

*

'

total reward over an infinite time horizon (denoted by V ) • Bence, the fixed point is independent of the chosen nonzero stopping time. This implies

*

that for each nonzero stopping time

o, v

may be

b _{ya sequence vn :• U0vn-i ,starting rom any v0}0 0 . f 6

stopping time 6 we have

0

*

v +

v •

n

approximated successively e V. So for each nonzero

These results may be formulated alternatively as follows: for each nonzero

stopping time

o, v*

is the unique solution of the optimality equation

The class of described methods may be extended.

For a special class of stoppinq times, which we called transition memory-less stopping times (see section 2.2), the mappings U

0 produce the

oppor-tunity of determining in each iteration step a decision rule of a special

type for which the supremum by applying u

0 is attained or approximately

attained (see chapter 5). Such a decision rule will be called a stationary

~ f

Markov strategy (denoted by f ) • We define the mapping L₀of V in a similar

way as we have defined u

6, with the difference that the expected reward by

~

applying the stationary Markov strategy f is computed instead of the

sup-remum over all decision rules.

For transition memoryless nonzero stopping times we define for each

(A) .

;>. e:N

=

{1,~, _{••• } a mapping u6 of}

V.

If the supremum by computing u

0v is attained for a stationary Markov

stra-~ tegy (f ) then U₀(;\) _v uJ""> is defined by (n) := lim U 0 v • n -with ;>. e :N ,

(14)

If this supremum Ls not attained

uJ">

is defined by using a Markov strategy

for which the supremum is approximated (see chapter 6).

uJ">

is neither

necessarily contracting' nor monotone. However, the monotone contraction

pro-perty of the mappings U₀and

Li

enables the use of uJ"> as a base for

suc-cessive approximation methods (see chapter 6).

For each transition memoryless nonzero stopping time

o

and each A E: liJ u {co}

o>.

a sequence vn defined by

*

..

converges to V • Here fn is chosen such that

sufficiently well. So each pair (o,>.) yields

v*. Moreover, the stationary Markov strategy

f

L0nv!:1 approximates u

0v::1

a successive approximation of

found in the n-th i~eration of

such a procedure becomes (£-)optimal for n sufficiently large.

o>. cS>.

The vectors vn-l and vn enable us to construct upper and lower bounds for

*

the optimal return V • In addition, the availability of upper and lower bounds allows an incorporation of a suboptimality test. The use of upp,er and lower bounds and a suboptimality test may yield a considerable gain in computation time, see section 7.3.

We conclude this introduction with a short overview of the contents of the subsequent chapters.

In the firs't three sections of chapter 2 some basic notions required in the sequel are presented. After the introduction of some notations (section 2.1) we discuss in sections 2.2 and 2.3 the concepts of stopping time and weighted supremum norms respectively. The final section of chapter 2 is

devoted to some properties of weighted supremum norms.

In chapter 3 we treat Markov reward processes. (stochastic processes with-out the possibility of making decisions). In section 3.1 the Markov reward model is defined. Reward functions may be unbounded under our assumptions.

In section 3.2 the concept of stopping time is used to define the

contrac-tion mappings on the complete metric space

V

(introduced in section 2,3).

(15)

The study of Markov decision processes starts in chapter 4. After a descrip-tion of the model (secdescrip-tion 4.1), secdescrip-tion 4,2 contains the introducdescrip-tion of

decision rules and assilmptions, These assumptions will be a natural

exten-s~on of those in chapter 3. under our assumptions some results about Markov

decision ~rocesses will be proved (section 4,3), The final section (4.4) is

again devoted to a discussion of the assumption5,

:In chapter 5 the c0ncept of stopping time is used to generate a whole set

Of optimization procedures based on the mappings U

0, For each decision rule

~, not necessarily a stationary Markov strategy, and each stopping time

o

a contractive mapping L; of

V

will be defined and investigated (section

5,1), Next, (section 5,2) the operator U

0 will be studied. Finally, we will

present necessary and sufficient conditions for the stopping times under which we can restrict the attention to stationary Markov strategies only

(transition memoryless stopping times).

In chapter.6 we investigate value oriented successive approximations based

on the mappings

ur~.).

The term "value oriented" is used since in each

itera-tion step extra effort is given to obtain better estimates for the total

co

expected reward corresponding to the stationary Markov strategy fn.

Chapter 7 will be used to construct upper and lower bounds for the optimal

reward

v*.

In this chapter also a suboptimality test will be introduced, In

the third section of this chapter we show how our theory may be used in the special case of a Markov decision process with a finite state space and

a finite action space, We indicate the relation with the existing

optimiza-tion procedures.

In the brief chapter 8 we weaken the assumptions as imposed in chapter 4, This weakened version corresponds to the N-stage contraction assumption introduced by Denardo [12]. It will be proved that N-stage contraction with respect to a given bounding function implies the existence of a new bound-ing function satisfybound-ing the assumptions of chapter 4.

(16)

7

We conclude this monoqraph with a chapter in which we show that/a number of

specific Markov decision processes is covered by our theory. We will also show how a number of the existing approximation.methods for certain systems of linear equations are included in our treatment of Markov reward proces-ses (chapter 3). The final part of this chapter consists of an example. In this example we treat an inventory problem.

(17)

(18)

CHAPTER 2 PRE L1

MINARI

ES

,The goal of this chapter will be the introduction of some of the noti6ns which play an important role throughout this monograph.

First (section 2,1) we will give some notations and we will introduce the measurable spaces relevant for the stochastic processes that will be inves-tigated.

Next (section 2.2) stopping times are introduced. we will allow for random-ized stopping times. Several specific stopping times will be described.

In section 2.3 a bounding function µ is introduced. The function µ will be

used to define a weighted supremum norm. In the following chapters this bounding function will appear to be one of the tools for handling Markov

decision processes with an unbounded reward structure and with a transition

prci>ability structure that needs not to be contractive with respect to the usual supremum norm.

Using the bounding function µ a Banach space

W

and a complete metric space

V are defined.

Finally (section 2,4), we discuss some properties of bounding functions.

2.1. No:ta:tlom

As mentioned in the introductory chapter we study a system which is

ob-served to be in one of the states from a

state spaae

S at times t

=

O, 1; •••

we assume S to be countably infinite or finit~, and represent the states

by the integers, starting with zero. So if the state space is finite, S is

represented by {0,1, •• .,N}, where N+l is the number of states. Ifs is

countably infinite it is represented by {0,1,2, ••• }. A

path

is a sequence

of states that are subsequently visited.

NOl"ATIONS 2,1.1.

(i) sk

..

:= s x

s

x

...

x

s,

..

the k-fold Cartesian product of

s,

sos1

=s.

s

:= s x

s

x ••• J

s

is the set of all paths.

{ii) Let a E

s ,

k with k . ~ n, k,n e :N then a (n) denotes the row vector of

the first n components of a.

(iii) ka is the number of components of a. So ka is n if and only if

(19)

(iv)

(v)

(vi}

(vii)

The i-th component of a e: Sn, n ;;:: i is denoted by (a] i-l •

n . .

Hence a E

s

may be written as a= ((aJ₀,(aJ

1, ••• ,(a]k _1

>.

. a

Y := (a,B> := ([a]o,[a]1, ... ,[a]k -l'(BJo, ... ,[a]k -1>, k.y•ka +

ke,

"' k a B

where

a,B

e: G with G :=

u

s •

00 co k=l

The term (column) veator is used hereafter for real valued

func.-tions on

s.

(viii) The term matl'ix is used hereafter for real valued functions on

s

2•

(ix)

(x)

The (1,j)-th element of a matrix A will be denoted by a(i,j).

0

A is the identity matrix (with diagonal entries equal to. one and other entries equal to zero),

(Xi} Matrix multiplication and matrix-vector multiplication are defined

as usual (in all cases there will be absolute convergence}.

·(Xii) An is then-fold matrix produ~t Ax A x ••• x A, the (i,j)-th entry

of An is denoted by a(n)(i,j).

(xiii) Let. v,w be vectors, then vs w if and only if v(i) s w(i) for all.

i € Si v < w if and only if v s w and for at least one i ;;: S

v(i) < w(i).

Let

S

0 be the a-field of all subsets of s, then the measurable space

m

₀

~F

₀

> is defined to be the product space, with

o

0

=

s""

and F0 is the

a-algebra on '2₀generated by the finite products of the a-field S₀•

In order to be able to use the concept of stopping time in an adequate way

we extend the measurable space ('2₀,F₀> to the measurable space ('2,F). The

space (il,f) will play a main role in the sequel. Let the set E := {0,1} and

let

S

be the a-field of all subsets of

s

x E then the measurable space

(G, f) is defined to be the product space with '2 := (S x E) 00 and F is the

a-algebra on G generated by the finite products of the a-field

S.

So

o

0 contains all sequences of the form

(20)

REMARK 2.1.1. The state 0 will play a special role throughout this mono-graph. We do not exclude Markov decision processes with defective

transi-tion probabilities, Therefore, the state 0 will be used as a (fictive)

ab-' ' '

sorbing state. This will give us the possibility of defining (based on the

transition probabilities) adequate probability measures on cn~,F

₀

) and

W,F>.

2.2. S.topp.in.g :tirne6

We now are ready to introduce the notion of stopping time.

DEFINITION 2.2.1. A function O: G.., .... [0,1] is called a (Pandomized)

stopp-ing time

if and only if o satisfies the following properties

(2. 2.1) (2.2.2) V [ (o (a) a ,Se:G.., 1] , 0) A ([13]k - l

r

O)].,.. [o({a,13))

a

O] •

DEFINITION 2.2.2. The set of all {randomized) stopping times is 6,

REMARK 2.2, 1.

(i) Roughly speaking for each a e sk, we will use 1 - 6 (a} as the

pr,obatii-lity that a stochastic process on

s

is stopped at time k - 1 in state

[a.]k-l given that the states [aJ₀,[aJ₁, ••• ,[a]k-l have been visited

successively.

(ii) From now on we will use the less formal notation o(a,B), o(i) instead

of o((a,13)) and o({i)) respectively.

DEFINITION 2.2.3 •

. (i) o e ti. is said to be a norwandomized stopping time if and only if

Vae:G.., o(a) e {0,1} •

(21)

(iii)

o

e t:.. is said to be an ent;riy time i f and ~ly i f

o

is nonrandomized and memoryless.

DEFINITION 2.2.4. A nonempty subset G c ('\., is said to be a goahead eet if

and only if (i) V [ (a ,S) e G .. a e G] a,SeG.,. {ii) (0) € G (iii) NOTATIONS 2.2.1.

(i) Gn is the goahead set of those sequences of

c;..

for which the compo-nents [a]. are zero for i ~ n, if there are any. So

l. n Gn := { U k=l

..

sk} u {(a,S)

I

Se u k=l

{ii) For a goahead set G we define G(i) by

G(i) := {a e G

I

[aJ

0

=

i}, i e

s •

LEMMA 2.2.1. There is a one to one correspondence between nonrandomized stopping times and goahead sets.

PllOOF. The characteristic function of G is a nonrandomized stopping time. D

DEFINITION 2.2.s. o e t:.. is said to be a nonzezio stopping time if and only

(22)

REMARK 2.2.2.

(i) A nonrandomized stopping time is nonzero if and only if

V iES

o

(i) 1 •

(ii) The only nonzero stopping time

o

E 8, which is an entry time, has the

following property

V G

o

(a)

=

1 •

at: co

DEFINITION 2. 2. 6. A goahead set is said to be nonzero if and only if S c G.

DEFINITION 2.2.7. Let

o

0

,o

1 EA then

o

0

$of

if and only if

LEMMA 2.2.2. Let Q be an index set and suppose for each q E Q,

o

E A, is a

q

nonrandanized stopping time, Let

o-,o+

be defined by

0-(a) :=inf 0 (a),

o+

(a) i= sup 0 (a)

q£Q q qt:Q q

respectively, then

o-

and

o+

are elements of 8.

Consequently, if for

q

E': Q, G corresponds to the stopping time

oq

then

o-+ q

and

o

correspond to the goahead sets n G , n G respectively. qE':Q q qEQ q

PROOF. The proof follows by inspection.

DEFINITION 2.2.8. The nonrandomized stopping fwiation T: '1-+- r;+

u {

00 } is defined by

(ii) -r (w) = .. • (dt = O for all t E': .r; ) • +

(23)

14

~DEFINITION, 2.2.9. A stopping time

o

is said to be

t:ronsition memoeyZess

if

and only if there exists a subset Tl E

s

2 such that

DEFINITION 2.2.10. A goahead set is said to be transition memo:cyless if and

only if the corresponding nonrandomized stopping time is transition memory-less.

LEMMA 2.2.3. Memoryless stopping times are transition memoryless.

We will now give some simple examples of nonzero stopping times. The

exam-ples 2.2.1-2.2.4 are nonrandomized stopping times, so they can be expressed

in terms of goahead sets. Example 2. 2. 5 is a simple illustration of a

ran-domized stopping time. The examples 2.2.2-2.2.5 give transition memoryless

stopping times, see also Wessels [74], and van Nunen and Wessels [55].

EXAMPLE 2.2.1. G := G _nor in terms of stopping times V

o

(a)

llEG 1 else

o

(a)

=

o.

n

EXAMPLE 2.2.2. The goahead set G

8 is defined by co i-1 := u {O}k, G 8(1) := {(i,a)

I

a e u G8(j)}, for i '/' 0 . k=l j=O EXAMPLE 2.2.3. co G := u {a E Sn

I

[a] j E Bi j < n} u S n=2

with B c s such that O E B.

EXAMPLE 2. 2. 4. The goahead set GR is defined by

(24)

..

GR(i) :={a

I

a f: u {i}k} u { (a,13)

I

a f: u {i}k,

a

f:GR(O)}, i e: s\{O} •

k=l k==l

EXAMPLE 2.2.S. 6 is qiven by Vif:S\{O} 6(i) =~else 6(a) 1.

DEFINITION 2.3.1. A real valued function µ on S is said to be a bounding

funation

i f and only i f

(il µ(i) > 01 for all i £ S\{O} •

{ii) µ{0)

=

0 •

DEFINITION 2.3.2. Let µ be a bounding function, then Wµ is the set of vec-tors such that

(2.3.1)

REMARK 2.3.1. Note that w{O) 0 for each w £ ~µ•

DEFINITION 2. 3. 3. Let µ be a bounding function. Then, for each W f: Wµ 1 the

µ-norm of w is defined by·

LEMMA 2. 3.1. The space W with this µ-norm (weighted supremum norm) is a

µ

Banach space.

PROOF. The proof is straightforward.

0

DEFINITION 2.3.4. Let the matrix A be a bounded linear operator in Wµ• The no%lll of A is defined by

(25)

REMARK 2.3.2.

(i) It is easily verified that

II All • sup µ -l (i)

µ iES\{O}

l

la(i,j)lµ(j) , jES

and i f this supremum is finite then A is a bounded linear operator. (ii) 'I'he concept of bounding function is studied in more detail in

sec-tions 2.4, 3.3, 4.4, and

s.1.

(iii) We refer to Wessels (75], who introduced the concept of weighted sup-remum no:rms in this context and to Hinderer [ 31] 1 who used Wessels'

idea of weighted supremum norms for defining bounding functions.

DEFINITION 2.3.5. Let b be a vector with b(O)

=

01 and let p E [0,1). The

set of vectors V is defined by

µ,b,p

I

-1

V _µ,b,p := { V (V - ( 1 - p) b) E Ill } • _µ

REMARK 2.3.3. Note that also v{O) = 0 for v E V and v

1 -v2 E Wµ for

v

1,v2 E V.

DEFINITION 2.3.6. The metr>ie dµ on V µ,b,p is defined by

LEMMA 2.3.2. A set

V

_{µ, ,p}b with the metric d _µis a complete metric space. unless explicitely mentioned we fixed µ, b, and p for the remaining part of this monograph. Referring to these fixed µ, b and p we will omit the sub-scripts µ, b, p.

2.4.

Some

~

on bounding 6unction.6

In this section we give some properties of a bounding function µ' with re-spect to the corresponding spaces W , _µ and V , • _µ

(26)

LEMMA 2.4.1. Suppose S contains a finite number of elements. Let µ0 ):)e a

bounding function and wn € W (n ~ 0) then

µo

[II w - w _n II _µ + O] ,.[II w - w II , ..

o,

for all bounding function µ

'J •

0 n µ

PROOF. For a proof we refer to books on numerical mathematics, see e.g.

Collatz [ 8 ] , Krasnosel' skii [ 42]. D

LEMMA 2.4.2. Suppose

s

contains a finite number of elements and µ

0 is a

bounding function, then

where B is a matrix.

PR:>OF. The proof of (i) follows directly from the fact that

[II B II < 1] • lim Bn

µo _n-+co 0 ,

where O is the matrix with all entries zero. The proof of (ii) can be found

in e.g. Krasnosel'skii [42].

LEMMA 2.4.3. Suppose µ0 is a bounding function and B is a matrix with

fi-nite µo-norm, then

< 1] .. 3 I [II B II I < 1] •

µ µ

PROOF. For a proof we refer to van Hee and Wessels [29], who proved this

theorem for (sub-) Markov matrices, but their proof can easily be extended

(27)

REMARK 2.4.1.

(i) Note that in lemma 2. 4. 3 S is not niquired to be finite.

(ii) If S is co~tably infinite the linear space Cll may contain elements that are not bounded. If µ (i) -+- .. for i + .., then i t is also pe:cmitted

for lw(i) I + ""•

(iii) Note that i t is not requested that the µ-no:r:m of b exists. If the JJ-no:r:m of b exists then clearly

W

=

V.

(28)

CHAPTER

3

MARKOV RBllARV PROCESSES

In this chapter we restrict ourselves to Markov reward processes. So we

ex-hibit our method for the first time in a relatively simple situation. After defining the model {section 3.1) we introduce the assumptions on the reward structure and the transition probability structure. As mentioned we require neither the reward structure to be bounded nor the transition

pro-babilities to be strictly defective.

Next {section 3.2) we show that each nonzero stopping time

o

e ~ defines a

contraction mapping (L

0) on the complet~ metric space

V.

The fixed point

appears to be independent of the stopping time. It equals the total

ex-pected reward over an infinite time horizon. In the final section (3.3) we

discuss the assumptions on the reward structure and the

transition.probabi-lities in relation to the bounding function µ and the function b.

3.1.

The Ma!rkov

Jr.eWaJtd

model

We consider a system that is observed to be in one of the states of S at

discrete points in time t

=

0,1, . • . • If the system's state at time t i s

i €

s,

the system's state at time t+l will be j €

s

with probability

p(i,j), independent of the time instant t.

ASSUMPTION 3.1.1. (i) (ii) (iii) V. .

s o :s;

p Ci, j J :s; 1 l.1)€

v.

s

l

p{i,j) E 1 l.e jeS p(0,0) - l •

For each i es the unique probability measureP

1 on {00

,F

0J is defined in

the standard way, see' e.g. Neveu [52], Bauer [ 1] by defining the probabi-liti,:es of cylindrical sets.

(3.1.1)

+

where n € Z and

o .

j is the Kronecker symbol

(29)

if i .. j

else •

Given the starting state i € S and the matrix P (with entries p(i,jJJ we coo.sider the stochastic process { s

0 ,n

I

n ;i: 0}, where s0 ,n cw0J • [w0

J •

n So

s

0,n is the state of the process at time n. The stochastic process {s

0 ,n

I

n ~ O} is a Markov chain with stati6nary transition probabilities.

See e.g. Ross [62], Feller [17], Karlin (39], Cox and Miller [ 9], Kemeny and Snell (40],

For each stoppinq time 6 € /1 and each startinq state i € S we define in a similar way the unique probability measure Pi,o on CQ,F) by qiving for

n € z+

with tk E

s

and ~ E E.

This defines for each i € s and 6 E /1 a stochastic process { (sn ,en)

I

n ~ 0}, where sn(w) :•in' en(w) :• dn.

So sn1_'_{en are the state and the value of en at time n.}

REMARK

3.1.1. The stochastic process {(sn,en)

I

n ~ O} is not a Markov chain since the value o{a) may depend on the complete history

([aJ

0,[aJ1, ••• ,[a]k _1> for each a€ G=.

a

Formula {3.1.2) shows the connection between Pi and P i,o • For each w €

n,

w = ((io•da),(i

1,d1J •• ) we define w0 by w0 := Ci0 ,11 ,12 , .. J.

For

Ao

E F

0 we define the set A € f by

It is easily verified that for A

(30)

Let f₀be a measurable function on ('2

0,F 0). The function f on ('2,F) is then defined such that

It follows from formula (3.1.3) that

where Ei f

0, Ei,of denote the expectation of f0 and f with respect to the prd:>abili ty measures Pi and Pi,

0 respectively.

In the sequel we omit the subscript 0 in f

0 • The process {sO,n

I

n ~ O}

will thus be denoted by {s

I

n ~ O}.

n

NOTATION 3.1.1. ByEf, E

0f we denote the vector with components Eif, Ei,of

respectively.

We now state the assumptions on the reward structure and the tr.ansition probabilities of the system considered in this section. Therefore, we first introduce the reward function. At each point in time a reward is earned. We assume this reward to depend on the actual state of the system only. So the reward function r is a vector.

ASSUMPTION 3.1.2. (r - b) E (JJ • ASSUMPTION 3.1.3. ASSUMPTION 3.1.4. II P II < 1 • ASSUMPTION 3.1.5, (Pb - Pb) E W •

(31)

REMARK 3.1. 2. (i) Since b(O}

r(O) = O.

O and µ (0)

(ii) If b

I W,

then r

I W.

0, asswnption 3.1.2 implies that also

(iii) In terms of potential theory (see e.g. Hordijk [33]) the second as-swnption states that b is a charge with respect to P, which implies the existence of the total expected reward over an infinite time ho-rizon.

(iv) Assumption 3.1.4 means that the transition probabilities are such

that the expectation of µ (~

1) , with respect to Pi is at most llP H• U (i) •

This implies that the process has a tendency to decrease its J.!-value. (v) The final asswnption states that, given the startinq state i₀ €

s,

the difference between the expected one-stage return and Pr (i

0) lies between -MJJ (i ) and Mµ (i ) for some M € lR + and all i €

s.

Q 0 0

(vi) Note that if b €

W

the assumptions 3.1.2-3.1.5 may be replaced by '(a) r € W,

(b) l!Plf < 1.

LEMMA 3.1.1.

(Pr - pb) E: W •

P:ROOF. II Pr - pb II !> II Pb - pb II + 211 r - b I!=: M, which is finite according to the assumptions 3.1.2-3.1.5.

LEMMA 3. 1. 2. For M

1 as defined in the proof of lemma 3 .1.1, we have

n:;;;l,2, ••. ,

with p

0 := max{llPl!,p}.

PROOF. The proof proceeds inductively. The statement is true for n = 1.

Suppose i t is true for arl:>itrary n ~ 1. Using the assumptions 3.1.2-3.1.5

we then have

0

(32)

LEMMA 3.1. 3.

V "ES lim (Pl\.) (i)

l. n+'»,

lini (Pnr) (i)

=

O •

n+«>

PROOF. The proof is a direct consequence.of assumption 3.1.3 and the fore-going lemma.

For each n ~ 1 we define the vector V n by

n-1

(3. 1. 4)

v

:=E

l

r(sk) •

n k=O

LEMMA 3.1.4.

v _n=

PROOF. The proof follows by inspection.

D

0

Clearly Vn(i) represents the total expected reward over n time periods when

the initial state is 1 E

s.

THEOREM 3.1.1.

PROOF. The convergence of

l

n•O since,

Pnr follows from assumption 3.1.3 and 3.1.2,

"'

l

pl\,

+

l

Pn(r - b) •

n=O n=O

(33)

DEFINITION 3.1.1. The total eli;pected reward vector V is defined by

3. 2 •

CcntJux.c:tlon

mappbtgt.

and

.&:toppbtg

:ti.mu

µ:MMA 3,2.1. Let v e: V then

(1) Pnlvl exists for all n e: £+, (ii) lim (Pnjvl> (i)

=

O, i e: s.

n.._

PROOF. v e: V implies that v can be written as v

=

(1-p)-lb + w where we fJJ. So Pnlvl

s

(1-p)-!Pnlbl + Pnlwl which is defined. Moreover, since

..

l

Pnlbl and

l

Pnlwl exist we find part (ii) of the le11111a.

0

n=O n=O

DEFINITION 3.2.1. The mappinq L₁of V is defined by

LEMMA 3.2.2.

(i) 'L

1 maps V into V.

(ii) L

1 is a monotone mappinq.

(iii) The set {v E: V

I

llv- (1-p)-1bll s M

1 (1-p0)

-2

} is mapped in itself by Ll.

(iv) L

1 is strictly contractinq with contraction radius If P II.

(v) The unique fixed point of L

1 is

v.

PROOF. (i) L -1 1 v = r + Pv = r + P ( (1 - p) b + w) , with w e fJJ. so -1 -1 -1 II L₁v - ( 1 - p) b If

=

II r + {1 - p) Pb + Pw - ( 1 - p) b II s II b + ( 1 - p) -lPb - U - pl -lb II +II Pw II +II r -b II s II (1-p)-l(Pb-Pb) ll+llPll•llwll+llr-bll • (1-p)-111Pb-pbll+llPll•llwll+llr-bll

<"'.

(34)

The proof of part (ii) is trivial.

.

I

-1 -2} (iii) Let v € {v € V II v- (1-p) bll S: M 1 (l-P0) then (iv) . -1 -1 llL 1v-(1-p) bll=llr+Pv-(1-p) bllS: -1 -1 -1 . s llP(v-(1-p) b) 11+11 (1-p) Pb-. (1-p) b + rll -1 Let v

1 ,v2 € V then v1 ,v2 can be given by v1

=

(1 -P) b + w1 and

-1 .

v₂.. (1-p) b + w

2 respectively where w1,w2 € W, So v1 -v2 •w1 -w2, thus

By choosing v

1 and v2 such that w1 • µ and w2 = 0 the equality is at-tained.

The last part of the lemma follows directly from

L

1V = r + P(

l

Pnr)

n=O

=

r + n=l

l

..

where the interchange of summation is justified since

l

Pn

I

rl <

co.O

n=O

Now we return to the concept of stopping time. Note that the stopping func-tion T on

n

is a random variable. so we can define the random variable s, by

if T = n 1

(3.2.1)

if 'l"

= ...

Given the starting state i €

s

and the transition probabilit~es the distri-bution of T is uniquely determined by the choice of

o

€ A.

(35)

LEMMA 3.2.3. Let o € ll be a n0nzero stopping time, then

PROOF. The proof follows directly from the definition of ' and the

defini-tion of nonzero stopping time.

0

DEFINITION 3.2.2. Leto e /:;;, the mapping L

0 of V is defined component-wise by

•-1

(L 0v) (i) :.,;Ei 0[

l

r(sk) + v(s,)J, I k=O i €

s .

l'U!:MARK 3. 2. 1. Note that as a consequence of the definitions of

o

and V,

CL

0v) (0) =

o

for all v €

V.

EXAMPLE 3.2.1. Let o € A be the nonrandanized stopping time that corres-ponds to the goahead set Gi and let v € V then

(L

0v) (i)

=

r(i) + _jE:S

l

p(i,j)v(j) •

EXAMPLE 3.2.2. Let

o

e ll be the stopping time that corresponds to the go-ahead set GH and v e V then

(L

0 v) (i) = r(i) +

l

p(i,j) (L0v) (j) +

l

p(i,j)v(j) •

j q ~i

EXAMPLE 3.2.3. Let

o

e fl be the stopping time that corresponds to the go-ahead set GR then

-1 -1 ~

(L

0v) (il = (1-p(i,i)) r(i) + (1-p(i,i)) l. p(i,j)v(j), i

,&o.

jf'i

DEFINITION 3.2.3. The matrix P₀is defined to be the matrix with (i,j)-th element (p₀(i,j)) equal to

.,.

po(i,j) :=

l

pi o(Sn

=

j , L = n) • n=O '

(36)

LEMMA 3.2.4. Let o E fl be a nonzero stoppinq~'time. then

P₀ :=llP

011 S {1-inf o(i)) +(inf o(i))llPll < 1 .

ieS iES

PROOF. First note that II p 6 II is finite, since up 0 II

s

r

II pn ll < ..

(assump-n=O

tion 3.1.4).

For ~ E fl we define the stopping time oM e fl by

{:Cal

00 if

a.

<! u (S\{O}) k oM(a)

==

k=M+1 else • Now, since co

I

11 P 6 II - II P 0 11

I

s

l

II Pn II M k=M

So i t suffices to prove the lemma for stoppinq times oM. This will be done by induction with respect to M.

Let lln c fl be the set of stopping times with

..

fl :=

{o

e fl

I

o<a.>

n 0 for all a.

e

u

(S\{O})k} •

k=n+l

So

'1J

only contains the stopping times with o (i) = 0 for all i E s\{O}. For

6 e ll₀we have II P ₀II = 1. Suppose 6 E ll

1 is a nonzero stopping time, then

c1-o(i))µCil + o(i)

I

p(i,j)J.1(j) j<!S

(37)

Since

o

€ 1::.

1 is suppos~d to be a nonzero stopping time, there exists a mm-ber e: > 0 such that

o

(i) > e:, for all i e S which implies II P₀II

=

_{P 0}< 1.

Now we state the induction hypothesis: Suppose for arbitrary n ~ 1

II P_{0 II}~ 1 on lln and II P_{0 II} < 1 if o e An is nonzero •

Let

o

E t.

1 , we define oi (a) :=

o

(i,a) for i E

s

and a e

G..•

It is easily

.n+

verified that o i E An.

Now for each i E

s

we have

n+1

l

p 0(i,j)µ(j) =

l l

Pi 0(sm = j, T = m)µ(j) j€S jES m=O '

=

l

[pi

,o

<so

= j, '

=

O) + j£S n+1 +

l

Pio (s

=

j , '

= m)]µ(j) m=1 , m n+1

=

o -

0Cil>µc1> + oCi>

I I I

pCi,kl • jES m=1 kES

(1 - 0 (i)) µ (i) +

n

+ o(i)

l

p(i,k)

l I

pk 0 (sm=j, T=m)J.l(j) kES jES m=O ' i

s

c1 - 0Ci))J.1Cil + o(i)

I

p(i,klJ.1Ck> k€S

s

c1-0Cil>J.1Ci> + oCilllPll)µ(il.

So if

o

e ~+l is a nonzero stopping time then

(38)

LEMMA 3.2.5, Leto € /1 then L 0V,. V.

PROOF. We first mention a property of Markov chains

1fP

1(sn,. j) > O, k,n E~. consider for each i €

s

T.-1 (L₀V) (i) = Ei 0[

l

r(sk) + V(s,)J ' k•O T-1 oo ,. Ei 0[

l

r(sk)] +

l l

Pi ₀(s =j, T =n)V(j) ' k=O n=O j€S ' n T-1 oo oo

•E₁&(

l

r(sk)] +

l l

_{P. 0 (s =j,}T •n)

l

Ejr(sk) ' k=O n=O jES i , n k"'O

T-1 oo

=E. _0[

l

r(sk)]+

l l l

Pi ₀(sn=j, T=n)Ejr(sk)

i ' k=O n=O jES k=O 1

T-1 =Ei 0

[l

r(sk>J+ ' k=O 00 . . +

l l l

Pi ₀(sn=j, T=n)Ei(r(sn+k) lsn = j] n=O j€S k=O ' 1'-1 .. "" = Ei 0[

l

r(sk)

J

+

l

_{Ei 0r(sT kl =Ei 0}

l

r(sk) =V(i) 'k=O k = O ' + 'k=O

where the interchange of SUlllll!ations is justified by the fact that

l

Eilr<sn>

I

<"" • n=O

LEMMA 3.2.6. Let

o

€ 6 then (i) L

0 maps V into V.

(ii) L

0 is a monotone mapping. (iii) L

0 is strictly contracting if and only if

o

is nonzero. {iv) The contraction radius of L

0 equals P 0

=

II P 0 II.

(39)

(v) The set {v e:

VI

llv- (1-p)-1bll S (l-p

0)

-2_M1_} _{is mapped by}_L

0 in

it-2M1

self, where M' := ~~- and M is defined as in the proof of lelllllla

1 -p

0 1

3.1.1 by Ml :=Ii Pb - Pbll + 21r - bll.

PROOF. The proof of (i) follows from lemma 3.2.5 and theorem 3.1.11 since

• ₁ VE: Veach v e; V may be written

aa

v = V + w with we: W. Now L

0v=L0 (V+w) •

= L₀V + P₀w

=

V + P₀w e:

V.

The monotonicity of L₀is trivial.

To prove (iii) we first note that v

1,v2 EV imply that <v1-v2l , (L₀v₁_- L₀v

2> are elements of W. Moreover v1,v2 may be given by v1 =V + w1,

v

2

=

V + w2 with w1

,w

2 €

W.

So

L

0 is strictly contracting if and only if

o

is nonzero follows from lemma

3.2.4, The contraction radius equals llP

011 as is verified by choosing

-1 v

1

=

V +µand v2 =

v.

The last assertion follows from

llv-

(1-p) bll s

S M 1 (1-P)-2 so each v E {v € V

I

llv - (1-Pl-1bll $ (l -P 0) -2 M'1 may be

written as v = V + w, where the µ-norm of w €

W

is at most

Now -1 -1 II L.s v - ( 1 -p) b II = II Lo (V + w) - ( 1 -p) b II $ -1 s llv-(1-p) bll + llP 0ll•llwll -2 -2 Ml (1-po) +Po(Ml+M')(l-po)

=

TBEOREM 3.2.1. For any nonzero stopping time

o

E: ~ the mapping L₀has the

unique fixed point v (independent of

o).

PROOF. The proof follows directly from the fact that V is a complete metric

(40)

LEMMA 3.2.7.

(ii) Suppose o₁,o₂are nonrandomized nonzero stopping' times and G1,G2 the

goahead sets corresponding to o

1 and o 2 then

G 1 c G2 .. o

1 $ o 2 and thus p 0

1

(iii) Let Q be a set of indices, o € A, o nonzero and nonrandomized then

q q p + $ sup p 0 and p 0 q€Q q 0 ~ inf p 0 q€Q q

LEMMA 3.2.8. Let o € A be a nonzero stopping time, v~ €

V

then

(i) v _n0 := _{Lo (vn-1)}0 +

v

(in µ-norm)

(ii) _Lovo0 $ 0 0

_"'v

VO .. v _n (in µ-norm)

(iii) _Lovo0 ~ _VO0 .. v _n0

+

_v

(in µ-norm)

where, the convergence is component-wise.

PROOF. The proof of (i) is a direct consequence of theorem 3.2.1, whereas parts (ii) and (iii) follow from the monotonicity of the mapping L

0 and

theorem 3.2.1.

D

So the determination of the total expected reward over an infinite time

horizon,

v,

may be done by successive approximation of V by v!, with

arbi-trary nonzero stopping time o and arbiarbi-trary element

v~

of

V.

Particularly if o

1,

oH,

oR are the nonrandomized nonzero stopping times

corresponding to the goahead sets G1, ~' GR respectively, the following

(41)

(i) (ii) (iii) 01 01 01 v n :

=

r + Pv n- l , with v ₀ e: V;

oH

vn is component-wise defined by

o

OH ,

o

v H{i) := r(i) +

l

p(i,j)vn (j) +

l

p(i,j)v ~

1

Cj), i € S ,

n j<i j~i n

.

oH

with v₀ e: V: 0 v R is component-wise n defined by 0 R -1 -1 \' 0R

v (i) := (1-p(i,i)) r(i) + (1-p(i,i) l.. p(i,j)v

1Cj), i e:

s,

n j;'i

n-oR

i :;

o,

with v₀ e: V.

Furthermore if in each of these approximations

v~ is chosen as required in

lemma 3.2.8(ii) or (iii) the convergence will be monotone.

The following lellllllas will clarify some of the relations between the bound-'ing function µ and the function b (remind remark 3.1. 2 (vi}).

LEMMA 3.3.1. Suppose 3M'ElR+ Vie:S [b(i) :1!: -M'µ(.i:)] then there exist a

p',

0 < p • < 1 and a bounding function µ • such that

(3. 3.1) llPll_µ1 S p ' < l

C3.3.2l II rllµ, < 00 _•

-1

PROOF. Choose M3 :• max{211Pb-pbll. (1-p) ,2M'}, p' :• P~ + l.i(1-p~) and

' ' µ " -

-µ' (i) := b(i) + M

3µ(i), t:llen clearlyµ• is a bounding function. Now

II r II ,

=

sup (Ir (i) I) (b (i) + M.

3µ (i)) -l µ ie:S\{O}

s sup clbCill>Cb(i)+M3µ(i))-l

ie:S\{O}

+ sup clr(i) -b(i) I> (b(i) +M31.l(i))-l <"'.

(42)

Furthermore

:s; p •µ ••.

In a similar way the following lell!llla. can be proved.

LEMMA 3.3.2. Suppose 3M'ElR+ ViES [b(i) ~ M'µ(i)]; then there exist a P',

0 < p' < 1 and a bounding functionµ' such that (3.3.1) and (3,3.2) are satisfied.

D

The latter two lemmas express that if the reward function :Ls bounded from

one side (with respect to the weighting factors µ (i)), a new bounding

func-tion

µ•

can be defined such that the Banach space

Wµ,

contains b, r, Vn and

V. However, for the existence of such a bounding function µ •, the condition

that b is bounded from one side (with respect to µ) is essential, as is

illustrated by the following example.

EXAMPLE 3.3.1. S r(O)

= o.

Let i₀ i₀ := i 0 + 1. := {0,1,2,.;.}l p(0,0) 1, ViES\{O} p(i,0) :=min { c 1 +2> > p}, if i0 is even then we iE'.N • 1 -p, redefine For all O < i < i

0 the rewards and transition probabilities are given by

r(i) =

o,

p(i,i)

=

p, p(i,j) = O for j # i and j ~ O. For i > ~i

₀

, we .

choose

with

p(2i,2i + 2) =p(2i -1,2i + 1) =p (1 -ai) , p(2i,2i + 1) =p(21 -1,21 +2) =Pai ,

r(2i+1) = -(i+1)-2p-(i+l)I p(21-1,j) = p(2i,j) = 0 otherwise;

(43)

We clearly have II Pllµ

=

P.

Moreover i t is easily verified that

..

l l

P(n) (21,j)lr(j)I s P-i

l

n=O j€S n=i and

..

-2 n

l l

P(n) (2i+1,j)lrCj)I S p-(i+!) n=O j€S

It can also be verified that

vi€S II p(i,j)bCj> - pb(ill

=

o

j

So the assumptions 3.1.2-3.1.5 are satisfied.

l

n=i+1

-2

n

However, no bounding function µ' exists for which P ' < 1 and II r II , < '"'.

-2 -i -2 -(i+l) µ

This follows since r (2i}

=

(i) p , r (2i + 1)

= -

(i + 1} P for i.> ~io implies that an eventuel bounding functionµ• should satisfy

1 -i -2 µ' <2il ~

M'

P Ci> =: µ

0 c2i>

for some M € :R+ and i > ~i

0

•

Assume the existence of a bounding function µ' and a p' < 1 such that

(3.3.3) We define µ •

~LPµ'

p' • i 1 :== max{i0,min{(i; 1) 2 > '1(1 +p')}} • ia-1 Then, substituting µ

0 in the right hand side of (3.3.3) yields, for i > 211, the condition

(44)

µ• <il :?: aµ 0 <i> =: µ1 Cil , 1 1 + p. with 13 := pi-<-2- - l > 1. Substituting µ

1 (il in (3.3.3) yields for i > 2i1

2

µI (i) i!: 13µ

1 (i)

=

13 µ0 (i) •

Iterating in this way proves that no bounding function exists.

REMARK 3. 3.1.

(i) It is easily verified that the assumptions 3.1.3 and 3.1.4 do not

im-ply assumption 3.1.5. By replacing the rewards in the above example by lrl the assumptions 3.1.3 and 3.1.4 remain satisfied whereas assumption 3.1.5 fails.

{ii) If b is bounded from one side (in µ-norm) it follows from lemma 3.3.1

or 3.3.2 that r is a charge with respect to P, since

..

l

Pn

Ir I

11 , S ( 1 -P ' ) -lll

r

II , ,

n=O µ µ

..

l

Pnlrl Cil s

(1-p'>-

11r<i>I (µ'(i))-l, i ;. 0 •

n=O

In this case assumption 3.1.3 may be replaced by the assumption that

(45)

CHAPTER. 4

MARKOV DECISION PROCESSES

As mentioned in chapter l we consider in this and the following chapters

Markov decision processes.

In section 4.1 the model is described. Next, in section 4.2, decision rules and the assumptions on the transition probabilities and on the reward

struc-ture will be introduced. Again the assumptions will allow for an unbound~d

reward structure. They are in fact a natural extension of the assumptions

3.1.2-3.l.5 to the case in which decisions are permitted. A number of

re-sults about Markov decision processes will be proved under our assumptions

(section 4. 3). For example, the existence of e:-optimal stationary Markov

decision rules will be shown. For Markov decision processes with a bounded'

reward structure this result has also been obtained by Blackwell [ 5] and

Denardo [ 12] •

Harrison [24] proved the same for discounted Markov decision processes with

a bounding function µ (i)

=

1 for i E: s\ {O}. Moreover, in section 4. 3 we

prove the convergence (in µ-norm) of the standard dynamic programming algo-rithm.

The final section will be devoted again to a discussion on the assumptions.

4.1.

'The MaJtkov dec,U,ion model.

We consider a Markov decision process on the countably infinite or finite

state spaces at discrete points in time t

=

O,l, •••• In each state i

es

the set of actions available is A. We allow A to be general arid suppose

A

to be a a-field on A with {a} e

A

if a E: A. If the system's actual state

is i E:

s

and action a e A is selected, then the system's next state will be

j e

s

with probability pa(i,j).

ASSUMPTION 4.1.1. {i)

(ii)

(46)

(iv) pa(i,j) as a function of a, is a measurable function on (A,A) for each

i,j e.

s.

If state i e.

s

is observed at time n and action a e. A has been selected,

then an immediate (expected) reward r(i,a) is earned. So frOlll now on the reward function r is a real valued measurable function on S x A.

The objective is to choose the actions at the successive points in time in such a way that the total expected reward over an infinite time horizon is

maximal. A precise formulation will be given in the following sections.

It will be shown later on (chapter 9) that our model formulation includes

the discounted case (with a discounted factor 13 < 1), since

e

may be

sup-posed to be incorporated in the pa(i,j). The same holds for semi-Markov decision processes where it is only required that t is interpreted as the number of the decision moment rather than actual time. For semi-Markov de-cision processes with discounting, the resulting discount factor depends on i,j and a e. A only, and may again be supposed to be incorporated in the transition probabilities pa(i,j).

In the first part of this section we are concerned with the concept of de-cision rules. Roughly, a dede-cision rule is a recipe for taking actions at

each point in time. A decision rule will be denoted by 1f. The action to be

selected at time n, according to 11, may be a function of the entire history

of the process until that time. We allow for the decision rule 'If to be such

that for each state i e. S actions are selected by a randc:m mechanism. This

random mechanism may be a function of the history too.

DEFINITION 4. 2. 1.

(i) An n-stage histo1.'y hn of the process is a (2n + 1)-tuple

(ii) Hn' n ~ 0 denotes the set of all possible n-stage histories.

(47)

DEFINITION 4.2.2.

(i) Let~ be a transition probability of (Hn,Sn) into (A,A), n ~ O. So

(a) for every hn e Hn ~<·lhn) is a probability measure on (A,A);

(b) for every A' e A ~(A'I•> is measurable on (Hn,Sn)•

Then a decision rule ('IT) is defined to be a sequence of transition

probabilities, 'IT := (~,q

₁

,q

2, ••• ).

(ii) The set of all decision rules is denoted by D.

DEFINITION 4.2.3.

(i) A decision rule 'IT= (~,q

1

, ••• ) is called non'l'andorrrized if ~<·lhn)

is a degenerated measure on (A,A) for each n ~ O, i.e.

3 A _a.:: [q _-nCalh ) _n

=

1]. 'l'he set of all nonrandomizeddecision rules is N.

(ii) A decision rule 'IT == (~ ,q

1, ... ) is said to be Mazokov or memo'l'Y'leee

if for all n ~ O a _-n(<

I

h ) depends on the last component of h only.

n n

(iii) The set of all Markov decision rules is denoted by BM.

(iv) A decision rule is said to be a Markov-strategy if it is

nonrandomi-zed and Markov.

(v) The set of Markov strategies is denoted by M.

(vi) A Markov strategy can thus be identified with a sequence of

func-tions {fn I n = 0,1, ••• } where fn is a function from S into A. Such

a function is called a (Markov) poliay. The set of all possible

po-licies is denoted by F.

(vii) A Markov strategy is called station~ if all its component policies

are identical. We denote by £"" the stationary Markov strategy with

component f. F .. denotes the set of all stationary Markov strategies.

(viii) For 'IT= (f

0,£1, ... ) .:: Mand g c F we denote by (g,1T) := (g,f0,f1, •• )

the Markov strategy that applies g first and then applies the

poli-cies of 'IT in their given order.

n n · n

For n c:N we define the measurable space ( (S x A) ,s

0,A), where S0,A is the

product a-field generated by S

0 and A. The product space

cn

0 A'FO A) is the

m ' '

space with

n

0 A = (S x A) and

F

0 A the product a-field of subsets of

..

,

(S x A) generated by S

0 and A. For each 1T = (~ ,q1, ••• ) and each n .:: ::N we

define for ( (S x A) n ,S~ A) and (S x A,S

0 A) the transition probability on

n , ,