Contracting Markov decision processes
Citation for published version (APA):
van Nunen, J. A. E. E. (1976). Contracting Markov decision processes. Stichting Mathematisch Centrum.
https://doi.org/10.6100/IR29937
DOI:
10.6100/IR29937
Document status and date:
Published: 01/01/1976
Document Version:
Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)
Please check the document version of this publication:
• A submitted manuscript is the version of the article upon submission and before peer-review. There can be
important differences between the submitted version and the official published version of record. People
interested in the research are advised to contact the author for the final version of the publication, or visit the
DOI to the publisher's website.
• The final author version and the galley proof are versions of the publication after peer review.
• The final published version features the final layout of the paper including the volume, issue and page
numbers.
Link to publication
General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain
• You may freely distribute the URL identifying the publication in the public portal.
If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:
www.tue.nl/taverne
Take down policy
If you believe that this document breaches copyright please contact us at: openaccess@tue.nl
providing details and we will investigate your claim.
CONTRACTING MARKOV DECISION PROCESSES
PROEFSCHIRFT
TER VERKRTJGING VAN VE GRAAV VAN VOCTOR IN
VETECHNISCHE WETENSCHAPPEN AAN VE TECHNISCHE
HOGESCHOOL EINVHOVEN,
OPGEZAG VAN VE RECTOR
MAGNIFICUS, PROF.VR.IR. G. VOSSERS, VOOR E'EN
COMMISSIE AANGEWEZEN VOOR HET COLLEGE VAN
VEKANEN ZN HET OPENBAAR TE VERVEVZGEN OP
VINSVAG
25
MEI
1916
TE
16.00
UUR.
do011.
JOHANNES ARNOLVUS ELISABETH EMMANUEL VAN NUNEN
GEBOREN TE VENLO
dooJt. de pMmo:toJten PM
6.
dlt.
J.
Wu~ e1.oen
(2.2.ll o((OJ)
=
l; v1
~0
[6((0,ill =OJ; YaEG.,[[o(a) ~OJ .. [o((a,O)J=
1]] (2.2.2) Ya,S€G., [o(a) = O•o((a,13))=0]; yaEG., yiES\{O} [o((ct, O,i)) =OJ.page 11: replace line 1 from below by
page 12: replace line 6 and 7 from above by (ii) (0) E G; Y ctEG [ (a,0) E G].
page 12: replace line 9 from below by
G : = B u { (a, S)
I
o; E B, B E npage 13: replace line 6 from above by u k=l
n
{O}k}, with B= ( u k=l
page 14 line 3 from above: replace o(a) < 1 by o(a) ~ o.
page 14: replace line 6 from below by
.. k i-1
G
8(0) := k=l u {O}; GH(i) := {(i,a)
I
aE j=O u GH(j)} u {i}, for i ~ 0 . page 14: replace line 4 and 3 from below byG := S u ( u Bk) u { (a,S)
I
ct € S u ( u Bk), S E u {O}k}k=2 k=2 k=l
with B c: s\{O}.
page 133 line 13 from below: add (to appear in Computing)
1.
INTR.OVUCTION
'l..
PRELIMINARIES
8
2. 1. No:t:.ati:on4 8
2.2. Stopp.lng ti.mei. 10
'l.. 3.
Wugh.ted
.6upJte.mum
nolf.ln614
2. 4.
SomeJr.e.ma/!.k.6
on bounding 6unc.ti.on4 J 53.
MARKOV REWARV PROCESSES
193. 1. The Mallkov Jr.eWaJtd model 19
3. 'l.. Con:tJuu!ti.on ma.pp.i..ng.6 and .e..topp.lng
Ume-6
U3. 3.
A
cll6cU6.6.i..on on:the
tu,.t.wnpti.on4 3.1. 2-3.1. 532
4. MARKOV 17EC1S10N PROCESSES
374. 1. The Mallkov ded.6.i..on model 37
4.
'l.. Ved.6.i..on 11,Ulu and a.6.t.wnp.tlon4 384.
3.Some
pJtopeJttlu 06 MaJckov dec-U.i..on pJtoc.U4U43
4. 4. RemaJ!k.6
on:the
a.6.6umpti.on44.
'l..1-4.
'l.. 459
5. STOPPING TIMES ANV CONTRACTION
INMARKOV 17EC1SION PROCESSES
635. 1. The cof!Vt.a.cti.on mapp.lng L~ 64
5.2. The opti..mal Jte.twm. mapp.lng o& 70
6, VAWE ORlENTEV SUCCESSIVE APPROXIMATION
81
6.1.
PoUcq hnpltovement value detellmlna:tlonpMc.eduJte.6
81
6.
2. Some
lle.tllalllu on:the
value Miented me:thod.6 927. UPPER BOUNVS, Lell/ER BOUNVS ANV SUBOPT1MAL1TY
95
1. 1. UppeJL bound.6 t:md toweJL bound.6 6oll
vf
andv*
95
n
1. 'l.. The .e.ubopt:..i.ma,Uty 06 MaJckov poUc.lu and .e.ubopthna,f. aet.l..on4 102
7.
3. Some
lle.tMldu
onQ.ln.lte
4.tate .6pa.c.e
MaJckcv ded.6.i..onpJtoce.6.6e.6
106
8. N-STAGE CONTRACTION
1118.1.
ConveJLgence undeJL:the
.6tlwt.g.thenedN-.6.tage
c.on:tJux.cti.on9.
1.
Some.example.6
06 i.pe.elQ..lc. M€Vlk.ov dew-i..on
pMc.e.uu
9. 2.
The. i.olu:tlon 06
4!{4.t.e.m606 UneM e.quatlon.6
9.3. An -i..nventoJc.y p1t.oblem
REFERENCES
SUBJECT ZNVEXSELECTEV LIST OF SYMBOLS
SA.MEN
VATTING
CURRICULUM
VITAE
117
119 123135
137
139
143
INTROVUCTION
In the last three decades much attention has been given to Markov decision processes. Markov decision processes w!;)re first introduced by Bellman [ 2] in 1957, and constitute a special class of dynamic programming problems. In
1960 How?-rd [ 35] published his book "Dynamic programming and Markov process-es". This publication gave an important impulse to the investigation of Markov decision processes.
'
We will first give an outline of the decision processes to be investigated.
Consider a system with a countable state space
s.
The system can becon-trolled at discrete points in time t
=
0,1,2, ••• , by a decision maker.,Ifat time t the system is observed to be in state i € s, the decision maker can select an action from a nonempty set A. This set is independent of the state i €Sand of the time instant t. If he selects the action a € A in state
i €
s,
at time t , the system's state at time t + 1 will be j € S withproba-bility pa{i,j) again independent of t. He then earns an immediate {expected) reward r{i,a).
Usually, the problem is to choose the actions "as good as possible" with re-spect to a given optimality criterion.
We will use the (expeated) totaZ r>ewaPd ar>iterion.
So the prOblem is:
1) to determine a recipe {decision rule) according to which actions should
be taken such that the {expected) total reward over an infinite time ho-rizon is maximal1
2) to determine the total reward that may be expected if we act according to that decision rule.
In first instance the following solution techniques for Markov decision processes with a finite state space and a finite action space were availa-ble: the policy iteration algorithm developed by Howard [35], linear pro-gramming [ 11], [ 13], apprpximation by standard dynamic propro-gramming [ 35].
A disadvantage {especially for large scaled problems) of the former two me-thods is that each iteration step requires a relatively large amount of com-putation.
Furthermore, the convergence of the method of standard dynamic programming is very slow.
Hence, the construction of numerical methods for determining optimal
solu-tions has been the subject of much research in this area. Moreover, muc~
attention has been paid to the generalization of the model as described by
~ellman and Howard. Both subjects will be studied in this monograph.
MacQueen [ 46) introduced an improved version of the standard dynamic
pro-gramming algo~ithm by constructing, in each iteration step, improved upper
and lower bounds for the optimal return vector. His approach yields a rath-er ,fast algorithm for solving finite state space finite action space Markov decision processes.
Modifications of the afore mentioned optimization procedures have been given J:)y e.g. Hastinqs [25], (26] who proposed a Gauss-Seidel-like technique and Reetz [60] who based his optimization procedure on an overrelaxation idea. Modifications have also been given by Porteus (59], Wessels [74], van Nunen
,and Wessels [55], and van Nunen [53], [54].
By and by several extensions of the original model were presented. The fi-nite state space and fifi-nite action space restriction was dropped, For example Maitra [48] and Derman [14] studied Markov decision processes with a countable state space and finite action space, whereas Blackwell [ S] and Denardo [12] already investigated Markov decision processes with a general
state ~d action space.
The restriction of equidistant decision point was dropped as well; see Jewell (37], [38].
As a remaining restriction, however, a bounded reward structure is assumed in the above articles.
This restriction has been released recently, see Lippman [44], [45], Harrison ,[24], weasels [75], and Hinderer [31].
In this monograph we will investigate Markov decision processes on a coun-tably infinite or finite st.ate space and with a general action space.
Fur-thermore, we allow for an unbounded reward st:ructu:re. Ne do not require the
supremum noxm0e assume the existence of a function b:
s
+JR and apositi-ve function µ:
s
+:m+
:= {x e:m
I
x > O} such thatV ieS aeA(i) V lrCi,a) - b(iJI < µ(i) ,
the function µ will bi:i used to construct a weighting function, or a bound-ing function. Moreover, we assume
3 • V V
l
pa(i,j)µ (j) S pµ (i) •O<p<1 iE:S aeA(i) .
s
)€
In order to guarantee the existence of the total expected reward we assume the function b on S to be a charge with respect to the transition probabi-lity structure (see section 4.1).
These assumptions on the reward and transition probability structure arise in fact by a combination and a slight extension of the conditions as
pro-/
posed by Wessels [75] and Harrison [24]. As will be shown in the final chapter the assumptions allow e.g. the investigation of a large class of discounted Markov and Semi-Markov decision Processes. Lippman's assumptions [45] are covered as well, see van Nunen and Wessels [56].
We will develop a set of optimization procedures for solving Markov deci-sion problems, satisfying the described conditions, with respect to the
\
total reward criterion. This will be done by using the concept of
stopping
time
(see also Wessels [74]), which results in a unifying approach. Thisset of methods includes the procedures for finite state, finite action
space Markov decision processes as proposed by Howard [35], Reetz [60],
Hastings [25] MacQueen [46], A main role in our approach will.be played by
the theory of monotone contraction mappings defined on a complete metric
space of functions on
s.
This space will be denoted byV,
The concept of stopping time will be used to define a set ·of contraction
mappings on
V.
Given a decision rule and given the starting state i e S wemay define the stochastic process {st
I
t=
0,1, ••• } where st denotes thestate of the system at time t. Roughly speaking a stopping time is a recipe
for terminating the stochastic process {st
I
t 2: O}. For each stopping time(denoted by 6) we define the mapping U
0 of V by defining (U0v) (i) as the
supremum over all decision rules of the expected total reward until the
process is stopped according to the stopping time
o,
given that the processif the process is stopped in state j e
s. u
6 is proved to be a monotonecon-tractive mapping on the complete metric space
V.
For stoppinq times that are nonzel"o (see section 2.2) u
6 will be strictly
contracting, its fixed point beinq equal to the requested optimal expected
*
'
total reward over an infinite time horizon (denoted by V ) • Bence, the fixed point is independent of the chosen nonzero stopping time. This implies
*
that for each nonzero stopping time
o, v
may beb ya sequence vn :• U0vn-i ,starting rom any v0 0 0 . f 6
stopping time 6 we have
0
*
v +
v •
n
approximated successively e V. So for each nonzero
These results may be formulated alternatively as follows: for each nonzero
stopping time
o, v*
is the unique solution of the optimality equationThe class of described methods may be extended.
For a special class of stoppinq times, which we called transition memory-less stopping times (see section 2.2), the mappings U
0 produce the
oppor-tunity of determining in each iteration step a decision rule of a special
type for which the supremum by applying u
0 is attained or approximately
attained (see chapter 5). Such a decision rule will be called a stationary
~ f
Markov strategy (denoted by f ) • We define the mapping L0 of V in a similar
way as we have defined u
6, with the difference that the expected reward by
~
applying the stationary Markov strategy f is computed instead of the
sup-remum over all decision rules.
For transition memoryless nonzero stopping times we define for each
(A) .
;>. e:N
=
{1,~, ••• } a mapping u6 ofV.
If the supremum by computing u
0v is attained for a stationary Markov
stra-~ tegy (f ) then U0 (;\) v uJ""> is defined by (n) := lim U 0 v • n -with ;>. e :N ,
If this supremum Ls not attained
uJ">
is defined by using a Markov strategyfor which the supremum is approximated (see chapter 6).
uJ">
is neithernecessarily contracting' nor monotone. However, the monotone contraction
pro-perty of the mappings U0 and
Li
enables the use of uJ"> as a base forsuc-cessive approximation methods (see chapter 6).
For each transition memoryless nonzero stopping time
o
and each A E: liJ u {co}o>.
a sequence vn defined by
*
..
converges to V • Here fn is chosen such that
sufficiently well. So each pair (o,>.) yields
v*. Moreover, the stationary Markov strategy
f
L0nv!:1 approximates u
0v::1
a successive approximation of
found in the n-th i~eration of
such a procedure becomes (£-)optimal for n sufficiently large.
o>. cS>.
The vectors vn-l and vn enable us to construct upper and lower bounds for
*
the optimal return V • In addition, the availability of upper and lower bounds allows an incorporation of a suboptimality test. The use of upp,er and lower bounds and a suboptimality test may yield a considerable gain in computation time, see section 7.3.
We conclude this introduction with a short overview of the contents of the subsequent chapters.
In the firs't three sections of chapter 2 some basic notions required in the sequel are presented. After the introduction of some notations (section 2.1) we discuss in sections 2.2 and 2.3 the concepts of stopping time and weighted supremum norms respectively. The final section of chapter 2 is
devoted to some properties of weighted supremum norms.
In chapter 3 we treat Markov reward processes. (stochastic processes with-out the possibility of making decisions). In section 3.1 the Markov reward model is defined. Reward functions may be unbounded under our assumptions.
In section 3.2 the concept of stopping time is used to define the
contrac-tion mappings on the complete metric space
V
(introduced in section 2,3).The study of Markov decision processes starts in chapter 4. After a descrip-tion of the model (secdescrip-tion 4.1), secdescrip-tion 4,2 contains the introducdescrip-tion of
decision rules and assilmptions, These assumptions will be a natural
exten-s~on of those in chapter 3. under our assumptions some results about Markov
decision ~rocesses will be proved (section 4,3), The final section (4.4) is
again devoted to a discussion of the assumption5,
:In chapter 5 the c0ncept of stopping time is used to generate a whole set
Of optimization procedures based on the mappings U
0, For each decision rule
~, not necessarily a stationary Markov strategy, and each stopping time
o
a contractive mapping L; of
V
will be defined and investigated (section5,1), Next, (section 5,2) the operator U
0 will be studied. Finally, we will
present necessary and sufficient conditions for the stopping times under which we can restrict the attention to stationary Markov strategies only
(transition memoryless stopping times).
In chapter.6 we investigate value oriented successive approximations based
on the mappings
ur~.).
The term "value oriented" is used since in eachitera-tion step extra effort is given to obtain better estimates for the total
co
expected reward corresponding to the stationary Markov strategy fn.
Chapter 7 will be used to construct upper and lower bounds for the optimal
reward
v*.
In this chapter also a suboptimality test will be introduced, Inthe third section of this chapter we show how our theory may be used in the special case of a Markov decision process with a finite state space and
a finite action space, We indicate the relation with the existing
optimiza-tion procedures.
In the brief chapter 8 we weaken the assumptions as imposed in chapter 4, This weakened version corresponds to the N-stage contraction assumption introduced by Denardo [12]. It will be proved that N-stage contraction with respect to a given bounding function implies the existence of a new bound-ing function satisfybound-ing the assumptions of chapter 4.
7
We conclude this monoqraph with a chapter in which we show that/a number of
specific Markov decision processes is covered by our theory. We will also show how a number of the existing approximation.methods for certain systems of linear equations are included in our treatment of Markov reward proces-ses (chapter 3). The final part of this chapter consists of an example. In this example we treat an inventory problem.
CHAPTER 2
PRE L1
MINARIES
,The goal of this chapter will be the introduction of some of the noti6ns which play an important role throughout this monograph.
First (section 2,1) we will give some notations and we will introduce the measurable spaces relevant for the stochastic processes that will be inves-tigated.
Next (section 2.2) stopping times are introduced. we will allow for random-ized stopping times. Several specific stopping times will be described.
In section 2.3 a bounding function µ is introduced. The function µ will be
used to define a weighted supremum norm. In the following chapters this bounding function will appear to be one of the tools for handling Markov
decision processes with an unbounded reward structure and with a transition
prci>ability structure that needs not to be contractive with respect to the usual supremum norm.
Using the bounding function µ a Banach space
W
and a complete metric spaceV are defined.
Finally (section 2,4), we discuss some properties of bounding functions.
2.1. No:ta:tlom
As mentioned in the introductory chapter we study a system which is
ob-served to be in one of the states from a
state spaae
S at times t=
O, 1; •••we assume S to be countably infinite or finit~, and represent the states
by the integers, starting with zero. So if the state space is finite, S is
represented by {0,1, •• .,N}, where N+l is the number of states. Ifs is
countably infinite it is represented by {0,1,2, ••• }. A
path
is a sequenceof states that are subsequently visited.
NOl"ATIONS 2,1.1.
(i) sk
..
:= s xs
x...
xs,
..
the k-fold Cartesian product ofs,
sos1=s.
s
:= s xs
x ••• Js
is the set of all paths.{ii) Let a E
s ,
k with k . ~ n, k,n e :N then a (n) denotes the row vector ofthe first n components of a.
(iii) ka is the number of components of a. So ka is n if and only if
(iv)
(v)
(vi}
(vii)
The i-th component of a e: Sn, n ;;:: i is denoted by (a] i-l •
n . .
Hence a E
s
may be written as a= ((aJ0,(aJ1, ••• ,(a]k _1
>.
. a
Y := (a,B> := ([a]o,[a]1, ... ,[a]k -l'(BJo, ... ,[a]k -1>, k.y•ka +
ke,
"' k a B
where
a,B
e: G with G :=u
s •
00 co k=l
The term (column) veator is used hereafter for real valued
func.-tions on
s.
(viii) The term matl'ix is used hereafter for real valued functions on
s
2•(ix)
(x)
The (1,j)-th element of a matrix A will be denoted by a(i,j).
0
A is the identity matrix (with diagonal entries equal to. one and other entries equal to zero),
(Xi} Matrix multiplication and matrix-vector multiplication are defined
as usual (in all cases there will be absolute convergence}.
·(Xii) An is then-fold matrix produ~t Ax A x ••• x A, the (i,j)-th entry
of An is denoted by a(n)(i,j).
(xiii) Let. v,w be vectors, then vs w if and only if v(i) s w(i) for all.
i € Si v < w if and only if v s w and for at least one i ;;: S
v(i) < w(i).
Let
S
0 be the a-field of all subsets of s, then the measurable space
m
0
~F0
> is defined to be the product space, witho
0
=
s""
and F0 is thea-algebra on '20 generated by the finite products of the a-field S0•
In order to be able to use the concept of stopping time in an adequate way
we extend the measurable space ('20,F0> to the measurable space ('2,F). The
space (il,f) will play a main role in the sequel. Let the set E := {0,1} and
let
S
be the a-field of all subsets ofs
x E then the measurable space(G, f) is defined to be the product space with '2 := (S x E) 00 and F is the
a-algebra on G generated by the finite products of the a-field
S.
So
o
0 contains all sequences of the form
REMARK 2.1.1. The state 0 will play a special role throughout this mono-graph. We do not exclude Markov decision processes with defective
transi-tion probabilities, Therefore, the state 0 will be used as a (fictive)
ab-' ' '
sorbing state. This will give us the possibility of defining (based on the
transition probabilities) adequate probability measures on cn~,F
0
) andW,F>.
2.2. S.topp.in.g :tirne6
We now are ready to introduce the notion of stopping time.
DEFINITION 2.2.1. A function O: G.., .... [0,1] is called a (Pandomized)
stopp-ing time
if and only if o satisfies the following properties(2. 2.1) (2.2.2) V [ (o (a) a ,Se:G.., 1] , 0) A ([13]k - l
r
O)].,.. [o({a,13))a
O] •DEFINITION 2.2.2. The set of all {randomized) stopping times is 6,
REMARK 2.2, 1.
(i) Roughly speaking for each a e sk, we will use 1 - 6 (a} as the
pr,obatii-lity that a stochastic process on
s
is stopped at time k - 1 in state[a.]k-l given that the states [aJ0,[aJ1, ••• ,[a]k-l have been visited
successively.
(ii) From now on we will use the less formal notation o(a,B), o(i) instead
of o((a,13)) and o({i)) respectively.
DEFINITION 2.2.3 •
. (i) o e ti. is said to be a norwandomized stopping time if and only if
Vae:G.., o(a) e {0,1} •
(iii)
o
e t:.. is said to be an ent;riy time i f and ~ly i fo
is nonrandomized and memoryless.DEFINITION 2.2.4. A nonempty subset G c ('\., is said to be a goahead eet if
and only if (i) V [ (a ,S) e G .. a e G] a,SeG.,. {ii) (0) € G (iii) NOTATIONS 2.2.1.
(i) Gn is the goahead set of those sequences of
c;..
for which the compo-nents [a]. are zero for i ~ n, if there are any. Sol. n Gn := { U k=l
..
sk} u {(a,S)I
Se u k=l{ii) For a goahead set G we define G(i) by
G(i) := {a e G
I
[aJ0
=
i}, i es •
LEMMA 2.2.1. There is a one to one correspondence between nonrandomized stopping times and goahead sets.
PllOOF. The characteristic function of G is a nonrandomized stopping time. D
DEFINITION 2.2.s. o e t:.. is said to be a nonzezio stopping time if and only
REMARK 2.2.2.
(i) A nonrandomized stopping time is nonzero if and only if
V iES
o
(i) 1 •(ii) The only nonzero stopping time
o
E 8, which is an entry time, has thefollowing property
V G
o
(a)=
1 •at: co
DEFINITION 2. 2. 6. A goahead set is said to be nonzero if and only if S c G.
DEFINITION 2.2.7. Let
o
0
,o
1 EA theno
0$of
if and only ifLEMMA 2.2.2. Let Q be an index set and suppose for each q E Q,
o
E A, is aq
nonrandanized stopping time, Let
o-,o+
be defined by0-(a) :=inf 0 (a),
o+
(a) i= sup 0 (a)q£Q q qt:Q q
respectively, then
o-
ando+
are elements of 8.Consequently, if for
q
E': Q, G corresponds to the stopping timeoq
theno-+ q
and
o
correspond to the goahead sets n G , n G respectively. qE':Q q qEQ qPROOF. The proof follows by inspection.
DEFINITION 2.2.8. The nonrandomized stopping fwiation T: '1-+- r;+
u {
00 } is defined by(ii) -r (w) = .. • (dt = O for all t E': .r; ) • +
14
~DEFINITION, 2.2.9. A stopping time
o
is said to bet:ronsition memoeyZess
ifand only if there exists a subset Tl E
s
2 such thatDEFINITION 2.2.10. A goahead set is said to be transition memo:cyless if and
only if the corresponding nonrandomized stopping time is transition memory-less.
LEMMA 2.2.3. Memoryless stopping times are transition memoryless.
We will now give some simple examples of nonzero stopping times. The
exam-ples 2.2.1-2.2.4 are nonrandomized stopping times, so they can be expressed
in terms of goahead sets. Example 2. 2. 5 is a simple illustration of a
ran-domized stopping time. The examples 2.2.2-2.2.5 give transition memoryless
stopping times, see also Wessels [74], and van Nunen and Wessels [55].
EXAMPLE 2.2.1. G := G n or in terms of stopping times V
o
(a)llEG 1 else
o
(a)=
o.
nEXAMPLE 2.2.2. The goahead set G
8 is defined by co i-1 := u {O}k, G 8(1) := {(i,a)
I
a e u G8(j)}, for i '/' 0 . k=l j=O EXAMPLE 2.2.3. co G := u {a E SnI
[a] j E Bi j < n} u S n=2with B c s such that O E B.
EXAMPLE 2. 2. 4. The goahead set GR is defined by
..
..
GR(i) :={a
I
a f: u {i}k} u { (a,13)I
a f: u {i}k,a
f:GR(O)}, i e: s\{O} •k=l k==l
EXAMPLE 2.2.S. 6 is qiven by Vif:S\{O} 6(i) =~else 6(a) 1.
DEFINITION 2.3.1. A real valued function µ on S is said to be a bounding
funation
i f and only i f(il µ(i) > 01 for all i £ S\{O} •
{ii) µ{0)
=
0 •DEFINITION 2.3.2. Let µ be a bounding function, then Wµ is the set of vec-tors such that
(2.3.1)
REMARK 2.3.1. Note that w{O) 0 for each w £ ~µ•
DEFINITION 2. 3. 3. Let µ be a bounding function. Then, for each W f: Wµ 1 the
µ-norm of w is defined by·
LEMMA 2. 3.1. The space W with this µ-norm (weighted supremum norm) is a
µ
Banach space.
PROOF. The proof is straightforward.
0
DEFINITION 2.3.4. Let the matrix A be a bounded linear operator in Wµ• The no%lll of A is defined by
REMARK 2.3.2.
(i) It is easily verified that
II All • sup µ -l (i)
µ iES\{O}
l
la(i,j)lµ(j) , jES
and i f this supremum is finite then A is a bounded linear operator. (ii) 'I'he concept of bounding function is studied in more detail in
sec-tions 2.4, 3.3, 4.4, and
s.1.
(iii) We refer to Wessels (75], who introduced the concept of weighted sup-remum no:rms in this context and to Hinderer [ 31] 1 who used Wessels'
idea of weighted supremum norms for defining bounding functions.
DEFINITION 2.3.5. Let b be a vector with b(O)
=
01 and let p E [0,1). Theset of vectors V is defined by
µ,b,p
I
-1V µ,b,p := { V (V - ( 1 - p) b) E Ill } • µ
REMARK 2.3.3. Note that also v{O) = 0 for v E V and v
1 -v2 E Wµ for
v
1,v2 E V.
DEFINITION 2.3.6. The metr>ie dµ on V µ,b,p is defined by
LEMMA 2.3.2. A set
V
µ, ,p b with the metric d µ is a complete metric space. unless explicitely mentioned we fixed µ, b, and p for the remaining part of this monograph. Referring to these fixed µ, b and p we will omit the sub-scripts µ, b, p.2.4.
Some
~on bounding 6unction.6
In this section we give some properties of a bounding function µ' with re-spect to the corresponding spaces W , µ and V , • µ
LEMMA 2.4.1. Suppose S contains a finite number of elements. Let µ0 ):)e a
bounding function and wn € W (n ~ 0) then
µo
[II w - w n II µ + O] ,.[II w - w II , ..
o,
for all bounding function µ'J •
0 n µ
PROOF. For a proof we refer to books on numerical mathematics, see e.g.
Collatz [ 8 ] , Krasnosel' skii [ 42]. D
LEMMA 2.4.2. Suppose
s
contains a finite number of elements and µ0 is a
bounding function, then
where B is a matrix.
PR:>OF. The proof of (i) follows directly from the fact that
[II B II < 1] • lim Bn
µo n-+co 0 ,
where O is the matrix with all entries zero. The proof of (ii) can be found
in e.g. Krasnosel'skii [42].
LEMMA 2.4.3. Suppose µ0 is a bounding function and B is a matrix with
fi-nite µo-norm, then
< 1] .. 3 I [II B II I < 1] •
µ µ
PROOF. For a proof we refer to van Hee and Wessels [29], who proved this
theorem for (sub-) Markov matrices, but their proof can easily be extended
REMARK 2.4.1.
(i) Note that in lemma 2. 4. 3 S is not niquired to be finite.
(ii) If S is co~tably infinite the linear space Cll may contain elements that are not bounded. If µ (i) -+- .. for i + .., then i t is also pe:cmitted
for lw(i) I + ""•
(iii) Note that i t is not requested that the µ-no:r:m of b exists. If the JJ-no:r:m of b exists then clearly
W
=
V.
CHAPTER
3MARKOV RBllARV PROCESSES
In this chapter we restrict ourselves to Markov reward processes. So we
ex-hibit our method for the first time in a relatively simple situation. After defining the model {section 3.1) we introduce the assumptions on the reward structure and the transition probability structure. As mentioned we require neither the reward structure to be bounded nor the transition
pro-babilities to be strictly defective.
Next {section 3.2) we show that each nonzero stopping time
o
e ~ defines acontraction mapping (L
0) on the complet~ metric space
V.
The fixed pointappears to be independent of the stopping time. It equals the total
ex-pected reward over an infinite time horizon. In the final section (3.3) we
discuss the assumptions on the reward structure and the
transition.probabi-lities in relation to the bounding function µ and the function b.
3.1.
The Ma!rkov
Jr.eWaJtd
model
We consider a system that is observed to be in one of the states of S at
discrete points in time t
=
0,1, . • . • If the system's state at time t i si €
s,
the system's state at time t+l will be j €s
with probabilityp(i,j), independent of the time instant t.
ASSUMPTION 3.1.1. (i) (ii) (iii) V. .
s o :s;
p Ci, j J :s; 1 l.1)€v.
s
l
p{i,j) E 1 l.e jeS p(0,0) - l •For each i es the unique probability measureP
1 on {00
,F
0J is defined inthe standard way, see' e.g. Neveu [52], Bauer [ 1] by defining the probabi-liti,:es of cylindrical sets.
(3.1.1)
+
where n € Z and
o .
j is the Kronecker symbolif i .. j
else •
Given the starting state i € S and the matrix P (with entries p(i,jJJ we coo.sider the stochastic process { s
0 ,n
I
n ;i: 0}, where s0 ,n cw0J • [w0J •
n Sos
0,n is the state of the process at time n. The stochastic process {s
0 ,n
I
n ~ O} is a Markov chain with stati6nary transition probabilities.See e.g. Ross [62], Feller [17], Karlin (39], Cox and Miller [ 9], Kemeny and Snell (40],
For each stoppinq time 6 € /1 and each startinq state i € S we define in a similar way the unique probability measure Pi,o on CQ,F) by qiving for
n € z+
with tk E
s
and ~ E E.This defines for each i € s and 6 E /1 a stochastic process { (sn ,en)
I
n ~ 0}, where sn(w) :•in' en(w) :• dn.So sn1' en are the state and the value of en at time n.
REMARK
3.1.1. The stochastic process {(sn,en)I
n ~ O} is not a Markov chain since the value o{a) may depend on the complete history([aJ
0,[aJ1, ••• ,[a]k _1> for each a€ G=.
a
Formula {3.1.2) shows the connection between Pi and P i,o • For each w €
n,
w = ((io•da),(i1,d1J •• ) we define w0 by w0 := Ci0 ,11 ,12 , .. J.
For
Ao
E F0 we define the set A € f by
It is easily verified that for A
Let f0 be a measurable function on ('2
0,F 0). The function f on ('2,F) is then defined such that
It follows from formula (3.1.3) that
where Ei f
0, Ei,of denote the expectation of f0 and f with respect to the prd:>abili ty measures Pi and Pi,
0 respectively.
In the sequel we omit the subscript 0 in f
0 • The process {sO,n
I
n ~ O}will thus be denoted by {s
I
n ~ O}.n
NOTATION 3.1.1. ByEf, E
0f we denote the vector with components Eif, Ei,of
respectively.
We now state the assumptions on the reward structure and the tr.ansition probabilities of the system considered in this section. Therefore, we first introduce the reward function. At each point in time a reward is earned. We assume this reward to depend on the actual state of the system only. So the reward function r is a vector.
ASSUMPTION 3.1.2. (r - b) E (JJ • ASSUMPTION 3.1.3. ASSUMPTION 3.1.4. II P II < 1 • ASSUMPTION 3.1.5, (Pb - Pb) E W •
REMARK 3.1. 2. (i) Since b(O}
r(O) = O.
O and µ (0)
(ii) If b
I W,
then rI W.
0, asswnption 3.1.2 implies that also
(iii) In terms of potential theory (see e.g. Hordijk [33]) the second as-swnption states that b is a charge with respect to P, which implies the existence of the total expected reward over an infinite time ho-rizon.
(iv) Assumption 3.1.4 means that the transition probabilities are such
that the expectation of µ (~
1) , with respect to Pi is at most llP H• U (i) •
This implies that the process has a tendency to decrease its J.!-value. (v) The final asswnption states that, given the startinq state i0 €
s,
the difference between the expected one-stage return and Pr (i
0) lies between -MJJ (i ) and Mµ (i ) for some M € lR + and all i €
s.
Q 0 0
(vi) Note that if b €
W
the assumptions 3.1.2-3.1.5 may be replaced by '(a) r € W,(b) l!Plf < 1.
LEMMA 3.1.1.
(Pr - pb) E: W •
P:ROOF. II Pr - pb II !> II Pb - pb II + 211 r - b I!=: M, which is finite according to the assumptions 3.1.2-3.1.5.
LEMMA 3. 1. 2. For M
1 as defined in the proof of lemma 3 .1.1, we have
n:;;;l,2, ••. ,
with p
0 := max{llPl!,p}.
PROOF. The proof proceeds inductively. The statement is true for n = 1.
Suppose i t is true for arl:>itrary n ~ 1. Using the assumptions 3.1.2-3.1.5
we then have
0
LEMMA 3.1. 3.
V "ES lim (Pl\.) (i)
l. n+'»,
lini (Pnr) (i)
=
O •n+«>
PROOF. The proof is a direct consequence.of assumption 3.1.3 and the fore-going lemma.
For each n ~ 1 we define the vector V n by
n-1
(3. 1. 4)
v
:=El
r(sk) •n k=O
LEMMA 3.1.4.
v n =
PROOF. The proof follows by inspection.
D
0
Clearly Vn(i) represents the total expected reward over n time periods when
the initial state is 1 E
s.
THEOREM 3.1.1.
PROOF. The convergence of
l
n•O since,
Pnr follows from assumption 3.1.3 and 3.1.2,
"'
l
pl\,
+l
Pn(r - b) •n=O n=O
DEFINITION 3.1.1. The total eli;pected reward vector V is defined by
3. 2 •
CcntJux.c:tlon
mappbtgt.and
.&:toppbtg:ti.mu
µ:MMA 3,2.1. Let v e: V then(1) Pnlvl exists for all n e: £+, (ii) lim (Pnjvl> (i)
=
O, i e: s.n.._
PROOF. v e: V implies that v can be written as v
=
(1-p)-lb + w where we fJJ. So Pnlvls
(1-p)-!Pnlbl + Pnlwl which is defined. Moreover, since..
..
l
Pnlbl andl
Pnlwl exist we find part (ii) of the le11111a.0
n=O n=O
DEFINITION 3.2.1. The mappinq L1 of V is defined by
LEMMA 3.2.2.
(i) 'L
1 maps V into V.
(ii) L
1 is a monotone mappinq.
(iii) The set {v E: V
I
llv- (1-p)-1bll s M1 (1-p0)
-2
} is mapped in itself by Ll.
(iv) L
1 is strictly contractinq with contraction radius If P II.
(v) The unique fixed point of L
1 is
v.
PROOF. (i) L -1 1 v = r + Pv = r + P ( (1 - p) b + w) , with w e fJJ. so -1 -1 -1 II L1 v - ( 1 - p) b If=
II r + {1 - p) Pb + Pw - ( 1 - p) b II s II b + ( 1 - p) -lPb - U - pl -lb II +II Pw II +II r -b II s II (1-p)-l(Pb-Pb) ll+llPll•llwll+llr-bll • (1-p)-111Pb-pbll+llPll•llwll+llr-bll<"'.
The proof of part (ii) is trivial.
.
I
-1 -2} (iii) Let v € {v € V II v- (1-p) bll S: M 1 (l-P0) then (iv) . -1 -1 llL 1v-(1-p) bll=llr+Pv-(1-p) bllS: -1 -1 -1 . s llP(v-(1-p) b) 11+11 (1-p) Pb-. (1-p) b + rll -1 Let v1 ,v2 € V then v1 ,v2 can be given by v1
=
(1 -P) b + w1 and-1 .
v2 .. (1-p) b + w
2 respectively where w1,w2 € W, So v1 -v2 •w1 -w2, thus
By choosing v
1 and v2 such that w1 • µ and w2 = 0 the equality is at-tained.
The last part of the lemma follows directly from
L
1V = r + P(
l
Pnr)n=O
=
r + n=ll
..
where the interchange of summation is justified since
l
PnI
rl <co.O
n=O
Now we return to the concept of stopping time. Note that the stopping func-tion T on
n
is a random variable. so we can define the random variable s, byif T = n 1
(3.2.1)
if 'l"
= ...
Given the starting state i €
s
and the transition probabilit~es the distri-bution of T is uniquely determined by the choice ofo
€ A.LEMMA 3.2.3. Let o € ll be a n0nzero stopping time, then
PROOF. The proof follows directly from the definition of ' and the
defini-tion of nonzero stopping time.
0
DEFINITION 3.2.2. Leto e /:;;, the mapping L
0 of V is defined component-wise by
•-1
(L 0v) (i) :.,;Ei 0[l
r(sk) + v(s,)J, I k=O i €s .
l'U!:MARK 3. 2. 1. Note that as a consequence of the definitions of
o
and V,CL
0v) (0) =
o
for all v €V.
EXAMPLE 3.2.1. Let o € A be the nonrandanized stopping time that corres-ponds to the goahead set Gi and let v € V then
(L
0v) (i)
=
r(i) + jE:Sl
p(i,j)v(j) •EXAMPLE 3.2.2. Let
o
e ll be the stopping time that corresponds to the go-ahead set GH and v e V then(L
0 v) (i) = r(i) +
l
p(i,j) (L0v) (j) +l
p(i,j)v(j) •j q ~i
EXAMPLE 3.2.3. Let
o
e fl be the stopping time that corresponds to the go-ahead set GR then-1 -1 ~
(L
0v) (il = (1-p(i,i)) r(i) + (1-p(i,i)) l. p(i,j)v(j), i
,&o.
jf'iDEFINITION 3.2.3. The matrix P0 is defined to be the matrix with (i,j)-th element (p0(i,j)) equal to
.,.
po(i,j) :=
l
pi o(Sn=
j , L = n) • n=O 'LEMMA 3.2.4. Let o E fl be a nonzero stoppinq~'time. then
P0 :=llP
011 S {1-inf o(i)) +(inf o(i))llPll < 1 .
ieS iES
PROOF. First note that II p 6 II is finite, since up 0 II
s
r
II pn ll < ..(assump-n=O
tion 3.1.4).
For ~ E fl we define the stopping time oM e fl by
{:Cal
00 ifa.
<! u (S\{O}) k oM(a)==
k=M+1 else • Now, since coI
11 P 6 II - II P 0 11I
sl
II Pn II M k=MSo i t suffices to prove the lemma for stoppinq times oM. This will be done by induction with respect to M.
Let lln c fl be the set of stopping times with
..
fl :=
{o
e flI
o<a.>
n 0 for all a.
e
u
(S\{O})k} •k=n+l
So
'1J
only contains the stopping times with o (i) = 0 for all i E s\{O}. For6 e ll0 we have II P 0 II = 1. Suppose 6 E ll
1 is a nonzero stopping time, then
c1-o(i))µCil + o(i)
I
p(i,j)J.1(j) j<!SSince
o
€ 1::.1 is suppos~d to be a nonzero stopping time, there exists a mm-ber e: > 0 such that
o
(i) > e:, for all i e S which implies II P0 II=
P 0 < 1.Now we state the induction hypothesis: Suppose for arbitrary n ~ 1
II P0 II ~ 1 on lln and II P0 II < 1 if o e An is nonzero •
Let
o
E t.1 , we define oi (a) :=
o
(i,a) for i Es
and a eG..•
It is easily.n+
verified that o i E An.
Now for each i E
s
we haven+1
l
p 0(i,j)µ(j) =l l
Pi 0(sm = j, T = m)µ(j) j€S jES m=O '=
l
[pi,o
<so
= j, '=
O) + j£S n+1 +l
Pio (s=
j , '
= m)]µ(j) m=1 , m n+1=
o -
0Cil>µc1> + oCi>I I I
pCi,kl • jES m=1 kES(1 - 0 (i)) µ (i) +
n
+ o(i)
l
p(i,k)l I
pk 0 (sm=j, T=m)J.l(j) kES jES m=O ' is
c1 - 0Ci))J.1Cil + o(i)I
p(i,klJ.1Ck> k€Ss
c1-0Cil>J.1Ci> + oCilllPll)µ(il.So if
o
e ~+l is a nonzero stopping time thenLEMMA 3.2.5, Leto € /1 then L 0V,. V.
PROOF. We first mention a property of Markov chains
1fP
1(sn,. j) > O, k,n E~. consider for each i €
s
T.-1 (L0V) (i) = Ei 0[
l
r(sk) + V(s,)J ' k•O T-1 oo ,. Ei 0[l
r(sk)] +l l
Pi 0 (s =j, T =n)V(j) ' k=O n=O j€S ' n T-1 oo oo•E1 &(
l
r(sk)] +l l
P. 0 (s =j, T •n)l
Ejr(sk) ' k=O n=O jES i , n k"'OT-1 oo
=E. 0[
l
r(sk)]+l l l
Pi 0(sn=j, T=n)Ejr(sk)i ' k=O n=O jES k=O 1
T-1 =Ei 0
[l
r(sk>J+ ' k=O 00 . . +l l l
Pi 0(sn=j, T=n)Ei(r(sn+k) lsn = j] n=O j€S k=O ' 1'-1 .. "" = Ei 0[l
r(sk)J
+l
Ei 0r(sT kl =Ei 0l
r(sk) =V(i) 'k=O k = O ' + 'k=Owhere the interchange of SUlllll!ations is justified by the fact that
l
Eilr<sn>I
<"" • n=OLEMMA 3.2.6. Let
o
€ 6 then (i) L0 maps V into V.
(ii) L
0 is a monotone mapping. (iii) L
0 is strictly contracting if and only if
o
is nonzero. {iv) The contraction radius of L0 equals P 0
=
II P 0 II.(v) The set {v e:
VI
llv- (1-p)-1bll S (l-p0)
-2M1} is mapped by L
0 in
it-2M1
self, where M' := ~~- and M is defined as in the proof of lelllllla
1 -p
0 1
3.1.1 by Ml :=Ii Pb - Pbll + 21r - bll.
PROOF. The proof of (i) follows from lemma 3.2.5 and theorem 3.1.11 since
• 1 VE: Veach v e; V may be written
aa
v = V + w with we: W. Now L0v=L0 (V+w) •
= L0V + P0w
=
V + P0w e:V.
The monotonicity of L0 is trivial.To prove (iii) we first note that v
1,v2 EV imply that <v1-v2l , (L0v1 _- L0v
2> are elements of W. Moreover v1,v2 may be given by v1 =V + w1,
v
2
=
V + w2 with w1,w
2 €W.
SoL
0 is strictly contracting if and only if
o
is nonzero follows from lemma3.2.4, The contraction radius equals llP
011 as is verified by choosing
-1 v
1
=
V +µand v2 =v.
The last assertion follows fromllv-
(1-p) bll sS M 1 (1-P)-2 so each v E {v € V
I
llv - (1-Pl-1bll $ (l -P 0) -2 M'1 may bewritten as v = V + w, where the µ-norm of w €
W
is at mostNow -1 -1 II L.s v - ( 1 -p) b II = II Lo (V + w) - ( 1 -p) b II $ -1 s llv-(1-p) bll + llP 0ll•llwll -2 -2 Ml (1-po) +Po(Ml+M')(l-po)
=
TBEOREM 3.2.1. For any nonzero stopping time
o
E: ~ the mapping L0 has theunique fixed point v (independent of
o).
PROOF. The proof follows directly from the fact that V is a complete metric
LEMMA 3.2.7.
(ii) Suppose o1,o2 are nonrandomized nonzero stopping' times and G1,G2 the
goahead sets corresponding to o
1 and o 2 then
G 1 c G2 .. o
1 $ o 2 and thus p 0
1
(iii) Let Q be a set of indices, o € A, o nonzero and nonrandomized then
q q p + $ sup p 0 and p 0 q€Q q 0 ~ inf p 0 q€Q q
LEMMA 3.2.8. Let o € A be a nonzero stopping time, v~ €
V
then(i) v n 0 := Lo (vn-1) 0 +
v
(in µ-norm)(ii) Lovo 0 $ 0 0
"'v
VO .. v n (in µ-norm)
(iii) Lovo 0 ~ VO 0 .. v n 0
+
v
(in µ-norm)where, the convergence is component-wise.
PROOF. The proof of (i) is a direct consequence of theorem 3.2.1, whereas parts (ii) and (iii) follow from the monotonicity of the mapping L
0 and
theorem 3.2.1.
D
So the determination of the total expected reward over an infinite time
horizon,
v,
may be done by successive approximation of V by v!, witharbi-trary nonzero stopping time o and arbiarbi-trary element
v~
ofV.
Particularly if o
1,
oH,
oR are the nonrandomized nonzero stopping timescorresponding to the goahead sets G1, ~' GR respectively, the following
(i) (ii) (iii) 01 01 01 v n :
=
r + Pv n- l , with v 0 e: V;oH
vn is component-wise defined byo
OH ,o
v H{i) := r(i) +
l
p(i,j)vn (j) +l
p(i,j)v ~1
Cj), i € S ,n j<i j~i n
.
oH
with v0 e: V: 0 v R is component-wise n defined by 0 R -1 -1 \' 0Rv (i) := (1-p(i,i)) r(i) + (1-p(i,i) l.. p(i,j)v
1Cj), i e:
s,
n j;'i
n-oR
i :;
o,
with v0 e: V.Furthermore if in each of these approximations
v~ is chosen as required in
lemma 3.2.8(ii) or (iii) the convergence will be monotone.The following lellllllas will clarify some of the relations between the bound-'ing function µ and the function b (remind remark 3.1. 2 (vi}).
LEMMA 3.3.1. Suppose 3M'ElR+ Vie:S [b(i) :1!: -M'µ(.i:)] then there exist a
p',
0 < p • < 1 and a bounding function µ • such that(3. 3.1) llPllµ 1 S p ' < l
C3.3.2l II rllµ, < 00 •
-1
PROOF. Choose M3 :• max{211Pb-pbll. (1-p) ,2M'}, p' :• P~ + l.i(1-p~) and
' ' µ " -
-µ' (i) := b(i) + M
3µ(i), t:llen clearlyµ• is a bounding function. Now
II r II ,
=
sup (Ir (i) I) (b (i) + M.3µ (i)) -l µ ie:S\{O}
s sup clbCill>Cb(i)+M3µ(i))-l
ie:S\{O}
+ sup clr(i) -b(i) I> (b(i) +M31.l(i))-l <"'.
Furthermore
:s; p •µ ••.
In a similar way the following lell!llla. can be proved.
LEMMA 3.3.2. Suppose 3M'ElR+ ViES [b(i) ~ M'µ(i)]; then there exist a P',
0 < p' < 1 and a bounding functionµ' such that (3.3.1) and (3,3.2) are satisfied.
D
The latter two lemmas express that if the reward function :Ls bounded from
one side (with respect to the weighting factors µ (i)), a new bounding
func-tion
µ•
can be defined such that the Banach spaceWµ,
contains b, r, Vn andV. However, for the existence of such a bounding function µ •, the condition
that b is bounded from one side (with respect to µ) is essential, as is
illustrated by the following example.
EXAMPLE 3.3.1. S r(O)
= o.
Let i0 i0 := i 0 + 1. := {0,1,2,.;.}l p(0,0) 1, ViES\{O} p(i,0) :=min { c 1 +2> > p}, if i0 is even then we iE'.N • 1 -p, redefine For all O < i < i0 the rewards and transition probabilities are given by
r(i) =
o,
p(i,i)=
p, p(i,j) = O for j # i and j ~ O. For i > ~i0
, we .choose
with
p(2i,2i + 2) =p(2i -1,2i + 1) =p (1 -ai) , p(2i,2i + 1) =p(21 -1,21 +2) =Pai ,
r(2i+1) = -(i+1)-2p-(i+l)I p(21-1,j) = p(2i,j) = 0 otherwise;
We clearly have II Pllµ
=
P.Moreover i t is easily verified that
..
l l
P(n) (21,j)lr(j)I s P-il
n=O j€S n=i and..
-2 nl l
P(n) (2i+1,j)lrCj)I S p-(i+!) n=O j€SIt can also be verified that
vi€S II p(i,j)bCj> - pb(ill
=
oj
So the assumptions 3.1.2-3.1.5 are satisfied.
l
n=i+1
-2
n
However, no bounding function µ' exists for which P ' < 1 and II r II , < '"'.
-2 -i -2 -(i+l) µ
This follows since r (2i}
=
(i) p , r (2i + 1)= -
(i + 1} P for i.> ~io implies that an eventuel bounding functionµ• should satisfy1 -i -2 µ' <2il ~
M'
P Ci> =: µ0 c2i>
for some M € :R+ and i > ~i
0
•Assume the existence of a bounding function µ' and a p' < 1 such that
(3.3.3) We define µ •
~LPµ'
p' • i 1 :== max{i0,min{(i; 1) 2 > '1(1 +p')}} • ia-1 Then, substituting µ0 in the right hand side of (3.3.3) yields, for i > 211, the condition
µ• <il :?: aµ 0 <i> =: µ1 Cil , 1 1 + p. with 13 := pi-<-2- - l > 1. Substituting µ
1 (il in (3.3.3) yields for i > 2i1
2
µI (i) i!: 13µ
1 (i)
=
13 µ0 (i) •Iterating in this way proves that no bounding function exists.
REMARK 3. 3.1.
(i) It is easily verified that the assumptions 3.1.3 and 3.1.4 do not
im-ply assumption 3.1.5. By replacing the rewards in the above example by lrl the assumptions 3.1.3 and 3.1.4 remain satisfied whereas assumption 3.1.5 fails.
{ii) If b is bounded from one side (in µ-norm) it follows from lemma 3.3.1
or 3.3.2 that r is a charge with respect to P, since
..
l
PnIr I
11 , S ( 1 -P ' ) -lllr
II , ,n=O µ µ
..
l
Pnlrl Cil s(1-p'>-
11r<i>I (µ'(i))-l, i ;. 0 •n=O
In this case assumption 3.1.3 may be replaced by the assumption that
CHAPTER. 4
MARKOV DECISION PROCESSES
As mentioned in chapter l we consider in this and the following chapters
Markov decision processes.
In section 4.1 the model is described. Next, in section 4.2, decision rules and the assumptions on the transition probabilities and on the reward
struc-ture will be introduced. Again the assumptions will allow for an unbound~d
reward structure. They are in fact a natural extension of the assumptions
3.1.2-3.l.5 to the case in which decisions are permitted. A number of
re-sults about Markov decision processes will be proved under our assumptions
(section 4. 3). For example, the existence of e:-optimal stationary Markov
decision rules will be shown. For Markov decision processes with a bounded'
reward structure this result has also been obtained by Blackwell [ 5] and
Denardo [ 12] •
Harrison [24] proved the same for discounted Markov decision processes with
a bounding function µ (i)
=
1 for i E: s\ {O}. Moreover, in section 4. 3 weprove the convergence (in µ-norm) of the standard dynamic programming algo-rithm.
The final section will be devoted again to a discussion on the assumptions.
4.1.
'The MaJtkov dec,U,ion model.
We consider a Markov decision process on the countably infinite or finite
state spaces at discrete points in time t
=
O,l, •••• In each state ies
the set of actions available is A. We allow A to be general arid suppose
A
to be a a-field on A with {a} e
A
if a E: A. If the system's actual stateis i E:
s
and action a e A is selected, then the system's next state will bej e
s
with probability pa(i,j).ASSUMPTION 4.1.1. {i)
(ii)
(iv) pa(i,j) as a function of a, is a measurable function on (A,A) for each
i,j e.
s.
If state i e.
s
is observed at time n and action a e. A has been selected,then an immediate (expected) reward r(i,a) is earned. So frOlll now on the reward function r is a real valued measurable function on S x A.
The objective is to choose the actions at the successive points in time in such a way that the total expected reward over an infinite time horizon is
maximal. A precise formulation will be given in the following sections.
It will be shown later on (chapter 9) that our model formulation includes
the discounted case (with a discounted factor 13 < 1), since
e
may besup-posed to be incorporated in the pa(i,j). The same holds for semi-Markov decision processes where it is only required that t is interpreted as the number of the decision moment rather than actual time. For semi-Markov de-cision processes with discounting, the resulting discount factor depends on i,j and a e. A only, and may again be supposed to be incorporated in the transition probabilities pa(i,j).
In the first part of this section we are concerned with the concept of de-cision rules. Roughly, a dede-cision rule is a recipe for taking actions at
each point in time. A decision rule will be denoted by 1f. The action to be
selected at time n, according to 11, may be a function of the entire history
of the process until that time. We allow for the decision rule 'If to be such
that for each state i e. S actions are selected by a randc:m mechanism. This
random mechanism may be a function of the history too.
DEFINITION 4. 2. 1.
(i) An n-stage histo1.'y hn of the process is a (2n + 1)-tuple
(ii) Hn' n ~ 0 denotes the set of all possible n-stage histories.
DEFINITION 4.2.2.
(i) Let~ be a transition probability of (Hn,Sn) into (A,A), n ~ O. So
(a) for every hn e Hn ~<·lhn) is a probability measure on (A,A);
(b) for every A' e A ~(A'I•> is measurable on (Hn,Sn)•
Then a decision rule ('IT) is defined to be a sequence of transition
probabilities, 'IT := (~,q
1
,q2, ••• ).
(ii) The set of all decision rules is denoted by D.
DEFINITION 4.2.3.
(i) A decision rule 'IT= (~,q
1
, ••• ) is called non'l'andorrrized if ~<·lhn)is a degenerated measure on (A,A) for each n ~ O, i.e.
3 A a.:: [q -n Calh ) n
=
1]. 'l'he set of all nonrandomizeddecision rules is N.(ii) A decision rule 'IT == (~ ,q
1, ... ) is said to be Mazokov or memo'l'Y'leee
if for all n ~ O a -n (<
I
h ) depends on the last component of h only.n n
(iii) The set of all Markov decision rules is denoted by BM.
(iv) A decision rule is said to be a Markov-strategy if it is
nonrandomi-zed and Markov.
(v) The set of Markov strategies is denoted by M.
(vi) A Markov strategy can thus be identified with a sequence of
func-tions {fn I n = 0,1, ••• } where fn is a function from S into A. Such
a function is called a (Markov) poliay. The set of all possible
po-licies is denoted by F.
(vii) A Markov strategy is called station~ if all its component policies
are identical. We denote by £"" the stationary Markov strategy with
component f. F .. denotes the set of all stationary Markov strategies.
(viii) For 'IT= (f
0,£1, ... ) .:: Mand g c F we denote by (g,1T) := (g,f0,f1, •• )
the Markov strategy that applies g first and then applies the
poli-cies of 'IT in their given order.
n n · n
For n c:N we define the measurable space ( (S x A) ,s
0,A), where S0,A is the
product a-field generated by S
0 and A. The product space
cn
0 A'FO A) is them ' '
space with
n
0 A = (S x A) and
F
0 A the product a-field of subsets of..
,
,
(S x A) generated by S
0 and A. For each 1T = (~ ,q1, ••• ) and each n .:: ::N we
define for ( (S x A) n ,S~ A) and (S x A,S
0 A) the transition probability on
n , ,