Markov decision processes and quasi-martingales
Citation for published version (APA):
Groenewegen, L. P. J., & Hee, van, K. M. (1976). Markov decision processes and quasi-martingales. (Memorandum COSOR; Vol. 7604). Technische Hogeschool Eindhoven.
Document status and date: Published: 01/01/1976
Document Version:
Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)
Please check the document version of this publication:
• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.
• The final author version and the galley proof are versions of the publication after peer review.
• The final published version features the final layout of the paper including the volume, issue and page numbers.
Link to publication
General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain
• You may freely distribute the URL identifying the publication in the public portal.
If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:
www.tue.nl/taverne Take down policy
If you believe that this document breaches copyright please contact us at: openaccess@tue.nl
providing details and we will investigate your claim.
EINDHOVEN UNIVERSITY OF TECHNOLOGY
Department of Mathematics
PROBABILITY THEORY, STATISTICS AND OPE~~TIONS RESEARCH GROUP
Memorandum COSOR 76-04 Markov decision processes and
Quasi-Martingales by
L.P.J. Groenewegen and K.M. van Hee
Eindhoven, February 1976 The Netherlands
by
L.P.J. Groenewegen and K.M. van Hee
O. Abstract
It is showed in this paper that quasi-(super)martingales play an important role in the theory of Markov decision processes.
For excessive functions (with respect to a charge) i t ~s proved that the value of the state at time t converges almost surely under each Markov
strategy, which implies that the value function in the state at time t converges to zero (a.s), if an optimal strategy is used. At last a
characterization of the conserving and equalizing properties is formulated using martingale theory.
- - -
---_._---_._---I. Introduction
In this section the frame-work of convergent dynamic programming,see Hordijk (1974a, 1974b),is sketched.
A Markov decision process will be a triple (S,P,r) where S is a countable
set, called the state space, r is a real measurable function on S x P,
called the reward fW1ction and P is a Borel subset of E where E is the set of all Markov transition fW1ctions on S, i.e.
P t.
E,
implies P : S x S -+- [O,IJ,2:
P(i,j) $ 1 for all ~ E S.jES
It is assumed that
E
1S endowed with a metric such thatE
1S a Polish space.We will use the following notational convention for functions g on S x P: gp(i) := g(i,P) and we assume that all functions on S x P are measurable on P.
It 1S assumed that P and r have the following (product) properties: let PI ,PZ,P
3..• E P and AI,AZ,A3, ... c S then there 1S aPE P such that f or a 11 i E S: P (i ,.) = Pk (i ,.) i f i E ' \ and r P (i) r P ( i) i f
PI(i,.) PZ(i,.). 1 Z
00
Any (0"
n-if at least Let Fn
coordinates of the paths, n = 0,1,2, ... , and Fis the a-field generated by u F •
*
00 n=O nn = O,I,2, ••• } is a stochastic process on «~) ,F,JPiR) for each
*
00 'all W E (8) X
(w)
selects the (n + l)-th coordinate ofw.
n
The expectation with respect to F. R is denoted by E. R'
1, 1,
measurable function f is said to be integrabZe w.r.t. F. R 1,
one of the terms E. Rf+ and E. Rf- is finite, and summabZe if both are
1, 1,
of all Markov strategies 1S denoted by M. The state space S 1S extended to S* by adding a state p such that 8* := 8 u {p} , and all PEP are extended to S* by: P(i,p) ;= 1 -
I
P(i,j) and P (p ,p) := I.jE8
All functions on 8 are extended to 8* by defining them
°
1n p.*
cobe the usual a-field on (8) generated by the first n +
Now {X ,
n
(i,R), and for
finite.
PO'Pj , ' " (an empty product
the identity operator). w.r.t. F. R the expectation may be evaluated as
1,
=
PO •.• Pn_1f(i) for R = functions is defined as If f(X ) is integrable n follows: E. RCf(X )J 1, n of Markov transition Defini tion I.1•i) A function g 8 x P +R 1S called a charge iff
00
for all 1 E S, REM.
Let g be such a charge.
ii) A function f : S +R 1S called 8uperharmonic w.r.t. (a charge) g
iff f(X ) is integrable w.r.t. F. R for all i,R,n and
n 1,
f ~ ~ + Pf for all PEP.
iii) A function f : S X+R is called excessive w.r.t. (a charge) g
iff f is superharmonic w.r.t. g and
00
Assumption 1.2.
i) The reward function r ~s a charge and
00 ii) sup E i R[
I
rp+ (Xn)] < 00 REM ' n=O n (recall: x+ := max(O,x) ,x-Defini tion 1.3. + : = (-x) ).i) The
value function
v of (S,P,r) ~s a real function on S:co v(i) := supE
i R[
I
rp (Xn)]REM ' n=O n
for all ~ E S
ii) A strategy REM is called
optimal
if this supremum ~s attained for R in all i E S.In this paper it will be shown, us~ng some theory on supermartingales, that each superharmonic function f with w.r.t. charge g has the property that f(X ) converges F. R almost surely for all i and R. Especially the value
n ~,
function veX ) converges almost surely to zero under each optimal strategy.
n
As last result we give a slight extension of the following theorem of Hordijk (1974a), which is based on a result of Dynkin and Juschkewitsch
(1969): if 'I and '2 are stopping times for the sequence of a-fields
{Fn , n = O,I,.,.} then 'I:;;; '2 Fi,R- a.s. implies:
E. D[veX ) 1 ..., 1: 1 ;:0: E. l'[v(X ) ~,.., 1: 2
In section 2 some theory on quasi-martingales is develloped and ~n section 3 this theory is applied to Markov decision processes. Most of the lemmas used in this paper are well-known but the authors do not know of any place in the literature where the facts were combined to get the results mentioned above.
2. Quasi-martingales
Quasi-martingales have been introduced by Fisk (1965) as continuous time stochastic processes having a decomposition into the sum of a martingale and a process having almost all sample functions of bounded variation. In
this paper we give essentially the same definition for the discrete time case.
Let R be the set {O,I ,2, ••• } and let (lI,A,]P) be a probability space and {At' t E R} an increasing sequence of a-fields contained in A. All stochastic processes in this section are defined on (lI,A,]P) and have values in the set of real numbers with the Borel-cr-field on it. Moreover
they are adapted to {A , t ER}, Le. the a-field generated by the first
t
n + 1 coordinates of the paths is a subset of A , nEE (The conditional
At n
expectation w.r.t. At is denoted by ~ ).
Definition 2.1.
Let {B
t, t ER} be a stochastic process such A s tochas tic pro-cess {v t' t EIN} is called a
(QSPM) i f there exists a (super)martingale
t-I St +
L
B JP-a.s. k=O t thatI
IB
I
< oaJP-a.s. e tdN quasi-(super)mart~n-iate{S , tEE} such that
t
In Fisk's paper the process {B
t, tEE} of definition 2.1 ~s called a process of bounded variation.
({V
t, t ER} is said to be a QSPM w.r.t. {Bt , t ER}). Lennna 2.2.
Let {B
t , t E R} and {Vt , t E R} be stochastic processes with tEE
I
IBtl < 00 JP-a.s. Then: Proof: Definev
-t t-IL
Bk, k=Oi) Suppose At
E
V
t+! ~ v t +B
t Then t t-!- I
Bk ~ V -I
B k=O t k=O k Hence {St , t E: :N} is a supermartingaleii) Conversely, suppose {V
t' t E: :N} is a QSPM w.r.t. {Bt, t E: :N}. Then
A
A
t E tv = E tS +I
B ~ S + t+! t+l k=O k t tI
Bk=
k=Ov
+ B lP-a. s. t tFor a quasi-martingale the same characterization holds with equality. In lemma 2.3 it is shown that quasi-(super)martingales converge lP-a.s. under a condition analoguous to that for supermartingales.
Lemma 2.3.
A Let {B
t, t E:N} and {Vt ' t E:N} be as
~n
lennna 2.2 with E tvt+l~
B +V ,t t p- a.s.Assume furthermore i) limsup EV - < 00 tiN t 00 ii)
E
I
B:
< 00 k=O then Vt converges P-a.s • Proof. LetV -
t t-!I
B k, t E :N. k=O limsup Es-
~ t tE:::IN t-! I'" + limsup E Vt + limsup E 2. B ktE::IN tE:IN k=O
Since {St' t E ~} is a supermartingale it follows from the convergence
theorem on supermartingales (see e.g. Neveu (1972). IV -1-2) that St t-I
converges JP-a. s. Also
I
Bk converges a. s., hence V does. 0
k=O t
Remark.
From the proof of the cited theorem of Neveu it can be seen that Neveu's
-condition: su~ EV
t < 00 can be replaced by limsup EV
t < 00
tEJN tEJN
The next lemma is not that the requirements
Lemma
2.4.
really used in the rest of the paper, but00 of lemma 2.3 almost imply that
EI
I
BI
<n=O n
it shows
00 •
Assume in addition to the assumptions of lemma 2.3 that
V
o
~s summable, then it holds that00
EIIBj<oo
n=O n
Proof.
Let St be defined as ~n the proof of lemma 2.3. Note that
Let M := limsup E Vt and note that limsup E V + M ~ O.
t t t
Then
t-I t-I
E So +E
I
B+ + M~
E (V + M) + EI
B~
k=O k t k=O
hence, by taking the limsup of both sides we have:
00 00
E So + E
I
B+ + M~
EI
B~
f or a11 t E~
00
since E So = EV
O < "" it holds that E
I
B~ < 00 which proves the lemma.0
k=O
This section ends with a property on
reguZar
supermartingales. The definition of regularity given below is equivalent to the usual one(see e.g. Neveu (1972) IV
-S-24);
Let {St' t E ~} be a supermartingale and let 'I and '2 be two stopping times w.r.t.
{A ,
t E ~}.A
is the a-field generated by uA
andt 00 n
n~
A := {B E A
""I
B n hi = n} E n, n E ~}, I
Definition
2.S.
The supermartingale {St' t E ~} is called
reguZar
iff the sequence{S~,
t E~}
converges in LI-sense.Property 2.6. Let {St' t E~} be a regular supermartingale and let ' I ' '2 be stopping times. It holds that Sand S are integrable and
' I t 2
(for a proof see Neveu (1972) IV-S-2S).
3. Some consequences for Markov decision processes.
In this section we return to the model described 1n section I. We first give a quick survey of the properties of this model which are relevant
to our exposition here.
Properties. 3.I. "" "" sup E.
[I
r p (X ) ] 1. 11' n nEll ' n=O nwhere IT is the set of all strategies (see Van Hee (1975) for the definition of IT, E. and the proof).
1,7T
For this property assumption 1.2 (ii) 1S required.
3.2. The value function v satisfies Bellman's optimality equation: v(i) = sup {rp(i) + Pv(i)}.
PEP
This statement is a standard consequence of 3.1: the proof is similar to those in Ross (1970), tho 6;1 or Hordijk (1974a) tho 3.1 and 3.5.
3.3. Let g be a charge and f a superharmonic function w.r.t. g then limE. Rf(X )
1, n
. f 1 1 ' f l ' . n-roo 1
eX1sts or a R E M, 1 E S and the 01 oW1ng assert10ns are equ1va ent: i) f is excessive w.r.t. g
-ii) lim E. Rf (X ) = 0 for all R E M, 1 E: S
1, n
n-;.oo
.
iii) lim E. Rf (X ) 2 0 for all R E M, 1 E S. n~ 1, n
For a proof see Hordijk (1974a) tho 2.17
Remark 3.4.
It is obvious from 3.2 that the value function v 1S superharmonic w.r.t r and by its definition it 1S clear that
v(i) 2 Ei R[
I
r p (Xn)] for all 1 E S, REM,,
n=O
n
hence v 1S excessive w.r.t. r.
Remark 3.5.
If f is excessive w.r.t. a charge g it holds that
lim E. RClf(X )
I]
= lim E. if(X)]n~ 1 , n n~ 1,1\ n
by 3.3 (ii).
Lenuna 3.6.
Let f be a superharmonic function w.r.t. g. Then for all R t M,t E~ and
~ E 5 it holds that
Proof.
It is clear that (Ptf)(X ) is F -measurble and since
t t
it holds that
F Ei)f(X
t+1)
J
=
Ptf (Xt ) F.~,R-a.s.Hence, by the superharmonicity of f the statement follows.
The main results of this paper are easy to prove nmV'.
Theorem 3.7.
Let f be an excess~ve function w.r.t a charge g. For any ~ E S, Rc M {f(X
t), t E ]N} is a quasi-supermartingale w.r.t.
{gp (X~),k E:N} and f(X) convergesJP. R-a.s. (for t-+ oo ).
k -K t ~,
Proof.
o
Since g is a charge we have
Fix i E: S, REM. By lemma 3.6. we have f(X
t)
00
]P, R-a.s.
~
,
So lemma 2.2 shows that {f(X), t E:N} is QSPMw.r.t. {gp (~), k E:N}.
t k
From gbeing a charge and property 3.3 ii) it follows that all conditions of lenuna 2.3 are fulfilled, which proves the theorem.
0
Theorem 3.8.
Let f be an excessive function w.r.t. a charge g. The supermartingale t-I {f(X t) +
I
gp (~), t E~} k=O k ~s regular. Proof. t-l Fix i E S, REM and let St := f(Xt) +
I
gp (~). k=O kBy theorem 3.7 we have: {St' t E R} is a supermartingale, so we only have
to check that St converges in LI-sense. t-I St ~ f (X t) +
I
g; (Xk) k=O k hence t-l co-I
g;(~)
I
I
gp (~)I
S - f (X t) ~ ::; t k=O k k=O kOn the other hand, s~nce (a + b) 2: a - b+
hence t-j [f(X t) +
I
gp (~)] k=O k t-j 2: f (X t) -I
g;
(Xk) k=O k co t-j 2: f-(X t) -[I
gp (Xk)]+ k=O k jBy the dominated convergence theorem we have the L -convergence of St - f (X
- II
-1
-to zero. This implies the L -convergence of St'
Coro llary 3.9.
Let f be an excess~ve function w.r.t. a charge g, and let T
I and TZ be stopping times w.r.t. {F t, t E:}.j} then LJ T -I I
I
gp(~)
k=O k F TI + f(X ) 2 E. R r ~, I '[ -I Z[ I
gp
(~) k=O k + f (X )] '[2 Fi , R-a.s. on {TI ::; T2 } f or all i E: S, R E: M.Note that 3.9 is a direct consequence of property 2.6 and theorem 3.8. If Fi,R[r
l ::; T2J integration w.r.t. Fi,R gives the theorem of
Hordijk mentioned in the introduction.
4. Some remarks.
I). A strategy R = (PO,P
I, ••• ) E: M is called conserving if v = rpt + Ptv
for all t E: N (v is the value function) and R is called equalizing
i f
lim:IE. IV(X)] = 0
t-+oo 1,K t for all i E: S.
It is well-known (see the proof of tho 4.6 in Hordijk (1974a) that R E: M is optimal iff R is equalizing and conserving.
For each equalizing strategy (for all ~ E S) since by tho we know that this limit must
R we may conclude that veX ) + 0 F. R-a.s.
n ~,
3.7 veX ) converges F. R-a.s. and by 3.5
n 1,
be zero.
In Groenewegen (1975) this result has been proved for optimal strategies. 2). For a conserving strategy R
=
(PO,PI , ... ) it holds that
v(Xt) = rp (Xt ) + Pt v(Xt)
t
hence {v(X
3). Let g : S x P ~R and f
t : S ~R t E
R;
suppose we do not know whether g is a charge or not. Assumei)
F
t~ gp (X) +E. Rf I(X I)]P. R-as for all i E S, t E:N,
t t ~,t+ t+ ~,
F
REM and with E.tRf I (X I) well-defined.
~, t+ t+
ii)
limsup E. _f-(X ) < 00~,l\t t t -+ 00
00
for all ~ E S, REM and t E R
iii) < 00
P.
n-a . s • for all REM
1,,,-00
iv) for all 1 E S, REM.
Under these conditions, similar to those in lemma 2.3, we have that f (X ) converges F. R-a. s. for each i E S, R E. M. I f in addition fa 1S
t t l ,
finite, this implies also by lemma 2.4 that g is a charge. So if f
t
=
f for all t E :N, f is superharmonic w.r.t. g. 4). Let N E:N and QN,QN+I"" E P and let g be a charge.Define
R
:=
{R=
POP I" , E MIPk=
Q
k for k ~ N} and 00 k ~ No
5 k :": N - IIt is easy to check that the assumptions i), iii) and iv) in the above remark are satisfied for R E
R.
And assumption ii) is satis-fied by the observation that v(n,X ) := v (X ) is excessive w.r.t. gn n n
for the space-time process, where the only allowed strategies are elements of
R.
Hence v (X ) converges F· R- a •s .5). Let f be an excessive function w.r.t. a charge g. From the 3.8 we know that f(X ) converges P. R-a.s. for t + 00 and from property
t _ . 1, 1
3.3 (ii) we know that f (X
t) converges in L -sense for t + 00 •
The following counterexample shows that in general f+(X ) does not
. I t
converge 1n L -sense for t + 00 •
Example:
S := {O,1,2, ... }. P and Q are Markov transition functions with P(O,O) = I, P(i,i + I) = i / (i + I) for i ~ I, P(i,O) = 1 / (i + I) for i ~ 1,
Q(i,Q) = I.
P
is the collection of Markov transition functions which can be generated from P and Q by using the product property. Further-more r p ~°
and rQ(i)=
i.It can be verified easily that the conditions 1.2 i) and ii) are ful-filled, that v(i) = i, that lim v(Xt) = 0 Pi,R-a.s. for all R and that
t+oo
lim ElK(X )
=
1 for R=
PPP ...t+oo ' t
But is 1S well-known that the LI-limit and the a.s.-limit should be equal, if both exist. So v(X
t) does not converge in LI-sense for t + 00 • 6). In Mandl (1974) a martingale is considered in connection with the
average cost criterion for the optimal control of a Markov chain. Using his construction for the total return criterion we get the following. Define a real function on S x
P:
<P(i ,P) :
=
r P (i) + Pv (i) - v (i)and a random variab le Yn := r p (i) + v(Xn+
l) - v(Xn) - ¢(Xn'Pn)
n
I t 1S easy to see that
F n-I
E ny
=
°,
so M :=I
Ykn n
k=O
1S a martingale. Now this martingale becomes
M n n-] n-l
I
r p (Xm) + v(X n) - v(Xo) -
I
m=O m m=O ~(X ,P). m m Note thatI. v(X
o)
= v(i) Pi,R-a.s. for all R ~ M 2. by (3.2) ¢ (X ,P ) ~°
F. R.a.s.14
-Hence
n-I
I
r p(x )
+vex )
m=O m m n
~s the supermartingale treated in section 3. In view of the conserving and equalizing strategies, mentioned in remark 4.1 it is worthwile to note that 00 00 = lim E. R[v(X ) ] + Ei R[
I
n~~' n 'n=O ¢(x ,P )J n nfor all REM.
From this equality it is easy to see that aRE M~s optimal if and only if it is equalizing and conserving.
Literature
Dynkin, E.B. and Juschkewitsch, A.A., (1969) Satze und Aufgaben liber Markoffsche Prozesse, Springer Verlag, Berlin.
Fisk, D.L. (1965) Quasi-martingales, Trans. AMS 120 (1965) 369-389. Groenewegen, L.P.J. (1975) Convergence results related to the equalizing
property in Markov decision processes (1975) COSOR 1975-18, Eindhoven University of Technology.
Van Hee, K.M. (1975) Markov strategies in Dynamic Programming COSOR 1975-20, Eindhoven University of Technology.
Hordijk, A. (1974a) Dynamic programming and Markov potential theory Math. Centre Tract 51, Mathematisch centrum Amsterdam. Hordijk, A. (1974b) Convergent dynamic programming, Techn. Report 28,
Dep. Ope Res. Stanford University, Stanford.
Mandl, P. (1974) Estimation and control in Markov Chains, Adv. Appl. Probe 6, 40-60 (1974).
Neveu, J. (1972) Martingales
a
temps discret, Masson, Paris.Ross, S.M. (1970) Applied probability models with optimization applications. Holden-Day, San Francisco.