Markov strategies in dynamic programming
Citation for published version (APA):
Hee, van, K. M. (1975). Markov strategies in dynamic programming. (Memorandum COSOR; Vol. 7520). Technische Hogeschool Eindhoven.
Document status and date: Published: 01/01/1975
Document Version:
Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)
Please check the document version of this publication:
• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.
• The final author version and the galley proof are versions of the publication after peer review.
• The final published version features the final layout of the paper including the volume, issue and page numbers.
Link to publication
General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain
• You may freely distribute the URL identifying the publication in the public portal.
If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:
www.tue.nl/taverne Take down policy
If you believe that this document breaches copyright please contact us at: openaccess@tue.nl
providing details and we will investigate your claim.
EINDHOVEN UNIVERSITY OF TECHNOLOGY Department of Mathematics
STATISTICS AND OPERATIONS RESEARCH GROUP
Memorandum COSG~ 75-20
Markov strategies 1n Dynamic Programming
by
K.M. van Hee
Eindhoven, October 1975
Markov strategies in Dynamic Programmin$
by
K.M. van Hee
O. Summary
It will be proven that the supremum of the expected total return over the Markov strategies equals the supremum over all strategies. The model
assump-tions are: the state space is counta"lle, the action space is measurable and the supremum of the expected total of the positive rewards over the Markov strategies is finite.
1. Introduction
In this section we develop the usual framework of dynamic programming (see for example [Strauch]).
Consider a countable set S, the state space, an arbitrary set
A,
called the action space and a a-field A onA.
The function P: S x S x
A
+ [O,IJ is a transition function, ~.e.1) P(j 2) P(j
We call:
i,a) is measurable on (A,A) for each j,i E S,
i,a) ~ 0,
L
P(jI
i,a) ::; 1 for all i,j E S, a E A.jES
H
:=
(S x A)n x S nthe set of possible histories until time n. On h we have the product a-field
n induced by A and the power set of S.
A strategy TI is a sequence (TI
O,TI1,TI2, ••• ) where TIn is a transition
probabi-lity on (A x H ) such that TI (·\h ) is a probability on (A,A), h E Hand
n n n n n
TIn(AOlo) is measurable for A
O E A. The set of all strategies is denoted by IT. The subset RM of IT is the class of randomized Markov strategies which con-tains exactly the strategies n E IT such that n (olh I,a ,s ) does not depend
n n- n n
on h
n_1 and an for all hn-I E Hn_l, an E A and sn E S. Further we define the subset M of RM by: n E M if and only if n (ols) is a degenerated probability.
2
-M is called the class of -Markov strategies. We may identify a strategy TI E M with a sequence of functions {f , n
=
O,1,2, ••• } where f : 8 7 A such thatn n
TIn(f(s)
I
s) = I. We call the functions Markov policies. We introduce a new state P and we define 8* := 8 u {p}, pepI
P,a) := 1, for all a E A and PcP i,a):= 1 -I
P (jI
i,a) for all (i,a) E 8 x A. Any strategy 'IT E II,jE8
any i E 8 and P define a probability
w.
on (8* x A)oo, by the Ionescu1,TI
Tulcea theorem, and a stochastic process {(X ,Y ), n = O,1,2, ... } where X
- n n n
and Yare the projections from (5* x A)oo to the n-th state space and n-th
n
action space respectively. This means tr,t X is the state and Y the action
n n
at time n.
*
Functions on 8 x A are extended to 5 x A by definingthemO in (p,a).
General-ly speaking we say a measurable function f on a measure space (~,F,~) is in-tegrable if at least one of the terms
J
f+d~ andJ
f-d~ is finite.+
-(f := max{O,f}, f
:=
-min{O,f}).If u(X ,Y ) is integrable with respect to W. , 'IT E RM we may evaluate the
n n 1,1T
integral in operator notation:
1T
OP1T1P ••• P1Tnu(i) := E.l1T[u(Xn,Y )]n
=
...
... {J
TI (dali )u(i ,a)} .n n n
If u(X ) is~. ~integrable we have in the same way:
n 1, 1T
E. [u(X )] = 1T
OP ••• 1T IPu(i).
1'lT n
n-For a Markov strategy f = (f
O'£1'£2"") we write
A reward function r is a measurable function on 5 x A. The total return is
00
defined by \L ~r(X,Y) if this sum exists. We are concerned with the ex-n n n=O
pected total return with respect to!P. • To define this we need the
foI-1,IT
- 3
-00
Lenuna l. Suppose E.
[L
r\x,Y)J < 00, TI En,
1. E: S. It holds that1.,TI n=0 n n
00
reX ,Y ) exists P. -a.s., and 1.S
W.
-integrable.n n 1.,TI 1.,TI
a)
I
n=O
00 00
b) E.
[I
r(X,Y)J =I
E. [reX ,Y )J •1.TI n=O n n n=O 1.n n n
integrabili-reX ,Y ) is n n 00
L
n=O + r (X , Y ) we nave the n n 00Proof. Since
I
r+(X,y) < oo,~. -a.s. the existence ofn=O ~ n 1.,TI 00
obvious. And S1.nce
(I
r(X,Y »+ ~I
n=O n n
ty with respect to W. . By the monotone convergence theorem we have
1.,n
N N
lim {
I
E. [r (X , Y )] -+I
E. [r (X , Y )]} =N-7<lO n=O 1.TI n n n=O 1.TI n n
00 00
L
+I
reX ,Y )J = E. [ r (x ,Y )-
.
1.TI n=O n n n=O n n Since 00I
n=O E. [r+(x ,Y )J=
1.1f n n 00E.
[L
r+(X,Y)J < 00 1.TI n=O n nwe may subtract this quantity on both sides of tfie equality which produces
the statement.
o
In the next section we need the following elementary selection theorem.
Lenuna 2. Let TI(·ji) be a probability on (A,A) such that for all i E S, and let u be an TI-integrable function such that:
xCi) :=
I
TI(dali)u(i;a) < 00 then there is a Markov policy f such thatxCi) ~ u(i,f(i», for all i E S •
The proof is trivial.
In the rest of this paper we use the following assumption:
(A)
00
sup E.
f[
I
r+(X,y)] < 00fEM 1. n=O n n for all 1.
E S.
4
-2. The theorem
In this section we prove, among other things, the existence of the expected total return of a strategy TI E IT. Anticipating on this result we shall use the following notations:
N-]
vN(i, TI) := E.l.TI[
I
r(X ,Y )J,
N 1,2, ••• ,""n=O n n
v(i,TI) := v (i,TI)co N-1
wN(i,TI) := E.l.TI[
I
r (X ,+ Y )], N 1,2, .••,00
n=O n n
w(i, TI) := w (i TI)
.
co 'The main result of this section is: sup v(i,TI) = sup v(i,f)
TIE IT fEM
for all 1. E S •
Lemma
3.
Suppose E. [r+(X,Y )]
< 00 for all n, all TI E ~~ and all i E S. Letl.TI n n
u be a nonnegative function on Sand N an arbitrary nonnegative integer. For all TI e RM there exists a f E M such that
N-1
z =
I
TIOPTI1···PTInr + TIoPTI] ••• PTIN_]P(-u) n=OSince TIN_]{r + P ~u)} < co we may apply lemma 2 and we find a Markov policy fN-1 such that
Note that u
N-1 1.S integrable with respect to TIOPTI] .•• TIN-2P S1.nce TIOPTI] •.•TI
5
-With backward induction we shall construct Markov policies fN_2tfN_3t"'tfO' Suppose we found fN_2tfN_3t"'tfk such that for
N-I ~:=
I
f kPfk+1",Pfnr + fkPfk+I···fN_IP(-u) n=k it holds that k-II
rrOPrr I . "P'TTnr + 'TTOP'TTI,.,Prrk-IP~ ~ z , n=O N-I +Note that 'TTk_IT + 'TTk_IP~ < 00 Slnce uk ~
I
fkPfk+I",Pfnr. Apply lemma n=kagain to find a Markov policy fk- l such that ~-l = fk_IT + fk-IP~ ~
~ 'TTk_IT + 'TTk-IP~, By the monotonicity of the operators 'TTnP we have
k-2
I
rraP'TT I ,· ,P'TTnr + 'TTOP'TT I ".P'TTk-2P~-I ~
z •n=O
Hence the Markov strategy f defined by (fatflt"'tfN_I) for the first N time
periods, has the desired property.
0
Lemma 4. For all i E S it holds that sup wei,£) =
fEM
sup w(it'!T) . '!TERM
Proof, We only have to show a(i) : = sup w( i t f ) ~
fEM
sup w(it'TT) 'TTERM
=: b(i) ,
I) Suppose b(i) < 00 for all i E S. Fix 1
0 E S, For each E > 0 there is a
IT E RM and some integer N such that
6
-2) Suppose b(i
O)
=
00 for some iO E S. First we note that for all +n = (n
O,n1,TI2, •.• ) E RM we have TInr < 00. Suppose the contrary, ~.e. TInr+(j) = 00. This implies for all k > 0 the existence of a Markov policy fO and even so the existence of Markov strategy f such that
w(j,f)
~
fOr+(j)~
k ,which contradicts our assumption
(A).
In exactly the same way we may prove that+ + +
n {r + Pf lr +.•• + Pf ~.. PfN lr } < 00 •
n n+ n + l
-It follows from b(i
O)
=
00 that there is for each k a TI E RM and an inte-ger N such thatCompletely analogous to the proof of lemma 3 we find by backward induc-tion a sequence of Markov policies fN-l,fN-2, ... ,fO and therefore a Mar-kov strategy f, starting with (fO,f1, .•. ,f
N-1), such that
This implies that a(i
O) 00, which produces a contradiction.
o
It is a direct consequence of lemma 4 that for all TI E RM and all ~ E S
v(i,TI) exists and moreover that sup v(i,TI) < 00
TIERM
Lemma 5. For all i E S:
for all i E S •
sup v(i,f) fEM
sup v(i,TI)
=:
v(i) TIERMProof. Fix i
O E S. Choose E > O. There must be a n such that
7 -00 E .
[I
r+(X JY ) ] < £ • ~orr n=N n n Hence co vN(iOJrr) - E.[I
r-(X JY )J ~ v(iO) -2£ . ~O,rr n=N n n Define rrN E RM by:and the function u on S by:
00
u(i) :=
Then we have
r (X , Y )
J
n n
By lemma 3 we know the existence of f = (fOJ f 1,f2, ... ) E M such that
Define the probability p on S* by
p(j) := F. f[X- = jJ •
~O' -~
(Hence E. f[u(XN)J =
I
p(j)u(j).)~o j ES
It follows from negative dynamic prograIT~ing [Strauch, tho 4.3J that there exists a g = (gO,gI Jg2 J"') E M such that
00
I
p(j) • E.[I
rex
,Y )J :;
jES Jg n=O n n j
I
ES p(j).u(j) •Let f* EMbe defined by: f* = (fO,f1, .•• ,fN-1,gO,gl" .• ) then it
~s
clear that00
This proves the lemma.
pUlE.
Jg
[I
n=O[ r (xn,Yn)J ~- 8
-It follows from a theorem of [Derman and StrauchJ, which is generalized by [HordijkJ, for a countable state space and a measurable action space, that for a fixed i E S and an arbitrary TI E IT there ~s a IT E RM such that
F . [ X = j, Y E AJ =F.
*
[X = j, Y E AJ for all j E S, A E A . ~,TI n n ~,TI n n Hence 00 E.[I
r+(X ,Y )J ~TI n=0 n n 00=
E. *[I
r+(X ,Y )J < 00 ~TI n=O n nwhich proves the existence of v(i;,T) and that v(i,TI) = v(i,TI*). The next theorem is now a direct consequence of lemma 5.
Theorem. For all i E S:
sup v(i,TI) = TIEIT
3. Remarks.
sup v(i, f) • fEM
a) The theorem, we proved in section 2, is applicable to the well-known mo-dels: pos i ti ve, negative and discounted dynamic programming provided that
the state space is countable: Note that the discount factor, if it is less than one, is included in the absorption probability of the state p.
b) In the convergent dynamic programming model, develloped by [HordijkJ ~n which it is assumed that
00
E.
[L
~,TI n=O
I
r (X ,Y )n nI]
< co for all TI E RM, for all ~ E Sthe theorem is also valid, since ~n this situation assumption (A) ~s also fulfilled.
(The proof of this statement may be found using the arguments of [Hordijk: corollary 13.3J.)
c) If we allow a discount factor S > 1 such that
00
n +
S r (X ,Y )J < 00
n n
9
-d) For our model Bellman's optimality equation is valid, ~.e.
veil := sup v(i,f) fcH
satisfies the equation
veil = sup {r(i,a) +
I
P(j[i,a)v(j)} •aEA J
(To prove this use the arguments of [Hordijk, tho 3.1 and tho 3.5J.)
Acknmvledgemen t.
The author wisnes to thank dr. A. Hordijk for stimulating discussions.
4. References
Derm~l, C. and R. Straudl; A note on memoryless rules for controlling sequen-tial control processes.
Ann. Math. Statist.
r!..
(1966), 276-278.Hordijk. A; Dynamic Programming and Markov potential theory. Amsterdam, Mathern. Center, 1974. M.C. Tract, nr. 51.
Strauch, R; Negative dynamic programming.