Markov decision processes and quasi-martingales

(1)

Markov decision processes and quasi-martingales

Citation for published version (APA):

Groenewegen, L. P. J., & Hee, van, K. M. (1976). Markov decision processes and quasi-martingales. (Memorandum COSOR; Vol. 7604). Technische Hogeschool Eindhoven.

Document status and date: Published: 01/01/1976

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne Take down policy

If you believe that this document breaches copyright please contact us at: openaccess@tue.nl

providing details and we will investigate your claim.

(2)

EINDHOVEN UNIVERSITY OF TECHNOLOGY

Department of Mathematics

PROBABILITY THEORY, STATISTICS AND OPE~~TIONS RESEARCH GROUP

Memorandum COSOR 76-04 Markov decision processes and

Quasi-Martingales by

L.P.J. Groenewegen and K.M. van Hee

Eindhoven, February 1976 The Netherlands

(3)

by

L.P.J. Groenewegen and K.M. van Hee

O. Abstract

It is showed in this paper that quasi-(super)martingales play an important role in the theory of Markov decision processes.

For excessive functions (with respect to a charge) i t ~s proved that the value of the state at time t converges almost surely under each Markov

strategy, which implies that the value function in the state at time t converges to zero (a.s), if an optimal strategy is used. At last a

characterization of the conserving and equalizing properties is formulated using martingale theory.

- - -

---_._---_._---I. Introduction

In this section the frame-work of convergent dynamic programming,see Hordijk (1974a, 1974b),is sketched.

A Markov decision process will be a triple (S,P,r) where S is a countable

set, called the state space, r is a real measurable function on S x P,

called the reward fW1ction and P is a Borel subset of E where E is the set of all Markov transition fW1ctions on S, i.e.

P t.

E,

implies P : S x S -+- [O,IJ,

2:

P(i,j) $ 1 for all ~ E S.

jES

It is assumed that

E

1S endowed with a metric such that

E

1S a Polish space.

We will use the following notational convention for functions g on S x P: gp(i) := g(i,P) and we assume that all functions on S x P are measurable on P.

It 1S assumed that P and r have the following (product) properties: let PI ,P_Z,P

3..• E P and AI,AZ,A3, ... c S then there 1S aPE P such that f or a 11 i E S: P (i ,.) = Pk (i ,.) i f i E ' \ and r P (i) r P ( i) i f

PI(i,.) PZ(i,.). 1 Z

(4)

00

Any (0"

n-if at least Let F

n

coordinates of the paths, n = 0,1,2, ... , and Fis the a-field generated by u F •

*

00 n=O n

n = O,I,2, ••• } is a stochastic process on «~) ,F,JPiR) for each

*

00 '

all W E (8) X

(w)

selects the (n + l)-th coordinate of

w.

n

The expectation with respect to F. R is denoted by E. R'

1, 1,

measurable function f is said to be integrabZe w.r.t. F. R 1,

one of the terms E. Rf+ and E. Rf- is finite, and summabZe if both are

1, 1,

of all Markov strategies 1S denoted by M. The state space S 1S extended to S* by adding a state p such that 8* := 8 u {p} , and all PEP are extended to S* by: P(i,p) ;= 1 -

I

P(i,j) and P (p ,p) := I.

jE8

All functions on 8 are extended to 8* by defining them

°

1n p.

*

co

be the usual a-field on (8) generated by the first n +

Now {X ,

n

(i,R), and for

finite.

PO'P_j , ' " (an empty product

the identity operator). w.r.t. F. R the expectation may be evaluated as

1,

=

PO •.• Pn_1f(i) for R = functions is defined as If f(X ) is integrable n follows: E. RCf(X )J 1, n of Markov transition Defini tion I.1•

i) A function g 8 x P +R 1S called a charge iff

00

for all 1 E S, REM.

Let g be such a charge.

ii) A function f : S +R 1S called 8uperharmonic w.r.t. (a charge) g

iff f(X ) is integrable w.r.t. F. R for all i,R,n and

n 1,

f ~ ~ + Pf for all PEP.

iii) A function f : S X+R is called excessive w.r.t. (a charge) g

iff f is superharmonic w.r.t. g and

00

(5)

Assumption 1.2.

i) The reward function r ~s a charge and

00 ii) sup E i R[

I

rp+ (X_n)] < 00 REM ' n=O n (recall: x+ := max(O,x) ,x-Defini tion 1.3. + : = (-x) ).

i) The

value function

v of (S,P,r) ~s a real function on S:

co v(i) := supE

i R[

I

rp (X_n)]

REM ' n=O n

for all ~ E S

ii) A strategy REM is called

optimal

if this supremum ~s attained for R in all i E S.

In this paper it will be shown, us~ng some theory on supermartingales, that each superharmonic function f with w.r.t. charge g has the property that f(X ) converges F. R almost surely for all i and R. Especially the value

n ~,

function veX ) converges almost surely to zero under each optimal strategy.

n

As last result we give a slight extension of the following theorem of Hordijk (1974a), which is based on a result of Dynkin and Juschkewitsch

(1969): if 'I and '2 are stopping times for the sequence of a-fields

{F_{n ,} n = O,I,.,.} then 'I:;;; '2 Fi,R- a.s. implies:

E. D[veX ) 1 ..., 1: 1 ;:0: E. l'[v(X ) ~,.., 1: 2

In section 2 some theory on quasi-martingales is develloped and ~n section 3 this theory is applied to Markov decision processes. Most of the lemmas used in this paper are well-known but the authors do not know of any place in the literature where the facts were combined to get the results mentioned above.

(6)

2. Quasi-martingales

Quasi-martingales have been introduced by Fisk (1965) as continuous time stochastic processes having a decomposition into the sum of a martingale and a process having almost all sample functions of bounded variation. In

this paper we give essentially the same definition for the discrete time case.

Let R be the set {O,I ,2, ••• } and let (lI,A,]P) be a probability space and {At' t E R} an increasing sequence of a-fields contained in A. All stochastic processes in this section are defined on (lI,A,]P) and have values in the set of real numbers with the Borel-cr-field on it. Moreover

they are adapted to {A , t ER}, Le. the a-field generated by the first

t

n + 1 coordinates of the paths is a subset of A , nEE (The conditional

At n

expectation w.r.t. At is denoted by ~ ).

Definition 2.1.

Let {B

t, t ER} be a stochastic process such A s tochas tic pro-cess {v t' t EIN} is called a

(QSPM) i f there exists a (super)martingale

t-I St +

L

B JP-a.s. k=O t that

I

IB

I

< oaJP-a.s. e tdN quasi-(super)mart~n-iate

{S , tEE} such that

t

In Fisk's paper the process {B

t, tEE} of definition 2.1 ~s called a process of bounded variation.

({V

t, t ER} is said to be a QSPM w.r.t. {Bt , t ER}). Lennna 2.2.

Let {B

t , t E R} and {Vt , t E R} be stochastic processes with _tEE

I

IBtl < 00 JP-a.s. Then: Proof: Define

v

-t t-I

L

B_k, k=O

(7)

i) Suppose At

E

V

_t+! ~ _{v t} +

B

_t Then t t-!

- I

_Bk ~ V -

I

B k=O t k=O k Hence {St , t E: :N} is a supermartingale

ii) Conversely, suppose {V

t' t E: :N} is a QSPM w.r.t. {Bt, t E: :N}. Then

A

t E tv = E tS +

I

B ~ S + t+! t+l k=O k t t

I

_Bk

=

k=O

v

+ B lP-a. s. t t

For a quasi-martingale the same characterization holds with equality. In lemma 2.3 it is shown that quasi-(super)martingales converge lP-a.s. under a condition analoguous to that for supermartingales.

Lemma 2.3.

A Let {B

t, t E:N} and {Vt ' t E:N} be as

~n

lennna 2.2 with E tvt+l

~

B +V ,t t p- a.s.

Assume furthermore i) limsup EV - < 00 tiN t 00 ii)

E

I

B:

< 00 k=O then Vt converges P-a.s • Proof. Let

V -

_t t-!

I

B k, t E :N. k=O limsup E

s-

~ t tE:::IN t-! I'" + limsup E Vt + limsup E 2. B k

tE::IN tE:IN k=O

(8)

Since {St' t E ~} is a supermartingale it follows from the convergence

theorem on supermartingales (see e.g. Neveu (1972). IV -1-2) that St t-I

converges JP-a. s. Also

I

B

k converges a. s., hence V does. 0

k=O t

Remark.

From the proof of the cited theorem of Neveu it can be seen that Neveu's

-condition: su~ EV

t < 00 can be replaced by limsup EV

t < 00

tEJN tEJN

The next lemma is not that the requirements

Lemma

2.4.

really used in the rest of the paper, but₀₀ of lemma 2.3 almost imply that

EI

I

B

I

<

n=O n

it shows

00 •

Assume in addition to the assumptions of lemma 2.3 that

V

_o

~s summable, then it holds that

00

EIIBj<oo

n=O n

Proof.

Let St be defined as ~n the proof of lemma 2.3. Note that

Let M := limsup E Vt and note that limsup E V + M ~ O.

t t t

Then

t-I t-I

E So +E

I

B+ + M

~

E (V + M) + E

I

B~

k=O k t k=O

hence, by taking the limsup of both sides we have:

00 00

E So + E

I

B+ + M

~

E

I

B~

f or a11 t E

~

(9)

00

since E So = EV

O < "" it holds that E

I

B~ < 00 which proves the lemma.

0

k=O

This section ends with a property on

reguZar

supermartingales. The definition of regularity given below is equivalent to the usual one

(see e.g. Neveu (1972) IV

-S-24);

Let {St' t E ~} be a supermartingale and let 'I and '2 be two stopping times w.r.t.

{A ,

t E ~}.

A

is the a-field generated by u

A

and

t 00 n

n~

A := {B E A

""I

B n hi = n} E n, n E ~}

, I

Definition

2.S.

The supermartingale {St' t E ~} is called

reguZar

iff the sequence

{S~,

t E

~}

converges in LI-sense.

Property 2.6. Let {St' t E~} be a regular supermartingale and let ' I ' '2 be stopping times. It holds that Sand S are integrable and

' I t 2

(for a proof see Neveu (1972) IV-S-2S).

3. Some consequences for Markov decision processes.

In this section we return to the model described 1n section I. We first give a quick survey of the properties of this model which are relevant

to our exposition here.

Properties. 3.I. "" "" sup E.

[I

_{r p (X ) ]} 1. 11' n nEll ' n=O n

(10)

where IT is the set of all strategies (see Van Hee (1975) for the definition of IT, E. and the proof).

1,7T

For this property assumption 1.2 (ii) 1S required.

3.2. The value function v satisfies Bellman's optimality equation: v(i) = sup {rp(i) + Pv(i)}.

PEP

This statement is a standard consequence of 3.1: the proof is similar to those in Ross (1970), tho 6;1 or Hordijk (1974a) tho 3.1 and 3.5.

3.3. Let g be a charge and f a superharmonic function w.r.t. g then limE. Rf(X )

1, n

. f 1 1 ' f l ' . n-roo 1

eX1sts or a R E M, 1 E S and the 01 oW1ng assert10ns are equ1va ent: i) f is excessive w.r.t. g

-ii) lim E. Rf (X ) = 0 for all R E M, 1 E: S

1, n

n-;.oo

.

iii) lim E. Rf (X ) 2 0 for all R E M, 1 E S. n~ 1, n

For a proof see Hordijk (1974a) tho 2.17

Remark 3.4.

It is obvious from 3.2 that the value function v 1S superharmonic w.r.t r and by its definition it 1S clear that

v(i) 2 E_i R[

I

_{r p (Xn)] for all 1 E} S, REM,

,

n=O

n

hence v 1S excessive w.r.t. r.

Remark 3.5.

If f is excessive w.r.t. a charge g it holds that

lim E. RClf(X )

I]

= lim E. if(X)]

n~ 1 , n n~ 1,1\ n

by 3.3 (ii).

(11)

Lenuna 3.6.

Let f be a superharmonic function w.r.t. g. Then for all R t M,t E~ and

~ E 5 it holds that

Proof.

It is clear that (Ptf)(X ) is F -measurble and since

t t

it holds that

F Ei)f(X

t+1)

J

=

Ptf (Xt ) F.~,R-a.s.

Hence, by the superharmonicity of f the statement follows.

The main results of this paper are easy to prove nmV'.

Theorem 3.7.

Let f be an excess~ve function w.r.t a charge g. For any ~ E S, Rc M {f(X

t), t E ]N} is a quasi-supermartingale w.r.t.

{gp (X~),k E:N} and f(X) convergesJP. R-a.s. (for t-+ oo ).

k -K t ~,

Proof.

o

Since g is a charge we have

Fix i E: S, REM. By lemma 3.6. we have f(X

t)

00

]P, R-a.s.

~

,

So lemma 2.2 shows that {f(X), t E:N} is QSPMw.r.t. {gp (~), k E:N}.

t k

From gbeing a charge and property 3.3 ii) it follows that all conditions of lenuna 2.3 are fulfilled, which proves the theorem.

0

(12)

Theorem 3.8.

Let f be an excessive function w.r.t. a charge g. The supermartingale t-I {f(X t) +

I

gp (~), t E~} k=O k ~s regular. Proof. t-l Fix i E S, REM and let St := f(X

t) +

I

gp (~). k=O k

By theorem 3.7 we have: {St' t E R} is a supermartingale, so we only have

to check that St converges in LI-sense. t-I St ~ f (X t) +

I

g; (Xk) k=O k hence t-l co

-I

g;

(~)

I

gp (~)

I

S - f (X t) ~ ::; t k=O k k=O k

On the other hand, s~nce (a + b) 2: a - b+

hence t-j [f(X t) +

I

gp (~)] k=O k t-j 2: f (X t) -

I

g;

(Xk) k=O k co t-j 2: f-(X t) -[

I

gp (Xk)]+ k=O k j

By the dominated convergence theorem we have the L -convergence of St - f (X

(13)

- II

-1

-to zero. This implies the L -convergence of St'

Coro llary 3.9.

Let f be an excess~ve function w.r.t. a charge g, and let T

I and TZ be stopping times w.r.t. {F t, t E:}.j} then LJ T -I I

I

gp

(~)

k=O k F TI + f(X ) 2 E. R r ~, I '[ -I Z

[ I

gp

(~) k=O k + f (X )] '[2 Fi , R-a.s. on {TI ::; T2 } f or all i E: S, R E: M.

Note that 3.9 is a direct consequence of property 2.6 and theorem 3.8. If Fi,R[r

l ::; T2J integration w.r.t. Fi,R gives the theorem of

Hordijk mentioned in the introduction.

4. Some remarks.

I). A strategy R = (PO,P

I, ••• ) E: M is called conserving if v = rpt + Ptv

for all t E: N (v is the value function) and R is called equalizing

i f

lim:IE. IV(X)] = 0

t-+oo 1,K t for all i E: S.

It is well-known (see the proof of tho 4.6 in Hordijk (1974a) that R E: M is optimal iff R is equalizing and conserving.

For each equalizing strategy (for all ~ E S) since by tho we know that this limit must

R we may conclude that veX ) + 0 F. R-a.s.

n ~,

3.7 veX ) converges F. R-a.s. and by 3.5

n 1,

be zero.

In Groenewegen (1975) this result has been proved for optimal strategies. 2). For a conserving strategy R

=

(PO,P

I , ... ) it holds that

v(X_t_{) = rp (Xt )} + Pt v(X_t)

t

hence {v(X

(14)

3). Let g : S x P ~R and f

t : S ~R t E

R;

suppose we do not know whether g is a charge or not. Assume

i)

F

t

~ gp (X) +E. Rf I(X I)]P. R-as for all i E S, t E:N,

t t ~,t+ t+ ~,

F

REM and with E.tRf I (X I) well-defined.

~, t+ t+

ii)

limsup E. _f-(X ) < 00

~,l\t t t -+ 00

00

for all ~ E S, REM and t E R

iii) < 00

P.

n-a . s • for all REM

1,,,-00

iv) for all 1 E S, REM.

Under these conditions, similar to those in lemma 2.3, we have that f (X ) converges F. R-a. s. for each i E S, R E. M. I f in addition fa 1S

t t l ,

finite, this implies also by lemma 2.4 that g is a charge. So if f

t

=

f for all t E :N, f is superharmonic w.r.t. g. 4). Let N E:N and QN,QN+I"" E P and let g be a charge.

Define

R

:=

{R

=

POP I" , E MIPk

=

Q

k for k ~ N} and 00 k ~ N

o

5 k :": N - I

It is easy to check that the assumptions i), iii) and iv) in the above remark are satisfied for R E

R.

And assumption ii) is satis-fied by the observation that v(n,X ) := v (X ) is excessive w.r.t. g

n n n

for the space-time process, where the only allowed strategies are elements of

R.

Hence v (X ) converges F· R- a •s .

(15)

5). Let f be an excessive function w.r.t. a charge g. From the 3.8 we know that f(X ) converges P. R-a.s. for t + 00 and from property

t _ . 1, 1

3.3 (ii) we know that f (X

t) converges in L -sense for t + 00 •

The following counterexample shows that in general f+(X ) does not

. I t

converge 1n L -sense for t + 00 •

Example:

S := {O,1,2, ... }. P and Q are Markov transition functions with P(O,O) = I, P(i,i + I) = i / (i + I) for i ~ I, P(i,O) = 1 / (i + I) for i ~ 1,

Q(i,Q) = I.

P

is the collection of Markov transition functions which can be generated from P and Q by using the product property. Further-more r p ~

°

and rQ(i)

=

i.

It can be verified easily that the conditions 1.2 i) and ii) are ful-filled, that v(i) = i, that lim v(X_t) = 0 Pi,R-a.s. for all R and that

t+oo

lim ElK(X )

=

1 for R

=

PPP ...

t+oo ' t

But is 1S well-known that the LI-limit and the a.s.-limit should be equal, if both exist. So v(X

t) does not converge in LI-sense for t + 00 • 6). In Mandl (1974) a martingale is considered in connection with the

average cost criterion for the optimal control of a Markov chain. Using his construction for the total return criterion we get the following. Define a real function on S x

P:

<P(i ,P) :

=

r P (i) + Pv (i) - v (i)

and a random variab le Y_n _{:= r p (i) + v(Xn+}

l) - v(Xn) - ¢(Xn'Pn)

n

I t 1S easy to see that

F n-I

E ny

=

°,

so M :=

I

_Yk

n n

k=O

1S a martingale. Now this martingale becomes

M n n-] n-l

I

_{r p (Xm)} + v(X n) - v(X

o) -

I

m=O m m=O ~(X ,P). m m Note that

I. v(X

_o)

= v(i) Pi,R-a.s. for all R ~ M 2. by (3.2) ¢ (X ,P ) ~

°

F. R.a.s.

(16)

14

-Hence

n-I

I

_{r p}

(x )

+

vex )

m=O m m n

~s the supermartingale treated in section 3. In view of the conserving and equalizing strategies, mentioned in remark 4.1 it is worthwile to note that 00 00 = lim E. R[v(X ) ] + Ei R[

I

n~~' n 'n=O ¢(x ,P )J n n

for all REM.

From this equality it is easy to see that aRE M~s optimal if and only if it is equalizing and conserving.

Literature

Dynkin, E.B. and Juschkewitsch, A.A., (1969) Satze und Aufgaben liber Markoffsche Prozesse, Springer Verlag, Berlin.

Fisk, D.L. (1965) Quasi-martingales, Trans. AMS 120 (1965) 369-389. Groenewegen, L.P.J. (1975) Convergence results related to the equalizing

property in Markov decision processes (1975) COSOR 1975-18, Eindhoven University of Technology.

Van Hee, K.M. (1975) Markov strategies in Dynamic Programming COSOR 1975-20, Eindhoven University of Technology.

Hordijk, A. (1974a) Dynamic programming and Markov potential theory Math. Centre Tract 51, Mathematisch centrum Amsterdam. Hordijk, A. (1974b) Convergent dynamic programming, Techn. Report 28,

Dep. Ope Res. Stanford University, Stanford.

Mandl, P. (1974) Estimation and control in Markov Chains, Adv. Appl. Probe 6, 40-60 (1974).

Neveu, J. (1972) Martingales

a

temps discret, Masson, Paris.

Ross, S.M. (1970) Applied probability models with optimization applications. Holden-Day, San Francisco.