Bounding functions for Markov decision processes in relation to the spectral radius

(1)

Bounding functions for Markov decision processes in relation

to the spectral radius

Citation for published version (APA):

Zijm, W. H. M. (1978). Bounding functions for Markov decision processes in relation to the spectral radius. (Memorandum COSOR; Vol. 7821). Technische Hogeschool Eindhoven.

Document status and date: Published: 01/01/1978

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne Take down policy

If you believe that this document breaches copyright please contact us at: openaccess@tue.nl

providing details and we will investigate your claim.

(2)

Department of Mathematics

PROBABILITY THEORY, STATISTICS AND OPERATIONS RESEARCH GROUP

Memorandum COSOR 78-21

Bounding functions for Markov decision processes in relation to the spectral radius

by W.H.M. Zijm

Eindhoven, October 1978 The Netherlands

(3)

Abstract

Bounding functions for Markov decision processes in Relation to the spectral radius

by W.H.M. Zijm

We consider Markov decision processes in the situation of discrete time, countable state space and general decision space. By introducing so-called "weighted supremum norms" or "bounding functions", conver-gence of successive approximations to the value function can be proved under certain conditions. These bounding functions may also be

applied to reduce the norms of the transition probability matrices and hence (mostly) to improve upper and lower bounds of the approxima-tion procedure.

In this paper we will show, that it is possible to construct bounding functions which are strongly excessive with an excessivity factor arbitrarily close to the spectral radius of the Markov decision

process, where this spectral radius is assUmed to be smaller than one. 1. Introduction

contraction properties of certain operators in a Banach space often play an important role in the theory of Markov decision processes.

Assuming a positive probability of leaving the systems in N stages (uniform in the starting state and the strategy), or, in other words, having so-called N-step contraction, we can construct a strongly excessive function, i.e. an excessive function with an excessivity factor smaller than one. With the help of this strongly excessive function we can construct a norm, the weighted supremum norm (see Wessels [7J), so that we have i-step contraction in this norm with a contraction factor equal to the excessivity factor. This contraction _factor plays an important role in the estimates for the value function

(4)

of a Markov decision process and one may ask how good, that means, how small we can get it. In other words, can we construct strongly exces-sive functions with an excessivity factor as small as possible in order to get a good est.imation for the value function, and under which con-ditions?

An important result has been stated in a paper by van Bee and Wessels [2J"where a strongly excessive function was proved to exist with an excessivity factor p* + &, where p * is defined as the spectral radius of a Markov decision process and supposed to be smaller than one, and E > 0, arbitrarily small. In this paper the same result .will be proved, but the method is essentially simpler and more constructive. Where [2J uses a policy iteration method and takes the limit as a

strongly excessive function, we work with a successive approximation procedure and use the result, provided by the n-th step, i.e. we may stop after a finite number of iterations. Before doing so, we will first state some preliminary results of which especially lemma 1 and lemma 3 are worth mentioning. Lemma l,due to van Bee and Wessels [2J, states that, having a Markov decision process with a spectral radius smaller than one, the supremum, taken over all stationary Markov strategies, of the lifetime of the process will be finite. Lemma 3 gives the natural extension to all (non-randomized) Markov strategies.

In this paper, the definitions and notations of Wessels [7J will be used. We consider a Markov decision process with a countably infinite state space S and general decision space K. A system is observed at discrete points of time (t

=

0,1,2, ••• ). If at time t the state of the system is i € S, we may choose decision k € K, which results in a

reward rk

i and a

probability~.

~J of observing the system in state j

at time t + 1. We suppose

1

j€S k P .. S 1 1.J for all i € S, k € K

A policy f is a function on S with values in K. A strategy s is a sequence of policies: s

=

(f

O,f1,f2, ••• ). If strategy s is used we take decision ft(i), if a time t the state of the system is i. A stationary strategy is a strategy s = (f,f,f, ••• ).

(5)

3

As optimality criterion we choose the total expected reward, which is defined for a strategy s

=

(f

O,f1,f2,···) as a vector V(s) in the following way: C¥' t-1 (ll V(s) =

~

{ II P(f n) }r(ft) t=O n=O

where the sum is supposed to remain convergent when rewards are replaced by their absolute values,r(f) is interpreted as a (column)vector with i-component r1(i) (for i € 5) for any policy f and P(f) is interpreted as a matrix with (i,j)-component

p~~i)

(for i,j € 5) for any policy f.

V(s) converges absolutely and uniformly in its components under the following conditions:

(2)

1 P~j

S P < 1 ,

Ir~1

S MO (for all i € 5, k € K)

j€6

Under these conditions the total expected reward V(s) is at most MO / (1 - p). The value function

v

= sup V(s) s

may then be estimated by successive approximations. Upper and lower bounds for V may be given at each step. At the same time the method produces at each step a stationary strategy s = (f,f, ••• ) with V(s) lying between the same bounds. ([3], [4], [5J, [1J).

Wessels [7] treated a more general situation. Instead of (2) he only assumes the existence of a positive function P, defined on 5, such that

(3)

(4) sup sup p-l(i)

~ p~.

p(j) =: PH < 1

k i ' j J

Remark: p is a strongly excessive function, with excessivity factor PH I i.e. P(f)p S PHP, Vf.

(6)

namely for vectors v:

II vii := sup p-l(i) IV(i)

I,

for all v, for which sup p-l(i)

I

veil

I

< co.

p i i

and for matrices A.

II All II := sup jJ-l(i) i Introduce (5) LfV:= ref) + P(f)v (6) UV := sup LfV f la .. Ip(j) ~J

Wessels [7J then proves, that the following successive approximation procedure ends after a finite number of steps:

Start

Choose a > 0, 15 > 0, va with IIvO" < co and va < uvO « for all

-1 p

components), and IS (1 - PH) < a iteration part

Find for n == 1,2, ••• a policy f such that

n until

v

:= L f

v

1 ~

max{v

l ' UV 1 - Oll} n n- n- n-n -1

snp{jJ (i) (v (i)-v l

_U»l

L

n

n-1 - P _H

where P

L:= inf inf jJ-l(i)

?

P~j

p(j)

k i J

stop Then:

The important question now is: can we choose an appropriate strongly excessive function jJ, or, once having a jJl can we improve it in order to obtain better bounds for the value function V? Therefore we will

(7)

5

give a characterization of strongly excessive functions in terms of the spectral radius of a Markov decision process.

2. Preliminaries

Suppose that w~ have a Markov decision process and a positive function

~ such that for all policies f the following holds:

(9) P(f}~::;; Mp

for some positive constant M. (9) implies n n P (f) ~ :5 M ~. Hence and therefore: I

*

(10) p := sup limsup II pn(f) II

~

:5 M f n~ ~

So p* (which is called the spectral radius of the Markov decision process) appears to be a lower bound for M.

Notice that for a single finite matrix its spectral radius is equal to the absolute value of the largest eigenvalue. This explains why

*

p does not depend on ~ in that case. For a single infinite matrix, i.e. S countably infinite, it is easy to see that for two positive functions ~ and ~ with a~ :5

p

:5 a~

(a,a

positive constants)

limsup II ~ (f) II

lin

= limsup II pn (f) II

!/n.

n-roo ).1 n+o> u

Hence also in a Markov decision process with countably infinite state space the spectral radius is the same for ).1 and ~.

Now, the main topic of this paper will be to prove, that, having

*

p < 1, it is possible to construct a strongly excessive function with

*

(8)

We first formulate our assumptions. Suppose we have a bounding function ~ such that:

1

(ll) p* = sup limsup II pn(f) II n < 1, and

f:n;+o:> }J

(12) M:= max{1, sup IIp{f)1I } < co

f ~

th h ' w;th 1 ~ , < p*-l Fur ermore, c oose A . A

In order to simplify the proofs and notations of the following lemmas we next give a transformation of the Markov decision process

~see van Hee and Wessels [2J, lemma 6, or Veinott [6J, lemma 3). D f ' _e _~neP*{f) _asth _{e ma r x w}t i ith _{e ements p. .}1 * f ( i ) . , _,~,J_E S _, _wereh

~J (13) p. , *f (i) := M -1 ~J }J -1 (i) f (i) Pij }J (j) , * and A by: (14) A * := AM

Then, assumptions (11) and (12) become:

(15) sup limsup

IIP~(f)1I1/n

=

M-lp* (where II IIdenotes the usual

f n-+co

sup-norm) and hence

sup limsup II A *n P*n(f)1I1/n = A

*

M-1 P *

=

f n-+co

*

AP < 1

*

(16) P,(f) e ~ e (where e denotes the unit vector)

This transformation enables us to work with the usual sup-norm, instead of the weighted supremum norm. For the rest of this section we will

*

write A and P(f) again, instead of A' and P (f). Hence we may write:

1

(17) sup limsup II An pn(f) lin < 1

(9)

7

(18) P(f)e $; e.

00

\' n n

Lemma 1: Define zf:= L A P (f)e and z:= sup zf (where the sup is

ncO f

taken component wise). Let furthermore (17) and (18) hold. Then IIz!1 < 00

For a proof, see van Bee and Wessels [2J, lemma 11.

Lemma 2: For z defined in lemma 1, and under conditions (17), (18) we have: z ~ e + AP(f)z, Vf

comment: This result is not trivial. If z was defined as the supremum, taken over all (non-randomized) Markov strategies, then z would satisfy Bellman's optimality principle and the above inequality would be

obvious. However z is only defined as the supremum, taken over all stationary Markov strategies.

Proof: Suppose that for i € S, k € K we have

z(i) < 1 + A

1 P~j

z(j) or j€S

z(i) + E

=

1 + A

1 P~j

z(j), E > 0 j€S

e:

Choose 6

=

2 IIzll > 0 (see lemma 1)

Because of the definition of z we have, for all a € S:

3f : S ~ K such that

a

00

zeal ~

2

ncO

An

2

(pn(fa»a,b

~

zeal -

o.

Hence b€S

1

+

A

and so:

1

+

A

{ 2

ncO f Ca) p ~ z{j) ~ zeal - 0 aJ zeal -0

(10)

Define f as follows: ₍

f (i) :

=

k

f(a) := f (a). for a ~ i

a

Then we have

z(i) - 15 + E < z(i) + E

=

1 + A

z Ca) - 15 ::; 1 + A \' _I.. _Pajf (a) z {J'} jES

f (i)

L

Pij z(j) j€S

Define E. as the (column) vector with i-component equal to E, and all

~

other components equal to zero. Then: z - Qe + E. ::; e + AP(f)z ~ or e ~ z - AP(f)z + E. - Qe ~ and so: N N

2

An pn(f)e

~

z - AN+1 pN+1(f)z +

r

An PP(f) Ei -n=O ncO N - 0

r

An pn(f) e

~

z - AN+1 pN+1(f)z + Ei -

OZ

~

n=O For N ~ 00 we have N

z

~

lim

r

An pn (f) e

~

z + e i - 15 liz II. e = z + E; i - ; e:. e

N""" n=O

. . N+l N+l N+l

(Remark: For N suff~c~ently large we have II A P (f) II < p

where p is taken in such a way that sup limsup II .>..npn(f) 'lil/n < p < 1 f n"""

Hence: z(i)

~

z(i) +

~

which proves the contradiction. q.e.d. Lemma 3. Let z be defined as in lemma 1 and let (17) and (18) hold.

00 n-l

'I'hen (19) z = sup

r

An

n

p(ft)e where s

=

(f

O,f1,f2, ... ) s n=O t=O

Proof: follows immediately from lemma 2.

Once having lemma 3, we know that the following successive iteration procedure is converging:

(11)

9

Zo _

0

Z

=

sup {e + AP(f) Z

I};

n

=

1,2, •.•

n f

n-00

(Note that zn ~ z, V~ and that (zn)n=l is an increasing sequence). The next lemma states that Z converges to z, for n + 00, uniformly

n

in its components. In the next section we then use this successive approximation procedure to construct a better bounding function. Lemma 4: let z be defined as in lemma 1 and let (17) and (18) hold. Then, for all E > 0, there is a nO such that

e: Proof: Choose e: > 0, 0

=

211z11

As in the proof of lemma 2, we have, Va E S a policy f such that

a

z (al ~

_1:

An

_1:

_(t>'(f» _b ~ _{zeal - o. Hence}

ncO bES a a,

f Ca)

An

1 + A

1:

p ~ {

_1:

(~(fJ)

_j,b}2!: zeal - 0

jES a) _ncO _bES

and so:

f (a)

1

+

A

L

Pa;

z(j) 2!:

zeal - 0

jES

Defining f as the policy with f(a)

=

f (a), a E S, we have

a

Z - oe ~ e + AP(f)z ~ z (see also lemma 2) which implies e ~ Z - AP (f) z - Qe and so N Z - AN+1

~+l(f)

z - 15 N

I

An tP(f)e ;c:

_1:

An pn(f)e ncO ncO while co _n-l N Z

=

sup

1:

An

n

P(f )e ;c: z ~

L

An

P'1

(f) e t N

s ncO teO ncO

So we have: Z ~ Z _{N -}> Z - AN+1 ~+l (f) z - Qz.

(12)

Then or:

~

AN

+

1 pN

+

1(f) IIzlle <

~

2 1 z :?: zN > Z -

2"

e. e -

011

zll e = z - e. e

Remark: Because of z ~ zN+l :?: zN we have

3. Construction of a new bounding function

q.e.d.

We just proved that the successive approximation procedure, described after lemma 3, converges uniformly to z under the conditions of lemma 1. Because of the remark at the end of section 2 we will find that, for N

sufficiently large

t;

or

Because of zN+l = sup {e + AP(f)zN} we then have e + AP(f)zN ~ zN + e,

f

for all f, so

Vi

E S,

V

k E K

If we write (20) in the notations of our original problem (that is: return to our "old" P(f) an A, and forget the transformation), we find

M-1 lJ- 1(i)

Vk

E K

or: (21)

2 P~j

lJ(j)ZN(j)

~

A-1 lJ(i}ZN(i) jES

Vi

E

S, Vk

E K

Defining lJ 0 zN as the vector with elements ]J(i) zN(i) we may

write (21) as

(13)

11

Combining all these elements together we have proved the following theorem:

Theorem:

If for a Markov decision process there is a bounding function

JJ such that

*

(11) p (12) M 1 ." sup limsup II pn (f) II n < 1 f n-+<><>

=

max{l, sup IIP(f)1I } < ()Q

f II

then there is for each e > 0 a bounding function II such that (23) P(f)~ ~ (p* + £)~

(24) II ~ II ~ LJJ for some constant L

*

-1

Proof: Choose A

=

(p + e) ,then the conditions of lemma 1 until 4 (after transforming the process) are fullfilled, so that we may apply the successive approximation procedure, described in §2, and take JJ:= JJ 0 zN' where JJ 0 zN satisfies (22).

t;. Comments

Of course in practice one will not always know the spectral radius of a Markov decision process. In that case we must have at our dispo-sal a good estimation of p*, or a procedure, which results in a good bounding function, without knowing exactly the spectral radius. One of the things we can do is the following:

Suppose we have a Markov decision problem and a bounding function JJ

*

such that (12) holds. Furthermore let p , defined in (10) be smaller than one, but otherwise unknown.

Take A

=

1

Start the procedure, described in section 2, and continue until II zN+ 1 - zNII < 0 < 1

(14)

or:

AP(f)ZN S zN -f(1 - ole and, because of

II

zJI S

II

zli < 00

Y zN' with y ₌ _{1 -} 1 - 15

_{II zJI}

< 1

*

-1

This implies AP S Y < 1, so we can improve A, i.e. take AI

=

AY , and apply the same procedure again.

*-1

As long as A < P , the procedure is converging, and we will always find a y < 1, so that A can be improved. It can be proved that this

*-1 sequence of X-values is converging to P • References

[1J E.V. Denardo, Contraction mappings in the theory underlying dynamic programming, SIAM Rev. 9 (1967),165-177.

[2J K.M. van Bee and J. Wessels, Markov decision processes and

strongly excessive functions,to appear in Stochastic Processes and their Appl.

[3J J. MacQueen, A modified dynamic programming method for Markovian decision problems. J. Math. Anal. Appl. 14 (1966) 38-43.

[4J J.A.E.E. van Nunen, A set of successive approximation methods for discounted Markovian decision problems. Z.f. Opere Res. 20 (1976), 203-208.

[5J B. Schellhaas,Zur Extrapolation in Markoffschen Entscheidungs-modellen mit Diskontierung, Z.f. Oper Res. 18 (1974),91-104. [6J A.F. Veinott, Discrete dynamic programming with sensitive

dis-count optimality criteria, Ann. Math. Statist. 40 (1969), 1635-1660.

[7J J. Wessels, Markov programming by successive approximations with respect to weighted supremum norms, J. Math. Anal. Appl. 58