• No results found

Determination of the spectral radius of a Markov decision process

N/A
N/A
Protected

Academic year: 2021

Share "Determination of the spectral radius of a Markov decision process"

Copied!
17
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Determination of the spectral radius of a Markov decision

process

Citation for published version (APA):

Zijm, W. H. M. (1979). Determination of the spectral radius of a Markov decision process. (Memorandum COSOR; Vol. 7909). Technische Hogeschool Eindhoven.

Document status and date: Published: 01/01/1979

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne Take down policy

If you believe that this document breaches copyright please contact us at: openaccess@tue.nl

providing details and we will investigate your claim.

(2)

Department of Mathematics

PROBABILITY THEORY, STATISTICS AND OPERATIO~S RESEARCH Gr.oUP

Memorandum COSOR 79-09

Determination of the spectral radius of a Markov decision process

by

W.H.M. Zijm

Eindhoven, Oktober 1979 The Netherlands

(3)

Determination ef the spectral ~~!':. Markov decision Erocess

by W.H.M. Zijm

o.

Abstract

Consider a Markov decision process in the situation of discrete time, finite state space and finite action space. A positive probabili-ty for fading of the system is allowed. In this case, contraction

properties of certain operators, used in Dynamic Programming, are strict-ly related to the spectral radius of the process. In this paper a method for estimating this spectral radius is proposed.

The result can be extended immediately to the case in which the transition probability matrices are replaced by general nonnegative ma-trices.

1. Introduction

Contraction properties of certain operators in a Banach space often play an important role in the theory of Markov decision processes. Assu-ming a positive probability of leaving the system in N stages (uniform in the starting state and the strategy) or, in other words, having so-called N-step contraction, one can construct a strongly excessive func-tion, i.e. a strictly positive (column)vector ~ such th~ for all ma-trices P, which are involved in the system, we have

(1) P~ S p~

where p, the excessivity factor, is smaller than .one.

This excessivity factor can be used in determining the rate of con-vergence of a successive approximation procedure, and, closely related to this, it can be used in estimating the value function of a Markov de-cision process (Wessels [14J). It turns out that strongly excessive functions can be characterized in terms of the spectral radius, more specifically: it is possible to construct strongly excessive functions with an excessivity factor arbitrarily close to the spectral radius of a Markov decision process (van Hee and Wessels (2], Zijm [15]), so that an exact knowledge of this spectral radius would be very useful.

Several authors have dealt With questions concerning the spectral radius of a Markov decision process, most of them in the irreducible'case

(4)

(Isaacson and Luecke [5J, Howard and Matheson [4J, Mandl [7J, Mandl and Seneta[SJ, 5ladky [11J, [12J). Rothblum [9J gives a Cesaro-Laurent ex-pansion of the resolvent of a specific matrix on the spectral circle. Finally Veinott [13J pr~ent$ a number of results concerning transient dynamic programming (spectral radius smaller than one).

In this paper we present a method to obtain a good approximation

for

the spectral radius of a Markov decision process, using the strongly excessive functions, mentioned above. In fact, we will find a sequence of strongly excessive functions ~n' with excessivity factor ~n' such

co

that the sequence (An)n=O is converging to the spectral radius. For that purpose we first introduce some concepts and notations and give a few preliminary results (section 2). In section 3 we develop the ap-proximation method for the spectral radius and prove that the sequence

co

(~n)n=O is actually converging. In section 4 we take a closer look to the functions ~n' and investigate their asymptotic behaviour. As a by-product of this investigation we find that alr irreducible matrices with maximal spectral radius have the same, strictly positive, right eiqenvector, associated with this spectral radius. Finally, in

section 5, we determine upper and lower bounds for the spectral radius.

2. Preliminaries

Consider a Markov decision process with a finite state space 5 and a finite action space K. A system is observed at discrete points of time. If at time t the state of the system is i€S we may choose action k€K which results in a probability

p~.

of observing the system in state j

J.J at time t + 1. We suppose

(2 )

~ p~. ~

1 , for all i € 5, k € K

j€5 J

Hence a positive probability for fading of the system is allowed.

A policy f is a function on 5 with values in K. By F we denote the set of policies {flf: 5 ~ K}. A strategy s is a sequence of policies: s

=

(fO,f1,f

2, •• ). If we use strategy s, then action ft(i) is tak~n, if

at time t the state of the system is i. A stationary strategy is a strategy s

=

(f,f,f, ••• ).

*

The spectral radius P of this Markov decision process is defined as:

(5)

3 (3) 1

*

n n p :=maxlimsup'I/p (f)1I f n-t-<lO fCi)

with p (f) := (Pij ) i,jE:S' and 11 ••• 11 the usual sup-norm.

*

Throughout this paper ''fe assu'Ile p < 1.

We say that a vector v is nonnegative (positive)'- written v? 0

(v » 0) - if all its components are nonn~gative (positive). We say that v is semipositive-- written v > 0 - if v ? 0 and v ~ O. We write v ? w

(v»

w,

v > w), if v - w ? 0 (v - w» 0, v - w > 0). Bye we denote the vector with all components equal to one, by I the identity matix.

*

For

A

>

P

we define: 00 zeAl := max f (4) ( I - P(f»-1e = max

I

>.-(n+l)pn (f)e f n=O

(max taken component - wise) This concept will play a central role in the approximation method. The following properties can be proved:

*

Lemma 1: For >. > p it holds: a) IIz(A)1I < '" 00 b} ZtA}

=

sup

I

s n=O n-1 >.-(n+1) IT . P(ft)e, where s = t=O

c} AZ (A) "" max {e + P (f)z(A)} f

Proof: a) follows from a more general ~esultin van Bee and Wessels [2]. In this case it is immediate from

1 max limsup II A-cn+l)pn(f)lIn ::

f n+co

-1

*

A p < 1. The proof of b) and c) can be found in Zijm [lSJ •

o

*

From lemma le, it follows that, for A> p , we have (5) P(f)z(A) S A zeAl, for all f E: F

For A< 1 zeAl is an example of a strongly excessive function.

Definition: A strongly excessive function of a Markov decision process is a (col~}vector ~ » 0, sueh that for some p < 1 (the excessivity factor) ""e have

(6)

*

Lemma 2: For a MarRovdecision process with spectral radius p < 1 and strongly excessive function ~ with excessivity factor p we h~ve

*

p ~ p 1

*

-1 n n p = max limsup II c1c2 P (f)II ~ p. f n4<lO

Remark: Strongly excessive functions in the case of a countable state space and general action space are treated extensively in van Hee and Wessels [ 2

J.

*

We see that p is a lower bound for the excessivity factor of any strongly excessive function. Moreover, (5) shows that it is possible, e.g. by a policy iteration procedure ([3J), to construct strongly

excessive functions with an excessivity factor arbitrarily close to the

*

spectral radius p , if the last one is exactly known (Choose ~ such that

*

*

P < ASp + €, € small).

In the following, however, we suppose p

*

< 1, but otherwise un-known, and we use the concept of strongly excessive functions to find a good estimate for it.

-3. Approximation of the spectral radius

co

In this section we will construct a sequence of numbers (A)n n=0

co

and a sequence of vectors (Z(An»n=O (defined by (4» such that (7) lim A = P ,

*

n4<lO n where Z(A

n) is needed for the calculation of An+1" The basic idea of the

method is as follows:

*

Consider, for

A

~ p the functional equation (8) AZ

=

max{e + P(f)z}

(7)

·5

From (4) and lemma 1 it follows that the unique, strictly positive

*

.

finite solution 2: .,. Z(A) e:~i.sts if and only if \ > :J • Hence thfl eXJ.s-....

tence of such a solution is a method to determine whether A > P or

*

*

). =

p • Moreover, if

A

> p , then (9) max P(f)z (A) f 1 ~ Az().)-e ~ ),(1 - ---)z().) ).11Z(A)II Hence, by lemma 2,

(10)

p

*

~ A(1 - - - ) < ).1 ).lIz(A)1I

and we may investigate (8) again, now for).I := A(1 _ 1 )

All z(A)1I

*

In general, starting with some A

O > p , e.g.

Aa

= 1, we define for n ;:: 0:

A

n+1

= ).

(1 - 1 ), if (8) has a strictly positive, n Anll Z(A

n)1I

finite solution Z

=

Z(A ), for A = A •

n n

co

The convergence of (An) n=O is established in the following theorem.

Theorem 1: lim A

n-+a> n

=

p

*

*

Proof: It is immediately clear that A ~ p , n

=

0,1,2, ••• , which

im-n

*

plies (cf.lemma 1) that we define A 1 = A if and only if A = p •

co n+ n n .

Hence we may suppose that (An)n=O is strictly decreasing. Suppose

limA

*

y := > p (hence y > 0)

n n-+a>

Then IIZ(y)1I < co, and, since ).Z (A) is decreasing in A,

(1 - 1 ) ~ (1 - 1 ) < 1 for n = 0,1,2, •••

(8)

which implies:

lim A s lim (1 - 1 ,n A = I) n

yllz(y)1I J 0

n-+<x> n-+<x>

Because of the contradiction, we must conclude

*

0

lim An = P n-+<x>

In order to avoid trivialities we suppose in the rest of this paper that always the first possibility occurs, i.e. that we will find a

mono-co

*

tone decreasing sequence (A) 0 such that A > p , for all n, and

n n= n

*

lim X = p • n-+co n

In the following section we will take a closer look to the

asy.mtot-co

ic behaviour of the sequence {Z(An)}n=O' for two reasons. First of all, the study of this asymptotic behaviour leads to a few nice properties of the Markov decision process. Secondly, we will need the results, in order to derive lower bounds for the spectral radius.

co

4. Asymptotic behaviour of {z(An)} n=O

IIZ(A )11 n (11)

In this section we study the asymptotic behaviour of {z(An) }:=O. More specifically, defining

Z(A )

n lJn ""

-we will show the existence of a vector ~ > 0, such that (12) 11m ~

=

~

n~ n

We will need sQl1le statements from the theory of iinear operatoJrs on a finite dimensional vector space (Dunford and Schwartz [1], Kato [6]. The resolvent R(A,P) of an S x S -matrix P is defined as

(9)

7

Here a(p) denotes the spectrum of P. The spectral radius of P is deno-tQd by pCP); recall that for nonnegative matrices p(P) is real and ncn-negative, that pCP) equals the largest eigenvalue of P, and that we may choose the corresponding left -and right-eigenvector nonnegative

(Perron-Froberius theorem, see e.g. Seneta [10]).

From Kato [6J it is seen that we may give a Laurent expansion of R(X,P) at ~

=

pep), for A sufficient close to pcP), which takes the

form:

co

(14) R(A,P)

=

I

n= -k(P)

We will not consider in detail the meaning of the matrices An (P) and of k(P) (see for example Rothblum

[9J),

we only note that k(P) :s S. It

follows that PcP) is a pole of R(X,P) of order k(P). In the same way we may write:

co

(15)

where

R(A,P)e =

l

n= -k(P)

e (P) :=A (P)e, forn=-k(P), -k(P) + 1, •••

n n Furthermore, (16) e_k(p) (P) 1/e-k (P) (p)1I

=

II ( )II .1 . R (~-IP)e

=

e_k(p) P ~. im Hp(p) IIR(A,P)ell ::> 0

Returning to our notations of the preceding sections" notice that

*

for A >

P

(17) Let z(A)

=

max(AI - P(f» -1e = f max R(A,P(f»e f (18)

(10)

and (19) p ~ max p(~(f» f€F

O

*

Since F

O is finite, there exist~ a constant C, such that for A~ p > p

Q)

_I

A-(n+1)pn (f)e n=O ~ C • e However: lim* IIZ(A) II

=

Q) Hp Q)

Returning to the sequence {An}n=O of section 3 this means that for n large enough, we are only dealing with policies f such that

*

p(P(f» = p • COmbining (15) and (17), we may write for A close enough

*

n

to p (hence for n large enough):

Q)

(20) Z (An) = max

I

(An - p*)le

l (P (f» f€F1,~l= -kO:'tt'») with F1 := {f

I

p(P(f» = p

*

}, hence F 1 = F\FO Defining: k(1) = max k(P(f» f€F1 and

e-k (1) := max e-k (1) (P(f» f€F (e_ k (1) (f) = 0 if k(P(f»< ken)

o .

IIZ (A )II n lim 1I = lim n n-+<» n-+<» (21)

we conclude from (20) and (16) Z(A )

n

We establish to immediate corollaries of the existence of lim lin n-+<»

(11)

9

Theorem 2: There exists a vector p > 0 such that

*

max

P(f)p

=

P

lJ

f

~: From lemma lc and (11) we have

). p = max {_...;;.e__ n n f

II

z (). )

II

n

+ P(f)p }n_

Define j..I:= e_k (1) , then, by (21) and theorem 1, the result follows.

II

e_k (1)11

o

For the next result we need the concept of an incidence matrix.

Definition: The incidence matrix of a Markov decision process is a

~atrixP, defined by

P

=

{Pij

I

i,j € S, Pij

=

1 if

p~~i)

> 0 for some f,

Pij = 0 otherwise}

Theorem 3: If the incidence matrix of a Markov decision process is irreducible (we say that the system is communicating) then the vector ~, defined in theorem 2, is strictly positive.

Proof: Suppose p(i) for all f, we have

*

=

0, for i € 0 C S. Then, since P(f)lJ $ P 'lJ,

Pf.<.i) _- 0, for -i

t

0, j € 0; for all f € F.

~J

Bence, for the incidence matrix P: Pij

=

0, for i

i

0, j € 0

which is 'impossible, since P is irreducible., Hence 0

=

~.

(12)

Corolla~: In a communicating ~ystem (irreducible incidence matrix)

*

each irreducible transition matrix with maximal spectral radius p , possesses exactly the same, strictly positive, right eigenvector ll,

*

associated with P •

*

Proof: Let P(f) be irreducible, and p(P(f»

=

p • According to the wall-known Perron-Frobenius theorem, P(f) possesses strictly positive

* .

*

left- and right eigenvectors, associated with p • From P(f)ll S P l! it

*

follows, by multiplying with the left eigenvector, that P(f)l!

=

P lI,

Where l! » 0 (theorem 3).

o

In the final section we will use the fact that lim IIn exists to n-+<»

determine upper and lower bounds for the spectral radius at each step of the approximation procedure, and we prove the convergence of these

*

bounds to p • At the end we repeat the whole procedure.

5. Upper and lower bounds for the spectral ra<iius

In this section we will determine upper and lower bounds for the spectral radius at each step of the approximation procedure. ObViously

*

An' n

=

0, 1 ,2 , • •• may serve as an upper bound for p • In order to deter-mine a lower bound we define

(22 ) a

n

(max P(f)z(A

»

(i)

f n

=

min

i (z ().. » (i)

n Then the follOWing result holds

Lemma 3: a s p

*

n

Proof: From (11) and (22) we conclude that there exists a policy f, such that

P(f)II

(13)

11

Completely analogous to the proof of lemma 2 we find that 1

-limsup IIpk

(f)n

k

~

a n k~ hence

*

p ~ a n

o

Theorem 4: If the system is communicating, then

*

lim a = p

n

n~

(maxPlt)lJ lei) (max P(f)ll) (i)

[min f n ]= f

*

Proof: lim a = lim min =p

n~ n n~ i j.In(i) i j.I (i)

since j.I exists (theorem 2 and (21» and lJ » O. (theorem 3)

0

Hence in the communicating case we have converging upper and lower bounds (theorem 1 and theorem 4). The following simple example shows that difficulties may arise in the general case.

Ex_le: Suppose we have a system with two states, 1 and 2, and only

one

policy f. Let

P(f)

=

, while for n large enough II will take the

n

*

Hence p form: By (22)

·~'p=C)

lln

=

(:n) , with

~::

En

=

0 1

*

lim a =

'2

< p • n~ n

(14)

To avoid the difficulties, illustrated above, we proceed as follows. In the case that the incidence matrix P i!; reducible we may, eventually after permuting rows and corresponding columns, write

P

u

\

P21 P22 (23) P ::: I I , I I

,

I I

,

I I

,

I I

,

I I

,

,

I I

,

I I

,

P r1 Pr

2---

Prr where P

1l, 1

=

l, .•• ,r is irreducible and Plk

=

0 for k > 1. Denote by C1 the subset of states, corresponding to Pll, for 1

=

l, ••• ,r. Now,

f(i)

for

all f € F, define the matrix Q(f) := (qij )i,j€S by

o

elsewhere (24) f{i) Pij , if i,j € C l , for some 1 € {l, ••• ,r} It is well-known that f € F, in particular for 1

=

l, ...

,r.

Q(f) possesses the same eigenvalues as P(f), for

*

P = max P(Q(f». Let Qll(f) := Pll(f) and define, f

(25) max P (Qll(f» f

*

then it is easy to see that we can estimate Pl' 1

=

l, •••,r, simply by applyinq the procedure, described in section 3, to r different problems now. If we restrict ourselves to the Markov decision process with tran-sition matrices Q

11(f), f € F, then we have a communicating system, and hence, by theorem 1 and 4, we can determine upper and lower bounds, which are converqing to P~. If we are only interested in P

*

then the procedure may be the followinq:

(15)

13

Step 0: Calculate the incidence matrix P of the Markov decision p~cess. Determine its irreducible classes, C

1""'Cx say, and cancel, for all matrices P(f), the transition between

Ck

and Cl '

k,l

*

1, •••,r ; k ,.

l-Step 1: Choose l5 > 0, define A O = 1. For n = 0,1,2, •••

Step 2: Investigate the functional equation

(26)

A

Z

n

=

max {e + P(f)z} f

If (26) has no strictly positive finite solution, then' go to A. Otherwise, define 1 A n+1 = An (1 -A IIz(-A )11 n n Step 3: Calculate (max P (f)Z(A

»

(i) f n max 1=:1, ••,r then go to B

Otherwise, increase n with 1 and return to step 2 A:

B: Stop.

; go to Stop go to Stop

We will end this paper with some remarks. First of all, the proofs in this paper remain unchanged if we replace the substochastic matrices by general nonnegative matrices. The only difference is that we must choose

1

0 such that

>0

>

p

*

(for instance, let AO be the maximal row sum, where the maximum is taken over all rows and all matrices).

In every step of the approximation procedure we have to solve a set of functional equations. This can be .done e.g. by using Howard's policy iteration procedure (see Howard [3J). However, for a large state space

(16)

the amount of work will grow rapidly. For that reason we may use

approximations zI (A ), in stead of Z(A ), in order to improvp. An.

n n

Using a specific type of approximations, again convergence of the

00

*

sequence (An)n=O to p can be proved (see Zijm [ISJ).

Finally, notice that the idea of the investigation of the

function-al equation (8) has also been used in Mandl [7], although furthermore

the approximation technique is quite different. Moreover, in [7] only Markov decision processes with strictly positive transition matrices are studied.

Acknowledgement: I am grateful to Jan van der Wal and to Professor

Uriel G. Rothblum for several valuable suggestions, especially with

respect to section 4.

References

[1J Dunford, Nand J.T. Schwartz, Linear Operators, Part I,

Inter-science, New York (1958).

[2J van Hee, K.M. and J. Wessels, Markov decision processes and

strongly excessive functions, Stoch. Proc. and their Apple

.!!.

(978), 59-76.

[3J Howard, R.A., Dynamic programming and Markov processes, Wiley,

New York (1960).

[4] Howard, R.A. and J.E. Matheson, Risk-sensitive Markov decision

processes, Management Science ~(1972), 356-369.

[5] Isaacson, D and G.R. Luecke, Strongly ergodic Markov chains and

rates of convergence using spectral conditions, Stoch. Proc. and

their Appl. 2(1978), 113-121.

[6J

Kato, T., Perturbation theory for linear Operators,

Springer-Verlag, New York (1966).

[7] Mandl, P., An iterative method for maximizing the

characteris-tic root of positive matrices, Rev. Roum. Mathem. Pures et

Appl. 11(1969), 1317-1322.

(17)

15

a dynamic programming problem, The Australian Journal of Statistics 11(1969),85-96.

[9J Rothblum, U.G., Expansions of sums of matrix powers and resol-vents, to appear.

[10] Seneta, E., Nonnegative matrices, MacMillan Company, New York (1964) •

[11] Sladky, K., On dYnamic programming recursions for multiplica-tive Markov decision chains, Mathematical Programming Study ~

(1976), 216-226 (North-Holland Publishing Company) •

[12] Sladky,

K.,

Successive approximation methods for dynamic pro-gramming models, Proceedings of the Third Formator Symposium on Mathematical Methods for the analysis of Large-Scale Systems, Prague (1979), 171-189.

[13] Veinott, A.F., Discrete dynamic programming with sensitive dis-count optimality criteria, Ann. Math. Stat. ~(19G9), 1635-1660. [14] Wessels, J., Markov programming by successive approximations

with respect to weighted supremum norms, J. Math. Anal. and Appl. ~(1977), 326-335.

[15] Zijm, W.H.M., Bounding functions for Markov decision processes in relation to the spectral radius, Operations Research Ver-fahren 21(1978), 461-472.

Referenties

GERELATEERDE DOCUMENTEN

Do employees communicate more, does involvement influence their attitude concerning the cultural change and does training and the workplace design lead to more

To provide an answer to these questions the researchers applied centrality analysis from graph theory (i.e. social network analysis) as well as power indices from cooperative

Linear plant and quadratic supply rate The purpose of this section is to prove stability results based on supply rates generated by transfer functions that act on the variables w

As far as we know, the relation between the spectral radius and diameter has so far been investigated by few others: Guo and Shao [7] determined the trees with largest spectral

It remains a challenge to observe LAHs around high- redshift individual galaxies with a spatial resolution sufficient to perform a precise analysis of Lyman-α line variations in the

The main results of this paper are the exact convergence rates of the gradient descent method with exact line search and its noisy variant for strongly convex functions with

In [3], Cunningham extended the method to a pseudopolynomial-time algorithm for minimizing an arbitrary submodular function (for integer-valued submodular functions, it