Determination of the spectral radius of a Markov decision
process
Citation for published version (APA):
Zijm, W. H. M. (1979). Determination of the spectral radius of a Markov decision process. (Memorandum COSOR; Vol. 7909). Technische Hogeschool Eindhoven.
Document status and date: Published: 01/01/1979
Document Version:
Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)
Please check the document version of this publication:
• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.
• The final author version and the galley proof are versions of the publication after peer review.
• The final published version features the final layout of the paper including the volume, issue and page numbers.
Link to publication
General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain
• You may freely distribute the URL identifying the publication in the public portal.
If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:
www.tue.nl/taverne Take down policy
If you believe that this document breaches copyright please contact us at: openaccess@tue.nl
providing details and we will investigate your claim.
Department of Mathematics
PROBABILITY THEORY, STATISTICS AND OPERATIO~S RESEARCH Gr.oUP
Memorandum COSOR 79-09
Determination of the spectral radius of a Markov decision process
by
W.H.M. Zijm
Eindhoven, Oktober 1979 The Netherlands
Determination ef the spectral ~~!':. Markov decision Erocess
by W.H.M. Zijm
o.
AbstractConsider a Markov decision process in the situation of discrete time, finite state space and finite action space. A positive probabili-ty for fading of the system is allowed. In this case, contraction
properties of certain operators, used in Dynamic Programming, are strict-ly related to the spectral radius of the process. In this paper a method for estimating this spectral radius is proposed.
The result can be extended immediately to the case in which the transition probability matrices are replaced by general nonnegative ma-trices.
1. Introduction
Contraction properties of certain operators in a Banach space often play an important role in the theory of Markov decision processes. Assu-ming a positive probability of leaving the system in N stages (uniform in the starting state and the strategy) or, in other words, having so-called N-step contraction, one can construct a strongly excessive func-tion, i.e. a strictly positive (column)vector ~ such th~ for all ma-trices P, which are involved in the system, we have
(1) P~ S p~
where p, the excessivity factor, is smaller than .one.
This excessivity factor can be used in determining the rate of con-vergence of a successive approximation procedure, and, closely related to this, it can be used in estimating the value function of a Markov de-cision process (Wessels [14J). It turns out that strongly excessive functions can be characterized in terms of the spectral radius, more specifically: it is possible to construct strongly excessive functions with an excessivity factor arbitrarily close to the spectral radius of a Markov decision process (van Hee and Wessels (2], Zijm [15]), so that an exact knowledge of this spectral radius would be very useful.
Several authors have dealt With questions concerning the spectral radius of a Markov decision process, most of them in the irreducible'case
(Isaacson and Luecke [5J, Howard and Matheson [4J, Mandl [7J, Mandl and Seneta[SJ, 5ladky [11J, [12J). Rothblum [9J gives a Cesaro-Laurent ex-pansion of the resolvent of a specific matrix on the spectral circle. Finally Veinott [13J pr~ent$ a number of results concerning transient dynamic programming (spectral radius smaller than one).
In this paper we present a method to obtain a good approximation
for
the spectral radius of a Markov decision process, using the strongly excessive functions, mentioned above. In fact, we will find a sequence of strongly excessive functions ~n' with excessivity factor ~n' suchco
that the sequence (An)n=O is converging to the spectral radius. For that purpose we first introduce some concepts and notations and give a few preliminary results (section 2). In section 3 we develop the ap-proximation method for the spectral radius and prove that the sequence
co
(~n)n=O is actually converging. In section 4 we take a closer look to the functions ~n' and investigate their asymptotic behaviour. As a by-product of this investigation we find that alr irreducible matrices with maximal spectral radius have the same, strictly positive, right eiqenvector, associated with this spectral radius. Finally, in
section 5, we determine upper and lower bounds for the spectral radius.
2. Preliminaries
Consider a Markov decision process with a finite state space 5 and a finite action space K. A system is observed at discrete points of time. If at time t the state of the system is i€S we may choose action k€K which results in a probability
p~.
of observing the system in state jJ.J at time t + 1. We suppose
(2 )
~ p~. ~
1 , for all i € 5, k € Kj€5 J
Hence a positive probability for fading of the system is allowed.
A policy f is a function on 5 with values in K. By F we denote the set of policies {flf: 5 ~ K}. A strategy s is a sequence of policies: s
=
(fO,f1,f2, •• ). If we use strategy s, then action ft(i) is tak~n, if
at time t the state of the system is i. A stationary strategy is a strategy s
=
(f,f,f, ••• ).*
The spectral radius P of this Markov decision process is defined as:
3 (3) 1
*
n n p :=maxlimsup'I/p (f)1I f n-t-<lO fCi)with p (f) := (Pij ) i,jE:S' and 11 ••• 11 the usual sup-norm.
*
Throughout this paper ''fe assu'Ile p < 1.
We say that a vector v is nonnegative (positive)'- written v? 0
(v » 0) - if all its components are nonn~gative (positive). We say that v is semipositive-- written v > 0 - if v ? 0 and v ~ O. We write v ? w
(v»
w,
v > w), if v - w ? 0 (v - w» 0, v - w > 0). Bye we denote the vector with all components equal to one, by I the identity matix.*
ForA
>P
we define: 00 zeAl := max f (4) ( I - P(f»-1e = maxI
>.-(n+l)pn (f)e f n=O(max taken component - wise) This concept will play a central role in the approximation method. The following properties can be proved:
*
Lemma 1: For >. > p it holds: a) IIz(A)1I < '" 00 b} ZtA}
=
supI
s n=O n-1 >.-(n+1) IT . P(ft)e, where s = t=Oc} AZ (A) "" max {e + P (f)z(A)} f
Proof: a) follows from a more general ~esultin van Bee and Wessels [2]. In this case it is immediate from
1 max limsup II A-cn+l)pn(f)lIn ::
f n+co
-1
*
A p < 1. The proof of b) and c) can be found in Zijm [lSJ •o
*
From lemma le, it follows that, for A> p , we have (5) P(f)z(A) S A zeAl, for all f E: F
For A< 1 zeAl is an example of a strongly excessive function.
Definition: A strongly excessive function of a Markov decision process is a (col~}vector ~ » 0, sueh that for some p < 1 (the excessivity factor) ""e have
*
Lemma 2: For a MarRovdecision process with spectral radius p < 1 and strongly excessive function ~ with excessivity factor p we h~ve
*
p ~ p 1*
-1 n n p = max limsup II c1c2 P (f)II ~ p. f n4<lORemark: Strongly excessive functions in the case of a countable state space and general action space are treated extensively in van Hee and Wessels [ 2
J.
*
We see that p is a lower bound for the excessivity factor of any strongly excessive function. Moreover, (5) shows that it is possible, e.g. by a policy iteration procedure ([3J), to construct strongly
excessive functions with an excessivity factor arbitrarily close to the
*
spectral radius p , if the last one is exactly known (Choose ~ such that
*
*
P < ASp + €, € small).
In the following, however, we suppose p
*
< 1, but otherwise un-known, and we use the concept of strongly excessive functions to find a good estimate for it.-3. Approximation of the spectral radius
co
In this section we will construct a sequence of numbers (A)n n=0
co
and a sequence of vectors (Z(An»n=O (defined by (4» such that (7) lim A = P ,
*
n4<lO n where Z(A
n) is needed for the calculation of An+1" The basic idea of the
method is as follows:
*
Consider, for
A
~ p the functional equation (8) AZ=
max{e + P(f)z}·5
From (4) and lemma 1 it follows that the unique, strictly positive
*
.
finite solution 2: .,. Z(A) e:~i.sts if and only if \ > :J • Hence thfl eXJ.s-....
tence of such a solution is a method to determine whether A > P or
*
*
). =
p • Moreover, ifA
> p , then (9) max P(f)z (A) f 1 ~ Az().)-e ~ ),(1 - ---)z().) ).11Z(A)II Hence, by lemma 2,(10)
p*
~ A(1 - - - ) < ).1 ).lIz(A)1Iand we may investigate (8) again, now for).I := A(1 _ 1 )
All z(A)1I
*
In general, starting with some A
O > p , e.g.
Aa
= 1, we define for n ;:: 0:A
n+1
= ).
(1 - 1 ), if (8) has a strictly positive, n Anll Z(An)1I
finite solution Z
=
Z(A ), for A = A •n n
co
The convergence of (An) n=O is established in the following theorem.
Theorem 1: lim A
n-+a> n
=
p*
*
Proof: It is immediately clear that A ~ p , n
=
0,1,2, ••• , whichim-n
*
plies (cf.lemma 1) that we define A 1 = A if and only if A = p •
co n+ n n .
Hence we may suppose that (An)n=O is strictly decreasing. Suppose
limA
*
y := > p (hence y > 0)
n n-+a>
Then IIZ(y)1I < co, and, since ).Z (A) is decreasing in A,
(1 - 1 ) ~ (1 - 1 ) < 1 for n = 0,1,2, •••
which implies:
lim A s lim (1 - 1 ,n A = I) n
yllz(y)1I J 0
n-+<x> n-+<x>
Because of the contradiction, we must conclude
*
0
lim An = P n-+<x>
In order to avoid trivialities we suppose in the rest of this paper that always the first possibility occurs, i.e. that we will find a
mono-co
*
tone decreasing sequence (A) 0 such that A > p , for all n, and
n n= n
*
lim X = p • n-+co n
In the following section we will take a closer look to the
asy.mtot-co
ic behaviour of the sequence {Z(An)}n=O' for two reasons. First of all, the study of this asymptotic behaviour leads to a few nice properties of the Markov decision process. Secondly, we will need the results, in order to derive lower bounds for the spectral radius.
co
4. Asymptotic behaviour of {z(An)} n=O
IIZ(A )11 n (11)
In this section we study the asymptotic behaviour of {z(An) }:=O. More specifically, defining
Z(A )
n lJn ""
-we will show the existence of a vector ~ > 0, such that (12) 11m ~
=
~n~ n
We will need sQl1le statements from the theory of iinear operatoJrs on a finite dimensional vector space (Dunford and Schwartz [1], Kato [6]. The resolvent R(A,P) of an S x S -matrix P is defined as
7
Here a(p) denotes the spectrum of P. The spectral radius of P is deno-tQd by pCP); recall that for nonnegative matrices p(P) is real and ncn-negative, that pCP) equals the largest eigenvalue of P, and that we may choose the corresponding left -and right-eigenvector nonnegative
(Perron-Froberius theorem, see e.g. Seneta [10]).
From Kato [6J it is seen that we may give a Laurent expansion of R(X,P) at ~
=
pep), for A sufficient close to pcP), which takes theform:
co
(14) R(A,P)
=
I
n= -k(P)
We will not consider in detail the meaning of the matrices An (P) and of k(P) (see for example Rothblum
[9J),
we only note that k(P) :s S. Itfollows that PcP) is a pole of R(X,P) of order k(P). In the same way we may write:
co
(15)
where
R(A,P)e =
l
n= -k(P)
e (P) :=A (P)e, forn=-k(P), -k(P) + 1, •••
n n Furthermore, (16) e_k(p) (P) 1/e-k (P) (p)1I
=
II ( )II .1 . R (~-IP)e=
e_k(p) P ~. im Hp(p) IIR(A,P)ell ::> 0Returning to our notations of the preceding sections" notice that
*
for A >P
(17) Let z(A)=
max(AI - P(f» -1e = f max R(A,P(f»e f (18)and (19) p ~ max p(~(f» f€F
O
*
Since FO is finite, there exist~ a constant C, such that for A~ p > p
Q)
_I
A-(n+1)pn (f)e n=O ~ C • e However: lim* IIZ(A) II=
Q) Hp Q)Returning to the sequence {An}n=O of section 3 this means that for n large enough, we are only dealing with policies f such that
*
p(P(f» = p • COmbining (15) and (17), we may write for A close enough
*
nto p (hence for n large enough):
Q)
(20) Z (An) = max
I
(An - p*)lel (P (f» f€F1,~l= -kO:'tt'») with F1 := {f
I
p(P(f» = p*
}, hence F 1 = F\FO Defining: k(1) = max k(P(f» f€F1 ande-k (1) := max e-k (1) (P(f» f€F (e_ k (1) (f) = 0 if k(P(f»< ken)
o .
IIZ (A )II n lim 1I = lim n n-+<» n-+<» (21)we conclude from (20) and (16) Z(A )
n
We establish to immediate corollaries of the existence of lim lin n-+<»
9
Theorem 2: There exists a vector p > 0 such that
*
max
P(f)p=
P
lJf
~: From lemma lc and (11) we have
). p = max {_...;;.e__ n n f
II
z (). )II
n
+ P(f)p }n_
Define j..I:= e_k (1) , then, by (21) and theorem 1, the result follows.
II
e_k (1)11o
For the next result we need the concept of an incidence matrix.
Definition: The incidence matrix of a Markov decision process is a
~atrixP, defined by
P
=
{PijI
i,j € S, Pij=
1 ifp~~i)
> 0 for some f,Pij = 0 otherwise}
Theorem 3: If the incidence matrix of a Markov decision process is irreducible (we say that the system is communicating) then the vector ~, defined in theorem 2, is strictly positive.
Proof: Suppose p(i) for all f, we have
*
=
0, for i € 0 C S. Then, since P(f)lJ $ P 'lJ,Pf.<.i) _- 0, for -i
t
0, j € 0; for all f € F.~J
Bence, for the incidence matrix P: Pij
=
0, for ii
0, j € 0which is 'impossible, since P is irreducible., Hence 0
=
~.Corolla~: In a communicating ~ystem (irreducible incidence matrix)
*
each irreducible transition matrix with maximal spectral radius p , possesses exactly the same, strictly positive, right eigenvector ll,
*
associated with P •
*
Proof: Let P(f) be irreducible, and p(P(f»
=
p • According to the wall-known Perron-Frobenius theorem, P(f) possesses strictly positive* .
*
left- and right eigenvectors, associated with p • From P(f)ll S P l! it
*
follows, by multiplying with the left eigenvector, that P(f)l!=
P lI,Where l! » 0 (theorem 3).
o
In the final section we will use the fact that lim IIn exists to n-+<»
determine upper and lower bounds for the spectral radius at each step of the approximation procedure, and we prove the convergence of these
*
bounds to p • At the end we repeat the whole procedure.
5. Upper and lower bounds for the spectral ra<iius
In this section we will determine upper and lower bounds for the spectral radius at each step of the approximation procedure. ObViously
*
An' n
=
0, 1 ,2 , • •• may serve as an upper bound for p • In order to deter-mine a lower bound we define(22 ) a
n
(max P(f)z(A
»
(i)f n
=
mini (z ().. » (i)
n Then the follOWing result holds
Lemma 3: a s p
*
n
Proof: From (11) and (22) we conclude that there exists a policy f, such that
P(f)II
11
Completely analogous to the proof of lemma 2 we find that 1
-limsup IIpk(f)n
k~
a n k~ hence*
p ~ a no
Theorem 4: If the system is communicating, then
*
lim a = p
n
n~
(maxPlt)lJ lei) (max P(f)ll) (i)
[min f n ]= f
*
Proof: lim a = lim min =p
n~ n n~ i j.In(i) i j.I (i)
since j.I exists (theorem 2 and (21» and lJ » O. (theorem 3)
0
Hence in the communicating case we have converging upper and lower bounds (theorem 1 and theorem 4). The following simple example shows that difficulties may arise in the general case.
Ex_le: Suppose we have a system with two states, 1 and 2, and only
one
policy f. LetP(f)
=
, while for n large enough II will take the
n
*
Hence p form: By (22)·~'p=C)
lln=
(:n) , with~::
En=
0 1*
lim a ='2
< p • n~ nTo avoid the difficulties, illustrated above, we proceed as follows. In the case that the incidence matrix P i!; reducible we may, eventually after permuting rows and corresponding columns, write
P
u
\
P21 P22 (23) P ::: I I , I I,
I I,
I I,
I I,
I I,
,
I I,
I I,
P r1 Pr2---
Prr where P1l, 1
=
l, .•• ,r is irreducible and Plk=
0 for k > 1. Denote by C1 the subset of states, corresponding to Pll, for 1=
l, ••• ,r. Now,f(i)
for
all f € F, define the matrix Q(f) := (qij )i,j€S byo
elsewhere (24) f{i) Pij , if i,j € C l , for some 1 € {l, ••• ,r} It is well-known that f € F, in particular for 1=
l, ...,r.
Q(f) possesses the same eigenvalues as P(f), for
*
P = max P(Q(f». Let Qll(f) := Pll(f) and define, f
(25) max P (Qll(f» f
*
then it is easy to see that we can estimate Pl' 1
=
l, •••,r, simply by applyinq the procedure, described in section 3, to r different problems now. If we restrict ourselves to the Markov decision process with tran-sition matrices Q11(f), f € F, then we have a communicating system, and hence, by theorem 1 and 4, we can determine upper and lower bounds, which are converqing to P~. If we are only interested in P
*
then the procedure may be the followinq:13
Step 0: Calculate the incidence matrix P of the Markov decision p~cess. Determine its irreducible classes, C
1""'Cx say, and cancel, for all matrices P(f), the transition between
Ck
and Cl 'k,l
*
1, •••,r ; k ,.l-Step 1: Choose l5 > 0, define A O = 1. For n = 0,1,2, •••
Step 2: Investigate the functional equation
(26)
A
Zn
=
max {e + P(f)z} fIf (26) has no strictly positive finite solution, then' go to A. Otherwise, define 1 A n+1 = An (1 -A IIz(-A )11 n n Step 3: Calculate (max P (f)Z(A
»
(i) f n max 1=:1, ••,r then go to BOtherwise, increase n with 1 and return to step 2 A:
B: Stop.
; go to Stop go to Stop
We will end this paper with some remarks. First of all, the proofs in this paper remain unchanged if we replace the substochastic matrices by general nonnegative matrices. The only difference is that we must choose
1
0 such that
>0
>
p*
(for instance, let AO be the maximal row sum, where the maximum is taken over all rows and all matrices).In every step of the approximation procedure we have to solve a set of functional equations. This can be .done e.g. by using Howard's policy iteration procedure (see Howard [3J). However, for a large state space
the amount of work will grow rapidly. For that reason we may use
approximations zI (A ), in stead of Z(A ), in order to improvp. An.
n n
Using a specific type of approximations, again convergence of the
00
*
sequence (An)n=O to p can be proved (see Zijm [ISJ).
Finally, notice that the idea of the investigation of the
function-al equation (8) has also been used in Mandl [7], although furthermore
the approximation technique is quite different. Moreover, in [7] only Markov decision processes with strictly positive transition matrices are studied.
Acknowledgement: I am grateful to Jan van der Wal and to Professor
Uriel G. Rothblum for several valuable suggestions, especially with
respect to section 4.
References
[1J Dunford, Nand J.T. Schwartz, Linear Operators, Part I,
Inter-science, New York (1958).
[2J van Hee, K.M. and J. Wessels, Markov decision processes and
strongly excessive functions, Stoch. Proc. and their Apple
.!!.
(978), 59-76.
[3J Howard, R.A., Dynamic programming and Markov processes, Wiley,
New York (1960).
[4] Howard, R.A. and J.E. Matheson, Risk-sensitive Markov decision
processes, Management Science ~(1972), 356-369.
[5] Isaacson, D and G.R. Luecke, Strongly ergodic Markov chains and
rates of convergence using spectral conditions, Stoch. Proc. and
their Appl. 2(1978), 113-121.
[6J
Kato, T., Perturbation theory for linear Operators,Springer-Verlag, New York (1966).
[7] Mandl, P., An iterative method for maximizing the
characteris-tic root of positive matrices, Rev. Roum. Mathem. Pures et
Appl. 11(1969), 1317-1322.
15
a dynamic programming problem, The Australian Journal of Statistics 11(1969),85-96.
[9J Rothblum, U.G., Expansions of sums of matrix powers and resol-vents, to appear.
[10] Seneta, E., Nonnegative matrices, MacMillan Company, New York (1964) •
[11] Sladky, K., On dYnamic programming recursions for multiplica-tive Markov decision chains, Mathematical Programming Study ~
(1976), 216-226 (North-Holland Publishing Company) •
[12] Sladky,
K.,
Successive approximation methods for dynamic pro-gramming models, Proceedings of the Third Formator Symposium on Mathematical Methods for the analysis of Large-Scale Systems, Prague (1979), 171-189.[13] Veinott, A.F., Discrete dynamic programming with sensitive dis-count optimality criteria, Ann. Math. Stat. ~(19G9), 1635-1660. [14] Wessels, J., Markov programming by successive approximations
with respect to weighted supremum norms, J. Math. Anal. and Appl. ~(1977), 326-335.
[15] Zijm, W.H.M., Bounding functions for Markov decision processes in relation to the spectral radius, Operations Research Ver-fahren 21(1978), 461-472.