Geometric convergence in average reward Markov decision processes

(1)

Geometric convergence in average reward Markov decision

processes

Citation for published version (APA):

Zijm, W. H. M. (1980). Geometric convergence in average reward Markov decision processes. (Memorandum COSOR; Vol. 8008). Technische Hogeschool Eindhoven.

Document status and date: Published: 01/01/1980 Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers) Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne Take down policy

If you believe that this document breaches copyright please contact us at: openaccess@tue.nl

providing details and we will investigate your claim.

(2)

EINDHOVEN UNIVERSITY OF TECHNOLOGY

Department of Mathematics

PROBABILITY THEORY, STATISTICS AND OPERATIONS RESEARCH GROUP

Memorandum COSOR 80-08 Geometric convergence in

a~rage reward Markov decision processes by

W.H.M. Zijm

Eindhoven, June 1980 The Netherlands

(3)

Geometric convergence in average reward Markov decision processes

by

W.H.M. Zijm

o.

Abstract

Recently, Federgruen and Schweitzer [3] proved that in undiscounted Markov decision problems the value iteration method for finding maximal gain poli-cies converges geometrically fast, whenever convergence occurs. This result was obtained without any restriction on either the periodicity or chain structure of the problem. In this paper we establish the same result once again; the proof however, seems essentially simpler and, moreover, yields an upperbound for the convergence rate.

1. Introduction

In a recent, remarkable paper [3J Federgruen and Schweitzer showed that the value iteration method for finding maximal gain policies in undiscounted Markov decision problems exhibits a geometric rate of convergence, whenever convergence occurs. In the case that after a finite number of steps we are dealing with only one maximizing policy, this fact is immediately clear by exploiting the so-called Jordan-form of a matrix (compare e.g. Pease [5J). In this case a sharp upperbound for the ultimate convergence factor is given by the absolute value of the largest eigenvalue with radius strictly smaller than one (which is not the same as the subradius as i t was defined by e.g. Morton and Wecker [4]; in multichain or periodic cases this subra-dius is equal to one). In this paper we present an alternative proof for the main result in Federgruen and Schweitzer [3J; a proof however, which we believe to be essentially simpler. Moreover, we find an upper bound for the ultimate convergence rate.

(4)

- 2

-We consider a discrete time Markov decision process with finite state space S {l, .•• ,N} and finite action space K. Choosing action k E K when the

sys-tem is in state i E S results in a probability

p~.

of observing the system

~] k

in state j at the next point of time. Furthermore a reward r. is earned. A

J.

policy f is a function from the state space to the action space, a strategy

TI is a sequence of polici~s : ~ = (f

O,f1,f2, •..

>.

A strategy n

=

(f,f,f, ... )

is called stationary. Furthermore we denote by P(f) the matrix with elements

f(i) " 1 (f) d th . th f{i) . 1 N

p.. ; ~, J

= , ...

N i r enotes e vector w~ components r. ; ~

=

I"·' .

J.] ~

In undiscounted Markovian decision problems the value-iteration equations can be written as follows:

(1) v. (n)

1. == max

{r~

+

f

P~j

Vj(n-l>} i i=1, ••• ,N; n =1,2, . • . .

kEK j=l

Here v. (n) denotes the i-th component of a vector v(n) and v(O) is given.

1.

Brown [1J showed that

*

(2) Ilv(n) - ng II :s; C

for some constant C ( here II ••• 11 denotes the usual sup-norm). In 'the above

*

expression g denotes the maximal gain rate vector, defined by

(3) with max f g(f)

=

lim n-)o<lO i = 1 , ..• ,N n _1

L

n+l t=O R. P (f) r (f) •

Derman [2J proved that there exists a policy that achieves the N suprema in (3) simultaneously.

*

co

In general {v(n) - ng }n=l may fail to converge for arbitrary v(O) if some of the transition probability matrices are periodic. The necessary and

suf-*

co

ficient condition for the convergence of {v(n) - ng }n=l for all v(O) was ob-tained by Schwei tzer and Federgruen [6J.A very easy sufficient condi tion is the

(5)

- 3

-assumption that all matrices are aperiodic. The main result in Federgruen

*

and Schweitzer [3] finally states that, whenever lim {v(n) - ng } exists, n-+«>

the convergence to the limit is geometric. It is this result that wil1

be

proved here again in a relatively simple way, in section 2. The method of proving the result implies the existence of an upper-bound, strictly smaller than one, for the ultimate convergence. rate, which is independent of the starting vector.

2. Geometric convergence of value-iteration

00

As a starting point in this section we assume that {v(n) - ng }n=1 is con-verging to some vector w*. Define {e(n)}:=1 by

(4) e(n) = v(n) - ng - w

*

n = 1,2, ••.•

Our aim in this section will be to prove that e(n) is approaching zero geo-metrically fast after a finite number of steps, i.e. for n ~ nO say. Substi-tution of (4) into (1), and writing (1) as a vector-equation gives, for n=1,2, .••

*

*'

*

ng

+

w + e(n) = max {ref) + (n-1) P(f) g

+.

P{f) w

.+

P{f) e{n-1')}.

( 5)

f

Divide both sides of ( 5) by n and let n tend to infinity. Then we find

(6) max P(f) g

*

:= _g*

f

Let A denote the set of policies which maximize. P (f) g •

*

It is clear that for n sufficiently large, n ~ n

1 say, a maximizing policy f in (5) will be in A, hence P(f)g* g*. Subtracting this from (5) reduces the functional equation to

(7) g* + w*+e(n) = max {ref) +P(f) w* +p(f)'e(n-1)h n ~ n

1 fEA

Again, let n tend to infinity. Then we have, since lim e(n)

=

0, n-+«>

(8) g

*

+ w

*

max {ref) + P(f)w*} • fEA

(6)

4

-Let B denote the set .of policies which maximize ref)

+

P(f) w.

*

(hence B c A) ,

For n sufficiently large, n ~ n

2 say, we find with the same arguments as above the following reduced functional equation

(9) e(n)

=

max p(f)e(n - 1)

fEB (obviously n

2 ~ n1) •

It will be our problem to prove that lim e(n)

=

0 implies that the conver-n""'"

gence to zero is also geometrically fast. For generality we reformulate the problem as follows:

Suppose we have a set of (column)vectors {x(n); n

=

O,l, .•• } which obey the following dynamic programming recursion

(10) x. (n) = ~ max kEK N

I

n

=

0,1,2, •.• i == 1, ••• ,N. j=l

where K is a finite set of actions, as defined in the introduction. Suppose furthermore

(11 ) lim x (n) n""'"

o •

Then we want to establish the geometric convergence of {x(n) I n == O,l, ••. }.

We will need the following definitions:

o

{ t l i E S n x. (n) > O} ~ x. (n) < a} ~

c'

== S \

c

n n 0' S \ 0 n n

(7)

•

5

-From (10) and (11) i t is clear that C'

F

¢ ,

D'

F

¢ ,

for all n.

Further-n n

more C C D' , D e c ' for all n. The following lemma asserts that

n n n n '

max {Xi

(n)}

is

decreasing to zero geometrically fast, whenever Co ~

¢.

iEs

Lemma 1 : Suppose Co

F

¢

and let (10) and (11) hold. Then there exist num-bers a E lR, 0 S; a < 1 and nO E IN , nO S; 2N such that

k max {Xi (kon

O)} S; a max {Xi (O)} .

iES iES

Proof Define RO = Co and R by

---

n HI3kEK

I

k 1} R = P .. = ~J n=1,2, ••• n jER 1

n-Then i t follows immediately that R C C • Now suppose that R

F

¢

for

N n N n n

some n ;;:: 2 • Since there are at most 2 - 1 non-empty subsets of S we must have for some n l' n

2 E IN with 0 S; n1 < n2 :<;; 2N that Rn1 Rn2 Define

R

By definition of R there exists a finite sequence of polioies n

{f +1,···,f } such that _n

1 n2

= 1 for i ER •

min xi(n

1). By repeating the sequence {f +1, .•• ,f } again

iER n₁ n₂

and again we immediately conclude

which contradicts (11). Hence R n

k::: 1,2, •••

N

= ¢ for n ~ 2 •

(8)

Now define nO £: lN and a '

n'

0 := min{nlR

a ' := max max iES 1T

then a' < 1 and we find

max i E n { 6 -JR as ¢ }

2

jEC O max i [ follows: (n Y) P 0 ]ij 1T }

We may reason in the same way, using C

1 as a starting point; we then find

N

numbers nO I air I etc., etc. Since S possesses at most 2 - 1 non-empty subsets i t follows that we are able to determine numbers nO E lN, nO

~

2N

and a E JR, a < 1, such that

max {xi (n + nO) } ~ _a _{max {xi (n) }}

i i

n=O,l,2, ..•

and hence

{x. ( k. no)} ~ k {xi (O)}

max a max

i l. i

k=1,2, •••

An analogous result may be found for min{x. (n)}. We have iES l.

o

Lemma 2 : Suppose DO

F

¢

and let (10) and (11) hold. Then there exist num-bers S E JR, 0 5'; S < 1 and mO E lN, mO 5'; 2N such that

Proof min iES T

=

n {Xi (k.m

O

)}

~

e

k min iES

hi

k p . . l.] {x. (O)} • l. 1 for all k E K} n=1,2, •••

(9)

- 7

-N

Then obviously TeD and T

=

~ for n ~ 2 , since the opposite assertion

n n n

would imply, by an argument analogous to the one in the proof of Lemma 1, the existence of integers n

1,n2 with 0

~

n1 < n2

~

2N such that T _n T =: T and by definition of T 1 n2 · n max iET contradicting (11) again. k

=

1,2, ••• ; i E T,

Hence we may define

ma

E IN and 13' E lR as follows

m' := min

o

13' max iES {n

IT

n min { 'IT

=

~

}

then 13' < 1 and we find

min i

13'. min {x. (O)}

~

i

Using an analogous argument as in the foregoing proof, we also may determine

N

numbers mO E IN, mO ~ 2 and 13 E lR, 13 < 1 such that

min {xi (n +mO)} ~ _{f3 ..}_{min {xi (n) }} _n

=

_{0,1,2, . . .}

i i

and hence

min {xi (k_UU) ~ 13k .. min {xi (0) } k = 1.2, .••

i i

(10)

8

-Lemma 1 and -Lemma 2 together establish the geometric convergence of {x(n) n l,2, ..• } to zero. Notice that a and

B

do not depend explicit-lyon the vectors x(n), n 1,2, ••. , but merely on the positions of C

n and D 1 and on the probabilistic behaviour of the set of matrices. Since

n

S contains only a finite number of the non-empty subsets and moreover, we have only a finite number of policies (and hence a finite number of strate-gies of length

~

2N) this implies that nO' m

O' a and

B

may be chosen in-dependent of x(O), that is: the geometric convergence is uniform with. res-pect to the starting vector, while max(a,B) constitutes an upperbound for the convergence rate.

We further remark that the finiteness of K is not very essential; under the following condition we still have geometric convergence with a factor, strictly smaller than one:

min i,j/k

k ,. k

(11)

9

-References

[1] Brown, B., On the iterative method of dynamic programming on a finite state space discrete time Markov process, Ann. Math. Statist. 36, 1279 - 1285 (1965).

[2J Derman, C., Finite State Markovian Decision Processes, Academic Press, New York (1970).

[3] Federgruen, A. and P.J. Schweitzer, Geometric convergence of value-iteration in multichain Markov decision pr?blems, Adv. Appl. Prob.

!!,

188 - 217 (1979).

[4J Morton, T. and W. Wecker, Ergodicity and convergence for Markov decision processes, Management Science ~, 890 - 900 (1977).

[5J Pease, M.e., Methods of Matrix algebra, Academic Press, New York (1965).

[6J Schweitzer P.J. and A.Federgruen, The asymptotic behaviour of undiscounted value-iteration in Markov decision problems, Math.Opns. Res. ~, 360 - 382

(1976) .