The method of value oriented successive approximations for the average reward Markov decision process

(1)

The method of value oriented successive approximations for

the average reward Markov decision process

Citation for published version (APA):

Wal, van der, J. (1979). The method of value oriented successive approximations for the average reward Markov decision process. (Memorandum COSOR; Vol. 7907). Technische Hogeschool Eindhoven.

Document status and date: Published: 01/01/1979

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne

Take down policy

If you believe that this document breaches copyright please contact us at:

openaccess@tue.nl

providing details and we will investigate your claim.

(2)

Department of Mathematics

,STATISTICS AND OPERATIONS RESEARCH GROUP

Memorandum COSOR 79-07

The method of value oriented successive

approximations for the average reward

Markov decision process.

by

J. van der Wal

(3)

The method of value oriented successive approximations for the average reward Markov decision process.

Abstract. In this paper we consider the Markov decision process with finite state and action spaces at the criterion of average reward per unit time. We will consider the method of value oriented successive approximations which has been extensively studied by Van Nunen for the total reward case. Under various conditions which guarantee the gain of the process to be independent of the starting state and a strong aperiodicity assumption we show that the method converges and produces £ - optimal policies.

1. Introduction and notations.

In this paper we will consider the finite state and action Markov decision process (MDP) -at the criterion of average reward per unit time. We present the method of value oriented successive approxi-mations which proved to be very useful for the total reward MOP, cf. Van Nunen [9,10J and Porteus [13J.

This method lays somewhere in between of the method of standard suc-cessive approximations and the policy iteration method. In the standard successive approximation method one only performs policy improvement steps where in the policy iteration method each improvement step is followed by an evaluation step in which the value of the actual policy

is determined. In the method of value oriented successive approxima-tions each maximization step (improvement step) is followed by one or more 'extrapolation' steps using the maximizing policy. This gives us a better approximation of the value of the actual policy.

We will show that this method also converges in the average reward case under various conditions which guarantee that the gain is constant and a strong aperiodicity condition.

So, we consider a discrete time MOP with finite state space S:

=

{1,2, .•. ,N} and finite action sets A(i): = {1,2, . . . ,A,}, i E S. If in state i action

1

(4)

moves to state j with probability p(jli,a), with E. p(jli,a)

=

1.

J

A policy f is a map from S into UA(i) such thatf(i) ( A(i) is the action to be taken in state i, i E S.

We denote by r

f the vector with i-th component r(i,f(i)) and by P(f) the matrix with i , j - th element p(jli, f(i)), i , j E S.

Throughout this paper we will work under the following assumption.

Strong aperiondicity assumption.There exists an a > 0 such that for all f (1.1)

pef)

~ aI

where I .~~~?t~s the identity matrix.

Note that by using Schweitzer's aperiodicity transformation [15]

one can transform any MOP into an equivalent MOP satisfying the strong aperiodicity assumption.

And except for section 6 we always have the following assuqption

Unichain assumption. For all policies f the matrix P(f) is uni-chained, i.e. the Markov chain corresponding to P(f) has only one recurrent chain and possibly some transient states.

Under assumption (1.1) we can define P*(f) by

*

P (f)

And the unichain assumption guarantees that for any policy f the gain

n

is independent at the starting state. We denote by gf E R and v f E R the gain and the relative value vector corresponding to policy f. Thus we know that (gf' Vf) is the unique solution of

{r_f + P(f) v =v

*

P

(f)v

=

0

(5)

(1.2)

and also

We are interested in determining an 8 -optimal policy, i.e. a policy f* which satisfies g~ + E ~ max

f

g = : g*.

f

Before we formulate the value oriented successive approximation method we will introduce some more notations.

We define the operators L(f) and U on En by

L(f)v = r

f

+

P(f)v

Uv = max L ( f) v

f

Further we denote by f the unique policy f which satisfies

v

(i) L(f) v

=

Uv

(ii) . _{If L(h) v}

₌

_{Uv then h(i)} _~ _f(i), _Le. _{ties are broken}

by taking the maximizing action with minimal number.

(A)

Then we define the operation U ,where A is a natural number, by

Now we can formulate the value oriented method.

Value oriented successive approximations.

Select an arbitrary va and determine

(1.3) n

a,l , ...

(6)

studied a.o. by Bellman [2], Howard [6], White [20], Schweitzer [14] and Schweitzer and Federgruen [16,17].

For very large the method tends to be equivalent with Howard's policy iteration method

[6].

This can be argued as follows.

instead of f ) v n (We write f n

v n

+₁

+

PA(f ) (V - V ) n n _{f n} Agfn.e

+

v

+

P*(f ) (V - V ) + (pA(f ) f_n n n f_n n P*(fn) (v - vn f ) n

Now P*(f) (v -vf ) is a constant vector (depending on n), cn.e say.

A n n n

And P ( f ) - P*(f) tends to zero if A + ~. So if A becomes large

n n

V 1 will differ from v

f by merely a constant vector (depending on n

n+ n

and A), c ,e say. Then f 1 is the (a) policy that maximizes

n,A n+

+

c ,e

n,A (A + ~)

which is nearly the same as maximizing r

f + P(f)Vf_n as one does in the policy improvement step in Howard's algorithm. (cf. Morton [7] and Van der Wal [18,19] ).

Instead of using a constant A we have in our formulation (1.3) one might also choose in the n-th iteration step

v

n+l

where the A may depend on anything we observe when we execute the compu-n

tations. A sensible choice will be to have A increasing and to enlarge

n

A even further if policy f is almost the same as f 1.

n

n-In section 2 we first derive some preliminary inequalities. Then we consider in section 3 the irreducible case, which is essentially simpler than the general unichain case with transient states. This case is treated in section 4. In section 5 we show that the value oriented method ultimately will converge exponentially fast. In section 6 we will relax the unichain

(7)

assumption to communicatingness and simply connectedness. Finally, we present in section 7 an example which shows that the value oriented method of successive approximations may cycle between

nonoptimal policies if we only demand that all chains are aperiodic.

2. Some preliminary results.

In this section we will derive some preliminary results for the method of value oriented successive approximations.

So we consider the scheme

and we further denote f by f •

v n n Define y and 0 , n =

0,1,2, .••

by n n

(2.1)

min i (Uv n v )n (i) (2.2)

o

n max_i (UV. n - v )n (i)

Then we have the following well-known result (cf. Odoni

[11],

Hastings [4], Hordijk and Tijms [5])

Lemma

2.1.

For all n we have

(2.3)

Proof:

We have (by the definition of f ) n

L ( f ) v - v

(8)

So

(2.4) y e ~ r

f + P(f ) v - v ~ 0 .e

n n n n n n

Multiplying (2.4) by P* we get, using P*(f )P(f ) f n n n P*(f ) and (1.2), n (2.5) y e ~ gf e ~ 0 e n n n

And for arbitrary f we have

So, after multiplication with P*(f), also gf ~ on' Hence also

(2.6) g*

Combining (2.5) and (2.6) proves the assertion.

o

So, what we would like to show is that 0 - y tends to zero if n tends

n

to infinity.

The lemma below gives a first result in the direction of a proof.

Lemma 2.2. The sequence

Proof:

{y

In

n

O,l, ... }

is monotonically non-decreasing.

For arbitrary n we have

Uv - v 2 L (f ) v 1- v = LA(f ) L(f )v - LA(f )v n n+1 n n+ n+1 n n n n n So also pA(f )(L(f)v - v) n n n n A 2 P (f )y e n n yne

o

(9)

In the special case of A

=

1 also the sequence

{o }

is monotone but

, n

now non-increasing. (see Odoni [11]). This, however, needs not be the case if

A

>

1.

Example 2.1

S

=

{1, 2 }, A (1 ) { 1 }, A (2)

=

{1, 2} . Furthermore, p ( 1 11 , 1)

=

1 and r(l,l)

=

100. In state 2 action 1 has r(2,1)

=

0 and p(112,1)

=

0.9 and action 2 has r(2,2) 10 and p(112,2) = 0.1. So action 1 has the higher probability of reaching state 1 but action 2 has the higher immediate reward. Take V

o

=

0 and

A

=

2. we get V1 (200,29) .

°

1

=

153.9 > 00. Then Uv

_o

- V 0 Next we compute T

=

(100,10) , so

°

0

=

100, and T Uv 1 - V1

=

(100, 153 . 9) , so

Of course the sequence 0 : = min {o ,0 1}'

°

0

n n

n-non-increasing sequence of upperbourids for g*.

00 constitutes a

states.

Our approach will be as follows. First we examine the sequence

{Y

_n} and we show that

Y

t g*. Then we will prove that

°

converges to g*

n n

as well. Which together implies that f becomes E-optimal in the long

n

run.

In the following section we first investigate the simpler irreducible case.

3. The irreducible case.

The case that there is only one recurrent chain and no transient states - the irreducible case - is essentially simpler than the case with transient

Therefore we consider in this section the irreducible case first.

(3. 1) Define g n n E lR by Uv n vn

Then we have (compare the proof of lemma 2.2.)

(10)

And consequently

(3.3)

f:r::om

We will show that, if all P(f) are irreducible and satisfy the strong aperiodicity assumption, then any product P(h₁) . . . P(h

N_1) is strictly

*

positive. This we will use to show that g converges ~o g.e

exponen-n

tially fast.

Lemma 3.1.

If all P(f) are irreducible and satisfy the strong aperiodicity assump-tion, then there exists aT> 0 such that P(h

1) •.. P(hN_1)(i,j) ;::: T for all h

1, ...hN_1, and i , j E S.

Proof:

Let

h

1

,h

2 , ••• ,

hn-1 be an arbitrary sequence of policies. For this sequence we define S(i,n), n

=

O,1, ... ,N-1 by

SU,n) := {j E S

I

P(h

1) ••• P(hn) (i,j) > O}

So we have to show that S(i,N-1) = S for all i E S. Clearly we have S(i,O) =

{i}

and S(i,n) c S(i,n+l) since we have for j E S(i,n)

P(h

1)···P(hn+1)(i,j) ;.::: P(h1) •••P(hn) (i,j) P(hn+1) (j,j) > 0 P(h

1) ...P(h )n (i,j) > 0 and

p~.

JJ > .0 for all a, especially a

=

h +l(j).n

It remains to be shown that the sets S(i,n) are strictly increasing as long as S(i,n)

f

S.

Suppose S(i,n)

f

Sand S(i,n+1) = S(i,n). This implies that for all

j E S(i,n) and k of: S(i,n) we have P(h

n+1) (j,k)= 0, otherwise k E S(i,n). So the set S(i,n) is closed under P(h 1)' But P(h 1) is irreducible so only

n+ n+

S can be closed under P(h 1)' Hence, by contradiction, S(i,n) is strictly n+

increasing until S(i,n) = S. So, ultimately for n = N-1, we have S(i,N-1) S.

As there are only finitely many policies and states also

min min P(h

1) ••• P(hn_1) (i,j) =: T > 0 i , j h

(11)

As an immediate consequence of this lemma and lemma 2.1., we have

Lemma

3.2.

If

kA

~

N -

1, then for all n

g* - Y

n+k ~ (1 - T) (g

Proof:

- Y )

_n

Let

j~

satisfy gn(jO)

=

on.

Then we have for all i and all h

1

,···hN_1

P(h

1)··

.P(h

N_1

)gn(i)

=

I

P(h

1)··

.P(~-l)

(i,j)9

n

(j)

.

.,r-.---j~j

P(h

₁)··

.P(h

_{N_}

1)

(i,j)9

n

(j)

+

P(h

1)··

.P(h

N_1)

(i,jO)on

o

So

P(h

1 ) ••• P(h

N-

1)g

n ~

(1 - T)Y e

n

+

Tg*e.

Then also for all

M ~

N-1

~ (1 -

TJY e

+

Tg*e.

n

Hence also for

kA

~

N-1

(1 - T)Y e

+

Tg*e.

n

And thus

or

So we have obtained the following result.

o

Theorem 3.1.

_Y

*

n

converges to

9

exponentially fast.

We know

now that

however, is how

<5_n

converges to

_{.. '} .C

f

becomes e:-optimal

n

do we recognize this.

*

.

9 .

for n sufficiently large. The problem,

Therefore, we want to show that also

(12)

(3.4)

As we have seen in the proof of lemma 2.1. one has

P*(f )g n n

Define

r;

by

(3.5) min minp*(f) (i,j)

f i , j

Now we can prove the following lemma.

Lemma 3.3. For all n Proof:

o -

Y

n n -1 ~

r;

(g* - Yn)

The assertion follows immediately from

g*e ~ gf e

=

P*(f)g ~ (1 - ~)y e + ~o e

n n n n n

Summarizing we have shown the following result.

Theorem 3.2.

o

(i) Y

n converges monotonically and exponentially fast to g*.

(ii) 0 converges exponentially fast, though not necessarily

monotoni-n

cally, to g*.

So in the irreducible case and under the strong aperiodicity assumption the value oriented approach converges exponentially fast.

In the next section we will try to generalize this result to the general unichain case with transient states. Note that in this case lemma 3.1. need no longer hold and that ~ defined in (3.5) may be zero, so lemma 3.3. can no longer be used.

(13)

4. The general unichained case.

As we already remarked the approach in this case will have to be diffe-rent from the one in the preceding section.

First we will derive a similar lemma as lemma 3.1. which will enable us to show that the span of v is bounded, where the span of a vector v. notation

n

spXV), is defined as

Next we show that this

max v(i) i implies sp(V)

=

- min v(i) i that

Y

n converges to g* and finally we show that there must exist a subsequence of 0 which converges to g*.

n

Lemma 4.1.

There exists a constant

n

> 0 such that for all h

1, •.. ,hN_1 and all i , j E S

(4.1) L min {p (h

1) ... P (hN_1) (i,k) ,P (h1) ... P (hN_1)(j,k)} ~ n

k

This lemma says that for any h

1 , ..• ,h

N_

1

any two states i and j have a

common successor at time N-l. Conditions of the type (4.1) are called scrambling condition (cf. Hajnal [3] and Morton and Wecker [8] ), and give contraction in the span-norm.

Proof of lemma 4.1.:

The line of proof is similar to the proof of lemma 3.1. Let h

1, ... ,hN_1 be again an arbitrary sequence of policies. For this sequence we define again S(i,n) .- {slp(h

1) ... P(hn)(i,s) > O}, n

=

0,1, .. , i E S. What we have to show is that S(i,N-l) n S(j,N-l) is nonempty for all i , j E S. Again i t is clear that S(i,O) {i} and S(i,n) c S(i,n+l). And if S(i,n)

=

S(i,n+l) then S(i,n) is closed under h 1. Now suppose

n+

S(i,N-l)

n

S(j,N-1) is empty, then both S(i,N-l) and S(j,N-l) are real subsets of S so there must be an m and an n with S(i,m)

=

S(i,m+l) and S(j,n) = S(j,n+1).

Then S(i,m) is closed under h 1 and S(j,n) is closed under h 1 and since

m+ n+

S(i,N-l)

n

S(j,N-l) is empty S(i,m)

n

S(j,n) is also empty. So if we define a policy f by f(s)

=

h

(14)

S E S(j,n) then P(f) has at least two recurrent subchains: S(i,m) and

S(j,n), which contradicts the unichain assumption. Hence S(i,N-1) n S(jtN-1) is nonempty and thus the left hand side in (4.1) is positive. As there are only finitely many policies and states there must exist

(4.2)

an n >0 such that (4.1) holds for all h

1, •.. ,hN_1 and i,j E S. From lemma 4.1. we immediately get

Lemma 4.2.

N

For all v E lR and all h

1, · · ·,hN_1

sp(P(h

1)· •. P(hN_1)v) ~ (1 - n)sp(v) Proof:

n

Let i and j be a maximal and a minimal comprnent of P (h1) •.•P (h

N_1)v , respectively. Then writing Q instead of P(h

1) ••.P(hN_1), we have

sp(Qv)

=

(Qv)

(i) -

(Qv)

=

Ik[Q(i,k) - Q(j,k)]v(k)

=

~

[Q(i,k) - min {Q(i,k),Q(j,k)}]v(k)

+

,

-2[Q(j,k) - min {Q(i,k),Q(j,k)}]v(k)

k

~

(1 -

n)

max v(k) -

(1 -

n)

min v(k)

=

(1 -

n)sp(v).

0

k k

Define K :

=

JJlax r(i,a) -

\!lin

r(i,a) ,

1,a 1,a

then we also have

Lemma 4.3. For all v E

Proof:

By definition of K we have sp(r ) ~ K for all h. Further we have for all n

hand ,., that sp(p(h) v) :$ sp (v). ~So

SP(L(h1)···L(~_1)v) = sp (r

+

P (h 1}r

+ ••• +

P (hI) ••• P (hN_2)rh

+

P (hI) ••• P (hN_1)v) n

_t

n₂ N-I sp(r_h )

+ ••• +

sp(P(h 1)·· .P(h _2)rh 1 N N4 (N-l)K

+

(l-n) sp(v)

_o

(15)

In order to show that sp(vn ) is bounded we introduce the following nota-tion.

(4.3)

a,1,2, ...

,

p

a,l, ... ,

A- 1.

Then we have from lemma 4.3. for ~

a,l, ... ,

q a,1, ... ,N-2

SP(WR,(N_1)+Q) :5; (N-1)K + (1 - n )sP(w(R,_l) (N-1)+Q) ~-1 :5; ••• :5; (N-1)K + (1-n) (N-1)K + ••• + (1-n) (N-l)K + R. + (1-n) sp (w ) q :5;[1 + (1-n) + " .

(1-n)~-1](N-1)K

+ sp(v

a

)

-1 :5; n (N-1)K + sp(V

a

)

So for all m (4.4) sp(w ) m -1 :5; n (N-1)K + sp(V

a)

From which we get

Theorem 4.1 sp(v) is bounded.

n

Proof:

Immediately from v = w a n d (4.4) •

n nA

o

Before we can use this to prove y t g* we first derive a number of in-n equalities. Define g , n

=

0,1,2, . . . by n (4.5) g = L(f )v - v n n n n

Then we have from the proof of lemma 2.2

(16)

and

(4.6)

Also v - V n+l n A A A-l A-l L (f ) v - v = L (f ) v - L (f) v + L (f) v -n n n n n n n n n - vn So [p A-l(f) _{+ '"} ₊ P(f) ₊ r] _g n n n , n

o,

1 , ••. (4.7) Define y by

v

- v

n+k n n+~-l

I

~n A-l [r + P ( f t ) + .,. + P ( f t)] g t ' n =0, 1, •• k= 1 , 2 , . (4.8) y lim y n-+oo _n We will prove y g*.

Let us consider v - v where m and £ for the time being are arbitrary.

m+£ m

From (4.6) and using P(f) ~ ar for all f (the strong aperiodicity assump-tion) we get for all nand k

(4.9) ~ akA g (i) + (l-akA)y n n and kA-a a . + (l-a ) 11l!n (p (f ) g ) (J) J<:.s n n + (l_akA-a

_n

n (4.10) akA-a(pa(f )g ) (i) n n

~

akA-a(pa(f)g )(i) n n

~

akA(pa(f )g ) (i) + (l_akA)y

n n n , a 0,1, .. ,>"-1

Now assume that i

(17)

(4.111

Then we have from (4.10) with n+k

=

'm+~and n=t (m ~ t < m + ~)

(4.12)

Hence from (4.7) and (4.12)

(4.13) (v. n - V ) (i ) S; Oy

+

Oex-Vt(y - y )

m+", m o m m

On the other hand we have 6 ~ g* for all n. Hence there is a state n

jo E S which has gm+k(jO)

~

g*.for at least £/N of the indices m,m+1, ... ,

m+~-1. For this state jo we thus have

(4.14) ~

_IN

(g* - y)

So from (4.13) and (4.14)

(4.15)

~ ~

IN

(g* -

y) - Vtex -Vt (y - y )

m

Now we can prove y g* .

Theorem 4.2.

*

y

=

g

Proof.

(18)

(4.16)

for

some M

1 and all n. £

Now choose

t

such that /N{g* - y) > 2M

1

+

M2 where M2 is some positive

-H

constant. Next choose m such that

£Aa

(y - Ym) ~

M

2 .

Then we get

Hence, using (4.16) with n = m,

which contradicts (4.16) for n Therefore we must have y

=

g*.

m

+

L

[J

So by now we know that y converges to g* so that f becomes nearly

n n

optimal is n becomes large. The problem remains to recognize that f n is nearly optimal. Therefore we need the following result.

Theorem 4.3.

Proof.

*

g is a limitpoint of the sequence

0

n

We know 0 ~ g*.

n

larger than g*.

Suppose that the smallest limitpoint of 0 is strictly n

Then we can construct - using a similar reasoning as before - a violation of the boundedness of sp{v ). Hence g* must be a

n

limi tpoint of 0 .

IJ

n

From theorems 4.2. and 4.3. we see that the method of value oriented successive approximations converges if all P(f) are unchained and satisfy the strong aperiodicity assumption.

(19)

5. Geometric convergence

For the irreducible case we have obtained that sp(g ) converges to

n

zero geometrically.

For the general unichained case we still have

to show that sp(g ) converges to zero.

In this section we will show

n

that also in this case sp(g ) converges to zero exponentially fast.

n

=

f

v -

v (N).e is bounded.

n n

so there exists a subsequence of v (and

9 )

with

9 ~

g*e, f

n

nk

v

(N).e

~

v (k

~~)

for some f and v.

Then for all k

nk

In the preceeding we already argued that

y

converges to g* and that

n

o

_n

has a subsequence converging to g*.

_~d

as sp(v ) is bounded also

_n

Further there are only finitely many policies

- v

=

L(f~v

nk

get using

max

h

If we let k

L(h)v

- v

=

L(f

)v

nk

Ok

nk

n

_k

tend to infinity then we

v (N) .e)

nk

v

nk

(N) .e) for all h

(5. 1)

max L(h)V -

v =

L(f)V -

v

h

g*e.

Now let

f;. > 0

be such that L(h)v - v

~

ge - e:e implies L(h)v - v= ge.

Then we have the following lemma.

Lemma

5.1. If sp(V

- v (N).e - v)

~

e and L(f )v = v+ g*e then

n

sp(V

_n

+

₁ -

V

_n

+

₁

(N).e -v)

~ £

and L(f

n

+

1

)v

=

v+ g*e.

sp(v

- v (N).e - v)

~ f;

and L(f )v = v + g*e.

m m m As v - v (N).e - v ~ nk - v (N) •e - v) ~ E:

nkO

lemma

5.1 that we

and L(f

)v

=

v

+

g*e.

nkO

have for m

>

n

k

_o

But this implies

Before we prove this lemma note the following.

tends to zero, there exists a k

O

such that sp(v

n

kO

Then it is a consequence of

(20)

(5.2)

₌

v+ A£g*e + P (f

A

n 1)'" P (f

A

) (v - v)

nkO+~- _nkO n

kO So by lemma· 4.2

zero and since

sp(v -v) decreases in £ exponentially fast to n

kO+£

also g converges to g*e exponentially fast if £ + 00.

nkO+£ Proof of lemma 5.1.: v n+l = v

+

Ag e Hence A L (f )v

+

n A + P (f ) (v n n - V) - V) sp(v 1 - v l(N).e - v) n+ n+ And A sp(P (f ) (v - v) ~ S n n L(f 1)v· - v n+ L ( fn+l) vn+l + P ( fn+l) (v - vn+l) - v >-:L(f)V l+P(f l)(V-V ) - v n+ n+ n+l L(f)v + p(f}'(v 1 - v) + P(f l)(V - v ) - v n+ n+ ·n+l g*e + [p (f) P(f

)](V

-

v' >-: g*e - sa n+l n+l

~s for any two stochastic matrices A and B and vector w we have (A-B)w >-: sp(w) .

Hence also

L (f 1)v - v

(21)

6. The communicating case.

In section 4 we treated the general unichained case. The convergence proof for the method of the value oriented successive approximations was given in two stages. First we used the unichain assumption and the strong aperiodicity to prove that sp(v ) was bounded (Lemma's 4.1

n

- 4.3 and theorem 4.1 ). In the second stage we used the boundedness of sp(V ) and 0 ~ g* to prove Y

n ~ g* and 0 ~ g* (k ~ 00) for some

.1

n n

k subsequence (nk)k E lN~

From this i t is clear that the method of value oriented successive approximations will converge whenever sp(v ) is bounded and the gain

n

of the MDP is independent of the starting state.

Below we will show that if the MDP is communicating then also sp(v ) n is bounded (clearly in that case the gain is constant).

Definition 6.1.

A MDP is called

communicating

i f there exists for any two states i and j a policy f .. and a number r such that pr (f .. ) (i,j) >

a

(cf. Bather [1]).

1J 1J

Further we define

M IJlax Ir(i,a) I 1,a

a

m.in v (i)

,

1" : max v (i)

,

n 1 ,2 , ..•

n 1 n n 1 n

a

Ip:j a}

r;

: . min _{Pij >

1,J,a

Now we will derive some lemma's which we need in order to prove that sp(v) is bounded.

n

Lemma 6.1.

For all n (i) a ~ a

-

AM

n+l n (ii) 1" n+l :0: 1"n + AM Proof (i) 1..-1 + P (f ) r f + . . . + P ( f ) rf n n n n

(22)

A

~ -AMe + P (f)v ~ - AMe +

a

e

n n n

So also (J_{n+1 -}> _-/\'M

+

~ Sl"ml"larly one may obtain (ii).

vn·

Lemma 6.2.

If sp(v 1) ~ sp(v ) then for all m, n ~ m < N-1

n+N- n

o

Proof: Using lemma 6.1 we get n+N-2 sp(V_n

+

_{N_ 1 )}

=

T n

+

N- 1 n+N-2

L

k

=

n - 0 n+N-l

=

k

- a )

n

(a

k+1 -

a ) -

k (0m+1 - 0m) + sp(v )n ::;; A(N-1)M + A(N-2)M - (a m+1 - am) + sp(vn) A(2N-3)M -

(a

1 -

a )

+ sp(v ) m+ m n

And with sp(v 1) ~ spev ) one obtains

a

1 -

a

::;;

AM(2N-3).

0

n+N- n m+ m

Lemma 6.3.

Ifsp(v +N 1) ~ spev ) and v l(j)::;; C +

a

1 for j E Sand n ::;; m < N-1

n - n m+ m+

then we have for all £ E S for which there exists an action a E A (j) with a Pj £ > 0 v (£) m Proof: We have i-A -1 ~ 0 + a ~

[C

+ 2AM(N-l)] m and A-1 L ( f )Uv m m So, v m+1(j)

~

L A - 1(f )[-Me

+

ma~

P(f)v

J(j)

m f m

(23)

-M +(L1.-1 (f ) m

~

-AM

+

pA-1(f ) m ~ -AM + a1.-1 max a max P(f)V ) (j) f m max P(f)V (j) f m

r

p

~

v (k)

+

(1 k· Jk m A-I a. )0 m From this we get

1.-1

I

_P~kv

(k) (1 A-I C + ₀

m+ 1 ~ v m+1 (j) ~ -AM + a. maxa k J m + - a. )0m And wit:1 lemma 6.2.

1.-1 _max

I

_P~kv

(k) (1 1.-1 C + AM (2N-3) + 0 ~ -AM + a + - a. )0 m a k J m m So, 1.-1

I

a a max p. (V (k) - 0 ) :s;

c

+ a k Jk m m

Hence, if for some ~

E

S and a E A(j)

or 2AM(N-1) a a p .

_Jt

> 0, so P;o

_.

~ IN 1;;, then we have v (t) m - 0m

o

Next we will show that i f spIv 1) ~ sp(v ) then spIv ) cannot be

n+N- n n arbitrary large. Define

a

C n 1-1. -1 a 1;;

[C

n_1 + 2AM(N-1)], n = 1,2, . . . , N-1 Then we have the following lemma

Lemma 6.4.

If the MDP is communicating then sp(v

n+N_1) ~ sp(vn) implies sp(vn) ~ CN_1

Proof:

Let j E S be such that v N 1(j) n+ -by

S(j,O)

=

{j}

0

n+N-1 and define S(j,t), t=O,1, ... N-1,

S(j,t+1) = {i E

sl

there exists a k E S(j,t) and a E A(k) such that a

Pki > O}

a

From p .. ~ a >

a

Tor all i and a we have S(j,t) < S(j,t+1). Further i t 1.1.

(24)

From lemma 6.3 we obtain (with C 0) for all i c S(j,l)

Next we get for all i E S(j,2)

v (i) - (J ~ C

n+N-3 n+N-3 2

until we get for all i E S(j,N-1) = S

v (i) - (J ~ C

n n N-1

Finally we can prove that sp(v ) is bounded.

n

Theorem 6.1.

If the MOP is communicating then we have for all n

sp(V

n) ~ max {sp(v

O

)

+ 2AM(N-2),

C

N_1 + 2AM(N-1)}

Proof.

Either we have

D

or

sp(v_n

+

_N- _{1 )} sp (v_n

+

_N_{- 1)} < sp(v ) n ~ sp(v ) • n But if sp{v n+N_1) ~ sp(vn) then by lemma 6.4 sp(vn) ~ CN_1 so sp(v N 1) :$ sp(v ) + 2AM(N-l) ~ C 1 + 2AM{N-1). n+ - n

N-Together this yields sp (V n+N-1) ~ max {sp(v ), Cn N-1 + 2AM(N-l)} , n 0, 1 , ... Hence with p

=

q(N-1) + r , p,q

=

0,1,2, ..• , r

=

0,1, ••• , N-2 ~ max {max {sP(V(q_2) (N-l)+r)' C_N_ 1 + 2AM(N-l)}, CN_1 + 2AM(N-1)} max {sP{V(q_2) (N-1)+r)' CN_ 1 + 2AM(N-l)}

(25)

With sp(V) :;; sp(v ) + 2AMr :;; sp(v ) + 2,\M(N-2}, r

r o o O,1, ..• ,N-2

we finally get

sp(v) :;; max {sp(v } + 2,\M(N-2},

c

1 + 2,\M(N-1)}.

0

n 0

N-80, sp(V ) is bounded and as we argued in the beginning of the section

n

that implies that the method of value oriented successive approximations also conv~rges for communicating MOP's.

A still weaker condition that assures constant maximal gain and sp(V ) n to be bounded is the condition of simply connectedness used by Platzman [12] to prove the convergence of standard successive approximations.

Definition 6.2.

A MOP is

simply

connected

if the state space 8 is the disjoint union of two sets,where one of them is a communicating class and the other is transient under any policy.

o

Denote the communicating class by 8 and the transient class by

s.

In

order to prove that this condition is already sufficient, we define 0 (i) (i) (J min v (J min v n iESo n n iES n 0 _{max v (i)} ,T max (i) T _vn n _{iESO n} n iES

It is clear that SO is closed under any P(f}. Also let k be the m±nimal number for which £,\~

7¥

S

(#S

is the number of states in S) then for some

1 ~ ~ >

a

and for all i E S and policies h

1, . . . ,hk

L

P (h 1) ••• P (h

k) (i,j )? r,

jES~

Hence for all i E S

v (i):;; kAM +

n+k And as for all i E SO

o

~T n + (l-~)Tn v (i) n+k

o

:;; k'\M + T n

o

:;; k:\M + I;;T n + (1-1;;)Tn

(26)

we also have

T :<::; kAM+ rTO + (l-r

_h

n+k ~ n ~ n

Similarly one may show

a

cr ~ -kAM + l;;cr + (l-Z;;)cr n+k n n so that (6.1) _sp_(v_n+k) - 00 )

+

(l-T)sp(v ) n n From theorem 6.1 (6.1) as

a

we know that T n

o

- cr

n is bounded so we can rewrite

sp(v n+k) From this one may show

:<::; L + (l-Z;;)sp(v) , n -1 :<::; Z;; L + sp(v ) q n = 0,1, •.. for all p = 0,1, .•. , q Together with 0,1, .•• ,k-1 sp(v ) :<::; 2AMq + sp(v ) :<::; sp(v ) + 2AM(k-1), q q 0 0

we get for all n

=

0,1, .•. -1

sp(V ) :<::; Z;; L + 2A.M(k-1)

+

sp(V )

n 0

So we have shown

0,1, .•. ,k-1

Theorem 6.12. If the MOP is simply connected then sp(v ) is bounded. n

And as simply connectedness implies constant gain one may show in identically the same way as in section 4 that the method of value oriented successive approximations converges.

Via

the approach in section 5 we can obtain again formula (5.2), hence 0 con-n verges again to g , however, lemma 4.2. does not hold here. So we cannot conclude that the convergence is geometric. But i t is evident that if there is a unique policy satisfying L(f) v= V+ g*e then the convergence is geo-metric again.

(27)

7. The necessity of the strong aperiodicity assumption.

In this section we will discuss the question of the aperiodicity. In the previous sections we used the strong aperiodicity assumption. One might ask whether mere aperiodicity would not be enough.

The following example shows one of the problems one might get under the weaker aperiodicity assumption: all P(f) are aperiodic.

Example 7. 1 •

S = {1,2,3,4}, A(1) = {1,2}, In state 1 action 1 has

P~l

a a a a

P23 = P24 = ~, P34 = P41 = 1.

A(2) = A(3) = A(4) = {i}.

1 and action 2 has

P~2

1. Further So we have only two policies:f

1 taking action 1 in state 1 and f

2 taking action 2 in state 1. Consider the strategy which uses alterna-tely f

1 and f2. Then the transitions in four steps are given by Q = P(f

2)P(f1)P(f2)P(f1). One easily verifies that

a

0 0

Q ~ 0

!a !a

0 0 _{~ ~}

a a

~ ~

So Q is not unichained ({ 1} and {3,4}) .

Using this idea we constructed the following example.

Example 7.2.

So there are eight different policies which can be characterized by the triples (a

3,a4,a6) where a. is the action prescribed in the state

1-i, i= 3,4,6. Now consider for the case A = 2 {l} . A(4) = A(6) = {1,2}, A(1) = A(2) = A(S) = A(7)

S

=

{1,2,3,4,S,6,7},

the sequence of policies (1,2,1), (2,1,2), (1,2,1), etc. Writing f

1 for policy (1,2,1) and f

2 for policy (2,1,2) we consider

2 2

P (f

1)P (f2) = : may again verify that Q(l,l) = 1 and Q(4,4) 1, so

Q

is again not unichained.

(28)

One can try to construct a vector v and rewards r(i,a) such that

o

Uv

o

and v₂ - V

o is some constant vector and v1 - Vo is not. A solution is given by ·T v := (3,6,4,2,2,2,2) , rO,l) := 2, r(2,1) := r(3,1) o r(4,1) := 4, r(4,2) := r(5,1) := 6, r(6,1) := 2, r(6,2) Which yields 4, r(3,2) := 6, r(7,1) ='

o.

Any policy f with f(4) == 2 satisfies

T T

v 1 = (10 , 12, 12, 12 , 10 ,6 , 8) .. and v 2 := (19, 22 , 2

a ,

18 , 18 , 18 , 18) := V 6 + 16. e .

This shows that the value oriented method may cycle between suboptimal policies as one easily verifies that the policies (1,2,1) and (2,1,2) both

2

have gain 4 and (2,2,2) is the optimal policy with gain 4

_{7 .}

However, in example 7.2 there is some ambiguity in the choice of the maximizing policies f₁ and fA'

~ L.

L(f)v

o =uvo and any policy h with h(3) := h(6) := 1 satisfies L(h)V1 ==uv1. So if we use as a rule for breaking ties: take the action with lowest number then cycling may occur.

The question remains whether cycling may occur if we use the rule: do not change any act.ion unless another action is strictly better. Or if we use randomized actions in case of ties.

(29)

References

[lJ

[2J [3]

[4]

[5J [6J [7J [8J [9J [10]

[ 11J

Bather, J. Bellman, R. Hajnal, J. Hastings, N.A;J.

Hordijk, A., and H.C. Tijms

Howard, R;A.

Morton, T.E.

Morton, T., and W. Wecker

Nunen, J .A.E.E van

Nunen J.A.E.E van

Odoni, A.

Optimal decision procedures for finite Markov chains. Part II, Adv. Probe 5

(1973), 521-540.

A Markov decision process, J. Math. Mech. ~ (1957),679-684.

Weak ergodicity in nonhomogeneous Markov chains, Proc. Cambridge Philos. Soc. 54

(1958), 233-246.

Bounds on hhe gain of a Markov decision process, OR ~ (1971) ,240~243.

The method of successive approximations and Markovian decision problems, OR 22

(1974), 519-521.

Dynamic programming and Markov processes, J. Wiley, New York, 1960.

Undiscounted Markov renewal programming via modified successive approximations, OR 19

(1971), 1081-1089.

Ergodicity and convergence for Markov decision processes, Man. Sci. 23 (1977), 890--900.

A set of successive approximation methods for discounted Markovian decision problems, Zeitschrift fur Opere Res. 20 (1976), 203-208. Contracting Markov decision processes,

Mathematical Centre Tract 71, Mathematical Centre Amsterdam, 1976.

On finding the maximal gain for Markov decision processes, OR 17 (1969), 857-860.

(30)

[12

J

[13J [14J [ 15J C16J [17J [18J [l9] [20J Platzman, L. Porteus, E.L. Schweitzer, P:J. 'Schweitzer, P.J. Schweitzer, P.J. and A. Federgruen Schweitzer, P.J. and A. Federgruen

Wal, J. van der

.Wal, J. van der

White, D.

Improved conditions for convergence in undiscounted Markov renewal programming, OR ~ (1977), 529-533.

Some bounds for discounted sequential decision processes, Management Science 18

(1971), 7-11.

Perturbation theory and Markovian decision processes, ph.D. dissertation, MIT Operations Research Centre Report 15, 1965.

Iterative solution of the functional equations af undiscounted Markov renewal programming, J. Math. Anal. Appl.

2!

(1971), 495-501. The asumptotic behaviour of undiscounted value iteration in Markov decision problems, Math. Opere Res. ~ (1978), 360-382.

Geometric convergence of value-iteration in multichain Markov decision problems, Adv. Appl. Probe

l!.

(1979),188-217.

A successive approximation algorithm for an undiscounted Markov decision process,

Computing ~ (1976), 157-162.

A policy improvement - value approximation algorithm for the ergotic average reward Markov decision process, Eindhoven,

University of Technology, Dept. of Math.,

Mem~randum Cosor 78-27, 1978.

Dynamic programming, Markov chains and the method of successive approximations, JMAA 6