A stopping time-based policy iteration algorithm for Markov decision processes with discountfactor tending to 1

(1)

decision processes with discountfactor tending to 1

Citation for published version (APA):

Wal, van der, J. (1978). A stopping time-based policy iteration algorithm for Markov decision processes with discountfactor tending to 1. (Memorandum COSOR; Vol. 7824). Technische Hogeschool Eindhoven.

Document status and date: Published: 01/01/1978

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne Take down policy

If you believe that this document breaches copyright please contact us at: openaccess@tue.nl

providing details and we will investigate your claim.

(2)

PROBABILITY THEORY, STATISTICS AND OPERATIONS RESEARCH GROUP

Memorandum COSOR 78-24

A stopping time-based policy iteration algorithm for Markov decision processes

with discountfactor tending to 1 by

J. van der Wal

Eindhoven, November 1978 The Netherlands

(3)

by

J. van der Wal

Abstract. This paper considers the Markov decision process with finite state and action spaces, when the discountfactor tends to 1. Miller and Veinott have shown the existence of n-discount optimal policies and Veinott has gi-ven an algorithm to determine one. In this paper we use the stopping times as introduced by Wessels to generate a set of modified policy iteration al-gorithms for the determination of an n-discount optimal strategy.

Introduction and notations. In this paper we consider the discounted Markov decision process (MOP) with finite state and action spaces when the discount-factor

S

tends to 1. We are interested in finding n-discount optimal policies. The notion of n(+)-discount optimality stems from Miller and Veinott [3

J.

As we know (-i)-discount optimality corresponds to average (or gain) optima-Ii ty and O-discount optimality to bias optimality. In [3] the exis tence of n-discount optimal policies has been shown and Veinott [4J has shown how to determine n-discount optimal policies with an extended (and adapted) version of Howard's Policy Iteration Algorithm (PIA) [2J.

In a previous paper [6J we gave a variant of Howard's PIA based on a finite transition memoryless stopping time to determine an average optimal policy. Here we extend this stopping time based approach to determine n-discount op-timal policies. An example of such a stopping time based algorithm is the Gauss-Seidel version of Howard's PIA.

So, we are looking at a discrete-time MOP with finite state space S= {l,2, •• ,N}

and finite action space A. If in state i action a is taken then the immediate reward is r(i,a) and the system moves to state j with probability p~ .• A

po-~J

licy or stationary strategy is a map from S into A. Each i E S and policy f

<» determine a probability measure~. f on (S x A)

~, and a stochastic process

{(X,A),n

n n O,l, ••• } where X is the state and A the action taken at time n n

n. The expectation with respect to ~. f will be denoted by lE. f'

~, ~,

In Wessels [7J stopping times are used to generate successive approximation algorithms. Following the same approach we define a nonzero, finite and

tran-00

si tion memory less stopping time T as a map from S into:IN {1,2, •.• ,oo} such that for all i and f (T < <») = 1 and that T can be completely

characte-,f

(4)

Here we consider only this type of stopping times. As a consequence of this transition memorylessness we can restrict ourselves to policies (cf. lemma 3.1 and 3.2 in [6J). In the remainder of this paper. and T are fixed.

We want to introduce a few more notations. Let f be a policy then define the

*

vectors r f and rS,.,f and the matrices Pf' Pf and Pe,.,f by r(i,f(i»

.-1

rC) f(i) =:lE. f

I

Snr(X,A)

IJ " , ~ 'n=O n n (ef. [7 J) N-l ;; lim

i

I

N-+ro n=O 00 P

s

_'"f(i,j)

=

_n=l~"

I

snlP . f(X

=

j, • ;; n) •

-Further we define the matrices P

f and by Pf(i,j) (i, j) Then we have Lemma 1.1. i) ii) iii) P S,T,f r S,.,f ; ;

{ f(i) _Pij if (i, j)

0 if (i, j)

r

(i)

i f (i, j)

Pij if (i,j)

(suppressing the dependence on .)

i

T

E T

i

T

E T •

From the finiteness of • i t follows that I - P

f is nonsingular so that ii) and iii) also hold for S

=

1. We will write r f and P f instead of r

1 f

" 1:, ' "

and P . 1,., f

(5)

The total expected discounted reward under policy f, denoted by vB,f' satis-fies

00

vB,f =

I

(BPf)nrf n==O

A policy f is n-discount optimal (n

=

-1,0, ... ) if ( 1.1) _limsup _(1-B)n _{(vB,f-v/3,g);O::O}

Btl

for all 9 •

And policy f is called oo-discount optimal if f is n-discount optimal for all n == -1,0,1, . . . .

For VB,f we also have the Laurent series expansion in (1 - /3) for B t l

00

V Q, f

=

I

(1 - B) nc f '

~ n=-l n,

-1

(Miller and Veinott [3

J

used the expansion in p, with /3 = . (1 + p) I but in

our case the expansion in (1 - B) gives the simpler expressions) . The terms c can be obtained as follows

n,f [I + BP 2 2 f + B Pf + ..• Jrf [I + _B_{(P f - P f) + B (P f - P f>}

*

2

*

_{+ •••}_{Jr f}₊_{(1 -} _/3)-1 _{P fr f - P fr}

*

_f With pn _ p* f f (p f - P f) , n

*

n

=

1,2,... (from P fP f

*

get (1.2)

If I 8 is nonsingular and

B

is sufficiently close to 1 then we have the ex-pansion (1. 3) 8ince I -1 (I - B8) 00

I

k=O

*

Pf+P_f is nonsingular (lemma 1d in [1J) we may substitute (1.3) in

(1. 2) to obtain (1 .4) with 8

*

c -1, f P fr f -1

*

cO,f [(1-8) -Pf]r_f k- -1 k -1 c_k, f ( -1 ) [8 (I - 8)

J

(I - 8) r f

(6)

For any two policies f and g we define

( 1.5) n=-l,O, • • • .

And we define

f ~ g

if for all i E S the first nonzero element, if any, in the row ~c_l f(i) , ,g, ~cO f (i) , ... , ~c f (i) is positive (cL z,1iller and Veinott [3J). Further

,g, n, ~g

we write f ~ g if f ~ g for all n -1,0, •. " So ~ and ~ are partial orde-rings on the set of polici~s,

We see that a policy f is n-discount optimal Coo-discount optimalJ if and on-ly if f ~ g [f ~ gJ for all g,

It is straightforward that our notion of n-discount optimality is identical

+ -1

to the n discount optimality in Veinott [5J as lim (1 - 13) /p == 1 <S

=

(1 + p) ).

St!

In section 2 we will derive a Laurent series expansion for r Q + P Q V Q f' f.J,T,g ;.>,T,g f.J, from which we obtain the PIA formulated in section 4. In section 5 we show that the policy improvement step of this algorithm indeed improves the policy. And in section 6 we show that our modified PIA produces an n-discount opti-mal policy.

2. The Laurent series expansion for rQ + P _{v Q f' Performing a stopping}

I-',T,g S,T,g ~,

time based successive approximation step on va,f means maximize over g

(2, 1) r + P v

f3,~,g f3,T,g S,f (cL Wessels [7

J) •

For (2.1) we can derive a Laurent series expansion as follows: Substitute in (2.1) lemma 1.1(ii) and (iii) and use expansion (1.3) with S P - to obtain

g - -1 ,... '" r + P v

=

(I - Sp) [r + (P - (1 - (3) P ) v Q fJ f3,T,g S,T,g S,f g g g g ;.>, 00 (2,2) ₌₌

L

n=O co [1 - (1 - 13)

J

P

L

(1 - 13) k c k , f} • g k=-l

And we find for the coefficient d of (1 - f3)k in (2.2)

k,g,f (2.3) d -1,g,f d n,g,f - -1...., (I - P ) P c 1 f g g - , n - - -1 n - - 1 (-1) [P (I - P )

J

(I - P ) r + g g g g

(7)

With the notations r , P and T,g T,g QT,g R T,g (I

P)

-1 g - -1

P

(I - P ) g g (2.3) simplifies to (2.4) (2.5) d P c -l,g,f T,g - l , f d

=

r + P c - Q P c O,g,f T,g Ltg O,f T,g L,g - l , f d n,g,f (-1) nRn T,g L,g r + n+1 9- n 9-+1 £,

L

R P c +

I

(-1) R P c . 9.=0 T,g T,g n-9.,f ~=O L,g L,g n-~-l,f

The expression for d can be simplified further to the recursion

n,g,f

(2.6) d = ( - R ) d + P (c - c

n-l ,f)' n ~ 1 •

n,g,f L,g n-1,g,f L,g n,f

If we maximize (2.1) for S sufficiently close to 1 then we maximize "lexico-grafically" the first terms of the expansion (2.2), Le.

first maximize d_

l ,g, 'f' then maximize dO ,g, f over the set of maximizers

of d_

1 ,g, f etc.

In [6J we showed that a policy improvement step which subsequently maximizes

d and d gives a convergent algorithm and produces an average

opti--l,g,f O,g,f

mal strategy. Here we extend this result and we show that an algorithm with as improvement step the maximization of d 1 , •.• ,d f' produces an (n -

1)-,g,f n,g,

discount optimal strategy.

3. Some equations. In this' section we collect a number of equations we need in the sequel.

In the first part of this section we derive from equations (2.4)-(2.6) a set of equivalent equations.

Let f be the current policy and g an arbitrary policy_ Define

(3. 1)

W

:= d c

k,g,f k,g,f - k,f' k=-l,O, . . . •

(8)

-r = r + P r T,g 9 9 T,g ~ P P + P P (3.2) T ,9 9 9 T,g QT,9 I + P Q _{9 .,g}

-

-R P + P R T,g 9 9 T,g

If we substitute (3.1) and (3.2) in (2.4)-(2.6) we get (3.3) _P_{c_ 1 f +}

_P

_{(c_ 1 f +} ~-1 _f)

₌

_{c_ 1 f +} ~-1 _f g , g , ,g, , ,g, (3.4) _r ₊

_P

_Co

_{f -} _{(c_ 1 f +}

~-1

_f) ₊

_P

_{(cO f +}

~O

_f)

₌

_Co

_{f +}

~O

_{9 f} 9 g , , ,g, g , ,g, , , , (3.5) _-P 9 (Ck _ 1 , f +

~k:-1,g

,f) +

P

_{g (ck,f - c k _ 1 ,f) +}

P

9 (Ck,f +

~k,g,f) =

Ck,f +

~k,g,f"

In order to rewrite (3.3)-(3.5) componentwise, define

(3.6) T.:= {j E S

I

(i,j) E T} • l

Then we have for all v E JRN

(3.7) (P v) '" (i) 9 \' g(i) (.) L P_{i j} v J JET. l and

(P

v) (i) =:: 9 \' g(i) (.) L Pij v J jiT. l

If we substitute this into (3.3)-(3.5) we get the componentwise formulation of (3. 3) - (3 • 5) • (3.8) (3.9) r(i,g(i» + \' L p .. 9 (i) Co f{J} - (c_. 1 f+~-1 f) (i) + l J , , ,g, JET. l

+

L

p~:i)

(CO

f+~O

f) (j)

=

(cO

f+~O

f) (i)

lJ , ,g, , ,g, jiT. l (3.10) +

I

p~~i)

(c k

f+~k

f) (j) = (ck

f+~k

f) (i) • lJ , ,g, , ,g, jiT i

So (3.8)-(3.10) follow from (2.4)-(2.6). That (3.8)-(3.10) is even equiva-lent to (2.4)-(2.6) is immediate from the finiteness of the stopping time T.

This we see as follows. Clearly (3.8)-(3.10) and (3.3)-(3.5) are equivalent.

- -1

And as • is finite I -

P

is nonsingular. Multiplying (3.3)-(3.5) by (I - P )

9 9

(9)

In the second part of this section we derive some relations between the ~ck _,g,f and the ~k _,g,f' Clearly we have from r _S,L,f+ P _{S,L,f S,f}v

=

Vs

_,f (cf. lemma 1.1 in Wessels [7J) that dk,f,f ck,f so

(3.11;f) P c

=

c

T,f -l,f -l,f

(3.12;f) + P c - Q P c = c

,f T,f O,f L,f T,f -l,f O,f

(3.13;f) _{(-RT,f)ck- 1 ,f + PL,f(ck,f - c k- 1 ,f)}

₌

_{ck,f .}

If we subtract (2.4)-(2.6) from (3.11;g)-(3.13;g) and substitute (3.1) and

(1.5) we get

(3.14) P ~c ; ~c - ~

T,g -1,g,f -1,g,f -1,g,f

(3.15) P ~c - Q ~c

=

~c - ~

T,g O,g,f T,g -l,g,f O,g,f O,g,f

(3.16) (-R )(~c -~ )+P (~c - c ) =

T,g k-l,g,f k-1,g,f T,g k,g,f k-1,g,f

== ~c - ~ ,

k,g,f k,g,f k ~ 1 •

4. The modified policy improvement step. In section 2 we have seen that if S t 1 the stopping time-based successive approximation step first maximizes d_

1 I g, f

then dO f etc. In [6J where we only considered d and d we gave

,g, -1,g,f O,g,f

the following approach. Define ~ -1 I f by

(4.1) _~-l,f:= max ~ _- 1 _,g,f

=

max P _{L,g -l,f}c - c_₁,f .

g

Then we have for all a (4.2)

Since, suppose the lhs in (4.2) is greater than the rhs for some a. And let g be a maximizer in (4.1) then we see from (3.8) that (4.2) holds with equa-lity for g(i). Now consider the policy h with h(i} = a and h(j) = g(j), j ¥ i .

Then from (4.2) (4.3)

so (4.4)

(10)

(4.5) A_I (i/f) :~ the set of actions for which (4.2) holds with equality. And

(4.6) G_

1 (f) := {g

I

g(i) E A_I (i/f) for all i E 5} • For any policy 9 E G_

1 (f) (4.3) and (4.4) will hold with equality, so G_I (f) is the set of maximizers of (4.1). Continuing in this way we define

(4.7) _{1/JO,f :} max 1/J

O .

() ,g,f gEG_

I f

Then for all a E A_1 (i,f) (4.8) r (i, a) +

I

j

If we define further

(4.9)

(4.10)

AO(i,f) := the set of a E A_I (i,f) for which (4.8) holds with equality GO(f) :== {g

I

g(i) E AO(i,f) for all i E 5} •

Then again GO(f) is precisely the set of maximizers of (4.7). In [6J we proved that a policy iteration algorithm with as improvement step the determination of a policy 9 in GO (f) with 9 equal to f whenever possible (g (i) ~ f (i) if f(i) E AO(i, ) I converges and produces an average optimal policy. I.e. a

po--1 Hcy h Here we (4.11) (4.12) (4.13) (4.14)

with h 2': ₉ for all g.

extend the policy improvement step in the following way. Define 1/Jk, f := max 1/Jk f'

gEG (f) ,g,

k-1

k

=

1,2, •••

Ak(i,f) := the set of a E A

k_1 (i,f) for which (4.13) below holds with equality, k == 1,2, •••

- I

p

~

_{. (ck_1 f}+ 1/Jk-1 f) (j) +

I

p

~

. (c_k f - c_k_₁, f) (j) +

jiT. 1J , , JET. 1J ,

1 1

(11)

In the same way as before one may show that (4.13) holds for all aE~_l(i,f)

and that g maximizes (4.11) within G

k_1 (f) if and only if g E Gk(f). Now we can propose the following modified policy iteration algorithm.

(4.15)

VaZue determination step

Let f be the current policy. Determine c_

1,f, ••• ,cn,f'

PoZiey imppovement step

Determine a policy g E G (f) with g(i) = f(i) whenever

n

f(i} E A (i,f) •

n

In the next sections we will show that this modified PIA converges and termi-nates with an (n - 1) -discount optimal strategy.

5. The policy improvement step. In this section we prove that the policy impro-vement step (4.15) produces a policy g which is at least as good as f with respect to the first n + 2 terms of the Laurent series expansions for v Q

f.',g

and vp,f' And that these terms can only be two by two equal if the newly pro-duced policy is identical to the old one:

Theorem 5.1. Let f be an arbitrary policy and g E G (f) with g (j)

=

f (j)

when-n

ever f(j) E A (j,f), j E S then

n

i)

g!l

f.

n n n n

ii) g f only if g = f (g = f t+ g ~ f and f ~ g).

In order to prove this we need the following lemma.

Lemma 5.2. Let f be an arbitrary policy then

i) _{1)J_1,f2!}₀

and if 1)J_l,f(i) = •••

=

1)Jk,f(i)

=

0 then

ii}

iii) iv)

1)J-1,f(j)

= •.•

= 1)Jk,f(j)

=

0 for all j E V(i,f} := f(i) E ~(i,f).

1)Jk+1,f(i) ~

o.

Proof. i} From (3.11if) we have

max P c 2! P c T,g -l,f T,f -l,f hence 1)J 1 2! O.

- , f

g

ii)-iv) we prove by induction.

k

=

-1. Assume W_1,f(i)

=

O. Then from (4.2)

{R,

t

T. 1.

f (i)

(12)

Also fran (3.11 i f) and (3.2) and (3.7)

(5.2) _p..f(i) _{c 1 f(J)}. ₊

I

_p..f(i) _{c 1 f()}.

1J - , 1J - t

JET. jiT.

1 1

Subtracting (5.2) from (5.1) we get

\' f(i) . L p.. 1Ji 1 f(J) 1J - t jiT. 1 :s; 0

which together with

1Ji-1tf ~ 0 yields

o

for all j E V(i,f)

and (5.1) [(4.2) with a

=

f(i)] holds with equality so

f(i) EA_

1(i,f) • Next, let W 1 (f) := {j

I

1Ji-1,f(j)

=

O}, then

w_

1 (f) is closed under Pf' Fur-ther f(i) E A_l (i,f) for all i E w_

1 (f). Now let f be any policy with f(i)

=

f(i) on w_

1 (f) and f(i) E A_l (i,f) else, then f E G_1 ( . So if the system starts in i E w_

1 (f) and we use policy f then the system will not leave w_

1 (f) before T, therefore i t uses only actions fran f. So

o

which completes the proof for n -1.

Let W~(f) := {j

1Ji-1,f(j) 1Ji~,f(j)

=

O} then we have from the induction assumption f(i) E A

k_1 (i,f) for i E Wk_1 (f) and 1Jik,f 2 0 on

w

k_1 (f). Assume 1Ji_k,f(i) 0, f(i) E ~-l (i,f) so (4.13) holds for a = f(i) (k ~ 1):

+

(13)

(5.4) \' f(i) . + L p.. c_k f (J) J ~J , j~T . ~

If we subtract (5.4) fram (S.3) we get

(5.5) _ \' _L _{Pij tJ!k-l}f (i) ( .) _J ₊ \' _L _Pijf (i) ,I, (.) :::;; _{~k J} 0 _•

jrLT. jrLT.

~ . ~

(For k = 0 (5.3) and (S.4) will look different but after the subtraction we get again (5.S),)

In the induction assumption the first term on the rhs of (S.5) disappears, so

But tJ!k,f cOon W

k_1 so also for all j E V(i,f). Hence tJ!k(j) = 0 for all j V(i,f). As a result (5.3) holds with equ~lity so f(i) E Ak(i,f).

Final-ly the same reasoning as before gives us tJ!k+1,f(i) c

o.

0

Now we return to the proof of the theorem.

Proof of theorem 1 Define Zk := {i E S k -1,0, . . . . We will prove by induction

~c_l _,g,f(i) = •.• ~ck _,g,f(i)

=

O},

~C_1,g,f ~ 0 and if i E Zk_1 then ~Ck,g,f(i) ~ 0, k = O,l, ••• ,n ~k,f

=

°

on Zk and Zk is closed under P

g, k -1,0, ••• ,n.

Fram gE G (f) we have tJ!k f=tJ!k.l.':' k=-l, ... ,n. So from (3.14)-(3.16)

n ,g, ,L

(5.6) (S.7)

(S.8)

P ~c - Q P ~c

T,g O,g,£ T,g T,g -1,g,£ ~c O,g,f - tJ!O,f

-R (~c

T,g k-1,g,f tJ!k-l _,£) + P _T,g(~ck _,g,£ - ~ck_l _,g,f) ~C _k,g,f

k = -1: In [6J we used (S.6) and (5.7) to prove /:;c_

1 , g, f?' O. Assume

~c_l f(i) = 0 then we have from (S.6),

W_1

f ? 0 and P ~c_1 f ~ 0 that

,g, , T,g ,g, W_1,f(i) 0 and A ( . )

I

9 (i) A ( .) +

I

9 (i) (A , I , ) ( .) ~ 0 t.:.C 1 f 1 = p.. L.lC 1 f 1 p. . L.lC_₁ £ + 't'_1 f J • ,g - ,g, ~J - ,g, 1J ,g, , JET. jiT. 1 ~

(14)

From 1j!_1,f(i) us

o

and lemma 5.2(ii) also

I

p .. tjJ 1 f{J) 9 _~J(i) _{- ,} .

o

which gives

\' pgi·

~

i) fi

c

1 (j) == 0 . L J - ,g,f jES 80 with b,c -1,g,f

o

we get jiT. ~

b,c_₁_,g,f(j)

o

for all j E W(i,g)

And the set Z 1 is closed under P .

- 9

I

9 (1)

{~ Pi~ > O} .

-1 < k < n: From the induction assumption and (5.7) and (5.8) we get on zk_1 the following two equations

(5.9) P _{L,g k,g,f}b,c == b,c _k,g,f- tjJk,f (5.10) _-R _(tick _{f -1j!k f) +}_P _(b,ck ₁ _{f - b,ck} _f) L,g ,g, , L,g + ,g, ,g, b,c . k+l,g,f + I/i k+l,f • With (5.9) and I + R L,g Q .,g we may rewrite (5.10) as (5.11) P tic - Q P 6c "g k+1,g,f " g L,g k,g,f 6ck +1

,g,

f - 1j!k+1,f .

Now (5.9) and (5.11) have the same form as (5.6)-(5.7) so in exactly the same way as there (cf. [6]) we obtain that 6c

k,g,f ~ 0 on Zk_1' tjJk,f and Zk closed under P

g'

k

=

n. In this case we only have one equation on Z l ' viz.

k-(5.12) P tic

"g k,g,f 6c k,g,f - I/i k,f

But we also have 9 (i) == f (i) if f (i) E Gk(i,f).

The si tuation is identical to the case n == k = 0 in [ 6 ]. If we multiply

1 N

(5.12) with p* lim

1

I

pn

on Zk_l (Zk_1 is also closed under .,g N+ro N + n=O T,g * P ) L,g

*

then we get P l/J

T ,g k, f 0, hence 1j!k,f

=

0 on the set B"k_l of recurrent states in Zk_l under P and therefore f(i) E Ak(i,f), i E B k 1 (lemma

L,g L, - 5.2). We

need however l/Jk,f == 0 on B

k_1, the set of recurrent states of Zk_l under Pg' Let i c B

k_1 with l/ik f(i) == 0 then l/Jk,f(j) for all j E V(i,f). But also

O . f ' ' d f ( i ) . th . B 8 ' f ,j. (') 0 th

Wk,f(j) Wk,f(j)

== ~ J E Ti an Pij Slnce en J E .,k-1· a ~ ~k,f ~ == en

=

0 and f{j) E Ak(j,f) for all j E W(i,g). Therefore the set of states

with l/Jk,f(i) == 0 is closed under P

g' It contains B.,k_l hence also Bk_1. As a result f 9 on B

k_1, and since Bk_1 is closed under Pg also Ck,g = Ck,g or 6Ck ,g,f == 0 on B

k_1.

The equivalent of lemma 5.4 in [6] gives

min 6c

k , g, f(i) = 0 ,

icZ

(15)

hence ~Ck,g,f(i) cOon Zk_l'

Finally if ~c 1 f == ••• = ~c _{f=O then (by induction) 1/J_ 1 f=···=1/J f=O,}

- ,g, n,g, , - n,

and by lemma 5.2(iii) f(i) E A (i,f) for all i E S, hence 9 = f.

0

• . n

6. The convergence to an (n- 1)-discount optimal policy. In the previous section we showed that the policy improvement step gives a policy 9 E G (f) with

n

n n

9 ~ f. And 9 = f only if 9 = f. As there are only finitely many policies, the policy iteration algorithm terminates with a policy, f say, with f E G (f).

n

Now we prove

n-1 Theorem 6.1. If f E G (f), then f is (n-1)-discount optimal, i.e. f c 9

n

for all g.

The way we prove this is similar to the approach of section 5. First we need the following equivalent of lemma 5.2.

Lemma 6.2. Let 9 be an arbitrary policy and f E G (f) then

n i) ii) iii) iv) 11' S -1,g,f

o

and if 1/J_l,g,f(i) ••• = 1/Jk,g,f(i)

=

0 then 1/J_

1 ,g, f(j)

= ...

=

1I'k ,g, f(j)

=

0 for all j E V(i,g), k S n g(i) E Ak(i,f), k S n

1/Jk+1,g,f(i) s 0, k S n-1.

Proof. The proof is similar to the proof of lemma 5.2. i) ,It < 1/J O.

'I'-1,g,f- -l,f iil-iv) we show again by induction.

k

=

-1: If we subtract (4.2) with a = g(i) from (3.8) and substitute 1/J-1,f we get

== 0

So if 1/J__l f(i) = 0 then from l/J_

1 f S 0 also l/J_1 f(j) = 0 for j E V(i,g).

,g, ,g, ,g,

As a result g(i) satisfies (4.2) (with 1/J-1,f = 0) with equality, so g(i) E A_1 (i,f). And let g be an arbitrary policy in G_

1 (f) with g(i) g(i) if g(i) E A_1 (i,f) then for i with 1I'_1,g,f(i)

=

0

1/J0 _,g,f (i)

o

which completes the proof for k= -1.

The case k c 0 is completely analogous to the case k cOin lemma 5.2. We

(16)

_P_r~o_o_f---,O,-f_:.;:.;:..;;...::;,;=-==-6.::...;;..::.-1 'fhe reasoning is almost identical to the one in theo-rem 5.1. We only give a brief outline. First we have from (3.14), (3.15) and

~ ~ 0 that ~c ~ O. If ~c_1 f(i)

=

0, then - from (3.14) - also

-l,g,f -l,g,f ,g,

~-1 f(i) =:

a,

and - by lemma 6.2 - ~-1 f(j) 0 for all j E V(i,g). And

,g, ,g,

again the set {i E S

I

~c 1 f(i) =: O} is closed under P . For

a

~ k ~ n-1

- ,g, g

the reasoning is similar to the reasoning in theorem 5.1. We miss however the condition g(i) f(i) if f(i) E A (i,f), therefore we can only prove

n

(n- 1) - and not n-discount - optimality.

o

7. oo-discount optimality. In this section we prove the result which corresponds to theorem 4 in Miller and Veinott [3J. I.e., we show that a policy f obtain-ed from the modifiobtain-ed policy iteration algorithm with n = N, the number of states in S, [f E G

N (f) J, is not only (N - 1) -discount optimal but even discount optimal.

In order to do this we first copy the result of Miller & Veinott for the case T

=

1, i.e. the standard successive approximation step in (2.1). (We have to do this because we expand in (1 - S) and Miller and Veinott [3J used the

expan--1

sion in p (8 = (1 + p) ).) Then we have

(7. 1) with (7.2) y O,g,f y n,g,f Of course y n,f,f (7.3) co

I

y f(1-8)n n=-l n,g, P (c _g _n,f - c -1 _{n ,}f) - c _{n ,}f' n 1 ,2 , • •• .

o

so n=1,2, •••• Fran (7.2) and (7.3) we get for n ~ 1

(7.4) y

n/g I f

Further we have from (1.3a) (7.5) with S c n/f c n-l ,f =: (-1) n-l [S (I _ S) -1 In-1[ -S (I _ S) -1 - 1J[I - SJ- 1 - (-1) n-l [S (I _ S) -1 In - 1 (I - S) -2_r f

(17)

And we get the following variant of theorem 4 in Miller and Veinott [3J.

Lemma 7 1. If Y f 0, n

=

-l, . . . ,S then Y _n,g,f

o

for all n.

n,9,

Proof. Substitute (7.5) in (7.4) and use lemma 4 in Miller and Veinott [3J

*

-2

*

-1

with x = (I-Pf+P

f) rf , M = -(Pf-Pf) (I-Pf+Pf) and L the null space of P 9 - P f"

Note .that the result of this lemma can be obtained directly from theorem 4 in Miller and Veinott using lim (1 - S)

/p

1.

St!

For an arbitrary stopping time T we have with lemma 1.1

- -1 ,....

r Q + P Q V Q f - Vs f = (I - SP) (r + SP v Q f) - V Q , f

I-',T,g I-',t,g 1-'" 9 9 9 1-', I-'

- -1 ,....

-(I - SP g) (r 9 + SP 9 v S, f + SP 9 v S, f - v

s,

f> (7.6)

From which we obtain

Lemma 7.2. Y-₁_{, g,}f k==-1,0, • • • • y k,g,f - -1 == (I - SP ) 9 00

o

if and only i f 1/1_₁

_,g,

f == ••• = 1/I_k

_,g,

f = 0,

o

Proof. The if part follows by induction. The only if part is immediate from

(7.6) •

o

Now we are able to prove

Theorem 7. If f E Gs(f) then f ~ 9 for all n and all g.

Proof. Suppose we have a policy 9 with 9 ~ f for some n > S - 1, then clearly

S-1

§

9 f and 9 - f. From lemma 6.2 we have ~ck f = 0, k == -l,O, ••. ,S-l, ,g,

1/Ik,g,f = 0, k -l,O, •.. ,S-l and ~ ~ 0, otherwise 9 would be an improve-S,g,f

ment of f. Further (3.16) with k

=

S reduces to

(7. 7) P ~c

(18)

*

And if we multiply this with p then we get

T ,g

$ _8,g,f

o

on Rec(T,g)

where Rec(T,g) is the set of recurrent states under P . And with lemma 6.2 T,g

ii) also $s f(j) if j E V(i,g) for some i E Rec(T,g). 80 even

, ,g,

1jJ

8,g,f

o

on Rec (g)

with Rec(g) the set of recurrent states under P (cf. the proof of theorem

9

5.1) .

So on Rec (g) $_1 _,g,f

= ... =

1jJS _,g,f

=

O. Hence by lemma 7.2 also

y

=

0 and by lemma 7.1 Y f

=

0 for all n. Thus with

8,g,f n,g,

So V

6

,g

=

v

6

,f and especially ~C8,g,f 1jJs f s: 0

o

on Rec(g). From (7.7) we have with ,g,

P ~c 2 AC

S f i l 8 f '

T,g ,g, ,g,

80 V := {i

I

~cS f(i) max ~cS f(j)} is closed under P • Hence on V T,g

,g, j ,g,

we have Ac

s ,g, f 0 and therefore ~c 8,g, f s: 0 on S. But we assumed 9 ~ f for

some n > S - l , hence with llc_

1 ,g, f = ...

=

I1cS_1 ,g, f

=

o we must have AC

8 ,g, f = 0 on S. But then ~

=

0

S,g,f on S and by lemma 7.2 also Y S,g,f

=

O. So by lemma 7.1 we have Y f 0 for all n and also

n,g, ~c n,g, f

=

0 for all n,

n

or 9 = f for all n. Summarizing, we have shown that if for some n 2 8 we have 9 ~ f then 9 ~ f. Hence f ~ 9 for all n and all g.

8. References

[lJ Blackwell, D., Discrete dynamic programming, Ann. Math. Statist. 33 (1962), 719-726.

o

[2J Howard, R.A., Dynamic Programming and Markov Processes, MIT Press, Cam-bridge (Mass), 1960.

[3J Miller, B.L. and Veinott, A.F., Discrete dynamic programming with a small interest rate, Ann. Math. Statist. 40 (1969), 366-370.

[4J Veinott, A.F., On finding optimal policies in discrete dynamic program-ming with no discounting, Ann. Math. Statist. 37 (1966), 1284-1294.

(19)

[5J Veinott, A.F., Discrete dynamic programming with sensitive discount optimality criteria, Ann. Math. Statist. 40 (1969), 1635-1660. [6J Wal, J. van der; A stopping time-based policy iteration algorithm for

average reward Markov decision processes, Memorandum COSOR 78-11, University of Technology Eindhoven, 1978.

[7J Wessels, J., Stopping times and Markov programming, Transactions of the 7th Prague Conference on Information Theory, Statistical Decision Functions and Random Processes, Prague, 1977, vol. A, 575-585.