On uniformly nearly optimal stationary strategies

(1)

On uniformly nearly optimal stationary strategies

Citation for published version (APA):

Wal, van der, J. (1981). On uniformly nearly optimal stationary strategies. (Memorandum COSOR; Vol. 8111). Technische Hogeschool Eindhoven.

Document status and date: Published: 01/01/1981 Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers) Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne

Take down policy

If you believe that this document breaches copyright please contact us at:

openaccess@tue.nl

providing details and we will investigate your claim.

(2)

STATISTICS AND OPERATIONS RESEARCH GROUP

Memorandum COSOR 81-11 . On· uniformly nearly optimal

stationary ~trategies

by

Jan van der Wal

Eindhoven, The Netherlands AUgust 1981

(3)

by

Jan van der Wal

Abstract

For Markov decision processes with countable state space and nonnegative imme-diate rewards Ornstein proved the existence of a stationary strategy f which is uniformly nearly optimal in the following multiplicative sense v(f) ~ (l - e)v* • Strauch proved that if the immediate rewards are nonpositive and the action space is finite then a uniformly optima~ stationary strategy exists. This paper connects these partial results and proves the following theorem for Markov decision processes with countable state space and arbitrary action space: if

in each state where the value is nonpositive a conserving action exists then there is a stationary strategy f satisfying v(f) ~ v - eu where u is the

*

value of the problem if only the positive rewards are counted.

1. Introduction

Corlsider a Markov decision process (MDP) with countable state space S and ar-bitrary action space A, with a a-field A containing all one-point sets. If in state i E S action a E A is taken two things happen: a (possibly negative)

immediate reward r(i,a) is earned and the system moves to a new state j, j e S,

with probability p(i,a,j), where lj p(i,a,j) • I. The functions r(i,a) and p(i,a,j) are assumed to be a measurable in a.

(4)

One may distinguish three sets of strategies, namely the set IT of all (possibly randomized and history dependent) strategies satisfying the usual

mealurability

conditions, the set M of all nonrandomized Markov strategies and the set P of all nonrandomized stationary strategies. So F c Men • The elements of F will be called policies and are treated as functions on

s.

For each strategy ~ €

n

and each initial state i ~ S we define in the usual

way a probability measure P it~ on (S x A)~ and a stochastic process

{(X ,A), n _{n n}

= O,I,2, ••• } where X denotes the state of the system at time n

_n and An the action chosen at time n. Expectations with respect to P . will be

lJ~

denoted by E .

1,11'

Now we can define the total expected reward when the process starts in i € S and strategy ~ €

n

is used

00

(1. 1) v(i,~) := E .

r

reX ,A ) ,

1,~ n-O n n

whenever the expectation at the right hand side is well-defined. To guarantee this the following assumption will be made.

Condition 1: For all i € S and all ~ €

n

~

(1. 2) u(i,~) := E.

r

r (X ,A ) + < 00 ,

lt~

n-O n n

where

(5)

This condition allows us to interchange expectation and summation in (1.1) and implies

(1. 3) _{lim v} (i,~) - v(i,n) , n-+oo n

where

n-I

(1.4) v (i,~) • E .

L

reX. ,A.) •

n 1,~ k-O -K-K

The value of the total reward MOP is defined by

(t .5) v* (i) :- sup v(i,n) •

~e:n

Further we define the value of the MOP where the negative rewards are neglected

*

by u , so

(1. 6) u*(i) :- sup u(i,n) •

'!TEn

If the argument i corresponding to the state is deleted the function on S is meant. So for example yen) and v* are the functions with i-th coordinate v(i,n) and v*(i) respectively. Often these functions will be treated as columnvectors. It is useful to have the following notations. Let f be any policy then the immediate reward function ref) and transition probability function P(f), which will be treated as a columnvector and a matrix respectively, are defined by

(t.7) r(f)(i) := r(i,f(i» , i E S

(6)

On suitable subsets of functions on S we define the operatorsL(f) and U by (1. 9) (l . 10) L(f)v == ref) + P(f)v Uv == sup L(f)v • feF

Van Hee [1978J has shown that Condition 1 implies u* < - (i.e. u*(i) < - for

*

all i e: S)~ whence also v <~, and further that

(1.11) v*(i) == sup v(i,~) •

~€M

Ornstein [1969J proved that if all rewards are nonnegative then for each e > 0

a stationary strategy f exists satisfying

(1. 12) v(f) ~ (1 - e:)v

*

•

Strauch [1966J showed that if r(i,a) S 0 for all i e: S, a € A and if A

is

finite then an optimal stationary strategy exists, i.e. an f e: F with

(I. 13) v(f) == v

*

This result has been generalized by Hordijk [1974, Theorem 6.3.cJ: i f v*

s

0 and if f is a conserving policy, i.e.

(1.14) then

*

ref) + P(f)v - v v(f) == v

*

,

(7)

Van der Wal [1981, Theorem 2.22] proved that if A is finite then

(I.IS) v*(i)

= sup v(i,f) ,

fEF

where the condition A is finite can be weakened to any other condition guaran-teeing the existence of an optimal stationary strategy for the MOP with re-wards r-(i,a) with r-(i,a) := mintO, r(i,a)}.

The purpose of this paper is to connect the partial results mentioned above. Our main result is given in the following theorem.

Theorem 1.1. If in each state i E S for which v*(i) S 0 a conserving actiqu

exists (cf. (1.14» then for each € > 0 a stationary strategy f exists satis-fying

(1.16) v(f) ~ v

*

- eu

*

So in the positive and the negative dynamic programming cases this tqeorea yields the results of Ornstein, Strauch and Hordijk. And

in

the case of both positive and negative rewards it generalizes the result of Van der Wal to uniform nearly optimality.

The organisation of the paper is as follows. The next section gives a brief outline of the proof of Theorem 1.1. Then in the following sections various parts of the proof will be worked out.

(8)

2. Outline of the proof

In this section the proof of Theorem 1.1 will be sketched. Therefore

we

first split up the state space S into the subsets S- :- {i € S

I

v*(i) ~ O} and S+ := {i e S

I

v * (i) > O} • In the first part of the proof (Section 3) in each state i E S- an arbitrary conserving action is fixed and it is shown that the

value of this restricted MOP is the same as the value of the original MOP. As a result of this one can embed, having fixed conserving actions on S-, the MOP on S+ thus obtaining an MOP with strictly positive value function.

The next step in the proof (Section 4) is to consider an MOP with v* > 0 but

positive as well as negative immediate rewards. For such an MOP we construct a modified MOP with immediate rewards ;(i,a) and transition probabilities p(i,a,j) as follows:

(2.1)

(i) if r(i,a) ~ 0 then ;(i,a) :- r(i,a) and p(i,a,j) :- p(i,a,j) (ii) if r(i,a) < 0 and r(i,a) + !j p(i,a,j)v*(j) ~ 0 then

(iii) ';(i,a) := 0 p(i,a,j) :-and * r(i,a) + Ik p(i,a,k)v (k)

Ik

p(i,a,k)v*(k) p(i,a,j)

i f r(i,a) < 0 and r(i,a) + !j p(i,a,j)v*(j) < 0 then r(i,a) +

r.

p(i,a,j)v*(j) and p(i,a,j) :- 0

J

So in this modified MOP one "borrows from the future" if the immediate reward is negative but never more then to pay the immediate loss and also not more then possible, and the transition probabilities are adapted accordingly.

It will be shown that this modified MOP has the same value as the original one (with v~ > 0) and that in each state i all actions for which ;(i,a) < 0

(9)

(case (iii» can be eliminated without affecting the value of the MOP. This way we obtain a positive dynamic programming problem. Using Ornste!ftl

re-suIt we have for each e > 0 the existence of a stationary strategy f satisfy-ing

(2.2)

'"

v{f) ~ {1 - £)v

*

(tildes are used for the modified MOP).

Then, in Section 5, it will be shown that a stationary strategy f satisfying (2.2) also satisfies

(2.3) v(f) ~ v

*

- e{l - £) -1 U

*

•

This proves Theorem 1.1 for positive valued MDP's.

In Section 6 the proof is extended to the case that S is non~pty.

3. On S conserving actions are enough

In this section we prove the following result.

Theorem 3.1. Let f be any conserving policy on S , i.e.

ref) + P(f)v*

= v* on S

...

Let further IT_f denote the set of all strategies in IT which on S act accor-ding to f. Then

sup v(i,~)

= v*{i)

for all i € S ~€ITf

(10)

In order that this theorem holds one certainly needs the following lemma. Let the stopping time T be the time of the first switch from S- to

S+ or vie.

versa: (3. 1 ) _T(iO,i l ,···)

.

{

if io € S then inf{n if iO € S+ then inf{n for all i

J,i2, ••• € S, and where inf

0

:= m •

(At this time we only use that T is the first exit time on S , the ampler definition is for later use).

Lemma 3.2. Let f be any conserving policy on S • Then

(3.2) where E. f [Tt reX ,A ) + v*(X)] ... V*(i) 1, n-O n n T v*(X ) :-0 T if T • co • for all i e S

(Note that if v*(i) > 0 an equality like (3.2) need not hold as in the posi-tive dynamic programming case conserving actions need not be optimal.)

Proof. The expression at the left hand side of (3.2) is clearly equal to the total expected reward for the Markov process on S- with rewards r(i,a) and transition prObabilities p(i,a,j) defined by

(3.3)

r(i,a) +

2

+ p(i,a,j)v*(j) ,

j€S

1

r(i,a)

(11)

(In order to have the transition probabilities add up to 1 we cou1dadd an extra obsorbing state where no more returns are obtained; we will not 40 this explicitly.)

For the MDP with state space S defined by (3.3) we have, since f is cQUser-ving, Further where

*

== v [ T-t ] v(i,f) :- E. f

!

r(X,A) + v*(X~) , 1, n-O n n • i e: S

Clearly v* < 0 (we only consider S-) and v(f) S v*. But also v(f) is the lar-gest nonpositive solution of ~(f) + P(f)v • v (see e.g. Van der Wal [1981, Theorem 2.18]), hence v(f) ~ v* • So v(f) == v*, which proves the le~.

0

The next step in the proof of Theorem 3.1 is the construction of a strategy

W E TI

f (i.e. using f on S-) which is nearly optimal. Let w (n), n - 1 ,2p . . , be a strategy satisfying

(k)

*

-k +

v(i,w ) ~ v (i) - €2 for all i e S •

(Clearly such a strategy exists within TI but not necessarily within M.) Th~n

also E. (k) 1,1t [ T-t

_*

]

_*

_-k

I

reX ,A ) + v (X) ~ v (i) - e2 , O n n T n- .

(12)

+

for all k - 1,2, ••• and all i ~ S •

*

-

+

Now let ~ be the strategy which on

S

acts according to f and on

Sus ••

strategy

~(k)

during the k-th stay in s+, pretending the process to restart at the time of the k-th entry to S+. Then ~* satisfies the following lemma.

Lennna 3.3. v(~

*

) ~ v

*

- £e •

Proof. To prove this we introduce the following notations. Define the func-tions b

(~(k»

on S+ and c (f) on S- by

T T

T-1

c (i,f) :- E. f

l

reX ,A ) ,

T 1, ncO n n i ~ S

Further define the transition probability matrices Q (w(k» from

s+

to S- and

T R (f) from S- to S+ by T

QT(~(k»(i,j)

= p . (k) (T <

00,

1,~

x •

j) , T + -i € S , j e S R (f)(i,j)

=

P. f (T <

00,

X = j) , T 1, T + i e S , j € S . Then we have (3.4) on S , + +

where e is the unit function on S , and by Lemma 3.2

(3.5) c (f) + R (f)v*

(13)

Now define on S+ (suppressing the dependence on w)

vo :-

0 v := b (w(I» + Q (w(l»c (f) 1 . • • •• v + P

[b~(w(n»

+ QT(w(n»CT(f)] , vn • n-) .,n-J •

where the transition (sub)probability matrix P . on

s+

is defined by T,n { P _.,0:= I P •• P T, n ' • ,n-l n=t,2, •••• Then

*

v + P v .. n T,n

*

-n ~ v _n-J + P _T,n-1 v - £2 e

*

-)

-n

*

~ v - £[2 + ••• + 2 ]e > v - £e •

Clearly vn converges to v(w*) if n tends to ~ , so if Itmsup p. nV* ~

a

then . n~· ,

(14)

(3.6) lim sup P b (1T (n+ t

»

s; 0

L,n L

n-+«>

since T ~ 1 and for a fixed strategy the sum of the positive rewards from time

n onwards tends to 0 if n tends to ~. Also

(3.1) ~ v - e

*

2-(n+l) e - Q ( 1T (n+I» V

*

T

*

~ v - e 2-(n+l) e (s1nce .

Q

(w (n+1) v

*

s; 0) •

•

+ So from (3.6) and (3.7) we have on S

(3.8) lim sup P v

*

s lim sup P

n-+«> .,n n-+«> ',n

Remains to prove that (3.8) also holds on S • Let i E S then

(3.9) V(i,1T*)

=

c (i,f) +

I

R (f)(i,j)v(j,1T*)

• T • s+ JE: ~ c (i,f) +

I

•

_JE. s+ R (f)(i,j)v*(j) - e

•

Together (3. 8) and (3.9) complete the proof of the lemma.

Since in Lemma 3.3 e > 0 can be chosen arbitrarily Lemma 3.3 also proves Theorem. 3. I.

(15)

Note that as a consequence of Theorem 3.1 we can fix some arbitrary conser-ving policy on S and then embed the MDP on S • The embedded MDP on +

S

+ then

has a strictly positive value function.

4. Positive valued MDP's

In this section we consider an MDP with strictly positive value: v*(i) > 0 for all i € S. It will be shown that transformation (2.1) yields a new MDP

with the same value function and essentially only actions with nonnegative immediate rewards.

All objects in the transformed MDP will be labeled by

a- .

Lemma 4.1. -* v 'III: V

*

-*

Proof. First it will be shown that v ~ v As one easily verifies

* -

*

L(f)v

=

L(f)v •

*

Hence, as Uv

=

v ,also So

... *

_Uv

₌

_v

*

';* ~ lim sup un 0 S lim sup unv*. v*

n"""" n""""

where the first inequality can be found in Schal [1975, Formula (2.5)] and

* * -*

(16)

It follows from (1.11) that it suffices to prove that for all w € M we have v(w) s; ;(w).

Let ~ E M be an arbitrary Markov strategy, then w can be characterized by the policies to be followed at each time: w • (fO,f_l, ..• ). Let d(f) be the func-tion defined by

d(f) ... v - L(f)v ,

*

so d(f) ~ O. Then we have for the n-period reward vn(w) (see (1.4»

Similarly (4.2)

;

(~) n

...

*

... v

*

... v

(17)

whence also

As remarked before this implies v*s;* which cpmpletes the proof of the lemma.

o

Next we want to show that this modified MOP can be reduced to an MOP with non-negative immediate rewards only.

Lemma 4.2. Let

n+

be the set of all strategies which use only actions for which ;(i,a) ~ O. Then

*

- v

I.e., we can eliminate in each state i E S all those actions for which

;(i,a) < 0 without affecting the value.

Proof. Let

n

E M be an arbitrary Markov strategy. Let further f be some policy

with ;(f) ~ O. (Since v* > 0 such a policy exists.) Now consider the strategy '" _~

which is the following combination of ~ and f. In words: play ~ until you first reach a state where ~ prescribes an action for which the ~ediate If~-rewardlf is negative; instead of taking this action you switch to f and you

play f for ever after. Clearly ;(f) ~ 0, so this switch yields you a better total expected reward, i.e.

(18)

Hence sup+ v(1r) 1rEIT

*

• v

*

~ v for all 1r. which completes the proof of the lemma.

o

Now in each state all actions yielding negative immediate payoffs can be eli-minated which gives us a positive dynamic programming problem. So by Ornsteins

result we obtain the following corollary.

Corollary 4.3. For each € > 0 a policy f exists such that

(4.3) ~ (1 - e:)v

*

•

5. Nearly-optimality for. the positive valued MOP

The main result of this section is the following theorem.

Theorem 5.1. Suppose we have an MOP with v* > 0 and let f satisfy

""

*

v(f) ~ (I - e:)v • Then

(5.1) v(f) ~ v - e:(1 -

*

e:) -1 u •

*

This theorem extends Ornsteins result and already establishes Theorem 1.1 for a special case.

In order to prove this theorem we need two lemmas. The main tool in our ap-proach is the concept of the socalled stationary randomized and action depen-dent go ahead function as used in Van Nunen en Stidham [1981] and Van der Wal [1981. Chapter 3J.

(19)

This gives us a different view upon the datatransformation (2.J). Let 0 be a go-ahead function on S x A with

o(i,f(i» - J (5.2) <S(i,f(i» '" if r(i,f(i» ~ 0 r(i,f(i» + E. p(i,f(i),j)v*(j) J E. p(i,f(i),j)v*(j) J So p(i,f(i),j) .. o(i,f(i»p(i,f(i),j). if r(i,f(i» < 0 •

Since we are only interested in the Markov process where policy f is used it suffices here to define 0 for the policy f. The idea of this go-ahead function is that if in a state an action is taken with negative immediate reward, then with probability 1 - 0 the process stops after the next transition and with probability 0 the process continues. Let To be the t~e upon which the process is stopped. (A more formal introduction of go-ahead functions and stopping times can be found in Van Nunen en Stidham [1981] and Van der Wal (1981].)

+

-Now define the reward functions ro(f) and ro(f) by

T -1 + 0 + ro(f) := E _f

r

r (X ,A ) n-O n n T -I 0' r o(f) '" E _. _f

L

r-(Xn,A_n)

,

n-O

(20)

and the operator L

6(f) by

where

Then we have

(5.3)

(Since stopping occurs to cover up the negative immediate rewards one has

From (5.3) and f being a policy satisfying v(f) ;;:: (l - &)v* we also have

(5.4) Lei (f)v ;;::

*

(l -

dv--.

*

Further, since T6 ;;:: 1, (cf. the proof of Lemma 4.24 in Van der Wa1 [1981J)

(5.5) v(f) - lim L~(f)O •

n~ .

(21)

>_ L~-l(f)(1 _IJ _- _€) _V

*

_- pn(f) _IS _v

*

(5.6)

~

...

From (5.5) and (5.6) one sees that in order to prove (5.1) it suffices to prove that (It) t n

*

-J

*

(5.7) L PIS(f)v ~ (1 - €) u· t n-O n

*

as then. clearly PIS (f)v- -+ 0 (n -+ (0) •

To prove this we need the following lemma.

Lemma 5.2.

Proof. Immediate from Td ~ I (cf. also (5.5».

_o

From this we obtain almost immediately Lemma 5.3.

*

+

*

Proof. u(f) s u and by (5.3) and (5.4) also r

t5(f) ~ (J - &)v. So

(22)

6. Proof of Theorem 1.1

Finally we show how the proof of Theorem 5.1 can be extended to Theorem 1.1. So we start with some arbitrary countable state MOP for which in all states with v*(i)

~

0 a conserving action exists. Now fix a conserving action in each state in S-. As shown in Section 3 this does not affect the value of the MOP. Next we embed the MOP on S+ thus obtaining an MOP with strictly positive value function. For this MOP apply Theorem 5.1. FUTther it is clear that the u* function of the original MOP (on S+) is larger than or equal to the u* of the embedded MOP. So a policy which is conserving on S- and satisfies

(5.1) for the embedded MOP also satisfies (5.1) fOT the-original MOP. This

1 t th for S+.

comp e es e aTgument

Now consider S-. Let f be some policy which is conserving on S- and satis-fies (1.16) on S+. Then for any i €

S-~'-I

~

v(i,f) = _{E . f}

I

r(X ,A ) + vex ,f) 1., _n=

o

_{n n} T

~'-I

I

r(X,A) + v*(X )

~

*

~ _{E • f} _{- £E . f"U (X )} 1,

o

_{n n} _T 1., T n= since

*

m .

f u (X ) ~ E • f 1, T 1,

(23)

References

Heet K. van (1978), Markov strategies in dynamic programming, Math. Oper.

Res.

1,

37-41.

Hordijk, A. (1974), Dynamic programming and Markov potential theory, Math. Centre Tract 51, Mathematisch Centrum, Amsterdam.

Nunen, J. van and S~ Stidham (1981), Action-dependent stopping times and Markov decision processes with unbounded rewards, OR Spectrum.

Ornstein, D. (1969), On the existence of stationary optimal strategies, Proc. Amer. Math. Soc. 20, 563-569.

Schal, M. (1975), Conditions for optimality in dynamic programming and for the limit of n-stage policies to be optimal. Zeitschr. Wahrsch. Th. verw. Gebieten~, 179-196.

Strauch, R. (1966), Negative dynamic programming, Ann. Math. Statist. ~, 871-889.

Wal, J. van der (1981), Stochastic dynamic programming, Math. Centre Tract 139, Mathematisch Centrum, Amsterdam.