Markov decision processes with unknown transition law : the average return case

(1)

Markov decision processes with unknown transition law : the

average return case

Citation for published version (APA):

Hee, van, K. M. (1979). Markov decision processes with unknown transition law : the average return case. (Memorandum COSOR; Vol. 7901). Technische Hogeschool Eindhoven.

Document status and date: Published: 01/01/1979

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne

Take down policy

If you believe that this document breaches copyright please contact us at:

openaccess@tue.nl

providing details and we will investigate your claim.

(2)

Department of Mathematics

PROBABILITY THEORY, STATISTICS AND OPERATIONS RESEARCH GROUP

Memorandum COSOR 79-01

Markov decision processes with unknown transition law; the average return case

by

Kees M. van Hee

Eindhoven, January 1979 The Netherlands

(3)

:MARKOV DECISION PROCESSES WITH UNKNOWN TRANSITION LAW; THE AVERAGE RETURN CASE

by

Kees M. van Hee

O. ABSTRACT

In this paper we consider some problems and results in the area of Markov decision processes with an incompletely known transition law. We concentrate here on the average return under the Bayes criterion. We discuss easy-to-handla strategies which are optimal under some conditions. For detailed proofs we refer to a monograph published by the present author.

1. INTRODUCTION

In this paper we review a part of (van Hae (1978a»),

e

monograph dealing with Markov decision processes in discrete time, with an incompletely known transition law. All proofs of statements given here, can be found in this monograph. Moreover this monograph contains results for the discounted return easel some of these results are reviewed in (van Hee (1976b»).

We do not bother about measure theoretic problems and therefore we assume all sets to be countable or sometimes even finite, however we remark that in (van Hee (1978a») the problems are treated in a general measure theoretic setting.

We start with a description of the model and we discuss some of its properties. A Markov decision process (MOP) with unknown transition law is specified by as-tuple

(X. A. 0. p. r) (1.1)

where X is the

state space,

A the

action space,

0 the

parametep space,

P a

tpansition ppobabiZity

from Xx Ax

0

to Xand r the

peward function

(i.e r: X

x

A+

R.

where

R

is the set of real numbers). (We assume r to be bounded. if Xor A is countable). The parameter

e

E 0 is unknown to the decision maker. At each stage O. 1. 2 ••.• the decision maker chooses an action a E A where he may base his choice on the sequence of past states and actions.

A

stpategy

TI is a sequence TI = (TI • TI • TI •••• ) where TI is a transition

0 1 2 0 n

probability from X to A and TI a transition probability from (X

x

A)

x

X

n to A (n > 1).

=

(4)

A strategy is called

stationary

if there is a function f

X

+ A such that always 'If ({f(x J.)

I

x • a , x • a •••• ,x ) • 1.··

n n 0 0 1 1 n

According to the well-known lonescu Tulcea theorem (Cf. (Neveu (19651) ] we have for each

starting state

x E X. each strategy 'If E IT and each parameter

~ E

e

a probability p'lf_x,

e

on " : .. (X

x

A1

N

and a random process

(~ : .. {OJ 1. 2 •••• } 1

{(X.

A

1.

n £ N} where

n n

(1 .2)

(The expectation with respect to X (w): .. x • n n An(w): .. ~n if W - (x , a_o 0 p'lf 1B denoted

x,a

x , s •••• ) 1 1 by E'lf

a)

x.

£

n

(1 .3)

The ave~age ~eturn is defined by

g(x.e.'If): .. _liminf

~

N-1._t

_E

1f (_r(X ) n• An)

N~

ncO

x.a

(1 .4)

(1 .5) It only happens 1n non-interesting caseB that there 1s a strategy 1f' E IT such

that g(x,

a ,

1f')

-

>

g(x,

e ,

1f1 for all x £ X and

e

£

e.

So we cannot use this as an optimality criterion.

We

heve chosen the Bayes criterion [for a motivation cf. (van

Hee (197809))]

Fix some probability q on 0. (Such

a

probability is called a p~io~

distpibution).

The

Bayesian

ave~age ~eturn with respect to q is defined by

1 N-1 { }

g(x. q.1f): .. liminf

N

t

e

q(a)

E:,e

_{(r(Xn• An))}

N-+oo ncO

(Note that the definitions (1.4) and (1.5) are consistent if

we

identify 8 E 0

with the distribution that is degenerate at

a).

The set of all probabilities on

e

will be denoted by

W. A

strategy 1f' is called

£-optimaZ

in (x, q) £

X

x

W

if

g(x! q, 1f')

-

> g(x, q,1f) - E for all 1f E IT

(a a-optimal strategy is simply called optimal).

The Bayes criterion allows us to consider the parameter as a random variable

Z

1f with range

e

and distribution q on

e.

On

e

x

n

we have the probability P

_x.q

determined by (1 .6)

(z

e: B. (X , A , X , A •••• ) E C) OJ

r

p'lf 8 o 0 1 1

SEB

x, for events C 1n

n.

(c)

q(8) (1.7)

(5)

3

-We compute the so-called p08te~io~ ~8t~ibution8Q of Z in the following n

way:

Q (8): II!: p1T

(z

E B

I

X , A , X , A , ••• , X , An]

n x,q 0 0 1 1 n

(Note that 0 is determined pn - a.s.)

n x.q

( 1 .8)

Define the probability T ,(q) on 0 by x,a,x (1 .9) (x,x' EX.

e

E0. a e: A) T ,(q) (e): .. x,a.x p(x'lx,a.e) q(e) 1: p(x'lx,a,e') q(8')

a'

It is possible to choose versions of the posterior distributions such that 0

0 .. q and Q_n+1

=

TX_{n' n' n+1}A X (Q_nl.

As indicated by Bellman (Cf. [Bellmann (19611JJ end prov.ed in a very general setting in [Rieder (1975)] th1~ decision model 1s equivalent to a MOP with a known transition law, specified by e 4-tuple

(X x W, A.

'P.

r)

(1.10)

where Xx Wis state space, A the act1gn epace.

P

the transition law defined by

p

(x', T ,(ql

I

x,q,a) : • E q(el P(x'lx,a,el

x,a,x e

and

r:

X

x

Wx A+

R,

the reward function. ia defined by

(1.11)

r(x,q,a) :

=

r(x,a) (1.12)

Note that the state (x,q) of the new model (1.10) consists of the original state xe:X and the "information state" q E W. It turns out that each state

(x, q1 and each strategy

rr

for the new model. define a probability -1TP

..

x,q and a random process

{(X

n' Qn'

A),..n

n . E

N}

on

IT :

..

(X x Wx A)N.

Here is

X

(w)

.

=

x • Q (wl : .. q and

A

(w)

.

.. a where w .. (x ,q ,a ,x ,q ,a ,

...

) €

IT.

n n n n n n 0 o 0 1 1 1

The original model (1.1) and the new model (1.10) have the following relationship: E 'IT

x,q

(rex

n, n

A)) ..

-'IF

E

x,q (1.13)

where 1T is the strategy for model (1.10), which is defined by

(1.14)

i

(a

I

x.q,a •••• ,x,q 1 : " n (a

I

x,a , ••• x)

n n 0 0 0 n n n n 0 0 n

Hence models (1.1) and (1.10) are equivalent and therefore we use the notations of model (1.1).

(6)

",

So we are dealing with a Markov decision process w~th known transition law again. -However this new MOP has aome odd properties. At first the state space is infinite even if the state space of the original model is finite. Further the new MOP is

transient

1n general. i.e. in most cases Q 1 Q •

n m

'It

P - a.s.for all n 1 m. In section 2 we show by an example that even x.q

if

X

and A are finite sets there need not to be an optimal strategy.

In the next section we introduce strategies that are easy-to-handle. at least if X, A and 0 are finite sete. and we consider conditions guaranteeing these strategies to be optimal. These conditions imply that the posterior distribut-ions Q converge to degenerate distributdistribut-ions. which property is used explicitely

n

to prove the optimality.

We conclude this section by introducing a parameter structure that is quite general and that facilitates formulating some results.

From now on we assume that we are dealing with the following structure: (i) X .. Xx Y

.

(ii) R is a transition probability from

X

x

A

x

Y to X (ii1)

K • K • K ••••

is a partition of

X

x

A.

1 2 3 OIl (iv)

0 ..

n

0

i •

e ..

(6 ,

e • e •... )

i-1 1 2 3 (1.15)

(v) P(x·.y'

I

x.y.a.e) .. R(x'

I

x.a.y') • P1(y'l 8

i ) i f and only i f (x.a) £Ki

where Pi(.le

i ) is a probability on Y.

Hence we have factorized the original transition law. If (x.a) EK

i then the transition to the next state x',y' depends on

e

only through its i-th

component

a

i • We present below some ,examples having this structure. It is straight forward to verify that

co

T

I , (q) (a) =

E

(x.y).a.(x .y ) i=1

Pi(}I' lei) qce)

E

Picy'lei) q (e)

e

C1.16)

(provided that the denominator does not vanish). Here 1

8 represents the indicator function of the set B.

Although this parameter structure Beems to be rather complicated there are practical situations where this structure occurs in a natural way.

(7)

5

-",

The state of the system at stage n is

x

=

(X ,

Y l.

The state component

n n n

Y is called the

suppLementary state variabte.

It can be proved that if

n A

a

E 0 is known then it is sufficient to consider

X

instead of

X

(see

n n

r

v~n Hae (1978a) page 52 ])

So

X

has to be considered as the original state variable if the parameter n

is known, while

Y

only occurs since it contains information concerning n

the unknown parameter.

From now on we are dealing with this parameter structure and we shall

... ...

consider only

X

and

X.

To facilitate notations we omit the head

n

from now on. Example 1

Consider an inventory control model without backlogging. If the demand distribution is known, the inventory level may be chosen as the state variable. However if the demand distribution is unknown the sequence

of successive inventory levels does not reflect the sequence of successive demands and therefore we have to consider the demand in each period as a

8upptementary state variabte.

Here

X

is the set of inventory levels, i.e. X 2 (O,~) and Y is the set of possible demands. The transition function is

is

R(x'

I

x,a,y') = 1 i f x' • max {a-y'

,a}

and a

-

> x. = 0 otherwise

(Hence the action a is the inventory level after ordering). The sets

K , K , •••

2 3

unknown parameter

a

1

Example 2

are empty and p

(./a )

is the demand distribution with

1 1

e:

0. 1

Consider a waiting line model with bulk arrivals. At each time point 0,1,2, ...

a group of customers arrives and the distribution of the size of the group is unknown. The service distribution is exponentially with parameter a and controllable by the parameter a. Let y' be the number of customers arriving in some period, let x be the queue length at the beginning of that period and

r

at the end. Then, if c:

=

x +

y' -

x' ~ 0 we have

c

a -a

R(x'

I

x,a,y') = - e

c:

(8)

.,

and if c < 0 thenR(x'

I

x.a.y') • O~ Further K • K •••• are empty and

2 3

P (y'18 ) is the probability of a group if size y'.

1 .1

Exemple 3

CORsider a linear system with random disturbances. The state at stage n is X

n and the disturbance at stage n is Vn• Then

x

= C X + C A + V

n+1 1 0 2 n n+1

where X.V and A are Euclidean spaces and C and C suitable matrices.

1 2

Assume that {y , n E N} forms a sequence of i.i.d. random variables with

n

an incompletely known distribution.

If only the sequence (X , X , X , ••• ) is observable to the controller

0 1 2

then he may reconstruct the sequence of supplementary state variables.

2. OPTIMAL STRATEGIES

We start this section with an example showing thet there need not to be en optimal strategy even if X end A ere finite sete.

Example 4

Let X I : {1,2,3,4,S,6}, A I: {1.2.3}. Y • {0.1}.

e

I : (0.1).

1

is not countable here but if we replace

e

by a countable

1

however notations become (Note that

0

1

subset of (0.1) the same arguments are valid. more difficult).

Further let R(x'Pc., e,y') be defined by:

R(313,a,1)

=

R(413.a,0) I: R(414,a.1J I: R(314,a,OJ =

R(SIS,a,1)

=

R(sIS •. a,O} I: R(Sls,a.O) I : R(S!S.a,ll = 1 for all

a E A, and let

R(111.1.1) = R(211 ,1,0)= 1. R(311,2,y) = R(sI1.3.y) = R(312,2.y) = R(SIS,3,y)7' 1 for all y E

Y.

Finally K , K , ••• are empty and p (118)=

a

,p

(ala)

= 1-8. 2 3 1

(9)

7

-., The example can be represented in the diagram:

Only in states 1 and 2 the chosen action has ef~sct. The rewards obtained are r(3) ~ reS) = 7 and r(4)

=

reS) • 3 for all actions and in the other states no rewards are obtained.

The average return in the sub"chain {3.4} is

i

(7+3)

=

S and in the

sub-chain {S,S} : 7 6 + 3(1-6) = 46+3. Consider a starting state

x

£ {1,2}. It is easy to verify that for known

e

£

e

the optimal action is a maximizer

1

of the function 5.~(2,a) + {46+3} ~(3,a) (where ~ is the Kronecker function), a £ {2,3}. It is also straight forward to verify that if we have to choose one of the actions 2 of 3 and if q is the prior distribution then the maximizer of S.~(2,a) + {4. /6q(de) + 3} • ~(3,a), a £ {2,3} is the best choice.

n

Let ~ be the strategy that chooses action 1 the first n times and the maximizer of the function

5~(2,a) + {4 f 6.Q (d6) + 3} \S(3,a). a £ {2,3} n

thereafter, where Q is the posterior distribution at time n

n

if the system starts in state 1 with prior distribution q. Then the

Bayesian average return

i'n states 1 and 2 is

E_q [max {S,4 f 6 Qn(de) + 3} ]

(this expectation does not depend on the starting state and the strategy). Note that

E [max {S, 4 f 6 Q 1 (d6) + 3

I

Q , ... , Q ] >

q n+ 1 n "'"

> max {S,4 E rI6 Q l' (d6) IQ , •••Q

1

+ 3} = max {S,4 f 6 Q (d6) + 3}

.... q

L'

n+ 1 n n

with equality if and only if 5 > 4 f 6 Q 1(d6) + 3} P - a.s. However if

- n+ q

q gives positive mass to the set

{a

£

e

I 6 > i} the equality never holds.

n n+1

Hence in this case the strategy ~ is worse than ~ and consequently there is no opti:Nll strategy.

(10)

We first introduce two assumptions:

there are bounded functions g and h on 0, and X x 0 respectively such that hex,S) + g(8)

=

sup L(x,e,S)

aEA (i) (11) r 1s bounded on X x A 00 where L(x,a,S)

=

E

1

K

i-1 1 (2.1 )

(x,a) { rex,a) +

r

E R(x'lx,a,y') • y' x'

For models with known parameter value these conditions are the well-known conditions considered 1n [Derman (1966)] and [ROSS (1968)J. In that case there exist stationary optimal strategies, at least if

A

is finite. Moreover the optimal average return is gee).

r,

In [Ross (196B)] several situations are given where (2.1),ti.i) holds for fixed parameter valuee.

For instance. if X and A are finite sets. and for each stationary strategy

the resulting process is an irreducible Markov chain. then (2.1) (i.i) is valid. Let {e: • n e:

N}

be a sequence of bounded real numbers such that lim E

h

=

o.

n n~

The strategies we consider choose at each stage n a maximizer of

1.: L(X .a,8) Q (e) • a

e:

A (2.2)

e

n n

We call these rules

Bayesian

equiva~ent

rutes

because we are maximizing the "Bayesian equivalent" of the function we have to maximize in case the parameter is known. Note that we may choose

e:

~ O. n e:

N

if there is a

n

maximizer of the function (2.2).

In example 4 we already encountered these strategies. So it is clear by

the example that these strategies are not optimal in general. However we give at the end of this section conditions guaranteeing these strategies to be optimal.

In [van Hee (1978a)J we consider these strategies also for the discounted total return case and we prove there that these strategies are optimal for the linear system with quadratic costs (in discrete time) and also for some inventory control models. There we also consider bounds on the dis-counted total return of these strategies.

(11)

9

-.,

In

[Fo~

and Rolph (1973)], [Mandl (1974H and [Geor'gin (1978)] another heuristic strategy is considered. which turns out to be optimal in a lot of situations. This strategy can be formulated in the following way:

fiAt each stage estimate the unknown parameter a using the available

A

data. by 8. Then compute an optimal (stationary) strategy for the model where the parameter is known and equal to ~. Then use the corresponding action in the actual state. Repeat this procedure at each stage".

Hence. if we consider Bayes estimates. then the method proposed by the other authors may be formulated in the following way:

choose at each stage n a maximizer of L(X , a, E

e

Q

(e»),

a € A

n a n

(here we assumed the parameter set as a sub-set of

R).

(2.3)

To prove the optimality of Bayesian equivalent rules we need the following limit theorem: Theorem 1 00 I f I:

n=O

then

\

_i (P 'II' _ a.s.) i . 1.2,3" ... x,q

for all bounded functions f on 0.

(where Z = (Z , Z , Z , •.• ) cf (1.15) (Lv.)

1 2 3

Note that theorem 1 gives a sufficient condition' for the consistency of the Bayes estimation procedure.

We introduce the function ~ on X x A x 0.

~(x,e.a) :

=

L(x,a,e) - h(x,a) - g(S)

Note that ~(x,S,a) < 0 for all x €

x,a

€ 0 , a € A.

0 =

Further we extend ~ to a function on X x A x Wand likewise the functions hand g:

(12)

(2.4) (i) cj>(x.q.a)

r

q(8) cj>(x.8.a)

S

( i i ) _hCx.q)

r

q(S) h(x.8)

e

( i i i ) g(q) I = t q(8) g(8)

8

Note that these definitions are consistent of (2.1) (ii) if we identify 8 with the distribution that is degenerate at 8.

The next theorem provides us a sufficient condition for a strategy n to be optimal. Theorem 2 N-1 E n I f liminf 1

r

[cj>CXn,A n)] = 0 N+00

N

_n=O x.q then 1 N-1 liminf

N

r

E n [reXn,An)] = gCq) N+00 _n=O x.q and n is optimal.

Theorem's 1 and 2 are the key tools for proving optimality of the Bayesian equivalent rules.

As an illustration we shall consider two practical examples from more general theorems. For the first example we provide a proof to demonstrate the technique.

Theorem 3

Let X and A be finite sets. let M , M ••••• M be a partition of X and let

1 2 m

K_i = M

i x A, i=1, ••• ,m (Ki=</lfor i > m).

Let the Markov chain {X • n E

N}

be irreducible for each stationary strategy n

and each parameter value •

•

Then a strategy n that chooses at stage n a maximizer of the function

a maximizer of </lCX ,Q ,a) is a maximizer of E L(X ,a,8).Q

(8))

n n e n n cj>(Xn.Qn.al • a £ A is optimal. (Note that Proof. lI'

Let A be the action at stage n under strategy n • Hence n

o

> </l(X .Q ,A ) • max

r

Q Ce) ~(X ,6,a) =

= n n n n ~ n

(13)

11 -> max l: Q (a) f £ F e n min x£X <p(x.a.f(x)) •

wh8re F is the set of all functions from X to A.

Note that F represents the set of all stationary strategies for the model with known parameter value.

~

Since the Markov chain {X • n £ N} is irreducible for each stationary n

strategy it can be proved (cf. lemma 4.7 in [van Hee (1978a)])that the number of visits to each set M

i is almost surely infinite for all strategies.

Therefore the condition of theorem 1 is fulfilled for all i and so we have lim l: 0 (a) min <p(x.a.f(x) = min <p(x.Z.f(x))

n-+co a n x£X x£X

Since there is a parameter value

max min

fEF x£X

stationary optimal strategy for the model with known (cf. Ross (1968) we have

<p[x.a.f(xl)

=

0 for all a £

e.

Therefore we find

o

₌> liminf n -+00 and so N-1 lim 1 l: N-+oo

N

n=O <P(X .0 .A )

n n n ~ _{n-+oo fe:F a}lim max l: 0_n(a) min <p(x.a.f(x))_xe:X

=

0

Application of theorem 2 gives the desired result.

In this example we assumed that the information we obtain after the transition about the unknown parameter does not depend on the action chosen. In the next example we relax this assumption.

Here we assume:

(i) _{M ••••• M is a partition of X.} _{N •..••}_{N is a partition of} _A.

1 m I n (2.5)

(ii) For each stationary strategy the Markov chain {X • t £N} t

is irreducible

(iii) The partition K • K •••• of X x A consists of the sets

1 2

{M

(14)

Before we consider the strategy t~at turns out to be optimal. we first introduce the concept of a

sequence of density zero.

A sequence 5

=

(Sl' s2.···) is said to be of density zero if

limsup 1<,-+ 00 1 max k = 0 (2.6) chosen in order.

Examples of such sequences are: (si

=

2i• i E

N)

and (si

=

i 2 • i E

N)

In theorem 4 we consider a strategy that is inspired by an idea in [ Mallows and Robbins (1964)

J.

In [Fox and Rolph (1973)J this idea is used in a similar way for Markov renewal programs.

The idea is. that we make use of

forced choice actions

to guarantee that we return to each set K

i infinitely often. which is necessary to apply theorem 1. However we do this with a frequency that is so low as not to influence the Bayesian average return. (In fact the concept of a sequence of density zero is used here).

Now we are ready to formulate in an informal way the strategy n that will be optimal. In (2.6) this strategy is sketched:

Fix (forced choice) actions a , a ••••• a auch that a₁ £ Nt

1 2

n

Let ti(n) be the number of visits to the set M

i at stage n.

I f X

n E Mi and ti (n) £ 5 for some i £ {1 ••••• m} then the next action in the sequence (a_l , ••••a

n) is chosen.

otherwise. if t

_x

(n) ;{5. then a maximizer of the function n

~(X • Q • a). a £ A

n

is chosen.

A •

Hence TI uses the same actions as the strategy n in theorem 3 except

for stages where t

x

_n (n) £ 5. Then the forced choice actions are

The proof of theorem 4 is rather technically. although the idea is simple. Theorem 4

Let (2.5) hold. The strategy

TI

defined in (2.6) is optimal.

This result is also true in a roore abstract model. In fact it completes results of QMandl (1974)J and it generalizes work of [Rose (1975)] •

(15)

3. LITERATURE Bellman, R., Derman, C., Fox, B.L. and Rolph, J .E., Georgin, J.P., Van Hee, K.M., Van Hee, K.M., Mallows, C.L. and Robbins, H. Mandl, P., Neveu, J., Rieder, U., Rose, J.S., Ross, S.M., 13

-Adaptive control processes: a guided tour

Princeton (N.Y.), Princeton University Press (1961) Denumerable state Markovian decision processes -average cost criterion,

Ann. Math. Statist. 37 (1966), 1545-1554 Adaptive policies for Markov renewal programs Ann. Math. Statist. 35 (1973), 846-856

Estimation et controle des chaInes de Markov sur des espaces arbitraire.

In: Lecture Notes in Mathematics 636 Springer-Verlag, Berlin etc. (1978) Bayesian control of Markov chains

Amsterdam, Mathematical Centre Tracts 95 (1978a)

Markov decisions processes with unknown transition law; the discounted case

To appear. in: Kybernetica (1978b)

Some problems of optimal sampling strategy. J. Math. Anal. Appl: ~ (1964), 90-103

Estimation and control in Markov chains

Advances in Appl. Probability ~ (1974), 40-60 Mathematical foundations of the calculus of probability

San Francisco etc., Holden-Day (1965) Bayesian dynamic programming

Advances in Appl. Probability

Z

(1975), 330-348 Markov decision processes under uncertainty -average return criterion

Unpublished report (1975)

Arbitrary state Markovian decision processes Ann. Math. statist. 39 (1968), 2118-2122