Approximations in Bayesian controlled Markov chains

(1)

Approximations in Bayesian controlled Markov chains

Citation for published version (APA):

Hee, van, K. M. (1976). Approximations in Bayesian controlled Markov chains. (Memorandum COSOR; Vol. 7615). Technische Hogeschool Eindhoven.

Document status and date: Published: 01/01/1976

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne

Take down policy

If you believe that this document breaches copyright please contact us at:

openaccess@tue.nl

providing details and we will investigate your claim.

(2)

PROBABILITY THEORY, STATISTICS AND OPERATIONS RESEARCH GROUP

Memorandum COSOR 76-15 . Approximations in Bayesian

Controlled Markov Chains by

K.M. van Hee

Eindhoven, October 1976 The Netherlands

(3)

Approximations in Bayesian Controlled Markov Chains

by

R.M.van Ree

O. Summary

A class of Markov decision processes is considered with a finite state and action space and with an incompletely known transition mechanism. The

con-troller is looking for a strategy maximizing the Bayesian expected total dis-counted return. In section 2 approximations are given for this value and in section 3 we indicate how to compute the value for a fixed prior distribution.

1. Introduction and Preliminaries

For a detailed description of the model we refer to van Hee (1976A), here we only give a sketch. For statements without proof see also van Hee (1976A). Consider a Markov decision process with a finite state spaoe S and a finite

action space A. Let r: S x A-+- R be the :roeward function.

Let X be the state of the system at time n. There is a subset B c S such

n

that if X € B the next state is partially determined by the outcome of a

n

random variable Y I ' where {y , n = 1,2,3, ••• } is a sequence of i.i.d.

ran-n+ n

dom variables not controllable by the decisionmaker. The process

{Yn' n = 1,2,3, ••• } is called the gxte:ronal p:roocess and has a finite state space E. If and only if X € B then Y 1 becomes visible to the

decisionma-n n+

ker. Let P be a transition probability from S x A x E to S such that P[X +1 =t_n

I

X =s, A =a, Y_n _n _n+1 =yJ = P(tls,a,y) ,

where A is the action at time n. (For s € S\B P(tls,a,') is constant and

n

we omit the dependence on y in this case). Only the distribution of the ex-ternal process; i.e. p(Yle) := Pe[Y

n+1 = yJ, depends on an unknown parameter e €

a

where

a

is a finite paramete:ro space. Note that for each fixed e €

a

the process forms an ordinary Markov decision process with transition probability:

P6[X +I=t

I

X =s, A =aJ =

L

p(tls,a,y).p(Yle) •

Q n n n E

y€

Examples of such a model can be found in inventory control, where Y_n is the demand in period (n - l,nJ and also in queueing models, where Y_n is the num-ber of newcomers.

(4)

Let IT be the set of all strategies based on the visible histories (i.e. for each n E IT the action An may depend on XO, ••• ,X

n, AO, ••• ,An_1 and on Yk if ~_I E B, k

=

1,2, •••

,n).

For each starting state s E S, each n E IT and e E 8 we have a random process {(Xn,An,Y_u+_I)' n

=

1,2,3, ••• } and ~ a probabilitYP:,e[ ] on the sample space. (The expectation w.r.t. this probability is denoted byEn e')

s,

The

Bayesian expected discounted total return

v(s,q,n) w.r.t. an

prior

dis-tribution

q on 8 is defined by co v(s,q,n) :-

L

E: e[

L

e

'n=O n

a

r(X ,A )J.q(e), s E S, n E IT , n n

where

a

E [0,1) is the discount factor.

The set of all distributions on 8 is denoted by W, and the function

v: S x W~R, defined by v(s,q)

:=

SUR v(s,q,n), is called the

value function.

1TET! We define a sequence of stopping times:

0'1

:=

inf{n ~

0;

X E B} n

O'k

:=

inf{n > ok_I; X E B} , k

=

2,3,4, •••

n

L

k := ok + I

,

k

=

1,2,3, •••

.

The Bayes criterion allows us to consider the parameter

e

E 8 as a random

va-riable Z with distribution q on 8.

Given q E W, S E Sand n E S we have a probability pn on the sample space

s,q

of the process {Z,(X ,A ,Y I)' n = 0,1,2, ••• } and for each event C defined n n n+

by

we have

( h_{t e} _expectat~on. _w.r.t. pn _~s. d_enoted b_yEn )_•

s,q s,q

We define on the event {L < co} for s E S, q E W, n E IT:

n

<p (e) := p1T [Z

=

elY to • .,Y ]

n s,q LI L_n

(5)

3

-The vector valued process {Q } with Q := {Q (e), e E a} is called the

n n n

Baye8 proae88.

Note that, since the values of the external process are not influenced by the starting state s and the strategy ~, we have that ~n(e) does not depend on s and ~ on {T < _~}, and evenso for Q (e) if B = S.

n n

If we are in the situation that expectations or conditional expectations do not depend on s and ~ we omit these sub and superscript.

We sometimes need the following conditions: (A) for all s E S, ~ E IT and e E

a

00

p~ e[ n h <~}]

=

I • s, _n=1 n

(Note that B m S implies (A).)

(B) For each pair e,e E

a

there is ayE E such that p(Yle) ~ p(Yle) •

(The only place where (B) is used is in the proof of the following theorem.)

Theorem 1. Assume (A,B). Thenfor all s E S, q E W and ~ E IT it holds that lim

Q

(e) a 0 e P -a.s.

~ n Z, s,q

We need some notations p: E x W~ [0,1] such that

p(y,q) :=

L

q(e).p(Yle) ,

eEa

T: Wx E ~ W such that

T (q)(e)_y

:=

~

_p(y,q) if p(y q)_, > 0, := q(6) otherwise.

We may reduce our Bayesian decision problem to a

disaounted dynamia program

with state space S X W, action space A and reward function r, as stated in theorem 2.

(6)

Theorem 2. The value function v is the unique solution of the functional equation v(s,q) _{'"' max {r(s,a)} +l3 a€A • max {r(s,a) + l3 a€A

L L

p(tls,a,y)p(y,q)v(t,Tyq)}, s € B y€E t€S

L

P(tls,a)v(t,q)}, s € S\B • t€S

Corollary 1. There is an optimal strategy 'IT*which is

stationaI'y"

Le. there is a function g: S x W-+ A such that 'IT* chooses action g(s,q) in (s,q) € S XW.

2. Approximations

In this section we shall give some approximations for v(s,q), s € S and a fixed prior q € W. In section 3 we consider the computational aspects. We identify each e €

a

with the degenerated distribution at e. Hence v(s,e) is the optimal value of the Markov decision process if s is the starting state and

e

is known. Let

2.1. M

:=

{f

I

f: S ~ A}

be the set of

Markov poZiaies

and identify the strategy 'IT € IT that chooses action £(s), f € M in state (s,q) with f.

Further let

F₆ := {f € M

I

v(s,e) = v(s,e,f) for all s € S}, 6 € 8 and c: 8 -+ M be such that c(6) € F

6,

e

€ 8. We define

2.2. i) F:= u F

6

6d~

ii) F

:=

{f € M

I

f

=

c(6) for some 6 € 8} • On S x Wwe define the following functions:

2.3. i) w(s,q) :=

L

v(s,6)q(6)

6€8

ii) t(s,q) := max

L

v(s,e,f)q(6) fd 6€8

iii) l(s,q) := max

L

v(s,6,f)q(6) _• f€F e€8

(7)

5

-Lemma 3.

i) i(s,q) ~ t(s,q) ~ v(s,q) ~ w(s,q) for all s E S, q E W •

ii) Let (AjB) hold. then for all t E S, 1T E II and q EW

lim max {w(s,Q ) - !(s,Q )} = 0, p1T t -a.s. n-+<'" SES n n ,q Proof. i) i(s,q) ~ t(s,q) = max v(s,q,f) ~ fEF

L

q(e) sup V(S,e,1T) = w(s,q) •

6Ea 1TEII

ii) By theorem 1 we have

lim w(s,Q ) = lim

L

v(s,6)Q (6)

=

v(s,Z), P~ -a.s ••

n-+<'" n n~ eEa n ,q

Note that

Ima!

L

Qn(6)v(s,e,f) -ma! v(s,Z,f)

I

~ma!

IL

Q (6)V(8,e,f) -v(s,Z,f)

I.

fEF e fEF fEF 6 n

Hence

lim !(s,Q ) = v(s,Z), p1T

t -a.s.

n-+<'" n ,q

Define two functions:

o

S E S, a E A, e Ea.

2.4. i) cp(s,a,e) :=r(s,a) +e

L L

P(tls,a,y)p(yle)v(t,e) -v(s,e),

tES yEE

ii) q>(s,q) :=max

L

cp(s,a,e)q(e), s E S, q E W •

aEA eEa

Note that cp(s,a,6) ~ 0 for all s E S, a E A and e E

e

and note also that

cp(s,q) = 0 if q is a degenerated distribution.

Lemma 4. i)

ii)

I

v(s,q) ~ w(s,q) +T'=1" max

L

min cp(x,a,e)q(e)

aEA eEa XES

v(s,q)

~

w(s,q) +1

~

e max

L

min cp(x,f(x) ,e)q(6) •

fEF eEa XES

iii) if B = S then span {w(s,q) - v(s,q)}

~

E

[L

en span cp(s,Qn)] •

*

s q n=O s

*

span f(x) := x sup f(x) - inf f(x). x X

(8)

Proof. By 2.4 we have for a E A:

I q(a)v(s,e) +q>(s,q) ~r(s,a) + S II I p(tls,a,y)p(yla)q(e)v(t,e).

ed) e t y

Note that p(Yle)q(e) = (T q)(e)p(y,q). Hence by substituting 2.3i) we have

y

w(s,q) +q>(s,q) ~t(s,a) +sI I P(tls,a,y)p(y,q)w(t,T q), if s E B

t y . y

~r(s,a) +13 I P(tls,a)w(t,q) if s E S\B. t

Let n be a stationary strategy (see corollary 1.1) and define

f<s,q) := En [w(XI,Qt)J, s,q E S X W

s,q then

r(s,q) S q>(s,q) + w(s,q) - Sf(s,q) •

By the Markov property we have for all s,q E S x W:

Hence CXl n X ,Q

J,

P -a.s. n n s,q CXl CXl CXl

- En [S I Snw(X I,Q I)J =w(s,q) +E [ I enq>(X ,Q )J.

s,q n=O n+ n+ s,q n=O n n

Let IT be the strategy that chooses in

mizing I q(e)q>(s,a,e). Note that 'if is

e

holds in 2.4 i f n = IT.

(s,q) E S x W a fixed action a,

maxi-stationary and note also that equality

We first prove iii).

Let n

*

be a stat10nary opt1mal strategy, then

.

*

CXl 2.6. v(s,q) = v(s,q,n*) S w(s,q) +En [ I sn max q>(x,Qn)] • s,q n=O XES But CXl 2.7. v(s,q)

~

v(s,q,i)

~

w(s,q) + ElT [ I sn min q>(x,Q )J • s,q n=O XES n

(9)

7

-Remark that under the condition B = S the distribution of Q is independent n

of s E Sand TI e IT, hence iii) is a direct consequence of 2.6 and 2.7. To prove i) and ii) note that

min cp(x,q) =min max

L

q(6)cp(x,f(x),e) -max min

L

q(e)cp(x,f(x) ,e) ~

XES XES fEF e fEF xeS e

~ max

L

q(e)min cp(x,f(x),e) ~ max

L

q(e)min cp(x,a,e) •

fEF e X E S aEA e X E S

Further note that the last two expressions are convex functions on W so by Jensen's inequality applied to the right hand side of 2.7 we have the

de-sired result.

0

Remark 2.8. By the proof of lemma 4 we see that the lowerbound given in ii) is greater than or equal to the lowerbound of i), but it requires more work to compute it. Further note that, if (A,B) holds

lim max

L

min cp(x,f(x),e)Q (e) = 0, p~ -a.s.

n-+co feF eXES n . s"q

since

TI

Q

(e)

+ ~z

e'

P -a.s.

n , s , q

We introduce now an operator U working on the space G of bounded measurable functions on S x W (measurable w.r.t. the Borel a-field on S x W):

Let f E G: (Uf) (s,q) := 1" -1 1 sup ETI

[L

TIEIT s,q n=O

Note that for f E G Uf is continuous on W since

I(Uf)(s,q) - (Uf)(s,cp)I

~

{I

~

e

+ M'}

L

Iq(e) - cp(e)1

e

for q,cp E W, Ir(s,a)

I

~ M, If

I

~ M'.

Note further that G isa Banach space w.r.t. the supremum norm.

In Wessels (1974) and van Nunen (1976) a class of operators of this type is studied for models with a finite respectively countable state space. They both prove the following theorem. For our situation it is proved in van Ree

(10)

Theorem 5. The operator U (defined in 2.9) is monotone and contracting. The value function v is the unique fixed point of U in G.

The next theorem is important for successive approximations. Let us assume that

v

is an approximation of v and that the difference

Iv -

vi is bounded by a function E.

Theorem 6. Let v be the value function and let v and E E G, such that

Iv(s,q) - v(s,q)1 ~ E(S,q) then it holds that

for all s E S, q E W

L

~ sup ETI [8 nE(X

,Q

)].

TIETI s,q Tn _Ln

Proof. First we define the operator L: G + Gby

TI L₁

(Lf)(s,q) :- sup E [8 f(X

,Q

)],

f E G, s E S, q E W TIETI s,q L₁ L₁

(it is easy to verify that Lf is continuous on W, so Lf E G). It holds that·

(U(v+e:»(s,q) ~ (Uv)(s,q) + (LE)(s,q) ~ v(s,q) + (Le:)(s,q) and therefore

and in the same way

So, again by the monotonicity of U, we have

To complete the proof we have to verify that

(11)

9

-Corollary 7. Suppose that B

=

S. Let v E G and let e: W~R be a bounded

measurable function. If

Iv(s,q) - v(s,q)1 ~ e(q)

then

To prove this statement note that B .. S implies T

=

n and that the

distri-n

but ion of Q is independent of the starting state and the strategy. n

Corollary 8. Suppose that B .. S. Let v(s,q) := Hw(s,q) + R,(s,q)} and

e(q)

:=

! min

I

max {v(x,e) - v(x,e,f(x»}q(6) •

fEF eEE~ XES

Then: i)

ii) iii)

Proof.

E [e(Q )J ~ E [e(Q I)J,

q n q n+

limE [e(Q)J - 0 •

n-+oo q n

i) Note that

Iv(s,q) - v(s,q)1 ~ !{w(s,q) - R,(s,q)} ~ e(q) •

ii) Note that e(q) is a concave function on W. Since {Qn' n E IN} forms a

martingale (see van Hee (1976A» we have that {e(Q ), n E IN} forms a

n

supermartingale.

iii) By theorem 1 we have P -a.s.

q

lim e(Q )

=

!

min max {v(x,Z) - v(x,Z,f(x»}

=

0 •

n-+oo n fEF XES

Remark 2.10. Let B - S and define

i) e(q) :-

!

1

~

S

max

I

min

~(x,f(x),e)q(e)

,

fEF eE8 xES

ii) v(s ,q) := w(s ,q) + E(q) •

(12)

Then the three statements of corollary 8 hold also. The proof proceeds along the same lines, using lemma 4 and remark 2.8.

If in corollary 8 t is replaced by

I

and F by F the statements i) and iii) re-main true.

3. Computational aspects and additional remarks

The approximations given in section 2 are of interest for computations if we are prepared to determine the sets F and {v(s,f,e)

I

s E S, e E

a,

f E F} (or F replaced by F). Let k

:=

~(a) then the determination of F requires the so-lution of k ordinary Markov decision problems with a finite state and action space and the determination of all optimal policies. If n := ~(F) (or ~(F» then we have to solve (k- l)n systems of linear equations to determine the second set.

If there is atE M which is optimal for all

e

E 8 then v(s,q)

=

w(s,q) for all s E S, q E W. For separable value functions, i.e. for models with

v(s,e) • h(s) + g(e), it holds that span ~(s,q)

=

0, hence by lemma 4iii) we

s

have that span{v(s,q) - w(s,q)}

=

O. In van Hee (1976B) a class of problems,

s

including some inventory control models,is considered with this structure.

For each q E W we define

W(q)

:=

{~ E W

I

~

=

n

hence W (q) is the set of all n-stage posterior distributions of q. The sets n

. W (q) and W (q), n ~ m are in general not disjoint (see van Hee (1976A».

n m

For a fixed q E W it follows from section 2, that, loosely speaking, the ap-proximations of v(s,~) for ~ E W (q) are better if n is large.

n

Since (Unv)'{s,q) requires only the values of ~(s,~) for ~ E W (q), S E Swe n

may approximate v(s,~) by ~(s,q) on W (q) and then by backward induction we n

can determine (Unv)(s,q). The only problem is the determination of n, the horizon.

For models with B

=

S corollary 8i) shows that the error determination is rather easy: we have only to compute a~ [£(Q

)J,

which requires the

deter-q n

mination of £(~) for all ~ E W (q), to check whether horizon n is sufficient-n

ly accurate or not. I f B is a proper subset of Sandi f (A,B) holds a similar result holds since

(13)

11 -L 1T n u-sup

E

[B

€(Q )] ~

B

~ [£(Q ) ] , s q L q L 1T€IT' n n

viz. the distribution of

Q

depends only on q.

L n

In van Hee (1976A) two algorithms are presented based on these arguments, for models with B • S and for models where B consists of only one state. Also numerical results are given there and attention is paid to the deter-mination of optimal actions.

In Martin (1967) the usual method of successive approximations is proposed with a terminal function t: S +R. In our terminology Martin approximates v(s,q) by (Unt)(s,q). The difficulty of this method_is that the choice of the horizon must be made on the error estimate

ie

n

~

=

i .

where

M

:=

max r(s,a), ~

:=

min r(s,a) •

s,a s,a

Satia and Lave (1973) also suggest the use of upper and lower bounds for

v(s,~), ~ € W (q). It is easy to see that their bounds are worse than ours.

n

(see van Hee (1976A».

Literature

van Hee, K.M. (1976A); Bayesian control of Markov chains, to appear.

van Hee, K.M. (I976B) ; Adaptive control of special structured Markov chains, to appear.

Martin, J.J. (1967); Bayesian decision problems and Markov chains, Wiley, New York.

van Nunen, J.A.E.E. (1976); Contracting Markov decision processes, MC-tract, Amsterdam.

Satia, J.R. and R.E. Lave (1973); Markov decision processes with uncertain transition probabilities, Operations Research 21.

Wessels, J. (1974); Stopping times and Markov Programming. Proceedings of 1974 E.M.S.-meeting and 7-th Prague Conference on Information Theory, Statistical decision functions and Random processes.