Markov decision processes with unknown transition law : the
average return case
Citation for published version (APA):
Hee, van, K. M. (1979). Markov decision processes with unknown transition law : the average return case. (Memorandum COSOR; Vol. 7901). Technische Hogeschool Eindhoven.
Document status and date: Published: 01/01/1979
Document Version:
Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)
Please check the document version of this publication:
• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.
• The final author version and the galley proof are versions of the publication after peer review.
• The final published version features the final layout of the paper including the volume, issue and page numbers.
Link to publication
General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain
• You may freely distribute the URL identifying the publication in the public portal.
If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:
www.tue.nl/taverne
Take down policy
If you believe that this document breaches copyright please contact us at:
openaccess@tue.nl
providing details and we will investigate your claim.
Department of Mathematics
PROBABILITY THEORY, STATISTICS AND OPERATIONS RESEARCH GROUP
Memorandum COSOR 79-01
Markov decision processes with unknown transition law; the average return case
by
Kees M. van Hee
Eindhoven, January 1979 The Netherlands
:MARKOV DECISION PROCESSES WITH UNKNOWN TRANSITION LAW; THE AVERAGE RETURN CASE
by
Kees M. van Hee
O. ABSTRACT
In this paper we consider some problems and results in the area of Markov decision processes with an incompletely known transition law. We concentrate here on the average return under the Bayes criterion. We discuss easy-to-handla strategies which are optimal under some conditions. For detailed proofs we refer to a monograph published by the present author.
1. INTRODUCTION
In this paper we review a part of (van Hae (1978a»),
e
monograph dealing with Markov decision processes in discrete time, with an incompletely known transition law. All proofs of statements given here, can be found in this monograph. Moreover this monograph contains results for the discounted return easel some of these results are reviewed in (van Hee (1976b»).We do not bother about measure theoretic problems and therefore we assume all sets to be countable or sometimes even finite, however we remark that in (van Hee (1978a») the problems are treated in a general measure theoretic setting.
We start with a description of the model and we discuss some of its properties. A Markov decision process (MOP) with unknown transition law is specified by as-tuple
(X. A. 0. p. r) (1.1)
where X is the
state space,
A theaction space,
0 theparametep space,
P a
tpansition ppobabiZity
from Xx Ax0
to Xand r thepeward function
(i.e r: Xx
A+R.
whereR
is the set of real numbers). (We assume r to be bounded. if Xor A is countable). The parametere
E 0 is unknown to the decision maker. At each stage O. 1. 2 ••.• the decision maker chooses an action a E A where he may base his choice on the sequence of past states and actions.A
stpategy
TI is a sequence TI = (TI • TI • TI •••• ) where TI is a transition0 1 2 0 n
probability from X to A and TI a transition probability from (X
x
A)x
Xn to A (n > 1).
=
A strategy is called
stationary
if there is a function fX
+ A such that always 'If ({f(x J.)I
x • a , x • a •••• ,x ) • 1.··n n 0 0 1 1 n
According to the well-known lonescu Tulcea theorem (Cf. (Neveu (19651) ] we have for each
starting state
x E X. each strategy 'If E IT and each parameter~ E
e
a probability p'lfx,e
on " : .. (Xx
A1N
and a random process
(~ : .. {OJ 1. 2 •••• } 1
{(X.
A1.
n £ N} wheren n
(1 .2)
(The expectation with respect to X (w): .. x • n n An(w): .. ~n if W - (x , ao 0 p'lf 1B denoted
x,a
x , s •••• ) 1 1 by E'lfa)
x.
£n
(1 .3)The ave~age ~eturn is defined by
g(x.e.'If): .. liminf
~
N-1.tE
1f (r(X ) n• An)N~
ncO
x.a
(1 .4)(1 .5) It only happens 1n non-interesting caseB that there 1s a strategy 1f' E IT such
that g(x,
a ,
1f')-
>
g(x,e ,
1f1 for all x £ X ande
£e.
So we cannot use this as an optimality criterion.
We
heve chosen the Bayes criterion [for a motivation cf. (vanHee (197809))]
Fix some probability q on 0. (Such
a
probability is called a p~io~distpibution).
TheBayesian
ave~age ~eturn with respect to q is defined by1 N-1 { }
g(x. q.1f): .. liminf
N
t
t
e
q(a)
E:,e
(r(Xn• An))N-+oo ncO
(Note that the definitions (1.4) and (1.5) are consistent if
we
identify 8 E 0with the distribution that is degenerate at
a).
The set of all probabilities on
e
will be denoted byW. A
strategy 1f' is called£-optimaZ
in (x, q) £X
xW
ifg(x! q, 1f')
-
> g(x, q,1f) - E for all 1f E IT(a a-optimal strategy is simply called optimal).
The Bayes criterion allows us to consider the parameter as a random variable
Z
1f with range
e
and distribution q one.
One
x
n
we have the probability Px.q
determined by (1 .6)(z
e: B. (X , A , X , A •••• ) E C) OJr
p'lf 8 o 0 1 1SEB
x, for events C 1nn.
(c)
q(8) (1.7)3
-We compute the so-called p08te~io~ ~8t~ibution8Q of Z in the following n
way:
Q (8): II!: p1T
(z
E BI
X , A , X , A , ••• , X , An]n x,q 0 0 1 1 n
(Note that 0 is determined pn - a.s.)
n x.q
( 1 .8)
Define the probability T ,(q) on 0 by x,a,x (1 .9) (x,x' EX.
e
E0. a e: A) T ,(q) (e): .. x,a.x p(x'lx,a.e) q(e) 1: p(x'lx,a,e') q(8')a'
It is possible to choose versions of the posterior distributions such that 0
0 .. q and Qn+1
=
TXn' n' n+1A X (Qnl.As indicated by Bellman (Cf. [Bellmann (19611JJ end prov.ed in a very general setting in [Rieder (1975)] th1~ decision model 1s equivalent to a MOP with a known transition law, specified by e 4-tuple
(X x W, A.
'P.
r)
(1.10)where Xx Wis state space, A the act1gn epace.
P
the transition law defined byp
(x', T ,(qlI
x,q,a) : • E q(el P(x'lx,a,elx,a,x e
and
r:
Xx
Wx A+R,
the reward function. ia defined by(1.11)
r(x,q,a) :
=
r(x,a) (1.12)Note that the state (x,q) of the new model (1.10) consists of the original state xe:X and the "information state" q E W. It turns out that each state
(x, q1 and each strategy
rr
for the new model. define a probability -1TP..
x,q and a random process{(X
n' Qn'
A),..n
n . EN}
onIT :
..
(X x Wx A)N.
Here isX
(w).
.
=
x • Q (wl : .. q andA
(w).
.
.. a where w .. (x ,q ,a ,x ,q ,a ,...
) €IT.
n n n n n n 0 o 0 1 1 1
The original model (1.1) and the new model (1.10) have the following relationship: E 'IT
x,q
(rex
n, nA)) ..
-'IFE
x,q (1.13)
where 1T is the strategy for model (1.10), which is defined by
(1.14)
i
(aI
x.q,a •••• ,x,q 1 : " n (aI
x,a , ••• x)n n 0 0 0 n n n n 0 0 n
Hence models (1.1) and (1.10) are equivalent and therefore we use the notations of model (1.1).
",
So we are dealing with a Markov decision process w~th known transition law again. -However this new MOP has aome odd properties. At first the state space is infinite even if the state space of the original model is finite. Further the new MOP is
transient
1n general. i.e. in most cases Q 1 Q •n m
'It
P - a.s.for all n 1 m. In section 2 we show by an example that even x.q
if
X
and A are finite sets there need not to be an optimal strategy.In the next section we introduce strategies that are easy-to-handle. at least if X, A and 0 are finite sete. and we consider conditions guaranteeing these strategies to be optimal. These conditions imply that the posterior distribut-ions Q converge to degenerate distributdistribut-ions. which property is used explicitely
n
to prove the optimality.
We conclude this section by introducing a parameter structure that is quite general and that facilitates formulating some results.
From now on we assume that we are dealing with the following structure: (i) X .. Xx Y
.
.
(ii) R is a transition probability from
X
x
Ax
Y to X (ii1)K • K • K ••••
is a partition ofX
x
A.
1 2 3 OIl (iv)0 ..
n
0
i •e ..
(6 ,
e • e •... )
i-1 1 2 3 (1.15)(v) P(x·.y'
I
x.y.a.e) .. R(x'I
x.a.y') • P1(y'l 8i ) i f and only i f (x.a) £Ki
where Pi(.le
i ) is a probability on Y.
Hence we have factorized the original transition law. If (x.a) EK
i then the transition to the next state x',y' depends on
e
only through its i-thcomponent
a
i • We present below some ,examples having this structure. It is straight forward to verify that
co
T
I , (q) (a) =E
(x.y).a.(x .y ) i=1
Pi(}I' lei) qce)
E
Picy'lei) q (e)e
C1.16)
(provided that the denominator does not vanish). Here 1
8 represents the indicator function of the set B.
Although this parameter structure Beems to be rather complicated there are practical situations where this structure occurs in a natural way.
5
-",
The state of the system at stage n is
x
=
(X ,
Y l.
The state componentn n n
Y is called the
suppLementary state variabte.
It can be proved that ifn A
a
E 0 is known then it is sufficient to considerX
instead ofX
(seen n
r
v~n Hae (1978a) page 52 ])So
X
has to be considered as the original state variable if the parameter nis known, while
Y
only occurs since it contains information concerning nthe unknown parameter.
From now on we are dealing with this parameter structure and we shall
... ...
consider only
X
andX.
To facilitate notations we omit the headn
from now on. Example 1
Consider an inventory control model without backlogging. If the demand distribution is known, the inventory level may be chosen as the state variable. However if the demand distribution is unknown the sequence
of successive inventory levels does not reflect the sequence of successive demands and therefore we have to consider the demand in each period as a
8upptementary state variabte.
HereX
is the set of inventory levels, i.e. X 2 (O,~) and Y is the set of possible demands. The transition function isis
R(x'
I
x,a,y') = 1 i f x' • max {a-y',a}
and a-
> x. = 0 otherwise(Hence the action a is the inventory level after ordering). The sets
K , K , •••
2 3
unknown parameter
a
1
Example 2
are empty and p
(./a )
is the demand distribution with1 1
e:
0. 1Consider a waiting line model with bulk arrivals. At each time point 0,1,2, ...
a group of customers arrives and the distribution of the size of the group is unknown. The service distribution is exponentially with parameter a and controllable by the parameter a. Let y' be the number of customers arriving in some period, let x be the queue length at the beginning of that period and
r
at the end. Then, if c:=
x +y' -
x' ~ 0 we havec
a -a
R(x'
I
x,a,y') = - ec:
.,
and if c < 0 thenR(x'
I
x.a.y') • O~ Further K • K •••• are empty and2 3
P (y'18 ) is the probability of a group if size y'.
1 .1
Exemple 3
CORsider a linear system with random disturbances. The state at stage n is X
n and the disturbance at stage n is Vn• Then
x
= C X + C A + Vn+1 1 0 2 n n+1
where X.V and A are Euclidean spaces and C and C suitable matrices.
1 2
Assume that {y , n E N} forms a sequence of i.i.d. random variables with
n
an incompletely known distribution.
If only the sequence (X , X , X , ••• ) is observable to the controller
0 1 2
then he may reconstruct the sequence of supplementary state variables.
2. OPTIMAL STRATEGIES
We start this section with an example showing thet there need not to be en optimal strategy even if X end A ere finite sete.
Example 4
Let X I : {1,2,3,4,S,6}, A I: {1.2.3}. Y • {0.1}.
e
I : (0.1).1
is not countable here but if we replace
e
by a countable1
however notations become (Note that
0
1
subset of (0.1) the same arguments are valid. more difficult).
Further let R(x'Pc., e,y') be defined by:
R(313,a,1)
=
R(413.a,0) I: R(414,a.1J I: R(314,a,OJ =R(SIS,a,1)
=
R(sIS •. a,O} I: R(Sls,a.O) I : R(S!S.a,ll = 1 for alla E A, and let
R(111.1.1) = R(211 ,1,0)= 1. R(311,2,y) = R(sI1.3.y) = R(312,2.y) = R(SIS,3,y)7' 1 for all y E
Y.
Finally K , K , ••• are empty and p (118)=
a
,p(ala)
= 1-8. 2 3 17
-., The example can be represented in the diagram:
Only in states 1 and 2 the chosen action has ef~sct. The rewards obtained are r(3) ~ reS) = 7 and r(4)
=
reS) • 3 for all actions and in the other states no rewards are obtained.The average return in the sub"chain {3.4} is
i
(7+3)=
S and in thesub-chain {S,S} : 7 6 + 3(1-6) = 46+3. Consider a starting state
x
£ {1,2}. It is easy to verify that for knowne
£e
the optimal action is a maximizer1
of the function 5.~(2,a) + {46+3} ~(3,a) (where ~ is the Kronecker function), a £ {2,3}. It is also straight forward to verify that if we have to choose one of the actions 2 of 3 and if q is the prior distribution then the maximizer of S.~(2,a) + {4. /6q(de) + 3} • ~(3,a), a £ {2,3} is the best choice.
n
Let ~ be the strategy that chooses action 1 the first n times and the maximizer of the function
5~(2,a) + {4 f 6.Q (d6) + 3} \S(3,a). a £ {2,3} n
thereafter, where Q is the posterior distribution at time n
n
if the system starts in state 1 with prior distribution q. Then the
Bayesian average return
i'n states 1 and 2 isEq [max {S,4 f 6 Qn(de) + 3} ]
(this expectation does not depend on the starting state and the strategy). Note that
E [max {S, 4 f 6 Q 1 (d6) + 3
I
Q , ... , Q ] >q n+ 1 n "'"
> max {S,4 E rI6 Q l' (d6) IQ , •••Q
1
+ 3} = max {S,4 f 6 Q (d6) + 3}.... q
L'
n+ 1 n nwith equality if and only if 5 > 4 f 6 Q 1(d6) + 3} P - a.s. However if
- n+ q
q gives positive mass to the set
{a
£e
I 6 > i} the equality never holds.n n+1
Hence in this case the strategy ~ is worse than ~ and consequently there is no opti:Nll strategy.
We first introduce two assumptions:
there are bounded functions g and h on 0, and X x 0 respectively such that hex,S) + g(8)
=
sup L(x,e,S)aEA (i) (11) r 1s bounded on X x A 00 where L(x,a,S)
=
E
1K
i-1 1 (2.1 )(x,a) { rex,a) +
r
E R(x'lx,a,y') • y' x'For models with known parameter value these conditions are the well-known conditions considered 1n [Derman (1966)] and [ROSS (1968)J. In that case there exist stationary optimal strategies, at least if
A
is finite. Moreover the optimal average return is gee).r,
In [Ross (196B)] several situations are given where (2.1),ti.i) holds for fixed parameter valuee.
For instance. if X and A are finite sets. and for each stationary strategy
the resulting process is an irreducible Markov chain. then (2.1) (i.i) is valid. Let {e: • n e:
N}
be a sequence of bounded real numbers such that lim Eh
=
o.
n n~
The strategies we consider choose at each stage n a maximizer of
1.: L(X .a,8) Q (e) • a
e:
A (2.2)e
n nWe call these rules
Bayesian
equiva~entrutes
because we are maximizing the "Bayesian equivalent" of the function we have to maximize in case the parameter is known. Note that we may choosee:
~ O. n e:N
if there is an
maximizer of the function (2.2).
In example 4 we already encountered these strategies. So it is clear by
the example that these strategies are not optimal in general. However we give at the end of this section conditions guaranteeing these strategies to be optimal.
In [van Hee (1978a)J we consider these strategies also for the discounted total return case and we prove there that these strategies are optimal for the linear system with quadratic costs (in discrete time) and also for some inventory control models. There we also consider bounds on the dis-counted total return of these strategies.
9
-.,
In
[Fo~
and Rolph (1973)], [Mandl (1974H and [Geor'gin (1978)] another heuristic strategy is considered. which turns out to be optimal in a lot of situations. This strategy can be formulated in the following way:fiAt each stage estimate the unknown parameter a using the available
A
data. by 8. Then compute an optimal (stationary) strategy for the model where the parameter is known and equal to ~. Then use the corresponding action in the actual state. Repeat this procedure at each stage".
Hence. if we consider Bayes estimates. then the method proposed by the other authors may be formulated in the following way:
choose at each stage n a maximizer of L(X , a, E
e
Q(e»),
a € An a n
(here we assumed the parameter set as a sub-set of
R).
(2.3)
To prove the optimality of Bayesian equivalent rules we need the following limit theorem: Theorem 1 00 I f I:
n=O
then\
i (P 'II' _ a.s.) i . 1.2,3" ... x,qfor all bounded functions f on 0.
(where Z = (Z , Z , Z , •.• ) cf (1.15) (Lv.)
1 2 3
Note that theorem 1 gives a sufficient condition' for the consistency of the Bayes estimation procedure.
We introduce the function ~ on X x A x 0.
~(x,e.a) :
=
L(x,a,e) - h(x,a) - g(S)Note that ~(x,S,a) < 0 for all x €
x,a
€ 0 , a € A.0 =
Further we extend ~ to a function on X x A x Wand likewise the functions hand g:
(2.4) (i) cj>(x.q.a)
r
q(8) cj>(x.8.a)S
( i i ) hCx.q)
r
q(S) h(x.8)e
( i i i ) g(q) I = t q(8) g(8)
8
Note that these definitions are consistent of (2.1) (ii) if we identify 8 with the distribution that is degenerate at 8.
The next theorem provides us a sufficient condition for a strategy n to be optimal. Theorem 2 N-1 E n I f liminf 1
r
[cj>CXn,A n)] = 0 N+00N
n=O x.q then 1 N-1 liminfN
r
E n [reXn,An)] = gCq) N+00 n=O x.q and n is optimal.Theorem's 1 and 2 are the key tools for proving optimality of the Bayesian equivalent rules.
As an illustration we shall consider two practical examples from more general theorems. For the first example we provide a proof to demonstrate the technique.
Theorem 3
Let X and A be finite sets. let M , M ••••• M be a partition of X and let
1 2 m
Ki = M
i x A, i=1, ••• ,m (Ki=</lfor i > m).
Let the Markov chain {X • n E
N}
be irreducible for each stationary strategy nand each parameter value •
•
Then a strategy n that chooses at stage n a maximizer of the function
a maximizer of </lCX ,Q ,a) is a maximizer of E L(X ,a,8).Q
(8))
n n e n n cj>(Xn.Qn.al • a £ A is optimal. (Note that Proof. lI'
Let A be the action at stage n under strategy n • Hence n
o
> </l(X .Q ,A ) • maxr
Q Ce) ~(X ,6,a) == n n n n ~ n
11 -> max l: Q (a) f £ F e n min x£X <p(x.a.f(x)) •
wh8re F is the set of all functions from X to A.
Note that F represents the set of all stationary strategies for the model with known parameter value.
~
Since the Markov chain {X • n £ N} is irreducible for each stationary n
strategy it can be proved (cf. lemma 4.7 in [van Hee (1978a)])that the number of visits to each set M
i is almost surely infinite for all strategies.
Therefore the condition of theorem 1 is fulfilled for all i and so we have lim l: 0 (a) min <p(x.a.f(x) = min <p(x.Z.f(x))
n-+co a n x£X x£X
Since there is a parameter value
max min
fEF x£X
stationary optimal strategy for the model with known (cf. Ross (1968) we have
<p[x.a.f(xl)
=
0 for all a £e.
Therefore we findo
=> liminf n -+00 and so N-1 lim 1 l: N-+ooN
n=O <P(X .0 .A )n n n ~ n-+oo fe:F alim max l: 0n(a) min <p(x.a.f(x))xe:X
=
0Application of theorem 2 gives the desired result.
In this example we assumed that the information we obtain after the transition about the unknown parameter does not depend on the action chosen. In the next example we relax this assumption.
Here we assume:
(i) M ••••• M is a partition of X. N •..••N is a partition of A.
1 m I n (2.5)
(ii) For each stationary strategy the Markov chain {X • t £N} t
is irreducible
(iii) The partition K • K •••• of X x A consists of the sets
1 2
{M
Before we consider the strategy t~at turns out to be optimal. we first introduce the concept of a
sequence of density zero.
A sequence 5=
=
(Sl' s2.···) is said to be of density zero iflimsup 1<,-+ 00 1 max k = 0 (2.6) chosen in order.
Examples of such sequences are: (si
=
2i• i EN)
and (si=
i 2 • i EN)
In theorem 4 we consider a strategy that is inspired by an idea in [ Mallows and Robbins (1964)
J.
In [Fox and Rolph (1973)J this idea is used in a similar way for Markov renewal programs.The idea is. that we make use of
forced choice actions
to guarantee that we return to each set Ki infinitely often. which is necessary to apply theorem 1. However we do this with a frequency that is so low as not to influence the Bayesian average return. (In fact the concept of a sequence of density zero is used here).
Now we are ready to formulate in an informal way the strategy n that will be optimal. In (2.6) this strategy is sketched:
Fix (forced choice) actions a , a ••••• a auch that a1 £ Nt
1 2
n
Let ti(n) be the number of visits to the set M
i at stage n.
I f X
n E Mi and ti (n) £ 5 for some i £ {1 ••••• m} then the next action in the sequence (al , ••••a
n) is chosen.
otherwise. if t
x
(n) ;{5. then a maximizer of the function n~(X • Q • a). a £ A
n
n
is chosen.
A •
Hence TI uses the same actions as the strategy n in theorem 3 except
for stages where t
x
n (n) £ 5. Then the forced choice actions areThe proof of theorem 4 is rather technically. although the idea is simple. Theorem 4
Let (2.5) hold. The strategy
TI
defined in (2.6) is optimal.This result is also true in a roore abstract model. In fact it completes results of QMandl (1974)J and it generalizes work of [Rose (1975)] •
3. LITERATURE Bellman, R., Derman, C., Fox, B.L. and Rolph, J .E., Georgin, J.P., Van Hee, K.M., Van Hee, K.M., Mallows, C.L. and Robbins, H. Mandl, P., Neveu, J., Rieder, U., Rose, J.S., Ross, S.M., 13
-Adaptive control processes: a guided tour
Princeton (N.Y.), Princeton University Press (1961) Denumerable state Markovian decision processes -average cost criterion,
Ann. Math. Statist. 37 (1966), 1545-1554 Adaptive policies for Markov renewal programs Ann. Math. Statist. 35 (1973), 846-856
Estimation et controle des chaInes de Markov sur des espaces arbitraire.
In: Lecture Notes in Mathematics 636 Springer-Verlag, Berlin etc. (1978) Bayesian control of Markov chains
Amsterdam, Mathematical Centre Tracts 95 (1978a)
Markov decisions processes with unknown transition law; the discounted case
To appear. in: Kybernetica (1978b)
Some problems of optimal sampling strategy. J. Math. Anal. Appl: ~ (1964), 90-103
Estimation and control in Markov chains
Advances in Appl. Probability ~ (1974), 40-60 Mathematical foundations of the calculus of probability
San Francisco etc., Holden-Day (1965) Bayesian dynamic programming
Advances in Appl. Probability
Z
(1975), 330-348 Markov decision processes under uncertainty -average return criterionUnpublished report (1975)
Arbitrary state Markovian decision processes Ann. Math. statist. 39 (1968), 2118-2122