MARKOV DECISION PROCESSES LODEWIJK KALLENBERG UNIVERSITY OF LEIDEN

(1)

MARKOV DECISION PROCESSES

LODEWIJK KALLENBERG

UNIVERSITY OF LEIDEN

(2)

Preface

Branching out from operations research roots of the 1950’s, Markov decision processes (MDPs) have gained recognition in such diverse fields as economics, telecommunication, engineering and ecology. These applications have been accompanied by many theoretical advances. Markov decision processes, also referred to as stochastic dynamic programming or stochastic control problems, are models for sequential decision making when outcomes are uncertain. The Markov decision process model consists of decision epochs, states, actions, transition probabilities and rewards. Choosing an action in a state generates a reward and determines the state at the next decision epoch through a transition probability function. Policies or strategies are prescriptions of which action to choose under any eventuality at every future decision epoch. Decision makers seek policies which are optimal in some sense.

These lecture notes aim to present a unified treatment of the theoretical and algorithmic as- pects of Markov decision process models. It can serve as a text for an advanced undergraduate or graduate level course in operations research, econometrics or control engineering. As a prereq- uisite, the reader should have some background in linear algebra, real analysis, probability, and linear programming. Throughout the text there are a lot of examples. At the end of each chapter there is a section with bibliographic notes and a section with exercises. A solution manual is available on request (e-mail to kallenberg@math.leidenuniv.nl).

Chapter 1 introduces the Markov decision process model as a sequential decision model with actions, transitions, rewards and policies. We illustrate these concepts with nine different applications: red-black gambling, how-to-serve in tennis, optimal stopping, replacement problems, maintenance and repair, production control, optimal control of queues, stochastic scheduling, and the multi-armed bandit problem.

Chapter 2 deals with the finite horizon model with nonstationary transitions and rewards, and the principle of dynamic programming: backward induction. We present an equivalent stationary infinite horizon model. We also study under which conditions optimal policies are monotone, i.e.

nondecreasing or nonincreasing in the ordering of the state space.

In chapter 3 the discounted rewards over an infinite horizon are studied. This results in the optimality equation and methods to solve this equation: policy iteration, linear programming, value iteration and modified value iteration. Furthermore, we study under which conditions monotone optimal policies exist.

Chapter 4 discusses the total rewards over an infinite horizon under the assumption that the transition matrices are substochastic. We first present some background material on square

i

(3)

ii

matrices, eigenvalues and the spectral radius. Then, we introduce the linear program and its correspondence to policies. We derive equivalent statements for the properties that the model is a so-called contracting or normalized dynamic programming model. Next, we present the optimality equation and results on the computations of optimal transient policies. For contracting dynamic programming results and algorithms can be formulated which are similar to the results and algorithms in the discounted reward model. Special sections are devoted to finite horizon and transient MDPs, to positive, negative and convergent MDPs, and to special models as red-black gambling and the optimal stopping problem.

Chapter 5 discusses the criterion of average rewards over an infinite horizon, in the most general case. Firstly, polynomial algorithms are developed to classify MDPs as irreducible or communicating. The distinction between unichain and multichain turns out to be N P-complete, so there is no hope of a polynomial algorithm. Then, the stationary, the fundamental and the deviation matrices are introduced, and the internal relations and properties are derived. Next, an extension of a theorem by Blackwell and the Laurent series expansion are presented. These results are fundamental to analyze the relation between discounted, average and more sensitive optimality criteria. With these results, as in the discounted case but via a more complicated analysis, the optimality equation is derived and methods to solve this equation are presented (policy iteration, linear programming and value iteration).

In chapter 6 special cases of the average reward criterion (irreducible, unichain and communicating) are considered. In all these cases the optimality equation and the methods of policy iteration, linear programming and value iteration can be simplified. Furthermore, we present the method of modified value iteration for these special cases.

Chapter 7 introduces more sensitive optimality criteria: bias optimality, n-discount and n- average optimality, and Blackwell optimality. The criteria of n-discount and n-average optimality are equivalent. We present a unifying framework, based on the Laurent series expansion, to derive sensitive discount optimality equations. Using a lexicographic ordering of the Laurent series, we derive the policy iteration method for n-discount optimality. In the irreducible case, one can derive a sequence of nested linear programs to compute n-discount optimal policies for any n.

Also for Blackwell optimality, even in the most general case, linear programming can be applied.

However, then the elements are not real numbers, but lie in a much general ordered field, namely

in an ordered field of rational functions. For bias optimality, an optimal policy can be found

with a three-step linear programming approach. When in addition the model is a unichain MDP,

the linear programs for bias optimality can be simplified. In this unichain case, we also derive a

simple policy iteration method and turnpike results. The last sections of this chapter deal with

some special optimality criteria. We consider overtaking, average overtaking and cumulative

overtaking optimality. A next section deals with a weighted combination of the total discounted

rewards and the long-run average rewards. For this criterion an optimal policy might not exist,

even when we allow nonstationary randomized policies. We present an iterative algorithm for

computing an ε-optimal nonstationary policy with a simple structure. Finally, we study an

optimality criterion which is the sum of expected total discounted rewards with different one-step

(4)

iii

rewards and discount factors. It turns out that for this criterion an optimal deterministic policy exists with a first nonstationary part and then it becomes stationary. We present an algorithm to compute such policy.

In chapter 8, six of the applications introduced in chapter 1 (replacement problems, maintenance and repair, production and inventory control, optimal control of queues, stochastic scheduling and multi-armed bandit problems) are analyzed in much more detail. In most cases theoretical and computational (algorithmic) results are presented. It turns out that in many cases polynomial algorithms exist, e.g. of order O(N

³

), where N is the number of states. Finally, we present separableMDP problems.

Chapter 9 deals with some other topics. We start with complexity results (e.g. MDPs are P-complete, deterministic MDPs are in N C), additional constraints (for discounted and average rewards, and for MDPs with sum of discounted rewards and different discount factors) and multiple objectives (both for discounted MDPs as well as for average MDPs). Then, the linear program approach for average rewards is revisited. Next, we consider mean-variance tradeoffs, followed by determinstic MDPs (models in which each action determines the next state with probability 1). In the last section of this chapter semi-Markov decision problems are analyzed.

The subject of the last chapter (chapter 10) is stochastic games, particularly the two-person zero-sum stochastic game. Then, both players may choose actions from their own action sets, resulting in transitions and rewards determined by both players. Zero-sum means that the reward for player 1 has to be payed by player 2. Hence, there is a conflicting situation: player 1 wants to maximize the rewards, while player 2 tries to minimize the rewards. We discuss the value of the game and the concept of optimal policies for discounted, total as well as for average rewards. We also derive mathematical programming formulations and iterative methods. In some special cases we can present finite solution methods to find the value and optimal policies. In the last section before the sections with the bibliographic notes and the exercises we discuss two-person general- sum stochastic games in which each player has his own reward function and tries to maximize his own payoff.

For these lecture notes a lot of material, collected over the years and from various sources is used.

In the bibliographic notes is referred to many books, papers and reports. I close this preface by expressing my gratitude to Arie Hordijk, who introduced me to the topic of MDPs. Furthermore, he was my supervisor and after my PhD a colleague during many years.

Lodewijk Kallenberg

Leiden, October, 2016.

(5)

iv

(6)

Introduction

1.1 The MDP model

1.2 Policies and optimality criteria 1.2.1 Policies

1.2.2 Optimality criteria 1.3 Examples

1.3.1 Red-black gambling

1.3.2 Gaming: How to serve in tennis 1.3.3 Optimal stopping

1.3.4 Replacement problems 1.3.5 Maintenance and repair 1.3.6 Production control

1.3.7 Optimal control of queues 1.3.8 Stochastic scheduling

1.3.9 Multi-armed bandit problems 1.4 Bibliographic notes

1.5 Exercises

1.1 The MDP model

An MDP is a model for sequential decision making under uncertainty, taking into account both the short-term outcomes of current decisions and opportunities for making decisions in the future.

While the notion of an MDP may appear quite simple, it encompasses a wide range of applications and has generated a rich mathematical theory. In an MDP model one can distinguish the following seven characteristics.

1. The state space

At any time point at which a decision has to be made, the state of the system is observed by the decision maker. The set of possible states is called the state space and will be denoted by S. The state space may be finite, denumerable, compact or even more general. In a finite state space, the number of states, i.e. |S|, will be denoted by N .

1

(13)

2 CHAPTER 1. INTRODUCTION

2. The action sets

When the decision maker observes that the system is in state i, he (we will refer to the decision maker as ’he’) chooses an action from a certain action set that may depend on the observed state:

the action set in state i is denoted by A(i). Similarly to the state space the action sets may be finite, denumerable, compact or more general.

3. The decision time points

The time intervals between the decision points may be constant or random. In the first case the model is said to be a Markov decision process; when the times between consecutive decision points are random the model is called a semi-Markov decision process.

4. The immediate rewards (or costs)

Given the state of the system and the chosen action, an immediate reward (or cost) is earned (there is no essential difference between rewards and costs, namely: maximizing rewards is equivalent to minimizing costs). These rewards may in general depend on the decision time point, the observed state and the chosen action, but not on the history of the process. The immediate reward at decision time point t for an action a in state i will be denoted by r

^t_i

(a); if the reward is independent of the time t, we will write r

_i

(a) instead of r

_i^t

(a).

5. The transition probabilities

Given the state of the system and the chosen action, the state at the next decision time point is determined by a transition law. These transitions only depend on the decision time point t, the observed state i and the chosen action a and not on the history of the process. This property is called the Markov property. If the transitions really depend on the decision time point, the problem is said to be nonstationary. If the state at time t is i and action a is chosen, we denote the probability that at the next time point the system is in state j by p

^t_ij

(a). If the transitions are independent of the time points, the problem is called stationary, and the transition probabilities are denoted by p

_ij

(a).

6. The planning horizon

The process has a planning horizon, which is the result of the time points at which the system has to be controlled. This horizon may be finite, infinite or of random length.

7. The optimality criterion

The objective of a Markov decision problem (or a semi-Markov decision problem) is to determine

a policy, i.e. a decision rule for each decision time point and each history (including the present

state) of the process, that optimizes the performance of the system. The performance is measured

by a utility function. This utility function assigns to each policy a value, given the starting state

of the process. In the next section we will explain the concept of a policy in more detail and we

will present several optimality criteria.

(14)

1.2. POLICIES AND OPTIMALITY CRITERIA 3

Example 1.1 Inventory model with backlog

An inventory has to be managed over a planning horizon of T weeks. At the beginning of each week the manager observes the inventory on hand and has to decide how many units to order. We assume that orders can be delivered instantaneously and that there is a finite inventory capacity of B units. We also assume that the demands D

_t

in week t, 1 ≤ t ≤ T , are independent random variables that have nonnegative integer values and that the numbers p

_j

(t) := P {D

t

= j } are known for all j ∈ N

0

and for t = 1, 2, . . . , T . If the demand during a period exceeds the inventory on hand, the shortage is backlogged in the next period. The optimization problem is: which inventory strategy minimizes the total expected costs?

If an order is made in week t, there is a fixed cost K

_t

and a cost k

_t

for each ordered unit. If at the end of week t there is a positive inventory, then there are inventory costs of h

_t

per unit;

when there is a shortage, there are backlogging costs of q

_t

per unit. The data K

_t

, k

_t

, h

_t

, q

_t

and p

_j

(t), j ∈ N, are known for all t ∈ {1, 2, . . ., T }.

Let i, the state of the system, be the inventory at the start of week t (shortages are modeled as negative inventory), let the number of ordered units be a and let j be the inventory at the end of week t; so j is the state of the next decision time point.

Then, the following costs are involved, where we use the notation δ(x) =

( 1 if x ≥ 1;

0 if x ≤ 0.

ordering costs: K

_t

· δ(a) + k

t

· a;

inventory costs: h

_t

· δ(j) · j;

backlogging costs: q

_t

· δ(−j) · (−j).

This inventory problem can be modeled as a nonstationary MDP over a finite planning horizon, with a denumerable state space and finite action sets:

S = {. . . , −1, 0, 1, . . ., B}; A(i) = {a ≥ 0 | 0 ≤ i + a ≤ B};

p

^t_ij

(a) =

( p

_i+a−j

(t) j ≤ i + a;

0 B ≥ j > i + a;

r

^t_i

(a) = −{K

t

· δ(a) + k

t

· a + P

_i+a

j=0

p

_j

(t) · h

t

· (i + a − j) + P

_∞

j=i+a+1

p

_j

(t) · q

t

· (j − i − a)}.

1.2 Policies and optimality criteria

1.2.1 Policies

A policy R is a sequence of decision rules: R = (π

¹

, π

²

, . . . , π

^t

, . . . ), where π

^t

is the decision rule at time point t, t = 1, 2, . . .. The decision rule π

^t

at time point t may depend on all available information on the system until time t, i.e. on the states at the time points 1, 2, . . ., t and the actions at the time points 1, 2, . . . , t − 1.

The formal definition of a policy is as follows. Consider the Cartesian product

S × A := {(i, a) | i ∈ S, a ∈ A(i)} (1.1)

(15)

4 CHAPTER 1. INTRODUCTION

and let H

_t

denote the set of the possible histories of the system up to time point t, i.e.

H

_t

:= {h

t

= (i

₁

, a

₁

, . . . , i

_t−1

, a

_t−1

, i

_t

) | (i

k

, a

_k

) ∈ S × A, 1 ≤ k ≤ t − 1; i

t

∈ S}. (1.2) A decision rule π

^t

at time point t is function on H

t

× A := {(h

t

, a

t

) | h

t

∈ H

t

, a

t

∈ A(i

t

) }, which gives the probability of the action to be taken at time t, given the history h

_t

, i.e.

π

_h^t_t_a_t

≥ 0 for every a

t

∈ A(i

t

) and X

at

π

_h^t_t_a_t

= 1 for every h

_t

∈ H

t

. (1.3)

Let C denote the set of all policies. A policy is said to be memoryless if the decision rule π

^t

is independent of (i

₁

, a

₁

, . . . , i

_t−1

, a

_t−1

) for every t ∈ N. So, for a memoryless policy, the decision rule at time t depends - with regard to the history h

_t

- only on the state i

_t

; therefore the notation π

^t_i_t_a_t

is used instead of π

_h^t

tat

. We call C(M ) the set of the memoryless policies. Memoryless policies are also called Markov policies.

If a policy is memoryless and the decision rules are independent of the time point t, i.e.

π

¹

= π

²

= · · · , then the policy is called stationary. Hence, a stationary policy is determined by a nonnegative function π on S × A such that P

a

π

_ia

= 1 for every i ∈ S. The stationary policy R = (π, π, . . .) is denoted by π

^∞

(and sometimes by π). The set of stationary policies is notated by C(S).

If the decision rule π of the stationary policy π

^∞

is nonrandomized, i.e. for every i ∈ S, we have π

ia

= 1 for exactly one action a

i

and consequently π

ia

= 0 for every a 6= a

i

, then the policy is called deterministic. Hence, a deterministic policy can be described by a function f on S, where f (i) is the chosen action a

_i

, i ∈ S. A deterministic policy is denoted by f

^∞

(and sometimes by f ). The set of deterministic policies is notated by by C(D).

A matrix P = (p

_ij

) is a transition matrix if p

_ij

≥ 0 for all (i, j) and P

j

p

_ij

= 1 for all i. For a Markov policy R = (π

¹

, π

²

, . . . ) the transition matrix P (π

^t

) and the reward vector r(π

^t

) are defined by

P (π

^t

)

ij

:= X

a

p

^t_ij

(a) · π

ia^t

for every i ∈ S, j ∈ S and t ∈ N; (1.4)

r(π

^t

)

i

:= X

a

r

^t_i

(a) · π

ia^t

for every i ∈ S and t ∈ N. (1.5) Take any initial distribution β defined on the state space S, i.e. β

_i

is the probability that the system starts in state i, and take any policy R. Then, by the theorem of Ionescu Tulcea (see e.g.

Bertsekas and Shreve [21], Proposition 7.28, p.140), there exists a unique probability measure P

_β,R

on H

_∞

, where

H

_∞

:= {h

∞

= (i

₁

, a

₁

, i

₂

, a

₂

, . . . ) | (i

k

, a

_k

) ∈ S × A, k = 1, 2, . . .}. (1.6) If β

_i

= 1 for some i ∈ S, then we write P

i,R

instead of P

_β,R

.

Let the random variables X

_t

and Y

_t

denote the state and action at time t, t = 1, 2, . . . . Given an

initial distribution β and a policy R, by the theorem of Ionescu Tulcea, for all j ∈ S, a ∈ A(j)

(16)

1.2. POLICIES AND OPTIMALITY CRITERIA 5

the notion P

_β,R

{X

t

= j, Y

_t

= a } is well-defined as the probability that at time t the state is j and the action is a. Similarly, for all j ∈ S the notion P

β,R

{X

t

= j } is well-defined as the probability that at time t the state is j. Furthermore, P

β,R

{X

t

= j } = P

a

P

_β,R

{X

t

= j, Y

t

= a }.

Lemma 1.1

For any Markov policy R = (π

¹

, π

²

, . . . ), any initial distribution β and any t ∈ N, we have (1) P

_β,R

{X

t

= j, Y

_t

= a } = P

i

β

_i

· {P (π

¹

)P (π

²

) · · · P (π

^t−1

) }

ij

· π

ja^t

for all (j, a) ∈ S × A, where, if t = 1, P (π

¹

)P (π

²

) · · · P (π

^t−1

) is defined as the identity matrix I.

(2) E

_β,R

{r

X^tt

(Y

_t

) } = P

i

β

_i

· {P (π

¹

)P (π

²

) · · · P (π

^t−1

) · r(π

^t

) }

i

. Proof

By induction on t. For t = 1,

P

_β,R

{X

t

= j, Y

_t

= a } = β

j

· π

¹ja

= X

i

β

_i

· {P (π

¹

)P (π

²

) · · · P (π

^t−1

) }

ij

· π

ja^t

and

E

_β,R

{r

^tXt

(Y

_t

) } = X

i,a

β

_i

· π

ia¹

· r

i¹

(a) = X

i

β

_i

· {P (π

¹

)P (π

²

) · · · P (π

^t−1

) · r(π

^t

) }

i

. Assume that the results are true for t; we show that the results also hold for t + 1:

P

_β,R

{X

t+1

= j, Y

t+1

= a } = P

k,b

P

_β,R

{X

t

= k, Y

t

= b } · p

^t_kj

(b) · π

_ja^t+1

= P

k,b,i

β

_i

· {P (π

¹

)P (π

²

) · · · P (π

^t−1

) }

ik

· π

^t_kb

· p

^t_kj

(b) · π

_ja^t+1

= P

i

β

_i

· P

k

{P (π

¹

)P (π

²

) · · · P (π

^t−1

) }

ik

· P

b

π

^t_kb

· p

^tkj

(b) · π

ja^t+1

= P

i

β

i

· P

k

{P (π

¹

)P (π

²

) · · · P (π

^t−1

) }

ik

· {P (π

^t

) }

kj

· π

_ja^t+1

= P

i

β

_i

· {P (π

¹

)P (π

²

) · · · P (π

^t

) }

ij

· π

ja^t+1

. Furthermore, one has

E

_β,R

{r

^t+1_X_t+1

(Y

_t+1

) } = P

j,a

P

_β,R

{X

t+1

= j, Y

_t+1

= a } · r

^t+1_j

(a)

= P

j,a,i

β

_i

· {P (π

¹

)P (π

²

) · · · P (π

^t

) }

ij

· π

^t+1_ja

· r

^t+1_j

(a)

= P

i

β

_i

· P

j

{P (π

¹

)P (π

²

) · · · P (π

^t

) }

ij

· P

a

π

_ja^t+1

· r

^t+1_j

(a)

= P

i

β

_i

· P

j

{P (π

¹

)P (π

²

) · · · P (π

^t

) }

ij

· {r(π

^t+1

) }

j

= P

i

β

_i

· {P (π

¹

)P (π

²

) · · · P (π

^t

)r(π

^t+1

) }

i

.

The next theorem shows that for any initial distribution β, any sequence of policies R

₁

, R

₂

, . . . and any convex combination of the marginal distributions of P

_β,R_k

, k ∈ N, there exists a Markov policy R

_∗

with the same marginal distribution.

Theorem 1.1

For any initial distribution β, any sequence of policies R

1

, R

2

, . . . and any sequence of nonnegative real numbers p

₁

, p

₂

, . . . satisfying P

k

p

_k

= 1, there exists a Markov policy R

_∗

such that P

_β,R

∗

{X

t

= j, Y

_t

= a } = X

k

p

_k

·P

β,Rk

{X

t

= j, Y

_t

= a } for all (j, a) ∈ S×A, and all t ∈ N. (1.7)

(17)

6 CHAPTER 1. INTRODUCTION

Proof

Define the Markov policy R

_∗

= (π

¹

, π

²

, . . . ) by

π

_ja^t

:=

P

k

p

k

· P

β,Rk

{X

t

= j, Y

t

= a } P

k

p

_k

· P

β,Rk

{X

t

= j } for all t ∈ N and all (j, a) ∈ S × A. (1.8) In case the denominator is zero, take for π

_ja^t

, a ∈ A(j) arbitrary nonnegative numbers such that P

a

π

_ja^t

= 1, j ∈ S. Take any (j, a) ∈ S × A. We prove the theorem by induction on t.

For t = 1, we obtain P

_β,R_∗

{X

1

= j } = β

j

and P

k

p

_k

· P

β,Rk

{X

1

= j } = P

k

p

_k

· β

j

= β

_j

. If β

_j

= 0, then P

_β,R_∗

{X

1

= j, Y

₁

= a } = P

k

p

_k

· P

β,Rk

{X

1

= j, Y

₁

= a } = 0.

If β

_j

6= 0, then from (1.8) it follows that P

k

p

_k

· P

β,Rk

{X

1

= j, Y

₁

= a } = P

k

p

_k

· P

β,Rk

{X

1

= j } · π

ja¹

= β

_j

· π

ja¹

= P

_β,R_∗

{X

1

= j, Y

₁

= a }.

Assume that (1.7) is true for t. We shall prove that (1.7) is also true for t + 1.

P

_β,R_∗

{X

t+1

= j } = P

l,b

P

_β,R_∗

{X

t

= l, Y

_b

= b } · p

^t_lj

(b)

= P

l,b,k

p

k

· P

β,Rk

{X

t

= l, Y

b

= b } · p

^t_lj

(b)

= P

k

p

_k

· P

l,b

P

_β,R

k

{X

t

= l, Y

_b

= b } · p

^t_lj

(b)

= P

k

p

_k

· P

β,Rk

{X

t+1

= j }.

If P

β,R∗

{X

t+1

= j } = 0, then P

k

p

k

· P

β,Rk

{X

t+1

= j } = 0, and consequently, P

_β,R_∗

{X

t+1

= j, Y

_t+1

= a } = P

k

p

_k

· P

β,Rk

{X

t+1

= j, Y

_t+1

= a } = 0.

If P

_β,R_∗

{X

t+1

= j } 6= 0, then

P

_β,R_∗

{X

t+1

= j, Y

t+1

= a } = P

β,R∗

{X

t+1

= j } · π

_ja^t+1

= P

k

p

k

· P

β,Rk

{X

t+1

= j } · π

_ja^t+1

= P

k

p

_k

· P

β,Rk

{X

t+1

= j } ·

P

k pk·P_β,Rk{Xt+1=j,Yt+1=a}

P

kpk·P_β,Rk{Xt+1=j}

= P

k

p

_k

· P

β,Rk

{X

t+1

= j, Y

_t+1

= a }.

Corollary 1.1

For any starting state i and any policy R, there exists a Markov policy R

_∗

such that P

_i,R_∗

{X

t

= j, Y

_t

= a } = P

i,R

{X

t

= j, Y

_t

= a } for all t ∈ N and all (j, a) ∈ S × A, and

E

_i,R_∗

{r

^t_X_t

(Y

t

) } = E

i,R

{r

_X^t_t

(Y

t

) } for all t ∈ N.

(18)

1.2. POLICIES AND OPTIMALITY CRITERIA 7

1.2.2 Optimality criteria

We consider the following optimality criteria:

1. Total expected reward over a finite horizon.

2. Total expected discounted reward over an infinite horizon.

3. Total expected reward over an infinite horizon.

4. Average expected reward over an infinite horizon.

5. More sensitive optimality criteria over an infinite horizon.

Assumption 1.1

In infinite horizon models we assume that the immediate rewards and the transition probabilities are stationary, and we denote these quantities by r

_i

(a) and p

_ij

(a), respectively, for all i, j and a.

Total expected reward over a finite horizon

Consider an MDP with a finite planning horizon of T periods. For any policy R and any initial state i ∈ S, the total expected reward over the planning horizon is defined by:

v

_i^T

(R) :=

T

X

t=1

E

_i,R

{r

X^tt

(Y

_t

) } =

T

X

t=1

X

j,a

P

_i,R

{X

t

= j, Y

_t

= a } · r

^tj

(a) for all i ∈ S. (1.9)

Interchanging the summation and the expectation in (1.9) is allowed, so v

_i^T

(R) may also be defined as the expected total reward, i.e.

v

^T_i

(R) := E

_i,R

n X

^T

t=1

r

^t_X_t

(Y

_t

) o

for all i ∈ S.

Let

v

_i^T

:= sup

_R∈C

v

_i^T

(R) for all i ∈ S, (1.10) or in vector notation, v

^T

= sup

_R∈C

v

^T

(R). The vector v

^T

is called the value vector. From Corollary 1.1 and Lemma 1.1, it follows that

v

^T

= sup

_{R∈C(M )}

v

^T

(R) (1.11)

and

v

^T

(R) =

T

X

t=1

P (π

¹

)P (π

²

) · · · P (π

^t−1

) · r(π

^t

) for R = (π

¹

, π

²

, · · · ) ∈ C(M). (1.12) A policy R

_∗

is called an optimal policy if

v

^T

(R

_∗

) = v

^T

. (1.13)

It is nontrivial that there exists an optimal policy: the supremum has to be attained and it has

to be attained simultaneously for all starting states. It can be shown (see the next chapter) that

an optimal Markov policy R

_∗

= (f

_∗¹

, f

_∗²

, · · · , f

_∗^T

) exists, where f

_∗^t

is a deterministic decision rule

for t = 1, 2, . . ., T .

(19)

8 CHAPTER 1. INTRODUCTION

Total expected discounted reward over an infinite horizon

Assume that an amount r earned at time point 1 is deposited in a bank with interest rate ρ. This amount becomes (1 + ρ) · r at time point 2, (1 + ρ)

²

· r at time point 3, etc. Hence, for interest rate ρ, an amount r at time point 1 is comparable with (1 + ρ)

^t−1

· r at time point t, t = 1, 2, . . ..

Define α := (1 + ρ)

⁻¹

and call α the discount factor. Note that α ∈ (0, 1). Then, conversely, an amount r received at time point t is considered as equivalent to the amount α

^t−1

· r at time point 1, the so-called discounted value.

Hence, the reward r

_X_t

(Y

_t

) at time point t has at time point 1 the discounted value α

^t−1

·r

Xt

(Y

_t

).

The total expected α-discounted reward, given initial state i and policy R, is denoted by v

_i^α

(R) and defined by

v

_i^α

(R) :=

X

∞ t=1

E

_i,R

{α

^t−1

· r

Xt

(Y

_t

) }. (1.14) Obviously, v

_i^α

(R) = P

_∞

t=1

α

^t−1

P

j,a

P

_i,R

{X

t

= j, Y

_t

= a } · r

j

(a). Another way to consider the discounted reward is by the expected total α-discounted reward, i.e.

E

_i,R

(

_∞

X

t=1

α

^t−1

· r

Xt

(Y

_t

) )

.

Since

X

∞ t=1

α

^t−1

· r

Xt

(Y

_t

)

≤ X

∞ t=1

α

^t−1

· M = (1 − α)

⁻¹

· M,

where M = max

_i,a

|r

i

(a) |, the theorem of dominated convergence (e.g. Bauer [13] p. 71) implies

E

_i,R

(

_∞

X

t=1

α

^t−1

· r

Xt

(Y

t

) )

=

∞

X

t=1

E

_i,R

{α

^t−1

· r

Xt

(Y

t

) } = v

i^α

(R), (1.15) i.e. the expected total discounted reward and the total expected discounted reward criteria are equivalent.

Let R = (π

¹

, π

²

, . . . ) ∈ C(M), then

v

^α

(R) = X

∞ t=1

α

^t−1

· P (π

¹

)P (π

²

) · · · P (π

^t−1

) · r(π

^t

). (1.16)

Hence, a stationary policy π

^∞

satisfies v

^α

(π

^∞

) =

X

∞ t=1

α

^t−1

P (π)

^t−1

r(π). (1.17)

Like before, the value vector v

^α

is defined by

v

^α

:= sup

_R∈C

v

^α

(R). (1.18)

A policy R

_∗

is an optimal policy if

v

^α

(R

_∗

) = v

^α

. (1.19)

(20)

1.2. POLICIES AND OPTIMALITY CRITERIA 9

In Chapter 3 we will show the existence of an optimal deterministic policy f

_∗^∞

for this criterion and we also will prove that the value vector v

^α

is the unique solution of the so-called optimality equation

x

_i

= max

_a∈A(i)

n

r

_i

(a) + α X

j

p

_ij

(a)x

_j

o

for all i ∈ S. (1.20)

Furthermore, we will derive that f

_∗^∞

is an optimal policy if r

i

(f

_∗

) + α X

j

p

ij

(f

_∗

)v

_j^α

≥ r

i

(a) + α X

j

p

ij

(a)v

_j^α

for all a ∈ A(i) for all i ∈ S. (1.21)

Total expected reward over an infinite horizon

A logical definition of the total expected reward is the total expected discounted reward with discount factor α = 1. So, given initial state i and policy R, we obtain P

_∞

t=1

E

_i,R

{r

Xt

(Y

_t

) }.

However, in general P

_∞

t=1

E

_i,R

{r

Xt

(Y

_t

) } may be not well-defined. Therefore, we consider this criterion under the following assumptions.

Assumption 1.2

(1) The model is substochastic, i.e. P

j

p

_ij

(a) ≤ 1 for all (i, a) ∈ S × A.

(2) For any initial state i and any policy R, P

_∞

t=1

E

_i,R

{r

Xt

(Y

_t

) } is well-defined (possibly ±∞).

Under these assumptions the total expected reward, which we denote by v

_i

(R) for initial state i and policy R, is well-defined by

v

_i

(R) :=

X

∞ t=1

E

_i,R

{r

Xt

(Y

_t

) }. (1.22)

In this case, we also can write v

_i

(R) = P

_∞

t=1

P

j,a

P

_i,R

{X

t

= j, Y

_t

= a } · r

j

(a). The value vector, denoted by v and the concept of an optimal policy are defined in the usual way:

v := sup

_R∈C

v(R). (1.23)

A policy R

_∗

is an optimal policy if

v(R

_∗

) = v. (1.24)

Under the additional assumption that every policy R is transient, i.e.

∞

X

t=1

P

_i,R

{X

t

= j, Y

t

= a } < ∞ for all i, j and all a,

it can be shown (cf. Kallenberg [148], chapter 3) that most properties of the discounted MDP

model are valid for the total reward MDP model, taking discount factor α = 1.

(21)

10 CHAPTER 1. INTRODUCTION

Average expected reward over an infinite horizon

In the criterion of average reward the limiting behavior of the average reward over the first T periods, i.e.

_T¹

P

_T

t=1

r

_X_t

(Y

_t

), is considered for T → ∞. Since lim

T →∞ 1 T

P

_T

t=1

r

_X_t

(Y

_t

) may not exist and interchanging limit and expectation is not allowed in general, there are four different evaluation measures which can be considered:

1. Lower limit of the average expected reward:

φ

i

(R) := lim inf

_{T →∞}_T¹

P

T

t=1

E

_i,R

{r

Xt

(Y

t

) }, i ∈ S, with value vector φ := sup

R∈C

φ(R).

2. Upper limit of the average expected reward:

φ

_i

(R) := lim sup

_{T →∞}_T¹

P

T

t=1

E

_i,R

{r

Xt

(Y

_t

) }, i ∈ S, with value vector φ := sup

R∈C

φ(R).

3. Expectation of the lower limit of the average reward:

ψ

_i

(R) := E

_i,R

{lim inf

T →∞ 1 T

P

_T

t=1

r

_X_t

(Y

_t

) }, i ∈ S, with value vector ψ := sup

R∈C

ψ(R).

4. Expectation of the upper limit of the average reward:

ψ

_i

(R) := E

i,R

{lim sup

_{T →∞}_T¹

P

T

t=1

r

Xt

(Y

t

) }, i ∈ S, with value vector ψ := sup

R∈C

ψ(R).

The next lemma shows the relation between these four criteria.

Lemma 1.2

ψ

_i

(R) ≤ φ

i

(R) ≤ φ

i

(R) ≤ ψ

i

(R) for every state i and every policy R.

Proof

Take any state i and any policy R. The first inequality follow from Fatou’s lemma (e.g. Bauer [13], p.126):

ψ

_i

(R) = E

_i,R

{lim inf

_{T →∞}_T¹

P

_T

t=1

r

_X_t

(Y

_t

) } ≤ lim inf

_{T →∞}_T¹

P

_T

t=1

E

_i,R

{r

Xt

(Y

_t

) } = φ

i

(R).

The second inequality (φ

_i

(R) ≤ φ

i

(R)) is obvious. The third inequality is also a consequence of Fatou’s lemma:

φ

_i

(R) = lim sup

_{T →∞}_T¹

P

_T

t=1

E

_i,R

{r

Xt

(Y

_t

) } ≤ E

i,R

{lim sup

_{T →∞}_T¹

P

_T

t=1

r

_X_t

(Y

_t

) } = ψ

i

(R).

We will present two examples to show that the quantities ψ

_i

(R), φ

_i

(R), φ

_i

(R) and ψ

_i

(R) may differ for some state i and some policy R. In the first example we show that ψ

_i

(R) < φ

_i

(R) and φ

_i

(R) < ψ

_i

(R) is possible; the second example shows that φ

_i

(R) < φ

_i

(R) is possible.

We use directed graphs to illustrate examples. The nodes of the graph represent the states. If

the transition probability p

_ij

(a) is positive there is an arc (i, j) from node i to node j; for a = 1

we use a simple arc, for a = 2 a double arc, etc.; next to the arc from node i to node j we note

the transition probability p

_ij

(a).

MARKOV DECISION PROCESSES LODEWIJK KALLENBERG UNIVERSITY OF LEIDEN