• No results found

MARKOV DECISION PROCESSES LODEWIJK KALLENBERG UNIVERSITY OF LEIDEN

N/A
N/A
Protected

Academic year: 2021

Share "MARKOV DECISION PROCESSES LODEWIJK KALLENBERG UNIVERSITY OF LEIDEN"

Copied!
725
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

MARKOV DECISION PROCESSES

LODEWIJK KALLENBERG

UNIVERSITY OF LEIDEN

(2)

Preface

Branching out from operations research roots of the 1950’s, Markov decision processes (MDPs) have gained recognition in such diverse fields as economics, telecommunication, engineering and ecology. These applications have been accompanied by many theoretical advances. Markov decision processes, also referred to as stochastic dynamic programming or stochastic control problems, are models for sequential decision making when outcomes are uncertain. The Markov decision process model consists of decision epochs, states, actions, transition probabilities and rewards. Choosing an action in a state generates a reward and determines the state at the next decision epoch through a transition probability function. Policies or strategies are prescriptions of which action to choose under any eventuality at every future decision epoch. Decision makers seek policies which are optimal in some sense.

These lecture notes aim to present a unified treatment of the theoretical and algorithmic as- pects of Markov decision process models. It can serve as a text for an advanced undergraduate or graduate level course in operations research, econometrics or control engineering. As a prereq- uisite, the reader should have some background in linear algebra, real analysis, probability, and linear programming. Throughout the text there are a lot of examples. At the end of each chapter there is a section with bibliographic notes and a section with exercises. A solution manual is available on request (e-mail to kallenberg@math.leidenuniv.nl).

Chapter 1 introduces the Markov decision process model as a sequential decision model with actions, transitions, rewards and policies. We illustrate these concepts with nine different ap- plications: red-black gambling, how-to-serve in tennis, optimal stopping, replacement problems, maintenance and repair, production control, optimal control of queues, stochastic scheduling, and the multi-armed bandit problem.

Chapter 2 deals with the finite horizon model with nonstationary transitions and rewards, and the principle of dynamic programming: backward induction. We present an equivalent stationary infinite horizon model. We also study under which conditions optimal policies are monotone, i.e.

nondecreasing or nonincreasing in the ordering of the state space.

In chapter 3 the discounted rewards over an infinite horizon are studied. This results in the optimality equation and methods to solve this equation: policy iteration, linear programming, value iteration and modified value iteration. Furthermore, we study under which conditions monotone optimal policies exist.

Chapter 4 discusses the total rewards over an infinite horizon under the assumption that the transition matrices are substochastic. We first present some background material on square

i

(3)

ii

matrices, eigenvalues and the spectral radius. Then, we introduce the linear program and its correspondence to policies. We derive equivalent statements for the properties that the model is a so-called contracting or normalized dynamic programming model. Next, we present the optimality equation and results on the computations of optimal transient policies. For contracting dynamic programming results and algorithms can be formulated which are similar to the results and algorithms in the discounted reward model. Special sections are devoted to finite horizon and transient MDPs, to positive, negative and convergent MDPs, and to special models as red-black gambling and the optimal stopping problem.

Chapter 5 discusses the criterion of average rewards over an infinite horizon, in the most general case. Firstly, polynomial algorithms are developed to classify MDPs as irreducible or communicating. The distinction between unichain and multichain turns out to be N P-complete, so there is no hope of a polynomial algorithm. Then, the stationary, the fundamental and the deviation matrices are introduced, and the internal relations and properties are derived. Next, an extension of a theorem by Blackwell and the Laurent series expansion are presented. These results are fundamental to analyze the relation between discounted, average and more sensitive optimality criteria. With these results, as in the discounted case but via a more complicated analysis, the optimality equation is derived and methods to solve this equation are presented (policy iteration, linear programming and value iteration).

In chapter 6 special cases of the average reward criterion (irreducible, unichain and commu- nicating) are considered. In all these cases the optimality equation and the methods of policy iteration, linear programming and value iteration can be simplified. Furthermore, we present the method of modified value iteration for these special cases.

Chapter 7 introduces more sensitive optimality criteria: bias optimality, n-discount and n- average optimality, and Blackwell optimality. The criteria of n-discount and n-average optimality are equivalent. We present a unifying framework, based on the Laurent series expansion, to derive sensitive discount optimality equations. Using a lexicographic ordering of the Laurent series, we derive the policy iteration method for n-discount optimality. In the irreducible case, one can derive a sequence of nested linear programs to compute n-discount optimal policies for any n.

Also for Blackwell optimality, even in the most general case, linear programming can be applied.

However, then the elements are not real numbers, but lie in a much general ordered field, namely

in an ordered field of rational functions. For bias optimality, an optimal policy can be found

with a three-step linear programming approach. When in addition the model is a unichain MDP,

the linear programs for bias optimality can be simplified. In this unichain case, we also derive a

simple policy iteration method and turnpike results. The last sections of this chapter deal with

some special optimality criteria. We consider overtaking, average overtaking and cumulative

overtaking optimality. A next section deals with a weighted combination of the total discounted

rewards and the long-run average rewards. For this criterion an optimal policy might not exist,

even when we allow nonstationary randomized policies. We present an iterative algorithm for

computing an ε-optimal nonstationary policy with a simple structure. Finally, we study an

optimality criterion which is the sum of expected total discounted rewards with different one-step

(4)

iii

rewards and discount factors. It turns out that for this criterion an optimal deterministic policy exists with a first nonstationary part and then it becomes stationary. We present an algorithm to compute such policy.

In chapter 8, six of the applications introduced in chapter 1 (replacement problems, mainte- nance and repair, production and inventory control, optimal control of queues, stochastic schedul- ing and multi-armed bandit problems) are analyzed in much more detail. In most cases theoretical and computational (algorithmic) results are presented. It turns out that in many cases polyno- mial algorithms exist, e.g. of order O(N

3

), where N is the number of states. Finally, we present separableMDP problems.

Chapter 9 deals with some other topics. We start with complexity results (e.g. MDPs are P-complete, deterministic MDPs are in N C), additional constraints (for discounted and average rewards, and for MDPs with sum of discounted rewards and different discount factors) and multiple objectives (both for discounted MDPs as well as for average MDPs). Then, the linear program approach for average rewards is revisited. Next, we consider mean-variance tradeoffs, followed by determinstic MDPs (models in which each action determines the next state with probability 1). In the last section of this chapter semi-Markov decision problems are analyzed.

The subject of the last chapter (chapter 10) is stochastic games, particularly the two-person zero-sum stochastic game. Then, both players may choose actions from their own action sets, resulting in transitions and rewards determined by both players. Zero-sum means that the reward for player 1 has to be payed by player 2. Hence, there is a conflicting situation: player 1 wants to maximize the rewards, while player 2 tries to minimize the rewards. We discuss the value of the game and the concept of optimal policies for discounted, total as well as for average rewards. We also derive mathematical programming formulations and iterative methods. In some special cases we can present finite solution methods to find the value and optimal policies. In the last section before the sections with the bibliographic notes and the exercises we discuss two-person general- sum stochastic games in which each player has his own reward function and tries to maximize his own payoff.

For these lecture notes a lot of material, collected over the years and from various sources is used.

In the bibliographic notes is referred to many books, papers and reports. I close this preface by expressing my gratitude to Arie Hordijk, who introduced me to the topic of MDPs. Furthermore, he was my supervisor and after my PhD a colleague during many years.

Lodewijk Kallenberg

Leiden, October, 2016.

(5)

iv

(6)

Contents

1 Introduction 1

1.1 The MDP model . . . . 1

1.2 Policies and optimality criteria . . . . 3

1.2.1 Policies . . . . 3

1.2.2 Optimality criteria . . . . 7

1.3 Examples . . . 15

1.3.1 Red-black gambling . . . 15

1.3.2 Gaming: How to serve in tennis . . . 16

1.3.3 Optimal stopping . . . 17

1.3.4 Replacement problems . . . 18

1.3.5 Maintenance and repair . . . 19

1.3.6 Production control . . . 20

1.3.7 Optimal control of queues . . . 21

1.3.8 Stochastic scheduling . . . 22

1.3.9 Multi-armed bandit problem . . . 23

1.4 Bibliographic notes . . . 24

1.5 Exercises . . . 26

2 Finite Horizon 29 2.1 Introduction . . . 29

2.2 Backward induction . . . 30

2.3 An equivalent stationary infinite horizon model . . . 32

2.4 Monotone optimal policies . . . 33

2.5 Bibliographic notes . . . 39

2.6 Exercises . . . 40

3 Discounted rewards 43 3.1 Introduction . . . 43

3.2 Monotone contraction mappings . . . 44

3.3 The optimality equation . . . 48

3.4 Policy iteration . . . 53

3.5 Linear programming . . . 60

v

(7)

vi CONTENTS

3.6 Value iteration . . . 72

3.7 Value iteration and bisection . . . 85

3.8 Modified Policy Iteration . . . 88

3.9 Monotone optimal policies . . . 96

3.10 Bibliographic notes . . . 100

3.11 Exercises . . . 101

4 Total reward 107 4.1 Introduction . . . 107

4.2 Square matrices, eigenvalues and spectral radius . . . 109

4.3 The linear program . . . 115

4.4 Transient, contracting, excessive and normalized MDPs . . . 116

4.5 The optimality equation . . . 125

4.6 Optimal transient policies . . . 127

4.7 The contracting model . . . 132

4.8 Finite horizon and transient MPDs . . . 136

4.9 Positive MDPs . . . 140

4.10 Negative MDPs . . . 147

4.11 Convergent MDPs . . . 151

4.12 Special models . . . 154

4.12.1 Red-black gambling . . . 154

4.12.2 Optimal stopping . . . 156

4.13 Bibliographic notes . . . 160

4.14 Exercises . . . 161

5 Average reward - general case 165 5.1 Introduction . . . 165

5.2 Classification of MDPs . . . 166

5.2.1 Definitions . . . 166

5.2.2 Classification of Markov chains . . . 167

5.2.3 Classification of Markov decision chains . . . 167

5.3 Stationary, fundamental and deviation matrix . . . 173

5.3.1 The stationary matrix . . . 173

5.3.2 The fundamental matrix and the deviation matrix . . . 176

5.4 Extension of Blackwell’s theorem . . . 180

5.5 The Laurent series expansion . . . 181

5.6 The optimality equation . . . 182

5.7 Policy iteration . . . 184

5.8 Linear programming . . . 192

5.9 Value iteration . . . 202

5.10 Bibliographic notes . . . 211

(8)

CONTENTS vii

5.11 Exercises . . . 212

6 Average reward - special cases 215 6.1 The irreducible case . . . 215

6.1.1 Optimality equation . . . 216

6.1.2 Policy iteration . . . 217

6.1.3 Linear programming . . . 218

6.1.4 Value iteration . . . 224

6.1.5 Modified policy iteration . . . 224

6.2 The unichain case . . . 228

6.2.1 Optimality equation . . . 228

6.2.2 Policy iteration . . . 229

6.2.3 Linear programming . . . 234

6.2.4 Value iteration . . . 243

6.2.5 Modified policy iteration . . . 247

6.3 The communicating case . . . 251

6.3.1 Optimality equation . . . 252

6.3.2 Policy iteration . . . 252

6.3.3 Linear programming . . . 254

6.3.4 Value iteration . . . 258

6.3.5 Modified value iteration . . . 258

6.4 Bibliographic notes . . . 261

6.5 Exercises . . . 261

7 More sensitive optimality criteria 265 7.1 Introduction . . . 265

7.2 Equivalence between n-discount and n-average optimality . . . 266

7.3 Stationary optimal policies and optimality equations . . . 268

7.4 Lexicographic ordering of Laurent series . . . 271

7.5 Policy iteration for n-discount optimality . . . 274

7.6 Linear programming and n-discount optimality (irreducible case) . . . 279

7.6.1 Average optimality . . . 280

7.6.2 Bias optimality . . . 280

7.6.3 n-discount optimality . . . 282

7.7 Blackwell optimality and linear programming . . . 283

7.8 Bias optimality and policy iteration (unichain case) . . . 288

7.9 Bias optimality and linear programming . . . 289

7.9.1 The general case . . . 289

7.9.2 The unichain case . . . 297

7.10 Turnpike results and bias optimality (unichain case) . . . 298

7.11 Overtaking, average overtaking and cumulative overtaking optimality . . . 302

(9)

viii CONTENTS

7.12 A weighted combination of discounted and average rewards . . . 303

7.13 A sum of discount factors . . . 310

7.14 Bibliographic notes . . . 315

7.15 Exercises . . . 316

8 Special models 321 8.1 Replacement problems . . . 322

8.1.1 A general replacement model . . . 322

8.1.2 A replacement model with increasing deterioration . . . 325

8.1.3 Skip to the right model with failure . . . 327

8.1.4 A separable replacement problem . . . 328

8.2 Maintenance and repair problems . . . 329

8.2.1 A surveillance-maintenance-replacement model . . . 329

8.2.2 Optimal repair allocation in a series system . . . 331

8.2.3 Maintenance of systems composed of highly reliable components . . . 334

8.3 Production and inventory control . . . 344

8.3.1 No backlogging . . . 344

8.3.2 Backlogging . . . 346

8.3.3 Inventory control and single-critical-number policies . . . 348

8.3.4 Inventory control and (s, S)-policies . . . 350

8.4 Optimal control of queues . . . 355

8.4.1 The single-server queue . . . 355

8.4.2 Parallel queues . . . 358

8.5 Stochastic scheduling . . . 359

8.5.1 Maximizing finite-time returns on a single processor . . . 359

8.5.2 Optimality of the µc-rule . . . 360

8.5.3 Optimality of threshold policies . . . 361

8.5.4 Optimality of join-the-shortest-queue policies . . . 362

8.5.5 Optimality of LEPT and SEPT policies . . . 364

8.5.6 Maximizing finite-time returns on two processors . . . 369

8.5.7 Tandem queues . . . 369

8.6 Multi-armed bandit problems . . . 372

8.6.1 Introduction . . . 372

8.6.2 A single project with a terminal reward . . . 372

8.6.3 Multi-armed bandits . . . 374

8.6.4 Methods for the computation of the Gittins indices . . . 377

8.7 Separable problems . . . 386

8.7.1 Introduction . . . 386

8.7.2 Examples (part 1) . . . 387

8.7.3 Discounted rewards . . . 388

(10)

CONTENTS ix

8.7.4 Average rewards - unichain case . . . 390

8.7.5 Average rewards - general case . . . 392

8.7.6 Examples (part 2) . . . 399

8.8 Bibliographic notes . . . 401

8.9 Exercises . . . 403

9 Other topics 407 9.1 Complexity results . . . 408

9.1.1 Complexity theory . . . 408

9.1.2 MDPs are P-complete . . . 413

9.1.3 DMDPs are in N C . . . 415

9.1.4 For discounted MDPs, the policy iteration and linear programming method are strongly polynomial . . . 419

9.1.5 Value iteration for discounted MDPs . . . 433

9.2 Additional constraints . . . 436

9.2.1 Introduction . . . 436

9.2.2 Infinite horizon and discounted rewards . . . 436

9.2.3 Infinite horizon and total rewards . . . 446

9.2.4 Infinite horizon and total rewards for transient MDPs . . . 449

9.2.5 Finite horizon . . . 450

9.2.6 Infinite horizon and average rewards . . . 451

9.2.7 Constrained MDPs with sum of discounted rewards and different discount factors . . . 464

9.2.8 Constrained discounted MDPs with two discount factors . . . 475

9.3 Multiple objectives . . . 480

9.3.1 Multi-objective linear programming . . . 481

9.3.2 Discounted rewards . . . 483

9.3.3 Average rewards . . . 485

9.4 The linear program approach for average rewards revisited . . . 496

9.5 Mean-variance tradeoffs . . . 502

9.5.1 Formulations of the problem . . . 502

9.5.2 A unifying framework . . . 503

9.5.3 Determination of an optimal solution . . . 504

9.5.4 Determination of an optimal policy . . . 508

9.5.5 The unichain case . . . 511

9.5.6 Finite horizon variance-penalized MDPs . . . 512

9.6 Deterministic MDPs . . . 520

9.6.1 Introduction . . . 520

9.6.2 Average costs . . . 520

9.6.3 Discounted costs . . . 525

(11)

x CONTENTS

9.7 Semi-Markov decision processes . . . 532

9.7.1 Introduction . . . 532

9.7.2 Model formulation . . . 533

9.7.3 Examples . . . 534

9.7.4 Discounted rewards . . . 536

9.7.5 Average rewards - general case . . . 541

9.7.6 Average rewards - special cases . . . 548

9.7.7 Continuous-time Markov decision processes . . . 558

9.8 Bibliographic notes . . . 564

9.9 Exercises . . . 566

10 Stochastic Games 569 10.1 Introduction . . . 570

10.1.1 The model . . . 570

10.1.2 Optimality criteria . . . 571

10.1.3 Matrix games . . . 571

10.1.4 Bimatrix games . . . 574

10.2 Discounted rewards . . . 576

10.2.1 Value and optimal policies . . . 576

10.2.2 Mathematical programming . . . 588

10.2.3 Iterative methods . . . 589

10.2.4 Finite methods . . . 597

10.3 Total rewards . . . 611

10.3.1 Value and optimal policies . . . 611

10.3.2 Mathematical programming . . . 612

10.3.3 Single-controller stochastic game: the transient case . . . 614

10.3.4 Single-controller stochastic game: the general case . . . 618

10.4 Average rewards . . . 621

10.4.1 Value and optimal policies . . . 621

10.4.2 The Big Match . . . 622

10.4.3 Mathematical programming . . . 626

10.4.4 Perfect information and irreducible games . . . 635

10.4.5 Finite methods . . . 641

10.5 Two-person general-sum stochastic game . . . 670

10.5.1 Introduction . . . 670

10.5.2 Discounted rewards . . . 671

10.5.3 Single-controller stochastic games . . . 674

10.6 Bibliographic notes . . . 682

10.7 Exercises . . . 684

(12)

Chapter 1

Introduction

1.1 The MDP model

1.2 Policies and optimality criteria 1.2.1 Policies

1.2.2 Optimality criteria 1.3 Examples

1.3.1 Red-black gambling

1.3.2 Gaming: How to serve in tennis 1.3.3 Optimal stopping

1.3.4 Replacement problems 1.3.5 Maintenance and repair 1.3.6 Production control

1.3.7 Optimal control of queues 1.3.8 Stochastic scheduling

1.3.9 Multi-armed bandit problems 1.4 Bibliographic notes

1.5 Exercises

1.1 The MDP model

An MDP is a model for sequential decision making under uncertainty, taking into account both the short-term outcomes of current decisions and opportunities for making decisions in the future.

While the notion of an MDP may appear quite simple, it encompasses a wide range of applications and has generated a rich mathematical theory. In an MDP model one can distinguish the following seven characteristics.

1. The state space

At any time point at which a decision has to be made, the state of the system is observed by the decision maker. The set of possible states is called the state space and will be denoted by S. The state space may be finite, denumerable, compact or even more general. In a finite state space, the number of states, i.e. |S|, will be denoted by N .

1

(13)

2 CHAPTER 1. INTRODUCTION

2. The action sets

When the decision maker observes that the system is in state i, he (we will refer to the decision maker as ’he’) chooses an action from a certain action set that may depend on the observed state:

the action set in state i is denoted by A(i). Similarly to the state space the action sets may be finite, denumerable, compact or more general.

3. The decision time points

The time intervals between the decision points may be constant or random. In the first case the model is said to be a Markov decision process; when the times between consecutive decision points are random the model is called a semi-Markov decision process.

4. The immediate rewards (or costs)

Given the state of the system and the chosen action, an immediate reward (or cost) is earned (there is no essential difference between rewards and costs, namely: maximizing rewards is equivalent to minimizing costs). These rewards may in general depend on the decision time point, the observed state and the chosen action, but not on the history of the process. The immediate reward at decision time point t for an action a in state i will be denoted by r

ti

(a); if the reward is independent of the time t, we will write r

i

(a) instead of r

it

(a).

5. The transition probabilities

Given the state of the system and the chosen action, the state at the next decision time point is determined by a transition law. These transitions only depend on the decision time point t, the observed state i and the chosen action a and not on the history of the process. This property is called the Markov property. If the transitions really depend on the decision time point, the problem is said to be nonstationary. If the state at time t is i and action a is chosen, we denote the probability that at the next time point the system is in state j by p

tij

(a). If the transitions are independent of the time points, the problem is called stationary, and the transition probabilities are denoted by p

ij

(a).

6. The planning horizon

The process has a planning horizon, which is the result of the time points at which the system has to be controlled. This horizon may be finite, infinite or of random length.

7. The optimality criterion

The objective of a Markov decision problem (or a semi-Markov decision problem) is to determine

a policy, i.e. a decision rule for each decision time point and each history (including the present

state) of the process, that optimizes the performance of the system. The performance is measured

by a utility function. This utility function assigns to each policy a value, given the starting state

of the process. In the next section we will explain the concept of a policy in more detail and we

will present several optimality criteria.

(14)

1.2. POLICIES AND OPTIMALITY CRITERIA 3

Example 1.1 Inventory model with backlog

An inventory has to be managed over a planning horizon of T weeks. At the beginning of each week the manager observes the inventory on hand and has to decide how many units to order. We assume that orders can be delivered instantaneously and that there is a finite inventory capacity of B units. We also assume that the demands D

t

in week t, 1 ≤ t ≤ T , are independent random variables that have nonnegative integer values and that the numbers p

j

(t) := P {D

t

= j } are known for all j ∈ N

0

and for t = 1, 2, . . . , T . If the demand during a period exceeds the inventory on hand, the shortage is backlogged in the next period. The optimization problem is: which inventory strategy minimizes the total expected costs?

If an order is made in week t, there is a fixed cost K

t

and a cost k

t

for each ordered unit. If at the end of week t there is a positive inventory, then there are inventory costs of h

t

per unit;

when there is a shortage, there are backlogging costs of q

t

per unit. The data K

t

, k

t

, h

t

, q

t

and p

j

(t), j ∈ N, are known for all t ∈ {1, 2, . . ., T }.

Let i, the state of the system, be the inventory at the start of week t (shortages are modeled as negative inventory), let the number of ordered units be a and let j be the inventory at the end of week t; so j is the state of the next decision time point.

Then, the following costs are involved, where we use the notation δ(x) =

( 1 if x ≥ 1;

0 if x ≤ 0.

ordering costs: K

t

· δ(a) + k

t

· a;

inventory costs: h

t

· δ(j) · j;

backlogging costs: q

t

· δ(−j) · (−j).

This inventory problem can be modeled as a nonstationary MDP over a finite planning horizon, with a denumerable state space and finite action sets:

S = {. . . , −1, 0, 1, . . ., B}; A(i) = {a ≥ 0 | 0 ≤ i + a ≤ B};

p

tij

(a) =

( p

i+a−j

(t) j ≤ i + a;

0 B ≥ j > i + a;

r

ti

(a) = −{K

t

· δ(a) + k

t

· a + P

i+a

j=0

p

j

(t) · h

t

· (i + a − j) + P

j=i+a+1

p

j

(t) · q

t

· (j − i − a)}.

1.2 Policies and optimality criteria

1.2.1 Policies

A policy R is a sequence of decision rules: R = (π

1

, π

2

, . . . , π

t

, . . . ), where π

t

is the decision rule at time point t, t = 1, 2, . . .. The decision rule π

t

at time point t may depend on all available information on the system until time t, i.e. on the states at the time points 1, 2, . . ., t and the actions at the time points 1, 2, . . . , t − 1.

The formal definition of a policy is as follows. Consider the Cartesian product

S × A := {(i, a) | i ∈ S, a ∈ A(i)} (1.1)

(15)

4 CHAPTER 1. INTRODUCTION

and let H

t

denote the set of the possible histories of the system up to time point t, i.e.

H

t

:= {h

t

= (i

1

, a

1

, . . . , i

t−1

, a

t−1

, i

t

) | (i

k

, a

k

) ∈ S × A, 1 ≤ k ≤ t − 1; i

t

∈ S}. (1.2) A decision rule π

t

at time point t is function on H

t

× A := {(h

t

, a

t

) | h

t

∈ H

t

, a

t

∈ A(i

t

) }, which gives the probability of the action to be taken at time t, given the history h

t

, i.e.

π

httat

≥ 0 for every a

t

∈ A(i

t

) and X

at

π

httat

= 1 for every h

t

∈ H

t

. (1.3)

Let C denote the set of all policies. A policy is said to be memoryless if the decision rule π

t

is independent of (i

1

, a

1

, . . . , i

t−1

, a

t−1

) for every t ∈ N. So, for a memoryless policy, the decision rule at time t depends - with regard to the history h

t

- only on the state i

t

; therefore the notation π

titat

is used instead of π

ht

tat

. We call C(M ) the set of the memoryless policies. Memoryless policies are also called Markov policies.

If a policy is memoryless and the decision rules are independent of the time point t, i.e.

π

1

= π

2

= · · · , then the policy is called stationary. Hence, a stationary policy is determined by a nonnegative function π on S × A such that P

a

π

ia

= 1 for every i ∈ S. The stationary policy R = (π, π, . . .) is denoted by π

(and sometimes by π). The set of stationary policies is notated by C(S).

If the decision rule π of the stationary policy π

is nonrandomized, i.e. for every i ∈ S, we have π

ia

= 1 for exactly one action a

i

and consequently π

ia

= 0 for every a 6= a

i

, then the policy is called deterministic. Hence, a deterministic policy can be described by a function f on S, where f (i) is the chosen action a

i

, i ∈ S. A deterministic policy is denoted by f

(and sometimes by f ). The set of deterministic policies is notated by by C(D).

A matrix P = (p

ij

) is a transition matrix if p

ij

≥ 0 for all (i, j) and P

j

p

ij

= 1 for all i. For a Markov policy R = (π

1

, π

2

, . . . ) the transition matrix P (π

t

) and the reward vector r(π

t

) are defined by

P (π

t

)

ij

:= X

a

p

tij

(a) · π

iat

for every i ∈ S, j ∈ S and t ∈ N; (1.4)

r(π

t

)

i

:= X

a

r

ti

(a) · π

iat

for every i ∈ S and t ∈ N. (1.5) Take any initial distribution β defined on the state space S, i.e. β

i

is the probability that the system starts in state i, and take any policy R. Then, by the theorem of Ionescu Tulcea (see e.g.

Bertsekas and Shreve [21], Proposition 7.28, p.140), there exists a unique probability measure P

β,R

on H

, where

H

:= {h

= (i

1

, a

1

, i

2

, a

2

, . . . ) | (i

k

, a

k

) ∈ S × A, k = 1, 2, . . .}. (1.6) If β

i

= 1 for some i ∈ S, then we write P

i,R

instead of P

β,R

.

Let the random variables X

t

and Y

t

denote the state and action at time t, t = 1, 2, . . . . Given an

initial distribution β and a policy R, by the theorem of Ionescu Tulcea, for all j ∈ S, a ∈ A(j)

(16)

1.2. POLICIES AND OPTIMALITY CRITERIA 5

the notion P

β,R

{X

t

= j, Y

t

= a } is well-defined as the probability that at time t the state is j and the action is a. Similarly, for all j ∈ S the notion P

β,R

{X

t

= j } is well-defined as the probability that at time t the state is j. Furthermore, P

β,R

{X

t

= j } = P

a

P

β,R

{X

t

= j, Y

t

= a }.

Lemma 1.1

For any Markov policy R = (π

1

, π

2

, . . . ), any initial distribution β and any t ∈ N, we have (1) P

β,R

{X

t

= j, Y

t

= a } = P

i

β

i

· {P (π

1

)P (π

2

) · · · P (π

t−1

) }

ij

· π

jat

for all (j, a) ∈ S × A, where, if t = 1, P (π

1

)P (π

2

) · · · P (π

t−1

) is defined as the identity matrix I.

(2) E

β,R

{r

Xtt

(Y

t

) } = P

i

β

i

· {P (π

1

)P (π

2

) · · · P (π

t−1

) · r(π

t

) }

i

. Proof

By induction on t. For t = 1,

P

β,R

{X

t

= j, Y

t

= a } = β

j

· π

1ja

= X

i

β

i

· {P (π

1

)P (π

2

) · · · P (π

t−1

) }

ij

· π

jat

and

E

β,R

{r

tXt

(Y

t

) } = X

i,a

β

i

· π

ia1

· r

i1

(a) = X

i

β

i

· {P (π

1

)P (π

2

) · · · P (π

t−1

) · r(π

t

) }

i

. Assume that the results are true for t; we show that the results also hold for t + 1:

P

β,R

{X

t+1

= j, Y

t+1

= a } = P

k,b

P

β,R

{X

t

= k, Y

t

= b } · p

tkj

(b) · π

jat+1

= P

k,b,i

β

i

· {P (π

1

)P (π

2

) · · · P (π

t−1

) }

ik

· π

tkb

· p

tkj

(b) · π

jat+1

= P

i

β

i

· P

k

{P (π

1

)P (π

2

) · · · P (π

t−1

) }

ik

· P

b

π

tkb

· p

tkj

(b) · π

jat+1

= P

i

β

i

· P

k

{P (π

1

)P (π

2

) · · · P (π

t−1

) }

ik

· {P (π

t

) }

kj

· π

jat+1

= P

i

β

i

· {P (π

1

)P (π

2

) · · · P (π

t

) }

ij

· π

jat+1

. Furthermore, one has

E

β,R

{r

t+1Xt+1

(Y

t+1

) } = P

j,a

P

β,R

{X

t+1

= j, Y

t+1

= a } · r

t+1j

(a)

= P

j,a,i

β

i

· {P (π

1

)P (π

2

) · · · P (π

t

) }

ij

· π

t+1ja

· r

t+1j

(a)

= P

i

β

i

· P

j

{P (π

1

)P (π

2

) · · · P (π

t

) }

ij

· P

a

π

jat+1

· r

t+1j

(a)

= P

i

β

i

· P

j

{P (π

1

)P (π

2

) · · · P (π

t

) }

ij

· {r(π

t+1

) }

j

= P

i

β

i

· {P (π

1

)P (π

2

) · · · P (π

t

)r(π

t+1

) }

i

.

The next theorem shows that for any initial distribution β, any sequence of policies R

1

, R

2

, . . . and any convex combination of the marginal distributions of P

β,Rk

, k ∈ N, there exists a Markov policy R

with the same marginal distribution.

Theorem 1.1

For any initial distribution β, any sequence of policies R

1

, R

2

, . . . and any sequence of nonnegative real numbers p

1

, p

2

, . . . satisfying P

k

p

k

= 1, there exists a Markov policy R

such that P

β,R

{X

t

= j, Y

t

= a } = X

k

p

k

·P

β,Rk

{X

t

= j, Y

t

= a } for all (j, a) ∈ S×A, and all t ∈ N. (1.7)

(17)

6 CHAPTER 1. INTRODUCTION

Proof

Define the Markov policy R

= (π

1

, π

2

, . . . ) by

π

jat

:=

P

k

p

k

· P

β,Rk

{X

t

= j, Y

t

= a } P

k

p

k

· P

β,Rk

{X

t

= j } for all t ∈ N and all (j, a) ∈ S × A. (1.8) In case the denominator is zero, take for π

jat

, a ∈ A(j) arbitrary nonnegative numbers such that P

a

π

jat

= 1, j ∈ S. Take any (j, a) ∈ S × A. We prove the theorem by induction on t.

For t = 1, we obtain P

β,R

{X

1

= j } = β

j

and P

k

p

k

· P

β,Rk

{X

1

= j } = P

k

p

k

· β

j

= β

j

. If β

j

= 0, then P

β,R

{X

1

= j, Y

1

= a } = P

k

p

k

· P

β,Rk

{X

1

= j, Y

1

= a } = 0.

If β

j

6= 0, then from (1.8) it follows that P

k

p

k

· P

β,Rk

{X

1

= j, Y

1

= a } = P

k

p

k

· P

β,Rk

{X

1

= j } · π

ja1

= β

j

· π

ja1

= P

β,R

{X

1

= j, Y

1

= a }.

Assume that (1.7) is true for t. We shall prove that (1.7) is also true for t + 1.

P

β,R

{X

t+1

= j } = P

l,b

P

β,R

{X

t

= l, Y

b

= b } · p

tlj

(b)

= P

l,b,k

p

k

· P

β,Rk

{X

t

= l, Y

b

= b } · p

tlj

(b)

= P

k

p

k

· P

l,b

P

β,R

k

{X

t

= l, Y

b

= b } · p

tlj

(b)

= P

k

p

k

· P

β,Rk

{X

t+1

= j }.

If P

β,R

{X

t+1

= j } = 0, then P

k

p

k

· P

β,Rk

{X

t+1

= j } = 0, and consequently, P

β,R

{X

t+1

= j, Y

t+1

= a } = P

k

p

k

· P

β,Rk

{X

t+1

= j, Y

t+1

= a } = 0.

If P

β,R

{X

t+1

= j } 6= 0, then

P

β,R

{X

t+1

= j, Y

t+1

= a } = P

β,R

{X

t+1

= j } · π

jat+1

= P

k

p

k

· P

β,Rk

{X

t+1

= j } · π

jat+1

= P

k

p

k

· P

β,Rk

{X

t+1

= j } ·

P

k pk·Pβ,Rk{Xt+1=j,Yt+1=a}

P

kpk·Pβ,Rk{Xt+1=j}

= P

k

p

k

· P

β,Rk

{X

t+1

= j, Y

t+1

= a }.

Corollary 1.1

For any starting state i and any policy R, there exists a Markov policy R

such that P

i,R

{X

t

= j, Y

t

= a } = P

i,R

{X

t

= j, Y

t

= a } for all t ∈ N and all (j, a) ∈ S × A, and

E

i,R

{r

tXt

(Y

t

) } = E

i,R

{r

Xtt

(Y

t

) } for all t ∈ N.

(18)

1.2. POLICIES AND OPTIMALITY CRITERIA 7

1.2.2 Optimality criteria

We consider the following optimality criteria:

1. Total expected reward over a finite horizon.

2. Total expected discounted reward over an infinite horizon.

3. Total expected reward over an infinite horizon.

4. Average expected reward over an infinite horizon.

5. More sensitive optimality criteria over an infinite horizon.

Assumption 1.1

In infinite horizon models we assume that the immediate rewards and the transition probabilities are stationary, and we denote these quantities by r

i

(a) and p

ij

(a), respectively, for all i, j and a.

Total expected reward over a finite horizon

Consider an MDP with a finite planning horizon of T periods. For any policy R and any initial state i ∈ S, the total expected reward over the planning horizon is defined by:

v

iT

(R) :=

T

X

t=1

E

i,R

{r

Xtt

(Y

t

) } =

T

X

t=1

X

j,a

P

i,R

{X

t

= j, Y

t

= a } · r

tj

(a) for all i ∈ S. (1.9)

Interchanging the summation and the expectation in (1.9) is allowed, so v

iT

(R) may also be defined as the expected total reward, i.e.

v

Ti

(R) := E

i,R

n X

T

t=1

r

tXt

(Y

t

) o

for all i ∈ S.

Let

v

iT

:= sup

R∈C

v

iT

(R) for all i ∈ S, (1.10) or in vector notation, v

T

= sup

R∈C

v

T

(R). The vector v

T

is called the value vector. From Corollary 1.1 and Lemma 1.1, it follows that

v

T

= sup

R∈C(M )

v

T

(R) (1.11)

and

v

T

(R) =

T

X

t=1

P (π

1

)P (π

2

) · · · P (π

t−1

) · r(π

t

) for R = (π

1

, π

2

, · · · ) ∈ C(M). (1.12) A policy R

is called an optimal policy if

v

T

(R

) = v

T

. (1.13)

It is nontrivial that there exists an optimal policy: the supremum has to be attained and it has

to be attained simultaneously for all starting states. It can be shown (see the next chapter) that

an optimal Markov policy R

= (f

1

, f

2

, · · · , f

T

) exists, where f

t

is a deterministic decision rule

for t = 1, 2, . . ., T .

(19)

8 CHAPTER 1. INTRODUCTION

Total expected discounted reward over an infinite horizon

Assume that an amount r earned at time point 1 is deposited in a bank with interest rate ρ. This amount becomes (1 + ρ) · r at time point 2, (1 + ρ)

2

· r at time point 3, etc. Hence, for interest rate ρ, an amount r at time point 1 is comparable with (1 + ρ)

t−1

· r at time point t, t = 1, 2, . . ..

Define α := (1 + ρ)

−1

and call α the discount factor. Note that α ∈ (0, 1). Then, conversely, an amount r received at time point t is considered as equivalent to the amount α

t−1

· r at time point 1, the so-called discounted value.

Hence, the reward r

Xt

(Y

t

) at time point t has at time point 1 the discounted value α

t−1

·r

Xt

(Y

t

).

The total expected α-discounted reward, given initial state i and policy R, is denoted by v

iα

(R) and defined by

v

iα

(R) :=

X

t=1

E

i,R

t−1

· r

Xt

(Y

t

) }. (1.14) Obviously, v

iα

(R) = P

t=1

α

t−1

P

j,a

P

i,R

{X

t

= j, Y

t

= a } · r

j

(a). Another way to consider the discounted reward is by the expected total α-discounted reward, i.e.

E

i,R

(

X

t=1

α

t−1

· r

Xt

(Y

t

) )

.

Since

X

t=1

α

t−1

· r

Xt

(Y

t

)

≤ X

t=1

α

t−1

· M = (1 − α)

−1

· M,

where M = max

i,a

|r

i

(a) |, the theorem of dominated convergence (e.g. Bauer [13] p. 71) implies

E

i,R

(

X

t=1

α

t−1

· r

Xt

(Y

t

) )

=

X

t=1

E

i,R

t−1

· r

Xt

(Y

t

) } = v

iα

(R), (1.15) i.e. the expected total discounted reward and the total expected discounted reward criteria are equivalent.

Let R = (π

1

, π

2

, . . . ) ∈ C(M), then

v

α

(R) = X

t=1

α

t−1

· P (π

1

)P (π

2

) · · · P (π

t−1

) · r(π

t

). (1.16)

Hence, a stationary policy π

satisfies v

α

) =

X

t=1

α

t−1

P (π)

t−1

r(π). (1.17)

Like before, the value vector v

α

is defined by

v

α

:= sup

R∈C

v

α

(R). (1.18)

A policy R

is an optimal policy if

v

α

(R

) = v

α

. (1.19)

(20)

1.2. POLICIES AND OPTIMALITY CRITERIA 9

In Chapter 3 we will show the existence of an optimal deterministic policy f

for this criterion and we also will prove that the value vector v

α

is the unique solution of the so-called optimality equation

x

i

= max

a∈A(i)

n

r

i

(a) + α X

j

p

ij

(a)x

j

o

for all i ∈ S. (1.20)

Furthermore, we will derive that f

is an optimal policy if r

i

(f

) + α X

j

p

ij

(f

)v

jα

≥ r

i

(a) + α X

j

p

ij

(a)v

jα

for all a ∈ A(i) for all i ∈ S. (1.21)

Total expected reward over an infinite horizon

A logical definition of the total expected reward is the total expected discounted reward with discount factor α = 1. So, given initial state i and policy R, we obtain P

t=1

E

i,R

{r

Xt

(Y

t

) }.

However, in general P

t=1

E

i,R

{r

Xt

(Y

t

) } may be not well-defined. Therefore, we consider this criterion under the following assumptions.

Assumption 1.2

(1) The model is substochastic, i.e. P

j

p

ij

(a) ≤ 1 for all (i, a) ∈ S × A.

(2) For any initial state i and any policy R, P

t=1

E

i,R

{r

Xt

(Y

t

) } is well-defined (possibly ±∞).

Under these assumptions the total expected reward, which we denote by v

i

(R) for initial state i and policy R, is well-defined by

v

i

(R) :=

X

t=1

E

i,R

{r

Xt

(Y

t

) }. (1.22)

In this case, we also can write v

i

(R) = P

t=1

P

j,a

P

i,R

{X

t

= j, Y

t

= a } · r

j

(a). The value vector, denoted by v and the concept of an optimal policy are defined in the usual way:

v := sup

R∈C

v(R). (1.23)

A policy R

is an optimal policy if

v(R

) = v. (1.24)

Under the additional assumption that every policy R is transient, i.e.

X

t=1

P

i,R

{X

t

= j, Y

t

= a } < ∞ for all i, j and all a,

it can be shown (cf. Kallenberg [148], chapter 3) that most properties of the discounted MDP

model are valid for the total reward MDP model, taking discount factor α = 1.

(21)

10 CHAPTER 1. INTRODUCTION

Average expected reward over an infinite horizon

In the criterion of average reward the limiting behavior of the average reward over the first T periods, i.e.

T1

P

T

t=1

r

Xt

(Y

t

), is considered for T → ∞. Since lim

T →∞ 1 T

P

T

t=1

r

Xt

(Y

t

) may not exist and interchanging limit and expectation is not allowed in general, there are four different evaluation measures which can be considered:

1. Lower limit of the average expected reward:

φ

i

(R) := lim inf

T →∞T1

P

T

t=1

E

i,R

{r

Xt

(Y

t

) }, i ∈ S, with value vector φ := sup

R∈C

φ(R).

2. Upper limit of the average expected reward:

φ

i

(R) := lim sup

T →∞T1

P

T

t=1

E

i,R

{r

Xt

(Y

t

) }, i ∈ S, with value vector φ := sup

R∈C

φ(R).

3. Expectation of the lower limit of the average reward:

ψ

i

(R) := E

i,R

{lim inf

T →∞ 1 T

P

T

t=1

r

Xt

(Y

t

) }, i ∈ S, with value vector ψ := sup

R∈C

ψ(R).

4. Expectation of the upper limit of the average reward:

ψ

i

(R) := E

i,R

{lim sup

T →∞T1

P

T

t=1

r

Xt

(Y

t

) }, i ∈ S, with value vector ψ := sup

R∈C

ψ(R).

The next lemma shows the relation between these four criteria.

Lemma 1.2

ψ

i

(R) ≤ φ

i

(R) ≤ φ

i

(R) ≤ ψ

i

(R) for every state i and every policy R.

Proof

Take any state i and any policy R. The first inequality follow from Fatou’s lemma (e.g. Bauer [13], p.126):

ψ

i

(R) = E

i,R

{lim inf

T →∞T1

P

T

t=1

r

Xt

(Y

t

) } ≤ lim inf

T →∞T1

P

T

t=1

E

i,R

{r

Xt

(Y

t

) } = φ

i

(R).

The second inequality (φ

i

(R) ≤ φ

i

(R)) is obvious. The third inequality is also a consequence of Fatou’s lemma:

φ

i

(R) = lim sup

T →∞T1

P

T

t=1

E

i,R

{r

Xt

(Y

t

) } ≤ E

i,R

{lim sup

T →∞T1

P

T

t=1

r

Xt

(Y

t

) } = ψ

i

(R).

We will present two examples to show that the quantities ψ

i

(R), φ

i

(R), φ

i

(R) and ψ

i

(R) may differ for some state i and some policy R. In the first example we show that ψ

i

(R) < φ

i

(R) and φ

i

(R) < ψ

i

(R) is possible; the second example shows that φ

i

(R) < φ

i

(R) is possible.

We use directed graphs to illustrate examples. The nodes of the graph represent the states. If

the transition probability p

ij

(a) is positive there is an arc (i, j) from node i to node j; for a = 1

we use a simple arc, for a = 2 a double arc, etc.; next to the arc from node i to node j we note

the transition probability p

ij

(a).

Referenties

GERELATEERDE DOCUMENTEN

Dat een audicien niet altijd het best passende hoortoestel kan leveren door een beperkte keuze binnen een bepaalde categorie, is niet zozeer het gevolg van het systeem als

En tot slot in deze extra editie enkele voor- beelden van het werk van Zorginstituut Nederland, zoals de maatregelen die het instituut samen met beroepsgroepen, verze- keraars

Het consultatiedocument staat ook stil bij de zogenaamde “andere … gespecificeerde stoornis”. Voorgesteld wordt alleen de “andere …specificeerde stoonis” bij de

review/meta-analyse die wij hebben bestudeerd, concluderen wij dat het aannemelijk is dat internetbehandeling gebaseerd op cognitieve gedragstherapie bij depressie vergeleken met

De functie Verblijf omvat verblijf in een instelling met samenhangende zorg bestaande uit persoonlijke verzorging, verpleging, begeleiding of behandeling, voor een verzekerde met

At all ages a profile with moderate to high levels of co- occurring internalizing and externalizing problems was iden- tified, but at 6 years this profile was characterized by

In terms of the administration of CARA approvals between 10 May 2002 and 3 July 2006 (that is, when the cultivation of virgin ground was not enforced as a listed activity under ECA

According to this intelligence estimate, much of Moscow’s concern was linked to its supply of weapons to Southern African states: “A significant further diminution of tensions