• No results found

Markov-modulated Poisson Processes in Practice

N/A
N/A
Protected

Academic year: 2021

Share "Markov-modulated Poisson Processes in Practice"

Copied!
34
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Bachelor Project

by Ruben van Oostenbrugge

Markov-modulated Poisson

Processes in Practice

The Expectation-Maximization Algorithm in all its Glory

Supervisor: M. Heemskerk, MSc

Examiner: Prof. Dr. M. Mandjes

Universiteit van Amsterdam Studentnummer: 11020385

(2)

Abstract

A Markov-modulated Poisson process (or MMPP) is a Poisson process for which the arrival rate changes according to a finite underlying Markov pro-cess. MMPP’s are, for example, used for making models in signal processing. Parameter estimation of MMPP’s isn’t an easy task, because the sought param-eters are intertwined in a very challenging way. A solution for this problem is found in the Expectation-Maximization (EM) algorithm. This thesis explains the EM algorithm thoroughly and then applies it in several cases. The first and most simple case is a mixture of two Gaussians with fixed variance. A sample is drawn from this distribution, after which the application of the EM algorithm unfolds. The same is done for a more complex mixture of Gaussians. This mixture consists of an arbitrary amount of Gaussians with unknown means and variances.

Finally, the EM algorithm is applied to Markov-modulated Poisson Processes. This requires –conform the method of Deng & Mark [1]– that the sample drawn from the MMPP is first transformed to its discrete counterpart. Then the EM algorithm is being applied, where the Lagrange method of finding optima that are subject to a constraint is used to satisfy the constraint that the probabilities need to sum up to 1. The Forward-Backward algorithm is used to calculate the desired conditional probabilities in the Expectation step of the EM algorithm. There are certain conditions that make the estimates resulting from the EM algorithm better or worse. In general, If the difference between arrival rates of the MMPP get smaller, the resulting estimates seem to get worse. Also, a rea-sonable first estimate of our parameter tuple is needed to achieve convergence to the true parameter: after all, the EM algorithm only converges to a local maximum of the likelihood. In conclusion, the EM algorithm converges, when applied to MMPP’s and under certain conditions, to the true parameters in a satisfactory manner. An open question remains if there exists a way to find the desired parameters even quicker; the EM algorithm is known to converge quite slowly.

(3)

Contents

Preface 2

1 Notation and definitions 3

2 The EM algorithm 5

2.1 Jensen’s inequality . . . 5

2.2 The EM algorithm . . . 7

2.3 Finding the best parameter for Ψ . . . 7

2.4 The EM algorithm for Gaussian clustering . . . 8

2.4.1 Two Gaussians with variance 1 . . . 8

2.4.2 k Gaussians with unknown variance . . . 10

3 The EM algorithm for a Markov Modulated Poisson Process 13 3.1 Discretization of MMPP’s . . . 14

3.2 The probability measure for the discretized MMPP . . . 15

3.3 The Forward-Backward algorithm . . . 17

3.3.1 Forward . . . 17

3.3.2 Backward . . . 18

3.3.3 Concluding the Forward-Backward algorithm . . . 18

3.4 The Expectation Step . . . 18

3.5 The Maximization Step . . . 21

3.6 Concluding the EM algorithm . . . 22

3.7 Is it any good? . . . 23

4 Conclusion 25

5 Popular Summary 26

(4)

Preface

Let’s say that we have a binary data set: there is a time interval and at each moment in time, a certain event either occurs or not. It could be that we ob-serve clusters of events; the event occurs relatively frequent / infrequent within certain sub-intervals. In that case, if we don’t know the cause of this cluster-ing, we could assume that our data originates from a Poisson process for which the arrival rate changes according to an underlying Markov process. Such a Poisson process (which strictly isn’t a Poisson process anymore, because it has no constant rate) is called a Markov Modulated Poisson Process. But how do we find the right parameters for this MMPP? Finding the maximum likelihood estimator seems the way to go, but finding the true parameters for the under-lying Markov process turns out to be hard because the states of this Markov process are not observed. Information that is underlying to a data set but isn’t observed is also referred to as latent data.

The main purpose of this thesis is to show how the Expectation-Maximization algorithm can be used to find the maximum likelihood estimators for the param-eters of a Markov Modulated Poisson process. This non-trivial quest is thereby simplified into small steps (iterations of the EM algorithm) towards the best solution. Those steps are –as opposed to the problem as a whole– analytically tractable.

MMPP’s are being used in practice, for example in signal processing. As MMPP’s can describe processes that are very general in nature, it is not hard to come up with a large variety of potential use cases. Basically any process that involves the occurrence of events is eligible for modeling with an MMPP. The only requisite is that the arrival rate of the process changes from time to time and that we don’t have a sufficient model for these changes (and thus assume those changes to occur randomly). Therefore, this thesis gives an answer to a real world question.

The method that is described in this thesis, is first developed by Deng and Mark [1].

(5)

Chapter 1

Notation and definitions

I will now present some concepts needed to construct and analyze the Markov Modulated Poisson process.

Definition 1.0.1. Stochastic Process

A stochastic process is (the numerical interpretation of) a measurable variable that changes randomly over time. The change of the variable is thus subject to a certain stochastic distribution. Often, a stochastic process is expressed as Xt, where t is a point in time.

Remark. Note that both the mathematical object of a process and the empirical data it aims to model are being referred to as ”stochastic process”.

Definition 1.0.2. Time-Homogeneous Process

A stochastic process is called time-homogeneous (or simply: homogeneous) if and only if it’s underlying stochastic distributions are the same for each time interval of the same length.

Definition 1.0.3. Poisson Process

A Poisson process is a stochastic process that counts occurrences of an event. The variable Xtcounts the amount of occurrences in an interval of length t and is P ois(λt)-distributed, where λ is called the arrival rate (or simply: rate) of the process. A random counting process is a Poisson process if and only if it is time homogeneous. The time between two events is exponentially distributed with parameter λ.

Definition 1.0.4. Markov Process

A Markov process is a stochastic process Xtif and only if for each t1< t2holds that the distribution of the random variable Y := Xt2− Xt1 only depends on the value Xt1 (and not on the values of Xt for t < t1).

Definition 1.0.5. Finite Markov Process

A Markov process is said to be finite if and only if the process takes finitely many values. The set of values taken is called the state space.

(6)

Remark. The state space of a Markov process can be chosen to be an index set, which enables us to have the state space index any set of (real world) situations that we can describe in words!

Definition 1.0.6. Irreducible

A Markov process is said to be irreducible if and only there is, for any two states, a strictly positive probability of moving from one to the other in finite time.

Definition 1.0.7. Doubly Stochastic Poisson Process

A doubly stochastic Poisson process (or Cox process) is a Poisson process for which the arrival rate λtis subject to another stochastic process (and thus can vary randomly over time).

Remark. Strictly, a doubly stochastic Poisson process is no Poisson process. Definition 1.0.8. Markov Modulated Poisson Process

A Markov Modulated Poisson process (or MMPP) is a doubly stochastic Poisson process for which the arrival rate λtis determined by a finite irreducible Markov process.

(7)

Chapter 2

The EM algorithm

In the field of statistical data analysis, the objective is often, given a data set, to specify a probability distribution that models the data set as accurate as possible. The choice of distribution is often arbitrary and there are techniques to help make this choice. Mostly, a classical family of distributions (such as the normal distributions) is chosen, after which estimating the parameters of the distribution family becomes the objective. The scientific consensus is to find the parameters that maximize the probability for the data set to occur. This is called ’maximum-likelihood estimation’ or ’MLE’ for short. In a lot of cases, it is an easy task to find this MLE analytically. Sometimes, however, different parameters that we need to estimate are too entangled to optimize analytically; in this case, it can be useful to use the so called EM algorithm [4]. The EM algorithm makes use of latent variables in a data set (variables that belong to a data set, but aren’t observed). It takes an initial value for the parameter vector and facilitates optimizing the parameters step by step, which after enough steps leads to a better estimation of all parameters. We will unveil the EM algorithm in this chapter.

2.1

Jensen’s inequality

We start with a result that is known as Jensen’s inequality. This result is, in some sense, the basis of the EM algorithm.

Lemma 1. Let k ∈ N≥1, {x1, ..., xk}, {a1, ..., ak} ⊂ R≥0 and Pk i=1ai = 1. Then log k X i=1 aixi ! ≥ k X i=1 ailog(xi) and this inequality is an equality if x1= x2= · · · = xk.

(8)

Proof. If x1= x2= · · · = xk, then log k X i=1 aixi ! = log k X i=1 aix1 ! = log x1· k X i=1 ai ! = log(x1) = k X i=1 ailog(x1) = k X i=1 ailog(xi).

Consider that x1 = x2 = · · · = xk doesn’t hold. Let α ∈ [0, 1]. We know that the equation

log((1 − α) · x1+ α · x2) − (1 − α) · log(x1) − α · log(x2) ≥ 0 (2.1) holds with equality for α = 0 and α = 1. The second derivative of the expression in (2.1) w.r.t. α is −(x2−x1)2

((1−α)·x1+α·x2)2 ≤ 0. Thus it is must be the

case that (2.1) holds for all α ∈ [0, 1]. This proves the lemma for k = 2. We use induction to prove the lemma. Suppose that for some k and for any {x1, ..., xk−1}, {a1, ..., ak−1} ⊂ R≥0 such thatPk−1i=1 ai= 1,

log k−1 X i=1 aixi ! ≥ k−1 X i=1 ailog(xi). Let {x∗1, ..., x∗k}, {a∗ 1, ..., a∗k} ⊂ R≥0 such that Pk i=1a ∗

i = 1. then, by our induc-tion hypothesis and the fact that the lemma holds for k = 2,

log k X i=1 a∗ix∗i ! = log a∗kx∗k+ k−1 X i=1 a∗ix∗i ! = log a∗kx∗k+ (1 − a∗k) · k−1 X i=1 a∗ i 1 − a∗kx ∗ i ! ≥ a∗k· log(x∗k) + (1 − a∗k) · log k−1 X i=1 a∗i 1 − a∗ k x∗i ! ≥ a∗k· log(x∗k) + (1 − a∗k) · k−1 X i=1 a∗ i 1 − a∗k log(x ∗ i) = k X i=1 a∗ilog(x∗i).

(9)

2.2

The EM algorithm

We will now show how the EM algorithm works. Let Ω := Ω0× Ω1 be a finite two-dimensional sample space, let F := 2Ω

be a σ-algebra and let {Pθ : F → [0, 1]|θ ∈ Θ} and be a family of probability measures for some set Θ. Suppose that we have a data set x(1), ..., x(m) ∈ Ω

0 and we want to find θ such that Qm

i=1Pθ({x

(i)} × Ω

1) is maximized. In a lot of these cases, it is very hard (if not impossible) to calculate this optimum analytically. The EM algorithm iteratively leads to a better and better solution, where the calculations of each step CAN be done analytically. We have that

arg max θ  m Y i=1 Pθ({x(i)} × Ω1)  = arg max θ  m X i=1 logPθ({x(i)} × Ω1)  .

Let Pθ(z|x) denote the conditional probability Pθ({x}×{z})

Pθ({x}×Ω1) and let θn, θ ∈ Θ.

Then, by Jensen’s inequality,

φ(θ) : = m X i=1 log({x(i)} × Ω1)  (2.2) = m X i=1 log X z∈Ω1 Pθ({x(i)} × {z}) ! = m X i=1 log X z∈Ω1 Pθn(z|x (i) ) · Pθ({x (i)} × {z}) Pθn(z|x (i)) ! ≥ m X i=1 X z∈Ω1 Pθn(z|x (i)) · log Pθ({x(i)} × {z}) Pθn(z|x (i))  (2.3) =: ψn(θ). (2.4)

Let θn+1= arg maxθψn(θ). Note that θ = θngives equality in (2.3), because the logarithm in the sum is then taken over a constant. Hence φ(θn+1) ≥ φ(θn). This gives us an algorithm to maximize φ(θ): we choose an initial value θ0, plug it in expression (2.3) as θn and define θ1 = θn+1 = arg maxθψn(θ). We can then proceed to plug in θ1 as our new θn, define θ2 = θn+1= arg maxθψn(θ), etcetera. We just showed that this approach will at least not lead to a worse choice for θ than we began with, which is a great start!

2.3

Finding the best parameter for Ψ

In order to run the EM algorithm, we must first assign an initial value to θ0. Then we must find θ1 := arg maxθψ0(θ), and so on. Note that if we have θn,

(10)

we can already calculate Qz,i(θn) := 1 Pθn(z|x (i)). (2.5) We can rewrite ψn(θ) = m X i=1 X z∈Ω1 1 Qz,i(θn)

· logPθ({x(i)} × {z}) · Qz,i(θn)

= m X i=1 X z∈Ω1

log Pθ({x(i)} × {z}) + log(Qz,i n))

Qz,i(θn) .

Hence, because Qz,i(θn) is a constant with respect to θ,

arg max θ ψn(θ) = arg max θ m X i=1 X z∈Ω1 log Pθ({x(i)} × {z}) Qz,i(θn) . (2.6)

Let’s try to find the latter value by taking the derivative of ψn(θ) and set it to zero. ∂ ∂θ m X i=1 X z∈Ω1 log Pθ({x(i)} × {z}) Qz,i(θn) = m X i=1 X z∈Ω1 (∂θ∂ Pθ({x(i)} × {z})) Qz,i(θn) · Pθ({x(i)} × {z}) = X z∈Ω1  m X i=1 (∂ ∂θPθ({x (i)} × {z})) Qz,i(θn) · Pθ({x(i)} × {z})  = 0. (2.7)

We shall later see that equation (2.7) can be solved analytically for some distri-butions.

2.4

The EM algorithm for Gaussian clustering

The Em algorithm is frequently used to estimate the parameters of Gaussian mixtures. A Gaussian mixture is in fact a linear combination of multiple Gaus-sians. As we don’t know which point in our data set originates from which Gaussian, it isn’t an easy task to find the best parameters.

2.4.1

Two Gaussians with variance 1

Let’s first keep things simple. We start with a mixture of two Gaussians of which the variance is known to be 1. A point in our data set has equal probability of originating from either of the two Gaussians. Let’s say that m = 500, µ1 = 5 and µ2= 10, but we don’t know the latter two. Our data set could then (plotted in a histogram) look like in figure 2.1.

(11)

Figure 2.1: µ1= 5, µ2= 10

The latent variable z ∈ Ω1defines if a point originates from the first (z = 0) or the second (z = 1) Gaussian. Note that we can’t observe z. The mean parameters µ1 and µ2 are to be estimated. Let θ0:= (1, 3) be our initial guess of the tuple (µ1, µ2). Our ”marginal” probabilities are

Pθ({x(i)} × {z}) = 1 2· 1 √ 2π · e −1 2(θ (z)−x(i))2 = 1 2 ·√2π· e −1 2(θ (z)−x(i))2 = 1 2 ·√2π· e −1 2(θ (z))2+x(i)θ(z)1 2(x (i))2 , where θ = (θ(0), θ(1)). The derivative w.r.t. θ is

∂ ∂θPθ({x (i) } × {z}) =x (i)− θ(z) 2 ·√2π · e −1 2(θ (z)−x(i))2 .

We now write, from equation (2.7) and the fact that Ω1= {0, 1}, that

 m X i=1 (∂θ({x(i)} × {0})) Q0,i(θn) · Pθ({x(i)} × {0})  +  m X i=1 (∂θ({x(i)} × {1})) Q0,i(θn) · Pθ({x(i)} × {1})  = m X i=1 x(i)− θ(0) Q0,i(θn) , m X i=1 x(i)− θ(1) Q1,i(θn)  = (0, 0), hence, θ(z)= Pm i=1 x(i) Qz,i(θn) Pm i=1 1 Qz,i(θn) , ∀z ∈ Ω1. (2.8)

For a Gaussian mixture, expression (2.6) is quadratic with negative leading coefficient, thus expression (2.8) gives us our desired θn+1:= arg maxθψn(θ).

Running the algorithm yields a very quick convergence to the real parameters µ0and µ1, as shown in figure 2.2 ( 0,1,2,3 and 4 iterations respectively).

(12)

Figure 2.2

2.4.2

k Gaussians with unknown variance

Let’s make things a lot harder, by assuming that our data originates from an even mixture of k Gaussians where some parameters are known and other pa-rameters need to be estimated. We define:

Nz∼ N (µz, σ2

z), ∀z ∈ Ω1= {1, ..., k}; Iµ, Iσ⊂ {1, ..., k}.

Let Iµ and Iσ denote the sets of numbers z such that respectively µz or σzare unknown parameters that we would like to estimate. Let P be the real measure of our data source. Our ”marginal” probabilities are now

P({x(i)} × {z}) = 1 k · σz·√2π · e −1 2(µz −x (i) σz ) 2 . The derivatives w.r.t. µ and σ are

∂ ∂µzP({x (i)} × {z}) = (x(i)− µ z) · P({x(i)} × {z}); ∂ ∂σzP({x (i)} × {z}) =(x (i)− µz)2− σ2 z σ3 z · P({x(i)} × {z}).

Let’s assume that we posses, for some n, parameters θn,σz and θn,µz, denoting

respectively the estimated parameters for the standard deviation and the mean of our zth Gaussian after n iterations. Note that some of those parameters are fixed, but it doesn’t matter. Defining θn+1,σz and θn+1,µz as the values that maximize ψn(θ) from equation (2.4) yields a maximum greater than the expression in equation (2.2) as all parameters can still be chosen equal to the parameters θn,σz and θn,µz, in which case Jensen’s inequality holds with equality.

(13)

In fact, even if we fix additional parameters, equality can still be obtained. This way we can have the steps of the algorithm optimize only our µ’s or only our σ’s alternately. Just as in section 2.4.1, to find our optimum, we need to solve equation (2.7) and thus need that each summand of this equation equals zero:

m X i=1 (x(i)− µ z) · 1Iµ(z) +(x (i)−µ z)2−σ2z σ3 z · 1Iσ(z) Qz,i(θn) = 0, ∀z ∈ {1, ..., k}. Rewriting this with τz:= σz−1 yields

m X i=1 (x(i)− µz) · 1 Iµ(z) + ((x(i)− µz)2· τz3− τz) · 1Iσ(z) Qz,i(θn) = 0, ∀z ∈ {1, ..., k}. (2.9) For a given z, finding µz and τz simultaneously such that equation (2.9) is satisfied is an incredibly hard task to do analytically. But as mentioned earlier, we can optimize µz and τz alternately.

This means that in the µ-step, we assume all σ’s fixed. this leads to the following equation. m X i=1 (x(i)− µz) · 1Iµ(z) Qz,i(θn) = 0, ∀z ∈ {1, ..., k}, which implies θn+1,µz = µz= m X i=1 (x(i)) · 1Iµ(z) Qz,i(θn) / m X i=1 1Iµ(z) Qz,i(θn).

In the σ-step (or τ -step), we assume all µ’s fixed. this leads to the following equation. m X i=1 ((x(i)− µz)2· τ3 z − τz) · 1Iσ(z) Qz,i(θn) = 0, ∀z ∈ {1, ..., k}. (2.10)

which simplifies to a quadratic equation w.r.t. τ . Solving this and then calcu-lating σzyields θn+1,σz = σz= 1Iσ(z) · v u u t m X i=1 (x(i)− µz)2 Qz,i(θn) / m X i=1 1 Qz,i(θn).

The algorithm resulting from this θn+1,µzand θn+1,σzwas implemented in a sim-ple web app. Generating a samsim-ple of size 1000 of the distribution14(N (−40, 52)+ N (0, 42) + N (10, 32) + N (15, 22)), yields, for 2,4,8,16 and 32 iterations re-spectively the yellow areas plotted against the data set (in histogram) in fig-ure 2.3. After 32 iterations, the fitted distribution is 14(N (−40.1, (4.7)2) + N (−0.4, (3.6)2) + N (9.8, (4.5)2) + N (15.8, (1.9)2)), which is reasonable.

(14)
(15)

Chapter 3

The EM algorithm for a

Markov Modulated Poisson

Process

In this chapter, we will provide a way, as contributed by L. Deng and J.W. Mark [1], to estimate the parameters of an MMPP, provided some sample of arrivals. The method given contains the following steps:

• transform the MMPP to its discrete counterpart, • apply the EM algorithm to the discretized MMPP,

• within each iteration of the EM algorithm, carry out the maximization step with the Lagrangian method for optimization with constraints, • if the EM algorithm has reached some convergence criterion, transform

the MMPP back to the continuum.

We will now delve into it. Let’s say that we have a data set S = {S1, ..., Sn} ⊂ R≥0of interarrival times. Let’s assume that the MMPP shifts between N arrival rates λ1, ..., λN conform some matrix

Q =      −σ1 σ12 · · · σ1N σ21 −σ2 · · · σ2N .. . ... ... ... σN 1 σN 2 · · · −σN      ,

where σi =PNj=1,j6=iσij. We will try to find the λ’s and off-diagonal σ’s that maximize the probability of observing S.

(16)

3.1

Discretization of MMPP’s

According to L. Deng and J.W. Mark [1], it is sufficient to perform the parameter estimation for a discretization of our MMPP and then transform the MMPP back to the continuum. The proposed discretization is as follows. The total time interval underlying to our interarrival times Si is divided in T equal time slots of length h, where h is chosen such that it is reasonable to assume that there occurs at most one event per time slot: an arrival of the Poisson process or a jump in the hidden Markov chain. Then our MMPP can be described in terms of a discrete transition matrix A ∈ RN ×N, and probabilities b(1)

i and b (0) i for having respectively one or zero events occur in one time slot, conditional on the underlying Markov chain being in state i ∈ {1, ..., N }. From the definition of the Poisson distribution, we must have that

b(1)i = λih · e−λih,

So for small h the approximation b(1)i ≈ λih is reasonable. Note that for any given i ∈ {1, ..., N }, the holding time is distributed as the minimum of the N − 1 exponential distributions with the outgoing rates σij (j 6= i) as parameters [2]. This minimum is distributed exponentially with the sum of those parameters as a parameter, hence the probability of our Markov chain to jump in a time slot of length h is equal to 1 − e−h·PNj=0,j6=iσij. By using the standard approximation

ex≈ 1 + x (which becomes exact when x tends to zero), this probability equals approximately h·PNj=0,j6=iσij. given that there is a jump from state i, the prob-ability that this jump is to state j 6= i equals σij

PN

j=0,j6=iσij, thus the probability

of jumping from state i to j (6= i) in an interval of length h is approximately h · σij=: Aij. Summarizing, we define for all i:

b(1)i := h · λi; b(0)i := 1 − h · λi; Aij := h · σij, ∀j 6= i; Aii := 1 − N X j=0,j6=i h · σij. (3.1)

When h tends to zero, the error of these approximations tends to zero, which is convenient. Our set S of interarrival times can now be transformed to a tuple X := (X1, ..., XT) ∈ {0, 1}T =: Ω0 that expresses for each time slot whether there was an arrival (1) or not (0). We assume that jumps only occur at the very beginning of a time slot, which results in a small error for small h. We define Z := (Z1, ..., ZT) ∈ {1, ..., N }T =: Ω1 as the (latent) states in which our hidden Markov chain was at the start of each time slot. Finally, we define θ := (A11(θ), ..., AN N(θ), b(1)1 (θ), ..., b(1)N (θ)) as the tuple of parameters that we need to estimate.

(17)

Figure 3.1: Visualization of a discretized MMPP with N = 2: the vertical bars represent the 50 time slots of length h, dark color represents an arrival, red / blue represents the rate per time slot.

the time slots of length h, a dark color means that an arrival occurred in the given time slot and the colors red / blue define in which state the underlying Markov chain was in the time slot. In the visualization it is clear that the color red represents a higher rate than blue (there are a lot more arrivals in the red bars). Keep in mind that the colors are unknown (latent), we only posses a sequence of zeros and ones. The image also shows us that it is a clustering problem that we are looking at, which makes this problem similar to the problem in the previous chapter.

3.2

The probability measure for the discretized

MMPP

In order to calculate the desired entities for the maximization step of the EM algorithm, we need to obtain the values of some (conditional) probabilities. We now have a measure space, consisting of the sample space

(18)

Ω := Ω0× Ω1= X × Z = {(X1, ..., XT)} × {(Z1, ..., ZT)}, the σ-algebraF := 2Ω and a probability measure

Pθ:F → [0, 1] that depends on θ and satisfies

Pθ(Z) = T −1 Y t=1 AZt,Zt+1(θ), Pθ(X|Z) = T Y t=1 (b(1)Z t(θ)) Xt(b(0) Zt(θ)) 1−Xt,

where the first equation is just the product of the transition probabilities of the jumps that are made in our underlying Markov chain and the latter equation assigns b(1)Z

t(θ) or b

(0)

Zt(θ) to each time slot (as Xttakes a value from {0, 1}), which

expresses the probability of having an arrival within the time slot conditional on the state Ztof our underlying Markov chain. Hence,

Pθ(X, Z) = Pθ(Z) · Pθ(X|Z) = T −1 Y t=1 AZt,Zt+1(θ)  ·  T Y t=1 (b(1)Z t(θ)) Xt(b(0) Zt(θ)) 1−Xt  , (3.2) and Pθ(Z|X) = Pθ (X, Z) P Z∈Ω1Pθ(X, Z) . (3.3)

Hereby, we can theoretically calculate almost all probabilities needed to con-tinue for now. Note that we have chosen h to be small, such that there is a large number of time slots (in practice, order 100 is the minimum). The size of the set Ω1 becomes a problem if we try to obtain the conditional probabilities Pθn(Zt= i, Zt+1= j|X) and Pθn(Zt= i|X). We need to sum over all possible

paths of Z that lead to Zt= i. The amount of paths becomes really large really quickly as the amount of time slots becomes greater: 2k, where k is the amount of time slots until t. Our universe will be out of existence by the time our calculation is finished. Luckily, there is a method called the forward-backward algorithm [5] that allows to isolate the needed conditional probabilities without checking all paths.

(19)

3.3

The Forward-Backward algorithm

In this section, we will show briefly how to obtain the quantity Pθn(Zt|X) with

the forward-backward algorithm. First, note that the following probabilities are known: Pθn(Xt|Zt) = (b (1) Zt(θn)) Xt(b(0) Zt(θn)) 1−Xt; Pθn(Zt|Zt−1) = AZt,Zt+1(θn).

Also we assume Pθn(Z1) to be known.

The conditional probability Pθn(Zt|X) is equal to

Pθn(Zt|X) =

Pθn(Zt, X)

Pθn(X)

,

so it is sufficient to calculate Pθn(Zt, X) for each possible value of Zt, as we

can sum those and thereby find the normalization factor Pθn(X) to obtain

Pθn(Zt|X).

Furthermore, by decomposing the probability in conditional probabilities and then erasing an unnecessary condition:

Pθn(Zt, X) = Pθn(Zt, X1:T)

= Pθn(Xt+1:T|Zt, X1:t) · Pθn(Zt, X1:t)

= Pθn(Xt+1:T|Zt) · Pθn(Zt, X1:t)

= βt(Zt) · αt(Zt) (3.4)

Where the notation Xa:b means that the realized value of X agrees to the X within the probability measure at least for all entries from a to b.

3.3.1

Forward

The forward part of the algorithm calculates the quantity αt(Zt) = Pθn(Zt, X1:t).

We have that

α1(Z1) = Pθn(Z1, X1) = Pθn(Z1) · Pθn(X1|Z1)

is a known quantity. For t ≥ 2, we observe, by summing over all probabilities of the previous time slot and by decomposing in conditional probabilities, that

αt(Zt) = Pθn(Zt, X1:t) = N X Zt−1=1 Pθn(Zt, Zt−1, X1:t) = N X Zt−1=1 Pθn(Xt|Zt, Zt−1, X1:t−1) · Pθn(Zt|Zt−1, X1:t−1) · Pθn(Zt−1, X1:t−1) = N X Zt−1=1 Pθn(Xt|Zt) · Pθn(Zt|Zt−1) · αt−1(Zt−1)

(20)

This shows that α can be constructed by using its recurrence and the known probabilities, hence we can calculate αt(Zt) for each t.

3.3.2

Backward

The backward part of the algorithm calculates the quantity βt(Zt) = Pθn(Xt+1:T|Zt).

We have that βT(ZT) = 1 [5]. For t ≤ T − 1, we have, again by decomposing in a smart way and then removing the excessive conditions:

βt(Zt) = Pθn(Xt+1:T|Zt) = N X Zt+1=1 Pθn(Xt+1:T, Zt+1|Zt) = N X Zt+1=1 Pθn(Xt+2:T|Zt+1, Zt, Xt+1) · Pθn(Xt+1|Zt+1, Zt) · Pθn(Zt+1|Zt) = N X Zt+1=1 Pθn(Xt+2:T|Zt+1) · Pθn(Xt+1|Zt+1) · Pθn(Zt+1|Zt) = N X Zt+1=1 βt+1(Zt+1) · Pθn(Xt+1|Zt+1) · Pθn(Zt+1|Zt)

We can use this recurrence of β and the known probabilities to calculate βt(Zt) for each t.

3.3.3

Concluding the Forward-Backward algorithm

By the previous two subsections, we can now compute the quantities αt(Zt) and βt(Zt) from equation (3.4), hence we can compute the probability Pθn(Zt|X).

Note that the sums we defined to calculate this consist of N summands and the recurrence relations have a depth of at most T . Hence few calculations are needed to carry out this computation, that is: computation time will be reasonably short. In what follows, we will use the EM algorithm to converge towards a good estimate of our parameter tuple θ.

3.4

The Expectation Step

We will use the EM algorithm to find the maximum likelihood estimator for θ. We will show what one iteration looks like, beginning with the expectation step.

(21)

Suppose that θn is an estimation of our parameter tuple. We need to find arg max θ Pθ(X) = arg max θ  Pθ(X) Pθn(X)  = arg max θ  log Pθ(X) Pθn(X)   , and by the Baum inequality [1], we know that

log Pθ(X) Pθn(X)

 ≥ X Z∈Ω1

Pθn(Z|X) · (log(Pθ(X, Z)) − log(Pθn(X, Z))) =: ψn(θ),

which is trivially an equality if θ = θn. Similar to what we did in the previous chapter, we can try to optimize this lower bound and define

θn+1:= arg max θ (ψn(θ)) = arg max θ X Z∈Ω1 Pθn(Z|X) · log(Pθ(X, Z)). (3.5)

By the equation of (3.3), by applying basic rules for logarithms and by intro-ducing the indicator functions (such that we sum over the values Aij and bi), we can rewrite the logarithm in the latter expression as

log(Pθ(X, Z)) = T −1 X t=1 log AZt,Zt+1(θ)   +  T X t=1 log(b(1)Z t(θ)) Xt  + log(b(0)Z t(θ)) 1−Xt  = N X i=1 N X j=1 T −1 X t=1 1{i}(Zt) · 1{j}(Zt+1)  log(Aij(θ))  + N X i=1 T X t=1 1{i}(Zt) ·  Xt· logb(1)i (θ)+ (1 − Xt) · logb(0)i (θ)  = N X i=1 N X j=1 T −1 X t=1 1{i}(Zt) · 1{j}(Zt+1)  log(Aij(θ))  + N X i=1 T X t=1 1{i}(Zt) ·  Xt· logb(1)i (θ)+ (1 − Xt) · log1 − b(1)i (θ)  ,

which assures that the function in equation (3.5) is a linear combination of logarithms that we can rewrite as

(22)

N X i=1 N X j=1 Cij· log(Aij(θ)) + N X i=1 

Di1· logb(1)i (θ)+ Di2· log1 − b(1)i (θ) 

, (3.6)

where Cij, Di1 and Di2 are, for all i, j ∈ {1, ..., N }, constants w.r.t. θ that are fixed by θn (our previous estimate of θ), our data set X and the nature of our problem. Actually, Cij equals

Cij= X Z∈Ω1 Pθn(Z|X) · T −1 X t=1 1{i}(Zt) · 1{j}(Zt+1), Di1 equals Di1= X Z∈Ω1 Pθn(Z|X) · T X t=1 1{i}(Zt) · Xt and Di2 equals Di2= X Z∈Ω1 Pθn(Z|X) · T X t=1 1{i}(Zt) · (1 − Xt).

Note that we have the constraint that PNj=1Aij = 1, ∀i ∈ {1, ..., N }. When looking at the summands over i of both sums in (3.6), we notice that there is no dependence between them, hence we can separately maximize the individual entities N X j=1 Cij· log(Aij(θ)) (3.7) and Di1· log  b(1)i (θ)+ Di2· log  1 − b(1)i (θ). (3.8) It turns out that the maximization of the entities (3.7) and (3.8) can be done analytically, which embodies the maximization step of the EM algorithm. We will proceed with that in the next section.

(23)

3.5

The Maximization Step

Given an estimate θn of our parameter tuple θ, we decided in the previous section that, representing an iteration in the EM algorithm, we can define θn+1 to be the value of θ that maximizes each of the summands (3.7) and (3.8) of the expression in (3.6). In this section, we will carry out this maximization. As (3.7) is subject to the constraintPNj=1Aij = 1, we introduce the Lagrangian function L(Ai1, ..., AiN, λ) : =  N X j=1 Cij· log(Aij(θ))  − λ ·  N X j=1 (Aij(θ) − 1 N)  = N X j=1 Cij· log(Aij(θ)) + λ N − λ · Aij(θ). (3.9) It is known [3] that the optimum of the entity (3.7) is a saddle point of its Lagrangian function (3.9). The partial derivatives ofLare:

∂L(Ai1, ..., AiN, λ) ∂Aij = cij Aij − λ; ∂L(Ai1, ..., AiN, λ) ∂λ = N X k=1 (1 N − Aik). Setting all partial derivatives equal to zero, yields

Aij = Cij λ and λ = N X k=1 Cik, and thus Aij= PNCij k=1Cik (3.10)

defines the unique saddle point forL, hence (3.10) must be the set of points for which (3.7) is maximized. This makes sense, because

(24)

and

N X

k=1

Cik= Eθn[#{jumps from i}|X]

hold, where the expectations are taken over Z. The quantity Aij is thus the proportion (in expectation) of jumps from i that end up in j conditional on X and θn, which is exactly what we want.

We can conclude that

Aij =E

θn[#{jumps from i to j}|X]

Eθn[#{jumps from i}|X]

= PT −1

t=1 Pθn(Zt= i, Zt+1= j|X)

PT −1

t=1 Pθn(Zt= i|X)

As for equation (3.8), it is straightforward to take the derivative, set it equal to zero and, by the fact that the expectation of the indicator function is just the probability that Zt= i, arrive at

b(1)i (θ) = Di1 Di1+ Di2. = P Z∈Ω1Pθn(Z|X) · PT t=11{i}(Zt) · Xt P Z∈Ω1Pθn(Z|X) · PT t=11{i}(Zt) =Eθn[ PT t=11{i}(Zt) · Xt|X] Eθn[ PT t=11{i}(Zt)|X] = PT t=1Xt· Eθn[1{i}(Zt)|X] PT t=1Eθn[1{i}(Zt)|X] = PT t=1Xt· Pθn[Zt= i|X] PT t=1Pθn[Zt= i|X]

With the expectation step and the maximization step, we have found, given estimation θn, an estimation θn+1 of our parameter tuple θ that is at least not worse than the previous estimation.

3.6

Concluding the EM algorithm

We have now described one iteration of the EM algorithm in the context of a discretized MMPP. The best parameters are found by repeating this step until convergence, let’s say after M iterations. Now we transform our parameters back to their original form. We have parameters

(25)

and

b(1)i (θM), ∀i ∈ {1, ..., N }, which leaves us, by the definitions of (3.1), with

σij = Aij(θM) h , ∀j 6= i; σii = j=N X j=1,j6=i −σij, ∀i ∈ {1, ..., N }; λi= b(1)i (θM) h , ∀i ∈ {1, ..., N }.

This terminates our search for the maximum likelihood estimator for an MMPP successfully!

3.7

Is it any good?

Deng and Mark [1] conclude that it is necessary for the EM algorithm to con-verge nicely in this context to start with a first estimate θ0 that is reasonably good. They use k-means clustering for the first estimate and then use the EM algorithm as described in this chapter to give better estimates. They simulated, among others, five different MMPP’s with two rates and arrived at the results in Table 3.1. What we can observe when we look at this table, is that the estimation gets worse if the difference between the rates λ1and λ2 are smaller. In that case, it is not clear which rate applies at which time slot, so it seems logical that the algorithm struggles with this.

The table shows that the estimates are reasonable. Therefore we can view the application of the EM algorithm on MMPP’s as a success!

(26)

λ1 λ2 σ12 σ21 True Parameter 100 10 10 1 EM estimates 97.7 10.1 10.2 0.93 True Parameter 100 20 1 1 EM estimates 99.4 21.1 0.64 1.00 True Parameter 100 30 3 2 EM estimates 109.0 36.0 1.36 0.4 True Parameter 100 70 1 1 EM estimates 94.6 79.4 37.7 0.29 True Parameter 100 0 3 3 EM estimates 98.2 0.19 2.84 2.75 Table 3.1: EM estimates for five different MMPP’s [1]

(27)

Chapter 4

Conclusion

The EM algorithm is a special algorithm that get closer to the sought optimum in each iteration. That is, if the initial estimate is good enough; the EM algo-rithm can converge to local optima otherwise.

It seems that the algorithm exactly fits the use case as described in this thesis: maximum likelihood estimation with latent parameters. A thought experiment on how to use the EM algorithm in more general cases where there are no latent parameters failed quickly.

When applied to Gaussian mixtures, the EM algorithm seems to do a great job at finding the maximum likelihood estimates for the means and variances, although greater variances seem to cause greater errors.

When applied to MMPP’s, the EM algorithm seems to do a decent job at finding the maximum likelihood estimate for the transition probabilities and an even better job for the arrival rates.

In future research, it would be worthwhile to redo the simulations done by Deng and Mark [1]. This is especially tricky because generating a small MMPP and running the EM algorithm on it conform the method provided in this thesis costs at least one hour of computation time. This makes debugging a long-term process.

Also, the k-means method of coming up with an acceptable initial value of the parameter tuple could be worthwhile delving into: according to Deng and Mark, a good initial value is crucial for their method to produce a good result. The process, that involves the EM algorithm, is elegant because it doesn’t de-crease the likelihood (that we seek to maximize) in any iteration. However, it seems to be the case that there are known algorithms that converge to the MLE in fewer iterations. It is therefore justified to investigate other algorithms as well in future research.

In conclusion: we showed a complete method of finding reasonable estimates of the parameters of an MMPP. As this was the very goal of this thesis, we can view it as a success.

(28)

Chapter 5

Popular Summary

Imagine being locked up in a room with no windows. The only light in the room is from a digital clock telling you the time. There are no sounds you can hear, except, when it rains, you can hear the raindrops on the roof. You get the message that you are being released only if you predict the number of raindrops that fall on the roof within a week from the current moment, with an error margin of 5 %. You can make only one prediction per day. As you have very little information (you can only count the raindrops and see the current time), it is challenging to come up with a decent way to make the prediction. But if you were to find this thesis underneath a loose floorboard, it might set you free! This thesis provides a method to describe a model that agrees the most with the pattern of raindrops that you heard until now. Assuming that the model is close to correct, you can then make a reasonable prediction of the number of raindrops within the following week.

A very naive model would be to assume that, no matter what circumstances, the probability of having one raindrop in the next millisecond doesn’t change the whole week. In that case, we are talking about a Poisson Process. The Poisson process is useful in a lot of case, but we need something more sophisticated. After all: the probability of rain is greater if it is already raining.

What if we assume that there are four types of rainfall: ”no rain”, ”drizzle”, ”moderate rain” and ”shower”. We could then formulate for each type the prob-ability of having a raindrop fall on the roof within the next millisecond (arrival probability). We can also formulate for each combination of two rainfall types, a probability of going, within one millisecond, from one type of rainfall to the other (transition probability). Note that it is in this case very likely that the type of rainfall stays the same (because a millisecond is very short). What we just described is called a Markov Modulated Poisson Process.

The difficulty here is to find the arrival probabilities and transition probabilities that match best with the raindrops that we heard during the time we were in the room. Strictly, we don’t know what type of rainfall applies to the current moment. For example, it could be that it pours (”shower”), but due to some

(29)

combination of wind and trees that are near the room, there are only few rain-drops that fall on the roof of the room for a little while. Note that this is why we defined the type of rain by the probability of a raindrop falling on the roof within the next millisecond, for we are under no circumstances certain that this will happen. This makes the problem at hand really tricky.

Luckily, there are techniques to tackle this problem. The so called EM algo-rithm is a method to find better and better estimates of said probabilities step by step.

Of course, I hope the reader doesn’t get caught up in such a twisted situation as described above, but the method in this thesis is also used in other practical and scientific fields. In fact, the method applies to countless use cases in which there is a process that comes with an underlying process that isn’t observed!

(30)

Bibliography

[1] Deng, L., Mark, J.W. Parameter estimation for Markov modulated Poisson processes via the EM algorithm with time discretization. Telecommunication Systems 1, 321–338 (1993). https://doi.org/10.1007/BF02136167

[2] Norris, J. R. (1997). Markov chains. Cambridge: Cambridge University Press.

[3] Beavis, B. & Dobbs, I. M. (1999). Optimization and stability theory for economic analysis. Cambridge: Cambridge University Press.

[4] Dempster, A., Laird, N., & Rubin, D. (1977). Maximum Likelihood from In-complete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 1-38.

[5] Russell, S. & Norvig, Peter. (2003). Artificial Intelligence, A Modern Ap-proach. Second Edition.

(31)

Appendix A

Python code for EM

iteration

The Python code for the implementation of the EM algortihm in a web app. The result of Figure 2.3 was calculated by this code.

import numpy as np import mpmath as mp import scipy as sp import matplotlib.pyplot as plt def p_theta_n(k,thetaMu_n,thetaSigma_n,x,z): y = mp.exp(-(((thetaMu_n[z]-x)/thetaSigma_n[z])**2)/2)/(thetaSigma_n[z]*(2* np.pi)**(0.5)*k) return y def Q_n_z_i(k,thetaMu_n,thetaSigma_n,x,z): sum0 = 0 p=0 while(p<k): sum0 += p_theta_n(k,thetaMu_n,thetaSigma_n,x,p) p+=1

return sum0 / p_theta_n(k,thetaMu_n,thetaSigma_n,x,z)

def calcMuSigma(x=y,iMu = [0,1,2,3],iSigma = [0,1,2,3],thetaMu_n = [1,3,5,7],thetaSigma_n = [1,1,1,1], chooseInitials = 0): k = len(thetaMu_n) iterations = 0 if(chooseInitials == 1): thetaMu_0 = np.arange(0,k) * 0.95124865621 thetaSigma_0 = np.ones(k) * 1.2456712354

(32)

for i in np.arange(0,k):

thetaMu_0[i] = thetaMu_0[i] * (i in iMu) + thetaMu_n[i] * (1 -(i in iMu))

thetaSigma_0[i] = thetaSigma_0[i] * (i in iSigma) + thetaSigma_n[i] * (1 -(i in iSigma))

thetaMu_n = thetaMu_0 thetaSigma_n = thetaSigma_0 while(iterations<2): z=0 if((-1)**iterations == 1): newMus = [] while(z<k): sum_0 = 0 sum_1 = 0 i=0

while(i<len(x)):

sum_0+= x[i] / Q_n_z_i(k,thetaMu_n,thetaSigma_n,x[i],z) sum_1+= 1 / Q_n_z_i(k,thetaMu_n,thetaSigma_n,x[i],z) i+=1

muCalc = sum_0 / sum_1

newMu = float(muCalc * (z in iMu) + thetaMu_n[z] * (1 -(z in iMu))) newMus.append(newMu) z+=1 thetaMu_n = newMus if((-1)**iterations == -1): newSigmas = [] while(z<k): sum_0 = 0 sum_1 = 0 i=0

while(i<len(x)):

sum_0+= (x[i] - thetaMu_n[z])**2 /

Q_n_z_i(k,thetaMu_n,thetaSigma_n,x[i],z)

sum_1+= 1 / Q_n_z_i(k,thetaMu_n,thetaSigma_n,x[i],z) i+=1

sigmaCalc = (sum_0 / sum_1)**(0.5)

newSigma = float(sigmaCalc * (z in iSigma) + thetaSigma_n[z] * (1 - (z in iSigma))) newSigmas.append(newSigma)

z+=1

(33)

iterations+= 1

return([thetaMu_n,thetaSigma_n])

def getVals(x,thetaMu_n ,thetaSigma_n): points = 500

k = len(thetaMu_n) gaussianVals = [] histoVals = [] labels = []

binAmount = min(30,max(10,int(round(len(x)/20,0)))) h = (max(x) - min(x))/binAmount

bins = np.zeros(binAmount) l = 0 while(l< binAmount): binMin = min(x) + l*h binMax = min(x) + (l+1) * h q = 0 while(q < len(x)):

if (x[q]>= binMin and x[q] < binMax): bins[l]+=1

q+=1 l+=1 j = 0

while(j < points):

x_val = min(x) + (j*(max(x)-min(x)))/points gaussianVal = 0 r = 0 while(r < k): gaussianVal += mp.exp(-(((thetaMu_n[r]-x_val)/thetaSigma_n[r])**2)/2)/(thetaSigma_n[r]*(2* np.pi)**(0.5)*k) r+=1

gaussianVals.append(float(gaussianVal)) s = 0

histoVal = 0

while(s < binAmount): binMin = min(x) + s*h binMax = min(x) + (s+1) * h

if (x_val>= binMin and x_val < binMax): histoVal += bins[s]

s+=1

histoVals.append(float(histoVal/(h*len(x)))) label = ’’

if (j % (int(round(points / 10,0))) == 0): label = round(float(x_val),2)

(34)

j+=1

Referenties

GERELATEERDE DOCUMENTEN

Hoewel er nog maar minimaal gebruik gemaakt is van de theorieën van Trauma Studies om Kanes werk te bestuderen, zal uit dit onderzoek blijken dat de ervaringen van Kanes

Je hebt bij elke vraag maar twee mogelijke uitkomsten, succes en mislukking, en de kans daarop blijft bij elke vraag gelijk.. Geen

This Act, declares the state-aided school to be a juristic person, and that the governing body shall be constituted to manage and control the state-aided

A suitable homogeneous population was determined as entailing teachers who are already in the field, but have one to three years of teaching experience after

Next to increasing a leader’s future time orientation, it is also expected that high levels of cognitive complexity will result in a greater past and present time orientation..

Finally, in order to allow users to check the accuracy of their programs and their own inverse orthogonaliza- tion procedures, BHS list in Table V of their article the I =0, 1,

At the end of each period, after the new stock value is attained, a dividend is paid and the stock price is reduced by the corresponding amount Formally, these operations are

Hence, the transition kernel corresponding to time t is a convolution of the two transition kernels corresponding to time t of the two independent processes.. Of course you have