On the Comparison of Changepoints Models in Network Reconstruction with Themselves and their Bayesian Counterparts

(1)

On the Comparison of Changepoints Models in Network Reconstruction with Themselves and their Bayesian

Counterparts

Master Thesis

January 13, 2017

Student: Dennis Steenhuis

Primary supervisor: Dr. M.A.Grzegorczyk

(2)

Abstract

In this thesis we describe two different types of changepoint models, the Changepoint Model based on Segment Neighbourhood Search and the Hid- den Markov Model. These models will be compared to each other and the Homogeneous Model in their Network Reconstruction Accuracy on some artificial data. After this comparison, all three frequentistic methods will be compared to their Bayesian counterparts with use of the Yeast data.

(3)

List of Figures

2.1 A simplified city with 7 intersections and the travel time in minutes between those intersections. . . 4 2.2 The green road is the fastest road from A to G, with a travel

time of 14 minutes. . . 4 2.3 Histogram of example data. . . 7 2.4 True density and estimated density. . . 7 2.5 The EM-algorithm can converge to different (local) maxima. . 9 2.6 Three estimated densities and the true density. . . 9 2.7 A simplified city with 7 intersections, which is an undirected

graph. . . 12 2.8 Example of a directed graph. . . 12

3.1 Simulated stockprices with known change points at time 8, 16, 24 and 32. . . 16 3.2 Simulated stockprices Estimated changepoints at time 8.375,

16.375, 23.375 and 31.125. . . 16 3.3 Both minimizing the BIC and MDL value give the right num-

ber of changepoints, the AIC gives two extra. . . 18 3.4 Regression lines using the optimal number of changepoints,

accoring to the AIC, BIC and MDL values. . . 18

(8)

4.1 Dependency structure of an Hidden Markov Model. . . 21 4.2 HMM Regression with unequal variances among the groups. . 34 4.3 HMM Regression with equal variances among the groups. . . . 34 4.4 HMM Regression with unequal variances among the groups.

Stuck on a local maximum of the likelihood. . . 35 4.5 HMM Regression with equal variances among the groups. Stuck

on a local maximum of the likelihood. . . 35 4.6 Example Stockprice with a CPM and a Left-Right HMM re-

gression line. . . 39 4.7 Example stock price with a CPM and a Left-Right Equal Vari-

ance HMM regression line. . . 39

5.1 True Network for the Artificial Data. . . 41 5.2 Reconstructed artificial network with five indepenent linear

models over sequence 1 (AUC-ROC=0.8). . . 42 5.3 Reconstructed artificial network with five independent linear

models over sequence 2 (AUC-ROC=0.43). . . 43 5.4 Reconstructed artificial network with five independent linear

models over sequence 3 (AUC-ROC=0.43). . . 43 5.5 Reconstructed artificial network with five independent change-

point models on sequence 1 (AUC-ROC=0.73). . . 44 5.6 Reconstructed artificial network with five independent change-

point models over sequence 2 (AUC-ROC=0.97). . . 45 5.7 Reconstructed artificial network with five independent change-

point models over sequence 3 (AUC-ROC=0.67). . . 46 5.8 Reconstructed artificial network with five changepoint models,

with a global changepoint allocation vector over sequence 3 and BIC minimization criterion (AUC-ROC=0.80). . . 47

(9)

5.9 Reconstructed artificial network with five changepoint models, with a global changepoint allocation vector over sequence 3 and MDL minimization criterion (AUC-ROC=0.90). . . 47 5.10 Reconstructed artificial network with five independent Hidden

Markov models over sequence 2 (AUC-ROC=0.50). . . 48 5.11 Reconstructed artificial network with five independent Left-

Right Hidden Markov models over sequence 2 (AUC-ROC=0.53). 48 5.12 Reconstructed artificial network with five independent Hidden

Markov models over sequence 3 (AUC-ROC=0.50). . . 49 5.13 Boxplot of 100 converged BIC values of node C for every of

the 16 parent options (Left to right is increasing number of parents, option 2 is the true one). . . 49 5.14 Reconstructed artificial network with five independent Hid-

den Markov Models over sequence 3, in which equal variances among the states is assumed (AUC-ROC=1). . . 50 5.15 Boxplot of 100 converged BIC values of node C for every of

the 16 parent options (Left to right is increasing number of parents, option 2 is the true one). . . 50 5.16 Estimated parameters for node C, with the true parents. Green

= State 1, Black = State 2, Red = State 3, dotted lines are true parameters. . . 51

6.1 True network for the yeast network. . . 55 6.2 Estimated graph using the Homogeneous model (AUC-PR=0.57). 55 6.3 Estimated graph using the Changepoint model, with a fixed

changepoint (AUC-PR=0.443). . . 56 6.4 Estimated graph using the Changepoint model, with one change-

point (AUC-PR=0.49). . . 58 6.5 Estimated graph using the Changepoint model, with BIC as

minization criterion (AUC-PR=0.43). . . 59

(10)

6.6 Estimated graph using the Changepoint model, with MDL as

minization criterion (AUC-PR=0.36). . . 59

6.7 Estimated graph using the Changepoint model, with BIC as minization criterion (AUC-PR=0.40). . . 60

6.8 Estimated graph using the Left-Right HMM with Equal Vari- ances (AUC-PR=0.48). . . 61

6.9 Estimated graph using the HMM with Equal Variances (AUC- PR=0.68). . . 61

B.1 CBF1 data, switch on . . . 75

B.2 CBF1 data, switch off . . . 75

B.3 GAL4 data, switch on . . . 76

B.4 GAL4 data, switch off . . . 76

B.5 SWI5 data, switch on . . . 76

B.6 SWI5 data, switch off . . . 76

B.7 GAL80 data, switch on . . . 77

B.8 GAL80 data, switch off . . . 77

B.9 ASH1 data, switch on . . . 77

B.10 ASH1 data, switch off . . . 77

(11)

List of Tables

4.1 α−pass using our example. . . 24

4.2 β−pass using our example. . . 25

4.3 Step 2 of the Viterbi algorithm applied on our example. . . 28

5.1 Regression Coefficients for the Artificial Data. . . 41

5.2 Estimated regression coefficients using five independent linear models over sequence 1. . . 42

5.3 Estimated changepoints using five independent Changepoint models on sequence 1. . . 44

5.4 Estimated changepoint allocations using five independent changepoint models and one global allocation changepoint over sequence 2. . . 45

5.5 Estimated changepoints using five independent changepoint models and one global allocation changepoint over sequence 3. 46 5.6 AUC-ROC scores for the tested models and the three different sequences. . . 52

6.1 Inferred changepoints for the five genes, the complete network, using both the BIC and MDL as minimization method. . . 56

6.2 Inferred changepoints for the five genes, the complete network, for both the BIC and MDL as minimization criterion. . . 58

(12)

6.4 Inferred changepoints for the five genes, using both the Left- Right HMM aswell as the HMM, both with Equal Variances among the classes. . . 59 6.3 Inferred changepoints for the five genes, for both the BIC and

MDL as minimization criterion. . . 60 6.5 AUC-PR scores of all tested models on the yeast data, com-

pared to the Bayesian scores. Note that the Bayesian scores are taken from [11] and are an approximation. . . 61

A.1 Left - Reconstructed sequence using the Viterbi Algorithm with the normal HMM Right - Reconstructed sequence using the Viterbi Algorithm with the Left-Right HMM . . . 70 A.2 Left - Reconstructed sequence using the Viterbi Algorithm

with the normal HMM Right - Reconstructed sequence using the Viterbi Algorithm with the Equal Variance HMM . . . 72 A.3 Reconstructed sequence using the Viterbi Algorithm with the

Left-Right Equal Variance HMM . . . 73 A.4 Reconstructed sequence using the Viterbi Algorithm with the

Equal Variance HMM . . . 74

(13)

”If we knew what it was we were doing, it would not be called research, would it?”

Albert Einstein

”In tied’n van nood, moej eerappels kunn’n schill’n met een biel.”

Jan Sluiter

(14)

Chapter 1 Introduction

Ever since the introduction of Bayesian Statistics, there is a discussion between the ‘Frequentistics’ and the ‘Bayesians’ over which method is better and why. In this thesis we will compare the Network Reconstruction Accu- racy of three different frequentistic models with their Bayesian counterparts.

Before we start comparing we will explain some basic theory in chapter 2 about e.g. Dynamic Programming and the EM-algorithm, which are used for the Segment Neighbourhood Search (SNS) and Baum-Welch algorithm we use for different types of models. Besides that we explain something about basic graph theory and in which two ways we measure the Network Reconstruction Accuracy of a model.

Next in chapter 3 we will introduce the Changepoint Model (CPM) based on Segment Neighbourhood Search and the methods of Penalised Likelihood and Maximum Description Length (MDL) to determine the optimal number of changepoints in a time serie using a Changepoint Model.

In chapter 4 we will introduce the Hidden Markov Model (HMM), and discuss the solutions for the three most common problems using a HMM. Which are finding the likelihood given a model; finding the most likely sequence given a model and finding the Maximum Likelihood Estimator (MLE) given an observation. Besides that we introduce the HMM with equal variances among the states and the Left-Right HMM, and how those are connected with the CPM’s.

Then finally in chapter 5 we start comparing. We have created an artificial network with three corresponding data sets. One based on a Homogeneous Model (HOM), one based on a Changepoint model and one based on the Hidden Markov Model. We compare the Network Reconstruction Accuracy

(15)

of all three different models (also with different settings) on all three data sets. By comparing those models, we have found a nice solution for a problem that can occur when there are outliers in the data in combination with the EM-algorithm.

After comparing our three chosen models with themselves, we will compare them with their Bayesian counterparts in chapter 6. The comparison is done with data from the yeast species Saccharomyces cerevisiae in which a syn- thetic network is build. Results from the Bayesian models are taken from an earlier research by M.A. Grzegorczyk [11], the frequentistic models described are used to compute Network Reconstruction Accuracy scores and are compared to those earlier Bayesian results.

All of the described algorithms in this thesis are implemented and written in R. These R codes can be found on: https://github.com/DeSteen/MSc-Thesis.

Besides those algorithms, the data en more R codes for reproducing results and figures can be found there as well.

(16)

Chapter 2 Theory

2.1 Dynamic Programming

Dynamic Programming is used to solve optimization problems and is based on the idea of making decisions on what the optimal path is, with the assumption that all future decisions are optimal. Why this gives the overall optimal path is summed up in Bellman’s Principle of Optimality.

Lemma 1 (Bellman’s Principle of Optimality (1957) [14]). An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

2.1.1 A simple example of Dynamic Programming

Suppose we have a pizzeria with delivery service and are located at south- west corner of our simplified city (see figure 2.1 on the following page) and have to deliver a pizza at the north-east corner of the city. In this city we have seven intersections and at each road between intersections we have written the time in minutes it takes to go from one intersection to the other.

Now we want to find the fastest road from the south-west corner (A) to the north-east corner (G).

(17)

5 2

4 7

8 8

12 9

8 5

2

A B

C

D E

F G

Figure 2.1: A simplified city with 7 intersections and the travel time in minutes between those intersections.

5 2

4 7

8 8

12 9

8 5

2

A B

C

D E

F G

Figure 2.2: The green road is the fastest road from A to G, with a travel time of 14 minutes.

1. Let us start with the intersections that are the closest to G. Which are E and F and denote the time to travel to G from intersection I as T (I). Hence we have that T (E) = 5 and T (F ) = 2.

2. In the second step we have all the intersection we can reach from E and F , which are B, C and D. Let us start with intersection B, we have two options when we are at B, you can either go to E or F . By the Principle of Optimality, the optimal path after getting to one of those intersections is the optimal path already calculated. Hence we have that:

T (B) = min{Time from B to E + T (E), Time from B to F + T (F )}

= min{7 + T (E), 8 + T (F )} = 10, (B → F → G).

In the same way we get for C and D, T (C) = 13, (C → E → G) and T (D) = 10, (D → F → G).

3. Likewise we did in step 2, in the last step we only have to consider the time to the next intersection. Hence we have:

T (A) = min{Time from A to B + T (B), Time from A to C + T (C), Time from A to D + T (D)}

= min{5 + T (B), 2 + T (C), 4 + T (D)} = 14

(18)

Hence the fastest road from A to G is A → D → F → G, which has a travel time of 14 minutes. Besides the fastest road from A to G, we have also found the fastest road from all intersections to G. This can be seen in figure 2.2 on the previous page where the fastest roads are colored green, while the others are red.

2.2 EM-algorithm

In some cases the equations in Maximum Likelihood Estimation (MLE) can not be solved directly, because some of the data is unobserved or unavailable.

In that case we could use so-called latent variables, variables we assume to be known, but are unknown. By assuming these variables known, it can make the maximization of the likelihood a lot easier. This is called the Expectation-Maximization algorithm, for short EM-algorithm. [6]

2.2.1 Outline of the EM-algorithm [13]

Suppose we have our observed data Z, which has a log-likelihood `(θ; Z), where θ are the parameters of the model. Besides that let Z^m denote our latent (missing) data, so our complete data will be T = (Z, Z^m). Let us denote the likelihood of the complete data T with `₀(θ; T). Now let us start with the EM-algorithm, by Bayes’ rule we have:

Pθ(Z^m|Z) = Pθ(Z^m, Z)

P^θ(Z) = Pθ(T) P^θ(Z) P^θ(Z) = Pθ(T)

Pθ(Z^m|Z)

`(θ; Z) = `₀(θ; T) − `₁(θ; Z^m|Z)

By taking the expectation on both sides with respect to the distribution of T|Z, with parameters θ^(k) we get:

Eθ^(k)[`(θ; Z)|Z] = Eθ^(k)[`₀(θ; T)|Z] − Eθ^(k)[`₁(θ; Z^m|Z)|Z]

`(θ; Z) = Eθ^(k)[`₀(θ; T)|Z] − Eθ^(k)[`₁(θ; Z^m|Z)|Z]

= Q(θ|θ^(k)) − R(θ|θ^(k))

(19)

Because of the fact that `(θ; Z) is Z-measurable. Now finally we will show that maximizing Q(θ|θ^(k)) over θ also maximizes `(θ; Z).

`(θ; Z) − `(θ^(k); Z) = Q(θ|θ^(k)) − R(θ|θ^(k)) − (Q(θ^(k)|θ^(k)) − R(θ^(k)|θ^(k)))

`(θ; Z) − `(θ^(k); Z) ≥ Q(θ|θ^(k)) − Q(θ^(k)|θ^(k))

Therefore when we optimize Q(θ|θ^(k)) such that it is greater than Q(θ^(k)|θ^(k)), then we certainly have that `(θ; Z), for the optimizing θ of Q(θ|θ^(k)), is greater than `(θ^(k); Z). Since Q(θ|θ^(k)) is an expectation, the optimization is done in two step and can be sumarized as follows:

Algorithm 1 (EM-algorithm):

1. Set an initial estimate θ⁽⁰⁾.

2. For k = 0, 1, 2, ..., until convergence.

(a) E-step: Compute:

Q(θ|θ^(k)) = Eθ^(k)[`₀(θ; T)|Z]

(b) M-step:

θ^(k+1) = argmax

θ

{Q(θ|θ^(k))}

2.2.2 An example of the EM-algorithm

To make it more concrete, we will consider a little well known example where the EM-algorithm is used. The idea and notation of this example is taken from [13]. Suppose we have 50 data points (see figure 2.3 on the following page), which seem to come from a bi-modal distribution. So a Normal distribution would not be appropriate here. So let us model Y as a mixture of two normal distributions:

Y₁ ∼ N (µ₁, σ²₁) Y2 ∼ N (µ2, σ²₂)

Y = (1 − ∆)Y1 + ∆Y2

(20)

With ∆ ∼ Ber(π). Hence it follows that the density of Y is:

f_Y(y; θ) = f_Y(y; π, µ₁, µ₂, σ²₁, σ₂²) = (1 − π)φ₁(y) + πφ₂(y)

Where φ_j(y) denotes the Normal density with mean µ_j and variance σ_j². And θ = (π, µ₁, µ₂, σ₁², σ²₂) is the parameter set. Note that this distribution is also called a bimodal normal distribution. Then the log-likelihood of our fifty data

Histogram of example data

x

Density

2 4 6 8 10

0.00.10.20.30.4

Figure 2.3: Histogram of example data.

2 4 6 8 10

0.00.10.20.30.4

True and estimated density

t fY(t;θ)

True density Estimated density

Figure 2.4: True density and estimated density.

points would be:

`(θ; y) =

50

X

i=1

log f_Y(y_i; θ) =

50

X

i=1

log ((1 − π)φ₁(y_i) + πφ₂(y_i))

Since we have a sum in the logarithm this expression becomes difficult to maximize it over θ. Therefore let us take ∆_i as our latent variables. So when

∆i = 0, yi comes from distribution Y1 and ∆i = 1 if it comes from distribution Y₂. Now we assume to know our variables ∆_i, hence our likelihood becomes:

`₀(θ; T) = `₀(θ; y, ∆) =

50

X

i=1

(1 − ∆_i) log φ₁(y_i) + ∆_ilog φ₂(y_i)

+

50

X

i=1

(1 − ∆_i) log(1 − π) + ∆_ilog π Applying the EM-algorithm on this, with first the E-step gives us:

Q(θ|θ^(k)) = Eθ^(k)[`₀(θ; T)|y]

(21)

= Eθ^(k)

" ₅₀ X

i=1

(1 − ∆i) log φ1(yi) + ∆ilog φ2(yi)

+

50

X

i=1

(1 − ∆_i) log(1 − π) + ∆_ilog π|y

#

=

50

X

i=1

(1 − γi(θ^(k))) log φ1(yi) + γi(θ^(k)) log φ2(yi)

+

50

X

i=1

(1 − γ_i(θ^(k))) log(1 − π) + γ_i(θ^(k)) log π

Where γ_i(θ) = Eθ[∆_i|y] = Pθ(∆_i = 1|y). And then in the M-step, we get the following:

µ^(k+1)₁ = P50

i=1(1 − γ_i(θ^(k)))y_i P50

i=1(1 − γ_i(θ^(k))) µ^(k+1)₂ = P50

i=1γ_i(θ^(k))y_i P50

i=1γ_i(θ^(k)) σ₁^2,(k+1) =

P50

i=1(1 − γ_i(θ^(k)))(y_i− µ^(k+1)₁ )² P50

i=1(1 − γ_i(θ^(k))) σ₂^2,(k+1)= P50

i=1γ_i(θ^(k))(y_i − µ^(k+1)₂ )² P50

i=1γ_i(θ^(k)) π^(k+1) =

P50

i=1γ_i(θ^(k)) 50

By doing this untill convergence of the likelihood we get the following esti- mates:

ˆ

µ₁ = 4.266 µˆ₂ = 6.833 ˆ

σ₁² = 0.558 σˆ²₂ = 0.665 ˆ

π = 0.633

Then in figure 2.4 on the previous page we can see the density from which the data comes (black line) and the estimated density (red line). So the lines are approximately the same and the estimated density will converge to the real density as n goes to infinity.

2.2.3 Convergence of the EM-algorithm

Theoretically, the EM-algorithm always converges to a MLE of the parameters of the model [18], [19]. But it is not necessarily true that the EM- algorithm convergence to the global maximum. When we go back to the

(22)

example of subsection 2.2.2 on page 6 and perform the EM-algorithm with several different initial values we see that it converges to different maxima.

In figure 2.5 we can see that it not only converges to different (local) maxima but that also the number of iterations can differ a lot. In figure 2.6

0 10 20 30 40 50 60

-100-95-90-85-80-75

Number of iterations

Log-Likelihood

Converged in 23 iterations to -77.99 Converged in 64 iterations to -83.78 Converged in 16 iterations to -86.64

Figure 2.5: The EM-algorithm can converge to different (local) maxima.

2 4 6 8 10

0.00.10.20.30.40.5

t fY(t;θ)

True density LL: -77.99 LL: -83.78 LL: -86.64

Figure 2.6: Three estimated densities and the true density.

we see for each of these MLE’s the estimated density. And it is clear that the one with the highest Log-Likelihood is the best MLE of the three found.

Therefore when applying the EM-algorithm it is best to start the algorithm several times with different initial values and choose the model with the highest Log-Likelihood. Besides the fact that EM-algorithm does not necessarily converges to the global maximum, it can be that the EM-algorithm does not converge at all.

2.2.4 Convergence problems of the EM-algorithm

For some samples it can happen that the EM-algorithm collapses because of a to small estimation of a variance in one of the densities [1]. There are found two cases for Normal mixtures where this could happen:

1. There is an outlier in the dataset

2. There is a repetition of one or more data points in the data set.

These problems are numerical problems. Here we focus on the first problem, and for the second problem we refer to the literature [1]. We will now show

(23)

why an outlier can let the EM-algorithm collapse. But before we do that let us extend the bivariate normal distribution from subsection 2.2.2 on page 6 to a multimodal normal distribution. With the following density:

f_Y(y; θ) = π₁φ₁(y) + π₂φ₂(y) + · · · + π_Mφ_M(y),

M

X

m=1

π_m = 1, π_m > 0 (2.1)

With φm(y) being the density of a normal distribution with mean µm and variance σ_m². And again θ being the complete parameter set. Hence we will get the following updates in the EM-algorithm:

µ^(k+1)_m = PN

i=1γ_i,m(θ^(k))y_i PN

i=1γ_i,m(θ^(k)) (2.2)

σ^2,(k+1)_m = PN

i=1γ_i,m(θ^(k))(y_i− µ^(k+1)m )² PN

i=1γ_i,m(θ^(k)) (2.3) π_m^(k+1) =

PN

i=1γ_i,m(θ^(k))

N (2.4)

Where now γi,m(θ) = P^θ(Observation i comes from distribution N (µm, σ_m²)) =

φm(yi)πm

fY(yi;θ) . First let us assume that we are close to a maximum of the likelihood and consider an outlier y_out with component m₀, (the distribution N (µ_m₀, σ²_m

0)), being close to this outlier. Since the normal density is expo- nential decreasing it more likely that the outlier is generated by component m₀, then by other components far from y_out and the other data points are far from the component m₀:

∀y_n 6= y_out : γ_n,m₀(θ^(k)) << γ_out,m₀(θ^(k)) ≈ 1

So the only point which contributes significantly to computation of µ^(k+1)m0

in equation (2.2) is the outlier, therefore µ_m₀ → y_out. And the variance becomes:

σ_m^2,(k+1)₀ = PN

i=1γ_i,m₀(θ^(k))(y_i− µ^(k+1)m0 )² PN

i=1γi,m0(θ^(k))

= 1

PN

i=1γ_i,m₀(θ^(k))

N

X

i=1

φ_m₀(y_i)π^(k)m0

fY(yi; θ^(k))(y_i− µ^(k+1)_m

0 )²

= 1

PN

i=1γ_i,m₀(θ^(k))

N

X

i=1

(y_i− µ^(k+1)m0 )²πm^(k)0

f_Y(y_i; θ^(k))

1 q

2πσm^2,(k)0

exp (

−(y_i− µ^(k)m0)² 2σm^2,(k)0

)

(24)

If we look at the contribution of a data point y_n far from µ^(k)m0 to this update we see that:

lim

e^(k)n /β→∞

α exp (

−e^(k)n

β

)e^(k+1)n

√β =

= lim

e^(k)n /β→∞

α exp (

−e^(k)n

β

)e^(k)n

√β + α exp (

−e^(k)n

β

)∆e^(k)n

√β = 0 (2.5)

Where we write e^(k+1)n = (y_n−µ^(k+1)m0 )² = e^(k)n +∆e^(k)n and α and β are positive scalars. To see that the limit is zero, we first look at the first term, which is clearly zero in the limit. For the second term remind that we are close to a maximum of the likelihood, hence ∆e^(k)n << e^(k)n , so also the second term is zero in the limit. So when µ_m₀ comes close to y_out only the outlier will contribute to the variance in the updating step. Therefore:

σ_m²₀ ∝ (y_out− µ_m₀)² → 0 as µ_m₀ → y_out (2.6) So the isolated data point in combination with local characteristics of the normal distribution, can make it happen that EM algorithm will collapse.

2.3 Graphs & Network Reconstruction Accu- racy

2.3.1 Graphs

Since we want to compare the Network Reconstruction Accuracy, we need to know what a network or Graph is, before we are able to reconstruct them.

A graph or network is a set of edges and vertices, which we will denote by G = (E, V ). Where E is the set of edges between the vertices collected in V . If we go back to our simplified city of subsection 2.1.1 on page 3, we can see that this simplified city is actually a network (see also figure 2.7 on the following page). The set of vertices or nodes V in this network are the intersections, so:

V = {A, B, C, D, E, F, G}

Besides the intersections we have the roads between them, which we can denote as our edges in our graph. So the set E will be:

(25)

5 2

4 7

8 8

12 9

8 5

2

A B

C

D E

F G

Figure 2.7: A simplified city with 7 intersections, which is an undirected graph.

A

B

C

D

E

Figure 2.8: Example of a directed graph.

E = {(A, B), (A, C), (A, D), (B, E), (B, F ),

(C, E), (C, F ), (D, E), (D, F ), (E, G), (F, G)}

This example is an example of an undirected graph, since there are no one- way roads in our example. This also implies that an edge (A, B) is the same (B, A), another way of denoting these edges is A − B. In figure 2.8 we have an example of a directed graph, this can be seen at the edges which are arrows, instead of lines. This could also be denoted as A → B. Note that in both examples the nodes are labeled by letters, but this can be any type of labeling.

There exist many types of inference on graphs, the method we use is based on the assumptions that the data is measured at the nodes. Based on this data we want to determine the set E or between which vertices there exists edges.

More about this type of modeling can be found in section 5.1 on page 40.

2.3.2 Network Reconstruction Accuarcy Criteria

The power of a model in terms of “being able to reconstruct a network”, will be explained in terms of Precision-Recall (PR) curves and Receiver Operator Characteristic (ROC) curves and the area below those curves. Both are used in binary decision problems for machine learning [4] and a small description

(26)

of both will be given below.

Such criteria can only be used when the true network or graph is known, let us denote this true network with G_{T rue} = (E_{T rue}, V_{T rue}) and say G_{T rue}(i, j) = 1 if there is an edge from node i to node j and G_{T rue}(i, j) = 0 otherwise.

G_{T rue}(i, j) = 1 if (i, j) ∈ E_{T rue} 0 otherwise

Now consider a model with the graph G_{M odel} = (E_{M odel}, V_{T rue}) and the same type of function as we did for G_{T rue}. Then we can define True Positives (TP) as the number of edges for which it holds 1 = G_{T rue}(i, j) = G_{M odel}(i, j), the False Positives (FP), 0 = G_{T rue}(i, j) 6= G_{M odel}(i, j), the True Negatives (TN), 0 = G_{T rue}(i, j) = G_{M odel}(i, j) and the False Negatives (FN),1 = G_{T rue}(i, j) 6=

G_{M odel}(i, j).

From these numbers we can calculate the following three numbers, the Recall (R) or True Positive Rate (TPR), the Precision (P) and the False Positive Rate (FPR).

R = T P R = T P T P + F N

P = T P

T P + F P F P R = F P

F P + T N

The Recall and Precision defines points in PR-space, which will be connected through non-linear interpolation [4]. The TPR and FPR defines points in ROC-space which will be connected through linear lines. These points are created by the increasing the cutt-off probability from zero to one, when a lot of different probabilities are available, the ROC becomes a step function. Both methods create curves in [0, 1] × [0, 1] and the area under these curves (AUC) can be used as a quantitative measure to measure network reconstruction accuracy. For both methods holds that, the closer the value to one, the better the method. While an AUC-ROC smaller than 0.5, will be a model worse than random guessing. In chapter 5 on page 40 mostly AUC-ROC scores will be used, while AUC-PR will be used in chapter 6 on page 54 because of comparison with the results of [11].

(27)

Chapter 3 Change Point Models

3.1 Definition

A change point model is a model that at a certain time point or multiple time points there will be a change in distribution of the output. This can be a huge change, such as a change in distribution type, like going from a Poisson distribution to a Geometric distribution. Or a smaller change, like going from a Normal distribution with mean 1, to a Normal distribution with mean 2.5. Or when for example using regression, it can be that the regression coefficients will change.

3.2 Segment Neighbourhood Search

Let us now have data y_1:N, which is broken into M + 1 segments and let us denote the set of changepoints as τ0:M +1 = (τ0 = 0, τ1, ..., τM, τM +1 = N ). To find the optimal set τ_{0:M +1} an algorithm Segment Neighbourhood Search (SNS), based on Dynamic Programming is used [14]. The Segment Neighbourhood Search algorithm finds the optimum by finding

argmin

τ0:M +1

(_{M +1} X

i=1

C(y_τ_i−1_:τ_i) )

(3.1)

where C(·) is a cost function to be determined depending on the problem.

(28)

Algorithm 2 (Segment Neighborhood Search): ¹

Input: A data vector y_1:N, a cost function C(·) and a maximum number of changepoints M , (M + 1 segments).

1. Compute:

Q1,i:j = C(yi:j), for 1 ≤ i < j ≤ N 2. For m = 1, 2, ..., M

(a) For j = 1, ..., N

Q_m+1,1:j = min

k∈{1,...,j−1}{Q_m,1:k + Q_1,k+1:j} (b) τ_m+1,0 = n, then for i = 1, ..., m

τ_m+1,i = argmin

k∈{1,...,τm+1,i−1−1}

Q_m+1−i,1:k+ Q_1,k+1:τ_m+1,i−1

Note that step 2(a) and 2(c) are not needed when we are searching for only one changepoint.

When applying this algorithm on a data set of simulated stock prices, which consists of 321 time points (eight measurements per time unit) at which the stockprice is measured and which has 4 known changepoints, at time 8, 16, 24 and 32. Let the algorithm run for 4 changepoints and the cost function C(y_i:j) = − log Pθˆ(y_i:j). Where Pθˆ(y_i:j) is the likelihood of y_i:j under ˆθ. Note that to use this cost function we need the extra condition that we only can allow sequences which have at least the number of datapoints.

1This algorithm has been taken from [14], but note that there is a slight mistake in their third for-loop.

(29)

0 10 20 30 40

010203040

Time

Stock Price

Example Stock Prices

Figure 3.1: Simulated stockprices with known change points at time 8, 16, 24 and 32.

0 10 20 30 40

010203040

Time

Stock Price

Actual Stockprice CPM Regression line

Figure 3.2: Simulated stockprices Estimated changepoints at time 8.375, 16.375, 23.375 and 31.125.

3.2.1 Penalised Likelihood

But how do we determine the number of changepoints when it is not known beforehand. In theory adding more changepoints will increase the fit of a model, but will decrease the smoothness or may over-fit the data. Hence we should add a form of penalty for adding more changepoints. To review a few methods, we will first start with the general definition of a penalised likelihood [14].

Definition 1. Consider a model, with p parameters, combined in the parameter set θ, data y_1:N and `(θ; y_1:N) as the log-likelihood of that model. Then the penalised likelihood is defined as:

P L(θ; y1:N) = −2`(θ; y1:N) + pφ(N )

With φ(N ) the penalty function, which is non-decreasing in the number of data points, N .

Then the model that minimizes P L(θ; y_1:N) will be the prefered model, as one can see the chosen model depends on the choice of the penalty function.

A lot of different penalty functions can be chosen, two of the most popular penalty functions are described by Akaike’s Information Criterion (AIC) and

(30)

the Bayesian Information Criterion (BIC) [14]. Which are defined as follows:

AIC : φ(N ) = 2 BIC : φ(N ) = log(N )

For a change point model with k changepoints we have that the number of parameters, p = k + (k + 1) · d. Where d is the number of parameters per segment.

3.2.2 Minimum Description Length

Using penalised likelihood is one of the options to restrict the number of changepoints, another option is using the Minimum Description Length (MDL).

This is based on finding the model which maximizes the compression of the data. Given a set of changepoints we can estimate the so-called code-length of the data, which can be seen as the amount of memory needed to store the data [7]. Following the derivation of [5] we have that the code length of a changepoint model with M changepoints is proportional to:

CL(M, τ_{0:M +1}, θ, y_1:N) ∝ −

M +1

X

i=1

`(θ_i; y_τ_i−1_+1:τ_i) + log(M + 1)+

+ (M + 1) log(N ) +

M +1

X

i=1

d

2log(τ_i− τ_i−1+ 1)

= 1 2

M +1

X

i=1

BIC(y_τ_i−1_+1:τ_i) + log(M + 1) + (M + 1) log(N ) Where again d is the number of parameters to estimate for each segment.

Hence using the MDL to find the optimal number of changepoints can be done using the BIC as a cost function for each segment in the SNS algorithm.

Using the three options we have given as criterion to find the optimal number of changepoints on our example dataset of stock prices. With a maximum number of changepoints set to 20, we find that minimizing the AIC value we find an optimum of 6 changepoints, while the BIC and MDL both give back the real optimum of 4 changepoints. The values of those criteria have been plotted in figure 3.3, while the regression line with the optimal number of changepoints is plotted in figure 3.4 on the next page. Note that both minimizing the BIC value as the MDL value give the same regression line, this necesarily does not have to be true for all datasets, since the cost function using both minimization criteria is different.

(31)

80010001200140016001800

Number of Changepoints

Minimization Criterion Value

0 4 6 10 15 20

AIC BIC MDL

Figure 3.3: Both minimizing the BIC and MDL value give the right number of changepoints, the AIC gives two extra.

0 10 20 30 40

010203040

Time

Stock Price

AIC Regression line BIC Regression line MDL Regression line

Figure 3.4: Regression lines using the optimal number of changepoints, accoring to the AIC, BIC and MDL values.

(32)

Chapter 4 Hidden Markov Models

4.1 Definition

4.1.1 Markov Models

One of the simplest models is a Markov Model. Let us start with a Markov Model in discrete time. Suppose we have a board game with 100 squares, from which are 60 coloured green and the remaining 40 are coloured red.

And suppose at each turn when you are at green you stay at green with probability 0.67 and when you are at red, you stay at red with probability 0.41. Then this can be modelled as a Markov Chain, since the colour or state where you are at turn t, only depends on which state you where in turn t − 1, or:

P(St= s_t|S_t−1 = s_t−1, .., S₀ = s₀) = P(St= s_t|S_t−1 = s_t−1). (4.1) Which is also called the Markovian Property. Since this process is a Markov Chain, we can create its transition matrix, A:

A =

G R

G 0.67 0.33 R 0.59 0.41

. (4.2)

In this case we have two states, Green and Red, then we will denote the state space as S = {Green, Red}. Besides the transition matrix we need the initial state distribution to complete our Markov Model. The initial state

(33)

distribution describes the probability of being on either a green or a red square at the start, which is:

π =

G R

0.6 0.4

(4.3) The fraction of green and red squares.

4.1.2 Hidden Markov Models

Now suppose that besides our board with squares, we also have two dice a green and a red one. Where the red one is biased and has probability 0.55 of throwing a six, while the green one is a fair dice. And you throw with the same colour of dice as on which square you are. Now suppose after the game somebody gives us a sequence of thrown numbers, then the sequence of colours is hidden to us.

B =

Six N oSix G 0.17 0.83 R 0.55 0.45

. (4.4)

Because the real states we are interested in (the sequence of colours) are hidden, we speak of a Hidden Markov Process. Where λ = (A, B, π) defines our complete Hidden Markov Model. Note that in this case we have that the observed sequence is discrete (with two possibilities), but it can also be continuous. If we have a continuous distribution, the matrix B will be denoted as Θ and consist of the parameters of that distribution for each class.

We can visually describe the random variables (Y_i)ⁿ_i=1, (S_i)ⁿ_i=1 with the graph of figure 4.1 on the following page. Where in our example the Yiwould be the thrown numbers, while S_i would be the color of square you were on. Besides that we can deduce some dependency properties from this graph, namely:

1. Y_i ⊥ Y_j, for i 6= j 2. Y_i ⊥ S_j|S_i, for i 6= j

3. S_i+1⊥ S_j|S_i, for j = 1, ..., i − 1

Where the symbol ⊥ stands for independency.

(34)

Y1 Y2 Y3 Y4

S1 S2 S3 S4

Figure 4.1: Dependency structure of an Hidden Markov Model.

Throughout the remaining part of this chapter we will always take this example for our computations along with the following observed sequence:

2, 6, 6, 4, 1 (4.5)

4.2 Computations with HMMs

When we do computations with HMMs we distinguish between the three most common problems [3].

1. Given a HMM λ and observed data y_1:N we want to compute the likelihood P(Y1:N = y_1:N; λ).

2. Given a HMM λ and observed data y1:N we want to find the most likely state sequence s_1:N or the most likely state of s_i, i ∈ {1, ..., N }.

3. Given observed data y^(k)_1:N, k = 1, 2, ..., K and the structure of the HMM, we want to find the maximum likelihood estimator of λ.

For simplicity reasons we will often omit the random variable in the expres- sions like Pλ(Y_1:N = y_1:N|S_1:N = s_1:N) and write Pλ(y_1:N|s_1:N), either Y_1:N

(35)

being discrete or continuous. Furthermore we write λ = (A, Θ, π) as our complete HMM. Let us start with the first problem which is the easiest one.

By definition of total probability we have that:

Pλ(y_1:N) = X

s1:N∈S^N

Pλ(y_1:N|s_1:N)Pλ(s_1:N), (4.6)

where

P^λ(y1:N|s1:N) =

N

Y

n=1

f (yn; θsn),

P^λ(s_1:N) = π_s₁

N −1

Y

n=1

a_s_n_s_n+1.

With f (y_n, θ_s_n), being the density of (Y_n|S_n = s_n). Combining these equations it follows that:

Pλ(y_1:N) = X

s1:N∈S^N

_N Y

n=1

f (y_n; θ_s_n)

! π_s₁

N −1

Y

n=1

a_s_i_s_n+1

!!

= X

s1:N∈S^N

π_s₁f (y₁; θ_s₁)a_s₁_s₂f (y₂; θ_s₂) . . . a_s_{N −1}_s_Nf (y_N; θ_s_N) (4.7)

The order of calculations using this direct definition is O(2N ·M^N) [15], which exponentially fast with factor the number of states. Even for our example with M = 2 and N = 5 it already becomes 320. To reduce the number of calculations we can use the so called forward or backward algorithm, or a combination of those two. We will start with the forward algorithm.

4.2.1 Forward Algorithm

First let us define α_n(i) as:

α_n(i) = Pλ(y_1:n, S_n= i) (4.8) Then it follows by definition that:

α₁(i) = Pλ(y₁, S₁ = i) = Pλ(S₁ = i)f (y₁; θ_i) = π_if (y₁; θ_i) α_n+1(j) = Pλ(y_1:n+1, S_n+1 = j)

(36)

=

M

X

i=1

Pλ(y_1:n+1, S_n+1 = j|S_n = i)Pλ(S_n= i)

=

M

X

i=1

Pλ(y_1:n+1|S_n+1 = j, S_n = i)Pλ(S_n= i)Pλ(S_n+1 = j|S_n= i)

=

M

X

i=1

Pλ(y_n+1|S_n+1 = j, S_n = i)Pλ(y_1:n|S_n+1 = j, S_n = i)×

× P^λ(Sn = i)P^λ(Sn+1= j|Sn = i)

=

M

X

i=1

P^λ(y_n+1|S_n+1 = j)Pλ(y_1:n, S_n= i)a_ij

= f (y_n+1; θ_j)

M

X

i=1

α_n(i)a_ij

And by the rule of total probability we have that:

Pλ(y_1:N) =

M

X

i=1

Pλ(y_1:N, S_N = i) =

M

X

i=1

α_N(i) (4.9)

Using this we can calculate the total likelihood, by the forward algorithm or α−pass [3]. The forward algorithm will be summarized in algorithm 3

Algorithm 3 (Forward Algorithm):

1. For i = 1, ..., M

α₁(i) = π_if (y₁; θ_i) 2. For n = 1, ..., N − 1

For j = 1, ..., M

α_n+1(j) = f (y_n+1; θ_j)

M

X

i=1

α_n(i)a_ij

3. Set:

Pλ(y_1:N) =

M

X

i=1

α_N(i)

Using this algorithm the number of calculations reduces to an order of O(M²N ) [15]. Which becomes 10 for our example, which is much smaller than 320.

We will now work out our example:

(37)

i =Green i =Red α₁(i) 0.498 0.180 α₂(i) 0.075 0.131 α3(i) 0.022 0.043 α₄(i) 0.033 0.011 α₅(i) 0.024 0.007

Table 4.1: α−pass using our example.

Hence the likelihood of our sequence 2, 6, 6, 4, 1 is 0.031.

4.2.2 Backward Algorithm

For the backward algorithm we define:

β_n(i) =

Pλ(y_n+1:N|S_n = i) for 1 ≤ n ≤ N − 1

1 otherwise (4.10)

β_n(i) =

M

X

j=1

Pλ(y_n+1:N, S_n+1= j|S_n = i)

=

M

X

j=1

Pλ(y_n+2:N, y_n+1|S_n+1 = j, S_n = i)Pλ(S_n+1 = j|S_n = i)

=

M

X

j=1

P^λ(y_n+2:N|S_n+1= j)Pλ(y_n+1|S_n = i)a_ij

=

M

X

j=1

β_n+1(j)f (y_n+1; θ_i)a_ij

Hence it follows that:

Pλ(y_1:N) = Pλ(y_2:N)Pλ(y₁) =

M

X

i=1

β₁(i)π_if (y₁; θ_i) (4.11)

In the same way as we did with the forward algorithm, we can now calculate the likelihood using the backward algorithm or β−pass [3].

(38)

Algorithm 4 (Backward Algorithm):

1. For i = 1, ..., M

βN(i) = 1 2. For n = N − 1, ..., 1

For i = 1, ..., M

β_n(i) =

M

X

j=1

β_n+1(j)f (y_n+1; θ_j)a_ij

3. Set:

Pλ(y_1:N) =

M

X

i=1

β₁(i)f (y₁; θ_i)π_i

Just like the forward algorithm, the number of calculations needed is of order O(M²N ) [15]. We can use this algorithm to compute the likelihood of our example and compare it to he likelihood computed with the forward algorithm.

i =Green i =Red

β₅(i) 1 1

β₄(i) 0.705 0.674 β₃(i) 0.492 0.469 β₂(i) 0.141 0.156 β₁(i) 0.044 0.049

Table 4.2: β−pass using our example.

And our likelihood is:

β₁f (y₁; θ₁)π₁+ β₂f (y₁; θ₂)π₂ = 0.044 × 0.83 × 0.6 + 0.049 × 0.45 × 0.4 = 0.031 Which is, as it should be, equal to the likelihood computed by the forward algorithm.

It is possible to combine the forward and backward algorithm [3], so that we can come up with the following:

Pλ(y_1:N) =

M

X

i=1

Pλ(y_1:N|S_n = i)Pλ(S_n = i)

(39)

=

M

X

i=1

Pλ(y_1:n|S_n = i)Pλ(y_n+1:N|S_n = i)Pλ(S_n= i)

=

M

X

i=1

Pλ(y_1:n, S_n = i)Pλ(y_n+1:N|S_n= i)

=

M

X

i=1

α_n(i)β_n(i) (4.12)

=

M

X

i=1 M

X

j=1

Pλ(y_1:N, S_n+1 = j|S_n= i)Pλ(S_n = i)

=

M

X

i=1 M

X

j=1

Pλ(y_1:N|S_n+1 = j, S_n= i)Pλ(S_n+1 = j|S_n= i)Pλ(S_n = i)

=

M

X

i=1 M

X

j=1

Pλ(y_1:n|S_n= i)Pλ(y_n+1:N|S_n+1= j, S_n= i)a_ijPλ(S_n= i)

=

M

X

i=1 M

X

j=1

P^λ(y1:n, Sn= i)P^λ(yn+1|Sn+1 = j)P^λ(yn+2:N|Sn+1= j)aij

=

M

X

i=1 M

X

j=1

α_n(i)f (y_n+1; θ_j)β_n+1(j)a_ij (4.13)

One can check for himself using table 4.1 on page 24 and 4.2 and the matrices A and B that for each n = 1, .., 5 this will give the same likelihood as the forward or backward algorithm only.

4.2.3 Viterbi Algorithm

Now we know the likelihood of certain observed data, we maybe want to find the sequence of the Hidden Markov Chain [3]. To find this sequence, let us first define the following probability:

γ_n(i) = Pλ(S_n = i|y_1:N), i ∈ {1, ..., M } and n ∈ {1, ..., N } (4.14) By Bayes rule we can compute this quantity and by following the same derivation as in equation (4.12) it follows that:

γn(i) = P^λ(Sn= i|y1:N) = P^λ(y_1:N, S_n= i)

Pλ(y_1:N) (4.15)

On the Comparison of Changepoints Models in Network Reconstruction with Themselves and their Bayesian Counterparts