Comparing graphical models with pseudo-likelihood methods and their Bayesian counterparts on gene regulatory networks.

(1)

Comparing graphical models with

pseudo-likelihood methods and their

Bayesian counterparts on gene regulatory networks.

Master's thesis

February 8, 2017

Student: G. Hekkelman

Primary supervisor: dr. M.A. Grzegorczyk Secondary supervisor: dr. W.P. Krijnen

(2)

In biology, a set of regulatory genes can have different interactions with each other. By measuring gene expression data, it becomes possible to infer a network from it. Such a network is called a gene regulatory network.

In this thesis, the principle and relation of pseudo-likelihood and graphical models is discussed and applied to graphical data. For both methods, different models on homogeneous and mixture data are used. An algorithm to estimate different components of mixture data is the expectation- maximization (EM) algorithm. Its properties and the numerical difficulties that may arise are discussed. Also, simulation studies are performed for these models on homogeneous and mixture data where the mixture data consists of two components. Here, the simulated mixture data has components of which the parameters are similar. For this data, the methods are used to reconstruct an underlying network. These type of simulations usually take days, therefore only the case is considered in which estimating the different components is not clear from an exploratory analysis.

Synthetic biology allows for new constructions of a gene regulatory network to seed new func- tions within a cell. Combining the inferences with the new constructions, yields the problem of network reconstruction. An example of this is the yeast synthetic network data set (Cantone et al., 2009). For this data set, both aforementioned methods which are frequentist statistical methods are compared with their Bayesian counterparts. Partial F-tests and t-tests are used to test if if the data is homogenous or mixture distributed. As a result, although if data has a mixture distribution, the homogeneous model is often able to explain the behavior of the underlying network.

All R-code is available on https://github.com/gerard1911/msc-thesis.

1

(3)

1 Introduction 4

1.1 Systems biology and gene regulatory networks . . . 4

1.1.1 Systems biology . . . 5

1.1.2 Quantitative measurements of gene expression . . . 6

1.1.3 Gene regulatory networks . . . 7

1.2 Graph theory . . . 8

1.2.1 Concepts . . . 8

1.2.2 Probability setting . . . 10

1.2.3 Relation to gene regulatory networks . . . 11

1.3 Variable selection procedures . . . 12

1.3.1 Models with means as linear combinations of covariables . . . 12

1.3.2 Penalized Maximum Likelihood . . . 12

1.4 Model performance: ROC & PR-curves and AUC scores . . . 13

1.4.1 Receiver Operating Characteristic (ROC) curves . . . 13

1.4.2 Precision Recall (PR) curves . . . 15

1.5 Design and structure of the study . . . 15

2 Pseudo-likelihood methods 17 2.1 Definition and consistency . . . 17

2.2 Pseudo-likelihood on graphs . . . 19

2.2.1 Pseudo-likelihood on gene regulatory networks . . . 19

2.3 Different distributions as marginal densities . . . 19

2.3.1 Linear Model . . . 20

2.3.2 J-component Gaussian Mixture Model (GMM) . . . 21

2.3.3 EM-Algorithm . . . 24

2.3.4 CEM-Algorithm . . . 31

2.3.5 SEM-Algorithm . . . 34

2.3.6 Convergence problems of the EM-algorithm . . . 35

3 Graphical Models 40 3.1 Derivation and relation to pseudo-likelihood . . . 40

3.2 Graphical Model . . . 42

3.2.1 Penalized maximum likelihood: lasso penalty . . . 43

3.2.2 Selecting best sparse model . . . 44

3.3 Mixtures of Graphical Models . . . 44

3.4 Complication of graphical models . . . 46

2

(4)

4 Simulation Study 47

4.1 Sequential linear data . . . 47

4.1.1 Homogeneous linear data . . . 47

4.1.2 Mixtures of linear data . . . 48

4.2 Static Multivariate Data . . . 49

4.2.1 Homogeneous multivariate data . . . 49

4.2.2 Mixtures of multivariate data . . . 49

4.3 Design of the study . . . 49

4.4 Results . . . 50

4.4.1 Times of being best explaining model. . . 50

4.4.2 Mean and boxplots of AUC-ROC scores . . . 51

4.5 Conclusion . . . 54

4.6 Discussion . . . 55

5 An Applied Example: yeast synthetic network 56 5.1 Introduction . . . 56

5.2 Exploratory data analysis . . . 58

5.2.1 Plots of gene activity through time. . . 58

5.2.2 Scatter plots of gene activity . . . 61

5.3 Formal data analysis . . . 65

5.3.1 Linear model . . . 65

5.3.2 Gaussian Mixture Model . . . 66

5.3.3 EM-Algorithm . . . 67

5.3.4 CEM-Algorithm . . . 68

5.3.5 SEM-Algorithm . . . 69

5.3.6 Graphical Model . . . 70

5.3.7 Mixture Graphical Model . . . 71

5.3.8 Hypothesis tests on regression coefficients of experiments . . . 72

5.4 Conclusion . . . 73

5.5 Discussion . . . 74

6 Summary 76 7 Acknowledgements 78 Appendices 81 A Derivations 82 A.1 Derivation variances conditional normal distribution . . . 82

B Yeast Data Analysis 83 B.1 Linear model . . . 83

B.2 Gaussian Mixture Model . . . 84

B.3 EM-algorithm . . . 85

B.4 CEM-algorithm . . . 86

B.5 SEM-algorithm . . . 88

B.6 Graphical Model . . . 89

B.7 Mixture Graphical Model . . . 89

(5)

Introduction

This chapter gives an introduction to gene regulatory networks. The most important theory, concepts and applications are explained here.

After the basics of gene regulatory networks, an introduction to graph theory is given. Neces- sary definitions and properties are stated and expanded to a probability setting. This helps in estimating and understanding gene regulatory networks.

Since different models are learned on gene regulatory networks in this thesis, a brief introduction into model selection is given. Here, the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) is given. Furthermore, a brief explanation about penalized maximum likelihood is provided for obtaining sparse models.

Finally, measures for model performance on data are given. Here, the concepts of the receiver operating characteristics (ROC) and precision recall (PR) curves are explained. The ROC curve explains the predictive behavior of a model by means of counting the number of true positives and false positives it gives as a certain threshold decreases, whereas the PR curve also uses the true negatives of the model. The area under a ROC curve (AUC-ROC) is used as the model performance measure on artificial datasets, the area under a PR curve (AUC-PR) is used to compare all methods and their Bayesian counterparts, applied to the yeast synthetic dataset (Cantone et al., 2009; Grzegorczyk, 2016).

1.1 Systems biology and gene regulatory networks

In this section, a short introduction is given to systems biology and gene regulatory networks.

Later this thesis, different statistical methods are used to make inferences on a synthetic gene regulatory network in the yeast saccharomyces cerevisiae (Cantone et al., 2009). But, in order to understand the meaning of these inferences, a summary is given on the basic concepts of systems biology. This summary is mainly based on the work produced by Grzegorczyk, Emmert-Streib, Fomekong Nanfack and Clancy (Grzegorczyk, 2015; Emmert-Streib et al., 2014; Fomekong Nanfack et al., 2010; Clancy and Brown, 2008).

4

(6)

1.1.1 Systems biology

In systems biology, the main concern is to study a collection of elements as a system. The field aims at finding the connections between the elements, the dynamics, the role of each element within the system and the entire system’s functionality (Emmert-Streib et al., 2014). In this thesis, the collection of elements is a set of genes within a genome, which is the genetic material of an organism and consists of DNA ¹. DNA consists of four nucleotide bases, namely adenine (A), cytosine (C), guanine (G) and thymine (T).

Figure 1.1: Example of a double strand DNA molecule, together with its mRNA copy molecule.

Here, every adenine (A), cytosine (C), guanine (G) and thymine (T) molecule is opposed by a uracil (U), guanine, cytosine and adenine molecule respectively (Win).

A gene is the basic functional unit of a genome and consists of long molecules of DNA, made up of chains in a double-helix structure (Fomekong Nanfack et al., 2010). Here, the DNA appears in a double string, where adenine is opposed by thymine and cytosine by guanine. A gene is more or less information stored in a DNA sequence of nucleotides on the genome. This information is required for biological processes in the cell. One of these processes is transcription. During transcription, the double-helix structure of the gene is locally unzipped by the enzyme RNA polymerase such that the sequence can be copied into messenger RNA (mRNA). Like DNA, mRNA also consists of nucleotides with the difference that it contains uracil (U) instead of thymine. An example of a mRNA copy together with its original DNA sequences is given in figure 1.1.

After transcription, the created mRNA molecule travels from the nucleus to the cytoplasm of the cell. Here, the ribosomes are located. The ribosomes are responsible for reading the mRNA molecule to create protein. The process of reading the mRNA and generating of protein is called translation. During translation, the mRNA sequence is related to a sequence of amino acids in

1For RNA viruses, it consists of RNA instead of DNA.

(7)

a protein. Each group of three mRNA nucleotides in the sequence constitutes a codon, which in turn specifies a particular amino acid. Thus, the mRNA molecule is used by the ribosomes as a template to order the chain of amino acids that form a particular protein. A list of the codons together with the amino acids is given in figure 1.2.

Proteins are responsible for many different processes in the organism, examples are metabolism,

Figure 1.2: The amino acids specified by each mRNA codon. Multiple codons can code for the same amino acid. (Clancy and Brown, 2008).

transporting molecules and DNA replication. Hence, generating proteins is an essential process within the cell. An overview of the whole process of generating protein is given in figure 1.3. The activity in which mRNA, protein and other genetic products are produced from a specific gene is called the gene expression.

1.1.2 Quantitative measurements of gene expression

Throughout the years, many techniques have been developed to measure gene expressions in or- ganisms (Baxevanis and Ouellette, 2004). Ideally, for every gene of the genome there is a particular mRNA molecule. Then, for each gene a specific expression can be measured (Grzegorczyk, 2015).

A way of measuring gene expressions is by DNA microarray technology. A DNA microarray is a collection of microscopic DNA spots attached to a solid surface. Usually, these microarrays are used to quantify mRNA transcribed from different genes which encode different proteins. Hence, if a DNA microarray is made for a cell at different time points, time series data are obtained.

However, this technique generates noisy data that are usually subject to bias in the biological measurement (Fomekong Nanfack et al., 2010). In particular, earlier developed microarray chips suffer from large measurement errors. Therefore, in order to analyze the microarray data, it must be standardized and normalized first. (Grzegorczyk, 2015).

(8)

Figure 1.3: During transcription, the enzyme RNA polymerase (green) uses DNA as a template to produce a pre-mRNA transcript (pink). The pre-mRNA is processed to form a mature mRNA molecule that can be translated to build the protein molecule (polypeptide) encoded by the original gene (Clancy and Brown, 2008).

1.1.3 Gene regulatory networks

In a gene regulatory network (GRN), a set of regulatory genes are involved which possibly interact with each other in different ways. A gene can activate another gene, stimulating it to become more active or inhibit it. This can be directly, as a gene is responsible for generating a protein that regulates the other gene (transcriptional regulation), or indirectly by generating a protein that neutralizes the activation of a gene by another protein (protein-protein regulation). In order to make a GRN statistically applicable, the following definition is used in this thesis for a GRN as proposed by Emmert-Streib.

Definition 1. A network that has been inferred from gene expression data is defined as a gene regulatory network (GRN) (Emmert-Streib et al., 2014).

By measuring the expressions of all genes in the network, it becomes possible to make statistical inferences on the network structure. Here, the main focus is to reveal the possibly completely

(9)

unknown relations between the genes of the network. For this reason, many different techniques to model GRNs have been developed (Fomekong Nanfack et al., 2010). In the last years, there have been many network inference methods introduced and many comparisons have been conducted.

However, it is unlikely that there is one method that fits all different biological, technical and experimental design conditions best (Emmert-Streib et al., 2014).

In this thesis, statistical models are used to make model inferences on GRNs. To this extent, concepts of graph theory are needed to make comparisons in performance for these methods.

1.2 Graph theory

Before it is possible make inferences on gene regulatory networks, concepts of graph theory are needed. First, the definition and properties of a graph are given in a general and a probability setting. Here, also the Clifford-Hammersley theorem is stated, as it turns out to be very important on graphical models (Lafferty et al., 2001). Finally, the link between graph theory and gene regulatory networks is explained.

1.2.1 Concepts

To understand the basic concepts of graph theory, a few definitions are needed first. To illustrate the definitions, an example is used.

Definition 2. A graph G = (V, E ) is a set of vertices V and a set of edges which are pairs of the vertices E ⊂ V × V. A subgraph G_sub= (Vsub, Esub) of G is a subset of the vertices Vsub⊂ V of G together with the corresponding edges E_sub⊂ E.

Here, an ordered edge E_ij is denoted as an ordered pair of vertices V_i, V_j ∈ V, i.e. E_ij = (V_i, V_j).

An edge E_ij ∈ E indicates that there is a direct relation of vertex V_i on vertex V_j. Examples of graphs are given in figure 1.4 and figure 1.5.

V1 V2

V3

V4 V5

Figure 1.4: An example of a directed graph with 5 vertices and 8 edges.

(10)

Definition 3. An edge E_ij ∈ E is said to be undirected if E_ij ∈ E implicates E_ji∈ E. A graph G is undirected if all edges in E are undirected.

The graph in figure 1.4 is a directed graph. In figures, an arrowed link between the vertices indicates directed edges. If there is only a line connecting the vertices, it indicates an undirected edge.

Definition 4. Two vertices V_i, V_j ∈ V are said to be adjacent if E_ij ∈ E and E_ji ∈ E.

For the graph in figure 1.4, the only vertices which are adjacent are V₃ and V₅. As an exten- sion the basic definitions for single edges or pairs of vertices, the following definitions are used for sequences of vertices.

Definition 5. The sequence (V₁, . . . , V_n) with V_k ∈ V for k = 1, . . . , n is called a path if (Vi, Vi+1) ∈ E for i = 1, . . . .n − 1 and all Vk are distinct.

Definition 6. The sequence (V₁, . . . , Vn) with V_k ∈ V for k = 1, . . . , n is called a cycle if (V₁, . . . , V_n−1) is a path, E_n−1,n ∈ E and V₁ = V_n.

In the graph of figure 1.4, the sequence (V₂, V₁, V₄) is a path but (V₁, V₄, V₃) is not. The sequence (V₂, V1, V4, V5, V2) is a cycle but (V2, V1, V4, V3, V5, V2) is not.

Definition 7. A set of vertices A ⊂ V separates vertices V_i and V_j if for all paths P between Vi and V_j, ∃V_k ∈ P such that V_k∈ A and V_i, Vj ∈ A. The minimal separating set is denotes/ as S(V_i, V_j), which is the smallest set of vertices that separates V_i and V_j.

For the graph of figure 1.4, the set A = {V₂, V₄} is the smallest set separates nodes V₁ and V₅. The following definitions are important to understand the Clifford-Hammersley theorem, stated in subsection 1.2.2.

Definition 8. A subgraph G_sub is

• complete if all its vertices are adjacent.

• maximally complete if there is no additional vertex V ∈ V in the graph such that the subgraph together with the additional vertex is complete.

Definition 9. The neighbor set of a vertex V ∈ V, denoted N (V ) is the subset of vertices that are adjacent to V .

(11)

V1 V2

V3

V4 V5

Figure 1.5: An example of an undirected graph with 5 vertices and 8 edges.

Definition 10. A subset C ⊂ V is called a clique if for every vertex V ∈ C, C ⊂ {V, N (V )}.

As an immediate consequence of the definition, a clique C ⊂ V is a complete subgraph of G.

By these equivalent definition, it is easier to recognize cliques within graphs.

In figure 1.5, the subgraphs {V₂, V₃, V₅} and {V₃, V₄, V₅} are maximal cliques. But, {V₁, V₂, V₃} is not a clique because vertices V₂ and V₄ are not adjacent. Also, {V₃, V4} is not a maximal clique since it forms together with V₅ still a complete subgraph.

1.2.2 Probability setting

In this subsection, properties of undirected graphs are discussed only. Doing so, easy interpreta- tions are obtained for dependent random variables in a graph setting.

Let Y = {Y₁, . . . , YN} be a set of random variables attached with some distribution function P(y). In a probability setting, a graph is defined as a set of random variables as vertices V = Y together with a set of edges E with all pairs of random variables which are dependent on each other.

Definition 11. Two random variables Y_i and Y_j are conditionally independent, given V\{V_i, Vj} if and only if E_ij, Eji ∈ E. That is/

Eij ∈ E ⇐⇒ P(y/ i| {y₁, . . . , yN}\{y_i}) = P(yi| {y₁, . . . , yN}\{y_i, yj})

Therefore, having no edge between two vertices means that two variables are independent. This corresponds to the following properties:

(12)

Definition 12. Let V_i, V_j ∈ V be two vertices with E_ij, E_ji ∈ E. This is equivalent to the/ following:

1. (Pairwise Markov property) Y_i⊥Y_j | V\{V_i, V_j} 2. (Local Markov property) Y_i⊥Y_j | N (Y_i)

3. (Global Markov property) Y_i⊥Y_j | S(Y_i, Yj)

A subset C is a maximal clique if it is a maximally complete subgraph. The set of all maximal cliques in the graph is denoted by cl(G). The latter about the collection of maximal cliques and the Markov properties about conditional independence motivates the fundamental theorem of random fields is (Hammersley and Clifford, 1971; Lafferty et al., 2001).

Theorem 1 (Hammersley-Clifford Theorem). Let Y = (Y₁, . . . , YN) be a random vector of vertices with density function P(Y = y) > 0 and the graph G = (Y, E). The density can be factorized in terms of the cliques of G

P(y) = 1 Z

Y

C∈cl(G)

φ_C(y_C)

Then, the following statements are equivalent:

1. Local Markov property: Y_i⊥Y_j | N (Y_i)

2. Factorization property: the density can be factorized in terms of the cliques of G 3. Global Markov property: Y_i⊥Y_j | S(Y_i, Yj)

For undirected graphs, the Hammersley-Clifford also holds without the positivity condition (Lauritzen et al., 1990). In figure 1.5, the set of all maximal cliques is given by

cl(G) = {{V1, V2}, {V₁, V4}, {V₂, V3, V5}, {V₂, V4, V5}}

The Clifford-Hammersley theorem is used later in this thesis, by stating that variables Y_i and Yj are conditionally dependent on each other if and only if there is a factor f (y_i, yj) in the original density function P(y).

1.2.3 Relation to gene regulatory networks

In a graphical setting, a GRN can be seen as a graph where its vertices are given by the specific genes. Then, there is an edge between the vertices if and only if there is either a direct or indirect relation between the genes. Hence, by treating genes as random variables, its gene expression data is used to make statistical inferences on which genes are conditionally independent of each other. Doing so, a network out of gene expression data is reconstructed. However, to make such a reconstruction, a variable selection procedure is needed.

(13)

1.3 Variable selection procedures

After the general concepts of graph theory in a probability setting and its relation to GRNs, an introduction is given to model selection for different models on graphs. Here, an information criterion selection procedure is given. Secondly, penalized maximum likelihood estimation is explained and how information selection procedures can be applied to it.

1.3.1 Models with means as linear combinations of covariables

Given data y = (y₁, . . . , y_N)^> and explanatory variables X = (x₁, . . . , x_N), multiple linear models can be learned with data by choosing different subsets of the explanatory variables out of all available explanatory variables. The matter of interest is the subset of explanatory variables which explains the data most accurate, using the least number of explanatory variables.

For this reason, an information criterion is needed, which measures the improvement by adding an explanatory variable in terms of the likelihood and adds a penalty term for adding another variable. The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are used as statistics based on the log-likelihood function with adjustment for the number of parameters estimated and for the amount of data, to measure the goodness of fit. (Dobson and Barnett, 2008).

Definition 13. Given data y = (y₁, . . . , y_N)^>∈ R^N with explanatory variables

(X = x1, . . . , xN)^> ∈ R^{N ×p} and a distribution function P(y | X, θ). Let M denote the model and let ˆθM be the vector with estimated parameters which maximizes the likelihood under the model M and let k = |ˆθ_M| be the number of estimated coefficients. The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are defined by

AIC(ˆθ_M | y) = −2 log P(y | ˆθ_M) + 2k BIC(ˆθ_M | y) = −2 log P(y | ˆθ_M) + k log N

In this thesis, the BIC criterion is used to select the best parsimonious model. Finally, a model is obtained that involves only the variables that contribute largely to the variation in the observed variable. In a graph setting, this means that each vertex V is explained separately by the other ones for all vertices in the graph. But, it is also possible to explain all observed variables simultaneously.

For that procedure, another procedure is needed to select variables that penalizes the size of the estimated parameter coefficients themselves.

1.3.2 Penalized Maximum Likelihood

Where the AIC and BIC criterion are statistics for goodness of fit of a model involving the number of estimated parameters, the notion of penalized maximum likelihood estimation is used to penalize the size of the estimated parameters.

Definition 14. Let y = (y₁, . . . , y_N)^> ∈ R^N be observed data together with a log-likelihood function `(y | θ) with θ ∈ Θ be a set of parameters to be estimated, let t > 0 be a penalty parameter together with a norm ||.||. The penalized maximum likelihood estimator ˆθP M L is

(14)

defined by

θˆ_{P M L}= arg max

θ∈Θ

`(y | θ), with ||θ|| ≤ t

For penalized maximum likelihood estimation, the goal is to optimize log-likelihood with a re- striction on the size of the parameters to make sparse inferences. As the penalty term t decreases, it forces the estimated parameters to shrink. By gradually decreasing the penalty term t, the estimated parameters get smaller until they reach zero, meaning they have no influence on the data.

In this thesis, the size of the estimated covariance matrix of gene expression data is penalized.

Doing so, the covariances shrink to zero in order of their contribution to the likelihood. Hence, an ordered list of correlated variables is obtained.

Often, the L1 norm or L2 norm is used for penalized maximum likelihood estimation. The first one is known as the lasso and the second is known as ridge regression. But, only for the lasso, coefficient estimates will be exactly zero as t is small enough (James et al., 2013).

For variable selection, it is also possible to take the BIC score of a model once as a parameter is exactly estimated as zero. This way, it is possible to make comparisons for models based on penalized maximum likelihood as for models with means as linear combinations of the covariables.

After obtaining BIC scores for both different methods, each model is scored involving a number of variables. The variables having the lowest AIC or BIC score explain the data best. However, in order to evaluate the performance in network reconstruction of these models, a measure for each model is needed what correlations it inferred right and wrong.

1.4 Model performance: ROC & PR-curves and AUC scores

In this section, the basics of receiver operating characteristic (ROC) and precision recall (PR) curves are explained for evaluating the performance of different binary decision methods on data.

1.4.1 Receiver Operating Characteristic (ROC) curves

The ROC curve is described by the true positives and false positives a model gives, as a penalty threshold changes. The curve is created by plotting a parametrization of the true positive rate (TPR, sensitivity) against the false positive rate (FPR, fall out) , as a function of the threshold.

At a high threshold, no variables are added to the model, which yields a specificity and fall out equal to zero. As the threshold decreases, more variables are added to the model which are either true positive or false positive. Therefore, the ROC curve is strictly non-decreasing in terms of the fall out (Metz, 1978). When all variables are added to the model, and the threshold is thus zero, both the sensitivity and fall out equal one. Hence every ROC curve is a non-decreasing function starting from (F P R = 0, T P R = 0) and ending in (F P R = 1, T P R = 1).

Desirably, a method of consideration adds more true variables than wrong variables to the model as the threshold drops. This results in a ROC curve which starts steep at the beginning, and flattens out as the threshold drops, see also figure 1.6. In a visual representation, the more the ROC curve tends to the upper left corner of the plot, the better the method of consideration is. A measure for the performance of a method is the area under the ROC curve, or AUC-ROC-score.

(15)

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

ROC curve AUC = 0.90625

FPR

Sensitivity 5101520

(a) Example of a ROC curve with area under ROC curve (AU C −ROC = 0.906), classified in class A of the traditional academic point system which means the model performance is excellent (table 1.1).

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

ROC curve AUC = 0.6988636

FPR

Sensitivity 5101520

(b) Example of a ROC curve with area under ROC curve (AU C −ROC = 0.699), classified in class D of the traditional academic point system which means that the model performance is poor (table 1.1).

Figure 1.6: Two examples of receiver operating characteristic (ROC) curves. Here, the color of the curve changes with the threshold.

The larger the AUC-ROC score is, the better the method performs in classifying. The AUC-ROC score is bounded between zero and one. A score of one means that the method classifies the data perfectly, a score of zero means that the method classifies the data completely wrong. Random guessing results in a score of 0.5. A guide of classifying the accuracy of a method is the traditional academic point system, given in table 1.1. An example of both a good and poor performing model is given in figure 1.6.

AUC-ROC Class Meaning 0.90 - 1 A Excellent 0.80 - 0.90 B Good 0.70 - 0.80 C Fair 0.60 - 0.70 D Poor 0.50 - 0.60 E Fail

Table 1.1: The traditional academic point system for AUC-ROC scores (Metz, 1978)

(16)

1.4.2 Precision Recall (PR) curves

In relation to the ROC curves, PR curves are used as an alternative to ROC curves. The PR curves are suitable for tasks with a large skew in the class distribution (Goadrich et al., 2004). In PR curves, the recall is exactly the same as the sensitivity of PR curves and the precision is defined as the fraction of true positives with respect to all positives given by the model (both true as false).

A visual difference between ROC and PR curves is that the goal for ROC curves is that they are close to the left upper corner of the plot, whereas the goal for PR curves is that they are close to the upper right corner. As for ROC curves, also the area under the PR curve (AUC-PR) is a measure for model performance. Again, the higher the AUC-PR is the better the method of consideration performs.

However, by a theorem of Davis and Goadrich: For a fixed number of positive and negative examples, one curve dominates a second curve in ROC space if and only if the first dominates the second in Precision-Recall space (Davis and Goadrich, 2006). To illustrate this, consider figures 1.6 and 1.7.

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

PR curve AUC = 0.8631085

Recall

Precision 5101520

(a) Corresponding PR curve of the excellent ROC curve in figure 1.6a .

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

PR curve AUC = 0.6635835

Recall

Precision 5101520

(b) Corresponding PR curve of the poor ROC curve in figure 1.6b .

Figure 1.7: Two examples of precision recall (PR) curves on the same data as the ROC curves of figure 1.6. Here, the color of the curve changes with the threshold.

1.5 Design and structure of the study

In chapters 2 and 3, different homogeneous and mixture models are described to infer networks from data. For these different types of models, their underlying relations and properties are described and their maximum likelihood estimates are derived.

(17)

For every described model, corresponding data is generated according to an artificial network in chapter 4. In this simulation study, the different models are used to reconstruct the underlying network on every generated data set. To study the performances of the models, the ROC curves and AUC-ROC scores are calculated for each model on every type of generated data. In the simulation study, fifty data sets are generated for four different types of data. Doing so, a comparison can be made on what model performs best on the data in terms of the highest mean AUC-ROC score and the fraction of simulations it had the highest AUC-ROC score of all.

In this study, the models are also applied to a gene regulatory network. In chapter 5, the model performance is investigated on the yeast saccharomyces cerevisiae data set (Cantone et al., 2009) to compare it with earlier studies on model performance of their Bayesian counterparts (Grzegorczyk, 2016). Since this study used PR curves to evaluate model performance, PR-curves are also used in this chapter.

(18)

Pseudo-likelihood methods

The idea of pseudo-likelihood was developed by Julian Besag, in the analysis of dependencies in spatial data (Besag, 1975). The general idea of pseudo-likelihood is to approximate the likelihood of a model by a more tractable objective. In this section, the maximum likelihood estimators are derived for the pseudo-likelihood using a homogeneous and mixture Gaussian distribution with means linearly dependent on the covariables.

Furthermore, the EM-algorithm, CEM-algorithm and SEM-algorithm are explained. Here, a sug- gestion is given for an initial parameter estimate for these methods as they do not necessarily converge to the maximum likelihood estimate. Also, the numerical problems are explained that may arise in these algorithms and suggestions are given to solve these.

2.1 Definition and consistency

Let (Y₁, . . . , Y_M) be M random variables, with a set of parameters θ. The variables have some underlying dependencies, given by the set E. The set E is also known as the neighbor set, where a pair of nodes (i, j) are called neighbors if the conditional distribution at node i depends on the value of node j. So the pair (Y_i, Y_j) ∈ E indicates that the variable Y_i is conditionally dependent on Y_j, given all its neighbors. The pseudo-likelihood is defined as follows

Definition 15. Let Y = (Y₁, . . . , YM) be a random vector with neighbor set E , together with a probability function Pθ(y). The pseudo-likelihood of a random vector (Y₁, . . . , Y_M) is defined by

PLθ(Y = y) =

N

Y

i=1

Pθ(Y_i= y_i| Y_j = y_j, (Y_i, Y_j) ∈ E )

Intuitively, the pseudo-likelihood is build up by the marginal densities of the random variables, conditioned on the neighbors. Notice that for any random vector Y , the likelihood can be factorized

17

(19)

in conditional probabilities and thus

Pθ(Y = y) =

N

Y

i=1

Pθ(yi | y₁, . . . , yi−1) ≈

N

Y

i=1

Pθ(Yi= yi| Y_j = yj, (Yi, Yj) ∈ E )

= PLθ(Y = y)

The advantage of this tactic is that it breaks the original problem into a series of simpler sub- problems. Instead of having to deal with M random variables simultaneously, we can tackle each variable Y_i in turn. The assumption made here is that the conditional distribution of each random variable is independent of all other variables.

Thus, to obtain an estimate of the parameters in the model, one can use the approximation of the likelihood to do model inference. This approach is known as maximum pseudo-likelihood estimation (MPLE). Suppose there are N observations for the random variables. To obtain a MPLE, one needs to maximize the pseudo-likelihood function with respect to the parameter θ. Such estimators ˆθ_N are strongly constistent, as a result of a theorem by Gong and Samaniego without proof (Gong and Samaniego, 1981):

Theorem 2 (Strong consistency MPLE). Let Y₁, . . . , Y_N ^i.i.d.∼ F_θ, where F_θ is some distribution function and each Y_i has density or probability mass function f (y | θ). Suppose that the following two assumptions hold:

1. The first, second and third order partial derivative of log f (y | θ) with respect to θ exists for all y and θ.

2. For any τ 6= θ:

Pθ(f (Y_i | θ) = f (Y_i| τ )) < 1

For > 0, let A_N() be the event that there exists a root ˆθN of the equation

N

X

i=1

∂

∂θlog f (y_i | θ) = 0 for which |ˆθ_N− θ| < . Then for any > 0,

N →∞lim P(AN()) = 1 i.e. the MPLE ˆθ_N is strongly consistent.

Notice that the MPLE ˆθ_N maximizes the pseudo-likelihood, but in general not the original likelihood. Therefore, the reliability of the estimator strongly depends on the similarity between the pseudo-likelihood and the likelihood itself.

(20)

2.2 Pseudo-likelihood on graphs

Suppose we have a set of nodes Y = {Y₁, . . . , YM} with some distribution function P(y | θ). The likelihood function of the set of nodes is approximated by the product of all conditional distributions of the single variables, given the values of all others

P(Y | θ) ≈

M

Y

i=1

P(Yi | Y_(−i), θ_i) (2.1)

To maximize this pseudo-likelihood expression, every factor in the latter expression is maximized. Hence, the problem is parted into M sub-problems:

arg max

θi

P(Yi| Y_(−i), θ_i), for i = 1, . . . , M (2.2) Where θ_i is the collection of parameters of the variable Y_i. For these sub-problems, various densities can be taken. In this theses, data stems from a single Gaussian distribution or at data from a mixture from Gaussian distributions where Y_i is the response and the regulators are in the matrix of covariables, X = Y_(−i). Here, the search is for best parsimonious model for each density, i.e. the model involving the least number of parameters explaining the data best. Therefore, the number of parameters used in the density is penalized. Thus, the BIC criterion is used on each sub-problem in turn. Hence, the following sub-problems are solved to find an approximate answer to the original problem:

arg min

θi

−2 log P(Yi| Y_(−i), θ_i) + log(n)k_i , for i = 1, . . . , M (2.3) Where k_i = |θ_i| is the number of estimated parameters in θ_i. Doing this, the model fitting best for each node is found, using the least number of nodes explaining it. Combining all different nodes, a reconstructed network is obtained where each node is explained by other nodes, such that its BIC score is minimal. In the following section, the distributions are discussed that are used in this thesis as densities in the pseudo-likelihood estimates.

2.2.1 Pseudo-likelihood on gene regulatory networks

Suppose there are N observations for M different variables. Since time series data are used from gene regulatory networks, it is necessary to lag the covariable values. A biological motivation is that it takes time for a gene to generate its corresponding protein to interact with other genes.

Hence, it makes more sense model the response variables Y_i(t) by the previous values of the covariables Y_(−i)(t − 1) than by the current values of the covariables Y_(−i)(t) for all i = 1, . . . , M with t ∈ {2, . . . , N }. Hence, the pseudo-likelihood looks as follows:

PLθ(Y (t) | Y (t − 1)) =

N

Y

i=1

Pθ(Y_i(t) = y_i(t) | Y_j(t − 1) = y_j(t − 1), (Y_i(t), Y_j(t − 1)) ∈ E )

2.3 Different distributions as marginal densities

In this section, linear models and mixture models are discussed for the pseudo-likelihood method.

For mixture data, the difference is discussed for the case that the data is categorized, or that

(21)

the category data is missing. For the latter, the expectation-maximization algorithm and its variants are described. Also, the maximum pseudo-likelihood estimates are obtained as well as their regularized versions for both models. Finally, the problems the EM-algorithm may have on data are described.

2.3.1 Linear Model

Assume there are N single observations {y₁, . . . , yN}, where each y_i ∈ R has p explanatory variables, given in a vector x_i = (1, x_i1, . . . , x_ip) ∈ R^p+1. In the linear model, each observation y_i is linearly modelled by its explanatory variables x_i. Hence, the model for y_i is as following:

y_i= β₀+

p

X

j=1

x_ijβ_j+ _i, i = 1, . . . , N

Where β = (β₀, β₁, . . . , β_p)^> ∈ R^p+1 is the vectors with regression coefficients with intercept β0 and _i ∼ N (0, σ²) is random noise. Therefore, the model can be rewritten as follows:

y_i| x_i ∼ N (x^>_i β, σ²), i = 1, . . . , N

Let now y = (y₁, . . . , y_N)^> ∈ R^N and assume that every y_i is independently distributed for all i = 1, . . . , N . Also, let X = (x₁, . . . , xN)^> ∈ R^{N ×(p+1)}. As a consequence, the vector y is multivariate Gaussian distributed (Dobson and Barnett, 2008):

y | X ∼ N (Xβ, σ²IN ×N) (2.4)

The likelihood function of y | X is given by L(y | X) = (2πσ²)⁻^N² exp

− 1

2σ²(y − Xβ)^>(y − Xβ)

Assume that X has full column rank. By taking the logarithm of the likelihood and setting the partial derivatives of it with respect to β and σ² equal to zero we obtain the maximum likelihood estimates of the linear model (Friedman et al., 2001).

βˆ_OLS = (X^>X)⁻¹X^>y ˆ

σ² = 1 N − p − 1

N

X

i=1

(y_i− x^>_i β)ˆ ² (2.5)

Where ˆβ_OLS is the ordinary least squares estimator. In case that X does not have full column rank, observe that the log-likelihood of the y | X is proportional to the following expression

`(β | y, X) ∝ (y − Xβ)^>(y − Xβ)

= ||y − Xβ||²₂

Instead of minimizing the log-likelihood directly, add a small term λ > 0 as a penalty for the size of the regression coefficients β. This procedure is known as ridge regression, with the difference

(22)

that λ is taken as small as possible where in ridge regression the idea is to increase the penalty term to shrink (Friedman et al., 2001). Then, the maximization problem changes to

minβ ||y − Xβ||²₂+ λ|β|²₂ = min

β ||˜y − ˜Xβ||²₂ (2.6) Where

X =˜

√ X λIN ×N

, ˜y = y 0_N

With 0_N is a vector of size N containing zeros only. Observe that ˜X has full column rank for all λ > 0. Hence, the expression in equation 2.6 is an ordinary least squares problem again. Hence, the solution of this equation is

βˆRLS= ( ˜X^>X)˜ ⁻¹X˜^>y˜

= (X^>X + λIN ×N)⁻¹X^>y (2.7) Where the latter is known as the regularized least squares estimator. Observe that if λ = 0, the regularized least squares estimator is identical to the ordinary least squares estimator. When there is not enough data available to make a least squares estimate, one can still use the regularized least squares estimate to make inferences on the regression coefficients.

2.3.2 J-component Gaussian Mixture Model (GMM)

Instead of having a model which assumes that there is only one density the data is sampled from, also a mixture model is possible which assumes that there are multiple Gaussian densities responsible for the data sampled. To illustrate this, consider the following example.

Example 1 Consider the plot given in figure 2.1:

From the plot, it seems that there are different means for the different type of species.

Therefore, it seems unlikely that this data stems from a single Gaussian distribution.

Here, it looks more likely that the data stems from a mixture distribution of two or three components.

As illustrated by the example, it is in some cases more convenient to use a mixture of Gaussian distributions as the probability density function of a random variable Y than a single Gaussian distribution. Suppose there are J components, then the random variable Y looks as follows

Y₁ ∼ N (µ₁, Σ₁) ...

YJ ∼ N (µ_J, ΣJ)

Y = I(∆ = 1)Y1+ · · · + I(∆ = J)

Where ∆ ∼ U (π₁, . . . , π_J) indicates a discrete uniform random variable from what component the sample stems from, where π_j is the probability of a sample stemming from component j. Let φj(y) denote the multivariate Gaussian distribution with parameters (µj, Σj) for j = 1, . . . , J . Then the density of Y is:

(23)

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

1234567

Sepal.Length

Petal.Length

setosa versicolor virginica

Figure 2.1: An example of two dimensional mixture data: The famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and petal length, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. (Anderson, 1935)

g_y(y) =

J

X

i=1

π_jφ_j(y)

Suppose the data given in figure 2.1 is modelled by maximum likelihood, then J = 3 and the parameters of interest are in this case θ_j = (πj, µj, Σj) for every j = 1, . . . , J . Suppose there are N samples, where the sample size per component is given by N_j. Let Z = (y₁, . . . , y_N) denote the data from the measured variable and let Z^m = (∆1 = δ1, . . . , ∆N = δN), with δi ∈ {1, . . . , J } for i = 1, . . . , N be the data from which component the measured variable belongs. The total data is denoted as T = {Z, Z^m}. The log-likelihood is then given by.

`(θ | T) =

N

X

i=1 J

X

j=1

I(∆i = j) log(π_jφ_j(y_i))

=

J

X

j=1

N_jlog π_j+

J

X

j=1 N

X

i=1

I(∆i= j) log φ_j(y_i)

By the constraintP_J

i=1π_j = 1, the maximization of the log-likelihood of the mixture data is a

(24)

constrained optimization problem:

maximize

θ `(θ | T) subject to

J

X

i=1

πj = 1.

Which is solved by using the method of Lagrange multipliers. Hence the following function needs to be maximized:

Λ(θ, λ|y₁, . . . , y_N, δ₁, . . . , δ_N) =

J

X

j=1

N_jlog π_j+

J

X

j=1 N

X

i=1

I(∆i= j) log φ_j(y_i) − λ(1 −

J

X

j=1

π_j)

By partially differentiating the latter equation to π_j, N_j/πj = λ. Hence, by bringing πj tot the right hand side and adding over all j = 1, . . . , J , λ = N . Therefore, the maximum likelihood estimates are:

ˆ π_j = N_j

N µˆj = 1 Nj

N

X

i=1

I(δi = j)yi for j = 1 . . . , J (2.8)

Σˆ_j = 1 N_j− 1

N

X

i=1

I(δi= j)(y_i− ˆµ_j)(y_i− ˆµ_j)^>

So, this specifically means that for every component, the mean and variance are estimated using the sample mean and the sample variance. The estimated probability of belonging to a certain component is simply the proportion of samples belonging to that component. For the resulting estimates of the iris data set, consider figure 2.2.

Means as Mixture-Specific Linear Combinations of Covariables

Instead of estimating the mean of the J components of the mixture model, the interest is the linear modeling of the mean µ_j(x) of a response variable y ∈ R with a set of covariables, for each j = 1, . . . , J . Again, assume there is a mixture of J components of Gaussian data and a set of variables y = (y₁, . . . , yN)^> and covariables X = (x₁, . . . , xN)^>, i.e.

{(y₁, x₁), . . . , (y_N, x_N)} ∈ Y | x = I(∆ = 1)Y₁ | x + · · · + I(∆ = J )Y_J | x Where the variables

• Y_j | x ∼ N (µ_j(x), σ²_j) for j = 1, . . . , J .

• ∆ ∼ U (π₁, . . . , π_J) indicates the component from which the distribution is obtained.

For this particular kind of problem, the interest is modeling the means of Y_j by linear regression.

Let φ_j(y | x) denote the univariate Gaussian distribution with mean µj = x^>_i βj and variance σ². The probability density function is given by

(25)

g_y|x(y) =

J

X

j=1

πjφj(y | x)

Again, suppose the data given in figure 2.1 is modelled by maximum likelihood, then J = 3 and the parameters of interest are in this case θ_j = (π_j, β_j, σ_j²) for every j = 1, . . . , J . Suppose there are N samples, where the sample size per component is given by Nj. Let Z = {(y₁, x1), . . . , (yN, xN)}

denote the data from the measured variable and let Z^m = (∆₁, . . . , ∆_N) be the data from which component the measured variable belongs. The total data is denoted as T = {Z, Z^m}. The log-likelihood is then given by.

`(θ | T) =

N

X

i=1 J

X

i=1

I(∆i = j) log(π_jφ_j(y_i))

=

J

X

j=1

Njlog πj+

J

X

j=1 N

X

i=1

I(∆i= j) log φj(yi| x_i)

Let y_j be the vector of response variables that belong to component j, let X_j be the corresponding matrix with the covariables. Thus y_j = (y_i ∈ y | δ_i = j) ∈ R^N^j and X_j = (x_i ∈ X | δi = j) ∈ R^N^j^×(p+1). In line with the calculations of the previous section, he following maximum likelihood estimates are obtained for the parameters:

ˆ πj = Nj

N

βˆj = (X^>_j Xj)⁻¹X^>_j yj for j = 1, . . . , J ˆ

σ_j²= 1 N_j− p − 1

N

X

i=1

I(δi = j)(y_i− x^>_i βˆ_j)²

So again, the maximum likelihood estimates of the model are the results of a linear model, applied to all different subsets of the data set categorized by the components. For the estimated regression line of the iris data set, consider figure 2.2.

2.3.3 EM-Algorithm

Earlier this section, data is discussed where the component from which a data sample stems from is known. However, there are applications in which that it is unknown. For this reason, the expectation-maximization (EM) algorithm is used in order to estimate the probabilities that a certain sample stems from a certain component.

Still, the interest is the linear modeling of the mean µ_j(x) of a response variable y ∈ R with a set of covariables, for each j = 1, . . . , J . Again, assume there is a mixture of J components of Gaussian data and a set of variables y = (y₁, . . . , y_N)^>∈ R^N and covariables X = (x₁, . . . , x_N)^>R^{N ×(p+1)}, i.e.

{(y₁, x1), . . . , (y_N, x_N)} ∈ Y | x = I(∆ = 1)Y1 | x + · · · + I(∆ = J )Y_J | x Where the variables

(26)

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

1234567

Sepal.Length

Petal.Length

setosa versicolor virginica

Figure 2.2: The iris data set fitted by a three component Gaussian mixture model, with the estimated means as crosses, the estimated regression lines and the 80% confidence intervals of the means.

• Y_j | x ∼ N (µ_j(x), σ²_j) for j = 1, . . . , J .

• ∆ ∼ U (π₁, . . . , π_J) indicates the component from which the distribution is obtained, with πj ∈ [0, 1] for all j = 1, . . . , J andPJ

j=1πj = 1.

Again, the interest is estimating the means of Y_j as a linear combination of the covariables.

Hence, the model is as follows

Y_j | x ∼ N (x^>β_j, σ_j²), j = 1, . . . , J

Where β_j ∈ R^p+1 is the (p + 1)-dimensional vector of regression coefficients for j = 1, . . . , J . This time, for the response values y = (y₁, . . . , yN)^>the component Y_j to which the sample belongs from the mixture of distributions is unknown. So, latent variables are involved. Let Z = (y, X) denote the observed data. The model is than rewritten as:

y_i=











x^>_i β₁+ _i1 with probability π₁; x^>_i β2+ i2 with probability π₂;

...

x^>_i βJ+ iJ with probability π_J.

(2.9)

(27)

Where _ij ∼ N (0, σ_j²), j = 1, . . . , J and i = 1, . . . , N is the random noise. Hence, the likelihood based on the observed data is given by:

L(θ | Z) =

N

Y

i=1





J

X

j=1

πjφj(yi | x_i)





Here, φ_j is now the probability density function of an univariate Gaussian distributed random variable with mean µ_j = x^>_i β_j and variance σ²_j. Hence, the log-likelihood based on the observed data Z is therefore given by

`(θ | Z) =

N

X

i=1

log





J

X

j=1

πjφj(y | x)



 (2.10)

Suppose that it is actually known from which distribution sample y_i was arrived, this the value of variable ∆_i is known. The latent, unobserved variables are given by Z^m = (∆1, . . . , ∆_N) and the complete data is thus given by T = (Z, Z^m). The likelihood based on the complete data is therefore given by

L0(θ | T) =

N

Y

i=1 J

Y

j=1

(πjφj(y | x))^I(∆ⁱ^=j)

Hence the log-likelihood based on the complete data (defined as complete log-likelihood), is given by

`₀(θ | T) =

N

X

i=1 J

X

j=1

I(∆i = j) (log π_j+ log φ_j(y | x))

Expectation step

Since the latent variables are not available in real applications, the maximum likelihood estimates cannot be estimated from this directly. In between, the conditional expectation is calculated of the complete log-likelihood, based only on the observed data and an actual estimate of the parameters θˆ^(r). The conditional expectation of the complete log-likelihood is given by:

Q(θ | ˆθ^(r)) = Eθˆ^(r)[`₀(θ | T) | Z]

=

N

X

i=1 J

X

j=1

Eθˆ^(r)[I(∆i = j) | Z] (log πj+ log φj(y | x))

=

N

X

i=1 J

X

j=1

ˆ

γ_ij^(r)log πj +

N

X

i=1 J

X

j=1

ˆ

γ_ij^(r)log φj(y | x)

Here, the parameter γ_ij is called the responsibility, which intuitively is the probability of observation i belonging to component j. The estimated responsibility ˆγ_ij^(r)of component j = 1, . . . , J for observation i = 1, . . . , N is given by:

(28)

ˆ

γ_ij^(r)(yi| x_i) := Eθˆ^(r)[I(∆i= j) | Z]

= Pθˆ^(r)(I(∆i = j) = 1 | Z)

i.i.d.

= Pθˆ^(r)(I(∆i = j) = 1 | Y = yi)

= πˆ_j^(r)φ_j(y_i | x_i) PJ

j=1πˆ_j^(r)φj(yi| x_i) Maximization step

For obtaining maximum likelihood estimates for θ = (π₁, . . . , π_J, β1, . . . , β_J, σ₁², . . . , σ²_J), maximization of the expected log-likelihood of the complete model indirectly maximizes the log-likelihood of the incomplete model (as will be proven later this thesis in theorem 3). To maximize the expected complete log-likelihood Q(θ | ˆθ^(r)), take the partial derivatives with respect to θ and set it equal to zero. Here, the additional constraint is that the values of π_j sum up to one. Therefore, the problem is a constrained optimization problem:

argmax

θ

Q(θ | ˆθ^(r))

subject to

J

X

i=1

π_j = 1.

(2.11)

To solve this problem, the method of Lagrange multipliers is used. This brings us the La- grangean:

Λ(θ, λ | y, X) =

N

X

i=1 J

X

j=1

ˆ

γ_ij^(r)log πj+

N

X

i=1 J

X

j=1

ˆ

γ_ij^(r)log φj(y | x) − λ

J

X

i=1

πj − 1

!

For finding the maximum of the Lagrangean, set all partial derivatives equal to zero, i.e.

∂

∂λΛ(θ, λ | y, X) = 1 −

J

X

i=1

πj (2.12)

∂

∂πj

Λ(θ, λ | y, X) =

N

X

i=1

ˆ γ_ij^(r)

πj

− λ (2.13)

∂

∂β_jΛ(θ, λ | y, X) =

N

X

i=1

ˆ γ_ij^(r) 1

σ_j²(yi− x^>_i βj)xi (2.14)

∂

∂σ²_jΛ(θ, λ | y, X) =

N

X

i=1

ˆ γ_ij^(r)

"

− 1 2σ_j² + 1

2σ_j⁴(y_i− x^>_i β_j)²

#

(2.15)

By equation 2.13, it follows that λπ_j = PN

i=1ˆγ_ij^(r). By summing over all j = 1, . . . , J and PJ

j=1

PN

i=1γˆ_ij^(r)= N , it follows that

λ = N