Graphical Models with Latent Variables

(1)

Graphical Models with Latent Variables

S. Jonker BSc

Master’s thesis in Mathematics

June 2014

(2)

(3)

Abstract

This paper concerns graphical models for which not all variables have been measured.

The influences of such latent variables will be discussed as well as two methods (factor analysis and graphical latent model selection) that deal with this kind of models.

(4)

(5)

Chapter 1 Introduction

This chapter will give an introduction to conditional independence models. The most relevant theory and importance of such models will be explained as well as their relations to graphs. It turns out that variables may seem to be directly related to each other while this dependency in fact only exists through some other (set of) variable(s). We will see a clear example of this in the first section of this chapter using an example of Simpsons Paradox.

After understanding what we mean by graphical conditional independence models, we will sum up some definitions and properties known from graph theory. They will help us estimating and understanding the models of the following chapters.

Finally we will take a first look at a test that compares different estimated conditional independence graphs. A formula for the ‘distance’ between having conditional independence or dependence between two sets of variables will be given. This formula in fact tells us how much information we lose when we pretend that some set of edges is not included in the corresponding graph. The more information we lose, the more evidence we have for a set of edges to be in the model.

1.1 Conditional Independence

First look at an example of Simpsons paradox borrowed from Edwards [2]. He has made some fictive data to study possible sexual discrimination in patterns of university admissions. When we look at table 1.1, we find some discrimination against female applications: there seems to be some preference of admitting males (62.5%) over females (37.5%).

(8)

Sex # applications # admitted % admitted

Male 400 250 62.5

Female 400 150 37.5

Table 1.1: Fictive Data of Males and Females Being Admitted to a University.

But we do not see this discrimination of gender when the data is splitted by university department. The cause of more males being admitted lies in the fact that more males apply for departments that easier accept students, see table 1.2.

Department Sex # applications # admitted % admitted

I Male 100 25 25

Female 300 75 25

II Male 300 225 75

Female 100 75 75

Table 1.2: Fictive Data of Admissions to a University Broken Down by Department.

We see that Sex (S) and Department (D) are marginally associated, as well as De- partment and Admission (A). On the other hand, we see that Sex and Admission are independent given Department. We can visualize this graphically as follows:

Figure 1.1: Graph of College Admissions.

The lesson to be learned here is that when we only rely on pairwise marginal associations, we might end up with the wrong conclusions. At first sight, we thought that the university discriminated against females but this turned out to be the wrong conclusion.

So it is necessary to also look at conditional associations between variables, as well as their marginal associations.

(9)

Conditional Independence in Graphical Models

We could express the graph of figure 1.1 as S⊥A|D, which should be read as ‘Sex is independent of Admission given Department’. We would like to relate this expression to a probability distribution. Recall the following functions for positive conditional probability functions:

P (Y = y|X = x) = P (X = x, Y = y)

P (X = x) (for discrete variables) f_{Y |X}(y|x) = fX,Y(x, y)

f_X(x) (for continuous variables)

Where f_X(x) represents the marginal density of X. From now on all properties will be expressed for continuous variables but can easily be rewritten for discrete variables in a same manner.

The conditional probability functions could be factorized when two (sets of) variables are independent given another (set of) variable(s). For example: if X⊥Y |Z, then we could factorize the probability function depended on Z as follows:

f_{X,Y |Z}(x, y|z) = f_X|Z(x|z) · f_{Y |Z}(y|z)

We can imagine the following theorem (which will not be proved here) to be true when we combine the last two equations:

Theorem 1. The Factorization Theorem

The sets X and Y are independent given the set Z if and only if their probability function can be factorized according to the conditional independencies:

X⊥Y |Z ⇐⇒ ∃h, g such that ∀X, Y and ∀Z for which f_Z(z) > 0 the corresponding probability function can be written as:

f_X,Y,Z(x, y, z) = h(x, z) · g(y, z).

So, basically we could express a graphical model in two equivalent ways: using the factorization of a probability function or as a graph. We will discuss the last representation in the next section. Probability functions will be used in the last section of this chapter and will be discussed in more detail in the next chapters.

1.2 Graph Theory

Before we continue with graphical models, we have to introduce some concepts of graph theory. We begin with three definitions:

Definition 1. A graph G is a set of vertices V and edges E ⊂ K × K.

(10)

Definition 2. An edge is said to be undirected if both ordered edges are in E: if (i, j) ∈ E, then (j, i) ∈ E.

Definition 3. A graph is said to be undirected if all edges in E are undirected.

Since we talk about graphs in which some vertices are connected and some are not, we also have to define some concepts of vertices which are connected in some way.

Definition 4. Two vertices a and b are adjacent if (a, b) ∈ E and (b, a) ∈ E.

Example: In figure 1.2 a and d are adjacent, but a and b are not.

Definition 5. The sequence (a₁, a₂. . . , a_j) is a path if (a_k, a_k+1) ∈ E for k = 1, . . . , j −1 and if all al are distinct.

Example: In figure 1.2 (a, e, b, d) is a path, but (a, e, d, b) is not since e and d are not adjacent.

Definition 6. The sequence (a₁, a₂. . . , a_j) is a cycle if (a₁, a₂. . . , a_j−1) is a path, (aj−1, aj) ∈ E and a1 = aj.

Example: In figure 1.2 (a, e, b, d, a) is a cycle.

Definition 7. The set of vertices A separates vertices a and b if all paths between a and b contain at least one element of A and a, b /∈ A.

Example: In figure 1.2 the set {e, d} separates a and b.

Definition 8. A graph is complete if all vertices are adjacent.

Example: The graph of figure 1.2 is complete if the vertices a and e were not part of the graph.

Definition 9. A clique is a maximally complete sub graph.

Example: In figure 1.2 the sub graph of vertices b, c and d is a clique.

From now on we will only talk about undirected graphs. By doing so, we can easily connect our knowledge of conditional independence to the formal definition of a graph:

Definition 10. Two variables Xa and Xb are conditional independent given all other variables if and only if (a, b) /∈ E and (b, a) /∈ E.

These conditional independencies are formalized in three equivalent Markov properties:

Definition 11. Having no edge between two vertices a and b in the corresponding graphical model corresponds to the following properties:

Pairwise Markov property Xa⊥X_b|rest Local Markov Property X_a⊥X_b|N (X_a)

(11)

Global Markov Property Xa⊥X_b|S(X_a, Xb)

Where ‘rest’ stands for all variables except for X_a and X_b, N (X_a) represents the neighbourhood of Xa, the set of all vertices that are adjacent to Xa, and S(Xa, Xb) is the (minimal) separating set of X_a and X_b.

These properties might be best understood using an example. Looking at figure 1.2 we clearly see that there is no edge between a and b. Reminding the example of Simpsons paradox, we know that this means that X_a and X_b are independent of each other when some other variables are known. We shall investigate these independencies using the three different Markov properties.

First: the pairwise property: Xa⊥X_b|X_c, Xd, Xe. This property clearly conditions on too many variables. The local one already leaves out some variables we do not need:

X_a⊥X_b|X_d, X_e. But since all conditional independencies are symmetric, we could also write: Xb⊥X_a|X_c, Xd, Xe. Which is the same statement as the pairwise Markov. Last we have the global one: X_a⊥X_b|X_d, X_e. This clearly is the smallest expression possible.

Figure 1.2: A Graphical Model for Variable X = (Xa, X_b, Xc, X_d, Xe).

There are several ways to prove the equivalence of these three properties. One way to do this, is by proving that the global property implies the local (G =⇒ L), that the local implies the pairwise property (L =⇒ P) and finally that the pairwise Markov implies the global (P =⇒ G) Markov property. To do this, we will follow the proof of Whittaker [11].

Proof.

(G =⇒ L) The global Markov implies the local Markov: a separating set is always a subset of a neighbourhood.

(L =⇒ P) Suppose you have vertices numbered from 1 to k (denoted by the set K), corresponding to variables X₁ to X_k. Then define for every vertex a its neighbourhood set: Na= N (Xa). In this case

X_a⊥X_B_a|X_N_a, where X_B_a = X_K\{a,N_a_}.

(12)

Select any vertex b in Ba. This vertex clearly is not adjacent to a and the expression above can be rewritten as:

Xa⊥{X_b, XC}|X_N_a, where C = B_a\b.

This statement can be rewritten as (will not be proved here):

Xa⊥X_b|{X_N_a, XC} and X_a⊥X_C|{X_N_a, Xb}.

Since N_a∪ C = K\{i, j} = rest, we end up with the condition of the pairwise Markov property.

(P =⇒ G) This part simply follows from the so called separation theorem, which states that when each set of vertices a is separated from the disjoint set b by a third disjoint set c (the separation set), then

Xa⊥X_b|X_c.

This is all we will say about graph theory for now. The next section will provide a way to calculate the strength of conditional independencies. This gives a first handle in graphical model selection.

1.3 Kullback-Leibler Information Divergence

It is important to have some measure of the strength of all possible edges in a graph when one wants to estimate a graphical model. One way to measure these ‘strengths’

is using the Kullback-Leibler Information Divergence (KL Divergence) which measures the ‘distance’ between two densities in the following way:

Definition 12. Kullback-Leibler Information Divergence

Let f and g be two densities. Then the KL Divergence is defined as follows:

I(f ; g) = E_flogf (x) g(x), where Ef is the expectation with respect to the density f .

Example: Suppose that f ∼ Bern(p) and g ∼ Bern(q), then we can calculate the

(13)

KL Divergence as follows:

I(f ; g) = E_flogp^x(1 − p)^1−x q^x(1 − q)^1−x

= E_flog p q

x 1 − p 1 − q

1−x

= Ef

x logp

q + (1 − x) log1 − p 1 − q

= p logp

q + (1 − p) log1 − p 1 − q

As you probably have seen, the divergence is not symmetric in f and g, so we cannot say it is a real distance.

We will use this measurement to calculate two kinds of distances: a) the distance between complete dependency and independence between two (groups of) variables and b) the distance between complete dependency and conditional independence between two (groups of) variables.

Let’s start with the first one. When two variables are completely independent, we can write their density function f_XY simply as f_Xf_Y. So the distance between complete dependency and independency of X and Y can be calculated as:

Inf (X⊥Y ) = I(fXY; fXfY).

This formula gives you the information you lose when you pretend that x and y are completely independent. You do not lose information when these variables really are independent and end up with a distance of zero as expected.

Looking at graphical models, we are more interested in conditional independencies.

When two variables X and Y are conditional independent given a third variable Z, then their density function f_{XY Z} can simply be written as f_Zf_X|Zf_{Y |Z}. This gives us the following definition of the information of a conditional independency:

Inf (X⊥Y |Z) = I(f_{XY Z}; f_Zf_X|Zf_{Y |Z}).

We will be using this formula to calculate the strength of an edge compared to a full graph. The next chapter will show how this can be done for discrete data sets. The chapter thereafter will apply this distance to continuous models.

(14)

(15)

Chapter 2 A Discrete Model

This chapter will apply the topics of last chapter to discrete multivariate models. We will be looking at contingency tables to find graphs corresponding to conditional independencies. First we will introduce why these models are called log-linear models. Then we will see how these logarithms can be used to estimate a model. In the end we will apply this theory to a real dataset.

2.1 Log-Linear Models

Consider a random vector x = (x1, x2, x3) where each xi ∈ {0, 1}. There are 8 different values of x, so we can write its density function as:

P (x) = p₁₁₁^x¹^x²^x³· p^x₁₁₀¹^x²^(1−x³⁾· p^x₁₀₁¹^(1−x²^)x³· p₀₁₁^(1−x¹^)x²^x³· p₁₀₀^x¹^(1−x²^)(1−x³⁾. . . p^(1−x₀₀₀ ¹^)(1−x²^)(1−x³⁾, where pijk is the probability of obtaining the value x = (i, j, k). We could rewrite this as a log expansion:

log P (x) = u0+ u1x1+ u2x2+ u3x3+ u12x1x2+ u13x1x3+ u23x2x3+ u123x1x2x3, with:

log p₀₀₀= u₀ log p₁₀₀= u₀+ u₁ log p010= u0+ u2

log p001= u0+ u3

log p110= u0+ u1+ u2+ u12

log p₁₀₁= u₀+ u₁+ u₃+ u₁₃ log p₀₁₁= u₀+ u₂+ u₃+ u₂₃

log p111= u0+ u1+ u2+ u3+ u12+ u13+ u23+ u123

(16)

Of course, this could be generalized for any x = (x1, . . . , xk) with each xi ∈ {1, . . . , r_i} to:

log P (x) = X

A⊂K

u_a(x_a).

This new representation with so called u-terms turns out te be very useful in graphical modelling. Remember that when two groups of variables x and y are independent given a third group of variables z, we should be able to write their density function as:

P (x, y, z) = g(x, z) · h(y, z) (Theorem 1), or using logarithms: log P (x, y, z) = h^∗(x, z) + g^∗(y, z). Looking back at the u-term representation above, we can conclude that two variables a and b are conditional independent if and only if all u-terms involving a as well as b are zero. This is formalized in the following proposition:

Proposition 1. XA⊥X_B|X_C iff ∀j ∈ A and ∀k ∈ B : uD = 0 whenever j, k ∈ D.

Example: For x ∈ {0, 1}³: X1⊥X₂|X₃ iff u12= u123= 0.

Now we have found a nice way to connect graphical models to log density functions:

all edges in a graph should correspond to the non-zero u-terms in the corresponding log density function. A problem arises when we for instance compare the following two expressions:

log P (x) = u₀+ u₁+ u₂+ u₃+ u₁₂+ u₁₃+ u₂₃+ u₁₂₃ log P (x) = u0+ u1+ u2+ u3+ u12+ u13+ u23

Both expressions imply the same conditional independency relations (a full graph). They both imply a model, but we will say that only the first one is graphical. So, to be graphical all u-terms should correspond to the conditional independency relations implied by the graph G. This is equivalent to the following definition:

Definition 13. Graphical Log-Linear Models

A multinomial distribution is a graphical log-linear model if the u-terms satisfy:

1. All maximal u-terms correspond to the cliques of the graph G.

2. If u_A6= 0 then u_B 6= 0 ∀B ⊂ A.

2.2 Estimation

The theory of estimating a grahical log-linear model will be explained for two different problems: estimating the log-linear expression when the corresponding graph is known and secondly estimating the graph G itself.

(17)

2.2.1 With Known Graph Structure

A widely used method to find the best density function is maximizing the likelihood.

For our log-linear models this will be done as follows:

`(P (x), x¹, . . . , x^N) = log

N

Y

i=1

P (x¹)

=

N

X

i=1

log P (xⁱ)

=X

x

n(x) log P (x) = `(P (x), n(x))

Where the xⁱ are all N observations, n(x) is the number of observations x and where the last sum is taken over all possible values of x. This expression can be rewritten as:

`(P (x), n(x)) =X

x

n(x) log P (x)

= NX

x

n(x)

N log P (x) +X

x

n(x) logn(x)

N − NX

x

n(x)

N logn(x) N

= ` n(x) N , n(x)

− N ·X

x

n(x)

N logn(x)/N P (x)

= ` n(x) N , n(x)

− N · I n(x) N ; P (x)

Clearly, maximizing the likelihood corresponds to minimizing the KL Divergence. There are no constraints to P (x) when we consider a full graph, this results in the following probability function: ˆP (x) = ^n(x)_N . It becomes slightly more difficult when we consider a non-full graph. Proposition 7.6.1. of [11] states that the MLE for every clique of the graph in this situation is equal to:

Pˆ_A(x_A) = nA(xA)

N for all cliques A of the graph (2.1) This is true since all variables within a clique are arbitrary. The same solution holds for every B ⊂ A when A is a clique. Combining all cliques gives us the following MLE corresponding the graph G:

P (x) =ˆ Q

A clique of GPˆA(xA) Q

S separator of GPˆ_S(x_S), where S is the set of overlapping sets of the cliques.

(18)

2.2.2 Estimating the Graph Structure

The most interesting part of course is finding the conditional independence graph itself.

There are several ways to do this, but one is the following: begin with a full graph and test which edges should be deleted, then use this new graph to test whether you might have deleted too many edges. We begin with the first part: testing which variables are conditional independent given total dependency. We do this using the so called deviance, which is twice the difference between the likelihood of a full model and the model M we test on:

Definition 14. Deviance

The deviance of a model M is given as:

Dev(M ) = 2

` n(x) N , n(x)

− ` ˆP^M(x), n(x)

= 2

` n(x) N , n(x)

−

` n(x) N , n(x)

+ N · I n(x)

N , ˆP^M(x)

= 2NX

x

n(x)

N logn(x)/N Pˆ^M(x)

= 2X

x

n(x) logn(x)/N Pˆ^M(x)

To test whether an edge should be excluded from the graph, we use the following theorem:

Theorem 2.

Dev(M )|M is true ∼ χ²_df as n → ∞,

where df is the difference in degrees of freedom between M and the saturated (full) model, this corresponds to the number of u-terms set to zero in M .

This theorem could be used to test any model against the full model. When we only exclude one edge, we can easily generalize the results we found so far:

To test whether the vertices a and b are conditional independent, we simply calculate the deviance for the model in which only edge (a, b) misses and test it against the χ²₁ distribution. We can conclude that they are independent when the deviance is lower than the critical value of the chi-squared distribution on one degree of freedom with a certain significance level α. The deviance will be calculated using:

Pˆ^M(x) =

Pˆ_K\{b}^M (x) · ˆP_K\{a}^M (x) Pˆ_K\{a,b}^M (x)

= n_K\{b}(x)/N · n_K\{a}(x)/N n_K\{a,b}(x)/N

= n_K\{b}(x) · n_K\{a}(x) n_K\{a,b}(x) · N

(19)

This results in the following expression for the edge exclusion deviance:

Dev(M ) = 2X

x

n(x) logn(x)/N Pˆ^M(x)

= 2X

x

n(x) log n(x) · n_K\{a,b}(x) n_K\{b}(x) · n_K\{a}(x)

Once all insignificant edges are deleted, we need to test whether some edges should have stayed in the graph. This will be done by testing the new model (M1) against the same model with one edge added (M2). Their deviance difference is equal to:

2

` ˆP^M²(x), n(x)

− ` ˆP^M¹(x), n(x)

=2

` n(x) N , n(x)

− ` ˆP^M¹(x), n(x)

− 2

` n(x) N , n(x)

− ` ˆP^M²(x), n(x)

=Dev(M1) − Dev(M2)

For n → ∞ we have Dev(M1) − Dev(M2)|M1 is true ∼ χ²_df

M1−df_M2. So we do not add the edge that is being tested if the deviance difference is greater than the critical value of the chi-squared distribution with degree of freedom equal to the difference between the degrees of freedom of models M1 and M2 at significance level α.

Now we know how to estimate a conditional independence graph for a discrete dataset and how to calculate the corresponding log-linear expansion. Next section will give a fully worked out example.

2.3 An Applied Example

The theory of discrete graphical models will be applied to a modified dataset having 601 observations borrowed from Fair [14]. In his paper, he tried to estimate a Tobit model for leisure activities, in our case a model for extramarital affairs. We adjusted the data such that all five variables are either zero or one. The following variables are being investigated:

1. Sex (0: female, 1: male)

2. Years Married (0: < 8 years, 1: > 8 years) 3. Children (0: no children, 1: children)

4. Self-rating of Marriage (0: not happy, 1: happy)

5. Affairs (0: did not have any affair, 1: has had at least one affair)

(20)

The following shows the contingency table of this dataset:

Years Married

0 1

Children Children

0 1 0 1

Self-rating Self-rating Self-rating Self-rating

0 1 0 1 0 1 0 1

Sex 0 Affairs 0 11 72 16 45 0 4 28 67

1 2 8 13 8 2 0 20 19

1 0 13 38 14 51 1 5 24 62

1 4 8 8 16 2 1 17 22

Table 2.1: Contingency Table of Extramarital Affairs

As a first step in estimating a graphical model for this data set, we calculate the edge exclusion deviances. The R- package ‘ggm’ helps us to calculate these deviances using the command ‘fitConGraph’. Only the deviances corresponding to the edges (Years Married, Children), (Children, Affairs) and (Self-rating, Affairs) are higher than the critical value of 2.706 of the χ²₁-distribution at the 10% significance level, see table 2.2.

This results in the graph of figure 2.1a.

Male Years Married Children Self-rating

Years Married 0.0703 * * *

Children 2.3494 138.41 * *

Self-rating 0.2047 1.2122 2.5858 *

Affairs 1.2465 1.4081 3.3842 22.321

Table 2.2: First Edge Exclusion Deviances

Next step is investigating whether we might have excluded too many edges. We do this using the edge inclusion deviances, which resulted in adding four extra edges. See figure 2.1b. After repetitively adding and deleting edges, we see that two graphs keep coming back. A more sparse graph is always preffered, since they are easier to interpret. So we use the graph of figure 2.1e as our final graph.

We see that having children does not affect any of the other variables, while gender seems to be highly connected to all other variables except for children. We also conclude that the self-rating of marriage does not directly influence whether someone has an extramarital affair, but it indirectly has an influence via someone’s gender and the number of years a person has been married.

(21)

(a) First Edge Exclusion (b) First Edge Inclusion (c) Second Edge Exclusion

(d) Second Edge Inclusion (e) Third Edge Exclusion Figure 2.1: Estimation Using Deviances

We see that the final graph contains three cliques: A = {Affairs, Sex, Years Married}, B = {Sex, Self-rating} and C = {Children} and one overlapping set: D = {Sex}. We use this to find an expression for the probability function corresponding to our graph:

P (xˆ Sex, xYears Married, xChildren, xSelf-rating, xAffairs) =

PˆA(xA) · ˆPB(xB) · ˆPC(xC) Pˆ_D(x_D) , where ˆPS(xS) is defined as in equation 2.1. Our last task is to estimate the u-terms for the graphical model we found. This is not hard to do, but asks for quite some patience in the number of calculations we have to perform. We therefore will only demonstrate the procedure for the u0-term, which equals the logarithm of the probability of all variables being zero. This gives:

ˆ

u0 = log ˆP (0, 0, 0, 0, 0)

= log

"11+72+16+45

601 ·11+2+16+13+0+2+28+20

601 ·11+2+13+4+72+8+38+8+0+2+1+2+4+0+5+1 601

11+2+72+8+16+13+45+8+0+2+4+0+28+20+67+19 601

#

= log

"₁₄₄

601 ·₆₀₁⁹² ·¹⁷¹₆₀₁

315 601

#

= log 0.0199 = −3.92

All other u-terms are estimated in a similar way, see section 2.1.

(22)

(23)

Chapter 3 A Continuous Model

This chapter will focus on variables which are continuous. Or more precisely, on variables which have a multivariate Gaussian distribution. It turns out to be relatively easy to estimate a graphical model for these kind of variables. The first section will discuss some of the theory of these Gaussian graphical models and the second section will focus on how to estimate a proper graphical model. The third section tests one of the estimation methods for a set of simulated data and the last section will apply the theory to a real dataset.

3.1 Gaussian Graphical Models

The inverse of the covariance matrix Θ = Σ⁻¹ of a multivariate Gaussian distribution gives information about the covariance between two variables given the rest and therefore about the edges in the graphical model. Whittaker [11] describes how the covariance of two variables given the other ones are related to the inverse-covariance matrix:

Cov(X_i, X_j|rest) = −θ_ij pθ_iiθjj

.

Thus, an element θij being equal to 0 corresponds to having no edge between the i^th and j^th variable. We can make this more intuitive using the Hammersly-Clifford theorem.

This theorem states that the following are equivalent for a positive probability density function f (x1, . . . , xK) :

• f (x_i|x_K\{i}) = f (xi|N (x_i)), for i = 1, . . . , k This uses the local Markov property, which we have discussed before.

• f (x) can be factorized according to the cliques of the graph: f (x) = _W¹ Q

C∈cl(G)hC(xC), where W is used to normalize the function and cl(G) is the set of cliques of the graph G.

(24)

We can factorize the Gaussian probability function as follows:

f (x) = C ·Y

i,j

e⁻¹²^θ^ij^(xⁱ^−µⁱ^)(x^j^−µ^j⁾

= C ·Y

i,j

h^∗_ij(x_i, x_j)

where C = 1/p

(2π)^k|Σ| and where the product of serveral functions h^∗ form a function h as in the second equivalence statement. Now when θij = 0, we will have no contri- bution to f (x) as a result of h^∗_ij(xi, xj). Which proves that xj ∈ N (x/ _i), using the local Markov property.

In the second equivalence statement you see that the variables xi and xj are not part of a clique when θ_ij = 0, since there will be no term h_C(x_C) for which i, j ∈ C. Which strengthens the statement that two variables are independent when their corresponding element of the inverse covariance matrix is zero.

So, we just have added another dimension into estimating the covariance matrix of some data set. We no longer just want this matrix to fit the data, we also want to say something about the graphical model and therefore about the inverse covariance matrix of this set.

3.2 Estimation

As discussed above, estimating a graphical model involves two main problems: One wants to know the graph structure as well as the values of the covariance matrix. The difference with usual estimation of the covariance matrix is that some of the values of Σ⁻¹ will be 0. We begin with some situations in which the graph structure is known.

3.2.1 With Known Graph Structure

Assuming we have a full graph, we will run through the usual steps: estimate Σ by maximizing the log likelihood:

`₀(Σ) =

n

X

i=1

`_x_i(Σ)

=

n

X

i=1

log[(√

2π)^−p|Σ|^−1/2e⁻¹²^(xⁱ^−x)^t^Σ⁻¹^(xⁱ^−x)]

=

n

X

i=1

−p

2log[2π] −1

2log |Σ| − 1

2(x_i− x)^tΣ⁻¹(x_i− x)

= C −n

2 log |Σ| −1 2

n

X

i=1

T r[(xi− x)^t(xi− x)Σ⁻¹].

(25)

Eliminating the constant C gives us the following expression to maximize:

`(Σ) = n

2log |Σ|⁻¹−n

2T r(SΘ), where S = ¹_nPn

i=1(x_i−x)^t(x_i−x). It can easily be shown that: ˆΣ = S. With probability 1 none of the entries of ˆΘ = ˆΣ⁻¹ are zero. We therefore have to add some constraints to our optimization when the graph is more sparse, which corresponds to more elements of Θ being equal to zero. Then, the optimization will look like:

Maximize: `(Σ), subject to θ_ij = 0 for all (i, j) /∈ E

This is equivalent to maximizing the following Lagrange function:

`₁(Θ) = `(Σ) − X

(j,k) /∈E

γ_jkθ_jk,

where Γ is a matrix of Lagrange parameters with nonzero values for all entries corresponding to edges not in E. The gradient equation is: Θ⁻¹− S − Γ = 0 and has a unique solution. An algorithm for solving this equation is given in [5] page 634.

3.2.2 Estimating the Graph Structure Deviance

A natural way to estimate the structure of an independence graph would be to use deviances as we did in the previous chapter in the case of discrete datasets. We will demonstrate this approach in a very abstract manner.

We have already seen that the log likelihood for a given covariance matrix Σ looks like

C −n

2 log |Σ| −n

2T r(SΣ⁻¹)

and that the unconstrained maximum is obtained for ˆΣ = S. So the maximum value of the log likelihood equals C +ⁿ₂log |S| −^n·p₂ . It is a little bit tedious, and not very interesting at the moment, to find an expression for the log likelihood in a model where more than one edge is excluded. But fortunately, the deviance gets a very simple expression when we exclude only one edge at the time, as Whittaker [11] describes on page 169.

The edge exclusion deviance for the edge (a, b) of a full graph equals:

−n log(1 − Corr²(X_a, X_b|X\{X_a, X_b})).

This deviance should be distributed as a chi-squared distribution with one degree of freedom when this edge indeed should be excluded from the model. We only have one degree of freedom, since we constrain one parameter to be equal to zero.

(26)

This method is a nice first approach, but does lead to some practical problems. We use a significance level in testing the edge exclusion deviances for each edge but we cannot say anything about the resulting significance level for the whole test in which we have tested these deviances for all edges separately. This is one of the problems we also did not solve in the previous chapter.

Secondly, we may want to calculate the edge exclusion deviances against the model we found so far; when we concluded that edge (a, b) should be excluded from the model, we would like to test the exclusion deviance of the next edge using the model without edge (a, b) instead of the full model. The order in which we test edges would be very influential when we apply this method. So we expect to end up with different outcomes for different orders.

All these problems will be tackled in the next subsection. Instead of testing each edge individually, we will come up with a method that penalizes having many edges in our model.

The Graphical Lasso

It is easier to interpret a sparse graph over a full graph, so when we estimate the graph structure, we would like to penalize every nonzero element of Θ. Penalizing the log likelihood using the zero norm (kΘk₀ = P

i,j1_{θ_ij_6=0}) unfortunately leads to a non convex optimization. Therefore we will be using a penalty for the log likelihood that looks like the Lagrange function above:

`2(Θ) = `(Σ) − λkΘk1, (3.1)

where kΘk₁ is the sum over all absolute values of Θ. Here λ influences the sparsity of the estimated graph. We obtain a full graph when λ = 0, since this is just optimizing the normal log likelihood `₀ and obtain a more sparse graph when λ increases.

From now on, we will base our results on the optimization in which 3.1 is scaled up by

2

n to make our results in line with most common literature. Corollary 1 of [3] then helps us to find the smallest λ for which the graph is empty: λ_s = 2 · maxi,j∈{1,...,p},i6=j|S_ij|.

This gives us an interval for λ in which we have to search: λ ∈ [0, λs]

We could also try to penalize the log likelihood using a norm that is very close to the L1-norm: kΘk1−. For any ∈ (0, 1) we again have to deal with a non covex optimization. On the other hand, when we optimize using the norm kΘk₁₊ for any > 0 we indeed will optimize over a convex space but this optimization will not lead to a sparse solution for any λ. In conclusion, we will use the L1-norm since it both defines a convex optimization and leads to a sparse solution.

Meinshausen and B¨uhlmann [7] show that this optimization, under the right circum- stances, indeed leads to the true graphical model:

(27)

They first introduce six assumptions which have to be satisfied:

1. The number of variables p is allowed to grow as the number of observations n grows: There exists a γ > 0, such that

p(n) = O(n^γ) for n → ∞.

2. For all a ∈ V and n ∈ N: V ar(Xa) = 1 and there exists a v²> 0, such that for all n ∈ N and a ∈ V :

V ar(Xa|X_{V \{a}}) ≥ v².

The common variance can always be achieved by an appropriate scaling of the variables.

3. The size of the neighbourhood N (a) of any a ∈ V is restricted by n as follows:

There exists some 0 ≤ κ < 1 such that

maxa∈V |N (a)| = O(n^κ) for n → ∞.

4. The size of the overlap of any two neighbourhoods of two neighbouring nodes is bounded above by an arbitrary large number: There exists some m < ∞ such that for all n ∈ N:

max

a,b∈V,b∈N (a)

|N (a) ∩ N (b)| ≤ m for n → ∞.

5. The absolute value of the partial correlation π_abbetween variables Y_aand Y_b, which is the correlation after having eliminated the linear effects of all other variables, is bounded from below: There exists some constant δ > 0 and some ξ > κ (with κ as in the third assumption), such that for every a, b ∈ V :

|π_ab| ≥ δn^{−(1−ξ)/2}.

6. Neighbourhood stability. See Meinshausen and B¨uhlmann [7].

The authors have under these assumptions proved that they can bound the probability of falsely joining two distinct connectivity components. A connectivity component Ca ⊆ V of a node a ∈ V is the set of all nodes which are, either directly or via some chain of edges, connected to a.

The probability of falsely connecting two connectivity components will be bounded as follows:

P (∃a ∈ V : ˆC_a^λ6⊆ C_a) ≤ α, when the regularization parameter λ is calculated as:

λ = 2ˆ√σaφ˜⁻¹

α .

(28)

Here ˜φ = 1 − φ, where φ is the cumulative distribution function of N (0, 1), σ_a² = hX_a, Xai/n and α a variable between 0 and 1. We will use this optimization as a basis for the rest of this thesis. In the next chapter we will investigate what we have to change when we have not measured all relevant parameters but for now we will apply optimization 3.1 to a set of simulated data and to a set of scores on different mathematical subjects.

3.3 Test on simulated data

We have tested the second method on a simulated dataset. The sample consists of 100 observations, each having eight variables and the true graph has eleven vertices. The inverse covariance matrix on which the dataset has been simulated is:

Σ⁻¹=







6 0 1 0 0 1 0 0 0 7 2 0 4 0 2 0 1 2 7 0 1 0 0 2 0 0 0 7 0 1 0 2 0 4 1 0 8 2 0 0 1 0 0 1 2 6 0 1 0 2 0 0 0 0 7 0 0 0 2 2 0 1 0 7







We used the R-package ‘glasso’ (which uses the scaled optimization of equation 3.1) to test whether we are able to find the true graph out of the dataset. We used λ = 0.028 to obtain the same sparsity level as in the original model. As you can see in figure 3.1, the estimated graph is almost similar to the true one.

(a) The True Graph (b) The Estimated Graph Figure 3.1: Estimating the Graph Structure on Simluated Data

(29)

3.4 An Applied Example

A nice dataset to apply the ‘glasso’ method on is the set of examination marks of 88 students on 5 different subjects (mechanics, vectors, algebra, analysis and statistics), borrowed from Mardia et al. [13]. The question is whether we can expect a student being better or worse at one subject, knowing how he or she performs on other ones.

First step was defining an interval in which we will search for a good value of λ. Using the previously discussed corollary of [3] we found: λ^best∈ [0, 311.07]. We found a graph with six edges for λ = 102, see figure 3.2 .

Figure 3.2: Estimated Graph Mathmarks.

As you can see, algebra plays a role in the performance on statistics and analysis. As well as mechanics does on them. But mechanics and algebra have no direct influence on each other. And vectors is only influenced by mechanics. The number of observations in this sample is close to the numer of observations of the previous simulated example and so is the number of variables. This gives us reasons to believe that this graph is close to the true graph.

(30)

(31)

Chapter 4 Estimating the Graph Structure with Latent Variables

In this chapter we will explain what we are able to do with a dataset when some variables are not measured. First we will briefly discuss a widely used method for finding the latent variables that can explain the behaviour of the measured ones. This method is called factor analysis and will be further explained in section 4.1. After that, we will discuss a newer method which focusses on the structure between all, measured and latent, variables.

Latent variables do not have to be taken into account when one wants to estimate the covariance matrix but we will see in section 4.2 why we run into some problems when we estimate the graph structure. Chandrasekan et al. [1] introduced a method for finding the graphical model with latent variables and their proof will be discussed in section 4.3.

The last two sections test this new method and apply it on the dataset of the previous chapter.

4.1 Factor Analysis

Factor analysis is a method that investigates whether a set of variables can be linearly explained by a smaller set of unobserved variables called factors. This will be done in two steps. First one will try to find the best fit with respect to the so called loading matrix B. This matrix will be related to the observed variables X_i and the factors F_j as in the following linear system.

X₁− ¯X₁= β₁₀+ β₁₁F₁+ β₁₂F₂+ · · · + β_1hF_h+ ₁, X₂− ¯X₂= β₂₀+ β₂₁F₁+ β₂₂F₂+ · · · + β_2hF_h+ ₂,

...

Xp− ¯Xp = βp0+ βp1F1+ βp2F2+ · · · + βphFh+ p.

(32)

Here, i are error terms to indicate that this is not an exact relationship between X and F . When a loading βij is zero or very close to zero, we say that the factor Fj does not explain the behaviour of variable X_i. When this loading is not close to zero, we expect that the i^thvariable (partly) can be explained by the j^th factor. Furthermore, we expect that all factors are uncorrelated to each other and have variance 1 such that Cov(F ) = I.

The linear system above can be written as X − ¯X = BF + . This is very useful when we want to apply factor analysis to a covariance matrix. The covariance matrix Σ is related to the dataset X, the loading matrix B and factors F as follows:

Σ = Cov(X − ¯X) = Cov(BF + )

= BCov(F )B^t+ Cov()

= BB^t+ Cov()

Several methods can be used to find such a best fit for the matrix B. A commonly used method is the principal component method which uses an estimation of the spectral decomposition of Σ to define the elements of B. This decomposition is defined as:

Σ =ˆ

p

X

i=1

λieie^t_i,

where λ_i are the eigenvalues and e_i are the eigenvectors of Σ. The eigenvalues will be ordered by their size: λi > λj for i < j. Assuming there are h factors, we find the following optimal estimation of Σ.

Σ ∼ˆ =

h

X

i=1

λieie^t_i = √ λ1e1

√λ2e2 . . . √ λ_he_h







√λ₁e^t₁

√λ2e^t₂ ...

√λhe^t_h







= BB^t.

The loadings will be defined as βij =√

λieij. Where eij corresponds to the jth element of the ith eigenvecotr of Σ.

Once we have found an optimal loading matrix, we have to improve our solution. Our goal is to rotate B in such a way that we can easily interpret the dependencies of the observed variables to the factors. We will explain this in the next example.

Example. Suppose we have three variables which we would like to explain using two factors. We found the following loadings but, as you can see, it is not easy to draw any conclusion out of this system.

X₁− ¯X₁ = 0.5F₁+ 0.5F₂+ ₁, X₂− ¯X₂ = 0.3F₁+ 0.3F₂+ ₂, X₃− ¯X_p = 0.5F₁− 0.5F₂+ ₃.

(33)

We visualized these loadings in figure 4.1a. When you rotate both axes by 45^◦, all three points will lie exactly on one of these new axes, see figure 4.1b. This rotation will not change the theoretical variances and covariances of X and will therefore also lead to an optimal solution ˆB with respect to the used estimation method for the original ˆB.

(a) The First Estimation of B. (b) The Rotated Estimation of B.

Figure 4.1: Rotating the Loading Matrix B.

So the only thing this rotation does is making the loadings we have found in the first step easier to interpret, see the system below.

X₁− ¯X₁ = 0.5√

2F₁+ 0 · F₂+ ₁, X₂− ¯X₂ = 0.3√

2F₁+ 0 · F₂+ ₂, X3− ¯Xp = 0 · F1− 0.5√

2F2+ 3.

Thus, we would like to rotate B in such a way that all points are as close as possible to as little as possible axes. This means that we would like the matrix B to have as much elements equal to zero as possible.

In conclusion, factor analysis is applied in two steps. First we try to find a best fit B in the linear system Σ = BB^t+ Cov(). The second step is rotating ˆB such that as many as possible element of this matrix are equal to zero.

(34)

4.2 Graphical Latent Model Selection

Looking back at the Math Marks example, you can imagine that there are some unmea- sured variables which also influence the scores on the five different subjects. Different from factor analysis, we will not look at them any more as factors which explain the behaviour of the measured variables but as extra variables which might influence some of the other ones. So, instead of only estimating the edges between latent and observed variables, we try to find the right edges between any possible set of variables.

Generally, we do not need latent variables like these to make a good estimation of the covariance matrix. But as you will see, they have a great influence on the inverse- covariance matrix selection and therefore on the resulting graphical model.

In this thesis we will investigate what we can say about these ‘graphical models with latent variables’. The edges of a graphical model of the observed variables generally do not correspond to the relevant edges of the complete model. This will become clear when we split the data in observed and latent/hidden variables: X = (O, H)^t. The corresponding covariance matrix and its inverse can be written as:

Σ =Σ_OO Σ_OH ΣHO ΣHH

, Σ⁻¹= Θ =Θ_OO Θ_OH ΘHO ΘHH

.

Using the Schur complement we see that Σ⁻¹_OO= ΘOO− Θ_OHΘ⁻¹_HHΘHO. Thus, it is not possible to estimate ΘOOdirectly out of ΣOO, a matrix which can be estimated directly out of the observed data.

We will estimate this matrix ΣOO under the conditions that ΘOO implies a sparse graph and Θ_OHΘ⁻¹_HHΘ_HO has low rank, following the article of Chandrasekaran, Parrilo and Willsky [1]. The second condition corresponds to having a relative small number of hidden variables compared to the observed ones as this rank forms a lower bound for the number of latent variables.

For the penalized log likelihood we write the inverse covariance matrix of the observed variables as Σ⁻¹_OO= S −L, with S being a matrix with many zeros and L having low rank.

Thus, we would like to penalize every nonzero element of S. As shown before, the best way to do this is by minimizing the following:

kSk₁ =

p

X

j=1 p

X

k=1

|s_jk|,

where p is equal to the number of observed variables.

The next problem is defining how to penalize L having high rank. Low rank corresponds to many eigenvalues being equal to 0. Thus, for convexity reasons, we would like to

(35)

minimize the sum over all (absolute values of) the eigenvalues. This sum equals to the trace of the diagonal matrix of the singular value decomposition of L, which simply equals to the trace of L, since we will assume L to be symmetric. So, the penalty on L = U DU^t looks like:

p

X

i=1

λi = T r(D) = T r(DU^tU ) = T r(U DU^t) = T r(L),

As a result, the log likelihood we would like to maximize looks like:

`3(S − L, Σ_OO) = `(S − L, Σ_OO) − λ1kSk₁− λ₂T r(L) (4.1) s.t. S − L 0, L 0.

Here, the extra constraints provide positive-definiteness and positive-semi-definiteness.

This log likelihood formulation implies a convex optimization problem which can be solved in polynomial time.

4.3 Proof of Chandrasekaran et al.

Chandrasekaran et al. [1] proved that maximizing equation 4.1 under certain conditions indeed leads to a good approximation of the real matrices S^∗ and L^∗. Their proof will be briefly discussed below.

Our goal is to find matrices S and L which have a bounded estimation error and which are algebraically correct estimates of Σ⁻¹_OO, which means that all elements of S and S^∗ have the same sign, that the rank of L is equal to the rank of L^∗ and that S − L can be realized as the covariance matrix of an appropriate model with latent variables.

The authors begin with defining two different varieties. The first one is the set of matrices which have a bounded number of nonzero entries:

S(| support(M )|) = {N ∈ R^p×p: | support(N )| ≤ | support(M )|}, where support(M ) denotes the locations of nonzero entries of M .

M is a smooth point of this variety and the tangent space at M is given by Ω(M ).

Likewise they define the set for matrices with bounded rank. M is a smooth point of this variety

L(rank(M )) = {N ∈ R^p×p: rank(N ) ≤ rank(M )}

and the tangent space of this variety at M is given by T (M ).

For identifiability reasons the authors also give some conditions on the structure between the observed and latent variables and between the observed variables themselves. The

(36)

first condition corresponds to the low rank matrix L. When this matrix is sparse itself, the additional correlations due to marginalization over the latent variables could be confused with the graph structure of the observed variables. Therefore, we want the latent variables to be ‘spread out’ over the observed variables. This means that the support of M will not be concentrated at only a few locations.

Another problem arises when the graph of the observed variables contains a densely connected sub graph. These connections might be confused with marginalizations in- duced by latent variables. Therefore, we need to bound the number of connections each observed variable has with other observed variables. We will call the maximal number of connections an observed variable has with other ones over all observed variables the degree of S.

It is possible to uniquely recover the sparse and low rank matrices out of Σ⁻¹_OO when the tangent spaces of S and L have a transverse intersection, which means that they only intersect at the origin. To quantify the level of transversality of these two spaces, the authors define the minimum gain with respect to the dual norm of the regularization function f_γ(S, L) = ^λ_λ¹

2kSk₁+ T r(L), see 4.1. This function is a norm for all γ > 0 and its dual norm is given by:

g_γ(S, L) = max kSk∞

γ , kLk₂

.

This new norm is used to conclude that when the effect of the latent variables is more

‘spread out’ over the observed ones and that when the graph of observed variables does contain less densely connected sub graphs, the tangent spaces of S and L must have a more transverse intersection.

Next the authors define quantities that control the gain of the Fisher information of the true marginal concentration matrix Θ_OO restricted to Ω and T . This is to make sure that elements of Ω and T are individually identifiable under the map I^∗.

In conclusion, the authors state that when:

• the effect of the latent variables is ‘diffuse’ enough and the degree of S is not too high,

• γ = λ₁/λ₂ is within some interval,

• λ₂ has a specified value,

• the number of observations is high enough and

• the minimum nonzero singular value of L is not too small, as well as the minimum nonzero entry of S,

then we have algebraic correctness and a bounded estimation error with respect to the dual norm of f_γ with probability greater that 1 − exp{−p}.

(37)

4.4 Test on Simulated Data

We have tested this optimization method on four datasets of simulated data based on a model with eight variables of which one is latent. The true graphical model, including the latent variable, has a sparsity level of 80%. We made four databases with 10, 200, 600 and 60, 000 observations to test on. For the optimization we used the code that pro- fessor Kim-Chuan Toh has made for the article. The code makes use of the LogDetPPA package [6] and is given in appendix A.

For each dataset we searched for the best results with respect to the found non-edges were exactly 50% of the edges are found. The best we could do in this setting was finding 69.23% of these non-edges. We compared the Kullback-Leibler divergence and Bayesian information criterion for these optimal combinations of λ₁ and λ₂ and as expected we saw both the KL divergence and BIC becoming smaller as the number of observations increased, see figure 4.2.

(a) Kullback-Leibler Divergence. (b) Bayesian Information Criterium.

Figure 4.2: Comparison Based on the KL-Divergence and BIC.

4.5 An Applied Example

We tried to estimate the graph structure of the dataset of section 3.4 assuming that there are some unobserved influences to the performances on the five subjects. We were looking for a graph with the same sparsity level as in the non-latent estimation and compared it to the non-latent estimated graphical model.

(38)

For λ1 = 0.085 and λ2 = 0.0000425 we found the graph of figure 4.3. Even under the assumption of latent variables, we hold all conditional covariances exept for the covariance between algebra and analysis. The corresponding edge has been replaced by the edge statistics-vectors.

Figure 4.3: Estimated Graph Mathmarks with Latent Variables.

(39)

Chapter 5 The Structure of the Latent Variables

Once we found S (= Θ_OO) and L (= Θ_OHΘ⁻¹_HHΘ_HO) only half of the graph structure is known: S tells us directly how the structure between the observed variables is, but it is hard to draw any conclusion out of L. We tried to come up with some techniques for making a suitable decomposition of L such that the missing parts of Θ imply a sparse graph as well. Herefore we used some known properties (L is symmetric and Θ_OH = Θ^t_HO) and some assumptions which will be further explained below. The rank of L will determine the number of latent variables in all cases.

5.1 No Structure Between the Latent Variables

First we assume that the latent variables themselves imply an empty graph. This results in Θ_HH (and Θ⁻¹_HH) being a diagonal matrix. And since Θ_OH = Θ^t_HO, we simply use the eigen decomposition of L to find the missing parts of Θ. Below you see an applied example of this method.

Example 1. Suppose you find the following matrix:

L =







2.1200 2.0400 1.0400 2.0800 0.5200 2.0400 4.6800 1.6800 3.3600 0.8400 1.0400 1.6800 1.0133 2.0267 0.5067 2.0800 3.3600 2.0267 4.0533 1.0133 0.5200 0.8400 0.5067 1.0133 0.2533





 Using its eigen decomposition we find the following precision matrices:

ΘOH =







−0.8750 0.3171 0.3659

−0.0201 −0.7789 0.6268 0.2111 0.2362 0.3002 0.4222 0.4723 0.6004







Graphical Models with Latent Variables