• No results found

Graphical Models with Latent Variables

N/A
N/A
Protected

Academic year: 2021

Share "Graphical Models with Latent Variables "

Copied!
58
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Graphical Models with Latent Variables

S. Jonker BSc

Master’s thesis in Mathematics

June 2014

(2)
(3)

Abstract

This paper concerns graphical models for which not all variables have been measured.

The influences of such latent variables will be discussed as well as two methods (factor analysis and graphical latent model selection) that deal with this kind of models.

(4)
(5)

Contents

1 Introduction 4

1.1 Conditional Independence . . . 4

1.2 Graph Theory . . . 6

1.3 Kullback-Leibler Information Divergence . . . 9

2 A Discrete Model 12 2.1 Log-Linear Models . . . 12

2.2 Estimation . . . 13

2.2.1 With Known Graph Structure . . . 14

2.2.2 Estimating the Graph Structure . . . 15

2.3 An Applied Example . . . 16

3 A Continuous Model 20 3.1 Gaussian Graphical Models . . . 20

3.2 Estimation . . . 21

3.2.1 With Known Graph Structure . . . 21

3.2.2 Estimating the Graph Structure . . . 22

3.3 Test on simulated data . . . 25

3.4 An Applied Example . . . 26

4 Estimating the Graph Structure with Latent Variables 28 4.1 Factor Analysis . . . 28

4.2 Graphical Latent Model Selection . . . 31

4.3 Proof of Chandrasekaran et al. . . 32

4.4 Test on Simulated Data . . . 34

4.5 An Applied Example . . . 34

5 The Structure of the Latent Variables 36 5.1 No Structure Between the Latent Variables . . . 36

5.2 A Better Approach . . . 37

5.3 Test on Simulated Data . . . 39

5.4 An Applied Example . . . 40

(6)

6 Data Analysis 42

6.1 The Dataset . . . 42

6.2 General Gaussian Graphical Model Selection . . . 42

6.3 Factor Analysis . . . 43

6.4 Latent Graphical Model Selection . . . 44

6.4.1 Non-Latent Part . . . 44

6.4.2 Latent Part . . . 45

6.5 Comparison and Conclusion . . . 45

7 Summary 48

8 Acknowledgements 50

Bibliography 52

A Optimization Code 54

(7)

Chapter 1

Introduction

This chapter will give an introduction to conditional independence models. The most relevant theory and importance of such models will be explained as well as their relations to graphs. It turns out that variables may seem to be directly related to each other while this dependency in fact only exists through some other (set of) variable(s). We will see a clear example of this in the first section of this chapter using an example of Simpsons Paradox.

After understanding what we mean by graphical conditional independence models, we will sum up some definitions and properties known from graph theory. They will help us estimating and understanding the models of the following chapters.

Finally we will take a first look at a test that compares different estimated conditional independence graphs. A formula for the ‘distance’ between having conditional indepen- dence or dependence between two sets of variables will be given. This formula in fact tells us how much information we lose when we pretend that some set of edges is not included in the corresponding graph. The more information we lose, the more evidence we have for a set of edges to be in the model.

1.1 Conditional Independence

First look at an example of Simpsons paradox borrowed from Edwards [2]. He has made some fictive data to study possible sexual discrimination in patterns of university admissions. When we look at table 1.1, we find some discrimination against female ap- plications: there seems to be some preference of admitting males (62.5%) over females (37.5%).

(8)

Sex # applications # admitted % admitted

Male 400 250 62.5

Female 400 150 37.5

Table 1.1: Fictive Data of Males and Females Being Admitted to a University.

But we do not see this discrimination of gender when the data is splitted by university department. The cause of more males being admitted lies in the fact that more males apply for departments that easier accept students, see table 1.2.

Department Sex # applications # admitted % admitted

I Male 100 25 25

Female 300 75 25

II Male 300 225 75

Female 100 75 75

Table 1.2: Fictive Data of Admissions to a University Broken Down by Department.

We see that Sex (S) and Department (D) are marginally associated, as well as De- partment and Admission (A). On the other hand, we see that Sex and Admission are independent given Department. We can visualize this graphically as follows:

Figure 1.1: Graph of College Admissions.

The lesson to be learned here is that when we only rely on pairwise marginal associa- tions, we might end up with the wrong conclusions. At first sight, we thought that the university discriminated against females but this turned out to be the wrong conclusion.

So it is necessary to also look at conditional associations between variables, as well as their marginal associations.

(9)

Conditional Independence in Graphical Models

We could express the graph of figure 1.1 as S⊥A|D, which should be read as ‘Sex is independent of Admission given Department’. We would like to relate this expression to a probability distribution. Recall the following functions for positive conditional probability functions:

P (Y = y|X = x) = P (X = x, Y = y)

P (X = x) (for discrete variables) fY |X(y|x) = fX,Y(x, y)

fX(x) (for continuous variables)

Where fX(x) represents the marginal density of X. From now on all properties will be expressed for continuous variables but can easily be rewritten for discrete variables in a same manner.

The conditional probability functions could be factorized when two (sets of) variables are independent given another (set of) variable(s). For example: if X⊥Y |Z, then we could factorize the probability function depended on Z as follows:

fX,Y |Z(x, y|z) = fX|Z(x|z) · fY |Z(y|z)

We can imagine the following theorem (which will not be proved here) to be true when we combine the last two equations:

Theorem 1. The Factorization Theorem

The sets X and Y are independent given the set Z if and only if their probability function can be factorized according to the conditional independencies:

X⊥Y |Z ⇐⇒ ∃h, g such that ∀X, Y and ∀Z for which fZ(z) > 0 the corresponding probability function can be written as:

fX,Y,Z(x, y, z) = h(x, z) · g(y, z).

So, basically we could express a graphical model in two equivalent ways: using the fac- torization of a probability function or as a graph. We will discuss the last representation in the next section. Probability functions will be used in the last section of this chapter and will be discussed in more detail in the next chapters.

1.2 Graph Theory

Before we continue with graphical models, we have to introduce some concepts of graph theory. We begin with three definitions:

Definition 1. A graph G is a set of vertices V and edges E ⊂ K × K.

(10)

Definition 2. An edge is said to be undirected if both ordered edges are in E: if (i, j) ∈ E, then (j, i) ∈ E.

Definition 3. A graph is said to be undirected if all edges in E are undirected.

Since we talk about graphs in which some vertices are connected and some are not, we also have to define some concepts of vertices which are connected in some way.

Definition 4. Two vertices a and b are adjacent if (a, b) ∈ E and (b, a) ∈ E.

Example: In figure 1.2 a and d are adjacent, but a and b are not.

Definition 5. The sequence (a1, a2. . . , aj) is a path if (ak, ak+1) ∈ E for k = 1, . . . , j −1 and if all al are distinct.

Example: In figure 1.2 (a, e, b, d) is a path, but (a, e, d, b) is not since e and d are not adjacent.

Definition 6. The sequence (a1, a2. . . , aj) is a cycle if (a1, a2. . . , aj−1) is a path, (aj−1, aj) ∈ E and a1 = aj.

Example: In figure 1.2 (a, e, b, d, a) is a cycle.

Definition 7. The set of vertices A separates vertices a and b if all paths between a and b contain at least one element of A and a, b /∈ A.

Example: In figure 1.2 the set {e, d} separates a and b.

Definition 8. A graph is complete if all vertices are adjacent.

Example: The graph of figure 1.2 is complete if the vertices a and e were not part of the graph.

Definition 9. A clique is a maximally complete sub graph.

Example: In figure 1.2 the sub graph of vertices b, c and d is a clique.

From now on we will only talk about undirected graphs. By doing so, we can easily connect our knowledge of conditional independence to the formal definition of a graph:

Definition 10. Two variables Xa and Xb are conditional independent given all other variables if and only if (a, b) /∈ E and (b, a) /∈ E.

These conditional independencies are formalized in three equivalent Markov properties:

Definition 11. Having no edge between two vertices a and b in the corresponding graph- ical model corresponds to the following properties:

Pairwise Markov property Xa⊥Xb|rest Local Markov Property Xa⊥Xb|N (Xa)

(11)

Global Markov Property Xa⊥Xb|S(Xa, Xb)

Where ‘rest’ stands for all variables except for Xa and Xb, N (Xa) represents the neigh- bourhood of Xa, the set of all vertices that are adjacent to Xa, and S(Xa, Xb) is the (minimal) separating set of Xa and Xb.

These properties might be best understood using an example. Looking at figure 1.2 we clearly see that there is no edge between a and b. Reminding the example of Simpsons paradox, we know that this means that Xa and Xb are independent of each other when some other variables are known. We shall investigate these independencies using the three different Markov properties.

First: the pairwise property: Xa⊥Xb|Xc, Xd, Xe. This property clearly conditions on too many variables. The local one already leaves out some variables we do not need:

Xa⊥Xb|Xd, Xe. But since all conditional independencies are symmetric, we could also write: Xb⊥Xa|Xc, Xd, Xe. Which is the same statement as the pairwise Markov. Last we have the global one: Xa⊥Xb|Xd, Xe. This clearly is the smallest expression possible.

Figure 1.2: A Graphical Model for Variable X = (Xa, Xb, Xc, Xd, Xe).

There are several ways to prove the equivalence of these three properties. One way to do this, is by proving that the global property implies the local (G =⇒ L), that the local implies the pairwise property (L =⇒ P) and finally that the pairwise Markov implies the global (P =⇒ G) Markov property. To do this, we will follow the proof of Whittaker [11].

Proof.

(G =⇒ L) The global Markov implies the local Markov: a separating set is always a subset of a neighbourhood.

(L =⇒ P) Suppose you have vertices numbered from 1 to k (denoted by the set K), corresponding to variables X1 to Xk. Then define for every vertex a its neighbour- hood set: Na= N (Xa). In this case

Xa⊥XBa|XNa, where XBa = XK\{a,Na}.

(12)

Select any vertex b in Ba. This vertex clearly is not adjacent to a and the expression above can be rewritten as:

Xa⊥{Xb, XC}|XNa, where C = Ba\b.

This statement can be rewritten as (will not be proved here):

Xa⊥Xb|{XNa, XC} and Xa⊥XC|{XNa, Xb}.

Since Na∪ C = K\{i, j} = rest, we end up with the condition of the pairwise Markov property.

(P =⇒ G) This part simply follows from the so called separation theorem, which states that when each set of vertices a is separated from the disjoint set b by a third disjoint set c (the separation set), then

Xa⊥Xb|Xc.

This is all we will say about graph theory for now. The next section will provide a way to calculate the strength of conditional independencies. This gives a first handle in graphical model selection.

1.3 Kullback-Leibler Information Divergence

It is important to have some measure of the strength of all possible edges in a graph when one wants to estimate a graphical model. One way to measure these ‘strengths’

is using the Kullback-Leibler Information Divergence (KL Divergence) which measures the ‘distance’ between two densities in the following way:

Definition 12. Kullback-Leibler Information Divergence

Let f and g be two densities. Then the KL Divergence is defined as follows:

I(f ; g) = Eflogf (x) g(x), where Ef is the expectation with respect to the density f .

Example: Suppose that f ∼ Bern(p) and g ∼ Bern(q), then we can calculate the

(13)

KL Divergence as follows:

I(f ; g) = Eflogpx(1 − p)1−x qx(1 − q)1−x

= Eflog p q

x 1 − p 1 − q

1−x

= Ef

 x logp

q + (1 − x) log1 − p 1 − q



= p logp

q + (1 − p) log1 − p 1 − q

As you probably have seen, the divergence is not symmetric in f and g, so we cannot say it is a real distance.

We will use this measurement to calculate two kinds of distances: a) the distance be- tween complete dependency and independence between two (groups of) variables and b) the distance between complete dependency and conditional independence between two (groups of) variables.

Let’s start with the first one. When two variables are completely independent, we can write their density function fXY simply as fXfY. So the distance between complete dependency and independency of X and Y can be calculated as:

Inf (X⊥Y ) = I(fXY; fXfY).

This formula gives you the information you lose when you pretend that x and y are completely independent. You do not lose information when these variables really are independent and end up with a distance of zero as expected.

Looking at graphical models, we are more interested in conditional independencies.

When two variables X and Y are conditional independent given a third variable Z, then their density function fXY Z can simply be written as fZfX|ZfY |Z. This gives us the following definition of the information of a conditional independency:

Inf (X⊥Y |Z) = I(fXY Z; fZfX|ZfY |Z).

We will be using this formula to calculate the strength of an edge compared to a full graph. The next chapter will show how this can be done for discrete data sets. The chapter thereafter will apply this distance to continuous models.

(14)
(15)

Chapter 2

A Discrete Model

This chapter will apply the topics of last chapter to discrete multivariate models. We will be looking at contingency tables to find graphs corresponding to conditional inde- pendencies. First we will introduce why these models are called log-linear models. Then we will see how these logarithms can be used to estimate a model. In the end we will apply this theory to a real dataset.

2.1 Log-Linear Models

Consider a random vector x = (x1, x2, x3) where each xi ∈ {0, 1}. There are 8 different values of x, so we can write its density function as:

P (x) = p111x1x2x3· px1101x2(1−x3)· px1011(1−x2)x3· p011(1−x1)x2x3· p100x1(1−x2)(1−x3). . . p(1−x000 1)(1−x2)(1−x3), where pijk is the probability of obtaining the value x = (i, j, k). We could rewrite this as a log expansion:

log P (x) = u0+ u1x1+ u2x2+ u3x3+ u12x1x2+ u13x1x3+ u23x2x3+ u123x1x2x3, with:

log p000= u0 log p100= u0+ u1 log p010= u0+ u2

log p001= u0+ u3

log p110= u0+ u1+ u2+ u12

log p101= u0+ u1+ u3+ u13 log p011= u0+ u2+ u3+ u23

log p111= u0+ u1+ u2+ u3+ u12+ u13+ u23+ u123

(16)

Of course, this could be generalized for any x = (x1, . . . , xk) with each xi ∈ {1, . . . , ri} to:

log P (x) = X

A⊂K

ua(xa).

This new representation with so called u-terms turns out te be very useful in graphical modelling. Remember that when two groups of variables x and y are independent given a third group of variables z, we should be able to write their density function as:

P (x, y, z) = g(x, z) · h(y, z) (Theorem 1), or using logarithms: log P (x, y, z) = h(x, z) + g(y, z). Looking back at the u-term representation above, we can conclude that two variables a and b are conditional independent if and only if all u-terms involving a as well as b are zero. This is formalized in the following proposition:

Proposition 1. XA⊥XB|XC iff ∀j ∈ A and ∀k ∈ B : uD = 0 whenever j, k ∈ D.

Example: For x ∈ {0, 1}3: X1⊥X2|X3 iff u12= u123= 0.

Now we have found a nice way to connect graphical models to log density functions:

all edges in a graph should correspond to the non-zero u-terms in the corresponding log density function. A problem arises when we for instance compare the following two expressions:

log P (x) = u0+ u1+ u2+ u3+ u12+ u13+ u23+ u123 log P (x) = u0+ u1+ u2+ u3+ u12+ u13+ u23

Both expressions imply the same conditional independency relations (a full graph). They both imply a model, but we will say that only the first one is graphical. So, to be graph- ical all u-terms should correspond to the conditional independency relations implied by the graph G. This is equivalent to the following definition:

Definition 13. Graphical Log-Linear Models

A multinomial distribution is a graphical log-linear model if the u-terms satisfy:

1. All maximal u-terms correspond to the cliques of the graph G.

2. If uA6= 0 then uB 6= 0 ∀B ⊂ A.

2.2 Estimation

The theory of estimating a grahical log-linear model will be explained for two different problems: estimating the log-linear expression when the corresponding graph is known and secondly estimating the graph G itself.

(17)

2.2.1 With Known Graph Structure

A widely used method to find the best density function is maximizing the likelihood.

For our log-linear models this will be done as follows:

`(P (x), x1, . . . , xN) = log

N

Y

i=1

P (x1)

=

N

X

i=1

log P (xi)

=X

x

n(x) log P (x) = `(P (x), n(x))

Where the xi are all N observations, n(x) is the number of observations x and where the last sum is taken over all possible values of x. This expression can be rewritten as:

`(P (x), n(x)) =X

x

n(x) log P (x)

= NX

x

n(x)

N log P (x) +X

x

n(x) logn(x)

N − NX

x

n(x)

N logn(x) N

= ` n(x) N , n(x)



− N ·X

x

n(x)

N logn(x)/N P (x)

= ` n(x) N , n(x)



− N · I n(x) N ; P (x)



Clearly, maximizing the likelihood corresponds to minimizing the KL Divergence. There are no constraints to P (x) when we consider a full graph, this results in the following probability function: ˆP (x) = n(x)N . It becomes slightly more difficult when we consider a non-full graph. Proposition 7.6.1. of [11] states that the MLE for every clique of the graph in this situation is equal to:

A(xA) = nA(xA)

N for all cliques A of the graph (2.1) This is true since all variables within a clique are arbitrary. The same solution holds for every B ⊂ A when A is a clique. Combining all cliques gives us the following MLE corresponding the graph G:

P (x) =ˆ Q

A clique of GA(xA) Q

S separator of GS(xS), where S is the set of overlapping sets of the cliques.

(18)

2.2.2 Estimating the Graph Structure

The most interesting part of course is finding the conditional independence graph itself.

There are several ways to do this, but one is the following: begin with a full graph and test which edges should be deleted, then use this new graph to test whether you might have deleted too many edges. We begin with the first part: testing which variables are conditional independent given total dependency. We do this using the so called deviance, which is twice the difference between the likelihood of a full model and the model M we test on:

Definition 14. Deviance

The deviance of a model M is given as:

Dev(M ) = 2



` n(x) N , n(x)



− ` ˆPM(x), n(x)



= 2



` n(x) N , n(x)





` n(x) N , n(x)



+ N · I n(x)

N , ˆPM(x)



= 2NX

x

n(x)

N logn(x)/N PˆM(x)

= 2X

x

n(x) logn(x)/N PˆM(x)

To test whether an edge should be excluded from the graph, we use the following theorem:

Theorem 2.

Dev(M )|M is true ∼ χ2df as n → ∞,

where df is the difference in degrees of freedom between M and the saturated (full) model, this corresponds to the number of u-terms set to zero in M .

This theorem could be used to test any model against the full model. When we only exclude one edge, we can easily generalize the results we found so far:

To test whether the vertices a and b are conditional independent, we simply calculate the deviance for the model in which only edge (a, b) misses and test it against the χ21 distribution. We can conclude that they are independent when the deviance is lower than the critical value of the chi-squared distribution on one degree of freedom with a certain significance level α. The deviance will be calculated using:

M(x) =

K\{b}M (x) · ˆPK\{a}M (x) PˆK\{a,b}M (x)

= nK\{b}(x)/N · nK\{a}(x)/N nK\{a,b}(x)/N

= nK\{b}(x) · nK\{a}(x) nK\{a,b}(x) · N

(19)

This results in the following expression for the edge exclusion deviance:

Dev(M ) = 2X

x

n(x) logn(x)/N PˆM(x)

= 2X

x

n(x) log n(x) · nK\{a,b}(x) nK\{b}(x) · nK\{a}(x)

Once all insignificant edges are deleted, we need to test whether some edges should have stayed in the graph. This will be done by testing the new model (M1) against the same model with one edge added (M2). Their deviance difference is equal to:

2



` ˆPM2(x), n(x)



− ` ˆPM1(x), n(x)



=2



` n(x) N , n(x)



− ` ˆPM1(x), n(x)

− 2



` n(x) N , n(x)



− ` ˆPM2(x), n(x)

=Dev(M1) − Dev(M2)

For n → ∞ we have Dev(M1) − Dev(M2)|M1 is true ∼ χ2df

M1−dfM2. So we do not add the edge that is being tested if the deviance difference is greater than the critical value of the chi-squared distribution with degree of freedom equal to the difference between the degrees of freedom of models M1 and M2 at significance level α.

Now we know how to estimate a conditional independence graph for a discrete dataset and how to calculate the corresponding log-linear expansion. Next section will give a fully worked out example.

2.3 An Applied Example

The theory of discrete graphical models will be applied to a modified dataset having 601 observations borrowed from Fair [14]. In his paper, he tried to estimate a Tobit model for leisure activities, in our case a model for extramarital affairs. We adjusted the data such that all five variables are either zero or one. The following variables are being investigated:

1. Sex (0: female, 1: male)

2. Years Married (0: < 8 years, 1: > 8 years) 3. Children (0: no children, 1: children)

4. Self-rating of Marriage (0: not happy, 1: happy)

5. Affairs (0: did not have any affair, 1: has had at least one affair)

(20)

The following shows the contingency table of this dataset:

Years Married

0 1

Children Children

0 1 0 1

Self-rating Self-rating Self-rating Self-rating

0 1 0 1 0 1 0 1

Sex 0 Affairs 0 11 72 16 45 0 4 28 67

1 2 8 13 8 2 0 20 19

1 0 13 38 14 51 1 5 24 62

1 4 8 8 16 2 1 17 22

Table 2.1: Contingency Table of Extramarital Affairs

As a first step in estimating a graphical model for this data set, we calculate the edge exclusion deviances. The R- package ‘ggm’ helps us to calculate these deviances using the command ‘fitConGraph’. Only the deviances corresponding to the edges (Years Married, Children), (Children, Affairs) and (Self-rating, Affairs) are higher than the critical value of 2.706 of the χ21-distribution at the 10% significance level, see table 2.2.

This results in the graph of figure 2.1a.

Male Years Married Children Self-rating

Years Married 0.0703 * * *

Children 2.3494 138.41 * *

Self-rating 0.2047 1.2122 2.5858 *

Affairs 1.2465 1.4081 3.3842 22.321

Table 2.2: First Edge Exclusion Deviances

Next step is investigating whether we might have excluded too many edges. We do this using the edge inclusion deviances, which resulted in adding four extra edges. See figure 2.1b. After repetitively adding and deleting edges, we see that two graphs keep coming back. A more sparse graph is always preffered, since they are easier to interpret. So we use the graph of figure 2.1e as our final graph.

We see that having children does not affect any of the other variables, while gender seems to be highly connected to all other variables except for children. We also con- clude that the self-rating of marriage does not directly influence whether someone has an extramarital affair, but it indirectly has an influence via someone’s gender and the number of years a person has been married.

(21)

(a) First Edge Exclusion (b) First Edge Inclusion (c) Second Edge Exclusion

(d) Second Edge Inclusion (e) Third Edge Exclusion Figure 2.1: Estimation Using Deviances

We see that the final graph contains three cliques: A = {Affairs, Sex, Years Married}, B = {Sex, Self-rating} and C = {Children} and one overlapping set: D = {Sex}. We use this to find an expression for the probability function corresponding to our graph:

P (xˆ Sex, xYears Married, xChildren, xSelf-rating, xAffairs) =

A(xA) · ˆPB(xB) · ˆPC(xC) PˆD(xD) , where ˆPS(xS) is defined as in equation 2.1. Our last task is to estimate the u-terms for the graphical model we found. This is not hard to do, but asks for quite some patience in the number of calculations we have to perform. We therefore will only demonstrate the procedure for the u0-term, which equals the logarithm of the probability of all variables being zero. This gives:

ˆ

u0 = log ˆP (0, 0, 0, 0, 0)

= log

"11+72+16+45

601 ·11+2+16+13+0+2+28+20

601 ·11+2+13+4+72+8+38+8+0+2+1+2+4+0+5+1 601

11+2+72+8+16+13+45+8+0+2+4+0+28+20+67+19 601

#

= log

"144

601 ·60192 ·171601

315 601

#

= log 0.0199 = −3.92

All other u-terms are estimated in a similar way, see section 2.1.

(22)
(23)

Chapter 3

A Continuous Model

This chapter will focus on variables which are continuous. Or more precisely, on variables which have a multivariate Gaussian distribution. It turns out to be relatively easy to estimate a graphical model for these kind of variables. The first section will discuss some of the theory of these Gaussian graphical models and the second section will focus on how to estimate a proper graphical model. The third section tests one of the estimation methods for a set of simulated data and the last section will apply the theory to a real dataset.

3.1 Gaussian Graphical Models

The inverse of the covariance matrix Θ = Σ−1 of a multivariate Gaussian distribution gives information about the covariance between two variables given the rest and therefore about the edges in the graphical model. Whittaker [11] describes how the covariance of two variables given the other ones are related to the inverse-covariance matrix:

Cov(Xi, Xj|rest) = −θijiiθjj

.

Thus, an element θij being equal to 0 corresponds to having no edge between the ith and jth variable. We can make this more intuitive using the Hammersly-Clifford theorem.

This theorem states that the following are equivalent for a positive probability density function f (x1, . . . , xK) :

• f (xi|xK\{i}) = f (xi|N (xi)), for i = 1, . . . , k This uses the local Markov property, which we have discussed before.

• f (x) can be factorized according to the cliques of the graph: f (x) = W1 Q

C∈cl(G)hC(xC), where W is used to normalize the function and cl(G) is the set of cliques of the graph G.

(24)

We can factorize the Gaussian probability function as follows:

f (x) = C ·Y

i,j

e12θij(xi−µi)(xj−µj)

= C ·Y

i,j

hij(xi, xj)

where C = 1/p

(2π)k|Σ| and where the product of serveral functions h form a function h as in the second equivalence statement. Now when θij = 0, we will have no contri- bution to f (x) as a result of hij(xi, xj). Which proves that xj ∈ N (x/ i), using the local Markov property.

In the second equivalence statement you see that the variables xi and xj are not part of a clique when θij = 0, since there will be no term hC(xC) for which i, j ∈ C. Which strengthens the statement that two variables are independent when their corresponding element of the inverse covariance matrix is zero.

So, we just have added another dimension into estimating the covariance matrix of some data set. We no longer just want this matrix to fit the data, we also want to say something about the graphical model and therefore about the inverse covariance matrix of this set.

3.2 Estimation

As discussed above, estimating a graphical model involves two main problems: One wants to know the graph structure as well as the values of the covariance matrix. The difference with usual estimation of the covariance matrix is that some of the values of Σ−1 will be 0. We begin with some situations in which the graph structure is known.

3.2.1 With Known Graph Structure

Assuming we have a full graph, we will run through the usual steps: estimate Σ by maximizing the log likelihood:

`0(Σ) =

n

X

i=1

`xi(Σ)

=

n

X

i=1

log[(√

2π)−p|Σ|−1/2e12(xi−x)tΣ−1(xi−x)]

=

n

X

i=1

−p

2log[2π] −1

2log |Σ| − 1

2(xi− x)tΣ−1(xi− x)

= C −n

2 log |Σ| −1 2

n

X

i=1

T r[(xi− x)t(xi− x)Σ−1].

(25)

Eliminating the constant C gives us the following expression to maximize:

`(Σ) = n

2log |Σ|−1−n

2T r(SΘ), where S = 1nPn

i=1(xi−x)t(xi−x). It can easily be shown that: ˆΣ = S. With probability 1 none of the entries of ˆΘ = ˆΣ−1 are zero. We therefore have to add some constraints to our optimization when the graph is more sparse, which corresponds to more elements of Θ being equal to zero. Then, the optimization will look like:

Maximize: `(Σ), subject to θij = 0 for all (i, j) /∈ E

This is equivalent to maximizing the following Lagrange function:

`1(Θ) = `(Σ) − X

(j,k) /∈E

γjkθjk,

where Γ is a matrix of Lagrange parameters with nonzero values for all entries corre- sponding to edges not in E. The gradient equation is: Θ−1− S − Γ = 0 and has a unique solution. An algorithm for solving this equation is given in [5] page 634.

3.2.2 Estimating the Graph Structure Deviance

A natural way to estimate the structure of an independence graph would be to use deviances as we did in the previous chapter in the case of discrete datasets. We will demonstrate this approach in a very abstract manner.

We have already seen that the log likelihood for a given covariance matrix Σ looks like

C −n

2 log |Σ| −n

2T r(SΣ−1)

and that the unconstrained maximum is obtained for ˆΣ = S. So the maximum value of the log likelihood equals C +n2log |S| −n·p2 . It is a little bit tedious, and not very inter- esting at the moment, to find an expression for the log likelihood in a model where more than one edge is excluded. But fortunately, the deviance gets a very simple expression when we exclude only one edge at the time, as Whittaker [11] describes on page 169.

The edge exclusion deviance for the edge (a, b) of a full graph equals:

−n log(1 − Corr2(Xa, Xb|X\{Xa, Xb})).

This deviance should be distributed as a chi-squared distribution with one degree of freedom when this edge indeed should be excluded from the model. We only have one degree of freedom, since we constrain one parameter to be equal to zero.

(26)

This method is a nice first approach, but does lead to some practical problems. We use a significance level in testing the edge exclusion deviances for each edge but we can- not say anything about the resulting significance level for the whole test in which we have tested these deviances for all edges separately. This is one of the problems we also did not solve in the previous chapter.

Secondly, we may want to calculate the edge exclusion deviances against the model we found so far; when we concluded that edge (a, b) should be excluded from the model, we would like to test the exclusion deviance of the next edge using the model without edge (a, b) instead of the full model. The order in which we test edges would be very influential when we apply this method. So we expect to end up with different outcomes for different orders.

All these problems will be tackled in the next subsection. Instead of testing each edge individually, we will come up with a method that penalizes having many edges in our model.

The Graphical Lasso

It is easier to interpret a sparse graph over a full graph, so when we estimate the graph structure, we would like to penalize every nonzero element of Θ. Penalizing the log likelihood using the zero norm (kΘk0 = P

i,j1ij6=0}) unfortunately leads to a non convex optimization. Therefore we will be using a penalty for the log likelihood that looks like the Lagrange function above:

`2(Θ) = `(Σ) − λkΘk1, (3.1)

where kΘk1 is the sum over all absolute values of Θ. Here λ influences the sparsity of the estimated graph. We obtain a full graph when λ = 0, since this is just optimizing the normal log likelihood `0 and obtain a more sparse graph when λ increases.

From now on, we will base our results on the optimization in which 3.1 is scaled up by

2

n to make our results in line with most common literature. Corollary 1 of [3] then helps us to find the smallest λ for which the graph is empty: λs = 2 · maxi,j∈{1,...,p},i6=j|Sij|.

This gives us an interval for λ in which we have to search: λ ∈ [0, λs]

We could also try to penalize the log likelihood using a norm that is very close to the L1-norm: kΘk1−. For any  ∈ (0, 1) we again have to deal with a non covex optimiza- tion. On the other hand, when we optimize using the norm kΘk1+ for any  > 0 we indeed will optimize over a convex space but this optimization will not lead to a sparse solution for any λ. In conclusion, we will use the L1-norm since it both defines a convex optimization and leads to a sparse solution.

Meinshausen and B¨uhlmann [7] show that this optimization, under the right circum- stances, indeed leads to the true graphical model:

(27)

They first introduce six assumptions which have to be satisfied:

1. The number of variables p is allowed to grow as the number of observations n grows: There exists a γ > 0, such that

p(n) = O(nγ) for n → ∞.

2. For all a ∈ V and n ∈ N: V ar(Xa) = 1 and there exists a v2> 0, such that for all n ∈ N and a ∈ V :

V ar(Xa|XV \{a}) ≥ v2.

The common variance can always be achieved by an appropriate scaling of the variables.

3. The size of the neighbourhood N (a) of any a ∈ V is restricted by n as follows:

There exists some 0 ≤ κ < 1 such that

maxa∈V |N (a)| = O(nκ) for n → ∞.

4. The size of the overlap of any two neighbourhoods of two neighbouring nodes is bounded above by an arbitrary large number: There exists some m < ∞ such that for all n ∈ N:

max

a,b∈V,b∈N (a)

|N (a) ∩ N (b)| ≤ m for n → ∞.

5. The absolute value of the partial correlation πabbetween variables Yaand Yb, which is the correlation after having eliminated the linear effects of all other variables, is bounded from below: There exists some constant δ > 0 and some ξ > κ (with κ as in the third assumption), such that for every a, b ∈ V :

ab| ≥ δn−(1−ξ)/2.

6. Neighbourhood stability. See Meinshausen and B¨uhlmann [7].

The authors have under these assumptions proved that they can bound the probabil- ity of falsely joining two distinct connectivity components. A connectivity component Ca ⊆ V of a node a ∈ V is the set of all nodes which are, either directly or via some chain of edges, connected to a.

The probability of falsely connecting two connectivity components will be bounded as follows:

P (∃a ∈ V : ˆCaλ6⊆ Ca) ≤ α, when the regularization parameter λ is calculated as:

λ = 2ˆ√σaφ˜−1

 α  .

(28)

Here ˜φ = 1 − φ, where φ is the cumulative distribution function of N (0, 1), σa2 = hXa, Xai/n and α a variable between 0 and 1. We will use this optimization as a basis for the rest of this thesis. In the next chapter we will investigate what we have to change when we have not measured all relevant parameters but for now we will apply op- timization 3.1 to a set of simulated data and to a set of scores on different mathematical subjects.

3.3 Test on simulated data

We have tested the second method on a simulated dataset. The sample consists of 100 observations, each having eight variables and the true graph has eleven vertices. The inverse covariance matrix on which the dataset has been simulated is:

Σ−1=

6 0 1 0 0 1 0 0 0 7 2 0 4 0 2 0 1 2 7 0 1 0 0 2 0 0 0 7 0 1 0 2 0 4 1 0 8 2 0 0 1 0 0 1 2 6 0 1 0 2 0 0 0 0 7 0 0 0 2 2 0 1 0 7

We used the R-package ‘glasso’ (which uses the scaled optimization of equation 3.1) to test whether we are able to find the true graph out of the dataset. We used λ = 0.028 to obtain the same sparsity level as in the original model. As you can see in figure 3.1, the estimated graph is almost similar to the true one.

(a) The True Graph (b) The Estimated Graph Figure 3.1: Estimating the Graph Structure on Simluated Data

(29)

3.4 An Applied Example

A nice dataset to apply the ‘glasso’ method on is the set of examination marks of 88 students on 5 different subjects (mechanics, vectors, algebra, analysis and statistics), borrowed from Mardia et al. [13]. The question is whether we can expect a student being better or worse at one subject, knowing how he or she performs on other ones.

First step was defining an interval in which we will search for a good value of λ. Using the previously discussed corollary of [3] we found: λbest∈ [0, 311.07]. We found a graph with six edges for λ = 102, see figure 3.2 .

Figure 3.2: Estimated Graph Mathmarks.

As you can see, algebra plays a role in the performance on statistics and analysis. As well as mechanics does on them. But mechanics and algebra have no direct influence on each other. And vectors is only influenced by mechanics. The number of observations in this sample is close to the numer of observations of the previous simulated example and so is the number of variables. This gives us reasons to believe that this graph is close to the true graph.

(30)
(31)

Chapter 4

Estimating the Graph Structure with Latent Variables

In this chapter we will explain what we are able to do with a dataset when some vari- ables are not measured. First we will briefly discuss a widely used method for finding the latent variables that can explain the behaviour of the measured ones. This method is called factor analysis and will be further explained in section 4.1. After that, we will discuss a newer method which focusses on the structure between all, measured and latent, variables.

Latent variables do not have to be taken into account when one wants to estimate the covariance matrix but we will see in section 4.2 why we run into some problems when we estimate the graph structure. Chandrasekan et al. [1] introduced a method for finding the graphical model with latent variables and their proof will be discussed in section 4.3.

The last two sections test this new method and apply it on the dataset of the previous chapter.

4.1 Factor Analysis

Factor analysis is a method that investigates whether a set of variables can be linearly explained by a smaller set of unobserved variables called factors. This will be done in two steps. First one will try to find the best fit with respect to the so called loading matrix B. This matrix will be related to the observed variables Xi and the factors Fj as in the following linear system.

X1− ¯X1= β10+ β11F1+ β12F2+ · · · + β1hFh+ 1, X2− ¯X2= β20+ β21F1+ β22F2+ · · · + β2hFh+ 2,

...

Xp− ¯Xp = βp0+ βp1F1+ βp2F2+ · · · + βphFh+ p.

(32)

Here, i are error terms to indicate that this is not an exact relationship between X and F . When a loading βij is zero or very close to zero, we say that the factor Fj does not explain the behaviour of variable Xi. When this loading is not close to zero, we expect that the ithvariable (partly) can be explained by the jth factor. Furthermore, we expect that all factors are uncorrelated to each other and have variance 1 such that Cov(F ) = I.

The linear system above can be written as X − ¯X = BF + . This is very useful when we want to apply factor analysis to a covariance matrix. The covariance matrix Σ is related to the dataset X, the loading matrix B and factors F as follows:

Σ = Cov(X − ¯X) = Cov(BF + )

= BCov(F )Bt+ Cov()

= BBt+ Cov()

Several methods can be used to find such a best fit for the matrix B. A commonly used method is the principal component method which uses an estimation of the spectral decomposition of Σ to define the elements of B. This decomposition is defined as:

Σ =ˆ

p

X

i=1

λieieti,

where λi are the eigenvalues and ei are the eigenvectors of Σ. The eigenvalues will be ordered by their size: λi > λj for i < j. Assuming there are h factors, we find the following optimal estimation of Σ.

Σ ∼ˆ =

h

X

i=1

λieieti = √ λ1e1

√λ2e2 . . . √ λheh

√λ1et1

√λ2et2 ...

√λheth

= BBt.

The loadings will be defined as βij =√

λieij. Where eij corresponds to the jth element of the ith eigenvecotr of Σ.

Once we have found an optimal loading matrix, we have to improve our solution. Our goal is to rotate B in such a way that we can easily interpret the dependencies of the observed variables to the factors. We will explain this in the next example.

Example. Suppose we have three variables which we would like to explain using two factors. We found the following loadings but, as you can see, it is not easy to draw any conclusion out of this system.

X1− ¯X1 = 0.5F1+ 0.5F2+ 1, X2− ¯X2 = 0.3F1+ 0.3F2+ 2, X3− ¯Xp = 0.5F1− 0.5F2+ 3.

(33)

We visualized these loadings in figure 4.1a. When you rotate both axes by 45, all three points will lie exactly on one of these new axes, see figure 4.1b. This rotation will not change the theoretical variances and covariances of X and will therefore also lead to an optimal solution ˆB with respect to the used estimation method for the original ˆB.

(a) The First Estimation of B. (b) The Rotated Estimation of B.

Figure 4.1: Rotating the Loading Matrix B.

So the only thing this rotation does is making the loadings we have found in the first step easier to interpret, see the system below.

X1− ¯X1 = 0.5√

2F1+ 0 · F2+ 1, X2− ¯X2 = 0.3√

2F1+ 0 · F2+ 2, X3− ¯Xp = 0 · F1− 0.5√

2F2+ 3.

Thus, we would like to rotate B in such a way that all points are as close as possible to as little as possible axes. This means that we would like the matrix B to have as much elements equal to zero as possible.

In conclusion, factor analysis is applied in two steps. First we try to find a best fit B in the linear system Σ = BBt+ Cov(). The second step is rotating ˆB such that as many as possible element of this matrix are equal to zero.

(34)

4.2 Graphical Latent Model Selection

Looking back at the Math Marks example, you can imagine that there are some unmea- sured variables which also influence the scores on the five different subjects. Different from factor analysis, we will not look at them any more as factors which explain the behaviour of the measured variables but as extra variables which might influence some of the other ones. So, instead of only estimating the edges between latent and observed variables, we try to find the right edges between any possible set of variables.

Generally, we do not need latent variables like these to make a good estimation of the covariance matrix. But as you will see, they have a great influence on the inverse- covariance matrix selection and therefore on the resulting graphical model.

In this thesis we will investigate what we can say about these ‘graphical models with latent variables’. The edges of a graphical model of the observed variables generally do not correspond to the relevant edges of the complete model. This will become clear when we split the data in observed and latent/hidden variables: X = (O, H)t. The corresponding covariance matrix and its inverse can be written as:

Σ =ΣOO ΣOH ΣHO ΣHH



, Σ−1= Θ =ΘOO ΘOH ΘHO ΘHH

 .

Using the Schur complement we see that Σ−1OO= ΘOO− ΘOHΘ−1HHΘHO. Thus, it is not possible to estimate ΘOOdirectly out of ΣOO, a matrix which can be estimated directly out of the observed data.

We will estimate this matrix ΣOO under the conditions that ΘOO implies a sparse graph and ΘOHΘ−1HHΘHO has low rank, following the article of Chandrasekaran, Parrilo and Willsky [1]. The second condition corresponds to having a relative small number of hid- den variables compared to the observed ones as this rank forms a lower bound for the number of latent variables.

For the penalized log likelihood we write the inverse covariance matrix of the observed variables as Σ−1OO= S −L, with S being a matrix with many zeros and L having low rank.

Thus, we would like to penalize every nonzero element of S. As shown before, the best way to do this is by minimizing the following:

kSk1 =

p

X

j=1 p

X

k=1

|sjk|,

where p is equal to the number of observed variables.

The next problem is defining how to penalize L having high rank. Low rank corresponds to many eigenvalues being equal to 0. Thus, for convexity reasons, we would like to

(35)

minimize the sum over all (absolute values of) the eigenvalues. This sum equals to the trace of the diagonal matrix of the singular value decomposition of L, which simply equals to the trace of L, since we will assume L to be symmetric. So, the penalty on L = U DUt looks like:

p

X

i=1

λi = T r(D) = T r(DUtU ) = T r(U DUt) = T r(L),

As a result, the log likelihood we would like to maximize looks like:

`3(S − L, ΣOO) = `(S − L, ΣOO) − λ1kSk1− λ2T r(L) (4.1) s.t. S − L  0, L  0.

Here, the extra constraints provide positive-definiteness and positive-semi-definiteness.

This log likelihood formulation implies a convex optimization problem which can be solved in polynomial time.

4.3 Proof of Chandrasekaran et al.

Chandrasekaran et al. [1] proved that maximizing equation 4.1 under certain conditions indeed leads to a good approximation of the real matrices S and L. Their proof will be briefly discussed below.

Our goal is to find matrices S and L which have a bounded estimation error and which are algebraically correct estimates of Σ−1OO, which means that all elements of S and S have the same sign, that the rank of L is equal to the rank of L and that S − L can be realized as the covariance matrix of an appropriate model with latent variables.

The authors begin with defining two different varieties. The first one is the set of matrices which have a bounded number of nonzero entries:

S(| support(M )|) = {N ∈ Rp×p: | support(N )| ≤ | support(M )|}, where support(M ) denotes the locations of nonzero entries of M .

M is a smooth point of this variety and the tangent space at M is given by Ω(M ).

Likewise they define the set for matrices with bounded rank. M is a smooth point of this variety

L(rank(M )) = {N ∈ Rp×p: rank(N ) ≤ rank(M )}

and the tangent space of this variety at M is given by T (M ).

For identifiability reasons the authors also give some conditions on the structure between the observed and latent variables and between the observed variables themselves. The

(36)

first condition corresponds to the low rank matrix L. When this matrix is sparse itself, the additional correlations due to marginalization over the latent variables could be con- fused with the graph structure of the observed variables. Therefore, we want the latent variables to be ‘spread out’ over the observed variables. This means that the support of M will not be concentrated at only a few locations.

Another problem arises when the graph of the observed variables contains a densely connected sub graph. These connections might be confused with marginalizations in- duced by latent variables. Therefore, we need to bound the number of connections each observed variable has with other observed variables. We will call the maximal number of connections an observed variable has with other ones over all observed variables the degree of S.

It is possible to uniquely recover the sparse and low rank matrices out of Σ−1OO when the tangent spaces of S and L have a transverse intersection, which means that they only intersect at the origin. To quantify the level of transversality of these two spaces, the authors define the minimum gain with respect to the dual norm of the regularization function fγ(S, L) = λλ1

2kSk1+ T r(L), see 4.1. This function is a norm for all γ > 0 and its dual norm is given by:

gγ(S, L) = max kSk

γ , kLk2

 .

This new norm is used to conclude that when the effect of the latent variables is more

‘spread out’ over the observed ones and that when the graph of observed variables does contain less densely connected sub graphs, the tangent spaces of S and L must have a more transverse intersection.

Next the authors define quantities that control the gain of the Fisher information of the true marginal concentration matrix ΘOO restricted to Ω and T . This is to make sure that elements of Ω and T are individually identifiable under the map I.

In conclusion, the authors state that when:

• the effect of the latent variables is ‘diffuse’ enough and the degree of S is not too high,

• γ = λ12 is within some interval,

• λ2 has a specified value,

• the number of observations is high enough and

• the minimum nonzero singular value of L is not too small, as well as the minimum nonzero entry of S,

then we have algebraic correctness and a bounded estimation error with respect to the dual norm of fγ with probability greater that 1 − exp{−p}.

(37)

4.4 Test on Simulated Data

We have tested this optimization method on four datasets of simulated data based on a model with eight variables of which one is latent. The true graphical model, including the latent variable, has a sparsity level of 80%. We made four databases with 10, 200, 600 and 60, 000 observations to test on. For the optimization we used the code that pro- fessor Kim-Chuan Toh has made for the article. The code makes use of the LogDetPPA package [6] and is given in appendix A.

For each dataset we searched for the best results with respect to the found non-edges were exactly 50% of the edges are found. The best we could do in this setting was finding 69.23% of these non-edges. We compared the Kullback-Leibler divergence and Bayesian information criterion for these optimal combinations of λ1 and λ2 and as expected we saw both the KL divergence and BIC becoming smaller as the number of observations increased, see figure 4.2.

(a) Kullback-Leibler Divergence. (b) Bayesian Information Criterium.

Figure 4.2: Comparison Based on the KL-Divergence and BIC.

4.5 An Applied Example

We tried to estimate the graph structure of the dataset of section 3.4 assuming that there are some unobserved influences to the performances on the five subjects. We were looking for a graph with the same sparsity level as in the non-latent estimation and compared it to the non-latent estimated graphical model.

(38)

For λ1 = 0.085 and λ2 = 0.0000425 we found the graph of figure 4.3. Even under the assumption of latent variables, we hold all conditional covariances exept for the covariance between algebra and analysis. The corresponding edge has been replaced by the edge statistics-vectors.

Figure 4.3: Estimated Graph Mathmarks with Latent Variables.

(39)

Chapter 5

The Structure of the Latent Variables

Once we found S (= ΘOO) and L (= ΘOHΘ−1HHΘHO) only half of the graph structure is known: S tells us directly how the structure between the observed variables is, but it is hard to draw any conclusion out of L. We tried to come up with some techniques for making a suitable decomposition of L such that the missing parts of Θ imply a sparse graph as well. Herefore we used some known properties (L is symmetric and ΘOH = ΘtHO) and some assumptions which will be further explained below. The rank of L will determine the number of latent variables in all cases.

5.1 No Structure Between the Latent Variables

First we assume that the latent variables themselves imply an empty graph. This results in ΘHH (and Θ−1HH) being a diagonal matrix. And since ΘOH = ΘtHO, we simply use the eigen decomposition of L to find the missing parts of Θ. Below you see an applied example of this method.

Example 1. Suppose you find the following matrix:

L =

2.1200 2.0400 1.0400 2.0800 0.5200 2.0400 4.6800 1.6800 3.3600 0.8400 1.0400 1.6800 1.0133 2.0267 0.5067 2.0800 3.3600 2.0267 4.0533 1.0133 0.5200 0.8400 0.5067 1.0133 0.2533

 Using its eigen decomposition we find the following precision matrices:

ΘOH =

−0.8750 0.3171 0.3659

−0.0201 −0.7789 0.6268 0.2111 0.2362 0.3002 0.4222 0.4723 0.6004

Referenties

GERELATEERDE DOCUMENTEN

Like the well-known LISREL models, the proposed models consist of a structural and a measurement part, where the structural part is a system of logit equations and the measurement

Summarizing, for this application, the modi ed Lisrel model extended with Fay's approach to partially observed data gave a very parsimonious description of both the causal

We develop a method that correctly estimates the relationship between an imputed latent variable and external auxiliary variables, by updating the latent variable imputa- tions to

Keywords: latent class analysis, classification trees, mixture models, categorical data analysis, divisive hierarchical clustering.. Latent class (LC) analysis has become a

In the current article, two latent class models, referred to as the Direct model and the Indirect model, are presented that can be used to predict a group-level outcome by means

This article showed that the cub model, for which specialized software and developments have recently been proposed, is a restricted loglinear latent class model that falls within

A simulation study was used to compare accuracy and bias relative to the reliability, for alpha, lambda-2, MS, and LCRC, and one additional method, which is the split-half

There is also shown that there exists a relation between the problem of elimination for the class of L 2 systems and disturbance decoupling problems, which has resulted in an