• No results found

Variational inference in Graphical Models

N/A
N/A
Protected

Academic year: 2021

Share "Variational inference in Graphical Models"

Copied!
49
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Variational Inference in Graphical Models

Nikki Levering

July 13, 2017

Bachelor Thesis

Supervisor: dr. B.J.K. Kleijn

Korteweg-de Vries Instituut voor Wiskunde

(2)

Abstract

A graphical model is a graph in which the nodes index random variables and the edges define the dependence structure between the random variables. There are different meth-ods of computing probability distributions in such graphical models. However, there are graphical models in which exact algorithms take in the best case exponential time to compute probabilities, making the use of approximating methods desirable in large graphical models. Variational inference is an approximation method that uses convex analysis to transform the computation of a probability in an optimization problem, in graphical models where the conditional probability distributions are exponential fami-lies. This optimizing problem regards the cumulant function. However, optimizing is often also difficult, because the set that is optimized over is often too complex. Mean Field approximation deals with this problem by only taking a subset of the original set the optimum is taken over. This results in a lower bound for the cumulant function that needs to be maximized. Maximizing this lower bound is equal to minimizing the Kullback-Leibler divergence between the true distribution and all tractable distributions. There are certain algorithms that deal with minimizing this Kullback-Leibler divergence in graphical models, of which Variational Message Passing is one.

Title: Variational Inference in Graphical Models

Author: Nikki Levering, nikki.levering@student.uva.nl, 10787992 Supervisor: dr. B.J.K. Kleijn

Second reader: prof. dr. J.H. van Zanten Date: July 13, 2017

Korteweg-de Vries Instituut voor Wiskunde Universiteit van Amsterdam

Science Park 904, 1098 XH Amsterdam http://www.science.uva.nl/math

(3)

Contents

1. Introduction 4

2. Graphical Models 7

2.1. Directed Graphical models . . . 7 2.2. Undirected Graphical models . . . 10

3. Exact Algorithms 12

3.1. The elimination algorithm . . . 12 3.2. NP-completeness . . . 14

4. Variational Inference 17

4.1. Transforming the cumulant function . . . 18 4.2. Mean Field Approximation . . . 22 4.3. Variational Message Passing . . . 24

5. Applications 28

5.1. Approximation of the posterior . . . 28 5.2. Variational Bayes . . . 30

6. Conclusion and Discussion 34

7. Popular summary (in Dutch) 36

Bibliography 38

A. Proofs, chapter 2 40

B. Proof, chapter 3 44

(4)

1. Introduction

Inference in statistics is a process in which data is analyzed to deduce what the properties of the underlying distribution of the data are. In this thesis we will focus on inference in graphical models. A directed (resp. undirected) graphical model is a directed acyclic (resp. undirected) graph G = (V, E ) (where V are the nodes and E the edges of the graph). The nodes index a collection of random variables {Xv : v ∈ V}. The edges in a

graphical model indicate the dependence structure in the graphical model. The intuitive interpretation of the dependence structure given by these edges is that if there is an edge from a node X1 to a node X2, the distribution of X2 will depend on the value of X1. An

example of a directed and undirected graphical model are shown in figure 1.1.

X2 X3 X4

X1

Y1 Y2 Y3

Figure 1.1.: Example of a directed graphical model (left), with nodes that index the random variables X1, . . . , X4, and an example of an undirected graphical

model (right), whereby the random variables Y1, . . . , Y3 are indexed by the

nodes.

The goal of statistical inference in graphical models is to compute the underlying distri-butions in the graphical model. When only the graphical model but no data is available, one is often interested in a marginal distribution or the joint probability distribution for the model. However, when the values of some of the random variables are already observed, one often wants to compute the probability distribution of the variables not ob-served, conditioned on the observed variables. The nodes corresponding to these observed variables are then called evidence nodes.

Graphical models are not only used in Mathematics, but also in Physics, Artificial Intelligence and Biology. For example, in Biology one makes use of the so-called QMR-DT database, which is a probablistic database used for the diagnosis of a disease in Medicine [7]. The database can be represented as bipartite graphical model, where one layer of nodes represents symptoms of diseases and the other layer of nodes represents the diseases itself and where a directed edge is present from disease to symptom if it is known that that disease can cause that symptom (see figure 1.2). The data is given by the observed symptoms of a patient and the marginal distributions for the diseases. Now statistical inference can be performed to deduce the conditional probability of finding these symptoms given the diseases, but other underlying distributions of the data can also be deduced.

(5)

From this example in Medicine it can be seen that statistical inference in graphical models has applications in other fields of study. This makes it a very relevant subject.

There are several methods of performing statistical inference in graphical models. These can be separated in exact and approximation methods, where the focus in this thesis will be on certain methods of the latter. Exact statistical inference computes the underlying distribution in an exact way, meaning that the properties that are deduced for the underlying distribution of the given data, are the properties of the true distribution. Approximation methods for statistical inference only approximate the true distribution, which would make these kind of methods less desirable. However, as will be deduced in this thesis, exact algorithms for performing statistical inference in graphical models are time consuming, especially when the models are large, and therefore the approximation methods are nevertheless very relevant.

As said, we will in this thesis focus on certain methods of statistical inference that approximate the true distribution. These methods are called variational methods and use of these methods is called variational inference. In performing variational inference in graphical models we will in this thesis focus on those graphical models for which the joint probability distribution is finitely- parameterized and in exponential family form. This means we can write, for a graphical model with nodes that index the dependent random variables X1, . . . , Xn (with realized values x1, . . . , xn) and some parameter vector θ:

pθ(x1, . . . , xn) = exp {hθ, φ(x1, . . . , xn)i − A(θ)},

where φ(x1, . . . , xn) is a vector of complete and sufficient statistics and A(θ) is defined as

A(θ) = log Z

exp {hθ, φ(x1, . . . , xn)i}ν(d(x1, . . . , xn)).

For more details about the exponential family, read the beginning of chapter 4.

The idea behind variational inference is now to identify a probability distribution as the solution of an optimization problem. For exponential families it turns out that computing A(θ) can be identified with the following optimization problem:

A(θ) = sup

µ∈M

{hθ, µi − A∗(µ)}, (1.1)

where the set of realizable mean parameters M and the conjugate dual A∗(µ) will be defined in section 4.1. So computing the underlying distribution of the data in a graphical model where the joint probability distribution is in exponential family form, is then reduced to solving the optimization problem in (1.1). However, solving this problem is not necessarily easy, because A∗(µ) does often not have an explicit form and M could be such that optimization over this set is hard.

A method for solving this optimization problem is given by Mean Field approximation, which uses only distributions for which M and A∗(µ) can be identified, called tractable distributions. As it turns out, solving the optimization problem then equals minizing the so-called Kullback-Leibler divergence, which is defined as

D(q||p) = Z

X

q(x) logq(x)

(6)

Symptoms Diseases

Figure 1.2.: The structure of the QMR-DT graphical model. A bipartite graphical model where the upper layer of the nodes represent diseases and the layer below represents symptoms. The grey nodes are the symptoms that have been observed. [7]

for the true distribution p and a tractable distribution q. So variational inference in graph-ical models with joint probability distribution in exponential family form is just mini-mizing the Kullback-Leibler divergence between the true distribution and all tractable distributions.

In this thesis the goal will be to study how variational inference in graphical models works. To achieve this goal, in chapter 2 we will first look at graphical models and how to define probability distributions for the random variables in the graphical models. Then in chapter 3 we will look at exact inference by giving an example of an exact algorithm. We will also prove the fact that exact algorithms are time consuming. Next we will analyze variational inference and Mean Field approximation in chapter 4, with in the same chapter the Variational Message Passing algorithm, which is an example of an algorithm performing variational inference. In the last chapter, chapter 5, some statistical applications of variational inference are shown with focus on applications in Bayesian statistics.

(7)

2. Graphical Models

Before one can perform statistical inference in graphical models, one needs to know how the probabilities in such models are defined. So a structure on the models has to be defined to indicate how probabilities in graphical models can be computed. The dependence of the random variables in a graphical models is given by the edges in the graph. This structure differs for directed and undirected graphical models, so these cases will be treated separately.

2.1. Directed Graphical models

In a directed graphical model the direction of the edges between the nodes defines the dependencies between the random variables indexed by the nodes.

Definition 2.1.1. Let G = (V, E ) be a directed acyclic graph, where V = {V1, . . . , Vn},

n ∈ N.

(i) When Vi → Vj ∈ E, for some i, j ∈ {1, . . . , n}, Vi is called the parent of Vj in G.

Denote the set of all parents for some node Vi as πVi (i ∈ {1, . . . , n}).

(ii) Vj is called a descendant of Vi if there exists a directed path from Vi to Vj,

i, j ∈ {1, . . . , n}. [9]

In the directed graphical model in figure 1.1 X1 → X2, so X1 is a parent of X2. The

absence of another directed edge towards X2 now gives πX2 = X1. In the graph X1 has

a path to X2, X3 and X4, leading to the conclusion that X2, X3 and X4 are descendants

of X1.

To compute probabilities in a directed graphical model, the semantics of this model have to be defined. There are two ways to define the semantics of a directed graphical model: one regarding local probabilities and one regarding the global probabilities in the model. To compute probabilities, consistency between both definitions is needed. As it turns out the two definitions are indeed equivalent, which means probabilities in a directed graphical model can be computed. A directed graphical model satisfying either of the definitions is called a Bayesian network. In this section the two definitions regarding the semantics of a Bayesian network are given together with a theorem which proves their equivalence.

Definition 2.1.2. Let G = (V, E ) be a directed acyclic graphical model with {Xv : v ∈ V }

(8)

of conditional independence assumptions, called the local independencies or local Markov assumptions, for i, j ∈ N where Xj is not a descendant of Xi:

p(Xi, Xj|πXi) = p(Xi|πXi)p(Xj|πXi).

Denote for this conditional independence (Xi ⊥ Xj|πXi). [9]

From the definition follows that given its parents, each node Xi is conditionally

inde-pendent of the nodes which are not its descendants. A directed graphical model is per definition acyclic. This property is very important, because this means that for a node Xi

the node Xj can not both be a descendant and a parent of Xi. If this would be possible,

Xi would be a parent of Xj, Xj again a parent of Xi, which also implies Xi is a parent

of Xi itself. This gives rise to an infinite loop, where Xi is dependent of Xj, Xj of Xi,

Xi again of Xj, and so on. To avoid this situation it is indeed important for a directed

graphical model to be acyclic.

Y1 Y2 Y3 X3 X2 X1 Z3 Z2 Z1

Figure 2.1.: The three possible directed acyclic graphical models consisting of three nodes.

Example 2.1.3. The three structures for a directed acyclic graphical model consisting of three nodes are shown in figure 2.1. In the leftmost model there is conditional inde-pendence between the random variables X2 and X3 given the value of X1, due to the fact

that there is no path from either X2 to X3 or X3 to X2 and X1 is a parent of X2. This

leads to the following local independence:

p(X2, X3|X1) = p(X2|X1)p(X3|X1)

In the graphical model with random variables Y1, Y2 and Y3, there is a local independence

between Y1 and Y3 given the value of Y2, because Y2 is a parent of Y3 and Y1 not a

descendant of Y3. This gives us

p(Y1, Y3|Y2) = p(Y1|Y2)p(Y3|Y2)

In the rightmost graphical model the random variables Z1 and Z2 are not conditionally

independent given Z3:

p(Z1, Z2|Z3) 6= p(Z1|Z3)p(Z2|Z3)

Using the local Markov assumptions there is a dependence structure defined on Bayesian networks from which local probabilities can be computed. From the other definition re-garding the semantics of the Bayesian network global probabilities can be computed. The definition is the following:

(9)

Definition 2.1.4. Let G = (V, E ) be a directed acyclic graphical model, with

{Xv : v ∈ V} a collection of random variables indexed by the nodes of G. A distribution

p over the same space factorizes according to G if p can be expressed as a product p(X1, . . . , Xn) =

n

Y

i=1

p(Xi|πXi). (2.1)

This equation is called the chain rule for Bayesian networks and the factors p(Xi|πXi)

local probablistic models. [9]

This definition provides a global structure on the Bayesian network. There can now be proven that both definitions for the independence structure on a Bayesian network are in fact equivalent, as stated in the following theorem, for which the proof can be found in appendix A:

Theorem 2.1.5. Let G = (V, E ) be a Bayesian network over a set of random variables {Xv : v ∈ V} and let p be a joint distribution over the space in which {Xv : v ∈ V} are

defined. All the local Markov assumptions associated with G hold in p if and only if p factorizes according to G. [10]

The equivalence between the two definitions of a Bayesian network leads to consistency and gives a structure on the Bayesian network, that allows for both computing local and global probability distributions.

η λ X

Figure 2.2.: Graphical model of a random variable X with parameter λ and hyperparam-eter η.

Example 2.1.6. To show how graphical models can be used, an example will be given which will also show the connection between graphical models and Bayesian statistics. Let X be a random variable with a Poisson(λ) distribution. Let λ be not exactly known, but also have a Poisson distribution, only with parameter η. Then η is an hyperparameter for λ. Let η have a Γ(α, β) distribution, where α and β are known. A directed graphical model can now be drawn and becomes as shown in figure 2.3. From Bayesian statistics it is now known that X|λ, η = X|λ. So:

X|λ ∼ Poisson(λ) λ|η ∼ Poisson(η) η ∼ Γ(α, β)

Because X|λ, η = X|λ, the local Markov assumptions hold and thus is the directed graphical network a Bayesian network. This means that the joint probability distribution p(X, λ, η) should be a factorised distribution:

p(X, λ, η) = p(X|λ)p(λ|η)p(η). This is indeed the case, because:

(10)

2.2. Undirected Graphical models

Having defined the structure in a Bayesian network, there will now be turned to the structure on an undirected graphical model. Again a local and global definition of the independence structure in the model are represented, which turn out to be equivalent. An undirected graphical model with this structure is called a Markov Random Field (MRF). Definition 2.2.1. Let H = (X , D) be an undirected graph. Then for each node X ∈ X the Markov blanket NH(X) of X is the set of neighbors of X in the graph. The local

Markov independencies associated with H are defined to be

I`(H) = {(X ⊥ X − {X} − NH(X) | NH(X)) : X ∈ X }.

In other words, X is conditionally independent of the rest of the nodes in the graph given its immediate neighbors. [10]

Example 2.2.2. In the undirected graphical model in figure 2.1 these local Markov independencies translate to Y1 and Y3 being conditionally independent given Y2, because

Y2 is an immediate neighbor of Y3, but Y1 is not .

The global structure of an MRF is defined by its joint probability function. The joint probability distribution of an MRF is to be computed in terms of factors:

Definition 2.2.3. Let Y be a set of random variables. We define a factor to be a function from dom(Y ) to R+. [10]

Definition 2.2.4. Let H = (X , D) be a Markov Random Field, where the nodes index the random variables {X1, . . . , Xn}. A distribution p factorizes over H if it is associated

with

• a set of subsets Y1, . . . , Ym, where each Yi is a complete subgraph of H, meaning

that every two nodes in Yi share an edge;

• factors ψ1[Y1], . . . , ψm[Ym], such that p(X1, . . . , Xn) = 1 Zp 0 (X1, . . . , Xn), (2.2) where p0(X1, . . . , Xn) = ψ1[Y1] × ψ2[Y2] × · · · × ψm[Ym]

is a measure which is normalized by Z =P

X1,...,Xnp

0(X

1, . . . , Xn). Z is called the

parti-tion funcparti-tion. A distribuparti-tion p that factorizes over H is also called a Gibbs distribuparti-tion over H. [10]

Example 2.2.5. In the MRF of figure 2.3 the subgraph consisting of the nodes X4, X5

and X6 is complete and will be denoted by Y4. There are three other complete subgraphs

of the MRF, denoted by Y1, Y2, Y3 and shown in the right model of figure 2.3. On each of

(11)

X5 X2 X3 X1 X4 X6 Y2 Y1 Y3 Y4

Figure 2.3.: Undirected graphical model (left) and a model of its complete subgraphs (right), with Y1 = {X1, X3}, Y2 = {X2, X3}, Y3 = {X3, X4, X5} and Y4 =

{X4, X5, X6}.

of the nodes in the complete subgraph and is positive. The Gibbs distribution p on the MRF becomes, with Z a normalising constant:

p(X1, . . . , X6) =

1

Zψ1[Y1] · ψ2[Y2] · ψ3[Y3] · ψ4[Y4]

Theorem 2.2.6. Let H = (X , D) be an MRF and p a Gibbs distribution over H. Then all the local Markov properties associated with H hold in p. [10]

This theorem can be proven by showing that p(Xi|X \ Xi) = p(Xi|NH(X)) The right

hand side of the equation can be written out, using the fact that a Gibbs distribution exists for the MRF and making use of the theorem of Bayes, which leads to the left side of the equation. Because those are the only elements needed in the proof, the proof is omitted here, but it can be found in [2].

From this theorem it can be concluded that the global structure of an MRF as defined by its joint probability distribution implies the local Markov properties to hold. In fact, it can be proven that that the reverse implication also holds, making both structures equivalent. This result is known as the Hammersley-Clifford theorem.

Theorem 2.2.7. (Hammersley-Clifford) Let H = (X , D) be an MRF and p a positive distribution over X , meaning p(x) ≥ 0 for all x ∈ V al(X). If the local Markov properties implied by H hold in p, then p is a Gibbs distribution over H. [10]

(12)

3. Exact Algorithms

Now that the local and global structure of Bayesian networks and Markov Random Fields are defined, the joint probability distribution is known to be a product of conditional probability distributions, respectively a product of factors with a normalization constant. A marginal distribution for a certain random variable Xi, corresponding to a node in

the graph, can now be found by summing over all other variables in the joint probability distribution. However, as can be seen in example 3.1.1, the computational complexity of computing a marginal distribution in this way can be reduced significantly by using exact inference algorithms. The next section will treat an exact inference algorithm, the so-called elimination algorithm, to see how this computational complexity can be reduced. When the elimination algorithm is understood, the chapter will be closed with some general results about exact inference algorithms and their computational complexity.

3.1. The elimination algorithm

Before treating the elimination algorithm formally, the following example is presented, showing how the elimination algorithm is used and how it reduces computational com-plexity. X3 X5 X1 X2 X4

Figure 3.1.: An MRF H with the discrete random variables X1, . . . , X5.

Example 3.1.1. Consider the MRF H presented in figure 3.1, for which the marginal of X1 is asked. Assume that the cardinality of each variable is r, so |Xi| = r for i = 1, . . . , 5.

It can be seen that when taking Y1 = {X1, X2, X3}, Y2 = {X2, X4} and Y3 = {X2, X3, X5}

each Yi is a complete subgraph of H. This means that when introducing the factors ψi[Yi]

the joint distribution function becomes p(X1, . . . , X5) =

1

Zψ1[Y1]ψ2[Y2]ψ3[Y3] = 1

(13)

where Z is the corresponding partition function normalizing the distribution function.

The marginal of X1 can now be obtained by summing over the remaining variables,

because the variables are discrete. This gives the following marginal: p(X1) = X X2 X X3 X X4 X X5 1 Zψ1[X1, X2, X3]ψ2[X2, X4]ψ3[X2, X3, X5].

The computational complexity is now r5, because each summation is applied to a term

containing 5 variables of cardinality r [6]. However, note that the variable X5 does

not appear in ψ1[X1, X2, X3] or ψ[X2, X4]. The distributive law then gives rise to the

following: X X5 1 Zψ1[X1, X2, X3]ψ2[X2, X4]ψ3[X2, X3, X5] = 1 Zψ1[X1, X2, X3]ψ2[X2, X4] X X5 ψ3[X2, X3, X5] = 1 Zψ1[X1, X2, X3]ψ2[X2, X4]m5(X2, X3),

where it is clear that m5(X2, X3) :=PX5ψ3[X2, X3, X5]. Making use of the distributive

law repeatedly, the marginal of X1 can be computed as follows:

p(X1) = X X2 X X3 X X4 X X5 1 Zψ1[X1, X2, X3]ψ2[X2, X4]ψ3[X2, X3, X5] = 1 Z X X2 X X3 X X4 ψ1[X1, X2, X3]ψ2[X2, X4]m5(X2, X3) = 1 Z X X2 X X3 ψ1[X1, X2, X3]m5(X2, X3)m4(X2) = 1 Z X X2 m4(X2)m3(X1, X2) = 1 Zm2(X1), with m4(X2) =PX4ψ2[X2, X4], m3(X1, X2) =PX3ψ1[X1, X2, X3]m5(X2, X3) and m2(X1) = P

X2m4(X2)m3(X1, X2). The number of variables that appear together in a

summation is no more than 3, reducing the computational complexity from r5 to r3 [6], which for large r is a significant improvement.

Computing the marginal of X1 using the distributive law in the example above is

an example of the elimination algorithm: in each step a new variable is eliminated by summing out this variable, where the summation is only performed over the factors which contain the variable to be eliminated.

(14)

The variable elimination algorithm will now be stated more formally, based on a de-scription in [9]. Let Ψ be a set of factors and W a subset of k variables which are to be eliminated. Let W1, . . . , Wk be an ordering of W . When the set of factors would be Ψi,

eliminating the variable Wi is done by partitioning Ψi in two sets: one with those factors

that have Wi as an argument and one with those factors that do not have Wi as an

argu-ment. The second set, which will be denoted Ψ00i, is clearly the complement of the fist set, denoted Ψ0i, such that Ψ0i∪ Ψ00

i = Ψi. Now a new factor φ can be defined as the product

of all factors with Wi as argument: φ :=

Q

ψ∈Ψ0iψ. Now define τ :=

P

Wiφ as the factor

φ with Wi summed out. The variable Wi is now eliminated and a new set of factors Ψi+1,

from which the next variable is to eliminated, can be defined as Ψi+1= Ψ00i ∪ {τ }. Now,

having an ordering W1, . . . , Wk, first W1 is eliminated (Ψ1 := Ψ), then W2, etc. When

Wk is eliminated, Ψk+1 is the set of factors left and by the construction of the algorithm,

there is no factor left in Ψk+1 containing any Wi, i ∈ {1, . . . , k} as an argument.

Let Y be a set of variables for which the probability distribution is asked in a Bayesian network consisting of the variables X = {X1, . . . , Xn}. This probability distribution p can

now be found by the use of the above algorithm, eliminating the variables W = X − Y . This can be done by choosing the set of factors as the set of conditional probability distributions: Ψ = {ψXi}

n

i=1, where ψXi = p(Xi|πXi). For an MRF consisting of the

variables X = {X1, . . . , Xn}, with Y ⊂ X the set of variables for which a probability

distribution is asked, the algorithm can be done in the same way, only by choosing the set of factors as the set of factors the joint probability distribution consists of.

3.2. NP-completeness

As shown in example 3.1.1, the computational complexity of obtaining a marginal reduces significantly by the use of the variable elimination algorithm instead of summing out all variables over all factors. The elimination algorithm is not the only exact algorithm to compute a marginal probability distribution on Bayesian networks and Markov Random Fields. However, there are results which hold for all these exact algorithms and regard their computational complexity.

First the classes NP, P and #P are introduced. Certain problems can be categorized into these classes based on their computational complexity. Computing probabilities in graphical models using exact algorithms is such a problem. The goal in this section will be to prove that in the best case exponential time is needed to solve such problems. This means that there exist graphical models for which the use of such exact algorithms takes exponential time. If such a graphical model is large, the use of exact algorithms in this model will thus be very time consuming. Any information given on NP, P and #P is based on Cook [3], Koller and Friedman [9], [10] and Valiant [15].

Definition 3.2.1. A decision problem is a problem for which the return values are only true or false (or yes or no, or 0 or 1). P is the class of all decision problems that can be solved in polynomial time. NP is the class of all decision problems for which a positive answer can be verified in polynomial time. Solvability in polynomial time means that a Turing machine needs in the worst-case polynomial time to solve this problem.

(15)

The following example is a decision problem that is in NP.

Example 3.2.2. In a SAT problem a formula in the propositional logic is taken as input. True is returned when there is an assignment which satisfies the formula and if no such assignment exists, false will be returned. An example is the formula q1 ∨ q2. For this

formula true is returned, q1 = true, q2 = true being a satisfying assignment. However,

for the formula q1 ∨ ¬q1 no satisfying assignment exists, resulting in false as output. A

restricted version of this SAT -problem is the 3-SAT problem.

A formula φ is said to be a 3-SAT formula over the Boolean variables q1, . . . , qn if φ is

a so-called conjunction: φ = C1∧ · · · ∧ Cm. Each Ci is a clause of the form `i,1∨ `i,1∨ `i,3.

Each `i,j (i = 1, . . . , m; j = 1, 2, 3) is either qk or ¬qk for some k = 1, . . . , n. Given a

3-SAT formula the decision problem is if this formula can be satisfied. So the decision problem is if there is an assignment for q1, . . . , qnthat makes φ return true. If there would

be an algorithm producing such an assignment in polynomial time for all 3-SAT formulas, the problem would be in P. Let there be a positive answer given, so an assignment for q1, . . . , qn that makes φ return true. If such an assignment can be verified in polynomial

time, for all φ and all positive answers, then the problem would be in NP.

As it turns out, the 3-SAT decision problem is in NP. An example of a 3-SAT formula is φ = (q1∨ ¬q2∨ ¬q3) ∧ (q1 ∨ q2∨ ¬q3). Trivially, q1 = true, q2 = f alse, q3 = f alse is

an assignment of q1, . . . q3 that makes φ return true. Also, this can trivially be verified in

polynomial time. This is in general the case with 3-SAT problems, so any 3-SAT decision problem is in NP. However, no one has yet found an algorithm that proves the problem is also in P.

Instead of wanting to check a positive answer for a decision problem, sometimes it is preferred to know how many positive solutions there are. For example, the following problem is in NP: given a graph G and k ∈ N, does G have a complete subgraph of size greater or equal to k? However, if it was preferred to have knowledge about the number of complete subgraphs of size greater or equal to k, the answer to this NP-problem would not suffice. The question becomes: given a graph G and k ∈ N, how many complete subgraphs of size greater or equal k does G have? This was an example, but for every problem in NP the number of positive answers can be counted. The class #P is the class of problems which count the number of positive answers to a problem in NP in polynomial time. The 3-SAT problem of example 3.2.2. is in #P, because counting the number of satisfying assignments to a certain 3-SAT formula can be done in polynomial time.

There is a class of mathematicians that state that it is very unreasonable to assume P = NP, but until present day it is not yet proven that this is indeed the case.

Definition 3.2.3. A problem is called NP-complete if it is in NP and it is NP-hard: the polynomial solvability of this problem would imply that all other problems in NP are solvable in polynomial time as well.

The definition of NP-completeness states that it is possible to transform any problem in NP in polynomial time to the complete problem such that the solution of the NP-complete problem also gives a solution to the original problem in NP. This means that if

(16)

for one NP-complete problem it can be proven that it is solvable in polynomial time, all problems in NP can be solved in polynomial time, from which can be concluded P = NP. However, as stated above, if one assumes P 6= NP, then no NP-complete problem would be solvable in polynomial time.

To provide information about the computational complexity of computing distributions in graphical models using exact algorithms, two theorems will be stated that indicate in which computational complexity class this problem is. The first theorem is the following: Theorem 3.2.4. The following problem is NP-complete: given a Bayesian network G over X , a variable X ∈ X , and a value x ∈ V al(X), decide whether p(X = x) > 0.

This result states that in a Bayesian network, having x ∈ V al(X) for which

p(X = x) > 0, checking if p(X = x) > 0 can be done in polynomial time. The prob-lem being NP-complete results in the conclusion that, if P 6= NP, the computational complexity can not be polynomial. So in the best case the computational complexity is exponential. A proof of the theorem can be found in appendix B.

The result only concerns the problem of deciding whether p(X = x) > 0 in a BN G with a variable X. Often one is more interested in computing the actual probability p(X = x). A last result that will be stated concerns this problem and is the following, where #P-completeness is in the same way defined as NP-completeness only for the class #P:

Theorem 3.2.5. The following problem is #P-complete: given a BN over X , a variable X ∈ X and a value x ∈ V al(X), compute p(X = x).

Of course, because #P concerns counting problems instead of decision problems, an #P problem can not be an NP problem. It turns out that counting the number of satisfying assignments for a 3-SAT formula is the canonical #P-hard problem. Although this problem can not be in NP, it is NP-hard, because if it can be solved then the 3-SAT decision problem can be solved. The general belief is that the counting version of the 3-SAT problem is more difficult than the decision problem itself. From this and the theorem it can again be concluded that there exist Bayesian networks for which computing p(X = x) can in the best case be done in exponential time.

Thus the computational complexity of computing exact probabilities in Bayesian net-works is exponential. This indicates that for some large Bayesian netnet-works, exact algo-rithms are time consuming. Because a lot of graphical models in Artificial Intelligence and Physics are very large, good approximation algorithms for computing probabilities in graphical models that can work in polynomial time are desirable. In the next chapter one approximation method is analyzed, called variational inference.

(17)

4. Variational Inference

Let a graphical model G be given, where the nodes index the random variables

X = (X1, . . . , Xn), whose dependence is defined by the edges of the graphical model. Let

X have values in X = X1× X2× · · · × Xn. The goal of statistical inference in graphical

models is now to deduce the underlying distribution of these random variables. A method to do this is by using variational inference. This is an approximating inference method, which transforms the problem of computing a distribution to an optimization problem. It is widely used because it works fast. The analysis of variational inference as presented in this chapter is based on an analysis by Wainwright and Jordan in [17] and [16] and an analysis by Blei [1].

In studying variational inference, we will focus on exponential families. Reason for this is that exponential families are mathematical tractably and also versatile modelling tools [8]. The most well-known probability distributions, such as the Gaussian, Exponential, Poisson and Gamma distribution, are exponential families. Example 4.0.1. will show that with respect to the Lebesgue measure, the collection of normal distributions forms an exponential family.

The density of the exponential family is taken with respect to some measure ν which has density h : X → R+ with respect to dx, the components of dx being counting measures

for a discrete space or in a continuous case Lebesgue measures. An exponential family can now be written in the following form, where x ∈ X :

pθ(x) = exp {hθ, φ(x)i − A(θ)}, (4.1)

h·, ·i being the standard inner product on Rd. Here, φ = (φ

α : X → R)α∈I is a vector of

functions from X to Rd, where I is a vector indexing φ

αand d = |I|. The φα are functions

(in fact, they are sufficient and complete statistics). θ = (θα)α∈I is called the vector of

canonical parameters associated with φ and A(θ) the cumulant generating function or log partition function. A(θ) normalizes pθ to make it a density and is thus defined by the

following logarithm:

A(θ) = log Z

X

exp {hθ, φ(x)i}ν(dx).

However, A(θ) normalizes pθ only if A(θ) is finite. Thus, only the canonical parameters

θ contained in the set Ω = {θ ∈ Rd | A(θ) < ∞} will be taken into account as possible parametrizations.

The method of variational inference for exponential families in graphical models is now as follows. Assume that the factors in equation (2.1) (or (2.2), when the graphical model G is undirected) can be expressed in exponential family form, relative to a common measure ν. The joint distribution, which is a product of such factors, is then also in exponential family form and can be written as (4.1). In addition, assume that the parameterspace Ω

(18)

is open. Exponential families for which Ω is open are called regular. Now the inference problem is to find the underlying distribution of X = (X1, . . . , Xn), thus the underlying

joint distribution. Variational inference tackles this problem by transforming the problem of computing the cumulant function A(θ) to the following optimization problem:

A(θ) = sup

µ∈M

{hθ, µi − A∗(µ)}, (4.2)

where the definition of M and A∗(µ) will be given in the following section, which describes the transformation of the problem. The optimization problem itself also has to be solved, for which Mean Field approximation is often used, which only looks at a subset of M and is described in section 4.2.

Example 4.0.1. The density of the normal distribution with parameters (µ, σ2) can be

written as: p(µ,σ2)(x) = 1 √ 2π exp n − 1 2σ2(x − µ) 2o = √1 2π exp n − 1 2σ2x 2+ µ σ2x − 1 2σ2µ 2− log σo = √1 2π exp nD− 1 2σ2 µ σ2  ,x 2 x  E − µ 2 2σ2 − log σ o So by defining θ = −12, µ σ2 T

and noting that µ22 + log σ = −

θ2 1

4θ2 −

1

2log −2θ2 =: A(θ)

the collection of all normal distributions is indeed an exponential family.

4.1. Transforming the cumulant function

The exponential family described is characterized by θ ∈ Ω, the vector of canonical pa-rameters. However, in transforming the inference problem to an optimization problem an alternative parametrization is used in terms of a vector of mean parameters or realizable expected sufficient statistics. These parameters arise as the expectations of φ under a distribution that is absolutely continuous with respect to the measure ν. The set M of realizable mean parameters is defined as:

M =nµ ∈ Rd | ∃p s.t.

Z

φ(x)p(x)ν(dx) = µo.

Note that in this definition, p is only assumed to be absolutely continuous with respect to ν, not that p is an exponential family. Later on, when describing Mean Field approx-imation, the supremum in (4.2) will not be taken over M but over Mtract, which is a

subset of M that only uses tractable distributions p.

As it turns out, there is a one-to-one mapping between the canonical parameters θ ∈ Ω and the interior of M. This fact, and the fact that this mapping is given by the gradient of the log partition function A, will now be proved. The results will be used to transform the inference problem of finding an expression for (4.1) to the optimization problem in (4.2).

(19)

Lemma 4.1.1. The cumulant function A(θ) := log Z

X

exp hθ, φ(x)iν(dx)

associated with any regular exponential family has the following properties: (a) It has derivatives of all orders on its domain Ω. Especially it holds that:

∂A ∂θα(θ) = E θ[φα(X)] := Z φα(x)pθ(x)ν(dx). (4.3) ∂2A ∂θα∂θβ(θ) = E θ[φα(X)φβ(X)] − Eθ[φα(X)]Eθ[φβ(X)]. (4.4)

(b) A is a convex function of θ on its domain Ω, and strictly so if the representation is minimal: for φ there exists no vector a ∈ Rd such that ha, φ(x)i is a constant.

Proof. By use of the dominated convergence theorem, it can be verified that differentiat-ing through the integral of A is valid, which yields:

∂A ∂θα (θ) = ∂ ∂θα log Z X exp hθ, φ(x)iν(dx) = R X ∂ ∂θα exp hθ, φ(x)iν(dx) R Xexp hθ, φ(x)iν(dx) = Z X φα(x) exp hθ, φ(x)iν(dx) R Xexp hθ, φ(x)iν(dx) = Eθ[φα(X)].

The second equation can be proven in the same manner and higher-order derivatives also follow from the same procedure.

To prove the second statement, note that the second-order partial derivative in (4.4) is equal to Cov[φα(X), φβ(X)], implying the full Hessian ∇2A(θ) to be the covariance

matrix of φ(X). The covariance matrix is always positive semidefinite, so ∇2A(θ) is positive semidefinite, which implies A to be convex [5]. Let the representation now be minimal, such that there is no nonzero a ∈ Rd and b ∈ R such that ha, φ(x)i = b holds

ν-a.e. This yields Varθ[ha, φ(x)i] = aT∇2A(θ)a > 0 for all a ∈ Rd, θ ∈ Ω. So the Hessian

is strictly positive definite and so is strictly convex [5].

From this lemma it can be seen that the range of ∇A is contained within the set of realizable mean parameters M. So ∇A(θ) can transform the problem with parameters θ ∈ Ω in a problem with parameters µ ∈ M if the function is bijective, giving a one-to-one correspondence between θ and µ. As it turns out, some extra conditions are needed to guarantee this bijectivity, starting with the need for the representation to be minimal to make ∇A injective:

Lemma 4.1.2. The gradient mapping ∇A : Ω → M is injective if and only if the exponential representation is minimal.

Proof. Assume the representation to be not minimal, indicating the existence of a nonzero vector a ∈ Rd for which ha, φ(x)i is a constant ν-a.e. Let θ1 ∈ Ω and define θ2 = θ1+ at,

(20)

t ∈ R. Due to the fact that Ω is open, t can be chosen such that θ2 ∈ Ω. Linearity in

the first component of the inner product yields hθ2, φ(x)i = hθ1, φ(x)i + tha, φ(x)i. a is

defined such that the last term is a constant, resulting in the fact that pθ1 and pθ2 induce

the same distribution (the only difference being the normalization constant), such that ∇A(θ1) = ∇A(θ2) and ∇A is not injective.

On the contrary, assume the representation to be minimal. Lemma 4.1.1 then states that A is strictly convex. This property, and the fact that A is differentiable, gives A(θ2) > A(θ1) + h∇A(θ1), θ2 − θ1i for all θ1 6= θ2 in Ω. The same inequality can be

derived when θ1 and θ2 are reversed and if those inequalities are added, the following

inequality arises:

h∇A(θ1) − ∇A(θ2), θ1− θ2i > 0.

So if θ1 6= θ2 it follows that ∇A(θ1) 6= ∇A(θ2) and thus that ∇A is injective.

Surjectivity of ∇A can now be guaranteed if the range is restricted to the interior of M, as stated in the following theorem:

Theorem 4.1.3. In a minimal exponential family, the gradient map ∇A is onto M◦.

Consequently, for each µ ∈ M◦, there exists some θ = θ(µ) ∈ Ω such that Eθ[φ(X)] = µ.

A proof of this theorem can be found in appendix C. The last lemma and this theorem together give that for minimal exponential families, each mean parameter µ ∈ M◦ is realized uniquely by a density in the exponential family. The definition of M allows p to be an arbitrary density, while a typical exponential family is only a strict subset of all possible densities. So there are other densities p which also realize µ, where p does not have to be a member of the exponential family. In Mean Field approximation, where not M but Mtract is used, there could be chosen for an Mtract with only densities p that are

exponential families, because these are tractable.

The property that makes the exponential distribution pθ(µ) special is the fact that,

considering all distributions that realize µ, it has a connection with the maximum en-tropy principle, which will be discussed now by first introducing two new definitions: the Boltzmann-Shannon entropy and the Fenchel-Legendre conjugate.

Definition 4.1.4. The Boltzmann-Shannon entropy is defined as H(pθ(x)) = −

Z

X

pθ(x) log pθ(x)ν(dx),

where pθ is a density with respect to the measure ν.

Definition 4.1.5. Given a log partition function A, the conjugate dual function or Fenchel-Legendre conjugate of A, denoted by A∗, is defined as:

A∗(µ) = sup

θ∈Θ

{hµ, θi − A(θ)},

µ ∈ Rd being a vector of variables having the same dimension as θ, called dual variables. These definitions are combined in the following theorem, which also gives the relation between A and A∗, resulting in the optimization problem for A given in (4.2).

(21)

Theorem 4.1.6. (a) For any µ ∈ M◦, let θ(µ) denote the unique canonical parame-ter corresponding to µ, so ∇A(θ) = µ. The Fenchel-Legendre dual of A has the following form:

A∗(µ) = (

−H(pθ(x)) if µ ∈ M◦

∞ if µ /∈ M

For any boundary point µ ∈ M \ M◦ the dual becomes A∗(µ) = lim n→∞ h −Hpθ(µn)(x) i , {µn} being a sequence in Mconverging to µ.

(b) In terms of this dual, the log partition function has the following variational repre-sentation:

A(θ) = sup

µ∈M

{hθ, µi − A∗(µ)}.

(c) For all θ ∈ Ω, the supremum in the variational representation of A(θ) is attained uniquely at the vector µ = Eθ[φ(x)]. This statement holds in both minimal and

nonminimal representations.

Before discussing the consequences of this theorem, an example will be given for the exponential distribution in which the expression for the cumulant function A of the expo-nential distribution is transformed in the optimization problem expressed in (4.2). This verifies the theorem for the exponential distribution. The proof of the theorem for all regular exponential distributions can be found in appendix C.

Example 4.1.7. Let θ ∈ (0, ∞) and X be exponentially distributed with parameter θ. Then, for x ∈ V al(X):

pθ(x) = θe−θx = e−θx+log θ = eh−θ,xi+log θ

Thus for ˆθ ∈ (−∞, 0) the distribution can be written as pθˆ(x) = ehˆθ,xi+log −ˆθ

So if X is exponentially distribution with parameter ˆθ ∈ (−∞, 0) then it is a member of the exponential family with φ(x) = x and A(ˆθ) = − log −ˆθ. The Fenchel-Legendre conjugate dual A∗(µ) is now defined as

A∗(µ) = supθ∈(−∞,0)ˆ {µˆθ + log (−ˆθ)}

Taking the derivative with respect to ˆθ and equating this to 0 to find the optimum yields µ = 1ˆ

θ. This value for µ indeed satisfies ∇A(ˆθ) = µ.

Because φ(x) = x, M = {µ : ∃p such that R xp(x)d(νx) = µ}. x ∈ V al(X) gives

x ≥ 0, otherwise x is not in the support of the exponential family. This and p(x) ≥ 0 for all x ∈ V al(X), P p(x) = 1, yield R xp(x)d(νx) > 0 for all distributions p. Thus

(22)

M = (0, ∞) = M◦. Now, for all µ ∈ M, A(µ) = −1 − log µ. This is indeed equal to

the negative Boltzmann-Shannon entropy: −H(pθˆ) = Z [0,∞) pθˆ(x) log pθˆ(x)ν(dx) = Z [0,∞) eθx+log −ˆˆ θ(ˆθx + log −ˆθ)dx = Epˆθ[ˆθx + log −ˆθ] = ˆθ · − 1 ˆ θ + log −ˆθ = −1 − log (−ˆθ) −1 = −1 − log −1 ˆ θ = −1 − log µ

This verifies A∗(µ) = −H(pθˆ) for µ ∈ M◦. For µ /∈ M, so for µ < 0, µˆθ +log −ˆθ increases

when θ → −∞. This yields A∗(µ) = ∞. The theorem now gives in (b) as value of A(ˆθ): A(ˆθ) = supµ>0{µˆθ + 1 + log µ}

It is easily proved by differentiating that this is equal to − log −ˆθ where only once the supremum is attained, namely when µ = −1/ˆθ. This also verifies (c).

From theorem 4.1.6 it can be concluded that A(θ) can be written as a variational problem with a unique solution, which was explicitly shown in example 4.1.7 for the exponential distribution. The first goal of variational inference was transforming the problem of computing the probability distribution in (4.1) to the problem of optimizing the cumulant function in (4.2), meaning that goal is now achieved. However, that is just the first part of variational inference, because the optimization problem still has to be solved. Solving the optimization problem is not necessarily an easy task. In a lot of models either A∗(θ) has no explicit form or the complexity of M makes it hard to take a supremum. When this is the case and it can not be directly derived what the value of A(θ) is, all that is left is the next inequality which gives a lower-bound on A(θ), where the inequality follows directly from the definition of A(θ) as optimization problem:

A(θ) ≥ hθ, µi − A∗(µ). (4.5)

There are certain methods developed to maximize this lower-bound on A(θ). A certain class of these methods is the Mean-Field variational methods. These methods will be studied in the next section and as mentioned a few times already, they only look at Mtrac⊂ M in which all distributions p are tractable.

4.2. Mean Field Approximation

The mean field approach deals with the difficulties concerning the optimization problem of optimizing (4.2) by only looking at distributions for which M and A∗(θ) can be char-acterized. Such distributions are called tractable. So Mtract is chosen such that both

the computation of A∗(θ(µ)) (where θ(µ) is the value of θ corresponding to the mean parameter µ ∈ Mtract) and the maximization over Mtract can be done tractably. So the

(23)

mean field approach only looks for the best lower bound within the set Mtract. This lower

bound is then equal to

sup

µ∈Mtract

{hµ, θi − A∗(θ(µ))}

and the value µ ∈ Mtract realizing this supremum lower bound is called the mean field

approximation of the true mean parameters.

It turns out that there exists a close connection between maximizing the lower bound and Kullback-Leibler divergence. The Kullback-Leibler divergence is defined as follows: Definition 4.2.1. Let X ∈ X be a continuous random variable. Given two distributions with densities q and p with respect to a base measure ν, the KL (Kullback-Leibler) divergence is defined as D(q||p) = Z X q(x)hlog q(x) p(x) i ν(dx). The integral is replaced by a sum in the case of discrete variables.

To indicate the connection, for θ1, θ2 ∈ Ω let pθ1 and pθ2 be two probability

distribu-tions from some regular exponential family with statistics φ(X). Let µ1 and µ2 be the

corresponding mean parameters to respectively θ1 and θ2, meaning Eθi[φ(X)] = µi for

i = 1, 2.

Computing the KL divergence now gives the following: D(pθ1||pθ2) = Eθ1 h log pθ1(X) pθ2(X) i = Eθ1 h

log ehθ1,φ(X)i−A(θ1)− log ehθ2,φ(X)i−A(θ2)

i

= Eθ1[hθ1, φ(X)i − hθ2, φ(X)i + A(θ2) − A(θ1)]

= A(θ2) − A(θ1) + Eθ1[hθ1, φ(X)i − hθ2, φ(X)i]

= A(θ2) − A(θ1) + hθ1, µ1i − hθ2, µ1i

= A(θ2) − hθ1, µ1i + A∗(µ1) + hθ1, µ1i − hθ2, µ1i

= A(θ2) + A∗(µ1) − hµ1, θ2i

With this in mind, the connection is immediately clear, because maximizing the lower bound in (4.5), with µ ∈ Mtract, equals minimizing A(θ) + A∗(µ) − hµ, θi := D(pθ0(µ)||pθ),

where θ0(µ) is the canonical parameter corresponding to µ.

Conclude that mean field approximation maximizes the lower bound in (4.5) by taking a subset Mtract of M, for which A∗(µ) can be computed and the set Mtract is such that

maximizing is possible. This is equivalent to minimizing the Kullback-Leibler divergence between the distribution and all tractable distributions.

The statistical setting was a graphical model G with random variables X = (X1, . . . , Xn),

where the joint probability distribution had exponential family form. The problem was to find this joint probability distribution. Variational inference tackled the problem by transforming the computation of the cumulant function A(θ) to maximizing hθ, µi−A∗(µ) and again transforming this problem to the minimization of the KL divergence between the joint probability distribution and all tractable distributions. The chapter will now be closed by analyzing the Variational Message Passing algorithm, which is an example of an algorithm using variational inference to find a distribution of a Bayesian network.

(24)

4.3. Variational Message Passing

There are certain variational algorithms that try to minimize the KL divergence. An example of such an algorithm is the Variational Message Passing (VMP) algorithm, which will be treated in this section and is used in Bayesian networks. The algorithm is described based on an article by Winn and Bishop [18].

The statistical setting in which VMP works is as follows. Let G = (V, E ) be a Bayesian network, where the nodes index the random variables X1, . . . , Xn. The dependence

struc-ture of X1, . . . , Xn is given by the edges E . Because G is a Bayesian network, the joint

distribution can be written as

p(X1, . . . , Xn) =

Y

p(Xi|πXi). (4.6)

Assume that all factors in (4.6) are in exponential family form. In addition, assume that when Y is a parent of X and X is a parent of W , the functional form, with respect to X, of both p(X|Y ) and p(W |X) is the same. For exponential families this implies that if p(X|Y ) can be written as exponential family with a vector of sufficient statistics given by φ(X), then so can p(W |X). Furthermore, assume that the value for a part of the random variables X1, . . . , Xn is observed. Denote these observed variables as

E = (E1, . . . , Em), m ≤ n. The values for the other random variables are not yet observed

and the collection of these random variables will be denoted as H = (H1, . . . , Hn−m). See

for an example figure 4.1. The goal in this graphical model is to compute the conditional distribution p(H|E).

E1

H1 H3

H2 E2

Figure 4.1.: Bayesian network of the observed random variables (grey) E1, E2 and

unob-served random variables (white) H1, H2 and H3.

The VMP uses variational inference to find p(H|E). At first, using the obtained results of Mean Field approximation, VMP tries to find p(H|E) by minimizing the KL divergence between the true distribution p(H|E) and all tractable distributions q(H). To make q tractable, q is assumed to be factorised with respect to each Hi, so

q(H) =Y

i

qi(Hi).

The VMP algorithm can now be found by performing the following steps:

(i) Rewrite D(q||p) such that minimizing this equals minimizing D(qj||˜p) with respect

to each j for a certain distribution ˜p.

(25)

(iii) Use (4.6) to rewrite qj(Hj) after which it becomes evident that qj(Hj) can be

computed locally, as a sum of terms in which in each term only the children or the parents of Hj are involved. Namely only terms p(Hj|πHj) and p(X|πX) where X is

a child of Hj.

(iv) Use the fact that the functional form of p(Hj|πHj) and p(X|πX) has to be the

same, with respect to Hj, which implies they are both exponential families with

sufficient statistics φj(Hj). Because these terms are the only terms present in the

computation of qj(Hj), qj(Hj) can be written as exponential family with statistics

φj(Hj).

These steps are now performed in the rest of the section to find a distribution for p(H|E) and to derive at the VMP algorithm.

To minimize D(q||p), note that it can be rewritten as follows (when X1, . . . , Xn are

discrete replace the integrals by sums): D(q || p) = Z H q(H) log q(H) p(H|E)ν(dH) = Z H q(H) log q(H) p(H|E)ν(dH) + Z H q(H) log p(H, E) q(H) ν(dH) − Z H q(H) logp(H, E) q(H) ν(dH) = Z H q(H) log p(E)ν(dH) − Z H q(H) log p(H, E) q(H) ν(dH) = log p(E) − L(q) where L(q) := Z H q(H) logp(H, E) q(H) ν(dH). (4.7)

Because E consists of all nodes that are already observed, log p(E) is a constant. This implies that minimizing the KL divergence D(q||p) equals maximizing L(q). Now using the factorised form of q, it can be seen that

L(q) = Z H q(H) logp(H, E) q(H) ν(dH) = Z H Y i qi(Hi) h log p(H, E) −X i log qi(Hi) i ν(dH) = Z H Y i qi(Hi) log p(H, E) Y i ν(dHi) − X i Z H Y j qj(Hj) log qi(Hi)ν(dHi) = Z H qj(Hj)[log p(H, E) Y i6=j (qi(Hi)ν(dHi)]ν(dHj) − Z H qj(Hj) log qj(Hj)ν(dHj) −X i6=j Z H qi(Hi) log qi(Hi)ν(dHi).

(26)

Now denote

log ˜p(H, E) = hlog p(H, E)ii6=j =

Z H log p(H, E)Y i6=j (qi(Hi)ν(dHi)) (4.8) to rewrite L(q) as L(q) = Z H qj(Hj) log ˜p(H, E)ν(dHj) − Z H qj(Hj) log qj(Hj)ν(dHj) −X i6=j Z H qi(Hi) log qi(Hi)ν(dHi) = −D(qj(Hj)||˜p) − X i6=j Z H qi(Hi) log qi(Hi)ν(dHi).

Conclude that D(q||p) is minimized when L(q) is maximized and this happens when D(qj||˜p) is minimized with respect to each j. The minimum of D(qj||˜p) is attained if and

only if qj is chosen equal to ˜p. So the optimal expression qj∗(Hj) is:

log qj∗(Hj) := log ˜p(H, E) + c, c ∈ R (4.9)

where c can be found by normalizing. As can be seen, each qj(Hj) depends on the factors

qi, i 6= j. So choosing an initial distribution, one can update each factor qj by cycling

through the factors. Now, substituting (4.6) in (4.9) gives log q∗j(Hj) =

X

i

log ˜p(Xi|πXi) + c

Terms in the sum over i that are independent of Hj will be constant and finite, thus:

log q∗j(Hj) = log ˜p(Hj|πHj) +

X

k:Hj∈πXk

log ˜p(Xk|πXk) + c

0.

The final expression for log qj∗(Hj) shows all that is needed to compute q∗j(Hj) is

log ˜p(Hj|πHj), only depending on the parents of Hj, and

P

k:Hj∈πXklog ˜p(Xk|πXk), only

depending on the children of Hj and their parents. The first term can therefore be viewed

as a message from the parents, while the terms in the sum can be viewed as messages from its children. So an update for qj(Hj) can be computed locally, only involving messages

from the children and parents of Hj.

Assume that by cycling through the factors qj, the term qj(Hj) needs to be optimized.

Let X be a child of Hj and define ˜πX := πX \ Hj. Then, by the fact that the factors in

(4.6) are exponential families:

p(Hj|πHj) = hj(Hj) exp {hηj(πHj), φj(Hj)i − Aj(πHj)} (4.10)

(27)

where ηj, ηX, hj, hX are functions, φj, φX sufficient and complete statistics and Aj, AX

cumulant functions. One of the assumptions was that both expressions had the same functional form with respect to Hj. So both expressions can be written as exponential

families with sufficient statistics φj(Hj). Thus functions A0 and ηj,X can be defined such

that (4.11) can be rewritten as:

p(X|Hj, ˜πX) = exp {hηj,X(X, ˜πX), φj(Hj)i − A0(X, ˜πX)} (4.12)

Using this, (4.9) now becomes: log q∗j(Hj) = D hj(Hj) + hηj(πHj), φj(Hj)i − Aj(πHj) E i6=j + X k:Hj∈πXk D hηj,Xk(Xk, ˜πXk), φj(Hj)i − A 0 (Xk, ˜πXk) E i6=j + c0 =Dhηj(πHj)ii6=j+ X k:Hj∈πXk hηj,Xk(Xk, ˜πXk)ii6=j, φj(Hj) E + hj(Hj) + c00

It can now be concluded that q∗j is also a member from the exponential family, having parameter vector

ηj∗ = hηj(πHj)ii6=j +

X

k:Hj∈πXk

hηj,Xk(Xk, ˜πXk)ii6=j. (4.13)

The first term can be seen as a message from the parents of Hj. The second as a message

from the children and co-parents of Hj. These messages together form ηj∗, which is the

parameter vector of qj(Hj) and thus completely fixes its distribution.

It can be concluded that the VMP algorithm works as follows. It finds a tractable distribution q that minimizes the KL divergence between all tractable distributions and the true distribution p. So this distribution q is chosen as optimal distribution in approx-imating the distribution p. The initial distribution of q is given by

q(H) =Y

i

qi(Hi).

The algorithm cycles through the factors and updates each term, where each factor qj(Hj)

(28)

5. Applications

In this chapter, two applications of variational inference are shown. The first will be the approximation of the posterior, which is done by mean field approximation. The second application is called variational Bayes and gives an example of how mean field approximation can be used in Bayesian statistics, in the form of computing the so-called marginal likelihood.

5.1. Approximation of the posterior

The first application lies in the approximation of the posterior and follows an example from David Blei [1].

Let G be a graphical model in which the values for the random variables X = {X1, . . . , Xn} are observed and the values for the remaining variables

Y = {Y1, . . . , Ym} are unobserved. The dependence structure of the random variables

in G is defined by the edges of G. Let θ be a hyperparameter. Then p(Y |X, θ) is the posterior distribution, given by

p(Y |X, θ) = p(Y, X|θ)

p(X|θ) . (5.1)

Assume that this posterior distribution is from an exponential family. The goal in this statistical setting is now to approximate this posterior distribution p(Y |X, θ) using Mean Field approximation methods. As it turns out, the cumulant function A(θ, X) is in this case equal to p(X|θ). With the results of the last chapter this means that for µ ∈ Mtract, the set of realizable and tractable mean parameters, it is the case that

p(X|θ) ≥ hθ, µi − A∗(µ). So the goal is to maximize this lower-bound and in that way lower-bound the posterior distribution.

First, notice that (5.1) can be rewritten as

log p(Y |X, θ) = log p(Y, X|θ) − log p(X|θ)

meaning that in this case A(θ, X) = log p(X|θ) and hξ(X, θ), φ(Y )i = p(Y, X|θ), where ξ is a function of X and θ. The lower bound on A(X, θ) given by inequality (4.5), then becomes, with µ ∈ M the mean parameter corresponding to the parameter θ and H the Boltzmann-Shannon entropy:

log p(X|θ) ≥ hξ(X, θ), µi − A∗(µ) = Eθ[log p(Y, X|θ)] + H(p(Y |θ))

(29)

Now Mean Field approximation will be used. So µ will be chosen as µ ∈ Mtract

for some Mtrac. For this µ, p(Y |θ) needs to be tractable. Tractable subfamilies of

exponential family distributions can be constructed by considering factorized families. In such a factorized family each factor is an exponential family distribution which depends on what is called a variational parameter. So, writing p(Y |θ) = q(Y ), consider q(Y |η) =

Qm

i=1q(Yi|ηi), η = (η1, . . . , ηm) being the variational parameters. For all i, q(Yi|ηi) is in

the exponential family and can be written as

q(Yi|ηi) = h(Yi) exp {hηi, Yii − A(ηi)} (5.2)

with respect to the Lebesgue measure. So Mtract is here chosen such that the tractable

distributions in this set can be written as (5.2).

Using the factorized form of q, the resulting lower bound for log p(X|θ) becomes:

log p(X|θ) ≥ Eθ[log p(Y1, . . . , Ym|X, θ)p(X|θ)] − m

X

i=1

Eθ[log q(Yi|ηi)]

= Eθ[log p(Y1, . . . , Ym|X, θ)] + Eθ[log p(X|θ)] − m X i=1 Eθ[log q(Yi|ηi)] = m X i=1

Eθ[log p(Yi|Y1, . . . , Yi−1, X, θ)] + log p(X|θ) − m

X

i=1

Eθ[log q(Yi|ηi)]

Now a lower bound can be found that optimizes this lower bound with respect to ηi. To

do this, reorder Y such that Yi is the last variable and let Y−i be Y \ Yi. Note that the

part that depends on ηi is then given by:

`i = Eθ[log p(Yi|Y−i, X, θ)] − Eθ[log q(Yi|ηi)].

Now, using (5.2), this is equal to

`i = Eθ[log p(Yi|Y−i, X, θ)] − Eθ[log h(Yi)] − Eθ[hYi, ηii − A(ηi)]

= Eθ[log p(Yi|Y−i, X, θ)] − Eθ[log h(Yi)] − ηTi A 0

(ηi) + A(ηi),

where the last step follows because A0(ηi) = Eθ[Yi]:

A0(ηi) = logR Yiexp hηi, Yii logR exp hηi, Yii = Z Yiexp hηi, Yii − A(ηi) = Z Yiq(Yi|ηi) = Eθ[Yi]. Using ∂ ∂ηi  ηTi A0(ηi)  = ηiTA00(ηi) + A0(ηi),

differentiating `i with respect to ηi gives

∂ ∂ηi `i = ∂ ∂ηi 

Eθ[log p(Yi|Y−i, X, θ)] − Eθ[log h(Yi)]

 − ηT

i A 00

(30)

This results in an optimal value of ηi equal to

ηi = [A00(ηi)]−1

 ∂

∂ηiE

θ[log p(Yi|Y−i, X, θ)] −

∂ ∂ηiE

θ[log h(Yi)]



(5.3) So one can update η1, . . . , ηm iteratively, finding the value of η that optimizes the lower

bound for p(X|η).

So the problem of lower-bouding the posterior distribution p(Y |X, θ) can, using varia-tional inference, be transformed to the problem of maximizing

E[log p(Y, X|θ)] − E[log p(Y |θ)].

When p(Y |θ) is chosen tractable, it can be chosen as a factorized family with variational parameters η. Then there needs to be optimized with respect to these parameters η, which happens when they are as in (5.3).

5.2. Variational Bayes

Variational inference also finds its applications in Bayesian statistics. An example of how graphical models and Bayesian statistics relate is already given in example 2.1.6. It can now be analyzed how variational inference works in graphical models that describe a problem in Bayesian statistics. In particular, the usage of mean-field variational methods to Bayesian inference can be analyzed and is often called variational Bayes. Following Wainwright and Jordan [17], this application of variational inference will be studied in this section.

Again, consider a graphical model with X the observed variables and Y the unobserved variables, having parameter θ. Assume that p(X, Y |θ) is a member of the exponential family, meaning

p(X, Y |θ) = exp {hη(θ), φ(X, Y )i − A(η(θ))},

where the function η : Rd → Rd is just a reparametrization of θ. Furthermore, assume

that the prior over Ω is also a member of the exponential family and has, for α, β ∈ R, the following form:

pα,β(θ) = exp {hα, η(θ)i − βA(η(θ)) − B(α, β)},

which is called the conjugate prior form. This means that given the likelihood function the posterior will have the same functional form as the prior. More formal, the prior is an element of the conjugate family of p, defined as:

Definition 5.2.1. Let (P, A) be a measurable model for an observation Y ∈ Y. Let M denote a collection of probability distributions on (P, A). The set M is called a conjugate family for the model P, if the posterior based on a prior from M again lies in M [8].

B(α, β) is a normalization constant and defined as: B(α, β) = log

Z

(31)

We choose for this form of the likelihood and the prior because a lot of models are specified in this way. A central problem in Bayesian statistics is now to compute

pα, ˆˆβ(x) := Z

p(x|θ)pα, ˆˆβ(θ)dθ,

which is called the marginal likelihood, with x an observed value for X and ˆα, ˆβ fixed values for α and β. To see how variational Bayes works, the problem of computing the marginal likelihood will be tackled. Mean Field approximation will be used to approx-imate this marginal likelihood. A graphical model for this problem is shown in figure 5.1. Y β θ α X

Figure 5.1.: Observed variables X, unobserved variables Y , parameter θ and hyperpa-rameters α, β.

To tackle the problem, first Jensen’s inequality is used to derive the following lower bound: log pα, ˆˆβ(x) = log Z p(x|θ)pα, ˆˆβ(θ)dθ = log Z pα,β(θ) pα, ˆˆβ(θ) pα,β(θ) p(x|θ)dθ ≥ Z pα,β(θ) log pα, ˆˆβ(θ) pα,β(θ) dθ + Z pα,β(θ) log p(x|θ)dθ = Epα,β[log pα, ˆˆβ(θ) pα,β(θ)] + E pα,β[log p(x|θ)]. Now, define Ax(η(θ)) := log nZ Y exp {hη(θ), φ(x, y)i}ν(dy) o

and note that log p(x|θ) = log

Z

Y

exp {hη(θ), φ(x, y)i − A(η(θ))}ν(dy) = Ax(η(θ)) − A(η(θ)). (5.4)

For every observed x, the set of mean parameters Mx can be defined, where µ ∈ Mx if

µ = E[φ(Y, x)] for some distribution p. Let µ(θ) be the mean parameter corresponding to θ, then the following lower bound holds as consequence of theorem 4.1.6:

(32)

This, together with (5.4), results in:

log p(x|θ) ≥ hµ(θ), η(θ)i − A∗x(µ(θ)) − A(η(θ)) and thus log pα, ˆˆβ(x) ≥ Epα,β[log pα, ˆˆβ(θ) pα,β(θ)] + E pα,β[hµ(θ), η(θ)i − A ∗ x(µ(θ)) − A(η(θ))]

Again, the problem is transformed in optimizing an inequality, this time with respect to µ ∈ Mx and α, β. So the term that needs to be optimized is

Epα,β[log pα, ˆˆβ(θ) pα,β(θ)] + E pα,β[hµ, η(θ)i − A ∗ x(µ) − A(η(θ))]

= Epα,β[h ˆα, η(θ)i − ˆβA(η(θ)) − B( ˆα, ˆβ) − hα, η(θ)i + βA(η(θ))

+ B(α, β)] + hµ, Epα,β[η(θ)]i − A

x(µ) − Epα,β[A(η(θ)]

= h ˆα − α, ˜ηi + h ˜A, β − ˆβi − B( ˆα, ˆβ) + B(α, β) + hµ, ˜ηi − A∗x(µ) − ˜A

where ˜η = Epα,β[η(θ)] and ˜A = Epα,β[A(η(θ))]. Writing out definitions gives the following

equality:

B∗(˜η, ˜A) = h˜η, αi + h− ˜A, βi − B(α, β) (5.5)

Now using this the term that needs to be optimized can be transformed in the following way:

h ˆα − α, ˜ηi + h ˜A, β − ˆβi − B( ˆα, ˆβ) + B(α, β) + hµ, ˜ηi − A∗x(µ) − ˜A

= h ˆα, ˜ηi − hα, ˜ηi − h− ˜A, βi − h ˜A, ˆβi − B( ˆα, ˆβ) + B(α, β) + hµ, ˜ηi − A∗x(µ) − ˜A = hµ + ˆα, ˜ηi − A∗x(µ) − B∗(˜η, ˜A) + h ˆβ + 1, − ˜Ai − B( ˆα, ˆβ).

This term needs to be optimized with respect to (µ, ˜η, ˜A), where the optimizing over ˜η and ˜A naturally corresponds to optimizing over α and β. Because the optimal value of µ depends on the optimal value of ˜η and ˜A and the same holds the other way around, the optimization values need to be iteratively updated each time, where of course

µ(n)= arg max µ∈Mx {hµ, ˜η(n−1)i − A∗ x(µ)} (5.6) and (˜η(n), ˜A(n)) = arg max (˜η, ˜A) {hµ(n)+ ˆα, ˜ηi − (1 + ˆβ) ˜A − B∗ (˜η, ˜A)}. (5.7)

By differentiating the term in (5.6) it can easily be checked that the optimal value of µ(n) is given by

µ(n)= Z

Y

(33)

In the same way, (α, β) is updated as (α(n), β(n)) = ( ˆα + µ(n), ˆβ + 1), what results in ˜ η(n)= Z η(θ)pα(n)(n)(θ)dθ. (5.9)

(34)

6. Conclusion and Discussion

Statistical inference in graphical models can be performed by using exact algorithms. However, in the best case exact algorithms can only compute distributions in graphical models with exponential computational complexity (assuming P 6= NP). Due to the large amount of available data, graphical models are getting larger, meaning that executing exact algorithms for calculations is too time consuming. That is why it is desirable to look at the alternative: approximation methods.

The approximating method analyzed in this thesis is called variational inference. Let G be a graphical model for which the factors that together form the joint probability distribution are exponential families. The joint probability distribution is then also an exponential family. To approximate this distribution one can rewrite the cumulant func-tion A(θ), which is the normalizing factor, as

A(θ) = sup

µ∈M

{hθ, µi − A∗(µ)}, where M is the set of realizable mean parameters

M =nµ ∈ Rd | ∃q s.t.

Z

φ(x)q(x)ν(dx) = µo

and A∗(µ) = −H(pθ(µ)(x)), the negative Boltzmann-Shannon entropy for the exponential

family with parameter θ(µ). This is the parameter θ for which µ = Eθ[φ(X)]. As it

turns out, the complexity of M makes it often difficult to take a supremum, and A∗(µ) does not always have an explicit form. Mean Field approximation deals with these two difficulties by reducing M to the set Mtractwhich only looks at distributions q which are

tractable. The value of A(θ) can then be approximated by maximizing the lower bound hθ, µi − A∗(µ). This is equal to minimizing the Kullback-Leibler divergence between the

true distribution p and all tractable distributions q, defined as: D(q||p) = Z X q(x)hlog q(x) p(x) i ν(dx).

So variational inference transforms the problem of computing the distribution to minimiz-ing the KL divergence. For a Bayesian network this KL divergence can be minimized by using the Variational Message Passing algorithm, which initializes an initial distribution for the tractable distribution q and then cycles through the factors in the distribution of q and updates each factor iteratively.

Statistical applications of variational inference can be found in Bayesian statistics in the approximation of the posterior and the marginal likelihood.

Referenties

GERELATEERDE DOCUMENTEN

Abstract: We study asymptotically optimal statistical inference concerning the unknown state of N identical quantum systems, using two complementary approaches: a “poor man’s

Bij de ionisatie door eleotronen in hat gas ontstaan behalve de .nieu.we electronen ook ionen, metaatabiele atomen en photonen. Al deze deeltJeS kunnen, wanneer

Family medicine training institutions and organisations (such as WONCA Africa and the South African Academy of Family Physicians) have a critical role to play in supporting

2.6 Molecular Orbital Theory &amp; Interaction Energy Decomposition The activation strain model reveals great insight into relative energies and even entire reac- tion

Nonparametric statistical methods are used in situations in which it is unreasonable to assume that the sample was drawn from a population distribution with a particular

De bestudering van de rol van de natuurlijke en door verzuring veroorzaakte nutriënten tekorten en het effect van bekalking en bemesting als “geneesmiddel” hebben inzicht gegeven

First we will take a look at the library PETSc, a large library which can be used both for parallel implementation of parallel Krylov subspace solvers as for parallel