On the Usage of Herding in Learning Sigmoid Belief Networks

(1)

MSc Artificial Intelligence

Track: Learning Systems

Master Thesis

On the Usage of Herding

in Learning Sigmoid Belief Networks

by

Joost Ren´

e van Amersfoort

10021248

June 14, 2016

42 credits

February 2015 – June 2016

Supervisor:

Prof. Dr. Max Welling

Assessor: Dr. Efstratios Gavves

Machine Learning Group University of Amsterdam

(2)

Acknowledgments

I would like to thank Prof. Dr. Max Welling for his supervision and guidance throughout this thesis. Our Monday meetings were always insightful and helped me stay focused and on track. Furthermore, I would like to give a special thanks to Christos Louizos and Otto Fabius for always making time to provide feedback or have a discussion on how to proceed. I also would like to thank Dr. Efstratios Gavves and Dr. Piet Rodenburg for being part of my thesis defense committee. Finally, I would like to thank my roommates, friends and family for their support.

(3)

Abstract

In this thesis, two training algorithms for the Sigmoid Belief Network (SBN) [Neal, 1992] are augmented to include the deterministic sampling algorithm Herding [Welling, 2009]. The SBN is a directed binary latent variable model that has garnered popularity when it was used as a building block for the Deep Belief Network [Hinton et al., 2006]. It is difficult to train with backpropagation, due to the variance of the gradients that arises when backpropagating through the binary latent variables. Herding, as used in this thesis, is a deterministic alternative to Gibbs sampling [Geman and Geman, 1984]. Its notable advantage is a quicker convergence than Gibbs sampling, at the expense of introducing a bias. The first training algorithm under consideration is Expectation-Maximization [Dempster et al., 1977], where Herding is used as a drop-in replacement for Gibbs sampling. The second training algorithm is Variational Inference for Monte Carlo objectives (VIMCO) [Mnih and Rezende, 2016], where the stochastic sampling process is replaced by Herding. The results show that Herding achieves its intended effect of variance reduction, but also show that EM is not a practical algorithm to train large SBNs and that VIMCO already has powerful variance reduction mechanisms in place, which reduces the benefits of using Herding.

(4)

Introduction

This thesis concerns itself with learning a generative binary latent variable model in an unsupervised manner. In this introduction, we will put these technical concepts in context and describe our contributions to them.

Machine Learning algorithms are often used to model the conditional distribution of some unobserved variable y, also known as the label, given some observed variable x, ‘the data point’: p(y|x). These algorithms are called discriminative algorithms, examples of them include Linear Regression, Logistic Regression, Decision trees and Support Vector Machines. Their counterpart is known as the generative algorithm, which instead of modeling the conditional probability, models a joint variable over observations and labels. This allows generating any variable in the model. Moreover it is not necessary to supply the label of an observed variable, creating a model that depends on just the data points. Such a model could be learned in such a way that it reproduces its input, effectively allowing one to generate new samples that follow the same distribution as the input data. Learning a model without labels is called unsupervised learning.

In this thesis, we will look at one specific model of such type: the Sigmoid Belief Net-work [Neal, 1992] (SBN). In the SBN, we assume that the data is generated by some latent distribution p(z) and we try to model the relation between this distribution and the observed variables: p(x|z), see also the graphical model in Figure 1.1. Specifically, we assume that the distribution p(z) is Bernoulli, or binary. For more details refer

x

z _θ

N

Figure 1.1: Graphical Model of SBN

(6)

Chapter 1 Introduction 5

to the Preliminaries Chapter 2.1. Traditionally, models with latent variables are con-sidered more difficult to learn, because it is impossible to apply the backpropagation algorithm [Rumelhart et al., 1988] without summing over (integrating for continuous variables) all latent variable configurations. Recently, it has been suggested that it is possible to approximate the latent distribution by replacing it with a deterministic dif-ferentiable function (’the reparametrization trick’), which allows for what the authors call ‘stochastic backpropagation’ [Kingma and Welling, 2013, Rezende et al., 2014]. It works well when the latent distribution is continuous, such as a Gaussian, but when the latent distribution is binary the deterministic values vary too much and the learning signal becomes too noisy to meaningfully update the parameters with. New algorithms have been proposed that combine stochastic backpropagation with variance reduction techniques, which work with varying levels of success. For an overview of these training methods and more background, see Chapter 3.

The main contribution of this thesis stems from adapting some of the algorithms dis-cussed in Chapter 3 to include a different sampling algorithm called Herding [Welling, 2009]. While the concept of Herding can be applied in different ways, in this thesis it is used as a way to obtain samples from a Bernoulli distribution. The main advantage of Herding is its convergence rate being (much) faster than Gibbs sampling, while in-curring a (small) bias. General sampling algorithms generally include a random number generator, but Herding works different and is purely deterministic. Instead of using the probability a random variable is one for a one-time sample, it keeps track of the probability assigned to a random variable, as if they were credits. The process works as follows: during a sampling round the probability that a certain random variable is 1 is computed, and added to its current stack of probability credits. If these credits are over 0.5, then the sample value is set to one and one is deducted from the credits (indeed possibly putting the random variable in debt). Otherwise the sample value is set to 0 and the credits are retained until the next round. This process is deterministic, which is against the general accepted definition of a sample, thus we refer to samples created with this process as pseudo-samples. For a more formal treatment of Herding, refer to Section 2.2.

There are two main experiments in this thesis that both aim to test the variance re-duction and bias of Herding in different settings. The first setting is the classical Expectation-Maximization case, where Herding is used as a drop-in replacement for Gibbs sampling, see Chapter 5. The second setting is based on a very recent algorithm and aims at improving the state of the art by starting from the best available algorithm and modifying it to include Herding, see Chapter 6.

(7)

Chapter 2

Preliminaries

2.1 Sigmoid Belief Networks

The SBN was introduced in 1992 by Radford Neal, while working at the University of Toronto [Neal, 1992]. As described in the introduction, the SBN is a directed binary latent variable model and can consist of any number of layers of latent variables. In this context, directed means the layers of latent variables have a one-way dependence, namely downwards. When Neal originally introduced the SBN he also described a training method based on Gibbs sampling, which got impressive results at the time, but requires a lot of computation power. Interested in improving the training efficiency of the SBN, Saul et al. [1996] derived a variational algorithm based on mean field theory. Meanwhile, Hinton’s group in Toronto came up with a different variational algorithm called wake-sleep [Hinton et al., 1995], in which they viewed the SBN as a Helmholtz machine [Dayan et al., 1995]. The real breakthrough in popularity for the SBN was when Hinton et al. [2006] discovered a fast greedy algorithm to train an extension of the SBN layer by layer, see also [Bengio et al., 2007]. This extension is commonly known as the Deep Belief Network and consists of a Sigmoid Belief Network with a Restricted Boltzmann Machine (RBM) [Hinton, 2002] on top, see also Bengio et al. [2015] for more background on the naming. It is trained by starting with an RBM and stacking SBN layers one by one. The DBN was so popular, because it was the first deep network and improved the state-of-the-art on many data sets.

Recently, the research community has tried to take the advances made in training con-tinuous latent variable models to binary latent variable models. The first paper that was able to do this successfully is Neural Variational Inference for Learning Mnih and Gre-gor [2014], which also attempts to do stochastic backpropagation using several variance reduction techniques. An even more recent example is Variational Inference for Monte

(8)

Chapter 2 Preliminaries 7

Carlo Objectives (VIMCO) Mnih and Rezende [2016], which employs Importance Sam-pling instead of the more elaborate techniques of NVIL. More analysis and derivations of several methods named in this section will follow in Chapter 3.

2.1.1 Model

As described in the introduction, the SBN is a directed binary latent variable model. The directed part is exemplified by the arrows in Figure 2.1 pointing downwards:

x1 x2 x3 x4

z1 z2 z3

Visible layer Hidden layer

Figure 2.1: One layer Sigmoid Belief Network

The relation between two layers is given by the following equation:

ˆ xi = σ(

X

j

wijzj+ bi) (2.1)

With σ(·), not surprisingly, the Sigmoid activation function:

σ(x) = 1

1 + e−x (2.2)

The prior distribution over z:

p(z) = K Y j=1 qzj j (1 − qj)1−zj (2.3)

Throughout this thesis the input data x, is also binary (with ˆx defined in Equation 2.1):

p(x|z) = D Y i=1 ˆ xxi i (1 − ˆxi) 1−xi _(2.4)

The marginal log-likelihood of x conditioned on the parameters of the model then be-comes:

(9)

log p(x) =X

z

[log p(x|z) + log p(z)] (2.5)

This is the quantity that we aim to minimize in order to get a good model fit.

2.1.2 Autoregressive Sigmoid Belief Networks

A powerful extension of the SBN that explicitly models interaction between latent vari-ables of the same layer is the autoregressive SBN. Such an extension was first intro-duced in the Deep AutoRegressive Network [Gregor et al., 2014] expanding on work by Larochelle and Murray [2011]. It was shown to improve convergence speed and per-formance of the network.

The extension is straightforward and only requires the conditional probabilities to be changed to:

p(xi|x<1, z) = σ(

X

j

wijzj+ Si,<ix<i+ bi) (2.6)

This introduces a new weight matrix S, which due to the way it is used, has only relevant weight parameters in its upper triangular part. Regrettably, this change forces us to do a sequential pass through all the binary variables, which introduces a substantial increase in computation time, especially on the GPU.

2.2 Herding

Herding was originally introduced by Welling [2009] as a way to directly and determin-istically obtain pseudo-samples in a maximum entropy model. Later it was also used as an alternative to Gibbs Sampling in Markov Random Fields [Bornn et al., 2013], which is similar to how the concept is applied in this thesis. Recall that in Gibbs sampling we iterate several times over all the latent variables and at each iteration we sample a random variable given the other random variables (for more background refer to Section 11.3 of Bishop [2006]):

(10)

While this approach has been shown to converge to the true distribution, it requires long sampling chains, which are not feasible in practice. Herding tries to avoid these long sampling chains at the cost of introducing a bias. Formally it is defined as:

hi(Xi) ← hi(Xi) + p(Xi|x−i) (2.8)

xi ← argmax Xi

hi(Xi) (2.9)

hi(xi) ← hi(xi) − xi (2.10)

Where Xi is a binary random variable and xi its current instantiation. Note that this

process only works if Xi is a binary latent variable, otherwise subtracting xi will not

lead to correct behavior.

The vector h contains the potential stored over time. If this potential reaches 0.5 for a random variable then the random variable’s value is set to 1 in that sampling round and 1 is subtracted from the potential. Otherwise the value of the random variable is set to 0. This process leads to deterministic samples, with binary units turning on at predictable times. The potential vector h is stored accross epochs to allow random variables that have very low probability be on. In order to further reduce bias, h is initialized to a uniform sample between (−0.5, 0.5).

The described sampling process has less variance than just sampling using a random generator. Consider the following scenario: a variable that is consistently 1 with a probability 0.1 and has a start credit of 0.3, will turn on after 2 sampling rounds, but will then have to wait 10 rounds to be 1 again. This way we guarantee that the variable is not on more times than what can be expected from its probability. A standard sampling algorithm might however turn on the latent variable any number of times, creating a big dependency on the state of the random number generator. In fact, it has been shown that the Herding process leads to _T1 convergence, versus a √1

T convergence for Gibbs,

see Chen et al. [2012].

The drawback of Herding is that not all configurations of random variables will occur. Consider the situation where there are 2 random variables, the first has a start credit of 0.1 and the second 0.2 (these are initialized at random so this is a valid assumption), but both have the same probability of being on namely 0.1. For the first two rounds, both will produce the value 0, but the third round the second variable will be 1, while the first variable will be 1 in the fourth round. Both will then be on again 10 rounds later. This leads to the effect that they are never on at the same time, while given that they are on with a probability of 0.1, independent of each other, this should lead to them both being on with probability: 0.1 · 0.1 = 0.01. Despite these correlations being wrong, the marginal is still correct, because the random variables are 1 with roughly the

(11)

same frequency as Gibbs. In Chapter 5, several experiments are done to visualize the convergence behavior.

(12)

Chapter 3

Overview of methods to train

Sigmoid Belief Networks

In this chapter the most important methods of training a SBN are described. When training an SBN we would like to maximize its marginal log-likelihood as defined in Equation 2.5. However for any reasonable dimension of z it becomes intractable to compute the exact gradient. Therefore, we resort to some approximation, there are roughly two types of approximations possible: Expectation-Maximization (EM) and Variational Bayes (VB), with VB actually being a family of different approximations. We start with the Expectation-Maximization algorithm in Section 3.1, and continue with several different variational approaches, such as Wake-Sleep and NVIL. Not all these approaches form the basis of extensions with Herding, but they are necessary to see the evolution of VB methods and make introduction of recent algorithms look like natural extensions.

3.1 Expectation Maximization

The Expectation-Maximization (EM) algorithm, introduced in 1977 by Dempster et al. [1977], consists of two alternating steps: the E-step and the M-step. In the E step, the posterior is evaluated, while in the M-step the samples obtained in the E-step are used to compute a bound on the marginal log-likelihood. Maximization of this bound also leads to an increase in the true marginal log-likelihood. The EM algorithm requires that it is feasible to evaluate the posterior, and also that the lower bound (see Equation 3.1) is differentiable with respect to its parameters. Both are often the case, which has led to widespread usage of the EM algorithm.

(13)

Chapter 3 Overview of methods to train Sigmoid Belief Networks 12

3.1.1 The Algorithm

The following derivation is based on section 9.4 of [Bishop, 2006] and the likelihood as defined in Section 2.1.1. It should give a comprehensive overview, but please refer to the cited work for more details. This derivation also becomes the basis for Section 3.2. Note that in this Section we specifically condition on the parameters, since there are two sets of parameters (new and old), which are important to separate.

The quantity we want to maximize is the marginal log-likelihood:

log p(X|θ) = logX

z

p(X, z|θ)

In this situation z is a binary variable, for a continuous variable the sum would change in an integral. In practice this value is intractable, due to the exponential growth of the different configurations of z. For example with just 100 latent variables, it would require 2100 operations which is outside the scope of even modern supercomputers. A quantity that we can optimize is the joint log-likelihood p(x, z|θ), however we are only given the data and instead we look at its expectation under the posterior p(z|x, θ) for one data point n:

Ln(θ) =

X

z

p(z|xn, θold) log p(xn, z|θ) (3.1)

We evaluate the posterior in the E-step using the old parameters and use this posterior to find the whole expectation, which we optimize using gradient descent. This step is called the M-step and it is performed on the log-likelihood of the entire data set:

L(θ) =X

n

X

z

p(z|xn, θold) [log p(xn|z, θ) + log p(z|θ)] (3.2)

The proof that optimizing this bound also optimizes the marginal log-likelihood goes as follows. We start with the standard product rule:

(14)

Note that on the left-hand side the log-likelihood does not depend on z and that the posterior integrates 1, leading to:

p(z|x, θold) log p(x, z|θ) − H(p(z|x, θold), p(z|x, θ)) (3.6)

Where H(·) is the entropy function, which is constant and therefore drops out when optimizing this equation. It is however important to consider this entropy when com-paring models that are trained with EM. Unfortunately, it is difficult to compute the entropy, which makes it hard to do any comparison on models trained with EM. More background on comparing unsupervised models can be found in the next chapter. Note that in the following derivation of the gradients of L, see Equation 3.2, we left out the sum over the data points for clarity.

Gradient with respect to qj:

∂L ∂qj =X z p(z|x) zj qj − 1 − zj 1 − qj =X z p(z|x) zj(1 − qj) − (1 − zj)qj qj(1 − qj) = 1 qj(1 − qj) X zj p(z|x)(zj− qj) Define qj as σ(αj) and: ∂L ∂αj = ∂L ∂qj · ∂qj ∂αj = ∂L ∂qj · σ(α_j)(1 − σ(αj)) =X z p(z|x)(zj− qj)

(15)

The gradient with respect to wij is done in similar fashion. First take the derivative

with respect to σi: ∂L ∂σi =X z p(z|x) xi σi −1 − xi 1 − σi =X z p(z|x) xi(1 − σi) − (1 − xi)σi σi(1 − σi) =X z p(z|x) xi− σi σi(1 − σi)

Then apply the chain rule to compute the derivative with respect to wij:

∂L ∂wij = ∂L ∂σi · ∂σi ∂wij = ∂L ∂σi · σ_i(1 − σi)zj =X z p(z|x)(xi− σi)zj

And for bi the result is the same without zj:

∂L ∂bi

=X

z

p(z|x)(xi− σi)

In order to obtain samples from the posterior p(z|x) we can use Herding or Gibbs sampling, as explained in the next section.

3.1.2 Sampling

Unfortunately, we cannot sample directly from the posterior, because we simply do not know the parameters that govern the distribution. A possible solution would be to employ importance sampling, but since we have a high dimensional latent space, we would need so many samples to reach a high probability zone that this is not feasible. A solution for this type of problem is called Markov Chain Monte Carlo (MCMC) sam-pling, refer to section 11.2 of Bishop [2006] for more explanation on this topic. Gibbs sampling is a specific instantiation of an MCMC algorithm, where the random variables are sampled one-by-one, replacing each random variable with its newest value:

(16)

zj ∼ p(zj|z−j) (3.7)

The probability that a certain hidden unit zj is 1 can be computed with the following

expression:

p(zj = 1|z−j, x) ∝

Y

i

p(xi|z−j, zj = 1) × p(zj = 1)

In order to perform renormalization we also need p(zj = 0|z−j, x):

p(zj = 0|z−j, x) ∝

Y

i

p(xi|z−j, zj = 0) × p(zj = 0)

Filling in the conditional distributions, we obtain the following expressions:

p(zj = 1|z−j, xn) ∝ Y i σ(wij + X −j wi,−jz−j+ bi) xin (1 − σ(wij + X −j wi,−jz−jn+ bi)) 1−xin × q_j p(zj = 0|z−j, xn) ∝ Y i σ(X −j wi,−jz−j+ bi) xin (1 − σ(X −j wi,−jz−j+ bi)) 1−xin × (1 − q_j)

Convert to log probabilities:

log p(zj = 1|z−j, x) ∝ X i xilog σ(wij + X −j wi,−jz−j+ bi) + (1 − xi) log(1 − σ(wij+ X −j wi,−jz−j+ bi)) + log qj log p(zj = 0|z−j, x) ∝ X i xilog σ( X −j wi,−jz−j+ bi) + (1 − xi) log(1 − σ( X −j wi,−jz−j+ bi)) + log(1 − qj)

We can now compute the actual probability p(zj = 1|z−j, x) by normalizing:

p(zj = 1|z−j, x) =

elog p(zj=1|x)

elog p(zj=0|x)_{+ e}log p(zj=1|x) = σ(log p(zj = 1|x) − log p(zj = 0|x))

(17)

Hj ← Hj + σ(log p(zj = 1|x) − log p(zj = 0|x))

Next we check for each Hj if the value is above 0.5. If that is the case, then we set

zjt = 1 (with index j the latent variable and index t the sampling round) and subtract

1 from Hj, else we set zjt= 0 and leave Hj as is.

In Gibbs sampling, we use a random number generator to sample a value with probability p(zj = 1|z−j, xn):

zt+1_j ∼ p(zj = 1|z−j,t, x)

3.1.3 Advantages and Disadvantages of training an SBN with EM

Even though EM has been widely adopted by the community, it is a computationally expensive algorithm when applied to an SBN. As we have seen in the previous section, doing the E-step in an SBN requires MCMC sampling. This is done by computing a value proportional to p(zi= 0|z−i, x) and p(zi= 1|z−i, x). However computing these two

values requires a double forward pass, which needs to be repeated for every latent variable for every sampling round. In variational methods (see the next section), evaluating the approximate posterior for all latent variables requires just one forward pass.

In recent years, deep learning has increasingly started using GPUs to take over work from the CPU [Krizhevsky et al., 2012]. GPUs have a much larger theoretical compute capability and are still improving with a factor of two every two years, unlike CPUs. Their drawback is that they require the computational problem to be parallel. Since MCMC sampling is an inherently sequential process, the EM algorithm cannot take advantage of GPUs, which makes it difficult to scale EM to large data sets.

3.2 Variational Inference

Variational Inference (VI) is similar to EM, in the sense that marginal log-likelihood is also decomposed in a lower bound and a Kullback-Leibler divergence (KLD). The difference is that instead of first computing point estimates in the E-step, VI optimizes the lower bound directly using an approximation of the posterior q(z|x) instead of the true posterior. In this section, we will look at several different approximations, which all have their own trade-offs.

(18)

3.2.1 Mean-Field Theory

The first Variational algorithm we will look at is Mean-Field Theory (MFT), it was invented by Parisi [1988], and first applied to SBNs by Saul et al. [1996].

As we have suggested in the introduction, every variational algorithm includes making some form of approximation. In MFT, the approximation is that the posterior can be factorized:

qn(zn|xn) =

Y

i

qn(zin|xn) (3.8)

A factorized posterior allows the value of a latent variable to be computed as a weighted average of all the units that feed into it. Refer to equation 10.6 in Bishop [2006] for more information.

The main benefit of MFT is that it allows training of larger networks than EM, due to a much simpler inference step. This comes at the cost of MFT being unable to capture correlations between strong fluctuating units, or logical constraints, since all units are modeled as if they were independent. Another disadvantage is that it is necessary to compute a data point specific posterior, which could wind up taking a lot of time for large data sets.

3.2.2 Wake-Sleep algorithms

Around the same time as MFT, Hinton et al. [1995] introduced the Wake Sleep (WS) algorithm as a way to train Helmholtz machines, which are deep directed graphical models, over visible variables x and hidden variables z. The variables are organized in layers, with visible variables at the bottom and the hidden variables stacked on top. The top layer has a factorized unconditional distribution p(z), which makes it possible to do ancestral sampling from the top layer down to the visible layer and obtain a sample of x. The SBN is a specific instantiation of the Helmholtz machine [Dayan and Hinton, 1996], where the hidden variables are binary and the activation function is a sigmoid. WS involves training an auxiliary network called the inference network alongside the generative model. This networks stochastically outputs samples for each layer, which should estimate the conditional probability of the generative model at their respective layer. Using these samples the joint log-likelihood p(x, z) becomes fully visible and can be directly maximized. This is called the wake-phase. Recalling the variational bound:

(19)

log p(x) ≥X

z

q(z|x) logp(x, z)

q(z|x) (3.9)

From this it can be seen that if we optimize the posterior q(z|x), then we are effectively pushing a bound of the true log-likelihood as well. The posterior can be optimized by obtaining a ‘dream sample’ from p(x, z) using ancestral sampling and then maximizing the likelihood of these samples. This leads to q(z|x) estimating the true posterior p(z|x) and is called the sleep-phase. This update minimizes the following KL divergence: KL(p(z|x)||q(z|x)), while we have seen in the derivation of the bound in Section 3.1 that it should optimize: KL(q(z|x)||p(z|x)). This leads to optimization deficiencies, which are unfortunately an inherent property of the WS algorithm. The size of these deficiencies depends on the problem setting, but a good example can be found in Figure 2 of Kingma and Welling [2013], where WS is compared with Auto-Encoding Variational Bayes (AEVB), an algorithm that does directly optimize the lower bound. The results in that paper show that WS is consistently outperformed by an algorithm that correctly optimizes the KL divergence.

In comparison with Mean-Field Theory, WS does not rely on the assumption that the posterior is fully factorized. Only latent variables within a layer are fully factorized, but there are still correlations between layers. These correlations allow for more meaningful representations to be learned by the latent variables, at the cost of having to train more parameters for the inference network. Generally, training these parameters is still faster than doing MFT, because there is just a single posterior for all data points.

Recently, the wake-sleep algorithm was extended to include importance sampling [Born-schein and Bengio, 2014]. In this extension, the inference network obtains K samples, which are weighted according to their probability. These weighted samples allow for an unbiased estimator of the marginal likelihood and significantly improved results over standard wake-sleep, at a computational cost that grows linear in the amount of im-portance samples taken. The main benefit of this method is reduction of variance in the gradients (see Figure 1Bornschein and Bengio [2014]), which is often a problem in binary latent variable models. The results of RWS are impressive even when taking just five samples.

3.2.3 Neural Variational Inference and Learning

Neural Variational Inference and Learning (NVIL) Mnih and Gregor [2014] attempts to follow the gradient of the variational lower bound directly, in similar fashion as the Variational Autoencoder [Kingma and Welling, 2013]. Doing this naively by just taking a sample from the posterior and computing the gradient based on that is not feasible due to large variance in the gradients of the encoder. Instead NVIL employs several

(20)

variance reduction techniques for the encoder, which makes training possible. NVIL can be used to train both continuous and discrete latent variables models, although the authors recommend using the reparametrization trick based algorithms for continuous latent variable models.

The derivation of NVIL starts by noticing that directly optimizing the variational bound is impossible due to high variance in the gradients of the inference network q. It then proceeds by looking at three techniques which, when used concurrently, bring down the variance to an acceptable level. Before looking at these techniques, let us first introduce the gradient on the variational lower bound (see Equation 3.9) of the inference network (equation 6 of Mnih and Gregor [2014]):

∇_φL(x) ≈ 1 t

t

X

i=1

(log pθ(x, z(i)) − log qφ(z(i)|x)) × ∇φlog qφ(z(i)|x) (3.10)

With φ the parameters of the inference network, θ the parameters of the generative network and t the amount of samples taken by the inference network. Note that this is not the full gradient, but for brevity we leave out the gradient of the generative network pθ, because it does not pose any problems during optimization. If we now zoom in on

the subtraction that scales the gradient on log Qφ:

lφ(x, z) = (log pθ(x, z(i)) − log qφ(z(i)|x)) (3.11)

This is what Mnih and Gregor call ‘the learning signal’ Mnih and Gregor [2014] and could potentially become very large, which leads to large variance in the parameter updates, making learning slow. This behavior has been observed in practice, leading to researchers avoiding this gradient estimator. Mnih and Gregor observe that they can subtract any value that does not depend on z from this learning signal and introduce an input-dependent baseline C(x), which is trained to minimize the expected square of the centered learning signal, originally proposed by Williams [1992]. C(x) is parameterized as a neural network. Furthermore, they normalize the variance of the learning signal and extend the input-dependent baseline to be specific per layer: Ci(zi−1).

The result of all this variance reduction is impressive, as NVIL beats WS by a consid-erable margin (see table 1 of Mnih and Gregor [2014]). This algorithm exemplifies a big step in variational inference for binary latent variable models. The only drawback is that the extensions introduced by NVIL lead to significantly larger engineering cost when implementing new models and require more hyper-parameters to be set.

(21)

3.2.4 Variational Inference for Monte Carlo Objectives

In the WS and reweighted WS setting, we have seen that the variance of the gradient can be reduced by replacing the traditional sampling step with an importance sampling approach. In the same spirit, but in the context of continuous latent variable models that directly optimize a bound on the marginal log-likelihood, Burda et al. [2015] introduced importance weighted variational auto-encoders. Their results show that taking multiple samples of the posterior and weighting them by their probabilities leads to a better defined posterior and subsequently a better lower bound on the marginal log-likelihood. Following this example, Mnih and Rezende [2016] found a way to simplify the NVIL model by also incorporating multiple (weighted) samples in what they called Variational Inference for Monte Carlo Objectives (VIMCO). Their model introduces a local learning signal for each sample that is dependent on all the other samples. This is the model that forms the basis for the experiments laid out in Section 6.

Intuitively, using multiple samples allows for the fact that not each sample has to explain the observation well. This will lead to smoother posterior distributions, which in turn leads to a smoother learning curve and less instability.

3.2.4.1 Derivation

Instead of starting by rewriting the log-likelihood of the data, VIMCO instead introduces an unbiased estimator of the marginal likelihood:

ˆ I(z1:K) = 1 K K X i=1 p(x, zi) q(zi_|x) with z 1:K _{∼ q(z}1:K_|x) _(3.12)

With q the proposal distribution (also known as the inference network) and K the amount of samples. By taking the logarithm of ˆI, it is straightforward to obtain a lower bound on the log-likelihood. In general the objective that we want to use for training models is: LK_{(x) = E}_q(z1:k_|x) " log 1 K K X i=1 f (x, zi) # (3.13)

Where f (x, zi) can be any unbiased estimator of the log-likelihood. Note that taking the expectation of an unbiased estimator gives the value that it estimates, which in the case of using logP (x,z_Q(zi_|x)i) as f (x, zi) leads to LK(x) being a lower bound on the log-likelihood.

(22)

Also interesting to remark is that setting K = 1 in this equation results in the classical variational lower bound we have seen before.

The gradient for this objective is:

∇θL = EQ(z1:K_|x)   X j ˆ L(z1:K)∇θlog Q(zj|x)  + E_Q(z1:K_|x)   X j ˜ ωj∇θlog f (x, zj)   (3.14)

with ω the weights: ˜ωj = Pkf (x,zj) i=1f (x,zi)

and ˆL(z1:k) the log of Equation 3.12, which is similar to the learning signal as introduced in the NVIL derivation.

For NVIL, it was derived that it is possible to subtract any quantity from the learning signal as long as it does not depend on z, or in the case of the local learning signal on zi. In VIMCO, it is proposed to subtract from the learning signal of each sample the average learning signal of all other samples. This allows for both a sample independent value and a good estimator of the learning signal for that sample (since the samples are IID): ˆ L(zj|z−j) = ˆL(z1:K) − log 1 K − 1 X i6=j f (x, zi) (3.15)

The final gradient estimator then becomes:

∇_θL_K(x) 'X j ˆ L(zj|z−j)∇θlog Q(zj|x) + X j ˜ ωj∇_θlog f (x, zj) (3.16)

3.2.4.2 Comparison with other methods

VIMCO clearly outperforms NVIL as can be seen in table 3.1. The improvement over reweighted WS is minimal, which seems to indicate that using multiple samples negates the theoretical disadvantage of optimizing the reverse KL divergence. In comparison with NVIL, VIMCO is a simpler model that does not need to learn a special baseline network or employ other (heuristics-based) variance reduction techniques. This comes at the computational cost of having to obtain multiple samples, which roughly scales linear with the amount of samples. Concluding, we might say that these methods per-form comparable on the binarized MNIST data set, but that VIMCO is the easiest to implement and use.

(23)

Experiment Test score

10 samples, VIMCO 91.3 10 samples, RWS 92.9 10 samples, NVIL 95.0

Table 3.1: Negative Log-likelihood scores on a binarized MNIST data set, lower is better, all from Mnih and Rezende [2016]

(24)

Chapter 4

Comparison of evaluation metrics

for unsupervised models

In this chapter we look at several evaluation metrics for unsupervised models and see if they have potential to help comparing EM to VIMCO. In the last chapter we have seen two metrics: an expectation of the joint log-likelihood (EM) and a bound on the marginal log-likelihood (the other models). These two metrics are the most common for unsupervised latent variable models, there are however unsupervised models that do not have either of those. An example is a class of models called Generative Adversarial Networks (GANs) [Goodfellow et al., 2014], these models use an adversarial loss and do not try to estimate a probability density function.

Focusing on the models introduced in the overview chapter, we first look at ways to compare models of which we can only compute joint log-likelihood with models that report a bound on the marginal log-likelihood. One possible way to do the comparison is by looking at the discriminatory power of the learned representation. This was first suggested in the original paper on SBNs [Neal, 1992]. It works by first computing the latent representation (posterior) of each data point in the training set and then training a classifier on this representation. The comparison is done by looking at the classification scores, which is a weak proxy for the true quality of the representation, due to the fact that the model was never trained to do well on discriminatory tasks. This type of comparison is impossible for GAN-type networks, because in these networks no latent representation is computed. Another option suggested by Neal [1992] is to simply pick a very small network and compute the exact marginal log-likelihood. This is impossible for any network of interest.

A more recent suggestion is Annealed Importance Sampling by Salakhutdinov and Mur-ray [2008]. Originally introduced as a method to estimate the partition function of an RBM, a necessary prerequisite for computing the marginal log-likelihood of a Deep Belief Network. While they claim good results, they also note that AIS occasionally

(25)

Chapter 4 Comparison of evaluation metrics for unsupervised models 24

gives a large overestimate of the true value. It requires careful tuning of several hyper parameters and could take a long time for a moderately sized network. We have tried to implement this method, but were unable to get it to work to a satisfactory level on an SBN.

Recently, since the introduction of the GAN, there has been a renewed interest in evalu-ation metrics for unsupervised models. Goodfellow et al. [2014] made an effort by using Parzen window estimates, a process in which many samples are drawn from the model, which are then used to compute the parameters of some distribution under which the likelihood of the test set is computed. Recently, it has been shown that this approach does not achieve what it intended [Theis et al., 2015] and simply performing K-means will outperform any model. As a last resort, some researchers showed visual samples to proof the quality of their model [Denton et al., 2015]. Since there is an obvious problem when the network simply returns one of the training images, the researchers made sure to verify that the euclidean distance to its nearest neighbor was sufficiently large. Un-fortunately, Theis et al. [2015] also showed that this approach does not guarantee that the network came up with an image by itself.

Concluding, evaluating unsupervised models is a difficult problem. Theis et al. [2015] suggest that each model should be evaluated for its intended purpose. However if the purpose is to evaluate training methods, then this advice is not useful. During the experiments with EM we will look at exactly how large the problem with not being able to compute the entropy is. If it is sufficiently large, then we should conclude that it is practically impossible to make a proper comparison.

(26)

Chapter 5

EM experiment

In this experiment, we aim to figure out the effect of using Herding as the mechanism to perform inference and obtain samples in the E-step of EM. Specifically, we try to answer analyze to what extent Herding introduces a bias during inference and how this affects training an SBN.

5.1 Inference

In order to investigate how well Herding works for inference, we want to measure the difference between the expectation computed with samples and the true distribution. In order to compute the true distribution we need to sum over all values of the posterior distribution, which given n latent variables requires 2ncomputations. Therefore we had to resort to using a small SBN for this experiment and chose to use an SBN with one hidden layer consisting of 20 latent variables. Before computing the true distribution the network was first trained until convergence, using Gibbs sampling to perform inference in the E-step.

Using the final parameters of the converged model, we computed the ground truth value of p(z|x) for a few data points. Next we ran a Gibbs and Herding sampling chain for 105 steps and computed the rolling expectation, which is defined as the expectation up until a step t: E[z]_t= 1 t t X j=0 zj (5.1) 25

(27)

Chapter 5 EM experiment 26

(a) Data point 1 (b) Data point 2 Figure 5.1: Convergence of samples for increasing amount of iterations

With zj a sample of all latent variables at time j. By taking the difference between

the rolling expectations and the ground truth we can show the convergence speed of sampling algorithms:

error = p(z|x) − E[z]t (5.2)

The results can be seen on a log-log plot in Figure 5.1.

These results show that Herding’s error is lower than Gibbs, but that at a certain point (around 200 iterations for data point 1 and 800 iterations for data point 2) it doesn’t improve anymore and becomes a flat line. Gibbs starts out worse, but catches up after a while when Herding’s curve becomes flat. These results confirm our theoretical expectation that Herding converges performs better in the beginning, but converges to the wrong value. We also see that Gibbs sampling performs worse in the start, but continues to improve even after taking 105 samples, which is according to the literature and confirms that no bugs are present in the code. Interestingly, our hypothesis that Gibbs would converge with a speed of √1

T, with T the number of samples, while Herding

should manage the quicker _T1, has not been shown to be true. The slope of both error rates is similar. If we however draw line for Herding at Gibbs 100 _{point and end at the}

point where Herding starts to become a flat line, then it does have a slope that resembles

1

T. There are two important points that need to be considered to understand this result.

Firstly, the probability vector is initialized at 0, which means that the first sample is the maximum likelihood result for Herding, giving it a head start which makes the slope flatter than it would otherwise be. Secondly, these results were obtained on a converged model, it could be that doing these experiments halfway through training would give a more varied result. Since a converged model is more certain about which latent variables

(28)

(a) Herding for 5 and 200 samples (b) Increasing amount of Gibbs Samples Figure 5.2: Comparison of training with different amount of samples

should be on for a specific data point, which already reduces the variance. Reduced gradient variance in converged models was also observed for the VIMCO experiments, see Chapter 6.

Aside from the statistical difference, the computational speed and memory cost of Herd-ing and Gibbs are very similar. HerdHerd-ing incurs a slightly increased memory cost when keeping track of the probability potential φ, which is one 32 bit float value for each latent variable for each data point. In the context of MNIST this means 50000 · 200 · 32 bits, which is equal to about 40 megabytes, a negligible value on a modern system. The computational cost is slightly decreased, because it is no longer necessary to obtain a value from a random number generator. In practice, there is no tangible difference in running either algorithm.

5.2 Training

Comparing different models trained with EM in combination with MCMC sampling is difficult, because it is impossible to reliably estimate the entropy that is included in com-puting the log-likelihood. In fact, there is a trade-off between putting a lot of probability in a very small area (leading to a low entropy), but having a worse expectation over the joint log-likelihood and spreading the probability (leading to a high entropy) but having a good expectation over the joint log-likelihood. In all the graphs relating to this section the number that is plotted is only the expectation over the joint log-likelihood.

In the graphs of Figure 5.2 we can clearly see that the entropy could have substantial effect. Especially in Figure 5.2a, we see that when training the network with 5 pseudo-samples it performs much better than with 200 pseudo-pseudo-samples (the graph stops around 150 epochs because that model is very computationally expensive to train). Also the graph of the 200 pseudo-samples model is nonsensical, it should never go down after

(29)

(a) Samples from model trained with 5

pseudo-samples (b) Samples from USPS dataset Figure 5.3: Samples from a model and the USPS dataset

just 10 epochs or be flat so early in the training process. The entropy factor could explain this behavior. When looking at the Gibbs Figure 5.2b, we see that the behavior here is more normal and increasing the amount of samples generally leads to a better performance, although again the 200 samples model is outperformed by a model that is trained with much less samples (30).

In order to verify if Herding with 200 pseudo-samples is really as bad as it looks, we sampled from that model and from the Gibbs model trained with 200 samples. Sam-pling from the model works by first samSam-pling z from the prior using a random number generator and then computing the activation:

z ∼ p(z) (5.3)

x = σ(WTz + b) (5.4)

In Figure 5.3, an example of taking samples from the model is shown. On the left are samples from the seemingly well-performing model trained with Herding and 5 pseudo-samples, on the right are data points from the USPS data set. The actual comparison can be found in Figure 5.4. The samples from the model trained with Gibbs have a little more contrast, but are not obviously better. Unfortunately, it is impossible to draw conclusions from this as well. Leading to the conclusion that it is better to focus on algorithms where the marginal log-likelihood or a tight bound is computable, such as VIMCO.

(30)

(a) Samples from model trained with 200 sam-ples obtained with Gibbs

(b) Samples from model trained with 200 pseudo-samples obtained with Herding Figure 5.4: Samples from two models trained with for 200 steps

(31)

Chapter 6

VIMCO experiment

In this chapter, we first aim to reproduce the results as laid out in the original VIMCO paper [Mnih and Rezende, 2016] in order to validate that our implementation is correct, but also that the claims in the paper are true. Note that at the time of writing, this paper was not yet peer reviewed. After confirming that the results are similar to what is reported in the paper, we replace the stochastic sampling by a deterministic version based on Herding and repeat the experiment. The hypothesis for the outcome of this experiment is that for a low number of samples, around five, Herding will outperform stochastic sampling, but not by the same margin as seen in the EM experiment. This is due to the fact that samples are importance weighted, which already decreases the variance of the gradients. The experiment is similar to one of the EM experiments before, we will start at five samples for both Herding and stochastic and gradually increase the amount of samples for stochastic until we reach the same performance as Herding. After discovering at what amount of samples the two methods are equal, we would also like to quantify the variance reduction. One possible way to do this is to look at the variance of the gradients of the parameters. As in the experiment before, we expect Herding to outperform stochastic sampling by a small but significant amount. We looked at the variance of the gradients in two models: one that was trained for three epochs and one that was trained until convergence.

Finally, we also made an autoregressive VIMCO. Autoregression does not change the core VIMCO algorithm, but only the conditional probabilities q(z|x) and p(x|z). Recall the definition of q(z|x) from before:

q(zi|x) = σ( X j wijxj+ bi) (6.1) (6.2) 30

(32)

Chapter 6 VIMCO experiment 31

Which now turns into:

q(zi|z<i, x) = σ(

X

j

wijxj+ si,<iz<i+ bi) (6.3)

p(x|z) changes in a similar fashion, with its own set of autoregressive parameters, i.e. the matrix s is not shared.

This change incurs a significant increase of computation time, due to the fact that the latent variables now need to be computed sequentially, because every latent variable now depends on all the latent variables before it in the same layer. On the CPU this would not make a noticeable change, but on the GPU it is quite substantial. Interestingly, the computational complexity only increases slightly, but the lack of parallelization is detrimental for performance.

Aside for the experiments on autoregression, all the experiments in this section are performed with a three layer Sigmoid Belief Network, where each layer consists of 200 latent variables and the prior on the top layer is also learned. The learning rate is fixed at 0.001 and the same for each layer, and the batch size is set at 24. Adam [Kingma and Ba, 2014] is used as optimizer, with the b values at their default of 0.95 and 0.999. The data set is not the standard MNIST, but a binarized version of MNIST [Murray and Salakhutdinov, 2009] as made available by Larochelle and Murray [2011]. It is commonly used in the binary latent variable model literature. This experimental setup closely resembles the one defined in the original VIMCO paper [Mnih and Rezende, 2016]. During the experiment we track the average lower bound of the log-likelihood on the train and validation set, these are both obtained with the amount of samples set by the experiment. After convergence (usually around 2000 epochs for VIMCO experiments), we compute a lower bound on the average log-likelihood of the test set with 1000 samples per data point (regardless of the amount of samples the model is trained with).

6.1 Comparison of sampling methods

In Table 6.1, the results can be found for our reimplementation of VIMCO and training a model with the hyper parameters explained above. Notably, our implementation1 _scores

-91.33 while the paper reports -93.6 for the same amount of samples. This is likely due to the fact that our hyper parameters were more carefully chosen.

(33)

(a) SBN trained with stochastic samples and Herding, both using 5 samples

(b) SBN trained with different amounts of stochastic samples and 5 Herding samples Figure 6.1: Comparison of lower bound on the log-likelihood for different amounts of samples taken during the training process. Herding outperforms stochastic sampling,

but is similar when the amount of stochastic samples is doubled.

5 samples, paper [Mnih and Rezende, 2016] -93.6

5 samples, Herding -91.17

5 samples, stochastic -91.33

Table 6.1: Log-likelihood scores on the test set, higher is better

The results of comparing stochastic sampling to Herding are shown in Figure 6.1. In the first figure (6.1a), the model is trained once with 5 samples for both Herding and stochas-tic. While in the second figure (6.1b) the amount of stochastic samples is increased until it matched the performance of Herding. As can be seen in the graphs, training the model with 5 Herding samples outperforms training with 5 stochastic samples. When sampling with Herding, the model converges much quicker, however the validation score after convergence is not that different. When increasing the number of samples, it is clear that 10 stochastic samples is the best match to 5 Herding samples, which is in line with our hypothesis that the importance weighting component of VIMCO already reduces variance of the gradients and that Herding’s effect is less significant than in the EM case.

(34)

(a) Encoder layer 1 (b) Encoder layer 2 (c) Encoder layer 3

(d) Decoder layer 1 (e) Decoder layer 2 (f) Decoder layer 3 Figure 6.2: Gradient variance plots for 3 epoch model

6.2 Gradient Variance

In order to quantify the variance reduction and verify if that is indeed the reason for the improved performance of Herding in the last section, we performed a separate experiment to visualize the amount of gradient variance reduction when increasing the number of samples. The experiment works by doing 10 runs of an increasing amount of samples, i.e. we first take 10 times a set of 2 samples, next we take 10 times a set of 5 samples etc. This continues until 500 samples in steps of 5. Then we compute corresponding gradients from all these sets of samples. Next we compute the variance over 10 sample sets of the same size and plot the result on a log-log plot, see Figure 6.2 for the results for a model trained for 3 epochs and Figure 6.3 for the results on the converged model. Except for layer 2 of the decoder in Figure 6.2e, the variance is not noticeably lower for Herding versus stochastic sampling. It is also clear that the variance of the encoder is an order of magnitude larger than for the decoder, which follows the theoretical predictions [Mnih and Gregor, 2014, Mnih and Rezende, 2016]. When looking at the results for the converged model, there is no discernible difference between stochastic sampling and Herding. Interestingly, when comparing the model at 3 epochs and the converged model the variance in general dropped by an order of magnitude, even for the encoder. The fact that there is no difference between Herding and stochastic sampling could mean two things. First, it could mean this way of measuring the variance of the gradients does not work well, which is unlikely because it is also used successfully in the paper on RWS [Bornschein and Bengio, 2014]. Second, it could mean there is not really any difference and the importance weighted sampling already negates any effect. The

(35)

(a) Encoder layer 1 (b) Encoder layer 2 (c) Encoder layer 3

(d) Decoder layer 1 (e) Decoder layer 2 (f) Decoder layer 3 Figure 6.3: Gradient variance plots for converged model

(a) Distribution of variance levels for Herding (b) Distribution of variance levels for Gibbs Figure 6.4: Distribution of variance levels for all different elements of a Weight matrix

difference between training a model with stochastic sampling and Herding should then be explained by some other unknown effect.

In order to further understand what is happening, we created a histogram to see the distribution of variance over the parameters. The intuition being that the average might be equal, but stochastic sampling might have more outliers. In Figure 6.4, a histogram is shown of the variance per parameter when taking 5 samples for the final Weight matrix of the decoder (similar to Figures 6.3f and 6.2f). It shows that the variance of stochastic sampling has a wider distribution, with more parameters having slightly higher variance. The effect is minimal, but this could point to the fact that Herding is helping with training after all.

(36)

(a) Cut off at 400 epochs (b) Full results Figure 6.5: AutoRegression result

Herding (2000 epochs) -91.17 Stochastic (2000 epochs) -91.33 AR Herding (168 epochs) -92.45 AR Stochastic (257 epochs) -92.3

Table 6.2: Test set scores

6.3 Autoregression

As described before, running autoregression is significantly slower than the standard VIMCO algorithm. On a GTX750, the difference between standard VIMCO and au-toregression is a factor 10, while the difference with auau-toregression herding is a factor 30. Therefore we limited the experiments to what was practically reasonable and fo-cused on showing that these models converge faster, which indeed follows directly from the graphs in Figure 6.5. After just 50 epochs the autoregressive models have a better validation set score than the original models ever reach. However because we were only able to train the model for such low amount of epochs the autoregressive models did not perform better on the test set as can be seen in Table 6.2. It is very likely that training the model for the same amount of epochs would lead to a substantial performance im-provement. However repeating the experiment and training both models for the same amount of wall-clock time would likely result in the standard SBN outperforming the autoregressive version, because of the aforementioned problems with a computational speed.

(37)

Chapter 7

Conclusion

In this thesis we looked at using Herding to improve the training procedure of a Sig-moid Belief Network. Specifically, we aimed to reduce the variance of gradient updates enabling faster convergence and better results when training the network with a lower amount of samples. In order to do this, we performed two different experiments. The first experiment consists of training the SBN using Expectation-Maximization, with Herding as the sampling algorithm to directly obtain pseudo-samples from the poste-rior. This experiment mainly served the purpose of making a sound comparison between Gibbs sampling (the traditional way to obtain samples in EM) and Herding. The sizes of the network that are practically possible prohibit this method from obtaining a score that approach the state of the art. The second experiment was aimed at making an effort to approach a state of the art score with Herding. We picked the best published algorithm as starting point, which at the time was Variational Inference for Monte Carlo Objectives (VIMCO).

In the first experiment we had dissatisfactory results. The convergence plots show that indeed Herding has a lower variance in the first iterations, but the slope was not according to what the theory predicted. Also the comparison of different models trained with EM proved difficult, the only metric that was possible to keep track of (expectation of the joint log-likelihood) proved unreliable for model comparison.

As for the second experiment, we showed that the results of the VIMCO [Mnih and Rezende, 2016] paper were reproducible and that Herding allows slightly faster conver-gence, without improving the final test set score. When inspecting the variance of the gradients directly, we indeed found little evidence that combining the variance reduction of Herding with the variance reduction component in VIMCO has a substantial effect. Especially when looking at a converged model there was no difference, which explains the fact that there was no discernible difference in test set score, allowing only for modest improvement in training speed.

(38)

Chapter 7 Conclusion 37

When implementing autoregression on top of VIMCO, there is a clear improvement in convergence speed. Even though there is only a minor increase in computational complexity, there is an order of magnitude drop in computational time. This is be-cause a feed-forward of a layer can no longer be computed in parallel for all latent variables, but becomes sequential, especially on modern GPUs this is detrimental for performance. Combining Herding with autoregression gave an increase in convergence speed over epochs again, however since Herding requires the samples to be computed sequentially and autoregression requires this to be done sequentially over the latent variables, the result is that the amount of sequential steps is the multiplication of the amount of samples and latent variables, leading to a non-practical speed.

Concluding, we can say that Herding indeed shows promise when used as an alternative to Gibbs sampling in a problem setting with high variance and where it is only practical to obtain a few samples. However, Herding is less useful in settings where the variance is already reduced by other means, such as importance sampling. In general, the com-bination of variational methods and importance sampling is a powerful one, specifically for binary latent variable models which deal with a lot of variance. It allows for a prin-cipled solution to a problem, where in the past one had to resort to tricks or difficult to optimize solutions. It also means that the future of Herding for learning deep models is not bright.

7.1 Future Work

As this thesis is a comprehensive overview of using Herding in Sigmoid Belief Networks and concludes that it does not give a substantial improvement, this section focuses on future work for Herding and SBNs separately. Future directions for Herding should likely focus on finding new areas where it can be applied. To make matters worse, the recent trend in deep learning community is to move away from sampling-based approaches and towards variational inference, which is thought to be easier to scale. This approach is in line with recent progress in hardware that heavily favors algorithms that can be computed in parallel. This trend will make it difficult to find new applications for Herding within deep learning.

SBNs on the other hand have a more promising future. Continuous latent variable models have become an important topic of interest at many major conferences [Bowman et al., 2015, Gregor et al., 2015, Kingma et al., 2014, Kulkarni et al., 2015, Rasmus et al., 2015, Xu et al., 2015]. There is reason to think that the same could happen to binary latent variable models. For example, it has been observed that the VAE is difficult to scale to larger networks and that the posterior does not capture the latent space well enough [Burda et al., 2015]. It is possible that these problems can now be solved by using a SBN trained with (a variation of) VIMCO. Other types of problems that simply

(39)

Chapter 7 Conclusion 38

require the latent variables to be binary [Graves, 2016, Reed and de Freitas, 2015], could also be improved by using VIMCO.

Another direction for future work could be further improving the training method. In this case, VIMCO forms an excellent starting point, because it is relatively easy to implement and obtains great performance. We have seen that the encoder gradients still exhibit much larger variance than the decoder, meaning there is likely more improvement possible.

(40)

Bibliography

Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle, et al. Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19:153, 2007.

Yoshua Bengio, Ian J Goodfellow, and Aaron Courville. Deep learning. An MIT Press book in preparation. Draft chapters available at http://www. iro. umontreal. ca/ ben-gioy/dlbook, 2015.

CM Bishop. Pattern recognition and machine learning. Information science, 2006. Luke Bornn, Yutian Chen, Nando de Freitas, Mareija Eskelin, Jing Fang, and Max

Welling. Herded gibbs sampling. ICLR 2013, 2013.

J¨org Bornschein and Yoshua Bengio. Reweighted wake-sleep. In ICLR 2015, 2014. Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz,

and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.

Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoen-coders. In ICLR 2016, 2015.

Yutian Chen, Max Welling, and Alex Smola. Super-samples from kernel herding. UAI 2010, 2012.

Peter Dayan and Geoffrey E Hinton. Varieties of helmholtz machine. Neural Networks, 9(8):1385–1403, 1996.

Peter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The helmholtz machine. Neural computation, 7(5):889–904, 1995.

Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38, 1977.

Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in Neural Information Processing Systems, pages 1486–1494, 2015.

(41)

BIBLIOGRAPHY 40

Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. Pattern Analysis and Machine Intelligence, IEEE Transactions on, (6):721–741, 1984.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sher-jil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Ad-vances in Neural Information Processing Systems, pages 2672–2680, 2014.

Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983, 2016.

Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. Deep autoregressive networks. In Proceedings of the 31st International Conference on Ma-chine Learning, 2014.

Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. Draw: A recurrent neural network for image generation. ICML 2015, 2015.

Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002.

Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The ”wake-sleep” algorithm for unsupervised neural networks. Science, 268(5214):1158–1161, 1995.

Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2014.

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR 2014, 2013.

Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581–3589, 2014.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse graphics network. In Advances in Neural Information Processing Systems, pages 2530–2538, 2015.

(42)

BIBLIOGRAPHY 41

Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. JMLR: W&CP, 15:29–37, 2011.

Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In Proceedings of the 31st International Conference on Machine Learning, 2014.

Andriy Mnih and Danilo J Rezende. Variational inference for monte carlo objectives. arXiv preprint arXiv:1602.06725, 2016.

Iain Murray and Ruslan R Salakhutdinov. Evaluating probabilities under high-dimensional latent variable models. In Advances in neural information processing systems, pages 1137–1144, 2009.

Radford M Neal. Connectionist learning of belief networks. Artificial intelligence, 56 (1):71–113, 1992.

G Parisi. Statistical field theory. Frontiers in Physics, Addison-Wesley, 1988.

Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems, pages 3532–3540, 2015.

Scott Reed and Nando de Freitas. Neural programmer-interpreters. ICLR 2016, 2015. Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic

backprop-agation and approximate inference in deep generative models. In Proceedings of The 31st International Conference on Machine Learning, 2014.

David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representa-tions by back-propagating errors. Cognitive modeling, 5(3):1, 1988.

Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of deep belief networks. In Proceedings of the 25th international conference on Machine learning, pages 872–879. ACM, 2008.

Lawrence K Saul, Tommi Jaakkola, and Michael I Jordan. Mean field theory for sigmoid belief networks. Journal of artificial intelligence research, 4(1):61–76, 1996.

Lucas Theis, A¨aron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. ICLR 2016, 2015.

Max Welling. Herding dynamical weights to learn. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1121–1128. ACM, 2009. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist

(43)

BIBLIOGRAPHY 42

Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. ICML (2015), 2015.

On the Usage of Herding in Learning Sigmoid Belief Networks

MSc Artificial Intelligence

Master Thesis

On the Usage of Herding

in Learning Sigmoid Belief Networks

Joost Ren´

e van Amersfoort

June 14, 2016

Acknowledgments

Abstract

Contents

Chapter 1

Introduction

Chapter 2

Preliminaries

2.1

Sigmoid Belief Networks

2.2

Herding

Chapter 3

Overview of methods to train

Sigmoid Belief Networks

3.1

Expectation Maximization

3.2

Variational Inference

Chapter 4

Comparison of evaluation metrics

for unsupervised models

Chapter 5

EM experiment

5.1

Inference

5.2

Training

Chapter 6

VIMCO experiment

6.1

Comparison of sampling methods

6.2

Gradient Variance

6.3

Autoregression

Chapter 7

Conclusion

7.1

Future Work

Bibliography