Improving Controllable Generation with Semi-supervised Deep Generative Models

(1)

MS

C

A

RTIFICIAL

I

NTELLIGENCE

M

ASTER

T

HESIS

Improving Controllable Generation with

Semi-supervised Deep Generative Models

by

E

ELCO VAN DER

W

EL 10670033

July 7, 2020

48EC July 2020

Supervisor:

Dr. Wilker FERREIRAAZIZ

Assessor: Dr. Miguel RIOS GAONA

(2)

(3)

iii

Abstract

Deep generative models are widely used to learn strong unsupervised representa-tions of data. An important application of these models is controllable generation; the process of generating new data conditioned on a set of controlled variables. This process is interesting in various areas, such as image generation containing specific objects, or text generation with a specific sentiment. Often these models require a degree of supervision on the latent variables. Variational autoencoders are a popu-lar choice for this application, due to the flexibility of these models. However, VAE models are not without issues. Purely training on a likelihood objective typically does not yield a well trained model for controllable generation, and auxiliary losses that exploit partial supervision or disentanglement of latent variables are often re-quired. This greatly complicates the training process by requiring many hyperpa-rameters, and can lead to worse results. Furthermore, since we often control for discrete attributes, the partial supervision often uses discrete latent variables which require relaxed distributions in order to make gradient estimation viable. Existing methods for unbiased binary relaxations are often difficult to train, and perform sig-nificantly worse than continuous distributions.

This thesis aims to improve controllable generation with variational autoencoders by introducing two new methods. Firstly, to improve training with discrete latent variables a new continuous relaxation of a discrete random variable is introduced, based on a piecewise linear spline function. The proposed distribution differs from existing methods by having access to a closed form KL divergence, as opposed to the estimate needed for other solutions. This greatly reduces gradient variance, and im-proves the obtained results in multiple scenarios. Additionally, to improve the train-ing process of semi-supervised models we propose to reformulate the loss function as constrained optimization problem. Instead of jointly optimizing the ELBO with many side losses, the main loss constrained by a set of auxiliary objectives is opti-mized. This method both shows promising results on a variety of side losses, and significantly reduces the time spent searching for hyperparameters. Both proposed methods are tested on a semi-supervised VAE, and show significant improvements on various image datasets. However, due to posterior collapse caused be the strong generators used for language models results on text data are not conclusive.

(4)

(5)

v

Acknowledgements

I want to thank Wilker Aziz for his dedicated supervision and the insightful dis-cussions on many subjects. His enthusiasm and helpful suggestions were a great motivator. I want to thank Miguel Rios for being part of the defense committee. I also want to thank my friends from the Master room, and the great coffee breaks and good times. Finally, I thank my family for the many years of support.

(6)

(7)

vii

viii 4.3 Method . . . 25 4.3.1 Model . . . 25 4.3.2 Constrained Optimization . . . 27 4.3.3 Proposed Constraints . . . 28 4.4 Experiments . . . 29 4.4.1 Experimental Setup. . . 29 4.4.2 Results . . . 31 4.4.3 Qualitative Analysis . . . 33 4.5 Discussion . . . 35 5 Conclusion 37 Bibliography 39 A Derivations 43 A.1 The Evidence Lower Bound . . . 43

B Splines 45 B.1 BPWL definition. . . 45

B.1.1 Main Definition . . . 45

B.1.2 Examples . . . 47

B.2 KL-Divergence for Piecewise Linear Distributions . . . 48

C Models details and Hyperparameters 51 C.1 CNN Hyperparameters . . . 51

(9)

1

Chapter 1

Introduction

With the large amount of data and increasing compute capabilities available in re-cent years, generative models have made significant progress in many areas. An important application of these models is controllable generation, the process of gen-erating new data while controlling aspects of the output. Many applications can be found in computer vision, where such a task could be generating images with specific traits, such as a portrait with a predefined hair colour or facial expression. Another interesting application is natural language processing, where we could gen-erate text fragments with a specific sentiment, or reconstruct words with a different morphology.

Variational autoencoders are a natural choice for controllable generation tasks, as these models allow us to construct and optimize complex graphical models with both supervised and unsupervised elements. For controllable generation, typically both these elements are required; Unsupervised learning allows us to learn strong representations of the data, and the supervised aspect gives control over the gener-ation process.

However, there remain many issues that keep these models from being applicable in real world scenarios. An important issue for semi-supervised approaches discrete variables that are latent when no supervision is available. In the traditional VAE setup, these discrete variables are not directly applicable, as sampling from these dis-tributions is generally non-differentiable. Many workarounds have been proposed to solve this issue, such as substituting the discrete probability distribution with a continuous density that acts like a discrete variable. However these methods often do not have access to a closed form KL divergence, a crucial quantity for optimizing VAE architectures, and need to estimate this quantity. This typically results in gra-dient variances that can be an order of magnitude higher compared to distributions that do not require an estimate. In this work, we construct a new relaxed binary distribution using spline functions. Splines are piecewise polynomial functions that can construct highly complex distributions with relatively simple definitions. An upside of this simple definition is that it does allow for a closed form KL divergence, reducing the gradient variance significantly.

A second element of semi-supervised VAE models that can complicate training is the amount of required objectives. Often, these are required because the VAE does not encode any preferences in the loss with regards to controllable generation. It aims to learn a good enough reconstruction while providing structure to the latent space, and can get stuck in local optima that do not neccessarily yield the type of model we want to train. Therefore, a VAE typically trains on a main loss, has an auxiliary supervision loss, a loss to prevent posterior collapse, and possibly a loss to promote

(10)

2 Chapter 1. Introduction

disentanglement between latent variables. Tuning all these separate losses with hy-perparameters is a laborious process, and recent work has shown (Lee, Hashimoto, and Liang,2019) that simply summing many loss functions can result in worse train-ing results. We propose a method of incorporattrain-ing any differentiable side objective to the main loss by applying constrained optimization. Instead of jointly optimiz-ing all losses simultaneously, the objective is reformulated as a main loss function, constrained by one or more side objectives.

Concluding, in this thesis we will address the following research questions:

Can we design a low variance relaxation to binary variables using splines, while maintaining the ability to model binary data?

We propose a new continuous relaxation which in theory improves existing methods by reducing the variance of the gradients in the inference network. It is not clear from literature how much the variance from KL estimate in existing methods reduces the performance of discrete latent variable models, and by performing experiments on both synthetic and real data we aim to show that removing this estimate improves the training process.

Can we use constrained optimization to improve the training of semi-supervised models, by formulating auxiliary objectives as constraints?

Constrained optimization using Lagrange multipliers has the advantage of being much more interpretable than methods based on hyperparameter weighting. In-stead of searching for values that weigh auxiliary objectives in a desirable way, we can find a desired value for the loss itself and use it as a constraint bound. In this work, we aim to show through experiments that this way of training both improves hyperparameter search, and training results.

1.1 Outline

First, Chapter2discusses preliminary subjects. Variational inference, gradient esti-mation and the used datasets are discussed. In Chapter3, we look at relaxed binary distributions and the BPWL distribution is introduced followed by experiments and evaluation. Continuing, in Chapter4semi-supervised variational autoencoders are discussed, along with constrained optimization and the disentanglement of latent variables. Experiments are performed using the ideas from both Chapter 3 and 4. Finally, the conclusion can be found in Chapter5.

(11)

3

Chapter 2

Preliminaries

This chapter discusses the preliminary concepts in this thesis. Section2.1 gives a summary of variational inference and VAE models. Next, Section2.2looks at path-wise gradient estimation, which is a low variance estimator used for most VAE ar-chitectures. Finally, section2.3discusses the datasets used in this work.

2.1 Variational Inference

Many problems in probablistic modelling require approximations of probability den-sities, due to the presence of intractable integrals. Variational inference (VI) is one such approximation technique, and by casting approximate inference as an opti-mization problem it is a popular choice for machine learning methods.

x

z _θ

N

FIGURE2.1: A probablistic graphical model with observed variable x, latent variable z, parameterized by parameters θ.

We start with the probablistic model in Figure2.1. The figure depicts a two variable model p(x, z; θ) = p(x|z; θ)p(z; θ), where x is observed and z is latent. The param-eters θ could be part of any type of model, for example a neural network. While this probablistic model is simple, inferring the true posterior p(z|x; θ)can already be intractible as it requires marginalizing over z:

p(z|x; θ) = p(x|z; θ)p(z; θ) p(x; θ) (2.1) Where p(x; θ) = Z p(x|z; θ)p(z)dz (2.2)

(12)

4 Chapter 2. Preliminaries

In other words: to infer the marginal likelihood (or, evidence) p(x; θ), we need to consider all possible configurations of z. When z is a discrete random variable marginalization might be possible, but when z is continuous this would require in-finitely many forward passes through the likelihood model. Of course, this is not a viable option; an approximate solution is required.

Variational inference (VI) proposes such an approximation by introducing a new trainable distribution q(z; λ). With this new density an Evidence Lower Bound (ELBO) can be derived:

log p(x; θ) ≥E_q₍_z_;λ₎[log p(x|z; θ)] −KL(q(z; λ)||p(z; θ)) (2.3) The full derivation can be found in AppendixA.1. By maximizing the ELBO, we can approximate the true posterior up to a constant (AppendixA.1).

2.1.1 Variational Autoencoders

A Variational Autoencoder is a type of latent probablistic model introduced by Kingma and Welling,2014, which applies Variational Inference to models with neural net-work based likelihoods. By using amortized VI the model can share parameters

λ and θ between observed datapoints, which allows us to train a pair of neural

networks to map observations between the data distribution and latent distribu-tion.

As an example we take the model as proposed by Kingma and Welling,2014, which has the same graphical model as Figure2.1. Additionally, a diagonal Gaussian is used as latent variable, and a Bernoulli is used as a likelihood:

p(x|z; θ) = D

∏

d=1 Bern(xd|fd(z; θ)) (2.4a) q(z|x; λ) = N (z|µ(x; λ), σ(x; λ)2I) (2.4b) p(z) = N (z|0, 1) (2.4c)

Here the function fd(z; θ) is a neural network that transforms a sample from the latent variable z to the parameters of the likelihood. This is generally called the gen-erative network, or decoder. The functions µ(x; λ)and σ(x; λ)are also parameter-ized by a neural network, and usually most parameters are shared between the two functions. This network is referred to as the inference network, or encoder.

To train this model, the ELBO is used as loss function. To approximate the expected log-likelihood, an MC estimate is used. This results in the following training objec-tive:

arg max_θ_,λL(D, θ, λ) =E_q₍_z_|_x_;λ₎[log p(x|z; θ)] −KL(q(z|x; λ)||p(z; θ)) (2.5a) MC ≈ 1 K K

∑

k=1 log p(x|z(k); θ) −KL(q(z|x; λ)||p(z; θ)) (2.5b) z(k)∼ q(z|x; λ) (2.5c)

(13)

2.2. Pathwise Gradient Estimation 5

The setup described in Equation 2.5 uses diagonal gaussian distributions, for which the KL term can be determined in closed form. Section3.2discusses situations where this is not the case, resulting in an additional MC estimate. Continuing, maximizing this loss function requires backpropagation over the parameters of the model, which is discussed in the next section.

2.2 Pathwise Gradient Estimation

The ELBO offers a way to approximate the log evidence, which gives us a forward pass through the model. However, to learn the parameters of the model, we also need to be able to estimate gradients for backpropagation. In this section, we look at estimating the gradients for both the inference network and generative network with a pathwise gradient estimator (Mohamed et al.,2019) and the reparameteriza-tion trick (Kingma and Welling,2014). The general idea is to view the stochasticity introduced by the latent variable z as an auxiliary input to the neural network, which allows us to estimate pathwise derivatives to all parameters of the model. In this sec-tion, we assume the KL term in the ELBO is available in closed form, which allows us to calculate gradients with standard backpropagation. As such, the KL term is omitted from the derivations.

First, we look at the generative network with parameters θ. In our loss function, θ does not appear in any expectation, so by Leibniz’ rule we are allowed to interchange the limits to obtain an estimate:

∇_θE_q₍_z_|_x_;λ₎[log p(x|z; θ)] =E_q₍_z_|_x_;λ₎[∇_θlog p(x|z; θ)] (2.6a) MC ≈ 1 K K

∑

k=1 ∇θlog p(x|z (k) ; θ) (2.6b) z(k) ∼q(z|x; λ) (2.6c) This gives us a gradient estimator for the parameters of the generative network, making it trainable through backpropagation. Estimating gradients for the parame-ters λ in the inference network requires some more thought, as we cannot push the gradient operator inside the expectation:

Equation2.7bcan no longer be written directly as an expectated gradient, and taking an MC estimate is no longer possible. A solution here is to make the expection not dependent on parameters λ. This is what the reparameterization trick (Kingma and Welling, 2014; Rezende, Mohamed, and Wierstra, 2014) aims to achieve: by reparameterizing z in terms of a tranformation f on some random variable q(e), the

(14)

Eq(z|x;λ)[log p(x|z; θ)] =Eq(e)[log p(x|f(e; λ); θ)] (2.8)

With this reparameterization trick the inference gradients can again be estimated with an MC estimate:

∇λEq(z|x;λ)[log p(x|z; θ)] = ∇λEq(e)[log p(x|f(e; λ); θ)] (2.9a)

MC ≈ 1 K K

∑

k=1 ∇_λlog p(x|f(e(k); λ); θ) (2.9b) e(k)∼q(e) (2.9c)

In order to apply this method, distribution q(z|x; λ) has to have a differentiable transformation where z = f(e; λ)for e ∼ q(e). The following two sections discuss

methods of obtaining such a transformation.

2.2.1 Location-Scale Distributions

Distributions from the location-scale family have access to reparameterizable sam-ples through a simple affine transformation. For example, if we have a Gaussian distribution q(z) = N (z|µ, σ2I)we can generate reparameterizable samples through

the following procedure:

z=µ+eσ (2.10)

e∼ N (0, 1) (2.11)

Wheredenotes an element-wise product. The derivatives with respect to σ and µ are available, so this is a valid reparameterization.

2.2.2 Inverse Transform Sampling

Inverse Transform sampling is a way to produce reparameterizable samples from any scalar distribution using the inverse cumulative distribution function (CDF). By using the CDF of q(z|x; λ)as a bijective transformation from the support of the random variable to [0, 1], we can generate samples by inverting this transformation and using samples from the uniform distribution:

z=F−1(u; λ) (2.12a)

u∼U(0, 1) (2.12b)

Where F−1 is the inverse CDF of random variable z. A caveat of this method is that most known distributions do not have access to a closed-form inverse CDF, which makes inverse transform sampling impossible. Additionally, when q(z|x)is multivariate, the dimensions need to be independent, or autoregressive with scalar conditionals.

(15)

2.3. Datasets 7

2.3 Datasets

2.3.1 MNIST

We use the MNIST dataset (LeCun,1998) as a standard benchmark for all models. The original resolution of 28×28 pixels is used, and images are binarized by sam-pling each pixel from a Bernoulli distribution.

FIGURE 2.2: Four example entries in the MNIST4 dataset. For the first image, the accompanying label would be a bitvector with the

first, fourth and eigth bit set to 1: 0100100010.

MNIST4

To make an MNIST dataset with multiple annotated attributes, an MNIST4 dataset is generated by taking one image from the dataset, independently sampling three more images, and tiling them in a 56×56 pixel image. As annotation this gives binary vector of length ten, with between one and four bits active to signify the digits present in the image. Figure2.2shows an example from the MNIST4 dataset, along with a label annotation.

2.3.2 SVHN

Because the differences in complexity between MNIST and CELEBA are quite signif-icant, we use the Street View House Numbers (SVHN) dataset to show results on a dataset of intermediate difficulty. The cropped version of the dataset is used, which has a size of 32×32 pixels and three colour channels. Ten example entries from the dataset are shown in Figure2.3

FIGURE2.3: Examples from the SVHN dataset.

2.3.3 CELEBA

As a second, more complex image dataset CELEBA (Liu et al.,2015) is used. This dataset contains about 200,000 RGB images of celebrities, and each image is anno-tated with many attributes. The aligned version of the dataset is used, and the im-ages are cropped and resized to 64×64 pixels, as described in Lample et al., 2017; Klys, Snell, and Zemel,2018.

(16)

FIGURE 2.4: Examples from the CELEBA dataset. The first image is annotated with features Eyeglasses, Mouth-slightly-open and

No-beard set to one, and the remaining features set to zero.

Ten features interesting for controllable generation are chosen from the dataset: Bangs, Big Lips, Blonde Hair, Eyeglasses, Heavy Makeup, Male, Mouth Slightly Open, No Beard, Smiling and Wavy Hair. Any image that has one or more of these features is kept in the training set, and the rest is discarded. The resulting training set has about 160,000 images, with a validation and test set of about 20,000 images. Ten example images from the dataset are shown in Figure2.4.

2.3.4 SIGMORPHON 2020

To test the model on sequential data, the Sigmorphon 2020 dataset for task 0 (Vylo-mova et al.,2020) is used. This dataset is created for the typologically diverse mor-phological inflection task, and contains mormor-phological inflections annotated with a lemma and morphological tags. For pre-processing, each word is split to individual characters, and words longer than 30 characters are discarded. For the experiments, Dutch and English are used to evaluate the models. Dutch has 19 morphological lables, a training set of about 30,000 entries, with a validation set and test set of ap-proximately 5000 entries. English has 9 morphological labels, with a training set of 70,000, and a validation and test set of 10,000 entries.

Table2.1shows a few example entries in the english dataset, along with the lemma and annotated morphological features. The lemma is only shown for completeness, and is not used in experiments in this thesis.

Lemma Inflection Tags grapple grappled V;PST explorate explorating V.PTCP;PRS keelhaul keelhauls V;SG;3;PRS

TABLE2.1: Some example entries from the SIGMORPHON 2020 mor-phological inflection dataset. An inflection is shown with a corre-sponding lemma and morphosyntactic description tags seperated by

(17)

9

Chapter 3

The Binary Piecewise Linear

Distribution

3.1 Introduction

Pathwise gradient estimators and the reparameterization trick allow for efficient backpropagation through computation graphs with stochastic nodes. These meth-ods produce unbiased, low variance gradient estimates, improving training greatly compared to other methods. The core idea of the reparameterization trick is to split the sampling procedure into a stochastic component that does not require gradient updates, and a deterministic transformation that can propagate gradients through the rest of the graph. While many continuous densities have access to these repa-rameterizable samples, discrete distributions cannot be refactored this way due to the discontinuous nature of probability mass functions.

There exist a few methods to incorporate these discrete distributions into computa-tion graphs, but these do not come with all the advantages of applying the reparam-eterization trick to continuous densities. A method like the Score Function Estima-tor (SFE) (Rubinstein, Shapiro, and Uryasev, 1996) can be applied to both discrete and continuous densities, and has much lower requirements on the properties of the used distribution. However, this flexibility comes at a cost: the variance of the estimator is often significantly higher, requiring other techniques or many samples during training.

Another possible method is to relax a discrete distribution to a continuous density that behaves similarly. If this continuous density has access to reparameterizable samples, the reparameterization trick can be used and we obtain a low variance esti-mator. However, when we look at the gradient estimation for VAE models, gradient variance can originate in two places. Firstly, the estimation of the log-likelihood al-ways causes some amount of variance, since a closed form solution is not available when standard neural network configurations are used. A second place where vari-ance can occur is the KL term of the ELBO. While many distributions do have access to a KL with a closed form solution, available methods for relaxing a Bernoulli vari-able do not. This would require another estimation in the ELBO, and could some-what reduce the advantages gained by moving away from using the SFE.

In this chapter, a new binary relaxation is introduced by carefully parameterizing spline functions. These splines are piecewise functions from simple building blocks, which allows us to construct sufficiently complex densities that approximate Bernoulli distributions, while keeping the definition simple to obtain both reparameterizable

(18)

10 Chapter 3. The Binary Piecewise Linear Distribution

samples and a closed form KL divergence. The goal of this new distribution is keep-ing the overall gradient variance low, which should have a positive influence on the training properties of the model. Experimental results on both synthetic and real-world data show that the proposed method can achieve gradient variances more than an order of magnitude lower than existing methods. However, this does not always translate to better training results. While smaller models on simple datasets do have significantly better performance, for larger models it seems to matter less which binary relaxation is used.

Section3.2first discusses a few existing methods, followed by the proposed method in Section 3.3. In Section 3.4 the results on both synthetic and real-world models are presented, followed by a discussion of the method. AppendixBcontains more details and derivations on the construction of the proposed spline distribution, and the KL divergence between splines.

3.2 Background

Discrete latent variables pose a problem for the standard VAE architecture; In or-der to obtain gradients for the inference network, we need access reparameterizable samples from the latent distribution (Section2.2). This section contains an overview of available methods to model or approximate Bernoulli latent variables by either manipulating the gradient calculation, or relaxing the binary variable to some con-tinuous density. While many of these methods extend to other discrete distributions, such as categorical variables, only the Bernoulli case is discussed.

3.2.1 Score Function Estimator

The score function estimator (SFE) (Rubinstein, Shapiro, and Uryasev,1996) is a gen-eral purpose gradient estimator which sees widespread use in many areas. It is known under a few different terminologies, such as REINFORCE (Williams,1992), or the likelihood ratio method (Glynn,1990). As discussed in section2.2, the prob-lem it aims to solve is taking a derivative of an expected value. In the case of a VAE, this occurs when deriving the gradients for the inference network from the likelihood term in the ELBO:

∇_λ_E_q₍_z_|_x;λ₎[log p(x|z; θ)] (3.1) When reparameterization is not available for distribution q, this expression can not be turned into the expected gradient needed for an MC estimate. The score func-tion estimator solves this by applying the log derivative identity, which allows us to move the gradient inside the expectation:

(19)

3.2. Background 11

By using equation3.2d, gradients can be estimated without the use of the reparame-terization trick. However, while this is a general result the variance of this estimator is often too high to have a good learning signal for the inference network (Mohamed et al.,2019). This problem can be somewhat mitigated by applying variance reduc-tion techniques like control variates. For an expectareduc-tion over funcreduc-tion f(z), a control variate h(z)can be constructed if the expectationE_q(z|x;λ)[h(z)]is a known quantity.

The expectation can be modified as follows:

Eq(z|x;λ)[f(z)] =Eq(z|x;λ)

h

f(z) − (h(z) −E_q₍_z_|_x;λ₎[h(z)])i (3.3)

By using information from this known quantity h(z), the high variance from of the estimator can be reduced. Because of the wide applicability of the score function esti-mator, many advanced control variates are available (Grathwohl et al.,2018; Tucker et al., 2017). In this work, all experiments involving the score function estimator use a simple control variate that both centers the learning signal and normalizes the variance as described in Mnih and Gregor,2014.

3.2.2 Relaxed Binary Distributions

Because of the high gradient variance of the score function estimator, there has been much work on applying pathwise derivatives on discrete latent spaces by relaxing the discrete latent variable to a continuous density with a differentiable reparame-terization. Examples of these approximations are the Binary Concrete (or Gumbel Sigmoid) distribution (Maddison, Mnih, and Teh,2017; Jang, Gu, and Poole,2017), the Kumaraswamy distribution (Kumaraswamy,1980), or the Continuous Bernoulli distribution (Loaiza-Ganem and Cunningham, 2019). By assigning most density around zero or one, all of these distributions can successfully approximate Bernoulli variables while having access to reparameterizable samples. However, they share a common problem: The KL divergence term in the ELBO is not available as a closed form solution. This is usually solved by approximating the KL term with an MC esti-mate, which results in either much longer training times by requiring many samples, or higher gradient variances.

Another possible downside of these relaxed Bernoulli distributions is that there is no density at either zero or one, which removes the sparsity from the original discrete variable. While most density is still concentrated around the edges of the distri-bution, some applications do require sparse variables (Lee, Hashimoto, and Liang, 2019; Bastings, Aziz, and Titov,2019). Louizos, Welling, and Kingma,2018propose a method that ’sparsifies’ a concrete distribution by collapsing a part of the density to zero and one. Bastings, Aziz, and Titov,2019extend this technique by applying it to the Kumaraswamy distribution, naming it the HardKumaraswamy. Rolfe,2017 pro-poses a similar idea, by mixing discrete components with continuous components in an hierarchical model. While the obtained sparsity retains some of the sparsity from the Bernoulli distribution it aims to approximate, the rectifiers used to induce the sparsity will produce zero gradients when either a zero or one is sampled from the distribution. It is not clear if these missing gradients have a detrimental effect, and in the experiments in section3.3the HardKumaraswamy is used to see potential differences between the approaches.

(20)

A final class of binary relaxations combines the latent Bernoulli variable with a con-tinuous density by assuming a discrete distribution during the forward pass, and a relaxed continuous distribution during the backward pass. This method is gener-ally known as the straight-through estimator (Bengio, Léonard, and Courville,2013), and while it provides a low variance method to propagate gradients through a dis-crete random variable it is not an unbiased estimator; The produced gradients do not correspond to the forward pass of the model. A popular technique based on this straight-through estimator is the VQ-VAE (Oord, Vinyals, et al.,2017). These mod-els are able to generate samples of a quality much higher than most standard VAE architectures (Razavi, Oord, and Vinyals,2019), but due to the biased gradients they are considered out of scope for this research.

3.2.3 Spline Distributions

A spline is a composite function, constructed from a set of polynomial function pieces. Because of their high flexibility, these functions have many applications in interpolation problems, 3d modelling, or other function approximations.

Müller et al.,2019propose a powerful method by applying spline functions to im-portance sampling. By constructing monotonic spline CDF from either linear or quadratic function pieces, they obtain a cumulative distribution function that is both flexible and has a simple definition.

Durkan et al.,2019 apply a similar method to normalizing flows, by using spline distributions as coupling transforms. By generating splines with higher order poly-nomial functions, they greatly increase the flexibility of the spline while ensuring a monotonicity in the CDF. This method offers advantages over existing coupling transforms, as generating complex distributions with a small number of normaliz-ing flows can be difficult.

3.3 Method

This research aims to apply spline distributions to approximate latent Bernoulli vari-ables. In order for such a distribution to be a suitable candidate for a VAE latent space, we generally require two properties: the ability to generate reparameteriz-able samples, and a closed-form KL divergence between the distribution and some pre-defined prior. In this research, we will look at priors from the same class of distributions, i.e. a spline variational posterior with a spline prior.

From this angle, linear spline distributions seem like a good candidate: reparameter-izable samples are available by inverse transform sampling (Section2.2.2), and the KL divergence is available in closed formB.2.

3.3.1 Linear Spline Definition

In order to generate a distribution from such a spline function, one possible method is to generate a CDF from a monotonically increasing spline between zero and one. A corresponding PDF can be obtained by differentiation. The methods proposed by Durkan et al., 2019are based on this idea; by generating a cumulative density that is monotonically increasing, the resulting distribution is always a proper den-sity.

(21)

3.3. Method 13

This section defines at a more simple piecewise linear spline PDF, with a correspond-ing quadratic CDF. A linear spline is a collection of linear function pieces, between N+1 boundaries known as knots. The domain between each pair of adjacent knots (xi−1, xi)is called a bin, and each bin has an associated polynomial.

In order to keep the notation and implementation concise, we introduce a variable

ξiwhich translates each bin i to the origin x =0:

ξi =x−xi−1 (3.4)

With this notation, a cumulative distribution function can be simply defined as:

FX(x) =            a1ξ₁2+b1ξ1+c1, if x0≤ x<x1 a2ξ₂2+b2ξ2+c2, if x1≤ x<x2 . . . . anξ_n2+bnξn+cn, if xn−1≤x <xn (3.5)

Where ai, bi and ci are the parameters of the distribution. By differentiation, we obtain a piecewise linear PDF for each bin i:

pXi(x) =2aiξi+bi (3.6)

To generate reparameterizable samples from this distribution, an inverse CDF is needed for each bin (Section2.2.2). Because the CDF consists of second degree poly-nomials, this inverse can be found with the quadratic formula:

F_Y−1 i (y) =    −bi+ √ b2 i−4ai(ci−y) 2ai if ai 6=0 y−ci bi , otherwise (3.7)

To apply such a spline distribution as a VAE latent variable, we need access to a KL divergence between two instances of this distribution. Provided the x coordinates of the knots of the splines are identical, a closed form expression is available by summing over the partial KL divergence of each bin. A full derivation of this KL divergence for linear splines can be found in AppendixB.2.

It is important to observe that defining a piecewise linear spline does not necces-sarily ensure a normalized density, so this needs to be taken care of when param-eterizing the spline distribution. Section 3.3.2 discusses the method used in this research, for a more free-form spline distribution Müller et al.,2019propose an al-ternate method.

3.3.2 The Binary Piecewise Linear Distribution

While previous applications of these spline functions aim to generate highly flexi-ble distributions, either for importance sampling or as normalizing flows, a binary spline distribution should not encode any information other than the binary dis-tribution it is trying to approximate. To achieve such a disdis-tribution, this section defines a restricted linear spline parameterization with a domain between zero and

(22)

FIGURE3.1: (Top) A Binary spline PDF and (Bottom) accompanying CDF for different parameters p. Hyperparameters are described in

AppendixB.1.2

one, which can be controlled by a single parameter p that expressed the cumulative density at X= 1₂, analogous to the parameter of the Bernoulli distribution. Further-more, the distribution should be symmetric around the center when p= 1₂, and have as much density as possible at the edges of the distribution while maintaining good training properties and numerical stability.

From this list of desired attributes, the distribution in Figure3.1is constructed. The slope of the line segments s, the height of the center knot hd, and the relative width of the two center bins wcare controlled by hyperparameters and are not learnt dur-ing traindur-ing. Figure3.1 demonstrates a few instances with different parameters p, the full description and different combinations of hyperparameters can be found in AppendixB.

3.4 Experiments

This section shows the experimental setup and results to evaluate the proposed BPWL distribution. First, a few synthetic experiments on a single layer neural net-work are performed to gain more insight into the differences between existing meth-ods. Additionally, the BPWL distribution is tested as latent variable in a standard VAE architecture to test the real world performance.

3.4.1 Experimental Setup

(23)

3.4. Experiments 15

Distribution Likelihood

Estimator KL-Divergence

Bernoulli SFE Closed Form

Binary Concrete pathwise MC estimate HardKuma pathwise MC estimate

BPWL pathwise Closed Form

TABLE 3.1: Methods for estimating the ELBO with different binary latent variables.

Synthetic experimental setup

To obtain a gradients for backpropagation through stochastic nodes, an MC esti-mator is used. Many existing methods (3.2) suffer from high variance problems in this MC estimate, and this research proposes the BPWL distribution to reduce this problem for relaxed binary distributions. In order to measure the differences in gra-dient variance between existing distributions and the BPWL, we first perform some experiments on a single layer neural network.

The setup follows a standard VAE decoder architecture. A single sample is taken from a (relaxed) binary distribution controlled by parameter p. For the Bernoulli distribution, p determines the mass at X = 1, and for relaxations p determines the amount of density around X = 1. This sample fed into a single linear layer of ten units that parameterizes a univariate Bernoulli likelihood. Gradients are obtained by setting the labels to 0, and backpropagating the obtained ELBO. For each model, a prior of the same type as the latent distribution is used with p = 0.5. Table 3.1

lists all tested distributions, together with the method of calculating both the KL-divergence and log-likelihood loss terms. In the experiments, the BPWL distribution is also tested with an estimated KL. The pathwise gradients, SFE and KL estimates all make use of an MC estimator, which uses a single sample in all experiments.

By testing this architecture many times with random initialization, we can obtain an estimate for the variance of the gradients backpropagated to parameter p, which in a real VAE scenario would be the gradient propagated to the inference network. Ad-ditionally, by varying p we some get insight into the effect of the parameter. All ex-periments are repeated for a 1000 iterations to estimate the gradient variance, which is repeated 100 times to estimate the variance of the experiment.

VAE Experimental Setup

In a second experiment, the BPWL distribution is applied to a standard VAE model on the MNIST, SVHN and CELEBA datasets described in Section2.3. The results are compared to the standard VAE with Gaussian latent space, and VAEs with different binary latent spaces. All models are trained as on the standard ELBO loss (Section

2.1). Both the inference and generative networks are a stack of convolutional layers, using the same hyperparameters for each latent distribution. The specific hyperpa-rameters for each dataset can be found in AppendixC.1.

For MNIST, each datapoint is binarized and a Bernoulli likelihood

p(x|z; θ) =∏_dD₌₁Bern(xd|fd(z); θ)is used, where fd is the generative network with a final sigmoid activation. Both SVHN and CELEBA are RGB, and use a gaussian

(24)

likelihoodN (x|f(z; θ), g(z; θ)I), where fθand gθare a generative network that splits

before a final linear layer. The means f(z; θ)are located between [-1, 1] by using a tanh activation, and to enforce a positive variance gθ has a softplus activation on

the final layer (Dugas et al.,2001). The variational posterior q(z|x; λ)is generated from a single parameter per dimension between zero and one, from a final sigmoid activation in the inference network. The exact parameterization can be found in AppendixC.

All networks are trained with the ADAM optimizer, and tuned separately for batch size, L2 weight decay, learning rate and dropout rate.

3.4.2 Results

We will first discuss the results obtained from the synthetic experiments. Table3.2

contains the average variance of each tested distribution, which gives a good indica-tion of the differences between the used estimators. The Bernoulli distribuindica-tion has significantly higher variance due to the score function estimator. The log-likelihood estimates between the distributions with pathwise derivatives are not significantly different, but due to not requiring an estimate for the KL divergence the BPWL dis-tribution obtains a total variance an order of magnitute lower than the other binary relaxations.

To understand the effect of the KL estimate better, Figure 3.2 shows the gradient variance as a function of parameter p. The curve for the HardKuma distribution is omitted to keep the picture clear, and closely follows the observations of the Con-crete distribution.

Distribution Var LL (×10−2₎ _{Var KL (}_×₁₀−2₎ _{Total (}_×₁₀−2₎

Bernoulli 5.70±1.27 - 5.70±2.27 Concrete 0.19±0.09 1.81±0.48 1.99±0.59 HardKuma 0.25±0.11 2.01±0.42 2.26±0.50 BPWL (standard) 0.22±0.09 - 0.22±0.09 BPWL (narrow) 0.31±0.05 - 0.31±0.05 BPWL (wide) 0.19±0.07 - 0.19±0.07 BPWL (KL estimated) 0.21±0.08 1.97±0.32 2.27±0.48 TABLE3.2: Gradient variances for parameter p, split into the gradi-ent from the KL, log-likelihood and total variance. Entries with no

variance are determined with a closed form solution

We can make a few observations from these measurements. Firstly, the score func-tion estimator has a significantly higher variance compared to pathwise derivatives. This is well documented in literature, and it is clear that having such a high variance is detrimental to the training process.

When comparing different methods of obtaining gradients for the KL-divergence, we can observe that when using MC estimates the total variance approaches the val-ues of the SFE when p is close to either zero or one. This shows a possible downside of the estimator: since we are using a prior with p = 0.5, the latent variable will only contain information when p is close to either zero or one. This means that the estimator variance goes up when more information is present in the latent variable,

(25)

3.4. Experiments 17

making training possibly more difficult. We can see the BPWL distribution does not have this issue when the KL is determined in closed form.

FIGURE3.2: The variance of the gradients at parameter p for different binary distributions, with a confidence interval of σ =1. For clarity not all distributions are included, the observed curves are similar

be-tween distributions with corresponding estimators.

Continuing with the VAE experiments, Table3.3shows the ELBO and log-likelihood on the test set for each used dataset. At first glance we can see that all binary distri-butions perform significantly worse that the vanilla Gaussian VAE. This is not sur-prising; A diagonal Gaussian can encode much more information than a bit vector of the same size, which results in better reconstructions. For this reason the results on a Gaussian VAE are merely included as a reference, not for a direct comparison. We can see that in all cases, the measured log-likelihoods for the Bernoulli distribu-tion are again much worse than any other result. We can attribute this to the use of the score function estimator: the learning signal received by the inference net-work is simply too noisy to keep up with the models trained with pathwise deriva-tives.

Continuing with the relaxed binary distributions, both the results on MNIST and SVHN show significantly better log-likelihoods for the BPWL distributions with standard hyperparameters. However on CELEBA this difference disappears, and the measured likelihoods are much closer. We pose that this might simply be an issue with the maximum amount of information possible in the binary latent vari-ables: CELEBA is both four times larger than SVHN, and contains many more com-plex image features. Experiments with a larger binary latent distribution size show that simply using a bigger model does not help improve the model quality.

(26)

MNIST SVHN CELEBA

Posterior -ELBO NLL -ELBO NLL -ELBO NLL

Gaussian 85.94±0.05 81.23±0.06 1117.22±0.26 1105.92±0.28 5261.55±0.62 5218.17±0.62 Bernoulli 137.44±0.18 119.91±0.11 1713.11±1.08 1689.13±1.08 7630.63±2.41 7611.85±3.60 Concrete 168.77±0.15 104.85±0.12 1429.32±0.34 1361.55±0.32 6363.64±0.71 6301.37±0.43 HardKuma 180.84±0.11 105.66±0.04 1421.41±0.29 1349.32±0.30 6381.58±0.41 6305.18±0.37 BPWL (standard) 129.51±0.06 99.57±0.07 1353.30±0.31 1321.84±0.29 6333.80±0.50 6302.34±0.49 BPWL (narrow) 134.96±0.07 100.15±0.08 1377.99±0.52 1341.29±0.55 6349.39±0.97 6302.27±0.88 BPWL (wide) 128.42±0.04 105.11±0.06 1386.69±0.43 1362.43±0.41 6323.14±0.79 6304.81±0.71

TABLE3.3: ELBO and log-likelihood on the on the test sets with dif-ferent (relaxed) binary distributions. A Gaussian VAE is included for reference. NLL is estimated with a 10 sample MC-estimate,

experi-ments are repeated 5 times.

3.4.3 Choosing a BPWL Configuration

With the large number of possible hyperparameter configurations for the BPWL dis-tribution, choosing a good distribution shape can be difficult. We performed the experiments in Section 3.4 on three different configurations to gain some insight into this topic. The hyperparameters and plots of the configurations can be found in AppendixB.1.2.

The narrow and wide parameter configurations represent two extremes of the pos-sible shapes of the distribution: The narrow version pushes almost all density to the sides of the distribution, and the wide configuration has the density more spread out. From the experiments in Table 3.3 we can see that both these extremes lead to worse log-likelihoods on all datasets, which can be explained by looking at the CDF of each distribution. When concentrating the density at zero and one, the cor-responding CDF is almost vertical or almost horizontal everywhere. As a result the inverse CDF used for sampling from the distribution has the same issues, which ei-ther greatly amplifies the gradients, or reduces the gradients to almost zero. As a result, we can see that the gradient variances of the narrow configuration in3.2are almost 50% increased. In extreme cases this can even cause numerical overflow or underflow, which can be observed by pushing even more density to the sides of the distribution than the narrow configuration.

In contrast, the wide configuration is closer to a uniform distribution, and the chance of sampling on the ’wrong’ side of the distribution is much larger. For example when p = 1, the chance of sampling closer to zero p(X ≤ 0.5) = 0.09. While this might look like a reasonably small number, the results show that this impacts the quality of the trained model significantly.

The standard configuration was designed to be a middle road between these two extremes, and in the experiments in Chapter4we will use only use this configura-tion.

3.4.4 Qualitative Analysis

An upside of the binary latent variable compared to the continuous Gaussian is in-creased interpretability. Each latent dimension only has two states, which allows us to use the binary distribution as a set of switches to influence the reconstruc-tion. Note that we have not disentangled the dimensions of latent variable in this chapter, and this section only provides some insight into the structure of the latent space.

(27)

3.4. Experiments 19

By manually analysing the most important dimensions in the latent space, we can look into the effects of flipping a single bit in the latent code. While most bits do not cause an discernible effect on their own, the reconstructions in figure3.3shows the effect of flipping three individual bits that do behave like switches. Similar experi-ments on SVHN and CELEBA do not yield such interpretable latent features.

(A) A bit that controls the stroke thickness.

(B) A bit that controls density in the bottom left.

(C) A bit controls the density around the center.

FIGURE3.3: MNIST reconstructions with one bit difference between the top and bottom rows.

Next, an interpolation between two latent codes is performed. Because the latent space is binary, interpolation between two latent representation is done by flipping the different bits between the codes one by one, starting at the first bit. Figure3.4

shows four traversals of 24 steps through the latent code. We can see that some bitvectors do not produce recognizable digits, and in some cases flipping single bits causes a large difference in the resulting reconstruction. For example, the bottom left interpolation instantly switches from a three to a six at the fourth step, and does not have many proper digits in the second row.

FIGURE 3.4: Four interpolations in the latent code. Traversals are depicted left to right, and top to bottom.

(28)

3.5 Discussion

This chapter introduced a new binary relaxation using linear splines called the bi-nary piecewice linear distribution (BPWL). The distribution has a theoretical advan-tage over existing solutions, by replacing the estimated KL divergence with a closed form expression. Experiments show that this reduces the variance of the inference network gradients significantly, which improves training on small models. The more complex CELEBA dataset does not show the same improvements, and we speculate this could be caused by the limited amount of information captured by binary latent variables.

In future work, the spline distributions can be optimized for different scenarios. The used spline parameterization is made by hand, and finetuned only on the VAE latent variable case. The flexibility of spline distributions allow for infinite variations, and better spline configurations could exist. Additionally, this flexibility also allows for different use cases, such as approximating other distributions.

(29)

21

Chapter 4

Controllable Generation with Deep

Generative Models

4.1 Introduction

Semi-supervised variational autoencoders offer a strong way to improve model in-terpretability and are a popular choice for controllable generation applications. While the original VAE implementation is completely unsupervised, there is no limitation on the VAE setting that does not allow supervision. The M2 model by Kingma et al., 2014applies this idea by introducing a discrete latent variable y that can be marginal-ized by a simple summation, which allows for end-to-end training. Many newer models are based on this idea, and applied to subjects like domain invariant repre-sentation learning (Louizos et al.,2016; Ilse et al.,2019), clustering (Maaløe, Fraccaro, and Winther,2017), structured prediction (Sohn, Lee, and Yan,2015) or controllable generation (Klys, Snell, and Zemel,2018). However, most of these models require a categorical latent variable with a sufficiently small sample space in order to deal with the expectations in the ELBO. This thesis looks at discrete latent variables that are combinatorial, which cannot be dealt with in this way: taking an expectation over a 10 bit latent variable would require 210 passes through the model, which is infeasible in most scenarios.

Another issue arises when optimizing models with many training objectives. For the type of models discussed in this section, this can happen quite easily; we want to train an ELBO, keep the KL term reasonably high, disentangle latent variables, and supervise one or more latent spaces. Training a model with all these separate objectives by simply adding them together requires us to find hyperparameters to weigh each loss, however we have no guarantee these hyperparameters should stay the same over the course of training. Additionally, when we find a good combination of hyperparameters it is hard to determine what they actually mean. Constrained optimization is a technique that has found some popularity in recent years to deal with these kinds of problems. Instead of jointly optimizing many different losses, we can reformulate our loss function as a main loss, constrained by a few preferred scores on side-losses.

In this chapter, we look at training a model for controllable generation that is appli-cable to data with a Bernoulli target variable, and easier to train through the use of constrained optimization. The proposed method and background on a relaxed bi-nary variable can be found in Chapter3. Section4.2discusses the background on the proposed methods, Section4.3defines the model and training methods, followed by experiments and results in Section4.4.

(30)

22 Chapter 4. Controllable Generation with Deep Generative Models

4.2 Background

In this section, we discuss the background on semi-supervised learning and the pro-posed method for constrained optimization.

4.2.1 Semi-supervised Learning with Variational Autoencoders

Semi-supervised learning combines supervised and unsupervised learning, by train-ing on both labeled and unlabeled data. Semi-supervised techniques have many applications, such as inferring unseen labels, or learning structured representations when only few labels are available. This process is vital for controllable generation; by adding a degree of supervision to strong the strong representations from unsu-pervised methods, we gain the needed control over the generation process.

The semi-supervised VAE by Kingma et al.,2014, generally referred to as M2, demon-strates this by incorporating a y variable that is both inferred in an unsupervised way, and supervised by a set of class labels. The resulting model both learns to pre-dict the labels y from observations x, and generate new samples from the model based on a chosen y.

Louizos et al.,2016extend M2 with the fair VAE architecture that allows the genera-tive network and inference networks to train jointly, and apply it to fair representa-tion learning. The DIVA architecture (Ilse et al.,2019) extends this model by moving from a hierarchical model to a model that has parallel latent sub-spaces, which has shown benefits over hierarchical models in recent years (Siddharth et al.,2017; Klys, Snell, and Zemel, 2018; Atanov et al., 2019). The following section discusses the DIVA architecture, and the connection to the method proposed in this thesis.

4.2.2 The DIVA Architecture

The domain invariant variational autoencoder (DIVA) (Ilse et al., 2019) is a semi-supervised VAE that aims to tackle the problem of domain generalization. As an example, they use the MNIST dataset with randomly rotated digits. The method aims to generate digits with a controlled digit identity y, and a controlled rotation d, which is possibly not observed during training.

The model uses three parallel continuous latent spaces zx, zd, and zy, with vari-ational posterior q(zx, zd, zy|x) = q(zx|x)q(zd|x)q(zy|x). Here, zd aims to capture domain specific information such as image rotation. zy captures information about the attributes y, and zxcaptures any residual information. Both zdand zyare trained in a (semi-)supervised manner, whereas zx is unsupervised.

In this work we use a model similar to the DIVA setup without a separate variable for domain invariance, and apply it with a combinatorial discrete latent variable instead of a categorical variable by introducing a relaxed Bernoulli distribution.

4.2.3 Constrained Optimization

Constrained optimization is a family of optimization techniques that aims to opti-mize a loss function under the presence of some pre-defined constraints. In recent years, it has been applied to VAE models in multiple scenarios, as an attempt to fix the training and generalization issues common in these models. Rezende and Vi-ola,2018 propose GECO, which aims to minimize the KL between the variational

(31)

4.2. Background 23

posterior and prior, while constraining the expected likelihood. They propose a few different constraints based on this idea, and show the flexibility of the method. Ad-ditionally, a smoothing mechanism for the values of the multipliers is introduced to reduce stability issues of the Lagrangian dual by applying a moving average to the multipliers. A related method to GECO is the LagVAE (Zhao, Song, and Ermon, 2018), which targets bounds on the mutual information I(X; Z), constrained with the InfoVAE objective (Zhao, Song, and Ermon,2019).

Pelsmaeker and Aziz,2020 take a similar approach with a minimum desired rate (MDR) constraint, and instead train the likelihood term in the ELBO, constrained by a sufficiently high KL to prevent posterior collapse. The method is a generalization of the free-bits method Kingma et al., 2016, which omits the KL term completely when it falls below a chosen bound. MDR shows promising results on the effect on the latent variable on the reconstruction when using strong decoders.

Another application of Lagrangian multipliers is shown by Lee, Hashimoto, and Liang, 2019, who use it to optimize a communication game. They perform a de-tailed analysis between optimizing a multi-objective loss by using a weighted sum, and using a constrained optimization setting. Their experiments show that using constrained optimization leads to both significantly better results, and more inter-pretable hyperparameters.

Section 4.3.3 proposes an extension to the MDR constraint to other types of con-straints, to both reduce the time spent searching for hyperparameters, and incorpo-rate the possible advantages of constrained optimization observed by Lee, Hashimoto, and Liang,2019.

4.2.4 Supervised Disentangling of Latent Variables

Disentangling latent variables is a crucial part of controllable generation. When gen-erating a sample from a model with entangled latent variables, changing part of the latent code while keeping the rest static will not necessarily produce desireable sam-ples. Many features will overlap, or control multiple aspects at the same time, which greatly reduces the interpretability of the latent code.

In this section, we will only look at supervised disentanglement. While unsuper-vised methods such as the β-VAE (Higgins et al.,2017) can show strong results, in this work we are specifically interested in disentangling a few selected, observed features for the purpose of controlling these selected features while keeping other features in the reconstruction the same. Specifically, when we have some latent vari-able z, and an observed attribute label y, we want z to be disentangled from y. Many supervised disentangling methods (Shu et al.,2020; Zellinger et al.,2017; Louizos et al.,2016) rely on disentangling the observed variable y when it is a categorical vari-able. The general idea here is to reduce the difference in the latent code z between observations with different observed classes y, which should promote invariance between z and y. However, when y is instead a combinatorial variable, there are simply too many combinations to consider, and disentangling all of these is gener-ally not feasible.

(32)

A method that is more applicable to any type of disentanglement is reducing the mutual information between variables. Klys, Snell, and Zemel, 2018 propose the CSVAE based on this idea, and derive a bound on the mutual information that can be minimized during training. This method is discussed in the next section.

4.2.5 A Note on the CSVAE Disentanglement

The CSVAE (Klys, Snell, and Zemel, 2018) proposes a disentanglement loss that is not limited to categorical variables, by deriving an objective from the mutual in-formation. First, we rewrite the mutual information as I(Y; Z) = H(Y) −H(Y|Z). Here, the entropy H(Y) is a constant obtained from a prior p(y), so maximizing H(Y|Z)would minimize the mutual information.

There are a few implicit assumptions in the original derivation from Klys, Snell, and Zemel,2018, and will discuss tohse in this section. The conditional entropy is first written out: H(Y|Z) = − Z Z Z Yp (y, z)log p(y|z)dy dz (4.1a) = − Z Z Z Y Z Xp (z|x)p(x)p(y|z)log p(y|z)dx dy dz (4.1b)

Next, the expression is rewritten as a expectation over the data distribution D(x):

ED(x) − Z Z Z Yp (z|x)p(y|z)log p(y|z)dy dz (4.2)

However, here we assume the marginal likelihood p(x; θ) matches the unknown data distribution D(x). This assumption does not hold in general, and in order to estimate this quantity we would need to approximate and sample from the marginal likelihood, which would result in a very different objective.

Continuing, p(z|x)and p(y|z)are replaced by variational approximations q(z|x; φ) q(y|z; δ), and trained by minimizing the following objective:

arg min_φED(_x)q(z|x;φ)

− Z Y q(y|z; δ)log q(y|z; δ)dy (4.3) Additionally, the new posterior q(y|z; δ)is trained as a log-likelihood.

arg max_δED(x,y)q(z|x;φ)[log q(y|z; δ)] (4.4)

This additional loss can be interpreted as an adversarial component. The network aims to predict y from z, while the objective in4.3 tries to reduce the information about y in z.

(33)

4.3. Method 25

Here the data distribution D(x, y)is introduced. Again this does not follow from Equation4.1b, and the new variational posterior q(y|z; δ)is not trained to approxi-mate p(y|z). Attempts to correct the derivation were made, but these attempts have not lead to a trainable objective. As the method does produce good results in our experiments, it is used in this research without the assumption that it is a bound on the mutual information.

4.3 Method

Here we describe the model used for controllable generation, along with the con-strained optimization method and proposed concon-strained.

4.3.1 Model

We propose a deep generative model for semi-supervised learning based on DIVA (Ilse et al., 2019), and derive an ELBO for both the supervised and unsupervised case. We learn continuous latent variables z and w from a datasetD(x, y), where x is the observed data we want to generate, and y is an occasionally missing label. Here, w is supervised by y. The goal is to train a model that can sample new data from the learned latent space with a generative network p(x|z, w), and to vary the generated x by changing w and z independently.

x z y w θ θ N

FIGURE4.1: Model 2, a 4 variable model with latent variables z and w, observed variable x and semi-observed labels y.

The proposed VAE setup used in this research is a four variable model shown in Figure 4.1. The joint distribution and variational posterior are factorized as fol-lows:

p(x, y, w, z; θ) =p(x|z, w; θ)p(w|y; θ)p(y)p(z) (4.5a) q(z, w, y|x; λ, φ, γ) =q(z|x; φ)q(w|x; λ)q(y|w; γ) (4.5b) Here, a difference between the standard M2 setup is the intermediate latent space w. When disentangling variables y and z, some features dependent on y still need to be present in the latent code, which is the purpose of variable w. Without this extra variable, disentangling significantly reduces the strength of the model, since any latent features correlating with y are removed.

(34)

Each variable in this setup has the following associated distribution:

p(w|y; θ) = N (w|µ(y; θ), σ(y; θ)2I) (4.6)

q(z|x; φ) = N (z|µ(x; φ), σ(x; φ)2I) (4.7)

q(w|x; λ) = N (w|µ(x; λ), σ(x; λ)2I) (4.8)

q(y|w; γ) = B(y|f(w; γ)) (4.9) WhereBsignifies an arbitrary multivariate binary distribution. Both the Bernoulli and relaxed binary cases are tested, which can be found in Section 4.4. The prior p(z)is a unit gaussian, and the prior p(y) is a binary distribution from the same class asB, with the parameters derived from the labeled data.

We derive an unsupervised ELBO for the proposed model: log p(x; θ) = log Z Z Z W Z Yp (x, y, w, z; θ)dy dw dz (4.10a) = logEq(z,w,y|x;λ,φ,γ) p(x, y, w, z; θ) q(z, w, y|x; λ, φ, γ) (4.10b) ≥E_q₍_z_,w,y_|_x_;λ,φ,γ₎ log p(x, y, w, z; θ) q(z, w, y|x; λ, φ, γ) (4.10c) L_u(D, θ, φ, λ, γ) =E_q₍_z_|_x_;φ₎_q₍_w_|_x_;λ₎[log p(x|z, w; θ)] (4.10d) −KL(q(z|x; φ)||p(z)) (4.10e) −E_q₍_w_|_x_;λ₎[KL(q(y|w; γ)||p(y))] (4.10f) +E_q₍_w_|_x_;λ₎_q₍_y_|_w_;γ₎[log p(w|y; θ)] +H(q(w|x; λ)) (4.10g) For the supervised case, the samples from q(y|w; γ)are replaced by the observed labels from the data. Here, the KL term in Equation4.10fbecomes a constant with respect to optimization, and falls away. Additionally, q(y|w; γ)is trained as a classi-fier for y. This gives the following supervised objective:

Ls(D, θ, φ, λ, γ) =Eq(z|x;φ)q(w|x;λ)[log p(x|z, w; θ)] (4.11a) −KL(q(z|x; φ)||p(z)) (4.11b) −KL(q(w|x; λ)||log p(w|y; θ)) (4.11c) +βyEq(w|x;λ)[log q(y|w; γ)] (4.11d)

Where βy in 4.11d is a hyperparameter that controls the strength of supervision. Note that when the supervision constraint proposed in Section4.3.3 is used, it re-places4.11d. Finally, to jointly train the model on both supervised and unsupervised objectives, the losses are added weighted by hyperparameter βs:

Improving Controllable Generation with Semi-supervised Deep Generative Models

MS

A

I

M

T

Improving Controllable Generation with

Semi-supervised Deep Generative Models

E

W

July 7, 2020

Abstract

Acknowledgements

Contents

Chapter 1

Introduction

1.1

Outline

Chapter 2

Preliminaries

2.1

Variational Inference

∏

∑

2.2

Pathwise Gradient Estimation

∑

∑

2.3

Datasets

Chapter 3

The Binary Piecewise Linear

Distribution

3.1

Introduction

3.2

Background

3.3

Method

3.4

Experiments

3.5

Discussion

Chapter 4

Controllable Generation with Deep

Generative Models

4.1

Introduction

4.2

Background

4.3

Method