Effective Estimation of Deep Generative Models of Language

(1)

MSc Artificial Intelligence

Master Thesis

Effective Estimation of Deep Generative

Models of Language

by

Tom Pelsmaeker

10177590

21st February 2019

36 ECTS May 2018 - February 2019 Supervisor: Dr W.F. Aziz Assessors: Dr R. van den Berg Dr I.A. Titov

Institute of Logic, Language and Computation University of Amsterdam

(2)

(3)

Abstract

We model natural language, in the form of written text, as draws from the marginal of a generative model parameterised by neural networks. It is studied how such models can best capture properties of language in their latent structure, in order to generate convincing novel language. The Variational Autoencoder (VAE) (Kingma and Welling,

2013;Rezende et al.,2014) forms the backbone of all these models.

In order to learn both a good density estimator and a model that can generate novel language, we combine the VAE with a recurrent network. However, when such an expressive neural network is used as decoder in a VAE, latent variables tend to be ignored, a problem known as the strong decoder problem (Alemi et al., 2018). We address this problem in the context of a SentenceVAE(SenVAE) with a strong recurrent decoder. It is shown that through careful optimisation the strong decoder problem can be countered, and SenVAEmodels can outperform general recurrent neural network on a language modelling task. Hereafter, we investigate the effect of enriching the latent structure of these models by means of various normalising flows and expressive priors, including the inverse autoregressive flow (Kingma et al.,2016) and variational mixture of posteriors prior (Tomczak and Welling,2017a). These expressive prior and posterior distributions are empirically shown to enrich the latent representations without negative effects on language modelling performance. Finally, we probe the latent variables for their properties, and show that they encode surface, syntactic and semantic linguistic information. We hope that this principled investigation helps pave the way for the application of latent variable models to broader language tasks, for instance, dialogue systems, question answering and translation.

The code for the experiments will be released after final publication of this thesis and possible accompanying research paper onhttps://github.com/0Gemini0/msc-thesis/.

(4)

Acknowledgements

I wish to extend my gratitude to Wilker, for his great guidance over the past year. I have been especially lucky with the amount of time, effort and personal care he invested, without which my thesis would have never been possible. Furthermore, I would like to thank Dr Rianne van den Berg and Dr Ivan Titov for taking the time to read and asses my work. I am also grateful to my family and friends for having the patience to put up with the answer ‘I’m working on my thesis’ whenever they asked about my life these past months. Last, I am thankful to Maudi for her never-ending support.

(5)

4.5 Conclusion . . . 59 5 Concluding Remarks 60 References 61 A Definitions 70 B Derivations 72 B.1 Variational Autoencoder . . . 72 B.2 Normalising Flows . . . 74 B.3 Dropout . . . 75 C Architectures 76 C.1 RNNLM . . . 76 C.2 Gauss-SenVAE . . . 77 C.3 vMF-SenVAE . . . 77 C.4 MADE. . . 78 C.5 IAF-SenVAE . . . 78

C.6 Vamp and MoG. . . 78

C.7 Autoregressive Network . . . 78

D Additional Experiments 80 D.1 Further Tuning of the RNNLM . . . 80

D.2 Validation Set Results Expressive Variational Models . . . 80

D.3 Various SentEval Embeddings . . . 80

(7)

LIST OF ABBREVIATIONS

AF autoregressive flow. 55

AU active units. 22

BPTT backpropagation through time. 7

CNN convolutional neural network. 32

ELBO evidence lowerbound. 9,10,13,15–21,27,31–33,36–38,41,42,72,74,78

GAN generative adversarial network. 2

GRU gated recurrent unit. 7,24–26,40,41,43

IAF inverse autoregressive flow. 3,39–47,51,55–58,78

IAF-VP volume preserving inverse autoregressive flow. 40,56,78

KL Kullback Leibler (divergence). 9,13–20,26,27,29,32–34,41

LSTM long short-term memory network. 7,24,25,40,80

MADE masked autoencoder for distribution estimation. 39–41,44,46,78

MC Monte Carlo. 10,16,22,24,38

MCMC Markov chain Monte Carlo. 8,9

MDR minimum desired rate. 19,27,30,43–45,47,50,51,81

MMD maximum mean discrepancy. 20,27

MoG mixture of Gaussians. 3,42,44–47,56,78

NLL negative log likelihood. 16,20–22,26–31,36,43,46,80

NLP natural language processing. 1

NN neural network. 6,7,9,12,17

PTB Penn Treebank. 30,31,40,46,47,52,53,58,81

(8)

RNNLM recurrent neural network language model. vii, viii, 7,11,13,15,16, 23, 24,

26,30,31,33,34,47–49,51,52,76,77,80,82

SenVAE variational autoencoder for sentences. i, vii, 12, 13, 15–17, 20, 23, 26–37,

39–47,49,51–54,57,59,60,77,78,81,82

SFB soft free bits. 19

TER translation edit rate. 53,54

VAE variational autoencoder. i, 2–4, 7,9–12,14–17, 20,22,26,27, 30–34,36,37, 41,

42,55–60,76

Vamp variational mixture of posteriors. 3,37,42–47,55,56,78

VI variational inference. 7–11

(9)

LIST OF TABLES

3.1 Penn Treebank descriptive statistic. . . 23

3.2 Deterministic Bayesian optimisation results. . . 25

3.3 RNNLM test set results. . . 26

3.4 Baselines without optimisation strategies. . . 26

3.5 Gaussian SenVAE Bayesian optimisation 0. . . 28

3.6 Gaussian SenVAE Bayesian optimisation 5. . . 28

3.7 vMF SenVAE Bayesian optimisation 0. . . 29

3.8 vMF SenVAE Bayesian optimisation 5. . . 29

3.9 SenVAE validation set results 0. . . 30

3.10 SenVAE validation set results 0. . . 30

3.11 SenVAE test set results. . . 31

4.1 Flows with R = 5. . . 47

4.2 Flows with R = 15. . . 47

4.3 SentEval probing tasks. . . 48

4.4 SentEval downstream tasks. . . 48

4.5 SentEval probing tasks results. . . 50

4.6 SentEval downstream tasks results. . . 51

4.7 Generated sentences . . . 53

4.8 Average minimum TER . . . 54

4.9 Posterior Samples. . . 54

4.10 Homotopies . . . 55

C.1 Recurrent language model decoder hyperparameters. . . 77

D.1 Different vocabulary pre-processing results on Penn Treebank.. . . 80

D.2 Flows with R = 5, validation. . . 81

D.3 Flows with R = 15, validation. . . 81

D.4 SentEval results with different embedding types. . . 81

D.5 SentEval downstream tasks additional results. . . 83

(10)

LIST OF FIGURES

3.1 Spread and mean validation perplexity of RNNLM . . . 25

3.2 Rate-Distortion curves.. . . 32

4.1 Context size of the inverse autoregressive flow. . . 44

4.2 Inference network regularisation. . . 45

(11)

A NOTE ON NOTATION

Here, the notation conventions that will be used throughout this thesis will be intro-duced. The goal is to develop an explicit but succinct notation that will aid in avoiding confusion.

Random Variables A random variable X : Ω → E is a function that maps a set of events in sample space Ω to a measurable space E, oftentimes the D-dimensional real space RD. The probability distribution of events x ∈ E is described by a probability mass function pX(x) in the discrete case or a probability density function fX(x)1 in the

continuous case.

In line with conventions in the machine learning community, Random variables will be denoted with slanted uppercase letters, and a specific realisation with the same lowercase letter, e.g X = x, Z = z. Both PMF’s and PDF’s will be denoted with a lowercase letter, omitting the subscript, so p(x) = pX(x) and p(z) = fZ(z). When a random variable

has multiple probability densities2, we can represent them with different letters, e.g. p(z) and q(z). When conditioning on different random variables, we will use the same succinct notation, e.g. p(x|z) = f_X|Z(x|z).

Expected Value EX[g(X)] = R g(x)p(x)dx denotes the expected value or

expecta-tion. When there are competing probability densities for the same random variable, the expected value will be subscripted with the intended probability density, for instance Eq(z|x)[p(x|z)]. Otherwise, we will default to the standard notation.

Parameters A set of parameters is denoted with a lowercase Greek letter, e.g. θ, φ. When a distinction needs to be made between different sets of parameters, probability densities may be subscripted by the set of parameters that produced them, e.g. pθ(x|z)

and qφ(z|x).

Matrices When it is necessary to clarify the use of matrices and vectors, uppercase boldface will be used for matrices and lowercase boldface for vectors, e.g. M, m. We reserve this notation for deterministic vectors and matrices, random variables and their

1

Note that the probability of a single event given a continuous Random Variable is always zero, e.g Pr(X = c) = 0, ∀c ∈ E when X is continuous. However, the probability density function describes how probability is distributed on a continuous interval, where the area under the curve represent the probability of an interval.

2

In classical probability theory, every random variable has a single function that describes the probab-ility distribution. However, in the context of machine learning, it is often useful to take a model selection (see AppendixA) perspective. It can be said that every probability density describes a different model of the random variable, and we try to select the model that fits best with the observed realisations of the random variable.

(12)

realisations will be denoted with slanted upper- and lowercase letters whether they are vectors, matrices or scalars.

Sequences A sequence of length M , xM₀ = hx0, . . . , xMi, will simply be denoted as x

to avoid clutter, whereas the tokens in the sequence will be subscripted with their index. Furthermore, we define xi−1₀ ≡ x_<i to denote the sequence of tokens preceding the i’th token.

Functions and Distributions The following functions an distributions will be used throughout the thesis:

• log: the natural logarithm.

• exp: the exponential function with base e. • N (·|·, ·): the Normal distribution or Gaussian. • vMF(·|·, ·): the von-Mises Fisher distribution.

• I_v: The modified Bessel function of the first kind at order v. • Γ: the Gamma function.

• D[·||·]: A general divergence between two probability density functions. • ELBO: the evidence lower bound.

• KL[·||·]: the KL divergence between two probability density functions. • det: the matrix determinant.

(13)

CHAPTER

1

INTRODUCTION

Language is, as far as we know, unique to humans (Pagel,2017). It enables communica-tion of abstract thought and has spurred the evolucommunica-tion of human culture. According to one strand of research, it might even influence the way we can think and reason about the world (Whorf,1956). It is safe to say that human society would not be what it is today without language.

Hence, language is a central part of the human experience. So central, it seems, that much effort has been devoted to teaching language to things that are not human. For instance, it has often been attempted, with limited success, to teach great apes basic human language (Savage-Rumbaugh and Lewin, 1994; Tomasello, 2017). Nowadays, much research is focused on teaching language to machines instead of apes. Large strides have recently been made in such Natural Language Processing (NLP). For example, virtual assistants1 allow us to speak to our machines using natural language; search engines allow us to find information using keywords; automatic translation systems allow us to read texts in virtually any language; and advanced grammar checkers allow us to write better theses.

These advances are mostly powered by developments in deep learning, representation learning methods consisting of multiple nonlinear modules that transform data to a useful representation for a given problem (LeCun et al., 2015). Deep learning models are capable of learning intricate structures from large amounts of high-dimensional data (LeCun et al., 2015), automatically. In NLP, deep learning models now achieve state of the art results in many tasks, including machine translation (Bahdanau et al.,2014;

Vaswani et al., 2017), language modelling (Melis et al., 2017), text-to-speech synthesis (Van Den Oord et al., 2016; Shen et al.,2018) and question answering (De Cao et al.,

2018).

However, free-form language generation remains a challenging task. While services like virtual assistants can often respond well to questions, this is always within a con-strained setting and often with the aid of built-in heuristics (Wen et al.,2015). Natural language generation is concerned with the construction of computer systems than can produce understandable texts in English or other human languages from some under-lying non-linguistic representation of information (Gatt and Krahmer, 2018). Ideally, a model should be able to learn and produce language from scratch, without invoking complicated heuristics or hand-crafted systems. Thus, in this thesis, we will research deep learning models for free-form language generation.

We will approach this task by learning generative models that approximate the un-derlying and unknown data distribution based on random observations in the form of

1

Including Siri (https://www.apple.com/siri/), Alexa (https://developer.amazon. com/alexa) and most recently Google Duplex (https://ai.googleblog.com/2018/05/ duplex-ai-system-for-natural-conversation.html).

(14)

raw sentences. In a (very broad) sense, such models attempt to learn the underlying “distribution of language”, from which they can generate novel sentences. Traditionally, such models were learned from data by statistical modelling techniques. However, en-abled by recent advances in the field of machine learning, we will employ a combination of statistical modelling and deep learning to learn powerful generative models, called deep generative models.

Two broad classes of such deep generative models exist; the Variational AutoEncoder (VAE) (Kingma and Welling,2013;Rezende et al.,2014) and the Generative Adversarial Network (GAN)(Goodfellow et al., 2014). The former is a deep latent variable model that is optimised towards a lower bound of the data log-likelihood, whereas the latter is an implicit model optimised to generate data similar to true data. Both have seen application to language generation.

In general, generating language with GANs is difficult due to the discrete nature of text. Backpropagation through GANs is only possible with continuous data, which is required for the optimisation of deep learning models. Solutions to this problem exist in the form of generating continuous approximations to text (Zhang et al.,2016b;Rajeswar et al., 2017) or the use of a REINFORCE (Williams, 1992) gradient (Yu et al., 2017). However, the first modifies generated data artificially, which makes comparison to true data difficult, whereas the latter typically yields high-variance gradient estimates. Thus, applying GANs to discrete data such as text is still an active field of research.

In contrast,VAEs have no difficulty with discrete data and have lower-variance gradi-ent estimates than REINFORCE methods. Hence, this thesis will focus on generating language with deep latent variable models. However, VAEs suffer from optimisation problems when applied to language (Bowman et al.,2015). Generally, when a VAEhas a strong decoder, the part of the model that assesses the likelihood of data, latent vari-ables tend to be ignored (Alemi et al.,2018). A likely cause for this phenomenon is that such strong decoders are expressive density estimators by themselves, and therefore do not need to rely on information in the latent variables to make accurate likelihood as-sessments of the data. Yet, such strong decoders are common and required for successful language generation (Melis et al.,2017).

One could wonder whether it is even necessary to add latent variables to a model that is already a strong density estimator in its own right. If a strong decoder does not utilise latent variables to reconstruct the data, there might be no point in forcing the model to do so. However, there is an important distinction between likelihood assessment and sampling or generation. Likelihood assessment only requires a model that assigns high likelihood to the points in the dataset, something a strong deterministic decoder might do well on its own. Yet, convincing sampling or data generation requires a model of the probability density that stretches outside observed data. In other words, a model of the underlying probability space from which the data was generated is required, so novel “realistic” data can be sampled from this space. Strong decoders without latent variables have no explicit mechanism that does so, it is unclear how probability mass is assigned outside observed datapoints. On the other hand, latent variable models explicitly model probability mass by a smooth latent distribution that covers a region of the latent space. These models can generate novel data such as language by stochastically mapping from this (lower-dimensional) latent space to the data space. Hence, deep latent variable models are likely more suited for a generative task than strong decoders on their own. Beyond this, latent variable models might even be stronger likelihood assessors than strong decoders without latent variables. For instance, It has already been shown that careful model optimisation can lead to latent variable models with strong decoders that do not ignore the latent variables (Bowman et al., 2015;Chen et al.,2016). All in all, there is enough incentive to learn deep latent variable models with strong decoders. In light of this, various optimisation techniques will be considered to counter the strong decoder problem.

(15)

1.1. RESEARCH QUESTIONS AND CONTRIBUTIONS

In summary, the goal of this thesis is successful application of deep latent variable models to the language domain. In the next section, we will make this goal more concrete by outlining several research questions and our contributions to existing literature.

1.1 Research Questions and Contributions

The main goal of this thesis is successful application of deep latent variable models to the language domain in order to generate novel language. This broad topic will be tackled with a few specific research questions. These will be outlined in this section, together with our contributions to existing literature.

Which optimisation techniques help counter the strong decoder problem on language? As mentioned in the introduction, VAEs suffer from the so-called strong decoder problem; the model ignores the latent variable when it has enough capacity to reconstruct the data without it. This problem was first observed in language modelling withVAEs (Bowman et al.,2015), and later in other domains as well (Chen et al.,2016). This problem can be cast as an optimisation problem. Perhaps the specific way in which

VAEs are optimised stimulates the decoder to ignore the latent variable. To this end, many tweaks regarding the optimisation procedure (Bowman et al., 2015; Chen et al.,

2016;Kingma et al.,2016) and loss function (Alemi et al.,2018;Zhao et al.,2017a) have been proposed. To the best of our knowledge, these methods have never been extensively compared on language data. In most applications, these methods are seen as means to an end. One is selected based on a mixture of gut-feeling, convenience and trial and error. Our contribution is to offer a fair and extensive comparison between a broad selection of such optimisation techniques. Whereas most work on deep latent variable models concerns image processing, this thesis will focus on language modelling, mainly because the strong decoder problem makes application of VAEs to language difficult. We will show that it is possible to train language VAEs that do not ignore the latent variables and still perform at least as well as a baseline language model, given careful optimisation.

Can more expressive prior and posterior distributions improve language mod-elling with deep latent variable models and lead to better latent represent-ations? Even when latent variables are not ignored, the model does not necessarily benefit from them. The degree to which latent variables can improve statistical models hinges on their representational capacity, i.e. how well information can be represented by the latent variables. It has been observed that current deep latent variable models are inefficient in this regard (Alemi et al.,2018). When more information is stored in the latent variables, overall model performance degrades, hinting that the latent variables lack representational capacity. This problem has been explored inAlemi et al.(2018) by comparison of various normalising flows (Rezende and Mohamed, 2015) that improve the expressiveness of the variational prior and posterior distribution. Many such norm-alising flows have been proposed in recent years (e.g. Kingma et al.(2016);van den Berg et al. (2018); Louizos and Welling (2017); Tomczak and Welling (2017a, 2016); Huang et al. (2018)) However, to the best of our knowledge, normalising flows have never been applied to language VAEs. Thus, our main contribution is a comparison of the effect of more expressive prior and posterior distributions on the performance of deep latent variable models of language. Specifically, we compare between VAEs with an Inverse Autoregressive Flow (IAF) posterior (Kingma et al.,2016), Mixture of Gaussian (MoG) and Variational mixture of posteriors (Vamp) prior (Tomczak and Welling,2017b), and a defaultVAE. Our analysis will be similar to that ofAlemi et al.(2018), with a different nuance. Alemi et al.(2018) highlight a specific solution to the strong decoder problem and discover that latent variables lack representation capacity in general, whereas we

(16)

1.1. RESEARCH QUESTIONS AND CONTRIBUTIONS

explore the extent to which normalising flows can improve the representational capacity of the latent variables. We will show that more expressive variational distributions yield latent variables with a better representational capacity, without hurting the language modelling capacity of the models.

What linguistic information is captured by the latent variables? In the general setup of a deep latent variable model with a strong decoder we arguably hope that the latent variables capture global variation in the data whereas the decoder captures local variation in the data. This ideal model is called a semantic encoder (Alemi et al.,2018) because it forms a highly compressed representation of the data that retains semantic features, features that capture global information about the data. This is in contrast to models that capture no information in the latent variables and models that store the data in their latent variables. We will asses the extent to which our models are semantic encoders by assessing the downstream performance of the latent representations on a col-lection of tasks that require different linguistic information (Conneau and Kiela,2018). Note that semantic in this context does not correspond to linguistic semantics, which is the branch of linguistics concerned with meaning2. Next to this, we will asses how well our models generate novel language by comparing generated sentences with training data. Last, we will show samples from our models. These kinds of qualitative analysis are commonplace in the assessment of latent variable models. Yet, assessment of these lat-ent variables on downstream tasks is rare, only previously preslat-ented inRios et al.(2018). The three questions will each be thoroughly investigated in the third and fourth chapter of this thesis. Chapter 3 will explore the various optimisation techniques for variational autoencoders trained on language data. Chapter 4, then, will treat the second and third research question. In that chapter, an assessment will be made of the language modelling capacity ofVAEs trained with various prior and posterior distributions. Then, sentence embeddings derived from these models will be probed for their linguistic information content. But first, Chapter 2 will provide background on language models and deep latent variable models.

2

(17)

CHAPTER

2

BACKGROUND

In this chapter, it will be defined what generative models are, how they can be applied to language, and which specific class of generative models is used.

First, it is important to note that the class of models investigated in this thesis are so called probabilistic models. Formally, a probabilistic or statistical model is a set of prob-ability distributions on the sample space (McCullagh,2002). In practice, probabilistic modelling is the process of finding a (set of) probability distribution(s) that best ap-proximate the true probability distribution of a set of observed and unobserved random variables given a finite set of realisations. This process is also called model selection. A generative model, then, is any probabilistic model that describes the generative pro-cess of observed random variables, i.e. a model that learns by generating data. Given this generative process, new synthetic realisations of those variables can be generated (Bishop, 2007). Thus, a generative model of X is any model of the joint, conditional or marginal probability distribution of X given any number of other random variables, learned from a finite set of realisations {x1, . . . , xN}.

First, it will be discussed how language can be modelled in such a probabilistic framework. Then, we describe a specific class of generative models over observed and latent variables, which will form the basis for most of the generative models of language in this thesis.

2.1 Probabilistic Language Models

Language is the method of human communication, either spoken or written, consisting of the use of words in a structured and conventional way1. Statistical language modelling is based on the assumption that there is a “probability distribution of language” from which all such spoken or written communication is generated. A statistical model of language is an approximation of this underlying probability distribution.

To make the task of finding such a model more feasible some simplifying assumptions need to be made. In the entirety of this thesis, the focus will be on English written language. Text will furthermore be viewed as a set of sequences, each consisting of a string of tokens that is produced in a left to right fashion. The simplest example of such a sequence is, of course, a sentence consisting of a string of words or characters. But depending on the specific task a sequence could just as well be a word, paragraph or even an entire book.

Given the simplifying assumptions above, we can now define the probability of a sequence X = x of length M as a left-to-right factorisation of the probabilities of the

1

(18)

2.1. PROBABILISTIC LANGUAGE MODELS tokens: p(x) = p(x0) M Y i=1 p(xi|x<i) (2.1)

which is a general representation of a statistical model of language widely used in the language modelling community (Bengio et al., 2003). However, this model suffers from the curse of dimensionality, or equivalently, data sparsity. Note that given a token vocabulary of size |V |, and sequences of length M , we have |V |M − 1 possible sequences (Bengio et al.,2003). Since this quantity grows exponential in M , most sequences will never be observed except when M is tiny or when the dataset is enormous. How then, do we assign probabilities to sequences or sub-sequences that have never been observed? A classical way to approach this problem is to break the assumption that a token depends on all previous tokens, and instead let the probability of a token only be governed by n − 1 previous tokens: p(x) ≈ p(x0) M Y i=1 p(xi|xi−1_i−n+1) (2.2)

Such n-gram models (Brown et al., 1992) suffer less from the curse of dimensionality, because now only |V |n−1 possible sequences need to be assigned a probability. However, such models only address the curse of dimensionality when a sufficiently small n is chosen. In practice, n larger than 3 is unfeasible in most cases (Bengio et al., 2003). This is undesirable because sentences in natural languages often have long-range dependencies between words. Also, even given a small n, some n-grams might still be unobserved. Such n-grams could be assigned a small probability with a smoothing function (Jelinek,

1980). However, it is desirable to have a class of models that automatically assigns probability to unseen sequences in a more principled manner.

2.1.1 Neural Language Models

A first solution to the sparsity problem arose with the combination of two concepts, a distributed representation of tokens and neural networks (Bengio et al.,2003). Neural Networks (NN) are universal function approximators (Hornik et al.,1989) consisting of several layers of weights and activation functions, whose parameters are optimised with backpropagation (Werbos, 1974). NNs can be used to predict the parameters from the probability distribution of Equation 2.1 if they are given the tokens as input in some form. Such a neural language model can implicitly handle unseen sequences, asNNs are smooth functions of the input (Bengio et al.,2003).

However, tokens cannot be provided to a NN in their linguistic form, as neural networks require numerical inputs. A basic way to numerically represent arbitrary tokens is to specify a vocabulary V of tokens based on the data and then assign each token a unique number from [0, . . . , |V | − 1]. The tokens can then be provided to the NNas a |V | dimensional one-hot vector (Harris and Harris, 2010) where the position of the token number is 1 and the rest 0. A downside to this approach is the fact that this representation contains no information about relatedness between tokens, thus losing a lot of information that could help the neural network predict good probabilities.

Bengio et al. (2003) adressed this problem by learning a continuous feature vector representations of tokens instead. This is done with the inclusion of a shared input layer in theNNthat maps the numerical representation of each token to a lower-dimensional continuous vector. The general idea is that theNNcan learn similarities between words by mapping them close-by in the continuous embedding space. This idea was later extended to explicitly account for the distributional hypothesis (Firth, 1957; Harris,

(19)

2.2. LATENT VARIABLE MODELS

Yet, even with this distributed representation NNs can struggle to form good prob-abilistic models of language. The neural language model from Bengio et al. (2003) was designed to use a fixed number n of concatenated word embeddings of previous tokens to predict the probability distribution of the next token. When n is too small the model will be unable to capture long term-dependencies, similar to n-gram models. However, large n are computationally unfeasible as the dimensionality of the model scales linearly with n.

The first model that could handle preceding context of arbitrary length was the Recurrent Neural Network (RNN) Language Model (RNNLM) byMikolov et al.(2010). In contrast to feedforward NNs, RNNs pass a hidden state from the previous step to the next step. This hidden state is updated every step as a function of the input and the previous hidden state. The weights of the RNN are shared across steps, hence the name recurrent neural network. In summary, the RNNLM takes a token and the previous hidden state as input at every timestep, and outputs an updated hidden state and probability distribution for the next token. The hidden state essentially acts as a memory vector, containing a compressed representation of all preceding tokens. Thus, theRNNLMcan encode temporal information from contexts of arbitrary length (Mikolov et al.,2010), in theory.

In practice, RNNs suffer from catastrophic forgetting because the gradients either vanish or explode during backpropagation through time (BPTT) (Hochreiter,1991). As such, RNNs cannot effectively model long term dependencies. Extensions such as the long short-term memory network (LSTM) (Hochreiter and Schmidhuber,1997) and the gated recurrent unit (GRU) (Cho et al.,2014) add extra machinery to theRNNin order to keep the gradient flow undisturbed. These models achieve state of the art results in language modelling (Melis et al.,2017).

In this thesis, RNN-type2 models with continuous token embeddings will be the neural networks that power our deep generative models of language. Note that aRNNLM

is a generative model; it models densities of observed sequences and tokens. By feeding a draw from the probability distribution predicted at the previous timestep as input to the next timestep, whole new sequences can be generated. However, RNNLMs may not learn interesting representations of whole sequences, for they are trained to estimate to conditional distribution of next words only. Nor do they learn to capture global properties of sentences in their model (Bowman et al., 2015). The next section will discuss a class of models that can in principle learn such properties by representing language as distributions in a lower-dimensional latent space, and generate novel data from this latent representation.

2.2 Latent Variable Models

A latent variable model is a probabilistic model over observed and latent random vari-ables. A latent variable is simply a variable for which we do not, or cannot, have any observations, as opposed to an observed variable for which we have some realisations.

Latent variables are often induced as underlying causes to an observed phenomenon. Consider, for example, intelligence. Spearman (1904) noticed that performance of in-dividuals on various cognitive tasks was highly correlated. Therefore, he postulated that there must be an underlying latent factor, general intelligence, that is the cause of this correlation. The introduction of this factor in statistical models of cognitive ability helped to explain the results on the cognitive tasks. Similarly, there might be underlying causes that explain observed text. For instance, language adheres to grammatical and syntactical rules that are not directly observed, but cause language to have a certain form. Thus, latent variable models might be better models of language than models

2_RNN_{-type models are all neural networks that use a recurrent architecture, so the}_LSTM_and_GRU

(20)

of the observed variables alone. Furthermore, latent variables that capture underlying properties of language might allow for better generation of novel language.

This section will explain latent variable models in more detail and cover a specific optimisation technique for latent variable models called variational inference (VI). Last, the Variational Autoencoder (VAE), a specific deep VI model, will be explained, as it forms the basis for all latent variable models in this thesis.

2.2.1 Posterior Inference

Consider a general latent variable model with a joint distribution over observed data X and latent variables Z, that factorises as follows:

p(x, z) = p(x|z)p(z) (2.3)

The latent variable in this hierarchical model helps govern the distribution of the data (Blei et al., 2017). Specifically, p(z) is a prior density over the latent variables. The latents are related to the data via the likelihood p(x|z). Often, we are interested in finding the posterior distribution of the latent variables given the data in order to predict the distribution of new datapoints. This can be done through Bayesian inference:

p(z|x) = p(x|z)p(z)

p(x) (2.4)

where p(x) = R p(x, z)dz is the marginal density of the data, also called the evidence. Given a simple model, the posterior could be computed directly with the above formula. However, modern statistics and machine learning often rely on complex and difficult to compute probability densities (Blei et al., 2017). In such cases, computation of the integrals required for the posterior and marginal distributions is often intractable, in-tractable meaning that the exact solution is not computable within a time feasible for any practical application (Hopcroft,2013).

Fortunately, there exist several methods that make inference with intractable pos-teriors possible. Historically, sampling techniques have been the dominant method for approximate posterior inference (Blei et al.,2017). Markov Chain Monte Carlo (MCMC) (Hastings,1970) and its extensions have been particularly successful. A specific advant-age ofMCMCis that it guarantees producing exact samples from the target density, in the limit (Robert and Casella,2013). However, it may take a long time before MCMC

converges to an acceptable approximation of the target density. This makes the pro-cedure computationally intensive and, therefore, often unfeasible for large datasets (Blei et al.,2017).

Variational inference is a method of approximate inference that is more suited for large datasets. It does not have any guarantee of converging to exact inference in the limit, but it may convergence to an acceptable approximation much faster thanMCMC

methods (Blei et al.,2017). Importantly, variational inference frames posterior inference as an optimisation problem. Later on, we will see that this matches well with deep learn-ing, which also relies on optimisation to learn models of data. Thus, for the remaining part of this section, we will focus on VIand its applications.

2.2.2 Variational Inference

Variational inference differs fromMCMCin that it uses optimisation instead of sampling to approximate intractable posteriors. Specifically, it posits a family of densitiesZ over the latent variables (Blei et al.,2017). The aim is then to find a member of that family that minimises a divergence to the exact posterior:

ˆ

q(z) = arg min

q(z)∈Z

(21)

where p(z|x) is the exact posterior from Equation2.4and q(z) an approximate posterior. The benefit of this approach is that we can freely choose any family of densities that have desirable properties for our problem. For instance,Z is often chosen to be a family of Normal distributions. Furthermore, given a posterior over multiple latent variables, independence of the latent variables is often assumed (Equation2.6). This is also called the mean field assumption.

q(z0, . . . , zN) = N

Y

i

q(zi) (2.6)

Such simplifications help to make optimisation and inference withVItractable. However, one must be careful not to select a variational family that is too simple, for it might be impossible to closely approximate the true posterior with a simple distribution. Finding expressive variational families that still allow for tractable VIremains an open research problem (Blei et al.,2017;Rezende and Mohamed,2015).

Now, we turn our attention to finding a solution to Equation2.5. The focus will be on Kullback-Leiber (KL) variational inference (Barber,2012), where theKL divergence between q(z) and p(z|x) is optimised. In principle,VI encompasses any procedure that uses optimisation to approximate densities (Wainwright et al.,2008), so other divergence measures could be chosen (Blei et al.,2017). However, that lies outside the scope of this work. Consider the following form of Equation2.5:

ˆ

q(z) = arg min

q(z)∈Z

KL[q(z)||p(z|x)] (2.7)

and theKL divergence:

KL[q(z)||p(z|x)] = Eq(z)[log q(z)] − Eq(z)[log p(z|x)]

= Eq(z)[log q(z)] − Eq(z)[log p(z, x)] + log p(x) (2.8)

which is not computable directly in this form. Remember that the reason the posterior p(z|x) is intractable is that the marginal p(x) =R p(z, x)dz is intractable, which we need in order to solve the Equation above. However, an alternative objective can be derived from Equation 2.8:

log p(x) − KL[q(z)||p(z|x)] = Eq(z)[log p(x, z)] − Eq(z)[log q(z)]

log p(x) ≥ Eq(z)[log p(x, z)] − Eq(z)[log q(z)] (2.9)

Which is called the Evidence Lower Bound (ELBO), since it lower-bounds the mar-ginal distribution of X.3 Optimising the ELBOhas two effects. First, it minimises the KL-divergence in Equation 2.7, finding a (hopefully) close approximation to the true posterior. Second, when VI is applied in a frequentist framework to learn the distribu-tions p as well, it maximises log p(x), as the ELBO“pushes” the log-evidence up when it is maximised. Thus, next to Bayesian inference, VI can also be applied to model selection (Blei et al.,2017), finding the best model for observations of X.

Several algorithms for optimising the ELBO exist (Blei et al., 2017). An example is the CAVI algorithm (Bishop,2007), that iteratively optimises each latent variable of a mean-field variational family whilst keeping the others fixed (Blei et al., 2017). This procedure does not easily scale to large amounts of data as it requires iteration over the entire dataset every optimisation step. Stochastic alternatives to this algorithm exist, that subsample the dataset every optimisation step, which allows application to larger datasets (Paisley et al.,2012;Hoffman et al.,2013). Recently, stochastic backpropagation with amortized inference (Rezende et al.,2014; Kingma and Welling,2013) married VI

to deep learning, allowing for NN-based generative models. This led to the birth of the Variational Autoencoder (VAE) (Kingma and Welling, 2013), a generative model that parameterises the variational family and other densities with neural networks. This model will be the topic of the next paragraph.

3

(22)

2.2.3 Reparametrisation and the Variational Autoencoder

The technique of stochastic backpropagation and the Variational Autoencoder have been developed concurrently by Rezende et al. (2014) and Kingma and Welling (2013). In the following short description of theVAEwe will mostly stick to the style and notation ofKingma and Welling (2013).

As briefly mentioned in the previous section, the VAEoffers a solution to two chal-lenges that arise in approximate posterior inference (Kingma and Welling, 2013). The

VAEis scalable to large datasets, unlikeMCMCmethods and traditionalVIoptimisation techniques, because it uses stochastic optimisation (Robbins and Monro,1951). Also, it is applicable to arbitrarily complex likelihood functions p(x|z) , unlike previousVI meth-ods like CAVI (Bishop,2007) and SVI (Hoffman et al.,2013). This is because stochastic backpropagation allows for the use of stochastic gradient descent (SGD) (Bottou,2010), a powerful method of stochastic optimization that iteratively updates the parameters of a function with the negative gradient of a loss function specified over the function outcome and the desired outcome.

We explain theVAEbased on the simple model of Equation2.4. The posterior p(z|x) and marginal p(x) are assumed to be intractable. Therefore, we specify the variational distribution q(z), and attempt to find the best approximation to the actual posterior by minimising the KL-divergence. However, both the likelihood p(x|z) and the variational posterior are modelled with arbitrarily complex functions, e.g. neural networks. These functions are governed by a global set of parameters, θ denoting the parameters of p and φ the parameters of the variational approximation. Furthermore, the dataset X consists of N independently and identically distributed (i.i.d.) random variables whose realisation Xi = xi are vectors in RDx. For every datapoint xi, we specify a vector of

latent variables Zi = zi in RDz governed by the global parameters φ and the data. This

means we are practising amortized VI4. With these specifications, we can derive the followingELBO(Equation2.9) for the i ’th datapoint:

ELBO(θ, φ; xi) = Eqφ(zi|xi)[log pθ(xi, zi)] − Eqφ(zi|xi)[log qφ(zi|xi)] (2.10)

whose optimisation minimises the posterior KL and maximises the marginal likelihood by point estimating the parameters {θ, φ}5. The expectations in the ELBO could be approximated with an MC-estimate by sampling from qφ(zi|xi). However, parameter

optimisation with SGD requires the gradients of theELBOwith respect to θ and φ. As sampling is a non-differentiable procedure, the gradients with respect to φ are unavailable when the expectations are MC-estimated.

The key to solving this problem is the reparametrisation trick (Kingma and Welling,

2013). The idea is that we specify a deterministic, invertible function over an auxil-iary random variable that produces the desired latent variable, z = gφ(, x), where

is governed by an independent marginal p(). This allows us to specify the density qφ(z|x) = p() det ∂gφ ∂ −1

(Rezende and Mohamed,2015). As a results of the law of the unconscious statistician (Ringner,2009), any expectation under z can then be rewritten to an expectation under , EZ[f (z)] = E[f (gφ(, x)].6 Thus, theELBOcan be rewritten

4

In traditionalVI, separate local variational parameters have to be optimised for every latent variable, which can be computationally expensive. Inference with a shared set of global variational parameters that govern a function of the data is called amortizedVI(Zhang et al.,2017). Note that the distribution of each latent variable is still governed by unique parameters, but these parameters are predicted by a function of the global parameters, instead of optimised.

5_{Generally, in Bayesian inference, no distinction is made between parameters and latent variables.}

They are both viewed as random variables over which posterior distributions have to be inferred. In-versely, in Frequentist inference, parameters are viewed as separate entities of which an optimal point estimate can be determined using ML or MAP estimation over data. In this section and the rest of the thesis, we will take a Frequentist stance with respect to the parameters, whilst being Bayesian in terms of the local latent variables. For a fully Bayesian discussion of theVAE, seeKingma and Welling(2013).

6

(23)

2.3. DEEP GENERATIVE MODELS OF LANGUAGE

to the following form:

ELBO(θ, φ; xi) = Ei[log pθ(xi, zi)] − Ei[log qφ(zi|xi)]

where zi= gφ(i, xi) and i ∼ p(i) (2.11)

which can be approximated by sampling ∼ p(). With this technique backpropagation towards φ becomes possible, as z is no longer sampled but a deterministic function of and x.

When both the variational posterior qφ(z|x) and the likelihood pθ(x|z) are neural

networks, we arrive at the class of models called variational autoencoders. This name relates the model to autoencoders (Vincent et al.,2008), a class of deep learning models that encode data into a lossy deterministic code and which are optimised for the recon-struction loss of decoded representations. When viewed from a coding theory perspective (Kingma and Welling,2013), qφ(z|x) can be thought of as a probabilistic encoder, since

it encodes a datapoint x into a distribution over possible values of the code z from which x could have been generated. Similarly, pθ(x|z) can be thought of as a decoder as it

pro-duces a distribution over possible values of x from a code z. Hence the name variational autoencoder.

However, qφ(z|x)is also known as an inference network, and pθ(x|z) as a generative

network or generator (Hinton et al., 1995). Inference network is a term from VI liter-ature that refers to a variational approximation whose local parameters are predicted by a complex global function, as is the case in a VAE. The generative network or gen-erator refers to the fact that pθ(x|z)is a generative model of X. All terms will be used

interchangeably in the rest of this thesis.

In conclusion, theVAEis a deep generative model that allows generation of new data from latent variables. The reparametrisation trick makes efficient, amortized, optimisa-tion with stochastic gradient descent possible. The remainder of this thesis will seek to apply the VAEto language modelling, in search of expressive deep generative models of language.

2.3 Deep Generative Models of Language

Now, we have all the building blocks available to specify what exactly is meant by a deep generative model of language. Recall that a generative model in general is any probabilistic model that learns by generating data, or equivalently, models the distribu-tion of the data. Within this definidistribu-tion, we restrict ourselves to models with prescribed densities and tractable (approximations) to the marginal likelihood. Of language refers to the fact that we are talking about language models, specifically statistical models that in some way represent a density over written text. Last, deep refers to the fact that (part of) the model consists of neural networks. Thus, both theRNNLM of Section2.1

and the VAE, when applied to language, of Section 2.2 are deep generative models of language.

In the next chapter, we will discuss the problem of optimisation in deep generative models of language, especially when latent variables are involved. Various existing op-timisation strategies will be empirically compared to offer a guideline for future research on VAE’s for language.

(24)

CHAPTER

3

OPTIMISING DEEP GENERATIVE MODELS OF LANGUAGE

In the previous chapter, generative models and language models were discussed. This provides enough background to investigate deep generative models of language in more detail.

In this chapter, it will be researched how deep generative models of language can be effectively optimised. VAEs face the problem of the latent variables being ignored by the decoder when the decoder is “strong”, i.e. when it has enough capacity to reconstruct the data without the need for a latent representation to decode from. This is dubbed the strong decoder problem byAlemi et al. (2018). This problem appears when

VAEs are applied to the language domain (Bowman et al., 2015), because expressive recurrent models are used as decoder. Yet, it is desirable to use such strong decoders in conjunction with latent variable models. The strong decoder provides the capacity needed to produce language in the correct form, whereas the latent variables capture a more global representation of language that forms a smooth manifold from which language can be sampled. Language models without latent variables do not form such a smooth manifold, but only encode direct representations of the data. Thus, we set out to overcome mode collapse in VAEs when coupled with strong decoders on a language modelling task.

The specific research question we address in this chapter is which optimisation tech-niques can help to counter the strong decoder problem on language. In doing so, the following contributions were made:

• We proposed three novel techniques for optimisation of variational autoencoders with strong decoders, and show empirically that these enabled training of such

VAEs without posterior collapse.

• We present a thorough comparison of a multitude of optimisation techniques for

VAEs, which, to our best knowledge, has not been done before. This is accompan-ied by an extensive discussion of the strong decoder problem and potential ways to address it. Furthermore, we show thatVAEs trained with the best of these tech-niques can outperform recurrent models without latent variables on a language modelling task.

First, a generative model of language will be described in some detail, the variational autoencoder for sentences (SenVAE). Then, challenges faced with the optimisation of deep latent variable models will be outlined.

3.1 A Variational Autoencoder for Sentences

TheSenVAE by Bowman et al.(2015) is a straightforward application of a VAEto the language domain. SenVAE uses RNN-type networks as encoder and decoder instead of

(25)

3.1. A VARIATIONAL AUTOENCODER FOR SENTENCES

feedforward NNs. Thus, SenVAE is, in fact, the RNNLM combined with a stochastic encoder. The exact model architecture can be found in AppendixC.

TheSenVAEoptimises theELBOfrom Equation2.11. As described in Section2.2.3, this objective can be optimised with stochastic gradient descent when the choice of variational family is reparameterisable. In the following section two choices of variational posterior are outlined.

3.1.1 Choice of Variational Family and Prior

The reparametrisation trick cannot be applied to any arbitrary distribution. Kingma and Welling(2013) describe three general choices of qφ(z|x) to which the reparametrisation

trick can be applied, namely distributions with a tractable inverse CDF, distributions from the location-scale family and distributions that can be formed by a composition of different distributions. In this section, we describe reparametrisation for two distribu-tions, the Normal or Gaussian distribution and the von Mises-Fisher (vMF) distribution.

Consider the following equivalent form of theELBO (Equation2.9): ELBO(θ, φ; x) = E[log pθ(x|z)] − KL[qφ(z|x)||pθ(z)]

where z = gφ(, x) and ∼ p() (3.1)

where (z, ) ∈ RDZ _{and x is a sequence of word representations in R}DX_{. We highlight}

this specific form of theELBObecause, for the choices of prior and variational posterior discussed in this section, the KLis tractable.

Normal Distribution As inKingma and Welling(2013), the defaultSenVAEmodel has a diagonal Gaussian distribution as variational posterior (Bowman et al., 2015), qφ(z|x) = N (z|µφ(x), diag(σφ(x))). This distribution falls within the location-scale

fam-ily of distributions and can be reparameterised as:

z = gφ(, x) = µφ(x) + σφ(x) , where ∼ N (0, IDZ) (3.2)

where we transform a sample from p() = N (|0, IDZ) to a sample from the variational

posterior. Since z now depends directly on µφ(x) and σφ(x), gradient flow is not

ob-structed between the encoder and decoder. The KL divergence is analytically solvable when selecting the prior p(z) = N (z|0, IDZ):

KL[qφ(z|x)||pθ(z)] = 1 2 Dz X d=1 − log σ_φ2(x) − 1 + σ_φ2(x) + µ2_φ(x) (3.3)

The derivation can be found in SectionB.1 of Appendix B.

The diagonal Gaussian is a popular choice of variational posterior due to its sim-plicity and tractable KL. However, it has a few downsides. First, a choice of diagonal Gaussian implies the mean-field assumption, i.e. independence between the dimensions of a sentence embedding z. However, the true posterior potentially has highly correlated latent variables. A diagonal Gaussian can never approximate the true posterior when that is the case. In a similar vein, the Gaussian is a uni-modal distribution, whereas the true posterior might be highly multi-modal. Thus, a diagonal Gaussian might be a too simple approximate posterior.

Second, KL divergence forces the Gaussian posterior to stay close to a standard Normal distribution. This acts as a regularizer, avoiding posterior collapse to point masses with no variance. On the other hand, this also forces every posterior to stay close to a zero mean, making it costly to encode data spread across the latent space (Tolstikhin et al.,2017). In the worst case scenario, over-regularisation causes posterior collapse to a standard Normal, leading to the latent variable being ignored by the decoder.

(26)

3.1. A VARIATIONAL AUTOENCODER FOR SENTENCES

The first problem will be addressed in the next chapter. The second problem is ad-dressed later in this chapter when solutions to the strong decoder problem are discussed. Now a choice of variational family will be described that has a uniform prior, placing less of a constraint on the posterior through the KL. We discuss this variational family in addition to the Gaussian family because recent research posited it as a possible solution to the strong decoder problem (Xu and Durrett,2018), and showed encouraging results on language modelling. Hence, we aim to address empirically whether their findings can be corroborated in our experiments.

Von Mises-Fisher Distribution The von Mises-Fisher distribution (Fisher, 1953;

von Mises,1918) is a distribution on the (Dz−1)-dimensional unit hypersphere in RDz.

This means that an arbitrary random variable X in RDz _{with a}_vMF_{probability density}

function only has realisations on the surface of a (Dz−1)-dimensional unit hypersphere.

The vMF is governed by a mean parameter µ and concentration parameter κ, where µ ∈ RDz_{, ||µ||}

2= 1 and κ ∈ R>0.

The nature of the von Mises-Fisher distribution makes it a natural choice as vari-ational posterior when modelling data that has underlying spherical structure (Davidson et al., 2018). But even when the data is not of an explicit spherical nature a vMF -posterior may be beneficial. Xu and Durrett (2018) have successfully applied a vMF

-VAE to language modelling, where it outperformed a VAE with a Gaussian posterior. This success hinges on the property of the vMFthat it becomes a uniform distribution on the hypersphere when κ = 0. Thus, choosing the prior p(z) = vMF(z|·, 0) for a vari-ational posterior of the form qφ(z|µφ(x), κφ(x)) provides a partial solution to the second

problem sketched above; the mean of the variational posterior can now vary over the surface of the hypersphere without additional cost. This is visible in theKLdivergence (Davidson et al.,2018): KL[qφ(z|x)||p(z)] = κφ(x) IDz 2 (κφ(x)) IDz 2 −1 (κφ(x)) + log   κφ(x) Dz 2 −1 (2π)Dz2 I_Dz 2 −1 (κφ(x))   − log 2(π Dz 2 ) Γ(Dz 2 ) ! (3.4) which does not depend on µφ(x). Here Iv is the modified Bessel function of the first

kind at order v (Weisstein,2002) and Γ the gamma function.

In order to learn κφ(x) and µφ(x) directly a reparameterised sampling procedure is

necessary. However, the vMF distribution does not fall within one of the three classes of distribution to which reparametrisation can be applied directly, so a more complex procedure is necessary.

Ulrich (1984) outlined a general procedure to sample from the von-Mises Fisher distribution. First, a “change magnitude” w is sampled from a univariate density, w ∼ g(w|κφ(x), Dz), which can be transformed to a sample z0from thevMFdistribution with

the first standard basis vector as location, vMF(e1, κφ(x)) (Davidson et al.,2018). This

sample can subsequently be transformed to z via an orthonormal Householder reflection (Householder, 1958), z = U (e1, µφ(x))z0. This transformation can be thought of as

rotating the sample over the surface of the hypersphere. The Householder reflection is deterministic in µφ(x), so its parameters are accessible to gradient descent. However,

κφ(x) is only accessible through the samples z0 and w, so cannot be optimised with

gradient descent.

The extended reparameterisation trick introduced byNaesseth et al.(2016) provides a solution to this problem. With their procedure, any distribution that can be simulated with the acceptance-rejection method (Von Neumann, 1951) can be reparameterised as long as the proposal distribution can be reparameterised. Davidson et al. (2018) show that this method can be applied to all sampling steps in the procedure outlined above.

(27)

3.2. THE STRONG DECODER PROBLEM

Thus, combination of the sampling technique from Ulrich (1984) and the method from

Naesseth et al.(2016) allows reparameterised sampling of z from vMF(µφ(x), κφ(x)). A

more detailed description of the sampling procedure and reparametrisation is outside the scope of this work. We refer the interested reader to Davidson et al.(2018)1_.

However, optimising κφ(x) through this method has some instability issues. It was

noted in Xu and Durrett (2018) that the value of κφ(x) had to be clamped between

certain values to prevent numerical instability. We confirmed this in our experiments. Specifically, we found the model to be stable when _3.51 Dz < κφ(x) < 3.5 Dz. Hence,

κφ(x) was clamped between these values in all experiments. In contrast toXu and

Dur-rett (2018), a softclamp method was used to prevent obstruction of the gradient flow. Without this, we found that κφ(x) got stuck once it was below or above the threshold.

Specifically, a scaled and shifted Tanh function was applied to κφ(x).

Having described theSenVAEwith two choices of variational family, we will now discuss the phenomenon that latent variables tend to be ignored when the decoder is strong.

3.2 The Strong Decoder Problem

In the previous section, it was hinted that the VAEmay ignore the latent variables due to the interaction between the prior and posterior in the KL divergence. This problem is especially apparent when we have strong decoders (Alemi et al., 2018), conditional likelihoods pθ(x|z) parameterised by models with a large representational capacity. In

such cases, the model might achieve a high ELBOwithout using information from the latent variable.

Bowman et al.(2015) first noticed this problem when the VAE was applied to lan-guage modelling. It is natural that this problem arises in the lanlan-guage domain, as the state of the art in language modelling, and many other language tasks, is driven by

RNN-type models (Melis et al.,2017). These models are very ’strong’ decoders because they condition on all previous context when generating the next word. This causes the

SenVAE to ignore the latent code without special alterations to the training procedure (Bowman et al.,2015).

One could ask whether strong decoders even need the added machinery of aVAE, as they seemingly do not benefit from the inclusion of latent variables. The motivation is twofold. First, even though strong decoders do not benefit from the latent variables when trained in a naive way, it does not mean that they can never benefit from the latent code. For instance, the SenVAE can make use of future context as well as previous context when decoding a sentence, whereas the RNNLM cannot. Yet, the same can be said about regular autoencoders.

However,Bowman et al.(2015) have shown that regular autoencoders do not encode data into a smooth manifold. Hence, they do not explicitly encode meaningful relations between the datapoints, a second useful property of VAEs. The ability to learn useful representations of data in a unsupervised manner is desirable, as they can find use in many downstream task. A prime example of this are word embeddings (Mikolov et al.,

2013b).

Thus, there is ample reason to solve the strong decoder problem. It will first be grounded in information theory. Then, several possible causes for the problem and their potential solutions will be highlighted

3.2.1 An Information-Theoretic Perspective

Alemi et al. (2018) phrase the strong decoder problem from an information-theoretic

1_{We also rely on their implementation of the procedure. See}

https://github.com/nicola-decao/ s-vae-pytorch

(28)

perspective, which will be discussed in this section. Consider any stochastic encoder q(z|x) that encodes datapoint x into a latent vector z. This induces a joint distribution q(x, z) = q∗(x)q(z|x), with marginal posterior q(z) =R q∗_{(x)q(z|x)dz. Note that q(z|x)}

is simply the variational posterior, and q∗(x) the true data density. From here, the mutual information between X and Z can be defined:

Iq(X; Z) =

Z Z

q(x, z) log q(x, z)

q∗(x)q(z)dxdz (3.5) which is a symmetric measure of how much information the random variables contain about one another (Alemi et al., 2018). This metric knows two limits. In one extreme, X and Z are independent, thus share no mutual information. In the other extreme, X and Z are equal, meaning that they have maximum mutual information. In neither case is Z a good representation of X. Thus, we wish to encode Z such that it shares a degree of mutual information with X, without directly reproducing the data.

Within this framework, Alemi et al. (2018) define four types of VAE-models; au-toencoders, semantic encoders, syntactic encoders and autodecoders. Autoencoders have maximum mutual information between X and Z, data is copied exactly to the latent space and no useful representation is learned. Syntactic encoders have relatively high mutual information, so Z stores syntactic and semantic information about X, e.g. sen-tence content and word order. Semantic encoders have low but non-negligible mutual information, meaning that Z only stores high-order semantic information about X. Last, autodecoders share no mutual information between X and Z, again learning no useful representation of the data. A standard RNNLM can be seen as an autodecoder, X is reproduced directly without learning a useful representation per se. Yet, what we ar-guably hope to achieve with the SenVAE is a semantic encoder; a model that learns compressed high-order representations of the data but still needs to rely on the decoder to reconstruct the finer details.

Hence, the type of VAE can be determined with the mutual information. However, Equation 3.5 is not tractable in most cases, because the true data distribution q∗(x) is often inaccessible, and the marginal q(z) might be hard to compute. However, Alemi et al. (2018) derive the following tractable bounds on the mutual information:

H − D ≤ Iq(X; Z) ≤ R (3.6)

H = − Eq∗_(x)[q∗(x)] (3.7)

D = − Eq∗_(x)[E_q(z|x)[log p(x|z)]] (3.8)

R = Eq∗_(x)[KL[q(z|x)|p(z)] (3.9)

where p(z) is an approximation of q(z), which we can recognise as the prior distribution in the VAE framework. p(x|z) is an approximation to q(z|x), which can similarly be recognised as the decoder in theVAEframework. SeeAlemi et al.(2018) for a derivation of these bounds.

H is the entropy of the data. D is the distortion, which is the expected reconstruction loss, or conditional log-likelihood of a VAE. R is the rate, the expectedKL divergence between the posterior and prior, or regularisation loss. Both D and R can be approx-imated with anMC-estimate by sampling from the data distribution, i.e. by averaging over the dataset. H can be treated as a constant.

This perspective offers a novel way of framing the strong decoder problem. When the decoder ignores the latent variable and theKLdivergence collapses to zero, the mutual information between X and Z must be zero and X is independent from Z. Thus, it impossible for the latent variable to encode a useful representation of the data when the KL collapses.

Alemi et al.(2018) argue that this makes it important to report R and D, next to the

ELBO and marginal NLL, because it offers better insight in the behavior of theVAE. As discussed, it can help determine to which of the four VAE-classes a model belongs. We will adopt this convention to diagnose the strong decoder problem in our models.

(29)

3.2.2 Directions

We will now turn to the discussion of potential solutions of the strong decoder problem. After presenting the solutions in general, some practical solutions will be discussed that will be applied to the SenVAE.

Weakening the Decoder A natural solution to the problem of a strong decoder is to stop using strong decoders. This forces the model to rely on the latent variables for good reconstruction of the data. Chen et al. (2016) claim that there is an information pref-erence in VAE optimisation, information that can be modelled by the decoder without use of the latent variable will be modelled that way, because modelling with the latent variable incurs extra cost. They motivate this by showing that aVAEsuffers extra cost equal to theKLdivergence between the variational posterior and true posterior, due to the form of the ELBO. When no information is encoded in the latent variable, the true posterior collapses to the prior, and this cost vanishes. To avoid this problem, one can weaken the decoder so it cannot model all information about the data locally.

However, weakening the decoder does not fundamentally solve the strong decoder problem, it only avoids it. In most cases, finding the best solution to a problem requires the strongest decoder possible. From the perspective offered by Chen et al. (2016) it might seem no alternative solution exists, but this is shortsighted. For instance, we can find stronger posteriors or priors. Furthermore, this problem arises during optimisation due to the form of the ELBO. The optimisation procedure can be altered to mitigate this problem.

Altering Model Optimisation Another possible cause of the strong decoder problem is sub-optimal optimisation techniques. It is well known that optimisation of NNs is a highly non-convex problem. (Stochastic) gradient descent only guarantees to reach a local optimum in this non-convex plane. Thus it could be thatVAEs with strong decoders more easily find local optima where the latent variable is neglected, perhaps because there simply are more local optima of that kind. Then, by altering the optimisation procedure in the right way, the model could be given a “push” towards local optima where the latent variable is not ignored.

Changing the ELBO One could also alter the non-convex optimisation plane dir-ectly by changing the ELBO. As mentioned above, Chen et al.(2016) showed that the

ELBO expresses an information preference for decoding without the latent variable. Furthermore, Tolstikhin et al. (2017) and Zhao et al. (2017a) noted that the KL term in the ELBO forces the variational posterior for every datapoint towards the prior as we optimise with mini-batches, whereas the posterior should only match the prior on expectation. These constraints on the latent variable imposed by theELBOmight make “good”2 local optima hard to reach. Thus, when theELBOis altered to alleviate these constraints, better optima may be found. A downside to this approach is that such alterations mean that we no longer have any guarantee of optimising a lower bound on the marginal log-likelihood. Nonetheless, finding a good local optimum in the altered objective might also mean finding a good local optimum in the ELBO, so this method deserves consideration.

More Complex Variational Families In Section 2.2.2 it was mentioned that the choice of variational family influences the fidelity with which the true posterior can be approximated. For instance, it is impossible to approximate a complex multi-modal pos-terior with a simple uni-modal distribution such as a diagonal Gaussian orvMF. Thus, using a more complex variational family might improve the model. Similarly, Alemi

2

Effective Estimation of Deep Generative Models of Language

MSc Artificial Intelligence

Master Thesis