Text Generation and Annotation with Joint Multimodal Variational Autoencoders

(1)

MSc Artificial Intelligence

Master Thesis

Text Generation and Annotation

with Joint Multimodal Variational

Autoencoders

by

Tom Koenen

10215557

October 15, 2018

36 EC September 2017 - September 2018

Supervisor:

Dr. Peter Bloem

Assessor:

Prof. dr. Frank van Harmelen

(2)

Text Generation and Annotation with Joint Multimodal

Variational Autoencoders

Tom Koenen

October 15, 2018

Abstract

Multi-modal learning allows us to build models that link data-points from one domain to another. By using a Joint Multimodal Variational Autoencoder we can place the point of cross-over between the text and attribute domains at the level of the latent representation and generate output in either domain by decoding the latent representation of the other. In order to achieve this the latent space must contain an implicit representation of both the high-level syntax required for natural language sequences as well as the attributes that have been annotated to these sentences. By constraining the latent spaces to prevent the latent representations from diverging, this joint representation of the natural language sentence and its attributes can be approximated by the encoders of the two individual modalities. This cross-over allows for text to be generated from annotation inputs as well as annotations to be generated from text inputs.

1 Introduction

While information from many different domains has become available in ever increasing quantities, this has often led researchers to use the data available to them in one domain to directly predict the values of the parallel data it is paired with. A large variety of machine learning techniques exist in order to train computer models to find some label or value given a certain type of input. While ideally this would allow us to find the parameters of these models that give the desired output, often it is difficult to map directly from one domain to another when the domains in question are not observed together very often. If the data in one domain has not been been found alongside a corresponding value in another, the mapping can become more difficult.

Unfortunately many of these machine learning techniques are of limited use when applied to unlabeled data. Labels would provide an objective desired outcome for the machine learning model that it should strive for given its input. In practice however, these are rarely encountered outside of services specifically based on user feedback. Online review sites, for instance, might work by asking their users to annotate their written review with a specific score, but on the whole the request for internet users to add unambiguous explanations to their data output, is often ignored or considered a cumbersome task.

A further limitation of relying on labeled datasets is that attempts to map the relation between input and label often ignore the shared underlying causes responsible for generating data of two different types. After all, the information in both the descriptive text as well as the knowledge statement annotation are descriptions of the same phenomenon, just expressed in different domains. A model that explains the data in relation to not only a corresponding label or parallel data-point, but to the unobserved, latent variable that it is conditionally dependent upon would be of greater use. As this would allow us to evaluate both whether new data-points are probable and how this data is most likely represented in the parallel and latent domains.

Even if two domains are observed alongside each other, it can prove useful to forgo a direct mapping between the two and instead focus on the underlying values the two domains are dependent upon. If these latent values and their mapping to the data in different domains were defined, it would be possible to generate parallel data from a single latent value. If we use the mapping in the opposite direction, it would be possible to find the corresponding latent value for a given data point and from that generate the data in the parallel domain. So the benefits this approach are twofold: it can be used to generate new data and easily map data from one domain to another.

(3)

1.1 Unsupervised learning

It is worth noting that the problem of sparse data labeling only affects supervised machine learning methods, which require every input value in the dataset to have an output label. These methods approach problems by searching for a model that correctly matches inputs to outputs as closely as possible as often as possible and so search for the optimal model parameters given this task (as with classification or regression problems).

Unsupervised machine learning methods on the other hand, deal with unlabeled data and attempt to find the underlying structure of the data-points. This could be for the purposes of clustering or data-compression, two tasks that predict the values of unobserved variables based on the observed data but without access to corresponding objective ’true’ labels.

A different perspective on unsupervised learning comes from generative modeling, which is con-cerned with generating new datapoints similar to the ones found in the dataset. If the assumption is made that a joint distribution P(X,Z) exists so that new data-points X_new could be generated from P(X|Z) that resemble the existing data X and are conditionally dependent on the latent dis-tribution P(Z), then this changes the objective of the learning algorithm. Rather than finding out the relation between some goal and objective P(Y|X), the algorithm will learn the data distribution of the data X P(X) and the latent space responsible for generating the data P(Z) as well as the conditional distributions between the two Pθ(Z|X) and Pφ(X|Z) with their parameters θ and φ.

This offers up new possibilities in terms of data analysis as now the data not only exists in the regular data-space but also in a latent space. Under proper constraints the distance between latent representations could be used to indicate similar properties not immediately evident in the data-space. Specific regions in P(Z) will become responsible for generating specific likely types of output, as defined by the conditional distribution P(X|Z). If, for instance, a latent representation of an image of a car exists in the latent space, the individual points within that subspace could make up the different types of cars that are possible. This means that when the latent representation of an image is known, sampling around this point and mapping it back into the data-space could help generate images similar in general type, rather than just pixel-values. If we return to the example above of an internet review site, the benefits of using a Generative Model on this data become clear. We could map the reviews to a latent representation rather than a label and learn to generate probable reviews, and more specifically be able to generate probable reviews from within a certain region of the latent space that are similar in the properties encoded into that area of the latent space. However there is no need to completely abandon the label information, in fact it would be informative to create a joint model of the two types of data as well as the latent representation: P(X,Y,Z). If both X and Y are generated by Z and the joint model is known, we could not only generate new data-points of these two modalities, but also map the two modalities to one another through the latent distribution P(X|Z) · P(Z|Y).

This brings us to the focus of this thesis: the modeling of data from different modalities based on the same latent representation. To do this we will make use of Variational Autoencoders (VAEs), a generative model adapted from a type of unsupervised dimensionality reduction model called an autoencoder. With this generative model it is possible to couple the data from two different domains to a shared latent representation as well as generate the two types of data from one single latent representation. The question then becomes: how well does the data from each domain translate to the other after mapping it through the latent space? If the two domains are text and text annotations, the tasks of mapping from one to the other are essentially instances of knowledge extraction (when mapping from a text to its attributes) and language synthesis tasks (when mapping from attribute to text). It will be of interest to see how well each task can be performed by this framework and how the sharing of a latent representation could enhance or impede the generation of data conditioned on it (i.e. P(X|Z) and P(Y|Z) ).

1.2 Autoencoders

Autoencoders are a type of neural network that learn a data representation through unsupervised learning. This representation is usually of a lower dimensionality than that of the input data in order to more efficiently represent the original input data or create a denoised version of the original data. This representation is however expected to be decoded back into a near approximation of the original input in order to prove useful as an alternative representation. The latent representation is connected to an output layer that is of the same dimensionality of the input and the difference between the input and output is backpropagated to the network in order to minimize it. In the

(4)

Figure 1: Autoencoder network structure. Source:commons.wikimedia.org

simplest case (See Equation 1) the encoder and decoder both have one layer and the weights matrix and bias vector make up the parameters. Each layer output is passed through an activation function and the model’s final output is used in conjunction with the original input to define the loss function (in the case of Equation 1 the mean squared error). This reconstruction error is backpropagated trough the network at each new iteration and the weights of the encoder and decoder layers are updated accordingly. The hyperparameters of the autoencoder (i.e. number of layers, type of activation function, loss function) are chosen depending on the type of data that is being used, with performance varying depending on the choices made.

θ : X → Z z = σ(Wθxinput+ bθ) φ : Z → X xoutput= σ(Wφz + bφ) θ, φ = arg min θ,φ ||X − (φ · θ)X||2 (1)

The latent space contains learnable features of the data. In images these features might cor-respond to the edges or corners in images; in sentences to the topic or grammatical function of a word. Selecting a single value in the latent space and putting it through the decoder with all the other values in the vector set to 0 can give an idea of the kind of structure that specific latent value has come to represent after training.

This ability of a neural network to express features of the data without annotation is also what makes the architecture as a whole appealing to the field of Generative Modeling. In its original form an autoencoder can not meaningfully output datapoints by putting randomly sampled values through the decoder half of the network. Because the network is used as a means of mapping input to a lower dimensional latent space and back to data space. The points that are mapped do not have meaning based on their location alone. A point sampled from the space between any two other points can result in decoded output completely unrelated to that of the two points. An autoencoder does a good job of representing a single input in latent space, but the space as a whole does not necessarily reflect the data it has been trained on. VAEs introduce sampling of the

(5)

latent space to implicitly add features to the regions the data gets mapped to. Through this new architecture the autoencoder can then be re-imagined as a Bayesian inference network, where the encoder and decoder network function as approximators of the distributions needed for inference of the data.

2 Related Work

2.1 Variational Autoencoders

VAEs are an Unsupervised Machine Learning method for learning data distributions first proposed independently by Kingma and Welling (2013) andRezende et al. (2014). They demonstrated it was possible to create a generative model that allows for approximate inference of continuous latent variables. This model allows neural networks to perform variational Bayesian inference by using them as approximators of the necessary conditional distributions and training them with stochastic gradient descent. The model was trained on image data and afterwards it was possible able to generate "natural" looking images from decoded random samples of the continuous latent space.

In other words, this VAE model was able to recreate the data X according to a distribution P(X), by working to make sure this distribution is as close as possible to the unknown true distribution of the image data Ptrue(X). VAEs are able to approximate this distribution without making strong

assumptions of the data or large approximation errors by training neural networks to act as the deterministic function mapping data to and from the latent space (see Equation1). These neural networks can be updated through gradient descent and have been shown to provide promising results in generating data from the image domain.

Training this VAE model essentially creates a generative model of the data P(X) as well as the structure of the underlying latent variables P(Z) and the parameters of the functions mapping between the two. Once this has been achieved it is trivial to encode new data to the latent space or generate new data altogether by decoding random samples from the latent space with a model that has been trained to output probable data-points.

2.1.1 The lower bound on the data probability

In order to capture the dependencies between the different dimensions of the data, X is taken to be dependent on latent space Z. A function f exists such that given the parameters θ, f maps a vector of latent variables z to a data-point x. The function fθis deterministic and its parameters θ

are trainable. The z is randomly sampled from the latent space probability density function P(z). The goal when constructing a VAE is to maximize the probability of the data given the other parameters of the network.

A problem in approximating this true data distribution is the intractability in the standard marginal likelihood approach:

P (X) = Z

P (X|z)P (z)dz = EP (z)P (X|z)

Under different circumstances a distribution P(X) might be obtained through repeated sampling from the latent space, and learning the conditional probabilities based on the number of times a specific variable z was observed generating a data-point x. The problem here is that even when using a random sampling method, most values of z have near-zero probability of having generated the data (i.e. P(X|z) ≈ 0, for most z) and therefore an intractably large number of z will need to be sampled to accurately approximate the correct distribution over all the possible data-points P(X). What is required to maximize P(X) is a lower bound on the data probability that can be optimized with gradient descent, so it becomes necessary to find a definition of the lower bound with terms that lend themselves to gradient descent optimization.

Part of this approach would be to sample only from variables z that are likely to have produced X in the first place. A new function Qθ(z|X) could be used to find these samples. Q being an

approximation of the distribution P(z|X). When sampling from a small number of likely z however, a highly similar distribution should be more easily computed by: Ez∼Qlog P(X|z).

The distribution Q can theoretically be any type of distribution we want, but for mathematical convenience it is assumed that Q is dependent on X: Qθ(z|X).

(6)

The VAE model depends on the ability to rewrite the terms that make up the data probability in terms of Kullback-Leibler (KL) divergences and vice versa. When starting out with the KL-divergence between the true distribution P(z|X) and its approximation Qθ(z|X) it is possible to

rewrite this into a lower bound on P(X). The Kullback-Leibler divergence between this Qθ(z|X)

and the distribution P(X|z) is given by: DKL[Q(z|X)||P (z|X)] =

Z

Applying Bayes’ Theorem replaces the intractable term P(z|X) with P(X), P(z), and P(X|z): DKL[Q(z|X)||P (z|X)] = Ez∼Q[log Q(z|X) − log P (X|z) − log P (z) + log P (X)]

= Ez∼Q[log Q(z|X) − log P (X|z) − log P (z)] + log P (X)

The terms can here be rearranged in order to move the independent P(X) to the left-hand side. DKL[Q(z|X)||P (z|X)] − log P (X) = Ez∼Q[log Q(z|X) − log P (X|z) − log P (z)]

This allows for the collapse of log Q(z) and log P(z) into one KL divergence. The terms can be rearranged as follows:

DKL[Q(z|X)||P (z|X)] − log P (X) = DKL[Q(z|X)||P (z)] − Ez∼Q[log P (X|z)] =

log P (X) − DKL[Q(z|X)||P (z|X)] = Ez∼Q[log P (X|z)] − DKL[Q(z|X)||P (z)]

Now maximizing the left-hand side increases the probability of the data P(X) and minimizes the KL divergence between the true and approximate distributions of Q(z|X) and P(z|X). This makes the right-hand side of the equation suitable as a loss function for gradient descent optimization:

log P (X) − DKL[Q(z|X)||P (z|X)] = Ez∼Q[log P (X|z)] − DKL[Q(z|X)||P (z)]

L = log P (X) ≥ Ez∼Q[log P (X|z)] − DKL[Q(z|X)||P (z)]

This optimization can be performed through the two tractable terms on the right-hand side. Maximizing these terms, also maximizes the data probability.

• Increasing Ez∼Q[log P(X|z)] is analogous to increasing the probability of the encoded data

given a sampled z. It is referred to as the decoding probability.

• Decreasing D[Q(z|X)||P(z)] is reduces the error of the divergence between the approximated distribution Q(z|X) and prior distribution P(z). This can be thought of as the degree the model needs to diverge from the standard normal distribution in order to encode the infor-mation of the data into the latent space. This measure of the space taken up by the latent space is also called the encoding error.

Using gradient descent with this loss function allows us to optimize P(X) in conjunction with D[Q(z|X)||P(z|X)] without having to deal with the intractable terms P(X|z) and P(z|X). In other words:

log P (X) = DKL[Q(z|X)||P (z|X)] + Ez∼Q[log P (X|z)] − DKL[Q(z|X)||P (z)]

DKL[Q(z|X)||P(z|X)] might be unknown, but it is by definition a non-negative term and so the

two other terms count as a lower bound on the log probability of X. 2.1.2 The Encoder and Decoder

The lower bound terms can be interpreted as two terms that together make up an autoencoder (i.e. a term for the loss of encoding the data X into the latent space z and a term for the probability of decoding a sample from the latent space back into data space). It possible to perform stochastic gradient descent if we define the loss on these terms in a way that is differentiable with regards to φ and θ.

(7)

The terms Qθ(z|X) and P(z) are determined by the loss function on the encoder: D[Q(z|X)||P(z)].

For mathematical convenience both are assumed to be Gaussian distributions. P(z) is taken to be a Gaussian distribution N(0,I) and Q(z|X) to be a diagonal Gaussian with the mean vector µ and sparse matrix σ (represented by a vector of the diagonal values) determined by deterministic func-tions with parameters θ. The encoder loss of the term D[N(µ(X),σ(X))||N (0,I) can be rewritten in a closed from solution as in Equation2(for the full derivation see AppendixA).

D[N (µ(X), σ(X))||N (0, I)] = EN (0,I)[log N (0, I) − log N (µ, σ)] =

[...] = 1

2(log detσ − d + tr(σ

−1_{) + µ}T_σ−1_µ)

(2)

In this equation d represents the dimensionality of the multivariate distribution. This derivation also motivates the choice of distribution N (0,I), without which the closed form solution could not be simplified to so few terms.

We can use a neural network as the deterministic function with parameters θ to find the µ and σ and consequently perform gradient descent based on the loss function we have just defined. 2.1.3 The reconstruction loss

The second lower bound term Ez∼Q [log Pφ(X|z)] can be calculated per datapoint for L number

of samples of z by the formula3 and incorporated into the loss per datapoint (see formula4).

Ez(i,l)_∼Q[log P (x(i)|zi,l)] ≈

1 L L X l=1

log Pφ(x(i)|z(i,l))

(3) ˜ L(θ, φ; x(i)_{) = −D} KL[Qθ(z|x(i))||P (z)] + 1 L L X l=1

log Pφ(x(i)|z(i,l))

(4) In practice this parameter l can be set to 1 without negative effects, if the size of the minibatches (M) drawn from the total data (N) is large enough (e.g. M ≈ 100), but even if M is set to 1 the model can still be run with only a light decrease in performance.

With this it is possible to formulate an estimator for the lower bound of the full dataset by summing over the M batches (with each batch randomly drawn from a total dataset of size N):

N M M X l=1 ˜ L(θ, φ; x(i))

While this loss is easily differentiable with regards to φ, the parameters φ, which z is dependent upon, are placed behind a sampling step.

The sampling of z from distribution Qθ(z|X) is non-deterministic and therefore makes it

im-possible to backpropagate the loss to the θ gradients of the network before the sampling step by directly sampling from the latent distribution N (µ, σ). However it is possible to take a sample and backpropagate after performing reparameterization.

2.1.4 The reparametrization trick

A way to make backpropagation through the µ and σ layers possible despite this sampling, is to take the sample () from a standard normal distribution ( ∼ N (0,I)) instead of the latent distribution and apply the mean and variance of the latent distribution to in order to generate a sample.

Through this reparametrization trick the sample from a standard normal distribution can generate a sample from the distribution N (µ,σ) by first multiplying with (diagonal) vector σ and then adding it to vector µ. This approach makes it possible to calculate the gradients necessary for updating the µ and σ steps (through the addition and multiplication steps respectively) and leaves only the random sampling itself out of consideration when updating the network weights θ (see Figure2).

(8)

X Encoder (Qθ) µ(X) σ(X) ∗ + ∼ N (0, I) Decoder (Pφ) f(z) L(X,f(z)) KL[N (µ(X),σ(X))||N (0,I)]

Figure 2: A variational autoencoder implemented as a feed-forward neural network. The blue nodes show the loss functions, the red node shows the sampling operation. Adapted fromDoersch

(2016)

With these specifications of the network both lower bound terms can be used to update the network and optimize the P(X). The network will be made up of an encoder and a decoder section. The encoder maps the input data to a latent distribution z by outputting a k-dimensional mean µ and diagonal covariance-matrix σ. These two values then calculate samples from the distribution by multiplying and adding to random sample . Based on these samples the decoder network will then generate an output. The cross-entropy between this output and the original input is then used to calculate the second part of the loss function.

The complete loss function is therefore made up of the addition of KL divergence of the distri-bution N (µ,σ) and the cross-entropy loss between the input and output values:

−DKL(Qθ(z|X)||Pφ(z|X)) + ˜L(φ, θ; X)

2.2 VAEs applied to Natural Language Sequences

The VAE model suggested by Kingma & Welling (2013) relies on continuous inputs like those of image pixel-values but with some alterations the model architecture can be re-purposed to work with language inputs. In the case of language data, finding the probability distribution of the datapoints given a continuous latent variable corresponds to the training of a language model. Meaning that only the sentences that resemble real-world natural language examples should be given a high probability, while the nonsensical sentences should be given a low probability. Fortunately the application of neural networks to create well performing language models has gained traction in recent years and can be used to approximate the conditional distribution p(x|z) necessary for the Variational Bayes approach.

2.2.1 Neural Network Language Models

In 2010Mikolov et al. introduced a Recurrent Neural Network Language Model (RNNLM) that was capable of addressing some of the shortcomings inherent to n-gram and standard feed-forward

(9)

neural network models. Their model makes use of a Recurrent Neural Network (RNN), a class of neural network that organizes its nodes as a directed graph with cycles in order to process sequential inputs. This model takes advantage of this type of network structure and has an advantage over baseline n-gram models in terms of both time complexity and the amount of data required for its training.

The application of Neural Networks to Language Models was first introduced byBengio et al.

(2000) and further explored byGoodman(2001). Bengio deviated from the then dominant method-ology of using n-gram language models and used a feed-forward neural network to define the probability distribution over sequences of words.

An n-gram language model calculates the probability of a sentence by looking at the frequency counts of subsequences of length n (an n-gram). The probability of the word following that n-gram is calculated by dividing the total number of times the next word occurred after the given n-gram by the total number of occurrences of the n-gram. In other words P (wt+1|w1, · · · , wt−2, wt−1, wt) is

approximated by P (wt+1|wt−n, · · · , wt−1, wt) and the probability of the full sequence is a

factoriza-tion over all the condifactoriza-tional probabilities. The main problem with this approach is its dependence on previous occurrences of subsequences in the training set. Given the number of possible natu-ral language sequences it is unrealistic to have a model dependent on counting the occurrence of training examples of any large size. If the n-gram is reduced to a 1-, 2- or 3-gram then the training data will more likely be sufficient for word predictions, but dependencies of more than those few words will be lost to the language model. The n-gram approach is inherently limited to finding a balance between finding a value n large enough that it can predict the next word in the sequence with some meaningful insight and small enough that there are enough occurrences in the training set to draw from.

While Bengio’s neural network model maintains the n-gram structure of defining the probability of the next word based on the n preceding words, this neural network architecture offers the advantage of a distributed representation that characterizes the preceding words in the input sequence not as a fixed category, but as a continuous number (or vector of numbers) that represent some undefined attribute of the sequence that gets updated based on its contribution to predicting the next word in the sequence. This ability of the model to learn features and become less dependent on exactly matching subsequences in the training set, is what makes this model so valuable to language modeling. The model proves to be able to generalize well under different inputs, so that sequences of words that have never before been encountered in that order will still score high if the words are close in representation to words that have been encountered during training.

This model functions by mapping the word input to a continuous feature representation, which is connected to the full-sentence representation layer. The final output is given by a softmax distribution over all the possible next words given the input sequence and its loss by the summed log-likelihood over that distribution. The models uses backpropagation to train the weights of the prediction layers as well as the feature embedding layer.

Goodman’s analysis of the Bengio model also shows that the neural network model offers unique advantages over the standard n-gram models with particular gains shown in the handling of longer input sequences. The relative simplicity of the neural networks model over the smoothed and clustered alternatives also allow it to be easily interpolated with other models for improved results, indicating that it performs well in a manner that is complementary to other types of models. 2.2.2 Recurrent Neural Network Language Models

Unfortunately the standard neural network language model still has a significant shortcoming when it comes to its input sequence. The architecture of a feed-forward neural network requires an input sequence of fixed length to predict the vocabulary probabilities of the following word. This makes a neural network language model dependent on the information provided by a standard amount of preceding words. When it comes to natural language this limitation makes it impossible to accurately process longer dependencies in a sentence the way a human reader would be able to. Unless the useful length of the preceding sentence is predefined before training, a standard feed-forward neural network will be trained with a fixed length on a variety of sentences containing both longer dependencies, that require the model to take the earliest words in the sequence into account, as well as shorter dependencies that only require the model to take later words in the sequence into consideration when determining the probable next word in the sequence. Understandably these conflicting examples pose a problem for the training of a language model and make it important to have an alternative to the manual specification of useful context length. To address these types

(10)

Input(t)

Context(t)

Output(t)

Context(t-1)

Figure 3: Simple RNN network (source: Mikolov et al. 2010)

of shortcoming in sequence dataElman(1990) proposed a model architecture known as a simple Recurrent Neural Network (RNN) or Elman network.

At each timestep in a sequence (in the case of language modeling corresponding to each word in a sentence) the model connects its inputs to a hidden context layer, which in turn is connected to the output layer as in the standard feed-forward model. However the input layer at each timestep is concatenated with the context representation of the previous timestep and only then passed to hidden context layer of the current timestep (which in turn will be concatenated with the input at the next timestep).

Mikolov et al.(2010) successfully applied this type of RNN to language modeling. Although the model uses the same activation functions as the feed-forward network (sigmoid from input to hidden layer, softmax from hidden to output), there is a marked decrease in the training time necessary to reach convergence and it performs better than the contemporary back-off models even when they are trained on small datasets.

Long Short-Term Memory

An extension on the RNN architecture was proposed by Hochreiter and Schmidhuber(1997) to address some of the shortcomings of the Regular RNN model, primarily that of the vanishing gradient problem. The vanishing gradient problem describes the phenomenon in neural networks where the activations of layers farther in the network become vanishingly small due the standard activations outputting a signal between 0 and 1. Given layers with these activations, backprop-agating the partial derivative of the error with respect to the activation functions at the earlier layers of the network result is very small error update values and so the weights of those layers are hardly updated at all.

This is a significant problem for recurrent neural networks, as they use the same neural network to pass signals through at every timestep in the input sequence and the backpropagation through several timesteps would therefore result in a vanishing gradient. The wider the gap between the timesteps in the sequence, the more of a problem the RNN model has to handle this "long-term dependency" in the data.

Hochreiter & Schmidhuber’s Long Short-Term Memory (LSTM) network addresses this by having an architecture that is explicitly built to retain information over the course of a sequence.

(11)

LSTM networks have a far more elaborate way of handling the information passing through the network than the RNN’s relatively simplistic approach of concatenating the previous state’s context vector to the current state’s input.

Each LSTM cell has a cell state Ctthat is only altered through linear operations before passing

through the next timestep’s cell, hereby staying unaffected by vanishing gradients due to not having any activation functions to contend with.

The alterations to the cells are with two operations: a multiplication and an addition (see Equation5). The multiplication is done with the output of the so-called "forget gate", a sigmoid activation over a neural network connected to the concatenation of the input at that timestep (xt) and previous hidden state output (ht−1). The output of this gate essentially defines what

information in the cell should be retained and what should be forgotten. If an output after activation is 1, for instance, it will be completely unaffected after multiplication. If it is 0 it will be fully forgotten. The second operation is done with the output of the "input gate", which adds new values to the cell state representation that are deemed relevant changes to the state based on the concatenated input and previous state output. This proposed change to the state eCt is

defined as the positive or negative values that result from a tanh activation over the network with the concatenated inputs, scaled by a vector of importance factors (the sigmoid activation over a network with the same concatenated inputs). This cell state is then passed to the next timestep and also used to predict the current state’s output.

Ct= Ct−1· ft+ it· eCt (5)

The output of the cell is given by the tanh activation of the cell state multiplied with the results of the output gate (a sigmoid activation over a network layer connected to the concatenated inputs [xt, ht−1]) (see Equation6). The tanh activation maps the cell state to the range (-1, 1), which

allows the state to have positive or negative inputs. The results of the output gate are essentially a scaling factor that determines the relevance of the information at state t to its output.

ht= tanh(Ct) · ot (6)

Figure 4: LSTM and GRU layer architecture. Olah(2015)

Gated Recurrent Units

A different alteration to RNNs for the purpose of handling long term dependences was the Gated Recurrent Unit(GRU) proposed byCho et al. (2014) The GRU architecture changes the LSTM cell in two significant ways: it merges the cell state and hidden state and it combines the input and forget gate.

The hidden state ht−1passes to the next timestep after two linear operations, just as the cell

state does, but it also is concatenated with input x and passed to the networks whose outputs are passed to those linear operations. We can roughly say that rt is the reset gate that takes the

concatenated input and output values, which determine how the new input should be combined with the hidden state. The value ztis the update factor that determines how much of the previous

state should be kept for the final addition operation that determines the next hidden state. The difference in performance between these two types of RNN is not very noticeable on many tasks. In their comparative study of different types of RNN architectures Greff et al. (2015)

(12)

LSTM Hello LSTM World Linear Linear z LSTM <SOS> Hello LSTM World Hello LSTM World <EOS> µ σ sample

Figure 5: The architecture of the variational autoencoder language model of Bowman & Vilnis

found that the standard LSTM and GRU architectures generally perform about the same on a variety on different tasks and that the forget and output activation functions are the most significant activations in the network. Removing either of these two gates severely diminishes the performance of the LSTM. It is hypothesized that GRUs still manage to perform well without an output activation because where in a standard LSTM cell the activation works to prevent an unbounded cell-state from being propagated through the network, this same function is performed in a GRU by the coupling of the input and forget gates.

The most crucial network parameters were found to be learning rate, followed by network size. The highest measured interaction between these two parameters turned out to be very small, so for the purpose of parameter tuning these parameters can be treated as more or less independent. 2.2.3 Incorporating RNN structures into the VAE

Bowman et al.(2015) proposed a VAE architecture suited to text inputs, comparable in general functionality to existing RNNLMs. Unlike the RNNLM however, their model proves itself capable of capturing global sentence features like topic and syntax. This improvement is due to the fact that the VAE model represents whole sentences in latent continuous variables.The RNNLM model on the other hand, is based on a series of next-step predictions given slices of the preceding text as input and define the probability of the sentence as a whole as the function of all these individual predictions.

The input is entered into the model on a word-by-word basis and each word is represented using a learned dictionary of embedding vectors. This input is passed to an encoder in the form of a single-layer LSTM RNN, which connects to two separate linear layers that output the mean and diagonal variance matrix. After the re-parametrized sampling is executed in similar fashion to the standard VAE approach, the sample gets passed to a decoder, again consisting of single-layer LSTM RNN.

During the training of the model, the correct word inputs get passed to the decoder at each timestep of the sequence in order to teach the model the true relations between words at different timesteps and avoid false predictions from being propagated down-sequence. During evaluation only a start-of-sequence token is passed at the first timestep and the sequence is generated based on previous predicted words.

This model structure is identical in its high-level architecture to the Variational Recurrent Au-toencoder (VRAE) ofFabius and Amersfoort (2014) that models musical sequences. Unlike the similar RNN based approaches, these two models do not represent the latent space per time-step and so unlike their competitors and the RNNLM model are capable of encoding complex global features into a single latent space variable. So far this extension of the VAE algorithm seems lim-ited to the vector encoding of word sequences and the use of recurrent layers for the encoders and decoders. In order to make the model both function as a language model and a generative model with a sensibly ordered latent space, specific changes need to be made to the model parameters so that the balance of these two goals is maintained.

The model runs into problems when it optimizes strictly according to the variational lower bound objective. This objective can be split into two terms: the data likelihood under the condi-tional probability Pφ(X|Z) (defined as the reconstruction loss or cross-entropy between input and

output sequence) and the KL divergence of the conditional probability Qθ(z|X) from the marginal

(13)

small cross-entropy term. Unfortunately the language model VAE tends to set the distribution Qθ(z|X) equal to the standard normal distribution P(Z) early on during the training and finish

training with a relatively high reconstruction loss. This essentially leaves the encoded variable z without any encoded information and simply passes the words of the sequence to the LSTM layer one at a time, like the regular RNNLM model.

This problematic tendency can be more or less resolved with two separate and complementary techniques. Firstly through KL cost annealing. If the KL term is weighed by a variable that starts at value 0 and shifts to 1 over the course of a certain number of epochs (the rate of change is dependent on the type of function used for annealing).

The decoder initially gets passed a highly information-rich variable z from the sampler and over time the KL constraint increases and most of this information will no longer be encoded into the latent space. Given that the decoder has the same structure and expressiveness as the RNNLM, only the information the RNNLM is incapable of predicting based on the previous predicted words in the sentence will be dependent on the sampling of z. The other information that might be encoded into the latent variable z will disappear as it would only needlessly increase the KL-loss term to encode more information than is necessary with an RNNLM decoder. The rate of change of the KL-loss weight per epoch is a parameter that is available for tuning in the model as is the type of function that determines the curve that the weight shifts along.

The second technique is word dropout in the decoder during training. The decoder tends to become overly reliant on the ground-truth of the previous word and it is beneficial to encourage the decoder to make use of the information encoded in latent variable z that is conditioned on. This can be achieved by randomly setting some of these word inputs to the word token <UNK> with some probability between 0 and 1 (as defined by a tunable parameter). In the most extreme case, none of the ground-truth words will be passed to the encoder, and the decoder at each time-step is only aware of its current place in the sentence and the whole sentence representation z. While this is advantageous from the perspective of the encoder, it also causes the KL-loss to greatly increase, as it now needs to encode far more into the latent space in order to keep the reconstruction error reasonably low.

2.2.4 Results Sequence VAE

The VAE model is a significant improvement over the RNNLM model when it comes to the task of inferring missing words in a given sequence of known words (imputation). Unless the missing words were all at the end of a sentence, the RNNLM model would need to sample from the posterior over the missing words and do a full step of Gibbs sampling per entry in the vocabulary and execute a large number of sampling steps to transmit this information to known variables downstream in the sequence. The VAE model does not have this problem of transmitting information downstream, because of the latent variable passed to the RNN network. While the optimal word choice is still an intractably hard problem due to the Gibbs Sampling, the full-sentence representation does allow for easier passing of information through missing variables down the sequence. The amount of information retained by the latent variable also appears to depend heavily on the dropout-rate during the training of the decoder. A higher dropout rate necessitates the retention of word-level information in the latent space even though this additional information might start to strain the KL-divergence constraints on the latent space.

Even using greedy algorithms to decode the latent space samples, the model appears to have a rich encoding of global features, resulting for the most part in grammatically correct decoded sentences, which retain their general topic.

Further analysis of the variational model shows that the dropout-rate strongly affects the dis-tributions of the total loss over the KL- and reconstruction-loss terms. Higher dropout rates cause the KL loss to remain far above 0, but ensure that the reconstruction error goes down.

This model also allows for the latent space itself to be explored in new detail, by sampling from the posterior distribution p(z|x). Because the VAE model encoder outputs a distribution rather than a specific encoding, after training the latent space has similar sentence encodings grouped together. Sampling these points and interpolating between them, gives us some interesting results, once the latent samples have passed through the decoder model. The decodings of intermediate sentences between two samples appear to be grammatically correct, and shift from the sentence structure and topic of one sample to those of the other the closer you get to the other sample in the latent space (see Table1).

(14)

he was silent for a long moment . he was silent for a moment .

it was quiet for a moment . it was dark and cold . there was a pause . it was my turn .

Table 1: Decoded interpolated latent variables fromBowman et al.’s VAE model i went to the store to buy some groceries .

i store to buy some groceries . i were to buy any groceries . horses are to buy any groceries . horses the favorite any animal . horses the favorite favorite animal . horses are my favorite animal .

Table 2: Decoded interpolated latent variables from conventional autoencoder model (Bowman et al.,2015)

This stands in stark contrast to the interpolation between points decoded by the conventional RRN autoencoder network (see Table 2), which break down into ungrammatical structures. All this is to show that the VAE learns representations that are smooth and fill up the available space, which proves that by sampling points from a distribution the latent space now has meaningful subspaces within it. When moving between two points in the latent space, each intervening point still decodes to a grammatical sentence and the structure and information decoded shifts gradually over the distance traversed.

2.3 Joint Latent Space Approaches

The standard VAE approach can successfully capture the high-level features of a model in the latent space representation of its data. While that approach lends itself well to generative modeling of data within one domain, it is too limited a model architecture for learning latent encodings that are well-suited to decoding into a parallel domain. The latent encoding of the input domain likely does contain some useful features when it comes to encoding to the output domain. The model can easily be expanded in order to create a generative model across different domains. To accomplish this the decoder can then either be passed additional inputs in addition to the latent variable, or the latent variable can be constrained in some way to ensure it has useful dependencies with the data manifold of another domain.

2.3.1 Conditional Variational Autoencoder

The Conditional Variational Autoencoder (CVAE) introduced bySohn et al.(2015) has as its main innovation the concatenation of the label values y to the decoder model inputs and to the latent representations. This then does away with the need of a latent variable to be fully representative of the data and reinterprets the latent space as a representation of more abstract features that cannot be represented by the labels of the data, whereas the decoder model can now be applied to the concatenation of this label and the latent representation of more abstract features. The formulation of the lower bound forces the latent representation to be as close as possible to the standard normal distribution as possible, while also reconstructing the data. So over the course of the model’s training, much of the basic mapping to the data manifold will be handled by the decoder model. The encoding to the latent space will primarily be concerned with representing higher-level abstract qualities of the data. When these qualities are passed to the decoder, it ensures that they are taken into account and the data is decoded correctly.

The decoding of realistic seeming data can be done regardless of the type of latent variable taken as input. This becomes evident when we observe the decoding of randomly sampled latent variables.

(15)

y z

x

Figure 6: Conditional VAE

y

z

x Figure 7: Bilingual VAE (BiVAE)

This principle of reintroducing inputs here is fundamentally different from the added word-inputs in the VAE language model fromBowman et al.(2015). There the reintroduction of words functions as an aid during training to avoid fitting incorrect words sequences as the time-distributed words encodings are passed to the decoder model alongside the latent representation of the sentence. This is done in order to generate the correct next word-probabilities at each time-step and help the model ignore grammatically incorrect sequences during training. In the CVAE the added input y is present during both training and evaluation and is used as an additional input for both the encoder and decoder, making both the latent and data distribution dependent on it.

This model has some advantages when it comes to generating data. Changing some of the attributes y can be used to generate similar images with distinct chosen differences. Because the general information of the image is maintained in latent variable z ∼ pθ(z|x,y), the changes in y can

represent specific alterations within that image (e.g. from smiling to frowning, man to woman). 2.3.2 Bilingual Word embeddings

Parallel natural language corpora are a good example of the type of task a cross-domain VAE would be useful for. A latent representation can be be shared across the two different languages, because the two decoders act as language models and interprets the high-level features of the latent variable within the context of each individual language.

The Bilingual VAE architecture (Wei and Deng, 2017) can also be used to model a joint continuous latent space p(z|x,y) from which the data in both languages (or domains) are generated: p(x,y|z). The assumption also makes it possible to easily map the data from one domain to another by encoding it to the latent space p(z|x) and decoding it into the other domain p(y|z). Though the focus of said paper is the encoding of language data to a latent space from different language inputs rather than generating joint data, it does demonstrate that a sentence can be mapped to a latent space alongside data from a parallel domain.

The Bilingual model will not be used in the cross-domain experiments, since it only maps between two different types of language data. The model is mainly useful as an indicator that joint latent representations of sentences can be used for encoding high-level attributes and are of interest for cross-domain decodings.

2.3.3 Joint multimodal autoencoder

Suzuki et al. (2016) proposed a joint multi-modal VAE (JMVAE) that can exchange multiple modalities bidirectionally. This is an extension of the VAE model of Kingsma & Welling, which uses conditional distributions and therefore cannot generate data bidirectionally. This is to say that if a network has been trained to learn the conditional probability of an annotation given an image, it cannot reserve the model and generate an image based on an annotation.

In order to add this desired flexibility to the model Suzuki at al. implemented a joint latent distribution under which all modalities are treated equally. This architecture means that the latent distribution z is now dependent on inputs from the two modalities: x,w, annotations and words respectively. This is similar to the approach taken to generate Bilingual word-embeddings by

(16)

x w x w z z Figure 8: JMVAE x x w w z z z x w z KL KL Figure 9: JMVAE-kl

Wei & Deng, but it has difficulties generating data bidirectionally: namely one of the two input modalities being unknown to the model.

In the most naive version of their joint model, it trained the encoder on joint inputs as well as missing inputs, where the missing modality had all the inputs values set to 0 or random noise. Following the iterative sampling procedure by Markov chain suggested byRezende et al.(2014) it is then possible to complement the missing input value with a transition matrix.

LJ M V AE= log P (x) > −DKL(Qθ(z|x, w)||p(z))+EQθ(z|x,w)[log Pφx(x|z)]+EQθ(z|x,w)[log Pφw(w|z)]

(7) However, if this missing input has a high dimensionality (e.g. NLP sequences) this leaves the decoder unable to reconstruct the inputs from the latent space sample. To address this shortcoming Suzuki et al. suggested the JMVAE-h and JMVAE-kl variants on their joint multi-modal model.

JMVAE-kl trains two VAE encoders alongside the joint model for each input domain. The loss function is improved with certain constraints to ensure that the latent representations of the three encodings are grouped close together. Because of this similarity the encoding of a single domain can pass through the two decoders of the joint model and reconstruct the input as well as the missing domain representation.

LJ M V AEKL(x, w) = LJ M V AE(x, w) − (DKL(qθ(z|x, w)||qθ(z|x)) + DKL(qθ(z|x, w)||qθ(z|w))) (8)

The loss function is extended by adding additional KL-terms between the joint latent distribu-tion and the two latent distribudistribu-tions of the individual modalities. This in effect forces the three latent spaces to remain close to one another, allowing the decoder in one modality to be able to decode the encoded latent space from another modality. Hereby the joint distribution has become less dependent on known inputs in high dimensional modalities and can reliably encode a joint latent distribution with inputs from a single domain.

3 Experimental Setup

The VAE-based models incorporating a joint latent space are now applied to datasets of annotated text and labeled images in order to examine the efficacy of encoding and decoding the two different modalities to and from their joint latent representation. Of particular interest is the accuracy of the model on cross-domain encoding and decoding. The points of comparison between the models will be:

• The reconstruction loss of the data used in the model, which corresponds to the ability of the model to encode and recreate the input data.

• The KL-divergence of the latent space encoded by each model, which indicates how efficiently the two combined inputs can be represented in a shared latent space.

(17)

• The quality of the decodings from the models’ sampled latent variables.

• The smoothness of the interpolated values between two values, which indicates the degree to which the latent space is made up of connected regions that have encoded meaning and structure of the underlying data based on the specific location in the space.

For the experiments I have applied the standard, conditional and joint multi-modal VAE models on an annotated text dataset in order to model data from each domain on its own as well as in conjunction with data from parallel domains. The dataset used for annotated text is the end-to-end natural language dataset introduced byNovikova et al.(2017) for the purpose of data driven text generation in the restaurant domain. It contains both selected elements from a list of 79 possible restaurant attributes as well as corresponding text descriptions.

The attribute sequence will be represented as a binary vector of sufficient length to represent all encountered attributes by index. The 79 total attributes are made up of a flattened vector of the various possible values that 8 different restaurant properties can have. This input is an n-hot encoding of categorical inputs, where several individual attributes are represented in the same vector by having the values of the indices corresponding to the observed attributes set to 1 and all the other values to 0. The data has been trained on the training set and parameter tuning has been done on the separate validation set which is always split off from the training set before experimentation. A separate data file served as the hold-out data from which the model generates the final test results. The training set contains 37389 datapoints. The validiation and test sets both contain 4672 datapoints.

The experiments on labeled images are performed on the MNIST dataset, where the pixel values and labels are interpreted as the two domains. These experiments serve as a point of comparison for the results of the different VAE models.

All final experiments were run on GPU nodes on the Distributed ASCI Supercomputer 5 (DAS-5) (Bal et al.,2016)

3.1 Standard VAE

The standard VAE model encodes the input data to the latent space through a given number of neural network layers and per input prediction decodes a single sample from the latent space back through its decoder layers. Depending on the type of input data, different types of layers will be chosen to produce the optimal results in encoding and decoding. In the case of image inputs these choice will be between Dense and Convolutional layers and with Natural Language inputs between various types of RNN networks (Regular, LSTM, GRU).

While the standard VAE model is only applicable to single domain datasets, its results on different data will still prove useful as a baseline for the improved results of the other model extensions to its architecture. In addition it also provides insight into the structure of the data itself, given that this is the only model where the output is entirely dependent on the latent representation of one domain, without the confounding factors of parallel latent representations or decoders connected to the input space. Not only the accuracy of the decodings, but also the shape of the latent space itself will give us some insight into the degree to which higher order context information can be captured under different model parameters. The VAE model is an implementation of the Bowman VAE model. The pytorch implementation of this standard model was found on github.com/timbmg and used in these experiments with the text data extracted from the end-to-end natural language dataset. This model serves as a baseline against which the other variations can be compared. Its language networks do diverge from the Bowman model however, as this model has for ease of implementation been build with GRU units rather than LSTMs for its recurrent network layer.

The model parameters are as follows: the optimizer used was Adam (Kingma and Ba, 2014) and the lower bound function used (as derived in Equation2and AppendixA) is the following:

log P (X) ≥ E (log P (X|z)) − DKL(Q(z|X)||P (z))

The VAE language model was trained under a set of parameters that will be considered the baseline for our experiments and are used for all the experiments unless specified otherwise. The parameters are the following: The annealing on the KL weight is done along a logistic function, that reaches its final weight 1.0 after 4 epochs (see Figure 10). The RNN-network is of the type: GRU, the learning-rate: 0.001, batch-size: 32, latent-layer size: 16, embedding-layer size: 300, hidden-layer size: 256, word-dropout rate: 0.5 and maximum sequence length: 60.

(18)

Figure 10: KL weight

3.2 Conditional Variational Autoencoder

The CVAE model is a relatively simple adaptation of the VAE framework and can easily be implemented by concatenating the label inputs to both the input and latent variables. This con-catenation makes the output distributions of the encoder and decoder networks dependent on the attribute data. This extension of the model changes the definition of the loss function lower bound as follows:

log P (X) ≥ E (log P (X|z, Y )) − DKL(Q(z|X, Y )||P (z|Y ))

The two components still come down to a reconstruction loss term and a KL divergence on the latent space distribution. Though there are additional inputs to this model, the parameters themselves remain the same as those in the VAE network.

After training the CVAE model it will be of interest to see the degree to which it shows improvement over the standard VAE network (in terms of the reconstruction loss as well as the quality of decodings and interpolations). It also serves a further point of comparison for the JMVAE model and will indicate based on the change in sentence reconstructions how well the information of the attributes is retained when the label inputs are used to generate the latent variable rather than being concatenated to the latent variable outright.

The CVAE model will be tested with both the attribute and sentence data taken as inputs (X) with data from each modality conditioned on that of the other (Y).

The Sentence CVAE model was trained with the baseline parameters and differs from the VAE model only in so far that because the attribute vector is used for concatenations in the model, it is requires that the attribute data be added as input values to the model.

The Attribute CVAE model was trained with a lower learning rate (0.0001) than the baseline in order to avoid the vanishing gradient problem and uses Dense layers with a sigmoid activation instead of RNN layers to encode and decode sentences. All other parameters are identical to the baseline. The sentence embedding is concatenated to both the input and latent variables, but with a dropout layer applied before concatenating to the latent variable, in order to prevent the model from overfitting.

3.3 JMVAE

In the JMVAE model the two domain inputs are encoded to a latent space alongside the joint domain encoder. The model is trained by minimizing each of the reconstruction and KL-losses, with additional loss terms added to avoid KL-divergence between the three latent spaces of the parallel encoding. In the case of the attribute encoder and decoder, the RNN layer has been by a Dense layer. Of particular interest here are the gains made to the data reconstruction of the data-entries when decoding from a joint encoding and how those compare to the decodings of single domain encodings as well. Also the difference in quality of the two different different domains will be of interest, as the attributes of the data are far more sparse than the data they describe and

(19)

their latent space encodings and data decodings will show what limitations the model has. log P (X, W ) ≥ E(log P (X|z)) + E (log P (W |z)) − DKL(Q(z|X, W )||P (z))

−DKL(Qθ(z|X, W )||Qθ(z|X)) − DKL(Qθ(z|X, W )||Qθ(z|W ))

During training only the KL losses of the three encoders and the joint decoder reconstruction losses are tracked. After training the three different encoders will encode the entire test set and pass their latent variables to the decoders of the two domains. The quality of decodings from the three encoders will then be compared by looking at their reconstruction loss values and, in the case of the attribute decoding, the accuracy of their predictions.

4 Results

4.1 VAE

(a) Total Loss (b) Negative Log Likelihood

(c) Negative Log Likelihood per word (d) KL joint distribution

Figure 11: Validation loss values of baseline VAE model

Under the baseline parameters the model converged after about 40 epochs (see Figure11) and when decomposed into its separate losses (see Figure11) it is clear the Negative Log Likelihood makes up the largest part of the total loss. The KL divergence on the joint distribution converges sharply when the KL weight increases exponentially and remains close to 0 for the rest of the run. The average per batch reconstruction loss on the holdout set after 100 epochs of training was 36.44. When this loss is averaged over the number of words in the batch, the per word loss is 1.61 Qualitatively this score translates to a language model that allows from the decoding of gram-matical sentences from any of the interpolated points between two samples (see Table3)

4.2 CVAE

4.2.1 Sentence CVAE

The Sentence CVAE model, when trained under the baseline parameters, behaves a lot like the VAE model, but with an even stronger fraction of the loss made up of the Negative Log Likelihood (see

(20)

• the phoenix is a restaurant located in the city centre . ◦ the phoenix is a restaurant located in the city centre .

◦ the phoenix is a restaurant located in the city centre .

◦ the phoenix is a restaurant that serves indian food in the riverside area . it has a high customer rating and a price range of £ 20-25 .

◦ the phoenix is a restaurant providing take-away deliveries in the low price range . it is located in the city centre .

• the phoenix is a restaurant providing take-away deliveries in the low price range . it is located in the city centre .

Table 3: Interpolation of latent space samples through VAE decoder

Figure12). The concatenation of the attribute data to the input and latent space have evidently made it possible for the decoder to extract the necessary information for a correct decoding from the labeled sample. This causes KL loss term to converge even faster than was the case in the VAE model.

(c) Negative Log Likelihood per word (d) KL joint distribution

Figure 12: Validation loss values of sentence CVAE model

The loss of the trained model on the hold-out set shows an improvement over the vae model, after 100 epochs reaching a per batch average negative log likelihood of 29.83 and a per word negative log likelihood of 1.31.

(21)

sampling of random latent values (given a chosen attribute vector) result in a variety of decoded sentences with close connections to the input attributes (see Table4)

• customer rating[low] eatType[coffee shop] familyFriendly[yes] food[English] name[Cocum] priceRange[less than £20]

. there is a family friendly coffee shop called cocum . it serves english food and has a low customer rating . it has a price range of less than £ 20 .

. cocum is a low price coffee shop that serves breakfast food . it is family friendly and has a low customer rating .

. cocum is a low price coffee shop that is family friendly .

. cocum is a low cost coffee shop that is family friendly and serves breakfast .

. cocum is a low price coffee shop that is family friendly and has a customer rating of 1 out of 5 .

. cocum is a low cost coffee shop that is family friendly .

. cocum is a low priced , family friendly coffee shop with a low customer rating . . cocum is a low priced , family friendly coffee shop with a low customer rating . . cocum is a low priced , family friendly coffee shop with a low customer rating .

Table 4: CVAE samples of latent space with a set attribute input (chosen attributes in bold)

4.2.2 Attribute CVAE

The attribute CVAE reconstructs a binary vector of attributes and so uses the binary cross entropy function to determine its reconstruction loss. The loss values (see Figure 13) indicate that the majority of the loss is again made up of the reconstruction loss, but the KL loss has a great deal more peaky values than the sentence CVAE. The average KL loss however remains close to 0 throughout the run. After 100 epochs the average Binary Cross Entropy over the dataset is 3.61 and the average KL loss is 0.034

(a) Total Loss (b) Binary Cross Entropy

(c) KL joint distribution

(22)

(c) Negative Log Likelihood per word (d) Binary Cross-entropy

Figure 14: Validation loss values of baseline JMVAE model

When applying this model to the holdout dataset and comparing the attribute decodings to the true values, the performance can be measured in terms of accuracy, false positives and false negatives (see Table 6). Accuracy corresponds to the percentage of the decoded binary vector values that correspond to the true values. A false positive is a decoded value that is positive where the true label is negative and a false negative is a negative value where its true label is positive.

4.3 JMVAE

Though alternative learning rates and layer sizes were tried, the base parameters of the VAE model proved most effective on the JMVAE model as well. Under these parameters the JMVAE model appears to converge after about 20-30 epochs (see Figure14a). When the total loss is decomposed into its separate components (see Figure14) by far the largest contributer to the total loss is found to be the negative log-likelihood loss of the sentence reconstruction. It in large part contributes to the high per batch variance in loss and the time it takes to convergence. The binary cross-entropy loss by comparison is much smaller and takes longer to convergence.

All of the KL-losses (see Figure15) converge as soon as the KL-weight has reached the annealed weight 1.0 and remain as constant constraints on the three types of encoders. It must be noted that the KL divergence on the joint latent space is much higher than it was in the VAE and CVAE models. The representation clearly does not express the information required by the two separate domains efficiently. The two domains appear to have little overlap in the values their decoders interpret from their latent space representation.

The per batch BCE by the JMVAE model after a 100 epochs is 1.60. The average per batch NLL loss on 29.17 and the per word NLL loss is 1.28, a small decrease in NLL loss from CVAE models. Based on informal analysis of the results, the sentence decoders of the different models do not appear to be affected by this change in performance (see Tables3,4 &5).

When sampling from the latent space and decoding with the JMVAE’s sentence decoder, the decodings (see Table5) show similar smoothness in transition to Bowman’s original experiments as well as a language model that reliably outputs grammatical sentences with smooths transitions between the interpolated samples.

(23)

(a) KL joint distribution (b) KL sentence distribution (c) KL attribute distribution

Figure 15: KL-values of baseline JMVAE model

When looking at the outputs of the three separate encoders, there is a marked difference in the ability of the decoder to decode the correct outputs, in particular when it comes to the decoded attributes (see Table6). The same metrics were used as with evaluation the of the Attribute CVAE model.

The average attribute decoding is more likely to have false positive than a false negative. In practice this means that the decoder is more likely to invent attributes that have not been encountered than miss the attributes present in the input (see Tables7, 8&9).

The average sentence decoding of the attribute encoder has a NLL per word loss of 1.29 and after an informal comparison appears to be similar in quality to the sentences decoded from the joint decoder output.

• the twenty two is a family friendly restaurant located near the rice boat . ◦ the twenty two is located near the rice boat . it is a family friendly restaurant with a low

customer rating .

◦ the twenty two is located near the rice boat . it is a family friendly restaurant that serves french food .

◦ the twenty two , a family friendly restaurant , is located near the rice boat . ◦ the twenty two , a family friendly restaurant , is located near the rice boat . ◦ the twenty two , a family friendly restaurant , is located near the rice boat .

◦ the green man , a chinese restaurant , is located in riverside , near all bar one , and is family friendly .

◦ the chinese restaurant , the green man , is located in riverside near all bar one . ◦ the chinese restaurant , the green man , is located in riverside near all bar one . • the chinese pub , the mill , is located in the riverside area .

Table 5: Interpolation of latent space samples through JMVAE sentence decoder

5 Conclusion

The JMVAE architecture has shown to be effective for the modeling of joint data on the domain of restaurant reviews, but has mixed results in projecting data across domains. The text generation from attribute data performs well, decoding grammatically correct sentences with only a slight shift in topic. The text annotation task, however, is more prone to error.

The latent space does not integrate the two modalities smoothly, it needs to divergence far more drastically from the standard normal distribution to fit a distribution that can output data

accuracy avg false positives avg false negatives

Attribute CVAE 0.88 5.04 4.14

Attribute JMVAE encoder 0.95 1.91 1.88

Joint JMVAE encoder 0.93 3.14 2.58

Sentence JMVAE encoder 0.88 5.01 4.29

Table 6: The average reconstruction performance on the attribute decoder of data encoded by the three JMVAE encoders and the attribute CVAE model